Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

JSON

Two layered JSON surfaces. Both are registered in every session.

  1. Trino-named layer in sqe-trino-functions: json_extract, json_extract_scalar, json_array_length, json_parse, json_format, json_object, is_json_scalar, json_array_contains, json_size, json_array_get, to_json. Maps directly to dbt-trino model expectations.
  2. DataFusion JSON layer via datafusion-functions-json (registered in session_context.rs:361 and embedded.rs:172): json_get, json_get_str, json_get_int, json_get_float, json_get_bool, json_get_json, json_get_array, json_contains, json_as_text, json_length. Type-specific extractors that avoid an outer CAST.

JSON columns are stored as VARCHAR (Iceberg has no native JSON type yet). Both layers parse on every call, so for hot paths consider extracting once into a typed column.

Trino-named layer

FunctionOriginNotesTrinoSnowflakeSpark SQLDuckDB
json_parse(s)sqe-trino-functionsParse JSON text. Returns the same VARCHAR after validation; primarily used to fail fast on malformed input. trino_functions.rs:92yesparse_jsonfrom_jsonjson
json_format(j)sqe-trino-functionsFormat / re-emit JSON. trino_functions.rs:60yes-to_jsonto_json
json_object(k1, v1, k2, v2, ...)sqe-trino-functionsBuild a JSON object from key-value pairs. trino_functions.rs:59yesyesyesyes
json_array_length(j)sqe-trino-functionsNumber of elements in a JSON array. NULL if not an array. trino_functions.rs:91yesyesyesjson_array_length
json_extract(j, '$.path')sqe-trino-functionsJSONPath extraction. Returns JSON-encoded value. trino_functions.rs:89yesyesyesyes
json_extract_scalar(j, '$.path')sqe-trino-functionsSame path syntax; returns plain text scalar (or NULL). trino_functions.rs:90yes-get_json_objectjson_extract_string
is_json_scalar(j)sqe-trino-functions (ext)True for JSON null / boolean / number / string. False for objects and arrays. trino_functions_ext.rs:30yes---
json_array_contains(j, v)sqe-trino-functions (ext)True if JSON array contains the value. trino_functions_ext.rs:31yes---
json_size(j, '$.path')sqe-trino-functions (ext)Cardinality at the path. Object key count for objects; array length for arrays; 0 for primitives. trino_functions_ext.rs:72yes---
json_array_get(j, idx)sqe-trino-functions (ext)0-based element from a JSON array. trino_functions_ext.rs:73yes---
to_json(any)sqe-trino-functions (ext)Serialize a SQL value (struct, array, map) as JSON. trino_functions_ext.rs:80-yesyesyes

DataFusion JSON layer

These return typed scalars directly, which means no outer CAST and DataFusion can push predicates through.

FunctionOriginNotesReturns
json_get(j, key_or_index)datafusion-functions-jsonGeneric typed accessor. The argument can be a string key or integer index.Union of possible types
json_get_str(j, key)datafusion-functions-jsonForce string return.VARCHAR
json_get_int(j, key)datafusion-functions-jsonForce integer.BIGINT
json_get_float(j, key)datafusion-functions-jsonForce float.DOUBLE
json_get_bool(j, key)datafusion-functions-jsonForce boolean.BOOLEAN
json_get_json(j, key)datafusion-functions-jsonRe-emit nested JSON.VARCHAR (JSON)
json_get_array(j, key)datafusion-functions-jsonForce array.VARCHAR (JSON array)
json_contains(j, '$.path')datafusion-functions-jsonTrue if path exists.BOOLEAN
json_as_text(j)datafusion-functions-jsonRe-emit as JSON text (round-trip).VARCHAR
json_length(j)datafusion-functions-jsonNumber of object keys / array elements at root.BIGINT

Path syntax

The Trino layer uses JSONPath ($.foo.bar[0]).

The DataFusion JSON layer uses single-step keys: a string for object access or an integer for array access.

-- Trino-style: full JSONPath
SELECT json_extract_scalar(payload, '$.user.id')
FROM events;

-- DataFusion: chained single-step
SELECT json_get_str(json_get_json(payload, 'user'), 'id')
FROM events;

Both compose. Both work on the same VARCHAR column. Choose by ergonomics:

  • Trino-style is one call per path; cleaner SQL.
  • DataFusion JSON layer returns native types; no outer CAST.

For a deeply nested path used in a hot WHERE clause, the DataFusion layer wins on speed (no JSON re-encoding between steps).

Comparison

OperationSQETrinoSnowflakeSpark SQLDuckDB
Parse / validatejson_parsejson_parseparse_jsonfrom_json (typed)json() cast
Path extract (text)json_extract_scalarjson_extract_scalar: syntax (payload:user.id)get_json_objectjson_extract_string
Path extract (JSON)json_extractjson_extractpath (object navigation)nested from_jsonjson_extract
Typed extractjson_get_str/int/float/boolcast(... AS ...):value::TYPEfrom_json schemajson_extract_*
Build objectjson_object(k, v, ...)json_object(k, v, ...)object_constructto_json(struct(...))json_object
Build arrayjson_array(...) (Trino style; not registered yet)json_array(...)array_constructto_json(array(...))json_array
Lengthjson_array_length / json_lengthjson_array_lengtharray_sizejson_array_lengthjson_array_length
Contains key/valuejson_contains / json_array_containsjson_array_containsarray_containsarray_containsjson_contains
To JSON textto_jsoncast(x AS json)to_jsonto_jsonto_json

Examples

Extract a typed field for filtering

-- Slowpath: parse JSON, cast to int, compare
SELECT * FROM events
WHERE CAST(json_extract_scalar(payload, '$.user_id') AS BIGINT) = 12345;

-- Faster: typed extractor, no CAST
SELECT * FROM events
WHERE json_get_int(json_get_json(payload, 'user'), 'id') = 12345;

Project several fields at once

SELECT
    json_extract_scalar(payload, '$.user.id')        AS user_id,
    json_extract_scalar(payload, '$.user.email')     AS email,
    json_extract_scalar(payload, '$.session.locale') AS locale,
    CAST(json_extract_scalar(payload, '$.amount') AS DECIMAL(18, 2)) AS amount
FROM events;

Path-existence filter

SELECT * FROM events
WHERE json_contains(payload, '$.metadata.experimental_flag') = true;

Build a structured field for downstream consumers

SELECT json_object(
    'id', id,
    'host', url_extract_host(url),
    'path', url_extract_path(url),
    'received_at', cast(occurred_at AS VARCHAR)
) AS event_doc
FROM events;

Performance tips

  • Decode once at the boundary: when a JSON column is queried in many downstream views, consider materialising the relevant subfields into typed columns at ingest time. Repeated parsing is the biggest JSON cost.
  • Path index pruning: the DataFusion JSON layer can push json_get_str(payload, 'user_id') = 'X' through to a manifest min/max statistic when the underlying JSON has been pre-extracted. The Trino layer is opaque.
  • Avoid round-trip: json_format(json_parse(j)) is a no-op semantically but burns CPU. Skip the round-trip unless you need re-canonicalisation.

Iceberg JSON column type

Iceberg V3 has no native JSON primitive yet. SQE accepts JSON in CREATE TABLE and stores it as VARCHAR underneath. Predicates and projections work as text. When Iceberg adds a JSON primitive, the storage layout will change but the SQL surface stays the same.

What is not registered

  • json_array(...) (Trino name for build-an-array). Use cast(make_array(...) AS VARCHAR) followed by json_format, or to_json(make_array(...)).
  • json_size at the root: covered. With a path: covered.
  • JSONPath wildcards ($..foo, $[*]): not supported by DataFusion’s path parser. Use unnest(json_get_array(...)) for array-level wildcarding.