Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Encoding, hashing, URL

Three small, related families: binary encoding (base64 / hex), cryptographic hashes, URL parsing. SQE inherits the encoding and crypto helpers from DataFusion and adds Trino-named aliases plus eight URL extractors.

Binary encoding

FunctionOriginNotesTrinoSnowflakeSpark SQLDuckDB
encode(input, encoding)datafusion-builtinEncode binary to text. encoding: 'base64' or 'hex'.yesyesyesyes
decode(input, encoding)datafusion-builtinDecode text to binary. Same encoding set.yesyesyesyes
from_base64(s)sqe-trino-functionsTrino-named base64 decode. trino_functions.rs:78yesyes (base64_decode_string)yes (unbase64)yes
to_base64(b)sqe-trino-functionsTrino-named base64 encode. trino_functions.rs:79yesyes (base64_encode)yes (base64)yes
from_hex(s)sqe-trino-functionsTrino-named hex decode. trino_functions.rs:80yesyes (hex_decode_string)yes (unhex)yes
to_hex(n)datafusion-builtinEncode integer or binary as hex. NOT registered as Trino UDF: DataFusion already has it.yesyesyesyes
SELECT to_base64(CAST('hello' AS bytea));         -- 'aGVsbG8='
SELECT from_base64('aGVsbG8=');                   -- bytea -> 'hello'
SELECT encode(CAST('hi' AS bytea), 'hex');        -- '6869'
SELECT decode('6869', 'hex');                     -- bytea -> 'hi'

Cryptographic hashes

FunctionOriginNotesTrinoSnowflakeSpark SQLDuckDB
md5(s)datafusion-builtin128-bit hex string.yesyesyesyes
sha224(s)datafusion-builtinSHA-224 hex.yes-yes-
sha256(s)datafusion-builtin + sqe-policySHA-256 hex. SQE registers an additional UDF used by column masks.yesyesyesyes
sha384(s)datafusion-builtinSHA-384 hex.yes-yes-
sha512(s)datafusion-builtinSHA-512 hex.yesyesyesyes
digest(s, algo)datafusion-builtinGeneric. Algos: md5, sha224, sha256, sha384, sha512, blake2s, blake2b, blake3.----
checksum(b)sqe-trino-functions (ext)xxHash of bytes; cheaper than crypto hashes. trino_functions_ext.rs:69yes---
SELECT sha256(email) AS email_hash FROM users;
SELECT digest('hello', 'blake3');       -- 64-char blake3 hex
SELECT to_hex(checksum(payload)) FROM events;

URL parsing

All eight URL functions live in sqe-trino-functions. They wrap the url crate from crates.io for correct RFC 3986 handling.

FunctionOriginNotesTrinoSnowflakeSpark SQLDuckDB
url_extract_host(url)sqe-trino-functionsHostname only, no port. trino_functions.rs:67yesparse_url(..., 'HOST')parse_url(..., 'HOST')-
url_extract_path(url)sqe-trino-functionsPath component (after host, before query). trino_functions.rs:68yesparse_url(..., 'PATH')parse_url(..., 'PATH')-
url_extract_port(url)sqe-trino-functionsPort as INT. NULL when absent. trino_functions.rs:69yes---
url_extract_protocol(url)sqe-trino-functionsScheme (https, http, s3, etc). trino_functions.rs:70yesparse_url(..., 'PROTOCOL')parse_url(..., 'PROTOCOL')-
url_extract_query(url)sqe-trino-functionsQuery string (after ?, no leading ?). trino_functions.rs:71yesparse_url(..., 'QUERY')parse_url(..., 'QUERY')-
url_extract_parameter(url, name)sqe-trino-functionsFirst value of named parameter. trino_functions.rs:72yesparse_url(..., 'QUERY:name')parse_url(..., 'QUERY', 'name')-
url_encode(s)sqe-trino-functionsPercent-encode. trino_functions.rs:73yesyesyes-
url_decode(s)sqe-trino-functionsPercent-decode. trino_functions.rs:74yesyesyes-
SELECT
    url,
    url_extract_protocol(url) AS proto,
    url_extract_host(url)     AS host,
    url_extract_port(url)     AS port,
    url_extract_path(url)     AS path,
    url_extract_query(url)    AS query,
    url_extract_parameter(url, 'utm_source') AS utm
FROM events;

For example, on https://example.com:8443/api?utm_source=newsletter&page=2:

ComponentValue
url_extract_protocolhttps
url_extract_hostexample.com
url_extract_port8443
url_extract_path/api
url_extract_queryutm_source=newsletter&page=2
url_extract_parameter(..., 'utm_source')newsletter

When to use which hash

Use caseRecommendation
Equality lookups, deduplicationxxhash via checksum(). ~10x faster than crypto hashes.
Cryptographic integrity, signaturessha256 or sha512. Avoid md5 for new uses.
Column masking via grantssha256 (used in MASKED WITH clauses). Deterministic, no collisions in practice.
Hashing UTF-8 stringsAll accept VARCHAR or BINARY arguments. The result type is VARCHAR (hex) for crypto hashes, BINARY for digest().
SaltingConcatenate the salt: sha256(concat(salt, password)). SQE has no pbkdf2-style helpers.

When to use which URL parser

url_extract_host, url_extract_query, url_extract_parameter use a real RFC 3986 parser, so they handle internationalised domain names (IDN), unusual ports, percent-encoded query values, and trailing fragments correctly. The regex equivalent (regexp_extract(url, 'https?://([^/]+)', 1)) breaks on these edge cases.

Use the regex form only when the input is known to be a fixed shape that the regex covers.

Why no Snowflake-style PARSE_URL

Snowflake’s parse_url(url [, permissive]) returns an OBJECT with named keys. The eight separate url_extract_* functions are equivalent in expressive power and easier to compile statically. dbt-snowflake to dbt-sqe migrations need the rewrite, but it is mechanical: parse_url(u, 'HOST') becomes url_extract_host(u), etc.

Why DataFusion’s decode is not shadowed

A Snowflake-style DECODE(expr, search1, result1, ...) would conflict with DataFusion’s binary decode(input, encoding). SQE keeps the binary one and exposes the conditional logic via CASE WHEN. See Conditional for the trade-off.