Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Aggregate functions

Functions used in GROUP BY queries and OVER clauses. SQE inherits ~40 aggregates from DataFusion plus 12 Trino UDAFs from sqe-trino-functions.

Standard aggregates

FunctionOriginNotesTrinoSnowflakeSpark SQLDuckDB
count(expr)datafusion-builtinCounts non-NULL rows.countcountcountcount
count(*)datafusion-builtinCounts all rows.countcountcountcount
count(distinct expr)datafusion-builtinDistinct non-NULL count.count(distinct ...)count(distinct ...)count(distinct ...)count(distinct ...)
sum(expr)datafusion-builtinSum. NULL-skipping.sumsumsumsum
sum(distinct expr)datafusion-builtinDistinct sum.sum(distinct ...)sum(distinct ...)sum(distinct ...)sum(distinct ...)
avg(expr) / mean(expr)datafusion-builtinArithmetic mean. NULL-skipping.avgavgavg / meanavg / mean
min(expr)datafusion-builtinMinimum.minminminmin
max(expr)datafusion-builtinMaximum.maxmaxmaxmax
median(expr)datafusion-builtinExact median. Slower than approx_median on big inputs.-medianmedianmedian

Statistical / regression

FunctionOriginNotesTrinoSnowflakeSpark SQLDuckDB
variance(x) / var_samp(x)datafusion-builtinSample variance.variance / var_sampvariance_sampvariance / var_sampvariance / var_samp
var_pop(x)datafusion-builtinPopulation variance.var_popvariance_popvar_popvar_pop
stddev(x) / stddev_samp(x)datafusion-builtinSample stddev.stddev / stddev_sampstddev_sampstddev / stddev_sampstddev
stddev_pop(x)datafusion-builtinPopulation stddev.stddev_popstddev_popstddev_popstddev_pop
corr(y, x)datafusion-builtinPearson correlation.corrcorrcorrcorr
covar_samp(y, x) / covar_pop(y, x)datafusion-builtinSample / population covariance.covar_samp / covar_popcovar_samp / covar_popcovar_samp / covar_popcovar_samp / covar_pop
regr_slope(y, x)datafusion-builtinLinear regression slope.regr_sloperegr_slope-regr_slope
regr_intercept(y, x)datafusion-builtiny-intercept.regr_interceptregr_intercept-regr_intercept
regr_r2(y, x)datafusion-builtinR-squared.regr_r2regr_r2-regr_r2
regr_count, regr_sxx, regr_syy, regr_sxy, regr_avgx, regr_avgydatafusion-builtinRegression sums and counts.yesyes-yes

Distinct and approximation

FunctionOriginNotesTrinoSnowflakeSpark SQLDuckDB
approx_distinct(expr)datafusion-builtinHyperLogLog distinct count. ~1% error.approx_distinctapprox_count_distinctapprox_count_distinctapprox_count_distinct
approx_median(expr)datafusion-builtinMedian estimate via t-digest.--approx_percentile(0.5)approx_quantile(0.5)
approx_percentile_cont(expr, p)datafusion-builtinPercentile estimate via t-digest. p in [0, 1].approx_percentileapprox_percentileapprox_percentileapprox_quantile
approx_percentile(expr, p)sqe-trino-functionsTrino-named alias of approx_percentile_cont. trino_functions.rs:164approx_percentile---

Boolean and bitwise

FunctionOriginNotesTrinoSnowflakeSpark SQLDuckDB
bool_and(x)datafusion-builtinTrue only if every row is true.bool_andbooland_aggbool_and / everybool_and
bool_or(x)datafusion-builtinTrue if any row is true.bool_orboolor_aggbool_or / anybool_or
every(x)sqe-trino-functionsTrino-named alias of bool_and. trino_functions.rs:170every-every-
bit_and(x)datafusion-builtinBitwise AND of all values.bitwise_and_aggbitand_aggbit_and-
bit_or(x)datafusion-builtinBitwise OR.bitwise_or_aggbitor_aggbit_or-
bit_xor(x)datafusion-builtinBitwise XOR.bitwise_xor_aggbitxor_aggbit_xor-
bitwise_and_agg(x)sqe-trino-functionsTrino name for bit_and. trino_functions.rs:144bitwise_and_agg---
bitwise_or_agg(x)sqe-trino-functionsTrino name for bit_or. trino_functions.rs:150bitwise_or_agg---
bitwise_xor_agg(x)sqe-trino-functionsTrino name for bit_xor. trino_functions.rs:158bitwise_xor_agg---

Positional

FunctionOriginNotesTrinoSnowflakeSpark SQLDuckDB
first_value(expr [order by ...])datafusion-builtinFirst row’s value. Most useful with OVER ordering.first_valuefirst_valuefirst_valuefirst_value
last_value(expr [order by ...])datafusion-builtinLast row’s value.last_valuelast_valuelast_valuelast_value
nth_value(expr, n [order by ...])datafusion-builtinNth row’s value.nth_valuenth_valuenth_valuenth_value
max_by(value, key)sqe-trino-functionsvalue from the row with the max key. trino_functions.rs:177max_bymax_by-arg_max
min_by(value, key)sqe-trino-functionsvalue from the row with the min key. trino_functions.rs:178min_bymin_by-arg_min
arbitrary(expr)sqe-trino-functions (ext)Any one non-NULL value. Trino-named alias of any_value. trino_functions_ext.rs:68arbitraryany_valueany_valueany_value

Collection-building

FunctionOriginNotesTrinoSnowflakeSpark SQLDuckDB
array_agg(expr)datafusion-builtinCollect into an array. NULLs included.array_aggarray_aggcollect_listarray_agg / list
array_agg(distinct expr)datafusion-builtinDistinct array.array_agg(distinct)array_agg(distinct)collect_setlist_distinct
string_agg(expr, sep)datafusion-builtinConcatenate with separator. SQL standard.array_join/listagglistagg-string_agg
listagg(expr, sep)sqe-trino-functionsSame as string_agg; Snowflake / Trino name. trino_functions.rs:138listagglistagg--
histogram(expr)sqe-trino-functionsMap of value -> count. trino_functions.rs:188histogram--histogram
map_agg(key, value)sqe-trino-functionsBuild a map by aggregating key-value pairs. Last write wins. trino_functions.rs:189map_aggobject_aggmap_from_arraysmap
multimap_agg(key, value)sqe-trino-functionsMap where each value is an array (collects duplicates). trino_functions.rs:190multimap_agg---
map_union(map_col)sqe-trino-functionsAggregate already-built maps into one. trino_functions.rs:191map_union---

Modifiers

ModifierNotes
agg_func(expr) FILTER (WHERE pred)Filter rows before aggregation. Cleaner than agg_func(CASE WHEN pred THEN expr END).
agg_func(distinct expr)Distinct values only.
agg_func(expr) OVER (...)Window form. Uses PARTITION BY, ORDER BY, frame clauses. See Window functions.
agg_func(expr) WITHIN GROUP (ORDER BY ...)Ordered aggregate (e.g. listagg).

Example using FILTER:

SELECT
    region,
    count(*) AS total_orders,
    count(*) FILTER (WHERE status = 'cancelled') AS cancelled,
    sum(amount) FILTER (WHERE status = 'shipped') AS shipped_revenue
FROM orders
GROUP BY region;

GROUP BY extensions

ConstructOriginNotesTrinoSnowflakeSpark SQLDuckDB
GROUP BY GROUPING SETS ((a, b), (a), ())datafusion-builtinMultiple grouping levels in one query.yesyesyesyes
GROUP BY CUBE (a, b, c)datafusion-builtinAll 2^N grouping combinations.yesyesyesyes
GROUP BY ROLLUP (a, b, c)datafusion-builtinHierarchical: (), (a), (a, b), (a, b, c).yesyesyesyes
GROUPING(col)datafusion-builtinReturns 1 if col was rolled up in this row, else 0.yesyesyesyes
SELECT
    region,
    product,
    sum(amount) AS revenue,
    GROUPING(region) AS region_rolled_up,
    GROUPING(product) AS product_rolled_up
FROM orders
GROUP BY ROLLUP (region, product)
ORDER BY region, product;

Approximation vs exact: when to choose

  • count(distinct) exact. sub-second on millions of rows; avoid above ~1B distinct values.
  • approx_distinct HyperLogLog. order of magnitude faster on huge inputs. ~1% relative error.
  • median exact. sorts the entire group; expensive on big partitions.
  • approx_median / approx_percentile_cont t-digest. sub-percent error, much cheaper memory profile.

For dashboards over multi-billion-row tables, default to approximations. For audit queries that need exact counts, default to exact.