Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Roadmap

SQE is developed in phases, each building on the previous.

Phase Overview

gantt
    title SQE Implementation Phases
    dateFormat YYYY-MM
    axisFormat %b %Y

    section Core
    Phase 1 - Single Node        :done, p1, 2026-01, 2026-02
    Phase 2 - Write Path         :done, p2, 2026-02, 2026-03
    Phase 3 - Row-Level Writes   :done, p3, 2026-03, 2026-04

    section Scale
    Phase 3b - Benchmarks        :done, p3b, 2026-03, 2026-04
    Phase 4 - Pluggable Auth     :done, p4, 2026-03, 2026-04
    Phase 4b - Streaming/Distributed :done, p4b, 2026-04, 2026-04

    section Shipped
    Phase 2c - dbt Compatibility :done, p2c, 2026-03, 2026-04
    Phase 5 - Pluggable Catalogs :done, p5, 2026-04, 2026-05
    Phase 7 - Iceberg V3         :done, p7, 2026-04, 2026-05
    Phase 9 - OpenLineage        :done, p9, 2026-05, 2026-05

    section Next
    Phase 6 - Security Policies  :p6, 2026-05, 2026-07
    Phase 8 - Trino Decommission :p8, 2026-07, 2026-09

Phase 1 — Single-Node Engine (Done)

The foundation: a working SQL engine that queries Iceberg tables through Polaris with Keycloak auth.

  • DataFusion query execution
  • Keycloak OIDC authentication (ROPC grant)
  • Per-session catalog with bearer token passthrough
  • Arrow Flight SQL server
  • CLI client (sqe-cli)
  • SELECT, SHOW CATALOGS/SCHEMAS/TABLES, EXPLAIN
  • Prometheus metrics + structured JSON logging

Phase 2 — Write Path & Views (Done)

SQL write operations and catalog DDL.

  • CREATE TABLE AS SELECT
  • CREATE OR REPLACE TABLE
  • INSERT INTO SELECT
  • CREATE VIEW / DROP VIEW
  • CREATE SCHEMA / DROP SCHEMA
  • DROP TABLE / DROP TABLE IF EXISTS
  • Parquet writer (to S3 via Iceberg)
  • Audit logging (JSONL)
  • OpenTelemetry export (OTLP/gRPC)
  • Trino-compatible HTTP endpoint

Phase 2c — dbt Compatibility (Done)

Native dbt support via dbt-sqe adapter over ADBC Flight SQL.

  • information_schema virtual providers (tables, schemata, columns)
  • dbt-sqe Python adapter (connection manager, materializations)
  • ALTER TABLE RENAME
  • dbt table, view, and append-only incremental materializations
  • incremental with merge strategy (CoW + MoR)
  • Adapter lives at adapters/dbt-sqe/dbt/

Phase 3 — Row-Level Writes (Done)

DELETE FROM, UPDATE, and MERGE INTO are implemented via Copy-on-Write using the iceberg-rust fork vendored at vendor/iceberg-rust/ (DF 53.1 + Arrow 58 rebase of risingwavelabs/iceberg-rust), which provides rewrite_files() transaction support.

Strategy: Copy-on-Write

graph TB
    subgraph "Copy-on-Write (Implemented)"
        READ["Read affected<br/>data files"] --> FILTER["Apply WHERE filter"]
        FILTER --> REWRITE["Rewrite without<br/>deleted/modified rows"]
        REWRITE --> COMMIT["Commit via<br/>rewrite_files()"]
    end

    subgraph "Merge-on-Read (Future)"
        DELFILE["Write position<br/>delete files"] --> COMMITDEL["Commit via<br/>RowDeltaAction"]
        COMMITDEL --> COMPACT["Background<br/>compaction"]
    end

    style READ fill:#6f9
    style FILTER fill:#6f9
    style REWRITE fill:#6f9
    style COMMIT fill:#6f9

CoW rewrites affected data files entirely. MoR has shipped: set TBLPROPERTIES ('write.delete.mode' = 'merge-on-read') to opt in. SQE writes a position-delete file (no PK) or an equality-delete file (with PK) and commits via FastAppendAction / RowDeltaAction. CoW remains the default for backward compatibility.

Delivered

  • DELETE FROM table WHERE condition — removes matching rows; supports cross-table subqueries; DELETE without WHERE = truncate
  • UPDATE table SET col = expr WHERE condition — modifies matching rows; supports CASE WHEN transformations and cross-table subqueries
  • MERGE INTO target USING source ON condition WHEN MATCHED/NOT MATCHED ... — full outer join approach with WHEN MATCHED/NOT MATCHED clauses
  • All operations atomic via Iceberg snapshot isolation
  • dbt incremental with merge strategy
  • Integration tests against Polaris + MinIO
  • TPC-C write queries (17/17 pass), TPC-E write queries enabled

Iceberg Dependency

Uses the iceberg-rust fork vendored at vendor/iceberg-rust/ for rewrite_files(). When upstream apache/iceberg-rust ships OverwriteAction (tracked in Epic #2186), the dependency can be migrated back to the official crate.

SQE Changes

FileChange
Cargo.tomlVendored iceberg-rust fork (DF 53 + Arrow 58 rebase) at vendor/iceberg-rust/
crates/sqe-coordinator/src/delete_handler.rsDELETE FROM execution via CoW
crates/sqe-coordinator/src/update_handler.rsUPDATE execution via CoW
crates/sqe-coordinator/src/merge_handler.rsMERGE INTO execution via CoW
crates/sqe-coordinator/src/query_handler.rsRoutes Merge/Delete/Update to handlers
crates/sqe-coordinator/src/write_handler.rsShared CoW rewrite logic

Phase 7 — Iceberg V3 (Done)

Iceberg V3 table format support landed end-to-end. The vendored fork at vendor/iceberg-rust/ is rebased onto DataFusion 53.1 + Arrow 58 and carries V3 spec coverage. SQE Iceberg matrix score: 167/189 = 88.4% (per docs/iceberg-matrix.md).

V3 Features Shipped

FeatureStatus
Default values (ALTER TABLE ADD COLUMN ... DEFAULT)Done
Schema evolution (ALTER TABLE ADD/DROP COLUMN)Done
Nanosecond timestamps (TIMESTAMP_NS, TIMESTAMPTZ_NS)Done
Partition evolutionDone
Equality deletes + position deletes (MoR)Done

V3 Features Still Blocked Upstream

FeatureBlocker
Variant typeiceberg-rust #2188 not merged
Geometry typeDataFusion UDT #12644
Vector / Embedding typeIceberg V3 vector spec not finalised
Multi-arg partition transformsIceberg Java spec alignment in progress
Row lineageDeferred upstream

Other Hardening

  • Metadata cache invalidation on DDL
  • Large result-set streaming (Flight SQL do_get back-pressure)
  • Error messages tuned for catalog and auth failures
  • Partition pruning across all predicate types

Phase 4b/4c — Distributed Execution (Done)

Scale-out query execution with stateless workers. Implemented via streaming execution in two phases.

graph TB
    subgraph Coordinator
        PLAN["Query Planner"] --> STAGE["Stage Decomposition"]
        STAGE --> SCHED["Scheduler"]
    end

    subgraph Workers
        SCHED -->|ScanTask| W1["Worker 1"]
        SCHED -->|ScanTask| W2["Worker 2"]
        SCHED -->|ScanTask| W3["Worker 3"]
    end

    W1 <-->|DoExchange shuffle| W2
    W2 <-->|DoExchange shuffle| W3
    W1 -->|Arrow stream| MERGE["Multi-endpoint merge"]
    W2 -->|Arrow stream| MERGE
    W3 -->|Arrow stream| MERGE
    MERGE --> CLIENT["Client"]

Delivered

  • Phase A (spill-to-disk): FairSpillPool with watermarks, late materialization, file/page pruning, TopK, S3 I/O pipeline (coalescing, footer cache, prefetch), SortMergeJoin fallback
  • Phase B (distributed): DoExchange shuffle, distributed sort (range-partition with sampling), two-phase aggregation, distributed joins (broadcast, shuffle hash, pre-sorted merge, predicate transfer), multi-endpoint Flight SQL, stage decomposition
  • Adaptive sort stripping — memory-aware sort mode selection
  • Metrics — spill, shuffle, late-mat, pruning, time-to-first-row, S3 I/O, auth, write path

Benchmark Results (SF1, distributed 2-worker)

SuitePass RateTimeSpeedup vs single
TPC-H22/2213.5s2.1x
TPC-DS98/9936.1s2.8x
SSB13/135.3s2.7x
TPC-C17/178.6s2.6x

Phase 5 — Pluggable Catalogs (Done)

CatalogBackend trait replaced the hard-coded Polaris REST catalog. Seven catalog backends ship today:

BackendStatus
Iceberg REST (Polaris, Lakeformation)Done — default
AWS GlueDone
NessieDone
Hive Metastore (Thrift)Done
AWS S3 TablesDone
Snowflake Horizon (REST-compatible)Done
JDBCDone

Multi-cloud storage via object_store: S3 (+ endpoint override for R2, Ceph, Garage, MinIO), Azure ADLS Gen2/Blob, GCS, local filesystem. Engine-side session-manager wiring for Delta + remaining edge cases is the only deferred item; tracked in nextsteps.md.


Phase 6 — Security Policies (Planned)

Fine-grained access control via LogicalPlan rewriting.

  • PolicyEnforcer implementations (OPA via Rego, Cedar)
  • GRANT/REVOKE with ROWS WHERE and MASKED WITH
  • SHOW GRANTS / SHOW EFFECTIVE POLICY
  • Column restriction (invisible columns)
  • Policy caching with TTL (moka)
  • No-information-leakage model (PostgreSQL RLS style)

Phase 9 — OpenLineage Emitter (Done)

Coordinator-side OpenLineage 2-0-2 emitter with column-level lineage. Off by default; zero hot-path overhead when disabled.

  • sqe-lineage crate: event types, observer, emitter task, file/HTTP/spool sinks, multi-catalog dataset extractor, column-lineage trace rules across 11 LogicalPlan node types
  • OpenLineageConfig in sqe-core with TOML + env-var overrides + startup validation
  • Hooks in QueryHandler::execute_statement (START / COMPLETE / FAIL)
  • Documented at docs/book/src/operations/openlineage.md

Deferrals (documented): mTLS, per-event user-OIDC bearer forwarding, MERGE per-branch column annotations, embedded CLI emit, DDL hint extraction.


Phase 10 — Performance & Reliability Testing (Planned)

Validate SQE is production-ready through systematic benchmarking and reliability testing.

Performance Benchmarks

graph LR
    subgraph "Benchmark Suite"
        TPCH["TPC-H<br/>22 queries<br/>SF 10/100/1000"]
        TPCDS["TPC-DS<br/>99 queries<br/>SF 10/100"]
        CUSTOM["Custom workloads<br/>Iceberg-specific<br/>partition pruning"]
    end

    subgraph "Comparison Targets"
        TRINO["Trino<br/>(current prod)"]
        SQE["SQE"]
        SPARK["Spark SQL<br/>(reference)"]
    end

    TPCH --> TRINO
    TPCH --> SQE
    TPCH --> SPARK
    TPCDS --> TRINO
    TPCDS --> SQE
BenchmarkScale FactorsPurpose
TPC-HSF10, SF100, SF1000Standard analytical workload, join-heavy
TPC-DSSF10, SF100Complex analytics, subqueries, window functions
Iceberg-specificVariesPartition pruning, metadata operations, time travel
Write path1M, 10M, 100M rowsCTAS, INSERT, MERGE throughput
Concurrent users10, 50, 100 sessionsConnection handling, session isolation

Key Metrics

  • Query latency — P50, P95, P99 per query
  • Throughput — queries/second under load
  • Memory usage — peak RSS per query complexity
  • Startup time — cold start to first query
  • Scan speed — GB/s from S3 (single-node vs distributed)

Performance Targets

MetricTargetRationale
TPC-H SF100 geometric meanWithin 2x of TrinoParity goal for migration
Cold start to ready< 2 secondsK8s autoscaling responsiveness
Peak memory (SF100 query)< 4GB coordinatorFit in standard K8s pod limits
Concurrent session overhead< 10MB per sessionSupport 100+ sessions

Reliability Testing

TestMethodWhat it validates
Chaos: kill worker mid-querykubectl delete pod during scanCoordinator retries/fails gracefully
Chaos: kill coordinatorSIGKILL during queryIn-flight queries fail cleanly, no data corruption
Chaos: Polaris unavailableBlock network to PolarisGraceful error, no hang, cached metadata still works
Chaos: Keycloak unavailableBlock network to KeycloakExisting sessions continue, new auth fails cleanly
Chaos: S3 latency spiketc netem delay on S3Query timeout, not hang
Memory pressureLarge query + small memory limitSpill-to-disk or clean OOM, no silent corruption
Token expiry during querySet very short token TTLRefresh mid-query, or clean auth error
Concurrent DDL + DMLCTAS while DROP TABLE on same tableIceberg conflict detection, clean error
Long-running soak test24h mixed workloadNo memory leaks, no connection leaks, stable latency

Profiling & Optimization

  • CPU profilingperf + flamegraphs on hot queries
  • Memory profilingjemalloc stats, allocation tracking
  • I/O profiling — S3 request counts, Parquet read amplification
  • Query plan analysis — DataFusion EXPLAIN ANALYZE for bottleneck identification

Deliverables

  • Automated benchmark harness (run TPC-H/DS, collect results, compare)
  • Performance regression CI (catch slowdowns before merge)
  • Published benchmark results: SQE vs Trino on identical data
  • Reliability test playbook with pass/fail criteria
  • Memory/CPU profiling report with optimization recommendations
  • Soak test (24h) passing without degradation

Phase 8 — Trino Decommission (Future)

Complete migration from Trino DCAF fork.

  • Full Trino wire protocol compatibility for remaining tools
  • Dashboard migration playbook (Superset, Grafana, etc.)
  • JDBC driver migration guide (Trino JDBC → Flight SQL JDBC)
  • Performance parity validation (benchmark comparison)
  • Runbook for operators
  • Trino fork sunset and decommission