Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Using the CLI

sqe-cli is the SQL client. By default it connects to a remote coordinator over Arrow Flight SQL or Trino HTTP. Pass --embedded to skip the network entirely and run an in-process engine. That mode is useful for ad-hoc analysis on local Parquet, CSV, or JSON files without standing up a cluster.

Usage

sqe-cli [OPTIONS]

Options:
  -H, --host <HOST>          Coordinator host [default: localhost]
  -p, --port <PORT>          Coordinator port [default: 50051]
      --protocol <PROTOCOL>  Wire protocol: flight or http [default: flight]
  -u, --user <USER>          Username (prompts if not set)
      --token <TOKEN>        Bearer token (skips password flow)
  -e, --execute <SQL>        Execute a single query and exit
      --file <PATH>          Read statements from a SQL script file
      --stop-on-error        Abort the script on first error (default: continue)
      --embedded             Run the engine in-process (no remote coordinator)
      --memory-limit <SIZE>  Per-process memory pool when --embedded [default: 1GB]
      --warehouse <PATH>     Single catalog at PATH named `iceberg`
                             (shorthand for --catalog iceberg=PATH)
      --catalog NAME=PATH    Attach a named persistent catalog (repeatable)
      --memory               Skip persistent catalogs entirely
  -f, --format <FORMAT>      Output format: table, csv, tsv, json [default: table]
      --tls                  Use HTTPS/TLS
      --insecure             Accept invalid TLS certificates
  -h, --help                 Print help
  -V, --version              Print version

Embedded mode

--embedded boots a single-process SessionContext with the same DataFusion tuning the cluster coordinator uses (parse_float_as_decimal, 64MB hash-join broadcast threshold, dynamic filter pushdown, Parquet filter pushdown). It registers all the same scalar functions, Trino-dialect aliases, JSON helpers, and the read_parquet(...) table-valued function. No auth, no Polaris, no network listeners.

# One-shot query against a local Parquet file
sqe-cli --embedded -e "SELECT COUNT(*) FROM read_parquet('data.parquet')"

# Trino-dialect functions work out of the box
sqe-cli --embedded -e "SELECT year(DATE '2026-05-07')"

# Run a script of statements
sqe-cli --embedded --file setup.sql

# Combine: script first, then ad-hoc query
sqe-cli --embedded --file setup.sql -e "SELECT COUNT(*) FROM staging"

# Interactive REPL (the default if no -e or --file is given)
sqe-cli --embedded

S3 access works too. Pass credentials inline to read_parquet:

SELECT *
FROM read_parquet(
    's3://bucket/path/*.parquet',
    access_key  => 'AKIA...',
    secret_key  => '...',
    region      => 'eu-example-1'
);

File format TVFs

Alongside read_parquet(), the embedded engine ships read_csv() and read_json() for direct file access. They share the same calling convention (positional path, named keyword args) and the same S3 credential bag.

-- Local CSV (auto-detect schema, headers on by default)
SELECT count(*) FROM read_csv('/data/sales.csv');

-- Tab- or semicolon-separated, no header
SELECT * FROM read_csv('/data/raw.tsv',
    delimiter   => '\t',
    has_header  => 'false');

-- NDJSON
SELECT * FROM read_json('/data/events.jsonl');

-- S3-hosted CSV with inline credentials
SELECT * FROM read_csv('s3://bucket/sales/*.csv',
    access_key => 'AKIA...',
    secret_key => '...',
    endpoint   => 'http://minio:9000',
    region     => 'us-east-1');

CSV-specific named args: delimiter, has_header, quote, escape, comment, null_regex, file_extension. JSON-specific: newline_delimited, file_extension.

Auto-detect: SELECT * FROM 'file.ext'

DuckDB-style sugar for “I just want to query this file.” The engine looks at the file extension and picks the right reader:

SELECT * FROM '/data/sales.parquet';
SELECT * FROM '/data/events.jsonl';
SELECT * FROM '/data/log.csv';

Works with globs and S3 URLs too. For S3, you still need credentials configured somewhere (default in [storage], or use read_csv()/read_parquet() and pass them inline).

HTTP / HTTPS URLs

Every file-format TVF and the SELECT * FROM 'file.ext' auto-detect accept HTTP and HTTPS URLs out of the box:

-- Public CSV from any HTTP(S) host
SELECT count(*) FROM read_csv(
  'https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv'
);

-- Auto-detect on a quoted URL
SELECT count(*) FROM
  'https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv';

-- Parquet over HTTP (range-request reads, no full download)
SELECT count(*) FROM read_parquet('https://example.com/dataset.parquet');

The engine builds an HTTP object store on first request per scheme://host[:port] and caches it for the rest of the session. No configuration needed.

HuggingFace Hub: hf://

hf:// URLs resolve to public HuggingFace Hub download URLs:

-- hf://datasets/<owner>/<name>/<path>
SELECT * FROM read_csv(
  'hf://datasets/datasets-examples/doc-formats-csv-1/data.csv'
);

-- Pin a revision via ?revision=
SELECT * FROM read_parquet(
  'hf://datasets/squad/plain_text/train.parquet?revision=v1.0.0'
);

-- Models and Spaces work the same way
SELECT * FROM read_json('hf://models/<owner>/<name>/config.json');

The resolver expands hf://datasets/<owner>/<name>/<path> to https://huggingface.co/datasets/<owner>/<name>/resolve/<rev>/<path> and routes through the same HTTP object store as raw HTTPS URLs. Default revision is main.

Public datasets work without any auth. Private datasets are not yet supported (HF token plumbing is on the roadmap).

COPY ... TO 'file'

Export query results to disk. Format is auto-detected from the extension.

COPY (SELECT * FROM iceberg.sales.orders WHERE year = 2026)
  TO '/exports/orders-2026.parquet';

COPY (SELECT customer_id, total FROM iceberg.sales.orders)
  TO '/exports/orders.csv';

-- Force a format / pass options
COPY orders TO '/exports/orders.json'
  (FORMAT 'json');

Persistent catalog

By default, --embedded attaches a SQLite-backed Iceberg catalog at ~/.sqe/warehouse/. Tables created here survive across sessions; SQL DDL (CREATE SCHEMA, CREATE TABLE, DROP TABLE, DROP SCHEMA) routes through the iceberg catalog without any out-of-band setup:

# Session 1: declare a schema and a table via plain SQL
sqe-cli --embedded -e "CREATE SCHEMA iceberg.staging"
sqe-cli --embedded -e \
    "CREATE TABLE iceberg.staging.events (event_id BIGINT, ts TIMESTAMP, kind VARCHAR)"

# Session 2: same warehouse, table is still there
sqe-cli --embedded -e "SELECT count(*) FROM iceberg.staging.events"

CREATE TABLE ... AS SELECT ... (CTAS) is a known limitation: the embedded mode does not yet have a Parquet writer + iceberg-transaction commit pipeline, so the upstream provider rejects table providers that carry data. Use a separate CREATE TABLE (schema only) followed by INSERT INTO ... SELECT ... once the embedded INSERT path lands, or load via the cluster path for now.

The on-disk layout:

~/.sqe/warehouse/
├── sqe.db              # SQLite catalog (namespaces, table pointers)
└── iceberg/            # Iceberg metadata + Parquet data files
    └── staging/
        └── events/
            ├── metadata/
            └── data/

The catalog name is iceberg. Three-part identifiers (iceberg.staging.events) work; unqualified names resolve against DataFusion’s default in-memory catalog, so SELECT * FROM read_parquet(...) still works without any catalog interaction.

Override the path:

sqe-cli --embedded --warehouse /data/my-warehouse -e "..."

Skip the catalog entirely (ephemeral session, nothing written to disk):

sqe-cli --embedded --memory -e "SELECT 1"

Tables in the warehouse are valid Iceberg. If you later upgrade to a cluster deployment, point the cluster catalog at the same path and the tables come along. No migration, no re-export.

Multiple catalogs

Attach more than one warehouse with repeated --catalog NAME=PATH flags. Each becomes a top-level SQL identifier; cross-catalog joins work without any session-state setup.

sqe-cli --embedded \
    --catalog prod=/data/prod \
    --catalog stage=/data/stage \
    -e "SELECT *
        FROM prod.sales.orders p
        LEFT JOIN stage.sales.orders s ON p.id = s.id
        WHERE s.id IS NULL"

The catalog name shows up in information_schema.tables.table_catalog, in .catalogs, and in 3-part SQL identifiers. Names cannot contain . (it would clash with the SQL namespace separator) and cannot repeat (DataFusion’s register_catalog would silently overwrite).

--warehouse <path> remains as a shorthand for --catalog iceberg=<path>. The three flags --memory, --warehouse, and --catalog are mutually exclusive. Pick one.

Dot-commands

The REPL recognises sqlite/DuckDB-style commands that start with .. They run client-side, never reach the engine, and don’t end with ;:

sqe> .help
Dot commands:
  .help                show this list
  .exit, .quit         leave the REPL
  .tables [schema]     list tables (optionally filter by schema)
  .schema <table>      describe a table's columns
  .describe <table>    alias for .schema
  .summarize <table>   per-column count, distinct, null, min, max
  .catalogs            list catalogs visible to the session
  .read <path>         execute a SQL script file
  .timer on|off        toggle per-query elapsed-time output
  .format [fmt]        show or set output format (table|csv|tsv|json)

Examples:

sqe> .timer on
Timer: on

sqe> SELECT count(*) FROM read_parquet('events.parquet');
+----------+
| count(*) |
+----------+
| 1500000  |
+----------+
Time: 0.243s

sqe> .tables
sqe> .schema iceberg.staging.events
sqe> .summarize iceberg.staging.events
sqe> .read setup.sql
sqe> .format json

.summarize runs a per-column UNION ALL of count, null_count, distinct_count, min, and max. It is a two-step flow: the REPL fetches the column list from information_schema.columns, then generates and executes the aggregate query. Min/max are cast to VARCHAR so columns of mixed types render in one table.

SQL surface

In addition to the standard dialect, embedded mode (and the cluster coordinator) supports DuckDB-style projection sugar that DataFusion 53.1 ships natively:

-- Drop columns from the projection.
SELECT * EXCLUDE (secret, internal_id) FROM users;

-- Substitute a column with an expression while keeping order.
SELECT * REPLACE (UPPER(name) AS name, total / 100 AS total)
FROM orders;

-- Native column-level metadata.
DESCRIBE iceberg.staging.events;

The legacy \format and \q forms still work for backward compatibility.

What embedded mode does not include

  • Authentication, RBAC, or column masking. Embedded mode runs as the local user. Use the cluster path when you need policy enforcement.
  • Distributed execution. Embedded mode is single-process by design.
  • Concurrent writers. The SQLite catalog is single-process; running two sqe-cli --embedded instances against the same warehouse simultaneously will likely produce errors. The cluster path handles concurrent writes correctly.

Script files

--file reads a SQL script and executes statements in order, separated by ;. The splitter respects single-quoted strings, double-quoted identifiers, line comments (--), and block comments (/* ... */), so semicolons inside those don’t accidentally split a statement.

By default, errors print to stderr and execution continues. Pass --stop-on-error to abort on the first failure. That is the right setting for CI scripts where any failure means the schema setup is broken.

sqe-cli --embedded --file setup.sql

Interactive Mode

sqe-cli --host sqe-coordinator --port 50051 --user alice
Password: ****
sqe-cli 0.1.0 connected to http://sqe-coordinator:50051 (flight)
Type SQL queries, or \q to quit. End multi-line queries with ;

sqe> SELECT * FROM raw.orders LIMIT 3;
 order_id | customer_id | amount | region
----------+-------------+--------+--------
 1        | 100         | 250.00 | EU
 2        | 101         | 150.00 | US
 3        | 100         | 300.00 | EU
(3 rows)

sqe> \q

Multi-line Queries

Queries are executed when you type ;:

sqe> SELECT
  ->   region,
  ->   COUNT(*) AS orders,
  ->   SUM(amount) AS total
  -> FROM raw.orders
  -> GROUP BY region
  -> ORDER BY total DESC;

Commands

CommandAction
\qQuit
quitQuit
exitQuit
Ctrl+CCancel current input / quit
Ctrl+DQuit (EOF)

History is saved to ~/.sqe_history.

Single Query Mode

Execute one query and exit. Useful for scripts:

sqe-cli -H localhost -p 50051 -u alice -e "SELECT COUNT(*) FROM raw.orders;"

Output Formats

Table (default)

sqe-cli -e "SELECT 1 AS a, 'hello' AS b;" --format table
 a | b
---+-------
 1 | hello
(1 rows)

CSV

sqe-cli -e "SELECT 1 AS a, 'hello' AS b;" --format csv
a,b
1,hello

JSON (newline-delimited)

sqe-cli -e "SELECT 1 AS a, 'hello' AS b;" --format json
{"a":"1","b":"hello"}

Authentication

Username/Password

# Interactive prompt
sqe-cli --user alice

# Environment variables (no prompts)
export SQE_USER=alice
export SQE_PASSWORD=secret
sqe-cli -e "SHOW SCHEMAS;"

Bearer Token

Skip the password flow entirely with a pre-obtained token:

sqe-cli --token eyJhbGciOiJSUzI1NiIs... -e "SELECT 1;"

Connecting in Kubernetes

# Port-forward to the coordinator
kubectl port-forward svc/sqe-coordinator 50051:50051

# Then connect locally
sqe-cli --host localhost --port 50051

# Or exec directly into the pod
kubectl exec -it deploy/sqe-coordinator -- sqe-cli

Using with Trino Protocol

For compatibility with tools that speak Trino HTTP:

sqe-cli --protocol http --host localhost --port 8080 --user alice

This uses the Trino-compatible /v1/statement endpoint instead of Flight SQL.