PostgreSQL extension that receives Substrait plans over Arrow Flight SQL, converts them to PostgreSQL Query trees, executes via standard_planner, and returns Arrow record batches.
Flight SQL gRPC → Substrait protobuf → PG Query* → standard_planner → executor → Arrow batches
| Relation | Notes |
|---|---|
| ReadRel | Named tables, virtual tables (VALUES) |
| FilterRel | WHERE clauses |
| ProjectRel | Projections, window functions |
| AggregateRel | GROUP BY, ROLLUP/CUBE (via SetRel), DISTINCT aggregates |
| SortRel | ASC/DESC, NULLS FIRST/LAST |
| FetchRel | LIMIT/OFFSET |
| JoinRel | INNER, LEFT, RIGHT, FULL |
| CrossRel | Flattened to multi-table FROM |
| SetRel | UNION/INTERSECT/EXCEPT, ALL/DISTINCT |
- Window functions:
rank,dense_rank,row_number,lag,lead,first_value,last_value,ntile - Scalar and correlated subqueries,
EXISTS,IN - CTE deduplication (auto-detected duplicate subtrees)
CASE/WHEN,COALESCE,CAST- Aggregates:
COUNT,SUM,AVG,MIN,MAX,STDDEV,VARIANCE,BOOL_AND/OR
Dependencies: Meson >= 1.1.0, C++17, Arrow Flight SQL C++, protobuf, PostgreSQL >= 15.
meson setup builddir
ninja -C builddir
sudo ninja -C builddir installInstalls arrow_flight_sql.so into the PostgreSQL pkglibdir.
Add to postgresql.conf and restart PostgreSQL:
shared_preload_libraries = 'arrow_flight_sql'
The adapter starts a gRPC server on grpc://127.0.0.1:15432. Authentication uses pg_hba.conf — set the Flight SQL client IP to password or trust (SCRAM/MD5 not yet supported).
import pyarrow.flight as flight
client = flight.FlightClient("grpc://127.0.0.1:15432")
token = client.authenticate_basic_token("postgres", "")
opts = flight.FlightCallOptions(headers=[token])
# protobuf wire helpers (tag + length-delimited field)
def varint(v):
r = []
while v > 0x7F:
r.append((v & 0x7F) | 0x80); v >>= 7
r.append(v & 0x7F)
return bytes(r)
def pb(n, data):
return varint((n << 3) | 2) + varint(len(data)) + data
sql = "SELECT 1 AS n, 'hello' AS greeting"
cmd = pb(1, b"type.googleapis.com/arrow.flight.protocol.sql"
b".CommandStatementQuery") \
+ pb(2, pb(1, sql.encode()))
info = client.get_flight_info(flight.FlightDescriptor.for_command(cmd), opts)
reader = client.do_get(info.endpoints[0].ticket, opts)
print(reader.read_all().to_pandas())
# n greeting
# 0 1 hello# read a pre-generated Substrait plan (binary protobuf)
with open("substrait_test/tpch/plans/01.proto", "rb") as f:
plan_bytes = f.read()
# wrap in Flight SQL CommandStatementSubstraitPlan
cmd = pb(1, b"type.googleapis.com/arrow.flight.protocol.sql"
b".CommandStatementSubstraitPlan") \
+ pb(2, pb(1, pb(1, plan_bytes)))
info = client.get_flight_info(flight.FlightDescriptor.for_command(cmd), opts)
reader = client.do_get(info.endpoints[0].ticket, opts)
table = reader.read_all()
print(f"{table.num_rows} rows, {table.num_columns} cols")
print(table.to_pandas().head())Plans are pre-generated Substrait protobuf via isthmus-cli. See substrait_test/{tpch,tpcds}/plans/ for all TPC-H and TPC-DS plans.
Add to postgresql.conf:
shared_preload_libraries = 'arrow_flight_sql'
| GUC | Default | Description |
|---|---|---|
arrow_flight_sql.uri |
grpc://127.0.0.1:15432 |
Flight SQL endpoint URI |
arrow_flight_sql.session_timeout |
300 (seconds) |
Max session duration (-1 = no timeout) |
arrow_flight_sql.max_sessions |
256 |
Max concurrent sessions (-1 = unlimited) |
arrow_flight_sql.ring_buffer_size |
8MB |
Shared ring buffer per session (1MB-1GB) |
arrow_flight_sql.max_n_rows_per_record_batch |
1048576 |
Max rows per Arrow record batch |
arrow_flight_sql.ipc_compression |
false |
ZSTD compression for Arrow IPC batches |
arrow_flight_sql.explain |
0 |
EXPLAIN mode: 0=off, 1=plan only, 2=analyze |
arrow_flight_sql.explain_dir |
/tmp/afs_explain |
Directory for EXPLAIN JSON output |
TPC-H (22 queries, 22 pass) and TPC-DS (99 queries, 98 pass).
Ground truth: psql CSV output vs Flight SQL adapter, row-by-row comparison with decimal rounding and null normalization.
python3 substrait_test/test_substrait.py --benchmark tpch --sf 1
python3 substrait_test/test_substrait.py --benchmark tpcds --sf 1Scale factors: 1, 5, 10, 15, 20. Data lives in substrait_test/{benchmark}/data/. Plans are pre-generated Substrait protobuf via isthmus-cli in substrait_test/{benchmark}/plans/.
Generate JSON explain plans for all three execution paths (pgsql, arrow, substrait) and visualize side-by-side:
python3 substrait_test/test_substrait.py --explain 7 --benchmark tpch --sf 1
python3 substrait_test/plot_explain.py tpch
# open substrait_test/explain/tpch.html--explain bitmask: 4=pgsql, 2=arrow, 1=substrait (7=all). Sets --run automatically when --run not given. Plans are written as individual JSON files per query per method in substrait_test/explain/{benchmark}_sf{sf}/. The HTML viewer shows query tabs, SF sub-tabs, and parallel worker badges (G=gather, W=worker).
scripts/gen_all_perf.sh # run all SFs, all methods
scripts/gen_perf.sh # subset of queries
python3 substrait_test/plot_tpch_performance.pyAFS_DEBUG: Auto-enabled in debug builds. Logssubstrait_to_query,standard_planner, and execution timings viaelog(LOG).
Apache License 2.0. See LICENSE.txt.
See CONTRIBUTING.md.