Skip to content

Add sqlite reader and adjust SQL queries to work there#182

Open
georgestagg wants to merge 10 commits intomainfrom
sqlite
Open

Add sqlite reader and adjust SQL queries to work there#182
georgestagg wants to merge 10 commits intomainfrom
sqlite

Conversation

@georgestagg
Copy link
Collaborator

@georgestagg georgestagg commented Mar 10, 2026

I went back and forth many times for this PR on whether to introduce SQL engine specific syntax for percentiles and other incompatibilities. In the end, I decided not to, instead opting to try and use as SQL-agnostic code as I could throughout.

Various things don't work in sqlite, here is the situation as it currently stands as far as I recall:

  • No EXCLUDE. Instead we keep duplicated columns during the affected queries and drop them from the result.

  • LIMIT 0 does not return the correct column type information - Switch to LIMIT 1.

  • No GREATEST or LEAST - Switch to a utility function to build the equivalent with CASE WHEN.

  • No QUANTILE_CONT or PERCENTILE_CONT - Return back to using an NTILE-based method for boxplot and density.

  • No ANY_VALUE - We can use MIN, that gives us an any value.

  • No GENERATE_SERIES - Use a RECURSIVE CTE to generate the series values.

Closes #134

@georgestagg georgestagg requested a review from thomasp85 March 10, 2026 11:57
Comment on lines -80 to +88
default = ["duckdb", "sqlite", "vegalite", "ipc", "builtin-data"]
default = ["duckdb", "sqlite", "vegalite", "ipc", "parquet", "builtin-data"]
ipc = ["polars/ipc"]
duckdb = ["dep:duckdb", "dep:arrow"]
polars-sql = ["polars/sql"]
builtin-data = ["polars/parquet"]
parquet = ["polars/parquet"]
postgres = ["dep:postgres"]
sqlite = ["dep:rusqlite"]
vegalite = []
ggplot2 = []
builtin-data = []
Copy link
Collaborator Author

@georgestagg georgestagg Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This tweak to features is just prep for Wasm, not related to sqlite.

@georgestagg
Copy link
Collaborator Author

Some more context:

I almost added a new SqlDialect trait with generalised methods on it for GREATER/LESSER, GENERATE_SERIES and PERCENTILE_CONT/QUANTILE_CONT. However, I realised that we could use the same (admittedly convoluted) SQL everywhere, even in Snowflake I believe, and so it no longer seemed necessary.

A question arises of if there are performance considerations to reach for the lowest common denominator implementations.

I had Claude whip up a benchmark, and these are the results:

  generate_series                                                                                                                                                                                                                                     
                                                                                                                                                                                                                                                      
  ┌──────┬────────┬──────────┬──────────┐                                                                                                                                                                                                             
  │ Size │ Native │ Portable │ Slowdown │                                                                                                                                                                                                             
  ├──────┼────────┼──────────┼──────────┤                   
  │ 64   │ 51 µs  │ 800 µs   │ ~16x     │
  ├──────┼────────┼──────────┼──────────┤
  │ 512  │ 59 µs  │ 1.10 ms  │ ~19x     │
  ├──────┼────────┼──────────┼──────────┤
  │ 1000 │ 65 µs  │ 1.34 ms  │ ~21x     │
  ├──────┼────────┼──────────┼──────────┤
  │ 4096 │ 117 µs │ 3.15 ms  │ ~27x     │
  └──────┴────────┴──────────┴──────────┘

  percentile

  ┌───────┬────────┬──────────┬──────────┐
  │ Rows  │ Native │ Portable │ Slowdown │
  ├───────┼────────┼──────────┼──────────┤
  │ 100   │ 96 µs  │ 495 µs   │ ~5x      │
  ├───────┼────────┼──────────┼──────────┤
  │ 1000  │ 107 µs │ 540 µs   │ ~5x      │
  ├───────┼────────┼──────────┼──────────┤
  │ 10000 │ 397 µs │ 1.01 ms  │ ~2.5x    │
  └───────┴────────┴──────────┴──────────┘
Details
use criterion::{criterion_group, criterion_main, BenchmarkId, Criterion};
use duckdb::Connection;
use ggsql::utils::{sql_generate_series, sql_percentile};

fn setup_connection() -> Connection {
    Connection::open_in_memory().expect("Failed to open DuckDB in-memory connection")
}

fn create_data_table(conn: &Connection, name: &str, n_rows: usize) {
    conn.execute_batch(&format!(
        "CREATE OR REPLACE TABLE {name} AS \
         SELECT random() * 1000.0 AS val \
         FROM GENERATE_SERIES(0, {}) AS seq(n)",
        n_rows - 1
    ))
    .expect("Failed to create data table");
}

fn bench_generate_series(c: &mut Criterion) {
    let conn = setup_connection();
    let mut group = c.benchmark_group("generate_series");

    for n in [64, 512, 1000, 4096] {
        group.bench_with_input(BenchmarkId::new("native", n), &n, |b, &n| {
            let sql = format!(
                "SELECT n FROM GENERATE_SERIES(0, {}) AS seq(n)",
                n - 1
            );
            b.iter(|| {
                let mut stmt = conn.prepare(&sql).unwrap();
                let rows = stmt.query_map([], |row| row.get::<_, f64>(0)).unwrap();
                for r in rows {
                    std::hint::black_box(r.unwrap());
                }
            });
        });

        group.bench_with_input(BenchmarkId::new("portable", n), &n, |b, &n| {
            let cte = sql_generate_series(n);
            let sql = format!("WITH RECURSIVE {cte} SELECT n FROM __ggsql_seq__");
            b.iter(|| {
                let mut stmt = conn.prepare(&sql).unwrap();
                let rows = stmt.query_map([], |row| row.get::<_, f64>(0)).unwrap();
                for r in rows {
                    std::hint::black_box(r.unwrap());
                }
            });
        });
    }

    group.finish();
}

fn bench_percentile(c: &mut Criterion) {
    let conn = setup_connection();
    let mut group = c.benchmark_group("percentile");

    for n_rows in [100, 1000, 10000] {
        let table = format!("data_{n_rows}");
        create_data_table(&conn, &table, n_rows);

        group.bench_with_input(BenchmarkId::new("native", n_rows), &n_rows, |b, _| {
            let sql = format!(
                "SELECT QUANTILE_CONT(val, 0.25) AS q1, QUANTILE_CONT(val, 0.75) AS q3 \
                 FROM {table}"
            );
            b.iter(|| {
                let mut stmt = conn.prepare(&sql).unwrap();
                let row = stmt
                    .query_row([], |row| {
                        Ok((row.get::<_, f64>(0)?, row.get::<_, f64>(1)?))
                    })
                    .unwrap();
                std::hint::black_box(row);
            });
        });

        group.bench_with_input(BenchmarkId::new("portable", n_rows), &n_rows, |b, _| {
            let from = format!("SELECT * FROM {table}");
            let q1 = sql_percentile("val", 0.25, &from, &[]);
            let q3 = sql_percentile("val", 0.75, &from, &[]);
            let sql = format!("SELECT {q1} AS q1, {q3} AS q3");
            b.iter(|| {
                let mut stmt = conn.prepare(&sql).unwrap();
                let row = stmt
                    .query_row([], |row| {
                        Ok((row.get::<_, f64>(0)?, row.get::<_, f64>(1)?))
                    })
                    .unwrap();
                std::hint::black_box(row);
            });
        });
    }

    group.finish();
}

criterion_group!(benches, bench_generate_series, bench_percentile);
criterion_main!(benches);

@georgestagg
Copy link
Collaborator Author

Just a quick comment to note b3d6b49 reintroduced SqlDialect as a fight against the less-than-ideal benchmark results.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Readers should be able to provide alternative SQL clauses

1 participant