Skip to content

Commit 3b07d4f

Browse files
authored
Merge pull request #683 from prequel-co/add-databricks
Add Databricks and benchmark results for most SQL warehouse options
2 parents 62364a3 + acdd41f commit 3b07d4f

File tree

16 files changed

+1054
-1
lines changed

16 files changed

+1054
-1
lines changed

README.md

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -284,7 +284,6 @@ Please help us add more systems and run the benchmarks on more types of VMs:
284284
- [ ] Azure Synapse
285285
- [ ] Boilingdata
286286
- [ ] CockroachDB Serverless
287-
- [ ] Databricks
288287
- [ ] DolphinDB
289288
- [ ] Dremio (without publishing)
290289
- [ ] DuckDB operating like "Athena" on remote Parquet files

databricks/.env.example

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
# Databricks Configuration
2+
# Copy this file to .env and fill in your actual values
3+
4+
# Your Databricks workspace hostname (e.g., dbc-xxxxxxxx-xxxx.cloud.databricks.com)
5+
DATABRICKS_SERVER_HOSTNAME=your-workspace-hostname.cloud.databricks.com
6+
7+
# SQL Warehouse HTTP path (found in your SQL Warehouse settings)
8+
# Uncomment the warehouse size you want to use
9+
DATABRICKS_HTTP_PATH=/sql/1.0/warehouses/your-warehouse-id
10+
11+
# Instance type name for results file naming & results machine type label
12+
databricks_instance_type=Large
13+
14+
# Your Databricks personal access token
15+
DATABRICKS_TOKEN=your-databricks-token
16+
17+
# Unity Catalog and Schema names
18+
DATABRICKS_CATALOG=clickbench_catalog
19+
DATABRICKS_SCHEMA=clickbench_schema
20+
21+
# Parquet data location (must use s3:// format)
22+
DATABRICKS_PARQUET_LOCATION=s3://clickhouse-public-datasets/hits_compatible/hits.parquet
23+

databricks/README.md

Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,38 @@
1+
## Setup
2+
3+
1. Create a Databricks workspace and SQL Warehouse (you can do this in the Datbricks UI). Once the SQL Warehouse has been created, copy the warehouse path to use in the .env file
4+
2. Generate a personal access token from your Databricks workspace
5+
3. Copy `.env.example` to `.env` and fill in your values:
6+
7+
```bash
8+
cp .env.example .env
9+
# Edit .env with your actual credentials
10+
```
11+
12+
## Running the Benchmark
13+
14+
```bash
15+
./benchmark.sh
16+
```
17+
18+
## How It Works
19+
20+
1. **benchmark.sh**: Entry point that installs dependencies via `uv` and runs the benchmark
21+
2. **benchmark.py**: Orchestrates the full benchmark:
22+
- Creates the catalog and schema
23+
- Creates the `hits` table with explicit schema (including TIMESTAMP conversion)
24+
- Loads data from the parquet file using `INSERT INTO` with type conversions
25+
- Runs all queries via `run.sh`
26+
- Collects timing metrics from Databricks REST API
27+
- Outputs results to JSON in the `results/` directory
28+
3. **run.sh**: Iterates through queries.sql and executes each query
29+
4. **query.py**: Executes individual queries and retrieves execution times from Databricks REST API (`/api/2.0/sql/history/queries/{query_id}`)
30+
5. **queries.sql**: Contains the 43 benchmark queries
31+
32+
## Notes
33+
34+
- Query execution times are pulled from the Databricks REST API, which provides server-side metrics
35+
- The data is loaded from a parquet file with explicit type conversions (Unix timestamps → TIMESTAMP, Unix dates → DATE)
36+
- The benchmark uses Databricks SQL Connector for Python
37+
- Results include load time, data size, and individual query execution times (3 runs per query)
38+
- Results are saved to `results/{instance_type}.json`

0 commit comments

Comments
 (0)