|
| 1 | +## Setup |
| 2 | + |
| 3 | +1. Create a Databricks workspace and SQL Warehouse (you can do this in the Datbricks UI). Once the SQL Warehouse has been created, copy the warehouse path to use in the .env file |
| 4 | +2. Generate a personal access token from your Databricks workspace |
| 5 | +3. Copy `.env.example` to `.env` and fill in your values: |
| 6 | + |
| 7 | +```bash |
| 8 | +cp .env.example .env |
| 9 | +# Edit .env with your actual credentials |
| 10 | +``` |
| 11 | + |
| 12 | +## Running the Benchmark |
| 13 | + |
| 14 | +```bash |
| 15 | +./benchmark.sh |
| 16 | +``` |
| 17 | + |
| 18 | +## How It Works |
| 19 | + |
| 20 | +1. **benchmark.sh**: Entry point that installs dependencies via `uv` and runs the benchmark |
| 21 | +2. **benchmark.py**: Orchestrates the full benchmark: |
| 22 | + - Creates the catalog and schema |
| 23 | + - Creates the `hits` table with explicit schema (including TIMESTAMP conversion) |
| 24 | + - Loads data from the parquet file using `INSERT INTO` with type conversions |
| 25 | + - Runs all queries via `run.sh` |
| 26 | + - Collects timing metrics from Databricks REST API |
| 27 | + - Outputs results to JSON in the `results/` directory |
| 28 | +3. **run.sh**: Iterates through queries.sql and executes each query |
| 29 | +4. **query.py**: Executes individual queries and retrieves execution times from Databricks REST API (`/api/2.0/sql/history/queries/{query_id}`) |
| 30 | +5. **queries.sql**: Contains the 43 benchmark queries |
| 31 | + |
| 32 | +## Notes |
| 33 | + |
| 34 | +- Query execution times are pulled from the Databricks REST API, which provides server-side metrics |
| 35 | +- The data is loaded from a parquet file with explicit type conversions (Unix timestamps → TIMESTAMP, Unix dates → DATE) |
| 36 | +- The benchmark uses Databricks SQL Connector for Python |
| 37 | +- Results include load time, data size, and individual query execution times (3 runs per query) |
| 38 | +- Results are saved to `results/{instance_type}.json` |
0 commit comments