Lakebench

CLI tool for deploying and benchmarking lakehouse architectures on Kubernetes.

Note: This package is published as lakebench-k8s on PyPI. Install with pip install lakebench-k8s. The CLI command is lakebench.

Choosing between Hive and Polaris, Iceberg and Delta, or sizing Spark for 100 GB vs 10 TB shouldn't require weeks of manual setup. Lakebench deploys a complete lakehouse stack from a single YAML file, generates realistic data at any scale, runs the pipeline, benchmarks query performance, and tears everything down — so you can focus on comparing architectures, not plumbing.

Installation

pip install lakebench-k8s

Or with pipx: pipx install lakebench-k8s

Pre-built binaries (no Python required) are available on GitHub Releases.

Prerequisites

Python 3.10+
kubectl and helm on PATH
Kubernetes cluster (1.26+)
S3-compatible object storage (FlashBlade, MinIO, AWS S3, etc.)

Quick Start

# 1. Generate config (interactive prompts for S3 details)
lakebench init --interactive

# 2. Deploy infrastructure
lakebench deploy lakebench.yaml

# 3. Generate test data
lakebench generate lakebench.yaml --wait

# 4. Run the pipeline + benchmark
lakebench run lakebench.yaml

# 5. View results
lakebench report

# 6. Tear down
lakebench destroy lakebench.yaml

Commands

Command	Description
`lakebench init`	Generate a starter configuration file
`lakebench validate <config>`	Validate config and test connectivity
`lakebench info <config>`	Show configuration summary
`lakebench recommend`	Recommend cluster sizing for a scale factor
`lakebench deploy <config>`	Deploy all infrastructure
`lakebench generate <config>`	Generate synthetic data to bronze bucket
`lakebench run <config>`	Execute the medallion pipeline with metrics
`lakebench benchmark <config>`	Run 8-query Trino benchmark
`lakebench query <config>`	Execute SQL queries against Trino
`lakebench status [config]`	Show deployment status
`lakebench logs <component> [config]`	Stream logs from a component
`lakebench report`	Generate HTML benchmark report
`lakebench destroy <config>`	Tear down all resources

Component Versions

Component	Version
Spark Operator	2.4.0 (Kubeflow)
Apache Spark	3.5.3
Apache Iceberg	1.5.0
Hive Metastore	3.1.3 (Stackable 25.7.0)
Trino	479
PostgreSQL	16

Documentation

Full documentation is in the docs/ directory:

Getting Started — Prerequisites, install, first deployment
Configuration — Full YAML reference
CLI Reference — All commands and flags
Recipes — Supported component combinations
Troubleshooting — Common errors and fixes

License

Apache 2.0

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.github/workflows		.github/workflows
datagen		datagen
docs		docs
src/lakebench		src/lakebench
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
conftest.py		conftest.py
install.sh		install.sh
lakebench.spec		lakebench.spec
lakebench.yaml		lakebench.yaml
lbrun.py		lbrun.py
pyproject.toml		pyproject.toml
test-config.yaml		test-config.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Lakebench

Installation

Prerequisites

Quick Start

Commands

Component Versions

Documentation

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Languages

License

PureStorage-OpenConnect/lakebench

Folders and files

Latest commit

History

Repository files navigation

Lakebench

Installation

Prerequisites

Quick Start

Commands

Component Versions

Documentation

License

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Languages

Packages