Skip to content

Deploy, benchmark, and compare lakehouse architectures on Kubernetes with a single YAML file. Lakebench provisions Spark, Iceberg, and Trino, generates synthetic data at any scale, runs medallion pipelines (Bronze→Silver→Gold), and scores performance with an 8-query benchmark across pluggable catalogs, table formats, and query engines.

License

Notifications You must be signed in to change notification settings

PureStorage-OpenConnect/lakebench

Repository files navigation

Lakebench

CLI tool for deploying and benchmarking lakehouse architectures on Kubernetes.

Note: This package is published as lakebench-k8s on PyPI. Install with pip install lakebench-k8s. The CLI command is lakebench.

Choosing between Hive and Polaris, Iceberg and Delta, or sizing Spark for 100 GB vs 10 TB shouldn't require weeks of manual setup. Lakebench deploys a complete lakehouse stack from a single YAML file, generates realistic data at any scale, runs the pipeline, benchmarks query performance, and tears everything down — so you can focus on comparing architectures, not plumbing.

Installation

pip install lakebench-k8s

Or with pipx: pipx install lakebench-k8s

Pre-built binaries (no Python required) are available on GitHub Releases.

Prerequisites

  • Python 3.10+
  • kubectl and helm on PATH
  • Kubernetes cluster (1.26+)
  • S3-compatible object storage (FlashBlade, MinIO, AWS S3, etc.)

Quick Start

# 1. Generate config (interactive prompts for S3 details)
lakebench init --interactive

# 2. Deploy infrastructure
lakebench deploy lakebench.yaml

# 3. Generate test data
lakebench generate lakebench.yaml --wait

# 4. Run the pipeline + benchmark
lakebench run lakebench.yaml

# 5. View results
lakebench report

# 6. Tear down
lakebench destroy lakebench.yaml

Commands

Command Description
lakebench init Generate a starter configuration file
lakebench validate <config> Validate config and test connectivity
lakebench info <config> Show configuration summary
lakebench recommend Recommend cluster sizing for a scale factor
lakebench deploy <config> Deploy all infrastructure
lakebench generate <config> Generate synthetic data to bronze bucket
lakebench run <config> Execute the medallion pipeline with metrics
lakebench benchmark <config> Run 8-query Trino benchmark
lakebench query <config> Execute SQL queries against Trino
lakebench status [config] Show deployment status
lakebench logs <component> [config] Stream logs from a component
lakebench report Generate HTML benchmark report
lakebench destroy <config> Tear down all resources

Component Versions

Component Version
Spark Operator 2.4.0 (Kubeflow)
Apache Spark 3.5.3
Apache Iceberg 1.5.0
Hive Metastore 3.1.3 (Stackable 25.7.0)
Trino 479
PostgreSQL 16

Documentation

Full documentation is in the docs/ directory:

License

Apache 2.0

About

Deploy, benchmark, and compare lakehouse architectures on Kubernetes with a single YAML file. Lakebench provisions Spark, Iceberg, and Trino, generates synthetic data at any scale, runs medallion pipelines (Bronze→Silver→Gold), and scores performance with an 8-query benchmark across pluggable catalogs, table formats, and query engines.

Resources

License

Contributing

Stars

Watchers

Forks

Packages

No packages published

Languages