A development environment for experimenting with Apache Polaris, Apache Spark, and Apache Iceberg integration. This project provides a Docker-based setup for quick prototyping and learning.
Important
- This environment is intended for development and testing purposes only. Not suitable for production use.
- This project is under active development and subject to changes.
Polaris Spark DevBox offers a pre-configured environment that combines:
- Apache Polaris - A Rust-based query engine built on Apache Arrow DataFusion
- Apache Spark (v3.5.4) - Unified analytics engine for large-scale data processing
- Apache Iceberg - Open table format for huge analytic datasets
The environment includes:
- Polaris Server (Java 21)
- Apache Spark 3.5.4 with Hadoop 3
- Jupyter Notebook with PySpark integration
- Pre-configured networking and volume management
- Sample datasets and Iceberg table examples
- π Zero-configuration setup with Apache Polaris and Spark
- π Integrated Jupyter environment with PySpark
- π Apache Iceberg table format support
- π³ Containerized development
- π Automated initialization
- π Example notebooks demonstrating Polaris, Spark, and Iceberg integration
- π οΈ Development utilities
Note
Make sure you have all prerequisites installed before proceeding with the setup.
- Python 3.11+
- Docker Desktop
- Windows/macOS: Use Docker Desktop
- Linux: Docker Desktop for Linux or Docker Engine with Docker Compose
- Sufficient disk space for containers and volumes
Caution
Keep your environment variables secure and never commit the .env file to version control.
Required variables in .env:
COMPOSE_PROJECT_NAME=polaris_spark_dev
POLARIS_CATALOG_NAME=my_catalog
POLARIS_DEFAULT_BASE_LOCATION=file:///data/polaris
POLARIS_PRINCIPAL_NAME=polarisuser
POLARIS_PRINCIPAL_ROLE_NAME=polarisuser_role
POLARIS_CATALOG_ROLE_NAME=my_catalog_role
POLARIS_API_HOST=localhost
POLARIS_API_PORT=8181Tip
Ensure Docker is running before starting the containers.
-
Start the environment:
docker-compose up -d
-
Verify container status:
docker-compose ps
-
Setup Apache Polaris:
./setup
[!NOTE] This will configure Polaris with
- a catalog
- a principal and principal role
- a catalog role
- assign the catalog role to the principal role
- grant privileges
- generate a simple Jupyter Notebook verify the setup
Note
All services are configured to run on localhost by default.
- Jupyter Notebook: http://localhost:8888
- Polaris API: http://localhost:10081
- Polaris Admin: http://localhost:10082
- Uses Apache Spark 3.5.4
- PySpark for Python interface
- Spark SQL for data querying
- Built-in Spark History Server
- Apache Iceberg table format support
- Schema evolution
- Time travel queries
- Partition evolution
- Hidden partitioning
- REST Catalog
- SQL query support
- Distributed query execution
- Integration with Iceberg tables
polaris-spark-devbox/
βββ connection_config.py # Configuration utility
βββ requirements.txt # Python dependencies
βββ docker-compose.yml # Container orchestration
βββ conf/ # Configuration files
βββ templates/ # Template files
βββ notebooks/ # Jupyter notebooks
βββ http/ # HTTP test files
Note
All examples and documentation are automatically generated during setup.
The setup generates two types of documentation:
-
Jupyter Notebooks (
notebooks/polaris_setup_verify.ipynb)- Setup verification
- API usage examples
- Iceberg table operations
- Spark SQL queries
-
HTTP Files (
http/polaris.http)- REST API documentation
- Testing endpoints
Note
This section is for advanced users who want to build their own custom images.
The project uses Task to manage image builds. The Taskfile.yml provides tasks to build:
- PySpark Notebook image with Jupyter
- Apache Polaris base and server images
- Task installed
- Docker with multi-platform build support
# Build all images
task
# Build only Spark Notebook image
task build_spark_notebook_image
# Build Polaris base image
task build_polaris_base
# Build Polaris server image
task build_polaris_imageYou can customize the build by modifying these variables in Taskfile.yml:
vars:
JAVA_VERSION: 17 # Java version for builds
SPARK_VERSION: 3.5.4 # Apache Spark version
HADOOP_VERSION: 3 # Hadoop version
POLARIS_VERSION: 0.9.x # Apache Polaris version
SPARK_NOTEBOOK_IMAGE: ghcr.io/kameshsampath/polaris-spark-devbox/spark35notebook
POLARIS_BASE_IMAGE: ghcr.io/kameshsampath/polaris-spark-devbox/polaris-base
POLARIS_SERVER_IMAGE: ghcr.io/kameshsampath/polaris-spark-devbox/polarisTip
The Polaris base image is built for both ARM64 and AMD64 architectures.
- Apache Spark Documentation
- Apache Iceberg Documentation
- Apache Arrow DataFusion
- PySpark Documentation
Tip
Before submitting a PR, make sure to test your changes thoroughly.
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit changes (
git commit -m 'Add amazing feature') - Push to branch (
git push origin feature/amazing-feature) - Open a Pull Request
Apache License 2.0. See LICENSE for details.
For questions and support:
- Open an issue in the GitHub repository
- Connect on LinkedIn
Built with β€οΈ for Open Source by Kamesh Sampath, Developer Relations @ Snowflake