A configuration-driven FastAPI service for ingesting data into Databricks tables via Zerobus streams.
- Config-driven endpoints: Automatically creates REST endpoints based on JSON configuration
- Persistent streams: Maintains long-lived Zerobus streams for optimal performance
- Multi-table support: Handle multiple tables with different schemas simultaneously
- Dynamic validation: Automatic request validation using Pydantic models
- Organized structure: Clean separation of proto files and stubs per table
- Flexible durability: Choose between fast async ingestion or guaranteed durability per request
zerobus-station/
├── app.py # Main FastAPI application
├── stream_manager.py # Stream lifecycle management
├── config.json # Table configuration (created from example)
├── config.example.json # Example configuration template
├── .env # Environment variables (not committed)
├── .env.example # Example environment variables template
├── databricks_zerobus-*.whl # Zerobus SDK wheel
├── Dockerfile # Docker container definition
├── tables/
│ ├── station_one/
│ │ ├── schema.proto # Protobuf schema
│ │ └── schema_pb2.py # Generated Python stubs
│ └── station_two/
│ ├── schema.proto
│ └── schema_pb2.py
└── README.md # This file
This project includes two example tables (station_one and station_two) with pre-configured protobuf schemas. Follow these steps to get started quickly:
Run these SQL commands in your Databricks workspace (replace YOUR_CATALOG and YOUR_SCHEMA with your values):
-- Create station_one table
CREATE TABLE YOUR_CATALOG.YOUR_SCHEMA.station_one (
device_name STRING,
temp INT,
humidity BIGINT
)
TBLPROPERTIES (
'delta.autoOptimize.optimizeWrite' = 'true',
'delta.autoOptimize.autoCompact' = 'true',
'delta.enableRowTracking' = 'false'
);
-- Create station_two table
CREATE TABLE YOUR_CATALOG.YOUR_SCHEMA.station_two (
device_name STRING,
temp INT,
humidity BIGINT,
description STRING
)
TBLPROPERTIES (
'delta.autoOptimize.optimizeWrite' = 'true',
'delta.autoOptimize.autoCompact' = 'true',
'delta.enableRowTracking' = 'false'
);-- Replace <service-principal-id> with your service principal's application ID
GRANT USE CATALOG ON CATALOG YOUR_CATALOG TO `<service-principal-id>`;
GRANT USE SCHEMA ON SCHEMA YOUR_CATALOG.YOUR_SCHEMA TO `<service-principal-id>`;
GRANT SELECT, MODIFY ON TABLE YOUR_CATALOG.YOUR_SCHEMA.station_one TO `<service-principal-id>`;
GRANT SELECT, MODIFY ON TABLE YOUR_CATALOG.YOUR_SCHEMA.station_two TO `<service-principal-id>`;# Copy example configuration files
cp .env.example .env
cp config.example.json config.json
# Edit .env with your service principal credentials
# Edit config.json with your workspace details and replace YOUR_CATALOG/YOUR_SCHEMA# Install dependencies
uv sync
# Run the service
uvicorn app:app --reload --host 0.0.0.0 --port 8000# Test station_one
curl -X POST http://localhost:8000/ingest/station_one \
-H "Content-Type: application/json" \
-d '{"device_name": "sensor-001", "temp": 25, "humidity": 60}'
# Test station_two
curl -X POST http://localhost:8000/ingest/station_two \
-H "Content-Type: application/json" \
-d '{"device_name": "sensor-002", "temp": 22, "humidity": 55, "description": "Main entrance"}'- Python 3.11+
- Databricks workspace with Zerobus access
- Databricks workspace ID
- Service principal with the following permissions:
- On catalog:
USE_CATALOG - On schema:
USE_SCHEMA - On table:
MODIFY,SELECT
- On catalog:
- Install dependencies:
pip install fastapi uvicorn python-dotenv
pip install databricks_zerobus-0.0.17-py3-none-any.whlOr using uv:
uv sync- Create a
.envfile from the example:
cp .env.example .envThen edit .env with your credentials:
DATABRICKS_CLIENT_ID=your-service-principal-id
DATABRICKS_CLIENT_SECRET=your-service-principal-secretNote: The .env file is automatically loaded on startup and should never be committed to version control.
The service is driven by config.json, which defines:
- Databricks connection details (server endpoint, workspace ID, workspace URL)
- Table definitions with schemas
- Protobuf message mappings
Quick Start:
cp config.example.json config.jsonThen edit config.json with your Databricks details and table definitions.
{
"databricks": {
"server_endpoint": "workspace-id.zerobus.region.cloud.databricks.com",
"workspace_id": "workspace-id",
"workspace_url": "https://workspace-url.cloud.databricks.com"
},
"tables": {
"station_one": {
"table_name": "catalog.schema.table_name",
"proto_package": "station_one",
"message_name": "StationOne",
"fields": [
{"name": "device_name", "type": "string", "proto_type": "optional string", "field_num": 1},
{"name": "temp", "type": "int32", "proto_type": "optional int32", "field_num": 2},
{"name": "humidity", "type": "int64", "proto_type": "optional int64", "field_num": 3}
]
}
}
}Start the server:
uv run uvicorn app:app --reload --host 0.0.0.0 --port 8000The service will:
- Load environment variables from
.env - Load configuration from
config.json - Create Pydantic validation models for each table
- Initialize the stream manager with OAuth token factory
- Create dynamic endpoints for each table
Build and run:
docker build -t zerobus-station .
docker run -p 8000:8000 --env-file .env zerobus-stationService information and available endpoints
curl http://localhost:8000/Global health check showing active streams
curl http://localhost:8000/healthFor each table in the config, the following endpoints are created:
Ingest a record into the specified table
Fast async ingestion (default):
curl -X POST http://localhost:8000/ingest/station_one \
-H "Content-Type: application/json" \
-d '{
"device_name": "sensor-001",
"temp": 25,
"humidity": 60
}'With durability guarantee:
curl -X POST "http://localhost:8000/ingest/station_one?wait_for_ack=true" \
-H "Content-Type: application/json" \
-d '{
"device_name": "sensor-001",
"temp": 25,
"humidity": 60
}'Query Parameters:
wait_for_ack(bool, default: false): If true, waits for server acknowledgment before returning. Use false for maximum throughput, true for guaranteed durability.
Health check for a specific table
curl http://localhost:8000/health/station_oneFlush pending records for a table to ensure durability
curl -X POST http://localhost:8000/flush/station_oneFollow these steps to add a new table to the service:
Create your table in Databricks SQL:
CREATE TABLE catalog.schema.my_new_table (
field1 STRING,
field2 INT,
field3 BIGINT
)
TBLPROPERTIES (
'delta.autoOptimize.optimizeWrite' = 'true',
'delta.autoOptimize.autoCompact' = 'true',
'delta.enableRowTracking' = 'false'
);
GRANT USE CATALOG ON CATALOG YOUR_CATALOG TO `<service-principal-id>`;
GRANT USE SCHEMA ON SCHEMA YOUR_CATALOG.YOUR_SCHEMA TO `<service-principal-id>`;
GRANT SELECT, MODIFY ON TABLE YOUR_CATALOG.YOUR_SCHEMA.station_one TO `<service-principal-id>`;
GRANT SELECT, MODIFY ON TABLE YOUR_CATALOG.YOUR_SCHEMA.station_two TO `<service-principal-id>`;mkdir -p tables/my_new_tableCreate tables/my_new_table/schema.proto:
syntax = "proto2";
package my_new_table;
message MyNewTable {
optional string field1 = 1;
optional int32 field2 = 2;
optional int64 field3 = 3;
}Important Notes:
- Field types must match your Databricks table schema
- Field numbers must be sequential starting from 1
- Use
int32for INT,int64for BIGINT,stringfor STRING - Package name should match your table key
protoc --python_out=. tables/my_new_table/schema.protoThis generates tables/my_new_table/schema_pb2.py.
Add your table configuration:
{
"databricks": {
"server_endpoint": "workspace-id.zerobus.region.cloud.databricks.com",
"workspace_id": "workspace-id",
"workspace_url": "https://workspace-url.cloud.databricks.com"
},
"tables": {
"my_new_table": {
"table_name": "catalog.schema.my_new_table",
"proto_package": "my_new_table",
"message_name": "MyNewTable",
"fields": [
{"name": "field1", "type": "string", "proto_type": "optional string", "field_num": 1},
{"name": "field2", "type": "int32", "proto_type": "optional int32", "field_num": 2},
{"name": "field3", "type": "int64", "proto_type": "optional int64", "field_num": 3}
]
}
}
}Configuration Fields:
table_name: Fully qualified table name in Databricks (catalog.schema.table)proto_package: Must match the package name in your .proto filemessage_name: Must match the message name in your .proto filefields: List of fields for Pydantic validation (must match proto definition)
# Stop the current service (Ctrl+C)
uvicorn app:app --reload --host 0.0.0.0 --port 8000Or for Docker:
docker build -t zerobus-station .
docker run -p 8000:8000 --env-file .env zerobus-stationcurl -X POST http://localhost:8000/ingest/my_new_table \
-H "Content-Type: application/json" \
-d '{
"field1": "test",
"field2": 123,
"field3": 456
}'The endpoint will be automatically available at /ingest/my_new_table.
The StreamManager class handles:
- Lazy initialization: Streams are created on first request
- Connection pooling: One persistent stream per table
- OAuth token management: Automatic token generation using token factory
- Health monitoring: Automatic stream state checking
- Graceful recovery: Handles stream failures and recreates as needed
- Clean shutdown: Flushes and closes all streams on service shutdown
# Global health
curl http://localhost:8000/health
# Table-specific health
curl http://localhost:8000/health/station_one
curl http://localhost:8000/health/station_two# Fast async ingestion
curl -X POST http://localhost:8000/ingest/station_one \
-H "Content-Type: application/json" \
-d '{"device_name": "sensor-001", "temp": 25, "humidity": 60}'
# With durability guarantee
curl -X POST "http://localhost:8000/ingest/station_one?wait_for_ack=true" \
-H "Content-Type: application/json" \
-d '{"device_name": "sensor-001", "temp": 25, "humidity": 60}'FastAPI automatically generates interactive API docs:
- Swagger UI: http://localhost:8000/docs
- ReDoc: http://localhost:8000/redoc
- First Request: When the first record is sent to
/ingest/{table_key}, the stream manager creates a new Zerobus stream with OAuth token factory - Subsequent Requests: The same stream is reused for better performance
- Health Checks: Stream state is validated before each use
- Recovery: Failed streams are automatically recreated with fresh tokens
- Shutdown: All streams are gracefully flushed and closed
Client Request
↓
FastAPI Endpoint (/ingest/{table_key})
↓
JSON Validation (Pydantic)
↓
Get/Create Stream (StreamManager)
↓
OAuth Token (via token_factory)
↓
Convert JSON → Protobuf
↓
Ingest via Zerobus Stream
↓
[Optional] Wait for Ack
↓
Response to Client
The service uses OAuth 2.0 with client credentials:
- Stream manager creates a
token_factoryfunction - Token factory calls
get_zerobus_token()from the Zerobus SDK - Token is automatically refreshed on stream creation/recovery
- Tokens are scoped to specific table permissions
- Persistent Streams: Streams are kept alive between requests for minimal latency
- Async Operations: FastAPI's async capabilities ensure non-blocking operations
- Buffering: Zerobus SDK handles buffering and flow control automatically (50,000 in-flight records by default)
- Batch Flushing: Use the
/flush/{table_key}endpoint to ensure durability without waiting per-record - Fast vs. Durable: Use
wait_for_ack=falsefor high throughput,wait_for_ack=truefor guaranteed durability
The service handles various error scenarios:
- Invalid Table: Returns 404 if table not found in config
- Validation Errors: Returns 400 with detailed validation messages
- Stream Failures: Returns 500 and logs detailed error information
- Automatic Recovery: StreamManager recreates failed streams automatically
- OAuth Errors: Logged with full details for debugging
© 2025 Databricks, Inc. All rights reserved. The source in this notebook is provided subject to the Databricks License [https://databricks.com/db-license-source]. All included or referenced third party libraries are subject to the licenses set forth below.