A distributed real-time transaction processing pipeline
- About
- Motivation
- Overview
- Key features
- Architecture
- Data flow
- Screenshots
- Quick Start
- Integrations
- Tech stack
- Local Access Points
- Roadmap
- License
TrStream is a distributed data pipeline designed to simulate and process high-throughput financial transaction streams in real time.
The project explores the architectural patterns behind modern fintech and data engineering systems, including:
- Event-driven ingestion
- Scalable stream processing
- Lakehouse-style storage
- Analytical querying
It was developed as a personal project to apply concepts from Designing Data-Intensive Applications (Martin Kleppmann) with a strong focus on system design and data flow.
Modern financial platforms ingest millions of events per day and must:
- Reliably process asynchronous events
- Retain raw events for auditing and replay
- Progressively optimize data layout for analytics
- Remain horizontally scalable and loosely coupled
TrStream models this workflow locally using open-source technologies, emphasizing clear data stages and realistic ingestion patterns.
At a high level, the pipeline:
- Ingests transaction events from external systems (e.g. Stripe webhooks) and internal producers
- Buffers and distributes events via Kafka
- Persists immutable raw events in a data lake (Parquet on S3-compatible storage)
- Processes and reorganizes data into optimized analytical layouts
- Exposes a SQL query layer on analytics-ready data
All components are containerized and orchestrated with Docker Compose, allowing horizontal scaling of producers and consumers.
- Event-driven ingestion via webhooks and producers
- Kafka-based buffering with horizontal scalability through partitioning
- Clear separation of data lifecycle stages: raw → processed → analytics
- Immutable Parquet storage for auditing and replay
- File compaction and layout optimization for analytical workloads
- SQL querying on object storage (Athena-like experience)
- Lightweight SQL editor implemented with Streamlit
- Fully containerized local environment with Docker Compose orchestration and explicit health checks
- Event sources
- External providers (e.g. Stripe) deliver events through a webhook service
- Internal producers simulate transaction streams with configurable rates and distributions
- Kafka ingestion
- All events are published to Kafka topics
- Kafka provides buffering, ordering, partitioning and backpressure handling
- Kafka UI exposes end-to-end observability of topics and consumers
- Consumers and storage
- Kafka consumers persist events to MinIO (S3-compatible storage)
- Data flows through explicit lifecycle stages:
- Raw: immutable, append-only Parquet files mirroring incoming events
- Processed: cleaned and normalized data derived from raw events
- Analytics: compacted and query-optimized Parquet files
- Processing and compaction
- A processor transforms raw data into cleaned and normalized datasets
- A compacter merges small files and optimizes layout for analytical queries
- Both services are schema- and source-aware, ensuring consistent and query-ready outputs
- Query and visualization
- A query service exposes SQL access (DuckDB-based) over analytics data
- A Streamlit UI provides an interactive SQL editor for data exploration
Kafka UI: monitor topics, partitions, and consumer groups in real time
MinIO Console: browse buckets, view file metadata, and manage storage
SQL Editor: write and execute SQL queries against analytics data
- Docker and Docker Compose installed locally
- Python 3.10+
- Clone the repository:
git clone https://github.com/AlessioCappello2/TrStream.git
cd TrStream- Build all images:
scripts/scripts-cli/build.sh- Start core services:
scripts/scripts-cli/run.sh- Start everything (including UI):
scripts/scripts-cli/run_all.sh- Access the services:
- Kafka UI: http://localhost:8080
- MinIO Console: http://localhost:9001
- SQL Query API: http://localhost:8000
- SQL Editor: http://localhost:8501
Scale producers and consumers horizontally:
scripts/scripts-cli/run.sh producer=3 consumer=4See scripts documentation for more options.
TrStream supports real-world payment provider integrations:
- Webhook ingestion for payment events
- Signature verification and validation
- Setup Guide
- Sandbox API integration for transaction data
- Redis-backed event queue and serverless webhook deployment with Vercel
- Setup Guide
Each integration is independently deployable and configurable, allowing users to choose which data sources to include in their pipeline.
| Component | Technology |
|---|---|
| Ingestion | Kafka, Webhooks |
| Messaging | Kafka (Bitnami legacy image) |
| Storage | MinIO (S3-compatible) |
| Processing | Python, PyArrow, Boto3 |
| Query engine | DuckDB |
| API layer | FastAPI |
| Visualization | Streamlit |
| Monitoring | Kafka UI (Provectus Labs) |
| Orchestration | Docker Compose |
| Component | URL |
|---|---|
| Stripe Webhook | http://localhost:8100 |
| Kafka UI | http://localhost:8080 |
| MinIO Browser | http://localhost:9000 |
| MinIO Console | http://localhost:9001 |
| SQL Querier API | http://localhost:8000 |
| Streamlit Editor | http://localhost:8501 |
Backend Services (no web interface):
- Kafka:
localhost:9092(connect via Kafka clients)
Tip: Use Kafka UI to monitor Kafka topics and messages.
- Kafka-based event streaming
- Multi-stage data lifecycle (raw → processed → analytics)
- DuckDB query layer
- Streamlit SQL editor
- Stripe integration with webhook handling
- Revolut integration with sandbox API
- Observability
- Prometheus metrics
- Grafana dashboards
- Job scheduling
- Airflow DAGs for processing and compaction
- Additional payment provider integrations (e.g. Adyen, PayPal)
- Real-time alerting and anomaly detection on transaction streams
This project is licensed under the MIT License - see the LICENSE file for details.



