TrStream

A distributed real-time transaction processing pipeline

Quick Start - Architecture - Features - Integrations

About

TrStream is a distributed data pipeline designed to simulate and process high-throughput financial transaction streams in real time.

The project explores the architectural patterns behind modern fintech and data engineering systems, including:

Event-driven ingestion
Scalable stream processing
Lakehouse-style storage
Analytical querying

It was developed as a personal project to apply concepts from Designing Data-Intensive Applications (Martin Kleppmann) with a strong focus on system design and data flow.

Motivation

Modern financial platforms ingest millions of events per day and must:

Reliably process asynchronous events
Retain raw events for auditing and replay
Progressively optimize data layout for analytics
Remain horizontally scalable and loosely coupled

TrStream models this workflow locally using open-source technologies, emphasizing clear data stages and realistic ingestion patterns.

Overview

At a high level, the pipeline:

Ingests transaction events from external systems (e.g. Stripe webhooks) and internal producers
Buffers and distributes events via Kafka
Persists immutable raw events in a data lake (Parquet on S3-compatible storage)
Processes and reorganizes data into optimized analytical layouts
Exposes a SQL query layer on analytics-ready data

All components are containerized and orchestrated with Docker Compose, allowing horizontal scaling of producers and consumers.

Key features

Event-driven ingestion via webhooks and producers
Kafka-based buffering with horizontal scalability through partitioning
Clear separation of data lifecycle stages: raw → processed → analytics
Immutable Parquet storage for auditing and replay
File compaction and layout optimization for analytical workloads
SQL querying on object storage (Athena-like experience)
Lightweight SQL editor implemented with Streamlit
Fully containerized local environment with Docker Compose orchestration and explicit health checks

Architecture

Data flow

Event sources
- External providers (e.g. Stripe) deliver events through a webhook service
- Internal producers simulate transaction streams with configurable rates and distributions
Kafka ingestion
- All events are published to Kafka topics
- Kafka provides buffering, ordering, partitioning and backpressure handling
- Kafka UI exposes end-to-end observability of topics and consumers
Consumers and storage
- Kafka consumers persist events to MinIO (S3-compatible storage)
- Data flows through explicit lifecycle stages:
  - Raw: immutable, append-only Parquet files mirroring incoming events
  - Processed: cleaned and normalized data derived from raw events
  - Analytics: compacted and query-optimized Parquet files
Processing and compaction
- A processor transforms raw data into cleaned and normalized datasets
- A compacter merges small files and optimizes layout for analytical queries
- Both services are schema- and source-aware, ensuring consistent and query-ready outputs
Query and visualization
- A query service exposes SQL access (DuckDB-based) over analytics data
- A Streamlit UI provides an interactive SQL editor for data exploration

Screenshots

Kafka UI: monitor topics, partitions, and consumer groups in real time

MinIO Console: browse buckets, view file metadata, and manage storage

SQL Editor: write and execute SQL queries against analytics data

Quick Start

Prerequisites

Docker and Docker Compose installed locally
Python 3.10+

Launch the pipeline

Clone the repository:

git clone https://github.com/AlessioCappello2/TrStream.git
cd TrStream

Build all images:

scripts/scripts-cli/build.sh

Start core services:

scripts/scripts-cli/run.sh

Start everything (including UI):

scripts/scripts-cli/run_all.sh

Access the services:
- Kafka UI: http://localhost:8080
- MinIO Console: http://localhost:9001
- SQL Query API: http://localhost:8000
- SQL Editor: http://localhost:8501

Scaling

Scale producers and consumers horizontally:

scripts/scripts-cli/run.sh producer=3 consumer=4

See scripts documentation for more options.

Integrations

TrStream supports real-world payment provider integrations:

Stripe

Webhook ingestion for payment events
Signature verification and validation
Setup Guide

Revolut

Sandbox API integration for transaction data
Redis-backed event queue and serverless webhook deployment with Vercel
Setup Guide

Each integration is independently deployable and configurable, allowing users to choose which data sources to include in their pipeline.

Tech stack

Component	Technology
Ingestion	Kafka, Webhooks
Messaging	Kafka (Bitnami legacy image)
Storage	MinIO (S3-compatible)
Processing	Python, PyArrow, Boto3
Query engine	DuckDB
API layer	FastAPI
Visualization	Streamlit
Monitoring	Kafka UI (Provectus Labs)
Orchestration	Docker Compose

Local Access Points

Component	URL
Stripe Webhook	http://localhost:8100
Kafka UI	http://localhost:8080
MinIO Browser	http://localhost:9000
MinIO Console	http://localhost:9001
SQL Querier API	http://localhost:8000
Streamlit Editor	http://localhost:8501

Backend Services (no web interface):

Kafka: localhost:9092 (connect via Kafka clients)

Tip: Use Kafka UI to monitor Kafka topics and messages.

Roadmap

Completed

Kafka-based event streaming
Multi-stage data lifecycle (raw → processed → analytics)
DuckDB query layer
Streamlit SQL editor
Stripe integration with webhook handling
Revolut integration with sandbox API

Planned

Observability
- Prometheus metrics
- Grafana dashboards
Job scheduling
- Airflow DAGs for processing and compaction

Future considerations

Additional payment provider integrations (e.g. Adyen, PayPal)
Real-time alerting and anomaly detection on transaction streams

License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
config		config
images		images
keys		keys
minio		minio
scripts		scripts
src		src
.dockerignore		.dockerignore
.editorconfig		.editorconfig
.gitattributes		.gitattributes
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TrStream

Table of Contents

About

Motivation

Overview

Key features

Architecture

Data flow

Screenshots

Quick Start

Prerequisites

Launch the pipeline

Scaling

Integrations

Stripe

Revolut

Tech stack

Local Access Points

Roadmap

Completed

Planned

Future considerations

License

About

Uh oh!

Releases 3

Packages

Languages

License

AlessioCappello2/TrStream

Folders and files

Latest commit

History

Repository files navigation

TrStream

Table of Contents

About

Motivation

Overview

Key features

Architecture

Data flow

Screenshots

Quick Start

Prerequisites

Launch the pipeline

Scaling

Integrations

Stripe

Revolut

Tech stack

Local Access Points

Roadmap

Completed

Planned

Future considerations

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Languages

Packages