Distributed Systems & AI Infrastructure Engineer
I build correctness-first systems β from storage engines and consensus protocols to fault-tolerant pipelines and orchestration platforms.
My focus is simple:
Systems must remain correct under failure β not just under ideal conditions.
End-to-end AI infrastructure stack: workload β orchestration β serving β storage β consensus
AI systems are not just pipelines.
They are multi-layer distributed systems, where each layer handles a different class of failures:
- crashes
- retries & duplicate processing
- network delays and reordering
- resource contention
- adversarial behavior (Byzantine faults)
I design systems where correctness is enforced at every layer.
Fault-tolerant ingestion and semantic retrieval system.
- Kafka replay & duplicate handling
- Idempotent ingestion
- Deterministic processing & recovery
Kubernetes-based AI workload orchestrator.
- GPU-aware scheduling (type + count)
- Gang scheduling for distributed training
- Multi-tenant fairness & isolation
- Idempotent job submission + retries
- Reconciliation-driven execution
Inference-time control plane for cache placement and routing.
- Prefix-aware cache reuse
- Session-affinity routing
- Cache fill on miss β reuse on hit
- Persistent cache state
Crash-consistent storage engine with S3-style abstraction.
- Write-Ahead Logging (WAL)
- Deterministic recovery via replay
- Snapshot + restore
- Raft-based replication
- Object storage interface built on KV engine
Agreement layer across replicas.
- Leader-based consensus (Raft)
- Asynchronous Byzantine agreement (MVBA)
- Handles failures, delays, adversarial nodes
Failure-aware workflow execution system.
- Step-level retries
- Timeout handling
- Deterministic state transitions
Each layer addresses a specific class of failure:
| Layer | Handles |
|---|---|
| Orchestration | Resource contention, partial execution |
| Serving | Redundant computation, routing correctness |
| Pipeline | Retries, duplicates |
| Storage | Crashes, partial writes |
| Consensus | Adversarial nodes, agreement |
- Built systems handling $600M+ annual volume
- Focus: correctness, consistency, reliability
- Published in Springer journals & conferences
- Designed asynchronous BFT protocols
- Bridging theory β real systems
I design systems for failure, not just success.
I ask:
- What if a worker crashes mid-processing?
- What if a write is partially persisted?
- What if messages are replayed?
- What if nodes behave maliciously?
I build systems that:
- recover deterministically
- enforce explicit state transitions
- prevent duplication and corruption
- remain correct under failure
Languages:
Java, C++, Go, Python
Backend & Infra:
Spring Boot, Kafka, PostgreSQL, Kubernetes, Docker
Distributed Systems:
WAL, replication, consensus (Raft, BFT), idempotency, retries
AI Infrastructure:
Embeddings, RAG pipelines, vector search (pgvector)
Prioritized-MVBA β Asynchronous Byzantine Agreement Protocol
Published in Springer journals & international conferences
π https://scholar.google.com/citations?user=mBIQ1-0AAAAJ&hl=en
- Distributed systems & storage engines
- Fault-tolerant AI infrastructure
- Consensus protocol engineering
π LinkedIn: https://www.linkedin.com/in/nasitsony

