Skip to content
View NasitSony's full-sized avatar

Block or report NasitSony

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don’t include any personal information such as legal names or email addresses. Markdown is supported. This note will only be visible to you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
nasitsony/README.md

Hi, I'm Nasit Sony πŸ‘‹

Distributed Systems & AI Infrastructure Engineer

I build correctness-first systems β€” from storage engines and consensus protocols to fault-tolerant pipelines and orchestration platforms.

My focus is simple:

Systems must remain correct under failure β€” not just under ideal conditions.


πŸ—οΈ System Overview

AI Infrastructure Stack

End-to-end AI infrastructure stack: workload β†’ orchestration β†’ serving β†’ storage β†’ consensus


🧠 The Idea

AI systems are not just pipelines.

They are multi-layer distributed systems, where each layer handles a different class of failures:

  • crashes
  • retries & duplicate processing
  • network delays and reordering
  • resource contention
  • adversarial behavior (Byzantine faults)

I design systems where correctness is enforced at every layer.


βš™οΈ What I Build (Layered System)

πŸ”„ SmartSearch β€” Client Workload

Fault-tolerant ingestion and semantic retrieval system.

  • Kafka replay & duplicate handling
  • Idempotent ingestion
  • Deterministic processing & recovery

βš™οΈ Veriflow β€” Orchestration Layer

Kubernetes-based AI workload orchestrator.

  • GPU-aware scheduling (type + count)
  • Gang scheduling for distributed training
  • Multi-tenant fairness & isolation
  • Idempotent job submission + retries
  • Reconciliation-driven execution

🧠 LLM Serving Cache β€” Serving Layer

Inference-time control plane for cache placement and routing.

  • Prefix-aware cache reuse
  • Session-affinity routing
  • Cache fill on miss β†’ reuse on hit
  • Persistent cache state

🧱 VeriStore β€” Storage Layer (KV + Object Store)

Crash-consistent storage engine with S3-style abstraction.

  • Write-Ahead Logging (WAL)
  • Deterministic recovery via replay
  • Snapshot + restore
  • Raft-based replication
  • Object storage interface built on KV engine

🧠 Consensus β€” Raft + Async BFT

Agreement layer across replicas.

  • Leader-based consensus (Raft)
  • Asynchronous Byzantine agreement (MVBA)
  • Handles failures, delays, adversarial nodes

πŸ” AgentFlow β€” Workflow Engine

Failure-aware workflow execution system.

  • Step-level retries
  • Timeout handling
  • Deterministic state transitions

πŸ’₯ Failure Model

Each layer addresses a specific class of failure:

Layer Handles
Orchestration Resource contention, partial execution
Serving Redundant computation, routing correctness
Pipeline Retries, duplicates
Storage Crashes, partial writes
Consensus Adversarial nodes, agreement

⚑ Experience Snapshot

πŸ’° Production Systems (Fintech)

  • Built systems handling $600M+ annual volume
  • Focus: correctness, consistency, reliability

πŸ”¬ Distributed Systems & BFT Research

  • Published in Springer journals & conferences
  • Designed asynchronous BFT protocols
  • Bridging theory ↔ real systems

πŸ’‘ Engineering Philosophy

I design systems for failure, not just success.

I ask:

  • What if a worker crashes mid-processing?
  • What if a write is partially persisted?
  • What if messages are replayed?
  • What if nodes behave maliciously?

I build systems that:

  • recover deterministically
  • enforce explicit state transitions
  • prevent duplication and corruption
  • remain correct under failure

🧰 Tech Stack

Languages:
Java, C++, Go, Python

Backend & Infra:
Spring Boot, Kafka, PostgreSQL, Kubernetes, Docker

Distributed Systems:
WAL, replication, consensus (Raft, BFT), idempotency, retries

AI Infrastructure:
Embeddings, RAG pipelines, vector search (pgvector)


πŸ“š Research

Prioritized-MVBA β€” Asynchronous Byzantine Agreement Protocol
Published in Springer journals & international conferences

πŸ”— https://scholar.google.com/citations?user=mBIQ1-0AAAAJ&hl=en


🎯 Current Focus

  • Distributed systems & storage engines
  • Fault-tolerant AI infrastructure
  • Consensus protocol engineering

πŸ“¬ Connect

πŸ”— LinkedIn: https://www.linkedin.com/in/nasitsony

Pinned Loading

  1. VeriStore VeriStore Public

    Correctness-first C++ storage engine with WAL durability, crash recovery, Raft replication, and a minimal S3-style object store.

    C++

  2. veriflow-control-plane veriflow-control-plane Public

    Fault-tolerant Kubernetes job orchestration control plane with persistent lifecycle tracking and reconciliation-driven execution recovery.

    Go

  3. SmartSearch SmartSearch Public

    Production-style semantic search and RAG backend built as a distributed system. Features async ingestion (Kafka), embedding pipelines, pgvector search, and strong reliability guarantees β€” including…

    Java

  4. async-bft-suite async-bft-suite Public

    Prototype framework implementing three asynchronous BFT agreement protocols (Cachin MVBA, VABA, pMVBA) with a unified simulation harness and comparable metrics.

    Python

  5. llm-serving-cache llm-serving-cache Public

    Distributed inference cache using VeriStore

    C++

  6. agentflow agentflow Public

    Control-plane system for reliable, stateful task orchestration with idempotency, retries, and failure-aware execution.

    Java