Skip to content

Amit-netizen/StoreLite

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

StoreLite — Log-Structured Key-Value Storage Engine

A production-pattern storage engine in C++17 on Linux, implementing the core internals of systems like LevelDB, Bitcask, and NetApp ONTAP's WAFL: append-only writes, write-ahead logging, crash recovery, LRU caching, and compaction.

StoreLite is a log-structured key-value storage engine inspired by LSM-tree based systems such as LevelDB and RocksDB.


Benchmark Results (Ubuntu, VirtualBox)

Mode Write Read Compaction Savings
No-sync / No-WAL (max throughput) 407K ops/s / 58.7 MB/s 308K ops/s 76.3%
WAL disabled, fsync enabled 255 ops/s 445K ops/s 2.2%
Full durable (sync + WAL) 130 ops/s 397K ops/s 55.1%

The 3,100x write throughput gap between durable and no-sync mode is not a bug. It is the fundamental storage engineering tradeoff — fdatasync waits for disk confirmation on every write. This is why real systems use WAL group commit, write coalescing, and battery-backed NVRAM.


Architecture

  Put("k", "v")
       │
       ▼
  ┌─────────┐  O_SYNC   ┌───────────┐
  │   WAL   │ ────────► │  wal.log  │  ← durable before segment write
  └────┬────┘           └───────────┘
       │
       ▼
  ┌──────────────┐  pwrite  ┌──────────────────┐
  │  Active Seg  │ ───────► │   N.seg  (disk)  │  ← append-only
  │ (write-only) │          └──────────────────┘
  └──────┬───────┘
         │  rotate when full (default 64MB)
         ▼
  ┌──────────────┐
  │  Immutable   │  ──►  Compactor  ──►  merged segment
  │  Segments    │        (background)     stale data reclaimed
  └──────────────┘
         │
         ▼
  ┌──────────────┐
  │  Hash Index  │  key → { seg_id, offset, val_len, timestamp_us }
  │ (in-memory)  │
  └──────┬───────┘
         │  cache-first on reads
         ▼
  ┌──────────────┐
  │  LRU Cache   │  O(1) get/put — doubly-linked list + hashmap
  └──────────────┘

On-Disk Record Format

┌──────────┬────────────────┬──────────┬──────────┬──────────────────────┐
│ crc32: 4 │ timestamp_us:8 │ key_len:4│ val_len:4│  key  │    value     │
└──────────┴────────────────┴──────────┴──────────┴───────┴──────────────┘
  ◄──────────────── 20-byte header ─────────────────►  variable length

val_len == 0xFFFFFFFF  →  tombstone (deletion marker)
  • CRC32 covers all bytes after the CRC field — detects bit rot and partial writes
  • Timestamp enables last-write-wins conflict resolution during compaction
  • All multi-byte fields are little-endian

Components

File Lines Responsibility
record.cpp ~130 CRC32 (table-driven), encode/decode, tombstone encoding
segment.cpp ~160 Append-only file; pwrite/pread/fdatasync; full scan for recovery
wal.cpp ~90 O_SYNC write-ahead log; replay on crash; truncate on clean shutdown
lru_cache.cpp ~70 Fixed-capacity LRU; O(1) via doubly-linked list + unordered_map
compactor.cpp ~100 Scan N segments → emit 1; keep latest timestamp per key; unlink stale files
engine.cpp ~300 Orchestrator; std::shared_mutex for concurrent reads; segment rotation
main.cpp ~180 CLI benchmark (write/read/mixed/delete) + interactive REPL

Key Systems Concepts

Concept Where used
Log-structured writes Every write is a sequential pwrite — no random I/O on write path
WAL + crash recovery O_SYNC guarantees durability; WAL replayed on startup; partial records at end ignored
fdatasync vs fsync Data-only flush per write — skips inode metadata update
pwrite / pread Offset-explicit I/O — safe for concurrent readers without seeking
Tombstone deletions Deletions write a sentinel record; compaction purges both tombstone and shadowed data
Segment rotation Active segment sealed at segment_max_bytes; becomes immutable candidate for compaction
LRU O(1) eviction Doubly-linked list (access order) + hashmap (O(1) lookup) — classic interview data structure
std::shared_mutex Multiple concurrent reads; exclusive lock only on writes and segment rotation

Screenshots

Build — zero warnings

build

102 Tests Passing — CRC32, Record, LRU, Segment, WAL, Engine

tests-top tests-pass

Benchmark — Max Throughput (no-sync, no-WAL)

407K writes/sec · 308K reads/sec · 76.3% compaction savings bench-fast

Benchmark — Durable Writes (fsync per write, no WAL)

Demonstrates the fdatasync cost: 255 writes/sec vs 407K — 1,600x slowdown for durability bench-sync

Benchmark — Full Durability (sync + WAL)

Production-safe mode: WAL flushed before segment write, fdatasync on every record bench-full


Build & Run

Requirements: g++ >= 9, make, Linux (kernel ≥ 3.0)

git clone https://github.com/<your-username>/storeLite
cd StoreLite
make

Run all 102 tests

make test

Benchmark

# Max throughput (OS-buffered writes)
./storelite --bench --no-sync --no-wal

# fsync per write, no WAL
./storelite --bench --no-wal -n 1000

# Full durability: WAL + fsync (slow — intended)
./storelite --bench

# Custom: 500k ops, 512B values
./storelite --bench --no-sync --no-wal -n 500000 -v 512

Interactive REPL

./storelite --interactive --data-dir /tmp/mydb
> put name amit
OK
> put project StoreLite
OK
> get name
amit
> del name
OK
> get name
(nil)
> scan
project = StoreLite
(1 keys)
> stats
keys=1 segs=1 bytes=1024 chits=2 cmiss=1
> compact
Compacted: kept=1 dropped=1
> quit

Compaction Deep Dive

Workload:  100k unique keys written twice (100k stale entries)
           50k keys then deleted (50k tombstones)

Before:    2 segments, ~34.6 MB
After:     1 segment,   ~8.2 MB
Savings:   76.3%

Algorithm:
  1. Rotate active segment → immutable
  2. Scan all immutable segments oldest-first
  3. Build map: key → latest {value, timestamp_us}
  4. Emit only live (non-tombstone) entries to new segment
  5. Atomically update in-memory index
  6. unlink() old segment files

Test Coverage (102 tests)

Suite Count What is tested
CRC32 4 Known vector 0xCBF43926, determinism, collision detection
Record encode/decode 6 Round-trip, tombstone, CRC corruption rejection, short buffer
LRU Cache 8 Basic get/put, eviction order, invalidation, hit/miss counters
Segment 4 Append, scan, tombstone records, persist across reopen
WAL 5 Log/replay, tombstone replay, clear, empty check
Engine (integration) 75 put/get/del, overwrite, persistence across reopen, scan, post-compaction reads

Why Log-Structured?

Traditional B-tree storage engines (like the ones used in older databases) require random I/O for updates — finding the right page, modifying it, writing it back. Log-structured engines convert all writes to sequential appends, which is dramatically faster on both spinning disks (no seek time) and SSDs (avoids write amplification).

The tradeoff: read amplification and space amplification grow over time as stale data accumulates — which is exactly what compaction solves.

This is the same architecture used in LevelDB, RocksDB, Apache Cassandra, and the core write path of NetApp ONTAP's WAFL (Write Anywhere File Layout).


License

MIT

About

Log-structured key-value storage engine in C++17 — WAL, crash recovery, compaction, and LRU cache. 407K writes/sec. Inspired by LevelDB and Bitcask.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors