A multi-core CPU cache hierarchy simulator with MESI coherency protocol
Built in C++20. Models the full memory hierarchy (L1 → L2 → L3 → DRAM) across two simulated cores with a complete MESI bus protocol, configurable replacement policies, and trace-driven simulation mode. Every cache line state transition is unit-tested.
Modern CPUs spend more time waiting for memory than executing instructions. Understanding why — and being able to simulate, measure, and prove it — is what separates systems engineers from application developers. This simulator makes the invisible visible: you can watch false sharing destroy cache performance in real time, see LRU vs FIFO behave differently under thrash, and measure exactly how many DRAM fetches a workload costs.
- 3-level hierarchy — L1 (32KB, 8-way) per core, L2 (256KB, 4-way) per core, L3 (4MB, 16-way) shared
- MESI coherency protocol — BusRd, BusRdX, BusUpgr between two cores; M→S downgrade on foreign read, write-invalidate on write
- Replacement policies — LRU, FIFO, Random (selectable at runtime via
--policy) - Write policies — Write-back with allocate (default), Write-through (
--write-through) - Trace-driven mode — replay any memory access trace file (
--trace) - 5 built-in workloads — sequential scan, false sharing, ping-pong, hot loop, cache thrash
- Per-level statistics — accesses, hits, misses, writebacks, MESI invalidations, hit rate, DRAM fetches
- 27 unit tests — every MESI state transition, LRU eviction order, writeback-on-eviction, write-through propagation
# Requirements: GCC 13+ or Clang 14+, CMake >= 3.16, C++20
git clone https://github.com/YOUR_USERNAME/CacheSim.git
cd CacheSim
mkdir build && cd build
cmake ..
make -j$(nproc)./test_cache./cache_sim --all-workloads./cache_sim --workload false_sharing --iters 64./cache_sim --workload sequential --iters 128./cache_sim --workload hot_loop --iters 128./cache_sim --workload ping_pong --iters 64./cache_sim --workload thrash --iters 128./cache_sim --workload false_sharing --iters 10 --verbose./cache_sim --trace ../traces/false_sharing.trace --verbose./cache_sim --workload sequential --policy fifo./cache_sim --workload thrash --policy random./cache_sim --workload false_sharing --write-through --iters 64./cache_sim --help# one access per line: <core_id> <R|W> <hex_address>
0 R 0x1000
1 W 0x1004
0 R 0x1000
L1-Core0 accesses=128 hits=112 misses=16 hit%=87.5%
DRAM fetches: 16
128 elements × 8 bytes = 1024 bytes. Block size 64 bytes → 16 cache line fetches. Each fetch brings 8 elements → 7 free hits per line. Exactly 87.5%.
L1-Core0 accesses=1280 hits=1280 misses=0 hit%=100.0%
L2/L3: accesses=0
DRAM fetches: 0
4-element working set fits in L1. After warmup, 1280 accesses with zero memory traffic.
L1-Core0 accesses=128 hits=0 misses=128 WBs=128 INVs=128 hit%=0.0%
L1-Core1 accesses=128 hits=0 misses=128 WBs=127 INVs=127 hit%=0.0%
L3 accesses=256 hits=255 misses=1
DRAM fetches: 1
Only 1 DRAM fetch. The data fits comfortably in L3. Yet L1 hit rate is 0%. Core 0 writes → BusUpgr invalidates Core 1's copy. Core 1 writes → BusUpgr invalidates Core 0's copy. 128 times each. Two threads writing separate variables that share a cache line destroy each other's performance.
Verbose output shows the bus traffic:
C0 W 0x1000 | MESI BusUpgr/BusRdX: Core 1 S→I L1 write miss L2 write miss L3 write hit
C1 W 0x1004 | MESI BusUpgr/BusRdX: Core 0 M→I (writeback) [WB] L1 write miss L3 write hit
Fix in real code: alignas(64) int a; alignas(64) int b; — one variable per cache line.
L1-Core0 hits=0 INVs=32 hit%=0.0%
L1-Core1 hits=0 INVs=31 hit%=0.0%
DRAM fetches: 1
Same address, alternating writes. Every write invalidates the other core. Zero L1 utility.
L1-Core0 accesses=128 hits=0 misses=128 hit%=0.0%
L3 hits=0 misses=128
DRAM fetches: 128
Stride = L1 size. All accesses map to set 0. 8-way L1 fills after 8 accesses, then evicts on every subsequent access. 128 DRAM fetches for 128 accesses — the cache is completely bypassed despite being empty.
Core 0 ──── L1C0 (32KB, 8-way, 64B) ──── L2C0 (256KB, 4-way, 64B) ──┐
├── L3 (4MB, 16-way, 64B) ── DRAM
Core 1 ──── L1C1 (32KB, 8-way, 64B) ──── L2C1 (256KB, 4-way, 64B) ──┘
CacheLevel::read() and write() only probe and count stats — they never auto-allocate on miss. The CacheHierarchy controller runs the snoop bus, determines the correct MESI state (E vs S depends on whether any other core holds the line), then calls fill() with the correct state. This is how real cache controllers work. Auto-allocating on miss with a hardcoded state is incorrect for MESI.
| State | Meaning | On read by other core | On write by other core |
|---|---|---|---|
| Modified | Dirty, exclusive | Flush + M→S, other gets S | Flush + M→I |
| Exclusive | Clean, exclusive | E→S, other gets S | E→I |
| Shared | Clean, shared | Stay S | S→I |
| Invalid | Not present | — | — |
CacheSim/
├── include/
│ ├── cache_line.h # CacheLine struct, MESIState enum
│ ├── cache_config.h # CacheConfig: sets/ways/block_size, address decode
│ ├── cache_level.h # CacheLevel: probe/read/write/fill/invalidate/downgrade
│ ├── cache_hierarchy.h # CacheHierarchy: L1/L2/L3 wiring + MESI bus protocol
│ └── trace_player.h # TracePlayer: trace file parser and replay engine
├── src/
│ └── main.cpp # CLI driver + 5 built-in workloads
├── tests/
│ └── test_cache.cpp # 27 unit tests (self-registering, no framework needed)
├── traces/
│ └── false_sharing.trace # Example trace demonstrating MESI invalidation storm
└── CMakeLists.txt
test_cold_miss_then_hit — first access misses, second hits
test_mesi_exclusive_on_first_read — sole reader gets Exclusive state
test_mesi_shared_on_second_read — second reader downgrades both to Shared
test_mesi_modified_on_write — write transitions line to Modified
test_mesi_write_invalidates_other_core — BusUpgr invalidates other core's copy
test_mesi_modified_downgrade_on_foreign_read — M→S flush on read from other core
test_lru_eviction_order — LRU evicts least recently used way
test_false_sharing_generates_invalidations — adjacent vars in same line → INVs
test_writeback_on_eviction — dirty eviction increments WB counter
test_write_through_propagates_to_l2 — WT write updates L2 state to Modified
test_spatial_locality_block_reuse — adjacent address in same block hits
Results: 27 passed, 0 failed
- L3 back-invalidation on eviction — when L3 evicts a line, invalidate all L1/L2 copies
- PLRU (Pseudo-LRU) — tree-based O(1) update; what real Intel/AMD CPUs use
perf memtrace import — convert Linux perf binary traces for real workload replay- MOESI / MESIF — add Owned or Forward states for directory-based protocols
- Multi-level inclusion violation tracking — detect when L2 evicts a line still in L1
- Valgrind cachegrind comparison — run same program through both, compare miss rates
Developed a trace-driven CPU cache hierarchy simulator in C++20 — models L1/L2/L3 with MESI coherency protocol (BusRd/BusRdX/BusUpgr), LRU/FIFO/Random eviction, and write-back/write-through policies; demonstrated false sharing reducing L1 hit rate from 87% to 0% via BusUpgr invalidation storms across 2 simulated cores; 27 unit tests covering all MESI state transitions and eviction edge cases.
MIT












