cache: replace read/dirty with a concurrent hash-trie by twmb · Pull Request #6 · twmb/go-cache

twmb · 2026-04-21T23:49:16Z

Summary

Replaces the read/dirty/promote backing of cache.Cache with a concurrent
hash-trie modeled on Go's internal/sync.HashTrieMap, implemented entirely in
userspace (hash/maphash + unsafe.Sizeof(uintptr(0)); no internal/abi or
internal/goarch). Public API is unchanged; all documented semantics are
preserved, with two minor behavior shifts noted below.

Why

The read/dirty design amortizes a write-map promotion across operations; the
per-op cost is decent on average but adversarial workloads pay heavily for
the copy. The trie spreads cost across per-bucket locks and eliminates the
promotion cycle.
Upstream sync.Map moved to a hash trie in Go 1.24, so the old inspiration
in this package's docs was stale.
Singleflight, stale values, TTLs, idle-age, and error caching all live above
the storage layer and carry over unchanged.

Changes, by commit

bump to go 1.24, fix stale-at-epoch, tighten Clean — baseline cleanup.
Go modules had claimed 1.18 but the code required 1.19+ (atomic.Pointer[T],
atomic.Int64, atomic.Uint32). newStale was producing a stale born at
the Unix epoch when the main entry's expiry was <= 0 — now bases on
now(). Clean now calls e.del() directly instead of c.Delete(k) since
c.each has already promoted.
expand test suite; fix data race in Get's post-miss return — gets coverage
to 100% at the baseline. The new race-targeted tests caught a real data
race: Cache.Get read l.v without any happens-before edge when a
concurrent Swap had replaced e.p with its own loading. Fix: l.wg.Wait()
before reading l.v to synchronize with whoever called wg.Done — either
setve or Swap's defer.
add concurrent hash-trie in isolation — new cache/trie.go + unit tests.
Standalone, not yet wired into Cache.
back Cache with the hash-trie; retire read/dirty machinery — Cache
holds a trie[K, loading[V]] instead of r/mu/dirty/misses/pd. An
ent[K, V] type alias (Go 1.24 generic aliases) names the trie's concrete
entry type. Entry operations (entGet, entTryGet, entDel, entLoad,
entMaybeNewStale) are free functions over that alias. Delete physically
removes via deleteEntryIf with a "still tombstoned under lock" predicate
to avoid clobbering a concurrent Swap resurrection; keys become eligible
for GC promptly (TestIssue40999 still passes).
README: rewrite for the trie-backed cache — new design blurb, fresh
benchmark numbers.
port the stdlib sync.Map tests we were missing — TestConcurrentClear,
TestMapClearOneAllocation, TestMapRangeNoAllocations, BenchmarkClear.

Semantic shifts

Both acceptable in my read, but flagging explicitly:

Range loses its promote-snapshot flavor. Previously each promoted
under c.mu, so Range iterated a fixed snapshot. Now Range is a lock-free
walk over atomic pointers; concurrent mutations may or may not be visible.
This is the same guarantee sync.Map.Range documents and the existing
docstring already uses that wording.
Clean no longer serializes on c.mu. Iteration is lock-free; per-key
physical removal takes only the trie's bucket lock. Faster, but callers who
relied on Clean holding a global lock should know.

Benchmarks

Apple M1, Go 1.26, -count=5 -benchtime=1s, benchstat comparing the
pre-trie baseline to the current branch.

Time per op — geomean -35.24%.

Largest wins:

Benchmark	Pre-trie	Trie	Δ
AdversarialAlloc	350.8 ns	11.4 ns	-96.74%
AdversarialDelete	132.2 ns	6.5 ns	-95.10%
SwapMostlyMisses	1032 ns	154 ns	-85.11%
LoadOrStoreUnique	1449 ns	592 ns	-59.17%
LoadAndDeleteBalanced	10.7 ns	4.4 ns	-58.66%
LoadAndDeleteUnique	4.3 ns	1.8 ns	-58.59%
LoadOrStoreBalanced	702 ns	318 ns	-54.77%
CompareAndDeleteMostlyHits	131 ns	91 ns	-30.63%
DeleteCollision	5.1 ns	3.6 ns	-29.23%
SwapCollision	338 ns	242 ns	-28.47%
CompareAndSwapNoExistingKey	2.5 ns	1.9 ns	-22.97%

Regressions:

Benchmark	Pre-trie	Trie	Δ
LoadAndDeleteCollision	7.1 ns	67.6 ns	+853.35%
CompareAndSwapCollision	10.1 ns	17.6 ns	+73.45%
LoadMostlyMisses	3.4 ns	4.6 ns	+33.40%

The *Collision regressions are intrinsic: the trie physically removes on
Delete (required to keep keys GC-eligible per TestIssue40999) so every
Delete takes the bucket lock. Pre-trie just did a single CAS to nil and
deferred physical removal to the promotion cycle. When many goroutines hammer
the same key, they all serialize on that bucket mutex. LoadMostlyMisses
pays a small fixed cost for the hash-walk on every miss.

Allocations. AdversarialAlloc and AdversarialDelete drop to 0 B/op
(pre-trie was 55 and 36, respectively — the cost of dirty-map promotion
copying). SwapMostlyMisses drops from 4 allocs/op to 2. Per-op allocations
on the insert/swap paths (LoadOrStore*, Swap*Hits, CompareAndSwap*Hits)
are intrinsic: each operation may install a fresh loading[V] carrying the
singleflight machinery (Mutex, WaitGroup, atomic counters) that a plain
sync.Map does not need.

Test plan

Verify go test ./... -race -count=3 on the CI environment.
Confirm the two semantic shifts (Range snapshot-less, Clean
concurrency) don't affect any downstream consumers.
Spot-check the benchmark regressions against real workloads — the
LoadAndDeleteCollision shape is narrow (single hot key being
repeatedly add/removed) but worth checking if anyone depends on it.

- Bump go.mod from 1.18 to 1.24. The code has required at least 1.19 since the switch to generic atomic types (atomic.Pointer[T], atomic.Int64, atomic.Uint32). 1.24 also unlocks maps.Copy, which replaces the hand-rolled copy loop in promote. - newStale: when the main entry's expiry is 0 (infinite cache) or -1 (immediate expiry), base the stale's lifespan on now() instead of producing a stale born at the Unix epoch. This was a latent bug: Get paths typically regenerated the stale via maybeNewStale before it was observed, but the initial stale set by Swap/Set was wrong. - Clean: call e.del() directly instead of c.Delete(k). c.each already promotes when necessary, so every entry the callback sees lives in the read map and needs no dirty-map bookkeeping or per-key relock. - Document that Expire is a no-op on in-flight loads for Cache, Item, and Set.

Test coverage climbs from 84.3% to 100% of statements. The new tests either target concretely uncovered behavior (Item/Set wrappers, autoclean lifecycle, Clean no-op on MaxStaleAge<0) or target race windows in the existing read/dirty implementation (Swap-vs-in-flight-Get, Delete-during-load, Expire-no-op-during-load, maxAge=0 collapsing, the promote CAS-retry path, Swap observing the promotingDelete sentinel, Get's unlocked-miss / locked-hit interleaving, CompareAndSwap against a just-promoted entry). Single -count=1 -race runs land 100% coverage reliably on this machine; the stress test driving the tightest windows runs for ~2s against many goroutines. The Swap-vs-in-flight-Get test caught a real data race (reported by -race). After Get's final e.get, the original code returned l.v without synchronizing with any writer: concurrent Swap's defer writes l.v under l.mu and calls l.wg.Done, but Get was only waiting on whatever loading e.p pointed to at the time, which could be Swap's replacement (a different loading). Add l.wg.Wait() before returning l.v to establish the happens-before edge with whoever finalized l (either setve or Swap's defer). New test files: - wrappers_test.go: every Item and Set method - autoclean_test.go: AutoCleanInterval goroutine lifecycle and StopAutoClean idempotency - concurrency_test.go: race-targeted tests and the miss/stale correctness properties called out in the trie port plan

Introduces cache/trie.go: a generic concurrent hash-trie map modeled on Go's internal/sync.HashTrieMap, but implemented entirely in userspace (hash/maphash for the runtime's typed hasher, unsafe.Sizeof(uintptr(0)) for pointer-size, no internal/abi or internal/goarch). The trie is not yet wired into Cache; that happens in a later commit. Design: - Root is a lazily-allocated indirect node, atomically swapped on Clear so readers holding the old root keep operating on it. - Interior nodes hold 16 atomic child pointers; leaves hold a key, an atomic value slot (*V), and an atomic overflow pointer for hash-collision chains. - Load is lock-free; LoadOrStore and Delete take the parent's per-bucket mutex. Delete prunes empty parents bottom-up while holding locks hand-over-hand. - expand splits a leaf into a subtree on hash-prefix collision; full hash collisions chain through overflow. Tests (cache/trie_test.go) cover: zero-value trie, basic store/load, load-or-store semantics with nil-value initialization, single and collision-chain delete (head and mid-chain), large-key stress (20k keys) through expand, full hash collisions via a test-only hashFn hook, concurrent disjoint keys, concurrent contention on the same key, concurrent delete-prune, range-during-mutation, and the deleteEntry lock-retry paths under sustained delete contention. Coverage is 99.1%. The five uncovered statements are defensive panics at invariant-violation points (tree deeper than the hash has bits, node type discriminator corrupted, etc.) that cannot fire without memory corruption; each is documented inline with the reason it is unreachable. Also: - Scale TestMaxIdleAge sleep windows up by ~4x so the timing-based assertions tolerate -count=3 -race load. The previous 30ms-within- 50ms-window spacing had no margin for scheduler delay. - Fix TestSwap_CancelsInFlightGet: synchronize on the miss goroutine actually completing (via a missDone channel) before asserting the miss ran. Previously the test asserted a flag set by a goroutine that hadn't been scheduled to run yet under heavy load. - Modernize remaining int-range for-loops.

The read/dirty/promote structure is replaced by the concurrent hash-trie from the previous commit. Public API, stale/TTL semantics, singleflight collapsing, and CompareAndSwap/Delete behavior are preserved. Shape of the rewrite: - Cache holds a single `trie[K, loading[V]]` instead of r/mu/dirty/ misses/pd. - An `ent[K, V]` type alias (Go 1.24 generic aliases) names the trie's concrete entry type so callers can read `*ent[K, V]` instead of `*trieEntry[K, loading[V]]` everywhere. All entry-manipulating logic (entGet, entTryGet, entDel, entLoad, entMaybeNewStale) is free functions over that alias, since Go generics do not permit methods on instantiation-specific aliases. - Delete physically removes the entry via the trie's deleteEntryIf primitive (with a "still tombstoned under lock" predicate to avoid clobbering a concurrent Swap resurrection). This lets keys and their referents be GC'd, preserving TestIssue40999. - Clean walks the trie, tombstones expired entries, and then calls deleteEntryIf for each tombstone so the trie's memory footprint does not grow unboundedly with delete churn. - Range iterates via the trie walk; no snapshot (matches sync.Map and the existing Range docstring). - Swap's fast path CAS on an existing entry's value slot is preserved without any global lock; the slow path goes through loadOrStoreEntry and resolves races on the trie's per-bucket mutex. Added to the trie for Cache's benefit: - deleteEntryIf(k, pred): removes only if pred returns true under the parent bucket's lock. Callers use this to avoid the resurrected-during-delete race. - Inline chain removal in deleteEntryIf, retiring the separate trieRemoveFromChain helper (it had a "not found" return path that became dead after the pred-and-locate restructure). Two new tests beyond the existing suite: - TestGet_ConcurrentStaleDuringInFlight: a second Get observes a stale-returning loading that is already in flight from the first Get and returns the stale without attempting its own miss. - TestGet_StaleRefreshCASRace: 16 concurrent Gets on an expired-with- stale key exercise the slow-path CAS-retry that replaces a finalized prev loading with a fresh loading carrying a stale snapshot. - TestClean_SkipsInFlightLoads: Clean must skip entries whose load has not finalized. - TestTrie_DeleteRaceAgainstExpand: deleteEntryIf's "slot changed to non-entry after lock" branch, triggered by concurrent insert on hash-prefix-colliding keys forcing an expand during delete. Full suite passes under -race; coverage settles at 98.6%-99.0% of statements across independent runs. The remaining uncovered lines are documented defensive panics (invariant violations that require corruption to fire) and two tightly-timed race branches that need scheduler interleavings we cannot force from userspace without test hooks. Benchmark comparison against the pre-trie baseline is deferred to the cleanup commit.

Update the design blurb (no more read/dirty promotion), the coverage footnote (97% → ~99% plus a note on the defensive-panic residue), and the microbenchmark section (fresh numbers from the current code on Apple M1; the old table compared sync.Map to the pre-trie cache and is no longer meaningful). Flag that per-op allocations on the insert/swap paths are intrinsic — the singleflight machinery (loading[V] with a Mutex, WaitGroup, and atomic counters) is something sync.Map doesn't carry, and carrying it costs a single heap allocation per inserted key.

Add three tests and one benchmark that exist in sync/map_test.go and sync/map_bench_test.go but were not in our copy: - TestConcurrentClear: 10 writers, 10 readers, and 10 Clear goroutines run concurrently. Correctness is no panic and no phantom keys (keys we never Stored appearing in Range after the dust settles). - TestMapClearOneAllocation: asserts Clear allocates ≤1 time. Cache's Clear replaces the trie root with one fresh indirect node, matching the sync.Map guarantee. - TestMapRangeNoAllocations: asserts Range does not allocate. The benchmarks already showed 0 B/op but an explicit AllocsPerRun test guards against a regression in the Range closure wiring. - BenchmarkClear: Clear throughput. Skipped from stdlib (intentionally): - TestMapMatchesDeepCopy: a second mapInterface oracle (DeepCopyMap). Our TestCacheMatchesRWMutex already covers the semantics; the stdlib keeps a second oracle mainly to catch sync.Map-specific promotion bugs we no longer have. - TestMapMatchesHashTrieMap: stdlib's internal hash trie as the oracle. We are the hash trie. - TestHashTrieMap{BadHash,TruncHash}: pathological-hash tests. Our TestTrie_FullHashCollision{,DeleteHead} already exercises full collisions via the hashFn hook. - BenchmarkHashTrieMapLoad[Small|Large], LoadOrStore[Large]: trie- level microbenchmarks. Redundant with the cache-level benchmarks since Cache is thin over trie.

twmb added 6 commits April 21, 2026 12:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cache: replace read/dirty with a concurrent hash-trie#6

cache: replace read/dirty with a concurrent hash-trie#6
twmb wants to merge 6 commits intomainfrom
trie-cache

twmb commented Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

twmb commented Apr 21, 2026

Summary

Why

Changes, by commit

Semantic shifts

Benchmarks

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant