cache: replace read/dirty with a concurrent hash-trie#6
Draft
Conversation
- Bump go.mod from 1.18 to 1.24. The code has required at least 1.19 since the switch to generic atomic types (atomic.Pointer[T], atomic.Int64, atomic.Uint32). 1.24 also unlocks maps.Copy, which replaces the hand-rolled copy loop in promote. - newStale: when the main entry's expiry is 0 (infinite cache) or -1 (immediate expiry), base the stale's lifespan on now() instead of producing a stale born at the Unix epoch. This was a latent bug: Get paths typically regenerated the stale via maybeNewStale before it was observed, but the initial stale set by Swap/Set was wrong. - Clean: call e.del() directly instead of c.Delete(k). c.each already promotes when necessary, so every entry the callback sees lives in the read map and needs no dirty-map bookkeeping or per-key relock. - Document that Expire is a no-op on in-flight loads for Cache, Item, and Set.
Test coverage climbs from 84.3% to 100% of statements. The new tests
either target concretely uncovered behavior (Item/Set wrappers, autoclean
lifecycle, Clean no-op on MaxStaleAge<0) or target race windows in the
existing read/dirty implementation (Swap-vs-in-flight-Get,
Delete-during-load, Expire-no-op-during-load, maxAge=0 collapsing, the
promote CAS-retry path, Swap observing the promotingDelete sentinel,
Get's unlocked-miss / locked-hit interleaving, CompareAndSwap against a
just-promoted entry). Single -count=1 -race runs land 100% coverage
reliably on this machine; the stress test driving the tightest windows
runs for ~2s against many goroutines.
The Swap-vs-in-flight-Get test caught a real data race (reported by
-race). After Get's final e.get, the original code returned l.v without
synchronizing with any writer: concurrent Swap's defer writes l.v under
l.mu and calls l.wg.Done, but Get was only waiting on whatever loading
e.p pointed to at the time, which could be Swap's replacement (a
different loading). Add l.wg.Wait() before returning l.v to establish
the happens-before edge with whoever finalized l (either setve or
Swap's defer).
New test files:
- wrappers_test.go: every Item and Set method
- autoclean_test.go: AutoCleanInterval goroutine lifecycle and
StopAutoClean idempotency
- concurrency_test.go: race-targeted tests and the miss/stale
correctness properties called out in the trie port plan
Introduces cache/trie.go: a generic concurrent hash-trie map modeled on
Go's internal/sync.HashTrieMap, but implemented entirely in userspace
(hash/maphash for the runtime's typed hasher, unsafe.Sizeof(uintptr(0))
for pointer-size, no internal/abi or internal/goarch). The trie is not
yet wired into Cache; that happens in a later commit.
Design:
- Root is a lazily-allocated indirect node, atomically swapped on
Clear so readers holding the old root keep operating on it.
- Interior nodes hold 16 atomic child pointers; leaves hold a key,
an atomic value slot (*V), and an atomic overflow pointer for
hash-collision chains.
- Load is lock-free; LoadOrStore and Delete take the parent's
per-bucket mutex. Delete prunes empty parents bottom-up while
holding locks hand-over-hand.
- expand splits a leaf into a subtree on hash-prefix collision; full
hash collisions chain through overflow.
Tests (cache/trie_test.go) cover: zero-value trie, basic store/load,
load-or-store semantics with nil-value initialization, single and
collision-chain delete (head and mid-chain), large-key stress (20k keys)
through expand, full hash collisions via a test-only hashFn hook,
concurrent disjoint keys, concurrent contention on the same key,
concurrent delete-prune, range-during-mutation, and the deleteEntry
lock-retry paths under sustained delete contention.
Coverage is 99.1%. The five uncovered statements are defensive panics
at invariant-violation points (tree deeper than the hash has bits, node
type discriminator corrupted, etc.) that cannot fire without memory
corruption; each is documented inline with the reason it is unreachable.
Also:
- Scale TestMaxIdleAge sleep windows up by ~4x so the timing-based
assertions tolerate -count=3 -race load. The previous 30ms-within-
50ms-window spacing had no margin for scheduler delay.
- Fix TestSwap_CancelsInFlightGet: synchronize on the miss goroutine
actually completing (via a missDone channel) before asserting the
miss ran. Previously the test asserted a flag set by a goroutine
that hadn't been scheduled to run yet under heavy load.
- Modernize remaining int-range for-loops.
The read/dirty/promote structure is replaced by the concurrent hash-trie
from the previous commit. Public API, stale/TTL semantics, singleflight
collapsing, and CompareAndSwap/Delete behavior are preserved.
Shape of the rewrite:
- Cache holds a single `trie[K, loading[V]]` instead of r/mu/dirty/
misses/pd.
- An `ent[K, V]` type alias (Go 1.24 generic aliases) names the trie's
concrete entry type so callers can read `*ent[K, V]` instead of
`*trieEntry[K, loading[V]]` everywhere. All entry-manipulating logic
(entGet, entTryGet, entDel, entLoad, entMaybeNewStale) is free
functions over that alias, since Go generics do not permit methods
on instantiation-specific aliases.
- Delete physically removes the entry via the trie's deleteEntryIf
primitive (with a "still tombstoned under lock" predicate to avoid
clobbering a concurrent Swap resurrection). This lets keys and their
referents be GC'd, preserving TestIssue40999.
- Clean walks the trie, tombstones expired entries, and then calls
deleteEntryIf for each tombstone so the trie's memory footprint does
not grow unboundedly with delete churn.
- Range iterates via the trie walk; no snapshot (matches sync.Map and
the existing Range docstring).
- Swap's fast path CAS on an existing entry's value slot is preserved
without any global lock; the slow path goes through loadOrStoreEntry
and resolves races on the trie's per-bucket mutex.
Added to the trie for Cache's benefit:
- deleteEntryIf(k, pred): removes only if pred returns true under the
parent bucket's lock. Callers use this to avoid the
resurrected-during-delete race.
- Inline chain removal in deleteEntryIf, retiring the separate
trieRemoveFromChain helper (it had a "not found" return path that
became dead after the pred-and-locate restructure).
Two new tests beyond the existing suite:
- TestGet_ConcurrentStaleDuringInFlight: a second Get observes a
stale-returning loading that is already in flight from the first Get
and returns the stale without attempting its own miss.
- TestGet_StaleRefreshCASRace: 16 concurrent Gets on an expired-with-
stale key exercise the slow-path CAS-retry that replaces a finalized
prev loading with a fresh loading carrying a stale snapshot.
- TestClean_SkipsInFlightLoads: Clean must skip entries whose load has
not finalized.
- TestTrie_DeleteRaceAgainstExpand: deleteEntryIf's "slot changed to
non-entry after lock" branch, triggered by concurrent insert on
hash-prefix-colliding keys forcing an expand during delete.
Full suite passes under -race; coverage settles at 98.6%-99.0% of
statements across independent runs. The remaining uncovered lines are
documented defensive panics (invariant violations that require
corruption to fire) and two tightly-timed race branches that need
scheduler interleavings we cannot force from userspace without test
hooks.
Benchmark comparison against the pre-trie baseline is deferred to the
cleanup commit.
Update the design blurb (no more read/dirty promotion), the coverage footnote (97% → ~99% plus a note on the defensive-panic residue), and the microbenchmark section (fresh numbers from the current code on Apple M1; the old table compared sync.Map to the pre-trie cache and is no longer meaningful). Flag that per-op allocations on the insert/swap paths are intrinsic — the singleflight machinery (loading[V] with a Mutex, WaitGroup, and atomic counters) is something sync.Map doesn't carry, and carrying it costs a single heap allocation per inserted key.
Add three tests and one benchmark that exist in sync/map_test.go and
sync/map_bench_test.go but were not in our copy:
- TestConcurrentClear: 10 writers, 10 readers, and 10 Clear goroutines
run concurrently. Correctness is no panic and no phantom keys (keys
we never Stored appearing in Range after the dust settles).
- TestMapClearOneAllocation: asserts Clear allocates ≤1 time. Cache's
Clear replaces the trie root with one fresh indirect node, matching
the sync.Map guarantee.
- TestMapRangeNoAllocations: asserts Range does not allocate. The
benchmarks already showed 0 B/op but an explicit AllocsPerRun test
guards against a regression in the Range closure wiring.
- BenchmarkClear: Clear throughput.
Skipped from stdlib (intentionally):
- TestMapMatchesDeepCopy: a second mapInterface oracle (DeepCopyMap).
Our TestCacheMatchesRWMutex already covers the semantics; the stdlib
keeps a second oracle mainly to catch sync.Map-specific promotion
bugs we no longer have.
- TestMapMatchesHashTrieMap: stdlib's internal hash trie as the
oracle. We are the hash trie.
- TestHashTrieMap{BadHash,TruncHash}: pathological-hash tests. Our
TestTrie_FullHashCollision{,DeleteHead} already exercises full
collisions via the hashFn hook.
- BenchmarkHashTrieMapLoad[Small|Large], LoadOrStore[Large]: trie-
level microbenchmarks. Redundant with the cache-level benchmarks
since Cache is thin over trie.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Replaces the read/dirty/promote backing of
cache.Cachewith a concurrenthash-trie modeled on Go's
internal/sync.HashTrieMap, implemented entirely inuserspace (
hash/maphash+unsafe.Sizeof(uintptr(0)); nointernal/abiorinternal/goarch). Public API is unchanged; all documented semantics arepreserved, with two minor behavior shifts noted below.
Why
per-op cost is decent on average but adversarial workloads pay heavily for
the copy. The trie spreads cost across per-bucket locks and eliminates the
promotion cycle.
sync.Mapmoved to a hash trie in Go 1.24, so the old inspirationin this package's docs was stale.
the storage layer and carry over unchanged.
Changes, by commit
bump to go 1.24, fix stale-at-epoch, tighten Clean— baseline cleanup.Go modules had claimed 1.18 but the code required 1.19+ (
atomic.Pointer[T],atomic.Int64,atomic.Uint32).newStalewas producing a stale born atthe Unix epoch when the main entry's expiry was
<= 0— now bases onnow().Cleannow callse.del()directly instead ofc.Delete(k)sincec.eachhas already promoted.expand test suite; fix data race in Get's post-miss return— gets coverageto 100% at the baseline. The new race-targeted tests caught a real data
race:
Cache.Getreadl.vwithout any happens-before edge when aconcurrent
Swaphad replacede.pwith its own loading. Fix:l.wg.Wait()before reading
l.vto synchronize with whoever calledwg.Done— eithersetveorSwap's defer.add concurrent hash-trie in isolation— newcache/trie.go+ unit tests.Standalone, not yet wired into Cache.
back Cache with the hash-trie; retire read/dirty machinery—Cacheholds a
trie[K, loading[V]]instead ofr/mu/dirty/misses/pd. Anent[K, V]type alias (Go 1.24 generic aliases) names the trie's concreteentry type. Entry operations (
entGet,entTryGet,entDel,entLoad,entMaybeNewStale) are free functions over that alias.Deletephysicallyremoves via
deleteEntryIfwith a "still tombstoned under lock" predicateto avoid clobbering a concurrent Swap resurrection; keys become eligible
for GC promptly (
TestIssue40999still passes).README: rewrite for the trie-backed cache— new design blurb, freshbenchmark numbers.
port the stdlib sync.Map tests we were missing—TestConcurrentClear,TestMapClearOneAllocation,TestMapRangeNoAllocations,BenchmarkClear.Semantic shifts
Both acceptable in my read, but flagging explicitly:
Rangeloses its promote-snapshot flavor. Previouslyeachpromotedunder
c.mu, so Range iterated a fixed snapshot. Now Range is a lock-freewalk over atomic pointers; concurrent mutations may or may not be visible.
This is the same guarantee
sync.Map.Rangedocuments and the existingdocstring already uses that wording.
Cleanno longer serializes onc.mu. Iteration is lock-free; per-keyphysical removal takes only the trie's bucket lock. Faster, but callers who
relied on
Cleanholding a global lock should know.Benchmarks
Apple M1, Go 1.26,
-count=5 -benchtime=1s,benchstatcomparing thepre-trie baseline to the current branch.
Time per op — geomean -35.24%.
Largest wins:
Regressions:
The
*Collisionregressions are intrinsic: the trie physically removes onDelete(required to keep keys GC-eligible perTestIssue40999) so everyDelete takes the bucket lock. Pre-trie just did a single CAS to nil and
deferred physical removal to the promotion cycle. When many goroutines hammer
the same key, they all serialize on that bucket mutex.
LoadMostlyMissespays a small fixed cost for the hash-walk on every miss.
Allocations.
AdversarialAllocandAdversarialDeletedrop to 0 B/op(pre-trie was 55 and 36, respectively — the cost of dirty-map promotion
copying).
SwapMostlyMissesdrops from 4 allocs/op to 2. Per-op allocationson the insert/swap paths (
LoadOrStore*,Swap*Hits,CompareAndSwap*Hits)are intrinsic: each operation may install a fresh
loading[V]carrying thesingleflight machinery (
Mutex,WaitGroup, atomic counters) that a plainsync.Mapdoes not need.Test plan
go test ./... -race -count=3on the CI environment.Rangesnapshot-less,Cleanconcurrency) don't affect any downstream consumers.
LoadAndDeleteCollisionshape is narrow (single hot key beingrepeatedly add/removed) but worth checking if anyone depends on it.