Skip to content

cache: replace read/dirty with a concurrent hash-trie#6

Draft
twmb wants to merge 6 commits intomainfrom
trie-cache
Draft

cache: replace read/dirty with a concurrent hash-trie#6
twmb wants to merge 6 commits intomainfrom
trie-cache

Conversation

@twmb
Copy link
Copy Markdown
Owner

@twmb twmb commented Apr 21, 2026

Summary

Replaces the read/dirty/promote backing of cache.Cache with a concurrent
hash-trie modeled on Go's internal/sync.HashTrieMap, implemented entirely in
userspace (hash/maphash + unsafe.Sizeof(uintptr(0)); no internal/abi or
internal/goarch). Public API is unchanged; all documented semantics are
preserved, with two minor behavior shifts noted below.

Why

  • The read/dirty design amortizes a write-map promotion across operations; the
    per-op cost is decent on average but adversarial workloads pay heavily for
    the copy. The trie spreads cost across per-bucket locks and eliminates the
    promotion cycle.
  • Upstream sync.Map moved to a hash trie in Go 1.24, so the old inspiration
    in this package's docs was stale.
  • Singleflight, stale values, TTLs, idle-age, and error caching all live above
    the storage layer and carry over unchanged.

Changes, by commit

  1. bump to go 1.24, fix stale-at-epoch, tighten Clean — baseline cleanup.
    Go modules had claimed 1.18 but the code required 1.19+ (atomic.Pointer[T],
    atomic.Int64, atomic.Uint32). newStale was producing a stale born at
    the Unix epoch when the main entry's expiry was <= 0 — now bases on
    now(). Clean now calls e.del() directly instead of c.Delete(k) since
    c.each has already promoted.
  2. expand test suite; fix data race in Get's post-miss return — gets coverage
    to 100% at the baseline. The new race-targeted tests caught a real data
    race: Cache.Get read l.v without any happens-before edge when a
    concurrent Swap had replaced e.p with its own loading. Fix: l.wg.Wait()
    before reading l.v to synchronize with whoever called wg.Done — either
    setve or Swap's defer.
  3. add concurrent hash-trie in isolation — new cache/trie.go + unit tests.
    Standalone, not yet wired into Cache.
  4. back Cache with the hash-trie; retire read/dirty machineryCache
    holds a trie[K, loading[V]] instead of r/mu/dirty/misses/pd. An
    ent[K, V] type alias (Go 1.24 generic aliases) names the trie's concrete
    entry type. Entry operations (entGet, entTryGet, entDel, entLoad,
    entMaybeNewStale) are free functions over that alias. Delete physically
    removes via deleteEntryIf with a "still tombstoned under lock" predicate
    to avoid clobbering a concurrent Swap resurrection; keys become eligible
    for GC promptly (TestIssue40999 still passes).
  5. README: rewrite for the trie-backed cache — new design blurb, fresh
    benchmark numbers.
  6. port the stdlib sync.Map tests we were missingTestConcurrentClear,
    TestMapClearOneAllocation, TestMapRangeNoAllocations, BenchmarkClear.

Semantic shifts

Both acceptable in my read, but flagging explicitly:

  • Range loses its promote-snapshot flavor. Previously each promoted
    under c.mu, so Range iterated a fixed snapshot. Now Range is a lock-free
    walk over atomic pointers; concurrent mutations may or may not be visible.
    This is the same guarantee sync.Map.Range documents and the existing
    docstring already uses that wording.
  • Clean no longer serializes on c.mu. Iteration is lock-free; per-key
    physical removal takes only the trie's bucket lock. Faster, but callers who
    relied on Clean holding a global lock should know.

Benchmarks

Apple M1, Go 1.26, -count=5 -benchtime=1s, benchstat comparing the
pre-trie baseline to the current branch.

Time per op — geomean -35.24%.

Largest wins:

Benchmark Pre-trie Trie Δ
AdversarialAlloc 350.8 ns 11.4 ns -96.74%
AdversarialDelete 132.2 ns 6.5 ns -95.10%
SwapMostlyMisses 1032 ns 154 ns -85.11%
LoadOrStoreUnique 1449 ns 592 ns -59.17%
LoadAndDeleteBalanced 10.7 ns 4.4 ns -58.66%
LoadAndDeleteUnique 4.3 ns 1.8 ns -58.59%
LoadOrStoreBalanced 702 ns 318 ns -54.77%
CompareAndDeleteMostlyHits 131 ns 91 ns -30.63%
DeleteCollision 5.1 ns 3.6 ns -29.23%
SwapCollision 338 ns 242 ns -28.47%
CompareAndSwapNoExistingKey 2.5 ns 1.9 ns -22.97%

Regressions:

Benchmark Pre-trie Trie Δ
LoadAndDeleteCollision 7.1 ns 67.6 ns +853.35%
CompareAndSwapCollision 10.1 ns 17.6 ns +73.45%
LoadMostlyMisses 3.4 ns 4.6 ns +33.40%

The *Collision regressions are intrinsic: the trie physically removes on
Delete (required to keep keys GC-eligible per TestIssue40999) so every
Delete takes the bucket lock. Pre-trie just did a single CAS to nil and
deferred physical removal to the promotion cycle. When many goroutines hammer
the same key, they all serialize on that bucket mutex. LoadMostlyMisses
pays a small fixed cost for the hash-walk on every miss.

Allocations. AdversarialAlloc and AdversarialDelete drop to 0 B/op
(pre-trie was 55 and 36, respectively — the cost of dirty-map promotion
copying). SwapMostlyMisses drops from 4 allocs/op to 2. Per-op allocations
on the insert/swap paths (LoadOrStore*, Swap*Hits, CompareAndSwap*Hits)
are intrinsic: each operation may install a fresh loading[V] carrying the
singleflight machinery (Mutex, WaitGroup, atomic counters) that a plain
sync.Map does not need.

Test plan

  • Verify go test ./... -race -count=3 on the CI environment.
  • Confirm the two semantic shifts (Range snapshot-less, Clean
    concurrency) don't affect any downstream consumers.
  • Spot-check the benchmark regressions against real workloads — the
    LoadAndDeleteCollision shape is narrow (single hot key being
    repeatedly add/removed) but worth checking if anyone depends on it.

twmb added 6 commits April 21, 2026 12:23
- Bump go.mod from 1.18 to 1.24. The code has required at least 1.19
  since the switch to generic atomic types (atomic.Pointer[T],
  atomic.Int64, atomic.Uint32). 1.24 also unlocks maps.Copy, which
  replaces the hand-rolled copy loop in promote.
- newStale: when the main entry's expiry is 0 (infinite cache) or -1
  (immediate expiry), base the stale's lifespan on now() instead of
  producing a stale born at the Unix epoch. This was a latent bug:
  Get paths typically regenerated the stale via maybeNewStale before
  it was observed, but the initial stale set by Swap/Set was wrong.
- Clean: call e.del() directly instead of c.Delete(k). c.each already
  promotes when necessary, so every entry the callback sees lives in
  the read map and needs no dirty-map bookkeeping or per-key relock.
- Document that Expire is a no-op on in-flight loads for Cache, Item,
  and Set.
Test coverage climbs from 84.3% to 100% of statements. The new tests
either target concretely uncovered behavior (Item/Set wrappers, autoclean
lifecycle, Clean no-op on MaxStaleAge<0) or target race windows in the
existing read/dirty implementation (Swap-vs-in-flight-Get,
Delete-during-load, Expire-no-op-during-load, maxAge=0 collapsing, the
promote CAS-retry path, Swap observing the promotingDelete sentinel,
Get's unlocked-miss / locked-hit interleaving, CompareAndSwap against a
just-promoted entry). Single -count=1 -race runs land 100% coverage
reliably on this machine; the stress test driving the tightest windows
runs for ~2s against many goroutines.

The Swap-vs-in-flight-Get test caught a real data race (reported by
-race). After Get's final e.get, the original code returned l.v without
synchronizing with any writer: concurrent Swap's defer writes l.v under
l.mu and calls l.wg.Done, but Get was only waiting on whatever loading
e.p pointed to at the time, which could be Swap's replacement (a
different loading). Add l.wg.Wait() before returning l.v to establish
the happens-before edge with whoever finalized l (either setve or
Swap's defer).

New test files:
  - wrappers_test.go: every Item and Set method
  - autoclean_test.go: AutoCleanInterval goroutine lifecycle and
    StopAutoClean idempotency
  - concurrency_test.go: race-targeted tests and the miss/stale
    correctness properties called out in the trie port plan
Introduces cache/trie.go: a generic concurrent hash-trie map modeled on
Go's internal/sync.HashTrieMap, but implemented entirely in userspace
(hash/maphash for the runtime's typed hasher, unsafe.Sizeof(uintptr(0))
for pointer-size, no internal/abi or internal/goarch). The trie is not
yet wired into Cache; that happens in a later commit.

Design:
  - Root is a lazily-allocated indirect node, atomically swapped on
    Clear so readers holding the old root keep operating on it.
  - Interior nodes hold 16 atomic child pointers; leaves hold a key,
    an atomic value slot (*V), and an atomic overflow pointer for
    hash-collision chains.
  - Load is lock-free; LoadOrStore and Delete take the parent's
    per-bucket mutex. Delete prunes empty parents bottom-up while
    holding locks hand-over-hand.
  - expand splits a leaf into a subtree on hash-prefix collision; full
    hash collisions chain through overflow.

Tests (cache/trie_test.go) cover: zero-value trie, basic store/load,
load-or-store semantics with nil-value initialization, single and
collision-chain delete (head and mid-chain), large-key stress (20k keys)
through expand, full hash collisions via a test-only hashFn hook,
concurrent disjoint keys, concurrent contention on the same key,
concurrent delete-prune, range-during-mutation, and the deleteEntry
lock-retry paths under sustained delete contention.

Coverage is 99.1%. The five uncovered statements are defensive panics
at invariant-violation points (tree deeper than the hash has bits, node
type discriminator corrupted, etc.) that cannot fire without memory
corruption; each is documented inline with the reason it is unreachable.

Also:
  - Scale TestMaxIdleAge sleep windows up by ~4x so the timing-based
    assertions tolerate -count=3 -race load. The previous 30ms-within-
    50ms-window spacing had no margin for scheduler delay.
  - Fix TestSwap_CancelsInFlightGet: synchronize on the miss goroutine
    actually completing (via a missDone channel) before asserting the
    miss ran. Previously the test asserted a flag set by a goroutine
    that hadn't been scheduled to run yet under heavy load.
  - Modernize remaining int-range for-loops.
The read/dirty/promote structure is replaced by the concurrent hash-trie
from the previous commit. Public API, stale/TTL semantics, singleflight
collapsing, and CompareAndSwap/Delete behavior are preserved.

Shape of the rewrite:
  - Cache holds a single `trie[K, loading[V]]` instead of r/mu/dirty/
    misses/pd.
  - An `ent[K, V]` type alias (Go 1.24 generic aliases) names the trie's
    concrete entry type so callers can read `*ent[K, V]` instead of
    `*trieEntry[K, loading[V]]` everywhere. All entry-manipulating logic
    (entGet, entTryGet, entDel, entLoad, entMaybeNewStale) is free
    functions over that alias, since Go generics do not permit methods
    on instantiation-specific aliases.
  - Delete physically removes the entry via the trie's deleteEntryIf
    primitive (with a "still tombstoned under lock" predicate to avoid
    clobbering a concurrent Swap resurrection). This lets keys and their
    referents be GC'd, preserving TestIssue40999.
  - Clean walks the trie, tombstones expired entries, and then calls
    deleteEntryIf for each tombstone so the trie's memory footprint does
    not grow unboundedly with delete churn.
  - Range iterates via the trie walk; no snapshot (matches sync.Map and
    the existing Range docstring).
  - Swap's fast path CAS on an existing entry's value slot is preserved
    without any global lock; the slow path goes through loadOrStoreEntry
    and resolves races on the trie's per-bucket mutex.

Added to the trie for Cache's benefit:
  - deleteEntryIf(k, pred): removes only if pred returns true under the
    parent bucket's lock. Callers use this to avoid the
    resurrected-during-delete race.
  - Inline chain removal in deleteEntryIf, retiring the separate
    trieRemoveFromChain helper (it had a "not found" return path that
    became dead after the pred-and-locate restructure).

Two new tests beyond the existing suite:
  - TestGet_ConcurrentStaleDuringInFlight: a second Get observes a
    stale-returning loading that is already in flight from the first Get
    and returns the stale without attempting its own miss.
  - TestGet_StaleRefreshCASRace: 16 concurrent Gets on an expired-with-
    stale key exercise the slow-path CAS-retry that replaces a finalized
    prev loading with a fresh loading carrying a stale snapshot.
  - TestClean_SkipsInFlightLoads: Clean must skip entries whose load has
    not finalized.
  - TestTrie_DeleteRaceAgainstExpand: deleteEntryIf's "slot changed to
    non-entry after lock" branch, triggered by concurrent insert on
    hash-prefix-colliding keys forcing an expand during delete.

Full suite passes under -race; coverage settles at 98.6%-99.0% of
statements across independent runs. The remaining uncovered lines are
documented defensive panics (invariant violations that require
corruption to fire) and two tightly-timed race branches that need
scheduler interleavings we cannot force from userspace without test
hooks.

Benchmark comparison against the pre-trie baseline is deferred to the
cleanup commit.
Update the design blurb (no more read/dirty promotion), the coverage
footnote (97% → ~99% plus a note on the defensive-panic residue), and
the microbenchmark section (fresh numbers from the current code on Apple
M1; the old table compared sync.Map to the pre-trie cache and is no
longer meaningful). Flag that per-op allocations on the insert/swap
paths are intrinsic — the singleflight machinery (loading[V] with a
Mutex, WaitGroup, and atomic counters) is something sync.Map doesn't
carry, and carrying it costs a single heap allocation per inserted key.
Add three tests and one benchmark that exist in sync/map_test.go and
sync/map_bench_test.go but were not in our copy:

  - TestConcurrentClear: 10 writers, 10 readers, and 10 Clear goroutines
    run concurrently. Correctness is no panic and no phantom keys (keys
    we never Stored appearing in Range after the dust settles).
  - TestMapClearOneAllocation: asserts Clear allocates ≤1 time. Cache's
    Clear replaces the trie root with one fresh indirect node, matching
    the sync.Map guarantee.
  - TestMapRangeNoAllocations: asserts Range does not allocate. The
    benchmarks already showed 0 B/op but an explicit AllocsPerRun test
    guards against a regression in the Range closure wiring.
  - BenchmarkClear: Clear throughput.

Skipped from stdlib (intentionally):

  - TestMapMatchesDeepCopy: a second mapInterface oracle (DeepCopyMap).
    Our TestCacheMatchesRWMutex already covers the semantics; the stdlib
    keeps a second oracle mainly to catch sync.Map-specific promotion
    bugs we no longer have.
  - TestMapMatchesHashTrieMap: stdlib's internal hash trie as the
    oracle. We are the hash trie.
  - TestHashTrieMap{BadHash,TruncHash}: pathological-hash tests. Our
    TestTrie_FullHashCollision{,DeleteHead} already exercises full
    collisions via the hashFn hook.
  - BenchmarkHashTrieMapLoad[Small|Large], LoadOrStore[Large]: trie-
    level microbenchmarks. Redundant with the cache-level benchmarks
    since Cache is thin over trie.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant