feat(cache): add in flight deduping#4459
feat(cache): add in flight deduping#4459MasterPtato wants to merge 1 commit into03-18-fix_cache_clean_up_libfrom
Conversation
|
🚅 Deployed to the rivet-pr-4459 environment in rivet-frontend
|
|
Warning This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
This stack of pull requests is managed by Graphite. Learn more about stacking. |
175706a to
17d21f5
Compare
75e1e38 to
bc5d6d3
Compare
Code Review: feat(cache): add in flight dedupingThe core idea of in-flight request deduplication is sound and addresses a real stampede problem. Here are my findings: Critical Issue 1: Lease not released on getter error. If the getter returns an Err, try_join short-circuits via ?, so the lease release loop never executes. This permanently stalls subsequent requests for those keys (5-second wait per attempt) until the process restarts. A guard/defer pattern would ensure cleanup regardless of the error path. Critical Issue 2: Broadcast not sent on empty resolution or getter error. broadcast_tx.send is only called when entries_values is non-empty. If the getter resolves no values or errors, waiters are stuck until IN_FLIGHT_TIMEOUT (5 seconds). The broadcast should be sent unconditionally after the getter completes. Combined with issue 1, a getter error causes both a leaked lease and a 5-second stall for all waiters. Moderate Issue 3: HashMap iteration order creates implicit coupling. In req_config.rs, keys and cache_keys are unzipped from ctx.entries() with non-deterministic HashMap iteration order, then keys is zipped with cached_values from the driver. This works because both were derived from the same iterator in the same pass, but it is fragile. A Vec pairing (Key, RawCacheKey) would make the relationship explicit and safe. The same issue applies in the waiting-keys path (succeeded_keys / succeeded_cache_keys). Moderate Issue 4: Rate limit tests silently removed. integration.rs contained test_rate_limit_basic and test_rate_limit_ip_isolation. These do not appear in any of the new test files (fetch.rs, in_flight.rs, ttl.rs). If rate limiting is still a feature of this crate, these tests should be preserved. Minor Issue 5: timeout_falls_back_to_getter test adds 5 seconds to the test suite. The test necessarily waits for IN_FLIGHT_TIMEOUT. Making the timeout configurable via a cfg(test) override or a parameter on CacheInner would allow faster test runs. Minor Issue 6: Inline await changes latency behavior on cache misses. The cache write was previously done in a background task; now it is awaited inline. This is likely intentional (to ensure broadcast happens after write), but it adds write latency to every cache miss response. A short comment documenting the trade-off would help future readers. Positive Observations: Switching GetterCtx from Vec to HashMap removes O(n) deduplication and makes key lookups O(1). scc::HashMap is an appropriate choice for concurrent in-flight tracking. Test coverage for the deduplication logic is solid: single waiter, multiple waiters, independent keys, mixed cached/in-flight, and timeout fallback are all covered. Removing the anyhow glob import aligns with the project style guidelines. The Driver::get signature change to a slice of RawCacheKey is a correct ergonomics improvement. |
bc5d6d3 to
b395fae
Compare
17d21f5 to
de18421
Compare
de18421 to
e65f84d
Compare
b395fae to
6570bf2
Compare
6570bf2 to
b46226c
Compare
65200cf to
ea32d90
Compare
b46226c to
66ec30f
Compare
Code Review: feat(cache): add in-flight deduplicationCritical Issues1. Broadcast never sent when getter resolves no values In engine/packages/cache/src/req_config.rs, broadcast_tx.send is only called inside the if entries_values.is_empty block. If the getter legitimately resolves no keys (a valid cache miss), all waiters remain blocked until IN_FLIGHT_TIMEOUT (5 seconds). The broadcast must be sent unconditionally after the getter completes. 2. Lease permanently left in in_flight after getter error If the getter returns Err, try_join propagates via ? and skips both the broadcast send and the lease cleanup loop. The in_flight entry for each leased key is never removed -- those keys are poisoned until process restart. Fix: replace try_join with join, capture the result, unconditionally broadcast and clean up leases, then propagate the error. These two issues combine: a getter error causes both a leaked lease and a 5-second cascading stall for all current waiters. Moderate Issues3. Race between broadcast send and lease removal Broadcast is sent before leases are removed from in_flight. A new request arriving in this window finds the entry still present, subscribes as a waiter on an already-exhausted channel, and stalls for the full timeout. Fix by removing leases first, then broadcasting. 4. Rate limit tests dropped without replacement The deleted integration.rs contained test_rate_limit_basic and test_rate_limit_ip_isolation. Neither appears in fetch.rs, in_flight.rs, or ttl.rs. If rate limiting is still supported, these should be preserved or replaced. Minor Issues5. timeout_falls_back_to_getter adds 5 real seconds to CI This test waits on the real IN_FLIGHT_TIMEOUT constant. Making the timeout configurable for tests or using tokio::time::pause would eliminate the wall-clock wait. 6. Uppercase log messages Lines with Failed to decode value and Failed to encode value in req_config.rs should be lowercase per project conventions. 7. No comment on inline vs background write trade-off The cache write is now awaited inline (previously spawned as background). A short comment explaining the ordering constraint would help future readers. Positive Observations
Summary
The two critical issues and the race condition should be addressed before merging. They can cause 5-second cascading stalls for any cache miss that errors or returns nothing. |
66ec30f to
97b9cfd
Compare
ddfa969 to
bed6ca4
Compare
97b9cfd to
3fc4f7f
Compare
bed6ca4 to
10a4ff1
Compare
3fc4f7f to
662fee6
Compare
10a4ff1 to
860e71e
Compare
662fee6 to
893ea25
Compare
893ea25 to
715bec8
Compare
860e71e to
22498e8
Compare

Description
Please include a summary of the changes and the related issue. Please also include relevant motivation and context.
Type of change
How Has This Been Tested?
Please describe the tests that you ran to verify your changes.
Checklist: