Skip to content

feat(cache): add in flight deduping#4459

Open
MasterPtato wants to merge 1 commit into03-18-fix_cache_clean_up_libfrom
03-19-feat_cache_add_in_flight_deduping
Open

feat(cache): add in flight deduping#4459
MasterPtato wants to merge 1 commit into03-18-fix_cache_clean_up_libfrom
03-19-feat_cache_add_in_flight_deduping

Conversation

@MasterPtato
Copy link
Copy Markdown
Contributor

Description

Please include a summary of the changes and the related issue. Please also include relevant motivation and context.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes.

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

@railway-app
Copy link
Copy Markdown

railway-app bot commented Mar 19, 2026

🚅 Deployed to the rivet-pr-4459 environment in rivet-frontend

Service Status Web Updated (UTC)
frontend-cloud 😴 Sleeping (View Logs) Web Mar 31, 2026 at 3:58 pm
frontend-inspector 😴 Sleeping (View Logs) Web Mar 24, 2026 at 1:03 am
website ❌ Build Failed (View Logs) Web Mar 19, 2026 at 8:53 pm
mcp-hub ✅ Success (View Logs) Web Mar 19, 2026 at 8:52 pm
ladle ❌ Build Failed (View Logs) Web Mar 19, 2026 at 8:51 pm

Copy link
Copy Markdown
Contributor Author

MasterPtato commented Mar 19, 2026

@MasterPtato MasterPtato force-pushed the 03-18-fix_cache_clean_up_lib branch from 175706a to 17d21f5 Compare March 19, 2026 20:53
@MasterPtato MasterPtato force-pushed the 03-19-feat_cache_add_in_flight_deduping branch from 75e1e38 to bc5d6d3 Compare March 19, 2026 20:53
@pkg-pr-new
Copy link
Copy Markdown

pkg-pr-new bot commented Mar 19, 2026

More templates

@rivetkit/cloudflare-workers

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/cloudflare-workers@4459

@rivetkit/framework-base

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/framework-base@4459

@rivetkit/next-js

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/next-js@4459

@rivetkit/react

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/react@4459

rivetkit

pnpm add https://pkg.pr.new/rivet-dev/rivet/rivetkit@4459

@rivetkit/sql-loader

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/sql-loader@4459

@rivetkit/sqlite-vfs

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/sqlite-vfs@4459

@rivetkit/traces

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/traces@4459

@rivetkit/workflow-engine

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/workflow-engine@4459

@rivetkit/virtual-websocket

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/virtual-websocket@4459

@rivetkit/engine-runner

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/engine-runner@4459

@rivetkit/engine-runner-protocol

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/engine-runner-protocol@4459

commit: 6570bf2

@claude
Copy link
Copy Markdown

claude bot commented Mar 19, 2026

Code Review: feat(cache): add in flight deduping

The core idea of in-flight request deduplication is sound and addresses a real stampede problem. Here are my findings:

Critical Issue 1: Lease not released on getter error. If the getter returns an Err, try_join short-circuits via ?, so the lease release loop never executes. This permanently stalls subsequent requests for those keys (5-second wait per attempt) until the process restarts. A guard/defer pattern would ensure cleanup regardless of the error path.

Critical Issue 2: Broadcast not sent on empty resolution or getter error. broadcast_tx.send is only called when entries_values is non-empty. If the getter resolves no values or errors, waiters are stuck until IN_FLIGHT_TIMEOUT (5 seconds). The broadcast should be sent unconditionally after the getter completes. Combined with issue 1, a getter error causes both a leaked lease and a 5-second stall for all waiters.

Moderate Issue 3: HashMap iteration order creates implicit coupling. In req_config.rs, keys and cache_keys are unzipped from ctx.entries() with non-deterministic HashMap iteration order, then keys is zipped with cached_values from the driver. This works because both were derived from the same iterator in the same pass, but it is fragile. A Vec pairing (Key, RawCacheKey) would make the relationship explicit and safe. The same issue applies in the waiting-keys path (succeeded_keys / succeeded_cache_keys).

Moderate Issue 4: Rate limit tests silently removed. integration.rs contained test_rate_limit_basic and test_rate_limit_ip_isolation. These do not appear in any of the new test files (fetch.rs, in_flight.rs, ttl.rs). If rate limiting is still a feature of this crate, these tests should be preserved.

Minor Issue 5: timeout_falls_back_to_getter test adds 5 seconds to the test suite. The test necessarily waits for IN_FLIGHT_TIMEOUT. Making the timeout configurable via a cfg(test) override or a parameter on CacheInner would allow faster test runs.

Minor Issue 6: Inline await changes latency behavior on cache misses. The cache write was previously done in a background task; now it is awaited inline. This is likely intentional (to ensure broadcast happens after write), but it adds write latency to every cache miss response. A short comment documenting the trade-off would help future readers.

Positive Observations: Switching GetterCtx from Vec to HashMap removes O(n) deduplication and makes key lookups O(1). scc::HashMap is an appropriate choice for concurrent in-flight tracking. Test coverage for the deduplication logic is solid: single waiter, multiple waiters, independent keys, mixed cached/in-flight, and timeout fallback are all covered. Removing the anyhow glob import aligns with the project style guidelines. The Driver::get signature change to a slice of RawCacheKey is a correct ergonomics improvement.

@MasterPtato MasterPtato force-pushed the 03-19-feat_cache_add_in_flight_deduping branch from bc5d6d3 to b395fae Compare March 21, 2026 01:55
@MasterPtato MasterPtato force-pushed the 03-18-fix_cache_clean_up_lib branch from 17d21f5 to de18421 Compare March 21, 2026 01:55
@MasterPtato MasterPtato mentioned this pull request Mar 21, 2026
11 tasks
@MasterPtato MasterPtato force-pushed the 03-18-fix_cache_clean_up_lib branch from de18421 to e65f84d Compare March 24, 2026 00:30
@MasterPtato MasterPtato force-pushed the 03-19-feat_cache_add_in_flight_deduping branch from b395fae to 6570bf2 Compare March 24, 2026 00:30
@MasterPtato MasterPtato mentioned this pull request Mar 24, 2026
11 tasks
@MasterPtato MasterPtato force-pushed the 03-19-feat_cache_add_in_flight_deduping branch from 6570bf2 to b46226c Compare March 24, 2026 00:36
@MasterPtato MasterPtato force-pushed the 03-18-fix_cache_clean_up_lib branch 2 times, most recently from 65200cf to ea32d90 Compare March 25, 2026 00:05
@MasterPtato MasterPtato force-pushed the 03-19-feat_cache_add_in_flight_deduping branch from b46226c to 66ec30f Compare March 25, 2026 00:05
@claude
Copy link
Copy Markdown

claude bot commented Mar 25, 2026

Code Review: feat(cache): add in-flight deduplication

Critical Issues

1. Broadcast never sent when getter resolves no values

In engine/packages/cache/src/req_config.rs, broadcast_tx.send is only called inside the if entries_values.is_empty block. If the getter legitimately resolves no keys (a valid cache miss), all waiters remain blocked until IN_FLIGHT_TIMEOUT (5 seconds). The broadcast must be sent unconditionally after the getter completes.

2. Lease permanently left in in_flight after getter error

If the getter returns Err, try_join propagates via ? and skips both the broadcast send and the lease cleanup loop. The in_flight entry for each leased key is never removed -- those keys are poisoned until process restart.

Fix: replace try_join with join, capture the result, unconditionally broadcast and clean up leases, then propagate the error.

These two issues combine: a getter error causes both a leaked lease and a 5-second cascading stall for all current waiters.


Moderate Issues

3. Race between broadcast send and lease removal

Broadcast is sent before leases are removed from in_flight. A new request arriving in this window finds the entry still present, subscribes as a waiter on an already-exhausted channel, and stalls for the full timeout. Fix by removing leases first, then broadcasting.

4. Rate limit tests dropped without replacement

The deleted integration.rs contained test_rate_limit_basic and test_rate_limit_ip_isolation. Neither appears in fetch.rs, in_flight.rs, or ttl.rs. If rate limiting is still supported, these should be preserved or replaced.


Minor Issues

5. timeout_falls_back_to_getter adds 5 real seconds to CI

This test waits on the real IN_FLIGHT_TIMEOUT constant. Making the timeout configurable for tests or using tokio::time::pause would eliminate the wall-clock wait.

6. Uppercase log messages

Lines with Failed to decode value and Failed to encode value in req_config.rs should be lowercase per project conventions.

7. No comment on inline vs background write trade-off

The cache write is now awaited inline (previously spawned as background). A short comment explaining the ordering constraint would help future readers.


Positive Observations

  • Switching GetterCtx from Vec with O(n) dedup to HashMap is clean. Lookups are O(1) and dedup is free.
  • scc::HashMap is a good fit for in_flight: fine-grained locking without a global mutex.
  • The Driver::get slice signature is an ergonomics improvement.
  • Removing the anyhow glob import aligns with project conventions.
  • New focused test files are much easier to navigate than the old monolithic integration.rs.
  • In-flight test coverage is solid across all distinct code paths.

Summary

Issue Severity
Broadcast not sent when getter resolves nothing Critical
Lease leaked in in_flight after getter error Critical
Race between broadcast send and lease removal Moderate
Rate limit tests dropped without replacement Moderate
Timeout test adds 5s to CI wall time Minor
Uppercase log messages Minor
No comment on inline write ordering Minor

The two critical issues and the race condition should be addressed before merging. They can cause 5-second cascading stalls for any cache miss that errors or returns nothing.

@MasterPtato MasterPtato force-pushed the 03-19-feat_cache_add_in_flight_deduping branch from 66ec30f to 97b9cfd Compare March 26, 2026 01:18
@MasterPtato MasterPtato force-pushed the 03-18-fix_cache_clean_up_lib branch 2 times, most recently from ddfa969 to bed6ca4 Compare March 26, 2026 20:50
@MasterPtato MasterPtato force-pushed the 03-19-feat_cache_add_in_flight_deduping branch from 97b9cfd to 3fc4f7f Compare March 26, 2026 20:50
@MasterPtato MasterPtato force-pushed the 03-18-fix_cache_clean_up_lib branch from bed6ca4 to 10a4ff1 Compare March 28, 2026 00:20
@MasterPtato MasterPtato force-pushed the 03-19-feat_cache_add_in_flight_deduping branch from 3fc4f7f to 662fee6 Compare March 28, 2026 00:20
@MasterPtato MasterPtato force-pushed the 03-18-fix_cache_clean_up_lib branch from 10a4ff1 to 860e71e Compare March 30, 2026 19:40
@MasterPtato MasterPtato force-pushed the 03-19-feat_cache_add_in_flight_deduping branch from 662fee6 to 893ea25 Compare March 30, 2026 19:40
@MasterPtato MasterPtato force-pushed the 03-19-feat_cache_add_in_flight_deduping branch from 893ea25 to 715bec8 Compare March 31, 2026 01:40
@MasterPtato MasterPtato force-pushed the 03-18-fix_cache_clean_up_lib branch from 860e71e to 22498e8 Compare March 31, 2026 01:40
@MasterPtato MasterPtato mentioned this pull request Mar 31, 2026
11 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant