feat(ups): implement queue subs#4486
feat(ups): implement queue subs#4486MasterPtato wants to merge 1 commit into03-19-feat_cache_add_in_flight_dedupingfrom
Conversation
|
🚅 Deployed to the rivet-pr-4486 environment in rivet-frontend
|
|
Warning This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
This stack of pull requests is managed by Graphite. Learn more about stacking. |
ec32bdf to
ae0f886
Compare
6570bf2 to
b46226c
Compare
PR Review:
|
| Severity | Issue |
|---|---|
| Critical | DefaultHasher used for persistent DB keys - not stable across processes/versions |
| Medium | No metrics update in queue subscription cleanup task |
| Medium | retain_sync inside async retain - TODO unresolved |
| Medium | No queue depth limit; unbounded growth if consumers are down |
| Low | return Ok(...) vs idiomatic expression form |
| Low | SQL NOTIFY injection pattern fragile (safe today, not future-proof) |
| Low | Unused anyhow! import if all usages replaced |
| Info | Memory metric counts topics not subscribers (expanded surface) |
| Info | Load-balance test 500ms window may be flaky on slow CI |
The DefaultHasher stability issue is the most important fix - it will silently corrupt queue routing in any multi-process deployment.
|
Code Review summary for feat(ups): implement queue subs -- The implementation is well-structured and the test coverage is solid. Two medium issues to address before merge: (1) retain_sync on scc::HashMap inside the memory driver GC task runs synchronously inside an async closure and can block the executor; (2) The SELECT and INSERT+NOTIFY in publish_to_queues are not wrapped in a transaction, creating a race where a message can be orphaned for up to 1 hour if the subscriber dies between the two operations. Minor items: one remaining anyhow! macro call in publish() should use .context() instead; the queue parameter is missing from the tracing span in queue_subscribe (fields(%subject) should also include queue); and the Cargo.lock version downgrade from 2.1.7 to 2.1.6-rc.1 appears to be a stacked-PR artifact. On test coverage: test_queue_subscribe_load_balance only checks total message count, not per-subscriber distribution -- this is the right choice to avoid flakiness but worth noting in a comment. Also missing a test for Postgres reconnection behavior (does not drain pending ups_queue_messages on reconnect). |
|
PR Review: feat(ups): implement queue subs Good implementation of queue subscriptions across all three drivers. The Postgres driver in particular is well thought out with atomic message claiming, heartbeating, and reconnection handling. A few issues worth addressing before merging. BUG: Heartbeat Task Leaks After Subscriber Drop In queue_subscribe in the Postgres driver, the heartbeat task uses a cloned CancellationToken for its stop signal, but the token stored in the struct (_heartbeat_token) is just a CancellationToken — not a DropGuard. Dropping a CancellationToken does NOT cancel it, so the heartbeat loop will run indefinitely after the subscriber is dropped. The child token (heartbeat_token_child) inside the spawned task holds a reference, so neither token gets cancelled when the struct is dropped. Fix: store a DropGuard instead of the bare CancellationToken: This means dropped queue subscriptions will keep heartbeating until the process exits, keeping dead rows alive in ups_queue_subs beyond the TTL. Cleanup in Drop Has No Error Logging The Drop impl for PostgresQueueSubscriber spawns a task to deregister from ups_queue_subs but silently ignores errors. Worth adding a tracing::warn if the delete fails. Memory Driver Metric Counts Topics, Not Subscribers queue_subscribers.len() is the number of topics, not individual queue subscribers (which would require iterating the nested maps). This makes the metric misleading when there are multiple queue groups or multiple subscribers per group. spawn_queue_subscription_cleanup_task Missing Metric Update Unlike spawn_subscription_cleanup_task which calls metrics::POSTGRES_SUBSCRIPTION_COUNT.set(...) after cleanup, the queue variant does not update any metric on removal. Minor: TODO Left in GC Loop Worth resolving before merge. The concern is valid — retain_sync inside retain_async blocks the async executor for the inner retain duration. For small maps this is likely fine, but a brief explanatory comment would be better than a TODO. Minor: rx: Option<...> Is Unnecessary PostgresQueueSubscriber stores rx: Option<broadcast::Receiver<Vec>>, but it is always Some at construction. Consider storing rx directly to avoid the unnecessary Option wrapping. Minor: PostgresQueueSubscriber Visibility pub struct PostgresQueueSubscriber is fully public but only used internally as Box. Should be pub(crate) or private. Observation: Delivery Semantics The FOR UPDATE SKIP LOCKED + DELETE approach means a message is gone the moment it is claimed, before the consumer processes it. If the consumer crashes after claiming but before processing, the message is lost. This is at-most-once delivery (matching NATS queue subscribe semantics). Worth a doc comment since it is a meaningful semantic choice. Overall the implementation is solid. The Postgres driver correctly handles reconnection, wakeup fan-out, atomic claiming, and GC. The heartbeat leak is the one bug that should be fixed before this ships. |
b46226c to
66ec30f
Compare
ae0f886 to
4e8c22e
Compare
e5b3f53 to
6af4511
Compare
97b9cfd to
3fc4f7f
Compare
6af4511 to
f975d35
Compare
3fc4f7f to
662fee6
Compare
662fee6 to
893ea25
Compare
f975d35 to
08159ac
Compare
08159ac to
5113d59
Compare
893ea25 to
715bec8
Compare

Description
Please include a summary of the changes and the related issue. Please also include relevant motivation and context.
Type of change
How Has This Been Tested?
Please describe the tests that you ran to verify your changes.
Checklist: