feat(taskbroker): Add Sent Flag to Prevent Dropping Tasks on Push Failure#586
feat(taskbroker): Add Sent Flag to Prevent Dropping Tasks on Push Failure#586george-sentry wants to merge 4 commits intomainfrom
Conversation
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
| let mut rows = self | ||
| .claim_activations(application, namespaces.as_deref(), Some(1), None, true) | ||
| .await?; |
There was a problem hiding this comment.
Bug: A race condition in pull mode marks tasks as sent=true before delivery. A network failure during response transmission causes retry attempts to be consumed for undelivered tasks.
Severity: HIGH
Suggested Fix
The sent flag should only be marked as true after the gRPC response has been successfully delivered to the worker, similar to the implementation in push mode. This involves moving the logic that sets sent=true to after the gRPC call returns successfully, ensuring the database state reflects the actual delivery status.
Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent.
Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not
valid.
Location: src/store/inflight_activation.rs#L456-L458
Potential issue: In pull mode, the `sent` flag for a task is set to `true` in the
database before the task data is successfully transmitted to the worker. If a network
failure occurs during the gRPC response transmission after the database write, the task
is marked as `sent` but was never delivered. When the processing deadline for this task
expires, the `handle_processing_deadline()` function will incorrectly increment the
`processing_attempts` counter because it treats the task as successfully sent. This
consumes a retry attempt for a task that never reached a worker, potentially causing it
to be dropped prematurely.
| if let Ok(tasks) = store | ||
| .get_pending_activations_from_namespaces(None, Some(&demoted_namespaces), None, None) | ||
| .claim_activations(None, Some(&demoted_namespaces), None, None, false) | ||
| .await |
There was a problem hiding this comment.
Bug: Demoted namespace tasks with persistent Kafka publish failures enter an infinite retry loop, as their attempt counters are never incremented upon processing deadline expiration.
Severity: HIGH
Suggested Fix
Modify the logic for handling demoted namespace tasks to ensure that persistent failures consume retry attempts. This could involve either marking the task as sent=true before the Kafka publish attempt or introducing a separate mechanism to increment the attempt counter for this specific failure scenario, preventing the infinite loop.
Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent.
Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not
valid.
Location: src/upkeep.rs#L301-L303
Potential issue: When handling demoted namespace tasks, the system claims them with
`sent=false` and attempts to publish them to Kafka. If the Kafka publish fails, the task
remains in a `processing` state with `sent=false`. When its processing deadline expires,
the task is reverted to `pending` without incrementing its `processing_attempts`
counter. This creates an infinite loop where a task with a persistent Kafka publish
failure will be repeatedly claimed and reverted without ever consuming its retry budget,
leading to wasted system resources.

Linear
Completes STREAM-860
Description
Currently, taskworkers pull tasks from taskbrokers via RPC. This approach works, but has some drawbacks. Therefore, we want taskbrokers to push tasks to taskworkers instead. Read this page on Notion for more information.
Right now, I rely on processing_deadline to revert processing tasks back to pending if pushing them failed. This isn't good because it eats through processing attempts, resulting in needlessly dropped tasks.
I want to add a sent column to the activation table to track whether a task was successfully sent after being fetched from the table. Now, upkeep increments processing attempts only for tasks that are processing and have sent = true.
If the status is processing and sent = false, that means pushing failed or timed out (or didn't happen yet), and we can revert back to pending without incrementing processing attempts.