Skip to content

feat(taskbroker): Add Sent Flag to Prevent Dropping Tasks on Push Failure#586

Open
george-sentry wants to merge 4 commits intomainfrom
george/push-taskbroker/add-sent-flag
Open

feat(taskbroker): Add Sent Flag to Prevent Dropping Tasks on Push Failure#586
george-sentry wants to merge 4 commits intomainfrom
george/push-taskbroker/add-sent-flag

Conversation

@george-sentry
Copy link
Copy Markdown
Member

Linear

Completes STREAM-860

Description

Currently, taskworkers pull tasks from taskbrokers via RPC. This approach works, but has some drawbacks. Therefore, we want taskbrokers to push tasks to taskworkers instead. Read this page on Notion for more information.

Right now, I rely on processing_deadline to revert processing tasks back to pending if pushing them failed. This isn't good because it eats through processing attempts, resulting in needlessly dropped tasks.

I want to add a sent column to the activation table to track whether a task was successfully sent after being fetched from the table. Now, upkeep increments processing attempts only for tasks that are processing and have sent = true.

If the status is processing and sent = false, that means pushing failed or timed out (or didn't happen yet), and we can revert back to pending without incrementing processing attempts.

@george-sentry george-sentry requested a review from a team as a code owner April 2, 2026 22:03
@linear-code
Copy link
Copy Markdown

linear-code bot commented Apr 2, 2026

Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Comment on lines +456 to +458
let mut rows = self
.claim_activations(application, namespaces.as_deref(), Some(1), None, true)
.await?;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: A race condition in pull mode marks tasks as sent=true before delivery. A network failure during response transmission causes retry attempts to be consumed for undelivered tasks.
Severity: HIGH

Suggested Fix

The sent flag should only be marked as true after the gRPC response has been successfully delivered to the worker, similar to the implementation in push mode. This involves moving the logic that sets sent=true to after the gRPC call returns successfully, ensuring the database state reflects the actual delivery status.

Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent.
Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not
valid.

Location: src/store/inflight_activation.rs#L456-L458

Potential issue: In pull mode, the `sent` flag for a task is set to `true` in the
database before the task data is successfully transmitted to the worker. If a network
failure occurs during the gRPC response transmission after the database write, the task
is marked as `sent` but was never delivered. When the processing deadline for this task
expires, the `handle_processing_deadline()` function will incorrectly increment the
`processing_attempts` counter because it treats the task as successfully sent. This
consumes a retry attempt for a task that never reached a worker, potentially causing it
to be dropped prematurely.

Comment on lines 301 to 303
if let Ok(tasks) = store
.get_pending_activations_from_namespaces(None, Some(&demoted_namespaces), None, None)
.claim_activations(None, Some(&demoted_namespaces), None, None, false)
.await
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Demoted namespace tasks with persistent Kafka publish failures enter an infinite retry loop, as their attempt counters are never incremented upon processing deadline expiration.
Severity: HIGH

Suggested Fix

Modify the logic for handling demoted namespace tasks to ensure that persistent failures consume retry attempts. This could involve either marking the task as sent=true before the Kafka publish attempt or introducing a separate mechanism to increment the attempt counter for this specific failure scenario, preventing the infinite loop.

Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent.
Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not
valid.

Location: src/upkeep.rs#L301-L303

Potential issue: When handling demoted namespace tasks, the system claims them with
`sent=false` and attempts to publish them to Kafka. If the Kafka publish fails, the task
remains in a `processing` state with `sent=false`. When its processing deadline expires,
the task is reverted to `pending` without incrementing its `processing_attempts`
counter. This creates an infinite loop where a task with a persistent Kafka publish
failure will be repeatedly claimed and reverted without ever consuming its retry budget,
leading to wasted system resources.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant