Skip to content

Add JobCancelTx and JobFailTx#1219

Open
mitar wants to merge 1 commit intoriverqueue:masterfrom
mitar:feat-jobfailtx
Open

Add JobCancelTx and JobFailTx#1219
mitar wants to merge 1 commit intoriverqueue:masterfrom
mitar:feat-jobfailtx

Conversation

@mitar
Copy link
Copy Markdown
Contributor

@mitar mitar commented Apr 18, 2026

See discussion in #1173 for background.

I was trying to prepare response for that discussion but more I was thinking and exploring the codebase, more I realized how this could be implemented so I decided just to go with it and implement it, which then makes it much easier to reason about all edge cases and stuff. So here it is, implementation adding support for JobCancelTx and JobFailTx so that also failures can be recorded inside a transaction. At the end the implementation was pretty straightforward and I like the outcome.

@brandur You asked:

Do you think you could write a sample job implementation with transaction that shows how this would be used, and make sure to include a few different sample return err to show how those would interact with the new feature? This'd help me get a clearer understanding of how you're looking to use this sort of thing.

See example_fail_job_within_tx_test.go file for this. I think it is pretty neat with using defer to handle all possible ways job body could return error. There is also a simpler version of it in example_cancel_job_within_tx_test.go.

JobFailTx / JobCancelTx edge-case matrix

Integration tests in job_fail_tx_integration_test.go (TestJobFailTxIntegration) exercise every combination of worker return * *Tx call * tx outcome.

# Work returns Tx fn called Tx outcome Expected final state
1 nil / / Completed
2 nil JobFailTx commit Retryable / Available / Discarded (parameterized by attempt)
3 nil JobFailTx rollback Completed (executor fallback)
4 nil JobCancelTx commit Cancelled
5 nil JobCancelTx rollback Completed (executor fallback)
6 err / / Retryable / Discarded
7 err JobFailTx commit Retryable / Discarded (from Tx call; executor IfRunning no-op)
8 err JobFailTx rollback Retryable / Discarded (executor fallback, using Work's err)
9 err JobCancelTx commit Cancelled
10 err JobCancelTx rollback Retryable / Discarded (executor fallback, using Work's err)
11 panic JobFailTx commit Retryable / Discarded (from Tx; panic handled, IfRunning no-op)
12 JobCancelError JobFailTx commit Retryable / Discarded (from Tx wins; executor's cancel UPDATE is IfRunning no-op; ErrorHandler.HandleError is not called because the cancel branch skips it)
13 nil JobFailTx then JobCompleteTx (same tx) commit Retryable (from first Tx); second Tx call returns no error and a job whose State is not Completed (the row as JobFailTx left it - IfRunning blocked the state change in SQL, so the row is returned unchanged via the UNION-ALL branch)
14 nil JobFailTx returned an error commit Collapses into Row 1 (Completed) if the error fired before the DB UPDATE; into Row 2 (Retryable/Discarded) if the error fired after (args-unmarshal path).
15 err JobFailTx returned an error commit Collapses into Row 6 (Retryable/Discarded, executor fallback with Work's err) if the error fired before the DB UPDATE; into Row 7 (from-Tx state wins) if the error fired after.
16 nil JobFailTx returned an error rollback Collapses into Row 1 - DB rollback undoes any partial Tx work, executor sees Work's nil return and completes.
17 err JobFailTx returned an error rollback Collapses into Row 6 - DB rollback undoes any partial Tx work, executor's normal error path handles Work's err.

Notes

  • Rows 2, 7, 11: "Retryable / Discarded" depends on job.Attempt vs job.MaxAttempts. JobFailTx picks Discarded on the final attempt and Retryable/Available otherwise, mirroring the executor's own error-path logic.

  • Rows 3, 5, 8, 10: on rollback, the *Tx DB write is undone and the row stays running, so the executor's own path runs normally based on what Work returned (nil → complete path; err → error path).

  • Row 7: the AttemptError persisted is the one from JobFailTx (e.g. "from JobFailTx"), not the one Work returned. The executor's later errors-append is guarded by state = 'running' in the SQL and is a no-op.

  • Row 11: same as row 7 in terms of the persisted AttemptError; the panic value is swallowed at the DB level for the same reason (IfRunning guard).

  • Row 12: errors.As(workErr, &JobCancelError{}) succeeds, so the executor takes its cancel branch in reportError. That branch (a) skips ErrorHandler.HandleError, and (b) tries to write JobSetStateCancelled - blocked by the IfRunning guard because JobFailTx already moved the row out of 'running'. Net DB result: what JobFailTx set.

  • Row 13: JobSetStateIfRunningMany's SQL returns the row even on a no-op update (via its UNION ALL branch), so JobCompleteTx finds the row and doesn't return ErrNotFound - it simply returns the row in its post-JobFailTx state.

  • Rows 14–17 (JobFailTx itself returned an error): these don't introduce new terminal states - they collapse into earlier rows depending on where in JobFailTx's body the error fired.

    Error-return sites in JobFailTx (job_fail_tx.go) split into two buckets:

    • Before the DB UPDATE: "job must be running", client-not-in-context, metadata-marshal failure, attempt-error-marshal failure, pilot.JobSetStateIfRunningMany DB error, and len(rows) == 0ErrNotFound. In all of these, the row is never updated, so whether the caller commits or rolls back is irrelevant - the final DB state is whatever the executor itself chooses based on Work's return. Behavior collapses to Row 1 / 6.
    • After the DB UPDATE: only the final json.Unmarshal(EncodedArgs, &updatedJob.Args) at the end of JobFailTx. The row was updated in the caller's tx; whether that persists depends on commit vs rollback. On commit, behavior collapses to Row 2 / 7 (from-Tx state wins). On rollback, behavior collapses to Row 1 / 6 (row never actually changed).

    In practice, callers follow the Go idiom if err != nil { return err } after JobFailTx, which means Work returns the err and their defer tx.Rollback(ctx) fires (committing nothing). That path falls under Row 17. The "DB marked failed despite JobFailTx err" variants (14 on commit after args-unmarshal failure; 15 on commit after args-unmarshal failure) are pathological - the caller got a non-nil err from JobFailTx and chose to commit anyway.

// timeout so the retry loop (attempt 1 -> Available -> fetcher picks it
// up -> attempt 2) completes reliably even under the race detector or
// heavy CI parallelism.
deadline := time.After(30 * time.Second)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Default 10 seconds from WaitOrTimeoutN was not enough here for stable CI runs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant