Open
Conversation
mitar
commented
Apr 19, 2026
| // timeout so the retry loop (attempt 1 -> Available -> fetcher picks it | ||
| // up -> attempt 2) completes reliably even under the race detector or | ||
| // heavy CI parallelism. | ||
| deadline := time.After(30 * time.Second) |
Contributor
Author
There was a problem hiding this comment.
Default 10 seconds from WaitOrTimeoutN was not enough here for stable CI runs.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
See discussion in #1173 for background.
I was trying to prepare response for that discussion but more I was thinking and exploring the codebase, more I realized how this could be implemented so I decided just to go with it and implement it, which then makes it much easier to reason about all edge cases and stuff. So here it is, implementation adding support for
JobCancelTxandJobFailTxso that also failures can be recorded inside a transaction. At the end the implementation was pretty straightforward and I like the outcome.@brandur You asked:
See
example_fail_job_within_tx_test.gofile for this. I think it is pretty neat with usingdeferto handle all possible ways job body could return error. There is also a simpler version of it inexample_cancel_job_within_tx_test.go.JobFailTx/JobCancelTxedge-case matrixIntegration tests in
job_fail_tx_integration_test.go(TestJobFailTxIntegration) exercise every combination of worker return **Txcall * tx outcome.nilCompletednilJobFailTxRetryable/Available/Discarded(parameterized by attempt)nilJobFailTxCompleted(executor fallback)nilJobCancelTxCancellednilJobCancelTxCompleted(executor fallback)errRetryable/DiscardederrJobFailTxRetryable/Discarded(from Tx call; executor IfRunning no-op)errJobFailTxRetryable/Discarded(executor fallback, using Work's err)errJobCancelTxCancellederrJobCancelTxRetryable/Discarded(executor fallback, using Work's err)JobFailTxRetryable/Discarded(from Tx; panic handled, IfRunning no-op)JobCancelErrorJobFailTxRetryable/Discarded(from Tx wins; executor's cancel UPDATE is IfRunning no-op;ErrorHandler.HandleErroris not called because the cancel branch skips it)nilJobFailTxthenJobCompleteTx(same tx)Retryable(from first Tx); second Tx call returns no error and a job whoseStateis notCompleted(the row asJobFailTxleft it - IfRunning blocked the state change in SQL, so the row is returned unchanged via the UNION-ALL branch)nilJobFailTxreturned an errorCompleted) if the error fired before the DB UPDATE; into Row 2 (Retryable/Discarded) if the error fired after (args-unmarshal path).errJobFailTxreturned an errorRetryable/Discarded, executor fallback with Work's err) if the error fired before the DB UPDATE; into Row 7 (from-Tx state wins) if the error fired after.nilJobFailTxreturned an errornilreturn and completes.errJobFailTxreturned an errorNotes
Rows 2, 7, 11: "
Retryable/Discarded" depends onjob.Attemptvsjob.MaxAttempts.JobFailTxpicksDiscardedon the final attempt andRetryable/Availableotherwise, mirroring the executor's own error-path logic.Rows 3, 5, 8, 10: on rollback, the
*TxDB write is undone and the row staysrunning, so the executor's own path runs normally based on whatWorkreturned (nil → complete path; err → error path).Row 7: the
AttemptErrorpersisted is the one fromJobFailTx(e.g."from JobFailTx"), not the oneWorkreturned. The executor's later errors-append is guarded bystate = 'running'in the SQL and is a no-op.Row 11: same as row 7 in terms of the persisted
AttemptError; the panic value is swallowed at the DB level for the same reason (IfRunning guard).Row 12:
errors.As(workErr, &JobCancelError{})succeeds, so the executor takes its cancel branch inreportError. That branch (a) skipsErrorHandler.HandleError, and (b) tries to writeJobSetStateCancelled- blocked by the IfRunning guard becauseJobFailTxalready moved the row out of'running'. Net DB result: whatJobFailTxset.Row 13:
JobSetStateIfRunningMany's SQL returns the row even on a no-op update (via itsUNION ALLbranch), soJobCompleteTxfinds the row and doesn't returnErrNotFound- it simply returns the row in its post-JobFailTxstate.Rows 14–17 (
JobFailTxitself returned an error): these don't introduce new terminal states - they collapse into earlier rows depending on where inJobFailTx's body the error fired.Error-return sites in
JobFailTx(job_fail_tx.go) split into two buckets:"job must be running", client-not-in-context, metadata-marshal failure, attempt-error-marshal failure,pilot.JobSetStateIfRunningManyDB error, andlen(rows) == 0→ErrNotFound. In all of these, the row is never updated, so whether the caller commits or rolls back is irrelevant - the final DB state is whatever the executor itself chooses based onWork's return. Behavior collapses to Row 1 / 6.json.Unmarshal(EncodedArgs, &updatedJob.Args)at the end ofJobFailTx. The row was updated in the caller's tx; whether that persists depends on commit vs rollback. On commit, behavior collapses to Row 2 / 7 (from-Tx state wins). On rollback, behavior collapses to Row 1 / 6 (row never actually changed).In practice, callers follow the Go idiom
if err != nil { return err }afterJobFailTx, which meansWorkreturns the err and theirdefer tx.Rollback(ctx)fires (committing nothing). That path falls under Row 17. The "DB marked failed despiteJobFailTxerr" variants (14 on commit after args-unmarshal failure; 15 on commit after args-unmarshal failure) are pathological - the caller got a non-nil err fromJobFailTxand chose to commit anyway.