fix: prevent setCheckpoint from overwriting concurrent job state updates#71
Merged
fix: prevent setCheckpoint from overwriting concurrent job state updates#71
Conversation
Replace savePlan(staleSnapshot) calls in setCheckpoint, clearCheckpoint, _doReconcile, and resumePlan with atomic updatePlanFields that only modifies plan-level fields (status, checkpoint, completedAt, prUrl) without touching job states. Add reconciliation safety net that cross-references jobs.json to detect plan jobs stuck as 'running' when they have already completed or failed. Closes #63
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes a race condition where
setCheckpoint()in the orchestrator overwrites concurrent job state updates viasavePlan()with a stale plan snapshot. This caused completed plan jobs to appear stuck as "running" when a sibling job fails.Closes #63
Changes
src/lib/plan-state.ts: AddupdatePlanFields()— atomic read-modify-write for plan-level fields only (status, checkpoint, checkpointContext, completedAt, prUrl) inside theplanMutex, preserving job statessrc/lib/orchestrator.ts: Replace allsavePlan(staleSnapshot)calls insetCheckpoint,clearCheckpoint,_doReconcile, andresumePlanwithupdatePlanFields. Add reconciliation safety net that cross-referencesjobs.jsonto detect plan jobs stuck as 'running' when they have already completed or failedtests/lib/plan-state.test.ts: Add 5 tests forupdatePlanFields(basic update, concurrent preservation, ID mismatch, no plan, checkpointContext)tests/lib/orchestrator.test.ts: AddupdatePlanFieldsmock to allbeforeEachblocks. Add race condition regression test and safety net testtests/lib/orchestrator-modes.test.ts: AddupdatePlanFieldsmock tobeforeEachblockTesting
bun run buildpassesbun testpasses (613/613)Notes
handleJobFailedcallingloadPlan()outside the mutex, getting a stale snapshot, then passing it tosetCheckpoint()which calledsavePlan(plan)— writing the entire stale object back to disk, overwriting concurrentupdatePlanJob()changes fromhandleJobCompletefor sibling jobs._doReconcileis a defense-in-depth measure that detects jobs stuck as 'running' in the plan whenjobs.jsonalready shows them as completed/failed. This covers edge cases where the race condition may have already corrupted plan state before this fix was deployed.