Skip to content

fix: prevent setCheckpoint from overwriting concurrent job state updates#71

Merged
nigel-dev merged 2 commits intomainfrom
fix/stale-plan-snapshot-race
Feb 14, 2026
Merged

fix: prevent setCheckpoint from overwriting concurrent job state updates#71
nigel-dev merged 2 commits intomainfrom
fix/stale-plan-snapshot-race

Conversation

@nigel-dev
Copy link
Owner

Summary

Fixes a race condition where setCheckpoint() in the orchestrator overwrites concurrent job state updates via savePlan() with a stale plan snapshot. This caused completed plan jobs to appear stuck as "running" when a sibling job fails.

Closes #63

Changes

  • src/lib/plan-state.ts: Add updatePlanFields() — atomic read-modify-write for plan-level fields only (status, checkpoint, checkpointContext, completedAt, prUrl) inside the planMutex, preserving job states
  • src/lib/orchestrator.ts: Replace all savePlan(staleSnapshot) calls in setCheckpoint, clearCheckpoint, _doReconcile, and resumePlan with updatePlanFields. Add reconciliation safety net that cross-references jobs.json to detect plan jobs stuck as 'running' when they have already completed or failed
  • tests/lib/plan-state.test.ts: Add 5 tests for updatePlanFields (basic update, concurrent preservation, ID mismatch, no plan, checkpointContext)
  • tests/lib/orchestrator.test.ts: Add updatePlanFields mock to all beforeEach blocks. Add race condition regression test and safety net test
  • tests/lib/orchestrator-modes.test.ts: Add updatePlanFields mock to beforeEach block

Testing

  • bun run build passes
  • bun test passes (613/613)
  • Manual testing done (if applicable)

Notes

  • The root cause was handleJobFailed calling loadPlan() outside the mutex, getting a stale snapshot, then passing it to setCheckpoint() which called savePlan(plan) — writing the entire stale object back to disk, overwriting concurrent updatePlanJob() changes from handleJobComplete for sibling jobs.
  • The safety net in _doReconcile is a defense-in-depth measure that detects jobs stuck as 'running' in the plan when jobs.json already shows them as completed/failed. This covers edge cases where the race condition may have already corrupted plan state before this fix was deployed.

Replace savePlan(staleSnapshot) calls in setCheckpoint, clearCheckpoint,
_doReconcile, and resumePlan with atomic updatePlanFields that only
modifies plan-level fields (status, checkpoint, completedAt, prUrl)
without touching job states.

Add reconciliation safety net that cross-references jobs.json to detect
plan jobs stuck as 'running' when they have already completed or failed.

Closes #63
@nigel-dev nigel-dev merged commit 2869e81 into main Feb 14, 2026
4 checks passed
@nigel-dev nigel-dev deleted the fix/stale-plan-snapshot-race branch February 14, 2026 13:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

setCheckpoint overwrites concurrent job state updates — completed plan jobs stuck as running

1 participant