fix: prevent setCheckpoint from overwriting concurrent job state updates by nigel-dev · Pull Request #71 · nigel-dev/opencode-mission-control

nigel-dev · 2026-02-14T12:40:29Z

Summary

Fixes a race condition where setCheckpoint() in the orchestrator overwrites concurrent job state updates via savePlan() with a stale plan snapshot. This caused completed plan jobs to appear stuck as "running" when a sibling job fails.

Closes #63

Changes

src/lib/plan-state.ts: Add updatePlanFields() — atomic read-modify-write for plan-level fields only (status, checkpoint, checkpointContext, completedAt, prUrl) inside the planMutex, preserving job states
src/lib/orchestrator.ts: Replace all savePlan(staleSnapshot) calls in setCheckpoint, clearCheckpoint, _doReconcile, and resumePlan with updatePlanFields. Add reconciliation safety net that cross-references jobs.json to detect plan jobs stuck as 'running' when they have already completed or failed
tests/lib/plan-state.test.ts: Add 5 tests for updatePlanFields (basic update, concurrent preservation, ID mismatch, no plan, checkpointContext)
tests/lib/orchestrator.test.ts: Add updatePlanFields mock to all beforeEach blocks. Add race condition regression test and safety net test
tests/lib/orchestrator-modes.test.ts: Add updatePlanFields mock to beforeEach block

Testing

bun run build passes
bun test passes (613/613)
Manual testing done (if applicable)

Notes

The root cause was handleJobFailed calling loadPlan() outside the mutex, getting a stale snapshot, then passing it to setCheckpoint() which called savePlan(plan) — writing the entire stale object back to disk, overwriting concurrent updatePlanJob() changes from handleJobComplete for sibling jobs.
The safety net in _doReconcile is a defense-in-depth measure that detects jobs stuck as 'running' in the plan when jobs.json already shows them as completed/failed. This covers edge cases where the race condition may have already corrupted plan state before this fix was deployed.

Replace savePlan(staleSnapshot) calls in setCheckpoint, clearCheckpoint, _doReconcile, and resumePlan with atomic updatePlanFields that only modifies plan-level fields (status, checkpoint, completedAt, prUrl) without touching job states. Add reconciliation safety net that cross-references jobs.json to detect plan jobs stuck as 'running' when they have already completed or failed. Closes #63

nigel-dev added 2 commits February 14, 2026 06:39

fix: extend safety net to recover stopped jobs as failed in plan state

0bef593

nigel-dev merged commit 2869e81 into main Feb 14, 2026
4 checks passed

nigel-dev deleted the fix/stale-plan-snapshot-race branch February 14, 2026 13:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: prevent setCheckpoint from overwriting concurrent job state updates#71

fix: prevent setCheckpoint from overwriting concurrent job state updates#71
nigel-dev merged 2 commits intomainfrom
fix/stale-plan-snapshot-race

nigel-dev commented Feb 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

nigel-dev commented Feb 14, 2026

Summary

Changes

Testing

Notes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant