Fix race in CloseResults that drops test failure events#3512
Fix race in CloseResults that drops test failure events#3512sean- wants to merge 4 commits intothought-machine:masterfrom
Conversation
ab59fcd to
f641f5b
Compare
|
The |
src/core/state.go
Outdated
| // drainResults waits until all pending results in the internal channel have | ||
| // been forwarded to the external results channel. | ||
| func (state *BuildState) drainResults() { | ||
| for len(state.progress.internalResults) > 0 { |
There was a problem hiding this comment.
I am concerned that this could lead to a deadlocked process - would it make sense to add a watchdog timer here to make it abandon this attempt to clean up?
There was a problem hiding this comment.
Yeah, the spin loop was racy in a second, subtler way too: even after len(internalResults) hit 0, forwardResults could have read a result but not yet forwarded it under the mutex, so CloseResults could still nil out the channel and drop it.
Rather than adding a watchdog timer (which would mask the problem rather than fix it), I've reworked this to use context-based shutdown:
stateProgressnow has aworkerCtx(cancelled inStop()) and adispatcherCtx(cancelled inCloseResults())forwardResultsselects onctx.Done()at every blocking point. When cancelled, it drains remaining items via a non-blocking select loop, then closes the results channel itselfCloseResultsjust cancels the dispatcher context and waits on aWaitGroupforforwardResultsto finish
This eliminates the spin loop, the recover/panic pattern, and the race window entirely. The shutdown ordering is: workers finish -> wg.Wait() in plz.go -> CloseResults() cancels dispatcher -> dispatcher drains and closes.
One thing I'd like to follow up on separately: NewBuildState() currently creates its own context.Background() internally. Ideally it would accept a context.Context parameter so callers can propagate cancellation from signal handlers, but that's a larger change I'd rather do in a follow-up.
f641f5b to
9cef345
Compare
CloseResults could close the external results channel while forwardResults still had pending items to forward from the internal channel. This caused a "send on closed channel" panic that was silently recovered, dropping the result. On FreeBSD CI this manifested as shell_output_test failing because the test failure event was lost, leaving failedTargets empty and suppressing the detailed error output. Replace the recover/panic pattern with structured context-based shutdown. A dispatcher context signals forwardResults to drain remaining results and close the external channel itself. CloseResults cancels the context and waits on a WaitGroup for the drain to complete, eliminating both the spin loop and the race. Add WorkerContext() for build/test/parse workers to use for shutdown signalling (threaded through in a follow-up commit).
9cef345 to
45e4613
Compare
Pass the worker context from BuildState.WorkerContext() through to the process executor calls in build.Build(), test.Test(), and parse.Parse(). This enables cancellation to propagate from Stop() through to running subprocesses. ExecWithTimeoutShell and ExecWithTimeoutShellStdStreams now accept a context.Context parameter. The exec package uses context.TODO() as a placeholder until it is wired to signal handling.
making the dependency on a parent context explicit at every call site. This sets up future signal-based graceful shutdown. Production callers use context.TODO() until wired to signal handlers. Test helpers accept *testing.T and use t.Context(), calling t.Helper() for correct error attribution.
CloseResults could close the external results channel while forwardResults still had pending items to forward from the internal channel. This caused a
send on closed channelpanic that was silently recovered, dropping the result. On FreeBSD CI this manifested asshell_output_testfailing because the test failure event was lost, leavingfailedTargetsempty and suppressing the detailed error output.Fix by draining the internal channel before closing, and setting the external channel to nil after close so late arrivals are skipped rather than panicking.