Skip to content

Fix race in CloseResults that drops test failure events#3512

Open
sean- wants to merge 4 commits intothought-machine:masterfrom
sean-:fix-close-results-race
Open

Fix race in CloseResults that drops test failure events#3512
sean- wants to merge 4 commits intothought-machine:masterfrom
sean-:fix-close-results-race

Conversation

@sean-
Copy link
Copy Markdown
Contributor

@sean- sean- commented Mar 26, 2026

CloseResults could close the external results channel while forwardResults still had pending items to forward from the internal channel. This caused a send on closed channel panic that was silently recovered, dropping the result. On FreeBSD CI this manifested as shell_output_test failing because the test failure event was lost, leaving failedTargets empty and suppressing the detailed error output.

Fix by draining the internal channel before closing, and setting the external channel to nil after close so late arrivals are skipped rather than panicking.

@sean- sean- force-pushed the fix-close-results-race branch 2 times, most recently from ab59fcd to f641f5b Compare March 26, 2026 16:10
@sean-
Copy link
Copy Markdown
Contributor Author

sean- commented Mar 26, 2026

The build-linux-alt failure is likely due to #3510 not being merged yet (which has the fix for the rate limit).

Comment on lines +444 to +447
// drainResults waits until all pending results in the internal channel have
// been forwarded to the external results channel.
func (state *BuildState) drainResults() {
for len(state.progress.internalResults) > 0 {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am concerned that this could lead to a deadlocked process - would it make sense to add a watchdog timer here to make it abandon this attempt to clean up?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the spin loop was racy in a second, subtler way too: even after len(internalResults) hit 0, forwardResults could have read a result but not yet forwarded it under the mutex, so CloseResults could still nil out the channel and drop it.

Rather than adding a watchdog timer (which would mask the problem rather than fix it), I've reworked this to use context-based shutdown:

  • stateProgress now has a workerCtx (cancelled in Stop()) and a dispatcherCtx (cancelled in CloseResults())
  • forwardResults selects on ctx.Done() at every blocking point. When cancelled, it drains remaining items via a non-blocking select loop, then closes the results channel itself
  • CloseResults just cancels the dispatcher context and waits on a WaitGroup for forwardResults to finish

This eliminates the spin loop, the recover/panic pattern, and the race window entirely. The shutdown ordering is: workers finish -> wg.Wait() in plz.go -> CloseResults() cancels dispatcher -> dispatcher drains and closes.

One thing I'd like to follow up on separately: NewBuildState() currently creates its own context.Background() internally. Ideally it would accept a context.Context parameter so callers can propagate cancellation from signal handlers, but that's a larger change I'd rather do in a follow-up.

@sean- sean- force-pushed the fix-close-results-race branch from f641f5b to 9cef345 Compare April 1, 2026 19:16
CloseResults could close the external results channel while
forwardResults still had pending items to forward from the internal
channel. This caused a "send on closed channel" panic that was silently
recovered, dropping the result. On FreeBSD CI this manifested as
shell_output_test failing because the test failure event was lost,
leaving failedTargets empty and suppressing the detailed error output.

Replace the recover/panic pattern with structured context-based
shutdown. A dispatcher context signals forwardResults to drain
remaining results and close the external channel itself. CloseResults
cancels the context and waits on a WaitGroup for the drain to
complete, eliminating both the spin loop and the race.

Add WorkerContext() for build/test/parse workers to use for shutdown
signalling (threaded through in a follow-up commit).
@sean- sean- force-pushed the fix-close-results-race branch from 9cef345 to 45e4613 Compare April 1, 2026 19:43
sean- added 3 commits April 1, 2026 13:06
Pass the worker context from BuildState.WorkerContext() through to
the process executor calls in build.Build(), test.Test(), and
parse.Parse(). This enables cancellation to propagate from Stop()
through to running subprocesses.

ExecWithTimeoutShell and ExecWithTimeoutShellStdStreams now accept a
context.Context parameter. The exec package uses context.TODO() as
a placeholder until it is wired to signal handling.
making the dependency on a parent context explicit at every call site.
This sets up future signal-based graceful shutdown.

Production callers use context.TODO() until wired to signal handlers.
Test helpers accept *testing.T and use t.Context(), calling t.Helper()
for correct error attribution.
@sean- sean- requested a review from toastwaffle April 1, 2026 20:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants