Fix race in CloseResults that drops test failure events by sean- · Pull Request #3512 · thought-machine/please

sean- · 2026-03-26T15:55:32Z

CloseResults could close the external results channel while forwardResults still had pending items to forward from the internal channel. This caused a send on closed channel panic that was silently recovered, dropping the result. On FreeBSD CI this manifested as shell_output_test failing because the test failure event was lost, leaving failedTargets empty and suppressing the detailed error output.

Fix by draining the internal channel before closing, and setting the external channel to nil after close so late arrivals are skipped rather than panicking.

sean- · 2026-03-26T16:18:16Z

The build-linux-alt failure is likely due to #3510 not being merged yet (which has the fix for the rate limit).

toastwaffle · 2026-03-31T11:06:57Z

src/core/state.go

+// drainResults waits until all pending results in the internal channel have
+// been forwarded to the external results channel.
+func (state *BuildState) drainResults() {
+	for len(state.progress.internalResults) > 0 {


I am concerned that this could lead to a deadlocked process - would it make sense to add a watchdog timer here to make it abandon this attempt to clean up?

Yeah, the spin loop was racy in a second, subtler way too: even after len(internalResults) hit 0, forwardResults could have read a result but not yet forwarded it under the mutex, so CloseResults could still nil out the channel and drop it.

Rather than adding a watchdog timer (which would mask the problem rather than fix it), I've reworked this to use context-based shutdown:

stateProgress now has a workerCtx (cancelled in Stop()) and a dispatcherCtx (cancelled in CloseResults())

forwardResults selects on ctx.Done() at every blocking point. When cancelled, it drains remaining items via a non-blocking select loop, then closes the results channel itself

CloseResults just cancels the dispatcher context and waits on a WaitGroup for forwardResults to finish

This eliminates the spin loop, the recover/panic pattern, and the race window entirely. The shutdown ordering is: workers finish -> wg.Wait() in plz.go -> CloseResults() cancels dispatcher -> dispatcher drains and closes.

One thing I'd like to follow up on separately: NewBuildState() currently creates its own context.Background() internally. Ideally it would accept a context.Context parameter so callers can propagate cancellation from signal handlers, but that's a larger change I'd rather do in a follow-up.

CloseResults could close the external results channel while forwardResults still had pending items to forward from the internal channel. This caused a "send on closed channel" panic that was silently recovered, dropping the result. On FreeBSD CI this manifested as shell_output_test failing because the test failure event was lost, leaving failedTargets empty and suppressing the detailed error output. Replace the recover/panic pattern with structured context-based shutdown. A dispatcher context signals forwardResults to drain remaining results and close the external channel itself. CloseResults cancels the context and waits on a WaitGroup for the drain to complete, eliminating both the spin loop and the race. Add WorkerContext() for build/test/parse workers to use for shutdown signalling (threaded through in a follow-up commit).

Pass the worker context from BuildState.WorkerContext() through to the process executor calls in build.Build(), test.Test(), and parse.Parse(). This enables cancellation to propagate from Stop() through to running subprocesses. ExecWithTimeoutShell and ExecWithTimeoutShellStdStreams now accept a context.Context parameter. The exec package uses context.TODO() as a placeholder until it is wired to signal handling.

making the dependency on a parent context explicit at every call site. This sets up future signal-based graceful shutdown. Production callers use context.TODO() until wired to signal handlers. Test helpers accept *testing.T and use t.Context(), calling t.Helper() for correct error attribution.

sean- force-pushed the fix-close-results-race branch 2 times, most recently from ab59fcd to f641f5b Compare March 26, 2026 16:10

toastwaffle reviewed Mar 31, 2026

View reviewed changes

sean- force-pushed the fix-close-results-race branch from f641f5b to 9cef345 Compare April 1, 2026 19:16

sean- force-pushed the fix-close-results-race branch from 9cef345 to 45e4613 Compare April 1, 2026 19:43

sean- added 3 commits April 1, 2026 13:06

Run plz autofix on repo.

a11d8e2

sean- requested a review from toastwaffle April 1, 2026 20:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix race in CloseResults that drops test failure events#3512

Fix race in CloseResults that drops test failure events#3512
sean- wants to merge 4 commits intothought-machine:masterfrom
sean-:fix-close-results-race

sean- commented Mar 26, 2026 •

edited

Loading

Uh oh!

sean- commented Mar 26, 2026

Uh oh!

toastwaffle Mar 31, 2026

Uh oh!

sean- Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

sean- commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sean- commented Mar 26, 2026

Uh oh!

toastwaffle Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

sean- Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sean- commented Mar 26, 2026 •

edited

Loading