Retry request on WriteFailedException instead of returning 500 by sid-stripe · Pull Request #2081 · brefphp/bref

sid-stripe · 2026-03-04T11:33:42Z

Related to #2077

Summary

When PHP-FPM's Unix socket is broken between Lambda invocations (e.g. the FPM worker died while the master is still alive), sendAsyncRequest throws a WriteFailedException (broken pipe). The current behavior restarts FPM but still throws FastCgiCommunicationFailed, returning a 500 to the caller even though FPM was just successfully restarted and is ready to serve.

This PR retries the request once against the freshly restarted FPM process on WriteFailedException, turning a guaranteed 500 into a transparent recovery.

Why this is safe

A WriteFailedException means the FastCGI request failed to write to the socket — the request never reached PHP-FPM. Since no application code executed, retrying is idempotent.

Other exceptions (e.g. ReadFailedException) are not retried because the request may have already been processed by PHP, and retrying could cause double-execution. The existing behavior (log, restart FPM, throw FastCgiCommunicationFailed) is preserved for those cases.

Background

Key findings from production diagnostics for #2077:

The FPM master is alive when the broken pipe occurs — proc_get_status() shows running=true, signaled=false, exitcode=-1. The master is not crashing.
The FPM worker or socket is dead. The write fails because the reader end (worker) has closed its end of the Unix socket, likely due to the worker dying between invocations.
The error is transient. After Bref restarts FPM, the very next request to the same Lambda instance succeeds. The current code already does the right thing (restart FPM) but then discards the recovery by throwing FastCgiCommunicationFailed.
The pattern: Request N completes normally → ~6ms later request N+1 arrives → WriteFailedException immediately → Bref restarts FPM → current code throws 500 → next request works fine.

I still don't know the root cause of the worker death is still under investigation (Lambda freeze/thaw, socket FD corruption, etc.), but regardless of cause, retrying on write failure is a safe mitigation that eliminates unnecessary 500s.

What changed

In the catch (Throwable $e) block of FpmHandler::handleRequest():

FPM is still restarted (stop() + start()) on all errors — no change
New: If the exception is WriteFailedException, retry the request once against the fresh FPM
If the retry also fails, log both exceptions and throw FastCgiCommunicationFailed as before
For all other exceptions, behavior is unchanged (log + throw)

We've been running this patch via cweagans/composer-patches in production. Broken pipe errors that previously resulted in 500s are now transparently recovered, with no double-execution or other side effects observed.

Happy to hear your thoughts! Thought we will submit a PR since all the retries have been successful.

500 Committed-By-Agent: claude

GrahamCampbell

Are you sure that this only occurs in that situation, and not after being able to write something, and then not being able to write something?

sid-stripe · 2026-03-04T12:17:36Z

Are you sure that this only occurs in that situation, and not after being able to write something, and then not being able to write something?

Good question: I'm not 100% certain a partial write can't happen. But I reckon the retry is still safe regardless, because of the sequence of operations:

WriteFailedException is thrown
$this->stop() kills the FPM process that may have received partial data
$this->start() spawns a completely new FPM process
The retry goes to the new process, which has no knowledge of any previous partial request

So even in the partial-write case, the old FPM process is destroyed before the retry happens. The new process starts clean.

mnapoli · 2026-03-04T16:42:26Z

This is very interesting, thank you for opening this.

To piggyback on Graham's comment, just to be sure:

WriteFailedException is thrown

$this->stop() kills the FPM process that may have received partial data

Could there be any possibility that FPM actually executed anything in the PHP worker between these two steps (and thus making the retry not idempotent).

mnapoli · 2026-03-04T20:53:07Z

To follow up, it might be worth looking into https://github.com/hollodotme/fast-cgi-client (the FastCGI client we use) and maybe the FPM implementation to figure out if PHP scripts could start running before everything is written in the FastCGI request?

sid-stripe · 2026-03-05T05:20:16Z

Could there be any possibility that FPM actually executed anything in the PHP worker between these two steps (and thus making the retry not idempotent).

To follow up, it might be worth looking into https://github.com/hollodotme/fast-cgi-client (the FastCGI client we use) and maybe the FPM implementation to figure out if PHP scripts could start running before everything is written in the FastCGI request?

Thank you, I will look into those.

Retry request on WriteFailedException instead of returning

a03893e

500 Committed-By-Agent: claude

sid-stripe mentioned this pull request Mar 4, 2026

Silent FPM master death causing persistent broken pipe errors (~1-2% of requests) #2077

Open

GrahamCampbell reviewed Mar 4, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Retry request on WriteFailedException instead of returning 500#2081

Retry request on WriteFailedException instead of returning 500#2081
sid-stripe wants to merge 1 commit intobrefphp:masterfrom
sid-stripe:fix/retry-on-broken-pipe

sid-stripe commented Mar 4, 2026

Uh oh!

GrahamCampbell left a comment

Uh oh!

sid-stripe commented Mar 4, 2026

Uh oh!

mnapoli commented Mar 4, 2026

Uh oh!

mnapoli commented Mar 4, 2026

Uh oh!

sid-stripe commented Mar 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

sid-stripe commented Mar 4, 2026

Summary

Why this is safe

Background

What changed

Uh oh!

GrahamCampbell left a comment

Choose a reason for hiding this comment

Uh oh!

sid-stripe commented Mar 4, 2026

Uh oh!

mnapoli commented Mar 4, 2026

Uh oh!

mnapoli commented Mar 4, 2026

Uh oh!

sid-stripe commented Mar 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants