Skip to content

Retry request on WriteFailedException instead of returning 500#2081

Draft
sid-stripe wants to merge 1 commit intobrefphp:masterfrom
sid-stripe:fix/retry-on-broken-pipe
Draft

Retry request on WriteFailedException instead of returning 500#2081
sid-stripe wants to merge 1 commit intobrefphp:masterfrom
sid-stripe:fix/retry-on-broken-pipe

Conversation

@sid-stripe
Copy link

Related to #2077

Summary

When PHP-FPM's Unix socket is broken between Lambda invocations (e.g. the FPM worker died while the master is still alive), sendAsyncRequest throws a WriteFailedException (broken pipe). The current behavior restarts FPM but still throws FastCgiCommunicationFailed, returning a 500 to the caller even though FPM was just successfully restarted and is ready to serve.

This PR retries the request once against the freshly restarted FPM process on WriteFailedException, turning a guaranteed 500 into a transparent recovery.

Why this is safe

A WriteFailedException means the FastCGI request failed to write to the socket — the request never reached PHP-FPM. Since no application code executed, retrying is idempotent.

Other exceptions (e.g. ReadFailedException) are not retried because the request may have already been processed by PHP, and retrying could cause double-execution. The existing behavior (log, restart FPM, throw FastCgiCommunicationFailed) is preserved for those cases.

Background

Key findings from production diagnostics for #2077:

  • The FPM master is alive when the broken pipe occurs — proc_get_status() shows running=true, signaled=false, exitcode=-1. The master is not crashing.
  • The FPM worker or socket is dead. The write fails because the reader end (worker) has closed its end of the Unix socket, likely due to the worker dying between invocations.
  • The error is transient. After Bref restarts FPM, the very next request to the same Lambda instance succeeds. The current code already does the right thing (restart FPM) but then discards the recovery by throwing FastCgiCommunicationFailed.
  • The pattern: Request N completes normally → ~6ms later request N+1 arrives → WriteFailedException immediately → Bref restarts FPM → current code throws 500 → next request works fine.

I still don't know the root cause of the worker death is still under investigation (Lambda freeze/thaw, socket FD corruption, etc.), but regardless of cause, retrying on write failure is a safe mitigation that eliminates unnecessary 500s.

What changed

In the catch (Throwable $e) block of FpmHandler::handleRequest():

  1. FPM is still restarted (stop() + start()) on all errors — no change
  2. New: If the exception is WriteFailedException, retry the request once against the fresh FPM
  3. If the retry also fails, log both exceptions and throw FastCgiCommunicationFailed as before
  4. For all other exceptions, behavior is unchanged (log + throw)

We've been running this patch via cweagans/composer-patches in production. Broken pipe errors that previously resulted in 500s are now transparently recovered, with no double-execution or other side effects observed.

Happy to hear your thoughts! Thought we will submit a PR since all the retries have been successful.

Copy link
Contributor

@GrahamCampbell GrahamCampbell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure that this only occurs in that situation, and not after being able to write something, and then not being able to write something?

@sid-stripe
Copy link
Author

Are you sure that this only occurs in that situation, and not after being able to write something, and then not being able to write something?

Good question: I'm not 100% certain a partial write can't happen. But I reckon the retry is still safe regardless, because of the sequence of operations:

  1. WriteFailedException is thrown
  2. $this->stop() kills the FPM process that may have received partial data
  3. $this->start() spawns a completely new FPM process
  4. The retry goes to the new process, which has no knowledge of any previous partial request

So even in the partial-write case, the old FPM process is destroyed before the retry happens. The new process starts clean.

@mnapoli
Copy link
Member

mnapoli commented Mar 4, 2026

This is very interesting, thank you for opening this.

To piggyback on Graham's comment, just to be sure:

  1. WriteFailedException is thrown
  2. $this->stop() kills the FPM process that may have received partial data

Could there be any possibility that FPM actually executed anything in the PHP worker between these two steps (and thus making the retry not idempotent).

@mnapoli
Copy link
Member

mnapoli commented Mar 4, 2026

To follow up, it might be worth looking into https://github.com/hollodotme/fast-cgi-client (the FastCGI client we use) and maybe the FPM implementation to figure out if PHP scripts could start running before everything is written in the FastCGI request?

@sid-stripe
Copy link
Author

Could there be any possibility that FPM actually executed anything in the PHP worker between these two steps (and thus making the retry not idempotent).

To follow up, it might be worth looking into https://github.com/hollodotme/fast-cgi-client (the FastCGI client we use) and maybe the FPM implementation to figure out if PHP scripts could start running before everything is written in the FastCGI request?

Thank you, I will look into those.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants