Retry request on WriteFailedException instead of returning 500#2081
Retry request on WriteFailedException instead of returning 500#2081sid-stripe wants to merge 1 commit intobrefphp:masterfrom
Conversation
500 Committed-By-Agent: claude
GrahamCampbell
left a comment
There was a problem hiding this comment.
Are you sure that this only occurs in that situation, and not after being able to write something, and then not being able to write something?
Good question: I'm not 100% certain a partial write can't happen. But I reckon the retry is still safe regardless, because of the sequence of operations:
So even in the partial-write case, the old FPM process is destroyed before the retry happens. The new process starts clean. |
|
This is very interesting, thank you for opening this. To piggyback on Graham's comment, just to be sure:
Could there be any possibility that FPM actually executed anything in the PHP worker between these two steps (and thus making the retry not idempotent). |
|
To follow up, it might be worth looking into https://github.com/hollodotme/fast-cgi-client (the FastCGI client we use) and maybe the FPM implementation to figure out if PHP scripts could start running before everything is written in the FastCGI request? |
Thank you, I will look into those. |
Related to #2077
Summary
When PHP-FPM's Unix socket is broken between Lambda invocations (e.g. the FPM worker died while the master is still alive),
sendAsyncRequestthrows aWriteFailedException(broken pipe). The current behavior restarts FPM but still throwsFastCgiCommunicationFailed, returning a 500 to the caller even though FPM was just successfully restarted and is ready to serve.This PR retries the request once against the freshly restarted FPM process on
WriteFailedException, turning a guaranteed 500 into a transparent recovery.Why this is safe
A
WriteFailedExceptionmeans the FastCGI request failed to write to the socket — the request never reached PHP-FPM. Since no application code executed, retrying is idempotent.Other exceptions (e.g.
ReadFailedException) are not retried because the request may have already been processed by PHP, and retrying could cause double-execution. The existing behavior (log, restart FPM, throwFastCgiCommunicationFailed) is preserved for those cases.Background
Key findings from production diagnostics for #2077:
proc_get_status()showsrunning=true,signaled=false,exitcode=-1. The master is not crashing.FastCgiCommunicationFailed.WriteFailedExceptionimmediately → Bref restarts FPM → current code throws 500 → next request works fine.I still don't know the root cause of the worker death is still under investigation (Lambda freeze/thaw, socket FD corruption, etc.), but regardless of cause, retrying on write failure is a safe mitigation that eliminates unnecessary 500s.
What changed
In the
catch (Throwable $e)block ofFpmHandler::handleRequest():stop()+start()) on all errors — no changeWriteFailedException, retry the request once against the fresh FPMFastCgiCommunicationFailedas beforeWe've been running this patch via
cweagans/composer-patchesin production. Broken pipe errors that previously resulted in 500s are now transparently recovered, with no double-execution or other side effects observed.Happy to hear your thoughts! Thought we will submit a PR since all the retries have been successful.