feat: cross-platform force-kill primitive for stuck PHP threads#2365
feat: cross-platform force-kill primitive for stuck PHP threads#2365nicolas-grekas wants to merge 1 commit intophp:mainfrom
Conversation
a7be2c7 to
692ee4c
Compare
withinboredom
left a comment
There was a problem hiding this comment.
Looks pretty good. FYI: bugs like php/php-src#21267 mean that sometimes JIT just will never hit these vm breakpoints. So, be prepared for bug reports that aren't related to this change, but are due to JIT.
|
Good work! Shouldn't this code be directly on php-src (TSRM)? It could be useful in other contexts than FrankenPHP. |
|
@dunglas dunno in the future, but at the moment, this allows providing the capablity to all versions of PHP, without having to wait for an hypothetical merge in 8.6+. |
The two don't exclude each other, it would actually help getting this upstreamed because it will open the doors to more calls being switched to alertable. Once that lands in master, we can backport it in FrankenPHP for 8.4+. |
|
PR should be ready! It took me a while to get it correct. PR description is updated. Lots of comments in the patch; let me know if that's too much. |
|
PR ready twice 😅 |
Introduces a self-contained primitive that wakes a PHP thread parked in a blocking call (sleep, synchronous I/O, etc.) so the graceful drain used by RestartWorkers / DrainWorkers / Shutdown completes promptly instead of waiting for the syscall to return naturally. Design: each PHP thread, at boot from its own TSRM context, hands a force_kill_slot (pointers to its EG(vm_interrupt) and EG(timed_out) atomic bools, plus pthread_t / Windows HANDLE) back to Go via go_frankenphp_store_force_kill_slot. The slot lives on phpThread and is protected by a per-thread RWMutex so the zero-and-release path at thread exit cannot race an in-flight kill. From any goroutine, Go passes the slot back to frankenphp_force_kill_thread, which stores true into both bools (waking the VM at the next opcode boundary, routing through zend_timeout -> "Maximum execution time exceeded") and delivers a platform-specific wake-up: - Linux/FreeBSD: pthread_kill(SIGRTMIN+3) with a no-op handler installed via pthread_once, SA_ONSTACK, no SA_RESTART. Signal delivery causes the in-flight blocking syscall to return EINTR. - Windows: CancelSynchronousIo + QueueUserAPC covers alertable I/O and SleepEx. Non-alertable Sleep (including PHP's usleep) stays uninterruptible. - macOS: atomic-bool-only path. Threads stuck in blocking syscalls wait for the syscall to complete naturally. Reserved signal: SIGRTMIN+3. PHP's pcntl_signal(SIGRTMIN+3, ...) clobbers it; embedders whose own Go code uses that signal must patch the constant. glibc NPTL reserves SIGRTMIN..SIGRTMIN+2. Drain integration: drainWorkerThreads waits drainGracePeriod (5s) for each thread to reach Yielding, then arms force-kill on stragglers and keeps waiting until they yield. phpThread.shutdown does the same. There is no abandon path: if a thread is stuck in a syscall force-kill cannot interrupt (macOS, Windows non-alertable Sleep) the drain blocks until the syscall returns naturally - matching pre-patch behaviour exactly, just typically much faster because force-kill cuts a 60s sleep down to milliseconds. Operators that want a harder bound rely on their orchestrator (systemd, k8s, supervisord) to SIGKILL the process. worker_test.go + testdata/worker-sleep.php exercise the full path: the test marks a file before sleep(60), polls until the worker is proven parked, then asserts RestartWorkers completes within the grace period and that the post-sleep echo never runs (which would mean the VM interrupt was never observed).
First step of the split suggested in #2287: land the force-kill
infrastructure as a standalone, reviewable primitive independent of
background workers.
Design
Each PHP thread, at boot from its own TSRM context, hands a
force_kill_slot(pointers to itsEG(vm_interrupt)andEG(timed_out)atomic bools, plus
pthread_t/ WindowsHANDLE) back to Go viago_frankenphp_store_force_kill_slot. The slot lives onphpThreadand is protected by a per-thread
RWMutexso the zero-and-release pathat thread exit cannot race an in-flight kill. From any goroutine, Go
passes the slot back to
frankenphp_force_kill_thread, which storestrueinto both atomic bools (waking the VM at the next opcodeboundary, routing through
zend_timeout-> "Maximum execution timeexceeded") and delivers a platform-specific wake-up:
pthread_kill(SIGRTMIN+3)with a no-op handlerinstalled once via
pthread_once,SA_ONSTACK, noSA_RESTART.Signal delivery returns any in-flight blocking syscall with
EINTR.CancelSynchronousIo+QueueUserAPCcovers alertableI/O and
SleepEx. Non-alertableSleep(including PHP'susleep)stays uninterruptible.
wait for the syscall to complete naturally.
Reserved signal:
SIGRTMIN+3. A PHP script that callspcntl_signal(SIGRTMIN+3, ...)clobbers this. Embedders whose own Gocode uses
SIGRTMIN+3must patch it here. glibc NPTL reservesSIGRTMIN..SIGRTMIN+2, so the offset cannot go lower.Drain integration
drainWorkerThreadswaitsdrainGracePeriod(5s) for each thread toreach
Yielding, then arms force-kill on stragglers and keepswaiting until they yield.
phpThread.shutdowndoes the same. Thereis no abandon path: if a thread is stuck in a syscall force-kill cannot
interrupt (macOS, Windows non-alertable Sleep), the drain blocks until
the syscall returns naturally — matching pre-patch behaviour exactly,
just typically much faster because force-kill cuts a
sleep(60)downto milliseconds. Operators that want a harder bound rely on their
orchestrator (systemd, k8s, supervisord) to SIGKILL the process.
go_frankenphp_on_thread_shutdownruns on both the healthy path andthe unhealthy-during-Shutdown path so
state.Doneis set even whenforce-kill bails the thread. Without it,
phpThread.shutdown'sWaitFor(state.Done)would never unblock.Testing
TestRestartWorkersForceKillsStuckThreaddrives the full path via amarker file so
RestartWorkersonly arms once the worker is provenparked in
sleep(), then asserts bounded elapsed time and that thepost-sleep echo never runs.