Drop pnpm from CMD so Node is PID 1 and SIGTERM reaches its handler#4874
Drop pnpm from CMD so Node is PID 1 and SIGTERM reaches its handler#4874habdelra wants to merge 2 commits into
Conversation
The previous attempt (prepending `exec` to CMD/ts-node) was insufficient: after that change, staging logs still show zero `Shutting down server for worker manager...` / `Stopping N worker(s)...` / `Draining reservations...` lines on rolling deploys, and zero reservations end up with completion_reason='interrupted'. The dying workers' last lines are pnpm's `ERR_PNPM_RECURSIVE_RUN_FIRST_FAIL` and `Command failed with signal "SIGTERM"` — pnpm is intercepting SIGTERM and not forwarding it to Node, so the registered SIGTERM handler in worker-manager.ts never fires and in-flight reservations stay locked until the 2h lease ages out. Change the four deployed Dockerfile CMDs from exec pnpm --filter "./packages/realm-server" $<x>_script to exec /realm-server/packages/realm-server/$<x>_script and switch the `*_script` build-args from npm script names to script paths relative to `packages/realm-server/`. The start scripts already `exec` into ts-node, so removing the pnpm wrapper makes Node the direct child of sh, and Node becomes PID 1 once the script execs. Because the scripts no longer inherit CWD/PATH from `pnpm --filter`, each script now `cd`s into the realm-server package directory and prepends `./node_modules/.bin` to PATH (matching what `pnpm --filter` set up implicitly). Independent insurance — keep regardless of whether the fix lands cleanly: Layer 1 (start scripts): each of the 7 deployed start scripts now writes `[<script-name>] pid=… ppid=… about to exec ts-node` to stderr immediately before `exec ts-node`. Proves the shell got far enough to hand off. Layer 2 (Node): each of the 5 deployed Node entry files now writes a `[<tag>] STARTUP pid=… ppid=… argv=…` line to stderr synchronously near the top of the file, and each SIGTERM/SIGINT/disconnect/uncaughtException handler writes `[<tag>] <signal> received pid=… ppid=…` synchronously as its first action — before calling the existing shutdown function. The writes use `process.stderr.write` (not the logger) so the line lands even if Node terminates milliseconds later. If after a deploy we see STARTUP but no SIGTERM received, signal delivery is broken upstream of Node and we know to look elsewhere; if we see SIGTERM received but no draining log, the shutdown path itself is wedged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR updates the deployed realm-server/worker/prerender containers so PID 1 ultimately becomes the Node/ts-node process (instead of pnpm), improving SIGTERM delivery to Node signal handlers and adding additional startup/signal diagnostic logging to help confirm signal flow during ECS rolling deploys.
Changes:
- Switch deployed Dockerfile
CMDfrompnpm --filter ... $*_scriptto executing the start-script path directly (/realm-server/packages/realm-server/$*_script). - Update
manual-deploy.ymlbuild-args to pass script paths (e.g.scripts/start-staging.sh) instead of pnpm script names. - Update deployed start scripts to set
CWD+ prepend./node_modules/.bintoPATH, and add stderr “about to exec ts-node” stamps; add stderr startup/signal stamps in Node entrypoints.
Reviewed changes
Copilot reviewed 17 out of 17 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
.github/workflows/manual-deploy.yml |
Pass script path build-args (instead of pnpm script names) to match new Dockerfile CMD behavior. |
packages/realm-server/main.ts |
Add stderr startup + SIGTERM/SIGINT receipt stamps near process startup/shutdown wiring. |
packages/realm-server/prerender.Dockerfile |
Execute prerender start script directly from CMD (drop pnpm from PID 1). |
packages/realm-server/prerender-manager.Dockerfile |
Execute prerender-manager start script directly from CMD (drop pnpm from PID 1). |
packages/realm-server/prerender/manager-server.ts |
Add stderr startup + SIGTERM/SIGINT receipt stamps. |
packages/realm-server/prerender/prerender-server.ts |
Add stderr startup + SIGTERM/SIGINT receipt stamps. |
packages/realm-server/realm-server.Dockerfile |
Execute realm-server start script directly from CMD (drop pnpm from PID 1). |
packages/realm-server/scripts/start-prerender-manager.sh |
Ensure correct CWD/PATH without pnpm wrapper; add stderr “about to exec ts-node” stamp. |
packages/realm-server/scripts/start-prerender-production.sh |
Ensure correct CWD/PATH without pnpm wrapper; add stderr “about to exec ts-node” stamp. |
packages/realm-server/scripts/start-prerender-staging.sh |
Ensure correct CWD/PATH without pnpm wrapper; add stderr “about to exec ts-node” stamp. |
packages/realm-server/scripts/start-production.sh |
Ensure correct CWD/PATH without pnpm wrapper; add stderr “about to exec ts-node” stamp. |
packages/realm-server/scripts/start-staging.sh |
Ensure correct CWD/PATH without pnpm wrapper; add stderr “about to exec ts-node” stamp. |
packages/realm-server/scripts/start-worker-production.sh |
Ensure correct CWD/PATH without pnpm wrapper; add stderr “about to exec ts-node” stamp. |
packages/realm-server/scripts/start-worker-staging.sh |
Ensure correct CWD/PATH without pnpm wrapper; add stderr “about to exec ts-node” stamp. |
packages/realm-server/worker.Dockerfile |
Execute worker start script directly from CMD (drop pnpm from PID 1). |
packages/realm-server/worker-manager.ts |
Add stderr startup + SIGTERM/SIGINT/disconnect/uncaughtException receipt stamps. |
packages/realm-server/worker.ts |
Add stderr startup + SIGTERM/SIGINT/disconnect receipt stamps. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
`process.stderr.write` is libuv-async when stderr is a pipe (the Docker / ECS case for these processes), so writes can be lost if the process exits before libuv flushes — which is the exact scenario the SIGTERM / SIGINT / disconnect / uncaughtException stamps are designed to capture. `writeSync(2, ...)` calls write(2) directly, hands the bytes to the kernel pipe buffer synchronously, and returns only after that completes. Apply this to the STARTUP stamp as well, both for consistency and so an immediate post-startup crash doesn't lose the proof-of-life line. No behavioural change other than the write semantics. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Host Test Results 1 files 1 suites 1h 28m 39s ⏱️ Results for commit 6086166. Realm Server Test Results 1 files ± 0 1 suites ±0 8m 25s ⏱️ + 4m 27s Results for commit 6086166. ± Comparison against earlier commit 4f23c66. |
Summary
PR #4860 tried to fix the worker-reservation orphaning by prepending
execto the CMD line and to thets-nodeinvocation inside the start scripts, on the hypothesis that pnpm 11.x would forward signals to its child once it was no longer behind an extrashlayer. That hypothesis is wrong, and this PR replaces it with the actual fix plus a layer of diagnostic logging.Diagnostic data
Staging task-def
boxel-worker-staging:2083(commit a2bf28b — PR #4860'sexec-prepend fix) has been deployed for 1+ hour across multiple rolling deploys. Across that window:Shutting down server for worker manager...,Stopping N worker(s)..., orDraining reservations for N worker(s)...— the three signature lines fromrunShutdowninpackages/realm-server/worker-manager.ts.job_reservationswithcompletion_reason='interrupted'from the new tasks.ERR_PNPM_RECURSIVE_RUN_FIRST_FAILfollowed byCommand failed with signal "SIGTERM", and nothing from Node — i.e. pnpm received the signal and exited, but Node never logged a thing on its way out.Net effect: SIGTERM doesn't reach Node's
process.on('SIGTERM')handler, so the worker-manager never runs its drain UPDATE that marks in-flight reservationsinterrupted. The reservations stay locked until the 2-hour lease ages out. Job 410956 (ctse/personal from-scratch) is currently stuck this way.Fix
Drop pnpm from PID 1. The deployed start scripts already
execintots-node, so removing the pnpm wrapper makes Node the direct child of sh and Node becomes PID 1 once the script execs.exec pnpm --filter \"./packages/realm-server\" \$<x>_scripttoexec /realm-server/packages/realm-server/\$<x>_script.*_scriptbuild-args in.github/workflows/manual-deploy.ymlchange from npm script names (start:staging,start:worker-staging, etc.) to script paths (scripts/start-staging.sh,scripts/start-worker-staging.sh, etc.).pnpm --filter, each script nowcds into the realm-server package directory and prepends./node_modules/.binto PATH up front (matching whatpnpm --filterdid implicitly).Process chain after the change: Docker
sh -c \"exec /realm-server/packages/realm-server/\$worker_script\"→ sh expands the env var, execs into the script (sh replaced — script is PID 1) → the script execs into ts-node (ts-node replaces the script — ts-node is PID 1).Logging (independent insurance — keep even if the fix works)
Two layers, both using
process.stderr.write(synchronous) so the line lands even if Node terminates immediately after.Layer 1 — start scripts. Each of the 7 deployed start scripts now emits
[<script-name>] pid=… ppid=… about to exec ts-node at <iso-ts>to stderr immediately beforeexec ts-node. Proves the shell got far enough to hand off.Layer 2 — Node entry files. Each of
worker-manager.ts,worker.ts,main.ts,prerender-server.ts,manager-server.tsnow writes a[<tag>] STARTUP pid=… ppid=… argv=…line synchronously near the top of the file, and eachSIGTERM/SIGINT/disconnect/uncaughtExceptionhandler writes[<tag>] <signal> received pid=… ppid=…synchronously as its first action — before calling the existing shutdown function. The body ofrunShutdownis unchanged.What we learn from the next staging deploy:
[start-worker-staging] … about to exec ts-nodethen[worker-manager] STARTUP→ process chain is healthy.[worker-manager] SIGTERM received→ signal delivery to Node is working now.Shutting down server for worker manager...+Stopping N worker(s)...+Draining reservations for N worker(s)...→ the existing graceful shutdown is running end-to-end.SIGTERM receivedon the next rolling deploy, signal delivery is still broken upstream of Node and we have a different root cause to chase.Scope
Dev-only scripts (
start-pg.sh,start-matrix.sh,start-host-dist.sh,start-icons.sh,start-without-matrix.sh) are intentionally not touched — their lifetime model is different.Test plan
[start-worker-staging] pid=… ppid=… about to exec ts-node at <ts>[worker-manager] STARTUP pid=… ppid=… argv=[…][worker-manager] SIGTERM received pid=… ppid=…Shutting down server for worker manager...Stopping N worker(s)...Draining reservations for N worker(s)...job_reservationsfrom the killed task should land withcompletion_reason='interrupted'.🤖 Generated with Claude Code