Drop pnpm from CMD so Node is PID 1 and SIGTERM reaches its handler by habdelra · Pull Request #4874 · cardstack/boxel

habdelra · 2026-05-18T22:49:39Z

Summary

PR #4860 tried to fix the worker-reservation orphaning by prepending exec to the CMD line and to the ts-node invocation inside the start scripts, on the hypothesis that pnpm 11.x would forward signals to its child once it was no longer behind an extra sh layer. That hypothesis is wrong, and this PR replaces it with the actual fix plus a layer of diagnostic logging.

Diagnostic data

Staging task-def boxel-worker-staging:2083 (commit a2bf28b — PR #4860's exec-prepend fix) has been deployed for 1+ hour across multiple rolling deploys. Across that window:

Zero occurrences of Shutting down server for worker manager..., Stopping N worker(s)..., or Draining reservations for N worker(s)... — the three signature lines from runShutdown in packages/realm-server/worker-manager.ts.
Zero rows in job_reservations with completion_reason='interrupted' from the new tasks.
The dying workers' last 3 log lines are uniformly pnpm's ERR_PNPM_RECURSIVE_RUN_FIRST_FAIL followed by Command failed with signal "SIGTERM", and nothing from Node — i.e. pnpm received the signal and exited, but Node never logged a thing on its way out.

Net effect: SIGTERM doesn't reach Node's process.on('SIGTERM') handler, so the worker-manager never runs its drain UPDATE that marks in-flight reservations interrupted. The reservations stay locked until the 2-hour lease ages out. Job 410956 (ctse/personal from-scratch) is currently stuck this way.

Fix

Drop pnpm from PID 1. The deployed start scripts already exec into ts-node, so removing the pnpm wrapper makes Node the direct child of sh and Node becomes PID 1 once the script execs.

The 4 deployed Dockerfile CMDs change from exec pnpm --filter \"./packages/realm-server\" \$<x>_script to exec /realm-server/packages/realm-server/\$<x>_script.
The matching *_script build-args in .github/workflows/manual-deploy.yml change from npm script names (start:staging, start:worker-staging, etc.) to script paths (scripts/start-staging.sh, scripts/start-worker-staging.sh, etc.).
Because the scripts no longer inherit CWD/PATH from pnpm --filter, each script now cds into the realm-server package directory and prepends ./node_modules/.bin to PATH up front (matching what pnpm --filter did implicitly).

Process chain after the change: Docker sh -c \"exec /realm-server/packages/realm-server/\$worker_script\" → sh expands the env var, execs into the script (sh replaced — script is PID 1) → the script execs into ts-node (ts-node replaces the script — ts-node is PID 1).

Logging (independent insurance — keep even if the fix works)

Two layers, both using process.stderr.write (synchronous) so the line lands even if Node terminates immediately after.

Layer 1 — start scripts. Each of the 7 deployed start scripts now emits [<script-name>] pid=… ppid=… about to exec ts-node at <iso-ts> to stderr immediately before exec ts-node. Proves the shell got far enough to hand off.

Layer 2 — Node entry files. Each of worker-manager.ts, worker.ts, main.ts, prerender-server.ts, manager-server.ts now writes a [<tag>] STARTUP pid=… ppid=… argv=… line synchronously near the top of the file, and each SIGTERM/SIGINT/disconnect/uncaughtException handler writes [<tag>] <signal> received pid=… ppid=… synchronously as its first action — before calling the existing shutdown function. The body of runShutdown is unchanged.

What we learn from the next staging deploy:

[start-worker-staging] … about to exec ts-node then [worker-manager] STARTUP → process chain is healthy.
[worker-manager] SIGTERM received → signal delivery to Node is working now.
Shutting down server for worker manager... + Stopping N worker(s)... + Draining reservations for N worker(s)... → the existing graceful shutdown is running end-to-end.
If we see STARTUP but no SIGTERM received on the next rolling deploy, signal delivery is still broken upstream of Node and we have a different root cause to chase.

Scope

Dev-only scripts (start-pg.sh, start-matrix.sh, start-host-dist.sh, start-icons.sh, start-without-matrix.sh) are intentionally not touched — their lifetime model is different.

Test plan

After this PR merges and the next staging deploy completes, the rolling deploy that replaces the worker tasks should emit (per killed task, in CloudWatch):
- [start-worker-staging] pid=… ppid=… about to exec ts-node at <ts>
- [worker-manager] STARTUP pid=… ppid=… argv=[…]
- [worker-manager] SIGTERM received pid=… ppid=…
- Shutting down server for worker manager...
- Stopping N worker(s)...
- Draining reservations for N worker(s)...
At least one row in job_reservations from the killed task should land with completion_reason='interrupted'.
If only the STARTUP line appears and the SIGTERM-received line doesn't, the fix isn't enough and we follow up with a different root-cause investigation.

🤖 Generated with Claude Code

The previous attempt (prepending `exec` to CMD/ts-node) was insufficient: after that change, staging logs still show zero `Shutting down server for worker manager...` / `Stopping N worker(s)...` / `Draining reservations...` lines on rolling deploys, and zero reservations end up with completion_reason='interrupted'. The dying workers' last lines are pnpm's `ERR_PNPM_RECURSIVE_RUN_FIRST_FAIL` and `Command failed with signal "SIGTERM"` — pnpm is intercepting SIGTERM and not forwarding it to Node, so the registered SIGTERM handler in worker-manager.ts never fires and in-flight reservations stay locked until the 2h lease ages out. Change the four deployed Dockerfile CMDs from exec pnpm --filter "./packages/realm-server" $<x>_script to exec /realm-server/packages/realm-server/$<x>_script and switch the `*_script` build-args from npm script names to script paths relative to `packages/realm-server/`. The start scripts already `exec` into ts-node, so removing the pnpm wrapper makes Node the direct child of sh, and Node becomes PID 1 once the script execs. Because the scripts no longer inherit CWD/PATH from `pnpm --filter`, each script now `cd`s into the realm-server package directory and prepends `./node_modules/.bin` to PATH (matching what `pnpm --filter` set up implicitly). Independent insurance — keep regardless of whether the fix lands cleanly: Layer 1 (start scripts): each of the 7 deployed start scripts now writes `[<script-name>] pid=… ppid=… about to exec ts-node` to stderr immediately before `exec ts-node`. Proves the shell got far enough to hand off. Layer 2 (Node): each of the 5 deployed Node entry files now writes a `[<tag>] STARTUP pid=… ppid=… argv=…` line to stderr synchronously near the top of the file, and each SIGTERM/SIGINT/disconnect/uncaughtException handler writes `[<tag>] <signal> received pid=… ppid=…` synchronously as its first action — before calling the existing shutdown function. The writes use `process.stderr.write` (not the logger) so the line lands even if Node terminates milliseconds later. If after a deploy we see STARTUP but no SIGTERM received, signal delivery is broken upstream of Node and we know to look elsewhere; if we see SIGTERM received but no draining log, the shutdown path itself is wedged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

This PR updates the deployed realm-server/worker/prerender containers so PID 1 ultimately becomes the Node/ts-node process (instead of pnpm), improving SIGTERM delivery to Node signal handlers and adding additional startup/signal diagnostic logging to help confirm signal flow during ECS rolling deploys.

Changes:

Switch deployed Dockerfile CMD from pnpm --filter ... $*_script to executing the start-script path directly (/realm-server/packages/realm-server/$*_script).
Update manual-deploy.yml build-args to pass script paths (e.g. scripts/start-staging.sh) instead of pnpm script names.
Update deployed start scripts to set CWD + prepend ./node_modules/.bin to PATH, and add stderr “about to exec ts-node” stamps; add stderr startup/signal stamps in Node entrypoints.

Reviewed changes

Copilot reviewed 17 out of 17 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
`.github/workflows/manual-deploy.yml`	Pass script path build-args (instead of pnpm script names) to match new Dockerfile CMD behavior.
`packages/realm-server/main.ts`	Add stderr startup + SIGTERM/SIGINT receipt stamps near process startup/shutdown wiring.
`packages/realm-server/prerender.Dockerfile`	Execute prerender start script directly from CMD (drop pnpm from PID 1).
`packages/realm-server/prerender-manager.Dockerfile`	Execute prerender-manager start script directly from CMD (drop pnpm from PID 1).
`packages/realm-server/prerender/manager-server.ts`	Add stderr startup + SIGTERM/SIGINT receipt stamps.
`packages/realm-server/prerender/prerender-server.ts`	Add stderr startup + SIGTERM/SIGINT receipt stamps.
`packages/realm-server/realm-server.Dockerfile`	Execute realm-server start script directly from CMD (drop pnpm from PID 1).
`packages/realm-server/scripts/start-prerender-manager.sh`	Ensure correct CWD/PATH without pnpm wrapper; add stderr “about to exec ts-node” stamp.
`packages/realm-server/scripts/start-prerender-production.sh`	Ensure correct CWD/PATH without pnpm wrapper; add stderr “about to exec ts-node” stamp.
`packages/realm-server/scripts/start-prerender-staging.sh`	Ensure correct CWD/PATH without pnpm wrapper; add stderr “about to exec ts-node” stamp.
`packages/realm-server/scripts/start-production.sh`	Ensure correct CWD/PATH without pnpm wrapper; add stderr “about to exec ts-node” stamp.
`packages/realm-server/scripts/start-staging.sh`	Ensure correct CWD/PATH without pnpm wrapper; add stderr “about to exec ts-node” stamp.
`packages/realm-server/scripts/start-worker-production.sh`	Ensure correct CWD/PATH without pnpm wrapper; add stderr “about to exec ts-node” stamp.
`packages/realm-server/scripts/start-worker-staging.sh`	Ensure correct CWD/PATH without pnpm wrapper; add stderr “about to exec ts-node” stamp.
`packages/realm-server/worker.Dockerfile`	Execute worker start script directly from CMD (drop pnpm from PID 1).
`packages/realm-server/worker-manager.ts`	Add stderr startup + SIGTERM/SIGINT/disconnect/uncaughtException receipt stamps.
`packages/realm-server/worker.ts`	Add stderr startup + SIGTERM/SIGINT/disconnect receipt stamps.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

`process.stderr.write` is libuv-async when stderr is a pipe (the Docker / ECS case for these processes), so writes can be lost if the process exits before libuv flushes — which is the exact scenario the SIGTERM / SIGINT / disconnect / uncaughtException stamps are designed to capture. `writeSync(2, ...)` calls write(2) directly, hands the bytes to the kernel pipe buffer synchronously, and returns only after that completes. Apply this to the STARTUP stamp as well, both for consistency and so an immediate post-startup crash doesn't lose the proof-of-life line. No behavioural change other than the write semantics. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-18T23:42:23Z

Host Test Results

1 files 1 suites 1h 28m 39s ⏱️
2 661 tests 2 646 ✅ 15 💤 0 ❌
2 680 runs 2 665 ✅ 15 💤 0 ❌

Results for commit 6086166.

Realm Server Test Results

1 files ± 0 1 suites ±0 8m 25s ⏱️ + 4m 27s
1 408 tests +546 1 408 ✅ +546 0 💤 ±0 0 ❌ ±0
1 495 runs +577 1 495 ✅ +577 0 💤 ±0 0 ❌ ±0

Results for commit 6086166. ± Comparison against earlier commit 4f23c66.

habdelra requested a review from Copilot May 18, 2026 22:51

Copilot started reviewing on behalf of habdelra May 18, 2026 22:51 View session

Copilot AI reviewed May 18, 2026

View reviewed changes

habdelra requested a review from a team May 18, 2026 23:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Drop pnpm from CMD so Node is PID 1 and SIGTERM reaches its handler#4874

Drop pnpm from CMD so Node is PID 1 and SIGTERM reaches its handler#4874
habdelra wants to merge 2 commits into
mainfrom
sigterm-pid1-drop-pnpm-and-logging

habdelra commented May 18, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented May 18, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

habdelra commented May 18, 2026

Summary

Diagnostic data

Fix

Logging (independent insurance — keep even if the fix works)

Scope

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Host Test Results

Realm Server Test Results

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

github-actions Bot commented May 18, 2026 •

edited

Loading