feat(core): SIGINT/SIGTERM graceful drain + unhandled-rejection redaction (PER-7855 phase 3/3)#2198
Closed
Shivanshu-07 wants to merge 1 commit intomasterfrom
Closed
feat(core): SIGINT/SIGTERM graceful drain + unhandled-rejection redaction (PER-7855 phase 3/3)#2198Shivanshu-07 wants to merge 1 commit intomasterfrom
Shivanshu-07 wants to merge 1 commit intomasterfrom
Conversation
…tion (PER-7855)
Phase 3 of PER-7855 CLI QoS hardening, plus a bonus fix for the
existing POSIX child-tree leak in `browser.js`.
R1 — Graceful drain on SIGINT/SIGTERM (`cli-command/src/command.js`):
- New module-level `shutdownState` bag (`signal`, `forced`,
`drainTimer`, `hardExitTimer`) is exposed to commands as
`ctx.shutdown` so they can call `percy.stop(ctx.shutdown.forced)`
for graceful-on-first-signal, force-on-second-signal behavior.
- First SIGINT/SIGTERM logs `${signal} received, draining (press Ctrl-C
again to force)...` to stderr and arms a 30s drain timer that flips
`shutdown.forced=true` if the runner hasn't completed.
- Second signal (or the 30s timer) flips `forced=true` immediately and
arms a 5s hard-exit safety timer to bail if `percy.stop(true)` hangs.
- Production exit codes: SIGINT→130, SIGTERM→143, surfaced via
`process.exit` only when `definition.exitOnError` is true. Tests
with `exitOnError:false` preserve the legacy clean-resolution
behavior because AbortError still carries `exitCode:0`.
- `start.js`, `snapshot.js`, `exec.js` callbacks now read
`ctx.shutdown.forced` to choose the `percy.stop(force)` argument.
Non-signal errors preserve the original force-stop behavior.
R3 — Global unhandled-rejection / uncaught-exception handlers:
- Attached exactly once per process by `ensureProcessHandlers()` (called
on every runner invocation; no-op after first attach).
- Stack trace routed through `redactSecrets()` so CDP rejections that
include serialized page-script bodies, Authorization headers, or
cookie strings cannot leak via the new log path.
- Sets `activeContext.runFailed=true`; runs that complete cleanly but
saw an unhandled rejection now throw a synthetic exit-1 error at
the end so CI doesn't see a green build.
Bonus — POSIX child-tree leak in `core/src/browser.js:207`:
The previous `this.process.kill('SIGKILL')` targeted only the lead
Chromium pid. Despite spawning detached at `:266`, that left renderer
/ utility / zygote children orphaned on every kill. The fix matches
the Puppeteer / Playwright convention: shell out to `taskkill /T /F`
on Windows; on POSIX use `process.kill(-pid, 'SIGKILL')` to signal
the whole process group. Falls back to the old lead-pid kill on
either path's error so a missing process doesn't wedge `_closed`.
HTTP server graceful drain (`core/src/server.js`):
`Server.close()` becomes async with a `drainMs` option (default 5s).
Uses Node 18.2+ `closeIdleConnections` / `closeAllConnections` when
available; falls back to manual socket-set iteration on Node 14
(Windows CI is pinned there per `.github/workflows/windows.yml:15`).
The `this.draining` flag is set so future request middleware can
emit `Connection: close` headers.
Test infrastructure:
- `_resetShutdownForTest()` exported from `@percy/cli-command` for
spec isolation; module-level state is also auto-reset at the start
of each `runCommandWithContext` so back-to-back specs don't leak
signal state.
- `try/finally` in `runCommandWithContext` ensures per-run signal
listeners are always removed, even on paths where
`generatePromise`'s cleanup callback wouldn't fire — eliminates
the MaxListenersExceededWarning that was a pre-existing concern.
- Updated `command.test.js` and `cli-exec/test/exec.test.js`
assertions for the new "draining" announcement on stderr and the
removal of the legacy "Stopping percy..." log on graceful (single-
signal) interrupts.
Tests added: `cli-command/test/shutdown.test.js` (4 specs) covers
SIGINT→130, SIGTERM→143, `shutdown.forced` transition on first vs
second signal, and the redactSecrets path for unhandled rejections.
Origin: docs/brainstorms/2026-04-24-per-7855-cli-qos-hardening-requirements.md
Plan: docs/plans/2026-04-27-001-feat-per-7855-cli-qos-hardening-plan.md
Phase 1: commit e135e9a (network refactors + redaction + hint)
Phase 2: commit e8a6d44 (per-port lockfile)
Co-Authored-By: Claude Opus 4.7 (1M context, extended thinking) <noreply@anthropic.com>
9 tasks
Contributor
Author
|
Closing in favor of consolidated PR #2199, which contains all three commits (the same content) so review can happen against a single diff. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Final phase of PER-7855 — graceful drain on
SIGINT/SIGTERM, global unhandled-rejection / uncaught-exception handlers, and a bonus fix for the existing POSIX child-tree leak inbrowser.js. Independent of Phase 1 (#2196) and Phase 2 (#2197); all three can land in any order.R1 — Graceful drain on
SIGINT/SIGTERMshutdownStatebag exposed to commands asctx.shutdownso they can callpercy.stop(ctx.shutdown.forced)for graceful-on-first-signal, force-on-second-signal behavior.${signal} received, draining (press Ctrl-C again to force)...to stderr, arms a 30s drain timer that flipsforced=true.forced=trueand arms a 5s hard-exit safety timer.SIGINT→130,SIGTERM→143, surfaced viaprocess.exitonly whendefinition.exitOnErroris true. Tests withexitOnError: falsepreserve the legacy clean-resolution (AbortError carriesexitCode: 0).start.js,snapshot.js,exec.jsreadctx.shutdown.forcedto choosepercy.stop(force). Non-signal errors preserve the original force-stop behavior.R3 — Global unhandled-rejection / uncaught-exception handlers
ensureProcessHandlers().redactSecrets()so CDP rejections that include serialized page-script bodies, Authorization headers, or cookie strings cannot leak via the new log path. (Phase 1 deepening security finding.)activeContext.runFailed = true; runs that complete cleanly but saw an unhandled rejection throw a synthetic exit-1 error at the end so CI doesn't see a green build.Bonus — POSIX child-tree leak in
browser.js:207The previous
this.process.kill('SIGKILL')targeted only the lead Chromium pid. Despite spawning detached at:266, that left renderer / utility / zygote children orphaned on every kill. The fix matches the Puppeteer / Playwright convention:taskkill /pid <pid> /T /Fon Windows;process.kill(-pid, 'SIGKILL')on POSIX (negative-pid signals the whole process group). Falls back to the lead-pid kill on either path's error so a missing process doesn't wedge_closed.HTTP server graceful drain
Server.close()becomes async with adrainMsoption (default 5s). Uses Node 18.2+closeIdleConnections/closeAllConnectionswhen available; falls back to manual socket-set iteration on Node 14 (Windows CI is pinned there).this.drainingflag set for futureConnection: closemiddleware.Test plan
Phase 3: shutdown + unhandled-rejection + exit codes(cli-command/test/shutdown.test.js) — 4 new specs, all passPre-existing test infrastructure note
cli-snapshot/test/file.test.jshas 4 pre-existing ESM-resolver failures (mockfs cannot intercept dynamicimport()of test fixtures likepages.js). Unrelated to this PR. Same 4 failures appear on master.Test infrastructure
Risks
Post-Deploy Monitoring & Validation
Origin / Plan
🤖 Generated with Claude Opus 4.7 (1M context, extended thinking) via Claude Code