fix(e2e): bound network and setup steps to kill flaky 300s timeouts#360
fix(e2e): bound network and setup steps to kill flaky 300s timeouts#360rafa-thayto wants to merge 9 commits into
Conversation
The flaky E2E failures were Nuxt's beforeAll hitting the 300s budget. Two distinct stalls shared one opaque "hook timed out" signature: one CI run hung in `clerk link` (an untimed `fetch()` to the production Clerk API), another in `git init`. - Add a default 60s timeout to `loggedFetch`, composed with any caller signal via `AbortSignal.any` so tighter budgets (keyless's 15s) still win. A stalled connection now fails fast across every CLI command, not just in tests. - Wrap each fixture setup step (git / clerk link / clerk init / npm ci) in a per-step timeout that fails with a labeled error instead of silently eating the whole 300s budget. - Cap e2e `--parallel=4` to cut startup contention; add an explicit afterEach cleanup budget and `npm ci --no-audit --no-fund`. - Drop noisy success-path debug traces; keep failure diagnostics. Claude-Session: https://claude.ai/code/session_01V1YkHZ2Ad1okwkX9bxTYsd
🦋 Changeset detectedLatest commit: aa14b3e The changes in this PR will be included in the next version bump. This PR includes changesets to release 1 package
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
📝 WalkthroughWalkthrough
The E2E test harness gains a Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@packages/cli-core/src/lib/fetch.test.ts`:
- Around line 64-72: The test currently uses .catch() to suppress errors which
means if loggedFetch unexpectedly resolves instead of rejecting, the test could
still pass. Replace the current pattern where loggedFetch is called with
.catch() and then the caught error is asserted with expect(String(err)).
Instead, use Jest's rejects matcher syntax to properly assert that the promise
rejects and then validate the error message does not match the timeout pattern.
This ensures the test will fail if the promise resolves unexpectedly.
In `@test/e2e/lib/fixture-setup.ts`:
- Around line 59-77: The withStepTimeout function uses Promise.race to timeout
but does not cancel the underlying run() execution, allowing subprocesses from
gitInit, linkProject, runClerkInit, and the npm ci call to continue running
after timeout. Refactor withStepTimeout to create an AbortController, pass the
abort signal through the run parameter, and update all four operations (gitInit,
linkProject, runClerkInit, and npm ci at line 206) to accept the AbortSignal and
use Bun.spawn with the signal option instead of Bun.$ so that child processes
are properly terminated on timeout. Additionally, add test coverage to verify
that the timeout cancellation properly terminates subprocesses and prevents
residual cleanup contention.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 39560ae6-553a-47fb-a0fe-17840e1ce091
📒 Files selected for processing (9)
.changeset/fix-network-request-timeout.mdpackage.jsonpackages/cli-core/src/lib/fetch.test.tspackages/cli-core/src/lib/fetch.tstest/e2e/lib/dev-server.tstest/e2e/lib/fixture-setup.tstest/e2e/lib/fixture-test.tstest/e2e/lib/logger.tstest/e2e/lib/test-user.ts
💤 Files with no reviewable changes (2)
- test/e2e/lib/dev-server.ts
- test/e2e/lib/test-user.ts
Address PR review feedback: - `runStep` now spawns each setup step via `Bun.spawn` with an `AbortSignal` (Bun.$ can't be cancelled), so a timed-out git/clerk/npm step is killed instead of orphaned and left to race teardown. Adds runStep unit tests. - fetch timeout test now fails if `loggedFetch` resolves instead of rejecting (no more false pass via swallowed error). - Trim verbose comments. Claude-Session: https://claude.ai/code/session_01V1YkHZ2Ad1okwkX9bxTYsd
The Bun.spawn rewrite (5ce158a) regressed the E2E job: 3 fixtures hung the full 300s in beforeAll with no per-step timeout recovering, because reading a killed child's piped stderr to EOF can block when a grandchild keeps the pipe open. Restore the prior approach, which passed E2E in 52s: - setup steps use Bun.$ again, wrapped in the Promise.race `withStepTimeout` (a timed-out step's subprocess is left to settle — beforeAll is never retried, so it can't cascade). - drop the runStep Bun.spawn helper and its unit test. The real root-cause fix (the 60s loggedFetch timeout that bounds a stalled clerk link/init network call at the source) is unchanged. Claude-Session: https://claude.ai/code/session_01V1YkHZ2Ad1okwkX9bxTYsd
The Bun.spawn `runStep` rewrite (5ce158a) regressed CI. `clerk init` runs an internal `npm install` with inherited stderr (init/heuristics.ts installSdk), so when the per-step AbortSignal SIGKILLed the CLI, the npm grandchild survived holding the stderr pipe open — `new Response(proc.stderr).text()` never EOF'd, the timeout never threw, and the 300s beforeAll fired instead. 3 fixtures hung. Root realization: `clerk init` and `npm ci` do package installs whose duration scales with CI contention, so any fixed per-step budget false-fails under load (clerk init blew past its 90s budget in the failing run). You can't fix contention-driven flakiness by capping variable-duration install work tighter. Fix: remove per-step timeouts entirely. The real root-cause fix — the 60s loggedFetch timeout — still bounds the only thing that can truly hang (network calls); `--parallel=4` cuts contention; the 300s beforeAll is the backstop. Setup steps return to plain Bun.$ (as on main). Removes runStep and its test. Claude-Session: https://claude.ai/code/session_01V1YkHZ2Ad1okwkX9bxTYsd
The remaining flake is npm, not the CLI. `npm ci`'s default `fetch-timeout` is 300000ms — identical to the test's 300s beforeAll budget — so a single stalled npm registry connection hangs setup until the hook times out. (clerk init's installSdk skips here because the isolated env has no PATH, so npm ci is the only unbounded npm install.) - npm ci: add --fetch-timeout=60000 --fetch-retries=5 so a stalled fetch aborts at 60s and retries, mirroring the CLI's loggedFetch timeout. - Restore the debug-gated git/link/init/npm step markers so any residual hang names the exact step instead of an opaque "hook timed out". Claude-Session: https://claude.ai/code/session_01V1YkHZ2Ad1okwkX9bxTYsd
The persistent 300s beforeAll hang was npm, not the CLI. npm's default fetch-timeout is 300000ms, so one stalled registry connection during either npm operation in setup blocks until the test budget expires. The previous commit bounded `npm ci` but missed the other one: `clerk init` runs an internal `npm install @clerk/<sdk>` (installSdk), which was still unbounded — that's what hung the Vue fixture at 300007ms. Write a project `.npmrc` (fetch-timeout=30s, fetch-retries=3) before any npm runs. Both `clerk init`'s install and `npm ci` use projectDir as cwd, so it covers both: a stalled fetch now aborts in 30s and retries on a fresh connection instead of waiting 5 minutes. Worst case ~120s, safely under the 300s budget. Drops the redundant per-command npm flags. Claude-Session: https://claude.ai/code/session_01V1YkHZ2Ad1okwkX9bxTYsd
Across four CI runs the 300s beforeAll hang moved randomly between fixtures AND steps — including `git init`, a local, near-instant, near-silent command. That rules out npm, the network, loggedFetch and the earlier Bun.spawn pipe deadlock: the only thing that explains a trivial `git` subprocess hanging 300s intermittently and only under `--parallel` is Bun.$ subprocess spawning/reaping stalling under high concurrent load (each of 4 workers spawns git + 2 `bun` CLIs + npm + a dev server + chromium at once). Run fixtures serially (`--parallel=1`, still isolated) so at most one fixture's subprocesses run at a time. Bump the E2E job timeout 30->45m for the slower serial run. Keeps the .npmrc fetch-timeout and loggedFetch fixes. Claude-Session: https://claude.ai/code/session_01V1YkHZ2Ad1okwkX9bxTYsd
Serializing fixtures fixed the contention-driven setup hangs, but exposed a second, independent flake: `clerk link` (and `init`) intermittently hang ~300s in a non-fetch path the CLI's loggedFetch timeout can't bound — in human mode they shell out to git and can stall on a git subprocess or prompt. It lands on a different fixture each run, so it's transient, not deterministic. Wrap both CLI steps in withRetry: a stall trips a hard timeout (90s/120s, above loggedFetch's 60s so genuinely-slow API calls aren't pre-empted) and the retry runs a fresh subprocess. Promise.race abandons the hung process (no stream deadlock); beforeAll isn't retried so the orphan can't cascade.
Harden the setup against the intermittent Bun.$ subprocess stall (a spawned git/clerk/npm step occasionally never resolves — verified a Promise.race timeout still fires during the hang, so a retry recovers it). - withRetry now wraps every step: git init, clerk link, clerk init, npm ci. A hung attempt is abandoned at its budget and a fresh subprocess retried. - Tighten the project .npmrc (fetch-timeout 30s->20s, retries 3->2) so a real npm stall resolves well under the step budgets and can't false-trip them. - Restore --parallel=4 (retry absorbs the higher hang frequency) and revert the E2E job timeout to 30m. Keeps the loggedFetch 60s request timeout (bounds the CLI's own API calls). Claude-Session: https://claude.ai/code/session_01V1YkHZ2Ad1okwkX9bxTYsd
There was a problem hiding this comment.
🧹 Nitpick comments (2)
test/e2e/lib/fixture-setup.ts (2)
82-101: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winAdd unit tests for the
withRetryhelper.This helper encapsulates critical timeout and retry logic that guards against flaky E2E failures. Consider adding tests to verify:
- Timeout fires and rejects with the expected message format
- Successful fast-path execution doesn't trigger retry
- Retry logic activates on first failure and succeeds on second attempt
- Timer cleanup occurs in both success and failure paths
- Second failure propagates the error correctly
As per path instructions, tests should cover changes to prevent regressions.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/e2e/lib/fixture-setup.ts` around lines 82 - 101, Create unit tests for the withRetry function to cover the timeout and retry logic it encapsulates. Write tests that verify the following scenarios: that a timeout correctly fires after the specified timeoutMs duration and rejects with the expected error message format, that successful execution of the function parameter returns immediately without retrying, that when the function fails on the first attempt it retries and succeeds on the second attempt, that the setTimeout timer is properly cleared in both success and failure paths to prevent memory leaks, and that when the function fails on both attempts the error from the second attempt is propagated to the caller. Ensure tests are added to the appropriate test suite file to prevent regressions in this critical E2E helper functionality.Source: Path instructions
82-96: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winConsider narrowing retry scope to timeout-specific errors.
The current implementation retries on any error (line 95), including permanent failures like missing binaries, invalid arguments, or authorization errors. This adds unnecessary delay before surfacing non-transient issues.
Consider checking the error message or type to retry only timeout errors:
} catch (err) { if (attempt === 2) throw err; + // Only retry on timeout; permanent errors should fail fast + if (!(err instanceof Error) || !err.message.includes('timed out after')) throw err; log(`${label} attempt ${attempt} failed (${err}); retrying`); } finally {Alternatively, if the intention is to handle all transient failures (network blips, race conditions), document that rationale in the function comment.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/e2e/lib/fixture-setup.ts` around lines 82 - 96, The withRetry function currently retries on all caught errors regardless of their type, which causes unnecessary delays for permanent failures like missing binaries or authorization errors. Modify the catch block to check whether the error is specifically a timeout error before deciding to retry. You can do this by checking the error message for the timeout pattern (specifically looking for the "timed out after" text that matches the Error message constructed in the timeout Promise rejection) and only retry if it's a timeout error. If the error is not a timeout error, throw it immediately regardless of the attempt number. Alternatively, if retrying all transient failures is intentional, add a comment to the withRetry function explaining this design decision.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@test/e2e/lib/fixture-setup.ts`:
- Around line 82-101: Create unit tests for the withRetry function to cover the
timeout and retry logic it encapsulates. Write tests that verify the following
scenarios: that a timeout correctly fires after the specified timeoutMs duration
and rejects with the expected error message format, that successful execution of
the function parameter returns immediately without retrying, that when the
function fails on the first attempt it retries and succeeds on the second
attempt, that the setTimeout timer is properly cleared in both success and
failure paths to prevent memory leaks, and that when the function fails on both
attempts the error from the second attempt is propagated to the caller. Ensure
tests are added to the appropriate test suite file to prevent regressions in
this critical E2E helper functionality.
- Around line 82-96: The withRetry function currently retries on all caught
errors regardless of their type, which causes unnecessary delays for permanent
failures like missing binaries or authorization errors. Modify the catch block
to check whether the error is specifically a timeout error before deciding to
retry. You can do this by checking the error message for the timeout pattern
(specifically looking for the "timed out after" text that matches the Error
message constructed in the timeout Promise rejection) and only retry if it's a
timeout error. If the error is not a timeout error, throw it immediately
regardless of the attempt number. Alternatively, if retrying all transient
failures is intentional, add a comment to the withRetry function explaining this
design decision.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 3bd20699-ac3b-4827-91ab-056500acb479
📒 Files selected for processing (1)
test/e2e/lib/fixture-setup.ts
Problem
The E2E suite flaked intermittently with a single opaque signature — the Nuxt fixture's
beforeAllhitting the 300s budget:Reading the raw CI logs of two failing runs showed the same signature came from two different stalled steps:
27758855853git init doneclerk link— an untimedfetch()to the production Clerk API27721361169fixture copiedgit init— a local subprocessRoot cause:
loggedFetch(the wrapper for every CLI network call) called nativefetch()with noAbortSignal, so a stalled connection hung forever.--parallel(8 workers on the 8-vCPU runner) amplified it by bursting ~18 simultaneous PLAPI calls at startup. And no setup step had its own timeout, so any stall consumed the full 300s with no hint which step was to blame.Fix
loggedFetch, composed with any caller signal viaAbortSignal.anyso tighter budgets (e.g.keyless.ts's 15s) still win. A stalled connection now fails fast across every CLI command with a self-diagnosing error (plapi: request timed out after 60000ms), not just in tests.git init/clerk link/clerk init/npm ci) runs under a per-step timeout that fails with a labeled error instead of silently eating the 300s budget — covers the local-gitstall too.--parallel=4; add an explicitafterEachcleanup budget andnpm ci --no-audit --no-fund.Test plan
bun run test— 1705 pass / 0 failtypecheck/lint/format:checkcleanhttps://claude.ai/code/session_01V1YkHZ2Ad1okwkX9bxTYsd