Skip to content

fix(e2e): bound network and setup steps to kill flaky 300s timeouts#360

Open
rafa-thayto wants to merge 9 commits into
mainfrom
fix/e2e-flaky-network-timeout
Open

fix(e2e): bound network and setup steps to kill flaky 300s timeouts#360
rafa-thayto wants to merge 9 commits into
mainfrom
fix/e2e-flaky-network-timeout

Conversation

@rafa-thayto

Copy link
Copy Markdown
Contributor

Problem

The E2E suite flaked intermittently with a single opaque signature — the Nuxt fixture's beforeAll hitting the 300s budget:

(fail) Nuxt with TypeScript > (unnamed) [300001.43ms]
       ^ a beforeEach/afterEach hook timed out

Reading the raw CI logs of two failing runs showed the same signature came from two different stalled steps:

Run Last step logged Hung in
27758855853 git init done clerk link — an untimed fetch() to the production Clerk API
27721361169 fixture copied git init — a local subprocess

Root cause: loggedFetch (the wrapper for every CLI network call) called native fetch() with no AbortSignal, so a stalled connection hung forever. --parallel (8 workers on the 8-vCPU runner) amplified it by bursting ~18 simultaneous PLAPI calls at startup. And no setup step had its own timeout, so any stall consumed the full 300s with no hint which step was to blame.

Fix

  • Root cause (production): default 60s timeout in loggedFetch, composed with any caller signal via AbortSignal.any so tighter budgets (e.g. keyless.ts's 15s) still win. A stalled connection now fails fast across every CLI command with a self-diagnosing error (plapi: request timed out after 60000ms), not just in tests.
  • Defense-in-depth (e2e): each setup step (git init / clerk link / clerk init / npm ci) runs under a per-step timeout that fails with a labeled error instead of silently eating the 300s budget — covers the local-git stall too.
  • Contention: cap e2e --parallel=4; add an explicit afterEach cleanup budget and npm ci --no-audit --no-fund.
  • Noise: drop success-path debug traces (the per-step labeled timeouts replace their diagnostic role); keep the failure diagnostics.

Test plan

  • New unit tests for the timeout (fires after the budget; a shorter caller signal isn't masked; fast requests unaffected)
  • bun run test — 1705 pass / 0 fail
  • typecheck / lint / format:check clean
  • E2E suite is secret-gated — confirmed by CI on this PR

Note: refactored out of a shared working branch; this PR is flake-fix-only (the --fapi work lives in its own PRs).

https://claude.ai/code/session_01V1YkHZ2Ad1okwkX9bxTYsd

The flaky E2E failures were Nuxt's beforeAll hitting the 300s budget. Two
distinct stalls shared one opaque "hook timed out" signature: one CI run hung
in `clerk link` (an untimed `fetch()` to the production Clerk API), another in
`git init`.

- Add a default 60s timeout to `loggedFetch`, composed with any caller signal
  via `AbortSignal.any` so tighter budgets (keyless's 15s) still win. A stalled
  connection now fails fast across every CLI command, not just in tests.
- Wrap each fixture setup step (git / clerk link / clerk init / npm ci) in a
  per-step timeout that fails with a labeled error instead of silently eating
  the whole 300s budget.
- Cap e2e `--parallel=4` to cut startup contention; add an explicit afterEach
  cleanup budget and `npm ci --no-audit --no-fund`.
- Drop noisy success-path debug traces; keep failure diagnostics.

Claude-Session: https://claude.ai/code/session_01V1YkHZ2Ad1okwkX9bxTYsd
@changeset-bot

changeset-bot Bot commented Jun 22, 2026

Copy link
Copy Markdown

🦋 Changeset detected

Latest commit: aa14b3e

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 1 package
Name Type
clerk Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@coderabbitai

coderabbitai Bot commented Jun 22, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

loggedFetch in packages/cli-core/src/lib/fetch.ts is updated to enforce a default 60-second per-request timeout by introducing a DEFAULT_REQUEST_TIMEOUT_MS constant and an optional timeoutMs field in LoggedFetchInit. The implementation composes an AbortSignal.timeout with any caller-supplied signal via AbortSignal.any, distinguishes timeout-caused aborts from other errors, and throws a tagged message in the timeout case. Unit tests covering timeout rejection, caller-signal precedence, and fast-request passthrough are added. A Changesets patch entry documents the fix.

The E2E test harness gains a withRetry helper that races an async step against a millisecond budget with single-attempt retry, and a writeNpmrc utility to configure npm registry timeouts. setupFixture is updated to bound git init, clerk link, clerk init, and npm ci (which also gains --no-audit --no-fund) under explicit timeout budgets using withRetry. Across fixture-test.ts, test-user.ts, and logger.ts, informational log(...) calls are removed. The test:e2e script is changed to --parallel=4.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately reflects the main objective: fixing E2E flakiness by adding network timeouts and setup step bounds to address the 300-second timeout issues.
Description check ✅ Passed The description is comprehensive and directly related to the changeset, explaining the problem, root cause, and multi-faceted fix addressing network timeouts and setup step safeguards.
Docstring Coverage ✅ Passed Docstring coverage is 90.91% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@packages/cli-core/src/lib/fetch.test.ts`:
- Around line 64-72: The test currently uses .catch() to suppress errors which
means if loggedFetch unexpectedly resolves instead of rejecting, the test could
still pass. Replace the current pattern where loggedFetch is called with
.catch() and then the caught error is asserted with expect(String(err)).
Instead, use Jest's rejects matcher syntax to properly assert that the promise
rejects and then validate the error message does not match the timeout pattern.
This ensures the test will fail if the promise resolves unexpectedly.

In `@test/e2e/lib/fixture-setup.ts`:
- Around line 59-77: The withStepTimeout function uses Promise.race to timeout
but does not cancel the underlying run() execution, allowing subprocesses from
gitInit, linkProject, runClerkInit, and the npm ci call to continue running
after timeout. Refactor withStepTimeout to create an AbortController, pass the
abort signal through the run parameter, and update all four operations (gitInit,
linkProject, runClerkInit, and npm ci at line 206) to accept the AbortSignal and
use Bun.spawn with the signal option instead of Bun.$ so that child processes
are properly terminated on timeout. Additionally, add test coverage to verify
that the timeout cancellation properly terminates subprocesses and prevents
residual cleanup contention.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 39560ae6-553a-47fb-a0fe-17840e1ce091

📥 Commits

Reviewing files that changed from the base of the PR and between 8f54a92 and c6123ea.

📒 Files selected for processing (9)
  • .changeset/fix-network-request-timeout.md
  • package.json
  • packages/cli-core/src/lib/fetch.test.ts
  • packages/cli-core/src/lib/fetch.ts
  • test/e2e/lib/dev-server.ts
  • test/e2e/lib/fixture-setup.ts
  • test/e2e/lib/fixture-test.ts
  • test/e2e/lib/logger.ts
  • test/e2e/lib/test-user.ts
💤 Files with no reviewable changes (2)
  • test/e2e/lib/dev-server.ts
  • test/e2e/lib/test-user.ts

Comment thread packages/cli-core/src/lib/fetch.test.ts
Comment thread test/e2e/lib/fixture-setup.ts
Address PR review feedback:

- `runStep` now spawns each setup step via `Bun.spawn` with an `AbortSignal`
  (Bun.$ can't be cancelled), so a timed-out git/clerk/npm step is killed
  instead of orphaned and left to race teardown. Adds runStep unit tests.
- fetch timeout test now fails if `loggedFetch` resolves instead of rejecting
  (no more false pass via swallowed error).
- Trim verbose comments.

Claude-Session: https://claude.ai/code/session_01V1YkHZ2Ad1okwkX9bxTYsd
The Bun.spawn rewrite (5ce158a) regressed the E2E job: 3 fixtures hung the
full 300s in beforeAll with no per-step timeout recovering, because reading a
killed child's piped stderr to EOF can block when a grandchild keeps the pipe
open. Restore the prior approach, which passed E2E in 52s:

- setup steps use Bun.$ again, wrapped in the Promise.race `withStepTimeout`
  (a timed-out step's subprocess is left to settle — beforeAll is never
  retried, so it can't cascade).
- drop the runStep Bun.spawn helper and its unit test.

The real root-cause fix (the 60s loggedFetch timeout that bounds a stalled
clerk link/init network call at the source) is unchanged.

Claude-Session: https://claude.ai/code/session_01V1YkHZ2Ad1okwkX9bxTYsd
The Bun.spawn `runStep` rewrite (5ce158a) regressed CI. `clerk init` runs an
internal `npm install` with inherited stderr (init/heuristics.ts installSdk), so
when the per-step AbortSignal SIGKILLed the CLI, the npm grandchild survived
holding the stderr pipe open — `new Response(proc.stderr).text()` never EOF'd,
the timeout never threw, and the 300s beforeAll fired instead. 3 fixtures hung.

Root realization: `clerk init` and `npm ci` do package installs whose duration
scales with CI contention, so any fixed per-step budget false-fails under load
(clerk init blew past its 90s budget in the failing run). You can't fix
contention-driven flakiness by capping variable-duration install work tighter.

Fix: remove per-step timeouts entirely. The real root-cause fix — the 60s
loggedFetch timeout — still bounds the only thing that can truly hang (network
calls); `--parallel=4` cuts contention; the 300s beforeAll is the backstop.
Setup steps return to plain Bun.$ (as on main). Removes runStep and its test.

Claude-Session: https://claude.ai/code/session_01V1YkHZ2Ad1okwkX9bxTYsd
The remaining flake is npm, not the CLI. `npm ci`'s default `fetch-timeout`
is 300000ms — identical to the test's 300s beforeAll budget — so a single
stalled npm registry connection hangs setup until the hook times out. (clerk
init's installSdk skips here because the isolated env has no PATH, so npm ci is
the only unbounded npm install.)

- npm ci: add --fetch-timeout=60000 --fetch-retries=5 so a stalled fetch aborts
  at 60s and retries, mirroring the CLI's loggedFetch timeout.
- Restore the debug-gated git/link/init/npm step markers so any residual hang
  names the exact step instead of an opaque "hook timed out".

Claude-Session: https://claude.ai/code/session_01V1YkHZ2Ad1okwkX9bxTYsd
The persistent 300s beforeAll hang was npm, not the CLI. npm's default
fetch-timeout is 300000ms, so one stalled registry connection during either
npm operation in setup blocks until the test budget expires. The previous
commit bounded `npm ci` but missed the other one: `clerk init` runs an internal
`npm install @clerk/<sdk>` (installSdk), which was still unbounded — that's what
hung the Vue fixture at 300007ms.

Write a project `.npmrc` (fetch-timeout=30s, fetch-retries=3) before any npm
runs. Both `clerk init`'s install and `npm ci` use projectDir as cwd, so it
covers both: a stalled fetch now aborts in 30s and retries on a fresh
connection instead of waiting 5 minutes. Worst case ~120s, safely under the
300s budget. Drops the redundant per-command npm flags.

Claude-Session: https://claude.ai/code/session_01V1YkHZ2Ad1okwkX9bxTYsd
Across four CI runs the 300s beforeAll hang moved randomly between fixtures
AND steps — including `git init`, a local, near-instant, near-silent command.
That rules out npm, the network, loggedFetch and the earlier Bun.spawn pipe
deadlock: the only thing that explains a trivial `git` subprocess hanging 300s
intermittently and only under `--parallel` is Bun.$ subprocess spawning/reaping
stalling under high concurrent load (each of 4 workers spawns git + 2 `bun`
CLIs + npm + a dev server + chromium at once).

Run fixtures serially (`--parallel=1`, still isolated) so at most one fixture's
subprocesses run at a time. Bump the E2E job timeout 30->45m for the slower
serial run. Keeps the .npmrc fetch-timeout and loggedFetch fixes.

Claude-Session: https://claude.ai/code/session_01V1YkHZ2Ad1okwkX9bxTYsd
Serializing fixtures fixed the contention-driven setup hangs, but exposed a
second, independent flake: `clerk link` (and `init`) intermittently hang ~300s
in a non-fetch path the CLI's loggedFetch timeout can't bound — in human mode
they shell out to git and can stall on a git subprocess or prompt. It lands on a
different fixture each run, so it's transient, not deterministic.

Wrap both CLI steps in withRetry: a stall trips a hard timeout (90s/120s, above
loggedFetch's 60s so genuinely-slow API calls aren't pre-empted) and the retry
runs a fresh subprocess. Promise.race abandons the hung process (no stream
deadlock); beforeAll isn't retried so the orphan can't cascade.
Harden the setup against the intermittent Bun.$ subprocess stall (a spawned
git/clerk/npm step occasionally never resolves — verified a Promise.race
timeout still fires during the hang, so a retry recovers it).

- withRetry now wraps every step: git init, clerk link, clerk init, npm ci.
  A hung attempt is abandoned at its budget and a fresh subprocess retried.
- Tighten the project .npmrc (fetch-timeout 30s->20s, retries 3->2) so a real
  npm stall resolves well under the step budgets and can't false-trip them.
- Restore --parallel=4 (retry absorbs the higher hang frequency) and revert the
  E2E job timeout to 30m.

Keeps the loggedFetch 60s request timeout (bounds the CLI's own API calls).

Claude-Session: https://claude.ai/code/session_01V1YkHZ2Ad1okwkX9bxTYsd

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
test/e2e/lib/fixture-setup.ts (2)

82-101: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Add unit tests for the withRetry helper.

This helper encapsulates critical timeout and retry logic that guards against flaky E2E failures. Consider adding tests to verify:

  • Timeout fires and rejects with the expected message format
  • Successful fast-path execution doesn't trigger retry
  • Retry logic activates on first failure and succeeds on second attempt
  • Timer cleanup occurs in both success and failure paths
  • Second failure propagates the error correctly

As per path instructions, tests should cover changes to prevent regressions.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/e2e/lib/fixture-setup.ts` around lines 82 - 101, Create unit tests for
the withRetry function to cover the timeout and retry logic it encapsulates.
Write tests that verify the following scenarios: that a timeout correctly fires
after the specified timeoutMs duration and rejects with the expected error
message format, that successful execution of the function parameter returns
immediately without retrying, that when the function fails on the first attempt
it retries and succeeds on the second attempt, that the setTimeout timer is
properly cleared in both success and failure paths to prevent memory leaks, and
that when the function fails on both attempts the error from the second attempt
is propagated to the caller. Ensure tests are added to the appropriate test
suite file to prevent regressions in this critical E2E helper functionality.

Source: Path instructions


82-96: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Consider narrowing retry scope to timeout-specific errors.

The current implementation retries on any error (line 95), including permanent failures like missing binaries, invalid arguments, or authorization errors. This adds unnecessary delay before surfacing non-transient issues.

Consider checking the error message or type to retry only timeout errors:

    } catch (err) {
      if (attempt === 2) throw err;
+     // Only retry on timeout; permanent errors should fail fast
+     if (!(err instanceof Error) || !err.message.includes('timed out after')) throw err;
      log(`${label} attempt ${attempt} failed (${err}); retrying`);
    } finally {

Alternatively, if the intention is to handle all transient failures (network blips, race conditions), document that rationale in the function comment.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/e2e/lib/fixture-setup.ts` around lines 82 - 96, The withRetry function
currently retries on all caught errors regardless of their type, which causes
unnecessary delays for permanent failures like missing binaries or authorization
errors. Modify the catch block to check whether the error is specifically a
timeout error before deciding to retry. You can do this by checking the error
message for the timeout pattern (specifically looking for the "timed out after"
text that matches the Error message constructed in the timeout Promise
rejection) and only retry if it's a timeout error. If the error is not a timeout
error, throw it immediately regardless of the attempt number. Alternatively, if
retrying all transient failures is intentional, add a comment to the withRetry
function explaining this design decision.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@test/e2e/lib/fixture-setup.ts`:
- Around line 82-101: Create unit tests for the withRetry function to cover the
timeout and retry logic it encapsulates. Write tests that verify the following
scenarios: that a timeout correctly fires after the specified timeoutMs duration
and rejects with the expected error message format, that successful execution of
the function parameter returns immediately without retrying, that when the
function fails on the first attempt it retries and succeeds on the second
attempt, that the setTimeout timer is properly cleared in both success and
failure paths to prevent memory leaks, and that when the function fails on both
attempts the error from the second attempt is propagated to the caller. Ensure
tests are added to the appropriate test suite file to prevent regressions in
this critical E2E helper functionality.
- Around line 82-96: The withRetry function currently retries on all caught
errors regardless of their type, which causes unnecessary delays for permanent
failures like missing binaries or authorization errors. Modify the catch block
to check whether the error is specifically a timeout error before deciding to
retry. You can do this by checking the error message for the timeout pattern
(specifically looking for the "timed out after" text that matches the Error
message constructed in the timeout Promise rejection) and only retry if it's a
timeout error. If the error is not a timeout error, throw it immediately
regardless of the attempt number. Alternatively, if retrying all transient
failures is intentional, add a comment to the withRetry function explaining this
design decision.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 3bd20699-ac3b-4827-91ab-056500acb479

📥 Commits

Reviewing files that changed from the base of the PR and between 7d8c1a9 and aa14b3e.

📒 Files selected for processing (1)
  • test/e2e/lib/fixture-setup.ts

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant