fix(e2e): bound network and setup steps to kill flaky 300s timeouts by rafa-thayto · Pull Request #360 · clerk/cli

rafa-thayto · 2026-06-22T21:51:42Z

Problem

The E2E suite flaked intermittently with a single opaque signature — the Nuxt fixture's beforeAll hitting the 300s budget:

(fail) Nuxt with TypeScript > (unnamed) [300001.43ms]
       ^ a beforeEach/afterEach hook timed out

Reading the raw CI logs of two failing runs showed the same signature came from two different stalled steps:

Run	Last step logged	Hung in
`27758855853`	`git init done`	`clerk link` — an untimed `fetch()` to the production Clerk API
`27721361169`	`fixture copied`	`git init` — a local subprocess

Root cause: loggedFetch (the wrapper for every CLI network call) called native fetch() with no AbortSignal, so a stalled connection hung forever. --parallel (8 workers on the 8-vCPU runner) amplified it by bursting ~18 simultaneous PLAPI calls at startup. And no setup step had its own timeout, so any stall consumed the full 300s with no hint which step was to blame.

Fix

Root cause (production): default 60s timeout in loggedFetch, composed with any caller signal via AbortSignal.any so tighter budgets (e.g. keyless.ts's 15s) still win. A stalled connection now fails fast across every CLI command with a self-diagnosing error (plapi: request timed out after 60000ms), not just in tests.
Defense-in-depth (e2e): each setup step (git init / clerk link / clerk init / npm ci) runs under a per-step timeout that fails with a labeled error instead of silently eating the 300s budget — covers the local-git stall too.
Contention: cap e2e --parallel=4; add an explicit afterEach cleanup budget and npm ci --no-audit --no-fund.
Noise: drop success-path debug traces (the per-step labeled timeouts replace their diagnostic role); keep the failure diagnostics.

Test plan

New unit tests for the timeout (fires after the budget; a shorter caller signal isn't masked; fast requests unaffected)
bun run test — 1705 pass / 0 fail
typecheck / lint / format:check clean
E2E suite is secret-gated — confirmed by CI on this PR

Note: refactored out of a shared working branch; this PR is flake-fix-only (the --fapi work lives in its own PRs).

https://claude.ai/code/session_01V1YkHZ2Ad1okwkX9bxTYsd

The flaky E2E failures were Nuxt's beforeAll hitting the 300s budget. Two distinct stalls shared one opaque "hook timed out" signature: one CI run hung in `clerk link` (an untimed `fetch()` to the production Clerk API), another in `git init`. - Add a default 60s timeout to `loggedFetch`, composed with any caller signal via `AbortSignal.any` so tighter budgets (keyless's 15s) still win. A stalled connection now fails fast across every CLI command, not just in tests. - Wrap each fixture setup step (git / clerk link / clerk init / npm ci) in a per-step timeout that fails with a labeled error instead of silently eating the whole 300s budget. - Cap e2e `--parallel=4` to cut startup contention; add an explicit afterEach cleanup budget and `npm ci --no-audit --no-fund`. - Drop noisy success-path debug traces; keep failure diagnostics. Claude-Session: https://claude.ai/code/session_01V1YkHZ2Ad1okwkX9bxTYsd

changeset-bot · 2026-06-22T21:51:45Z

🦋 Changeset detected

Latest commit: aa14b3e

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 1 package

Name	Type
clerk	Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

coderabbitai · 2026-06-22T21:58:11Z

📝 Walkthrough

Walkthrough

loggedFetch in packages/cli-core/src/lib/fetch.ts is updated to enforce a default 60-second per-request timeout by introducing a DEFAULT_REQUEST_TIMEOUT_MS constant and an optional timeoutMs field in LoggedFetchInit. The implementation composes an AbortSignal.timeout with any caller-supplied signal via AbortSignal.any, distinguishes timeout-caused aborts from other errors, and throws a tagged message in the timeout case. Unit tests covering timeout rejection, caller-signal precedence, and fast-request passthrough are added. A Changesets patch entry documents the fix.

The E2E test harness gains a withRetry helper that races an async step against a millisecond budget with single-attempt retry, and a writeNpmrc utility to configure npm registry timeouts. setupFixture is updated to bound git init, clerk link, clerk init, and npm ci (which also gains --no-audit --no-fund) under explicit timeout budgets using withRetry. Across fixture-test.ts, test-user.ts, and logger.ts, informational log(...) calls are removed. The test:e2e script is changed to --parallel=4.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately reflects the main objective: fixing E2E flakiness by adding network timeouts and setup step bounds to address the 300-second timeout issues.
Description check	✅ Passed	The description is comprehensive and directly related to the changeset, explaining the problem, root cause, and multi-faceted fix addressing network timeouts and setup step safeguards.
Docstring Coverage	✅ Passed	Docstring coverage is 90.91% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@packages/cli-core/src/lib/fetch.test.ts`:
- Around line 64-72: The test currently uses .catch() to suppress errors which
means if loggedFetch unexpectedly resolves instead of rejecting, the test could
still pass. Replace the current pattern where loggedFetch is called with
.catch() and then the caught error is asserted with expect(String(err)).
Instead, use Jest's rejects matcher syntax to properly assert that the promise
rejects and then validate the error message does not match the timeout pattern.
This ensures the test will fail if the promise resolves unexpectedly.

In `@test/e2e/lib/fixture-setup.ts`:
- Around line 59-77: The withStepTimeout function uses Promise.race to timeout
but does not cancel the underlying run() execution, allowing subprocesses from
gitInit, linkProject, runClerkInit, and the npm ci call to continue running
after timeout. Refactor withStepTimeout to create an AbortController, pass the
abort signal through the run parameter, and update all four operations (gitInit,
linkProject, runClerkInit, and npm ci at line 206) to accept the AbortSignal and
use Bun.spawn with the signal option instead of Bun.$ so that child processes
are properly terminated on timeout. Additionally, add test coverage to verify
that the timeout cancellation properly terminates subprocesses and prevents
residual cleanup contention.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 39560ae6-553a-47fb-a0fe-17840e1ce091

📥 Commits

Reviewing files that changed from the base of the PR and between 8f54a92 and c6123ea.

📒 Files selected for processing (9)

.changeset/fix-network-request-timeout.md
package.json
packages/cli-core/src/lib/fetch.test.ts
packages/cli-core/src/lib/fetch.ts
test/e2e/lib/dev-server.ts
test/e2e/lib/fixture-setup.ts
test/e2e/lib/fixture-test.ts
test/e2e/lib/logger.ts
test/e2e/lib/test-user.ts

💤 Files with no reviewable changes (2)

test/e2e/lib/dev-server.ts
test/e2e/lib/test-user.ts

Address PR review feedback: - `runStep` now spawns each setup step via `Bun.spawn` with an `AbortSignal` (Bun.$ can't be cancelled), so a timed-out git/clerk/npm step is killed instead of orphaned and left to race teardown. Adds runStep unit tests. - fetch timeout test now fails if `loggedFetch` resolves instead of rejecting (no more false pass via swallowed error). - Trim verbose comments. Claude-Session: https://claude.ai/code/session_01V1YkHZ2Ad1okwkX9bxTYsd

The Bun.spawn rewrite (5ce158a) regressed the E2E job: 3 fixtures hung the full 300s in beforeAll with no per-step timeout recovering, because reading a killed child's piped stderr to EOF can block when a grandchild keeps the pipe open. Restore the prior approach, which passed E2E in 52s: - setup steps use Bun.$ again, wrapped in the Promise.race `withStepTimeout` (a timed-out step's subprocess is left to settle — beforeAll is never retried, so it can't cascade). - drop the runStep Bun.spawn helper and its unit test. The real root-cause fix (the 60s loggedFetch timeout that bounds a stalled clerk link/init network call at the source) is unchanged. Claude-Session: https://claude.ai/code/session_01V1YkHZ2Ad1okwkX9bxTYsd

The Bun.spawn `runStep` rewrite (5ce158a) regressed CI. `clerk init` runs an internal `npm install` with inherited stderr (init/heuristics.ts installSdk), so when the per-step AbortSignal SIGKILLed the CLI, the npm grandchild survived holding the stderr pipe open — `new Response(proc.stderr).text()` never EOF'd, the timeout never threw, and the 300s beforeAll fired instead. 3 fixtures hung. Root realization: `clerk init` and `npm ci` do package installs whose duration scales with CI contention, so any fixed per-step budget false-fails under load (clerk init blew past its 90s budget in the failing run). You can't fix contention-driven flakiness by capping variable-duration install work tighter. Fix: remove per-step timeouts entirely. The real root-cause fix — the 60s loggedFetch timeout — still bounds the only thing that can truly hang (network calls); `--parallel=4` cuts contention; the 300s beforeAll is the backstop. Setup steps return to plain Bun.$ (as on main). Removes runStep and its test. Claude-Session: https://claude.ai/code/session_01V1YkHZ2Ad1okwkX9bxTYsd

The remaining flake is npm, not the CLI. `npm ci`'s default `fetch-timeout` is 300000ms — identical to the test's 300s beforeAll budget — so a single stalled npm registry connection hangs setup until the hook times out. (clerk init's installSdk skips here because the isolated env has no PATH, so npm ci is the only unbounded npm install.) - npm ci: add --fetch-timeout=60000 --fetch-retries=5 so a stalled fetch aborts at 60s and retries, mirroring the CLI's loggedFetch timeout. - Restore the debug-gated git/link/init/npm step markers so any residual hang names the exact step instead of an opaque "hook timed out". Claude-Session: https://claude.ai/code/session_01V1YkHZ2Ad1okwkX9bxTYsd

The persistent 300s beforeAll hang was npm, not the CLI. npm's default fetch-timeout is 300000ms, so one stalled registry connection during either npm operation in setup blocks until the test budget expires. The previous commit bounded `npm ci` but missed the other one: `clerk init` runs an internal `npm install @clerk/<sdk>` (installSdk), which was still unbounded — that's what hung the Vue fixture at 300007ms. Write a project `.npmrc` (fetch-timeout=30s, fetch-retries=3) before any npm runs. Both `clerk init`'s install and `npm ci` use projectDir as cwd, so it covers both: a stalled fetch now aborts in 30s and retries on a fresh connection instead of waiting 5 minutes. Worst case ~120s, safely under the 300s budget. Drops the redundant per-command npm flags. Claude-Session: https://claude.ai/code/session_01V1YkHZ2Ad1okwkX9bxTYsd

Across four CI runs the 300s beforeAll hang moved randomly between fixtures AND steps — including `git init`, a local, near-instant, near-silent command. That rules out npm, the network, loggedFetch and the earlier Bun.spawn pipe deadlock: the only thing that explains a trivial `git` subprocess hanging 300s intermittently and only under `--parallel` is Bun.$ subprocess spawning/reaping stalling under high concurrent load (each of 4 workers spawns git + 2 `bun` CLIs + npm + a dev server + chromium at once). Run fixtures serially (`--parallel=1`, still isolated) so at most one fixture's subprocesses run at a time. Bump the E2E job timeout 30->45m for the slower serial run. Keeps the .npmrc fetch-timeout and loggedFetch fixes. Claude-Session: https://claude.ai/code/session_01V1YkHZ2Ad1okwkX9bxTYsd

Serializing fixtures fixed the contention-driven setup hangs, but exposed a second, independent flake: `clerk link` (and `init`) intermittently hang ~300s in a non-fetch path the CLI's loggedFetch timeout can't bound — in human mode they shell out to git and can stall on a git subprocess or prompt. It lands on a different fixture each run, so it's transient, not deterministic. Wrap both CLI steps in withRetry: a stall trips a hard timeout (90s/120s, above loggedFetch's 60s so genuinely-slow API calls aren't pre-empted) and the retry runs a fresh subprocess. Promise.race abandons the hung process (no stream deadlock); beforeAll isn't retried so the orphan can't cascade.

Harden the setup against the intermittent Bun.$ subprocess stall (a spawned git/clerk/npm step occasionally never resolves — verified a Promise.race timeout still fires during the hang, so a retry recovers it). - withRetry now wraps every step: git init, clerk link, clerk init, npm ci. A hung attempt is abandoned at its budget and a fresh subprocess retried. - Tighten the project .npmrc (fetch-timeout 30s->20s, retries 3->2) so a real npm stall resolves well under the step budgets and can't false-trip them. - Restore --parallel=4 (retry absorbs the higher hang frequency) and revert the E2E job timeout to 30m. Keeps the loggedFetch 60s request timeout (bounds the CLI's own API calls). Claude-Session: https://claude.ai/code/session_01V1YkHZ2Ad1okwkX9bxTYsd

coderabbitai

🧹 Nitpick comments (2)

test/e2e/lib/fixture-setup.ts (2)
82-101: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Add unit tests for the withRetry helper.

This helper encapsulates critical timeout and retry logic that guards against flaky E2E failures. Consider adding tests to verify:

Timeout fires and rejects with the expected message format

Successful fast-path execution doesn't trigger retry

Retry logic activates on first failure and succeeds on second attempt

Timer cleanup occurs in both success and failure paths

Second failure propagates the error correctly

As per path instructions, tests should cover changes to prevent regressions.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/e2e/lib/fixture-setup.ts` around lines 82 - 101, Create unit tests for
the withRetry function to cover the timeout and retry logic it encapsulates.
Write tests that verify the following scenarios: that a timeout correctly fires
after the specified timeoutMs duration and rejects with the expected error
message format, that successful execution of the function parameter returns
immediately without retrying, that when the function fails on the first attempt
it retries and succeeds on the second attempt, that the setTimeout timer is
properly cleared in both success and failure paths to prevent memory leaks, and
that when the function fails on both attempts the error from the second attempt
is propagated to the caller. Ensure tests are added to the appropriate test
suite file to prevent regressions in this critical E2E helper functionality.
Source: Path instructions

82-96: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Consider narrowing retry scope to timeout-specific errors.

The current implementation retries on any error (line 95), including permanent failures like missing binaries, invalid arguments, or authorization errors. This adds unnecessary delay before surfacing non-transient issues.

Consider checking the error message or type to retry only timeout errors:
    } catch (err) {
      if (attempt === 2) throw err;
+     // Only retry on timeout; permanent errors should fail fast
+     if (!(err instanceof Error) || !err.message.includes('timed out after')) throw err;
      log(`${label} attempt ${attempt} failed (${err}); retrying`);
    } finally {
Alternatively, if the intention is to handle all transient failures (network blips, race conditions), document that rationale in the function comment.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/e2e/lib/fixture-setup.ts` around lines 82 - 96, The withRetry function
currently retries on all caught errors regardless of their type, which causes
unnecessary delays for permanent failures like missing binaries or authorization
errors. Modify the catch block to check whether the error is specifically a
timeout error before deciding to retry. You can do this by checking the error
message for the timeout pattern (specifically looking for the "timed out after"
text that matches the Error message constructed in the timeout Promise
rejection) and only retry if it's a timeout error. If the error is not a timeout
error, throw it immediately regardless of the attempt number. Alternatively, if
retrying all transient failures is intentional, add a comment to the withRetry
function explaining this design decision.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@test/e2e/lib/fixture-setup.ts`:
- Around line 82-101: Create unit tests for the withRetry function to cover the
timeout and retry logic it encapsulates. Write tests that verify the following
scenarios: that a timeout correctly fires after the specified timeoutMs duration
and rejects with the expected error message format, that successful execution of
the function parameter returns immediately without retrying, that when the
function fails on the first attempt it retries and succeeds on the second
attempt, that the setTimeout timer is properly cleared in both success and
failure paths to prevent memory leaks, and that when the function fails on both
attempts the error from the second attempt is propagated to the caller. Ensure
tests are added to the appropriate test suite file to prevent regressions in
this critical E2E helper functionality.
- Around line 82-96: The withRetry function currently retries on all caught
errors regardless of their type, which causes unnecessary delays for permanent
failures like missing binaries or authorization errors. Modify the catch block
to check whether the error is specifically a timeout error before deciding to
retry. You can do this by checking the error message for the timeout pattern
(specifically looking for the "timed out after" text that matches the Error
message constructed in the timeout Promise rejection) and only retry if it's a
timeout error. If the error is not a timeout error, throw it immediately
regardless of the attempt number. Alternatively, if retrying all transient
failures is intentional, add a comment to the withRetry function explaining this
design decision.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 3bd20699-ac3b-4827-91ab-056500acb479

📥 Commits

Reviewing files that changed from the base of the PR and between 7d8c1a9 and aa14b3e.

📒 Files selected for processing (1)

test/e2e/lib/fixture-setup.ts

coderabbitai Bot reviewed Jun 22, 2026

View reviewed changes

Comment thread packages/cli-core/src/lib/fetch.test.ts

Comment thread test/e2e/lib/fixture-setup.ts

rafa-thayto added 8 commits June 22, 2026 19:39

coderabbitai Bot reviewed Jun 23, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(e2e): bound network and setup steps to kill flaky 300s timeouts#360

fix(e2e): bound network and setup steps to kill flaky 300s timeouts#360
rafa-thayto wants to merge 9 commits into
mainfrom
fix/e2e-flaky-network-timeout

rafa-thayto commented Jun 22, 2026

Uh oh!

changeset-bot Bot commented Jun 22, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Jun 22, 2026 •

edited

Loading

Walkthrough

Estimated code review effort

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

rafa-thayto commented Jun 22, 2026

Problem

Fix

Test plan

Uh oh!

changeset-bot Bot commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🦋 Changeset detected

Uh oh!

coderabbitai Bot commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Estimated code review effort

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

changeset-bot Bot commented Jun 22, 2026 •

edited

Loading

coderabbitai Bot commented Jun 22, 2026 •

edited

Loading