Skip to content

Gate the test-realm SW so it stops intercepting when unclaimed#5380

Open
lukemelia wants to merge 1 commit into
mainfrom
fix-test-realm-sw-leak-flake
Open

Gate the test-realm SW so it stops intercepting when unclaimed#5380
lukemelia wants to merge 1 commit into
mainfrom
fix-test-realm-sw-leak-flake

Conversation

@lukemelia

Copy link
Copy Markdown
Contributor

Problem

Host test shards fail intermittently with whole modules failing at setup (Promise rejected before … [object Object]) or hanging to a 60s timeout, with the app stuck on the login form. Recent examples on #5373 hit theme-card-test, fallback-models, show-card, workspace-chooser-delete, and workspace-chooser archive.

Root cause: the test-realm service worker leaks across modules

packages/host/public/test-realm-sw.js intercepts every browser-level fetch to http://test-realm/… and relays it to the window via postMessage, returning 503 "No responsive client available" if nothing replies within its 1500ms relay timeout.

The client-side responder (the test-realm-fetch handler in tests/helpers/test-realm-service-worker.ts, via setupTestRealmServiceWorker) is installed only by the ~13 image-def/* and audio-def/* acceptance modules.

unregister() (called in that helper's hooks.after) does not evict an already-active worker from a still-loaded page — the active worker keeps controlling the QUnit runner until the page unloads. So once an image/audio module registers the SW, it lingers into later modules. A later module that does a http://test-realm/ browser fetch but never installed a responder — e.g. boot-time MatrixService.start → setSystemCard → store.get(http://test-realm/test/SystemCard/default) — gets a 503. Because a 503 (unlike the expected 404 for a missing SystemCard) is fatal to boot, the app renders the login form, and the module cascades or times out (60s waiting for operator-mode DOM that never appears).

Confirmed from the failing logs:

  • The 503 body No responsive client available is emitted only at test-realm-sw.js:~29.
  • Timeout diagnostics show the app fully settled and stuck on the login form; the boot failure is the SystemCard 503 during IndexRootRoute.model.
  • Module order in a failing shard: Acceptance | image def | opening an image URL in a new tab (registers SW) → Acceptance | workspace-chooser-delete (victim), i.e. the exact leaked-SW adjacency.
  • The relation "jobs" does not exist / worker-fatal lines that also appear are shutdown-time noise (timestamped after tests finish); not causal.

Why it's flaky / shard-composition-sensitive: it only bites when an image/audio module lands immediately before a non-SW module that fetches test-realm/ URLs in the same shard/browser session. Adding tests reshuffles shard boundaries, so a PR that adds tests (or a stale branch) can trigger it while main doesn't.

Fix

Gate interception on an active flag in the SW:

  • Defaults to passthrough (off).
  • The owning module turns it on in beforeEach (after the responder is listening) and off in hooks.after (before unregister), each via a small ack'd postMessage.
  • A lingering worker then passes test-realm/ requests straight through — behaving exactly as if no SW were installed, which is the state non-intercepting modules already expect and pass under. No responder is invoked for them, so there is no hanging-fetch failure mode.
  • The flag defaults off so a terminated-and-restarted worker (which loses the in-memory flag) fails safe toward passthrough rather than resurrecting the leak; the owning module re-asserts it every beforeEach.

Rejected alternative

An earlier attempt (on #5373, since reverted) made the responder always-on — installed globally and resolving the network at message time. That backfired: it called virtualNetwork.fetch() for every leaked request, and for slow/not-found/torn-down fetches the promise never resolved, holding a test waiter so settled() never completed — turning fast cascades into 243s timeouts and spreading failures from one shard to three. This PR takes the opposite tack (stop intercepting) rather than (always respond).

Validation

Host tests can't be run locally (the ember test --path dist harness aborts at boot), so this is validated by CI on this branch. Lint + type-check pass locally.

🤖 Generated with Claude Code

The test-realm service worker relays http://test-realm/ browser fetches to the
window and 503s ("No responsive client available") when no client answers
within its 1500ms timeout. `unregister()` does not evict an already-active
worker from a still-loaded page, so once an image/audio module registers it, it
keeps controlling the QUnit runner into later modules. Those modules never
install a `test-realm-fetch` responder, so a boot-time fetch (e.g.
MatrixService.setSystemCard) gets a 503 instead of the expected 404 — boot
fails, the login form renders, and the module cascades or times out. It's
shard-composition-sensitive, which is why it surfaces intermittently.

Gate interception on an `active` flag: the worker defaults to passthrough and
only intercepts while a module has turned it on (in its beforeEach, after the
responder is listening), turning it back off on teardown. A lingering worker
then passes requests straight through, behaving exactly as if no SW were
installed — the state non-intercepting modules already expect and pass under.
The flag defaults off so a terminated-and-restarted worker fails safe toward
passthrough rather than resurrecting the leak; the owning module re-asserts it
each beforeEach.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@github-actions

github-actions Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

Preview deployments

Host Test Results

    1 files      1 suites   2h 24m 9s ⏱️
3 366 tests 3 351 ✅ 15 💤 0 ❌
3 385 runs  3 370 ✅ 15 💤 0 ❌

Results for commit fd887ad.

Realm Server Test Results

    1 files      1 suites   8m 19s ⏱️
1 674 tests 1 674 ✅ 0 💤 0 ❌
1 753 runs  1 753 ✅ 0 💤 0 ❌

Results for commit fd887ad.

@lukemelia lukemelia marked this pull request as ready for review July 1, 2026 14:07
@lukemelia

Copy link
Copy Markdown
Contributor Author

[Claude Code 🤖] CI green: all 20 Host Test shards pass, including 10/12/13 which were the leaked-SW victims. Marking ready for review.

@lukemelia lukemelia requested review from a team and richardhjtan July 1, 2026 15:34
@habdelra habdelra requested a review from Copilot July 1, 2026 22:42

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses intermittent packages/host acceptance-test shard failures caused by the test-realm service worker continuing to intercept http://test-realm/... requests after the module that registered it has finished, leading to unexpected 503 "No responsive client available" responses in later modules.

Changes:

  • Adds an opt-in active gate in test-realm-sw.js so the SW defaults to passthrough unless explicitly enabled.
  • Updates the test helper to turn interception on in beforeEach (after installing the responder) and turn it off in after (before unregister()), using an ack’d postMessage.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
packages/host/tests/helpers/test-realm-service-worker.ts Adds setSwActive() and uses it to enable/disable interception around test module lifetime.
packages/host/public/test-realm-sw.js Introduces an in-memory active flag and a message handler to toggle interception; fetch handler now gates interception on active.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +44 to +67
async function setSwActive(active: boolean): Promise<void> {
let controller = navigator.serviceWorker.controller;
if (!controller) {
return;
}
await new Promise<void>((resolve) => {
let channel = new MessageChannel();
let settled = false;
let finish = () => {
if (settled) {
return;
}
settled = true;
clearTimeout(timeout);
channel.port1.onmessage = null;
resolve();
};
let timeout = setTimeout(finish, 500);
channel.port1.onmessage = finish;
controller.postMessage({ type: 'test-realm-sw-set-active', active }, [
channel.port2,
]);
});
}
Comment on lines +29 to +32
let port = event.ports && event.ports[0];
if (port) {
port.postMessage({ ok: true, active });
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants