Skip to content

Graceful degradation when a trusted realm server is unreachable at boot#5381

Open
lukemelia wants to merge 2 commits into
mainfrom
cs-11667-graceful-degradation-when-a-trusted-realm-server-is
Open

Graceful degradation when a trusted realm server is unreachable at boot#5381
lukemelia wants to merge 2 commits into
mainfrom
cs-11667-graceful-degradation-when-a-trusted-realm-server-is

Conversation

@lukemelia

Copy link
Copy Markdown
Contributor

What & why

Boot assembly (CS-11658) builds the available-realms list by asking each trusted server for the user's realms via _realm-auth, under Promise.all. One server failing rejected the whole assembly, which bubbled up to start()'s catch and logged the user out — blocking boot and hiding all of the user's realms.

This makes a trusted server being unreachable degrade gracefully instead.

Closes CS-11667.

Changes

  • services/realm-server.tsfetchUserRealmsFromTrustedServers now assembles via Promise.allSettled, so realms from the reachable servers still load. Unreachable servers are recorded in new tracked unreachableRealmServers state (cleared on logout/reset), which drives the notice.
  • services/matrix-service.ts — the boot-time authenticateToAllAccessibleRealms call is wrapped so an unreachable own-server no longer logs the user out. Added retryUnreachableRealmServers() (merges recovered realms, authenticates them, clears the notice) and a bounded background retry (6 × 10s), scheduled after boot and in the account-data listener.
  • workspace-chooser/index.gts — an unobtrusive notice at the top of the chooser naming the unreachable server(s).

Acceptance criteria

  • One unreachable server does not prevent other servers' realms from loading, and never blocks boot.
  • A non-blocking notice is shown naming the unreachable server.
  • A retry is attempted, and the notice clears on success.

Tests

  • realm-server-mock: new setRealmAuthFailure toggle that makes _realm-auth respond 503.
  • matrix-service-boot-assembly-test.ts: boot completes with the base realm intact, the unreachable server is recorded, and retry recovers the realm + clears the notice.
  • workspace-chooser-unreachable-notice-test.gts: the notice renders naming the server and disappears after a successful retry.

Notes

  • Verified locally with pnpm lint:types, eslint, and template-lint (clean). Host tests run in CI per this repo's setup.
  • The notice is a contextual banner inside the workspace chooser (where the realm list lives) rather than a new global toast service, since the host has no generic toast infrastructure. Easy to relocate if a more global placement is preferred.

🤖 Generated with Claude Code

Boot assembly fetched each trusted server's realm list via `_realm-auth`
under `Promise.all`. One server failing rejected the whole assembly, which
bubbled up to `start()`'s catch and logged the user out — blocking boot and
hiding all of the user's realms.

- realm-server: assemble via `Promise.allSettled` so reachable servers still
  contribute their realms; record unreachable servers in tracked
  `unreachableRealmServers` state (cleared on logout/reset).
- matrix-service: don't log out when the boot-time
  `authenticateToAllAccessibleRealms` fails; add `retryUnreachableRealmServers`
  plus a bounded background retry that merges recovered realms and clears the
  notice.
- workspace-chooser: show an unobtrusive notice naming the unreachable
  server(s).
- tests: `setRealmAuthFailure` mock toggle; boot-assembly degradation/retry
  coverage and a notice-rendering test.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@github-actions

github-actions Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

Preview deployments

Host Test Results

    1 files  ±0      1 suites  ±0   2h 37m 0s ⏱️ +26s
3 377 tests +1  3 362 ✅ +1  15 💤 ±0  0 ❌ ±0 
3 396 runs  +1  3 381 ✅ +1  15 💤 ±0  0 ❌ ±0 

Results for commit 3ba99f8. ± Comparison against earlier commit 171c609.

Realm Server Test Results

    1 files  ±    0      1 suites  ±0   9m 21s ⏱️ +15s
1 674 tests ±    0  1 673 ✅  -     1  0 💤 ±0  1 ❌ +1 
3 506 runs  +1 753  3 505 ✅ +1 752  0 💤 ±0  1 ❌ +1 

Results for commit 3ba99f8. ± Comparison against earlier commit 171c609.

For more details on these errors, see this check.

@lukemelia lukemelia marked this pull request as ready for review July 1, 2026 18:54
@lukemelia

Copy link
Copy Markdown
Contributor Author

[Codex]

Findings

  1. P1: Runtime account-data refresh can erase all workspaces during a transient outage
    packages/host/app/services/matrix-service.ts:463-470
    fetchUserRealmsFromTrustedServers() now returns partial results instead of throwing when _realm-auth fails. In this account-data handler, an unreachable trusted server can therefore produce realmURLs = [], and setAvailableRealmIdentifiers([]) immediately removes all user realms from the current session. This contradicts the comment above that event-time assembly failures should “leave the available-realms list as it was” and makes existing workspaces disappear during a temporary outage.

  2. P2: Boot now suppresses unrelated authentication failures
    packages/host/app/services/matrix-service.ts:1055-1066
    The new catch swallows every authenticateToAllAccessibleRealms() failure, not just the expected case where boot assembly already marked a trusted server unreachable. Errors from token fetching, session-room joining, or malformed auth responses can now let startup continue to postLoginCompleted = true, leaving the app appearing logged in without required realm authentication. This should rethrow when there is no recorded unreachable trusted server, or catch a narrower expected failure.

Verdict

patch is incorrect. The graceful-degradation path is directionally useful, but it introduces a runtime workspace-list wipe on account-data refresh and masks unrelated auth failures at boot.

Confidence: 0.86

…uth catch

Two issues from Codex review:

- P1: the runtime `app.boxel.realm-servers` account-data handler replaced the
  available-realms list with the partial result from
  `fetchUserRealmsFromTrustedServers`, so a transiently-unreachable server
  erased already-loaded workspaces. Extracted the handler body into
  `applyTrustedRealmServersAccountData`, which merges (never removes) while any
  server is unreachable and only replaces on a fully reachable assembly.
- P2: the boot-time catch around `authenticateToAllAccessibleRealms` swallowed
  every failure. Now it only swallows when a trusted server was actually
  recorded unreachable; otherwise it rethrows so boot fails loudly instead of
  reaching `postLoginCompleted` unauthenticated.

Added a regression test that a refresh during an outage keeps the loaded realm.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@lukemelia

Copy link
Copy Markdown
Contributor Author

[Claude Code 🤖] Both findings were correct — fixed in 3ba99f8.

P1 (workspace wipe on transient refresh): Extracted the account-data handler body into applyTrustedRealmServersAccountData. When any trusted server is unreachable this round it now merges (adds newly-discovered realms, never removes) and lets the retry reconcile; it only replaces the list on a fully-reachable assembly. Added a regression test (account-data refresh survives a transient outage) that boots healthy, then fires a refresh while _realm-auth returns 503 and asserts the loaded realm is not wiped.

P2 (boot swallowing unrelated auth failures): The catch around authenticateToAllAccessibleRealms now rethrows when unreachableRealmServers is empty, so only a genuinely-unreachable trusted server is tolerated; any other auth failure falls through to the existing catch → logout instead of reaching postLoginCompleted unauthenticated. I didn't add a dedicated test for this branch because triggering an authenticate failure without also marking the server unreachable requires both _realm-auth calls to diverge (assembly succeeds, authenticate fails for an unrelated reason), which isn't cleanly expressible in the current mock; the happy path (no unreachable server, no throw) is already exercised by the existing boot tests, confirming the guard doesn't affect it.

@lukemelia lukemelia requested review from a team and burieberry July 1, 2026 19:27
@habdelra habdelra requested a review from Copilot July 1, 2026 22:43

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR makes host boot resilient to trusted realm servers being unreachable during boot assembly, so reachable realms still load and the user is informed (with automatic retry) instead of being logged out.

Changes:

  • Updates boot assembly to tolerate per-server _realm-auth failures via Promise.allSettled, tracking unreachable trusted servers for UI visibility.
  • Adds matrix-service retry orchestration (manual + bounded background retries) and makes boot-time realm authentication non-fatal when an unreachable trusted server is the cause.
  • Adds a workspace chooser notice for unreachable trusted servers and introduces test/mocking support for simulating _realm-auth failures.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
packages/host/app/services/realm-server.ts Assembles trusted-server realm lists via allSettled and tracks unreachable trusted servers for UI + retry.
packages/host/app/services/matrix-service.ts Avoids boot logout on unreachable trusted server, adds retry APIs + bounded background retry loop, and merges runtime refresh conservatively during outages.
packages/host/app/components/operator-mode/workspace-chooser/index.gts Renders an in-context notice when trusted servers were unreachable during boot, with styling and iconography.
packages/host/tests/integration/matrix-service-boot-assembly-test.ts Adds integration coverage for graceful boot + runtime refresh behavior when _realm-auth fails and then recovers.
packages/host/tests/integration/components/workspace-chooser-unreachable-notice-test.gts Adds rendering test coverage for the unreachable-server notice and its dismissal after retry.
packages/host/tests/helpers/realm-server-mock/types.ts Extends mock state to support simulating _realm-auth failures.
packages/host/tests/helpers/realm-server-mock/routes.ts Implements _realm-auth failure mode (503) when toggled.
packages/host/tests/helpers/realm-server-mock/index.ts Adds setRealmAuthFailure() helper for deterministic tests.
packages/host/tests/helpers/index.gts Re-exports setRealmAuthFailure() for tests.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +69 to +80
private get unreachableRealmServersMessage() {
let hosts = this.unreachableRealmServers.map((serverURL) => {
try {
return new URL(serverURL).host;
} catch {
return serverURL;
}
});
let servers =
hosts.length === 1 ? hosts[0] : `${hosts.length} realm servers`;
return `Couldn’t reach ${servers}. Some workspaces may be missing — retrying…`;
}
Comment on lines +392 to +405
hooks.beforeEach(async function (this: RenderingTestContext) {
await setupIntegrationTestRealm({
mockMatrixUtils,
contents: {},
startMatrix: false,
});
let realmServer = getService('realm-server') as RealmServerService;
await realmServer.setAvailableRealmIdentifiers([]);
// Boot healthy so the trusted-servers path is authoritative and the
// user's realm is loaded before the simulated outage.
let matrixService = getService('matrix-service') as MatrixService;
await matrixService.ready;
await matrixService.start();
});

@backspace backspace left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems pretty involved to reproduce, but a screen recording or the like would have been nice vs just tests! The bot feedback seems worth considering

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants