Graceful degradation when a trusted realm server is unreachable at boot#5381
Graceful degradation when a trusted realm server is unreachable at boot#5381lukemelia wants to merge 2 commits into
Conversation
Boot assembly fetched each trusted server's realm list via `_realm-auth` under `Promise.all`. One server failing rejected the whole assembly, which bubbled up to `start()`'s catch and logged the user out — blocking boot and hiding all of the user's realms. - realm-server: assemble via `Promise.allSettled` so reachable servers still contribute their realms; record unreachable servers in tracked `unreachableRealmServers` state (cleared on logout/reset). - matrix-service: don't log out when the boot-time `authenticateToAllAccessibleRealms` fails; add `retryUnreachableRealmServers` plus a bounded background retry that merges recovered realms and clears the notice. - workspace-chooser: show an unobtrusive notice naming the unreachable server(s). - tests: `setRealmAuthFailure` mock toggle; boot-assembly degradation/retry coverage and a notice-rendering test. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Preview deploymentsHost Test Results 1 files ±0 1 suites ±0 2h 37m 0s ⏱️ +26s Results for commit 3ba99f8. ± Comparison against earlier commit 171c609. Realm Server Test Results 1 files ± 0 1 suites ±0 9m 21s ⏱️ +15s Results for commit 3ba99f8. ± Comparison against earlier commit 171c609. For more details on these errors, see this check. |
|
[Codex] Findings
Verdict patch is incorrect. The graceful-degradation path is directionally useful, but it introduces a runtime workspace-list wipe on account-data refresh and masks unrelated auth failures at boot. Confidence: 0.86 |
…uth catch Two issues from Codex review: - P1: the runtime `app.boxel.realm-servers` account-data handler replaced the available-realms list with the partial result from `fetchUserRealmsFromTrustedServers`, so a transiently-unreachable server erased already-loaded workspaces. Extracted the handler body into `applyTrustedRealmServersAccountData`, which merges (never removes) while any server is unreachable and only replaces on a fully reachable assembly. - P2: the boot-time catch around `authenticateToAllAccessibleRealms` swallowed every failure. Now it only swallows when a trusted server was actually recorded unreachable; otherwise it rethrows so boot fails loudly instead of reaching `postLoginCompleted` unauthenticated. Added a regression test that a refresh during an outage keeps the loaded realm. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
[Claude Code 🤖] Both findings were correct — fixed in 3ba99f8. P1 (workspace wipe on transient refresh): Extracted the account-data handler body into P2 (boot swallowing unrelated auth failures): The catch around |
There was a problem hiding this comment.
Pull request overview
This PR makes host boot resilient to trusted realm servers being unreachable during boot assembly, so reachable realms still load and the user is informed (with automatic retry) instead of being logged out.
Changes:
- Updates boot assembly to tolerate per-server
_realm-authfailures viaPromise.allSettled, tracking unreachable trusted servers for UI visibility. - Adds matrix-service retry orchestration (manual + bounded background retries) and makes boot-time realm authentication non-fatal when an unreachable trusted server is the cause.
- Adds a workspace chooser notice for unreachable trusted servers and introduces test/mocking support for simulating
_realm-authfailures.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| packages/host/app/services/realm-server.ts | Assembles trusted-server realm lists via allSettled and tracks unreachable trusted servers for UI + retry. |
| packages/host/app/services/matrix-service.ts | Avoids boot logout on unreachable trusted server, adds retry APIs + bounded background retry loop, and merges runtime refresh conservatively during outages. |
| packages/host/app/components/operator-mode/workspace-chooser/index.gts | Renders an in-context notice when trusted servers were unreachable during boot, with styling and iconography. |
| packages/host/tests/integration/matrix-service-boot-assembly-test.ts | Adds integration coverage for graceful boot + runtime refresh behavior when _realm-auth fails and then recovers. |
| packages/host/tests/integration/components/workspace-chooser-unreachable-notice-test.gts | Adds rendering test coverage for the unreachable-server notice and its dismissal after retry. |
| packages/host/tests/helpers/realm-server-mock/types.ts | Extends mock state to support simulating _realm-auth failures. |
| packages/host/tests/helpers/realm-server-mock/routes.ts | Implements _realm-auth failure mode (503) when toggled. |
| packages/host/tests/helpers/realm-server-mock/index.ts | Adds setRealmAuthFailure() helper for deterministic tests. |
| packages/host/tests/helpers/index.gts | Re-exports setRealmAuthFailure() for tests. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| private get unreachableRealmServersMessage() { | ||
| let hosts = this.unreachableRealmServers.map((serverURL) => { | ||
| try { | ||
| return new URL(serverURL).host; | ||
| } catch { | ||
| return serverURL; | ||
| } | ||
| }); | ||
| let servers = | ||
| hosts.length === 1 ? hosts[0] : `${hosts.length} realm servers`; | ||
| return `Couldn’t reach ${servers}. Some workspaces may be missing — retrying…`; | ||
| } |
| hooks.beforeEach(async function (this: RenderingTestContext) { | ||
| await setupIntegrationTestRealm({ | ||
| mockMatrixUtils, | ||
| contents: {}, | ||
| startMatrix: false, | ||
| }); | ||
| let realmServer = getService('realm-server') as RealmServerService; | ||
| await realmServer.setAvailableRealmIdentifiers([]); | ||
| // Boot healthy so the trusted-servers path is authoritative and the | ||
| // user's realm is loaded before the simulated outage. | ||
| let matrixService = getService('matrix-service') as MatrixService; | ||
| await matrixService.ready; | ||
| await matrixService.start(); | ||
| }); |
backspace
left a comment
There was a problem hiding this comment.
This seems pretty involved to reproduce, but a screen recording or the like would have been nice vs just tests! The bot feedback seems worth considering
What & why
Boot assembly (CS-11658) builds the available-realms list by asking each trusted server for the user's realms via
_realm-auth, underPromise.all. One server failing rejected the whole assembly, which bubbled up tostart()'s catch and logged the user out — blocking boot and hiding all of the user's realms.This makes a trusted server being unreachable degrade gracefully instead.
Closes CS-11667.
Changes
services/realm-server.ts—fetchUserRealmsFromTrustedServersnow assembles viaPromise.allSettled, so realms from the reachable servers still load. Unreachable servers are recorded in new trackedunreachableRealmServersstate (cleared on logout/reset), which drives the notice.services/matrix-service.ts— the boot-timeauthenticateToAllAccessibleRealmscall is wrapped so an unreachable own-server no longer logs the user out. AddedretryUnreachableRealmServers()(merges recovered realms, authenticates them, clears the notice) and a bounded background retry (6 × 10s), scheduled after boot and in the account-data listener.workspace-chooser/index.gts— an unobtrusive notice at the top of the chooser naming the unreachable server(s).Acceptance criteria
Tests
realm-server-mock: newsetRealmAuthFailuretoggle that makes_realm-authrespond 503.matrix-service-boot-assembly-test.ts: boot completes with the base realm intact, the unreachable server is recorded, and retry recovers the realm + clears the notice.workspace-chooser-unreachable-notice-test.gts: the notice renders naming the server and disappears after a successful retry.Notes
pnpm lint:types, eslint, and template-lint (clean). Host tests run in CI per this repo's setup.🤖 Generated with Claude Code