Skip to content

CI flake diagnostics: readiness-check + startup phase logging#4873

Open
habdelra wants to merge 3 commits into
mainfrom
worktree-ci-readiness-diagnostics
Open

CI flake diagnostics: readiness-check + startup phase logging#4873
habdelra wants to merge 3 commits into
mainfrom
worktree-ci-readiness-diagnostics

Conversation

@habdelra
Copy link
Copy Markdown
Contributor

Why

The host-test job recurrently fails with:

Error: Timed out waiting for: https-get://localhost:4201/base/_readiness-check, .../node-test/..., .../skills/..., .../test/...

Seen on this branch's run #26061756182 and on a main run ~50 min earlier (#26058294854), same 4 realm URLs each time. wait-on waits 10 min, then aborts.

Today's signal floor:

  • realm-server.log jumps from is starting indexing to nothing for 10 minutes.
  • worker-manager.log silences between completed invalidations in Xms and completed index visit in Yms (no per-file heartbeat in the visit loop).
  • The wait-on failure message tells us nothing about what the realm-server responded — timeout? 500? not listening?

We can't tell whether the hang is in queue.publish, in worker pickup, in the visit loop, in isWorldReadable, or in the realm-server's HTTP stack itself.

What this PR adds

Temporary diagnostic logging, every line tagged [ci-readiness-diag]:

Location Logs
Realm#readinessCheck entry, 15s heartbeat while #startedUp is pending, fulfillment with total wait time
Realm.#startup per-phase info logs: copy / isNewIndex / fullIndexOnStartup / awaiting fullIndex / awaiting isWorldReadable / complete
RealmIndexUpdater.publishFullIndex job_id + queue.publish duration the moment the from-scratch job is durably enqueued
IndexRunner.fromScratch visit loop progress line every 25 files: filesCompleted/totalFiles, lastUrl, elapsedMs
ci-host.yaml host-test step DEBUG=start-server-and-test env so wait-on prints each probe's HTTP status / connection error

Existing log levels already pass info through for the affected loggers (*=info, realm-index-updater=debug, index-runner=debug, index-perf=debug), so no level changes were needed. Lines land in the existing realm-server-log, test-realms-log, and worker-manager-log artifacts, plus the failing step's stdout (for the wait-on trace).

Removal

git grep ci-readiness-diag locates every block — comment markers + the log-string prefix make it findable both in source and in the log artifacts.

Files

  • .github/workflows/ci-host.yaml
  • packages/runtime-common/realm.ts
  • packages/runtime-common/realm-index-updater.ts
  • packages/runtime-common/index-runner.ts

Test plan

  • CI passes (the new logs are info-level, no behavioral change)
  • On the next reproduction, the realm-server artifact should now show which realm is parked at which startup phase, and the wait-on output should show the actual HTTP response code

🤖 Generated with Claude Code

The "Error: Timed out waiting for: .../_readiness-check" failure in the
host-test job recurs without enough log signal to root-cause. Today the
realm-server log artifact jumps from "is starting indexing" to nothing
for 10 minutes; the worker log similarly silences between
"completed invalidations" and "completed index visit"; the wait-on
output never shows what HTTP response (if any) it saw.

Temporary diagnostic logging, all tagged "[ci-readiness-diag]":

  - Realm#readinessCheck: entry log, 15s heartbeat while #startedUp is
    pending, fulfillment log with total wait time.
  - Realm.#startup: per-phase info logs (copy / isNewIndex /
    fullIndexOnStartup / awaiting fullIndex / awaiting isWorldReadable
    / complete) so we can see exactly where a realm is parked.
  - RealmIndexUpdater.publishFullIndex: log job_id and queue.publish
    duration the moment the from-scratch job is durably enqueued.
  - IndexRunner.fromScratch: visit-loop progress every 25 files
    (filesCompleted / totalFiles / lastUrl / elapsedMs).
  - ci-host.yaml host-test step: DEBUG=start-server-and-test so wait-on
    prints each probe's HTTP status / connection error.

Removal: `git grep ci-readiness-diag` locates every block. Each new log
line carries the "[ci-readiness-diag]" prefix in its message so search
on log artifacts also finds them.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@habdelra habdelra marked this pull request as ready for review May 18, 2026 22:30
@habdelra habdelra requested a review from Copilot May 18, 2026 22:30
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds temporary, tagged diagnostic logging and CI verbosity to better pinpoint where the realm-server startup/indexing pipeline stalls when _readiness-check timeouts occur in CI.

Changes:

  • Add [ci-readiness-diag] logging around _readiness-check waiting and realm startup phases.
  • Add enqueue-time diagnostics for from-scratch indexing jobs and progress heartbeats during the visit loop.
  • Enable start-server-and-test debug output in ci-host so wait-on prints per-probe status/errors.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
.github/workflows/ci-host.yaml Enables DEBUG=start-server-and-test to surface wait-on probe diagnostics in CI output.
packages/runtime-common/realm.ts Adds readiness-check waiting logs and startup phase markers.
packages/runtime-common/realm-index-updater.ts Logs from-scratch job enqueue details (job id + timing).
packages/runtime-common/index-runner.ts Adds periodic visit-loop progress logs during from-scratch indexing.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread packages/runtime-common/realm.ts Outdated
Comment thread packages/runtime-common/realm-index-updater.ts Outdated
Two findings on PR #4873:

  1. readinessCheck created a setInterval heartbeat per probe. wait-on
     polls every 0.5–2s for up to 10 minutes, so a stalled startup
     would accumulate ~300 concurrent intervals all heartbeating the
     same Deferred, ~300 log lines every 15s. Moved the heartbeat to
     #startup so it runs exactly once per realm-lifetime regardless of
     probe count. readinessCheck keeps only entry + fulfillment logs.

  2. publishFullIndex labeled the elapsed time "queue.publish took",
     but startedAt is captured before getRealmOwnerUsername() and the
     clearLastModified write inside enqueueReindexRealmJob. Renamed
     the label to "enqueue path took" so it accurately covers the
     full path from "is starting indexing" to "durably enqueued".

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@habdelra
Copy link
Copy Markdown
Contributor Author

maybe these new logs should be at a debug level with a common log category, and then we just enable that logger as part of the realm server startup for host tests, so this is purely opt in

Per review: the diagnostic logs shouldn't be `info`-level with
`[ci-readiness-diag]` baked into each message — they should be `debug`
on a common logger category that's opt-in via `LOG_LEVELS`. That makes
them invisible in production by default, and the cleanup story is
"drop `readiness-diag=debug` from `HOST_TEST_LOG_LEVELS`" with the
code-removal as a follow-up.

  - New named logger `readiness-diag` on each of `Realm`,
    `RealmIndexUpdater`, and `IndexRunner` (as `#readinessDiag`).
  - Every prior `this.#log.info('[ci-readiness-diag] ...')` call site
    rewritten as `this.#readinessDiag.debug('...')`. The string prefix
    drops out — the logger name does that job.
  - `HOST_TEST_LOG_LEVELS` in `mise-tasks/test-services/host` now
    includes `readiness-diag=debug`. Production realm-server keeps
    `*=info` so the new logger stays silent there.
  - Renamed in-code marker comments from `[ci-readiness-diag]` to
    `[readiness-diag]` for consistency with the logger name.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@habdelra
Copy link
Copy Markdown
Contributor Author

[Claude Code 🤖]

Good call — moved them all to a new readiness-diag logger at debug, enabled via readiness-diag=debug in HOST_TEST_LOG_LEVELS. Production realm-server keeps its *=info default so this is purely opt-in in CI.

Knock-on cleanup: dropped the [ci-readiness-diag] string prefix from every log message (the logger name carries that signal now), and renamed the in-code marker comments to [readiness-diag] for consistency.

Pushed as 43f127e.

Cleanup path is now two-stage:

  1. Drop readiness-diag=debug from HOST_TEST_LOG_LEVELS → logs silent everywhere.
  2. git grep readiness-diag → delete the call sites + the #readinessDiag fields when convenient.

@habdelra habdelra requested a review from a team May 18, 2026 23:14
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 18, 2026

Host Test Results

    1 files      1 suites   1h 43m 56s ⏱️
2 661 tests 2 646 ✅ 15 💤 0 ❌
2 680 runs  2 665 ✅ 15 💤 0 ❌

Results for commit 43f127e.

Realm Server Test Results

    1 files  ±    0      1 suites  +1   8m 10s ⏱️ + 8m 10s
1 408 tests +1 408  1 408 ✅ +1 408  0 💤 ±0  0 ❌ ±0 
1 495 runs  +1 495  1 495 ✅ +1 495  0 💤 ±0  0 ❌ ±0 

Results for commit 43f127e. ± Comparison against earlier commit 9ec47d4.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants