Skip to content

Isolate per-file indexing failures instead of aborting the whole realm#5389

Draft
richardhjtan wants to merge 2 commits into
mainfrom
fix/index-runner-file-error-isolation
Draft

Isolate per-file indexing failures instead of aborting the whole realm#5389
richardhjtan wants to merge 2 commits into
mainfrom
fix/index-runner-file-error-isolation

Conversation

@richardhjtan

Copy link
Copy Markdown
Contributor

Summary

  • IndexRunner.tryToVisit rethrew any non-404 error from a file's fused visit (e.g. a prerender-visit timeout), which unwound the entire fromScratch/incremental visit loop and skipped batch.done().
  • Since batch.done() is what promotes the working table into the live boxel_index, a single file's transport-level failure discarded every other already-successfully-indexed file in the same job — leaving the realm mounted but with an empty or stale search index, and no error surfaced anywhere.
  • Reproduced locally: a large realm's from-scratch reindex got through 652 of 1166 files, hit one stuck prerender-visit request, and the whole job errored out with zero rows ever landing in boxel_index.
  • Fix: catch non-error non-404 failures per file and record a file-error entry for that URL, mirroring the existing error_doc pattern already used for in-band render errors, so the rest of the realm still indexes and commits.

Test plan

  • tsc --noEmit clean on the changed file
  • Reproduced the original failure locally, applied the fix, and confirmed a subsequent full reindex populated boxel_index correctly
  • No dedicated unit test added yet — the failure mode requires a transport-level prerender abort mid-batch, which isn't easily simulated in the existing IndexRunner test harness; open to suggestions on the least-effort way to cover this

🤖 Generated with Claude Code

tryToVisit rethrew any non-404 error from a file visit, which unwound
the entire fromScratch/incremental visit loop and skipped batch.done().
Since batch.done() is what promotes the working table into the live
boxel_index, a single file's transport-level failure (e.g. a prerender
timeout) discarded every other already-successfully-indexed file in
the same job, leaving the realm mounted with an empty or stale search
index and no error surfaced.

Catch non-404 errors per file and record a file-error entry for that
URL instead, mirroring the error_doc pattern already used for in-band
render errors, so the rest of the realm still indexes and commits.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6ae5858e07

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread packages/runtime-common/index-runner.ts Outdated
Comment on lines +580 to +584
let entry: FileErrorIndexEntry = {
type: 'file-error',
error,
};
await this.batch.updateEntry(url, entry);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Don't leave card instances tombstoned on visit failures

When an incremental reindex hits a transient prerenderVisit failure for an existing .json card, Batch.invalidate() has already inserted tombstones for every production row at that URL, including the instance row. This catch only overwrites the file tombstone with a file-error, so batch.done() promotes the untouched instance tombstone and removes the card from search/instance results instead of preserving the last-known-good instance with an error row like the in-band render-error path does. This affects existing card files whenever the fused visit throws before indexCardWithResult runs.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Claude Code 🤖] Confirmed — Batch.invalidate() tombstones every existing type at a URL up front (instance + file for a card), and the file-error write alone left the instance tombstone to be promoted, silently deleting the card from search on a transient error. Fixed in 5e88b86: now also writes an instance-error entry when the failed URL is a card, so the last-known-good instance survives the same way the in-band render-error path preserves it.

@richardhjtan richardhjtan marked this pull request as draft July 2, 2026 14:14
@richardhjtan richardhjtan requested a review from Copilot July 2, 2026 14:14

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR makes realm indexing more robust by preventing a single file’s transport-level prerender/visit failure from aborting the entire indexing batch (and therefore preventing batch.done() from promoting the working index into the live boxel_index table).

Changes:

  • Updates IndexRunner.tryToVisit to treat non-404 visit failures as per-file failures by writing a persisted file-error index entry for the URL.
  • Ensures dependency-row cache invalidation and stats.fileErrors accounting happen on this new error path.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +555 to +559
// A transport-level failure (prerender timeout/abort, network error)
// never reaches performCardIndexing/performFileIndexing's own
// error-entry construction — visitFileForIndexingFused rethrows
// before calling indexCardWithResult/indexFileWithResults. Left
// uncaught here, one file's failure propagates out of the

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Claude Code 🤖] Fair point on the missing regression test for the mid-batch isolation behavior — flagged as open in the PR description since simulating a transport-level prerender abort isn't trivial in the current IndexRunner test harness. Open to suggestions on the lowest-effort way to cover it if you have one.

Batch.invalidate() tombstones every existing type at a URL up front,
including the instance row for a previously-indexed card. The prior
commit's file-error entry only overwrote the file tombstone, so
batch.done() would promote the untouched instance tombstone and
silently remove an existing card from search over a transient error
(e.g. a prerender timeout) instead of preserving it with an error row
the way the in-band render-error path does.

Re-check whether the failed URL is a card and, if so, also write an
instance-error entry alongside the file-error one.
@github-actions

github-actions Bot commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Host Test Results

    1 files      1 suites   2h 19m 57s ⏱️
3 372 tests 3 357 ✅ 15 💤 0 ❌
3 391 runs  3 376 ✅ 15 💤 0 ❌

Results for commit 5e88b86.

Realm Server Test Results

    1 files      1 suites   9m 32s ⏱️
1 674 tests 1 674 ✅ 0 💤 0 ❌
1 753 runs  1 753 ✅ 0 💤 0 ❌

Results for commit 5e88b86.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants