Skip to content

fix(mem_wal): suppress stale LSM vector-search reads via block-list post-filter#6899

Open
hamersaw wants to merge 12 commits into
lance-format:mainfrom
hamersaw:bug/wal-stale-row-on-update
Open

fix(mem_wal): suppress stale LSM vector-search reads via block-list post-filter#6899
hamersaw wants to merge 12 commits into
lance-format:mainfrom
hamersaw:bug/wal-stale-row-on-update

Conversation

@hamersaw
Copy link
Copy Markdown
Contributor

@hamersaw hamersaw commented May 21, 2026

What

Fixes a stale read in LSM MemWAL vector search: when a primary key is updated and its fresh row falls out of its own source's top-k, the superseded copy from an older generation could win the cross-source dedup and be returned.

Background

LsmGlobalPkDedupExec (introduced in #6881) is exact only over the candidates each source surfaces. If a PK's fresh version is pushed out of its source's top-k by closer rows, the dedup never sees it and cannot suppress the stale copy from an older generation — so the stale row is served. Repro: test_vector_search_stale_read_when_fresh_falls_out_of_top_k.

Approach

Make staleness a per-source filter, in two parts:

1. Block-list production (mem_wal-side).

  • GenPkIndex — per-generation pk_hash → rowids (keyed on the shared compute_pk_hash), with membership, within-generation supersession (write-polarity aware), and cross-generation supersession.
  • compute_block_lists / compute_source_block_lists — assemble a per-source block bitmap (RowAddrMask, keyed by _rowid) from the real LSM sources: active/frozen via BatchStore positions, flushed via disk scan, base blocked cross-generation only.

2. Execution (post-filter).

  • Each per-source KNN over-fetches (STALE_OVERFETCH_FACTOR) and BlockListFilterExec drops _rowid-blocked (superseded) rows before the cross-source union, so a stale row never reaches the merge. The existing global dedup still resolves within-source duplicates.

Scope / caveats

  • Post-filter, not a true prefilter. It relies on over-fetch to backfill dropped rows and does not guarantee k live results in the adversarial case (more superseded rows near the query than the over-fetch covers). A TODO at the filter site tracks promoting the same mask into the KNN as a true prefilter (the index traverses until k rows pass), which removes the over-fetch.
  • Within-base duplicate dedup is intentionally not addressed (base is blocked cross-generation only).
  • Flushed membership is built via a per-query scan today; a build-at-flush + cache optimization is a follow-up.

Tests

test_vector_search_stale_read_when_fresh_falls_out_of_top_k now passes; the other vector-search tests are unchanged. Unit tests cover GenPkIndex membership/supersession, the per-source assembly (in-memory + base disk scan), and BlockListFilterExec. cargo fmt + cargo clippy -p lance --lib --tests -- -D warnings clean.

🤖 Generated with Claude Code

…e-read fix

Foundation for fixing the LSM vector-search stale read where an updated PK's
fresh row falls out of its own source's top-k, letting the superseded copy
from an older generation win (the per-source top-k -> global-dedup gap that
replaced the bloom-based FilterStaleExec in lance-format#6881).

Adds the mem_wal-side machinery that *produces* the per-source block-lists,
keyed on the same compute_pk_hash the dedup nodes use:

- GenPkIndex: per-generation pk-hash -> rowaddrs, with membership,
  within-generation supersession (write-polarity aware), and cross-generation
  supersession (superseded_by NEWER(G)).
- compute_block_lists: folds generations newest-first into per-gen block trees
  plus the membership union base uses.
- compute_source_block_lists: assembles a per-source RowAddrMask bitmap from
  the real LSM sources (active/frozen via BatchStore positions, flushed via
  disk scan of pk + _rowaddr, base blocked cross-generation only).

Scope: this PR only builds the bitmaps; how a mask drives the actual KNN
search (a Scanner pre-filter API) is a deliberate follow-up, so the new
functions are gated dead_code until wired and there is no behavior change yet.
The failing spec test lands with that follow-up.

Unit-tested: membership, within/cross-gen supersession, in-memory assembly,
and the base disk-scan path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@github-actions github-actions Bot added the enhancement New feature or request label May 21, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented May 21, 2026

…ost-filter

A primary-key update whose fresh row falls out of its own source's top-k let
the superseded copy from an older generation win the global dedup, returning a
stale row (the gap LsmGlobalPkDedupExec cannot close, exposed by
test_vector_search_stale_read_when_fresh_falls_out_of_top_k).

Each per-source KNN now over-fetches and drops rows superseded by a newer
generation (matched by _rowid) via BlockListFilterExec, before the cross-source
union, so a stale row never reaches the merge. The existing global dedup still
resolves within-source duplicates. Block-list bitmaps come from the block-list
machinery, re-keyed from _rowaddr to _rowid (what fast_search emits; equal to
_rowaddr unless stable row ids are enabled).

This is a post-filter that relies on over-fetch (STALE_OVERFETCH_FACTOR) to
backfill dropped rows; it does not guarantee k live results in the adversarial
case. A TODO at the filter site tracks promoting the same mask to a true KNN
prefilter (which traverses until k rows pass, removing the over-fetch).

Turns the failing spec test green; existing vector-search tests unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@hamersaw hamersaw changed the title feat(mem_wal): block-list membership + bitmaps for vector-search stale-read fix fix(mem_wal): suppress stale LSM vector-search reads via block-list post-filter May 21, 2026
@github-actions github-actions Bot added the bug Something isn't working label May 21, 2026
hamersaw and others added 10 commits May 21, 2026 14:33
…tch default 2.5

Default STALE_OVERFETCH_FACTOR to 2.5 (was 4) and round up at the call site
(ceil(k * factor)), so a filtered source still over-fetches at least k+1.

Add UnderfillFilterWarnExec: a pass-through node placed at the top of the plan
only when a per-source block-list was applied. It logs a tracing warning if the
query returns fewer than k rows — the signature of an under-fetch, where the
over-fetch did not leave enough live candidates after dropping superseded rows.
Gated on filtering having happened so a genuinely small unfiltered result does
not warn; the real fix remains a true KNN prefilter (tracked at the filter site).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…block-list

compute_source_block_lists rebuilt each flushed generation's GenPkIndex on every
vector query via a full PK-column scan. Cache it on FlushedMemTableCache (a
parallel moka cache keyed by the same immutable flushed path) so the index is
built once per generation, single-flight, and reused as an Arc::clone on warm
queries; pruned alongside the dataset cache by retain_paths at compaction.

Cold miss still builds lazily by scanning. Build-at-flush is deliberately not
wired: the flush write path has no handle to the reader-side cache, so seeding
it would mean threading a read cache through the writer for marginal gain (it
only saves the first cold scan). MVCC watermark pinning is likewise deferred:
membership is built at plan time (a subset of execution-time candidates), so it
cannot over-suppress a not-yet-visible row.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…K stale reads

Adds three vector-search dedup tests:
- over-fetch backfill: a source whose entire top-k is stale still returns k live
  results from its next-nearest rows (no under-fill).
- cross-flushed suppression: an older flushed generation's stale row is blocked
  by a newer flushed generation (no base/active involved).
- composite primary key: the block-list keys on the full (id1,id2) PK, so an
  updated base row is suppressed end-to-end.

Also Box::pin the now-larger plan_search sub-futures (block-list build and the
per-source KNN build) and the flushed PK-index build future to stay under
clippy::large_futures at every await site.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… (dangerous direction)

Our existing within-active dedup test keeps the newer copy only because it is
also the closer one, so it passes even with broken newest-wins. This ports the
lance-format#6844 spec: the re-inserted copy is FARTHER than the stale one (which sits on
the query), so a correct result requires real newest-wins. Passes on the
block-list post-filter.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…che method

get_or_build_pk_index is public (on FlushedMemTableCache) but GenPkIndex lives in
a private module, so the [`GenPkIndex`] intra-doc link failed the rustdoc CI job
under -D warnings (rustdoc::private-intra-doc-links). Keep it a plain code span.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the per-generation row-address block-list (RowAddrMask +
BlockListFilterExec) with PK-hash membership filtering (PkHashFilterExec)
for every source, base included.

- GenPkIndex is now a generation's set of PK hashes; compute_blocked_sets
  yields each generation's NEWER(G) set plus the base table's full-union
  set. No row addresses are tracked.
- Removes the per-query full base PK scan: base is filtered by hashing
  only its KNN candidates against the membership set, like every other
  source, instead of scanning the whole base table.
- Within-generation duplicates are left to the global dedup's
  (generation, freshness) tiebreaker. The address-based within-gen block
  is dropped (no test exercised it). Trade-off: within-gen top-k eviction
  is no longer pre-filtered -- the same bug class we fix cross-source,
  to be closed later by a true KNN prefilter or flush-time dedup.
- Fold the under-fetch warning into PkHashFilterExec as a per-source
  check (the over-fetch is per-source) and delete UnderfillFilterWarnExec.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… newtype

Now that a generation's membership is just a set of PK hashes, the
separate module and newtype no longer earn their keep.

- Merge gen_pk_index.rs into block_list.rs. GenPkIndex (a thin wrapper
  over HashSet<u64>) is replaced by free functions over HashSet<u64>:
  pk_hashes_from_batches / pk_hashes_from_batch_store and the private
  compute_blocked_sets.
- FlushedMemTableCache caches Arc<HashSet<u64>> directly
  (get_or_build_pk_hashes), dropping the newtype from the cache type.
- Tighten doc and inline comments across the vector-search block-list
  path (block_list, pk_hash_filter, vector_search, flushed_cache).

No behavior change. 109 scanner tests green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ry union

`compute_source_block_lists` previously merged every newer generation's
hashes into a fresh `NEWER(G)` set per source, per query — copying a large
flushed gen's PKs on every search. Instead pass `PkHashFilterExec` a
`Vec<Arc<HashSet<u64>>>` of the newer generations' membership sets and have
it block a candidate whose PK hash is in any of them.

- Each generation's membership is a shared `Arc<HashSet<u64>>` referenced,
  never merged; flushed gens reuse their cached set instead of copying it.
- Removes `compute_blocked_sets`; the per-source list is the sets
  accumulated newest-first.

No behavior change. 109 scanner tests green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add a writer-independent way to ask whether primary keys have been
(re)written in the WAL fresh tier — the active + frozen memtables and
flushed generations a scanner spans — i.e. are shadowed above the base
table. Built like any query: construct the `LsmScanner`, then call
`contains_pks`.

- `LsmScanner::contains_pks(&RecordBatch) -> Vec<bool>`: hashes each input
  PK row with the same `compute_pk_hash` the dedup uses and tests it
  against the per-generation membership sets, so callers never hash PKs
  themselves; composite PKs work by including all PK columns.
- Internal `block_list::fresh_tier_block_list` returns those membership
  sets as `Vec<Arc<HashSet<u64>>>` (the shape the vector-search filter
  already consumes); flushed sets are reused from the FlushedMemTableCache.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Addresses review findings.

1. (correctness) Generations are per-shard, but compute_source_block_lists
   keyed the block-list by generation alone and accumulated membership
   across all shards. With two shards each at the same generation, the
   adjacent same-gen sources cross-blocked: a source ran through
   PkHashFilterExec against its own (or another shard's) membership, so an
   entire shard's memtable/flushed generation was silently filtered out of
   a multi-shard vector search. Now keyed by (shard_id, generation) and
   accumulated per shard, so a source is blocked only by strictly-newer
   generations of its own shard and never by itself. Planner looks up by
   (shard_id, generation). New regression test block_lists_are_keyed_per_shard.

2. (memory) The flushed PK scan folds the projected stream into the hash
   set one batch at a time (scan_pk_hashes) instead of try_collect into a
   Vec<RecordBatch>, so only one PK batch is resident at a time.

3. (test) The stale-read regression test now also asserts the nearest live
   neighbor (pk=2) is the returned top-1, not just that the stale read is
   absent.

112 scanner tests green; clippy clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant