Skip to content

feat(mem_wal): FTS search for LSM scanner with Local and global-rescore modes#6910

Draft
touch-of-grey wants to merge 10 commits into
lance-format:mainfrom
touch-of-grey:FTSRead
Draft

feat(mem_wal): FTS search for LSM scanner with Local and global-rescore modes#6910
touch-of-grey wants to merge 10 commits into
lance-format:mainfrom
touch-of-grey:FTSRead

Conversation

@touch-of-grey
Copy link
Copy Markdown
Contributor

Adds full-text search to LsmScanner, spanning the base table, flushed
memtable generations, and the active/frozen in-memory memtables, with two
BM25 scoring modes from the multi-segment FTS discussion (#6789):

  • Local — each source scores with its own corpus statistics; the
    coordinator unions per-source plans and merges by _score (per-partition
    top-k sort + sort-preserving merge). Single-pass, no cross-source coordination.
  • LocalWithGlobalRescore — each source returns top-K' candidates with
    raw BM25 sufficient stats (doc_len, term_freqs); the planner aggregates
    per-source (N, sumdl, df_t) into one global MemBM25Scorer, rescores every
    candidate with global stats, and selects the global top-k. K' = max(rescore_factor·k, 100).

Components

  • LsmFtsSearchPlanner + LsmScanner::full_text_search(column, query, k, mode).
  • lance-index: InvertedIndex::bm25_candidate_search returns candidates with
    (row_id, doc_len, term_freqs, local_score) (parallel to bm25_search).
  • mem_wal FtsMemIndex: search_candidates + bm25_stats_for_terms so the
    in-memory index can feed the global scorer.
  • Rescore handles index-less flushed generations via a flat scan-and-tokenize
    fallback (mirrors the fallback Local mode already gets from scanner.full_text_search).

Benchmark

benches/mem_wal/fts/mem_wal_fts_read_bench.rs (+ run_fts_read_sweep.sh),
following the #6882 CLI/JSON template: ShardWriter ingestion of flushed gens +
active memtable, FineWeb text, 200 single-term queries per config under both
modes. EC2 m7gd.4xlarge (ARM, 16 vCPU), local instance-store NVMe vs S3,
2 flushed generations + active, rescore_factor=10.

storage base k Local p50 Local p99 Rescore p50 Rescore p99 Jaccard
NVMe 100k 10 16.9 ms 19.5 ms 31.7 ms 32.5 ms 0.859
NVMe 100k 100 19.3 ms 21.4 ms 36.2 ms 36.9 ms 0.927
NVMe 1M 10 16.6 ms 18.6 ms 37.6 ms 40.8 ms 0.885
NVMe 1M 100 19.3 ms 21.1 ms 44.4 ms 47.1 ms 0.939
S3 100k 10 905 ms 1475 ms 1556 ms 2391 ms 0.858
S3 100k 100 997 ms 1889 ms 1725 ms 2558 ms 0.927
S3 1M 10 855 ms 1631 ms 1396 ms 2085 ms 0.885
S3 1M 100 963 ms 1684 ms 1756 ms 2636 ms 0.939

Takeaways:

  • Storage dominates. NVMe p50 is ~17–19 ms; the same query on S3 is
    ~0.9–1.0 s (Local) / ~1.4–1.8 s (Rescore). Per-query inverted-index reads
    from S3 are uncached across queries here, so each query pays object-store
    round-trips — the headline cost for a cold object-store FTS read tier.
  • Base size barely matters (100k vs 1M within noise): BM25 over an
    inverted index is ~O(matches) with WAND top-k bounding the work, not corpus size.
  • Rescore costs ~1.7–2.3× over Local (candidate search + global stat
    aggregation + a take to materialize winners). The multiplier shrinks on S3
    (~1.6–1.8×) because S3 read latency is the common denominator both modes pay.

Jaccard here = mean top-k overlap between the two modes,
|A ∩ B| / |A ∪ B| over the per-query result row-id sets (1.0 = identical
top-k, lower = global rescoring pulled different docs in). It rises with k
(0.86 at k=10 → 0.94 at k=100): rescoring swaps ~1–1.5 docs out of the top-10,
a smaller fraction of a top-100. This quantifies how much global-stats
rescoring actually changes results vs local scoring — the #6789 tradeoff.
It's a membership metric (ignores intra-top-k reorder); a rank/score metric
(NDCG / score correlation) is a useful follow-up.

Draft: opening for design feedback on the rescore planner shape and the
scoring-mode API before polishing.

…idate APIs

Adds FTS support to LsmScanner spanning base table, flushed memtable
generations, and active/frozen in-memory memtables. Local scoring mode
is wired end-to-end (each source uses its own BM25 stats; coordinator
unions per-source plans, per-partition top-K sort + sort-preserving
merge). LocalWithGlobalRescore returns a clear NotSupported error until
the rescore exec lands in a follow-up.

Lance-index extensions used by Rescore mode (when wired) are in place:

- InvertedIndex::bm25_candidate_search returns top-K' candidates with
  raw (doc_len, term_freqs[input_order]) and local_score, parallel to
  bm25_search.
- FtsMemIndex::search_candidates returns the same per-doc stats from
  the in-memory tail + frozen partitions.
- FtsMemIndex::bm25_stats_for_terms exports segment-level (N, sumdl,
  df_t) so the in-memory index can feed a global scorer.

Also aligns `_score` nullability on FtsIndexExec with the on-disk FTS
schema so Local mode can UNION active + base/flushed without a schema
mismatch.
Implements wjones127's rescore proposal (discussion lance-format#6789) on the
single-node MemWAL path. The planner synchronously:

  1. Tokenizes the query against the first source's FTS tokenizer.
  2. Resolves a `SourceHandle` for each LSM source — active memtable
     keeps its `FtsMemIndex` reference; Lance sources open the column's
     InvertedIndex once and reuse it.
  3. Gathers `(N_i, sumdl_i, df_t_i)` from every source via
     `bm25_stats_for_terms` and folds them into one global MemBM25Scorer.
  4. Runs each source's candidate search with LOCAL pruning (no
     base_scorer) at `K' = max(rescore_factor * k, MIN_CANDIDATES)`.
  5. Rescores every candidate with the global scorer and picks the
     global top-k.
  6. Materializes user columns per source (BatchStore for active,
     `take_rows` for Lance) and stitches them back into the rescored
     order via a transient `__lsm_fts_order` column.
  7. Returns the pre-materialized batch as a MemorySourceConfig exec.

Two new end-to-end tests cover (a) active-only rescore picks the
highest-tf doc first and (b) base+active rescore produces identical
scores for symmetric hits under the global stats.
New bench `benches/mem_wal/fts/lsm_fts_modes.rs`, sibling of
`mem_wal_fineweb_fts.rs`, sharing the same FineWeb loader shape and
cache-dir convention.

For a configurable LSM shape (balanced / memwal_skewed / growing_lsm)
the bench:

  1. Loads a HuggingFace FineWeb slice into a base Lance dataset plus
     several flushed-generation datasets plus one active in-memory
     memtable, each with its own FTS index.
  2. Picks `num_queries` representative single-term queries from the
     80–99 percentile band of corpus DF.
  3. Runs both FtsScoringMode::Local and
     FtsScoringMode::LocalWithGlobalRescore through LsmScanner and
     records per-query latency.
  4. Builds a single-merged-index baseline (the same FineWeb rows in
     one Lance dataset) and runs the same queries against it.
  5. Reports mean / p50 / p95 / p99 latency per mode plus top-K
     Jaccard and `_score` Pearson for both LSM modes against each
     other and against the baseline.

Output is JSON-pretty-printed to stdout and (optionally) written to
`--output`. Bench is registered in Cargo.toml next to its sibling.
New CLI bench modeled on the vector / point-lookup read benches:
`--phase prepare|search`, `--uri` with local/cloud detection, real
FineWeb text payload, ShardWriter ingestion of flushed generations plus
an active memtable, and the same JSON output contract.

The scoring mode is the panel: each search invocation times the query
set under both FtsScoringMode::Local and LocalWithGlobalRescore and
reports per-mode p50/p95/p99/mean/qps plus the top-k Jaccard between the
two modes.

`run_fts_read_sweep.sh` drives the panel across local NVMe and an
s3:// prefix for a configurable base-size / top-k matrix, mirrors each
result.json to S3, and prints a summary table. Both registered in
Cargo.toml next to the existing FineWeb FTS benches.
The memtable flush trigger is byte/batch-count based, not row-count, so
the FineWeb text payload (variable row size) never reliably flushed at
`max_memtable_rows`. Set `max_memtable_batches` to one generation's worth
of batches so the batch store fills exactly at each generation boundary,
and drain pending flushes via `wait_for_flush_drain()` before snapshotting
the manifest so all flushed generations are visible to the planner.
…teria

The rescore path's open_inverted_index used a manual field-id scan of
load_indices() that failed to match the maintained FTS index on flushed
generation datasets, so rescore errored with "missing an FTS index" even
though Local mode (which uses scanner.full_text_search) resolved it fine.
Switch to the same load_scalar_index(for_column().supports_fts())
criteria lookup the base-table FTS exec path uses.
…core

Flushed memtable generations are written without an on-disk FTS index
(the maintained index lives only in the active/frozen memtable), so the
rescore path can't assume every Lance source has an InvertedIndex. Add a
flat fallback: when open_inverted_index returns None, scan + tokenize the
source's text column once to compute corpus stats (total_tokens, num_docs,
df) and per-doc query-term frequencies, then score candidates with local
stats for top-K' selection — mirroring the flat fallback Local mode gets
from scanner.full_text_search. resolve_tokenizer is replaced by
resolve_params so the same FTS params drive both query tokenization and
the flat scans. Regression test covers indexed base + index-less flushed
gen + indexed active in one rescore query.
Flushed memtable generations are written without an FTS index, so both
scoring modes would flat-scan them per query — an O(rows*queries)
artifact that swamps the scoring-mode signal the bench measures. Build an
inverted index on each flushed generation after flush (modeling the
realistic post-flush multi-segment FTS state) so Local and Rescore both
use the fast indexed path. The index-less flat path remains covered by
unit tests.
…ance-format#6901)

lance lance-format#6901 makes the memtable flush handler build the shard's maintained
secondary indexes on each flushed generation, so the FTS index now exists
on every flushed gen without the bench creating it. Remove the manual
create_index loop; both scoring modes still use the fast indexed path, and
the rescore planner's flat fallback remains for the no-maintained-index
case (covered by unit tests).
…ench

The lance-format#6882 refactor replaced ad-hoc benches with standalone CLI+JSON
benchmarks driven through the real ShardWriter ingestion path.
mem_wal_fts_read_bench follows that template (and is what the EC2 sweep
ran); lsm_fts_modes was an off-template synthetic-shape bench that built
datasets manually. Drop it and its Cargo.toml entry to keep one
template-aligned FTS read benchmark.
@github-actions github-actions Bot added the enhancement New feature or request label May 22, 2026
@touch-of-grey
Copy link
Copy Markdown
Contributor Author

@jackye1995 @hamersaw — opening this as a draft for design feedback on the LSM FTS scoring-mode API and the rescore planner shape. PTAL when you have a chance.

@github-actions
Copy link
Copy Markdown
Contributor

ACTION NEEDED
Lance follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

For details on the error please inspect the "PR Title Check" action.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 22, 2026

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant