feat(mem_wal): FTS search for LSM scanner with Local and global-rescore modes by touch-of-grey · Pull Request #6910 · lance-format/lance

touch-of-grey · 2026-05-22T06:30:38Z

Adds full-text search to LsmScanner, spanning the base table, flushed
memtable generations, and the active/frozen in-memory memtables, with two
BM25 scoring modes from the multi-segment FTS discussion (#6789):

Local — each source scores with its own corpus statistics; the
coordinator unions per-source plans and merges by _score (per-partition
top-k sort + sort-preserving merge). Single-pass, no cross-source coordination.
LocalWithGlobalRescore — each source returns top-K' candidates with
raw BM25 sufficient stats (doc_len, term_freqs); the planner aggregates
per-source (N, sumdl, df_t) into one global MemBM25Scorer, rescores every
candidate with global stats, and selects the global top-k. K' = max(rescore_factor·k, 100).

Components

LsmFtsSearchPlanner + LsmScanner::full_text_search(column, query, k, mode).
lance-index: InvertedIndex::bm25_candidate_search returns candidates with
(row_id, doc_len, term_freqs, local_score) (parallel to bm25_search).
mem_wal FtsMemIndex: search_candidates + bm25_stats_for_terms so the
in-memory index can feed the global scorer.
Rescore handles index-less flushed generations via a flat scan-and-tokenize
fallback (mirrors the fallback Local mode already gets from scanner.full_text_search).

Benchmark

benches/mem_wal/fts/mem_wal_fts_read_bench.rs (+ run_fts_read_sweep.sh),
following the #6882 CLI/JSON template: ShardWriter ingestion of flushed gens +
active memtable, FineWeb text, 200 single-term queries per config under both
modes. EC2 m7gd.4xlarge (ARM, 16 vCPU), local instance-store NVMe vs S3,
2 flushed generations + active, rescore_factor=10.

storage	base	k	Local p50	Local p99	Rescore p50	Rescore p99	Jaccard
NVMe	100k	10	16.9 ms	19.5 ms	31.7 ms	32.5 ms	0.859
NVMe	100k	100	19.3 ms	21.4 ms	36.2 ms	36.9 ms	0.927
NVMe	1M	10	16.6 ms	18.6 ms	37.6 ms	40.8 ms	0.885
NVMe	1M	100	19.3 ms	21.1 ms	44.4 ms	47.1 ms	0.939
S3	100k	10	905 ms	1475 ms	1556 ms	2391 ms	0.858
S3	100k	100	997 ms	1889 ms	1725 ms	2558 ms	0.927
S3	1M	10	855 ms	1631 ms	1396 ms	2085 ms	0.885
S3	1M	100	963 ms	1684 ms	1756 ms	2636 ms	0.939

Takeaways:

Storage dominates. NVMe p50 is ~17–19 ms; the same query on S3 is
~0.9–1.0 s (Local) / ~1.4–1.8 s (Rescore). Per-query inverted-index reads
from S3 are uncached across queries here, so each query pays object-store
round-trips — the headline cost for a cold object-store FTS read tier.
Base size barely matters (100k vs 1M within noise): BM25 over an
inverted index is ~O(matches) with WAND top-k bounding the work, not corpus size.
Rescore costs ~1.7–2.3× over Local (candidate search + global stat
aggregation + a take to materialize winners). The multiplier shrinks on S3
(~1.6–1.8×) because S3 read latency is the common denominator both modes pay.

Jaccard here = mean top-k overlap between the two modes,
|A ∩ B| / |A ∪ B| over the per-query result row-id sets (1.0 = identical
top-k, lower = global rescoring pulled different docs in). It rises with k
(0.86 at k=10 → 0.94 at k=100): rescoring swaps ~1–1.5 docs out of the top-10,
a smaller fraction of a top-100. This quantifies how much global-stats
rescoring actually changes results vs local scoring — the #6789 tradeoff.
It's a membership metric (ignores intra-top-k reorder); a rank/score metric
(NDCG / score correlation) is a useful follow-up.

Draft: opening for design feedback on the rescore planner shape and the
scoring-mode API before polishing.

…idate APIs Adds FTS support to LsmScanner spanning base table, flushed memtable generations, and active/frozen in-memory memtables. Local scoring mode is wired end-to-end (each source uses its own BM25 stats; coordinator unions per-source plans, per-partition top-K sort + sort-preserving merge). LocalWithGlobalRescore returns a clear NotSupported error until the rescore exec lands in a follow-up. Lance-index extensions used by Rescore mode (when wired) are in place: - InvertedIndex::bm25_candidate_search returns top-K' candidates with raw (doc_len, term_freqs[input_order]) and local_score, parallel to bm25_search. - FtsMemIndex::search_candidates returns the same per-doc stats from the in-memory tail + frozen partitions. - FtsMemIndex::bm25_stats_for_terms exports segment-level (N, sumdl, df_t) so the in-memory index can feed a global scorer. Also aligns `_score` nullability on FtsIndexExec with the on-disk FTS schema so Local mode can UNION active + base/flushed without a schema mismatch.

Implements wjones127's rescore proposal (discussion lance-format#6789) on the single-node MemWAL path. The planner synchronously: 1. Tokenizes the query against the first source's FTS tokenizer. 2. Resolves a `SourceHandle` for each LSM source — active memtable keeps its `FtsMemIndex` reference; Lance sources open the column's InvertedIndex once and reuse it. 3. Gathers `(N_i, sumdl_i, df_t_i)` from every source via `bm25_stats_for_terms` and folds them into one global MemBM25Scorer. 4. Runs each source's candidate search with LOCAL pruning (no base_scorer) at `K' = max(rescore_factor * k, MIN_CANDIDATES)`. 5. Rescores every candidate with the global scorer and picks the global top-k. 6. Materializes user columns per source (BatchStore for active, `take_rows` for Lance) and stitches them back into the rescored order via a transient `__lsm_fts_order` column. 7. Returns the pre-materialized batch as a MemorySourceConfig exec. Two new end-to-end tests cover (a) active-only rescore picks the highest-tf doc first and (b) base+active rescore produces identical scores for symmetric hits under the global stats.

New bench `benches/mem_wal/fts/lsm_fts_modes.rs`, sibling of `mem_wal_fineweb_fts.rs`, sharing the same FineWeb loader shape and cache-dir convention. For a configurable LSM shape (balanced / memwal_skewed / growing_lsm) the bench: 1. Loads a HuggingFace FineWeb slice into a base Lance dataset plus several flushed-generation datasets plus one active in-memory memtable, each with its own FTS index. 2. Picks `num_queries` representative single-term queries from the 80–99 percentile band of corpus DF. 3. Runs both FtsScoringMode::Local and FtsScoringMode::LocalWithGlobalRescore through LsmScanner and records per-query latency. 4. Builds a single-merged-index baseline (the same FineWeb rows in one Lance dataset) and runs the same queries against it. 5. Reports mean / p50 / p95 / p99 latency per mode plus top-K Jaccard and `_score` Pearson for both LSM modes against each other and against the baseline. Output is JSON-pretty-printed to stdout and (optionally) written to `--output`. Bench is registered in Cargo.toml next to its sibling.

New CLI bench modeled on the vector / point-lookup read benches: `--phase prepare|search`, `--uri` with local/cloud detection, real FineWeb text payload, ShardWriter ingestion of flushed generations plus an active memtable, and the same JSON output contract. The scoring mode is the panel: each search invocation times the query set under both FtsScoringMode::Local and LocalWithGlobalRescore and reports per-mode p50/p95/p99/mean/qps plus the top-k Jaccard between the two modes. `run_fts_read_sweep.sh` drives the panel across local NVMe and an s3:// prefix for a configurable base-size / top-k matrix, mirrors each result.json to S3, and prints a summary table. Both registered in Cargo.toml next to the existing FineWeb FTS benches.

The memtable flush trigger is byte/batch-count based, not row-count, so the FineWeb text payload (variable row size) never reliably flushed at `max_memtable_rows`. Set `max_memtable_batches` to one generation's worth of batches so the batch store fills exactly at each generation boundary, and drain pending flushes via `wait_for_flush_drain()` before snapshotting the manifest so all flushed generations are visible to the planner.

…teria The rescore path's open_inverted_index used a manual field-id scan of load_indices() that failed to match the maintained FTS index on flushed generation datasets, so rescore errored with "missing an FTS index" even though Local mode (which uses scanner.full_text_search) resolved it fine. Switch to the same load_scalar_index(for_column().supports_fts()) criteria lookup the base-table FTS exec path uses.

…core Flushed memtable generations are written without an on-disk FTS index (the maintained index lives only in the active/frozen memtable), so the rescore path can't assume every Lance source has an InvertedIndex. Add a flat fallback: when open_inverted_index returns None, scan + tokenize the source's text column once to compute corpus stats (total_tokens, num_docs, df) and per-doc query-term frequencies, then score candidates with local stats for top-K' selection — mirroring the flat fallback Local mode gets from scanner.full_text_search. resolve_tokenizer is replaced by resolve_params so the same FTS params drive both query tokenization and the flat scans. Regression test covers indexed base + index-less flushed gen + indexed active in one rescore query.

Flushed memtable generations are written without an FTS index, so both scoring modes would flat-scan them per query — an O(rows*queries) artifact that swamps the scoring-mode signal the bench measures. Build an inverted index on each flushed generation after flush (modeling the realistic post-flush multi-segment FTS state) so Local and Rescore both use the fast indexed path. The index-less flat path remains covered by unit tests.

…ance-format#6901) lance lance-format#6901 makes the memtable flush handler build the shard's maintained secondary indexes on each flushed generation, so the FTS index now exists on every flushed gen without the bench creating it. Remove the manual create_index loop; both scoring modes still use the fast indexed path, and the rescore planner's flat fallback remains for the no-maintained-index case (covered by unit tests).

…ench The lance-format#6882 refactor replaced ad-hoc benches with standalone CLI+JSON benchmarks driven through the real ShardWriter ingestion path. mem_wal_fts_read_bench follows that template (and is what the EC2 sweep ran); lsm_fts_modes was an off-template synthetic-shape bench that built datasets manually. Drop it and its Cargo.toml entry to keep one template-aligned FTS read benchmark.

touch-of-grey · 2026-05-22T06:30:53Z

@jackye1995 @hamersaw — opening this as a draft for design feedback on the LSM FTS scoring-mode API and the rescore planner shape. PTAL when you have a chance.

github-actions · 2026-05-22T06:30:55Z

ACTION NEEDED
Lance follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

For details on the error please inspect the "PR Title Check" action.

codecov · 2026-05-22T07:04:43Z

Codecov Report

❌ Patch coverage is 87.83322% with 178 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
...st/lance/src/dataset/mem_wal/scanner/fts_search.rs	86.21%	91 Missing and 60 partials ⚠️
rust/lance/src/dataset/mem_wal/scanner/builder.rs	0.00%	15 Missing ⚠️
rust/lance-index/src/scalar/inverted/index.rs	95.40%	6 Missing and 2 partials ⚠️
rust/lance/src/dataset/mem_wal/index/fts.rs	97.75%	4 Missing ⚠️

📢 Thoughts on this report? Let us know!

touch-of-grey added 10 commits May 21, 2026 17:13

github-actions Bot added the enhancement New feature or request label May 22, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(mem_wal): FTS search for LSM scanner with Local and global-rescore modes#6910

feat(mem_wal): FTS search for LSM scanner with Local and global-rescore modes#6910
touch-of-grey wants to merge 10 commits into
lance-format:mainfrom
touch-of-grey:FTSRead

touch-of-grey commented May 22, 2026

Uh oh!

touch-of-grey commented May 22, 2026

Uh oh!

github-actions Bot commented May 22, 2026

Uh oh!

codecov Bot commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

touch-of-grey commented May 22, 2026

Components

Benchmark

Uh oh!

touch-of-grey commented May 22, 2026

Uh oh!

github-actions Bot commented May 22, 2026

Uh oh!

codecov Bot commented May 22, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant