feat(mem_wal): FTS search for LSM scanner with Local and global-rescore modes#6910
Draft
touch-of-grey wants to merge 10 commits into
Draft
feat(mem_wal): FTS search for LSM scanner with Local and global-rescore modes#6910touch-of-grey wants to merge 10 commits into
touch-of-grey wants to merge 10 commits into
Conversation
…idate APIs Adds FTS support to LsmScanner spanning base table, flushed memtable generations, and active/frozen in-memory memtables. Local scoring mode is wired end-to-end (each source uses its own BM25 stats; coordinator unions per-source plans, per-partition top-K sort + sort-preserving merge). LocalWithGlobalRescore returns a clear NotSupported error until the rescore exec lands in a follow-up. Lance-index extensions used by Rescore mode (when wired) are in place: - InvertedIndex::bm25_candidate_search returns top-K' candidates with raw (doc_len, term_freqs[input_order]) and local_score, parallel to bm25_search. - FtsMemIndex::search_candidates returns the same per-doc stats from the in-memory tail + frozen partitions. - FtsMemIndex::bm25_stats_for_terms exports segment-level (N, sumdl, df_t) so the in-memory index can feed a global scorer. Also aligns `_score` nullability on FtsIndexExec with the on-disk FTS schema so Local mode can UNION active + base/flushed without a schema mismatch.
Implements wjones127's rescore proposal (discussion lance-format#6789) on the single-node MemWAL path. The planner synchronously: 1. Tokenizes the query against the first source's FTS tokenizer. 2. Resolves a `SourceHandle` for each LSM source — active memtable keeps its `FtsMemIndex` reference; Lance sources open the column's InvertedIndex once and reuse it. 3. Gathers `(N_i, sumdl_i, df_t_i)` from every source via `bm25_stats_for_terms` and folds them into one global MemBM25Scorer. 4. Runs each source's candidate search with LOCAL pruning (no base_scorer) at `K' = max(rescore_factor * k, MIN_CANDIDATES)`. 5. Rescores every candidate with the global scorer and picks the global top-k. 6. Materializes user columns per source (BatchStore for active, `take_rows` for Lance) and stitches them back into the rescored order via a transient `__lsm_fts_order` column. 7. Returns the pre-materialized batch as a MemorySourceConfig exec. Two new end-to-end tests cover (a) active-only rescore picks the highest-tf doc first and (b) base+active rescore produces identical scores for symmetric hits under the global stats.
New bench `benches/mem_wal/fts/lsm_fts_modes.rs`, sibling of
`mem_wal_fineweb_fts.rs`, sharing the same FineWeb loader shape and
cache-dir convention.
For a configurable LSM shape (balanced / memwal_skewed / growing_lsm)
the bench:
1. Loads a HuggingFace FineWeb slice into a base Lance dataset plus
several flushed-generation datasets plus one active in-memory
memtable, each with its own FTS index.
2. Picks `num_queries` representative single-term queries from the
80–99 percentile band of corpus DF.
3. Runs both FtsScoringMode::Local and
FtsScoringMode::LocalWithGlobalRescore through LsmScanner and
records per-query latency.
4. Builds a single-merged-index baseline (the same FineWeb rows in
one Lance dataset) and runs the same queries against it.
5. Reports mean / p50 / p95 / p99 latency per mode plus top-K
Jaccard and `_score` Pearson for both LSM modes against each
other and against the baseline.
Output is JSON-pretty-printed to stdout and (optionally) written to
`--output`. Bench is registered in Cargo.toml next to its sibling.
New CLI bench modeled on the vector / point-lookup read benches: `--phase prepare|search`, `--uri` with local/cloud detection, real FineWeb text payload, ShardWriter ingestion of flushed generations plus an active memtable, and the same JSON output contract. The scoring mode is the panel: each search invocation times the query set under both FtsScoringMode::Local and LocalWithGlobalRescore and reports per-mode p50/p95/p99/mean/qps plus the top-k Jaccard between the two modes. `run_fts_read_sweep.sh` drives the panel across local NVMe and an s3:// prefix for a configurable base-size / top-k matrix, mirrors each result.json to S3, and prints a summary table. Both registered in Cargo.toml next to the existing FineWeb FTS benches.
The memtable flush trigger is byte/batch-count based, not row-count, so the FineWeb text payload (variable row size) never reliably flushed at `max_memtable_rows`. Set `max_memtable_batches` to one generation's worth of batches so the batch store fills exactly at each generation boundary, and drain pending flushes via `wait_for_flush_drain()` before snapshotting the manifest so all flushed generations are visible to the planner.
…teria The rescore path's open_inverted_index used a manual field-id scan of load_indices() that failed to match the maintained FTS index on flushed generation datasets, so rescore errored with "missing an FTS index" even though Local mode (which uses scanner.full_text_search) resolved it fine. Switch to the same load_scalar_index(for_column().supports_fts()) criteria lookup the base-table FTS exec path uses.
…core Flushed memtable generations are written without an on-disk FTS index (the maintained index lives only in the active/frozen memtable), so the rescore path can't assume every Lance source has an InvertedIndex. Add a flat fallback: when open_inverted_index returns None, scan + tokenize the source's text column once to compute corpus stats (total_tokens, num_docs, df) and per-doc query-term frequencies, then score candidates with local stats for top-K' selection — mirroring the flat fallback Local mode gets from scanner.full_text_search. resolve_tokenizer is replaced by resolve_params so the same FTS params drive both query tokenization and the flat scans. Regression test covers indexed base + index-less flushed gen + indexed active in one rescore query.
Flushed memtable generations are written without an FTS index, so both scoring modes would flat-scan them per query — an O(rows*queries) artifact that swamps the scoring-mode signal the bench measures. Build an inverted index on each flushed generation after flush (modeling the realistic post-flush multi-segment FTS state) so Local and Rescore both use the fast indexed path. The index-less flat path remains covered by unit tests.
…ance-format#6901) lance lance-format#6901 makes the memtable flush handler build the shard's maintained secondary indexes on each flushed generation, so the FTS index now exists on every flushed gen without the bench creating it. Remove the manual create_index loop; both scoring modes still use the fast indexed path, and the rescore planner's flat fallback remains for the no-maintained-index case (covered by unit tests).
…ench The lance-format#6882 refactor replaced ad-hoc benches with standalone CLI+JSON benchmarks driven through the real ShardWriter ingestion path. mem_wal_fts_read_bench follows that template (and is what the EC2 sweep ran); lsm_fts_modes was an off-template synthetic-shape bench that built datasets manually. Drop it and its Cargo.toml entry to keep one template-aligned FTS read benchmark.
Contributor
Author
|
@jackye1995 @hamersaw — opening this as a draft for design feedback on the LSM FTS scoring-mode API and the rescore planner shape. PTAL when you have a chance. |
Contributor
|
ACTION NEEDED The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification. For details on the error please inspect the "PR Title Check" action. |
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds full-text search to
LsmScanner, spanning the base table, flushedmemtable generations, and the active/frozen in-memory memtables, with two
BM25 scoring modes from the multi-segment FTS discussion (#6789):
Local— each source scores with its own corpus statistics; thecoordinator unions per-source plans and merges by
_score(per-partitiontop-k sort + sort-preserving merge). Single-pass, no cross-source coordination.
LocalWithGlobalRescore— each source returns top-K' candidates withraw BM25 sufficient stats
(doc_len, term_freqs); the planner aggregatesper-source
(N, sumdl, df_t)into one globalMemBM25Scorer, rescores everycandidate with global stats, and selects the global top-k.
K' = max(rescore_factor·k, 100).Components
LsmFtsSearchPlanner+LsmScanner::full_text_search(column, query, k, mode).InvertedIndex::bm25_candidate_searchreturns candidates with(row_id, doc_len, term_freqs, local_score)(parallel tobm25_search).FtsMemIndex:search_candidates+bm25_stats_for_termsso thein-memory index can feed the global scorer.
fallback (mirrors the fallback Local mode already gets from
scanner.full_text_search).Benchmark
benches/mem_wal/fts/mem_wal_fts_read_bench.rs(+run_fts_read_sweep.sh),following the #6882 CLI/JSON template: ShardWriter ingestion of flushed gens +
active memtable, FineWeb text, 200 single-term queries per config under both
modes. EC2 m7gd.4xlarge (ARM, 16 vCPU), local instance-store NVMe vs S3,
2 flushed generations + active,
rescore_factor=10.Takeaways:
~0.9–1.0 s (Local) / ~1.4–1.8 s (Rescore). Per-query inverted-index reads
from S3 are uncached across queries here, so each query pays object-store
round-trips — the headline cost for a cold object-store FTS read tier.
inverted index is ~O(matches) with WAND top-k bounding the work, not corpus size.
aggregation + a
taketo materialize winners). The multiplier shrinks on S3(~1.6–1.8×) because S3 read latency is the common denominator both modes pay.
Jaccard here = mean top-k overlap between the two modes,
|A ∩ B| / |A ∪ B|over the per-query result row-id sets (1.0 = identicaltop-k, lower = global rescoring pulled different docs in). It rises with k
(0.86 at k=10 → 0.94 at k=100): rescoring swaps ~1–1.5 docs out of the top-10,
a smaller fraction of a top-100. This quantifies how much global-stats
rescoring actually changes results vs local scoring — the #6789 tradeoff.
It's a membership metric (ignores intra-top-k reorder); a rank/score metric
(NDCG / score correlation) is a useful follow-up.
Draft: opening for design feedback on the rescore planner shape and the
scoring-mode API before polishing.