feat: support segmented inverted index build and search#6305
feat: support segmented inverted index build and search#6305
Conversation
PR Review: feat: support segmented inverted index build and searchOverall this is a solid first slice for segmented FTS. The BM25 cross-segment scoring approach is correct (global corpus stats passed to per-segment search, top-k merge via min-heap). A few items worth addressing: P1: Duplicated scorer-merge logic (3x copy-paste)The pattern of merging let mut base_scorer = first_index.bm25_base_scorer(&tokens);
for index in indices.iter().skip(1) {
let segment_scorer = index.bm25_base_scorer(&tokens);
base_scorer.total_tokens += segment_scorer.total_tokens;
base_scorer.num_docs += segment_scorer.num_docs;
for (token, count) in segment_scorer.token_docs {
*base_scorer.token_docs.entry(token).or_insert(0) += count;
}
}The same top-k heap merge is also duplicated between P1:
|
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
rust/lance/src/io/exec/fts.rs
Outdated
| let mut candidates = std::collections::BinaryHeap::new(); | ||
| for index in &indices { | ||
| let (doc_ids, scores) = index | ||
| .bm25_search( |
There was a problem hiding this comment.
should we do concurrent search here?
| .await?; | ||
| index.as_any().downcast_ref::<InvertedIndex>().cloned() | ||
| let segments = load_segments(&ds, &column).await?; | ||
| let (tokenizer, base_scorer) = match segments { |
There was a problem hiding this comment.
worth parallelizing this!
rust/lance/src/io/exec/fts.rs
Outdated
| let mut candidates = std::collections::BinaryHeap::new(); | ||
| for index in &indices { | ||
| let (doc_ids, scores) = index | ||
| .bm25_search( |
wjones127
left a comment
There was a problem hiding this comment.
A key benchmark I would recommend doing for this is FTS search over 13 fragments, varying segment size: 1, 2, 4, 6 segments (plus one unindexed fragment). That would help us see how robust the performance is over varying numbers of segments. Ideally we can get a somewhat flat line.
# Conflicts: # java/lance-jni/src/blocking_dataset.rs # java/src/main/java/org/lance/Dataset.java # java/src/test/java/org/lance/index/VectorIndexTest.java # python/python/lance/dataset.py # python/python/lance/lance/__init__.pyi # python/python/tests/test_vector_index.py # python/src/dataset.rs # rust/lance/src/index/api.rs # rust/lance/src/index/create.rs
This change adds `with_index_segments()` for vector queries and makes ANN planning prune to the selected index segments instead of always searching the full logical index. It also makes `with_fragments()` participate in segment selection and flat fallback computation so fragment-filtered and segment-filtered searches stay correct when only part of the logical index is queried. This feature will make distributed search much faster to avoid loading not related index segments. --- FTS should also support this, will add after #6305 been merged.
This PR teaches inverted/FTS indices to participate in the segment-based build workflow and to search across multiple committed segments with a shared BM25 scorer. It keeps the current on-disk inverted format intact while aligning FTS with the newer
execute_uncommitted() -> create_index_segment_builder() -> commit_existing_index_segments()path.