fix: wrong merge logic for distributed IVF_PQ #5772

yanghua · 2026-01-21T11:24:21Z

No description provided.

github-actions · 2026-01-21T11:26:28Z

PR Review: fix: wrong merge logic for distributed IVF_PQ

Summary

This PR addresses incorrect merge logic in distributed IVF_PQ index building by changing how partial auxiliary shards are ordered during merging. The key change is moving from content-derived sorting (based on min_fragment_id, min_row_id) to lexicographic parent directory name sorting, with an additional per-partition sort by first_row_id within each partition.

P0/P1 Issues

1. Excessive Debug Logging in Production Code (P1 - Performance)

The PR introduces ~30+ unconditional log::info\! calls throughout the codebase in critical hot paths:

open_vector_index (called on every index access)
ANNIvfSubIndexExec::late_search (called per-partition per-query)
merge_partial_vector_auxiliary_files (multiple times per shard)

These info-level logs will be emitted in production by default and may:

Cause I/O overhead on high-throughput systems
Bloat logs significantly during vector search operations
Leak internal implementation details

Recommendation: Either:

Gate ALL verbose logging behind LANCE_IVFPQ_DEBUG (not just some of them)
Use log::debug\! or log::trace\! for non-critical diagnostics
Remove diagnostic logging before merging, or add it in a separate PR focused on observability

2. Duplicated Sorting Logic (P1 - Maintainability)

The parent directory name extraction and sorting logic is duplicated in two places (~lines 486-498 and 1221-1233 in index_merger.rs):

// First occurrence
aux_paths.sort_by(|a, b| {
    let a_parts: Vec<_> = a.parts().collect();
    // ... same logic
});

// Second occurrence  
shard_infos.sort_by(|(path_a, _lens_a), (path_b, _lens_b)| {
    let a_parts: Vec<_> = path_a.parts().collect();
    // ... same logic
});

Recommendation: Extract into a helper function to ensure consistency and reduce maintenance burden.

3. No Tests for the Core Bug Fix (P1 - Test Coverage)

The PR changes the fundamental ordering logic from content-derived keys to directory-name-based sorting, but there are no new tests verifying:

That the new ordering produces correct merged indices
That row_id ordering within partitions is correct after the fix
Regression tests demonstrating the bug is fixed

The existing tests appear to have whitespace formatting changes but no new test cases for the actual fix.

4. Potential Re-opening of Files (P1 - Performance)

In the new per-partition merge loop (lines 1281-1442), each shard's auxiliary file is re-opened for every partition:

for pid in 0..nlist {
    for (path, lens) in shard_infos.iter() {
        // Opens file again for each partition
        let fh = sched.open_file(aux, &CachedFileSize::unknown()).await?;
        let reader = V2Reader::try_open(...).await?;
    }
}

For large indices with many partitions (e.g., 1000+ partitions), this could result in thousands of file open operations per shard.

Recommendation: Consider caching readers across partitions or restructuring to minimize file I/O.

Minor Notes (not blocking)

Many whitespace/formatting changes (e.g., .await? indentation) that make the diff harder to review but don't affect correctness
The compute_shard_content_key function is now described as "for diagnostics" but is still called - consider removing if truly unused

Questions for Author

Can you provide a test case that demonstrates the bug and shows it's fixed by this change?
Is there a reason the logging must be at info level rather than debug/trace?
Was the file re-opening performance tested with large partition counts?

codecov · 2026-01-21T12:36:49Z

Codecov Report

❌ Patch coverage is 48.24903% with 133 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
...lance-index/src/vector/distributed/index_merger.rs	48.24%	127 Missing and 6 partials ⚠️

📢 Thoughts on this report? Let us know!

fix: wrong merge logic for distributed IVF_PQ

a3f706a

github-actions bot added the bug Something isn't working label Jan 21, 2026

yanghua added 8 commits January 21, 2026 19:41

revert code

cbb8141

revert rust/lance/src/index/vector/ivf/builder.rs

5681df4

revert rust/lance/src/index/vector/ivf.rs

d4f6380

revert rust/lance/src/index/vector/ivf.rs

3b72433

revert rust/lance/src/index/vector/ivf.rs

a553642

revert rust/lance/src/index/vector.rs

224c4e3

revert rust/lance/src/index.rs

c97c257

revert rust/lance/src/index.rs

87f53d5

yanghua closed this Jan 28, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: wrong merge logic for distributed IVF_PQ #5772

fix: wrong merge logic for distributed IVF_PQ #5772

Uh oh!

yanghua commented Jan 21, 2026

Uh oh!

github-actions bot commented Jan 21, 2026

Uh oh!

codecov bot commented Jan 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

fix: wrong merge logic for distributed IVF_PQ #5772

fix: wrong merge logic for distributed IVF_PQ #5772

Uh oh!

Conversation

yanghua commented Jan 21, 2026

Uh oh!

github-actions bot commented Jan 21, 2026

PR Review: fix: wrong merge logic for distributed IVF_PQ

Summary

P0/P1 Issues

Minor Notes (not blocking)

Questions for Author

Uh oh!

codecov bot commented Jan 21, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant