Skip to content

Conversation

@yanghua
Copy link
Collaborator

@yanghua yanghua commented Jan 21, 2026

No description provided.

@github-actions github-actions bot added the bug Something isn't working label Jan 21, 2026
@github-actions
Copy link
Contributor

PR Review: fix: wrong merge logic for distributed IVF_PQ

Summary

This PR addresses incorrect merge logic in distributed IVF_PQ index building by changing how partial auxiliary shards are ordered during merging. The key change is moving from content-derived sorting (based on min_fragment_id, min_row_id) to lexicographic parent directory name sorting, with an additional per-partition sort by first_row_id within each partition.


P0/P1 Issues

1. Excessive Debug Logging in Production Code (P1 - Performance)

The PR introduces ~30+ unconditional log::info\! calls throughout the codebase in critical hot paths:

  • open_vector_index (called on every index access)
  • ANNIvfSubIndexExec::late_search (called per-partition per-query)
  • merge_partial_vector_auxiliary_files (multiple times per shard)

These info-level logs will be emitted in production by default and may:

  • Cause I/O overhead on high-throughput systems
  • Bloat logs significantly during vector search operations
  • Leak internal implementation details

Recommendation: Either:

  • Gate ALL verbose logging behind LANCE_IVFPQ_DEBUG (not just some of them)
  • Use log::debug\! or log::trace\! for non-critical diagnostics
  • Remove diagnostic logging before merging, or add it in a separate PR focused on observability

2. Duplicated Sorting Logic (P1 - Maintainability)

The parent directory name extraction and sorting logic is duplicated in two places (~lines 486-498 and 1221-1233 in index_merger.rs):

// First occurrence
aux_paths.sort_by(|a, b| {
    let a_parts: Vec<_> = a.parts().collect();
    // ... same logic
});

// Second occurrence  
shard_infos.sort_by(|(path_a, _lens_a), (path_b, _lens_b)| {
    let a_parts: Vec<_> = path_a.parts().collect();
    // ... same logic
});

Recommendation: Extract into a helper function to ensure consistency and reduce maintenance burden.

3. No Tests for the Core Bug Fix (P1 - Test Coverage)

The PR changes the fundamental ordering logic from content-derived keys to directory-name-based sorting, but there are no new tests verifying:

  • That the new ordering produces correct merged indices
  • That row_id ordering within partitions is correct after the fix
  • Regression tests demonstrating the bug is fixed

The existing tests appear to have whitespace formatting changes but no new test cases for the actual fix.

4. Potential Re-opening of Files (P1 - Performance)

In the new per-partition merge loop (lines 1281-1442), each shard's auxiliary file is re-opened for every partition:

for pid in 0..nlist {
    for (path, lens) in shard_infos.iter() {
        // Opens file again for each partition
        let fh = sched.open_file(aux, &CachedFileSize::unknown()).await?;
        let reader = V2Reader::try_open(...).await?;
    }
}

For large indices with many partitions (e.g., 1000+ partitions), this could result in thousands of file open operations per shard.

Recommendation: Consider caching readers across partitions or restructuring to minimize file I/O.


Minor Notes (not blocking)

  • Many whitespace/formatting changes (e.g., .await? indentation) that make the diff harder to review but don't affect correctness
  • The compute_shard_content_key function is now described as "for diagnostics" but is still called - consider removing if truly unused

Questions for Author

  1. Can you provide a test case that demonstrates the bug and shows it's fixed by this change?
  2. Is there a reason the logging must be at info level rather than debug/trace?
  3. Was the file re-opening performance tested with large partition counts?

@codecov
Copy link

codecov bot commented Jan 21, 2026

Codecov Report

❌ Patch coverage is 48.24903% with 133 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
...lance-index/src/vector/distributed/index_merger.rs 48.24% 127 Missing and 6 partials ⚠️

📢 Thoughts on this report? Let us know!

@yanghua yanghua closed this Jan 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant