Skip to content

fix(index): preserve range partition structure in BTree index update during optimize#6904

Draft
Jay-ju wants to merge 4 commits into
lance-format:mainfrom
Jay-ju:fix/range-partitioned-btree-update
Draft

fix(index): preserve range partition structure in BTree index update during optimize#6904
Jay-ju wants to merge 4 commits into
lance-format:mainfrom
Jay-ju:fix/range-partitioned-btree-update

Conversation

@Jay-ju
Copy link
Copy Markdown
Contributor

@Jay-ju Jay-ju commented May 22, 2026

Problem

When a BTree index is built with range partitioning (distributed mode), the update() method previously called combine_old_new() which is unaware of the range partition structure. This caused the retrained index to lose its range partition layout during optimize, producing a single monolithic index instead of per-partition page data and lookup files.

Root Cause

BTreeIndex::update() unconditionally calls combine_old_new() + train_btree_index() without checking whether the index was built with range partitioning. The train_btree_index() function with range_id=None generates a non-partitioned index, destroying the original range partition structure.

Fix

Detect range-partitioned indices by checking ranges_to_files in update(). When present:

  1. Combine old and new data as before via combine_old_new()
  2. Collect all merged data into sorted batches
  3. Re-train each partition independently with the correct range_id
  4. Merge per-partition lookup files into a unified lookup with correct page_idx offsets via merge_range_partition_lookups_in_place()

New Helper Functions

  • collect_sorted_batches() — collects a SendableRecordBatchStream into Vec<RecordBatch>
  • slice_batches() — slices a batch vector by row range [start, end)
  • batches_to_stream() — converts Vec<RecordBatch> back to SendableRecordBatchStream
  • merge_range_partition_lookups_in_place() — merges per-partition lookup files into a unified lookup, adjusting page_idx offsets and writing pages_per_range_partition metadata

Tests

  • test_optimize_btree_after_append_preserves_data — verifies regular BTree index data correctness after append + optimize
  • test_optimize_btree_index_update_preserves_range_partition_structure — verifies range partition structure (part_* files) is preserved after optimize

…during optimize

When a BTree index is built with range partitioning (distributed mode),
the update() method previously called combine_old_new() which is unaware
of the range partition structure. This caused the retrained index to lose
its range partition layout, producing a single monolithic index instead
of per-partition page data and lookup files.

This fix detects range-partitioned indices by checking ranges_to_files
and, when present, re-trains each partition independently before merging
the per-partition lookup files into a unified lookup with correct
page_idx offsets.

Added tests:
- test_optimize_btree_after_append_preserves_data: verifies regular
  BTree index data correctness after append + optimize
- test_optimize_btree_index_update_preserves_range_partition_structure:
  verifies range partition structure is preserved after optimize
Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@github-actions github-actions Bot added the bug Something isn't working label May 22, 2026
@Jay-ju Jay-ju marked this pull request as draft May 22, 2026 02:36
Jay-ju added 2 commits May 22, 2026 10:45
…e-partitioned BTree optimize

- Enhanced test_optimize_btree_index_update_preserves_range_partition_structure
  to verify range query and point query correctness after optimize
- Added test_optimize_btree_range_partition_with_three_partitions to cover
  3-partition scenario with data that doesn't divide evenly, verifying
  both structure preservation and query correctness across partition
  boundaries
…structure preservation

The existing test asserted that update() on a range-partitioned BTree
index should fall back to non-ranged. After the fix that preserves
range partition structure during update, this assertion must be updated
to expect ranges_to_files.is_some() instead.
@codecov
Copy link
Copy Markdown

codecov Bot commented May 22, 2026

Codecov Report

❌ Patch coverage is 98.02632% with 9 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
rust/lance-index/src/scalar/btree.rs 88.31% 1 Missing and 8 partials ⚠️

📢 Thoughts on this report? Let us know!

@Jay-ju
Copy link
Copy Markdown
Contributor Author

Jay-ju commented May 22, 2026

@claude review

…atches in range-partitioned BTree update

- Track trained_partition_ids instead of assuming all partitions are trained
- Skip merge when only one partition was trained
- Remove collect_sorted_batches helper, inline the collection logic
- Pass SchemaRef to batches_to_stream to avoid accessing empty batch vector
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant