Skip to content

Conversation

@BubbleCal
Copy link
Contributor

less contention to improve indexing perf (+3%)

@github-actions
Copy link
Contributor

ACTION NEEDED
Lance follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

For details on the error please inspect the "PR Title Check" action.

@github-actions
Copy link
Contributor

Code Review

Summary: This PR optimizes inverted index building by processing posting lists on a CPU pool with a channel-based producer-consumer pattern instead of buffered async streams.

P0/P1 Issues

P1: Missing error propagation from producer task

In builder.rs, when the channel rx.recv() returns None, the loop exits silently but doesn't check if the producer encountered an error before completing. The producer.await? at the end will catch panics, but if the producer successfully sent an error and then exited, the flow depends on the consumer processing that error batch.

The current code does handle this correctly (error batches are still received and propagated via batch?), but consider adding a comment to clarify this intentional behavior, as the control flow is subtle.

P1: Potential issue with length variable shadowing in to_batch_with_docs

In index.rs:1976, length is used for both the posting list length and passed to build_batch, but build_batch doesn't actually use length from the calling function - it shadows it with self.len() on line 1919. This works correctly but the local length variable on line 1976 is unused. Consider removing it or using self.len() directly in the idf() call.

Minor Observations (not blocking)

  • The commented-out channel_capacity line should be removed if not needed
  • Good test coverage added that verifies equivalence between the two batch building methods

Overall the approach looks sound for reducing contention. The channel-based pattern should indeed help by keeping CPU-bound work off the async runtime.

@BubbleCal BubbleCal changed the title perf: use cpu pool to process all posting lists perf: use cpu pool to process all posting lists Jan 22, 2026
@codecov
Copy link

codecov bot commented Jan 22, 2026

Codecov Report

❌ Patch coverage is 91.87500% with 13 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
rust/lance-index/src/scalar/inverted/encoding.rs 87.50% 0 Missing and 9 partials ⚠️
rust/lance-index/src/scalar/inverted/builder.rs 84.61% 2 Missing ⚠️
rust/lance-index/src/scalar/inverted/index.rs 97.33% 2 Missing ⚠️

📢 Thoughts on this report? Let us know!

let docs_for_batches = docs.clone();
let schema_for_batches = schema.clone();

// let channel_capacity = get_num_compute_intensive_cpus().max(1);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to remove this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

where
F: FnMut(u32, u32) -> f32,
{
debug_assert!(length > 0);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to provide more information about this assert.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

@BubbleCal BubbleCal requested a review from Xuanwo January 23, 2026 09:31
@BubbleCal BubbleCal merged commit 2e6adce into main Jan 23, 2026
33 checks passed
@BubbleCal BubbleCal deleted the yang/optimize-write-postings branch January 23, 2026 09:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants