feat: add early termination for compaction plan with max_compaction_bytes option#6890
Open
Jay-ju wants to merge 4 commits into
Open
feat: add early termination for compaction plan with max_compaction_bytes option#6890Jay-ju wants to merge 4 commits into
Jay-ju wants to merge 4 commits into
Conversation
…ytes option Add budget-based early termination to DefaultCompactionPlanner to prevent OOM when planning compaction on datasets with many fragments (e.g., hundreds of thousands). Changes: - Add max_compaction_bytes option to CompactionOptions - Refactor max_source_fragments from post-hoc truncation to in-loop early termination, stopping fragment metrics collection once budget is exceeded - Add exceeds_budget() helper checking both fragment count and byte limits during the planning loop - Update Python bindings and TypedDict docs - Add functional tests for early termination behavior - Add benchmark tests for plan performance at scale Closes: lance-format#6039
…imits - Add apply_budget_limits() for strict post-hoc truncation on task list - Move early termination check before fragment is added to bin - Guarantee at least 1 task is always included - Fix test_max_source_fragments CI failure
- Fix Issue 1: Remove first-task exemption in apply_budget_limits, budget is now a strict hard limit (0 tasks if first task exceeds it) - Fix Issue 2: Early termination now tracks effective (non-noop) candidate fragments only, preventing budget waste on bins that will be filtered by is_noop() - Fix Issue 3: Mark benchmark tests as #[ignore] to reduce CI cost - Update docs to clarify hard-limit semantics
Contributor
Author
|
@claude review |
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
The test uses IVF with 2 partitions but default nprobes=1, which only probes 1 partition per segment. With delta indices (2 segments), the search may miss the partition containing ID 0 in the first segment, causing the assertion to fail non-deterministically (e.g., returning [889, 1000] instead of [0, 1000]). Setting nprobes=2 ensures all partitions are probed, making the search exhaustive and the test deterministic.
Contributor
Author
|
Hi @hamersaw. Fragment planning consumes much time in large data scenarios. I have discussed with @zhangyue19921010 . Based on the discussion of #6039 , we revised the original logic of full planning followed by trimming to on-demand planning. Planning will stop once reaching the threshold. Could you take a look when you have time? |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Add budget-based early termination to
DefaultCompactionPlannerto prevent OOM when planning compaction on datasets with many fragments (e.g., hundreds of thousands).Closes: #6039
Problem
When a dataset has hundreds of thousands of fragments,
plan_compactioncollects metrics for all fragments before producing the plan. This leads to:The existing
max_source_fragmentsoption was a post-hoc truncation — it collected all metrics first, then truncated the output. This did not reduce planning time or memory.Benchmark data (10K fragments, no deletions):
Plan time barely changed because all metrics were still collected.
Solution
Refactor
max_source_fragmentsfrom post-hoc truncation to in-loop early termination, and add a newmax_compaction_bytesoption. The planner now trackstotal_candidate_fragmentsandtotal_candidate_bytesduring the metrics collection loop and breaks out as soon as either budget is exceeded.Key changes:
max_source_fragments: Now terminates metrics collection early (was post-hoc truncation)max_compaction_bytes: New option to limit by cumulative fragment byte sizeexceeds_budget(): Helper method checking both limits during the planning loop.buffered(io_parallelism())) — unlike PR feat: support bounded compaction planner #6095 which used serial I/ODesign Rationale
This approach follows hamersaw's review feedback on PR #6095: extending
CompactionOptionsrather than adding a newBoundedCompactionPlannertype. Users configure limits directly without needing to choose a planner implementation.Changes
Rust
CompactionOptions: Addmax_compaction_bytes: Option<usize>fieldDefaultCompactionPlanner::plan(): Replace post-hoc truncation with in-loop early terminationDefaultCompactionPlanner::exceeds_budget(): New helper methodCompactionOptions::apply_dataset_config(): Supportlance.compaction.max_compaction_bytesPython
CompactionOptionsTypedDict: Addmax_compaction_bytesfield with docsmax_compaction_byteskeyUsage
Comparison with PR #6095
BoundedCompactionPlannertypeDefaultCompactionPlannerplanner="bounded"+ limitsmax_*options