Intermediate result blocked approach to aggregation memory management#15591
Intermediate result blocked approach to aggregation memory management#15591Rachelint wants to merge 72 commits intoapache:mainfrom
Conversation
|
Hi @Rachelint I think I have a alternative proposal that seems relatively easy to implement. |
Really thanks. This design in pr indeed still introduces quite a few code changes... I tried to not modify anythings about
But I found this way will introduce too many extra cost... Maybe we place the |
cc37eba to
f690940
Compare
95c6a36 to
a4c6f42
Compare
2100a5b to
0ee951c
Compare
|
Has finished development(and test) of all needed common structs!
|
c51d409 to
2863809
Compare
|
It is very close, just need to add more tests! |
31d660d to
2b8dd1e
Compare
|
Hi @Rachelint 👋 just wanted to check in on this! The last commit was about a month ago, any update on where things stand? Also worth noting that this work could help fix or mitigate #19906, so there's renewed interest in getting it over the line. Thanks for all the effort you've put into this. It's really appreciated! |
Sorry for long delay for some private reasons, will try to make it ready this weekend:
|
- Generate dense group indices (0..N) via hash-table-style dedup in fuzz tests, matching real GroupedHashAggregateStream behavior. Previously sparse random indices caused SeenValues::All to mark never-appeared groups as "seen" on transition to Some mode. - Use build_single_null_buffer() instead of build(EmitTo::All) to avoid panic in blocked mode. - Replace deprecated gen_range with random_range. - Extract Fixture::new_fixed() from inline test setup. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Refactor `maybe_enable_blocked_groups` into a pure predicate `can_enable_blocked_groups` that returns bool, moving `alter_block_size` calls to the caller. Adjust OOM mode selection to account for infinite memory pools explicitly, and add assertions before entering ProducingBlocks state. Minor rustfmt fixes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add SessionContextOptions (skip_partial / sort_hint / enable_blocked_groups) with Option<bool> semantics: None → randomized, Some(true) → force on, Some(false) → force off. Thread it through AggregationFuzzerBuilder → AggregationFuzzer → SessionContextGenerator. Also simplify can_enable_blocked_groups: flatten match to && and take &Box<dyn GroupValues> to match call-site ergonomics. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
have fixed all correctness problems, can be ready again today. |
… path Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
This branch still have conflicts. |
Yes, mainly solved the correctness in |
|
Busy in work recent days... Fixing the last conflicts now. |
|
I was checking it out as well, and playing around with it |
| /// | ||
| #[derive(Debug)] | ||
| pub struct Blocks<B: Block> { | ||
| inner: VecDeque<B>, |
There was a problem hiding this comment.
I think it would be nice to avoid the VecDeque as I believe it is relatively slow to index (because of the %).
I think we can use a start offset instead during pop (and increment it), replace the block with an empty one to "pop" it and reclaim the memory.
There was a problem hiding this comment.
There was a problem hiding this comment.
Yes, it is better to use Vec<T> and I tried it when I still see this a performance improvement feature.
However, after many tries, I found it actually can't help dafafusion run faster (it is only something can help to better memory management)... And I finally switch to use VecDeque for simplicity...
The experiments can be saw in this archived branch:
https://github.com/Rachelint/arrow-datafusion/compare/intermeidate-result-blocked-approach-bak
There was a problem hiding this comment.
Ok, but I think the Vec approach is relatively simple as well?
Not to pin you down, but I think when it will be used more it is problably coming up later anyway.
There was a problem hiding this comment.
Make sense, I am switching it to Vec.
- Add missing `?` operator on `take_orderings` call in first_last.rs - Handle `EmitTo::NextBlock` in exhaustive matches for array_agg, first_last/state, and order modules - Remove duplicate `Result` import in row.rs - Remove unused imports (correlation.rs, multi_group_by/mod.rs) - Add `#[cfg(test)]` to `TestSeenValuesResult` enum
| |block_id, block_offset, new_value| { | ||
| // SAFETY: `block_id` and `block_offset` are guaranteed to be in bounds | ||
| let value = unsafe { | ||
| self.values[block_id as usize] |
There was a problem hiding this comment.
this can use unsafe index as well (with a plain Vec it would certainly be faster)
The merge from main accidentally set the expected value of collect_statistics to false in the SHOW ALL assertion. Restore it to true to match the actual config default. Also remove stale #[expect(dead_code)] on query_builder helpers that are now used. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Blocked groups pre-allocates memory per block, increasing baseline memory usage. Adjust spill test memory pools to accommodate: - test_order_is_retained_when_spilling: 600 → 2000 bytes - test_sort_reservation_fails_during_spill: keep at 500 (still triggers sort reservation failure with blocked groups) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Rather than adjusting memory limits in existing tests, split each into
two variants that explicitly set enable_aggregation_blocked_groups:
- test_order_is_retained_when_spilling_{flat,blocked}: both use 2000
bytes (the original 600 is no longer sufficient after upstream
accumulator memory changes).
- test_sort_reservation_fails_during_spill_{flat,blocked}: both use
500 bytes which still triggers the expected sort reservation OOM.
Also add enable_blocked_groups parameter to new_spill_ctx helper so
each test explicitly controls the grouping mode.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
In flat mode (block_size=None), `new_block` was using `Vec::with_capacity(DEFAULT_BLOCK_CAP=128)` regardless, causing the reported `size()` to jump from ~32 bytes to 1024 bytes for only 3 groups. This made sort_headroom reservation exceed tight memory pools. Fix: in flat mode, use `Vec::new()` and let `resize` grow via the standard Vec growth strategy, matching the original behavior. Also restore flat test memory limit to original 600 bytes (now passes again), and keep blocked test at 600 bytes (batch_size=1 means per-block capacity is just 1 slot). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add #[expect(clippy::borrowed_box)] to can_enable_blocked_groups - Collapse nested if statements in switch_to_skip_aggregation Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Fix formatting in context_generator.rs, fuzzer.rs, array_agg.rs, multi_group_by/mod.rs - Remove stale #[expect(dead_code)] on with_no_grouping (now used) - Update configs.md with new enable_aggregation_blocked_groups entry Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Function was renamed to can_enable_blocked_groups but the rustdoc link in GroupedHashAggregateStream doc comment was not updated, causing `cargo doc -D warnings` to fail. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
ci passed and no conflict again, rest things before ready:
|
- Add section comments to categorize tests: helpers, basic correctness,
OOM/cancellation, ordered aggregation, schema/planning, skip
aggregation, spill/memory, statistics, and multi-stage
- Add `task_ctx_with_blocked_groups` helper for non-spill tests
- Add `enable_blocked_groups` param to `check_aggregates`,
`check_grouping_sets`, `first_last_multi_partitions`, and
`run_test_with_spill_pool_if_necessary`
- Wrap all grouped aggregation tests in `for enable_blocked in
[false, true]` loops so both flat and blocked storage modes are
exercised
- Merge `test_order_is_retained_when_spilling_{flat,blocked}` and
`test_sort_reservation_fails_during_spill_{flat,blocked}` back into
single tests with loops
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Which issue does this PR close?
Rationale for this change
As mentioned in #7065 , we use a single
Vecto manageaggregation intermediate resultsboth inGroupAccumulatorandGroupValues.It is simple but not efficient enough in high-cardinality aggregation, because when
Vecis not large enough, we need to allocate a newVecand copy all data from the old one.So this pr introduces a
blocked approachto manage theaggregation intermediate results. We will never resize theVecin the approach, and instead we split the data to blocks, when the capacity is not enough, we just allocate a new block. Detail can see #7065What changes are included in this PR?
PrimitiveGroupsAccumulatorandGroupValuesPrimitiveas the exampleAre these changes tested?
Test by exist tests. And new unit tests, new fuzzy tests.
Are there any user-facing changes?
Two functions are added to
GroupValuesandGroupAccumulatortrait.But as you can see, there are default implementations for them, and users can choose to really support the blocked approach when wanting a better performance for their
udafs.