feat(parquet): fuse level encoding with counting and histogram updates#9795
feat(parquet): fuse level encoding with counting and histogram updates#9795HippoBaro wants to merge 1 commit intoapache:mainfrom
Conversation
etseidl
left a comment
There was a problem hiding this comment.
Looks nice, thanks @HippoBaro.
|
run benchmark arrow_writer |
|
🤖 Arrow criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing fuse_lvl_encoding_hist_counting (849685c) to b93240a (merge-base) diff File an issue against this benchmark runner |
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
alamb
left a comment
There was a problem hiding this comment.
Thanks @HippoBaro -- I think this looks really close. Thank you @etseidl for the review
| self.rep_levels_encoder | ||
| .put_with_observer(levels, |level, count| { | ||
| new_rows += (count as u32) * (level == 0) as u32; | ||
| if let Some(ref mut h) = self.page_metrics.repetition_level_histogram { |
There was a problem hiding this comment.
You might be able to move this check out of the loop (so call put if self.page_metrics.repetition_level_histogram is none and and call with_with_observer if it s some
Maybe could improve the inner loop even more
There was a problem hiding this comment.
I did a quick test of that idea earlier and it didn't seem worth the added complexity. But it probably deserves a second look on better hardware 😄.
There was a problem hiding this comment.
Just did the experiment locally with this:
match self.page_metrics.definition_level_histogram.as_mut() {
Some(histogram) => encoder.put_with_observer(levels, |level, count| {
values_to_write += count * (level == max_def) as usize;
histogram.increment_by(level, count as i64);
}),
None => encoder.put_with_observer(levels, |level, count| {
values_to_write += count * (level == max_def) as usize;
}),
};Benchmarks show a 2-3% improvements on average for list_primitive.
There was a problem hiding this comment.
I've pushed the above change!
849685c to
8c4bd85
Compare
Add `put_with_observer()` to `LevelEncoder` so callers can piggyback row/null counting and histogram updates onto level encoding without extra passes over the level buffers. Update `write_mini_batch()` to encode definition and repetition levels while collecting the associated metrics in the same pass, and hoist the histogram-enabled branch out of the inner loop. Add `LevelHistogram::increment_by()` for counted updates, keep `update_from_levels()` as a deprecated compatibility wrapper, and remove the now-unnecessary PageMetrics histogram update helpers. Signed-off-by: Hippolyte Barraud <hippolyte.barraud@datadoghq.com>
8c4bd85 to
28f25c0
Compare
|
run benchmark arrow_writer |
|
Hi @HippoBaro, thanks for the request (#9795 (comment)). Only whitelisted users can trigger benchmarks. Allowed users: Dandandan, Fokko, Jefffrey, Omega359, adriangb, alamb, asubiotto, brunal, buraksenn, cetra3, codephage2020, comphead, erenavsarogullari, etseidl, friendlymatthew, gabotechs, geoffreyclaude, grtlr, haohuaijin, jonathanc-n, kevinjqliu, klion26, kosiew, kumarUjjawal, kunalsinghdadhwal, liamzwbao, mbutrovich, mkleen, mzabaluev, neilconway, rluvaton, sdf-jkl, timsaucer, xudong963, zhuqi-lucas. File an issue against this benchmark runner |
I thought so 😅 Was worth a try! |
|
run benchmark arrow_writer |
|
🤖 Arrow criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing fuse_lvl_encoding_hist_counting (28f25c0) to b93240a (merge-base) diff File an issue against this benchmark runner |
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
alamb
left a comment
There was a problem hiding this comment.
I think it looks very nice persoanlly. The only last reamining question in my mind is if we should be super safe and restore put with a deprecated note.
I think if we want to include it in the next arrow (58.2.0, I am hoping to cut the release in the next day or two) we should include the deprecation
If we are ok with waiting until arrow 59 (in a month or so) we can probably not worry about put
| /// Put/encode levels vector into this level encoder. | ||
| /// Returns number of encoded values that are less than or equal to length of the | ||
| /// input buffer. | ||
| /// Put/encode levels vector into this level encoder and call |
There was a problem hiding this comment.
Shall we make a put here that is deprecated that just calls through to put_with_observer?
There was a problem hiding this comment.
Not sure how you manage the release process. I’m happy to go either way as long as it doesn’t delay merging to main. If both options allow that, I don’t have a preference. Otherwise, I’d rather formally deprecate put than wait to remove it.
|
Thanks @etseidl and @HippoBaro |
Which issue does this PR close?
Rationale for this change
See #9731
What changes are included in this PR?
Add
put_with_observer()toLevelEncoderthat calls anFnMut(i16, usize)observer for each value during encoding. This allows callers to piggyback counting and histogram updates into the encoding pass without extra iterations over the level buffer.Previously,
write_mini_batch()made 3 separate passes over each level array: one to count non-null values or row boundaries, one to update the level histogram, and one to RLE-encode. Now all three operations happen in a single pass via the observer closure.Replace
LevelHistogram::update_from_levels()with a newLevelHistogram::increment_by()that accepts a count, and remove the now-unnecessaryupdate_definition_level_histogram()andupdate_repetition_level_histogram()methods from PageMetrics.Are these changes tested?
All tests passing; existing tests give 100% coverage.
Are there any user-facing changes?
None