perf(parquet): optimize DELTA_BINARY_PACKED skip & add scan_filtered#9769
perf(parquet): optimize DELTA_BINARY_PACKED skip & add scan_filtered#9769sahuagin wants to merge 1 commit intoapache:mainfrom
Conversation
…red() Two additions to DeltaBitPackDecoder: 1. skip() optimization: bw=0 miniblocks use an O(1) multiply instead of decoding 32/64 values per miniblock. Terminal skips (discarding all remaining page values) avoid heap allocation and last_value tracking. 2. Decoder::scan_filtered() — new provided method on the Decoder trait (default: decode everything, safe fallback for all encodings). DeltaBitPackDecoder overrides it to compute a conservative value range [lo, hi] per miniblock and skip non-matching miniblocks without decoding individual values. Benchmarks vs upstream HEAD (arrow_reader bench): bw=0 single-value skip: -21.6% bw=0 increasing-value skip: -24.3% mixed stepped skip: -3.9% Wall-time scan_filtered on 1M-row DELTA file (monotone column): full decode: 1.96ms -> scan_filtered: 470us (4.2x speedup)
|
Thanks for your submission @sahuagin. A quick look at this leads me to believe there are 3 PRs here. One to optimize a special skip case, one to optimize for the case of skipping to the end of a page, and one to introduce a new API ( Also, if possible, please add new issues as necessary and link to them in the PR description. This helps us greatly when generating release notes. The skip optimization could perhaps piggyback on #9476, but the terminal skip and Finally, if any of this code has been generated by a LLM, please see https://arrow.apache.org/docs/dev/developers/overview.html#ai-generated-code as well as https://www.apache.org/legal/generative-tooling.html Thanks again |
|
I've got some stuff split up and going to submit shortly; I have an "origin story" for my offering that, in my voice, is much more conversational and story driven. Would it be ok to put that in the first issue to help explain what is going on, or is a more clinical voice preferred in tickets? |
We don't grade your prose 😅. But if you have a roadmap for what you plan to do a single issue with milestones is fine. That gives others a chance to guide the direction early on in case there are hidden pitfalls. Thanks! |
|
I agree with @etseidl that splitting into smaller PRs will maek it more likely that we'll be able to find the time to review this code carefully. Also I agree that the voice of the PR description is not a key consideration 😆 -- mostly we need to be able to review and understand your proposed changes Thank you for your contribution @sahuagin |
|
run benchmarks arrow_reader |
|
🤖 Arrow criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing delta-skip-scan-filtered-decoder (095ebfc) to de11d9c (merge-base) diff File an issue against this benchmark runner |
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
etseidl
left a comment
There was a problem hiding this comment.
Just flushing some comments on the optimization. The speed up looks nice for the special cases. I'm not sure the terminal skip is showing up, but I think that case can be optimized further (local testing showed a maybe 4% improvement in the other skip cases).
| // Terminal skip: caller is discarding all remaining values on this page. | ||
| // last_value will never be read again, so we can use O(1) arithmetic | ||
| // skips (BitReader::skip) instead of decoding through get_batch. | ||
| let terminal = to_skip >= self.values_left + skip; |
There was a problem hiding this comment.
I think all of the "terminal" logic can move before taking first_value. If terminal is true, we don't even need to take first_value.
| let terminal = to_skip >= self.values_left + skip; | ||
|
|
||
| if terminal { | ||
| while skip < to_skip { |
There was a problem hiding this comment.
I think this can simply set self.values_left to 0, and perhaps take first_value just in case. The only reason for stepping through the headers is to do validation, but if we're skipping anyway, I think we can just ignore invalid data.
| )); | ||
| } | ||
|
|
||
| // see commentary in self.get() above regarding optimizations |
There was a problem hiding this comment.
Not sure why this comment was dropped, please restore
| if bit_width == 0 { | ||
| // if min_delta == 0, there's nothing to do. self.last_value is unchanged | ||
| // All remainders are zero: every delta equals min_delta exactly. | ||
| // Advance last_value by n * min_delta with no bit reads. |
There was a problem hiding this comment.
The new comments here do not address the min_delta == 0 case.
Two additions to DeltaBitPackDecoder:
skip() optimization: bw=0 miniblocks use an O(1) multiply instead of decoding 32/64 values per miniblock. Terminal skips (discarding all remaining page values) avoid heap allocation and last_value tracking.
Decoder::scan_filtered() — new provided method on the Decoder trait (default: decode everything, safe fallback for all encodings). DeltaBitPackDecoder overrides it to compute a conservative value range [lo, hi] per miniblock and skip non-matching miniblocks without decoding individual values.
Benchmarks vs upstream HEAD (arrow_reader bench):
bw=0 single-value skip: -21.6%
bw=0 increasing-value skip: -24.3%
mixed stepped skip: -3.9%
Wall-time scan_filtered on 1M-row DELTA file (monotone column):
full decode: 1.96ms -> scan_filtered: 470us (4.2x speedup)