parquet: O(1) skip for bw=0 miniblocks in DeltaBitPackDecoder#9786
parquet: O(1) skip for bw=0 miniblocks in DeltaBitPackDecoder#9786alamb merged 4 commits intoapache:mainfrom
Conversation
In the non-terminal skip path, miniblocks with bit_width=0 no longer call get_batch. Every delta in a bw=0 miniblock equals min_delta exactly, so last_value can be advanced by n * min_delta using a single wrapping_mul + wrapping_add. When min_delta == 0 no update is needed at all. Common in practice: timestamps with uniform step, run-length-like data. The bw=0 case is the hot path for those columns. Benchmarks vs upstream HEAD: bw=0 single-value skip: -21.6% bw=0 increasing-value skip: -24.3% Split from apache#9769 as requested by reviewer.
| )); | ||
| } | ||
|
|
||
| // see commentary in self.get() above regarding optimizations |
There was a problem hiding this comment.
Please restore this comment as it is still relevant
There was a problem hiding this comment.
Will do as soon as possible.
| if bit_width == 0 { | ||
| // if min_delta == 0, there's nothing to do. self.last_value is unchanged | ||
| // All remainders are zero: every delta equals min_delta exactly. | ||
| // Advance last_value by n * min_delta with no bit reads. |
There was a problem hiding this comment.
please still address the min_delta == 0 case in the comments here
There was a problem hiding this comment.
Will do. I think you have more info in my prior ticket, which I'll incorporate.
| let mini_block_to_skip = self.mini_block_remaining.min(to_skip - skip); | ||
| let mini_block_should_skip = mini_block_to_skip; | ||
|
|
||
| let skip_count = self |
There was a problem hiding this comment.
I love moving this inside bw != 0 branch...saves a memset at least. I wonder if we could lazy initialize skip_buffer as well. Probably better as a follow on.
There was a problem hiding this comment.
Should I make a ticket for this?
There was a problem hiding this comment.
Filed as #9792 for follow-on — thanks for the suggestion.
|
run benchmark arrow_reader |
|
🤖 Arrow criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing pr/bw0-skip (cb208e5) to de11d9c (merge-base) diff File an issue against this benchmark runner |
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
|
Note on benchmark variance: These results were collected on a non-isolated machine without CPU frequency pinning. Small variances of ±5% on non-bw=0 paths (particularly |
|
I took the liberty of fixing the |
- Restore "see commentary in self.get() above" pointer that ties this routine to the equivalent optimizations in get(), per etseidl review. - Call out the min_delta == 0 case explicitly in the bw=0 branch comment so the no-op behavior is documented, not just implicit in the surrounding `if min_delta != 0` guard. Addresses review feedback on apache#9786.
|
Comments addressed in e302663a6. Full benchmark results on the updated branch against the Wins on the bw=0 paths
The magnitude scales with how much of the column is bw=0 (single-value: every miniblock; increasing-value: every miniblock; stepped-increasing: depends on where the step lands within a miniblock). Non-target paths
Measurement conditionsNon-isolated machine, no CPU pinning, browser tabs and background processes active. For the bw=0 target paths the signal (−18% to −22% on single-value) is large enough to read through any plausible noise; for the non-target paths I'd discount the numbers. Happy to rerun on more controlled hardware if useful. |
In the non-terminal skip path of
DeltaBitPackDecoder::skip(), miniblocks withbit_width=0no longer callget_batch. Every delta in a bw=0 miniblock equalsmin_deltaexactly, solast_valuecan be advanced byn * min_deltausing asingle
wrapping_mul+wrapping_add. Whenmin_delta == 0no update isneeded at all.
Benchmarks (
arrow_readerbench vs upstream HEAD,-- skip --baseline upstream):Results are consistent across all integer types (Int8/16/32/64, UInt8/16/32/64,
Decimal128). Benchmarks were run on a non-isolated machine (no CPU frequency
pinning); small variances of ±5% on non-bw=0 paths should be attributed to
measurement noise.
Origin: Observed that constant and near-constant integer columns encoded
with DELTA_BINARY_PACKED are common in practice (timestamps with uniform step,
run-length-like data). The bw=0 case is the hot path for those columns and the
decode loop was doing unnecessary work.
Verification: Existing skip tests pass. Two new focused tests added for the
uniform-step case (
min_delta != 0), which was not covered by prior tests.The bw=0 path produces identical
last_valuestate as the decode-based path byconstruction (
n * min_delta == sum of n copies of min_delta).Generated-by: Claude (claude-sonnet-4-6)