parquet: speed up ByteView dictionary decoder#9745
parquet: speed up ByteView dictionary decoder#9745Dandandan wants to merge 9 commits intoapache:mainfrom
Conversation
Replace the `extend(keys.iter().map(...))` loop in `ByteViewArrayDecoderDictionary::read` with a `chunks_exact(8)` loop that bulk-validates each chunk's keys, then uses `get_unchecked` gather plus raw-pointer writes. Matches the pattern in `RleDecoder::get_batch_with_dict`. Drops per-element bounds check, per-element `error.is_none()` branch, and `Vec::extend`'s per-push capacity check. Invalid keys now return an error eagerly via a cold helper instead of zero-filling and deferring. Dictionary-decode microbenchmarks (parquet/benches/arrow_reader.rs): BinaryView mandatory, no NULLs 102.91 µs -> 74.29 µs -27.8% BinaryView optional, no NULLs 104.63 µs -> 76.65 µs -26.9% BinaryView optional, half NULLs 143.25 µs -> 132.46 µs -7.3% StringView mandatory, no NULLs 105.98 µs -> 73.87 µs -28.8% StringView optional, no NULLs 104.62 µs -> 76.34 µs -27.4% StringView optional, half NULLs 141.86 µs -> 131.85 µs -7.1% Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
run benchmark arrow_reader_clickbench |
Two small follow-ups to the chunked-gather rewrite, both driven by
inspecting the aarch64 asm:
1) Rewrite `adjust_buffer_index` without an `if/else` so LLVM emits a
`csel` in the hot chunked loop. Previously the main 8-key gather
went through an out-of-line block with a conditional branch per
view; now each view is 5 branchless instructions (ldp/cmp/csel/
add/stp).
2) Replace `chunk.iter().all(|&k| cond)` with a max-reduction over
`u32` keys. `.all()` short-circuits, which blocks vectorisation —
LLVM emitted 8 sequential `ldrsw+cmp+b.ls`. The max-reduction
compiles on aarch64 NEON to:
ldp q1, q0, [x1] ; one load, 8 keys
umax.4s v2, v1, v0 ; pairwise lane max
umaxv.4s s2, v2 ; horizontal reduce
cmp w13, w22 ; one compare
b.hs <cold error path> ; one branch
The NEON registers are then reused for the gather (`fmov`/`mov.s
v[i]`) so keys are loaded exactly once.
Casting keys via `k as u32` correctly rejects any negative i32
(corrupt data) because a negative value becomes a large u32.
Microbenchmark deltas over the previous commit (criterion, aarch64):
BinaryView mandatory, no NULLs 74.29 µs -> 72.96 µs -1.8%
BinaryView optional, no NULLs 76.65 µs -> 75.01 µs -2.1%
StringView mandatory, no NULLs 73.87 µs -> 72.27 µs -2.2%
StringView optional, no NULLs 76.34 µs -> 75.41 µs -1.2%
Cumulative vs. main HEAD (89b1497):
BinaryView mandatory, no NULLs 102.91 µs -> 72.96 µs -29.2%
BinaryView optional, no NULLs 104.63 µs -> 75.01 µs -28.4%
BinaryView optional, half NULLs 143.25 µs -> 133.06 µs -7.4%
StringView mandatory, no NULLs 105.98 µs -> 72.27 µs -30.7%
StringView optional, no NULLs 104.62 µs -> 75.41 µs -29.2%
StringView optional, half NULLs 141.86 µs -> 132.20 µs -6.8%
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
🤖 Arrow criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing optimize-byte-view-dict-decoder (fe1728d) to 89b1497 (merge-base) diff File an issue against this benchmark runner |
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ther Raises the chunk size from 8 to 16 to match apache#9746's finding for the RLE dict gather, and replaces the raw-pointer writes with a spare- capacity slice of MaybeUninit so the unsafe surface is confined to one slice index and one set_len. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Change `RleDecoder::get_batch_with_dict` (pub(crate)) to take `&mut [MaybeUninit<T>]` so callers can gather directly into `Vec::spare_capacity_mut()` without zero-initialising first. In `ByteViewArrayDecoderDictionary::read`, the common `base_buffer_idx == 0` case now calls a new `DictIndexDecoder::read_with_dict` that delegates to `get_batch_with_dict`, skipping the intermediate index-buffer pass. The `base_buffer_idx != 0` branch keeps the chunked-gather fallback. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The 1024-entry scratch is only used when decoding bit-packed runs. Moving the `get_or_insert_with` call inside the `else if self.bit_packed_left > 0` branch means RLE-only streams skip the allocation entirely, and the `Option` discriminant check is paid only where the buffer is actually read. Relies on Rust's disjoint field borrows to hold both `self.bit_reader` and `self.index_buf` mutably at once. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
run benchmark arrow_reader_clickbench |
|
🤖 Arrow criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing optimize-byte-view-dict-decoder (e344717) to b93240a (merge-base) diff File an issue against this benchmark runner |
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
Which issue does this PR close?
None — targeted optimisation surfaced by profiling
profile_clickbenchlocally.Rationale for this change
ByteViewArrayDecoderDictionary::readis the inner loop for reading dictionary-encodedStringView/BinaryViewcolumns. The previous shape expanded every RLE run through an intermediate index buffer and ran each decoded key through a bounds-checkedget+Some/Nonebranch + a deferred-error capture.Vec::extend(Map<_, closure>)also missesTrustedLen, so the per-push capacity check stayed in the hot loop.What changes are included in this PR?
Two layered changes:
1. Fuse RLE decode with view gather (no zero-init).
RleDecoder::get_batch_with_dict(internal,pub(crate)) now takes&mut [MaybeUninit<T>]so callers can gather straight intoVec::spare_capacity_mut()+set_len— no upfrontresize(..., 0). A newDictIndexDecoder::read_with_dictexposes this for the dict-view decoder. Whenbase_buffer_idx == 0(the common case: dictionary buffers are the last buffers in the output), the dict-view decoder callsread_with_dictdirectly, and the intermediate 1024-entry index buffer is bypassed. RLE runs now fill view slots with no per-key gather at all. The scratchindex_bufinsideRleDecoderis also allocated lazily, only when a bit-packed run is actually read.2. Bulk-validated chunked gather where fusion doesn't apply.
For the
base_buffer_idx != 0fallback (buffer-index rewrite needed on every view), the read loop does a 16-key chunked gather with bulk max-reduction validation — same shape asRleDecoder::get_batch_with_dictin #9746. Bit-packed leftover drain inread_with_dictfollows the same pattern.Supporting cleanups driven by asm inspection:
adjust_buffer_indexrewritten asview.wrapping_add((is_long * base as u128) << 64)so LLVM emitscselinside the chunked loop instead of a per-view conditional branch to an out-of-line adjustment block..all(|&k| cond)replaced with au32max-reduction..all()short-circuits and blocks autovectorisation; the fold form compiles toldp q1,q0 + umax.4s + umaxv.4s + cmp + b.hson aarch64 — one SIMD load, one branch, reusing NEON registers for the gather.k as u32correctly rejects negative i32 (corrupt data) — the negative value becomes a very large u32 and fails the max-reduction check.Are these changes tested?
Existing unit tests in
byte_view_array,encodings::rle, andarrow::decoder::dictionary_indexpass. TheRleDecoder::get_batch_with_dictsignature change also required rerouting the other in-crate caller (encodings::decoding::DictDecoder::get), which is covered by its own tests.Microbenchmarks (
parquet/benches/arrow_reader.rs,arrow_array_reader/(String|Binary)ViewArray/dictionary *, aarch64 / Apple Silicon, 5 s measurement, baseline = currentapache/main):Half-NULL cases gain less because roughly half the views are null padding rather than gather output.
Are there any user-facing changes?
None — same public API, same semantics (invalid dictionary indices still surface as
ParquetError::General). TheRleDecoder::get_batch_with_dictsignature change is internal to the crate.🤖 Generated with Claude Code