parquet: speed up ByteView dictionary decoder by Dandandan · Pull Request #9745 · apache/arrow-rs

Dandandan · 2026-04-16T18:10:35Z

Which issue does this PR close?

None — targeted optimisation surfaced by profiling profile_clickbench locally.

Rationale for this change

ByteViewArrayDecoderDictionary::read is the inner loop for reading dictionary-encoded StringView / BinaryView columns. The previous shape expanded every RLE run through an intermediate index buffer and ran each decoded key through a bounds-checked get + Some/None branch + a deferred-error capture. Vec::extend(Map<_, closure>) also misses TrustedLen, so the per-push capacity check stayed in the hot loop.

What changes are included in this PR?

Two layered changes:

1. Fuse RLE decode with view gather (no zero-init).
RleDecoder::get_batch_with_dict (internal, pub(crate)) now takes &mut [MaybeUninit<T>] so callers can gather straight into Vec::spare_capacity_mut() + set_len — no upfront resize(..., 0). A new DictIndexDecoder::read_with_dict exposes this for the dict-view decoder. When base_buffer_idx == 0 (the common case: dictionary buffers are the last buffers in the output), the dict-view decoder calls read_with_dict directly, and the intermediate 1024-entry index buffer is bypassed. RLE runs now fill view slots with no per-key gather at all. The scratch index_buf inside RleDecoder is also allocated lazily, only when a bit-packed run is actually read.

2. Bulk-validated chunked gather where fusion doesn't apply.
For the base_buffer_idx != 0 fallback (buffer-index rewrite needed on every view), the read loop does a 16-key chunked gather with bulk max-reduction validation — same shape as RleDecoder::get_batch_with_dict in #9746. Bit-packed leftover drain in read_with_dict follows the same pattern.

Supporting cleanups driven by asm inspection:

adjust_buffer_index rewritten as view.wrapping_add((is_long * base as u128) << 64) so LLVM emits csel inside the chunked loop instead of a per-view conditional branch to an out-of-line adjustment block.
.all(|&k| cond) replaced with a u32 max-reduction. .all() short-circuits and blocks autovectorisation; the fold form compiles to ldp q1,q0 + umax.4s + umaxv.4s + cmp + b.hs on aarch64 — one SIMD load, one branch, reusing NEON registers for the gather.
Casting keys via k as u32 correctly rejects negative i32 (corrupt data) — the negative value becomes a very large u32 and fails the max-reduction check.

Are these changes tested?

Existing unit tests in byte_view_array, encodings::rle, and arrow::decoder::dictionary_index pass. The RleDecoder::get_batch_with_dict signature change also required rerouting the other in-crate caller (encodings::decoding::DictDecoder::get), which is covered by its own tests.

Microbenchmarks (parquet/benches/arrow_reader.rs, arrow_array_reader/(String|Binary)ViewArray/dictionary *, aarch64 / Apple Silicon, 5 s measurement, baseline = current apache/main):

Bench	main	this PR	Δ
BinaryView mandatory, no NULLs	92.2 µs	59.2 µs	−36%
BinaryView optional, no NULLs	94.2 µs	61.5 µs	−35%
BinaryView optional, half NULLs	139.1 µs	106.0 µs	−24%
StringView mandatory, no NULLs	94.0 µs	59.6 µs	−37%
StringView optional, no NULLs	101.5 µs	61.5 µs	−39%
StringView optional, half NULLs	135.4 µs	105.4 µs	−22%

Half-NULL cases gain less because roughly half the views are null padding rather than gather output.

Are there any user-facing changes?

None — same public API, same semantics (invalid dictionary indices still surface as ParquetError::General). The RleDecoder::get_batch_with_dict signature change is internal to the crate.

🤖 Generated with Claude Code

Replace the `extend(keys.iter().map(...))` loop in `ByteViewArrayDecoderDictionary::read` with a `chunks_exact(8)` loop that bulk-validates each chunk's keys, then uses `get_unchecked` gather plus raw-pointer writes. Matches the pattern in `RleDecoder::get_batch_with_dict`. Drops per-element bounds check, per-element `error.is_none()` branch, and `Vec::extend`'s per-push capacity check. Invalid keys now return an error eagerly via a cold helper instead of zero-filling and deferring. Dictionary-decode microbenchmarks (parquet/benches/arrow_reader.rs): BinaryView mandatory, no NULLs 102.91 µs -> 74.29 µs -27.8% BinaryView optional, no NULLs 104.63 µs -> 76.65 µs -26.9% BinaryView optional, half NULLs 143.25 µs -> 132.46 µs -7.3% StringView mandatory, no NULLs 105.98 µs -> 73.87 µs -28.8% StringView optional, no NULLs 104.62 µs -> 76.34 µs -27.4% StringView optional, half NULLs 141.86 µs -> 131.85 µs -7.1% Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Dandandan · 2026-04-16T18:19:24Z

run benchmark arrow_reader_clickbench

Two small follow-ups to the chunked-gather rewrite, both driven by inspecting the aarch64 asm: 1) Rewrite `adjust_buffer_index` without an `if/else` so LLVM emits a `csel` in the hot chunked loop. Previously the main 8-key gather went through an out-of-line block with a conditional branch per view; now each view is 5 branchless instructions (ldp/cmp/csel/ add/stp). 2) Replace `chunk.iter().all(|&k| cond)` with a max-reduction over `u32` keys. `.all()` short-circuits, which blocks vectorisation — LLVM emitted 8 sequential `ldrsw+cmp+b.ls`. The max-reduction compiles on aarch64 NEON to: ldp q1, q0, [x1] ; one load, 8 keys umax.4s v2, v1, v0 ; pairwise lane max umaxv.4s s2, v2 ; horizontal reduce cmp w13, w22 ; one compare b.hs <cold error path> ; one branch The NEON registers are then reused for the gather (`fmov`/`mov.s v[i]`) so keys are loaded exactly once. Casting keys via `k as u32` correctly rejects any negative i32 (corrupt data) because a negative value becomes a large u32. Microbenchmark deltas over the previous commit (criterion, aarch64): BinaryView mandatory, no NULLs 74.29 µs -> 72.96 µs -1.8% BinaryView optional, no NULLs 76.65 µs -> 75.01 µs -2.1% StringView mandatory, no NULLs 73.87 µs -> 72.27 µs -2.2% StringView optional, no NULLs 76.34 µs -> 75.41 µs -1.2% Cumulative vs. main HEAD (89b1497): BinaryView mandatory, no NULLs 102.91 µs -> 72.96 µs -29.2% BinaryView optional, no NULLs 104.63 µs -> 75.01 µs -28.4% BinaryView optional, half NULLs 143.25 µs -> 133.06 µs -7.4% StringView mandatory, no NULLs 105.98 µs -> 72.27 µs -30.7% StringView optional, no NULLs 104.62 µs -> 75.41 µs -29.2% StringView optional, half NULLs 141.86 µs -> 132.20 µs -6.8% Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

adriangbot · 2026-04-16T18:22:12Z

🤖 Arrow criterion benchmark running (GKE) | trigger
Instance: c4a-highmem-16 (12 vCPU / 65 GiB) | Linux bench-c4262428096-1396-tjkfl 6.12.55+ #1 SMP Sun Feb 1 08:59:41 UTC 2026 aarch64 GNU/Linux

CPU Details (lscpu)

Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Comparing optimize-byte-view-dict-decoder (fe1728d) to 89b1497 (merge-base) diff
BENCH_NAME=arrow_reader_clickbench
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental,object_store --bench arrow_reader_clickbench
BENCH_FILTER=
Results will be posted here when complete

File an issue against this benchmark runner

adriangbot · 2026-04-16T18:49:49Z

🤖 Arrow criterion benchmark completed (GKE) | trigger

Instance: c4a-highmem-16 (12 vCPU / 65 GiB)

CPU Details (lscpu)

Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Details

group                                             main                                   optimize-byte-view-dict-decoder
-----                                             ----                                   -------------------------------
arrow_reader_clickbench/async/Q1                  1.01   1101.7±5.83µs        ? ?/sec    1.00   1094.3±7.81µs        ? ?/sec
arrow_reader_clickbench/async/Q10                 1.04      6.7±0.08ms        ? ?/sec    1.00      6.4±0.05ms        ? ?/sec
arrow_reader_clickbench/async/Q11                 1.03      7.7±0.10ms        ? ?/sec    1.00      7.4±0.07ms        ? ?/sec
arrow_reader_clickbench/async/Q12                 1.02     14.7±0.12ms        ? ?/sec    1.00     14.5±0.06ms        ? ?/sec
arrow_reader_clickbench/async/Q13                 1.01     17.4±0.12ms        ? ?/sec    1.00     17.2±0.08ms        ? ?/sec
arrow_reader_clickbench/async/Q14                 1.01     16.2±0.10ms        ? ?/sec    1.00     16.0±0.09ms        ? ?/sec
arrow_reader_clickbench/async/Q19                 1.02      3.1±0.03ms        ? ?/sec    1.00      3.1±0.03ms        ? ?/sec
arrow_reader_clickbench/async/Q20                 1.15     95.1±2.13ms        ? ?/sec    1.00     82.5±9.33ms        ? ?/sec
arrow_reader_clickbench/async/Q21                 1.07    108.7±4.81ms        ? ?/sec    1.00    101.4±5.42ms        ? ?/sec
arrow_reader_clickbench/async/Q22                 1.00   131.3±10.79ms        ? ?/sec    1.01    132.4±7.64ms        ? ?/sec
arrow_reader_clickbench/async/Q23                 1.05    254.2±2.16ms        ? ?/sec    1.00    242.0±1.92ms        ? ?/sec
arrow_reader_clickbench/async/Q24                 1.04     20.2±0.20ms        ? ?/sec    1.00     19.4±0.09ms        ? ?/sec
arrow_reader_clickbench/async/Q27                 1.05     59.7±0.55ms        ? ?/sec    1.00     57.0±0.17ms        ? ?/sec
arrow_reader_clickbench/async/Q28                 1.04     60.2±0.61ms        ? ?/sec    1.00     57.7±0.19ms        ? ?/sec
arrow_reader_clickbench/async/Q30                 1.03     18.8±0.12ms        ? ?/sec    1.00     18.2±0.06ms        ? ?/sec
arrow_reader_clickbench/async/Q36                 1.04     15.8±0.24ms        ? ?/sec    1.00     15.2±0.11ms        ? ?/sec
arrow_reader_clickbench/async/Q37                 1.02      5.4±0.04ms        ? ?/sec    1.00      5.3±0.03ms        ? ?/sec
arrow_reader_clickbench/async/Q38                 1.05     14.0±0.27ms        ? ?/sec    1.00     13.4±0.13ms        ? ?/sec
arrow_reader_clickbench/async/Q39                 1.06     25.5±0.54ms        ? ?/sec    1.00     24.0±0.18ms        ? ?/sec
arrow_reader_clickbench/async/Q40                 1.03      5.9±0.06ms        ? ?/sec    1.00      5.7±0.04ms        ? ?/sec
arrow_reader_clickbench/async/Q41                 1.01      5.0±0.04ms        ? ?/sec    1.00      5.0±0.03ms        ? ?/sec
arrow_reader_clickbench/async/Q42                 1.02      3.6±0.03ms        ? ?/sec    1.00      3.5±0.03ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q1     1.01   1076.9±6.54µs        ? ?/sec    1.00   1061.2±4.90µs        ? ?/sec
arrow_reader_clickbench/async_object_store/Q10    1.06      6.6±0.07ms        ? ?/sec    1.00      6.3±0.07ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q11    1.04      7.5±0.07ms        ? ?/sec    1.00      7.3±0.07ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q12    1.02     14.7±0.12ms        ? ?/sec    1.00     14.4±0.06ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q13    1.03     17.5±0.15ms        ? ?/sec    1.00     17.0±0.07ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q14    1.02     16.3±0.12ms        ? ?/sec    1.00     15.9±0.08ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q19    1.03      3.0±0.03ms        ? ?/sec    1.00      2.9±0.02ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q20    1.03     73.6±0.82ms        ? ?/sec    1.00     71.4±0.17ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q21    1.03     82.0±0.67ms        ? ?/sec    1.00     79.8±0.25ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q22    1.04    100.8±0.80ms        ? ?/sec    1.00     97.4±0.42ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q23    1.10    242.1±6.28ms        ? ?/sec    1.00    220.7±7.28ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q24    1.03     19.7±0.24ms        ? ?/sec    1.00     19.2±0.08ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q27    1.03     58.3±0.55ms        ? ?/sec    1.00     56.7±0.19ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q28    1.02     58.5±0.69ms        ? ?/sec    1.00     57.4±0.24ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q30    1.02     18.5±0.15ms        ? ?/sec    1.00     18.0±0.05ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q36    1.03     15.2±0.23ms        ? ?/sec    1.00     14.9±0.11ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q37    1.01      5.4±0.03ms        ? ?/sec    1.00      5.3±0.03ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q38    1.02     13.5±0.23ms        ? ?/sec    1.00     13.3±0.10ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q39    1.04     24.4±0.52ms        ? ?/sec    1.00     23.6±0.16ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q40    1.02      5.6±0.06ms        ? ?/sec    1.00      5.5±0.05ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q41    1.02      4.9±0.05ms        ? ?/sec    1.00      4.8±0.03ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q42    1.01      3.4±0.02ms        ? ?/sec    1.00      3.4±0.03ms        ? ?/sec
arrow_reader_clickbench/sync/Q1                   1.00    873.9±2.03µs        ? ?/sec    1.00    871.5±3.76µs        ? ?/sec
arrow_reader_clickbench/sync/Q10                  1.06      5.1±0.02ms        ? ?/sec    1.00      4.8±0.03ms        ? ?/sec
arrow_reader_clickbench/sync/Q11                  1.06      6.1±0.02ms        ? ?/sec    1.00      5.7±0.02ms        ? ?/sec
arrow_reader_clickbench/sync/Q12                  1.02     21.9±0.06ms        ? ?/sec    1.00     21.4±0.07ms        ? ?/sec
arrow_reader_clickbench/sync/Q13                  1.03     30.8±0.18ms        ? ?/sec    1.00     30.0±0.25ms        ? ?/sec
arrow_reader_clickbench/sync/Q14                  1.03     23.4±0.14ms        ? ?/sec    1.00     22.8±0.05ms        ? ?/sec
arrow_reader_clickbench/sync/Q19                  1.03      2.7±0.02ms        ? ?/sec    1.00      2.6±0.02ms        ? ?/sec
arrow_reader_clickbench/sync/Q20                  1.04    125.2±3.83ms        ? ?/sec    1.00    120.5±0.23ms        ? ?/sec
arrow_reader_clickbench/sync/Q21                  1.04     95.2±0.35ms        ? ?/sec    1.00     91.7±0.35ms        ? ?/sec
arrow_reader_clickbench/sync/Q22                  1.01    140.2±0.36ms        ? ?/sec    1.00    139.5±3.42ms        ? ?/sec
arrow_reader_clickbench/sync/Q23                  1.07   286.8±14.73ms        ? ?/sec    1.00   267.8±12.67ms        ? ?/sec
arrow_reader_clickbench/sync/Q24                  1.03     27.5±0.07ms        ? ?/sec    1.00     26.7±0.07ms        ? ?/sec
arrow_reader_clickbench/sync/Q27                  1.05    111.1±0.26ms        ? ?/sec    1.00    105.9±0.16ms        ? ?/sec
arrow_reader_clickbench/sync/Q28                  1.05    109.2±0.21ms        ? ?/sec    1.00    104.4±0.18ms        ? ?/sec
arrow_reader_clickbench/sync/Q30                  1.03     18.9±0.06ms        ? ?/sec    1.00     18.5±0.11ms        ? ?/sec
arrow_reader_clickbench/sync/Q36                  1.01     22.7±0.05ms        ? ?/sec    1.00     22.5±0.09ms        ? ?/sec
arrow_reader_clickbench/sync/Q37                  1.01      6.9±0.02ms        ? ?/sec    1.00      6.8±0.02ms        ? ?/sec
arrow_reader_clickbench/sync/Q38                  1.00     11.6±0.03ms        ? ?/sec    1.00     11.5±0.02ms        ? ?/sec
arrow_reader_clickbench/sync/Q39                  1.02     21.4±0.07ms        ? ?/sec    1.00     20.9±0.05ms        ? ?/sec
arrow_reader_clickbench/sync/Q40                  1.02      5.3±0.05ms        ? ?/sec    1.00      5.2±0.03ms        ? ?/sec
arrow_reader_clickbench/sync/Q41                  1.01      5.7±0.04ms        ? ?/sec    1.00      5.6±0.02ms        ? ?/sec
arrow_reader_clickbench/sync/Q42                  1.01      4.4±0.03ms        ? ?/sec    1.00      4.3±0.03ms        ? ?/sec

Resource Usage

base (merge-base)

Metric	Value
Wall time	792.1s
Peak memory	3.1 GiB
Avg memory	3.0 GiB
CPU user	701.4s
CPU sys	89.0s
Peak spill	0 B

branch

Metric	Value
Wall time	784.3s
Peak memory	3.2 GiB
Avg memory	3.1 GiB
CPU user	713.7s
CPU sys	70.7s
Peak spill	0 B

File an issue against this benchmark runner

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ther Raises the chunk size from 8 to 16 to match apache#9746's finding for the RLE dict gather, and replaces the raw-pointer writes with a spare- capacity slice of MaybeUninit so the unsafe surface is confined to one slice index and one set_len. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Change `RleDecoder::get_batch_with_dict` (pub(crate)) to take `&mut [MaybeUninit<T>]` so callers can gather directly into `Vec::spare_capacity_mut()` without zero-initialising first. In `ByteViewArrayDecoderDictionary::read`, the common `base_buffer_idx == 0` case now calls a new `DictIndexDecoder::read_with_dict` that delegates to `get_batch_with_dict`, skipping the intermediate index-buffer pass. The `base_buffer_idx != 0` branch keeps the chunked-gather fallback. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The 1024-entry scratch is only used when decoding bit-packed runs. Moving the `get_or_insert_with` call inside the `else if self.bit_packed_left > 0` branch means RLE-only streams skip the allocation entirely, and the `Option` discriminant check is paid only where the buffer is actually read. Relies on Rust's disjoint field borrows to hold both `self.bit_reader` and `self.index_buf` mutably at once. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Dandandan · 2026-04-23T07:25:03Z

run benchmark arrow_reader_clickbench

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

adriangbot · 2026-04-23T07:28:55Z

🤖 Arrow criterion benchmark running (GKE) | trigger
Instance: c4a-highmem-16 (12 vCPU / 65 GiB) | Linux bench-c4302472381-1778-dwktp 6.12.55+ #1 SMP Sun Feb 1 08:59:41 UTC 2026 aarch64 GNU/Linux

CPU Details (lscpu)

Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Comparing optimize-byte-view-dict-decoder (e344717) to b93240a (merge-base) diff
BENCH_NAME=arrow_reader_clickbench
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental,object_store --bench arrow_reader_clickbench
BENCH_FILTER=
Results will be posted here when complete

File an issue against this benchmark runner

adriangbot · 2026-04-23T07:55:49Z

🤖 Arrow criterion benchmark completed (GKE) | trigger

Instance: c4a-highmem-16 (12 vCPU / 65 GiB)

CPU Details (lscpu)

Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Details

group                                             main                                   optimize-byte-view-dict-decoder
-----                                             ----                                   -------------------------------
arrow_reader_clickbench/async/Q1                  1.00   1082.2±8.15µs        ? ?/sec    1.01   1087.7±4.95µs        ? ?/sec
arrow_reader_clickbench/async/Q10                 1.08      6.6±0.03ms        ? ?/sec    1.00      6.1±0.02ms        ? ?/sec
arrow_reader_clickbench/async/Q11                 1.08      7.7±0.06ms        ? ?/sec    1.00      7.1±0.04ms        ? ?/sec
arrow_reader_clickbench/async/Q12                 1.00     14.2±0.06ms        ? ?/sec    1.00     14.2±0.06ms        ? ?/sec
arrow_reader_clickbench/async/Q13                 1.00     16.9±0.11ms        ? ?/sec    1.00     16.8±0.08ms        ? ?/sec
arrow_reader_clickbench/async/Q14                 1.00     15.7±0.15ms        ? ?/sec    1.00     15.6±0.06ms        ? ?/sec
arrow_reader_clickbench/async/Q19                 1.04      3.1±0.03ms        ? ?/sec    1.00      3.0±0.02ms        ? ?/sec
arrow_reader_clickbench/async/Q20                 1.17     84.3±0.67ms        ? ?/sec    1.00     71.8±0.21ms        ? ?/sec
arrow_reader_clickbench/async/Q21                 1.06    96.6±11.00ms        ? ?/sec    1.00     91.4±2.44ms        ? ?/sec
arrow_reader_clickbench/async/Q22                 1.05    137.4±0.71ms        ? ?/sec    1.00    131.1±0.57ms        ? ?/sec
arrow_reader_clickbench/async/Q23                 1.04    249.9±3.91ms        ? ?/sec    1.00    240.8±2.43ms        ? ?/sec
arrow_reader_clickbench/async/Q24                 1.00     19.2±0.15ms        ? ?/sec    1.00     19.2±0.07ms        ? ?/sec
arrow_reader_clickbench/async/Q27                 1.03     58.6±0.46ms        ? ?/sec    1.00     57.0±0.22ms        ? ?/sec
arrow_reader_clickbench/async/Q28                 1.03     59.0±0.33ms        ? ?/sec    1.00     57.5±0.19ms        ? ?/sec
arrow_reader_clickbench/async/Q30                 1.01     18.2±0.07ms        ? ?/sec    1.00     18.0±0.05ms        ? ?/sec
arrow_reader_clickbench/async/Q36                 1.02     15.5±0.27ms        ? ?/sec    1.00     15.2±0.15ms        ? ?/sec
arrow_reader_clickbench/async/Q37                 1.00      5.2±0.04ms        ? ?/sec    1.01      5.3±0.03ms        ? ?/sec
arrow_reader_clickbench/async/Q38                 1.01     13.7±0.25ms        ? ?/sec    1.00     13.5±0.15ms        ? ?/sec
arrow_reader_clickbench/async/Q39                 1.04     25.0±0.31ms        ? ?/sec    1.00     24.0±0.26ms        ? ?/sec
arrow_reader_clickbench/async/Q40                 1.00      5.5±0.06ms        ? ?/sec    1.01      5.6±0.04ms        ? ?/sec
arrow_reader_clickbench/async/Q41                 1.00      4.8±0.03ms        ? ?/sec    1.02      4.9±0.04ms        ? ?/sec
arrow_reader_clickbench/async/Q42                 1.00      3.5±0.02ms        ? ?/sec    1.00      3.4±0.01ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q1     1.00   1053.1±6.57µs        ? ?/sec    1.01   1065.7±3.58µs        ? ?/sec
arrow_reader_clickbench/async_object_store/Q10    1.07      6.4±0.04ms        ? ?/sec    1.00      6.0±0.04ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q11    1.07      7.4±0.06ms        ? ?/sec    1.00      6.9±0.03ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q12    1.00     14.2±0.04ms        ? ?/sec    1.00     14.2±0.18ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q13    1.00     16.7±0.10ms        ? ?/sec    1.02     17.1±2.48ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q14    1.00     15.6±0.07ms        ? ?/sec    1.01     15.7±0.20ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q19    1.02      2.9±0.03ms        ? ?/sec    1.00      2.9±0.02ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q20    1.02     72.6±0.60ms        ? ?/sec    1.00     71.0±0.28ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q21    1.02     81.4±0.46ms        ? ?/sec    1.00     79.5±0.26ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q22    1.02     98.9±0.69ms        ? ?/sec    1.00     96.6±0.97ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q23    1.00    224.4±3.18ms        ? ?/sec    1.02    228.1±1.08ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q24    1.00     18.9±0.11ms        ? ?/sec    1.02     19.2±0.21ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q27    1.01     57.4±0.59ms        ? ?/sec    1.00     56.6±0.36ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q28    1.02     57.7±0.60ms        ? ?/sec    1.00     56.8±0.25ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q30    1.01     17.9±0.09ms        ? ?/sec    1.00     17.8±0.16ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q36    1.00     14.8±0.20ms        ? ?/sec    1.00     14.8±0.15ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q37    1.00      5.2±0.02ms        ? ?/sec    1.00      5.2±0.04ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q38    1.02     13.3±0.24ms        ? ?/sec    1.00     13.1±0.11ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q39    1.02     23.8±0.33ms        ? ?/sec    1.00     23.3±0.24ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q40    1.00      5.3±0.06ms        ? ?/sec    1.03      5.5±0.09ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q41    1.00      4.6±0.03ms        ? ?/sec    1.03      4.8±0.10ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q42    1.00      3.3±0.01ms        ? ?/sec    1.01      3.3±0.02ms        ? ?/sec
arrow_reader_clickbench/sync/Q1                   1.00    879.6±2.41µs        ? ?/sec    1.00    875.7±1.42µs        ? ?/sec
arrow_reader_clickbench/sync/Q10                  1.15      5.1±0.03ms        ? ?/sec    1.00      4.4±0.01ms        ? ?/sec
arrow_reader_clickbench/sync/Q11                  1.13      6.0±0.03ms        ? ?/sec    1.00      5.3±0.03ms        ? ?/sec
arrow_reader_clickbench/sync/Q12                  1.03     21.5±0.10ms        ? ?/sec    1.00     20.9±0.06ms        ? ?/sec
arrow_reader_clickbench/sync/Q13                  1.03     24.3±0.12ms        ? ?/sec    1.00     23.7±0.12ms        ? ?/sec
arrow_reader_clickbench/sync/Q14                  1.02     22.7±0.07ms        ? ?/sec    1.00     22.4±0.08ms        ? ?/sec
arrow_reader_clickbench/sync/Q19                  1.02      2.6±0.03ms        ? ?/sec    1.00      2.6±0.02ms        ? ?/sec
arrow_reader_clickbench/sync/Q20                  1.03    124.3±0.31ms        ? ?/sec    1.00    120.4±0.19ms        ? ?/sec
arrow_reader_clickbench/sync/Q21                  1.08     99.1±0.15ms        ? ?/sec    1.00     92.0±0.13ms        ? ?/sec
arrow_reader_clickbench/sync/Q22                  1.03    145.4±0.43ms        ? ?/sec    1.00    141.0±0.34ms        ? ?/sec
arrow_reader_clickbench/sync/Q23                  1.01    294.9±9.09ms        ? ?/sec    1.00    291.1±6.84ms        ? ?/sec
arrow_reader_clickbench/sync/Q24                  1.03     26.7±0.05ms        ? ?/sec    1.00     26.0±0.06ms        ? ?/sec
arrow_reader_clickbench/sync/Q27                  1.04    110.2±0.18ms        ? ?/sec    1.00    106.2±0.16ms        ? ?/sec
arrow_reader_clickbench/sync/Q28                  1.04    107.8±0.13ms        ? ?/sec    1.00    103.8±0.17ms        ? ?/sec
arrow_reader_clickbench/sync/Q30                  1.02     18.3±0.05ms        ? ?/sec    1.00     18.0±0.06ms        ? ?/sec
arrow_reader_clickbench/sync/Q36                  1.00     22.5±0.04ms        ? ?/sec    1.00     22.4±0.06ms        ? ?/sec
arrow_reader_clickbench/sync/Q37                  1.00      6.7±0.02ms        ? ?/sec    1.07      7.2±0.02ms        ? ?/sec
arrow_reader_clickbench/sync/Q38                  1.01     11.5±0.02ms        ? ?/sec    1.00     11.4±0.02ms        ? ?/sec
arrow_reader_clickbench/sync/Q39                  1.02     20.9±0.03ms        ? ?/sec    1.00     20.5±0.03ms        ? ?/sec
arrow_reader_clickbench/sync/Q40                  1.00      4.9±0.02ms        ? ?/sec    1.00      5.0±0.02ms        ? ?/sec
arrow_reader_clickbench/sync/Q41                  1.00      5.4±0.02ms        ? ?/sec    1.00      5.4±0.02ms        ? ?/sec
arrow_reader_clickbench/sync/Q42                  1.01      4.2±0.02ms        ? ?/sec    1.00      4.2±0.02ms        ? ?/sec

Resource Usage

base (merge-base)

Metric	Value
Wall time	790.2s
Peak memory	4.6 GiB
Avg memory	4.5 GiB
CPU user	702.5s
CPU sys	84.0s
Peak spill	0 B

branch

Metric	Value
Wall time	785.2s
Peak memory	4.8 GiB
Avg memory	4.7 GiB
CPU user	713.0s
CPU sys	72.0s
Peak spill	0 B

File an issue against this benchmark runner

github-actions Bot added the parquet Changes to the parquet crate label Apr 16, 2026

Dandandan and others added 5 commits April 17, 2026 06:22

Remove unused ByteView import

90e095b

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Use fold for max-key reduction in dict gather

0fa7d13

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Merge branch 'main' into optimize-byte-view-dict-decoder

e86bf41

Dandandan changed the title ~~parquet: speed up ByteView dictionary decoder with chunks_exact gather (~28%)~~ parquet: speed up ByteView dictionary decoder Apr 23, 2026

cargo fmt

e344717

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

parquet: speed up ByteView dictionary decoder#9745

parquet: speed up ByteView dictionary decoder#9745
Dandandan wants to merge 9 commits intoapache:mainfrom
Dandandan:optimize-byte-view-dict-decoder

Dandandan commented Apr 16, 2026 •

edited

Loading

Uh oh!

Dandandan commented Apr 16, 2026

Uh oh!

adriangbot commented Apr 16, 2026

Uh oh!

adriangbot commented Apr 16, 2026

Uh oh!

Dandandan commented Apr 23, 2026

Uh oh!

adriangbot commented Apr 23, 2026

Uh oh!

adriangbot commented Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Dandandan commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Dandandan commented Apr 16, 2026

Uh oh!

adriangbot commented Apr 16, 2026

Uh oh!

adriangbot commented Apr 16, 2026

Uh oh!

Dandandan commented Apr 23, 2026

Uh oh!

adriangbot commented Apr 23, 2026

Uh oh!

adriangbot commented Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Dandandan commented Apr 16, 2026 •

edited

Loading