Skip to content

parquet: speed up ByteView dictionary decoder#9745

Draft
Dandandan wants to merge 9 commits intoapache:mainfrom
Dandandan:optimize-byte-view-dict-decoder
Draft

parquet: speed up ByteView dictionary decoder#9745
Dandandan wants to merge 9 commits intoapache:mainfrom
Dandandan:optimize-byte-view-dict-decoder

Conversation

@Dandandan
Copy link
Copy Markdown
Contributor

@Dandandan Dandandan commented Apr 16, 2026

Which issue does this PR close?

None — targeted optimisation surfaced by profiling profile_clickbench locally.

Rationale for this change

ByteViewArrayDecoderDictionary::read is the inner loop for reading dictionary-encoded StringView / BinaryView columns. The previous shape expanded every RLE run through an intermediate index buffer and ran each decoded key through a bounds-checked get + Some/None branch + a deferred-error capture. Vec::extend(Map<_, closure>) also misses TrustedLen, so the per-push capacity check stayed in the hot loop.

What changes are included in this PR?

Two layered changes:

1. Fuse RLE decode with view gather (no zero-init).
RleDecoder::get_batch_with_dict (internal, pub(crate)) now takes &mut [MaybeUninit<T>] so callers can gather straight into Vec::spare_capacity_mut() + set_len — no upfront resize(..., 0). A new DictIndexDecoder::read_with_dict exposes this for the dict-view decoder. When base_buffer_idx == 0 (the common case: dictionary buffers are the last buffers in the output), the dict-view decoder calls read_with_dict directly, and the intermediate 1024-entry index buffer is bypassed. RLE runs now fill view slots with no per-key gather at all. The scratch index_buf inside RleDecoder is also allocated lazily, only when a bit-packed run is actually read.

2. Bulk-validated chunked gather where fusion doesn't apply.
For the base_buffer_idx != 0 fallback (buffer-index rewrite needed on every view), the read loop does a 16-key chunked gather with bulk max-reduction validation — same shape as RleDecoder::get_batch_with_dict in #9746. Bit-packed leftover drain in read_with_dict follows the same pattern.

Supporting cleanups driven by asm inspection:

  • adjust_buffer_index rewritten as view.wrapping_add((is_long * base as u128) << 64) so LLVM emits csel inside the chunked loop instead of a per-view conditional branch to an out-of-line adjustment block.
  • .all(|&k| cond) replaced with a u32 max-reduction. .all() short-circuits and blocks autovectorisation; the fold form compiles to ldp q1,q0 + umax.4s + umaxv.4s + cmp + b.hs on aarch64 — one SIMD load, one branch, reusing NEON registers for the gather.
  • Casting keys via k as u32 correctly rejects negative i32 (corrupt data) — the negative value becomes a very large u32 and fails the max-reduction check.

Are these changes tested?

Existing unit tests in byte_view_array, encodings::rle, and arrow::decoder::dictionary_index pass. The RleDecoder::get_batch_with_dict signature change also required rerouting the other in-crate caller (encodings::decoding::DictDecoder::get), which is covered by its own tests.

Microbenchmarks (parquet/benches/arrow_reader.rs, arrow_array_reader/(String|Binary)ViewArray/dictionary *, aarch64 / Apple Silicon, 5 s measurement, baseline = current apache/main):

Bench main this PR Δ
BinaryView mandatory, no NULLs 92.2 µs 59.2 µs −36%
BinaryView optional, no NULLs 94.2 µs 61.5 µs −35%
BinaryView optional, half NULLs 139.1 µs 106.0 µs −24%
StringView mandatory, no NULLs 94.0 µs 59.6 µs −37%
StringView optional, no NULLs 101.5 µs 61.5 µs −39%
StringView optional, half NULLs 135.4 µs 105.4 µs −22%

Half-NULL cases gain less because roughly half the views are null padding rather than gather output.

Are there any user-facing changes?

None — same public API, same semantics (invalid dictionary indices still surface as ParquetError::General). The RleDecoder::get_batch_with_dict signature change is internal to the crate.

🤖 Generated with Claude Code

Replace the `extend(keys.iter().map(...))` loop in
`ByteViewArrayDecoderDictionary::read` with a `chunks_exact(8)` loop
that bulk-validates each chunk's keys, then uses `get_unchecked`
gather plus raw-pointer writes. Matches the pattern in
`RleDecoder::get_batch_with_dict`.

Drops per-element bounds check, per-element `error.is_none()` branch,
and `Vec::extend`'s per-push capacity check. Invalid keys now return
an error eagerly via a cold helper instead of zero-filling and
deferring.

Dictionary-decode microbenchmarks (parquet/benches/arrow_reader.rs):

  BinaryView mandatory, no NULLs    102.91 µs -> 74.29 µs  -27.8%
  BinaryView optional, no NULLs     104.63 µs -> 76.65 µs  -26.9%
  BinaryView optional, half NULLs   143.25 µs -> 132.46 µs  -7.3%
  StringView mandatory, no NULLs    105.98 µs -> 73.87 µs  -28.8%
  StringView optional, no NULLs     104.62 µs -> 76.34 µs  -27.4%
  StringView optional, half NULLs   141.86 µs -> 131.85 µs  -7.1%

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added the parquet Changes to the parquet crate label Apr 16, 2026
@Dandandan
Copy link
Copy Markdown
Contributor Author

run benchmark arrow_reader_clickbench

Two small follow-ups to the chunked-gather rewrite, both driven by
inspecting the aarch64 asm:

1) Rewrite `adjust_buffer_index` without an `if/else` so LLVM emits a
   `csel` in the hot chunked loop. Previously the main 8-key gather
   went through an out-of-line block with a conditional branch per
   view; now each view is 5 branchless instructions (ldp/cmp/csel/
   add/stp).

2) Replace `chunk.iter().all(|&k| cond)` with a max-reduction over
   `u32` keys. `.all()` short-circuits, which blocks vectorisation —
   LLVM emitted 8 sequential `ldrsw+cmp+b.ls`. The max-reduction
   compiles on aarch64 NEON to:

      ldp  q1, q0, [x1]         ; one load, 8 keys
      umax.4s  v2, v1, v0       ; pairwise lane max
      umaxv.4s s2, v2           ; horizontal reduce
      cmp  w13, w22             ; one compare
      b.hs <cold error path>    ; one branch

   The NEON registers are then reused for the gather (`fmov`/`mov.s
   v[i]`) so keys are loaded exactly once.

Casting keys via `k as u32` correctly rejects any negative i32
(corrupt data) because a negative value becomes a large u32.

Microbenchmark deltas over the previous commit (criterion, aarch64):

  BinaryView mandatory, no NULLs     74.29 µs -> 72.96 µs   -1.8%
  BinaryView optional,  no NULLs     76.65 µs -> 75.01 µs   -2.1%
  StringView mandatory, no NULLs     73.87 µs -> 72.27 µs   -2.2%
  StringView optional,  no NULLs     76.34 µs -> 75.41 µs   -1.2%

Cumulative vs. main HEAD (89b1497):

  BinaryView mandatory, no NULLs    102.91 µs -> 72.96 µs  -29.2%
  BinaryView optional,  no NULLs    104.63 µs -> 75.01 µs  -28.4%
  BinaryView optional, half NULLs   143.25 µs -> 133.06 µs  -7.4%
  StringView mandatory, no NULLs    105.98 µs -> 72.27 µs  -30.7%
  StringView optional,  no NULLs    104.62 µs -> 75.41 µs  -29.2%
  StringView optional, half NULLs   141.86 µs -> 132.20 µs  -6.8%

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@adriangbot
Copy link
Copy Markdown

🤖 Arrow criterion benchmark running (GKE) | trigger
Instance: c4a-highmem-16 (12 vCPU / 65 GiB) | Linux bench-c4262428096-1396-tjkfl 6.12.55+ #1 SMP Sun Feb 1 08:59:41 UTC 2026 aarch64 GNU/Linux

CPU Details (lscpu)
Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Comparing optimize-byte-view-dict-decoder (fe1728d) to 89b1497 (merge-base) diff
BENCH_NAME=arrow_reader_clickbench
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental,object_store --bench arrow_reader_clickbench
BENCH_FILTER=
Results will be posted here when complete


File an issue against this benchmark runner

@adriangbot
Copy link
Copy Markdown

🤖 Arrow criterion benchmark completed (GKE) | trigger

Instance: c4a-highmem-16 (12 vCPU / 65 GiB)

CPU Details (lscpu)
Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected
Details

group                                             main                                   optimize-byte-view-dict-decoder
-----                                             ----                                   -------------------------------
arrow_reader_clickbench/async/Q1                  1.01   1101.7±5.83µs        ? ?/sec    1.00   1094.3±7.81µs        ? ?/sec
arrow_reader_clickbench/async/Q10                 1.04      6.7±0.08ms        ? ?/sec    1.00      6.4±0.05ms        ? ?/sec
arrow_reader_clickbench/async/Q11                 1.03      7.7±0.10ms        ? ?/sec    1.00      7.4±0.07ms        ? ?/sec
arrow_reader_clickbench/async/Q12                 1.02     14.7±0.12ms        ? ?/sec    1.00     14.5±0.06ms        ? ?/sec
arrow_reader_clickbench/async/Q13                 1.01     17.4±0.12ms        ? ?/sec    1.00     17.2±0.08ms        ? ?/sec
arrow_reader_clickbench/async/Q14                 1.01     16.2±0.10ms        ? ?/sec    1.00     16.0±0.09ms        ? ?/sec
arrow_reader_clickbench/async/Q19                 1.02      3.1±0.03ms        ? ?/sec    1.00      3.1±0.03ms        ? ?/sec
arrow_reader_clickbench/async/Q20                 1.15     95.1±2.13ms        ? ?/sec    1.00     82.5±9.33ms        ? ?/sec
arrow_reader_clickbench/async/Q21                 1.07    108.7±4.81ms        ? ?/sec    1.00    101.4±5.42ms        ? ?/sec
arrow_reader_clickbench/async/Q22                 1.00   131.3±10.79ms        ? ?/sec    1.01    132.4±7.64ms        ? ?/sec
arrow_reader_clickbench/async/Q23                 1.05    254.2±2.16ms        ? ?/sec    1.00    242.0±1.92ms        ? ?/sec
arrow_reader_clickbench/async/Q24                 1.04     20.2±0.20ms        ? ?/sec    1.00     19.4±0.09ms        ? ?/sec
arrow_reader_clickbench/async/Q27                 1.05     59.7±0.55ms        ? ?/sec    1.00     57.0±0.17ms        ? ?/sec
arrow_reader_clickbench/async/Q28                 1.04     60.2±0.61ms        ? ?/sec    1.00     57.7±0.19ms        ? ?/sec
arrow_reader_clickbench/async/Q30                 1.03     18.8±0.12ms        ? ?/sec    1.00     18.2±0.06ms        ? ?/sec
arrow_reader_clickbench/async/Q36                 1.04     15.8±0.24ms        ? ?/sec    1.00     15.2±0.11ms        ? ?/sec
arrow_reader_clickbench/async/Q37                 1.02      5.4±0.04ms        ? ?/sec    1.00      5.3±0.03ms        ? ?/sec
arrow_reader_clickbench/async/Q38                 1.05     14.0±0.27ms        ? ?/sec    1.00     13.4±0.13ms        ? ?/sec
arrow_reader_clickbench/async/Q39                 1.06     25.5±0.54ms        ? ?/sec    1.00     24.0±0.18ms        ? ?/sec
arrow_reader_clickbench/async/Q40                 1.03      5.9±0.06ms        ? ?/sec    1.00      5.7±0.04ms        ? ?/sec
arrow_reader_clickbench/async/Q41                 1.01      5.0±0.04ms        ? ?/sec    1.00      5.0±0.03ms        ? ?/sec
arrow_reader_clickbench/async/Q42                 1.02      3.6±0.03ms        ? ?/sec    1.00      3.5±0.03ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q1     1.01   1076.9±6.54µs        ? ?/sec    1.00   1061.2±4.90µs        ? ?/sec
arrow_reader_clickbench/async_object_store/Q10    1.06      6.6±0.07ms        ? ?/sec    1.00      6.3±0.07ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q11    1.04      7.5±0.07ms        ? ?/sec    1.00      7.3±0.07ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q12    1.02     14.7±0.12ms        ? ?/sec    1.00     14.4±0.06ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q13    1.03     17.5±0.15ms        ? ?/sec    1.00     17.0±0.07ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q14    1.02     16.3±0.12ms        ? ?/sec    1.00     15.9±0.08ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q19    1.03      3.0±0.03ms        ? ?/sec    1.00      2.9±0.02ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q20    1.03     73.6±0.82ms        ? ?/sec    1.00     71.4±0.17ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q21    1.03     82.0±0.67ms        ? ?/sec    1.00     79.8±0.25ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q22    1.04    100.8±0.80ms        ? ?/sec    1.00     97.4±0.42ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q23    1.10    242.1±6.28ms        ? ?/sec    1.00    220.7±7.28ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q24    1.03     19.7±0.24ms        ? ?/sec    1.00     19.2±0.08ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q27    1.03     58.3±0.55ms        ? ?/sec    1.00     56.7±0.19ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q28    1.02     58.5±0.69ms        ? ?/sec    1.00     57.4±0.24ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q30    1.02     18.5±0.15ms        ? ?/sec    1.00     18.0±0.05ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q36    1.03     15.2±0.23ms        ? ?/sec    1.00     14.9±0.11ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q37    1.01      5.4±0.03ms        ? ?/sec    1.00      5.3±0.03ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q38    1.02     13.5±0.23ms        ? ?/sec    1.00     13.3±0.10ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q39    1.04     24.4±0.52ms        ? ?/sec    1.00     23.6±0.16ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q40    1.02      5.6±0.06ms        ? ?/sec    1.00      5.5±0.05ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q41    1.02      4.9±0.05ms        ? ?/sec    1.00      4.8±0.03ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q42    1.01      3.4±0.02ms        ? ?/sec    1.00      3.4±0.03ms        ? ?/sec
arrow_reader_clickbench/sync/Q1                   1.00    873.9±2.03µs        ? ?/sec    1.00    871.5±3.76µs        ? ?/sec
arrow_reader_clickbench/sync/Q10                  1.06      5.1±0.02ms        ? ?/sec    1.00      4.8±0.03ms        ? ?/sec
arrow_reader_clickbench/sync/Q11                  1.06      6.1±0.02ms        ? ?/sec    1.00      5.7±0.02ms        ? ?/sec
arrow_reader_clickbench/sync/Q12                  1.02     21.9±0.06ms        ? ?/sec    1.00     21.4±0.07ms        ? ?/sec
arrow_reader_clickbench/sync/Q13                  1.03     30.8±0.18ms        ? ?/sec    1.00     30.0±0.25ms        ? ?/sec
arrow_reader_clickbench/sync/Q14                  1.03     23.4±0.14ms        ? ?/sec    1.00     22.8±0.05ms        ? ?/sec
arrow_reader_clickbench/sync/Q19                  1.03      2.7±0.02ms        ? ?/sec    1.00      2.6±0.02ms        ? ?/sec
arrow_reader_clickbench/sync/Q20                  1.04    125.2±3.83ms        ? ?/sec    1.00    120.5±0.23ms        ? ?/sec
arrow_reader_clickbench/sync/Q21                  1.04     95.2±0.35ms        ? ?/sec    1.00     91.7±0.35ms        ? ?/sec
arrow_reader_clickbench/sync/Q22                  1.01    140.2±0.36ms        ? ?/sec    1.00    139.5±3.42ms        ? ?/sec
arrow_reader_clickbench/sync/Q23                  1.07   286.8±14.73ms        ? ?/sec    1.00   267.8±12.67ms        ? ?/sec
arrow_reader_clickbench/sync/Q24                  1.03     27.5±0.07ms        ? ?/sec    1.00     26.7±0.07ms        ? ?/sec
arrow_reader_clickbench/sync/Q27                  1.05    111.1±0.26ms        ? ?/sec    1.00    105.9±0.16ms        ? ?/sec
arrow_reader_clickbench/sync/Q28                  1.05    109.2±0.21ms        ? ?/sec    1.00    104.4±0.18ms        ? ?/sec
arrow_reader_clickbench/sync/Q30                  1.03     18.9±0.06ms        ? ?/sec    1.00     18.5±0.11ms        ? ?/sec
arrow_reader_clickbench/sync/Q36                  1.01     22.7±0.05ms        ? ?/sec    1.00     22.5±0.09ms        ? ?/sec
arrow_reader_clickbench/sync/Q37                  1.01      6.9±0.02ms        ? ?/sec    1.00      6.8±0.02ms        ? ?/sec
arrow_reader_clickbench/sync/Q38                  1.00     11.6±0.03ms        ? ?/sec    1.00     11.5±0.02ms        ? ?/sec
arrow_reader_clickbench/sync/Q39                  1.02     21.4±0.07ms        ? ?/sec    1.00     20.9±0.05ms        ? ?/sec
arrow_reader_clickbench/sync/Q40                  1.02      5.3±0.05ms        ? ?/sec    1.00      5.2±0.03ms        ? ?/sec
arrow_reader_clickbench/sync/Q41                  1.01      5.7±0.04ms        ? ?/sec    1.00      5.6±0.02ms        ? ?/sec
arrow_reader_clickbench/sync/Q42                  1.01      4.4±0.03ms        ? ?/sec    1.00      4.3±0.03ms        ? ?/sec

Resource Usage

base (merge-base)

Metric Value
Wall time 792.1s
Peak memory 3.1 GiB
Avg memory 3.0 GiB
CPU user 701.4s
CPU sys 89.0s
Peak spill 0 B

branch

Metric Value
Wall time 784.3s
Peak memory 3.2 GiB
Avg memory 3.1 GiB
CPU user 713.7s
CPU sys 70.7s
Peak spill 0 B

File an issue against this benchmark runner

Dandandan and others added 5 commits April 17, 2026 06:22
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ther

Raises the chunk size from 8 to 16 to match apache#9746's finding for the
RLE dict gather, and replaces the raw-pointer writes with a spare-
capacity slice of MaybeUninit so the unsafe surface is confined to
one slice index and one set_len.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Change `RleDecoder::get_batch_with_dict` (pub(crate)) to take
`&mut [MaybeUninit<T>]` so callers can gather directly into
`Vec::spare_capacity_mut()` without zero-initialising first.

In `ByteViewArrayDecoderDictionary::read`, the common `base_buffer_idx == 0`
case now calls a new `DictIndexDecoder::read_with_dict` that delegates
to `get_batch_with_dict`, skipping the intermediate index-buffer pass.
The `base_buffer_idx != 0` branch keeps the chunked-gather fallback.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@Dandandan Dandandan changed the title parquet: speed up ByteView dictionary decoder with chunks_exact gather (~28%) parquet: speed up ByteView dictionary decoder Apr 23, 2026
The 1024-entry scratch is only used when decoding bit-packed runs.
Moving the `get_or_insert_with` call inside the `else if self.bit_packed_left > 0`
branch means RLE-only streams skip the allocation entirely, and the `Option`
discriminant check is paid only where the buffer is actually read.

Relies on Rust's disjoint field borrows to hold both `self.bit_reader` and
`self.index_buf` mutably at once.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@Dandandan
Copy link
Copy Markdown
Contributor Author

run benchmark arrow_reader_clickbench

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@adriangbot
Copy link
Copy Markdown

🤖 Arrow criterion benchmark running (GKE) | trigger
Instance: c4a-highmem-16 (12 vCPU / 65 GiB) | Linux bench-c4302472381-1778-dwktp 6.12.55+ #1 SMP Sun Feb 1 08:59:41 UTC 2026 aarch64 GNU/Linux

CPU Details (lscpu)
Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Comparing optimize-byte-view-dict-decoder (e344717) to b93240a (merge-base) diff
BENCH_NAME=arrow_reader_clickbench
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental,object_store --bench arrow_reader_clickbench
BENCH_FILTER=
Results will be posted here when complete


File an issue against this benchmark runner

@adriangbot
Copy link
Copy Markdown

🤖 Arrow criterion benchmark completed (GKE) | trigger

Instance: c4a-highmem-16 (12 vCPU / 65 GiB)

CPU Details (lscpu)
Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected
Details

group                                             main                                   optimize-byte-view-dict-decoder
-----                                             ----                                   -------------------------------
arrow_reader_clickbench/async/Q1                  1.00   1082.2±8.15µs        ? ?/sec    1.01   1087.7±4.95µs        ? ?/sec
arrow_reader_clickbench/async/Q10                 1.08      6.6±0.03ms        ? ?/sec    1.00      6.1±0.02ms        ? ?/sec
arrow_reader_clickbench/async/Q11                 1.08      7.7±0.06ms        ? ?/sec    1.00      7.1±0.04ms        ? ?/sec
arrow_reader_clickbench/async/Q12                 1.00     14.2±0.06ms        ? ?/sec    1.00     14.2±0.06ms        ? ?/sec
arrow_reader_clickbench/async/Q13                 1.00     16.9±0.11ms        ? ?/sec    1.00     16.8±0.08ms        ? ?/sec
arrow_reader_clickbench/async/Q14                 1.00     15.7±0.15ms        ? ?/sec    1.00     15.6±0.06ms        ? ?/sec
arrow_reader_clickbench/async/Q19                 1.04      3.1±0.03ms        ? ?/sec    1.00      3.0±0.02ms        ? ?/sec
arrow_reader_clickbench/async/Q20                 1.17     84.3±0.67ms        ? ?/sec    1.00     71.8±0.21ms        ? ?/sec
arrow_reader_clickbench/async/Q21                 1.06    96.6±11.00ms        ? ?/sec    1.00     91.4±2.44ms        ? ?/sec
arrow_reader_clickbench/async/Q22                 1.05    137.4±0.71ms        ? ?/sec    1.00    131.1±0.57ms        ? ?/sec
arrow_reader_clickbench/async/Q23                 1.04    249.9±3.91ms        ? ?/sec    1.00    240.8±2.43ms        ? ?/sec
arrow_reader_clickbench/async/Q24                 1.00     19.2±0.15ms        ? ?/sec    1.00     19.2±0.07ms        ? ?/sec
arrow_reader_clickbench/async/Q27                 1.03     58.6±0.46ms        ? ?/sec    1.00     57.0±0.22ms        ? ?/sec
arrow_reader_clickbench/async/Q28                 1.03     59.0±0.33ms        ? ?/sec    1.00     57.5±0.19ms        ? ?/sec
arrow_reader_clickbench/async/Q30                 1.01     18.2±0.07ms        ? ?/sec    1.00     18.0±0.05ms        ? ?/sec
arrow_reader_clickbench/async/Q36                 1.02     15.5±0.27ms        ? ?/sec    1.00     15.2±0.15ms        ? ?/sec
arrow_reader_clickbench/async/Q37                 1.00      5.2±0.04ms        ? ?/sec    1.01      5.3±0.03ms        ? ?/sec
arrow_reader_clickbench/async/Q38                 1.01     13.7±0.25ms        ? ?/sec    1.00     13.5±0.15ms        ? ?/sec
arrow_reader_clickbench/async/Q39                 1.04     25.0±0.31ms        ? ?/sec    1.00     24.0±0.26ms        ? ?/sec
arrow_reader_clickbench/async/Q40                 1.00      5.5±0.06ms        ? ?/sec    1.01      5.6±0.04ms        ? ?/sec
arrow_reader_clickbench/async/Q41                 1.00      4.8±0.03ms        ? ?/sec    1.02      4.9±0.04ms        ? ?/sec
arrow_reader_clickbench/async/Q42                 1.00      3.5±0.02ms        ? ?/sec    1.00      3.4±0.01ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q1     1.00   1053.1±6.57µs        ? ?/sec    1.01   1065.7±3.58µs        ? ?/sec
arrow_reader_clickbench/async_object_store/Q10    1.07      6.4±0.04ms        ? ?/sec    1.00      6.0±0.04ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q11    1.07      7.4±0.06ms        ? ?/sec    1.00      6.9±0.03ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q12    1.00     14.2±0.04ms        ? ?/sec    1.00     14.2±0.18ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q13    1.00     16.7±0.10ms        ? ?/sec    1.02     17.1±2.48ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q14    1.00     15.6±0.07ms        ? ?/sec    1.01     15.7±0.20ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q19    1.02      2.9±0.03ms        ? ?/sec    1.00      2.9±0.02ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q20    1.02     72.6±0.60ms        ? ?/sec    1.00     71.0±0.28ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q21    1.02     81.4±0.46ms        ? ?/sec    1.00     79.5±0.26ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q22    1.02     98.9±0.69ms        ? ?/sec    1.00     96.6±0.97ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q23    1.00    224.4±3.18ms        ? ?/sec    1.02    228.1±1.08ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q24    1.00     18.9±0.11ms        ? ?/sec    1.02     19.2±0.21ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q27    1.01     57.4±0.59ms        ? ?/sec    1.00     56.6±0.36ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q28    1.02     57.7±0.60ms        ? ?/sec    1.00     56.8±0.25ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q30    1.01     17.9±0.09ms        ? ?/sec    1.00     17.8±0.16ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q36    1.00     14.8±0.20ms        ? ?/sec    1.00     14.8±0.15ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q37    1.00      5.2±0.02ms        ? ?/sec    1.00      5.2±0.04ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q38    1.02     13.3±0.24ms        ? ?/sec    1.00     13.1±0.11ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q39    1.02     23.8±0.33ms        ? ?/sec    1.00     23.3±0.24ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q40    1.00      5.3±0.06ms        ? ?/sec    1.03      5.5±0.09ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q41    1.00      4.6±0.03ms        ? ?/sec    1.03      4.8±0.10ms        ? ?/sec
arrow_reader_clickbench/async_object_store/Q42    1.00      3.3±0.01ms        ? ?/sec    1.01      3.3±0.02ms        ? ?/sec
arrow_reader_clickbench/sync/Q1                   1.00    879.6±2.41µs        ? ?/sec    1.00    875.7±1.42µs        ? ?/sec
arrow_reader_clickbench/sync/Q10                  1.15      5.1±0.03ms        ? ?/sec    1.00      4.4±0.01ms        ? ?/sec
arrow_reader_clickbench/sync/Q11                  1.13      6.0±0.03ms        ? ?/sec    1.00      5.3±0.03ms        ? ?/sec
arrow_reader_clickbench/sync/Q12                  1.03     21.5±0.10ms        ? ?/sec    1.00     20.9±0.06ms        ? ?/sec
arrow_reader_clickbench/sync/Q13                  1.03     24.3±0.12ms        ? ?/sec    1.00     23.7±0.12ms        ? ?/sec
arrow_reader_clickbench/sync/Q14                  1.02     22.7±0.07ms        ? ?/sec    1.00     22.4±0.08ms        ? ?/sec
arrow_reader_clickbench/sync/Q19                  1.02      2.6±0.03ms        ? ?/sec    1.00      2.6±0.02ms        ? ?/sec
arrow_reader_clickbench/sync/Q20                  1.03    124.3±0.31ms        ? ?/sec    1.00    120.4±0.19ms        ? ?/sec
arrow_reader_clickbench/sync/Q21                  1.08     99.1±0.15ms        ? ?/sec    1.00     92.0±0.13ms        ? ?/sec
arrow_reader_clickbench/sync/Q22                  1.03    145.4±0.43ms        ? ?/sec    1.00    141.0±0.34ms        ? ?/sec
arrow_reader_clickbench/sync/Q23                  1.01    294.9±9.09ms        ? ?/sec    1.00    291.1±6.84ms        ? ?/sec
arrow_reader_clickbench/sync/Q24                  1.03     26.7±0.05ms        ? ?/sec    1.00     26.0±0.06ms        ? ?/sec
arrow_reader_clickbench/sync/Q27                  1.04    110.2±0.18ms        ? ?/sec    1.00    106.2±0.16ms        ? ?/sec
arrow_reader_clickbench/sync/Q28                  1.04    107.8±0.13ms        ? ?/sec    1.00    103.8±0.17ms        ? ?/sec
arrow_reader_clickbench/sync/Q30                  1.02     18.3±0.05ms        ? ?/sec    1.00     18.0±0.06ms        ? ?/sec
arrow_reader_clickbench/sync/Q36                  1.00     22.5±0.04ms        ? ?/sec    1.00     22.4±0.06ms        ? ?/sec
arrow_reader_clickbench/sync/Q37                  1.00      6.7±0.02ms        ? ?/sec    1.07      7.2±0.02ms        ? ?/sec
arrow_reader_clickbench/sync/Q38                  1.01     11.5±0.02ms        ? ?/sec    1.00     11.4±0.02ms        ? ?/sec
arrow_reader_clickbench/sync/Q39                  1.02     20.9±0.03ms        ? ?/sec    1.00     20.5±0.03ms        ? ?/sec
arrow_reader_clickbench/sync/Q40                  1.00      4.9±0.02ms        ? ?/sec    1.00      5.0±0.02ms        ? ?/sec
arrow_reader_clickbench/sync/Q41                  1.00      5.4±0.02ms        ? ?/sec    1.00      5.4±0.02ms        ? ?/sec
arrow_reader_clickbench/sync/Q42                  1.01      4.2±0.02ms        ? ?/sec    1.00      4.2±0.02ms        ? ?/sec

Resource Usage

base (merge-base)

Metric Value
Wall time 790.2s
Peak memory 4.6 GiB
Avg memory 4.5 GiB
CPU user 702.5s
CPU sys 84.0s
Peak spill 0 B

branch

Metric Value
Wall time 785.2s
Peak memory 4.8 GiB
Avg memory 4.7 GiB
CPU user 713.0s
CPU sys 72.0s
Peak spill 0 B

File an issue against this benchmark runner

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants