[SPARK-56633][SQL][TESTS] Add comprehensive Parquet vectorized-reader benchmark coverage by LuciferYang · Pull Request #55558 · apache/spark

LuciferYang · 2026-04-27T07:59:08Z

What changes were proposed in this pull request?

Add benchmark coverage for the Parquet vectorized-read decode surface that has none today, plus extend the existing VectorizedRleValuesReaderBenchmark to its full public API:

ParquetVectorUpdaterBenchmark (new) — every ParquetVectorUpdater family obtained through ParquetVectorUpdaterFactory.getUpdater. Six groups: identity, type-converting, rebase, unsigned, decimal, FixedLenByteArray.
VectorizedDeltaReaderBenchmark (new) — all three delta decoders (VectorizedDeltaBinaryPackedReader, VectorizedDeltaByteArrayReader, VectorizedDeltaLengthByteArrayReader). Five groups covering bulk read/skip across value distributions and prefix-overlap shapes, plus single-value reads and byte/short/unsigned variants.
VectorizedPlainValuesReaderBenchmark (new) — every public read/skip method on VectorizedPlainValuesReader. Five groups: fixed-size bulk, conversion bulk (unsigned, with-rebase), variable-length, single-value, skip.
VectorizedRleValuesReaderBenchmark (extended) — three new groups: row-index-filtered reads (with-filter code path), single-value reads, skip paths.

Why are the changes needed?

ParquetVectorUpdater and the delta / plain decoders sit on the hot path of every Parquet column read but have no in-repo benchmark coverage. Coverage is intentionally broad — every public read/skip method is included even when it's already memcpy-optimal — so the result files track the long-term performance baseline and future iterative optimization does not have to add benchmark coverage as a precursor.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Pass Github Actions

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code

LuciferYang · 2026-04-27T08:04:36Z

will update benchmark results later

… benchmark coverage ### What changes were proposed in this pull request? Add comprehensive benchmark coverage for the Parquet vectorized-read decode paths via three new benchmark files plus an extension to the existing `VectorizedRleValuesReaderBenchmark`: * `ParquetVectorUpdaterBenchmark` (new) - every `ParquetVectorUpdater` family obtained from `ParquetVectorUpdaterFactory`: identity (Boolean, Byte, Short, Integer, Long, Float, Double, Binary), type-converting (IntegerToLong, IntegerToDouble, FloatToDouble, DateToTimestampNTZ, DowncastLong), rebase (IntegerWithRebase, LongWithRebase, LongAsMicros), unsigned (UnsignedInteger, UnsignedLong), decimal (IntegerToDecimal, LongToDecimal, BinaryToDecimal, FixedLenByteArrayToDecimal), and FixedLenByteArray (FixedLenByteArrayUpdater, FixedLenByteArrayAsInt, FixedLenByteArrayAsLong). * `VectorizedDeltaReaderBenchmark` (new) - all three delta decoders. Group A/B: DELTA_BINARY_PACKED INT32/INT64 read+skip across constant / monotonic / small-delta-random / wide-random distributions. Group C: DELTA_BYTE_ARRAY read+skip across prefix-overlap shapes. Group D: DELTA_LENGTH_BYTE_ARRAY read+skip across payload sizes. Group E: variant reads on DeltaBinaryPackedReader (readBytes, readShorts, readUnsignedIntegers, readUnsignedLongs, skipBytes, skipShorts, single-value readByte/Short/Integer/Long) plus DeltaByteArrayReader.readBinary(int len). * `VectorizedPlainValuesReaderBenchmark` (new) - every public read/skip method on `VectorizedPlainValuesReader` across five groups: fixed-size bulk, conversion bulk (unsigned, with-rebase), variable- length, single-value, skip. * `VectorizedRleValuesReaderBenchmark` (extension) - new groups added: Group E: row-index-filtered reads (exercises the with-filter path of `readBatchInternal` / `readBatchInternalWithDefLevels`); two filter shapes x three null ratios x with/without def-level materialization. Group F: per-call overhead of readBoolean / readInteger / readValueDictionaryId looped NUM_ROWS times. Group G: skipBooleans / skipIntegers across the same parameter sweeps as Groups A and B. ### Why are the changes needed? Coverage is intentionally broad - every public read/skip method is included even when no obvious optimization opportunity exists today, so the result files track the long-term performance baseline of the Parquet decode surface and future iterative optimization does not have to add benchmark coverage as a precursor. ### Implementation notes * Updater instances are obtained via the production `ParquetVectorUpdaterFactory.getUpdater` entry point so the benchmark exercises the full configuration matrix (logical-type annotation, rebase mode, timezone) the production decoder uses. Tricky cases (`DowncastLongUpdater`, `BinaryToDecimalUpdater`, `FixedLenByteArrayToDecimalUpdater`) include a brief comment noting the routing predicate that selects them, since slight changes to the descriptor or target Spark type re-route to a different Updater. * Each case pre-warms the decode path before `benchmark.addCase` to stabilize first-case JIT state (a follow-up to the SPARK-56522 review feedback). * Variable-length cases call `vector.reset()` at the start of each iteration so the binary vector's child arrayData does not accumulate payload bytes across iterations. * For row-index-filtered cases in `VectorizedRleValuesReaderBenchmark`, a fresh `ParquetReadState` is constructed per measurement iteration because `rowRanges` is iterated forward and not reset by the existing resetForNewBatch / resetForNewPage entry points. ### Does this PR introduce _any_ user-facing change? No. Benchmark-only addition. ### How was this patch tested? * `build/sbt sql/Test/compile` clean (including scalastyle). * Result files to be generated on GHA on JDK 17/21/25 to establish baseline. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude Opus 4.7

…uet.ParquetVectorUpdaterBenchmark (JDK 17, Scala 2.13, split 1 of 1)

LuciferYang marked this pull request as draft April 27, 2026 07:59

LuciferYang force-pushed the parquet-benchmark-coverage branch 3 times, most recently from b6f5da6 to 0e9b3fa Compare April 27, 2026 09:33

LuciferYang force-pushed the parquet-benchmark-coverage branch from 0e9b3fa to e30ce3e Compare April 27, 2026 13:06

Benchmark results for org.apache.spark.sql.execution.datasources.parq…

1fef57f

…uet.ParquetVectorUpdaterBenchmark (JDK 17, Scala 2.13, split 1 of 1)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-56633][SQL][TESTS] Add comprehensive Parquet vectorized-reader benchmark coverage#55558

[SPARK-56633][SQL][TESTS] Add comprehensive Parquet vectorized-reader benchmark coverage#55558
LuciferYang wants to merge 2 commits intoapache:masterfrom
LuciferYang:parquet-benchmark-coverage

LuciferYang commented Apr 27, 2026 •

edited

Loading

Uh oh!

LuciferYang commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

LuciferYang commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

LuciferYang commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

LuciferYang commented Apr 27, 2026 •

edited

Loading