feat(parquet): wire scan_filtered through ArrayReader stack; add with_miniblock_predicate by sahuagin · Pull Request #9770 · apache/arrow-rs

sahuagin · 2026-04-20T02:24:00Z

Which issue does this PR close?

Depends on #9769 (the decoder PR — diff includes those changes while that PR is pending).
Please review after #9769 merges; this PR's diff includes those decoder changes while that PR is pending.

Rationale for this change

Exposes the scan_filtered miniblock-level predicate pushdown added in #9769 through the full column reader stack and as a public API. For mandatory DELTA_BINARY_PACKED
INT32/INT64 columns, entire miniblocks (32/64 values) can be skipped without decoding when a caller-supplied range predicate rules them out. This is especially effective
for monotone columns (timestamps, sequence numbers, auto-increment IDs) where bw=0 blocks allow O(1) skipping.

What changes are included in this PR?

Wiring chain (bottom to top):

ColumnValueDecoderImpl::scan_filtered_values() — dispatches to decoder
GenericColumnReader::scan_filtered_records() — mandatory columns only; optional/repeated fall back to full decode to keep def/rep levels in sync
GenericRecordReader::scan_filtered_records() — page-switching loop with same fallback
ArrayReader::scan_records() — new provided trait method (default = read_records, safe for all encodings/types)
PrimitiveArrayReader::scan_records() — override that calls the above chain
StructArrayReader::scan_records() — delegates to its single child for single-column projections; multi-column projections fall back to read_records
ParquetRecordBatchReader::next_inner — uses scan_records in the All selection branch when a predicate is set
ArrowReaderBuilder::with_miniblock_predicate() — fluent setter; works for both ParquetRecordBatchReaderBuilder (sync) and ParquetRecordBatchStreamBuilder (async)

New public API:
pub type MiniblockPredicate = Arc<dyn Fn(i64, i64) -> bool + Send + Sync>;

// on ArrowReaderBuilder:
pub fn with_miniblock_predicate(self, predicate: MiniblockPredicate) -> Self;

Example:
let pred: MiniblockPredicate = Arc::new(|_lo, hi| hi >= 1_000_000);
let reader = ParquetRecordBatchReaderBuilder::try_new(file)?
.with_projection(mask)
.with_miniblock_predicate(pred)
.build()?;

Limitations (documented in MiniblockPredicate rustdoc):

Multi-column projections fall back to full decode (each column has independent value ranges)
Optional/repeated columns fall back (def/rep level synchronization required)
No file format changes; miniblock ranges are computed on-the-fly from data already present in DELTA_BINARY_PACKED block headers

Are these changes tested?

test_scan_records_delta_binary_packed_mandatory in primitive_array.rs: unit test exercising the full wiring chain on a mandatory INT64 DELTA column, verifying correct
miniblock-level skipping and no false negatives
test_with_miniblock_predicate_single_column in arrow_reader/mod.rs: end-to-end test through ParquetRecordBatchReaderBuilder verifying the public API, correct output,
and that fewer rows are returned than a full read

Are there any user-facing changes?

Yes — two additive public API items:

MiniblockPredicate type alias
ArrowReaderBuilder::with_miniblock_predicate() method

No breaking changes. The new scan_records method on the ArrayReader trait has a provided default (read_records) so no existing implementations are affected.

…red() Two additions to DeltaBitPackDecoder: 1. skip() optimization: bw=0 miniblocks use an O(1) multiply instead of decoding 32/64 values per miniblock. Terminal skips (discarding all remaining page values) avoid heap allocation and last_value tracking. 2. Decoder::scan_filtered() — new provided method on the Decoder trait (default: decode everything, safe fallback for all encodings). DeltaBitPackDecoder overrides it to compute a conservative value range [lo, hi] per miniblock and skip non-matching miniblocks without decoding individual values. Benchmarks vs upstream HEAD (arrow_reader bench): bw=0 single-value skip: -21.6% bw=0 increasing-value skip: -24.3% mixed stepped skip: -3.9% Wall-time scan_filtered on 1M-row DELTA file (monotone column): full decode: 1.96ms -> scan_filtered: 470us (4.2x speedup)

…_miniblock_predicate Wires DeltaBitPackDecoder::scan_filtered() up the full column reader stack and exposes it as a public API on both ParquetRecordBatchReaderBuilder (sync) and ParquetRecordBatchStreamBuilder (async). Wiring chain (bottom to top): ColumnValueDecoderImpl::scan_filtered_values() GenericColumnReader::scan_filtered_records() (mandatory columns only) GenericRecordReader::scan_filtered_records() (optional/repeated fallback) ArrayReader::scan_records() trait method + page-switching helper PrimitiveArrayReader::scan_records() override StructArrayReader::scan_records() (single-child delegate) ParquetRecordBatchReader::next_inner All branch ArrowReaderBuilder::with_miniblock_predicate() Public API additions: - MiniblockPredicate type alias (Arc<dyn Fn(i64, i64) -> bool + Send + Sync>) - ArrowReaderBuilder::with_miniblock_predicate(pred) fluent setter Example: let pred: MiniblockPredicate = Arc::new(|_lo, hi| hi >= 1_000_000); let reader = ParquetRecordBatchReaderBuilder::try_new(file)? .with_projection(mask) .with_miniblock_predicate(pred) .build()?; Limitations: - Multi-column projections fall back to full decode (column value ranges are independent; a shared predicate would produce mismatched lengths) - Optional/repeated columns fall back (def/rep levels must stay synchronized with the values buffer) - No file format changes; miniblock ranges computed on-the-fly from DELTA_BINARY_PACKED block headers already present in the page data

sahuagin added 2 commits April 20, 2026 02:11

github-actions Bot added the parquet Changes to the parquet crate label Apr 20, 2026

sahuagin marked this pull request as draft April 20, 2026 02:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(parquet): wire scan_filtered through ArrayReader stack; add with_miniblock_predicate#9770

feat(parquet): wire scan_filtered through ArrayReader stack; add with_miniblock_predicate#9770
sahuagin wants to merge 2 commits intoapache:mainfrom
sahuagin:delta-scan-filtered-wiring

sahuagin commented Apr 20, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

sahuagin commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

sahuagin commented Apr 20, 2026 •

edited

Loading