Filter pushdown selectivity threshold#9414
Conversation
Previously, every predicate in the RowFilter received the same ProjectionMask containing ALL filter columns. This caused unnecessary decoding of expensive string columns when evaluating cheap integer predicates. Now each predicate receives a mask with only the single column it needs. Key sync improvements (vs baseline): - Q37: 63.7ms -> 7.3ms (-88.6%, Title LIKE with CounterID=62 filter) - Q36: 117ms -> 24ms (-79.5%, URL <> '' with CounterID=62 filter) - Q40: 17.9ms -> 5.1ms (-71.5%, multi-pred with RefererHash eq) - Q41: 17.3ms -> 5.5ms (-68.1%, multi-pred with URLHash eq) - Q22: 303ms -> 127ms (-58.2%, 3 string predicates) - Q42: 7.6ms -> 3.9ms (-48.5%, int-only multi-predicate) - Q38: 19.1ms -> 12.4ms (-34.9%, 5 int predicates) - Q21: 159ms -> 98ms (-38.5%, URL LIKE + SearchPhrase) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Use page-level min/max statistics (via StatisticsConverter) to compute a RowSelection that skips pages where equality predicates cannot match. For each equality predicate with an integer literal, we check if the literal falls within each page's [min, max] range and skip pages where it doesn't. Impact is data-dependent - most effective when data is sorted/clustered by the filter column. For this particular 100K-row sample file the data isn't sorted by filter columns, so improvements are modest (~5% for some CounterID=62 queries). Would show larger gains on sorted datasets. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Put the cheapest/most selective predicate first: SearchPhrase <> '' filters ~87% of rows before expensive LIKE predicates run. This reduces string column decoding for Title and URL significantly. Q22 sync: ~6% improvement, Q22 async: ~13% improvement. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
run benchmark arrow_reader_clickbench |
|
🤖 |
|
🤖: Benchmark completed Details
|
|
run benchmark arrow_reader_clickbench |
|
run benchmark arrow_reader_clickbench |
|
@alamb can you get the runner "unstuck" again? :D |
Done What I think is happening is that the runer is being oomkilled:
|
|
I won't schedule any more benchmarks from that branch. Interesting about the mem usage. |
|
run benchmark arrow_reader_clickbench |
|
🤖 |
|
🤖: Benchmark completed Details
|
|
Nice |
Which issue does this PR close?
Rationale for this change
It can be better to altogether skip (combined) filters with low effectivity - as there still will be overhead of individual (small) skip/read during Parquet decoder.
What changes are included in this PR?
This adds a simple threshold to skip pushing down if the current selection is not "effective", i.e. under a fraction of rows
Are these changes tested?
Are there any user-facing changes?