Update aggregate dynamic filter from parquet file stats during file o…#20687
Update aggregate dynamic filter from parquet file stats during file o…#20687Dandandan wants to merge 2 commits intoapache:mainfrom
Conversation
…pening When a parquet file is opened, its file-level statistics (min/max per column) are now used to update the aggregate dynamic filter bounds before any data is read. This enables earlier pruning of concurrent and subsequent files. Key changes: - Add DynamicFilterFileStatsHandler trait for updating filter bounds from file statistics - Implement the trait for AggrDynFilter using inclusive operators (<=/>= instead of </>) for file-stats-derived bounds to preserve correctness - Track bound_from_data per accumulator to switch to strict operators once the accumulator confirms the bound from actual data - Add build_predicate() to AggrDynFilter, unifying predicate construction between file stats updates and accumulator updates - Walk the predicate tree in the parquet opener to find and update dynamic filter nodes with file-level statistics https://claude.ai/code/session_016tPwbdpgUiYwZSQNT8onup
|
run benchmarks |
|
🤖 |
|
🤖: Benchmark completed Details
|
Replace derived Debug on Inner with a manual implementation that omits the file_stats_handler field's recursive expansion. This prevents a stack overflow when AggrDynFilter holds a DynamicFilterPhysicalExpr which in turn holds a file_stats_handler pointing back to AggrDynFilter. https://claude.ai/code/session_016tPwbdpgUiYwZSQNT8onup
|
run benchmarks |
|
🤖 |
|
🤖: Benchmark completed Details
|
…pening
When a parquet file is opened, its file-level statistics (min/max per column) are now used to update the aggregate dynamic filter bounds before any data is read. This enables earlier pruning of concurrent and subsequent files.
Key changes:
https://claude.ai/code/session_016tPwbdpgUiYwZSQNT8onup
Which issue does this PR close?
Rationale for this change
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?