OptimizeProjections: prune struct-only UNNEST when outputs are unused and ancestors are multiplicity-insensitive#20668
Draft
kosiew wants to merge 7 commits intoapache:mainfrom
Draft
OptimizeProjections: prune struct-only UNNEST when outputs are unused and ancestors are multiplicity-insensitive#20668kosiew wants to merge 7 commits intoapache:mainfrom
kosiew wants to merge 7 commits intoapache:mainfrom
Conversation
- Added handling for volatile expressions to impact the optimization process within the `optimize_projections` function. - Introduced checks for volatile expressions in both plan and ancestor nodes to adjust required indices accordingly. - Updated `RequiredIndices` struct to track whether it encounters volatile expressions and to handle multiplicity sensitivity. - Implemented new utility functions to streamline the processing of child requirements and eliminate unnecessary unnesting when certain conditions are met. - Added unit tests to validate the new functionality related to unnesting and aggregation on volatile expression scenarios.
- Allow elimination of unnest operation for empty lists while preserving nulls. - Modify the `eliminate_unnest_when_only_group_keys_are_required` test case to specify struct unnest conditions. - Introduce a new test case `keep_list_unnest_when_group_keys_are_only_required_outputs` to verify unnest behavior when only group keys are required. - Ensure that the optimization logic correctly handles different unnest scenarios based on list and struct types.
- Introduced new SQL Logic Tests to validate unnest pruning behavior in DataFusion. - Tests include scenarios with empty lists and null values to ensure correct handling of cardinality-sensitive cases. - Added explanations for expected logical plans for both aggregation and selection queries.
…projections - Removed repetitive code for handling volatile ancestors across different input plans. - Introduced a new helper function `with_volatile_if_needed` to encapsulate the logic of conditionally adding a volatile ancestor. - Improved code readability and maintainability by reducing duplication in `optimize_projections` method.
…city and volatility - Introduced methods `for_multiplicity_sensitive_child` and `for_multiplicity_insensitive_child` for better handling of child requirements in `RequiredIndices`. - Replaced usage of `with_volatile_if_needed` with `with_plan_volatile` and `with_volatile_ancestor_if` for clearer logic when managing volatile context. - Updated `optimize_projections` function to use new methods, improving code readability and maintainability.
…ts for unnest pruning - Updated the `rewrite_projection_given_requirements` function to enhance handling of projection requirements based on additional conditions such as projected benefit, multiplicity sensitivity, and volatile ancestors. - Added a new SQL logic test to validate the pruning of struct unnest in cases where it is cardinality-preserving and outputs are irrelevant. - Improved comments for clarity on unnest semantics regarding null preservation.
…te_projection_given_requirements function This change simplifies the logic in the rewrite_projection_given_requirements function by removing the check for projection benefit, which was deemed unnecessary. This helps streamline the code and improve readability.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Rationale for this change
Logical
UNNESTcan sometimes be proven unnecessary when it does not contribute any required output columns and when removing it cannot change query results.However,
UNNESTis not always safe to remove:count(*), certain windows) are multiplicity-sensitive, meaning they can observe row-count changes.This PR adds strict, logical-level gating so
UNNESTis eliminated only when these semantic hazards are ruled out.What changes are included in this PR?
Track multiplicity/volatility context through projection pruning
Extend
RequiredIndiceswith:multiplicity_sensitive: whether ancestor operators can observe row-multiplicity changes.has_volatile_ancestor: whether any ancestor expression is volatile.Propagate volatility presence from each plan node (
plan.expressions().any(Expr::is_volatile)) into child requirements.Mark children as multiplicity-sensitive/insensitive depending on the operator context (e.g., aggregate/window/join paths).
Logical pruning of
LogicalPlan::Unnestwhen safeAdd
can_eliminate_unnest+ helpers:dependency_indicesand qualified-field equality.If eligible, replace
Unnestwith its input and compute child requirements using passthrough dependencies.Otherwise, keep existing behavior and make the child requirement explicitly multiplicity-sensitive.
Tests
Unit tests:
count(1)/count(*)).preserve_nullsis disabled.SQLLogicTest coverage (
optimizer_unnest_prune.slt):EXPLAINassertsUnnestis removed only for safe struct case.EXPLAINassertsUnnestremains for list case and for multiplicity-sensitive count.Are these changes tested?
Yes.
Added unit tests in the optimizer module covering both positive (safe elimination) and negative (must keep) scenarios.
Added SQLLogicTests validating:
Unnestabsent only in the safe struct case).Are there any user-facing changes?
Yes (query-plan / performance behavior).
EXPLAINwill no longer showUnnestand execution may be faster due to avoiding unnecessary expansion.LLM-generated code disclosure
This PR includes LLM-generated code and comments. All LLM-generated content has been manually reviewed and tested.