-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-54864][SQL] Add rCTE nodes to NormalizePlan #53636
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-54864][SQL] Add rCTE nodes to NormalizePlan #53636
Conversation
Changes SummaryThis PR adds support for normalizing UnionLoop and UnionLoopRef nodes in the plan comparison logic. It introduces new normalization methods in CteIdNormalizer and corresponding test cases to enable testing of recursive CTEs in the single-pass analyzer. Type: feature Components Affected: catalyst/plans/NormalizePlan, catalyst/plans/CteIdNormalizer, test suite for NormalizePlan Files Changed
Architecture Impact
Risk Areas: ID mapping correctness: The normalization relies on traversal order where UnionLoop is encountered before UnionLoopRef to build the mapping correctly, Duplicate ID handling in normalizeDef: The logic checks if a CTERelationDef ID is already mapped (new check at line 256), which could affect existing behavior for duplicate CTEs, UnionLoop ID mapping: The normalizeUnionLoop method only remaps if ID already exists, but doesn't insert new mappings. This is asymmetric compared to normalizeUnionLoopRef behavior Suggestions
Full review in progress... | Powered by diffray |
Changes SummaryThis PR adds support for normalizing recursive CTEs (rCTEs) in Spark's query plan normalization logic by introducing handlers for UnionLoop and UnionLoopRef nodes. This enables the single-pass analyzer to correctly compare semantically identical recursive CTE queries with different internal IDs. Type: feature Components Affected: Catalyst Query Plan Normalization, Recursive CTE Support, Query Plan Comparison Infrastructure Files Changed
Architecture Impact
Risk Areas: Bug fix in normalizeDef() changes behavior - need to verify it doesn't break existing non-recursive CTE normalization, ID remapping logic complexity - UnionLoopRef mapping uses counter differently than CTERelationRef, potential for confusion, Interaction between UnionLoop ID normalization and UnionLoopRef counter-based normalization - asymmetric pattern could be error-prone Suggestions
Full review in progress... | Powered by diffray |
JIRA Issue Information=== Sub-task SPARK-54864 === This comment was automatically generated by GitHub Actions |
8011e89 to
b0dde9e
Compare
|
thanks, merging to master! |
### What changes were proposed in this pull request? Replace `CteIdNormalizer` in `NormalizePlan` with an application of the rule `NormalizeCteIds`. ### Why are the changes needed? To add testing for recursive CTEs for single pass analyzer. Reduce code duplication. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New test in `NormalizePlanSuite`. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude Sonnet 4.5 via Cursor AI - Original draft of the tests. Closes apache#53636 from Pajaraja/pavle-martinovic_data/PlanNormalization. Authored-by: pavle-martinovic_data <pavle.martinovic@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
What changes were proposed in this pull request?
Replace
CteIdNormalizerinNormalizePlanwith an application of the ruleNormalizeCteIds.Why are the changes needed?
To add testing for recursive CTEs for single pass analyzer.
Reduce code duplication.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
New test in
NormalizePlanSuite.Was this patch authored or co-authored using generative AI tooling?
Generated-by: Claude Sonnet 4.5 via Cursor AI - Original draft of the tests.