Skip to content

Conversation

@Pajaraja
Copy link
Contributor

@Pajaraja Pajaraja commented Dec 29, 2025

What changes were proposed in this pull request?

Replace CteIdNormalizer in NormalizePlan with an application of the rule NormalizeCteIds.

Why are the changes needed?

To add testing for recursive CTEs for single pass analyzer.
Reduce code duplication.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

New test in NormalizePlanSuite.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Sonnet 4.5 via Cursor AI - Original draft of the tests.

@github-actions github-actions bot added the SQL label Dec 29, 2025
@diffray-bot
Copy link

Changes Summary

This PR adds support for normalizing UnionLoop and UnionLoopRef nodes in the plan comparison logic. It introduces new normalization methods in CteIdNormalizer and corresponding test cases to enable testing of recursive CTEs in the single-pass analyzer.

Type: feature

Components Affected: catalyst/plans/NormalizePlan, catalyst/plans/CteIdNormalizer, test suite for NormalizePlan

Files Changed
File Summary Change Impact
...he/spark/sql/catalyst/plans/NormalizePlan.scala Added handling for UnionLoop and UnionLoopRef nodes in the plan normalization transformation and added corresponding normalization methods to CteIdNormalizer class. ✏️ 🟡
...ark/sql/catalyst/plans/NormalizePlanSuite.scala Added three new test cases for UnionLoopRef ID normalization, UnionLoop ID and outputAttrIds normalization, and complete recursive CTE normalization. ✏️ 🟡
Architecture Impact
  • New Patterns: Parallel normalization patterns for recursive CTE nodes (similar to existing CTE handling)
  • Dependencies: added imports: CTERelationDef, UnionLoop, UnionLoopRef in test file
  • Coupling: Introduces tight coupling between UnionLoop/UnionLoopRef and CteIdNormalizer. The normalization logic assumes specific traversal order (UnionLoop before UnionLoopRef) similar to existing CTE handling.

Risk Areas: ID mapping correctness: The normalization relies on traversal order where UnionLoop is encountered before UnionLoopRef to build the mapping correctly, Duplicate ID handling in normalizeDef: The logic checks if a CTERelationDef ID is already mapped (new check at line 256), which could affect existing behavior for duplicate CTEs, UnionLoop ID mapping: The normalizeUnionLoop method only remaps if ID already exists, but doesn't insert new mappings. This is asymmetric compared to normalizeUnionLoopRef behavior

Suggestions
  • Add documentation explaining the assumption about traversal order in plan transformation
  • Consider adding a comment explaining why UnionLoop doesn't insert new mappings unlike UnionLoopRef
  • Verify the logic in normalizeDef change (line 256-257) doesn't introduce unintended regressions for existing CTE handling

Full review in progress... | Powered by diffray

@diffray-bot
Copy link

Changes Summary

This PR adds support for normalizing recursive CTEs (rCTEs) in Spark's query plan normalization logic by introducing handlers for UnionLoop and UnionLoopRef nodes. This enables the single-pass analyzer to correctly compare semantically identical recursive CTE queries with different internal IDs.

Type: feature

Components Affected: Catalyst Query Plan Normalization, Recursive CTE Support, Query Plan Comparison Infrastructure

Files Changed
File Summary Change Impact
...he/spark/sql/catalyst/plans/NormalizePlan.scala Added UnionLoop and UnionLoopRef normalization handlers, plus a fix to prevent duplicate CTE ID mappings ✏️ 🟡
...ark/sql/catalyst/plans/NormalizePlanSuite.scala Added three comprehensive test cases for UnionLoopRef, UnionLoop, and recursive CTE normalization ✏️ 🟢
Architecture Impact
  • New Patterns: Visitor pattern extension for rCTE nodes, ID mapping pattern for maintaining referential integrity
  • Dependencies: added: UnionLoop and UnionLoopRef logical plan classes
  • Coupling: NormalizePlan normalization logic now explicitly depends on recursive CTE node types (UnionLoop, UnionLoopRef), increasing coupling to the CTE implementation details

Risk Areas: Bug fix in normalizeDef() changes behavior - need to verify it doesn't break existing non-recursive CTE normalization, ID remapping logic complexity - UnionLoopRef mapping uses counter differently than CTERelationRef, potential for confusion, Interaction between UnionLoop ID normalization and UnionLoopRef counter-based normalization - asymmetric pattern could be error-prone

Suggestions
  • Consider adding comments explaining why UnionLoopRef uses counter-based assignment vs. mapping lookup (unlike UnionLoop)
  • Verify the bug fix in normalizeDef() doesn't have unintended side effects on regular CTEs
  • Consider adding integration tests that verify recursive CTEs normalize correctly within larger query plans

Full review in progress... | Powered by diffray

@github-actions
Copy link

JIRA Issue Information

=== Sub-task SPARK-54864 ===
Summary: Add plan normalization for recursive CTEs
Assignee: None
Status: Open
Affected: ["4.2.0"]


This comment was automatically generated by GitHub Actions

@Pajaraja Pajaraja force-pushed the pavle-martinovic_data/PlanNormalization branch from 8011e89 to b0dde9e Compare January 8, 2026 14:48
@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 98a17fd Jan 9, 2026
Yicong-Huang pushed a commit to Yicong-Huang/spark that referenced this pull request Jan 9, 2026
### What changes were proposed in this pull request?

Replace `CteIdNormalizer` in `NormalizePlan` with an application of the rule `NormalizeCteIds`.

### Why are the changes needed?

To add testing for recursive CTEs for single pass analyzer.
Reduce code duplication.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

New test in `NormalizePlanSuite`.

### Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Sonnet 4.5 via Cursor AI - Original draft of the tests.

Closes apache#53636 from Pajaraja/pavle-martinovic_data/PlanNormalization.

Authored-by: pavle-martinovic_data <pavle.martinovic@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants