[SPARK-56632][SQL][CONNECT] Fix AMBIGUOUS_COLUMN_REFERENCE regression for reused DataFrame in natural join#55556
Conversation
b086d2d to
f365869
Compare
3162326 to
ddf46ff
Compare
ddf46ff to
c34827a
Compare
cloud-fan
left a comment
There was a problem hiding this comment.
Summary
Prior state and problem. SPARK-55070 fixed a regression introduced by SPARK-53503 by broadening the ancestor filter in resolveDataFrameColumnRecursively from p.outputSet to p.output ++ p.metadataOutput for regular plan-id-tagged column resolution. The broad filter is needed for rhs["join_key"] after a natural/USING join, where the right-side join key is hidden into Project.hiddenOutputTag (metadataOutput). But the broad filter also accepts candidates that should be rejected: when the same DataFrame appears both directly in a join and nested inside a natural/USING-join wrapper that hides its columns, both candidates pass at every ancestor and resolveDataFrameColumnByPlanId's foldLeft throws AMBIGUOUS_COLUMN_REFERENCE.
Design approach. Replace the single broadened walk with a strict-then-broad two-pass pattern, mirroring the output-then-metadata precedence in LogicalPlan.resolve (outputAttributes.resolve orElse outputMetadataAttributes.resolve):
- Regular access: walk first with strict
p.outputSet. That drops candidates hidden at an ancestor (the reused-DataFrame case). If strict resolves, use it. Otherwise retry with broadp.output ++ p.metadataOutputto handle the SPARK-55070rhs["join_key"]case where the only valid resolution is via metadata. - Metadata access (
IS_METADATA_COL): a single walk filtered byp.metadataOutput.
The filter is threaded as a getAllowed: LogicalPlan => AttributeSet through resolveDataFrameColumnByPlanId and resolveDataFrameColumnRecursively; the foldLeft merge logic is unchanged.
Key design decisions.
- Strict-first precedence: when both a directly-visible and a metadata-only candidate exist for the same plan-id, prefer the directly-visible one. This matches
LogicalPlan.resolve's output-before-metadata preference. - Fallback uses union, not metadata-only: the broad fallback uses
p.output ++ p.metadataOutputrather thanp.metadataOutput. Necessary because each candidate is filtered at every ancestor in the walk; metadata-only at the fallback would still reject the SPARK-55070 candidate at the matched node. - Metadata-access filter narrowed from
p.output ++ p.metadataOutput(post-SPARK-55070) top.metadataOutput. This is a quiet behavior change — see inline.
Implementation sketch. All changes are in ColumnResolutionHelper. resolveDataFrameColumn becomes the strategy split (metadata vs. regular; regular runs strict-then-broad). resolveDataFrameColumnByPlanId and resolveDataFrameColumnRecursively gain a getAllowed parameter. Star resolution (resolveDataFrameStarRecursively) is left unchanged — see general comment below.
General comment
The same regression pattern appears reachable on the star path. resolveDataFrameStarRecursively (lines 672–685, unchanged) still uses the single broad filter p.output ++ p.metadataOutput. If the test from this PR is repeated with dim["*"] instead of dim["dim_id"], the direct-dim and nested-under-USING-dim star candidates both pass the broad filter at every ancestor, and resolveDataFrameStarByPlanId's if (r1.nonEmpty && r2.nonEmpty) throw ambiguous would fire. Could you either thread the same strict-then-broad pattern through the star path, or, if the star case isn't user-reachable, document why?
| u, planId, isMetadataAccess, q, 0) | ||
| val (resolved, matched) = if (u.containsTag(LogicalPlan.IS_METADATA_COL)) { | ||
| // Metadata access (e.g. `df["_metadata"]`): the resolved attribute lives | ||
| // in `p.metadataOutput`, so filter ancestors by `p.metadataOutput`. |
There was a problem hiding this comment.
The comment claims "the resolved attribute lives in p.metadataOutput," but getMetadataAttributeByNameOpt (LogicalPlan.scala:56) explicitly looks in (metadataOutput ++ output).collectFirst with the note "An already-referenced column might appear in output instead of metadataOutput." The resolved attribute can be in p.output, not p.metadataOutput.
This matters because the filter is now narrower than pre-SPARK-55070 (p.output ++ p.metadataOutput). At any ancestor that clears metadataOutput to Nil (Aggregate, Limit, Sort, Window, set ops — basicLogicalOperators.scala:404, 448, 510, 853, 885, 1228, 1513, 1573, 1684) but carries the metadata attribute through output, the strict-metadata filter would reject a candidate the previous code accepted. Is the narrowing intentional? If so, can you spell that out in the comment and add a test for the metadata-access path? The PR description doesn't mention this behavior change.
| resolveDataFrameColumnByPlanId( | ||
| u, planId, false, q, 0, plan => plan.outputSet) match { | ||
| case (Some(r), m) => (Some(r), m) | ||
| case _ => resolveDataFrameColumnByPlanId(u, planId, false, q, 0, |
There was a problem hiding this comment.
The fallback re-walks the tree from scratch — re-descending, re-running p.resolve(u.nameParts, conf.resolver) at every matched node, and re-merging — only to swap the filter set. Resolution at matched nodes and the descent are identical between the two passes; only getAllowed(p) differs.
Consider collapsing to a single walk by exposing the two filter components per level (e.g. (p.outputSet, AttributeSet(p.metadataOutput))) and tracking pass-states on each candidate as it flows up. Concretely: drop getAllowed, return candidates as (NamedExpression, depth, passesStrict); at every ancestor, use r.references.subsetOf(AttributeSet(p.output ++ p.metadataOutput)) as the survival gate (matches today's broad filter) and AND-in r.references.subsetOf(p.outputSet) to update passesStrict. At the top of resolveDataFrameColumn, prefer the passesStrict subset and fall back to all survivors. That preserves the foldLeft merge and the strict-then-broad precedence, but pays one walk instead of two.
Not a blocker — just feels like the two passes are doing the same descent twice when the only difference is the filter.
… for reused DataFrame in natural join Fix an AMBIGUOUS_COLUMN_REFERENCE regression introduced by SPARK-55070 when a DataFrame is referenced both directly in a join and also nested under a natural/USING join elsewhere in the same plan. Replace the single broadened ancestor walk in `resolveDataFrameColumn` with a two-walk pattern, mirroring the `outputAttributes.resolve orElse outputMetadataAttributes.resolve` precedence in `LogicalPlan.resolve`. Regular access walks first with the strict `p.outputSet` filter; only on no match does it retry with `p.output ++ p.metadataOutput`. Metadata access keeps a single walk filtered by `p.metadataOutput`. Co-authored-by: Isaac Co-authored-by: Wenchen Fan <cloud0fan@gmail.com>
1e69359 to
79a3074
Compare
What changes were proposed in this pull request?
Fix an
AMBIGUOUS_COLUMN_REFERENCEregression introduced by SPARK-55070 when a DataFrame is referenced both directly in a join and also nested under a natural/USING join elsewhere in the same plan.Replace the single broadened ancestor walk in
resolveDataFrameColumnwith a two-walk pattern, mirroring theoutputAttributes.resolve(...) orElse outputMetadataAttributes.resolve(...)precedence used byLogicalPlan.resolve/LogicalPlan.resolveChildren:df["_metadata"],IS_METADATA_COLtagged): a single walk filtered byp.metadataOutput.df["col"]): walk first with the strict filterp.outputSet. That drops candidates hidden at an ancestor — e.g. the right side's join key after a natural/USING join. If strict resolves, use it. Otherwise retry with the broad filterp.output ++ p.metadataOutputto handle the SPARK-55070rhs["join_key"]case where the only valid resolution is viap.metadataOutput.The filter choice is threaded as a
getAllowed: LogicalPlan => AttributeSetargument throughresolveDataFrameColumnByPlanId/resolveDataFrameColumnRecursively; no change to thefoldLeftmerge logic.Why are the changes needed?
SPARK-55070 broadened the ancestor filter in
resolveDataFrameColumnRecursivelyfromp.outputSettop.output ++ p.metadataOutputso thatrhs["join_key"]works after a natural/USING join (where one join key is hidden inProject.hiddenOutputTag). But when the same DataFrame is both used directly in a join and also nested under a natural/USING-join wrapper elsewhere in the plan, the broadened filter lets both candidates throughresolveDataFrameColumnByPlanId's merge, trippingthrow ambiguousColumnReferences(u).For example, queries like:
now resolve
dim["dim_id"]to the direct-usage output candidate.Does this PR introduce any user-facing change?
Yes — bug fix. Queries that referenced a DataFrame both directly in a join and nested under a natural/USING join (where the wrapper hides one of the columns into
metadataOutput) previously raisedAMBIGUOUS_COLUMN_REFERENCE. They now resolve to the direct-usage candidate.How was this patch tested?
test_select_regular_column_with_reused_dataframe_hidden_in_natural_joinadded toColumnTestsMixininpython/pyspark/sql/tests/test_column.py.test_self_join,test_self_join_II/III/IV, andtest_select_join_keys.Was this patch authored or co-authored using generative AI tooling?
Generated-by: Claude Code (Opus 4.7)