[SPARK-56614][SQL][CONNECT] Add config for strict DataFrame column resolution by zhengruifeng · Pull Request #55531 · apache/spark

zhengruifeng · 2026-04-24T08:02:04Z

What changes were proposed in this pull request?

Add a new internal SQL config spark.sql.analyzer.strictDataFrameColumnResolution (default true) that controls how UnresolvedAttributes carrying a PLAN_ID_TAG (Spark Connect DataFrame columns) are resolved in ColumnResolutionHelper:

When true (default): strict plan-id-based resolution only; no name-based fallback.
When false: if plan-id-based resolution does not resolve the tagged attribute, also try name-based resolution as a fallback.

Additional test hygiene in this PR:

Move test_self_join, test_self_join_II, test_self_join_III, test_self_join_IV, test_drop_notexistent_col, and test_select_join_keys from test_dataframe.py to ColumnTestsMixin in test_column.py; they exercise DataFrame column resolution and belong with the Column tests.
In test_parity_column.py, add ColumnParityTestsWithNonStrictDFColResolution which re-runs the Column parity suite with the new config set to false. Both parity classes pin the config value via setUpClass/tearDownClass and include a test_df_col_resolution_mode sanity check.
In test_connect_column.py, add test_select_column_replaced_by_withcolumn: under strictDataFrameColumnResolution=false the plan-id-tagged reference to a withColumn-shadowed column resolves via name fallback; under strictDataFrameColumnResolution=true the same query fails with CANNOT_RESOLVE_DATAFRAME_COLUMN.

Why are the changes needed?

Keep Spark Connect DataFrame column references strictly anchored to their plan of origin by default, while giving users an escape hatch to opt into the more permissive name-based fallback path.

Does this PR introduce any user-facing change?

A new internal config is added. Default behavior is equivalent to current Apache Spark master for tagged attributes; setting spark.sql.analyzer.strictDataFrameColumnResolution=false opts into a slightly more permissive path where tagged attributes can also be resolved by name.

How was this patch tested?

build/sbt 'catalyst/testOnly *AnalysisSuite'
build/sbt 'sql/testOnly *SparkSessionExtensionSuite *ColumnNodeToExpressionConverterSuite *ResolverGuardSuite *PlanResolutionSuite *DataFrameSuite *DataFrameColumnarSuite'
python/run-tests --testnames "pyspark.sql.tests.test_column,pyspark.sql.tests.connect.test_parity_column,pyspark.sql.tests.connect.test_connect_column SparkConnectColumnTests.test_select_column_replaced_by_withcolumn"

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code (Anthropic), claude-opus-4-7

### What changes were proposed in this pull request? Adds a new internal SQL config `spark.sql.analyzer.strictDataFrameColumnResolution` (default `false`) that controls how `UnresolvedAttribute` instances carrying a `PLAN_ID_TAG` (i.e. DataFrame columns coming through Spark Connect) are resolved in `ColumnResolutionHelper`: - When `true`: - In `resolveDataFrameColumnRecursively`, if the target plan node is found but the column cannot be resolved on it, fail immediately with `CANNOT_RESOLVE_DATAFRAME_COLUMN` instead of delaying the failure. - In the main `resolveExpression` path, allow name-based resolution as a fallback for tagged attributes as well. - When `false` (default): keep the current behavior of delaying the failure so that later analyzer iterations/rules still have a chance to resolve the column, and keep skipping name-based resolution for tagged attributes. ### Why are the changes needed? Give users an opt-in mode that surfaces DataFrame column resolution failures earlier, matching the semantics used internally in Databricks Runtime (databricks-eng/runtime@4452ee5). The default preserves existing Apache Spark behavior. ### Does this PR introduce _any_ user-facing change? No behavior change by default. A new internal config is added. ### How was this patch tested? - New unit test in `SparkSessionExtensionSuite` that enables the config and verifies `CANNOT_RESOLVE_DATAFRAME_COLUMN` is raised immediately, alongside the existing "inject analyzer rule - hidden column" test which still exercises the default delayed-failure path. - `sbt 'catalyst/testOnly *AnalysisSuite'` - `sbt 'sql/testOnly *SparkSessionExtensionSuite *ColumnNodeToExpressionConverterSuite *ResolverGuardSuite *PlanResolutionSuite'` ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude Code (Anthropic), claude-opus-4-7

Flip the default to match the semantics of ES_1636208_FIX_DELAY_DF_COL_FALURE = false (the Databricks default): tagged DataFrame columns that cannot be resolved on their matched plan node now fail immediately with CANNOT_RESOLVE_DATAFRAME_COLUMN, and name-based fallback is allowed. Setting the config to false restores the previous delayed-failure behavior. Update the existing "inject analyzer rule - hidden column" test to set the config to false locally, since it exercises the delayed-failure path.

Remove the early-throw in resolveDataFrameColumnRecursively; the config now only controls whether name-based resolution is attempted as a fallback for tagged UnresolvedAttributes. Failures from plan-id-based resolution continue to be delayed so that later analyzer iterations can still resolve the column. Update the SQLConf doc accordingly and revert the hidden-column test back to its original form (no longer needs to toggle the config).

Reorder the guard on the UnresolvedAttribute case so the cheap `!u.containsTag(PLAN_ID_TAG)` check runs first. Most attributes have no PLAN_ID_TAG, so this avoids the SQLConf lookup on the common path; the config is only consulted for tagged (Spark Connect) attributes.

The previous default (true) gave the more lenient behavior (name-based fallback allowed), which is the opposite of what "strict" implies. Flip the polarity so that `true` now means strict (plan-id-only resolution for tagged attributes) and `false` (default) keeps the lenient fallback behavior. Net user-facing behavior on the default path is unchanged.

Flip the default to true so strict plan-id-based resolution of tagged DataFrame columns is on by default. Drop the name-fallback wording from the doc string.

…est_column Move `test_join_with_cte`, `test_self_join`, `test_self_join_II`, `test_self_join_III`, `test_self_join_IV`, `test_drop_notexistent_col`, and `test_select_join_keys` into `ColumnTestsMixin` in `test_column.py`. These all exercise DataFrame column resolution behavior, so they are a better fit there and will pick up coverage on both classic (`ColumnTests`) and Connect (`ColumnParityTests`) via the existing mixin plumbing. Drop the Connect-specific parity variant of `test_join_with_cte` since the mixin already covers both sessions, and remove now-unused `when`/`concat` imports from `test_dataframe.py`.

…solution disabled Add ColumnParityWithNonStrictDataFrameColumnResolutionTests that re-runs the Column parity suite with `spark.sql.analyzer.strictDataFrameColumnResolution=false`, so the name-based fallback path for tagged UnresolvedAttributes is exercised under Spark Connect.

…class Shorter name: ColumnParityTestsWithLenientDFColResolution.

…hNonStrictDFColResolution

Revert the move of test_join_with_cte; it stays in test_connect_basic.py where it exercises cross-session parity between classic and Connect.

Note the config is specific to Spark Connect (only Connect attaches plan-id tags) and describe the `false` behavior.

Drop the leading 'Specific to Spark Connect.' sentence and qualify 'DataFrame columns' as 'Spark Connect DataFrame columns'.

…both parity classes Add `test_df_col_resolution_mode` to ColumnParityTests (expects "true") and to ColumnParityTestsWithNonStrictDFColResolution (expects "false") to confirm each class is running under the intended config.

…ss config for parity * Add test_select_column_replaced_by_withcolumn in test_connect_column.py, covering the case where `df.withColumn("c", ...).select(df["c"])` must resolve the tagged reference to the overwritten alias. Wrapped in `self.connect_conf(...)` so it exercises the non-strict DataFrame column resolution path. * In test_parity_column.py, set the strict DataFrame column resolution config via setUpClass/tearDownClass instead of conf(), so both parity classes explicitly pin the value on the live session.

…onStrictDFColResolution Mirror the setUpClass override with an explicit tearDownClass so each parity class pairs its own set/unset and does not rely on the parent's teardown to do it.

…ed_by_withcolumn Add a second case with spark.sql.analyzer.strictDataFrameColumnResolution set to true, asserting that the tagged reference to a column shadowed by withColumn fails with CANNOT_RESOLVE_DATAFRAME_COLUMN under strict mode.

zhengruifeng added 14 commits April 24, 2026 08:01

[SQL] Default strictDataFrameColumnResolution to true, simplify doc

4c3ca18

Flip the default to true so strict plan-id-based resolution of tagged DataFrame columns is on by default. Drop the name-fallback wording from the doc string.

[SQL][PYTHON][TESTS] Rename lenient DF column resolution parity test …

dc82ed6

…class Shorter name: ColumnParityTestsWithLenientDFColResolution.

[SQL][PYTHON][TESTS] Rename parity test class to ColumnParityTestsWit…

7d3c7db

…hNonStrictDFColResolution

[SQL][PYTHON][TESTS] Keep test_join_with_cte in test_connect_basic

90af6f4

Revert the move of test_join_with_cte; it stays in test_connect_basic.py where it exercises cross-session parity between classic and Connect.

[SQL] Expand strictDataFrameColumnResolution doc

63c81c1

Note the config is specific to Spark Connect (only Connect attaches plan-id tags) and describe the `false` behavior.

[SQL] Tweak strictDataFrameColumnResolution doc wording

6a84daa

Drop the leading 'Specific to Spark Connect.' sentence and qualify 'DataFrame columns' as 'Spark Connect DataFrame columns'.

zhengruifeng changed the title ~~[SQL] Add config for strict DataFrame column resolution~~ [SPARK-56614][SQL][CONNECT] Add config for strict DataFrame column resolution Apr 24, 2026

zhengruifeng marked this pull request as ready for review April 24, 2026 10:26

zhengruifeng added 3 commits April 24, 2026 10:42

[SQL][PYTHON][TESTS] Explicit tearDownClass in ColumnParityTestsWithN…

52aced6

…onStrictDFColResolution Mirror the setUpClass override with an explicit tearDownClass so each parity class pairs its own set/unset and does not rely on the parent's teardown to do it.

zhengruifeng requested a review from cloud-fan April 24, 2026 11:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-56614][SQL][CONNECT] Add config for strict DataFrame column resolution#55531

[SPARK-56614][SQL][CONNECT] Add config for strict DataFrame column resolution#55531
zhengruifeng wants to merge 17 commits intoapache:masterfrom
zhengruifeng:spark-dev-4

zhengruifeng commented Apr 24, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

zhengruifeng commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

zhengruifeng commented Apr 24, 2026 •

edited

Loading