[SPARK-46367][SQL] Support narrowing projection of `KeyedPartitioning` in `PartitioningPreservingUnaryExecNode` by peter-toth · Pull Request #55519 · apache/spark

peter-toth · 2026-04-23T18:44:25Z

What changes were proposed in this pull request?

When a KeyedPartitioning passes through a PartitioningPreservingUnaryExecNode (e.g. ProjectExec), the previous implementation projected the partitioning as a whole expression via multiTransformDown. If any expression position could not be mapped to an output attribute, the entire KeyedPartitioning was silently dropped, resulting in UnknownPartitioning.

This PR replaces that approach with a per-position projection algorithm implemented in two new private helpers (projectKeyedPartitionings and projectOtherPartitionings), with the main outputPartitioning reduced to a simple split, project, and combine:

For each expression position (0..N-1), collect the unique expressions at that position across all input KeyedPartitionings (using ExpressionSet to deduplicate semantically equal expressions), then project each through the output aliases via projectExpression.
Positions with at least one projected alternative are projectable; they define the maximum achievable granularity. Positions that cannot be expressed in the output are dropped (narrowing).
The shared partitionKeys are projected to the subset of projectable positions via KeyedPartitioning.projectKeys.
The final KeyedPartitionings are the cross-product of per-position alternatives, computed lazily via MultiTransform.generateCartesianProduct, deduplicated, and bounded by a single outer take(aliasCandidateLimit).

All resulting KeyedPartitionings at the same granularity share the same partitionKeys object, preserving the invariant required by GroupPartitionsExec.

A new isNarrowed: Boolean flag is added to KeyedPartitioning and set to true when the projection drops one or more key positions. When isNarrowed=true and isGrouped=false, GroupPartitionsExec would merge original partitions that held distinct keys, carrying the same data-skew risk as allowJoinKeysSubsetOfPartitionKeys. groupedSatisfies therefore gates such narrowed partitionings behind that config. When isGrouped=true after narrowing, the projected keys are still distinct so no merging happens and no config is required.

Why are the changes needed?

Without this fix, a ProjectExec that drops any column of a multi-column partition key causes the entire KeyedPartitioning to be lost. This breaks storage-partitioned join optimisations (SPJ) that rely on the partitioning surviving projection (e.g. a subquery that renames or projects away a partition key column).

Does this PR introduce any user-facing change?

Yes. SPJ is now preserved through ProjectExec nodes:

Alias projections (e.g. SELECT id AS pk FROM t) no longer break SPJ.
Narrowing projections (e.g. SELECT id FROM t where t is partitioned by (id, name)) enable SPJ when the projected keys remain distinct, or when spark.sql.sources.v2.bucketing.allowJoinKeysSubsetOfPartitionKeys is enabled and the keys become non-unique.

How was this patch tested?

Unit tests added/updated in ProjectedOrderingAndPartitioningSuite:

Full-granularity alias substitution
2->1 and 3->2 narrowing with and without aliases
PartitioningCollection with mixed projectability
isNarrowed=true, isGrouped=false: groupedSatisfies blocked without config, allowed with allowJoinKeysSubsetOfPartitionKeys
isNarrowed=true, isGrouped=true: satisfies succeeds without config

End-to-end tests added in KeyGroupedPartitioningSuite:

Alias in subquery does not break SPJ
Narrowing projection with duplicate projected keys requires allowJoinKeysSubsetOfPartitionKeys
Narrowing projection with distinct projected keys triggers SPJ without config

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Sonnet 4.6

peter-toth · 2026-04-24T17:20:27Z

I will rebase this PR once #55538 is merged.

…in PartitioningPreservingUnaryExecNode ### What changes were proposed in this pull request? When a `KeyedPartitioning` passes through a `PartitioningPreservingUnaryExecNode` (e.g. `ProjectExec`), the previous implementation projected the partitioning as a whole expression via `multiTransformDown`. If any expression position could not be mapped to an output attribute, the entire `KeyedPartitioning` was silently dropped, resulting in `UnknownPartitioning`. This PR replaces that approach with a per-position projection algorithm implemented in two new private helpers (`projectKeyedPartitionings` and `projectOtherPartitionings`), with the main `outputPartitioning` reduced to a simple split, project, and combine: 1. For each expression position (0..N-1), collect the unique expressions at that position across all input `KeyedPartitioning`s (using `ExpressionSet` to deduplicate semantically equal expressions), then project each through the output aliases via `projectExpression`. 2. Positions with at least one projected alternative are *projectable*; they define the maximum achievable granularity. Positions that cannot be expressed in the output are dropped (narrowing). 3. The shared `partitionKeys` are projected to the subset of projectable positions via `KeyedPartitioning.projectKeys`. 4. The final `KeyedPartitioning`s are the cross-product of per-position alternatives, computed lazily via `MultiTransform.generateCartesianProduct`, deduplicated, and bounded by a single outer `take(aliasCandidateLimit)`. All resulting `KeyedPartitioning`s at the same granularity share the same `partitionKeys` object, preserving the invariant required by `GroupPartitionsExec`. A new `isNarrowed: Boolean` flag is added to `KeyedPartitioning` and set to `true` when the projection drops one or more key positions. When `isNarrowed=true` and `isGrouped=false`, `GroupPartitionsExec` would merge original partitions that held distinct keys, carrying the same data-skew risk as `allowJoinKeysSubsetOfPartitionKeys`. `groupedSatisfies` therefore gates such narrowed partitionings behind that config. When `isGrouped=true` after narrowing, the projected keys are still distinct so no merging happens and no config is required. ### Why are the changes needed? Without this fix, a `ProjectExec` that drops any column of a multi-column partition key causes the entire `KeyedPartitioning` to be lost. This breaks storage-partitioned join optimisations (SPJ) that rely on the partitioning surviving projection (e.g. a subquery that renames or projects away a partition key column). ### Does this PR introduce _any_ user-facing change? Yes. SPJ is now preserved through `ProjectExec` nodes: - Alias projections (e.g. `SELECT id AS pk FROM t`) no longer break SPJ. - Narrowing projections (e.g. `SELECT id FROM t` where `t` is partitioned by `(id, name)`) enable SPJ when the projected keys remain distinct, or when `spark.sql.sources.v2.bucketing.allowJoinKeysSubsetOfPartitionKeys` is enabled and the keys become non-unique. ### How was this patch tested? Unit tests added/updated in `ProjectedOrderingAndPartitioningSuite`: - Full-granularity alias substitution - 2->1 and 3->2 narrowing with and without aliases - `PartitioningCollection` with mixed projectability - `isNarrowed=true, isGrouped=false`: `groupedSatisfies` blocked without config, allowed with `allowJoinKeysSubsetOfPartitionKeys` - `isNarrowed=true, isGrouped=true`: `satisfies` succeeds without config End-to-end tests added in `KeyGroupedPartitioningSuite`: - Alias in subquery does not break SPJ - Narrowing projection with duplicate projected keys requires `allowJoinKeysSubsetOfPartitionKeys` - Narrowing projection with distinct projected keys triggers SPJ without config ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude Sonnet 4.6

peter-toth force-pushed the SPARK-46367-keyedpartitioning-projection branch from 433d560 to 0b3e7bc Compare April 23, 2026 18:49

peter-toth mentioned this pull request Apr 23, 2026

[SPARK-46367][SQL] Fix KeyedPartitioning not remapped through column aliases in ProjectExec #55475

Closed

peter-toth force-pushed the SPARK-46367-keyedpartitioning-projection branch from 0b3e7bc to 8f8efaa Compare April 24, 2026 17:11

peter-toth mentioned this pull request Apr 24, 2026

[SPARK-55535][SPARK-55092][SQL] Refactor KeyGroupedPartitioning and Storage Partition Join #54330

Closed

peter-toth force-pushed the SPARK-46367-keyedpartitioning-projection branch from 8f8efaa to 397a8be Compare April 25, 2026 06:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-46367][SQL] Support narrowing projection of `KeyedPartitioning` in `PartitioningPreservingUnaryExecNode`#55519

[SPARK-46367][SQL] Support narrowing projection of `KeyedPartitioning` in `PartitioningPreservingUnaryExecNode`#55519
peter-toth wants to merge 1 commit intoapache:masterfrom
peter-toth:SPARK-46367-keyedpartitioning-projection

peter-toth commented Apr 23, 2026 •

edited

Loading

Uh oh!

peter-toth commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

peter-toth commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

peter-toth commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

peter-toth commented Apr 23, 2026 •

edited

Loading