[SPARK-55535][SPARK-55092][SQL] Refactor KeyGroupedPartitioning and Storage Partition Join#54330
[SPARK-55535][SPARK-55092][SQL] Refactor KeyGroupedPartitioning and Storage Partition Join#54330peter-toth wants to merge 30 commits intoapache:masterfrom
KeyGroupedPartitioning and Storage Partition Join#54330Conversation
5122de6 to
114aee5
Compare
0c0e88b to
c1e3e93
Compare
KeyGroupedPartitioning and Storage Partition JoinKeyGroupedPartitioning and Storage Partition Join
c1e3e93 to
53034f5
Compare
|
This PR is requires and contains the changes of #54335. Once that PR is merged I will rebase this one. |
KeyGroupedPartitioning and Storage Partition JoinKeyGroupedPartitioning and Storage Partition Join
|
Why don't you merge #54335 , @peter-toth ? You already got the required community approval on your PR. |
szehon-ho
left a comment
There was a problem hiding this comment.
I think the coalesceRDD is a good idea, but its a bit risky I feel to change so much the DataSourceRDD. Is there another way? Maybe have a customRDD that holds the grouped partitions? though im not so familiar with this part
|
Overall i like the GroupPartitionExec idea, but definitely would be good to have some of @sunchao @viirya @chirag-s-db @cloud-fan to also take a look |
Sorry @dongjoon-hyun, I didn't notice your approval yesterday. Thanks for your review! @viirya requested a small change just now, once CI completes I will merge that PR and rebase this one. |
Initially I wanted to add a new RDD for
|
### What changes were proposed in this pull request? This is a minor refector of `BroadcastHashJoinExec.outputPartitioning` to: - simlify the logic and - make it future proof by using `Partitioning with Expression` instead of `HashPartitioningLike`. ### Why are the changes needed? Code cleanup and add support for future partitionings that implement `Expression` but not `HashPartitioningLike`. (Like `KeyedPartitioning` is in #54330.) ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #54335 from peter-toth/SPARK-55551-improve-broadcasthashjoinexec-output-partitioning. Authored-by: Peter Toth <peter.toth@gmail.com> Signed-off-by: Peter Toth <peter.toth@gmail.com>
69ded9c to
8a087af
Compare
|
#54335 is merged and I've rebased this PR on latest |
|
Thanks for the refactor. I was actually wondering if this approach works: (using cursor generation, please double check if it makes sense). Its more localized to SPJ case then that can be used in GroupPartitionsExec like: The current code is definitely more Spark-native, reusing coalesceRDD, but my doubt is the threadlocal and chance for memory leak like fixed by @viirya : #51503 But ill defer to others, if people like this approach more. |
|
edit: i guess its what you are saying you considered, in your previous comment |
btw, didn't get this, are you saying there is some leak in current DataSourceRDD that need ThreadLocal to fix? Should it be fixed spearately? |
As far as I see you assume that the child is a Also, even if there is a
Not necessarily a leak, but there are some issues with custom metrics reporting and when the readers gets closed. Consider the following plan (without this PR): We have only 1 task in the stage due to coalesce(1) and that task calls the |
I see, sorry forgot about that case. Interesting, so you mean we are losing metrics. Should we at least add a test? It may make sense to do in separate pr, but depends the final approach. The approach does make sense, I am a bit unsure if ThreadLocal is the best/safe approach, consider the risk to introduce memory leak, as you can see its a bit tricky, but I am not so familiar with DataSourceRDD code. |
|
Sure, let me add a test tomorrow, and maybe someone can come up with a better idea to fix it. |
|
I extracted the metrics reporting bug / fix to SPARK-55619 / #54396 and added a new test. |
|
Thank you! |
…roupPartitionsExec` operator, remove old code
| ensureOrdering(child, child.outputPartitioning, o) | ||
| case _ => child | ||
| } | ||
| case (c @ GroupedPartitions(p), distribution) if p.satisfies(distribution) => |
There was a problem hiding this comment.
I need to revisit this part as converting a KeyedPartitioning to grouped (build the distinct set of keys) just to check if it can satisfiy a distribution doesn't make sense...
There was a problem hiding this comment.
This is refactored in 326915b, now we don't compute the distinct set of keys so as to decide if a KeyedPartitioning can satisfy a distribution.
As KeyedPartitioning is a special partitioning (not just becuase of this refactor PR) I elaborated on what KeyedPartitioning.satisfies() actually means.
There was a problem hiding this comment.
Something is wrong with that commit, let me check the test failures.
…ies()` means, partitioning of `KeyGroupedShuffleSpec` don't need to be grouped
@szehon-ho, I added a test case that yields groupedPartitions.isEmpty in 32b563f. |
7810dcd to
7951dc6
Compare
|
@cloud-fan, @szehon-ho, @viirya I wonder if we can proceed with this refactor? Please note that this change implicitly fixes the correctness issue reported in #54378 / SPARK-55848, but we would need the tests from @naveenp2708's #54679 on |
|
Thank you for the review @cloud-fan, @dongjoon-hyun, @viirya, @szehon-ho and @chirag-s-db. Merged to |
…minor improvements to `EnsureRequirements` ### What changes were proposed in this pull request? This is a follow-up PR to #54330 to fix `OrderedDistribution` handling in `EnsureRequirements` so as to avoid a correctness bug. The PR contains minor improvements to `EnsureRequirements` and configuration docs updates as well. ### Why are the changes needed? To fix a correctness bug introduced with the refactor. ### Does this PR introduce _any_ user-facing change? Yes, but the refactor (#54330) hasn't been released. ### How was this patch tested? Added new UT. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #54727 from peter-toth/SPARK-55535-refactor-kgp-and-spj-follow-up. Authored-by: Peter Toth <peter.toth@gmail.com> Signed-off-by: Peter Toth <peter.toth@gmail.com>
…clustering ### What changes were proposed in this pull request? Backport fix for SPARK-55848 to branch-4.1. This branch does not have the KeyGroupedPartitioning refactor (#54330) from master. The fix adds an `isPartiallyClustered` flag to `KeyGroupedPartitioning` and restructures `satisfies0()` to check `ClusteredDistribution` first, returning `false` when partially clustered. `EnsureRequirements` then inserts the necessary Exchange. ### Why are the changes needed? SPJ with partial clustering produces incorrect results for post-join dedup operations (dropDuplicates, Window row_number). The partially-clustered partitioning is incorrectly treated as satisfying `ClusteredDistribution`, so no Exchange is inserted before dedup operators. ### Does this PR introduce any user-facing change? Yes. Queries using SPJ with partial clustering followed by dedup operations will now return correct results. ### How was this patch tested? Three regression tests added to KeyGroupedPartitioningSuite with data correctness checks and plan assertions verifying shuffle Exchange presence. All 95 tests pass. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #54751 from naveenp2708/spark-55848-fix-branch-4.1. Authored-by: Naveen Kumar Puppala <naveenp2708@gmail.com> Signed-off-by: Peter Toth <peter.toth@gmail.com>
…ring dedup ### What changes were proposed in this pull request? Test-only PR. Adds regression tests for SPARK-55848 (SPJ partial clustering produces incorrect results for post-join dedup operations). Three tests added to KeyGroupedPartitioningSuite: 1. SPARK-55848: dropDuplicates after SPJ with partial clustering 2. SPARK-55848: Window dedup after SPJ with partial clustering 3. SPARK-55848: checkpointed scan with partial clustering and dedup ### Why are the changes needed? The fix was merged via #54330, but regression tests for the correctness issue (SPARK-55848 / #54378) were not included. These tests ensure the issue does not regress. ### Does this PR introduce any user-facing change? No. Test-only change. ### How was this patch tested? All 73 tests in KeyGroupedPartitioningSuite pass. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #54714 from naveenp2708/spark-55848-tests-master. Authored-by: Naveen Kumar Puppala <naveenp2708@gmail.com> Signed-off-by: Peter Toth <peter.toth@gmail.com>
### What changes were proposed in this pull request? This PR adds a new method to SPJ partition key `Reducer`s to return the type of a reduced partition key. ### Why are the changes needed? After the [SPJ refactor](#54330) some Iceberg SPJ tests, that join a `hours` transform partitioned table with a `days` transform partitioned table, started to fail. This is because after the refactor the keys of a `KeyedPartitioning` partitioning are `InternalRowComparableWrapper`s, which include the type of the key, and when the partition keys are reduced the type of the reduced keys are inherited from their original type. - #54330 This means that when `hours` transformed hour keys are reduced to days, the keys actually remain having `IntegerType` type, while the `days` transformed keys have `DateType` type in Iceberg. This type difference causes that the left and right side `InternalRowComparableWrapper`s are not considered equal despite their `InternalRow` raw key data are equal. Before the refactor the type of (possibly reduced) partition keys were not stored in the partitioning. When the left and right side raw keys were compared in `EnsureRequirement` a common comparator was initialized with the type of the left side keys. So in the Iceberg SPJ tests the `IntegerType` keys were forced to be interpreted as `DateType`, or the `DateType` keys were forced to be interpreted as `IntegerType`, depending on the join order of the tables. The reason why this was not causing any issues is that the `PhysicalDataType` of both `DateType` and `IntegerType` logical types is `PhysicalIntegerType`. This PR introduces a new `resultType()` method of `Reducer` to return the correct type of the reduced keys and properly compares the left and right side reduced key types and thorws an error when they are not the same. ### Does this PR introduce _any_ user-facing change? Yes, the reduced key types are now properly compared and incompatibilities are reported to users. ### How was this patch tested? Added new UTs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #54884 from peter-toth/SPARK-56046-typed-spj-reducers. Authored-by: Peter Toth <peter.toth@gmail.com> Signed-off-by: Peter Toth <peter.toth@gmail.com>
…ring dedup ### What changes were proposed in this pull request? Test-only PR. Adds regression tests for SPARK-55848 (SPJ partial clustering produces incorrect results for post-join dedup operations). Three tests added to KeyGroupedPartitioningSuite: 1. SPARK-55848: dropDuplicates after SPJ with partial clustering 2. SPARK-55848: Window dedup after SPJ with partial clustering 3. SPARK-55848: checkpointed scan with partial clustering and dedup ### Why are the changes needed? The fix was merged via apache#54330, but regression tests for the correctness issue (SPARK-55848 / apache#54378) were not included. These tests ensure the issue does not regress. ### Does this PR introduce any user-facing change? No. Test-only change. ### How was this patch tested? All 73 tests in KeyGroupedPartitioningSuite pass. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#54714 from naveenp2708/spark-55848-tests-master. Authored-by: Naveen Kumar Puppala <naveenp2708@gmail.com> Signed-off-by: Peter Toth <peter.toth@gmail.com>
### What changes were proposed in this pull request? This PR adds a new method to SPJ partition key `Reducer`s to return the type of a reduced partition key. ### Why are the changes needed? After the [SPJ refactor](apache#54330) some Iceberg SPJ tests, that join a `hours` transform partitioned table with a `days` transform partitioned table, started to fail. This is because after the refactor the keys of a `KeyedPartitioning` partitioning are `InternalRowComparableWrapper`s, which include the type of the key, and when the partition keys are reduced the type of the reduced keys are inherited from their original type. - apache#54330 This means that when `hours` transformed hour keys are reduced to days, the keys actually remain having `IntegerType` type, while the `days` transformed keys have `DateType` type in Iceberg. This type difference causes that the left and right side `InternalRowComparableWrapper`s are not considered equal despite their `InternalRow` raw key data are equal. Before the refactor the type of (possibly reduced) partition keys were not stored in the partitioning. When the left and right side raw keys were compared in `EnsureRequirement` a common comparator was initialized with the type of the left side keys. So in the Iceberg SPJ tests the `IntegerType` keys were forced to be interpreted as `DateType`, or the `DateType` keys were forced to be interpreted as `IntegerType`, depending on the join order of the tables. The reason why this was not causing any issues is that the `PhysicalDataType` of both `DateType` and `IntegerType` logical types is `PhysicalIntegerType`. This PR introduces a new `resultType()` method of `Reducer` to return the correct type of the reduced keys and properly compares the left and right side reduced key types and thorws an error when they are not the same. ### Does this PR introduce _any_ user-facing change? Yes, the reduced key types are now properly compared and incompatibilities are reported to users. ### How was this patch tested? Added new UTs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#54884 from peter-toth/SPARK-56046-typed-spj-reducers. Authored-by: Peter Toth <peter.toth@gmail.com> Signed-off-by: Peter Toth <peter.toth@gmail.com>
| * `KeyedPartitioning` is used in two distinct forms: | ||
| * | ||
| * 1. '''As outputPartitioning''': When used as a node's output partitioning (e.g., in | ||
| * `BatchScanExec` or `GroupPartitionsExec`), the `partitionKeys` are always in sorted order. |
There was a problem hiding this comment.
@peter-toth, I wonder if we can relax this assumption - allow non-sorted and ungrouped partitionKeys in outputPartitioning.
I found an interesting case during the experiment of this feature - suppose we have a table like
CREATE OR REPLACE TABLE orders_userid_dt_iceberg (
order_id BIGINT,
user_id BIGINT,
amount DECIMAL(10,2),
dt STRING
)
USING iceberg
PARTITIONED BY (bucket(4, user_id), dt);
INSERT INTO orders_userid_dt_iceberg VALUES
(1001, 1, 120.50, '2025-01-01'),
(1002, 1, 80.00, '2025-01-01'),
(1003, 2, 200.00, '2025-01-01'),
(1004, 2, 50.00, '2025-01-02'),
(1005, 3, 30.00, '2025-01-02'),
(1006, 3, 70.00, '2025-01-03'),
(1007, 1, 60.00, '2025-01-03');
SELECT user_id, count(*)
FROM orders_userid_dt_iceberg
WHERE dt = '2025-01-01'
GROUP BY user_id;
for this query, the ColumnPruning inject a Project on the Filter(RelationV2), then
+- == Initial Plan ==
HashAggregate (13)
+- Exchange (12)
+- HashAggregate (11)
+- Project (10) <= KeyedPartitioning is dropped since here, see AliasAwareQueryOutputOrdering#outputPartitioning
+- BatchScan iceberg spark_catalog.default.orders_userid_dt_iceberg (1)
if we make a projection for KeyedPartitioning (obviously, the projected one will break the current assumption in docs) in AliasAwareQueryOutputOrdering#outputPartitioning instead of dropping, we can avoid an expensive shuffle. my experiment shows this at least works for such a simple query, do you think this is a right direction?
There was a problem hiding this comment.
@pan3793, actually sorted order is not a hard requirement, but it increases the chance that we don't need to add GroupPartitionsExec with explicit expectedPartitionKeys into the query plan to align partitions by keys on both sides of a join.
The idea of adjusting PartitioningPreservingUnaryExecNodes to not drop but project KeyedPartitionings has already come up: #54330 (comment), but I haven't had time to work on it yet.
It is a bit tricky because the current logic requires that all KeyedPartitionings in a partitioning collection have equal sequences of partition keys (and actually identical sequence is even better to decrease the footprint of the partitioning). I think we should maintain this invariant during projection and keep only one sequence of keys but it should have the most granular expressions. Let me give you an example:
Let's suppose we have child.outputPartitioning as
PartitioningCollection(
KeyedPartitioning(expressions = [x, y], partitionKeys = [(1, 1), (1, 2), (2, 1), (2, 2)]),
KeyedPartitioning(expressions = [x_alias, y], partitionKeys = <identical seq>),
KeyedPartitioning(expressions = [x, y_alias], partitionKeys = <identical seq>),
KeyedPartitioning(expressions = [x_alias, y_alias], partitionKeys = <identical seq>))
because we have Project x, x as x_alias, y, y as y_alias somewhere in the child subplan.
Now, if we have Project x, x_alias on the top then obviously the node's outputPartitioning could be:
PartitioningCollection(
KeyedPartitioning(expressions = [x], partitionKeys = [(1), (1), (2), (2)]),
KeyedPartitioning(expressions = [x_alias], partitionKeys = <identical seq>))
But if we have Project x, x_alias, y_alias then we should not project the first KeyedPartitioning, but keep those which have more granularity:
PartitioningCollection(
KeyedPartitioning(expressions = [x, y_alias], partitionKeys = [(1, 1), (1, 2), (2, 1), (2, 2)]),
KeyedPartitioning(expressions = [x_alias, y_alias], partitionKeys = <identical seq>))
I think what we sould avoid is having multiple different partitionKeys in a collection like:
PartitioningCollection(
KeyedPartitioning(expressions = [x], partitionKeys = [(1), (1), (2), (2)]),
KeyedPartitioning(expressions = [x_alias], partitionKeys = <identical seq>),
KeyedPartitioning(expressions = [x, y_alias], partitionKeys = [(1, 1), (1, 2), (2, 1), (2, 2)]),
KeyedPartitioning(expressions = [x_alias, y_alias], partitionKeys = <identical seq 2>))
because it would break the current logic and it doesn't have any benefit.
Also, we should probably think through how KeyedPartitioning projection relates to the allowJoinKeysSubsetOfPartitionKeys conf.
Anyways, I can probably open a PR next week or so, but if you would like to work on this just let me know.
There was a problem hiding this comment.
@peter-toth, thanks for the detailed explanation, and looking forward to your subsequent improvements on these parts!
What changes were proposed in this pull request?
This PR extracts partitiong grouping logic from
BatchScanExecto a newGroupPartitionsExecoperator and replacesKeyGroupedPartitioningwithKeyedPartitioning.KeyedPartitioningrepresents a partitioning where partition keys are known. It can be grouped (clustered) or not by partition keys. When grouping is required the new operator can be inserted into a plan at any place (similary to how exchanges are inserted under joins or aggregates to satisfy expected distributions) and so creating the necessary grouped/replicated partitions by keys.GroupPartitionsExecuses the already existingCoalescedRDDwith a newGroupedPartitionCoalescerto ensure that input partitions with the same key end up in a common output partition.DataSourceRDDto its pre-SPJ form.PartitionKeyinstead of the previousPartitionValuesto be in sync with the DSv2HasPartitionKeyinterface.StoragePartitionJoinParamsis not required inBatchScanExec, its fields are now part of the newGroupPartitionsExecoperator.KeyedPartitioningno longer storesoriginalPartitionKeysfor partially clustered joins as those keys are available asoutputPartitioningof the join's children (below the insertedGroupPartitionsExecif that is inserted).Why are the changes needed?
To solve the issue of unecessary partition grouping SPARK-55092 ([SPARK-55092][SQL] Disable partition grouping in
KeyGroupedPartitioningwhen not needed #53859) and simplify KGP/SPJ implementation.A new operator allows more granular control over partition grouping, which can improve multi table joins:
Consider the following examples with 3 tables:
t1is partitoned by(a1, a2)and returns partitons with keys(1, 1),(1, 2),(2, 1),(2, 2)t2is partitoned by(b1, b2)and returns partitons with keys(2, 1),(2, 3),(3, 1),(3, 2)t3is partitoned byc1and returns partitons with keys2,3When
spark.sql.requireAllClusterKeysForCoPartition=falseandspark.sql.sources.v2.bucketing.allowJoinKeysSubsetOfPartitionKeys.enabled=trueare the query ist1 JOIN t2 ON a1 = b1 AND a2 = b2 JOIN t3 ON a1 = c1, then storage partition join kicks in.Before this PR the common set of partition keys are pushed down to all 3 scans :
After this PR
GroupPartitionsoperators do the grouping, which is fully utilizingt1andt2partitioning in the innerJoinoperator and regroup the join results for the outerJoinoperator:Fully utilized partitioning in joins can avoid skews better.
Or consider the following examples with 3 tables:
t1is partitoned byaand returns partitons with keys1,1,2,2t2is partitoned byband returns partitons with keys2,3t3is partitoned bycand returns partitons with keys2,4When
spark.sql.sources.v2.bucketing.partiallyClusteredDistribution.enabled=trueis set, then partial clustering can be used not only with 2 table joins, but with multi table joins as well:Before this PR:
After this PR:
Keeping one side unclustered can also help avoiding skews.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Existing UTs adjusted, new UTs from #53859 and additional new UTs to test the above improvements.
Was this patch authored or co-authored using generative AI tooling?
Yes, documentation and some helpers were added by Claude.