[SPARK-55535][SPARK-55092][SQL] Refactor `KeyGroupedPartitioning` and Storage Partition Join by peter-toth · Pull Request #54330 · apache/spark

peter-toth · 2026-02-15T20:07:52Z

What changes were proposed in this pull request?

This PR extracts partitiong grouping logic from BatchScanExec to a new GroupPartitionsExec operator and replaces KeyGroupedPartitioning with KeyedPartitioning.

KeyedPartitioning represents a partitioning where partition keys are known. It can be grouped (clustered) or not by partition keys. When grouping is required the new operator can be inserted into a plan at any place (similary to how exchanges are inserted under joins or aggregates to satisfy expected distributions) and so creating the necessary grouped/replicated partitions by keys.
The implementation of GroupPartitionsExec uses the already existing CoalescedRDD with a new GroupedPartitionCoalescer to ensure that input partitions with the same key end up in a common output partition.
This PR kind of restores DataSourceRDD to its pre-SPJ form.
This PR tries to unify the terminology and prefers using PartitionKey instead of the previous PartitionValues to be in sync with the DSv2 HasPartitionKey interface.
After this PR StoragePartitionJoinParams is not required in BatchScanExec, its fields are now part of the new GroupPartitionsExec operator.
KeyedPartitioning no longer stores originalPartitionKeys for partially clustered joins as those keys are available as outputPartitioning of the join's children (below the inserted GroupPartitionsExec if that is inserted).

Why are the changes needed?

To solve the issue of unecessary partition grouping SPARK-55092 ([SPARK-55092][SQL] Disable partition grouping in KeyGroupedPartitioning when not needed #53859) and simplify KGP/SPJ implementation.
A new operator allows more granular control over partition grouping, which can improve multi table joins:

Consider the following examples with 3 tables:
- t1 is partitoned by (a1, a2) and returns partitons with keys (1, 1), (1, 2), (2, 1), (2, 2)
- t2 is partitoned by (b1, b2) and returns partitons with keys (2, 1), (2, 3), (3, 1), (3, 2)
- t3 is partitoned by c1 and returns partitons with keys 2, 3
When spark.sql.requireAllClusterKeysForCoPartition=false and spark.sql.sources.v2.bucketing.allowJoinKeysSubsetOfPartitionKeys.enabled=true are the query is t1 JOIN t2 ON a1 = b1 AND a2 = b2 JOIN t3 ON a1 = c1, then storage partition join kicks in.
Before this PR the common set of partition keys are pushed down to all 3 scans :
```
Join a1 = c1
  Join a1 = b1 AND a2 = b2
    Scan t1, commonPartitionValues = [1, 2, 3]
    Scan t2, commonPartitionValues = [1, 2, 3]
  Scan t3, commonPartitionValues = [1, 2, 3]
```
After this PR GroupPartitions operators do the grouping, which is fully utilizing t1 and t2 partitioning in the inner Join operator and regroup the join results for the outer Join operator:
```
Join a1 = c1
  GroupPartitions commonPartitionValues = [1, 2, 3]
    Join a1 = b1 AND a2 = b2
      GroupPartitions commonPartitionValues = [(1, 1), (1, 2), (2, 1), (2, 2), (3, 1), (3, 2)]
        Scan t1
      GroupPartitions commonPartitionValues = [(1, 1), (1, 2), (2, 1), (2, 2), (3, 1), (3, 2)]
        Scan t2
  GroupPartitions commonPartitionValues = [1, 2, 3]
    Scan t3
```
Fully utilized partitioning in joins can avoid skews better.

Or consider the following examples with 3 tables:
- t1 is partitoned by a and returns partitons with keys 1, 1, 2, 2
- t2 is partitoned by b and returns partitons with keys 2, 3
- t3 is partitoned by c and returns partitons with keys 2, 4
When spark.sql.sources.v2.bucketing.partiallyClusteredDistribution.enabled=true is set, then partial clustering can be used not only with 2 table joins, but with multi table joins as well:
Before this PR:
```
Join a = c
  Join a = b
    Scan t1, commonPartitionValues = [1, 2, 3, 4]
    Scan t2, commonPartitionValues = [1, 2, 3, 4]
  Scan t3, commonPartitionValues = [1, 2, 3, 4]
```
After this PR:
```
Join a = c
  GroupPartitions commonPartitionValues = [1, 1, 2, 2, 3, 4]
    Join a = b
      GroupPartitions commonPartitionValues = [1, 1, 2, 2, 3]
        Scan t1
      GroupPartitions commonPartitionValues = [1, 1, 2, 2, 3], replicatePartitions = true
        Scan t2
  GroupPartitions commonPartitionValues = [1, 1, 2, 2, 3, 4], replicatePartitions = true
    Scan t3
```
Keeping one side unclustered can also help avoiding skews.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Existing UTs adjusted, new UTs from #53859 and additional new UTs to test the above improvements.

Was this patch authored or co-authored using generative AI tooling?

Yes, documentation and some helpers were added by Claude.

peter-toth · 2026-02-18T19:38:15Z

This PR is requires and contains the changes of #54335. Once that PR is merged I will rebase this one.

peter-toth · 2026-02-18T21:29:46Z

cc @sunchao, @szehon-ho, @chirag-s-db, @dongjoon-hyun, @cloud-fan

dongjoon-hyun · 2026-02-18T21:33:24Z

Why don't you merge #54335 , @peter-toth ? You already got the required community approval on your PR.

szehon-ho

I think the coalesceRDD is a good idea, but its a bit risky I feel to change so much the DataSourceRDD. Is there another way? Maybe have a customRDD that holds the grouped partitions? though im not so familiar with this part

szehon-ho · 2026-02-19T02:54:58Z

Overall i like the GroupPartitionExec idea, but definitely would be good to have some of @sunchao @viirya @chirag-s-db @cloud-fan to also take a look

peter-toth · 2026-02-19T09:12:41Z

Why don't you merge #54335 , @peter-toth ? You already got the required community approval on your PR.

Sorry @dongjoon-hyun, I didn't notice your approval yesterday. Thanks for your review! @viirya requested a small change just now, once CI completes I will merge that PR and rebase this one.

peter-toth · 2026-02-19T10:19:09Z

I think the coalesceRDD is a good idea, but its a bit risky I feel to change so much the DataSourceRDD. Is there another way? Maybe have a customRDD that holds the grouped partitions? though im not so familiar with this part

Initially I wanted to add a new RDD for GroupPartitionsExec but it was very similar to CoalescedRDD / CoalescedRDDPartition so just creating a new GroupedPartitionCoalescer, which holds the grouped partitions, seemed like a cleaner approach.

DataSourceRDD is now back to its pre partition grouping form. IMO we will need to backport the ThreadLocal[ReaderState] fix to previous Spark versions too so as to fix the case when there is a coalece after the scan.

### What changes were proposed in this pull request? This is a minor refector of `BroadcastHashJoinExec.outputPartitioning` to: - simlify the logic and - make it future proof by using `Partitioning with Expression` instead of `HashPartitioningLike`. ### Why are the changes needed? Code cleanup and add support for future partitionings that implement `Expression` but not `HashPartitioningLike`. (Like `KeyedPartitioning` is in #54330.) ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #54335 from peter-toth/SPARK-55551-improve-broadcasthashjoinexec-output-partitioning. Authored-by: Peter Toth <peter.toth@gmail.com> Signed-off-by: Peter Toth <peter.toth@gmail.com>

peter-toth · 2026-02-19T16:51:52Z

#54335 is merged and I've rebased this PR on latest master.

szehon-ho · 2026-02-19T22:26:23Z

Thanks for the refactor. I was actually wondering if this approach works: (using cursor generation, please double check if it makes sense). Its more localized to SPJ case then

class GroupedPartitionsRDD(
    @transient private val dataSourceRDD: DataSourceRDD,
    groupedPartitions: Seq[Seq[Int]]
  ) extends RDD[InternalRow](dataSourceRDD) {
  
override def compute(split: Partition, context: TaskContext): Iterator[InternalRow] = {
    val groupedPartition = split.asInstanceOf[GroupedPartitionsRDDPartition]
      val readers = new ArrayBuffer[PartitionReader[_]]()
      var listenerAdded = false
      
      def addCompletionListener(): Unit = {
        if (!listenerAdded) {
          context.addTaskCompletionListener[Unit] { _ =>
            readers.foreach { reader =>
              try {
                CustomMetrics.updateMetrics(
                  reader.currentMetricsValues.toImmutableArraySeq,
                  dataSourceRDD.customMetrics)
                reader.close()
              } catch {
                case e: Exception =>
                  logWarning(s"Error closing reader: ${e.getMessage}", e)
              }
            }
          }
          listenerAdded = true
        }
      }
      
      // Use a self-closing iterator wrapper
      new Iterator[InternalRow] {
        private val parentIter = groupedPartition.parentIndices.iterator
        private var currentIterator: Iterator[InternalRow] = null
        private var currentReader: PartitionReader[_] = null
        
        private def advance(): Boolean = {
          while (currentIterator == null || !currentIterator.hasNext) {
            if (!parentIter.hasNext) {
              // Close current reader if exists
              if (currentReader != null) {
                try {
                  CustomMetrics.updateMetrics(
                    currentReader.currentMetricsValues.toImmutableArraySeq,
                    dataSourceRDD.customMetrics)
                  currentReader.close()
                } catch {
                  case e: Exception =>
                    logWarning(s"Error closing reader: ${e.getMessage}", e)
                }
                currentReader = null
              }
              return false
            }
            
            // Close previous reader
            if (currentReader != null) {
              try {
                CustomMetrics.updateMetrics(
                  currentReader.currentMetricsValues.toImmutableArraySeq,
                  dataSourceRDD.customMetrics)
                currentReader.close()
              } catch {
                case e: Exception =>
                  logWarning(s"Error closing reader: ${e.getMessage}", e)
              }
            }
            
            val parentIndex = parentIter.next()
            val inputPartitionOpt = dataSourceRDD.inputPartitions(parentIndex)
            
            currentIterator = inputPartitionOpt.iterator.flatMap { inputPartition =>
              currentReader = if (dataSourceRDD.columnarReads) {
                dataSourceRDD.partitionReaderFactory.createColumnarReader(inputPartition)
              } else {
                dataSourceRDD.partitionReaderFactory.createReader(inputPartition)
              }
              
              addCompletionListener()
              
              val iter = if (dataSourceRDD.columnarReads) {
                new MetricsBatchIterator(
                  new PartitionIterator[ColumnarBatch](
                    currentReader.asInstanceOf[PartitionReader[ColumnarBatch]], 
                    dataSourceRDD.customMetrics))
              } else {
                new MetricsRowIterator(
                  new PartitionIterator[InternalRow](
                    currentReader.asInstanceOf[PartitionReader[InternalRow]], 
                    dataSourceRDD.customMetrics))
              }
              
              iter.asInstanceOf[Iterator[InternalRow]]
            }
          }
          true
        }
    
    override def hasNext: Boolean = advance()
    
    override def next(): InternalRow = {
      if (!hasNext) {
        throw new NoSuchElementException("next on empty iterator")
      }
      currentIterator.next()
    }
  }
}


private case class GroupedPartitionsRDDPartition(
    index: Int,
    parentIndices: Array[Int],
    preferredLocation: Option[String] = None
  ) extends Partition

that can be used in GroupPartitionsExec like:

override protected def doExecute(): RDD[InternalRow] = {
  if (groupedPartitions.isEmpty) {
    sparkContext.emptyRDD
  } else {
    child.execute() match {
      case dsRDD: DataSourceRDD =>
        new GroupedPartitionsRDD(
          dsRDD,
          groupedPartitions.map(_._2))
      case _ => // error or fallback?
    }
  }
}

The current code is definitely more Spark-native, reusing coalesceRDD, but my doubt is the threadlocal and chance for memory leak like fixed by @viirya : #51503 But ill defer to others, if people like this approach more.

szehon-ho · 2026-02-19T22:28:26Z

edit: i guess its what you are saying you considered, in your previous comment

szehon-ho · 2026-02-19T22:30:17Z

DataSourceRDD is now back to its pre partition grouping form. IMO we will need to backport the ThreadLocal[ReaderState] fix to previous Spark versions too so as to fix the case when there is a coalece after the scan.

btw, didn't get this, are you saying there is some leak in current DataSourceRDD that need ThreadLocal to fix? Should it be fixed spearately?

peter-toth · 2026-02-19T23:13:24Z

The current code is definitely more Spark-native, reusing coalesceRDD, but my doubt is the threadlocal and chance for memory leak like fixed by @viirya : #51503 But ill defer to others, if people like this approach more.

As far as I see you assume that the child is a DataSourceRDD, but the main point of this change is to move the grouping logic to the new operator (GroupPartitionsExec) so as to be able to insert it into those plans as well where there is no BatchScanExec / DataSourceRDD, e.g. cached or checkpointed plans.

Also, even if there is a BatchScanExec (DataSourceRDD) in the plan,GroupPartitionsExec is inserted right below the join / aggregate where the grouping is needed (like an exchange is inserted), so there can be other nodes / RDDs between the GroupPartitionsExec and the data source. So we can't assume that.

DataSourceRDD is now back to its pre partition grouping form. IMO we will need to backport the ThreadLocal[ReaderState] fix to previous Spark versions too so as to fix the case when there is a coalece after the scan.

btw, didn't get this, are you saying there is some leak in current DataSourceRDD that need ThreadLocal to fix? Should it be fixed spearately?

Not necessarily a leak, but there are some issues with custom metrics reporting and when the readers gets closed. Consider the following plan (without this PR):

...
  CoalesceExec (1)
    BatchScanExec (returns 2 partitions)

We have only 1 task in the stage due to coalesce(1) and that task calls the DataSourceRDD.compute() for both input partitions. It doesn't matter if those partitions are actually grouped or not. Both invocations create 1-1 reader and install 1-1 listener to close the readers and report customer metrics of the reader. But the listeners run only at the end of the task so the first reader is kept open for too long. What's worse, the 2 reported metrics conflict and only one will be kept.
So I think yes, we need to fix this issue on other branches as well. I don't think we can/should backport this refactor to older versions, but we can extract the ThreadLocal[ReaderState] logic and apply it on other branches. Technically we could fix it separately, but as this PR hugely simplifies the affected DataSourceRDD, it is easier to do it together with this refactor on the master branch.

szehon-ho · 2026-02-19T23:29:15Z

As far as I see you assume that the child is a DataSourceRDD, but the main point of this change is to move the grouping logic to the new operator (GroupPartitionsExec) so as to be able to insert it into those plans as well where there is no BatchScanExec / DataSourceRDD, e.g. cached or checkpointed plans.

I see, sorry forgot about that case.

Interesting, so you mean we are losing metrics. Should we at least add a test? It may make sense to do in separate pr, but depends the final approach. The approach does make sense, I am a bit unsure if ThreadLocal is the best/safe approach, consider the risk to introduce memory leak, as you can see its a bit tricky, but I am not so familiar with DataSourceRDD code.

peter-toth · 2026-02-19T23:46:20Z

Sure, let me add a test tomorrow, and maybe someone can come up with a better idea to fix it.

peter-toth · 2026-02-20T11:13:04Z

I extracted the metrics reporting bug / fix to SPARK-55619 / #54396 and added a new test.

szehon-ho · 2026-02-20T19:16:40Z

Thank you!

…roupPartitionsExec` operator, remove old code

peter-toth · 2026-03-04T22:05:14Z

+            ensureOrdering(child, child.outputPartitioning, o)
+          case _ => child
+        }
+      case (c @ GroupedPartitions(p), distribution) if p.satisfies(distribution) =>


I need to revisit this part as converting a KeyedPartitioning to grouped (build the distinct set of keys) just to check if it can satisfiy a distribution doesn't make sense...

This is refactored in 326915b, now we don't compute the distinct set of keys so as to decide if a KeyedPartitioning can satisfy a distribution.
As KeyedPartitioning is a special partitioning (not just becuase of this refactor PR) I elaborated on what KeyedPartitioning.satisfies() actually means.

Something is wrong with that commit, let me check the test failures.

Should be fixed in 7951dc6.

…ies()` means, partitioning of `KeyGroupedShuffleSpec` don't need to be grouped

…RK-55092 tests

peter-toth · 2026-03-05T19:40:36Z

Also can we add more tests:

Empty grouped partitions: Plan that yields groupedPartitions.isEmpty (ie, partitioned table but no partition values inserted)

@szehon-ho, I added an empty partitioned table test in 4a904ad, but it seems we prevent returing KeyedPartitioning without partitions. This is not new behaviour, it worked the same way with KeyGroupedPartitioning before this PR. If we removed that inputPartitions.nonEmpty guard from BatchScanExec and then the 2 shuffles would disappear, but no GroupPartitionsExec is added as those are not needed. Maybe the only way to get GroupPartitionsExec with empty groupedPartitions is to enable spark.sql.sources.v2.bucketing.partition.filter.enabled and use disjoint set of keys in join legs to get empty expectedPartitionKeys. Let me check this tomorrow.

@szehon-ho, I added a test case that yields groupedPartitions.isEmpty in 32b563f.
The commit also cleans up SPARK-55092 (scans should not group partitions) test case, but that's kind of trivial due to moving partition grouping out from scans.

peter-toth · 2026-03-08T07:02:03Z

@cloud-fan, @szehon-ho, @viirya I wonder if we can proceed with this refactor?

Please note that this change implicitly fixes the correctness issue reported in #54378 / SPARK-55848, but we would need the tests from @naveenp2708's #54679 on master as well.

peter-toth · 2026-03-09T20:01:33Z

Thank you for the review @cloud-fan, @dongjoon-hyun, @viirya, @szehon-ho and @chirag-s-db.

Merged to master (4.2.0).

…minor improvements to `EnsureRequirements` ### What changes were proposed in this pull request? This is a follow-up PR to #54330 to fix `OrderedDistribution` handling in `EnsureRequirements` so as to avoid a correctness bug. The PR contains minor improvements to `EnsureRequirements` and configuration docs updates as well. ### Why are the changes needed? To fix a correctness bug introduced with the refactor. ### Does this PR introduce _any_ user-facing change? Yes, but the refactor (#54330) hasn't been released. ### How was this patch tested? Added new UT. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #54727 from peter-toth/SPARK-55535-refactor-kgp-and-spj-follow-up. Authored-by: Peter Toth <peter.toth@gmail.com> Signed-off-by: Peter Toth <peter.toth@gmail.com>

…clustering ### What changes were proposed in this pull request? Backport fix for SPARK-55848 to branch-4.1. This branch does not have the KeyGroupedPartitioning refactor (#54330) from master. The fix adds an `isPartiallyClustered` flag to `KeyGroupedPartitioning` and restructures `satisfies0()` to check `ClusteredDistribution` first, returning `false` when partially clustered. `EnsureRequirements` then inserts the necessary Exchange. ### Why are the changes needed? SPJ with partial clustering produces incorrect results for post-join dedup operations (dropDuplicates, Window row_number). The partially-clustered partitioning is incorrectly treated as satisfying `ClusteredDistribution`, so no Exchange is inserted before dedup operators. ### Does this PR introduce any user-facing change? Yes. Queries using SPJ with partial clustering followed by dedup operations will now return correct results. ### How was this patch tested? Three regression tests added to KeyGroupedPartitioningSuite with data correctness checks and plan assertions verifying shuffle Exchange presence. All 95 tests pass. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #54751 from naveenp2708/spark-55848-fix-branch-4.1. Authored-by: Naveen Kumar Puppala <naveenp2708@gmail.com> Signed-off-by: Peter Toth <peter.toth@gmail.com>

…ring dedup ### What changes were proposed in this pull request? Test-only PR. Adds regression tests for SPARK-55848 (SPJ partial clustering produces incorrect results for post-join dedup operations). Three tests added to KeyGroupedPartitioningSuite: 1. SPARK-55848: dropDuplicates after SPJ with partial clustering 2. SPARK-55848: Window dedup after SPJ with partial clustering 3. SPARK-55848: checkpointed scan with partial clustering and dedup ### Why are the changes needed? The fix was merged via #54330, but regression tests for the correctness issue (SPARK-55848 / #54378) were not included. These tests ensure the issue does not regress. ### Does this PR introduce any user-facing change? No. Test-only change. ### How was this patch tested? All 73 tests in KeyGroupedPartitioningSuite pass. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #54714 from naveenp2708/spark-55848-tests-master. Authored-by: Naveen Kumar Puppala <naveenp2708@gmail.com> Signed-off-by: Peter Toth <peter.toth@gmail.com>

### What changes were proposed in this pull request? This PR adds a new method to SPJ partition key `Reducer`s to return the type of a reduced partition key. ### Why are the changes needed? After the [SPJ refactor](#54330) some Iceberg SPJ tests, that join a `hours` transform partitioned table with a `days` transform partitioned table, started to fail. This is because after the refactor the keys of a `KeyedPartitioning` partitioning are `InternalRowComparableWrapper`s, which include the type of the key, and when the partition keys are reduced the type of the reduced keys are inherited from their original type. - #54330 This means that when `hours` transformed hour keys are reduced to days, the keys actually remain having `IntegerType` type, while the `days` transformed keys have `DateType` type in Iceberg. This type difference causes that the left and right side `InternalRowComparableWrapper`s are not considered equal despite their `InternalRow` raw key data are equal. Before the refactor the type of (possibly reduced) partition keys were not stored in the partitioning. When the left and right side raw keys were compared in `EnsureRequirement` a common comparator was initialized with the type of the left side keys. So in the Iceberg SPJ tests the `IntegerType` keys were forced to be interpreted as `DateType`, or the `DateType` keys were forced to be interpreted as `IntegerType`, depending on the join order of the tables. The reason why this was not causing any issues is that the `PhysicalDataType` of both `DateType` and `IntegerType` logical types is `PhysicalIntegerType`. This PR introduces a new `resultType()` method of `Reducer` to return the correct type of the reduced keys and properly compares the left and right side reduced key types and thorws an error when they are not the same. ### Does this PR introduce _any_ user-facing change? Yes, the reduced key types are now properly compared and incompatibilities are reported to users. ### How was this patch tested? Added new UTs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #54884 from peter-toth/SPARK-56046-typed-spj-reducers. Authored-by: Peter Toth <peter.toth@gmail.com> Signed-off-by: Peter Toth <peter.toth@gmail.com>

…ring dedup ### What changes were proposed in this pull request? Test-only PR. Adds regression tests for SPARK-55848 (SPJ partial clustering produces incorrect results for post-join dedup operations). Three tests added to KeyGroupedPartitioningSuite: 1. SPARK-55848: dropDuplicates after SPJ with partial clustering 2. SPARK-55848: Window dedup after SPJ with partial clustering 3. SPARK-55848: checkpointed scan with partial clustering and dedup ### Why are the changes needed? The fix was merged via apache#54330, but regression tests for the correctness issue (SPARK-55848 / apache#54378) were not included. These tests ensure the issue does not regress. ### Does this PR introduce any user-facing change? No. Test-only change. ### How was this patch tested? All 73 tests in KeyGroupedPartitioningSuite pass. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#54714 from naveenp2708/spark-55848-tests-master. Authored-by: Naveen Kumar Puppala <naveenp2708@gmail.com> Signed-off-by: Peter Toth <peter.toth@gmail.com>

### What changes were proposed in this pull request? This PR adds a new method to SPJ partition key `Reducer`s to return the type of a reduced partition key. ### Why are the changes needed? After the [SPJ refactor](apache#54330) some Iceberg SPJ tests, that join a `hours` transform partitioned table with a `days` transform partitioned table, started to fail. This is because after the refactor the keys of a `KeyedPartitioning` partitioning are `InternalRowComparableWrapper`s, which include the type of the key, and when the partition keys are reduced the type of the reduced keys are inherited from their original type. - apache#54330 This means that when `hours` transformed hour keys are reduced to days, the keys actually remain having `IntegerType` type, while the `days` transformed keys have `DateType` type in Iceberg. This type difference causes that the left and right side `InternalRowComparableWrapper`s are not considered equal despite their `InternalRow` raw key data are equal. Before the refactor the type of (possibly reduced) partition keys were not stored in the partitioning. When the left and right side raw keys were compared in `EnsureRequirement` a common comparator was initialized with the type of the left side keys. So in the Iceberg SPJ tests the `IntegerType` keys were forced to be interpreted as `DateType`, or the `DateType` keys were forced to be interpreted as `IntegerType`, depending on the join order of the tables. The reason why this was not causing any issues is that the `PhysicalDataType` of both `DateType` and `IntegerType` logical types is `PhysicalIntegerType`. This PR introduces a new `resultType()` method of `Reducer` to return the correct type of the reduced keys and properly compares the left and right side reduced key types and thorws an error when they are not the same. ### Does this PR introduce _any_ user-facing change? Yes, the reduced key types are now properly compared and incompatibilities are reported to users. ### How was this patch tested? Added new UTs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#54884 from peter-toth/SPARK-56046-typed-spj-reducers. Authored-by: Peter Toth <peter.toth@gmail.com> Signed-off-by: Peter Toth <peter.toth@gmail.com>

pan3793 · 2026-04-17T13:28:47Z

+ * `KeyedPartitioning` is used in two distinct forms:
+ *
+ * 1. '''As outputPartitioning''': When used as a node's output partitioning (e.g., in
+ *    `BatchScanExec` or `GroupPartitionsExec`), the `partitionKeys` are always in sorted order.


@peter-toth, I wonder if we can relax this assumption - allow non-sorted and ungrouped partitionKeys in outputPartitioning.

I found an interesting case during the experiment of this feature - suppose we have a table like

CREATE OR REPLACE TABLE orders_userid_dt_iceberg ( order_id BIGINT, user_id BIGINT, amount DECIMAL(10,2), dt STRING ) USING iceberg PARTITIONED BY (bucket(4, user_id), dt); INSERT INTO orders_userid_dt_iceberg VALUES (1001, 1, 120.50, '2025-01-01'), (1002, 1, 80.00, '2025-01-01'), (1003, 2, 200.00, '2025-01-01'), (1004, 2, 50.00, '2025-01-02'), (1005, 3, 30.00, '2025-01-02'), (1006, 3, 70.00, '2025-01-03'), (1007, 1, 60.00, '2025-01-03');

SELECT user_id, count(*) FROM orders_userid_dt_iceberg WHERE dt = '2025-01-01' GROUP BY user_id;

for this query, the ColumnPruning inject a Project on the Filter(RelationV2), then

+- == Initial Plan == HashAggregate (13) +- Exchange (12) +- HashAggregate (11) +- Project (10) <= KeyedPartitioning is dropped since here, see AliasAwareQueryOutputOrdering#outputPartitioning +- BatchScan iceberg spark_catalog.default.orders_userid_dt_iceberg (1)

if we make a projection for KeyedPartitioning (obviously, the projected one will break the current assumption in docs) in AliasAwareQueryOutputOrdering#outputPartitioning instead of dropping, we can avoid an expensive shuffle. my experiment shows this at least works for such a simple query, do you think this is a right direction?

@pan3793, actually sorted order is not a hard requirement, but it increases the chance that we don't need to add GroupPartitionsExec with explicit expectedPartitionKeys into the query plan to align partitions by keys on both sides of a join.

The idea of adjusting PartitioningPreservingUnaryExecNodes to not drop but project KeyedPartitionings has already come up: #54330 (comment), but I haven't had time to work on it yet.
It is a bit tricky because the current logic requires that all KeyedPartitionings in a partitioning collection have equal sequences of partition keys (and actually identical sequence is even better to decrease the footprint of the partitioning). I think we should maintain this invariant during projection and keep only one sequence of keys but it should have the most granular expressions. Let me give you an example:
Let's suppose we have child.outputPartitioning as

PartitioningCollection( KeyedPartitioning(expressions = [x, y], partitionKeys = [(1, 1), (1, 2), (2, 1), (2, 2)]), KeyedPartitioning(expressions = [x_alias, y], partitionKeys = <identical seq>), KeyedPartitioning(expressions = [x, y_alias], partitionKeys = <identical seq>), KeyedPartitioning(expressions = [x_alias, y_alias], partitionKeys = <identical seq>))

because we have Project x, x as x_alias, y, y as y_alias somewhere in the child subplan.

Now, if we have Project x, x_alias on the top then obviously the node's outputPartitioning could be:

PartitioningCollection( KeyedPartitioning(expressions = [x], partitionKeys = [(1), (1), (2), (2)]), KeyedPartitioning(expressions = [x_alias], partitionKeys = <identical seq>))

But if we have Project x, x_alias, y_alias then we should not project the first KeyedPartitioning, but keep those which have more granularity:

PartitioningCollection( KeyedPartitioning(expressions = [x, y_alias], partitionKeys = [(1, 1), (1, 2), (2, 1), (2, 2)]), KeyedPartitioning(expressions = [x_alias, y_alias], partitionKeys = <identical seq>))

I think what we sould avoid is having multiple different partitionKeys in a collection like:

PartitioningCollection( KeyedPartitioning(expressions = [x], partitionKeys = [(1), (1), (2), (2)]), KeyedPartitioning(expressions = [x_alias], partitionKeys = <identical seq>), KeyedPartitioning(expressions = [x, y_alias], partitionKeys = [(1, 1), (1, 2), (2, 1), (2, 2)]), KeyedPartitioning(expressions = [x_alias, y_alias], partitionKeys = <identical seq 2>))

because it would break the current logic and it doesn't have any benefit.

Also, we should probably think through how KeyedPartitioning projection relates to the allowJoinKeysSubsetOfPartitionKeys conf.

Anyways, I can probably open a PR next week or so, but if you would like to work on this just let me know.

@peter-toth, thanks for the detailed explanation, and looking forward to your subsequent improvements on these parts!

@pan3793, #55519 should fix the problem with AliasAwareQueryOutputOrdering#outputPartitioning.

peter-toth mentioned this pull request Feb 15, 2026

[SPARK-55092][SQL] Disable partition grouping in KeyGroupedPartitioning when not needed #53859

Closed

peter-toth force-pushed the SPARK-55535-refactor-kgp-and-spj branch from 5122de6 to 114aee5 Compare February 15, 2026 20:26

peter-toth mentioned this pull request Feb 16, 2026

[SPARK-55551][SQL] Improve BroadcastHashJoinExec output partitioning #54335

Closed

peter-toth commented Feb 16, 2026

View reviewed changes

Comment thread sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastHashJoinExec.scala

peter-toth force-pushed the SPARK-55535-refactor-kgp-and-spj branch from 0c0e88b to c1e3e93 Compare February 17, 2026 20:21

peter-toth changed the title ~~[WIP][SPARK-55535][SQL] Refactor KeyGroupedPartitioning and Storage Partition Join~~ [WIP][SPARK-55535][SPARK-55092][SQL] Refactor KeyGroupedPartitioning and Storage Partition Join Feb 18, 2026

peter-toth force-pushed the SPARK-55535-refactor-kgp-and-spj branch from c1e3e93 to 53034f5 Compare February 18, 2026 18:48

peter-toth changed the title ~~[WIP][SPARK-55535][SPARK-55092][SQL] Refactor KeyGroupedPartitioning and Storage Partition Join~~ [SPARK-55535][SPARK-55092][SQL] Refactor KeyGroupedPartitioning and Storage Partition Join Feb 18, 2026

peter-toth marked this pull request as ready for review February 18, 2026 19:38

szehon-ho reviewed Feb 19, 2026

View reviewed changes

peter-toth mentioned this pull request Feb 19, 2026

Spark 4.1: Implement SupportsReportOrdering DSv2 API apache/iceberg#14948

Open

peter-toth force-pushed the SPARK-55535-refactor-kgp-and-spj branch from 69ded9c to 8a087af Compare February 19, 2026 16:50

peter-toth mentioned this pull request Feb 20, 2026

[SPARK-55619][SQL] Fix custom metrics in case of coalesced partitions #54396

Closed

Replace KeyGroupedPartitioning with KeyedPartitioning, add new `G…

7a94690

…roupPartitionsExec` operator, remove old code

peter-toth commented Mar 4, 2026

View reviewed changes

peter-toth added 4 commits March 5, 2026 15:33

minor improvements

b3d34ef

refactor GroupedPartitions, document what `KeyedPartitioning.satisf…

326915b

…ies()` means, partitioning of `KeyGroupedShuffleSpec` don't need to be grouped

fix KeyedPartitioning.isGrouped when expectedPartitionKeys is set

8526dc1

add empty groupPartitions test case, fix test spark tags, cleanup SPA…

32b563f

…RK-55092 tests

fix BroadcastDistribution in EnsureRequirements

7951dc6

peter-toth force-pushed the SPARK-55535-refactor-kgp-and-spj branch from 7810dcd to 7951dc6 Compare March 6, 2026 09:42

peter-toth mentioned this pull request Mar 7, 2026

[SQL] dropDuplicates and Window dedup produce incorrect results with SPJ partiallyClusteredDistribution #54378

Open

cloud-fan approved these changes Mar 9, 2026

View reviewed changes

dongjoon-hyun approved these changes Mar 9, 2026

View reviewed changes

viirya approved these changes Mar 9, 2026

View reviewed changes

peter-toth closed this in a1c62dd Mar 9, 2026

peter-toth mentioned this pull request Mar 9, 2026

[SPARK-55848][SQL] Fix incorrect dedup results with SPJ partial clustering #54679

Closed

naveenp2708 mentioned this pull request Mar 10, 2026

[SPARK-55848][SQL][TESTS] Add regression tests for SPJ partial clustering dedup #54714

Closed

peter-toth mentioned this pull request Mar 10, 2026

[SPARK-55535][SQL][FOLLOW-UP] Fix OrderedDistribution handling and minor improvements to EnsureRequirements #54727

Closed

This was referenced Mar 11, 2026

[SPARK-55848][SQL][4.1] Fix incorrect dedup results with SPJ partial clustering #54749

Closed

[SPARK-55848][SQL][4.1] Fix incorrect dedup results with SPJ partial clustering #54751

Closed

peter-toth mentioned this pull request Mar 18, 2026

[SPARK-56046][SQL] Typed SPJ partition key Reducers #54884

Closed

pan3793 reviewed Apr 17, 2026

View reviewed changes

naveenp2708 mentioned this pull request Apr 22, 2026

[SPARK-46367][SQL] Fix KeyedPartitioning not remapped through column aliases in ProjectExec #55475

Closed

Conversation

peter-toth commented Feb 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Uh oh!

peter-toth commented Feb 18, 2026

Uh oh!

peter-toth commented Feb 18, 2026

Uh oh!

dongjoon-hyun commented Feb 18, 2026

Uh oh!

szehon-ho left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

szehon-ho commented Feb 19, 2026

Uh oh!

peter-toth commented Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

peter-toth commented Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

peter-toth commented Feb 19, 2026

Uh oh!

szehon-ho commented Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

szehon-ho commented Feb 19, 2026

Uh oh!

szehon-ho commented Feb 19, 2026

Uh oh!

peter-toth commented Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

szehon-ho commented Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

peter-toth commented Feb 19, 2026

Uh oh!

peter-toth commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

szehon-ho commented Feb 20, 2026

Uh oh!

peter-toth Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

peter-toth Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

peter-toth Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

peter-toth Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

peter-toth commented Mar 5, 2026

Uh oh!

peter-toth commented Mar 8, 2026

Uh oh!

peter-toth commented Mar 9, 2026

Uh oh!

pan3793 Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

peter-toth Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pan3793 Apr 18, 2026

peter-toth commented Feb 15, 2026 •

edited

Loading

peter-toth commented Feb 19, 2026 •

edited

Loading

peter-toth commented Feb 19, 2026 •

edited

Loading

szehon-ho commented Feb 19, 2026 •

edited

Loading

peter-toth commented Feb 19, 2026 •

edited

Loading

szehon-ho commented Feb 19, 2026 •

edited

Loading

peter-toth commented Feb 20, 2026 •

edited

Loading

pan3793 Apr 17, 2026 •

edited

Loading

peter-toth Apr 17, 2026 •

edited

Loading