Skip to content

[SPARK-55535][SPARK-55092][SQL] Refactor KeyGroupedPartitioning and Storage Partition Join#54330

Closed
peter-toth wants to merge 30 commits intoapache:masterfrom
peter-toth:SPARK-55535-refactor-kgp-and-spj
Closed

[SPARK-55535][SPARK-55092][SQL] Refactor KeyGroupedPartitioning and Storage Partition Join#54330
peter-toth wants to merge 30 commits intoapache:masterfrom
peter-toth:SPARK-55535-refactor-kgp-and-spj

Conversation

@peter-toth
Copy link
Copy Markdown
Contributor

@peter-toth peter-toth commented Feb 15, 2026

What changes were proposed in this pull request?

This PR extracts partitiong grouping logic from BatchScanExec to a new GroupPartitionsExec operator and replaces KeyGroupedPartitioning with KeyedPartitioning.

  • KeyedPartitioning represents a partitioning where partition keys are known. It can be grouped (clustered) or not by partition keys. When grouping is required the new operator can be inserted into a plan at any place (similary to how exchanges are inserted under joins or aggregates to satisfy expected distributions) and so creating the necessary grouped/replicated partitions by keys.
  • The implementation of GroupPartitionsExec uses the already existing CoalescedRDD with a new GroupedPartitionCoalescer to ensure that input partitions with the same key end up in a common output partition.
  • This PR kind of restores DataSourceRDD to its pre-SPJ form.
  • This PR tries to unify the terminology and prefers using PartitionKey instead of the previous PartitionValues to be in sync with the DSv2 HasPartitionKey interface.
  • After this PR StoragePartitionJoinParams is not required in BatchScanExec, its fields are now part of the new GroupPartitionsExec operator.
  • KeyedPartitioning no longer stores originalPartitionKeys for partially clustered joins as those keys are available as outputPartitioning of the join's children (below the inserted GroupPartitionsExec if that is inserted).

Why are the changes needed?

  1. To solve the issue of unecessary partition grouping SPARK-55092 ([SPARK-55092][SQL] Disable partition grouping in KeyGroupedPartitioning when not needed #53859) and simplify KGP/SPJ implementation.

  2. A new operator allows more granular control over partition grouping, which can improve multi table joins:

    Consider the following examples with 3 tables:

    • t1 is partitoned by (a1, a2) and returns partitons with keys (1, 1), (1, 2), (2, 1), (2, 2)
    • t2 is partitoned by (b1, b2) and returns partitons with keys (2, 1), (2, 3), (3, 1), (3, 2)
    • t3 is partitoned by c1 and returns partitons with keys 2, 3

    When spark.sql.requireAllClusterKeysForCoPartition=false and spark.sql.sources.v2.bucketing.allowJoinKeysSubsetOfPartitionKeys.enabled=true are the query is t1 JOIN t2 ON a1 = b1 AND a2 = b2 JOIN t3 ON a1 = c1, then storage partition join kicks in.
    Before this PR the common set of partition keys are pushed down to all 3 scans :

    Join a1 = c1
      Join a1 = b1 AND a2 = b2
        Scan t1, commonPartitionValues = [1, 2, 3]
        Scan t2, commonPartitionValues = [1, 2, 3]
      Scan t3, commonPartitionValues = [1, 2, 3]
    

    After this PR GroupPartitions operators do the grouping, which is fully utilizing t1 and t2 partitioning in the inner Join operator and regroup the join results for the outer Join operator:

    Join a1 = c1
      GroupPartitions commonPartitionValues = [1, 2, 3]
        Join a1 = b1 AND a2 = b2
          GroupPartitions commonPartitionValues = [(1, 1), (1, 2), (2, 1), (2, 2), (3, 1), (3, 2)]
            Scan t1
          GroupPartitions commonPartitionValues = [(1, 1), (1, 2), (2, 1), (2, 2), (3, 1), (3, 2)]
            Scan t2
      GroupPartitions commonPartitionValues = [1, 2, 3]
        Scan t3
    

    Fully utilized partitioning in joins can avoid skews better.

    Or consider the following examples with 3 tables:

    • t1 is partitoned by a and returns partitons with keys 1, 1, 2, 2
    • t2 is partitoned by b and returns partitons with keys 2, 3
    • t3 is partitoned by c and returns partitons with keys 2, 4

    When spark.sql.sources.v2.bucketing.partiallyClusteredDistribution.enabled=true is set, then partial clustering can be used not only with 2 table joins, but with multi table joins as well:
    Before this PR:

    Join a = c
      Join a = b
        Scan t1, commonPartitionValues = [1, 2, 3, 4]
        Scan t2, commonPartitionValues = [1, 2, 3, 4]
      Scan t3, commonPartitionValues = [1, 2, 3, 4]
    

    After this PR:

    Join a = c
      GroupPartitions commonPartitionValues = [1, 1, 2, 2, 3, 4]
        Join a = b
          GroupPartitions commonPartitionValues = [1, 1, 2, 2, 3]
            Scan t1
          GroupPartitions commonPartitionValues = [1, 1, 2, 2, 3], replicatePartitions = true
            Scan t2
      GroupPartitions commonPartitionValues = [1, 1, 2, 2, 3, 4], replicatePartitions = true
        Scan t3
    

    Keeping one side unclustered can also help avoiding skews.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Existing UTs adjusted, new UTs from #53859 and additional new UTs to test the above improvements.

Was this patch authored or co-authored using generative AI tooling?

Yes, documentation and some helpers were added by Claude.

@peter-toth peter-toth force-pushed the SPARK-55535-refactor-kgp-and-spj branch from 0c0e88b to c1e3e93 Compare February 17, 2026 20:21
@peter-toth peter-toth changed the title [WIP][SPARK-55535][SQL] Refactor KeyGroupedPartitioning and Storage Partition Join [WIP][SPARK-55535][SPARK-55092][SQL] Refactor KeyGroupedPartitioning and Storage Partition Join Feb 18, 2026
@peter-toth peter-toth force-pushed the SPARK-55535-refactor-kgp-and-spj branch from c1e3e93 to 53034f5 Compare February 18, 2026 18:48
@peter-toth
Copy link
Copy Markdown
Contributor Author

This PR is requires and contains the changes of #54335. Once that PR is merged I will rebase this one.

@peter-toth peter-toth changed the title [WIP][SPARK-55535][SPARK-55092][SQL] Refactor KeyGroupedPartitioning and Storage Partition Join [SPARK-55535][SPARK-55092][SQL] Refactor KeyGroupedPartitioning and Storage Partition Join Feb 18, 2026
@peter-toth peter-toth marked this pull request as ready for review February 18, 2026 19:38
@peter-toth
Copy link
Copy Markdown
Contributor Author

@dongjoon-hyun
Copy link
Copy Markdown
Member

Why don't you merge #54335 , @peter-toth ? You already got the required community approval on your PR.

Copy link
Copy Markdown
Member

@szehon-ho szehon-ho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the coalesceRDD is a good idea, but its a bit risky I feel to change so much the DataSourceRDD. Is there another way? Maybe have a customRDD that holds the grouped partitions? though im not so familiar with this part

@szehon-ho
Copy link
Copy Markdown
Member

Overall i like the GroupPartitionExec idea, but definitely would be good to have some of @sunchao @viirya @chirag-s-db @cloud-fan to also take a look

@peter-toth
Copy link
Copy Markdown
Contributor Author

peter-toth commented Feb 19, 2026

Why don't you merge #54335 , @peter-toth ? You already got the required community approval on your PR.

Sorry @dongjoon-hyun, I didn't notice your approval yesterday. Thanks for your review! @viirya requested a small change just now, once CI completes I will merge that PR and rebase this one.

@peter-toth
Copy link
Copy Markdown
Contributor Author

peter-toth commented Feb 19, 2026

I think the coalesceRDD is a good idea, but its a bit risky I feel to change so much the DataSourceRDD. Is there another way? Maybe have a customRDD that holds the grouped partitions? though im not so familiar with this part

Initially I wanted to add a new RDD for GroupPartitionsExec but it was very similar to CoalescedRDD / CoalescedRDDPartition so just creating a new GroupedPartitionCoalescer, which holds the grouped partitions, seemed like a cleaner approach.

DataSourceRDD is now back to its pre partition grouping form. IMO we will need to backport the ThreadLocal[ReaderState] fix to previous Spark versions too so as to fix the case when there is a coalece after the scan.

peter-toth added a commit that referenced this pull request Feb 19, 2026
### What changes were proposed in this pull request?

This is a minor refector of `BroadcastHashJoinExec.outputPartitioning` to:
- simlify the logic and
- make it future proof by using `Partitioning with Expression` instead of `HashPartitioningLike`.

### Why are the changes needed?
Code cleanup and add support for future partitionings that implement `Expression` but not `HashPartitioningLike`. (Like `KeyedPartitioning` is in #54330.)

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Existing tests.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #54335 from peter-toth/SPARK-55551-improve-broadcasthashjoinexec-output-partitioning.

Authored-by: Peter Toth <peter.toth@gmail.com>
Signed-off-by: Peter Toth <peter.toth@gmail.com>
@peter-toth peter-toth force-pushed the SPARK-55535-refactor-kgp-and-spj branch from 69ded9c to 8a087af Compare February 19, 2026 16:50
@peter-toth
Copy link
Copy Markdown
Contributor Author

#54335 is merged and I've rebased this PR on latest master.

@szehon-ho
Copy link
Copy Markdown
Member

szehon-ho commented Feb 19, 2026

Thanks for the refactor. I was actually wondering if this approach works: (using cursor generation, please double check if it makes sense). Its more localized to SPJ case then

class GroupedPartitionsRDD(
    @transient private val dataSourceRDD: DataSourceRDD,
    groupedPartitions: Seq[Seq[Int]]
  ) extends RDD[InternalRow](dataSourceRDD) {
  
override def compute(split: Partition, context: TaskContext): Iterator[InternalRow] = {
    val groupedPartition = split.asInstanceOf[GroupedPartitionsRDDPartition]
      val readers = new ArrayBuffer[PartitionReader[_]]()
      var listenerAdded = false
      
      def addCompletionListener(): Unit = {
        if (!listenerAdded) {
          context.addTaskCompletionListener[Unit] { _ =>
            readers.foreach { reader =>
              try {
                CustomMetrics.updateMetrics(
                  reader.currentMetricsValues.toImmutableArraySeq,
                  dataSourceRDD.customMetrics)
                reader.close()
              } catch {
                case e: Exception =>
                  logWarning(s"Error closing reader: ${e.getMessage}", e)
              }
            }
          }
          listenerAdded = true
        }
      }
      
      // Use a self-closing iterator wrapper
      new Iterator[InternalRow] {
        private val parentIter = groupedPartition.parentIndices.iterator
        private var currentIterator: Iterator[InternalRow] = null
        private var currentReader: PartitionReader[_] = null
        
        private def advance(): Boolean = {
          while (currentIterator == null || !currentIterator.hasNext) {
            if (!parentIter.hasNext) {
              // Close current reader if exists
              if (currentReader != null) {
                try {
                  CustomMetrics.updateMetrics(
                    currentReader.currentMetricsValues.toImmutableArraySeq,
                    dataSourceRDD.customMetrics)
                  currentReader.close()
                } catch {
                  case e: Exception =>
                    logWarning(s"Error closing reader: ${e.getMessage}", e)
                }
                currentReader = null
              }
              return false
            }
            
            // Close previous reader
            if (currentReader != null) {
              try {
                CustomMetrics.updateMetrics(
                  currentReader.currentMetricsValues.toImmutableArraySeq,
                  dataSourceRDD.customMetrics)
                currentReader.close()
              } catch {
                case e: Exception =>
                  logWarning(s"Error closing reader: ${e.getMessage}", e)
              }
            }
            
            val parentIndex = parentIter.next()
            val inputPartitionOpt = dataSourceRDD.inputPartitions(parentIndex)
            
            currentIterator = inputPartitionOpt.iterator.flatMap { inputPartition =>
              currentReader = if (dataSourceRDD.columnarReads) {
                dataSourceRDD.partitionReaderFactory.createColumnarReader(inputPartition)
              } else {
                dataSourceRDD.partitionReaderFactory.createReader(inputPartition)
              }
              
              addCompletionListener()
              
              val iter = if (dataSourceRDD.columnarReads) {
                new MetricsBatchIterator(
                  new PartitionIterator[ColumnarBatch](
                    currentReader.asInstanceOf[PartitionReader[ColumnarBatch]], 
                    dataSourceRDD.customMetrics))
              } else {
                new MetricsRowIterator(
                  new PartitionIterator[InternalRow](
                    currentReader.asInstanceOf[PartitionReader[InternalRow]], 
                    dataSourceRDD.customMetrics))
              }
              
              iter.asInstanceOf[Iterator[InternalRow]]
            }
          }
          true
        }
    
    override def hasNext: Boolean = advance()
    
    override def next(): InternalRow = {
      if (!hasNext) {
        throw new NoSuchElementException("next on empty iterator")
      }
      currentIterator.next()
    }
  }
}


private case class GroupedPartitionsRDDPartition(
    index: Int,
    parentIndices: Array[Int],
    preferredLocation: Option[String] = None
  ) extends Partition

that can be used in GroupPartitionsExec like:

override protected def doExecute(): RDD[InternalRow] = {
  if (groupedPartitions.isEmpty) {
    sparkContext.emptyRDD
  } else {
    child.execute() match {
      case dsRDD: DataSourceRDD =>
        new GroupedPartitionsRDD(
          dsRDD,
          groupedPartitions.map(_._2))
      case _ => // error or fallback?
    }
  }
}

The current code is definitely more Spark-native, reusing coalesceRDD, but my doubt is the threadlocal and chance for memory leak like fixed by @viirya : #51503 But ill defer to others, if people like this approach more.

@szehon-ho
Copy link
Copy Markdown
Member

edit: i guess its what you are saying you considered, in your previous comment

@szehon-ho
Copy link
Copy Markdown
Member

DataSourceRDD is now back to its pre partition grouping form. IMO we will need to backport the ThreadLocal[ReaderState] fix to previous Spark versions too so as to fix the case when there is a coalece after the scan.

btw, didn't get this, are you saying there is some leak in current DataSourceRDD that need ThreadLocal to fix? Should it be fixed spearately?

@peter-toth
Copy link
Copy Markdown
Contributor Author

peter-toth commented Feb 19, 2026

The current code is definitely more Spark-native, reusing coalesceRDD, but my doubt is the threadlocal and chance for memory leak like fixed by @viirya : #51503 But ill defer to others, if people like this approach more.

As far as I see you assume that the child is a DataSourceRDD, but the main point of this change is to move the grouping logic to the new operator (GroupPartitionsExec) so as to be able to insert it into those plans as well where there is no BatchScanExec / DataSourceRDD, e.g. cached or checkpointed plans.

Also, even if there is a BatchScanExec (DataSourceRDD) in the plan,GroupPartitionsExec is inserted right below the join / aggregate where the grouping is needed (like an exchange is inserted), so there can be other nodes / RDDs between the GroupPartitionsExec and the data source. So we can't assume that.

DataSourceRDD is now back to its pre partition grouping form. IMO we will need to backport the ThreadLocal[ReaderState] fix to previous Spark versions too so as to fix the case when there is a coalece after the scan.

btw, didn't get this, are you saying there is some leak in current DataSourceRDD that need ThreadLocal to fix? Should it be fixed spearately?

Not necessarily a leak, but there are some issues with custom metrics reporting and when the readers gets closed. Consider the following plan (without this PR):

...
  CoalesceExec (1)
    BatchScanExec (returns 2 partitions)

We have only 1 task in the stage due to coalesce(1) and that task calls the DataSourceRDD.compute() for both input partitions. It doesn't matter if those partitions are actually grouped or not. Both invocations create 1-1 reader and install 1-1 listener to close the readers and report customer metrics of the reader. But the listeners run only at the end of the task so the first reader is kept open for too long. What's worse, the 2 reported metrics conflict and only one will be kept.
So I think yes, we need to fix this issue on other branches as well. I don't think we can/should backport this refactor to older versions, but we can extract the ThreadLocal[ReaderState] logic and apply it on other branches. Technically we could fix it separately, but as this PR hugely simplifies the affected DataSourceRDD, it is easier to do it together with this refactor on the master branch.

@szehon-ho
Copy link
Copy Markdown
Member

szehon-ho commented Feb 19, 2026

As far as I see you assume that the child is a DataSourceRDD, but the main point of this change is to move the grouping logic to the new operator (GroupPartitionsExec) so as to be able to insert it into those plans as well where there is no BatchScanExec / DataSourceRDD, e.g. cached or checkpointed plans.

I see, sorry forgot about that case.

Interesting, so you mean we are losing metrics. Should we at least add a test? It may make sense to do in separate pr, but depends the final approach. The approach does make sense, I am a bit unsure if ThreadLocal is the best/safe approach, consider the risk to introduce memory leak, as you can see its a bit tricky, but I am not so familiar with DataSourceRDD code.

@peter-toth
Copy link
Copy Markdown
Contributor Author

Sure, let me add a test tomorrow, and maybe someone can come up with a better idea to fix it.

@peter-toth
Copy link
Copy Markdown
Contributor Author

peter-toth commented Feb 20, 2026

I extracted the metrics reporting bug / fix to SPARK-55619 / #54396 and added a new test.

@szehon-ho
Copy link
Copy Markdown
Member

Thank you!

…roupPartitionsExec` operator, remove old code
ensureOrdering(child, child.outputPartitioning, o)
case _ => child
}
case (c @ GroupedPartitions(p), distribution) if p.satisfies(distribution) =>
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to revisit this part as converting a KeyedPartitioning to grouped (build the distinct set of keys) just to check if it can satisfiy a distribution doesn't make sense...

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is refactored in 326915b, now we don't compute the distinct set of keys so as to decide if a KeyedPartitioning can satisfy a distribution.
As KeyedPartitioning is a special partitioning (not just becuase of this refactor PR) I elaborated on what KeyedPartitioning.satisfies() actually means.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something is wrong with that commit, let me check the test failures.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be fixed in 7951dc6.

@peter-toth
Copy link
Copy Markdown
Contributor Author

Also can we add more tests:

  • Empty grouped partitions: Plan that yields groupedPartitions.isEmpty (ie, partitioned table but no partition values inserted)

@szehon-ho, I added an empty partitioned table test in 4a904ad, but it seems we prevent returing KeyedPartitioning without partitions. This is not new behaviour, it worked the same way with KeyGroupedPartitioning before this PR. If we removed that inputPartitions.nonEmpty guard from BatchScanExec and then the 2 shuffles would disappear, but no GroupPartitionsExec is added as those are not needed. Maybe the only way to get GroupPartitionsExec with empty groupedPartitions is to enable spark.sql.sources.v2.bucketing.partition.filter.enabled and use disjoint set of keys in join legs to get empty expectedPartitionKeys. Let me check this tomorrow.

@szehon-ho, I added a test case that yields groupedPartitions.isEmpty in 32b563f.
The commit also cleans up SPARK-55092 (scans should not group partitions) test case, but that's kind of trivial due to moving partition grouping out from scans.

@peter-toth
Copy link
Copy Markdown
Contributor Author

@cloud-fan, @szehon-ho, @viirya I wonder if we can proceed with this refactor?

Please note that this change implicitly fixes the correctness issue reported in #54378 / SPARK-55848, but we would need the tests from @naveenp2708's #54679 on master as well.

@peter-toth peter-toth closed this in a1c62dd Mar 9, 2026
@peter-toth
Copy link
Copy Markdown
Contributor Author

Thank you for the review @cloud-fan, @dongjoon-hyun, @viirya, @szehon-ho and @chirag-s-db.

Merged to master (4.2.0).

peter-toth added a commit that referenced this pull request Mar 12, 2026
…minor improvements to `EnsureRequirements`

### What changes were proposed in this pull request?

This is a follow-up PR to #54330 to fix `OrderedDistribution` handling in `EnsureRequirements` so as to avoid a correctness bug. The PR contains minor improvements to `EnsureRequirements` and configuration docs updates as well.

### Why are the changes needed?

To fix a correctness bug introduced with the refactor.

### Does this PR introduce _any_ user-facing change?

Yes, but the refactor (#54330) hasn't been released.

### How was this patch tested?

Added new UT.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #54727 from peter-toth/SPARK-55535-refactor-kgp-and-spj-follow-up.

Authored-by: Peter Toth <peter.toth@gmail.com>
Signed-off-by: Peter Toth <peter.toth@gmail.com>
peter-toth pushed a commit that referenced this pull request Mar 16, 2026
…clustering

### What changes were proposed in this pull request?

Backport fix for SPARK-55848 to branch-4.1. This branch does not have the KeyGroupedPartitioning refactor (#54330) from master.

The fix adds an `isPartiallyClustered` flag to `KeyGroupedPartitioning` and restructures `satisfies0()` to check `ClusteredDistribution` first, returning `false` when partially clustered. `EnsureRequirements` then inserts the necessary Exchange.

### Why are the changes needed?

SPJ with partial clustering produces incorrect results for post-join dedup operations (dropDuplicates, Window row_number). The partially-clustered partitioning is incorrectly treated as satisfying `ClusteredDistribution`, so no Exchange is inserted before dedup operators.

### Does this PR introduce any user-facing change?

Yes. Queries using SPJ with partial clustering followed by dedup operations will now return correct results.

### How was this patch tested?

Three regression tests added to KeyGroupedPartitioningSuite with data correctness checks and plan assertions verifying shuffle Exchange presence. All 95 tests pass.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #54751 from naveenp2708/spark-55848-fix-branch-4.1.

Authored-by: Naveen Kumar Puppala <naveenp2708@gmail.com>
Signed-off-by: Peter Toth <peter.toth@gmail.com>
peter-toth pushed a commit that referenced this pull request Mar 20, 2026
…ring dedup

### What changes were proposed in this pull request?

Test-only PR. Adds regression tests for SPARK-55848 (SPJ partial clustering produces incorrect results for post-join dedup operations).

Three tests added to KeyGroupedPartitioningSuite:
1. SPARK-55848: dropDuplicates after SPJ with partial clustering
2. SPARK-55848: Window dedup after SPJ with partial clustering
3. SPARK-55848: checkpointed scan with partial clustering and dedup

### Why are the changes needed?

The fix was merged via #54330, but regression tests for the correctness issue (SPARK-55848 / #54378) were not included. These tests ensure the issue does not regress.

### Does this PR introduce any user-facing change?

No. Test-only change.

### How was this patch tested?

All 73 tests in KeyGroupedPartitioningSuite pass.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #54714 from naveenp2708/spark-55848-tests-master.

Authored-by: Naveen Kumar Puppala <naveenp2708@gmail.com>
Signed-off-by: Peter Toth <peter.toth@gmail.com>
peter-toth added a commit that referenced this pull request Mar 21, 2026
### What changes were proposed in this pull request?

This PR adds a new method to SPJ partition key `Reducer`s to return the type of a reduced partition key.

### Why are the changes needed?
After the [SPJ refactor](#54330) some Iceberg SPJ tests, that join a `hours` transform partitioned table with a `days` transform partitioned table, started to fail. This is because after the refactor the keys of a `KeyedPartitioning` partitioning are `InternalRowComparableWrapper`s, which include the type of the key, and when the partition keys are reduced the type of the reduced keys are inherited from their original type.

- #54330

This means that when `hours` transformed hour keys are reduced to days, the keys actually remain having `IntegerType` type, while the `days` transformed keys have `DateType` type in Iceberg. This type difference causes that the left and right side `InternalRowComparableWrapper`s are not considered equal despite their `InternalRow` raw key data are equal.

Before the refactor the type of (possibly reduced) partition keys were not stored in the partitioning. When the left and right side raw keys were compared in `EnsureRequirement` a common comparator was initialized with the type of the left side keys.
So in the Iceberg SPJ tests the `IntegerType` keys were forced to be interpreted as `DateType`, or the `DateType` keys were forced to be interpreted as `IntegerType`, depending on the join order of the tables.
The reason why this was not causing any issues is that the `PhysicalDataType` of both `DateType` and `IntegerType` logical types is `PhysicalIntegerType`.

This PR introduces a new `resultType()` method of `Reducer` to return the correct type of the reduced keys and properly compares the left and right side reduced key types and thorws an error when they are not the same.

### Does this PR introduce _any_ user-facing change?
Yes, the reduced key types are now properly compared and incompatibilities are reported to users.

### How was this patch tested?
Added new UTs.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #54884 from peter-toth/SPARK-56046-typed-spj-reducers.

Authored-by: Peter Toth <peter.toth@gmail.com>
Signed-off-by: Peter Toth <peter.toth@gmail.com>
terana pushed a commit to terana/spark that referenced this pull request Mar 23, 2026
…ring dedup

### What changes were proposed in this pull request?

Test-only PR. Adds regression tests for SPARK-55848 (SPJ partial clustering produces incorrect results for post-join dedup operations).

Three tests added to KeyGroupedPartitioningSuite:
1. SPARK-55848: dropDuplicates after SPJ with partial clustering
2. SPARK-55848: Window dedup after SPJ with partial clustering
3. SPARK-55848: checkpointed scan with partial clustering and dedup

### Why are the changes needed?

The fix was merged via apache#54330, but regression tests for the correctness issue (SPARK-55848 / apache#54378) were not included. These tests ensure the issue does not regress.

### Does this PR introduce any user-facing change?

No. Test-only change.

### How was this patch tested?

All 73 tests in KeyGroupedPartitioningSuite pass.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#54714 from naveenp2708/spark-55848-tests-master.

Authored-by: Naveen Kumar Puppala <naveenp2708@gmail.com>
Signed-off-by: Peter Toth <peter.toth@gmail.com>
terana pushed a commit to terana/spark that referenced this pull request Mar 23, 2026
### What changes were proposed in this pull request?

This PR adds a new method to SPJ partition key `Reducer`s to return the type of a reduced partition key.

### Why are the changes needed?
After the [SPJ refactor](apache#54330) some Iceberg SPJ tests, that join a `hours` transform partitioned table with a `days` transform partitioned table, started to fail. This is because after the refactor the keys of a `KeyedPartitioning` partitioning are `InternalRowComparableWrapper`s, which include the type of the key, and when the partition keys are reduced the type of the reduced keys are inherited from their original type.

- apache#54330

This means that when `hours` transformed hour keys are reduced to days, the keys actually remain having `IntegerType` type, while the `days` transformed keys have `DateType` type in Iceberg. This type difference causes that the left and right side `InternalRowComparableWrapper`s are not considered equal despite their `InternalRow` raw key data are equal.

Before the refactor the type of (possibly reduced) partition keys were not stored in the partitioning. When the left and right side raw keys were compared in `EnsureRequirement` a common comparator was initialized with the type of the left side keys.
So in the Iceberg SPJ tests the `IntegerType` keys were forced to be interpreted as `DateType`, or the `DateType` keys were forced to be interpreted as `IntegerType`, depending on the join order of the tables.
The reason why this was not causing any issues is that the `PhysicalDataType` of both `DateType` and `IntegerType` logical types is `PhysicalIntegerType`.

This PR introduces a new `resultType()` method of `Reducer` to return the correct type of the reduced keys and properly compares the left and right side reduced key types and thorws an error when they are not the same.

### Does this PR introduce _any_ user-facing change?
Yes, the reduced key types are now properly compared and incompatibilities are reported to users.

### How was this patch tested?
Added new UTs.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes apache#54884 from peter-toth/SPARK-56046-typed-spj-reducers.

Authored-by: Peter Toth <peter.toth@gmail.com>
Signed-off-by: Peter Toth <peter.toth@gmail.com>
* `KeyedPartitioning` is used in two distinct forms:
*
* 1. '''As outputPartitioning''': When used as a node's output partitioning (e.g., in
* `BatchScanExec` or `GroupPartitionsExec`), the `partitionKeys` are always in sorted order.
Copy link
Copy Markdown
Member

@pan3793 pan3793 Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@peter-toth, I wonder if we can relax this assumption - allow non-sorted and ungrouped partitionKeys in outputPartitioning.

I found an interesting case during the experiment of this feature - suppose we have a table like

CREATE OR REPLACE TABLE orders_userid_dt_iceberg (
  order_id   BIGINT,
  user_id    BIGINT,
  amount     DECIMAL(10,2),
  dt         STRING
)
USING iceberg
PARTITIONED BY (bucket(4, user_id), dt);

INSERT INTO orders_userid_dt_iceberg VALUES
  (1001, 1, 120.50, '2025-01-01'),
  (1002, 1,  80.00, '2025-01-01'),
  (1003, 2, 200.00, '2025-01-01'),
  (1004, 2,  50.00, '2025-01-02'),
  (1005, 3,  30.00, '2025-01-02'),
  (1006, 3,  70.00, '2025-01-03'),
  (1007, 1,  60.00, '2025-01-03');
SELECT user_id, count(*)
FROM orders_userid_dt_iceberg
WHERE dt = '2025-01-01'
GROUP BY user_id;

for this query, the ColumnPruning inject a Project on the Filter(RelationV2), then

+- == Initial Plan ==
   HashAggregate (13)
   +- Exchange (12)
      +- HashAggregate (11)
         +- Project (10)    <= KeyedPartitioning is dropped since here, see AliasAwareQueryOutputOrdering#outputPartitioning
            +- BatchScan iceberg spark_catalog.default.orders_userid_dt_iceberg (1)

if we make a projection for KeyedPartitioning (obviously, the projected one will break the current assumption in docs) in AliasAwareQueryOutputOrdering#outputPartitioning instead of dropping, we can avoid an expensive shuffle. my experiment shows this at least works for such a simple query, do you think this is a right direction?

Copy link
Copy Markdown
Contributor Author

@peter-toth peter-toth Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pan3793, actually sorted order is not a hard requirement, but it increases the chance that we don't need to add GroupPartitionsExec with explicit expectedPartitionKeys into the query plan to align partitions by keys on both sides of a join.

The idea of adjusting PartitioningPreservingUnaryExecNodes to not drop but project KeyedPartitionings has already come up: #54330 (comment), but I haven't had time to work on it yet.
It is a bit tricky because the current logic requires that all KeyedPartitionings in a partitioning collection have equal sequences of partition keys (and actually identical sequence is even better to decrease the footprint of the partitioning). I think we should maintain this invariant during projection and keep only one sequence of keys but it should have the most granular expressions. Let me give you an example:
Let's suppose we have child.outputPartitioning as

PartitioningCollection(
  KeyedPartitioning(expressions = [x, y], partitionKeys = [(1, 1), (1, 2), (2, 1), (2, 2)]), 
  KeyedPartitioning(expressions = [x_alias, y], partitionKeys = <identical seq>),
  KeyedPartitioning(expressions = [x, y_alias], partitionKeys = <identical seq>),
  KeyedPartitioning(expressions = [x_alias, y_alias], partitionKeys = <identical seq>))

because we have Project x, x as x_alias, y, y as y_alias somewhere in the child subplan.

Now, if we have Project x, x_alias on the top then obviously the node's outputPartitioning could be:

PartitioningCollection(
  KeyedPartitioning(expressions = [x], partitionKeys = [(1), (1), (2), (2)]), 
  KeyedPartitioning(expressions = [x_alias], partitionKeys = <identical seq>))

But if we have Project x, x_alias, y_alias then we should not project the first KeyedPartitioning, but keep those which have more granularity:

PartitioningCollection(
  KeyedPartitioning(expressions = [x, y_alias], partitionKeys = [(1, 1), (1, 2), (2, 1), (2, 2)]),
  KeyedPartitioning(expressions = [x_alias, y_alias], partitionKeys = <identical seq>))

I think what we sould avoid is having multiple different partitionKeys in a collection like:

PartitioningCollection(
  KeyedPartitioning(expressions = [x], partitionKeys = [(1), (1), (2), (2)]), 
  KeyedPartitioning(expressions = [x_alias], partitionKeys = <identical seq>),
  KeyedPartitioning(expressions = [x, y_alias], partitionKeys = [(1, 1), (1, 2), (2, 1), (2, 2)]),
  KeyedPartitioning(expressions = [x_alias, y_alias], partitionKeys = <identical seq 2>))

because it would break the current logic and it doesn't have any benefit.

Also, we should probably think through how KeyedPartitioning projection relates to the allowJoinKeysSubsetOfPartitionKeys conf.

Anyways, I can probably open a PR next week or so, but if you would like to work on this just let me know.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@peter-toth, thanks for the detailed explanation, and looking forward to your subsequent improvements on these parts!

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pan3793, #55519 should fix the problem with AliasAwareQueryOutputOrdering#outputPartitioning.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants