[SPARK-55848][SQL][4.1] Fix incorrect dedup results with SPJ partial clustering#54749
Closed
naveenp2708 wants to merge 400 commits intoapache:masterfrom
Closed
[SPARK-55848][SQL][4.1] Fix incorrect dedup results with SPJ partial clustering#54749naveenp2708 wants to merge 400 commits intoapache:masterfrom
naveenp2708 wants to merge 400 commits intoapache:masterfrom
Conversation
… `Set` instead of `Seq` ### What changes were proposed in this pull request? This PR aims to fix `PythonPipelineSuite` flakiness via `Set` instead of `Seq` in multiple places. ### Why are the changes needed? Currently, `PythonPipelineSuite` is flaky like the following. We should fix this flakiness. - https://github.com/apache/spark/actions/runs/19396864076/job/55498096472 ``` [info] - referencing internal datasets *** FAILED *** (821 milliseconds) [info] List(`spark_catalog`.`default`.`src`, `spark_catalog`.`default`.`c`, `spark_catalog`.`default`.`a`) did not equal List(`spark_catalog`.`default`.`src`, `spark_catalog`.`default`.`a`, `spark_catalog`.`default`.`c`) (PythonPipelineSuite.scala:366) ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#53080 from dongjoon-hyun/SPARK-XXX. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 13545a4) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
… minutes ### What changes were proposed in this pull request? This PR aims to limit the Maven GitHub Action job timeout to 150 minutes. ### Why are the changes needed? Currently, Maven CI runs 6 hours which is the default timeout. In general, this job should pass in 150 minutes. <img width="641" height="444" alt="Screenshot 2025-11-15 at 18 44 28" src="https://github.com/user-attachments/assets/16e3bfce-1eab-4671-8ec1-cec209c4f0e3" /> ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual review. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#53081 from dongjoon-hyun/SPARK-54370. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 0a42f55) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
…en is disabled ### What changes were proposed in this pull request? BHJ LeftAnti update numOutputRows missing case for hashed = EmptyHashedRelation <img width="1754" height="1148" alt="image" src="https://github.com/user-attachments/assets/a71e4546-578e-4e4d-9434-9287074ebe75" /> ### Why are the changes needed? Fix missing sql metrics for BHJ ### Does this PR introduce _any_ user-facing change? Yes, BHJ LeftAnti will update numOutputRows when hashed = EmptyHashedRelation ### How was this patch tested? Existed UT ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#53014 from AngersZhuuuu/SPARK-54319. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 3757091) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…eSuite` to skip tests when PyConnect dependencies is not available
### What changes were proposed in this pull request?
SPARK-54020 added some new test cases in `PythonPipelineSuite`. This pr incorporates `assume(PythonTestDepsChecker.isConnectDepsAvailable)` for these test cases to ensure that the tests are skipped rather than failing when PyConnect dependencies are missing.
### Why are the changes needed?
Enhance the robustness of test cases. Prior to this, when executing `build/sbt "connect/testOnly org.apache.spark.sql.connect.pipelines.PythonPipelineSuite"`:
```
[info] - reading internal datasets outside query function that trigger eager analysis or execution will fail (spark.sql("SELECT * FROM src")) *** FAILED *** (4 milliseconds)
[info] "org.apache.spark.sql.connect.PythonTestDepsChecker.isConnectDepsAvailable was false" did not contain "TABLE_OR_VIEW_NOT_FOUND" (PythonPipelineSuite.scala:546)
[info] org.scalatest.exceptions.TestFailedException:
[info] at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472)
[info] at org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471)
[info] at org.scalatest.Assertions$.newAssertionFailedException(Assertions.scala:1231)
[info] at org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:1295)
[info] at org.apache.spark.sql.connect.pipelines.PythonPipelineSuite.$anonfun$new$43(PythonPipelineSuite.scala:546)
[info] at org.apache.spark.sql.connect.pipelines.PythonPipelineSuite.$anonfun$new$43$adapted(PythonPipelineSuite.scala:532)
[info] at org.apache.spark.SparkFunSuite.$anonfun$gridTest$2(SparkFunSuite.scala:241)
[info] at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18)
[info] at org.scalatest.enablers.Timed$$anon$1.timeoutAfter(Timed.scala:127)
[info] at org.scalatest.concurrent.TimeLimits$.failAfterImpl(TimeLimits.scala:282)
[info] at org.scalatest.concurrent.TimeLimits.failAfter(TimeLimits.scala:231)
[info] at org.scalatest.concurrent.TimeLimits.failAfter$(TimeLimits.scala:230)
...
[info] *** 24 TESTS FAILED ***
[error] Failed tests:
[error] org.apache.spark.sql.connect.pipelines.PythonPipelineSuite
[error] (connect / Test / testOnly) sbt.TestsFailedException: Tests unsuccessful
```
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
- Pass Github Actions
- Manually verify that the relevant tests will no longer fail when PyConnect dependencies are missing.
### Was this patch authored or co-authored using generative AI tooling?
No
Closes apache#53088 from LuciferYang/SPARK-54375.
Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: yangjie01 <yangjie01@baidu.com>
(cherry picked from commit 722bcc0)
Signed-off-by: yangjie01 <yangjie01@baidu.com>
### What changes were proposed in this pull request? Marks all declarative pipelines configuration options as internal, except for `spark.sql.pipelines.maxFlowRetryAttempts`. ### Why are the changes needed? When implementing Declarative Pipelines, we made several quantities configurable. However, documented configurations are essentially public APIs, and it's too early to commit yet to supporting all of these. We should mark most of them internal except where we think users will really need them. ### Does this PR introduce _any_ user-facing change? Yes, to unreleased software. ### How was this patch tested? ### Was this patch authored or co-authored using generative AI tooling? Closes apache#53090 from sryza/internal-configs. Authored-by: Sandy Ryza <sandy.ryza@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 1db267e) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
Kills the worker if flush fails in `daemon.py`.
- Spark conf: `spark.python.daemon.killWorkerOnFlushFailure` (default `true`)
- SQL conf: `spark.sql.execution.pyspark.udf.daemonKillWorkerOnFlushFailure` (fallback to the above)
Before it just dies, reuse `faulthandler` feature and record the thread dump and it will appear in the error message if `faulthandler` is enabled.
```
WARN TaskSetManager: Lost task 3.0 in stage 1.0 (TID 8) (127.0.0.1 executor 1): org.apache.spark.SparkException: Python worker exited unexpectedly (crashed): Current thread 0x00000001f0796140 (most recent call first):
File "/.../python/pyspark/daemon.py", line 95 in worker
File "/.../python/pyspark/daemon.py", line 228 in manager
File "/.../python/pyspark/daemon.py", line 253 in <module>
File "<frozen runpy>", line 88 in _run_code
File "<frozen runpy>", line 198 in _run_module_as_main
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator$$anonfun$1.applyOrElse(PythonRunner.scala:679)
...
```
Even when `faulthandler` is not eabled, the error will appear in the executor's `stderr` file.
```
Traceback (most recent call last):
File "/.../python/pyspark/daemon.py", line 228, in manager
code = worker(sock, authenticated)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/.../python/pyspark/daemon.py", line 88, in worker
raise Exception("test")
Exception: test
```
When this is disabled, the behavior is the same as before but with a log.
### Why are the changes needed?
Currently an exception caused by `outfile.flush()` failure in `daemon.py` is ignored, but if the last command in `worker_main` is still not flushed, it could cause a UDF stuck in Java waiting for the response from the Python worker.
It should just die and let Spark retry the task.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Manually.
<details>
<summary>Test with the patch to emulate the case</summary>
```patch
% git diff
diff --git a/python/pyspark/daemon.py b/python/pyspark/daemon.py
index 54c9507..e107216d769 100644
--- a/python/pyspark/daemon.py
+++ b/python/pyspark/daemon.py
-84,6 +84,8 def worker(sock, authenticated):
exit_code = compute_real_exit_code(exc.code)
finally:
try:
+ if worker_main.__globals__.get("TEST", False):
+ raise Exception("test")
outfile.flush()
except Exception:
faulthandler_log_path = os.environ.get("PYTHON_FAULTHANDLER_DIR", None)
diff --git a/python/pyspark/worker.py b/python/pyspark/worker.py
index 6e34b04..ff210f4fd97 100644
--- a/python/pyspark/worker.py
+++ b/python/pyspark/worker.py
-3413,7 +3413,14 def main(infile, outfile):
# check end of stream
if read_int(infile) == SpecialLengths.END_OF_STREAM:
- write_int(SpecialLengths.END_OF_STREAM, outfile)
+ import random
+
+ if random.random() < 0.1:
+ # emulate the last command is not flushed yet
+ global TEST
+ TEST = True
+ else:
+ write_int(SpecialLengths.END_OF_STREAM, outfile)
else:
# write a different value to tell JVM to not reuse this worker
write_int(SpecialLengths.END_OF_DATA_SECTION, outfile)
-3423,6 +3430,9 def main(infile, outfile):
faulthandler.cancel_dump_traceback_later()
+TEST = False
+
+ if __name__ == "__main__":
# Read information about how to connect back to the JVM from the environment.
conn_info = os.environ.get(
```
</details>
With just `pass` (before this), it gets stuck, and after this it lets Spark retry the task.
### Was this patch authored or co-authored using generative AI tooling?
No.
Closes apache#53055 from ueshin/issues/SPARK-54344/daemon_flush.
Lead-authored-by: Takuya Ueshin <ueshin@databricks.com>
Co-authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit ed23cc3)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request? This PR fixes the `+`, `updated`, and `removed` methods of `AttributeMap` to correctly hash with `Attribute.ExprId` instead of `Attribute` as a whole. ### Why are the changes needed? This change fixes non-determinism with the `AttributeMap` when an entry is being added to the `AttributeMap` with `+` such that `attr1 != attr2` but `attr1.exprId = attr2.exprId`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added a new test suite. ### Was this patch authored or co-authored using generative AI tooling? Tests were generated by Claude Code on Sonnet 4.5. Closes apache#53044 from kelvinjian-db/fix-attributemap. Authored-by: Kelvin Jiang <kelvin.jiang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 78d1d52) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…ng schema changes ### What changes were proposed in this pull request? Follow-up of apache#52876, add tests for cached temp view detecting schema changes ### Why are the changes needed? There is no test coverage after comment apache#52876 (comment) is addressed. This PR is to add a test case for it. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? New test case ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#53103 from gengliangwang/SPARK-53924-test. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org> (cherry picked from commit fd683ce) Signed-off-by: Gengliang Wang <gengliang@apache.org>
…ble comment
### What changes were proposed in this pull request?
This PR fixes a bug where COMMENT ON TABLE table_name IS NULL was not properly removing the table comment.
### Why are the changes needed?
The syntax COMMENT ON TABLE table_name IS NULL should remove the table comment. However, the previous implementation was setting the comment to null rather than removing the property entirely.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
Enhanced test case test("COMMENT ON TABLE") in DataSourceV2SQLSuite verifies:
* Comment can be set and is stored correctly
* Comment is completely removed when set to NULL (property no longer exists)
* Literal string "NULL" can still be set as a comment value
* Works for both session catalog and V2 catalogs
### Was this patch authored or co-authored using generative AI tooling?
Generated-by: Claude Sonnet 4.5
Closes apache#53091 from ganeshashree/SPARK-54377.
Authored-by: Ganesha S <ganesha.s@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit e8f0a67)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…and union
### What changes were proposed in this pull request?
Make the `maxRows` and `maxRowsPerPartition` only calculated at most once.
### Why are the changes needed?
Improve performance, especially when there are dozens of joins and unions.
Before pr, the number of maxRows executions of join/union increases exponentially with the number of joins/unions.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Local test, 28 tables join before pr 36s, after pr 4s, 29 tables join before pr 67s, after pr 5s
```
Seq(1).toDF("a").write.mode("overwrite").parquet("tmp/t1")
spark.read.parquet("tmp/t1").createOrReplaceTempView("t")
val t1 = System.currentTimeMillis()
spark.sql(
"""
|select a,count(1) from (
|select t1.a from (select distinct a from t) t1
|join t t2 on t1.a=t2.a
|join t t3 on t1.a=t3.a
|join t t4 on t1.a=t4.a
|join t t5 on t1.a=t5.a
|join t t6 on t1.a=t6.a
|join t t7 on t1.a=t7.a
|join t t8 on t1.a=t8.a
|join t t9 on t1.a=t9.a
|join t t10 on t1.a=t10.a
|join t t11 on t1.a=t11.a
|join t t12 on t1.a=t12.a
|join t t13 on t1.a=t13.a
|join t t14 on t1.a=t14.a
|join t t15 on t1.a=t15.a
|join t t16 on t1.a=t16.a
|join t t17 on t1.a=t17.a
|join t t18 on t1.a=t18.a
|join t t19 on t1.a=t19.a
|join t t20 on t1.a=t20.a
|join t t21 on t1.a=t21.a
|join t t22 on t1.a=t22.a
|join t t23 on t1.a=t23.a
|join t t24 on t1.a=t24.a
|join t t25 on t1.a=t25.a
|join t t26 on t1.a=t26.a
|join t t27 on t1.a=t27.a
|join t t28 on t1.a=t28.a
|) group by a
|""".stripMargin).show
```
### Was this patch authored or co-authored using generative AI tooling?
No.
Closes apache#51451 from zml1206/SPARK-52767.
Authored-by: zml1206 <zhuml1206@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit aa387f3)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR fixes recaching of DSv2 tables.
### Why are the changes needed?
These changes are needed to restore correct caching behavior for DSv2 tables if a connector doesn't reuse table instances. Currently, the following use case is broken:
```
// create and populate table
sql("CREATE TABLE testcat.ns.tbl (id bigint, data string) USING foo")
Seq((1L, "a"), (2L, "b")).toDF("id", "data").write.insertInto("testcat.ns.tbl")
// cache table
val df1 = spark.table("testcat.ns.tbl")
df1.cache()
df1.show() // 1 -> a, 2 -> b
// insert more data, refreshing cache entry
Seq((3L, "c"), (4L, "d")).toDF("id", "data").write.insertInto("testcat.ns.tbl")
// query
val df2 = spark.table("testcat.ns.tbl")
df2.show() // CACHE MISS BEFORE CHANGE!
```
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Existing + new tests.
### Was this patch authored or co-authored using generative AI tooling?
No.
Closes apache#53109 from aokolnychyi/spark-54387.
Authored-by: Anton Okolnychyi <aokolnychyi@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit dce992b)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
…uld be 1-based ### What changes were proposed in this pull request? The SparkGetColumnsOperation is mainly used for the JDBC driver, while JDBC uses 1-based ordinal/column-index instead of 0-based. This is also documented in Hive API. https://github.com/apache/spark/blob/551b922a53acfdfeb2c065d5dedf35cb8cd30e1d/sql/hive-thriftserver/src/main/java/org/apache/hive/service/cli/operation/GetColumnsOperation.java#L94-L95 Note, the GetColumnsOperation, which is originally copied from the Hive has a correct implementation, the issue only exists in SparkGetColumnsOperation. For safety, a config `spark.sql.legacy.hive.thriftServer.useZeroBasedColumnOrdinalPosition` is added to allow the user to switch back to the previous behavior. ### Why are the changes needed? The SparkGetColumnsOperation is mainly used by JDBC [java.sql.DatabaseMetaData#getColumns](https://docs.oracle.com/en/java/javase/17/docs/api/java.sql/java/sql/DatabaseMetaData.html#getColumns(java.lang.String,java.lang.String,java.lang.String,java.lang.String)), this change makes it satisfy the JDBC API specification. ### Does this PR introduce _any_ user-facing change? Yes, see the above section. ### How was this patch tested? UTs are modified. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#53062 from pan3793/SPARK-54350. Authored-by: Cheng Pan <chengpan@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 05bc5d4) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
…g info ### What changes were proposed in this pull request? This PR extends current canonicalization function for DataSourceV2ScanRelation to normalize the keyGroupedPartitioning and ordering field. Therefore it can apply to partition/ordering-aware data sources. ### Why are the changes needed? In order to apply canonicalization to partition/ordering-aware data sources. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#53105 from yhuang-db/SPARK-54163-canonicalization-normalization. Authored-by: yhuang-db <itisyuchuan@gmail.com> Signed-off-by: Gengliang Wang <gengliang@apache.org> (cherry picked from commit 1012a5f) Signed-off-by: Gengliang Wang <gengliang@apache.org>
…tegration extension ### What changes were proposed in this pull request? There are many places where pyspark is trying to integrate with faulthandler and use the same functionality to dump stack traces/record thread dumps. In order to reduce the complexity of the integration and ease the extension of integration it makes sense to technically refactor the code to use the same code in all the places. ### Why are the changes needed? Improves developer experience. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By existing unit tests ### Was this patch authored or co-authored using generative AI tooling? Closes apache#53016 from antban/simplify-faulthandler-integration. Authored-by: antban <dmitry.sorokin@gmail.com> Signed-off-by: Takuya Ueshin <ueshin@databricks.com> (cherry picked from commit 6227fba) Signed-off-by: Takuya Ueshin <ueshin@databricks.com>
…ndler integration extension" This reverts commit 6ee5a16.
…rform a default check for `PythonTestDepsChecker.isConnectDepsAvailable` ### What changes were proposed in this pull request? This pr aims to make the test cases in `PythonPipelineSuite` perform a default check for `PythonTestDepsChecker.isConnectDepsAvailable`. ### Why are the changes needed? Simplify the dependency checks for Python modules in test cases within `PythonPipelineSuite`. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass Github Actions ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#53106 from LuciferYang/refactor-PythonPipelineSuite. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: yangjie01 <yangjie01@baidu.com> (cherry picked from commit 1aa8d5a) Signed-off-by: yangjie01 <yangjie01@baidu.com>
### What changes were proposed in this pull request? Introduce a new SQL config for controlling the geospatial feature: ``` spark.sql.geospatial.enabled ``` The default value is `false`, and enabled only in testing. ### Why are the changes needed? Guard the geospatial feature until it's fully finished. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added appropriate unit tests to confirm that the config is effective: - `STExpressionsSuite` ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#53009 from uros-db/geo-config. Authored-by: Uros Bojanic <uros.bojanic@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit d299684) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…r consistency ### What changes were proposed in this pull request? This PR refactors the test "corrupted view metadata: mismatch between viewQueryColumnNames and schema" in `SessionCatalogSuite.scala` to use the `withBasicCatalog` helper method instead of manually creating a `SessionCatalog` instance with `newBasicCatalog()`. ### Why are the changes needed? **Consistency**: All other tests in `SessionCatalogSuite` use the `withBasicCatalog` helper pattern ### Does this PR introduce _any_ user-facing change? No. This is a test-only refactoring with no functional changes. ### How was this patch tested? - Verified no linter errors in the modified file - The test logic remains identical, only the catalog initialization pattern changed - Existing test validates the same corrupted view metadata error message ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude Sonnet 4.5 Closes apache#53126 from ganeshashree/SPARK-54030-2. Authored-by: Ganesha S <ganesha.s@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 61668ad) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
… in PySpark ### What changes were proposed in this pull request? Make a few updates to ST functions in PySpark: - Remove alias `result`. - Separate examples. ### Why are the changes needed? Improve clarity. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests suffice. ``` python/run-tests --testnames pyspark.sql.functions.builtin ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#53125 from uros-db/geo-pyspark-st-examples. Authored-by: Uros Bojanic <uros.bojanic@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit ee0f692) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
…Scala and PySpark ### What changes were proposed in this pull request? Implement the `st_setsrid` function in Scala and PySpark API. ### Why are the changes needed? Expand API support for the `ST_SetSrid` expression. ### Does this PR introduce _any_ user-facing change? Yes, the new function is now available in Scala and PySpark API. ### How was this patch tested? Added appropriate Scala function unit tests: - `STFunctionsSuite` Added appropriate PySpark function unit tests: - `test_functions` ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#53117 from uros-db/geo-ST_SetSrid-scala. Authored-by: Uros Bojanic <uros.bojanic@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 158a132) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
… key positions can be fully pushed down ### What changes were proposed in this pull request? When a KeyGroupedShuffleSpec is used to shuffle another child of a JOIN, we must be able to push down JOIN keys or partition values to be able to ensure that both children have matching partitioning. If one child reports a KeyGroupedPartitioning but we can't push down these values (for example, if the child was a key-grouped scan that was checkpointed), then this information cannot be pushed down to the child scan and we should avoid using this shuffle spec to shuffle other children. ### Why are the changes needed? Prevents potential correctness issue when key-grouped partitioning is used on a checkpointed RDD. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? See test changes. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#53132 from chirag-s-db/checkpoint-pushdown-4.1. Authored-by: Chirag Singh <chirag.singh@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
…ubernetes docs ### What changes were proposed in this pull request? The Spark documentation references the use of Pod Security Policies to restrict the privileges of Spark pods on Kubernetes. PodSecurityPolicy was [deprecated](https://kubernetes.io/blog/2021/04/08/kubernetes-1-21-release-announcement/#podsecuritypolicy-deprecation) in Kubernetes v1.21, and removed from Kubernetes in v1.25. The documentation should be updated to remove this reference and to direct users to the currently maintained Pod Security Admission Controller docs instead. ### Why are the changes needed? Maintenance of the documentation to point to relevant and currently supported security features in Kubernetes. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Documentation changes only, no function changes to Spark. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#53130 from Jimvin/master. Authored-by: Jim Halfpenny <jim.halfpenny@stackable.tech> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit f328b5e) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
…ctly ### What changes were proposed in this pull request? This PR aims to fix `release-build.sh` to detect `REPO_ID` correctly. ### Why are the changes needed? Previously, we use `grep -A 5` to find `description` tag. https://github.com/apache/spark/blob/f328b5ef14c9ef4e2d04ab69c0578ab461388715/dev/create-release/release-build.sh#L501-L502 However, it's insufficient as of now. According to Today's result, we need to grep 13 lines like the following. ``` $ curl --retry 10 --retry-all-errors -s -u "$ASF_USERNAME:$ASF_PASSWORD" https://repository.apache.org/service/local/staging/profile_repositories | grep -A 5 "<repositoryId>orgapachespark-" <repositoryId>orgapachespark-1505</repositoryId> <type>closed</type> <policy>release</policy> <userId>dongjoon</userId> <userAgent>curl/7.81.0</userAgent> <ipAddress>35.94.112.49</ipAddress> ``` ``` $ curl --retry 10 --retry-all-errors -s -u "$ASF_USERNAME:$ASF_PASSWORD" https://repository.apache.org/service/local/staging/profile_repositories | grep -A 13 "<repositoryId>orgapachespark-" <repositoryId>orgapachespark-1505</repositoryId> <type>closed</type> <policy>release</policy> <userId>dongjoon</userId> <userAgent>curl/7.81.0</userAgent> <ipAddress>35.94.112.49</ipAddress> <repositoryURI>https://repository.apache.org/content/repositories/orgapachespark-1505</repositoryURI> <created>2025-11-16T20:23:35.413Z</created> <createdDate>Sun Nov 16 20:23:35 UTC 2025</createdDate> <createdTimestamp>1763324615413</createdTimestamp> <updated>2025-11-16T21:02:45.041Z</updated> <updatedDate>Sun Nov 16 21:02:45 UTC 2025</updatedDate> <updatedTimestamp>1763326965041</updatedTimestamp> <description>Apache Spark 4.1.0-preview4 (commit c125aea)</description> ``` ### Does this PR introduce _any_ user-facing change? No behavior change. ### How was this patch tested? Manually test. ``` $ curl --retry 10 --retry-all-errors -s -u "$ASF_USERNAME:$ASF_PASSWORD" https://repository.apache.org/service/local/staging/profile_repositories | grep -A 13 "<repositoryId>orgapachespark-" | awk '/<repositoryId>/ { id = $0 } /<description>/ && $0 ~ /Apache Spark '"$RELEASE_VERSION"'/ { print id }' <repositoryId>orgapachespark-1505</repositoryId> ``` After merging this, I'm going to test with `Apache Spark 4.1.0-preview4`. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#53136 from dongjoon-hyun/SPARK-54426. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit be281fb) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
… method for SDP checkpoint path construction ### What changes were proposed in this pull request? Followups from apache#53070 to improve code clarity. ### Why are the changes needed? Make sure the code for constructing SDP checkpoint directory paths is clear. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? ### Was this patch authored or co-authored using generative AI tooling? Closes apache#53089 from sryza/collide-followups. Authored-by: Sandy Ryza <sandy.ryza@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 10d1b4c) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
…at/WorkDirClean in standalone worker ### What changes were proposed in this pull request? Currently, [worker](https://github.com/apache/spark/blob/87b3b94232436528f88c9a7aa7ee70758b85a33a/core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala#L495) will schedule tasks forwarding `SendHeartbeat` and `WorkDirCleanup` while `handleRegisterResponse`. While worker registration could happen multiple times in case of heartbeat timeout/disconnected from master, in these cases the tasks would be scheduled multiple times. To fix the issue: - Adding `heartbeatTask` and `workDirCleanupTask` in worker to tell whether these tasks have been scheduled - `heartbeatTask` and `workDirCleanupTask` will be initialized after the 1st registration, and then skipped scheduling these tasks in later registration. - Cancel the task and reset `heartbeatTask` and `workDirCleanupTask` when worker stops. ### Why are the changes needed? Fix the issue repeatedly scheduling SendHeartbeat/WorkDirClean tasks after worker registration. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? UT added ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#53054 from ivoson/duplicate-worker-heartbeat. Authored-by: Tengfei Huang <tengfei.h@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit d51b433) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
…R clause
### What changes were proposed in this pull request?
Fix error message for scalar subquery in IDENTIFIER clause
### Why are the changes needed?
Previously, for a query like: `SELECT * FROM IDENTIFIER((SELECT 'tableName'))` we would throw an `INTERNAL_ERROR` because we never call `IdentifierResolution.evalIdentifierExpr` to check if the identifier expression is foldable. The call never happens, because in order to call `eval`, the expression has to be resolved first, but the subquery has `UnresolvedAlias('tableName')` that never gets resolved.
In `CheckAnalysis`, if the identifier expression is not resolved, I propose that we throw `NOT_A_CONSTANT_STRING.NOT_CONSTANT`, because ALL identifier expressions need to be constant and if by the end of analysis the expression is not resolved, that means it wasn't constant.
### Does this PR introduce _any_ user-facing change?
Error message improvement.
### How was this patch tested?
Added new test cases to golden file.
### Was this patch authored or co-authored using generative AI tooling?
No
Closes apache#53133 from mihailotim-db/mihailotim-db/fix_error_message_ident.
Authored-by: Mihailo Timotic <mihailo.timotic@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 3dcb661)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…ting directory ### What changes were proposed in this pull request? If the name provided to `spark-pipelines init` matches an existing directory, raises an error instead of overwriting the existing directory's contents. ### Why are the changes needed? Help users avoid accidentally overwriting their code. ### Does this PR introduce _any_ user-facing change? Yes, to an unreleased version. ### How was this patch tested? Added unit test for the cases where init is invoked with an existing directory. ### Was this patch authored or co-authored using generative AI tooling? Closes apache#53140 from sryza/initoverwrite. Authored-by: Sandy Ryza <sandy.ryza@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 6811eab) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
…ta checks ### What changes were proposed in this pull request? This PR fixes error formatting for recently added incompatible table metadata checks. ### Why are the changes needed? These changes are needed to avoid unnecessary empty lines. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#53115 from aokolnychyi/spark-53924-54157-followup. Authored-by: Anton Okolnychyi <aokolnychyi@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 8cab074) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request? Fixes typos and cleans up code formatting (that could've been automated, but done manually this time) https://issues.apache.org/jira/browse/SPARK-54418 ### Why are the changes needed? Cleaner code with a fewer typos ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Built locally. Waiting for the official build to finish once PR's created. ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#52538 from jaceklaskowski/sdp-typo-hunting-formatting. Authored-by: Jacek Laskowski <jacek@japila.pl> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 4ac02ab) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
… Table ### What changes were proposed in this pull request? This PR renames `currentVersion` to `version` in DSv2 `Table`. This method is supposed to be a simple getter and must not trigger a refresh of the underlying table state. ### Why are the changes needed? These changes are needed to avoid ambiguity in the about to be released API in 4.1. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#53118 from aokolnychyi/spark-51771-followup. Authored-by: Anton Okolnychyi <aokolnychyi@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 6578b9b) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
…tDf.cache() ### What changes were proposed in this pull request? Backport of apache#53572 to 4.1 branch Follow up of apache#51032 . That pr changed V2WriteCommand not to execute eagerly on df.cache(). However, there are a bunch of other commands that still do. ``` val df = sql("CREATE TABLE...") df.cache() // executes again, fails with TableAlreadyExistsException ``` This patch skip CacheManager for all Command, because these are eagerly-executed already when first calling sql("COMMAND"). ``` val df = sql("SHOW TABLES.") sql("CREATE TABLE foo") df.cache() // executes again and df now includes foo ``` ### Why are the changes needed? To prevent the command with side-effect from being executed again if a user runs df.cache on the result of the command. Many are dangerous as they would be running a second time without the user expectation (df.cache triggering another action on the table) ### Does this PR introduce _any_ user-facing change? If the user created a resultDF from a command, and then ran resultDf.cache, it used to re-run the command. Now it will no-op. Most of the time, this is beneficial as re-running the command will result in an error, or worse data corruption. However, in some small cases , like SHOW TABLES or SHOW NAMESPACES, it will affect the contents of resultDf as it will no longer refresh when calling resultDf.cache() Note: In most cases, we are lucky and will not see user-facing change. This is because commands, like for example DescribeTableExec plan node, already has a in-memory reference to Table object and keeps the old result despite repeated execution. However, SHOW XXX command plans do not cache in memory results so they see some effect. ### How was this patch tested? Existing unit test, add new unit tests ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#54064 from szehon-ho/4.1. Authored-by: Szehon Ho <szehon.apache@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
…programming guide ### What changes were proposed in this pull request? Documents parameters for the `spark-pipelines` CLI, in the declarative pipelines programming guide ### Why are the changes needed? Complete documentation ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? ### Was this patch authored or co-authored using generative AI tooling? Closes apache#54035 from sryza/refresh-selection-docs. Authored-by: Sandy Ryza <sandy.ryza@databricks.com> Signed-off-by: Sandy Ryza <sandy.ryza@databricks.com> (cherry picked from commit 9788c52) Signed-off-by: Sandy Ryza <sandy.ryza@databricks.com>
…lformed DOT label
### What changes were proposed in this pull request?
This PR fixes a typo in `RDDOperationGraph.scala` where an extra `}` character was accidentally added to the DOT node label string, causing the DAG visualization to fail rendering in both Jobs and Stages pages of the Spark Web UI.
**Before (broken):**
```scala
s"""${node.id} [id="node_${node.id}" labelType="html" label="$label}"]"""
```
**After (fixed):**
```scala
s"""${node.id} [id="node_${node.id}" labelType="html" label="$label"]"""
```
### Why are the changes needed?
The DAG visualization in the Spark Web UI is completely broken. Instead of rendering the graph, it shows the raw DOT source text. This affects both Jobs and Stages views across all browsers (Firefox, Edge, Chrome) and both Spark Shell and PySpark.
This regression was introduced in SPARK-45274 (PR apache#43053).
### Does this PR introduce _any_ user-facing change?
Yes. The DAG visualization in the Spark Web UI will render correctly again.
**Before:** Raw DOT graph source displayed instead of visualization
**After:** Properly rendered DAG graph
### How was this patch tested?
- Existing unit tests pass: `RDDOperationGraphSuite`, `UISeleniumSuite` (DAG visualization test)
- Manual verification with `sc.range(0, 10, 1, 4).count()` in spark-shell
### Was this patch authored or co-authored using generative AI tooling?
No.
Closes apache#54171 from yaooqinn/SPARK-55387.
Authored-by: Kent Yao <kentyao@microsoft.com>
Signed-off-by: Kent Yao <kentyao@microsoft.com>
(cherry picked from commit 4e86ae4)
Signed-off-by: Kent Yao <kentyao@microsoft.com>
…b Action jobs ### What changes were proposed in this pull request? This PR aims to set `strategy.max-parrallel` to 20 for all GitHub Action jobs. ### Why are the changes needed? ASF Infra team directly requested us via email in (privatespark) mailing list. - https://lists.apache.org/thread/voqz9tp3m8wj00lp0y81n25qgvc90f3q Here is `GitHub Action` syntax. - https://docs.github.com/en/actions/reference/workflows-and-actions/workflow-syntax#jobsjob_idstrategymax-parallel <img width="762" height="112" alt="Screenshot 2026-02-07 at 21 09 02" src="https://github.com/user-attachments/assets/770d1b81-390b-49d1-8518-70cb20eb93af" /> ### Does this PR introduce _any_ user-facing change? No Apache Spark behavior change. - Technically, for the PR builder, we use more 20 jobs on the PR contributor's GitHub repo. This job will be limited. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: `Opus 4.5` on `Claude Code` Closes apache#54204 from dongjoon-hyun/SPARK-55423. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit de34528) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
This is a followup to apache#52402 that addresses backward compatibility concerns: 1. Keep the original `implicit SQLContext` factory methods for full backward compatibility 2. Add new overloads with explicit `SparkSession` parameter for new code 3. Fix `TestGraphRegistrationContext` to provide implicit `spark` and `sqlContext` to avoid name shadowing issues in nested classes 4. Remove redundant `implicit val sparkSession` declarations from pipeline tests that are no longer needed with the fix PR apache#52402 changed the MemoryStream API to use `implicit SparkSession` which broke backward compatibility for code that only has `implicit SQLContext` available. This followup ensures: - Old code continues to work without modification - New code can use SparkSession with explicit parameters - Internal implementation uses SparkSession (modernization from apache#52402) No. This maintains full backward compatibility while adding new API options. Existing tests pass. The API changes are additive. Yes Made with [Cursor](https://cursor.com) Closes apache#54108 from cloud-fan/memory-stream-compat. Lead-authored-by: Wenchen Fan <wenchen@databricks.com> Co-authored-by: Wenchen Fan <cloud0fan@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit db28b99) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…lize step ### What changes were proposed in this pull request? ``` + OLD_VERSION=4.2.0-preview1 + [[ -n 4.2.0-preview1 ]] + echo 'Removing old version: spark-4.2.0-preview1' Removing old version: spark-4.2.0-preview1 + svn rm https://dist.apache.org/repos/dist/release/spark/spark-4.2.0-preview1 -m 'Remove older 4.2 release after 4.2.0-preview2' svn: E215004: Authentication failed and interactive prompting is disabled; see the --force-interactive option svn: E215004: No more credentials or we tried too many times. Authentication failed ``` We need to set password and username. ### Why are the changes needed? Add username and password at svn with rm at finalize step ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? Manually tested. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#54217 from HyukjinKwon/fix-release-script. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit f6031fe) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
…when join keys are less than cluster keys Backport apache#54182 to branch-4.1 ### What changes were proposed in this pull request? Fix a `java.lang.ArrayIndexOutOfBoundsException` when `spark.sql.sources.v2.bucketing.allowJoinKeysSubsetOfPartitionKeys.enabled=true`, by correcting the `expression`(should pass the full partition expression instead of the projected one) passed to `KeyGroupedPartitioning#project`. Also, fix a test code issue, change the calculation result of `BucketTransform` defined at `InMemoryBaseTable.scala` to match `BucketFunctions` defined at `transformFunctions.scala` (thanks peter-toth for pointing this out!) ### Why are the changes needed? It's a bug fix. ### Does this PR introduce _any_ user-facing change? Some queries that failed when `spark.sql.sources.v2.bucketing.allowJoinKeysSubsetOfPartitionKeys.enabled=true` now run normally. ### How was this patch tested? New UT is added, previously it failed with `ArrayIndexOutOfBoundsException`, now passed. ``` $ build/sbt "sql/testOnly *KeyGroupedPartitioningSuite -- -z SPARK=55411" ... [info] - bug *** FAILED *** (1 second, 884 milliseconds) [info] java.lang.ArrayIndexOutOfBoundsException: Index 1 out of bounds for length 1 [info] at scala.collection.immutable.ArraySeq$ofRef.apply(ArraySeq.scala:331) [info] at org.apache.spark.sql.catalyst.plans.physical.KeyGroupedPartitioning$.$anonfun$project$1(partitioning.scala:471) [info] at org.apache.spark.sql.catalyst.plans.physical.KeyGroupedPartitioning$.$anonfun$project$1$adapted(partitioning.scala:471) [info] at scala.collection.immutable.ArraySeq.map(ArraySeq.scala:75) [info] at scala.collection.immutable.ArraySeq.map(ArraySeq.scala:35) [info] at org.apache.spark.sql.catalyst.plans.physical.KeyGroupedPartitioning$.project(partitioning.scala:471) [info] at org.apache.spark.sql.execution.KeyGroupedPartitionedScan.$anonfun$getOutputKeyGroupedPartitioning$5(KeyGroupedPartitionedScan.scala:58) ... ``` UTs affected by `bucket()` calculate logic change are tuned. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#54259 from pan3793/SPARK-55411-4.1. Authored-by: Cheng Pan <chengpan@apache.org> Signed-off-by: Peter Toth <peter.toth@gmail.com>
…re-wrapping ### What changes were proposed in this pull request? The current Theta sketch update and merge functions from `TypedImperativeAggregate` unnecessarily re-wrap the aggregation buffer with one of the ThetaSketchState case classes. Since changes to the buffer are mutable, we can avoid this re-wrap entirely. ### Why are the changes needed? Better engineering practice, small optimization. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? re-ran the SQLQueryTestSuite with `SPARK_GENERATE_GOLDEN_FILES=1`, no impact as expected. ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#53984 from cboumalh/cboumalh-theta-merge-followup. Authored-by: Chris Boumalhab <cboumalh@amazon.com> Signed-off-by: Daniel Tenedorio <daniel.tenedorio@databricks.com> (cherry picked from commit 6112a0b) Signed-off-by: Daniel Tenedorio <daniel.tenedorio@databricks.com>
…checkError` ### What changes were proposed in this pull request? This PR aims to fix `EventLogFileWriters.closeWriter` to handle `checkError`. In general, we need the following three. 1. Do `flush` first before closing to isolate any problems at this layer. 2. Do `PrintWriter.close` and fallback to the underlying Hadoop file stream's `close` API. 3. Show warnings properly if `checkError` returns true. ### Why are the changes needed? Currently, Apache Spark's event log writer naively invokes `PrintWriter.close()` without error handling. https://github.com/apache/spark/blob/4e1cb88bba0c031f54dd07e3adc0d464d45cbfce/core/src/main/scala/org/apache/spark/deploy/history/EventLogFileWriters.scala#L80 https://github.com/apache/spark/blob/4e1cb88bba0c031f54dd07e3adc0d464d45cbfce/core/src/main/scala/org/apache/spark/deploy/history/EventLogFileWriters.scala#L133-L135 However, Java community recommends to use `checkError` in case of `PrintWriter.flush` and `PrintWriter.close`. - https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/io/PrintWriter.html#checkError() When `checkError` returns `true`, a user can lose their event log. For example, the event log is not uploaded silently. Spark had better show a proper warning and tries to do the best efforts to flush or close the underlying Hadoop File streams at least. ### Does this PR introduce _any_ user-facing change? No, this is a bug fix for the corner case. ### How was this patch tested? Pass the CIs with the newly added test cases. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: `Opus 4.5` on `Claude Code` Closes apache#54280 from dongjoon-hyun/SPARK-55495. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 3484a4a) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
…stFrameworks.JUnit` ### What changes were proposed in this pull request? This PR fixes an issue that `TestFrameworks.JUnit` is still used even though newer jupiter plugin (`com.github.sbt.junit.sbt-jupiter-interface`) is used. As a result, `-Dtests.include.tags` and `-Dtests.exclude.tags` don't work. ``` $ build/sbt -Dtest.include.tags=org.apache.spark.tags.SlowHiveTest unsafe/test ... [info] Run completed in 2 seconds, 877 milliseconds. [info] Total number of tests run: 0 [info] Suites: completed 2, aborted 0 [info] Tests: succeeded 0, failed 0, canceled 0, ignored 0, pending 0 [info] No tests were executed. [info] Passed: Total 97, Failed 0, Errors 0, Passed 97 ``` ### Why are the changes needed? Bug fix. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Confirmed `-Dtests.exclude.tags` works. ``` $ build/sbt -Dtest.include.tags=org.apache.spark.tags.SlowHiveTest unsafe/test ... [info] Run completed in 738 milliseconds. [info] Total number of tests run: 0 [info] Suites: completed 2, aborted 0 [info] Tests: succeeded 0, failed 0, canceled 0, ignored 0, pending 0 [info] No tests were executed. [info] Passed: Total 0, Failed 0, Errors 0, Passed 0 [success] Total time: 2 s, completed 2026/02/12 19:33:53 ``` There are no tagged `.java` tests so I didn't confirm if `-Dtest.exclude.tags` works. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#54281 from sarutak/fix-jupiter-junit5. Authored-by: Kousuke Saruta <sarutak@amazon.co.jp> Signed-off-by: Cheng Pan <chengpan@apache.org> (cherry picked from commit 448681f) Signed-off-by: Cheng Pan <chengpan@apache.org>
… and MutableColumnarRow ### What changes were proposed in this pull request? `ColumnarBatchRow.copy()` and `MutableColumnarRow.copy()`/`get()` do not handle `VariantType`, causing a `RuntimeException: Not implemented. VariantType` when using `VariantType` columns with streaming custom data sources that rely on columnar batch row copying. PR apache#53137 (SPARK-54427) added `VariantType` support to `ColumnarRow` but missed `ColumnarBatchRow` and `MutableColumnarRow`. PR apache#54006 attempted this fix but was closed. This patch adds: - `PhysicalVariantType` branch in `ColumnarBatchRow.copy()` - `VariantType` branch in `MutableColumnarRow.copy()` and `get()` - Test in `ColumnarBatchSuite` validating `VariantVal` round-trip through `copy()` ### Why are the changes needed? Without this fix, any streaming pipeline that returns `VariantType` columns from a custom columnar data source throws a runtime exception when Spark attempts to copy the batch row. ### Does this PR introduce _any_ user-facing change? No. This is a bug fix for an existing feature. ### How was this patch tested? Added a new test `[SPARK-55552] Variant` in `ColumnarBatchSuite` that creates a `VariantType` column vector, populates it with `VariantVal` data (including a null), wraps it in a `ColumnarBatchRow`, calls `copy()`, and verifies the values round-trip correctly. ### Was this patch authored or co-authored using generative AI tooling? Yes. GitHub Copilot was used to assist in drafting portions of this contribution. This contribution is my original work and I license it under the Apache 2.0 license. Closes apache#54337 from tugceozberk/fix-columnar-batch-row-variant-copy. Authored-by: tugce-applied <tugce@appliedcomputing.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit c4188b0) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
… pages ### What changes were proposed in this pull request? This PR proposes to fix the file name back to with hyphens, see ``` [TXT] | spark-release-4-0-0.html | 2026-02-10 09:21 | 134K | -- | -- | -- | -- | -- | spark-release-4-0-1.html | 2026-02-10 09:21 | 25K | | spark-release-4-0-2.html | 2026-02-10 09:21 | 11K | | spark-release-4.1.0.html | 2026-02-10 09:21 | 69K | [TXT] [spark-release-4-0-0.html](https://spark.apache.org/releases/spark-release-4-0-0.html) 2026-02-10 09:21 134K [TXT] [spark-release-4-0-1.html](https://spark.apache.org/releases/spark-release-4-0-1.html) 2026-02-10 09:21 25K [TXT] [spark-release-4-0-2.html](https://spark.apache.org/releases/spark-release-4-0-2.html) 2026-02-10 09:21 11K [TXT] [spark-release-4.1.0.html](https://spark.apache.org/releases/spark-release-4.1.0.html) 2026-02-10 09:21 69K ``` at https://spark.apache.org/releases/ ### Why are the changes needed? For consistency. ### Does this PR introduce _any_ user-facing change? Virtually no. ### How was this patch tested? Will be tested in the next release. Also manually tested. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#54372 from HyukjinKwon/SPARK-55599. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit c78b1a1) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
…ak when end-offset is not updated ### What changes were proposed in this pull request? Backport apache#54237 to branch-4.1 In `_SimpleStreamReaderWrapper.latestOffset()`, validate that custom implementation of datasource based on `SimpleDataSourceStreamReader.read()` does not return a non-empty batch with end == start. If it does, raise PySparkException with error class `SIMPLE_STREAM_READER_OFFSET_DID_NOT_ADVANCE` before appending to the cache. Empty batches with end == start remain allowed. ### Why are the changes needed? When a user implements read(start) incorrectly and returns: - Same offset for both: end = start (e.g. both {"offset": 0}). - Non-empty iterator: e.g. 2 rows. If a reader returns end == start with data (e.g. return (it, {"offset": start_idx})), the wrapper keeps appending to its prefetch cache on every trigger while commit(end) never trims entries (first matching index is 0). The cache grows without bound and driver (non-JVM) memory increases until OOM. Validating and raising error before appending stops this and fails fast with a clear error. Empty batches with end == start remain allowed , it will allow the Python data source to represent that there is no data to read. ### Does this PR introduce _any_ user-facing change? Yes. Implementations that return end == start with a non-empty iterator now get PySparkException instead of unbounded memory growth. Empty batches with end == start are unchanged. ### How was this patch tested? Added unit test in `test_python_streaming_datasource.py` ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#54321 from vinodkc/br_SPARK-55416_4.1. Authored-by: vinodkc <vinod.kc.in@gmail.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
…itioning` ### What changes were proposed in this pull request? This PR adds a new `initMetricsValues()` method to `PartitionReader` so as to initialize custom metrics returned by `currentMetricsValues()`. In case of `KeyGroupedPartitioning` multiple input partitions are grouped and so multiple `PartitionReader` belong to one output partition. A `PartitionReader` needs to be initialized with metrics calculated by the previous `PartitionReader` of the same partition group to calculate the right value. ### Why are the changes needed? To calculate custom metrics correctly. ### Does this PR introduce _any_ user-facing change? It fixes metrics calculation. ### How was this patch tested? New UT is added. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#54401 from peter-toth/SPARK-55302-fix-kgp-custom-metrics-4.1. Authored-by: Peter Toth <peter.toth@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
…tions ### What changes were proposed in this pull request? Replace `PartitionMetricCallback` with a `ConcurrentHashMap` keyed by task attempt ID to correctly track reader state across multiple `compute()` calls when `DataSourceRDD` is coalesced. The completion listener is registered only once per task attempt, and metrics are flushed and carried forward between readers as partitions are advanced. ### Why are the changes needed? When `DataSourceRDD` is coalesced (e.g., via `.coalesce(1)`), `compute()` gets called multiple times per task, which causes the custom metrics incorrect. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude Sonnet 4.6 Closes apache#54407 from viirya/SPARK-55619-branch-4.1. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
…config files ### What changes were proposed in this pull request? This PR aims to generalize `postgres-krb-setup.sh` to find config files instead of using static locations. ### Why are the changes needed? To improve `PostgresKrbIntegrationSuite` to handle different PostgreSQL image via `POSTGRES_DOCKER_IMAGE_NAME`. Currently, `PostgresKrbIntegrationSuite` fails for other `PostgreSQL` images. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: `Gemini 3.1 Pro (High)` on `Antigravity` Closes apache#54423 from dongjoon-hyun/SPARK-55637. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 920a401) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
…ined error message parameter <!-- Thanks for sending a pull request! Here are some tips for you: 1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html 2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html 3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'. 4. Be sure to keep the PR description updated to reflect all changes. 5. Please write your PR title to summarize what this PR proposes. 6. If possible, provide a concise example to reproduce the issue for a faster review. 7. If you want to add a new configuration, please read the guideline first for naming configurations in 'core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala'. 8. If you want to add or modify an error type or message, please read the guideline first in 'common/utils/src/main/resources/error/README.md'. --> ### What changes were proposed in this pull request? <!-- Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below. 1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers. 2. If you fix some SQL features, you can provide some references of other DBMSes. 3. If there is design documentation, please add the link. 4. If there is a discussion in the mailing list, please add the link. --> There is a mismatch between the error class definition for UDTF_ARROW_TYPE_CONVERSION_ERROR and its usage. The code in python/pyspark/worker.py was raising UDTF_ARROW_TYPE_CONVERSION_ERROR with parameters data, schema, and arrow_schema, but the error class definition in error-conditions.json did not expect any parameters. This PR fixed this by introducing a new error class UDTF_ARROW_DATA_CONVERSION_ERROR specifically for data conversion errors, which accepts the necessary parameters. ### Why are the changes needed? <!-- Please clarify why the changes are needed. For instance, 1. If you propose a new API, clarify the use case for a new API. 2. If you fix a bug, you can clarify why it is a bug. --> Fixing a bug ### Does this PR introduce _any_ user-facing change? <!-- Note that it means *any* user-facing change including all aspects such as new features, bug fixes, or other behavior changes. Documentation-only updates are not considered user-facing changes. If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible. If possible, please also clarify if this is a user-facing change compared to the released Spark versions or within the unreleased branches such as master. If no, write 'No'. --> No ### How was this patch tested? <!-- If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. If benchmark tests were added, please run the benchmarks in GitHub Actions for the consistent environment, and the instructions could accord to: https://spark.apache.org/developer-tools.html#github-workflow-benchmarks. --> uT ### Was this patch authored or co-authored using generative AI tooling? <!-- If generative AI tooling has been used in the process of authoring this patch, please include the phrase: 'Generated-by: ' followed by the name of the tool and its version. If no, write 'No'. Please refer to the [ASF Generative Tooling Guidance](https://www.apache.org/legal/generative-tooling.html) for details. --> Yes Closes apache#54318 from allisonwang-db/fix-udtf-error. Authored-by: Allison Wang <allison.wang@databricks.com> Signed-off-by: Allison Wang <allison.wang@databricks.com> (cherry picked from commit e7de362) Signed-off-by: Allison Wang <allison.wang@databricks.com>
…n `AtomicReplaceTableExec` (This PR is to backport apache#54322) ### What changes were proposed in this pull request? SPARK-51834 added `.withConstraints()` to `CreateTableExec` and `ReplaceTableExec` but missed `AtomicReplaceTableExec` in the same file. This causes `REPLACE TABLE` and `CREATE OR REPLACE TABLE` with constraints to silently drop them when the catalog implements `StagingTableCatalog`. Also fixes `StagingInMemoryTableCatalog` to forward constraints when constructing `InMemoryTable`, and adds regression tests for all four constraint types through the atomic replace path. ### Why are the changes needed? `AtomicReplaceTableExec` does not call `.withConstraints()` when building `TableInfo`, so `REPLACE TABLE` and `CREATE OR REPLACE TABLE` silently drop constraints when the catalog implements `StagingTableCatalog`. ### Does this PR introduce _any_ user-facing change? Yes. Previously, REPLACE TABLE and CREATE OR REPLACE TABLE with constraints would silently drop constraints when using a `StagingTableCatalog`. After this fix, constraints are correctly passed through to the catalog. ### How was this patch tested? New unit tests in `UniqueConstraintSuite`, `PrimaryKeyConstraintSuite`, `CheckConstraintSuite`, and `ForeignKeyConstraintSuite` that exercise `REPLACE TABLE` with constraints through the atomic catalog path (`StagingInMemoryTableCatalog`). Verified that all 4 new tests fail without the fix and pass with it. All existing tests in these suites continue to pass. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude Opus 4.6 <noreplyanthropic.com> Closes apache#54322 from yyanyy/staging_catalog_constraint. Authored-by: Yan Yan <yyanyyyygmail.com> Closes apache#54404 from yyanyy/backport-SPARK-51834-follow-up-branch-4.1. Authored-by: Yan Yan <yyanyyyy@gmail.com> Signed-off-by: Gengliang Wang <gengliang@apache.org>
…tributes with non-binary-stable collations
### What changes were proposed in this pull request?
* `ConstantPropagation` optimizer rule substitutes attributes with literals derived from equality
predicates (e.g. `c = 'hello'`), then propagates them into other conditions in the same
conjunction. This is unsafe for non-binary-stable collations (e.g. `UTF8_LCASE`) where
equality is non-identity: `c = 'hello'` (case-insensitive) does not imply `c` holds exactly
the bytes `'hello'` - it could also be `'HELLO'`, `'Hello'`, etc.
* Substituting `c → 'hello'` in a second condition like `c = 'HELLO' COLLATE UNICODE` turns it
into the constant expression `'hello' = 'HELLO' COLLATE UNICODE`, which is always `false`,
producing incorrect results.
* Fixed by guarding `safeToReplace` with `isBinaryStable(ar.dataType)` so propagation is skipped
for attributes whose type is not binary-stable.
### Why are the changes needed?
Bug fix.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
New unit test.
### Was this patch authored or co-authored using generative AI tooling?
No.
Closes apache#54435 from ilicmarkodb/fix_collation_.
Authored-by: ilicmarkodb <marko.ilic@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit ec35791)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request? This PR aims to upgrade `aircompressor` to 2.0.3. ### Why are the changes needed? To bring the latest bug fixes. - https://github.com/airlift/aircompressor/releases/tag/2.0.3 ### Does this PR introduce _any_ user-facing change? No behavior change. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#54486 from dongjoon-hyun/SPARK-55688. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 0a7908e) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
…th undefined error message parameter" This reverts commit bc1bbe6.
### What changes were proposed in this pull request? The grammar accepts constraint specifications (PRIMARY KEY, UNIQUE, CHECK, FOREIGN KEY) in CREATE TABLE AS SELECT and REPLACE TABLE AS SELECT, but the execution layer silently drops them. Neither the ANSI SQL standard nor PostgreSQL supports this syntax - the SQL standard makes table element lists and AS subquery clauses mutually exclusive. Block this at the parser level, consistent with existing CTAS/RTAS checks for schema columns and partition column types. ### Why are the changes needed? Explicitly throw exception for an unsupported case for clarity ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? unit test ### Was this patch authored or co-authored using generative AI tooling? Co-Authored-By: Claude Opus 4.6 <noreplyanthropic.com> Closes apache#54454 from yyanyy/block-ctas-constraints. Authored-by: Yan Yan <yyanyyyy@gmail.com> Signed-off-by: Gengliang Wang <gengliang@apache.org> (cherry picked from commit 38a7533) Signed-off-by: Gengliang Wang <gengliang@apache.org>
…n UNIQUE constraint Javadoc ### What changes were proposed in this pull request? Document NULLS DISTINCT behavior: NULL values are treated as distinct from each other, so rows with NULLs in unique columns never conflict. Also note that UNIQUE allows nullable columns (unlike PRIMARY KEY) and that NULLS NOT DISTINCT is not currently supported. ### Why are the changes needed? Better javadoc clarity on UNIQUE constraint expectation. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? N/A ### Was this patch authored or co-authored using generative AI tooling? Co-Authored-By: Claude Opus 4.6 <noreplyanthropic.com> Closes apache#54357 from yyanyy/unique_definition_clarify. Authored-by: Yan Yan <yyanyyyy@gmail.com> Signed-off-by: Gengliang Wang <gengliang@apache.org> (cherry picked from commit 53606f2) Signed-off-by: Gengliang Wang <gengliang@apache.org>
…eV2Filtering` documentation ### What changes were proposed in this pull request? This is a follow-up to apache#38924 clarify behaviour of scans with runtime filters. ### Why are the changes needed? Please see discussion at apache#54330 (comment). ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? This is a documentation change. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#54490 from peter-toth/SPARK-55692-fix-supportsruntimefiltering-docs. Authored-by: Peter Toth <peter.toth@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 3665506) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
…portedV1WriteMetric ### What changes were proposed in this pull request? Bug was introduced by SPARK-50315 (apache#48867), won't fail the test, just causes lots of warning logs ``` $ build/sbt "sql/testOnly *V1WriteFallbackSuite" ... 18:06:25.108 WARN org.apache.spark.sql.execution.ui.SQLAppStatusListener: Unable to load custom metric object for class `org.apache.spark.sql.connector.SupportedV1WriteMetric`. Please make sure that the custom metric class is in the classpath and it has 0-arg constructor. org.apache.spark.SparkException: org.apache.spark.sql.connector.SupportedV1WriteMetric did not have a zero-argument constructor or a single-argument constructor that accepts SparkConf. Note: if the class is defined inside of another Scala class, then its constructors may accept an implicit parameter that references the enclosing class; in this case, you must define the class as a top-level class in order to prevent this extra parameter from breaking Spark's ability to find a valid constructor. at org.apache.spark.util.Utils$.$anonfun$loadExtensions$1(Utils.scala:2871) at scala.collection.immutable.List.flatMap(List.scala:283) at scala.collection.immutable.List.flatMap(List.scala:79) at org.apache.spark.util.Utils$.loadExtensions(Utils.scala:2853) at org.apache.spark.sql.execution.ui.SQLAppStatusListener.$anonfun$aggregateMetrics$3(SQLAppStatusListener.scala:220) at scala.Option.map(Option.scala:242) at org.apache.spark.sql.execution.ui.SQLAppStatusListener.$anonfun$aggregateMetrics$2(SQLAppStatusListener.scala:214) ... (repeat many times) ``` ### Why are the changes needed? Fix UT. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Verified locally, no warnings printed after fixing. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#54544 from pan3793/SPARK-55746. Authored-by: Cheng Pan <chengpan@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit a0a092f) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
… is null ### What changes were proposed in this pull request? The `GetArrayItem` expression incorrectly computed `nullable = false` when indexing into arrays with `containsNull = false` (e.g., from split()), even when the array itself could be null. This caused codegen to skip null checks, leading to NPE on `array.numElements()` during bounds checking. ### Why are the changes needed? To resolve NPE within spark engine. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Tests in this PR. ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#54546 from stevomitric/stevomitric/fix-npe-codegen. Lead-authored-by: Wenchen Fan <cloud0fan@gmail.com> Co-authored-by: Stevo Mitric <stevo.mitric@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit c0e367a) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request? `ArrowWriter.sizeInBytes()` and `SliceBytesArrowOutputProcessorImpl .getBatchBytes()` both accumulated per-column buffer sizes (each an `Int`) into an `Int` accumulator. When the total exceeds 2 GB the sum silently wraps negative, causing the byte-limit checks controlled by `spark.sql.execution.arrow.maxBytesPerBatch` and `spark.sql.execution.arrow.maxBytesPerOutputBatch` to behave incorrectly and potentially allow oversized batches through. Fix by changing both accumulators and return types to `Long`. ### Why are the changes needed? Fix possible overflow when calculating Arrow batch bytes. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude Sonnet 4.6 <noreplyanthropic.com> Closes apache#54584 from viirya/fix-arrow-batch-bytes-overflow. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com> (cherry picked from commit df195ac) Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
### What changes were proposed in this pull request? This PR fixes the `spark.range` start value in `SparkDataFramePi` from `0` to `1` to match `SparkPi`'s `1 until n`. ### Why are the changes needed? We added `SparkDataFramePi` example newly at Apache Spark 4.0.0 to follow `SparkPi` example. - apache#49617 `SparkPi` uses `1 until n` which generates `n - 1` samples and divides by `(n - 1)`. However, `SparkDataFramePi` uses `spark.range(0, n)` which generates `n` samples but still divides by `(n - 1)`, resulting in an inaccurate pi approximation. https://github.com/apache/spark/blob/897e1b828b1a66e0aa7b8a959897fc23f7c29c0c/examples/src/main/scala/org/apache/spark/examples/SparkPi.scala#L34-L39 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? This is a simple example fix. Verified by code inspection against `SparkPi.scala`. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude Code (claude-opus-4-6) Closes apache#54696 from dongjoon-hyun/SPARK-55893. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 11069d4) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
…clustering When SPJ partial clustering splits a partition across multiple tasks, post-join dedup operators (dropDuplicates, Window row_number) produce incorrect results because KeyGroupedPartitioning.satisfies0() incorrectly reports satisfaction of ClusteredDistribution via super.satisfies0() short-circuiting the isPartiallyClustered guard. This fix adds an isPartiallyClustered flag to KeyGroupedPartitioning and restructures satisfies0() to check ClusteredDistribution first, returning false when partially clustered. EnsureRequirements then inserts the necessary Exchange. Plain SPJ joins without dedup are unaffected. Closes apache#54378
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
Backport fix for SPARK-55848 to branch-4.1. This branch does not have the KeyGroupedPartitioning refactor (#54330) from master.
The fix adds an
isPartiallyClusteredflag toKeyGroupedPartitioningand restructuressatisfies0()to checkClusteredDistributionfirst, returningfalsewhen partially clustered.EnsureRequirementsthen inserts the necessary Exchange.Why are the changes needed?
SPJ with partial clustering produces incorrect results for post-join dedup operations (dropDuplicates, Window row_number). The partially-clustered partitioning is incorrectly treated as satisfying
ClusteredDistribution, so no Exchange is inserted before dedup operators.Does this PR introduce any user-facing change?
Yes. Queries using SPJ with partial clustering followed by dedup operations will now return correct results.
How was this patch tested?
Three regression tests added to KeyGroupedPartitioningSuite with data correctness checks and plan assertions verifying shuffle Exchange presence. All 95 tests pass.
Was this patch authored or co-authored using generative AI tooling?
No.