[SPARK-54830][CORE] Enable checksum based indeterminate shuffle retry by default #53652

ivoson · 2025-12-31T02:04:55Z

What changes were proposed in this pull request?

Enable checksum based indeterminate shuffle retry by default.

Increase jvm memory size to 6g for sql module tests, as test case SPARK-48037: Fix SortShuffleWriter lacks shuffle write related metrics resulting in potentially inaccurate data set shuffle partition as 16777216 which will need more memory for computing order independent shuffle checksum.

Why are the changes needed?

As checksum based solution is more accurate to detect indeterminate shuffle output changes, propose to enable it by default to avoid query correctness issues caused by indeterminate shuffle retry.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Existing UTs.

Was this patch authored or co-authored using generative AI tooling?

No

Added Java options to increase memory limit for tests.

github-actions · 2025-12-31T02:05:04Z

JIRA Issue Information

=== Task SPARK-54830 ===
Summary: Enable checksum based indeterminate shuffle retry by default
Assignee: Tengfei Huang
Status: Resolved
Affected: ["4.1.0"]

This comment was automatically generated by GitHub Actions

ivoson · 2025-12-31T06:33:45Z

.sbtopts


 -J-Xmx8g
-J-Xms8g
+-J-Xms4g


Change -Xms to 4g to save some memory for sql test where we modified the -Xmx to 6g. Otherwise, CI jobs may be terminated due to memory pressure.

Another option is to modify the UT case to disable shuffle checksum, then we don't need to change the memory settings.

ivoson · 2026-01-05T01:24:20Z

cc @cloud-fan

cloud-fan · 2026-01-05T06:08:06Z

docs/sql-migration-guide.md


+## Upgrading from Spark SQL 4.1 to 4.2
+
+- Since Spark 4.2, Spark enables order-independent checksums for shuffle outputs by default to detect data inconsistencies during indeterminate shuffle stage retries. If a checksum mismatch is detected, Spark rolls back and re-executes all succeeding stages that depend on the shuffle output. If rolling back is not possible for some succeeding stages, the job will fail. To restore the previous behavior, set `spark.sql.shuffle.orderIndependentChecksum.enabled` and `spark.sql.shuffle.orderIndependentChecksum.enableFullRetryOnMismatch` to `false`.


I think setting the first config to false is sufficient?

It would be better to set both to false.

Since the second one controls the behavior whether we depend on checksum to detect indeterminate shuffle retry, and the first one decide whether we'll compute checksum for shuffle output.
Only disable the 1st one, we'll never detect the indeterminate shuffle retry.

cloud-fan · 2026-01-05T07:21:43Z

thanks, merging to master!

ivoson added 4 commits December 24, 2025 09:44

enable checksum based indeterminate shuffle retry by default

4ab697e

Increase Java memory limit for SQL tests - Add comments

1f6e775

Added Java options to increase memory limit for tests.

Update jira number

602cf76

Merge branch 'apache:master' into SPARK-54556-followup

942eef0

github-actions bot added SQL BUILD labels Dec 31, 2025

ivoson added 2 commits December 31, 2025 11:30

validate memory issue

699f009

add migration guide

62df076

ivoson force-pushed the SPARK-54556-followup branch from 0dff2d9 to 62df076 Compare December 31, 2025 06:23

ivoson commented Dec 31, 2025

View reviewed changes

ivoson changed the title ~~[WIP][SPARK-54830][CORE] Enable checksum based indeterminate shuffle retry by default~~ [SPARK-54830][CORE] Enable checksum based indeterminate shuffle retry by default Jan 2, 2026

github-actions bot added the DOCS label Jan 2, 2026

cloud-fan reviewed Jan 5, 2026

View reviewed changes

cloud-fan closed this in 25307ab Jan 5, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-54830][CORE] Enable checksum based indeterminate shuffle retry by default #53652

[SPARK-54830][CORE] Enable checksum based indeterminate shuffle retry by default #53652

Uh oh!

ivoson commented Dec 31, 2025

Uh oh!

github-actions bot commented Dec 31, 2025

Uh oh!

ivoson Dec 31, 2025

Uh oh!

ivoson commented Jan 5, 2026

Uh oh!

cloud-fan Jan 5, 2026

Uh oh!

ivoson Jan 5, 2026

Uh oh!

cloud-fan commented Jan 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		## Upgrading from Spark SQL 4.1 to 4.2

		- Since Spark 4.2, Spark enables order-independent checksums for shuffle outputs by default to detect data inconsistencies during indeterminate shuffle stage retries. If a checksum mismatch is detected, Spark rolls back and re-executes all succeeding stages that depend on the shuffle output. If rolling back is not possible for some succeeding stages, the job will fail. To restore the previous behavior, set `spark.sql.shuffle.orderIndependentChecksum.enabled` and `spark.sql.shuffle.orderIndependentChecksum.enableFullRetryOnMismatch` to `false`.

[SPARK-54830][CORE] Enable checksum based indeterminate shuffle retry by default #53652

[SPARK-54830][CORE] Enable checksum based indeterminate shuffle retry by default #53652

Uh oh!

Conversation

ivoson commented Dec 31, 2025

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

github-actions bot commented Dec 31, 2025

JIRA Issue Information

Uh oh!

ivoson Dec 31, 2025

Choose a reason for hiding this comment

Uh oh!

ivoson commented Jan 5, 2026

Uh oh!

cloud-fan Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

ivoson Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Jan 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants