Skip to content

Conversation

@ivoson
Copy link
Contributor

@ivoson ivoson commented Dec 31, 2025

What changes were proposed in this pull request?

Enable checksum based indeterminate shuffle retry by default.

Increase jvm memory size to 6g for sql module tests, as test case SPARK-48037: Fix SortShuffleWriter lacks shuffle write related metrics resulting in potentially inaccurate data set shuffle partition as 16777216 which will need more memory for computing order independent shuffle checksum.

Why are the changes needed?

As checksum based solution is more accurate to detect indeterminate shuffle output changes, propose to enable it by default to avoid query correctness issues caused by indeterminate shuffle retry.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Existing UTs.

Was this patch authored or co-authored using generative AI tooling?

No

@github-actions
Copy link

JIRA Issue Information

=== Task SPARK-54830 ===
Summary: Enable checksum based indeterminate shuffle retry by default
Assignee: Tengfei Huang
Status: Resolved
Affected: ["4.1.0"]


This comment was automatically generated by GitHub Actions

@ivoson ivoson force-pushed the SPARK-54556-followup branch from 0dff2d9 to 62df076 Compare December 31, 2025 06:23

-J-Xmx8g
-J-Xms8g
-J-Xms4g
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change -Xms to 4g to save some memory for sql test where we modified the -Xmx to 6g. Otherwise, CI jobs may be terminated due to memory pressure.

Another option is to modify the UT case to disable shuffle checksum, then we don't need to change the memory settings.

@ivoson ivoson changed the title [WIP][SPARK-54830][CORE] Enable checksum based indeterminate shuffle retry by default [SPARK-54830][CORE] Enable checksum based indeterminate shuffle retry by default Jan 2, 2026
@github-actions github-actions bot added the DOCS label Jan 2, 2026
@ivoson
Copy link
Contributor Author

ivoson commented Jan 5, 2026

cc @cloud-fan


## Upgrading from Spark SQL 4.1 to 4.2

- Since Spark 4.2, Spark enables order-independent checksums for shuffle outputs by default to detect data inconsistencies during indeterminate shuffle stage retries. If a checksum mismatch is detected, Spark rolls back and re-executes all succeeding stages that depend on the shuffle output. If rolling back is not possible for some succeeding stages, the job will fail. To restore the previous behavior, set `spark.sql.shuffle.orderIndependentChecksum.enabled` and `spark.sql.shuffle.orderIndependentChecksum.enableFullRetryOnMismatch` to `false`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think setting the first config to false is sufficient?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better to set both to false.

Since the second one controls the behavior whether we depend on checksum to detect indeterminate shuffle retry, and the first one decide whether we'll compute checksum for shuffle output.
Only disable the 1st one, we'll never detect the indeterminate shuffle retry.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 25307ab Jan 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants