-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-54830][CORE] Enable checksum based indeterminate shuffle retry by default #53652
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Added Java options to increase memory limit for tests.
JIRA Issue Information=== Task SPARK-54830 === This comment was automatically generated by GitHub Actions |
0dff2d9 to
62df076
Compare
|
|
||
| -J-Xmx8g | ||
| -J-Xms8g | ||
| -J-Xms4g |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change -Xms to 4g to save some memory for sql test where we modified the -Xmx to 6g. Otherwise, CI jobs may be terminated due to memory pressure.
Another option is to modify the UT case to disable shuffle checksum, then we don't need to change the memory settings.
|
cc @cloud-fan |
|
|
||
| ## Upgrading from Spark SQL 4.1 to 4.2 | ||
|
|
||
| - Since Spark 4.2, Spark enables order-independent checksums for shuffle outputs by default to detect data inconsistencies during indeterminate shuffle stage retries. If a checksum mismatch is detected, Spark rolls back and re-executes all succeeding stages that depend on the shuffle output. If rolling back is not possible for some succeeding stages, the job will fail. To restore the previous behavior, set `spark.sql.shuffle.orderIndependentChecksum.enabled` and `spark.sql.shuffle.orderIndependentChecksum.enableFullRetryOnMismatch` to `false`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think setting the first config to false is sufficient?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be better to set both to false.
Since the second one controls the behavior whether we depend on checksum to detect indeterminate shuffle retry, and the first one decide whether we'll compute checksum for shuffle output.
Only disable the 1st one, we'll never detect the indeterminate shuffle retry.
|
thanks, merging to master! |
What changes were proposed in this pull request?
Enable checksum based indeterminate shuffle retry by default.
Increase jvm memory size to 6g for
sqlmodule tests, as test case SPARK-48037: Fix SortShuffleWriter lacks shuffle write related metrics resulting in potentially inaccurate data set shuffle partition as16777216which will need more memory for computing order independent shuffle checksum.Why are the changes needed?
As checksum based solution is more accurate to detect indeterminate shuffle output changes, propose to enable it by default to avoid query correctness issues caused by indeterminate shuffle retry.
Does this PR introduce any user-facing change?
No
How was this patch tested?
Existing UTs.
Was this patch authored or co-authored using generative AI tooling?
No