[SPARK-54838][SQL] A new feature to optimize partition size and count #53599

dotsering · 2025-12-24T19:39:27Z

What changes were proposed in this pull request?

I am proposing to add a new function in Dataset class to fix small file problem. We have noticed that if the source data our spark job reads have many small files (KB size), it creates lot of partitions. This PR adds a new function named optimizePartition which when used creates partitions of size 128MB if no size passed. You can pass your own desired partition size.

Why are the changes needed?

The changes are needed to solve small file problem. It also helps in reducing the number of files that gets written back to sink.

Does this PR introduce any user-facing change?

It does not introduce any change in any existing feature/functions of Dataset. It is a brand new function.

How was this patch tested?

I have added number of unit tests that covers the scenario of lot of small partitions and when this function is called, it either coalesces to reduce partition count or uses repartition to increase partition count if partition size is too big. Also tested it locally using a dockerized environment containing spark cluster.

Was this patch authored or co-authored using generative AI tooling?

I did most of the coding. I used Gemini along to help me walk through the process of opening PRs, jira ticket and using linters/formatters. This PR does not have lot of code change.

wangyum · 2025-12-25T09:53:40Z

Could we consider setting spark.sql.files.maxPartitionNum to address the small file issue?

dotsering · 2025-12-25T21:53:45Z

Could we consider setting spark.sql.files.maxPartitionNum to address the small file issue?

Thanks @wangyum for your feedback. Here are few limitations that I can think of using hardcoded maxPartitionNum property:

Incompatibility with Heterogeneous Data Sources: The configuration lacks flexibility because it applies to the read operation globally. An ideal value for one source is often catastrophic for another within the same job. For example, setting maxPartitionNum = 1 is perfect for my example 10,000 tiny 10KB files (reducing overhead), but if that same job also reads a large 1TB table, that table is also forced into a single partition, causing an immediate failure. You cannot optimize the small source without breaking the large one. You have to keep fine tuning it for every spark job. That was the main reason I thought of automating hardcoded value.
Destructive Merging of Optimized Data (Collateral Damage): This setting acts as a blunt instrument that overrides safe partition sizing logic. If a job reads a second data source that is already perfectly partitioned (e.g., 1,000 partitions of 150MB), setting maxPartitionNum = 10 forces Spark to ignore those healthy boundaries. It will aggressively coalesce the 1,000 partitions down to 10 massive 15GB partitions, destroying parallelism and guaranteeing Executor OOM errors on data that was originally fine.
Here is a link to the loc where maxPartitionBytes could have helped generate 1000 150MB partitions but if we set too low value for maxPartitionNum, it will override that partition count and reduce 1000 partitions to 10 partitions.

A new feature to optimize partition size and count

398683b

github-actions bot added SQL PYTHON CONNECT labels Dec 24, 2025

dotsering marked this pull request as ready for review December 24, 2025 19:43

dotsering changed the title ~~A new feature to optimize partition size and count~~ [SPARK-54838] A new feature to optimize partition size and count Dec 24, 2025

HyukjinKwon changed the title ~~[SPARK-54838] A new feature to optimize partition size and count~~ [SPARK-54838][SQL] A new feature to optimize partition size and count Dec 26, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-54838][SQL] A new feature to optimize partition size and count #53599

[SPARK-54838][SQL] A new feature to optimize partition size and count #53599

Uh oh!

dotsering commented Dec 24, 2025 •

edited

Loading

Uh oh!

wangyum commented Dec 25, 2025

Uh oh!

dotsering commented Dec 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[SPARK-54838][SQL] A new feature to optimize partition size and count #53599

Are you sure you want to change the base?

[SPARK-54838][SQL] A new feature to optimize partition size and count #53599

Uh oh!

Conversation

dotsering commented Dec 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

wangyum commented Dec 25, 2025

Uh oh!

dotsering commented Dec 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dotsering commented Dec 24, 2025 •

edited

Loading