[SPARK-57521][ML][CONNECT] Exclude parent from Model.estimatedSize to fix overcounting in ML cache by mkincaid · Pull Request #56584 · apache/spark

mkincaid · 2026-06-18T04:09:05Z

What changes were proposed in this pull request?

This patch unsets parent before calling the SizeEstimator.

Why are the changes needed?

Currently SizeEstimator includes the size of the SparkSession because it traverses the parent object which (in the case of many estimators that use DataFrame operations when fitting, like StringIndexer) eventually refers to the session. The session is there anyway and its size isn't attributable to fitting this specific model (and this results in double-counting when more models are fit), so it shouldn't be included in the size estimate.

The impact of the bug is largest when the SparkSession is large. For example, in Databricks, my testing shows that a 300-800M SparkSession is typical. In some configurations, like Databricks serverless, the size limit for a single model object might be 256M, so this bug causes such models to fail to train regardless of the state of the cache otherwise.

The Jira ticket includes a simple script that reproduces the condition locally, though the session is much smaller in that case (maybe 300k).

Does this PR introduce any user-facing change?

Yes, a favorable one, in that the model cache would fill less quickly (and the reported sizes of cached models would be smaller, if they are among the affected models).

How was this patch tested?

A test is added: training a StringIndexer should estimate at no larger than 50k, in the trivial test case with 3 strings. This test fails before the patch and passes after it. Another similar test is provided for MinMaxScaler. A ModelSuite is added to hold these since the bug is at the Model level, not that of individual models (so the StringIndexer and MinMaxScaler suites aren't really the right place for these tests, although they are examples).

Was this patch authored or co-authored using generative AI tooling?

Yes, the bug was discovered and initial patch/tests were created by pair programming with Claude. I wrote the bug/docs myself and validated the approach and final patch.

Generated-by: Claude Opus 4.6

mkincaid · 2026-06-18T04:40:26Z

Fixed actions configuration on my fork. Closing and reopening to trigger the checks to rerun

uros-b · 2026-06-19T18:43:48Z

+    // shared SparkSession state as part of every model's size.
+    // The parent is @transient (not persisted) and is not needed for transform() or save().
+    val savedParent = parent
+    parent = null


Please investigate and address a possible thread-safety regression here. There is a side-effecting mutation of shared state in a base-class method. Two concurrent estimatedSize calls on the same model can interleave so both save then both restore, with the second finally clobbering parent to null permanently; a concurrent reader (hasParent, transform, save) can also observe parent == null during the window.

In the current path this is masked because estimatedSize is invoked inside MLCache.register (which is synchronized) on a freshly-fit, not-yet-shared model, so it is not an active production bug today, but estimatedSize is private[spark] and the previous implementation was side-effect free, so the new contract is strictly weaker.

Please consider a non-mutating approach rather than mutating shared instance state.

Hi @uros-b, thanks for the quick review and input. I pushed a change that adds synchronized so that we wouldn't have two concurrent estimatedSize calls from here. However, I'm realizing this doesn't address the second part of your comment (a concurrent reader from elsewhere would still observe parent == null).

As I looked into this further, the truly non-mutating approaches I came up with were:

Create a copy of the Model with empty parent, then size that. But this depends on the implementation of the copy method which is model-specific (so not sure if it can be relied on to faithfully copy everything we care about sizing).

Make the Model object Cloneable, then clone(), clear parent, and size. But changing an interface of Model itself seems less conservative and beyond the scope I was intending for this original fix.

Serialize and deserialize the object before estimating its size. Since the parent is @transient it would be gone in the serialized copy. This seems conceptually appealing (it seems like, in principle, the data the model keeps and serializes is the state we care about sizing) but not sure if it might be expensive for large models and, like the copy option, should I worry about the possibility something relevant doesn’t survive the round trip.

Target the fix elsewhere, e.g., perhaps SizeEstimator itself should skip walking through SparkSession objects (the same way as there are existing exclusions there for ClassLoader and scala.reflect). This also seems less conservative since other users of SizeEstimator might not want the behavior to change.

Or I may be missing something easier/cleaner. It is probably pretty obvious that I'm new to this code base, so I want to be thoughtful about design and get more input before proceeding. Appreciate your patience with me and looking forward to your thoughts :)

uros-b

Thank you @mkincaid. Left a few comments, passing on to @zhengruifeng and @WeichenXu123 (ML/Connect experts) for further review.

… fix overcounting in ML cache

mkincaid closed this Jun 18, 2026

mkincaid reopened this Jun 18, 2026

uros-b reviewed Jun 19, 2026

View reviewed changes

Comment thread mllib/src/test/scala/org/apache/spark/ml/ModelSuite.scala Outdated

uros-b reviewed Jun 19, 2026

View reviewed changes

uros-b requested a review from zhengruifeng June 19, 2026 18:46

[SPARK-57521][ML][CONNECT] Exclude parent from Model.estimatedSize to…

22471c8

… fix overcounting in ML cache

mkincaid force-pushed the fix/ml-cache-size-estimator-parent branch from 2f6bb9e to 22471c8 Compare June 22, 2026 18:11

Fix non-ascii character in comment

a093706

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-57521][ML][CONNECT] Exclude parent from Model.estimatedSize to fix overcounting in ML cache#56584

[SPARK-57521][ML][CONNECT] Exclude parent from Model.estimatedSize to fix overcounting in ML cache#56584
mkincaid wants to merge 2 commits into
apache:masterfrom
mkincaid:fix/ml-cache-size-estimator-parent

mkincaid commented Jun 18, 2026

Uh oh!

mkincaid commented Jun 18, 2026

Uh oh!

uros-b Jun 19, 2026

Uh oh!

mkincaid Jun 22, 2026 •

edited

Loading

Uh oh!

Uh oh!

uros-b left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mkincaid commented Jun 18, 2026

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

mkincaid commented Jun 18, 2026

Uh oh!

uros-b Jun 19, 2026

Choose a reason for hiding this comment

Uh oh!

mkincaid Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

uros-b left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mkincaid Jun 22, 2026 •

edited

Loading