[ML] Refactor data extraction logic and improve cleanup handling #138060

valeriy42 · 2025-11-13T20:18:39Z

Fixes scroll context leaks in datafeed CCS tests by ensuring proper cleanup of scroll contexts.

Changes include:

Wrapping DataExtractor usage in a try-finally block within DatafeedJob.run() to guarantee that destroy() is always invoked.
Calling clearScrollLoggingExceptions() in ScrollDataExtractor.clearScroll() to handle failures smoothly.
Destroys the previous extractor in ChunkedDataExtractor.advanceTime() before instantiating a new one.
Adds a cleanup wait in the test after stopping the datafeed.
Removes the @AwaitsFix annotation from DatafeedCcsIT.

- Removed the outdated @AwaitsFix annotation from DatafeedCcsIT. - Enhanced the data extraction process in DatafeedJob by restructuring the while loop for better readability and error handling. - Added cleanup logic to ensure scroll contexts are properly destroyed after data extraction in both DatafeedJob and ChunkedDataExtractor. - Updated ScrollDataExtractor to improve error handling during scroll clearing. These changes aim to improve the robustness and maintainability of the data extraction process in the ML module.

valeriy42 · 2025-11-14T08:30:31Z

x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/datafeed/DatafeedJob.java

-                                + " using aggregations"
-                        )
-                    );
+        try {


Wrapped DataExtractor usage in a try-finally block so destroy() is always called, including on early returns due to isolation.

valeriy42 · 2025-11-14T08:34:58Z

x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/datafeed/DatafeedJob.java

+        } finally {
+            // Ensure the extractor is always destroyed to clean up scroll contexts
+            dataExtractor.destroy();


This is the block I actually added

valeriy42 · 2025-11-14T08:37:09Z

.../src/main/java/org/elasticsearch/xpack/ml/datafeed/extractor/scroll/ScrollDataExtractor.java


    private void clearScroll() {
-        innerClearScroll(scrollId);
+        clearScrollLoggingExceptions(scrollId);


Changed clearScroll() in ScrollDataExtractor to use clearScrollLoggingExceptions() instead of innerClearScroll() directly, so failures are logged instead of propagating. If clearScroll() throws, it could hide the original ResourceNotFoundException. By catching and logging cleanup failures, the original exception can still propagate. This preserves the real error while still attempting cleanup.

elasticsearchmachine · 2025-11-14T09:05:52Z

Pinging @elastic/ml-core (Team:ML)

jan-elastic · 2025-11-14T09:17:37Z

...in/ml/src/internalClusterTest/java/org/elasticsearch/xpack/ml/integration/DatafeedCcsIT.java

+            // cluster may have been created but couldn't be cleared until connectivity was restored.
+            // The wait gives time for the destroy() cleanup to complete.
+            try {
+                Thread.sleep(2000); // 2 seconds should be sufficient for cleanup to propagate


Instead of waiting for 2000ms, can we instead poll whether the cleanup has finished?

This 2000ms always costs time when it's unnecessary, and if there's a connectivity issue for more than 2000ms it still fails.

jan-elastic

LGTM. Just one thing that can hopefully be improved

- Added a sleep period after stopping the datafeed and job to ensure scroll requests complete, particularly following network disruptions. - Increased the timeout for context checks from 5 to 30 seconds to improve reliability in waiting for expected active states.

- Removed the sleep period after stopping the datafeed and job, as the cleanup mechanism should now handle scroll requests effectively. - Updated documentation to clarify the cleanup process and its reliance on the datafeed's mechanisms. - Increased the timeout for context checks from 30 to 60 seconds to enhance reliability after network recovery.

- Added exception handling when waiting for contexts to return to baseline to ensure cleanup proceeds even if the wait fails. - Updated assertions to check for datafeed errors more clearly, enhancing the robustness of the test logic. - Improved logging to capture failures in context recovery, aiding in debugging and test reliability.

elasticsearchmachine added the v9.3.0 label Nov 13, 2025

valeriy42 self-assigned this Nov 13, 2025

valeriy42 added >test Issues or PRs that are addressing/adding tests :ml Machine learning labels Nov 13, 2025

valeriy42 commented Nov 14, 2025

View reviewed changes

valeriy42 marked this pull request as ready for review November 14, 2025 09:05

elasticsearchmachine added the Team:ML Meta label for the ML team label Nov 14, 2025

jan-elastic reviewed Nov 14, 2025

View reviewed changes

jan-elastic approved these changes Nov 14, 2025

View reviewed changes

valeriy42 added 5 commits November 14, 2025 11:20

Add busy wait instead of waiting for context baseline.

056a804

declare lambda method throw exception

9ea0115

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ML] Refactor data extraction logic and improve cleanup handling #138060

[ML] Refactor data extraction logic and improve cleanup handling #138060

valeriy42 commented Nov 13, 2025 •

edited

Loading

Uh oh!

valeriy42 Nov 14, 2025

Uh oh!

valeriy42 Nov 14, 2025

Uh oh!

valeriy42 Nov 14, 2025 •

edited

Loading

Uh oh!

elasticsearchmachine commented Nov 14, 2025

Uh oh!

jan-elastic Nov 14, 2025

Uh oh!

jan-elastic left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[ML] Refactor data extraction logic and improve cleanup handling #138060

Are you sure you want to change the base?

[ML] Refactor data extraction logic and improve cleanup handling #138060

Conversation

valeriy42 commented Nov 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

valeriy42 Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

valeriy42 Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

valeriy42 Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

elasticsearchmachine commented Nov 14, 2025

Uh oh!

jan-elastic Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

jan-elastic left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

valeriy42 commented Nov 13, 2025 •

edited

Loading

valeriy42 Nov 14, 2025 •

edited

Loading