-
Notifications
You must be signed in to change notification settings - Fork 25.6k
[ML] Refactor data extraction logic and improve cleanup handling #138060
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
- Removed the outdated @AwaitsFix annotation from DatafeedCcsIT. - Enhanced the data extraction process in DatafeedJob by restructuring the while loop for better readability and error handling. - Added cleanup logic to ensure scroll contexts are properly destroyed after data extraction in both DatafeedJob and ChunkedDataExtractor. - Updated ScrollDataExtractor to improve error handling during scroll clearing. These changes aim to improve the robustness and maintainability of the data extraction process in the ML module.
| + " using aggregations" | ||
| ) | ||
| ); | ||
| try { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wrapped DataExtractor usage in a try-finally block so destroy() is always called, including on early returns due to isolation.
| } finally { | ||
| // Ensure the extractor is always destroyed to clean up scroll contexts | ||
| dataExtractor.destroy(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the block I actually added
|
|
||
| private void clearScroll() { | ||
| innerClearScroll(scrollId); | ||
| clearScrollLoggingExceptions(scrollId); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed clearScroll() in ScrollDataExtractor to use clearScrollLoggingExceptions() instead of innerClearScroll() directly, so failures are logged instead of propagating. If clearScroll() throws, it could hide the original ResourceNotFoundException. By catching and logging cleanup failures, the original exception can still propagate. This preserves the real error while still attempting cleanup.
|
Pinging @elastic/ml-core (Team:ML) |
| // cluster may have been created but couldn't be cleared until connectivity was restored. | ||
| // The wait gives time for the destroy() cleanup to complete. | ||
| try { | ||
| Thread.sleep(2000); // 2 seconds should be sufficient for cleanup to propagate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of waiting for 2000ms, can we instead poll whether the cleanup has finished?
This 2000ms always costs time when it's unnecessary, and if there's a connectivity issue for more than 2000ms it still fails.
jan-elastic
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Just one thing that can hopefully be improved
- Added a sleep period after stopping the datafeed and job to ensure scroll requests complete, particularly following network disruptions. - Increased the timeout for context checks from 5 to 30 seconds to improve reliability in waiting for expected active states.
- Removed the sleep period after stopping the datafeed and job, as the cleanup mechanism should now handle scroll requests effectively. - Updated documentation to clarify the cleanup process and its reliance on the datafeed's mechanisms. - Increased the timeout for context checks from 30 to 60 seconds to enhance reliability after network recovery.
- Added exception handling when waiting for contexts to return to baseline to ensure cleanup proceeds even if the wait fails. - Updated assertions to check for datafeed errors more clearly, enhancing the robustness of the test logic. - Improved logging to capture failures in context recovery, aiding in debugging and test reliability.
Fixes scroll context leaks in datafeed CCS tests by ensuring proper cleanup of scroll contexts.
Changes include:
DatafeedJob.run()to guarantee thatdestroy()is always invoked.clearScrollLoggingExceptions()inScrollDataExtractor.clearScroll()to handle failures smoothly.ChunkedDataExtractor.advanceTime()before instantiating a new one.@AwaitsFixannotation fromDatafeedCcsIT.Fixes #84268