Skip to content

Conversation

@valeriy42
Copy link
Contributor

@valeriy42 valeriy42 commented Nov 13, 2025

Fixes scroll context leaks in datafeed CCS tests by ensuring proper cleanup of scroll contexts.

Changes include:

  • Wrapping DataExtractor usage in a try-finally block within DatafeedJob.run() to guarantee that destroy() is always invoked.
  • Calling clearScrollLoggingExceptions() in ScrollDataExtractor.clearScroll() to handle failures smoothly.
  • Destroys the previous extractor in ChunkedDataExtractor.advanceTime() before instantiating a new one.
  • Adds a cleanup wait in the test after stopping the datafeed.
  • Removes the @AwaitsFix annotation from DatafeedCcsIT.

Fixes #84268

- Removed the outdated @AwaitsFix annotation from DatafeedCcsIT.
- Enhanced the data extraction process in DatafeedJob by restructuring the while loop for better readability and error handling.
- Added cleanup logic to ensure scroll contexts are properly destroyed after data extraction in both DatafeedJob and ChunkedDataExtractor.
- Updated ScrollDataExtractor to improve error handling during scroll clearing.

These changes aim to improve the robustness and maintainability of the data extraction process in the ML module.
@valeriy42 valeriy42 self-assigned this Nov 13, 2025
@valeriy42 valeriy42 added >test Issues or PRs that are addressing/adding tests :ml Machine learning labels Nov 13, 2025
+ " using aggregations"
)
);
try {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrapped DataExtractor usage in a try-finally block so destroy() is always called, including on early returns due to isolation.

Comment on lines +474 to +476
} finally {
// Ensure the extractor is always destroyed to clean up scroll contexts
dataExtractor.destroy();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the block I actually added


private void clearScroll() {
innerClearScroll(scrollId);
clearScrollLoggingExceptions(scrollId);
Copy link
Contributor Author

@valeriy42 valeriy42 Nov 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed clearScroll() in ScrollDataExtractor to use clearScrollLoggingExceptions() instead of innerClearScroll() directly, so failures are logged instead of propagating. If clearScroll() throws, it could hide the original ResourceNotFoundException. By catching and logging cleanup failures, the original exception can still propagate. This preserves the real error while still attempting cleanup.

@valeriy42 valeriy42 marked this pull request as ready for review November 14, 2025 09:05
@elasticsearchmachine elasticsearchmachine added the Team:ML Meta label for the ML team label Nov 14, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

// cluster may have been created but couldn't be cleared until connectivity was restored.
// The wait gives time for the destroy() cleanup to complete.
try {
Thread.sleep(2000); // 2 seconds should be sufficient for cleanup to propagate
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of waiting for 2000ms, can we instead poll whether the cleanup has finished?

This 2000ms always costs time when it's unnecessary, and if there's a connectivity issue for more than 2000ms it still fails.

Copy link
Contributor

@jan-elastic jan-elastic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Just one thing that can hopefully be improved

- Added a sleep period after stopping the datafeed and job to ensure scroll requests complete, particularly following network disruptions.
- Increased the timeout for context checks from 5 to 30 seconds to improve reliability in waiting for expected active states.
- Removed the sleep period after stopping the datafeed and job, as the cleanup mechanism should now handle scroll requests effectively.
- Updated documentation to clarify the cleanup process and its reliance on the datafeed's mechanisms.
- Increased the timeout for context checks from 30 to 60 seconds to enhance reliability after network recovery.
- Added exception handling when waiting for contexts to return to baseline to ensure cleanup proceeds even if the wait fails.
- Updated assertions to check for datafeed errors more clearly, enhancing the robustness of the test logic.
- Improved logging to capture failures in context recovery, aiding in debugging and test reliability.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:ml Machine learning Team:ML Meta label for the ML team >test Issues or PRs that are addressing/adding tests v9.3.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[CI] DatafeedCcsIT testDatafeedWithCcsRemoteUnavailable failing

3 participants