-
Notifications
You must be signed in to change notification settings - Fork 4.8k
fix(monitor): prevent sample loss during disruption sampler shutdown #30771
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Add graceful shutdown logic to ensure all samples are processed before the disruption sampler terminates. Previously, samples could be lost when the context was cancelled, as the consumer would exit immediately without draining the remaining queue. Changes: - Add 30-second timeout when waiting for consumer to finish - Implement contextCancelled flag to track cancellation state - Continue processing remaining samples after context cancellation - Remove early returns that would skip sample processing - Ensure current sample completes even if context is cancelled This prevents the "not finished writing all samples" error and ensures data integrity during shutdown.
|
Pipeline controller notification For optional jobs, comment This repository is configured in: automatic mode |
|
Tested the solution against my dev cluster: |
|
/assign @xueqzhan Please. |
|
/test ci/prow/okd-scos-images |
|
/test okd-scos-images |
|
@mgencur: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: mgencur, xueqzhan The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
Seems there is a known issue for the |
Add graceful shutdown logic to ensure all samples are processed before the disruption sampler terminates. Previously, samples could be lost when the context was cancelled, as the consumer would exit immediately without draining the remaining queue.
Changes:
This prevents the "not finished writing all samples" error and ensures data integrity during shutdown.
The error was spotted in multiple runs for Hypershift on AWS, example: 2020875118166675456
Root cause analysis:
The problematic sequence: