[Monitor OpenTelemetry exporter] Fix retry amplification storm #47002
Open
hectorhdzg wants to merge 1 commit into
Open
[Monitor OpenTelemetry exporter] Fix retry amplification storm #47002hectorhdzg wants to merge 1 commit into
hectorhdzg wants to merge 1 commit into
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
This PR mitigates retry amplification in the Azure Monitor OpenTelemetry exporter by preventing _transmit_from_storage() from aggressively draining local offline-storage blobs after sustained throttling, reducing the likelihood of immediately re-triggering 429s on recovery.
Changes:
- Add a per-invocation cap (
_MAX_STORAGE_DRAIN_BATCH = 10) on how many stored blobs are processed in_transmit_from_storage(). - Stop draining immediately when a retryable failure occurs while draining, to avoid creating a burst of follow-up requests.
- Add unit tests validating early termination on retryable failure and enforcement of the drain cap.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
sdk/monitor/azure-monitor-opentelemetry-exporter/azure/monitor/opentelemetry/exporter/export/_base.py |
Caps offline-storage draining per cycle and stops draining on retryable failures to prevent request bursts after throttling. |
sdk/monitor/azure-monitor-opentelemetry-exporter/tests/test_base_exporter.py |
Adds tests for early-stop behavior on retryable failures and for enforcing the per-invocation drain cap. |
During sustained 429 throttling, failed telemetry accumulates as blob files in local storage. On recovery, _transmit_from_storage() drained all blobs in a tight loop, creating a burst of requests that could immediately re-trigger throttling. Changes: - Cap storage drain to 10 blobs per invocation (_MAX_STORAGE_DRAIN_BATCH) to spread retry load across export cycles - Stop draining immediately when a retryable failure occurs, since the service is still under pressure - Add tests for both drain cap and early termination behaviors
4a70184 to
52d7553
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
During sustained 429 throttling, failed telemetry accumulates as blob files in local storage. On recovery, _transmit_from_storage() drained all blobs in a tight loop, creating a burst of requests that could immediately re-trigger throttling.
Changes: