Skip to content

fix(run): cap retry backoff at max_delay on OverflowError#3226

Draft
adityasingh2400 wants to merge 2 commits into
openai:mainfrom
adityasingh2400:fix/retry-backoff-overflow
Draft

fix(run): cap retry backoff at max_delay on OverflowError#3226
adityasingh2400 wants to merge 2 commits into
openai:mainfrom
adityasingh2400:fix/retry-backoff-overflow

Conversation

@adityasingh2400
Copy link
Copy Markdown
Contributor

Summary

_default_retry_delay in src/agents/run_internal/model_retry.py computes initial_delay * (multiplier ** max(attempt - 1, 0)) and then caps it with min(..., max_delay). The intermediate exponential can overflow Python's float range before the cap is applied, raising OverflowError even though the final value would be immediately clamped.

This happens with realistic-ish misconfigurations:

  • multiplier=100.0, attempt=200 (e.g. an aggressive retry policy on a long-running operation)
  • Default multiplier=2.0, attempt >= 1075 (a stuck retry loop)

The OverflowError escapes the helper and crashes the retry loop instead of just returning max_delay.

The fix catches OverflowError and treats it as saturation - the helper now behaves the same way at the overflow point as it does just before it. Two unit tests cover both scenarios.

Test plan

  • pytest tests/models/test_model_retry.py - all 54 existing + 2 new tests pass
  • ruff check and ruff format --check clean
  • Manual repro: _default_retry_delay(200, ModelRetryBackoffSettings(initial_delay=1.0, max_delay=5.0, multiplier=100.0, jitter=False)) returned OverflowError before, returns 5.0 after

When _default_retry_delay computes initial_delay * (multiplier ** exponent)
with a large attempt count or multiplier (e.g. multiplier=100, attempt>=200,
or default multiplier=2.0 with attempt>=1075), the intermediate exponential
overflows Python's float range and raises OverflowError - even though the
final value would be immediately capped at max_delay. The exception then
escapes the retry helper and crashes the retry loop.

Catch OverflowError and short-circuit to max_delay so the backoff helper
behaves the same way at the saturation point as it does just before it.
@github-actions github-actions Bot added bug Something isn't working feature:core labels May 8, 2026
@seratch
Copy link
Copy Markdown
Member

seratch commented May 8, 2026

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a9c515d1ae

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/agents/run_internal/model_retry.py Outdated
Comment on lines +292 to +295
try:
scaled = initial_delay * (multiplier**exponent)
except OverflowError:
scaled = max_delay
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve zero-delay backoff when exponent overflows

The overflow handler always substitutes max_delay, but initial_delay=0.0 is a valid way to request immediate retries. Once multiplier ** exponent overflows, _default_retry_delay(200, ModelRetryBackoffSettings(initial_delay=0, max_delay=5, multiplier=100, jitter=False)) returns 5.0 even though every non-overflowing attempt returns 0.0, unexpectedly stalling retries.

Useful? React with 👍 / 👎.

@seratch seratch marked this pull request as draft May 8, 2026 17:35
initial_delay=0 is a valid way to request immediate retries; the
OverflowError fallback was substituting max_delay even though every
non-overflowing attempt returns 0. Short-circuit when initial_delay
is 0 so retries stay immediate at any attempt count.
@seratch seratch removed the bug Something isn't working label May 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants