Skip to content

fix(session): retry OpenAI/Codex transient Responses stream errors#30323

Open
spark4862 wants to merge 1 commit into
anomalyco:devfrom
spark4862:fix/openai-codex-stream-error-retry
Open

fix(session): retry OpenAI/Codex transient Responses stream errors#30323
spark4862 wants to merge 1 commit into
anomalyco:devfrom
spark4862:fix/openai-codex-stream-error-retry

Conversation

@spark4862
Copy link
Copy Markdown
Contributor

@spark4862 spark4862 commented Jun 2, 2026

Issue for this PR

Closes #16214

Related: #21893, #25730, #25884

Type of change

  • Bug fix
  • New feature
  • Refactor / code improvement
  • Documentation

What does this PR do?

OpenCode sessions using the OpenAI/Codex Responses provider can stop mid-run when the upstream stream emits a transient error chunk. The chunk often has no HTTP status, so the existing 5xx retry path never runs and the failure becomes UnknownError.

Observed OpenAI/Codex stream envelopes

These payloads were collected from production OpenAI/Codex Responses streams (see #16214, #21893) and mapped to OpenAI streaming docs:

Payload Example nested fields Doc basis
Stream transport error type: "error", nested error.type: "upstream_error", error.code: "stream_read_error" Responses streaming error event (streaming guide, error event ref). upstream_error / stream_read_error are production variants not listed in the flat schema but emitted by the Codex/OpenAI backend.
Server failure nested error.type: "server_error", error.code: "server_error" error event + response.failed error codes (response.failed ref)
Overload nested error.type: "service_unavailable_error", error.code: "server_is_overloaded" Same streaming error surface; overload code observed on Codex gpt-5.x (#25730)
Rate limit / concurrency nested error.type: "rate_limit_error", error.code: "rate_limit_error" response.failed documents rate_limit_exceeded; concurrency-limit payloads observed in stream chunks (#21893)

Note: OpenAI documents a flat error event (type, code, message, param, sequence_number), but production often returns a nested envelope (error: { type, code, message }). This mismatch is reported upstream (openai-dotnet#881).

Non-retryable OpenAI/Codex codes stay unchanged: context_length_exceeded, insufficient_quota, usage_not_included, invalid_prompt.

Fix approach

Native retry only runs when SessionRetry.retryable() accepts the normalized error. These stream chunks were falling through as UnknownError.

This PR adds a single classifier for OpenAI/Codex stream envelopes and applies it at three points:

  1. ProviderError.parseStreamError() — normalize raw/nested/Error.message/validation-wrapped JSON into a retryable or non-retryable result.
  2. SessionRetry.retryable() — fallback when the envelope survives only as JSON text on an UnknownError (no HTTP status).
  3. MessageV2.fromError() — parse Error.message through parseStreamError() before defaulting to UnknownError.

Flow:

Responses SSE chunk (type: "error")
  -> normalize envelope (direct JSON, nested error object, or SDK validation wrapper)
  -> if non-retryable OpenAI code: context overflow / quota / invalid prompt
  -> if transient OpenAI/Codex code: APIError(isRetryable: true)
  -> SessionRetry.policy() retries the in-flight LLM call with backoff

Why three layers: some failures are parsed early as stream objects, others only appear as Error.message strings after SDK validation, and some reach retryable() as serialized JSON on UnknownError. All three paths must classify the same envelopes or retry is skipped.

Compared to #25728 / #25886 (overload-only, one layer) and closed #23841 (retryable() only, stale base): this covers the full OpenAI/Codex stream error surface on current dev.

How did you verify your code works?

  • Added regression tests in packages/opencode/test/session/retry.test.ts for stream_read_error, server_is_overloaded, server_error, rate_limit_error, and nested upstream codes.
  • Added regression tests in packages/opencode/test/session/message-v2.test.ts for stream chunk serialization and Error.message parsing.
  • Local smoke test: ProviderError.parseStreamError() for stream_read_error, overload, and validation-wrapped rate-limit text.
  • CI should run bun test test/session/retry.test.ts on this PR.

Screenshots / recordings

N/A — no UI changes.

Checklist

  • I have tested my changes locally
  • I have not included unrelated changes in this PR

Teach parseStreamError, retryable(), and fromError to classify OpenAI
Responses/Codex stream error envelopes (stream_read_error, overload,
rate_limit) as native SessionRetry candidates instead of UnknownError.

Co-authored-by: Cursor <cursoragent@cursor.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 2, 2026

The following comment was made by an LLM, it may be inaccurate:

Related PRs Found

Based on the search, here are the related open/existing PRs:

  1. PR fix: retry OpenAI overload errors #25886 - fix: retry OpenAI overload errors

  2. PR fix(session): retry Codex server_is_overloaded stream errors #25728 - fix(session): retry Codex server_is_overloaded stream errors

These PRs address overlapping concerns with stream error retries for OpenAI/Codex. However, according to the PR #30323 description, it provides more comprehensive coverage (including stream_read_error, rate-limit stream chunks, validation-wrapped envelopes, and Error.message cases) compared to these earlier PRs.

@spark4862
Copy link
Copy Markdown
Contributor Author

Updated the PR description to follow the repository template: issue link, type of change, what/why, verification, checklist. Also added OpenAI official streaming doc references and a fix-flow explanation for the OpenAI/Codex Responses stream error path.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 2, 2026

Thanks for your contribution!

This PR doesn't have a linked issue. All PRs must reference an existing issue.

Please:

  1. Open an issue describing the bug/feature (if one doesn't exist)
  2. Add Fixes #<number> or Closes #<number> to this PR description

See CONTRIBUTING.md for details.

@github-actions github-actions Bot removed needs:issue needs:compliance This means the issue will auto-close after 2 hours. labels Jun 2, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 2, 2026

Thanks for updating your PR! It now meets our contributing guidelines. 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Intermittent OpenAI streamed server_error (sequence_number:2) with gpt-5.3-codex; retries degrade session

1 participant