fix(session): retry OpenAI/Codex transient Responses stream errors by spark4862 · Pull Request #30323 · anomalyco/opencode

spark4862 · 2026-06-02T04:30:01Z

Issue for this PR

Type of change

Bug fix
New feature
Refactor / code improvement
Documentation

What does this PR do?

OpenCode sessions using the OpenAI/Codex Responses provider can stop mid-run when the upstream stream emits a transient error chunk. The chunk often has no HTTP status, so the existing 5xx retry path never runs and the failure becomes UnknownError.

Observed OpenAI/Codex stream envelopes

These payloads were collected from production OpenAI/Codex Responses streams (see #16214, #21893) and mapped to OpenAI streaming docs:

Payload	Example nested fields	Doc basis
Stream transport error	`type: "error"`, nested `error.type: "upstream_error"`, `error.code: "stream_read_error"`	Responses streaming `error` event (streaming guide, error event ref). `upstream_error` / `stream_read_error` are production variants not listed in the flat schema but emitted by the Codex/OpenAI backend.
Server failure	nested `error.type: "server_error"`, `error.code: "server_error"`	`error` event + `response.failed` error codes (response.failed ref)
Overload	nested `error.type: "service_unavailable_error"`, `error.code: "server_is_overloaded"`	Same streaming error surface; overload code observed on Codex gpt-5.x (#25730)
Rate limit / concurrency	nested `error.type: "rate_limit_error"`, `error.code: "rate_limit_error"`	`response.failed` documents `rate_limit_exceeded`; concurrency-limit payloads observed in stream chunks (#21893)

Note: OpenAI documents a flat error event (type, code, message, param, sequence_number), but production often returns a nested envelope (error: { type, code, message }). This mismatch is reported upstream (openai-dotnet#881).

Non-retryable OpenAI/Codex codes stay unchanged: context_length_exceeded, insufficient_quota, usage_not_included, invalid_prompt.

Fix approach

Native retry only runs when SessionRetry.retryable() accepts the normalized error. These stream chunks were falling through as UnknownError.

This PR adds a single classifier for OpenAI/Codex stream envelopes and applies it at three points:

ProviderError.parseStreamError() — normalize raw/nested/Error.message/validation-wrapped JSON into a retryable or non-retryable result.
SessionRetry.retryable() — fallback when the envelope survives only as JSON text on an UnknownError (no HTTP status).
MessageV2.fromError() — parse Error.message through parseStreamError() before defaulting to UnknownError.

Flow:

Responses SSE chunk (type: "error")
  -> normalize envelope (direct JSON, nested error object, or SDK validation wrapper)
  -> if non-retryable OpenAI code: context overflow / quota / invalid prompt
  -> if transient OpenAI/Codex code: APIError(isRetryable: true)
  -> SessionRetry.policy() retries the in-flight LLM call with backoff

Why three layers: some failures are parsed early as stream objects, others only appear as Error.message strings after SDK validation, and some reach retryable() as serialized JSON on UnknownError. All three paths must classify the same envelopes or retry is skipped.

Compared to #25728 / #25886 (overload-only, one layer) and closed #23841 (retryable() only, stale base): this covers the full OpenAI/Codex stream error surface on current dev.

How did you verify your code works?

Added regression tests in packages/opencode/test/session/retry.test.ts for stream_read_error, server_is_overloaded, server_error, rate_limit_error, and nested upstream codes.
Added regression tests in packages/opencode/test/session/message-v2.test.ts for stream chunk serialization and Error.message parsing.
Local smoke test: ProviderError.parseStreamError() for stream_read_error, overload, and validation-wrapped rate-limit text.
CI should run bun test test/session/retry.test.ts on this PR.

Screenshots / recordings

N/A — no UI changes.

Checklist

I have tested my changes locally
I have not included unrelated changes in this PR

Teach parseStreamError, retryable(), and fromError to classify OpenAI Responses/Codex stream error envelopes (stream_read_error, overload, rate_limit) as native SessionRetry candidates instead of UnknownError. Co-authored-by: Cursor <cursoragent@cursor.com>

github-actions · 2026-06-02T04:30:53Z

The following comment was made by an LLM, it may be inaccurate:

Related PRs Found

Based on the search, here are the related open/existing PRs:

PR fix: retry OpenAI overload errors #25886 - fix: retry OpenAI overload errors
- fix: retry OpenAI overload errors #25886
- Directly related: focuses on server_is_overloaded errors, which is one of the error types covered in PR fix(session): retry OpenAI/Codex transient Responses stream errors #30323
PR fix(session): retry Codex server_is_overloaded stream errors #25728 - fix(session): retry Codex server_is_overloaded stream errors
- fix(session): retry Codex server_is_overloaded stream errors #25728
- Directly related: addresses server_is_overloaded stream errors, a subset of what PR fix(session): retry OpenAI/Codex transient Responses stream errors #30323 handles

These PRs address overlapping concerns with stream error retries for OpenAI/Codex. However, according to the PR #30323 description, it provides more comprehensive coverage (including stream_read_error, rate-limit stream chunks, validation-wrapped envelopes, and Error.message cases) compared to these earlier PRs.

spark4862 · 2026-06-02T04:39:57Z

Updated the PR description to follow the repository template: issue link, type of change, what/why, verification, checklist. Also added OpenAI official streaming doc references and a fix-flow explanation for the OpenAI/Codex Responses stream error path.

github-actions · 2026-06-02T04:40:25Z

Thanks for your contribution!

This PR doesn't have a linked issue. All PRs must reference an existing issue.

Please:

Open an issue describing the bug/feature (if one doesn't exist)
Add Fixes #<number> or Closes #<number> to this PR description

See CONTRIBUTING.md for details.

github-actions · 2026-06-02T04:40:40Z

Thanks for updating your PR! It now meets our contributing guidelines. 👍

github-actions Bot added contributor needs:compliance This means the issue will auto-close after 2 hours. labels Jun 2, 2026

github-actions Bot added the needs:issue label Jun 2, 2026

github-actions Bot removed needs:issue needs:compliance This means the issue will auto-close after 2 hours. labels Jun 2, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(session): retry OpenAI/Codex transient Responses stream errors#30323

fix(session): retry OpenAI/Codex transient Responses stream errors#30323
spark4862 wants to merge 1 commit into
anomalyco:devfrom
spark4862:fix/openai-codex-stream-error-retry

spark4862 commented Jun 2, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 2, 2026

Uh oh!

spark4862 commented Jun 2, 2026

Uh oh!

github-actions Bot commented Jun 2, 2026

Uh oh!

github-actions Bot commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

spark4862 commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Issue for this PR

Type of change

What does this PR do?

Observed OpenAI/Codex stream envelopes

Fix approach

How did you verify your code works?

Screenshots / recordings

Checklist

Uh oh!

github-actions Bot commented Jun 2, 2026

Related PRs Found

Uh oh!

spark4862 commented Jun 2, 2026

Uh oh!

github-actions Bot commented Jun 2, 2026

Uh oh!

github-actions Bot commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

spark4862 commented Jun 2, 2026 •

edited

Loading