Skip to content

feat(responses): stream-teed credit reconciliation for native passthrough#428

Open
HaruHunab1320 wants to merge 1 commit intofix/responses-native-passthroughfrom
fix/responses-passthrough-stream-reconciliation
Open

feat(responses): stream-teed credit reconciliation for native passthrough#428
HaruHunab1320 wants to merge 1 commit intofix/responses-native-passthroughfrom
fix/responses-passthrough-stream-reconciliation

Conversation

@HaruHunab1320
Copy link
Copy Markdown
Contributor

Summary

Follow-up to #427, addressing the "settlement fires before stream consumed" finding from that PR's review. Stacked on top of fix/responses-native-passthroughmerge #427 first, then this rebases onto dev cleanly.

#427 settled the credit reservation to the reserved estimate on every request regardless of actual usage. A Codex turn that emitted 200 output tokens was charged at the same rate as one that emitted 4000, as long as both fit under the reserved upper bound. This PR fixes that by extracting real usage from the SSE stream and reconciling on stream close.

What changed

New: packages/lib/utils/responses-stream-reconcile.ts

Pure helper that wraps a ReadableStream<Uint8Array> so bytes flow to the client unchanged while the wrapper scans for the terminal response.completed SSE event and extracts response.usage. Exactly one terminal callback per stream lifecycle (end | cancel | error), regardless of how the stream ends.

const trackedBody = wrapWithUsageExtraction(upstream.body, (usage, reason) => {
  if (usage) reconcile(calculateCost(model, provider, usage.inputTokens, usage.outputTokens));
  else       reconcile(reservedAmount);
});

Design constraints (all enforced by tests):

  • Byte-exact passthrough — bytes flow in the exact order and size the upstream produced them. No batching, rewriting, or buffering beyond a single SSE frame for parser bookkeeping.
  • Parse errors are swallowed — a malformed frame must never break the forward path.
  • Chunk-boundary handling — SSE frames split mid-JSON across multiple reads are reassembled correctly.
  • Exactly-one terminal callback — fires on end, cancel, or error. A buggy callback that throws does not break the stream.
  • Pull-based — upstream is only drained when the client reads, matching direct-proxy semantics.
  • [DONE] sentinel ignored.

app/api/v1/responses/route.ts — wired into the passthrough

Replaces the previous fire-and-forget background settle(reservedAmount) with:

const reconciledBody = wrapWithUsageExtraction(upstreamResponse.body, (usage, reason) => {
  logger.debug(...);
  void runReconciliation(usage);
});

return new Response(reconciledBody ?? upstreamResponse.body, { ... });

runReconciliation():

  1. If usage is null (no response.completed seen) → settle to reserved (same as pre-stream-wrap behavior)
  2. If usage is present → calculateCost(model, provider, inputTokens, outputTokens) then settle to min(computed, reserved)
  3. If calculateCost throws → fall back to settling at reserved

Trade-offs (documented in inline comments)

  • Cap at reserved, not over-collect. If the model somehow runs hotter than the reservation covered, we can't retroactively collect more from this path — the reserved amount is the upper bound. Anything beyond would need a separate post-hoc ledger entry which this PR doesn't implement.
  • Cancel-before-completed charges the upper bound. Client aborts (Codex CLI Ctrl-C, tab close) before response.completed arrives settle to reserved. That's the same behavior as before this PR for the cancel case — the 50% safety buffer in estimateRequestCost is still our protection there.
  • No impact when upstream.body === null. We fall back to an immediate synchronous settle-to-reserved so we don't strand the reservation.

Tests

14 unit tests — packages/tests/unit/utils/responses-stream-reconcile.test.ts

Passthrough fidelity

  • Forwards bytes unchanged across multiple chunks
  • Handles SSE frames that span chunk boundaries
  • Handles the [DONE] sentinel without crashing
  • Malformed JSON in a data: line does not break the forward path

Usage extraction

  • Extracts input_tokens / output_tokens from response.completed
  • Extracts cached_tokens and reasoning_tokens when present
  • Omits cached/reasoning fields when zero or missing
  • Returns null when no response.completed event arrives
  • Defaults missing numeric fields to 0 (zero-usage event is still a reconciliation)

Termination

  • Fires onComplete exactly once on normal close
  • Fires with 'cancel' when reader is cancelled mid-stream
  • Still extracts usage when cancel fires AFTER response.completed
  • Fires with 'error' when the upstream throws
  • A throwing onComplete callback does not break the stream

4 new integration tests — packages/tests/unit/api/v1-responses-route.test.ts

  • Reconciles to actual cost when response.completed reports usage
  • Caps actual cost at reserved when model over-runs the estimate
  • Falls back to reserved when no response.completed arrives
  • Existing mockReconcileCredits assertion tests updated to drain the wrapped body (pull-based wrapper gates the callback on reader progress)

Test totals: 48 passing across the two affected suites (14 unit + 34 route, was 31).

Test plan

  • bun test packages/tests/unit/utils/responses-stream-reconcile.test.ts — 14/14
  • bun test packages/tests/unit/api/v1-responses-route.test.ts — 34/34
  • bunx tsc --noEmit clean on all changed files (only pre-existing bs58 errors)
  • bun run lint clean (biome auto-fix applied)
  • Merge fix(responses): native Responses-API passthrough for gpt-5.x + Codex CLI #427 first (this branch is stacked on it)
  • Deploy to dev and test end-to-end with Codex CLI via milady's PARALLAX_LLM_PROVIDER=cloud flow, confirm small turns get reconciled down from the estimate and large turns stay capped at reserved

Follow-ups (not in this PR)

  • Post-hoc over-run ledger — if the model uses more than reserved, we silently cap. A proper fix would emit a separate ledger entry for the delta, visible in the user's billing dashboard. Out of scope here.
  • Per-model usage details — we extract cached_tokens and reasoning_tokens from the usage object but don't yet use them in calculateCost (which is input/output only). When cost calculation grows to support cached/reasoning tiers, the plumbing is already in place.

🤖 Generated with Claude Code

@vercel
Copy link
Copy Markdown

vercel bot commented Apr 7, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
eliza-cloud-v2 Ready Ready Preview, Comment Apr 8, 2026 3:21am

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 7, 2026

Important

Review skipped

Auto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 6c0736a4-db9e-4651-b4cc-7edd87b0f304

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/responses-passthrough-stream-reconciliation

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@HaruHunab1320 HaruHunab1320 force-pushed the fix/responses-passthrough-stream-reconciliation branch from 7e4ad31 to 8631087 Compare April 7, 2026 22:01
HaruHunab1320 added a commit that referenced this pull request Apr 8, 2026
Addresses the Claude bot's third review on #427. All 5 substantive
findings fixed plus the 4 test gaps the reviewer flagged.

1. Body size guard bypassable via chunked encoding (medium): the
   Content-Length header check is now an early fast path only. The
   real enforcement is a post-read length check on the buffered
   request text. Clients using `Transfer-Encoding: chunked` (which
   omits Content-Length) or lying about Content-Length will be
   caught by the post-read check before we touch req.json(). The
   header check stays for the cheap-fast-path benefit.

2. Background settle in serverless (medium): for non-streaming
   responses (content-type != "text/event-stream") we now `await`
   the settle synchronously before returning. The body is fully
   materialized at that point so deferring serves no purpose, and
   on Vercel the function can be frozen once the Response is
   returned, leaving a background promise to never complete and
   stranding the reservation. Streaming responses still
   fire-and-forget — that path is the one PR #428 addresses
   properly via stream-teed reconciliation.

3. Upstream error object forwarded verbatim (low): added
   `sanitizeGatewayError()` which extracts only the well-known
   OpenAI-compatible fields (message, type, code, optional param)
   from an unknown gateway error envelope. Stack traces,
   infrastructure host names, and arbitrary nested objects are
   stripped. Whitelist + length cap means even a hostile gateway
   payload can't leak internals through this path.

4. `isNativeResponsesPayload` non-string instructions (low): now
   triggers passthrough on ANY presence of `instructions`, not
   just string. A malformed `instructions: 42` payload routes
   through the passthrough so the upstream returns a coherent
   validation error instead of falling through to Chat Completions
   which would choke on the unexpected field.

5. Double `providerResponses` null guard (nit): captured the
   narrowed local at the early-bail check site so the second
   guard before the forward call is gone. Removed both the redundant
   if-block AND the unreachable 500 fallback that was load-bearing
   only for the type system.

New tests (6, bringing total to 42):
- 413 with no Content-Length (chunked encoding bypass test)
- 400 invalid_json on malformed JSON body
- Synchronous settle for non-streaming passthrough (no flush needed)
- Non-streaming JSON passthrough body forwarded unchanged
- Sanitized gateway error envelope (hostile fields stripped)
- Non-string `instructions: 42` routes to passthrough

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ough

Builds on #427. The native passthrough previously settled the credit
reservation to the reserved estimate on every request regardless of
actual usage — so a Codex turn that emitted 200 output tokens was
charged at the same rate as one that emitted 4000, as long as both
fit under the reserved upper bound.

This PR wraps the upstream ReadableStream with a pass-through reader
that also extracts `response.usage` from the terminal `response.completed`
SSE event. When the stream ends, we compute the real cost via
`calculateCost(model, provider, inputTokens, outputTokens)` and
reconcile the reservation down to actual (capped at the reservation
as an upper bound — over-runs would need a separate post-hoc ledger
entry which is out of scope).

- Zero behavioral impact on the client: bytes flow in the exact order
  and size the upstream produced them. We do not batch, rewrite, or
  buffer beyond a single SSE frame for parser bookkeeping.
- SSE events are parsed out-of-band on the side. Parse errors are
  swallowed — a malformed frame must never break the forward path.
- Exactly one terminal callback per stream lifecycle: `end`, `cancel`,
  or `error`. Client cancel (Codex CLI Ctrl-C, tab close) fires the
  callback with whatever usage we had seen before the cancel, so a
  turn that completed-then-cancelled still reconciles to actual.
- Pull-based ReadableStream so the upstream is only drained when the
  client reads, matching the semantics of a direct proxy.
- Cost is clamped at the reservation: `actualCost = min(computed, reserved)`.

- If the client aborts mid-stream before `response.completed` arrives,
  we settle to the reserved estimate (same as pre-stream-wrap
  behavior). The 50% safety buffer in `estimateRequestCost` stays as
  the upper bound for that case.
- If `calculateCost` throws, we fall back to settling at the reserved
  amount rather than crashing the reconciliation.
- We cap at reserved rather than allowing over-collection. Anything
  beyond the reservation would need a separate post-hoc charge which
  this PR doesn't implement.

14 unit tests for `wrapWithUsageExtraction` covering:
- Passthrough fidelity (byte-exact, chunk-split frames, [DONE] sentinel,
  malformed JSON recovery)
- Usage extraction (headline + cached + reasoning tokens, missing
  fields default to 0, null when no completed event)
- Termination paths (end, cancel before completed, cancel after
  completed, error, throwing callback swallowed)

4 new integration tests in the route suite:
- Reconciles to actual cost when response.completed reports usage
- Caps actual cost at reserved when model over-runs the estimate
- Falls back to reserved when no response.completed arrives
- Existing reconcile/api_key_id tests updated to actually drain the
  stream (pull-based wrapper gates the callback on reader progress)

Total: 48 passing tests across the two affected suites (14 + 34).

This branch is stacked on `fix/responses-native-passthrough`. Merge

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant