feat(diagnostics): structured bridge-failure reasons by mvanhorn · Pull Request #16 · PrimeIntellect-ai/renderers

mvanhorn · 2026-05-10T22:49:58Z

bridge_to_next_turn returns list[int] | None. When it returns
None, the caller has no way to know which of the 6 README-documented
failure modes hit. The README's own empirical table (Qwen3.5-35B-A3B

mini-swe-agent-plus) shows 32 of 64 rollouts silently dropped via
apply_chat_template — discovery today requires noticing 77-vs-64
training samples after the run.

Demo

13 seconds: the current None return, the 7-reason enum, an example BridgeDiagnostic, and the per-turn log line it enables.

What changes

New renderers/diagnostics.py: BridgeFailureReason StrEnum covers
the 6 documented modes (ASSISTANT_IN_EXTENSION,
TRUNCATION_ZEROED_ANCHOR, BPE_DRIFT, BOOL_ROUND_TRIP,
TOOL_CALL_XML_DRIFT, THINKING_STRIPPED) plus
UNKNOWN_TEMPLATE_CLOSE for the DefaultRenderer fall-through.
BridgeDiagnostic dataclass carries (reason, message_index, token_span, detail) suitable for one logger.info line per turn.
diagnose_bridge(renderer, prev_prompt_ids, prev_completion_ids, new_messages, *, tools=None) -> BridgeDiagnostic | None returns
None when the bridge succeeds cleanly. Otherwise:
1. Contract-level checks first: reject_assistant_in_extension →
  ASSISTANT_IN_EXTENSION, DefaultRenderer →
  UNKNOWN_TEMPLATE_CLOSE, len(prev_prompt) > tokenizer.model_max_length → TRUNCATION_ZEROED_ANCHOR.
2. Comparison: call bridge_to_next_turn and render_ids, locate
  the first divergent token. Classify single-token bool literals as
  BOOL_ROUND_TRIP; fall through to BPE_DRIFT (empirical
  majority). Per-renderer hints (Qwen3 tool_call_id, GPT-OSS
  harmony channels) intentionally stay outside the protocol.
tests/test_diagnostics.py: 9 pure-function tests driven by a small
_StubRenderer so the suite runs in ~30ms without HuggingFace
downloads. Covers every branch.
renderers/__init__.py: exports BridgeFailureReason,
BridgeDiagnostic, diagnose_bridge.

Surface stability

No changes to the Renderer Protocol.
No new dependencies; uses enum.StrEnum (3.11+) and
dataclasses.dataclass.
Enum values are strings (locked via test_enum_str_values_stable)
so downstream log consumers and dashboards stay stable across
refactors.

Note on placement

Diagnostics could live in verifiers instead. The reasons live with
the renderer's bridge logic (and reuse base.py helpers like
reject_assistant_in_extension), so this PR keeps them here.
Happy to move it across if you'd rather verifiers / prime-rl own
the surface, or to add a docs page (Bridge Diagnostics) alongside
the existing Bridge Contract page.

Built with Claude Code (Opus 4.7).

bridge_to_next_turn returns list[int] | None. When the bridge bails, the caller has no way to learn which of the 6 README-documented failure modes hit. Adds renderers.diagnostics.diagnose_bridge that returns a typed BridgeDiagnostic so prime-rl and verifiers can observe bridge health per-turn during rollouts instead of discovering 32/64 silent drops after training. - renderers/diagnostics.py (~245 LOC): BridgeFailureReason StrEnum covering all 6 documented modes plus UNKNOWN_TEMPLATE_CLOSE for the DefaultRenderer fall-through. BridgeDiagnostic dataclass carries the reason, message_index, token_span, and a short detail string suitable for a logger.info line. diagnose_bridge orchestrates: contract-level checks first (assistant in extension, default renderer, truncation_zeroed_anchor), then runs the bridge and a fresh render and classifies the first divergent token. Per-renderer hints (Qwen3 tool_call_id, GPT-OSS harmony channels) intentionally stay out of the protocol; the fall-through is BPE_DRIFT, which empirically covers the majority case. - tests/test_diagnostics.py (~155 LOC): pure-function tests that exercise every branch with a small _StubRenderer rather than a real tokenizer, so the diagnostic suite runs in 30ms without HuggingFace model downloads. - renderers/__init__.py: exports BridgeFailureReason, BridgeDiagnostic, and diagnose_bridge. No changes to the Renderer protocol; no new dependencies. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(diagnostics): structured bridge-failure reasons#16

feat(diagnostics): structured bridge-failure reasons#16
mvanhorn wants to merge 1 commit into
PrimeIntellect-ai:mainfrom
mvanhorn:feat/diagnostics

mvanhorn commented May 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mvanhorn commented May 10, 2026

Demo

What changes

Surface stability

Note on placement

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant