feat(parsing): per-attempt ParsedToolCall with status + token spans by hallerite · Pull Request #22 · PrimeIntellect-ai/renderers

hallerite · 2026-05-12T18:26:35Z

Summary

Replaces ParsedResponse.tool_calls (list[dict] | None) with a typed list[ParsedToolCall] where every attempted tool-call block is recorded — successful and malformed alike — carrying a ToolCallParseStatus enum, raw block text, and a token_span pointing back into the completion stream.

class ToolCallParseStatus(str, Enum):
    OK = "ok"
    INVALID_JSON = "invalid_json"
    UNCLOSED_BLOCK = "unclosed_block"
    MISSING_NAME = "missing_name"
    MALFORMED_STRUCTURE = "malformed_structure"

@dataclass
class ParsedToolCall:
    raw: str
    name: str | None = None
    arguments: dict[str, Any] | str | None = None
    token_span: tuple[int, int] | None = None
    status: ToolCallParseStatus = ToolCallParseStatus.OK
    id: str | None = None

Why this diverges from vLLM / SGLang on purpose

Both engines erase exactly the signal verifier and RL-loss code needs:

vLLM (ExtractedToolCallInformation) collapses outcomes into a single tools_called: bool. hermes_tool_parser.py:138 dumps the entire raw output into content on any exception — partial success in parallel tool calls cannot be represented (one bad JSON kills the whole list).
SGLang (StreamingParseResult) silently drops failed calls in base_format_detector.py:84-89 with a logger.warning.

These choices are fine for inference serving. They aren't fine when the consumer is a training loop:

Verifier rubrics scoring "did the model follow the tool schema?" can't inspect what failed.
TITO-aware loss masking can't selectively zero out malformed-call tokens without span information.
Parallel-call rollouts can't distinguish "3/3 succeeded" from "2/3 succeeded, second was broken."

We keep the parser-vs-schema separation Will called out (parser does JSON→dict, schema validation is the tool's job), but stop discarding the parse-time failure receipt. The renderer is consciously a superset of vLLM's contract — additive richness for use cases the inference engines don't serve.

Behavioral changes

tool_calls is always a list, never None. Empty list = "model emitted no tool calls"; a list with non-OK entries = "parser caught a failure." These were collapsed under the old shape.
Qwen3 parser no longer falls through "preserve raw <tool_call> block as content" on JSON decode failure. The malformed signal lives on ParsedToolCall(status=INVALID_JSON) instead. The original EmptyModelResponseError prevention contract is still satisfied: tool_calls is non-empty for a malformed attempt, so downstream consumers that gate on "did anything come back" still see a non-empty response — without lying about its shape.
client.generate() only promotes finish_reason "stop"→"tool_calls" when at least one OK call is present. Malformed-only responses keep "stop" but still surface the attempt in result["tool_calls"] for verifier inspection.

Coverage

Every parser updated to emit per-block status + spans:

parsing.py: parse_qwen3, parse_qwen35, parse_glm, parse_deepseek_v3, parse_minimax, parse_kimi_k2, parse_gpt_oss
parsers.py (plugin parsers for DefaultRenderer): Qwen3ToolParser, Qwen35ToolParser, GlmToolParser, DeepSeekV3ToolParser
kimi_k25.py (inline text-based parser; token_span=None here — documented limitation, that branch walks decoded text, not token ids)

Downstream impact

Verifiers (renderer_client.py) and prime-rl orchestrator both consume parsed.tool_calls — those callers will need a small adapter (tc.function.name → tc.name, etc.) and a decision on what to do with non-OK entries. That's intentional: the whole point is to surface the signal. The PR description in those repos should call out the migration.

Test plan

pytest — 983 passed, 49 skipped (suite-wide)
Targeted: test_parsers.py, test_parse_response.py, test_parse_response_robustness.py, test_client.py, test_roundtrip.py
New cases added:
- test_qwen3_tool_parser_records_invalid_json — malformed JSON surfaces as INVALID_JSON, not silently dropped
- test_qwen3_tool_parser_parallel_partial_success — [OK, INVALID_JSON, OK] ordering preserved across parallel calls
- test_qwen3_vl_malformed_tool_call_surfaces_as_invalid_json — replaces the old "fall back to content" assertion
- test_generate_does_not_promote_finish_reason_for_malformed_tool_calls — client.generate() keeps "stop" for malformed-only responses
ruff check clean
ruff format --check clean

🤖 Generated with Claude Code

Replaces ParsedResponse.tool_calls (list[dict] | None) with a typed list[ParsedToolCall] where every attempted tool-call block is recorded — successful and malformed alike — with a ToolCallParseStatus enum (OK / INVALID_JSON / UNCLOSED_BLOCK / MISSING_NAME / MALFORMED_STRUCTURE), raw block text, and a token_span pointing back into the completion stream. Why this diverges from vLLM/SGLang on purpose: vLLM's ExtractedToolCallInformation collapses parse outcomes into a single tools_called: bool — partial success in parallel tool calls cannot be represented, and downstream callers cannot tell whether a malformed attempt happened (hermes_tool_parser.py:138 dumps the entire raw output into content on any exception). SGLang silently drops failed calls in base_format_detector.py with a logger.warning. Both choices are fine for inference serving, but they erase exactly the signal verifier / RL-loss code needs: - verifier rubrics that score "did the model follow the tool schema?" can't inspect what failed - TITO-aware loss masking can't selectively zero out malformed-call tokens without span information - parallel-call rollouts can't tell "all 3 calls succeeded" from "2 of 3 succeeded; the second was broken" This commit keeps the parser-vs-schema separation (parser does JSON→dict, schema validation is the tool's job), but stops discarding the parse-time failure receipt. Empty tool_calls now means "model did not emit any tool calls"; a list with non-OK entries means "the parser caught a failure" — these are different states the old list-or-None shape collapsed. Behavioral changes: - Qwen3 parser no longer falls through "preserve raw <tool_call> block as content" on JSON decode failure. The malformed signal now lives on the ParsedToolCall entry; downstream EmptyModelResponseError prevention is satisfied by tool_calls being non-empty (with status=INVALID_JSON) rather than by lying about the response shape. - client.generate() only promotes finish_reason "stop"→"tool_calls" when there is at least one OK call; malformed-only responses keep finish reason "stop" but still surface the attempt in result["tool_calls"]. - Default tool_calls is [] instead of None across the board. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The K2.5 inline parser was the lone holdout — it walked decoded text and left token_span=None on every ParsedToolCall it produced. Switches the special-token (plural ``<|tool_calls_section_*|>``) path to walk token ids via parse_kimi_k2_section, so K2.5 entries now carry spans like every other parser. The regex-on-text fallback stays for the singular ``<|tool_call_section_*|>`` variant — confirmed via tokenizer probe that form is NOT in K2.5's special-token vocab (returns unk_id), so it can only ever appear as literal text the model emitted; no spans there (text→token offset mapping is lossy across BPE). Factoring: - Extracted parse_kimi_k2_section() out of parse_kimi_k2() so both K2 and K2.5 share the section-walking + per-call status logic. parse_kimi_k2 is now a thin wrapper that adds stop-token stripping and text-level <think> extraction on top. - Helper takes *sets* of begin/end token IDs (singletons for K2 today, but the surface is ready if a future K2.x ships two variants both as special tokens). KimiK25Renderer.parse_response now also adjusts spans back to the caller's token_ids frame after _normalize_response_tokens (which may prepend the synthetic <think> prefill when the sampler stripped it). Without this shift, spans would be off by len(<think>) tokens whenever prefill recovery fires. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

hallerite and others added 2 commits May 12, 2026 18:25

willccbb approved these changes May 12, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(parsing): per-attempt ParsedToolCall with status + token spans#22

feat(parsing): per-attempt ParsedToolCall with status + token spans#22
hallerite wants to merge 2 commits into
mainfrom
feat/parsed-tool-call-status

hallerite commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

hallerite commented May 12, 2026

Summary

Why this diverges from vLLM / SGLang on purpose

Behavioral changes

Coverage

Downstream impact

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants