feat(parsing): per-attempt ParsedToolCall with status + token spans#22
Open
hallerite wants to merge 2 commits into
Open
feat(parsing): per-attempt ParsedToolCall with status + token spans#22hallerite wants to merge 2 commits into
hallerite wants to merge 2 commits into
Conversation
Replaces ParsedResponse.tool_calls (list[dict] | None) with a typed list[ParsedToolCall] where every attempted tool-call block is recorded — successful and malformed alike — with a ToolCallParseStatus enum (OK / INVALID_JSON / UNCLOSED_BLOCK / MISSING_NAME / MALFORMED_STRUCTURE), raw block text, and a token_span pointing back into the completion stream. Why this diverges from vLLM/SGLang on purpose: vLLM's ExtractedToolCallInformation collapses parse outcomes into a single tools_called: bool — partial success in parallel tool calls cannot be represented, and downstream callers cannot tell whether a malformed attempt happened (hermes_tool_parser.py:138 dumps the entire raw output into content on any exception). SGLang silently drops failed calls in base_format_detector.py with a logger.warning. Both choices are fine for inference serving, but they erase exactly the signal verifier / RL-loss code needs: - verifier rubrics that score "did the model follow the tool schema?" can't inspect what failed - TITO-aware loss masking can't selectively zero out malformed-call tokens without span information - parallel-call rollouts can't tell "all 3 calls succeeded" from "2 of 3 succeeded; the second was broken" This commit keeps the parser-vs-schema separation (parser does JSON→dict, schema validation is the tool's job), but stops discarding the parse-time failure receipt. Empty tool_calls now means "model did not emit any tool calls"; a list with non-OK entries means "the parser caught a failure" — these are different states the old list-or-None shape collapsed. Behavioral changes: - Qwen3 parser no longer falls through "preserve raw <tool_call> block as content" on JSON decode failure. The malformed signal now lives on the ParsedToolCall entry; downstream EmptyModelResponseError prevention is satisfied by tool_calls being non-empty (with status=INVALID_JSON) rather than by lying about the response shape. - client.generate() only promotes finish_reason "stop"→"tool_calls" when there is at least one OK call; malformed-only responses keep finish reason "stop" but still surface the attempt in result["tool_calls"]. - Default tool_calls is [] instead of None across the board. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The K2.5 inline parser was the lone holdout — it walked decoded text and left token_span=None on every ParsedToolCall it produced. Switches the special-token (plural ``<|tool_calls_section_*|>``) path to walk token ids via parse_kimi_k2_section, so K2.5 entries now carry spans like every other parser. The regex-on-text fallback stays for the singular ``<|tool_call_section_*|>`` variant — confirmed via tokenizer probe that form is NOT in K2.5's special-token vocab (returns unk_id), so it can only ever appear as literal text the model emitted; no spans there (text→token offset mapping is lossy across BPE). Factoring: - Extracted parse_kimi_k2_section() out of parse_kimi_k2() so both K2 and K2.5 share the section-walking + per-call status logic. parse_kimi_k2 is now a thin wrapper that adds stop-token stripping and text-level <think> extraction on top. - Helper takes *sets* of begin/end token IDs (singletons for K2 today, but the surface is ready if a future K2.x ships two variants both as special tokens). KimiK25Renderer.parse_response now also adjusts spans back to the caller's token_ids frame after _normalize_response_tokens (which may prepend the synthetic <think> prefill when the sampler stripped it). Without this shift, spans would be off by len(<think>) tokens whenever prefill recovery fires. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
willccbb
approved these changes
May 12, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Replaces
ParsedResponse.tool_calls(list[dict] | None) with a typedlist[ParsedToolCall]where every attempted tool-call block is recorded — successful and malformed alike — carrying aToolCallParseStatusenum, raw block text, and atoken_spanpointing back into the completion stream.Why this diverges from vLLM / SGLang on purpose
Both engines erase exactly the signal verifier and RL-loss code needs:
ExtractedToolCallInformation) collapses outcomes into a singletools_called: bool.hermes_tool_parser.py:138dumps the entire raw output intocontenton any exception — partial success in parallel tool calls cannot be represented (one bad JSON kills the whole list).StreamingParseResult) silently drops failed calls inbase_format_detector.py:84-89with alogger.warning.These choices are fine for inference serving. They aren't fine when the consumer is a training loop:
We keep the parser-vs-schema separation Will called out (parser does JSON→dict, schema validation is the tool's job), but stop discarding the parse-time failure receipt. The renderer is consciously a superset of vLLM's contract — additive richness for use cases the inference engines don't serve.
Behavioral changes
tool_callsis always a list, neverNone. Empty list = "model emitted no tool calls"; a list with non-OK entries = "parser caught a failure." These were collapsed under the old shape.<tool_call>block as content" on JSON decode failure. The malformed signal lives onParsedToolCall(status=INVALID_JSON)instead. The originalEmptyModelResponseErrorprevention contract is still satisfied:tool_callsis non-empty for a malformed attempt, so downstream consumers that gate on "did anything come back" still see a non-empty response — without lying about its shape.client.generate()only promotesfinish_reason "stop"→"tool_calls"when at least one OK call is present. Malformed-only responses keep"stop"but still surface the attempt inresult["tool_calls"]for verifier inspection.Coverage
Every parser updated to emit per-block status + spans:
parsing.py:parse_qwen3,parse_qwen35,parse_glm,parse_deepseek_v3,parse_minimax,parse_kimi_k2,parse_gpt_ossparsers.py(plugin parsers forDefaultRenderer):Qwen3ToolParser,Qwen35ToolParser,GlmToolParser,DeepSeekV3ToolParserkimi_k25.py(inline text-based parser;token_span=Nonehere — documented limitation, that branch walks decoded text, not token ids)Downstream impact
Verifiers (
renderer_client.py) and prime-rl orchestrator both consumeparsed.tool_calls— those callers will need a small adapter (tc.function.name→tc.name, etc.) and a decision on what to do with non-OK entries. That's intentional: the whole point is to surface the signal. The PR description in those repos should call out the migration.Test plan
pytest— 983 passed, 49 skipped (suite-wide)test_parsers.py,test_parse_response.py,test_parse_response_robustness.py,test_client.py,test_roundtrip.pytest_qwen3_tool_parser_records_invalid_json— malformed JSON surfaces asINVALID_JSON, not silently droppedtest_qwen3_tool_parser_parallel_partial_success—[OK, INVALID_JSON, OK]ordering preserved across parallel callstest_qwen3_vl_malformed_tool_call_surfaces_as_invalid_json— replaces the old "fall back to content" assertiontest_generate_does_not_promote_finish_reason_for_malformed_tool_calls—client.generate()keeps"stop"for malformed-only responsesruff checkcleanruff format --checkclean🤖 Generated with Claude Code