Skip to content

feat(parsing): per-attempt ParsedToolCall with status + token spans#22

Open
hallerite wants to merge 2 commits into
mainfrom
feat/parsed-tool-call-status
Open

feat(parsing): per-attempt ParsedToolCall with status + token spans#22
hallerite wants to merge 2 commits into
mainfrom
feat/parsed-tool-call-status

Conversation

@hallerite
Copy link
Copy Markdown
Member

Summary

Replaces ParsedResponse.tool_calls (list[dict] | None) with a typed list[ParsedToolCall] where every attempted tool-call block is recorded — successful and malformed alike — carrying a ToolCallParseStatus enum, raw block text, and a token_span pointing back into the completion stream.

class ToolCallParseStatus(str, Enum):
    OK = "ok"
    INVALID_JSON = "invalid_json"
    UNCLOSED_BLOCK = "unclosed_block"
    MISSING_NAME = "missing_name"
    MALFORMED_STRUCTURE = "malformed_structure"

@dataclass
class ParsedToolCall:
    raw: str
    name: str | None = None
    arguments: dict[str, Any] | str | None = None
    token_span: tuple[int, int] | None = None
    status: ToolCallParseStatus = ToolCallParseStatus.OK
    id: str | None = None

Why this diverges from vLLM / SGLang on purpose

Both engines erase exactly the signal verifier and RL-loss code needs:

  • vLLM (ExtractedToolCallInformation) collapses outcomes into a single tools_called: bool. hermes_tool_parser.py:138 dumps the entire raw output into content on any exception — partial success in parallel tool calls cannot be represented (one bad JSON kills the whole list).
  • SGLang (StreamingParseResult) silently drops failed calls in base_format_detector.py:84-89 with a logger.warning.

These choices are fine for inference serving. They aren't fine when the consumer is a training loop:

  • Verifier rubrics scoring "did the model follow the tool schema?" can't inspect what failed.
  • TITO-aware loss masking can't selectively zero out malformed-call tokens without span information.
  • Parallel-call rollouts can't distinguish "3/3 succeeded" from "2/3 succeeded, second was broken."

We keep the parser-vs-schema separation Will called out (parser does JSON→dict, schema validation is the tool's job), but stop discarding the parse-time failure receipt. The renderer is consciously a superset of vLLM's contract — additive richness for use cases the inference engines don't serve.

Behavioral changes

  • tool_calls is always a list, never None. Empty list = "model emitted no tool calls"; a list with non-OK entries = "parser caught a failure." These were collapsed under the old shape.
  • Qwen3 parser no longer falls through "preserve raw <tool_call> block as content" on JSON decode failure. The malformed signal lives on ParsedToolCall(status=INVALID_JSON) instead. The original EmptyModelResponseError prevention contract is still satisfied: tool_calls is non-empty for a malformed attempt, so downstream consumers that gate on "did anything come back" still see a non-empty response — without lying about its shape.
  • client.generate() only promotes finish_reason "stop"→"tool_calls" when at least one OK call is present. Malformed-only responses keep "stop" but still surface the attempt in result["tool_calls"] for verifier inspection.

Coverage

Every parser updated to emit per-block status + spans:

  • parsing.py: parse_qwen3, parse_qwen35, parse_glm, parse_deepseek_v3, parse_minimax, parse_kimi_k2, parse_gpt_oss
  • parsers.py (plugin parsers for DefaultRenderer): Qwen3ToolParser, Qwen35ToolParser, GlmToolParser, DeepSeekV3ToolParser
  • kimi_k25.py (inline text-based parser; token_span=None here — documented limitation, that branch walks decoded text, not token ids)

Downstream impact

Verifiers (renderer_client.py) and prime-rl orchestrator both consume parsed.tool_calls — those callers will need a small adapter (tc.function.nametc.name, etc.) and a decision on what to do with non-OK entries. That's intentional: the whole point is to surface the signal. The PR description in those repos should call out the migration.

Test plan

  • pytest — 983 passed, 49 skipped (suite-wide)
  • Targeted: test_parsers.py, test_parse_response.py, test_parse_response_robustness.py, test_client.py, test_roundtrip.py
  • New cases added:
    • test_qwen3_tool_parser_records_invalid_json — malformed JSON surfaces as INVALID_JSON, not silently dropped
    • test_qwen3_tool_parser_parallel_partial_success[OK, INVALID_JSON, OK] ordering preserved across parallel calls
    • test_qwen3_vl_malformed_tool_call_surfaces_as_invalid_json — replaces the old "fall back to content" assertion
    • test_generate_does_not_promote_finish_reason_for_malformed_tool_callsclient.generate() keeps "stop" for malformed-only responses
  • ruff check clean
  • ruff format --check clean

🤖 Generated with Claude Code

hallerite and others added 2 commits May 12, 2026 18:25
Replaces ParsedResponse.tool_calls (list[dict] | None) with a typed
list[ParsedToolCall] where every attempted tool-call block is recorded —
successful and malformed alike — with a ToolCallParseStatus enum
(OK / INVALID_JSON / UNCLOSED_BLOCK / MISSING_NAME / MALFORMED_STRUCTURE),
raw block text, and a token_span pointing back into the completion stream.

Why this diverges from vLLM/SGLang on purpose:

vLLM's ExtractedToolCallInformation collapses parse outcomes into a single
tools_called: bool — partial success in parallel tool calls cannot be
represented, and downstream callers cannot tell whether a malformed
attempt happened (hermes_tool_parser.py:138 dumps the entire raw output
into content on any exception). SGLang silently drops failed calls in
base_format_detector.py with a logger.warning. Both choices are fine for
inference serving, but they erase exactly the signal verifier / RL-loss
code needs:

- verifier rubrics that score "did the model follow the tool schema?"
  can't inspect what failed
- TITO-aware loss masking can't selectively zero out malformed-call
  tokens without span information
- parallel-call rollouts can't tell "all 3 calls succeeded" from "2 of 3
  succeeded; the second was broken"

This commit keeps the parser-vs-schema separation (parser does JSON→dict,
schema validation is the tool's job), but stops discarding the parse-time
failure receipt. Empty tool_calls now means "model did not emit any tool
calls"; a list with non-OK entries means "the parser caught a failure" —
these are different states the old list-or-None shape collapsed.

Behavioral changes:

- Qwen3 parser no longer falls through "preserve raw <tool_call> block
  as content" on JSON decode failure. The malformed signal now lives on
  the ParsedToolCall entry; downstream EmptyModelResponseError prevention
  is satisfied by tool_calls being non-empty (with status=INVALID_JSON)
  rather than by lying about the response shape.
- client.generate() only promotes finish_reason "stop"→"tool_calls" when
  there is at least one OK call; malformed-only responses keep finish
  reason "stop" but still surface the attempt in result["tool_calls"].
- Default tool_calls is [] instead of None across the board.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The K2.5 inline parser was the lone holdout — it walked decoded text and
left token_span=None on every ParsedToolCall it produced. Switches the
special-token (plural ``<|tool_calls_section_*|>``) path to walk token
ids via parse_kimi_k2_section, so K2.5 entries now carry spans like
every other parser. The regex-on-text fallback stays for the singular
``<|tool_call_section_*|>`` variant — confirmed via tokenizer probe that
form is NOT in K2.5's special-token vocab (returns unk_id), so it can
only ever appear as literal text the model emitted; no spans there
(text→token offset mapping is lossy across BPE).

Factoring:
- Extracted parse_kimi_k2_section() out of parse_kimi_k2() so both K2
  and K2.5 share the section-walking + per-call status logic. parse_kimi_k2
  is now a thin wrapper that adds stop-token stripping and text-level
  <think> extraction on top.
- Helper takes *sets* of begin/end token IDs (singletons for K2 today,
  but the surface is ready if a future K2.x ships two variants both as
  special tokens).

KimiK25Renderer.parse_response now also adjusts spans back to the
caller's token_ids frame after _normalize_response_tokens (which may
prepend the synthetic <think> prefill when the sampler stripped it).
Without this shift, spans would be off by len(<think>) tokens whenever
prefill recovery fires.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants