Skip to content

Agentic Inference updates#314

Open
hvagadia wants to merge 14 commits into
mlcommons:mainfrom
hvagadia:hvagadia/cache-salt-and-correctness
Open

Agentic Inference updates#314
hvagadia wants to merge 14 commits into
mlcommons:mainfrom
hvagadia:hvagadia/cache-salt-and-correctness

Conversation

@hvagadia
Copy link
Copy Markdown
Collaborator

What does this PR do?

Type of change

  • Bug fix
  • New feature
  • Documentation update
  • Refactor/cleanup

Related issues

Testing

  • Tests added/updated
  • All tests pass locally
  • Manual testing completed

Checklist

  • Code follows project style
  • Pre-commit hooks pass
  • Documentation updated (if needed)

@hvagadia hvagadia requested a review from a team May 18, 2026 19:41
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 18, 2026

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

@hvagadia hvagadia marked this pull request as draft May 18, 2026 19:41
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces comprehensive support for multi-turn conversation benchmarking, including a new MultiTurnDataset class, a MultiTurnStrategy for turn sequencing, and a ConversationManager for state tracking. It also provides extensive documentation, validation schemas, and utility scripts for dataset conversion and analysis. Feedback focuses on optimizing performance and memory efficiency within the MultiTurnDataset class, specifically recommending the use of vectorized pandas operations to avoid memory-intensive dictionary conversions and suggesting an optimization to reduce the algorithmic complexity of building message histories from quadratic to linear.

Comment thread src/inference_endpoint/dataset_manager/multi_turn_dataset.py
Comment thread src/inference_endpoint/dataset_manager/multi_turn_dataset.py
Comment thread src/inference_endpoint/dataset_manager/multi_turn_dataset.py
hvagadia added 8 commits May 18, 2026 13:05
Several OpenAI-compatible servers (notably SGLang and vLLM with their
reasoning/tool parsers) drift from the strict OpenAI spec in ways the
existing decoders silently swallowed:

* SSEDelta.reasoning was the wrong field name. SGLang/vLLM emit
  `reasoning_content` on streaming deltas (matching the non-streaming
  ChatCompletionResponseMessage). The previous code never matched, so
  every reasoning chunk on a thinking-mode response was dropped on the
  floor with no error.

* SSEDelta fields default to None (servers send `null`, not `""`).
  Previous defaults of `""` looked truthy when the server actually had
  no payload for that channel in a given chunk.

* ChatCompletionResponseMessage / ChatCompletionChoice /
  ChatCompletionResponse: many fields (`content`, `refusal`,
  `finish_reason`, `usage`, `system_fingerprint`) are not always
  emitted. Without defaults, msgspec rejected non-streaming responses
  from these servers entirely.

* decode_sse_message now catches msgspec.ValidationError, logs the raw
  chunk preview, and returns None so the stream keeps draining. The
  prior behaviour bubbled into the worker's outer Exception handler
  and lost the diagnostic.

TextModelOutput now carries `tool_calls` and `finish_reason` as
first-class fields (in addition to the existing metadata copy) so
multi-turn replay can build history without reaching into metadata.
A diagnostic `chunk_stats` dict counts content / reasoning / tool-call
chunks per response for replay-determinism debugging.

`array_like=False` on TextModelOutput is required to safely add fields
without breaking the positional wire layout. PromptData and ErrorData
flip too for consistency.
Two correctness changes for multi-turn replay against thinking-mode and
tool-using models:

1. reasoning_content propagation. Prior assistant turns must replay
   their thinking trace as part of the message history; without it,
   the chat-template-rendered prompt diverges from what the original
   capture sent and outputs differ from the captured trajectory even
   at temperature=0.

2. tools propagation across all turns. Every request must carry the
   same tools array as turn 1; SGLang's tool-call parser is gated on
   a non-empty `tools` field in the request, and turns that omit it
   silently bypass the parser, leaking literal tool-call markup into
   the assistant `content` channel.

New optional cache-bursting salt (multi_turn.enable_salt: bool, default
False) appends a per-conversation blake2b digest to the end of each
trajectory's system prompt. This keeps within-trajectory prefix caching
intact while preventing cross-trajectory KV-cache leak during replay
of datasets that share a long system prompt across many trajectories.

The salt is computed once per conversation_id and reused on every turn
of that conversation. apply_salt is idempotent on already-salted
prompts so re-runs are stable. See examples/09_MultiTurn/docs for the
full methodology and validation; not included in this commit.

Mechanical change in dataset.py: load_from_file gains a **dataset_kwargs
passthrough so the factory can forward enable_salt without expanding
the signature for every future option.
The customer_support example and its two near-duplicate config files
predated the agentic datasets and no longer earn their keep:

- multi_turn_benchmark.yaml and multi_turn_with_concurrency.yaml differ
  only in `name`, two inline comments, and `report_dir`. Both already
  set target_concurrency: 32 — the README's Concurrency Control section
  documents the same knob inline. Keeping a second YAML to teach a
  setting that's already in the first is noise.
- customer_support_conversations.jsonl was the toy dataset backing
  those YAMLs. With them gone it has no consumer.

The remaining agentic YAMLs are tuned to actually work as written:
- model_params.name: "/model" matches SGLang's --model-path mount
- temperature: 0 (greedy) per MLPerf-inference reproducibility convention
- max_new_tokens sized to the longest observed turn in each capture
- target_concurrency reduced from theoretical maxima to values that
  match real B200-class server capacity on a 1T MoE
- client.warmup_connections: 0 / max_idle_time: 0.5 work around uvicorn
  closing pre-warmed idle sockets after 5s, which otherwise causes
  every first request to fail with ConnectionResetError

README updated: Basic Configuration example uses agentic_coding /
agentic_coding_flat.jsonl (the dataset that's actually in scope post-
deletion); Using Configuration File invokes agentic_coding_benchmark.yaml;
Example Datasets section dropped (the only entry was customer_support).
…alysis utility

- score_inline_accuracy.py: single-script scorer for multi-turn benchmark
  runs. Coding turns score by multiset IoU on a curated whitelist of ~40
  canonical bash exes; workflow turns score by exact-match on
  `intent: IXXX`. Folds in events.jsonl -> model_assistants.jsonl
  conversion so a benchmark report dir can be scored in one call.

- analyze_flat_jsonl.py: produces a single composite summary plot
  (turns/conv, ISL/OSL distributions, per-turn growth, token-class
  violins) for any flat multi-turn JSONL.
…ance

- MultiTurnConfig: add `enable_salt: bool = False` knob.
- MultiTurnDataset: when enabled, append `\n\n[cache_salt: <hex>]` to
  the system message once per trajectory, where hex is
  `blake2b(conversation_id, digest_size=8).hexdigest()`. Same salt is
  reused across all turns of one trajectory so within-trajectory
  prefix caching is preserved; differs across trajectories so the
  cross-trajectory cache match terminates at the salt boundary.
- MultiTurnDataset: drop rows with no `conversation_id` after
  load (e.g. the `_type: dataset_metadata` license/source sentinel
  some upstream snapshots prepend).

Methodology + full-dataset salt-vs-no-salt accuracy comparison in
examples/09_MultiTurn/docs/EVALUATION.md.
- README: new "Accuracy Evaluation" subsection under "Running Multi-Turn
  Benchmarks" documenting the score_inline_accuracy.py command and what
  it writes into report_dir. Completes the convert -> replay -> score
  pipeline.
- score_inline_accuracy.py: ruff-format/lint pass (formatting only, no
  behavior change).
- Add ModelParams.chat_template_kwargs (forwarded per request to
  vLLM/SGLang). Enables Kimi-K2.6 Thinking-mode by sending
  {'thinking': True, 'preserve_thinking': True} on every request so
  prior assistant reasoning is rendered back into the prompt as
  <think>...</think> blocks during multi-turn replays.
- Wire field through ChatCompletionRequest msgspec struct and
  openai_msgspec_adapter so it lands on the wire as a top-level
  request key.
- Score workflow accuracy from the structured intent_codes set
  stamped on each scorable assistant row (no regex fallback): a turn
  is a hit iff the model's extracted I### code is a member of the
  per-turn ground-truth set. Tool-only assistant turns remain
  unscorable by design.
- inject_tool_delay knob on multi_turn dataset config: when set, the
  strategy defers the next turn issue by the dataset's embedded
  delay_seconds via loop.call_later, modelling the producer-side
  tool/human pause that the original capture saw.
- Update agentic_{coding,workflow}_benchmark.yaml to point at
  the t1 datasets, T=1.0/top_p=0.95 sampling, salt enabled, and
  chat_template_kwargs preset for Kimi-K2.6.
- Update README + score_inline_accuracy.py + JSONL schema to match
  the new _t1.jsonl naming and intent_codes/delay_seconds fields.
- Regenerate full templates to expose the new schema field.

Tests cover end-to-end chat_template_kwargs propagation, the
intent_codes scoring rule, and the inject_tool_delay scheduling
path.
Comment thread docs/MULTI_TURN_QUICKSTART.md Outdated
Comment thread docs/MULTI_TURN_QUICKSTART.md
@hvagadia hvagadia force-pushed the hvagadia/cache-salt-and-correctness branch from 66f0a3d to 1bbc7ab Compare May 18, 2026 20:26
Comment thread docs/MULTI_TURN_QUICKSTART.md Outdated
Comment thread docs/MULTI_TURN_QUICKSTART.md Outdated
Comment thread docs/MULTI_TURN_QUICKSTART.md
Comment thread docs/MULTI_TURN_QUICKSTART.md
Comment thread docs/MULTI_TURN_QUICKSTART.md
Comment thread docs/MULTI_TURN_QUICKSTART.md
@hvagadia hvagadia changed the title Agentic Inference updates [Not ready for review] Agentic Inference updates May 18, 2026
@hvagadia hvagadia marked this pull request as ready for review May 18, 2026 22:55
@hvagadia hvagadia changed the title [Not ready for review] Agentic Inference updates Agentic Inference updates May 18, 2026
@@ -0,0 +1,525 @@
# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
"""Inline accuracy scorer for multi-turn benchmark runs.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name of the file needs to be more explicit (and aligned with this summary)

Comment on lines +6 to +7
Coding turn: IoU of the multisets of canonical bash exes used in the turn
Workflow turn: 1 if `intent: IXXX` code matches gt, else 0
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should these be 2 separate scorer? IIRC the design was to have modularized accuracy checker for different components, so they can be reused for different dataset (e.g. if we have a longer coding dataset, we can reuse this for that)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nv-alicheng to suggest as well

# Canonical exe whitelist. Synonyms collapse to one canonical name.
# Anything not in this map is dropped. cd / pwd / echo are intentionally
# absent: they're navigation/output verbs, not action classes.
EXE_MAP: dict[str, str] = {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this map haave an original reference (paper/code)? Good to put a reference if it's Aapache2.0 license

multi_turn:
turn_timeout_s: 600.0
enable_salt: true # add salt after system prompt to prevent cache reuse across trajectories
inject_tool_delay: true # add delay before user/tool turns
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I might have missed the prev PR) is the tool delay injected dynamically (calced based on distribution), or statically (embedded in the dataset)?

(Same Q for the salt as well)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tool delay is embedded in the dataset from the proposed distribution and used statically by client. Salt is deterministically computed at runtime.

Copy link
Copy Markdown
Collaborator

@nv-alicheng nv-alicheng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Council — Multi-AI Code Review

Reviewed by: Codex + Claude | Depth: thorough

Found 11 issues across 7 files (3 couldn't be posted inline — see summary comment).


idx, turn = pair
idx, turn = turns[cursor]
self._active_iters[conv_id] = (turns, cursor + 1)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Codex+Claude] high (bug): Cursor is advanced to cursor + 1 at this line before the scheduled delay executes.

If a timeout on the preceding in-flight turn triggers _abort_remaining_turns, it reads the already-advanced cursor (line 417: turns, cursor = state) and iterates turns[cursor:] — skipping the turn whose delay was just cancelled. mark_turn_failed and register_skipped/_publish_synthetic_failure are never called for that turn, leaving ConversationManager's completion count off by one and the benchmark missing the failure event entirely.

"n_scorable": n,
"n_perfect": n_perfect,
"n_zero": n_zero,
"pass_rate": round(total / n, 4) if n else float("nan"),
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Codex+Claude] high (data-integrity): float("nan") is serialized as the bare token NaN by json.dumps (line 518) without allow_nan=False.

NaN is not valid JSON per RFC 8259 — jq, Go's encoding/json, and most strict parsers will refuse to read the output file. Replace float("nan") with None so the field becomes null in JSON, which is valid and clearly signals no data.

i += 1
if i >= len(tokens):
return None
leaf = _PATH_LEAF.search(tokens[i]).group(0).lower()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Claude] medium (bug): _PATH_LEAF.search(tokens[i]).group(0) raises AttributeError when tokens[i] is exactly "/" (a bare slash).

re.compile(r"[^/]+$") returns None for "/" because there are no non-slash characters before $. This aborts the entire scoring run. Guard against it:

m = _PATH_LEAF.search(tokens[i])
if m is None:
    return None
leaf = m.group(0).lower()

# ---------------------------------------------------------------------------


def _build_index_to_key(gt_jsonl: Path) -> list[tuple[str, int]]:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Claude] medium (data-integrity): _build_index_to_key silently produces wrong results if the gt JSONL has a different turn count or order than the benchmark run dataset.

The positional index list is used to map sample_idx_map.json integer indices to (conv_id, turn) keys. If the gt file has extra or missing conversations, all subsequent UUID-to-key mappings past the mismatch point attribute model outputs to the wrong turns — producing incorrect scores with no warning. At minimum, assert or warn when len(index_to_key) differs from the number of UUIDs in uuid_to_index.

try:
payload = await request.json()
received_payloads.append(payload)
except Exception:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Claude] low (error-handling): except Exception silently swallows request-parse failures.

logger = None # noqa: F841 suppresses the ruff warning but provides zero signal when parsing fails. Per project rules (AGENTS.md), every except block must log or include a comment explaining why the error is ignored. A parse failure here would leave received_payloads empty with no diagnostic output, making failures very hard to debug.

logit_bias: dict[str, float] | None = None
user: str | None = None
chat_template: str | None = None
chat_template_kwargs: dict[str, Any] | None = None
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Claude] low (design): The gc=False AT-RISK audit comment at line 112 lists messages, tools, and logit_bias but omits the newly added field at this line.

chat_template_kwargs: dict[str, Any] | None is also a mutable container. While the current Kimi use-case values are dict[str, bool] (cycle-free), the Any type admits cyclic references in general. Add chat_template_kwargs to the audit comment so future reviewers know it was considered.

if val and isinstance(val, str):
system_content = val
break
if self._enable_salt and system_content:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Claude] low (design): Salt is silently skipped for conversations that have no system prompt (system_content is None).

if self._enable_salt and system_content: means that when enable_salt=True is requested, conversations without a system row receive no cache differentiation — partially defeating the purpose of the flag. At minimum, log a warning at INFO level identifying which conversations were skipped so the caller knows they are unprotected.

def _build_index_to_key(gt_jsonl: Path) -> list[tuple[str, int]]:
keys: list[tuple[str, int]] = []
by_conv: dict[str, list[dict]] = {}
for line in gt_jsonl.read_text().splitlines():
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Claude] low (performance): gt_jsonl.read_text().splitlines() loads the entire ground-truth file into memory as a single string before splitting.

For large benchmark datasets this doubles peak memory vs. streaming. _iter_assistant_turns (line 268) already uses the correct pattern — iterate the open file handle line-by-line:

with open(gt_jsonl) as fh:
    for line in fh:
        ...

@nv-alicheng
Copy link
Copy Markdown
Collaborator

Review Council — Multi-AI Code Review

Reviewed by: Codex + Claude | Depth: thorough

Found 11 issues across 7 files.

⚠️ 3 issues below couldn't be posted as inline comments (lines not in the diff hunk) — they are included here with file:line references.


🔴 Must Fix (critical/high)

Issues that will cause hangs, data loss, or incorrect behavior in production.

# File Line Category Reviewer(s) Summary
1 src/inference_endpoint/load_generator/multi_turn_strategy.py 326 bug Codex ⚠️ inline not possible — Timeout path decrements inflight but never calls _drain_event.set(). If the last in-flight turn times out, _drain_inflight in session.py waits forever — the benchmark hangs indefinitely. Fix: mirror session.py:540 (if self._phase_issuer.inflight <= 0: self._drain_event.set()) in the timeout handler.
2 src/inference_endpoint/load_generator/multi_turn_strategy.py 232 bug Codex+Claude Cursor advanced to cursor + 1 before the delayed turn fires. If a timeout aborts the conversation, _abort_remaining_turns reads the already-advanced cursor and skips the cancelled delayed turn — mark_turn_failed and _publish_synthetic_failure are never called for it, silently under-counting failures.
3 src/inference_endpoint/load_generator/session.py 537 bug Codex ⚠️ inline not possiblecompleted_uuids set accumulates every completed + timed-out query ID for the full phase and is never cleared. At 50 k QPS / 600 s this can hold millions of UUID strings and consume several hundred MB before the phase ends. The set's purpose (guard against late double-responses) only requires remembering IDs that were synthetically completed after a timeout — not every successful query.
4 examples/09_MultiTurn/accuracy/score_inline_accuracy.py 328 data-integrity Claude float("nan") serialized as bare NaN token by json.dumps without allow_nan=False. NaN is not valid JSON per RFC 8259; jq, Go's encoding/json, and most strict parsers reject the output file. Replace with Nonenull.

🟡 Should Fix (medium)

Real issues that produce incorrect results under reachable conditions.

# File Line Category Reviewer(s) Summary
5 src/inference_endpoint/core/types.py 224 bug Codex ⚠️ inline not possibleas_message_parts_after_first_chunk returns the full self.tool_calls for TPOT tokenization, but for streaming tool-call responses some of those tokens arrived in the first chunk (before TTFT). TPOT denominator is systematically inflated for all tool-calling runs.
6 examples/09_MultiTurn/accuracy/score_inline_accuracy.py 132 bug Claude _PATH_LEAF.search(tokens[i]).group(0) raises AttributeError when tokens[i] is "/" (bare slash). re.compile(r"[^/]+$") returns None for that input; unguarded .group(0) aborts the entire scoring run.
7 examples/09_MultiTurn/accuracy/score_inline_accuracy.py 340 data-integrity Claude _build_index_to_key has no guard against a mismatch between the gt JSONL and the benchmark run dataset. A difference in turn count/order silently maps model outputs to wrong turns, producing incorrect scores with no warning.

🔵 Consider (low)

Valid improvements, suitable as follow-ups.

# File Line Category Reviewer(s) Summary
8 tests/integration/test_multi_turn.py 939 error-handling Claude except Exception swallows parse failures with logger = None — no log, no comment. AGENTS.md requires every except to explain or log.
9 src/inference_endpoint/openai/types.py 132 design Claude gc=False AT-RISK comment (line 112) lists messages, tools, logit_bias but omits the newly added chat_template_kwargs: dict[str, Any] | None.
10 src/inference_endpoint/dataset_manager/multi_turn_dataset.py 413 design Claude Salt silently not applied to conversations without a system prompt; no warning logged when enable_salt=True is partially defeated.
11 examples/09_MultiTurn/accuracy/score_inline_accuracy.py 343 performance Claude gt_jsonl.read_text().splitlines() loads the entire file into memory. Use line-by-line streaming (same pattern as _iter_assistant_turns line 268).

🤖 Generated by /review-council (Codex + Claude, thorough depth).



def _extract_intent_code(turn: dict) -> str | None:
"""Two-stage extractor:
Copy link
Copy Markdown
Collaborator

@viraatc viraatc May 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any OSS implementations / papers / conventions we could link to for this section?

# Mandatory: with the default warmup behaviour, every request fails with
# ConnectionResetError because uvicorn closes pre-warmed idle sockets after 5s.
client:
warmup_connections: 0
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whats the http error code ur seeing on this?

dont remember seeing this in the past with sglang.
(i expect tcp-keepalive is disabled (default))

also max-idle-time should is likely not making a difference here, lets use default if we can.

# Event-driven state — populated in execute().
self._pending_convs: deque[tuple[str, list[tuple[int, int]]]] = deque()
self._active_iters: dict[str, Iterator[tuple[int, int]]] = {}
self._active_iters: dict[str, tuple[list[tuple[int, int]], int]] = {}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets add a typedef/alias, hard to understand what tuple[list[tuple[int, int]], int] is supposed to mean here and in self._pending_convs as well?

Copy link
Copy Markdown
Collaborator

@arekay-nv arekay-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Council — Multi-AI Code Review

Claude-only review (Codex unavailable — bwrap/pivot_root not permitted in this environment). Depth: thorough.

num_repeats=num_repeats,
)
if config.multi_turn is not None and config.multi_turn.enable_salt:
assert isinstance(dataloader, MultiTurnDataset)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Claude] high (bug): assert isinstance(dataloader, MultiTurnDataset) raises AssertionError — not a user-facing error — when enable_salt=True but the loader is not a MultiTurnDataset. Worse, the assert is silently skipped under python -O, leading to AttributeError on dataloader.enable_salt() with no diagnostic. The project convention for config-validation failures is InputValidationError. Replace with:

if not isinstance(dataloader, MultiTurnDataset):
    raise InputValidationError(
        "multi_turn.enable_salt requires a multi-turn dataset, got "
        f"{type(dataloader).__name__}"
    )

handle = self._loop.call_later(
delay, self._issue_turn_now, conv_id, idx, turn
)
self._delay_handles[conv_id] = handle
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Claude] high (concurrency): The early-exit guard in execute() (a few lines before this hunk) reads if not self._active_iters and not self._inflight. This guard fires incorrectly when every conversation has its first turn delayed: _issue_next_turn advances the cursor and stores the handle here, but returns before populating _inflight. If all _active_iters slots are occupied by pending-delay conversations, the guard sees _inflight empty and never sets _all_done, causing execute() to hang at await self._all_done.wait(). The guard should also check not self._delay_handles:

if not self._active_iters and not self._inflight and not self._delay_handles:
    return phase_issuer.issued_count

Similarly, _fill_slot (further below) checks elif not self._active_iters to set _all_done — this will fire prematurely while delays are pending, terminating the run before delayed turns execute.

"name",
"tool_calls",
"tool_results",
"reasoning_content",
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Claude] medium (api-contract): reasoning_content is unconditionally forwarded in prior-assistant-turn history messages. This field is a Kimi/SGLang-specific extension and is not part of the OpenAI Chat Completions API spec for request message objects. Any server that validates input strictly (standard vLLM, OpenAI-compatible endpoints, TRT-LLM serve) will reject requests with a 400/422 error when prior assistant turns carry this field.

Since ChatMessage.reasoning_content has omit_defaults=True and defaults to None, the field is only included when the dataset row has a non-null value — so this only triggers for datasets recorded from Kimi runs. But there is no api_type gate or config option to strip it, and no documentation warning. Recommend gating on api_type in (APIType.OPENAI_MSGSPEC, APIType.SGLANG) or adding a strip_reasoning_from_history option to MultiTurnConfig.

data = _parse_model_output(ev.get("data"))
if data is None:
continue
conv_id, client_turn = key
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Claude] medium (data-integrity): derive_model_assistants assigns assistant turn number as client_turn + 1 (the turn number of the preceding user/tool row + 1). This works only because _validate_turn_numbering enforces that every client turn is immediately followed by an assistant turn at the next sequential turn number. That invariant is not visible at this call site.

If a caller passes a JSONL file from a non-validating source (e.g. a manually constructed or future-format dataset with non-contiguous turn numbers), all turns are silently scored against the wrong ground-truth rows — missing_in_model will equal the full dataset size and pass rate will be 0 with no error or warning. A comment citing the coupling to _validate_turn_numbering, or a runtime assertion that client_turn + 1 exists in the gt index, would prevent silent misscoring.

if args.report_dir is not None:
model_path = args.report_dir / "model_assistants.jsonl"
derive_model_assistants(args.report_dir, args.gt, model_path, args.dataset_name)
else:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Claude] low (error-handling): When --model is passed directly (not --report-dir), model_path = args.model is used without checking whether the file exists. score_run then calls _iter_assistant_turns(model_jsonl) which raises a raw FileNotFoundError with a Python traceback. The --report-dir path is consistent but the --model path should add the same guard:

if not model_path.exists():
    sys.exit(f"model file not found: {model_path}")

Copy link
Copy Markdown
Collaborator

@arekay-nv arekay-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Council — Multi-AI Code Review

Claude-only review (Codex unavailable — bwrap/pivot_root not permitted in this environment). Depth: thorough.

num_repeats=num_repeats,
)
if config.multi_turn is not None and config.multi_turn.enable_salt:
assert isinstance(dataloader, MultiTurnDataset)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Claude] high (bug): assert isinstance(dataloader, MultiTurnDataset) raises AssertionError — not a user-facing error — when enable_salt=True but the loader is not a MultiTurnDataset. Worse, the assert is silently skipped under python -O, leading to AttributeError on dataloader.enable_salt() with no diagnostic. The project convention for config-validation failures is InputValidationError. Replace with:

if not isinstance(dataloader, MultiTurnDataset):
    raise InputValidationError(
        "multi_turn.enable_salt requires a multi-turn dataset, got "
        f"{type(dataloader).__name__}"
    )

handle = self._loop.call_later(
delay, self._issue_turn_now, conv_id, idx, turn
)
self._delay_handles[conv_id] = handle
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Claude] high (concurrency): The early-exit guard in execute() (a few lines before this hunk) reads if not self._active_iters and not self._inflight. This guard fires incorrectly when every conversation has its first turn delayed: _issue_next_turn advances the cursor and stores the handle here, but returns before populating _inflight. If all _active_iters slots are occupied by pending-delay conversations, the guard sees _inflight empty and never sets _all_done, causing execute() to hang at await self._all_done.wait(). The guard should also check not self._delay_handles:

if not self._active_iters and not self._inflight and not self._delay_handles:
    return phase_issuer.issued_count

Similarly, _fill_slot (further below) checks elif not self._active_iters to set _all_done — this will fire prematurely while delays are pending, terminating the run before delayed turns execute.

"name",
"tool_calls",
"tool_results",
"reasoning_content",
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Claude] medium (api-contract): reasoning_content is unconditionally forwarded in prior-assistant-turn history messages. This field is a Kimi/SGLang-specific extension and is not part of the OpenAI Chat Completions API spec for request message objects. Any server that validates input strictly (standard vLLM, OpenAI-compatible endpoints, TRT-LLM serve) will reject requests with a 400/422 error when prior assistant turns carry this field.

Since ChatMessage.reasoning_content has omit_defaults=True and defaults to None, the field is only included when the dataset row has a non-null value — so this only triggers for datasets recorded from Kimi runs. But there is no api_type gate or config option to strip it, and no documentation warning. Recommend gating on api_type in (APIType.OPENAI_MSGSPEC, APIType.SGLANG) or adding a strip_reasoning_from_history option to MultiTurnConfig.

data = _parse_model_output(ev.get("data"))
if data is None:
continue
conv_id, client_turn = key
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Claude] medium (data-integrity): derive_model_assistants assigns assistant turn number as client_turn + 1 (the turn number of the preceding user/tool row + 1). This works only because _validate_turn_numbering enforces that every client turn is immediately followed by an assistant turn at the next sequential turn number. That invariant is not visible at this call site.

If a caller passes a JSONL file from a non-validating source (e.g. a manually constructed or future-format dataset with non-contiguous turn numbers), all turns are silently scored against the wrong ground-truth rows — missing_in_model will equal the full dataset size and pass rate will be 0 with no error or warning. A comment citing the coupling to _validate_turn_numbering, or a runtime assertion that client_turn + 1 exists in the gt index, would prevent silent misscoring.

if args.report_dir is not None:
model_path = args.report_dir / "model_assistants.jsonl"
derive_model_assistants(args.report_dir, args.gt, model_path, args.dataset_name)
else:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Claude] low (error-handling): When --model is passed directly (not --report-dir), model_path = args.model is used without checking whether the file exists. score_run then calls _iter_assistant_turns(model_jsonl) which raises a raw FileNotFoundError with a Python traceback. The --report-dir path is consistent but the --model path should add the same guard:

if not model_path.exists():
    sys.exit(f"model file not found: {model_path}")

num_repeats=num_repeats,
)
if config.multi_turn is not None and config.multi_turn.enable_salt:
assert isinstance(dataloader, MultiTurnDataset)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Claude] high (bug): assert isinstance(dataloader, MultiTurnDataset) raises AssertionError when enable_salt=True but the loader is not a MultiTurnDataset. The assert is silently skipped under python -O, leading to AttributeError on dataloader.enable_salt() with no diagnostic. The project convention for config-validation failures is InputValidationError:

if not isinstance(dataloader, MultiTurnDataset):
    raise InputValidationError(
        "multi_turn.enable_salt requires a multi-turn dataset, got "
        f"{type(dataloader).__name__}"
    )

handle = self._loop.call_later(
delay, self._issue_turn_now, conv_id, idx, turn
)
self._delay_handles[conv_id] = handle
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Claude] high (concurrency): The early-exit guard in execute() reads if not self._active_iters and not self._inflight. When every conversation has its first turn delayed, _issue_next_turn advances the cursor and stores the handle here, but returns before populating _inflight. The guard then sees _inflight empty and _all_done is never set, causing execute() to hang at await self._all_done.wait(). Fix: also check not self._delay_handles.

The same issue affects _fill_slot: elif not self._active_iters fires prematurely while delays are pending, terminating the run before delayed turns execute.

"name",
"tool_calls",
"tool_results",
"reasoning_content",
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Claude] medium (api-contract): reasoning_content is unconditionally forwarded in prior-assistant-turn history messages. This is a Kimi/SGLang-specific extension, not in the OpenAI Chat Completions spec for request message objects. Servers that validate input strictly (standard vLLM, TRT-LLM serve) will reject with 400/422 when prior assistant turns carry this field. Since ChatMessage.reasoning_content defaults to None and uses omit_defaults=True, this only triggers for datasets recorded from Kimi runs — but there is no api_type gate, config option to strip it, or documentation warning.

data = _parse_model_output(ev.get("data"))
if data is None:
continue
conv_id, client_turn = key
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Claude] medium (data-integrity): client_turn + 1 works only because _validate_turn_numbering enforces sequential turn numbering — an invariant not visible here. If called with a non-validating source (manually constructed dataset, future format with non-contiguous turn numbers), all turns are silently scored against the wrong ground-truth rows: missing_in_model equals the full dataset size, pass rate is 0, with no error. A comment or runtime assertion would prevent silent misscoring.

if args.report_dir is not None:
model_path = args.report_dir / "model_assistants.jsonl"
derive_model_assistants(args.report_dir, args.gt, model_path, args.dataset_name)
else:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Claude] low (error-handling): When --model is passed directly, model_path = args.model is used without checking existence. score_run raises a raw FileNotFoundError with a traceback. The --report-dir path is consistent — the --model path needs the same existence check before use.

@arekay-nv
Copy link
Copy Markdown
Collaborator

Review Council — Multi-AI Code Review

Reviewed by: Claude (Codex unavailable — bwrap/pivot_root not permitted in this environment) | Depth: thorough | PR: #314 "Agentic Inference updates"

Found 5 new issues posted as inline comments (3 previously covered by earlier review pass were deduplicated). Commit hygiene: 14 commits, 6 chore/fix — consider squashing before merge.

🔴 Must Fix (critical/high)

# File Line Category Summary
1 factory.py 121 bug assert isinstance(...) → must be InputValidationError; silently skipped under python -O
2 multi_turn_strategy.py 247 concurrency execute() early-exit + _fill_slot completion check both missing not self._delay_handles; first-turn-delay scenario can hang or terminate early

🟡 Should Fix (medium)

# File Line Category Summary
3 multi_turn_dataset.py 438 api-contract reasoning_content unconditionally forwarded in history messages; breaks non-Kimi endpoints (vLLM, TRT-LLM) with 400/422
4 score_inline_accuracy.py 461 data-integrity client_turn + 1 implicitly couples to _validate_turn_numbering invariant; silent misscoring if invariant not enforced upstream

🔵 Consider (low)

# File Line Category Summary
5 score_inline_accuracy.py 507 error-handling --model path has no existence check; raises raw FileNotFoundError traceback vs --report-dir path's clean error

⚠️ Commit hygiene: This PR has 14 commits including 6 apparent chore/fix commits. Consider squashing before merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants