Skip to content

feat(bfcl): add BFCL v4 edge-agentic accuracy + performance integration#346

Open
Palanivelg wants to merge 48 commits into
mlcommons:mainfrom
Palanivelg:feat/bfcl-v4-combined
Open

feat(bfcl): add BFCL v4 edge-agentic accuracy + performance integration#346
Palanivelg wants to merge 48 commits into
mlcommons:mainfrom
Palanivelg:feat/bfcl-v4-combined

Conversation

@Palanivelg

@Palanivelg Palanivelg commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

What does this PR do?

Integrates the edge-agentic example (BFCL-v4 as the accuracy set): single-turn + multi-turn pipelines, datasets, adapters, and a reproducible run script. See examples/10_Edge_Agentic_Example/README.md.

Type of change

  • Bug fix
  • [x ] New feature
  • Documentation update
  • Refactor/cleanup

Related issues

Self-contained: this PR includes the numpy relaxation (numpy>=1.26.4) required by bfcl-eval (which hard-pins numpy==1.26.4), so it no longer depends on a separate prerequisite PR. The [bfcl] extra is isolated via [tool.uv].conflicts so its old pins don't constrain the shared tooling deps. This supersedes #345 (closed as redundant).

Testing

  • Tests added/updated
  • All tests pass locally
  • Manual testing completed

Checklist

  • Code follows project style
  • Pre-commit hooks pass
  • Documentation updated (if needed)

@Palanivelg Palanivelg requested a review from a team June 8, 2026 17:06
@github-actions

github-actions Bot commented Jun 8, 2026

Copy link
Copy Markdown

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for Berkeley Function Calling Leaderboard (BFCL) v4 accuracy benchmarking, introducing single-turn and multi-turn evaluation pipelines, datasets, and adapters, along with sequential sample ordering for deterministic evaluation. The code review feedback identifies three key improvements: moving the tool_calls parsing loop inside the try-except block in bfcl_v4_execution.py to prevent crashes on invalid JSON, conditionally including tools and tool_choice in the request payload in bfcl_v4_multi_turn_runner.py to avoid API errors when no tools are present, and guarding the n_repeats calculation in bfcl_v4_scorer.py to ensure it does not evaluate to zero if some samples fail.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread src/inference_endpoint/evaluation/bfcl_v4_execution.py Outdated
Comment thread src/inference_endpoint/evaluation/bfcl_v4_multi_turn_runner.py
Comment thread src/inference_endpoint/evaluation/bfcl_v4_scorer.py Outdated
@Palanivelg Palanivelg changed the title Feat/bfcl v4 combined feat(bfcl): add BFCL v4 edge-agentic accuracy integration Jun 8, 2026
Palanivelg added 14 commits June 8, 2026 17:16
Integrate Berkeley Function Calling Leaderboard (BFCL) v4 evaluation
into the accuracy pipeline. Covers both single-turn function-calling
subsets (non_live, live, hallucination) and the agentic multi-turn
subsets (multi_turn_base + variants), validated against evalscope.

Single-turn (drop-in scorer):
  - BFCLv4 predefined dataset (categories=[non_live|live|hallucination],
    configurable sample_pct) plus a default preset.
  - BFCLv4Scorer wires bfcl-eval's ast_checker into the standard
    accuracy phase via the existing scorer registry.
  - FunctionCallExtractor: normalizes native tool_calls JSON, JSON
    arrays of function-call objects, and text-form function calls
    into the canonical BFCL input format.
  - openai_msgspec_adapter now:
      * passes tool_choice through,
      * emits tool_calls verbatim as output_text for scoring,
      * coerces whole-number temperatures (vLLM strictness),
      * makes max_completion_tokens optional and uses a permissive
        ColumnFilter so messages+tools datasets pass through.

Multi-turn (agentic loop, outside the standard scorer):
  - BFCLExecutionBridge: parses tool_calls, executes them locally
    against bfcl-eval's class instances, and constructs the tool
    response messages for the next turn.
  - BFCLMultiTurnRunner: drives the per-entry agentic conversation
    via httpx (bounded by max_steps_per_turn / timeout_s).
  - BFCLv4MultiTurnScorer: invokes bfcl-eval's multi_turn_checker.
  - bfcl_v4_multi_turn_cli: standalone CLI for the multi-turn flow.

Supporting plumbing:
  - SequentialSampleOrder + `sequential=` flag on create_sample_order,
    used by accuracy phases so ordering matches reference runs.
  - BenchmarkConfig.Dataset.params: dict for dataset-specific kwargs
    (e.g. categories, sample_pct) plumbed through DataLoaderFactory.
  - ScorerMethod.BFCL_V4.
  - Dataset.load() preserves user-provided ColumnFilters when the
    adapter would otherwise inject a conflicting one.
  - `--accuracy-only` benchmark flag: skip the performance phase
    entirely (forces num_workers=1, max_connections=1 for
    deterministic per-sample ordering).

Optional dep: `pip install -e ".[bfcl]"` (`bfcl-eval==2026.3.23`).
The top-level numpy pin must be relaxed to `>=1.26.4` because
bfcl-eval hard-pins `numpy==1.26.4` — shipped as a separate
prerequisite commit on the chore/relax-numpy-pin branch.

Validation (Qwen3.6-27B-Q4_K_M, temperature=0):
  - Single-turn live (10%): ~82% accuracy.
  - Multi-turn base (full 200): 140/200 = 70.00%, exact parity
    (100% match, 0 mismatches) with evalscope on identical inputs.
Add a committed config that reproduces the validated <3h BFCL v4 accuracy
run on an embedded device (Thor).

Dataset (BFCLv4.generate):
  - category_sample_pct: per-category sampling rates (e.g. non_live 20 /
    live 10 / hallucination 5), resolved per subset via SUBSET_TO_CATEGORY.
  - subset_floor: subsets whose TOTAL size is <= floor are taken in full,
    preventing tiny subsets (live_parallel=16, live_parallel_multiple=24)
    from collapsing to one or two noisy samples. Selection stays
    deterministic (head(n)). Plain sample_pct behavior is unchanged.

Multi-turn CLI:
  - --sample-pct: deterministic per-subset sub-sampling so a 3% (~24 entry)
    multi-turn run is reproducible. Defaults to all entries.

examples/10_BFCLv4_Example/:
  - offline_bfcl_v4_single_turn.yaml: single-turn (non_live + live +
    hallucination) accuracy config, run with --accuracy-only.
  - README.md: documents the two run paths (single-turn YAML vs multi-turn
    CLI), the per-category sampling + floor, and the ~2h49m Thor budget.
The example and docstring referenced a specific device; reword to the
generic "edge device" since the <3h budget applies to embedded targets
broadly.
In --accuracy-only runs there is no performance phase, so ctx.rt_settings
is None. _run_benchmark_async read ctx.rt_settings.max_duration_ms
unconditionally, raising AttributeError at session setup. The global
timeout only applies to the performance phase, so default max_duration_ms
to None when rt_settings is absent.
--accuracy-only forces a single connection for deterministic sample
ordering, which serializes the offline MAX_THROUGHPUT burst. For large
accuracy datasets the sequential processing time exceeds the hardcoded
240 s drain cap, so the phase aborted mid-run dropping in-flight samples.

Make drain_timeout a per-phase field defaulting to 240.0 (performance
phases unchanged). Accuracy phases pass None to drain unbounded, since
every sample must complete; a dropped transport still unblocks the wait
via the _receive_responses close path. Re-check inflight after clear()
to close a completion/clear race on the unbounded path.
The msgspec adapter serialized tool_calls into `output` AND kept them in the
structured `tool_calls` field. TextModelOutput.__str__ then appended the
tool_calls a second time, producing duplicated, malformed JSON
(`[{...}][{...}]`) that FunctionCallExtractor could not parse. This made every
single-turn function-calling subset score 0% (and gave hallucination subsets a
spurious 100%).

Keep `output` as the textual content only; the structured tool_calls field is
the single source serialized once by __str__. This matches the non-msgspec
OpenAI adapter and the streaming accumulator, which already did this.

Multi-turn is unaffected: it uses a separate httpx runner that reads structured
tool_calls from the raw response and never touches TextModelOutput.
…mbles

When a model emits a prose preamble alongside a tool call (e.g.
"To compute this, I'll call...\n[{...}]"), str(TextModelOutput) prefixes the
tool-call JSON with that text, so the function-call parser's json.loads fast
path fails and the sample scores 0.

Override BFCLv4Scorer.get_outputs to prefer the structured tool_calls field
when present (the function call is the answer; the prose is chatter), falling
back to the full string for plain-text responses such as hallucination
refusals. Verified deterministic across repeated fresh-server single-turn runs.
Some servers (e.g. trtllm-serve on edge devices) stall when tools are present
but tool_choice is omitted, relying on a server default. Set tool_choice="auto"
explicitly on each single-turn sample and pass it through the function_calling
preset's ColumnFilter so it reaches the request payload.

Multi-turn already sends tool_choice="auto" via its dedicated runner, so this
only affects the single-turn path.
Add ModelParams.seed field and propagate it to the OpenAI wire format
so runs can be made reproducible:

- config/schema.py: add seed field to ModelParams
- openai/types.py: add seed field to ChatCompletionRequest
- openai/openai_adapter.py: include seed in metadata dict
- openai/openai_msgspec_adapter.py: include seed in metadata dict and
  ChatCompletionRequest construction
- evaluation/bfcl_v4_multi_turn_runner.py: accept seed param; inject
  payload["seed"] when set
- evaluation/bfcl_v4_multi_turn_cli.py: expose --seed CLI arg and pass
  it to BFCLMultiTurnRunner
- commands/benchmark/cli.py: expose --seed and --report-dir overrides
  on the from-config subcommand
- tests: unit coverage for seed propagation in msgspec adapter,
  multi-turn runner, and from-config CLI
Expand examples/10_BFCLv4_Example/README.md:
- Add a "Reproducing from the PRs" section explaining that PR #1
  (numpy pin) is a prerequisite for PR #2 to install [bfcl]
- Show how to check out and install from the branches
- Document --seed flag for both single-turn (from-config) and
  multi-turn CLI paths
- Replace placeholder accuracy numbers with confirmed Thor validation
  results (Qwen3.6-27B-Q4_K_M, temperature=0, 456 ST samples):
    non_live 86.98%, live 84.12%, hallucination 94.32%, overall 87.50%
    (both seed runs identical); MT base 70.00% (exact evalscope parity)
- Add output file paths and a quick sanity-check script
Replace the terse reference doc with a numbered walkthrough that
works for someone unfamiliar with the endpoints repo:
- What is this / What is the endpoints repo
- Step 0: prerequisites including a llama.cpp Docker quick-start
- Step 1: install from the two PRs with conflict explanation
- Step 2: run single-turn (with YAML config notes)
- Step 3: run multi-turn
- Step 4: verify results with one-liners
- Seed reproducibility section
- Reference results table (Thor, two seed runs, evalscope parity)
- Fix memory requirement: ~24 GB (not 16 GB) for the Q4 GGUF + KV cache
- Replace generic Docker quick-start with Thor-specific llama.cpp native
  build instructions (Docker CUDA images don't target sm_110/aarch64-SBSA)
- Add x86_64 Docker quick-start in a collapsible details block
- Fix Step 4 result path: results.json under accuracy_scores key, not
  a separate accuracy_scores.json; add report.txt note
- Add server-side determinism note (--seed 42, -np 1 on llama-server)
- Replace placeholder MT numbers with actual sampled-run Thor results:
    multi_turn_base 66.67% (4/6), miss_func 33.33% (2/6),
    miss_param 16.67% (1/6), long_context 66.67% (4/6),
    overall 45.84% (24 entries) — identical across both seed runs
- Separate full multi_turn_base parity result (140/200 = 70.00%) into
  its own subsection to avoid conflating sampled and full-set numbers
- Update wall-clock: ~82 min ST + ~64 min MT ≈ 2.4–2.5 h total
bfcl-eval's Qwen model handler transitively imports qwen_agent which
requires soundfile; without it the import fails on Thor and any machine
where soundfile is not already installed.
… run_accuracy.sh

Renames examples/10_BFCLv4_Example to examples/10_Edge_Agentic_Example to
align with the MLPerf edge-agentic submission category name.

Adds run_accuracy.sh — a single script that reproduces both single-turn and
multi-turn reference accuracy numbers end-to-end with the exact validated
parameters (sampling rates, temperature=0, seed=42, max-steps-per-turn=25).

Updates README to lead with the one-liner quick-start referencing the script,
fixes the install instructions to point to mlcommons/endpoints (not the fork),
adds --seed and --max-steps-per-turn to the Step 3 MT snippet, and corrects
the internal path reference in online_agentic_coding_perf.yaml.
@Palanivelg Palanivelg force-pushed the feat/bfcl-v4-combined branch from 9f0186a to 37c74d2 Compare June 8, 2026 17:21
- bfcl_v4_execution: move tool-call argument JSON parsing inside the
  try-except block so json.JSONDecodeError from malformed model output
  (common on small/quantized models) is caught and handled gracefully
  rather than crashing the evaluation run.

- bfcl_v4_multi_turn_runner: only include tools/tool_choice in the
  request payload when the tools list is non-empty, avoiding 400 errors
  on endpoints that reject tool_choice without accompanying tools.

- bfcl_v4_scorer: guard n_repeats with max(1, ...) so a partial run
  where fewer samples completed than num_samples() does not produce a
  zero divisor and incorrect reporting.
- Relax base numpy to >=1.26.4 so bfcl-eval's numpy==1.26.4 pin resolves;
  regenerate uv.lock.
- Regenerate stale *_template_full.yaml config templates after schema change.
- Fix mypy: annotate tool_calls/tool_call_ids and narrow Optional
  messages/tools in the multi-turn runner; mark BFCLv4Scorer.score override.
- Isolate the bfcl extra via [tool.uv].conflicts and add patched
  filelock/virtualenv floors to the dev extra so bfcl-eval's filelock==3.20.0
  pin no longer drags shared tooling deps into CVE versions
  (CVE-2025-68146, CVE-2026-22701, CVE-2026-22702).
The accuracy phase hardcoded drain_timeout=None, which ignored a
user-configured DrainConfig.accuracy_timeout_s and failed
test_configured_drain_timeouts_propagate_to_phases. accuracy_timeout_s
already defaults to None (unbounded), so reading it preserves the
unbounded default while honoring an explicit timeout.
from-config has no --model-params.name / --endpoint-config.endpoints
overrides, so the script errored immediately. Render a temp YAML with
MODEL/ENDPOINT substituted into the committed config (anchored on the
"# set to your ..." comments) so the one-liner still works without
editing the tracked file.
@viraatc viraatc self-requested a review June 29, 2026 20:51
Comment thread src/inference_endpoint/config/schema.py Outdated

@viraatc viraatc left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few integration questions from this review.

Comment thread src/inference_endpoint/commands/benchmark/execute.py Outdated
Comment thread src/inference_endpoint/commands/benchmark/execute.py
Comment thread src/inference_endpoint/commands/benchmark/cli.py Outdated
Comment thread src/inference_endpoint/evaluation/bfcl_v4_scorer.py Outdated
Comment thread examples/11_Edge_Agentic_Example/online_edge_full_run.yaml
Comment thread src/inference_endpoint/commands/benchmark/cli.py
Comment thread src/inference_endpoint/commands/benchmark/execute.py Outdated
Comment thread tests/unit/commands/test_benchmark.py Outdated
Comment thread scripts/publish_submission.py
Comment thread src/inference_endpoint/commands/benchmark/execute.py
Comment thread src/inference_endpoint/dataset_manager/dataset.py
Comment thread src/inference_endpoint/evaluation/bfcl_v4_multi_turn_runner.py

@viraatc viraatc left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few follow-up integration comments.

Comment thread src/inference_endpoint/compliance/checker.py Outdated
Comment thread src/inference_endpoint/compliance/__init__.py
Comment thread src/inference_endpoint/compliance/checker.py Outdated
…compliance + nits)

- session: add stop_current_phase() and have the perf max_duration cap call it
  instead of session.stop(), so a combined perf+accuracy run still runs accuracy
  when the performance phase hits its time cap (was silently skipped). The
  stop-check observes a per-phase flag (reset each phase) so polling strategies
  stop issuing, and the strategy task is cancelled to interrupt a blocked await.
  Adds a regression test (perf timeout -> accuracy still completes).
- compliance: all_turns_observed now compares observed to expected (scorable)
  turns rather than issued; a dataset with 1007 issued / 1006 scorable no longer
  fails a valid run. Tests updated to carry expected and encode that case.
- schema: rename ambiguous Dataset.params -> generate_params (feeds generate());
  regenerated templates and updated the example config.
- dataset: elaborate why only the adapter's ColumnFilter is dropped when a preset
  supplies its own (other adapter transforms still apply).
- example yaml: drop the "do not change" noise; annotate the BFCL gate block once
  as fixed and keep per-field notes only where users tune values.
- test: assert the from-config seed override is meaningful (seed != 42 before).
…only

Drop the `accuracy_only` bool threaded through run_benchmark/setup_benchmark/
_load_datasets; derive it on BenchmarkContext from `test_mode == TestMode.ACC`.
`--accuracy-only` remains a CLI convenience alias that resolves to TestMode.ACC.
Coalesce the dataset-presence guards in _load_datasets around the resolved mode.

Move the `seed` Cyclopts alias onto ModelParams (`--seed`) so offline/online
expose it via normal model-field unwrapping, and remove the bespoke from-config
`--seed` override flag (set seed in the YAML for from-config runs).
…ed breakdown

BFCLv4Scorer.score() now returns the scalar overall accuracy (fraction) like
every other Scorer, dropping the `# type: ignore[override]`. The per-subset /
per-category detail is cached and exposed via a new optional
`Scorer.score_breakdown()` hook, and finalize writes it to results.json under a
dedicated `breakdown` key (numeric percentages, not formatted strings).

Consumers (compliance gate, results plotting, publish helper) read the typed
`breakdown` block instead of special-casing a score-that-might-be-a-dict; the
string-coercion paths are kept only as defensive fallbacks for older artifacts.

Compliance also now loads config.yaml via BenchmarkConfig.from_yaml_file()
(schema validation + env resolution + discriminated union) rather than a raw
yaml.safe_load, with test fixtures updated to write full schema-valid configs.

@viraatc viraatc left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Council — integration & minimalism pass

Placeholder; summary posted separately.

Comment thread src/inference_endpoint/commands/benchmark/execute.py Outdated
Comment thread src/inference_endpoint/openai/openai_adapter.py
Comment thread src/inference_endpoint/openai/openai_msgspec_adapter.py Outdated
Comment thread src/inference_endpoint/openai/openai_msgspec_adapter.py Outdated
Comment thread src/inference_endpoint/load_generator/session.py Outdated
Comment thread src/inference_endpoint/dataset_manager/factory.py Outdated
return default


class FunctionCallExtractor(Extractor, extractor_id="function_call_extractor"):

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Council: integration review] medium (design): FunctionCallExtractor Priority 3 (_try_parse_text_function_calls, ~100 lines: _SKIP_NAMES frozenset of Python builtins, balanced-paren scanner, ast.parse("dict(...)") arg parsing) is a heavy heuristic on a scoring path, and is reachable (bfcl_v4_scorer.py:252 calls self.extractor.extract(...)). Its own docstring concedes it "may produce false positives from explanatory text or fail on nested parentheses" — i.e. it can silently mis-score. Given BFCL endpoints return structured tool_calls (Priority 1) or a JSON array (Priority 2), is the free-text fallback needed at all? Recommend dropping Priority 3 (return default after the JSON paths) unless you can show a real endpoint that only emits prose function calls — for a benchmark, a clean miss is safer than a heuristic guess. This is the largest single chunk of new logic on the integration surface.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keeping Priority 3 as the last-resort recall path. For the validated Qwen3.6 runs, structured tool_calls (Priority 1) dominate, but some models/quants still emit text-form calls, and dropping it would lower recall with no accuracy upside on the compliant path.

Comment thread src/inference_endpoint/load_generator/sample_order.py Outdated
@viraatc

viraatc commented Jun 30, 2026

Copy link
Copy Markdown
Collaborator

Review Council — integration & minimalism pass

Reviewers: Claude (full integration-surface pass) + Codex (timed out at max-effort before emitting findings; not counted). Depth: thorough, scoped per the maintainer's ask — is each new line / API change / new member on an existing class actually needed? Focus was the integration surface (changes to pre-existing files), not the self-contained BFCL internals. All findings verified against HEAD; deduped against the 64 existing comments.

8 new inline findings posted (1 high, 6 medium, 1 low). Findings already raised by you/@arekay-nv were intentionally skipped.

🔴 Must fix

# File:Line Cat Summary
1 commands/benchmark/execute.py:1246 bug invalid_reasons block is dead code — reads valid/invalid_reason keys no scorer ever puts on the entry dict; always empty, and latently KeyError-prone if they were.

🟡 Should fix

# File:Line Cat Summary
2 openai/openai_adapter.py:53 bug seed added to metadata but to_endpoint_request never sends it → seed silently dropped on this adapter (msgspec path works; compliance checker asserts a seed that wasn't sent).
3 openai/openai_msgspec_adapter.py:113 design required_columns=["prompt"][] weakens load-time validation for all msgspec datasets. Gratuitous — BFCL preset supplies its own filter that dataset.py already preserves.
4 openai/openai_msgspec_adapter.py:76 design temperature int-coercion runs on every request, no test/reason; no-op given `float
5 load_generator/session.py:96 design Revert — there should be no drain timeout. Default flips None240.0 (and _drain_inflight at :496), a silent global change: every phase now abandons in-flight responses after 4 min. Put both defaults back to None. (Race-fix re-check is good — keep.)
6 dataset_manager/factory.py:90 api-bloat dataset_params.pop("force") invents a 2nd spelling for the existing force_regenerate kwarg — silent rename + two ways to do one thing.
7 evaluation/extractor.py:308 design FunctionCallExtractor Priority-3 free-text parser (~100 lines, self-admitted false positives) on a scoring path. Endpoints return structured tool_calls/JSON; recommend dropping the heuristic — a clean miss beats a wrong guess in a benchmark.

🔵 Consider

# File:Line Cat Summary
8 load_generator/sample_order.py:102 non-pythonic SequentialSampleOrder.__init__(*args, **kwargs) — loose signature accepting params it ignores; make it explicit.

Independent agreement with existing threads

The council independently reached the same conclusion as the open --accuracy-only thread (cli.py @arekay-nv / your TestMode.ACC-as-source-of-truth note): the flag is pure redundancy — _run does nothing but if accuracy_only: mode = TestMode.ACC, and everything downstream already keys off test_mode == TestMode.ACC. +1 to dropping the flag and keeping only the internal BenchmarkContext.accuracy_only property. (Not re-posted inline to avoid duplicate noise.)

Net read on the ask

The BFCL subsystem is cleanly self-contained; the friction is in the shared-code edits made to accommodate it. Most can shrink: drop the redundant flag (already flagged), drop the dead validity block (#1), keep seed/temp/required-columns changes off the shared adapters (#2-4), revert the drain-timeout default to None — no drain timeout (#5), and reconsider the text-extraction heuristic (#7). Genuinely-needed new surface: ModelParams.seed, ChatCompletionRequest.seed/tool_choice, Scorer.score_breakdown() hook, SequentialSampleOrder, stop_current_phase().

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need accuracy_only? Might be simpler to rely on the config to have perf/accuracy phases and run them as-is. This will create corner cases - what if the config has only perf, but we specify accuracy_only.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

( might need a larger refactor of execute.py though: #384 )

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The corner case is guarded: _load_datasets raises InputValidationError when --accuracy-only is set with no accuracy dataset. TestMode.ACC stays an explicit mode because the compliance workflow runs accuracy standalone under forced single-stream (now persisted to config.yaml, 4ed5bf8). Fully driving scoring off the config's declared phases is the right longer-term shape — tracked in #387.

Comment thread src/inference_endpoint/commands/benchmark/execute.py
- execute.py: remove dead invalid_reasons block (entries never carry
  valid/invalid_reason; comprehension was always empty + KeyError-prone)
- session.py: revert PhaseConfig.drain_timeout and _drain_inflight defaults
  to None (no implicit drain bound); keep the post-clear inflight re-check
- openai_msgspec_adapter.py: drop the global temperature int-coercion;
  restore default ColumnFilter required_columns=["prompt"] (preset supplies
  its own filter for BFCL)
- openai_adapter.py: thread seed through to_endpoint_request so the
  non-msgspec path no longer silently drops model_params.seed
- factory.py: drop the "force" alias; use the canonical force_regenerate kwarg
- sample_order.py: give SequentialSampleOrder an explicit
  n_samples_in_dataset signature
Comment thread src/inference_endpoint/evaluation/bfcl_v4_execution.py Outdated

@arekay-nv arekay-nv left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Council — Multi-AI Code Review

Reviewed by Codex (gpt-5.5) + Claude | Depth: thorough

8 inline findings below (2 high, 2 medium, 4 low). Full synthesis in the summary comment.

Comment thread src/inference_endpoint/evaluation/bfcl_v4_scorer.py Outdated
Comment thread src/inference_endpoint/commands/benchmark/execute.py
Comment thread src/inference_endpoint/dataset_manager/dataset.py
Comment thread src/inference_endpoint/evaluation/bfcl_v4_scorer.py
Comment thread scripts/plot_results.py
Comment thread src/inference_endpoint/evaluation/bfcl_v4_scorer.py Outdated
Comment thread src/inference_endpoint/compliance/checker.py Outdated
Comment thread src/inference_endpoint/evaluation/bfcl_v4_multi_turn_scorer.py
@arekay-nv

Copy link
Copy Markdown
Collaborator

Review Council — Multi-AI Code Review

Reviewed by: Codex (gpt-5.5, xhigh) + Claude | Depth: thorough

Found 8 issues across 6 files (2 high, 2 medium, 4 low). The empty-events scorer crash was flagged independently by both reviewers. All line numbers verified against HEAD (2ff32c1); 3 candidate findings were dropped (1 failed verification, 2 duplicated existing threads).

🔴 Must Fix (high)

# File Line Category Reviewer(s) Summary
1 evaluation/bfcl_v4_scorer.py 247 bug Both Empty COMPLETE-events log → pd.DataFrame([]) has no columns → df["sample_uuid"] KeyError before the L305 zero-breakdown guard (dead code). Scoring crashes; results.json may not be written.
2 commands/benchmark/execute.py 841 data-integrity Codex --accuracy-only forces num_workers/max_connections=1 only on the live client, but config.yaml was already written (L399) with the original values. Compliance single_stream check (checker.py:142) reads the stale file → fails a run that actually ran single-stream.

🟡 Should Fix (medium)

# File Line Category Reviewer(s) Summary
3 dataset_manager/dataset.py 395 design Claude "Skip adapter ColumnFilter" heuristic papers over the adapter's hard required_columns=["prompt"]; a messages-schema dataset without a preset ColumnFilter still fails. Also drops all adapter ColumnFilters when any user transform is one.
4 evaluation/bfcl_v4_scorer.py 236 testing Claude score() / _score_ast / _score_hallucination / all 3 aggregation strategies — the math feeding the compliance gate — have no unit test (only get_outputs() is covered).

🔵 Consider (low)

# File Line Category Reviewer(s) Summary
5 scripts/plot_results.py 64 data-integrity Claude Multiple report_dir + one --out-dir → fixed-name PNGs clobber each other; only the last run's plots survive.
6 evaluation/bfcl_v4_scorer.py 178 design Claude _score_hallucination reaches into FunctionCallExtractor._try_parse_tool_calls_json (private cross-class); expose a public predicate.
7 compliance/checker.py 137 error-handling Claude Seed check conflates "no seed set" with "seed ≠ 42" in its FAIL message.
8 evaluation/bfcl_v4_multi_turn_scorer.py 171 bug Claude Unweighted subset-mean over-weights small subsets; untested (exploratory multi-turn path, #382).
Dropped after verification (3)

⚠️ Commit hygiene: 42 commits including ~17 apparent fixups. Consider squashing before merge.

…rds, compliance, nits)

- bfcl_v4_scorer.score(): guard empty/unmatched events log so an events
  file with no COMPLETE records (or none mapping to a known sample_uuid)
  emits a zero breakdown instead of KeyError; factor out _zero_breakdown()
  and drop the now-unreachable all_scores guard.
- execute.py: bake the accuracy-only single-stream override (num_workers=1,
  max_connections=1) into config before persisting config.yaml so the
  compliance gate sees the settings that actually ran; drop the redundant
  runtime override.
- extractor: add public has_native_tool_calls() and use it from the
  hallucination scorer instead of the private _try_parse_tool_calls_json.
- compliance/checker: distinguish "no seed set" from "seed != 42".
- bfcl_v4 dataset: document sample_pct=None (kept in full, not 0%) and warn
  when generate() returns an empty single-turn DataFrame.
- bfcl_v4_execution: coalesce duplicated assistant content assignment.
- multi_turn_scorer: document the unweighted per-subset mean trade-off.
- plot_results: write per-report subdirs when multiple report dirs share
  one --out-dir so fixed-name plots no longer clobber each other.
- tests: cover score()/_score_ast/_score_hallucination, sample-weighted
  aggregation, empty-events guards, and has_native_tool_calls.
@codecov-commenter

Copy link
Copy Markdown

Welcome to Codecov 🎉

Once you merge this PR into your default branch, you're all set! Codecov will compare coverage reports and display results in all future pull requests.

Thanks for integrating Codecov - We've got you covered ☂️

The combined perf+accuracy run loads a repo-root-relative performance dataset
path, so it must be launched from the repo root; the README Step 5 and the
config header comment previously said to cd into the example dir, which breaks
dataset resolution. The inline-checker verify one-liner also read a top-level
`valid` key that the scorer never writes (it emits score/turns/domains/per_turn)
— derive validity from turns.missing instead.
Resolve conflicts:
- config/schema.py: keep both ScorerMethod members (BFCL_V4 + LEGACY_MLPERF_DEEPSEEK_R1)
- commands/benchmark/execute.py: keep BFCL score_breakdown() + entry storage, adopt main's richer completeness log line
- AGENTS.md: union Key Components table (main's DeepSeek-R1 row + BFCL Compliance row)
- config/templates/*_full.yaml: regenerated from schema
Comment thread scripts/bfcl_import_smoke.py Outdated
# The scorer guards `bfcl_eval` behind try/except → None on ImportError.
# With the extra installed these must resolve to the real objects, otherwise
# scoring silently degrades.
from inference_endpoint.evaluation import bfcl_v4_scorer

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should move the imports to the top of the file, lazy imports generally bad

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 7542f63 — hoisted all four function-body imports (bfcl_eval plus the first-party bfcl_v4_scorer/bfcl_v4_execution/scoring modules) to the top of the file per the AGENTS.md no-lazy-imports rule. The smoke script is run in an env with the bfcl extra installed, so top-level imports are the intended behavior.

Comment thread scripts/bfcl_import_smoke.py Outdated
# The execution bridge imports a different bfcl-eval submodule (also guarded
# behind try/except → None), so check it resolved too — catches a partial
# bfcl-eval breakage that leaves the ast_checker path working.
from inference_endpoint.evaluation import bfcl_v4_execution

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similarly to this, lazy import

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also fixed in 7542f63 — this import was hoisted to module top level with the others.

Comment thread scripts/bfcl_import_smoke.py Outdated
return 1

# The scorer/dataset are registered via their import side effects.
from inference_endpoint.evaluation.scoring import Scorer

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this as well import

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also fixed in 7542f63 — hoisted to the top; Scorer.get("bfcl_v4") now relies on the top-level scoring import's registration side effect.

Remove throughput/latency/runtime figures (tok/s, TTFT, TPOT, per-turn
latency, ISL/OSL percentiles, wall-clock/runtime columns) that are
hardware-specific and not the reference target. Retain accuracy numbers
(86.23% overall, 87.96% normalized, per-category, IoU 0.6335) and the
reasoning ON vs OFF comparison used to justify running with reasoning off.
Address PR mlcommons#346 review: move the function-body imports (bfcl_eval and the
first-party bfcl_v4_scorer/bfcl_v4_execution/scoring modules) to the top of
the file per AGENTS.md no-lazy-imports rule. The smoke script is run in an
env with the `bfcl` extra installed, so a top-level import is the intended
behavior.

@arekay-nv arekay-nv left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Council — Multi-AI Code Review

Reviewed by Codex + Claude | Depth: thorough. 15 verified findings posted as inline comments (summary comment follows).

if func_data is None:
return None

name = func_data.get("name", "")

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Codex] high (bug): func_data = item.get("function") is only checked for None — when the model emits JSON where function is not an object (e.g. [{"function": "foo"}]), func_data.get("name", "") raises AttributeError, which is not in the except (json.JSONDecodeError, ValueError, TypeError) list below, so one malformed response aborts BFCL scoring instead of falling through to the next parser. Guard with isinstance(func_data, dict) and return None like the other malformed-shape branches.

import json, pathlib
r = json.loads(pathlib.Path('results/edge_agentic_full_run/results.json').read_text())
print('Overall ST accuracy:',
r['accuracy_scores']['bfcl_v4::function_calling']['score']['overall_accuracy'], '%')

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Claude] high (documentation): This verify snippet (and its copy at line 324) indexes ['score']['overall_accuracy'], but the pipeline writes "score" as a scalar float (execute.py finalize_benchmark"score": score, where BFCLv4Scorer.score() returns a 0–1 fraction); the per-subset dict lives under the separate "breakdown" key. Running the documented command raises TypeError: 'float' object is not subscriptable. Use ['breakdown']['overall_accuracy'] (which is also the percentage the % suffix implies). The multi-turn snippet at line 254 is unaffected because the multi-turn CLI writes score as a dict — that shape asymmetry between the two results.json files is itself worth a note.

config = config.with_updates(
settings=config.settings.model_copy(
update={
"client": config.settings.client.with_updates(

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Codex] medium (bug): Residual of the single-stream persistence fix (4ed5bf8): this override bakes num_workers=1, max_connections=1 into the persisted config.yaml, but leaves load_pattern.target_concurrency untouched. check_config_lock also gates on target_concurrency == 1 (compliance/checker.py:154-156), so an --accuracy-only run from a combined config declaring target_concurrency > 1 still fails the single_stream compliance check even though the client is forced to one connection. Normalize (or clear) the load-pattern concurrency here too, matching this comment's stated intent that the written config.yaml matches what actually runs.

inference-endpoint benchmark from-config \
--config online_edge_full_run.yaml \
--accuracy-only \
--seed 42

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Claude] medium (documentation): benchmark from-config does not accept --seed: the handler (commands/benchmark/cli.py from_config) only defines config, timeout, mode, accuracy-only, and report-dir. Reproduced: cyclopts fails with UnknownOptionError: Unknown option: "--seed". For from-config runs the seed must come from the YAML (model_params.seed: 42, already set in online_edge_full_run.yaml). Either fix this example to point at the YAML key or add a seed override to from_config.

message = choice["message"]
except (KeyError, IndexError) as exc:
logger.warning("Malformed response: %s", exc)
return None, None, None

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Claude] medium (error-handling): A structurally malformed 200 body (proxy error page, {"error": ...} payload, empty choices) returns (None, None, None) here, which process_response treats as "no tool calls" (bfcl_v4_execution.py:192-199) — it appends a fabricated empty assistant message and advances the turn, so the entry is scored as a normal completion with degraded accuracy. That's inconsistent with the transport-error path, where _send_request returning None force-terminates the entry and the scorer marks it force_terminated. A malformed body is the same failure class as an HTTP 500 and should force-terminate (or at least be surfaced distinctly); otherwise a flaky gateway silently degrades reported accuracy.

if score is None:
return [Check("accuracy_results_present", False, "no accuracy score found")]

for golden_key, result_key in _ACCURACY_METRIC_KEYS.items():

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Claude] low (error-handling): If no _ACCURACY_METRIC_KEYS key intersects the model's golden_accuracy (every non-Qwen model in models.py), this loop continues on every iteration and check_accuracy returns zero accuracy:* checks — ComplianceReport.passed (an all() over whatever checks exist) then reports PASS based on config-lock alone for an accuracy results.json. "No applicable accuracy metric" should be an explicit FAIL/warning check, mirroring the accuracy_results_present failure above, rather than an empty list.

a combined run whose performance phase hits its ``max_duration`` cap
still proceeds to the accuracy phase.
"""
self._current_phase_stopped = True

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Claude] low (bug): Residual of the 25f84a7 perf-timeout fix: unlike stop(), stop_current_phase does not set self._drain_event, and _drain_inflight only consults _stop_requested — so if the max_duration_ms cap fires while the performance phase is already inside its drain wait (strategy task finished and reset to None), the callback is a no-op. With the default performance_timeout_s: 240 that only stretches the cap by up to 4 minutes, but with the documented drain.performance_timeout_s: null ("wait indefinitely") a hung request keeps the run alive indefinitely despite the cap. Consider setting _drain_event here too, or having _drain_inflight observe _current_phase_stopped.

]
return self.turns[turn_idx]

def get_tools_for_turn(self, turn_idx: int) -> list[dict[str, Any]]:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Claude] low (design): Dead/duplicated multi-turn plumbing that will drift: (1) get_tools_for_turn is never called anywhere in src/ or tests — the execution bridge implements its own holdout-tool logic in get_initial_request — and excluded_function is likewise stored but never read. (2) MULTI_TURN_SUBSETS is defined identically here and in the package __init__.py; if the two copies diverge, generate()'s multi-turn filtering and load_multi_turn_entries()/scorer aggregation silently disagree about what "multi-turn" means. Same pattern for DEFAULT_MAX_STEPS_PER_TURN = 25 duplicated in bfcl_v4_execution.py:47 and bfcl_v4_multi_turn_runner.py:40.

widths = [(hi - lo) for lo, hi in dist.hist_buckets]
axes[1].bar(
centers,
dist.hist_counts[: len(centers)],

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Claude] low (bug): The length-mismatch guard is one-directional: dist.hist_counts[: len(centers)] truncates extra counts, but counts and buckets are extracted independently (hist.get("counts") / hist.get("buckets") in extract_distribution, no length validation), so a shorter hist_counts makes axes[1].bar(centers, ...) raise a shape-mismatch ValueError — and the per-plot loop in generate_plots doesn't catch it, so one malformed histogram aborts plotting for the whole report dir. Validate both arrays to a common length in extract_distribution, or slice both sides.

}
if model_params.streaming == StreamingMode.ON:
metadata["stream"] = True
if model_params.max_new_tokens is not None:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Claude] low (design): Dead guard: ModelParams.max_new_tokens is non-optional with a default (max_new_tokens: Annotated[int, ...] = 1024, config/schema.py), so is not None is always true and every request always carries max_completion_tokens. The apparent intent — omit the token cap when unset, symmetric with the stream guard above — is unreachable through the schema. Either make the field int | None so the omission path is reachable, or drop the guard so the code doesn't suggest a configuration knob that doesn't exist.

@arekay-nv

Copy link
Copy Markdown
Collaborator

Review Council — Multi-AI Code Review

Reviewed by: Codex + Claude | Depth: thorough

Found 15 issues across 12 files (17 candidates; 2 dropped in synthesis — numpy floor-pin trade-off is already documented in-code with uv.lock preserving reproducibility, and the runner-docstring/httpx mismatch is covered by the existing sample-issuer thread → #382). Every finding was line-verified against HEAD (7542f63) and deduplicated against the 114 existing inline comments; the two "residual" items below are new gaps left by fixes to earlier review comments, not repeats. Upstream bfcl-eval claims (eval() denylist, globals() instance cache) were verified against the gorilla repo source. uv build --no-sources passes; the legacy_mlperf_deepseek_r1 subproject remains correctly excluded from the wheel.

🔴 Must Fix (critical/high)

Issues that will cause incorrect behavior in normal usage.

# File Line Category Reviewer(s) Summary
1 src/inference_endpoint/evaluation/extractor.py 441 bug Codex Non-dict function payload (e.g. [{"function":"foo"}]) raises uncaught AttributeError — one malformed model response aborts BFCL scoring
2 examples/11_Edge_Agentic_Example/README.md 246 documentation Claude Verify snippets (l.246, l.324) index ['score']['overall_accuracy'] but score is a scalar float → TypeError; correct path is ['breakdown']['overall_accuracy']

🟡 Should Fix (medium)

Real issues that trigger under specific conditions or flaws that will compound.

# File Line Category Reviewer(s) Summary
3 src/inference_endpoint/commands/benchmark/execute.py 397 bug Codex Residual of 4ed5bf8: accuracy-only override persists num_workers/max_connections but not load_pattern.target_concurrency → valid run fails single_stream compliance
4 examples/11_Edge_Agentic_Example/README.md 386 documentation Claude from-config doesn't accept --seed (reproduced UnknownOptionError); seed must come from YAML
5 src/inference_endpoint/evaluation/bfcl_v4_multi_turn_runner.py 236 error-handling Claude Malformed 200 body is scored as "model done with turn" instead of force-terminated like transport errors — flaky gateway silently degrades accuracy
6 src/inference_endpoint/evaluation/bfcl_v4_execution.py 235 security Claude Endpoint-controlled tool-call strings reach bfcl-eval's eval() behind a bypassable 8-name denylist → RCE from an untrusted --endpoint
7 src/inference_endpoint/evaluation/bfcl_v4_execution.py 316 bug Claude cleanup() never frees bfcl-eval's globals() instance cache → silent state reuse on same-entry re-runs + unbounded growth over ~800 entries
8 src/inference_endpoint/evaluation/bfcl_v4_scorer.py 204 bug Claude live_relevance scored with inverted semantics (credit for NOT calling) and the backwards number is published under unscored_subsets
9 src/inference_endpoint/evaluation/bfcl_v4_multi_turn_scorer.py 59 testing Claude BFCLv4MultiTurnScorer.score() entirely untested (force-terminated, turn-mismatch, checker-exception, aggregation branches)
10 src/inference_endpoint/dataset_manager/predefined/bfcl_v4/__init__.py 180 testing Claude generate() — the frozen MLPerf gate composition — has zero tests for its sampling/filtering logic

🔵 Consider (low)

Valid improvements that could be follow-ups.

# File Line Category Reviewer(s) Summary
11 src/inference_endpoint/compliance/checker.py 187 error-handling Claude Model with no matching golden-accuracy keys yields zero accuracy:* checks → gate silently passes on config-lock alone
12 src/inference_endpoint/load_generator/session.py 380 bug Claude Residual of 25f84a7: stop_current_phase doesn't set _drain_event → perf cap is a no-op once the phase is draining (unbounded if performance_timeout_s: null)
13 src/inference_endpoint/dataset_manager/predefined/bfcl_v4/multi_turn.py 120 design Claude Dead get_tools_for_turn/excluded_function; MULTI_TURN_SUBSETS and DEFAULT_MAX_STEPS_PER_TURN each defined twice
14 src/inference_endpoint/metrics/results_plots.py 358 bug Claude Counts/buckets mismatch guarded in one direction only — shorter hist_counts raises and aborts plotting for the report dir
15 src/inference_endpoint/openai/openai_msgspec_adapter.py 88 design Claude max_new_tokens is not None guard is dead — schema field is non-optional, so the omission path it implies is unreachable

⚠️ Commit hygiene: This PR has 48 commits including ~20 apparent fixups. Prior discussion noted the incremental review-checkpoint commits are intentional — flagging only for the merge decision (squash recommended).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants