feat(bfcl): add BFCL v4 edge-agentic accuracy + performance integration by Palanivelg · Pull Request #346 · mlcommons/endpoints

Palanivelg · 2026-06-08T17:06:02Z

What does this PR do?

Integrates the edge-agentic example (BFCL-v4 as the accuracy set): single-turn + multi-turn pipelines, datasets, adapters, and a reproducible run script. See examples/10_Edge_Agentic_Example/README.md.

Type of change

Bug fix
[x ] New feature
Documentation update
Refactor/cleanup

Related issues

Self-contained: this PR includes the numpy relaxation (numpy>=1.26.4) required by bfcl-eval (which hard-pins numpy==1.26.4), so it no longer depends on a separate prerequisite PR. The [bfcl] extra is isolated via [tool.uv].conflicts so its old pins don't constrain the shared tooling deps. This supersedes #345 (closed as redundant).

Testing

Tests added/updated
All tests pass locally
Manual testing completed

Checklist

Code follows project style
Pre-commit hooks pass
Documentation updated (if needed)

github-actions · 2026-06-08T17:06:16Z

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

gemini-code-assist

Code Review

This pull request adds support for Berkeley Function Calling Leaderboard (BFCL) v4 accuracy benchmarking, introducing single-turn and multi-turn evaluation pipelines, datasets, and adapters, along with sequential sample ordering for deterministic evaluation. The code review feedback identifies three key improvements: moving the tool_calls parsing loop inside the try-except block in bfcl_v4_execution.py to prevent crashes on invalid JSON, conditionally including tools and tool_choice in the request payload in bfcl_v4_multi_turn_runner.py to avoid API errors when no tools are present, and guarding the n_repeats calculation in bfcl_v4_scorer.py to ensure it does not evaluate to zero if some samples fail.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Integrate Berkeley Function Calling Leaderboard (BFCL) v4 evaluation into the accuracy pipeline. Covers both single-turn function-calling subsets (non_live, live, hallucination) and the agentic multi-turn subsets (multi_turn_base + variants), validated against evalscope. Single-turn (drop-in scorer): - BFCLv4 predefined dataset (categories=[non_live|live|hallucination], configurable sample_pct) plus a default preset. - BFCLv4Scorer wires bfcl-eval's ast_checker into the standard accuracy phase via the existing scorer registry. - FunctionCallExtractor: normalizes native tool_calls JSON, JSON arrays of function-call objects, and text-form function calls into the canonical BFCL input format. - openai_msgspec_adapter now: * passes tool_choice through, * emits tool_calls verbatim as output_text for scoring, * coerces whole-number temperatures (vLLM strictness), * makes max_completion_tokens optional and uses a permissive ColumnFilter so messages+tools datasets pass through. Multi-turn (agentic loop, outside the standard scorer): - BFCLExecutionBridge: parses tool_calls, executes them locally against bfcl-eval's class instances, and constructs the tool response messages for the next turn. - BFCLMultiTurnRunner: drives the per-entry agentic conversation via httpx (bounded by max_steps_per_turn / timeout_s). - BFCLv4MultiTurnScorer: invokes bfcl-eval's multi_turn_checker. - bfcl_v4_multi_turn_cli: standalone CLI for the multi-turn flow. Supporting plumbing: - SequentialSampleOrder + `sequential=` flag on create_sample_order, used by accuracy phases so ordering matches reference runs. - BenchmarkConfig.Dataset.params: dict for dataset-specific kwargs (e.g. categories, sample_pct) plumbed through DataLoaderFactory. - ScorerMethod.BFCL_V4. - Dataset.load() preserves user-provided ColumnFilters when the adapter would otherwise inject a conflicting one. - `--accuracy-only` benchmark flag: skip the performance phase entirely (forces num_workers=1, max_connections=1 for deterministic per-sample ordering). Optional dep: `pip install -e ".[bfcl]"` (`bfcl-eval==2026.3.23`). The top-level numpy pin must be relaxed to `>=1.26.4` because bfcl-eval hard-pins `numpy==1.26.4` — shipped as a separate prerequisite commit on the chore/relax-numpy-pin branch. Validation (Qwen3.6-27B-Q4_K_M, temperature=0): - Single-turn live (10%): ~82% accuracy. - Multi-turn base (full 200): 140/200 = 70.00%, exact parity (100% match, 0 mismatches) with evalscope on identical inputs.

Add a committed config that reproduces the validated <3h BFCL v4 accuracy run on an embedded device (Thor). Dataset (BFCLv4.generate): - category_sample_pct: per-category sampling rates (e.g. non_live 20 / live 10 / hallucination 5), resolved per subset via SUBSET_TO_CATEGORY. - subset_floor: subsets whose TOTAL size is <= floor are taken in full, preventing tiny subsets (live_parallel=16, live_parallel_multiple=24) from collapsing to one or two noisy samples. Selection stays deterministic (head(n)). Plain sample_pct behavior is unchanged. Multi-turn CLI: - --sample-pct: deterministic per-subset sub-sampling so a 3% (~24 entry) multi-turn run is reproducible. Defaults to all entries. examples/10_BFCLv4_Example/: - offline_bfcl_v4_single_turn.yaml: single-turn (non_live + live + hallucination) accuracy config, run with --accuracy-only. - README.md: documents the two run paths (single-turn YAML vs multi-turn CLI), the per-category sampling + floor, and the ~2h49m Thor budget.

The example and docstring referenced a specific device; reword to the generic "edge device" since the <3h budget applies to embedded targets broadly.

In --accuracy-only runs there is no performance phase, so ctx.rt_settings is None. _run_benchmark_async read ctx.rt_settings.max_duration_ms unconditionally, raising AttributeError at session setup. The global timeout only applies to the performance phase, so default max_duration_ms to None when rt_settings is absent.

--accuracy-only forces a single connection for deterministic sample ordering, which serializes the offline MAX_THROUGHPUT burst. For large accuracy datasets the sequential processing time exceeds the hardcoded 240 s drain cap, so the phase aborted mid-run dropping in-flight samples. Make drain_timeout a per-phase field defaulting to 240.0 (performance phases unchanged). Accuracy phases pass None to drain unbounded, since every sample must complete; a dropped transport still unblocks the wait via the _receive_responses close path. Re-check inflight after clear() to close a completion/clear race on the unbounded path.

The msgspec adapter serialized tool_calls into `output` AND kept them in the structured `tool_calls` field. TextModelOutput.__str__ then appended the tool_calls a second time, producing duplicated, malformed JSON (`[{...}][{...}]`) that FunctionCallExtractor could not parse. This made every single-turn function-calling subset score 0% (and gave hallucination subsets a spurious 100%). Keep `output` as the textual content only; the structured tool_calls field is the single source serialized once by __str__. This matches the non-msgspec OpenAI adapter and the streaming accumulator, which already did this. Multi-turn is unaffected: it uses a separate httpx runner that reads structured tool_calls from the raw response and never touches TextModelOutput.

…mbles When a model emits a prose preamble alongside a tool call (e.g. "To compute this, I'll call...\n[{...}]"), str(TextModelOutput) prefixes the tool-call JSON with that text, so the function-call parser's json.loads fast path fails and the sample scores 0. Override BFCLv4Scorer.get_outputs to prefer the structured tool_calls field when present (the function call is the answer; the prose is chatter), falling back to the full string for plain-text responses such as hallucination refusals. Verified deterministic across repeated fresh-server single-turn runs.

Some servers (e.g. trtllm-serve on edge devices) stall when tools are present but tool_choice is omitted, relying on a server default. Set tool_choice="auto" explicitly on each single-turn sample and pass it through the function_calling preset's ColumnFilter so it reaches the request payload. Multi-turn already sends tool_choice="auto" via its dedicated runner, so this only affects the single-turn path.

Add ModelParams.seed field and propagate it to the OpenAI wire format so runs can be made reproducible: - config/schema.py: add seed field to ModelParams - openai/types.py: add seed field to ChatCompletionRequest - openai/openai_adapter.py: include seed in metadata dict - openai/openai_msgspec_adapter.py: include seed in metadata dict and ChatCompletionRequest construction - evaluation/bfcl_v4_multi_turn_runner.py: accept seed param; inject payload["seed"] when set - evaluation/bfcl_v4_multi_turn_cli.py: expose --seed CLI arg and pass it to BFCLMultiTurnRunner - commands/benchmark/cli.py: expose --seed and --report-dir overrides on the from-config subcommand - tests: unit coverage for seed propagation in msgspec adapter, multi-turn runner, and from-config CLI

Expand examples/10_BFCLv4_Example/README.md: - Add a "Reproducing from the PRs" section explaining that PR #1 (numpy pin) is a prerequisite for PR #2 to install [bfcl] - Show how to check out and install from the branches - Document --seed flag for both single-turn (from-config) and multi-turn CLI paths - Replace placeholder accuracy numbers with confirmed Thor validation results (Qwen3.6-27B-Q4_K_M, temperature=0, 456 ST samples): non_live 86.98%, live 84.12%, hallucination 94.32%, overall 87.50% (both seed runs identical); MT base 70.00% (exact evalscope parity) - Add output file paths and a quick sanity-check script

Replace the terse reference doc with a numbered walkthrough that works for someone unfamiliar with the endpoints repo: - What is this / What is the endpoints repo - Step 0: prerequisites including a llama.cpp Docker quick-start - Step 1: install from the two PRs with conflict explanation - Step 2: run single-turn (with YAML config notes) - Step 3: run multi-turn - Step 4: verify results with one-liners - Seed reproducibility section - Reference results table (Thor, two seed runs, evalscope parity)

- Fix memory requirement: ~24 GB (not 16 GB) for the Q4 GGUF + KV cache - Replace generic Docker quick-start with Thor-specific llama.cpp native build instructions (Docker CUDA images don't target sm_110/aarch64-SBSA) - Add x86_64 Docker quick-start in a collapsible details block - Fix Step 4 result path: results.json under accuracy_scores key, not a separate accuracy_scores.json; add report.txt note - Add server-side determinism note (--seed 42, -np 1 on llama-server) - Replace placeholder MT numbers with actual sampled-run Thor results: multi_turn_base 66.67% (4/6), miss_func 33.33% (2/6), miss_param 16.67% (1/6), long_context 66.67% (4/6), overall 45.84% (24 entries) — identical across both seed runs - Separate full multi_turn_base parity result (140/200 = 70.00%) into its own subsection to avoid conflating sampled and full-set numbers - Update wall-clock: ~82 min ST + ~64 min MT ≈ 2.4–2.5 h total

bfcl-eval's Qwen model handler transitively imports qwen_agent which requires soundfile; without it the import fails on Thor and any machine where soundfile is not already installed.

… run_accuracy.sh Renames examples/10_BFCLv4_Example to examples/10_Edge_Agentic_Example to align with the MLPerf edge-agentic submission category name. Adds run_accuracy.sh — a single script that reproduces both single-turn and multi-turn reference accuracy numbers end-to-end with the exact validated parameters (sampling rates, temperature=0, seed=42, max-steps-per-turn=25). Updates README to lead with the one-liner quick-start referencing the script, fixes the install instructions to point to mlcommons/endpoints (not the fork), adds --seed and --max-steps-per-turn to the Step 3 MT snippet, and corrects the internal path reference in online_agentic_coding_perf.yaml.

- bfcl_v4_execution: move tool-call argument JSON parsing inside the try-except block so json.JSONDecodeError from malformed model output (common on small/quantized models) is caught and handled gracefully rather than crashing the evaluation run. - bfcl_v4_multi_turn_runner: only include tools/tool_choice in the request payload when the tools list is non-empty, avoiding 400 errors on endpoints that reject tool_choice without accompanying tools. - bfcl_v4_scorer: guard n_repeats with max(1, ...) so a partial run where fewer samples completed than num_samples() does not produce a zero divisor and incorrect reporting.

- Relax base numpy to >=1.26.4 so bfcl-eval's numpy==1.26.4 pin resolves; regenerate uv.lock. - Regenerate stale *_template_full.yaml config templates after schema change. - Fix mypy: annotate tool_calls/tool_call_ids and narrow Optional messages/tools in the multi-turn runner; mark BFCLv4Scorer.score override. - Isolate the bfcl extra via [tool.uv].conflicts and add patched filelock/virtualenv floors to the dev extra so bfcl-eval's filelock==3.20.0 pin no longer drags shared tooling deps into CVE versions (CVE-2025-68146, CVE-2026-22701, CVE-2026-22702).

The accuracy phase hardcoded drain_timeout=None, which ignored a user-configured DrainConfig.accuracy_timeout_s and failed test_configured_drain_timeouts_propagate_to_phases. accuracy_timeout_s already defaults to None (unbounded), so reading it preserves the unbounded default while honoring an explicit timeout.

from-config has no --model-params.name / --endpoint-config.endpoints overrides, so the script errored immediately. Render a temp YAML with MODEL/ENDPOINT substituted into the committed config (anchored on the "# set to your ..." comments) so the one-liner still works without editing the tracked file.

viraatc

A few integration questions from this review.

viraatc

A few follow-up integration comments.

…compliance + nits) - session: add stop_current_phase() and have the perf max_duration cap call it instead of session.stop(), so a combined perf+accuracy run still runs accuracy when the performance phase hits its time cap (was silently skipped). The stop-check observes a per-phase flag (reset each phase) so polling strategies stop issuing, and the strategy task is cancelled to interrupt a blocked await. Adds a regression test (perf timeout -> accuracy still completes). - compliance: all_turns_observed now compares observed to expected (scorable) turns rather than issued; a dataset with 1007 issued / 1006 scorable no longer fails a valid run. Tests updated to carry expected and encode that case. - schema: rename ambiguous Dataset.params -> generate_params (feeds generate()); regenerated templates and updated the example config. - dataset: elaborate why only the adapter's ColumnFilter is dropped when a preset supplies its own (other adapter transforms still apply). - example yaml: drop the "do not change" noise; annotate the BFCL gate block once as fixed and keep per-field notes only where users tune values. - test: assert the from-config seed override is meaningful (seed != 42 before).

…only Drop the `accuracy_only` bool threaded through run_benchmark/setup_benchmark/ _load_datasets; derive it on BenchmarkContext from `test_mode == TestMode.ACC`. `--accuracy-only` remains a CLI convenience alias that resolves to TestMode.ACC. Coalesce the dataset-presence guards in _load_datasets around the resolved mode. Move the `seed` Cyclopts alias onto ModelParams (`--seed`) so offline/online expose it via normal model-field unwrapping, and remove the bespoke from-config `--seed` override flag (set seed in the YAML for from-config runs).

…ed breakdown BFCLv4Scorer.score() now returns the scalar overall accuracy (fraction) like every other Scorer, dropping the `# type: ignore[override]`. The per-subset / per-category detail is cached and exposed via a new optional `Scorer.score_breakdown()` hook, and finalize writes it to results.json under a dedicated `breakdown` key (numeric percentages, not formatted strings). Consumers (compliance gate, results plotting, publish helper) read the typed `breakdown` block instead of special-casing a score-that-might-be-a-dict; the string-coercion paths are kept only as defensive fallbacks for older artifacts. Compliance also now loads config.yaml via BenchmarkConfig.from_yaml_file() (schema validation + env resolution + discriminated union) rather than a raw yaml.safe_load, with test fixtures updated to write full schema-valid configs.

viraatc

Review Council — integration & minimalism pass

Placeholder; summary posted separately.

viraatc · 2026-06-30T21:16:38Z

        return default
+
+
+class FunctionCallExtractor(Extractor, extractor_id="function_call_extractor"):


[Council: integration review] medium (design): FunctionCallExtractor Priority 3 (_try_parse_text_function_calls, ~100 lines: _SKIP_NAMES frozenset of Python builtins, balanced-paren scanner, ast.parse("dict(...)") arg parsing) is a heavy heuristic on a scoring path, and is reachable (bfcl_v4_scorer.py:252 calls self.extractor.extract(...)). Its own docstring concedes it "may produce false positives from explanatory text or fail on nested parentheses" — i.e. it can silently mis-score. Given BFCL endpoints return structured tool_calls (Priority 1) or a JSON array (Priority 2), is the free-text fallback needed at all? Recommend dropping Priority 3 (return default after the JSON paths) unless you can show a real endpoint that only emits prose function calls — for a benchmark, a clean miss is safer than a heuristic guess. This is the largest single chunk of new logic on the integration surface.

Keeping Priority 3 as the last-resort recall path. For the validated Qwen3.6 runs, structured tool_calls (Priority 1) dominate, but some models/quants still emit text-form calls, and dropping it would lower recall with no accuracy upside on the compliant path.

viraatc · 2026-06-30T21:17:06Z

Review Council — integration & minimalism pass

Reviewers: Claude (full integration-surface pass) + Codex (timed out at max-effort before emitting findings; not counted). Depth: thorough, scoped per the maintainer's ask — is each new line / API change / new member on an existing class actually needed? Focus was the integration surface (changes to pre-existing files), not the self-contained BFCL internals. All findings verified against HEAD; deduped against the 64 existing comments.

8 new inline findings posted (1 high, 6 medium, 1 low). Findings already raised by you/@arekay-nv were intentionally skipped.

🔴 Must fix

#	File:Line	Cat	Summary
1	`commands/benchmark/execute.py:1246`	bug	`invalid_reasons` block is dead code — reads `valid`/`invalid_reason` keys no scorer ever puts on the entry dict; always empty, and latently `KeyError`-prone if they were.

🟡 Should fix

#	File:Line	Cat	Summary
2	`openai/openai_adapter.py:53`	bug	`seed` added to metadata but `to_endpoint_request` never sends it → seed silently dropped on this adapter (msgspec path works; compliance checker asserts a seed that wasn't sent).
3	`openai/openai_msgspec_adapter.py:113`	design	`required_columns=["prompt"]` → `[]` weakens load-time validation for all msgspec datasets. Gratuitous — BFCL preset supplies its own filter that dataset.py already preserves.
4	`openai/openai_msgspec_adapter.py:76`	design	`temperature` int-coercion runs on every request, no test/reason; no-op given `float
5	`load_generator/session.py:96`	design	Revert — there should be no drain timeout. Default flips `None`→`240.0` (and `_drain_inflight` at :496), a silent global change: every phase now abandons in-flight responses after 4 min. Put both defaults back to `None`. (Race-fix re-check is good — keep.)
6	`dataset_manager/factory.py:90`	api-bloat	`dataset_params.pop("force")` invents a 2nd spelling for the existing `force_regenerate` kwarg — silent rename + two ways to do one thing.
7	`evaluation/extractor.py:308`	design	`FunctionCallExtractor` Priority-3 free-text parser (~100 lines, self-admitted false positives) on a scoring path. Endpoints return structured `tool_calls`/JSON; recommend dropping the heuristic — a clean miss beats a wrong guess in a benchmark.

🔵 Consider

#	File:Line	Cat	Summary
8	`load_generator/sample_order.py:102`	non-pythonic	`SequentialSampleOrder.__init__(args, *kwargs)` — loose signature accepting params it ignores; make it explicit.

Independent agreement with existing threads

The council independently reached the same conclusion as the open --accuracy-only thread (cli.py @arekay-nv / your TestMode.ACC-as-source-of-truth note): the flag is pure redundancy — _run does nothing but if accuracy_only: mode = TestMode.ACC, and everything downstream already keys off test_mode == TestMode.ACC. +1 to dropping the flag and keeping only the internal BenchmarkContext.accuracy_only property. (Not re-posted inline to avoid duplicate noise.)

Net read on the ask

The BFCL subsystem is cleanly self-contained; the friction is in the shared-code edits made to accommodate it. Most can shrink: drop the redundant flag (already flagged), drop the dead validity block (#1), keep seed/temp/required-columns changes off the shared adapters (#2-4), revert the drain-timeout default to None — no drain timeout (#5), and reconsider the text-extraction heuristic (#7). Genuinely-needed new surface: ModelParams.seed, ChatCompletionRequest.seed/tool_choice, Scorer.score_breakdown() hook, SequentialSampleOrder, stop_current_phase().

arekay-nv · 2026-06-30T21:38:29Z

Do we really need accuracy_only? Might be simpler to rely on the config to have perf/accuracy phases and run them as-is. This will create corner cases - what if the config has only perf, but we specify accuracy_only.

+1

( might need a larger refactor of execute.py though: #384 )

The corner case is guarded: _load_datasets raises InputValidationError when --accuracy-only is set with no accuracy dataset. TestMode.ACC stays an explicit mode because the compliance workflow runs accuracy standalone under forced single-stream (now persisted to config.yaml, 4ed5bf8). Fully driving scoring off the config's declared phases is the right longer-term shape — tracked in #387.

- execute.py: remove dead invalid_reasons block (entries never carry valid/invalid_reason; comprehension was always empty + KeyError-prone) - session.py: revert PhaseConfig.drain_timeout and _drain_inflight defaults to None (no implicit drain bound); keep the post-clear inflight re-check - openai_msgspec_adapter.py: drop the global temperature int-coercion; restore default ColumnFilter required_columns=["prompt"] (preset supplies its own filter for BFCL) - openai_adapter.py: thread seed through to_endpoint_request so the non-msgspec path no longer silently drops model_params.seed - factory.py: drop the "force" alias; use the canonical force_regenerate kwarg - sample_order.py: give SequentialSampleOrder an explicit n_samples_in_dataset signature

arekay-nv

Review Council — Multi-AI Code Review

Reviewed by Codex (gpt-5.5) + Claude | Depth: thorough

8 inline findings below (2 high, 2 medium, 4 low). Full synthesis in the summary comment.

arekay-nv · 2026-07-01T01:18:59Z

Review Council — Multi-AI Code Review

Reviewed by: Codex (gpt-5.5, xhigh) + Claude | Depth: thorough

Found 8 issues across 6 files (2 high, 2 medium, 4 low). The empty-events scorer crash was flagged independently by both reviewers. All line numbers verified against HEAD (2ff32c1); 3 candidate findings were dropped (1 failed verification, 2 duplicated existing threads).

🔴 Must Fix (high)

#	File	Line	Category	Reviewer(s)	Summary
1	`evaluation/bfcl_v4_scorer.py`	247	bug	Both	Empty COMPLETE-events log → `pd.DataFrame([])` has no columns → `df["sample_uuid"]` `KeyError` before the L305 zero-breakdown guard (dead code). Scoring crashes; `results.json` may not be written.
2	`commands/benchmark/execute.py`	841	data-integrity	Codex	`--accuracy-only` forces `num_workers/max_connections=1` only on the live client, but `config.yaml` was already written (L399) with the original values. Compliance `single_stream` check (`checker.py:142`) reads the stale file → fails a run that actually ran single-stream.

🟡 Should Fix (medium)

#	File	Line	Category	Reviewer(s)	Summary
3	`dataset_manager/dataset.py`	395	design	Claude	"Skip adapter ColumnFilter" heuristic papers over the adapter's hard `required_columns=["prompt"]`; a messages-schema dataset without a preset ColumnFilter still fails. Also drops all adapter ColumnFilters when any user transform is one.
4	`evaluation/bfcl_v4_scorer.py`	236	testing	Claude	`score()` / `_score_ast` / `_score_hallucination` / all 3 aggregation strategies — the math feeding the compliance gate — have no unit test (only `get_outputs()` is covered).

🔵 Consider (low)

#	File	Line	Category	Reviewer(s)	Summary
5	`scripts/plot_results.py`	64	data-integrity	Claude	Multiple `report_dir` + one `--out-dir` → fixed-name PNGs clobber each other; only the last run's plots survive.
6	`evaluation/bfcl_v4_scorer.py`	178	design	Claude	`_score_hallucination` reaches into `FunctionCallExtractor._try_parse_tool_calls_json` (private cross-class); expose a public predicate.
7	`compliance/checker.py`	137	error-handling	Claude	Seed check conflates "no seed set" with "seed ≠ 42" in its FAIL message.
8	`evaluation/bfcl_v4_multi_turn_scorer.py`	171	bug	Claude	Unweighted subset-mean over-weights small subsets; untested (exploratory multi-turn path, #382).

Dropped after verification (3)

factory.py:90 — Claude claimed force_regenerate is popped and would collide; actual code pops "force" correctly. Failed verification.
bfcl_v4_multi_turn_runner.py:30 (httpx in production) — already covered by open thread Make results compatible to MLPerf submission checker for 6.0 #29 / resolved Documentation: Add example of how to ramp up vLLM/SGLang endpoints for benchmarking #11.
checker.py:268 (benchmark_rulesets read twice) — minor, adjacent to a resolved thread.

⚠️ Commit hygiene: 42 commits including ~17 apparent fixups. Consider squashing before merge.

…rds, compliance, nits) - bfcl_v4_scorer.score(): guard empty/unmatched events log so an events file with no COMPLETE records (or none mapping to a known sample_uuid) emits a zero breakdown instead of KeyError; factor out _zero_breakdown() and drop the now-unreachable all_scores guard. - execute.py: bake the accuracy-only single-stream override (num_workers=1, max_connections=1) into config before persisting config.yaml so the compliance gate sees the settings that actually ran; drop the redundant runtime override. - extractor: add public has_native_tool_calls() and use it from the hallucination scorer instead of the private _try_parse_tool_calls_json. - compliance/checker: distinguish "no seed set" from "seed != 42". - bfcl_v4 dataset: document sample_pct=None (kept in full, not 0%) and warn when generate() returns an empty single-turn DataFrame. - bfcl_v4_execution: coalesce duplicated assistant content assignment. - multi_turn_scorer: document the unweighted per-subset mean trade-off. - plot_results: write per-report subdirs when multiple report dirs share one --out-dir so fixed-name plots no longer clobber each other. - tests: cover score()/_score_ast/_score_hallucination, sample-weighted aggregation, empty-events guards, and has_native_tool_calls.

codecov-commenter · 2026-07-01T19:04:02Z

Welcome to Codecov 🎉

Once you merge this PR into your default branch, you're all set! Codecov will compare coverage reports and display results in all future pull requests.

Thanks for integrating Codecov - We've got you covered ☂️

The combined perf+accuracy run loads a repo-root-relative performance dataset path, so it must be launched from the repo root; the README Step 5 and the config header comment previously said to cd into the example dir, which breaks dataset resolution. The inline-checker verify one-liner also read a top-level `valid` key that the scorer never writes (it emits score/turns/domains/per_turn) — derive validity from turns.missing instead.

Resolve conflicts: - config/schema.py: keep both ScorerMethod members (BFCL_V4 + LEGACY_MLPERF_DEEPSEEK_R1) - commands/benchmark/execute.py: keep BFCL score_breakdown() + entry storage, adopt main's richer completeness log line - AGENTS.md: union Key Components table (main's DeepSeek-R1 row + BFCL Compliance row) - config/templates/*_full.yaml: regenerated from schema

leopck · 2026-07-02T21:51:03Z

+    # The scorer guards `bfcl_eval` behind try/except → None on ImportError.
+    # With the extra installed these must resolve to the real objects, otherwise
+    # scoring silently degrades.
+    from inference_endpoint.evaluation import bfcl_v4_scorer


we should move the imports to the top of the file, lazy imports generally bad

Fixed in 7542f63 — hoisted all four function-body imports (bfcl_eval plus the first-party bfcl_v4_scorer/bfcl_v4_execution/scoring modules) to the top of the file per the AGENTS.md no-lazy-imports rule. The smoke script is run in an env with the bfcl extra installed, so top-level imports are the intended behavior.

leopck · 2026-07-02T21:51:18Z

+    # The execution bridge imports a different bfcl-eval submodule (also guarded
+    # behind try/except → None), so check it resolved too — catches a partial
+    # bfcl-eval breakage that leaves the ast_checker path working.
+    from inference_endpoint.evaluation import bfcl_v4_execution


similarly to this, lazy import

Also fixed in 7542f63 — this import was hoisted to module top level with the others.

leopck · 2026-07-02T21:51:33Z

+        return 1
+
+    # The scorer/dataset are registered via their import side effects.
+    from inference_endpoint.evaluation.scoring import Scorer


this as well import

Also fixed in 7542f63 — hoisted to the top; Scorer.get("bfcl_v4") now relies on the top-level scoring import's registration side effect.

Remove throughput/latency/runtime figures (tok/s, TTFT, TPOT, per-turn latency, ISL/OSL percentiles, wall-clock/runtime columns) that are hardware-specific and not the reference target. Retain accuracy numbers (86.23% overall, 87.96% normalized, per-category, IoU 0.6335) and the reasoning ON vs OFF comparison used to justify running with reasoning off.

Address PR mlcommons#346 review: move the function-body imports (bfcl_eval and the first-party bfcl_v4_scorer/bfcl_v4_execution/scoring modules) to the top of the file per AGENTS.md no-lazy-imports rule. The smoke script is run in an env with the `bfcl` extra installed, so a top-level import is the intended behavior.

arekay-nv

Review Council — Multi-AI Code Review

Reviewed by Codex + Claude | Depth: thorough. 15 verified findings posted as inline comments (summary comment follows).

arekay-nv · 2026-07-02T23:50:18Z

+                if func_data is None:
+                    return None
+
+                name = func_data.get("name", "")


[Codex] high (bug): func_data = item.get("function") is only checked for None — when the model emits JSON where function is not an object (e.g. [{"function": "foo"}]), func_data.get("name", "") raises AttributeError, which is not in the except (json.JSONDecodeError, ValueError, TypeError) list below, so one malformed response aborts BFCL scoring instead of falling through to the next parser. Guard with isinstance(func_data, dict) and return None like the other malformed-shape branches.

arekay-nv · 2026-07-02T23:50:18Z

+import json, pathlib
+r = json.loads(pathlib.Path('results/edge_agentic_full_run/results.json').read_text())
+print('Overall ST accuracy:',
+      r['accuracy_scores']['bfcl_v4::function_calling']['score']['overall_accuracy'], '%')


[Claude] high (documentation): This verify snippet (and its copy at line 324) indexes ['score']['overall_accuracy'], but the pipeline writes "score" as a scalar float (execute.py finalize_benchmark — "score": score, where BFCLv4Scorer.score() returns a 0–1 fraction); the per-subset dict lives under the separate "breakdown" key. Running the documented command raises TypeError: 'float' object is not subscriptable. Use ['breakdown']['overall_accuracy'] (which is also the percentage the % suffix implies). The multi-turn snippet at line 254 is unaffected because the multi-turn CLI writes score as a dict — that shape asymmetry between the two results.json files is itself worth a note.

arekay-nv · 2026-07-02T23:50:18Z

+        config = config.with_updates(
+            settings=config.settings.model_copy(
+                update={
+                    "client": config.settings.client.with_updates(


[Codex] medium (bug): Residual of the single-stream persistence fix (4ed5bf8): this override bakes num_workers=1, max_connections=1 into the persisted config.yaml, but leaves load_pattern.target_concurrency untouched. check_config_lock also gates on target_concurrency == 1 (compliance/checker.py:154-156), so an --accuracy-only run from a combined config declaring target_concurrency > 1 still fails the single_stream compliance check even though the client is forced to one connection. Normalize (or clear) the load-pattern concurrency here too, matching this comment's stated intent that the written config.yaml matches what actually runs.

arekay-nv · 2026-07-02T23:50:18Z

+inference-endpoint benchmark from-config \
+  --config online_edge_full_run.yaml \
+  --accuracy-only \
+  --seed 42


[Claude] medium (documentation): benchmark from-config does not accept --seed: the handler (commands/benchmark/cli.py from_config) only defines config, timeout, mode, accuracy-only, and report-dir. Reproduced: cyclopts fails with UnknownOptionError: Unknown option: "--seed". For from-config runs the seed must come from the YAML (model_params.seed: 42, already set in online_edge_full_run.yaml). Either fix this example to point at the YAML key or add a seed override to from_config.

arekay-nv · 2026-07-02T23:50:18Z

+            message = choice["message"]
+        except (KeyError, IndexError) as exc:
+            logger.warning("Malformed response: %s", exc)
+            return None, None, None


[Claude] medium (error-handling): A structurally malformed 200 body (proxy error page, {"error": ...} payload, empty choices) returns (None, None, None) here, which process_response treats as "no tool calls" (bfcl_v4_execution.py:192-199) — it appends a fabricated empty assistant message and advances the turn, so the entry is scored as a normal completion with degraded accuracy. That's inconsistent with the transport-error path, where _send_request returning None force-terminates the entry and the scorer marks it force_terminated. A malformed body is the same failure class as an HTTP 500 and should force-terminate (or at least be surfaced distinctly); otherwise a flaky gateway silently degrades reported accuracy.

arekay-nv · 2026-07-02T23:50:19Z

+    if score is None:
+        return [Check("accuracy_results_present", False, "no accuracy score found")]
+
+    for golden_key, result_key in _ACCURACY_METRIC_KEYS.items():


[Claude] low (error-handling): If no _ACCURACY_METRIC_KEYS key intersects the model's golden_accuracy (every non-Qwen model in models.py), this loop continues on every iteration and check_accuracy returns zero accuracy:* checks — ComplianceReport.passed (an all() over whatever checks exist) then reports PASS based on config-lock alone for an accuracy results.json. "No applicable accuracy metric" should be an explicit FAIL/warning check, mirroring the accuracy_results_present failure above, rather than an empty list.

arekay-nv · 2026-07-02T23:50:19Z

+        a combined run whose performance phase hits its ``max_duration`` cap
+        still proceeds to the accuracy phase.
+        """
+        self._current_phase_stopped = True


[Claude] low (bug): Residual of the 25f84a7 perf-timeout fix: unlike stop(), stop_current_phase does not set self._drain_event, and _drain_inflight only consults _stop_requested — so if the max_duration_ms cap fires while the performance phase is already inside its drain wait (strategy task finished and reset to None), the callback is a no-op. With the default performance_timeout_s: 240 that only stretches the cap by up to 4 minutes, but with the documented drain.performance_timeout_s: null ("wait indefinitely") a hung request keeps the run alive indefinitely despite the cap. Consider setting _drain_event here too, or having _drain_inflight observe _current_phase_stopped.

arekay-nv · 2026-07-02T23:50:19Z

+            ]
+        return self.turns[turn_idx]
+
+    def get_tools_for_turn(self, turn_idx: int) -> list[dict[str, Any]]:


[Claude] low (design): Dead/duplicated multi-turn plumbing that will drift: (1) get_tools_for_turn is never called anywhere in src/ or tests — the execution bridge implements its own holdout-tool logic in get_initial_request — and excluded_function is likewise stored but never read. (2) MULTI_TURN_SUBSETS is defined identically here and in the package __init__.py; if the two copies diverge, generate()'s multi-turn filtering and load_multi_turn_entries()/scorer aggregation silently disagree about what "multi-turn" means. Same pattern for DEFAULT_MAX_STEPS_PER_TURN = 25 duplicated in bfcl_v4_execution.py:47 and bfcl_v4_multi_turn_runner.py:40.

arekay-nv · 2026-07-02T23:50:19Z

+        widths = [(hi - lo) for lo, hi in dist.hist_buckets]
+        axes[1].bar(
+            centers,
+            dist.hist_counts[: len(centers)],


[Claude] low (bug): The length-mismatch guard is one-directional: dist.hist_counts[: len(centers)] truncates extra counts, but counts and buckets are extracted independently (hist.get("counts") / hist.get("buckets") in extract_distribution, no length validation), so a shorter hist_counts makes axes[1].bar(centers, ...) raise a shape-mismatch ValueError — and the per-plot loop in generate_plots doesn't catch it, so one malformed histogram aborts plotting for the whole report dir. Validate both arrays to a common length in extract_distribution, or slice both sides.

arekay-nv · 2026-07-02T23:50:19Z

        }
+        if model_params.streaming == StreamingMode.ON:
+            metadata["stream"] = True
+        if model_params.max_new_tokens is not None:


[Claude] low (design): Dead guard: ModelParams.max_new_tokens is non-optional with a default (max_new_tokens: Annotated[int, ...] = 1024, config/schema.py), so is not None is always true and every request always carries max_completion_tokens. The apparent intent — omit the token cap when unset, symmetric with the stream guard above — is unreachable through the schema. Either make the field int | None so the omission path is reachable, or drop the guard so the code doesn't suggest a configuration knob that doesn't exist.

arekay-nv · 2026-07-02T23:50:50Z

Review Council — Multi-AI Code Review

Reviewed by: Codex + Claude | Depth: thorough

Found 15 issues across 12 files (17 candidates; 2 dropped in synthesis — numpy floor-pin trade-off is already documented in-code with uv.lock preserving reproducibility, and the runner-docstring/httpx mismatch is covered by the existing sample-issuer thread → #382). Every finding was line-verified against HEAD (7542f63) and deduplicated against the 114 existing inline comments; the two "residual" items below are new gaps left by fixes to earlier review comments, not repeats. Upstream bfcl-eval claims (eval() denylist, globals() instance cache) were verified against the gorilla repo source. uv build --no-sources passes; the legacy_mlperf_deepseek_r1 subproject remains correctly excluded from the wheel.

🔴 Must Fix (critical/high)

Issues that will cause incorrect behavior in normal usage.

#	File	Line	Category	Reviewer(s)	Summary
1	`src/inference_endpoint/evaluation/extractor.py`	441	bug	Codex	Non-dict `function` payload (e.g. `[{"function":"foo"}]`) raises uncaught `AttributeError` — one malformed model response aborts BFCL scoring
2	`examples/11_Edge_Agentic_Example/README.md`	246	documentation	Claude	Verify snippets (l.246, l.324) index `['score']['overall_accuracy']` but `score` is a scalar float → `TypeError`; correct path is `['breakdown']['overall_accuracy']`

🟡 Should Fix (medium)

Real issues that trigger under specific conditions or flaws that will compound.

#	File	Line	Category	Reviewer(s)	Summary
3	`src/inference_endpoint/commands/benchmark/execute.py`	397	bug	Codex	Residual of `4ed5bf8`: accuracy-only override persists `num_workers`/`max_connections` but not `load_pattern.target_concurrency` → valid run fails `single_stream` compliance
4	`examples/11_Edge_Agentic_Example/README.md`	386	documentation	Claude	`from-config` doesn't accept `--seed` (reproduced `UnknownOptionError`); seed must come from YAML
5	`src/inference_endpoint/evaluation/bfcl_v4_multi_turn_runner.py`	236	error-handling	Claude	Malformed 200 body is scored as "model done with turn" instead of force-terminated like transport errors — flaky gateway silently degrades accuracy
6	`src/inference_endpoint/evaluation/bfcl_v4_execution.py`	235	security	Claude	Endpoint-controlled tool-call strings reach bfcl-eval's `eval()` behind a bypassable 8-name denylist → RCE from an untrusted `--endpoint`
7	`src/inference_endpoint/evaluation/bfcl_v4_execution.py`	316	bug	Claude	`cleanup()` never frees bfcl-eval's `globals()` instance cache → silent state reuse on same-entry re-runs + unbounded growth over ~800 entries
8	`src/inference_endpoint/evaluation/bfcl_v4_scorer.py`	204	bug	Claude	`live_relevance` scored with inverted semantics (credit for NOT calling) and the backwards number is published under `unscored_subsets`
9	`src/inference_endpoint/evaluation/bfcl_v4_multi_turn_scorer.py`	59	testing	Claude	`BFCLv4MultiTurnScorer.score()` entirely untested (force-terminated, turn-mismatch, checker-exception, aggregation branches)
10	`src/inference_endpoint/dataset_manager/predefined/bfcl_v4/__init__.py`	180	testing	Claude	`generate()` — the frozen MLPerf gate composition — has zero tests for its sampling/filtering logic

🔵 Consider (low)

Valid improvements that could be follow-ups.

#	File	Line	Category	Reviewer(s)	Summary
11	`src/inference_endpoint/compliance/checker.py`	187	error-handling	Claude	Model with no matching golden-accuracy keys yields zero `accuracy:*` checks → gate silently passes on config-lock alone
12	`src/inference_endpoint/load_generator/session.py`	380	bug	Claude	Residual of `25f84a7`: `stop_current_phase` doesn't set `_drain_event` → perf cap is a no-op once the phase is draining (unbounded if `performance_timeout_s: null`)
13	`src/inference_endpoint/dataset_manager/predefined/bfcl_v4/multi_turn.py`	120	design	Claude	Dead `get_tools_for_turn`/`excluded_function`; `MULTI_TURN_SUBSETS` and `DEFAULT_MAX_STEPS_PER_TURN` each defined twice
14	`src/inference_endpoint/metrics/results_plots.py`	358	bug	Claude	Counts/buckets mismatch guarded in one direction only — shorter `hist_counts` raises and aborts plotting for the report dir
15	`src/inference_endpoint/openai/openai_msgspec_adapter.py`	88	design	Claude	`max_new_tokens is not None` guard is dead — schema field is non-optional, so the omission path it implies is unreachable

⚠️ Commit hygiene: This PR has 48 commits including ~20 apparent fixups. Prior discussion noted the incremental review-checkpoint commits are intentional — flagging only for the merge decision (squash recommended).

hvagadia added 3 commits June 3, 2026 11:27

feat(multiturn): add kimi agentic benchmark controls

36cd0b2

feat: add multi-turn benchmark controls

3425267

fix: skip disabled multi-turn inline accuracy

0a13672

Palanivelg requested a review from a team June 8, 2026 17:06

gemini-code-assist Bot reviewed Jun 8, 2026

View reviewed changes

Comment thread src/inference_endpoint/evaluation/bfcl_v4_execution.py Outdated

Comment thread src/inference_endpoint/evaluation/bfcl_v4_multi_turn_runner.py

Comment thread src/inference_endpoint/evaluation/bfcl_v4_scorer.py Outdated

Palanivelg changed the title ~~Feat/bfcl v4 combined~~ feat(bfcl): add BFCL v4 edge-agentic accuracy integration Jun 8, 2026

Palanivelg added 14 commits June 8, 2026 17:16

docs(bfcl): generalize "Thor" to "edge device" in BFCL v4 example

5bef6d7

The example and docstring referenced a specific device; reword to the generic "edge device" since the <3h budget applies to embedded targets broadly.

fix(bfcl): add soundfile dependency for qwen_agent transitive import

ab8b591

bfcl-eval's Qwen model handler transitively imports qwen_agent which requires soundfile; without it the import fails on Thor and any machine where soundfile is not already installed.

Palanivelg force-pushed the feat/bfcl-v4-combined branch from 9f0186a to 37c74d2 Compare June 8, 2026 17:21

Palanivelg mentioned this pull request Jun 8, 2026

feat(edge-agentic): add BFCL v4 edge-agentic reference implementation… mlcommons/inference#2598

Merged

docs(bfcl): annotate YAML with do-not-change submission rules

d628805

Palanivelg mentioned this pull request Jun 8, 2026

feat(edge-agentic): add BFCL v4 edge-agentic workload to edge benchma… mlcommons/inference_policies#331

Draft

3 tasks

Palanivelg added 2 commits June 8, 2026 21:14

Palanivelg mentioned this pull request Jun 8, 2026

chore(deps): relax numpy pin to >=1.26.4 for bfcl-eval compatibility #345

Closed

10 tasks

viraatc self-requested a review June 29, 2026 20:51

viraatc reviewed Jun 29, 2026

View reviewed changes

Comment thread src/inference_endpoint/config/schema.py Outdated

viraatc reviewed Jun 29, 2026

View reviewed changes

Comment thread src/inference_endpoint/commands/benchmark/execute.py Outdated

Comment thread src/inference_endpoint/commands/benchmark/execute.py

Comment thread src/inference_endpoint/commands/benchmark/cli.py Outdated

Comment thread src/inference_endpoint/evaluation/bfcl_v4_scorer.py Outdated

arekay-nv reviewed Jun 29, 2026

View reviewed changes

viraatc reviewed Jun 29, 2026

View reviewed changes

Comment thread src/inference_endpoint/compliance/checker.py Outdated

Comment thread src/inference_endpoint/compliance/__init__.py

Comment thread src/inference_endpoint/compliance/checker.py Outdated

Palanivelg added 3 commits June 29, 2026 23:50

Palanivelg mentioned this pull request Jun 30, 2026

BFCL-v4 multi-turn runner: reuse HttpClientSampleIssuer instead of bespoke HTTP client #382

Open

viraatc reviewed Jun 30, 2026

View reviewed changes

arekay-nv reviewed Jun 30, 2026

View reviewed changes

Comment thread src/inference_endpoint/commands/benchmark/execute.py

arekay-nv reviewed Jun 30, 2026

View reviewed changes

Comment thread src/inference_endpoint/dataset_manager/predefined/bfcl_v4/__init__.py

arekay-nv reviewed Jun 30, 2026

View reviewed changes

Comment thread src/inference_endpoint/dataset_manager/predefined/bfcl_v4/__init__.py

arekay-nv reviewed Jun 30, 2026

View reviewed changes

Comment thread src/inference_endpoint/evaluation/bfcl_v4_execution.py Outdated

arekay-nv reviewed Jul 1, 2026

View reviewed changes

This was referenced Jul 1, 2026

Unify accuracy scoring path for perf+accuracy and accuracy-only runs #387

Open

Schema-aware adapter selection (drop hard-required 'prompt' column / ColumnFilter skip) #388

Open

Palanivelg added 2 commits July 1, 2026 22:16

leopck reviewed Jul 2, 2026

View reviewed changes

Palanivelg added 2 commits July 2, 2026 22:15

arekay-nv reviewed Jul 2, 2026

View reviewed changes

		return default


		class FunctionCallExtractor(Extractor, extractor_id="function_call_extractor"):

Uh oh!

Conversation

Palanivelg commented Jun 8, 2026 • edited by zihaok Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Type of change

Related issues

Testing

Checklist

Uh oh!

github-actions Bot commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

viraatc left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

viraatc left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

viraatc left a comment

Choose a reason for hiding this comment

Review Council — integration & minimalism pass

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

viraatc commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review Council — integration & minimalism pass

🔴 Must fix

🟡 Should fix

🔵 Consider

Independent agreement with existing threads

Net read on the ask

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

arekay-nv left a comment

Choose a reason for hiding this comment

Review Council — Multi-AI Code Review

Uh oh!

Palanivelg commented Jun 8, 2026 •

edited by zihaok

Loading

github-actions Bot commented Jun 8, 2026 •

edited

Loading

viraatc commented Jun 30, 2026 •

edited

Loading