feat(bfcl): add BFCL v4 edge-agentic accuracy + performance integration#346
feat(bfcl): add BFCL v4 edge-agentic accuracy + performance integration#346Palanivelg wants to merge 48 commits into
Conversation
|
MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅ |
There was a problem hiding this comment.
Code Review
This pull request adds support for Berkeley Function Calling Leaderboard (BFCL) v4 accuracy benchmarking, introducing single-turn and multi-turn evaluation pipelines, datasets, and adapters, along with sequential sample ordering for deterministic evaluation. The code review feedback identifies three key improvements: moving the tool_calls parsing loop inside the try-except block in bfcl_v4_execution.py to prevent crashes on invalid JSON, conditionally including tools and tool_choice in the request payload in bfcl_v4_multi_turn_runner.py to avoid API errors when no tools are present, and guarding the n_repeats calculation in bfcl_v4_scorer.py to ensure it does not evaluate to zero if some samples fail.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
Integrate Berkeley Function Calling Leaderboard (BFCL) v4 evaluation
into the accuracy pipeline. Covers both single-turn function-calling
subsets (non_live, live, hallucination) and the agentic multi-turn
subsets (multi_turn_base + variants), validated against evalscope.
Single-turn (drop-in scorer):
- BFCLv4 predefined dataset (categories=[non_live|live|hallucination],
configurable sample_pct) plus a default preset.
- BFCLv4Scorer wires bfcl-eval's ast_checker into the standard
accuracy phase via the existing scorer registry.
- FunctionCallExtractor: normalizes native tool_calls JSON, JSON
arrays of function-call objects, and text-form function calls
into the canonical BFCL input format.
- openai_msgspec_adapter now:
* passes tool_choice through,
* emits tool_calls verbatim as output_text for scoring,
* coerces whole-number temperatures (vLLM strictness),
* makes max_completion_tokens optional and uses a permissive
ColumnFilter so messages+tools datasets pass through.
Multi-turn (agentic loop, outside the standard scorer):
- BFCLExecutionBridge: parses tool_calls, executes them locally
against bfcl-eval's class instances, and constructs the tool
response messages for the next turn.
- BFCLMultiTurnRunner: drives the per-entry agentic conversation
via httpx (bounded by max_steps_per_turn / timeout_s).
- BFCLv4MultiTurnScorer: invokes bfcl-eval's multi_turn_checker.
- bfcl_v4_multi_turn_cli: standalone CLI for the multi-turn flow.
Supporting plumbing:
- SequentialSampleOrder + `sequential=` flag on create_sample_order,
used by accuracy phases so ordering matches reference runs.
- BenchmarkConfig.Dataset.params: dict for dataset-specific kwargs
(e.g. categories, sample_pct) plumbed through DataLoaderFactory.
- ScorerMethod.BFCL_V4.
- Dataset.load() preserves user-provided ColumnFilters when the
adapter would otherwise inject a conflicting one.
- `--accuracy-only` benchmark flag: skip the performance phase
entirely (forces num_workers=1, max_connections=1 for
deterministic per-sample ordering).
Optional dep: `pip install -e ".[bfcl]"` (`bfcl-eval==2026.3.23`).
The top-level numpy pin must be relaxed to `>=1.26.4` because
bfcl-eval hard-pins `numpy==1.26.4` — shipped as a separate
prerequisite commit on the chore/relax-numpy-pin branch.
Validation (Qwen3.6-27B-Q4_K_M, temperature=0):
- Single-turn live (10%): ~82% accuracy.
- Multi-turn base (full 200): 140/200 = 70.00%, exact parity
(100% match, 0 mismatches) with evalscope on identical inputs.
Add a committed config that reproduces the validated <3h BFCL v4 accuracy
run on an embedded device (Thor).
Dataset (BFCLv4.generate):
- category_sample_pct: per-category sampling rates (e.g. non_live 20 /
live 10 / hallucination 5), resolved per subset via SUBSET_TO_CATEGORY.
- subset_floor: subsets whose TOTAL size is <= floor are taken in full,
preventing tiny subsets (live_parallel=16, live_parallel_multiple=24)
from collapsing to one or two noisy samples. Selection stays
deterministic (head(n)). Plain sample_pct behavior is unchanged.
Multi-turn CLI:
- --sample-pct: deterministic per-subset sub-sampling so a 3% (~24 entry)
multi-turn run is reproducible. Defaults to all entries.
examples/10_BFCLv4_Example/:
- offline_bfcl_v4_single_turn.yaml: single-turn (non_live + live +
hallucination) accuracy config, run with --accuracy-only.
- README.md: documents the two run paths (single-turn YAML vs multi-turn
CLI), the per-category sampling + floor, and the ~2h49m Thor budget.
The example and docstring referenced a specific device; reword to the generic "edge device" since the <3h budget applies to embedded targets broadly.
In --accuracy-only runs there is no performance phase, so ctx.rt_settings is None. _run_benchmark_async read ctx.rt_settings.max_duration_ms unconditionally, raising AttributeError at session setup. The global timeout only applies to the performance phase, so default max_duration_ms to None when rt_settings is absent.
--accuracy-only forces a single connection for deterministic sample ordering, which serializes the offline MAX_THROUGHPUT burst. For large accuracy datasets the sequential processing time exceeds the hardcoded 240 s drain cap, so the phase aborted mid-run dropping in-flight samples. Make drain_timeout a per-phase field defaulting to 240.0 (performance phases unchanged). Accuracy phases pass None to drain unbounded, since every sample must complete; a dropped transport still unblocks the wait via the _receive_responses close path. Re-check inflight after clear() to close a completion/clear race on the unbounded path.
The msgspec adapter serialized tool_calls into `output` AND kept them in the
structured `tool_calls` field. TextModelOutput.__str__ then appended the
tool_calls a second time, producing duplicated, malformed JSON
(`[{...}][{...}]`) that FunctionCallExtractor could not parse. This made every
single-turn function-calling subset score 0% (and gave hallucination subsets a
spurious 100%).
Keep `output` as the textual content only; the structured tool_calls field is
the single source serialized once by __str__. This matches the non-msgspec
OpenAI adapter and the streaming accumulator, which already did this.
Multi-turn is unaffected: it uses a separate httpx runner that reads structured
tool_calls from the raw response and never touches TextModelOutput.
…mbles
When a model emits a prose preamble alongside a tool call (e.g.
"To compute this, I'll call...\n[{...}]"), str(TextModelOutput) prefixes the
tool-call JSON with that text, so the function-call parser's json.loads fast
path fails and the sample scores 0.
Override BFCLv4Scorer.get_outputs to prefer the structured tool_calls field
when present (the function call is the answer; the prose is chatter), falling
back to the full string for plain-text responses such as hallucination
refusals. Verified deterministic across repeated fresh-server single-turn runs.
Some servers (e.g. trtllm-serve on edge devices) stall when tools are present but tool_choice is omitted, relying on a server default. Set tool_choice="auto" explicitly on each single-turn sample and pass it through the function_calling preset's ColumnFilter so it reaches the request payload. Multi-turn already sends tool_choice="auto" via its dedicated runner, so this only affects the single-turn path.
Add ModelParams.seed field and propagate it to the OpenAI wire format so runs can be made reproducible: - config/schema.py: add seed field to ModelParams - openai/types.py: add seed field to ChatCompletionRequest - openai/openai_adapter.py: include seed in metadata dict - openai/openai_msgspec_adapter.py: include seed in metadata dict and ChatCompletionRequest construction - evaluation/bfcl_v4_multi_turn_runner.py: accept seed param; inject payload["seed"] when set - evaluation/bfcl_v4_multi_turn_cli.py: expose --seed CLI arg and pass it to BFCLMultiTurnRunner - commands/benchmark/cli.py: expose --seed and --report-dir overrides on the from-config subcommand - tests: unit coverage for seed propagation in msgspec adapter, multi-turn runner, and from-config CLI
Expand examples/10_BFCLv4_Example/README.md: - Add a "Reproducing from the PRs" section explaining that PR #1 (numpy pin) is a prerequisite for PR #2 to install [bfcl] - Show how to check out and install from the branches - Document --seed flag for both single-turn (from-config) and multi-turn CLI paths - Replace placeholder accuracy numbers with confirmed Thor validation results (Qwen3.6-27B-Q4_K_M, temperature=0, 456 ST samples): non_live 86.98%, live 84.12%, hallucination 94.32%, overall 87.50% (both seed runs identical); MT base 70.00% (exact evalscope parity) - Add output file paths and a quick sanity-check script
Replace the terse reference doc with a numbered walkthrough that works for someone unfamiliar with the endpoints repo: - What is this / What is the endpoints repo - Step 0: prerequisites including a llama.cpp Docker quick-start - Step 1: install from the two PRs with conflict explanation - Step 2: run single-turn (with YAML config notes) - Step 3: run multi-turn - Step 4: verify results with one-liners - Seed reproducibility section - Reference results table (Thor, two seed runs, evalscope parity)
- Fix memory requirement: ~24 GB (not 16 GB) for the Q4 GGUF + KV cache
- Replace generic Docker quick-start with Thor-specific llama.cpp native
build instructions (Docker CUDA images don't target sm_110/aarch64-SBSA)
- Add x86_64 Docker quick-start in a collapsible details block
- Fix Step 4 result path: results.json under accuracy_scores key, not
a separate accuracy_scores.json; add report.txt note
- Add server-side determinism note (--seed 42, -np 1 on llama-server)
- Replace placeholder MT numbers with actual sampled-run Thor results:
multi_turn_base 66.67% (4/6), miss_func 33.33% (2/6),
miss_param 16.67% (1/6), long_context 66.67% (4/6),
overall 45.84% (24 entries) — identical across both seed runs
- Separate full multi_turn_base parity result (140/200 = 70.00%) into
its own subsection to avoid conflating sampled and full-set numbers
- Update wall-clock: ~82 min ST + ~64 min MT ≈ 2.4–2.5 h total
bfcl-eval's Qwen model handler transitively imports qwen_agent which requires soundfile; without it the import fails on Thor and any machine where soundfile is not already installed.
… run_accuracy.sh Renames examples/10_BFCLv4_Example to examples/10_Edge_Agentic_Example to align with the MLPerf edge-agentic submission category name. Adds run_accuracy.sh — a single script that reproduces both single-turn and multi-turn reference accuracy numbers end-to-end with the exact validated parameters (sampling rates, temperature=0, seed=42, max-steps-per-turn=25). Updates README to lead with the one-liner quick-start referencing the script, fixes the install instructions to point to mlcommons/endpoints (not the fork), adds --seed and --max-steps-per-turn to the Step 3 MT snippet, and corrects the internal path reference in online_agentic_coding_perf.yaml.
9f0186a to
37c74d2
Compare
- bfcl_v4_execution: move tool-call argument JSON parsing inside the try-except block so json.JSONDecodeError from malformed model output (common on small/quantized models) is caught and handled gracefully rather than crashing the evaluation run. - bfcl_v4_multi_turn_runner: only include tools/tool_choice in the request payload when the tools list is non-empty, avoiding 400 errors on endpoints that reject tool_choice without accompanying tools. - bfcl_v4_scorer: guard n_repeats with max(1, ...) so a partial run where fewer samples completed than num_samples() does not produce a zero divisor and incorrect reporting.
- Relax base numpy to >=1.26.4 so bfcl-eval's numpy==1.26.4 pin resolves; regenerate uv.lock. - Regenerate stale *_template_full.yaml config templates after schema change. - Fix mypy: annotate tool_calls/tool_call_ids and narrow Optional messages/tools in the multi-turn runner; mark BFCLv4Scorer.score override. - Isolate the bfcl extra via [tool.uv].conflicts and add patched filelock/virtualenv floors to the dev extra so bfcl-eval's filelock==3.20.0 pin no longer drags shared tooling deps into CVE versions (CVE-2025-68146, CVE-2026-22701, CVE-2026-22702).
The accuracy phase hardcoded drain_timeout=None, which ignored a user-configured DrainConfig.accuracy_timeout_s and failed test_configured_drain_timeouts_propagate_to_phases. accuracy_timeout_s already defaults to None (unbounded), so reading it preserves the unbounded default while honoring an explicit timeout.
from-config has no --model-params.name / --endpoint-config.endpoints overrides, so the script errored immediately. Render a temp YAML with MODEL/ENDPOINT substituted into the committed config (anchored on the "# set to your ..." comments) so the one-liner still works without editing the tracked file.
viraatc
left a comment
There was a problem hiding this comment.
A few integration questions from this review.
viraatc
left a comment
There was a problem hiding this comment.
A few follow-up integration comments.
…compliance + nits) - session: add stop_current_phase() and have the perf max_duration cap call it instead of session.stop(), so a combined perf+accuracy run still runs accuracy when the performance phase hits its time cap (was silently skipped). The stop-check observes a per-phase flag (reset each phase) so polling strategies stop issuing, and the strategy task is cancelled to interrupt a blocked await. Adds a regression test (perf timeout -> accuracy still completes). - compliance: all_turns_observed now compares observed to expected (scorable) turns rather than issued; a dataset with 1007 issued / 1006 scorable no longer fails a valid run. Tests updated to carry expected and encode that case. - schema: rename ambiguous Dataset.params -> generate_params (feeds generate()); regenerated templates and updated the example config. - dataset: elaborate why only the adapter's ColumnFilter is dropped when a preset supplies its own (other adapter transforms still apply). - example yaml: drop the "do not change" noise; annotate the BFCL gate block once as fixed and keep per-field notes only where users tune values. - test: assert the from-config seed override is meaningful (seed != 42 before).
…only Drop the `accuracy_only` bool threaded through run_benchmark/setup_benchmark/ _load_datasets; derive it on BenchmarkContext from `test_mode == TestMode.ACC`. `--accuracy-only` remains a CLI convenience alias that resolves to TestMode.ACC. Coalesce the dataset-presence guards in _load_datasets around the resolved mode. Move the `seed` Cyclopts alias onto ModelParams (`--seed`) so offline/online expose it via normal model-field unwrapping, and remove the bespoke from-config `--seed` override flag (set seed in the YAML for from-config runs).
…ed breakdown BFCLv4Scorer.score() now returns the scalar overall accuracy (fraction) like every other Scorer, dropping the `# type: ignore[override]`. The per-subset / per-category detail is cached and exposed via a new optional `Scorer.score_breakdown()` hook, and finalize writes it to results.json under a dedicated `breakdown` key (numeric percentages, not formatted strings). Consumers (compliance gate, results plotting, publish helper) read the typed `breakdown` block instead of special-casing a score-that-might-be-a-dict; the string-coercion paths are kept only as defensive fallbacks for older artifacts. Compliance also now loads config.yaml via BenchmarkConfig.from_yaml_file() (schema validation + env resolution + discriminated union) rather than a raw yaml.safe_load, with test fixtures updated to write full schema-valid configs.
viraatc
left a comment
There was a problem hiding this comment.
Review Council — integration & minimalism pass
Placeholder; summary posted separately.
| return default | ||
|
|
||
|
|
||
| class FunctionCallExtractor(Extractor, extractor_id="function_call_extractor"): |
There was a problem hiding this comment.
[Council: integration review] medium (design): FunctionCallExtractor Priority 3 (_try_parse_text_function_calls, ~100 lines: _SKIP_NAMES frozenset of Python builtins, balanced-paren scanner, ast.parse("dict(...)") arg parsing) is a heavy heuristic on a scoring path, and is reachable (bfcl_v4_scorer.py:252 calls self.extractor.extract(...)). Its own docstring concedes it "may produce false positives from explanatory text or fail on nested parentheses" — i.e. it can silently mis-score. Given BFCL endpoints return structured tool_calls (Priority 1) or a JSON array (Priority 2), is the free-text fallback needed at all? Recommend dropping Priority 3 (return default after the JSON paths) unless you can show a real endpoint that only emits prose function calls — for a benchmark, a clean miss is safer than a heuristic guess. This is the largest single chunk of new logic on the integration surface.
There was a problem hiding this comment.
Keeping Priority 3 as the last-resort recall path. For the validated Qwen3.6 runs, structured tool_calls (Priority 1) dominate, but some models/quants still emit text-form calls, and dropping it would lower recall with no accuracy upside on the compliant path.
Review Council — integration & minimalism passReviewers: Claude (full integration-surface pass) + Codex (timed out at max-effort before emitting findings; not counted). Depth: thorough, scoped per the maintainer's ask — is each new line / API change / new member on an existing class actually needed? Focus was the integration surface (changes to pre-existing files), not the self-contained BFCL internals. All findings verified against HEAD; deduped against the 64 existing comments. 8 new inline findings posted (1 high, 6 medium, 1 low). Findings already raised by you/@arekay-nv were intentionally skipped. 🔴 Must fix
🟡 Should fix
🔵 Consider
Independent agreement with existing threadsThe council independently reached the same conclusion as the open Net read on the askThe BFCL subsystem is cleanly self-contained; the friction is in the shared-code edits made to accommodate it. Most can shrink: drop the redundant flag (already flagged), drop the dead validity block (#1), keep seed/temp/required-columns changes off the shared adapters (#2-4), revert the drain-timeout default to |
There was a problem hiding this comment.
Do we really need accuracy_only? Might be simpler to rely on the config to have perf/accuracy phases and run them as-is. This will create corner cases - what if the config has only perf, but we specify accuracy_only.
There was a problem hiding this comment.
+1
( might need a larger refactor of execute.py though: #384 )
There was a problem hiding this comment.
The corner case is guarded: _load_datasets raises InputValidationError when --accuracy-only is set with no accuracy dataset. TestMode.ACC stays an explicit mode because the compliance workflow runs accuracy standalone under forced single-stream (now persisted to config.yaml, 4ed5bf8). Fully driving scoring off the config's declared phases is the right longer-term shape — tracked in #387.
- execute.py: remove dead invalid_reasons block (entries never carry valid/invalid_reason; comprehension was always empty + KeyError-prone) - session.py: revert PhaseConfig.drain_timeout and _drain_inflight defaults to None (no implicit drain bound); keep the post-clear inflight re-check - openai_msgspec_adapter.py: drop the global temperature int-coercion; restore default ColumnFilter required_columns=["prompt"] (preset supplies its own filter for BFCL) - openai_adapter.py: thread seed through to_endpoint_request so the non-msgspec path no longer silently drops model_params.seed - factory.py: drop the "force" alias; use the canonical force_regenerate kwarg - sample_order.py: give SequentialSampleOrder an explicit n_samples_in_dataset signature
arekay-nv
left a comment
There was a problem hiding this comment.
Review Council — Multi-AI Code Review
Reviewed by Codex (gpt-5.5) + Claude | Depth: thorough
8 inline findings below (2 high, 2 medium, 4 low). Full synthesis in the summary comment.
Review Council — Multi-AI Code ReviewReviewed by: Codex (gpt-5.5, xhigh) + Claude | Depth: thorough Found 8 issues across 6 files (2 high, 2 medium, 4 low). The empty-events scorer crash was flagged independently by both reviewers. All line numbers verified against HEAD ( 🔴 Must Fix (high)
🟡 Should Fix (medium)
🔵 Consider (low)
Dropped after verification (3)
|
…rds, compliance, nits) - bfcl_v4_scorer.score(): guard empty/unmatched events log so an events file with no COMPLETE records (or none mapping to a known sample_uuid) emits a zero breakdown instead of KeyError; factor out _zero_breakdown() and drop the now-unreachable all_scores guard. - execute.py: bake the accuracy-only single-stream override (num_workers=1, max_connections=1) into config before persisting config.yaml so the compliance gate sees the settings that actually ran; drop the redundant runtime override. - extractor: add public has_native_tool_calls() and use it from the hallucination scorer instead of the private _try_parse_tool_calls_json. - compliance/checker: distinguish "no seed set" from "seed != 42". - bfcl_v4 dataset: document sample_pct=None (kept in full, not 0%) and warn when generate() returns an empty single-turn DataFrame. - bfcl_v4_execution: coalesce duplicated assistant content assignment. - multi_turn_scorer: document the unweighted per-subset mean trade-off. - plot_results: write per-report subdirs when multiple report dirs share one --out-dir so fixed-name plots no longer clobber each other. - tests: cover score()/_score_ast/_score_hallucination, sample-weighted aggregation, empty-events guards, and has_native_tool_calls.
Welcome to Codecov 🎉Once you merge this PR into your default branch, you're all set! Codecov will compare coverage reports and display results in all future pull requests. Thanks for integrating Codecov - We've got you covered ☂️ |
The combined perf+accuracy run loads a repo-root-relative performance dataset path, so it must be launched from the repo root; the README Step 5 and the config header comment previously said to cd into the example dir, which breaks dataset resolution. The inline-checker verify one-liner also read a top-level `valid` key that the scorer never writes (it emits score/turns/domains/per_turn) — derive validity from turns.missing instead.
Resolve conflicts: - config/schema.py: keep both ScorerMethod members (BFCL_V4 + LEGACY_MLPERF_DEEPSEEK_R1) - commands/benchmark/execute.py: keep BFCL score_breakdown() + entry storage, adopt main's richer completeness log line - AGENTS.md: union Key Components table (main's DeepSeek-R1 row + BFCL Compliance row) - config/templates/*_full.yaml: regenerated from schema
| # The scorer guards `bfcl_eval` behind try/except → None on ImportError. | ||
| # With the extra installed these must resolve to the real objects, otherwise | ||
| # scoring silently degrades. | ||
| from inference_endpoint.evaluation import bfcl_v4_scorer |
There was a problem hiding this comment.
we should move the imports to the top of the file, lazy imports generally bad
There was a problem hiding this comment.
Fixed in 7542f63 — hoisted all four function-body imports (bfcl_eval plus the first-party bfcl_v4_scorer/bfcl_v4_execution/scoring modules) to the top of the file per the AGENTS.md no-lazy-imports rule. The smoke script is run in an env with the bfcl extra installed, so top-level imports are the intended behavior.
| # The execution bridge imports a different bfcl-eval submodule (also guarded | ||
| # behind try/except → None), so check it resolved too — catches a partial | ||
| # bfcl-eval breakage that leaves the ast_checker path working. | ||
| from inference_endpoint.evaluation import bfcl_v4_execution |
There was a problem hiding this comment.
similarly to this, lazy import
There was a problem hiding this comment.
Also fixed in 7542f63 — this import was hoisted to module top level with the others.
| return 1 | ||
|
|
||
| # The scorer/dataset are registered via their import side effects. | ||
| from inference_endpoint.evaluation.scoring import Scorer |
There was a problem hiding this comment.
Also fixed in 7542f63 — hoisted to the top; Scorer.get("bfcl_v4") now relies on the top-level scoring import's registration side effect.
Remove throughput/latency/runtime figures (tok/s, TTFT, TPOT, per-turn latency, ISL/OSL percentiles, wall-clock/runtime columns) that are hardware-specific and not the reference target. Retain accuracy numbers (86.23% overall, 87.96% normalized, per-category, IoU 0.6335) and the reasoning ON vs OFF comparison used to justify running with reasoning off.
Address PR mlcommons#346 review: move the function-body imports (bfcl_eval and the first-party bfcl_v4_scorer/bfcl_v4_execution/scoring modules) to the top of the file per AGENTS.md no-lazy-imports rule. The smoke script is run in an env with the `bfcl` extra installed, so a top-level import is the intended behavior.
arekay-nv
left a comment
There was a problem hiding this comment.
Review Council — Multi-AI Code Review
Reviewed by Codex + Claude | Depth: thorough. 15 verified findings posted as inline comments (summary comment follows).
| if func_data is None: | ||
| return None | ||
|
|
||
| name = func_data.get("name", "") |
There was a problem hiding this comment.
[Codex] high (bug): func_data = item.get("function") is only checked for None — when the model emits JSON where function is not an object (e.g. [{"function": "foo"}]), func_data.get("name", "") raises AttributeError, which is not in the except (json.JSONDecodeError, ValueError, TypeError) list below, so one malformed response aborts BFCL scoring instead of falling through to the next parser. Guard with isinstance(func_data, dict) and return None like the other malformed-shape branches.
| import json, pathlib | ||
| r = json.loads(pathlib.Path('results/edge_agentic_full_run/results.json').read_text()) | ||
| print('Overall ST accuracy:', | ||
| r['accuracy_scores']['bfcl_v4::function_calling']['score']['overall_accuracy'], '%') |
There was a problem hiding this comment.
[Claude] high (documentation): This verify snippet (and its copy at line 324) indexes ['score']['overall_accuracy'], but the pipeline writes "score" as a scalar float (execute.py finalize_benchmark — "score": score, where BFCLv4Scorer.score() returns a 0–1 fraction); the per-subset dict lives under the separate "breakdown" key. Running the documented command raises TypeError: 'float' object is not subscriptable. Use ['breakdown']['overall_accuracy'] (which is also the percentage the % suffix implies). The multi-turn snippet at line 254 is unaffected because the multi-turn CLI writes score as a dict — that shape asymmetry between the two results.json files is itself worth a note.
| config = config.with_updates( | ||
| settings=config.settings.model_copy( | ||
| update={ | ||
| "client": config.settings.client.with_updates( |
There was a problem hiding this comment.
[Codex] medium (bug): Residual of the single-stream persistence fix (4ed5bf8): this override bakes num_workers=1, max_connections=1 into the persisted config.yaml, but leaves load_pattern.target_concurrency untouched. check_config_lock also gates on target_concurrency == 1 (compliance/checker.py:154-156), so an --accuracy-only run from a combined config declaring target_concurrency > 1 still fails the single_stream compliance check even though the client is forced to one connection. Normalize (or clear) the load-pattern concurrency here too, matching this comment's stated intent that the written config.yaml matches what actually runs.
| inference-endpoint benchmark from-config \ | ||
| --config online_edge_full_run.yaml \ | ||
| --accuracy-only \ | ||
| --seed 42 |
There was a problem hiding this comment.
[Claude] medium (documentation): benchmark from-config does not accept --seed: the handler (commands/benchmark/cli.py from_config) only defines config, timeout, mode, accuracy-only, and report-dir. Reproduced: cyclopts fails with UnknownOptionError: Unknown option: "--seed". For from-config runs the seed must come from the YAML (model_params.seed: 42, already set in online_edge_full_run.yaml). Either fix this example to point at the YAML key or add a seed override to from_config.
| message = choice["message"] | ||
| except (KeyError, IndexError) as exc: | ||
| logger.warning("Malformed response: %s", exc) | ||
| return None, None, None |
There was a problem hiding this comment.
[Claude] medium (error-handling): A structurally malformed 200 body (proxy error page, {"error": ...} payload, empty choices) returns (None, None, None) here, which process_response treats as "no tool calls" (bfcl_v4_execution.py:192-199) — it appends a fabricated empty assistant message and advances the turn, so the entry is scored as a normal completion with degraded accuracy. That's inconsistent with the transport-error path, where _send_request returning None force-terminates the entry and the scorer marks it force_terminated. A malformed body is the same failure class as an HTTP 500 and should force-terminate (or at least be surfaced distinctly); otherwise a flaky gateway silently degrades reported accuracy.
| if score is None: | ||
| return [Check("accuracy_results_present", False, "no accuracy score found")] | ||
|
|
||
| for golden_key, result_key in _ACCURACY_METRIC_KEYS.items(): |
There was a problem hiding this comment.
[Claude] low (error-handling): If no _ACCURACY_METRIC_KEYS key intersects the model's golden_accuracy (every non-Qwen model in models.py), this loop continues on every iteration and check_accuracy returns zero accuracy:* checks — ComplianceReport.passed (an all() over whatever checks exist) then reports PASS based on config-lock alone for an accuracy results.json. "No applicable accuracy metric" should be an explicit FAIL/warning check, mirroring the accuracy_results_present failure above, rather than an empty list.
| a combined run whose performance phase hits its ``max_duration`` cap | ||
| still proceeds to the accuracy phase. | ||
| """ | ||
| self._current_phase_stopped = True |
There was a problem hiding this comment.
[Claude] low (bug): Residual of the 25f84a7 perf-timeout fix: unlike stop(), stop_current_phase does not set self._drain_event, and _drain_inflight only consults _stop_requested — so if the max_duration_ms cap fires while the performance phase is already inside its drain wait (strategy task finished and reset to None), the callback is a no-op. With the default performance_timeout_s: 240 that only stretches the cap by up to 4 minutes, but with the documented drain.performance_timeout_s: null ("wait indefinitely") a hung request keeps the run alive indefinitely despite the cap. Consider setting _drain_event here too, or having _drain_inflight observe _current_phase_stopped.
| ] | ||
| return self.turns[turn_idx] | ||
|
|
||
| def get_tools_for_turn(self, turn_idx: int) -> list[dict[str, Any]]: |
There was a problem hiding this comment.
[Claude] low (design): Dead/duplicated multi-turn plumbing that will drift: (1) get_tools_for_turn is never called anywhere in src/ or tests — the execution bridge implements its own holdout-tool logic in get_initial_request — and excluded_function is likewise stored but never read. (2) MULTI_TURN_SUBSETS is defined identically here and in the package __init__.py; if the two copies diverge, generate()'s multi-turn filtering and load_multi_turn_entries()/scorer aggregation silently disagree about what "multi-turn" means. Same pattern for DEFAULT_MAX_STEPS_PER_TURN = 25 duplicated in bfcl_v4_execution.py:47 and bfcl_v4_multi_turn_runner.py:40.
| widths = [(hi - lo) for lo, hi in dist.hist_buckets] | ||
| axes[1].bar( | ||
| centers, | ||
| dist.hist_counts[: len(centers)], |
There was a problem hiding this comment.
[Claude] low (bug): The length-mismatch guard is one-directional: dist.hist_counts[: len(centers)] truncates extra counts, but counts and buckets are extracted independently (hist.get("counts") / hist.get("buckets") in extract_distribution, no length validation), so a shorter hist_counts makes axes[1].bar(centers, ...) raise a shape-mismatch ValueError — and the per-plot loop in generate_plots doesn't catch it, so one malformed histogram aborts plotting for the whole report dir. Validate both arrays to a common length in extract_distribution, or slice both sides.
| } | ||
| if model_params.streaming == StreamingMode.ON: | ||
| metadata["stream"] = True | ||
| if model_params.max_new_tokens is not None: |
There was a problem hiding this comment.
[Claude] low (design): Dead guard: ModelParams.max_new_tokens is non-optional with a default (max_new_tokens: Annotated[int, ...] = 1024, config/schema.py), so is not None is always true and every request always carries max_completion_tokens. The apparent intent — omit the token cap when unset, symmetric with the stream guard above — is unreachable through the schema. Either make the field int | None so the omission path is reachable, or drop the guard so the code doesn't suggest a configuration knob that doesn't exist.
Review Council — Multi-AI Code ReviewReviewed by: Codex + Claude | Depth: thorough Found 15 issues across 12 files (17 candidates; 2 dropped in synthesis — numpy floor-pin trade-off is already documented in-code with 🔴 Must Fix (critical/high)Issues that will cause incorrect behavior in normal usage.
🟡 Should Fix (medium)Real issues that trigger under specific conditions or flaws that will compound.
🔵 Consider (low)Valid improvements that could be follow-ups.
|
What does this PR do?
Integrates the edge-agentic example (BFCL-v4 as the accuracy set): single-turn + multi-turn pipelines, datasets, adapters, and a reproducible run script. See
examples/10_Edge_Agentic_Example/README.md.Type of change
Related issues
Self-contained: this PR includes the numpy relaxation (
numpy>=1.26.4) required bybfcl-eval(which hard-pinsnumpy==1.26.4), so it no longer depends on a separate prerequisite PR. The[bfcl]extra is isolated via[tool.uv].conflictsso its old pins don't constrain the shared tooling deps. This supersedes #345 (closed as redundant).Testing
Checklist