feat(compliance): MLPerf TEST04 caching audit — extensible AuditTest framework by wu6u3tw · Pull Request #332 · mlcommons/endpoints

wu6u3tw · 2026-06-03T20:53:15Z

Summary

Adds an extensible MLPerf compliance-audit framework with TEST04 (caching detection) as the first test, driven by an audit: block in the benchmark YAML. This PR carries the full redesign: the approved design plan, the implementation, tests, and runnable WAN2.2 examples.

TEST04 issues one fixed sample for every query in an audit phase; if repeating an identical request makes the SUT meaningfully faster, it is serving from cache. Pass iff the audit run is at most 10% faster than the reference (matching upstream compliance/TEST04/verify_performance.py).

Design (the two axes)

Axis A — run modification: expressed as a generic typed SampleOrderSpec (WITHOUT_REPLACEMENT | SINGLE(index)) carried on a RunSpec. No test-specific knowledge leaks into the load generator.
Axis B — verification: a pure post-run check, AuditTest.verify(runs) -> AuditVerdict, registered per test.

A generic orchestrator (commands/audit.py::run_audit) runs each RunSpec phase back-to-back via the existing setup_benchmark / run_benchmark_async path, then verifies and writes the verdict. Adding TEST01/06/07/09 later is a new registry entry, not cross-cutting edits.

Config shape

audit:
  test: test04
  samples: 64         # reference phase query count
  audit_samples: 64   # audit (fixed-sample) phase count
  sample_index: 3     # MLCommons performance_issue_same_index
  threshold: 0.10     # audit qps must stay < ref qps * (1 + threshold)

AuditConfig is a discriminated-union-ready sub-model on BenchmarkConfig (parallel to AccuracyConfig) — no DatasetType.AUDIT, no audit fields polluting Dataset, no test04 boolean in RuntimeSettings.

What's included

compliance/__init__.py — AuditTest protocol + RunSpec/RunStats/RunArtifacts + registry
compliance/verdict.py — AuditVerdict + atomic write_verdict (tmp → fsync → rename → fsync)
compliance/tests/test04.py — Test04Audit + verify_test04
commands/audit.py — generic run_audit orchestrator
config/schema.py — AuditTestId + Test04Config/AuditConfig + BenchmarkConfig.audit
load_generator — SampleOrderSpec + SingleSampleOrder + factory dispatch
Unit tests + e2e integration test (offline + single-stream) against the echo server
docs/compliance_audit_plan.md — the design plan
WAN2.2 submission examples: offline_wan22_submission.yaml, single_stream_wan22_submission.yaml

Exit codes

benchmark from-config with an audit: block exits 0 (PASS) / 1 (FAIL); errors propagate via the standard handler using the repo-wide scheme (InputValidationError → 2, SetupError → 3, ExecutionError → 4). The on-disk audit_verdict.json is the durable record.

Testing

Unit + integration green; pre-commit run --all-files clean. The e2e test exercises the full audit: → run_audit → AuditVerdict flow for both max_throughput (offline) and concurrency=1 (single-stream).

Implementation notes (moved from the design doc)

File-by-file changes (against `main`)

File	Change
`compliance/__init__.py`	new — `AuditTest` protocol, `AuditRunSpec`, `AuditRunStats`, `AuditResult`, `get_audit_test()` registry
`compliance/result.py`	new — `AuditResult` + atomic `write_result` (reference `verify_OUTPUT_CACHING_TEST.txt` wording + JSON)
`compliance/audit_test/__init__.py`	new — imports submodules so registration fires
`compliance/audit_test/output_caching_test.py`	new — `OutputCachingAudit.plan_runs` (reference + audit specs) + `verify_output_caching` core
`commands/audit.py`	new — generic `run_audit` loop (plan → validate-all → execute → verify → write)
`config/schema.py`	+ `AuditTestId`, `AuditConfig`, `audit: AuditConfig \| None` on `BenchmarkConfig`
`load_generator/sample_order.py`	+ `SampleOrderSpec` + `SingleSampleOrder`; `create_sample_order` switches on the spec
`config/runtime_settings.py`	+ `sample_order: SampleOrderSpec` (generic; default `WITHOUT_REPLACEMENT`)
`commands/benchmark/execute.py`	+ typed `run_spec` seam in `setup_benchmark`; `run_benchmark` dispatches to `run_audit`
`examples/09_Wan22_VideoGen_Example/offline_wan22_submission.yaml`	new — WAN2.2 Offline submission (perf + accuracy + TEST04)
`examples/09_Wan22_VideoGen_Example/single_stream_wan22_submission.yaml`	new — WAN2.2 SingleStream submission (perf + accuracy + TEST04)

Requirements traceability

Covers every comment thread on PR #332 — the maintainer workflow threads
(@nvzhihanj, @viraatc), both "Review Council" passes, and the Gemini robustness comments.

Maintainer workflow & example-config threads

Comment	Resolution
Run one command, not two/three; phases back-to-back	`run_audit` generic loop; `audit:` block on `benchmark from-config`
Perf + accuracy + audit from a single config	`type: submission` YAML; `run_benchmark` runs perf [+acc], then `run_audit` additively (§5)
Comparing 50 distinct vs 20/25 repeated "doesn't seem fair"	resolved — shipped examples use equal counts (Offline 64/64, SingleStream 20). `audit_samples` allows independent counts (upstream TEST04 uses 5000/500 for SDXL) as an opt-in to shorten the audit phase; see §5.
"Forced to run 248 in audit … too long"	`samples` (reference) and `audit_samples` (audit) are independent subsets; no full-dataset requirement
Audit sample "shuffled or fixed?"	fixed — reference = `WITHOUT_REPLACEMENT`, audit = `SINGLE(sample_index)` (MLPerf `issue_same`)
Need an audit config for single-stream too	load-pattern validation admits `concurrency` (single-stream) and `max_throughput` (offline)
Paced loads should not silently pass	`poisson` rejected up front (§4 step 3) — pacing caps throughput and masks caching
Inconsistent / context-free example file names	the shipped example YAMLs use context-rich sibling names (`offline_wan22_submission.yaml` / `single_stream_wan22_submission.yaml`); result artifacts use fixed `verify_OUTPUT_CACHING_TEST.txt` + JSON
`num_workers` hard-coded in example YAMLs; use default	omitted from the shipped examples — they carry only what TEST04 requires (endpoint defaults otherwise)
README / unrelated dependency churn (`pip`, `aiohttp`) in the diff	this PR contains only the design doc + the two WAN2.2 example configs — no README or dependency changes bundled

Design-review findings (both Review Council passes)

Finding (severity)	Resolution
`ref_samples` dead write / mismatched counts (high)	each phase's count is an explicit `AuditRunSpec.n_samples` honored via `n_samples_to_issue` (the bug was the reference count being silently dropped)
No `AuditTest` abstraction; TEST04 hardcoded (high)	`AuditTest` protocol + `get_audit_test` registry; generic loop
`DatasetType.AUDIT` abstraction leak (high)	dropped; phases derive from a normal PERFORMANCE dataset
`test04` boolean in `RuntimeSettings`/load-gen (high)	generic `SampleOrderSpec`; load-gen has no test knowledge
`_OVERRIDE_TEST04_SAMPLE_INDEX` stringly-typed kwarg (med)	typed `run_spec` seam
Two-phase `model_copy` surgery; ref skips validation (med)	declarative `AuditRunSpec`; validate all specs before any run
Orchestrator untested (med)	unit tests assert per-phase counts + early-return paths
Scattered params / hardcoded threshold (med)	per-test config model (`OutputCachingTestConfig`), discriminated on `test` — each test carries only its own knobs
Unfair QPS comparison across counts/contents (med)	examples use equal counts; per-phase completion guard + qps rate-normalization keep unequal counts sound when opted into (upstream-faithful, §5)
Audit params belong in `AuditConfig`, not `Dataset` (med)	`AuditConfig` sub-model on `BenchmarkConfig`; `Dataset` untouched
Two parallel verifier entry points (low)	one `verify_output_caching(AuditRunStats, AuditRunStats)` core + a single `AuditRunStats.from_report` adapter
`sample_index` bound-checked late (low)	validated vs loaded dataset size before any run
`audit_config` re-entrancy trap (critical)	every phase config sets `audit=None`; cannot re-enter `run_audit`
Orchestrator returns `None`; PASS/FAIL indistinguishable	`run_audit` returns a typed `AuditResult`; CLI exits `0` (PASS) / `1` (FAIL); errors via the standard handler (`SetupError` → 3, `ExecutionError` → 4)
Non-atomic result write (high)	`write_result` uses `tmp → fsync → rename → fsync(parent)`
Duplicates `setup_benchmark` dir / `config.yaml` logic (med)	phases reuse `setup_benchmark`; no recomputed report-dir
`_audit_marker` parsed twice in error path (low)	n/a — orchestrator owns phase labels, so no directory-swap guard

Robustness & API hygiene (Gemini + Review Council)

Comment	Resolution
CLI catches only `FileNotFoundError`/`ValueError`; write outside try	n/a in redesign (no standalone verifier CLI); the audit runs via `from-config` and errors propagate to `main.py`'s handler
`_audit_marker` `AttributeError` on non-`dict` JSON	n/a in redesign (no standalone dir-swap marker; the orchestrator owns phase labels)
`Report.from_snapshot` `KeyError`/`TypeError` uncaught	n/a — snapshots are consumed in-process per phase; a phase that fails to produce a usable report aborts with `ExecutionError`
Public entry points missing from `__all__`	`compliance/__init__.py` `__all__` exports the full public surface

Success criteria (goal-driven)

Integration — benchmark from-config with an audit: block runs both phases
back-to-back and writes verify_OUTPUT_CACHING_TEST.txt + audit_result.json; PASS against a
no-caching mock_http_echo_server, FAIL against a caching mock.
Completion guard — a phase that completes far fewer than its requested count fails
the result (completed < requested × (1 − threshold) → FAIL), independent of the other
phase's count.
Unit — SingleSampleOrder always yields the configured index (bounds-checked);
verify_output_caching PASS within threshold, FAIL above, boundary at the strict < line,
slower-passes, custom threshold, and the completion guard trips; AuditRunStats.from_report
raises on a None-duration or non-positive qps; OutputCachingAudit.plan_runs emits a
reference spec at samples and an audit spec at audit_samples (which may differ).
Unit (orchestrator) — assert the reference phase issues samples and the audit phase
issues audit_samples (defaulting to samples when omitted), validation fires before any
run, the typed result propagates (PASS/FAIL distinguishable), and a phase config never
carries audit (no re-entry).
Validation — a paced (poisson) load and an out-of-range sample_index are both
rejected before any phase runs.
Robustness — AuditRunStats.from_report raises a clean ValueError on a report with no
duration (qps is None) or non-positive throughput (qps <= 0); a phase whose
Report.complete is False (metrics drain timeout / interrupt) aborts the audit with
ExecutionError rather than certifying a result on partial data.
No leakage — grep -r test04 src/inference_endpoint/{load_generator,config/runtime_settings.py}
returns nothing.
pre-commit run --all-files clean (ruff / mypy / license headers).

🤖 Generated with Claude Code

github-actions · 2026-06-03T20:53:31Z

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

gemini-code-assist

Code Review

This pull request implements the MLPerf TEST04 compliance audit to detect result caching by repeatedly issuing a single fixed sample and comparing the throughput against a reference run. It introduces configuration options, validation guards, a SingleSampleOrder generator, and a compliance verification module with a CLI tool and tests. The review feedback focuses on improving the robustness of the compliance verifier, specifically by handling potential OSError exceptions during file writes, catching AttributeError when parsing non-dictionary JSON configurations, and gracefully handling malformed snapshot files during parsing.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

…review Address gemini-code-assist review on PR mlcommons#332: - CLI catches OSError (PermissionError etc.) and write_verdict failures, not just FileNotFoundError/ValueError — all map to exit 2. - _audit_marker tolerates non-dict results.json (isinstance guards) instead of raising AttributeError. - _run_stats_from_dir rejects a non-dict snapshot with a clear ValueError. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

wu6u3tw · 2026-06-04T22:57:49Z

Update summary

All review feedback has been addressed. Here is what changed since the original submission:

Architecture (main concern)

audit: test04 now runs both phases in a single command — reference run then audit run back-to-back against the same endpoint, with automatic comparison and verdict output. No more 3-step manual workflow.

Config shape

type: audit dataset replaces the old settings.runtime.test04_sample_index and audit_n_samples runtime variables. Reference and audit sample counts are now independent and co-located with the dataset config — consistent with how type: accuracy datasets carry their own accuracy_config.

audit: "test04"

datasets:
  - name: wan22_prompts
    path: wan22_prompts.jsonl
    type: "performance"
    samples: 50          # reference phase query count (50–144)

  - name: wan22_audit
    path: wan22_prompts.jsonl
    type: "audit"
    samples: 25          # audit phase query count (25–50)
    audit_sample_index: 0

Robustness

Warning logged when audit: test04 is set but no type: audit dataset is present (previously silent fallback to index 0).
Phase failures (SetupError/ExecutionError) are caught and logged cleanly — no unhandled traceback, verdict not lost.
Report.from_snapshot wrapped in try/except in _run_stats_from_dir — malformed snapshots exit with code 2 instead of crashing.
Pre-flight audit_sample_index bounds check before dataset load.

Testing

New e2e integration test (test_audit_test04_two_phase_flow) exercises the full run_benchmark → two-phase flow against the echo server and asserts both phase subdirs are created and the flow completes gracefully.

Example config

Renamed offline_wan22_test04.yaml → wan22_audit_test04.yaml per review suggestion.

wu6u3tw · 2026-06-05T21:26:19Z

All review feedback has been addressed. Here's a summary of what changed:

Architecture
audit: test04 now runs reference and audit phases in a single command back-to-back against the same endpoint — no more 3-step workflow, no endpoint-change risk. A single type: audit dataset entry drives both phases (carrying ref_samples, audit_samples, audit_sample_index).

Sample counts & index
ref_samples: 50, audit_samples: 25 — sized for WAN2.2 throughput. audit_sample_index: 3 — fixed per MLCommons audit.config (performance_issue_same_index=3 for WAN2.2).

SingleStream
Added wan22_single_stream_test04.yaml (concurrency=1, ref/audit samples=20 matching MLCommons min_query_count).

Durations
Perf configs: min=10min, max=4hr. Audit configs: min=10min, max=2hr. The 10-min minimum documents MLCommons compliance intent; counts take priority in the current session stop logic, with AND-semantics available as a future improvement.

Robustness fixes (Gemini)

write_verdict moved inside try-except in CLI
_audit_marker uses isinstance guards — no AttributeError possible
Report.from_snapshot wrapped in try/except (KeyError, TypeError) in _run_stats_from_dir

Cleanup

Test renamed to test_audit_test04.py
README.md removed from diff (rebased onto main)
Orphaned type: audit datasets in non-TEST04 configs now emit a warning; multiple audit datasets raise InputValidationError

nvzhihanj

Review Council — first-principles design review

Reviewed by: Claude (Codex review timed out on this 2046-line diff at xhigh reasoning) · Depth: thorough

Focus: design issues warranting re-design for a modular, extensible audit-test framework (TEST04 is the first of several). 11 findings; see the tiered summary comment. The ref_samples dead-write (#1) was independently verified against the source.

nvzhihanj · 2026-06-07T04:48:42Z

Review Council — Multi-AI Code Review (first-principles design review)

Reviewed by: Claude · Depth: thorough
Codex review timed out on this 2046-line diff at xhigh reasoning (the load-gen + compliance surface is large); this pass is Claude-led. The one HIGH bug below was independently verified against the source.

Framing: TEST04 is the first MLPerf compliance/audit test and is meant to become a modular, extensible framework. The findings below are design-led — what would adding the next audit (TEST01/05) cost, and where does TEST04-specific knowledge leak into general-purpose code. 11 findings, all posted inline.

🔴 Re-design / Must-fix

#	File	Line	Cat	Why it needs a re-design
1	`commands/benchmark/execute.py`	1151	bug	`ref_samples` is a dead write. `Dataset.samples` is consumed nowhere; `ref_config` never sets `n_samples_to_issue`, so the reference phase runs duration-driven and ignores `ref_samples` while the audit phase honors `audit_samples` → the compared phases run mismatched counts. Set `n_samples_to_issue=ref_samples`.
2	`compliance/__init__.py`	18	design	No `AuditTest` abstraction. `run_benchmark` hardcodes `if audit==TEST04`; package exports only `test04_*`. Adding TEST01/05 = cross-cutting edits everywhere. Introduce an `AuditTest` protocol (`plan_runs`+`verify`) registered by `AuditMode`.
3	`config/schema.py`	82	design	`DatasetType.AUDIT` is a fake dataset type the loader ignores, carrying test params on the shared `Dataset` model, then converted to `PERFORMANCE`. Move params to a structured `audit:` block; drop the fake type.
4	`config/runtime_settings.py`	90	design	`test04` boolean leaks into core load-gen. `RuntimeSettings.test04`/`test04_sample_index` + `create_sample_order`'s `if settings.test04`. Use a generic sample-order strategy selector, not a per-test flag.

🟡 Should-fix

#	File	Line	Cat	Summary
5	`commands/benchmark/execute.py`	113	design	`_OVERRIDE_TEST04_SAMPLE_INDEX` stringly-typed magic kwarg through `**runtime_overrides`; pass a typed `run_spec` instead.
6	`commands/benchmark/execute.py`	1146	design	Two-phase `model_copy` surgery is fragile (root cause of #1; ref phase also skips `_validate_audit_test04`). Use a declarative `RunSpec` + validate before any phase runs.
7	`tests/integration/commands/test_benchmark_command.py`	209	testing	`_run_benchmark_test04` has no unit test; the one integration test asserts `verdict OR error` with `min_duration_ms=0` — the regime that hides bug #1.
8	`config/schema.py`	666	design	`audit` bare top-level enum; params scattered, threshold hardcoded. Use a structured compliance sub-config (like `accuracy_config`).
9	`compliance/test04.py`	206	design	QPS compared across phases with different counts/contents (upstream TEST04 uses the same query set); completion guard only protects the FAIL direction. Extends the existing fairness thread; compounded by #1.

🔵 Consider

#	File	Line	Cat	Summary
10	`compliance/test04.py`	175	design	`verify_test04_dirs` vs `verify_test04_from_reports` duplication; dir-swap guard in one path only. Collapse to one core + thin adapters.
11	`commands/benchmark/execute.py`	446	bug	`audit_sample_index` bound-checked vs requested counts, not the loaded dataset size, until phase 2 — an out-of-range index wastes a full reference run.

Through-line: #1, #5, #6, #7 are all symptoms of the same root cause — TEST04 is bolted onto run_benchmark via per-phase config surgery and untyped overrides instead of a first-class audit-test abstraction (#2). Fixing #2/#3/#4 (an AuditTest that emits typed RunSpecs + a generic ordering strategy) would dissolve most of the others structurally.

Dedup: none overlap existing inline comments except #9, which extends the maintainer's existing fairness thread with upstream-parity / guard-direction substance.

wu6u3tw · 2026-07-01T23:00:21Z

Rebased onto latest main (picked up the deepseek-r1 accuracy commit, #378). One real conflict: execute.py's service-launch timeout — upstream independently added a configurable service_ready_timeout_s (default 30s) solving the same problem our earlier hardcoded 120s + comment was working around, so took upstream's version and dropped ours. AGENTS.md's Compliance table row conflicted at every commit that touched it (rebase replays each commit against a moving base); resolved each to the row's latest wording. All 42 compliance/audit tests pass, pre-commit run --all-files clean. History rewritten again — as before, expect review threads to show as outdated.

codecov-commenter · 2026-07-01T23:06:56Z

Codecov Report

❌ Patch coverage is 83.33333% with 67 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (main@1577095). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
...c/inference_endpoint/commands/benchmark/execute.py	64.74%	55 Missing ⚠️
src/inference_endpoint/compliance/__init__.py	89.18%	4 Missing ⚠️
src/inference_endpoint/compliance/result.py	89.65%	3 Missing ⚠️
src/inference_endpoint/commands/audit.py	96.72%	2 Missing ⚠️
...point/compliance/audit_test/output_caching_test.py	96.66%	2 Missing ⚠️
...sync_utils/services/metrics_aggregator/__main__.py	0.00%	1 Missing ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #332   +/-   ##
=======================================
  Coverage        ?   80.69%           
=======================================
  Files           ?      119           
  Lines           ?    15720           
  Branches        ?        0           
=======================================
  Hits            ?    12685           
  Misses          ?     3035           
  Partials        ?        0

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

arekay-nv

Review Council — Multi-AI Code Review (Codex + Claude, depth: thorough)

Inline findings posted below. See the summary comment for the tiered overview.

arekay-nv · 2026-07-02T02:46:02Z

Review Council — Multi-AI Code Review

Reviewed by: Codex + Claude | Depth: thorough | Focus: extensibility (basis for future audit tests)

Found 10 issues across 8 files. The framework is well-structured — run_audit is genuinely test-agnostic, the AuditTest protocol (plan_runs/verify/validate) is a clean seam, result.py's atomic write (tmp → fsync → rename → fsync-parent) is correct, and prior review rounds (the AuditMode re-entrancy trap, non-atomic verdict write, verify() ignoring threshold, "no AuditTest abstraction") are all resolved. The remaining items are one correctness issue in how the audit is wired to run, and a set of extensibility/robustness refinements.

🔴 Must Fix

Incorrect behavior under normal usage.

#	File	Line	Category	Reviewer(s)	Summary
1	`commands/benchmark/cli.py`	66	correctness	Codex	Main run executes before the audit and primes a caching SUT with the same dataset/seed → reference QPS inflated → TEST04 can false-PASS a caching SUT. This is the default path in the PR's own submission examples.

🟡 Should Fix

Real issues under specific conditions, or design debt that compounds.

#	File	Line	Category	Reviewer(s)	Summary
2	`commands/audit.py`	106	bug	Both	Per-phase tmpfs dir (`/dev/shm/benchmark_*`) never cleaned up — `run_audit` bypasses `run_benchmark`'s `finally` rmtree. Leaks accumulate across audits.
3	`commands/audit.py`	144	error-handling	Claude	A clean-but-zero-throughput phase makes `verify()`→`stats()` raise raw `ValueError`, which escapes as an exit-1 traceback indistinguishable from a real FAIL.
4	`compliance/__init__.py`	50	design	Claude	Extensibility: `AuditRunStats`/`stats()` are QPS/TEST04-specific but sit in the shared framework module; a non-QPS audit can't use them. Move into `output_caching_test.py`.
5	`tests/unit/compliance/test_output_caching.py`	301	testing	Claude	Zero-QPS path is tested only in isolation on `from_report`, never through `run_audit`/`verify` — the crash in #3 has no coverage.

🔵 Consider

Valid improvements / follow-ups — several directly serve the extensibility goal.

#	File	Line	Category	Reviewer(s)	Summary
6	`compliance/__init__.py`	106	design	Claude	Extensibility: `get_audit_test` raises bare `KeyError` on a registry miss → exit-1 traceback. A contributor who forgets the `AUDIT_TESTS` entry gets no clear diagnostic. Raise `SetupError`.
7	`config/runtime_settings.py`	44	design	Claude	Extensibility: `SampleOrderSpec` can express only 2 of 3 sample orders (no with-replacement); `fixed_index` conflates strategy with parameter.
8	`compliance/audit_test/__init__.py`	18	documentation	Claude	Docstring says this `__init__` wires the registry; the wiring is actually in the parent `compliance/__init__.py`.
9	`compliance/audit_test/output_caching_test.py`	152	documentation	Claude	Sequential phase-ordering warmth bias is real but undocumented in the README pass criteria (why examples use `threshold` 0.10 vs 0.20).
10	`AGENTS.md`	188	documentation	Claude	Code Organization tree omits `audit_test/__init__.py`.

On extensibility (the review's focus): the protocol + registry design is the right shape — adding a test should be implement AuditTest + one AUDIT_TESTS entry + one enum member, and run_audit needs no changes. Findings #4, #6, #7 are the gaps between that intent and the current code: metric-specific types in the shared module (#4), an unregistered-test failure mode that isn't diagnosable (#6), and a sample-order selector that can't yet express all existing orders (#7). Addressing them now — while the seam is new and has exactly one implementor — is much cheaper than after a second test hardens the shape. One further note (not posted inline, as docs/compliance_audit_plan.md documents it as deliberate): audit: holds a single config, so a full compliance suite (TEST01+TEST04+…) needs N separate runs; the documented migration is audit: list[AuditConfig] + a loop in run_audit.

⚠️ Commit hygiene: 14 commits including 5 fix/wip commits (e.g. wip(compliance): move audit validation…, several fix(compliance): …). Consider squashing the fixups into their feature commits before merge.

_{Codex reviewed against the true merge-base (157709…); the reference/audit phase code, atomic writer, and validate() bounds checks were read directly to confirm every line anchor.}

The audit's reference phase re-issues the same performance dataset (same dataloader_random_seed) the main run also issues. Running the main run first primes a response-caching SUT, so the reference phase inherits those cache hits and its QPS rises to match the fixed-sample audit phase — the audit could PASS even though the SUT is caching, which is exactly what TEST04 exists to detect (PR mlcommons#332 review finding). cli._run now resolves the report_dir once and runs the audit before run_benchmark, so both share one directory tree regardless of order. A crashed or interrupted audit skips the main run entirely (mirroring how a main-run crash used to skip the audit); a completed audit (PASS or FAIL) still lets the main run execute so a FAIL doesn't cost the submission its perf report.

…ions run_audit drives setup_benchmark -> run_benchmark_async -> finalize_benchmark directly for each phase, bypassing run_benchmark()'s own finally block — the only place that salvages and removes bench.tmpfs_dir. Without per-phase cleanup, each audit phase leaked its RAM-backed /dev/shm directory (PR mlcommons#332 review finding). Also wrap test.verify(...) the same way phase execution already is: a phase that completes zero samples yields qps=0, and AuditRunStats.from_report raises a bare ValueError for that — previously outside any try/except, it escaped run_audit uncaught instead of mapping to the documented exit code 4 (ExecutionError).

…spec PR mlcommons#332 review requested this rename to make the parameter's audit-only purpose explicit. setup_benchmark itself stays as-is: it's shared by both run_benchmark (main run) and run_audit (phase execution), so renaming the function to something audit-specific would be incorrect.

…ng PERF run_audit previously called setup_benchmark with a hardcoded TestMode.PERF for every phase, unconditionally stripping accuracy datasets. That's correct for output_caching_test's two PERF-only phases, but a future audit test needing ACC/BOTH mode would silently run as PERF, never invoking the accuracy scorer despite the phase being labeled otherwise (PR mlcommons#332 review finding). AuditRunSpec gains a test_mode field (defaults to TestMode.PERF, so output_caching_test is unaffected); run_audit now branches phase-dataset selection and the setup_benchmark call on spec.test_mode instead of assuming every audit is performance-only.

wu6u3tw · 2026-07-02T21:30:24Z

Update summary

Addressed this round of review findings:

High: audit now runs before the main benchmark (cli._run), not after — the reference phase was getting primed by the main run's traffic against a caching SUT, masking exactly the caching TEST04 exists to detect.
Medium: each audit phase now cleans up its own /dev/shm tmpfs dir (run_audit bypassed run_benchmark's cleanup finally, leaking one per phase per run).
Medium: test.verify(...) is now wrapped like phase execution already was — a zero-QPS phase's bare ValueError maps to ExecutionError (exit 4) instead of escaping as an uncaught traceback. Added a unit test exercising this through the real run_audit path.
Low: AuditRunSpec gains a test_mode field (defaults to PERF); run_audit reads it per-phase instead of hardcoding TestMode.PERF, so a future ACC/BOTH-mode audit phase won't silently run as PERF.
Naming: setup_benchmark's run_spec param renamed to audit_run_spec. Left setup_benchmark itself as-is — it's shared by both run_benchmark (main run) and run_audit (phase execution), so renaming it to something audit-specific would be incorrect.

Also confirmed and resolved two threads that were already fixed but never marked resolved (the Ctrl-C exit-code distinction at audit.py:124, and the e2e test's result.passed/verify .txt assertions at test_benchmark_command.py:217).

All unit tests pass (1325), integration audit tests pass, pre-commit run --all-files clean including mypy.

wu6u3tw · 2026-07-02T22:07:02Z

History squashed to 3 commits

Branch was force-pushed to reorganize into three focused commits (docs+examples / implementation / tests), following the same pattern as the earlier squash on this PR. No content changed — the resulting tree is byte-identical to the prior tip; verified via git diff <old>..<new> --stat showing zero changes.

Heads-up: existing review threads (including already-resolved ones) will likely show as outdated since their anchored lines moved. Nothing was dropped — only re-anchored.

wu6u3tw · 2026-07-02T22:42:57Z

Fixed 6 more findings (naming, tmpfs, design, docs); 3 discussions remain.

wu6u3tw

(retracted)

arekay-nv

Thanks, there are minor issues that we can address in followup.

arekay-nv · 2026-07-02T21:46:27Z

+    if config.audit is None:
+        run_benchmark(config, mode)
+        return
+
+    # Freeze the report dir now so the audit and the main run share the same
+    # directory tree regardless of which one executes first.
+    report_dir = resolve_report_dir(config)
+    config = config.with_updates(report_dir=str(report_dir))
+    # Run the audit BEFORE the main benchmark. The audit's reference phase
+    # must observe an unprimed SUT: if the main run went first, a
+    # response-caching SUT would already have cached every sample in the same
+    # performance dataset the reference phase re-issues (same
+    # dataloader_random_seed), inflating the reference phase's QPS to match
+    # the fixed-sample audit phase and masking exactly the caching behavior
+    # TEST04 exists to detect. If the audit crashes (SetupError/
+    # ExecutionError) or is interrupted, it re-raises here and the main run
+    # never starts — mirroring how a main-run crash used to skip the audit
+    # entirely. A completed audit (PASS or FAIL) still lets the main run
+    # produce its own submission-quality report either way.
+    # All audit artifacts (phase subdirs + verify_<TEST>.txt +
+    # audit_result.json) nest under <report_dir>/audit/ so they don't
+    # intermingle with the main run's top-level output.
+    result = run_audit(config, report_dir / "audit")


I believe that the auto-review flagged this, but is this the intended behavior from MLPerf?

MLPerf runs these as separate processes.

mlperf inference do not have it in a decided/specific order in the rules or any design docs.
But mlperf inference run ref first.

Design plan for the MLPerf TEST04 output-caching compliance audit (docs/compliance_audit_plan.md): the AuditTest protocol, the generic run_audit orchestrator, and the output-caching detection algorithm (reference phase over distinct samples vs. a fixed-sample audit phase, comparing achieved QPS). Adds the Compliance entry to AGENTS.md, the usage README under compliance/audit_test/, and the two WAN2.2 submission example configs (Offline + SingleStream) wiring perf, accuracy, and audit into a single from-config run. Notes two known limitations as pending future work: multiple performance datasets (bounds-checking assumes exactly one) and multiple audit instances (audit: is a single AuditConfig, not a list). Documents the reference-phase-first ordering bias absorbed by threshold, and lists audit_test/__init__.py in the Code Organization tree.

Generic AuditTest framework (compliance/): protocol + AuditRunSpec / AuditRunArtifacts + test registry, wired to OutputCachingAudit (compliance/audit_test/), which also owns the QPS-specific AuditRunStats (its only consumer) rather than the shared framework module. run_audit (commands/audit.py) orchestrates the planned phases back-to-back, dispatched by cli._run before the main benchmark run (so the reference phase observes an unprimed SUT rather than one already warmed by the main run's own traffic against the same dataset). A crashed or interrupted audit skips the main run; a completed audit (PASS or FAIL) still lets it execute, so a FAIL doesn't cost the submission its perf report. A registry miss in get_audit_test raises SetupError (exit 3), not a bare KeyError. Each audit phase reuses the existing setup_benchmark / run_benchmark_async / finalize_benchmark engine via AuditRunSpec.test_mode (defaults to PERF; ACC/BOTH keeps accuracy datasets) and its own tmpfs salvage + cleanup, since it bypasses run_benchmark()'s own finally block. _run_benchmark_async itself now guards its tmpfs directory with a try/except BaseException around the whole ZMQ-context body, so a crash partway through (worker failure, metrics-drain exception) still salvages logs and removes the tmpfs dir instead of leaking it, regardless of which caller invoked it. Per-phase failures normalize to SetupError/ExecutionError (exit 3/4); an interrupted phase propagates KeyboardInterrupt (exit 130) instead. compliance/result.py writes audit_result.json + verify_<TEST>.txt atomically (tmp, fsync, rename, fsync-parent). SampleOrderSpec (config/runtime_settings.py) gains a SampleOrderKind tag (without_replacement / with_replacement / single) instead of overloading fixed_index alone, so it can express all three SampleOrder strategies (load_generator/sample_order.py) including with-replacement, which the audit phase's fixed-index repeat-sample traversal sits alongside. schema.py adds the AuditConfig discriminated union (AuditTestId.OUTPUT_CACHING_TEST / OutputCachingTestConfig) and the audit: field on BenchmarkConfig.

Unit coverage (tests/unit/compliance/test_output_caching.py): verify_output_caching's pass/fail/threshold/boundary logic (including the exact completion-guard boundary), OutputCachingAudit.plan_runs/ validate, AuditRunStats.from_report (including the zero-QPS ValueError path, normalized to ExecutionError through the real run_audit call), write_result's atomic write, the registry (including get_audit_test's SetupError on a miss), and run_audit's orchestrator guards: out-of-range sample_index, rate-paced load patterns, reference count vs dataset size, incomplete/interrupted phases, accuracy-dataset stripping, per-phase tmpfs cleanup, and AuditRunSpec.test_mode branching. tests/unit/commands/test_benchmark.py adds a test confirming _run_benchmark_async cleans up its tmpfs directory when it crashes mid-run, before ever returning a BenchmarkResult. Integration coverage (tests/integration/commands/test_benchmark_command.py): the full reference + output_caching two-phase flow against a real echo server for both offline and single-stream, asserting the PASS verdict, verify_<TEST>.txt, and audit_result.json contents; a dedicated cli._run test confirming the audit dispatches before the main benchmark run against one shared report_dir. Also covers SampleOrderSpec/SampleOrderKind (including the new with-replacement variant) and SingleSampleOrder (tests/unit/config/, tests/unit/load_generator/), and the signal-handling .ready-file fix in the metrics aggregator (tests/integration/async_utils/).

wu6u3tw requested a review from a team June 3, 2026 20:53

gemini-code-assist Bot reviewed Jun 3, 2026

View reviewed changes

Comment thread src/inference_endpoint/compliance/__main__.py Outdated

Comment thread src/inference_endpoint/compliance/test04.py Outdated

Comment thread src/inference_endpoint/compliance/test04.py Outdated

wu6u3tw requested review from arekay-nv and nv-alicheng June 3, 2026 22:19

nvzhihanj reviewed Jun 4, 2026

View reviewed changes

Comment thread examples/09_Wan22_VideoGen_Example/wan22_audit_test04.yaml Outdated

nvzhihanj reviewed Jun 4, 2026

View reviewed changes

Comment thread examples/09_Wan22_VideoGen_Example/offline_wan22_test04.yaml Outdated

nvzhihanj reviewed Jun 4, 2026

View reviewed changes

Comment thread examples/09_Wan22_VideoGen_Example/offline_wan22_test04.yaml Outdated

wu6u3tw force-pushed the feat/test04-compliance branch from 9057190 to b547f1d Compare June 4, 2026 23:14

viraatc force-pushed the main branch from 19b20bb to 28f5ac1 Compare June 4, 2026 23:25

wu6u3tw force-pushed the feat/test04-compliance branch 2 times, most recently from cdbae64 to eae1234 Compare June 4, 2026 23:40

nvzhihanj reviewed Jun 5, 2026

View reviewed changes

Comment thread examples/09_Wan22_VideoGen_Example/wan22_audit_test04.yaml Outdated

nvzhihanj reviewed Jun 5, 2026

View reviewed changes

Comment thread examples/09_Wan22_VideoGen_Example/wan22_audit_test04.yaml Outdated

nvzhihanj reviewed Jun 5, 2026

View reviewed changes

Comment thread examples/09_Wan22_VideoGen_Example/wan22_audit_test04.yaml Outdated

nvzhihanj reviewed Jun 5, 2026

View reviewed changes

Comment thread tests/unit/compliance/test_audit_test04.py Outdated

nvzhihanj reviewed Jun 5, 2026

View reviewed changes

Comment thread README.md Outdated

wu6u3tw force-pushed the feat/test04-compliance branch 3 times, most recently from b190d21 to e0de06f Compare June 5, 2026 21:03

wu6u3tw commented Jun 5, 2026

View reviewed changes

Comment thread examples/09_Wan22_VideoGen_Example/offline_wan22.yaml

wu6u3tw force-pushed the feat/test04-compliance branch 2 times, most recently from 385630c to c1e48bf Compare June 5, 2026 21:22

nvzhihanj reviewed Jun 6, 2026

View reviewed changes

Comment thread examples/09_Wan22_VideoGen_Example/wan22_audit_test04.yaml Outdated

nvzhihanj reviewed Jun 7, 2026

View reviewed changes

viraatc reviewed Jun 8, 2026

View reviewed changes

Comment thread examples/09_Wan22_VideoGen_Example/single_stream_wan22.yaml Outdated

wu6u3tw force-pushed the feat/test04-compliance branch from 802acf6 to f5b7f15 Compare July 1, 2026 23:00

wu6u3tw commented Jul 1, 2026

View reviewed changes

Comment thread src/inference_endpoint/commands/benchmark/execute.py Outdated

arekay-nv reviewed Jul 1, 2026

View reviewed changes

Comment thread src/inference_endpoint/commands/benchmark/execute.py Outdated

wu6u3tw commented Jul 1, 2026

View reviewed changes

Comment thread src/inference_endpoint/commands/benchmark/execute.py

arekay-nv reviewed Jul 1, 2026

View reviewed changes

Comment thread src/inference_endpoint/commands/benchmark/execute.py

wu6u3tw commented Jul 1, 2026

View reviewed changes

Comment thread src/inference_endpoint/commands/benchmark/execute.py

wu6u3tw commented Jul 1, 2026

View reviewed changes

Comment thread src/inference_endpoint/commands/audit.py

wu6u3tw commented Jul 1, 2026

View reviewed changes

Comment thread src/inference_endpoint/commands/audit.py Outdated

arekay-nv reviewed Jul 2, 2026

View reviewed changes

arekay-nv requested a review from nvzhihanj July 2, 2026 17:58

wu6u3tw force-pushed the feat/test04-compliance branch from bd1396a to 7ebb69a Compare July 2, 2026 22:03

wu6u3tw force-pushed the feat/test04-compliance branch 3 times, most recently from 178f94e to e09ce01 Compare July 2, 2026 22:41

wu6u3tw commented Jul 2, 2026

View reviewed changes

arekay-nv approved these changes Jul 2, 2026

View reviewed changes

wu6u3tw added 3 commits July 2, 2026 16:38

wu6u3tw force-pushed the feat/test04-compliance branch from e09ce01 to 0bf10bd Compare July 2, 2026 23:40

Uh oh!

Conversation

wu6u3tw commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Design (the two axes)

Config shape

What's included

Exit codes

Testing

Implementation notes (moved from the design doc)

File-by-file changes (against main)

Requirements traceability

Maintainer workflow & example-config threads

Design-review findings (both Review Council passes)

Robustness & API hygiene (Gemini + Review Council)

Success criteria (goal-driven)

Uh oh!

github-actions Bot commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wu6u3tw commented Jun 4, 2026

Update summary

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wu6u3tw commented Jun 5, 2026

Uh oh!

Uh oh!

nvzhihanj left a comment

Choose a reason for hiding this comment

Review Council — first-principles design review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nvzhihanj commented Jun 7, 2026

Review Council — Multi-AI Code Review (first-principles design review)

🔴 Re-design / Must-fix

🟡 Should-fix

🔵 Consider

Uh oh!

Uh oh!

wu6u3tw commented Jul 1, 2026

Uh oh!

codecov-commenter commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

arekay-nv left a comment

Choose a reason for hiding this comment

Review Council — Multi-AI Code Review (Codex + Claude, depth: thorough)

wu6u3tw commented Jun 3, 2026 •

edited

Loading

File-by-file changes (against `main`)

github-actions Bot commented Jun 3, 2026 •

edited

Loading

codecov-commenter commented Jul 1, 2026 •

edited

Loading

wu6u3tw left a comment •

edited

Loading

wu6u3tw Jul 2, 2026 •

edited

Loading

wu6u3tw Jul 2, 2026 •

edited

Loading