feat(compliance): MLPerf TEST04 caching audit — extensible AuditTest framework#332
feat(compliance): MLPerf TEST04 caching audit — extensible AuditTest framework#332wu6u3tw wants to merge 3 commits into
Conversation
|
MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅ |
There was a problem hiding this comment.
Code Review
This pull request implements the MLPerf TEST04 compliance audit to detect result caching by repeatedly issuing a single fixed sample and comparing the throughput against a reference run. It introduces configuration options, validation guards, a SingleSampleOrder generator, and a compliance verification module with a CLI tool and tests. The review feedback focuses on improving the robustness of the compliance verifier, specifically by handling potential OSError exceptions during file writes, catching AttributeError when parsing non-dictionary JSON configurations, and gracefully handling malformed snapshot files during parsing.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
…review Address gemini-code-assist review on PR mlcommons#332: - CLI catches OSError (PermissionError etc.) and write_verdict failures, not just FileNotFoundError/ValueError — all map to exit 2. - _audit_marker tolerates non-dict results.json (isinstance guards) instead of raising AttributeError. - _run_stats_from_dir rejects a non-dict snapshot with a clear ValueError. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Update summaryAll review feedback has been addressed. Here is what changed since the original submission: Architecture (main concern)
Config shape
audit: "test04"
datasets:
- name: wan22_prompts
path: wan22_prompts.jsonl
type: "performance"
samples: 50 # reference phase query count (50–144)
- name: wan22_audit
path: wan22_prompts.jsonl
type: "audit"
samples: 25 # audit phase query count (25–50)
audit_sample_index: 0Robustness
Testing
Example config
|
9057190 to
b547f1d
Compare
cdbae64 to
eae1234
Compare
b190d21 to
e0de06f
Compare
385630c to
c1e48bf
Compare
|
All review feedback has been addressed. Here's a summary of what changed: Architecture Sample counts & index SingleStream Durations Robustness fixes (Gemini)
Cleanup
|
nvzhihanj
left a comment
There was a problem hiding this comment.
Review Council — first-principles design review
Reviewed by: Claude (Codex review timed out on this 2046-line diff at xhigh reasoning) · Depth: thorough
Focus: design issues warranting re-design for a modular, extensible audit-test framework (TEST04 is the first of several). 11 findings; see the tiered summary comment. The ref_samples dead-write (#1) was independently verified against the source.
Review Council — Multi-AI Code Review (first-principles design review)Reviewed by: Claude · Depth: thorough Framing: TEST04 is the first MLPerf compliance/audit test and is meant to become a modular, extensible framework. The findings below are design-led — what would adding the next audit (TEST01/05) cost, and where does TEST04-specific knowledge leak into general-purpose code. 11 findings, all posted inline. 🔴 Re-design / Must-fix
🟡 Should-fix
🔵 Consider
Through-line: #1, #5, #6, #7 are all symptoms of the same root cause — TEST04 is bolted onto Dedup: none overlap existing inline comments except #9, which extends the maintainer's existing fairness thread with upstream-parity / guard-direction substance. |
802acf6 to
f5b7f15
Compare
|
Rebased onto latest |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #332 +/- ##
=======================================
Coverage ? 80.69%
=======================================
Files ? 119
Lines ? 15720
Branches ? 0
=======================================
Hits ? 12685
Misses ? 3035
Partials ? 0 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
arekay-nv
left a comment
There was a problem hiding this comment.
Review Council — Multi-AI Code Review (Codex + Claude, depth: thorough)
Inline findings posted below. See the summary comment for the tiered overview.
Review Council — Multi-AI Code ReviewReviewed by: Codex + Claude | Depth: thorough | Focus: extensibility (basis for future audit tests) Found 10 issues across 8 files. The framework is well-structured — 🔴 Must FixIncorrect behavior under normal usage.
🟡 Should FixReal issues under specific conditions, or design debt that compounds.
🔵 ConsiderValid improvements / follow-ups — several directly serve the extensibility goal.
On extensibility (the review's focus): the protocol + registry design is the right shape — adding a test should be implement
Codex reviewed against the true merge-base ( |
The audit's reference phase re-issues the same performance dataset (same dataloader_random_seed) the main run also issues. Running the main run first primes a response-caching SUT, so the reference phase inherits those cache hits and its QPS rises to match the fixed-sample audit phase — the audit could PASS even though the SUT is caching, which is exactly what TEST04 exists to detect (PR mlcommons#332 review finding). cli._run now resolves the report_dir once and runs the audit before run_benchmark, so both share one directory tree regardless of order. A crashed or interrupted audit skips the main run entirely (mirroring how a main-run crash used to skip the audit); a completed audit (PASS or FAIL) still lets the main run execute so a FAIL doesn't cost the submission its perf report.
…ions run_audit drives setup_benchmark -> run_benchmark_async -> finalize_benchmark directly for each phase, bypassing run_benchmark()'s own finally block — the only place that salvages and removes bench.tmpfs_dir. Without per-phase cleanup, each audit phase leaked its RAM-backed /dev/shm directory (PR mlcommons#332 review finding). Also wrap test.verify(...) the same way phase execution already is: a phase that completes zero samples yields qps=0, and AuditRunStats.from_report raises a bare ValueError for that — previously outside any try/except, it escaped run_audit uncaught instead of mapping to the documented exit code 4 (ExecutionError).
…spec PR mlcommons#332 review requested this rename to make the parameter's audit-only purpose explicit. setup_benchmark itself stays as-is: it's shared by both run_benchmark (main run) and run_audit (phase execution), so renaming the function to something audit-specific would be incorrect.
…ng PERF run_audit previously called setup_benchmark with a hardcoded TestMode.PERF for every phase, unconditionally stripping accuracy datasets. That's correct for output_caching_test's two PERF-only phases, but a future audit test needing ACC/BOTH mode would silently run as PERF, never invoking the accuracy scorer despite the phase being labeled otherwise (PR mlcommons#332 review finding). AuditRunSpec gains a test_mode field (defaults to TestMode.PERF, so output_caching_test is unaffected); run_audit now branches phase-dataset selection and the setup_benchmark call on spec.test_mode instead of assuming every audit is performance-only.
Update summaryAddressed this round of review findings:
Also confirmed and resolved two threads that were already fixed but never marked resolved (the Ctrl-C exit-code distinction at All unit tests pass (1325), integration audit tests pass, |
bd1396a to
7ebb69a
Compare
History squashed to 3 commitsBranch was force-pushed to reorganize into three focused commits (docs+examples / implementation / tests), following the same pattern as the earlier squash on this PR. No content changed — the resulting tree is byte-identical to the prior tip; verified via Heads-up: existing review threads (including already-resolved ones) will likely show as outdated since their anchored lines moved. Nothing was dropped — only re-anchored. |
178f94e to
e09ce01
Compare
|
Fixed 6 more findings (naming, tmpfs, design, docs); 3 discussions remain. |
arekay-nv
left a comment
There was a problem hiding this comment.
Thanks, there are minor issues that we can address in followup.
| if config.audit is None: | ||
| run_benchmark(config, mode) | ||
| return | ||
|
|
||
| # Freeze the report dir now so the audit and the main run share the same | ||
| # directory tree regardless of which one executes first. | ||
| report_dir = resolve_report_dir(config) | ||
| config = config.with_updates(report_dir=str(report_dir)) | ||
| # Run the audit BEFORE the main benchmark. The audit's reference phase | ||
| # must observe an unprimed SUT: if the main run went first, a | ||
| # response-caching SUT would already have cached every sample in the same | ||
| # performance dataset the reference phase re-issues (same | ||
| # dataloader_random_seed), inflating the reference phase's QPS to match | ||
| # the fixed-sample audit phase and masking exactly the caching behavior | ||
| # TEST04 exists to detect. If the audit crashes (SetupError/ | ||
| # ExecutionError) or is interrupted, it re-raises here and the main run | ||
| # never starts — mirroring how a main-run crash used to skip the audit | ||
| # entirely. A completed audit (PASS or FAIL) still lets the main run | ||
| # produce its own submission-quality report either way. | ||
| # All audit artifacts (phase subdirs + verify_<TEST>.txt + | ||
| # audit_result.json) nest under <report_dir>/audit/ so they don't | ||
| # intermingle with the main run's top-level output. | ||
| result = run_audit(config, report_dir / "audit") |
There was a problem hiding this comment.
I believe that the auto-review flagged this, but is this the intended behavior from MLPerf?
There was a problem hiding this comment.
MLPerf runs these as separate processes.
There was a problem hiding this comment.
mlperf inference do not have it in a decided/specific order in the rules or any design docs.
But mlperf inference run ref first.
Design plan for the MLPerf TEST04 output-caching compliance audit (docs/compliance_audit_plan.md): the AuditTest protocol, the generic run_audit orchestrator, and the output-caching detection algorithm (reference phase over distinct samples vs. a fixed-sample audit phase, comparing achieved QPS). Adds the Compliance entry to AGENTS.md, the usage README under compliance/audit_test/, and the two WAN2.2 submission example configs (Offline + SingleStream) wiring perf, accuracy, and audit into a single from-config run. Notes two known limitations as pending future work: multiple performance datasets (bounds-checking assumes exactly one) and multiple audit instances (audit: is a single AuditConfig, not a list). Documents the reference-phase-first ordering bias absorbed by threshold, and lists audit_test/__init__.py in the Code Organization tree.
Generic AuditTest framework (compliance/): protocol + AuditRunSpec / AuditRunArtifacts + test registry, wired to OutputCachingAudit (compliance/audit_test/), which also owns the QPS-specific AuditRunStats (its only consumer) rather than the shared framework module. run_audit (commands/audit.py) orchestrates the planned phases back-to-back, dispatched by cli._run before the main benchmark run (so the reference phase observes an unprimed SUT rather than one already warmed by the main run's own traffic against the same dataset). A crashed or interrupted audit skips the main run; a completed audit (PASS or FAIL) still lets it execute, so a FAIL doesn't cost the submission its perf report. A registry miss in get_audit_test raises SetupError (exit 3), not a bare KeyError. Each audit phase reuses the existing setup_benchmark / run_benchmark_async / finalize_benchmark engine via AuditRunSpec.test_mode (defaults to PERF; ACC/BOTH keeps accuracy datasets) and its own tmpfs salvage + cleanup, since it bypasses run_benchmark()'s own finally block. _run_benchmark_async itself now guards its tmpfs directory with a try/except BaseException around the whole ZMQ-context body, so a crash partway through (worker failure, metrics-drain exception) still salvages logs and removes the tmpfs dir instead of leaking it, regardless of which caller invoked it. Per-phase failures normalize to SetupError/ExecutionError (exit 3/4); an interrupted phase propagates KeyboardInterrupt (exit 130) instead. compliance/result.py writes audit_result.json + verify_<TEST>.txt atomically (tmp, fsync, rename, fsync-parent). SampleOrderSpec (config/runtime_settings.py) gains a SampleOrderKind tag (without_replacement / with_replacement / single) instead of overloading fixed_index alone, so it can express all three SampleOrder strategies (load_generator/sample_order.py) including with-replacement, which the audit phase's fixed-index repeat-sample traversal sits alongside. schema.py adds the AuditConfig discriminated union (AuditTestId.OUTPUT_CACHING_TEST / OutputCachingTestConfig) and the audit: field on BenchmarkConfig.
Unit coverage (tests/unit/compliance/test_output_caching.py): verify_output_caching's pass/fail/threshold/boundary logic (including the exact completion-guard boundary), OutputCachingAudit.plan_runs/ validate, AuditRunStats.from_report (including the zero-QPS ValueError path, normalized to ExecutionError through the real run_audit call), write_result's atomic write, the registry (including get_audit_test's SetupError on a miss), and run_audit's orchestrator guards: out-of-range sample_index, rate-paced load patterns, reference count vs dataset size, incomplete/interrupted phases, accuracy-dataset stripping, per-phase tmpfs cleanup, and AuditRunSpec.test_mode branching. tests/unit/commands/test_benchmark.py adds a test confirming _run_benchmark_async cleans up its tmpfs directory when it crashes mid-run, before ever returning a BenchmarkResult. Integration coverage (tests/integration/commands/test_benchmark_command.py): the full reference + output_caching two-phase flow against a real echo server for both offline and single-stream, asserting the PASS verdict, verify_<TEST>.txt, and audit_result.json contents; a dedicated cli._run test confirming the audit dispatches before the main benchmark run against one shared report_dir. Also covers SampleOrderSpec/SampleOrderKind (including the new with-replacement variant) and SingleSampleOrder (tests/unit/config/, tests/unit/load_generator/), and the signal-handling .ready-file fix in the metrics aggregator (tests/integration/async_utils/).
e09ce01 to
0bf10bd
Compare
Summary
Adds an extensible MLPerf compliance-audit framework with TEST04 (caching detection) as the first test, driven by an
audit:block in the benchmark YAML. This PR carries the full redesign: the approved design plan, the implementation, tests, and runnable WAN2.2 examples.TEST04 issues one fixed sample for every query in an audit phase; if repeating an identical request makes the SUT meaningfully faster, it is serving from cache. Pass iff the audit run is at most 10% faster than the reference (matching upstream
compliance/TEST04/verify_performance.py).Design (the two axes)
SampleOrderSpec(WITHOUT_REPLACEMENT | SINGLE(index)) carried on aRunSpec. No test-specific knowledge leaks into the load generator.AuditTest.verify(runs) -> AuditVerdict, registered per test.A generic orchestrator (
commands/audit.py::run_audit) runs eachRunSpecphase back-to-back via the existingsetup_benchmark/run_benchmark_asyncpath, then verifies and writes the verdict. Adding TEST01/06/07/09 later is a new registry entry, not cross-cutting edits.Config shape
AuditConfigis a discriminated-union-ready sub-model onBenchmarkConfig(parallel toAccuracyConfig) — noDatasetType.AUDIT, no audit fields pollutingDataset, notest04boolean inRuntimeSettings.What's included
compliance/__init__.py—AuditTestprotocol +RunSpec/RunStats/RunArtifacts+ registrycompliance/verdict.py—AuditVerdict+ atomicwrite_verdict(tmp → fsync → rename → fsync)compliance/tests/test04.py—Test04Audit+verify_test04commands/audit.py— genericrun_auditorchestratorconfig/schema.py—AuditTestId+Test04Config/AuditConfig+BenchmarkConfig.auditload_generator—SampleOrderSpec+SingleSampleOrder+ factory dispatchdocs/compliance_audit_plan.md— the design planoffline_wan22_submission.yaml,single_stream_wan22_submission.yamlExit codes
benchmark from-configwith anaudit:block exits 0 (PASS) / 1 (FAIL); errors propagate via the standard handler using the repo-wide scheme (InputValidationError→ 2,SetupError→ 3,ExecutionError→ 4). The on-diskaudit_verdict.jsonis the durable record.Testing
Unit + integration green;
pre-commit run --all-filesclean. The e2e test exercises the fullaudit:→run_audit→AuditVerdictflow for both max_throughput (offline) and concurrency=1 (single-stream).Implementation notes (moved from the design doc)
File-by-file changes (against
main)compliance/__init__.pyAuditTestprotocol,AuditRunSpec,AuditRunStats,AuditResult,get_audit_test()registrycompliance/result.pyAuditResult+ atomicwrite_result(referenceverify_OUTPUT_CACHING_TEST.txtwording + JSON)compliance/audit_test/__init__.pycompliance/audit_test/output_caching_test.pyOutputCachingAudit.plan_runs(reference + audit specs) +verify_output_cachingcorecommands/audit.pyrun_auditloop (plan → validate-all → execute → verify → write)config/schema.pyAuditTestId,AuditConfig,audit: AuditConfig | NoneonBenchmarkConfigload_generator/sample_order.pySampleOrderSpec+SingleSampleOrder;create_sample_orderswitches on the specconfig/runtime_settings.pysample_order: SampleOrderSpec(generic; defaultWITHOUT_REPLACEMENT)commands/benchmark/execute.pyrun_specseam insetup_benchmark;run_benchmarkdispatches torun_auditexamples/09_Wan22_VideoGen_Example/offline_wan22_submission.yamlexamples/09_Wan22_VideoGen_Example/single_stream_wan22_submission.yamlRequirements traceability
Covers every comment thread on PR #332 — the maintainer workflow threads
(@nvzhihanj, @viraatc), both "Review Council" passes, and the Gemini robustness comments.
Maintainer workflow & example-config threads
run_auditgeneric loop;audit:block onbenchmark from-configtype: submissionYAML;run_benchmarkruns perf [+acc], thenrun_auditadditively (§5)audit_samplesallows independent counts (upstream TEST04 uses 5000/500 for SDXL) as an opt-in to shorten the audit phase; see §5.samples(reference) andaudit_samples(audit) are independent subsets; no full-dataset requirementWITHOUT_REPLACEMENT, audit =SINGLE(sample_index)(MLPerfissue_same)concurrency(single-stream) andmax_throughput(offline)poissonrejected up front (§4 step 3) — pacing caps throughput and masks cachingoffline_wan22_submission.yaml/single_stream_wan22_submission.yaml); result artifacts use fixedverify_OUTPUT_CACHING_TEST.txt+ JSONnum_workershard-coded in example YAMLs; use defaultpip,aiohttp) in the diffDesign-review findings (both Review Council passes)
ref_samplesdead write / mismatched counts (high)AuditRunSpec.n_sampleshonored vian_samples_to_issue(the bug was the reference count being silently dropped)AuditTestabstraction; TEST04 hardcoded (high)AuditTestprotocol +get_audit_testregistry; generic loopDatasetType.AUDITabstraction leak (high)test04boolean inRuntimeSettings/load-gen (high)SampleOrderSpec; load-gen has no test knowledge_OVERRIDE_TEST04_SAMPLE_INDEXstringly-typed kwarg (med)run_specseammodel_copysurgery; ref skips validation (med)AuditRunSpec; validate all specs before any runOutputCachingTestConfig), discriminated ontest— each test carries only its own knobsAuditConfig, notDataset(med)AuditConfigsub-model onBenchmarkConfig;Datasetuntouchedverify_output_caching(AuditRunStats, AuditRunStats)core + a singleAuditRunStats.from_reportadaptersample_indexbound-checked late (low)audit_configre-entrancy trap (critical)audit=None; cannot re-enterrun_auditNone; PASS/FAIL indistinguishablerun_auditreturns a typedAuditResult; CLI exits0(PASS) /1(FAIL); errors via the standard handler (SetupError→ 3,ExecutionError→ 4)write_resultusestmp → fsync → rename → fsync(parent)setup_benchmarkdir /config.yamllogic (med)setup_benchmark; no recomputed report-dir_audit_markerparsed twice in error path (low)Robustness & API hygiene (Gemini + Review Council)
FileNotFoundError/ValueError; write outside tryfrom-configand errors propagate tomain.py's handler_audit_markerAttributeErroron non-dictJSONReport.from_snapshotKeyError/TypeErroruncaughtExecutionError__all__compliance/__init__.py__all__exports the full public surfaceSuccess criteria (goal-driven)
benchmark from-configwith anaudit:block runs both phasesback-to-back and writes
verify_OUTPUT_CACHING_TEST.txt+audit_result.json; PASS against ano-caching
mock_http_echo_server, FAIL against a caching mock.the result (
completed < requested × (1 − threshold)→ FAIL), independent of the otherphase's count.
SingleSampleOrderalways yields the configured index (bounds-checked);verify_output_cachingPASS within threshold, FAIL above, boundary at the strict<line,slower-passes, custom threshold, and the completion guard trips;
AuditRunStats.from_reportraises on a
None-duration or non-positiveqps;OutputCachingAudit.plan_runsemits areference spec at
samplesand an audit spec ataudit_samples(which may differ).samplesand the audit phaseissues
audit_samples(defaulting tosampleswhen omitted), validation fires before anyrun, the typed result propagates (PASS/FAIL distinguishable), and a phase config never
carries
audit(no re-entry).poisson) load and an out-of-rangesample_indexare bothrejected before any phase runs.
AuditRunStats.from_reportraises a cleanValueErroron a report with noduration (
qps is None) or non-positive throughput (qps <= 0); a phase whoseReport.completeisFalse(metrics drain timeout / interrupt) aborts the audit withExecutionErrorrather than certifying a result on partial data.grep -r test04 src/inference_endpoint/{load_generator,config/runtime_settings.py}returns nothing.
pre-commit run --all-filesclean (ruff / mypy / license headers).🤖 Generated with Claude Code