feat(wan22): add VBench accuracy scorer for WAN 2.2 video generation by wu6u3tw · Pull Request #315 · mlcommons/endpoints

wu6u3tw · 2026-05-18T20:56:50Z

Summary

Adds VBenchScorer (scorer_id="vbench") that scores video-generation outputs on six VBench dimensions (subject_consistency, background_consistency, motion_smoothness, dynamic_degree, appearance_style, scene) and returns the mean of the per-dimension aggregates.
Runs VBench in an isolated uv subproject under examples/09_Wan22_VideoGen_Example/accuracy/ (vbench pins transformers==4.33.2 + numpy<2, incompatible with the parent env). VBenchScorer.score() shells out to vbench_runner.py via uv run --project, so the benchmark process never imports vbench.
VideoGenAdapter.decode_response now mirrors video_path into response_output so the event log carries it to the scorer.
Wires ScorerMethod.VBENCH into the config enum and adds offline_wan22_accuracy.yaml as a peer to the existing perf example.

Test plan

Unit: tests/unit/evaluation/test_scoring.py covers registration, mean-of-6-dims with mocked subprocess.run, missing-subproject FileNotFoundError.
Unit: tests/unit/videogen/test_adapter.py + integration adapter test updated for the new response_output contract.
examples/09_Wan22_VideoGen_Example/offline_wan22_accuracy.yaml validates as OfflineBenchmarkConfig.
Subproject uv lock resolves (115 packages).
pre-commit run --all-files clean.
End-to-end VBench run on a GPU host with uv sync in the accuracy subproject — not yet executed.

Draft for code review.

🤖 Generated with Claude Code

Scores generated videos on six VBench dimensions (subject_consistency, background_consistency, motion_smoothness, dynamic_degree, appearance_style, scene) and averages the per-dimension aggregates as the accuracy score. The MLPerf WAN 2.2 prompt set is a subset of VBench standard prompt suite, so we use VBench default evaluate() flow with bundled prompt-to-dimension lookup. VBench pins transformers==4.33.2 and numpy<2, both incompatible with the parent endpoints package (transformers==5.5.0, numpy==2.4.4). To keep the parent lockfile solvable and the accuracy environment reproducible, vbench lives in an isolated uv subproject under examples/09_Wan22_VideoGen_Example/accuracy/, with its own pyproject.toml and uv.lock. VBenchScorer.score() shells out to a vbench_runner.py script in that subproject via uv run --project, so the benchmark process never imports vbench. VideoGenAdapter now mirrors video_path into response_output so the event log carries it to the scorer (event publishing only forwards response_output, not metadata). Added offline_wan22_accuracy.yaml as a peer to the existing perf example, wired to eval_method: vbench. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

- VBenchScorer.score(): drop empty-output rows (failed queries whose record.data is None produce output=\"\") before _stage_videos, so Path(\"\").resolve() never staged the repo cwd as a video and corrupted the run. - _stage_videos: dst.unlink(missing_ok=True) before symlink, so re-scoring an existing report_dir is safe. - vbench_runner.py: when --full-info-json is omitted, default to the VBench_full_info.json bundled in the vbench package. The default `vbench_standard` mode (required for scene + appearance_style) needs a real file path; None crashes inside VBench. - offline_wan22_accuracy.yaml: prerequisites comment now points to `uv sync` in the accuracy subproject instead of `pip install vbench`, which would have broken the parent lockfile (vbench pins transformers==4.33.2 and numpy<2). - AGENTS.md VideoGen entry: replace stale \"switch to video_bytes for accuracy mode\" claim with a description of VBenchScorer and the out-of-process subproject pattern. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

3-stage progressive validation for the accuracy subproject: 1. vbench_runner.py arg plumbing + VBench bundled JSON resolution 2. VBenchScorer.score() against a hand-picked video subset 3. End-to-end inference-endpoint benchmark from-config Captures the unit-test gap (mocks vbench entirely) so the next person can validate VBench API drift, prompt-suite coverage, and filename-convention mismatches before marking the PR ready. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

github-actions · 2026-05-18T20:57:00Z

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

gemini-code-assist

Code Review

This pull request introduces the VBenchScorer to evaluate video generation accuracy across six MLPerf dimensions. To manage conflicting dependencies, VBench is executed within an isolated uv subproject using a standalone runner script. The changes include a comprehensive runbook, configuration templates, and updates to the VideoGenAdapter to pass video paths to the scorer. Review feedback highlights the need for more robust error handling when parsing VBench results and ensuring that symlink creation is resilient to prompts containing directory separators.

gemini-code-assist · 2026-05-18T20:59:05Z

+        with results_path.open() as f:
+            results = msgspec.json.decode(f.read())
+
+        per_dim_scores: list[float] = [
+            float(results[dim][0]) for dim in self.dimensions
+        ]


This block can be improved in two ways:\n1. Use read_bytes() for more efficient JSON decoding with msgspec.\n2. Add explicit error handling for missing dimension keys in the results JSON. If VBench fails to score a dimension (e.g., due to prompt mismatch), a raw KeyError is raised, which is difficult to debug.

Suggested change

with results_path.open() as f:

results = msgspec.json.decode(f.read())

per_dim_scores: list[float] = [

float(results[dim][0]) for dim in self.dimensions

]

results = msgspec.json.decode(results_path.read_bytes())

try:

per_dim_scores: list[float] = [

float(results[dim][0]) for dim in self.dimensions

]

except KeyError as e:

raise KeyError(

f"Dimension {e} missing from VBench results at {results_path}. "

"Ensure prompts match VBench's standard suite for all dimensions."

) from e

gemini-code-assist · 2026-05-18T20:59:05Z

+            dst = staged_dir / f"{prompt}-{idx}{src.suffix or '.mp4'}"
+            dst.unlink(missing_ok=True)
+            dst.symlink_to(src.resolve())


Prompts can contain characters like slashes (/) which are interpreted as directory separators in file paths. If a prompt contains a slash, dst.symlink_to will fail because the nested parent directory won't exist. Adding dst.parent.mkdir ensures the symlink can be created regardless of the prompt content.

Suggested change

dst = staged_dir / f"{prompt}-{idx}{src.suffix or '.mp4'}"

dst.unlink(missing_ok=True)

dst.symlink_to(src.resolve())

dst = staged_dir / f"{prompt}-{idx}{src.suffix or '.mp4'}"

dst.parent.mkdir(parents=True, exist_ok=True)

dst.unlink(missing_ok=True)

dst.symlink_to(src.resolve())

wu6u3tw and others added 3 commits May 15, 2026 10:02

wu6u3tw marked this pull request as ready for review May 18, 2026 20:57

wu6u3tw requested a review from a team May 18, 2026 20:57

gemini-code-assist Bot reviewed May 18, 2026

View reviewed changes

wu6u3tw requested review from arekay-nv and nv-alicheng May 19, 2026 18:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(wan22): add VBench accuracy scorer for WAN 2.2 video generation#315

feat(wan22): add VBench accuracy scorer for WAN 2.2 video generation#315
wu6u3tw wants to merge 3 commits into
mlcommons:mainfrom
wu6u3tw:feat/vbench-eval

wu6u3tw commented May 18, 2026

Uh oh!

github-actions Bot commented May 18, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 18, 2026

Uh oh!

gemini-code-assist Bot May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

wu6u3tw commented May 18, 2026

Summary

Test plan

Uh oh!

github-actions Bot commented May 18, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 18, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 18, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant