Skip to content

feat(wan22): add VBench accuracy scorer for WAN 2.2 video generation#315

Open
wu6u3tw wants to merge 3 commits into
mlcommons:mainfrom
wu6u3tw:feat/vbench-eval
Open

feat(wan22): add VBench accuracy scorer for WAN 2.2 video generation#315
wu6u3tw wants to merge 3 commits into
mlcommons:mainfrom
wu6u3tw:feat/vbench-eval

Conversation

@wu6u3tw
Copy link
Copy Markdown
Collaborator

@wu6u3tw wu6u3tw commented May 18, 2026

Summary

  • Adds VBenchScorer (scorer_id="vbench") that scores video-generation outputs on six VBench dimensions (subject_consistency, background_consistency, motion_smoothness, dynamic_degree, appearance_style, scene) and returns the mean of the per-dimension aggregates.
  • Runs VBench in an isolated uv subproject under examples/09_Wan22_VideoGen_Example/accuracy/ (vbench pins transformers==4.33.2 + numpy<2, incompatible with the parent env). VBenchScorer.score() shells out to vbench_runner.py via uv run --project, so the benchmark process never imports vbench.
  • VideoGenAdapter.decode_response now mirrors video_path into response_output so the event log carries it to the scorer.
  • Wires ScorerMethod.VBENCH into the config enum and adds offline_wan22_accuracy.yaml as a peer to the existing perf example.

Test plan

  • Unit: tests/unit/evaluation/test_scoring.py covers registration, mean-of-6-dims with mocked subprocess.run, missing-subproject FileNotFoundError.
  • Unit: tests/unit/videogen/test_adapter.py + integration adapter test updated for the new response_output contract.
  • examples/09_Wan22_VideoGen_Example/offline_wan22_accuracy.yaml validates as OfflineBenchmarkConfig.
  • Subproject uv lock resolves (115 packages).
  • pre-commit run --all-files clean.
  • End-to-end VBench run on a GPU host with uv sync in the accuracy subproject — not yet executed.

Draft for code review.

🤖 Generated with Claude Code

wu6u3tw and others added 3 commits May 15, 2026 10:02
Scores generated videos on six VBench dimensions (subject_consistency,
background_consistency, motion_smoothness, dynamic_degree,
appearance_style, scene) and averages the per-dimension aggregates as
the accuracy score. The MLPerf WAN 2.2 prompt set is a subset of
VBench standard prompt suite, so we use VBench default evaluate()
flow with bundled prompt-to-dimension lookup.

VBench pins transformers==4.33.2 and numpy<2, both incompatible with
the parent endpoints package (transformers==5.5.0, numpy==2.4.4). To
keep the parent lockfile solvable and the accuracy environment
reproducible, vbench lives in an isolated uv subproject under
examples/09_Wan22_VideoGen_Example/accuracy/, with its own
pyproject.toml and uv.lock. VBenchScorer.score() shells out to a
vbench_runner.py script in that subproject via uv run --project,
so the benchmark process never imports vbench.

VideoGenAdapter now mirrors video_path into response_output so the
event log carries it to the scorer (event publishing only forwards
response_output, not metadata). Added offline_wan22_accuracy.yaml as
a peer to the existing perf example, wired to eval_method: vbench.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- VBenchScorer.score(): drop empty-output rows (failed queries whose
  record.data is None produce output=\"\") before _stage_videos, so
  Path(\"\").resolve() never staged the repo cwd as a video and
  corrupted the run.
- _stage_videos: dst.unlink(missing_ok=True) before symlink, so
  re-scoring an existing report_dir is safe.
- vbench_runner.py: when --full-info-json is omitted, default to
  the VBench_full_info.json bundled in the vbench package. The
  default `vbench_standard` mode (required for scene + appearance_style)
  needs a real file path; None crashes inside VBench.
- offline_wan22_accuracy.yaml: prerequisites comment now points to
  `uv sync` in the accuracy subproject instead of `pip install vbench`,
  which would have broken the parent lockfile (vbench pins
  transformers==4.33.2 and numpy<2).
- AGENTS.md VideoGen entry: replace stale \"switch to video_bytes for
  accuracy mode\" claim with a description of VBenchScorer and the
  out-of-process subproject pattern.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
3-stage progressive validation for the accuracy subproject:
1. vbench_runner.py arg plumbing + VBench bundled JSON resolution
2. VBenchScorer.score() against a hand-picked video subset
3. End-to-end inference-endpoint benchmark from-config

Captures the unit-test gap (mocks vbench entirely) so the next
person can validate VBench API drift, prompt-suite coverage, and
filename-convention mismatches before marking the PR ready.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

@wu6u3tw wu6u3tw marked this pull request as ready for review May 18, 2026 20:57
@wu6u3tw wu6u3tw requested a review from a team May 18, 2026 20:57
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the VBenchScorer to evaluate video generation accuracy across six MLPerf dimensions. To manage conflicting dependencies, VBench is executed within an isolated uv subproject using a standalone runner script. The changes include a comprehensive runbook, configuration templates, and updates to the VideoGenAdapter to pass video paths to the scorer. Review feedback highlights the need for more robust error handling when parsing VBench results and ensuring that symlink creation is resilient to prompts containing directory separators.

Comment on lines +1021 to +1026
with results_path.open() as f:
results = msgspec.json.decode(f.read())

per_dim_scores: list[float] = [
float(results[dim][0]) for dim in self.dimensions
]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This block can be improved in two ways:\n1. Use read_bytes() for more efficient JSON decoding with msgspec.\n2. Add explicit error handling for missing dimension keys in the results JSON. If VBench fails to score a dimension (e.g., due to prompt mismatch), a raw KeyError is raised, which is difficult to debug.

Suggested change
with results_path.open() as f:
results = msgspec.json.decode(f.read())
per_dim_scores: list[float] = [
float(results[dim][0]) for dim in self.dimensions
]
results = msgspec.json.decode(results_path.read_bytes())
try:
per_dim_scores: list[float] = [
float(results[dim][0]) for dim in self.dimensions
]
except KeyError as e:
raise KeyError(
f"Dimension {e} missing from VBench results at {results_path}. "
"Ensure prompts match VBench's standard suite for all dimensions."
) from e

Comment on lines +948 to +950
dst = staged_dir / f"{prompt}-{idx}{src.suffix or '.mp4'}"
dst.unlink(missing_ok=True)
dst.symlink_to(src.resolve())
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Prompts can contain characters like slashes (/) which are interpreted as directory separators in file paths. If a prompt contains a slash, dst.symlink_to will fail because the nested parent directory won't exist. Adding dst.parent.mkdir ensures the symlink can be created regardless of the prompt content.

Suggested change
dst = staged_dir / f"{prompt}-{idx}{src.suffix or '.mp4'}"
dst.unlink(missing_ok=True)
dst.symlink_to(src.resolve())
dst = staged_dir / f"{prompt}-{idx}{src.suffix or '.mp4'}"
dst.parent.mkdir(parents=True, exist_ok=True)
dst.unlink(missing_ok=True)
dst.symlink_to(src.resolve())

@wu6u3tw wu6u3tw requested review from arekay-nv and nv-alicheng May 19, 2026 18:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant