feat(wan22): add VBench accuracy scorer for WAN 2.2 video generation#315
feat(wan22): add VBench accuracy scorer for WAN 2.2 video generation#315wu6u3tw wants to merge 3 commits into
Conversation
Scores generated videos on six VBench dimensions (subject_consistency, background_consistency, motion_smoothness, dynamic_degree, appearance_style, scene) and averages the per-dimension aggregates as the accuracy score. The MLPerf WAN 2.2 prompt set is a subset of VBench standard prompt suite, so we use VBench default evaluate() flow with bundled prompt-to-dimension lookup. VBench pins transformers==4.33.2 and numpy<2, both incompatible with the parent endpoints package (transformers==5.5.0, numpy==2.4.4). To keep the parent lockfile solvable and the accuracy environment reproducible, vbench lives in an isolated uv subproject under examples/09_Wan22_VideoGen_Example/accuracy/, with its own pyproject.toml and uv.lock. VBenchScorer.score() shells out to a vbench_runner.py script in that subproject via uv run --project, so the benchmark process never imports vbench. VideoGenAdapter now mirrors video_path into response_output so the event log carries it to the scorer (event publishing only forwards response_output, not metadata). Added offline_wan22_accuracy.yaml as a peer to the existing perf example, wired to eval_method: vbench. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- VBenchScorer.score(): drop empty-output rows (failed queries whose record.data is None produce output=\"\") before _stage_videos, so Path(\"\").resolve() never staged the repo cwd as a video and corrupted the run. - _stage_videos: dst.unlink(missing_ok=True) before symlink, so re-scoring an existing report_dir is safe. - vbench_runner.py: when --full-info-json is omitted, default to the VBench_full_info.json bundled in the vbench package. The default `vbench_standard` mode (required for scene + appearance_style) needs a real file path; None crashes inside VBench. - offline_wan22_accuracy.yaml: prerequisites comment now points to `uv sync` in the accuracy subproject instead of `pip install vbench`, which would have broken the parent lockfile (vbench pins transformers==4.33.2 and numpy<2). - AGENTS.md VideoGen entry: replace stale \"switch to video_bytes for accuracy mode\" claim with a description of VBenchScorer and the out-of-process subproject pattern. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
3-stage progressive validation for the accuracy subproject: 1. vbench_runner.py arg plumbing + VBench bundled JSON resolution 2. VBenchScorer.score() against a hand-picked video subset 3. End-to-end inference-endpoint benchmark from-config Captures the unit-test gap (mocks vbench entirely) so the next person can validate VBench API drift, prompt-suite coverage, and filename-convention mismatches before marking the PR ready. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅ |
There was a problem hiding this comment.
Code Review
This pull request introduces the VBenchScorer to evaluate video generation accuracy across six MLPerf dimensions. To manage conflicting dependencies, VBench is executed within an isolated uv subproject using a standalone runner script. The changes include a comprehensive runbook, configuration templates, and updates to the VideoGenAdapter to pass video paths to the scorer. Review feedback highlights the need for more robust error handling when parsing VBench results and ensuring that symlink creation is resilient to prompts containing directory separators.
| with results_path.open() as f: | ||
| results = msgspec.json.decode(f.read()) | ||
|
|
||
| per_dim_scores: list[float] = [ | ||
| float(results[dim][0]) for dim in self.dimensions | ||
| ] |
There was a problem hiding this comment.
This block can be improved in two ways:\n1. Use read_bytes() for more efficient JSON decoding with msgspec.\n2. Add explicit error handling for missing dimension keys in the results JSON. If VBench fails to score a dimension (e.g., due to prompt mismatch), a raw KeyError is raised, which is difficult to debug.
| with results_path.open() as f: | |
| results = msgspec.json.decode(f.read()) | |
| per_dim_scores: list[float] = [ | |
| float(results[dim][0]) for dim in self.dimensions | |
| ] | |
| results = msgspec.json.decode(results_path.read_bytes()) | |
| try: | |
| per_dim_scores: list[float] = [ | |
| float(results[dim][0]) for dim in self.dimensions | |
| ] | |
| except KeyError as e: | |
| raise KeyError( | |
| f"Dimension {e} missing from VBench results at {results_path}. " | |
| "Ensure prompts match VBench's standard suite for all dimensions." | |
| ) from e |
| dst = staged_dir / f"{prompt}-{idx}{src.suffix or '.mp4'}" | ||
| dst.unlink(missing_ok=True) | ||
| dst.symlink_to(src.resolve()) |
There was a problem hiding this comment.
Prompts can contain characters like slashes (/) which are interpreted as directory separators in file paths. If a prompt contains a slash, dst.symlink_to will fail because the nested parent directory won't exist. Adding dst.parent.mkdir ensures the symlink can be created regardless of the prompt content.
| dst = staged_dir / f"{prompt}-{idx}{src.suffix or '.mp4'}" | |
| dst.unlink(missing_ok=True) | |
| dst.symlink_to(src.resolve()) | |
| dst = staged_dir / f"{prompt}-{idx}{src.suffix or '.mp4'}" | |
| dst.parent.mkdir(parents=True, exist_ok=True) | |
| dst.unlink(missing_ok=True) | |
| dst.symlink_to(src.resolve()) |
Summary
VBenchScorer(scorer_id="vbench") that scores video-generation outputs on six VBench dimensions (subject_consistency, background_consistency, motion_smoothness, dynamic_degree, appearance_style, scene) and returns the mean of the per-dimension aggregates.uvsubproject underexamples/09_Wan22_VideoGen_Example/accuracy/(vbench pinstransformers==4.33.2+numpy<2, incompatible with the parent env).VBenchScorer.score()shells out tovbench_runner.pyviauv run --project, so the benchmark process never imports vbench.VideoGenAdapter.decode_responsenow mirrorsvideo_pathintoresponse_outputso the event log carries it to the scorer.ScorerMethod.VBENCHinto the config enum and addsoffline_wan22_accuracy.yamlas a peer to the existing perf example.Test plan
tests/unit/evaluation/test_scoring.pycovers registration, mean-of-6-dims with mockedsubprocess.run, missing-subprojectFileNotFoundError.tests/unit/videogen/test_adapter.py+ integration adapter test updated for the newresponse_outputcontract.examples/09_Wan22_VideoGen_Example/offline_wan22_accuracy.yamlvalidates asOfflineBenchmarkConfig.uv lockresolves (115 packages).pre-commit run --all-filesclean.uv syncin the accuracy subproject — not yet executed.Draft for code review.
🤖 Generated with Claude Code