feat: Plug in warmup phase#305
Conversation
Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>
|
MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅ |
There was a problem hiding this comment.
Code Review
This pull request introduces a formal warmup phase to the benchmark execution process, replacing the previous environment variable-based implementation. Key changes include the addition of a WarmupConfig schema, a SaltedDataset wrapper that injects random salts into prompts to prevent KV-cache reuse, and a drain_after configuration to manage phase transitions. The EchoServer and testing fixtures were also updated to support custom request handling for new integration tests. Review feedback identifies missing imports for os and logging in dataset.py and notes that the SaltedDataset constructor signature is incompatible with its base class, which could break inherited class methods.
Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>
arekay-nv
left a comment
There was a problem hiding this comment.
Review Council — Multi-AI Code Review
Reviewed by Codex + Claude | Depth: thorough
Found 14 issues across 6 files: 0 critical, 3 high, 4 medium, 7 low.
See inline comments for details. A grouped summary will follow as a separate top-level comment.
Review Council — Multi-AI Code ReviewReviewed by Codex + Claude | Depth: thorough | HEAD: Found 14 issues across 6 files. Inline comments posted in this review. Cross-reviewer agreement (boosted confidence): execute.py:365 (warmup RNG seeding) and dataset.py:467 (multimodal salt) were flagged by both Codex and Claude. 🔴 Must Fix (high)
🟡 Should Fix (medium)
🔵 Consider (low)
Theme summary The warmup feature lands the structural plumbing well, but several correctness/reproducibility properties the YAML knobs claim to provide are not actually enforced end-to-end:
🤖 Posted by |
- Add `random_seed` field to WarmupConfig for deterministic warmup scheduling - Use `dataclasses.replace` + warmup seed for warmup RuntimeSettings - Fix SaltedDataset.data to be a property (avoids stale snapshot after inner reload) - Fix multimodal salting to find first text part at any index (handles image-first prompts) - Log warning when input_tokens present without prompt (salting not possible) - Fix ruff-format CI failure in test_async_session.py - Move inline imports to top of test_benchmark.py (AGENTS.md compliance) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Review Council — Multi-AI Code ReviewReviewed by Claude (Codex CLI unavailable) | Depth: thorough | HEAD: `1143d64` Found 7 new issues across 5 files. Inline comments posted for 4; 3 are out-of-diff or post-fix. Cross-reviewer note: Issues already addressed in the previous round (unseeded RNGs, 🔴 Must Fix (high)
🟡 Should Fix (medium)
🔵 Consider (low)
🤖 Posted by |
Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>
arekay-nv
left a comment
There was a problem hiding this comment.
Review Council — Multi-AI Code Review (Round 2)
Reviewed by Codex + Claude | Depth: thorough | HEAD: c9e80b0 (was 56258be in round 1)
Re-reviewing after the author's response commits. Most round-1 findings are addressed; this round focuses on new code (global timeout, warmup_random_seed, register=False) and incomplete fixes.
Found 4 new issues across 3 files: 0 critical, 1 high, 1 medium, 2 low. See inline comments — a tiered summary will follow.
Review Council — Multi-AI Code Review (Round 2)Reviewed by Codex + Claude | Depth: thorough | HEAD: Re-review after the author's response commits. 4 new issues across 3 files. Inline comments in this review. Cross-reviewer agreement (boosted confidence): execute.py:570 — global timeout semantic change — was the only finding both Codex and Claude flagged in round 2. 🔴 Must Fix (high)
🟡 Should Fix (medium)
🔵 Consider (low)
Round 1 follow-up status
Theme summary (round 2)The biggest new concern is the global wall-clock timer added to bound the warmup drain. The mechanism works, but it silently changes what The drain-safety claim in 🤖 Posted by |
nv-alicheng
left a comment
There was a problem hiding this comment.
Review Council — Multi-AI Code Review (run #2)
Reviewed by: Codex + Claude | Depth: thorough
Found 12 new issues (deduped against the 31+ existing inline comments). Severity: 1 high, 2 medium, 9 low.
See the inline comments for details. A summary table is posted as a separate top-level comment after this review.
Review Council — Multi-AI Code Review (run #2)Reviewed by: Codex + Claude | Depth: thorough Found 12 new issues across 6 files (deduped against 31 existing inline comments). 🔴 Must Fix (high)Issues that produce incorrect benchmark results in normal usage.
🟡 Should Fix (medium)Real issues that trigger under specific conditions or design flaws that compound.
🔵 Consider (low)Valid improvements; could be follow-ups.
What we deduped
Notes on Codex coverage this runCodex returned only 2 findings on this run (both correctness-grade). The first reproduced an existing high-severity finding (input_tokens salting); the second (the |
Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>
Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>
|
Tested on example_08/q3vl and the warmup worked as expected and reduced starting overhead nicely. |
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
Merging since most unresolved comments are ranked low. Will address remaining in followup once we start collecting numbers with warmup. |
Wires the warmup phase into the standard benchmark execution flow as a config-driven feature, replacing the old ENABLE_WARMUP=1 env-var gate.
Config — new WarmupConfig on Settings with CLI shorthands: --warmup, --warmup-requests, --warmup-salt, --warmup-drain, --warmup-seed.
Prompt salting — Dataset.with_salt(rng) returns a shallow copy that prepends a seeded [<16-hex>] prefix on every load_sample() call to defeat KV-cache reuse. Handles
plain-string, multimodal (image-first), and pre-tokenized prompts (warns when salting is impossible).
Execution — warmup is prepended as a PhaseType.WARMUP phase before perf. Independent seeded RNGs keep warmup scheduling reproducible and isolated from the perf phase. The global
max_duration_ms timer starts only when the perf phase begins, so warmup time isn't charged against the perf budget. Drain is now config-driven via PhaseConfig.drain_after rather
than hardcoded.
Session — warmup responses are suppressed from metrics (COMPLETE event publish and _on_sample_complete callback both gated on current phase). _drain_inflight is bounded at 240 s
to prevent hangs.
Test infrastructure — EchoServer accepts an optional sync/async request_handler; new mock_http_echo_server_factory fixture supports multiple servers with distinct handlers per
test.
Tests — integration tests for salt and drain behaviour (test_warmup.py), unit tests for Dataset.with_salt() (test_salted_dataset.py), and fixture tests
(test_http_mock_fixtures.py).
Type of change
Related issues
Testing
Checklist