feat: Plug in warmup phase by arekay-nv · Pull Request #305 · mlcommons/endpoints

arekay-nv · 2026-05-05T01:42:47Z

Wires the warmup phase into the standard benchmark execution flow as a config-driven feature, replacing the old ENABLE_WARMUP=1 env-var gate.

Config — new WarmupConfig on Settings with CLI shorthands: --warmup, --warmup-requests, --warmup-salt, --warmup-drain, --warmup-seed.

Prompt salting — Dataset.with_salt(rng) returns a shallow copy that prepends a seeded [<16-hex>] prefix on every load_sample() call to defeat KV-cache reuse. Handles
plain-string, multimodal (image-first), and pre-tokenized prompts (warns when salting is impossible).

Execution — warmup is prepended as a PhaseType.WARMUP phase before perf. Independent seeded RNGs keep warmup scheduling reproducible and isolated from the perf phase. The global
max_duration_ms timer starts only when the perf phase begins, so warmup time isn't charged against the perf budget. Drain is now config-driven via PhaseConfig.drain_after rather
than hardcoded.

Session — warmup responses are suppressed from metrics (COMPLETE event publish and _on_sample_complete callback both gated on current phase). _drain_inflight is bounded at 240 s
to prevent hangs.

Test infrastructure — EchoServer accepts an optional sync/async request_handler; new mock_http_echo_server_factory fixture supports multiple servers with distinct handlers per
test.

Tests — integration tests for salt and drain behaviour (test_warmup.py), unit tests for Dataset.with_salt() (test_salted_dataset.py), and fixture tests
(test_http_mock_fixtures.py).

Type of change

Bug fix
New feature
Documentation update
Refactor/cleanup

Related issues

Testing

Tests added/updated
All tests pass locally
Manual testing completed

Checklist

Code follows project style
Pre-commit hooks pass
Documentation updated (if needed)

Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>

github-actions · 2026-05-05T01:42:54Z

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

gemini-code-assist

Code Review

This pull request introduces a formal warmup phase to the benchmark execution process, replacing the previous environment variable-based implementation. Key changes include the addition of a WarmupConfig schema, a SaltedDataset wrapper that injects random salts into prompts to prevent KV-cache reuse, and a drain_after configuration to manage phase transitions. The EchoServer and testing fixtures were also updated to support custom request handling for new integration tests. Review feedback identifies missing imports for os and logging in dataset.py and notes that the SaltedDataset constructor signature is incompatible with its base class, which could break inherited class methods.

Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>

arekay-nv

Review Council — Multi-AI Code Review

Reviewed by Codex + Claude | Depth: thorough

Found 14 issues across 6 files: 0 critical, 3 high, 4 medium, 7 low.

See inline comments for details. A grouped summary will follow as a separate top-level comment.

arekay-nv · 2026-05-05T03:04:05Z

Review Council — Multi-AI Code Review

Reviewed by Codex + Claude | Depth: thorough | HEAD: 56258be

Found 14 issues across 6 files. Inline comments posted in this review.

Cross-reviewer agreement (boosted confidence): execute.py:365 (warmup RNG seeding) and dataset.py:467 (multimodal salt) were flagged by both Codex and Claude.

🔴 Must Fix (high)

#	File	Line	Category	Reviewer(s)	Summary
1	`src/inference_endpoint/dataset_manager/dataset.py`	414	api-contract	Claude	Default `datasets_dir` silently changed `datasets/` → `dataset_cache/` (orthogonal to warmup; risks losing existing caches)
2	`src/inference_endpoint/dataset_manager/dataset.py`	459	bug	Claude	`SaltedDataset.load_sample()` mutates `prompt` only, leaving `input_tokens` (used by SGLang adapter) unchanged → KV-cache reuse not actually prevented for tokenized datasets
3	`src/inference_endpoint/commands/benchmark/execute.py`	365	bug	Both	Warmup `rng_sched` / `rng_sample_index` use bare `random.Random()` (unseeded) → warmup sample order is nondeterministic across runs, breaking reproducibility

🟡 Should Fix (medium)

#	File	Line	Category	Reviewer(s)	Summary
1	`src/inference_endpoint/dataset_manager/dataset.py`	467	bug	Both	Multimodal salting only fires when `prompt[0]` is text; image-first prompts (common shape) silently skip salting
2	`src/inference_endpoint/dataset_manager/dataset.py`	444	bug	Claude	`SaltedDataset.data` snapshots `inner.data` at construction; later `inner.load(force=True)` desyncs the wrapper from the inner dataset
3	`src/inference_endpoint/commands/benchmark/execute.py`	357	design	Claude	`warmup_rt` built ad-hoc instead of via `dataclasses.replace`; load pattern hardcoded to `MAX_THROUGHPUT` (no way to warm up at the perf-phase QPS/concurrency)
4	`src/inference_endpoint/load_generator/session.py`	273	api-contract	Claude	`ENABLE_WARMUP=1` env-var escape hatch removed without a deprecation warning; existing scripts now silently no-op

🔵 Consider (low)

#	File	Line	Category	Reviewer(s)	Summary
1	`src/inference_endpoint/load_generator/session.py`	68	api-contract	Claude	`PhaseConfig.drain_after=True` default flips warmup drain semantics for direct callers (was effectively `False`)
2	`src/inference_endpoint/load_generator/session.py`	423	data-integrity	Claude	`drain_after=False` filter only covers `_on_sample_complete`; warmup `COMPLETE` events still publish to ZMQ → event log pollution
3	`src/inference_endpoint/async_utils/services/metrics_aggregator/__main__.py`	100	error-handling	Claude	Tokenizer load failure silently downgraded to `nullcontext()`; user-requested `--tokenizer` produces empty ISL/OSL/TPOT with only a warning
4	`src/inference_endpoint/dataset_manager/dataset.py`	464	design	Claude	`os.urandom(8)` salt ignores seeded RNG → salted prompts are nondeterministic even when the user sets `dataloader_random_seed`
5	`tests/unit/commands/test_benchmark.py`	471	design	Claude	`TestBuildPhases` uses lazy/inline imports throughout (~15 sites) — explicit AGENTS.md violation
6	`tests/integration/commands/test_warmup.py`	292	testing	Claude	`_DELAY = 0.15` is fragile for overlap-observation tests — risk of CI flakes; replace with explicit `asyncio.Event` synchronization
7	`tests/integration/commands/test_warmup.py`	65	testing	Claude	All warmup integration tests use `num_workers=1`; multi-worker warmup path (ZMQ ordering, salt distinctness) is uncovered

Theme summary

The warmup feature lands the structural plumbing well, but several correctness/reproducibility properties the YAML knobs claim to provide are not actually enforced end-to-end:

Reproducibility — warmup RNGs are unseeded (execute.py:365) and the salt uses os.urandom (dataset.py:464). Two runs of the same config produce different warmup workloads, indirectly affecting perf-phase reproducibility when warmup sizes the cache.
Salt completeness — SaltedDataset only handles plain-text prompts. Tokenized datasets (SGLang adapter via input_tokens) and image-first multimodal prompts silently skip salting, defeating warmup.salt=true for those paths.
API/back-compat surface — the ENABLE_WARMUP env var was removed without a deprecation warning; the new PhaseConfig.drain_after=True default flips warmup drain semantics for direct callers; datasets_dir default rename couples an orthogonal cache-directory change to this PR.
Test coverage — single-worker only; the multi-worker ZMQ ordering and salt-distinctness across worker processes is the riskiest part of the architecture and is currently unexercised. The _DELAY = 0.15 overlap tests are also flake-prone.

🤖 Posted by /review-council skill.

Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>

- Add `random_seed` field to WarmupConfig for deterministic warmup scheduling - Use `dataclasses.replace` + warmup seed for warmup RuntimeSettings - Fix SaltedDataset.data to be a property (avoids stale snapshot after inner reload) - Fix multimodal salting to find first text part at any index (handles image-first prompts) - Log warning when input_tokens present without prompt (salting not possible) - Fix ruff-format CI failure in test_async_session.py - Move inline imports to top of test_benchmark.py (AGENTS.md compliance) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

arekay-nv

test

arekay-nv · 2026-05-05T11:56:21Z

Review Council — Multi-AI Code Review

Reviewed by Claude (Codex CLI unavailable) | Depth: thorough | HEAD: `1143d64`

Found 7 new issues across 5 files. Inline comments posted for 4; 3 are out-of-diff or post-fix.

Cross-reviewer note: Issues already addressed in the previous round (unseeded RNGs, SaltedDataset.data snapshot, multimodal salting at index 0, inline imports, ruff-format) are not re-reported.

🔴 Must Fix (high)

#	File	Line	Category	Summary
1	`commands/benchmark/execute.py`	365	design	Inline ↑ After RNG-seed fix: `rng_sched` and `rng_sample_index` both seeded with same `random_seed` value — identical pseudo-random sequences for scheduler and sample selection

🟡 Should Fix (medium)

#	File	Line	Category	Summary
2	`dataset_manager/dataset.py`	434	design	Inline ↑ `SaltedDataset` is registered in `Dataset.PREDEFINED` under `"SaltedDataset"` via `__init_subclass__`; instantiating via registry raises `TypeError` (incompatible constructor). Fix: `class SaltedDataset(Dataset, dataset_id=None)`
3	`dataset_manager/dataset.py`	478	bug	Inline ↑ Silent no-salt fallback for image-only multimodal prompts (all parts are `image_url`, no text part exists) — no `logger.warning` emitted, `salt=True` silently no-ops
4	`load_generator/session.py`	366	bug	`_drain_inflight` awaits `self._drain_event.wait()` with no timeout. If a warmup request is dropped (network error, server restart), the drain hangs indefinitely and the benchmark never starts. Only `SIGINT` unblocks it.

🔵 Consider (low)

#	File	Line	Category	Summary
5	`tests/unit/dataset_manager/test_salted_dataset.py`	143	testing	Inline ↑ `sd.data = inner.data` is redundant now that `data` is a property delegating to `_inner.data`
6	`config/schema.py`	413	api-contract	`random_seed: int = Field(0, ...)` default is implicitly deterministic; `Optional[int] = None` with "use fresh seed when None" semantics would better match `salt=True` cache-busting intent
7	`config/schema.py`	395	api-contract	`WarmupConfig` has no `@cyclopts.Parameter` annotations — warmup is YAML-only with no CLI flags. Undocumented limitation; users testing via CLI will find warmup silently inactive.

🤖 Posted by /review-council skill.

Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>

arekay-nv

Review Council — Multi-AI Code Review (Round 2)

Reviewed by Codex + Claude | Depth: thorough | HEAD: c9e80b0 (was 56258be in round 1)

Re-reviewing after the author's response commits. Most round-1 findings are addressed; this round focuses on new code (global timeout, warmup_random_seed, register=False) and incomplete fixes.

Found 4 new issues across 3 files: 0 critical, 1 high, 1 medium, 2 low. See inline comments — a tiered summary will follow.

arekay-nv · 2026-05-05T15:37:20Z

Review Council — Multi-AI Code Review (Round 2)

Reviewed by Codex + Claude | Depth: thorough | HEAD: c9e80b0 (was 56258be in round 1)

Re-review after the author's response commits. 4 new issues across 3 files. Inline comments in this review.

Cross-reviewer agreement (boosted confidence): execute.py:570 — global timeout semantic change — was the only finding both Codex and Claude flagged in round 2.

🔴 Must Fix (high)

#	File	Line	Category	Reviewer(s)	Summary
1	`src/inference_endpoint/commands/benchmark/execute.py`	570	api-contract	Both	Global timer makes warmup eat into the per-phase `max_duration_ms` budget; `_make_stop_check` still uses per-phase semantics → two clocks on the same configured value, accuracy phases configured for unbounded duration can be truncated

🟡 Should Fix (medium)

#	File	Line	Category	Reviewer(s)	Summary
1	`src/inference_endpoint/load_generator/session.py`	361	error-handling	Claude	`_drain_inflight` docstring claims a safety guarantee that only holds when `max_duration_ms is not None`; the default offline/online run still has unbounded drain on a hung request

🔵 Consider (low)

#	File	Line	Category	Reviewer(s)	Summary
1	`src/inference_endpoint/commands/benchmark/execute.py`	568	concurrency	Claude	`_on_global_timeout` has no `_done` guard; can fire during the cleanup gap between `session.run()` returning and `global_timeout_handle.cancel()` running
2	`tests/unit/commands/test_benchmark.py`	775	testing	Claude	`test_performance_sample_order_identical_with_and_without_warmup` builds two separate RuntimeSettings with fresh seeded RNGs — it doesn't actually exercise warmup-perturbing-perf-RNG; gives a false sense of coverage

Round 1 follow-up status

✅ Fixed — execute.py:365 — warmup RNGs unseeded — now random.Random(warmup_random_seed)
✅ Fixed — dataset.py:467 — multimodal salt only first text — now scans for first text part at any index
✅ Fixed — dataset.py:444 — SaltedDataset.data snapshot — now @property delegating to _inner.data
✅ Fixed — test_benchmark.py:471 — lazy imports — imports moved to top
✅ Fixed — dataset.py:434 — registry pollution — now class SaltedDataset(Dataset, register=False)
🟡 Partial — dataset.py:459 — input_tokens not salted — warning added for input_tokens-without-prompt; common case (both present) still silently drops salt
🟡 Partial — execute.py:357 — ad-hoc warmup_rt — now uses dataclass_replace ✅ but LoadPattern(MAX_THROUGHPUT) is still hardcoded
⏸️ Unaddressed — dataset.py:414 — datasets/→dataset_cache/`` — still a silent breaking default change
⏸️ Unaddressed — session.py:273 — ENABLE_WARMUP env-var removed — no deprecation warning
⏸️ Unaddressed — session.py:68 — drain_after=True default flips warmup — direct callers get opposite semantics
⏸️ Unaddressed — session.py:423 — late warmup COMPLETE events still publish to ZMQ
⏸️ Unaddressed — metrics_aggregator/__main__.py:100 — tokenizer fail silently downgrades
⏸️ Unaddressed — dataset.py:464 — os.urandom salt ignores seeded RNG
⏸️ Unaddressed — test_warmup.py:292 — _DELAY=0.15 flake risk
⏸️ Unaddressed — test_warmup.py:65 — no multi-worker tests

Theme summary (round 2)

The biggest new concern is the global wall-clock timer added to bound the warmup drain. The mechanism works, but it silently changes what max_duration_ms means for any user who enables warmup (warmup time now eats into the perf-phase budget, and accuracy phases configured for None get truncated by the global cap). Either scope the timer to the perf phase, expose a separate warmup.max_duration_ms, or rename the field — but don't double-mean the same configured value.

The drain-safety claim in _drain_inflight is also overstated: with the schema default (max_duration_ms=0 → None), no global timer is scheduled and the drain is unbounded — a single hung request can still hang the benchmark.

🤖 Posted by /review-council skill.

nv-alicheng

Review Council — Multi-AI Code Review (run #2)

Reviewed by: Codex + Claude | Depth: thorough

Found 12 new issues (deduped against the 31+ existing inline comments). Severity: 1 high, 2 medium, 9 low.

See the inline comments for details. A summary table is posted as a separate top-level comment after this review.

nv-alicheng · 2026-05-07T18:59:05Z

Review Council — Multi-AI Code Review (run #2)

Reviewed by: Codex + Claude | Depth: thorough

Found 12 new issues across 6 files (deduped against 31 existing inline comments).

🔴 Must Fix (high)

Issues that produce incorrect benchmark results in normal usage.

#	File	Line	Category	Reviewer(s)	Summary
1	`src/inference_endpoint/commands/benchmark/execute.py`	557	bug	Codex	Global `max_duration_ms` timer covers warmup + perf + accuracy combined; slow warmup eats into perf budget. Silent semantic shift from previous "perf-only" meaning.

🟡 Should Fix (medium)

Real issues that trigger under specific conditions or design flaws that compound.

#	File	Line	Category	Reviewer(s)	Summary
2	`src/inference_endpoint/async_utils/services/metrics_aggregator/__main__.py`	98	bug	Claude	`TokenizePool` partial-init failure leaks the executor's worker threads on retry.
3	`src/inference_endpoint/commands/benchmark/execute.py`	367	design	Claude	Warmup hardcoded to `MAX_THROUGHPUT` regardless of perf load pattern — surprising for online (Poisson/concurrency) runs since warmup exercises a different server code path than the steady-state pattern it's meant to warm up for.

🔵 Consider (low)

Valid improvements; could be follow-ups.

#	File	Line	Category	Reviewer(s)	Summary
4	`src/inference_endpoint/testing/echo_server.py`	36	api-contract	Claude	`RequestHandler` type alias doesn't allow `None`, but `_dispatch` relies on `None` for fall-through. Either widen the type or assert non-None.
5	`src/inference_endpoint/dataset_manager/dataset.py`	465	design	Claude	`SaltedDataset.load()` is a no-op — unloaded inner causes opaque `AssertionError` later. Either delegate or raise in `__init__`.
6	`src/inference_endpoint/dataset_manager/dataset.py`	480	data-integrity	Claude	Salt prefix `"[{salt}] "` adds ~5 token overhead — distorts ISL distribution between warmup and perf, especially for short prompts.
7	`src/inference_endpoint/config/schema.py`	428	api-contract	Claude	`WarmupConfig` lacks `cyclopts.Parameter` aliases — only verbose `--settings.warmup.*` flags, unlike sibling `runtime`/`load_pattern`.
8	`src/inference_endpoint/commands/benchmark/execute.py`	360	bug	Claude	`min_duration_ms=0` + `n_requests=None` falls back to "issue every dataset sample at MAX_THROUGHPUT" — undocumented expansion of warmup's intent.
9	`tests/unit/commands/test_benchmark.py`	411	testing	Claude	`test_defaults` doesn't assert the `warmup_random_seed=42` default; same gap in `test_all_flags_enabled` and `test_yaml_roundtrip`.
10	`tests/unit/commands/test_benchmark.py`	721	testing	Claude	`test_warmup_uses_independent_rng_instances` only checks `is not` identity; passes even if both RNGs were seeded identically.
11	`tests/unit/dataset_manager/test_salted_dataset.py`	109	testing	Claude	`test_salt_unique_across_different_indices` compares already-different prompts; passes even if salting is broken.
12	`tests/unit/dataset_manager/test_salted_dataset.py`	218	testing	Claude	No tests exercise the `input_tokens`-present branches of `SaltedDataset.load_sample` (the ones that produce silent token-ID misses for SGLang).

What we deduped

Codex P2 (SaltedDataset doesn't salt input_tokens) — already covered by existing comment on dataset.py:467 ([Review Council — Claude] high · bug). Dropped.
Claude test_warmup.py:433 (max_concurrent > 5 flakiness) — already covered by existing comment on test_warmup.py:292 ([Review Council — Claude] low · testing), which names that exact test. Dropped.

Notes on Codex coverage this run

Codex returned only 2 findings on this run (both correctness-grade). The first reproduced an existing high-severity finding (input_tokens salting); the second (the max_duration_ms timing bug) is genuinely new and is the most impactful issue in this review. The remaining coverage came from Claude.

Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>

Victor49152 · 2026-05-19T04:23:33Z

Tested on example_08/q3vl and the warmup worked as expected and reduced starting overhead nicely.

Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

viraatc

looks great, ty!

arekay-nv · 2026-05-19T22:56:02Z

Merging since most unresolved comments are ranked low. Will address remaining in followup once we start collecting numbers with warmup.

Plug in warmup phase

06a9237

Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>

github-actions Bot requested a review from nvzhihanj May 5, 2026 01:42

gemini-code-assist Bot reviewed May 5, 2026

View reviewed changes

Comment thread src/inference_endpoint/dataset_manager/dataset.py Outdated

Comment thread src/inference_endpoint/dataset_manager/dataset.py Outdated

github-code-quality Bot found potential problems May 5, 2026

View reviewed changes

Comment thread src/inference_endpoint/dataset_manager/dataset.py Fixed

Missing changes.

56258be

Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>

github-code-quality Bot found potential problems May 5, 2026

View reviewed changes

Comment thread src/inference_endpoint/dataset_manager/dataset.py Fixed