DRAFT: feat(power): vendor-agnostic power/energy monitoring#383
DRAFT: feat(power): vendor-agnostic power/energy monitoring#383viraatc wants to merge 2 commits into
Conversation
|
MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅ |
There was a problem hiding this comment.
Code Review
This pull request introduces vendor-agnostic power and energy monitoring telemetry to benchmark runs. It launches a sidecar process during the performance phase to stream power samples from built-in sources (nvidia_smi, prometheus) or custom commands, subsequently writing the integrated energy metrics to a sibling power.json file and report.txt. The review feedback highlights several robustness and compatibility issues: potential uncaught exceptions in parse.py when handling malformed JSONL or non-integer CSV fields; a design bug in templates.py where headerless CSV commands fail with a KeyError; an incomplete counter-reset check in window.py that can miss intermediate resets; a Windows compatibility issue in collector.py due to platform-specific os functions; and a potential TypeError in prom_poll.py if a Prometheus series contains a null metric.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
|
Note: this PR threads power start/stop hooks + a payload builder into the already-large |
daa231e to
5fd49ee
Compare
Add first-class, opt-in power telemetry collected during the performance phase and written to a sibling power.json (+ a report.txt section); the Report stays a pure snapshot-derived struct, mirroring the profiling precedent. Design (council-reviewed): - Sources: nvidia_smi and prometheus built-ins, plus a `command` escape hatch — any program that prints one sample/line plugs in with no core changes. The process boundary is the agnosticism boundary. - Collection: best-effort sidecar subprocess (own process group, SIGTERM->SIGKILL) so it never perturbs the hot path or fails the run. - Windowing: trace sliced to the PERFORMANCE phase via the shared monotonic->wall anchor; trapezoid integration (power_w) or counter delta (energy_j). - energy_per_output_token_j emitted only when token count and energy share the same window, else suppressed with a note (avoids mixed denominators). Config under settings.power; docs in docs/POWER_MONITORING.md; example in examples/11_Power_Monitoring/. Unit tests cover parsing, integration, consistency guard, status precedence, and an end-to-end sidecar. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
5fd49ee to
8a5b43d
Compare
- parse.py _parse_jsonl: skip non-dict JSONL lines (a list/scalar would raise AttributeError on obj.get, escaping the except). - parse.py _parse_csv: wrap the field-index int() conversions in try/except so a bad mapping degrades to empty rather than raising in finalization. - window.py _energy_j: detect mid-run counter resets by summing non-negative adjacent deltas (last-minus-first missed an intermediate reset). - collector.py _signal_group: guard os.killpg/getpgid with hasattr for platforms without process groups (Windows). - prom_poll.py _series_label: handle a null metric dict (accept None, check first) to avoid TypeError on `key in None`. - test: add mid-run counter-reset regression test. (The templates.py headerless-command-CSV thread is obsolete: command sources are JSONL-only now; CSV is the built-in nvidia-smi source.) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
What
First-class, opt-in, vendor-neutral power/energy monitoring. When
settings.power.sourceis set, a best-effort sidecar streams telemetry during the performance phase; at finalization the trace is windowed, integrated into energy, and written to a siblingpower.json(+ areport.txtsection).Report/result_summary.jsonstay snapshot-pure — mirroring the profiling precedent.Design: general framework + pluggable sources
The core
PowerConfigis vendor-neutral — it knows nothing about GPUs/Prometheus. A source is a tiny registered builder (@power_source("name")) that turns config into a sidecar argv + parse spec.nvidia_smiis just the first reference source; all NVIDIA-specific bits live in its builder, not the schema.source,interval_s,options(+ 3 hidden advanced). Per-source settings live inoptions(nvidia_smi: {gpu_indices},prometheus: {url, query},command: {argv}); each builder owns + validates its own keys.nvidia_smi,prometheus(DCGM exporter for remote GPUs), andcommand(any program emitting canonical JSONL — the process boundary is the plugin API for non-Python users).source=Noneserializes to{}.Robustness / correctness
subprocess.Popen(notServiceLauncher, which aborts the run on no-ready-signal); own process group, SIGTERM→SIGKILL,PR_SET_PDEATHSIGbackstop,stdbufline-buffering. Every path swallows errors — a broken collector never fails or perturbs the benchmark.power_w) or counter-delta (energy_j).energy_per_output_token_jemitted only when tokens and energy share the same window, else suppressed with a note (no mixed-denominator metric).Files
New
src/inference_endpoint/power/(sources.pyregistry,collector.py,parse.py,window.py,render.py,prom_poll.py);settings.powerinconfig/schema.py; integration incommands/benchmark/execute.py. Docs:docs/POWER_MONITORING.md; example:examples/11_Power_Monitoring/.Test plan
tests/unit/power/— config/serialization, plugin registration, unknown-source error, nvidia/JSONL parsing (+ malformed drops), trapezoid + counter integration, J/token consistency guard, status precedence, end-to-end sidecar.uv run pytest tests/unit/{power,config,commands}→ 231 passed;pre-commit run --all-filesgreen.Follow-ups (v2)
Multi-source totals + scope-aware double-count guards; entry-point auto-discovery (intentionally omitted — the in-process registry +
commandcover today's needs). Tracked alongside theexecute.pyrefactor in #384.🤖 Generated with Claude Code