-
Notifications
You must be signed in to change notification settings - Fork 42
Add report generation strategy for the MegatronRun #787
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
📝 WalkthroughWalkthroughAdds MegatronRunReportGenerationStrategy: parses Megatron-Run stdout to extract per-iteration times and per-GPU TFLOP/s, computes statistics over recent iterations, writes a CSV report, exports and registers the strategy, and adds unit tests. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~30 minutes Poem
🚥 Pre-merge checks | ✅ 2✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Greptile OverviewGreptile SummaryAdds Key changes:
Confidence Score: 5/5
Important Files Changed
|
amaslenn
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are in the code freeze stage now, this PR will be merged later.
src/cloudai/workloads/megatron_run/report_generation_strategy.py
Outdated
Show resolved
Hide resolved
|
|
||
| # Keep only the last 10 iterations for statistics (to exclude warmup) | ||
| if len(iter_times_ms) > 10: | ||
| iter_times_ms = iter_times_ms[-10:] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if taking the last 10 iterations is the most relevant. What if the training has some ups and downs (as I already saw). Maybe just skipping the warmup, so say the 20 first iterations is enough ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, skipping the warmup stage makes more sense, have updated to skipping the first 20 iterations. Originally was following the format in Megatron-Bridge report. Maybe later need to unify the formats for computing statistics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The last 10 iteration is what the GPU perf team uses. I have seen those runs on IB clusters and mostly towards the end it remains stable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No files reviewed, no comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/cloudai/workloads/megatron_run/report_generation_strategy.py (1)
30-121: Iteration-time parsing fails when TFLOP/s isn’t logged.
ITERATION_REGEXrequires TFLOP/s, so logs that only print iteration time will be ignored;can_handle_directory()then returns False andget_metric("iteration-time")returns METRIC_ERROR despite valid data. Consider making TFLOP/s optional and skipping warmup by iteration count, not by list position.🔧 Proposed fix
-ITERATION_REGEX = re.compile( - r"elapsed time per iteration \(ms\):\s*([0-9]+(?:\.[0-9]+)?)" - r".*?" - r"throughput per GPU \(TFLOP/s/GPU\):\s*([0-9]+(?:\.[0-9]+)?)", - re.IGNORECASE, -) +ITERATION_REGEX = re.compile( + r"elapsed time per iteration \(ms\):\s*([0-9]+(?:\.[0-9]+)?)" + r"(?:.*?throughput per GPU \(TFLOP/s/GPU\):\s*([0-9]+(?:\.[0-9]+)?))?", + re.IGNORECASE, +) def _extract(self, log_path: Path) -> tuple[list[float], list[float]]: """Extract iteration times (ms) and GPU TFLOPS from the log file.""" - iter_times_ms: list[float] = [] - gpu_tflops: list[float] = [] + records: list[tuple[float, float | None]] = [] with log_path.open("r", encoding="utf-8", errors="ignore") as f: for line in f: m = ITERATION_REGEX.search(line) if m: try: - iter_times_ms.append(float(m.group(1))) - gpu_tflops.append(float(m.group(2))) + iter_time = float(m.group(1)) + tflops = float(m.group(2)) if m.group(2) is not None else None + records.append((iter_time, tflops)) except (ValueError, TypeError): logging.debug("Failed to parse iteration metrics line: %s", line.rstrip("\n")) # Skip the first 20 iterations for statistics (to exclude warmup) - if len(iter_times_ms) > 20: - iter_times_ms = iter_times_ms[20:] - gpu_tflops = gpu_tflops[20:] - return iter_times_ms, gpu_tflops + if len(records) > 20: + records = records[20:] + iter_times_ms = [t for t, _ in records] + gpu_tflops = [g for _, g in records if g is not None] + return iter_times_ms, gpu_tflops
🤖 Fix all issues with AI agents
In `@src/cloudai/workloads/megatron_run/report_generation_strategy.py`:
- Around line 139-167: The CSV currently writes zeros when gpu_tflops is empty;
change the branch that sets
tflops_avg/tflops_median/tflops_min/tflops_max/tflops_std (used when writing via
writer to report_file in report_generation_strategy.py) to instead set those
values to empty strings (or the existing METRIC_ERROR sentinel used by
get_metric) so the CSV shows missing TFLOP/s data explicitly rather than zeros;
keep the same writer.writerow call that writes the "tflops_per_gpu" row but use
the new empty/sentinel values when gpu_tflops is falsy.
In
`@tests/report_generation_strategy/test_megatron_run_report_generation_strategy.py`:
- Around line 32-83: The tests currently (megatron_run_tr and
megatron_run_tr_no_data fixtures) only provide 3 iterations and do not exercise
the warmup-skip behavior; add a new fixture (e.g., megatron_run_tr_with_warmup)
that builds a TestRun with stdout.txt containing at least 21 iteration log lines
formatted like the existing stdout_content so the first 20 iterations are
present and can be skipped, then add a corresponding test that uses this fixture
to assert the report generation logic ignores the first 20 iterations (reference
MegatronRunTestDefinition, TestRun, and the stdout.txt written in the fixtures).
tests/report_generation_strategy/test_megatron_run_report_generation_strategy.py
Show resolved
Hide resolved
| "agent": self.agent, | ||
| "agent_steps": self.agent_steps, | ||
| "agent_metrics": self.agent_metrics, | ||
| "agent_metrics": self.agent_metrics if "agent_metrics" in self.model_fields_set else None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would change default values to None. Why is it needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it is None here then the definition of agent_metrics in test config can propagate, otherwise if agent_metrics is not defined in the scenario config, the final merged config would always be [default] even though agent_metrics is set in the test config.
Summary
Adds a new report generation strategy for the MegatronRun workload to extract iteration time and GPU TFLOP/s metrics from training logs. This enables Design Space Exploration (DSE) capabilities for MegatronRun workloads.
Also fixes the configuration overwrite behavior for
agent_metricsin test and scenario toml setups. Originally ifagent_metricsis set in test configuration but not in scenario configuration, theagent_metricswould be overwrited by[default,]. Nowagent_metricsin test configuration could be correctly propagated whenagent_metricsis not set in scenario configuration.Test Plan
Adds following to the test configurations
megatron_run_report.csvwould be generated andtrajectory.csvwould reflect the observations and rewards.