Skip to content

Conversation

@juntaowww
Copy link
Contributor

@juntaowww juntaowww commented Jan 22, 2026

Summary

Adds a new report generation strategy for the MegatronRun workload to extract iteration time and GPU TFLOP/s metrics from training logs. This enables Design Space Exploration (DSE) capabilities for MegatronRun workloads.

Also fixes the configuration overwrite behavior for agent_metrics in test and scenario toml setups. Originally if agent_metrics is set in test configuration but not in scenario configuration, the agent_metrics would be overwrited by [default,]. Now agent_metrics in test configuration could be correctly propagated when agent_metrics is not set in scenario configuration.

Test Plan

Adds following to the test configurations

agent_metrics = ["tflops-per-gpu"]
agent_reward_function = "identity"  # For maximizing TFLOP/s

# Or
# agent_metrics = ["iteration-time"] # For minimizing iteration time

megatron_run_report.csv would be generated and trajectory.csv would reflect the observations and rewards.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 22, 2026

📝 Walkthrough

Walkthrough

Adds MegatronRunReportGenerationStrategy: parses Megatron-Run stdout to extract per-iteration times and per-GPU TFLOP/s, computes statistics over recent iterations, writes a CSV report, exports and registers the strategy, and adds unit tests.

Changes

Cohort / File(s) Summary
Report strategy implementation
src/cloudai/workloads/megatron_run/report_generation_strategy.py
New class MegatronRunReportGenerationStrategy. Parses stdout.txt for iteration-time and tflops-per-gpu, skips initial iterations, computes avg/median/min/max/std, writes megatron_run_report.csv, exposes get_metric and helpers, and handles missing/invalid data.
Package exports
src/cloudai/workloads/megatron_run/__init__.py
Exported MegatronRunReportGenerationStrategy added to module namespace and __all__; copyright year updated.
Registration
src/cloudai/registration.py
Imported and registered MegatronRunReportGenerationStrategy for MegatronRunTestDefinition via Registry().add_report(...); minor copyright year bump.
Models
src/cloudai/models/scenario.py
TestRunModel.tdef_model_dump adjusted to emit agent_metrics only when present in model fields set (otherwise None).
Tests
tests/report_generation_strategy/test_megatron_run_report_generation_strategy.py, tests/test_test_scenario.py
New unit tests for MegatronRun strategy: fixtures with stdout samples, tests for detection, CSV generation structure and contents, metric extraction (including invalid/no-data cases), and updated test scenario to include the new strategy.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~30 minutes

Poem

🐇 I nibble logs with whiskers keen,
find iteration times and TFLOP/s unseen.
I skip the first hops, then average the rest,
a neat little CSV to show which run's best.
Hooray — the rabbit reports pass the test! ✨

🚥 Pre-merge checks | ✅ 2
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main change: adding a report generation strategy for the MegatronRun workload.
Description check ✅ Passed The pull request description clearly explains the addition of a new MegatronRunReportGenerationStrategy for extracting metrics and fixing agent_metrics configuration behavior, directly aligned with the changeset.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Jan 22, 2026

Greptile Overview

Greptile Summary

Adds MegatronRunReportGenerationStrategy to extract iteration time and GPU TFLOP/s metrics from MegatronRun training logs, enabling Design Space Exploration (DSE) capabilities. The implementation follows the established pattern from MegatronBridgeReportGenerationStrategy with similar structure and error handling.

Key changes:

  • New report generation strategy parses stdout.txt for iteration metrics using regex pattern matching
  • Skips first 20 iterations to exclude warmup period (vs MegatronBridge's approach of keeping last 10)
  • Generates CSV reports with statistics (avg, median, min, max, std) for both iteration time (ms) and TFLOP/s per GPU
  • Supports three metric types: default, iteration-time, and tflops-per-gpu for DSE agent integration
  • Updates scenario.py to conditionally serialize agent_metrics only when explicitly set, preventing default values from being written
  • Comprehensive test coverage with multiple edge cases and validation scenarios

Confidence Score: 5/5

  • This PR is safe to merge with minimal risk
  • The implementation follows well-established patterns from MegatronBridge, has comprehensive test coverage including edge cases, properly handles errors, and all changes are additive without modifying existing functionality. The scenario.py change is a minor improvement to serialization logic that correctly uses Pydantic's model_fields_set.
  • No files require special attention

Important Files Changed

Filename Overview
src/cloudai/workloads/megatron_run/report_generation_strategy.py Adds new MegatronRunReportGenerationStrategy class to extract iteration time and TFLOP/s metrics from stdout.txt, following similar patterns to MegatronBridge. Implementation looks solid with proper error handling and warmup iteration skipping.
tests/report_generation_strategy/test_megatron_run_report_generation_strategy.py Comprehensive test coverage for the new report generation strategy including happy path, edge cases, and metric extraction validation.
src/cloudai/models/scenario.py Updates tdef_model_dump to conditionally serialize agent_metrics only if explicitly set, preventing default values from being serialized.

Sequence Diagram

sequenceDiagram
    participant User
    participant System
    participant Registry
    participant MegatronRunRGS as MegatronRunReportGenerationStrategy
    participant TestRun
    participant LogFile as stdout.txt
    participant ReportFile as megatron_run_report.csv

    User->>System: Execute MegatronRun workload
    System->>TestRun: Create TestRun instance
    TestRun->>LogFile: Generate stdout.txt with iteration metrics
    
    User->>System: Request report generation
    System->>Registry: Get report strategies for MegatronRunTestDefinition
    Registry-->>System: Return [CheckpointTimingRGS, MegatronRunRGS]
    
    System->>MegatronRunRGS: can_handle_directory()
    MegatronRunRGS->>LogFile: Open and search for ITERATION_REGEX
    LogFile-->>MegatronRunRGS: Pattern found/not found
    MegatronRunRGS-->>System: Return true/false
    
    alt Can handle directory
        System->>MegatronRunRGS: generate_report()
        MegatronRunRGS->>LogFile: Extract iteration times and TFLOP/s
        LogFile-->>MegatronRunRGS: Raw metrics data
        MegatronRunRGS->>MegatronRunRGS: Skip first 20 iterations (warmup)
        MegatronRunRGS->>MegatronRunRGS: Calculate statistics (mean, median, min, max, pstdev)
        MegatronRunRGS->>ReportFile: Write CSV with metrics
        ReportFile-->>MegatronRunRGS: Success
        
        User->>System: Request metric value
        System->>MegatronRunRGS: get_metric("iteration-time" or "tflops-per-gpu")
        MegatronRunRGS->>LogFile: Re-extract data
        MegatronRunRGS->>MegatronRunRGS: Calculate mean
        MegatronRunRGS-->>System: Return metric value
        System-->>User: Metric value for DSE
    else Cannot handle directory
        MegatronRunRGS-->>System: Cannot handle
        System->>Registry: Try next strategy
    end
Loading

Copy link
Contributor

@amaslenn amaslenn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are in the code freeze stage now, this PR will be merged later.


# Keep only the last 10 iterations for statistics (to exclude warmup)
if len(iter_times_ms) > 10:
iter_times_ms = iter_times_ms[-10:]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if taking the last 10 iterations is the most relevant. What if the training has some ups and downs (as I already saw). Maybe just skipping the warmup, so say the 20 first iterations is enough ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, skipping the warmup stage makes more sense, have updated to skipping the first 20 iterations. Originally was following the format in Megatron-Bridge report. Maybe later need to unify the formats for computing statistics.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The last 10 iteration is what the GPU perf team uses. I have seen those runs on IB clusters and mostly towards the end it remains stable.

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/cloudai/workloads/megatron_run/report_generation_strategy.py (1)

30-121: Iteration-time parsing fails when TFLOP/s isn’t logged.

ITERATION_REGEX requires TFLOP/s, so logs that only print iteration time will be ignored; can_handle_directory() then returns False and get_metric("iteration-time") returns METRIC_ERROR despite valid data. Consider making TFLOP/s optional and skipping warmup by iteration count, not by list position.

🔧 Proposed fix
-ITERATION_REGEX = re.compile(
-    r"elapsed time per iteration \(ms\):\s*([0-9]+(?:\.[0-9]+)?)"
-    r".*?"
-    r"throughput per GPU \(TFLOP/s/GPU\):\s*([0-9]+(?:\.[0-9]+)?)",
-    re.IGNORECASE,
-)
+ITERATION_REGEX = re.compile(
+    r"elapsed time per iteration \(ms\):\s*([0-9]+(?:\.[0-9]+)?)"
+    r"(?:.*?throughput per GPU \(TFLOP/s/GPU\):\s*([0-9]+(?:\.[0-9]+)?))?",
+    re.IGNORECASE,
+)

 def _extract(self, log_path: Path) -> tuple[list[float], list[float]]:
     """Extract iteration times (ms) and GPU TFLOPS from the log file."""
-    iter_times_ms: list[float] = []
-    gpu_tflops: list[float] = []
+    records: list[tuple[float, float | None]] = []
     with log_path.open("r", encoding="utf-8", errors="ignore") as f:
         for line in f:
             m = ITERATION_REGEX.search(line)
             if m:
                 try:
-                    iter_times_ms.append(float(m.group(1)))
-                    gpu_tflops.append(float(m.group(2)))
+                    iter_time = float(m.group(1))
+                    tflops = float(m.group(2)) if m.group(2) is not None else None
+                    records.append((iter_time, tflops))
                 except (ValueError, TypeError):
                     logging.debug("Failed to parse iteration metrics line: %s", line.rstrip("\n"))

     # Skip the first 20 iterations for statistics (to exclude warmup)
-    if len(iter_times_ms) > 20:
-        iter_times_ms = iter_times_ms[20:]
-        gpu_tflops = gpu_tflops[20:]
-    return iter_times_ms, gpu_tflops
+    if len(records) > 20:
+        records = records[20:]
+    iter_times_ms = [t for t, _ in records]
+    gpu_tflops = [g for _, g in records if g is not None]
+    return iter_times_ms, gpu_tflops
🤖 Fix all issues with AI agents
In `@src/cloudai/workloads/megatron_run/report_generation_strategy.py`:
- Around line 139-167: The CSV currently writes zeros when gpu_tflops is empty;
change the branch that sets
tflops_avg/tflops_median/tflops_min/tflops_max/tflops_std (used when writing via
writer to report_file in report_generation_strategy.py) to instead set those
values to empty strings (or the existing METRIC_ERROR sentinel used by
get_metric) so the CSV shows missing TFLOP/s data explicitly rather than zeros;
keep the same writer.writerow call that writes the "tflops_per_gpu" row but use
the new empty/sentinel values when gpu_tflops is falsy.

In
`@tests/report_generation_strategy/test_megatron_run_report_generation_strategy.py`:
- Around line 32-83: The tests currently (megatron_run_tr and
megatron_run_tr_no_data fixtures) only provide 3 iterations and do not exercise
the warmup-skip behavior; add a new fixture (e.g., megatron_run_tr_with_warmup)
that builds a TestRun with stdout.txt containing at least 21 iteration log lines
formatted like the existing stdout_content so the first 20 iterations are
present and can be skipped, then add a corresponding test that uses this fixture
to assert the report generation logic ignores the first 20 iterations (reference
MegatronRunTestDefinition, TestRun, and the stdout.txt written in the fixtures).

"agent": self.agent,
"agent_steps": self.agent_steps,
"agent_metrics": self.agent_metrics,
"agent_metrics": self.agent_metrics if "agent_metrics" in self.model_fields_set else None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would change default values to None. Why is it needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it is None here then the definition of agent_metrics in test config can propagate, otherwise if agent_metrics is not defined in the scenario config, the final merged config would always be [default] even though agent_metrics is set in the test config.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants