Skip to content

Conversation

@srivatsankrishnan
Copy link
Contributor

@srivatsankrishnan srivatsankrishnan commented Jan 20, 2026

Summary

Currently M-Bridge relies on Nemo-Run for slurm execution. While installing Nemo-Run, we point it to main. There has been slight modification in how the Job ID is directed at stdout. We now support the following variations to make it robust to extract the job ID. Without this, it might seems like its a CloudAI Bug.

Job id: 694112 (original format)
- Job id: 694112 (NeMo Run format with dash)
Job ID: 694112 (uppercase ID variant)
With varying amounts of whitespace

Test Plan

  • CI/CD
  • real cluster

Additional Notes

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 20, 2026

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

📝 Walkthrough

Walkthrough

Relaxed the wrapper's strict exit behavior, capture wrapper stdout/stderr to files while streaming, record launcher exit codes, broaden job ID parsing to accept multiple formats, and add post-submission diagnostic log dumps and messages on failures.

Changes

Cohort / File(s) Summary
Wrapper script error handling & diagnostics
src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py
Removed -e from shell flags (now set -uo pipefail); added WRAPPER_STDOUT/WRAPPER_STDERR and process-substitution tee to capture wrapper output while streaming; introduced LAUNCH_RC to record non-zero launcher exit codes; expanded JOB_ID extraction to accept formats like Job <id> or Job id with flexible spacing; added post-submission diagnostic messages and tail of logs when launcher or submission fails.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐇 I hopped into the wrapper's lair,
I logged each whisper, stdout and err,
I noted when the launcher sighed,
and traced the job ids far and wide,
A quiet hop — I guard the chair.

🚥 Pre-merge checks | ✅ 2
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main focus of the changeset: improving Job ID extraction robustness for M-Bridge.
Description check ✅ Passed The description clearly explains the problem being solved (Job ID format variations from NeMo-Run) and lists the supported formats, directly relating to the changeset modifications.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@srivatsankrishnan srivatsankrishnan marked this pull request as ready for review January 20, 2026 22:23
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Jan 20, 2026

Greptile Overview

Greptile Summary

Enhanced job ID extraction to handle multiple NeMo-Run output formats including case-insensitive matching and variable whitespace. The simplified regex pattern now properly matches all documented formats: Job id: XXXXX, - Job id: XXXXX, Job ID: XXXXX, and variations with multiple spaces.

Key improvements:

  • Simplified regex from (^|[^a-zA-Z])Job id[: ]+[0-9]+ to Job[[:space:]]+id[: ]+[0-9]+ for better robustness with NeMo-Run output variations
  • Removed -e flag from bash set options to allow capturing launcher exit codes while continuing execution
  • Added comprehensive logging: wrapper stdout/stderr now saved to separate files via tee for debugging
  • Enhanced error handling: script now reports when job submission succeeds but launcher exits non-zero
  • Better error diagnostics: exit codes and log tails included in failure messages

Confidence Score: 4/5

  • This PR is safe to merge with minimal risk
  • The changes improve robustness of job ID extraction and error handling. The simplified regex pattern has been verified to match all documented NeMo-Run output formats. The removal of -e from pipefail is necessary and correct for capturing exit codes. Enhanced logging and error messages improve debugging capabilities. While the regex could theoretically match false positives like DebugJob id: 123, this is acceptable per custom guidelines for MegatronBridge implementations, and the NeMo-Run output is controlled and won't contain such patterns.
  • No files require special attention

Important Files Changed

Filename Overview
src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py Improved job ID extraction robustness with better error handling and logging; regex pattern simplified to support multiple job ID formats from NeMo-Run

Sequence Diagram

sequenceDiagram
    participant CloudAI
    participant WrapperScript as Wrapper Script
    participant NeMoRun as Megatron-Bridge/NeMo-Run
    participant Slurm
    participant LogFiles as Log Files

    CloudAI->>WrapperScript: Execute wrapper script
    WrapperScript->>WrapperScript: Setup logging (tee stdout/stderr)
    WrapperScript->>LogFiles: Create launcher.log
    WrapperScript->>NeMoRun: Execute launcher command
    NeMoRun->>Slurm: Submit sbatch job
    Slurm-->>NeMoRun: Job submitted
    NeMoRun->>LogFiles: Write "Job id: XXXXX" (or variants)
    NeMoRun-->>WrapperScript: Exit (RC captured)
    WrapperScript->>LogFiles: Read launcher.log
    WrapperScript->>WrapperScript: Extract job ID using regex
    alt Job ID found
        alt Launcher exited non-zero
            WrapperScript->>CloudAI: Warning + tail log (stderr)
        end
        WrapperScript->>CloudAI: "Submitted batch job XXXXX"
    else Job ID not found
        WrapperScript->>CloudAI: Error message + exit code (stderr)
        WrapperScript->>CloudAI: tail log (stderr)
        WrapperScript->>CloudAI: Exit 1
    end
Loading

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 file reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py`:
- Around line 135-139: Update the Megatron-Bridge job-id extraction regex so it
allows variable whitespace between "Job" and "id": replace the grep pattern in
the string that builds JOB_ID (the line containing 'JOB_ID=$(grep -Eio "Job id[:
]+[0-9]+" "$LOG" | tail -n1 | grep -Eo "[0-9]+" | tail -n1 || true)') with a
pattern using POSIX whitespace (e.g., "Job[[:space:]]+id[: ]+[0-9]+") so formats
like "Job   id: 694112" are matched while keeping case-insensitivity and the
rest of the pipeline unchanged.

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 file reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No files reviewed, no comments

Edit Code Review Agent Settings | Greptile

@srivatsankrishnan srivatsankrishnan merged commit bd81345 into NVIDIA:main Jan 20, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants