M-bridge Job ID extraction #783

srivatsankrishnan · 2026-01-20T20:30:41Z

Summary

Currently M-Bridge relies on Nemo-Run for slurm execution. While installing Nemo-Run, we point it to main. There has been slight modification in how the Job ID is directed at stdout. We now support the following variations to make it robust to extract the job ID. Without this, it might seems like its a CloudAI Bug.

Job id: 694112 (original format)
- Job id: 694112 (NeMo Run format with dash)
Job ID: 694112 (uppercase ID variant)
With varying amounts of whitespace

Test Plan

CI/CD
real cluster

Additional Notes

coderabbitai · 2026-01-20T20:30:52Z

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

📝 Walkthrough

Walkthrough

Relaxed the wrapper's strict exit behavior, capture wrapper stdout/stderr to files while streaming, record launcher exit codes, broaden job ID parsing to accept multiple formats, and add post-submission diagnostic log dumps and messages on failures.

Changes

Cohort / File(s)	Summary
Wrapper script error handling & diagnostics `src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py`	Removed `-e` from shell flags (now `set -uo pipefail`); added `WRAPPER_STDOUT`/`WRAPPER_STDERR` and process-substitution `tee` to capture wrapper output while streaming; introduced `LAUNCH_RC` to record non-zero launcher exit codes; expanded `JOB_ID` extraction to accept formats like `Job <id>` or `Job id` with flexible spacing; added post-submission diagnostic messages and `tail` of logs when launcher or submission fails.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐇 I hopped into the wrapper's lair,
I logged each whisper, stdout and err,
I noted when the launcher sighed,
and traced the job ids far and wide,
A quiet hop — I guard the chair.

🚥 Pre-merge checks | ✅ 2

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately describes the main focus of the changeset: improving Job ID extraction robustness for M-Bridge.
Description check	✅ Passed	The description clearly explains the problem being solved (Job ID format variations from NeMo-Run) and lists the supported formats, directly relating to the changeset modifications.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

greptile-apps · 2026-01-20T22:26:30Z

Greptile Overview

Greptile Summary

Enhanced job ID extraction to handle multiple NeMo-Run output formats including case-insensitive matching and variable whitespace. The simplified regex pattern now properly matches all documented formats: Job id: XXXXX, - Job id: XXXXX, Job ID: XXXXX, and variations with multiple spaces.

Key improvements:

Simplified regex from (^|[^a-zA-Z])Job id[: ]+[0-9]+ to Job[[:space:]]+id[: ]+[0-9]+ for better robustness with NeMo-Run output variations
Removed -e flag from bash set options to allow capturing launcher exit codes while continuing execution
Added comprehensive logging: wrapper stdout/stderr now saved to separate files via tee for debugging
Enhanced error handling: script now reports when job submission succeeds but launcher exits non-zero
Better error diagnostics: exit codes and log tails included in failure messages

Confidence Score: 4/5

This PR is safe to merge with minimal risk
The changes improve robustness of job ID extraction and error handling. The simplified regex pattern has been verified to match all documented NeMo-Run output formats. The removal of -e from pipefail is necessary and correct for capturing exit codes. Enhanced logging and error messages improve debugging capabilities. While the regex could theoretically match false positives like DebugJob id: 123, this is acceptable per custom guidelines for MegatronBridge implementations, and the NeMo-Run output is controlled and won't contain such patterns.
No files require special attention

Important Files Changed

Filename	Overview
src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py	Improved job ID extraction robustness with better error handling and logging; regex pattern simplified to support multiple job ID formats from NeMo-Run

Sequence Diagram

sequenceDiagram
    participant CloudAI
    participant WrapperScript as Wrapper Script
    participant NeMoRun as Megatron-Bridge/NeMo-Run
    participant Slurm
    participant LogFiles as Log Files

    CloudAI->>WrapperScript: Execute wrapper script
    WrapperScript->>WrapperScript: Setup logging (tee stdout/stderr)
    WrapperScript->>LogFiles: Create launcher.log
    WrapperScript->>NeMoRun: Execute launcher command
    NeMoRun->>Slurm: Submit sbatch job
    Slurm-->>NeMoRun: Job submitted
    NeMoRun->>LogFiles: Write "Job id: XXXXX" (or variants)
    NeMoRun-->>WrapperScript: Exit (RC captured)
    WrapperScript->>LogFiles: Read launcher.log
    WrapperScript->>WrapperScript: Extract job ID using regex
    alt Job ID found
        alt Launcher exited non-zero
            WrapperScript->>CloudAI: Warning + tail log (stderr)
        end
        WrapperScript->>CloudAI: "Submitted batch job XXXXX"
    else Job ID not found
        WrapperScript->>CloudAI: Error message + exit code (stderr)
        WrapperScript->>CloudAI: tail log (stderr)
        WrapperScript->>CloudAI: Exit 1
    end

greptile-apps

_{1 file reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py`:
- Around line 135-139: Update the Megatron-Bridge job-id extraction regex so it
allows variable whitespace between "Job" and "id": replace the grep pattern in
the string that builds JOB_ID (the line containing 'JOB_ID=$(grep -Eio "Job id[:
]+[0-9]+" "$LOG" | tail -n1 | grep -Eo "[0-9]+" | tail -n1 || true)') with a
pattern using POSIX whitespace (e.g., "Job[[:space:]]+id[: ]+[0-9]+") so formats
like "Job   id: 694112" are matched while keeping case-insensitivity and the
rest of the pipeline unchanged.

src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py

greptile-apps

_{1 file reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py

greptile-apps

_{No files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

make job_id extraction robust

a52fc51

Merge branch 'main' into m-bridge

137f2e7

srivatsankrishnan marked this pull request as ready for review January 20, 2026 22:23

srivatsankrishnan requested review from alexmanle, amaslenn and jeffnvidia as code owners January 20, 2026 22:23

greptile-apps bot reviewed Jan 20, 2026

View reviewed changes

src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py Outdated Show resolved Hide resolved

simplify regex from redirected log

37c796d

coderabbitai bot reviewed Jan 20, 2026

View reviewed changes

src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py Show resolved Hide resolved

greptile-apps bot reviewed Jan 20, 2026

View reviewed changes

src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py Outdated Show resolved Hide resolved

white space check--coderabbit

0501fdd

greptile-apps bot reviewed Jan 20, 2026

View reviewed changes

alexmanle approved these changes Jan 20, 2026

View reviewed changes

srivatsankrishnan merged commit bd81345 into NVIDIA:main Jan 20, 2026
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

M-bridge Job ID extraction #783

M-bridge Job ID extraction #783

Uh oh!

srivatsankrishnan commented Jan 20, 2026 •

edited

Loading

Uh oh!

coderabbitai bot commented Jan 20, 2026 •

edited

Loading

Other AI code review bot(s) detected

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

greptile-apps bot commented Jan 20, 2026 •

edited

Loading

Sequence Diagram

Uh oh!

greptile-apps bot left a comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

greptile-apps bot left a comment

Uh oh!

Uh oh!

greptile-apps bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

M-bridge Job ID extraction #783

M-bridge Job ID extraction #783

Uh oh!

Conversation

srivatsankrishnan commented Jan 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test Plan

Additional Notes

Uh oh!

coderabbitai bot commented Jan 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Other AI code review bot(s) detected

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

greptile-apps bot commented Jan 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Overview

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

srivatsankrishnan commented Jan 20, 2026 •

edited

Loading

coderabbitai bot commented Jan 20, 2026 •

edited

Loading

greptile-apps bot commented Jan 20, 2026 •

edited

Loading