-
Notifications
You must be signed in to change notification settings - Fork 42
M-bridge Job ID extraction #783
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Note Other AI code review bot(s) detectedCodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review. 📝 WalkthroughWalkthroughRelaxed the wrapper's strict exit behavior, capture wrapper stdout/stderr to files while streaming, record launcher exit codes, broaden job ID parsing to accept multiple formats, and add post-submission diagnostic log dumps and messages on failures. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 2✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Greptile OverviewGreptile SummaryEnhanced job ID extraction to handle multiple NeMo-Run output formats including case-insensitive matching and variable whitespace. The simplified regex pattern now properly matches all documented formats: Key improvements:
Confidence Score: 4/5
Important Files Changed
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1 file reviewed, 1 comment
src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py`:
- Around line 135-139: Update the Megatron-Bridge job-id extraction regex so it
allows variable whitespace between "Job" and "id": replace the grep pattern in
the string that builds JOB_ID (the line containing 'JOB_ID=$(grep -Eio "Job id[:
]+[0-9]+" "$LOG" | tail -n1 | grep -Eo "[0-9]+" | tail -n1 || true)') with a
pattern using POSIX whitespace (e.g., "Job[[:space:]]+id[: ]+[0-9]+") so formats
like "Job id: 694112" are matched while keeping case-insensitivity and the
rest of the pipeline unchanged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1 file reviewed, 1 comment
src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No files reviewed, no comments
Summary
Currently M-Bridge relies on Nemo-Run for slurm execution. While installing Nemo-Run, we point it to main. There has been slight modification in how the Job ID is directed at stdout. We now support the following variations to make it robust to extract the job ID. Without this, it might seems like its a CloudAI Bug.
Test Plan
Additional Notes