-
Notifications
You must be signed in to change notification settings - Fork 42
Improvements for NCCL over k8s #786
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
+ make ssh port usage more reliable
|
Note Other AI code review bot(s) detectedCodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review. 📝 WalkthroughWalkthroughAdded SSH configuration (port 2222) and host networking to the NCCL Kubernetes JSON generation; updated worker SSH startup and MPI launcher args to include SSH port; tightened performance report parsing to only accept data rows with exactly 13 fields. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 1 | ❌ 1❌ Failed checks (1 inconclusive)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@src/cloudai/workloads/nccl_test/kubernetes_json_gen_strategy.py`:
- Around line 34-36: The ssh_port property is hardcoded to 2222; make it
configurable by adding an instance attribute (e.g., self._ssh_port) set from a
constructor argument or environment variable with a default of 2222, change the
ssh_port property to return that attribute, and update any callers/construction
sites of the class (the class in kubernetes_json_gen_strategy.py that defines
ssh_port) to pass a custom port when needed so pods can avoid host port
conflicts when hostNetwork is true.
Greptile SummaryThis PR improves NCCL test reliability on Kubernetes by implementing three key changes: SSH port specification for better connectivity, host networking by default for improved performance, and robust result parsing that filters malformed output rows. Key changes:
Confidence Score: 4/5
Important Files Changed
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 files reviewed, 3 comments
src/cloudai/workloads/nccl_test/performance_report_generation_strategy.py
Show resolved
Hide resolved
alexmanle
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, AI did a thorough review.
Summary
Test Plan
Additional Notes
—