Skip to content

Conversation

@amaslenn
Copy link
Contributor

Summary

  1. Better ssh port spec for reliability.
  2. Use host's network by default.
  3. More robust NCCL results parsing.

Test Plan

  1. CI
  2. Manual runs.

Additional Notes

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 21, 2026

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

📝 Walkthrough

Walkthrough

Added SSH configuration (port 2222) and host networking to the NCCL Kubernetes JSON generation; updated worker SSH startup and MPI launcher args to include SSH port; tightened performance report parsing to only accept data rows with exactly 13 fields.

Changes

Cohort / File(s) Summary
NCCL Kubernetes JSON generation
src/cloudai/workloads/nccl_test/kubernetes_json_gen_strategy.py
Added ssh_port property (2222); enabled hostNetwork: True for launcher and worker pod specs; exposed SSH port on worker container (containerPort: ssh_port, name ssh); changed worker SSH startup to install/configure SSH, set Port 2222, restart service, and sleep; updated MPI launcher args to include -p {ssh_port} and disable strict host key checking.
Performance report parsing
src/cloudai/workloads/nccl_test/performance_report_generation_strategy.py
Updated copyright year; _parse_data_rows now appends parsed rows only when the split line yields exactly 13 fields, filtering other digit-starting lines.
Tests
tests/json_gen_strategy/test_nccl_kubernetes_json_gen_strategy.py
Adjusted expected launcher command in tests to use dynamic ssh_port value in plm_rsh_args assertion.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 I dig a tunnel to port two-two-two,
Host nets open wide, my hops now true,
SSH starts clean, MPI strings sing,
Thirteen fields line up — what joy they bring,
A rabbit cheers the tests anew!

🚥 Pre-merge checks | ✅ 1 | ❌ 1
❌ Failed checks (1 inconclusive)
Check name Status Explanation Resolution
Title check ❓ Inconclusive The title 'Improvements for NCCL over k8s' is vague and does not clearly describe the specific changes made in the pull request. Use a more descriptive title that highlights the main changes, such as 'Add SSH port configuration and host networking for NCCL Kubernetes runs' or 'Enhance NCCL Kubernetes reliability with SSH port and host network configuration'.
✅ Passed checks (1 passed)
Check name Status Explanation
Description check ✅ Passed The description is related to the changeset and outlines the three main improvements made, though it could be more detailed.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@src/cloudai/workloads/nccl_test/kubernetes_json_gen_strategy.py`:
- Around line 34-36: The ssh_port property is hardcoded to 2222; make it
configurable by adding an instance attribute (e.g., self._ssh_port) set from a
constructor argument or environment variable with a default of 2222, change the
ssh_port property to return that attribute, and update any callers/construction
sites of the class (the class in kubernetes_json_gen_strategy.py that defines
ssh_port) to pass a custom port when needed so pods can avoid host port
conflicts when hostNetwork is true.

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Jan 21, 2026

Greptile Summary

This PR improves NCCL test reliability on Kubernetes by implementing three key changes: SSH port specification for better connectivity, host networking by default for improved performance, and robust result parsing that filters malformed output rows.

Key changes:

  • Added explicit SSH port (2222) configuration as a property and integrated it into worker pod specs and MPI SSH arguments for reliable connections
  • Enabled hostNetwork: True on both launcher and worker pods to use the host's network stack directly
  • Modified NCCL result parsing to only process rows with exactly 13 fields, preventing crashes from malformed data
  • Changed SSH daemon initialization from exec /usr/sbin/sshd -D to service ssh restart; sleep infinity with enhanced logging (set -ex)
  • Updated copyright years to 2026
  • Updated tests to reflect the new SSH port parameter in MPI arguments

Confidence Score: 4/5

  • This PR is safe to merge with minor considerations around the SSH daemon approach
  • The changes are well-tested and address real reliability issues. The SSH port specification and robust parsing are solid improvements. Minor concerns exist around the SSH daemon restart approach (using service ssh restart instead of exec /usr/sbin/sshd -D) with hostNetwork: True, but this is likely intentional given the test plan mentions manual runs
  • Pay attention to src/cloudai/workloads/nccl_test/kubernetes_json_gen_strategy.py for the SSH daemon implementation approach

Important Files Changed

Filename Overview
src/cloudai/workloads/nccl_test/kubernetes_json_gen_strategy.py Adds SSH port configuration, host networking, and improved SSH setup reliability for NCCL tests
src/cloudai/workloads/nccl_test/performance_report_generation_strategy.py Adds validation to only parse NCCL result rows with exactly 13 fields, preventing malformed data
tests/json_gen_strategy/test_nccl_kubernetes_json_gen_strategy.py Updates test expectations to match new SSH port parameter in MPI arguments

Sequence Diagram

sequenceDiagram
    participant Launcher as MPIJob Launcher
    participant Worker1 as Worker Pod 1
    participant Worker2 as Worker Pod N
    participant SSH as SSH Daemon (port 2222)
    
    Note over Worker1,Worker2: Workers start with hostNetwork: True
    Worker1->>SSH: service ssh restart (port 2222)
    Worker1->>Worker1: sleep infinity
    Worker2->>SSH: service ssh restart (port 2222)
    Worker2->>Worker2: sleep infinity
    
    Note over Launcher: Launcher starts with hostNetwork: True
    Launcher->>Worker1: mpirun -mca plm_rsh_args '-p 2222'
    Launcher->>Worker2: mpirun -mca plm_rsh_args '-p 2222'
    
    Worker1->>Worker1: Execute NCCL test
    Worker2->>Worker2: Execute NCCL test
    
    Worker1->>Launcher: NCCL results (stdout)
    Worker2->>Launcher: NCCL results (stdout)
    
    Note over Launcher: Parse results (filter rows with != 13 fields)
    Launcher->>Launcher: Generate performance report
Loading

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, 3 comments

Edit Code Review Agent Settings | Greptile

Copy link

@alexmanle alexmanle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, AI did a thorough review.

@amaslenn amaslenn merged commit f5410ee into main Jan 21, 2026
6 checks passed
@amaslenn amaslenn deleted the am/nccl-k8s branch January 21, 2026 16:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants