-
Notifications
You must be signed in to change notification settings - Fork 109
Update slurm examples to v1 #1901
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
examples/slurm/utils.py
Outdated
| num_nodes: int, | ||
| gpus_per_node: int, | ||
| time_limit: str = "06:00:00", | ||
| python_exe: str = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case the question comes up, we are using this in a future AMD enablement to start the monarch bootstrap in a docker container so it would be great to have this optional flag in the example.
amirafzali
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some comments but looks good to me! Thank you for validating these examples with slurmjob
examples/slurm/utils.py
Outdated
| appdef = hyperactor.host_mesh( | ||
| image=image, | ||
| meshes=[f"mesh0:{num_hosts}:{host_type}"], # mesh_name:num_hosts:host_type | ||
| def create_slurm_job( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At this point we could probably get rid of utils? it was necessary when the job setup was less intuitive, SlurmJob should be usable standalone
examples/slurm_allreduce.ipynb
Outdated
| "monarch.tools.network 2025-08-29 21:02:45 INFO no AF_INET6 address that can bind TCP sockets for `gpu-queue-st-gpu-compute-2:26600` (error: [Errno -5] No address associated with hostname)\n", | ||
| "monarch.tools.network 2025-08-29 21:02:45 INFO resolved AF_INET address `10.0.2.132:26600` for `gpu-queue-st-gpu-compute-2:26600`\n", | ||
| "monarch._src.actor.allocator 2025-08-29 21:02:45 INFO initializing alloc on remote allocator addresses: ['tcp!10.0.2.236:26600', 'tcp!10.0.2.132:26600']\n" | ||
| "Found cached job at path: .monarch/job_state.pkl\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can probably disable cached job for these examples (pass as none)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seconded, please remove all outputs in examples. Keep it to just the code to run
|
@amirafzali has imported this pull request. If you are a Meta employee, you can view this in D87157070. |
|
@amirafzali merged this pull request in 199aff7. |
Summary: This PR updates the slurm examples to use v1 api (SlurmJob + jobTrait) Testplan: Tested on GB200 cluster, see notebook output cc chriscai-amd Pull Request resolved: meta-pytorch#1901 Reviewed By: dulinriley Differential Revision: D87157070 Pulled By: amirafzali fbshipit-source-id: 228ab64808604f74b554b6a7a8f50535bcb23964
This PR updates the slurm examples to use v1 api (SlurmJob + jobTrait)
Testplan:
Tested on GB200 cluster, see notebook output
cc @chriscai-amd