Skip to content

Conversation

@mreso
Copy link
Contributor

@mreso mreso commented Nov 15, 2025

This PR updates the slurm examples to use v1 api (SlurmJob + jobTrait)

Testplan:
Tested on GB200 cluster, see notebook output

cc @chriscai-amd

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Nov 15, 2025
@mreso mreso requested a review from amirafzali November 15, 2025 01:34
num_nodes: int,
gpus_per_node: int,
time_limit: str = "06:00:00",
python_exe: str = None,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case the question comes up, we are using this in a future AMD enablement to start the monarch bootstrap in a docker container so it would be great to have this optional flag in the example.

Copy link
Member

@amirafzali amirafzali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments but looks good to me! Thank you for validating these examples with slurmjob

appdef = hyperactor.host_mesh(
image=image,
meshes=[f"mesh0:{num_hosts}:{host_type}"], # mesh_name:num_hosts:host_type
def create_slurm_job(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point we could probably get rid of utils? it was necessary when the job setup was less intuitive, SlurmJob should be usable standalone

"monarch.tools.network 2025-08-29 21:02:45 INFO no AF_INET6 address that can bind TCP sockets for `gpu-queue-st-gpu-compute-2:26600` (error: [Errno -5] No address associated with hostname)\n",
"monarch.tools.network 2025-08-29 21:02:45 INFO resolved AF_INET address `10.0.2.132:26600` for `gpu-queue-st-gpu-compute-2:26600`\n",
"monarch._src.actor.allocator 2025-08-29 21:02:45 INFO initializing alloc on remote allocator addresses: ['tcp!10.0.2.236:26600', 'tcp!10.0.2.132:26600']\n"
"Found cached job at path: .monarch/job_state.pkl\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can probably disable cached job for these examples (pass as none)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seconded, please remove all outputs in examples. Keep it to just the code to run

@meta-codesync
Copy link

meta-codesync bot commented Nov 15, 2025

@amirafzali has imported this pull request. If you are a Meta employee, you can view this in D87157070.

@meta-codesync
Copy link

meta-codesync bot commented Nov 20, 2025

@amirafzali merged this pull request in 199aff7.

AlirezaShamsoshoara pushed a commit to AlirezaShamsoshoara/monarch that referenced this pull request Nov 20, 2025
Summary:
This PR updates the slurm examples to use v1 api (SlurmJob + jobTrait)

Testplan:
Tested on GB200 cluster, see notebook output

cc chriscai-amd

Pull Request resolved: meta-pytorch#1901

Reviewed By: dulinriley

Differential Revision: D87157070

Pulled By: amirafzali

fbshipit-source-id: 228ab64808604f74b554b6a7a8f50535bcb23964
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot. Merged

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants