Skip to content

Comments

Enable multiple runners for b200-multinode-slurm#771

Open
csahithi wants to merge 1 commit intomainfrom
sahithi/b200-multi-runner
Open

Enable multiple runners for b200-multinode-slurm#771
csahithi wants to merge 1 commit intomainfrom
sahithi/b200-multi-runner

Conversation

@csahithi
Copy link
Collaborator

No description provided.

Comment on lines +123 to +127
if [[ "${{ runner.name }}" == mi355x-amds* ]] || [[ "${{ runner.name }}" == b200-dgxc-slurm* ]]; then
echo "[Slurm] Cleaning up jobs with name: ${{ runner.name }} for user: $USER ..."
scancel -u "$USER" --name="${{ runner.name }}" || true
while [ -n "$(squeue -u "$USER" --name='${{ runner.name }}' --noheader --format='%i')" ]; do
squeue -u "$USER" --name="${{ runner.name }}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@csahithi is each runner on an separate linux username or on the same linux username?

on mi355, all 9 runners on are the same linux username this PR would break mi355.

+viz @cquil11

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change wouldn't break the existing implementation. All the runners are under the same linux username for B200 as well. It just adds an additional check to make sure we are checking for jobs with the runner name only under the current user, just to ensure jobs run with the same name by any other user are not disturbed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahh gotcha! can u launch an test run to help validate that this works?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@functionstackx @cquil11 the test run is completed

@functionstackx
Copy link
Contributor

thanks!

@cquil11 can help review & merge this PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

2 participants