Skip to content

[feat] Support multi-cluster operation in Slurm backends#3639

Open
vkarak wants to merge 2 commits intoreframe-hpc:developfrom
vkarak:feat/slurm-multi-cluster
Open

[feat] Support multi-cluster operation in Slurm backends#3639
vkarak wants to merge 2 commits intoreframe-hpc:developfrom
vkarak:feat/slurm-multi-cluster

Conversation

@vkarak
Copy link
Contributor

@vkarak vkarak commented Mar 9, 2026

This PR introduces a new configuration option for Slurm backends named slurm_multi_cluster_mode that supports Slurm's Multi-Cluster Operation. If not specified, nothing changes. If it is, then the clusters listed are being passed to Slurm's -M option. If set to ["all"], this is equivalent to -M all and all clusters are queried.

Closes #3559.

@JimPaine Would you mind trying this PR with your setup?

@codecov
Copy link

codecov bot commented Mar 10, 2026

Codecov Report

❌ Patch coverage is 92.30769% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 91.70%. Comparing base (1eee4f9) to head (dbfc43f).

Files with missing lines Patch % Lines
reframe/core/schedulers/slurm.py 91.66% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #3639      +/-   ##
===========================================
- Coverage    91.70%   91.70%   -0.01%     
===========================================
  Files           62       62              
  Lines        13713    13724      +11     
===========================================
+ Hits         12576    12586      +10     
- Misses        1137     1138       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@JimPaine
Copy link
Contributor

JimPaine commented Mar 10, 2026

@vkarak I have pulled from your fork and can confirm it is polling the the correct cluster.

Something that I think could improve the user experience would be to include it against the sbatch command as well. Currently I need to set the cluster twice, once for submission and once for job polling.

Here is a snippet of my partitions for the test I ran, you can see that I currently need to set it in access and slurm_multi_cluster_mode to be able to run the test.

                {
                    'name': 'cluster1',
                    'scheduler': 'slurm',
                    'launcher': 'local',
                    'environs': ['slurm_multi_cluster_mode'],
                    'access': ['-M tst1'],
                    'sched_options': {
                        'slurm_multi_cluster_mode': ['cluster1']
                    }
                },
                {
                    'name': 'cluster2',
                    'scheduler': 'slurm',
                    'launcher': 'local',
                    'environs': ['slurm_multi_cluster_mode'],
                    'access': ['-M tst2'],
                    'sched_options': {
                        'slurm_multi_cluster_mode': ['cluster2']
                    }
                }

@vkarak
Copy link
Contributor Author

vkarak commented Mar 10, 2026

Something that I think could improve the user experience would be to include it against the sbatch command as well. Currently I need to set the cluster twice, once for submission and once for job polling.

Yes, that make sense! I'll update the PR, so that the access options take multi-cluster mode into account.

@vkarak vkarak force-pushed the feat/slurm-multi-cluster branch from c30ab53 to dbfc43f Compare March 25, 2026 01:04
@vkarak
Copy link
Contributor Author

vkarak commented Mar 25, 2026

Yes, that make sense! I'll update the PR, so that the access options take multi-cluster mode into account.

I just updated it; now there is no need to pass the -M option explicitly. Let me know if that works fine for you, so that we can merge this.

Copy link
Contributor

@gppezzi gppezzi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works on Alps Daint.

@github-project-automation github-project-automation bot moved this from Todo to In Progress in ReFrame Backlog Mar 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: In Progress

Development

Successfully merging this pull request may close these issues.

Slurm Scheduler doesn't support multi-cluster

3 participants