-
Notifications
You must be signed in to change notification settings - Fork 242
Labels
CI/CDCI/CD infrastructureCI/CD infrastructureP0High priority - Must do!High priority - Must do!enhancementAny code-related improvementsAny code-related improvements
Description
Summary
Enable multi-GPU CI testing using the newly available 2-GPU runners. This will allow us to rigorously test peer access, device switching, and other multi-GPU functionality on every PR.
Background
Per nv-gha-runners/enterprise-runner-configuration#258, we now have access to two multi-GPU runner types:
| Runner | GPUs | Architecture |
|---|---|---|
nv-gpu-amd64-t4-2gpu |
2x T4 | Linux amd64 |
nv-gpu-amd64-h100-2gpu |
2x H100 | Linux amd64 |
Motivation
Once this is in place, we can rigorously test:
- Peer access:
DeviceMemoryResourcepeer access control, cross-device memory operations - Device switching: Context management across multiple devices
- IPC with multiple devices: Inter-process communication scenarios involving different GPUs
- Other multi-GPU functionality: Any code paths that behave differently with multiple devices present
Currently, these scenarios may only be tested locally or sporadically. Adding multi-GPU CI ensures consistent coverage on every PR.
Implementation
Add multi-GPU test configurations to ci/test-matrix.yml. These can be added to the existing special_runners section or a new dedicated section.
Suggested configurations (1-2 jobs to start):
special_runners:
amd64:
# Existing H100 single-GPU entries...
# New multi-GPU entries:
- { ARCH: 'amd64', PY_VER: '3.13', CUDA_VER: '13.1.0', LOCAL_CTK: '1', GPU: 't4-2gpu', DRIVER: 'latest' }
- { ARCH: 'amd64', PY_VER: '3.13', CUDA_VER: '13.1.0', LOCAL_CTK: '1', GPU: 'h100-2gpu', DRIVER: 'latest' }The workflow files may need updates to:
- Map the GPU names to the correct runner labels
- Ensure tests detect and utilize multiple GPUs
Scope & Limitations
- Linux amd64 only: Both available runners are
amd64. No multi-GPU coverage for:- Windows (no multi-GPU runners available)
- Linux arm64 (no multi-GPU runners available)
- Per-PR execution: Tests are fast, so running unconditionally on every PR is acceptable
- Parallel execution: Multi-GPU jobs should run in parallel with existing CI jobs, not blocking the critical path
Tasks
- Add multi-GPU configurations to
ci/test-matrix.yml - Update workflow files to support multi-GPU runner selection
- Verify existing multi-GPU tests run correctly (peer access, IPC, etc.)
- Consider adding a pytest marker or environment variable to identify multi-GPU test runs
Metadata
Metadata
Assignees
Labels
CI/CDCI/CD infrastructureCI/CD infrastructureP0High priority - Must do!High priority - Must do!enhancementAny code-related improvementsAny code-related improvements