Skip to content

Enable multi-GPU CI testing #1501

@Andy-Jost

Description

@Andy-Jost

Summary

Enable multi-GPU CI testing using the newly available 2-GPU runners. This will allow us to rigorously test peer access, device switching, and other multi-GPU functionality on every PR.

Background

Per nv-gha-runners/enterprise-runner-configuration#258, we now have access to two multi-GPU runner types:

Runner GPUs Architecture
nv-gpu-amd64-t4-2gpu 2x T4 Linux amd64
nv-gpu-amd64-h100-2gpu 2x H100 Linux amd64

Motivation

Once this is in place, we can rigorously test:

  • Peer access: DeviceMemoryResource peer access control, cross-device memory operations
  • Device switching: Context management across multiple devices
  • IPC with multiple devices: Inter-process communication scenarios involving different GPUs
  • Other multi-GPU functionality: Any code paths that behave differently with multiple devices present

Currently, these scenarios may only be tested locally or sporadically. Adding multi-GPU CI ensures consistent coverage on every PR.

Implementation

Add multi-GPU test configurations to ci/test-matrix.yml. These can be added to the existing special_runners section or a new dedicated section.

Suggested configurations (1-2 jobs to start):

special_runners:
  amd64:
    # Existing H100 single-GPU entries...
    # New multi-GPU entries:
    - { ARCH: 'amd64', PY_VER: '3.13', CUDA_VER: '13.1.0', LOCAL_CTK: '1', GPU: 't4-2gpu', DRIVER: 'latest' }
    - { ARCH: 'amd64', PY_VER: '3.13', CUDA_VER: '13.1.0', LOCAL_CTK: '1', GPU: 'h100-2gpu', DRIVER: 'latest' }

The workflow files may need updates to:

  1. Map the GPU names to the correct runner labels
  2. Ensure tests detect and utilize multiple GPUs

Scope & Limitations

  • Linux amd64 only: Both available runners are amd64. No multi-GPU coverage for:
    • Windows (no multi-GPU runners available)
    • Linux arm64 (no multi-GPU runners available)
  • Per-PR execution: Tests are fast, so running unconditionally on every PR is acceptable
  • Parallel execution: Multi-GPU jobs should run in parallel with existing CI jobs, not blocking the critical path

Tasks

  • Add multi-GPU configurations to ci/test-matrix.yml
  • Update workflow files to support multi-GPU runner selection
  • Verify existing multi-GPU tests run correctly (peer access, IPC, etc.)
  • Consider adding a pytest marker or environment variable to identify multi-GPU test runs

Metadata

Metadata

Assignees

Labels

CI/CDCI/CD infrastructureP0High priority - Must do!enhancementAny code-related improvements

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions