Enable multi-GPU CI testing

## Summary

Enable multi-GPU CI testing using the newly available 2-GPU runners. This will allow us to rigorously test peer access, device switching, and other multi-GPU functionality on every PR.

## Background

Per [nv-gha-runners/enterprise-runner-configuration#258](https://github.com/nv-gha-runners/enterprise-runner-configuration/pull/258), we now have access to two multi-GPU runner types:

| Runner | GPUs | Architecture |
|--------|------|--------------|
| `nv-gpu-amd64-t4-2gpu` | 2x T4 | Linux amd64 |
| `nv-gpu-amd64-h100-2gpu` | 2x H100 | Linux amd64 |

## Motivation

Once this is in place, we can rigorously test:
- **Peer access**: `DeviceMemoryResource` peer access control, cross-device memory operations
- **Device switching**: Context management across multiple devices
- **IPC with multiple devices**: Inter-process communication scenarios involving different GPUs
- **Other multi-GPU functionality**: Any code paths that behave differently with multiple devices present

Currently, these scenarios may only be tested locally or sporadically. Adding multi-GPU CI ensures consistent coverage on every PR.

## Implementation

Add multi-GPU test configurations to `ci/test-matrix.yml`. These can be added to the existing `special_runners` section or a new dedicated section.

Suggested configurations (1-2 jobs to start):
```yaml
special_runners:
  amd64:
    # Existing H100 single-GPU entries...
    # New multi-GPU entries:
    - { ARCH: 'amd64', PY_VER: '3.13', CUDA_VER: '13.1.0', LOCAL_CTK: '1', GPU: 't4-2gpu', DRIVER: 'latest' }
    - { ARCH: 'amd64', PY_VER: '3.13', CUDA_VER: '13.1.0', LOCAL_CTK: '1', GPU: 'h100-2gpu', DRIVER: 'latest' }
```

The workflow files may need updates to:
1. Map the GPU names to the correct runner labels
2. Ensure tests detect and utilize multiple GPUs

## Scope & Limitations

- **Linux amd64 only**: Both available runners are `amd64`. No multi-GPU coverage for:
  - Windows (no multi-GPU runners available)
  - Linux arm64 (no multi-GPU runners available)
- **Per-PR execution**: Tests are fast, so running unconditionally on every PR is acceptable
- **Parallel execution**: Multi-GPU jobs should run in parallel with existing CI jobs, not blocking the critical path

## Tasks

- [x] Add multi-GPU configurations to `ci/test-matrix.yml`
- [x] Update workflow files to support multi-GPU runner selection
- [x] Verify existing multi-GPU tests run correctly (peer access, IPC, etc.)
- [x] Consider adding a pytest marker or environment variable to identify multi-GPU test runs

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable multi-GPU CI testing #1501

Summary

Background

Motivation

Implementation

Scope & Limitations

Tasks

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Runner	GPUs	Architecture
`nv-gpu-amd64-t4-2gpu`	2x T4	Linux amd64
`nv-gpu-amd64-h100-2gpu`	2x H100	Linux amd64

Enable multi-GPU CI testing #1501

Description

Summary

Background

Motivation

Implementation

Scope & Limitations

Tasks

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions