Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions test/extended/include.go
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,7 @@ import (
_ "github.com/openshift/origin/test/extended/machines"
_ "github.com/openshift/origin/test/extended/networking"
_ "github.com/openshift/origin/test/extended/node"
_ "github.com/openshift/origin/test/extended/node/dra/example"
_ "github.com/openshift/origin/test/extended/node/dra/nvidia"
_ "github.com/openshift/origin/test/extended/node/node_e2e"
_ "github.com/openshift/origin/test/extended/node_tuning"
Expand Down
17 changes: 17 additions & 0 deletions test/extended/node/dra/example/OWNERS
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
approvers:
- sairameshv
- harche
- haircommander
- rphillips
- mrunalp

reviewers:
- sairameshv
- harche
- haircommander
- rphillips
- mrunalp

labels:
- sig/scheduling
- area/dra
138 changes: 138 additions & 0 deletions test/extended/node/dra/example/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,138 @@
# DRA Example Driver Extended Tests for OpenShift

This directory contains extended tests for the upstream [dra-example-driver](https://github.com/kubernetes-sigs/dra-example-driver) on OpenShift clusters. These tests provide **hardware-independent** DRA regression coverage — no GPU or special hardware is required.

## Overview

These tests validate:
- DRA example driver installation and lifecycle
- Single device allocation via ResourceClaims
- Multi-device allocation
- Pod lifecycle and resource cleanup
- Claim sharing behavior
- ResourceClaimTemplate-based claim creation and cleanup
Comment on lines +7 to +13
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also have a few more tests regarding DRAExtendedResources and the DRAPartitionableDevices sub features of DRA?

We want to claim these features work on Openshift and hence the ask.


## Prerequisites

1. **OpenShift 4.21+** cluster (DRA API enabled by default)
2. **Helm 3** installed and available in PATH
3. **git** installed and available in PATH
4. **Cluster-admin** access

The test framework automatically:
- Clones the upstream `dra-example-driver` repository
- Installs the driver via Helm with OpenShift SCC permissions
- Waits for driver components to be ready

## Quick Start

```bash
# 1. Build test binary
make WHAT=cmd/openshift-tests

# 2. Set kubeconfig
export KUBECONFIG=/path/to/kubeconfig

# 3. Run all DRA example driver tests (local binary)
OPENSHIFT_SKIP_EXTERNAL_TESTS=1 \
./openshift-tests run --dry-run all 2>&1 | \
grep "\[Feature:DRA-Example\]" | \
OPENSHIFT_SKIP_EXTERNAL_TESTS=1 ./openshift-tests run -f -

# OR run a specific test
OPENSHIFT_SKIP_EXTERNAL_TESTS=1 ./openshift-tests run-test \
'[sig-scheduling][Feature:DRA-Example][Suite:openshift/dra-example][Serial] Basic Device Allocation should allocate single device to pod via DRA'

# OR list all available tests
OPENSHIFT_SKIP_EXTERNAL_TESTS=1 \
./openshift-tests run --dry-run all 2>&1 | grep "\[Feature:DRA-Example\]"
```

> **Note**: `OPENSHIFT_SKIP_EXTERNAL_TESTS=1` is required when running a locally
> built binary. Without it, the `run` command attempts to extract test binaries
> from the cluster's release payload, which does not contain your local changes.
> This variable is NOT needed in CI where the binary is part of the payload.

## Environment Variables

| Variable | Default | Description |
|----------|---------|-------------|
| `DRA_EXAMPLE_DRIVER_REF` | `main` | Git ref (branch/tag) of the upstream dra-example-driver to install |

## Test Scenarios

### 1. Single Device Allocation
- Creates DeviceClass with CEL selector for `gpu.example.com` driver
- Creates ResourceClaim requesting 1 device
- Schedules pod with ResourceClaim
- Validates device allocation in ResourceClaim status

### 2. Resource Cleanup
- Creates pod with device ResourceClaim
- Deletes pod
- Verifies ResourceClaim persists after pod deletion but is unreserved

### 3. Multi-Device Allocation
- Creates ResourceClaim requesting 2 devices
- Schedules pod requiring multiple devices
- Validates all devices are allocated (driver publishes 9 virtual devices per node)

### 4. Claim Sharing
- Creates a single ResourceClaim
- Creates two pods referencing the same ResourceClaim
- Verifies behavior: both pods run (sharing supported) or second pod stays Pending

### 5. ResourceClaimTemplate
- Creates a ResourceClaimTemplate
- Creates pod with ResourceClaimTemplate reference
- Validates that ResourceClaim is auto-created from template
- Validates automatic cleanup of template-generated claim when pod is deleted

## OpenShift-Specific Adaptations

The upstream `dra-example-driver` Helm chart requires the following OpenShift adaptations (handled automatically by the test framework):

1. **SCC Grant**: The kubelet plugin DaemonSet runs with `privileged: true` and mounts hostPath volumes. A ClusterRoleBinding grants the `system:openshift:scc:privileged` ClusterRole to the driver ServiceAccount.

2. **SNO Tolerations**: Control-plane tolerations are added to allow scheduling on single-node OpenShift clusters.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tolerations need not be just for SNO(Single Node Openshift)

Suggested change
2. **SNO Tolerations**: Control-plane tolerations are added to allow scheduling on single-node OpenShift clusters.
2. **Tolerations**: Control-plane tolerations are added to allow scheduling on the OpenShift clusters.


## Troubleshooting

### Helm not found

**Cause**: Helm 3 not installed.

**Solution**: Install Helm following [official instructions](https://helm.sh/docs/intro/install/).

### SCC denied — kubelet plugin pod rejected

**Cause**: ClusterRoleBinding for privileged SCC not created.

**Solution**: The test framework creates this automatically. For manual debugging:

```bash
oc adm policy add-scc-to-user privileged \
-n dra-example-driver \
-z dra-example-driver-service-account
```

### ResourceSlices not appearing

**Cause**: DRA driver DaemonSet not ready.

**Solution**:

```bash
# Check DRA driver pods
oc get pods -n dra-example-driver

# Check DaemonSet logs
oc logs -n dra-example-driver -l app.kubernetes.io/name=dra-example-driver --all-containers
```

## References

- **Upstream repository**: https://github.com/kubernetes-sigs/dra-example-driver
- **Kubernetes DRA docs**: https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/
- **OpenShift Extended Tests**: https://github.com/openshift/origin/tree/master/test/extended
- **NVIDIA DRA tests (reference)**: `test/extended/node/dra/nvidia/`
128 changes: 128 additions & 0 deletions test/extended/node/dra/example/device_validator.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,128 @@
package example

import (
"context"
"fmt"

resourceapi "k8s.io/api/resource/v1"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
"k8s.io/client-go/kubernetes"
"k8s.io/kubernetes/test/e2e/framework"
)

// DeviceValidator validates DRA device allocation and ResourceSlice state for the example driver.
type DeviceValidator struct {
client kubernetes.Interface
framework *framework.Framework
}

// NewDeviceValidator creates a DeviceValidator using the provided test framework.
func NewDeviceValidator(f *framework.Framework) *DeviceValidator {
return &DeviceValidator{
client: f.ClientSet,
framework: f,
}
}

// ValidateDeviceAllocation checks that the given ResourceClaim has exactly expectedCount devices allocated.
func (dv *DeviceValidator) ValidateDeviceAllocation(ctx context.Context, namespace, claimName string, expectedCount int) error {
framework.Logf("Validating ResourceClaim allocation for %s/%s (expected %d device(s))", namespace, claimName, expectedCount)

claim, err := dv.client.ResourceV1().ResourceClaims(namespace).Get(ctx, claimName, metav1.GetOptions{})
if err != nil {
return fmt.Errorf("failed to get ResourceClaim %s/%s: %w", namespace, claimName, err)
}

if claim.Status.Allocation == nil {
return fmt.Errorf("ResourceClaim %s/%s is not allocated", namespace, claimName)
}

deviceCount := len(claim.Status.Allocation.Devices.Results)
if deviceCount != expectedCount {
return fmt.Errorf("ResourceClaim %s/%s expected %d device(s) but got %d",
namespace, claimName, expectedCount, deviceCount)
}

framework.Logf("ResourceClaim %s/%s has %d device(s) allocated", namespace, claimName, deviceCount)

for i, result := range claim.Status.Allocation.Devices.Results {
if result.Driver != exampleDriverName {
return fmt.Errorf("device %d has incorrect driver %q, expected %q", i, result.Driver, exampleDriverName)
}
if result.Pool == "" {
return fmt.Errorf("device %d has empty pool field", i)
}
if result.Device == "" {
return fmt.Errorf("device %d has empty device field", i)
}
if result.Request == "" {
return fmt.Errorf("device %d has empty request field", i)
}

framework.Logf("Device %d validated: driver=%s, pool=%s, device=%s, request=%s",
i, result.Driver, result.Pool, result.Device, result.Request)
}

return nil
}

// ValidateResourceSlice finds and validates the ResourceSlice published by the example driver on the given node.
func (dv *DeviceValidator) ValidateResourceSlice(ctx context.Context, nodeName string) (*resourceapi.ResourceSlice, error) {
framework.Logf("Validating ResourceSlice for node %s", nodeName)

sliceList, err := dv.client.ResourceV1().ResourceSlices().List(ctx, metav1.ListOptions{})
if err != nil {
return nil, fmt.Errorf("failed to list ResourceSlices: %w", err)
}

var nodeSlice *resourceapi.ResourceSlice
totalDevices := 0
for i := range sliceList.Items {
slice := &sliceList.Items[i]
if slice.Spec.NodeName != nil && *slice.Spec.NodeName == nodeName &&
slice.Spec.Driver == exampleDriverName {
totalDevices += len(slice.Spec.Devices)
if nodeSlice == nil && len(slice.Spec.Devices) > 0 {
nodeSlice = slice
}
}
}

if nodeSlice == nil {
return nil, fmt.Errorf("no ResourceSlice with devices found for driver %s on node %s", exampleDriverName, nodeName)
}
Comment thread
coderabbitai[bot] marked this conversation as resolved.

framework.Logf("Node %s has %d total device(s) across matching ResourceSlices (returning slice %s)",
nodeName, totalDevices, nodeSlice.Name)
return nodeSlice, nil
}

// GetTotalDeviceCount returns the total number of devices published by the example driver across all nodes.
func (dv *DeviceValidator) GetTotalDeviceCount(ctx context.Context) (int, error) {
framework.Logf("Counting total devices from %s driver via ResourceSlices", exampleDriverName)

sliceList, err := dv.client.ResourceV1().ResourceSlices().List(ctx, metav1.ListOptions{})
if err != nil {
return 0, fmt.Errorf("failed to list ResourceSlices: %w", err)
}

totalDevices := 0
for _, slice := range sliceList.Items {
if slice.Spec.Driver == exampleDriverName {
totalDevices += len(slice.Spec.Devices)
}
}

framework.Logf("Found %d total device(s) from %s driver", totalDevices, exampleDriverName)
return totalDevices, nil
}

// IsDriverPublishingDevices returns true if the example driver has published at least one device.
func (dv *DeviceValidator) IsDriverPublishingDevices(ctx context.Context) bool {
count, err := dv.GetTotalDeviceCount(ctx)
if err != nil {
framework.Logf("Failed to check if %s is publishing devices: %v", exampleDriverName, err)
return false
}
return count > 0
}
Loading