-
Notifications
You must be signed in to change notification settings - Fork 4.8k
OCPNODE-4108: add E2E tests for upstream dra-example-driver #31064
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
sabujmaity
wants to merge
1
commit into
openshift:main
Choose a base branch
from
sabujmaity:feat/OCPNODE-4108-add-dra-example-e2e
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,17 @@ | ||
| approvers: | ||
| - sairameshv | ||
| - harche | ||
| - haircommander | ||
| - rphillips | ||
| - mrunalp | ||
|
|
||
| reviewers: | ||
| - sairameshv | ||
| - harche | ||
| - haircommander | ||
| - rphillips | ||
| - mrunalp | ||
|
|
||
| labels: | ||
| - sig/scheduling | ||
| - area/dra |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
| @@ -0,0 +1,138 @@ | ||||||
| # DRA Example Driver Extended Tests for OpenShift | ||||||
|
|
||||||
| This directory contains extended tests for the upstream [dra-example-driver](https://github.com/kubernetes-sigs/dra-example-driver) on OpenShift clusters. These tests provide **hardware-independent** DRA regression coverage — no GPU or special hardware is required. | ||||||
|
|
||||||
| ## Overview | ||||||
|
|
||||||
| These tests validate: | ||||||
| - DRA example driver installation and lifecycle | ||||||
| - Single device allocation via ResourceClaims | ||||||
| - Multi-device allocation | ||||||
| - Pod lifecycle and resource cleanup | ||||||
| - Claim sharing behavior | ||||||
| - ResourceClaimTemplate-based claim creation and cleanup | ||||||
|
|
||||||
| ## Prerequisites | ||||||
|
|
||||||
| 1. **OpenShift 4.21+** cluster (DRA API enabled by default) | ||||||
| 2. **Helm 3** installed and available in PATH | ||||||
| 3. **git** installed and available in PATH | ||||||
| 4. **Cluster-admin** access | ||||||
|
|
||||||
| The test framework automatically: | ||||||
| - Clones the upstream `dra-example-driver` repository | ||||||
| - Installs the driver via Helm with OpenShift SCC permissions | ||||||
| - Waits for driver components to be ready | ||||||
|
|
||||||
| ## Quick Start | ||||||
|
|
||||||
| ```bash | ||||||
| # 1. Build test binary | ||||||
| make WHAT=cmd/openshift-tests | ||||||
|
|
||||||
| # 2. Set kubeconfig | ||||||
| export KUBECONFIG=/path/to/kubeconfig | ||||||
|
|
||||||
| # 3. Run all DRA example driver tests (local binary) | ||||||
| OPENSHIFT_SKIP_EXTERNAL_TESTS=1 \ | ||||||
| ./openshift-tests run --dry-run all 2>&1 | \ | ||||||
| grep "\[Feature:DRA-Example\]" | \ | ||||||
| OPENSHIFT_SKIP_EXTERNAL_TESTS=1 ./openshift-tests run -f - | ||||||
|
|
||||||
| # OR run a specific test | ||||||
| OPENSHIFT_SKIP_EXTERNAL_TESTS=1 ./openshift-tests run-test \ | ||||||
| '[sig-scheduling][Feature:DRA-Example][Suite:openshift/dra-example][Serial] Basic Device Allocation should allocate single device to pod via DRA' | ||||||
|
|
||||||
| # OR list all available tests | ||||||
| OPENSHIFT_SKIP_EXTERNAL_TESTS=1 \ | ||||||
| ./openshift-tests run --dry-run all 2>&1 | grep "\[Feature:DRA-Example\]" | ||||||
| ``` | ||||||
|
|
||||||
| > **Note**: `OPENSHIFT_SKIP_EXTERNAL_TESTS=1` is required when running a locally | ||||||
| > built binary. Without it, the `run` command attempts to extract test binaries | ||||||
| > from the cluster's release payload, which does not contain your local changes. | ||||||
| > This variable is NOT needed in CI where the binary is part of the payload. | ||||||
|
|
||||||
| ## Environment Variables | ||||||
|
|
||||||
| | Variable | Default | Description | | ||||||
| |----------|---------|-------------| | ||||||
| | `DRA_EXAMPLE_DRIVER_REF` | `main` | Git ref (branch/tag) of the upstream dra-example-driver to install | | ||||||
|
|
||||||
| ## Test Scenarios | ||||||
|
|
||||||
| ### 1. Single Device Allocation | ||||||
| - Creates DeviceClass with CEL selector for `gpu.example.com` driver | ||||||
| - Creates ResourceClaim requesting 1 device | ||||||
| - Schedules pod with ResourceClaim | ||||||
| - Validates device allocation in ResourceClaim status | ||||||
|
|
||||||
| ### 2. Resource Cleanup | ||||||
| - Creates pod with device ResourceClaim | ||||||
| - Deletes pod | ||||||
| - Verifies ResourceClaim persists after pod deletion but is unreserved | ||||||
|
|
||||||
| ### 3. Multi-Device Allocation | ||||||
| - Creates ResourceClaim requesting 2 devices | ||||||
| - Schedules pod requiring multiple devices | ||||||
| - Validates all devices are allocated (driver publishes 9 virtual devices per node) | ||||||
|
|
||||||
| ### 4. Claim Sharing | ||||||
| - Creates a single ResourceClaim | ||||||
| - Creates two pods referencing the same ResourceClaim | ||||||
| - Verifies behavior: both pods run (sharing supported) or second pod stays Pending | ||||||
|
|
||||||
| ### 5. ResourceClaimTemplate | ||||||
| - Creates a ResourceClaimTemplate | ||||||
| - Creates pod with ResourceClaimTemplate reference | ||||||
| - Validates that ResourceClaim is auto-created from template | ||||||
| - Validates automatic cleanup of template-generated claim when pod is deleted | ||||||
|
|
||||||
| ## OpenShift-Specific Adaptations | ||||||
|
|
||||||
| The upstream `dra-example-driver` Helm chart requires the following OpenShift adaptations (handled automatically by the test framework): | ||||||
|
|
||||||
| 1. **SCC Grant**: The kubelet plugin DaemonSet runs with `privileged: true` and mounts hostPath volumes. A ClusterRoleBinding grants the `system:openshift:scc:privileged` ClusterRole to the driver ServiceAccount. | ||||||
|
|
||||||
| 2. **SNO Tolerations**: Control-plane tolerations are added to allow scheduling on single-node OpenShift clusters. | ||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. These tolerations need not be just for SNO(Single Node Openshift)
Suggested change
|
||||||
|
|
||||||
| ## Troubleshooting | ||||||
|
|
||||||
| ### Helm not found | ||||||
|
|
||||||
| **Cause**: Helm 3 not installed. | ||||||
|
|
||||||
| **Solution**: Install Helm following [official instructions](https://helm.sh/docs/intro/install/). | ||||||
|
|
||||||
| ### SCC denied — kubelet plugin pod rejected | ||||||
|
|
||||||
| **Cause**: ClusterRoleBinding for privileged SCC not created. | ||||||
|
|
||||||
| **Solution**: The test framework creates this automatically. For manual debugging: | ||||||
|
|
||||||
| ```bash | ||||||
| oc adm policy add-scc-to-user privileged \ | ||||||
| -n dra-example-driver \ | ||||||
| -z dra-example-driver-service-account | ||||||
| ``` | ||||||
|
|
||||||
| ### ResourceSlices not appearing | ||||||
|
|
||||||
| **Cause**: DRA driver DaemonSet not ready. | ||||||
|
|
||||||
| **Solution**: | ||||||
|
|
||||||
| ```bash | ||||||
| # Check DRA driver pods | ||||||
| oc get pods -n dra-example-driver | ||||||
|
|
||||||
| # Check DaemonSet logs | ||||||
| oc logs -n dra-example-driver -l app.kubernetes.io/name=dra-example-driver --all-containers | ||||||
| ``` | ||||||
|
|
||||||
| ## References | ||||||
|
|
||||||
| - **Upstream repository**: https://github.com/kubernetes-sigs/dra-example-driver | ||||||
| - **Kubernetes DRA docs**: https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/ | ||||||
| - **OpenShift Extended Tests**: https://github.com/openshift/origin/tree/master/test/extended | ||||||
| - **NVIDIA DRA tests (reference)**: `test/extended/node/dra/nvidia/` | ||||||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,128 @@ | ||
| package example | ||
|
|
||
| import ( | ||
| "context" | ||
| "fmt" | ||
|
|
||
| resourceapi "k8s.io/api/resource/v1" | ||
| metav1 "k8s.io/apimachinery/pkg/apis/meta/v1" | ||
| "k8s.io/client-go/kubernetes" | ||
| "k8s.io/kubernetes/test/e2e/framework" | ||
| ) | ||
|
|
||
| // DeviceValidator validates DRA device allocation and ResourceSlice state for the example driver. | ||
| type DeviceValidator struct { | ||
| client kubernetes.Interface | ||
| framework *framework.Framework | ||
| } | ||
|
|
||
| // NewDeviceValidator creates a DeviceValidator using the provided test framework. | ||
| func NewDeviceValidator(f *framework.Framework) *DeviceValidator { | ||
| return &DeviceValidator{ | ||
| client: f.ClientSet, | ||
| framework: f, | ||
| } | ||
| } | ||
|
|
||
| // ValidateDeviceAllocation checks that the given ResourceClaim has exactly expectedCount devices allocated. | ||
| func (dv *DeviceValidator) ValidateDeviceAllocation(ctx context.Context, namespace, claimName string, expectedCount int) error { | ||
| framework.Logf("Validating ResourceClaim allocation for %s/%s (expected %d device(s))", namespace, claimName, expectedCount) | ||
|
|
||
| claim, err := dv.client.ResourceV1().ResourceClaims(namespace).Get(ctx, claimName, metav1.GetOptions{}) | ||
| if err != nil { | ||
| return fmt.Errorf("failed to get ResourceClaim %s/%s: %w", namespace, claimName, err) | ||
| } | ||
|
|
||
| if claim.Status.Allocation == nil { | ||
| return fmt.Errorf("ResourceClaim %s/%s is not allocated", namespace, claimName) | ||
| } | ||
|
|
||
| deviceCount := len(claim.Status.Allocation.Devices.Results) | ||
| if deviceCount != expectedCount { | ||
| return fmt.Errorf("ResourceClaim %s/%s expected %d device(s) but got %d", | ||
| namespace, claimName, expectedCount, deviceCount) | ||
| } | ||
|
|
||
| framework.Logf("ResourceClaim %s/%s has %d device(s) allocated", namespace, claimName, deviceCount) | ||
|
|
||
| for i, result := range claim.Status.Allocation.Devices.Results { | ||
| if result.Driver != exampleDriverName { | ||
| return fmt.Errorf("device %d has incorrect driver %q, expected %q", i, result.Driver, exampleDriverName) | ||
| } | ||
| if result.Pool == "" { | ||
| return fmt.Errorf("device %d has empty pool field", i) | ||
| } | ||
| if result.Device == "" { | ||
| return fmt.Errorf("device %d has empty device field", i) | ||
| } | ||
| if result.Request == "" { | ||
| return fmt.Errorf("device %d has empty request field", i) | ||
| } | ||
|
|
||
| framework.Logf("Device %d validated: driver=%s, pool=%s, device=%s, request=%s", | ||
| i, result.Driver, result.Pool, result.Device, result.Request) | ||
| } | ||
|
|
||
| return nil | ||
| } | ||
|
|
||
| // ValidateResourceSlice finds and validates the ResourceSlice published by the example driver on the given node. | ||
| func (dv *DeviceValidator) ValidateResourceSlice(ctx context.Context, nodeName string) (*resourceapi.ResourceSlice, error) { | ||
| framework.Logf("Validating ResourceSlice for node %s", nodeName) | ||
|
|
||
| sliceList, err := dv.client.ResourceV1().ResourceSlices().List(ctx, metav1.ListOptions{}) | ||
| if err != nil { | ||
| return nil, fmt.Errorf("failed to list ResourceSlices: %w", err) | ||
| } | ||
|
|
||
| var nodeSlice *resourceapi.ResourceSlice | ||
| totalDevices := 0 | ||
| for i := range sliceList.Items { | ||
| slice := &sliceList.Items[i] | ||
| if slice.Spec.NodeName != nil && *slice.Spec.NodeName == nodeName && | ||
| slice.Spec.Driver == exampleDriverName { | ||
| totalDevices += len(slice.Spec.Devices) | ||
| if nodeSlice == nil && len(slice.Spec.Devices) > 0 { | ||
| nodeSlice = slice | ||
| } | ||
| } | ||
| } | ||
|
|
||
| if nodeSlice == nil { | ||
| return nil, fmt.Errorf("no ResourceSlice with devices found for driver %s on node %s", exampleDriverName, nodeName) | ||
| } | ||
|
coderabbitai[bot] marked this conversation as resolved.
|
||
|
|
||
| framework.Logf("Node %s has %d total device(s) across matching ResourceSlices (returning slice %s)", | ||
| nodeName, totalDevices, nodeSlice.Name) | ||
| return nodeSlice, nil | ||
| } | ||
|
|
||
| // GetTotalDeviceCount returns the total number of devices published by the example driver across all nodes. | ||
| func (dv *DeviceValidator) GetTotalDeviceCount(ctx context.Context) (int, error) { | ||
| framework.Logf("Counting total devices from %s driver via ResourceSlices", exampleDriverName) | ||
|
|
||
| sliceList, err := dv.client.ResourceV1().ResourceSlices().List(ctx, metav1.ListOptions{}) | ||
| if err != nil { | ||
| return 0, fmt.Errorf("failed to list ResourceSlices: %w", err) | ||
| } | ||
|
|
||
| totalDevices := 0 | ||
| for _, slice := range sliceList.Items { | ||
| if slice.Spec.Driver == exampleDriverName { | ||
| totalDevices += len(slice.Spec.Devices) | ||
| } | ||
| } | ||
|
|
||
| framework.Logf("Found %d total device(s) from %s driver", totalDevices, exampleDriverName) | ||
| return totalDevices, nil | ||
| } | ||
|
|
||
| // IsDriverPublishingDevices returns true if the example driver has published at least one device. | ||
| func (dv *DeviceValidator) IsDriverPublishingDevices(ctx context.Context) bool { | ||
| count, err := dv.GetTotalDeviceCount(ctx) | ||
| if err != nil { | ||
| framework.Logf("Failed to check if %s is publishing devices: %v", exampleDriverName, err) | ||
| return false | ||
| } | ||
| return count > 0 | ||
| } | ||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we also have a few more tests regarding
DRAExtendedResourcesand theDRAPartitionableDevicessub features of DRA?We want to claim these features work on Openshift and hence the ask.