Skip to content

OCPNODE-4043: Add CI jobs to validate DRA APIs on NVIDIA GPU#74984

Draft
sairameshv wants to merge 2 commits intoopenshift:mainfrom
sairameshv:nvidia-dra-ocp
Draft

OCPNODE-4043: Add CI jobs to validate DRA APIs on NVIDIA GPU#74984
sairameshv wants to merge 2 commits intoopenshift:mainfrom
sairameshv:nvidia-dra-ocp

Conversation

@sairameshv
Copy link
Copy Markdown
Member

@sairameshv sairameshv commented Feb 17, 2026

Add CI infrastructure for DRA testing on NVIDIA GPUs

This adds CI jobs to test Kubernetes Dynamic Resource Allocation (DRA) with
NVIDIA GPUs on GCP across two repositories:

openshift/kubernetes (DRAPartitionableDevices) - Periodic:
- Tests upstream Kubernetes DRA Partitionable Devices (MIG) functionality
- Uses H100 GPU (a3-highgpu-1g) for DynamicMIG support
- Runs weekly in us-central1-a

openshift/origin (Basic DRA) - Presubmit:
- Tests OpenShift DRA integration with NVIDIA GPUs
- Uses L4 GPU (g2-standard-4) for cost-effective testing
- Triggers on changes to test/extended/node/dra/nvidia or test/extended/dra/nvidia

Reusable step-registry components:
- nvidia-nfd-operator-install: Installs NFD operator via OLM
- nvidia-gpu-operator-install: Installs NVIDIA GPU operator with CDI
- nvidia-dra-driver-install: Installs NVIDIA DRA driver via Helm
- nvidia-cleanup: Cleans up NVIDIA components and resources
- openshift-kubernetes-e2e-nvidia-dra-partitionable: Workflow for DRA testing

Note: G2/A3 GPU-optimized machine types include GPUs automatically.
All tests use us-central1 to match CI quota allocation.

Signed-off-by: Sai Ramesh Vanka <svanka@redhat.com

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Summary by CodeRabbit

  • New Features

    • Added NVIDIA DRA partitionable devices testing on GCP clusters with GPU support.
    • Added NVIDIA DRA end-to-end testing on AWS with GPU worker nodes.
    • Enhanced NVIDIA GPU operator installation with verification for driver readiness.
  • Chores

    • Added CI/periodic test configurations to validate NVIDIA DRA functionality across multiple cloud environments.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Feb 17, 2026
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

openshift-ci-robot commented Feb 17, 2026

@sairameshv: This pull request references OCPNODE-4043 which is a valid jira issue.

Details

In response to this:

  • Add CI infrastructure for testing DRA in OCP with an NVIDIA GPU
  • Step-registry workflow for GPU operator installation and DRA testing
  • AWS e2e job using GPU worker nodes (g4dn.xlarge)
  • Test runs conditionally on DRA test file changes present in openshift/origin for now
  • Same infra can be developed to have more jobs on different GPU varients based on DRA features to test in future i.e. MIG capable GPUs to test the DRAPartitionableDvices features like H100

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@sairameshv
Copy link
Copy Markdown
Member Author

/pj-rehearse pull-ci-openshift-origin-main-nvidia-dra-e2e-aws-nvidia-dra

@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@sairameshv: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 17, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Feb 17, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@sairameshv
Copy link
Copy Markdown
Member Author

/pj-rehearse pull-ci-openshift-origin-main-nvidia-dra-e2e-aws-nvidia-dra

@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@sairameshv: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

@sairameshv
Copy link
Copy Markdown
Member Author

/pj-rehearse pull-ci-openshift-origin-main-nvidia-dra-e2e-aws-nvidia-dra

@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@sairameshv: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

@sairameshv
Copy link
Copy Markdown
Member Author

sairameshv commented Feb 18, 2026

The rehearsal failure is expected here as the tests PR is not yet merged and hence no tests to run

Suite run returned error: no tests to run
error: error running a test suite: no tests to run
✗ FAILED (exit code: 1)

@openshift-ci openshift-ci Bot added the do-not-merge/invalid-owners-file Indicates that a PR should not merge because it has an invalid OWNERS file in it. label Mar 19, 2026
@openshift-ci openshift-ci Bot removed the do-not-merge/invalid-owners-file Indicates that a PR should not merge because it has an invalid OWNERS file in it. label Mar 19, 2026
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

openshift-ci-robot commented Mar 19, 2026

@sairameshv: This pull request references OCPNODE-4043 which is a valid jira issue.

Details

In response to this:

Add CI infrastructure for DRA testing on NVIDIA GPUs

This adds CI jobs to test Kubernetes Dynamic Resource Allocation (DRA) with
NVIDIA GPUs on GCP:

  • Basic DRA testing on openshift/origin with T4 GPU (n1-standard-4)
  • Partitionable Devices testing on openshift/kubernetes with MIG i.e. A100 GPU (a2-highgpu-1g)
  • Enables only DRAPartitionableDevices feature gate via CustomNoUpgrade
  • New steps to install NFD and NVIDIA GPU operators
  • Adds NVIDIA DRA driver installation via Helm

Signed-off-by: Sai Ramesh Vanka svanka@redhat.com

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@sairameshv
Copy link
Copy Markdown
Member Author

/pj-rehearse

@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@sairameshv: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

@sairameshv
Copy link
Copy Markdown
Member Author

/pj-rehearse pull-ci-openshift-kubernetes-release-4.22-gcp-nvidia-dra-e2e-gcp-nvidia-dra-partitionable-devices

@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@sairameshv: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

@sairameshv
Copy link
Copy Markdown
Member Author

/pj-rehearse pull-ci-openshift-kubernetes-release-4.22-gcp-nvidia-dra-partitionable-devices

@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@sairameshv: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

@sairameshv sairameshv force-pushed the nvidia-dra-ocp branch 3 times, most recently from f253c59 to dd181ab Compare March 19, 2026 16:37
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

openshift-ci-robot commented Mar 19, 2026

@sairameshv: This pull request references OCPNODE-4043 which is a valid jira issue.

Details

In response to this:

Add CI infrastructure for DRA testing on NVIDIA GPUs

This adds CI jobs to test Kubernetes Dynamic Resource Allocation (DRA) with
NVIDIA GPUs on GCP across two repositories:

openshift/kubernetes (DRAPartitionableDevices):

  • Tests upstream Kubernetes DRA Partitionable Devices (MIG) functionality
  • Uses H100 GPU (a3-highgpu-1g) with DynamicMIG support
  • Enables DRAPartitionableDevices feature gate via CustomNoUpgrade
  • Triggers on changes to test/e2e/dra/*partitionable or scheduler DRA code
  • Installs NVIDIA DRA driver v25.12.0 with DynamicMIG=true

openshift/origin (Basic DRA):

  • Tests OpenShift DRA integration with NVIDIA GPUs
  • Uses T4 GPU (n1-standard-4) for basic DRA testing
  • Triggers on changes to test/extended/node/dra/nvidia or test/extended/dra/nvidia

Reusable step-registry components:

  • nvidia-nfd-operator-install: Installs NFD operator via OLM
  • nvidia-gpu-operator-install: Installs NVIDIA GPU operator with CDI
  • nvidia-dra-driver-install: Installs NVIDIA DRA driver via Helm
  • nvidia-cleanup: Cleans up NVIDIA components and resources
  • openshift-kubernetes-e2e-nvidia-dra-partitionable: Workflow and test step
    for DRA Partitionable Devices testing (consolidated in single directory)

Signed-off-by: Sai Ramesh Vanka svanka@redhat.com

Co-Authored-By: Claude Sonnet 4.5 noreply@anthropic.com

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@sairameshv
Copy link
Copy Markdown
Member Author

/pj-rehearse pull-ci-openshift-kubernetes-main-gcp-nvidia-dra-partitionable-devices

@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@sairameshv: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

@sairameshv
Copy link
Copy Markdown
Member Author

/pj-rehearse pull-ci-openshift-kubernetes-main-gcp-nvidia-dra-partitionable-devices

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (1)
ci-operator/step-registry/nvidia/gpu-operator/install/nvidia-gpu-operator-install-commands.sh (1)

324-331: Use full DaemonSet selector instead of hardcoding app for diagnostics.

Line 324 assumes .spec.selector.matchLabels.app; if that key changes, timeout debugging will miss driver pods/logs even when the DaemonSet exists.

Suggested adjustment
-    DRIVER_SELECTOR=$(oc get "${DRIVER_DS}" -n nvidia-gpu-operator -o jsonpath='{.spec.selector.matchLabels.app}' 2>/dev/null || echo "")
+    DRIVER_SELECTOR=$(oc get "${DRIVER_DS}" -n nvidia-gpu-operator -o json 2>/dev/null | jq -r '.spec.selector.matchLabels | to_entries | map("\(.key)=\(.value)") | join(",")')
     if [ -n "${DRIVER_SELECTOR}" ]; then
-      oc get pods -n nvidia-gpu-operator -l "app=${DRIVER_SELECTOR}" -o wide 2>/dev/null || echo "  No driver pods found"
+      oc get pods -n nvidia-gpu-operator -l "${DRIVER_SELECTOR}" -o wide 2>/dev/null || echo "  No driver pods found"
       echo ""
       echo "Driver pod logs (if available):"
-      DRIVER_POD=$(oc get pods -n nvidia-gpu-operator -l "app=${DRIVER_SELECTOR}" -o name 2>/dev/null | head -1 || echo "")
+      DRIVER_POD=$(oc get pods -n nvidia-gpu-operator -l "${DRIVER_SELECTOR}" -o name 2>/dev/null | head -1 || echo "")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@ci-operator/step-registry/nvidia/gpu-operator/install/nvidia-gpu-operator-install-commands.sh`
around lines 324 - 331, The current diagnostic extracts only
.spec.selector.matchLabels.app into DRIVER_SELECTOR which breaks if the
DaemonSet uses different label keys; update the logic that sets DRIVER_SELECTOR
(and subsequent DRIVER_POD usage) to build the full label selector from the
DaemonSet's .spec.selector.matchLabels (e.g., read the map and join keys/values
into a comma-separated selector like "key1=val1,key2=val2") using oc/jsonpath or
jq, then pass that full selector to oc get pods -l and oc logs; keep the
existing fallbacks (empty string) and error messages but replace the hardcoded
"app=${DRIVER_SELECTOR}" usage with the constructed selector variable so
DRIVER_DS/DRIVER_SELECTOR/DRIVER_POD logic works for any matchLabels shape.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@ci-operator/config/openshift/origin/openshift-origin-main__nvidia-dra.yaml`:
- Around line 16-19: The curl invocation in the commands block currently can
write HTTP error pages into the repo file; update the curl command (the line
starting with "curl http://base-openstack-4-21.ocp.svc >
/etc/yum.repos.art/ci/base-openstack-4-21.repo") to use curl's fail mode (e.g.,
add --fail or -f) and write to the target with curl's -o/-O rather than shell
redirection so that curl returns a non-zero exit on HTTP errors and the step
fails immediately.
- Around line 49-52: The AWS pre chain is missing the NVIDIA DRA driver
installation step; update the pre steps under the "pre" list to add a new entry
"- ref: nvidia-dra-driver-install" immediately after
"nvidia-gpu-operator-install" so the AWS workflow mirrors the
"openshift-kubernetes-e2e-nvidia-dra-partitionable" workflow and ensures the
DeviceClass/ResourceSlice objects required by [Feature:NVIDIA-DRA] tests are
created.

In
`@ci-operator/step-registry/openshift/kubernetes/e2e/nvidia-dra-partitionable/openshift-kubernetes-e2e-nvidia-dra-partitionable-commands.sh`:
- Around line 31-36: The script currently reads RESOURCE_SLICES once and
proceeds even if zero; change it to poll until ResourceSlices exist (or a
configurable timeout) before continuing: replace the single read of
RESOURCE_SLICES with a loop that runs oc get resourceslice -o name repeatedly
(sleep between attempts) and exits successfully when RESOURCE_SLICES > 0 or
fails after the timeout, emitting a clear error; reference the RESOURCE_SLICES
variable and the oc get resourceslice command and ensure the failure path
returns non-zero so downstream test-kubernetes-e2e.sh won’t run when no slices
are advertised.
- Around line 15-19: The readonly assignment for GPU_NODES using command
substitution masks failures from the oc/get pipeline; change the code so you run
oc get nodes -l "${GPU_LABEL}" -o name first, check its exit status, capture its
output to a variable (e.g., nodes_out), then compute GPU_NODES from that output
(e.g., with wc -l) and finally mark GPU_NODES readonly; update the logic around
the oc get nodes call and the subsequent if that references GPU_NODES to fail
fast on oc errors rather than returning a misleading zero count (refer to
GPU_NODES and the oc get nodes -l "${GPU_LABEL}" -o name usage).

In
`@ci-operator/step-registry/openshift/kubernetes/e2e/nvidia-dra-partitionable/openshift-kubernetes-e2e-nvidia-dra-partitionable-ref.yaml`:
- Around line 23-24: Update the YAML step description to state that the
DRAPartitionableDevices feature-gate check is conditional: change the
unconditional bullet to something like "Verifies DRAPartitionableDevices feature
gate is enabled when DRA_TEST_FOCUS includes 'DRAPartitionableDevices'"; ensure
any other sentence mentioning this check is similarly clarified. Reference the
check logic in openshift-kubernetes-e2e-nvidia-dra-partitionable-commands.sh and
the DRA_TEST_FOCUS variable so readers understand the gate validation only runs
when that focus is present.

---

Nitpick comments:
In
`@ci-operator/step-registry/nvidia/gpu-operator/install/nvidia-gpu-operator-install-commands.sh`:
- Around line 324-331: The current diagnostic extracts only
.spec.selector.matchLabels.app into DRIVER_SELECTOR which breaks if the
DaemonSet uses different label keys; update the logic that sets DRIVER_SELECTOR
(and subsequent DRIVER_POD usage) to build the full label selector from the
DaemonSet's .spec.selector.matchLabels (e.g., read the map and join keys/values
into a comma-separated selector like "key1=val1,key2=val2") using oc/jsonpath or
jq, then pass that full selector to oc get pods -l and oc logs; keep the
existing fallbacks (empty string) and error messages but replace the hardcoded
"app=${DRIVER_SELECTOR}" usage with the constructed selector variable so
DRIVER_DS/DRIVER_SELECTOR/DRIVER_POD logic works for any matchLabels shape.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 8f85fd8b-0c12-47fb-80fc-7246a8443f2c

📥 Commits

Reviewing files that changed from the base of the PR and between 78a388c and 6466435.

📒 Files selected for processing (11)
  • ci-operator/config/openshift/kubernetes/openshift-kubernetes-main__gcp-nvidia-dra.yaml
  • ci-operator/config/openshift/origin/openshift-origin-main__nvidia-dra.yaml
  • ci-operator/jobs/openshift/kubernetes/openshift-kubernetes-main-periodics.yaml
  • ci-operator/jobs/openshift/origin/openshift-origin-main-presubmits.yaml
  • ci-operator/step-registry/nvidia/gpu-operator/install/nvidia-gpu-operator-install-commands.sh
  • ci-operator/step-registry/openshift/kubernetes/e2e/nvidia-dra-partitionable/OWNERS
  • ci-operator/step-registry/openshift/kubernetes/e2e/nvidia-dra-partitionable/openshift-kubernetes-e2e-nvidia-dra-partitionable-commands.sh
  • ci-operator/step-registry/openshift/kubernetes/e2e/nvidia-dra-partitionable/openshift-kubernetes-e2e-nvidia-dra-partitionable-ref.metadata.json
  • ci-operator/step-registry/openshift/kubernetes/e2e/nvidia-dra-partitionable/openshift-kubernetes-e2e-nvidia-dra-partitionable-ref.yaml
  • ci-operator/step-registry/openshift/kubernetes/e2e/nvidia-dra-partitionable/openshift-kubernetes-e2e-nvidia-dra-partitionable-workflow.metadata.json
  • ci-operator/step-registry/openshift/kubernetes/e2e/nvidia-dra-partitionable/openshift-kubernetes-e2e-nvidia-dra-partitionable-workflow.yaml

Comment on lines +16 to +19
commands: |
curl http://base-openstack-4-21.ocp.svc > /etc/yum.repos.art/ci/base-openstack-4-21.repo
from: tools
to: tools-openstack
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Make the repo download fail on HTTP errors.

curl without --fail can write an error page into the repo file and let the step continue. That turns a fetch problem into a much harder-to-diagnose package install failure later.

Suggested fix
-      curl http://base-openstack-4-21.ocp.svc > /etc/yum.repos.art/ci/base-openstack-4-21.repo
+      curl --fail --silent --show-error --retry 5 \
+        http://base-openstack-4-21.ocp.svc \
+        -o /etc/yum.repos.art/ci/base-openstack-4-21.repo
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
commands: |
curl http://base-openstack-4-21.ocp.svc > /etc/yum.repos.art/ci/base-openstack-4-21.repo
from: tools
to: tools-openstack
commands: |
curl --fail --silent --show-error --retry 5 \
http://base-openstack-4-21.ocp.svc \
-o /etc/yum.repos.art/ci/base-openstack-4-21.repo
from: tools
to: tools-openstack
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ci-operator/config/openshift/origin/openshift-origin-main__nvidia-dra.yaml`
around lines 16 - 19, The curl invocation in the commands block currently can
write HTTP error pages into the repo file; update the curl command (the line
starting with "curl http://base-openstack-4-21.ocp.svc >
/etc/yum.repos.art/ci/base-openstack-4-21.repo") to use curl's fail mode (e.g.,
add --fail or -f) and write to the target with curl's -o/-O rather than shell
redirection so that curl returns a non-zero exit on HTTP errors and the step
fails immediately.

Comment on lines +49 to +52
pre:
- chain: ipi-aws-pre
- ref: nvidia-nfd-operator-install
- ref: nvidia-gpu-operator-install
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, locate and read the target file
find . -name "*nvidia-dra*" -type f | head -20

Repository: openshift/release

Length of output: 1221


🏁 Script executed:

# Check the specific file and lines mentioned
find . -path "*/ci-operator/config/openshift/origin/*nvidia-dra*" -type f

Repository: openshift/release

Length of output: 137


🏁 Script executed:

# Get the file structure around the repository
ls -la ci-operator/config/openshift/origin/ 2>/dev/null | grep -i nvidia

Repository: openshift/release

Length of output: 151


🏁 Script executed:

# Read the target file with line numbers
cat -n ./ci-operator/config/openshift/origin/openshift-origin-main__nvidia-dra.yaml

Repository: openshift/release

Length of output: 3028


🏁 Script executed:

# Also read the sibling Kubernetes flow mentioned
cat -n ./ci-operator/config/openshift/kubernetes/openshift-kubernetes-main__gcp-nvidia-dra.yaml

Repository: openshift/release

Length of output: 1369


🏁 Script executed:

# Find and read the ipi-aws-pre chain definition
find . -path "*/step-registry/*ipi*aws*pre*" -type f | head -5

Repository: openshift/release

Length of output: 1011


🏁 Script executed:

# Find the nvidia-dra-partitionable workflow
find . -path "*/nvidia-dra-partitionable*workflow*" -type f

Repository: openshift/release

Length of output: 353


🏁 Script executed:

# Read the nvidia-dra-driver-install step definition
cat -n ./ci-operator/step-registry/nvidia/dra-driver/install/nvidia-dra-driver-install-ref.yaml

Repository: openshift/release

Length of output: 1648


🏁 Script executed:

# Search for ipi-aws-pre more broadly
find . -name "*ipi-aws-pre*" -o -name "*ipi*aws*pre*" | grep -i "ipi.*aws.*pre"

Repository: openshift/release

Length of output: 2649


🏁 Script executed:

# Try searching in openshift/installer step registry
find . -path "*/step-registry/openshift/installer/*aws*" -name "*pre*" -type f

Repository: openshift/release

Length of output: 43


🏁 Script executed:

# Read the nvidia-dra-partitionable workflow
cat -n ./ci-operator/step-registry/openshift/kubernetes/e2e/nvidia-dra-partitionable/openshift-kubernetes-e2e-nvidia-dra-partitionable-workflow.yaml

Repository: openshift/release

Length of output: 2574


🏁 Script executed:

# Read ipi-aws-pre chain to confirm it doesn't include DRA driver installation
cat -n ./ci-operator/step-registry/ipi/aws/pre/ipi-aws-pre-chain.yaml

Repository: openshift/release

Length of output: 667


Add nvidia-dra-driver-install to the AWS pre chain.

The AWS job is missing the NVIDIA DRA driver installation step that exists in the sibling Kubernetes workflow. Without it, the [Feature:NVIDIA-DRA] tests will fail after cluster bring-up because the required DeviceClass and ResourceSlice objects will not exist.

Add - ref: nvidia-dra-driver-install to the pre steps after nvidia-gpu-operator-install, matching the structure in the openshift-kubernetes-e2e-nvidia-dra-partitionable workflow.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ci-operator/config/openshift/origin/openshift-origin-main__nvidia-dra.yaml`
around lines 49 - 52, The AWS pre chain is missing the NVIDIA DRA driver
installation step; update the pre steps under the "pre" list to add a new entry
"- ref: nvidia-dra-driver-install" immediately after
"nvidia-gpu-operator-install" so the AWS workflow mirrors the
"openshift-kubernetes-e2e-nvidia-dra-partitionable" workflow and ensures the
DeviceClass/ResourceSlice objects required by [Feature:NVIDIA-DRA] tests are
created.

Comment on lines +15 to +19
readonly GPU_NODES=$(oc get nodes -l "${GPU_LABEL}" -o name 2>/dev/null | wc -l)
if [ "${GPU_NODES}" -eq 0 ]; then
echo "ERROR: No GPU nodes found with label ${GPU_LABEL}"
oc get nodes --show-labels
exit 1
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# Check if the file exists and examine the context around lines 15-19
fd -t f "openshift-kubernetes-e2e-nvidia-dra-partitionable-commands.sh" --exec cat -n {} \; 2>/dev/null | head -30

Repository: openshift/release

Length of output: 1057


🏁 Script executed:

# Also search for the file more broadly
find . -name "*nvidia-dra-partitionable-commands.sh" -type f 2>/dev/null | head -5

Repository: openshift/release

Length of output: 200


🏁 Script executed:

# Test 1: Does readonly with command substitution preserve exit status?
bash -c '
set -o pipefail
set -o errexit
readonly TEST=$(false); 
echo "Should not reach here"
' 2>&1; echo "Exit code: $?"

Repository: openshift/release

Length of output: 95


🏁 Script executed:

# Test 2: Does separating assignment from readonly preserve exit status?
bash -c '
set -o pipefail
set -o errexit
TEST=$(false)
readonly TEST
echo "Should not reach here"
' 2>&1; echo "Exit code: $?"

Repository: openshift/release

Length of output: 73


🏁 Script executed:

# Test 3: What about when command substitution succeeds but underlying command fails?
bash -c '
set -o pipefail
set -o errexit
readonly RESULT=$(exit 1; echo "value")
echo "Result: $RESULT"
' 2>&1; echo "Exit code: $?"

Repository: openshift/release

Length of output: 82


🏁 Script executed:

# Test 4: With pipefail and readonly - does it catch pipeline failures?
bash -c '
set -o pipefail
set -o errexit
readonly COUNT=$(false | wc -l)
echo "Should not reach here"
' 2>&1; echo "Exit code: $?"

Repository: openshift/release

Length of output: 95


🏁 Script executed:

# Test 5: Check if 2>/dev/null masks the failure in command substitution
bash -c '
set -o pipefail
set -o errexit
readonly COUNT=$(false 2>/dev/null | wc -l)
echo "Result: $COUNT"
' 2>&1; echo "Exit code: $?"

Repository: openshift/release

Length of output: 83


Inline readonly assignment masks pipeline exit status; separate the assignment.

Line 15 uses readonly GPU_NODES=$(...), which prevents set -o errexit and set -o pipefail from catching failures in the oc command. If oc get nodes fails due to API/auth issues, the command substitution still returns 0 (from wc -l), and the script proceeds with a misleading "no GPU nodes" error instead of surfacing the real failure.

Suggested fix
-readonly GPU_NODES=$(oc get nodes -l "${GPU_LABEL}" -o name 2>/dev/null | wc -l)
+GPU_NODES=$(oc get nodes -l "${GPU_LABEL}" -o name 2>/dev/null | wc -l)
+readonly GPU_NODES
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
readonly GPU_NODES=$(oc get nodes -l "${GPU_LABEL}" -o name 2>/dev/null | wc -l)
if [ "${GPU_NODES}" -eq 0 ]; then
echo "ERROR: No GPU nodes found with label ${GPU_LABEL}"
oc get nodes --show-labels
exit 1
GPU_NODES=$(oc get nodes -l "${GPU_LABEL}" -o name 2>/dev/null | wc -l)
readonly GPU_NODES
if [ "${GPU_NODES}" -eq 0 ]; then
echo "ERROR: No GPU nodes found with label ${GPU_LABEL}"
oc get nodes --show-labels
exit 1
🧰 Tools
🪛 Shellcheck (0.11.0)

[warning] 15-15: Declare and assign separately to avoid masking return values.

(SC2155)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@ci-operator/step-registry/openshift/kubernetes/e2e/nvidia-dra-partitionable/openshift-kubernetes-e2e-nvidia-dra-partitionable-commands.sh`
around lines 15 - 19, The readonly assignment for GPU_NODES using command
substitution masks failures from the oc/get pipeline; change the code so you run
oc get nodes -l "${GPU_LABEL}" -o name first, check its exit status, capture its
output to a variable (e.g., nodes_out), then compute GPU_NODES from that output
(e.g., with wc -l) and finally mark GPU_NODES readonly; update the logic around
the oc get nodes call and the subsequent if that references GPU_NODES to fail
fast on oc errors rather than returning a misleading zero count (refer to
GPU_NODES and the oc get nodes -l "${GPU_LABEL}" -o name usage).

Comment on lines +31 to +36
# Check ResourceSlices
RESOURCE_SLICES=$(oc get resourceslice -o name 2>/dev/null | wc -l)
echo "ResourceSlices: ${RESOURCE_SLICES}"
if [ "${RESOURCE_SLICES}" -eq 0 ]; then
echo "WARNING: No ResourceSlices found"
fi
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Wait for ResourceSlices before starting e2e.

A zero count here usually means the DRA driver has not advertised allocatable resources yet. Continuing makes the job timing-dependent and pushes the real failure into test-kubernetes-e2e.sh.

Suggested fix
-RESOURCE_SLICES=$(oc get resourceslice -o name 2>/dev/null | wc -l)
-echo "ResourceSlices: ${RESOURCE_SLICES}"
-if [ "${RESOURCE_SLICES}" -eq 0 ]; then
-  echo "WARNING: No ResourceSlices found"
-fi
+for _ in $(seq 1 30); do
+  RESOURCE_SLICES=$(oc get resourceslice -o name 2>/dev/null | wc -l)
+  if [ "${RESOURCE_SLICES}" -gt 0 ]; then
+    break
+  fi
+  sleep 10
+done
+echo "ResourceSlices: ${RESOURCE_SLICES}"
+if [ "${RESOURCE_SLICES}" -eq 0 ]; then
+  echo "ERROR: No ResourceSlices found after waiting for the DRA driver to advertise resources"
+  exit 1
+fi
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@ci-operator/step-registry/openshift/kubernetes/e2e/nvidia-dra-partitionable/openshift-kubernetes-e2e-nvidia-dra-partitionable-commands.sh`
around lines 31 - 36, The script currently reads RESOURCE_SLICES once and
proceeds even if zero; change it to poll until ResourceSlices exist (or a
configurable timeout) before continuing: replace the single read of
RESOURCE_SLICES with a loop that runs oc get resourceslice -o name repeatedly
(sleep between attempts) and exits successfully when RESOURCE_SLICES > 0 or
fails after the timeout, emitting a clear error; reference the RESOURCE_SLICES
variable and the oc get resourceslice command and ensure the failure path
returns non-zero so downstream test-kubernetes-e2e.sh won’t run when no slices
are advertised.

Comment on lines +23 to +24
- Verifies DRAPartitionableDevices feature gate is enabled
- Runs test-kubernetes-e2e.sh with specified focus pattern
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Clarify that feature-gate validation is conditional on test focus.

In this file, Lines 23-24 and Line 30 describe DRAPartitionableDevices gate verification as unconditional, but ci-operator/step-registry/openshift/kubernetes/e2e/nvidia-dra-partitionable/openshift-kubernetes-e2e-nvidia-dra-partitionable-commands.sh (Lines 36-43) performs that check only when DRA_TEST_FOCUS includes DRAPartitionableDevices.

Also applies to: 30-30

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@ci-operator/step-registry/openshift/kubernetes/e2e/nvidia-dra-partitionable/openshift-kubernetes-e2e-nvidia-dra-partitionable-ref.yaml`
around lines 23 - 24, Update the YAML step description to state that the
DRAPartitionableDevices feature-gate check is conditional: change the
unconditional bullet to something like "Verifies DRAPartitionableDevices feature
gate is enabled when DRA_TEST_FOCUS includes 'DRAPartitionableDevices'"; ensure
any other sentence mentioning this check is similarly clarified. Reference the
check logic in openshift-kubernetes-e2e-nvidia-dra-partitionable-commands.sh and
the DRA_TEST_FOCUS variable so readers understand the gate validation only runs
when that focus is present.

HarshwardhanPatil07 pushed a commit to HarshwardhanPatil07/release that referenced this pull request Apr 23, 2026
* Add reusable NVIDIA step-registry components for DRA testing

Based on openshift#74984.
Adds NFD operator, GPU operator, DRA driver install, and cleanup steps.

Signed-off-by: Pannaga Rao Bhoja Ramamanohara <pbhojara@redhat.com>
Co-Authored-By: Sai Ramesh Vanka <svanka@redhat.com>

* Fix nvidia-dra-driver-install for OpenShift CI compatibility

* Add DRA e2e CI job for kueue-operator on AWS with T4 GPU

---------

Signed-off-by: Pannaga Rao Bhoja Ramamanohara <pbhojara@redhat.com>
Co-authored-by: Sai Ramesh Vanka <svanka@redhat.com>
Co-authored-by: Pannaga Rao Bhoja Ramamanohara <pbhojara@pbhojara-thinkpadp16vgen1.rht.csb>
@sairameshv
Copy link
Copy Markdown
Member Author

/pj-rehearse pull-ci-openshift-origin-main-nvidia-dra-e2e-aws-nvidia-dra

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

@sairameshv: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

@sairameshv
Copy link
Copy Markdown
Member Author

/pj-rehearse periodic-ci-openshift-kubernetes-main-gcp-nvidia-dra-partitionable-devices

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

@sairameshv: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

@sairameshv
Copy link
Copy Markdown
Member Author

/pj-rehearse pull-ci-openshift-origin-main-nvidia-dra-e2e-aws-nvidia-dra

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

@sairameshv: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

@sairameshv
Copy link
Copy Markdown
Member Author

/pj-rehearse periodic-ci-openshift-kubernetes-main-gcp-nvidia-dra-partitionable-devices

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

@sairameshv: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

@sairameshv
Copy link
Copy Markdown
Member Author

/pj-rehearse pull-ci-openshift-origin-main-nvidia-dra-e2e-aws-nvidia-dra

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

@sairameshv: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

@sairameshv
Copy link
Copy Markdown
Member Author

/pj-rehearse periodic-ci-openshift-kubernetes-main-gcp-nvidia-dra-partitionable-devices

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

@sairameshv: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

Prucek pushed a commit to Prucek/release that referenced this pull request Apr 29, 2026
* Add reusable NVIDIA step-registry components for DRA testing

Based on openshift#74984.
Adds NFD operator, GPU operator, DRA driver install, and cleanup steps.

Signed-off-by: Pannaga Rao Bhoja Ramamanohara <pbhojara@redhat.com>
Co-Authored-By: Sai Ramesh Vanka <svanka@redhat.com>

* Fix nvidia-dra-driver-install for OpenShift CI compatibility

* Add DRA e2e CI job for kueue-operator on AWS with T4 GPU

---------

Signed-off-by: Pannaga Rao Bhoja Ramamanohara <pbhojara@redhat.com>
Co-authored-by: Sai Ramesh Vanka <svanka@redhat.com>
Co-authored-by: Pannaga Rao Bhoja Ramamanohara <pbhojara@pbhojara-thinkpadp16vgen1.rht.csb>
@sairameshv
Copy link
Copy Markdown
Member Author

Opened a PR to fix a few failures with the existing openshift/origin tests
openshift/origin#31104

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 5, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: sairameshv
Once this PR has been reviewed and has the lgtm label, please assign deads2k for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

This adds CI jobs to test Kubernetes Dynamic Resource Allocation (DRA) with
NVIDIA GPUs on GCP across two repositories:

openshift/kubernetes (DRAPartitionableDevices) - Periodic:
- Tests upstream Kubernetes DRA Partitionable Devices (MIG) functionality
- Uses H100 GPU (a3-highgpu-1g) for DynamicMIG support
- Runs weekly in us-central1-a

openshift/origin (Basic DRA) - Presubmit:
- Tests OpenShift DRA integration with NVIDIA GPUs
- Uses L4 GPU (g2-standard-4) for cost-effective testing
- Triggers on changes to test/extended/node/dra/nvidia or test/extended/dra/nvidia

Reusable step-registry components:
- nvidia-nfd-operator-install: Installs NFD operator via OLM
- nvidia-gpu-operator-install: Installs NVIDIA GPU operator with CDI
- nvidia-dra-driver-install: Installs NVIDIA DRA driver via Helm
- nvidia-cleanup: Cleans up NVIDIA components and resources
- openshift-kubernetes-e2e-nvidia-dra-partitionable: Workflow for DRA testing

Note: G2/A3 GPU-optimized machine types include GPUs automatically.
All tests use us-central1 to match CI quota allocation.

Signed-off-by: Sai Ramesh Vanka <svanka@redhat.com

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@sairameshv
Copy link
Copy Markdown
Member Author

/pj-rehearse periodic-ci-openshift-kubernetes-main-gcp-nvidia-dra-partitionable-devices

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

@sairameshv: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

- Origin: Switch from GCP L4 (g2-standard-4) to AWS T4 (g4dn.xlarge)
  Follows kueue-operator pattern (PR openshift#77108), better availability
- Kubernetes: Fix cluster profile from 'gcp' to 'openshift-org-gcp'
  Keep zone as us-central1-a (H100/A3 available in zones a/b/c, not f)
- Add nvidia-dra-driver-install step for DRA driver installation
- Install helm in test container for prerequisite validation
- Disable monitor tests for focused NVIDIA DRA testing

Fixes rehearse failures:
1. Invalid cluster profile 'gcp' (should be 'openshift-org-gcp')
2. Test failures due to missing helm binary in tests container
3. Unrelated monitor test failures (94 cluster health checks)

Note: us-central1-f has A100 (a2) but not H100 (a3) instances

Signed-off-by: Sai Ramesh Vanka <svanka@redhat.com>
@sairameshv
Copy link
Copy Markdown
Member Author

/pj-rehearse periodic-ci-openshift-kubernetes-main-gcp-nvidia-dra-partitionable-devices

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

@sairameshv: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

[REHEARSALNOTIFIER]
@sairameshv: the pj-rehearse plugin accommodates running rehearsal tests for the changes in this PR. Expand 'Interacting with pj-rehearse' for usage details. The following rehearsable tests have been affected by this change:

Test name Repo Type Reason
pull-ci-openshift-origin-main-nvidia-dra-e2e-aws-nvidia-dra openshift/origin presubmit Presubmit changed
pull-ci-openshift-origin-main-nvidia-dra-images openshift/origin presubmit Presubmit changed
periodic-ci-openshift-kueue-operator-main-test-e2e-dra-gpu-4-21 N/A periodic Registry content changed
periodic-ci-openshift-kubernetes-main-gcp-nvidia-dra-partitionable-devices N/A periodic Periodic changed
Interacting with pj-rehearse

Comment: /pj-rehearse to run up to 5 rehearsals
Comment: /pj-rehearse skip to opt-out of rehearsals
Comment: /pj-rehearse {test-name}, with each test separated by a space, to run one or more specific rehearsals
Comment: /pj-rehearse more to run up to 10 rehearsals
Comment: /pj-rehearse max to run up to 25 rehearsals
Comment: /pj-rehearse auto-ack to run up to 5 rehearsals, and add the rehearsals-ack label on success
Comment: /pj-rehearse list to get an up-to-date list of affected jobs
Comment: /pj-rehearse abort to abort all active rehearsals
Comment: /pj-rehearse network-access-allowed to allow rehearsals of tests that have the restrict_network_access field set to false. This must be executed by an openshift org member who is not the PR author

Once you are satisfied with the results of the rehearsals, comment: /pj-rehearse ack to unblock merge. When the rehearsals-ack label is present on your PR, merge will no longer be blocked by rehearsals.
If you would like the rehearsals-ack label removed, comment: /pj-rehearse reject to re-block merging.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 5, 2026

@sairameshv: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/rehearse/openshift/kubernetes/main/gcp-nvidia-dra-partitionable-devices dd181ab link unknown /pj-rehearse pull-ci-openshift-kubernetes-main-gcp-nvidia-dra-partitionable-devices
ci/rehearse/openshift/origin/main/nvidia-dra-e2e-gcp-nvidia-dra b32d41a link unknown /pj-rehearse pull-ci-openshift-origin-main-nvidia-dra-e2e-gcp-nvidia-dra
ci/rehearse/openshift/origin/main/nvidia-dra-e2e-aws-nvidia-dra 416fdbf link unknown /pj-rehearse pull-ci-openshift-origin-main-nvidia-dra-e2e-aws-nvidia-dra
ci/rehearse/periodic-ci-openshift-kubernetes-main-gcp-nvidia-dra-partitionable-devices d14e0b0 link unknown /pj-rehearse periodic-ci-openshift-kubernetes-main-gcp-nvidia-dra-partitionable-devices

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants