OCPNODE-4043: Add CI jobs to validate DRA APIs on NVIDIA GPU by sairameshv · Pull Request #74984 · openshift/release

sairameshv · 2026-02-17T12:39:43Z

Add CI infrastructure for DRA testing on NVIDIA GPUs

This adds CI jobs to test Kubernetes Dynamic Resource Allocation (DRA) with
NVIDIA GPUs on GCP across two repositories:

openshift/kubernetes (DRAPartitionableDevices) - Periodic:
- Tests upstream Kubernetes DRA Partitionable Devices (MIG) functionality
- Uses H100 GPU (a3-highgpu-1g) for DynamicMIG support
- Runs weekly in us-central1-a

openshift/origin (Basic DRA) - Presubmit:
- Tests OpenShift DRA integration with NVIDIA GPUs
- Uses L4 GPU (g2-standard-4) for cost-effective testing
- Triggers on changes to test/extended/node/dra/nvidia or test/extended/dra/nvidia

Reusable step-registry components:
- nvidia-nfd-operator-install: Installs NFD operator via OLM
- nvidia-gpu-operator-install: Installs NVIDIA GPU operator with CDI
- nvidia-dra-driver-install: Installs NVIDIA DRA driver via Helm
- nvidia-cleanup: Cleans up NVIDIA components and resources
- openshift-kubernetes-e2e-nvidia-dra-partitionable: Workflow for DRA testing

Note: G2/A3 GPU-optimized machine types include GPUs automatically.
All tests use us-central1 to match CI quota allocation.

Signed-off-by: Sai Ramesh Vanka <svanka@redhat.com

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Summary by CodeRabbit

New Features
- Added NVIDIA DRA partitionable devices testing on GCP clusters with GPU support.
- Added NVIDIA DRA end-to-end testing on AWS with GPU worker nodes.
- Enhanced NVIDIA GPU operator installation with verification for driver readiness.
Chores
- Added CI/periodic test configurations to validate NVIDIA DRA functionality across multiple cloud environments.

openshift-ci-robot · 2026-02-17T12:39:47Z

@sairameshv: This pull request references OCPNODE-4043 which is a valid jira issue.

Details

In response to this:

Add CI infrastructure for testing DRA in OCP with an NVIDIA GPU

Step-registry workflow for GPU operator installation and DRA testing

AWS e2e job using GPU worker nodes (g4dn.xlarge)

Test runs conditionally on DRA test file changes present in openshift/origin for now

Same infra can be developed to have more jobs on different GPU varients based on DRA features to test in future i.e. MIG capable GPUs to test the DRAPartitionableDvices features like H100

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

sairameshv · 2026-02-17T12:40:18Z

/pj-rehearse pull-ci-openshift-origin-main-nvidia-dra-e2e-aws-nvidia-dra

openshift-ci-robot · 2026-02-17T12:40:21Z

@sairameshv: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

openshift-ci · 2026-02-17T12:41:28Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

sairameshv · 2026-02-17T17:46:50Z

/pj-rehearse pull-ci-openshift-origin-main-nvidia-dra-e2e-aws-nvidia-dra

openshift-ci-robot · 2026-02-17T17:46:52Z

@sairameshv: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

sairameshv · 2026-02-18T06:49:12Z

/pj-rehearse pull-ci-openshift-origin-main-nvidia-dra-e2e-aws-nvidia-dra

openshift-ci-robot · 2026-02-18T06:49:14Z

@sairameshv: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

sairameshv · 2026-02-18T08:49:56Z

The rehearsal failure is expected here as the tests PR is not yet merged and hence no tests to run

Suite run returned error: no tests to run
error: error running a test suite: no tests to run
✗ FAILED (exit code: 1)

openshift-ci-robot · 2026-03-19T11:48:54Z

@sairameshv: This pull request references OCPNODE-4043 which is a valid jira issue.

Details

In response to this:

Add CI infrastructure for DRA testing on NVIDIA GPUs

This adds CI jobs to test Kubernetes Dynamic Resource Allocation (DRA) with
NVIDIA GPUs on GCP:

Basic DRA testing on openshift/origin with T4 GPU (n1-standard-4)

Partitionable Devices testing on openshift/kubernetes with MIG i.e. A100 GPU (a2-highgpu-1g)

Enables only DRAPartitionableDevices feature gate via CustomNoUpgrade

New steps to install NFD and NVIDIA GPU operators

Adds NVIDIA DRA driver installation via Helm

Signed-off-by: Sai Ramesh Vanka svanka@redhat.com

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

sairameshv · 2026-03-19T11:50:13Z

/pj-rehearse

openshift-ci-robot · 2026-03-19T11:50:17Z

@sairameshv: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

sairameshv · 2026-03-19T11:53:26Z

/pj-rehearse pull-ci-openshift-kubernetes-release-4.22-gcp-nvidia-dra-e2e-gcp-nvidia-dra-partitionable-devices

openshift-ci-robot · 2026-03-19T11:53:29Z

@sairameshv: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

sairameshv · 2026-03-19T12:16:12Z

/pj-rehearse pull-ci-openshift-kubernetes-release-4.22-gcp-nvidia-dra-partitionable-devices

openshift-ci-robot · 2026-03-19T12:16:15Z

@sairameshv: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

openshift-ci-robot · 2026-03-19T16:39:31Z

@sairameshv: This pull request references OCPNODE-4043 which is a valid jira issue.

Details

In response to this:

Add CI infrastructure for DRA testing on NVIDIA GPUs

This adds CI jobs to test Kubernetes Dynamic Resource Allocation (DRA) with
NVIDIA GPUs on GCP across two repositories:

openshift/kubernetes (DRAPartitionableDevices):

Tests upstream Kubernetes DRA Partitionable Devices (MIG) functionality

Uses H100 GPU (a3-highgpu-1g) with DynamicMIG support

Enables DRAPartitionableDevices feature gate via CustomNoUpgrade

Triggers on changes to test/e2e/dra/*partitionable or scheduler DRA code

Installs NVIDIA DRA driver v25.12.0 with DynamicMIG=true

openshift/origin (Basic DRA):

Tests OpenShift DRA integration with NVIDIA GPUs

Uses T4 GPU (n1-standard-4) for basic DRA testing

Triggers on changes to test/extended/node/dra/nvidia or test/extended/dra/nvidia

Reusable step-registry components:

nvidia-nfd-operator-install: Installs NFD operator via OLM

nvidia-gpu-operator-install: Installs NVIDIA GPU operator with CDI

nvidia-dra-driver-install: Installs NVIDIA DRA driver via Helm

nvidia-cleanup: Cleans up NVIDIA components and resources

openshift-kubernetes-e2e-nvidia-dra-partitionable: Workflow and test step
for DRA Partitionable Devices testing (consolidated in single directory)

Signed-off-by: Sai Ramesh Vanka svanka@redhat.com

Co-Authored-By: Claude Sonnet 4.5 noreply@anthropic.com

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

sairameshv · 2026-03-19T16:44:15Z

/pj-rehearse pull-ci-openshift-kubernetes-main-gcp-nvidia-dra-partitionable-devices

openshift-ci-robot · 2026-03-19T16:44:18Z

@sairameshv: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

sairameshv · 2026-03-20T02:27:34Z

/pj-rehearse pull-ci-openshift-kubernetes-main-gcp-nvidia-dra-partitionable-devices

coderabbitai

Actionable comments posted: 5

🧹 Nitpick comments (1)

ci-operator/step-registry/nvidia/gpu-operator/install/nvidia-gpu-operator-install-commands.sh (1)

324-331: Use full DaemonSet selector instead of hardcoding app for diagnostics.

Line 324 assumes .spec.selector.matchLabels.app; if that key changes, timeout debugging will miss driver pods/logs even when the DaemonSet exists.

Suggested adjustment

-    DRIVER_SELECTOR=$(oc get "${DRIVER_DS}" -n nvidia-gpu-operator -o jsonpath='{.spec.selector.matchLabels.app}' 2>/dev/null || echo "")
+    DRIVER_SELECTOR=$(oc get "${DRIVER_DS}" -n nvidia-gpu-operator -o json 2>/dev/null | jq -r '.spec.selector.matchLabels | to_entries | map("\(.key)=\(.value)") | join(",")')
     if [ -n "${DRIVER_SELECTOR}" ]; then
-      oc get pods -n nvidia-gpu-operator -l "app=${DRIVER_SELECTOR}" -o wide 2>/dev/null || echo "  No driver pods found"
+      oc get pods -n nvidia-gpu-operator -l "${DRIVER_SELECTOR}" -o wide 2>/dev/null || echo "  No driver pods found"
       echo ""
       echo "Driver pod logs (if available):"
-      DRIVER_POD=$(oc get pods -n nvidia-gpu-operator -l "app=${DRIVER_SELECTOR}" -o name 2>/dev/null | head -1 || echo "")
+      DRIVER_POD=$(oc get pods -n nvidia-gpu-operator -l "${DRIVER_SELECTOR}" -o name 2>/dev/null | head -1 || echo "")

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In
`@ci-operator/step-registry/nvidia/gpu-operator/install/nvidia-gpu-operator-install-commands.sh`
around lines 324 - 331, The current diagnostic extracts only
.spec.selector.matchLabels.app into DRIVER_SELECTOR which breaks if the
DaemonSet uses different label keys; update the logic that sets DRIVER_SELECTOR
(and subsequent DRIVER_POD usage) to build the full label selector from the
DaemonSet's .spec.selector.matchLabels (e.g., read the map and join keys/values
into a comma-separated selector like "key1=val1,key2=val2") using oc/jsonpath or
jq, then pass that full selector to oc get pods -l and oc logs; keep the
existing fallbacks (empty string) and error messages but replace the hardcoded
"app=${DRIVER_SELECTOR}" usage with the constructed selector variable so
DRIVER_DS/DRIVER_SELECTOR/DRIVER_POD logic works for any matchLabels shape.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@ci-operator/config/openshift/origin/openshift-origin-main__nvidia-dra.yaml`:
- Around line 16-19: The curl invocation in the commands block currently can
write HTTP error pages into the repo file; update the curl command (the line
starting with "curl http://base-openstack-4-21.ocp.svc >
/etc/yum.repos.art/ci/base-openstack-4-21.repo") to use curl's fail mode (e.g.,
add --fail or -f) and write to the target with curl's -o/-O rather than shell
redirection so that curl returns a non-zero exit on HTTP errors and the step
fails immediately.
- Around line 49-52: The AWS pre chain is missing the NVIDIA DRA driver
installation step; update the pre steps under the "pre" list to add a new entry
"- ref: nvidia-dra-driver-install" immediately after
"nvidia-gpu-operator-install" so the AWS workflow mirrors the
"openshift-kubernetes-e2e-nvidia-dra-partitionable" workflow and ensures the
DeviceClass/ResourceSlice objects required by [Feature:NVIDIA-DRA] tests are
created.

In
`@ci-operator/step-registry/openshift/kubernetes/e2e/nvidia-dra-partitionable/openshift-kubernetes-e2e-nvidia-dra-partitionable-commands.sh`:
- Around line 31-36: The script currently reads RESOURCE_SLICES once and
proceeds even if zero; change it to poll until ResourceSlices exist (or a
configurable timeout) before continuing: replace the single read of
RESOURCE_SLICES with a loop that runs oc get resourceslice -o name repeatedly
(sleep between attempts) and exits successfully when RESOURCE_SLICES > 0 or
fails after the timeout, emitting a clear error; reference the RESOURCE_SLICES
variable and the oc get resourceslice command and ensure the failure path
returns non-zero so downstream test-kubernetes-e2e.sh won’t run when no slices
are advertised.
- Around line 15-19: The readonly assignment for GPU_NODES using command
substitution masks failures from the oc/get pipeline; change the code so you run
oc get nodes -l "${GPU_LABEL}" -o name first, check its exit status, capture its
output to a variable (e.g., nodes_out), then compute GPU_NODES from that output
(e.g., with wc -l) and finally mark GPU_NODES readonly; update the logic around
the oc get nodes call and the subsequent if that references GPU_NODES to fail
fast on oc errors rather than returning a misleading zero count (refer to
GPU_NODES and the oc get nodes -l "${GPU_LABEL}" -o name usage).

In
`@ci-operator/step-registry/openshift/kubernetes/e2e/nvidia-dra-partitionable/openshift-kubernetes-e2e-nvidia-dra-partitionable-ref.yaml`:
- Around line 23-24: Update the YAML step description to state that the
DRAPartitionableDevices feature-gate check is conditional: change the
unconditional bullet to something like "Verifies DRAPartitionableDevices feature
gate is enabled when DRA_TEST_FOCUS includes 'DRAPartitionableDevices'"; ensure
any other sentence mentioning this check is similarly clarified. Reference the
check logic in openshift-kubernetes-e2e-nvidia-dra-partitionable-commands.sh and
the DRA_TEST_FOCUS variable so readers understand the gate validation only runs
when that focus is present.

---

Nitpick comments:
In
`@ci-operator/step-registry/nvidia/gpu-operator/install/nvidia-gpu-operator-install-commands.sh`:
- Around line 324-331: The current diagnostic extracts only
.spec.selector.matchLabels.app into DRIVER_SELECTOR which breaks if the
DaemonSet uses different label keys; update the logic that sets DRIVER_SELECTOR
(and subsequent DRIVER_POD usage) to build the full label selector from the
DaemonSet's .spec.selector.matchLabels (e.g., read the map and join keys/values
into a comma-separated selector like "key1=val1,key2=val2") using oc/jsonpath or
jq, then pass that full selector to oc get pods -l and oc logs; keep the
existing fallbacks (empty string) and error messages but replace the hardcoded
"app=${DRIVER_SELECTOR}" usage with the constructed selector variable so
DRIVER_DS/DRIVER_SELECTOR/DRIVER_POD logic works for any matchLabels shape.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 8f85fd8b-0c12-47fb-80fc-7246a8443f2c

📥 Commits

Reviewing files that changed from the base of the PR and between 78a388c and 6466435.

📒 Files selected for processing (11)

ci-operator/config/openshift/kubernetes/openshift-kubernetes-main__gcp-nvidia-dra.yaml
ci-operator/config/openshift/origin/openshift-origin-main__nvidia-dra.yaml
ci-operator/jobs/openshift/kubernetes/openshift-kubernetes-main-periodics.yaml
ci-operator/jobs/openshift/origin/openshift-origin-main-presubmits.yaml
ci-operator/step-registry/nvidia/gpu-operator/install/nvidia-gpu-operator-install-commands.sh
ci-operator/step-registry/openshift/kubernetes/e2e/nvidia-dra-partitionable/OWNERS
ci-operator/step-registry/openshift/kubernetes/e2e/nvidia-dra-partitionable/openshift-kubernetes-e2e-nvidia-dra-partitionable-commands.sh
ci-operator/step-registry/openshift/kubernetes/e2e/nvidia-dra-partitionable/openshift-kubernetes-e2e-nvidia-dra-partitionable-ref.metadata.json
ci-operator/step-registry/openshift/kubernetes/e2e/nvidia-dra-partitionable/openshift-kubernetes-e2e-nvidia-dra-partitionable-ref.yaml
ci-operator/step-registry/openshift/kubernetes/e2e/nvidia-dra-partitionable/openshift-kubernetes-e2e-nvidia-dra-partitionable-workflow.metadata.json
ci-operator/step-registry/openshift/kubernetes/e2e/nvidia-dra-partitionable/openshift-kubernetes-e2e-nvidia-dra-partitionable-workflow.yaml

coderabbitai · 2026-04-13T13:00:20Z

+    commands: |
+      curl http://base-openstack-4-21.ocp.svc > /etc/yum.repos.art/ci/base-openstack-4-21.repo
+    from: tools
+    to: tools-openstack


⚠️ Potential issue | 🟠 Major

Make the repo download fail on HTTP errors.

curl without --fail can write an error page into the repo file and let the step continue. That turns a fetch problem into a much harder-to-diagnose package install failure later.

Suggested fix

- curl http://base-openstack-4-21.ocp.svc > /etc/yum.repos.art/ci/base-openstack-4-21.repo + curl --fail --silent --show-error --retry 5 \ + http://base-openstack-4-21.ocp.svc \ + -o /etc/yum.repos.art/ci/base-openstack-4-21.repo

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

commands: |

curl http://base-openstack-4-21.ocp.svc > /etc/yum.repos.art/ci/base-openstack-4-21.repo

from: tools

to: tools-openstack

commands: |

curl --fail --silent --show-error --retry 5 \

http://base-openstack-4-21.ocp.svc \

-o /etc/yum.repos.art/ci/base-openstack-4-21.repo

from: tools

to: tools-openstack

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@ci-operator/config/openshift/origin/openshift-origin-main__nvidia-dra.yaml` around lines 16 - 19, The curl invocation in the commands block currently can write HTTP error pages into the repo file; update the curl command (the line starting with "curl http://base-openstack-4-21.ocp.svc > /etc/yum.repos.art/ci/base-openstack-4-21.repo") to use curl's fail mode (e.g., add --fail or -f) and write to the target with curl's -o/-O rather than shell redirection so that curl returns a non-zero exit on HTTP errors and the step fails immediately.

coderabbitai · 2026-04-13T13:00:20Z

+    pre:
+    - chain: ipi-aws-pre
+    - ref: nvidia-nfd-operator-install
+    - ref: nvidia-gpu-operator-install


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, locate and read the target file find . -name "*nvidia-dra*" -type f | head -20

Repository: openshift/release

Length of output: 1221

🏁 Script executed:

# Check the specific file and lines mentioned find . -path "*/ci-operator/config/openshift/origin/*nvidia-dra*" -type f

Repository: openshift/release

Length of output: 137

🏁 Script executed:

# Get the file structure around the repository ls -la ci-operator/config/openshift/origin/ 2>/dev/null | grep -i nvidia

Repository: openshift/release

Length of output: 151

🏁 Script executed:

# Read the target file with line numbers cat -n ./ci-operator/config/openshift/origin/openshift-origin-main__nvidia-dra.yaml

Repository: openshift/release

Length of output: 3028

🏁 Script executed:

# Also read the sibling Kubernetes flow mentioned cat -n ./ci-operator/config/openshift/kubernetes/openshift-kubernetes-main__gcp-nvidia-dra.yaml

Repository: openshift/release

Length of output: 1369

🏁 Script executed:

# Find and read the ipi-aws-pre chain definition find . -path "*/step-registry/*ipi*aws*pre*" -type f | head -5

Repository: openshift/release

Length of output: 1011

🏁 Script executed:

# Find the nvidia-dra-partitionable workflow find . -path "*/nvidia-dra-partitionable*workflow*" -type f

Repository: openshift/release

Length of output: 353

🏁 Script executed:

# Read the nvidia-dra-driver-install step definition cat -n ./ci-operator/step-registry/nvidia/dra-driver/install/nvidia-dra-driver-install-ref.yaml

Repository: openshift/release

Length of output: 1648

🏁 Script executed:

# Search for ipi-aws-pre more broadly find . -name "*ipi-aws-pre*" -o -name "*ipi*aws*pre*" | grep -i "ipi.*aws.*pre"

Repository: openshift/release

Length of output: 2649

🏁 Script executed:

# Try searching in openshift/installer step registry find . -path "*/step-registry/openshift/installer/*aws*" -name "*pre*" -type f

Repository: openshift/release

Length of output: 43

🏁 Script executed:

# Read the nvidia-dra-partitionable workflow cat -n ./ci-operator/step-registry/openshift/kubernetes/e2e/nvidia-dra-partitionable/openshift-kubernetes-e2e-nvidia-dra-partitionable-workflow.yaml

Repository: openshift/release

Length of output: 2574

🏁 Script executed:

# Read ipi-aws-pre chain to confirm it doesn't include DRA driver installation cat -n ./ci-operator/step-registry/ipi/aws/pre/ipi-aws-pre-chain.yaml

Repository: openshift/release

Length of output: 667

Add nvidia-dra-driver-install to the AWS pre chain.

The AWS job is missing the NVIDIA DRA driver installation step that exists in the sibling Kubernetes workflow. Without it, the [Feature:NVIDIA-DRA] tests will fail after cluster bring-up because the required DeviceClass and ResourceSlice objects will not exist.

Add - ref: nvidia-dra-driver-install to the pre steps after nvidia-gpu-operator-install, matching the structure in the openshift-kubernetes-e2e-nvidia-dra-partitionable workflow.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@ci-operator/config/openshift/origin/openshift-origin-main__nvidia-dra.yaml` around lines 49 - 52, The AWS pre chain is missing the NVIDIA DRA driver installation step; update the pre steps under the "pre" list to add a new entry "- ref: nvidia-dra-driver-install" immediately after "nvidia-gpu-operator-install" so the AWS workflow mirrors the "openshift-kubernetes-e2e-nvidia-dra-partitionable" workflow and ensures the DeviceClass/ResourceSlice objects required by [Feature:NVIDIA-DRA] tests are created.

coderabbitai · 2026-04-13T13:00:20Z

+readonly GPU_NODES=$(oc get nodes -l "${GPU_LABEL}" -o name 2>/dev/null | wc -l)
+if [ "${GPU_NODES}" -eq 0 ]; then
+  echo "ERROR: No GPU nodes found with label ${GPU_LABEL}"
+  oc get nodes --show-labels
+  exit 1


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# Check if the file exists and examine the context around lines 15-19 fd -t f "openshift-kubernetes-e2e-nvidia-dra-partitionable-commands.sh" --exec cat -n {} \; 2>/dev/null | head -30

Repository: openshift/release

Length of output: 1057

🏁 Script executed:

# Also search for the file more broadly find . -name "*nvidia-dra-partitionable-commands.sh" -type f 2>/dev/null | head -5

Repository: openshift/release

Length of output: 200

🏁 Script executed:

# Test 1: Does readonly with command substitution preserve exit status? bash -c ' set -o pipefail set -o errexit readonly TEST=$(false); echo "Should not reach here" ' 2>&1; echo "Exit code: $?"

Repository: openshift/release

Length of output: 95

🏁 Script executed:

# Test 2: Does separating assignment from readonly preserve exit status? bash -c ' set -o pipefail set -o errexit TEST=$(false) readonly TEST echo "Should not reach here" ' 2>&1; echo "Exit code: $?"

Repository: openshift/release

Length of output: 73

🏁 Script executed:

# Test 3: What about when command substitution succeeds but underlying command fails? bash -c ' set -o pipefail set -o errexit readonly RESULT=$(exit 1; echo "value") echo "Result: $RESULT" ' 2>&1; echo "Exit code: $?"

Repository: openshift/release

Length of output: 82

🏁 Script executed:

# Test 4: With pipefail and readonly - does it catch pipeline failures? bash -c ' set -o pipefail set -o errexit readonly COUNT=$(false | wc -l) echo "Should not reach here" ' 2>&1; echo "Exit code: $?"

Repository: openshift/release

Length of output: 95

🏁 Script executed:

# Test 5: Check if 2>/dev/null masks the failure in command substitution bash -c ' set -o pipefail set -o errexit readonly COUNT=$(false 2>/dev/null | wc -l) echo "Result: $COUNT" ' 2>&1; echo "Exit code: $?"

Repository: openshift/release

Length of output: 83

Inline readonly assignment masks pipeline exit status; separate the assignment.

Line 15 uses readonly GPU_NODES=$(...), which prevents set -o errexit and set -o pipefail from catching failures in the oc command. If oc get nodes fails due to API/auth issues, the command substitution still returns 0 (from wc -l), and the script proceeds with a misleading "no GPU nodes" error instead of surfacing the real failure.

Suggested fix

-readonly GPU_NODES=$(oc get nodes -l "${GPU_LABEL}" -o name 2>/dev/null | wc -l) +GPU_NODES=$(oc get nodes -l "${GPU_LABEL}" -o name 2>/dev/null | wc -l) +readonly GPU_NODES

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

readonly GPU_NODES=$(oc get nodes -l "${GPU_LABEL}" -o name 2>/dev/null | wc -l)

if [ "${GPU_NODES}" -eq 0 ]; then

echo "ERROR: No GPU nodes found with label ${GPU_LABEL}"

oc get nodes --show-labels

exit 1

GPU_NODES=$(oc get nodes -l "${GPU_LABEL}" -o name 2>/dev/null | wc -l)

readonly GPU_NODES

if [ "${GPU_NODES}" -eq 0 ]; then

echo "ERROR: No GPU nodes found with label ${GPU_LABEL}"

oc get nodes --show-labels

exit 1

🧰 Tools

🪛 Shellcheck (0.11.0)

[warning] 15-15: Declare and assign separately to avoid masking return values.

(SC2155)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@ci-operator/step-registry/openshift/kubernetes/e2e/nvidia-dra-partitionable/openshift-kubernetes-e2e-nvidia-dra-partitionable-commands.sh` around lines 15 - 19, The readonly assignment for GPU_NODES using command substitution masks failures from the oc/get pipeline; change the code so you run oc get nodes -l "${GPU_LABEL}" -o name first, check its exit status, capture its output to a variable (e.g., nodes_out), then compute GPU_NODES from that output (e.g., with wc -l) and finally mark GPU_NODES readonly; update the logic around the oc get nodes call and the subsequent if that references GPU_NODES to fail fast on oc errors rather than returning a misleading zero count (refer to GPU_NODES and the oc get nodes -l "${GPU_LABEL}" -o name usage).

coderabbitai · 2026-04-13T13:00:20Z

+# Check ResourceSlices
+RESOURCE_SLICES=$(oc get resourceslice -o name 2>/dev/null | wc -l)
+echo "ResourceSlices: ${RESOURCE_SLICES}"
+if [ "${RESOURCE_SLICES}" -eq 0 ]; then
+  echo "WARNING: No ResourceSlices found"
+fi


⚠️ Potential issue | 🟠 Major

Wait for ResourceSlices before starting e2e.

A zero count here usually means the DRA driver has not advertised allocatable resources yet. Continuing makes the job timing-dependent and pushes the real failure into test-kubernetes-e2e.sh.

Suggested fix

-RESOURCE_SLICES=$(oc get resourceslice -o name 2>/dev/null | wc -l) -echo "ResourceSlices: ${RESOURCE_SLICES}" -if [ "${RESOURCE_SLICES}" -eq 0 ]; then - echo "WARNING: No ResourceSlices found" -fi +for _ in $(seq 1 30); do + RESOURCE_SLICES=$(oc get resourceslice -o name 2>/dev/null | wc -l) + if [ "${RESOURCE_SLICES}" -gt 0 ]; then + break + fi + sleep 10 +done +echo "ResourceSlices: ${RESOURCE_SLICES}" +if [ "${RESOURCE_SLICES}" -eq 0 ]; then + echo "ERROR: No ResourceSlices found after waiting for the DRA driver to advertise resources" + exit 1 +fi

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@ci-operator/step-registry/openshift/kubernetes/e2e/nvidia-dra-partitionable/openshift-kubernetes-e2e-nvidia-dra-partitionable-commands.sh` around lines 31 - 36, The script currently reads RESOURCE_SLICES once and proceeds even if zero; change it to poll until ResourceSlices exist (or a configurable timeout) before continuing: replace the single read of RESOURCE_SLICES with a loop that runs oc get resourceslice -o name repeatedly (sleep between attempts) and exits successfully when RESOURCE_SLICES > 0 or fails after the timeout, emitting a clear error; reference the RESOURCE_SLICES variable and the oc get resourceslice command and ensure the failure path returns non-zero so downstream test-kubernetes-e2e.sh won’t run when no slices are advertised.

coderabbitai · 2026-04-13T13:00:20Z

+    - Verifies DRAPartitionableDevices feature gate is enabled
+    - Runs test-kubernetes-e2e.sh with specified focus pattern


⚠️ Potential issue | 🟡 Minor

Clarify that feature-gate validation is conditional on test focus.

In this file, Lines 23-24 and Line 30 describe DRAPartitionableDevices gate verification as unconditional, but ci-operator/step-registry/openshift/kubernetes/e2e/nvidia-dra-partitionable/openshift-kubernetes-e2e-nvidia-dra-partitionable-commands.sh (Lines 36-43) performs that check only when DRA_TEST_FOCUS includes DRAPartitionableDevices.

Also applies to: 30-30

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@ci-operator/step-registry/openshift/kubernetes/e2e/nvidia-dra-partitionable/openshift-kubernetes-e2e-nvidia-dra-partitionable-ref.yaml` around lines 23 - 24, Update the YAML step description to state that the DRAPartitionableDevices feature-gate check is conditional: change the unconditional bullet to something like "Verifies DRAPartitionableDevices feature gate is enabled when DRA_TEST_FOCUS includes 'DRAPartitionableDevices'"; ensure any other sentence mentioning this check is similarly clarified. Reference the check logic in openshift-kubernetes-e2e-nvidia-dra-partitionable-commands.sh and the DRA_TEST_FOCUS variable so readers understand the gate validation only runs when that focus is present.

* Add reusable NVIDIA step-registry components for DRA testing Based on openshift#74984. Adds NFD operator, GPU operator, DRA driver install, and cleanup steps. Signed-off-by: Pannaga Rao Bhoja Ramamanohara <pbhojara@redhat.com> Co-Authored-By: Sai Ramesh Vanka <svanka@redhat.com> * Fix nvidia-dra-driver-install for OpenShift CI compatibility * Add DRA e2e CI job for kueue-operator on AWS with T4 GPU --------- Signed-off-by: Pannaga Rao Bhoja Ramamanohara <pbhojara@redhat.com> Co-authored-by: Sai Ramesh Vanka <svanka@redhat.com> Co-authored-by: Pannaga Rao Bhoja Ramamanohara <pbhojara@pbhojara-thinkpadp16vgen1.rht.csb>

sairameshv · 2026-04-28T06:43:02Z

/pj-rehearse pull-ci-openshift-origin-main-nvidia-dra-e2e-aws-nvidia-dra

openshift-merge-bot · 2026-04-28T06:43:05Z

@sairameshv: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

sairameshv · 2026-04-28T06:47:39Z

/pj-rehearse periodic-ci-openshift-kubernetes-main-gcp-nvidia-dra-partitionable-devices

openshift-merge-bot · 2026-04-28T06:47:41Z

@sairameshv: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

sairameshv · 2026-04-28T11:20:41Z

/pj-rehearse pull-ci-openshift-origin-main-nvidia-dra-e2e-aws-nvidia-dra

openshift-merge-bot · 2026-04-28T11:20:45Z

@sairameshv: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

sairameshv · 2026-04-28T11:20:55Z

/pj-rehearse periodic-ci-openshift-kubernetes-main-gcp-nvidia-dra-partitionable-devices

openshift-merge-bot · 2026-04-28T11:20:57Z

@sairameshv: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

sairameshv · 2026-04-29T07:04:30Z

/pj-rehearse pull-ci-openshift-origin-main-nvidia-dra-e2e-aws-nvidia-dra

openshift-merge-bot · 2026-04-29T07:04:33Z

@sairameshv: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

sairameshv · 2026-04-29T07:04:44Z

/pj-rehearse periodic-ci-openshift-kubernetes-main-gcp-nvidia-dra-partitionable-devices

openshift-merge-bot · 2026-04-29T07:04:47Z

@sairameshv: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

* Add reusable NVIDIA step-registry components for DRA testing Based on openshift#74984. Adds NFD operator, GPU operator, DRA driver install, and cleanup steps. Signed-off-by: Pannaga Rao Bhoja Ramamanohara <pbhojara@redhat.com> Co-Authored-By: Sai Ramesh Vanka <svanka@redhat.com> * Fix nvidia-dra-driver-install for OpenShift CI compatibility * Add DRA e2e CI job for kueue-operator on AWS with T4 GPU --------- Signed-off-by: Pannaga Rao Bhoja Ramamanohara <pbhojara@redhat.com> Co-authored-by: Sai Ramesh Vanka <svanka@redhat.com> Co-authored-by: Pannaga Rao Bhoja Ramamanohara <pbhojara@pbhojara-thinkpadp16vgen1.rht.csb>

sairameshv · 2026-04-30T15:02:37Z

Opened a PR to fix a few failures with the existing openshift/origin tests
openshift/origin#31104

openshift-ci · 2026-05-05T08:42:37Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: sairameshv
Once this PR has been reviewed and has the lgtm label, please assign deads2k for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

This adds CI jobs to test Kubernetes Dynamic Resource Allocation (DRA) with NVIDIA GPUs on GCP across two repositories: openshift/kubernetes (DRAPartitionableDevices) - Periodic: - Tests upstream Kubernetes DRA Partitionable Devices (MIG) functionality - Uses H100 GPU (a3-highgpu-1g) for DynamicMIG support - Runs weekly in us-central1-a openshift/origin (Basic DRA) - Presubmit: - Tests OpenShift DRA integration with NVIDIA GPUs - Uses L4 GPU (g2-standard-4) for cost-effective testing - Triggers on changes to test/extended/node/dra/nvidia or test/extended/dra/nvidia Reusable step-registry components: - nvidia-nfd-operator-install: Installs NFD operator via OLM - nvidia-gpu-operator-install: Installs NVIDIA GPU operator with CDI - nvidia-dra-driver-install: Installs NVIDIA DRA driver via Helm - nvidia-cleanup: Cleans up NVIDIA components and resources - openshift-kubernetes-e2e-nvidia-dra-partitionable: Workflow for DRA testing Note: G2/A3 GPU-optimized machine types include GPUs automatically. All tests use us-central1 to match CI quota allocation. Signed-off-by: Sai Ramesh Vanka <svanka@redhat.com Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

sairameshv · 2026-05-05T10:43:37Z

/pj-rehearse periodic-ci-openshift-kubernetes-main-gcp-nvidia-dra-partitionable-devices

openshift-merge-bot · 2026-05-05T10:43:39Z

@sairameshv: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

- Origin: Switch from GCP L4 (g2-standard-4) to AWS T4 (g4dn.xlarge) Follows kueue-operator pattern (PR openshift#77108), better availability - Kubernetes: Fix cluster profile from 'gcp' to 'openshift-org-gcp' Keep zone as us-central1-a (H100/A3 available in zones a/b/c, not f) - Add nvidia-dra-driver-install step for DRA driver installation - Install helm in test container for prerequisite validation - Disable monitor tests for focused NVIDIA DRA testing Fixes rehearse failures: 1. Invalid cluster profile 'gcp' (should be 'openshift-org-gcp') 2. Test failures due to missing helm binary in tests container 3. Unrelated monitor test failures (94 cluster health checks) Note: us-central1-f has A100 (a2) but not H100 (a3) instances Signed-off-by: Sai Ramesh Vanka <svanka@redhat.com>

sairameshv · 2026-05-05T13:35:58Z

/pj-rehearse periodic-ci-openshift-kubernetes-main-gcp-nvidia-dra-partitionable-devices

openshift-merge-bot · 2026-05-05T13:36:01Z

@sairameshv: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

openshift-merge-bot · 2026-05-05T13:37:08Z

[REHEARSALNOTIFIER]
@sairameshv: the pj-rehearse plugin accommodates running rehearsal tests for the changes in this PR. Expand 'Interacting with pj-rehearse' for usage details. The following rehearsable tests have been affected by this change:

Test name	Repo	Type	Reason
pull-ci-openshift-origin-main-nvidia-dra-e2e-aws-nvidia-dra	openshift/origin	presubmit	Presubmit changed
pull-ci-openshift-origin-main-nvidia-dra-images	openshift/origin	presubmit	Presubmit changed
periodic-ci-openshift-kueue-operator-main-test-e2e-dra-gpu-4-21	N/A	periodic	Registry content changed
periodic-ci-openshift-kubernetes-main-gcp-nvidia-dra-partitionable-devices	N/A	periodic	Periodic changed

Interacting with pj-rehearse

Comment: /pj-rehearse to run up to 5 rehearsals
Comment: /pj-rehearse skip to opt-out of rehearsals
Comment: /pj-rehearse {test-name}, with each test separated by a space, to run one or more specific rehearsals
Comment: /pj-rehearse more to run up to 10 rehearsals
Comment: /pj-rehearse max to run up to 25 rehearsals
Comment: /pj-rehearse auto-ack to run up to 5 rehearsals, and add the rehearsals-ack label on success
Comment: /pj-rehearse list to get an up-to-date list of affected jobs
Comment: /pj-rehearse abort to abort all active rehearsals
Comment: /pj-rehearse network-access-allowed to allow rehearsals of tests that have the restrict_network_access field set to false. This must be executed by an openshift org member who is not the PR author

Once you are satisfied with the results of the rehearsals, comment: /pj-rehearse ack to unblock merge. When the rehearsals-ack label is present on your PR, merge will no longer be blocked by rehearsals.
If you would like the rehearsals-ack label removed, comment: /pj-rehearse reject to re-block merging.

openshift-ci · 2026-05-05T15:18:50Z

@sairameshv: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/rehearse/openshift/kubernetes/main/gcp-nvidia-dra-partitionable-devices	`dd181ab`	link	unknown	`/pj-rehearse pull-ci-openshift-kubernetes-main-gcp-nvidia-dra-partitionable-devices`
ci/rehearse/openshift/origin/main/nvidia-dra-e2e-gcp-nvidia-dra	`b32d41a`	link	unknown	`/pj-rehearse pull-ci-openshift-origin-main-nvidia-dra-e2e-gcp-nvidia-dra`
ci/rehearse/openshift/origin/main/nvidia-dra-e2e-aws-nvidia-dra	`416fdbf`	link	unknown	`/pj-rehearse pull-ci-openshift-origin-main-nvidia-dra-e2e-aws-nvidia-dra`
ci/rehearse/periodic-ci-openshift-kubernetes-main-gcp-nvidia-dra-partitionable-devices	`d14e0b0`	link	unknown	`/pj-rehearse periodic-ci-openshift-kubernetes-main-gcp-nvidia-dra-partitionable-devices`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Feb 17, 2026

openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 17, 2026

sairameshv force-pushed the nvidia-dra-ocp branch from f1eac4d to 048200a Compare March 19, 2026 11:38

openshift-ci Bot added the do-not-merge/invalid-owners-file Indicates that a PR should not merge because it has an invalid OWNERS file in it. label Mar 19, 2026

sairameshv force-pushed the nvidia-dra-ocp branch from 048200a to 3cfed85 Compare March 19, 2026 11:47

openshift-ci Bot removed the do-not-merge/invalid-owners-file Indicates that a PR should not merge because it has an invalid OWNERS file in it. label Mar 19, 2026

sairameshv force-pushed the nvidia-dra-ocp branch from 3cfed85 to 1acbb70 Compare March 19, 2026 12:10

sairameshv force-pushed the nvidia-dra-ocp branch 3 times, most recently from f253c59 to dd181ab Compare March 19, 2026 16:37

coderabbitai Bot reviewed Apr 13, 2026

View reviewed changes

sairameshv force-pushed the nvidia-dra-ocp branch from 6466435 to 5c4b6df Compare April 28, 2026 06:39

sairameshv force-pushed the nvidia-dra-ocp branch from 5c4b6df to 416fdbf Compare April 28, 2026 11:20

sairameshv force-pushed the nvidia-dra-ocp branch from 416fdbf to cfef1c5 Compare May 5, 2026 08:41

sairameshv force-pushed the nvidia-dra-ocp branch from cfef1c5 to a4bb0bf Compare May 5, 2026 08:44

sairameshv force-pushed the nvidia-dra-ocp branch from a4bb0bf to d14e0b0 Compare May 5, 2026 13:33

		- Verifies DRAPartitionableDevices feature gate is enabled
		- Runs test-kubernetes-e2e.sh with specified focus pattern

Conversation

sairameshv commented Feb 17, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

openshift-ci-robot commented Feb 17, 2026 • edited by openshift-ci Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sairameshv commented Feb 17, 2026

Uh oh!

openshift-ci-robot commented Feb 17, 2026

Uh oh!

openshift-ci Bot commented Feb 17, 2026

Uh oh!

sairameshv commented Feb 17, 2026

Uh oh!

openshift-ci-robot commented Feb 17, 2026

Uh oh!

sairameshv commented Feb 18, 2026

Uh oh!

openshift-ci-robot commented Feb 18, 2026

Uh oh!

sairameshv commented Feb 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci-robot commented Mar 19, 2026 • edited by openshift-ci Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sairameshv commented Mar 19, 2026

Uh oh!

openshift-ci-robot commented Mar 19, 2026

Uh oh!

sairameshv commented Mar 19, 2026

Uh oh!

openshift-ci-robot commented Mar 19, 2026

Uh oh!

sairameshv commented Mar 19, 2026

Uh oh!

openshift-ci-robot commented Mar 19, 2026

Uh oh!

openshift-ci-robot commented Mar 19, 2026 • edited by openshift-ci Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sairameshv commented Mar 19, 2026

Uh oh!

openshift-ci-robot commented Mar 19, 2026

Uh oh!

sairameshv commented Mar 20, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

sairameshv commented Apr 28, 2026

Uh oh!

openshift-merge-bot Bot commented Apr 28, 2026

Uh oh!

sairameshv commented Apr 28, 2026

Uh oh!

openshift-merge-bot Bot commented Apr 28, 2026

Uh oh!

sairameshv commented Apr 28, 2026

Uh oh!

openshift-merge-bot Bot commented Apr 28, 2026

Uh oh!

sairameshv commented Apr 28, 2026

sairameshv commented Feb 17, 2026 •

edited by coderabbitai Bot

Loading

openshift-ci-robot commented Feb 17, 2026 •

edited by openshift-ci Bot

Loading

sairameshv commented Feb 18, 2026 •

edited

Loading

openshift-ci-robot commented Mar 19, 2026 •

edited by openshift-ci Bot

Loading

openshift-ci-robot commented Mar 19, 2026 •

edited by openshift-ci Bot

Loading