OCPNODE-4043: Add CI jobs to validate DRA APIs on NVIDIA GPU#74984
OCPNODE-4043: Add CI jobs to validate DRA APIs on NVIDIA GPU#74984sairameshv wants to merge 2 commits intoopenshift:mainfrom
Conversation
|
@sairameshv: This pull request references OCPNODE-4043 which is a valid jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/pj-rehearse pull-ci-openshift-origin-main-nvidia-dra-e2e-aws-nvidia-dra |
|
@sairameshv: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
|
Skipping CI for Draft Pull Request. |
|
/pj-rehearse pull-ci-openshift-origin-main-nvidia-dra-e2e-aws-nvidia-dra |
|
@sairameshv: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
|
/pj-rehearse pull-ci-openshift-origin-main-nvidia-dra-e2e-aws-nvidia-dra |
|
@sairameshv: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
|
The rehearsal failure is expected here as the tests PR is not yet merged and hence no tests to run |
f1eac4d to
048200a
Compare
048200a to
3cfed85
Compare
|
@sairameshv: This pull request references OCPNODE-4043 which is a valid jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/pj-rehearse |
|
@sairameshv: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
|
/pj-rehearse pull-ci-openshift-kubernetes-release-4.22-gcp-nvidia-dra-e2e-gcp-nvidia-dra-partitionable-devices |
|
@sairameshv: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
3cfed85 to
1acbb70
Compare
|
/pj-rehearse pull-ci-openshift-kubernetes-release-4.22-gcp-nvidia-dra-partitionable-devices |
|
@sairameshv: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
f253c59 to
dd181ab
Compare
|
@sairameshv: This pull request references OCPNODE-4043 which is a valid jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/pj-rehearse pull-ci-openshift-kubernetes-main-gcp-nvidia-dra-partitionable-devices |
|
@sairameshv: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
|
/pj-rehearse pull-ci-openshift-kubernetes-main-gcp-nvidia-dra-partitionable-devices |
There was a problem hiding this comment.
Actionable comments posted: 5
🧹 Nitpick comments (1)
ci-operator/step-registry/nvidia/gpu-operator/install/nvidia-gpu-operator-install-commands.sh (1)
324-331: Use full DaemonSet selector instead of hardcodingappfor diagnostics.Line 324 assumes
.spec.selector.matchLabels.app; if that key changes, timeout debugging will miss driver pods/logs even when the DaemonSet exists.Suggested adjustment
- DRIVER_SELECTOR=$(oc get "${DRIVER_DS}" -n nvidia-gpu-operator -o jsonpath='{.spec.selector.matchLabels.app}' 2>/dev/null || echo "") + DRIVER_SELECTOR=$(oc get "${DRIVER_DS}" -n nvidia-gpu-operator -o json 2>/dev/null | jq -r '.spec.selector.matchLabels | to_entries | map("\(.key)=\(.value)") | join(",")') if [ -n "${DRIVER_SELECTOR}" ]; then - oc get pods -n nvidia-gpu-operator -l "app=${DRIVER_SELECTOR}" -o wide 2>/dev/null || echo " No driver pods found" + oc get pods -n nvidia-gpu-operator -l "${DRIVER_SELECTOR}" -o wide 2>/dev/null || echo " No driver pods found" echo "" echo "Driver pod logs (if available):" - DRIVER_POD=$(oc get pods -n nvidia-gpu-operator -l "app=${DRIVER_SELECTOR}" -o name 2>/dev/null | head -1 || echo "") + DRIVER_POD=$(oc get pods -n nvidia-gpu-operator -l "${DRIVER_SELECTOR}" -o name 2>/dev/null | head -1 || echo "")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@ci-operator/step-registry/nvidia/gpu-operator/install/nvidia-gpu-operator-install-commands.sh` around lines 324 - 331, The current diagnostic extracts only .spec.selector.matchLabels.app into DRIVER_SELECTOR which breaks if the DaemonSet uses different label keys; update the logic that sets DRIVER_SELECTOR (and subsequent DRIVER_POD usage) to build the full label selector from the DaemonSet's .spec.selector.matchLabels (e.g., read the map and join keys/values into a comma-separated selector like "key1=val1,key2=val2") using oc/jsonpath or jq, then pass that full selector to oc get pods -l and oc logs; keep the existing fallbacks (empty string) and error messages but replace the hardcoded "app=${DRIVER_SELECTOR}" usage with the constructed selector variable so DRIVER_DS/DRIVER_SELECTOR/DRIVER_POD logic works for any matchLabels shape.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@ci-operator/config/openshift/origin/openshift-origin-main__nvidia-dra.yaml`:
- Around line 16-19: The curl invocation in the commands block currently can
write HTTP error pages into the repo file; update the curl command (the line
starting with "curl http://base-openstack-4-21.ocp.svc >
/etc/yum.repos.art/ci/base-openstack-4-21.repo") to use curl's fail mode (e.g.,
add --fail or -f) and write to the target with curl's -o/-O rather than shell
redirection so that curl returns a non-zero exit on HTTP errors and the step
fails immediately.
- Around line 49-52: The AWS pre chain is missing the NVIDIA DRA driver
installation step; update the pre steps under the "pre" list to add a new entry
"- ref: nvidia-dra-driver-install" immediately after
"nvidia-gpu-operator-install" so the AWS workflow mirrors the
"openshift-kubernetes-e2e-nvidia-dra-partitionable" workflow and ensures the
DeviceClass/ResourceSlice objects required by [Feature:NVIDIA-DRA] tests are
created.
In
`@ci-operator/step-registry/openshift/kubernetes/e2e/nvidia-dra-partitionable/openshift-kubernetes-e2e-nvidia-dra-partitionable-commands.sh`:
- Around line 31-36: The script currently reads RESOURCE_SLICES once and
proceeds even if zero; change it to poll until ResourceSlices exist (or a
configurable timeout) before continuing: replace the single read of
RESOURCE_SLICES with a loop that runs oc get resourceslice -o name repeatedly
(sleep between attempts) and exits successfully when RESOURCE_SLICES > 0 or
fails after the timeout, emitting a clear error; reference the RESOURCE_SLICES
variable and the oc get resourceslice command and ensure the failure path
returns non-zero so downstream test-kubernetes-e2e.sh won’t run when no slices
are advertised.
- Around line 15-19: The readonly assignment for GPU_NODES using command
substitution masks failures from the oc/get pipeline; change the code so you run
oc get nodes -l "${GPU_LABEL}" -o name first, check its exit status, capture its
output to a variable (e.g., nodes_out), then compute GPU_NODES from that output
(e.g., with wc -l) and finally mark GPU_NODES readonly; update the logic around
the oc get nodes call and the subsequent if that references GPU_NODES to fail
fast on oc errors rather than returning a misleading zero count (refer to
GPU_NODES and the oc get nodes -l "${GPU_LABEL}" -o name usage).
In
`@ci-operator/step-registry/openshift/kubernetes/e2e/nvidia-dra-partitionable/openshift-kubernetes-e2e-nvidia-dra-partitionable-ref.yaml`:
- Around line 23-24: Update the YAML step description to state that the
DRAPartitionableDevices feature-gate check is conditional: change the
unconditional bullet to something like "Verifies DRAPartitionableDevices feature
gate is enabled when DRA_TEST_FOCUS includes 'DRAPartitionableDevices'"; ensure
any other sentence mentioning this check is similarly clarified. Reference the
check logic in openshift-kubernetes-e2e-nvidia-dra-partitionable-commands.sh and
the DRA_TEST_FOCUS variable so readers understand the gate validation only runs
when that focus is present.
---
Nitpick comments:
In
`@ci-operator/step-registry/nvidia/gpu-operator/install/nvidia-gpu-operator-install-commands.sh`:
- Around line 324-331: The current diagnostic extracts only
.spec.selector.matchLabels.app into DRIVER_SELECTOR which breaks if the
DaemonSet uses different label keys; update the logic that sets DRIVER_SELECTOR
(and subsequent DRIVER_POD usage) to build the full label selector from the
DaemonSet's .spec.selector.matchLabels (e.g., read the map and join keys/values
into a comma-separated selector like "key1=val1,key2=val2") using oc/jsonpath or
jq, then pass that full selector to oc get pods -l and oc logs; keep the
existing fallbacks (empty string) and error messages but replace the hardcoded
"app=${DRIVER_SELECTOR}" usage with the constructed selector variable so
DRIVER_DS/DRIVER_SELECTOR/DRIVER_POD logic works for any matchLabels shape.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 8f85fd8b-0c12-47fb-80fc-7246a8443f2c
📒 Files selected for processing (11)
ci-operator/config/openshift/kubernetes/openshift-kubernetes-main__gcp-nvidia-dra.yamlci-operator/config/openshift/origin/openshift-origin-main__nvidia-dra.yamlci-operator/jobs/openshift/kubernetes/openshift-kubernetes-main-periodics.yamlci-operator/jobs/openshift/origin/openshift-origin-main-presubmits.yamlci-operator/step-registry/nvidia/gpu-operator/install/nvidia-gpu-operator-install-commands.shci-operator/step-registry/openshift/kubernetes/e2e/nvidia-dra-partitionable/OWNERSci-operator/step-registry/openshift/kubernetes/e2e/nvidia-dra-partitionable/openshift-kubernetes-e2e-nvidia-dra-partitionable-commands.shci-operator/step-registry/openshift/kubernetes/e2e/nvidia-dra-partitionable/openshift-kubernetes-e2e-nvidia-dra-partitionable-ref.metadata.jsonci-operator/step-registry/openshift/kubernetes/e2e/nvidia-dra-partitionable/openshift-kubernetes-e2e-nvidia-dra-partitionable-ref.yamlci-operator/step-registry/openshift/kubernetes/e2e/nvidia-dra-partitionable/openshift-kubernetes-e2e-nvidia-dra-partitionable-workflow.metadata.jsonci-operator/step-registry/openshift/kubernetes/e2e/nvidia-dra-partitionable/openshift-kubernetes-e2e-nvidia-dra-partitionable-workflow.yaml
| commands: | | ||
| curl http://base-openstack-4-21.ocp.svc > /etc/yum.repos.art/ci/base-openstack-4-21.repo | ||
| from: tools | ||
| to: tools-openstack |
There was a problem hiding this comment.
Make the repo download fail on HTTP errors.
curl without --fail can write an error page into the repo file and let the step continue. That turns a fetch problem into a much harder-to-diagnose package install failure later.
Suggested fix
- curl http://base-openstack-4-21.ocp.svc > /etc/yum.repos.art/ci/base-openstack-4-21.repo
+ curl --fail --silent --show-error --retry 5 \
+ http://base-openstack-4-21.ocp.svc \
+ -o /etc/yum.repos.art/ci/base-openstack-4-21.repo📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| commands: | | |
| curl http://base-openstack-4-21.ocp.svc > /etc/yum.repos.art/ci/base-openstack-4-21.repo | |
| from: tools | |
| to: tools-openstack | |
| commands: | | |
| curl --fail --silent --show-error --retry 5 \ | |
| http://base-openstack-4-21.ocp.svc \ | |
| -o /etc/yum.repos.art/ci/base-openstack-4-21.repo | |
| from: tools | |
| to: tools-openstack |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@ci-operator/config/openshift/origin/openshift-origin-main__nvidia-dra.yaml`
around lines 16 - 19, The curl invocation in the commands block currently can
write HTTP error pages into the repo file; update the curl command (the line
starting with "curl http://base-openstack-4-21.ocp.svc >
/etc/yum.repos.art/ci/base-openstack-4-21.repo") to use curl's fail mode (e.g.,
add --fail or -f) and write to the target with curl's -o/-O rather than shell
redirection so that curl returns a non-zero exit on HTTP errors and the step
fails immediately.
| pre: | ||
| - chain: ipi-aws-pre | ||
| - ref: nvidia-nfd-operator-install | ||
| - ref: nvidia-gpu-operator-install |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, locate and read the target file
find . -name "*nvidia-dra*" -type f | head -20Repository: openshift/release
Length of output: 1221
🏁 Script executed:
# Check the specific file and lines mentioned
find . -path "*/ci-operator/config/openshift/origin/*nvidia-dra*" -type fRepository: openshift/release
Length of output: 137
🏁 Script executed:
# Get the file structure around the repository
ls -la ci-operator/config/openshift/origin/ 2>/dev/null | grep -i nvidiaRepository: openshift/release
Length of output: 151
🏁 Script executed:
# Read the target file with line numbers
cat -n ./ci-operator/config/openshift/origin/openshift-origin-main__nvidia-dra.yamlRepository: openshift/release
Length of output: 3028
🏁 Script executed:
# Also read the sibling Kubernetes flow mentioned
cat -n ./ci-operator/config/openshift/kubernetes/openshift-kubernetes-main__gcp-nvidia-dra.yamlRepository: openshift/release
Length of output: 1369
🏁 Script executed:
# Find and read the ipi-aws-pre chain definition
find . -path "*/step-registry/*ipi*aws*pre*" -type f | head -5Repository: openshift/release
Length of output: 1011
🏁 Script executed:
# Find the nvidia-dra-partitionable workflow
find . -path "*/nvidia-dra-partitionable*workflow*" -type fRepository: openshift/release
Length of output: 353
🏁 Script executed:
# Read the nvidia-dra-driver-install step definition
cat -n ./ci-operator/step-registry/nvidia/dra-driver/install/nvidia-dra-driver-install-ref.yamlRepository: openshift/release
Length of output: 1648
🏁 Script executed:
# Search for ipi-aws-pre more broadly
find . -name "*ipi-aws-pre*" -o -name "*ipi*aws*pre*" | grep -i "ipi.*aws.*pre"Repository: openshift/release
Length of output: 2649
🏁 Script executed:
# Try searching in openshift/installer step registry
find . -path "*/step-registry/openshift/installer/*aws*" -name "*pre*" -type fRepository: openshift/release
Length of output: 43
🏁 Script executed:
# Read the nvidia-dra-partitionable workflow
cat -n ./ci-operator/step-registry/openshift/kubernetes/e2e/nvidia-dra-partitionable/openshift-kubernetes-e2e-nvidia-dra-partitionable-workflow.yamlRepository: openshift/release
Length of output: 2574
🏁 Script executed:
# Read ipi-aws-pre chain to confirm it doesn't include DRA driver installation
cat -n ./ci-operator/step-registry/ipi/aws/pre/ipi-aws-pre-chain.yamlRepository: openshift/release
Length of output: 667
Add nvidia-dra-driver-install to the AWS pre chain.
The AWS job is missing the NVIDIA DRA driver installation step that exists in the sibling Kubernetes workflow. Without it, the [Feature:NVIDIA-DRA] tests will fail after cluster bring-up because the required DeviceClass and ResourceSlice objects will not exist.
Add - ref: nvidia-dra-driver-install to the pre steps after nvidia-gpu-operator-install, matching the structure in the openshift-kubernetes-e2e-nvidia-dra-partitionable workflow.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@ci-operator/config/openshift/origin/openshift-origin-main__nvidia-dra.yaml`
around lines 49 - 52, The AWS pre chain is missing the NVIDIA DRA driver
installation step; update the pre steps under the "pre" list to add a new entry
"- ref: nvidia-dra-driver-install" immediately after
"nvidia-gpu-operator-install" so the AWS workflow mirrors the
"openshift-kubernetes-e2e-nvidia-dra-partitionable" workflow and ensures the
DeviceClass/ResourceSlice objects required by [Feature:NVIDIA-DRA] tests are
created.
| readonly GPU_NODES=$(oc get nodes -l "${GPU_LABEL}" -o name 2>/dev/null | wc -l) | ||
| if [ "${GPU_NODES}" -eq 0 ]; then | ||
| echo "ERROR: No GPU nodes found with label ${GPU_LABEL}" | ||
| oc get nodes --show-labels | ||
| exit 1 |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# Check if the file exists and examine the context around lines 15-19
fd -t f "openshift-kubernetes-e2e-nvidia-dra-partitionable-commands.sh" --exec cat -n {} \; 2>/dev/null | head -30Repository: openshift/release
Length of output: 1057
🏁 Script executed:
# Also search for the file more broadly
find . -name "*nvidia-dra-partitionable-commands.sh" -type f 2>/dev/null | head -5Repository: openshift/release
Length of output: 200
🏁 Script executed:
# Test 1: Does readonly with command substitution preserve exit status?
bash -c '
set -o pipefail
set -o errexit
readonly TEST=$(false);
echo "Should not reach here"
' 2>&1; echo "Exit code: $?"Repository: openshift/release
Length of output: 95
🏁 Script executed:
# Test 2: Does separating assignment from readonly preserve exit status?
bash -c '
set -o pipefail
set -o errexit
TEST=$(false)
readonly TEST
echo "Should not reach here"
' 2>&1; echo "Exit code: $?"Repository: openshift/release
Length of output: 73
🏁 Script executed:
# Test 3: What about when command substitution succeeds but underlying command fails?
bash -c '
set -o pipefail
set -o errexit
readonly RESULT=$(exit 1; echo "value")
echo "Result: $RESULT"
' 2>&1; echo "Exit code: $?"Repository: openshift/release
Length of output: 82
🏁 Script executed:
# Test 4: With pipefail and readonly - does it catch pipeline failures?
bash -c '
set -o pipefail
set -o errexit
readonly COUNT=$(false | wc -l)
echo "Should not reach here"
' 2>&1; echo "Exit code: $?"Repository: openshift/release
Length of output: 95
🏁 Script executed:
# Test 5: Check if 2>/dev/null masks the failure in command substitution
bash -c '
set -o pipefail
set -o errexit
readonly COUNT=$(false 2>/dev/null | wc -l)
echo "Result: $COUNT"
' 2>&1; echo "Exit code: $?"Repository: openshift/release
Length of output: 83
Inline readonly assignment masks pipeline exit status; separate the assignment.
Line 15 uses readonly GPU_NODES=$(...), which prevents set -o errexit and set -o pipefail from catching failures in the oc command. If oc get nodes fails due to API/auth issues, the command substitution still returns 0 (from wc -l), and the script proceeds with a misleading "no GPU nodes" error instead of surfacing the real failure.
Suggested fix
-readonly GPU_NODES=$(oc get nodes -l "${GPU_LABEL}" -o name 2>/dev/null | wc -l)
+GPU_NODES=$(oc get nodes -l "${GPU_LABEL}" -o name 2>/dev/null | wc -l)
+readonly GPU_NODES📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| readonly GPU_NODES=$(oc get nodes -l "${GPU_LABEL}" -o name 2>/dev/null | wc -l) | |
| if [ "${GPU_NODES}" -eq 0 ]; then | |
| echo "ERROR: No GPU nodes found with label ${GPU_LABEL}" | |
| oc get nodes --show-labels | |
| exit 1 | |
| GPU_NODES=$(oc get nodes -l "${GPU_LABEL}" -o name 2>/dev/null | wc -l) | |
| readonly GPU_NODES | |
| if [ "${GPU_NODES}" -eq 0 ]; then | |
| echo "ERROR: No GPU nodes found with label ${GPU_LABEL}" | |
| oc get nodes --show-labels | |
| exit 1 |
🧰 Tools
🪛 Shellcheck (0.11.0)
[warning] 15-15: Declare and assign separately to avoid masking return values.
(SC2155)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@ci-operator/step-registry/openshift/kubernetes/e2e/nvidia-dra-partitionable/openshift-kubernetes-e2e-nvidia-dra-partitionable-commands.sh`
around lines 15 - 19, The readonly assignment for GPU_NODES using command
substitution masks failures from the oc/get pipeline; change the code so you run
oc get nodes -l "${GPU_LABEL}" -o name first, check its exit status, capture its
output to a variable (e.g., nodes_out), then compute GPU_NODES from that output
(e.g., with wc -l) and finally mark GPU_NODES readonly; update the logic around
the oc get nodes call and the subsequent if that references GPU_NODES to fail
fast on oc errors rather than returning a misleading zero count (refer to
GPU_NODES and the oc get nodes -l "${GPU_LABEL}" -o name usage).
| # Check ResourceSlices | ||
| RESOURCE_SLICES=$(oc get resourceslice -o name 2>/dev/null | wc -l) | ||
| echo "ResourceSlices: ${RESOURCE_SLICES}" | ||
| if [ "${RESOURCE_SLICES}" -eq 0 ]; then | ||
| echo "WARNING: No ResourceSlices found" | ||
| fi |
There was a problem hiding this comment.
Wait for ResourceSlices before starting e2e.
A zero count here usually means the DRA driver has not advertised allocatable resources yet. Continuing makes the job timing-dependent and pushes the real failure into test-kubernetes-e2e.sh.
Suggested fix
-RESOURCE_SLICES=$(oc get resourceslice -o name 2>/dev/null | wc -l)
-echo "ResourceSlices: ${RESOURCE_SLICES}"
-if [ "${RESOURCE_SLICES}" -eq 0 ]; then
- echo "WARNING: No ResourceSlices found"
-fi
+for _ in $(seq 1 30); do
+ RESOURCE_SLICES=$(oc get resourceslice -o name 2>/dev/null | wc -l)
+ if [ "${RESOURCE_SLICES}" -gt 0 ]; then
+ break
+ fi
+ sleep 10
+done
+echo "ResourceSlices: ${RESOURCE_SLICES}"
+if [ "${RESOURCE_SLICES}" -eq 0 ]; then
+ echo "ERROR: No ResourceSlices found after waiting for the DRA driver to advertise resources"
+ exit 1
+fi🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@ci-operator/step-registry/openshift/kubernetes/e2e/nvidia-dra-partitionable/openshift-kubernetes-e2e-nvidia-dra-partitionable-commands.sh`
around lines 31 - 36, The script currently reads RESOURCE_SLICES once and
proceeds even if zero; change it to poll until ResourceSlices exist (or a
configurable timeout) before continuing: replace the single read of
RESOURCE_SLICES with a loop that runs oc get resourceslice -o name repeatedly
(sleep between attempts) and exits successfully when RESOURCE_SLICES > 0 or
fails after the timeout, emitting a clear error; reference the RESOURCE_SLICES
variable and the oc get resourceslice command and ensure the failure path
returns non-zero so downstream test-kubernetes-e2e.sh won’t run when no slices
are advertised.
| - Verifies DRAPartitionableDevices feature gate is enabled | ||
| - Runs test-kubernetes-e2e.sh with specified focus pattern |
There was a problem hiding this comment.
Clarify that feature-gate validation is conditional on test focus.
In this file, Lines 23-24 and Line 30 describe DRAPartitionableDevices gate verification as unconditional, but ci-operator/step-registry/openshift/kubernetes/e2e/nvidia-dra-partitionable/openshift-kubernetes-e2e-nvidia-dra-partitionable-commands.sh (Lines 36-43) performs that check only when DRA_TEST_FOCUS includes DRAPartitionableDevices.
Also applies to: 30-30
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@ci-operator/step-registry/openshift/kubernetes/e2e/nvidia-dra-partitionable/openshift-kubernetes-e2e-nvidia-dra-partitionable-ref.yaml`
around lines 23 - 24, Update the YAML step description to state that the
DRAPartitionableDevices feature-gate check is conditional: change the
unconditional bullet to something like "Verifies DRAPartitionableDevices feature
gate is enabled when DRA_TEST_FOCUS includes 'DRAPartitionableDevices'"; ensure
any other sentence mentioning this check is similarly clarified. Reference the
check logic in openshift-kubernetes-e2e-nvidia-dra-partitionable-commands.sh and
the DRA_TEST_FOCUS variable so readers understand the gate validation only runs
when that focus is present.
* Add reusable NVIDIA step-registry components for DRA testing Based on openshift#74984. Adds NFD operator, GPU operator, DRA driver install, and cleanup steps. Signed-off-by: Pannaga Rao Bhoja Ramamanohara <pbhojara@redhat.com> Co-Authored-By: Sai Ramesh Vanka <svanka@redhat.com> * Fix nvidia-dra-driver-install for OpenShift CI compatibility * Add DRA e2e CI job for kueue-operator on AWS with T4 GPU --------- Signed-off-by: Pannaga Rao Bhoja Ramamanohara <pbhojara@redhat.com> Co-authored-by: Sai Ramesh Vanka <svanka@redhat.com> Co-authored-by: Pannaga Rao Bhoja Ramamanohara <pbhojara@pbhojara-thinkpadp16vgen1.rht.csb>
6466435 to
5c4b6df
Compare
|
/pj-rehearse pull-ci-openshift-origin-main-nvidia-dra-e2e-aws-nvidia-dra |
|
@sairameshv: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
|
/pj-rehearse periodic-ci-openshift-kubernetes-main-gcp-nvidia-dra-partitionable-devices |
|
@sairameshv: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
5c4b6df to
416fdbf
Compare
|
/pj-rehearse pull-ci-openshift-origin-main-nvidia-dra-e2e-aws-nvidia-dra |
|
@sairameshv: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
|
/pj-rehearse periodic-ci-openshift-kubernetes-main-gcp-nvidia-dra-partitionable-devices |
|
@sairameshv: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
|
/pj-rehearse pull-ci-openshift-origin-main-nvidia-dra-e2e-aws-nvidia-dra |
|
@sairameshv: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
|
/pj-rehearse periodic-ci-openshift-kubernetes-main-gcp-nvidia-dra-partitionable-devices |
|
@sairameshv: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
* Add reusable NVIDIA step-registry components for DRA testing Based on openshift#74984. Adds NFD operator, GPU operator, DRA driver install, and cleanup steps. Signed-off-by: Pannaga Rao Bhoja Ramamanohara <pbhojara@redhat.com> Co-Authored-By: Sai Ramesh Vanka <svanka@redhat.com> * Fix nvidia-dra-driver-install for OpenShift CI compatibility * Add DRA e2e CI job for kueue-operator on AWS with T4 GPU --------- Signed-off-by: Pannaga Rao Bhoja Ramamanohara <pbhojara@redhat.com> Co-authored-by: Sai Ramesh Vanka <svanka@redhat.com> Co-authored-by: Pannaga Rao Bhoja Ramamanohara <pbhojara@pbhojara-thinkpadp16vgen1.rht.csb>
|
Opened a PR to fix a few failures with the existing |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: sairameshv The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
This adds CI jobs to test Kubernetes Dynamic Resource Allocation (DRA) with NVIDIA GPUs on GCP across two repositories: openshift/kubernetes (DRAPartitionableDevices) - Periodic: - Tests upstream Kubernetes DRA Partitionable Devices (MIG) functionality - Uses H100 GPU (a3-highgpu-1g) for DynamicMIG support - Runs weekly in us-central1-a openshift/origin (Basic DRA) - Presubmit: - Tests OpenShift DRA integration with NVIDIA GPUs - Uses L4 GPU (g2-standard-4) for cost-effective testing - Triggers on changes to test/extended/node/dra/nvidia or test/extended/dra/nvidia Reusable step-registry components: - nvidia-nfd-operator-install: Installs NFD operator via OLM - nvidia-gpu-operator-install: Installs NVIDIA GPU operator with CDI - nvidia-dra-driver-install: Installs NVIDIA DRA driver via Helm - nvidia-cleanup: Cleans up NVIDIA components and resources - openshift-kubernetes-e2e-nvidia-dra-partitionable: Workflow for DRA testing Note: G2/A3 GPU-optimized machine types include GPUs automatically. All tests use us-central1 to match CI quota allocation. Signed-off-by: Sai Ramesh Vanka <svanka@redhat.com Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
|
/pj-rehearse periodic-ci-openshift-kubernetes-main-gcp-nvidia-dra-partitionable-devices |
|
@sairameshv: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
- Origin: Switch from GCP L4 (g2-standard-4) to AWS T4 (g4dn.xlarge) Follows kueue-operator pattern (PR openshift#77108), better availability - Kubernetes: Fix cluster profile from 'gcp' to 'openshift-org-gcp' Keep zone as us-central1-a (H100/A3 available in zones a/b/c, not f) - Add nvidia-dra-driver-install step for DRA driver installation - Install helm in test container for prerequisite validation - Disable monitor tests for focused NVIDIA DRA testing Fixes rehearse failures: 1. Invalid cluster profile 'gcp' (should be 'openshift-org-gcp') 2. Test failures due to missing helm binary in tests container 3. Unrelated monitor test failures (94 cluster health checks) Note: us-central1-f has A100 (a2) but not H100 (a3) instances Signed-off-by: Sai Ramesh Vanka <svanka@redhat.com>
|
/pj-rehearse periodic-ci-openshift-kubernetes-main-gcp-nvidia-dra-partitionable-devices |
|
@sairameshv: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
|
[REHEARSALNOTIFIER]
Interacting with pj-rehearseComment: Once you are satisfied with the results of the rehearsals, comment: |
|
@sairameshv: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Add CI infrastructure for DRA testing on NVIDIA GPUs
Summary by CodeRabbit
New Features
Chores