Skip to content

OCPNODE-4043: Fix NVIDIA DRA driver helm repo configuration#31104

Open
sairameshv wants to merge 1 commit intoopenshift:mainfrom
sairameshv:nvidia_dra_ocp
Open

OCPNODE-4043: Fix NVIDIA DRA driver helm repo configuration#31104
sairameshv wants to merge 1 commit intoopenshift:mainfrom
sairameshv:nvidia_dra_ocp

Conversation

@sairameshv
Copy link
Copy Markdown
Member

@sairameshv sairameshv commented Apr 30, 2026

  • Fix Helm repo URL helm.ngc.nvidia.com/nvidia
  • Add retry logic for flaky NVIDIA Helm repository updates

Summary by CodeRabbit

  • Chores

    • Improved Helm operations with automatic retry and per-attempt logging to reduce transient failures and surface final errors.
    • Updated NVIDIA Helm repository endpoint to the new official URL.
    • Standardized Helm release and namespace naming for the NVIDIA driver deployment.
  • Documentation

    • Updated installation and troubleshooting docs to reflect the new repository endpoint and namespace/release naming.

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: automatic mode

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 30, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: c3d6abbc-2243-484d-92f7-478d9fcfc273

📥 Commits

Reviewing files that changed from the base of the PR and between 0e2b8bb and c2f3fbb.

📒 Files selected for processing (2)
  • test/extended/node/dra/nvidia/README.md
  • test/extended/node/dra/nvidia/prerequisites_installer.go
✅ Files skipped from review due to trivial changes (1)
  • test/extended/node/dra/nvidia/README.md

Walkthrough

Helm-related NVIDIA DRA prerequisites updated: release/namespace renamed to nvidia-dra-driver, Helm repo URL changed to https://helm.ngc.nvidia.com/nvidia (added with --force-update), and a single helm repo update was replaced by a timed retry loop that logs attempts and returns the final error/output on exhaustion.

Changes

NVIDIA DRA Driver Helm Installer + Docs

Layer / File(s) Summary
Constants / Names
test/extended/node/dra/nvidia/prerequisites_installer.go
Rename constants to draDriverNamespace = "nvidia-dra-driver" and draDriverRelease = "nvidia-dra-driver".
Helm Repo Add
test/extended/node/dra/nvidia/prerequisites_installer.go
Change Helm repo URL to https://helm.ngc.nvidia.com/nvidia and add with --force-update.
Repo Update / Retry Logic
test/extended/node/dra/nvidia/prerequisites_installer.go
Replace single helm repo update with a PollUntilContextTimeout loop that runs helm repo update nvidia, logs each retry and captured output, suppresses intermediate errors, and returns the last error plus last output if timeout is reached.
Manual Install & Troubleshooting Docs
test/extended/node/dra/nvidia/README.md
Update README to use new Helm repo URL (https://helm.ngc.nvidia.com/nvidia --force-update), new namespace/release nvidia-dra-driver for namespace creation, SCC bindings, helm install/uninstall, verification (oc get pods, oc logs) and troubleshooting (oc delete pod) commands.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 11 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Ipv6 And Disconnected Network Test Compatibility ⚠️ Warning Tests require external connectivity to helm.ngc.nvidia.com for Helm chart installation and nvcr.io for container images. Will fail in disconnected environments. Add [Skipped:Disconnected] label or configure internal registry mirrors for helm.ngc.nvidia.com and nvcr.io/nvidia/cuda image pulls.
✅ Passed checks (11 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly references the main change: fixing NVIDIA DRA driver Helm repository configuration, which aligns with the primary objectives of updating the Helm repo URL and adding retry logic.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed The PR only modifies utility functions and documentation, not Ginkgo test names. All test names in nvidia_dra.go are static and deterministic with no dynamic content.
Test Structure And Quality ✅ Passed Check requires reviewing Ginkgo test code quality. PR modifies helper utility code and documentation, not test code. Actual Ginkgo tests were not modified.
Microshift Test Compatibility ✅ Passed This PR does not add new Ginkgo e2e tests. It only modifies prerequisites_installer.go (helper file) and README.md (documentation). Existing NVIDIA DRA tests were added in a previous commit.
Single Node Openshift (Sno) Test Compatibility ✅ Passed The PR does not add any new Ginkgo e2e tests. It only modifies prerequisites_installer.go (a helper utility class) and README.md (documentation). The SNO Test Compatibility check is not applicable.
Topology-Aware Scheduling Compatibility ✅ Passed PR modifies test setup code and documentation only. Does not introduce deployment manifests, operator code, or controllers. Existing tolerations for control-plane nodes predate this change.
Ote Binary Stdout Contract ✅ Passed Code properly uses framework.Logf for logging, captures command output with CombinedOutput, and has no main/init/TestMain/suite-level code writing to stdout. Complies with OTE Binary Stdout Contract.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Apr 30, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: sairameshv

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 30, 2026
@openshift-ci openshift-ci Bot requested review from haircommander and harche April 30, 2026 14:34
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/extended/node/dra/nvidia/prerequisites_installer.go`:
- Around line 118-122: Replace the current helm repo add invocation that
tolerates "already exists" with a forced update: update the exec.CommandContext
call that constructs cmd (currently using "helm", "repo", "add", "nvidia",
"https://helm.ngc.nvidia.com/nvidia") to include the "--force-update" flag, and
remove the special-case check on output containing "already exists" so any
non-nil err is treated as failure and returned (i.e., eliminate the
strings.Contains(...) branch and keep the existing error return when
cmd.CombinedOutput() fails).
- Around line 125-131: The closure passed to wait.PollUntilContextTimeout
declares output, but never uses it; modify the retry closure used in
wait.PollUntilContextTimeout to capture both the command output and error from
exec.CommandContext ("helm repo update"), log the output on failure (e.g.,
include output in framework.Logf when err != nil), and store the last seen
output and error in outer-scope variables so that when the poll ultimately times
out you can return or wrap the original command error/output instead of only the
context deadline error; update references around the closure and the timeout
error handling to use these stored lastOutput/lastErr values for better
diagnostics.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 9f8e0bc0-bdbb-4dd6-8fb9-e9622fb3ba10

📥 Commits

Reviewing files that changed from the base of the PR and between 4078649 and cefb01e.

📒 Files selected for processing (1)
  • test/extended/node/dra/nvidia/prerequisites_installer.go

Comment thread test/extended/node/dra/nvidia/prerequisites_installer.go Outdated
Comment thread test/extended/node/dra/nvidia/prerequisites_installer.go
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
test/extended/node/dra/nvidia/prerequisites_installer.go (1)

127-129: ⚡ Quick win

Scope Helm update to the NVIDIA repo for clarity and efficiency.

At line 128, helm repo update updates all configured repos. Prefer scoping to just nvidia for clarity and to avoid unnecessary network calls to unrelated repositories.

Suggested patch
-		cmd := exec.CommandContext(ctx, "helm", "repo", "update")
+		cmd := exec.CommandContext(ctx, "helm", "repo", "update", "nvidia")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/extended/node/dra/nvidia/prerequisites_installer.go` around lines 127 -
129, The helm command currently runs exec.CommandContext(ctx, "helm", "repo",
"update") inside the anonymous function passed to wait.PollUntilContextTimeout,
which updates all repos; change the command to scope the update to only the
NVIDIA repo (e.g., exec.CommandContext(ctx, "helm", "repo", "update", "nvidia"))
so lastOutput/lastErr reflect only the nvidia repo update and avoid updating
unrelated repos.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/extended/node/dra/nvidia/prerequisites_installer.go`:
- Around line 26-27: The README for the NVIDIA DRA test still shows the old
repo/namespace/release commands; update the README in the NVIDIA DRA test
directory to reflect the new workflow by replacing the old namespace and release
values with the current constants (use draDriverNamespace = "nvidia-dra-driver"
and draDriverRelease = "nvidia-dra-driver") and adjust any example repo commands
to match the new repo workflow so the manual setup commands are accurate and
consistent with the code.

---

Nitpick comments:
In `@test/extended/node/dra/nvidia/prerequisites_installer.go`:
- Around line 127-129: The helm command currently runs exec.CommandContext(ctx,
"helm", "repo", "update") inside the anonymous function passed to
wait.PollUntilContextTimeout, which updates all repos; change the command to
scope the update to only the NVIDIA repo (e.g., exec.CommandContext(ctx, "helm",
"repo", "update", "nvidia")) so lastOutput/lastErr reflect only the nvidia repo
update and avoid updating unrelated repos.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: abf3f06d-cf2d-4bb9-8df8-fefebbe3599d

📥 Commits

Reviewing files that changed from the base of the PR and between cefb01e and ee084b5.

📒 Files selected for processing (1)
  • test/extended/node/dra/nvidia/prerequisites_installer.go

Comment thread test/extended/node/dra/nvidia/prerequisites_installer.go
@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Scheduling required tests:
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-metal-ipi-ovn-ipv6
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/extended/node/dra/nvidia/prerequisites_installer.go`:
- Around line 127-129: The helm repo update call currently runs
exec.CommandContext(ctx, "helm", "repo", "update") inside the
wait.PollUntilContextTimeout callback which updates all repos and can fail due
to unrelated broken repos; change the exec.CommandContext invocation in that
callback to target only the nvidia repo by passing "nvidia" as an argument
(i.e., call exec.CommandContext(ctx, "helm", "repo", "update", "nvidia")),
keeping the same handling of lastOutput and lastErr and the surrounding
wait.PollUntilContextTimeout logic.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: b7718c6a-c096-43bb-aa2b-ab29ca6fc928

📥 Commits

Reviewing files that changed from the base of the PR and between ee084b5 and 0e2b8bb.

📒 Files selected for processing (2)
  • test/extended/node/dra/nvidia/README.md
  • test/extended/node/dra/nvidia/prerequisites_installer.go

Comment thread test/extended/node/dra/nvidia/prerequisites_installer.go
@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Scheduling required tests:
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-metal-ipi-ovn-ipv6
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi

@sairameshv sairameshv changed the title Fix NVIDIA DRA driver helm repo configuration OCPNODE-4043: Fix NVIDIA DRA driver helm repo configuration May 5, 2026
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label May 5, 2026
@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented May 5, 2026

@sairameshv: This pull request references OCPNODE-4043 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target either version "5.0." or "openshift-5.0.", but it targets "openshift-4.22" instead.

Details

In response to this:

  • Fix Helm repo URL helm.ngc.nvidia.com/nvidia
  • Add retry logic for flaky NVIDIA Helm repository updates

Summary by CodeRabbit

  • Chores

  • Improved reliability of Helm repository operations by adding automatic retry logic with per-attempt logging and final-error reporting to reduce transient failures.

  • Updated NVIDIA Helm repository endpoint to the new official URL.

  • Standardized Helm release and namespace naming for the NVIDIA driver deployment.

  • Documentation

  • Updated installation and troubleshooting docs to reflect the new repository endpoint and namespace/release naming.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

- Fix Helm repo URL (helm.ngc.nvidia.com/nvidia)
- Add retry logic for flaky NVIDIA Helm repository updates

Signed-off-by: Sai Ramesh Vanka <svanka@redhat.com>
@openshift-merge-bot openshift-merge-bot Bot added the ready-for-human-review Indicates a PR has been reviewed by automated tools and is ready for human review label May 5, 2026
@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Scheduling required tests:
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-metal-ipi-ovn-ipv6
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 5, 2026

@sairameshv: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@sairameshv
Copy link
Copy Markdown
Member Author

/verified by ci

@openshift-ci-robot
Copy link
Copy Markdown

@sairameshv: This PR has been marked as verified by ci.

Details

In response to this:

/verified by ci

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label May 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. ready-for-human-review Indicates a PR has been reviewed by automated tools and is ready for human review verified Signifies that the PR passed pre-merge verification criteria

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants