[None][fix] Update CI Agg test's mpi2 to mpix by chenfeiz0326 · Pull Request #13491 · NVIDIA/TensorRT-LLM

chenfeiz0326 · 2026-04-27T05:21:42Z

Summary by CodeRabbit

Bug Fixes
- Fixed CUDA virtual memory allocation property setup for improved compatibility.
Documentation
- Updated supported dependency versions and compatibility matrix.
Chores
- Bumped CUDA to 13.2.0, PyTorch to 2.11.0, and TensorRT to 10.16.0
- Updated base container images to version 26.03

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>

Signed-off-by: Emma Qiao <qqiao@nvidia.com>

Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>

Signed-off-by: Emma Qiao <qqiao@nvidia.com>

Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>

… 2.11.0a0 Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>

Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>

Signed-off-by: Emma Qiao <qqiao@nvidia.com>

Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>

Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>

chenfeiz0326 · 2026-04-27T05:24:45Z

/bot run --disable-fail-fast --stage-list "GB200-8_GPUs-2_Nodes-PyTorch-PerfSanity-Node2-GPU8-Post-Merge*"

coderabbitai · 2026-04-27T05:29:04Z

📝 Walkthrough

Walkthrough

This PR updates multiple dependency and container image versions across the project, including CUDA (13.1.1→13.2.0), PyTorch (2.10.0→2.11.0), TensorRT (10.15.1.29→10.16.0.72), and base container tags (26.02→26.03). Additionally, it updates CUDA virtual memory API usage to employ brace-initialization for CUmemLocation construction.

Changes

Cohort / File(s)	Summary
Documentation and Configuration Version Updates `README.md`, `constraints.txt`, `docs/source/legacy/reference/support-matrix.md`, `requirements.txt`	Updated version references and badges: CUDA 13.1.1→13.2.0, PyTorch 2.10.0→2.11.0, TensorRT 10.15.1→10.16.0, base image tags 26.02→26.03, TensorRT pins, NCCL and onnxscript versions.
CUDA Virtual Memory API Initialization Updates `cpp/include/tensorrt_llm/runtime/virtualMemory.h`, `cpp/tensorrt_llm/runtime/virtualMemory.cpp`	Updated CUmemLocation and CUmemAccessDesc initialization to use brace-initialization style (`CUmemLocation{CU_MEM_LOCATION_TYPE_DEVICE, {device}}`) replacing nested initializer lists, impacting allocation property setup.
CUDA Virtual Memory Test Initialization `cpp/tests/unit_tests/common/cudaDriverWrapperTest.cpp`, `cpp/tests/unit_tests/runtime/virtualMemoryTest.cpp`	Aligned test setup to match updated CUmemLocation initialization pattern using brace-initialization across multiple test cases.
Docker Build Configuration `docker/Dockerfile.multi`, `docker/Makefile`	Updated base image tags from 26.02 to 26.03, and CUDA versions from 13.1.1 to 13.2.0 across multiple build targets (rockylinux8, ubuntu22, ubuntu24).
Installation Scripts `docker/common/install_cuda_toolkit.sh`, `docker/common/install_pytorch.sh`, `docker/common/install_tensorrt.sh`	Updated CUDA (13.1.1→13.2.0), PyTorch (2.10.0→2.11.0), TensorRT (10.15.1.29→10.16.0.72) installer versions and corresponding dependent package pins (CUDNN, NCCL, CUBLAS, CUDA components).
Jenkins CI Configuration `jenkins/Build.groovy`, `jenkins/L0_Test.groovy`, `jenkins/current_image_tags.properties`	Updated Triton tags from r26.02 to r26.03, Docker image references to new version combinations (pytorch-26.03+trt10.16.0.72, cuda-13.2.0+trt10.16.0.72), adjusted MPI runtime flag (pmi2→pmix), updated sanity-check configs and bare-metal PyTorch installation to CUDA 13.2-compatible artifacts.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

[TRTLLM-8658][infra] upgrade to DLFW 25.10 and pytorch 2.9.0 / triton 3.5.0 #8621: Coordinates infra version bumps across Docker base tags, PyTorch/TensorRT/CUDA pins, install scripts, and CI image tag files.
[TRTLLM-6224][infra] Upgrade dependencies to DLFW 25.06 and CUDA 12.9.1 #5678: Performs coordinated dependency/container version variable updates across Dockerfile, installation scripts, and Jenkins configuration files.

Suggested reviewers

chzblych
Tabrizian
litaotju
leslie-fang25

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	PR description is empty of substantive content; only the template with comments and a checked checkbox are present, missing the Description and Test Coverage sections.	Add a Description section explaining the MPI change and its impact, and a Test Coverage section documenting how this change is tested.
Docstring Coverage	⚠️ Warning	Docstring coverage is 18.18% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	Title is directly related to a change in the PR (MPI argument update from pmi2 to pmix in L0_Test.groovy), matching the final commit message.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (5)

constraints.txt (1)
1-8: ⚠️ Potential issue | 🟡 Minor

Remove unnecessary vulnerability workarounds that are not present in pytorch:26.03-py3.

The vulnerabilities listed are already addressed in the updated base image:

wheel: CVE-2026-24049 is fixed in wheel 0.46.2, which is already installed via the base image's pip upgrade

tornado: Not installed in pytorch:26.03-py3 as it is not a core PyTorch dependency

black: Not installed in pytorch:26.03-py3 as it is not a runtime dependency

Remove the constraints for tornado>=6.5.5 and black>=26.3.1 entirely. Consider removing the wheel>=0.46.2 constraint as well since the fixed version is already provided by the base image's dependency resolution.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@constraints.txt` around lines 1 - 8, Remove the unnecessary vulnerability
workaround constraints: delete the lines containing "tornado>=6.5.5" and
"black>=26.3.1" from constraints.txt (these packages are not present in the
pytorch:26.03-py3 runtime), and optionally remove "wheel>=0.46.2" as well since
the base image already provides a fixed wheel; ensure only the remaining
necessary constraints stay in the file.
cpp/tensorrt_llm/runtime/virtualMemory.cpp (1)
2-2: ⚠️ Potential issue | 🟡 Minor

Update the copyright year for this modified file.

The file was changed in this PR, but the header still ends at 2025.
Proposed fix
- * Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
+ * Copyright (c) 2025-2026, NVIDIA CORPORATION.  All rights reserved.
As per coding guidelines: "All TensorRT-LLM source files must contain an NVIDIA copyright header with the year of latest meaningful modification" and "update year on modified files".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/tensorrt_llm/runtime/virtualMemory.cpp` at line 2, The file header in
virtualMemory.cpp still shows "2025"; update the NVIDIA copyright header year to
the latest modification year (2026) in the top-of-file header comment so the
file complies with the TensorRT-LLM guideline requiring the year of latest
meaningful modification.
cpp/tests/unit_tests/common/cudaDriverWrapperTest.cpp (1)
2-2: ⚠️ Potential issue | 🟡 Minor

Refresh the copyright header year range.

This test file is modified, but the header still ends at 2024.
Proposed fix
- * Copyright (c) 2022-2024, NVIDIA CORPORATION.  All rights reserved.
+ * Copyright (c) 2022-2026, NVIDIA CORPORATION.  All rights reserved.
As per coding guidelines: "All TensorRT-LLM source files must contain an NVIDIA copyright header with the year of latest meaningful modification" and "update year on modified files".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/tests/unit_tests/common/cudaDriverWrapperTest.cpp` at line 2, Update the
copyright header in cudaDriverWrapperTest.cpp: replace the ending year range
"2022-2024" with the current latest modification year (e.g., "2022-2026") so the
top-of-file NVIDIA copyright header reflects the latest meaningful modification.
cpp/include/tensorrt_llm/runtime/virtualMemory.h (1)
2-2: ⚠️ Potential issue | 🟡 Minor

Update the copyright year for this modified file.

This file was modified, but the header still ends at 2025.
Proposed fix
- * Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
+ * Copyright (c) 2025-2026, NVIDIA CORPORATION.  All rights reserved.
As per coding guidelines: "All TensorRT-LLM source files must contain an NVIDIA copyright header with the year of latest meaningful modification" and "update year on modified files".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/include/tensorrt_llm/runtime/virtualMemory.h` at line 2, Update the
copyright header year in the file virtualMemory.h to reflect the latest
meaningful modification (replace "2025" with the correct current year) so the
NVIDIA copyright header complies with project guidelines; locate the top-of-file
header comment in cpp/include/tensorrt_llm/runtime/virtualMemory.h and change
the year token in the existing copyright line.
cpp/tests/unit_tests/runtime/virtualMemoryTest.cpp (1)
2-2: ⚠️ Potential issue | 🟡 Minor

Update the header year to reflect this modification.

The file is updated in this PR, but the copyright year still ends at 2025.
Proposed fix
- * Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
+ * Copyright (c) 2025-2026, NVIDIA CORPORATION.  All rights reserved.
As per coding guidelines: "All TensorRT-LLM source files must contain an NVIDIA copyright header with the year of latest meaningful modification" and "update year on modified files".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/tests/unit_tests/runtime/virtualMemoryTest.cpp` at line 2, Update the
copyright header in virtualMemoryTest.cpp to reflect the latest modification
year (change 2025 to 2026) so the NVIDIA copyright line shows the current year
of modification.

🧹 Nitpick comments (1)

jenkins/Build.groovy (1)

409-412: Avoid hardcoding tritonShortTag; derive it from docker/Dockerfile.multi.

Line 410 says Dockerfile is the source of truth, but Line 411 is hardcoded. This can drift on the next tag bump.

Refactor sketch

-    // Get triton tag from docker/dockerfile.multi
-    def tritonShortTag = "r26.03"
+    // Get triton tag from docker/Dockerfile.multi (source of truth)
+    def tritonBaseTag = sh(
+        script: "grep '^ARG TRITON_BASE_TAG=' ${LLM_ROOT}/docker/Dockerfile.multi | cut -d= -f2",
+        returnStdout: true
+    ).trim()
+    def tritonShortTag = "r${tritonBaseTag.replace('-py3', '')}"

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@jenkins/Build.groovy` around lines 409 - 412, The tritonShortTag is hardcoded
as "r26.03" which can drift from the Dockerfile; modify the Build.groovy snippet
that defines tritonShortTag and the sh invocation so tritonShortTag is parsed
from docker/Dockerfile.multi instead of being a literal. Locate the def
tritonShortTag declaration and replace it with code that reads the
Dockerfile.multi contents (path: docker/Dockerfile.multi), extracts the TRITON
tag value (the same tag used in that file), assigns it to tritonShortTag, and
then use that variable in the existing sh command that references LLM_ROOT,
llmPath, and buildJobs (no other sh changes needed). Ensure the parsing is
robust to whitespace and comment lines so the extracted tag matches the format
expected by the cmake flags.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docker/common/install_pytorch.sh`:
- Around line 5-8: Update the NVIDIA release-notes URL in the header comment to
the proper 26.03 stack to match TORCH_VERSION="2.11.0"; locate the commented
reference to
"https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel-26-02.html#rel-26-02"
and replace it with the 26.03-equivalent release notes URL so the guidance
matches the TORCH_VERSION variable and current maintenance baseline.

In `@jenkins/L0_Test.groovy`:
- Line 1: The pipeline is pinned to a mutable feature branch in the `@Library`
declaration; update the library reference in the `@Library` annotation (the line
starting with "@Library(['bloom-jenkins-shared-lib@emma/update_nsc_login_node',
...])") to reference a stable source instead—either change
"bloom-jenkins-shared-lib@emma/update_nsc_login_node" to
"bloom-jenkins-shared-lib@main" or pin it to an immutable revision (commit SHA
or tagged release) so the shared library cannot change unexpectedly.
- Around line 3662-3667: The comment points out a CUDA version mismatch: the
block says "CUDA 13.2" but installs cuda-toolkit-13-1 while pip installs
torch==2.11.0+cu130; fix by aligning all three to CUDA 13.2 — update the toolkit
install from cuda-toolkit-13-1 to cuda-toolkit-13-2 and change the pip install
inside trtllm_utils.llmExecStepWithRetry to install torch and torchvision built
for cu132 (e.g., torch==2.11.0+cu132 torchvision==0.26.0+cu132 and corresponding
index URL), ensuring the echo/header text, toolkit package name, and pip package
tags all match.
- Around line 1165-1167: The code now unconditionally adds "--mpi=pmix" when
nodeCount > 1 (see srunArgs.add("--mpi=pmix")) for any stage where stageName
does not contain "Disagg-PerfSanity", but the docs
(jenkins/scripts/perf/README.md) still reference "--mpi=pmi2"; either update
that README entry to document "--mpi=pmix" for non-disaggregated multi-node
jobs, or restrict the code change by gating the pmix addition to the specific
aggregated stages (e.g., check stageName for the exact aggregated stage(s)
instead of using a broad negation of "Disagg-PerfSanity"), ensuring consistency
between srunArgs behavior and the README.

In `@requirements.txt`:
- Line 30: Update the comment in requirements.txt that currently reads
"rel-26-0" to the correct "rel-26-03" so the release-notes URL is accurate;
locate the comment line containing
"https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel-26-03.html#rel-26-0"
and replace the trailing fragment "rel-26-0" with "rel-26-03".

---

Outside diff comments:
In `@constraints.txt`:
- Around line 1-8: Remove the unnecessary vulnerability workaround constraints:
delete the lines containing "tornado>=6.5.5" and "black>=26.3.1" from
constraints.txt (these packages are not present in the pytorch:26.03-py3
runtime), and optionally remove "wheel>=0.46.2" as well since the base image
already provides a fixed wheel; ensure only the remaining necessary constraints
stay in the file.

In `@cpp/include/tensorrt_llm/runtime/virtualMemory.h`:
- Line 2: Update the copyright header year in the file virtualMemory.h to
reflect the latest meaningful modification (replace "2025" with the correct
current year) so the NVIDIA copyright header complies with project guidelines;
locate the top-of-file header comment in
cpp/include/tensorrt_llm/runtime/virtualMemory.h and change the year token in
the existing copyright line.

In `@cpp/tensorrt_llm/runtime/virtualMemory.cpp`:
- Line 2: The file header in virtualMemory.cpp still shows "2025"; update the
NVIDIA copyright header year to the latest modification year (2026) in the
top-of-file header comment so the file complies with the TensorRT-LLM guideline
requiring the year of latest meaningful modification.

In `@cpp/tests/unit_tests/common/cudaDriverWrapperTest.cpp`:
- Line 2: Update the copyright header in cudaDriverWrapperTest.cpp: replace the
ending year range "2022-2024" with the current latest modification year (e.g.,
"2022-2026") so the top-of-file NVIDIA copyright header reflects the latest
meaningful modification.

In `@cpp/tests/unit_tests/runtime/virtualMemoryTest.cpp`:
- Line 2: Update the copyright header in virtualMemoryTest.cpp to reflect the
latest modification year (change 2025 to 2026) so the NVIDIA copyright line
shows the current year of modification.

---

Nitpick comments:
In `@jenkins/Build.groovy`:
- Around line 409-412: The tritonShortTag is hardcoded as "r26.03" which can
drift from the Dockerfile; modify the Build.groovy snippet that defines
tritonShortTag and the sh invocation so tritonShortTag is parsed from
docker/Dockerfile.multi instead of being a literal. Locate the def
tritonShortTag declaration and replace it with code that reads the
Dockerfile.multi contents (path: docker/Dockerfile.multi), extracts the TRITON
tag value (the same tag used in that file), assigns it to tritonShortTag, and
then use that variable in the existing sh command that references LLM_ROOT,
llmPath, and buildJobs (no other sh changes needed). Ensure the parsing is
robust to whitespace and comment lines so the extracted tag matches the format
expected by the cmake flags.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: cf49de2b-5061-4f4d-8ebc-0c231f17cf0b

📥 Commits

Reviewing files that changed from the base of the PR and between 2f745de and b4c21fa.

📒 Files selected for processing (16)

README.md
constraints.txt
cpp/include/tensorrt_llm/runtime/virtualMemory.h
cpp/tensorrt_llm/runtime/virtualMemory.cpp
cpp/tests/unit_tests/common/cudaDriverWrapperTest.cpp
cpp/tests/unit_tests/runtime/virtualMemoryTest.cpp
docker/Dockerfile.multi
docker/Makefile
docker/common/install_cuda_toolkit.sh
docker/common/install_pytorch.sh
docker/common/install_tensorrt.sh
docs/source/legacy/reference/support-matrix.md
jenkins/Build.groovy
jenkins/L0_Test.groovy
jenkins/current_image_tags.properties
requirements.txt

coderabbitai · 2026-04-27T05:29:07Z

 # Use latest stable version from https://pypi.org/project/torch/#history
 # and closest to the version specified in
 # https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel-26-02.html#rel-26-02
-TORCH_VERSION="2.10.0"
+TORCH_VERSION="2.11.0"


⚠️ Potential issue | 🟡 Minor

Update the release-notes link to match the new baseline.

Line 7 still points to rel-26-02, but Line 8 now targets PyTorch 2.11.0 (26.03 stack). This leaves stale maintenance guidance.

Suggested fix

-# https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel-26-02.html#rel-26-02 +# https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel-26-03.html#rel-26-03

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# Use latest stable version from https://pypi.org/project/torch/#history

# and closest to the version specified in

# https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel-26-02.html#rel-26-02

TORCH_VERSION="2.10.0"

TORCH_VERSION="2.11.0"

# Use latest stable version from https://pypi.org/project/torch/#history

# and closest to the version specified in

# https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel-26-03.html#rel-26-03

TORCH_VERSION="2.11.0"

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@docker/common/install_pytorch.sh` around lines 5 - 8, Update the NVIDIA release-notes URL in the header comment to the proper 26.03 stack to match TORCH_VERSION="2.11.0"; locate the commented reference to "https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel-26-02.html#rel-26-02" and replace it with the 26.03-equivalent release notes URL so the guidance matches the TORCH_VERSION variable and current maintenance baseline.

coderabbitai · 2026-04-27T05:29:07Z

@@ -1,4 +1,4 @@
-@Library(['bloom-jenkins-shared-lib@main', 'trtllm-jenkins-shared-lib@main']) _
+@Library(['bloom-jenkins-shared-lib@emma/update_nsc_login_node', 'trtllm-jenkins-shared-lib@main']) _


⚠️ Potential issue | 🟠 Major

Avoid pinning the pipeline to a feature-branch shared library.

This makes the PR depend on mutable branch state outside the repo, so the pipeline can change under the same commit or break if that branch is rebased/deleted. Please switch back to @main or pin an immutable revision instead.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@jenkins/L0_Test.groovy` at line 1, The pipeline is pinned to a mutable feature branch in the `@Library` declaration; update the library reference in the `@Library` annotation (the line starting with "@Library(['bloom-jenkins-shared-lib@emma/update_nsc_login_node', ...])") to reference a stable source instead—either change "bloom-jenkins-shared-lib@emma/update_nsc_login_node" to "bloom-jenkins-shared-lib@main" or pin it to an immutable revision (commit SHA or tagged release) so the shared library cannot change unexpectedly.

coderabbitai · 2026-04-27T05:29:07Z

                    if(nodeCount > 1) {
-                        srunArgs.add("--mpi=pmi2")
+                        srunArgs.add("--mpi=pmix")
                    }


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Verify how MPI modes are used and documented in the repo. rg -n -C2 --fixed-strings -- '--mpi=' \ jenkins/L0_Test.groovy \ jenkins/scripts/perf/README.md \ jenkins/scripts/perf/local/submit.py \ jenkins/scripts/perf/disaggregated/slurm_launch_draft.sh

Repository: NVIDIA/TensorRT-LLM

Length of output: 4895

🏁 Script executed:

# Check the broader context around lines 1165-1167 sed -n '1130,1180p' jenkins/L0_Test.groovy | cat -n # Also check what config/stage gating exists in the surrounding code sed -n '1100,1200p' jenkins/L0_Test.groovy | cat -n

Repository: NVIDIA/TensorRT-LLM

Length of output: 9405

🏁 Script executed:

# Find the function/method containing lines 1165-1167 sed -n '1000,1165p' jenkins/L0_Test.groovy | grep -E '(def |stage |if \(|else if)' | tail -20 # Also check for stage references near this code sed -n '1100,1200p' jenkins/L0_Test.groovy | grep -E '(stageName|stage\(|if.*agg|if.*config)' | head -15

Repository: NVIDIA/TensorRT-LLM

Length of output: 1593

🏁 Script executed:

# Find the function/stage this code belongs to and how it's called grep -n "def.*disaggMode\|def.*nodeCount\|if.*disaggMode.*{" jenkins/L0_Test.groovy | head -20 # Search for where stageName is checked/filtered to see if there's config-level gating rg -n "stageName.*==" jenkins/L0_Test.groovy | head -20 # Look for the broader scope - find function definition containing line 1165 sed -n '800,1165p' jenkins/L0_Test.groovy | grep -n "^\s*def \|^\s*stage(" | tail -5

Repository: NVIDIA/TensorRT-LLM

Length of output: 2019

🏁 Script executed:

# Check git history to see what the original code was git log --oneline -20 jenkins/L0_Test.groovy # Look for any comments or rationale about pmi2 vs pmix rg -i "pmi2|pmix" jenkins/L0_Test.groovy -B2 -A2 # Check the README more carefully about when to use which MPI sed -n '85,130p' jenkins/scripts/perf/README.md

Repository: NVIDIA/TensorRT-LLM

Length of output: 2592

Update README.md to reflect pmix for non-disaggregated multi-node jobs, or add stage-specific filtering.

The code now uses --mpi=pmix for all non-disaggregated multi-node jobs (any stage where stageName does not contain "Disagg-PerfSanity"), but jenkins/scripts/perf/README.md:114 still documents --mpi=pmi2 for this path. Either update the documentation to reflect the change to pmix, or if pmix should only apply to specific aggregated stages, add stage/config gating instead of changing the shared default for all non-disaggregated jobs.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@jenkins/L0_Test.groovy` around lines 1165 - 1167, The code now unconditionally adds "--mpi=pmix" when nodeCount > 1 (see srunArgs.add("--mpi=pmix")) for any stage where stageName does not contain "Disagg-PerfSanity", but the docs (jenkins/scripts/perf/README.md) still reference "--mpi=pmi2"; either update that README entry to document "--mpi=pmix" for non-disaggregated multi-node jobs, or restrict the code change by gating the pmix addition to the specific aggregated stages (e.g., check stageName for the exact aggregated stage(s) instead of using a broad negation of "Disagg-PerfSanity"), ensuring consistency between srunArgs behavior and the README.

coderabbitai · 2026-04-27T05:29:07Z

+                        // Extra PyTorch CUDA 13.2 install for all bare-metal environments (Default PyTorch is for CUDA 12.8)
                        if (values[6]) {
-                            echo "###### Extra PyTorch CUDA 13.0 install Start ######"
+                            echo "###### Extra PyTorch CUDA 13.2 install Start ######"
                            // Use internal mirror instead of https://download.pytorch.org/whl/cu130 for better network stability.
                            // PyTorch CUDA 13.0 package and torchvision package can be installed as expected.
-                            trtllm_utils.llmExecStepWithRetry(pipeline, script: "pip3 install torch==2.10.0+cu130 torchvision==0.25.0+cu130 --extra-index-url https://urm.nvidia.com/artifactory/api/pypi/pytorch-cu128-remote/simple --extra-index-url https://download.pytorch.org/whl/cu130")
+                            trtllm_utils.llmExecStepWithRetry(pipeline, script: "pip3 install torch==2.11.0+cu130 torchvision==0.26.0+cu130 --extra-index-url https://urm.nvidia.com/artifactory/api/pypi/pytorch-cu128-remote/simple --extra-index-url https://download.pytorch.org/whl/cu130")


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Verify the bare-metal CUDA/PyTorch provisioning block stays internally consistent. rg -n -C3 'cuda-toolkit-13-1|Extra PyTorch CUDA 13\.2|torch==2\.11\.0\+cu130|torchvision==0\.26\.0\+cu130' \ jenkins/L0_Test.groovy

Repository: NVIDIA/TensorRT-LLM

Length of output: 1558

Align the bare-metal CUDA toolkit version with the stated PyTorch target.

The block at line 3662 claims "CUDA 13.2 alignment," but installs cuda-toolkit-13-1 (CUDA 13.1) on line 3660, while the PyTorch packages target cu130 (CUDA 13.0). This creates a three-way version mismatch: the block header claims 13.2, the toolkit is 13.1, and PyTorch targets 13.0. Update the toolkit installation to match the PyTorch CUDA version (either upgrade to cuda-toolkit-13-2 and update PyTorch to cu132, or downgrade PyTorch to match the 13.1 toolkit) to ensure the sanity jobs validate a consistent environment.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@jenkins/L0_Test.groovy` around lines 3662 - 3667, The comment points out a CUDA version mismatch: the block says "CUDA 13.2" but installs cuda-toolkit-13-1 while pip installs torch==2.11.0+cu130; fix by aligning all three to CUDA 13.2 — update the toolkit install from cuda-toolkit-13-1 to cuda-toolkit-13-2 and change the pip install inside trtllm_utils.llmExecStepWithRetry to install torch and torchvision built for cu132 (e.g., torch==2.11.0+cu132 torchvision==0.26.0+cu132 and corresponding index URL), ensuring the echo/header text, toolkit package name, and pip package tags all match.

coderabbitai · 2026-04-27T05:29:07Z

-# https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel-26-02.html#rel-26-02 uses 2.29.2
-# torch 2.10.0+cu130 depends on nvidia-nccl-cu13==2.28.9
-nvidia-nccl-cu13>=2.28.9,<=2.29.2
+# https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel-26-03.html#rel-26-0 uses 2.29.7


⚠️ Potential issue | 🟡 Minor

Fix the release-notes reference typo in the comment.

Line 30 references rel-26-0; this should be rel-26-03 to keep the guidance accurate.

Suggested fix

-# https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel-26-03.html#rel-26-0 uses 2.29.7 +# https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel-26-03.html#rel-26-03 uses 2.29.7

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel-26-03.html#rel-26-0 uses 2.29.7

# https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel-26-03.html#rel-26-03 uses 2.29.7

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@requirements.txt` at line 30, Update the comment in requirements.txt that currently reads "rel-26-0" to the correct "rel-26-03" so the release-notes URL is accurate; locate the comment line containing "https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel-26-03.html#rel-26-0" and replace the trailing fragment "rel-26-0" with "rel-26-03".

tensorrt-cicd · 2026-04-27T05:30:47Z

PR_Github #45654 [ run ] triggered by Bot. Commit: b4c21fa Link to invocation

EmmaQiaoCh and others added 21 commits April 1, 2026 02:29

Upgrade dependencies for dlfw 26.03 stack

8eafc8e

Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>

Update image tags

e076851

Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>

Merge branch 'main' into emma/upgrade_dlfw_2603

9623ed4

Signed-off-by: Emma Qiao <qqiao@nvidia.com>

FFix a clang missing-braces error

8162d6d

Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>

Fix again

279844e

Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>

Fix build error again

02c631a

Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>

Merge branch 'main' into emma/upgrade_dlfw_2603

eba68f4

Signed-off-by: Emma Qiao <qqiao@nvidia.com>

Upgrade onnxscript to support torch 2.11

6390b18

Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>

Merge branch 'main' into emma/upgrade_dlfw_2603

2032170

Testing for the reservation

8eacfce

Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>

Add workaround due to API incompability between public torch 2.11 and…

73cb947

… 2.11.0a0 Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>

Test for the H100 node which has new cuda driver

e265cd9

Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>

Correct a typo

9c25524

Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>

Fix the h100 selector condition

ba7ea4e

Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>

Merge branch 'main' into emma/upgrade_dlfw_2603

25cf6d5

Signed-off-by: Emma Qiao <qqiao@nvidia.com>

Merge branch 'main' into emma/upgrade_dlfw_2603

fc49d6a

Signed-off-by: Emma Qiao <qqiao@nvidia.com>

Merge branch 'main' into emma/upgrade_dlfw_2603

31ea317

Signed-off-by: Emma Qiao <qqiao@nvidia.com>

Merge branch 'main' into emma/upgrade_dlfw_2603

586686d

Add back the notes which may still need

5eeb6a8

Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>

Merge branch 'main' into emma/upgrade_dlfw_2603

c972181

update mpi2 to mpix

b4c21fa

Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>

chenfeiz0326 requested review from a team as code owners April 27, 2026 05:21

chenfeiz0326 requested review from QiJune, Shixiaowei02, mzweilz and niukuo April 27, 2026 05:21

github-actions Bot assigned chenfeiz0326 Apr 27, 2026

coderabbitai Bot reviewed Apr 27, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[None][fix] Update CI Agg test's mpi2 to mpix#13491

[None][fix] Update CI Agg test's mpi2 to mpix#13491
chenfeiz0326 wants to merge 21 commits intoNVIDIA:mainfrom
chenfeiz0326:chenfeiz/fix-emma-mpi-issue

chenfeiz0326 commented Apr 27, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

chenfeiz0326 commented Apr 27, 2026

Uh oh!

coderabbitai Bot commented Apr 27, 2026

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

❌ Failed checks (2 warnings)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Apr 27, 2026

Uh oh!

coderabbitai Bot Apr 27, 2026

Uh oh!

coderabbitai Bot Apr 27, 2026

Uh oh!

coderabbitai Bot Apr 27, 2026

Uh oh!

coderabbitai Bot Apr 27, 2026

Uh oh!

tensorrt-cicd commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		@@ -1,4 +1,4 @@
		@Library(['bloom-jenkins-shared-lib@main', 'trtllm-jenkins-shared-lib@main']) _
		@Library(['bloom-jenkins-shared-lib@emma/update_nsc_login_node', 'trtllm-jenkins-shared-lib@main']) _

	# https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel-26-03.html#rel-26-0 uses 2.29.7
	# https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel-26-03.html#rel-26-03 uses 2.29.7

Conversation

chenfeiz0326 commented Apr 27, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Description

Test Coverage

PR Checklist

GitHub Bot Help

Uh oh!

chenfeiz0326 commented Apr 27, 2026

Uh oh!

coderabbitai Bot commented Apr 27, 2026

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

❌ Failed checks (2 warnings)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

tensorrt-cicd commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

chenfeiz0326 commented Apr 27, 2026 •

edited by coderabbitai Bot

Loading