Skip to content

[None][fix] Update CI Agg test's mpi2 to mpix#13491

Open
chenfeiz0326 wants to merge 21 commits intoNVIDIA:mainfrom
chenfeiz0326:chenfeiz/fix-emma-mpi-issue
Open

[None][fix] Update CI Agg test's mpi2 to mpix#13491
chenfeiz0326 wants to merge 21 commits intoNVIDIA:mainfrom
chenfeiz0326:chenfeiz/fix-emma-mpi-issue

Conversation

@chenfeiz0326
Copy link
Copy Markdown
Collaborator

@chenfeiz0326 chenfeiz0326 commented Apr 27, 2026

Summary by CodeRabbit

  • Bug Fixes

    • Fixed CUDA virtual memory allocation property setup for improved compatibility.
  • Documentation

    • Updated supported dependency versions and compatibility matrix.
  • Chores

    • Bumped CUDA to 13.2.0, PyTorch to 2.11.0, and TensorRT to 10.16.0
    • Updated base container images to version 26.03

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

EmmaQiaoCh and others added 21 commits April 1, 2026 02:29
Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>
Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>
Signed-off-by: Emma Qiao <qqiao@nvidia.com>
Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>
Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>
Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>
Signed-off-by: Emma Qiao <qqiao@nvidia.com>
Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>
Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>
… 2.11.0a0

Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>
Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>
Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>
Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>
Signed-off-by: Emma Qiao <qqiao@nvidia.com>
Signed-off-by: Emma Qiao <qqiao@nvidia.com>
Signed-off-by: Emma Qiao <qqiao@nvidia.com>
Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>
Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>
@chenfeiz0326
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast --stage-list "GB200-8_GPUs-2_Nodes-PyTorch-PerfSanity-Node2-GPU8-Post-Merge*"

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 27, 2026

📝 Walkthrough

Walkthrough

This PR updates multiple dependency and container image versions across the project, including CUDA (13.1.1→13.2.0), PyTorch (2.10.0→2.11.0), TensorRT (10.15.1.29→10.16.0.72), and base container tags (26.02→26.03). Additionally, it updates CUDA virtual memory API usage to employ brace-initialization for CUmemLocation construction.

Changes

Cohort / File(s) Summary
Documentation and Configuration Version Updates
README.md, constraints.txt, docs/source/legacy/reference/support-matrix.md, requirements.txt
Updated version references and badges: CUDA 13.1.1→13.2.0, PyTorch 2.10.0→2.11.0, TensorRT 10.15.1→10.16.0, base image tags 26.02→26.03, TensorRT pins, NCCL and onnxscript versions.
CUDA Virtual Memory API Initialization Updates
cpp/include/tensorrt_llm/runtime/virtualMemory.h, cpp/tensorrt_llm/runtime/virtualMemory.cpp
Updated CUmemLocation and CUmemAccessDesc initialization to use brace-initialization style (CUmemLocation{CU_MEM_LOCATION_TYPE_DEVICE, {device}}) replacing nested initializer lists, impacting allocation property setup.
CUDA Virtual Memory Test Initialization
cpp/tests/unit_tests/common/cudaDriverWrapperTest.cpp, cpp/tests/unit_tests/runtime/virtualMemoryTest.cpp
Aligned test setup to match updated CUmemLocation initialization pattern using brace-initialization across multiple test cases.
Docker Build Configuration
docker/Dockerfile.multi, docker/Makefile
Updated base image tags from 26.02 to 26.03, and CUDA versions from 13.1.1 to 13.2.0 across multiple build targets (rockylinux8, ubuntu22, ubuntu24).
Installation Scripts
docker/common/install_cuda_toolkit.sh, docker/common/install_pytorch.sh, docker/common/install_tensorrt.sh
Updated CUDA (13.1.1→13.2.0), PyTorch (2.10.0→2.11.0), TensorRT (10.15.1.29→10.16.0.72) installer versions and corresponding dependent package pins (CUDNN, NCCL, CUBLAS, CUDA components).
Jenkins CI Configuration
jenkins/Build.groovy, jenkins/L0_Test.groovy, jenkins/current_image_tags.properties
Updated Triton tags from r26.02 to r26.03, Docker image references to new version combinations (pytorch-26.03+trt10.16.0.72, cuda-13.2.0+trt10.16.0.72), adjusted MPI runtime flag (pmi2→pmix), updated sanity-check configs and bare-metal PyTorch installation to CUDA 13.2-compatible artifacts.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested reviewers

  • chzblych
  • Tabrizian
  • litaotju
  • leslie-fang25
🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Description check ⚠️ Warning PR description is empty of substantive content; only the template with comments and a checked checkbox are present, missing the Description and Test Coverage sections. Add a Description section explaining the MPI change and its impact, and a Test Coverage section documenting how this change is tested.
Docstring Coverage ⚠️ Warning Docstring coverage is 18.18% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed Title is directly related to a change in the PR (MPI argument update from pmi2 to pmix in L0_Test.groovy), matching the final commit message.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (5)
constraints.txt (1)

1-8: ⚠️ Potential issue | 🟡 Minor

Remove unnecessary vulnerability workarounds that are not present in pytorch:26.03-py3.

The vulnerabilities listed are already addressed in the updated base image:

  • wheel: CVE-2026-24049 is fixed in wheel 0.46.2, which is already installed via the base image's pip upgrade
  • tornado: Not installed in pytorch:26.03-py3 as it is not a core PyTorch dependency
  • black: Not installed in pytorch:26.03-py3 as it is not a runtime dependency

Remove the constraints for tornado>=6.5.5 and black>=26.3.1 entirely. Consider removing the wheel>=0.46.2 constraint as well since the fixed version is already provided by the base image's dependency resolution.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@constraints.txt` around lines 1 - 8, Remove the unnecessary vulnerability
workaround constraints: delete the lines containing "tornado>=6.5.5" and
"black>=26.3.1" from constraints.txt (these packages are not present in the
pytorch:26.03-py3 runtime), and optionally remove "wheel>=0.46.2" as well since
the base image already provides a fixed wheel; ensure only the remaining
necessary constraints stay in the file.
cpp/tensorrt_llm/runtime/virtualMemory.cpp (1)

2-2: ⚠️ Potential issue | 🟡 Minor

Update the copyright year for this modified file.

The file was changed in this PR, but the header still ends at 2025.

Proposed fix
- * Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
+ * Copyright (c) 2025-2026, NVIDIA CORPORATION.  All rights reserved.

As per coding guidelines: "All TensorRT-LLM source files must contain an NVIDIA copyright header with the year of latest meaningful modification" and "update year on modified files".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/tensorrt_llm/runtime/virtualMemory.cpp` at line 2, The file header in
virtualMemory.cpp still shows "2025"; update the NVIDIA copyright header year to
the latest modification year (2026) in the top-of-file header comment so the
file complies with the TensorRT-LLM guideline requiring the year of latest
meaningful modification.
cpp/tests/unit_tests/common/cudaDriverWrapperTest.cpp (1)

2-2: ⚠️ Potential issue | 🟡 Minor

Refresh the copyright header year range.

This test file is modified, but the header still ends at 2024.

Proposed fix
- * Copyright (c) 2022-2024, NVIDIA CORPORATION.  All rights reserved.
+ * Copyright (c) 2022-2026, NVIDIA CORPORATION.  All rights reserved.

As per coding guidelines: "All TensorRT-LLM source files must contain an NVIDIA copyright header with the year of latest meaningful modification" and "update year on modified files".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/tests/unit_tests/common/cudaDriverWrapperTest.cpp` at line 2, Update the
copyright header in cudaDriverWrapperTest.cpp: replace the ending year range
"2022-2024" with the current latest modification year (e.g., "2022-2026") so the
top-of-file NVIDIA copyright header reflects the latest meaningful modification.
cpp/include/tensorrt_llm/runtime/virtualMemory.h (1)

2-2: ⚠️ Potential issue | 🟡 Minor

Update the copyright year for this modified file.

This file was modified, but the header still ends at 2025.

Proposed fix
- * Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
+ * Copyright (c) 2025-2026, NVIDIA CORPORATION.  All rights reserved.

As per coding guidelines: "All TensorRT-LLM source files must contain an NVIDIA copyright header with the year of latest meaningful modification" and "update year on modified files".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/include/tensorrt_llm/runtime/virtualMemory.h` at line 2, Update the
copyright header year in the file virtualMemory.h to reflect the latest
meaningful modification (replace "2025" with the correct current year) so the
NVIDIA copyright header complies with project guidelines; locate the top-of-file
header comment in cpp/include/tensorrt_llm/runtime/virtualMemory.h and change
the year token in the existing copyright line.
cpp/tests/unit_tests/runtime/virtualMemoryTest.cpp (1)

2-2: ⚠️ Potential issue | 🟡 Minor

Update the header year to reflect this modification.

The file is updated in this PR, but the copyright year still ends at 2025.

Proposed fix
- * Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
+ * Copyright (c) 2025-2026, NVIDIA CORPORATION.  All rights reserved.

As per coding guidelines: "All TensorRT-LLM source files must contain an NVIDIA copyright header with the year of latest meaningful modification" and "update year on modified files".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/tests/unit_tests/runtime/virtualMemoryTest.cpp` at line 2, Update the
copyright header in virtualMemoryTest.cpp to reflect the latest modification
year (change 2025 to 2026) so the NVIDIA copyright line shows the current year
of modification.
🧹 Nitpick comments (1)
jenkins/Build.groovy (1)

409-412: Avoid hardcoding tritonShortTag; derive it from docker/Dockerfile.multi.

Line 410 says Dockerfile is the source of truth, but Line 411 is hardcoded. This can drift on the next tag bump.

Refactor sketch
-    // Get triton tag from docker/dockerfile.multi
-    def tritonShortTag = "r26.03"
+    // Get triton tag from docker/Dockerfile.multi (source of truth)
+    def tritonBaseTag = sh(
+        script: "grep '^ARG TRITON_BASE_TAG=' ${LLM_ROOT}/docker/Dockerfile.multi | cut -d= -f2",
+        returnStdout: true
+    ).trim()
+    def tritonShortTag = "r${tritonBaseTag.replace('-py3', '')}"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@jenkins/Build.groovy` around lines 409 - 412, The tritonShortTag is hardcoded
as "r26.03" which can drift from the Dockerfile; modify the Build.groovy snippet
that defines tritonShortTag and the sh invocation so tritonShortTag is parsed
from docker/Dockerfile.multi instead of being a literal. Locate the def
tritonShortTag declaration and replace it with code that reads the
Dockerfile.multi contents (path: docker/Dockerfile.multi), extracts the TRITON
tag value (the same tag used in that file), assigns it to tritonShortTag, and
then use that variable in the existing sh command that references LLM_ROOT,
llmPath, and buildJobs (no other sh changes needed). Ensure the parsing is
robust to whitespace and comment lines so the extracted tag matches the format
expected by the cmake flags.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docker/common/install_pytorch.sh`:
- Around line 5-8: Update the NVIDIA release-notes URL in the header comment to
the proper 26.03 stack to match TORCH_VERSION="2.11.0"; locate the commented
reference to
"https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel-26-02.html#rel-26-02"
and replace it with the 26.03-equivalent release notes URL so the guidance
matches the TORCH_VERSION variable and current maintenance baseline.

In `@jenkins/L0_Test.groovy`:
- Line 1: The pipeline is pinned to a mutable feature branch in the `@Library`
declaration; update the library reference in the `@Library` annotation (the line
starting with "@Library(['bloom-jenkins-shared-lib@emma/update_nsc_login_node',
...])") to reference a stable source instead—either change
"bloom-jenkins-shared-lib@emma/update_nsc_login_node" to
"bloom-jenkins-shared-lib@main" or pin it to an immutable revision (commit SHA
or tagged release) so the shared library cannot change unexpectedly.
- Around line 3662-3667: The comment points out a CUDA version mismatch: the
block says "CUDA 13.2" but installs cuda-toolkit-13-1 while pip installs
torch==2.11.0+cu130; fix by aligning all three to CUDA 13.2 — update the toolkit
install from cuda-toolkit-13-1 to cuda-toolkit-13-2 and change the pip install
inside trtllm_utils.llmExecStepWithRetry to install torch and torchvision built
for cu132 (e.g., torch==2.11.0+cu132 torchvision==0.26.0+cu132 and corresponding
index URL), ensuring the echo/header text, toolkit package name, and pip package
tags all match.
- Around line 1165-1167: The code now unconditionally adds "--mpi=pmix" when
nodeCount > 1 (see srunArgs.add("--mpi=pmix")) for any stage where stageName
does not contain "Disagg-PerfSanity", but the docs
(jenkins/scripts/perf/README.md) still reference "--mpi=pmi2"; either update
that README entry to document "--mpi=pmix" for non-disaggregated multi-node
jobs, or restrict the code change by gating the pmix addition to the specific
aggregated stages (e.g., check stageName for the exact aggregated stage(s)
instead of using a broad negation of "Disagg-PerfSanity"), ensuring consistency
between srunArgs behavior and the README.

In `@requirements.txt`:
- Line 30: Update the comment in requirements.txt that currently reads
"rel-26-0" to the correct "rel-26-03" so the release-notes URL is accurate;
locate the comment line containing
"https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel-26-03.html#rel-26-0"
and replace the trailing fragment "rel-26-0" with "rel-26-03".

---

Outside diff comments:
In `@constraints.txt`:
- Around line 1-8: Remove the unnecessary vulnerability workaround constraints:
delete the lines containing "tornado>=6.5.5" and "black>=26.3.1" from
constraints.txt (these packages are not present in the pytorch:26.03-py3
runtime), and optionally remove "wheel>=0.46.2" as well since the base image
already provides a fixed wheel; ensure only the remaining necessary constraints
stay in the file.

In `@cpp/include/tensorrt_llm/runtime/virtualMemory.h`:
- Line 2: Update the copyright header year in the file virtualMemory.h to
reflect the latest meaningful modification (replace "2025" with the correct
current year) so the NVIDIA copyright header complies with project guidelines;
locate the top-of-file header comment in
cpp/include/tensorrt_llm/runtime/virtualMemory.h and change the year token in
the existing copyright line.

In `@cpp/tensorrt_llm/runtime/virtualMemory.cpp`:
- Line 2: The file header in virtualMemory.cpp still shows "2025"; update the
NVIDIA copyright header year to the latest modification year (2026) in the
top-of-file header comment so the file complies with the TensorRT-LLM guideline
requiring the year of latest meaningful modification.

In `@cpp/tests/unit_tests/common/cudaDriverWrapperTest.cpp`:
- Line 2: Update the copyright header in cudaDriverWrapperTest.cpp: replace the
ending year range "2022-2024" with the current latest modification year (e.g.,
"2022-2026") so the top-of-file NVIDIA copyright header reflects the latest
meaningful modification.

In `@cpp/tests/unit_tests/runtime/virtualMemoryTest.cpp`:
- Line 2: Update the copyright header in virtualMemoryTest.cpp to reflect the
latest modification year (change 2025 to 2026) so the NVIDIA copyright line
shows the current year of modification.

---

Nitpick comments:
In `@jenkins/Build.groovy`:
- Around line 409-412: The tritonShortTag is hardcoded as "r26.03" which can
drift from the Dockerfile; modify the Build.groovy snippet that defines
tritonShortTag and the sh invocation so tritonShortTag is parsed from
docker/Dockerfile.multi instead of being a literal. Locate the def
tritonShortTag declaration and replace it with code that reads the
Dockerfile.multi contents (path: docker/Dockerfile.multi), extracts the TRITON
tag value (the same tag used in that file), assigns it to tritonShortTag, and
then use that variable in the existing sh command that references LLM_ROOT,
llmPath, and buildJobs (no other sh changes needed). Ensure the parsing is
robust to whitespace and comment lines so the extracted tag matches the format
expected by the cmake flags.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: cf49de2b-5061-4f4d-8ebc-0c231f17cf0b

📥 Commits

Reviewing files that changed from the base of the PR and between 2f745de and b4c21fa.

📒 Files selected for processing (16)
  • README.md
  • constraints.txt
  • cpp/include/tensorrt_llm/runtime/virtualMemory.h
  • cpp/tensorrt_llm/runtime/virtualMemory.cpp
  • cpp/tests/unit_tests/common/cudaDriverWrapperTest.cpp
  • cpp/tests/unit_tests/runtime/virtualMemoryTest.cpp
  • docker/Dockerfile.multi
  • docker/Makefile
  • docker/common/install_cuda_toolkit.sh
  • docker/common/install_pytorch.sh
  • docker/common/install_tensorrt.sh
  • docs/source/legacy/reference/support-matrix.md
  • jenkins/Build.groovy
  • jenkins/L0_Test.groovy
  • jenkins/current_image_tags.properties
  • requirements.txt

Comment on lines 5 to +8
# Use latest stable version from https://pypi.org/project/torch/#history
# and closest to the version specified in
# https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel-26-02.html#rel-26-02
TORCH_VERSION="2.10.0"
TORCH_VERSION="2.11.0"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Update the release-notes link to match the new baseline.

Line 7 still points to rel-26-02, but Line 8 now targets PyTorch 2.11.0 (26.03 stack). This leaves stale maintenance guidance.

Suggested fix
-# https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel-26-02.html#rel-26-02
+# https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel-26-03.html#rel-26-03
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Use latest stable version from https://pypi.org/project/torch/#history
# and closest to the version specified in
# https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel-26-02.html#rel-26-02
TORCH_VERSION="2.10.0"
TORCH_VERSION="2.11.0"
# Use latest stable version from https://pypi.org/project/torch/#history
# and closest to the version specified in
# https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel-26-03.html#rel-26-03
TORCH_VERSION="2.11.0"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docker/common/install_pytorch.sh` around lines 5 - 8, Update the NVIDIA
release-notes URL in the header comment to the proper 26.03 stack to match
TORCH_VERSION="2.11.0"; locate the commented reference to
"https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel-26-02.html#rel-26-02"
and replace it with the 26.03-equivalent release notes URL so the guidance
matches the TORCH_VERSION variable and current maintenance baseline.

Comment thread jenkins/L0_Test.groovy
@@ -1,4 +1,4 @@
@Library(['bloom-jenkins-shared-lib@main', 'trtllm-jenkins-shared-lib@main']) _
@Library(['bloom-jenkins-shared-lib@emma/update_nsc_login_node', 'trtllm-jenkins-shared-lib@main']) _
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Avoid pinning the pipeline to a feature-branch shared library.

This makes the PR depend on mutable branch state outside the repo, so the pipeline can change under the same commit or break if that branch is rebased/deleted. Please switch back to @main or pin an immutable revision instead.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@jenkins/L0_Test.groovy` at line 1, The pipeline is pinned to a mutable
feature branch in the `@Library` declaration; update the library reference in the
`@Library` annotation (the line starting with
"@Library(['bloom-jenkins-shared-lib@emma/update_nsc_login_node', ...])") to
reference a stable source instead—either change
"bloom-jenkins-shared-lib@emma/update_nsc_login_node" to
"bloom-jenkins-shared-lib@main" or pin it to an immutable revision (commit SHA
or tagged release) so the shared library cannot change unexpectedly.

Comment thread jenkins/L0_Test.groovy
Comment on lines 1165 to 1167
if(nodeCount > 1) {
srunArgs.add("--mpi=pmi2")
srunArgs.add("--mpi=pmix")
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify how MPI modes are used and documented in the repo.
rg -n -C2 --fixed-strings -- '--mpi=' \
  jenkins/L0_Test.groovy \
  jenkins/scripts/perf/README.md \
  jenkins/scripts/perf/local/submit.py \
  jenkins/scripts/perf/disaggregated/slurm_launch_draft.sh

Repository: NVIDIA/TensorRT-LLM

Length of output: 4895


🏁 Script executed:

# Check the broader context around lines 1165-1167
sed -n '1130,1180p' jenkins/L0_Test.groovy | cat -n

# Also check what config/stage gating exists in the surrounding code
sed -n '1100,1200p' jenkins/L0_Test.groovy | cat -n

Repository: NVIDIA/TensorRT-LLM

Length of output: 9405


🏁 Script executed:

# Find the function/method containing lines 1165-1167
sed -n '1000,1165p' jenkins/L0_Test.groovy | grep -E '(def |stage |if \(|else if)' | tail -20

# Also check for stage references near this code
sed -n '1100,1200p' jenkins/L0_Test.groovy | grep -E '(stageName|stage\(|if.*agg|if.*config)' | head -15

Repository: NVIDIA/TensorRT-LLM

Length of output: 1593


🏁 Script executed:

# Find the function/stage this code belongs to and how it's called
grep -n "def.*disaggMode\|def.*nodeCount\|if.*disaggMode.*{" jenkins/L0_Test.groovy | head -20

# Search for where stageName is checked/filtered to see if there's config-level gating
rg -n "stageName.*==" jenkins/L0_Test.groovy | head -20

# Look for the broader scope - find function definition containing line 1165
sed -n '800,1165p' jenkins/L0_Test.groovy | grep -n "^\s*def \|^\s*stage(" | tail -5

Repository: NVIDIA/TensorRT-LLM

Length of output: 2019


🏁 Script executed:

# Check git history to see what the original code was
git log --oneline -20 jenkins/L0_Test.groovy

# Look for any comments or rationale about pmi2 vs pmix
rg -i "pmi2|pmix" jenkins/L0_Test.groovy -B2 -A2

# Check the README more carefully about when to use which MPI
sed -n '85,130p' jenkins/scripts/perf/README.md

Repository: NVIDIA/TensorRT-LLM

Length of output: 2592


Update README.md to reflect pmix for non-disaggregated multi-node jobs, or add stage-specific filtering.

The code now uses --mpi=pmix for all non-disaggregated multi-node jobs (any stage where stageName does not contain "Disagg-PerfSanity"), but jenkins/scripts/perf/README.md:114 still documents --mpi=pmi2 for this path. Either update the documentation to reflect the change to pmix, or if pmix should only apply to specific aggregated stages, add stage/config gating instead of changing the shared default for all non-disaggregated jobs.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@jenkins/L0_Test.groovy` around lines 1165 - 1167, The code now
unconditionally adds "--mpi=pmix" when nodeCount > 1 (see
srunArgs.add("--mpi=pmix")) for any stage where stageName does not contain
"Disagg-PerfSanity", but the docs (jenkins/scripts/perf/README.md) still
reference "--mpi=pmi2"; either update that README entry to document "--mpi=pmix"
for non-disaggregated multi-node jobs, or restrict the code change by gating the
pmix addition to the specific aggregated stages (e.g., check stageName for the
exact aggregated stage(s) instead of using a broad negation of
"Disagg-PerfSanity"), ensuring consistency between srunArgs behavior and the
README.

Comment thread jenkins/L0_Test.groovy
Comment on lines +3662 to +3667
// Extra PyTorch CUDA 13.2 install for all bare-metal environments (Default PyTorch is for CUDA 12.8)
if (values[6]) {
echo "###### Extra PyTorch CUDA 13.0 install Start ######"
echo "###### Extra PyTorch CUDA 13.2 install Start ######"
// Use internal mirror instead of https://download.pytorch.org/whl/cu130 for better network stability.
// PyTorch CUDA 13.0 package and torchvision package can be installed as expected.
trtllm_utils.llmExecStepWithRetry(pipeline, script: "pip3 install torch==2.10.0+cu130 torchvision==0.25.0+cu130 --extra-index-url https://urm.nvidia.com/artifactory/api/pypi/pytorch-cu128-remote/simple --extra-index-url https://download.pytorch.org/whl/cu130")
trtllm_utils.llmExecStepWithRetry(pipeline, script: "pip3 install torch==2.11.0+cu130 torchvision==0.26.0+cu130 --extra-index-url https://urm.nvidia.com/artifactory/api/pypi/pytorch-cu128-remote/simple --extra-index-url https://download.pytorch.org/whl/cu130")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify the bare-metal CUDA/PyTorch provisioning block stays internally consistent.
rg -n -C3 'cuda-toolkit-13-1|Extra PyTorch CUDA 13\.2|torch==2\.11\.0\+cu130|torchvision==0\.26\.0\+cu130' \
  jenkins/L0_Test.groovy

Repository: NVIDIA/TensorRT-LLM

Length of output: 1558


Align the bare-metal CUDA toolkit version with the stated PyTorch target.

The block at line 3662 claims "CUDA 13.2 alignment," but installs cuda-toolkit-13-1 (CUDA 13.1) on line 3660, while the PyTorch packages target cu130 (CUDA 13.0). This creates a three-way version mismatch: the block header claims 13.2, the toolkit is 13.1, and PyTorch targets 13.0. Update the toolkit installation to match the PyTorch CUDA version (either upgrade to cuda-toolkit-13-2 and update PyTorch to cu132, or downgrade PyTorch to match the 13.1 toolkit) to ensure the sanity jobs validate a consistent environment.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@jenkins/L0_Test.groovy` around lines 3662 - 3667, The comment points out a
CUDA version mismatch: the block says "CUDA 13.2" but installs cuda-toolkit-13-1
while pip installs torch==2.11.0+cu130; fix by aligning all three to CUDA 13.2 —
update the toolkit install from cuda-toolkit-13-1 to cuda-toolkit-13-2 and
change the pip install inside trtllm_utils.llmExecStepWithRetry to install torch
and torchvision built for cu132 (e.g., torch==2.11.0+cu132
torchvision==0.26.0+cu132 and corresponding index URL), ensuring the echo/header
text, toolkit package name, and pip package tags all match.

Comment thread requirements.txt
# https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel-26-02.html#rel-26-02 uses 2.29.2
# torch 2.10.0+cu130 depends on nvidia-nccl-cu13==2.28.9
nvidia-nccl-cu13>=2.28.9,<=2.29.2
# https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel-26-03.html#rel-26-0 uses 2.29.7
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix the release-notes reference typo in the comment.

Line 30 references rel-26-0; this should be rel-26-03 to keep the guidance accurate.

Suggested fix
-# https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel-26-03.html#rel-26-0 uses 2.29.7
+# https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel-26-03.html#rel-26-03 uses 2.29.7
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel-26-03.html#rel-26-0 uses 2.29.7
# https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel-26-03.html#rel-26-03 uses 2.29.7
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@requirements.txt` at line 30, Update the comment in requirements.txt that
currently reads "rel-26-0" to the correct "rel-26-03" so the release-notes URL
is accurate; locate the comment line containing
"https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel-26-03.html#rel-26-0"
and replace the trailing fragment "rel-26-0" with "rel-26-03".

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #45654 [ run ] triggered by Bot. Commit: b4c21fa Link to invocation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants