[None][fix] Update CI Agg test's mpi2 to mpix#13491
[None][fix] Update CI Agg test's mpi2 to mpix#13491chenfeiz0326 wants to merge 21 commits intoNVIDIA:mainfrom
Conversation
Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>
Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>
Signed-off-by: Emma Qiao <qqiao@nvidia.com>
Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>
Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>
Signed-off-by: Emma Qiao <qqiao@nvidia.com>
Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>
Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>
… 2.11.0a0 Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>
Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>
Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>
Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>
Signed-off-by: Emma Qiao <qqiao@nvidia.com>
Signed-off-by: Emma Qiao <qqiao@nvidia.com>
Signed-off-by: Emma Qiao <qqiao@nvidia.com>
Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>
Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>
|
/bot run --disable-fail-fast --stage-list "GB200-8_GPUs-2_Nodes-PyTorch-PerfSanity-Node2-GPU8-Post-Merge*" |
📝 WalkthroughWalkthroughThis PR updates multiple dependency and container image versions across the project, including CUDA (13.1.1→13.2.0), PyTorch (2.10.0→2.11.0), TensorRT (10.15.1.29→10.16.0.72), and base container tags (26.02→26.03). Additionally, it updates CUDA virtual memory API usage to employ brace-initialization for CUmemLocation construction. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 5
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (5)
constraints.txt (1)
1-8:⚠️ Potential issue | 🟡 MinorRemove unnecessary vulnerability workarounds that are not present in pytorch:26.03-py3.
The vulnerabilities listed are already addressed in the updated base image:
- wheel: CVE-2026-24049 is fixed in wheel 0.46.2, which is already installed via the base image's pip upgrade
- tornado: Not installed in pytorch:26.03-py3 as it is not a core PyTorch dependency
- black: Not installed in pytorch:26.03-py3 as it is not a runtime dependency
Remove the constraints for
tornado>=6.5.5andblack>=26.3.1entirely. Consider removing thewheel>=0.46.2constraint as well since the fixed version is already provided by the base image's dependency resolution.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@constraints.txt` around lines 1 - 8, Remove the unnecessary vulnerability workaround constraints: delete the lines containing "tornado>=6.5.5" and "black>=26.3.1" from constraints.txt (these packages are not present in the pytorch:26.03-py3 runtime), and optionally remove "wheel>=0.46.2" as well since the base image already provides a fixed wheel; ensure only the remaining necessary constraints stay in the file.cpp/tensorrt_llm/runtime/virtualMemory.cpp (1)
2-2:⚠️ Potential issue | 🟡 MinorUpdate the copyright year for this modified file.
The file was changed in this PR, but the header still ends at 2025.
Proposed fix
- * Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2025-2026, NVIDIA CORPORATION. All rights reserved.As per coding guidelines: "All TensorRT-LLM source files must contain an NVIDIA copyright header with the year of latest meaningful modification" and "update year on modified files".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/tensorrt_llm/runtime/virtualMemory.cpp` at line 2, The file header in virtualMemory.cpp still shows "2025"; update the NVIDIA copyright header year to the latest modification year (2026) in the top-of-file header comment so the file complies with the TensorRT-LLM guideline requiring the year of latest meaningful modification.cpp/tests/unit_tests/common/cudaDriverWrapperTest.cpp (1)
2-2:⚠️ Potential issue | 🟡 MinorRefresh the copyright header year range.
This test file is modified, but the header still ends at 2024.
Proposed fix
- * Copyright (c) 2022-2024, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2022-2026, NVIDIA CORPORATION. All rights reserved.As per coding guidelines: "All TensorRT-LLM source files must contain an NVIDIA copyright header with the year of latest meaningful modification" and "update year on modified files".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/tests/unit_tests/common/cudaDriverWrapperTest.cpp` at line 2, Update the copyright header in cudaDriverWrapperTest.cpp: replace the ending year range "2022-2024" with the current latest modification year (e.g., "2022-2026") so the top-of-file NVIDIA copyright header reflects the latest meaningful modification.cpp/include/tensorrt_llm/runtime/virtualMemory.h (1)
2-2:⚠️ Potential issue | 🟡 MinorUpdate the copyright year for this modified file.
This file was modified, but the header still ends at 2025.
Proposed fix
- * Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2025-2026, NVIDIA CORPORATION. All rights reserved.As per coding guidelines: "All TensorRT-LLM source files must contain an NVIDIA copyright header with the year of latest meaningful modification" and "update year on modified files".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/include/tensorrt_llm/runtime/virtualMemory.h` at line 2, Update the copyright header year in the file virtualMemory.h to reflect the latest meaningful modification (replace "2025" with the correct current year) so the NVIDIA copyright header complies with project guidelines; locate the top-of-file header comment in cpp/include/tensorrt_llm/runtime/virtualMemory.h and change the year token in the existing copyright line.cpp/tests/unit_tests/runtime/virtualMemoryTest.cpp (1)
2-2:⚠️ Potential issue | 🟡 MinorUpdate the header year to reflect this modification.
The file is updated in this PR, but the copyright year still ends at 2025.
Proposed fix
- * Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2025-2026, NVIDIA CORPORATION. All rights reserved.As per coding guidelines: "All TensorRT-LLM source files must contain an NVIDIA copyright header with the year of latest meaningful modification" and "update year on modified files".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/tests/unit_tests/runtime/virtualMemoryTest.cpp` at line 2, Update the copyright header in virtualMemoryTest.cpp to reflect the latest modification year (change 2025 to 2026) so the NVIDIA copyright line shows the current year of modification.
🧹 Nitpick comments (1)
jenkins/Build.groovy (1)
409-412: Avoid hardcodingtritonShortTag; derive it fromdocker/Dockerfile.multi.Line 410 says Dockerfile is the source of truth, but Line 411 is hardcoded. This can drift on the next tag bump.
Refactor sketch
- // Get triton tag from docker/dockerfile.multi - def tritonShortTag = "r26.03" + // Get triton tag from docker/Dockerfile.multi (source of truth) + def tritonBaseTag = sh( + script: "grep '^ARG TRITON_BASE_TAG=' ${LLM_ROOT}/docker/Dockerfile.multi | cut -d= -f2", + returnStdout: true + ).trim() + def tritonShortTag = "r${tritonBaseTag.replace('-py3', '')}"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@jenkins/Build.groovy` around lines 409 - 412, The tritonShortTag is hardcoded as "r26.03" which can drift from the Dockerfile; modify the Build.groovy snippet that defines tritonShortTag and the sh invocation so tritonShortTag is parsed from docker/Dockerfile.multi instead of being a literal. Locate the def tritonShortTag declaration and replace it with code that reads the Dockerfile.multi contents (path: docker/Dockerfile.multi), extracts the TRITON tag value (the same tag used in that file), assigns it to tritonShortTag, and then use that variable in the existing sh command that references LLM_ROOT, llmPath, and buildJobs (no other sh changes needed). Ensure the parsing is robust to whitespace and comment lines so the extracted tag matches the format expected by the cmake flags.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docker/common/install_pytorch.sh`:
- Around line 5-8: Update the NVIDIA release-notes URL in the header comment to
the proper 26.03 stack to match TORCH_VERSION="2.11.0"; locate the commented
reference to
"https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel-26-02.html#rel-26-02"
and replace it with the 26.03-equivalent release notes URL so the guidance
matches the TORCH_VERSION variable and current maintenance baseline.
In `@jenkins/L0_Test.groovy`:
- Line 1: The pipeline is pinned to a mutable feature branch in the `@Library`
declaration; update the library reference in the `@Library` annotation (the line
starting with "@Library(['bloom-jenkins-shared-lib@emma/update_nsc_login_node',
...])") to reference a stable source instead—either change
"bloom-jenkins-shared-lib@emma/update_nsc_login_node" to
"bloom-jenkins-shared-lib@main" or pin it to an immutable revision (commit SHA
or tagged release) so the shared library cannot change unexpectedly.
- Around line 3662-3667: The comment points out a CUDA version mismatch: the
block says "CUDA 13.2" but installs cuda-toolkit-13-1 while pip installs
torch==2.11.0+cu130; fix by aligning all three to CUDA 13.2 — update the toolkit
install from cuda-toolkit-13-1 to cuda-toolkit-13-2 and change the pip install
inside trtllm_utils.llmExecStepWithRetry to install torch and torchvision built
for cu132 (e.g., torch==2.11.0+cu132 torchvision==0.26.0+cu132 and corresponding
index URL), ensuring the echo/header text, toolkit package name, and pip package
tags all match.
- Around line 1165-1167: The code now unconditionally adds "--mpi=pmix" when
nodeCount > 1 (see srunArgs.add("--mpi=pmix")) for any stage where stageName
does not contain "Disagg-PerfSanity", but the docs
(jenkins/scripts/perf/README.md) still reference "--mpi=pmi2"; either update
that README entry to document "--mpi=pmix" for non-disaggregated multi-node
jobs, or restrict the code change by gating the pmix addition to the specific
aggregated stages (e.g., check stageName for the exact aggregated stage(s)
instead of using a broad negation of "Disagg-PerfSanity"), ensuring consistency
between srunArgs behavior and the README.
In `@requirements.txt`:
- Line 30: Update the comment in requirements.txt that currently reads
"rel-26-0" to the correct "rel-26-03" so the release-notes URL is accurate;
locate the comment line containing
"https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel-26-03.html#rel-26-0"
and replace the trailing fragment "rel-26-0" with "rel-26-03".
---
Outside diff comments:
In `@constraints.txt`:
- Around line 1-8: Remove the unnecessary vulnerability workaround constraints:
delete the lines containing "tornado>=6.5.5" and "black>=26.3.1" from
constraints.txt (these packages are not present in the pytorch:26.03-py3
runtime), and optionally remove "wheel>=0.46.2" as well since the base image
already provides a fixed wheel; ensure only the remaining necessary constraints
stay in the file.
In `@cpp/include/tensorrt_llm/runtime/virtualMemory.h`:
- Line 2: Update the copyright header year in the file virtualMemory.h to
reflect the latest meaningful modification (replace "2025" with the correct
current year) so the NVIDIA copyright header complies with project guidelines;
locate the top-of-file header comment in
cpp/include/tensorrt_llm/runtime/virtualMemory.h and change the year token in
the existing copyright line.
In `@cpp/tensorrt_llm/runtime/virtualMemory.cpp`:
- Line 2: The file header in virtualMemory.cpp still shows "2025"; update the
NVIDIA copyright header year to the latest modification year (2026) in the
top-of-file header comment so the file complies with the TensorRT-LLM guideline
requiring the year of latest meaningful modification.
In `@cpp/tests/unit_tests/common/cudaDriverWrapperTest.cpp`:
- Line 2: Update the copyright header in cudaDriverWrapperTest.cpp: replace the
ending year range "2022-2024" with the current latest modification year (e.g.,
"2022-2026") so the top-of-file NVIDIA copyright header reflects the latest
meaningful modification.
In `@cpp/tests/unit_tests/runtime/virtualMemoryTest.cpp`:
- Line 2: Update the copyright header in virtualMemoryTest.cpp to reflect the
latest modification year (change 2025 to 2026) so the NVIDIA copyright line
shows the current year of modification.
---
Nitpick comments:
In `@jenkins/Build.groovy`:
- Around line 409-412: The tritonShortTag is hardcoded as "r26.03" which can
drift from the Dockerfile; modify the Build.groovy snippet that defines
tritonShortTag and the sh invocation so tritonShortTag is parsed from
docker/Dockerfile.multi instead of being a literal. Locate the def
tritonShortTag declaration and replace it with code that reads the
Dockerfile.multi contents (path: docker/Dockerfile.multi), extracts the TRITON
tag value (the same tag used in that file), assigns it to tritonShortTag, and
then use that variable in the existing sh command that references LLM_ROOT,
llmPath, and buildJobs (no other sh changes needed). Ensure the parsing is
robust to whitespace and comment lines so the extracted tag matches the format
expected by the cmake flags.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: cf49de2b-5061-4f4d-8ebc-0c231f17cf0b
📒 Files selected for processing (16)
README.mdconstraints.txtcpp/include/tensorrt_llm/runtime/virtualMemory.hcpp/tensorrt_llm/runtime/virtualMemory.cppcpp/tests/unit_tests/common/cudaDriverWrapperTest.cppcpp/tests/unit_tests/runtime/virtualMemoryTest.cppdocker/Dockerfile.multidocker/Makefiledocker/common/install_cuda_toolkit.shdocker/common/install_pytorch.shdocker/common/install_tensorrt.shdocs/source/legacy/reference/support-matrix.mdjenkins/Build.groovyjenkins/L0_Test.groovyjenkins/current_image_tags.propertiesrequirements.txt
| # Use latest stable version from https://pypi.org/project/torch/#history | ||
| # and closest to the version specified in | ||
| # https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel-26-02.html#rel-26-02 | ||
| TORCH_VERSION="2.10.0" | ||
| TORCH_VERSION="2.11.0" |
There was a problem hiding this comment.
Update the release-notes link to match the new baseline.
Line 7 still points to rel-26-02, but Line 8 now targets PyTorch 2.11.0 (26.03 stack). This leaves stale maintenance guidance.
Suggested fix
-# https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel-26-02.html#rel-26-02
+# https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel-26-03.html#rel-26-03📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Use latest stable version from https://pypi.org/project/torch/#history | |
| # and closest to the version specified in | |
| # https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel-26-02.html#rel-26-02 | |
| TORCH_VERSION="2.10.0" | |
| TORCH_VERSION="2.11.0" | |
| # Use latest stable version from https://pypi.org/project/torch/#history | |
| # and closest to the version specified in | |
| # https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel-26-03.html#rel-26-03 | |
| TORCH_VERSION="2.11.0" |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docker/common/install_pytorch.sh` around lines 5 - 8, Update the NVIDIA
release-notes URL in the header comment to the proper 26.03 stack to match
TORCH_VERSION="2.11.0"; locate the commented reference to
"https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel-26-02.html#rel-26-02"
and replace it with the 26.03-equivalent release notes URL so the guidance
matches the TORCH_VERSION variable and current maintenance baseline.
| @@ -1,4 +1,4 @@ | |||
| @Library(['bloom-jenkins-shared-lib@main', 'trtllm-jenkins-shared-lib@main']) _ | |||
| @Library(['bloom-jenkins-shared-lib@emma/update_nsc_login_node', 'trtllm-jenkins-shared-lib@main']) _ | |||
There was a problem hiding this comment.
Avoid pinning the pipeline to a feature-branch shared library.
This makes the PR depend on mutable branch state outside the repo, so the pipeline can change under the same commit or break if that branch is rebased/deleted. Please switch back to @main or pin an immutable revision instead.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@jenkins/L0_Test.groovy` at line 1, The pipeline is pinned to a mutable
feature branch in the `@Library` declaration; update the library reference in the
`@Library` annotation (the line starting with
"@Library(['bloom-jenkins-shared-lib@emma/update_nsc_login_node', ...])") to
reference a stable source instead—either change
"bloom-jenkins-shared-lib@emma/update_nsc_login_node" to
"bloom-jenkins-shared-lib@main" or pin it to an immutable revision (commit SHA
or tagged release) so the shared library cannot change unexpectedly.
| if(nodeCount > 1) { | ||
| srunArgs.add("--mpi=pmi2") | ||
| srunArgs.add("--mpi=pmix") | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify how MPI modes are used and documented in the repo.
rg -n -C2 --fixed-strings -- '--mpi=' \
jenkins/L0_Test.groovy \
jenkins/scripts/perf/README.md \
jenkins/scripts/perf/local/submit.py \
jenkins/scripts/perf/disaggregated/slurm_launch_draft.shRepository: NVIDIA/TensorRT-LLM
Length of output: 4895
🏁 Script executed:
# Check the broader context around lines 1165-1167
sed -n '1130,1180p' jenkins/L0_Test.groovy | cat -n
# Also check what config/stage gating exists in the surrounding code
sed -n '1100,1200p' jenkins/L0_Test.groovy | cat -nRepository: NVIDIA/TensorRT-LLM
Length of output: 9405
🏁 Script executed:
# Find the function/method containing lines 1165-1167
sed -n '1000,1165p' jenkins/L0_Test.groovy | grep -E '(def |stage |if \(|else if)' | tail -20
# Also check for stage references near this code
sed -n '1100,1200p' jenkins/L0_Test.groovy | grep -E '(stageName|stage\(|if.*agg|if.*config)' | head -15Repository: NVIDIA/TensorRT-LLM
Length of output: 1593
🏁 Script executed:
# Find the function/stage this code belongs to and how it's called
grep -n "def.*disaggMode\|def.*nodeCount\|if.*disaggMode.*{" jenkins/L0_Test.groovy | head -20
# Search for where stageName is checked/filtered to see if there's config-level gating
rg -n "stageName.*==" jenkins/L0_Test.groovy | head -20
# Look for the broader scope - find function definition containing line 1165
sed -n '800,1165p' jenkins/L0_Test.groovy | grep -n "^\s*def \|^\s*stage(" | tail -5Repository: NVIDIA/TensorRT-LLM
Length of output: 2019
🏁 Script executed:
# Check git history to see what the original code was
git log --oneline -20 jenkins/L0_Test.groovy
# Look for any comments or rationale about pmi2 vs pmix
rg -i "pmi2|pmix" jenkins/L0_Test.groovy -B2 -A2
# Check the README more carefully about when to use which MPI
sed -n '85,130p' jenkins/scripts/perf/README.mdRepository: NVIDIA/TensorRT-LLM
Length of output: 2592
Update README.md to reflect pmix for non-disaggregated multi-node jobs, or add stage-specific filtering.
The code now uses --mpi=pmix for all non-disaggregated multi-node jobs (any stage where stageName does not contain "Disagg-PerfSanity"), but jenkins/scripts/perf/README.md:114 still documents --mpi=pmi2 for this path. Either update the documentation to reflect the change to pmix, or if pmix should only apply to specific aggregated stages, add stage/config gating instead of changing the shared default for all non-disaggregated jobs.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@jenkins/L0_Test.groovy` around lines 1165 - 1167, The code now
unconditionally adds "--mpi=pmix" when nodeCount > 1 (see
srunArgs.add("--mpi=pmix")) for any stage where stageName does not contain
"Disagg-PerfSanity", but the docs (jenkins/scripts/perf/README.md) still
reference "--mpi=pmi2"; either update that README entry to document "--mpi=pmix"
for non-disaggregated multi-node jobs, or restrict the code change by gating the
pmix addition to the specific aggregated stages (e.g., check stageName for the
exact aggregated stage(s) instead of using a broad negation of
"Disagg-PerfSanity"), ensuring consistency between srunArgs behavior and the
README.
| // Extra PyTorch CUDA 13.2 install for all bare-metal environments (Default PyTorch is for CUDA 12.8) | ||
| if (values[6]) { | ||
| echo "###### Extra PyTorch CUDA 13.0 install Start ######" | ||
| echo "###### Extra PyTorch CUDA 13.2 install Start ######" | ||
| // Use internal mirror instead of https://download.pytorch.org/whl/cu130 for better network stability. | ||
| // PyTorch CUDA 13.0 package and torchvision package can be installed as expected. | ||
| trtllm_utils.llmExecStepWithRetry(pipeline, script: "pip3 install torch==2.10.0+cu130 torchvision==0.25.0+cu130 --extra-index-url https://urm.nvidia.com/artifactory/api/pypi/pytorch-cu128-remote/simple --extra-index-url https://download.pytorch.org/whl/cu130") | ||
| trtllm_utils.llmExecStepWithRetry(pipeline, script: "pip3 install torch==2.11.0+cu130 torchvision==0.26.0+cu130 --extra-index-url https://urm.nvidia.com/artifactory/api/pypi/pytorch-cu128-remote/simple --extra-index-url https://download.pytorch.org/whl/cu130") |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify the bare-metal CUDA/PyTorch provisioning block stays internally consistent.
rg -n -C3 'cuda-toolkit-13-1|Extra PyTorch CUDA 13\.2|torch==2\.11\.0\+cu130|torchvision==0\.26\.0\+cu130' \
jenkins/L0_Test.groovyRepository: NVIDIA/TensorRT-LLM
Length of output: 1558
Align the bare-metal CUDA toolkit version with the stated PyTorch target.
The block at line 3662 claims "CUDA 13.2 alignment," but installs cuda-toolkit-13-1 (CUDA 13.1) on line 3660, while the PyTorch packages target cu130 (CUDA 13.0). This creates a three-way version mismatch: the block header claims 13.2, the toolkit is 13.1, and PyTorch targets 13.0. Update the toolkit installation to match the PyTorch CUDA version (either upgrade to cuda-toolkit-13-2 and update PyTorch to cu132, or downgrade PyTorch to match the 13.1 toolkit) to ensure the sanity jobs validate a consistent environment.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@jenkins/L0_Test.groovy` around lines 3662 - 3667, The comment points out a
CUDA version mismatch: the block says "CUDA 13.2" but installs cuda-toolkit-13-1
while pip installs torch==2.11.0+cu130; fix by aligning all three to CUDA 13.2 —
update the toolkit install from cuda-toolkit-13-1 to cuda-toolkit-13-2 and
change the pip install inside trtllm_utils.llmExecStepWithRetry to install torch
and torchvision built for cu132 (e.g., torch==2.11.0+cu132
torchvision==0.26.0+cu132 and corresponding index URL), ensuring the echo/header
text, toolkit package name, and pip package tags all match.
| # https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel-26-02.html#rel-26-02 uses 2.29.2 | ||
| # torch 2.10.0+cu130 depends on nvidia-nccl-cu13==2.28.9 | ||
| nvidia-nccl-cu13>=2.28.9,<=2.29.2 | ||
| # https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel-26-03.html#rel-26-0 uses 2.29.7 |
There was a problem hiding this comment.
Fix the release-notes reference typo in the comment.
Line 30 references rel-26-0; this should be rel-26-03 to keep the guidance accurate.
Suggested fix
-# https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel-26-03.html#rel-26-0 uses 2.29.7
+# https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel-26-03.html#rel-26-03 uses 2.29.7📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel-26-03.html#rel-26-0 uses 2.29.7 | |
| # https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel-26-03.html#rel-26-03 uses 2.29.7 |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@requirements.txt` at line 30, Update the comment in requirements.txt that
currently reads "rel-26-0" to the correct "rel-26-03" so the release-notes URL
is accurate; locate the comment line containing
"https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel-26-03.html#rel-26-0"
and replace the trailing fragment "rel-26-0" with "rel-26-03".
|
PR_Github #45654 [ run ] triggered by Bot. Commit: |
Summary by CodeRabbit
Bug Fixes
Documentation
Chores
Description
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
To see a list of available CI bot commands, please comment
/bot help.