-
Notifications
You must be signed in to change notification settings - Fork 110
build and test against CUDA 13.1.0 #747
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
Label checker was stuck with an error. So removed and readded a label to rerun it |
|
/ok to test |
|
/ok to test 5d8893c |
|
/ok to test 4356fe0 |
|
/ok to test 4fd715f |
|
/ok to test c2c614f |
|
/ok to test c4e79e4 |
|
@jameslamb This PR passes all the jobs |
|
Wow amazing, thank you for working on this @rgsl888prabhu !!! I'd just assumed we wouldn't be able to do this until more of RAPIDS was building, glad you got it working. I just took it out of draft. If you approve this, let's merge it 😁 |
📝 WalkthroughWalkthroughUpdates CUDA version references from 13.0 to 13.1 across GitHub Actions workflows, conda environment files, and build configurations. Switches workflow references from Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes 🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
.github/workflows/test.yaml (1)
31-68: Thecuda-13.1.0tag does not exist in rapidsai/shared-workflows.All five workflow references (conda-cpp-tests, conda-python-tests, wheel-tests-cuopt, wheel-tests-cuopt-server, and conda-notebook-tests) are pinned to
@cuda-13.1.0, which is not a published tag in the repository. Verify the correct tag name and update all references to use the intended tag version..github/workflows/build.yaml (1)
47-203: The cuda-13.1.0 tag does not exist in rapidsai/shared-workflows and will cause CI/CD failures.All 14 workflow references are pinned to
@cuda-13.1.0, but this tag does not exist in the repository. The shared-workflows repository uses semantic versioning tags likev26.02.00a,v25.12.00a, etc. Update the workflow references to use an existing tag or ensure thecuda-13.1.0tag is created in rapidsai/shared-workflows before merging.
🤖 Fix all issues with AI agents
In @dependencies.yaml:
- Line 10: Update the README examples that still reference the old Docker tag
"cuda13.0" to the correct tag "cuda13.1-py3.13"; specifically replace
occurrences of "cuda13.0" at the examples (currently around the README lines
that show the docker image tags) with "cuda13.1-py3.13" so the documentation
matches the conda environment variants (cuda: ["12.9","13.1"]).
🧹 Nitpick comments (1)
cpp/src/utilities/driver_helpers.cuh (1)
21-25: Hardcoded CUDA 13.0 version for backward compatibility is appropriate, but consider making configurable per guidelines.The code requests version
13000(CUDA 13.0) symbols despite the PR updating to CUDA 13.1.0. This is a deliberate backward-compatibility strategy: binary built with CUDA 13.1 toolkit can run on systems with older CUDA 13.0 drivers. This approach is sound because:
- The code uses only CUDA 13.0 APIs (
cuDevSmResourceSplitByCount,cuGreenCtxCreate, etc.), not CUDA 13.1-specific features likecuDevSmResourceSplitor workqueue resources- Requesting
13000ensures the binary doesn't break on CUDA 13.0 driversHowever, this hardcoded version conflicts with the coding guideline to "abstract multi-backend support for different CUDA versions." Consider making the version configurable (e.g., as a build-time macro or compile constant matching
CUDART_VERSION) rather than hardcoded, allowing future flexibility if CUDA 13.1 features are needed.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (8)
.github/workflows/build.yaml.github/workflows/pr.yaml.github/workflows/test.yaml.github/workflows/trigger-breaking-change-alert.yamlconda/environments/all_cuda-131_arch-aarch64.yamlconda/environments/all_cuda-131_arch-x86_64.yamlcpp/src/utilities/driver_helpers.cuhdependencies.yaml
🧰 Additional context used
📓 Path-based instructions (2)
**/*.{cu,cuh}
📄 CodeRabbit inference engine (.github/.coderabbit_review_guide.md)
**/*.{cu,cuh}: Every CUDA kernel launch and memory operation must have error checking with CUDA_CHECK or equivalent verification
Avoid reinventing functionality already available in Thrust, CCCL, or RMM libraries; prefer standard library utilities over custom implementations
Files:
cpp/src/utilities/driver_helpers.cuh
**/*.{cu,cuh,cpp,hpp,h}
📄 CodeRabbit inference engine (.github/.coderabbit_review_guide.md)
**/*.{cu,cuh,cpp,hpp,h}: Track GPU device memory allocations and deallocations to prevent memory leaks; ensure cudaMalloc/cudaFree balance and cleanup of streams/events
Validate algorithm correctness in optimization logic: simplex pivots, branch-and-bound decisions, routing heuristics, and constraint/objective handling must produce correct results
Check numerical stability: prevent overflow/underflow, precision loss, division by zero/near-zero, and use epsilon comparisons for floating-point equality checks
Validate correct initialization of variable bounds, constraint coefficients, and algorithm state before solving; ensure reset when transitioning between algorithm phases (presolve, simplex, diving, crossover)
Ensure variables and constraints are accessed from the correct problem context (original vs presolve vs folded vs postsolve); verify index mapping consistency across problem transformations
For concurrent CUDA operations (barriers, async operations), explicitly create and manage dedicated streams instead of reusing the default stream; document stream lifecycle
Eliminate unnecessary host-device synchronization (cudaDeviceSynchronize) in hot paths that blocks GPU pipeline; use streams and events for async execution
Assess algorithmic complexity for large-scale problems (millions of variables/constraints); ensure O(n log n) or better complexity, not O(n²) or worse
Verify correct problem size checks before expensive GPU/CPU operations; prevent resource exhaustion on oversized problems
Identify assertions with overly strict numerical tolerances that fail on legitimate degenerate/edge cases (near-zero pivots, singular matrices, empty problems)
Ensure race conditions are absent in multi-GPU code and multi-threaded server implementations; verify proper synchronization of shared state
Refactor code duplication in solver components (3+ occurrences) into shared utilities; for GPU kernels, use templated device functions to avoid duplication
Check that hard-coded GPU de...
Files:
cpp/src/utilities/driver_helpers.cuh
🧠 Learnings (12)
📓 Common learnings
Learnt from: CR
Repo: NVIDIA/cuopt PR: 0
File: .github/.coderabbit_review_guide.md:0-0
Timestamp: 2025-11-25T10:20:49.822Z
Learning: Applies to **/*.{cu,cuh,cpp,hpp,h} : Check that hard-coded GPU device IDs and resource limits are made configurable; abstract multi-backend support for different CUDA versions
📚 Learning: 2025-11-25T10:20:49.822Z
Learnt from: CR
Repo: NVIDIA/cuopt PR: 0
File: .github/.coderabbit_review_guide.md:0-0
Timestamp: 2025-11-25T10:20:49.822Z
Learning: Applies to **/*test*.{cpp,cu,py} : Ensure test isolation: prevent GPU state, cached memory, and global variables from leaking between test cases; verify each test independently initializes its environment
Applied to files:
.github/workflows/test.yaml
📚 Learning: 2025-11-25T10:20:49.822Z
Learnt from: CR
Repo: NVIDIA/cuopt PR: 0
File: .github/.coderabbit_review_guide.md:0-0
Timestamp: 2025-11-25T10:20:49.822Z
Learning: Applies to **/*.{cu,cuh,cpp,hpp,h} : Check that hard-coded GPU device IDs and resource limits are made configurable; abstract multi-backend support for different CUDA versions
Applied to files:
cpp/src/utilities/driver_helpers.cuh
📚 Learning: 2025-11-25T10:20:49.822Z
Learnt from: CR
Repo: NVIDIA/cuopt PR: 0
File: .github/.coderabbit_review_guide.md:0-0
Timestamp: 2025-11-25T10:20:49.822Z
Learning: Applies to **/*.{cu,cuh,cpp,hpp,h} : Verify error propagation from CUDA to user-facing APIs is complete; ensure CUDA errors are caught and mapped to meaningful user error codes
Applied to files:
cpp/src/utilities/driver_helpers.cuh
📚 Learning: 2025-11-25T10:20:49.822Z
Learnt from: CR
Repo: NVIDIA/cuopt PR: 0
File: .github/.coderabbit_review_guide.md:0-0
Timestamp: 2025-11-25T10:20:49.822Z
Learning: Applies to **/*.{cu,cuh,cpp,hpp,h} : Track GPU device memory allocations and deallocations to prevent memory leaks; ensure cudaMalloc/cudaFree balance and cleanup of streams/events
Applied to files:
cpp/src/utilities/driver_helpers.cuh
📚 Learning: 2025-11-25T10:20:49.822Z
Learnt from: CR
Repo: NVIDIA/cuopt PR: 0
File: .github/.coderabbit_review_guide.md:0-0
Timestamp: 2025-11-25T10:20:49.822Z
Learning: Applies to **/*.{cu,cuh,cpp,hpp,h} : Ensure race conditions are absent in multi-GPU code and multi-threaded server implementations; verify proper synchronization of shared state
Applied to files:
cpp/src/utilities/driver_helpers.cuh
📚 Learning: 2025-11-25T10:20:49.822Z
Learnt from: CR
Repo: NVIDIA/cuopt PR: 0
File: .github/.coderabbit_review_guide.md:0-0
Timestamp: 2025-11-25T10:20:49.822Z
Learning: Applies to **/*.{cu,cpp,hpp,h} : Avoid inappropriate use of exceptions in performance-critical GPU operation paths; prefer error codes or CUDA error checking for latency-sensitive code
Applied to files:
cpp/src/utilities/driver_helpers.cuh
📚 Learning: 2025-11-25T10:20:49.822Z
Learnt from: CR
Repo: NVIDIA/cuopt PR: 0
File: .github/.coderabbit_review_guide.md:0-0
Timestamp: 2025-11-25T10:20:49.822Z
Learning: Applies to **/*.{cu,cuh,cpp,hpp,h} : Refactor code duplication in solver components (3+ occurrences) into shared utilities; for GPU kernels, use templated device functions to avoid duplication
Applied to files:
cpp/src/utilities/driver_helpers.cuh
📚 Learning: 2025-11-25T10:20:49.822Z
Learnt from: CR
Repo: NVIDIA/cuopt PR: 0
File: .github/.coderabbit_review_guide.md:0-0
Timestamp: 2025-11-25T10:20:49.822Z
Learning: Applies to **/*.cu : Verify race conditions and correctness of GPU kernel shared memory, atomics, and warp-level operations
Applied to files:
cpp/src/utilities/driver_helpers.cuh
📚 Learning: 2025-11-25T10:20:49.822Z
Learnt from: CR
Repo: NVIDIA/cuopt PR: 0
File: .github/.coderabbit_review_guide.md:0-0
Timestamp: 2025-11-25T10:20:49.822Z
Learning: Applies to **/*.{cu,cuh,cpp,hpp,h} : Verify correct problem size checks before expensive GPU/CPU operations; prevent resource exhaustion on oversized problems
Applied to files:
cpp/src/utilities/driver_helpers.cuh
📚 Learning: 2025-11-25T10:20:49.822Z
Learnt from: CR
Repo: NVIDIA/cuopt PR: 0
File: .github/.coderabbit_review_guide.md:0-0
Timestamp: 2025-11-25T10:20:49.822Z
Learning: Applies to **/*.{cu,cuh,cpp,hpp,h} : Eliminate unnecessary host-device synchronization (cudaDeviceSynchronize) in hot paths that blocks GPU pipeline; use streams and events for async execution
Applied to files:
cpp/src/utilities/driver_helpers.cuh
📚 Learning: 2025-11-25T10:20:49.822Z
Learnt from: CR
Repo: NVIDIA/cuopt PR: 0
File: .github/.coderabbit_review_guide.md:0-0
Timestamp: 2025-11-25T10:20:49.822Z
Learning: Applies to **/*.{cu,cuh,cpp,hpp,h} : For concurrent CUDA operations (barriers, async operations), explicitly create and manage dedicated streams instead of reusing the default stream; document stream lifecycle
Applied to files:
cpp/src/utilities/driver_helpers.cuh
🔇 Additional comments (9)
conda/environments/all_cuda-131_arch-aarch64.yaml (2)
20-20: LGTM: CUDA version update is correct.The update from
cuda-version=13.0tocuda-version=13.1aligns with the PR objective to build and test against CUDA 13.1.0.
78-78: LGTM: Environment name correctly updated.The environment name change from
all_cuda-130_arch-aarch64toall_cuda-131_arch-aarch64is consistent with the CUDA version update..github/workflows/test.yaml (1)
1-1: LGTM: Copyright year updated appropriately.The copyright year bump to 2026 is appropriate for changes made in 2026.
.github/workflows/trigger-breaking-change-alert.yaml (1)
1-18: LGTM: Consistent workflow and copyright updates.The copyright year update and workflow reference change from
@mainto@cuda-13.1.0are consistent with the other workflow files in this PR.conda/environments/all_cuda-131_arch-x86_64.yaml (1)
20-78: LGTM: Consistent CUDA version update across architectures.The CUDA version and environment name updates are consistent with the aarch64 variant, ensuring uniform CUDA 13.1 support across both x86_64 and aarch64 architectures.
.github/workflows/pr.yaml (2)
1-1: LGTM: Copyright year updated.Copyright year appropriately bumped to 2026.
37-296: LGTM: Comprehensive workflow reference updates.All 15 external workflow references have been consistently updated from
@mainto@cuda-13.1.0. The local workflow reference at line 305 (self_hosted_service_test.yaml) is correctly unchanged. This thorough update ensures all CI jobs use the CUDA 13.1.0-compatible workflows.dependencies.yaml (1)
675-678: LGTM!The new CUDA 13.1 matrix entry follows the established pattern and is correctly structured.
cpp/src/utilities/driver_helpers.cuh (1)
18-18: Good defensive initialization.Initializing
functonullptrprevents undefined behavior and is a best practice improvement.
| if (driver_status != cudaDriverEntryPointSuccess) { | ||
| fprintf(stderr, "Failed to fetch symbol for %s\n", name); | ||
| return static_cast<void*>(nullptr); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Find all call sites of get_driver_entry_point and show surrounding context
rg -nP --type=cpp --type=cuda -C5 '\bget_driver_entry_point\s*\(' -g '!driver_helpers.cuh'Repository: NVIDIA/cuopt
Length of output: 84
🏁 Script executed:
cat -n cpp/src/utilities/driver_helpers.cuh | head -40Repository: NVIDIA/cuopt
Length of output: 1312
🏁 Script executed:
rg -nP --type=cpp --type=cuda 'nullptr' cpp/src/utilities/ -A 2 -B 2Repository: NVIDIA/cuopt
Length of output: 84
🏁 Script executed:
rg -nP --type=cpp -C5 'get_driver_entry_point\s*\(' -g '!driver_helpers.cuh'Repository: NVIDIA/cuopt
Length of output: 38
🏁 Script executed:
rg -l 'driver_helpers.cuh' --type=cppRepository: NVIDIA/cuopt
Length of output: 38
🏁 Script executed:
fd -e cuh | xargs rg -l 'driver_helpers.cuh'Repository: NVIDIA/cuopt
Length of output: 96
🏁 Script executed:
rg -nP 'get_driver_entry_point' cpp/src/dual_simplex/sparse_cholesky.cuh -C5Repository: NVIDIA/cuopt
Length of output: 4328
🏁 Script executed:
fd -e cuh -type fRepository: NVIDIA/cuopt
Length of output: 226
🏁 Script executed:
rg -nP 'CU_CHECK|CUDA_CHECK|auto.*get_driver_entry_point' cpp/src/dual_simplex/sparse_cholesky.cuh -B 2 -A 2Repository: NVIDIA/cuopt
Length of output: 2660
🏁 Script executed:
rg -nP 'void.*function_name|auto.*function_name|CUdevResource.*function|void setup_green_context|solve\(' cpp/src/dual_simplex/sparse_cholesky.cuh -B 5 -A 15Repository: NVIDIA/cuopt
Length of output: 3025
🏁 Script executed:
sed -n '150,175p' cpp/src/dual_simplex/sparse_cholesky.cuhRepository: NVIDIA/cuopt
Length of output: 1266
🏁 Script executed:
sed -n '155,170p' cpp/src/dual_simplex/sparse_cholesky.cuhRepository: NVIDIA/cuopt
Length of output: 879
Add nullptr checks before dereferencing function pointers returned by get_driver_entry_point.
The function returns nullptr on symbol lookup failure, but call sites in sparse_cholesky.cuh (lines 159, 163, 182, 204, 227, 235, 367, 370) immediately cast and call the returned pointer without validation. If get_driver_entry_point returns nullptr, the code will dereference a null function pointer, causing undefined behavior. Callers must check the return value and propagate errors appropriately:
auto func = cuopt::detail::get_driver_entry_point("symbol_name");
if (func == nullptr) {
// Handle error and propagate to caller
return error_code;
}
CU_CHECK(reinterpret_cast<decltype(...)>(func)(...));This is required by the coding guideline: error propagation from CUDA to user-facing APIs must be complete, and the solve() method must communicate failures rather than crashing silently.
| output: conda | ||
| matrix: | ||
| cuda: ["12.9", "13.0"] | ||
| cuda: ["12.9", "13.1"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Search for remaining references to CUDA 13.0 in the repository
echo "=== Searching for cuda 13.0 or cuda-13.0 references ==="
rg -n -i 'cuda[_-]?13\.0' --glob '!dependencies.yaml'
echo ""
echo "=== Checking all YAML files for cuda 13.0 ==="
fd -e yaml -e yml | xargs rg -l '13\.0' 2>/dev/null | head -20
echo ""
echo "=== Full repository file listing to understand structure ==="
git ls-files | head -30Repository: NVIDIA/cuopt
Length of output: 1870
🏁 Script executed:
#!/bin/bash
# Check for any CUDA 13.0 environment files or references
echo "=== Looking for cuda-130 or 13.0 environment files ==="
fd -e yaml -e yml | xargs ls -la 2>/dev/null | grep -i "130\|13\.0"
echo ""
echo "=== List all conda environment files ==="
fd -e yaml . conda/environments/ 2>/dev/null | sort
echo ""
echo "=== Check recipe.yaml for CUDA versions ==="
rg -n "13\." conda/recipes/cuopt/recipe.yaml 2>/dev/null
echo ""
echo "=== Check README for docker tag references in context ==="
rg -n -A2 -B2 "cuda.*13" README.mdRepository: NVIDIA/cuopt
Length of output: 1422
Update README.md Docker tags from CUDA 13.0 to 13.1.
The conda environment files have been correctly updated (only 12.9 and 13.1 variants exist), but the README.md documentation at lines 131 and 134 still references the outdated cuda13.0 Docker tags. Update these examples to use cuda13.1-py3.13 tags instead. The version constraints in conda/recipes/cuopt/recipe.yaml (lines 69-70, 90-91) are conda package version ranges and do not need changes.
🤖 Prompt for AI Agents
In @dependencies.yaml at line 10, Update the README examples that still
reference the old Docker tag "cuda13.0" to the correct tag "cuda13.1-py3.13";
specifically replace occurrences of "cuda13.0" at the examples (currently around
the README lines that show the docker image tags) with "cuda13.1-py3.13" so the
documentation matches the conda environment variants (cuda: ["12.9","13.1"]).
hlinsen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @rgsl888prabhu for the change and upgrade! Lgtm for C++
|
/merge |
Contributes to rapidsai/build-planning#236 Tests that CI here will work with the changes from rapidsai/shared-workflows#483, switches CUDA 13 builds to CUDA 13.1.0 and adds some CUDA 13.1.0 test jobs. ## Summary by CodeRabbit * **New Features** - Added support for CUDA Toolkit 13.1, providing compatibility with the latest CUDA runtime and libraries. * **Chores** - Updated CI/CD workflows for building, testing, and deployment to target CUDA 13.1.0 - Updated conda environment configurations to CUDA 13.1 for ARM and x86_64 architectures - Updated copyright year to 2026 <sub>✏️ Tip: You can customize this high-level summary in your review settings.</sub> Authors: - James Lamb (https://github.com/jameslamb) - https://github.com/jakirkham - Ramakrishnap (https://github.com/rgsl888prabhu) Approvers: - Hugo Linsenmaier (https://github.com/hlinsen) - Ramakrishnap (https://github.com/rgsl888prabhu) URL: NVIDIA#747
Contributes to rapidsai/build-planning#236
Tests that CI here will work with the changes from rapidsai/shared-workflows#483,
switches CUDA 13 builds to CUDA 13.1.0 and adds some CUDA 13.1.0 test jobs.
Summary by CodeRabbit
New Features
Chores
✏️ Tip: You can customize this high-level summary in your review settings.