Fix nvbug6084457: Make NVLINK_MAX_LINKS version-dependent by mdboom · Pull Request #2192 · NVIDIA/cuda-python

mdboom · 2026-06-10T17:54:00Z

Summary

This PR fixes NvlinkInfo.max_links being a static compile-time constant (NVML_NVLINK_MAX_LINKS) that overestimates the actual number of NVLink ports on a given device. Because not all devices support all links up to the macro ceiling, iterating over that range would attempt to query links the device doesn't have, producing incorrect behavior or errors.

Changes

New Device APIs (get_nvlink_count, get_nvlinks) — query the device-specific NVLink count at runtime via nvmlDeviceGetFieldValues(NVML_FI_DEV_NVLINK_LINK_COUNT), and iterate over only the links actually present on that device.
get_nvlink validation fix — the link-index range check now uses get_nvlink_count() instead of the static NvlinkInfo.max_links, so ValueError is raised precisely for links not supported by the specific device.
NvlinkInfo.max_links deprecated — the class attribute now emits a DeprecationWarning (via a metaclass property) pointing users to Device.get_nvlink_count(). The static value is still returned for compatibility but callers should migrate.
Vendored Deprecated library — a trimmed copy of the Deprecated package (v1.3.1, MIT) is vendored under cuda_core/cuda/core/_vendored/deprecated/ with the wrapt dependency removed. Used to emit DeprecationWarning with Sphinx-compatible docstring decorators (@deprecated, @versionadded, @versionchanged).
Tests — test_nvlink updated to iterate via get_nvlink_count() and exercise get_nvlinks(); new test_nvlink_max_links_deprecated asserts the deprecation warning fires.
Release notes — 1.1.0-notes.rst updated with new APIs, the behaviour change to get_nvlink, and the deprecation of NvlinkInfo.max_links.

rwgk · 2026-06-10T18:55:18Z


-NVLINK_MAX_LINKS = 18
+
+if tuple(int(x) for x in system_get_nvml_version().split(".")) < (3, 13):


Could we obtain this value from the source of truth?

E.g. I see:

$ grep NVLINK_MAX_LINKS /usr/local/cuda_*/include/nvml.h /usr/local/cuda_13.3.0_610.43.02_linux_kitpick035/include/nvml.h:#define NVML_NVLINK_MAX_LINKS 36 //!< Maximum number of NVLink links supported.

I have every single /usr/local/cuda-*.* from 12.0 to 13.3, but I only see NVML_NVLINK_MAX_LINKS in the 13.3 include file.

Maybe something like this could work?

int internal_only_get_NVLINK_MAX_LINKS() { #ifdef NVML_NVLINK_MAX_LINKS return NVML_NVLINK_MAX_LINKS; #else return 18; // Prior to CUDA 13.3 this value was hard-wired. #endif }

If there isn't a practical solution, could you please add a comment to explain?

I have every single /usr/local/cuda-. from 12.0 to 13.3, but I only see NVML_NVLINK_MAX_LINKS in the 13.3 include file.

Are you sure? I see it in 12.9 through 13.3 (though the value changed in 13.3).

We have to build a single binary that works for every version 12.9 - 13.3, so the only way to do this is with a runtime computation. Additionally, we build without the nvml.h header present. I'll add a comment.

Oh, sorry, I messed up like this (retrieved from my bash history; note the cuda_):

grep NVLINK_MAX_LINKS /usr/local/cuda_*/include/nvml.h

(I have a custom softlink for 13.3, which is why that matched.)

It looks much different like this:

$ grep NVLINK_MAX_LINKS /usr/local/cuda-*/include/nvml.h /usr/local/cuda-12.0/include/nvml.h:#define NVML_NVLINK_MAX_LINKS 18 /usr/local/cuda-12.1/include/nvml.h:#define NVML_NVLINK_MAX_LINKS 18 /usr/local/cuda-12.2/include/nvml.h:#define NVML_NVLINK_MAX_LINKS 18 /usr/local/cuda-12.3/include/nvml.h:#define NVML_NVLINK_MAX_LINKS 18 /usr/local/cuda-12.4/include/nvml.h:#define NVML_NVLINK_MAX_LINKS 18 /usr/local/cuda-12.5/include/nvml.h:#define NVML_NVLINK_MAX_LINKS 18 /usr/local/cuda-12.6/include/nvml.h:#define NVML_NVLINK_MAX_LINKS 18 /usr/local/cuda-12.8/include/nvml.h:#define NVML_NVLINK_MAX_LINKS 18 /usr/local/cuda-12.9/include/nvml.h:#define NVML_NVLINK_MAX_LINKS 18 /usr/local/cuda-13.0/include/nvml.h:#define NVML_NVLINK_MAX_LINKS 18 /usr/local/cuda-13.1/include/nvml.h:#define NVML_NVLINK_MAX_LINKS 18 //!< Maximum number of NVLink links supported. /usr/local/cuda-13.2/include/nvml.h:#define NVML_NVLINK_MAX_LINKS 18 //!< Maximum number of NVLink links supported. /usr/local/cuda-13.3/include/nvml.h:#define NVML_NVLINK_MAX_LINKS 36 //!< Maximum number of NVLink links supported.

I didn't think it through before, but at second look, of course the helper function I was envisioning would need to live in nvml.h itself.

I had my agent do a quick check in CUDA 13.3's nvml.h and the NVML docs. It didn't surface any runtime C API that reports NVML_NVLINK_MAX_LINKS. Your solution does indeed appear to be our only option. Ideally we'd request a runtime API, but that's for another day.

This is a module-level constant so there's not much we can do with a helper function. @mdboom that said, this would force nvml to be loaded at import time, which breaks CPU-only envs. Can we hide this behind module __getattr__

(For example this would break the import test.)

Experimentally, it doesn't break CPU-only builds, but it does break on systems without NVML installed, so I agree the __getattr__ trick is probably justified, but it's narrower than you think. nvmlSystemGetVersion doesn't require nvmlInit and doesn't require a GPU.

CPU-only envs have a broad definition, including but not limited to: GPU driver is not installed 😉

rwgk · 2026-06-12T08:24:19Z


-NVLINK_MAX_LINKS = 18
+
+if tuple(int(x) for x in system_get_nvml_version().split(".")) < (3, 13):


I didn't think it through before, but at second look, of course the helper function I was envisioning would need to live in nvml.h itself.

I had my agent do a quick check in CUDA 13.3's nvml.h and the NVML docs. It didn't surface any runtime C API that reports NVML_NVLINK_MAX_LINKS. Your solution does indeed appear to be our only option. Ideally we'd request a runtime API, but that's for another day.

github-actions · 2026-06-12T19:13:02Z

Doc Preview CI
🚀 View preview at https://nvidia.github.io/cuda-python/pr-preview/pr-2192/
https://nvidia.github.io/cuda-python/pr-preview/pr-2192/cuda-core/
https://nvidia.github.io/cuda-python/pr-preview/pr-2192/cuda-bindings/
https://nvidia.github.io/cuda-python/pr-preview/pr-2192/cuda-pathfinder/
Preview will be ready when the GitHub Pages deployment is complete.

mdboom · 2026-06-12T19:50:59Z

@leofang: There seem to be failures on specific hardware (a100, t4) on Windows only that we can't use all of the stated nvlinks. I have asked internally why this might be the case. It /shouldn't/ be a driver vs. CTK difference since NVML ships with the driver and we are asking NVML (not the CTK or cuda-bindings itself or something) what version we have to determine how many links we have. But check my work, it could be that I'm not checking the appropriate thing.

mdboom · 2026-06-16T13:47:04Z

As discussed in the meeting this has been changed so that:

The deprecated library is vendored. Unfortunately, dependabot doesn't support vendored libraries for Python, so it is on us to keep it updated, but this is not a big or complex dependency.
Getting NVML_NVLINK_MAX_LINKS (the header macro) to work correctly has been given up on and its usage in cuda.core deprecated. The recommended alternative is the device-specific field lookup implemented here.

rwgk · 2026-06-16T19:09:37Z

codex gpt-5.5 had two medium findings and one low finding. I asked it to create commits with suggested fixes, which is easier these days than explaining:

rwgk · 2026-06-16T19:10:34Z

It'd be nice to update the PR description.

1. Review finding: the 1.1.0 release notes did not mention the new Device.get_nvlink_count and Device.get_nvlinks APIs, the changed Device.get_nvlink validation behavior, or the NvlinkInfo.max_links deprecation.\n\n2. Suggested fix implemented in this commit: add release-note entries for device-specific NVLink enumeration, the stricter get_nvlink ValueError behavior, and the deprecated NvlinkInfo.max_links replacement path.

mdboom · 2026-06-22T13:37:38Z

rwgk@44e0197

This one seems out of date

rwgk@d0606ff

This has been cherry-picked

rwgk@b8eded2

This one is wrong. It shouldn't be updating the vendored dependency.

It'd be nice to update the PR description.

Sure.

mdboom added this to the cuda.bindings 13.3.1 milestone Jun 10, 2026

mdboom self-assigned this Jun 10, 2026

mdboom added bug Something isn't working cuda.bindings Everything related to the cuda.bindings module labels Jun 10, 2026

github-actions Bot added the cuda.core Everything related to the cuda.core module label Jun 10, 2026

mdboom modified the milestones: cuda.bindings 13.3.1, cuda.bindings next Jun 10, 2026

rwgk reviewed Jun 10, 2026

View reviewed changes

mdboom requested a review from rwgk June 11, 2026 19:13

rwgk approved these changes Jun 12, 2026

View reviewed changes

mdboom force-pushed the nvlink-max-links-dynamic branch from 62f83f4 to a55a771 Compare June 15, 2026 13:53

Improve NVML_NVLINK_MAX_LINKS dynamic handling

637a218

mdboom force-pushed the nvlink-max-links-dynamic branch from fde1d3f to 637a218 Compare June 15, 2026 14:57

Fix bug

7e78548

mdboom added the PR review get-together Mark PRs you'd like the team to review at the weekly PR review get-together. label Jun 15, 2026

mdboom added 4 commits June 16, 2026 08:07

Fix bug

16721e5

Vendor the Deprecated library

8a51cf0

Fix error message

0120702

Cleanup

93c3a77

mdboom requested review from leofang and rwgk June 16, 2026 13:47

Update AGENTS.md instructions

42f2189


		NVLINK_MAX_LINKS = 18

		if tuple(int(x) for x in system_get_nvml_version().split(".")) < (3, 13):

Conversation

mdboom commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mdboom Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Jun 12, 2026

Preview will be ready when the GitHub Pages deployment is complete.

Uh oh!

mdboom commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mdboom commented Jun 16, 2026

Uh oh!

rwgk commented Jun 16, 2026

Uh oh!

rwgk commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mdboom commented Jun 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mdboom commented Jun 10, 2026 •

edited

Loading

mdboom Jun 11, 2026 •

edited

Loading

mdboom commented Jun 12, 2026 •

edited

Loading

rwgk commented Jun 16, 2026 •

edited

Loading