Skip to content

Fix nvbug6084457: Make NVLINK_MAX_LINKS version-dependent#2192

Open
mdboom wants to merge 8 commits into
NVIDIA:mainfrom
mdboom:nvlink-max-links-dynamic
Open

Fix nvbug6084457: Make NVLINK_MAX_LINKS version-dependent#2192
mdboom wants to merge 8 commits into
NVIDIA:mainfrom
mdboom:nvlink-max-links-dynamic

Conversation

@mdboom

@mdboom mdboom commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Summary

This PR fixes NvlinkInfo.max_links being a static compile-time constant (NVML_NVLINK_MAX_LINKS) that overestimates the actual number of NVLink ports on a given device. Because not all devices support all links up to the macro ceiling, iterating over that range would attempt to query links the device doesn't have, producing incorrect behavior or errors.

Changes

  • New Device APIs (get_nvlink_count, get_nvlinks) — query the device-specific NVLink count at runtime via nvmlDeviceGetFieldValues(NVML_FI_DEV_NVLINK_LINK_COUNT), and iterate over only the links actually present on that device.

  • get_nvlink validation fix — the link-index range check now uses get_nvlink_count() instead of the static NvlinkInfo.max_links, so ValueError is raised precisely for links not supported by the specific device.

  • NvlinkInfo.max_links deprecated — the class attribute now emits a DeprecationWarning (via a metaclass property) pointing users to Device.get_nvlink_count(). The static value is still returned for compatibility but callers should migrate.

  • Vendored Deprecated library — a trimmed copy of the Deprecated package (v1.3.1, MIT) is vendored under cuda_core/cuda/core/_vendored/deprecated/ with the wrapt dependency removed. Used to emit DeprecationWarning with Sphinx-compatible docstring decorators (@deprecated, @versionadded, @versionchanged).

  • Teststest_nvlink updated to iterate via get_nvlink_count() and exercise get_nvlinks(); new test_nvlink_max_links_deprecated asserts the deprecation warning fires.

  • Release notes1.1.0-notes.rst updated with new APIs, the behaviour change to get_nvlink, and the deprecation of NvlinkInfo.max_links.

@mdboom mdboom added this to the cuda.bindings 13.3.1 milestone Jun 10, 2026
@mdboom mdboom self-assigned this Jun 10, 2026
@mdboom mdboom added bug Something isn't working cuda.bindings Everything related to the cuda.bindings module labels Jun 10, 2026
@github-actions github-actions Bot added the cuda.core Everything related to the cuda.core module label Jun 10, 2026
Comment thread cuda_bindings/cuda/bindings/nvml.pyx Outdated

NVLINK_MAX_LINKS = 18

if tuple(int(x) for x in system_get_nvml_version().split(".")) < (3, 13):

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we obtain this value from the source of truth?

E.g. I see:

$ grep NVLINK_MAX_LINKS /usr/local/cuda_*/include/nvml.h
/usr/local/cuda_13.3.0_610.43.02_linux_kitpick035/include/nvml.h:#define NVML_NVLINK_MAX_LINKS 36 //!< Maximum number of NVLink links supported.

I have every single /usr/local/cuda-*.* from 12.0 to 13.3, but I only see NVML_NVLINK_MAX_LINKS in the 13.3 include file.

Maybe something like this could work?

int internal_only_get_NVLINK_MAX_LINKS() {
#ifdef NVML_NVLINK_MAX_LINKS
    return NVML_NVLINK_MAX_LINKS;
#else
    return 18; // Prior to CUDA 13.3 this value was hard-wired. 
#endif
}

If there isn't a practical solution, could you please add a comment to explain?

@mdboom mdboom Jun 11, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have every single /usr/local/cuda-. from 12.0 to 13.3, but I only see NVML_NVLINK_MAX_LINKS in the 13.3 include file.

Are you sure? I see it in 12.9 through 13.3 (though the value changed in 13.3).

We have to build a single binary that works for every version 12.9 - 13.3, so the only way to do this is with a runtime computation. Additionally, we build without the nvml.h header present. I'll add a comment.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, sorry, I messed up like this (retrieved from my bash history; note the cuda_):

grep NVLINK_MAX_LINKS /usr/local/cuda_*/include/nvml.h

(I have a custom softlink for 13.3, which is why that matched.)

It looks much different like this:

$ grep NVLINK_MAX_LINKS /usr/local/cuda-*/include/nvml.h
/usr/local/cuda-12.0/include/nvml.h:#define NVML_NVLINK_MAX_LINKS 18
/usr/local/cuda-12.1/include/nvml.h:#define NVML_NVLINK_MAX_LINKS 18
/usr/local/cuda-12.2/include/nvml.h:#define NVML_NVLINK_MAX_LINKS 18
/usr/local/cuda-12.3/include/nvml.h:#define NVML_NVLINK_MAX_LINKS 18
/usr/local/cuda-12.4/include/nvml.h:#define NVML_NVLINK_MAX_LINKS 18
/usr/local/cuda-12.5/include/nvml.h:#define NVML_NVLINK_MAX_LINKS 18
/usr/local/cuda-12.6/include/nvml.h:#define NVML_NVLINK_MAX_LINKS 18
/usr/local/cuda-12.8/include/nvml.h:#define NVML_NVLINK_MAX_LINKS 18
/usr/local/cuda-12.9/include/nvml.h:#define NVML_NVLINK_MAX_LINKS 18
/usr/local/cuda-13.0/include/nvml.h:#define NVML_NVLINK_MAX_LINKS 18
/usr/local/cuda-13.1/include/nvml.h:#define NVML_NVLINK_MAX_LINKS 18 //!< Maximum number of NVLink links supported.
/usr/local/cuda-13.2/include/nvml.h:#define NVML_NVLINK_MAX_LINKS 18 //!< Maximum number of NVLink links supported.
/usr/local/cuda-13.3/include/nvml.h:#define NVML_NVLINK_MAX_LINKS 36 //!< Maximum number of NVLink links supported.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't think it through before, but at second look, of course the helper function I was envisioning would need to live in nvml.h itself.

I had my agent do a quick check in CUDA 13.3's nvml.h and the NVML docs. It didn't surface any runtime C API that reports NVML_NVLINK_MAX_LINKS. Your solution does indeed appear to be our only option. Ideally we'd request a runtime API, but that's for another day.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a module-level constant so there's not much we can do with a helper function. @mdboom that said, this would force nvml to be loaded at import time, which breaks CPU-only envs. Can we hide this behind module __getattr__

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(For example this would break the import test.)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Experimentally, it doesn't break CPU-only builds, but it does break on systems without NVML installed, so I agree the __getattr__ trick is probably justified, but it's narrower than you think. nvmlSystemGetVersion doesn't require nvmlInit and doesn't require a GPU.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CPU-only envs have a broad definition, including but not limited to: GPU driver is not installed 😉

@mdboom mdboom requested a review from rwgk June 11, 2026 19:13
Comment thread cuda_bindings/cuda/bindings/nvml.pyx Outdated

NVLINK_MAX_LINKS = 18

if tuple(int(x) for x in system_get_nvml_version().split(".")) < (3, 13):

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't think it through before, but at second look, of course the helper function I was envisioning would need to live in nvml.h itself.

I had my agent do a quick check in CUDA 13.3's nvml.h and the NVML docs. It didn't surface any runtime C API that reports NVML_NVLINK_MAX_LINKS. Your solution does indeed appear to be our only option. Ideally we'd request a runtime API, but that's for another day.

@github-actions

Copy link
Copy Markdown

@mdboom

mdboom commented Jun 12, 2026

Copy link
Copy Markdown
Contributor Author

@leofang: There seem to be failures on specific hardware (a100, t4) on Windows only that we can't use all of the stated nvlinks. I have asked internally why this might be the case. It /shouldn't/ be a driver vs. CTK difference since NVML ships with the driver and we are asking NVML (not the CTK or cuda-bindings itself or something) what version we have to determine how many links we have. But check my work, it could be that I'm not checking the appropriate thing.

@mdboom mdboom force-pushed the nvlink-max-links-dynamic branch from 62f83f4 to a55a771 Compare June 15, 2026 13:53
@mdboom mdboom force-pushed the nvlink-max-links-dynamic branch from fde1d3f to 637a218 Compare June 15, 2026 14:57
@mdboom mdboom added the PR review get-together Mark PRs you'd like the team to review at the weekly PR review get-together. label Jun 15, 2026
@mdboom

mdboom commented Jun 16, 2026

Copy link
Copy Markdown
Contributor Author

As discussed in the meeting this has been changed so that:

  • The deprecated library is vendored. Unfortunately, dependabot doesn't support vendored libraries for Python, so it is on us to keep it updated, but this is not a big or complex dependency.
  • Getting NVML_NVLINK_MAX_LINKS (the header macro) to work correctly has been given up on and its usage in cuda.core deprecated. The recommended alternative is the device-specific field lookup implemented here.

@mdboom mdboom requested review from leofang and rwgk June 16, 2026 13:47
@rwgk

rwgk commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

codex gpt-5.5 had two medium findings and one low finding. I asked it to create commits with suggested fixes, which is easier these days than explaining:

@rwgk

rwgk commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

It'd be nice to update the PR description.

1. Review finding: the 1.1.0 release notes did not mention the new Device.get_nvlink_count and Device.get_nvlinks APIs, the changed Device.get_nvlink validation behavior, or the NvlinkInfo.max_links deprecation.\n\n2. Suggested fix implemented in this commit: add release-note entries for device-specific NVLink enumeration, the stricter get_nvlink ValueError behavior, and the deprecated NvlinkInfo.max_links replacement path.
@mdboom

mdboom commented Jun 22, 2026

Copy link
Copy Markdown
Contributor Author

rwgk@44e0197

This one seems out of date

rwgk@d0606ff

This has been cherry-picked

rwgk@b8eded2

This one is wrong. It shouldn't be updating the vendored dependency.

It'd be nice to update the PR description.

Sure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working cuda.bindings Everything related to the cuda.bindings module cuda.core Everything related to the cuda.core module PR review get-together Mark PRs you'd like the team to review at the weekly PR review get-together.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants