Fix nvbug6084457: Make NVLINK_MAX_LINKS version-dependent#2192
Conversation
|
|
||
| NVLINK_MAX_LINKS = 18 | ||
|
|
||
| if tuple(int(x) for x in system_get_nvml_version().split(".")) < (3, 13): |
There was a problem hiding this comment.
Could we obtain this value from the source of truth?
E.g. I see:
$ grep NVLINK_MAX_LINKS /usr/local/cuda_*/include/nvml.h
/usr/local/cuda_13.3.0_610.43.02_linux_kitpick035/include/nvml.h:#define NVML_NVLINK_MAX_LINKS 36 //!< Maximum number of NVLink links supported.
I have every single /usr/local/cuda-*.* from 12.0 to 13.3, but I only see NVML_NVLINK_MAX_LINKS in the 13.3 include file.
Maybe something like this could work?
int internal_only_get_NVLINK_MAX_LINKS() {
#ifdef NVML_NVLINK_MAX_LINKS
return NVML_NVLINK_MAX_LINKS;
#else
return 18; // Prior to CUDA 13.3 this value was hard-wired.
#endif
}
If there isn't a practical solution, could you please add a comment to explain?
There was a problem hiding this comment.
I have every single /usr/local/cuda-. from 12.0 to 13.3, but I only see NVML_NVLINK_MAX_LINKS in the 13.3 include file.
Are you sure? I see it in 12.9 through 13.3 (though the value changed in 13.3).
We have to build a single binary that works for every version 12.9 - 13.3, so the only way to do this is with a runtime computation. Additionally, we build without the nvml.h header present. I'll add a comment.
There was a problem hiding this comment.
Oh, sorry, I messed up like this (retrieved from my bash history; note the cuda_):
grep NVLINK_MAX_LINKS /usr/local/cuda_*/include/nvml.h
(I have a custom softlink for 13.3, which is why that matched.)
It looks much different like this:
$ grep NVLINK_MAX_LINKS /usr/local/cuda-*/include/nvml.h
/usr/local/cuda-12.0/include/nvml.h:#define NVML_NVLINK_MAX_LINKS 18
/usr/local/cuda-12.1/include/nvml.h:#define NVML_NVLINK_MAX_LINKS 18
/usr/local/cuda-12.2/include/nvml.h:#define NVML_NVLINK_MAX_LINKS 18
/usr/local/cuda-12.3/include/nvml.h:#define NVML_NVLINK_MAX_LINKS 18
/usr/local/cuda-12.4/include/nvml.h:#define NVML_NVLINK_MAX_LINKS 18
/usr/local/cuda-12.5/include/nvml.h:#define NVML_NVLINK_MAX_LINKS 18
/usr/local/cuda-12.6/include/nvml.h:#define NVML_NVLINK_MAX_LINKS 18
/usr/local/cuda-12.8/include/nvml.h:#define NVML_NVLINK_MAX_LINKS 18
/usr/local/cuda-12.9/include/nvml.h:#define NVML_NVLINK_MAX_LINKS 18
/usr/local/cuda-13.0/include/nvml.h:#define NVML_NVLINK_MAX_LINKS 18
/usr/local/cuda-13.1/include/nvml.h:#define NVML_NVLINK_MAX_LINKS 18 //!< Maximum number of NVLink links supported.
/usr/local/cuda-13.2/include/nvml.h:#define NVML_NVLINK_MAX_LINKS 18 //!< Maximum number of NVLink links supported.
/usr/local/cuda-13.3/include/nvml.h:#define NVML_NVLINK_MAX_LINKS 36 //!< Maximum number of NVLink links supported.
There was a problem hiding this comment.
I didn't think it through before, but at second look, of course the helper function I was envisioning would need to live in nvml.h itself.
I had my agent do a quick check in CUDA 13.3's nvml.h and the NVML docs. It didn't surface any runtime C API that reports NVML_NVLINK_MAX_LINKS. Your solution does indeed appear to be our only option. Ideally we'd request a runtime API, but that's for another day.
There was a problem hiding this comment.
This is a module-level constant so there's not much we can do with a helper function. @mdboom that said, this would force nvml to be loaded at import time, which breaks CPU-only envs. Can we hide this behind module __getattr__
There was a problem hiding this comment.
Experimentally, it doesn't break CPU-only builds, but it does break on systems without NVML installed, so I agree the __getattr__ trick is probably justified, but it's narrower than you think. nvmlSystemGetVersion doesn't require nvmlInit and doesn't require a GPU.
There was a problem hiding this comment.
CPU-only envs have a broad definition, including but not limited to: GPU driver is not installed 😉
|
|
||
| NVLINK_MAX_LINKS = 18 | ||
|
|
||
| if tuple(int(x) for x in system_get_nvml_version().split(".")) < (3, 13): |
There was a problem hiding this comment.
I didn't think it through before, but at second look, of course the helper function I was envisioning would need to live in nvml.h itself.
I had my agent do a quick check in CUDA 13.3's nvml.h and the NVML docs. It didn't surface any runtime C API that reports NVML_NVLINK_MAX_LINKS. Your solution does indeed appear to be our only option. Ideally we'd request a runtime API, but that's for another day.
|
|
@leofang: There seem to be failures on specific hardware (a100, t4) on Windows only that we can't use all of the stated nvlinks. I have asked internally why this might be the case. It /shouldn't/ be a driver vs. CTK difference since NVML ships with the driver and we are asking NVML (not the CTK or cuda-bindings itself or something) what version we have to determine how many links we have. But check my work, it could be that I'm not checking the appropriate thing. |
62f83f4 to
a55a771
Compare
fde1d3f to
637a218
Compare
|
As discussed in the meeting this has been changed so that:
|
|
codex gpt-5.5 had two medium findings and one low finding. I asked it to create commits with suggested fixes, which is easier these days than explaining: |
|
It'd be nice to update the PR description. |
1. Review finding: the 1.1.0 release notes did not mention the new Device.get_nvlink_count and Device.get_nvlinks APIs, the changed Device.get_nvlink validation behavior, or the NvlinkInfo.max_links deprecation.\n\n2. Suggested fix implemented in this commit: add release-note entries for device-specific NVLink enumeration, the stricter get_nvlink ValueError behavior, and the deprecated NvlinkInfo.max_links replacement path.
|
This one seems out of date This has been cherry-picked This one is wrong. It shouldn't be updating the vendored dependency.
Sure. |
Summary
This PR fixes
NvlinkInfo.max_linksbeing a static compile-time constant (NVML_NVLINK_MAX_LINKS) that overestimates the actual number of NVLink ports on a given device. Because not all devices support all links up to the macro ceiling, iterating over that range would attempt to query links the device doesn't have, producing incorrect behavior or errors.Changes
New
DeviceAPIs (get_nvlink_count,get_nvlinks) — query the device-specific NVLink count at runtime vianvmlDeviceGetFieldValues(NVML_FI_DEV_NVLINK_LINK_COUNT), and iterate over only the links actually present on that device.get_nvlinkvalidation fix — the link-index range check now usesget_nvlink_count()instead of the staticNvlinkInfo.max_links, soValueErroris raised precisely for links not supported by the specific device.NvlinkInfo.max_linksdeprecated — the class attribute now emits aDeprecationWarning(via a metaclass property) pointing users toDevice.get_nvlink_count(). The static value is still returned for compatibility but callers should migrate.Vendored
Deprecatedlibrary — a trimmed copy of theDeprecatedpackage (v1.3.1, MIT) is vendored undercuda_core/cuda/core/_vendored/deprecated/with thewraptdependency removed. Used to emitDeprecationWarningwith Sphinx-compatible docstring decorators (@deprecated,@versionadded,@versionchanged).Tests —
test_nvlinkupdated to iterate viaget_nvlink_count()and exerciseget_nvlinks(); newtest_nvlink_max_links_deprecatedasserts the deprecation warning fires.Release notes —
1.1.0-notes.rstupdated with new APIs, the behaviour change toget_nvlink, and the deprecation ofNvlinkInfo.max_links.