[Common] NVTETensor peer-handle annotation + nccl_comm backend by phu0ngng · Pull Request #3017 · NVIDIA/TransformerEngine

phu0ngng · 2026-05-20T20:10:11Z

Description

Add a generic peer-handle annotation on NVTETensor so consumers can issue one-sided RMA against a tensor's storage without owning the underlying resource.
Register NCCL symmetric-memory windows as the first concrete backend behind this annotation (nccl_comm.h).
Gate newton_schulz on NVTE_WITH_CUSOLVERMP to unblock builds without cuSOLVERMp.

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

common/include/transformer_engine/comm_handle.h: payload-agnostic C API — kind tag (NVTEPeerHandleKind), nvte_tensor_peer_handle_kind, nvte_tensor_detach_peer_handle. Reserves slots for future backends (NVSHMEM, CUDA-IPC, UCX).
common/include/transformer_engine/nccl_comm.h: NCCL-specific setter/getter — nvte_tensor_attach_nccl_window(t, window, offset) / nvte_tensor_nccl_window(t, &window, &offset). Window is borrowed; tensor never owns it.
common/common.h: extend Tensor with peer_handle_kind / peer_handle_data / peer_handle_offset fields (ignored by paths that don't issue cross-rank ops).
common/comm_handle.cpp: implementations.
common/CMakeLists.txt: build the new comm_handle.cpp.
pytorch/csrc/extensions/newton_schulz.cpp + build_tools/pytorch.py: compile/include newton_schulz only when NVTE_WITH_CUSOLVERMP is set.
Bump 3rdparty/cudnn-frontend and minor doc tweak under examples/pytorch/comm_gemm_overlap/README.md.

Testing

Covered by the cpp-distributed EP tests in the follow-up branch (phuong/ep-2-common-cpp), which exercise the NCCL-window attach/detach path end-to-end.
Standard TE build sanity (NVTE_CUDA_ARCHS="90;100" NVTE_FRAMEWORK=jax pip install --no-build-isolation -e .).

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

…ewton_schulz on NVTE_WITH_CUSOLVERMP Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

greptile-apps · 2026-05-20T20:14:40Z

Greptile Summary

This PR introduces a generic peer-handle annotation on NVTETensor for one-sided RMA, with NCCL symmetric-memory windows as the first concrete backend. It also fixes the newton_schulz build for configurations without cuSOLVERMp by consistently gating the feature across all three affected files (newton_schulz.cpp, extensions.h, pybind.cpp) and the Python build script.

comm_handle.h / nccl_comm.h / comm_handle.cpp: New C API to attach, query, and detach an opaque borrowed peer handle (NCCL window + byte offset) on any NVTETensor; the Tensor struct gains three corresponding fields with correct default initialisation.
newton_schulz build gate: #ifdef NVTE_WITH_CUSOLVERMP consistently applied in the .cpp, extensions.h, pybind.cpp, and build_tools/pytorch.py, fixing undefined-symbol linker failures when cuSOLVERMp is absent.

Confidence Score: 4/5

Mostly safe to merge, but Tensor::clear() does not reset the three new peer-handle fields, leaving a dangling window pointer visible to consumers after a reset.

The new comm-handle machinery is well-structured and null-safe, and the Newton-Schulz build gate is correctly applied across all three affected files. The one concrete defect is that Tensor::clear() skips the three new peer_handle_* fields, so any code path that calls clear() and then re-uses the tensor can read a stale NVTE_PEER_HANDLE_NCCL_WINDOW annotation pointing at freed memory.

transformer_engine/common/common.h — the clear() method needs to reset peer_handle_kind, peer_handle_data, and peer_handle_offset.

Important Files Changed

Filename	Overview
transformer_engine/common/common.h	Adds three peer-handle fields to `Tensor` struct with correct default initialization, but `clear()` does not reset them — stale handle state will persist after `clear()`, risking use-after-free if the underlying window is freed.
transformer_engine/common/comm_handle.cpp	New implementation file for peer-handle get/set APIs; logic is straightforward and null-safe, but the matching `clear()` gap in common.h affects correctness when tensors are reused.
transformer_engine/common/include/transformer_engine/comm_handle.h	Clean C header defining `NVTEPeerHandleKind` enum and the generic detach API; correct include guards and extern "C" wrapping.
transformer_engine/common/include/transformer_engine/nccl_comm.h	NCCL-specific attach/get API using opaque `void*` window — no NCCL header dependency, correct include guards.
transformer_engine/pytorch/csrc/extensions/newton_schulz.cpp	Correctly guarded with `#ifdef NVTE_WITH_CUSOLVERMP`, completing the fix that also gates `extensions.h` and `pybind.cpp`.

Comments Outside Diff (1)

transformer_engine/common/common.h, line 206-217 (link)

clear() leaves stale peer-handle state after reset

peer_handle_kind, peer_handle_data, and peer_handle_offset are not cleared in Tensor::clear(). Any consumer that calls clear() before reusing a tensor will see a non-NONE kind with a potentially freed peer_handle_data pointer. Since peer_handle_data is a borrowed raw pointer (the caller owns the window), a stale NVTE_PEER_HANDLE_NCCL_WINDOW annotation combined with a freed window is a use-after-free hazard for any code path that reads the handle after clear().

The three new fields should be reset in clear(), mirroring the existing pattern for with_gemm_swizzled_scales and row_scaled_nvfp4.

_{Reviews (2): Last reviewed commit: "pytorch: guard cusolvermp/newton_schulz ..." | Re-trigger Greptile}

greptile-apps · 2026-05-20T20:14:44Z

+// behind NVTE_WITH_CUSOLVERMP in the common CMakeLists. Without the gate, the
+// pytorch ext glob would pick this file up and produce undefined symbols
+// (nvte_cusolvermp_ctx_*). Keep this gate aligned with common/.
+#ifdef NVTE_WITH_CUSOLVERMP


Incomplete guard — pybind.cpp and extensions.h still reference ungated symbols

The #ifdef NVTE_WITH_CUSOLVERMP here makes newton_schulz.cpp compile to an empty TU when the flag is unset, but pybind.cpp (lines 592–601) and extensions.h (lines 634–639) still reference transformer_engine::pytorch::cusolvermp_ctx_create, cusolvermp_ctx_destroy, and newton_schulz unconditionally. Those symbols will be undefined at link time, producing the same linker failure the PR aims to fix. The guard needs to be mirrored in both pybind.cpp and extensions.h.

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

…SOLVERMP Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

ptrendx · 2026-05-21T00:32:14Z

+    # Mirror the cuSOLVERMp gate. newton_schulz.cpp is conditionally compiled
+    # in the common lib; the pytorch ext glob pulls the same source so it must
+    # see the same define, otherwise the pybind layer refers to undefined
+    # cusolvermp_ctx_* symbols.
+    if bool(int(os.getenv("NVTE_WITH_CUSOLVERMP", "0"))):
+        cxx_flags.append("-DNVTE_WITH_CUSOLVERMP")
+


? What is the error you are seeing without that? At least from the cursory look at those sources I don't see how the pytorch files would be affected here. Is the comments implying that the PyTorch compilation is somehow taking the common files and compiling them again?

ptrendx · 2026-05-21T00:38:15Z

Ok, a general question - this looks like a thing that is basically not compatible with everything else in nvtetensor, so why do we even want to use NVTETensor at all? If you just need a C struct that would hold the pointer and offset then we should make that separate. It is hard to review this without seeing what the concrete implementation would actually use this mechanism for.

common: NVTETensor peer-handle annotation + nccl_comm backend; gate n…

e4cfa62

…ewton_schulz on NVTE_WITH_CUSOLVERMP Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

phu0ngng requested review from ksivaman and ptrendx as code owners May 20, 2026 20:10

phu0ngng requested a review from timmoon10 May 20, 2026 20:10

greptile-apps Bot reviewed May 20, 2026

View reviewed changes

phu0ngng added 2 commits May 20, 2026 20:18

common: tighten peer-handle docstrings

93ee116

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

pytorch: guard cusolvermp/newton_schulz pybind + decl on NVTE_WITH_CU…

4bd2f5b

…SOLVERMP Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

phu0ngng force-pushed the phuong/ep-1-nvtetensor branch from a573c9a to 4bd2f5b Compare May 20, 2026 21:24

ptrendx reviewed May 21, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Common] NVTETensor peer-handle annotation + nccl_comm backend#3017

[Common] NVTETensor peer-handle annotation + nccl_comm backend#3017
phu0ngng wants to merge 3 commits into
NVIDIA:mainfrom
phu0ngng:phuong/ep-1-nvtetensor

phu0ngng commented May 20, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot commented May 20, 2026 •

edited

Loading

Comments Outside Diff (1)

Uh oh!

greptile-apps Bot May 20, 2026

Uh oh!

ptrendx May 21, 2026

Uh oh!

ptrendx commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

phu0ngng commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Changes

Testing

Checklist:

Uh oh!

greptile-apps Bot commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Comments Outside Diff (1)

Uh oh!

greptile-apps Bot May 20, 2026

Choose a reason for hiding this comment

Uh oh!

ptrendx May 21, 2026

Choose a reason for hiding this comment

Uh oh!

ptrendx commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

phu0ngng commented May 20, 2026 •

edited

Loading

greptile-apps Bot commented May 20, 2026 •

edited

Loading