Skip to content

[Common] NVTETensor peer-handle annotation + nccl_comm backend#3017

Open
phu0ngng wants to merge 3 commits into
NVIDIA:mainfrom
phu0ngng:phuong/ep-1-nvtetensor
Open

[Common] NVTETensor peer-handle annotation + nccl_comm backend#3017
phu0ngng wants to merge 3 commits into
NVIDIA:mainfrom
phu0ngng:phuong/ep-1-nvtetensor

Conversation

@phu0ngng
Copy link
Copy Markdown
Collaborator

@phu0ngng phu0ngng commented May 20, 2026

Description

  • Add a generic peer-handle annotation on NVTETensor so consumers can issue one-sided RMA against a tensor's storage without owning the underlying resource.
  • Register NCCL symmetric-memory windows as the first concrete backend behind this annotation (nccl_comm.h).
  • Gate newton_schulz on NVTE_WITH_CUSOLVERMP to unblock builds without cuSOLVERMp.

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Changes

  • common/include/transformer_engine/comm_handle.h: payload-agnostic C API — kind tag (NVTEPeerHandleKind), nvte_tensor_peer_handle_kind, nvte_tensor_detach_peer_handle. Reserves slots for future backends (NVSHMEM, CUDA-IPC, UCX).
  • common/include/transformer_engine/nccl_comm.h: NCCL-specific setter/getter — nvte_tensor_attach_nccl_window(t, window, offset) / nvte_tensor_nccl_window(t, &window, &offset). Window is borrowed; tensor never owns it.
  • common/common.h: extend Tensor with peer_handle_kind / peer_handle_data / peer_handle_offset fields (ignored by paths that don't issue cross-rank ops).
  • common/comm_handle.cpp: implementations.
  • common/CMakeLists.txt: build the new comm_handle.cpp.
  • pytorch/csrc/extensions/newton_schulz.cpp + build_tools/pytorch.py: compile/include newton_schulz only when NVTE_WITH_CUSOLVERMP is set.
  • Bump 3rdparty/cudnn-frontend and minor doc tweak under examples/pytorch/comm_gemm_overlap/README.md.

Testing

  • Covered by the cpp-distributed EP tests in the follow-up branch (phuong/ep-2-common-cpp), which exercise the NCCL-window attach/detach path end-to-end.
  • Standard TE build sanity (NVTE_CUDA_ARCHS="90;100" NVTE_FRAMEWORK=jax pip install --no-build-isolation -e .).

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

…ewton_schulz on NVTE_WITH_CUSOLVERMP

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
@phu0ngng phu0ngng requested review from ksivaman and ptrendx as code owners May 20, 2026 20:10
@phu0ngng phu0ngng requested a review from timmoon10 May 20, 2026 20:10
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 20, 2026

Greptile Summary

This PR introduces a generic peer-handle annotation on NVTETensor for one-sided RMA, with NCCL symmetric-memory windows as the first concrete backend. It also fixes the newton_schulz build for configurations without cuSOLVERMp by consistently gating the feature across all three affected files (newton_schulz.cpp, extensions.h, pybind.cpp) and the Python build script.

  • comm_handle.h / nccl_comm.h / comm_handle.cpp: New C API to attach, query, and detach an opaque borrowed peer handle (NCCL window + byte offset) on any NVTETensor; the Tensor struct gains three corresponding fields with correct default initialisation.
  • newton_schulz build gate: #ifdef NVTE_WITH_CUSOLVERMP consistently applied in the .cpp, extensions.h, pybind.cpp, and build_tools/pytorch.py, fixing undefined-symbol linker failures when cuSOLVERMp is absent.

Confidence Score: 4/5

Mostly safe to merge, but Tensor::clear() does not reset the three new peer-handle fields, leaving a dangling window pointer visible to consumers after a reset.

The new comm-handle machinery is well-structured and null-safe, and the Newton-Schulz build gate is correctly applied across all three affected files. The one concrete defect is that Tensor::clear() skips the three new peer_handle_* fields, so any code path that calls clear() and then re-uses the tensor can read a stale NVTE_PEER_HANDLE_NCCL_WINDOW annotation pointing at freed memory.

transformer_engine/common/common.h — the clear() method needs to reset peer_handle_kind, peer_handle_data, and peer_handle_offset.

Important Files Changed

Filename Overview
transformer_engine/common/common.h Adds three peer-handle fields to Tensor struct with correct default initialization, but clear() does not reset them — stale handle state will persist after clear(), risking use-after-free if the underlying window is freed.
transformer_engine/common/comm_handle.cpp New implementation file for peer-handle get/set APIs; logic is straightforward and null-safe, but the matching clear() gap in common.h affects correctness when tensors are reused.
transformer_engine/common/include/transformer_engine/comm_handle.h Clean C header defining NVTEPeerHandleKind enum and the generic detach API; correct include guards and extern "C" wrapping.
transformer_engine/common/include/transformer_engine/nccl_comm.h NCCL-specific attach/get API using opaque void* window — no NCCL header dependency, correct include guards.
transformer_engine/pytorch/csrc/extensions/newton_schulz.cpp Correctly guarded with #ifdef NVTE_WITH_CUSOLVERMP, completing the fix that also gates extensions.h and pybind.cpp.

Comments Outside Diff (1)

  1. transformer_engine/common/common.h, line 206-217 (link)

    P1 clear() leaves stale peer-handle state after reset

    peer_handle_kind, peer_handle_data, and peer_handle_offset are not cleared in Tensor::clear(). Any consumer that calls clear() before reusing a tensor will see a non-NONE kind with a potentially freed peer_handle_data pointer. Since peer_handle_data is a borrowed raw pointer (the caller owns the window), a stale NVTE_PEER_HANDLE_NCCL_WINDOW annotation combined with a freed window is a use-after-free hazard for any code path that reads the handle after clear().

    The three new fields should be reset in clear(), mirroring the existing pattern for with_gemm_swizzled_scales and row_scaled_nvfp4.

Reviews (2): Last reviewed commit: "pytorch: guard cusolvermp/newton_schulz ..." | Re-trigger Greptile

// behind NVTE_WITH_CUSOLVERMP in the common CMakeLists. Without the gate, the
// pytorch ext glob would pick this file up and produce undefined symbols
// (nvte_cusolvermp_ctx_*). Keep this gate aligned with common/.
#ifdef NVTE_WITH_CUSOLVERMP
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Incomplete guard — pybind.cpp and extensions.h still reference ungated symbols

The #ifdef NVTE_WITH_CUSOLVERMP here makes newton_schulz.cpp compile to an empty TU when the flag is unset, but pybind.cpp (lines 592–601) and extensions.h (lines 634–639) still reference transformer_engine::pytorch::cusolvermp_ctx_create, cusolvermp_ctx_destroy, and newton_schulz unconditionally. Those symbols will be undefined at link time, producing the same linker failure the PR aims to fix. The guard needs to be mirrored in both pybind.cpp and extensions.h.

phu0ngng added 2 commits May 20, 2026 20:18
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
…SOLVERMP

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
@phu0ngng phu0ngng force-pushed the phuong/ep-1-nvtetensor branch from a573c9a to 4bd2f5b Compare May 20, 2026 21:24
Comment thread build_tools/pytorch.py
Comment on lines +79 to +85
# Mirror the cuSOLVERMp gate. newton_schulz.cpp is conditionally compiled
# in the common lib; the pytorch ext glob pulls the same source so it must
# see the same define, otherwise the pybind layer refers to undefined
# cusolvermp_ctx_* symbols.
if bool(int(os.getenv("NVTE_WITH_CUSOLVERMP", "0"))):
cxx_flags.append("-DNVTE_WITH_CUSOLVERMP")

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

? What is the error you are seeing without that? At least from the cursory look at those sources I don't see how the pytorch files would be affected here. Is the comments implying that the PyTorch compilation is somehow taking the common files and compiling them again?

@ptrendx
Copy link
Copy Markdown
Member

ptrendx commented May 21, 2026

Ok, a general question - this looks like a thing that is basically not compatible with everything else in nvtetensor, so why do we even want to use NVTETensor at all? If you just need a C struct that would hold the pointer and offset then we should make that separate. It is hard to review this without seeing what the concrete implementation would actually use this mechanism for.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants