Expert Parallelism: common C API + NCCL EP backend#3034
Conversation
Greptile SummaryThis PR lands the foundational Expert Parallelism (EP) layer for TransformerEngine: a common C API (
Confidence Score: 4/5Safe to merge with one build issue addressed: the public The
Important Files Changed
Sequence DiagramsequenceDiagram
participant Caller
participant C_API as nvte_ep_* (ep_api.cpp)
participant Backend as EPBackend singleton
participant NCCL_EP as ncclEp* (libnccl_ep.so)
Caller->>C_API: nvte_ep_initialize(ep_comm, group_config)
C_API->>Backend: EPBackend::initialize()
Backend->>NCCL_EP: ncclEpCreateGroup()
Caller->>C_API: "nvte_ep_register_layer(layer_config, &mem_size)"
C_API->>Backend: register_layer()
Backend->>NCCL_EP: ncclEpHandleMemSize()
Backend-->>Caller: handle_id + required mem_size
Note over Caller: Allocates handle_mem buffer
loop Per training step
Caller->>C_API: nvte_ep_prepare(handle, topk_idx, token_counts, stream)
C_API->>Backend: prepare() → ncclEpUpdateHandle()
Backend->>NCCL_EP: ncclEpUpdateHandle (AllGather routing map)
Caller->>C_API: nvte_ep_dispatch(handle, tokens, [win], weights, [win], stream)
C_API->>Backend: dispatch() → ncclEpDispatch()
Backend->>NCCL_EP: ncclEpDispatch (scatter tokens to expert ranks)
Note over Caller: Expert computation on recv_tokens
Caller->>C_API: nvte_ep_combine(handle, expert_out, [win], result, stream)
C_API->>Backend: combine() → ncclEpCombine()
Backend->>NCCL_EP: ncclEpCombine (scatter-sum back to source ranks)
end
Caller->>C_API: nvte_ep_shutdown()
C_API->>Backend: EPBackend::shutdown()
Backend->>NCCL_EP: ncclEpHandleDestroy + ncclEpGroupDestroy
Reviews (4): Last reviewed commit: "[pre-commit.ci] auto fixes from pre-comm..." | Re-trigger Greptile |
| env_home = os.environ.get("NCCL_HOME") | ||
| if env_home and (Path(env_home) / "include" / "nccl.h").exists(): | ||
| return env_home |
There was a problem hiding this comment.
NCCL_HOME set to a wrong path is silently ignored
If a user sets NCCL_HOME to an incorrect prefix that doesn't contain include/nccl.h, the function falls through to the system probe list without any warning. The function should warn when NCCL_HOME is set but doesn't resolve to a valid NCCL install.
| env_home = os.environ.get("NCCL_HOME") | |
| if env_home and (Path(env_home) / "include" / "nccl.h").exists(): | |
| return env_home | |
| env_home = os.environ.get("NCCL_HOME") | |
| if env_home: | |
| if (Path(env_home) / "include" / "nccl.h").exists(): | |
| return env_home | |
| print( | |
| f"[NCCL EP] WARNING: NCCL_HOME='{env_home}' is set but " | |
| f"'{env_home}/include/nccl.h' was not found; falling back to system probes." | |
| ) |
| cfg.algorithm = NCCL_EP_ALGO_HIGH_THROUGHPUT; | ||
| cfg.num_experts = static_cast<unsigned int>(group_config.num_experts); | ||
| cfg.max_dispatch_tokens_per_rank = static_cast<unsigned int>(group_config.max_tokens_per_rank); | ||
| cfg.max_token_bytes = static_cast<unsigned int>(group_config.hidden_dim * sizeof(nv_bfloat16)); |
There was a problem hiding this comment.
max_token_bytes hardcoded to sizeof(nv_bfloat16) breaks float32 dispatch
cfg.max_token_bytes is computed as hidden_dim * sizeof(nv_bfloat16) (2 bytes), but nvte_dtype_to_nccl supports float32, float16, int32, int64, float8, etc. When a caller creates the EP group with this config and later dispatches float32 tokens (via nvte_ep_dispatch), the pre-allocated max_token_bytes is half the required size. NCCL EP uses this value to size internal staging buffers at group creation; dispatching a wider dtype silently overruns those buffers or triggers an internal NCCL error. NVTEEpGroupConfig needs a dtype (or max_token_element_bytes) field so callers can declare the maximum element width they will use.
There was a problem hiding this comment.
Note for myself: Need to expose this option for users to set in ep_bootstrap.
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
099857f to
17e5126
Compare
…em_reloc gating Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
for more information, see https://pre-commit.ci
Summary
First PR in the TE Expert Parallelism (EP) series. Lands the common C API and NCCL EP backend that later framework PRs (PyTorch, JAX) build on. No Python bindings yet — common-lib foundation plus build wiring only. Build/load works on any arch; SM and NCCL version gates fire at runtime.
Every network-bound payload tensor takes an optional
NVTECommWindow. When the window is provided, the backend uses NCCL EP's symmetric-memory zero-copy path, which skips the D2D Memcpy from the user buffers to the Symmetric Staging Buffers.Implementation
Public C API (
transformer_engine/common/include/transformer_engine/{ep.h,comm_window.h})Types:
NVTEEpGroupConfig,NVTEEpLayerConfig,NVTEEpHandle,NVTECommWindow(side-band{ncclWindow_t window, size_t offset}; NCCL peer handles are not carried onNVTETensor).Lifecycle (host-only, eager):
nvte_ep_initialize— borrow an externalncclComm_tfor the EP sub-group and init the singleton backend.nvte_ep_shutdown— tear down the backend; idempotent; does not destroyep_comm.nvte_ep_register_layer— reserve ahandle_idfor a layer config and report thehandle_membuffer size the caller must allocate. The pair{id, mem}becomes the per-stepNVTEEpHandle.Per-step (allocation-free, CUDA-graph capturable)
nvte_ep_prepare— all-gather the routing map and write routing maps tohandle.mem.nvte_ep_dispatch— scatter tokens and routing weights from source ranks to expert ranks.tokens,topk_weights,recv_tokens,recv_topk_weightseach accept an optional symm-mem window for zero-copy.nvte_ep_combine— scatter-sum expert outputs back to source ranks (unweighted; caller pre-multiplies byrecv_topk_weights).expert_outaccepts a window.nvte_ep_dispatch_bwd— backward of dispatch; routes token and weight grads back to source.gradandg_recv_topk_weightsaccept windows; the gathered outputs (grad_tokens,grad_topk_weights).nvte_ep_combine_bwd— backward of combine;gradandgrad_expert_outaccept windows. Padded slots ingrad_expert_outare zeroed.Backend + build
transformer_engine/common/ep/):EPBackendsingleton, HT-mode dispatch/combine over NCCL EP (libnccl_ep.so), group/layer registration. Internal helpermake_payload_tensor()builds the per-callncclEpTensor_t: when the caller'sNVTECommWindow.window != nullptrit setswin_hdl+win_offset(zero-copy); otherwise it setsdatafromnvte_tensor_data(t)(HBM fallback).EPBackend::initialize): SM>=90 (viacudaDeviceGetAttribute), NCCL>=2.30.4 (viancclGetVersion), CUDA multicast/NVLS support.NVTE_WITH_NCCL_EP=OFF,ep/ep_api_stub.cppprovides throwingnvte_ep_*stubs so framework bindings link unconditionally; failure surfaces at firstnvte_ep_initialize.setup.pybuildslibnccl_ep.sofrom3rdparty/ncclby default; auto-disables NCCL EP when no requested CUDA arch >= 90. ExplicitNVTE_BUILD_WITH_NCCL_EP=1with all archs < 90 is treated as user errorNVTE_BUILD_WITH_NCCL_EP=0to opt out.NCCL_HOMEresolved dynamically: explicit env →/opt/nvidia/nccl,/usr/local/nccl,/usr→ldconfig -pfallback.Testing
tests/cpp_distributed/.Type of change
Checklist: