Skip to content

Optimize function that loads pointers on GPU#3001

Open
timmoon10 wants to merge 23 commits into
NVIDIA:mainfrom
timmoon10:tmoon/optimize-get_device_pointer_for_data_and_scales
Open

Optimize function that loads pointers on GPU#3001
timmoon10 wants to merge 23 commits into
NVIDIA:mainfrom
timmoon10:tmoon/optimize-get_device_pointer_for_data_and_scales

Conversation

@timmoon10
Copy link
Copy Markdown
Member

@timmoon10 timmoon10 commented May 16, 2026

Description

tex.get_device_pointer_for_data_and_scales has two problems:

  1. It has significant CPU overhead (see [PyTorch] Reduce CPU overhead in grouped MLP block #2897). In a representative benchmark on a GB200, it takes ~70 us per call.
  2. The meaning is extremely unintuitive. The most natural interpretation is that it takes a FP8/MXFP8/NVFP4 tensor and returns pointers as two int s. But actually it takes the buffers from multiple MXFP8/NVFP4 tensors (all assumed to have the same shape), swizzles the scaling factors, and transfers the pointers to a GPU array in a CUDA Graph-friendly way.

This PR makes several optimizations to reduce CPU overhead, mostly by avoiding heap allocations and mutex acquisition. I've also attempted to make the functionality more general and logical:

  • nvte_load_value_on_device: A general function for copying a small amount of data to GPU in a CUDA Graph-friendly way. Unlike nvte_convert_pointers_to_tensor, it makes no assumptions that the data is a list of pointers.
  • tex.load_data_ptrs_on_device: Takes a list of tensors and puts their data pointers into a GPU buffer.
  • tex.transform_and_load_data_ptrs_on_device: Performs a user-specified transform on a list of tensors and puts the resulting data pointers into a GPU buffer. Currently it only supports scale swizzles on uniformly shaped tensors, but the transform names help make the contracts explicit.

With these changes, per-call CPU runtime has dropped from 70 us to 31 us on a GB200 node.

This is progress toward #2897.

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring
  • Performance optimization

Changes

  • Add transformer_engine::Tensor::flat_2d_dims to compute first and last dims simultaneously
  • Generalize and rename nvte_load_value_on_device
  • Refactor and rename tex.load_data_ptrs_on_device and tex.transform_and_load_data_ptrs_on_device
  • Add internal wrapper class for NVTEShape with similar API as std::vector
  • Remove heap allocations in transformer_engine::SimpleTensor
  • Remove heap allocations in transformer_engine::Tensor shape functions
  • Add batched tensor allocation and deallocation to reduce mutex overhead
  • Avoid heap allocations in tensor checking functions

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

timmoon10 and others added 8 commits May 15, 2026 01:35
Avoid constructing temporary std::vector when converting NVTEBasicTensor to SimpleTensor. Avoid string operations in multi-tensor swizzle. Avoid temporary std::vector when checking scale tensors.

Signed-off-by: Tim Moon <tmoon@nvidia.com>
Tensor::shape() returns a std::vector<size_t> by value, allocating
on the heap. flat_first_dim and flat_last_dim only need to walk
the dims, so the allocation was pure overhead in hot paths.

Introduce Tensor::compute_shape() returning an NVTEShape (fixed
inline buffer, no heap) as the single source of truth for the
format-dependent shape logic. shape() is now a thin std::vector
wrapper around it for callers that want a vector; flat_first_dim
and flat_last_dim call compute_shape() directly.

Signed-off-by: Tim Moon <tmoon@nvidia.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Tim Moon <tmoon@nvidia.com>
flat_first_dim() and flat_last_dim() each called compute_shape()
independently. flat_2d_dims() computes both in a single pass; the
scalar helpers now delegate to it.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Tim Moon <tmoon@nvidia.com>
Replace all paired flat_first_dim() + flat_last_dim() calls on the
same tensor with a single flat_2d_dims() call. Saves one compute_shape()
per tensor in CheckScaleTensorShape, the multi-tensor swizzle loop, and
various cast/GEMM dispatch paths.

Also adds reserve() to the local vectors in
nvte_multi_tensor_swizzle_scaling_factors to avoid reallocation.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Tim Moon <tmoon@nvidia.com>
Signed-off-by: Tim Moon <tmoon@nvidia.com>
Replace the inline swizzle implementation with a call to
multi_tensor_swizzle_scales_for_gemm, which has identical logic
(16B-aligned contiguous output buffer, TensorWrapper construction,
nvte_multi_tensor_swizzle_scaling_factors kernel). Swizzled pointers
are read back from the updated TensorWrappers after the call.

Add reserve() to vectors in multi_tensor_swizzle_scales_for_gemm_impl
now that this function is on the hot path for get_device_pointer_for_data_and_scales.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Tim Moon <tmoon@nvidia.com>
Signed-off-by: Tim Moon <tmoon@nvidia.com>
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 16, 2026

Greptile Summary

This PR refactors and optimizes the path for loading GPU data/scale pointers on device, reducing CPU overhead from 72 µs to 41 µs per call by eliminating unnecessary heap allocations and improving API clarity.

  • Introduces nvte_load_value_on_device (and batch nvte_create_tensors/nvte_destroy_tensors) to replace the old nvte_convert_pointers_to_tensor, encoding payload in kernel arguments for CUDA-graph compatibility.
  • Splits the monolithic get_device_pointer_for_data_and_scales into load_data_ptrs_on_device + transform_and_load_data_ptrs_on_device, each with explicit semantics; heap allocations are reduced throughout by using std::array, pre-reserved vectors, and std::string_view.
  • Adds Tensor::flat_2d_dims() to compute first/last dims in one pass and refactors CheckScaleTensorShape to use constexpr std::array for block/alignment constants.

Confidence Score: 5/5

Safe to merge; changes are a well-scoped performance refactor with no logic regressions found.

All numerical computations in CheckScaleTensorShape were verified to be equivalent to the old code. The new batch TensorAllocator::Allocate is correct because the vector is pre-reserved to MAX_TENSOR_NUM in the constructor, so capacity() - size() faithfully tracks remaining room. The nvte_load_value_on_device kernel correctly handles byte-granularity tails and multi-chunk payloads. The only notable regression is the removal of the is_cuda() validation guard in load_data_ptrs_on_device, but all current call sites pass CUDA tensors, so there is no present defect.

transformer_engine/pytorch/csrc/extensions/utils.cpp — the new load_data_ptrs_on_device no longer validates that input tensors are on CUDA.

Important Files Changed

Filename Overview
transformer_engine/common/util/utils.cu New nvte_load_value_on_device kernel packs payload in kernel args for CUDA-graph safety; kernel logic is correct for aligned, unaligned, and multi-chunk cases. Deprecated wrapper retained with NVTE_API_CALL.
transformer_engine/pytorch/csrc/extensions/utils.cpp load_data_ptrs_on_device and transform_and_load_data_ptrs_on_device replace the old monolithic function; missing CUDA-device validation on input tensors compared to the old API.
transformer_engine/common/transformer_engine.cpp Batch Allocate/Free added to TensorAllocator; correctness relies on the pre-reserved MAX_TENSOR_NUM capacity, which holds. CheckScaleTensorShape refactored to std::array; logic verified equivalent to old code for all branches.
transformer_engine/common/common.h New Shape wrapper class over NVTEShape with vector-like interface avoids heap allocation; flat_2d_dims() computes both dims in a single pass; SimpleTensor constructor now implicit-converts from NVTEBasicTensor.
transformer_engine/pytorch/csrc/extensions/swizzle.cpp Refactored to use batch NVTETensor allocation with RAII DestroyGuard; correctness preserved by caching output scale dtype/shape before reading from output NVTETensors after the kernel.
transformer_engine/common/swizzle/swizzle.cu get_max_dynamic_smem switched to C++11 static-local init (thread-safe), fixing a latent race in the old lazy-init pattern; flat_2d_dims() reuse avoids duplicate shape() calls.

Sequence Diagram

sequenceDiagram
    participant PY as Python caller
    participant LDPOD as load_data_ptrs_on_device
    participant TALDPOD as transform_and_load_data_ptrs_on_device
    participant NLVOD as nvte_load_value_on_device
    participant GPU as GPU (CUDA stream)

    PY->>LDPOD: tensors, device
    LDPOD->>LDPOD: collect data_ptr() → ptrs_host[]
    LDPOD->>NLVOD: "ptrs_host, ptrs_device, n*8 bytes"
    NLVOD->>GPU: "kernel(payload=ptrs_host, dst=ptrs_device)"
    LDPOD-->>PY: ptrs_device (at::Tensor)

    PY->>TALDPOD: transform_type, scale_tensors, device
    TALDPOD->>TALDPOD: nvte_create_tensors (batch)
    TALDPOD->>GPU: nvte_multi_tensor_swizzle_scaling_factors(inputs, outputs)
    TALDPOD->>TALDPOD: collect swizzled ptr offsets → ptrs_host[]
    TALDPOD->>NLVOD: "ptrs_host, ptrs_device, n*8 bytes"
    NLVOD->>GPU: "kernel(payload=ptrs_host, dst=ptrs_device)"
    TALDPOD->>TALDPOD: nvte_destroy_tensors (RAII)
    TALDPOD-->>PY: (ptrs_device, swizzled_scales_buffer)

    PY->>GPU: "GEMM kernel(b_ptrs=ptrs_device, sfb_ptrs=ptrs_device)"
Loading

Reviews (6): Last reviewed commit: "[pre-commit.ci] auto fixes from pre-comm..." | Re-trigger Greptile

Comment thread transformer_engine/common/transformer_engine.cpp Outdated
Comment thread transformer_engine/common/util/utils.cu
Comment thread transformer_engine/common/common.h Outdated
dtype(static_cast<DType>(tensor.dtype)) {}

SimpleTensor() : SimpleTensor(nullptr, std::vector<size_t>{0}, DType::kFloat32) {}
SimpleTensor &operator=(const NVTEBasicTensor &tensor) {
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without this assignment operator, assigning from a NVTEBasicTensor triggers a heap allocator in the NVTEBasicTensor constructor. We do this assignment frequently within nvte_set_tensor_param_v2.

Comment thread transformer_engine/common/util/utils.cu Outdated
NVTE_CHECK(data_tensors[0].is_cuda(), "data_tensors must be on CUDA.");
const auto device = data_tensors[0].device();
auto stream = at::cuda::getCurrentCUDAStream();
std::tuple<at::Tensor, std::optional<at::Tensor>> transform_and_load_data_ptrs_on_device(
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not committed to this name. I based it on std::transform. I suppose "map" would be more Python-focused, but that sounds worse.

Comment thread transformer_engine/common/util/utils.cu
Comment thread transformer_engine/common/transformer_engine.cpp Outdated
timmoon10 and others added 3 commits May 16, 2026 11:49
- Use size_t in kernel tail loop (was int64_t)
- Zero-initialize Payload before memcpy (Payload{})
- Rename Payload members to kMaxBytes/kVectorSize/kMaxVectors (linter)
- Consistent at::empty shape pattern: {static_cast<int64_t>(N)}
- Drop intermediate swizzled_scales_bytes variable
- Add comment explaining uniform-stride assumption in
  transform_and_load_data_ptrs_on_device
- Rename sfb_buffer -> _sfb_buffer (keepalive, not directly used)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Tim Moon <tmoon@nvidia.com>
Signed-off-by: Tim Moon <tmoon@nvidia.com>
@timmoon10 timmoon10 force-pushed the tmoon/optimize-get_device_pointer_for_data_and_scales branch from 7946e5d to 48cc585 Compare May 16, 2026 11:53
@timmoon10
Copy link
Copy Markdown
Member Author

/te-ci

@ptrendx
Copy link
Copy Markdown
Member

ptrendx commented May 18, 2026

Seems a lot of those changes would basically not be needed if we did not use the std::vector in Tensor/SimpleTensor and just used NVTEShape everywhere - this would effectively make SimpleTensor and NVTEBasicTensor the same thing (we could even do the constructor in the public header, just behing the if cplusplus guard).

Copy link
Copy Markdown
Collaborator

@vthumbe1503 vthumbe1503 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for cleaning up the APIs. Looks much nicer now. CPU overheads being caused by heap allocations of shape, makes me wonder whether we should revive this PR to standardize on NVTEShape yo avoid back and forth between vector<size_t> and NVTE_Shape

Comment thread transformer_engine/common/util/utils.cu
Comment on lines +505 to +508
fc2_sfb_ptrs, _fc2_sfb_buffer = tex.transform_and_load_data_ptrs_on_device(
"uniform_mxfp8_columnwise_swizzle",
[w._columnwise_scale_inv for w in grouped_fc2_weight],
swizzle=True,
rowwise=False,
data_dtype=grouped_fc2_weight[0]._fp8_dtype,
device,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other optimization can be to load both fc1 and fc2 data and scale inv togegther at the start of backward. I am hoping it wouldnt make the code ugly.

Comment thread transformer_engine/common/common.h Outdated
SimpleTensor() : SimpleTensor(nullptr, std::vector<size_t>{0}, DType::kFloat32) {}
SimpleTensor &operator=(const NVTEBasicTensor &tensor) {
dptr = tensor.data_ptr;
shape.assign(tensor.shape.data, tensor.shape.data + tensor.shape.ndim);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So when you say heap allocations being done redundantly again and again. Do you mean the vector to NVTEShape conversions?

I rememember this problem being observed even with a basic te linear profiling. And I hadnt gotten this PR merged.
#2514

which essentially standadizes to use NVTEShape everywhere instead of using vector at all to avoid bouncing back and forth between the two allocations. Maybe it might be worth to revive the PR?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously, assigning an NVTEBasicTensor to a SimpleTensor would trigger the constructor and then the move operator. This would allocate an std::vector, move it, and deallocate the old std::vector.

One other approach I was thinking about was implementing a Shape class that wraps around NVTEShape and has a similar API as std::vector. That way we can keep the nice ergonomics, while avoiding heap allocations.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had tried your other approach in the 2514 PR above, but eventually had removed it due to some complications. I have refer back to my notes on why it didnt work out for me.

But here is the commit that reverted it
b599776

I had called it NVTEShapeWrapper and implemented all the vector based APIs.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One complication I do remember was to change a lot of attention interfaces to have NVTEShapeWrapper instead of using vector.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've found that adding a cast operator to std::vector helps reduce the number of places we need to change the interfaces.

timmoon10 and others added 4 commits May 20, 2026 01:25
Provides a std::vector<size_t>-like interface around NVTEShape without
heap allocation, used as the return type of Tensor::shape() in place of
the previous std::vector. Disambiguate cute::Shape from
transformer_engine::Shape in the hadamard_transform kernels.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Tim Moon <tmoon@nvidia.com>
Store shape in Shape class rather than std::vector.

Signed-off-by: Tim Moon <tmoon@nvidia.com>
Signed-off-by: Tim Moon <tmoon@nvidia.com>
@timmoon10
Copy link
Copy Markdown
Member Author

/te-ci

timmoon10 and others added 7 commits May 21, 2026 01:31
Signed-off-by: Tim Moon <tmoon@nvidia.com>
Expose nvte_create_tensors and nvte_destroy_tensors so multi-tensor
callers can amortize the TensorAllocator mutex across N tensors
instead of locking once per call. nvte_destroy_tensors was already
defined internally but not declared in the public header.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Tim Moon <tmoon@nvidia.com>
…evice

The uniform swizzle path constructed 2N TensorWrappers and then
extracted their raw NVTETensors into separate vectors. Replace with a
single 2N nvte_create_tensors call into one contiguous buffer (inputs
in the first half, outputs in the second), an RAII guard for
nvte_destroy_tensors, and a local set_param lambda for the setters.
Drops the separate pack pass and reduces the allocator mutex
acquisitions from 4N to 2 per call.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Tim Moon <tmoon@nvidia.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Tim Moon <tmoon@nvidia.com>
Signed-off-by: Tim Moon <tmoon@nvidia.com>
@timmoon10
Copy link
Copy Markdown
Member Author

/te-ci

} // namespace load_value_on_device
} // namespace transformer_engine

void nvte_load_value_on_device(const void *host_ptr, void *device_ptr, size_t num_bytes,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this function name is a bit confusing for public API. The name says "load_value", but the arguments are raw pointers and num_bytes, so it looks more like copying host bytes to device. Also the implementation launches multiple kernels if num_bytes > Payload::kMaxBytes, which may be unexpected for users who read this as a generic H2D copy helper.

Would it be better if we make it more explicit, e.g.:

  • rename to nvte_copy_small_host_payload_to_device
  • add a check to make sure this is only used for small CUDA graph-friendly payloads

Copy link
Copy Markdown
Member Author

@timmoon10 timmoon10 May 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not quite happy with the name either. My intention was "load an arbitrarily-sized object into GPU, but it's optimized for small things like structs". Some other ideas I had:

  • "memcpy" or "copy": Perfect match for the API, but it also doesn't give any hint that it is optimized for very small copies. Also, it doesn't communicate that the data is immediately passed as a kernel arg, so the host buffer can be immediately freed and this is compatible with CUDA Graphs.
  • "fill": Communicates that we are copying a single thing and that the data is included as a kernel arg. However, std::fill and torch.Tensor.fill_ repeat the value multiple times in the output buffer, rather than copying directly.
  • "load": Consistent with my intended meaning, but vague. Reminds me of cuModuleLoadDataEx, which has some similarities but is also different enough that we need to avoid confusion.
  • "store": Similar to "load".

I don't think we should enforce a single kernel launch though. The intended use-case is to copy lists of device pointers, which can become large for MoE models with large numbers of experts. The existing implementation can handle arbitrarily-sized data correctly, although perf may be terrible.

Copy link
Copy Markdown
Member Author

@timmoon10 timmoon10 May 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perplexity likes nvte_copy_host_to_device_via_kernel, nvte_copy_host_to_device_immediate, nvte_copy_host_to_device_graph_safe. The first one seems the clearest and least awkward to me.

transformer_engine::DType data_dtype, scale_dtype;
switch (scaling_mode) {
case NVTE_MXFP8_1D_SCALING:
data_dtype = transformer_engine::DType::kFloat8E4M3;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really want to hardcode data_dtype = kFloat8E4M3 here?

Copy link
Copy Markdown
Member Author

@timmoon10 timmoon10 May 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't actually access the fp8e4m3 values when swizzling, this is a fake configuration so the tensor passes validation checks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants