MXFP4 Cast Transpose Triton [WIP] #422

sarthak-amd · 2026-01-20T00:32:53Z

Description

Implements the MXFP4 rowwise and columnwise FP32/BF16 -> MXFP4 fused quantization + cast kernel

Verify Tolerances and functional Unit Tests
The triton te_cast_transpose_mxfp4_triton currently outputs FP4 data in linear layout [M, N/2] with contiguous byte packing. AITER's gemm_a4w4 requires the B matrix in MFMA shuffle layout for tensor cores. This layout shuffle can be fused into the triton kernel in future.

…-mxfp4

wangye805

You

wangye805 · 2026-02-02T04:51:25Z

tests/pytorch/triton_kernels/test_cast_mxfp4.py

+import numpy as np
+import os
+
+os.environ["USE_TRITON_FUSED_CAST_TRANSPOSE"] = "1"


We previously already defined env NVTE_USE_CAST_TRANSPOSE_TRITON.

wangye805 · 2026-02-02T04:57:31Z

tests/pytorch/triton_kernels/test_cast_mxfp4.py

+def test_quantize_mxfp4(shape, in_dtype, rowwise, columnwise, shuffle_B_matrix):
+    """Test MXFP4 quantization for rowwise/columnwise modes with/without FP4 shuffle.
+
+    Note: FP4 data shuffle (shuffle_B_matrix_for_aiter) is not yet supported in Triton kernel.


If FP4 data shuffle is not yet supported in Triton kernel, why do we need to add it here?

wangye805 · 2026-02-02T04:59:28Z

tests/pytorch/triton_kernels/test_cast_mxfp4.py

+    (32768, 160),
+    (4096, 1632),
+    (8, 32, 1024),
+    (16, 8, 4, 512),


Can we add some prime numbers like

TransformerEngine/tests/cpp/operator/test_cast_transpose.cu

Lines 90 to 92 in 9d6b0e5

{1, 3221}, // Prime 456

{2333, 1}, // Prime 345

{1481, 677}}; // Primes 234, 123

wangye805 · 2026-02-02T05:02:05Z

tests/pytorch/triton_kernels/test_cast_mxfp4.py

+    data_atol = 20.0 if in_dtype != torch.float32 else 16.0
+    scale_atol = 2.0 if in_dtype != torch.float32 else 1.0


Data tol seems to be quite large. You can follow our mxfp8 scale and data adjustment scheme:

TransformerEngine/tests/cpp/test_common.cu

Line 730 in 9d6b0e5

void adjust_ref_for_e8m0_scale_error(const std::string &name,

wangye805 · 2026-02-02T05:02:54Z

tests/pytorch/triton_kernels/test_cast_mxfp4.py

+            use_torch_semantics=True
+        )
+
+        # Compare only valid (non-padded) region - no shuffle extraction needed


What is fp4 shuffle?

wangye805 · 2026-02-02T14:52:14Z

transformer_engine/common/util/pybind_helper.h

      .value("kFloat8E4M3", transformer_engine::DType::kFloat8E4M3)                                \
-      .value("kFloat8E5M2", transformer_engine::DType::kFloat8E5M2);                               \
+      .value("kFloat8E5M2", transformer_engine::DType::kFloat8E5M2)                                \
+      .value("kFloat4E2M1", transformer_engine::DType::kFloat4E2M1);                               \


If we are going to enable kFloat4E2M1, there are other related changes needed. Search for https://github.com/search?q=repo%3AROCm%2FTransformerEngine%20kFloat4E2M1&type=code for more details:

wangye805 · 2026-02-02T14:57:17Z

transformer_engine/pytorch/tensor/_internal/mxfp4_tensor_base.py

+    - Data: [M, K/2] uint8 tensor (2 FP4 values packed per byte)
+    - Scale: [M, K/32] uint8 tensor (E8M0 format, one scale per 32-element block)


Is there alignment/padding requirements for M and K?

wangye805 · 2026-02-02T14:59:37Z

transformer_engine/pytorch/tensor/mxfp4_tensor.py

+        if inp.ndim < 2:
+            return False


TE currently supported 2D matrices from flatted high-dimensional tensors:

TransformerEngine/transformer_engine/common/common.h

Lines 238 to 262 in 9d6b0e5

size_t flat_first_dim() const {

const auto &full_shape = shape();

size_t ret = 1;

if (!full_shape.empty()) {

for (size_t i = 0; i < full_shape.size() - 1; i++) {

ret *= full_shape[i];

}

}

return ret;

}

/*! Matrix width after tensor is flattened to 2D

*

* If a tensor has dimensions (D1, D2, ..., Dn), it is reinterpreted

* as a (D1*D2*...*D(n-1), Dn) matrix.

*/

size_t flat_last_dim() const {

const auto &full_shape = shape();

if (full_shape.empty()) {

return 1;

} else {

return full_shape.back();

}

}

};

wangye805 · 2026-02-02T15:00:19Z

transformer_engine/pytorch/tensor/mxfp4_tensor.py

+
+        # Allocate PADDED scale tensors for shuffle compatibility
+        rowwise_scale_N = K // MXFP4_BLOCK_SCALING_SIZE
+        rowwise_scale_M_pad = cdiv(M, 256) * 256


I presume this 256 is from some alignment/padding requirement?

wangye805 · 2026-02-02T15:03:09Z

tests/pytorch/triton_kernels/test_cast_mxfp4.py

@@ -0,0 +1,178 @@
+# Copyright (c) 2025, Advanced Micro Devices, Inc. All rights reserved.


You will need to add this pytest into our ci script (somewhere near

TransformerEngine/ci/pytorch.sh

Line 74 in 9d6b0e5

run_default_fa 1 triton_kernels/test_norms.py

) otherwise it won't be tested

sarthak-amd added 6 commits January 19, 2026 18:28

MXFP4 Tensor support in TE

fd7129d

fused cast transpose mxfp4

aca9e33

add E2M1 Dtype

b7cc9f2

Add unit test

7b2b4e5

update unit test and unify the api with upcoming hip kernel

df39c9a

Merge remote-tracking branch 'origin/dev' into feature/cast-transpose…

c1680cb

…-mxfp4

wangye805 requested changes Feb 2, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MXFP4 Cast Transpose Triton [WIP] #422

MXFP4 Cast Transpose Triton [WIP] #422

Uh oh!

sarthak-amd commented Jan 20, 2026 •

edited

Loading

Uh oh!

wangye805 left a comment

Uh oh!

wangye805 Feb 2, 2026

Uh oh!

wangye805 Feb 2, 2026

Uh oh!

wangye805 Feb 2, 2026

Uh oh!

wangye805 Feb 2, 2026

Uh oh!

wangye805 Feb 2, 2026

Uh oh!

wangye805 Feb 2, 2026

Uh oh!

wangye805 Feb 2, 2026

Uh oh!

wangye805 Feb 2, 2026

Uh oh!

wangye805 Feb 2, 2026

Uh oh!

wangye805 Feb 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	{1, 3221}, // Prime 456
	{2333, 1}, // Prime 345
	{1481, 677}}; // Primes 234, 123

		data_atol = 20.0 if in_dtype != torch.float32 else 16.0
		scale_atol = 2.0 if in_dtype != torch.float32 else 1.0

		- Data: [M, K/2] uint8 tensor (2 FP4 values packed per byte)
		- Scale: [M, K/32] uint8 tensor (E8M0 format, one scale per 32-element block)

	size_t flat_first_dim() const {
	const auto &full_shape = shape();
	size_t ret = 1;
	if (!full_shape.empty()) {
	for (size_t i = 0; i < full_shape.size() - 1; i++) {
	ret *= full_shape[i];
	}
	}
	return ret;
	}

	/*! Matrix width after tensor is flattened to 2D
	*
	* If a tensor has dimensions (D1, D2, ..., Dn), it is reinterpreted
	* as a (D1D2...*D(n-1), Dn) matrix.
	*/
	size_t flat_last_dim() const {
	const auto &full_shape = shape();
	if (full_shape.empty()) {
	return 1;
	} else {
	return full_shape.back();
	}
	}
	};

		@@ -0,0 +1,178 @@
		# Copyright (c) 2025, Advanced Micro Devices, Inc. All rights reserved.

MXFP4 Cast Transpose Triton [WIP] #422

Are you sure you want to change the base?

MXFP4 Cast Transpose Triton [WIP] #422

Uh oh!

Conversation

sarthak-amd commented Jan 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Uh oh!

wangye805 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

sarthak-amd commented Jan 20, 2026 •

edited

Loading