[Performance][B200] silu_mul_quant: pack scales in int32 #28358

varun-sundar-rabindranath · 2025-11-09T04:59:51Z

Purpose

This PR focuses on the deepep_low_latency All2All code path.

DeepGEMM expects the activation scales to be packed in int32 format. Given unpacked scales, DeepGEMM performs scale transformation itself. This transformation by DeepGEMM is slow and delays the kernel execution unnecessarily.

This PR fuses the scale transformation in the silu_mul_quant kernel thereby eliminating a few torch operations in the hot path.

Main

PR

Test Plan

Add unit tests for the transformation
E2E :
command : VLLM_ALL2ALL_BACKEND=deepep_low_latency VLLM_USE_DEEP_GEMM=1 canhazgpu run -g2 -- vllm serve Qwen/Qwen3-30B-A3B-FP8 --trust-remote-code --tensor-parallel-size 1 --data-parallel-size 2 --enable-expert-parallel --no-enable-prefix-caching --port 9010
lm_eval : lm_eval --model local-completions --tasks gsm8k --model_args model=Qwen/Qwen3-30B-A3B-FP8,base_url=http://localhost:9010/v1/completions,num_concurrent=30,max_retries=3 --limit 100

Test Result

Unit tests pass
E2E passes with

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  | 0.89|±  |0.0314|
|     |       |strict-match    |     5|exact_match|↑  | 0.89|±  |0.0314|

varun-sundar-rabindranath · 2025-11-09T05:00:53Z

cc @elvircrn for changes to the kernel. PTAL ! Thanks 🙌

gemini-code-assist

Code Review

This pull request introduces a significant performance optimization by fusing the scale packing for DeepGEMM directly into the silu_mul_quant kernel. This avoids slow transformations in the hot path and shows promising performance gains. The changes are well-tested with an expanded unit test suite that covers various scale formats.

My review focuses on two main points:

The reference implementation in the Python unit test appears to have an incorrect memory layout compared to the kernel's output, which could lead to a misleading test result.
The comment in the CUDA kernel describing the memory layout is inconsistent and could cause confusion for future developers.

Addressing these points will improve the correctness of the tests and the maintainability of this complex but important kernel.

gemini-code-assist · 2025-11-09T05:02:32Z

csrc/quantization/activation_kernels.cu

+  // Int32 packed ue8m0 scales tensor.
+  // Let E, T, G be the number to experts, number of tokens and number of groups
+  // respectively. Let, E = 2, T = 4, G = 6, in this case the int32 scales
+  // tensor are of shape [1, 4, 2] and stride [8, 1, 4]. The scales are expected
+  // to be arranged as follows,
+  // [[T0G0-T0G1-T0G2-T0G3, T0G4-T0G5-X-X,],
+  //  [T1G0-T1G1-T1G2-T1G3, T1G4-T1G5-X-X,]
+  //  [T2G0-T2G1-T2G2-T2G3, T2G4-T2G5-X-X,]
+  //  [T3G0-T3G1-T3G2-T3G3, T3G4-T3G5-X-X,]]
+  // where, TxGy is the scale ue8m0 scale value of Token x, Group y.
+  //
+  // In memory (in bytes) the scale values are arranged as,
+  //  [T0G0, T0G1, T0G2, T0G3, T1G0, T1G2, T1G3, T1G4, T2G0, T2G1, T2G3, T2G4,
+  //   T3G0, T3G1, T3G2, T3G3, T0G4, T0G5, X, X, T1G4, T1G5, X, X, T2G4, T2G5,
+  //   X, X, T3G4, T3G5, X, X]
+  //
+  // An Int32 tensor of size [1, 4, 2] and stride [8, 1, 4] can be represented
+  // as an uint8 tensor of shape [1, 2, 4, 4] and stride [32, 16, 4, 1]. In
+  // english, ignoring the Experts dimension, the original int32 tensor is
+  // simply treated as two packed [4, 4] uint8 tensor (or two [4, 1] int32
+  // tensor). The following strides setting reflects this change. Caveat: This
+  // means that the G dimension is no longer contiguous. i.e. Note that to move
+  // from G3 to G4, we need to jump along the packing dimension. The kernel
+  // handles this case.


This comment explaining the packed int32 memory layout is confusing and contains inconsistencies.

The 2D array visualization (lines 704-707) suggests a (T, G_packed) C-contiguous layout, where G_packed is the fastest-moving dimension.

The linear memory layout description (lines 710-713) suggests a (G_packed, T) C-contiguous layout, where T is the fastest-moving dimension. There's also a typo (T1G2 instead of T1G1).

The implementation follows the second layout ((G_packed, T)).

To improve clarity and maintainability, please revise this comment to be consistent and accurately describe the (G_packed, T) memory layout that the kernel implements and expects.

Reviewers take note. If the description is confusing, Ill rewrite it. Thanks.

tests/kernels/moe/test_silu_mul_fp8_quant_deep_gemm.py

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

vllm/model_executor/layers/fused_moe/batched_deep_gemm_moe.py

csrc/quantization/activation_kernels.cu

mergify · 2025-11-10T18:04:31Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @varun-sundar-rabindranath.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

elvircrn · 2025-11-10T19:10:58Z

LGTM

heheda12345 · 2025-11-11T08:14:47Z

CC @mgoin @yewentao256

Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>

mgoin

Okay I think this looks good to me, nice work

…t#28358) Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com> Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com> Signed-off-by: George D. Torres <gdavtor@gmail.com>

…t#28358) Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com> Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com> Signed-off-by: Bram Wasti <bwasti@meta.com>

…t#28358) Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com> Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com>

…t#28358) Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com> Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com> Signed-off-by: Xingyu Liu <charlotteliu12x@gmail.com>

varun-sundar-rabindranath requested review from WoosukKwon, mgoin, pavanimajety, tlrmchlsmth and yewentao256 as code owners November 9, 2025 04:59

gemini-code-assist bot reviewed Nov 9, 2025

View reviewed changes

chatgpt-codex-connector bot reviewed Nov 9, 2025

View reviewed changes

vllm/model_executor/layers/fused_moe/batched_deep_gemm_moe.py Outdated Show resolved Hide resolved

varun-sundar-rabindranath commented Nov 9, 2025

View reviewed changes

csrc/quantization/activation_kernels.cu Outdated Show resolved Hide resolved

mgoin added performance Performance-related issues kernel deepseek Related to DeepSeek models nvidia labels Nov 9, 2025

github-project-automation bot added this to NVIDIA Nov 9, 2025

mgoin added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 9, 2025

elvircrn reviewed Nov 10, 2025

View reviewed changes

csrc/quantization/activation_kernels.cu Show resolved Hide resolved

csrc/quantization/activation_kernels.cu Show resolved Hide resolved

elvircrn reviewed Nov 10, 2025

View reviewed changes

csrc/quantization/activation_kernels.cu Show resolved Hide resolved

elvircrn reviewed Nov 10, 2025

View reviewed changes

csrc/quantization/activation_kernels.cu Show resolved Hide resolved

mergify bot added the needs-rebase label Nov 10, 2025

varun-sundar-rabindranath force-pushed the varun/silu-mul-packed-e8m0 branch from dc7a48e to b718232 Compare November 10, 2025 19:21

mergify bot removed the needs-rebase label Nov 10, 2025

varun-sundar-rabindranath force-pushed the varun/silu-mul-packed-e8m0 branch from 9a9756b to 64d99f2 Compare November 12, 2025 23:27

add pack ue8m0

2e41c33

Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>

varun-sundar-rabindranath force-pushed the varun/silu-mul-packed-e8m0 branch from 64d99f2 to 2e41c33 Compare November 13, 2025 03:00

Varun Sundar Rabindranath added 3 commits November 12, 2025 22:28

respect is_deep_gemm_e8m0_used

125a200

Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>

fix test_deepep_moe.py

bdf3f11

Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>

disable e8m0 instead of disabling tests

ae58b91

Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>

mgoin approved these changes Nov 13, 2025

View reviewed changes

github-project-automation bot moved this to In review in NVIDIA Nov 13, 2025

elvircrn approved these changes Nov 13, 2025

View reviewed changes

vllm-bot merged commit fe1cd77 into vllm-project:main Nov 13, 2025
91 of 93 checks passed

github-project-automation bot moved this from In review to Done in NVIDIA Nov 13, 2025

yewentao256 deleted the varun/silu-mul-packed-e8m0 branch December 1, 2025 23:49

Uh oh!

[Performance][B200] silu_mul_quant: pack scales in int32 #28358

[Performance][B200] silu_mul_quant: pack scales in int32 #28358

Uh oh!

Conversation

varun-sundar-rabindranath commented Nov 9, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

varun-sundar-rabindranath commented Nov 9, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Nov 9, 2025

Choose a reason for hiding this comment

Uh oh!

varun-sundar-rabindranath Nov 9, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mergify bot commented Nov 10, 2025

Uh oh!

elvircrn commented Nov 10, 2025

Uh oh!

heheda12345 commented Nov 11, 2025

Uh oh!

mgoin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

varun-sundar-rabindranath commented Nov 9, 2025 •

edited by github-actions bot

Loading