Skip to content

Conversation

@varun-sundar-rabindranath
Copy link
Contributor

@varun-sundar-rabindranath varun-sundar-rabindranath commented Nov 9, 2025

Purpose

This PR focuses on the deepep_low_latency All2All code path.

DeepGEMM expects the activation scales to be packed in int32 format. Given unpacked scales, DeepGEMM performs scale transformation itself. This transformation by DeepGEMM is slow and delays the kernel execution unnecessarily.

This PR fuses the scale transformation in the silu_mul_quant kernel thereby eliminating a few torch operations in the hot path.

Main
act-mul-main
PR
act-mul-pr

Test Plan

  • Add unit tests for the transformation
  • E2E :
    command : VLLM_ALL2ALL_BACKEND=deepep_low_latency VLLM_USE_DEEP_GEMM=1 canhazgpu run -g2 -- vllm serve Qwen/Qwen3-30B-A3B-FP8 --trust-remote-code --tensor-parallel-size 1 --data-parallel-size 2 --enable-expert-parallel --no-enable-prefix-caching --port 9010
    lm_eval : lm_eval --model local-completions --tasks gsm8k --model_args model=Qwen/Qwen3-30B-A3B-FP8,base_url=http://localhost:9010/v1/completions,num_concurrent=30,max_retries=3 --limit 100

Test Result

  • Unit tests pass
  • E2E passes with
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  | 0.89|±  |0.0314|
|     |       |strict-match    |     5|exact_match|↑  | 0.89|±  |0.0314|

@varun-sundar-rabindranath
Copy link
Contributor Author

cc @elvircrn for changes to the kernel. PTAL ! Thanks 🙌

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant performance optimization by fusing the scale packing for DeepGEMM directly into the silu_mul_quant kernel. This avoids slow transformations in the hot path and shows promising performance gains. The changes are well-tested with an expanded unit test suite that covers various scale formats.

My review focuses on two main points:

  1. The reference implementation in the Python unit test appears to have an incorrect memory layout compared to the kernel's output, which could lead to a misleading test result.
  2. The comment in the CUDA kernel describing the memory layout is inconsistent and could cause confusion for future developers.

Addressing these points will improve the correctness of the tests and the maintainability of this complex but important kernel.

Comment on lines +699 to +724
// Int32 packed ue8m0 scales tensor.
// Let E, T, G be the number to experts, number of tokens and number of groups
// respectively. Let, E = 2, T = 4, G = 6, in this case the int32 scales
// tensor are of shape [1, 4, 2] and stride [8, 1, 4]. The scales are expected
// to be arranged as follows,
// [[T0G0-T0G1-T0G2-T0G3, T0G4-T0G5-X-X,],
// [T1G0-T1G1-T1G2-T1G3, T1G4-T1G5-X-X,]
// [T2G0-T2G1-T2G2-T2G3, T2G4-T2G5-X-X,]
// [T3G0-T3G1-T3G2-T3G3, T3G4-T3G5-X-X,]]
// where, TxGy is the scale ue8m0 scale value of Token x, Group y.
//
// In memory (in bytes) the scale values are arranged as,
// [T0G0, T0G1, T0G2, T0G3, T1G0, T1G2, T1G3, T1G4, T2G0, T2G1, T2G3, T2G4,
// T3G0, T3G1, T3G2, T3G3, T0G4, T0G5, X, X, T1G4, T1G5, X, X, T2G4, T2G5,
// X, X, T3G4, T3G5, X, X]
//
// An Int32 tensor of size [1, 4, 2] and stride [8, 1, 4] can be represented
// as an uint8 tensor of shape [1, 2, 4, 4] and stride [32, 16, 4, 1]. In
// english, ignoring the Experts dimension, the original int32 tensor is
// simply treated as two packed [4, 4] uint8 tensor (or two [4, 1] int32
// tensor). The following strides setting reflects this change. Caveat: This
// means that the G dimension is no longer contiguous. i.e. Note that to move
// from G3 to G4, we need to jump along the packing dimension. The kernel
// handles this case.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This comment explaining the packed int32 memory layout is confusing and contains inconsistencies.

  1. The 2D array visualization (lines 704-707) suggests a (T, G_packed) C-contiguous layout, where G_packed is the fastest-moving dimension.
  2. The linear memory layout description (lines 710-713) suggests a (G_packed, T) C-contiguous layout, where T is the fastest-moving dimension. There's also a typo (T1G2 instead of T1G1).
  3. The implementation follows the second layout ((G_packed, T)).

To improve clarity and maintainability, please revise this comment to be consistent and accurately describe the (G_packed, T) memory layout that the kernel implements and expects.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewers take note. If the description is confusing, Ill rewrite it. Thanks.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@mgoin mgoin added performance Performance-related issues kernel deepseek Related to DeepSeek models nvidia labels Nov 9, 2025
@mgoin mgoin added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 9, 2025
@mergify
Copy link

mergify bot commented Nov 10, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @varun-sundar-rabindranath.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Nov 10, 2025
@elvircrn
Copy link
Contributor

LGTM

@heheda12345
Copy link
Collaborator

CC @mgoin @yewentao256

Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
Varun Sundar Rabindranath added 3 commits November 12, 2025 22:28
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
Copy link
Member

@mgoin mgoin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay I think this looks good to me, nice work

@github-project-automation github-project-automation bot moved this to In review in NVIDIA Nov 13, 2025
@vllm-bot vllm-bot merged commit fe1cd77 into vllm-project:main Nov 13, 2025
91 of 93 checks passed
@github-project-automation github-project-automation bot moved this from In review to Done in NVIDIA Nov 13, 2025
geodavic pushed a commit to geodavic/vllm that referenced this pull request Nov 16, 2025
…t#28358)

Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
Signed-off-by: George D. Torres <gdavtor@gmail.com>
bwasti pushed a commit to bwasti/vllm that referenced this pull request Nov 17, 2025
…t#28358)

Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
Signed-off-by: Bram Wasti <bwasti@meta.com>
devpatelio pushed a commit to SumanthRH/vllm that referenced this pull request Nov 29, 2025
…t#28358)

Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
@yewentao256 yewentao256 deleted the varun/silu-mul-packed-e8m0 branch December 1, 2025 23:49
charlotte12l pushed a commit to charlotte12l/vllm that referenced this pull request Dec 5, 2025
…t#28358)

Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
Signed-off-by: Xingyu Liu <charlotteliu12x@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

deepseek Related to DeepSeek models kernel nvidia performance Performance-related issues ready ONLY add when PR is ready to merge/full CI is needed

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

5 participants