-
-
Notifications
You must be signed in to change notification settings - Fork 11.8k
[Performance][B200] silu_mul_quant: pack scales in int32 #28358
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Performance][B200] silu_mul_quant: pack scales in int32 #28358
Conversation
|
cc @elvircrn for changes to the kernel. PTAL ! Thanks 🙌 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a significant performance optimization by fusing the scale packing for DeepGEMM directly into the silu_mul_quant kernel. This avoids slow transformations in the hot path and shows promising performance gains. The changes are well-tested with an expanded unit test suite that covers various scale formats.
My review focuses on two main points:
- The reference implementation in the Python unit test appears to have an incorrect memory layout compared to the kernel's output, which could lead to a misleading test result.
- The comment in the CUDA kernel describing the memory layout is inconsistent and could cause confusion for future developers.
Addressing these points will improve the correctness of the tests and the maintainability of this complex but important kernel.
| // Int32 packed ue8m0 scales tensor. | ||
| // Let E, T, G be the number to experts, number of tokens and number of groups | ||
| // respectively. Let, E = 2, T = 4, G = 6, in this case the int32 scales | ||
| // tensor are of shape [1, 4, 2] and stride [8, 1, 4]. The scales are expected | ||
| // to be arranged as follows, | ||
| // [[T0G0-T0G1-T0G2-T0G3, T0G4-T0G5-X-X,], | ||
| // [T1G0-T1G1-T1G2-T1G3, T1G4-T1G5-X-X,] | ||
| // [T2G0-T2G1-T2G2-T2G3, T2G4-T2G5-X-X,] | ||
| // [T3G0-T3G1-T3G2-T3G3, T3G4-T3G5-X-X,]] | ||
| // where, TxGy is the scale ue8m0 scale value of Token x, Group y. | ||
| // | ||
| // In memory (in bytes) the scale values are arranged as, | ||
| // [T0G0, T0G1, T0G2, T0G3, T1G0, T1G2, T1G3, T1G4, T2G0, T2G1, T2G3, T2G4, | ||
| // T3G0, T3G1, T3G2, T3G3, T0G4, T0G5, X, X, T1G4, T1G5, X, X, T2G4, T2G5, | ||
| // X, X, T3G4, T3G5, X, X] | ||
| // | ||
| // An Int32 tensor of size [1, 4, 2] and stride [8, 1, 4] can be represented | ||
| // as an uint8 tensor of shape [1, 2, 4, 4] and stride [32, 16, 4, 1]. In | ||
| // english, ignoring the Experts dimension, the original int32 tensor is | ||
| // simply treated as two packed [4, 4] uint8 tensor (or two [4, 1] int32 | ||
| // tensor). The following strides setting reflects this change. Caveat: This | ||
| // means that the G dimension is no longer contiguous. i.e. Note that to move | ||
| // from G3 to G4, we need to jump along the packing dimension. The kernel | ||
| // handles this case. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment explaining the packed int32 memory layout is confusing and contains inconsistencies.
- The 2D array visualization (lines 704-707) suggests a
(T, G_packed)C-contiguous layout, whereG_packedis the fastest-moving dimension. - The linear memory layout description (lines 710-713) suggests a
(G_packed, T)C-contiguous layout, whereTis the fastest-moving dimension. There's also a typo (T1G2instead ofT1G1). - The implementation follows the second layout (
(G_packed, T)).
To improve clarity and maintainability, please revise this comment to be consistent and accurately describe the (G_packed, T) memory layout that the kernel implements and expects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewers take note. If the description is confusing, Ill rewrite it. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
|
This pull request has merge conflicts that must be resolved before it can be |
|
LGTM |
dc7a48e to
b718232
Compare
9a9756b to
64d99f2
Compare
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
64d99f2 to
2e41c33
Compare
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
mgoin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay I think this looks good to me, nice work
…t#28358) Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com> Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com> Signed-off-by: George D. Torres <gdavtor@gmail.com>
…t#28358) Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com> Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com> Signed-off-by: Bram Wasti <bwasti@meta.com>
…t#28358) Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com> Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
…t#28358) Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com> Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com> Signed-off-by: Xingyu Liu <charlotteliu12x@gmail.com>
Purpose
This PR focuses on the deepep_low_latency All2All code path.
DeepGEMM expects the activation scales to be packed in int32 format. Given unpacked scales, DeepGEMM performs scale transformation itself. This transformation by DeepGEMM is slow and delays the kernel execution unnecessarily.
This PR fuses the scale transformation in the silu_mul_quant kernel thereby eliminating a few torch operations in the hot path.
Main


PR
Test Plan
command :
VLLM_ALL2ALL_BACKEND=deepep_low_latency VLLM_USE_DEEP_GEMM=1 canhazgpu run -g2 -- vllm serve Qwen/Qwen3-30B-A3B-FP8 --trust-remote-code --tensor-parallel-size 1 --data-parallel-size 2 --enable-expert-parallel --no-enable-prefix-caching --port 9010lm_eval :
lm_eval --model local-completions --tasks gsm8k --model_args model=Qwen/Qwen3-30B-A3B-FP8,base_url=http://localhost:9010/v1/completions,num_concurrent=30,max_retries=3 --limit 100Test Result