-
Notifications
You must be signed in to change notification settings - Fork 635
【0.11.0-dev】optimization of kimi-k2 in cann8.3 #4555
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: v0.11.0-dev
Are you sure you want to change the base?
Conversation
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces an optimization for the 'kimi' model on CANN 8.3 by enabling a fused MoE gating kernel. The changes are applied across several files related to MoE, including different quantization paths. My review includes a critical comment about an inconsistency in one of the quantization files that could lead to incorrect behavior, and a high-severity comment about code duplication that impacts maintainability. Addressing these points will improve the robustness and clarity of the implementation.
| if global_num_experts == 256 or (global_num_experts == 384 and | ||
| torch.version.cann.startswith("8.3")): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's an inconsistency in how the model type is determined here. This file checks global_num_experts directly, while other files in this PR (e.g., torchair_w8a8_dynamic.py) check against the effective number of experts (global_num_experts - global_redundant_expert_num). This could lead to incorrect kernel selection if global_redundant_expert_num is non-zero, which would be a bug. The logic should be consistent across all files. The original logic if global_num_experts == 256: was also likely incorrect for the same reason.
| if global_num_experts == 256 or (global_num_experts == 384 and | |
| torch.version.cann.startswith("8.3")): | |
| if (global_num_experts - global_redundant_expert_num == 256) or \ | |
| ((global_num_experts - global_redundant_expert_num == 384) and torch.version.cann.startswith("8.3")): |
| is_deepseek_v3_r1 = global_num_experts - global_redundant_expert_num == 256 | ||
| if is_deepseek_v3_r1: | ||
| is_kimi = global_num_experts - global_redundant_expert_num == 384 | ||
| # NOTE: now npu_moe_gating_top_k can support `group_count=256` pattern, and `group_count=384` pattern in cann8.3 | ||
| if is_deepseek_v3_r1 or (is_kimi and torch.version.cann.startswith("8.3")): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic to identify deepseek_v3_r1 and kimi models using magic numbers (256, 384), and the check for CANN version 8.3, is duplicated across multiple files (experts_selector.py, torchair_fused_moe.py, torchair_w8a8_dynamic.py, and torchair_w4a8_dynamic.py). This makes the code harder to maintain and increases the risk of inconsistencies when adding support for new models or CANN versions. Consider centralizing this logic into a helper function or a configuration object for better maintainability and readability.
| if is_deepseek_v3_r1: | ||
| is_kimi = global_num_experts - global_redundant_expert_num == 384 | ||
| # NOTE: now npu_moe_gating_top_k can support `group_count=256` pattern, and `group_count=384` pattern in cann8.3 | ||
| if is_deepseek_v3_r1 or (is_kimi and torch.version.cann.startswith("8.3")): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
不用判断专家数;直接删除判断
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
直接修改torchair部分,[experts_selector.py]由PR4352修改
820697b to
492d78c
Compare
f0edcbd to
b78b7a9
Compare
Signed-off-by: Levi-JQ <yujinqi2@huawei.com>
b78b7a9 to
3c32524
Compare
9271e19 to
5e779f1
Compare
Signed-off-by: Levi-JQ <yujinqi2@huawei.com>
5e779f1 to
9facab7
Compare
wangxiyuan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's done by c4a11a7 ?
This PR is for torchair |
What this PR does / why we need it?
In cann8.3, npu_moe_gating_top_k operator can support expert nums with 384, so kimi can use the operator to get better preformance.
Does this PR introduce any user-facing change?
How was this patch tested?