[Op][Optimization]Kernel fusion: cast+sigmoid+bias+noauxtc#7777
[Op][Optimization]Kernel fusion: cast+sigmoid+bias+noauxtc#7777ShaneGZhu wants to merge 5 commits into
Conversation
|
Thanks for your contribution! |
CI报告基于以下代码生成(30分钟更新一次): 1 任务总览CI 仍在进行中:1 个 Required 任务失败(
2 任务状态汇总2.1 Required任务 : 5/10 通过
2.2 可选任务 — 24/28 通过
3 失败详情(仅 required)Approval — PR审批流程问题(置信度: 高)Approval
根因详情:
当前两项均缺失( 关键日志: 修复建议:
修复建议摘要: 请联系指定FastDeploy RD及PaddlePaddle RD各Approve一次 链接: 查看日志 |
| from fastdeploy.model_executor.layers.moe.fused_cast_sigmoid_bias import ( | ||
| fused_cast_sigmoid_bias, | ||
| ) | ||
| pass |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## develop #7777 +/- ##
==========================================
Coverage ? 63.77%
==========================================
Files ? 458
Lines ? 63513
Branches ? 9728
==========================================
Hits ? 40507
Misses ? 20235
Partials ? 2771
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
PaddlePaddle-bot
left a comment
There was a problem hiding this comment.
🤖 Paddle-CI-Agent | pr_review |
2026-05-12 16:48:06
📋 Review 摘要
PR 概述:将 MoE 路由阶段的 cast+sigmoid+bias+noauxtc 操作融合为单个 CUDA kernel,仅支持非 EP 的 CUDA 路径,GLM-4.5-Air 实测 TPS 提升约 4%
变更范围:custom_ops/gpu_ops/(新 CUDA kernel)、model_executor/layers/moe/moe.py(集成调用)、tests/operators/(新增单测)
影响面 Tag:[OP] [Optimization]
📝 PR 规范检查
标题 Tag 大小写有误([Op] 应为 [OP])且 [Optimization] 与描述之间缺少空格;Usage or Command 段为空;Checklist 均未勾选(单测和精度数据已提供,应勾选对应项)。
标题建议(可直接复制):
[OP][Optimization] Kernel fusion: cast+sigmoid+bias+noauxtc
PR 描述建议(可直接复制):
## Motivation
MoE 路由阶段将 cast+sigmoid+bias+grouped topk 四个操作融合为单个 CUDA kernel(`grouped_topk`),减少内存带宽消耗和 kernel 启动开销,目前仅支持 CUDA 设备(非 EP 路径)。
## Modifications
- 新增 `custom_ops/gpu_ops/grouped_topk_kernels.cu`:实现 `grouped_topk_fused_kernel`,一次 kernel launch 完成 cast、sigmoid、bias 加法及 grouped topk 路由;支持 float32/bfloat16/float16 输入
- `custom_ops/gpu_ops/cpp_extensions.cc`:新增 `grouped_topk` 函数声明及 pybind11 binding
- `custom_ops/setup_ops.py`:将新 `.cu` 文件加入两处编译源列表
- `fastdeploy/model_executor/layers/moe/moe.py`:`get_moe_scores` 中 `use_fused=True` 时走新 `grouped_topk` 路径,替代原 `fused_cast_sigmoid_bias + noaux_tc` 双调用
- `tests/operators/test_grouped_topk_op.py`:新增覆盖 DeepSeek-V3、GLM-4.5-Air、Qwen3-30B-A3B、Kimi-K2 四种模型配置的正确性与数值对齐测试
## Usage or Command
N/A
## Accuracy Tests
fused_cast+noaux (A) vs fused_cast_grouped_topk (C) 精度对比(DeepSeek-V3 / GLM-4.5-Air),各 token 数下数值差异 ≤ 1.2e-7,expert 索引全部一致(✓),详见 PR 描述精度对比表。
## Checklist
- [x] Add at least a tag in the PR title.
- Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
- You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [x] Add unit tests. Please write the reason in this PR if no unit tests.
- [x] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.问题
| 级别 | 文件 | 概述 |
|---|---|---|
| 🟡 建议 | fastdeploy/model_executor/layers/moe/moe.py:40 |
grouped_topk 与 noaux_tc/noaux_tc_redundant 合并在同一 try 块,旧版本编译产物缺少新 op 时会导致三者全部导入失败,影响已有非融合路径 |
| 📝 PR 规范 | — | 标题 Tag 大小写/空格、Usage or Command 为空、Checklist 未勾选 |
总体评价
融合 kernel 实现完整,Warp-level bitonic sort 与 group selection 逻辑清晰,性能数据和精度对比齐全。主要需关注 moe.py 中的导入兼容性问题,避免旧编译产物部署时意外影响非融合路径的 noaux_tc。
| try: | ||
| from fastdeploy.model_executor.ops.gpu import noaux_tc, noaux_tc_redundant | ||
| from fastdeploy.model_executor.ops.gpu import ( | ||
| grouped_topk, |
There was a problem hiding this comment.
🟡 建议 grouped_topk 与 noaux_tc/noaux_tc_redundant 合并在同一 try 块中。
如果新 op 未编译(例如使用旧版本编译产物),整个 import 块会因 ImportError 失败,导致 noaux_tc 和 noaux_tc_redundant 也一同缺失——而这两个 op 在非融合路径下仍被使用,会触发 NameError。
建议将 grouped_topk 独立到单独的 try 块并用标志位保护:
try:
from fastdeploy.model_executor.ops.gpu import noaux_tc, noaux_tc_redundant
except Exception:
logger.warning("import noaux_tc Failed!")
_grouped_topk_available = False
try:
from fastdeploy.model_executor.ops.gpu import grouped_topk
_grouped_topk_available = True
except Exception:
pass然后在 get_moe_scores 中将 use_fused 改为 use_fused and _grouped_topk_available。
| @@ -126,9 +123,20 @@ def get_moe_scores( | |||
| renormalize, | |||
| routed_scaling_factor, | |||
| ) | |||
There was a problem hiding this comment.
🟡 建议 此分支若 grouped_topk 未成功导入,执行至此会抛出 NameError: name 'grouped_topk' is not defined。
应配合上方 import 修改,在此处增加 _grouped_topk_available 守卫,或在 import 失败时将 use_fused 强制置为 False。
Motivation
Kernel fusion: cast + sigmoid + bias + noauxtc. Currently, this is supported only on CUDA devices.
Modifications
custom_ops/gpu_ops/grouped_topk_kernels.cu:实现grouped_topk_fused_kernel,一次 kernel launch 完成 cast、sigmoid、bias 加法及 grouped topk 路由;支持 float32/bfloat16/float16 输入custom_ops/gpu_ops/cpp_extensions.cc:新增grouped_topk函数声明及 pybind11 bindingcustom_ops/setup_ops.py:将新.cu文件加入两处编译源列表fastdeploy/model_executor/layers/moe/moe.py:get_moe_scores中use_fused=True时走新grouped_topk路径,替代原fused_cast_sigmoid_bias + noaux_tc双调用tests/operators/test_grouped_topk_op.py:新增覆盖 DeepSeek-V3、GLM-4.5-Air、Qwen3-30B-A3B、Kimi-K2 四种模型配置的正确性与数值对齐测试Usage or Command
N/A
Accuracy Tests
fused_cast+noaux (A) vs fused_cast_grouped_topk (C) 性能对比
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.