[Op][Optimization]Kernel fusion: cast+sigmoid+bias+noauxtc by ShaneGZhu · Pull Request #7777 · PaddlePaddle/FastDeploy

ShaneGZhu · 2026-05-11T11:06:18Z

Motivation

Kernel fusion: cast + sigmoid + bias + noauxtc. Currently, this is supported only on CUDA devices.

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick，PR标题需遵循格式，在最开始加上[Cherry-Pick]标签，以及最后面加上原PR ID，例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

Modifications

新增 custom_ops/gpu_ops/grouped_topk_kernels.cu：实现 grouped_topk_fused_kernel，一次 kernel launch 完成 cast、sigmoid、bias 加法及 grouped topk 路由；支持 float32/bfloat16/float16 输入
custom_ops/gpu_ops/cpp_extensions.cc：新增 grouped_topk 函数声明及 pybind11 binding
custom_ops/setup_ops.py：将新 .cu 文件加入两处编译源列表
fastdeploy/model_executor/layers/moe/moe.py：get_moe_scores 中 use_fused=True 时走新 grouped_topk 路径，替代原 fused_cast_sigmoid_bias + noaux_tc 双调用
tests/operators/test_grouped_topk_op.py：新增覆盖 DeepSeek-V3、GLM-4.5-Air、Qwen3-30B-A3B、Kimi-K2 四种模型配置的正确性与数值对齐测试

Usage or Command

N/A

Accuracy Tests

测试分支	并行方式	模型	主要对比	请求数量	引擎MAX BS	平均输入	平均输出	TPS	OTPS	QPS	TTFT(ms)	解码速度(tok/s)
develop	TP8	GLM4.5-Air	baseline	256	256	159.53	5433.42	3263.46	3170.38	0.583	1257.81	30.38
develop	TP8	GLM4.5-Air	fused_cast	256	256	159.53	5662.12	3330.47 (+2.06%)	3239.20	0.572	1282.50	30.70 (+1%)
develop	TP8	GLM4.5-Air	fused_cast_get_moe_score	256	256	159.53	5604.22	3392.78 (+3.95%)	3298.88	0.589 (+1%)	1458.27	30.54 (+0.5%)

fused_cast+noaux (A) vs fused_cast_grouped_topk (C) 性能对比

config	T (token数)	E (专家数)	path_a(µs)	path_c(µs)	c/a	diff	idx
deepseek_v3	1	256	24.23	10.18	2.38x	0.00e+00	✓
deepseek_v3	8	256	24.55	11.66	2.11x	0.00e+00	✓
deepseek_v3	32	256	24.28	11.85	2.05x	0.00e+00	✓
deepseek_v3	128	256	24.37	11.87	2.05x	0.00e+00	✓
deepseek_v3	256	256	24.06	12.02	2.00x	0.00e+00	✓
deepseek_v3	512	256	23.91	12.27	1.95x	0.00e+00	✓
deepseek_v3	1024	256	24.16	20.73	1.17x	0.00e+00	✓
deepseek_v3	2048	256	26.77	31.05	0.86x	0.00e+00	✓
deepseek_v3	4096	256	35.95	48.19	0.75x	0.00e+00	✓
deepseek_v3	8192	256	60.40	77.83	0.78x	0.00e+00	✓
glm45_air	1	128	24.08	9.67	2.49x	0.00e+00	✓
glm45_air	8	128	23.89	9.79	2.44x	0.00e+00	✓
glm45_air	32	128	31.09	11.43	2.72x	0.00e+00	✓
glm45_air	128	128	24.34	11.45	2.13x	0.00e+00	✓
glm45_air	256	128	24.58	11.45	2.15x	0.00e+00	✓
glm45_air	512	128	24.54	11.56	2.12x	0.00e+00	✓
glm45_air	1024	128	24.55	11.94	2.06x	0.00e+00	✓
glm45_air	2048	128	24.54	13.05	1.88x	0.00e+00	✓
glm45_air	4096	128	26.33	17.21	1.53x	0.00e+00	✓
glm45_air	8192	128	40.93	29.50	1.39x	0.00e+00	✓

Checklist

Add at least a tag in the PR title.
- Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
- You can add new tags based on the PR content, but the semantics must be clear.
Format your code, run pre-commit before commit.
Add unit tests. Please write the reason in this PR if no unit tests.
Provide accuracy results.
If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

merge develop

paddle-bot · 2026-05-11T11:06:25Z

Thanks for your contribution!

PaddlePaddle-bot · 2026-05-11T11:27:39Z

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-12 17:08:23

CI报告基于以下代码生成（30分钟更新一次）:

PR commit: 0495e88
Merge base: 589a721 (branch: develop)
查看完整 Diff
CI 详情

1 任务总览

CI 仍在进行中：1 个 Required 任务失败（Approval），3 个 Required 任务运行中，1 个 Required 任务等待中，需处理审批问题后方可合并。

总执行（rerun次数）	总任务	✅ 通过	❌ 失败	⏳ 运行中	⏸️ 等待中	跳过
38(0)	38	29	3	4	2	0

2 任务状态汇总

2.1 Required任务 : 5/10 通过

必选任务阻塞合并，失败需优先处理。

状态	任务	耗时	根因	修复建议	日志	重跑
❌	`Approval`	10s	PR问题：缺少custom op审批（需2个RD各1）	联系FastDeploy RD及PaddlePaddle RD审批	Job	-
⏳	`run_tests_with_coverage`	-	运行中	-	Job	-
⏳	`run_ce_cases`	-	运行中	-	Job	-
⏳	`run_xpu_4cards_cases`	-	运行中	-	Job	-
⏸️	`run_4_cards_tests`	-	等待中	-	-	-
✅	其余 5 个必选任务通过	-	-	-	-	-

2.2 可选任务 — 24/28 通过

可选任务不阻塞合并，失败仅供参考。

状态	任务	耗时	日志	重跑
❌	`Run iluvatar Tests / run_iluvatar_cases`	15m18s	Job	-
❌	`Check PR Template`	16s	Job	-
⏳	`Trigger Jenkins for PR`	-	Job	-
⏸️	`CI_HPU`	-	-	-
✅	其余 24 个可选任务通过	-	-	-

3 失败详情（仅 required）

Approval — PR审批流程问题（置信度: 高）

Approval

状态: ❌ 失败
错误类型: PR审批流程
置信度: 高
根因摘要: PR新增custom op，缺少FastDeploy RD和PaddlePaddle RD各1个审批
分析器: 通用分析(fallback)

根因详情:
本次 PR（[Op][Optimization]Kernel fusion: cast+sigmoid+bias+noauxtc）涉及新增 custom op，根据仓库审批规则，必须获得以下两类 RD 的各 1 个 Approve：

FastDeploy RD（任意一位）：qingqing01(dangqingqing)、Jiang-Jia-Jun(jiangjiajun)、heavengate(dengkaipeng)
PaddlePaddle RD（任意一位）：jeff41404(gaoxiang)、yongqiangma(mayongqiang)

当前两项均缺失（There are 2 approved errors），CI 以 exit code 6 退出。

关键日志:

0. You must have one FastDeploy RD (qingqing01(dangqingqing), Jiang-Jia-Jun(jiangjiajun), heavengate(dengkaipeng)) approval for adding custom op.
1. You must have one PaddlePaddle RD (jeff41404(gaoxiang), yongqiangma(mayongqiang)) approval for adding custom op.
There are 2 approved errors.
##[error]Process completed with exit code 6.

修复建议:

请 PR 作者 @ShaneGZhu 联系 FastDeploy RD（@dangqingqing、@jiangjiajun 或 @dengkaipeng）对本 PR 进行 Review & Approve
同时联系 PaddlePaddle RD（@gaoxiang 或 @mayongqiang）进行 Review & Approve
两项审批均通过后，重新触发 CI 即可

修复建议摘要: 请联系指定FastDeploy RD及PaddlePaddle RD各Approve一次

链接: 查看日志

gongshaotian

LGTM

gongshaotian · 2026-05-11T12:32:33Z

-    from fastdeploy.model_executor.layers.moe.fused_cast_sigmoid_bias import (
-        fused_cast_sigmoid_bias,
-    )
+    pass


codecov-commenter · 2026-05-11T13:22:16Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
⚠️ Please upload report for BASE (develop@589a721). Learn more about missing BASE report.

Additional details and impacted files

@@            Coverage Diff             @@
##             develop    #7777   +/-   ##
==========================================
  Coverage           ?   63.77%           
==========================================
  Files              ?      458           
  Lines              ?    63513           
  Branches           ?     9728           
==========================================
  Hits               ?    40507           
  Misses             ?    20235           
  Partials           ?     2771

Flag	Coverage Δ
GPU	`72.36% <100.00%> (?)`
XPU	`7.20% <0.00%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…diff is 0.00.

PaddlePaddle-bot

🤖 Paddle-CI-Agent | pr_review | 2026-05-12 16:48:06

📋 Review 摘要

PR 概述：将 MoE 路由阶段的 cast+sigmoid+bias+noauxtc 操作融合为单个 CUDA kernel，仅支持非 EP 的 CUDA 路径，GLM-4.5-Air 实测 TPS 提升约 4%
变更范围：custom_ops/gpu_ops/（新 CUDA kernel）、model_executor/layers/moe/moe.py（集成调用）、tests/operators/（新增单测）
影响面 Tag：[OP] [Optimization]

📝 PR 规范检查

标题 Tag 大小写有误（[Op] 应为 [OP]）且 [Optimization] 与描述之间缺少空格；Usage or Command 段为空；Checklist 均未勾选（单测和精度数据已提供，应勾选对应项）。

标题建议（可直接复制）：

[OP][Optimization] Kernel fusion: cast+sigmoid+bias+noauxtc

PR 描述建议（可直接复制）：

## Motivation
MoE 路由阶段将 cast+sigmoid+bias+grouped topk 四个操作融合为单个 CUDA kernel（`grouped_topk`），减少内存带宽消耗和 kernel 启动开销，目前仅支持 CUDA 设备（非 EP 路径）。

## Modifications
- 新增 `custom_ops/gpu_ops/grouped_topk_kernels.cu`：实现 `grouped_topk_fused_kernel`，一次 kernel launch 完成 cast、sigmoid、bias 加法及 grouped topk 路由；支持 float32/bfloat16/float16 输入
- `custom_ops/gpu_ops/cpp_extensions.cc`：新增 `grouped_topk` 函数声明及 pybind11 binding
- `custom_ops/setup_ops.py`：将新 `.cu` 文件加入两处编译源列表
- `fastdeploy/model_executor/layers/moe/moe.py`：`get_moe_scores` 中 `use_fused=True` 时走新 `grouped_topk` 路径，替代原 `fused_cast_sigmoid_bias + noaux_tc` 双调用
- `tests/operators/test_grouped_topk_op.py`：新增覆盖 DeepSeek-V3、GLM-4.5-Air、Qwen3-30B-A3B、Kimi-K2 四种模型配置的正确性与数值对齐测试

## Usage or Command
N/A

## Accuracy Tests
fused_cast+noaux (A) vs fused_cast_grouped_topk (C) 精度对比（DeepSeek-V3 / GLM-4.5-Air），各 token 数下数值差异 ≤ 1.2e-7，expert 索引全部一致（✓），详见 PR 描述精度对比表。

## Checklist

- [x] Add at least a tag in the PR title.
  - Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
  - You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [x] Add unit tests. Please write the reason in this PR if no unit tests.
- [x] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.

问题

级别	文件	概述
🟡 建议	`fastdeploy/model_executor/layers/moe/moe.py:40`	`grouped_topk` 与 `noaux_tc`/`noaux_tc_redundant` 合并在同一 `try` 块，旧版本编译产物缺少新 op 时会导致三者全部导入失败，影响已有非融合路径
📝 PR 规范	—	标题 Tag 大小写/空格、`Usage or Command` 为空、Checklist 未勾选

总体评价

融合 kernel 实现完整，Warp-level bitonic sort 与 group selection 逻辑清晰，性能数据和精度对比齐全。主要需关注 moe.py 中的导入兼容性问题，避免旧编译产物部署时意外影响非融合路径的 noaux_tc。

PaddlePaddle-bot · 2026-05-12T08:52:28Z

 try:
-    from fastdeploy.model_executor.ops.gpu import noaux_tc, noaux_tc_redundant
+    from fastdeploy.model_executor.ops.gpu import (
+        grouped_topk,


🟡 建议 grouped_topk 与 noaux_tc/noaux_tc_redundant 合并在同一 try 块中。

如果新 op 未编译（例如使用旧版本编译产物），整个 import 块会因 ImportError 失败，导致 noaux_tc 和 noaux_tc_redundant 也一同缺失——而这两个 op 在非融合路径下仍被使用，会触发 NameError。

建议将 grouped_topk 独立到单独的 try 块并用标志位保护：

try: from fastdeploy.model_executor.ops.gpu import noaux_tc, noaux_tc_redundant except Exception: logger.warning("import noaux_tc Failed!") _grouped_topk_available = False try: from fastdeploy.model_executor.ops.gpu import grouped_topk _grouped_topk_available = True except Exception: pass

然后在 get_moe_scores 中将 use_fused 改为 use_fused and _grouped_topk_available。

PaddlePaddle-bot · 2026-05-12T08:52:29Z

@@ -126,9 +123,20 @@ def get_moe_scores(
            renormalize,
            routed_scaling_factor,
        )


🟡 建议 此分支若 grouped_topk 未成功导入，执行至此会抛出 NameError: name 'grouped_topk' is not defined。

应配合上方 import 修改，在此处增加 _grouped_topk_available 守卫，或在 import 失败时将 use_fused 强制置为 False。

gongshaotian

LGTM

yongqiangma

LGTM

ShaneGZhu added 2 commits May 11, 2026 18:20

[Ops][Optimization]Kernel fusion: cast+sigmoid+bias+noauxtc

68443e9

Merge remote-tracking branch 'origin/develop' into get_moe_score

ec06609

merge develop

ShaneGZhu had a problem deploying to Metax_ci May 11, 2026 11:06 — with GitHub Actions Failure

This comment was marked as outdated.

Sign in to view

Add unit_test file

49563e4

ShaneGZhu had a problem deploying to Metax_ci May 11, 2026 11:52 — with GitHub Actions Failure

ShaneGZhu marked this pull request as ready for review May 11, 2026 11:53

This comment was marked as outdated.

Sign in to view

gongshaotian previously approved these changes May 11, 2026

View reviewed changes

ShaneGZhu changed the title ~~[Ops][Optimization]Kernel fusion: cast+sigmoid+bias+noauxtc~~ [Op][Optimization]Kernel fusion: cast+sigmoid+bias+noauxtc May 11, 2026

Fixed the 1e-8 precision issue that had been introduced; the current …

a132799

…diff is 0.00.

ShaneGZhu dismissed gongshaotian’s stale review via a132799 May 12, 2026 08:31

ShaneGZhu had a problem deploying to Metax_ci May 12, 2026 08:31 — with GitHub Actions Error

clean

0495e88

ShaneGZhu had a problem deploying to Metax_ci May 12, 2026 08:35 — with GitHub Actions Failure

PaddlePaddle-bot reviewed May 12, 2026

View reviewed changes

gongshaotian approved these changes May 12, 2026

View reviewed changes

yongqiangma approved these changes May 12, 2026

View reviewed changes

Conversation

ShaneGZhu commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Usage or Command

Accuracy Tests

fused_cast+noaux (A) vs fused_cast_grouped_topk (C) 性能对比

Checklist

Uh oh!

paddle-bot Bot commented May 11, 2026

Uh oh!

This comment was marked as outdated.

Uh oh!

PaddlePaddle-bot commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1 任务总览

2 任务状态汇总

2.1 Required任务 : 5/10 通过

2.2 可选任务 — 24/28 通过

3 失败详情（仅 required）

Approval

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

gongshaotian left a comment

Choose a reason for hiding this comment

Uh oh!

gongshaotian May 11, 2026

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

📋 Review 摘要

📝 PR 规范检查

问题

总体评价

Uh oh!

PaddlePaddle-bot May 12, 2026

Choose a reason for hiding this comment

Uh oh!

PaddlePaddle-bot May 12, 2026

Choose a reason for hiding this comment

Uh oh!

gongshaotian left a comment

Choose a reason for hiding this comment

Uh oh!

yongqiangma left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ShaneGZhu commented May 11, 2026 •

edited

Loading

PaddlePaddle-bot commented May 11, 2026 •

edited

Loading

codecov-commenter commented May 11, 2026 •

edited

Loading