[Cherry-Pick][Optimization] optmize tritonmoe_preprocess op by yongqiangma · Pull Request #7786 · PaddlePaddle/FastDeploy

yongqiangma · 2026-05-12T06:19:07Z

Motivation

原 tritonmoe_preprocess 内嵌 kernel 硬编码了 num_experts 白名单（2/8/32/64/128/160/256），不支持任意 expert 数量，且排序与对齐分两步完成效率偏低。本次重构将对齐逻辑提取到独立 moe_align_kernel.cu，通过三路分发策略提升适用范围和性能。
测试平台： NVIDIA A100-SXM4-80GB，Driver Version: 535.230.02 CUDA Version: 12.9

num_tokens	num_experts	old(ms)	new(ms)	enhance
256	64	0.0208	0.0209	-0.48%
512	64	0.0212	0.0209	1.44%
1024	64	0.0196	0.0195	0.51%
2048	64	0.0211	0.0205	2.93%
4096	64	0.0232	0.0227	2.20%
8192	64	0.0281	0.0263	6.84%
16384	64	0.0374	0.0285	31.23%
32768	64	0.0557	0.0288	93.40%
65536	64	0.0917	0.0291	215.12%
163840	64	0.2026	0.0306	562.09%
256	256	0.026	0.0209	24.40%
512	256	0.0262	0.0197	32.99%
1024	256	0.0245	0.0197	24.37%
2048	256	0.0257	0.0207	24.15%
4096	256	0.0284	0.0229	24.02%
8192	256	0.0323	0.0264	22.35%
16384	256	0.0411	0.041	0.24%
32768	256	0.0594	0.0415	43.13%
65536	256	0.096	0.0417	130.22%
163840	256	0.1985	0.0433	358.43%

Modifications

新增 custom_ops/gpu_ops/moe/moe_align_kernel.cu：实现三路分发的 moe_align_block_size 函数模板及对应 CUDA kernels（小批量单 block kernel、大批量 cooperative kernel、通用双 kernel）
custom_ops/gpu_ops/helper.h：将 CEILDIV 宏工具函数上移至公共头文件
custom_ops/gpu_ops/moe/tritonmoe_preprocess.cu：删除旧内嵌 kernel，改为声明并调用 moe_align_block_size；支持任意 num_experts；修复小 token 分支的 max_num_tokens_padded 计算；输出名从 expert_ids 改为 experts_ids
custom_ops/setup_ops.py：将新文件加入两处编译源列表
tests/operators/test_tritonmoe_preprocess.py：新增覆盖各分支场景（空输入、单 token、均匀分布、大批量等）的正确性测试

Usage or Command

N/A

Accuracy Tests

N/A

Checklist

Add at least a tag in the PR title.
- Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
- You can add new tags based on the PR content, but the semantics must be clear.
Format your code, run pre-commit before commit.
Add unit tests. Please write the reason in this PR if no unit tests.
Provide accuracy results.
If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

paddle-bot · 2026-05-12T06:19:15Z

Thanks for your contribution!

Copilot

Pull request overview

该 PR 旨在重构并优化 tritonmoe_preprocess 的对齐/排序逻辑：去除旧 kernel 对 num_experts 的白名单限制，将对齐逻辑抽离为独立的 CUDA 实现，并补充更全面的正确性测试，以提升适用范围与性能表现。

Changes:

新增 moe_align_kernel.cu，实现三路分发策略（小 batch 单 block / 大 batch cooperative / 通用双 kernel）来完成 MoE 对齐与 token 排序。
tritonmoe_preprocess.cu 改为调用 moe_align_block_size，支持任意维度输入 numel 计算，并修正小 token 分支下的 padded size 计算方式与 blocks 计算方式。
扩充 test_tritonmoe_preprocess.py，覆盖更多形状与分支场景；setup_ops.py 增加新文件编译入口；helper.h 上移 CEILDIV 宏。

需要注意：PR 标题中 “optmize” 建议更正为 “optimize”，以符合仓库标题规范与可检索性。

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
`custom_ops/gpu_ops/moe/moe_align_kernel.cu`	新增 MoE 对齐与 token 排序 CUDA kernel（多策略分发）
`custom_ops/gpu_ops/moe/tritonmoe_preprocess.cu`	移除旧内嵌 kernel，改为调用 `moe_align_block_size` 并调整 shape 推导/输出命名
`custom_ops/gpu_ops/helper.h`	将 `CEILDIV` 提升为公共宏，供多处 CUDA 代码复用
`custom_ops/setup_ops.py`	将新增的 `moe_align_kernel.cu` 加入编译源列表
`tests/operators/test_tritonmoe_preprocess.py`	新增/重构正确性测试，覆盖更多输入形状与边界场景

+      int32_t total_vecs = (max_num_tokens_padded + VEC_SIZE - 1) / VEC_SIZE;
+      Vec* out_ptr = reinterpret_cast<Vec*>(sorted_token_ids);
+      for (int32_t i = threadIdx.x; i < total_vecs; i += blockDim.x) {
+        out_ptr[i] = fill_vec;
+      }


+    int32_t total_vecs = (max_num_tokens_padded + VEC_SIZE - 1) / VEC_SIZE;
+    Vec* out_ptr = reinterpret_cast<Vec*>(sorted_token_ids);
+    for (int32_t i = bid * nthreads + tid; i < total_vecs;
+         i += nblocks * nthreads) {
+      out_ptr[i] = fill_vec;
+    }


+    // Original 2-kernel approach (for medium inputs or cooperative fallback)
+    auto align_kernel = moe_align_block_size_kernel<scalar_t>;
+
+    const size_t scan_size = next_pow_2(num_experts);
+    const size_t shared_mem_size =
+        (num_experts + (num_experts + 1) + scan_size + WARP_SIZE) *
+        sizeof(int32_t);
+    align_kernel<<<2, threads, shared_mem_size, stream>>>(
+        topk_ids.data<scalar_t>(),
+        sorted_token_ids.data<int32_t>(),
+        experts_ids.data<int32_t>(),
+        num_tokens_post_pad.data<int32_t>(),
+        num_experts,
+        block_size,
+        numel,
+        cumsum_buffer.data<int32_t>(),
+        pad_sorted_token_ids,
+        scan_size,
+        max_num_tokens_padded);


+  bool small_batch_expert_mode = (numel < 1024) && (num_experts <= 64);
+
+  if (small_batch_expert_mode) {
+    const int32_t expert_threads = max((int32_t)num_experts, WARP_SIZE);
+    constexpr int32_t fill_threads = 256;
+    const int32_t shared_mem_size =
+        ((expert_threads + 1) * num_experts + (num_experts + 1)) *
+        sizeof(int32_t);
+


+    def setUp(self):
+        if not _AVAILABLE:
+            self.skipTest("CUDA or fastdeploy not available")
+
+    def test_docstring_example(self):
+        """Reproduce the example from the function docstring."""
+        topk_ids = paddle.to_tensor([[2, 3, 4], [1, 2, 4], [1, 3, 4], [1, 2, 3]], dtype="int64")
+        _verify(topk_ids, block_size=4, num_experts=5, label="docstring_example")
+


+
+DEVICE = "gpu"
+
+# 仅对小规模 case 打印详细 tensor，超过此阈值只打印统计摘要


+    if not _AVAILABLE:
+        print("SKIP: CUDA or fastdeploy not available.")
+    else:
+        basic = TestTritonMoePreprocessBasic()
+        basic.test_docstring_example()
+        basic.test_single_token_single_expert()
+        basic.test_all_tokens_same_expert()
+        basic.test_uniform_1d()
+        basic.test_topk_equals_num_experts()
+        basic.test_num_tokens_less_than_num_experts()
+        basic.test_exact_block_boundary()
+        basic.test_block_size_1()
+
+        edge = TestTritonMoePreprocessEdgeCases()
+        edge.test_empty_topk_ids()
+        edge.test_one_expert()
+        edge.test_large_block_size()
+        edge.test_int64_dtype()
+
+        real = TestTritonMoePreprocessRealistic()
+        for num_tokens, num_experts, block_size in [
+            (256, 8, 16),
+            (1024, 16, 16),
+            (4096, 64, 16),
+            (8192, 64, 32),
+            (8192, 128, 64),
+            (16384, 256, 128),
+        ]:
+            real._run_uniform_distribution(num_tokens, num_experts, block_size)
+        for num_tokens, top_k, num_experts, block_size in [
+            (512, 2, 8, 16),
+            (1024, 4, 16, 16),
+            (2048, 8, 64, 16),
+        ]:
+            real._run_topk_2d(num_tokens, top_k, num_experts, block_size)
+        for alpha in [0.5, 1.2, 2.0]:
+            real._run_zipf_distribution(alpha)
+        real.test_deterministic_with_fixed_seed()
+
+        print("\n*** All direct-run tests passed ***")


+DEVICE = "gpu"
+
+# 仅对小规模 case 打印详细 tensor，超过此阈值只打印统计摘要
+_PRINT_TENSOR_NUMEL_LIMIT = 64
+
+
+def _fmt_tensor(t: paddle.Tensor, name: str) -> str:
+    t_cpu = t.cpu()
+    if t_cpu.numel() <= _PRINT_TENSOR_NUMEL_LIMIT:
+        return f"{name}{list(t_cpu.shape)} = {t_cpu.tolist()}"
+    return (
+        f"{name}{list(t_cpu.shape)} | "
+        f"min={int(t_cpu.min())} max={int(t_cpu.max())} "
+        f"mean={float(t_cpu.cast('float32').mean()):.2f} numel={t_cpu.numel()}"
+    )
+
+


PaddlePaddle-bot

🤖 Paddle-CI-Agent | pr_review | 2026-05-12 14:50:00

📋 Review 摘要

PR 概述：重构 tritonmoe_preprocess 算子，将内嵌 kernel 提取到独立 moe_align_kernel.cu，实现三路分发策略（小批量单 block kernel、大批量 cooperative kernel、通用双 kernel），去除 num_experts 硬编码白名单，支持任意 expert 数量。
变更范围：custom_ops/gpu_ops/moe/、custom_ops/gpu_ops/helper.h、custom_ops/setup_ops.py、tests/operators/
影响面 Tag：[OP]

📝 PR 规范检查

标题存在两处问题：① Cherry-Pick 格式缺少原始 PR 编号 (#原PR号)；② "optmize" 为拼写错误（应为 "optimize"）。描述模板各 section 均已存在且内容完整，Checklist 全部勾选，描述部分合规。

标题建议（可直接复制，补充原 PR 号后使用）：

[Cherry-Pick][Optimization] Optimize tritonmoe_preprocess op (#<原PR号>)

问题

级别	文件	概述
🟡 建议	`custom_ops/gpu_ops/moe/moe_align_kernel.cu:499`	静态缓存未按设备 ID 区分，多 GPU 场景下块数估算可能有误

总体评价

重构思路清晰，三路分发策略设计合理，测试覆盖充分（空输入、边界、均匀/Zipf 分布等场景均有覆盖）。大批量下的性能提升数据可观（>100x for large token counts）。需关注多 GPU 场景下 cooperative kernel 的静态缓存问题，以及 Cherry-Pick 标题格式合规性。

PaddlePaddle-bot · 2026-05-12T06:53:31Z

+
+      auto coop_kernel = moe_align_block_size_cooperative_kernel<scalar_t>;
+
+      static int cached_max_blocks_per_sm = 0;


🟡 建议 静态变量 cached_max_blocks_per_sm / cached_num_sms 未按设备 ID 区分缓存

在 FastDeploy 的 TP/EP 多 GPU 场景中，同一进程内不同线程（Worker）可能绑定到不同的 CUDA 设备。由于 C++ static 变量全进程共享，第一个执行此路径的 GPU 写入缓存后，后续所有 GPU 均使用相同的 SM 数量和块占用数，导致 cooperative kernel 块数估算可能与实际设备不符。

当实际 SM 数少于缓存值时，cudaLaunchCooperativeKernel 会因块数超限而报错并 fallback，仅影响性能；但对于 SM 数更多的 GPU（如后续 Worker 绑定到更大的卡），blocks 数量会被低估，导致利用率下降。

建议修复方式：以设备 ID 为键缓存，或直接在每次调用时查询（device info 查询开销极低）：

int device_id; cudaGetDevice(&device_id); static std::unordered_map<int, std::pair<int,int>> device_cache; auto it = device_cache.find(device_id); if (it == device_cache.end()) { int max_blocks_per_sm = 0, num_sms = 0; cudaOccupancyMaxActiveBlocksPerMultiprocessor(&max_blocks_per_sm, ...); cudaDeviceGetAttribute(&num_sms, cudaDevAttrMultiProcessorCount, device_id); device_cache[device_id] = {max_blocks_per_sm, num_sms}; it = device_cache.find(device_id); } int max_coop_blocks = it->second.first * it->second.second;

PaddlePaddle-bot · 2026-05-12T07:00:51Z

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-12 14:58:10

CI报告基于以下代码生成（30分钟更新一次）:

PR commit: 091f862
Merge base: a5191f2 (branch: release/2.6)
查看完整 Diff
CI 详情

1 任务总览

⚠️ 当前 CI 尚未完成：1 个 Required 任务失败，5 个 Required 任务运行中，1 个 Required 任务等待中。需优先处理 Approval 失败问题。

总执行（rerun次数）	总任务	✅ 通过	❌ 失败	⏳ 运行中	⏸️ 等待中	跳过
39(0)	39	29	3	5	2	0

2 任务状态汇总

2.1 Required任务 : 3/10 通过

必选任务阻塞合并，失败需优先处理。

状态	任务	耗时	根因	修复建议	日志	重跑
❌	`Approval`	7s	PR问题：Cherry-Pick标题缺原始develop PR编号	标题中添加原develop PR号如#XXXX，并请RD审批	Job	-
⏳	`xpu_4cards_case_test / run_xpu_4cards_cases`	-	运行中	-	-	-
⏳	`Extracted partial CE model tasks to run in CI. / run_ce_cases`	-	运行中	-	-	-
⏳	`Run Base Tests / base_tests`	-	运行中	-	-	-
⏳	`Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage`	-	运行中	-	-	-
⏳	`Run Four Cards Tests / run_4_cards_tests`	-	运行中	-	-	-
⏸️	`Run Stable Tests / stable_tests`	-	等待中	-	-	-
✅	其余 3 个必选任务通过	-	-	-	-	-

2.2 可选任务 — 26/29 通过

可选任务不阻塞合并，失败仅供参考。

状态	任务	耗时	日志	重跑
❌	`Run iluvatar Tests / run_iluvatar_cases`	22m8s	Job	-
❌	`Trigger Jenkins for PR`	1m4s	Job	-
✅	其余 26 个可选任务通过	-	-	-

3 失败详情（仅 required）

Approval — PR问题（置信度: 高）

Approval

状态: ❌ 失败
错误类型: PR问题
置信度: 高
根因摘要: Cherry-Pick PR标题缺少原始develop分支PR编号
分析器: 通用分析(fallback)

根因详情:
scripts/check_approval.sh 运行后报告 "There are 1 approved errors."，退出码6。根据脚本输出，Cherry-Pick PR 必须满足：1）标题包含 [Cherry-Pick] 和原始 develop PR 编号（如 #5010）；2）获得指定 FastDeploy RD 的审批。当前 PR 标题缺少原始 PR 编号，或尚未获得必要审批。

关键日志:

==> PR title: [Cherry-Pick][Optimization] optmize tritonmoe_preprocess op
0. Cherry-Pick PR must come from develop and the title must contain [Cherry-Pick] and the original develop PR number (e.g., #5010).
Approval required from FastDeploy RD: qingqing01(dangqingqing), Jiang-Jia-Jun(jiangjiajun), heavengate(dengkaipeng).
There are 1 approved errors.
##[error]Process completed with exit code 6.

修复建议:

在 PR 标题中添加原始 develop 分支的 PR 编号，例如改为 [Cherry-Pick][Optimization] optmize tritonmoe_preprocess op (#XXXX)（XXXX 替换为实际 PR 号）
请求 FastDeploy RD（qingqing01/dangqingqing、Jiang-Jia-Jun/jiangjiajun、heavengate/dengkaipeng）之一进行审批

修复建议摘要: 标题添加原develop PR编号#XXXX，并请RD审批

关联变更: Approval 检查仅与 PR 标题格式和审批状态相关，与代码变更无直接关联

链接: 查看日志

codecov-commenter · 2026-05-12T08:14:22Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
⚠️ Please upload report for BASE (release/2.6@a5191f2). Learn more about missing BASE report.

Additional details and impacted files

@@              Coverage Diff               @@
##             release/2.6    #7786   +/-   ##
==============================================
  Coverage               ?   71.89%           
==============================================
  Files                  ?      378           
  Lines                  ?    53933           
  Branches               ?     8435           
==============================================
  Hits                   ?    38773           
  Misses                 ?    12395           
  Partials               ?     2765

Flag	Coverage Δ
GPU	`71.89% <ø> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

opt moe_align_kernel

091f862

Copilot AI review requested due to automatic review settings May 12, 2026 06:19

yongqiangma had a problem deploying to Metax_ci May 12, 2026 06:19 — with GitHub Actions Failure

Copilot started reviewing on behalf of yongqiangma May 12, 2026 06:19 View session

Copilot AI reviewed May 12, 2026

View reviewed changes

PaddlePaddle-bot reviewed May 12, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Cherry-Pick][Optimization] optmize tritonmoe_preprocess op#7786

[Cherry-Pick][Optimization] optmize tritonmoe_preprocess op#7786
yongqiangma wants to merge 1 commit into
PaddlePaddle:release/2.6from
yongqiangma:moe_align_26

yongqiangma commented May 12, 2026

Uh oh!

paddle-bot Bot commented May 12, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

PaddlePaddle-bot left a comment

Uh oh!

PaddlePaddle-bot May 12, 2026

Uh oh!

PaddlePaddle-bot commented May 12, 2026

Approval

Uh oh!

codecov-commenter commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants


		DEVICE = "gpu"

		# 仅对小规模 case 打印详细 tensor，超过此阈值只打印统计摘要


		auto coop_kernel = moe_align_block_size_cooperative_kernel<scalar_t>;

		static int cached_max_blocks_per_sm = 0;

Conversation

yongqiangma commented May 12, 2026

Motivation

Modifications

Usage or Command

Accuracy Tests

Checklist

Uh oh!

paddle-bot Bot commented May 12, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

📋 Review 摘要

📝 PR 规范检查

问题

总体评价

Uh oh!

PaddlePaddle-bot May 12, 2026

Choose a reason for hiding this comment

Uh oh!

PaddlePaddle-bot commented May 12, 2026

1 任务总览

2 任务状态汇总

2.1 Required任务 : 3/10 通过

2.2 可选任务 — 26/29 通过

3 失败详情（仅 required）

Approval

Uh oh!

codecov-commenter commented May 12, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants