[Cherry-Pick][Optimization] optmize tritonmoe_preprocess op#7786
[Cherry-Pick][Optimization] optmize tritonmoe_preprocess op#7786yongqiangma wants to merge 1 commit into
Conversation
|
Thanks for your contribution! |
There was a problem hiding this comment.
Pull request overview
该 PR 旨在重构并优化 tritonmoe_preprocess 的对齐/排序逻辑:去除旧 kernel 对 num_experts 的白名单限制,将对齐逻辑抽离为独立的 CUDA 实现,并补充更全面的正确性测试,以提升适用范围与性能表现。
Changes:
- 新增
moe_align_kernel.cu,实现三路分发策略(小 batch 单 block / 大 batch cooperative / 通用双 kernel)来完成 MoE 对齐与 token 排序。 tritonmoe_preprocess.cu改为调用moe_align_block_size,支持任意维度输入 numel 计算,并修正小 token 分支下的 padded size 计算方式与 blocks 计算方式。- 扩充
test_tritonmoe_preprocess.py,覆盖更多形状与分支场景;setup_ops.py增加新文件编译入口;helper.h上移CEILDIV宏。
需要注意:PR 标题中 “optmize” 建议更正为 “optimize”,以符合仓库标题规范与可检索性。
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
custom_ops/gpu_ops/moe/moe_align_kernel.cu |
新增 MoE 对齐与 token 排序 CUDA kernel(多策略分发) |
custom_ops/gpu_ops/moe/tritonmoe_preprocess.cu |
移除旧内嵌 kernel,改为调用 moe_align_block_size 并调整 shape 推导/输出命名 |
custom_ops/gpu_ops/helper.h |
将 CEILDIV 提升为公共宏,供多处 CUDA 代码复用 |
custom_ops/setup_ops.py |
将新增的 moe_align_kernel.cu 加入编译源列表 |
tests/operators/test_tritonmoe_preprocess.py |
新增/重构正确性测试,覆盖更多输入形状与边界场景 |
| int32_t total_vecs = (max_num_tokens_padded + VEC_SIZE - 1) / VEC_SIZE; | ||
| Vec* out_ptr = reinterpret_cast<Vec*>(sorted_token_ids); | ||
| for (int32_t i = threadIdx.x; i < total_vecs; i += blockDim.x) { | ||
| out_ptr[i] = fill_vec; | ||
| } |
| int32_t total_vecs = (max_num_tokens_padded + VEC_SIZE - 1) / VEC_SIZE; | ||
| Vec* out_ptr = reinterpret_cast<Vec*>(sorted_token_ids); | ||
| for (int32_t i = bid * nthreads + tid; i < total_vecs; | ||
| i += nblocks * nthreads) { | ||
| out_ptr[i] = fill_vec; | ||
| } |
| // Original 2-kernel approach (for medium inputs or cooperative fallback) | ||
| auto align_kernel = moe_align_block_size_kernel<scalar_t>; | ||
|
|
||
| const size_t scan_size = next_pow_2(num_experts); | ||
| const size_t shared_mem_size = | ||
| (num_experts + (num_experts + 1) + scan_size + WARP_SIZE) * | ||
| sizeof(int32_t); | ||
| align_kernel<<<2, threads, shared_mem_size, stream>>>( | ||
| topk_ids.data<scalar_t>(), | ||
| sorted_token_ids.data<int32_t>(), | ||
| experts_ids.data<int32_t>(), | ||
| num_tokens_post_pad.data<int32_t>(), | ||
| num_experts, | ||
| block_size, | ||
| numel, | ||
| cumsum_buffer.data<int32_t>(), | ||
| pad_sorted_token_ids, | ||
| scan_size, | ||
| max_num_tokens_padded); |
| bool small_batch_expert_mode = (numel < 1024) && (num_experts <= 64); | ||
|
|
||
| if (small_batch_expert_mode) { | ||
| const int32_t expert_threads = max((int32_t)num_experts, WARP_SIZE); | ||
| constexpr int32_t fill_threads = 256; | ||
| const int32_t shared_mem_size = | ||
| ((expert_threads + 1) * num_experts + (num_experts + 1)) * | ||
| sizeof(int32_t); | ||
|
|
| def setUp(self): | ||
| if not _AVAILABLE: | ||
| self.skipTest("CUDA or fastdeploy not available") | ||
|
|
||
| def test_docstring_example(self): | ||
| """Reproduce the example from the function docstring.""" | ||
| topk_ids = paddle.to_tensor([[2, 3, 4], [1, 2, 4], [1, 3, 4], [1, 2, 3]], dtype="int64") | ||
| _verify(topk_ids, block_size=4, num_experts=5, label="docstring_example") | ||
|
|
|
|
||
| DEVICE = "gpu" | ||
|
|
||
| # 仅对小规模 case 打印详细 tensor,超过此阈值只打印统计摘要 |
| if not _AVAILABLE: | ||
| print("SKIP: CUDA or fastdeploy not available.") | ||
| else: | ||
| basic = TestTritonMoePreprocessBasic() | ||
| basic.test_docstring_example() | ||
| basic.test_single_token_single_expert() | ||
| basic.test_all_tokens_same_expert() | ||
| basic.test_uniform_1d() | ||
| basic.test_topk_equals_num_experts() | ||
| basic.test_num_tokens_less_than_num_experts() | ||
| basic.test_exact_block_boundary() | ||
| basic.test_block_size_1() | ||
|
|
||
| edge = TestTritonMoePreprocessEdgeCases() | ||
| edge.test_empty_topk_ids() | ||
| edge.test_one_expert() | ||
| edge.test_large_block_size() | ||
| edge.test_int64_dtype() | ||
|
|
||
| real = TestTritonMoePreprocessRealistic() | ||
| for num_tokens, num_experts, block_size in [ | ||
| (256, 8, 16), | ||
| (1024, 16, 16), | ||
| (4096, 64, 16), | ||
| (8192, 64, 32), | ||
| (8192, 128, 64), | ||
| (16384, 256, 128), | ||
| ]: | ||
| real._run_uniform_distribution(num_tokens, num_experts, block_size) | ||
| for num_tokens, top_k, num_experts, block_size in [ | ||
| (512, 2, 8, 16), | ||
| (1024, 4, 16, 16), | ||
| (2048, 8, 64, 16), | ||
| ]: | ||
| real._run_topk_2d(num_tokens, top_k, num_experts, block_size) | ||
| for alpha in [0.5, 1.2, 2.0]: | ||
| real._run_zipf_distribution(alpha) | ||
| real.test_deterministic_with_fixed_seed() | ||
|
|
||
| print("\n*** All direct-run tests passed ***") |
| DEVICE = "gpu" | ||
|
|
||
| # 仅对小规模 case 打印详细 tensor,超过此阈值只打印统计摘要 | ||
| _PRINT_TENSOR_NUMEL_LIMIT = 64 | ||
|
|
||
|
|
||
| def _fmt_tensor(t: paddle.Tensor, name: str) -> str: | ||
| t_cpu = t.cpu() | ||
| if t_cpu.numel() <= _PRINT_TENSOR_NUMEL_LIMIT: | ||
| return f"{name}{list(t_cpu.shape)} = {t_cpu.tolist()}" | ||
| return ( | ||
| f"{name}{list(t_cpu.shape)} | " | ||
| f"min={int(t_cpu.min())} max={int(t_cpu.max())} " | ||
| f"mean={float(t_cpu.cast('float32').mean()):.2f} numel={t_cpu.numel()}" | ||
| ) | ||
|
|
||
|
|
PaddlePaddle-bot
left a comment
There was a problem hiding this comment.
🤖 Paddle-CI-Agent | pr_review |
2026-05-12 14:50:00
📋 Review 摘要
PR 概述:重构 tritonmoe_preprocess 算子,将内嵌 kernel 提取到独立 moe_align_kernel.cu,实现三路分发策略(小批量单 block kernel、大批量 cooperative kernel、通用双 kernel),去除 num_experts 硬编码白名单,支持任意 expert 数量。
变更范围:custom_ops/gpu_ops/moe/、custom_ops/gpu_ops/helper.h、custom_ops/setup_ops.py、tests/operators/
影响面 Tag:[OP]
📝 PR 规范检查
标题存在两处问题:① Cherry-Pick 格式缺少原始 PR 编号 (#原PR号);② "optmize" 为拼写错误(应为 "optimize")。描述模板各 section 均已存在且内容完整,Checklist 全部勾选,描述部分合规。
标题建议(可直接复制,补充原 PR 号后使用):
[Cherry-Pick][Optimization] Optimize tritonmoe_preprocess op (#<原PR号>)
问题
| 级别 | 文件 | 概述 |
|---|---|---|
| 🟡 建议 | custom_ops/gpu_ops/moe/moe_align_kernel.cu:499 |
静态缓存未按设备 ID 区分,多 GPU 场景下块数估算可能有误 |
总体评价
重构思路清晰,三路分发策略设计合理,测试覆盖充分(空输入、边界、均匀/Zipf 分布等场景均有覆盖)。大批量下的性能提升数据可观(>100x for large token counts)。需关注多 GPU 场景下 cooperative kernel 的静态缓存问题,以及 Cherry-Pick 标题格式合规性。
|
|
||
| auto coop_kernel = moe_align_block_size_cooperative_kernel<scalar_t>; | ||
|
|
||
| static int cached_max_blocks_per_sm = 0; |
There was a problem hiding this comment.
🟡 建议 静态变量 cached_max_blocks_per_sm / cached_num_sms 未按设备 ID 区分缓存
在 FastDeploy 的 TP/EP 多 GPU 场景中,同一进程内不同线程(Worker)可能绑定到不同的 CUDA 设备。由于 C++ static 变量全进程共享,第一个执行此路径的 GPU 写入缓存后,后续所有 GPU 均使用相同的 SM 数量和块占用数,导致 cooperative kernel 块数估算可能与实际设备不符。
当实际 SM 数少于缓存值时,cudaLaunchCooperativeKernel 会因块数超限而报错并 fallback,仅影响性能;但对于 SM 数更多的 GPU(如后续 Worker 绑定到更大的卡),blocks 数量会被低估,导致利用率下降。
建议修复方式:以设备 ID 为键缓存,或直接在每次调用时查询(device info 查询开销极低):
int device_id;
cudaGetDevice(&device_id);
static std::unordered_map<int, std::pair<int,int>> device_cache;
auto it = device_cache.find(device_id);
if (it == device_cache.end()) {
int max_blocks_per_sm = 0, num_sms = 0;
cudaOccupancyMaxActiveBlocksPerMultiprocessor(&max_blocks_per_sm, ...);
cudaDeviceGetAttribute(&num_sms, cudaDevAttrMultiProcessorCount, device_id);
device_cache[device_id] = {max_blocks_per_sm, num_sms};
it = device_cache.find(device_id);
}
int max_coop_blocks = it->second.first * it->second.second;
CI报告基于以下代码生成(30分钟更新一次): 1 任务总览
2 任务状态汇总2.1 Required任务 : 3/10 通过
2.2 可选任务 — 26/29 通过
3 失败详情(仅 required)Approval — PR问题(置信度: 高)Approval
根因详情: 关键日志: 修复建议:
修复建议摘要: 标题添加原develop PR编号#XXXX,并请RD审批 关联变更: Approval 检查仅与 PR 标题格式和审批状态相关,与代码变更无直接关联 链接: 查看日志 |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## release/2.6 #7786 +/- ##
==============================================
Coverage ? 71.89%
==============================================
Files ? 378
Lines ? 53933
Branches ? 8435
==============================================
Hits ? 38773
Misses ? 12395
Partials ? 2765
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Motivation
原
tritonmoe_preprocess内嵌 kernel 硬编码了 num_experts 白名单(2/8/32/64/128/160/256),不支持任意 expert 数量,且排序与对齐分两步完成效率偏低。本次重构将对齐逻辑提取到独立moe_align_kernel.cu,通过三路分发策略提升适用范围和性能。测试平台: NVIDIA A100-SXM4-80GB,Driver Version: 535.230.02 CUDA Version: 12.9
Modifications
custom_ops/gpu_ops/moe/moe_align_kernel.cu:实现三路分发的moe_align_block_size函数模板及对应 CUDA kernels(小批量单 block kernel、大批量 cooperative kernel、通用双 kernel)custom_ops/gpu_ops/helper.h:将CEILDIV宏工具函数上移至公共头文件custom_ops/gpu_ops/moe/tritonmoe_preprocess.cu:删除旧内嵌 kernel,改为声明并调用moe_align_block_size;支持任意 num_experts;修复小 token 分支的 max_num_tokens_padded 计算;输出名从expert_ids改为experts_idscustom_ops/setup_ops.py:将新文件加入两处编译源列表tests/operators/test_tritonmoe_preprocess.py:新增覆盖各分支场景(空输入、单 token、均匀分布、大批量等)的正确性测试Usage or Command
N/A
Accuracy Tests
N/A
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.