Skip to content

[Cherry-Pick][Optimization] optmize tritonmoe_preprocess op#7786

Open
yongqiangma wants to merge 1 commit into
PaddlePaddle:release/2.6from
yongqiangma:moe_align_26
Open

[Cherry-Pick][Optimization] optmize tritonmoe_preprocess op#7786
yongqiangma wants to merge 1 commit into
PaddlePaddle:release/2.6from
yongqiangma:moe_align_26

Conversation

@yongqiangma
Copy link
Copy Markdown
Collaborator

Motivation

tritonmoe_preprocess 内嵌 kernel 硬编码了 num_experts 白名单(2/8/32/64/128/160/256),不支持任意 expert 数量,且排序与对齐分两步完成效率偏低。本次重构将对齐逻辑提取到独立 moe_align_kernel.cu,通过三路分发策略提升适用范围和性能。
测试平台: NVIDIA A100-SXM4-80GB,Driver Version: 535.230.02 CUDA Version: 12.9

num_tokens num_experts old(ms) new(ms) enhance
256 64 0.0208 0.0209 -0.48%
512 64 0.0212 0.0209 1.44%
1024 64 0.0196 0.0195 0.51%
2048 64 0.0211 0.0205 2.93%
4096 64 0.0232 0.0227 2.20%
8192 64 0.0281 0.0263 6.84%
16384 64 0.0374 0.0285 31.23%
32768 64 0.0557 0.0288 93.40%
65536 64 0.0917 0.0291 215.12%
163840 64 0.2026 0.0306 562.09%
256 256 0.026 0.0209 24.40%
512 256 0.0262 0.0197 32.99%
1024 256 0.0245 0.0197 24.37%
2048 256 0.0257 0.0207 24.15%
4096 256 0.0284 0.0229 24.02%
8192 256 0.0323 0.0264 22.35%
16384 256 0.0411 0.041 0.24%
32768 256 0.0594 0.0415 43.13%
65536 256 0.096 0.0417 130.22%
163840 256 0.1985 0.0433 358.43%



Modifications

  • 新增 custom_ops/gpu_ops/moe/moe_align_kernel.cu:实现三路分发的 moe_align_block_size 函数模板及对应 CUDA kernels(小批量单 block kernel、大批量 cooperative kernel、通用双 kernel)
  • custom_ops/gpu_ops/helper.h:将 CEILDIV 宏工具函数上移至公共头文件
  • custom_ops/gpu_ops/moe/tritonmoe_preprocess.cu:删除旧内嵌 kernel,改为声明并调用 moe_align_block_size;支持任意 num_experts;修复小 token 分支的 max_num_tokens_padded 计算;输出名从 expert_ids 改为 experts_ids
  • custom_ops/setup_ops.py:将新文件加入两处编译源列表
  • tests/operators/test_tritonmoe_preprocess.py:新增覆盖各分支场景(空输入、单 token、均匀分布、大批量等)的正确性测试

Usage or Command

N/A

Accuracy Tests

N/A

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

Copilot AI review requested due to automatic review settings May 12, 2026 06:19
@paddle-bot
Copy link
Copy Markdown

paddle-bot Bot commented May 12, 2026

Thanks for your contribution!

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

该 PR 旨在重构并优化 tritonmoe_preprocess 的对齐/排序逻辑:去除旧 kernel 对 num_experts 的白名单限制,将对齐逻辑抽离为独立的 CUDA 实现,并补充更全面的正确性测试,以提升适用范围与性能表现。

Changes:

  • 新增 moe_align_kernel.cu,实现三路分发策略(小 batch 单 block / 大 batch cooperative / 通用双 kernel)来完成 MoE 对齐与 token 排序。
  • tritonmoe_preprocess.cu 改为调用 moe_align_block_size,支持任意维度输入 numel 计算,并修正小 token 分支下的 padded size 计算方式与 blocks 计算方式。
  • 扩充 test_tritonmoe_preprocess.py,覆盖更多形状与分支场景;setup_ops.py 增加新文件编译入口;helper.h 上移 CEILDIV 宏。

需要注意:PR 标题中 “optmize” 建议更正为 “optimize”,以符合仓库标题规范与可检索性。

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
custom_ops/gpu_ops/moe/moe_align_kernel.cu 新增 MoE 对齐与 token 排序 CUDA kernel(多策略分发)
custom_ops/gpu_ops/moe/tritonmoe_preprocess.cu 移除旧内嵌 kernel,改为调用 moe_align_block_size 并调整 shape 推导/输出命名
custom_ops/gpu_ops/helper.h CEILDIV 提升为公共宏,供多处 CUDA 代码复用
custom_ops/setup_ops.py 将新增的 moe_align_kernel.cu 加入编译源列表
tests/operators/test_tritonmoe_preprocess.py 新增/重构正确性测试,覆盖更多输入形状与边界场景

Comment on lines +76 to +80
int32_t total_vecs = (max_num_tokens_padded + VEC_SIZE - 1) / VEC_SIZE;
Vec* out_ptr = reinterpret_cast<Vec*>(sorted_token_ids);
for (int32_t i = threadIdx.x; i < total_vecs; i += blockDim.x) {
out_ptr[i] = fill_vec;
}
Comment on lines +274 to +279
int32_t total_vecs = (max_num_tokens_padded + VEC_SIZE - 1) / VEC_SIZE;
Vec* out_ptr = reinterpret_cast<Vec*>(sorted_token_ids);
for (int32_t i = bid * nthreads + tid; i < total_vecs;
i += nblocks * nthreads) {
out_ptr[i] = fill_vec;
}
Comment on lines +553 to +571
// Original 2-kernel approach (for medium inputs or cooperative fallback)
auto align_kernel = moe_align_block_size_kernel<scalar_t>;

const size_t scan_size = next_pow_2(num_experts);
const size_t shared_mem_size =
(num_experts + (num_experts + 1) + scan_size + WARP_SIZE) *
sizeof(int32_t);
align_kernel<<<2, threads, shared_mem_size, stream>>>(
topk_ids.data<scalar_t>(),
sorted_token_ids.data<int32_t>(),
experts_ids.data<int32_t>(),
num_tokens_post_pad.data<int32_t>(),
num_experts,
block_size,
numel,
cumsum_buffer.data<int32_t>(),
pad_sorted_token_ids,
scan_size,
max_num_tokens_padded);
Comment on lines +467 to +475
bool small_batch_expert_mode = (numel < 1024) && (num_experts <= 64);

if (small_batch_expert_mode) {
const int32_t expert_threads = max((int32_t)num_experts, WARP_SIZE);
constexpr int32_t fill_threads = 256;
const int32_t shared_mem_size =
((expert_threads + 1) * num_experts + (num_experts + 1)) *
sizeof(int32_t);

Comment on lines +267 to +275
def setUp(self):
if not _AVAILABLE:
self.skipTest("CUDA or fastdeploy not available")

def test_docstring_example(self):
"""Reproduce the example from the function docstring."""
topk_ids = paddle.to_tensor([[2, 3, 4], [1, 2, 4], [1, 3, 4], [1, 2, 3]], dtype="int64")
_verify(topk_ids, block_size=4, num_experts=5, label="docstring_example")


DEVICE = "gpu"

# 仅对小规模 case 打印详细 tensor,超过此阈值只打印统计摘要
Comment on lines +447 to +486
if not _AVAILABLE:
print("SKIP: CUDA or fastdeploy not available.")
else:
basic = TestTritonMoePreprocessBasic()
basic.test_docstring_example()
basic.test_single_token_single_expert()
basic.test_all_tokens_same_expert()
basic.test_uniform_1d()
basic.test_topk_equals_num_experts()
basic.test_num_tokens_less_than_num_experts()
basic.test_exact_block_boundary()
basic.test_block_size_1()

edge = TestTritonMoePreprocessEdgeCases()
edge.test_empty_topk_ids()
edge.test_one_expert()
edge.test_large_block_size()
edge.test_int64_dtype()

real = TestTritonMoePreprocessRealistic()
for num_tokens, num_experts, block_size in [
(256, 8, 16),
(1024, 16, 16),
(4096, 64, 16),
(8192, 64, 32),
(8192, 128, 64),
(16384, 256, 128),
]:
real._run_uniform_distribution(num_tokens, num_experts, block_size)
for num_tokens, top_k, num_experts, block_size in [
(512, 2, 8, 16),
(1024, 4, 16, 16),
(2048, 8, 64, 16),
]:
real._run_topk_2d(num_tokens, top_k, num_experts, block_size)
for alpha in [0.5, 1.2, 2.0]:
real._run_zipf_distribution(alpha)
real.test_deterministic_with_fixed_seed()

print("\n*** All direct-run tests passed ***")
Comment on lines +48 to +64
DEVICE = "gpu"

# 仅对小规模 case 打印详细 tensor,超过此阈值只打印统计摘要
_PRINT_TENSOR_NUMEL_LIMIT = 64


def _fmt_tensor(t: paddle.Tensor, name: str) -> str:
t_cpu = t.cpu()
if t_cpu.numel() <= _PRINT_TENSOR_NUMEL_LIMIT:
return f"{name}{list(t_cpu.shape)} = {t_cpu.tolist()}"
return (
f"{name}{list(t_cpu.shape)} | "
f"min={int(t_cpu.min())} max={int(t_cpu.max())} "
f"mean={float(t_cpu.cast('float32').mean()):.2f} numel={t_cpu.numel()}"
)


Copy link
Copy Markdown

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Paddle-CI-Agent | pr_review | 2026-05-12 14:50:00

📋 Review 摘要

PR 概述:重构 tritonmoe_preprocess 算子,将内嵌 kernel 提取到独立 moe_align_kernel.cu,实现三路分发策略(小批量单 block kernel、大批量 cooperative kernel、通用双 kernel),去除 num_experts 硬编码白名单,支持任意 expert 数量。
变更范围custom_ops/gpu_ops/moe/custom_ops/gpu_ops/helper.hcustom_ops/setup_ops.pytests/operators/
影响面 Tag[OP]

📝 PR 规范检查

标题存在两处问题:① Cherry-Pick 格式缺少原始 PR 编号 (#原PR号);② "optmize" 为拼写错误(应为 "optimize")。描述模板各 section 均已存在且内容完整,Checklist 全部勾选,描述部分合规。

标题建议(可直接复制,补充原 PR 号后使用):

  • [Cherry-Pick][Optimization] Optimize tritonmoe_preprocess op (#<原PR号>)

问题

级别 文件 概述
🟡 建议 custom_ops/gpu_ops/moe/moe_align_kernel.cu:499 静态缓存未按设备 ID 区分,多 GPU 场景下块数估算可能有误

总体评价

重构思路清晰,三路分发策略设计合理,测试覆盖充分(空输入、边界、均匀/Zipf 分布等场景均有覆盖)。大批量下的性能提升数据可观(>100x for large token counts)。需关注多 GPU 场景下 cooperative kernel 的静态缓存问题,以及 Cherry-Pick 标题格式合规性。


auto coop_kernel = moe_align_block_size_cooperative_kernel<scalar_t>;

static int cached_max_blocks_per_sm = 0;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 建议 静态变量 cached_max_blocks_per_sm / cached_num_sms 未按设备 ID 区分缓存

在 FastDeploy 的 TP/EP 多 GPU 场景中,同一进程内不同线程(Worker)可能绑定到不同的 CUDA 设备。由于 C++ static 变量全进程共享,第一个执行此路径的 GPU 写入缓存后,后续所有 GPU 均使用相同的 SM 数量和块占用数,导致 cooperative kernel 块数估算可能与实际设备不符。

当实际 SM 数少于缓存值时,cudaLaunchCooperativeKernel 会因块数超限而报错并 fallback,仅影响性能;但对于 SM 数更多的 GPU(如后续 Worker 绑定到更大的卡),blocks 数量会被低估,导致利用率下降。

建议修复方式:以设备 ID 为键缓存,或直接在每次调用时查询(device info 查询开销极低):

int device_id;
cudaGetDevice(&device_id);
static std::unordered_map<int, std::pair<int,int>> device_cache;
auto it = device_cache.find(device_id);
if (it == device_cache.end()) {
    int max_blocks_per_sm = 0, num_sms = 0;
    cudaOccupancyMaxActiveBlocksPerMultiprocessor(&max_blocks_per_sm, ...);
    cudaDeviceGetAttribute(&num_sms, cudaDevAttrMultiProcessorCount, device_id);
    device_cache[device_id] = {max_blocks_per_sm, num_sms};
    it = device_cache.find(device_id);
}
int max_coop_blocks = it->second.first * it->second.second;

@PaddlePaddle-bot
Copy link
Copy Markdown

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-12 14:58:10

CI报告基于以下代码生成(30分钟更新一次):


1 任务总览

⚠️ 当前 CI 尚未完成1 个 Required 任务失败,5 个 Required 任务运行中,1 个 Required 任务等待中。需优先处理 Approval 失败问题。

总执行(rerun次数) 总任务 ✅ 通过 ❌ 失败 ⏳ 运行中 ⏸️ 等待中 跳过
39(0) 39 29 3 5 2 0

2 任务状态汇总

2.1 Required任务 : 3/10 通过

必选任务阻塞合并,失败需优先处理。

状态 任务 耗时 根因 修复建议 日志 重跑
Approval 7s PR问题:Cherry-Pick标题缺原始develop PR编号 标题中添加原develop PR号如#XXXX,并请RD审批 Job -
xpu_4cards_case_test / run_xpu_4cards_cases - 运行中 - - -
Extracted partial CE model tasks to run in CI. / run_ce_cases - 运行中 - - -
Run Base Tests / base_tests - 运行中 - - -
Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage - 运行中 - - -
Run Four Cards Tests / run_4_cards_tests - 运行中 - - -
⏸️ Run Stable Tests / stable_tests - 等待中 - - -
其余 3 个必选任务通过 - - - - -

2.2 可选任务 — 26/29 通过

可选任务不阻塞合并,失败仅供参考。

状态 任务 耗时 日志 重跑
Run iluvatar Tests / run_iluvatar_cases 22m8s Job -
Trigger Jenkins for PR 1m4s Job -
其余 26 个可选任务通过 - - -

3 失败详情(仅 required)

Approval — PR问题(置信度: 高)

Approval

  • 状态: ❌ 失败
  • 错误类型: PR问题
  • 置信度: 高
  • 根因摘要: Cherry-Pick PR标题缺少原始develop分支PR编号
  • 分析器: 通用分析(fallback)

根因详情:
scripts/check_approval.sh 运行后报告 "There are 1 approved errors.",退出码6。根据脚本输出,Cherry-Pick PR 必须满足:1)标题包含 [Cherry-Pick] 和原始 develop PR 编号(如 #5010);2)获得指定 FastDeploy RD 的审批。当前 PR 标题缺少原始 PR 编号,或尚未获得必要审批。

关键日志:

==> PR title: [Cherry-Pick][Optimization] optmize tritonmoe_preprocess op
0. Cherry-Pick PR must come from develop and the title must contain [Cherry-Pick] and the original develop PR number (e.g., #5010).
Approval required from FastDeploy RD: qingqing01(dangqingqing), Jiang-Jia-Jun(jiangjiajun), heavengate(dengkaipeng).
There are 1 approved errors.
##[error]Process completed with exit code 6.

修复建议:

  1. 在 PR 标题中添加原始 develop 分支的 PR 编号,例如改为 [Cherry-Pick][Optimization] optmize tritonmoe_preprocess op (#XXXX)(XXXX 替换为实际 PR 号)
  2. 请求 FastDeploy RD(qingqing01/dangqingqing、Jiang-Jia-Jun/jiangjiajun、heavengate/dengkaipeng)之一进行审批

修复建议摘要: 标题添加原develop PR编号#XXXX,并请RD审批

关联变更: Approval 检查仅与 PR 标题格式和审批状态相关,与代码变更无直接关联

链接: 查看日志

@codecov-commenter
Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
⚠️ Please upload report for BASE (release/2.6@a5191f2). Learn more about missing BASE report.

Additional details and impacted files
@@              Coverage Diff               @@
##             release/2.6    #7786   +/-   ##
==============================================
  Coverage               ?   71.89%           
==============================================
  Files                  ?      378           
  Lines                  ?    53933           
  Branches               ?     8435           
==============================================
  Hits                   ?    38773           
  Misses                 ?    12395           
  Partials               ?     2765           
Flag Coverage Δ
GPU 71.89% <ø> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants