[OPTIMIZATION] save some compute in dsv3 #7785
Conversation
|
Thanks for your contribution! |
CI报告基于以下代码生成(30分钟更新一次): 1 任务总览所有 Required 任务已通过 ✅,建议合并;有 1 个可选任务失败(不阻塞合并),2 个任务运行中,2 个任务等待中。
2 任务状态汇总2.1 Required任务 : 2/2 通过
2.2 可选任务 — 8/13 通过
3 失败详情(仅 required)无 required 失败任务。 |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## develop #7785 +/- ##
==========================================
Coverage ? 63.98%
==========================================
Files ? 461
Lines ? 64145
Branches ? 9826
==========================================
Hits ? 41041
Misses ? 20279
Partials ? 2825
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
PaddlePaddle-bot
left a comment
There was a problem hiding this comment.
🤖 Paddle-CI-Agent | pr_review |
2026-05-13 17:52:26
📋 Review 摘要
PR 概述:将 DSv3 MLA decode 路径中 extract_decoder_token_from_q 提前至 kv_b_proj_bmm GEMM 之前,节省混合 batch 中 encoder token 的无效计算
变更范围:model_executor/models/deepseek_v3.py、layers/attention/mla_attention_backend.py、tests/operators/
影响面 Tag:[Models] [OP] [Optimization]
📝 PR 规范检查
标题使用了 [OPTIMIZATION](全大写),与官方 Tag [Optimization] 大小写不符;PR 描述完全为空,缺少所有必填 section。
标题建议(可直接复制):
[Optimization] save some compute in DSv3 MLA decode path
PR 描述建议(可直接复制):
## Motivation
在 prefill+decode 混合 batch 场景下,MLA decode 路径原本在 `kv_b_proj_bmm` GEMM 完成后才通过 `extract_decoder_token_from_q` 筛选 decoder token。本 PR 将筛选提前至 GEMM 之前(在 `deepseek_v3.forward` 中执行),使昂贵的 GEMM 只作用于 decoder token,节省 encoder token 部分的无效计算。同时清理 `mla_blackwell` 中遗留的硬编码开发路径,并将 `insert_decoder_result_back` 的内存分配从 `paddle.zeros` 改为 `paddle.empty` 减少初始化开销。
## Modifications
- `fastdeploy/model_executor/models/deepseek_v3.py`:在 `need_do_decode` 分支(flash_mla/Blackwell 路径),将 `extract_decoder_token_from_q` 提前至 `kv_b_proj_bmm` 之前,仅对 decoder token 做 GEMM;GEMM 后调用 `insert_decoder_result_back` 恢复完整 token 序列。
- `fastdeploy/model_executor/layers/attention/mla_attention_backend.py`:移除 `forward_mixed` 内重复的 `extract_decoder_token_from_q` 调用;`insert_decoder_result_back` 改用 `paddle.empty`;删除 `mla_blackwell` 中硬编码的 `/root/...` sys.path;Blackwell 判断从 `cc >= 100` 改为 `prop.major == 10`;新增 shape 一致性断言。
- `tests/operators/test_flashmla_precision.py`:新增 SM100/SM90 分支精度测试,引入 `page_size` 变量替换硬编码的 64。
- `tests/operators/test_deepgemm_precision.py`:更新 DeepGEMM Blackwell 测试,支持 2CTA 指令和 TMA pipeline。
## Usage or Command
N/A
## Accuracy Tests
N/A
## Checklist
- [x] Add at least a tag in the PR title.
- Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
- You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [ ] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.问题
| 级别 | 文件 | 概述 |
|---|---|---|
| 🔴 Bug | mla_attention_backend.py:862 |
SM8x 回归:prop.major == 9 使 A100/L40S(SM89)在 USE_FLASH_MLA=0 时进入 flash_mla 路径并 ImportError |
| 🔴 Bug | test_flashmla_precision.py:93 |
decoder_res 未定义:非 SM9/SM10 硬件上此行抛 NameError |
| 📝 PR 规范 | — | 标题 Tag 大小写不符([OPTIMIZATION] → [Optimization]),描述完全为空 |
总体评价
优化思路正确,将 GEMM 范围从全 token 收窄到 decoder-only 能有效降低混合 batch 计算量;但 SM 版本条件的收窄(cc >= 100 → prop.major == 9)可能引入 A100(SM80)、L40S(SM89)架构回归,建议修复后合入。
EmmonsCurse
left a comment
There was a problem hiding this comment.
LGTM~ Skip coverage check as it mainly relies on tests with sm_version >= 100
Motivation
在 prefill+decode 混合 batch 场景下,MLA decode 路径原本在
kv_b_proj_bmmGEMM 完成后才通过extract_decoder_token_from_q筛选 decoder token。本 PR 将筛选提前至 GEMM 之前(在deepseek_v3.forward中执行),使昂贵的 GEMM 只作用于 decoder token,节省 encoder token 部分的无效计算。同时清理mla_blackwell中遗留的硬编码开发路径,并将insert_decoder_result_back的内存分配从paddle.zeros改为paddle.empty减少初始化开销。Modifications
fastdeploy/model_executor/models/deepseek_v3.py:在need_do_decode分支(flash_mla/Blackwell 路径),将extract_decoder_token_from_q提前至kv_b_proj_bmm之前,仅对 decoder token 做 GEMM;GEMM 后调用insert_decoder_result_back恢复完整 token 序列。fastdeploy/model_executor/layers/attention/mla_attention_backend.py:移除forward_mixed内重复的extract_decoder_token_from_q调用;insert_decoder_result_back改用paddle.empty;删除mla_blackwell中硬编码的/root/...sys.path;Blackwell 判断从cc >= 100改为prop.major == 10;新增 shape 一致性断言。tests/operators/test_flashmla_precision.py:新增 SM100/SM90 分支精度测试,引入page_size变量替换硬编码的 64。tests/operators/test_deepgemm_precision.py:更新 DeepGEMM Blackwell 测试,支持 2CTA 指令和 TMA pipeline。Usage or Command
N/A
Accuracy Tests
N/A
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.