[OPTIMIZATION] save some compute in dsv3 by zhoutianzi666 · Pull Request #7785 · PaddlePaddle/FastDeploy

zhoutianzi666 · 2026-05-12T06:18:22Z

Motivation

在 prefill+decode 混合 batch 场景下，MLA decode 路径原本在 kv_b_proj_bmm GEMM 完成后才通过 extract_decoder_token_from_q 筛选 decoder token。本 PR 将筛选提前至 GEMM 之前（在 deepseek_v3.forward 中执行），使昂贵的 GEMM 只作用于 decoder token，节省 encoder token 部分的无效计算。同时清理 mla_blackwell 中遗留的硬编码开发路径，并将 insert_decoder_result_back 的内存分配从 paddle.zeros 改为 paddle.empty 减少初始化开销。

Modifications

fastdeploy/model_executor/models/deepseek_v3.py：在 need_do_decode 分支（flash_mla/Blackwell 路径），将 extract_decoder_token_from_q 提前至 kv_b_proj_bmm 之前，仅对 decoder token 做 GEMM；GEMM 后调用 insert_decoder_result_back 恢复完整 token 序列。
fastdeploy/model_executor/layers/attention/mla_attention_backend.py：移除 forward_mixed 内重复的 extract_decoder_token_from_q 调用；insert_decoder_result_back 改用 paddle.empty；删除 mla_blackwell 中硬编码的 /root/... sys.path；Blackwell 判断从 cc >= 100 改为 prop.major == 10；新增 shape 一致性断言。
tests/operators/test_flashmla_precision.py：新增 SM100/SM90 分支精度测试，引入 page_size 变量替换硬编码的 64。
tests/operators/test_deepgemm_precision.py：更新 DeepGEMM Blackwell 测试，支持 2CTA 指令和 TMA pipeline。

Usage or Command

N/A

Accuracy Tests

N/A

Checklist

Add at least a tag in the PR title.
- Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
- You can add new tags based on the PR content, but the semantics must be clear.
Format your code, run pre-commit before commit.
Add unit tests. Please write the reason in this PR if no unit tests.
Provide accuracy results.
If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

paddle-bot · 2026-05-12T06:18:27Z

Thanks for your contribution!

PaddlePaddle-bot · 2026-05-12T06:59:53Z

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-13 17:42:32

CI报告基于以下代码生成（30分钟更新一次）:

PR commit: 2a8d2c6
Merge base: f01bcde (branch: develop)
查看完整 Diff
CI 详情

1 任务总览

所有 Required 任务已通过 ✅，建议合并；有 1 个可选任务失败（不阻塞合并），2 个任务运行中，2 个任务等待中。

总执行（rerun次数）	总任务	✅ 通过	❌ 失败	⏳ 运行中	⏸️ 等待中	跳过
15(0)	15	10	1	2	2	0

2 任务状态汇总

2.1 Required任务 : 2/2 通过

必选任务阻塞合并，失败需优先处理。

状态	任务	耗时	根因	修复建议	日志	重跑
✅	其余 2 个必选任务通过	-	-	-	-	-

2.2 可选任务 — 8/13 通过

可选任务不阻塞合并，失败仅供参考。

状态	任务	耗时	日志	重跑
❌	`Check PR Template`	14s	Job	-
⏳	`FD-Build-Linux / fd-build`	-	Job	-
⏳	`Trigger Jenkins for PR`	-	Job	-
⏸️	`CI_HPU`	-	-	-
⏸️	`Run iluvatar Tests / run_iluvatar_cases`	-	-	-
✅	其余 8 个可选任务通过	-	-	-

3 失败详情（仅 required）

无 required 失败任务。

codecov-commenter · 2026-05-12T07:09:07Z

Codecov Report

❌ Patch coverage is 28.57143% with 20 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@f01bcde). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
fastdeploy/model_executor/models/deepseek_v3.py	38.88%	9 Missing and 2 partials ⚠️
...executor/layers/attention/mla_attention_backend.py	10.00%	7 Missing and 2 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             develop    #7785   +/-   ##
==========================================
  Coverage           ?   63.98%           
==========================================
  Files              ?      461           
  Lines              ?    64145           
  Branches           ?     9826           
==========================================
  Hits               ?    41041           
  Misses             ?    20279           
  Partials           ?     2825

Flag	Coverage Δ
GPU	`73.22% <28.57%> (?)`
XPU	`7.13% <0.00%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

PaddlePaddle-bot

🤖 Paddle-CI-Agent | pr_review | 2026-05-13 17:52:26

📋 Review 摘要

PR 概述：将 DSv3 MLA decode 路径中 extract_decoder_token_from_q 提前至 kv_b_proj_bmm GEMM 之前，节省混合 batch 中 encoder token 的无效计算
变更范围：model_executor/models/deepseek_v3.py、layers/attention/mla_attention_backend.py、tests/operators/
影响面 Tag：[Models] [OP] [Optimization]

📝 PR 规范检查

标题使用了 [OPTIMIZATION]（全大写），与官方 Tag [Optimization] 大小写不符；PR 描述完全为空，缺少所有必填 section。

标题建议（可直接复制）：

[Optimization] save some compute in DSv3 MLA decode path

PR 描述建议（可直接复制）：

## Motivation
在 prefill+decode 混合 batch 场景下，MLA decode 路径原本在 `kv_b_proj_bmm` GEMM 完成后才通过 `extract_decoder_token_from_q` 筛选 decoder token。本 PR 将筛选提前至 GEMM 之前（在 `deepseek_v3.forward` 中执行），使昂贵的 GEMM 只作用于 decoder token，节省 encoder token 部分的无效计算。同时清理 `mla_blackwell` 中遗留的硬编码开发路径，并将 `insert_decoder_result_back` 的内存分配从 `paddle.zeros` 改为 `paddle.empty` 减少初始化开销。

## Modifications
- `fastdeploy/model_executor/models/deepseek_v3.py`：在 `need_do_decode` 分支（flash_mla/Blackwell 路径），将 `extract_decoder_token_from_q` 提前至 `kv_b_proj_bmm` 之前，仅对 decoder token 做 GEMM；GEMM 后调用 `insert_decoder_result_back` 恢复完整 token 序列。
- `fastdeploy/model_executor/layers/attention/mla_attention_backend.py`：移除 `forward_mixed` 内重复的 `extract_decoder_token_from_q` 调用；`insert_decoder_result_back` 改用 `paddle.empty`；删除 `mla_blackwell` 中硬编码的 `/root/...` sys.path；Blackwell 判断从 `cc >= 100` 改为 `prop.major == 10`；新增 shape 一致性断言。
- `tests/operators/test_flashmla_precision.py`：新增 SM100/SM90 分支精度测试，引入 `page_size` 变量替换硬编码的 64。
- `tests/operators/test_deepgemm_precision.py`：更新 DeepGEMM Blackwell 测试，支持 2CTA 指令和 TMA pipeline。

## Usage or Command
N/A

## Accuracy Tests
N/A

## Checklist

- [x] Add at least a tag in the PR title.
  - Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
  - You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [ ] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.

问题

级别	文件	概述
🔴 Bug	`mla_attention_backend.py:862`	SM8x 回归：`prop.major == 9` 使 A100/L40S（SM89）在 `USE_FLASH_MLA=0` 时进入 flash_mla 路径并 ImportError
🔴 Bug	`test_flashmla_precision.py:93`	`decoder_res` 未定义：非 SM9/SM10 硬件上此行抛 `NameError`
📝 PR 规范	—	标题 Tag 大小写不符（`[OPTIMIZATION]` → `[Optimization]`），描述完全为空

总体评价

优化思路正确，将 GEMM 范围从全 token 收窄到 decoder-only 能有效降低混合 batch 计算量；但 SM 版本条件的收窄（cc >= 100 → prop.major == 9）可能引入 A100（SM80）、L40S（SM89）架构回归，建议修复后合入。

EmmonsCurse

LGTM～ Skip coverage check as it mainly relies on tests with sm_version >= 100

merge deve

a3948c5

zhoutianzi666 had a problem deploying to Metax_ci May 12, 2026 06:18 — with GitHub Actions Failure

merge deve

afa6599

zhoutianzi666 temporarily deployed to Metax_ci May 12, 2026 06:20 — with GitHub Actions Inactive

This comment was marked as outdated.

Sign in to view

ZhangX-21 added 2 commits May 12, 2026 15:36

remove 手动复制smem

d501daf

remove 手动复制smem

1c13307

zhoutianzi666 had a problem deploying to Metax_ci May 12, 2026 07:53 — with GitHub Actions Error

remove 手动复制smem

fc8d251

zhoutianzi666 had a problem deploying to Metax_ci May 12, 2026 08:05 — with GitHub Actions Error

This comment was marked as outdated.

Sign in to view

remove 手动复制smem

e80e0d3

zhoutianzi666 had a problem deploying to Metax_ci May 12, 2026 08:48 — with GitHub Actions Failure

This comment was marked as outdated.

Sign in to view

remove 手动复制smem

44b4601

zhoutianzi666 temporarily deployed to Metax_ci May 12, 2026 13:49 — with GitHub Actions Inactive

This comment was marked as outdated.

Sign in to view

zhoutianzi666 changed the title ~~merge deve~~ [test] May 13, 2026

remove 手动复制smem

de613a5

zhoutianzi666 had a problem deploying to Metax_ci May 13, 2026 03:06 — with GitHub Actions Error

remove 手动复制smem

45d2f5a

zhoutianzi666 had a problem deploying to Metax_ci May 13, 2026 03:08 — with GitHub Actions Error

remove 手动复制smem

30e7735

zhoutianzi666 had a problem deploying to Metax_ci May 13, 2026 03:09 — with GitHub Actions Failure

This comment was marked as outdated.

Sign in to view

remove 手动复制smem

4d5a5d8

zhoutianzi666 had a problem deploying to Metax_ci May 13, 2026 06:12 — with GitHub Actions Error

This comment was marked as outdated.

Sign in to view

remove 手动复制smem

d131390

zhoutianzi666 had a problem deploying to Metax_ci May 13, 2026 06:33 — with GitHub Actions Failure

zhoutianzi666 changed the title ~~[test]~~ [test][OPTIMIZATION] save some compute in dsv3 May 13, 2026

zhoutianzi666 changed the title ~~[test][OPTIMIZATION] save some compute in dsv3~~ [OPTIMIZATION] save some compute in dsv3 May 13, 2026

This comment was marked as outdated.

Sign in to view

remove 手动复制smem

bd22b44

zhoutianzi666 had a problem deploying to Metax_ci May 13, 2026 07:05 — with GitHub Actions Failure

This comment was marked as outdated.

Sign in to view

remove 手动复制smem

2a8d2c6

zhoutianzi666 had a problem deploying to Metax_ci May 13, 2026 09:34 — with GitHub Actions Failure

PaddlePaddle-bot suggested changes May 13, 2026

View reviewed changes

Comment thread fastdeploy/model_executor/layers/attention/mla_attention_backend.py

Comment thread tests/operators/test_flashmla_precision.py

chang-wenbin approved these changes May 13, 2026

View reviewed changes

EmmonsCurse approved these changes May 13, 2026

View reviewed changes

EmmonsCurse added the skip-ci: coverage label May 13, 2026

zhoutianzi666 merged commit 6e149e3 into PaddlePaddle:develop May 13, 2026
74 of 80 checks passed

Conversation

zhoutianzi666 commented May 12, 2026 • edited by EmmonsCurse Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Usage or Command

Accuracy Tests

Checklist

Uh oh!

paddle-bot Bot commented May 12, 2026

Uh oh!

This comment was marked as outdated.

Uh oh!

PaddlePaddle-bot commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1 任务总览

2 任务状态汇总

2.1 Required任务 : 2/2 通过

2.2 可选任务 — 8/13 通过

3 失败详情（仅 required）

Uh oh!

codecov-commenter commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

📋 Review 摘要

📝 PR 规范检查

问题

总体评价

Uh oh!

Uh oh!

Uh oh!

EmmonsCurse left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

zhoutianzi666 commented May 12, 2026 •

edited by EmmonsCurse

Loading

PaddlePaddle-bot commented May 12, 2026 •

edited

Loading

codecov-commenter commented May 12, 2026 •

edited

Loading