Skip to content

[OPTIMIZATION] save some compute in dsv3 #7785

Merged
zhoutianzi666 merged 14 commits into
PaddlePaddle:developfrom
zhoutianzi666:add_test_in_flashmla
May 13, 2026
Merged

[OPTIMIZATION] save some compute in dsv3 #7785
zhoutianzi666 merged 14 commits into
PaddlePaddle:developfrom
zhoutianzi666:add_test_in_flashmla

Conversation

@zhoutianzi666
Copy link
Copy Markdown
Collaborator

@zhoutianzi666 zhoutianzi666 commented May 12, 2026

Motivation

在 prefill+decode 混合 batch 场景下,MLA decode 路径原本在 kv_b_proj_bmm GEMM 完成后才通过 extract_decoder_token_from_q 筛选 decoder token。本 PR 将筛选提前至 GEMM 之前(在 deepseek_v3.forward 中执行),使昂贵的 GEMM 只作用于 decoder token,节省 encoder token 部分的无效计算。同时清理 mla_blackwell 中遗留的硬编码开发路径,并将 insert_decoder_result_back 的内存分配从 paddle.zeros 改为 paddle.empty 减少初始化开销。

Modifications

  • fastdeploy/model_executor/models/deepseek_v3.py:在 need_do_decode 分支(flash_mla/Blackwell 路径),将 extract_decoder_token_from_q 提前至 kv_b_proj_bmm 之前,仅对 decoder token 做 GEMM;GEMM 后调用 insert_decoder_result_back 恢复完整 token 序列。
  • fastdeploy/model_executor/layers/attention/mla_attention_backend.py:移除 forward_mixed 内重复的 extract_decoder_token_from_q 调用;insert_decoder_result_back 改用 paddle.empty;删除 mla_blackwell 中硬编码的 /root/... sys.path;Blackwell 判断从 cc >= 100 改为 prop.major == 10;新增 shape 一致性断言。
  • tests/operators/test_flashmla_precision.py:新增 SM100/SM90 分支精度测试,引入 page_size 变量替换硬编码的 64。
  • tests/operators/test_deepgemm_precision.py:更新 DeepGEMM Blackwell 测试,支持 2CTA 指令和 TMA pipeline。

Usage or Command

N/A

Accuracy Tests

N/A

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

@paddle-bot
Copy link
Copy Markdown

paddle-bot Bot commented May 12, 2026

Thanks for your contribution!

PaddlePaddle-bot

This comment was marked as outdated.

@PaddlePaddle-bot
Copy link
Copy Markdown

PaddlePaddle-bot commented May 12, 2026

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-13 17:42:32

CI报告基于以下代码生成(30分钟更新一次):


1 任务总览

所有 Required 任务已通过 ✅,建议合并;有 1 个可选任务失败(不阻塞合并),2 个任务运行中,2 个任务等待中。

总执行(rerun次数) 总任务 ✅ 通过 ❌ 失败 ⏳ 运行中 ⏸️ 等待中 跳过
15(0) 15 10 1 2 2 0

2 任务状态汇总

2.1 Required任务 : 2/2 通过

必选任务阻塞合并,失败需优先处理。

状态 任务 耗时 根因 修复建议 日志 重跑
其余 2 个必选任务通过 - - - - -

2.2 可选任务 — 8/13 通过

可选任务不阻塞合并,失败仅供参考。

状态 任务 耗时 日志 重跑
Check PR Template 14s Job -
FD-Build-Linux / fd-build - Job -
Trigger Jenkins for PR - Job -
⏸️ CI_HPU - - -
⏸️ Run iluvatar Tests / run_iluvatar_cases - - -
其余 8 个可选任务通过 - - -

3 失败详情(仅 required)

无 required 失败任务。

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 12, 2026

Codecov Report

❌ Patch coverage is 28.57143% with 20 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@f01bcde). Learn more about missing BASE report.

Files with missing lines Patch % Lines
fastdeploy/model_executor/models/deepseek_v3.py 38.88% 9 Missing and 2 partials ⚠️
...executor/layers/attention/mla_attention_backend.py 10.00% 7 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             develop    #7785   +/-   ##
==========================================
  Coverage           ?   63.98%           
==========================================
  Files              ?      461           
  Lines              ?    64145           
  Branches           ?     9826           
==========================================
  Hits               ?    41041           
  Misses             ?    20279           
  Partials           ?     2825           
Flag Coverage Δ
GPU 73.22% <28.57%> (?)
XPU 7.13% <0.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

PaddlePaddle-bot

This comment was marked as outdated.

PaddlePaddle-bot

This comment was marked as outdated.

PaddlePaddle-bot

This comment was marked as outdated.

@zhoutianzi666 zhoutianzi666 changed the title merge deve [test] May 13, 2026
PaddlePaddle-bot

This comment was marked as outdated.

PaddlePaddle-bot

This comment was marked as outdated.

@zhoutianzi666 zhoutianzi666 changed the title [test] [test][OPTIMIZATION] save some compute in dsv3 May 13, 2026
@zhoutianzi666 zhoutianzi666 changed the title [test][OPTIMIZATION] save some compute in dsv3 [OPTIMIZATION] save some compute in dsv3 May 13, 2026
PaddlePaddle-bot

This comment was marked as outdated.

PaddlePaddle-bot

This comment was marked as outdated.

Copy link
Copy Markdown

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Paddle-CI-Agent | pr_review | 2026-05-13 17:52:26

📋 Review 摘要

PR 概述:将 DSv3 MLA decode 路径中 extract_decoder_token_from_q 提前至 kv_b_proj_bmm GEMM 之前,节省混合 batch 中 encoder token 的无效计算
变更范围model_executor/models/deepseek_v3.pylayers/attention/mla_attention_backend.pytests/operators/
影响面 Tag[Models] [OP] [Optimization]

📝 PR 规范检查

标题使用了 [OPTIMIZATION](全大写),与官方 Tag [Optimization] 大小写不符;PR 描述完全为空,缺少所有必填 section。

标题建议(可直接复制):

  • [Optimization] save some compute in DSv3 MLA decode path

PR 描述建议(可直接复制):

## Motivation
在 prefill+decode 混合 batch 场景下,MLA decode 路径原本在 `kv_b_proj_bmm` GEMM 完成后才通过 `extract_decoder_token_from_q` 筛选 decoder token。本 PR 将筛选提前至 GEMM 之前(在 `deepseek_v3.forward` 中执行),使昂贵的 GEMM 只作用于 decoder token,节省 encoder token 部分的无效计算。同时清理 `mla_blackwell` 中遗留的硬编码开发路径,并将 `insert_decoder_result_back` 的内存分配从 `paddle.zeros` 改为 `paddle.empty` 减少初始化开销。

## Modifications
- `fastdeploy/model_executor/models/deepseek_v3.py`:在 `need_do_decode` 分支(flash_mla/Blackwell 路径),将 `extract_decoder_token_from_q` 提前至 `kv_b_proj_bmm` 之前,仅对 decoder token 做 GEMM;GEMM 后调用 `insert_decoder_result_back` 恢复完整 token 序列。
- `fastdeploy/model_executor/layers/attention/mla_attention_backend.py`:移除 `forward_mixed` 内重复的 `extract_decoder_token_from_q` 调用;`insert_decoder_result_back` 改用 `paddle.empty`;删除 `mla_blackwell` 中硬编码的 `/root/...` sys.path;Blackwell 判断从 `cc >= 100` 改为 `prop.major == 10`;新增 shape 一致性断言。
- `tests/operators/test_flashmla_precision.py`:新增 SM100/SM90 分支精度测试,引入 `page_size` 变量替换硬编码的 64。
- `tests/operators/test_deepgemm_precision.py`:更新 DeepGEMM Blackwell 测试,支持 2CTA 指令和 TMA pipeline。

## Usage or Command
N/A

## Accuracy Tests
N/A

## Checklist

- [x] Add at least a tag in the PR title.
  - Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
  - You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [ ] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.

问题

级别 文件 概述
🔴 Bug mla_attention_backend.py:862 SM8x 回归:prop.major == 9 使 A100/L40S(SM89)在 USE_FLASH_MLA=0 时进入 flash_mla 路径并 ImportError
🔴 Bug test_flashmla_precision.py:93 decoder_res 未定义:非 SM9/SM10 硬件上此行抛 NameError
📝 PR 规范 标题 Tag 大小写不符([OPTIMIZATION][Optimization]),描述完全为空

总体评价

优化思路正确,将 GEMM 范围从全 token 收窄到 decoder-only 能有效降低混合 batch 计算量;但 SM 版本条件的收窄(cc >= 100prop.major == 9)可能引入 A100(SM80)、L40S(SM89)架构回归,建议修复后合入。

Comment thread fastdeploy/model_executor/layers/attention/mla_attention_backend.py
Comment thread tests/operators/test_flashmla_precision.py
Copy link
Copy Markdown
Collaborator

@EmmonsCurse EmmonsCurse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM~ Skip coverage check as it mainly relies on tests with sm_version >= 100

@zhoutianzi666 zhoutianzi666 merged commit 6e149e3 into PaddlePaddle:develop May 13, 2026
74 of 80 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants