[Cherry-Pick][Speculative Decoding] Support mtp super ultra overlap in pd-split mode with insert_task overlap(#7323) by freeliuzc · Pull Request #7794 · PaddlePaddle/FastDeploy

freeliuzc · 2026-05-12T09:26:04Z

…de with insert_task overlap (#7323)

support mtp overlap in pd-split mode with insert_task overlap

Motivation

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick，PR标题需遵循格式，在最开始加上[Cherry-Pick]标签，以及最后面加上原PR ID，例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

Modifications

Usage or Command

Accuracy Tests

Checklist

Add at least a tag in the PR title.
- Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
- You can add new tags based on the PR content, but the semantics must be clear.
Format your code, run pre-commit before commit.
Add unit tests. Please write the reason in this PR if no unit tests.
Provide accuracy results.
If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

…de with insert_task overlap (PaddlePaddle#7323) * support mtp overlap in pd-split mode with insert_task overlap

paddle-bot · 2026-05-12T09:26:10Z

Thanks for your contribution!

Copilot

Pull request overview

该 PR（Cherry-Pick #7323）主要围绕 Speculative Decoding / MTP 在 PD 分离（splitwise）模式下的插入任务（insert_task）重叠能力做适配：通过更多使用 async_set_value 进行（尽量）异步写入 share buffer，减少同步开销，并补充了对应的单测覆盖关键分支。

Changes:

在 GPUModelRunner.insert_tasks_v1 中，将多处 share_inputs 写入改为 async_set_value，并在 splitwise_role=decode + SUFFIX 路径下调整 draft token 写入与缓存计数更新逻辑。
在 MTPProposer.insert_tasks_v1 中同样引入 async_set_value，并对 input_ids_cpu 写入增加 hybrid_mode 条件保护。
补充 splitwise SUFFIX 分支的单元测试；同时增强 async_expert_loader 对 cuda-python 13.x import 结构变化的兼容性。

PR 标题/描述检查（按仓库模板要求）：

标题格式符合 Cherry-Pick 约定（以 [Cherry-Pick] 开头且结尾包含原 PR 号 (#7323)）。
当前 PR 描述中模板项（Motivation/Modifications/Usage/Accuracy Tests）未完整补充；建议在合入前补齐，尤其是性能收益与（如涉及输出一致性）accuracy 结果。

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
tests/worker/test_gpu_model_runner.py	新增 splitwise_role=decode + SUFFIX 分支的 insert_tasks_v1 单测，并用同步 stub 替代 async_set_value 以便无 CUDA 环境运行。
fastdeploy/worker/gpu_model_runner.py	insert_tasks_v1 多处写入改为 async_set_value；splitwise decode + SUFFIX 分支写入 draft_tokens/seq_lens 并更新 cached token 计数；stop_seqs 写入方式改为 pad 后整块写入。
fastdeploy/spec_decode/mtp.py	MTP proposer 的 insert_tasks_v1 多处写入改为 async_set_value，并对 hybrid_mode 下的 input_ids_cpu 写入增加保护。
fastdeploy/model_executor/xpu_pre_and_post_process.py	新增 XPU 路径的 async_set_value 实现，供 XPU MTP proposer 等路径使用。
fastdeploy/model_executor/pre_and_post_process.py	将 async_set_value 改为全平台可用（不再仅 CUDA 可用），并在 CUDA 下继续走 custom_numpy_to_tensor 优化路径。
fastdeploy/eplb/async_expert_loader.py	兼容 cuda-python 13.x（cuda.bindings.runtime）与旧版本（cuda.cudart）的导入差异，并改进缺失依赖时的提示信息。

freeliuzc · 2026-05-12T10:29:07Z

+                # 每条 stop sequence pad 到 stop_seqs_max_len，凑齐空行后整块写入
+                # 避免对第 3 维做部分切片（非连续内存）导致 async_set_value stride 错位


freeliuzc · 2026-05-12T10:29:13Z

+
+    _cuda_ver = getattr(_cuda_pkg, "__version__", None)
+    if _cuda_ver is None:
+        # cuda-python >= 13.x 无顶层 __version__，通过 cuda-bindings 子包判断


PaddlePaddle-bot · 2026-05-12T09:51:08Z

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-12 17:49:38

CI报告基于以下代码生成（30分钟更新一次）:

PR commit: 460dacc
Merge base: 66dea60 (branch: release/2.6)
查看完整 Diff
CI 详情

1 任务总览

CI 存在 1 个 required 失败任务，另有 4 个 required 任务运行中，4 个 required 任务等待中。在 required 失败任务处理完成前，建议暂缓合并。

总执行（rerun次数）	总任务	✅ 通过	❌ 失败	⏳ 运行中	⏸️ 等待中	跳过
39(0)	39	26	3	5	5	0

2 任务状态汇总

2.1 Required任务 : 1/10 通过

必选任务阻塞合并，失败需优先处理。

状态	任务	耗时	根因	修复建议	日志	重跑
❌	`Approval`	10s	PR问题：修改spec_decode目录缺少RD Approval	请 @Deleter-D 对本PR进行Approve Review	Job	-
⏳	`Extracted partial CE model tasks / run_ce_cases`	-	运行中	-	Job	-
⏳	`Run Base Tests / base_tests`	-	运行中	-	Job	-
⏳	`xpu_4cards_case_test / run_xpu_4cards_cases`	-	运行中	-	Job	-
⏳	`xpu_8cards_case_test / run_xpu_8cards_cases`	-	运行中	-	Job	-
⏸️	`Run FastDeploy LogProb Tests / run_tests_logprob`	-	等待中	-	-	-
⏸️	`Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage`	-	等待中	-	-	-
⏸️	`Run Four Cards Tests / run_4_cards_tests`	-	等待中	-	-	-
⏸️	`Run Stable Tests / stable_tests`	-	等待中	-	-	-
✅	其余 1 个必选任务通过	-	-	-	-	-

2.2 可选任务 — 25/29 通过

可选任务不阻塞合并，失败仅供参考。

状态	任务	耗时	日志	重跑
❌	`Run iluvatar Tests / run_iluvatar_cases`	9m23s	Job	-
❌	`Check PR Template`	13s	Job	-
⏳	`Trigger Jenkins for PR`	-	Job	-
⏸️	`CI_HPU`	-	-	-
✅	其余 25 个可选任务通过	-	-	-

3 失败详情（仅 required）

Approval — PR审批缺失（置信度: 高）

Approval

状态: ❌ 失败
错误类型: PR审批
置信度: 高
根因摘要: PR问题：修改spec_decode目录缺少RD Approval
分析器: 通用分析(fallback)

根因详情:
PR 修改了 fastdeploy/spec_decode 和/或 custom_ops/gpu_ops/speculate_decoding 目录下的文件。根据仓库规则，此类修改需要至少一位指定 FastDeploy RD（freeliuzc(liuzichang01) 或 Deleter-D(wangyanpeng04)）进行 Approve Review。由于本 PR 作者即为 freeliuzc，需要 Deleter-D(wangyanpeng04) 完成审批。

关键日志:

==> PR title: [Cherry-Pick][Speculative Decoding] Support mtp super ultra overlap...
0. You must have one FastDeploy RD (freeliuzc(liuzichang01), Deleter-D(wangyanpeng04)) approval
   for modifing [fastdeploy/spec_decode,custom_ops/gpu_ops/speculate_decoding].
There are 1 approved errors.
##[error]Process completed with exit code 6.

修复建议:

请 @Deleter-D (wangyanpeng04) 对本 PR 进行 Approve Review

修复建议摘要: 请 @Deleter-D 对本PR进行Approve Review

链接: 查看日志

PaddlePaddle-bot

🤖 Paddle-CI-Agent | pr_review | 2026-05-12 18:34:17

📋 Review 摘要

PR 概述：在 PD 分离模式下为 MTP 投机解码支持 insert_tasks_v1 overlap，将所有同步 tensor 赋值替换为 async_set_value 异步写入；同时修复 cuda-python ≥13.x 兼容性并为 XPU 补充 async_set_value 实现。

变更范围：spec_decode/mtp.py、worker/gpu_model_runner.py、model_executor/pre_and_post_process.py、model_executor/xpu_pre_and_post_process.py、eplb/async_expert_loader.py

影响面 Tag：[Speculative Decoding] [PD Disaggregation] [XPU]

📝 PR 规范检查

PR 标题格式符合 Cherry-Pick 规范，但所有正文 section（Motivation / Modifications / Usage or Command / Accuracy Tests）均为空占位符，Checklist 全部未勾选。

标题建议（已合规，无需修改）

PR 描述建议（可直接复制）：

## Motivation
在 PD 分离（splitwise）模式下，MTP 投机解码的 `insert_tasks_v1` 中大量同步 tensor 赋值阻塞了 overlap 并行执行。本 PR 将这些赋值全部替换为 `async_set_value` 异步写入，支持 insert_task 阶段的 overlap，提升 PD-split 模式推理吞吐。同时修复 cuda-python ≥13.x（cuda-bindings 包）的导入兼容性，并为 XPU 补充对应的 `async_set_value` 实现。

## Modifications
- `fastdeploy/eplb/async_expert_loader.py`：新增 cuda-python ≥13.x 版本检测，使用 `cuda.bindings.runtime`；`except ImportError` 改为 `except Exception` 并输出警告
- `fastdeploy/model_executor/pre_and_post_process.py`：`async_set_value` 从仅 CUDA 可用改为通用实现（非 CUDA 路径用 `paddle.to_tensor`）；修复 `isinstance(src, np.array)` → `np.ndarray` 的 bug
- `fastdeploy/model_executor/xpu_pre_and_post_process.py`：新增 XPU 平台 `async_set_value` 实现（async numpy to tensor 待后续支持）
- `fastdeploy/spec_decode/mtp.py`：`insert_tasks_v1` 所有直接赋值改为 `async_set_value`；新增 XPU `async_set_value` 导入；`input_ids_cpu` 更新增加 `hybrid_mode` 保护
- `fastdeploy/worker/gpu_model_runner.py`：`insert_tasks_v1` 所有直接赋值改为 `async_set_value`；PD-split 投机解码路径改为仅写入 2 个 draft token；`stop_seqs` 三维写入改为整块填充；新增 `enable_mm` 保护
- `tests/worker/test_gpu_model_runner.py`：新增 `TestInsertTasksV1SplitwiseSuffix` 测试类，覆盖 5 个 PD-split suffix 场景

## Usage or Command
N/A

## Accuracy Tests
N/A

## Checklist

- [x] Add at least a tag in the PR title.
  - Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
  - You can add new tags based on the PR content, but the semantics must be clear.
- [x] Format your code, run `pre-commit` before commit.
- [x] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [x] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.

问题

级别	文件	概述
🟡 建议	`fastdeploy/eplb/async_expert_loader.py:38`	`except Exception` 过于宽泛，可能静默吞掉非导入类异常
❓ 疑问	`fastdeploy/worker/gpu_model_runner.py:911`	`is_chunk_step` 仍直接赋值，同块中 `is_block_step` 已改为 `async_set_value`，是否遗漏？
🟡 建议	`fastdeploy/spec_decode/mtp.py:53`	`async_set_value` 标注 "temporary solution"，同函数在两个模块重复定义，建议提 issue 跟踪统一
📝 PR 规范	—	PR 描述各 section 均为空，Checklist 全部未勾选

总体评价

整体思路清晰，async_set_value 批量替换覆盖全面且附有单测。async_set_value 跨模块重复为已知技术债（有 TODO 标注），不阻塞合入；is_chunk_step 直接赋值和 except Exception 过宽请作者确认是否为遗漏。

PaddlePaddle-bot · 2026-05-12T10:39:01Z

+        _cuda_ver = _meta.version("cuda-bindings")
+    _cuda_major = int(_cuda_ver.split(".")[0])
+    if _cuda_major >= 13:
+        from cuda.bindings import runtime as cudart


🟡 建议 except Exception 过于宽泛

当前将 except ImportError 改为 except Exception，会捕获所有异常（包括 AttributeError、NameError、TypeError 等非导入相关错误），可能让调试困难。例如 _cuda_ver.split(".")[0] 若返回非预期对象，int(...) 会抛 ValueError，被静默吞掉后 cudart = None，只留一行 warning，难以排查根因。

建议拆分为更精确的异常类型：

except (ImportError, AttributeError, ValueError) as _e: import warnings warnings.warn(f"cuda-python import failed, async_expert_loader will be unavailable: {_e}") cudart = None

PaddlePaddle-bot · 2026-05-12T10:39:01Z

-                self.share_inputs["prompt_lens"][idx : idx + 1] = len(input_ids)
-                self.share_inputs["is_block_step"][idx : idx + 1] = False
+                async_set_value(self.share_inputs["step_seq_lens_decoder"][idx : idx + 1], 0)
+                async_set_value(self.share_inputs["prompt_lens"][idx : idx + 1], len(input_ids))


❓ 疑问 is_chunk_step 是否遗漏 async_set_value？

同一代码块中，is_block_step 已改为 async_set_value 异步写入：

async_set_value(self.share_inputs["is_block_step"][idx : idx + 1], False)

但 is_chunk_step 仍使用同步直接赋值：

self.share_inputs["is_chunk_step"][idx : idx + 1] = prefill_end_index < len(input_ids)

如果 is_chunk_step 也被 overlap 读取，则可能存在竞态。请确认：是否 is_chunk_step 是 CPU tensor（类似 preempted_idx 的处理），如是建议补充注释说明。

PaddlePaddle-bot · 2026-05-12T10:39:01Z

        update_attn_mask_offsets,
    )
+
+    # temporary solution


🟡 建议 async_set_value 跨模块重复定义

当前 async_set_value 在 pre_and_post_process.py（CUDA 路径）和 xpu_pre_and_post_process.py（XPU 路径）分别实现，注释已标注 # temporary solution。两个实现逻辑高度相似，仅 CUDA 分支有 custom_numpy_to_tensor 优化。

建议后续提独立 issue 跟踪：将两个实现统一到 fastdeploy/model_executor/pre_and_post_process.py 或新建 common_ops.py，通过 current_platform.is_cuda() 分支区分，避免后续维护两份代码。

PaddlePaddle-bot · 2026-05-12T11:09:43Z

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-12 19:07:42

CI报告基于以下代码生成（30分钟更新一次）:

PR commit: 0f41924
Merge base: 66dea60 (branch: release/2.6)
查看完整 Diff
CI 详情

1 任务总览

CI 尚未完成：有 1 个 required 失败（Approval），3 个 required 运行中，需处理审批问题。

总执行（rerun次数）	总任务	✅ 通过	❌ 失败	⏳ 运行中	⏸️ 等待中	跳过
36(0)	36	28	2	4	2	0

2 任务状态汇总

2.1 Required任务 : 6/10 通过

必选任务阻塞合并，失败需优先处理。

状态	任务	耗时	根因	修复建议	日志	重跑
❌	`Approval`	11s	PR问题：修改spec_decode目录未获指定RD审批	请 Deleter-D(wangyanpeng04) 进行 Review Approve	Job	-
⏳	`Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage`	-	运行中	-	Job	-
⏳	`xpu_4cards_case_test / run_xpu_4cards_cases`	-	运行中	-	Job	-
⏳	`Run Four Cards Tests / run_4_cards_tests`	-	运行中	-	Job	-
✅	其余 6 个必选任务通过	-	-	-	-	-

2.2 可选任务 — 22/26 通过

可选任务不阻塞合并，失败仅供参考。

状态	任务	耗时	日志	重跑
❌	`Check PR Template`	15s	Job	-
⏳	`Trigger Jenkins for PR`	-	-	-
⏸️	`CI_HPU`	-	-	-
⏸️	`Run iluvatar Tests / run_iluvatar_cases`	-	-	-
✅	其余 22 个可选任务通过	-	-	-

3 失败详情（仅 required）

Approval — PR审批检查（置信度: 高）

Approval

状态: ❌ 失败
错误类型: 代码规范（PR审批）
置信度: 高
根因摘要: PR修改spec_decode/speculate_decoding目录，需指定RD审批但未获批准
分析器: 通用分析(fallback)

根因详情:
check_approval.sh 检查到1个审批错误。PR 修改了 fastdeploy/spec_decode 和 custom_ops/gpu_ops/speculate_decoding 目录，按照仓库规则，这些目录需要 FastDeploy RD（freeliuzc 或 Deleter-D/wangyanpeng04）的 Review Approval。由于 PR 作者为 freeliuzc，需要 Deleter-D(wangyanpeng04) 进行审批。

关键日志:

==> PR title: [Cherry-Pick][Speculative Decoding] Support mtp super ultra overlap in pd-split mode with insert_task overlap(#7323)
0. You must have one FastDeploy RD (freeliuzc(liuzichang01), Deleter-D(wangyanpeng04)) approval for modifing [fastdeploy/spec_decode,custom_ops/gpu_ops/speculate_decoding].
There are 1 approved errors.
##[error]Process completed with exit code 6.

修复建议:

请联系 Deleter-D(wangyanpeng04) 对此 PR 进行 Review 并点击 "Approve"

修复建议摘要: 请 Deleter-D(wangyanpeng04) Review Approve 此 PR

关联变更: PR 修改了 fastdeploy/spec_decode 和 custom_ops/gpu_ops/speculate_decoding 相关文件
链接: 查看日志

codecov-commenter · 2026-05-12T11:52:31Z

Codecov Report

❌ Patch coverage is 65.78947% with 39 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (release/2.6@66dea60). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
fastdeploy/worker/gpu_model_runner.py	76.27%	14 Missing ⚠️
fastdeploy/model_executor/pre_and_post_process.py	31.57%	7 Missing and 6 partials ⚠️
fastdeploy/eplb/async_expert_loader.py	50.00%	4 Missing and 2 partials ⚠️
fastdeploy/spec_decode/mtp.py	75.00%	5 Missing and 1 partial ⚠️

Additional details and impacted files

@@              Coverage Diff               @@
##             release/2.6    #7794   +/-   ##
==============================================
  Coverage               ?   71.90%           
==============================================
  Files                  ?      378           
  Lines                  ?    53954           
  Branches               ?     8440           
==============================================
  Hits                   ?    38793           
  Misses                 ?    12391           
  Partials               ?     2770

Flag	Coverage Δ
GPU	`71.90% <65.78%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

[Speculative Decoding] Support mtp super ultra overlap in pd-split mo…

460dacc

…de with insert_task overlap (PaddlePaddle#7323) * support mtp overlap in pd-split mode with insert_task overlap

Copilot AI review requested due to automatic review settings May 12, 2026 09:26

freeliuzc had a problem deploying to Metax_ci May 12, 2026 09:26 — with GitHub Actions Failure

Copilot started reviewing on behalf of freeliuzc May 12, 2026 09:26 View session

Copilot AI reviewed May 12, 2026

View reviewed changes

This comment was marked as outdated.

Sign in to view

fix note

0f41924

freeliuzc had a problem deploying to Metax_ci May 12, 2026 10:28 — with GitHub Actions Failure

PaddlePaddle-bot reviewed May 12, 2026

View reviewed changes

		# 每条 stop sequence pad 到 stop_seqs_max_len，凑齐空行后整块写入
		# 避免对第 3 维做部分切片（非连续内存）导致 async_set_value stride 错位

Conversation

freeliuzc commented May 12, 2026

Motivation

Modifications

Usage or Command

Accuracy Tests

Checklist

Uh oh!

paddle-bot Bot commented May 12, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

freeliuzc May 12, 2026

Choose a reason for hiding this comment

Uh oh!

freeliuzc May 12, 2026

Choose a reason for hiding this comment

Uh oh!

PaddlePaddle-bot commented May 12, 2026

1 任务总览

2 任务状态汇总

2.1 Required任务 : 1/10 通过

2.2 可选任务 — 25/29 通过

3 失败详情（仅 required）

Approval

Uh oh!

This comment was marked as outdated.

Uh oh!

PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

📋 Review 摘要

📝 PR 规范检查

问题

总体评价

Uh oh!

PaddlePaddle-bot May 12, 2026

Choose a reason for hiding this comment

Uh oh!

PaddlePaddle-bot May 12, 2026

Choose a reason for hiding this comment

Uh oh!

PaddlePaddle-bot May 12, 2026

Choose a reason for hiding this comment

Uh oh!

PaddlePaddle-bot commented May 12, 2026

1 任务总览

2 任务状态汇总

2.1 Required任务 : 6/10 通过

2.2 可选任务 — 22/26 通过

3 失败详情（仅 required）

Approval

Uh oh!

codecov-commenter commented May 12, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants