Skip to content

[Cherry-Pick][Speculative Decoding] Support mtp super ultra overlap in pd-split mode with insert_task overlap(#7323)#7794

Open
freeliuzc wants to merge 2 commits into
PaddlePaddle:release/2.6from
freeliuzc:cp_mtp_pd_overlap
Open

[Cherry-Pick][Speculative Decoding] Support mtp super ultra overlap in pd-split mode with insert_task overlap(#7323)#7794
freeliuzc wants to merge 2 commits into
PaddlePaddle:release/2.6from
freeliuzc:cp_mtp_pd_overlap

Conversation

@freeliuzc
Copy link
Copy Markdown
Collaborator

…de with insert_task overlap (#7323)

  • support mtp overlap in pd-split mode with insert_task overlap

Motivation

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick,PR标题需遵循格式,在最开始加上[Cherry-Pick]标签,以及最后面加上原PR ID,例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

Modifications

Usage or Command

Accuracy Tests

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

…de with insert_task overlap (PaddlePaddle#7323)

* support mtp overlap in pd-split mode with insert_task overlap
Copilot AI review requested due to automatic review settings May 12, 2026 09:26
@paddle-bot
Copy link
Copy Markdown

paddle-bot Bot commented May 12, 2026

Thanks for your contribution!

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

该 PR(Cherry-Pick #7323)主要围绕 Speculative Decoding / MTPPD 分离(splitwise)模式下的插入任务(insert_task)重叠能力做适配:通过更多使用 async_set_value 进行(尽量)异步写入 share buffer,减少同步开销,并补充了对应的单测覆盖关键分支。

Changes:

  • GPUModelRunner.insert_tasks_v1 中,将多处 share_inputs 写入改为 async_set_value,并在 splitwise_role=decode + SUFFIX 路径下调整 draft token 写入与缓存计数更新逻辑。
  • MTPProposer.insert_tasks_v1 中同样引入 async_set_value,并对 input_ids_cpu 写入增加 hybrid_mode 条件保护。
  • 补充 splitwise SUFFIX 分支的单元测试;同时增强 async_expert_loader 对 cuda-python 13.x import 结构变化的兼容性。

PR 标题/描述检查(按仓库模板要求):

  • 标题格式符合 Cherry-Pick 约定(以 [Cherry-Pick] 开头且结尾包含原 PR 号 (#7323))。
  • 当前 PR 描述中模板项(Motivation/Modifications/Usage/Accuracy Tests)未完整补充;建议在合入前补齐,尤其是性能收益与(如涉及输出一致性)accuracy 结果。

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
tests/worker/test_gpu_model_runner.py 新增 splitwise_role=decode + SUFFIX 分支的 insert_tasks_v1 单测,并用同步 stub 替代 async_set_value 以便无 CUDA 环境运行。
fastdeploy/worker/gpu_model_runner.py insert_tasks_v1 多处写入改为 async_set_value;splitwise decode + SUFFIX 分支写入 draft_tokens/seq_lens 并更新 cached token 计数;stop_seqs 写入方式改为 pad 后整块写入。
fastdeploy/spec_decode/mtp.py MTP proposer 的 insert_tasks_v1 多处写入改为 async_set_value,并对 hybrid_mode 下的 input_ids_cpu 写入增加保护。
fastdeploy/model_executor/xpu_pre_and_post_process.py 新增 XPU 路径的 async_set_value 实现,供 XPU MTP proposer 等路径使用。
fastdeploy/model_executor/pre_and_post_process.py 将 async_set_value 改为全平台可用(不再仅 CUDA 可用),并在 CUDA 下继续走 custom_numpy_to_tensor 优化路径。
fastdeploy/eplb/async_expert_loader.py 兼容 cuda-python 13.x(cuda.bindings.runtime)与旧版本(cuda.cudart)的导入差异,并改进缺失依赖时的提示信息。

Comment thread fastdeploy/worker/gpu_model_runner.py Outdated
Comment on lines +1041 to +1042
# 每条 stop sequence pad 到 stop_seqs_max_len,凑齐空行后整块写入
# 避免对第 3 维做部分切片(非连续内存)导致 async_set_value stride 错位
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment thread fastdeploy/eplb/async_expert_loader.py Outdated

_cuda_ver = getattr(_cuda_pkg, "__version__", None)
if _cuda_ver is None:
# cuda-python >= 13.x 无顶层 __version__,通过 cuda-bindings 子包判断
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@PaddlePaddle-bot
Copy link
Copy Markdown

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-12 17:49:38

CI报告基于以下代码生成(30分钟更新一次):


1 任务总览

CI 存在 1 个 required 失败任务,另有 4 个 required 任务运行中4 个 required 任务等待中。在 required 失败任务处理完成前,建议暂缓合并。

总执行(rerun次数) 总任务 ✅ 通过 ❌ 失败 ⏳ 运行中 ⏸️ 等待中 跳过
39(0) 39 26 3 5 5 0

2 任务状态汇总

2.1 Required任务 : 1/10 通过

必选任务阻塞合并,失败需优先处理。

状态 任务 耗时 根因 修复建议 日志 重跑
Approval 10s PR问题:修改spec_decode目录缺少RD Approval @Deleter-D 对本PR进行Approve Review Job -
Extracted partial CE model tasks / run_ce_cases - 运行中 - Job -
Run Base Tests / base_tests - 运行中 - Job -
xpu_4cards_case_test / run_xpu_4cards_cases - 运行中 - Job -
xpu_8cards_case_test / run_xpu_8cards_cases - 运行中 - Job -
⏸️ Run FastDeploy LogProb Tests / run_tests_logprob - 等待中 - - -
⏸️ Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage - 等待中 - - -
⏸️ Run Four Cards Tests / run_4_cards_tests - 等待中 - - -
⏸️ Run Stable Tests / stable_tests - 等待中 - - -
其余 1 个必选任务通过 - - - - -

2.2 可选任务 — 25/29 通过

可选任务不阻塞合并,失败仅供参考。

状态 任务 耗时 日志 重跑
Run iluvatar Tests / run_iluvatar_cases 9m23s Job -
Check PR Template 13s Job -
Trigger Jenkins for PR - Job -
⏸️ CI_HPU - - -
其余 25 个可选任务通过 - - -

3 失败详情(仅 required)

Approval — PR审批缺失(置信度: 高)

Approval

  • 状态: ❌ 失败
  • 错误类型: PR审批
  • 置信度: 高
  • 根因摘要: PR问题:修改spec_decode目录缺少RD Approval
  • 分析器: 通用分析(fallback)

根因详情:
PR 修改了 fastdeploy/spec_decode 和/或 custom_ops/gpu_ops/speculate_decoding 目录下的文件。根据仓库规则,此类修改需要至少一位指定 FastDeploy RD(freeliuzc(liuzichang01)Deleter-D(wangyanpeng04))进行 Approve Review。由于本 PR 作者即为 freeliuzc,需要 Deleter-D(wangyanpeng04) 完成审批。

关键日志:

==> PR title: [Cherry-Pick][Speculative Decoding] Support mtp super ultra overlap...
0. You must have one FastDeploy RD (freeliuzc(liuzichang01), Deleter-D(wangyanpeng04)) approval
   for modifing [fastdeploy/spec_decode,custom_ops/gpu_ops/speculate_decoding].
There are 1 approved errors.
##[error]Process completed with exit code 6.

修复建议:

  1. @Deleter-D (wangyanpeng04) 对本 PR 进行 Approve Review

修复建议摘要: 请 @Deleter-D 对本PR进行Approve Review

链接: 查看日志

PaddlePaddle-bot

This comment was marked as outdated.

Copy link
Copy Markdown

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Paddle-CI-Agent | pr_review | 2026-05-12 18:34:17

📋 Review 摘要

PR 概述:在 PD 分离模式下为 MTP 投机解码支持 insert_tasks_v1 overlap,将所有同步 tensor 赋值替换为 async_set_value 异步写入;同时修复 cuda-python ≥13.x 兼容性并为 XPU 补充 async_set_value 实现。

变更范围spec_decode/mtp.pyworker/gpu_model_runner.pymodel_executor/pre_and_post_process.pymodel_executor/xpu_pre_and_post_process.pyeplb/async_expert_loader.py

影响面 Tag[Speculative Decoding] [PD Disaggregation] [XPU]

📝 PR 规范检查

PR 标题格式符合 Cherry-Pick 规范,但所有正文 section(Motivation / Modifications / Usage or Command / Accuracy Tests)均为空占位符,Checklist 全部未勾选。

标题建议(已合规,无需修改)

PR 描述建议(可直接复制):

## Motivation
在 PD 分离(splitwise)模式下,MTP 投机解码的 `insert_tasks_v1` 中大量同步 tensor 赋值阻塞了 overlap 并行执行。本 PR 将这些赋值全部替换为 `async_set_value` 异步写入,支持 insert_task 阶段的 overlap,提升 PD-split 模式推理吞吐。同时修复 cuda-python ≥13.x(cuda-bindings 包)的导入兼容性,并为 XPU 补充对应的 `async_set_value` 实现。

## Modifications
- `fastdeploy/eplb/async_expert_loader.py`:新增 cuda-python ≥13.x 版本检测,使用 `cuda.bindings.runtime``except ImportError` 改为 `except Exception` 并输出警告
- `fastdeploy/model_executor/pre_and_post_process.py``async_set_value` 从仅 CUDA 可用改为通用实现(非 CUDA 路径用 `paddle.to_tensor`);修复 `isinstance(src, np.array)``np.ndarray` 的 bug
- `fastdeploy/model_executor/xpu_pre_and_post_process.py`:新增 XPU 平台 `async_set_value` 实现(async numpy to tensor 待后续支持)
- `fastdeploy/spec_decode/mtp.py``insert_tasks_v1` 所有直接赋值改为 `async_set_value`;新增 XPU `async_set_value` 导入;`input_ids_cpu` 更新增加 `hybrid_mode` 保护
- `fastdeploy/worker/gpu_model_runner.py``insert_tasks_v1` 所有直接赋值改为 `async_set_value`;PD-split 投机解码路径改为仅写入 2 个 draft token;`stop_seqs` 三维写入改为整块填充;新增 `enable_mm` 保护
- `tests/worker/test_gpu_model_runner.py`:新增 `TestInsertTasksV1SplitwiseSuffix` 测试类,覆盖 5 个 PD-split suffix 场景

## Usage or Command
N/A

## Accuracy Tests
N/A

## Checklist

- [x] Add at least a tag in the PR title.
  - Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
  - You can add new tags based on the PR content, but the semantics must be clear.
- [x] Format your code, run `pre-commit` before commit.
- [x] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [x] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.

问题

级别 文件 概述
🟡 建议 fastdeploy/eplb/async_expert_loader.py:38 except Exception 过于宽泛,可能静默吞掉非导入类异常
❓ 疑问 fastdeploy/worker/gpu_model_runner.py:911 is_chunk_step 仍直接赋值,同块中 is_block_step 已改为 async_set_value,是否遗漏?
🟡 建议 fastdeploy/spec_decode/mtp.py:53 async_set_value 标注 "temporary solution",同函数在两个模块重复定义,建议提 issue 跟踪统一
📝 PR 规范 PR 描述各 section 均为空,Checklist 全部未勾选

总体评价

整体思路清晰,async_set_value 批量替换覆盖全面且附有单测。async_set_value 跨模块重复为已知技术债(有 TODO 标注),不阻塞合入;is_chunk_step 直接赋值和 except Exception 过宽请作者确认是否为遗漏。

_cuda_ver = _meta.version("cuda-bindings")
_cuda_major = int(_cuda_ver.split(".")[0])
if _cuda_major >= 13:
from cuda.bindings import runtime as cudart
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 建议 except Exception 过于宽泛

当前将 except ImportError 改为 except Exception,会捕获所有异常(包括 AttributeErrorNameErrorTypeError 等非导入相关错误),可能让调试困难。例如 _cuda_ver.split(".")[0] 若返回非预期对象,int(...) 会抛 ValueError,被静默吞掉后 cudart = None,只留一行 warning,难以排查根因。

建议拆分为更精确的异常类型:

except (ImportError, AttributeError, ValueError) as _e:
    import warnings
    warnings.warn(f"cuda-python import failed, async_expert_loader will be unavailable: {_e}")
    cudart = None

self.share_inputs["prompt_lens"][idx : idx + 1] = len(input_ids)
self.share_inputs["is_block_step"][idx : idx + 1] = False
async_set_value(self.share_inputs["step_seq_lens_decoder"][idx : idx + 1], 0)
async_set_value(self.share_inputs["prompt_lens"][idx : idx + 1], len(input_ids))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❓ 疑问 is_chunk_step 是否遗漏 async_set_value

同一代码块中,is_block_step 已改为 async_set_value 异步写入:

async_set_value(self.share_inputs["is_block_step"][idx : idx + 1], False)

is_chunk_step 仍使用同步直接赋值:

self.share_inputs["is_chunk_step"][idx : idx + 1] = prefill_end_index < len(input_ids)

如果 is_chunk_step 也被 overlap 读取,则可能存在竞态。请确认:是否 is_chunk_step 是 CPU tensor(类似 preempted_idx 的处理),如是建议补充注释说明。

update_attn_mask_offsets,
)

# temporary solution
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 建议 async_set_value 跨模块重复定义

当前 async_set_valuepre_and_post_process.py(CUDA 路径)和 xpu_pre_and_post_process.py(XPU 路径)分别实现,注释已标注 # temporary solution。两个实现逻辑高度相似,仅 CUDA 分支有 custom_numpy_to_tensor 优化。

建议后续提独立 issue 跟踪:将两个实现统一到 fastdeploy/model_executor/pre_and_post_process.py 或新建 common_ops.py,通过 current_platform.is_cuda() 分支区分,避免后续维护两份代码。

@PaddlePaddle-bot
Copy link
Copy Markdown

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-12 19:07:42

CI报告基于以下代码生成(30分钟更新一次):


1 任务总览

CI 尚未完成:有 1 个 required 失败Approval),3 个 required 运行中,需处理审批问题。

总执行(rerun次数) 总任务 ✅ 通过 ❌ 失败 ⏳ 运行中 ⏸️ 等待中 跳过
36(0) 36 28 2 4 2 0

2 任务状态汇总

2.1 Required任务 : 6/10 通过

必选任务阻塞合并,失败需优先处理。

状态 任务 耗时 根因 修复建议 日志 重跑
Approval 11s PR问题:修改spec_decode目录未获指定RD审批 请 Deleter-D(wangyanpeng04) 进行 Review Approve Job -
Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage - 运行中 - Job -
xpu_4cards_case_test / run_xpu_4cards_cases - 运行中 - Job -
Run Four Cards Tests / run_4_cards_tests - 运行中 - Job -
其余 6 个必选任务通过 - - - - -

2.2 可选任务 — 22/26 通过

可选任务不阻塞合并,失败仅供参考。

状态 任务 耗时 日志 重跑
Check PR Template 15s Job -
Trigger Jenkins for PR - - -
⏸️ CI_HPU - - -
⏸️ Run iluvatar Tests / run_iluvatar_cases - - -
其余 22 个可选任务通过 - - -

3 失败详情(仅 required)

Approval — PR审批检查(置信度: 高)

Approval

  • 状态: ❌ 失败
  • 错误类型: 代码规范(PR审批)
  • 置信度: 高
  • 根因摘要: PR修改spec_decode/speculate_decoding目录,需指定RD审批但未获批准
  • 分析器: 通用分析(fallback)

根因详情:
check_approval.sh 检查到1个审批错误。PR 修改了 fastdeploy/spec_decodecustom_ops/gpu_ops/speculate_decoding 目录,按照仓库规则,这些目录需要 FastDeploy RD(freeliuzc 或 Deleter-D/wangyanpeng04)的 Review Approval。由于 PR 作者为 freeliuzc,需要 Deleter-D(wangyanpeng04) 进行审批。

关键日志:

==> PR title: [Cherry-Pick][Speculative Decoding] Support mtp super ultra overlap in pd-split mode with insert_task overlap(#7323)
0. You must have one FastDeploy RD (freeliuzc(liuzichang01), Deleter-D(wangyanpeng04)) approval for modifing [fastdeploy/spec_decode,custom_ops/gpu_ops/speculate_decoding].
There are 1 approved errors.
##[error]Process completed with exit code 6.

修复建议:

  1. 请联系 Deleter-D(wangyanpeng04) 对此 PR 进行 Review 并点击 "Approve"

修复建议摘要: 请 Deleter-D(wangyanpeng04) Review Approve 此 PR

关联变更: PR 修改了 fastdeploy/spec_decodecustom_ops/gpu_ops/speculate_decoding 相关文件
链接: 查看日志

@codecov-commenter
Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 65.78947% with 39 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (release/2.6@66dea60). Learn more about missing BASE report.

Files with missing lines Patch % Lines
fastdeploy/worker/gpu_model_runner.py 76.27% 14 Missing ⚠️
fastdeploy/model_executor/pre_and_post_process.py 31.57% 7 Missing and 6 partials ⚠️
fastdeploy/eplb/async_expert_loader.py 50.00% 4 Missing and 2 partials ⚠️
fastdeploy/spec_decode/mtp.py 75.00% 5 Missing and 1 partial ⚠️
Additional details and impacted files
@@              Coverage Diff               @@
##             release/2.6    #7794   +/-   ##
==============================================
  Coverage               ?   71.90%           
==============================================
  Files                  ?      378           
  Lines                  ?    53954           
  Branches               ?     8440           
==============================================
  Hits                   ?    38793           
  Misses                 ?    12391           
  Partials               ?     2770           
Flag Coverage Δ
GPU 71.90% <65.78%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants