Skip to content

[Speculative Decoding] fix pd-split metrics and support other model runner#6995

Open
freeliuzc wants to merge 2 commits intoPaddlePaddle:developfrom
freeliuzc:merge_mtp_support
Open

[Speculative Decoding] fix pd-split metrics and support other model runner#6995
freeliuzc wants to merge 2 commits intoPaddlePaddle:developfrom
freeliuzc:merge_mtp_support

Conversation

@freeliuzc
Copy link
Collaborator

Motivation

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick,PR标题需遵循格式,在最开始加上[Cherry-Pick]标签,以及最后面加上原PR ID,例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

Modifications

Usage or Command

Accuracy Tests

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

Copilot AI review requested due to automatic review settings March 24, 2026 13:03
@paddle-bot
Copy link

paddle-bot bot commented Mar 24, 2026

Thanks for your contribution!

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

该 PR 旨在改进 Speculative Decoding(尤其是 MTP)在 PD 拆分场景下的 metrics 记录,并尝试兼容特定 runner(通过环境变量开关)下的执行路径,以避免 EP/空输入等场景的异常行为。

Changes:

  • 在 GPUModelRunner 的特定分支中新增日志与 MTP 空输入 forward 兜底执行逻辑。
  • 在 MTP proposer 中引入环境变量开关,控制多模态相关 attn_mask_offsets 的处理路径。
  • 在 PD 拆分的资源管理流程中修正/隔离 metrics 对象并补充 decoder 侧起始时间戳字段与更新方法。

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File Description
fastdeploy/worker/gpu_model_runner.py 增加 speculative decoding 场景下的日志;在无有效输入/输出时为 MTP+EP 增加空输入 forward 的兜底执行
fastdeploy/spec_decode/mtp.py 增加 EB5_ENABLE_FD_RUNNER 环境变量开关,并在 CUDA 路径/插入任务时条件性跳过多模态 mask offsets 的构造与更新
fastdeploy/engine/sched/resource_manager_v1.py PD decode 侧接收 prefilled request 时深拷贝 metrics 并补充 decoder 推理起始时间设置
fastdeploy/engine/request.py RequestMetrics 增加 decoder engine 发送时间戳字段,并新增更新方法用于设置该字段
Comments suppressed due to low confidence (1)

fastdeploy/spec_decode/mtp.py:904

  • 同上:_propose_cuda() 在 eb5_runner=true 时跳过 update_attn_mask_offsets,但 forward 仍会基于 ForwardMeta.attn_mask_offsets 执行注意力计算(只要 enable_mm=True)。这会让 attn_mask_offsets 可能保持上一次 batch 的值或默认未初始化值,从而污染本轮推理。建议在 eb5_runner 场景显式把 ForwardMeta.attn_mask_offsets 置空/禁用多模态 mask offsets,或保证每步都对 offsets 做正确刷新。
                if self.enable_mm and not self.eb5_runner:
                    attn_mask_offsets = update_attn_mask_offsets(
                        ids_remove_padding,
                        getattr(
                            self.model_inputs, "seq_lens_this_time", self.model_inputs["seq_lens_this_time_buffer"]
                        ),
                        self.model_inputs["seq_lens_encoder"],
                        self.model_inputs["seq_lens_decoder"],
                        cu_seqlens_q,
                        self.model_inputs["attn_mask_offsets_full"],
                        self.model_inputs["attn_mask_offsets_decoder"],
                        self.model_inputs["is_block_step"],
                        self.model_inputs["decode_states"],
                        self.model_inputs["mask_rollback"],
                    )
                    self.model_inputs["attn_mask_offsets"].copy_(attn_mask_offsets, False)

Comment on lines +873 to +874
logger.info(f"seq_lens_this_time: {length + 1}")
logger.info(f"draft_tokens: {request.draft_token_ids}")
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里新增的 info 级别日志会在解码热路径打印每个请求的 draft_tokens(包含用户生成内容的 token id 列表),有明显的隐私/数据泄露风险,并且在高并发下会导致日志量爆炸、影响吞吐。建议删除该日志,或至少降为 debug 并只记录长度/摘要(例如 token 数、hash),同时用开关(如 deterministic logger / env flag)严格控制开启场景。

Copilot uses AI. Check for mistakes.
len(request.output_token_ids) if prefill_end_index >= len(input_ids) else 0
)
if self.enable_mm:
if self.enable_mm and not self.eb5_runner:
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里在 eb5_runner=true 时跳过了多模态 attn_mask_offsets_full/decoder 的填充,但本类后续在 _initialize_forward_meta() 里仍然会在 enable_mm=True 时把 model_inputs["attn_mask_offsets"] 传入 ForwardMeta(即 attention backend 会使用该 tensor)。如果 offsets 没有被正确初始化/更新,很容易导致注意力 mask 错误、输出不正确。建议在 eb5_runner 场景下同时确保 ForwardMeta.attn_mask_offsets 为 None(或将 enable_mm 置为 False),或者提供等价的 offsets 计算/初始化逻辑,避免使用到陈旧/未定义的数据。

Copilot uses AI. Check for mistakes.
Comment on lines +956 to +958
def update_decoder_start_time(self):
self.llm_engine_send_req_to_decoder_engine_timestamp = self.decode_inference_start_time

Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR 描述目前基本保留了模板内容,未补充本次修改的动机、具体改动点、使用方式/回归命令、以及(若影响输出)精度验证结果。为了便于评审与后续维护,建议按模板补全至少 Motivation/Modifications/Usage(or Command)/Accuracy Tests(如无测试也请说明原因)。

Copilot uses AI. Check for mistakes.
@codecov-commenter
Copy link

codecov-commenter commented Mar 24, 2026

Codecov Report

❌ Patch coverage is 64.28571% with 5 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@522d12c). Learn more about missing BASE report.

Files with missing lines Patch % Lines
fastdeploy/worker/gpu_model_runner.py 20.00% 3 Missing and 1 partial ⚠️
fastdeploy/spec_decode/mtp.py 66.66% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             develop    #6995   +/-   ##
==========================================
  Coverage           ?   73.26%           
==========================================
  Files              ?      399           
  Lines              ?    56056           
  Branches           ?     8851           
==========================================
  Hits               ?    41071           
  Misses             ?    12075           
  Partials           ?     2910           
Flag Coverage Δ
GPU 73.26% <64.28%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants