Skip to content

[XPU] fix_same_req_id#8040

Open
cmcamdy wants to merge 1 commit into
PaddlePaddle:developfrom
cmcamdy:fix_same_req_id
Open

[XPU] fix_same_req_id#8040
cmcamdy wants to merge 1 commit into
PaddlePaddle:developfrom
cmcamdy:fix_same_req_id

Conversation

@cmcamdy

@cmcamdy cmcamdy commented Jun 11, 2026

Copy link
Copy Markdown
Collaborator

Motivation

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick,PR标题需遵循格式,在最开始加上[Cherry-Pick]标签,以及最后面加上原PR ID,例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

Modifications

Usage or Command

Accuracy Tests

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

@codecov-commenter

codecov-commenter commented Jun 11, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 80.00000% with 1 line in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@fab344e). Learn more about missing BASE report.

Files with missing lines Patch % Lines
fastdeploy/engine/common_engine.py 50.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             develop    #8040   +/-   ##
==========================================
  Coverage           ?   67.72%           
==========================================
  Files              ?      471           
  Lines              ?    66361           
  Branches           ?    10217           
==========================================
  Hits               ?    44946           
  Misses             ?    18546           
  Partials           ?     2869           
Flag Coverage Δ
GPU 77.79% <80.00%> (?)
XPU 6.99% <0.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Paddle-CI-Agent | pr_review | 2026-06-11 17:13:21

📋 Review 摘要

PR 概述:在 PD decode 预分配资源时新增重复 request_id 拒绝逻辑,并保留 D 侧返回的错误原因。
变更范围fastdeploy/engine/common_engine.pyfastdeploy/engine/sched/resource_manager_v1.py
影响面 Tag[Engine] [Scheduler] [PD Disaggregation]

问题

级别 文件 概述
🔴 Bug fastdeploy/engine/sched/resource_manager_v1.py:1596 cache-task 模式下重复 request_id 被当作资源不足重试,P/D 会永久等待

📝 PR 规范检查

标题 Tag 使用 [XPU],但本次 diff 修改的是 Engine/Scheduler 的 PD decode 资源预分配逻辑,未触及 XPU 专用 worker/model_runner/ops;PR 描述仍是模板占位内容,缺少具体 Motivation/Modifications/Usage/Accuracy Tests 内容。建议替换为以下完整内容。

标题建议(可直接复制):

  • [PD Disaggregation] Fix duplicate request id handling in decode preallocation
PR 描述建议(点击展开,可直接复制)
## Motivation
修复 P/D 分离场景下 Decode 侧收到重复 request_id 时可能复用或污染已有 KV cache 的问题。

## Modifications
- `fastdeploy/engine/sched/resource_manager_v1.py`: 在 Decode 侧资源预分配时检测 `request_id` 是否已存在于 `self.requests`,重复时设置错误信息并拒绝分配。
- `fastdeploy/engine/common_engine.py`: 在资源预分配失败回传给 Prefill 时保留 Decode 侧已经设置的错误原因,避免统一覆盖为 `Not enough resources`## Usage or Command
N/A

## Accuracy Tests
N/A

## Checklist

- [x] Add at least a tag in the PR title.
  - Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
  - You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [ ] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [x] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.

总体评价

修复方向能避免 D 侧同一 request_id 复用已有 block,但当前永久失败和临时资源不足共用 False,会在 cache-task 模式下让重复请求卡住。需要先拆分失败语义,或在已有 error_msg 时回传错误并移除队列。

Comment thread fastdeploy/engine/sched/resource_manager_v1.py
@PaddlePaddle-bot

PaddlePaddle-bot commented Jun 13, 2026

Copy link
Copy Markdown

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-06-27 04:36:54

CI报告基于以下代码生成(30分钟更新一次):
PR commit: cb98744 | Merge base: fab344e (branch: develop)


1 Required任务 : 9/10 通过

总执行(rerun次数) 总任务 ✅ 通过 ❌ 失败 ⏳ 运行中 ⏸️ 等待中 跳过
41(0) 41 37 4 0 0 0
任务 错误类型 置信度 日志
Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage PR问题 Job

2 失败详情

🔴 Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage — PR问题(置信度: 中)

错误类型: PR问题 | 置信度: 中
分析器: ci_analyze_unittest_fastdeploy
失败用例: 无具体失败用例;单测阶段未暴露失败用例,失败发生在增量覆盖率校验阶段。

用例 错误摘要
覆盖率校验 diff-cover 增量覆盖率检查返回非 0,并被映射为退出码 9

关键日志:

[FAILURE]: Process completed with exit code 9.
.github/workflows/_unit_test_coverage.yml:254 diff-cover python_coverage_all.xml --diff-file=diff.txt --fail-under=80 --json-report diff_coverage.json || COVERAGE_EXIT_CODE=9
.github/workflows/_unit_test_coverage.yml:387-404 COVERAGE_EXIT_CODE=9 时输出覆盖率详情并 exit 9
  • 根因摘要: 增量覆盖率未通过 80% 阈值

PR 仅修改了 fastdeploy/engine/common_engine.pyfastdeploy/engine/sched/resource_manager_v1.py,新增/修改共 10 行;其中新增的 decode 重复 request_id 拒绝分支和保留既有 error_msg 的分支没有在本次 diff 中看到配套测试改动。CI job 的退出码 9 对应该 workflow 中 diff-cover --fail-under=80 失败路径,未拿到 diff_coverage.json 明细,因此无法精确列出未覆盖行号。

修复建议:

  1. tests/v1/test_resource_manager_v1.py 补充 preallocate_resource_in_d 对重复 request_id 的单测,覆盖返回 Falserequest.error_msg == "Duplicate request id in decode" 且不继续分配 block 的行为。
  2. 补充覆盖 fastdeploy/engine/common_engine.py 中资源预分配失败时保留已有 task.error_msg 的分支,或调整已有 engine/splitwise 相关测试覆盖该路径。
  3. 本地或 CI 重跑 bash scripts/coverage_run.sh 后检查 diff-cover ... --fail-under=80,确认新增行增量覆盖率达到 80%。

关联变更: fastdeploy/engine/common_engine.py:2128-2129fastdeploy/engine/sched/resource_manager_v1.py:1590-1596

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants