[Cherry-Pick][Scheduler][Optimization] Only preempt decode requests and better manage reserved blocks in scheduler (#7444)#7783
Conversation
…nage reserved blocks in scheduler (PaddlePaddle#7444) * [Optimization] Use new_token_ratio to control reserved blocks in scheduler * Only decode req can be preempted * Optimize scheduler for chunk prefill * [chore] add env var to switch reserved blocks policy * [chore] remove useless code * [test] fix some ci test * [test] fix embedding serving * [fix] fix for cache manager v1 * [fix] fix for cache manager v1 * [test] fix test_resource_manager_v1.py * [opt] stepped scheduling after model forward in mixed mode * Revert "[opt] stepped scheduling after model forward in mixed mode" This reverts commit 40f774e. * [chore] remove unused code --------- Co-authored-by: juncaipeng <13006307475@163.com> Co-authored-by: rainyfly <1435317881@qq.com>
|
Thanks for your contribution! |
CI报告基于以下代码生成(30分钟更新一次): 1 任务总览当前有 1 个 Required 任务失败(Approval),另有 7 个 Required 任务仍在运行中/等待中,PR 暂不满足合并条件。
2 任务状态汇总2.1 Required任务 : 2/10 通过
2.2 可选任务 — 23/26 通过
3 失败详情(仅 required)Approval — 代码审批(置信度: 高)Approval
根因详情: 关键日志: 修复建议:
修复建议摘要: 请 FastDeploy RD 成员(jiangjiajun 等)Approve 此 PR 关联变更: PR 修改了 链接: 查看日志 |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## release/2.6 #7783 +/- ##
==============================================
Coverage ? 71.90%
==============================================
Files ? 378
Lines ? 53987
Branches ? 8443
==============================================
Hits ? 38821
Misses ? 12395
Partials ? 2771
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
PaddlePaddle-bot
left a comment
There was a problem hiding this comment.
🤖 Paddle-CI-Agent | pr_review |
2026-05-12 15:33:33
📋 Review 摘要
PR 概述:优化调度器抢占策略与 KV-cache 预留机制:仅抢占 decode 阶段请求、引入 per-request new_token_ratio 动态预留替代固定块预留。
变更范围:engine/sched/resource_manager_v1.py、engine/request.py、envs.py、output/token_processor.py 及相关测试。
影响面 Tag:[Scheduler] [Engine] [DataProcessor]
📝 PR 规范检查
描述结构完整,五个必填段落(Motivation / Modifications / Usage or Command / Accuracy Tests / Checklist)均已填写。Checklist 中 [ ] Add unit tests 未勾选,PR 中也未说明原因——既然测试文件已有更新,建议勾选或补充说明新逻辑未加专项测试的理由(见下方 P1 建议)。
问题
| 级别 | 文件 | 概述 |
|---|---|---|
| 🟡 建议 | resource_manager_v1.py:440 |
三个新核心函数缺少专项单测 |
| ❓ 疑问 | resource_manager_v1.py:453 |
num_running_decode 使用 token 数量判断而非新引入的 status 字段 |
| ❓ 疑问 | resource_manager_v1.py:1637 |
update_metrics(verbose=True) 静默改为 update_metrics() |
总体评价
整体设计思路清晰,通过拆分 RUNNING_PREFILL/RUNNING_DECODE 状态和引入 per-request 动态预留比例,有效改善了抢占策略和块预留精度。逻辑推导正确,兼容旧策略路径(FD_USE_NEW_TOKEN_RATIO_RESERVE=0 回退)。建议补充新函数的专项单测后合入。
|
|
||
| return can_schedule | ||
|
|
||
| def _reset_reserve_on_preemption(self): |
There was a problem hiding this comment.
🟡 建议 本 PR 新增了三个核心调度函数:_select_preempt_candidate、_reset_reserve_on_preemption(new_token_ratio 分支)、_get_running_request_reserve_blocks,但均未配套专项单元测试。
现有测试只更新了已有测试中的 status 字段,未覆盖以下新场景:
_select_preempt_candidate:running 中全为RUNNING_PREFILL时应返回None_reset_reserve_on_preemption:ratio 估算逻辑的正确性_get_running_request_reserve_blocks:remaining_tokens为负、max_tokens=None等边界情况
PR Checklist 中 "Add unit tests" 未勾选,也未说明原因。
| if max_tokens is None: | ||
| max_tokens = self.config.model_config.max_model_len - req.prompt_token_ids_len | ||
| total_max_new_tokens += max_tokens | ||
| num_running_decode = sum( |
There was a problem hiding this comment.
❓ 疑问 num_running_decode 使用 num_total_tokens > need_prefill_tokens 判断请求是否处于 decode 阶段,而不是使用本 PR 新引入的 req.status == RequestStatus.RUNNING_DECODE。
两处 decode 检测逻辑不统一:_select_preempt_candidate 用 status 字段,这里用 token 数量比较。是否是故意为之(例如规避 token_processor 异步更新 status 的时序问题)?如果是,建议补充注释说明原因。
| def finish_requests(self, request_ids: Union[str, Iterable[str]]): | ||
| llm_logger.info(f"recycle resources for requests: {request_ids}") | ||
| self.update_metrics(verbose=True) | ||
| self.update_metrics() |
There was a problem hiding this comment.
❓ 疑问 finish_requests 中 update_metrics(verbose=True) 改为 update_metrics(),去掉了请求结束时的 INFO 级 metrics 日志(running/waiting/queuing 状态)。
此改动在 PR 描述中未提及,是有意减少日志噪声,还是顺手修改?去掉 verbose log 会降低调试线上请求结束时的状态可见性。
d38eeb8
into
PaddlePaddle:release/2.6
Motivation
Improve scheduler preemption and KV-cache reservation strategy: (1) preempting prefill requests is costly since they must recompute, while decode requests only need to replay the last token; (2) chunked-prefill requests transitioning to decode are not accounted for in output-block reservation; (3) the fixed-block reservation strategy cannot adapt to per-request decode progress.
Modifications
Only preempt decode requests
RequestStatus.RUNNINGintoRUNNING_PREFILL/RUNNING_DECODE; onlyRUNNING_DECODErequests are preemptable. Prefill and extend-table requests are skipped during preemption.Reserve decode output blocks for chunked-prefill requests
chunk_prefill_in_running_not_satisfiedflag to reserve capacity for upcoming decode phases.Use new_token_ratio for per-request block reservation
reserved_tokens = min(remaining_tokens, clip_max_new_tokens) * new_token_ratio.new_token_ratiodecays each step; on preemption it is recomputed from actual decode progress:ratio = (total_decoded + 16 * block_size * num_decode) / (total_max_new_tokens + 1), capped atinit_new_token_ratio.FD_USE_NEW_TOKEN_RATIO_RESERVE(default1).Minor changes
evictable_block_numandfree_blocksto scheduler metrics / resource-manager info log.Usage or Command
New token ratio policy is enabled by default. To switch back to fixed-block policy:
To customize our new policy:
Accuracy Tests
N/A
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.