Skip to content

[Cherry-Pick][Scheduler][Optimization] Only preempt decode requests and better manage reserved blocks in scheduler (#7444)#7783

Merged
Jiang-Jia-Jun merged 2 commits into
PaddlePaddle:release/2.6from
liyonghua0910:release/2.6+20260512_opt_schedule
May 13, 2026
Merged

[Cherry-Pick][Scheduler][Optimization] Only preempt decode requests and better manage reserved blocks in scheduler (#7444)#7783
Jiang-Jia-Jun merged 2 commits into
PaddlePaddle:release/2.6from
liyonghua0910:release/2.6+20260512_opt_schedule

Conversation

@liyonghua0910
Copy link
Copy Markdown
Collaborator

@liyonghua0910 liyonghua0910 commented May 12, 2026

Motivation

Improve scheduler preemption and KV-cache reservation strategy: (1) preempting prefill requests is costly since they must recompute, while decode requests only need to replay the last token; (2) chunked-prefill requests transitioning to decode are not accounted for in output-block reservation; (3) the fixed-block reservation strategy cannot adapt to per-request decode progress.

Modifications

Only preempt decode requests

  • Split RequestStatus.RUNNING into RUNNING_PREFILL / RUNNING_DECODE; only RUNNING_DECODE requests are preemptable. Prefill and extend-table requests are skipped during preemption.

Reserve decode output blocks for chunked-prefill requests

  • When a running chunked-prefill request cannot satisfy the reserved-block threshold, skip preemption and break scheduling instead of forcing it. Block the waiting queue via chunk_prefill_in_running_not_satisfied flag to reserve capacity for upcoming decode phases.

Use new_token_ratio for per-request block reservation

  • Replace the legacy fixed-block reservation with per-request estimation: reserved_tokens = min(remaining_tokens, clip_max_new_tokens) * new_token_ratio.
  • new_token_ratio decays each step; on preemption it is recomputed from actual decode progress: ratio = (total_decoded + 16 * block_size * num_decode) / (total_max_new_tokens + 1), capped at init_new_token_ratio.
  • Controlled by FD_USE_NEW_TOKEN_RATIO_RESERVE (default 1).

Minor changes

  • Add evictable_block_num and free_blocks to scheduler metrics / resource-manager info log.

Usage or Command

New token ratio policy is enabled by default. To switch back to fixed-block policy:

export FD_USE_NEW_TOKEN_RATIO_RESERVE=1    # enable new_token_ratio policy (default)
export FD_USE_NEW_TOKEN_RATIO_RESERVE=0    # fall back to legacy fixed-block policy

To customize our new policy:

export FD_INIT_NEW_TOKEN_RATIO=0.7         # initial ratio
export FD_MIN_NEW_TOKEN_RATIO=0.1          # minimum ratio
export FD_NEW_TOKEN_RATIO_DECAY=0.001      # per-step decay
export FD_CLIP_MAX_NEW_TOKENS=4096         # per-request reservation cap

Accuracy Tests

N/A

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

…nage reserved blocks in scheduler (PaddlePaddle#7444)

* [Optimization] Use new_token_ratio to control reserved blocks in scheduler

* Only decode req can be preempted

* Optimize scheduler for chunk prefill

* [chore] add env var to switch reserved blocks policy

* [chore] remove useless code

* [test] fix some ci test

* [test] fix embedding serving

* [fix] fix for cache manager v1

* [fix] fix for cache manager v1

* [test] fix test_resource_manager_v1.py

* [opt] stepped scheduling after model forward in mixed mode

* Revert "[opt] stepped scheduling after model forward in mixed mode"

This reverts commit 40f774e.

* [chore] remove unused code

---------

Co-authored-by: juncaipeng <13006307475@163.com>
Co-authored-by: rainyfly <1435317881@qq.com>
@paddle-bot
Copy link
Copy Markdown

paddle-bot Bot commented May 12, 2026

Thanks for your contribution!

PaddlePaddle-bot

This comment was marked as outdated.

@PaddlePaddle-bot
Copy link
Copy Markdown

PaddlePaddle-bot commented May 12, 2026

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-12 15:38:30

CI报告基于以下代码生成(30分钟更新一次):


1 任务总览

当前有 1 个 Required 任务失败(Approval),另有 7 个 Required 任务仍在运行中/等待中,PR 暂不满足合并条件。

总执行(rerun次数) 总任务 ✅ 通过 ❌ 失败 ⏳ 运行中 ⏸️ 等待中 跳过
36(0) 36 25 2 4 5 0

2 任务状态汇总

2.1 Required任务 : 2/10 通过

必选任务阻塞合并,失败需优先处理。

状态 任务 耗时 根因 修复建议 日志 重跑
Approval 10s PR问题:修改了 fastdeploy/envs.py,需 RD 成员 Approve 请 FastDeploy RD(jiangjiajun 等)Approve Job -
Run Base Tests / base_tests - 运行中 - Job -
Run Four Cards Tests / run_4_cards_tests - 运行中 - Job -
xpu_4cards_case_test / run_xpu_4cards_cases - 运行中 - Job -
⏸️ Extracted partial CE model tasks to run in CI. / run_ce_cases - 等待中 - - -
⏸️ Run FastDeploy LogProb Tests / run_tests_logprob - 等待中 - - -
⏸️ Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage - 等待中 - - -
⏸️ xpu_8cards_case_test / run_xpu_8cards_cases - 等待中 - - -
其余 2 个必选任务通过 - - - - -

2.2 可选任务 — 23/26 通过

可选任务不阻塞合并,失败仅供参考。

状态 任务 耗时 日志 重跑
Run iluvatar Tests / run_iluvatar_cases 17m20s Job -
Trigger Jenkins for PR - - -
⏸️ CI_HPU - - -
其余 23 个可选任务通过 - - -

3 失败详情(仅 required)

Approval — 代码审批(置信度: 高)

Approval

  • 状态: ❌ 失败
  • 错误类型: 代码审批
  • 置信度: 高
  • 根因摘要: PR 修改了 fastdeploy/envs.py,需 FastDeploy RD 成员 Approve
  • 分析器: 通用分析(fallback)

根因详情:
check_approval.sh 脚本检测到 PR 修改了受保护文件 fastdeploy/envs.py,该文件要求至少一名 FastDeploy RD 核心成员(jiangjiajun/liuyuanle/chenjian26/wanglongzhi)进行 Approve,但当前 PR 尚未获得该审批。退出码 6 表示有 1 项未满足的审批要求。

关键日志:

0. You must have one FastDeploy RD (Jiang-Jia-Jun(jiangjiajun), yuanlehome(liuyuanle),
   rainyfly(chenjian26), Wanglongzhi2001(wanglongzhi)) approval for modifying [fastdeploy/envs.py].
There are 1 approved errors.
##[error]Process completed with exit code 6.

修复建议:

  1. 请 jiangjiajun、liuyuanle、chenjian26、wanglongzhi 中任意一人对本 PR 进行 Approve

修复建议摘要: 请 FastDeploy RD 成员(jiangjiajun 等)Approve 此 PR

关联变更: PR 修改了 fastdeploy/envs.py

链接: 查看日志

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 12, 2026

Codecov Report

❌ Patch coverage is 87.00000% with 13 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (release/2.6@a5191f2). Learn more about missing BASE report.

Files with missing lines Patch % Lines
fastdeploy/engine/sched/resource_manager_v1.py 88.17% 0 Missing and 11 partials ⚠️
fastdeploy/output/token_processor.py 0.00% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@              Coverage Diff               @@
##             release/2.6    #7783   +/-   ##
==============================================
  Coverage               ?   71.90%           
==============================================
  Files                  ?      378           
  Lines                  ?    53987           
  Branches               ?     8443           
==============================================
  Hits                   ?    38821           
  Misses                 ?    12395           
  Partials               ?     2771           
Flag Coverage Δ
GPU 71.90% <87.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Paddle-CI-Agent | pr_review | 2026-05-12 15:33:33

📋 Review 摘要

PR 概述:优化调度器抢占策略与 KV-cache 预留机制:仅抢占 decode 阶段请求、引入 per-request new_token_ratio 动态预留替代固定块预留。
变更范围engine/sched/resource_manager_v1.pyengine/request.pyenvs.pyoutput/token_processor.py 及相关测试。
影响面 Tag[Scheduler] [Engine] [DataProcessor]

📝 PR 规范检查

描述结构完整,五个必填段落(Motivation / Modifications / Usage or Command / Accuracy Tests / Checklist)均已填写。Checklist 中 [ ] Add unit tests 未勾选,PR 中也未说明原因——既然测试文件已有更新,建议勾选或补充说明新逻辑未加专项测试的理由(见下方 P1 建议)。

问题

级别 文件 概述
🟡 建议 resource_manager_v1.py:440 三个新核心函数缺少专项单测
❓ 疑问 resource_manager_v1.py:453 num_running_decode 使用 token 数量判断而非新引入的 status 字段
❓ 疑问 resource_manager_v1.py:1637 update_metrics(verbose=True) 静默改为 update_metrics()

总体评价

整体设计思路清晰,通过拆分 RUNNING_PREFILL/RUNNING_DECODE 状态和引入 per-request 动态预留比例,有效改善了抢占策略和块预留精度。逻辑推导正确,兼容旧策略路径(FD_USE_NEW_TOKEN_RATIO_RESERVE=0 回退)。建议补充新函数的专项单测后合入。


return can_schedule

def _reset_reserve_on_preemption(self):
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 建议 本 PR 新增了三个核心调度函数:_select_preempt_candidate_reset_reserve_on_preemption(new_token_ratio 分支)、_get_running_request_reserve_blocks,但均未配套专项单元测试。

现有测试只更新了已有测试中的 status 字段,未覆盖以下新场景:

  • _select_preempt_candidate:running 中全为 RUNNING_PREFILL 时应返回 None
  • _reset_reserve_on_preemption:ratio 估算逻辑的正确性
  • _get_running_request_reserve_blocksremaining_tokens 为负、max_tokens=None 等边界情况

PR Checklist 中 "Add unit tests" 未勾选,也未说明原因。

if max_tokens is None:
max_tokens = self.config.model_config.max_model_len - req.prompt_token_ids_len
total_max_new_tokens += max_tokens
num_running_decode = sum(
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❓ 疑问 num_running_decode 使用 num_total_tokens > need_prefill_tokens 判断请求是否处于 decode 阶段,而不是使用本 PR 新引入的 req.status == RequestStatus.RUNNING_DECODE

两处 decode 检测逻辑不统一:_select_preempt_candidatestatus 字段,这里用 token 数量比较。是否是故意为之(例如规避 token_processor 异步更新 status 的时序问题)?如果是,建议补充注释说明原因。

def finish_requests(self, request_ids: Union[str, Iterable[str]]):
llm_logger.info(f"recycle resources for requests: {request_ids}")
self.update_metrics(verbose=True)
self.update_metrics()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❓ 疑问 finish_requestsupdate_metrics(verbose=True) 改为 update_metrics(),去掉了请求结束时的 INFO 级 metrics 日志(running/waiting/queuing 状态)。

此改动在 PR 描述中未提及,是有意减少日志噪声,还是顺手修改?去掉 verbose log 会降低调试线上请求结束时的状态可见性。

@Jiang-Jia-Jun Jiang-Jia-Jun merged commit d38eeb8 into PaddlePaddle:release/2.6 May 13, 2026
58 of 64 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants