[Cherry-Pick][Scheduler][Optimization] Only preempt decode requests and better manage reserved blocks in scheduler (#7444) by liyonghua0910 · Pull Request #7783 · PaddlePaddle/FastDeploy

liyonghua0910 · 2026-05-12T03:45:24Z

Motivation

Improve scheduler preemption and KV-cache reservation strategy: (1) preempting prefill requests is costly since they must recompute, while decode requests only need to replay the last token; (2) chunked-prefill requests transitioning to decode are not accounted for in output-block reservation; (3) the fixed-block reservation strategy cannot adapt to per-request decode progress.

Modifications

Only preempt decode requests

Split RequestStatus.RUNNING into RUNNING_PREFILL / RUNNING_DECODE; only RUNNING_DECODE requests are preemptable. Prefill and extend-table requests are skipped during preemption.

Reserve decode output blocks for chunked-prefill requests

When a running chunked-prefill request cannot satisfy the reserved-block threshold, skip preemption and break scheduling instead of forcing it. Block the waiting queue via chunk_prefill_in_running_not_satisfied flag to reserve capacity for upcoming decode phases.

Use new_token_ratio for per-request block reservation

Replace the legacy fixed-block reservation with per-request estimation: reserved_tokens = min(remaining_tokens, clip_max_new_tokens) * new_token_ratio.
new_token_ratio decays each step; on preemption it is recomputed from actual decode progress: ratio = (total_decoded + 16 * block_size * num_decode) / (total_max_new_tokens + 1), capped at init_new_token_ratio.
Controlled by FD_USE_NEW_TOKEN_RATIO_RESERVE (default 1).

Minor changes

Add evictable_block_num and free_blocks to scheduler metrics / resource-manager info log.

Usage or Command

New token ratio policy is enabled by default. To switch back to fixed-block policy:

export FD_USE_NEW_TOKEN_RATIO_RESERVE=1    # enable new_token_ratio policy (default)
export FD_USE_NEW_TOKEN_RATIO_RESERVE=0    # fall back to legacy fixed-block policy

To customize our new policy:

export FD_INIT_NEW_TOKEN_RATIO=0.7         # initial ratio
export FD_MIN_NEW_TOKEN_RATIO=0.1          # minimum ratio
export FD_NEW_TOKEN_RATIO_DECAY=0.001      # per-step decay
export FD_CLIP_MAX_NEW_TOKENS=4096         # per-request reservation cap

Accuracy Tests

N/A

Checklist

Add at least a tag in the PR title.
- Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
- You can add new tags based on the PR content, but the semantics must be clear.
Format your code, run pre-commit before commit.
Add unit tests. Please write the reason in this PR if no unit tests.
Provide accuracy results.
If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

…nage reserved blocks in scheduler (PaddlePaddle#7444) * [Optimization] Use new_token_ratio to control reserved blocks in scheduler * Only decode req can be preempted * Optimize scheduler for chunk prefill * [chore] add env var to switch reserved blocks policy * [chore] remove useless code * [test] fix some ci test * [test] fix embedding serving * [fix] fix for cache manager v1 * [fix] fix for cache manager v1 * [test] fix test_resource_manager_v1.py * [opt] stepped scheduling after model forward in mixed mode * Revert "[opt] stepped scheduling after model forward in mixed mode" This reverts commit 40f774e. * [chore] remove unused code --------- Co-authored-by: juncaipeng <13006307475@163.com> Co-authored-by: rainyfly <1435317881@qq.com>

paddle-bot · 2026-05-12T03:45:29Z

Thanks for your contribution!

PaddlePaddle-bot · 2026-05-12T04:29:25Z

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-12 15:38:30

CI报告基于以下代码生成（30分钟更新一次）:

PR commit: c3e1192
Merge base: a5191f2 (branch: release/2.6)
查看完整 Diff
CI 详情

1 任务总览

当前有 1 个 Required 任务失败（Approval），另有 7 个 Required 任务仍在运行中/等待中，PR 暂不满足合并条件。

总执行（rerun次数）	总任务	✅ 通过	❌ 失败	⏳ 运行中	⏸️ 等待中	跳过
36(0)	36	25	2	4	5	0

2 任务状态汇总

2.1 Required任务 : 2/10 通过

必选任务阻塞合并，失败需优先处理。

状态	任务	耗时	根因	修复建议	日志	重跑
❌	`Approval`	10s	PR问题：修改了 `fastdeploy/envs.py`，需 RD 成员 Approve	请 FastDeploy RD（jiangjiajun 等）Approve	Job	-
⏳	`Run Base Tests / base_tests`	-	运行中	-	Job	-
⏳	`Run Four Cards Tests / run_4_cards_tests`	-	运行中	-	Job	-
⏳	`xpu_4cards_case_test / run_xpu_4cards_cases`	-	运行中	-	Job	-
⏸️	`Extracted partial CE model tasks to run in CI. / run_ce_cases`	-	等待中	-	-	-
⏸️	`Run FastDeploy LogProb Tests / run_tests_logprob`	-	等待中	-	-	-
⏸️	`Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage`	-	等待中	-	-	-
⏸️	`xpu_8cards_case_test / run_xpu_8cards_cases`	-	等待中	-	-	-
✅	其余 2 个必选任务通过	-	-	-	-	-

2.2 可选任务 — 23/26 通过

可选任务不阻塞合并，失败仅供参考。

状态	任务	耗时	日志	重跑
❌	`Run iluvatar Tests / run_iluvatar_cases`	17m20s	Job	-
⏳	`Trigger Jenkins for PR`	-	-	-
⏸️	`CI_HPU`	-	-	-
✅	其余 23 个可选任务通过	-	-	-

3 失败详情（仅 required）

Approval — 代码审批（置信度: 高）

Approval

状态: ❌ 失败
错误类型: 代码审批
置信度: 高
根因摘要: PR 修改了 fastdeploy/envs.py，需 FastDeploy RD 成员 Approve
分析器: 通用分析(fallback)

根因详情:
check_approval.sh 脚本检测到 PR 修改了受保护文件 fastdeploy/envs.py，该文件要求至少一名 FastDeploy RD 核心成员（jiangjiajun/liuyuanle/chenjian26/wanglongzhi）进行 Approve，但当前 PR 尚未获得该审批。退出码 6 表示有 1 项未满足的审批要求。

关键日志:

0. You must have one FastDeploy RD (Jiang-Jia-Jun(jiangjiajun), yuanlehome(liuyuanle),
   rainyfly(chenjian26), Wanglongzhi2001(wanglongzhi)) approval for modifying [fastdeploy/envs.py].
There are 1 approved errors.
##[error]Process completed with exit code 6.

修复建议:

请 jiangjiajun、liuyuanle、chenjian26、wanglongzhi 中任意一人对本 PR 进行 Approve

修复建议摘要: 请 FastDeploy RD 成员（jiangjiajun 等）Approve 此 PR

关联变更: PR 修改了 fastdeploy/envs.py

链接: 查看日志

codecov-commenter · 2026-05-12T05:14:10Z

Codecov Report

❌ Patch coverage is 87.00000% with 13 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (release/2.6@a5191f2). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
fastdeploy/engine/sched/resource_manager_v1.py	88.17%	0 Missing and 11 partials ⚠️
fastdeploy/output/token_processor.py	0.00%	1 Missing and 1 partial ⚠️

Additional details and impacted files

@@              Coverage Diff               @@
##             release/2.6    #7783   +/-   ##
==============================================
  Coverage               ?   71.90%           
==============================================
  Files                  ?      378           
  Lines                  ?    53987           
  Branches               ?     8443           
==============================================
  Hits                   ?    38821           
  Misses                 ?    12395           
  Partials               ?     2771

Flag	Coverage Δ
GPU	`71.90% <87.00%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

PaddlePaddle-bot

🤖 Paddle-CI-Agent | pr_review | 2026-05-12 15:33:33

📋 Review 摘要

PR 概述：优化调度器抢占策略与 KV-cache 预留机制：仅抢占 decode 阶段请求、引入 per-request new_token_ratio 动态预留替代固定块预留。
变更范围：engine/sched/resource_manager_v1.py、engine/request.py、envs.py、output/token_processor.py 及相关测试。
影响面 Tag：[Scheduler] [Engine] [DataProcessor]

📝 PR 规范检查

描述结构完整，五个必填段落（Motivation / Modifications / Usage or Command / Accuracy Tests / Checklist）均已填写。Checklist 中 [ ] Add unit tests 未勾选，PR 中也未说明原因——既然测试文件已有更新，建议勾选或补充说明新逻辑未加专项测试的理由（见下方 P1 建议）。

问题

级别	文件	概述
🟡 建议	`resource_manager_v1.py:440`	三个新核心函数缺少专项单测
❓ 疑问	`resource_manager_v1.py:453`	`num_running_decode` 使用 token 数量判断而非新引入的 `status` 字段
❓ 疑问	`resource_manager_v1.py:1637`	`update_metrics(verbose=True)` 静默改为 `update_metrics()`

总体评价

整体设计思路清晰，通过拆分 RUNNING_PREFILL/RUNNING_DECODE 状态和引入 per-request 动态预留比例，有效改善了抢占策略和块预留精度。逻辑推导正确，兼容旧策略路径（FD_USE_NEW_TOKEN_RATIO_RESERVE=0 回退）。建议补充新函数的专项单测后合入。

PaddlePaddle-bot · 2026-05-12T07:36:04Z

+
        return can_schedule

+    def _reset_reserve_on_preemption(self):


🟡 建议 本 PR 新增了三个核心调度函数：_select_preempt_candidate、_reset_reserve_on_preemption（new_token_ratio 分支）、_get_running_request_reserve_blocks，但均未配套专项单元测试。

现有测试只更新了已有测试中的 status 字段，未覆盖以下新场景：

_select_preempt_candidate：running 中全为 RUNNING_PREFILL 时应返回 None

_reset_reserve_on_preemption：ratio 估算逻辑的正确性

_get_running_request_reserve_blocks：remaining_tokens 为负、max_tokens=None 等边界情况

PR Checklist 中 "Add unit tests" 未勾选，也未说明原因。

PaddlePaddle-bot · 2026-05-12T07:36:04Z

+                if max_tokens is None:
+                    max_tokens = self.config.model_config.max_model_len - req.prompt_token_ids_len
+                total_max_new_tokens += max_tokens
+            num_running_decode = sum(


❓ 疑问 num_running_decode 使用 num_total_tokens > need_prefill_tokens 判断请求是否处于 decode 阶段，而不是使用本 PR 新引入的 req.status == RequestStatus.RUNNING_DECODE。

两处 decode 检测逻辑不统一：_select_preempt_candidate 用 status 字段，这里用 token 数量比较。是否是故意为之（例如规避 token_processor 异步更新 status 的时序问题）？如果是，建议补充注释说明原因。

PaddlePaddle-bot · 2026-05-12T07:36:04Z

    def finish_requests(self, request_ids: Union[str, Iterable[str]]):
        llm_logger.info(f"recycle resources for requests: {request_ids}")
-        self.update_metrics(verbose=True)
+        self.update_metrics()


❓ 疑问 finish_requests 中 update_metrics(verbose=True) 改为 update_metrics()，去掉了请求结束时的 INFO 级 metrics 日志（running/waiting/queuing 状态）。

此改动在 PR 描述中未提及，是有意减少日志噪声，还是顺手修改？去掉 verbose log 会降低调试线上请求结束时的状态可见性。

liyonghua0910 had a problem deploying to Metax_ci May 12, 2026 03:45 — with GitHub Actions Failure

This comment was marked as outdated.

Sign in to view

Merge branch 'release/2.6' into release/2.6+20260512_opt_schedule

c3e1192

Jiang-Jia-Jun had a problem deploying to Metax_ci May 12, 2026 07:14 — with GitHub Actions Failure

PaddlePaddle-bot reviewed May 12, 2026

View reviewed changes

Jiang-Jia-Jun merged commit d38eeb8 into PaddlePaddle:release/2.6 May 13, 2026
58 of 64 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Cherry-Pick][Scheduler][Optimization] Only preempt decode requests and better manage reserved blocks in scheduler (#7444)#7783

[Cherry-Pick][Scheduler][Optimization] Only preempt decode requests and better manage reserved blocks in scheduler (#7444)#7783
Jiang-Jia-Jun merged 2 commits into
PaddlePaddle:release/2.6from
liyonghua0910:release/2.6+20260512_opt_schedule

liyonghua0910 commented May 12, 2026 •

edited

Loading

Uh oh!

paddle-bot Bot commented May 12, 2026

Uh oh!

This comment was marked as outdated.

Uh oh!

PaddlePaddle-bot commented May 12, 2026 •

edited

Loading

Approval

Uh oh!

codecov-commenter commented May 12, 2026 •

edited

Loading

Uh oh!

PaddlePaddle-bot left a comment

Uh oh!

PaddlePaddle-bot May 12, 2026

Uh oh!

PaddlePaddle-bot May 12, 2026

Uh oh!

PaddlePaddle-bot May 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

liyonghua0910 commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Usage or Command

Accuracy Tests

Checklist

Uh oh!

paddle-bot Bot commented May 12, 2026

Uh oh!

This comment was marked as outdated.

Uh oh!

PaddlePaddle-bot commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1 任务总览

2 任务状态汇总

2.1 Required任务 : 2/10 通过

2.2 可选任务 — 23/26 通过

3 失败详情（仅 required）

Approval

Uh oh!

codecov-commenter commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

📋 Review 摘要

📝 PR 规范检查

问题

总体评价

Uh oh!

PaddlePaddle-bot May 12, 2026

Choose a reason for hiding this comment

Uh oh!

PaddlePaddle-bot May 12, 2026

Choose a reason for hiding this comment

Uh oh!

PaddlePaddle-bot May 12, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

liyonghua0910 commented May 12, 2026 •

edited

Loading

PaddlePaddle-bot commented May 12, 2026 •

edited

Loading

codecov-commenter commented May 12, 2026 •

edited

Loading