[Feature]report PD info to IM by ChowMingSing · Pull Request #8082 · PaddlePaddle/FastDeploy

ChowMingSing · 2026-06-26T07:03:20Z

Motivation

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick，PR标题需遵循格式，在最开始加上[Cherry-Pick]标签，以及最后面加上原PR ID，例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

Modifications

Usage or Command

Accuracy Tests

Checklist

Add at least a tag in the PR title.
- Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
- You can add new tags based on the PR content, but the semantics must be clear.
Format your code, run pre-commit before commit.
Add unit tests. Please write the reason in this PR if no unit tests.
Provide accuracy results.
If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

PaddlePaddle-bot

🤖 Paddle-CI-Agent | pr_review | 2026-06-26 15:08:18

📋 Review 摘要

PR 概述：新增 IM 查询 FastDeploy/PD 注册信息、ready 健康检查和 /fastdeploy/server/info 汇报接口
变更范围：fastdeploy/entrypoints/openai/api_server.py
影响面 Tag：[APIServer] [PD Disaggregation]

问题

级别	文件	概述
🔴 Bug	`fastdeploy/entrypoints/openai/api_server.py:960`	`dp_rank` 字符串与整数比较，`is_master` 永远不会置为 1
🔴 Bug	`fastdeploy/entrypoints/openai/api_server.py:985`	async LLM 模式下 `llm_engine` 没有 `.engine`，新增 info 接口会 500

📝 PR 规范检查

标题包含官方 Tag，但当前 PR 描述各 section 仍是模板占位/空内容，建议替换为下面的完整描述。

标题建议（可直接复制）：

[APIServer] Report PD info to IM

PR 描述建议（点击展开，可直接复制）

## Motivation
Report FastDeploy PD disaggregation/register information to IM, including server identity, role, resource information, connected decode nodes, and readiness status.

## Modifications
- Add `/register_info` for decode node registration metadata.
- Add `/v2/health/ready` for IM readiness checks backed by existing `/health`.
- Add `/fastdeploy/server/info` to report API server/PD fields, resource ranges, master flag, and connected decode node list.
- Start a background decode-node poller that reads `D_IP_LIST`/`DECODE_PORTS` and collects `/register_info` from decode nodes.

## Usage or Command
N/A

## Accuracy Tests
N/A

## Checklist

- [x] Add at least a tag in the PR title.
  - Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
  - You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [ ] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.

总体评价

新增接口方向和变更范围清晰，但当前实现会在 master 识别和 async LLM 部署下产生错误结果/接口 500。建议先修复上述两个功能问题，并补充接口级测试后再合入。

PaddlePaddle-bot · 2026-06-26T07:11:14Z

+            with open(fed_member_file, 'r') as f:
+                fed_member_list = f.read().strip().split(',')
+                if fed_member_list.index(os.getenv("HOST_IP", "None")) == 0 and \
+                        dp_rank == 0:


🔴 Bug dp_rank 在上面已经被转成字符串，这里再和整数 0 比较，条件永远为 False。

配置了 FED_MEMBER_FILE 且当前 HOST_IP 是成员列表第一个、DP rank 为 0 时，is_master 仍会保持 0，IM 侧无法识别 master 节点。

建议修复方式：保留一个整数 rank 用于逻辑判断，只在拼接 pod_name 或写入响应时再转字符串。

dp_rank = cfg.parallel_config.local_data_parallel_id # pod_name 拼接处使用 str(dp_rank) if fed_member_list.index(os.getenv("HOST_IP", "None")) == 0 and dp_rank == 0: is_master = 1 cfg_dict["dp_rank"] = str(dp_rank)

PaddlePaddle-bot · 2026-06-26T07:11:14Z

+    cfg_dict["is_stopping"] = "running"
+    cfg_dict["is_master"] = is_master
+    cfg_dict["container_host_ip"] = os.getenv("HOST_IP", "None")
+    cfg_dict["free_block_num"] = llm_engine.engine.resource_manager.available_block_num()


🔴 Bug 这里直接访问 llm_engine.engine.resource_manager，在 FD_ENABLE_ASYNC_LLM=1 时会让新增接口返回 500。

load_engine() 在 async 模式下把全局 llm_engine 设置为 AsyncLLM；AsyncLLM 继承的 EngineServiceClient 只在子进程里创建 EngineService，主进程对象没有 .engine 属性。文件里已有生命周期代码也用 not isinstance(llm_engine, AsyncLLM) 区分了同步引擎路径。

建议修复方式：对 AsyncLLM 单独走跨进程状态查询/control API 获取 free_block_num，或在 async 模式下返回明确的不可用值；不要在 API server 主进程直接读取 llm_engine.engine.resource_manager。

codecov-commenter · 2026-06-26T07:38:19Z

Codecov Report

❌ Patch coverage is 8.63309% with 127 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@f4eda5a). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
fastdeploy/entrypoints/openai/api_server.py	8.63%	127 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             develop    #8082   +/-   ##
==========================================
  Coverage           ?   67.39%           
==========================================
  Files              ?      475           
  Lines              ?    67048           
  Branches           ?    10335           
==========================================
  Hits               ?    45187           
  Misses             ?    18990           
  Partials           ?     2871

Flag	Coverage Δ
GPU	`77.37% <8.63%> (?)`
XPU	`6.94% <0.00%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

PaddlePaddle-bot · 2026-06-27T04:11:49Z

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-06-27 12:10:37 UTC+08:00

CI报告基于以下代码生成（30分钟更新一次）:
PR commit: a931d80 | Merge base: f4eda5a (branch: develop)

1 Required任务 : 7/10 通过

总执行（rerun次数）	总任务	✅ 通过	❌ 失败	⏳ 运行中	⏸️ 等待中	跳过
41(0)	41	35	6	0	0	0

任务	错误类型	置信度	日志
`Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage`	PR问题	高	Job
`Pre Commit`	PR问题	高	Job
`Approval`	需要 Approval	高	Job

2 失败详情

🔴 Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage — PR问题（置信度: 高）

分析器: 通用分析(fallback)

失败用例:

用例	错误摘要
`Verify Code Coverage Threshold (80%)`	PR 新增代码 diff coverage 为 8%，低于 80% 阈值，步骤退出码 9

关键日志:

GPU Patch Coverage Details:
{"src_stats":{"fastdeploy/entrypoints/openai/api_server.py":{"percent_covered":8.63309352517986,
"violation_lines":[131,132,133,134,135,136,137,138,143,144,145,146,147,148,149,...,994]}},
"total_num_lines":139,"total_num_violations":127,"total_percent_covered":8,"num_changed_lines":224}
##[error]Process completed with exit code 9.

根因摘要: 新增 API Server 逻辑覆盖率仅 8%

PR 只修改了 fastdeploy/entrypoints/openai/api_server.py，新增 /register_info、/v2/health/ready、/fastdeploy/server/info 和 decode node 轮询逻辑。覆盖率检查显示该文件新增代码 139 行纳入统计，其中 127 行未覆盖；未覆盖行集中在 _fetch_decode_node_register_info、_poll_decode_nodes、launch_decode_node_poller、register_info、im_check_health、im_report 及其在 lifespan 中的启动调用。

修复建议:

为 fastdeploy/entrypoints/openai/api_server.py 新增或补充单测，覆盖 register_info()、im_check_health()、im_report() 的成功和 llm_engine is None 分支。
对 _fetch_decode_node_register_info() mock requests.get，覆盖 200、非 200 和异常路径；对 _poll_decode_nodes() 建议抽出单轮轮询逻辑或 mock time.sleep，避免测试无限循环。
若部分 IM 上报逻辑暂时无法在单测环境稳定覆盖，需要按项目规范调整测试策略或覆盖率排除，但当前 CI 失败的直接原因是新增代码覆盖率不足。

关联变更: fastdeploy/entrypoints/openai/api_server.py:129、fastdeploy/entrypoints/openai/api_server.py:141、fastdeploy/entrypoints/openai/api_server.py:164、fastdeploy/entrypoints/openai/api_server.py:363、fastdeploy/entrypoints/openai/api_server.py:820、fastdeploy/entrypoints/openai/api_server.py:869、fastdeploy/entrypoints/openai/api_server.py:883

🔴 Pre Commit — PR问题（置信度: 高）

分析器: 通用分析(fallback)

失败用例:

用例	错误摘要
`Check pre-commit`	black 和 isort 修改了 `api_server.py`，说明提交内容未按 pre-commit 格式化

关键日志:

black....................................................................Failed
- hook id: black
- files were modified by this hook
reformatted fastdeploy/entrypoints/openai/api_server.py
isort....................................................................Failed
- hook id: isort
- files were modified by this hook
Fixing .../fastdeploy/entrypoints/openai/api_server.py

根因摘要: api_server.py 未通过 black/isort 格式化

Pre Commit 的失败文件与本 PR 唯一变更文件一致。日志显示 black 重新格式化该文件，isort 也调整了同一文件；后续 flake8 和 ruff 通过，因此这是提交格式化结果缺失导致的 PR 问题。

修复建议:

本地执行 pre-commit run --files fastdeploy/entrypoints/openai/api_server.py，提交 black/isort 产生的格式化变更。
重点检查新增 requests import 的排序，以及 pod_name = (...) 多行拼接缩进等 black 改动区域。

关联变更: fastdeploy/entrypoints/openai/api_server.py

🔴 Approval — 需要 Approval（置信度: 高）

分析器: 内置分析

失败用例:

用例	错误摘要
`Approval`	该 Job 需要人工 Approval，完成审批后 CI 才会继续执行

关键日志:

[FAILURE]: Process completed with exit code 6.

根因摘要: 需要人工 Approval

该 Job 是审批门禁，不是代码执行失败。完成人工审批后，相关 CI 才会继续执行。

修复建议:

请通过人工审批。

关联变更: 未关联代码变更

[Feature]report PD info to IM

a931d80

ChowMingSing had a problem deploying to Metax_ci June 26, 2026 07:03 — with GitHub Actions Failure

PaddlePaddle-bot suggested changes Jun 26, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature]report PD info to IM#8082

[Feature]report PD info to IM#8082
ChowMingSing wants to merge 1 commit into
PaddlePaddle:developfrom
ChowMingSing:feature-im-report-v2

ChowMingSing commented Jun 26, 2026 •

edited

Loading

Uh oh!

PaddlePaddle-bot left a comment

Uh oh!

PaddlePaddle-bot Jun 26, 2026

Uh oh!

PaddlePaddle-bot Jun 26, 2026

Uh oh!

codecov-commenter commented Jun 26, 2026 •

edited

Loading

Uh oh!

PaddlePaddle-bot commented Jun 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

ChowMingSing commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Usage or Command

Accuracy Tests

Checklist

Uh oh!

PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

📋 Review 摘要

问题

📝 PR 规范检查

总体评价

Uh oh!

PaddlePaddle-bot Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

PaddlePaddle-bot Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

PaddlePaddle-bot commented Jun 27, 2026

1 Required任务 : 7/10 通过

2 失败详情

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ChowMingSing commented Jun 26, 2026 •

edited

Loading

codecov-commenter commented Jun 26, 2026 •

edited

Loading