Skip to content

[Feature]report PD info to IM#8082

Open
ChowMingSing wants to merge 1 commit into
PaddlePaddle:developfrom
ChowMingSing:feature-im-report-v2
Open

[Feature]report PD info to IM#8082
ChowMingSing wants to merge 1 commit into
PaddlePaddle:developfrom
ChowMingSing:feature-im-report-v2

Conversation

@ChowMingSing

@ChowMingSing ChowMingSing commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Motivation

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick,PR标题需遵循格式,在最开始加上[Cherry-Pick]标签,以及最后面加上原PR ID,例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

Modifications

Usage or Command

Accuracy Tests

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Paddle-CI-Agent | pr_review | 2026-06-26 15:08:18

📋 Review 摘要

PR 概述:新增 IM 查询 FastDeploy/PD 注册信息、ready 健康检查和 /fastdeploy/server/info 汇报接口
变更范围fastdeploy/entrypoints/openai/api_server.py
影响面 Tag[APIServer] [PD Disaggregation]

问题

级别 文件 概述
🔴 Bug fastdeploy/entrypoints/openai/api_server.py:960 dp_rank 字符串与整数比较,is_master 永远不会置为 1
🔴 Bug fastdeploy/entrypoints/openai/api_server.py:985 async LLM 模式下 llm_engine 没有 .engine,新增 info 接口会 500

📝 PR 规范检查

标题包含官方 Tag,但当前 PR 描述各 section 仍是模板占位/空内容,建议替换为下面的完整描述。

标题建议(可直接复制):

  • [APIServer] Report PD info to IM
PR 描述建议(点击展开,可直接复制)
## Motivation
Report FastDeploy PD disaggregation/register information to IM, including server identity, role, resource information, connected decode nodes, and readiness status.

## Modifications
- Add `/register_info` for decode node registration metadata.
- Add `/v2/health/ready` for IM readiness checks backed by existing `/health`.
- Add `/fastdeploy/server/info` to report API server/PD fields, resource ranges, master flag, and connected decode node list.
- Start a background decode-node poller that reads `D_IP_LIST`/`DECODE_PORTS` and collects `/register_info` from decode nodes.

## Usage or Command
N/A

## Accuracy Tests
N/A

## Checklist

- [x] Add at least a tag in the PR title.
  - Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
  - You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [ ] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.

总体评价

新增接口方向和变更范围清晰,但当前实现会在 master 识别和 async LLM 部署下产生错误结果/接口 500。建议先修复上述两个功能问题,并补充接口级测试后再合入。

with open(fed_member_file, 'r') as f:
fed_member_list = f.read().strip().split(',')
if fed_member_list.index(os.getenv("HOST_IP", "None")) == 0 and \
dp_rank == 0:

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug dp_rank 在上面已经被转成字符串,这里再和整数 0 比较,条件永远为 False。

配置了 FED_MEMBER_FILE 且当前 HOST_IP 是成员列表第一个、DP rank 为 0 时,is_master 仍会保持 0,IM 侧无法识别 master 节点。

建议修复方式:保留一个整数 rank 用于逻辑判断,只在拼接 pod_name 或写入响应时再转字符串。

dp_rank = cfg.parallel_config.local_data_parallel_id
# pod_name 拼接处使用 str(dp_rank)
if fed_member_list.index(os.getenv("HOST_IP", "None")) == 0 and dp_rank == 0:
    is_master = 1
cfg_dict["dp_rank"] = str(dp_rank)

cfg_dict["is_stopping"] = "running"
cfg_dict["is_master"] = is_master
cfg_dict["container_host_ip"] = os.getenv("HOST_IP", "None")
cfg_dict["free_block_num"] = llm_engine.engine.resource_manager.available_block_num()

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug 这里直接访问 llm_engine.engine.resource_manager,在 FD_ENABLE_ASYNC_LLM=1 时会让新增接口返回 500。

load_engine() 在 async 模式下把全局 llm_engine 设置为 AsyncLLMAsyncLLM 继承的 EngineServiceClient 只在子进程里创建 EngineService,主进程对象没有 .engine 属性。文件里已有生命周期代码也用 not isinstance(llm_engine, AsyncLLM) 区分了同步引擎路径。

建议修复方式:对 AsyncLLM 单独走跨进程状态查询/control API 获取 free_block_num,或在 async 模式下返回明确的不可用值;不要在 API server 主进程直接读取 llm_engine.engine.resource_manager

@codecov-commenter

codecov-commenter commented Jun 26, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 8.63309% with 127 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@f4eda5a). Learn more about missing BASE report.

Files with missing lines Patch % Lines
fastdeploy/entrypoints/openai/api_server.py 8.63% 127 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             develop    #8082   +/-   ##
==========================================
  Coverage           ?   67.39%           
==========================================
  Files              ?      475           
  Lines              ?    67048           
  Branches           ?    10335           
==========================================
  Hits               ?    45187           
  Misses             ?    18990           
  Partials           ?     2871           
Flag Coverage Δ
GPU 77.37% <8.63%> (?)
XPU 6.94% <0.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@PaddlePaddle-bot

Copy link
Copy Markdown

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-06-27 12:10:37 UTC+08:00

CI报告基于以下代码生成(30分钟更新一次):
PR commit: a931d80 | Merge base: f4eda5a (branch: develop)


1 Required任务 : 7/10 通过

总执行(rerun次数) 总任务 ✅ 通过 ❌ 失败 ⏳ 运行中 ⏸️ 等待中 跳过
41(0) 41 35 6 0 0 0
任务 错误类型 置信度 日志
Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage PR问题 Job
Pre Commit PR问题 Job
Approval 需要 Approval Job

2 失败详情

🔴 Run FastDeploy Unit Tests and Coverage / run_tests_with_coverage — PR问题(置信度: 高)

分析器: 通用分析(fallback)

失败用例:

用例 错误摘要
Verify Code Coverage Threshold (80%) PR 新增代码 diff coverage 为 8%,低于 80% 阈值,步骤退出码 9

关键日志:

GPU Patch Coverage Details:
{"src_stats":{"fastdeploy/entrypoints/openai/api_server.py":{"percent_covered":8.63309352517986,
"violation_lines":[131,132,133,134,135,136,137,138,143,144,145,146,147,148,149,...,994]}},
"total_num_lines":139,"total_num_violations":127,"total_percent_covered":8,"num_changed_lines":224}
##[error]Process completed with exit code 9.
  • 根因摘要: 新增 API Server 逻辑覆盖率仅 8%

PR 只修改了 fastdeploy/entrypoints/openai/api_server.py,新增 /register_info/v2/health/ready/fastdeploy/server/info 和 decode node 轮询逻辑。覆盖率检查显示该文件新增代码 139 行纳入统计,其中 127 行未覆盖;未覆盖行集中在 _fetch_decode_node_register_info_poll_decode_nodeslaunch_decode_node_pollerregister_infoim_check_healthim_report 及其在 lifespan 中的启动调用。

修复建议:

  1. fastdeploy/entrypoints/openai/api_server.py 新增或补充单测,覆盖 register_info()im_check_health()im_report() 的成功和 llm_engine is None 分支。
  2. _fetch_decode_node_register_info() mock requests.get,覆盖 200、非 200 和异常路径;对 _poll_decode_nodes() 建议抽出单轮轮询逻辑或 mock time.sleep,避免测试无限循环。
  3. 若部分 IM 上报逻辑暂时无法在单测环境稳定覆盖,需要按项目规范调整测试策略或覆盖率排除,但当前 CI 失败的直接原因是新增代码覆盖率不足。

关联变更: fastdeploy/entrypoints/openai/api_server.py:129fastdeploy/entrypoints/openai/api_server.py:141fastdeploy/entrypoints/openai/api_server.py:164fastdeploy/entrypoints/openai/api_server.py:363fastdeploy/entrypoints/openai/api_server.py:820fastdeploy/entrypoints/openai/api_server.py:869fastdeploy/entrypoints/openai/api_server.py:883

🔴 Pre Commit — PR问题(置信度: 高)

分析器: 通用分析(fallback)

失败用例:

用例 错误摘要
Check pre-commit black 和 isort 修改了 api_server.py,说明提交内容未按 pre-commit 格式化

关键日志:

black....................................................................Failed
- hook id: black
- files were modified by this hook
reformatted fastdeploy/entrypoints/openai/api_server.py
isort....................................................................Failed
- hook id: isort
- files were modified by this hook
Fixing .../fastdeploy/entrypoints/openai/api_server.py
  • 根因摘要: api_server.py 未通过 black/isort 格式化

Pre Commit 的失败文件与本 PR 唯一变更文件一致。日志显示 black 重新格式化该文件,isort 也调整了同一文件;后续 flake8 和 ruff 通过,因此这是提交格式化结果缺失导致的 PR 问题。

修复建议:

  1. 本地执行 pre-commit run --files fastdeploy/entrypoints/openai/api_server.py,提交 black/isort 产生的格式化变更。
  2. 重点检查新增 requests import 的排序,以及 pod_name = (...) 多行拼接缩进等 black 改动区域。

关联变更: fastdeploy/entrypoints/openai/api_server.py

🔴 Approval — 需要 Approval(置信度: 高)

分析器: 内置分析

失败用例:

用例 错误摘要
Approval 该 Job 需要人工 Approval,完成审批后 CI 才会继续执行

关键日志:

[FAILURE]: Process completed with exit code 6.
  • 根因摘要: 需要人工 Approval

该 Job 是审批门禁,不是代码执行失败。完成人工审批后,相关 CI 才会继续执行。

修复建议:

  1. 请通过人工审批。

关联变更: 未关联代码变更

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants