Skip to content

[Cherry-Pick][BugFix] Seperate prometheus multiproc dir for single-server multi-dp services (#8059)#8062

Open
liyonghua0910 wants to merge 5 commits into
PaddlePaddle:release/online/20260415from
liyonghua0910:release/online/20260415+20260616_fix_dp_metrics
Open

[Cherry-Pick][BugFix] Seperate prometheus multiproc dir for single-server multi-dp services (#8059)#8062
liyonghua0910 wants to merge 5 commits into
PaddlePaddle:release/online/20260415from
liyonghua0910:release/online/20260415+20260616_fix_dp_metrics

Conversation

@liyonghua0910

@liyonghua0910 liyonghua0910 commented Jun 17, 2026

Copy link
Copy Markdown
Collaborator

Motivation

Fix metric interference when multiple data-parallel services run on one server by isolating Prometheus multiprocess files per DP rank.

Modifications

  • Track the original PROMETHEUS_MULTIPROC_DIR set during metrics initialization.
  • Add setup_dp_prometheus_dir() to create per-DP dp{i} subdirectories and switch the target environment.
  • Apply DP-specific Prometheus dirs when launching internal-adapter DP services from LLMEngine / EngineService and when multi_api_server starts per-DP API server processes.
  • Update unit tests for multi API server Prometheus dirs, Prometheus setup behavior, cache transfer setup, and graph optimization metadata.

Usage or Command

N/A

Accuracy Tests

N/A

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

PaddlePaddle-bot

This comment was marked as outdated.

@codecov-commenter

codecov-commenter commented Jun 17, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 75.67568% with 9 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (release/online/20260415@eb7ea99). Learn more about missing BASE report.

Files with missing lines Patch % Lines
fastdeploy/engine/engine.py 16.66% 5 Missing ⚠️
fastdeploy/engine/common_engine.py 33.33% 2 Missing and 2 partials ⚠️
Additional details and impacted files
@@                    Coverage Diff                     @@
##             release/online/20260415    #8062   +/-   ##
==========================================================
  Coverage                           ?   71.94%           
==========================================================
  Files                              ?      389           
  Lines                              ?    54539           
  Branches                           ?     8550           
==========================================================
  Hits                               ?    39238           
  Misses                             ?    12585           
  Partials                           ?     2716           
Flag Coverage Δ
GPU 71.94% <75.67%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@PaddlePaddle-bot

PaddlePaddle-bot commented Jun 18, 2026

Copy link
Copy Markdown

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-06-26 11:44:30 UTC+08:00

CI报告基于以下代码生成(30分钟更新一次):
PR commit: c4788ec0 | Merge base: eb7ea99 (branch: release/online/20260415)


1 Required任务 : 7/7 通过

总执行(rerun次数) 总任务 ✅ 通过 ❌ 失败 ⏳ 运行中 ⏸️ 等待中 跳过
20(0) 20 20 0 0 0 0

2 失败详情

@liyonghua0910 liyonghua0910 force-pushed the release/online/20260415+20260616_fix_dp_metrics branch from b604581 to 4e3b6e6 Compare June 18, 2026 09:24
PaddlePaddle-bot

This comment was marked as outdated.

@liyonghua0910 liyonghua0910 force-pushed the release/online/20260415+20260616_fix_dp_metrics branch from 4e3b6e6 to a30775d Compare June 22, 2026 02:53
@liyonghua0910 liyonghua0910 changed the title [BugFix] Seperate prometheus multiproc dir for single-server multi-dp… [BugFix] Seperate prometheus multiproc dir for single-server multi-dp services Jun 22, 2026
@liyonghua0910 liyonghua0910 changed the title [BugFix] Seperate prometheus multiproc dir for single-server multi-dp services [Cherry-Pick][BugFix] Seperate prometheus multiproc dir for single-server multi-dp services (#8059) Jun 22, 2026
PaddlePaddle-bot

This comment was marked as outdated.

PaddlePaddle-bot

This comment was marked as outdated.

PaddlePaddle-bot

This comment was marked as outdated.

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Paddle-CI-Agent | pr_review | 2026-06-23 19:34:02

📋 Review 摘要

PR 概述:为单机多 DP 服务拆分 Prometheus multiprocess 目录,并补充相关测试修复。
变更范围fastdeploy/metrics/fastdeploy/engine/fastdeploy/entrypoints/openai/、相关单测。
影响面 Tag[Engine] [APIServer]

问题

级别 文件 概述
- - 未发现新的阻塞性问题。PR 规范问题在下面章节报,不在这里重复

历史 Findings 修复情况

Finding 问题 状态
F1 setup_dp_prometheus_dir()multi_api_server 用于 Popen 子进程时,dp_id == 0 仍会迁移 base 目录下已有 .db 文件。 ⚠️ 仍存在

📝 PR 规范检查

标题格式已补充 [Cherry-Pick][BugFix] 和原 PR 号;描述仍保留模板占位,Motivation / Modifications / Usage or Command / Accuracy Tests 未填写。

PR 描述建议(点击展开,可直接复制)
## Motivation
Fix metric interference when multiple data-parallel services run on one server by isolating Prometheus multiprocess files per DP rank.

## Modifications
- Track the original `PROMETHEUS_MULTIPROC_DIR` set during metrics initialization.
- Add `setup_dp_prometheus_dir()` to create per-DP `dp{i}` subdirectories and switch the target environment.
- Apply DP-specific Prometheus dirs when launching internal-adapter DP services from `LLMEngine` / `EngineService` and when `multi_api_server` starts per-DP API server processes.
- Update unit tests for multi API server Prometheus dirs, Prometheus setup behavior, cache transfer setup, and graph optimization metadata.

## Usage or Command
N/A

## Accuracy Tests
N/A

## Checklist

- [x] Add at least a tag in the PR title.
  - Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
  - You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [x] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.

总体评价

核心改动围绕 Prometheus multiprocess 目录隔离展开,本轮未发现新的阻塞性代码问题。历史 F1 在当前 diff 中仍未修复,建议后续把 multi_api_server 的纯环境派生逻辑与 DP0 .db 迁移逻辑拆开,避免 launcher 阶段误迁移文件。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants