[Cherry-Pick] [Refactor] Organize token processor metrics/traces code by liyonghua0910 · Pull Request #7795 · PaddlePaddle/FastDeploy

liyonghua0910 · 2026-05-12T10:04:01Z

Motivation

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick，PR标题需遵循格式，在最开始加上[Cherry-Pick]标签，以及最后面加上原PR ID，例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

Modifications

Usage or Command

Accuracy Tests

Checklist

Add at least a tag in the PR title.
- Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
- You can add new tags based on the PR content, but the semantics must be clear.
Format your code, run pre-commit before commit.
Add unit tests. Please write the reason in this PR if no unit tests.
Provide accuracy results.
If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

paddle-bot · 2026-05-12T10:04:08Z

Thanks for your contribution!

PaddlePaddle-bot · 2026-05-12T10:32:50Z

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-12 18:30:58

CI报告基于以下代码生成（30分钟更新一次）:

PR commit: 2c0cab2
Merge base: a5191f2 (branch: release/2.6)
查看完整 Diff
CI 详情

1 任务总览

当前 CI 状态：1 个 Required 任务失败（需处理），3 个 Required 任务运行中，1 个 Required 任务等待中。

总执行（rerun次数）	总任务	✅ 通过	❌ 失败	⏳ 运行中	⏸️ 等待中	跳过
35(0)	35	26	2	5	2	0

2 任务状态汇总

2.1 Required任务 : 5/10 通过

必选任务阻塞合并，失败需优先处理。

状态	任务	耗时	根因	修复建议	日志	重跑
❌	`Approval`	8s	PR问题：Cherry-Pick标题缺PR编号，envs.py需审批	标题加原始PR编号，并请相应RD完成审批	Job	-
⏳	`run_tests_with_coverage`	-	运行中	-	Job	-
⏳	`run_ce_cases`	-	运行中	-	Job	-
⏳	`run_xpu_4cards_cases`	-	运行中	-	Job	-
⏸️	`run_4_cards_tests`	-	等待中	-	-	-
✅	其余 5 个必选任务通过	-	-	-	-	-

2.2 可选任务 — 21/25 通过

可选任务不阻塞合并，失败仅供参考。

状态	任务	耗时	日志	重跑
❌	`Check PR Template`	12s	Job	-
⏳	`run_iluvatar_cases`	-	Job	-
⏳	`Trigger Jenkins for PR`	-	Job	-
⏸️	`CI_HPU`	-	-	-
✅	其余 21 个可选任务通过	-	-	-

3 失败详情（仅 required）

Approval — 代码规范（审批校验）（置信度: 高）

Approval

状态: ❌ 失败
错误类型: 代码规范（审批校验）
置信度: 高
根因摘要: PR审批条件不满足，存在2个审批错误
分析器: 通用分析(fallback)

根因详情:
审批检查脚本 scripts/check_approval.sh 发现 2 个问题：

fastdeploy/envs.py 被修改，需要 FastDeploy RD（jiangjiajun、liuyuanle、chenjian26、wanglongzhi）之一批准
Cherry-Pick PR 标题中缺少原始 develop PR 编号（如 [BugFix] adjust max_tokens and min_tokens when continue to generate tokens #5010），且需要 FastDeploy RD（dangqingqing、jiangjiajun、dengkaipeng）之一批准

关键日志:

0. You must have one FastDeploy RD (Jiang-Jia-Jun(jiangjiajun), yuanlehome(liuyuanle), rainyfly(chenjian26), Wanglongzhi2001(wanglongzhi)) approval for modifying [fastdeploy/envs.py].
1. Cherry-Pick PR must come from develop and the title must contain [Cherry-Pick] and the original develop PR number (e.g., #5010). Approval required from FastDeploy RD: qingqing01(dangqingqing), Jiang-Jia-Jun(jiangjiajun), heavengate(dengkaipeng).
There are 2 approved errors.
##[error]Process completed with exit code 6.

修复建议:

修改 PR 标题，添加原始 develop PR 编号（例如：[Cherry-Pick] [Refactor] Organize token processor metrics/traces code (#XXXX)）
请求以下任一 RD 审批 fastdeploy/envs.py 的修改：@Jiang-Jia-Jun、@yuanlehome、@rainyfly、@Wanglongzhi2001
请求以下任一 RD 完成 Cherry-Pick 审批：@qingqing01、@Jiang-Jia-Jun、@heavengate

修复建议摘要: 标题加原始PR编号，并请相应RD完成审批

关联变更: fastdeploy/envs.py（被本 PR 修改，触发审批要求）

链接: 查看日志

PaddlePaddle-bot

🤖 Paddle-CI-Agent | pr_review | 2026-05-12 18:36:03

📋 Review 摘要

PR 概述：将 token_processor 中散落的 metrics/tracing/logging 代码整理为独立辅助方法，并新增 FD_ENABLE_OBSERVABILITY 开关统一控制可观测性开销。
变更范围：fastdeploy/output/token_processor.py、fastdeploy/entrypoints/openai/、fastdeploy/engine/request.py、fastdeploy/envs.py
影响面 Tag：[DataProcessor] [APIServer] [Engine]

📝 PR 规范检查

存在以下规范问题：

[Refactor] 不在官方 Tag 列表中，应映射为 [Others]（语义近似但非官方 → 最近义官方 Tag）
Cherry-Pick 格式要求末尾附原始 PR 号（如 (#XXXX)），当前标题缺失
PR 描述所有章节（Motivation、Modifications、Usage or Command、Accuracy Tests）均为空，Checklist 全部未勾选

标题建议（可直接复制，需将 #XXXX 替换为原始 PR 号）：

[Cherry-Pick][Others] Organize token processor metrics/traces code(#XXXX)

PR 描述建议（可直接复制）：

## Motivation

将 `token_processor.py` 中散落在 `_process_per_token`、`_process_batch_output`、`_process_batch_output_use_zmq` 内的 metrics / tracing / logging 内联代码，整理为粒度清晰的辅助方法（`_record_trace_*`、`_record_task_metrics_*`、`_record_prometheus_metrics_*`、`_log_request_on_completion` 等），并通过 `FD_ENABLE_OBSERVABILITY` 环境变量统一开关，在关闭可观测性时避免无效开销。

## Modifications

- `fastdeploy/output/token_processor.py`：新增 `_observability_enabled` 属性；将原 `_record_metrics`、`_record_first_token_metrics`、`_record_completion_metrics` 等内联逻辑拆分为 10 个粒度更细的辅助方法，均受 `_observability_enabled` 守卫；ZMQ 与非 ZMQ 两条执行路径统一调用新方法
- `fastdeploy/envs.py`：新增 `FD_ENABLE_OBSERVABILITY` 环境变量（默认 `"1"`，即开启）
- `fastdeploy/engine/request.py`：`record_recv_first_token` 和 `record_decode_recv_second_token` 新增可选 `cur_time` 参数，避免重复调用 `time.time()`
- `fastdeploy/entrypoints/openai/serving_chat.py`、`serving_completion.py`：对 metrics 字段增加 `or 0` 兜底，防止 observability 关闭时因字段为 `None` 导致运算异常

## Usage or Command

N/A

## Accuracy Tests

N/A（纯代码重构，不涉及模型推理逻辑）

## Checklist

- [x] Add at least a tag in the PR title.
  - Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
  - You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [x] Add unit tests. Please write the reason in this PR if no unit tests.
- [x] Provide accuracy results.
- [x] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.

问题

级别	文件	概述
🔴 Bug	`fastdeploy/output/token_processor.py:314`	`_process_per_token` 调用 `_record_trace_on_completion(task)` 缺少 `rid` 参数，ZMQ 路径 tracing 失效
🟡 建议	`fastdeploy/output/token_processor.py:1073`	`inference_start_time` 无 None 守卫，可能触发 TypeError

总体评价

重构思路清晰，代码组织明显改善。但 ZMQ 路径完成阶段缺少 rid 传递是引入的实际 Bug，当 FD_ENABLE_OBSERVABILITY=1 时会导致 DECODE span 无法关联请求 ID，需要在合入前修复。

PaddlePaddle-bot · 2026-05-12T10:42:23Z


-                main_process_metrics.request_token_ratio.observe(token_ratio)
-                llm_logger.info(f"{self.resource_manager.info()}")
+                llm_logger.info(self.resource_manager.info())


🔴 Bug _record_trace_on_completion(task) 调用缺少 rid 参数，rid 将默认为 None，被传入 tracing.trace_report_span(rid=None, ...)，导致 trace span 无法关联请求 ID，当 FD_ENABLE_OBSERVABILITY=1 且使用 ZMQ 路径时追踪功能失效。

_process_batch_output（非 ZMQ 路径）正确传入了 rid，而此处 ZMQ 路径遗漏。

建议修复：

rid = task_id.split("_")[0] self._record_trace_on_completion(task, rid)

或在 _record_trace_on_completion 内部兜底：

if rid is None: rid = task.request_id.split("_")[0]

PaddlePaddle-bot · 2026-05-12T10:42:23Z

+        elif role == "decode":
+            trace_print(LoggingEventName.DECODE_INFERENCE_END, task.request_id, getattr(task, "user", ""))
+        trace_print(LoggingEventName.POSTPROCESSING_START, task.request_id, getattr(task, "user", ""))
+        tracing.trace_report_span(


🟡 建议 task.metrics.inference_start_time 未做 None 检查即直接用于 int(... * 1e9)，若该字段为 None 将抛出 TypeError: unsupported operand type(s) for *: 'NoneType' and 'float'。

原 _setup_trace_context 中有明确守卫：

t = task.metrics.inference_start_time ts = int(t * 1_000_000_000) if t is not None else 0

重构后该守卫在 _record_trace_on_first_token（line ~1053）和 _record_trace_on_completion（此处）均未复现，建议统一改为：

t = task.metrics.inference_start_time start_time_ns = int(t * 1e9) if t is not None else 0 tracing.trace_report_span( ..., start_time_ns=start_time_ns, ... )

codecov-commenter · 2026-05-12T11:37:17Z

Codecov Report

❌ Patch coverage is 77.77778% with 26 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (release/2.6@a5191f2). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
fastdeploy/output/token_processor.py	75.72%	12 Missing and 13 partials ⚠️
...astdeploy/entrypoints/openai/serving_completion.py	80.00%	1 Missing ⚠️

Additional details and impacted files

@@              Coverage Diff               @@
##             release/2.6    #7795   +/-   ##
==============================================
  Coverage               ?   72.43%           
==============================================
  Files                  ?      378           
  Lines                  ?    53964           
  Branches               ?     8446           
==============================================
  Hits                   ?    39090           
  Misses                 ?    12072           
  Partials               ?     2802

Flag	Coverage Δ
GPU	`72.43% <77.77%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

liyonghua0910 added 2 commits May 12, 2026 09:49

[Refactor] Organize token processor metrics/traces code

435e5ec

[feat] add env var to disable observability

2c0cab2

liyonghua0910 had a problem deploying to Metax_ci May 12, 2026 10:04 — with GitHub Actions Failure

This comment was marked as outdated.

Sign in to view

PaddlePaddle-bot suggested changes May 12, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Cherry-Pick] [Refactor] Organize token processor metrics/traces code#7795

[Cherry-Pick] [Refactor] Organize token processor metrics/traces code#7795
liyonghua0910 wants to merge 2 commits into
PaddlePaddle:release/2.6from
liyonghua0910:release/2.6+20260509_token_processor

liyonghua0910 commented May 12, 2026

Uh oh!

paddle-bot Bot commented May 12, 2026

Uh oh!

This comment was marked as outdated.

Uh oh!

PaddlePaddle-bot commented May 12, 2026

Approval

Uh oh!

PaddlePaddle-bot left a comment

Uh oh!

PaddlePaddle-bot May 12, 2026

Uh oh!

PaddlePaddle-bot May 12, 2026

Uh oh!

codecov-commenter commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

liyonghua0910 commented May 12, 2026

Motivation

Modifications

Usage or Command

Accuracy Tests

Checklist

Uh oh!

paddle-bot Bot commented May 12, 2026

Uh oh!

This comment was marked as outdated.

Uh oh!

PaddlePaddle-bot commented May 12, 2026

1 任务总览

2 任务状态汇总

2.1 Required任务 : 5/10 通过

2.2 可选任务 — 21/25 通过

3 失败详情（仅 required）

Approval

Uh oh!

PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

📋 Review 摘要

📝 PR 规范检查

问题

总体评价

Uh oh!

PaddlePaddle-bot May 12, 2026

Choose a reason for hiding this comment

Uh oh!

PaddlePaddle-bot May 12, 2026

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented May 12, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants