[Cherry-Pick] [Refactor] Organize token processor metrics/traces code#7795
[Cherry-Pick] [Refactor] Organize token processor metrics/traces code#7795liyonghua0910 wants to merge 2 commits into
Conversation
|
Thanks for your contribution! |
CI报告基于以下代码生成(30分钟更新一次): 1 任务总览当前 CI 状态:1 个 Required 任务失败(需处理),3 个 Required 任务运行中,1 个 Required 任务等待中。
2 任务状态汇总2.1 Required任务 : 5/10 通过
2.2 可选任务 — 21/25 通过
3 失败详情(仅 required)Approval — 代码规范(审批校验)(置信度: 高)Approval
根因详情:
关键日志: 修复建议:
修复建议摘要: 标题加原始PR编号,并请相应RD完成审批 关联变更: 链接: 查看日志 |
PaddlePaddle-bot
left a comment
There was a problem hiding this comment.
🤖 Paddle-CI-Agent | pr_review |
2026-05-12 18:36:03
📋 Review 摘要
PR 概述:将 token_processor 中散落的 metrics/tracing/logging 代码整理为独立辅助方法,并新增 FD_ENABLE_OBSERVABILITY 开关统一控制可观测性开销。
变更范围:fastdeploy/output/token_processor.py、fastdeploy/entrypoints/openai/、fastdeploy/engine/request.py、fastdeploy/envs.py
影响面 Tag:[DataProcessor] [APIServer] [Engine]
📝 PR 规范检查
存在以下规范问题:
[Refactor]不在官方 Tag 列表中,应映射为[Others](语义近似但非官方 → 最近义官方 Tag)- Cherry-Pick 格式要求末尾附原始 PR 号(如
(#XXXX)),当前标题缺失 - PR 描述所有章节(Motivation、Modifications、Usage or Command、Accuracy Tests)均为空,Checklist 全部未勾选
标题建议(可直接复制,需将 #XXXX 替换为原始 PR 号):
[Cherry-Pick][Others] Organize token processor metrics/traces code(#XXXX)
PR 描述建议(可直接复制):
## Motivation
将 `token_processor.py` 中散落在 `_process_per_token`、`_process_batch_output`、`_process_batch_output_use_zmq` 内的 metrics / tracing / logging 内联代码,整理为粒度清晰的辅助方法(`_record_trace_*`、`_record_task_metrics_*`、`_record_prometheus_metrics_*`、`_log_request_on_completion` 等),并通过 `FD_ENABLE_OBSERVABILITY` 环境变量统一开关,在关闭可观测性时避免无效开销。
## Modifications
- `fastdeploy/output/token_processor.py`:新增 `_observability_enabled` 属性;将原 `_record_metrics`、`_record_first_token_metrics`、`_record_completion_metrics` 等内联逻辑拆分为 10 个粒度更细的辅助方法,均受 `_observability_enabled` 守卫;ZMQ 与非 ZMQ 两条执行路径统一调用新方法
- `fastdeploy/envs.py`:新增 `FD_ENABLE_OBSERVABILITY` 环境变量(默认 `"1"`,即开启)
- `fastdeploy/engine/request.py`:`record_recv_first_token` 和 `record_decode_recv_second_token` 新增可选 `cur_time` 参数,避免重复调用 `time.time()`
- `fastdeploy/entrypoints/openai/serving_chat.py`、`serving_completion.py`:对 metrics 字段增加 `or 0` 兜底,防止 observability 关闭时因字段为 `None` 导致运算异常
## Usage or Command
N/A
## Accuracy Tests
N/A(纯代码重构,不涉及模型推理逻辑)
## Checklist
- [x] Add at least a tag in the PR title.
- Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
- You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [x] Add unit tests. Please write the reason in this PR if no unit tests.
- [x] Provide accuracy results.
- [x] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.问题
| 级别 | 文件 | 概述 |
|---|---|---|
| 🔴 Bug | fastdeploy/output/token_processor.py:314 |
_process_per_token 调用 _record_trace_on_completion(task) 缺少 rid 参数,ZMQ 路径 tracing 失效 |
| 🟡 建议 | fastdeploy/output/token_processor.py:1073 |
inference_start_time 无 None 守卫,可能触发 TypeError |
总体评价
重构思路清晰,代码组织明显改善。但 ZMQ 路径完成阶段缺少 rid 传递是引入的实际 Bug,当 FD_ENABLE_OBSERVABILITY=1 时会导致 DECODE span 无法关联请求 ID,需要在合入前修复。
|
|
||
| main_process_metrics.request_token_ratio.observe(token_ratio) | ||
| llm_logger.info(f"{self.resource_manager.info()}") | ||
| llm_logger.info(self.resource_manager.info()) |
There was a problem hiding this comment.
🔴 Bug _record_trace_on_completion(task) 调用缺少 rid 参数,rid 将默认为 None,被传入 tracing.trace_report_span(rid=None, ...),导致 trace span 无法关联请求 ID,当 FD_ENABLE_OBSERVABILITY=1 且使用 ZMQ 路径时追踪功能失效。
_process_batch_output(非 ZMQ 路径)正确传入了 rid,而此处 ZMQ 路径遗漏。
建议修复:
rid = task_id.split("_")[0]
self._record_trace_on_completion(task, rid)或在 _record_trace_on_completion 内部兜底:
if rid is None:
rid = task.request_id.split("_")[0]| elif role == "decode": | ||
| trace_print(LoggingEventName.DECODE_INFERENCE_END, task.request_id, getattr(task, "user", "")) | ||
| trace_print(LoggingEventName.POSTPROCESSING_START, task.request_id, getattr(task, "user", "")) | ||
| tracing.trace_report_span( |
There was a problem hiding this comment.
🟡 建议 task.metrics.inference_start_time 未做 None 检查即直接用于 int(... * 1e9),若该字段为 None 将抛出 TypeError: unsupported operand type(s) for *: 'NoneType' and 'float'。
原 _setup_trace_context 中有明确守卫:
t = task.metrics.inference_start_time
ts = int(t * 1_000_000_000) if t is not None else 0重构后该守卫在 _record_trace_on_first_token(line ~1053)和 _record_trace_on_completion(此处)均未复现,建议统一改为:
t = task.metrics.inference_start_time
start_time_ns = int(t * 1e9) if t is not None else 0
tracing.trace_report_span(
...,
start_time_ns=start_time_ns,
...
)
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## release/2.6 #7795 +/- ##
==============================================
Coverage ? 72.43%
==============================================
Files ? 378
Lines ? 53964
Branches ? 8446
==============================================
Hits ? 39090
Misses ? 12072
Partials ? 2802
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Motivation
Modifications
Usage or Command
Accuracy Tests
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.