Skip to content

[Cherry-Pick] [Refactor] Organize token processor metrics/traces code#7795

Draft
liyonghua0910 wants to merge 2 commits into
PaddlePaddle:release/2.6from
liyonghua0910:release/2.6+20260509_token_processor
Draft

[Cherry-Pick] [Refactor] Organize token processor metrics/traces code#7795
liyonghua0910 wants to merge 2 commits into
PaddlePaddle:release/2.6from
liyonghua0910:release/2.6+20260509_token_processor

Conversation

@liyonghua0910
Copy link
Copy Markdown
Collaborator

Motivation

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick,PR标题需遵循格式,在最开始加上[Cherry-Pick]标签,以及最后面加上原PR ID,例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

Modifications

Usage or Command

Accuracy Tests

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

@paddle-bot
Copy link
Copy Markdown

paddle-bot Bot commented May 12, 2026

Thanks for your contribution!

PaddlePaddle-bot

This comment was marked as outdated.

@PaddlePaddle-bot
Copy link
Copy Markdown

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-05-12 18:30:58

CI报告基于以下代码生成(30分钟更新一次):


1 任务总览

当前 CI 状态:1 个 Required 任务失败(需处理),3 个 Required 任务运行中,1 个 Required 任务等待中

总执行(rerun次数) 总任务 ✅ 通过 ❌ 失败 ⏳ 运行中 ⏸️ 等待中 跳过
35(0) 35 26 2 5 2 0

2 任务状态汇总

2.1 Required任务 : 5/10 通过

必选任务阻塞合并,失败需优先处理。

状态 任务 耗时 根因 修复建议 日志 重跑
Approval 8s PR问题:Cherry-Pick标题缺PR编号,envs.py需审批 标题加原始PR编号,并请相应RD完成审批 Job -
run_tests_with_coverage - 运行中 - Job -
run_ce_cases - 运行中 - Job -
run_xpu_4cards_cases - 运行中 - Job -
⏸️ run_4_cards_tests - 等待中 - - -
其余 5 个必选任务通过 - - - - -

2.2 可选任务 — 21/25 通过

可选任务不阻塞合并,失败仅供参考。

状态 任务 耗时 日志 重跑
Check PR Template 12s Job -
run_iluvatar_cases - Job -
Trigger Jenkins for PR - Job -
⏸️ CI_HPU - - -
其余 21 个可选任务通过 - - -

3 失败详情(仅 required)

Approval — 代码规范(审批校验)(置信度: 高)

Approval

  • 状态: ❌ 失败
  • 错误类型: 代码规范(审批校验)
  • 置信度: 高
  • 根因摘要: PR审批条件不满足,存在2个审批错误
  • 分析器: 通用分析(fallback)

根因详情:
审批检查脚本 scripts/check_approval.sh 发现 2 个问题:

  1. fastdeploy/envs.py 被修改,需要 FastDeploy RD(jiangjiajun、liuyuanle、chenjian26、wanglongzhi)之一批准
  2. Cherry-Pick PR 标题中缺少原始 develop PR 编号(如 [BugFix] adjust max_tokens and min_tokens when continue to generate tokens #5010),且需要 FastDeploy RD(dangqingqing、jiangjiajun、dengkaipeng)之一批准

关键日志:

0. You must have one FastDeploy RD (Jiang-Jia-Jun(jiangjiajun), yuanlehome(liuyuanle), rainyfly(chenjian26), Wanglongzhi2001(wanglongzhi)) approval for modifying [fastdeploy/envs.py].
1. Cherry-Pick PR must come from develop and the title must contain [Cherry-Pick] and the original develop PR number (e.g., #5010). Approval required from FastDeploy RD: qingqing01(dangqingqing), Jiang-Jia-Jun(jiangjiajun), heavengate(dengkaipeng).
There are 2 approved errors.
##[error]Process completed with exit code 6.

修复建议:

  1. 修改 PR 标题,添加原始 develop PR 编号(例如:[Cherry-Pick] [Refactor] Organize token processor metrics/traces code (#XXXX)
  2. 请求以下任一 RD 审批 fastdeploy/envs.py 的修改:@Jiang-Jia-Jun@yuanlehome@rainyfly@Wanglongzhi2001
  3. 请求以下任一 RD 完成 Cherry-Pick 审批:@qingqing01@Jiang-Jia-Jun@heavengate

修复建议摘要: 标题加原始PR编号,并请相应RD完成审批

关联变更: fastdeploy/envs.py(被本 PR 修改,触发审批要求)

链接: 查看日志

Copy link
Copy Markdown

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Paddle-CI-Agent | pr_review | 2026-05-12 18:36:03

📋 Review 摘要

PR 概述:将 token_processor 中散落的 metrics/tracing/logging 代码整理为独立辅助方法,并新增 FD_ENABLE_OBSERVABILITY 开关统一控制可观测性开销。
变更范围fastdeploy/output/token_processor.pyfastdeploy/entrypoints/openai/fastdeploy/engine/request.pyfastdeploy/envs.py
影响面 Tag[DataProcessor] [APIServer] [Engine]


📝 PR 规范检查

存在以下规范问题:

  1. [Refactor] 不在官方 Tag 列表中,应映射为 [Others](语义近似但非官方 → 最近义官方 Tag)
  2. Cherry-Pick 格式要求末尾附原始 PR 号(如 (#XXXX)),当前标题缺失
  3. PR 描述所有章节(Motivation、Modifications、Usage or Command、Accuracy Tests)均为空,Checklist 全部未勾选

标题建议(可直接复制,需将 #XXXX 替换为原始 PR 号):

  • [Cherry-Pick][Others] Organize token processor metrics/traces code(#XXXX)

PR 描述建议(可直接复制):

## Motivation`token_processor.py` 中散落在 `_process_per_token``_process_batch_output``_process_batch_output_use_zmq` 内的 metrics / tracing / logging 内联代码,整理为粒度清晰的辅助方法(`_record_trace_*``_record_task_metrics_*``_record_prometheus_metrics_*``_log_request_on_completion` 等),并通过 `FD_ENABLE_OBSERVABILITY` 环境变量统一开关,在关闭可观测性时避免无效开销。

## Modifications

- `fastdeploy/output/token_processor.py`:新增 `_observability_enabled` 属性;将原 `_record_metrics``_record_first_token_metrics``_record_completion_metrics` 等内联逻辑拆分为 10 个粒度更细的辅助方法,均受 `_observability_enabled` 守卫;ZMQ 与非 ZMQ 两条执行路径统一调用新方法
- `fastdeploy/envs.py`:新增 `FD_ENABLE_OBSERVABILITY` 环境变量(默认 `"1"`,即开启)
- `fastdeploy/engine/request.py``record_recv_first_token``record_decode_recv_second_token` 新增可选 `cur_time` 参数,避免重复调用 `time.time()`
- `fastdeploy/entrypoints/openai/serving_chat.py``serving_completion.py`:对 metrics 字段增加 `or 0` 兜底,防止 observability 关闭时因字段为 `None` 导致运算异常

## Usage or Command

N/A

## Accuracy Tests

N/A(纯代码重构,不涉及模型推理逻辑)

## Checklist

- [x] Add at least a tag in the PR title.
  - Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
  - You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [x] Add unit tests. Please write the reason in this PR if no unit tests.
- [x] Provide accuracy results.
- [x] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.

问题

级别 文件 概述
🔴 Bug fastdeploy/output/token_processor.py:314 _process_per_token 调用 _record_trace_on_completion(task) 缺少 rid 参数,ZMQ 路径 tracing 失效
🟡 建议 fastdeploy/output/token_processor.py:1073 inference_start_time 无 None 守卫,可能触发 TypeError

总体评价

重构思路清晰,代码组织明显改善。但 ZMQ 路径完成阶段缺少 rid 传递是引入的实际 Bug,当 FD_ENABLE_OBSERVABILITY=1 时会导致 DECODE span 无法关联请求 ID,需要在合入前修复。


main_process_metrics.request_token_ratio.observe(token_ratio)
llm_logger.info(f"{self.resource_manager.info()}")
llm_logger.info(self.resource_manager.info())
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug _record_trace_on_completion(task) 调用缺少 rid 参数,rid 将默认为 None,被传入 tracing.trace_report_span(rid=None, ...),导致 trace span 无法关联请求 ID,当 FD_ENABLE_OBSERVABILITY=1 且使用 ZMQ 路径时追踪功能失效。

_process_batch_output(非 ZMQ 路径)正确传入了 rid,而此处 ZMQ 路径遗漏。

建议修复:

rid = task_id.split("_")[0]
self._record_trace_on_completion(task, rid)

或在 _record_trace_on_completion 内部兜底:

if rid is None:
    rid = task.request_id.split("_")[0]

elif role == "decode":
trace_print(LoggingEventName.DECODE_INFERENCE_END, task.request_id, getattr(task, "user", ""))
trace_print(LoggingEventName.POSTPROCESSING_START, task.request_id, getattr(task, "user", ""))
tracing.trace_report_span(
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 建议 task.metrics.inference_start_time 未做 None 检查即直接用于 int(... * 1e9),若该字段为 None 将抛出 TypeError: unsupported operand type(s) for *: 'NoneType' and 'float'

_setup_trace_context 中有明确守卫:

t = task.metrics.inference_start_time
ts = int(t * 1_000_000_000) if t is not None else 0

重构后该守卫在 _record_trace_on_first_token(line ~1053)和 _record_trace_on_completion(此处)均未复现,建议统一改为:

t = task.metrics.inference_start_time
start_time_ns = int(t * 1e9) if t is not None else 0
tracing.trace_report_span(
    ...,
    start_time_ns=start_time_ns,
    ...
)

@codecov-commenter
Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 77.77778% with 26 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (release/2.6@a5191f2). Learn more about missing BASE report.

Files with missing lines Patch % Lines
fastdeploy/output/token_processor.py 75.72% 12 Missing and 13 partials ⚠️
...astdeploy/entrypoints/openai/serving_completion.py 80.00% 1 Missing ⚠️
Additional details and impacted files
@@              Coverage Diff               @@
##             release/2.6    #7795   +/-   ##
==============================================
  Coverage               ?   72.43%           
==============================================
  Files                  ?      378           
  Lines                  ?    53964           
  Branches               ?     8446           
==============================================
  Hits                   ?    39090           
  Misses                 ?    12072           
  Partials               ?     2802           
Flag Coverage Δ
GPU 72.43% <77.77%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants