Skip to content

[Optimization]Streaming requests return complete special tokens.#6998

Open
luukunn wants to merge 3 commits intoPaddlePaddle:developfrom
luukunn:return_token_ids
Open

[Optimization]Streaming requests return complete special tokens.#6998
luukunn wants to merge 3 commits intoPaddlePaddle:developfrom
luukunn:return_token_ids

Conversation

@luukunn
Copy link
Collaborator

@luukunn luukunn commented Mar 24, 2026

Motivation

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick,PR标题需遵循格式,在最开始加上[Cherry-Pick]标签,以及最后面加上原PR ID,例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

Modifications

Usage or Command

Accuracy Tests

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

Copilot AI review requested due to automatic review settings March 24, 2026 13:25
@paddle-bot
Copy link

paddle-bot bot commented Mar 24, 2026

Thanks for your contribution!

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

该 PR 旨在优化 OpenAI 兼容的 streaming 响应:当引擎输出被标记为 skipped 时,如果客户端开启 return_token_ids,仍然返回对应的 token ids(从而让“特殊 token/完整 token 流”在 streaming 场景下不丢失)。

Changes:

  • 在 completion streaming 中:仅在 skippedreturn_token_ids=False 时跳过输出;否则返回空 text 并携带 token_ids
  • 在 chat completion streaming 中:同样调整 skipped 分支逻辑,并在 skipped 时返回空内容(文本/多模态)。
  • tool_calls 的检测提前,以便即使该帧被跳过也能正确影响最终 finish_reason 判定。

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
fastdeploy/entrypoints/openai/serving_completion.py 调整 streaming 中对 skipped 的处理逻辑,使 return_token_ids=True 时仍能输出 token ids(text 为空)。
fastdeploy/entrypoints/openai/serving_chat.py 调整 chat streaming 的 skipped 处理与内容填充逻辑(含 multimodal 分支)。


if response_processor.enable_multimodal_content():
delta_message.multimodal_content = output["multipart"]
delta_message.multimodal_content = [{}] if output["skipped"] else output["multipart"]
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

当启用 multimodal 输出且 output["skipped"] 为 True 时,这里把 delta_message.multimodal_content 设为 [{}] 会导致返回的 multipart item 缺少上游约定的字段(例如 response_processors 里文本部分是 {type: "text", text: ...}),可能破坏客户端对 multimodal_content 结构的解析。建议在 skipped 时也返回一个结构一致的占位项(例如 type/text 为空字符串),再在其上附加 completion_token_ids。

Suggested change
delta_message.multimodal_content = [{}] if output["skipped"] else output["multipart"]
delta_message.multimodal_content = (
[{"type": "text", "text": ""}] if output["skipped"] else output["multipart"]
)

Copilot uses AI. Check for mistakes.
Comment on lines 563 to 566
delta_message = CompletionResponseStreamChoice(
index=idx,
text=output["text"],
text="" if output["skipped"] else (output["text"] or ""),
prompt_token_ids=None,
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

当前 PR 描述仍是模板占位(Motivation/Modifications/Usage/Accuracy Tests 未填写),且标题格式建议去掉引号并在 tag 后补空格(如 [Optimization] Streaming ...),否则后续回溯变更目的与验证方式会比较困难。请补充:为何要让 streaming 返回 skipped token 的 token_ids、预期行为/兼容性说明、以及至少一个可复现或验证的命令/用例。

Copilot uses AI. Check for mistakes.
Comment on lines +560 to 566
if output["skipped"] and not request.return_token_ids:
continue

delta_message = CompletionResponseStreamChoice(
index=idx,
text=output["text"],
text="" if output["skipped"] else (output["text"] or ""),
prompt_token_ids=None,
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里新增了 output["skipped"] and not request.return_token_ids 的分支,行为变化是:当 return_token_ids=True 时,即使 skipped 也会发送一个 text 为空的 chunk(携带 token_ids)。现有 tests/entrypoints/openai/test_serving_completion.py 的 streaming 用例只覆盖 skipped=False,建议补充 skipped=True 的场景(分别覆盖 return_token_ids=True/False),以避免后续回归(例如错误地继续 skip 导致 token_ids 丢失)。

Copilot generated this review using guidance from repository custom instructions.
Comment on lines 423 to +426
if response_processor.enable_multimodal_content():
delta_message.multimodal_content = output["multipart"]
delta_message.multimodal_content = [{}] if output["skipped"] else output["multipart"]
else:
delta_message.content = output["text"]
delta_message.content = "" if output["skipped"] else (output["text"] or "")
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

针对 skipped + multimodal + return_token_ids 的组合,当前实现会产出不含 type/text 的 multipart item(仅有 completion_token_ids)。tests/entrypoints/openai/test_serving_chat.py 虽覆盖了 skipped=True,但未断言 multipart item 的结构一致性。建议在这里补充/更新测试,明确 skipped chunk 的 multimodal_content 至少包含 type: "text" 与空 text,并包含 completion_token_ids。

Copilot generated this review using guidance from repository custom instructions.
@codecov-commenter
Copy link

Codecov Report

❌ Patch coverage is 80.00000% with 2 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@6cff780). Learn more about missing BASE report.

Files with missing lines Patch % Lines
fastdeploy/entrypoints/openai/serving_chat.py 66.66% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             develop    #6998   +/-   ##
==========================================
  Coverage           ?   73.84%           
==========================================
  Files              ?      399           
  Lines              ?    56045           
  Branches           ?     8849           
==========================================
  Hits               ?    41385           
  Misses             ?    11732           
  Partials           ?     2928           
Flag Coverage Δ
GPU 73.84% <80.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants