Skip to content

fix: release multimodal payload after mp handoff#4725

Merged
lvhan028 merged 2 commits into
InternLM:mainfrom
CUHKSZzxy:fix/multimodal-payload-lifetime
Jul 3, 2026
Merged

fix: release multimodal payload after mp handoff#4725
lvhan028 merged 2 commits into
InternLM:mainfrom
CUHKSZzxy:fix/multimodal-payload-lifetime

Conversation

@CUHKSZzxy

@CUHKSZzxy CUHKSZzxy commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator

Motivation

Large multimodal requests can keep API-server CPU RSS growing under high concurrency. In the Qwen3.5 image backpressure workload, the API process retained large preprocessed multimodal payloads after the request had already been handed off to the PyTorch MP engine path.

Result

In the 128-concurrency Qwen3.5 image backpressure workload:

Version API RSS Behavior
Baseline main Continued growing under load, observed above 200 GiB in long runs
This PR: AsyncEngine + MP streaming-wrapper cleanup Plateaued around ~38 GiB

Ablation also confirmed that partial cleanup is insufficient:

Variant API RSS Behavior
AsyncEngine cleanup only Still grew to ~92 GiB before early stop
MP/ZMQ cleanup only Still grew to ~94 GiB before early stop
AsyncEngine + MP/ZMQ cleanup Plateaued around ~38 GiB

Root Cause

The large multimodal payload is passed through several async generator / task wrappers:

  • AsyncEngine.generate
  • AsyncEngine.safe_run
  • MPEngineInstance.async_stream_infer
  • MP backend streaming wrappers, such as ZMQMPEngine._collective_rpc_streaming_async, RayMPEngine._collective_rpc_streaming_async, and AsyncRPCClient.async_stream_call

Even after the next stage has captured the request, the caller-side prompt_input / kwargs dictionaries can stay alive in suspended async frames for the lifetime of the streamed response. Those stale references keep large preprocessed image tensors alive in the API process.

Clearing only one side is insufficient: the API-side and MP wrapper-side references both need to be released after handoff.

Changes

  • Release multimodal from AsyncEngine after inference handoff.
  • Release multimodal from MP streaming wrapper frames after the RPC generator/task is created.

Validation

  • Compile check for touched runtime files.
  • Real Qwen3.5 multimodal backpressure reproduction confirmed API RSS plateau with the combined fix.

Assistance

Assisted with Codex + GPT-5.5 xHigh Fast, reviewed manually

@CUHKSZzxy CUHKSZzxy force-pushed the fix/multimodal-payload-lifetime branch from 1b48ab4 to cbc5eec Compare July 2, 2026 06:23
@CUHKSZzxy CUHKSZzxy marked this pull request as ready for review July 2, 2026 06:25
Copilot AI review requested due to automatic review settings July 2, 2026 06:25

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses API-server RSS growth under high-concurrency multimodal workloads by proactively releasing large multimodal payload references after they’ve been handed off to the PyTorch MP/ZMQ inference path, preventing those tensors from being kept alive by suspended async frames during streaming.

Changes:

  • Clear multimodal from AsyncEngine.safe_run’s local kwargs after creating the engine generator to avoid retaining large tensors in the API async context.
  • Clear multimodal from AsyncEngine.generate’s prompt_input once inference has started, reducing long-lived references in the request coroutine.
  • Clear multimodal from MP/ZMQ streaming wrapper frames (base.py, zmq_engine.py, zmq_rpc.py) after the downstream generator/task has captured the request, minimizing retention in streaming wrappers.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.

File Description
lmdeploy/serve/core/async_engine.py Drops multimodal references in the API async engine after inference handoff to reduce RSS retention during streaming.
lmdeploy/pytorch/engine/mp_engine/zmq_rpc.py Drops multimodal from RPC client wrapper kwargs after payload/task creation to avoid retaining original tensor refs.
lmdeploy/pytorch/engine/mp_engine/zmq_engine.py Refactors streaming RPC call to a named generator so wrapper kwargs can be cleared post-handoff.
lmdeploy/pytorch/engine/mp_engine/base.py Clears multimodal from MP engine wrapper kwargs after creating the downstream RPC generator.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@lvhan028 lvhan028 requested a review from grimoire July 3, 2026 06:11
@lvhan028 lvhan028 added the Bug:P1 label Jul 3, 2026

@grimoire grimoire left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, do we need to do the same to RayEngine?

@lvhan028 lvhan028 merged commit 3a85c05 into InternLM:main Jul 3, 2026
4 checks passed
@CUHKSZzxy CUHKSZzxy deleted the fix/multimodal-payload-lifetime branch July 3, 2026 07:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants