fix: release multimodal payload after mp handoff by CUHKSZzxy · Pull Request #4725 · InternLM/lmdeploy

CUHKSZzxy · 2026-07-02T06:15:16Z

Motivation

Large multimodal requests can keep API-server CPU RSS growing under high concurrency. In the Qwen3.5 image backpressure workload, the API process retained large preprocessed multimodal payloads after the request had already been handed off to the PyTorch MP engine path.

Result

In the 128-concurrency Qwen3.5 image backpressure workload:

Version	API RSS Behavior
Baseline `main`	Continued growing under load, observed above 200 GiB in long runs
This PR: AsyncEngine + MP streaming-wrapper cleanup	Plateaued around ~38 GiB

Ablation also confirmed that partial cleanup is insufficient:

Variant	API RSS Behavior
AsyncEngine cleanup only	Still grew to ~92 GiB before early stop
MP/ZMQ cleanup only	Still grew to ~94 GiB before early stop
AsyncEngine + MP/ZMQ cleanup	Plateaued around ~38 GiB

Root Cause

The large multimodal payload is passed through several async generator / task wrappers:

AsyncEngine.generate
AsyncEngine.safe_run
MPEngineInstance.async_stream_infer
MP backend streaming wrappers, such as ZMQMPEngine._collective_rpc_streaming_async, RayMPEngine._collective_rpc_streaming_async, and AsyncRPCClient.async_stream_call

Even after the next stage has captured the request, the caller-side prompt_input / kwargs dictionaries can stay alive in suspended async frames for the lifetime of the streamed response. Those stale references keep large preprocessed image tensors alive in the API process.

Clearing only one side is insufficient: the API-side and MP wrapper-side references both need to be released after handoff.

Changes

Release multimodal from AsyncEngine after inference handoff.
Release multimodal from MP streaming wrapper frames after the RPC generator/task is created.

Validation

Compile check for touched runtime files.
Real Qwen3.5 multimodal backpressure reproduction confirmed API RSS plateau with the combined fix.

Assistance

Assisted with Codex + GPT-5.5 xHigh Fast, reviewed manually

Copilot

Pull request overview

This PR addresses API-server RSS growth under high-concurrency multimodal workloads by proactively releasing large multimodal payload references after they’ve been handed off to the PyTorch MP/ZMQ inference path, preventing those tensors from being kept alive by suspended async frames during streaming.

Changes:

Clear multimodal from AsyncEngine.safe_run’s local kwargs after creating the engine generator to avoid retaining large tensors in the API async context.
Clear multimodal from AsyncEngine.generate’s prompt_input once inference has started, reducing long-lived references in the request coroutine.
Clear multimodal from MP/ZMQ streaming wrapper frames (base.py, zmq_engine.py, zmq_rpc.py) after the downstream generator/task has captured the request, minimizing retention in streaming wrappers.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.

File	Description
lmdeploy/serve/core/async_engine.py	Drops `multimodal` references in the API async engine after inference handoff to reduce RSS retention during streaming.
lmdeploy/pytorch/engine/mp_engine/zmq_rpc.py	Drops `multimodal` from RPC client wrapper kwargs after payload/task creation to avoid retaining original tensor refs.
lmdeploy/pytorch/engine/mp_engine/zmq_engine.py	Refactors streaming RPC call to a named generator so wrapper kwargs can be cleared post-handoff.
lmdeploy/pytorch/engine/mp_engine/base.py	Clears `multimodal` from MP engine wrapper kwargs after creating the downstream RPC generator.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

grimoire

LGTM, do we need to do the same to RayEngine?

fix: release multimodal payload after mp handoff

cbc5eec

CUHKSZzxy force-pushed the fix/multimodal-payload-lifetime branch from 1b48ab4 to cbc5eec Compare July 2, 2026 06:23

CUHKSZzxy marked this pull request as ready for review July 2, 2026 06:25

Copilot AI review requested due to automatic review settings July 2, 2026 06:25

Copilot started reviewing on behalf of CUHKSZzxy July 2, 2026 06:25 View session

Copilot AI reviewed Jul 2, 2026

View reviewed changes

lvhan028 requested a review from grimoire July 3, 2026 06:11

lvhan028 approved these changes Jul 3, 2026

View reviewed changes

lvhan028 added the Bug:P1 label Jul 3, 2026

grimoire approved these changes Jul 3, 2026

View reviewed changes

fix: release multimodal payload in ray mp stream

3a4444a

lvhan028 merged commit 3a85c05 into InternLM:main Jul 3, 2026
4 checks passed

CUHKSZzxy deleted the fix/multimodal-payload-lifetime branch July 3, 2026 07:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: release multimodal payload after mp handoff#4725

fix: release multimodal payload after mp handoff#4725
lvhan028 merged 2 commits into
InternLM:mainfrom
CUHKSZzxy:fix/multimodal-payload-lifetime

CUHKSZzxy commented Jul 2, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

grimoire left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

CUHKSZzxy commented Jul 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Result

Root Cause

Changes

Validation

Assistance

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

grimoire left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

CUHKSZzxy commented Jul 2, 2026 •

edited

Loading