fix: release multimodal payload after mp handoff#4725
Merged
lvhan028 merged 2 commits intoJul 3, 2026
Conversation
1b48ab4 to
cbc5eec
Compare
Contributor
There was a problem hiding this comment.
Pull request overview
This PR addresses API-server RSS growth under high-concurrency multimodal workloads by proactively releasing large multimodal payload references after they’ve been handed off to the PyTorch MP/ZMQ inference path, preventing those tensors from being kept alive by suspended async frames during streaming.
Changes:
- Clear
multimodalfromAsyncEngine.safe_run’s localkwargsafter creating the engine generator to avoid retaining large tensors in the API async context. - Clear
multimodalfromAsyncEngine.generate’sprompt_inputonce inference has started, reducing long-lived references in the request coroutine. - Clear
multimodalfrom MP/ZMQ streaming wrapper frames (base.py,zmq_engine.py,zmq_rpc.py) after the downstream generator/task has captured the request, minimizing retention in streaming wrappers.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| lmdeploy/serve/core/async_engine.py | Drops multimodal references in the API async engine after inference handoff to reduce RSS retention during streaming. |
| lmdeploy/pytorch/engine/mp_engine/zmq_rpc.py | Drops multimodal from RPC client wrapper kwargs after payload/task creation to avoid retaining original tensor refs. |
| lmdeploy/pytorch/engine/mp_engine/zmq_engine.py | Refactors streaming RPC call to a named generator so wrapper kwargs can be cleared post-handoff. |
| lmdeploy/pytorch/engine/mp_engine/base.py | Clears multimodal from MP engine wrapper kwargs after creating the downstream RPC generator. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
lvhan028
approved these changes
Jul 3, 2026
grimoire
approved these changes
Jul 3, 2026
grimoire
left a comment
Collaborator
There was a problem hiding this comment.
LGTM, do we need to do the same to RayEngine?
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
Large multimodal requests can keep API-server CPU RSS growing under high concurrency. In the Qwen3.5 image backpressure workload, the API process retained large preprocessed multimodal payloads after the request had already been handed off to the PyTorch MP engine path.
Result
In the 128-concurrency Qwen3.5 image backpressure workload:
mainAblation also confirmed that partial cleanup is insufficient:
Root Cause
The large
multimodalpayload is passed through several async generator / task wrappers:AsyncEngine.generateAsyncEngine.safe_runMPEngineInstance.async_stream_inferZMQMPEngine._collective_rpc_streaming_async,RayMPEngine._collective_rpc_streaming_async, andAsyncRPCClient.async_stream_callEven after the next stage has captured the request, the caller-side
prompt_input/kwargsdictionaries can stay alive in suspended async frames for the lifetime of the streamed response. Those stale references keep large preprocessed image tensors alive in the API process.Clearing only one side is insufficient: the API-side and MP wrapper-side references both need to be released after handoff.
Changes
multimodalfromAsyncEngineafter inference handoff.multimodalfrom MP streaming wrapper frames after the RPC generator/task is created.Validation
Assistance
Assisted with Codex + GPT-5.5 xHigh Fast, reviewed manually