prepare chunk indices before cache initialize#4458
Open
grimoire wants to merge 2 commits intoInternLM:mainfrom
Open
prepare chunk indices before cache initialize#4458grimoire wants to merge 2 commits intoInternLM:mainfrom
grimoire wants to merge 2 commits intoInternLM:mainfrom
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
This PR adjusts the PyTorch engine’s prefill path for SSM / gated-delta (flash-linear-attention) models so that chunk-gated-delta “chunk indices” preparation (which forces a CUDA stream sync) happens during step-context construction, before state-cache initialization and forward execution.
Changes:
- Move state-cache initialization for SSM from the input-update path into
model_forward(), afterbuild_context(). - In the CUDA backend
update_step_context(), eagerly callfla.ops.utils.prepare_chunk_indices(...)during prefill to trigger the required synchronization earlier.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
lmdeploy/pytorch/engine/model_agent/agent.py |
Moves SSM state cache initialization to occur after build_context() (and removes prior prefill-only init hook). |
lmdeploy/pytorch/backends/cuda/op_backend.py |
Adds gated-delta chunk-index preparation during prefill step-context update. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Chunk gated delta kernel requires a
chunk_indices, which requires stream synchronize.This PR computes the chunk_indices before forward and cache initialization.