Skip to content

[None][fix] Use one mamba slot sentinel to save memory#13489

Draft
Wanli-Jiang wants to merge 3 commits intoNVIDIA:mainfrom
Wanli-Jiang:user/williamj/fix-mamba-mtp-continue
Draft

[None][fix] Use one mamba slot sentinel to save memory#13489
Wanli-Jiang wants to merge 3 commits intoNVIDIA:mainfrom
Wanli-Jiang:user/williamj/fix-mamba-mtp-continue

Conversation

@Wanli-Jiang
Copy link
Copy Markdown
Collaborator

@Wanli-Jiang Wanli-Jiang commented Apr 27, 2026

Features

@coderabbitai summary

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

- Reserve max_draft_len + 1 extra Mamba slots in
  MambaHybridCacheManager so real requests and CUDA-graph padding
  dummies both fit.
- Allocate a permanent slot for the CUDA-graph sentinel; padding
  reuses it via direct mamba_cache_index lookup and no longer aliases
  live requests parked under the overlap scheduler.
- update_mamba_states scatters into the caller's state_indices
  (mamba_metadata.state_indices under MTP), removing the stale-tail
  read.
- Relax mamba2_mtp_ssm_cache_update's intermediate_states.size(0)
  check to ">= bs"; it's indexed by batch position, not slot.
- Release Phase-1 CUDA-graph pools before final KV allocation.

Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>
Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>
@Wanli-Jiang Wanli-Jiang changed the title User/williamj/fix mamba mtp continue [None][fix] Use one mamba slot sentinel to save memory Apr 27, 2026
Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>
@Wanli-Jiang Wanli-Jiang force-pushed the user/williamj/fix-mamba-mtp-continue branch from ce5c15b to 9a04e03 Compare April 27, 2026 05:08
@Wanli-Jiang
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #45653 [ run ] triggered by Bot. Commit: 9a04e03 Link to invocation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants