[None][fix] Use one mamba slot sentinel to save memory by Wanli-Jiang · Pull Request #13489 · NVIDIA/TensorRT-LLM

Wanli-Jiang · 2026-04-27T05:01:20Z

Features

based on [None][fix] Fix Mamba cache correctness under MTP + CUDA-graph padding #13151
removed extra sentinel for mtp cases, now all dummy requests use the same 1 extra slot.

@coderabbitai summary

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

- Reserve max_draft_len + 1 extra Mamba slots in MambaHybridCacheManager so real requests and CUDA-graph padding dummies both fit. - Allocate a permanent slot for the CUDA-graph sentinel; padding reuses it via direct mamba_cache_index lookup and no longer aliases live requests parked under the overlap scheduler. - update_mamba_states scatters into the caller's state_indices (mamba_metadata.state_indices under MTP), removing the stale-tail read. - Relax mamba2_mtp_ssm_cache_update's intermediate_states.size(0) check to ">= bs"; it's indexed by batch position, not slot. - Release Phase-1 CUDA-graph pools before final KV allocation. Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>

Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>

Wanli-Jiang · 2026-04-27T05:13:47Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-04-27T05:19:20Z

PR_Github #45653 [ run ] triggered by Bot. Commit: 9a04e03 Link to invocation

Wanli-Jiang added 2 commits April 27, 2026 02:13

Fix comments and unify non-mtp sources to mamba metadata as well.

67c04d1

Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>

github-actions Bot assigned Wanli-Jiang Apr 27, 2026

Wanli-Jiang changed the title ~~User/williamj/fix mamba mtp continue~~ [None][fix] Use one mamba slot sentinel to save memory Apr 27, 2026

[None][fix] Use one mamba slot sentinel to save memory

9a04e03

Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>

Wanli-Jiang force-pushed the user/williamj/fix-mamba-mtp-continue branch from ce5c15b to 9a04e03 Compare April 27, 2026 05:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[None][fix] Use one mamba slot sentinel to save memory#13489

[None][fix] Use one mamba slot sentinel to save memory#13489
Wanli-Jiang wants to merge 3 commits intoNVIDIA:mainfrom
Wanli-Jiang:user/williamj/fix-mamba-mtp-continue

Wanli-Jiang commented Apr 27, 2026 •

edited

Loading

Uh oh!

Wanli-Jiang commented Apr 27, 2026

Uh oh!

tensorrt-cicd commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Wanli-Jiang commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Features

Description

Test Coverage

PR Checklist

GitHub Bot Help

Uh oh!

Wanli-Jiang commented Apr 27, 2026

Uh oh!

tensorrt-cicd commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Wanli-Jiang commented Apr 27, 2026 •

edited

Loading