[None][fix] Port KV cache V2 follow-up fixes by yizhang-nv · Pull Request #13488 · NVIDIA/TensorRT-LLM

yizhang-nv · 2026-04-27T04:53:32Z

Summary by CodeRabbit

Bug Fixes
- Fixed attention workspace sizing calculation to accurately reflect KV cache reuse scenarios.
Improvements
- Enhanced KV cache scheduler capacity calculation for better resource allocation.
- Improved fallback mechanisms for KV cache initialization when encountering memory constraints.
- Optimized memory migration stream synchronization for improved performance.

Description

Port the KV cache V2 follow-up fixes from #12837 that are independent of enabling V2 by default.

This PR keeps use_kv_cache_manager_v2 default unchanged and excludes test waives/skips and cleanup workarounds. It includes:

Use max_batch_size as the V2 scheduler per-iteration request budget.
Size FP8 MLA context workspace with total_kv_len when KV cache reuse makes it larger than max_num_tokens.
Cap auto-provisioned V2 host cache by available memory and RLIMIT_MEMLOCK, and retry without host tier when host registration fails.
Order V2 batched migrations behind the executor stream before copy streams consume cache pages.

Test Coverage

pre-commit run --files cpp/tensorrt_llm/common/attentionOp.cpp cpp/tensorrt_llm/common/attentionOp.h cpp/tensorrt_llm/thop/attentionOp.cpp tensorrt_llm/_torch/pyexecutor/_util.py tensorrt_llm/_torch/pyexecutor/resource_manager.py tensorrt_llm/runtime/kv_cache_manager_v2/_storage_manager.py
python3 -m py_compile tensorrt_llm/_torch/pyexecutor/_util.py tensorrt_llm/_torch/pyexecutor/resource_manager.py tensorrt_llm/runtime/kv_cache_manager_v2/_storage_manager.py
git diff --check

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>

yizhang-nv · 2026-04-27T04:53:37Z

/bot run

coderabbitai · 2026-04-27T04:58:47Z

📝 Walkthrough

Walkthrough

The changes extend workspace-sizing methods across C++ and Python layers by adding a total_kv_len parameter to support improved FP8 buffer sizing in attention operations. Additionally, scheduler capacity computation is adjusted to use max_batch_size instead of max_num_sequences for V2 executor initialization, and KV cache manager startup now includes fallback error handling and explicit execution stream wiring for batched migration synchronization.

Changes

Cohort / File(s)	Summary
AttentionOp Workspace Sizing `cpp/tensorrt_llm/common/attentionOp.h`, `cpp/tensorrt_llm/common/attentionOp.cpp`	Added optional `total_kv_len` parameter (defaults to `0`) to `AttentionOp::getWorkspaceSizeForContext` method. Implementation uses this value to compute FP8 K/V buffer workspace when separate Q/K/V inputs are employed, deriving `kv_buf_tokens` from `total_kv_len` with fallback logic.
Thop AttentionOp Runner `cpp/tensorrt_llm/thop/attentionOp.cpp`	Added optional `ctx_total_kv_len` parameter (defaults to `0`) to `RunnerBase::getWorkspaceSize` and `Runner<T,...>::getWorkspaceSize` methods. Context workspace computation now forwards this parameter to `AttentionOp::getWorkspaceSizeForContext`.
Executor Scheduler & Resource Manager `tensorrt_llm/_torch/pyexecutor/_util.py`, `tensorrt_llm/_torch/pyexecutor/resource_manager.py`	V2 scheduler capacity now derives from `max_batch_size` instead of `max_num_sequences`, with adjustment logic applied separately. KV cache v2 initialization includes explicit exception handling for CUDA/memlock errors with controlled fallback to GPU-only cache tiers; imports explicit error types and constrains host quota by available memory and process memlock limit.
KV Cache Manager Storage `tensorrt_llm/runtime/kv_cache_manager_v2/_storage_manager.py`	Added optional `_execution_stream` field to `StorageManager`. Batched migration now captures a fresh CUDA event from execution stream and includes it in prior-events dependencies to ensure proper synchronization with copy-stream operations.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 7.69% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title '[None][fix] Port KV cache V2 follow-up fixes' clearly and concisely describes the main change—porting specific KV cache V2 fixes from another PR.
Description check	✅ Passed	The PR description covers all key sections from the template: it clearly explains what is being ported and why, lists specific technical changes, provides test coverage details, and includes the completed checklist.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cpp/tensorrt_llm/common/attentionOp.cpp`:
- Around line 806-811: total_kv_len may be negative and casting it directly to
size_t causes wrap/large values; before computing kv_buf_tokens (used to size
fp8_k_buf_size and fp8_v_buf_size and matched to enqueueContext), clamp or check
total_kv_len to ensure non-negative and then cast: compute a safe size_t kv_len
= (total_kv_len > 0 ? static_cast<size_t>(total_kv_len) : 0) and use
std::max(kv_len, static_cast<size_t>(mChunkPrefillBufferBatchSize) *
max_num_tokens) when assigning kv_buf_tokens, so fp8_k_buf_size and
fp8_v_buf_size are computed from a valid non-wrapped value.

In `@tensorrt_llm/_torch/pyexecutor/resource_manager.py`:
- Around line 1932-1952: The retry path that constructs a GPU-only config after
KVCacheManagerPy init fails can still leave the scheduler using
CapacitySchedulerPolicy.MAX_UTILIZATION (a suspending policy), causing
deadlocks; in the except block where you call _build_cache_config and set
self.kv_cache_manager_py_config and self.impl, modify the new config to use a
non-suspending scheduler policy (i.e. explicitly set the scheduler policy to a
non-suspending alternative instead of CapacitySchedulerPolicy.MAX_UTILIZATION)
before constructing KVCacheManagerPy, or alternatively re-raise the original
error instead of retrying; update the retry branch around KVCacheManagerPy,
_build_cache_config, and kv_cache_manager_py_config to implement one of these
two behaviors.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 82ada20d-56a1-4ef8-af86-ba785a7734fe

📥 Commits

Reviewing files that changed from the base of the PR and between 2f745de and 3fc8886.

📒 Files selected for processing (6)

cpp/tensorrt_llm/common/attentionOp.cpp
cpp/tensorrt_llm/common/attentionOp.h
cpp/tensorrt_llm/thop/attentionOp.cpp
tensorrt_llm/_torch/pyexecutor/_util.py
tensorrt_llm/_torch/pyexecutor/resource_manager.py
tensorrt_llm/runtime/kv_cache_manager_v2/_storage_manager.py

coderabbitai · 2026-04-27T04:58:51Z

+            // Use total_kv_len when available (KV cache reuse causes total_kv_len >> max_num_tokens).
+            // enqueueContext sizes these buffers by total_kv_len, so workspace must match.
+            size_t const kv_buf_tokens = std::max(
+                static_cast<size_t>(total_kv_len), static_cast<size_t>(mChunkPrefillBufferBatchSize) * max_num_tokens);
+            fp8_k_buf_size = kv_buf_tokens * static_cast<size_t>(total_k_dim_all_heads);
+            fp8_v_buf_size = kv_buf_tokens * static_cast<size_t>(total_v_dim_all_heads);


⚠️ Potential issue | 🟡 Minor

Guard total_kv_len before casting to size_t.

static_cast<size_t>(total_kv_len) will wrap if total_kv_len < 0, which can explode workspace size calculations.

💡 Suggested fix

- size_t const kv_buf_tokens = std::max( - static_cast<size_t>(total_kv_len), static_cast<size_t>(mChunkPrefillBufferBatchSize) * max_num_tokens); + int32_t const safeTotalKvLen = std::max(total_kv_len, 0); + size_t const kv_bufTokensByChunk + = static_cast<size_t>(mChunkPrefillBufferBatchSize) * static_cast<size_t>(max_num_tokens); + size_t const kv_buf_tokens = std::max(static_cast<size_t>(safeTotalKvLen), kv_bufTokensByChunk);

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@cpp/tensorrt_llm/common/attentionOp.cpp` around lines 806 - 811, total_kv_len may be negative and casting it directly to size_t causes wrap/large values; before computing kv_buf_tokens (used to size fp8_k_buf_size and fp8_v_buf_size and matched to enqueueContext), clamp or check total_kv_len to ensure non-negative and then cast: compute a safe size_t kv_len = (total_kv_len > 0 ? static_cast<size_t>(total_kv_len) : 0) and use std::max(kv_len, static_cast<size_t>(mChunkPrefillBufferBatchSize) * max_num_tokens) when assigning kv_buf_tokens, so fp8_k_buf_size and fp8_v_buf_size are computed from a valid non-wrapped value.

coderabbitai · 2026-04-27T04:58:51Z

+        try:
+            self.impl = KVCacheManagerPy(config)
+        except (CuError, KVCacheOutOfMemoryError):
+            if len(cache_tiers) > 1:
+                logger.warning(
+                    "Failed to initialize KV cache manager with host cache "
+                    "tier (cuMemHostRegister may have failed). "
+                    "Retrying without host cache tier.")
+                cache_tiers_gpu_only = [
+                    t for t in cache_tiers if isinstance(t, GpuCacheTierConfig)
+                ]
+                config = self._build_cache_config(
+                    kv_cache_config,
+                    tokens_per_block=tokens_per_block,
+                    vocab_size=vocab_size,
+                    cache_tiers=cache_tiers_gpu_only,
+                )
+                self.kv_cache_manager_py_config = config
+                self.impl = KVCacheManagerPy(config)
+            else:
+                raise


⚠️ Potential issue | 🟠 Major

GPU-only fallback can still deadlock the default V2 scheduler.

This retry drops the host tier, but _util.py still defaults V2 to CapacitySchedulerPolicy.MAX_UTILIZATION. Your own comment above says that policy relies on suspend/resume succeeding; without a host tier, resume eventually fails once the GPU tier fills, so this turns a startup-time host-registration failure into a latent runtime hang. Please either downgrade to a non-suspending policy when this fallback fires or surface the init failure instead of silently continuing.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/pyexecutor/resource_manager.py` around lines 1932 - 1952, The retry path that constructs a GPU-only config after KVCacheManagerPy init fails can still leave the scheduler using CapacitySchedulerPolicy.MAX_UTILIZATION (a suspending policy), causing deadlocks; in the except block where you call _build_cache_config and set self.kv_cache_manager_py_config and self.impl, modify the new config to use a non-suspending scheduler policy (i.e. explicitly set the scheduler policy to a non-suspending alternative instead of CapacitySchedulerPolicy.MAX_UTILIZATION) before constructing KVCacheManagerPy, or alternatively re-raise the original error instead of retrying; update the retry branch around KVCacheManagerPy, _build_cache_config, and kv_cache_manager_py_config to implement one of these two behaviors.

tensorrt-cicd · 2026-04-27T05:00:12Z

PR_Github #45650 [ run ] triggered by Bot. Commit: 3fc8886 Link to invocation

tensorrt-cicd · 2026-04-27T12:36:35Z

PR_Github #45650 [ run ] completed with state FAILURE. Commit: 3fc8886
/LLM/main/L0_MergeRequest_PR pipeline #35863 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

[None][fix] Port KV cache V2 follow-up fixes

3fc8886

Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>

yizhang-nv requested review from a team as code owners April 27, 2026 04:53

yizhang-nv requested a review from achartier April 27, 2026 04:53

github-actions Bot assigned yizhang-nv Apr 27, 2026

coderabbitai Bot reviewed Apr 27, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[None][fix] Port KV cache V2 follow-up fixes#13488

[None][fix] Port KV cache V2 follow-up fixes#13488
yizhang-nv wants to merge 1 commit intoNVIDIA:mainfrom
yizhang-nv:yizhan/v2-fixes-from-pr12837

yizhang-nv commented Apr 27, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

yizhang-nv commented Apr 27, 2026

Uh oh!

coderabbitai Bot commented Apr 27, 2026

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Apr 27, 2026

Uh oh!

coderabbitai Bot Apr 27, 2026

Uh oh!

tensorrt-cicd commented Apr 27, 2026

Uh oh!

tensorrt-cicd commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

yizhang-nv commented Apr 27, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Description

Test Coverage

PR Checklist

GitHub Bot Help

Uh oh!

yizhang-nv commented Apr 27, 2026

Uh oh!

coderabbitai Bot commented Apr 27, 2026

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

tensorrt-cicd commented Apr 27, 2026

Uh oh!

tensorrt-cicd commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yizhang-nv commented Apr 27, 2026 •

edited by coderabbitai Bot

Loading