Skip to content

[None][fix] Port KV cache V2 follow-up fixes#13488

Open
yizhang-nv wants to merge 1 commit intoNVIDIA:mainfrom
yizhang-nv:yizhan/v2-fixes-from-pr12837
Open

[None][fix] Port KV cache V2 follow-up fixes#13488
yizhang-nv wants to merge 1 commit intoNVIDIA:mainfrom
yizhang-nv:yizhan/v2-fixes-from-pr12837

Conversation

@yizhang-nv
Copy link
Copy Markdown
Member

@yizhang-nv yizhang-nv commented Apr 27, 2026

Summary by CodeRabbit

  • Bug Fixes

    • Fixed attention workspace sizing calculation to accurately reflect KV cache reuse scenarios.
  • Improvements

    • Enhanced KV cache scheduler capacity calculation for better resource allocation.
    • Improved fallback mechanisms for KV cache initialization when encountering memory constraints.
    • Optimized memory migration stream synchronization for improved performance.

Description

Port the KV cache V2 follow-up fixes from #12837 that are independent of enabling V2 by default.

This PR keeps use_kv_cache_manager_v2 default unchanged and excludes test waives/skips and cleanup workarounds. It includes:

  • Use max_batch_size as the V2 scheduler per-iteration request budget.
  • Size FP8 MLA context workspace with total_kv_len when KV cache reuse makes it larger than max_num_tokens.
  • Cap auto-provisioned V2 host cache by available memory and RLIMIT_MEMLOCK, and retry without host tier when host registration fails.
  • Order V2 batched migrations behind the executor stream before copy streams consume cache pages.

Test Coverage

  • pre-commit run --files cpp/tensorrt_llm/common/attentionOp.cpp cpp/tensorrt_llm/common/attentionOp.h cpp/tensorrt_llm/thop/attentionOp.cpp tensorrt_llm/_torch/pyexecutor/_util.py tensorrt_llm/_torch/pyexecutor/resource_manager.py tensorrt_llm/runtime/kv_cache_manager_v2/_storage_manager.py
  • python3 -m py_compile tensorrt_llm/_torch/pyexecutor/_util.py tensorrt_llm/_torch/pyexecutor/resource_manager.py tensorrt_llm/runtime/kv_cache_manager_v2/_storage_manager.py
  • git diff --check

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>
@yizhang-nv yizhang-nv requested review from a team as code owners April 27, 2026 04:53
@yizhang-nv yizhang-nv requested a review from achartier April 27, 2026 04:53
@yizhang-nv
Copy link
Copy Markdown
Member Author

/bot run

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 27, 2026

📝 Walkthrough

Walkthrough

The changes extend workspace-sizing methods across C++ and Python layers by adding a total_kv_len parameter to support improved FP8 buffer sizing in attention operations. Additionally, scheduler capacity computation is adjusted to use max_batch_size instead of max_num_sequences for V2 executor initialization, and KV cache manager startup now includes fallback error handling and explicit execution stream wiring for batched migration synchronization.

Changes

Cohort / File(s) Summary
AttentionOp Workspace Sizing
cpp/tensorrt_llm/common/attentionOp.h, cpp/tensorrt_llm/common/attentionOp.cpp
Added optional total_kv_len parameter (defaults to 0) to AttentionOp::getWorkspaceSizeForContext method. Implementation uses this value to compute FP8 K/V buffer workspace when separate Q/K/V inputs are employed, deriving kv_buf_tokens from total_kv_len with fallback logic.
Thop AttentionOp Runner
cpp/tensorrt_llm/thop/attentionOp.cpp
Added optional ctx_total_kv_len parameter (defaults to 0) to RunnerBase::getWorkspaceSize and Runner<T,...>::getWorkspaceSize methods. Context workspace computation now forwards this parameter to AttentionOp::getWorkspaceSizeForContext.
Executor Scheduler & Resource Manager
tensorrt_llm/_torch/pyexecutor/_util.py, tensorrt_llm/_torch/pyexecutor/resource_manager.py
V2 scheduler capacity now derives from max_batch_size instead of max_num_sequences, with adjustment logic applied separately. KV cache v2 initialization includes explicit exception handling for CUDA/memlock errors with controlled fallback to GPU-only cache tiers; imports explicit error types and constrains host quota by available memory and process memlock limit.
KV Cache Manager Storage
tensorrt_llm/runtime/kv_cache_manager_v2/_storage_manager.py
Added optional _execution_stream field to StorageManager. Batched migration now captures a fresh CUDA event from execution stream and includes it in prior-events dependencies to ensure proper synchronization with copy-stream operations.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 7.69% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title '[None][fix] Port KV cache V2 follow-up fixes' clearly and concisely describes the main change—porting specific KV cache V2 fixes from another PR.
Description check ✅ Passed The PR description covers all key sections from the template: it clearly explains what is being ported and why, lists specific technical changes, provides test coverage details, and includes the completed checklist.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cpp/tensorrt_llm/common/attentionOp.cpp`:
- Around line 806-811: total_kv_len may be negative and casting it directly to
size_t causes wrap/large values; before computing kv_buf_tokens (used to size
fp8_k_buf_size and fp8_v_buf_size and matched to enqueueContext), clamp or check
total_kv_len to ensure non-negative and then cast: compute a safe size_t kv_len
= (total_kv_len > 0 ? static_cast<size_t>(total_kv_len) : 0) and use
std::max(kv_len, static_cast<size_t>(mChunkPrefillBufferBatchSize) *
max_num_tokens) when assigning kv_buf_tokens, so fp8_k_buf_size and
fp8_v_buf_size are computed from a valid non-wrapped value.

In `@tensorrt_llm/_torch/pyexecutor/resource_manager.py`:
- Around line 1932-1952: The retry path that constructs a GPU-only config after
KVCacheManagerPy init fails can still leave the scheduler using
CapacitySchedulerPolicy.MAX_UTILIZATION (a suspending policy), causing
deadlocks; in the except block where you call _build_cache_config and set
self.kv_cache_manager_py_config and self.impl, modify the new config to use a
non-suspending scheduler policy (i.e. explicitly set the scheduler policy to a
non-suspending alternative instead of CapacitySchedulerPolicy.MAX_UTILIZATION)
before constructing KVCacheManagerPy, or alternatively re-raise the original
error instead of retrying; update the retry branch around KVCacheManagerPy,
_build_cache_config, and kv_cache_manager_py_config to implement one of these
two behaviors.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 82ada20d-56a1-4ef8-af86-ba785a7734fe

📥 Commits

Reviewing files that changed from the base of the PR and between 2f745de and 3fc8886.

📒 Files selected for processing (6)
  • cpp/tensorrt_llm/common/attentionOp.cpp
  • cpp/tensorrt_llm/common/attentionOp.h
  • cpp/tensorrt_llm/thop/attentionOp.cpp
  • tensorrt_llm/_torch/pyexecutor/_util.py
  • tensorrt_llm/_torch/pyexecutor/resource_manager.py
  • tensorrt_llm/runtime/kv_cache_manager_v2/_storage_manager.py

Comment on lines +806 to +811
// Use total_kv_len when available (KV cache reuse causes total_kv_len >> max_num_tokens).
// enqueueContext sizes these buffers by total_kv_len, so workspace must match.
size_t const kv_buf_tokens = std::max(
static_cast<size_t>(total_kv_len), static_cast<size_t>(mChunkPrefillBufferBatchSize) * max_num_tokens);
fp8_k_buf_size = kv_buf_tokens * static_cast<size_t>(total_k_dim_all_heads);
fp8_v_buf_size = kv_buf_tokens * static_cast<size_t>(total_v_dim_all_heads);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Guard total_kv_len before casting to size_t.

static_cast<size_t>(total_kv_len) will wrap if total_kv_len < 0, which can explode workspace size calculations.

💡 Suggested fix
-            size_t const kv_buf_tokens = std::max(
-                static_cast<size_t>(total_kv_len), static_cast<size_t>(mChunkPrefillBufferBatchSize) * max_num_tokens);
+            int32_t const safeTotalKvLen = std::max(total_kv_len, 0);
+            size_t const kv_bufTokensByChunk
+                = static_cast<size_t>(mChunkPrefillBufferBatchSize) * static_cast<size_t>(max_num_tokens);
+            size_t const kv_buf_tokens = std::max(static_cast<size_t>(safeTotalKvLen), kv_bufTokensByChunk);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/tensorrt_llm/common/attentionOp.cpp` around lines 806 - 811, total_kv_len
may be negative and casting it directly to size_t causes wrap/large values;
before computing kv_buf_tokens (used to size fp8_k_buf_size and fp8_v_buf_size
and matched to enqueueContext), clamp or check total_kv_len to ensure
non-negative and then cast: compute a safe size_t kv_len = (total_kv_len > 0 ?
static_cast<size_t>(total_kv_len) : 0) and use std::max(kv_len,
static_cast<size_t>(mChunkPrefillBufferBatchSize) * max_num_tokens) when
assigning kv_buf_tokens, so fp8_k_buf_size and fp8_v_buf_size are computed from
a valid non-wrapped value.

Comment on lines +1932 to +1952
try:
self.impl = KVCacheManagerPy(config)
except (CuError, KVCacheOutOfMemoryError):
if len(cache_tiers) > 1:
logger.warning(
"Failed to initialize KV cache manager with host cache "
"tier (cuMemHostRegister may have failed). "
"Retrying without host cache tier.")
cache_tiers_gpu_only = [
t for t in cache_tiers if isinstance(t, GpuCacheTierConfig)
]
config = self._build_cache_config(
kv_cache_config,
tokens_per_block=tokens_per_block,
vocab_size=vocab_size,
cache_tiers=cache_tiers_gpu_only,
)
self.kv_cache_manager_py_config = config
self.impl = KVCacheManagerPy(config)
else:
raise
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

GPU-only fallback can still deadlock the default V2 scheduler.

This retry drops the host tier, but _util.py still defaults V2 to CapacitySchedulerPolicy.MAX_UTILIZATION. Your own comment above says that policy relies on suspend/resume succeeding; without a host tier, resume eventually fails once the GPU tier fills, so this turns a startup-time host-registration failure into a latent runtime hang. Please either downgrade to a non-suspending policy when this fallback fires or surface the init failure instead of silently continuing.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/pyexecutor/resource_manager.py` around lines 1932 - 1952,
The retry path that constructs a GPU-only config after KVCacheManagerPy init
fails can still leave the scheduler using
CapacitySchedulerPolicy.MAX_UTILIZATION (a suspending policy), causing
deadlocks; in the except block where you call _build_cache_config and set
self.kv_cache_manager_py_config and self.impl, modify the new config to use a
non-suspending scheduler policy (i.e. explicitly set the scheduler policy to a
non-suspending alternative instead of CapacitySchedulerPolicy.MAX_UTILIZATION)
before constructing KVCacheManagerPy, or alternatively re-raise the original
error instead of retrying; update the retry branch around KVCacheManagerPy,
_build_cache_config, and kv_cache_manager_py_config to implement one of these
two behaviors.

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #45650 [ run ] triggered by Bot. Commit: 3fc8886 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #45650 [ run ] completed with state FAILURE. Commit: 3fc8886
/LLM/main/L0_MergeRequest_PR pipeline #35863 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants