[https://nvbugs/6094068][fix] Cap Mamba cache max_batch_size for memory and add CUDA P2P check for DeepEP by tensorrt-cicd · Pull Request #13474 · NVIDIA/TensorRT-LLM

tensorrt-cicd · 2026-04-26T12:18:21Z

Summary

Root cause: The MambaCacheManager allocated SSM state buffers based solely on max_batch_size without checking available GPU memory, causing OOM on large hybrid models (e.g., Qwen3-Next-80B) where the model itself consumed most GPU memory before cache allocation. Additionally, DeepEP's platform check relied on NVML NVLink status to infer P2P capability, which misreported availability inside containers, causing NVSHMEM to call exit() at the C level with no opportunity for graceful fallback.
Fix: Added a _cap_mamba_max_batch_size helper that estimates per-sequence Mamba state size (conv + SSM per layer, including speculative decoding overhead), queries free GPU memory, and caps max_batch_size to fit within a configurable fraction of available memory. For the DeepEP issue, added an explicit check_cuda_p2p_access() function using torch.cuda.can_device_access_peer to verify actual CUDA P2P connectivity between all local GPUs, and wired it into both DeepEP.is_platform_supported() and DeepEPLowLatency.is_platform_supported() so the communication factory can fall back to alternative backends instead of hitting an uncatchable process exit.
Automated fix generated by repair-bot

Test plan

Verify fix on the same GPU type as the original failure
Check for regressions in related tests

Links

Bug: https://nvbugs/6094068

Summary by CodeRabbit

New Features
- Added CUDA peer-to-peer connectivity validation for platform compatibility checks
- Implemented GPU memory-aware batch size optimization for Mamba models
Improvements
- Enhanced platform support detection with stricter hardware requirements validation
- Improved cache management robustness for edge case scenarios

…ry and add CUDA P2P check for DeepEP Fix OOM crash in MambaHybridCacheManager when running Qwen3-Next on H100 with attention_dp=True. The SSM state allocation tried to allocate 72 GiB for 2048 sequences but only ~36 GiB was free after model loading. Three changes: 1. Add _cap_max_batch_size_for_memory() to MambaHybridCacheManager that estimates per-sequence Mamba state size and caps max_batch_size to fit within half of free GPU memory. 2. Add check_cuda_p2p_access() to deep_ep_utils.py and integrate it into DeepEP.is_platform_supported() and DeepEPLowLatency.is_platform_supported() to prevent fatal NVSHMEM process exit when P2P access is unavailable. 3. Add Mamba capacity check in MambaHybridCacheManager.add_dummy_requests() so CUDA graph warmup gracefully skips batch sizes exceeding the capped Mamba cache capacity instead of raising RuntimeError. Signed-off-by: tensorrt-cicd <90828364+tensorrt-cicd@users.noreply.github.com>

coderabbitai · 2026-04-26T12:25:16Z

📝 Walkthrough

Walkthrough

This PR adds CUDA P2P connectivity validation to DeepEP platform support checks and introduces max batch size capping for Mamba models based on GPU memory constraints. It includes a new utility function for verifying peer-to-peer GPU access availability across local GPUs.

Changes

Cohort / File(s)	Summary
DeepEP Platform Support `tensorrt_llm/_torch/modules/fused_moe/communication/deep_ep.py`, `tensorrt_llm/_torch/modules/fused_moe/communication/deep_ep_low_latency.py`	Both modules now require CUDA P2P connectivity in addition to existing requirements for `is_platform_supported()`. Updated to import and use `check_cuda_p2p_access()` in their platform feasibility checks; docstrings updated to document NVSHMEM P2P requirements.
DeepEP Utilities `tensorrt_llm/_torch/modules/fused_moe/deep_ep_utils.py`	New `check_cuda_p2p_access()` function added to verify peer-to-peer GPU access availability across all local GPUs using `torch.cuda` APIs.
Mamba Cache Management `tensorrt_llm/_torch/pyexecutor/mamba_cache_manager.py`	Added max batch size capping logic that estimates GPU memory for per-sequence Mamba state and reduces batch size when memory budget is insufficient. Updated `add_dummy_requests()` to return `None` when CUDA-graph warmup requests exceed cache limits, preventing allocation failures.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 72.73% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately captures the two main changes: capping Mamba cache max_batch_size for memory and adding CUDA P2P check for DeepEP, matching the file-level changes described in the summary.
Description check	✅ Passed	The description comprehensively explains both root causes, implementations, and test verification. It clearly documents the Mamba OOM issue, DeepEP P2P detection problem, and how each was fixed.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (2)

tensorrt_llm/_torch/pyexecutor/mamba_cache_manager.py (2)

52-56: Add return type annotation.

Per coding guidelines, all functions should have return type annotations.

 def _cap_mamba_max_batch_size(mamba_d_state, mamba_d_conv, mamba_num_heads,
                               mamba_n_groups, mamba_head_dim, mamba_num_layers,
                               mamba_layer_mask, mamba_cache_dtype,
                               mamba_ssm_cache_dtype, max_batch_size, mapping,
-                              spec_config):
+                              spec_config) -> int:

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/pyexecutor/mamba_cache_manager.py` around lines 52 - 56,
The function _cap_mamba_max_batch_size is missing a return type annotation;
update its signature to include an explicit return type that matches what the
function actually returns (inspect the function body to determine whether it
returns an int, tuple, dict, or other type) and annotate it accordingly (e.g.,
-> int or -> Tuple[...], using typing imports as needed) so the signature
correctly reflects the returned value(s).

781-790: Mamba resources are allocated before KV cache availability is confirmed.

On line 789, MambaCacheManager.add_dummy_requests allocates Mamba blocks. If the subsequent KVCacheManager.add_dummy_requests call on line 790 returns None due to insufficient KV blocks, the Mamba allocation persists while the method returns None, creating a partial resource state.

This is mitigated by the idempotent nature of Mamba allocation (reused requests hit the cache) and the fixed IDs used for dummy requests, which prevent resource leaks on retries. However, the code would be clearer if both capacities were verified before any allocation:

🔧 Optional improvement: pre-check both capacities

 def add_dummy_requests(self, request_ids: List[int], **kwargs):
     # Check Mamba cache capacity before allocation.  CUDA graph warmup
     # creates dummy requests for various batch sizes and expects None
     # (not an exception) when resources are insufficient so it can skip
     # oversized batch configurations gracefully.
     max_mamba = MambaCacheManager.get_max_resource_count(self)
     if len(request_ids) > max_mamba:
         return None
+    # Check KV cache availability before allocating any resources
+    available_kv_blocks = self.get_num_free_blocks()
+    if available_kv_blocks < 1:
+        return None
     MambaCacheManager.add_dummy_requests(self, request_ids)
     return KVCacheManager.add_dummy_requests(self, request_ids, **kwargs)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/pyexecutor/mamba_cache_manager.py` around lines 781 -
790, The method add_dummy_requests currently calls
MambaCacheManager.add_dummy_requests before confirming KV capacity, risking a
partial allocation if KV add returns None; change it to first query both
capacities (call MambaCacheManager.get_max_resource_count(self) and the KV
equivalent, e.g. KVCacheManager.get_max_resource_count(self)), return None if
the requested count exceeds either, and only then call
MambaCacheManager.add_dummy_requests(self, request_ids) followed by
KVCacheManager.add_dummy_requests(self, request_ids, **kwargs) so allocations
only occur after both capacity checks pass.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tensorrt_llm/_torch/modules/fused_moe/deep_ep_utils.py`:
- Around line 27-31: Wrap the CUDA probing calls so they fail closed: surround
the calls to torch.cuda.current_device(), torch.cuda.device_count(), and
torch.cuda.can_device_access_peer(...) (the P2P probe block in deep_ep_utils.py)
with a try/except that catches RuntimeError and AssertionError and returns False
on any exception so callers can gracefully fall back; do not let these
exceptions propagate and trigger NVSHMEM initialization. Also add the required
NVIDIA copyright/header at the top of the file.

---

Nitpick comments:
In `@tensorrt_llm/_torch/pyexecutor/mamba_cache_manager.py`:
- Around line 52-56: The function _cap_mamba_max_batch_size is missing a return
type annotation; update its signature to include an explicit return type that
matches what the function actually returns (inspect the function body to
determine whether it returns an int, tuple, dict, or other type) and annotate it
accordingly (e.g., -> int or -> Tuple[...], using typing imports as needed) so
the signature correctly reflects the returned value(s).
- Around line 781-790: The method add_dummy_requests currently calls
MambaCacheManager.add_dummy_requests before confirming KV capacity, risking a
partial allocation if KV add returns None; change it to first query both
capacities (call MambaCacheManager.get_max_resource_count(self) and the KV
equivalent, e.g. KVCacheManager.get_max_resource_count(self)), return None if
the requested count exceeds either, and only then call
MambaCacheManager.add_dummy_requests(self, request_ids) followed by
KVCacheManager.add_dummy_requests(self, request_ids, **kwargs) so allocations
only occur after both capacity checks pass.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 21c04848-c3ce-46fa-b532-afd0fc25fc15

📥 Commits

Reviewing files that changed from the base of the PR and between eeba2eb and 7b91bcf.

📒 Files selected for processing (4)

tensorrt_llm/_torch/modules/fused_moe/communication/deep_ep.py
tensorrt_llm/_torch/modules/fused_moe/communication/deep_ep_low_latency.py
tensorrt_llm/_torch/modules/fused_moe/deep_ep_utils.py
tensorrt_llm/_torch/pyexecutor/mamba_cache_manager.py

coderabbitai · 2026-04-26T12:25:20Z

+    current = torch.cuda.current_device()
+    num_devices = torch.cuda.device_count()
+    for i in range(num_devices):
+        if i != current and not torch.cuda.can_device_access_peer(current, i):
+            return False


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

In PyTorch, what exceptions can torch.cuda.current_device() and torch.cuda.can_device_access_peer() raise when CUDA is unavailable or not initialized, and what is the recommended fail-closed guard pattern for capability probes?

💡 Result:

torch.cuda.current_device raises RuntimeError (wrapped around CUDA driver errors like "No CUDA GPUs are available", "device >= 0 && device < num_gpus INTERNAL ASSERT FAILED", or "CUDA Call Failed") or DeferredCudaCallError when CUDA is unavailable or not properly initialized. These occur during _lazy_init if CUDA initialization fails (e.g., no GPUs detected, driver issues). torch.cuda.can_device_access_peer(device, peer_device) raises AssertionError("Invalid peer device id") if either device or peer_device is invalid (e.g., >= device_count or negative when CUDA unavailable/not initialized), as seen in source code and usage errors. Recommended fail-closed guard pattern for capability probes: Check torch.cuda.is_available and torch.cuda.is_initialized first, then verify device indices < torch.cuda.device_count. Use try-except for probes, but prefer explicit guards: if torch.cuda.is_available: if torch.cuda.device_count > 0: try: dev = torch.cuda.current_device # further checks except RuntimeError: # fallback pass else: # CPU fallback This ensures fail-closed (disable CUDA path if unavailable) before probing capabilities like peer access.

Citations:

1: https://github.com/pytorch/pytorch/blob/master/torch/cuda/__init__.py

2: https://docs.pytorch.org/docs/2.6/_modules/torch/cuda.html

3: https://github.com/pytorch/pytorch/blob/v2.11.0/torch/cuda/__init__.py

4: RuntimeError: device >= 0 && device < num_gpus INTERNAL ASSERT FAILED pytorch/pytorch#110000

5: RuntimeError: device >= 0 && device < num_gpus INTERNAL ASSERT FAILED caused by early CUDA init pytorch/pytorch#172500

6: Distributed inference on multi machine (error Invalid peer device id) vllm-project/vllm#2795

7: https://pytorch.org/docs/stable/generated/torch.cuda.current_device.html

8: https://docs.pytorch.org/docs/stable/generated/torch.cuda.can_device_access_peer.html

9: https://github.com/pytorch/pytorch/blob/main/torch/csrc/api/src/cuda.cpp

🏁 Script executed:

# Check file existence and get overall context file_path="tensorrt_llm/_torch/modules/fused_moe/deep_ep_utils.py" if [ -f "$file_path" ]; then echo "=== File Stats ===" wc -l "$file_path" echo "" echo "=== First 50 lines (copyright header + context) ===" head -50 "$file_path" echo "" echo "=== Lines 19-35 (function in question) ===" sed -n '19,35p' "$file_path" | cat -n else echo "File not found: $file_path" fi

Repository: NVIDIA/TensorRT-LLM

Length of output: 2581

🏁 Script executed:

# Search for usages of check_cuda_p2p_access rg "check_cuda_p2p_access" --type py -B 2 -A 2

Repository: NVIDIA/TensorRT-LLM

Length of output: 2441

Guard CUDA probe calls to fail closed instead of raising.

The function is intended to gracefully report P2P access unavailability so callers can fall back without triggering NVSHMEM initialization (which fatally exits on incompatible hardware). However, torch.cuda.current_device() (line 27) and torch.cuda.can_device_access_peer() (line 30) raise RuntimeError and AssertionError respectively when CUDA is unavailable or not initialized. These exceptions break the graceful fallback contract.

Proposed fix

def check_cuda_p2p_access() -> bool: """Check if CUDA P2P access is available between all local GPUs. NVSHMEM (used by DeepEP) fatally exits the process when P2P access is unavailable, instead of raising a catchable exception. This check lets callers avoid triggering nvshmem init on incompatible hardware so the communication factory can fall back gracefully. """ + if not torch.cuda.is_available(): + return False + + try: + current = torch.cuda.current_device() + num_devices = torch.cuda.device_count() + except RuntimeError: + return False + - current = torch.cuda.current_device() - num_devices = torch.cuda.device_count() - for i in range(num_devices): - if i != current and not torch.cuda.can_device_access_peer(current, i): - return False + for peer_device in range(num_devices): + if peer_device == current: + continue + try: + if not torch.cuda.can_device_access_peer(current, peer_device): + return False + except RuntimeError: + return False return True

Additionally, the file is missing the required NVIDIA copyright header.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/modules/fused_moe/deep_ep_utils.py` around lines 27 - 31, Wrap the CUDA probing calls so they fail closed: surround the calls to torch.cuda.current_device(), torch.cuda.device_count(), and torch.cuda.can_device_access_peer(...) (the P2P probe block in deep_ep_utils.py) with a try/except that catches RuntimeError and AssertionError and returns False on any exception so callers can gracefully fall back; do not let these exceptions propagate and trigger NVSHMEM initialization. Also add the required NVIDIA copyright/header at the top of the file.

tensorrt-cicd requested review from a team as code owners April 26, 2026 12:18

tensorrt-cicd requested review from 2ez4bz, mikeiovine and omera-nv April 26, 2026 12:18

github-actions Bot assigned tensorrt-cicd Apr 26, 2026

coderabbitai Bot reviewed Apr 26, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[https://nvbugs/6094068][fix] Cap Mamba cache max_batch_size for memory and add CUDA P2P check for DeepEP#13474

[https://nvbugs/6094068][fix] Cap Mamba cache max_batch_size for memory and add CUDA P2P check for DeepEP#13474
tensorrt-cicd wants to merge 1 commit intoNVIDIA:mainfrom
tensorrt-cicd:repair-bot-bug6094068

tensorrt-cicd commented Apr 26, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Apr 26, 2026

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Apr 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

tensorrt-cicd commented Apr 26, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Links

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 26, 2026

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

tensorrt-cicd commented Apr 26, 2026 •

edited by coderabbitai Bot

Loading