[https://nvbugs/6094068][fix] Cap Mamba cache max_batch_size for memory and add CUDA P2P check for DeepEP#13474
Conversation
…ry and add CUDA P2P check for DeepEP Fix OOM crash in MambaHybridCacheManager when running Qwen3-Next on H100 with attention_dp=True. The SSM state allocation tried to allocate 72 GiB for 2048 sequences but only ~36 GiB was free after model loading. Three changes: 1. Add _cap_max_batch_size_for_memory() to MambaHybridCacheManager that estimates per-sequence Mamba state size and caps max_batch_size to fit within half of free GPU memory. 2. Add check_cuda_p2p_access() to deep_ep_utils.py and integrate it into DeepEP.is_platform_supported() and DeepEPLowLatency.is_platform_supported() to prevent fatal NVSHMEM process exit when P2P access is unavailable. 3. Add Mamba capacity check in MambaHybridCacheManager.add_dummy_requests() so CUDA graph warmup gracefully skips batch sizes exceeding the capped Mamba cache capacity instead of raising RuntimeError. Signed-off-by: tensorrt-cicd <90828364+tensorrt-cicd@users.noreply.github.com>
📝 WalkthroughWalkthroughThis PR adds CUDA P2P connectivity validation to DeepEP platform support checks and introduces max batch size capping for Mamba models based on GPU memory constraints. It includes a new utility function for verifying peer-to-peer GPU access availability across local GPUs. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
tensorrt_llm/_torch/pyexecutor/mamba_cache_manager.py (2)
52-56: Add return type annotation.Per coding guidelines, all functions should have return type annotations.
def _cap_mamba_max_batch_size(mamba_d_state, mamba_d_conv, mamba_num_heads, mamba_n_groups, mamba_head_dim, mamba_num_layers, mamba_layer_mask, mamba_cache_dtype, mamba_ssm_cache_dtype, max_batch_size, mapping, - spec_config): + spec_config) -> int:🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/pyexecutor/mamba_cache_manager.py` around lines 52 - 56, The function _cap_mamba_max_batch_size is missing a return type annotation; update its signature to include an explicit return type that matches what the function actually returns (inspect the function body to determine whether it returns an int, tuple, dict, or other type) and annotate it accordingly (e.g., -> int or -> Tuple[...], using typing imports as needed) so the signature correctly reflects the returned value(s).
781-790: Mamba resources are allocated before KV cache availability is confirmed.On line 789,
MambaCacheManager.add_dummy_requestsallocates Mamba blocks. If the subsequentKVCacheManager.add_dummy_requestscall on line 790 returnsNonedue to insufficient KV blocks, the Mamba allocation persists while the method returnsNone, creating a partial resource state.This is mitigated by the idempotent nature of Mamba allocation (reused requests hit the cache) and the fixed IDs used for dummy requests, which prevent resource leaks on retries. However, the code would be clearer if both capacities were verified before any allocation:
🔧 Optional improvement: pre-check both capacities
def add_dummy_requests(self, request_ids: List[int], **kwargs): # Check Mamba cache capacity before allocation. CUDA graph warmup # creates dummy requests for various batch sizes and expects None # (not an exception) when resources are insufficient so it can skip # oversized batch configurations gracefully. max_mamba = MambaCacheManager.get_max_resource_count(self) if len(request_ids) > max_mamba: return None + # Check KV cache availability before allocating any resources + available_kv_blocks = self.get_num_free_blocks() + if available_kv_blocks < 1: + return None MambaCacheManager.add_dummy_requests(self, request_ids) return KVCacheManager.add_dummy_requests(self, request_ids, **kwargs)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/pyexecutor/mamba_cache_manager.py` around lines 781 - 790, The method add_dummy_requests currently calls MambaCacheManager.add_dummy_requests before confirming KV capacity, risking a partial allocation if KV add returns None; change it to first query both capacities (call MambaCacheManager.get_max_resource_count(self) and the KV equivalent, e.g. KVCacheManager.get_max_resource_count(self)), return None if the requested count exceeds either, and only then call MambaCacheManager.add_dummy_requests(self, request_ids) followed by KVCacheManager.add_dummy_requests(self, request_ids, **kwargs) so allocations only occur after both capacity checks pass.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tensorrt_llm/_torch/modules/fused_moe/deep_ep_utils.py`:
- Around line 27-31: Wrap the CUDA probing calls so they fail closed: surround
the calls to torch.cuda.current_device(), torch.cuda.device_count(), and
torch.cuda.can_device_access_peer(...) (the P2P probe block in deep_ep_utils.py)
with a try/except that catches RuntimeError and AssertionError and returns False
on any exception so callers can gracefully fall back; do not let these
exceptions propagate and trigger NVSHMEM initialization. Also add the required
NVIDIA copyright/header at the top of the file.
---
Nitpick comments:
In `@tensorrt_llm/_torch/pyexecutor/mamba_cache_manager.py`:
- Around line 52-56: The function _cap_mamba_max_batch_size is missing a return
type annotation; update its signature to include an explicit return type that
matches what the function actually returns (inspect the function body to
determine whether it returns an int, tuple, dict, or other type) and annotate it
accordingly (e.g., -> int or -> Tuple[...], using typing imports as needed) so
the signature correctly reflects the returned value(s).
- Around line 781-790: The method add_dummy_requests currently calls
MambaCacheManager.add_dummy_requests before confirming KV capacity, risking a
partial allocation if KV add returns None; change it to first query both
capacities (call MambaCacheManager.get_max_resource_count(self) and the KV
equivalent, e.g. KVCacheManager.get_max_resource_count(self)), return None if
the requested count exceeds either, and only then call
MambaCacheManager.add_dummy_requests(self, request_ids) followed by
KVCacheManager.add_dummy_requests(self, request_ids, **kwargs) so allocations
only occur after both capacity checks pass.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 21c04848-c3ce-46fa-b532-afd0fc25fc15
📒 Files selected for processing (4)
tensorrt_llm/_torch/modules/fused_moe/communication/deep_ep.pytensorrt_llm/_torch/modules/fused_moe/communication/deep_ep_low_latency.pytensorrt_llm/_torch/modules/fused_moe/deep_ep_utils.pytensorrt_llm/_torch/pyexecutor/mamba_cache_manager.py
| current = torch.cuda.current_device() | ||
| num_devices = torch.cuda.device_count() | ||
| for i in range(num_devices): | ||
| if i != current and not torch.cuda.can_device_access_peer(current, i): | ||
| return False |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
In PyTorch, what exceptions can torch.cuda.current_device() and torch.cuda.can_device_access_peer() raise when CUDA is unavailable or not initialized, and what is the recommended fail-closed guard pattern for capability probes?
💡 Result:
torch.cuda.current_device raises RuntimeError (wrapped around CUDA driver errors like "No CUDA GPUs are available", "device >= 0 && device < num_gpus INTERNAL ASSERT FAILED", or "CUDA Call Failed") or DeferredCudaCallError when CUDA is unavailable or not properly initialized. These occur during _lazy_init if CUDA initialization fails (e.g., no GPUs detected, driver issues). torch.cuda.can_device_access_peer(device, peer_device) raises AssertionError("Invalid peer device id") if either device or peer_device is invalid (e.g., >= device_count or negative when CUDA unavailable/not initialized), as seen in source code and usage errors. Recommended fail-closed guard pattern for capability probes: Check torch.cuda.is_available and torch.cuda.is_initialized first, then verify device indices < torch.cuda.device_count. Use try-except for probes, but prefer explicit guards: if torch.cuda.is_available: if torch.cuda.device_count > 0: try: dev = torch.cuda.current_device # further checks except RuntimeError: # fallback pass else: # CPU fallback This ensures fail-closed (disable CUDA path if unavailable) before probing capabilities like peer access.
Citations:
- 1: https://github.com/pytorch/pytorch/blob/master/torch/cuda/__init__.py
- 2: https://docs.pytorch.org/docs/2.6/_modules/torch/cuda.html
- 3: https://github.com/pytorch/pytorch/blob/v2.11.0/torch/cuda/__init__.py
- 4: RuntimeError: device >= 0 && device < num_gpus INTERNAL ASSERT FAILED pytorch/pytorch#110000
- 5:
RuntimeError: device >= 0 && device < num_gpus INTERNAL ASSERT FAILEDcaused by early CUDA init pytorch/pytorch#172500 - 6: Distributed inference on multi machine (error Invalid peer device id) vllm-project/vllm#2795
- 7: https://pytorch.org/docs/stable/generated/torch.cuda.current_device.html
- 8: https://docs.pytorch.org/docs/stable/generated/torch.cuda.can_device_access_peer.html
- 9: https://github.com/pytorch/pytorch/blob/main/torch/csrc/api/src/cuda.cpp
🏁 Script executed:
# Check file existence and get overall context
file_path="tensorrt_llm/_torch/modules/fused_moe/deep_ep_utils.py"
if [ -f "$file_path" ]; then
echo "=== File Stats ==="
wc -l "$file_path"
echo ""
echo "=== First 50 lines (copyright header + context) ==="
head -50 "$file_path"
echo ""
echo "=== Lines 19-35 (function in question) ==="
sed -n '19,35p' "$file_path" | cat -n
else
echo "File not found: $file_path"
fiRepository: NVIDIA/TensorRT-LLM
Length of output: 2581
🏁 Script executed:
# Search for usages of check_cuda_p2p_access
rg "check_cuda_p2p_access" --type py -B 2 -A 2Repository: NVIDIA/TensorRT-LLM
Length of output: 2441
Guard CUDA probe calls to fail closed instead of raising.
The function is intended to gracefully report P2P access unavailability so callers can fall back without triggering NVSHMEM initialization (which fatally exits on incompatible hardware). However, torch.cuda.current_device() (line 27) and torch.cuda.can_device_access_peer() (line 30) raise RuntimeError and AssertionError respectively when CUDA is unavailable or not initialized. These exceptions break the graceful fallback contract.
Proposed fix
def check_cuda_p2p_access() -> bool:
"""Check if CUDA P2P access is available between all local GPUs.
NVSHMEM (used by DeepEP) fatally exits the process when P2P access is
unavailable, instead of raising a catchable exception. This check lets
callers avoid triggering nvshmem init on incompatible hardware so the
communication factory can fall back gracefully.
"""
+ if not torch.cuda.is_available():
+ return False
+
+ try:
+ current = torch.cuda.current_device()
+ num_devices = torch.cuda.device_count()
+ except RuntimeError:
+ return False
+
- current = torch.cuda.current_device()
- num_devices = torch.cuda.device_count()
- for i in range(num_devices):
- if i != current and not torch.cuda.can_device_access_peer(current, i):
- return False
+ for peer_device in range(num_devices):
+ if peer_device == current:
+ continue
+ try:
+ if not torch.cuda.can_device_access_peer(current, peer_device):
+ return False
+ except RuntimeError:
+ return False
return TrueAdditionally, the file is missing the required NVIDIA copyright header.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tensorrt_llm/_torch/modules/fused_moe/deep_ep_utils.py` around lines 27 - 31,
Wrap the CUDA probing calls so they fail closed: surround the calls to
torch.cuda.current_device(), torch.cuda.device_count(), and
torch.cuda.can_device_access_peer(...) (the P2P probe block in deep_ep_utils.py)
with a try/except that catches RuntimeError and AssertionError and returns False
on any exception so callers can gracefully fall back; do not let these
exceptions propagate and trigger NVSHMEM initialization. Also add the required
NVIDIA copyright/header at the top of the file.
Summary
max_batch_sizewithout checking available GPU memory, causing OOM on large hybrid models (e.g., Qwen3-Next-80B) where the model itself consumed most GPU memory before cache allocation. Additionally, DeepEP's platform check relied on NVML NVLink status to infer P2P capability, which misreported availability inside containers, causing NVSHMEM to callexit()at the C level with no opportunity for graceful fallback._cap_mamba_max_batch_sizehelper that estimates per-sequence Mamba state size (conv + SSM per layer, including speculative decoding overhead), queries free GPU memory, and capsmax_batch_sizeto fit within a configurable fraction of available memory. For the DeepEP issue, added an explicitcheck_cuda_p2p_access()function usingtorch.cuda.can_device_access_peerto verify actual CUDA P2P connectivity between all local GPUs, and wired it into bothDeepEP.is_platform_supported()andDeepEPLowLatency.is_platform_supported()so the communication factory can fall back to alternative backends instead of hitting an uncatchable process exit.Test plan
Links
Summary by CodeRabbit
New Features
Improvements