Skip to content

[https://nvbugs/6094068][fix] Cap Mamba cache max_batch_size for memory and add CUDA P2P check for DeepEP#13474

Open
tensorrt-cicd wants to merge 1 commit intoNVIDIA:mainfrom
tensorrt-cicd:repair-bot-bug6094068
Open

[https://nvbugs/6094068][fix] Cap Mamba cache max_batch_size for memory and add CUDA P2P check for DeepEP#13474
tensorrt-cicd wants to merge 1 commit intoNVIDIA:mainfrom
tensorrt-cicd:repair-bot-bug6094068

Conversation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

@tensorrt-cicd tensorrt-cicd commented Apr 26, 2026

Summary

  • Root cause: The MambaCacheManager allocated SSM state buffers based solely on max_batch_size without checking available GPU memory, causing OOM on large hybrid models (e.g., Qwen3-Next-80B) where the model itself consumed most GPU memory before cache allocation. Additionally, DeepEP's platform check relied on NVML NVLink status to infer P2P capability, which misreported availability inside containers, causing NVSHMEM to call exit() at the C level with no opportunity for graceful fallback.
  • Fix: Added a _cap_mamba_max_batch_size helper that estimates per-sequence Mamba state size (conv + SSM per layer, including speculative decoding overhead), queries free GPU memory, and caps max_batch_size to fit within a configurable fraction of available memory. For the DeepEP issue, added an explicit check_cuda_p2p_access() function using torch.cuda.can_device_access_peer to verify actual CUDA P2P connectivity between all local GPUs, and wired it into both DeepEP.is_platform_supported() and DeepEPLowLatency.is_platform_supported() so the communication factory can fall back to alternative backends instead of hitting an uncatchable process exit.
  • Automated fix generated by repair-bot

Test plan

  • Verify fix on the same GPU type as the original failure
  • Check for regressions in related tests

Links

Summary by CodeRabbit

  • New Features

    • Added CUDA peer-to-peer connectivity validation for platform compatibility checks
    • Implemented GPU memory-aware batch size optimization for Mamba models
  • Improvements

    • Enhanced platform support detection with stricter hardware requirements validation
    • Improved cache management robustness for edge case scenarios

…ry and add CUDA P2P check for DeepEP

Fix OOM crash in MambaHybridCacheManager when running Qwen3-Next on H100
with attention_dp=True. The SSM state allocation tried to allocate 72 GiB
for 2048 sequences but only ~36 GiB was free after model loading.

Three changes:
1. Add _cap_max_batch_size_for_memory() to MambaHybridCacheManager that
   estimates per-sequence Mamba state size and caps max_batch_size to fit
   within half of free GPU memory.
2. Add check_cuda_p2p_access() to deep_ep_utils.py and integrate it into
   DeepEP.is_platform_supported() and DeepEPLowLatency.is_platform_supported()
   to prevent fatal NVSHMEM process exit when P2P access is unavailable.
3. Add Mamba capacity check in MambaHybridCacheManager.add_dummy_requests()
   so CUDA graph warmup gracefully skips batch sizes exceeding the capped
   Mamba cache capacity instead of raising RuntimeError.

Signed-off-by: tensorrt-cicd <90828364+tensorrt-cicd@users.noreply.github.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 26, 2026

📝 Walkthrough

Walkthrough

This PR adds CUDA P2P connectivity validation to DeepEP platform support checks and introduces max batch size capping for Mamba models based on GPU memory constraints. It includes a new utility function for verifying peer-to-peer GPU access availability across local GPUs.

Changes

Cohort / File(s) Summary
DeepEP Platform Support
tensorrt_llm/_torch/modules/fused_moe/communication/deep_ep.py, tensorrt_llm/_torch/modules/fused_moe/communication/deep_ep_low_latency.py
Both modules now require CUDA P2P connectivity in addition to existing requirements for is_platform_supported(). Updated to import and use check_cuda_p2p_access() in their platform feasibility checks; docstrings updated to document NVSHMEM P2P requirements.
DeepEP Utilities
tensorrt_llm/_torch/modules/fused_moe/deep_ep_utils.py
New check_cuda_p2p_access() function added to verify peer-to-peer GPU access availability across all local GPUs using torch.cuda APIs.
Mamba Cache Management
tensorrt_llm/_torch/pyexecutor/mamba_cache_manager.py
Added max batch size capping logic that estimates GPU memory for per-sequence Mamba state and reduces batch size when memory budget is insufficient. Updated add_dummy_requests() to return None when CUDA-graph warmup requests exceed cache limits, preventing allocation failures.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 72.73% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately captures the two main changes: capping Mamba cache max_batch_size for memory and adding CUDA P2P check for DeepEP, matching the file-level changes described in the summary.
Description check ✅ Passed The description comprehensively explains both root causes, implementations, and test verification. It clearly documents the Mamba OOM issue, DeepEP P2P detection problem, and how each was fixed.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
tensorrt_llm/_torch/pyexecutor/mamba_cache_manager.py (2)

52-56: Add return type annotation.

Per coding guidelines, all functions should have return type annotations.

 def _cap_mamba_max_batch_size(mamba_d_state, mamba_d_conv, mamba_num_heads,
                               mamba_n_groups, mamba_head_dim, mamba_num_layers,
                               mamba_layer_mask, mamba_cache_dtype,
                               mamba_ssm_cache_dtype, max_batch_size, mapping,
-                              spec_config):
+                              spec_config) -> int:
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/pyexecutor/mamba_cache_manager.py` around lines 52 - 56,
The function _cap_mamba_max_batch_size is missing a return type annotation;
update its signature to include an explicit return type that matches what the
function actually returns (inspect the function body to determine whether it
returns an int, tuple, dict, or other type) and annotate it accordingly (e.g.,
-> int or -> Tuple[...], using typing imports as needed) so the signature
correctly reflects the returned value(s).

781-790: Mamba resources are allocated before KV cache availability is confirmed.

On line 789, MambaCacheManager.add_dummy_requests allocates Mamba blocks. If the subsequent KVCacheManager.add_dummy_requests call on line 790 returns None due to insufficient KV blocks, the Mamba allocation persists while the method returns None, creating a partial resource state.

This is mitigated by the idempotent nature of Mamba allocation (reused requests hit the cache) and the fixed IDs used for dummy requests, which prevent resource leaks on retries. However, the code would be clearer if both capacities were verified before any allocation:

🔧 Optional improvement: pre-check both capacities
 def add_dummy_requests(self, request_ids: List[int], **kwargs):
     # Check Mamba cache capacity before allocation.  CUDA graph warmup
     # creates dummy requests for various batch sizes and expects None
     # (not an exception) when resources are insufficient so it can skip
     # oversized batch configurations gracefully.
     max_mamba = MambaCacheManager.get_max_resource_count(self)
     if len(request_ids) > max_mamba:
         return None
+    # Check KV cache availability before allocating any resources
+    available_kv_blocks = self.get_num_free_blocks()
+    if available_kv_blocks < 1:
+        return None
     MambaCacheManager.add_dummy_requests(self, request_ids)
     return KVCacheManager.add_dummy_requests(self, request_ids, **kwargs)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/pyexecutor/mamba_cache_manager.py` around lines 781 -
790, The method add_dummy_requests currently calls
MambaCacheManager.add_dummy_requests before confirming KV capacity, risking a
partial allocation if KV add returns None; change it to first query both
capacities (call MambaCacheManager.get_max_resource_count(self) and the KV
equivalent, e.g. KVCacheManager.get_max_resource_count(self)), return None if
the requested count exceeds either, and only then call
MambaCacheManager.add_dummy_requests(self, request_ids) followed by
KVCacheManager.add_dummy_requests(self, request_ids, **kwargs) so allocations
only occur after both capacity checks pass.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tensorrt_llm/_torch/modules/fused_moe/deep_ep_utils.py`:
- Around line 27-31: Wrap the CUDA probing calls so they fail closed: surround
the calls to torch.cuda.current_device(), torch.cuda.device_count(), and
torch.cuda.can_device_access_peer(...) (the P2P probe block in deep_ep_utils.py)
with a try/except that catches RuntimeError and AssertionError and returns False
on any exception so callers can gracefully fall back; do not let these
exceptions propagate and trigger NVSHMEM initialization. Also add the required
NVIDIA copyright/header at the top of the file.

---

Nitpick comments:
In `@tensorrt_llm/_torch/pyexecutor/mamba_cache_manager.py`:
- Around line 52-56: The function _cap_mamba_max_batch_size is missing a return
type annotation; update its signature to include an explicit return type that
matches what the function actually returns (inspect the function body to
determine whether it returns an int, tuple, dict, or other type) and annotate it
accordingly (e.g., -> int or -> Tuple[...], using typing imports as needed) so
the signature correctly reflects the returned value(s).
- Around line 781-790: The method add_dummy_requests currently calls
MambaCacheManager.add_dummy_requests before confirming KV capacity, risking a
partial allocation if KV add returns None; change it to first query both
capacities (call MambaCacheManager.get_max_resource_count(self) and the KV
equivalent, e.g. KVCacheManager.get_max_resource_count(self)), return None if
the requested count exceeds either, and only then call
MambaCacheManager.add_dummy_requests(self, request_ids) followed by
KVCacheManager.add_dummy_requests(self, request_ids, **kwargs) so allocations
only occur after both capacity checks pass.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 21c04848-c3ce-46fa-b532-afd0fc25fc15

📥 Commits

Reviewing files that changed from the base of the PR and between eeba2eb and 7b91bcf.

📒 Files selected for processing (4)
  • tensorrt_llm/_torch/modules/fused_moe/communication/deep_ep.py
  • tensorrt_llm/_torch/modules/fused_moe/communication/deep_ep_low_latency.py
  • tensorrt_llm/_torch/modules/fused_moe/deep_ep_utils.py
  • tensorrt_llm/_torch/pyexecutor/mamba_cache_manager.py

Comment on lines +27 to +31
current = torch.cuda.current_device()
num_devices = torch.cuda.device_count()
for i in range(num_devices):
if i != current and not torch.cuda.can_device_access_peer(current, i):
return False
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

In PyTorch, what exceptions can torch.cuda.current_device() and torch.cuda.can_device_access_peer() raise when CUDA is unavailable or not initialized, and what is the recommended fail-closed guard pattern for capability probes?

💡 Result:

torch.cuda.current_device raises RuntimeError (wrapped around CUDA driver errors like "No CUDA GPUs are available", "device >= 0 && device < num_gpus INTERNAL ASSERT FAILED", or "CUDA Call Failed") or DeferredCudaCallError when CUDA is unavailable or not properly initialized. These occur during _lazy_init if CUDA initialization fails (e.g., no GPUs detected, driver issues). torch.cuda.can_device_access_peer(device, peer_device) raises AssertionError("Invalid peer device id") if either device or peer_device is invalid (e.g., >= device_count or negative when CUDA unavailable/not initialized), as seen in source code and usage errors. Recommended fail-closed guard pattern for capability probes: Check torch.cuda.is_available and torch.cuda.is_initialized first, then verify device indices < torch.cuda.device_count. Use try-except for probes, but prefer explicit guards: if torch.cuda.is_available: if torch.cuda.device_count > 0: try: dev = torch.cuda.current_device # further checks except RuntimeError: # fallback pass else: # CPU fallback This ensures fail-closed (disable CUDA path if unavailable) before probing capabilities like peer access.

Citations:


🏁 Script executed:

# Check file existence and get overall context
file_path="tensorrt_llm/_torch/modules/fused_moe/deep_ep_utils.py"
if [ -f "$file_path" ]; then
  echo "=== File Stats ==="
  wc -l "$file_path"
  echo ""
  echo "=== First 50 lines (copyright header + context) ==="
  head -50 "$file_path"
  echo ""
  echo "=== Lines 19-35 (function in question) ==="
  sed -n '19,35p' "$file_path" | cat -n
else
  echo "File not found: $file_path"
fi

Repository: NVIDIA/TensorRT-LLM

Length of output: 2581


🏁 Script executed:

# Search for usages of check_cuda_p2p_access
rg "check_cuda_p2p_access" --type py -B 2 -A 2

Repository: NVIDIA/TensorRT-LLM

Length of output: 2441


Guard CUDA probe calls to fail closed instead of raising.

The function is intended to gracefully report P2P access unavailability so callers can fall back without triggering NVSHMEM initialization (which fatally exits on incompatible hardware). However, torch.cuda.current_device() (line 27) and torch.cuda.can_device_access_peer() (line 30) raise RuntimeError and AssertionError respectively when CUDA is unavailable or not initialized. These exceptions break the graceful fallback contract.

Proposed fix
 def check_cuda_p2p_access() -> bool:
     """Check if CUDA P2P access is available between all local GPUs.
 
     NVSHMEM (used by DeepEP) fatally exits the process when P2P access is
     unavailable, instead of raising a catchable exception.  This check lets
     callers avoid triggering nvshmem init on incompatible hardware so the
     communication factory can fall back gracefully.
     """
+    if not torch.cuda.is_available():
+        return False
+
+    try:
+        current = torch.cuda.current_device()
+        num_devices = torch.cuda.device_count()
+    except RuntimeError:
+        return False
+
-    current = torch.cuda.current_device()
-    num_devices = torch.cuda.device_count()
-    for i in range(num_devices):
-        if i != current and not torch.cuda.can_device_access_peer(current, i):
-            return False
+    for peer_device in range(num_devices):
+        if peer_device == current:
+            continue
+        try:
+            if not torch.cuda.can_device_access_peer(current, peer_device):
+                return False
+        except RuntimeError:
+            return False
     return True

Additionally, the file is missing the required NVIDIA copyright header.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/modules/fused_moe/deep_ep_utils.py` around lines 27 - 31,
Wrap the CUDA probing calls so they fail closed: surround the calls to
torch.cuda.current_device(), torch.cuda.device_count(), and
torch.cuda.can_device_access_peer(...) (the P2P probe block in deep_ep_utils.py)
with a try/except that catches RuntimeError and AssertionError and returns False
on any exception so callers can gracefully fall back; do not let these
exceptions propagate and trigger NVSHMEM initialization. Also add the required
NVIDIA copyright/header at the top of the file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant