Skip to content

[https://nvbugs/6017720][fix] Fix moe backend mismatch on Blackwell in perf test.#13470

Open
dominicshanshan wants to merge 1 commit intoNVIDIA:mainfrom
dominicshanshan:user/shanshan/nvbug_6017720
Open

[https://nvbugs/6017720][fix] Fix moe backend mismatch on Blackwell in perf test.#13470
dominicshanshan wants to merge 1 commit intoNVIDIA:mainfrom
dominicshanshan:user/shanshan/nvbug_6017720

Conversation

@dominicshanshan
Copy link
Copy Markdown
Collaborator

@dominicshanshan dominicshanshan commented Apr 26, 2026

Summary by CodeRabbit

  • New Features

    • Added support for DeepSeek FP8 block-scale models with optimized performance configuration for advanced GPU architectures.
    • Enhanced GPU architecture detection for improved device compatibility.
  • Bug Fixes

    • Fixed model configuration label detection to work case-insensitively.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
@dominicshanshan dominicshanshan requested a review from a team as a code owner April 26, 2026 07:05
@dominicshanshan dominicshanshan changed the title [https://nvbugs/6017720][fix]Fix moe backend mismatch on Blackwell in perf test. [https://nvbugs/6017720][fix] Fix moe backend mismatch on Blackwell in perf test. Apr 26, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 26, 2026

📝 Walkthrough

Walkthrough

Adds support for DeepSeek FP8 block-scale model handling by introducing a model-name allowlist and safe SM-version detection function. Updates model YAML configuration detection to match PyTorch model labels case-insensitively and conditionally inject the DEEPGEMM MoE backend for SM versions 100 or higher.

Changes

Cohort / File(s) Summary
DeepSeek FP8 Model Configuration
tests/integration/defs/perf/pytorch_model_config.py
Added _DEEPSEEK_FP8_BLOCK_SCALE_MODELS constant, _get_sm_version_safe() function for CUDA-independent SM version detection, and updated get_model_yaml_config to perform case-insensitive model label matching with conditional DEEPGEMM MoE backend injection for SM ≥ 100.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description is largely empty, containing only the template placeholder text with no actual explanation of the issue, solution, or test coverage. Add a meaningful description explaining the MoE backend mismatch issue, why it occurs on Blackwell, how the fix resolves it, and document the test coverage for these changes.
Docstring Coverage ⚠️ Warning Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the fix for a MoE backend mismatch issue on Blackwell in the perf test, directly related to the code changes.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/integration/defs/perf/pytorch_model_config.py`:
- Around line 40-44: The helper _get_sm_version_safe() currently catches a bare
Exception which can hide unrelated bugs; change the except block to only catch
specific exceptions (e.g., except (ImportError, RuntimeError):) when calling
from tensorrt_llm._utils.get_sm_version() so import failures and CUDA runtime
errors are handled but other exceptions propagate; update the handler to return
0 for those specific exceptions (optionally bind the exception to a variable if
you want to log it) and remove the bare except Exception.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: a0012d56-058d-4b28-816d-92a4f3786d60

📥 Commits

Reviewing files that changed from the base of the PR and between eeba2eb and 0d1a46f.

📒 Files selected for processing (1)
  • tests/integration/defs/perf/pytorch_model_config.py

Comment on lines +40 to +44
try:
from tensorrt_llm._utils import get_sm_version
return get_sm_version()
except Exception:
return 0
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# Find the get_sm_version function in tensorrt_llm._utils
fd -t f "_utils.py" | head -20

Repository: NVIDIA/TensorRT-LLM

Length of output: 950


🏁 Script executed:

# Search for get_sm_version implementation
rg "def get_sm_version" -A 15

Repository: NVIDIA/TensorRT-LLM

Length of output: 3106


🏁 Script executed:

# Also check the context of the pytorch_model_config.py file
cat -n tests/integration/defs/perf/pytorch_model_config.py | sed -n '35,50p'

Repository: NVIDIA/TensorRT-LLM

Length of output: 680


Narrow _get_sm_version_safe() exception handling to specific exception types.

Line 43 catches bare Exception, which can mask unrelated bugs and hide unexpected failures in the SM100+ backend override path. Narrow the exception handling to import and runtime errors.

Proposed fix
 def _get_sm_version_safe() -> int:
-    try:
-        from tensorrt_llm._utils import get_sm_version
-        return get_sm_version()
-    except Exception:
+    try:
+        from tensorrt_llm._utils import get_sm_version
+    except ImportError:
         return 0
+
+    try:
+        return get_sm_version()
+    except (RuntimeError, OSError):
+        return 0

Per coding guidelines: "Avoid broad exception handling — catch specific exceptions, not bare except:" — torch.cuda.get_device_properties(0) raises RuntimeError when CUDA is unavailable or the device is invalid.

🧰 Tools
🪛 Ruff (0.15.11)

[warning] 43-43: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/integration/defs/perf/pytorch_model_config.py` around lines 40 - 44,
The helper _get_sm_version_safe() currently catches a bare Exception which can
hide unrelated bugs; change the except block to only catch specific exceptions
(e.g., except (ImportError, RuntimeError):) when calling from
tensorrt_llm._utils.get_sm_version() so import failures and CUDA runtime errors
are handled but other exceptions propagate; update the handler to return 0 for
those specific exceptions (optionally bind the exception to a variable if you
want to log it) and remove the bare except Exception.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant