[https://nvbugs/6017720][fix] Fix moe backend mismatch on Blackwell in perf test. by dominicshanshan · Pull Request #13470 · NVIDIA/TensorRT-LLM

dominicshanshan · 2026-04-26T07:05:33Z

Summary by CodeRabbit

New Features
- Added support for DeepSeek FP8 block-scale models with optimized performance configuration for advanced GPU architectures.
- Enhanced GPU architecture detection for improved device compatibility.
Bug Fixes
- Fixed model configuration label detection to work case-insensitively.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>

coderabbitai · 2026-04-26T07:09:51Z

📝 Walkthrough

Walkthrough

Adds support for DeepSeek FP8 block-scale model handling by introducing a model-name allowlist and safe SM-version detection function. Updates model YAML configuration detection to match PyTorch model labels case-insensitively and conditionally inject the DEEPGEMM MoE backend for SM versions 100 or higher.

Changes

Cohort / File(s)	Summary
DeepSeek FP8 Model Configuration `tests/integration/defs/perf/pytorch_model_config.py`	Added `_DEEPSEEK_FP8_BLOCK_SCALE_MODELS` constant, `_get_sm_version_safe()` function for CUDA-independent SM version detection, and updated `get_model_yaml_config` to perform case-insensitive model label matching with conditional DEEPGEMM MoE backend injection for SM ≥ 100.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The PR description is largely empty, containing only the template placeholder text with no actual explanation of the issue, solution, or test coverage.	Add a meaningful description explaining the MoE backend mismatch issue, why it occurs on Blackwell, how the fix resolves it, and document the test coverage for these changes.
Docstring Coverage	⚠️ Warning	Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and specifically describes the fix for a MoE backend mismatch issue on Blackwell in the perf test, directly related to the code changes.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/integration/defs/perf/pytorch_model_config.py`:
- Around line 40-44: The helper _get_sm_version_safe() currently catches a bare
Exception which can hide unrelated bugs; change the except block to only catch
specific exceptions (e.g., except (ImportError, RuntimeError):) when calling
from tensorrt_llm._utils.get_sm_version() so import failures and CUDA runtime
errors are handled but other exceptions propagate; update the handler to return
0 for those specific exceptions (optionally bind the exception to a variable if
you want to log it) and remove the bare except Exception.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: a0012d56-058d-4b28-816d-92a4f3786d60

📥 Commits

Reviewing files that changed from the base of the PR and between eeba2eb and 0d1a46f.

📒 Files selected for processing (1)

tests/integration/defs/perf/pytorch_model_config.py

coderabbitai · 2026-04-26T07:09:54Z

+    try:
+        from tensorrt_llm._utils import get_sm_version
+        return get_sm_version()
+    except Exception:
+        return 0


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# Find the get_sm_version function in tensorrt_llm._utils fd -t f "_utils.py" | head -20

Repository: NVIDIA/TensorRT-LLM

Length of output: 950

🏁 Script executed:

# Search for get_sm_version implementation rg "def get_sm_version" -A 15

Repository: NVIDIA/TensorRT-LLM

Length of output: 3106

🏁 Script executed:

# Also check the context of the pytorch_model_config.py file cat -n tests/integration/defs/perf/pytorch_model_config.py | sed -n '35,50p'

Repository: NVIDIA/TensorRT-LLM

Length of output: 680

Narrow _get_sm_version_safe() exception handling to specific exception types.

Line 43 catches bare Exception, which can mask unrelated bugs and hide unexpected failures in the SM100+ backend override path. Narrow the exception handling to import and runtime errors.

Proposed fix

def _get_sm_version_safe() -> int: - try: - from tensorrt_llm._utils import get_sm_version - return get_sm_version() - except Exception: + try: + from tensorrt_llm._utils import get_sm_version + except ImportError: return 0 + + try: + return get_sm_version() + except (RuntimeError, OSError): + return 0

Per coding guidelines: "Avoid broad exception handling — catch specific exceptions, not bare except:" — torch.cuda.get_device_properties(0) raises RuntimeError when CUDA is unavailable or the device is invalid.

🧰 Tools

🪛 Ruff (0.15.11)

[warning] 43-43: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/integration/defs/perf/pytorch_model_config.py` around lines 40 - 44, The helper _get_sm_version_safe() currently catches a bare Exception which can hide unrelated bugs; change the except block to only catch specific exceptions (e.g., except (ImportError, RuntimeError):) when calling from tensorrt_llm._utils.get_sm_version() so import failures and CUDA runtime errors are handled but other exceptions propagate; update the handler to return 0 for those specific exceptions (optionally bind the exception to a variable if you want to log it) and remove the bare except Exception.

Fix moe backend mismatch on Blackwell in perf test.

0d1a46f

Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>

dominicshanshan requested a review from a team as a code owner April 26, 2026 07:05

github-actions Bot assigned dominicshanshan Apr 26, 2026

dominicshanshan changed the title ~~[https://nvbugs/6017720][fix]Fix moe backend mismatch on Blackwell in perf test.~~ [https://nvbugs/6017720][fix] Fix moe backend mismatch on Blackwell in perf test. Apr 26, 2026

coderabbitai Bot reviewed Apr 26, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[https://nvbugs/6017720][fix] Fix moe backend mismatch on Blackwell in perf test.#13470

[https://nvbugs/6017720][fix] Fix moe backend mismatch on Blackwell in perf test.#13470
dominicshanshan wants to merge 1 commit intoNVIDIA:mainfrom
dominicshanshan:user/shanshan/nvbug_6017720

dominicshanshan commented Apr 26, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Apr 26, 2026

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (2 warnings)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Apr 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dominicshanshan commented Apr 26, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Description

Test Coverage

PR Checklist

GitHub Bot Help

Uh oh!

coderabbitai Bot commented Apr 26, 2026

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (2 warnings)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

dominicshanshan commented Apr 26, 2026 •

edited by coderabbitai Bot

Loading