[None][test] refresh test constraints by crazydemo · Pull Request #13482 · NVIDIA/TensorRT-LLM

crazydemo · 2026-04-27T01:40:37Z

Summary by CodeRabbit

Tests

Enhanced accuracy test coverage for Ada/Blackwell architectures with FP8/NVFP4 precision support.
Optimized Gemma3 1B accuracy testing requirements for Hopper and newer architectures.
Added performance benchmark test for Llama 3.1 8B model in MIG deployment context with concurrency scaling validation.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>

crazydemo · 2026-04-27T01:42:48Z

/bot run

coderabbitai · 2026-04-27T01:45:44Z

📝 Walkthrough

Walkthrough

This change adds hardware architecture gating to accuracy tests for Ada/Blackwell and Hopper GPUs, and introduces a new integration performance benchmark test that measures Llama 3.1 8B throughput and latency scaling under increasing concurrency in MIG mode.

Changes

Cohort / File(s)	Summary
Accuracy Test Architecture Gating `tests/integration/defs/accuracy/test_llm_api_autodeploy.py`, `tests/integration/defs/accuracy/test_llm_api_pytorch.py`	Added Ada/Blackwell/Hopper architecture skip decorators and pytest parameter marks to FP8/NVFP4 accuracy tests; moved Gemma3 1B skip logic from method-level to class-level decoration.
Integration Performance Benchmark `tests/integration/defs/test_e2e.py`	Added new test function that executes `BenchRunner` benchmarks on Llama 3.1 8B in MIG mode across increasing concurrency levels, parses throughput and latency metrics, and validates monotonic throughput scaling (1.3x minimum per step).

Sequence Diagram

sequenceDiagram
    participant Test as Performance Test
    participant BR as BenchRunner
    participant Model as Llama 3.1 8B Model
    participant MIG as MIG Context
    participant Parser as Metrics Parser
    participant Validator as Scaling Validator

    Test->>BR: Iterate concurrency levels
    loop For each concurrency
        BR->>MIG: Launch benchmark (streaming=False, pytorch_backend=True)
        MIG->>Model: Execute inference requests
        Model-->>MIG: Return results
        MIG-->>BR: Return benchmark metrics (dict)
        BR-->>Test: Metrics dict
        Test->>Parser: Extract throughput & latency
        Parser-->>Test: Parsed values
        Test->>Validator: Check throughput > 1.3x previous
        alt Scaling valid
            Validator-->>Test: Continue to next concurrency
        else Scaling failed
            Validator-->>Test: Fail test
        end
    end
    Test->>Test: Print summary table (concurrency vs. metrics)

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The PR description is entirely a template with no actual content filled in—no description, test coverage details, or explanation of the changes provided.	Add a substantive description explaining what test constraints were refreshed and why, including test coverage details and checklist completion.
Docstring Coverage	⚠️ Warning	Docstring coverage is 11.11% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly summarizes the main change: updating/refreshing test constraints across multiple test files to add hardware architecture gating and new integration tests.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

tests/integration/defs/test_e2e.py (1)
2629-2683: ⚠️ Potential issue | 🟠 Major

Register this new integration perf test in test-db and QA perf lists.

This adds a new benchmark-style integration test, but I don’t see corresponding test-list updates in this PR scope. Without list registration, scheduled/pre-merge coverage can miss it.

Please add/update:

tests/integration/test_lists/test-db/l0_perf.yml (or the appropriate l0_*.yml)

tests/integration/test_lists/qa/llm_perf_*.yml (or multinode perf list if applicable)

As per coding guidelines, performance/integration test changes must be reflected in both CI test-db and QA list files.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/integration/defs/test_e2e.py` around lines 2629 - 2683, The new
benchmark integration test test_trtllm_bench_mig_launch in
tests/integration/defs/test_e2e.py is not registered in the CI test lists; add
entries for this test to the test-db and QA perf lists by updating
tests/integration/test_lists/test-db/l0_perf.yml (or the appropriate l0_*.yml)
and tests/integration/test_lists/qa/llm_perf_*.yml (or the multinode perf list)
to include the test path and any required tags/labels (e.g., l0, perf,
integration, multinode) so the CI and QA runners will execute it; ensure the
YAML entry references the test module and function name
(tests.integration.defs.test_e2e::test_trtllm_bench_mig_launch) and matches
existing formatting and gating rules in those files.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/integration/defs/test_e2e.py`:
- Line 2656: Replace the placeholder-less f-string print(f"\n=== Benchmark
Results Comparison ===") with a normal string literal or add a formatted
placeholder; e.g., change it to print("\n=== Benchmark Results Comparison ===")
so the f-prefix is removed and lint F541 is resolved; locate the exact call
print(f"\n=== Benchmark Results Comparison ===") in the test file and update it
accordingly.
- Around line 2679-2682: The test's strict stepwise assertion (throughput >
prev_throughput * 1.3) is brittle; change it to a tolerant check using a
configurable relative threshold or pytest.approx to allow normal CI variance:
compute prev_throughput = float(results[concurrency_list[idx -
1]].get('throughput', 0)) and then assert throughput >= prev_throughput * (1.0 +
RELATIVE_THRESHOLD) (e.g., RELATIVE_THRESHOLD = 0.05) or use assert throughput >
pytest.approx(prev_throughput, rel=0.05); make RELATIVE_THRESHOLD configurable
at top of the test file and handle the zero/near-zero prev_throughput case by
skipping the comparison or requiring an absolute minimum delta.

---

Outside diff comments:
In `@tests/integration/defs/test_e2e.py`:
- Around line 2629-2683: The new benchmark integration test
test_trtllm_bench_mig_launch in tests/integration/defs/test_e2e.py is not
registered in the CI test lists; add entries for this test to the test-db and QA
perf lists by updating tests/integration/test_lists/test-db/l0_perf.yml (or the
appropriate l0_*.yml) and tests/integration/test_lists/qa/llm_perf_*.yml (or the
multinode perf list) to include the test path and any required tags/labels
(e.g., l0, perf, integration, multinode) so the CI and QA runners will execute
it; ensure the YAML entry references the test module and function name
(tests.integration.defs.test_e2e::test_trtllm_bench_mig_launch) and matches
existing formatting and gating rules in those files.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 56b52213-9959-4a73-a2f1-9e5d80d70c3a

📥 Commits

Reviewing files that changed from the base of the PR and between 035de5d and 67a10cc.

📒 Files selected for processing (3)

tests/integration/defs/accuracy/test_llm_api_autodeploy.py
tests/integration/defs/accuracy/test_llm_api_pytorch.py
tests/integration/defs/test_e2e.py

coderabbitai · 2026-04-27T01:45:47Z

+        output = runner()
+        results[concurrency] = output
+
+    print(f"\n=== Benchmark Results Comparison ===")


⚠️ Potential issue | 🟡 Minor

Remove the placeholder-less f-string at Line 2656.

This triggers F541 and fails lint.

Proposed fix

- print(f"\n=== Benchmark Results Comparison ===") + print("\n=== Benchmark Results Comparison ===")

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

print(f"\n=== Benchmark Results Comparison ===")

print("\n=== Benchmark Results Comparison ===")

🧰 Tools

🪛 Flake8 (7.3.0)

[error] 2656-2656: f-string is missing placeholders

(F541)

🪛 Ruff (0.15.11)

[error] 2656-2656: f-string without any placeholders

Remove extraneous f prefix

(F541)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/integration/defs/test_e2e.py` at line 2656, Replace the placeholder-less f-string print(f"\n=== Benchmark Results Comparison ===") with a normal string literal or add a formatted placeholder; e.g., change it to print("\n=== Benchmark Results Comparison ===") so the f-prefix is removed and lint F541 is resolved; locate the exact call print(f"\n=== Benchmark Results Comparison ===") in the test file and update it accordingly.

coderabbitai · 2026-04-27T01:45:47Z

+        if idx > 0:
+            prev_throughput = float(results[concurrency_list[idx - 1]].get(
+                'throughput', 0))
+            assert throughput > prev_throughput * 1.3, f"Throughput is not increasing for concurrency {concurrency_list[idx]}"


⚠️ Potential issue | 🟠 Major

The stepwise 1.3x throughput assertion is too brittle for CI perf variance.

Requiring every jump to exceed 30% can fail on normal noise/saturation, especially in shared MIG environments.

Stabilization option

if idx > 0: prev_throughput = float(results[concurrency_list[idx - 1]].get( 'throughput', 0)) - assert throughput > prev_throughput * 1.3, f"Throughput is not increasing for concurrency {concurrency_list[idx]}" + # Allow small variance between adjacent points. + assert throughput >= prev_throughput * 0.9, ( + f"Throughput regressed for concurrency {concurrency_list[idx]}" + ) + + baseline = float(results[concurrency_list[0]].get('throughput', 0)) + final_tp = float(results[concurrency_list[-1]].get('throughput', 0)) + assert final_tp >= baseline * 1.3, "End-to-end scaling did not improve enough"

As per coding guidelines, “Flag tests that assert overly brittle behavior … unless the product contract requires it.”

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/integration/defs/test_e2e.py` around lines 2679 - 2682, The test's strict stepwise assertion (throughput > prev_throughput * 1.3) is brittle; change it to a tolerant check using a configurable relative threshold or pytest.approx to allow normal CI variance: compute prev_throughput = float(results[concurrency_list[idx - 1]].get('throughput', 0)) and then assert throughput >= prev_throughput * (1.0 + RELATIVE_THRESHOLD) (e.g., RELATIVE_THRESHOLD = 0.05) or use assert throughput > pytest.approx(prev_throughput, rel=0.05); make RELATIVE_THRESHOLD configurable at top of the test file and handle the zero/near-zero prev_throughput case by skipping the comparison or requiring an absolute minimum delta.

tensorrt-cicd · 2026-04-27T01:50:54Z

PR_Github #45614 [ run ] triggered by Bot. Commit: 67a10cc Link to invocation

tensorrt-cicd · 2026-04-27T05:58:59Z

PR_Github #45614 [ run ] completed with state SUCCESS. Commit: 67a10cc
/LLM/main/L0_MergeRequest_PR pipeline #35829 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

refresh test constraints

67a10cc

Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>

crazydemo requested review from a team as code owners April 27, 2026 01:40

crazydemo requested a review from nvchenghaoz April 27, 2026 01:40

github-actions Bot assigned crazydemo Apr 27, 2026

crazydemo requested review from jieli-matrix and xinhe-nv April 27, 2026 01:42

coderabbitai Bot reviewed Apr 27, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[None][test] refresh test constraints#13482

[None][test] refresh test constraints#13482
crazydemo wants to merge 1 commit intoNVIDIA:mainfrom
crazydemo:refresh

crazydemo commented Apr 27, 2026 •

edited

Loading

Uh oh!

crazydemo commented Apr 27, 2026

Uh oh!

coderabbitai Bot commented Apr 27, 2026

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

❌ Failed checks (2 warnings)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Apr 27, 2026

Uh oh!

coderabbitai Bot Apr 27, 2026

Uh oh!

tensorrt-cicd commented Apr 27, 2026

Uh oh!

tensorrt-cicd commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	print(f"\n=== Benchmark Results Comparison ===")
	print("\n=== Benchmark Results Comparison ===")

Conversation

crazydemo commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Tests

Description

Test Coverage

PR Checklist

GitHub Bot Help

Uh oh!

crazydemo commented Apr 27, 2026

Uh oh!

coderabbitai Bot commented Apr 27, 2026

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

❌ Failed checks (2 warnings)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

tensorrt-cicd commented Apr 27, 2026

Uh oh!

tensorrt-cicd commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

crazydemo commented Apr 27, 2026 •

edited

Loading