Skip to content

[None][test] refresh test constraints#13482

Open
crazydemo wants to merge 1 commit intoNVIDIA:mainfrom
crazydemo:refresh
Open

[None][test] refresh test constraints#13482
crazydemo wants to merge 1 commit intoNVIDIA:mainfrom
crazydemo:refresh

Conversation

@crazydemo
Copy link
Copy Markdown
Collaborator

@crazydemo crazydemo commented Apr 27, 2026

Summary by CodeRabbit

Tests

  • Enhanced accuracy test coverage for Ada/Blackwell architectures with FP8/NVFP4 precision support.
  • Optimized Gemma3 1B accuracy testing requirements for Hopper and newer architectures.
  • Added performance benchmark test for Llama 3.1 8B model in MIG deployment context with concurrency scaling validation.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
@crazydemo
Copy link
Copy Markdown
Collaborator Author

/bot run

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 27, 2026

📝 Walkthrough

Walkthrough

This change adds hardware architecture gating to accuracy tests for Ada/Blackwell and Hopper GPUs, and introduces a new integration performance benchmark test that measures Llama 3.1 8B throughput and latency scaling under increasing concurrency in MIG mode.

Changes

Cohort / File(s) Summary
Accuracy Test Architecture Gating
tests/integration/defs/accuracy/test_llm_api_autodeploy.py, tests/integration/defs/accuracy/test_llm_api_pytorch.py
Added Ada/Blackwell/Hopper architecture skip decorators and pytest parameter marks to FP8/NVFP4 accuracy tests; moved Gemma3 1B skip logic from method-level to class-level decoration.
Integration Performance Benchmark
tests/integration/defs/test_e2e.py
Added new test function that executes BenchRunner benchmarks on Llama 3.1 8B in MIG mode across increasing concurrency levels, parses throughput and latency metrics, and validates monotonic throughput scaling (1.3x minimum per step).

Sequence Diagram

sequenceDiagram
    participant Test as Performance Test
    participant BR as BenchRunner
    participant Model as Llama 3.1 8B Model
    participant MIG as MIG Context
    participant Parser as Metrics Parser
    participant Validator as Scaling Validator

    Test->>BR: Iterate concurrency levels
    loop For each concurrency
        BR->>MIG: Launch benchmark (streaming=False, pytorch_backend=True)
        MIG->>Model: Execute inference requests
        Model-->>MIG: Return results
        MIG-->>BR: Return benchmark metrics (dict)
        BR-->>Test: Metrics dict
        Test->>Parser: Extract throughput & latency
        Parser-->>Test: Parsed values
        Test->>Validator: Check throughput > 1.3x previous
        alt Scaling valid
            Validator-->>Test: Continue to next concurrency
        else Scaling failed
            Validator-->>Test: Fail test
        end
    end
    Test->>Test: Print summary table (concurrency vs. metrics)
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description is entirely a template with no actual content filled in—no description, test coverage details, or explanation of the changes provided. Add a substantive description explaining what test constraints were refreshed and why, including test coverage details and checklist completion.
Docstring Coverage ⚠️ Warning Docstring coverage is 11.11% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main change: updating/refreshing test constraints across multiple test files to add hardware architecture gating and new integration tests.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tests/integration/defs/test_e2e.py (1)

2629-2683: ⚠️ Potential issue | 🟠 Major

Register this new integration perf test in test-db and QA perf lists.

This adds a new benchmark-style integration test, but I don’t see corresponding test-list updates in this PR scope. Without list registration, scheduled/pre-merge coverage can miss it.

Please add/update:

  • tests/integration/test_lists/test-db/l0_perf.yml (or the appropriate l0_*.yml)
  • tests/integration/test_lists/qa/llm_perf_*.yml (or multinode perf list if applicable)

As per coding guidelines, performance/integration test changes must be reflected in both CI test-db and QA list files.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/integration/defs/test_e2e.py` around lines 2629 - 2683, The new
benchmark integration test test_trtllm_bench_mig_launch in
tests/integration/defs/test_e2e.py is not registered in the CI test lists; add
entries for this test to the test-db and QA perf lists by updating
tests/integration/test_lists/test-db/l0_perf.yml (or the appropriate l0_*.yml)
and tests/integration/test_lists/qa/llm_perf_*.yml (or the multinode perf list)
to include the test path and any required tags/labels (e.g., l0, perf,
integration, multinode) so the CI and QA runners will execute it; ensure the
YAML entry references the test module and function name
(tests.integration.defs.test_e2e::test_trtllm_bench_mig_launch) and matches
existing formatting and gating rules in those files.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/integration/defs/test_e2e.py`:
- Line 2656: Replace the placeholder-less f-string print(f"\n=== Benchmark
Results Comparison ===") with a normal string literal or add a formatted
placeholder; e.g., change it to print("\n=== Benchmark Results Comparison ===")
so the f-prefix is removed and lint F541 is resolved; locate the exact call
print(f"\n=== Benchmark Results Comparison ===") in the test file and update it
accordingly.
- Around line 2679-2682: The test's strict stepwise assertion (throughput >
prev_throughput * 1.3) is brittle; change it to a tolerant check using a
configurable relative threshold or pytest.approx to allow normal CI variance:
compute prev_throughput = float(results[concurrency_list[idx -
1]].get('throughput', 0)) and then assert throughput >= prev_throughput * (1.0 +
RELATIVE_THRESHOLD) (e.g., RELATIVE_THRESHOLD = 0.05) or use assert throughput >
pytest.approx(prev_throughput, rel=0.05); make RELATIVE_THRESHOLD configurable
at top of the test file and handle the zero/near-zero prev_throughput case by
skipping the comparison or requiring an absolute minimum delta.

---

Outside diff comments:
In `@tests/integration/defs/test_e2e.py`:
- Around line 2629-2683: The new benchmark integration test
test_trtllm_bench_mig_launch in tests/integration/defs/test_e2e.py is not
registered in the CI test lists; add entries for this test to the test-db and QA
perf lists by updating tests/integration/test_lists/test-db/l0_perf.yml (or the
appropriate l0_*.yml) and tests/integration/test_lists/qa/llm_perf_*.yml (or the
multinode perf list) to include the test path and any required tags/labels
(e.g., l0, perf, integration, multinode) so the CI and QA runners will execute
it; ensure the YAML entry references the test module and function name
(tests.integration.defs.test_e2e::test_trtllm_bench_mig_launch) and matches
existing formatting and gating rules in those files.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 56b52213-9959-4a73-a2f1-9e5d80d70c3a

📥 Commits

Reviewing files that changed from the base of the PR and between 035de5d and 67a10cc.

📒 Files selected for processing (3)
  • tests/integration/defs/accuracy/test_llm_api_autodeploy.py
  • tests/integration/defs/accuracy/test_llm_api_pytorch.py
  • tests/integration/defs/test_e2e.py

output = runner()
results[concurrency] = output

print(f"\n=== Benchmark Results Comparison ===")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Remove the placeholder-less f-string at Line 2656.

This triggers F541 and fails lint.

Proposed fix
-    print(f"\n=== Benchmark Results Comparison ===")
+    print("\n=== Benchmark Results Comparison ===")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
print(f"\n=== Benchmark Results Comparison ===")
print("\n=== Benchmark Results Comparison ===")
🧰 Tools
🪛 Flake8 (7.3.0)

[error] 2656-2656: f-string is missing placeholders

(F541)

🪛 Ruff (0.15.11)

[error] 2656-2656: f-string without any placeholders

Remove extraneous f prefix

(F541)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/integration/defs/test_e2e.py` at line 2656, Replace the
placeholder-less f-string print(f"\n=== Benchmark Results Comparison ===") with
a normal string literal or add a formatted placeholder; e.g., change it to
print("\n=== Benchmark Results Comparison ===") so the f-prefix is removed and
lint F541 is resolved; locate the exact call print(f"\n=== Benchmark Results
Comparison ===") in the test file and update it accordingly.

Comment on lines +2679 to +2682
if idx > 0:
prev_throughput = float(results[concurrency_list[idx - 1]].get(
'throughput', 0))
assert throughput > prev_throughput * 1.3, f"Throughput is not increasing for concurrency {concurrency_list[idx]}"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

The stepwise 1.3x throughput assertion is too brittle for CI perf variance.

Requiring every jump to exceed 30% can fail on normal noise/saturation, especially in shared MIG environments.

Stabilization option
         if idx > 0:
             prev_throughput = float(results[concurrency_list[idx - 1]].get(
                 'throughput', 0))
-            assert throughput > prev_throughput * 1.3, f"Throughput is not increasing for concurrency {concurrency_list[idx]}"
+            # Allow small variance between adjacent points.
+            assert throughput >= prev_throughput * 0.9, (
+                f"Throughput regressed for concurrency {concurrency_list[idx]}"
+            )
+
+    baseline = float(results[concurrency_list[0]].get('throughput', 0))
+    final_tp = float(results[concurrency_list[-1]].get('throughput', 0))
+    assert final_tp >= baseline * 1.3, "End-to-end scaling did not improve enough"

As per coding guidelines, “Flag tests that assert overly brittle behavior … unless the product contract requires it.”

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/integration/defs/test_e2e.py` around lines 2679 - 2682, The test's
strict stepwise assertion (throughput > prev_throughput * 1.3) is brittle;
change it to a tolerant check using a configurable relative threshold or
pytest.approx to allow normal CI variance: compute prev_throughput =
float(results[concurrency_list[idx - 1]].get('throughput', 0)) and then assert
throughput >= prev_throughput * (1.0 + RELATIVE_THRESHOLD) (e.g.,
RELATIVE_THRESHOLD = 0.05) or use assert throughput >
pytest.approx(prev_throughput, rel=0.05); make RELATIVE_THRESHOLD configurable
at top of the test file and handle the zero/near-zero prev_throughput case by
skipping the comparison or requiring an absolute minimum delta.

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #45614 [ run ] triggered by Bot. Commit: 67a10cc Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #45614 [ run ] completed with state SUCCESS. Commit: 67a10cc
/LLM/main/L0_MergeRequest_PR pipeline #35829 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants