Skip to content

[https://nvbugs/6115036][fix] Fix NVFP4 engine size estimation and attention DP batch size in trtllm-bench#13498

Open
hyukn wants to merge 1 commit intoNVIDIA:mainfrom
hyukn:user/yukunh/fix-nvfp4-engine-size-and-adp-batch-size
Open

[https://nvbugs/6115036][fix] Fix NVFP4 engine size estimation and attention DP batch size in trtllm-bench#13498
hyukn wants to merge 1 commit intoNVIDIA:mainfrom
hyukn:user/yukunh/fix-nvfp4-engine-size-and-adp-batch-size

Conversation

@hyukn
Copy link
Copy Markdown
Collaborator

@hyukn hyukn commented Apr 27, 2026

Summary by CodeRabbit

  • New Features

    • Added attention data parallelism (DP) support for improved memory allocation and batch sizing calculations.
    • Implemented checkpoint size tracking in model configurations for more accurate engine sizing.
  • Enhancements

    • Engine size estimation now uses checkpoint size metadata when available, with fallback to parameter count-based estimation.

Description

Two fixes for the autotuner in trtllm-bench:

Issue 1: NVFP4 engine size underestimated by ~2x

The previous calculation used param_count * byte_per_elem (with byte_per_elem=0.5 for NVFP4) to estimate engine size. This significantly underestimates quantized models because NVFP4 checkpoints contain mixed dtypes: U8 for packed FP4 weights, BF16 for non-quantized layers (embeddings, norms), and F8_E4M3 for scale factors. The param_count from safetensors sums element counts across all dtypes, so multiplying by a single byte_per_elem is incorrect.

Fix: Compute checkpoint size from per-dtype byte sizes using safetensors metadata. Added SAFETENSORS_DTYPE_BYTES mapping and checkpoint_size_in_gb field to ModelConfig.

Verified with DeepSeek-V3.2-NVFP4 on disk: buggy=183.70 GB vs fixed=386.54 GB (2.10x ratio).

Issue 2: Attention DP max_batch_size computed from pooled GPU memory instead of per-rank

When attention data parallelism is enabled, each TP rank independently manages its own KV cache. The autotuner was pooling GPU memory across all ranks (n_gpus = tp_size * pp_size), producing a total max_batch_size that exceeds what a single rank can handle. This caused OOM when each rank tried to allocate the full pooled cache (e.g., 360 GB on a 276.5 GB GPU).

Fix: When enable_attention_dp=True (read from the extra_llm_api_options YAML config), compute per-rank available memory (single_gpu * pp_size) and divide engine size by tp_size (since TP still shards model weights).

Test Coverage

  • No existing unit tests cover the autotuner code (bench/build/tuning.py, bench/build/dataclasses.py).
  • Verified with a standalone repro script using real DeepSeek-V3.2-NVFP4 safetensors metadata.
  • Integration coverage via trtllm-bench perf tests in CI (e.g., test_perf.py, test_perf_sanity.py).

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

…tention DP batch size in trtllm-bench

Two fixes for the autotuner in trtllm-bench:

1. NVFP4 engine size estimation: The previous calculation used
   `param_count * byte_per_elem` which significantly underestimates
   quantized models (~2x for NVFP4). NVFP4 checkpoints contain mixed
   dtypes (U8 for packed weights, BF16 for non-quantized layers,
   F8_E4M3 for scales). Now computes checkpoint size from per-dtype
   byte sizes in safetensors metadata.

2. Attention DP max_batch_size: When attention data parallelism is
   enabled, each TP rank independently manages its own KV cache.
   The autotuner was pooling GPU memory across all ranks, producing
   a total max_batch_size, but pyexecutor expects per-rank limits.
   Now uses per-rank memory when enable_attention_dp is set in the
   extra_llm_api_options config.

Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
@hyukn hyukn requested a review from a team as a code owner April 27, 2026 07:30
@hyukn hyukn requested a review from nv-yilinf April 27, 2026 07:30
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 27, 2026

📝 Walkthrough

Walkthrough

This change adds attention data parallelism (DP) support to the benchmarking pipeline. It introduces an enable_attention_dp flag that flows from configuration through settings to engine tuning, adds checkpoint size computation to model configurations, and adjusts GPU memory calculations in engine tuning to account for per-rank memory allocation when attention DP is enabled.

Changes

Cohort / File(s) Summary
Configuration & Settings Flow
tensorrt_llm/bench/benchmark/utils/general.py, tensorrt_llm/bench/build/build.py
Added enable_attention_dp flag with default value False; flag is optionally overridden from YAML config and passed through the settings pipeline to engine tuning logic.
Model Metadata
tensorrt_llm/bench/build/dataclasses.py
Added checkpoint_size_in_gb field to ModelConfig; replaced get_param_count classmethod with get_param_count_and_checkpoint_size that computes checkpoint bytes from safetensors metadata using dtype-to-bytes lookup, with fallback to param-count formula for GPT-J.
Engine Tuning & Memory Calculation
tensorrt_llm/bench/build/tuning.py
Updated calc_engine_setting to accept enable_attention_dp flag; replaced param-count-only estimation with checkpoint-size-based engine sizing when available; modified GPU memory calculation to allocate memory per-rank (get_device_memory() * pp_size) when attention DP is enabled, affecting batch/token limits accordingly.

Sequence Diagram

sequenceDiagram
    participant Config as Configuration/YAML
    participant Settings as get_settings()
    participant BuildEngine as get_benchmark_engine_settings()
    participant Tuning as calc_engine_setting()
    participant ModelCfg as ModelConfig

    Config->>Settings: Load config (YAML)
    Settings->>Settings: Read enable_attention_dp<br/>(default False, or from YAML)
    Settings->>BuildEngine: Pass enable_attention_dp flag
    BuildEngine->>Tuning: Forward enable_attention_dp<br/>+ model_config
    Tuning->>ModelCfg: Access checkpoint_size_in_gb
    Tuning->>Tuning: Calculate engine size<br/>(checkpoint-based vs param-based)
    alt enable_attention_dp = True
        Tuning->>Tuning: Allocate memory per-rank<br/>(get_device_memory() * pp_size)
        Tuning->>Tuning: Divide engine size by tp_size
    else enable_attention_dp = False
        Tuning->>Tuning: Use total GPU memory<br/>(n_gpus * get_device_memory())
    end
    Tuning->>Tuning: Compute max_batch_size<br/>& max_num_tokens
    Tuning->>BuildEngine: Return batch/token limits
    BuildEngine->>Settings: Return tuned settings
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 62.50% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically summarizes the two main fixes: NVFP4 engine size estimation and attention DP batch size in trtllm-bench, with proper ticket format and type annotation.
Description check ✅ Passed The PR description comprehensively covers the issues, solutions, verification details, and test coverage, following the template structure with all major sections populated.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
tensorrt_llm/bench/benchmark/utils/general.py (1)

86-98: ⚠️ Potential issue | 🟡 Minor

Validate extra_llm_api_options before reading keys.

yaml.safe_load() can return None or a non-mapping value, and the subsequent .get(...) calls will fail with AttributeError. Since this file now controls enable_attention_dp sizing too, please fail fast with a typed config error here.

Suggested fix
     if extra_llm_api_options:
-        with open(extra_llm_api_options, 'r') as f:
+        with open(extra_llm_api_options, "r") as f:
             llm_args_dict = yaml.safe_load(f)
+        if not isinstance(llm_args_dict, dict):
+            raise TypeError(
+                f"extra_llm_api_options must contain a mapping, got {type(llm_args_dict)}"
+            )
             kv_cache_config = llm_args_dict.get("kv_cache_config", {
                 "dtype": "auto",
             })
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/bench/benchmark/utils/general.py` around lines 86 - 98, Validate
the result of yaml.safe_load before using .get: after loading into llm_args_dict
(when extra_llm_api_options is provided) check that llm_args_dict is a mapping
(e.g., isinstance(llm_args_dict, dict)); if not, raise a clear typed
configuration error (ValueError or a ConfigError) explaining that
extra_llm_api_options must be a mapping. Then safely obtain kv_cache_config =
llm_args_dict.get("kv_cache_config", {"dtype": "auto"}) and validate
kv_cache_config is a mapping before calling kv_cache_config.get, and similarly
read enable_chunked_prefill and enable_attention_dp from the validated
llm_args_dict.
tensorrt_llm/bench/build/build.py (1)

28-37: ⚠️ Potential issue | 🟠 Major

Plumb enable_attention_dp into the build CLI path too.

This helper now supports attention-DP sizing, but build_command() below still calls it without enable_attention_dp, and this file does not expose any way to set that flag. As written, trtllm-bench build will keep using the old non-attention-DP heuristic and can still overestimate max_batch_size for attention-DP workloads.

Also applies to: 61-70

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/bench/build/build.py` around lines 28 - 37, The build path
currently calls get_benchmark_engine_settings without passing the new
enable_attention_dp flag, so add an enable_attention_dp parameter to the
build_command entry point (and the CLI parsing that constructs its args), thread
that boolean into the call to get_benchmark_engine_settings, and expose a CLI
flag (e.g., --enable-attention-dp) that sets it; ensure the default behavior
stays False when the flag is not provided and update any doc/help text for
build_command accordingly so the attention-DP sizing heuristic is used when
requested.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tensorrt_llm/bench/build/dataclasses.py`:
- Around line 220-223: Replace the assertion check on param_count with a real
exception: instead of using "assert param_count, (...)" raise a built-in
exception (ValueError or RuntimeError) when param_count is falsy, including the
same message that references hf_model_path or model_hf_name; keep the subsequent
"return param_count, checkpoint_size_in_gb" unchanged so callers still receive
the values when valid.

In `@tensorrt_llm/bench/build/tuning.py`:
- Around line 74-89: The warning later uses n_gpus as the divisor even when
enable_attention_dp switches accounting to per-TP-rank, which underreports
per-GPU KV cache; fix by computing a consistent divisor (e.g. per_rank_divisor =
pp_size if enable_attention_dp else n_gpus) and use that variable in the
warning/calculation instead of always dividing by n_gpus so the
per-GPU/per-TP-rank math matches how total_gpu_memory and engine_size were
computed (refer to enable_attention_dp, engine_size, total_gpu_memory, tp_size,
pp_size, n_gpus).

---

Outside diff comments:
In `@tensorrt_llm/bench/benchmark/utils/general.py`:
- Around line 86-98: Validate the result of yaml.safe_load before using .get:
after loading into llm_args_dict (when extra_llm_api_options is provided) check
that llm_args_dict is a mapping (e.g., isinstance(llm_args_dict, dict)); if not,
raise a clear typed configuration error (ValueError or a ConfigError) explaining
that extra_llm_api_options must be a mapping. Then safely obtain kv_cache_config
= llm_args_dict.get("kv_cache_config", {"dtype": "auto"}) and validate
kv_cache_config is a mapping before calling kv_cache_config.get, and similarly
read enable_chunked_prefill and enable_attention_dp from the validated
llm_args_dict.

In `@tensorrt_llm/bench/build/build.py`:
- Around line 28-37: The build path currently calls
get_benchmark_engine_settings without passing the new enable_attention_dp flag,
so add an enable_attention_dp parameter to the build_command entry point (and
the CLI parsing that constructs its args), thread that boolean into the call to
get_benchmark_engine_settings, and expose a CLI flag (e.g.,
--enable-attention-dp) that sets it; ensure the default behavior stays False
when the flag is not provided and update any doc/help text for build_command
accordingly so the attention-DP sizing heuristic is used when requested.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 32d4e908-787c-4828-b5fa-47395a5228b9

📥 Commits

Reviewing files that changed from the base of the PR and between f3e458e and d0d51e2.

📒 Files selected for processing (4)
  • tensorrt_llm/bench/benchmark/utils/general.py
  • tensorrt_llm/bench/build/build.py
  • tensorrt_llm/bench/build/dataclasses.py
  • tensorrt_llm/bench/build/tuning.py

Comment on lines +220 to +223
assert param_count, (f"Can't get valid parameter count for model: "
f"{hf_model_path or model_hf_name}.")

return param_count
return param_count, checkpoint_size_in_gb
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Use a real exception for metadata sanity checks.

assert param_count disappears under python -O. If this path ever sees zero-count metadata in optimized runs, tuning can continue with an invalid engine-size estimate instead of failing fast. Please raise ValueError or RuntimeError here.

Suggested fix
-        assert param_count, (f"Can't get valid parameter count for model: "
-                             f"{hf_model_path or model_hf_name}.")
+        if not param_count:
+            raise ValueError(
+                f"Can't get valid parameter count for model: "
+                f"{hf_model_path or model_hf_name}."
+            )

As per coding guidelines, use built-in Python exception types; use exceptions for error handling, not return values.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
assert param_count, (f"Can't get valid parameter count for model: "
f"{hf_model_path or model_hf_name}.")
return param_count
return param_count, checkpoint_size_in_gb
if not param_count:
raise ValueError(
f"Can't get valid parameter count for model: "
f"{hf_model_path or model_hf_name}."
)
return param_count, checkpoint_size_in_gb
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/bench/build/dataclasses.py` around lines 220 - 223, Replace the
assertion check on param_count with a real exception: instead of using "assert
param_count, (...)" raise a built-in exception (ValueError or RuntimeError) when
param_count is falsy, including the same message that references hf_model_path
or model_hf_name; keep the subsequent "return param_count,
checkpoint_size_in_gb" unchanged so callers still receive the values when valid.

Comment on lines +74 to +89
# Use checkpoint size from safetensors metadata for accurate estimation.
# This is critical for quantized models (e.g. NVFP4) where different
# tensors have different dtypes.
if model_config.checkpoint_size_in_gb > 0:
engine_size = model_config.checkpoint_size_in_gb
else:
# Fallback to param_count-based estimation for backward compatibility.
engine_size = model_config.param_count * byte_per_elem / (1024**3)
# With attention DP, each TP rank independently manages its own KV cache
# and holds a TP shard of the model weights. Compute per-rank available
# memory: single GPU memory minus per-rank engine size.
if enable_attention_dp:
total_gpu_memory = get_device_memory() * pp_size
engine_size = engine_size / tp_size
else:
total_gpu_memory = get_device_memory() * n_gpus
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Keep the later “per GPU” warning in the same accounting mode.

This branch switches to per-TP-rank budgeting, but the later Line 156 warning still divides by n_gpus = tp_size * pp_size. With attention DP enabled, that underreports per-GPU KV cache by tp_size, so the warning becomes misleading.

Suggested fix
     # Number of GPU used for this run.
     n_gpus = tp_size * pp_size
+    warning_gpu_count = n_gpus
@@
     if enable_attention_dp:
         total_gpu_memory = get_device_memory() * pp_size
         engine_size = engine_size / tp_size
+        warning_gpu_count = pp_size
     else:
         total_gpu_memory = get_device_memory() * n_gpus
@@
-    if cache_memory / n_gpus < 10.0:
+    if cache_memory / warning_gpu_count < 10.0:
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/bench/build/tuning.py` around lines 74 - 89, The warning later
uses n_gpus as the divisor even when enable_attention_dp switches accounting to
per-TP-rank, which underreports per-GPU KV cache; fix by computing a consistent
divisor (e.g. per_rank_divisor = pp_size if enable_attention_dp else n_gpus) and
use that variable in the warning/calculation instead of always dividing by
n_gpus so the per-GPU/per-TP-rank math matches how total_gpu_memory and
engine_size were computed (refer to enable_attention_dp, engine_size,
total_gpu_memory, tp_size, pp_size, n_gpus).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant