[https://nvbugs/6115036][fix] Fix NVFP4 engine size estimation and attention DP batch size in trtllm-bench#13498
Conversation
…tention DP batch size in trtllm-bench Two fixes for the autotuner in trtllm-bench: 1. NVFP4 engine size estimation: The previous calculation used `param_count * byte_per_elem` which significantly underestimates quantized models (~2x for NVFP4). NVFP4 checkpoints contain mixed dtypes (U8 for packed weights, BF16 for non-quantized layers, F8_E4M3 for scales). Now computes checkpoint size from per-dtype byte sizes in safetensors metadata. 2. Attention DP max_batch_size: When attention data parallelism is enabled, each TP rank independently manages its own KV cache. The autotuner was pooling GPU memory across all ranks, producing a total max_batch_size, but pyexecutor expects per-rank limits. Now uses per-rank memory when enable_attention_dp is set in the extra_llm_api_options config. Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
📝 WalkthroughWalkthroughThis change adds attention data parallelism (DP) support to the benchmarking pipeline. It introduces an Changes
Sequence DiagramsequenceDiagram
participant Config as Configuration/YAML
participant Settings as get_settings()
participant BuildEngine as get_benchmark_engine_settings()
participant Tuning as calc_engine_setting()
participant ModelCfg as ModelConfig
Config->>Settings: Load config (YAML)
Settings->>Settings: Read enable_attention_dp<br/>(default False, or from YAML)
Settings->>BuildEngine: Pass enable_attention_dp flag
BuildEngine->>Tuning: Forward enable_attention_dp<br/>+ model_config
Tuning->>ModelCfg: Access checkpoint_size_in_gb
Tuning->>Tuning: Calculate engine size<br/>(checkpoint-based vs param-based)
alt enable_attention_dp = True
Tuning->>Tuning: Allocate memory per-rank<br/>(get_device_memory() * pp_size)
Tuning->>Tuning: Divide engine size by tp_size
else enable_attention_dp = False
Tuning->>Tuning: Use total GPU memory<br/>(n_gpus * get_device_memory())
end
Tuning->>Tuning: Compute max_batch_size<br/>& max_num_tokens
Tuning->>BuildEngine: Return batch/token limits
BuildEngine->>Settings: Return tuned settings
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
tensorrt_llm/bench/benchmark/utils/general.py (1)
86-98:⚠️ Potential issue | 🟡 MinorValidate
extra_llm_api_optionsbefore reading keys.
yaml.safe_load()can returnNoneor a non-mapping value, and the subsequent.get(...)calls will fail withAttributeError. Since this file now controlsenable_attention_dpsizing too, please fail fast with a typed config error here.Suggested fix
if extra_llm_api_options: - with open(extra_llm_api_options, 'r') as f: + with open(extra_llm_api_options, "r") as f: llm_args_dict = yaml.safe_load(f) + if not isinstance(llm_args_dict, dict): + raise TypeError( + f"extra_llm_api_options must contain a mapping, got {type(llm_args_dict)}" + ) kv_cache_config = llm_args_dict.get("kv_cache_config", { "dtype": "auto", })🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/bench/benchmark/utils/general.py` around lines 86 - 98, Validate the result of yaml.safe_load before using .get: after loading into llm_args_dict (when extra_llm_api_options is provided) check that llm_args_dict is a mapping (e.g., isinstance(llm_args_dict, dict)); if not, raise a clear typed configuration error (ValueError or a ConfigError) explaining that extra_llm_api_options must be a mapping. Then safely obtain kv_cache_config = llm_args_dict.get("kv_cache_config", {"dtype": "auto"}) and validate kv_cache_config is a mapping before calling kv_cache_config.get, and similarly read enable_chunked_prefill and enable_attention_dp from the validated llm_args_dict.tensorrt_llm/bench/build/build.py (1)
28-37:⚠️ Potential issue | 🟠 MajorPlumb
enable_attention_dpinto the build CLI path too.This helper now supports attention-DP sizing, but
build_command()below still calls it withoutenable_attention_dp, and this file does not expose any way to set that flag. As written,trtllm-bench buildwill keep using the old non-attention-DP heuristic and can still overestimatemax_batch_sizefor attention-DP workloads.Also applies to: 61-70
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/bench/build/build.py` around lines 28 - 37, The build path currently calls get_benchmark_engine_settings without passing the new enable_attention_dp flag, so add an enable_attention_dp parameter to the build_command entry point (and the CLI parsing that constructs its args), thread that boolean into the call to get_benchmark_engine_settings, and expose a CLI flag (e.g., --enable-attention-dp) that sets it; ensure the default behavior stays False when the flag is not provided and update any doc/help text for build_command accordingly so the attention-DP sizing heuristic is used when requested.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tensorrt_llm/bench/build/dataclasses.py`:
- Around line 220-223: Replace the assertion check on param_count with a real
exception: instead of using "assert param_count, (...)" raise a built-in
exception (ValueError or RuntimeError) when param_count is falsy, including the
same message that references hf_model_path or model_hf_name; keep the subsequent
"return param_count, checkpoint_size_in_gb" unchanged so callers still receive
the values when valid.
In `@tensorrt_llm/bench/build/tuning.py`:
- Around line 74-89: The warning later uses n_gpus as the divisor even when
enable_attention_dp switches accounting to per-TP-rank, which underreports
per-GPU KV cache; fix by computing a consistent divisor (e.g. per_rank_divisor =
pp_size if enable_attention_dp else n_gpus) and use that variable in the
warning/calculation instead of always dividing by n_gpus so the
per-GPU/per-TP-rank math matches how total_gpu_memory and engine_size were
computed (refer to enable_attention_dp, engine_size, total_gpu_memory, tp_size,
pp_size, n_gpus).
---
Outside diff comments:
In `@tensorrt_llm/bench/benchmark/utils/general.py`:
- Around line 86-98: Validate the result of yaml.safe_load before using .get:
after loading into llm_args_dict (when extra_llm_api_options is provided) check
that llm_args_dict is a mapping (e.g., isinstance(llm_args_dict, dict)); if not,
raise a clear typed configuration error (ValueError or a ConfigError) explaining
that extra_llm_api_options must be a mapping. Then safely obtain kv_cache_config
= llm_args_dict.get("kv_cache_config", {"dtype": "auto"}) and validate
kv_cache_config is a mapping before calling kv_cache_config.get, and similarly
read enable_chunked_prefill and enable_attention_dp from the validated
llm_args_dict.
In `@tensorrt_llm/bench/build/build.py`:
- Around line 28-37: The build path currently calls
get_benchmark_engine_settings without passing the new enable_attention_dp flag,
so add an enable_attention_dp parameter to the build_command entry point (and
the CLI parsing that constructs its args), thread that boolean into the call to
get_benchmark_engine_settings, and expose a CLI flag (e.g.,
--enable-attention-dp) that sets it; ensure the default behavior stays False
when the flag is not provided and update any doc/help text for build_command
accordingly so the attention-DP sizing heuristic is used when requested.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 32d4e908-787c-4828-b5fa-47395a5228b9
📒 Files selected for processing (4)
tensorrt_llm/bench/benchmark/utils/general.pytensorrt_llm/bench/build/build.pytensorrt_llm/bench/build/dataclasses.pytensorrt_llm/bench/build/tuning.py
| assert param_count, (f"Can't get valid parameter count for model: " | ||
| f"{hf_model_path or model_hf_name}.") | ||
|
|
||
| return param_count | ||
| return param_count, checkpoint_size_in_gb |
There was a problem hiding this comment.
Use a real exception for metadata sanity checks.
assert param_count disappears under python -O. If this path ever sees zero-count metadata in optimized runs, tuning can continue with an invalid engine-size estimate instead of failing fast. Please raise ValueError or RuntimeError here.
Suggested fix
- assert param_count, (f"Can't get valid parameter count for model: "
- f"{hf_model_path or model_hf_name}.")
+ if not param_count:
+ raise ValueError(
+ f"Can't get valid parameter count for model: "
+ f"{hf_model_path or model_hf_name}."
+ )As per coding guidelines, use built-in Python exception types; use exceptions for error handling, not return values.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| assert param_count, (f"Can't get valid parameter count for model: " | |
| f"{hf_model_path or model_hf_name}.") | |
| return param_count | |
| return param_count, checkpoint_size_in_gb | |
| if not param_count: | |
| raise ValueError( | |
| f"Can't get valid parameter count for model: " | |
| f"{hf_model_path or model_hf_name}." | |
| ) | |
| return param_count, checkpoint_size_in_gb |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tensorrt_llm/bench/build/dataclasses.py` around lines 220 - 223, Replace the
assertion check on param_count with a real exception: instead of using "assert
param_count, (...)" raise a built-in exception (ValueError or RuntimeError) when
param_count is falsy, including the same message that references hf_model_path
or model_hf_name; keep the subsequent "return param_count,
checkpoint_size_in_gb" unchanged so callers still receive the values when valid.
| # Use checkpoint size from safetensors metadata for accurate estimation. | ||
| # This is critical for quantized models (e.g. NVFP4) where different | ||
| # tensors have different dtypes. | ||
| if model_config.checkpoint_size_in_gb > 0: | ||
| engine_size = model_config.checkpoint_size_in_gb | ||
| else: | ||
| # Fallback to param_count-based estimation for backward compatibility. | ||
| engine_size = model_config.param_count * byte_per_elem / (1024**3) | ||
| # With attention DP, each TP rank independently manages its own KV cache | ||
| # and holds a TP shard of the model weights. Compute per-rank available | ||
| # memory: single GPU memory minus per-rank engine size. | ||
| if enable_attention_dp: | ||
| total_gpu_memory = get_device_memory() * pp_size | ||
| engine_size = engine_size / tp_size | ||
| else: | ||
| total_gpu_memory = get_device_memory() * n_gpus |
There was a problem hiding this comment.
Keep the later “per GPU” warning in the same accounting mode.
This branch switches to per-TP-rank budgeting, but the later Line 156 warning still divides by n_gpus = tp_size * pp_size. With attention DP enabled, that underreports per-GPU KV cache by tp_size, so the warning becomes misleading.
Suggested fix
# Number of GPU used for this run.
n_gpus = tp_size * pp_size
+ warning_gpu_count = n_gpus
@@
if enable_attention_dp:
total_gpu_memory = get_device_memory() * pp_size
engine_size = engine_size / tp_size
+ warning_gpu_count = pp_size
else:
total_gpu_memory = get_device_memory() * n_gpus
@@
- if cache_memory / n_gpus < 10.0:
+ if cache_memory / warning_gpu_count < 10.0:🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tensorrt_llm/bench/build/tuning.py` around lines 74 - 89, The warning later
uses n_gpus as the divisor even when enable_attention_dp switches accounting to
per-TP-rank, which underreports per-GPU KV cache; fix by computing a consistent
divisor (e.g. per_rank_divisor = pp_size if enable_attention_dp else n_gpus) and
use that variable in the warning/calculation instead of always dividing by
n_gpus so the per-GPU/per-TP-rank math matches how total_gpu_memory and
engine_size were computed (refer to enable_attention_dp, engine_size,
total_gpu_memory, tp_size, pp_size, n_gpus).
Summary by CodeRabbit
New Features
Enhancements
Description
Two fixes for the autotuner in
trtllm-bench:Issue 1: NVFP4 engine size underestimated by ~2x
The previous calculation used
param_count * byte_per_elem(withbyte_per_elem=0.5for NVFP4) to estimate engine size. This significantly underestimates quantized models because NVFP4 checkpoints contain mixed dtypes:U8for packed FP4 weights,BF16for non-quantized layers (embeddings, norms), andF8_E4M3for scale factors. Theparam_countfrom safetensors sums element counts across all dtypes, so multiplying by a singlebyte_per_elemis incorrect.Fix: Compute checkpoint size from per-dtype byte sizes using safetensors metadata. Added
SAFETENSORS_DTYPE_BYTESmapping andcheckpoint_size_in_gbfield toModelConfig.Verified with DeepSeek-V3.2-NVFP4 on disk: buggy=183.70 GB vs fixed=386.54 GB (2.10x ratio).
Issue 2: Attention DP max_batch_size computed from pooled GPU memory instead of per-rank
When attention data parallelism is enabled, each TP rank independently manages its own KV cache. The autotuner was pooling GPU memory across all ranks (
n_gpus = tp_size * pp_size), producing a totalmax_batch_sizethat exceeds what a single rank can handle. This caused OOM when each rank tried to allocate the full pooled cache (e.g., 360 GB on a 276.5 GB GPU).Fix: When
enable_attention_dp=True(read from theextra_llm_api_optionsYAML config), compute per-rank available memory (single_gpu * pp_size) and divide engine size bytp_size(since TP still shards model weights).Test Coverage
bench/build/tuning.py,bench/build/dataclasses.py).trtllm-benchperf tests in CI (e.g.,test_perf.py,test_perf_sanity.py).PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
To see a list of available CI bot commands, please comment
/bot help.