Conversation
…-RTX and ORT-GenAI Signed-off-by: Hrishith Thadicherla <hthadicherla@nvidia.com>
Signed-off-by: Hrishith Thadicherla <hthadicherla@nvidia.com>
Signed-off-by: Hrishith Thadicherla <hthadicherla@nvidia.com>
📝 WalkthroughWalkthroughThis pull request introduces a comprehensive ONNX graph surgery package for model transformations. It includes five transformation/utility modules for attention replacement with GroupQueryAttention, cross-attention KV cache addition, FP16-to-BF16 conversion, and DequantizeLinear weight transposition, plus a command-line interface and supporting utilities. Changes
Sequence Diagram(s)sequenceDiagram
participant User as User/CLI
participant GQA as GQA Replacement
participant Model as ONNX Model
participant Cache as RoPE Caches
participant Attn as Attention Fusion
participant QKV as Q/K/V Handling
participant GQANode as GQA Node Creation
participant Save as Model Save
User->>GQA: replace_attention_with_gqa(model_path, params)
GQA->>Model: Load ONNX model
GQA->>Model: Detect/convert dtype (FP16/BF16)
GQA->>Cache: Compute RoPE cos/sin caches
Cache-->>GQA: cos_cache, sin_cache initializers
GQA->>Model: Add RoPE initializers to graph
GQA->>Model: Format attention mask (seqlens_k, total_seq_len)
loop Per Layer
GQA->>Attn: Identify attention pattern (quantized/non-quantized)
Attn->>QKV: Fuse/handle Q/K/V projections
QKV->>GQANode: Create GroupQueryAttention node with inputs/outputs
GQANode->>Model: Connect past/present KV caches
GQANode->>Model: Wire GQA output to o_proj
end
GQA->>Model: Clean unused I/O and initializers
GQA->>Save: Save model with external data (optional)
Save-->>User: Modified ONNX model
sequenceDiagram
participant User as User/CLI
participant CrossKV as Cross-KV Encoder
participant HF as HuggingFace
participant Encoder as Encoder Model
participant Layer as Per-Layer Processing
participant Config as Config Generation
User->>CrossKV: add_cross_kv_to_encoder(encoder_path, hf_model_id, ...)
CrossKV->>HF: Load Whisper model from HuggingFace
HF-->>CrossKV: Model weights & config (num_heads, head_size)
CrossKV->>Encoder: Load encoder ONNX model
CrossKV->>Encoder: Detect dtype (FP16/FP32/BF16)
loop Per Decoder Layer
Layer->>Layer: Extract cross-attention K/V weights from HF
Layer->>Encoder: Create MatMul nodes (K, V projections)
Layer->>Encoder: Add optional bias via Add nodes
Layer->>Encoder: Insert Reshape & Transpose for output shape
Layer->>Encoder: Create present_key_cross_X, present_value_cross_X outputs
end
CrossKV->>Encoder: Rename inputs (input_features → audio_features if needed)
CrossKV->>Config: generate_genai_config(encoder, decoder filenames)
CrossKV->>Config: generate_audio_processor_config(num_mel_bins)
CrossKV->>Encoder: Save model with external data (optional)
Encoder-->>User: Modified encoder + genai_config.json + audio_processor_config.json
Estimated code review effort🎯 4 (Complex) | ⏱️ ~65 minutes Important Pre-merge checks failedPlease resolve all errors before merging. Addressing warnings is optional. ❌ Failed checks (1 error, 1 warning)
✅ Passed checks (2 passed)
✨ Finishing Touches
🧪 Generate unit tests (beta)
Tip Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs). Comment |
There was a problem hiding this comment.
Actionable comments posted: 6
🧹 Nitpick comments (1)
modelopt/onnx/graph_surgery/dq_transpose.py (1)
125-143: Consider consolidating duplicate helper functions.These helper functions (
_find_initializer,_find_node_by_output,_get_consumers) duplicate functionality inmodelopt/onnx/graph_surgery/utils/graph_utils.py. Consider importing from there to reduce code duplication.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/onnx/graph_surgery/dq_transpose.py` around lines 125 - 143, The three helper functions _find_initializer, _find_node_by_output, and _get_consumers in dq_transpose.py duplicate utilities already provided in modelopt/onnx/graph_surgery/utils/graph_utils.py; remove these local implementations and import the corresponding functions from graph_utils (e.g., import find_initializer, find_node_by_output, get_consumers or their exact exported names) and update all local call sites in dq_transpose.py to use the imported functions (ensure names match or alias imports to keep existing call sites like _find_initializer, _find_node_by_output, _get_consumers working).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@modelopt/onnx/graph_surgery/gqa_replacement.py`:
- Around line 926-933: The code is adding cos_cache and sin_cache to
graph.value_info twice; remove the manual creation and extension of
cos_cache_info and sin_cache_info and rely on add_initializer (from graph_utils)
to register the value_info; specifically delete the
helper.make_tensor_value_info(...) lines and the
graph.value_info.extend([cos_cache_info, sin_cache_info]) call in
gqa_replacement.py so only add_initializer creates the metadata, or if you must
keep them, first check for existing entries before extending to avoid
duplicates.
- Around line 1398-1408: external_data_name can stay None when output_path has
no directory, causing use_external_data to attempt saving with location=None;
ensure external_data_name is set whenever use_external_data is true and
external_data_name is None by assigning external_data_name =
os.path.basename(output_path) + "_data" outside (or after) the output_dir
conditional so that a default filename is created even for cwd outputs; update
the block around output_dir/external_data_name/use_external_data (references:
external_data_name, output_path, use_external_data) to set the default
unconditionally when needed.
- Around line 1306-1316: The current code reads initializer bytes with
np.frombuffer(init.raw_data) which can corrupt data for external/typed
initializers; instead use onnx.numpy_helper.to_array(init) to get a safe numpy
array, cast it to float16, then create a new TensorProto with
onnx.numpy_helper.from_array(arr_fp16, name=init.name) and replace the original
initializer in graph.initializer (or assign its raw_data/data_type from the new
TensorProto), ensuring you preserve init.name and dims; update the block that
inspects graph.node for quant_mul_name and scale_name to use
numpy_helper.to_array/from_array rather than direct raw_data access.
In `@modelopt/onnx/graph_surgery/utils/dtype_conversion.py`:
- Around line 121-130: Remove the special-case handling that rewrites
"value_float" into BF16: delete the branch that checks attr.name ==
"value_float" and the code that creates new_tensor, sets data_type to
TensorProto.BFLOAT16, converts via fp16_to_bf16_array, assigns raw_data, calls
attr.t.CopyFrom and renames attr.name to "value"; leave the existing "value"
attribute logic (the guarded path that checks tensor element types) intact so
native FP32 scalar attributes are not downconverted by the FP16→BF16 pass.
In `@modelopt/onnx/graph_surgery/utils/whisper_utils.py`:
- Around line 243-246: Guard against None for hidden_size/num_attention_heads
before computing head_size: check that hidden_size and num_attention_heads are
not None (and num_attention_heads != 0) when hf_model_id is None, and if they
are missing raise a clear ValueError indicating that either hf_model_id must be
provided or both hidden_size and num_attention_heads must be supplied; update
the code around the head_size computation (the variables head_size, hidden_size,
num_attention_heads, and hf_model_id) to perform this validation and only
compute head_size = hidden_size // num_attention_heads after the checks.
In `@modelopt/onnx/graph_surgery/whisper/encoder_cross_kv.py`:
- Around line 37-38: The cross-attention K/V weight and bias initializers are
always converted with numpy_helper.from_array so BF16 models detected by
detect_model_dtype end up with FLOAT proto types; update the serialization code
that creates these initializers (the places that currently call
numpy_helper.from_array for cross-attention key/value weights/biases) to use
array_to_initializer with the is_bfloat16 flag when onnx_dtype ==
TensorProto.BFLOAT16, and convert the underlying float32 arrays via
fp16_to_bf16_array (do float32 → float16 → bfloat16) before passing to
array_to_initializer so the TensorProto dtype matches BF16 models.
---
Nitpick comments:
In `@modelopt/onnx/graph_surgery/dq_transpose.py`:
- Around line 125-143: The three helper functions _find_initializer,
_find_node_by_output, and _get_consumers in dq_transpose.py duplicate utilities
already provided in modelopt/onnx/graph_surgery/utils/graph_utils.py; remove
these local implementations and import the corresponding functions from
graph_utils (e.g., import find_initializer, find_node_by_output, get_consumers
or their exact exported names) and update all local call sites in
dq_transpose.py to use the imported functions (ensure names match or alias
imports to keep existing call sites like _find_initializer,
_find_node_by_output, _get_consumers working).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 8fa348a4-244d-4101-9817-ae0b39af1d53
📒 Files selected for processing (12)
modelopt/onnx/graph_surgery/__init__.pymodelopt/onnx/graph_surgery/__main__.pymodelopt/onnx/graph_surgery/dq_transpose.pymodelopt/onnx/graph_surgery/gqa_replacement.pymodelopt/onnx/graph_surgery/utils/__init__.pymodelopt/onnx/graph_surgery/utils/dtype_conversion.pymodelopt/onnx/graph_surgery/utils/graph_utils.pymodelopt/onnx/graph_surgery/utils/rope_cache.pymodelopt/onnx/graph_surgery/utils/whisper_utils.pymodelopt/onnx/graph_surgery/whisper/__init__.pymodelopt/onnx/graph_surgery/whisper/encoder_cross_kv.pytests/unit/onnx/test_gqa_graph_surgery.py
| add_initializer(graph, "cos_cache", cos_cache, onnx_dtype) | ||
| add_initializer(graph, "sin_cache", sin_cache, onnx_dtype) | ||
|
|
||
| # Add value_info for cos/sin caches | ||
| cos_cache_info = helper.make_tensor_value_info("cos_cache", onnx_dtype, list(cos_cache.shape)) | ||
| sin_cache_info = helper.make_tensor_value_info("sin_cache", onnx_dtype, list(sin_cache.shape)) | ||
| graph.value_info.extend([cos_cache_info, sin_cache_info]) | ||
|
|
There was a problem hiding this comment.
cos_cache/sin_cache are registered twice in value_info.
add_initializer already appends corresponding value_info entries (modelopt/onnx/graph_surgery/utils/graph_utils.py, Lines 435-461). Re-adding them here creates duplicate metadata entries.
Suggested change
- # Add value_info for cos/sin caches
- cos_cache_info = helper.make_tensor_value_info("cos_cache", onnx_dtype, list(cos_cache.shape))
- sin_cache_info = helper.make_tensor_value_info("sin_cache", onnx_dtype, list(sin_cache.shape))
- graph.value_info.extend([cos_cache_info, sin_cache_info])📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| add_initializer(graph, "cos_cache", cos_cache, onnx_dtype) | |
| add_initializer(graph, "sin_cache", sin_cache, onnx_dtype) | |
| # Add value_info for cos/sin caches | |
| cos_cache_info = helper.make_tensor_value_info("cos_cache", onnx_dtype, list(cos_cache.shape)) | |
| sin_cache_info = helper.make_tensor_value_info("sin_cache", onnx_dtype, list(sin_cache.shape)) | |
| graph.value_info.extend([cos_cache_info, sin_cache_info]) | |
| add_initializer(graph, "cos_cache", cos_cache, onnx_dtype) | |
| add_initializer(graph, "sin_cache", sin_cache, onnx_dtype) | |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@modelopt/onnx/graph_surgery/gqa_replacement.py` around lines 926 - 933, The
code is adding cos_cache and sin_cache to graph.value_info twice; remove the
manual creation and extension of cos_cache_info and sin_cache_info and rely on
add_initializer (from graph_utils) to register the value_info; specifically
delete the helper.make_tensor_value_info(...) lines and the
graph.value_info.extend([cos_cache_info, sin_cache_info]) call in
gqa_replacement.py so only add_initializer creates the metadata, or if you must
keep them, first check for existing entries before extending to avoid
duplicates.
| # 1) Convert scale initializer to float16 | ||
| for node in graph.node: | ||
| if node.name == quant_mul_name: | ||
| scale_name = node.input[1] | ||
| for init in graph.initializer: | ||
| if init.name == scale_name and init.data_type == TensorProto.FLOAT: | ||
| arr = np.frombuffer(init.raw_data, dtype=np.float32).copy() | ||
| arr_fp16 = arr.astype(np.float16) | ||
| init.raw_data = arr_fp16.tobytes() | ||
| init.data_type = TensorProto.FLOAT16 | ||
| if verbose: |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, find the file and examine the code around lines 1306-1316
fd gqa_replacement.pyRepository: NVIDIA/Model-Optimizer
Length of output: 112
🏁 Script executed:
# Read the file around the mentioned lines
cat -n modelopt/onnx/graph_surgery/gqa_replacement.py | sed -n '1300,1320p'Repository: NVIDIA/Model-Optimizer
Length of output: 1407
🏁 Script executed:
# Check if numpy_helper is already imported or used elsewhere in this file
rg "numpy_helper|from_array|to_array" modelopt/onnx/graph_surgery/gqa_replacement.py -B 2 -A 2Repository: NVIDIA/Model-Optimizer
Length of output: 727
🏁 Script executed:
# Check how initializers are handled elsewhere in the codebase
rg "init\.raw_data|to_array|numpy_helper" modelopt/onnx/ -t py | head -30Repository: NVIDIA/Model-Optimizer
Length of output: 3117
🏁 Script executed:
# Check ONNX documentation on initializer handling by examining how it's used in tests or other modules
rg "raw_data" modelopt/onnx/ -t py -B 2 -A 2 | head -40Repository: NVIDIA/Model-Optimizer
Length of output: 3123
🏁 Script executed:
# Check what initializer_to_array does in utils
grep -A 20 "def initializer_to_array" modelopt/onnx/utils.pyRepository: NVIDIA/Model-Optimizer
Length of output: 48
🏁 Script executed:
# Check imports in gqa_replacement.py to see what's available
head -50 modelopt/onnx/graph_surgery/gqa_replacement.pyRepository: NVIDIA/Model-Optimizer
Length of output: 1865
🏁 Script executed:
# Look for documentation or examples of how initializers with external data or typed fields are handled
rg "external_data|int32_data|float_data" modelopt/onnx/ -t py -B 1 -A 1 | head -50Repository: NVIDIA/Model-Optimizer
Length of output: 3305
🏁 Script executed:
# Check if there are tests or examples of initializer conversion in the codebase
rg "raw_data.*tobytes|data_type.*FLOAT16|data_type.*FLOAT32" modelopt/onnx/ -t py -B 2 -A 2 | head -60Repository: NVIDIA/Model-Optimizer
Length of output: 5219
🏁 Script executed:
# Check what imports are available at the top of gqa_replacement.py
head -100 modelopt/onnx/graph_surgery/gqa_replacement.py | grep -E "import|from"Repository: NVIDIA/Model-Optimizer
Length of output: 465
🏁 Script executed:
# Search for onnx import in gqa_replacement.py
rg "^import onnx|^from onnx" modelopt/onnx/graph_surgery/gqa_replacement.pyRepository: NVIDIA/Model-Optimizer
Length of output: 183
🏁 Script executed:
# Check the initializer_to_array function definition
rg -A 15 "def initializer_to_array" modelopt/onnx/graph_surgery/utils/graph_utils.pyRepository: NVIDIA/Model-Optimizer
Length of output: 643
🏁 Script executed:
# Check if numpy_helper is used elsewhere for initializer conversion
rg "numpy_helper.to_array.*init|to_array.*initializer" modelopt/onnx/ -t py -B 1 -A 1Repository: NVIDIA/Model-Optimizer
Length of output: 5146
🏁 Script executed:
# Understand ONNX TensorProto storage formats by checking how initializers are created/modified elsewhere
rg "\.float_data|\.int32_data|\.int64_data" modelopt/onnx/ -t py -B 2 -A 2 | head -50Repository: NVIDIA/Model-Optimizer
Length of output: 1014
🏁 Script executed:
# Look at more context around the problematic code - what comes before and after
cat -n modelopt/onnx/graph_surgery/gqa_replacement.py | sed -n '1295,1325p'Repository: NVIDIA/Model-Optimizer
Length of output: 1934
🏁 Script executed:
# Check if initializer_to_array is imported in gqa_replacement.py
rg "initializer_to_array|from_array|numpy_helper" modelopt/onnx/graph_surgery/gqa_replacement.pyRepository: NVIDIA/Model-Optimizer
Length of output: 423
🏁 Script executed:
# Check imports at the top of gqa_replacement.py to see what utilities are available
sed -n '1,70p' modelopt/onnx/graph_surgery/gqa_replacement.pyRepository: NVIDIA/Model-Optimizer
Length of output: 2388
🏁 Script executed:
# Look for similar scale/initializer conversion patterns in the codebase
rg "array.*astype.*fp16|astype.*float16.*raw_data" modelopt/onnx/ -t py -B 3 -A 3Repository: NVIDIA/Model-Optimizer
Length of output: 48
🏁 Script executed:
# Check if numpy_helper is imported or if onnx.numpy_helper should be used
grep -n "numpy_helper\|from onnx import" modelopt/onnx/graph_surgery/gqa_replacement.py | head -20Repository: NVIDIA/Model-Optimizer
Length of output: 105
🏁 Script executed:
# Verify the fix pattern works by checking similar usage in the file's imported utilities
grep -A 10 "def array_to_initializer" modelopt/onnx/graph_surgery/utils/graph_utils.pyRepository: NVIDIA/Model-Optimizer
Length of output: 432
🏁 Script executed:
# Check if array_to_initializer uses numpy_helper
grep -B 5 -A 20 "def array_to_initializer" modelopt/onnx/graph_surgery/utils/graph_utils.pyRepository: NVIDIA/Model-Optimizer
Length of output: 821
🏁 Script executed:
# Check the imports in graph_utils.py to see how numpy_helper is used
head -30 modelopt/onnx/graph_surgery/utils/graph_utils.pyRepository: NVIDIA/Model-Optimizer
Length of output: 1092
Use onnx.numpy_helper to read and convert initializer data.
At lines 1312–1315, reading only init.raw_data via np.frombuffer() loses data when the initializer uses external storage or typed fields (e.g., float_data, int32_data). This corrupts the quantization scale, invalidating the quantized model.
Suggested change
- arr = np.frombuffer(init.raw_data, dtype=np.float32).copy()
- arr_fp16 = arr.astype(np.float16)
- init.raw_data = arr_fp16.tobytes()
- init.data_type = TensorProto.FLOAT16
+ arr_fp16 = onnx.numpy_helper.to_array(init).astype(np.float16)
+ converted_init = onnx.numpy_helper.from_array(arr_fp16, name=init.name)
+ init.CopyFrom(converted_init)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@modelopt/onnx/graph_surgery/gqa_replacement.py` around lines 1306 - 1316, The
current code reads initializer bytes with np.frombuffer(init.raw_data) which can
corrupt data for external/typed initializers; instead use
onnx.numpy_helper.to_array(init) to get a safe numpy array, cast it to float16,
then create a new TensorProto with onnx.numpy_helper.from_array(arr_fp16,
name=init.name) and replace the original initializer in graph.initializer (or
assign its raw_data/data_type from the new TensorProto), ensuring you preserve
init.name and dims; update the block that inspects graph.node for quant_mul_name
and scale_name to use numpy_helper.to_array/from_array rather than direct
raw_data access.
| output_dir = os.path.dirname(output_path) | ||
| if output_dir: | ||
| os.makedirs(output_dir, exist_ok=True) | ||
|
|
||
| if external_data_name is None: | ||
| external_data_name = os.path.basename(output_path) + "_data" | ||
|
|
||
| if use_external_data: | ||
| if verbose: | ||
| logger.info(f" Saving weights to external file: {external_data_name}") | ||
|
|
There was a problem hiding this comment.
external_data_name can remain None for cwd outputs.
If output_path has no directory component, the default external filename is never set, but use_external_data=True still tries to save with location=None.
Suggested change
output_dir = os.path.dirname(output_path)
if output_dir:
os.makedirs(output_dir, exist_ok=True)
- if external_data_name is None:
- external_data_name = os.path.basename(output_path) + "_data"
+ if external_data_name is None:
+ external_data_name = os.path.basename(output_path) + "_data"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@modelopt/onnx/graph_surgery/gqa_replacement.py` around lines 1398 - 1408,
external_data_name can stay None when output_path has no directory, causing
use_external_data to attempt saving with location=None; ensure
external_data_name is set whenever use_external_data is true and
external_data_name is None by assigning external_data_name =
os.path.basename(output_path) + "_data" outside (or after) the output_dir
conditional so that a default filename is created even for cwd outputs; update
the block around output_dir/external_data_name/use_external_data (references:
external_data_name, output_path, use_external_data) to set the default
unconditionally when needed.
| # Handle value_float attribute | ||
| if attr.name == "value_float": | ||
| fp32_val = np.array([attr.f], dtype=np.float32) | ||
| bf16_uint16 = fp16_to_bf16_array(fp32_val.astype(np.float16)) | ||
| new_tensor = onnx.TensorProto() | ||
| new_tensor.data_type = TensorProto.BFLOAT16 | ||
| new_tensor.raw_data = bf16_uint16.tobytes() | ||
| attr.t.CopyFrom(new_tensor) | ||
| attr.name = "value" | ||
| return True |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
cat -n modelopt/onnx/graph_surgery/utils/dtype_conversion.py | head -150Repository: NVIDIA/Model-Optimizer
Length of output: 5735
🏁 Script executed:
cat -n modelopt/onnx/graph_surgery/utils/dtype_conversion.py | sed -n '90,140p'Repository: NVIDIA/Model-Optimizer
Length of output: 1953
🏁 Script executed:
cat -n modelopt/onnx/graph_surgery/utils/dtype_conversion.py | sed -n '1,150p'Repository: NVIDIA/Model-Optimizer
Length of output: 5735
Remove the value_float attribute conversion block.
The function converts FP16 constants to BF16, but value_float encodes native FP32 scalars. Converting them to BF16 introduces unnecessary precision loss (FP32→FP16→BF16) and changes semantics—these constants were not originally FP16 and should not be rewritten in an FP16-to-BF16 pass. The value attribute conversion correctly guards against this with a type check (line 105); value_float lacks this guard.
Suggested change
- # Handle value_float attribute
- if attr.name == "value_float":
- fp32_val = np.array([attr.f], dtype=np.float32)
- bf16_uint16 = fp16_to_bf16_array(fp32_val.astype(np.float16))
- new_tensor = onnx.TensorProto()
- new_tensor.data_type = TensorProto.BFLOAT16
- new_tensor.raw_data = bf16_uint16.tobytes()
- attr.t.CopyFrom(new_tensor)
- attr.name = "value"
- return True📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Handle value_float attribute | |
| if attr.name == "value_float": | |
| fp32_val = np.array([attr.f], dtype=np.float32) | |
| bf16_uint16 = fp16_to_bf16_array(fp32_val.astype(np.float16)) | |
| new_tensor = onnx.TensorProto() | |
| new_tensor.data_type = TensorProto.BFLOAT16 | |
| new_tensor.raw_data = bf16_uint16.tobytes() | |
| attr.t.CopyFrom(new_tensor) | |
| attr.name = "value" | |
| return True |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@modelopt/onnx/graph_surgery/utils/dtype_conversion.py` around lines 121 -
130, Remove the special-case handling that rewrites "value_float" into BF16:
delete the branch that checks attr.name == "value_float" and the code that
creates new_tensor, sets data_type to TensorProto.BFLOAT16, converts via
fp16_to_bf16_array, assigns raw_data, calls attr.t.CopyFrom and renames
attr.name to "value"; leave the existing "value" attribute logic (the guarded
path that checks tensor element types) intact so native FP32 scalar attributes
are not downconverted by the FP16→BF16 pass.
|
|
||
| # Compute head_size | ||
| head_size = hidden_size // num_attention_heads | ||
|
|
There was a problem hiding this comment.
Guard against None values when hf_model_id is not provided.
If hf_model_id is None and the caller doesn't supply hidden_size or num_attention_heads, line 245 will raise a TypeError when attempting division with None.
🛡️ Proposed fix to add validation
+ # Validate required parameters
+ if hidden_size is None or num_attention_heads is None:
+ raise ValueError(
+ "hidden_size and num_attention_heads must be provided when hf_model_id is not specified"
+ )
+
# Compute head_size
head_size = hidden_size // num_attention_heads📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Compute head_size | |
| head_size = hidden_size // num_attention_heads | |
| # Validate required parameters | |
| if hidden_size is None or num_attention_heads is None: | |
| raise ValueError( | |
| "hidden_size and num_attention_heads must be provided when hf_model_id is not specified" | |
| ) | |
| # Compute head_size | |
| head_size = hidden_size // num_attention_heads | |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@modelopt/onnx/graph_surgery/utils/whisper_utils.py` around lines 243 - 246,
Guard against None for hidden_size/num_attention_heads before computing
head_size: check that hidden_size and num_attention_heads are not None (and
num_attention_heads != 0) when hf_model_id is None, and if they are missing
raise a clear ValueError indicating that either hf_model_id must be provided or
both hidden_size and num_attention_heads must be supplied; update the code
around the head_size computation (the variables head_size, hidden_size,
num_attention_heads, and hf_model_id) to perform this validation and only
compute head_size = hidden_size // num_attention_heads after the checks.
| from ..utils.graph_utils import detect_model_dtype | ||
|
|
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
find modelopt -name "encoder_cross_kv.py" -type fRepository: NVIDIA/Model-Optimizer
Length of output: 121
🏁 Script executed:
find modelopt -name "graph_utils.py" -o -name "dtype_conversion.py" | head -20Repository: NVIDIA/Model-Optimizer
Length of output: 210
🏁 Script executed:
cd modelopt && find . -type f -name "*.py" | grep -E "(graph_utils|dtype_conversion)" | head -20Repository: NVIDIA/Model-Optimizer
Length of output: 189
🏁 Script executed:
wc -l modelopt/onnx/graph_surgery/whisper/encoder_cross_kv.pyRepository: NVIDIA/Model-Optimizer
Length of output: 125
🏁 Script executed:
cat -n modelopt/onnx/graph_surgery/whisper/encoder_cross_kv.pyRepository: NVIDIA/Model-Optimizer
Length of output: 20927
🏁 Script executed:
cat -n modelopt/onnx/graph_surgery/utils/graph_utils.pyRepository: NVIDIA/Model-Optimizer
Length of output: 24055
🏁 Script executed:
head -100 modelopt/onnx/graph_surgery/utils/dtype_conversion.pyRepository: NVIDIA/Model-Optimizer
Length of output: 3065
🏁 Script executed:
sed -n '180,195p' modelopt/onnx/graph_surgery/whisper/encoder_cross_kv.pyRepository: NVIDIA/Model-Optimizer
Length of output: 741
🏁 Script executed:
sed -n '215,240p' modelopt/onnx/graph_surgery/whisper/encoder_cross_kv.pyRepository: NVIDIA/Model-Optimizer
Length of output: 1012
🏁 Script executed:
tail -50 modelopt/onnx/graph_surgery/utils/dtype_conversion.pyRepository: NVIDIA/Model-Optimizer
Length of output: 1860
🏁 Script executed:
rg "array_to_initializer" modelopt --type py -B 2 -A 2Repository: NVIDIA/Model-Optimizer
Length of output: 2590
🏁 Script executed:
sed -n '1,50p' modelopt/onnx/graph_surgery/gqa_replacement.py | head -60Repository: NVIDIA/Model-Optimizer
Length of output: 1865
🏁 Script executed:
grep -B 5 -A 10 "is_bfloat16" modelopt/onnx/graph_surgery/gqa_replacement.py | head -60Repository: NVIDIA/Model-Optimizer
Length of output: 1988
🏁 Script executed:
grep -B 3 -A 3 "fp16_to_bf16_array" modelopt/onnx/graph_surgery/utils/graph_utils.pyRepository: NVIDIA/Model-Optimizer
Length of output: 900
Add BF16-aware initializer serialization for cross-attention K/V weights and biases.
detect_model_dtype allows detecting BF16 models, but these initializers are always created with numpy_helper.from_array(...). Since NumPy doesn't support bfloat16, when the model is detected as BF16 the weights remain as FP32 arrays, causing numpy_helper.from_array() to create FLOAT TensorProtos instead of BFLOAT16. This dtype mismatch will cause inference failures.
Use array_to_initializer with the is_bfloat16 flag and convert float32 → float16 → bfloat16 via fp16_to_bf16_array when onnx_dtype == TensorProto.BFLOAT16.
Suggested change
-from ..utils.graph_utils import detect_model_dtype
+from ..utils.graph_utils import array_to_initializer, detect_model_dtype
+from ..utils.dtype_conversion import fp16_to_bf16_array
@@
- new_initializers.append(numpy_helper.from_array(k_weight_t, name=k_weight_name))
- new_initializers.append(numpy_helper.from_array(v_weight_t, name=v_weight_name))
+ if onnx_dtype == TensorProto.BFLOAT16:
+ k_bf16 = fp16_to_bf16_array(k_weight_t.astype(np.float16))
+ v_bf16 = fp16_to_bf16_array(v_weight_t.astype(np.float16))
+ new_initializers.append(array_to_initializer(k_bf16, k_weight_name, is_bfloat16=True))
+ new_initializers.append(array_to_initializer(v_bf16, v_weight_name, is_bfloat16=True))
+ else:
+ new_initializers.append(numpy_helper.from_array(k_weight_t, name=k_weight_name))
+ new_initializers.append(numpy_helper.from_array(v_weight_t, name=v_weight_name))
@@
- new_initializers.append(
- numpy_helper.from_array(k_bias.astype(np_dtype), name=k_bias_name)
- )
+ if onnx_dtype == TensorProto.BFLOAT16:
+ k_bias_bf16 = fp16_to_bf16_array(k_bias.astype(np.float16))
+ new_initializers.append(
+ array_to_initializer(k_bias_bf16, k_bias_name, is_bfloat16=True)
+ )
+ else:
+ new_initializers.append(
+ numpy_helper.from_array(k_bias.astype(np_dtype), name=k_bias_name)
+ )
@@
- new_initializers.append(
- numpy_helper.from_array(v_bias.astype(np_dtype), name=v_bias_name)
- )
+ if onnx_dtype == TensorProto.BFLOAT16:
+ v_bias_bf16 = fp16_to_bf16_array(v_bias.astype(np.float16))
+ new_initializers.append(
+ array_to_initializer(v_bias_bf16, v_bias_name, is_bfloat16=True)
+ )
+ else:
+ new_initializers.append(
+ numpy_helper.from_array(v_bias.astype(np_dtype), name=v_bias_name)
+ )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@modelopt/onnx/graph_surgery/whisper/encoder_cross_kv.py` around lines 37 -
38, The cross-attention K/V weight and bias initializers are always converted
with numpy_helper.from_array so BF16 models detected by detect_model_dtype end
up with FLOAT proto types; update the serialization code that creates these
initializers (the places that currently call numpy_helper.from_array for
cross-attention key/value weights/biases) to use array_to_initializer with the
is_bfloat16 flag when onnx_dtype == TensorProto.BFLOAT16, and convert the
underlying float32 arrays via fp16_to_bf16_array (do float32 → float16 →
bfloat16) before passing to array_to_initializer so the TensorProto dtype
matches BF16 models.
What does this PR do?
Type of change: new feature
Adds the ONNX Graph Surgery framework — graph-level transformations on exported ONNX models for optimized inference with ONNX Runtime.
GQA Attention Replacement — Replaces native attention subgraphs (Q/K/V projections, RoPE, softmax, KV cache) with a single fused
GroupQueryAttentionoperator. Handles RoPE cache computation, attention mask reformatting, Q/K/V weight fusion, and KV cache I/O automatically from a HuggingFace model ID. Supports FP16/BF16, INT4/AWQ quantized, and combined QKV models.DequantizeLinear Weight Transpose — Transposes quantized weights in
DequantizeLinearnodes to column-major layout for providers like NvTensorRtRtx. Handles INT4/UINT4 packed formats.Whisper Encoder Cross-Attention KV — Adds cross-attention K/V projection outputs to the Whisper encoder for the ONNX Runtime GenAI pipeline. Loads cross-attention weights from HuggingFace and generates
genai_config.json.Usage
Testing
Added test case for GQA graph surgery at
tests/unit/onnx/test_gqa_graph_surgery.py. Builds a toy attention subgraph with native ops matching the real Optimum export pattern and applies GQA surgery on it. Compared end outputs of both models — outputs match exactly.Summary by CodeRabbit
Release Notes
New Features
Tests