Skip to content

Graph Surgery Framework for TRT-RTX#992

Open
hthadicherla wants to merge 3 commits intomainfrom
hthadicherla/graph-surgery-framework
Open

Graph Surgery Framework for TRT-RTX#992
hthadicherla wants to merge 3 commits intomainfrom
hthadicherla/graph-surgery-framework

Conversation

@hthadicherla
Copy link
Contributor

@hthadicherla hthadicherla commented Mar 6, 2026

What does this PR do?

Type of change: new feature

Adds the ONNX Graph Surgery framework — graph-level transformations on exported ONNX models for optimized inference with ONNX Runtime.

GQA Attention Replacement — Replaces native attention subgraphs (Q/K/V projections, RoPE, softmax, KV cache) with a single fused GroupQueryAttention operator. Handles RoPE cache computation, attention mask reformatting, Q/K/V weight fusion, and KV cache I/O automatically from a HuggingFace model ID. Supports FP16/BF16, INT4/AWQ quantized, and combined QKV models.

DequantizeLinear Weight Transpose — Transposes quantized weights in DequantizeLinear nodes to column-major layout for providers like NvTensorRtRtx. Handles INT4/UINT4 packed formats.

Whisper Encoder Cross-Attention KV — Adds cross-attention K/V projection outputs to the Whisper encoder for the ONNX Runtime GenAI pipeline. Loads cross-attention weights from HuggingFace and generates genai_config.json.

Usage

# GQA attention replacement
python -m modelopt.onnx.graph_surgery replace-gqa \
    -i model.onnx -o model_gqa.onnx \
    -m meta-llama/Llama-3.2-1B \
    --max-seq-len 4096 --dtype float16

# DequantizeLinear weight transpose
python -m modelopt.onnx.graph_surgery transpose-dq \
    -i model_quantized.onnx -o model_transposed.onnx

# Whisper encoder cross-attention KV
python -m modelopt.onnx.graph_surgery add-cross-kv \
    -i encoder_model.onnx -o encoder_with_kv.onnx \
    -m openai/whisper-large-v3-turbo

Testing

Added test case for GQA graph surgery at tests/unit/onnx/test_gqa_graph_surgery.py. Builds a toy attention subgraph with native ops matching the real Optimum export pattern and applies GQA surgery on it. Compared end outputs of both models — outputs match exactly.

tests/unit/onnx/test_gqa_graph_surgery.py::TestGQAGraphSurgery::test_gqa_node_exists PASSED
tests/unit/onnx/test_gqa_graph_surgery.py::TestGQAGraphSurgery::test_gqa_attributes PASSED
tests/unit/onnx/test_gqa_graph_surgery.py::TestGQAGraphSurgery::test_node_count_reduced PASSED
tests/unit/onnx/test_gqa_graph_surgery.py::TestGQAGraphSurgery::test_rotary_emb_nodes_removed PASSED
tests/unit/onnx/test_gqa_graph_surgery.py::TestGQAGraphSurgery::test_position_ids_removed PASSED
tests/unit/onnx/test_gqa_graph_surgery.py::TestGQAGraphSurgery::test_logits_match
  Original nodes: 106  ->  GQA nodes: 13
  Logits shape:   (1, 4, 64)
  Original[0,:4]: [-2906.  7704. 15248.  8488.]
  GQA     [0,:4]: [-2906.  7704. 15248.  8488.]
  Max  abs diff:  0.000000
  Mean abs diff:  0.000000
PASSED

Summary by CodeRabbit

Release Notes

  • New Features

    • Added command-line interface for ONNX graph surgery operations including attention pattern replacement, weight optimization, and dtype conversion.
    • Introduced GroupQueryAttention replacement for efficient attention mechanism transformation.
    • Added FP16 to BF16 data type conversion for ONNX models.
    • Implemented DequantizeLinear weight optimization for column-major storage format.
    • Added Whisper encoder cross-attention support with cache generation and configuration utilities.
  • Tests

    • Added comprehensive unit tests validating graph surgery transformations and numerical accuracy.

…-RTX and ORT-GenAI

Signed-off-by: Hrishith Thadicherla <hthadicherla@nvidia.com>
Signed-off-by: Hrishith Thadicherla <hthadicherla@nvidia.com>
Signed-off-by: Hrishith Thadicherla <hthadicherla@nvidia.com>
@hthadicherla hthadicherla requested a review from a team as a code owner March 6, 2026 12:33
@hthadicherla hthadicherla requested a review from ajrasane March 6, 2026 12:33
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 6, 2026

📝 Walkthrough

Walkthrough

This pull request introduces a comprehensive ONNX graph surgery package for model transformations. It includes five transformation/utility modules for attention replacement with GroupQueryAttention, cross-attention KV cache addition, FP16-to-BF16 conversion, and DequantizeLinear weight transposition, plus a command-line interface and supporting utilities.

Changes

Cohort / File(s) Summary
Package Initialization
modelopt/onnx/graph_surgery/__init__.py, modelopt/onnx/graph_surgery/utils/__init__.py, modelopt/onnx/graph_surgery/whisper/__init__.py
Public API exposure and re-exports of transformation functions and utilities (replace_attention_with_gqa, add_cross_kv_to_encoder, convert_fp16_to_bf16, transpose_dequantize_linear_weights) from internal submodules.
CLI Implementation
modelopt/onnx/graph_surgery/__main__.py
Command-line interface with subcommands for GQA replacement, cross-KV addition, BF16 conversion, DQ transposition, and attention pattern analysis; dispatches to corresponding handlers.
GQA Replacement Transformation
modelopt/onnx/graph_surgery/gqa_replacement.py
Comprehensive ONNX transformation to replace standard attention with GroupQueryAttention; handles dtype conversion, RoPE cache injection, Q/K/V fusion/packing, attention mask reformatting, bias fusion, and quantized model adjustments (1503 lines).
DequantizeLinear Transposition
modelopt/onnx/graph_surgery/dq_transpose.py
Graph surgery module for transposing DequantizeLinear weights to column-major storage; includes INT4/UINT4 packing utilities, tensor transposition logic, and initializer manipulation for 4-bit quantized models.
Dtype Conversion Utilities
modelopt/onnx/graph_surgery/utils/dtype_conversion.py
FP16-to-BF16 conversion module with array conversion, initializer processing, Constant node handling, and full model transformation with optional external data support.
Graph Manipulation Utilities
modelopt/onnx/graph_surgery/utils/graph_utils.py
General-purpose ONNX graph utilities including topological sorting, dtype detection, node/initializer queries, cleanup, tensor array conversions, and dtype migrations (615 lines).
RoPE and Whisper Utilities
modelopt/onnx/graph_surgery/utils/rope_cache.py, modelopt/onnx/graph_surgery/utils/whisper_utils.py
RoPE cache computation for rotary embeddings and Whisper configuration utilities for audio processor and GenAI config generation/persistence with HuggingFace auto-detection support.
Whisper Encoder Cross-KV
modelopt/onnx/graph_surgery/whisper/encoder_cross_kv.py
Augments encoder models with cross-attention K/V cache outputs for ONNX Runtime GenAI compatibility; extracts weights from HuggingFace, constructs per-layer MatMul/bias/reshape/transpose subgraphs, and generates supporting configurations.
Unit Tests
tests/unit/onnx/test_gqa_graph_surgery.py
Comprehensive test suite for GQA replacement; constructs synthetic LLaMA-like ONNX models, validates GQA node attributes, verifies node count reduction, and checks numerical fidelity between original and transformed models.

Sequence Diagram(s)

sequenceDiagram
    participant User as User/CLI
    participant GQA as GQA Replacement
    participant Model as ONNX Model
    participant Cache as RoPE Caches
    participant Attn as Attention Fusion
    participant QKV as Q/K/V Handling
    participant GQANode as GQA Node Creation
    participant Save as Model Save

    User->>GQA: replace_attention_with_gqa(model_path, params)
    GQA->>Model: Load ONNX model
    GQA->>Model: Detect/convert dtype (FP16/BF16)
    GQA->>Cache: Compute RoPE cos/sin caches
    Cache-->>GQA: cos_cache, sin_cache initializers
    GQA->>Model: Add RoPE initializers to graph
    GQA->>Model: Format attention mask (seqlens_k, total_seq_len)
    loop Per Layer
        GQA->>Attn: Identify attention pattern (quantized/non-quantized)
        Attn->>QKV: Fuse/handle Q/K/V projections
        QKV->>GQANode: Create GroupQueryAttention node with inputs/outputs
        GQANode->>Model: Connect past/present KV caches
        GQANode->>Model: Wire GQA output to o_proj
    end
    GQA->>Model: Clean unused I/O and initializers
    GQA->>Save: Save model with external data (optional)
    Save-->>User: Modified ONNX model
Loading
sequenceDiagram
    participant User as User/CLI
    participant CrossKV as Cross-KV Encoder
    participant HF as HuggingFace
    participant Encoder as Encoder Model
    participant Layer as Per-Layer Processing
    participant Config as Config Generation

    User->>CrossKV: add_cross_kv_to_encoder(encoder_path, hf_model_id, ...)
    CrossKV->>HF: Load Whisper model from HuggingFace
    HF-->>CrossKV: Model weights & config (num_heads, head_size)
    CrossKV->>Encoder: Load encoder ONNX model
    CrossKV->>Encoder: Detect dtype (FP16/FP32/BF16)
    loop Per Decoder Layer
        Layer->>Layer: Extract cross-attention K/V weights from HF
        Layer->>Encoder: Create MatMul nodes (K, V projections)
        Layer->>Encoder: Add optional bias via Add nodes
        Layer->>Encoder: Insert Reshape & Transpose for output shape
        Layer->>Encoder: Create present_key_cross_X, present_value_cross_X outputs
    end
    CrossKV->>Encoder: Rename inputs (input_features → audio_features if needed)
    CrossKV->>Config: generate_genai_config(encoder, decoder filenames)
    CrossKV->>Config: generate_audio_processor_config(num_mel_bins)
    CrossKV->>Encoder: Save model with external data (optional)
    Encoder-->>User: Modified encoder + genai_config.json + audio_processor_config.json
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~65 minutes


Important

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

❌ Failed checks (1 error, 1 warning)

Check name Status Explanation Resolution
Security Anti-Patterns ❌ Error Functions loading transformers models do not expose trust_remote_code as a caller-configurable parameter, violating SECURITY.md requirements. Update affected functions to accept trust_remote_code: bool = False parameter and pass it explicitly to all from_pretrained() calls.
Docstring Coverage ⚠️ Warning Docstring coverage is 79.69% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Graph Surgery Framework for TRT-RTX' clearly and concisely summarizes the main change—introducing a new ONNX graph surgery framework. It is specific enough to convey the primary feature without being overly verbose.
✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch hthadicherla/graph-surgery-framework

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🧹 Nitpick comments (1)
modelopt/onnx/graph_surgery/dq_transpose.py (1)

125-143: Consider consolidating duplicate helper functions.

These helper functions (_find_initializer, _find_node_by_output, _get_consumers) duplicate functionality in modelopt/onnx/graph_surgery/utils/graph_utils.py. Consider importing from there to reduce code duplication.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/onnx/graph_surgery/dq_transpose.py` around lines 125 - 143, The
three helper functions _find_initializer, _find_node_by_output, and
_get_consumers in dq_transpose.py duplicate utilities already provided in
modelopt/onnx/graph_surgery/utils/graph_utils.py; remove these local
implementations and import the corresponding functions from graph_utils (e.g.,
import find_initializer, find_node_by_output, get_consumers or their exact
exported names) and update all local call sites in dq_transpose.py to use the
imported functions (ensure names match or alias imports to keep existing call
sites like _find_initializer, _find_node_by_output, _get_consumers working).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@modelopt/onnx/graph_surgery/gqa_replacement.py`:
- Around line 926-933: The code is adding cos_cache and sin_cache to
graph.value_info twice; remove the manual creation and extension of
cos_cache_info and sin_cache_info and rely on add_initializer (from graph_utils)
to register the value_info; specifically delete the
helper.make_tensor_value_info(...) lines and the
graph.value_info.extend([cos_cache_info, sin_cache_info]) call in
gqa_replacement.py so only add_initializer creates the metadata, or if you must
keep them, first check for existing entries before extending to avoid
duplicates.
- Around line 1398-1408: external_data_name can stay None when output_path has
no directory, causing use_external_data to attempt saving with location=None;
ensure external_data_name is set whenever use_external_data is true and
external_data_name is None by assigning external_data_name =
os.path.basename(output_path) + "_data" outside (or after) the output_dir
conditional so that a default filename is created even for cwd outputs; update
the block around output_dir/external_data_name/use_external_data (references:
external_data_name, output_path, use_external_data) to set the default
unconditionally when needed.
- Around line 1306-1316: The current code reads initializer bytes with
np.frombuffer(init.raw_data) which can corrupt data for external/typed
initializers; instead use onnx.numpy_helper.to_array(init) to get a safe numpy
array, cast it to float16, then create a new TensorProto with
onnx.numpy_helper.from_array(arr_fp16, name=init.name) and replace the original
initializer in graph.initializer (or assign its raw_data/data_type from the new
TensorProto), ensuring you preserve init.name and dims; update the block that
inspects graph.node for quant_mul_name and scale_name to use
numpy_helper.to_array/from_array rather than direct raw_data access.

In `@modelopt/onnx/graph_surgery/utils/dtype_conversion.py`:
- Around line 121-130: Remove the special-case handling that rewrites
"value_float" into BF16: delete the branch that checks attr.name ==
"value_float" and the code that creates new_tensor, sets data_type to
TensorProto.BFLOAT16, converts via fp16_to_bf16_array, assigns raw_data, calls
attr.t.CopyFrom and renames attr.name to "value"; leave the existing "value"
attribute logic (the guarded path that checks tensor element types) intact so
native FP32 scalar attributes are not downconverted by the FP16→BF16 pass.

In `@modelopt/onnx/graph_surgery/utils/whisper_utils.py`:
- Around line 243-246: Guard against None for hidden_size/num_attention_heads
before computing head_size: check that hidden_size and num_attention_heads are
not None (and num_attention_heads != 0) when hf_model_id is None, and if they
are missing raise a clear ValueError indicating that either hf_model_id must be
provided or both hidden_size and num_attention_heads must be supplied; update
the code around the head_size computation (the variables head_size, hidden_size,
num_attention_heads, and hf_model_id) to perform this validation and only
compute head_size = hidden_size // num_attention_heads after the checks.

In `@modelopt/onnx/graph_surgery/whisper/encoder_cross_kv.py`:
- Around line 37-38: The cross-attention K/V weight and bias initializers are
always converted with numpy_helper.from_array so BF16 models detected by
detect_model_dtype end up with FLOAT proto types; update the serialization code
that creates these initializers (the places that currently call
numpy_helper.from_array for cross-attention key/value weights/biases) to use
array_to_initializer with the is_bfloat16 flag when onnx_dtype ==
TensorProto.BFLOAT16, and convert the underlying float32 arrays via
fp16_to_bf16_array (do float32 → float16 → bfloat16) before passing to
array_to_initializer so the TensorProto dtype matches BF16 models.

---

Nitpick comments:
In `@modelopt/onnx/graph_surgery/dq_transpose.py`:
- Around line 125-143: The three helper functions _find_initializer,
_find_node_by_output, and _get_consumers in dq_transpose.py duplicate utilities
already provided in modelopt/onnx/graph_surgery/utils/graph_utils.py; remove
these local implementations and import the corresponding functions from
graph_utils (e.g., import find_initializer, find_node_by_output, get_consumers
or their exact exported names) and update all local call sites in
dq_transpose.py to use the imported functions (ensure names match or alias
imports to keep existing call sites like _find_initializer,
_find_node_by_output, _get_consumers working).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 8fa348a4-244d-4101-9817-ae0b39af1d53

📥 Commits

Reviewing files that changed from the base of the PR and between 37d3f10 and 51fdee8.

📒 Files selected for processing (12)
  • modelopt/onnx/graph_surgery/__init__.py
  • modelopt/onnx/graph_surgery/__main__.py
  • modelopt/onnx/graph_surgery/dq_transpose.py
  • modelopt/onnx/graph_surgery/gqa_replacement.py
  • modelopt/onnx/graph_surgery/utils/__init__.py
  • modelopt/onnx/graph_surgery/utils/dtype_conversion.py
  • modelopt/onnx/graph_surgery/utils/graph_utils.py
  • modelopt/onnx/graph_surgery/utils/rope_cache.py
  • modelopt/onnx/graph_surgery/utils/whisper_utils.py
  • modelopt/onnx/graph_surgery/whisper/__init__.py
  • modelopt/onnx/graph_surgery/whisper/encoder_cross_kv.py
  • tests/unit/onnx/test_gqa_graph_surgery.py

Comment on lines +926 to +933
add_initializer(graph, "cos_cache", cos_cache, onnx_dtype)
add_initializer(graph, "sin_cache", sin_cache, onnx_dtype)

# Add value_info for cos/sin caches
cos_cache_info = helper.make_tensor_value_info("cos_cache", onnx_dtype, list(cos_cache.shape))
sin_cache_info = helper.make_tensor_value_info("sin_cache", onnx_dtype, list(sin_cache.shape))
graph.value_info.extend([cos_cache_info, sin_cache_info])

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

cos_cache/sin_cache are registered twice in value_info.

add_initializer already appends corresponding value_info entries (modelopt/onnx/graph_surgery/utils/graph_utils.py, Lines 435-461). Re-adding them here creates duplicate metadata entries.

Suggested change
-    # Add value_info for cos/sin caches
-    cos_cache_info = helper.make_tensor_value_info("cos_cache", onnx_dtype, list(cos_cache.shape))
-    sin_cache_info = helper.make_tensor_value_info("sin_cache", onnx_dtype, list(sin_cache.shape))
-    graph.value_info.extend([cos_cache_info, sin_cache_info])
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
add_initializer(graph, "cos_cache", cos_cache, onnx_dtype)
add_initializer(graph, "sin_cache", sin_cache, onnx_dtype)
# Add value_info for cos/sin caches
cos_cache_info = helper.make_tensor_value_info("cos_cache", onnx_dtype, list(cos_cache.shape))
sin_cache_info = helper.make_tensor_value_info("sin_cache", onnx_dtype, list(sin_cache.shape))
graph.value_info.extend([cos_cache_info, sin_cache_info])
add_initializer(graph, "cos_cache", cos_cache, onnx_dtype)
add_initializer(graph, "sin_cache", sin_cache, onnx_dtype)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/onnx/graph_surgery/gqa_replacement.py` around lines 926 - 933, The
code is adding cos_cache and sin_cache to graph.value_info twice; remove the
manual creation and extension of cos_cache_info and sin_cache_info and rely on
add_initializer (from graph_utils) to register the value_info; specifically
delete the helper.make_tensor_value_info(...) lines and the
graph.value_info.extend([cos_cache_info, sin_cache_info]) call in
gqa_replacement.py so only add_initializer creates the metadata, or if you must
keep them, first check for existing entries before extending to avoid
duplicates.

Comment on lines +1306 to +1316
# 1) Convert scale initializer to float16
for node in graph.node:
if node.name == quant_mul_name:
scale_name = node.input[1]
for init in graph.initializer:
if init.name == scale_name and init.data_type == TensorProto.FLOAT:
arr = np.frombuffer(init.raw_data, dtype=np.float32).copy()
arr_fp16 = arr.astype(np.float16)
init.raw_data = arr_fp16.tobytes()
init.data_type = TensorProto.FLOAT16
if verbose:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# First, find the file and examine the code around lines 1306-1316
fd gqa_replacement.py

Repository: NVIDIA/Model-Optimizer

Length of output: 112


🏁 Script executed:

# Read the file around the mentioned lines
cat -n modelopt/onnx/graph_surgery/gqa_replacement.py | sed -n '1300,1320p'

Repository: NVIDIA/Model-Optimizer

Length of output: 1407


🏁 Script executed:

# Check if numpy_helper is already imported or used elsewhere in this file
rg "numpy_helper|from_array|to_array" modelopt/onnx/graph_surgery/gqa_replacement.py -B 2 -A 2

Repository: NVIDIA/Model-Optimizer

Length of output: 727


🏁 Script executed:

# Check how initializers are handled elsewhere in the codebase
rg "init\.raw_data|to_array|numpy_helper" modelopt/onnx/ -t py | head -30

Repository: NVIDIA/Model-Optimizer

Length of output: 3117


🏁 Script executed:

# Check ONNX documentation on initializer handling by examining how it's used in tests or other modules
rg "raw_data" modelopt/onnx/ -t py -B 2 -A 2 | head -40

Repository: NVIDIA/Model-Optimizer

Length of output: 3123


🏁 Script executed:

# Check what initializer_to_array does in utils
grep -A 20 "def initializer_to_array" modelopt/onnx/utils.py

Repository: NVIDIA/Model-Optimizer

Length of output: 48


🏁 Script executed:

# Check imports in gqa_replacement.py to see what's available
head -50 modelopt/onnx/graph_surgery/gqa_replacement.py

Repository: NVIDIA/Model-Optimizer

Length of output: 1865


🏁 Script executed:

# Look for documentation or examples of how initializers with external data or typed fields are handled
rg "external_data|int32_data|float_data" modelopt/onnx/ -t py -B 1 -A 1 | head -50

Repository: NVIDIA/Model-Optimizer

Length of output: 3305


🏁 Script executed:

# Check if there are tests or examples of initializer conversion in the codebase
rg "raw_data.*tobytes|data_type.*FLOAT16|data_type.*FLOAT32" modelopt/onnx/ -t py -B 2 -A 2 | head -60

Repository: NVIDIA/Model-Optimizer

Length of output: 5219


🏁 Script executed:

# Check what imports are available at the top of gqa_replacement.py
head -100 modelopt/onnx/graph_surgery/gqa_replacement.py | grep -E "import|from"

Repository: NVIDIA/Model-Optimizer

Length of output: 465


🏁 Script executed:

# Search for onnx import in gqa_replacement.py
rg "^import onnx|^from onnx" modelopt/onnx/graph_surgery/gqa_replacement.py

Repository: NVIDIA/Model-Optimizer

Length of output: 183


🏁 Script executed:

# Check the initializer_to_array function definition
rg -A 15 "def initializer_to_array" modelopt/onnx/graph_surgery/utils/graph_utils.py

Repository: NVIDIA/Model-Optimizer

Length of output: 643


🏁 Script executed:

# Check if numpy_helper is used elsewhere for initializer conversion
rg "numpy_helper.to_array.*init|to_array.*initializer" modelopt/onnx/ -t py -B 1 -A 1

Repository: NVIDIA/Model-Optimizer

Length of output: 5146


🏁 Script executed:

# Understand ONNX TensorProto storage formats by checking how initializers are created/modified elsewhere
rg "\.float_data|\.int32_data|\.int64_data" modelopt/onnx/ -t py -B 2 -A 2 | head -50

Repository: NVIDIA/Model-Optimizer

Length of output: 1014


🏁 Script executed:

# Look at more context around the problematic code - what comes before and after
cat -n modelopt/onnx/graph_surgery/gqa_replacement.py | sed -n '1295,1325p'

Repository: NVIDIA/Model-Optimizer

Length of output: 1934


🏁 Script executed:

# Check if initializer_to_array is imported in gqa_replacement.py
rg "initializer_to_array|from_array|numpy_helper" modelopt/onnx/graph_surgery/gqa_replacement.py

Repository: NVIDIA/Model-Optimizer

Length of output: 423


🏁 Script executed:

# Check imports at the top of gqa_replacement.py to see what utilities are available
sed -n '1,70p' modelopt/onnx/graph_surgery/gqa_replacement.py

Repository: NVIDIA/Model-Optimizer

Length of output: 2388


🏁 Script executed:

# Look for similar scale/initializer conversion patterns in the codebase
rg "array.*astype.*fp16|astype.*float16.*raw_data" modelopt/onnx/ -t py -B 3 -A 3

Repository: NVIDIA/Model-Optimizer

Length of output: 48


🏁 Script executed:

# Check if numpy_helper is imported or if onnx.numpy_helper should be used
grep -n "numpy_helper\|from onnx import" modelopt/onnx/graph_surgery/gqa_replacement.py | head -20

Repository: NVIDIA/Model-Optimizer

Length of output: 105


🏁 Script executed:

# Verify the fix pattern works by checking similar usage in the file's imported utilities
grep -A 10 "def array_to_initializer" modelopt/onnx/graph_surgery/utils/graph_utils.py

Repository: NVIDIA/Model-Optimizer

Length of output: 432


🏁 Script executed:

# Check if array_to_initializer uses numpy_helper
grep -B 5 -A 20 "def array_to_initializer" modelopt/onnx/graph_surgery/utils/graph_utils.py

Repository: NVIDIA/Model-Optimizer

Length of output: 821


🏁 Script executed:

# Check the imports in graph_utils.py to see how numpy_helper is used
head -30 modelopt/onnx/graph_surgery/utils/graph_utils.py

Repository: NVIDIA/Model-Optimizer

Length of output: 1092


Use onnx.numpy_helper to read and convert initializer data.

At lines 1312–1315, reading only init.raw_data via np.frombuffer() loses data when the initializer uses external storage or typed fields (e.g., float_data, int32_data). This corrupts the quantization scale, invalidating the quantized model.

Suggested change
-                            arr = np.frombuffer(init.raw_data, dtype=np.float32).copy()
-                            arr_fp16 = arr.astype(np.float16)
-                            init.raw_data = arr_fp16.tobytes()
-                            init.data_type = TensorProto.FLOAT16
+                            arr_fp16 = onnx.numpy_helper.to_array(init).astype(np.float16)
+                            converted_init = onnx.numpy_helper.from_array(arr_fp16, name=init.name)
+                            init.CopyFrom(converted_init)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/onnx/graph_surgery/gqa_replacement.py` around lines 1306 - 1316, The
current code reads initializer bytes with np.frombuffer(init.raw_data) which can
corrupt data for external/typed initializers; instead use
onnx.numpy_helper.to_array(init) to get a safe numpy array, cast it to float16,
then create a new TensorProto with onnx.numpy_helper.from_array(arr_fp16,
name=init.name) and replace the original initializer in graph.initializer (or
assign its raw_data/data_type from the new TensorProto), ensuring you preserve
init.name and dims; update the block that inspects graph.node for quant_mul_name
and scale_name to use numpy_helper.to_array/from_array rather than direct
raw_data access.

Comment on lines +1398 to +1408
output_dir = os.path.dirname(output_path)
if output_dir:
os.makedirs(output_dir, exist_ok=True)

if external_data_name is None:
external_data_name = os.path.basename(output_path) + "_data"

if use_external_data:
if verbose:
logger.info(f" Saving weights to external file: {external_data_name}")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

external_data_name can remain None for cwd outputs.

If output_path has no directory component, the default external filename is never set, but use_external_data=True still tries to save with location=None.

Suggested change
     output_dir = os.path.dirname(output_path)
     if output_dir:
         os.makedirs(output_dir, exist_ok=True)

-        if external_data_name is None:
-            external_data_name = os.path.basename(output_path) + "_data"
+    if external_data_name is None:
+        external_data_name = os.path.basename(output_path) + "_data"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/onnx/graph_surgery/gqa_replacement.py` around lines 1398 - 1408,
external_data_name can stay None when output_path has no directory, causing
use_external_data to attempt saving with location=None; ensure
external_data_name is set whenever use_external_data is true and
external_data_name is None by assigning external_data_name =
os.path.basename(output_path) + "_data" outside (or after) the output_dir
conditional so that a default filename is created even for cwd outputs; update
the block around output_dir/external_data_name/use_external_data (references:
external_data_name, output_path, use_external_data) to set the default
unconditionally when needed.

Comment on lines +121 to +130
# Handle value_float attribute
if attr.name == "value_float":
fp32_val = np.array([attr.f], dtype=np.float32)
bf16_uint16 = fp16_to_bf16_array(fp32_val.astype(np.float16))
new_tensor = onnx.TensorProto()
new_tensor.data_type = TensorProto.BFLOAT16
new_tensor.raw_data = bf16_uint16.tobytes()
attr.t.CopyFrom(new_tensor)
attr.name = "value"
return True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cat -n modelopt/onnx/graph_surgery/utils/dtype_conversion.py | head -150

Repository: NVIDIA/Model-Optimizer

Length of output: 5735


🏁 Script executed:

cat -n modelopt/onnx/graph_surgery/utils/dtype_conversion.py | sed -n '90,140p'

Repository: NVIDIA/Model-Optimizer

Length of output: 1953


🏁 Script executed:

cat -n modelopt/onnx/graph_surgery/utils/dtype_conversion.py | sed -n '1,150p'

Repository: NVIDIA/Model-Optimizer

Length of output: 5735


Remove the value_float attribute conversion block.

The function converts FP16 constants to BF16, but value_float encodes native FP32 scalars. Converting them to BF16 introduces unnecessary precision loss (FP32→FP16→BF16) and changes semantics—these constants were not originally FP16 and should not be rewritten in an FP16-to-BF16 pass. The value attribute conversion correctly guards against this with a type check (line 105); value_float lacks this guard.

Suggested change
-        # Handle value_float attribute
-        if attr.name == "value_float":
-            fp32_val = np.array([attr.f], dtype=np.float32)
-            bf16_uint16 = fp16_to_bf16_array(fp32_val.astype(np.float16))
-            new_tensor = onnx.TensorProto()
-            new_tensor.data_type = TensorProto.BFLOAT16
-            new_tensor.raw_data = bf16_uint16.tobytes()
-            attr.t.CopyFrom(new_tensor)
-            attr.name = "value"
-            return True
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Handle value_float attribute
if attr.name == "value_float":
fp32_val = np.array([attr.f], dtype=np.float32)
bf16_uint16 = fp16_to_bf16_array(fp32_val.astype(np.float16))
new_tensor = onnx.TensorProto()
new_tensor.data_type = TensorProto.BFLOAT16
new_tensor.raw_data = bf16_uint16.tobytes()
attr.t.CopyFrom(new_tensor)
attr.name = "value"
return True
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/onnx/graph_surgery/utils/dtype_conversion.py` around lines 121 -
130, Remove the special-case handling that rewrites "value_float" into BF16:
delete the branch that checks attr.name == "value_float" and the code that
creates new_tensor, sets data_type to TensorProto.BFLOAT16, converts via
fp16_to_bf16_array, assigns raw_data, calls attr.t.CopyFrom and renames
attr.name to "value"; leave the existing "value" attribute logic (the guarded
path that checks tensor element types) intact so native FP32 scalar attributes
are not downconverted by the FP16→BF16 pass.

Comment on lines +243 to +246

# Compute head_size
head_size = hidden_size // num_attention_heads

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Guard against None values when hf_model_id is not provided.

If hf_model_id is None and the caller doesn't supply hidden_size or num_attention_heads, line 245 will raise a TypeError when attempting division with None.

🛡️ Proposed fix to add validation
+    # Validate required parameters
+    if hidden_size is None or num_attention_heads is None:
+        raise ValueError(
+            "hidden_size and num_attention_heads must be provided when hf_model_id is not specified"
+        )
+
     # Compute head_size
     head_size = hidden_size // num_attention_heads
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Compute head_size
head_size = hidden_size // num_attention_heads
# Validate required parameters
if hidden_size is None or num_attention_heads is None:
raise ValueError(
"hidden_size and num_attention_heads must be provided when hf_model_id is not specified"
)
# Compute head_size
head_size = hidden_size // num_attention_heads
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/onnx/graph_surgery/utils/whisper_utils.py` around lines 243 - 246,
Guard against None for hidden_size/num_attention_heads before computing
head_size: check that hidden_size and num_attention_heads are not None (and
num_attention_heads != 0) when hf_model_id is None, and if they are missing
raise a clear ValueError indicating that either hf_model_id must be provided or
both hidden_size and num_attention_heads must be supplied; update the code
around the head_size computation (the variables head_size, hidden_size,
num_attention_heads, and hf_model_id) to perform this validation and only
compute head_size = hidden_size // num_attention_heads after the checks.

Comment on lines +37 to +38
from ..utils.graph_utils import detect_model_dtype

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

find modelopt -name "encoder_cross_kv.py" -type f

Repository: NVIDIA/Model-Optimizer

Length of output: 121


🏁 Script executed:

find modelopt -name "graph_utils.py" -o -name "dtype_conversion.py" | head -20

Repository: NVIDIA/Model-Optimizer

Length of output: 210


🏁 Script executed:

cd modelopt && find . -type f -name "*.py" | grep -E "(graph_utils|dtype_conversion)" | head -20

Repository: NVIDIA/Model-Optimizer

Length of output: 189


🏁 Script executed:

wc -l modelopt/onnx/graph_surgery/whisper/encoder_cross_kv.py

Repository: NVIDIA/Model-Optimizer

Length of output: 125


🏁 Script executed:

cat -n modelopt/onnx/graph_surgery/whisper/encoder_cross_kv.py

Repository: NVIDIA/Model-Optimizer

Length of output: 20927


🏁 Script executed:

cat -n modelopt/onnx/graph_surgery/utils/graph_utils.py

Repository: NVIDIA/Model-Optimizer

Length of output: 24055


🏁 Script executed:

head -100 modelopt/onnx/graph_surgery/utils/dtype_conversion.py

Repository: NVIDIA/Model-Optimizer

Length of output: 3065


🏁 Script executed:

sed -n '180,195p' modelopt/onnx/graph_surgery/whisper/encoder_cross_kv.py

Repository: NVIDIA/Model-Optimizer

Length of output: 741


🏁 Script executed:

sed -n '215,240p' modelopt/onnx/graph_surgery/whisper/encoder_cross_kv.py

Repository: NVIDIA/Model-Optimizer

Length of output: 1012


🏁 Script executed:

tail -50 modelopt/onnx/graph_surgery/utils/dtype_conversion.py

Repository: NVIDIA/Model-Optimizer

Length of output: 1860


🏁 Script executed:

rg "array_to_initializer" modelopt --type py -B 2 -A 2

Repository: NVIDIA/Model-Optimizer

Length of output: 2590


🏁 Script executed:

sed -n '1,50p' modelopt/onnx/graph_surgery/gqa_replacement.py | head -60

Repository: NVIDIA/Model-Optimizer

Length of output: 1865


🏁 Script executed:

grep -B 5 -A 10 "is_bfloat16" modelopt/onnx/graph_surgery/gqa_replacement.py | head -60

Repository: NVIDIA/Model-Optimizer

Length of output: 1988


🏁 Script executed:

grep -B 3 -A 3 "fp16_to_bf16_array" modelopt/onnx/graph_surgery/utils/graph_utils.py

Repository: NVIDIA/Model-Optimizer

Length of output: 900


Add BF16-aware initializer serialization for cross-attention K/V weights and biases.

detect_model_dtype allows detecting BF16 models, but these initializers are always created with numpy_helper.from_array(...). Since NumPy doesn't support bfloat16, when the model is detected as BF16 the weights remain as FP32 arrays, causing numpy_helper.from_array() to create FLOAT TensorProtos instead of BFLOAT16. This dtype mismatch will cause inference failures.

Use array_to_initializer with the is_bfloat16 flag and convert float32 → float16 → bfloat16 via fp16_to_bf16_array when onnx_dtype == TensorProto.BFLOAT16.

Suggested change
-from ..utils.graph_utils import detect_model_dtype
+from ..utils.graph_utils import array_to_initializer, detect_model_dtype
+from ..utils.dtype_conversion import fp16_to_bf16_array
@@
-        new_initializers.append(numpy_helper.from_array(k_weight_t, name=k_weight_name))
-        new_initializers.append(numpy_helper.from_array(v_weight_t, name=v_weight_name))
+        if onnx_dtype == TensorProto.BFLOAT16:
+            k_bf16 = fp16_to_bf16_array(k_weight_t.astype(np.float16))
+            v_bf16 = fp16_to_bf16_array(v_weight_t.astype(np.float16))
+            new_initializers.append(array_to_initializer(k_bf16, k_weight_name, is_bfloat16=True))
+            new_initializers.append(array_to_initializer(v_bf16, v_weight_name, is_bfloat16=True))
+        else:
+            new_initializers.append(numpy_helper.from_array(k_weight_t, name=k_weight_name))
+            new_initializers.append(numpy_helper.from_array(v_weight_t, name=v_weight_name))
@@
-            new_initializers.append(
-                numpy_helper.from_array(k_bias.astype(np_dtype), name=k_bias_name)
-            )
+            if onnx_dtype == TensorProto.BFLOAT16:
+                k_bias_bf16 = fp16_to_bf16_array(k_bias.astype(np.float16))
+                new_initializers.append(
+                    array_to_initializer(k_bias_bf16, k_bias_name, is_bfloat16=True)
+                )
+            else:
+                new_initializers.append(
+                    numpy_helper.from_array(k_bias.astype(np_dtype), name=k_bias_name)
+                )
@@
-            new_initializers.append(
-                numpy_helper.from_array(v_bias.astype(np_dtype), name=v_bias_name)
-            )
+            if onnx_dtype == TensorProto.BFLOAT16:
+                v_bias_bf16 = fp16_to_bf16_array(v_bias.astype(np.float16))
+                new_initializers.append(
+                    array_to_initializer(v_bias_bf16, v_bias_name, is_bfloat16=True)
+                )
+            else:
+                new_initializers.append(
+                    numpy_helper.from_array(v_bias.astype(np_dtype), name=v_bias_name)
+                )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/onnx/graph_surgery/whisper/encoder_cross_kv.py` around lines 37 -
38, The cross-attention K/V weight and bias initializers are always converted
with numpy_helper.from_array so BF16 models detected by detect_model_dtype end
up with FLOAT proto types; update the serialization code that creates these
initializers (the places that currently call numpy_helper.from_array for
cross-attention key/value weights/biases) to use array_to_initializer with the
is_bfloat16 flag when onnx_dtype == TensorProto.BFLOAT16, and convert the
underlying float32 arrays via fp16_to_bf16_array (do float32 → float16 →
bfloat16) before passing to array_to_initializer so the TensorProto dtype
matches BF16 models.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant