Graph Surgery Framework for TRT-RTX by hthadicherla · Pull Request #992 · NVIDIA/Model-Optimizer

hthadicherla · 2026-03-06T12:33:31Z

What does this PR do?

Type of change: new feature

Adds the ONNX Graph Surgery framework — graph-level transformations on exported ONNX models for optimized inference with ONNX Runtime.

GQA Attention Replacement — Replaces native attention subgraphs (Q/K/V projections, RoPE, softmax, KV cache) with a single fused GroupQueryAttention operator. Handles RoPE cache computation, attention mask reformatting, Q/K/V weight fusion, and KV cache I/O automatically from a HuggingFace model ID. Supports FP16/BF16, INT4/AWQ quantized, and combined QKV models.

DequantizeLinear Weight Transpose — Transposes quantized weights in DequantizeLinear nodes to column-major layout for providers like NvTensorRtRtx. Handles INT4/UINT4 packed formats.

Whisper Encoder Cross-Attention KV — Adds cross-attention K/V projection outputs to the Whisper encoder for the ONNX Runtime GenAI pipeline. Loads cross-attention weights from HuggingFace and generates genai_config.json.

Usage

# GQA attention replacement
python -m modelopt.onnx.graph_surgery replace-gqa \
    -i model.onnx -o model_gqa.onnx \
    -m meta-llama/Llama-3.2-1B \
    --max-seq-len 4096 --dtype float16

# DequantizeLinear weight transpose
python -m modelopt.onnx.graph_surgery transpose-dq \
    -i model_quantized.onnx -o model_transposed.onnx

# Whisper encoder cross-attention KV
python -m modelopt.onnx.graph_surgery add-cross-kv \
    -i encoder_model.onnx -o encoder_with_kv.onnx \
    -m openai/whisper-large-v3-turbo

Testing

Added test case for GQA graph surgery at tests/unit/onnx/test_gqa_graph_surgery.py. Builds a toy attention subgraph with native ops matching the real Optimum export pattern and applies GQA surgery on it. Compared end outputs of both models — outputs match exactly.

tests/unit/onnx/test_gqa_graph_surgery.py::TestGQAGraphSurgery::test_gqa_node_exists PASSED
tests/unit/onnx/test_gqa_graph_surgery.py::TestGQAGraphSurgery::test_gqa_attributes PASSED
tests/unit/onnx/test_gqa_graph_surgery.py::TestGQAGraphSurgery::test_node_count_reduced PASSED
tests/unit/onnx/test_gqa_graph_surgery.py::TestGQAGraphSurgery::test_rotary_emb_nodes_removed PASSED
tests/unit/onnx/test_gqa_graph_surgery.py::TestGQAGraphSurgery::test_position_ids_removed PASSED
tests/unit/onnx/test_gqa_graph_surgery.py::TestGQAGraphSurgery::test_logits_match
  Original nodes: 106  ->  GQA nodes: 13
  Logits shape:   (1, 4, 64)
  Original[0,:4]: [-2906.  7704. 15248.  8488.]
  GQA     [0,:4]: [-2906.  7704. 15248.  8488.]
  Max  abs diff:  0.000000
  Mean abs diff:  0.000000
PASSED

Summary by CodeRabbit

Release Notes

New Features
- Added command-line interface for ONNX graph surgery operations including attention pattern replacement, weight optimization, and dtype conversion.
- Introduced GroupQueryAttention replacement for efficient attention mechanism transformation.
- Added FP16 to BF16 data type conversion for ONNX models.
- Implemented DequantizeLinear weight optimization for column-major storage format.
- Added Whisper encoder cross-attention support with cache generation and configuration utilities.
Tests
- Added comprehensive unit tests validating graph surgery transformations and numerical accuracy.

…-RTX and ORT-GenAI Signed-off-by: Hrishith Thadicherla <hthadicherla@nvidia.com>

Signed-off-by: Hrishith Thadicherla <hthadicherla@nvidia.com>

coderabbitai · 2026-03-06T12:33:50Z

📝 Walkthrough

Walkthrough

This pull request introduces a comprehensive ONNX graph surgery package for model transformations. It includes five transformation/utility modules for attention replacement with GroupQueryAttention, cross-attention KV cache addition, FP16-to-BF16 conversion, and DequantizeLinear weight transposition, plus a command-line interface and supporting utilities.

Changes

Cohort / File(s)	Summary
Package Initialization `modelopt/onnx/graph_surgery/__init__.py`, `modelopt/onnx/graph_surgery/utils/__init__.py`, `modelopt/onnx/graph_surgery/whisper/__init__.py`	Public API exposure and re-exports of transformation functions and utilities (replace_attention_with_gqa, add_cross_kv_to_encoder, convert_fp16_to_bf16, transpose_dequantize_linear_weights) from internal submodules.
CLI Implementation `modelopt/onnx/graph_surgery/__main__.py`	Command-line interface with subcommands for GQA replacement, cross-KV addition, BF16 conversion, DQ transposition, and attention pattern analysis; dispatches to corresponding handlers.
GQA Replacement Transformation `modelopt/onnx/graph_surgery/gqa_replacement.py`	Comprehensive ONNX transformation to replace standard attention with GroupQueryAttention; handles dtype conversion, RoPE cache injection, Q/K/V fusion/packing, attention mask reformatting, bias fusion, and quantized model adjustments (1503 lines).
DequantizeLinear Transposition `modelopt/onnx/graph_surgery/dq_transpose.py`	Graph surgery module for transposing DequantizeLinear weights to column-major storage; includes INT4/UINT4 packing utilities, tensor transposition logic, and initializer manipulation for 4-bit quantized models.
Dtype Conversion Utilities `modelopt/onnx/graph_surgery/utils/dtype_conversion.py`	FP16-to-BF16 conversion module with array conversion, initializer processing, Constant node handling, and full model transformation with optional external data support.
Graph Manipulation Utilities `modelopt/onnx/graph_surgery/utils/graph_utils.py`	General-purpose ONNX graph utilities including topological sorting, dtype detection, node/initializer queries, cleanup, tensor array conversions, and dtype migrations (615 lines).
RoPE and Whisper Utilities `modelopt/onnx/graph_surgery/utils/rope_cache.py`, `modelopt/onnx/graph_surgery/utils/whisper_utils.py`	RoPE cache computation for rotary embeddings and Whisper configuration utilities for audio processor and GenAI config generation/persistence with HuggingFace auto-detection support.
Whisper Encoder Cross-KV `modelopt/onnx/graph_surgery/whisper/encoder_cross_kv.py`	Augments encoder models with cross-attention K/V cache outputs for ONNX Runtime GenAI compatibility; extracts weights from HuggingFace, constructs per-layer MatMul/bias/reshape/transpose subgraphs, and generates supporting configurations.
Unit Tests `tests/unit/onnx/test_gqa_graph_surgery.py`	Comprehensive test suite for GQA replacement; constructs synthetic LLaMA-like ONNX models, validates GQA node attributes, verifies node count reduction, and checks numerical fidelity between original and transformed models.

Sequence Diagram(s)

sequenceDiagram
    participant User as User/CLI
    participant GQA as GQA Replacement
    participant Model as ONNX Model
    participant Cache as RoPE Caches
    participant Attn as Attention Fusion
    participant QKV as Q/K/V Handling
    participant GQANode as GQA Node Creation
    participant Save as Model Save

    User->>GQA: replace_attention_with_gqa(model_path, params)
    GQA->>Model: Load ONNX model
    GQA->>Model: Detect/convert dtype (FP16/BF16)
    GQA->>Cache: Compute RoPE cos/sin caches
    Cache-->>GQA: cos_cache, sin_cache initializers
    GQA->>Model: Add RoPE initializers to graph
    GQA->>Model: Format attention mask (seqlens_k, total_seq_len)
    loop Per Layer
        GQA->>Attn: Identify attention pattern (quantized/non-quantized)
        Attn->>QKV: Fuse/handle Q/K/V projections
        QKV->>GQANode: Create GroupQueryAttention node with inputs/outputs
        GQANode->>Model: Connect past/present KV caches
        GQANode->>Model: Wire GQA output to o_proj
    end
    GQA->>Model: Clean unused I/O and initializers
    GQA->>Save: Save model with external data (optional)
    Save-->>User: Modified ONNX model

sequenceDiagram
    participant User as User/CLI
    participant CrossKV as Cross-KV Encoder
    participant HF as HuggingFace
    participant Encoder as Encoder Model
    participant Layer as Per-Layer Processing
    participant Config as Config Generation

    User->>CrossKV: add_cross_kv_to_encoder(encoder_path, hf_model_id, ...)
    CrossKV->>HF: Load Whisper model from HuggingFace
    HF-->>CrossKV: Model weights & config (num_heads, head_size)
    CrossKV->>Encoder: Load encoder ONNX model
    CrossKV->>Encoder: Detect dtype (FP16/FP32/BF16)
    loop Per Decoder Layer
        Layer->>Layer: Extract cross-attention K/V weights from HF
        Layer->>Encoder: Create MatMul nodes (K, V projections)
        Layer->>Encoder: Add optional bias via Add nodes
        Layer->>Encoder: Insert Reshape & Transpose for output shape
        Layer->>Encoder: Create present_key_cross_X, present_value_cross_X outputs
    end
    CrossKV->>Encoder: Rename inputs (input_features → audio_features if needed)
    CrossKV->>Config: generate_genai_config(encoder, decoder filenames)
    CrossKV->>Config: generate_audio_processor_config(num_mel_bins)
    CrossKV->>Encoder: Save model with external data (optional)
    Encoder-->>User: Modified encoder + genai_config.json + audio_processor_config.json

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~65 minutes

Important

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

❌ Failed checks (1 error, 1 warning)

Check name	Status	Explanation	Resolution
Security Anti-Patterns	❌ Error	Functions loading transformers models do not expose trust_remote_code as a caller-configurable parameter, violating SECURITY.md requirements.	Update affected functions to accept trust_remote_code: bool = False parameter and pass it explicitly to all from_pretrained() calls.
Docstring Coverage	⚠️ Warning	Docstring coverage is 79.69% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'Graph Surgery Framework for TRT-RTX' clearly and concisely summarizes the main change—introducing a new ONNX graph surgery framework. It is specific enough to convey the primary feature without being overly verbose.

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch hthadicherla/graph-surgery-framework

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 6

🧹 Nitpick comments (1)

modelopt/onnx/graph_surgery/dq_transpose.py (1)
125-143: Consider consolidating duplicate helper functions.

These helper functions (_find_initializer, _find_node_by_output, _get_consumers) duplicate functionality in modelopt/onnx/graph_surgery/utils/graph_utils.py. Consider importing from there to reduce code duplication.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/onnx/graph_surgery/dq_transpose.py` around lines 125 - 143, The
three helper functions _find_initializer, _find_node_by_output, and
_get_consumers in dq_transpose.py duplicate utilities already provided in
modelopt/onnx/graph_surgery/utils/graph_utils.py; remove these local
implementations and import the corresponding functions from graph_utils (e.g.,
import find_initializer, find_node_by_output, get_consumers or their exact
exported names) and update all local call sites in dq_transpose.py to use the
imported functions (ensure names match or alias imports to keep existing call
sites like _find_initializer, _find_node_by_output, _get_consumers working).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@modelopt/onnx/graph_surgery/gqa_replacement.py`:
- Around line 926-933: The code is adding cos_cache and sin_cache to
graph.value_info twice; remove the manual creation and extension of
cos_cache_info and sin_cache_info and rely on add_initializer (from graph_utils)
to register the value_info; specifically delete the
helper.make_tensor_value_info(...) lines and the
graph.value_info.extend([cos_cache_info, sin_cache_info]) call in
gqa_replacement.py so only add_initializer creates the metadata, or if you must
keep them, first check for existing entries before extending to avoid
duplicates.
- Around line 1398-1408: external_data_name can stay None when output_path has
no directory, causing use_external_data to attempt saving with location=None;
ensure external_data_name is set whenever use_external_data is true and
external_data_name is None by assigning external_data_name =
os.path.basename(output_path) + "_data" outside (or after) the output_dir
conditional so that a default filename is created even for cwd outputs; update
the block around output_dir/external_data_name/use_external_data (references:
external_data_name, output_path, use_external_data) to set the default
unconditionally when needed.
- Around line 1306-1316: The current code reads initializer bytes with
np.frombuffer(init.raw_data) which can corrupt data for external/typed
initializers; instead use onnx.numpy_helper.to_array(init) to get a safe numpy
array, cast it to float16, then create a new TensorProto with
onnx.numpy_helper.from_array(arr_fp16, name=init.name) and replace the original
initializer in graph.initializer (or assign its raw_data/data_type from the new
TensorProto), ensuring you preserve init.name and dims; update the block that
inspects graph.node for quant_mul_name and scale_name to use
numpy_helper.to_array/from_array rather than direct raw_data access.

In `@modelopt/onnx/graph_surgery/utils/dtype_conversion.py`:
- Around line 121-130: Remove the special-case handling that rewrites
"value_float" into BF16: delete the branch that checks attr.name ==
"value_float" and the code that creates new_tensor, sets data_type to
TensorProto.BFLOAT16, converts via fp16_to_bf16_array, assigns raw_data, calls
attr.t.CopyFrom and renames attr.name to "value"; leave the existing "value"
attribute logic (the guarded path that checks tensor element types) intact so
native FP32 scalar attributes are not downconverted by the FP16→BF16 pass.

In `@modelopt/onnx/graph_surgery/utils/whisper_utils.py`:
- Around line 243-246: Guard against None for hidden_size/num_attention_heads
before computing head_size: check that hidden_size and num_attention_heads are
not None (and num_attention_heads != 0) when hf_model_id is None, and if they
are missing raise a clear ValueError indicating that either hf_model_id must be
provided or both hidden_size and num_attention_heads must be supplied; update
the code around the head_size computation (the variables head_size, hidden_size,
num_attention_heads, and hf_model_id) to perform this validation and only
compute head_size = hidden_size // num_attention_heads after the checks.

In `@modelopt/onnx/graph_surgery/whisper/encoder_cross_kv.py`:
- Around line 37-38: The cross-attention K/V weight and bias initializers are
always converted with numpy_helper.from_array so BF16 models detected by
detect_model_dtype end up with FLOAT proto types; update the serialization code
that creates these initializers (the places that currently call
numpy_helper.from_array for cross-attention key/value weights/biases) to use
array_to_initializer with the is_bfloat16 flag when onnx_dtype ==
TensorProto.BFLOAT16, and convert the underlying float32 arrays via
fp16_to_bf16_array (do float32 → float16 → bfloat16) before passing to
array_to_initializer so the TensorProto dtype matches BF16 models.

---

Nitpick comments:
In `@modelopt/onnx/graph_surgery/dq_transpose.py`:
- Around line 125-143: The three helper functions _find_initializer,
_find_node_by_output, and _get_consumers in dq_transpose.py duplicate utilities
already provided in modelopt/onnx/graph_surgery/utils/graph_utils.py; remove
these local implementations and import the corresponding functions from
graph_utils (e.g., import find_initializer, find_node_by_output, get_consumers
or their exact exported names) and update all local call sites in
dq_transpose.py to use the imported functions (ensure names match or alias
imports to keep existing call sites like _find_initializer,
_find_node_by_output, _get_consumers working).

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 8fa348a4-244d-4101-9817-ae0b39af1d53

📥 Commits

Reviewing files that changed from the base of the PR and between 37d3f10 and 51fdee8.

📒 Files selected for processing (12)

modelopt/onnx/graph_surgery/__init__.py
modelopt/onnx/graph_surgery/__main__.py
modelopt/onnx/graph_surgery/dq_transpose.py
modelopt/onnx/graph_surgery/gqa_replacement.py
modelopt/onnx/graph_surgery/utils/__init__.py
modelopt/onnx/graph_surgery/utils/dtype_conversion.py
modelopt/onnx/graph_surgery/utils/graph_utils.py
modelopt/onnx/graph_surgery/utils/rope_cache.py
modelopt/onnx/graph_surgery/utils/whisper_utils.py
modelopt/onnx/graph_surgery/whisper/__init__.py
modelopt/onnx/graph_surgery/whisper/encoder_cross_kv.py
tests/unit/onnx/test_gqa_graph_surgery.py

coderabbitai · 2026-03-06T12:44:09Z

modelopt/onnx/graph_surgery/gqa_replacement.py

+    add_initializer(graph, "cos_cache", cos_cache, onnx_dtype)
+    add_initializer(graph, "sin_cache", sin_cache, onnx_dtype)
+
+    # Add value_info for cos/sin caches
+    cos_cache_info = helper.make_tensor_value_info("cos_cache", onnx_dtype, list(cos_cache.shape))
+    sin_cache_info = helper.make_tensor_value_info("sin_cache", onnx_dtype, list(sin_cache.shape))
+    graph.value_info.extend([cos_cache_info, sin_cache_info])
+


⚠️ Potential issue | 🟡 Minor

cos_cache/sin_cache are registered twice in value_info.

add_initializer already appends corresponding value_info entries (modelopt/onnx/graph_surgery/utils/graph_utils.py, Lines 435-461). Re-adding them here creates duplicate metadata entries.

Suggested change

- # Add value_info for cos/sin caches - cos_cache_info = helper.make_tensor_value_info("cos_cache", onnx_dtype, list(cos_cache.shape)) - sin_cache_info = helper.make_tensor_value_info("sin_cache", onnx_dtype, list(sin_cache.shape)) - graph.value_info.extend([cos_cache_info, sin_cache_info])

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

add_initializer(graph, "cos_cache", cos_cache, onnx_dtype)

add_initializer(graph, "sin_cache", sin_cache, onnx_dtype)

# Add value_info for cos/sin caches

cos_cache_info = helper.make_tensor_value_info("cos_cache", onnx_dtype, list(cos_cache.shape))

sin_cache_info = helper.make_tensor_value_info("sin_cache", onnx_dtype, list(sin_cache.shape))

graph.value_info.extend([cos_cache_info, sin_cache_info])

add_initializer(graph, "cos_cache", cos_cache, onnx_dtype)

add_initializer(graph, "sin_cache", sin_cache, onnx_dtype)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@modelopt/onnx/graph_surgery/gqa_replacement.py` around lines 926 - 933, The code is adding cos_cache and sin_cache to graph.value_info twice; remove the manual creation and extension of cos_cache_info and sin_cache_info and rely on add_initializer (from graph_utils) to register the value_info; specifically delete the helper.make_tensor_value_info(...) lines and the graph.value_info.extend([cos_cache_info, sin_cache_info]) call in gqa_replacement.py so only add_initializer creates the metadata, or if you must keep them, first check for existing entries before extending to avoid duplicates.

coderabbitai · 2026-03-06T12:44:09Z

modelopt/onnx/graph_surgery/gqa_replacement.py

+            # 1) Convert scale initializer to float16
+            for node in graph.node:
+                if node.name == quant_mul_name:
+                    scale_name = node.input[1]
+                    for init in graph.initializer:
+                        if init.name == scale_name and init.data_type == TensorProto.FLOAT:
+                            arr = np.frombuffer(init.raw_data, dtype=np.float32).copy()
+                            arr_fp16 = arr.astype(np.float16)
+                            init.raw_data = arr_fp16.tobytes()
+                            init.data_type = TensorProto.FLOAT16
+                            if verbose:


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# First, find the file and examine the code around lines 1306-1316 fd gqa_replacement.py

Repository: NVIDIA/Model-Optimizer

Length of output: 112

🏁 Script executed:

# Read the file around the mentioned lines cat -n modelopt/onnx/graph_surgery/gqa_replacement.py | sed -n '1300,1320p'

Repository: NVIDIA/Model-Optimizer

Length of output: 1407

🏁 Script executed:

# Check if numpy_helper is already imported or used elsewhere in this file rg "numpy_helper|from_array|to_array" modelopt/onnx/graph_surgery/gqa_replacement.py -B 2 -A 2

Repository: NVIDIA/Model-Optimizer

Length of output: 727

🏁 Script executed:

# Check how initializers are handled elsewhere in the codebase rg "init\.raw_data|to_array|numpy_helper" modelopt/onnx/ -t py | head -30

Repository: NVIDIA/Model-Optimizer

Length of output: 3117

🏁 Script executed:

# Check ONNX documentation on initializer handling by examining how it's used in tests or other modules rg "raw_data" modelopt/onnx/ -t py -B 2 -A 2 | head -40

Repository: NVIDIA/Model-Optimizer

Length of output: 3123

🏁 Script executed:

# Check what initializer_to_array does in utils grep -A 20 "def initializer_to_array" modelopt/onnx/utils.py

Repository: NVIDIA/Model-Optimizer

Length of output: 48

🏁 Script executed:

# Check imports in gqa_replacement.py to see what's available head -50 modelopt/onnx/graph_surgery/gqa_replacement.py

Repository: NVIDIA/Model-Optimizer

Length of output: 1865

🏁 Script executed:

# Look for documentation or examples of how initializers with external data or typed fields are handled rg "external_data|int32_data|float_data" modelopt/onnx/ -t py -B 1 -A 1 | head -50

Repository: NVIDIA/Model-Optimizer

Length of output: 3305

🏁 Script executed:

# Check if there are tests or examples of initializer conversion in the codebase rg "raw_data.*tobytes|data_type.*FLOAT16|data_type.*FLOAT32" modelopt/onnx/ -t py -B 2 -A 2 | head -60

Repository: NVIDIA/Model-Optimizer

Length of output: 5219

🏁 Script executed:

# Check what imports are available at the top of gqa_replacement.py head -100 modelopt/onnx/graph_surgery/gqa_replacement.py | grep -E "import|from"

Repository: NVIDIA/Model-Optimizer

Length of output: 465

🏁 Script executed:

# Search for onnx import in gqa_replacement.py rg "^import onnx|^from onnx" modelopt/onnx/graph_surgery/gqa_replacement.py

Repository: NVIDIA/Model-Optimizer

Length of output: 183

🏁 Script executed:

# Check the initializer_to_array function definition rg -A 15 "def initializer_to_array" modelopt/onnx/graph_surgery/utils/graph_utils.py

Repository: NVIDIA/Model-Optimizer

Length of output: 643

🏁 Script executed:

# Check if numpy_helper is used elsewhere for initializer conversion rg "numpy_helper.to_array.*init|to_array.*initializer" modelopt/onnx/ -t py -B 1 -A 1

Repository: NVIDIA/Model-Optimizer

Length of output: 5146

🏁 Script executed:

# Understand ONNX TensorProto storage formats by checking how initializers are created/modified elsewhere rg "\.float_data|\.int32_data|\.int64_data" modelopt/onnx/ -t py -B 2 -A 2 | head -50

Repository: NVIDIA/Model-Optimizer

Length of output: 1014

🏁 Script executed:

# Look at more context around the problematic code - what comes before and after cat -n modelopt/onnx/graph_surgery/gqa_replacement.py | sed -n '1295,1325p'

Repository: NVIDIA/Model-Optimizer

Length of output: 1934

🏁 Script executed:

# Check if initializer_to_array is imported in gqa_replacement.py rg "initializer_to_array|from_array|numpy_helper" modelopt/onnx/graph_surgery/gqa_replacement.py

Repository: NVIDIA/Model-Optimizer

Length of output: 423

🏁 Script executed:

# Check imports at the top of gqa_replacement.py to see what utilities are available sed -n '1,70p' modelopt/onnx/graph_surgery/gqa_replacement.py

Repository: NVIDIA/Model-Optimizer

Length of output: 2388

🏁 Script executed:

# Look for similar scale/initializer conversion patterns in the codebase rg "array.*astype.*fp16|astype.*float16.*raw_data" modelopt/onnx/ -t py -B 3 -A 3

Repository: NVIDIA/Model-Optimizer

Length of output: 48

🏁 Script executed:

# Check if numpy_helper is imported or if onnx.numpy_helper should be used grep -n "numpy_helper\|from onnx import" modelopt/onnx/graph_surgery/gqa_replacement.py | head -20

Repository: NVIDIA/Model-Optimizer

Length of output: 105

🏁 Script executed:

# Verify the fix pattern works by checking similar usage in the file's imported utilities grep -A 10 "def array_to_initializer" modelopt/onnx/graph_surgery/utils/graph_utils.py

Repository: NVIDIA/Model-Optimizer

Length of output: 432

🏁 Script executed:

# Check if array_to_initializer uses numpy_helper grep -B 5 -A 20 "def array_to_initializer" modelopt/onnx/graph_surgery/utils/graph_utils.py

Repository: NVIDIA/Model-Optimizer

Length of output: 821

🏁 Script executed:

# Check the imports in graph_utils.py to see how numpy_helper is used head -30 modelopt/onnx/graph_surgery/utils/graph_utils.py

Repository: NVIDIA/Model-Optimizer

Length of output: 1092

Use onnx.numpy_helper to read and convert initializer data.

At lines 1312–1315, reading only init.raw_data via np.frombuffer() loses data when the initializer uses external storage or typed fields (e.g., float_data, int32_data). This corrupts the quantization scale, invalidating the quantized model.

Suggested change

- arr = np.frombuffer(init.raw_data, dtype=np.float32).copy() - arr_fp16 = arr.astype(np.float16) - init.raw_data = arr_fp16.tobytes() - init.data_type = TensorProto.FLOAT16 + arr_fp16 = onnx.numpy_helper.to_array(init).astype(np.float16) + converted_init = onnx.numpy_helper.from_array(arr_fp16, name=init.name) + init.CopyFrom(converted_init)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@modelopt/onnx/graph_surgery/gqa_replacement.py` around lines 1306 - 1316, The current code reads initializer bytes with np.frombuffer(init.raw_data) which can corrupt data for external/typed initializers; instead use onnx.numpy_helper.to_array(init) to get a safe numpy array, cast it to float16, then create a new TensorProto with onnx.numpy_helper.from_array(arr_fp16, name=init.name) and replace the original initializer in graph.initializer (or assign its raw_data/data_type from the new TensorProto), ensuring you preserve init.name and dims; update the block that inspects graph.node for quant_mul_name and scale_name to use numpy_helper.to_array/from_array rather than direct raw_data access.

coderabbitai · 2026-03-06T12:44:09Z

modelopt/onnx/graph_surgery/gqa_replacement.py

+    output_dir = os.path.dirname(output_path)
+    if output_dir:
+        os.makedirs(output_dir, exist_ok=True)
+
+        if external_data_name is None:
+            external_data_name = os.path.basename(output_path) + "_data"
+
+    if use_external_data:
+        if verbose:
+            logger.info(f"  Saving weights to external file: {external_data_name}")
+


⚠️ Potential issue | 🔴 Critical

external_data_name can remain None for cwd outputs.

If output_path has no directory component, the default external filename is never set, but use_external_data=True still tries to save with location=None.

Suggested change

output_dir = os.path.dirname(output_path) if output_dir: os.makedirs(output_dir, exist_ok=True) - if external_data_name is None: - external_data_name = os.path.basename(output_path) + "_data" + if external_data_name is None: + external_data_name = os.path.basename(output_path) + "_data"

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@modelopt/onnx/graph_surgery/gqa_replacement.py` around lines 1398 - 1408, external_data_name can stay None when output_path has no directory, causing use_external_data to attempt saving with location=None; ensure external_data_name is set whenever use_external_data is true and external_data_name is None by assigning external_data_name = os.path.basename(output_path) + "_data" outside (or after) the output_dir conditional so that a default filename is created even for cwd outputs; update the block around output_dir/external_data_name/use_external_data (references: external_data_name, output_path, use_external_data) to set the default unconditionally when needed.

coderabbitai · 2026-03-06T12:44:09Z

modelopt/onnx/graph_surgery/utils/dtype_conversion.py

+        # Handle value_float attribute
+        if attr.name == "value_float":
+            fp32_val = np.array([attr.f], dtype=np.float32)
+            bf16_uint16 = fp16_to_bf16_array(fp32_val.astype(np.float16))
+            new_tensor = onnx.TensorProto()
+            new_tensor.data_type = TensorProto.BFLOAT16
+            new_tensor.raw_data = bf16_uint16.tobytes()
+            attr.t.CopyFrom(new_tensor)
+            attr.name = "value"
+            return True


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cat -n modelopt/onnx/graph_surgery/utils/dtype_conversion.py | head -150

Repository: NVIDIA/Model-Optimizer

Length of output: 5735

🏁 Script executed:

cat -n modelopt/onnx/graph_surgery/utils/dtype_conversion.py | sed -n '90,140p'

Repository: NVIDIA/Model-Optimizer

Length of output: 1953

🏁 Script executed:

cat -n modelopt/onnx/graph_surgery/utils/dtype_conversion.py | sed -n '1,150p'

Repository: NVIDIA/Model-Optimizer

Length of output: 5735

Remove the value_float attribute conversion block.

The function converts FP16 constants to BF16, but value_float encodes native FP32 scalars. Converting them to BF16 introduces unnecessary precision loss (FP32→FP16→BF16) and changes semantics—these constants were not originally FP16 and should not be rewritten in an FP16-to-BF16 pass. The value attribute conversion correctly guards against this with a type check (line 105); value_float lacks this guard.

Suggested change

- # Handle value_float attribute - if attr.name == "value_float": - fp32_val = np.array([attr.f], dtype=np.float32) - bf16_uint16 = fp16_to_bf16_array(fp32_val.astype(np.float16)) - new_tensor = onnx.TensorProto() - new_tensor.data_type = TensorProto.BFLOAT16 - new_tensor.raw_data = bf16_uint16.tobytes() - attr.t.CopyFrom(new_tensor) - attr.name = "value" - return True

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# Handle value_float attribute

if attr.name == "value_float":

fp32_val = np.array([attr.f], dtype=np.float32)

bf16_uint16 = fp16_to_bf16_array(fp32_val.astype(np.float16))

new_tensor = onnx.TensorProto()

new_tensor.data_type = TensorProto.BFLOAT16

new_tensor.raw_data = bf16_uint16.tobytes()

attr.t.CopyFrom(new_tensor)

attr.name = "value"

return True

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@modelopt/onnx/graph_surgery/utils/dtype_conversion.py` around lines 121 - 130, Remove the special-case handling that rewrites "value_float" into BF16: delete the branch that checks attr.name == "value_float" and the code that creates new_tensor, sets data_type to TensorProto.BFLOAT16, converts via fp16_to_bf16_array, assigns raw_data, calls attr.t.CopyFrom and renames attr.name to "value"; leave the existing "value" attribute logic (the guarded path that checks tensor element types) intact so native FP32 scalar attributes are not downconverted by the FP16→BF16 pass.

coderabbitai · 2026-03-06T12:44:09Z

modelopt/onnx/graph_surgery/utils/whisper_utils.py

+
+    # Compute head_size
+    head_size = hidden_size // num_attention_heads
+


⚠️ Potential issue | 🟡 Minor

Guard against None values when hf_model_id is not provided.

If hf_model_id is None and the caller doesn't supply hidden_size or num_attention_heads, line 245 will raise a TypeError when attempting division with None.

🛡️ Proposed fix to add validation

+ # Validate required parameters + if hidden_size is None or num_attention_heads is None: + raise ValueError( + "hidden_size and num_attention_heads must be provided when hf_model_id is not specified" + ) + # Compute head_size head_size = hidden_size // num_attention_heads

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# Compute head_size

head_size = hidden_size // num_attention_heads

# Validate required parameters

if hidden_size is None or num_attention_heads is None:

raise ValueError(

"hidden_size and num_attention_heads must be provided when hf_model_id is not specified"

)

# Compute head_size

head_size = hidden_size // num_attention_heads

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@modelopt/onnx/graph_surgery/utils/whisper_utils.py` around lines 243 - 246, Guard against None for hidden_size/num_attention_heads before computing head_size: check that hidden_size and num_attention_heads are not None (and num_attention_heads != 0) when hf_model_id is None, and if they are missing raise a clear ValueError indicating that either hf_model_id must be provided or both hidden_size and num_attention_heads must be supplied; update the code around the head_size computation (the variables head_size, hidden_size, num_attention_heads, and hf_model_id) to perform this validation and only compute head_size = hidden_size // num_attention_heads after the checks.

coderabbitai · 2026-03-06T12:44:09Z

modelopt/onnx/graph_surgery/whisper/encoder_cross_kv.py

+from ..utils.graph_utils import detect_model_dtype
+


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

find modelopt -name "encoder_cross_kv.py" -type f

Repository: NVIDIA/Model-Optimizer

Length of output: 121

🏁 Script executed:

find modelopt -name "graph_utils.py" -o -name "dtype_conversion.py" | head -20

Repository: NVIDIA/Model-Optimizer

Length of output: 210

🏁 Script executed:

cd modelopt && find . -type f -name "*.py" | grep -E "(graph_utils|dtype_conversion)" | head -20

Repository: NVIDIA/Model-Optimizer

Length of output: 189

🏁 Script executed:

wc -l modelopt/onnx/graph_surgery/whisper/encoder_cross_kv.py

Repository: NVIDIA/Model-Optimizer

Length of output: 125

🏁 Script executed:

cat -n modelopt/onnx/graph_surgery/whisper/encoder_cross_kv.py

Repository: NVIDIA/Model-Optimizer

Length of output: 20927

🏁 Script executed:

cat -n modelopt/onnx/graph_surgery/utils/graph_utils.py

Repository: NVIDIA/Model-Optimizer

Length of output: 24055

🏁 Script executed:

head -100 modelopt/onnx/graph_surgery/utils/dtype_conversion.py

Repository: NVIDIA/Model-Optimizer

Length of output: 3065

🏁 Script executed:

sed -n '180,195p' modelopt/onnx/graph_surgery/whisper/encoder_cross_kv.py

Repository: NVIDIA/Model-Optimizer

Length of output: 741

🏁 Script executed:

sed -n '215,240p' modelopt/onnx/graph_surgery/whisper/encoder_cross_kv.py

Repository: NVIDIA/Model-Optimizer

Length of output: 1012

🏁 Script executed:

tail -50 modelopt/onnx/graph_surgery/utils/dtype_conversion.py

Repository: NVIDIA/Model-Optimizer

Length of output: 1860

🏁 Script executed:

rg "array_to_initializer" modelopt --type py -B 2 -A 2

Repository: NVIDIA/Model-Optimizer

Length of output: 2590

🏁 Script executed:

sed -n '1,50p' modelopt/onnx/graph_surgery/gqa_replacement.py | head -60

Repository: NVIDIA/Model-Optimizer

Length of output: 1865

🏁 Script executed:

grep -B 5 -A 10 "is_bfloat16" modelopt/onnx/graph_surgery/gqa_replacement.py | head -60

Repository: NVIDIA/Model-Optimizer

Length of output: 1988

🏁 Script executed:

grep -B 3 -A 3 "fp16_to_bf16_array" modelopt/onnx/graph_surgery/utils/graph_utils.py

Repository: NVIDIA/Model-Optimizer

Length of output: 900

Add BF16-aware initializer serialization for cross-attention K/V weights and biases.

detect_model_dtype allows detecting BF16 models, but these initializers are always created with numpy_helper.from_array(...). Since NumPy doesn't support bfloat16, when the model is detected as BF16 the weights remain as FP32 arrays, causing numpy_helper.from_array() to create FLOAT TensorProtos instead of BFLOAT16. This dtype mismatch will cause inference failures.

Use array_to_initializer with the is_bfloat16 flag and convert float32 → float16 → bfloat16 via fp16_to_bf16_array when onnx_dtype == TensorProto.BFLOAT16.

Suggested change

-from ..utils.graph_utils import detect_model_dtype +from ..utils.graph_utils import array_to_initializer, detect_model_dtype +from ..utils.dtype_conversion import fp16_to_bf16_array @@ - new_initializers.append(numpy_helper.from_array(k_weight_t, name=k_weight_name)) - new_initializers.append(numpy_helper.from_array(v_weight_t, name=v_weight_name)) + if onnx_dtype == TensorProto.BFLOAT16: + k_bf16 = fp16_to_bf16_array(k_weight_t.astype(np.float16)) + v_bf16 = fp16_to_bf16_array(v_weight_t.astype(np.float16)) + new_initializers.append(array_to_initializer(k_bf16, k_weight_name, is_bfloat16=True)) + new_initializers.append(array_to_initializer(v_bf16, v_weight_name, is_bfloat16=True)) + else: + new_initializers.append(numpy_helper.from_array(k_weight_t, name=k_weight_name)) + new_initializers.append(numpy_helper.from_array(v_weight_t, name=v_weight_name)) @@ - new_initializers.append( - numpy_helper.from_array(k_bias.astype(np_dtype), name=k_bias_name) - ) + if onnx_dtype == TensorProto.BFLOAT16: + k_bias_bf16 = fp16_to_bf16_array(k_bias.astype(np.float16)) + new_initializers.append( + array_to_initializer(k_bias_bf16, k_bias_name, is_bfloat16=True) + ) + else: + new_initializers.append( + numpy_helper.from_array(k_bias.astype(np_dtype), name=k_bias_name) + ) @@ - new_initializers.append( - numpy_helper.from_array(v_bias.astype(np_dtype), name=v_bias_name) - ) + if onnx_dtype == TensorProto.BFLOAT16: + v_bias_bf16 = fp16_to_bf16_array(v_bias.astype(np.float16)) + new_initializers.append( + array_to_initializer(v_bias_bf16, v_bias_name, is_bfloat16=True) + ) + else: + new_initializers.append( + numpy_helper.from_array(v_bias.astype(np_dtype), name=v_bias_name) + )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@modelopt/onnx/graph_surgery/whisper/encoder_cross_kv.py` around lines 37 - 38, The cross-attention K/V weight and bias initializers are always converted with numpy_helper.from_array so BF16 models detected by detect_model_dtype end up with FLOAT proto types; update the serialization code that creates these initializers (the places that currently call numpy_helper.from_array for cross-attention key/value weights/biases) to use array_to_initializer with the is_bfloat16 flag when onnx_dtype == TensorProto.BFLOAT16, and convert the underlying float32 arrays via fp16_to_bf16_array (do float32 → float16 → bfloat16) before passing to array_to_initializer so the TensorProto dtype matches BF16 models.

hthadicherla added 3 commits March 6, 2026 13:14

Adding graph surgeries to make the model compatible and optimized TRT…

f1b90b4

…-RTX and ORT-GenAI Signed-off-by: Hrishith Thadicherla <hthadicherla@nvidia.com>

Made a single whisper utils file for whisper related utilities

b7f9432

Signed-off-by: Hrishith Thadicherla <hthadicherla@nvidia.com>

Add test for GQA graph surgery

51fdee8

Signed-off-by: Hrishith Thadicherla <hthadicherla@nvidia.com>

hthadicherla requested a review from a team as a code owner March 6, 2026 12:33

hthadicherla requested a review from ajrasane March 6, 2026 12:33

hthadicherla requested a review from vishalpandya1990 March 6, 2026 12:33

coderabbitai bot reviewed Mar 6, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Graph Surgery Framework for TRT-RTX#992

Graph Surgery Framework for TRT-RTX#992
hthadicherla wants to merge 3 commits intomainfrom
hthadicherla/graph-surgery-framework

hthadicherla commented Mar 6, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Mar 6, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Pre-merge checks failed

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Mar 6, 2026

Uh oh!

coderabbitai bot Mar 6, 2026

Uh oh!

coderabbitai bot Mar 6, 2026

Uh oh!

coderabbitai bot Mar 6, 2026

Uh oh!

coderabbitai bot Mar 6, 2026

Uh oh!

coderabbitai bot Mar 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant


		# Compute head_size
		head_size = hidden_size // num_attention_heads

Conversation

hthadicherla commented Mar 6, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Usage

Testing

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai bot commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Pre-merge checks failed

❌ Failed checks (1 error, 1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

hthadicherla commented Mar 6, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 6, 2026 •

edited

Loading