Skip to content

[OMNIML-3252][ONNX] Add real Q/DQ scales in Autotune#951

Open
gcunhase wants to merge 34 commits intoNVIDIA:mainfrom
gcunhase:dev/gcunhasergio/autotune_real_qdq_scales
Open

[OMNIML-3252][ONNX] Add real Q/DQ scales in Autotune#951
gcunhase wants to merge 34 commits intoNVIDIA:mainfrom
gcunhase:dev/gcunhasergio/autotune_real_qdq_scales

Conversation

@gcunhase
Copy link
Contributor

@gcunhase gcunhase commented Mar 2, 2026

What does this PR do?

Type of change: New feature

Overview: ONNX Autotune (also called Auto Q/DQ) is currently and standalone feature of ModelOpt that automatically adds Q/DQ where relevant according to information obtained from TensorRT inference. One issue is that the scales in those Q/DQ nodes are random.

This PR does 2 major things:

  1. Integrates Auto Q/DQ into the ONNX quantization workflow; and
  2. Enables calibration data to be used to obtain the correct scales for the Q/DQ nodes.

Usage

$ python -m modelopt.onnx.quantization --onnx_path=model.onnx --autotune={quick,default,extensive}

Please see __main__.py for other args.

Testing

  1. Added unittest for Q/DQ node placement validation: tests/gpu/onnx/quantization/test_autotune_quantization_integration.py

  2. Verified that accuracy was recovered by integrating MOQ with Autotune. Results on RTX 3090 with TRT 10.12.0.36 (--stronglyTyped) with ViT, as per examples/onnx_ptq:

Model Top-1 acc Top-5 acc
FP32 85.1% 97.5%
FP16 (FP32 with --fp16) 85.1% 97.5%
Quant (MOQ) 82.4% 96.4%
Quant (Autotune) 0.1% 0.5%
Quant (MOQ + Autotune) 79.6% 95.0%

Notice that accuracy was mostly recovered from standalone Autotune to MOQ + Autotune (real Q/DQ scales). The drop in accuracy between MOQ and MOQ + Autotune is likely due to some sensitive nodes being quantized, such as BiasAdd.

Before your PR is "Ready for review"

  • Make sure you read and follow Contributor guidelines and your commits are signed.
  • Is this change backward compatible?: Yes
  • Did you write any new necessary tests?: Yes
  • Did you add or update any necessary documentation?: No (will be done in a different PR)
  • Did you update Changelog?: No

Summary by CodeRabbit

  • New Features
    • Added --autotune flag to the ONNX PTQ CLI for automatic Q/DQ placement tuning.
    • Autotune workflow now accepts in-memory models and optionally retains the output directory.
    • Added finer control over output quantization for specific op types.
  • Bug Fixes
    • Improved handling of partial input Q/DQ removal to avoid incorrect graph rewiring.
  • Tests
    • Updated test metadata.
  • Chores
    • Updated optional ONNX extras configuration.

Additional information

To reproduce accuracy with ViT, call download_example_onnx.py and image_prep.py without --fp16.

If --fp16 is used here, quantizing this model with --autotune results in the following error:

[modelopt][onnx] - ERROR - Benchmark failed: Converting dtype('float16') to a ctypes type

This is fixed in #978.

@gcunhase gcunhase requested a review from a team as a code owner March 2, 2026 18:15
@gcunhase gcunhase requested a review from ajrasane March 2, 2026 18:15
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 2, 2026

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 98e74528-cf93-461e-8d45-8a4bc55f2584

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This pull request introduces autotune functionality for ONNX quantization to detect optimal Q/DQ node placements. It adds a CLI flag, autotuning workflow with flexible model input handling, activation operation taxonomy, and quantized node identification utilities.

Changes

Cohort / File(s) Summary
Activation Operations
modelopt/onnx/op_types.py
Added new public function get_activation_ops() that returns a set of activation operation names (Relu, LeakyRelu, PRelu, Elu, Selu, etc.).
CLI and Entry Point
modelopt/onnx/quantization/__main__.py
Added --autotune boolean CLI flag to the ONNX PTQ parser and propagated it to the quantize() function call.
Autotune Workflow
modelopt/onnx/quantization/autotune/workflows.py
Made region_pattern_autotuning_workflow() accept model input as either string path or in-memory ONNX ModelProto, made output_dir optional with automatic temporary directory creation, and added keep_output_dir flag to control directory cleanup.
Quantization Modes
modelopt/onnx/quantization/fp8.py, modelopt/onnx/quantization/int8.py
Added autotune: bool = False parameter to quantize() functions. When enabled, skips GEMV-TRT optimization, Conv-node exclusion, and node derivation logic; instead relies on autotune-provided inputs.
Quantization Infrastructure
modelopt/onnx/quantization/ort_utils.py
Added optional op_types_needing_output_quant parameter to configure_ort() to exclude specific operation types from output quantization.
Graph Utilities
modelopt/onnx/quantization/graph_utils.py
Updated remove_partial_input_qdq() to correctly identify and rewire target nodes by matching DequantizeLinear output names instead of assuming fixed input positions.
Core Quantization Logic
modelopt/onnx/quantization/quantize.py
Implemented new _find_nodes_to_quantize_autotune() function that runs autotuning workflow to determine quantizable nodes and per-op quantization behavior. Extended main quantize() to conditionally invoke autotune when enabled, merging results into quantization workflow.
Utilities
modelopt/onnx/utils.py
Added get_quantized_nodes() function that identifies ONNX nodes preceded by DequantizeLinear or followed by QuantizeLinear.
Dependencies
setup.py
Added cuda-python to ONNX optional dependencies.
Tests
tests/unit/onnx/quantization/autotune/test_region.py
Updated copyright year from 2024 to 2026.

Sequence Diagram

sequenceDiagram
    participant User
    participant CLI as __main__.py
    participant Quantize as quantize.py
    participant Autotune as Autotune Workflow
    participant FP8/INT8 as FP8/INT8 Quantize
    participant ORT as ORT Config

    User->>CLI: Run with --autotune flag
    CLI->>Quantize: quantize(..., autotune=True)
    
    alt autotune enabled
        Quantize->>Autotune: _find_nodes_to_quantize_autotune()
        Quantize->>Autotune: region_pattern_autotuning_workflow()
        Autotune->>Autotune: Create/use temp output_dir
        Autotune->>Autotune: Load/use model
        Autotune-->>Quantize: Return nodes_to_quantize, op_types, no_quantize_inputs, op_types_needing_output_quant
        Quantize->>FP8/INT8: quantize(..., autotune=True, no_quantize_inputs, op_types_needing_output_quant)
        FP8/INT8->>FP8/INT8: Skip GEMV optimization & Conv exclusions
        FP8/INT8->>ORT: configure_ort(..., op_types_needing_output_quant)
        ORT-->>FP8/INT8: Configured with output quant exclusions
        FP8/INT8-->>Quantize: Quantized model
    else autotune disabled
        Quantize->>FP8/INT8: quantize(..., autotune=False)
        FP8/INT8->>FP8/INT8: Apply standard optimizations & exclusions
        FP8/INT8->>ORT: configure_ort(...)
        ORT-->>FP8/INT8: Standard configuration
        FP8/INT8-->>Quantize: Quantized model
    end
    
    Quantize-->>User: Quantized ONNX model
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed Docstring coverage is 86.67% which is sufficient. The required threshold is 80.00%.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title directly references the main feature: adding real Q/DQ scales in autotune. This accurately summarizes the primary objective of integrating Auto Q/DQ into the ONNX quantization workflow to use calibration data for correct scales.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Comment @coderabbitai help to get the list of available commands and usage tips.

@gcunhase gcunhase requested a review from cjluo-nv March 2, 2026 18:16
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
tests/unit/onnx/quantization/autotune/test_region.py (1)

16-21: ⚠️ Potential issue | 🟡 Minor

Remove duplicate license text block.

Lines 16-21 duplicate the license disclaimer already present in lines 10-14. This appears to be a copy-paste error.

🔧 Proposed fix
 # limitations under the License.
-
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
 """Tests for the Region class in the autotuner."""
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit/onnx/quantization/autotune/test_region.py` around lines 16 - 21,
Remove the duplicated license disclaimer block that was accidentally
copy-pasted; locate the repeated Apache license/disclaimer text that appears a
second time and delete the redundant block so only the original license header
remains at the top of the file (ensure the first license header is preserved and
no other content is altered).
modelopt/onnx/quantization/fp8.py (1)

219-232: ⚠️ Potential issue | 🟠 Major

Potential AttributeError if nodes_to_exclude is None.

Same issue as in int8.py: line 232 calls nodes_to_exclude.extend() before validation on line 236. If nodes_to_exclude is passed as None, this will fail.

🐛 Proposed fix
     enable_gemv_detection_for_trt = kwargs.get("enable_gemv_detection_for_trt", True)
+    nodes_to_exclude = nodes_to_exclude or []
     if enable_gemv_detection_for_trt and not autotune:
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/onnx/quantization/fp8.py` around lines 219 - 232, The block that
calls nodes_to_exclude.extend(...) when enable_gemv_detection_for_trt and not
autotune can raise AttributeError if nodes_to_exclude is None; before calling
find_nodes_from_matmul_to_exclude and extending, ensure nodes_to_exclude is
initialized to a list (e.g., if nodes_to_exclude is None assign an empty list)
or guard the extend call by creating a new list and assigning it back to
nodes_to_exclude; update the code around enable_gemv_detection_for_trt /
autotune, the find_nodes_from_matmul_to_exclude call, and the nodes_to_exclude
handling so extend is only called on a list.
modelopt/onnx/quantization/int8.py (1)

161-174: ⚠️ Potential issue | 🟠 Major

Potential AttributeError if nodes_to_exclude is None.

When enable_gemv_detection_for_trt is True and autotune is False, line 174 calls nodes_to_exclude.extend() before nodes_to_exclude is validated/converted by find_nodes_to_exclude() on line 178. If nodes_to_exclude is passed as None, this will raise an AttributeError.

🐛 Proposed fix
     enable_gemv_detection_for_trt = kwargs.get("enable_gemv_detection_for_trt", True)
+    nodes_to_exclude = nodes_to_exclude or []
     if enable_gemv_detection_for_trt and not autotune:
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/onnx/quantization/int8.py` around lines 161 - 174, The code may call
nodes_to_exclude.extend(...) when nodes_to_exclude can be None; ensure
nodes_to_exclude is a list before extending: in the block guarded by
enable_gemv_detection_for_trt and not autotune, either initialize
nodes_to_exclude if None (e.g., nodes_to_exclude = nodes_to_exclude or []) or
call find_nodes_to_exclude() earlier and assign/normalize nodes_to_exclude
before using extend; update the logic around nodes_to_exclude,
find_nodes_from_matmul_to_exclude, and find_nodes_to_exclude to guarantee
nodes_to_exclude is always a list when extending.
🧹 Nitpick comments (3)
modelopt/onnx/quantization/quantize.py (1)

272-274: Filename replacement may fail with edge-case paths.

Using onnx_path.replace(".onnx", ".quant_autotune.onnx") could produce unexpected results if ".onnx" appears elsewhere in the path (e.g., /models/onnx.models/model.onnx).

💡 Safer alternative using path manipulation
+    import os
     # Export model with Q/DQ insertion
-    onnx_path_autotune = onnx_path.replace(".onnx", ".quant_autotune.onnx")
+    base, ext = os.path.splitext(onnx_path)
+    onnx_path_autotune = f"{base}.quant_autotune{ext}"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/onnx/quantization/quantize.py` around lines 272 - 274, The filename
construction using onnx_path.replace(".onnx", ".quant_autotune.onnx") can
mis-replace when ".onnx" appears elsewhere in the path; change the logic that
computes onnx_path_autotune to use proper path/suffix manipulation (e.g.,
Path(onnx_path).with_suffix(".quant_autotune.onnx") or equivalent) before
calling autotuner.export_onnx and appending to intermediate_generated_files,
updating references to onnx_path_autotune, onnx_path, and the
autotuner.export_onnx call accordingly.
modelopt/onnx/utils.py (1)

175-191: Potential IndexError if input/output lists contain unexpected elements.

The list comprehension assumes inp.inputs[0] and out.outputs[0] exist when inp.inputs / out.outputs are truthy. While graphsurgeon typically ensures non-empty lists here, adding explicit length checks would make this more robust.

🛡️ Proposed defensive fix
     return [
         node
         for node in graph.nodes
-        if any(inp.inputs[0].op == "DequantizeLinear" for inp in node.inputs if inp.inputs)
-        or any(out.outputs[0].op == "QuantizeLinear" for out in node.outputs if out.outputs)
+        if any(
+            len(inp.inputs) > 0 and inp.inputs[0].op == "DequantizeLinear"
+            for inp in node.inputs
+            if inp.inputs
+        )
+        or any(
+            len(out.outputs) > 0 and out.outputs[0].op == "QuantizeLinear"
+            for out in node.outputs
+            if out.outputs
+        )
     ]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/onnx/utils.py` around lines 175 - 191, The comprehension in
get_quantized_nodes assumes inp.inputs[0] and out.outputs[0] exist and can raise
IndexError; change the two any() guards to explicitly check length (or
truthiness plus index-safe access) before indexing (e.g., ensure len(inp.inputs)
> 0 and len(out.outputs) > 0) so you only evaluate inp.inputs[0].op ==
"DequantizeLinear" and out.outputs[0].op == "QuantizeLinear" when the lists have
at least one element; update the generator to use these safe conditions around
node.inputs and node.outputs to avoid crashes.
modelopt/onnx/quantization/autotune/workflows.py (1)

202-203: Docstring should document temp directory behavior.

The docstring for output_dir doesn't mention that when None is provided, a temporary directory is automatically created via tempfile.mkdtemp(). This is important for API consumers to understand, especially since temp directories may accumulate if keep_output_dir=True (the default).

📝 Suggested docstring update
-        output_dir: Directory for output files (state, logs, models). Created if it doesn't exist.
+        output_dir: Directory for output files (state, logs, models). Created if it doesn't exist.
+                   If None, a temporary directory is created via tempfile.mkdtemp().
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/onnx/quantization/autotune/workflows.py` around lines 202 - 203,
Update the docstring for the output_dir parameter in the function/class that
defines output_dir (the docstring in workflows.py around the autotune workflow)
to explicitly state that when output_dir is None a temporary directory is
created via tempfile.mkdtemp(), and note that the temporary directory will be
retained if keep_output_dir=True (the default), so callers may need to remove it
to avoid accumulation; reference the output_dir parameter name and the
keep_output_dir flag in the description.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@modelopt/onnx/quantization/autotune/workflows.py`:
- Around line 386-390: The log message in the cleanup branch is inverted: inside
the if not keep_output_dir block (where shutil.rmtree(output_dir) is called)
update the logger.debug message to tell users to set keep_output_dir=True to
retain the directory; specifically modify the message emitted by logger.debug
near the removal call that references output_dir and keep_output_dir so it
correctly reads that setting keep_output_dir=True will keep the directory.

In `@modelopt/onnx/quantization/quantize.py`:
- Around line 246-253: The function _find_nodes_to_quantize_autotune uses a
mutable default for intermediate_generated_files (list[str] = []); change the
signature to use None as the default (intermediate_generated_files:
Optional[list[str]] = None) and inside the function, if
intermediate_generated_files is None then set intermediate_generated_files = []
so each call gets a fresh list; update any type hints/imports if needed and
ensure all code in _find_nodes_to_quantize_autotune that appends or inspects
intermediate_generated_files works with the new initialization.

In `@setup.py`:
- Line 62: The dependency entry for "cuda-python" in setup.py lacks a version
constraint and the inline comment "For autotune" is misleading; change the
dependency to include a minimum version compatible with your CUDA/driver/ONNX
Runtime stack (e.g., "cuda-python>=13.0") and update the comment to accurately
state its purpose (e.g., "CUDA Python bindings for GPU/driver interactions -
ensure matches CUDA/ONNX Runtime version"). Ensure this follows the same pinning
style as other dependencies like "onnxslim>=0.1.76" and "polygraphy>=0.49.22".

---

Outside diff comments:
In `@modelopt/onnx/quantization/fp8.py`:
- Around line 219-232: The block that calls nodes_to_exclude.extend(...) when
enable_gemv_detection_for_trt and not autotune can raise AttributeError if
nodes_to_exclude is None; before calling find_nodes_from_matmul_to_exclude and
extending, ensure nodes_to_exclude is initialized to a list (e.g., if
nodes_to_exclude is None assign an empty list) or guard the extend call by
creating a new list and assigning it back to nodes_to_exclude; update the code
around enable_gemv_detection_for_trt / autotune, the
find_nodes_from_matmul_to_exclude call, and the nodes_to_exclude handling so
extend is only called on a list.

In `@modelopt/onnx/quantization/int8.py`:
- Around line 161-174: The code may call nodes_to_exclude.extend(...) when
nodes_to_exclude can be None; ensure nodes_to_exclude is a list before
extending: in the block guarded by enable_gemv_detection_for_trt and not
autotune, either initialize nodes_to_exclude if None (e.g., nodes_to_exclude =
nodes_to_exclude or []) or call find_nodes_to_exclude() earlier and
assign/normalize nodes_to_exclude before using extend; update the logic around
nodes_to_exclude, find_nodes_from_matmul_to_exclude, and find_nodes_to_exclude
to guarantee nodes_to_exclude is always a list when extending.

In `@tests/unit/onnx/quantization/autotune/test_region.py`:
- Around line 16-21: Remove the duplicated license disclaimer block that was
accidentally copy-pasted; locate the repeated Apache license/disclaimer text
that appears a second time and delete the redundant block so only the original
license header remains at the top of the file (ensure the first license header
is preserved and no other content is altered).

---

Nitpick comments:
In `@modelopt/onnx/quantization/autotune/workflows.py`:
- Around line 202-203: Update the docstring for the output_dir parameter in the
function/class that defines output_dir (the docstring in workflows.py around the
autotune workflow) to explicitly state that when output_dir is None a temporary
directory is created via tempfile.mkdtemp(), and note that the temporary
directory will be retained if keep_output_dir=True (the default), so callers may
need to remove it to avoid accumulation; reference the output_dir parameter name
and the keep_output_dir flag in the description.

In `@modelopt/onnx/quantization/quantize.py`:
- Around line 272-274: The filename construction using
onnx_path.replace(".onnx", ".quant_autotune.onnx") can mis-replace when ".onnx"
appears elsewhere in the path; change the logic that computes onnx_path_autotune
to use proper path/suffix manipulation (e.g.,
Path(onnx_path).with_suffix(".quant_autotune.onnx") or equivalent) before
calling autotuner.export_onnx and appending to intermediate_generated_files,
updating references to onnx_path_autotune, onnx_path, and the
autotuner.export_onnx call accordingly.

In `@modelopt/onnx/utils.py`:
- Around line 175-191: The comprehension in get_quantized_nodes assumes
inp.inputs[0] and out.outputs[0] exist and can raise IndexError; change the two
any() guards to explicitly check length (or truthiness plus index-safe access)
before indexing (e.g., ensure len(inp.inputs) > 0 and len(out.outputs) > 0) so
you only evaluate inp.inputs[0].op == "DequantizeLinear" and out.outputs[0].op
== "QuantizeLinear" when the lists have at least one element; update the
generator to use these safe conditions around node.inputs and node.outputs to
avoid crashes.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0f668a3 and 52b9c31.

📒 Files selected for processing (11)
  • modelopt/onnx/op_types.py
  • modelopt/onnx/quantization/__main__.py
  • modelopt/onnx/quantization/autotune/workflows.py
  • modelopt/onnx/quantization/fp8.py
  • modelopt/onnx/quantization/graph_utils.py
  • modelopt/onnx/quantization/int8.py
  • modelopt/onnx/quantization/ort_utils.py
  • modelopt/onnx/quantization/quantize.py
  • modelopt/onnx/utils.py
  • setup.py
  • tests/unit/onnx/quantization/autotune/test_region.py

)
argparser.add_argument(
"--autotune",
action="store_true",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we plan to add some modes for the autotune? E.g. fast, default, extensive etc?

This is the design doc: https://docs.google.com/document/d/1gDxHLiJTyBQInd8lV-EFoNbLoxGn9fRCqlhSFx5sd7U/edit?tab=t.0

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, will add those.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, please check, thanks.

@@ -242,6 +243,81 @@ def _preprocess_onnx(
)


Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall Assessment

This PR is well-structured and achieves its stated goals of:

  1. Integrating Auto Q/DQ into the ONNX quantization workflow
  2. Enabling calibration data to obtain correct scales for Q/DQ nodes

The changes are substantial but well-organized across multiple files. Below are my detailed review comments.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Integrates ONNX Auto Q/DQ (TensorRT-driven autotuning) into the existing ONNX quantization workflow so Q/DQ placement can be derived from TensorRT profiling and then calibrated to produce real (non-random) Q/DQ scales.

Changes:

  • Added an --autotune flag (and autotune plumbing) to route INT8/FP8 quantization through the Auto Q/DQ placement workflow.
  • Introduced utilities to detect “quantized nodes” from a Q/DQ-inserted model and used this to drive node selection + ORT configuration tweaks (output quantization for certain producers).
  • Updated autotune workflow API to accept in-memory models and optionally auto-manage its output directory.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
tests/unit/onnx/quantization/autotune/test_region.py Updates file header metadata.
setup.py Adds cuda-python to ONNX optional dependencies to support TensorRT Python autotune benchmarking.
modelopt/onnx/utils.py Adds get_quantized_nodes() helper for extracting quantized nodes from a Q/DQ graph.
modelopt/onnx/quantization/quantize.py Adds autotune flag, integrates Auto Q/DQ placement, and feeds results into INT8/FP8 quantizers.
modelopt/onnx/quantization/ort_utils.py Extends ORT configuration to optionally allow output quantization for selected op types.
modelopt/onnx/quantization/int8.py Adds autotune plumbing and bypasses some default heuristics when autotune is enabled.
modelopt/onnx/quantization/graph_utils.py Fixes partial-input Q/DQ removal to patch the intended consumer branch (shared Q/DQ case).
modelopt/onnx/quantization/fp8.py Adds autotune plumbing and bypasses some default heuristics when autotune is enabled.
modelopt/onnx/quantization/autotune/workflows.py Allows ModelProto input, optional output_dir, and adds optional output-dir cleanup.
modelopt/onnx/quantization/__main__.py Adds CLI flag --autotune.
modelopt/onnx/op_types.py Adds get_activation_ops() used by autotune integration logic.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link

@modelopt-bot modelopt-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed. I've posted several inline comments on specific lines. Overall this is a well-structured PR that successfully integrates Auto Q/DQ into the ONNX quantization workflow. Key highlights include good integration via _find_nodes_to_quantize_autotune, flexible API changes for in-memory models, and an important bug fix for shared Q/DQ pair handling. Please address the inline comments regarding documentation completion, code organization suggestions, and the copyright year consistency. Recommend approving with minor changes.

@gcunhase gcunhase changed the title [OMNIML-3252][ONNX] Add real Q/DQ scales in Autotune Draft: [OMNIML-3252][ONNX] Add real Q/DQ scales in Autotune Mar 2, 2026
gcunhase added 9 commits March 2, 2026 15:31
Signed-off-by: gcunhase <4861122+gcunhase@users.noreply.github.com>
Signed-off-by: gcunhase <4861122+gcunhase@users.noreply.github.com>
Signed-off-by: gcunhase <4861122+gcunhase@users.noreply.github.com>
Signed-off-by: gcunhase <4861122+gcunhase@users.noreply.github.com>
Signed-off-by: gcunhase <4861122+gcunhase@users.noreply.github.com>
Signed-off-by: gcunhase <4861122+gcunhase@users.noreply.github.com>
…de_inputs''

Signed-off-by: gcunhase <4861122+gcunhase@users.noreply.github.com>
Signed-off-by: gcunhase <4861122+gcunhase@users.noreply.github.com>
Signed-off-by: gcunhase <4861122+gcunhase@users.noreply.github.com>
gcunhase added 12 commits March 2, 2026 15:35
Signed-off-by: gcunhase <4861122+gcunhase@users.noreply.github.com>
Signed-off-by: gcunhase <4861122+gcunhase@users.noreply.github.com>
… Q/DQ need to be added or removed

Signed-off-by: gcunhase <4861122+gcunhase@users.noreply.github.com>
Signed-off-by: gcunhase <4861122+gcunhase@users.noreply.github.com>
Signed-off-by: gcunhase <4861122+gcunhase@users.noreply.github.com>
Signed-off-by: gcunhase <4861122+gcunhase@users.noreply.github.com>
Signed-off-by: gcunhase <4861122+gcunhase@users.noreply.github.com>
…ntization workflow

Signed-off-by: gcunhase <4861122+gcunhase@users.noreply.github.com>
Signed-off-by: gcunhase <4861122+gcunhase@users.noreply.github.com>
…iginal Auto Q/DQ PR.

Signed-off-by: gcunhase <4861122+gcunhase@users.noreply.github.com>
Signed-off-by: gcunhase <4861122+gcunhase@users.noreply.github.com>
Signed-off-by: gcunhase <4861122+gcunhase@users.noreply.github.com>
@gcunhase gcunhase force-pushed the dev/gcunhasergio/autotune_real_qdq_scales branch from ab4c5a3 to c23208f Compare March 2, 2026 20:40
@codecov
Copy link

codecov bot commented Mar 2, 2026

Codecov Report

❌ Patch coverage is 45.80153% with 71 lines in your changes missing coverage. Please review.
✅ Project coverage is 71.38%. Comparing base (0f668a3) to head (c147979).
⚠️ Report is 12 commits behind head on main.

Files with missing lines Patch % Lines
...elopt/onnx/quantization/autotune/autotuner_base.py 32.72% 37 Missing ⚠️
modelopt/onnx/quantization/autotune/__main__.py 27.77% 13 Missing ⚠️
modelopt/onnx/quantization/quantize.py 40.00% 9 Missing ⚠️
modelopt/onnx/quantization/autotune/workflows.py 14.28% 6 Missing ⚠️
modelopt/onnx/quantization/graph_utils.py 66.66% 3 Missing ⚠️
modelopt/onnx/quantization/int8.py 84.61% 2 Missing ⚠️
modelopt/onnx/op_types.py 50.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #951      +/-   ##
==========================================
- Coverage   72.12%   71.38%   -0.74%     
==========================================
  Files         209      211       +2     
  Lines       23617    23951     +334     
==========================================
+ Hits        17034    17098      +64     
- Misses       6583     6853     +270     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

gcunhase added 6 commits March 3, 2026 10:02
…e(). Directly use Insertion Points information.

Signed-off-by: gcunhase <4861122+gcunhase@users.noreply.github.com>
…es in the nodes_to_quantize

Signed-off-by: gcunhase <4861122+gcunhase@users.noreply.github.com>
Signed-off-by: gcunhase <4861122+gcunhase@users.noreply.github.com>
Signed-off-by: gcunhase <4861122+gcunhase@users.noreply.github.com>
…eded

Signed-off-by: gcunhase <4861122+gcunhase@users.noreply.github.com>
Signed-off-by: gcunhase <4861122+gcunhase@users.noreply.github.com>
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 12 out of 12 changed files in this pull request and generated 8 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

uses <output_dir>/autotuner_state.yaml (default: None)
quant_type: Quantization data type - "int8" for INT8 quantization (default),
"fp8" for FP8 quantization
default_dq_dtype: Dtype for DequantizeLinear output; "float32" (default) or "float16".
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default_dq_dtype docstring says only "float32" or "float16", but the quantization workflow now maps high_precision_dtype="bf16" to default_dq_dtype="bfloat16". Please update the docstring to include bfloat16 (or more generally state that any dtype supported by export_utils.resolve_dtype() is accepted).

Suggested change
default_dq_dtype: Dtype for DequantizeLinear output; "float32" (default) or "float16".
default_dq_dtype: Dtype for DequantizeLinear output; e.g. "float32" (default), "float16",
"bfloat16", or any dtype supported by export_utils.resolve_dtype().

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@willg-nv is bfloat16 also supported in default_dq_dtype?

Signed-off-by: gcunhase <4861122+gcunhase@users.noreply.github.com>
@gcunhase gcunhase changed the title Draft: [OMNIML-3252][ONNX] Add real Q/DQ scales in Autotune [OMNIML-3252][ONNX] Add real Q/DQ scales in Autotune Mar 5, 2026
gcunhase added 4 commits March 5, 2026 10:03
Signed-off-by: gcunhase <4861122+gcunhase@users.noreply.github.com>
Signed-off-by: gcunhase <4861122+gcunhase@users.noreply.github.com>
… silent corruption of the graph.

Signed-off-by: gcunhase <4861122+gcunhase@users.noreply.github.com>
Signed-off-by: gcunhase <4861122+gcunhase@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants