Integrate Automated QDQ placement tool - part 4.4 by willg-nv · Pull Request #961 · NVIDIA/Model-Optimizer

willg-nv · 2026-03-03T03:00:32Z

What does this PR do?

Many minor changes:

Add preset mode to AutoQDQ.
Add pattern cache tests.
increase batch size for stable QDQ insertion
update LICENSE 2024 -> 2026.
add cuda-python to pyproject.toml for [onnx]

Usage

# Add a code snippet demonstrating how to use this

Testing

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed (git commit -s -S).

Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded trust_remote_code=True, using torch.load(..., weights_only=True), avoiding pickle, etc.).

Is this change backward compatible?: ✅ / ❌ / N/A
If you copied code from any other source, did you follow IP policy in CONTRIBUTING.md?: ✅ / ❌ / N/A
Did you write any new necessary tests?: ✅ / ❌ / N/A
Did you update Changelog?: ✅ / ❌ / N/A

Additional Information

Summary by CodeRabbit

New Features
- Added mode presets (quick, default, extensive) with a new --mode option for autotuning.
- Introduced AutoQDQ: automated Q/DQ placement tool for ONNX quantization with pattern caching and checkpoint/resume.
Documentation
- Updated CLI help and examples to show mode usage and override semantics.
Tests
- Added comprehensive tests for pattern cache and mode-presets/explicit-override behavior; re-enabled a GPU autotuning workflow test; minor test updates.
Chores
- Added "cuda-python" to optional dependencies.

copy-pr-bot · 2026-03-03T03:00:36Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2026-03-03T03:00:53Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds CLI mode presets (quick/default/extensive) with explicit-flag tracking and application in runtime; expands unit tests for presets and PatternCache, adjusts test model tensor shapes, enables a GPU autotune test, and adds cuda-python to the onnx optional dependencies.

Changes

Cohort / File(s)	Summary
Autotuning CLI & Presets `modelopt/onnx/quantization/autotune/__main__.py`	Adds `MODE_PRESETS`, `_StoreWithExplicitFlag`, `apply_mode_presets(args)`, and integrates preset application in `run_autotune`; adds `--mode` CLI option and updates help/semantics for `--schemes_per_region`, `--warmup_runs`, and `--timing_runs`.
Autotune parser helper (tests) `modelopt/onnx/quantization/autotune/__main__.py`	Exposes `_get_autotune_parser()` to retrieve the CLI parser for tests.
Autotune CLI Tests `tests/unit/onnx/quantization/autotune/test_autotune_config.py`	Adds `TestModePresets` verifying preset application, explicit-flag precedence, and short-form flag handling; imports `MODE_PRESETS`, `_get_autotune_parser`, and `apply_mode_presets`.
PatternCache Tests `tests/unit/onnx/quantization/autotune/test_pattern_cache.py`	New `TestPatternCache` suite: creation/accessors, insertion and retrieval of pattern schemes, multiple-pattern handling, dict/YAML serialization round-trips, cache merge/update semantics, and best-scheme selection by latency.
Test Model Fixture `tests/_test_utils/onnx/quantization/autotune/models.py`	Updated test ONNX model tensor shapes (input/output and conv weight) to larger batch/channel dimensions and adjusted helper calls accordingly.
GPU Test Activation `tests/gpu/onnx/quantization/autotune/test_workflow.py`	Removed `pytest.mark.skip` from `test_export_quantized_model`, enabling the GPU workflow test to run.
Minor Header Update `tests/unit/onnx/quantization/autotune/test_region.py`	SPDX copyright year updated (2024 → 2026).
Packaging Extras `pyproject.toml`	Added `cuda-python` to the `onnx` optional dependencies.
Changelog `CHANGELOG.rst`	Documents the new AutoQDQ tool and AutoQDQ/AutoQDQ-related features added.

Sequence Diagram(s)

mermaid
sequenceDiagram
actor User
participant CLI as "CLI Parser"
participant Preset as "apply_mode_presets"
participant Runner as "Autotune Runner"
User->>CLI: invoke python -m modelopt.onnx.quantization.autotune --mode= [flags]
CLI->>Preset: pass parsed args (with explicit-flag markers)
Preset->>Preset: resolve mode defaults (quick/default/extensive)
Preset->>CLI: override unset args (schemes_per_region, warmup_runs, timing_runs)
CLI->>Runner: run_autotune(final args)
Runner->>Runner: execute tuning using final args

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 4

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title "Integrate Automated QDQ placement tool - part 4.4" directly aligns with the PR's primary objective to integrate the AutoQDQ feature, as documented in the objectives and change summaries.
Docstring Coverage	✅ Passed	Docstring coverage is 92.59% which is sufficient. The required threshold is 80.00%.
Security Anti-Patterns	✅ Passed	PR modifications reviewed against five critical security anti-patterns. No instances of torch.load/numpy.load without justification, hardcoded trust_remote_code, eval/exec on external input, or new # nosec comments found.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@modelopt/onnx/quantization/autotune/__main__.py`:
- Around line 36-57: Add pytest unit tests for the new
MODE_PRESETS/apply_mode_presets behavior: create tests that verify (1) selecting
each preset name in MODE_PRESETS sets args.num_schemes, args.warmup_runs, and
args.timing_runs when those fields are the DEFAULT_* values, (2) explicitly
passing non-default CLI values prevents overriding by apply_mode_presets, and
(3) explicitly passing values equal to the DEFAULT_* constants still counts as
"explicit" and should be overridden only if the code intends otherwise (cover
both expected behaviors). Use the apply_mode_presets function and the constants
DEFAULT_NUM_SCHEMES, DEFAULT_WARMUP_RUNS, DEFAULT_TIMING_RUNS and test args.mode
invalid case as well; place tests under tests/ (pytest) to ensure coverage and
assert that preset lookup uses MODE_PRESETS entries.
- Around line 43-56: The apply_mode_presets function currently detects
"unspecified" args by comparing against DEFAULT_* constants, which causes preset
values to override explicit CLI flags that happen to equal those defaults;
change the CLI parsing so the relevant options (num_schemes / warmup_runs /
timing_runs / schemes_per_region) default to None when not provided, then update
apply_mode_presets to only apply MODE_PRESETS when the corresponding args
attribute is None (e.g., check args.num_schemes is None instead of equality to
DEFAULT_NUM_SCHEMES); add unit tests exercising "--mode X" together with
explicit flags (including values equal to previous defaults) to assert explicit
flags are preserved. Ensure references: apply_mode_presets, MODE_PRESETS,
args.mode, DEFAULT_NUM_SCHEMES, DEFAULT_WARMUP_RUNS, DEFAULT_TIMING_RUNS.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between edde087 and 88793fa.

📒 Files selected for processing (4)

modelopt/onnx/quantization/autotune/__main__.py
tests/_test_utils/onnx/quantization/autotune/models.py
tests/unit/onnx/quantization/autotune/test_pattern_cache.py
tests/unit/onnx/quantization/autotune/test_region.py

modelopt/onnx/quantization/autotune/__main__.py

kevalmorabia97 · 2026-03-03T06:53:34Z

/ok to test 27b930d

codecov · 2026-03-03T07:03:26Z

Codecov Report

❌ Patch coverage is 84.00000% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 71.75%. Comparing base (42482b1) to head (c1b363b).
⚠️ Report is 4 commits behind head on main.

Files with missing lines	Patch %	Lines
modelopt/onnx/quantization/autotune/__main__.py	84.00%	4 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #961      +/-   ##
==========================================
- Coverage   72.13%   71.75%   -0.39%     
==========================================
  Files         209      211       +2     
  Lines       23631    23890     +259     
==========================================
+ Hits        17046    17142      +96     
- Misses       6585     6748     +163

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

kevalmorabia97 · 2026-03-03T13:14:52Z

/ok to test f3491cf

kevalmorabia97 · 2026-03-03T14:29:33Z

@willg-nv the test passes in this PR but is too slow. Any way to speed it up?

============================= slowest 50 durations =============================
168.39s call     tests/gpu/onnx/quantization/autotune/test_workflow.py::test_export_quantized_model[True]
...

gcunhase · 2026-03-03T15:07:34Z

Thank you for filing this bug. Can you please:

Add an update on the CHANGELOG file. For ex (feel free to suggest something else):

**New Features**

...
- Add automated Q/DQ placement toolkit for ONNX quantization. This toolkit uses TensorRT latency measurements to detect the desired Q/DQ insertion points.

Confirm that test_config.py is now test_autotune_config.py?

Will approve this PR after those 2 comments are fulfilled.

Thanks!

willg-nv · 2026-03-04T02:19:19Z

@willg-nv the test passes in this PR but is too slow. Any way to speed it up?

============================= slowest 50 durations =============================
168.39s call     tests/gpu/onnx/quantization/autotune/test_workflow.py::test_export_quantized_model[True]
...

@gcunhase @cjluo-nv can we make these slow tests manual tests or move to nightly tests?

willg-nv · 2026-03-04T02:23:22Z

Thank you for filing this bug. Can you please:

Add an update on the CHANGELOG file. For ex (feel free to suggest something else):
**New Features**

...
- Add automated Q/DQ placement toolkit for ONNX quantization. This toolkit uses TensorRT latency measurements to detect the desired Q/DQ insertion points.
Confirm that test_config.py is now test_autotune_config.py?

Will approve this PR after those 2 comments are fulfilled.

Thanks!

updated CHANGELOG
I think test_autotune_config.py is already on main branch. file.

coderabbitai

🧹 Nitpick comments (1)

CHANGELOG.rst (1)

20-20: Consider mentioning the preset mode feature.

The changelog entry is comprehensive and well-written. However, based on the PR objectives and changes, this release includes CLI mode presets (quick, default, extensive) that allow users to control the optimization depth. Since this is a user-facing CLI feature, consider adding it to the feature list for completeness.

📝 Suggested addition

-**AutoQDQ**: New tool for automated Q/DQ (Quantize/Dequantize) placement optimization for ONNX models. Uses TensorRT latency measurements to choose insertion schemes that minimize inference time. Discovers regions automatically, groups them by structural pattern, and tests multiple Q/DQ schemes per pattern. Supports INT8 and FP8 quantization, pattern cache for warm-start on similar models, checkpoint/resume, and importing patterns from an existing QDQ baseline. CLI: ``python -m modelopt.onnx.quantization.autotune``. See the AutoQDQ guide in the documentation.
+**AutoQDQ**: New tool for automated Q/DQ (Quantize/Dequantize) placement optimization for ONNX models. Uses TensorRT latency measurements to choose insertion schemes that minimize inference time. Discovers regions automatically, groups them by structural pattern, and tests multiple Q/DQ schemes per pattern. Supports INT8 and FP8 quantization, mode presets (quick/default/extensive) for controlling optimization depth, pattern cache for warm-start on similar models, checkpoint/resume, and importing patterns from an existing QDQ baseline. CLI: ``python -m modelopt.onnx.quantization.autotune``. See the AutoQDQ guide in the documentation.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@CHANGELOG.rst` at line 20, Update the AutoQDQ changelog entry to mention the
new CLI preset modes by adding a short sentence referencing the CLI entry point
(python -m modelopt.onnx.quantization.autotune) and the available presets
(quick, default, extensive); e.g., append that users can control optimization
depth via these presets so the AutoQDQ feature description (AutoQDQ) includes
the preset-mode capability in the feature list.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@CHANGELOG.rst`:
- Line 20: Update the AutoQDQ changelog entry to mention the new CLI preset
modes by adding a short sentence referencing the CLI entry point (python -m
modelopt.onnx.quantization.autotune) and the available presets (quick, default,
extensive); e.g., append that users can control optimization depth via these
presets so the AutoQDQ feature description (AutoQDQ) includes the preset-mode
capability in the feature list.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: c8666253-f3d8-4780-9e4f-cbdc81e990ca

📥 Commits

Reviewing files that changed from the base of the PR and between f3491cf and e47c678.

📒 Files selected for processing (1)

CHANGELOG.rst

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

modelopt/onnx/quantization/autotune/__main__.py (1)

30-40: ⚠️ Potential issue | 🟡 Minor

Effective defaults can drift from displayed defaults.

Current parser defaults (30/5/20) and MODE_PRESETS["default"] (50/50/100) diverge, while --mode defaults to "default". That makes the runtime defaults differ from what the per-flag help currently advertises, which is confusing for users and performance expectations.

Suggested consolidation (single source of truth)

-DEFAULT_NUM_SCHEMES = 30
+DEFAULT_MODE = "default"
 DEFAULT_QUANT_TYPE = "int8"
 DEFAULT_DQ_DTYPE = "float32"
 DEFAULT_TIMING_CACHE = str(Path(tempfile.gettempdir()) / "trtexec_timing.cache")
-DEFAULT_WARMUP_RUNS = 5
-DEFAULT_TIMING_RUNS = 20
 MODE_PRESETS = {
     "quick": {"schemes_per_region": 30, "warmup_runs": 10, "timing_runs": 50},
     "default": {"schemes_per_region": 50, "warmup_runs": 50, "timing_runs": 100},
     "extensive": {"schemes_per_region": 200, "warmup_runs": 50, "timing_runs": 200},
 }
+DEFAULT_NUM_SCHEMES = MODE_PRESETS[DEFAULT_MODE]["schemes_per_region"]
+DEFAULT_WARMUP_RUNS = MODE_PRESETS[DEFAULT_MODE]["warmup_runs"]
+DEFAULT_TIMING_RUNS = MODE_PRESETS[DEFAULT_MODE]["timing_runs"]
@@
-        default="default",
+        default=DEFAULT_MODE,

Also applies to: 242-252, 257-262, 322-336

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@modelopt/onnx/quantization/autotune/__main__.py` around lines 30 - 40, The
help and parser defaults (DEFAULT_NUM_SCHEMES, DEFAULT_WARMUP_RUNS,
DEFAULT_TIMING_RUNS) currently disagree with MODE_PRESETS["default"], causing
--mode default behavior to differ from per-flag help; fix by making a single
source of truth: either (A) update DEFAULT_NUM_SCHEMES/WARMUP_RUNS/TIMING_RUNS
to match MODE_PRESETS["default"] (50/50/100) and update help strings, or (B)
change the argument parser to derive its per-flag defaults and help text from
MODE_PRESETS["default"] (e.g., use MODE_PRESETS["default"]["schemes_per_region"]
etc.) so --mode="default" and per-flag help stay consistent; apply the same
consolidation where the other occurrences of these constants are used (refer to
DEFAULT_QUANT_TYPE, DEFAULT_DQ_DTYPE, DEFAULT_TIMING_CACHE if needed) and ensure
--mode default remains "default".

🧹 Nitpick comments (3)

tests/unit/onnx/quantization/autotune/test_pattern_cache.py (1)

107-109: Reduce test brittleness from positional cache indexing.

These assertions depend on list ordering (pattern_schemes[0]). If internal ordering changes, tests can fail despite correct behavior. Prefer selecting by pattern_signature before asserting details.

♻️ Suggested refactor

-        restored_ps = restored.pattern_schemes[0]
-        assert restored_ps.pattern_signature == "Conv->Relu"
+        matches = [x for x in restored.pattern_schemes if x.pattern_signature == "Conv->Relu"]
+        assert len(matches) == 1
+        restored_ps = matches[0]
@@
-        conv_relu_ps = cache.pattern_schemes[0]
-        assert conv_relu_ps.pattern_signature == "Conv->Relu"
+        matches = [x for x in cache.pattern_schemes if x.pattern_signature == "Conv->Relu"]
+        assert len(matches) == 1
+        conv_relu_ps = matches[0]
@@
-        conv_relu_ps = cache.pattern_schemes[0]
-        assert conv_relu_ps.pattern_signature == "Conv->Relu"
+        matches = [x for x in cache.pattern_schemes if x.pattern_signature == "Conv->Relu"]
+        assert len(matches) == 1
+        conv_relu_ps = matches[0]

Also applies to: 152-154, 176-178

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/unit/onnx/quantization/autotune/test_pattern_cache.py` around lines 107
- 109, The test is brittle because it assumes ordering by using
restored.pattern_schemes[0]; change the assertions to locate the PatternScheme
by matching pattern_signature (e.g., find the element in
restored.pattern_schemes whose pattern_signature == "Conv->Relu") and then
assert on its .schemes length and other properties; update the other occurrences
that use positional indexing (the similar assertions around the other checks) to
the same pattern-signature lookup to make the tests order-independent.

tests/_test_utils/onnx/quantization/autotune/models.py (1)

28-33: Consider parameterizing/reducing test shapes to avoid worsening GPU autotune runtime.

The new shapes (Line 28-33) plus weights (Line 45) imply a very compute-heavy Conv if this model is actually executed in GPU autotune tests. Given the existing complaint about a ~168s GPU autotune test, this change risks making it slower.

A low-friction option is to keep defaults but allow call sites (especially GPU tests) to override sizes:

Suggested refactor (backward compatible defaults)

-def _create_simple_conv_onnx_model():
+def _create_simple_conv_onnx_model(*, batch: int = 64, in_ch: int = 32, out_ch: int = 64, hw: int = 224):
     """Build ONNX model: Input -> Conv -> Relu -> Output (minimal for autotuner tests)."""
     input_tensor = helper.make_tensor_value_info(
-        "input", onnx.TensorProto.FLOAT, [64, 32, 224, 224]
+        "input", onnx.TensorProto.FLOAT, [batch, in_ch, hw, hw]
     )
     output_tensor = helper.make_tensor_value_info(
-        "output", onnx.TensorProto.FLOAT, [64, 64, 224, 224]
+        "output", onnx.TensorProto.FLOAT, [batch, out_ch, hw, hw]
     )
@@
         initializer=[
             helper.make_tensor(
-                "conv_weight", onnx.TensorProto.FLOAT, [64, 32, 3, 3], [0.1] * (64 * 32 * 3 * 3)
+                "conv_weight",
+                onnx.TensorProto.FLOAT,
+                [out_ch, in_ch, 3, 3],
+                [0.1] * (out_ch * in_ch * 3 * 3),
             )
         ],
     )

If stability only depends on batch, consider keeping batch=64 but shrinking hw (e.g., 32/64) in the GPU workflow tests to cut runtime.

Also applies to: 44-46

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/_test_utils/onnx/quantization/autotune/models.py` around lines 28 - 33,
The test model uses hard-coded large tensors (input_tensor and output_tensor and
the conv weight tensor around line 45) which makes GPU autotune slow; refactor
the model builder to accept parameters for batch, in_channels, out_channels,
height, and width with sensible, smaller defaults (e.g., batch=64 kept if needed
but default hw=32 or lower) so call sites can override sizes for GPU tests, and
ensure the conv weight tensor shape is derived from those parameters
(maintaining current defaults to preserve backward compatibility).

tests/unit/onnx/quantization/autotune/test_autotune_config.py (1)

125-132: Add one regression test for the implicit parser-default path (--mode omitted).

You already validate --mode default; adding the no---mode case would lock in the intended runtime behavior if parser defaults change later.

Suggested test addition

 class TestModePresets:
@@
+    def test_mode_omitted_uses_parser_default_mode_preset(self):
+        """When --mode is omitted, parser default mode preset is applied."""
+        args = self._parse_cli(["--onnx_path", "model.onnx"])
+        preset = MODE_PRESETS["default"]
+        assert args.num_schemes == preset["schemes_per_region"]
+        assert args.warmup_runs == preset["warmup_runs"]
+        assert args.timing_runs == preset["timing_runs"]

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/unit/onnx/quantization/autotune/test_autotune_config.py` around lines
125 - 132, Add a new unit test that mirrors
test_mode_default_applies_preset_when_no_explicit_flags but omits the --mode
flag to cover the implicit parser-default path: call the existing _parse_cli
helper with ["--onnx_path", "model.onnx"] (no --mode), look up preset =
MODE_PRESETS["default"], and assert that args.num_schemes ==
preset["schemes_per_region"], args.warmup_runs == preset["warmup_runs"], and
args.timing_runs == preset["timing_runs"]; place it alongside
test_mode_default_applies_preset_when_no_explicit_flags with a clear name like
test_mode_implicit_applies_preset_when_no_explicit_flags so future
parser-default changes are caught.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/_test_utils/onnx/quantization/autotune/models.py`:
- Around line 28-33: The declared output_tensor shape ([64, 64, 224, 224]) is
inconsistent with a 3x3 Conv without padding (spatial dims should be 222x222);
update the model either by adding symmetric padding to the Conv node (set pads
or auto_pad to keep spatial dims at 224) or change output_tensor to [64, 64,
222, 222] to match no-padding behavior; locate and modify the tensor declaration
named output_tensor and/or the Conv node attributes (pads/auto_pad) in the model
definition so shapes are consistent.

---

Outside diff comments:
In `@modelopt/onnx/quantization/autotune/__main__.py`:
- Around line 30-40: The help and parser defaults (DEFAULT_NUM_SCHEMES,
DEFAULT_WARMUP_RUNS, DEFAULT_TIMING_RUNS) currently disagree with
MODE_PRESETS["default"], causing --mode default behavior to differ from per-flag
help; fix by making a single source of truth: either (A) update
DEFAULT_NUM_SCHEMES/WARMUP_RUNS/TIMING_RUNS to match MODE_PRESETS["default"]
(50/50/100) and update help strings, or (B) change the argument parser to derive
its per-flag defaults and help text from MODE_PRESETS["default"] (e.g., use
MODE_PRESETS["default"]["schemes_per_region"] etc.) so --mode="default" and
per-flag help stay consistent; apply the same consolidation where the other
occurrences of these constants are used (refer to DEFAULT_QUANT_TYPE,
DEFAULT_DQ_DTYPE, DEFAULT_TIMING_CACHE if needed) and ensure --mode default
remains "default".

---

Nitpick comments:
In `@tests/_test_utils/onnx/quantization/autotune/models.py`:
- Around line 28-33: The test model uses hard-coded large tensors (input_tensor
and output_tensor and the conv weight tensor around line 45) which makes GPU
autotune slow; refactor the model builder to accept parameters for batch,
in_channels, out_channels, height, and width with sensible, smaller defaults
(e.g., batch=64 kept if needed but default hw=32 or lower) so call sites can
override sizes for GPU tests, and ensure the conv weight tensor shape is derived
from those parameters (maintaining current defaults to preserve backward
compatibility).

In `@tests/unit/onnx/quantization/autotune/test_autotune_config.py`:
- Around line 125-132: Add a new unit test that mirrors
test_mode_default_applies_preset_when_no_explicit_flags but omits the --mode
flag to cover the implicit parser-default path: call the existing _parse_cli
helper with ["--onnx_path", "model.onnx"] (no --mode), look up preset =
MODE_PRESETS["default"], and assert that args.num_schemes ==
preset["schemes_per_region"], args.warmup_runs == preset["warmup_runs"], and
args.timing_runs == preset["timing_runs"]; place it alongside
test_mode_default_applies_preset_when_no_explicit_flags with a clear name like
test_mode_implicit_applies_preset_when_no_explicit_flags so future
parser-default changes are caught.

In `@tests/unit/onnx/quantization/autotune/test_pattern_cache.py`:
- Around line 107-109: The test is brittle because it assumes ordering by using
restored.pattern_schemes[0]; change the assertions to locate the PatternScheme
by matching pattern_signature (e.g., find the element in
restored.pattern_schemes whose pattern_signature == "Conv->Relu") and then
assert on its .schemes length and other properties; update the other occurrences
that use positional indexing (the similar assertions around the other checks) to
the same pattern-signature lookup to make the tests order-independent.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 670b0a53-c9e9-494b-8034-fb49328b288b

📥 Commits

Reviewing files that changed from the base of the PR and between e47c678 and 6ada4d8.

📒 Files selected for processing (8)

CHANGELOG.rst
modelopt/onnx/quantization/autotune/__main__.py
pyproject.toml
tests/_test_utils/onnx/quantization/autotune/models.py
tests/gpu/onnx/quantization/autotune/test_workflow.py
tests/unit/onnx/quantization/autotune/test_autotune_config.py
tests/unit/onnx/quantization/autotune/test_pattern_cache.py
tests/unit/onnx/quantization/autotune/test_region.py

💤 Files with no reviewable changes (1)

tests/gpu/onnx/quantization/autotune/test_workflow.py

✅ Files skipped from review due to trivial changes (1)

tests/unit/onnx/quantization/autotune/test_region.py

🚧 Files skipped from review as they are similar to previous changes (2)

CHANGELOG.rst
pyproject.toml

tests/_test_utils/onnx/quantization/autotune/models.py

CHANGELOG.rst

gcunhase · 2026-03-04T16:15:48Z

@willg-nv the test passes in this PR but is too slow. Any way to speed it up?
============================= slowest 50 durations =============================
168.39s call     tests/gpu/onnx/quantization/autotune/test_workflow.py::test_export_quantized_model[True]
...
@gcunhase @cjluo-nv can we make these slow tests manual tests or move to nightly tests?

@cjluo-nv for input on this.

modelopt/onnx/quantization/autotune/__main__.py

kevalmorabia97 · 2026-03-04T18:56:09Z

@willg-nv the test passes in this PR but is too slow. Any way to speed it up?
============================= slowest 50 durations =============================
168.39s call     tests/gpu/onnx/quantization/autotune/test_workflow.py::test_export_quantized_model[True]
...
@gcunhase @cjluo-nv can we make these slow tests manual tests or move to nightly tests?

Do we know why is this test is so slow in the first place? Is there opportunity to use a smaller model, less data, less iterations or something like this to make it faster? Or even in the most smallest case, it still takes ~3mins to run?

modelopt/onnx/quantization/autotune/__main__.py

gcunhase · 2026-03-05T17:02:09Z

@willg-nv the test passes in this PR but is too slow. Any way to speed it up?
============================= slowest 50 durations =============================
168.39s call     tests/gpu/onnx/quantization/autotune/test_workflow.py::test_export_quantized_model[True]
...
@gcunhase @cjluo-nv can we make these slow tests manual tests or move to nightly tests?
Do we know why is this test is so slow in the first place? Is there opportunity to use a smaller model, less data, less iterations or something like this to make it faster? Or even in the most smallest case, it still takes ~3mins to run?

@willg-nv

modelopt/onnx/quantization/autotune/__main__.py

willg-nv · 2026-03-06T03:01:04Z

@willg-nv the test passes in this PR but is too slow. Any way to speed it up?
============================= slowest 50 durations =============================
168.39s call     tests/gpu/onnx/quantization/autotune/test_workflow.py::test_export_quantized_model[True]
...
@gcunhase @cjluo-nv can we make these slow tests manual tests or move to nightly tests?
Do we know why is this test is so slow in the first place? Is there opportunity to use a smaller model, less data, less iterations or something like this to make it faster? Or even in the most smallest case, it still takes ~3mins to run?
@willg-nv

This is an integration test which runs trtexec multiple times to find best Q/DQ insertion points for Conv->Relu model. Ideally it would run trtexec 4 times. This is the reason why this test would take 3mins to complete. If the problem size of Conv->Relu get decreased, the benchmark time would be unstable, and the test result would itermittently get failed.

Signed-off-by: Will Guo <willg@nvidia.com>

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

Signed-off-by: Will Guo <willg@nvidia.com>

gcunhase · 2026-03-06T19:55:40Z

pyproject.toml

 [project.optional-dependencies]
 onnx = [
    "cppimport",
+    "cuda-python",


This might need to be removed in lieu of #998

willg-nv requested a review from a team as a code owner March 3, 2026 03:00

willg-nv requested a review from gcunhase March 3, 2026 03:00

willg-nv force-pushed the dev-willg-integrate-auto-qdq-placement-part4.4 branch from 88793fa to 4604b84 Compare March 3, 2026 03:05

coderabbitai bot reviewed Mar 3, 2026

View reviewed changes

modelopt/onnx/quantization/autotune/__main__.py Show resolved Hide resolved

modelopt/onnx/quantization/autotune/__main__.py Show resolved Hide resolved

willg-nv force-pushed the dev-willg-integrate-auto-qdq-placement-part4.4 branch 2 times, most recently from 148771e to 27b930d Compare March 3, 2026 03:21

willg-nv requested a review from a team as a code owner March 3, 2026 03:21

willg-nv requested a review from kevalmorabia97 March 3, 2026 03:21

This was referenced Mar 3, 2026

Integrate Automated QDQ placement tool - Part 4 #704

Closed

[OMNIML-3252][ONNX] Add real Q/DQ scales in Autotune #951

Open

Integrate Automated QDQ placement tool - part 4.3 #843

Open

coderabbitai bot reviewed Mar 4, 2026

View reviewed changes

willg-nv force-pushed the dev-willg-integrate-auto-qdq-placement-part4.4 branch from e47c678 to 6ada4d8 Compare March 4, 2026 10:39

coderabbitai bot reviewed Mar 4, 2026

View reviewed changes

tests/_test_utils/onnx/quantization/autotune/models.py Show resolved Hide resolved

gcunhase reviewed Mar 4, 2026

View reviewed changes

CHANGELOG.rst Show resolved Hide resolved

gcunhase reviewed Mar 4, 2026

View reviewed changes

modelopt/onnx/quantization/autotune/__main__.py Outdated Show resolved Hide resolved

gcunhase reviewed Mar 4, 2026

View reviewed changes

modelopt/onnx/quantization/autotune/__main__.py Show resolved Hide resolved

gcunhase reviewed Mar 5, 2026

View reviewed changes

modelopt/onnx/quantization/autotune/__main__.py Show resolved Hide resolved

gcunhase reviewed Mar 5, 2026

View reviewed changes

modelopt/onnx/quantization/autotune/__main__.py Outdated Show resolved Hide resolved

willg-nv force-pushed the dev-willg-integrate-auto-qdq-placement-part4.4 branch from 12ac037 to b865901 Compare March 6, 2026 03:05

willg-nv and others added 5 commits March 6, 2026 03:08

Integrate Automated QDQ placement tool - part 4.4

626695d

Signed-off-by: Will Guo <willg@nvidia.com>

Enable test_export_quantized_model

ad44a54

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

resolve comments

2653dc9

Signed-off-by: Will Guo <willg@nvidia.com>

resolve comments

b32fa2c

Signed-off-by: Will Guo <willg@nvidia.com>

resolve comments

c1b363b

Signed-off-by: Will Guo <willg@nvidia.com>

willg-nv force-pushed the dev-willg-integrate-auto-qdq-placement-part4.4 branch from c3e8e12 to c1b363b Compare March 6, 2026 03:08

gcunhase mentioned this pull request Mar 6, 2026

[ONNX][Autotune] Replace CUDA memory management from CUDART to PyTorch #998

Open

gcunhase reviewed Mar 6, 2026

View reviewed changes

Conversation

willg-nv commented Mar 3, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Usage

Testing

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Uh oh!

copy-pr-bot bot commented Mar 3, 2026

Uh oh!

coderabbitai bot commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

kevalmorabia97 commented Mar 3, 2026

Uh oh!

codecov bot commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

kevalmorabia97 commented Mar 3, 2026

Uh oh!

kevalmorabia97 commented Mar 3, 2026

Uh oh!

gcunhase commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

willg-nv commented Mar 4, 2026

Uh oh!

willg-nv commented Mar 4, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

gcunhase commented Mar 4, 2026

Uh oh!

Uh oh!

Uh oh!

kevalmorabia97 commented Mar 4, 2026

Uh oh!

Uh oh!

gcunhase commented Mar 5, 2026

Uh oh!

Uh oh!

willg-nv commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gcunhase Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

willg-nv commented Mar 3, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 3, 2026 •

edited

Loading

codecov bot commented Mar 3, 2026 •

edited

Loading

gcunhase commented Mar 3, 2026 •

edited

Loading

willg-nv commented Mar 6, 2026 •

edited

Loading