Integrate Automated QDQ placement tool - part 4.4#961
Integrate Automated QDQ placement tool - part 4.4#961willg-nv wants to merge 5 commits intoNVIDIA:mainfrom
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds CLI mode presets (quick/default/extensive) with explicit-flag tracking and application in runtime; expands unit tests for presets and PatternCache, adjusts test model tensor shapes, enables a GPU autotune test, and adds Changes
Sequence Diagram(s)mermaid Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
88793fa to
4604b84
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@modelopt/onnx/quantization/autotune/__main__.py`:
- Around line 36-57: Add pytest unit tests for the new
MODE_PRESETS/apply_mode_presets behavior: create tests that verify (1) selecting
each preset name in MODE_PRESETS sets args.num_schemes, args.warmup_runs, and
args.timing_runs when those fields are the DEFAULT_* values, (2) explicitly
passing non-default CLI values prevents overriding by apply_mode_presets, and
(3) explicitly passing values equal to the DEFAULT_* constants still counts as
"explicit" and should be overridden only if the code intends otherwise (cover
both expected behaviors). Use the apply_mode_presets function and the constants
DEFAULT_NUM_SCHEMES, DEFAULT_WARMUP_RUNS, DEFAULT_TIMING_RUNS and test args.mode
invalid case as well; place tests under tests/ (pytest) to ensure coverage and
assert that preset lookup uses MODE_PRESETS entries.
- Around line 43-56: The apply_mode_presets function currently detects
"unspecified" args by comparing against DEFAULT_* constants, which causes preset
values to override explicit CLI flags that happen to equal those defaults;
change the CLI parsing so the relevant options (num_schemes / warmup_runs /
timing_runs / schemes_per_region) default to None when not provided, then update
apply_mode_presets to only apply MODE_PRESETS when the corresponding args
attribute is None (e.g., check args.num_schemes is None instead of equality to
DEFAULT_NUM_SCHEMES); add unit tests exercising "--mode X" together with
explicit flags (including values equal to previous defaults) to assert explicit
flags are preserved. Ensure references: apply_mode_presets, MODE_PRESETS,
args.mode, DEFAULT_NUM_SCHEMES, DEFAULT_WARMUP_RUNS, DEFAULT_TIMING_RUNS.
ℹ️ Review info
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
modelopt/onnx/quantization/autotune/__main__.pytests/_test_utils/onnx/quantization/autotune/models.pytests/unit/onnx/quantization/autotune/test_pattern_cache.pytests/unit/onnx/quantization/autotune/test_region.py
148771e to
27b930d
Compare
|
/ok to test 27b930d |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #961 +/- ##
==========================================
- Coverage 72.13% 71.75% -0.39%
==========================================
Files 209 211 +2
Lines 23631 23890 +259
==========================================
+ Hits 17046 17142 +96
- Misses 6585 6748 +163 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
/ok to test f3491cf |
|
@willg-nv the test passes in this PR but is too slow. Any way to speed it up? |
|
Thank you for filing this bug. Can you please:
Will approve this PR after those 2 comments are fulfilled. Thanks! |
@gcunhase @cjluo-nv can we make these slow tests manual tests or move to nightly tests? |
|
There was a problem hiding this comment.
🧹 Nitpick comments (1)
CHANGELOG.rst (1)
20-20: Consider mentioning the preset mode feature.The changelog entry is comprehensive and well-written. However, based on the PR objectives and changes, this release includes CLI mode presets (quick, default, extensive) that allow users to control the optimization depth. Since this is a user-facing CLI feature, consider adding it to the feature list for completeness.
📝 Suggested addition
-**AutoQDQ**: New tool for automated Q/DQ (Quantize/Dequantize) placement optimization for ONNX models. Uses TensorRT latency measurements to choose insertion schemes that minimize inference time. Discovers regions automatically, groups them by structural pattern, and tests multiple Q/DQ schemes per pattern. Supports INT8 and FP8 quantization, pattern cache for warm-start on similar models, checkpoint/resume, and importing patterns from an existing QDQ baseline. CLI: ``python -m modelopt.onnx.quantization.autotune``. See the AutoQDQ guide in the documentation. +**AutoQDQ**: New tool for automated Q/DQ (Quantize/Dequantize) placement optimization for ONNX models. Uses TensorRT latency measurements to choose insertion schemes that minimize inference time. Discovers regions automatically, groups them by structural pattern, and tests multiple Q/DQ schemes per pattern. Supports INT8 and FP8 quantization, mode presets (quick/default/extensive) for controlling optimization depth, pattern cache for warm-start on similar models, checkpoint/resume, and importing patterns from an existing QDQ baseline. CLI: ``python -m modelopt.onnx.quantization.autotune``. See the AutoQDQ guide in the documentation.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@CHANGELOG.rst` at line 20, Update the AutoQDQ changelog entry to mention the new CLI preset modes by adding a short sentence referencing the CLI entry point (python -m modelopt.onnx.quantization.autotune) and the available presets (quick, default, extensive); e.g., append that users can control optimization depth via these presets so the AutoQDQ feature description (AutoQDQ) includes the preset-mode capability in the feature list.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@CHANGELOG.rst`:
- Line 20: Update the AutoQDQ changelog entry to mention the new CLI preset
modes by adding a short sentence referencing the CLI entry point (python -m
modelopt.onnx.quantization.autotune) and the available presets (quick, default,
extensive); e.g., append that users can control optimization depth via these
presets so the AutoQDQ feature description (AutoQDQ) includes the preset-mode
capability in the feature list.
e47c678 to
6ada4d8
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
modelopt/onnx/quantization/autotune/__main__.py (1)
30-40:⚠️ Potential issue | 🟡 MinorEffective defaults can drift from displayed defaults.
Current parser defaults (
30/5/20) andMODE_PRESETS["default"](50/50/100) diverge, while--modedefaults to"default". That makes the runtime defaults differ from what the per-flag help currently advertises, which is confusing for users and performance expectations.Suggested consolidation (single source of truth)
-DEFAULT_NUM_SCHEMES = 30 +DEFAULT_MODE = "default" DEFAULT_QUANT_TYPE = "int8" DEFAULT_DQ_DTYPE = "float32" DEFAULT_TIMING_CACHE = str(Path(tempfile.gettempdir()) / "trtexec_timing.cache") -DEFAULT_WARMUP_RUNS = 5 -DEFAULT_TIMING_RUNS = 20 MODE_PRESETS = { "quick": {"schemes_per_region": 30, "warmup_runs": 10, "timing_runs": 50}, "default": {"schemes_per_region": 50, "warmup_runs": 50, "timing_runs": 100}, "extensive": {"schemes_per_region": 200, "warmup_runs": 50, "timing_runs": 200}, } +DEFAULT_NUM_SCHEMES = MODE_PRESETS[DEFAULT_MODE]["schemes_per_region"] +DEFAULT_WARMUP_RUNS = MODE_PRESETS[DEFAULT_MODE]["warmup_runs"] +DEFAULT_TIMING_RUNS = MODE_PRESETS[DEFAULT_MODE]["timing_runs"] @@ - default="default", + default=DEFAULT_MODE,Also applies to: 242-252, 257-262, 322-336
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/onnx/quantization/autotune/__main__.py` around lines 30 - 40, The help and parser defaults (DEFAULT_NUM_SCHEMES, DEFAULT_WARMUP_RUNS, DEFAULT_TIMING_RUNS) currently disagree with MODE_PRESETS["default"], causing --mode default behavior to differ from per-flag help; fix by making a single source of truth: either (A) update DEFAULT_NUM_SCHEMES/WARMUP_RUNS/TIMING_RUNS to match MODE_PRESETS["default"] (50/50/100) and update help strings, or (B) change the argument parser to derive its per-flag defaults and help text from MODE_PRESETS["default"] (e.g., use MODE_PRESETS["default"]["schemes_per_region"] etc.) so --mode="default" and per-flag help stay consistent; apply the same consolidation where the other occurrences of these constants are used (refer to DEFAULT_QUANT_TYPE, DEFAULT_DQ_DTYPE, DEFAULT_TIMING_CACHE if needed) and ensure --mode default remains "default".
🧹 Nitpick comments (3)
tests/unit/onnx/quantization/autotune/test_pattern_cache.py (1)
107-109: Reduce test brittleness from positional cache indexing.These assertions depend on list ordering (
pattern_schemes[0]). If internal ordering changes, tests can fail despite correct behavior. Prefer selecting bypattern_signaturebefore asserting details.♻️ Suggested refactor
- restored_ps = restored.pattern_schemes[0] - assert restored_ps.pattern_signature == "Conv->Relu" + matches = [x for x in restored.pattern_schemes if x.pattern_signature == "Conv->Relu"] + assert len(matches) == 1 + restored_ps = matches[0] @@ - conv_relu_ps = cache.pattern_schemes[0] - assert conv_relu_ps.pattern_signature == "Conv->Relu" + matches = [x for x in cache.pattern_schemes if x.pattern_signature == "Conv->Relu"] + assert len(matches) == 1 + conv_relu_ps = matches[0] @@ - conv_relu_ps = cache.pattern_schemes[0] - assert conv_relu_ps.pattern_signature == "Conv->Relu" + matches = [x for x in cache.pattern_schemes if x.pattern_signature == "Conv->Relu"] + assert len(matches) == 1 + conv_relu_ps = matches[0]Also applies to: 152-154, 176-178
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unit/onnx/quantization/autotune/test_pattern_cache.py` around lines 107 - 109, The test is brittle because it assumes ordering by using restored.pattern_schemes[0]; change the assertions to locate the PatternScheme by matching pattern_signature (e.g., find the element in restored.pattern_schemes whose pattern_signature == "Conv->Relu") and then assert on its .schemes length and other properties; update the other occurrences that use positional indexing (the similar assertions around the other checks) to the same pattern-signature lookup to make the tests order-independent.tests/_test_utils/onnx/quantization/autotune/models.py (1)
28-33: Consider parameterizing/reducing test shapes to avoid worsening GPU autotune runtime.The new shapes (Line 28-33) plus weights (Line 45) imply a very compute-heavy Conv if this model is actually executed in GPU autotune tests. Given the existing complaint about a ~168s GPU autotune test, this change risks making it slower.
A low-friction option is to keep defaults but allow call sites (especially GPU tests) to override sizes:
Suggested refactor (backward compatible defaults)
-def _create_simple_conv_onnx_model(): +def _create_simple_conv_onnx_model(*, batch: int = 64, in_ch: int = 32, out_ch: int = 64, hw: int = 224): """Build ONNX model: Input -> Conv -> Relu -> Output (minimal for autotuner tests).""" input_tensor = helper.make_tensor_value_info( - "input", onnx.TensorProto.FLOAT, [64, 32, 224, 224] + "input", onnx.TensorProto.FLOAT, [batch, in_ch, hw, hw] ) output_tensor = helper.make_tensor_value_info( - "output", onnx.TensorProto.FLOAT, [64, 64, 224, 224] + "output", onnx.TensorProto.FLOAT, [batch, out_ch, hw, hw] ) @@ initializer=[ helper.make_tensor( - "conv_weight", onnx.TensorProto.FLOAT, [64, 32, 3, 3], [0.1] * (64 * 32 * 3 * 3) + "conv_weight", + onnx.TensorProto.FLOAT, + [out_ch, in_ch, 3, 3], + [0.1] * (out_ch * in_ch * 3 * 3), ) ], )If stability only depends on batch, consider keeping
batch=64but shrinkinghw(e.g., 32/64) in the GPU workflow tests to cut runtime.Also applies to: 44-46
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/_test_utils/onnx/quantization/autotune/models.py` around lines 28 - 33, The test model uses hard-coded large tensors (input_tensor and output_tensor and the conv weight tensor around line 45) which makes GPU autotune slow; refactor the model builder to accept parameters for batch, in_channels, out_channels, height, and width with sensible, smaller defaults (e.g., batch=64 kept if needed but default hw=32 or lower) so call sites can override sizes for GPU tests, and ensure the conv weight tensor shape is derived from those parameters (maintaining current defaults to preserve backward compatibility).tests/unit/onnx/quantization/autotune/test_autotune_config.py (1)
125-132: Add one regression test for the implicit parser-default path (--modeomitted).You already validate
--mode default; adding the no---modecase would lock in the intended runtime behavior if parser defaults change later.Suggested test addition
class TestModePresets: @@ + def test_mode_omitted_uses_parser_default_mode_preset(self): + """When --mode is omitted, parser default mode preset is applied.""" + args = self._parse_cli(["--onnx_path", "model.onnx"]) + preset = MODE_PRESETS["default"] + assert args.num_schemes == preset["schemes_per_region"] + assert args.warmup_runs == preset["warmup_runs"] + assert args.timing_runs == preset["timing_runs"]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unit/onnx/quantization/autotune/test_autotune_config.py` around lines 125 - 132, Add a new unit test that mirrors test_mode_default_applies_preset_when_no_explicit_flags but omits the --mode flag to cover the implicit parser-default path: call the existing _parse_cli helper with ["--onnx_path", "model.onnx"] (no --mode), look up preset = MODE_PRESETS["default"], and assert that args.num_schemes == preset["schemes_per_region"], args.warmup_runs == preset["warmup_runs"], and args.timing_runs == preset["timing_runs"]; place it alongside test_mode_default_applies_preset_when_no_explicit_flags with a clear name like test_mode_implicit_applies_preset_when_no_explicit_flags so future parser-default changes are caught.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tests/_test_utils/onnx/quantization/autotune/models.py`:
- Around line 28-33: The declared output_tensor shape ([64, 64, 224, 224]) is
inconsistent with a 3x3 Conv without padding (spatial dims should be 222x222);
update the model either by adding symmetric padding to the Conv node (set pads
or auto_pad to keep spatial dims at 224) or change output_tensor to [64, 64,
222, 222] to match no-padding behavior; locate and modify the tensor declaration
named output_tensor and/or the Conv node attributes (pads/auto_pad) in the model
definition so shapes are consistent.
---
Outside diff comments:
In `@modelopt/onnx/quantization/autotune/__main__.py`:
- Around line 30-40: The help and parser defaults (DEFAULT_NUM_SCHEMES,
DEFAULT_WARMUP_RUNS, DEFAULT_TIMING_RUNS) currently disagree with
MODE_PRESETS["default"], causing --mode default behavior to differ from per-flag
help; fix by making a single source of truth: either (A) update
DEFAULT_NUM_SCHEMES/WARMUP_RUNS/TIMING_RUNS to match MODE_PRESETS["default"]
(50/50/100) and update help strings, or (B) change the argument parser to derive
its per-flag defaults and help text from MODE_PRESETS["default"] (e.g., use
MODE_PRESETS["default"]["schemes_per_region"] etc.) so --mode="default" and
per-flag help stay consistent; apply the same consolidation where the other
occurrences of these constants are used (refer to DEFAULT_QUANT_TYPE,
DEFAULT_DQ_DTYPE, DEFAULT_TIMING_CACHE if needed) and ensure --mode default
remains "default".
---
Nitpick comments:
In `@tests/_test_utils/onnx/quantization/autotune/models.py`:
- Around line 28-33: The test model uses hard-coded large tensors (input_tensor
and output_tensor and the conv weight tensor around line 45) which makes GPU
autotune slow; refactor the model builder to accept parameters for batch,
in_channels, out_channels, height, and width with sensible, smaller defaults
(e.g., batch=64 kept if needed but default hw=32 or lower) so call sites can
override sizes for GPU tests, and ensure the conv weight tensor shape is derived
from those parameters (maintaining current defaults to preserve backward
compatibility).
In `@tests/unit/onnx/quantization/autotune/test_autotune_config.py`:
- Around line 125-132: Add a new unit test that mirrors
test_mode_default_applies_preset_when_no_explicit_flags but omits the --mode
flag to cover the implicit parser-default path: call the existing _parse_cli
helper with ["--onnx_path", "model.onnx"] (no --mode), look up preset =
MODE_PRESETS["default"], and assert that args.num_schemes ==
preset["schemes_per_region"], args.warmup_runs == preset["warmup_runs"], and
args.timing_runs == preset["timing_runs"]; place it alongside
test_mode_default_applies_preset_when_no_explicit_flags with a clear name like
test_mode_implicit_applies_preset_when_no_explicit_flags so future
parser-default changes are caught.
In `@tests/unit/onnx/quantization/autotune/test_pattern_cache.py`:
- Around line 107-109: The test is brittle because it assumes ordering by using
restored.pattern_schemes[0]; change the assertions to locate the PatternScheme
by matching pattern_signature (e.g., find the element in
restored.pattern_schemes whose pattern_signature == "Conv->Relu") and then
assert on its .schemes length and other properties; update the other occurrences
that use positional indexing (the similar assertions around the other checks) to
the same pattern-signature lookup to make the tests order-independent.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 670b0a53-c9e9-494b-8034-fb49328b288b
📒 Files selected for processing (8)
CHANGELOG.rstmodelopt/onnx/quantization/autotune/__main__.pypyproject.tomltests/_test_utils/onnx/quantization/autotune/models.pytests/gpu/onnx/quantization/autotune/test_workflow.pytests/unit/onnx/quantization/autotune/test_autotune_config.pytests/unit/onnx/quantization/autotune/test_pattern_cache.pytests/unit/onnx/quantization/autotune/test_region.py
💤 Files with no reviewable changes (1)
- tests/gpu/onnx/quantization/autotune/test_workflow.py
✅ Files skipped from review due to trivial changes (1)
- tests/unit/onnx/quantization/autotune/test_region.py
🚧 Files skipped from review as they are similar to previous changes (2)
- CHANGELOG.rst
- pyproject.toml
@cjluo-nv for input on this. |
Do we know why is this test is so slow in the first place? Is there opportunity to use a smaller model, less data, less iterations or something like this to make it faster? Or even in the most smallest case, it still takes ~3mins to run? |
|
This is an integration test which runs trtexec multiple times to find best Q/DQ insertion points for Conv->Relu model. Ideally it would run trtexec 4 times. This is the reason why this test would take 3mins to complete. If the problem size of Conv->Relu get decreased, the benchmark time would be unstable, and the test result would itermittently get failed. |
12ac037 to
b865901
Compare
Signed-off-by: Will Guo <willg@nvidia.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
Signed-off-by: Will Guo <willg@nvidia.com>
Signed-off-by: Will Guo <willg@nvidia.com>
Signed-off-by: Will Guo <willg@nvidia.com>
c3e8e12 to
c1b363b
Compare
| [project.optional-dependencies] | ||
| onnx = [ | ||
| "cppimport", | ||
| "cuda-python", |
What does this PR do?
Many minor changes:
[onnx]Usage
# Add a code snippet demonstrating how to use thisTesting
Before your PR is "Ready for review"
Make sure you read and follow Contributor guidelines and your commits are signed (
git commit -s -S).Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded
trust_remote_code=True, usingtorch.load(..., weights_only=True), avoidingpickle, etc.).Additional Information
Summary by CodeRabbit
New Features
Documentation
Tests
Chores