feat: precision-driven quantization (FP16, RTN int4, static QDQ) via --precision flag by DingmaomaoBJTU · Pull Request #872 · microsoft/winml-cli

DingmaomaoBJTU · 2026-06-11T03:04:44Z

Summary

Adds a unified --precision flag that auto-selects the quantization algorithm based on the target precision. This replaces the need to manually configure --weight-type, --activation-type, and algorithm-specific flags.

Closes #867

Precision → Algorithm Mapping

`--precision`	Algorithm	Description
`fp16`	FP16 conversion	Weights + activations → FP16 (I/O stays FP32)
`int4` / `w4a32`	RTN (weight-only)	4-bit weight via MatMulNBits, activation stays FP32
`w4a16`	RTN + FP16	4-bit weight via MatMulNBits + FP16 post-processing
`int8`	Static QDQ	Calibrated QDQ (uint8 weight + uint8 activation)
`int16` / `w16a16`	Static QDQ	Calibrated QDQ (int16 weight + uint16 activation)
`w8a16`	Static QDQ	Calibrated QDQ (uint8 weight + uint16 activation)
`w8a8`	Static QDQ	Calibrated QDQ (uint8 weight + uint8 activation)

Multi-Pass vs Single-Pass Execution

quantize_onnx(precision=...) handles pass decomposition internally. The caller only needs one call.

`--precision`	Passes	Execution
`fp16`	1 (single)	FP16 conversion only
`int4` / `w4a32`	1 (single)	RTN 4-bit quantization only
`int8` / `int16` / `w8a8` / `w8a16` / `w16a16`	1 (single)	QDQ calibrated quantization only
`w4a16`	2 (multi)	Pass 1: RTN int4 → Pass 2: FP16 conversion

Architecture: expand_precision("w4a16") → ["int4", "fp16"]. The quant layer (quantize_onnx) orchestrates intermediate files and cleanup. Command layer (build, quantize) only calls quantize_onnx once with precision= — no if-else routing or manual multi-pass loops.

Key Design Decisions

int4 is equivalent to w4a32 — both produce RTN 4-bit weight-only quantization with activations unchanged (FP32)
w4a16 is the ONLY precision that expands to multi-pass; w8a16 is a single QDQ pass (uint8 weight + uint16 activation), NOT "int8 then fp16"
a32 (e.g., w4a32) means "activation stays FP32 (no quantization)" and is only valid with weight-only (4-bit) precisions
RTN and FP16 paths skip calibration entirely — warnings shown if calibration flags are provided
QDQ precisions (int8, int16, w8a16, etc.) still require calibration data

Supported Commands

Command	`--precision` support	Notes
`winml build`	✅	Full pipeline: export → optimize → quantize → compile
`winml quantize`	✅	Standalone quantization on existing ONNX
`winml config`	✅	Config generation respects precision
`winml perf`	✅	Performance testing with precision-aware builds
`winml eval`	✅	Evaluation with precision-aware builds

E2E Test Results (convnext-tiny-224)

Command	Result	Notes
`winml quantize --precision fp16`	✅	109→54.6MB, 4.7s
`winml quantize --precision int4`	✅	109→23.7MB, 3.7s (RTN 4-bit)
`winml quantize --precision w4a32`	✅	109→23.7MB (same as int4)
`winml quantize --precision w4a16`	✅	109→18.1MB (RTN + FP16)
`winml quantize --precision int8`	✅	109→28.0MB, 46s (QDQ)
`winml build --precision fp16`	✅	Full pipeline, 87s
`winml build --precision int4`	✅	Full pipeline with RTN
`winml build --precision w4a16`	✅	Full pipeline, RTN + FP16, 208s

TODO (follow-up PRs)

Mixed precision: QDQ on top of FP16 (e.g., --precision int8 --fp16)
RTN tuning flags: expose --block-size, --symmetric, --accuracy-level
Dynamic quantization: wire algorithm="dynamic" path (no calibration, quantize at runtime)

timenick

Three findings on PR #872.

🤖 Generated with GitHub Copilot CLI

Add FP16 precision conversion support across all model pipeline commands: - Create optim/fp16.py with convert_to_fp16() utility (wraps ORT float16) - optimize: --precision fp16 with --fp16-keep-io-types and --fp16-op-block-list - build: --precision fp16 stage between optimize and quantize - export: --precision fp16 as post-export conversion - Add shared precision_option() CLI decorator in utils/cli.py Design: FP16 is a precision transformation (not a graph optimization), so it lives as a command-layer utility rather than an optimizer pipe. All three commands share the same convert_to_fp16() function. Fixes #867

- Add algorithm, fp16, fp16_only, fp16_keep_io_types, fp16_op_block_list, and RTN fields to WinMLQuantizationConfig - quantize_onnx now supports pure-FP16 fast path (fp16_only=True skips QDQ) and FP16 post-processing after QDQ (fp16=True, fp16_only=False) - resolve_quant_compile_config returns fp16_only quant config for precision=fp16 - Remove _run_fp16_stage and skip-quantize hack from build.py pipelines - Build pipeline unified: Export -> Optimize -> Quantize Stage -> Compile where Quantize Stage handles both QDQ and FP16 conversion - Update tests to reflect new behavior (fp16 produces quant config, not None)

- Remove --precision flag and FP16 conversion from export command - Remove --precision, --fp16-keep-io-types, --fp16-op-block-list from optimize command and all FP16 conversion logic - Add --precision fp16 support to quantize command (creates fp16_only config, uses quantize_onnx FP16 fast path) - FP16 precision is now only available through: - winml quantize --precision fp16 (standalone) - winml build --precision fp16 (E2E pipeline) - winml perf/eval --precision fp16 (E2E commands)

Expand build's --precision from fp32/fp16 only to the full precision range: auto, fp32, fp16, int8, int16, and w{x}a{y} format (e.g., w8a8, w8a16). This unifies the build and quantize CLI experience. Changes: - Update precision_option() to accept free-form string instead of click.Choice restricted to fp32/fp16 - Pass precision to generate_build_config() for proper quant config resolution at config generation time - Pass precision to resolve_quant_compile_config() in _patch_device for config-file builds with --precision override - Propagate fp16/fp16_only fields when patching existing quant config - Add early validation using _is_valid_precision() for clear error messages - Add precision examples to build command help text

Replace 'import onnx' + 'from onnx import ...' dual-import pattern with consistent 'from onnx import ...' style to satisfy CodeQL's 'Module is imported with import and import from' check.

- Remove duplicate old precision_option (main already has expanded version) - Update test_precision_fp16_clears_quant to expect fp16_only quant config instead of quant=None (matches our FP16-in-quantize design) - Remove duplicate --precision fp16 build example (main already has one)

When --precision fp16 is used, calibration-related flags (--samples, --method, --weight-type, --activation-type) have no effect. Add explicit warnings in both the CLI layer (quantize command) and the API layer (quantize_onnx) so users are not silently surprised.

FP16-only quantization configs do not perform calibration, so they do not need task or model_name fields. The validation now treats fp16_only the same as ONNX builds and submodule builds.

Only static QDQ quantization requires calibration data (and thus task/model_name). RTN (weight-only) and dynamic quantization do not need calibration, so they should not require these fields.

- Add int4 to named precisions, support w4a{8,16} as weight-only RTN - Add is_weight_only_precision() and extract_weight_bits() helpers - resolve_quant_compile_config creates RTN config for weight-only - quantize command: add RTN fast path between FP16 and QDQ paths - quantize_onnx: implement RTN path using ORT MatMulNBitsQuantizer - Update tests for new valid precision values (int4, w4a16)

…tion - _patch_device now propagates algorithm/rtn_bits to existing quant config - _run_quantize_stage: add RTN path with proper StageLive output - quantizer: extract .model (ModelProto) from ONNXModel wrapper

- Add type annotation to fp16.py convert result (no-any-return) - Add assert for precision not None in quantize.py (union-attr) - Remove duplicate imports in build.py _run_quantize_stage

- Add RTN branch in generate_hf_build_config (int4/w4a16 was silently skipped) - Pass use_external_data to save_onnx in FP16 and RTN paths (quantizer.py) - Extract _warn_ignored_calibration_options helper to remove duplication

- QDQ FP16 post-processing: apply convert_to_fp16 in-memory instead of save-reload-save round-trip (matches RTN pattern) - Pass use_external_data consistently to all save_onnx calls - extract_weight_bits: validate bit-widths against supported sets - Add test for unsupported bit-width combinations (w4a4, w3a8, etc.) - Clarify dynamic algorithm as planned-not-yet-wired in config comment

…t-processing - Add 32 to _VALID_ACTIVATION_BITS (a32 = activation stays FP32) - w4a32 is treated as weight-only RTN, equivalent to int4 - w4a16 now correctly sets fp16=True for FP16 post-processing after RTN - Add extract_activation_bits() helper to derive activation bit-width - Validate a32 only valid with weight-only (4-bit) — w8a32 is rejected - Add tests for w4a32, extract_activation_bits, and a32 validation

- Remove fp16_only field from WinMLQuantizationConfig - Rename fp16 field to fp16_postprocess (post-quant FP16 conversion) - Use algorithm='fp16' for pure FP16 mode (replaces fp16_only=True) - Update all references in commands, config builders, and tests - Backward compat: from_dict still reads old 'fp16' key

- Remove fp16_postprocess from WinMLQuantizationConfig - Add expand_precision() to decompose w4a16 into [int4, fp16] passes - Refactor _run_quantize_stage into multi-pass loop with helper functions - Each quantize_onnx call now does exactly one operation (single responsibility) - Update standalone quantize command for two-pass w4a16 flow - Add precision field to WinMLBuildConfig for pass expansion - Add expand_precision tests

- Add 'precision' parameter to quantize_onnx() that handles multi-pass expansion internally (e.g., w4a16 → [int4, fp16]) - Simplify _run_quantize_stage in build.py to a single quantize_onnx() call — no more _make_step_config or _run_single_quantize_pass helpers - Simplify commands/quantize.py RTN path — remove manual expand_precision loop and intermediate file management - Delete unused _should_run_quantization() dead code from quantizer.py - All multi-pass orchestration (intermediate files, cleanup, pass config construction) now lives in the quant layer where it belongs

…ation

Move calibration warning logic from commands/quantize.py into utils/cli.py as warn_ignored_calibration_options() so any command that needs the check can reuse it without duplicating the logic.

FP16 conversion is exclusively used by the quantizer's algorithm='fp16' path. It's not an optimizer pipe — move it to quant/fp16.py where it logically belongs. Remove optim/fp16.py entirely.

Address reviewer comment: mode and algorithm are redundant. algorithm is the active routing field; mode is kept only for serialization backward-compatibility and marked deprecated.

Remove redundant 'algorithm' field. Expand 'mode' to cover all quantization modes: static, dynamic, rtn, fp16. The old 'qdq' value is mapped to 'static' for backward compatibility. from_dict() prefers the old 'algorithm' key over 'mode' when both are present (old to_dict emitted both), preventing silent data loss when deserializing configs with algorithm='rtn' or 'fp16'.

…ize command paths - Split _quantize_single_pass into 3 focused methods: _quantize_fp16, _quantize_rtn, _quantize_qdq with a dispatch dict - Consolidate 3 separate FP16/RTN/QDQ paths in commands/quantize.py into a single if/elif/else that builds config then shares execution logic - Remove duplicated try/except, console output, and output path logic

DingmaomaoBJTU requested a review from a team as a code owner June 11, 2026 03:04

github-advanced-security AI found potential problems Jun 11, 2026

View reviewed changes

Comment thread tests/unit/optim/pipes/test_pipe_fp16.py Fixed

DingmaomaoBJTU force-pushed the dingmaomaobjtu/feat-fp16-conversion branch from 8f5a1d2 to 9e7d8fd Compare June 11, 2026 04:15

DingmaomaoBJTU changed the title ~~feat: add --enable-fp16-conversion to winml optimize~~ feat: add --precision fp16 to optimize, build, and export commands Jun 11, 2026

github-advanced-security AI found potential problems Jun 11, 2026

View reviewed changes

Comment thread tests/unit/optim/test_fp16.py Fixed

DingmaomaoBJTU force-pushed the dingmaomaobjtu/feat-fp16-conversion branch from 9e7d8fd to 7d7a0ae Compare June 11, 2026 04:22

github-advanced-security AI found potential problems Jun 11, 2026

View reviewed changes

Comment thread src/winml/modelkit/optim/fp16.py Fixed

DingmaomaoBJTU force-pushed the dingmaomaobjtu/feat-fp16-conversion branch from 7d7a0ae to 328b5ab Compare June 11, 2026 04:32

timenick reviewed Jun 11, 2026

View reviewed changes

Comment thread src/winml/modelkit/commands/build.py Outdated

Comment thread src/winml/modelkit/commands/build.py Outdated

Comment thread tests/unit/optim/test_fp16.py Outdated

DingmaomaoBJTU force-pushed the dingmaomaobjtu/feat-fp16-conversion branch from 328b5ab to b859627 Compare June 11, 2026 05:26

github-advanced-security AI found potential problems Jun 11, 2026

View reviewed changes

Comment thread tests/unit/optim/test_fp16.py Fixed

DingmaomaoBJTU force-pushed the dingmaomaobjtu/feat-fp16-conversion branch 2 times, most recently from 837330d to fede96c Compare June 11, 2026 07:43

DingmaomaoBJTU changed the title ~~feat: add --precision fp16 to optimize, build, and export commands~~ feat: FP16 precision support via quantize stage + extended build --precision Jun 23, 2026

DingmaomaoBJTU and others added 7 commits June 23, 2026 15:16

chore: remove spurious .data files

37f12a4

fix: resolve CodeQL import warnings in fp16 module

3b4e69f

Replace 'import onnx' + 'from onnx import ...' dual-import pattern with consistent 'from onnx import ...' style to satisfy CodeQL's 'Module is imported with import and import from' check.

DingmaomaoBJTU force-pushed the dingmaomaobjtu/feat-fp16-conversion branch from 82c92cb to 75be8d3 Compare June 23, 2026 07:37

github-advanced-security AI found potential problems Jun 23, 2026

View reviewed changes

Comment thread src/winml/modelkit/commands/build.py Fixed

github-actions Bot added 6 commits June 23, 2026 15:43

fix: skip task/model_name validation for fp16_only quant configs

4597e07

FP16-only quantization configs do not perform calibration, so they do not need task or model_name fields. The validation now treats fp16_only the same as ONNX builds and submodule builds.

fix: skip calibration validation for rtn and dynamic algorithms

e882dd5

Only static QDQ quantization requires calibration data (and thus task/model_name). RTN (weight-only) and dynamic quantization do not need calibration, so they should not require these fields.

fix: build pipeline RTN routing and MatMulNBitsQuantizer model extrac…

762f2d0

…tion - _patch_device now propagates algorithm/rtn_bits to existing quant config - _run_quantize_stage: add RTN path with proper StageLive output - quantizer: extract .model (ModelProto) from ONNXModel wrapper

fix: resolve lint warnings (raw regex strings, unused variable)

43d27b3

DingmaomaoBJTU changed the title ~~feat: FP16 precision support via quantize stage + extended build --precision~~ feat: precision-driven quantization (FP16, RTN int4, static QDQ) via --precision flag Jun 23, 2026

github-actions Bot added 5 commits June 23, 2026 18:12

fix: resolve mypy type errors and remove duplicate imports

1183861

- Add type annotation to fp16.py convert result (no-any-return) - Add assert for precision not None in quantize.py (union-attr) - Remove duplicate imports in build.py _run_quantize_stage

fix: address code review findings

b99fabf

- Add RTN branch in generate_hf_build_config (int4/w4a16 was silently skipped) - Pass use_external_data to save_onnx in FP16 and RTN paths (quantizer.py) - Extract _warn_ignored_calibration_options helper to remove duplication

xieofxie reviewed Jun 24, 2026

View reviewed changes

Comment thread src/winml/modelkit/quant/config.py Outdated

github-actions Bot added 6 commits June 24, 2026 14:56

fix: clean up intermediate pass files in multi-pass quantize stage

9339354

chore: remove duplicate is_submodule assignment in build config valid…

47181aa

…ation

refactor: extract warn_ignored_calibration_options to shared cli utils

f110178

Move calibration warning logic from commands/quantize.py into utils/cli.py as warn_ignored_calibration_options() so any command that needs the check can reuse it without duplicating the logic.

refactor: move convert_to_fp16 from optim to quant module

2e79eb0

FP16 conversion is exclusively used by the quantizer's algorithm='fp16' path. It's not an optimizer pipe — move it to quant/fp16.py where it logically belongs. Remove optim/fp16.py entirely.

xieofxie reviewed Jun 24, 2026

View reviewed changes

Comment thread src/winml/modelkit/quant/config.py Outdated

chore: mark legacy mode field as deprecated in quant config

997009e

Address reviewer comment: mode and algorithm are redundant. algorithm is the active routing field; mode is kept only for serialization backward-compatibility and marked deprecated.

zhenchaoni requested changes Jun 24, 2026

View reviewed changes

xieofxie reviewed Jun 24, 2026

View reviewed changes

Comment thread src/winml/modelkit/quant/quantizer.py

zhenchaoni reviewed Jun 24, 2026

View reviewed changes

Comment thread src/winml/modelkit/commands/build.py Outdated

xieofxie reviewed Jun 24, 2026

View reviewed changes

Comment thread src/winml/modelkit/commands/quantize.py Outdated

xieofxie reviewed Jun 24, 2026

View reviewed changes

Comment thread src/winml/modelkit/quant/config.py

github-actions Bot added 2 commits June 24, 2026 17:21

fix: type dispatch dict properly to satisfy mypy no-any-return

37da6f1

DingmaomaoBJTU mentioned this pull request Jun 24, 2026

Apply fp16_op_block_list / fp16_keep_io_types to QDQ path (and vice versa) #963

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: precision-driven quantization (FP16, RTN int4, static QDQ) via --precision flag#872

feat: precision-driven quantization (FP16, RTN int4, static QDQ) via --precision flag#872
DingmaomaoBJTU wants to merge 28 commits into
mainfrom
dingmaomaobjtu/feat-fp16-conversion

DingmaomaoBJTU commented Jun 11, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

timenick left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

DingmaomaoBJTU commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Precision → Algorithm Mapping

Multi-Pass vs Single-Pass Execution

Key Design Decisions

Supported Commands

E2E Test Results (convnext-tiny-224)

TODO (follow-up PRs)

Uh oh!

Uh oh!

Uh oh!

Uh oh!

timenick left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

DingmaomaoBJTU commented Jun 11, 2026 •

edited

Loading