feat: precision-driven quantization (FP16, RTN int4, static QDQ) via --precision flag#872
Open
DingmaomaoBJTU wants to merge 28 commits into
Open
feat: precision-driven quantization (FP16, RTN int4, static QDQ) via --precision flag#872DingmaomaoBJTU wants to merge 28 commits into
DingmaomaoBJTU wants to merge 28 commits into
Conversation
8f5a1d2 to
9e7d8fd
Compare
9e7d8fd to
7d7a0ae
Compare
7d7a0ae to
328b5ab
Compare
timenick
reviewed
Jun 11, 2026
328b5ab to
b859627
Compare
837330d to
fede96c
Compare
Add FP16 precision conversion support across all model pipeline commands: - Create optim/fp16.py with convert_to_fp16() utility (wraps ORT float16) - optimize: --precision fp16 with --fp16-keep-io-types and --fp16-op-block-list - build: --precision fp16 stage between optimize and quantize - export: --precision fp16 as post-export conversion - Add shared precision_option() CLI decorator in utils/cli.py Design: FP16 is a precision transformation (not a graph optimization), so it lives as a command-layer utility rather than an optimizer pipe. All three commands share the same convert_to_fp16() function. Fixes #867
- Add algorithm, fp16, fp16_only, fp16_keep_io_types, fp16_op_block_list, and RTN fields to WinMLQuantizationConfig - quantize_onnx now supports pure-FP16 fast path (fp16_only=True skips QDQ) and FP16 post-processing after QDQ (fp16=True, fp16_only=False) - resolve_quant_compile_config returns fp16_only quant config for precision=fp16 - Remove _run_fp16_stage and skip-quantize hack from build.py pipelines - Build pipeline unified: Export -> Optimize -> Quantize Stage -> Compile where Quantize Stage handles both QDQ and FP16 conversion - Update tests to reflect new behavior (fp16 produces quant config, not None)
- Remove --precision flag and FP16 conversion from export command - Remove --precision, --fp16-keep-io-types, --fp16-op-block-list from optimize command and all FP16 conversion logic - Add --precision fp16 support to quantize command (creates fp16_only config, uses quantize_onnx FP16 fast path) - FP16 precision is now only available through: - winml quantize --precision fp16 (standalone) - winml build --precision fp16 (E2E pipeline) - winml perf/eval --precision fp16 (E2E commands)
Expand build's --precision from fp32/fp16 only to the full precision
range: auto, fp32, fp16, int8, int16, and w{x}a{y} format (e.g., w8a8,
w8a16). This unifies the build and quantize CLI experience.
Changes:
- Update precision_option() to accept free-form string instead of
click.Choice restricted to fp32/fp16
- Pass precision to generate_build_config() for proper quant config
resolution at config generation time
- Pass precision to resolve_quant_compile_config() in _patch_device
for config-file builds with --precision override
- Propagate fp16/fp16_only fields when patching existing quant config
- Add early validation using _is_valid_precision() for clear error
messages
- Add precision examples to build command help text
Replace 'import onnx' + 'from onnx import ...' dual-import pattern with consistent 'from onnx import ...' style to satisfy CodeQL's 'Module is imported with import and import from' check.
- Remove duplicate old precision_option (main already has expanded version) - Update test_precision_fp16_clears_quant to expect fp16_only quant config instead of quant=None (matches our FP16-in-quantize design) - Remove duplicate --precision fp16 build example (main already has one)
82c92cb to
75be8d3
Compare
When --precision fp16 is used, calibration-related flags (--samples, --method, --weight-type, --activation-type) have no effect. Add explicit warnings in both the CLI layer (quantize command) and the API layer (quantize_onnx) so users are not silently surprised.
FP16-only quantization configs do not perform calibration, so they do not need task or model_name fields. The validation now treats fp16_only the same as ONNX builds and submodule builds.
Only static QDQ quantization requires calibration data (and thus task/model_name). RTN (weight-only) and dynamic quantization do not need calibration, so they should not require these fields.
- Add int4 to named precisions, support w4a{8,16} as weight-only RTN
- Add is_weight_only_precision() and extract_weight_bits() helpers
- resolve_quant_compile_config creates RTN config for weight-only
- quantize command: add RTN fast path between FP16 and QDQ paths
- quantize_onnx: implement RTN path using ORT MatMulNBitsQuantizer
- Update tests for new valid precision values (int4, w4a16)
…tion - _patch_device now propagates algorithm/rtn_bits to existing quant config - _run_quantize_stage: add RTN path with proper StageLive output - quantizer: extract .model (ModelProto) from ONNXModel wrapper
- Add type annotation to fp16.py convert result (no-any-return) - Add assert for precision not None in quantize.py (union-attr) - Remove duplicate imports in build.py _run_quantize_stage
- Add RTN branch in generate_hf_build_config (int4/w4a16 was silently skipped) - Pass use_external_data to save_onnx in FP16 and RTN paths (quantizer.py) - Extract _warn_ignored_calibration_options helper to remove duplication
- QDQ FP16 post-processing: apply convert_to_fp16 in-memory instead of save-reload-save round-trip (matches RTN pattern) - Pass use_external_data consistently to all save_onnx calls - extract_weight_bits: validate bit-widths against supported sets - Add test for unsupported bit-width combinations (w4a4, w3a8, etc.) - Clarify dynamic algorithm as planned-not-yet-wired in config comment
…t-processing - Add 32 to _VALID_ACTIVATION_BITS (a32 = activation stays FP32) - w4a32 is treated as weight-only RTN, equivalent to int4 - w4a16 now correctly sets fp16=True for FP16 post-processing after RTN - Add extract_activation_bits() helper to derive activation bit-width - Validate a32 only valid with weight-only (4-bit) — w8a32 is rejected - Add tests for w4a32, extract_activation_bits, and a32 validation
- Remove fp16_only field from WinMLQuantizationConfig - Rename fp16 field to fp16_postprocess (post-quant FP16 conversion) - Use algorithm='fp16' for pure FP16 mode (replaces fp16_only=True) - Update all references in commands, config builders, and tests - Backward compat: from_dict still reads old 'fp16' key
xieofxie
reviewed
Jun 24, 2026
- Remove fp16_postprocess from WinMLQuantizationConfig - Add expand_precision() to decompose w4a16 into [int4, fp16] passes - Refactor _run_quantize_stage into multi-pass loop with helper functions - Each quantize_onnx call now does exactly one operation (single responsibility) - Update standalone quantize command for two-pass w4a16 flow - Add precision field to WinMLBuildConfig for pass expansion - Add expand_precision tests
- Add 'precision' parameter to quantize_onnx() that handles multi-pass expansion internally (e.g., w4a16 → [int4, fp16]) - Simplify _run_quantize_stage in build.py to a single quantize_onnx() call — no more _make_step_config or _run_single_quantize_pass helpers - Simplify commands/quantize.py RTN path — remove manual expand_precision loop and intermediate file management - Delete unused _should_run_quantization() dead code from quantizer.py - All multi-pass orchestration (intermediate files, cleanup, pass config construction) now lives in the quant layer where it belongs
Move calibration warning logic from commands/quantize.py into utils/cli.py as warn_ignored_calibration_options() so any command that needs the check can reuse it without duplicating the logic.
FP16 conversion is exclusively used by the quantizer's algorithm='fp16' path. It's not an optimizer pipe — move it to quant/fp16.py where it logically belongs. Remove optim/fp16.py entirely.
xieofxie
reviewed
Jun 24, 2026
Address reviewer comment: mode and algorithm are redundant. algorithm is the active routing field; mode is kept only for serialization backward-compatibility and marked deprecated.
zhenchaoni
requested changes
Jun 24, 2026
xieofxie
reviewed
Jun 24, 2026
zhenchaoni
reviewed
Jun 24, 2026
xieofxie
reviewed
Jun 24, 2026
Remove redundant 'algorithm' field. Expand 'mode' to cover all quantization modes: static, dynamic, rtn, fp16. The old 'qdq' value is mapped to 'static' for backward compatibility. from_dict() prefers the old 'algorithm' key over 'mode' when both are present (old to_dict emitted both), preventing silent data loss when deserializing configs with algorithm='rtn' or 'fp16'.
xieofxie
reviewed
Jun 24, 2026
…ize command paths - Split _quantize_single_pass into 3 focused methods: _quantize_fp16, _quantize_rtn, _quantize_qdq with a dispatch dict - Consolidate 3 separate FP16/RTN/QDQ paths in commands/quantize.py into a single if/elif/else that builds config then shares execution logic - Remove duplicated try/except, console output, and output path logic
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a unified
--precisionflag that auto-selects the quantization algorithm based on the target precision. This replaces the need to manually configure--weight-type,--activation-type, and algorithm-specific flags.Closes #867
Precision → Algorithm Mapping
--precisionfp16int4/w4a32w4a16int8int16/w16a16w8a16w8a8Multi-Pass vs Single-Pass Execution
quantize_onnx(precision=...)handles pass decomposition internally. The caller only needs one call.--precisionfp16int4/w4a32int8/int16/w8a8/w8a16/w16a16w4a16Architecture:
expand_precision("w4a16")→["int4", "fp16"]. The quant layer (quantize_onnx) orchestrates intermediate files and cleanup. Command layer (build,quantize) only callsquantize_onnxonce withprecision=— no if-else routing or manual multi-pass loops.Key Design Decisions
int4is equivalent tow4a32— both produce RTN 4-bit weight-only quantization with activations unchanged (FP32)w4a16is the ONLY precision that expands to multi-pass;w8a16is a single QDQ pass (uint8 weight + uint16 activation), NOT "int8 then fp16"a32(e.g.,w4a32) means "activation stays FP32 (no quantization)" and is only valid with weight-only (4-bit) precisionsSupported Commands
--precisionsupportwinml buildwinml quantizewinml configwinml perfwinml evalE2E Test Results (convnext-tiny-224)
winml quantize --precision fp16winml quantize --precision int4winml quantize --precision w4a32winml quantize --precision w4a16winml quantize --precision int8winml build --precision fp16winml build --precision int4winml build --precision w4a16TODO (follow-up PRs)
--precision int8 --fp16)--block-size,--symmetric,--accuracy-levelalgorithm="dynamic"path (no calibration, quantize at runtime)