support model_free WOQ quantization by xin3he · Pull Request #1699 · intel/auto-round

xin3he · 2026-04-17T03:56:02Z

Description

Model-free mode performs RTN WOQ quantization without loading the full model into memory. It downloads safetensors files directly, quantizes each Linear weight tensor shard-by-shard, and saves the packed result. This is useful when you want fast, no-calibration quantization with minimal resource requirements.

Auto-enabled by default. As of v0.13, when you pass --iters 0 --disable_opt_rtn together with a supported INT WOQ scheme, the CLI automatically takes the model-free path. This is bit-exactly equivalent to the regular --iters 0 --disable_opt_rtn flow but uses far less memory. Use --disable_model_free to opt out and force the original flow.

Key features:

No model object required ?? only config.json and safetensors files are needed
Low disk memory required (If no local model files) ?? downloads and quantizes one shard at a time, deleting the source shard after processing
Per-layer configuration ?? supports --layer_config for per-layer bit-width overrides and --ignore_layers to keep specific layers in full precision
Predefined ignore layers ?? automatically skips model-specific layers (e.g., MoE gates, MTP layers) based on config detection
Bit-exact parity with the standard --iters 0 --disable_opt_rtn flow for all supported schemes

Supported schemes

Model-free mode currently supports the following integer weight-only preset schemes (packed in the auto_round:auto_gptq format):

Preset	Bits	Group size	Sym
`W2A16`	2	128	true
`W2A16G32`	2	32	true
`W2A16G64`	2	64	true
`W4A16` (default)	4	128	true
`W4A16_MIXED`	4	128	true
`W8A16`	8	128	true

All of the above presets also support asymmetric quantization (sym=False) for 2-bit and 8-bit variants (W2A16, W2A16G32, W2A16G64, W8A16), producing auto_round:auto_gptq-packed output with bit-exact parity to the regular flow. For 4-bit asymmetric quantization the regular flow uses auto_round:auto_awq packing as suggested; use the standard AutoRound flow for that case.

You can also pass a custom QuantizationScheme(bits=N, group_size=G, sym=True/False, data_type="int", act_bits=16) with bits ?? {2, 4, 8} and any group_size / sym configuration.

Schemes that require special packing kernels (W3A16, FPW8A16, BF16, MXFP4, MXFP8, MXINT4, NVFP4, FP8_BLOCK, FP8_STATIC, INT8_W8A8, GGUF:*, ...) are not supported in model-free mode and will raise ValueError. Use the regular AutoRound flow for those.

CLI Usage

# Easiest: --iters 0 --disable_opt_rtn auto-routes to model-free
auto_round meta-llama/Llama-3.2-1B-Instruct \
  --scheme W4A16 \
  --iters 0 --disable_opt_rtn \
  --output_dir ./int4-llama

# Equivalent explicit invocation
auto_round meta-llama/Llama-3.2-1B-Instruct \
  --model_free \
  --scheme W4A16 \
  --output_dir ./int4-llama

# Opt out of auto-routing and use the regular flow instead
auto_round meta-llama/Llama-3.2-1B-Instruct \
  --scheme W4A16 \
  --iters 0 --disable_opt_rtn --disable_model_free \
  --output_dir ./int4-llama

# With per-layer configuration and ignored layers
auto_round meta-llama/Llama-3.2-1B-Instruct \
  --model_free \
  --scheme W4A16 \
  --group_size 32 \
  --asym \
  --layer_config "{k_proj:{bits:8},v_proj:{bits:8}}" \
  --ignore_layers "mlp" \
  --output_dir ./int4-llama

API Usage

from auto_round import AutoRound

AutoRound(
    model="meta-llama/Llama-3.2-1B-Instruct",
    scheme="W4A16",  # Or a QuantizationScheme instance for custom group_size / sym.
    layer_config={
        ".*k_proj": {"bits": 8, "group_size": 32},
        ".*v_proj": {"bits": 8, "group_size": 32},
    },
    ignore_layers="mlp",
    model_free=True,
).quantize_and_save("./int4-llama")

Note: Model-free mode only supports the auto_round output format and uses RTN (no calibration data, no iterative tuning). For higher-quality quantization or schemes outside the supported list, use the standard AutoRound flow.

Memory and performance optimizations

Improved the quantize_weight_rtn function to minimize peak memory usage by using in-place operations, avoiding unnecessary intermediate allocations, and vectorizing bit packing. This makes quantization more efficient, especially on large models and GPUs.

Fused expert tensor handling

Added logic to automatically split fused 3D expert tensors (common in MoE models) into per-expert 2D tensors, ensuring compatibility with quantization routines and improving support for a wider range of model architectures.

Utility function improvements

Enhanced the compress_layer_names utility to repeatedly compress multi-level numbered layer names until fully reduced, supporting more complex naming patterns.

Documentation and minor corrections

Updated documentation for both English and Chinese users, including new sections on model-free mode, corrected quantization scheme tables, and clarified quantization backend support.

Time and memory usage

Qwen/Qwen3.5-35B-A3B

Model free:
Total time: 153.61 seconds
Memory usage: 'peak_ram': 8.86GB, 'peak_vram': 0.7GB
--iters=0, --disable_opt_rtn:
Total time: 220 seconds
Memory usage: 'peak_ram': 48.13GB, 'peak_vram': 0.44GB

Type of Change

Related Issues

Fixes or relates to ##1491

Checklist Before Submitting

My code has been tested locally.
Documentation has been updated as needed.
New or updated tests are included where applicable.

Signed-off-by: Xin He <xin3.he@intel.com>

for more information, see https://pre-commit.ci

Signed-off-by: Xin He <xin3.he@intel.com>

Copilot

Pull request overview

Adds a new “model-free” weight-only (RTN) quantization path that operates directly on safetensors shards (without instantiating a full model), along with related utilities, CLI plumbing, tests, and documentation updates.

Changes:

Introduces auto_round.compressors.model_free with shard-by-shard quantization, ignore-layer handling, and FP8 source dequant support.
Enhances missing-tensor handling with fused MoE expert-tensor splitting and RTN memory/perf optimizations.
Updates CLI/docs and adds/updates CPU tests covering model-free mode and new utilities.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
auto_round/compressors/model_free.py	New model-free RTN WOQ implementation with shard streaming and per-layer config support.
auto_round/utils/missing_tensors.py	Adds fused-expert tensor splitting and reduces RTN peak memory via in-place ops/vectorized packing.
auto_round/utils/common.py	Improves `compress_layer_names` by repeatedly compressing until stable.
auto_round/main.py	Adds `--model_free` CLI flag and routes to model-free quantization flow.
test/test_cpu/quantization/test_model_free.py	New unit tests for model-free quantization behavior and helpers.
test/test_cpu/utils/test_missing_tensors.py	Migrates tests to pytest and adds coverage for fused-expert splitting + updated WOQ behaviors.
docs/step_by_step.md	Documents Model-Free Mode usage (CLI/API).
docs/step_by_step_CN.md	Chinese documentation updates aligned with the English Model-Free Mode section and related corrections.

Signed-off-by: Xin He <xin3.he@intel.com>

n1ck-guo · 2026-04-21T08:13:00Z

I think it would be better to wrap it as a class and use a unified interface with auto_round.

Signed-off-by: Xin He <xin3.he@intel.com>

xin3he · 2026-04-25T14:30:43Z

Have you tested a model with Conv1D layers where the weights need to be transposed before quantization? How do you detect the layer type?

Thanks for the remind, conv1d is skipped now.

xin3he · 2026-04-25T14:31:44Z

/azp run Unit-Test-CUDA-AutoRound

azure-pipelines · 2026-04-25T14:31:53Z

Azure Pipelines successfully started running 1 pipeline(s).

Signed-off-by: Xin He <xin3.he@intel.com>

wenhuach21 · 2026-04-26T03:00:56Z

+        model_free=True,
+        device_map=device,
+    )
+    assert getattr(ar_a, "model_free", False) is True


add model inference test for W4G128

Actually, many existing UTs have verified it since --iters 0 --disable_opt_rtn will call model free as default.

wenhuach21 · 2026-04-26T03:01:27Z

+    ar_a = AutoRound(
+        tiny_opt_model_path,
+        scheme=scheme_preset,
+        bits=scheme_kwargs["bits"],


better add a conv1d model test, and remove conv1d from the default target layer in AR API

already have in test/test_cpu/models/test_conv1d.py

wenhuach21 · 2026-04-26T03:02:32Z

+
+免模型量化模式（Model-Free Mode）可以**无需将完整模型加载到内存中**即可执行 RTN WOQ 量化。它直接下载 safetensors 文件，逐分片地对每个 Linear 权重张量进行量化并保存打包结果。当您需要快速、无标定数据的量化且资源有限时，该模式非常实用。
+
+> **默认自动启用。** 自 v0.13 起，当您同时传入 `--iters 0 --disable_opt_rtn` 与一个受支持的 INT WOQ scheme 时，CLI 会自动走免模型路径。该路径与原始 `--iters 0 --disable_opt_rtn` 流程**位级（bit-exact）等价**，但内存占用大幅降低。如需关闭自动路由、强制使用原始流程，可加 `--disable_model_free`。


move some to details to make the doc clean

wenhuach21 · 2026-04-26T03:05:42Z

+**Key features:**
+- **No model object required** – only `config.json` and safetensors files are needed
+- **Low disk memory required** (If no local model files) – downloads and quantizes one shard at a time, deleting the source shard after processing
+- **Per-layer configuration** – supports `--layer_config` for per-layer bit-width overrides and `--ignore_layers` to keep specific layers in full precision


all interface should be same as other API's. If not, you need to log warning and mention these limitations here

Agreed, the usage of layer_config and ignore_layers is not changed.

wenhuach21 · 2026-04-26T03:07:06Z

+        #   maxq = 2^(bits-1)  (e.g. 8 for 4-bit)
+        #   scale = signed_max(wmin, wmax) / maxq  (can be negative)
+        #   q = clamp(round(w / scale), -maxq, maxq - 1)
+        maxq = 1 << (bits - 1)  # e.g. 8 for 4-bit


better share code with other packing function.

Right, current packing function implementation relies on module object and configurations in module. I worry that the code to suit original code could be more than reused code.
If we plan to support more formats, reusing should be considered firstly.

wenhuach21 · 2026-04-26T03:07:28Z

+    result: dict[str, torch.Tensor] = {}
+    split_count = 0
+
+    for tensor_name, tensor in tensors_dict.items():


same issue. better reuse code

wenhuach21 · 2026-04-26T03:10:03Z

+            modules are kept in full precision.
+    """
+
+    SUPPORTED_FORMATS: tuple[str, ...] = ("auto_round",)


sync with heng to check whether the new arc could support this compressor

wenhuach21 · 2026-04-26T03:11:28Z

+# Multimodal keywords kept in full precision by default.
+_NONTEXT_KEYWORDS: tuple[str, ...] = (
+    "vision",
+    "visual",


there are many places including auto-scheme need the same code, we'd better reuse the same code

will consider it.

I did indeed consider reusing code; however, due to the model-free nature of the system, doing so would actually result in the introduction of even more code.

wenhuach21 · 2026-04-26T03:11:56Z

+
+# Allowed ``bits`` values for integer WOQ.
+# 3-bit is excluded — see note above.
+_SUPPORTED_INT_BITS: tuple[int, ...] = (2, 4, 8)


why not support 3bits?

If there is no urgent request for 3 bits, we can do it later

wenhuach21 · 2026-04-26T03:13:32Z

+    "W2A16",
+    "W2A16G32",
+    "W2A16G64",
+    "W4A16",


is there a better way, as we may support W4A16G32 later. We may forget to update this tuple

Right, in current implementaion, user can set group_size to enable it.

wenhuach21 · 2026-04-26T03:14:33Z

+    auto-routing trigger conditions.
+    """


add one more test for combining auto-scheme with model free?

It's not supported, I will cover it. Thanks

AutoScheme relies on model objects and CUDA devices, which necessitates extensive modifications to the "model-free" logic.
To ensure proper functionality, additional time is required to investigate specific implementation strategies.
Let's track it in the next release if necessary. #1745

Signed-off-by: Xin He <xin3.he@intel.com>

xin3he · 2026-04-26T13:47:01Z

/azp run Unit-Test-CUDA-AutoRound

azure-pipelines · 2026-04-26T13:47:09Z

Azure Pipelines successfully started running 1 pipeline(s).

Signed-off-by: Xin He <xin3.he@intel.com>

xin3he added 13 commits April 14, 2026 13:00

implement model free

dc592e9

Signed-off-by: Xin He <xin3.he@intel.com>

polished implementation

177bf48

Signed-off-by: Xin He <xin3.he@intel.com>

remove useless gpu_concurrency

97e0362

Signed-off-by: Xin He <xin3.he@intel.com>

添加预编译模式匹配器以提高量化过程中的性能和可扩展性

ff47a97

Signed-off-by: Xin He <xin3.he@intel.com>

fix typo

4d9ad0e

Signed-off-by: Xin He <xin3.he@intel.com>

update document

58709e6

Signed-off-by: Xin He <xin3.he@intel.com>

remove useless code and update UT

d3951f2

Signed-off-by: Xin He <xin3.he@intel.com>

mend

16991ea

remove high_gpu_mem_usage since no performacen benefit.

83b9b4f

Signed-off-by: Xin He <xin3.he@intel.com>

update regex

687260d

Signed-off-by: Xin He <xin3.he@intel.com>

fix bug and simplify UT

68d0cb7

Signed-off-by: Xin He <xin3.he@intel.com>

fix bug

312f75d

Signed-off-by: Xin He <xin3.he@intel.com>

add WOQ limiation and support bits group_size setting

3ca4d3b

Signed-off-by: Xin He <xin3.he@intel.com>

Copilot AI review requested due to automatic review settings April 17, 2026 03:56

Copilot started reviewing on behalf of xin3he April 17, 2026 03:56 View session

xin3he and others added 4 commits April 17, 2026 11:56

Merge branch 'main' into xinhe/4-14

3f15e02

[pre-commit.ci] auto fixes from pre-commit.com hooks

47b3f35

for more information, see https://pre-commit.ci

update doc

76f9915

Signed-off-by: Xin He <xin3.he@intel.com>

minor fix

c588ad2

Signed-off-by: Xin He <xin3.he@intel.com>

Copilot AI reviewed Apr 17, 2026

View reviewed changes

enable quant_nontext_module

0c14165

Signed-off-by: Xin He <xin3.he@intel.com>

Copilot AI mentioned this pull request Apr 20, 2026

Make Type of Change single-select without inflating GitHub task counts #1707

Closed

3 tasks

xin3he requested review from WeiweiZhang1, changwangss, n1ck-guo, wenhuach21 and yiliu30 April 20, 2026 08:19

n1ck-guo reviewed Apr 21, 2026

View reviewed changes

Comment thread auto_round/compressors/model_free.py Outdated

yiliu30 reviewed Apr 21, 2026

View reviewed changes

Comment thread auto_round/compressors/model_free.py Outdated

Comment thread docs/step_by_step.md

xin3he added 6 commits April 24, 2026 09:41

add UT to cover conv1d detection

f4fc5f4

Signed-off-by: Xin He <xin3.he@intel.com>

support MXFP4/8 dequantization

4f6f97e

Signed-off-by: Xin He <xin3.he@intel.com>

Merge branch 'main' into xinhe/4-14

ed46cd6

fix pylint

7e3a3f8

Signed-off-by: Xin He <xin3.he@intel.com>

Merge branch 'main' into xinhe/4-14

958191a

add auto fallback and change class name

7440c32

Signed-off-by: Xin He <xin3.he@intel.com>

xin3he requested review from n1ck-guo and yiliu30 April 25, 2026 14:31

fix CI

8b8d084

Signed-off-by: Xin He <xin3.he@intel.com>

wenhuach21 reviewed Apr 26, 2026

View reviewed changes

xin3he added 3 commits April 26, 2026 09:49

update readme

eb5fdf4

Signed-off-by: Xin He <xin3.he@intel.com>

添加回退压缩器功能以支持量化和保存

98a5040

Signed-off-by: Xin He <xin3.he@intel.com>

Merge branch 'main' into xinhe/4-14

46465c3

xin3he added 2 commits April 27, 2026 07:32

support diffusion model

7c76188

Signed-off-by: Xin He <xin3.he@intel.com>

fix bug

a92acc2

Signed-off-by: Xin He <xin3.he@intel.com>


		免模型量化模式（Model-Free Mode）可以无需将完整模型加载到内存中即可执行 RTN WOQ 量化。它直接下载 safetensors 文件，逐分片地对每个 Linear 权重张量进行量化并保存打包结果。当您需要快速、无标定数据的量化且资源有限时，该模式非常实用。

		> 默认自动启用。自 v0.13 起，当您同时传入 `--iters 0 --disable_opt_rtn` 与一个受支持的 INT WOQ scheme 时，CLI 会自动走免模型路径。该路径与原始 `--iters 0 --disable_opt_rtn` 流程位级（bit-exact）等价，但内存占用大幅降低。如需关闭自动路由、强制使用原始流程，可加 `--disable_model_free`。

Conversation

xin3he commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

CLI Usage

API Usage

Time and memory usage

Qwen/Qwen3.5-35B-A3B

Type of Change

Related Issues

Checklist Before Submitting

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

n1ck-guo commented Apr 21, 2026

Uh oh!

xin3he commented Apr 25, 2026

Uh oh!

xin3he commented Apr 25, 2026

Uh oh!

azure-pipelines Bot commented Apr 25, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xin3he Apr 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xin3he commented Apr 26, 2026

xin3he commented Apr 17, 2026 •

edited

Loading

xin3he Apr 26, 2026 •

edited

Loading