Skip to content

support model_free WOQ quantization#1699

Open
xin3he wants to merge 33 commits intomainfrom
xinhe/4-14
Open

support model_free WOQ quantization#1699
xin3he wants to merge 33 commits intomainfrom
xinhe/4-14

Conversation

@xin3he
Copy link
Copy Markdown
Contributor

@xin3he xin3he commented Apr 17, 2026

Description

Model-free mode performs RTN WOQ quantization without loading the full model into memory. It downloads safetensors files directly, quantizes each Linear weight tensor shard-by-shard, and saves the packed result. This is useful when you want fast, no-calibration quantization with minimal resource requirements.

Auto-enabled by default. As of v0.13, when you pass --iters 0 --disable_opt_rtn together with a supported INT WOQ scheme, the CLI automatically takes the model-free path. This is bit-exactly equivalent to the regular --iters 0 --disable_opt_rtn flow but uses far less memory. Use --disable_model_free to opt out and force the original flow.

Key features:

  • No model object required ?? only config.json and safetensors files are needed
  • Low disk memory required (If no local model files) ?? downloads and quantizes one shard at a time, deleting the source shard after processing
  • Per-layer configuration ?? supports --layer_config for per-layer bit-width overrides and --ignore_layers to keep specific layers in full precision
  • Predefined ignore layers ?? automatically skips model-specific layers (e.g., MoE gates, MTP layers) based on config detection
  • Bit-exact parity with the standard --iters 0 --disable_opt_rtn flow for all supported schemes

Supported schemes

Model-free mode currently supports the following integer weight-only preset schemes (packed in the auto_round:auto_gptq format):

Preset Bits Group size Sym
W2A16 2 128 true
W2A16G32 2 32 true
W2A16G64 2 64 true
W4A16 (default) 4 128 true
W4A16_MIXED 4 128 true
W8A16 8 128 true

All of the above presets also support asymmetric quantization (sym=False) for 2-bit and 8-bit variants (W2A16, W2A16G32, W2A16G64, W8A16), producing auto_round:auto_gptq-packed output with bit-exact parity to the regular flow. For 4-bit asymmetric quantization the regular flow uses auto_round:auto_awq packing as suggested; use the standard AutoRound flow for that case.

You can also pass a custom QuantizationScheme(bits=N, group_size=G, sym=True/False, data_type="int", act_bits=16) with bits ?? {2, 4, 8} and any group_size / sym configuration.

Schemes that require special packing kernels (W3A16, FPW8A16, BF16, MXFP4, MXFP8, MXINT4, NVFP4, FP8_BLOCK, FP8_STATIC, INT8_W8A8, GGUF:*, ...) are not supported in model-free mode and will raise ValueError. Use the regular AutoRound flow for those.

CLI Usage

# Easiest: --iters 0 --disable_opt_rtn auto-routes to model-free
auto_round meta-llama/Llama-3.2-1B-Instruct \
  --scheme W4A16 \
  --iters 0 --disable_opt_rtn \
  --output_dir ./int4-llama

# Equivalent explicit invocation
auto_round meta-llama/Llama-3.2-1B-Instruct \
  --model_free \
  --scheme W4A16 \
  --output_dir ./int4-llama

# Opt out of auto-routing and use the regular flow instead
auto_round meta-llama/Llama-3.2-1B-Instruct \
  --scheme W4A16 \
  --iters 0 --disable_opt_rtn --disable_model_free \
  --output_dir ./int4-llama

# With per-layer configuration and ignored layers
auto_round meta-llama/Llama-3.2-1B-Instruct \
  --model_free \
  --scheme W4A16 \
  --group_size 32 \
  --asym \
  --layer_config "{k_proj:{bits:8},v_proj:{bits:8}}" \
  --ignore_layers "mlp" \
  --output_dir ./int4-llama

API Usage

from auto_round import AutoRound

AutoRound(
    model="meta-llama/Llama-3.2-1B-Instruct",
    scheme="W4A16",  # Or a QuantizationScheme instance for custom group_size / sym.
    layer_config={
        ".*k_proj": {"bits": 8, "group_size": 32},
        ".*v_proj": {"bits": 8, "group_size": 32},
    },
    ignore_layers="mlp",
    model_free=True,
).quantize_and_save("./int4-llama")

Note: Model-free mode only supports the auto_round output format and uses RTN (no calibration data, no iterative tuning). For higher-quality quantization or schemes outside the supported list, use the standard AutoRound flow.

Memory and performance optimizations

  • Improved the quantize_weight_rtn function to minimize peak memory usage by using in-place operations, avoiding unnecessary intermediate allocations, and vectorizing bit packing. This makes quantization more efficient, especially on large models and GPUs.

Fused expert tensor handling

  • Added logic to automatically split fused 3D expert tensors (common in MoE models) into per-expert 2D tensors, ensuring compatibility with quantization routines and improving support for a wider range of model architectures.

Utility function improvements

  • Enhanced the compress_layer_names utility to repeatedly compress multi-level numbered layer names until fully reduced, supporting more complex naming patterns.

Documentation and minor corrections

  • Updated documentation for both English and Chinese users, including new sections on model-free mode, corrected quantization scheme tables, and clarified quantization backend support.

Time and memory usage

Qwen/Qwen3.5-35B-A3B

Model free:
Total time: 153.61 seconds
Memory usage: 'peak_ram': 8.86GB, 'peak_vram': 0.7GB
--iters=0, --disable_opt_rtn:
Total time: 220 seconds
Memory usage: 'peak_ram': 48.13GB, 'peak_vram': 0.44GB

Type of Change

  • Bug fix
  • New feature
  • Documentation update
  • Performance improvement
  • Code refactoring
  • Other (please specify):

Related Issues

Fixes or relates to ##1491

Checklist Before Submitting

  • My code has been tested locally.
  • Documentation has been updated as needed.
  • New or updated tests are included where applicable.

xin3he added 13 commits April 14, 2026 13:00
Signed-off-by: Xin He <xin3.he@intel.com>
Signed-off-by: Xin He <xin3.he@intel.com>
Signed-off-by: Xin He <xin3.he@intel.com>
Signed-off-by: Xin He <xin3.he@intel.com>
Signed-off-by: Xin He <xin3.he@intel.com>
Signed-off-by: Xin He <xin3.he@intel.com>
Signed-off-by: Xin He <xin3.he@intel.com>
Signed-off-by: Xin He <xin3.he@intel.com>
Signed-off-by: Xin He <xin3.he@intel.com>
Signed-off-by: Xin He <xin3.he@intel.com>
Signed-off-by: Xin He <xin3.he@intel.com>
Copilot AI review requested due to automatic review settings April 17, 2026 03:56
xin3he and others added 4 commits April 17, 2026 11:56
Signed-off-by: Xin He <xin3.he@intel.com>
Signed-off-by: Xin He <xin3.he@intel.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new “model-free” weight-only (RTN) quantization path that operates directly on safetensors shards (without instantiating a full model), along with related utilities, CLI plumbing, tests, and documentation updates.

Changes:

  • Introduces auto_round.compressors.model_free with shard-by-shard quantization, ignore-layer handling, and FP8 source dequant support.
  • Enhances missing-tensor handling with fused MoE expert-tensor splitting and RTN memory/perf optimizations.
  • Updates CLI/docs and adds/updates CPU tests covering model-free mode and new utilities.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
auto_round/compressors/model_free.py New model-free RTN WOQ implementation with shard streaming and per-layer config support.
auto_round/utils/missing_tensors.py Adds fused-expert tensor splitting and reduces RTN peak memory via in-place ops/vectorized packing.
auto_round/utils/common.py Improves compress_layer_names by repeatedly compressing until stable.
auto_round/main.py Adds --model_free CLI flag and routes to model-free quantization flow.
test/test_cpu/quantization/test_model_free.py New unit tests for model-free quantization behavior and helpers.
test/test_cpu/utils/test_missing_tensors.py Migrates tests to pytest and adds coverage for fused-expert splitting + updated WOQ behaviors.
docs/step_by_step.md Documents Model-Free Mode usage (CLI/API).
docs/step_by_step_CN.md Chinese documentation updates aligned with the English Model-Free Mode section and related corrections.

Comment thread auto_round/compressors/model_free.py Outdated
Comment thread test/test_cpu/utils/test_missing_tensors.py
Comment thread auto_round/utils/missing_tensors.py
Comment thread auto_round/utils/missing_tensors.py
Comment thread auto_round/__main__.py
Comment thread auto_round/compressors/model_free.py
Comment thread auto_round/compressors/model_free.py
Signed-off-by: Xin He <xin3.he@intel.com>
Comment thread auto_round/compressors/model_free.py Outdated
Comment thread auto_round/compressors/model_free.py Outdated
Comment thread docs/step_by_step.md
@n1ck-guo
Copy link
Copy Markdown
Contributor

I think it would be better to wrap it as a class and use a unified interface with auto_round.

xin3he added 6 commits April 24, 2026 09:41
Signed-off-by: Xin He <xin3.he@intel.com>
Signed-off-by: Xin He <xin3.he@intel.com>
Signed-off-by: Xin He <xin3.he@intel.com>
Signed-off-by: Xin He <xin3.he@intel.com>
@xin3he
Copy link
Copy Markdown
Contributor Author

xin3he commented Apr 25, 2026

Have you tested a model with Conv1D layers where the weights need to be transposed before quantization? How do you detect the layer type?

Thanks for the remind, conv1d is skipped now.

@xin3he
Copy link
Copy Markdown
Contributor Author

xin3he commented Apr 25, 2026

/azp run Unit-Test-CUDA-AutoRound

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@xin3he xin3he requested review from n1ck-guo and yiliu30 April 25, 2026 14:31
Signed-off-by: Xin He <xin3.he@intel.com>
model_free=True,
device_map=device,
)
assert getattr(ar_a, "model_free", False) is True
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add model inference test for W4G128

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, many existing UTs have verified it since --iters 0 --disable_opt_rtn will call model free as default.

ar_a = AutoRound(
tiny_opt_model_path,
scheme=scheme_preset,
bits=scheme_kwargs["bits"],
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better add a conv1d model test, and remove conv1d from the default target layer in AR API

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

already have in test/test_cpu/models/test_conv1d.py

Comment thread docs/step_by_step_CN.md

免模型量化模式(Model-Free Mode)可以**无需将完整模型加载到内存中**即可执行 RTN WOQ 量化。它直接下载 safetensors 文件,逐分片地对每个 Linear 权重张量进行量化并保存打包结果。当您需要快速、无标定数据的量化且资源有限时,该模式非常实用。

> **默认自动启用。** 自 v0.13 起,当您同时传入 `--iters 0 --disable_opt_rtn` 与一个受支持的 INT WOQ scheme 时,CLI 会自动走免模型路径。该路径与原始 `--iters 0 --disable_opt_rtn` 流程**位级(bit-exact)等价**,但内存占用大幅降低。如需关闭自动路由、强制使用原始流程,可加 `--disable_model_free`。
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move some to details to make the doc clean

Comment thread docs/step_by_step.md
**Key features:**
- **No model object required** – only `config.json` and safetensors files are needed
- **Low disk memory required** (If no local model files) – downloads and quantizes one shard at a time, deleting the source shard after processing
- **Per-layer configuration** – supports `--layer_config` for per-layer bit-width overrides and `--ignore_layers` to keep specific layers in full precision
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all interface should be same as other API's. If not, you need to log warning and mention these limitations here

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, the usage of layer_config and ignore_layers is not changed.

# maxq = 2^(bits-1) (e.g. 8 for 4-bit)
# scale = signed_max(wmin, wmax) / maxq (can be negative)
# q = clamp(round(w / scale), -maxq, maxq - 1)
maxq = 1 << (bits - 1) # e.g. 8 for 4-bit
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better share code with other packing function.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, current packing function implementation relies on module object and configurations in module. I worry that the code to suit original code could be more than reused code.
If we plan to support more formats, reusing should be considered firstly.

result: dict[str, torch.Tensor] = {}
split_count = 0

for tensor_name, tensor in tensors_dict.items():
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same issue. better reuse code

modules are kept in full precision.
"""

SUPPORTED_FORMATS: tuple[str, ...] = ("auto_round",)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sync with heng to check whether the new arc could support this compressor

# Multimodal keywords kept in full precision by default.
_NONTEXT_KEYWORDS: tuple[str, ...] = (
"vision",
"visual",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are many places including auto-scheme need the same code, we'd better reuse the same code

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will consider it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did indeed consider reusing code; however, due to the model-free nature of the system, doing so would actually result in the introduction of even more code.


# Allowed ``bits`` values for integer WOQ.
# 3-bit is excluded — see note above.
_SUPPORTED_INT_BITS: tuple[int, ...] = (2, 4, 8)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not support 3bits?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is no urgent request for 3 bits, we can do it later

"W2A16",
"W2A16G32",
"W2A16G64",
"W4A16",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a better way, as we may support W4A16G32 later. We may forget to update this tuple

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, in current implementaion, user can set group_size to enable it.

Comment on lines +240 to +241
auto-routing trigger conditions.
"""
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add one more test for combining auto-scheme with model free?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not supported, I will cover it. Thanks

Copy link
Copy Markdown
Contributor Author

@xin3he xin3he Apr 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AutoScheme relies on model objects and CUDA devices, which necessitates extensive modifications to the "model-free" logic.
To ensure proper functionality, additional time is required to investigate specific implementation strategies.
Let's track it in the next release if necessary. #1745

xin3he added 3 commits April 26, 2026 09:49
Signed-off-by: Xin He <xin3.he@intel.com>
Signed-off-by: Xin He <xin3.he@intel.com>
@xin3he
Copy link
Copy Markdown
Contributor Author

xin3he commented Apr 26, 2026

/azp run Unit-Test-CUDA-AutoRound

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

xin3he added 2 commits April 27, 2026 07:32
Signed-off-by: Xin He <xin3.he@intel.com>
Signed-off-by: Xin He <xin3.he@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants