Conversation
Signed-off-by: Xin He <xin3.he@intel.com>
Signed-off-by: Xin He <xin3.he@intel.com>
Signed-off-by: Xin He <xin3.he@intel.com>
Signed-off-by: Xin He <xin3.he@intel.com>
Signed-off-by: Xin He <xin3.he@intel.com>
Signed-off-by: Xin He <xin3.he@intel.com>
Signed-off-by: Xin He <xin3.he@intel.com>
Signed-off-by: Xin He <xin3.he@intel.com>
Signed-off-by: Xin He <xin3.he@intel.com>
Signed-off-by: Xin He <xin3.he@intel.com>
for more information, see https://pre-commit.ci
Signed-off-by: Xin He <xin3.he@intel.com>
There was a problem hiding this comment.
Pull request overview
Adds a new “model-free” weight-only (RTN) quantization path that operates directly on safetensors shards (without instantiating a full model), along with related utilities, CLI plumbing, tests, and documentation updates.
Changes:
- Introduces
auto_round.compressors.model_freewith shard-by-shard quantization, ignore-layer handling, and FP8 source dequant support. - Enhances missing-tensor handling with fused MoE expert-tensor splitting and RTN memory/perf optimizations.
- Updates CLI/docs and adds/updates CPU tests covering model-free mode and new utilities.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| auto_round/compressors/model_free.py | New model-free RTN WOQ implementation with shard streaming and per-layer config support. |
| auto_round/utils/missing_tensors.py | Adds fused-expert tensor splitting and reduces RTN peak memory via in-place ops/vectorized packing. |
| auto_round/utils/common.py | Improves compress_layer_names by repeatedly compressing until stable. |
| auto_round/main.py | Adds --model_free CLI flag and routes to model-free quantization flow. |
| test/test_cpu/quantization/test_model_free.py | New unit tests for model-free quantization behavior and helpers. |
| test/test_cpu/utils/test_missing_tensors.py | Migrates tests to pytest and adds coverage for fused-expert splitting + updated WOQ behaviors. |
| docs/step_by_step.md | Documents Model-Free Mode usage (CLI/API). |
| docs/step_by_step_CN.md | Chinese documentation updates aligned with the English Model-Free Mode section and related corrections. |
Signed-off-by: Xin He <xin3.he@intel.com>
|
I think it would be better to wrap it as a class and use a unified interface with auto_round. |
Signed-off-by: Xin He <xin3.he@intel.com>
Signed-off-by: Xin He <xin3.he@intel.com>
Signed-off-by: Xin He <xin3.he@intel.com>
Signed-off-by: Xin He <xin3.he@intel.com>
Thanks for the remind, conv1d is skipped now. |
|
/azp run Unit-Test-CUDA-AutoRound |
|
Azure Pipelines successfully started running 1 pipeline(s). |
| model_free=True, | ||
| device_map=device, | ||
| ) | ||
| assert getattr(ar_a, "model_free", False) is True |
There was a problem hiding this comment.
add model inference test for W4G128
There was a problem hiding this comment.
Actually, many existing UTs have verified it since --iters 0 --disable_opt_rtn will call model free as default.
| ar_a = AutoRound( | ||
| tiny_opt_model_path, | ||
| scheme=scheme_preset, | ||
| bits=scheme_kwargs["bits"], |
There was a problem hiding this comment.
better add a conv1d model test, and remove conv1d from the default target layer in AR API
There was a problem hiding this comment.
already have in test/test_cpu/models/test_conv1d.py
|
|
||
| 免模型量化模式(Model-Free Mode)可以**无需将完整模型加载到内存中**即可执行 RTN WOQ 量化。它直接下载 safetensors 文件,逐分片地对每个 Linear 权重张量进行量化并保存打包结果。当您需要快速、无标定数据的量化且资源有限时,该模式非常实用。 | ||
|
|
||
| > **默认自动启用。** 自 v0.13 起,当您同时传入 `--iters 0 --disable_opt_rtn` 与一个受支持的 INT WOQ scheme 时,CLI 会自动走免模型路径。该路径与原始 `--iters 0 --disable_opt_rtn` 流程**位级(bit-exact)等价**,但内存占用大幅降低。如需关闭自动路由、强制使用原始流程,可加 `--disable_model_free`。 |
There was a problem hiding this comment.
move some to details to make the doc clean
| **Key features:** | ||
| - **No model object required** – only `config.json` and safetensors files are needed | ||
| - **Low disk memory required** (If no local model files) – downloads and quantizes one shard at a time, deleting the source shard after processing | ||
| - **Per-layer configuration** – supports `--layer_config` for per-layer bit-width overrides and `--ignore_layers` to keep specific layers in full precision |
There was a problem hiding this comment.
all interface should be same as other API's. If not, you need to log warning and mention these limitations here
There was a problem hiding this comment.
Agreed, the usage of layer_config and ignore_layers is not changed.
| # maxq = 2^(bits-1) (e.g. 8 for 4-bit) | ||
| # scale = signed_max(wmin, wmax) / maxq (can be negative) | ||
| # q = clamp(round(w / scale), -maxq, maxq - 1) | ||
| maxq = 1 << (bits - 1) # e.g. 8 for 4-bit |
There was a problem hiding this comment.
better share code with other packing function.
There was a problem hiding this comment.
Right, current packing function implementation relies on module object and configurations in module. I worry that the code to suit original code could be more than reused code.
If we plan to support more formats, reusing should be considered firstly.
| result: dict[str, torch.Tensor] = {} | ||
| split_count = 0 | ||
|
|
||
| for tensor_name, tensor in tensors_dict.items(): |
There was a problem hiding this comment.
same issue. better reuse code
| modules are kept in full precision. | ||
| """ | ||
|
|
||
| SUPPORTED_FORMATS: tuple[str, ...] = ("auto_round",) |
There was a problem hiding this comment.
sync with heng to check whether the new arc could support this compressor
| # Multimodal keywords kept in full precision by default. | ||
| _NONTEXT_KEYWORDS: tuple[str, ...] = ( | ||
| "vision", | ||
| "visual", |
There was a problem hiding this comment.
there are many places including auto-scheme need the same code, we'd better reuse the same code
There was a problem hiding this comment.
I did indeed consider reusing code; however, due to the model-free nature of the system, doing so would actually result in the introduction of even more code.
|
|
||
| # Allowed ``bits`` values for integer WOQ. | ||
| # 3-bit is excluded — see note above. | ||
| _SUPPORTED_INT_BITS: tuple[int, ...] = (2, 4, 8) |
There was a problem hiding this comment.
If there is no urgent request for 3 bits, we can do it later
| "W2A16", | ||
| "W2A16G32", | ||
| "W2A16G64", | ||
| "W4A16", |
There was a problem hiding this comment.
is there a better way, as we may support W4A16G32 later. We may forget to update this tuple
There was a problem hiding this comment.
Right, in current implementaion, user can set group_size to enable it.
| auto-routing trigger conditions. | ||
| """ |
There was a problem hiding this comment.
add one more test for combining auto-scheme with model free?
There was a problem hiding this comment.
It's not supported, I will cover it. Thanks
There was a problem hiding this comment.
AutoScheme relies on model objects and CUDA devices, which necessitates extensive modifications to the "model-free" logic.
To ensure proper functionality, additional time is required to investigate specific implementation strategies.
Let's track it in the next release if necessary. #1745
Signed-off-by: Xin He <xin3.he@intel.com>
Signed-off-by: Xin He <xin3.he@intel.com>
|
/azp run Unit-Test-CUDA-AutoRound |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Signed-off-by: Xin He <xin3.he@intel.com>
Description
Model-free mode performs RTN WOQ quantization without loading the full model into memory. It downloads safetensors files directly, quantizes each Linear weight tensor shard-by-shard, and saves the packed result. This is useful when you want fast, no-calibration quantization with minimal resource requirements.
Key features:
config.jsonand safetensors files are needed--layer_configfor per-layer bit-width overrides and--ignore_layersto keep specific layers in full precision--iters 0 --disable_opt_rtnflow for all supported schemesSupported schemes
Model-free mode currently supports the following integer weight-only preset schemes (packed in the
auto_round:auto_gptqformat):W2A16W2A16G32W2A16G64W4A16(default)W4A16_MIXEDW8A16All of the above presets also support asymmetric quantization (
sym=False) for 2-bit and 8-bit variants (W2A16,W2A16G32,W2A16G64,W8A16), producingauto_round:auto_gptq-packed output with bit-exact parity to the regular flow. For 4-bit asymmetric quantization the regular flow usesauto_round:auto_awqpacking as suggested; use the standard AutoRound flow for that case.You can also pass a custom
QuantizationScheme(bits=N, group_size=G, sym=True/False, data_type="int", act_bits=16)withbits ?? {2, 4, 8}and any group_size / sym configuration.Schemes that require special packing kernels (
W3A16,FPW8A16,BF16,MXFP4,MXFP8,MXINT4,NVFP4,FP8_BLOCK,FP8_STATIC,INT8_W8A8,GGUF:*, ...) are not supported in model-free mode and will raiseValueError. Use the regular AutoRound flow for those.CLI Usage
API Usage
Memory and performance optimizations
quantize_weight_rtnfunction to minimize peak memory usage by using in-place operations, avoiding unnecessary intermediate allocations, and vectorizing bit packing. This makes quantization more efficient, especially on large models and GPUs.Fused expert tensor handling
Utility function improvements
compress_layer_namesutility to repeatedly compress multi-level numbered layer names until fully reduced, supporting more complex naming patterns.Documentation and minor corrections
Time and memory usage
Qwen/Qwen3.5-35B-A3B
Model free:Total time: 153.61 seconds
Memory usage: 'peak_ram': 8.86GB, 'peak_vram': 0.7GB
--iters=0, --disable_opt_rtn:Total time: 220 seconds
Memory usage: 'peak_ram': 48.13GB, 'peak_vram': 0.44GB
Type of Change
Related Issues
Fixes or relates to ##1491
Checklist Before Submitting