Skip to content

Add 1-bit affine quantization support#3478

Closed
dusterbloom wants to merge 4 commits intoml-explore:mainfrom
dusterbloom:feat/bits-1-affine-quantization
Closed

Add 1-bit affine quantization support#3478
dusterbloom wants to merge 4 commits intoml-explore:mainfrom
dusterbloom:feat/bits-1-affine-quantization

Conversation

@dusterbloom
Copy link
Copy Markdown

Summary

Adds support for bits=1 to affine_quantize / affine_dequantize /
quantized_matmul, enabling 1.25-bpw model serving in MLX (1 bit/weight +
fp16 scale + bias per group of 128 input columns).

The 3 commits (originally authored by Pasha Khosravi, cherry-picked from
research work) are:

SHA Author Description
309f947 Pasha Khosravi Add 1-bit affine quantization support — relaxes bits < 2 guard, adds bit-0/bit-1 → w_min/w_max codepath, ships Metal kernel affine_dequantize_*_gs_128_b_1, adds Python tests
17edfc9 Pasha Khosravi Guard fast-path Metal kernel dispatch for 1-bit quantization (1-line fix)
06ee46c Pasha Khosravi Fix qmv_fast tail iteration for non-aligned K (26-line correctness fix)

Total: 11 files changed, +484 / −98 in the kernel commit, plus 27 lines of
tail-iteration / dispatch hardening.

Motivation: 1.25-bpw production models

Several models in the wild are now shipping in 1.25-bpw (1 bit/weight +
group-128 affine scale/bias) format:

  • prism-ml/Bonsai-1.7B-mlx-1bit (~260 MB residency)
  • prism-ml/Bonsai-8B-mlx-1bit (~1.25 GB residency)
  • prism-ml/Bonsai-4B-mlx-1bit

These checkpoints declare quantization.bits == 1 in config.json and
require the affine_dequantize_*_gs_128_b_1 Metal kernel that this PR adds.

Validated end-to-end

The kernels have been integrated into the higgs inference engine
(PR #142) and validated on
real Bonsai-8B inference:

Metric Bonsai-8B (M4 base, 32GB)
Decode tps (median, 3 trials, max_tokens=200, T=0) 61.10 tok/s
Decode tps stdev 0.32 tok/s
TTFT 323 ms
Output coherence (greedy 3-word smoke) "Hello. World. Friend." ✅

Runtime parity validated against the original feat/magic-canvas
research branch: matches within thermal noise on M4.

Test coverage

The first commit (309f947) ships:

  • python/tests/test_quantized.py — adds test_quantize_1bit covering
    affine_quantize round-trip + quantized_matmul correctness for the
    1-bit path (96 lines added)
  • python/tests/cuda_skip.py — explicit skip entry for the new test on CUDA

The two follow-up commits are correctness fixes guarded by the existing
mlx_quantize / mlx_qmm test suite.

Downstream chain

Once this PR merges, two follow-up PRs drop the fork dependency for downstream
Rust consumers:

  1. ml-explore/mlx-c — bump submodule to a merged-mlx SHA, no signature
    changes anticipated (the v0.6.0-3 C bindings already accept global_scale
    as a nullable mlx_array)
  2. oxideai/mlx-rs — 12-line plumbing fix in src/ops/quantization.rs
    to pass null global_scale arrays through mlx_quantize /
    mlx_dequantize / mlx_qqmm (matches the v0.6.0-3 C signature)

Acknowledgements

All three commits authored by Pasha Khosravi.
This PR is a packaging step to bring his research work into upstream MLX so
production model loaders can drop their fork chains.

Cherry-picks are clean against current main (the branch base sits on
upstream ce45c52 "[CUDA] Use qmv kernel for fp quantizations (#3239)").

🤖 PR prepared with Claude Code

@zcbenz
Copy link
Copy Markdown
Collaborator

zcbenz commented May 5, 2026

Closing as a duplicate of #3161.

@zcbenz zcbenz closed this May 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants