ATQ (Adaptive Ternary Quantization) is a post-training and quantization-aware training framework that compresses large language model weights to a ternary representation {-1, 0, +1} using layer-specific dynamic thresholds. Unlike fixed-threshold ternary methods, ATQ adapts per-layer based on the empirical weight distribution — either by magnitude ranking (sparsity-target mode) or by absolute magnitude cutoffs — enabling ~16x compression versus FP32 while maintaining perplexity within acceptable degradation bounds. The framework supports mixed-precision assignment for sensitivity-critical layers, optional calibration-data-driven threshold tuning, straight-through estimators for quantization-aware training, and 2-bit packed storage for efficient on-device deployment.
| Method | Bits | Perplexity | Effective Size | Compression | Source |
|---|---|---|---|---|---|
| FP32 Baseline | 32 | 35.70 | 474.7 MB | 1.0x | Measured |
| RTN Ternary (naive) | 2 | 1,320,412 | 29.7 MB | 16.0x | Measured |
| ATQ (ours) | 2 | 110,062 | 95.6 MB | 5.0x | Measured |
| GPTQ | 4 | 32.1 | 59.3 MB | 8.0x | Frantar et al., 2022 |
| AWQ | 4 | 31.5 | 59.3 MB | 8.0x | Lin et al., 2023 |
Note on post-training ternary quantization: Ternary quantization (2-bit, only 3 possible values per weight) is fundamentally more aggressive than 4-bit methods like GPTQ/AWQ. Post-training ternary quantization without fine-tuning causes significant perplexity degradation across all methods. However, ATQ's adaptive thresholding achieves ~12x lower perplexity than naive RTN ternary, demonstrating that intelligent threshold selection is critical. The ternary layers themselves achieve 16x compression; the 5.0x overall ratio reflects that embeddings and the LM head are kept at full precision.
Quantization-aware training (QAT) with the included STE-based training loop is expected to substantially close the gap with FP32, as the model can adapt its weights to the ternary constraint during fine-tuning.
graph TD
A[HuggingFace Model] --> B[Layer Analysis]
B --> C{Mixed Precision?}
C -->|Yes| D[Importance Scoring]
D --> E[Assign Precision Map]
C -->|No| F[Full Ternary]
E --> G[Apply ATQ per Layer]
F --> G
G --> H[Calibration Optional]
H --> I[Quantized Model]
I --> J[2-bit Packed Storage]
I --> K[Perplexity Evaluation]
ATQ-LLM/
├── atq/ # Core quantization library
│ ├── __init__.py
│ ├── quantizers.py # Adaptive ternary quantizers (magnitude & sparsity modes)
│ ├── layers.py # ATQ-wrapped linear layers with STE
│ ├── mixed_precision.py # Importance scoring and precision map assignment
│ ├── calibration.py # Calibration-data-driven threshold optimization
│ └── bit_packing.py # 2-bit packed storage and unpacking utilities
├── llm/ # LLM-specific pipeline
│ ├── __init__.py
│ ├── quantize_model.py # End-to-end model quantization entry point
│ ├── evaluate.py # Perplexity and token-level evaluation
│ └── benchmark.py # Compression ratio, memory, and latency benchmarks
├── experiments/ # Reproducibility scripts
│ ├── ablation.py # Sparsity sweep and mixed-precision ablations
│ ├── train_atq_gpt2.py # QAT training loop for GPT-2
│ └── train_atq_tinyllama.py # QAT training loop for TinyLlama-1.1B
├── notebooks/ # Interactive exploration
│ ├── 01_atq_demo.ipynb # End-to-end quantization demo
│ ├── 02_ablation_results.ipynb
│ └── 03_layer_analysis.ipynb
├── tests/ # Unit tests
│ ├── test_quantizers.py
│ ├── test_layers.py
│ └── test_bit_packing.py
├── results/ # Saved benchmark outputs
├── requirements.txt
├── LICENSE
└── README.md
git clone https://github.com/as567-code/ATQ-LLM.git
cd ATQ-LLM
pip install -r requirements.txtRequirements: Python 3.8+, PyTorch 2.0+, Transformers 4.30+.
from llm.quantize_model import quantize_model
result = quantize_model(model_name="gpt2", use_calibration=False)
print(f"Compression: {result['stats']['compression_ratio']:.1f}x")# Quantize and evaluate
python llm/quantize_model.py --model gpt2
# Run benchmarks
python llm/benchmark.py --model gpt2
# Run ablation studies
python experiments/ablation.py --model gpt2
# Train with QAT
python experiments/train_atq_gpt2.py --epochs 3 --mode magnitudeThe experiments/ablation.py script sweeps sparsity targets and mixed-precision settings. Run with:
python experiments/ablation.py --model gpt2 --max-batches 50Key observations from post-training ablations:
- Sparsity vs. perplexity trade-off: Higher sparsity targets zero out more weights, increasing compression but also increasing perplexity. ATQ consistently outperforms naive RTN at all sparsity levels.
- Mixed-precision: Retaining the most sensitivity-critical layers at FP16 significantly improves perplexity at the cost of reduced aggregate compression.
- QAT potential: The included training scripts (
experiments/train_atq_gpt2.py) use STE gradients and optional knowledge distillation to fine-tune quantized models, which is expected to substantially reduce perplexity degradation.
Adaptive Thresholding. Each linear layer's threshold is computed independently from its own weight distribution. In sparsity-target mode, the threshold is set to the s-th percentile of absolute weight magnitudes so that exactly a fraction s of weights collapse to zero. In magnitude mode, an absolute cutoff is used directly.
Straight-Through Estimator (STE). During quantization-aware training, the forward pass applies the ternary mapping while the backward pass passes gradients through unchanged, allowing end-to-end gradient-based optimization despite the non-differentiable quantization step.
2-bit Packed Storage. Ternary values {-1, 0, +1} are encoded as {0, 1, 2} and packed four-per-byte, reducing on-disk and in-memory footprint by ~16x versus FP32.
Mixed Precision. An importance scorer ranks layers by gradient-weighted activation sensitivity. The top-k most sensitive layers are kept in FP16 or FP32, while the remainder are ternarized, trading aggregate compression ratio for preserved accuracy on critical computations.
Calibration. A small calibration dataset (e.g., 128 samples from WikiText-2) can be used to minimize per-layer reconstruction error and find globally optimal thresholds before inference.
- GPTQ (Frantar et al., 2022) — post-training weight quantization via optimal second-order updates; complementary to ATQ's threshold-based approach.
- AWQ (Lin et al., 2023) — activation-aware weight quantization that scales weights before quantization; ATQ instead adapts thresholds per layer without requiring activation statistics at quantization time.
- ATQ-Multimodal — a related ternary quantization effort extending these ideas to vision-language models: github.com/ak736/ATQ-Multimodal.
@inproceedings{atq2025,
title={Adaptive Ternary Quantization for On-Device LLM Compression},
author={Swaroop, Aditya and Kumar, Akshat},
booktitle={Advances in Neural Information Processing Systems (NeurIPS)},
year={2025},
note={Under Review}
}This project is licensed under the MIT License. See LICENSE for details.