ATQ-LLM: Adaptive Ternary Quantization for On-Device LLM Compression

Abstract

ATQ (Adaptive Ternary Quantization) is a post-training and quantization-aware training framework that compresses large language model weights to a ternary representation {-1, 0, +1} using layer-specific dynamic thresholds. Unlike fixed-threshold ternary methods, ATQ adapts per-layer based on the empirical weight distribution — either by magnitude ranking (sparsity-target mode) or by absolute magnitude cutoffs — enabling ~16x compression versus FP32 while maintaining perplexity within acceptable degradation bounds. The framework supports mixed-precision assignment for sensitivity-critical layers, optional calibration-data-driven threshold tuning, straight-through estimators for quantization-aware training, and 2-bit packed storage for efficient on-device deployment.

Key Results (GPT-2 Small, WikiText-2)

Method	Bits	Perplexity	Effective Size	Compression	Source
FP32 Baseline	32	35.70	474.7 MB	1.0x	Measured
RTN Ternary (naive)	2	1,320,412	29.7 MB	16.0x	Measured
ATQ (ours)	2	110,062	95.6 MB	5.0x	Measured
GPTQ	4	32.1	59.3 MB	8.0x	Frantar et al., 2022
AWQ	4	31.5	59.3 MB	8.0x	Lin et al., 2023

Note on post-training ternary quantization: Ternary quantization (2-bit, only 3 possible values per weight) is fundamentally more aggressive than 4-bit methods like GPTQ/AWQ. Post-training ternary quantization without fine-tuning causes significant perplexity degradation across all methods. However, ATQ's adaptive thresholding achieves ~12x lower perplexity than naive RTN ternary, demonstrating that intelligent threshold selection is critical. The ternary layers themselves achieve 16x compression; the 5.0x overall ratio reflects that embeddings and the LM head are kept at full precision.

Quantization-aware training (QAT) with the included STE-based training loop is expected to substantially close the gap with FP32, as the model can adapt its weights to the ternary constraint during fine-tuning.

Architecture

graph TD
    A[HuggingFace Model] --> B[Layer Analysis]
    B --> C{Mixed Precision?}
    C -->|Yes| D[Importance Scoring]
    D --> E[Assign Precision Map]
    C -->|No| F[Full Ternary]
    E --> G[Apply ATQ per Layer]
    F --> G
    G --> H[Calibration Optional]
    H --> I[Quantized Model]
    I --> J[2-bit Packed Storage]
    I --> K[Perplexity Evaluation]

Repository Structure

ATQ-LLM/
├── atq/                        # Core quantization library
│   ├── __init__.py
│   ├── quantizers.py           # Adaptive ternary quantizers (magnitude & sparsity modes)
│   ├── layers.py               # ATQ-wrapped linear layers with STE
│   ├── mixed_precision.py      # Importance scoring and precision map assignment
│   ├── calibration.py          # Calibration-data-driven threshold optimization
│   └── bit_packing.py          # 2-bit packed storage and unpacking utilities
├── llm/                        # LLM-specific pipeline
│   ├── __init__.py
│   ├── quantize_model.py       # End-to-end model quantization entry point
│   ├── evaluate.py             # Perplexity and token-level evaluation
│   └── benchmark.py            # Compression ratio, memory, and latency benchmarks
├── experiments/                # Reproducibility scripts
│   ├── ablation.py             # Sparsity sweep and mixed-precision ablations
│   ├── train_atq_gpt2.py       # QAT training loop for GPT-2
│   └── train_atq_tinyllama.py  # QAT training loop for TinyLlama-1.1B
├── notebooks/                  # Interactive exploration
│   ├── 01_atq_demo.ipynb       # End-to-end quantization demo
│   ├── 02_ablation_results.ipynb
│   └── 03_layer_analysis.ipynb
├── tests/                      # Unit tests
│   ├── test_quantizers.py
│   ├── test_layers.py
│   └── test_bit_packing.py
├── results/                    # Saved benchmark outputs
├── requirements.txt
├── LICENSE
└── README.md

Installation

git clone https://github.com/as567-code/ATQ-LLM.git
cd ATQ-LLM
pip install -r requirements.txt

Requirements: Python 3.8+, PyTorch 2.0+, Transformers 4.30+.

Quick Start

Quantize GPT-2 in Python

from llm.quantize_model import quantize_model

result = quantize_model(model_name="gpt2", use_calibration=False)
print(f"Compression: {result['stats']['compression_ratio']:.1f}x")

Run the Full Pipeline

# Quantize and evaluate
python llm/quantize_model.py --model gpt2

# Run benchmarks
python llm/benchmark.py --model gpt2

# Run ablation studies
python experiments/ablation.py --model gpt2

# Train with QAT
python experiments/train_atq_gpt2.py --epochs 3 --mode magnitude

Ablation Studies

The experiments/ablation.py script sweeps sparsity targets and mixed-precision settings. Run with:

python experiments/ablation.py --model gpt2 --max-batches 50

Key observations from post-training ablations:

Sparsity vs. perplexity trade-off: Higher sparsity targets zero out more weights, increasing compression but also increasing perplexity. ATQ consistently outperforms naive RTN at all sparsity levels.
Mixed-precision: Retaining the most sensitivity-critical layers at FP16 significantly improves perplexity at the cost of reduced aggregate compression.
QAT potential: The included training scripts (experiments/train_atq_gpt2.py) use STE gradients and optional knowledge distillation to fine-tune quantized models, which is expected to substantially reduce perplexity degradation.

How ATQ Works

Adaptive Thresholding. Each linear layer's threshold is computed independently from its own weight distribution. In sparsity-target mode, the threshold is set to the s-th percentile of absolute weight magnitudes so that exactly a fraction s of weights collapse to zero. In magnitude mode, an absolute cutoff is used directly.

Straight-Through Estimator (STE). During quantization-aware training, the forward pass applies the ternary mapping while the backward pass passes gradients through unchanged, allowing end-to-end gradient-based optimization despite the non-differentiable quantization step.

2-bit Packed Storage. Ternary values {-1, 0, +1} are encoded as {0, 1, 2} and packed four-per-byte, reducing on-disk and in-memory footprint by ~16x versus FP32.

Mixed Precision. An importance scorer ranks layers by gradient-weighted activation sensitivity. The top-k most sensitive layers are kept in FP16 or FP32, while the remainder are ternarized, trading aggregate compression ratio for preserved accuracy on critical computations.

Calibration. A small calibration dataset (e.g., 128 samples from WikiText-2) can be used to minimize per-layer reconstruction error and find globally optimal thresholds before inference.

Related Work

GPTQ (Frantar et al., 2022) — post-training weight quantization via optimal second-order updates; complementary to ATQ's threshold-based approach.
AWQ (Lin et al., 2023) — activation-aware weight quantization that scales weights before quantization; ATQ instead adapts thresholds per layer without requiring activation statistics at quantization time.
ATQ-Multimodal — a related ternary quantization effort extending these ideas to vision-language models: github.com/ak736/ATQ-Multimodal.

Citation

@inproceedings{atq2025,
  title={Adaptive Ternary Quantization for On-Device LLM Compression},
  author={Swaroop, Aditya and Kumar, Akshat},
  booktitle={Advances in Neural Information Processing Systems (NeurIPS)},
  year={2025},
  note={Under Review}
}

License

This project is licensed under the MIT License. See LICENSE for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ATQ-LLM: Adaptive Ternary Quantization for On-Device LLM Compression

Abstract

Key Results (GPT-2 Small, WikiText-2)

Architecture

Repository Structure

Installation

Quick Start

Quantize GPT-2 in Python

Run the Full Pipeline

Ablation Studies

How ATQ Works

Related Work

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
atq		atq
docs/superpowers		docs/superpowers
experiments		experiments
llm		llm
notebooks		notebooks
results		results
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

ATQ-LLM: Adaptive Ternary Quantization for On-Device LLM Compression

Abstract

Key Results (GPT-2 Small, WikiText-2)

Architecture

Repository Structure

Installation

Quick Start

Quantize GPT-2 in Python

Run the Full Pipeline

Ablation Studies

How ATQ Works

Related Work

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages