Skip to content

cryptopoly/ChaosEngine

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ChaosEngine: Spectral Triage KV Cache Compression

PCA-based decorrelation + channel truncation + hybrid quantization + layer-adaptive precision for LLM inference.

ChaosEngine compresses the KV cache of large language models by 3.7x with 0.034 average attention output error, saving up to 13 GB of VRAM at 128K context length on an 8B model.

Paper: paper/ChaosEngine_Technical_Report.pdf

Key Results

Metric FP16 (baseline) ChaosEngine
Compression 1.0x 3.7x
KV cache @ 128K ctx 18.0 GB 4.9 GB
Avg attention error 0.0 0.034
Easy layer error 0.0 0.013
PCA decorrelation -- 100%
Adaptive precision No Yes (4 tiers)
Calibration time -- 32s

Validated on RTX 4090 (CUDA) and M4 Max (MPS) with identical results. Cross-model tested on Qwen3-8B (3.7x) and Mistral 7B (2.6x).

Novel Findings

  1. PCA achieves 100% decorrelation where Givens rotation achieves only 3% -- full covariance capture vs pairwise
  2. Key-only rotation outperforms rotating both K and V -- values have different correlation structure
  3. Bottom PCA channels can be truncated to zero on easy layers with no quality loss -- 62.5% key bit savings for free
  4. Value quantization dominates error on hard layers (75% of total) -- motivates PCA-hybrid value treatment
  5. Group-wise quantization after PCA is 3-5x worse than per-channel -- PCA sorts by variance, breaking group assumptions

How It Works

ChaosEngine uses a four-tier layer-adaptive system based on per-layer quantization sensitivity:

Tier Layers Keys Values Avg Bits
easy L1-L12, L14, L19-20 PCA top48@K4 + truncate Uniform V4 2.75
mid L0, L13, L15, L17-18, L21 PCA K4/V4 Uniform V4 4.00
mhard L16, L22 PCA top80@K8 + bot48@K4 Uniform V4 5.25
vhard L23-L35 PCA top96@K8 + bot32@K4 PCA top48@V8 + bot80@V4 6.25

Layer assignments shown for Qwen3-8B. Automatically profiled per model.

Quick Start

# Clone
git clone https://github.com/cryptopoly/ChaosEngine.git
cd ChaosEngine

# Create virtual environment
python -m venv .venv
source .venv/bin/activate        # macOS/Linux
# .venv\Scripts\Activate.ps1     # Windows PowerShell

# Install PyTorch (pick one)
pip install torch                                              # CPU / MPS (macOS)
pip install torch --index-url https://download.pytorch.org/whl/cu124  # CUDA 12.4

# Install ChaosEngine + dependencies
pip install -e ".[dev]"
pip install transformers accelerate safetensors

# Run tests
python -m pytest tests/ -v

# Run benchmark (auto-detects CUDA/MPS/CPU)
python benchmarks/bench_4090.py

# Run on a different model
python benchmarks/bench_4090.py --model mistralai/Mistral-7B-v0.3

Requirements

  • Python 3.10-3.12 (3.13+ not yet supported by PyTorch CUDA)
  • PyTorch 2.2+
  • 16+ GB RAM (for 8B models)
  • CUDA GPU recommended (MPS and CPU also supported)

Project Structure

chaos_engine/
  config.py                  # Tier definitions, hyperparameters
  calibration/               # PCA center computation, sensitivity profiling
  scoring/                   # Trigonometric importance, friendliness, triage
  quantization/
    pca_rotation.py          # PCA decorrelation (100% off-diagonal removal)
    whitening.py             # PCA whitening + norm factoring + importance-weighted quant
    scalar_quantize.py       # 8/4/2-bit asymmetric per-channel quantization
    pack_unpack.py           # Bit packing for 4-bit and 2-bit storage
    givens_rotation.py       # Givens rotation (baseline comparison)
  kernels/                   # Triton fused attention kernels (Linux/CUDA)
  cache/                     # Mixed-precision KV cache data structures
  integration/               # HuggingFace transformers + vLLM integration
benchmarks/
  bench_4090.py              # Main benchmark (works on CUDA, MPS, CPU)
  bench_m4_v2.py             # M4 Max specific benchmark
paper/
  ChaosEngine_Technical_Report.pdf
  generate_paper.py          # Regenerate the PDF
tests/                       # 66 unit + integration tests

Running Tests

pip install -e ".[dev]"
python -m pytest tests/ -v

Memory Savings (Qwen3-8B at 3.7x)

Context Length FP16 KV Cache ChaosEngine Saved
4,096 0.6 GB 0.2 GB 0.4 GB
8,192 1.1 GB 0.3 GB 0.8 GB
32,768 4.5 GB 1.2 GB 3.3 GB
65,536 9.0 GB 2.5 GB 6.5 GB
131,072 18.0 GB 4.9 GB 13.1 GB

License

Apache 2.0

Citation

@techreport{chaosengine2026,
  title={ChaosEngine: Spectral Triage KV Cache Compression via PCA Truncation and Layer-Adaptive Hybrid Quantization},
  year={2026},
  url={https://github.com/cryptopoly/ChaosEngine}
}

Releases

No releases published

Packages

 
 
 

Contributors