ChaosEngine: Spectral Triage KV Cache Compression

PCA-based decorrelation + channel truncation + hybrid quantization + layer-adaptive precision for LLM inference.

ChaosEngine compresses the KV cache of large language models by 3.7x with 0.034 average attention output error, saving up to 13 GB of VRAM at 128K context length on an 8B model.

Paper: paper/ChaosEngine_Technical_Report.pdf

Key Results

Metric	FP16 (baseline)	ChaosEngine
Compression	1.0x	3.7x
KV cache @ 128K ctx	18.0 GB	4.9 GB
Avg attention error	0.0	0.034
Easy layer error	0.0	0.013
PCA decorrelation	--	100%
Adaptive precision	No	Yes (4 tiers)
Calibration time	--	32s

Validated on RTX 4090 (CUDA) and M4 Max (MPS) with identical results. Cross-model tested on Qwen3-8B (3.7x) and Mistral 7B (2.6x).

Novel Findings

PCA achieves 100% decorrelation where Givens rotation achieves only 3% -- full covariance capture vs pairwise
Key-only rotation outperforms rotating both K and V -- values have different correlation structure
Bottom PCA channels can be truncated to zero on easy layers with no quality loss -- 62.5% key bit savings for free
Value quantization dominates error on hard layers (75% of total) -- motivates PCA-hybrid value treatment
Group-wise quantization after PCA is 3-5x worse than per-channel -- PCA sorts by variance, breaking group assumptions

How It Works

ChaosEngine uses a four-tier layer-adaptive system based on per-layer quantization sensitivity:

Tier	Layers	Keys	Values	Avg Bits
easy	L1-L12, L14, L19-20	PCA top48@K4 + truncate	Uniform V4	2.75
mid	L0, L13, L15, L17-18, L21	PCA K4/V4	Uniform V4	4.00
mhard	L16, L22	PCA top80@K8 + bot48@K4	Uniform V4	5.25
vhard	L23-L35	PCA top96@K8 + bot32@K4	PCA top48@V8 + bot80@V4	6.25

Layer assignments shown for Qwen3-8B. Automatically profiled per model.

Quick Start

# Clone
git clone https://github.com/cryptopoly/ChaosEngine.git
cd ChaosEngine

# Create virtual environment
python -m venv .venv
source .venv/bin/activate        # macOS/Linux
# .venv\Scripts\Activate.ps1     # Windows PowerShell

# Install PyTorch (pick one)
pip install torch                                              # CPU / MPS (macOS)
pip install torch --index-url https://download.pytorch.org/whl/cu124  # CUDA 12.4

# Install ChaosEngine + dependencies
pip install -e ".[dev]"
pip install transformers accelerate safetensors

# Run tests
python -m pytest tests/ -v

# Run benchmark (auto-detects CUDA/MPS/CPU)
python benchmarks/bench_4090.py

# Run on a different model
python benchmarks/bench_4090.py --model mistralai/Mistral-7B-v0.3

Requirements

Python 3.10-3.12 (3.13+ not yet supported by PyTorch CUDA)
PyTorch 2.2+
16+ GB RAM (for 8B models)
CUDA GPU recommended (MPS and CPU also supported)

Project Structure

chaos_engine/
  config.py                  # Tier definitions, hyperparameters
  calibration/               # PCA center computation, sensitivity profiling
  scoring/                   # Trigonometric importance, friendliness, triage
  quantization/
    pca_rotation.py          # PCA decorrelation (100% off-diagonal removal)
    whitening.py             # PCA whitening + norm factoring + importance-weighted quant
    scalar_quantize.py       # 8/4/2-bit asymmetric per-channel quantization
    pack_unpack.py           # Bit packing for 4-bit and 2-bit storage
    givens_rotation.py       # Givens rotation (baseline comparison)
  kernels/                   # Triton fused attention kernels (Linux/CUDA)
  cache/                     # Mixed-precision KV cache data structures
  integration/               # HuggingFace transformers + vLLM integration
benchmarks/
  bench_4090.py              # Main benchmark (works on CUDA, MPS, CPU)
  bench_m4_v2.py             # M4 Max specific benchmark
paper/
  ChaosEngine_Technical_Report.pdf
  generate_paper.py          # Regenerate the PDF
tests/                       # 66 unit + integration tests

Running Tests

pip install -e ".[dev]"
python -m pytest tests/ -v

Memory Savings (Qwen3-8B at 3.7x)

Context Length	FP16 KV Cache	ChaosEngine	Saved
4,096	0.6 GB	0.2 GB	0.4 GB
8,192	1.1 GB	0.3 GB	0.8 GB
32,768	4.5 GB	1.2 GB	3.3 GB
65,536	9.0 GB	2.5 GB	6.5 GB
131,072	18.0 GB	4.9 GB	13.1 GB

License

Apache 2.0

Citation

@techreport{chaosengine2026,
  title={ChaosEngine: Spectral Triage KV Cache Compression via PCA Truncation and Layer-Adaptive Hybrid Quantization},
  year={2026},
  url={https://github.com/cryptopoly/ChaosEngine}
}

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
benchmarks		benchmarks
chaos_engine		chaos_engine
paper		paper
scripts		scripts
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ChaosEngine: Spectral Triage KV Cache Compression

Key Results

Novel Findings

How It Works

Quick Start

Requirements

Project Structure

Running Tests

Memory Savings (Qwen3-8B at 3.7x)

License

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ChaosEngine: Spectral Triage KV Cache Compression

Key Results

Novel Findings

How It Works

Quick Start

Requirements

Project Structure

Running Tests

Memory Savings (Qwen3-8B at 3.7x)

License

Citation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages