Skip to content

UIC-InDeXLab/Louver

Repository files navigation

Louver logo

Louver: A Halfspace Range-Searching Index for KV Cache

Louver is a KV cache sparse attention system that reduces token selection to halfspace range searching, giving a provable zero-false-negative guarantee. It supports GPU-only, CPU-only, and CPU-offloaded KV caches with custom fused kernels for both.

Installation

cd hira
pip install -e .

Project Structure

hira/
├── cache/              # KV cache wrapper (HiraCache, HiraConfig)
├── indexer/            # Index construction: CPU and CUDA backends
├── searcher/           # Halfspace range search: CPU and CUDA backends
├── attention/          # Sparse attention kernels (v1/v2) and baselines
├── threshold/          # Threshold oracle algorithms (sample-max, sample-gap, budget)
├── kernels/            # Low-level CUDA/C++ kernel sources
├── tests/              # Unit and integration tests (pytest tests/ -vv)
└── benchmark_area/     # All paper experiments (see below)

benchmark_area

All paper experiments live here. The core Louver implementation used in experiments is in benchmark_area/kernel_impl/.

benchmark_area/
├── kernel_impl/        # Main Louver GPU+CPU kernel implementation used in experiments
├── experiments/        # Paper experiment scripts and results
│   ├── accuracy/       # LongBench v1 and RULER accuracy vs. baselines (10% KV budget)
│   ├── latency/        # Per-step attention latency vs. context length (up to 40k tokens)
│   │   ├── gpu_bench.py          # GPU latency benchmark (RTX 5090)
│   │   ├── cpu_bench.py          # CPU latency benchmark (Threadripper 7970X)
│   │   ├── captures/             # Saved QKV tensors (.pt) for offline benchmarking
│   │   ├── reports/              # CSV results per model
│   │   └── plots/latency_plot.py # 2×2 grid latency figure
│   ├── recall/         # Recall@k vs. ANN and sparse-attention baselines
│   │   ├── recall_bench.py       # Main recall benchmark
│   │   ├── reports/              # Per-model recall CSVs
│   │   └── recall_plot.py        # Recall figure (3-panel)
│   ├── offload/        # KV offloading: Louver vs. HNSW/IVF/LSH on LongBench
│   │   ├── results/              # Per-method JSON summaries + task breakdowns
│   │   └── run.sh
│   ├── pruning/        # Index design ablation: grouping strategy × enclosing × S
│   │   └── results/              # CSVs: scanned fraction, speedup, recall per config
│   ├── threshold_oracle/         # Threshold oracle ablation (sample-max, gap, budget)
│   │   ├── bench.py              # Offline oracle benchmark using QKV captures
│   │   ├── run_online.py         # Online oracle benchmark (hooks into live model)
│   │   └── threshold_oracle/results/  # Timeseries CSVs and figures
│   ├── false_negativity/         # Effect of false negatives on LLM accuracy
│   ├── score_distribution/       # Attention score concentration analysis
│   └── experiments.md            # Experiment plan and workflow notes
├── baselines/          # Baseline implementations (H2O, StreamingLLM, Quest, Twilight)
├── cuda_bench/         # CUDA microbenchmarks for kernel development
├── cpu_bench/          # CPU microbenchmarks
├── indexes_benchmark/  # ANN index benchmarks (HNSW, IVF, PQ, LSH)
└── pruning_v2/         # Pruning power experiments (v2)

Running experiments

Each experiment directory has a run.sh. Accuracy and offloading experiments run on SLURM (Delta cluster, NCSA). Latency and recall experiments run locally on the RTX 5090 workstation.

# Accuracy (submit to SLURM on Delta)
cd benchmark_area/experiments/accuracy && sbatch run.sh

# Latency (local, requires QKV captures)
cd benchmark_area/experiments/latency && python gpu_bench.py && python cpu_bench.py

# Recall (local, requires QKV captures)
cd benchmark_area/experiments/recall && bash run.sh

# Offloading (Delta)
cd benchmark_area/experiments/offload && sbatch run.sh

QKV captures (.pt files, ~400–600 MB each) are generated by latency/capture_all.sh and shared across latency, recall, and threshold oracle experiments.

About

Louver: A Halfspace Range-Searching Index for KV Cache

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors