Louver is a KV cache sparse attention system that reduces token selection to halfspace range searching, giving a provable zero-false-negative guarantee. It supports GPU-only, CPU-only, and CPU-offloaded KV caches with custom fused kernels for both.
cd hira
pip install -e .hira/
├── cache/ # KV cache wrapper (HiraCache, HiraConfig)
├── indexer/ # Index construction: CPU and CUDA backends
├── searcher/ # Halfspace range search: CPU and CUDA backends
├── attention/ # Sparse attention kernels (v1/v2) and baselines
├── threshold/ # Threshold oracle algorithms (sample-max, sample-gap, budget)
├── kernels/ # Low-level CUDA/C++ kernel sources
├── tests/ # Unit and integration tests (pytest tests/ -vv)
└── benchmark_area/ # All paper experiments (see below)
All paper experiments live here. The core Louver implementation used in experiments is in benchmark_area/kernel_impl/.
benchmark_area/
├── kernel_impl/ # Main Louver GPU+CPU kernel implementation used in experiments
├── experiments/ # Paper experiment scripts and results
│ ├── accuracy/ # LongBench v1 and RULER accuracy vs. baselines (10% KV budget)
│ ├── latency/ # Per-step attention latency vs. context length (up to 40k tokens)
│ │ ├── gpu_bench.py # GPU latency benchmark (RTX 5090)
│ │ ├── cpu_bench.py # CPU latency benchmark (Threadripper 7970X)
│ │ ├── captures/ # Saved QKV tensors (.pt) for offline benchmarking
│ │ ├── reports/ # CSV results per model
│ │ └── plots/latency_plot.py # 2×2 grid latency figure
│ ├── recall/ # Recall@k vs. ANN and sparse-attention baselines
│ │ ├── recall_bench.py # Main recall benchmark
│ │ ├── reports/ # Per-model recall CSVs
│ │ └── recall_plot.py # Recall figure (3-panel)
│ ├── offload/ # KV offloading: Louver vs. HNSW/IVF/LSH on LongBench
│ │ ├── results/ # Per-method JSON summaries + task breakdowns
│ │ └── run.sh
│ ├── pruning/ # Index design ablation: grouping strategy × enclosing × S
│ │ └── results/ # CSVs: scanned fraction, speedup, recall per config
│ ├── threshold_oracle/ # Threshold oracle ablation (sample-max, gap, budget)
│ │ ├── bench.py # Offline oracle benchmark using QKV captures
│ │ ├── run_online.py # Online oracle benchmark (hooks into live model)
│ │ └── threshold_oracle/results/ # Timeseries CSVs and figures
│ ├── false_negativity/ # Effect of false negatives on LLM accuracy
│ ├── score_distribution/ # Attention score concentration analysis
│ └── experiments.md # Experiment plan and workflow notes
├── baselines/ # Baseline implementations (H2O, StreamingLLM, Quest, Twilight)
├── cuda_bench/ # CUDA microbenchmarks for kernel development
├── cpu_bench/ # CPU microbenchmarks
├── indexes_benchmark/ # ANN index benchmarks (HNSW, IVF, PQ, LSH)
└── pruning_v2/ # Pruning power experiments (v2)
Each experiment directory has a run.sh. Accuracy and offloading experiments run on SLURM (Delta cluster, NCSA). Latency and recall experiments run locally on the RTX 5090 workstation.
# Accuracy (submit to SLURM on Delta)
cd benchmark_area/experiments/accuracy && sbatch run.sh
# Latency (local, requires QKV captures)
cd benchmark_area/experiments/latency && python gpu_bench.py && python cpu_bench.py
# Recall (local, requires QKV captures)
cd benchmark_area/experiments/recall && bash run.sh
# Offloading (Delta)
cd benchmark_area/experiments/offload && sbatch run.shQKV captures (.pt files, ~400–600 MB each) are generated by latency/capture_all.sh and shared across latency, recall, and threshold oracle experiments.
