From b064e00706d8505855f3277909f1194575b0027f Mon Sep 17 00:00:00 2001 From: Will Manning Date: Fri, 3 Apr 2026 09:49:04 -0400 Subject: [PATCH 1/4] review Signed-off-by: Will Manning --- proposed/0033-block-turboquant.md | 186 ++++++++++++++++++++---------- 1 file changed, 128 insertions(+), 58 deletions(-) diff --git a/proposed/0033-block-turboquant.md b/proposed/0033-block-turboquant.md index 1338e83..f19391d 100644 --- a/proposed/0033-block-turboquant.md +++ b/proposed/0033-block-turboquant.md @@ -9,14 +9,15 @@ We propose modifying the [TurboQuant vector quantization encoding][current-impl] to use uniform blocks of a tunable power-of-2 size B (to be experimentally determined; expected range 64-256) with per-block SORF rotations for the MSE -stage. The QJL correction uses per-block Gaussian projection matrices as -prescribed by [1], not SORF — this eliminates the need for power-of-2 padding in -the QJL stage. The lowest-level TurboQuant array operates only on unit-norm -vectors, with norms externalized. In a second stage, a PDX-style -dimension-major code layout within groups of 64 vectors enables SIMD-friendly -vectorized distance computation. The TQ block size B (for quantization quality) -and the PDX vector group size (always 64, for SIMD utilization) are independent -parameters. +stage. The QJL correction strategy is an open design question with four options +to be experimentally compared: per-block Gaussian (paper-correct), per-block +SORF (fast but approximate), full-dimension padded SORF (current approach, +better convergence but wastes sign bits), or MSE-only (no QJL). The +lowest-level TurboQuant array operates only on unit-norm vectors, with norms +externalized. In a second stage, a PDX-style dimension-major code layout within +groups of 64 vectors enables SIMD-friendly vectorized distance computation. The +TQ block size B (for quantization quality) and the PDX vector group size (always +64, for SIMD utilization) are independent parameters. [current-impl]: https://github.com/vortex-data/vortex/pull/7167 @@ -136,8 +137,14 @@ The paper's MSE bound (Theorem 1 in [1]) is: E[‖x - x̂‖² / ‖x‖²] ≤ (√3 · π / 2) / 4^b ``` -This bound is dimension-independent — it depends only on the bit width `b`. For -a vector split into blocks with per-block normalization (assuming ‖xₖ‖ > 0): +This bound is dimension-independent — it depends only on the bit width `b`. +**Crucially, Theorem 1 is proved for true random orthogonal matrices.** Our use +of SORF is an approximation; the bound holds exactly only if the per-block +rotation is a true random orthogonal matrix (or if the SORF approximation is +validated empirically — see Experimental plan). + +Assuming the per-block MSE bound holds, for a vector split into blocks with +per-block normalization (‖xₖ‖ > 0): ``` ‖x - x̂‖² / ‖x‖² = Σ_k (‖xₖ‖² / ‖x‖²) × (‖xₖ - x̂ₖ‖² / ‖xₖ‖²) @@ -231,12 +238,18 @@ computation. **QJL strategy options** (to be experimentally compared): -| Strategy | Theoretical | Variance | Padding waste | Storage | Speed (per block) | -| -------------------- | ----------------- | -------------- | ------------- | --------------- | ----------------- | -| Per-block Gaussian | Correct (Lemma 4) | (π/(2B))·‖y‖² | None | k×B²×4 bytes | O(B²) | -| Per-block SORF | Approximate | ~(π/(2B))·‖y‖² | None | k×3×B bits | O(B log B) | -| Full-dim padded SORF | Approximate | ~(π/(2d))·‖y‖² | (pad-d)/pad | 3×padded_d bits | O(d log d) | -| MSE-only | N/A | N/A | N/A | None | 0 | +| Strategy | Theoretical | Variance | Padding waste | Storage | Speed | +| -------------------- | ----------------- | ------------------- | --------------- | ------------ | ---------------- | +| Per-block Gaussian | Correct (Lemma 4) | (π/(2B))·‖y‖² | None | k×B²×4 bytes | O(B²)/block | +| Per-block SORF | Approximate | ~(π/(2B))·‖y‖² | None | k×3×B bits | O(B log B)/block | +| Full-dim padded SORF | Approximate | ~(π/(2·pad_d))·‖y‖² | (pad_d-d)/pad_d | 3×pad_d bits | O(d log d) total | +| MSE-only | N/A | N/A | N/A | None | 0 | + +Note: the full-dim padded SORF variance bound formally uses `pad_d` (e.g., +1024), not `d` (768). However, the `pad_d - d` sign bits spent on zero-padded +coordinates carry no information about the residual, so the effective variance +reduction may be closer to `(π/(2d))·‖y‖²`. The experiment should measure +actual variance to resolve this. #### Norm architecture @@ -301,21 +314,34 @@ Store: codes (k × B per vector), block_norms (k per vector), centroids (shared), SORF signs (k × 3 × B, shared across vectors) # QJL stage (optional, one of four strategies) + +# --- Per-block strategies (Gaussian or SORF) --- +# Operate in unit-norm space, per block: for i in 0..k: if nᵢ > 0: x̂ᵢ = decode_mse_block(cᵢ, centroids, SORFᵢ) rᵢ = ûᵢ - x̂ᵢ (per-block unit-norm residual) γᵢ = ‖rᵢ‖ - sᵢ = sign(Projᵢ × rᵢ) (see QJL strategy options) + sᵢ = sign(Projᵢ × rᵢ) (Gaussian B×B or SORF at B-dim) else: γᵢ = 0, sᵢ = zeros - -# Projᵢ is one of: -# - Sᵢ ∈ ℝ^(B×B) Gaussian i.i.d. N(0,1) (per-block Gaussian) -# - SORF_qjlᵢ at B-dim (per-block SORF) -# - SORF_qjl at padded_d-dim on full r (full-dim padded SORF) - -Store: qjl_signs, qjl_residual_norms, projection params (strategy-dependent) +Store: qjl_signs (k × B per vector), qjl_residual_norms (k per vector) + +# --- Full-dim padded SORF strategy --- +# Requires full decode + denormalization to compute the d-dim residual. +# This crosses the norm externalization boundary: the TurboQuant array +# operates on unit-norm sub-vectors, but the full residual lives in the +# original scale. The full-dim QJL encode path is: +x̂ = concat(nᵢ × decode_mse_block(cᵢ, ...) for i in 0..k)[0..d] +r = x - x̂ (d-dim, original scale) +γ = ‖r‖ +r_pad = [r; zeros(padded_d - d)] (zero-pad to padded_d) +s = sign(SORF_qjl(r_pad)) (single padded_d-dim SORF) +Store: qjl_signs (padded_d per vector), qjl_residual_norm (1 per vector) + +# Note: the full-dim strategy is more complex to implement because it +# requires a full MSE decode + denormalization during encoding, adding +# O(d) work and coupling the QJL stage to the norm externalization layer. ``` #### Decoding algorithm @@ -327,16 +353,24 @@ for i in 0..k: ûᵢ = SORF⁻¹ᵢ(r̂ᵢ) # QJL correction (if present) + +# --- Per-block strategies (Gaussian or SORF) --- for i in 0..k: if γᵢ > 0: - correctionᵢ = (√(π/2) / dim_proj) × γᵢ × Projᵢᵀ × sᵢ + correctionᵢ = (√(π/2) / B) × γᵢ × Projᵢᵀ × sᵢ ûᵢ += correctionᵢ -# dim_proj = B for per-block strategies, padded_d for full-dim strategy - -# Denormalize and concatenate (using externalized norms) +# Denormalize and concatenate for i in 0..k: x̂ᵢ = nᵢ × ûᵢ x̃ = concat(x̂₀, x̂₁, ..., x̂ₖ₋₁)[0..d] + +# --- Full-dim padded SORF strategy --- +# Denormalize first (QJL operates in original scale) +for i in 0..k: + x̂ᵢ = nᵢ × ûᵢ +x̂ = concat(x̂₀, ..., x̂ₖ₋₁)[0..d] +correction = (√(π/2) / padded_d) × γ × SORF⁻¹_qjl(s) +x̃ = x̂ + correction[0..d] ``` ### Stage 2: PDX dimension-major layout @@ -410,7 +444,7 @@ row-major before per-vector inverse SORF decoding. ``` TurboQuantArray (operates on unit-norm B-dim sub-vectors) ├── metadata: { dimension: u32, bit_width: u32, block_size: u32, -│ num_blocks: u32, has_qjl: bool, is_pdx: bool } +│ num_blocks: u32, qjl_strategy: enum, is_pdx: bool } │ │ # Per-row children (sliced/taken on row operations) ├── codes: FixedSizeListArray # len=num_rows, list_size=num_blocks×B @@ -419,37 +453,63 @@ TurboQuantArray (operates on unit-norm B-dim sub-vectors) ├── centroids: PrimitiveArray # len=2^(bit_width-1) [MSE codebook] ├── mse_rotation_signs: PrimitiveArray # len=num_blocks×3×B [per-block MSE SORF] │ -│ # Optional QJL children (per-block Gaussian) -├── [qjl_signs]: FixedSizeListArray # len=num_rows, list_size=num_blocks×B -├── [qjl_matrices]: PrimitiveArray # len=num_blocks×B×B [Gaussian i.i.d. N(0,1)] +│ # QJL children (strategy-dependent, all optional) +│ # +│ # Per-block Gaussian: +│ # qjl_signs: FSL, list_size=num_blocks×B (per row) +│ # qjl_residual_norms: FSL, list_size=num_blocks (per row, externalized) +│ # qjl_matrices: Primitive, len=num_blocks×B×B (shared) +│ # +│ # Per-block SORF: +│ # qjl_signs: FSL, list_size=num_blocks×B (per row) +│ # qjl_residual_norms: FSL, list_size=num_blocks (per row, externalized) +│ # qjl_sorf_signs: Primitive, len=num_blocks×3×B (shared) +│ # +│ # Full-dim padded SORF: +│ # qjl_signs: FSL, list_size=padded_d (per row — larger than per-block!) +│ # qjl_residual_norm: Primitive, len=num_rows (per row, single norm, externalized) +│ # qjl_sorf_signs: Primitive, len=3×padded_d (shared) Externalized (lives in parent encoding, not in TurboQuantArray): ├── block_norms: FixedSizeListArray # len=num_rows, list_size=num_blocks -└── [qjl_residual_norms]: FixedSizeListArray # len=num_rows, list_size=num_blocks +└── [qjl_residual_norms]: (see strategy above) ``` The `is_pdx` flag in metadata determines whether `codes` (and `qjl_signs`) are in row-major or PDX-transposed layout. +Note: the full-dim padded SORF strategy has different per-vector storage than +the per-block strategies: `padded_d` QJL sign bits per vector (e.g., 1024 for +d=768) vs. `num_blocks × B` bits (768 for d=768, B=128). It also stores a +single residual norm per vector rather than one per block. + ## Compression ratio -For f32 input at dimension d with bit width b (QJL, so b-1 MSE bits + 1 QJL -bit), k = ⌈d/B⌉ blocks: +For f32 input at dimension d with MSE bit width b_mse (b_mse = b-1 for QJL +strategies, b_mse = b for MSE-only), k = ⌈d/B⌉ blocks: + +**MSE components (all strategies):** -| Component | Bits per vector | -| ------------------ | --------------- | -| MSE codes | k × B × (b-1) | -| QJL signs | k × B × 1 | -| Block norms | k × norm_bits | -| QJL residual norms | k × norm_bits | +| Component | Bits per vector | +| ----------- | --------------- | +| MSE codes | k × B × b_mse | +| Block norms | k × norm_bits | -| Component | Shared bits | -| --------------------- | ------------ | -| Centroids | 2^(b-1) × 32 | -| MSE SORF signs | k × 3 × B | -| QJL Gaussian matrices | k × B² × 32 | +| Component | Shared bits | +| -------------- | ------------ | +| Centroids | 2^b_mse × 32 | +| MSE SORF signs | k × 3 × B | -### Example: f32, d=768, b=5, B=128, N=1000 vectors, k=6 +**QJL components (strategy-dependent):** + +| Strategy | Signs/vec | Res. norms/vec | Shared | +| -------------------- | --------- | -------------- | ------------ | +| Per-block Gaussian | k × B | k × norm_bits | k × B² × 32 | +| Per-block SORF | k × B | k × norm_bits | k × 3 × B | +| Full-dim padded SORF | padded_d | norm_bits | 3 × padded_d | +| MSE-only | 0 | 0 | 0 | + +### Example: f32, d=768, b=5, B=128, N=1000, k=6 (per-block Gaussian QJL) - Uncompressed: 768 × 32 × 1000 = 24,576,000 bits (3,000 KB) - MSE codes: 6 × 128 × 4 × 1000 = 3,072,000 bits @@ -460,20 +520,25 @@ bit), k = ⌈d/B⌉ blocks: - **Total compressed: 7,372,544 bits (899 KB)** - **Ratio: 3.3×** (QJL Gaussian matrices dominate at small N) -At N=100K vectors the shared overhead amortizes to <0.04 bits/vector, giving -ratio ≈ 5.7×. At N=1M, ratio ≈ 5.8×. +At N=100K, the shared overhead is ~31.5 bits/vector (<1% of per-vector cost), +giving ratio ≈ 5.8×. At N=1M, ratio ≈ 5.8×. -### Comparison across configurations (f32, d=768, b=5) +### Comparison across configurations (f32, d=768, 5 bits/coordinate total) -| Config | B | Ratio (N=1K) | Ratio (N=100K) | Notes | -| ------------------------------------------------ | --- | ------------ | -------------- | ----------------------------- | -| Block MSE-only | 128 | 6.5× | 6.5× | No QJL; biased inner products | -| Block + per-block Gaussian QJL | 128 | 3.3× | 5.7× | Unbiased; matrices amortize | -| [Current][current-impl] (padded SORF + SORF QJL) | — | 4.7× | 4.7× | 33% padding waste | +All configurations use 5 total bits per coordinate. For QJL strategies, this is +4-bit MSE + 1-bit QJL. For MSE-only, all 5 bits go to MSE (32 centroids). -For small columns, MSE-only or the current padded approach may be preferable. -For large columns (the common case for embedding tables), per-block Gaussian QJL -gives the best ratio. +| Config | B | Ratio (N=1K) | Ratio (N=100K) | Notes | +| ------------------------------------- | --- | ------------ | -------------- | ----------------------------- | +| Block MSE-only (5-bit MSE) | 128 | 6.1× | 6.1× | No QJL; biased inner products | +| Block + per-block SORF QJL | 128 | 5.8× | 5.8× | Approximate; minimal overhead | +| Block + per-block Gaussian QJL | 128 | 3.3× | 5.8× | Correct; matrices amortize | +| [Current][current-impl] (padded SORF) | — | 4.7× | 4.7× | 33% padding waste | + +Per-block SORF QJL has the best ratio at all column sizes (SORF signs are +negligible overhead). Per-block Gaussian QJL is competitive only for large +columns where the B²×k×4 byte matrices amortize. For small columns, MSE-only +or per-block SORF QJL is preferable. ## Performance analysis @@ -672,6 +737,11 @@ The B-dim block structure ensures rotation matrices fit in GPU shared memory compressor; GPU decode requires either host-side decompression + GPU transfer, or direct GPU decompression of FastLanes/ALP. +Note: the GPU decode pipeline described above assumes per-block QJL. The +full-dim padded SORF QJL strategy requires a padded_d-dim inverse SORF, which +is a different (larger) kernel and may not fit the per-block tiling model as +cleanly. + ## References [1] Zandieh, A., Daliri, M., Hadian, M. and Mirrokni, V. "TurboQuant: Online From 02d6c9d72bd9ea2de2337e40053042d19f2aed55 Mon Sep 17 00:00:00 2001 From: Will Manning Date: Fri, 3 Apr 2026 10:01:16 -0400 Subject: [PATCH 2/4] review Signed-off-by: Will Manning --- proposed/0033-block-turboquant.md | 155 +++++++++++++++++++----------- 1 file changed, 101 insertions(+), 54 deletions(-) diff --git a/proposed/0033-block-turboquant.md b/proposed/0033-block-turboquant.md index f19391d..deadb96 100644 --- a/proposed/0033-block-turboquant.md +++ b/proposed/0033-block-turboquant.md @@ -2,7 +2,7 @@ **Authors:** Will Manning **Status:** Proposal -**Date:** 2026-04-02 +**Date:** 2026-04-03 ## Summary @@ -170,10 +170,12 @@ quantified empirically (see Experimental plan). **SORF approximation caveat.** Theorems 1 and 2 in [1] are proved for true random orthogonal matrices (QR of Gaussian), not SORF. The 3-round SORF -construction `HD₃·HD₂·HD₁` [5] is a structured approximation. The -approximation quality depends on dimension: 3 rounds provides 3 × log₂(B) -mixing stages (18 at B=64, 21 at 128, 24 at 256, 30 at 1024). Empirical -validation is needed for each candidate B — see Experimental plan. +construction `HD₃·HD₂·HD₁` [5] is a structured approximation. The approximation quality depends on dimension: each round of the Walsh-Hadamard +transform mixes all B coordinates through log₂(B) butterfly stages, so 3 rounds +provides 3 × log₂(B) total butterfly stages (18 at B=64, 21 at 128, 24 at 256). +This is a rough heuristic for mixing quality, not a formal convergence metric — +[5] does not analyze convergence rate as a function of rounds × dimension. +Empirical validation is needed for each candidate B — see Experimental plan. **Fallback: dense rotation.** If SORF proves insufficient at the chosen B, use a B × B random orthogonal matrix (QR of Gaussian) instead. Storage at B=128: @@ -202,31 +204,46 @@ independent Gaussian projection Sₖ ∈ ℝ^(B×B) with i.i.d. N(0,1) entries. Gaussian matrices work at any dimension, no padding is needed for the QJL stage. Each block's QJL is provably unbiased by Lemma 4 in [1], and the sum over blocks is also unbiased: `E[] = `. However, the per-block -variance is **d/B times higher** than full-dimension QJL: +variance is **d/B times higher** than full-dimension QJL. + +Lemma 4 gives the variance for QJL of a unit vector. Since QJL is applied to the +residual rₖ (with norm γₖ = ‖rₖ‖, typically ≪ 1), the actual variance scales +by γₖ²: ``` -Per-block (B-dim): Var[] ≤ (π / (2B)) × ‖y‖² -Full-dim (d-dim): Var[] ≤ (π / (2d)) × ‖y‖² +Per-block (B-dim): Var[] ≤ (π / (2B)) × ‖r‖² × ‖y‖² +Full-dim (d-dim): Var[] ≤ (π / (2d)) × ‖r‖² × ‖y‖² ``` -At d=768, B=128: 6× more variance. Storage: B×B×4 bytes per block (384 KB for -k=6 at B=128). Encode/decode cost: O(B²) matmul per block. +The ‖r‖² factor cancels when comparing strategies (same MSE quality → same +residual norms), so the **relative** variance ratio is d/B regardless. At +d=768, B=128: per-block has 6× more variance than full-dim. The absolute +variance is small — at b=4 MSE, ‖r‖² ≈ 0.01, so the per-block variance is +≈ 0.01 × (π/(2×128)) × ‖y‖² ≈ 1.2×10⁻⁴ × ‖y‖². + +Storage: B×B×4 bytes per block (384 KB for k=6 at B=128). Encode/decode cost: +O(B²) matmul per block. **Per-block SORF QJL** substitutes a B-dim SORF (`HD₃·HD₂·HD₁` [5]) for the -Gaussian matrix. This is NOT theoretically justified — Lemma 4 requires Gaussian -or Haar-orthogonal S, and SORF is only an approximation to Haar measure. +Gaussian matrix. This is NOT theoretically justified — Lemma 4 in [1] is proved +specifically for Gaussian S. For Haar-distributed random orthogonal S, +unbiasedness follows from rotational invariance (a separate argument), but the +variance constant may differ from π/(2B). SORF is an approximation to neither +Gaussian nor Haar measure. However, the [current implementation][current-impl] already uses SORF for QJL at d=1024 with acceptable results (~11% mean relative error for power-of-2 dims), demonstrating practical viability. The tradeoff vs Gaussian is compelling: -O(B log B) speed (5× faster at B=128), O(B) storage (over 1000× less). Quality +O(B log B) speed (~10× faster than Gaussian at B=128), O(B) storage (over +1000× less). Quality at B=128 needs validation — with only 21 mixing stages, the approximation to Haar measure is weaker than at d=1024 (30 stages). **Full-dimension padded SORF QJL** applies a single SORF at the padded dimension (e.g., 1024 for d=768) to the full residual vector `r = x - x̂`, matching the [current implementation][current-impl]. The higher dimension gives -better SORF-to-Haar convergence (30 mixing stages at d=1024 vs 21 at B=128) and -full-dimension variance `(π/(2d))·‖y‖²`, but wastes `(padded_d - d)/padded_d` +better SORF-to-Haar convergence (30 butterfly stages at d=1024 vs 21 at B=128) +and full-dimension variance `~(π/(2·padded_d))·‖r‖²·‖y‖²`, but wastes +`(padded_d - d)/padded_d` of the sign bits on zero-padded coordinates (25% at 768→1024). This approach requires computing the full residual from all blocks before applying QJL, adding a full-dimension decode step to the encode path. @@ -238,24 +255,30 @@ computation. **QJL strategy options** (to be experimentally compared): -| Strategy | Theoretical | Variance | Padding waste | Storage | Speed | -| -------------------- | ----------------- | ------------------- | --------------- | ------------ | ---------------- | -| Per-block Gaussian | Correct (Lemma 4) | (π/(2B))·‖y‖² | None | k×B²×4 bytes | O(B²)/block | -| Per-block SORF | Approximate | ~(π/(2B))·‖y‖² | None | k×3×B bits | O(B log B)/block | -| Full-dim padded SORF | Approximate | ~(π/(2·pad_d))·‖y‖² | (pad_d-d)/pad_d | 3×pad_d bits | O(d log d) total | -| MSE-only | N/A | N/A | N/A | None | 0 | +| Strategy | Theoretical | Variance (×‖r‖²‖y‖²) | Padding waste | Storage | Speed | +| -------------------- | ----------------- | -------------------- | --------------- | ------------ | ---------------- | +| Per-block Gaussian | Correct (Lemma 4) | π/(2B) | None | k×B²×4 bytes | O(B²)/block | +| Per-block SORF | Approximate | ~π/(2B) | None | k×3×B bits | O(B log B)/block | +| Full-dim padded SORF | Approximate | ~π/(2·pad_d) | (pad_d-d)/pad_d | 3×pad_d bits | O(d log d) total | +| MSE-only | N/A | N/A | N/A | None | 0 | + +Variance entries show the coefficient of `‖r‖²×‖y‖²` where `‖r‖²` is the +residual MSE (≈ 0.01 at b=4). The ‖r‖² factor is the same across strategies +(same MSE quality), so relative comparisons reduce to the coefficient alone: +per-block is d/B times higher than full-dim (6× at d=768, B=128). Note: the full-dim padded SORF variance bound formally uses `pad_d` (e.g., -1024), not `d` (768). However, the `pad_d - d` sign bits spent on zero-padded +1024), not `d` (768). The `pad_d - d` sign bits spent on zero-padded coordinates carry no information about the residual, so the effective variance -reduction may be closer to `(π/(2d))·‖y‖²`. The experiment should measure -actual variance to resolve this. +reduction may be closer to `π/(2d)`. The experiment should measure actual +variance to resolve this. #### Norm architecture The TurboQuant array itself operates only on unit-norm B-dim sub-vectors. Norms -are externalized into a separate child array, following the pattern established -by the NormVector encoding (PR #7251). +are externalized into a separate child array, following the pattern explored in +the NormVector encoding prototype (PR #7251, closed — the concept will need to +be implemented as part of this work or adapted from a different source). The per-block norms are stored as a single `FixedSizeListArray` with `list_size = num_blocks`, where `F` matches or widens the input element type: @@ -271,6 +294,13 @@ encoding to them. The cascading compressor treats norms like any other float column and is free to re-encode them with ALP, Pco, FastLanes, or other float compression schemes. +Note: centroids and quantization always operate in f32 (the +[current implementation][current-impl] converts all input to f32 before +quantization). For f64 input, the decode path produces f32 reconstructions +scaled by f64 norms — a mixed-precision multiply. This preserves the precision +of the norms (which capture the bulk of the vector's magnitude) while accepting +f32 precision for the unit-direction reconstruction. + #### Quantized-domain operations with per-block norms All quantized-domain operations require reading the block norms for both @@ -285,12 +315,19 @@ centroids[code_bₖ[j]]`. Per-block: compute unit-norm quantized dot product (sum of B centroid products), then weight by both vectors' block norms. - **Cosine similarity**: `cos(a, b) ≈ (Σ_k ‖aₖ‖·‖bₖ‖·unit_dotₖ) / (√(Σ_k ‖aₖ‖²) · √(Σ_k ‖bₖ‖²))`. Requires global norms reconstructed from - block norms. The norms tensor should be read once per scan query and cached. + block norms. +- **L2 distance** (squared Euclidean): `‖a-b‖² = ‖a‖² + ‖b‖² - 2 += Σ_k ‖aₖ‖² + Σ_k ‖bₖ‖² - 2 × Σ_k ‖aₖ‖·‖bₖ‖·unit_dotₖ`. Reuses the + per-block dot product and per-block norms; this is the primary ANN metric. + +The norms tensor should be read once per scan query and cached. #### Encoding algorithm ``` -Input: x ∈ ℝ^d, bit_width b, block_size B (power of 2) +Input: x ∈ ℝ^d, total_bits b, block_size B (power of 2) +b_mse = b - 1 for QJL strategies, b_mse = b for MSE-only +num_centroids = 2^b_mse k = ⌈d/B⌉ # Block split and normalize @@ -306,7 +343,7 @@ for i in 0..k: for i in 0..k: if nᵢ > 0: rᵢ = SORFᵢ(ûᵢ) (3-round HD, independent per block) - cᵢ[j] = nearest_centroid(rᵢ[j]) (shared codebook) + cᵢ[j] = nearest_centroid(rᵢ[j]) (shared codebook, num_centroids levels) else: cᵢ[j] = 0 @@ -316,7 +353,11 @@ Store: codes (k × B per vector), block_norms (k per vector), # QJL stage (optional, one of four strategies) # --- Per-block strategies (Gaussian or SORF) --- -# Operate in unit-norm space, per block: +# Operate in unit-norm space, per block. Note: the current implementation +# computes the QJL residual in original scale (r = x - x̂). With externalized +# norms, we instead compute the unit-norm residual (rᵢ = ûᵢ - x̂_unitᵢ) and +# let denormalization handle the scaling. These are mathematically equivalent: +# nᵢ × correctionᵢ gives the same result either way. for i in 0..k: if nᵢ > 0: x̂ᵢ = decode_mse_block(cᵢ, centroids, SORFᵢ) @@ -355,6 +396,7 @@ for i in 0..k: # QJL correction (if present) # --- Per-block strategies (Gaussian or SORF) --- +# Scale factor uses B (block dimension) because Lemma 4 applies per-block. for i in 0..k: if γᵢ > 0: correctionᵢ = (√(π/2) / B) × γᵢ × Projᵢᵀ × sᵢ @@ -528,17 +570,19 @@ giving ratio ≈ 5.8×. At N=1M, ratio ≈ 5.8×. All configurations use 5 total bits per coordinate. For QJL strategies, this is 4-bit MSE + 1-bit QJL. For MSE-only, all 5 bits go to MSE (32 centroids). -| Config | B | Ratio (N=1K) | Ratio (N=100K) | Notes | -| ------------------------------------- | --- | ------------ | -------------- | ----------------------------- | -| Block MSE-only (5-bit MSE) | 128 | 6.1× | 6.1× | No QJL; biased inner products | -| Block + per-block SORF QJL | 128 | 5.8× | 5.8× | Approximate; minimal overhead | -| Block + per-block Gaussian QJL | 128 | 3.3× | 5.8× | Correct; matrices amortize | -| [Current][current-impl] (padded SORF) | — | 4.7× | 4.7× | 33% padding waste | +| Config | B | Ratio (N=1K) | Ratio (N=100K) | Notes | +| ------------------------------------- | --- | ------------ | -------------- | ---------------------------------- | +| Block MSE-only (5-bit MSE) | 128 | 6.1× | 6.1× | No QJL; biased inner products | +| Block + per-block SORF QJL | 128 | 5.8× | 5.8× | Approximate; minimal overhead | +| Block + full-dim padded SORF QJL | 128 | 5.7× | 5.7× | Lower variance; padded_d signs/vec | +| Block + per-block Gaussian QJL | 128 | 3.3× | 5.8× | Paper-correct; matrices amortize | +| [Current][current-impl] (padded SORF) | — | 4.7× | 4.7× | 33% padding waste | Per-block SORF QJL has the best ratio at all column sizes (SORF signs are -negligible overhead). Per-block Gaussian QJL is competitive only for large -columns where the B²×k×4 byte matrices amortize. For small columns, MSE-only -or per-block SORF QJL is preferable. +negligible overhead). Full-dim padded SORF QJL is close behind (the extra +padded_d − d = 256 sign bits per vector are a small cost). Per-block Gaussian +QJL is competitive only for large columns where the B²×k×4 byte matrices +amortize. ## Performance analysis @@ -546,16 +590,18 @@ or per-block SORF QJL is preferable. With k blocks at B-dim, encoding requires per block: -| Operation | FLOPs (B=128) | -| --------------------------- | ------------------------------------- | -| MSE SORF (3-round) | 3 × 128 × log₂(128) + 3 × 128 ≈ 3,072 | -| Centroid lookup | 128 binary searches | -| QJL Gaussian matmul (S × r) | B² = 16,384 | -| Norm computation | 128 FMA + sqrt ≈ 129 | +| Operation | FLOPs (B=128) | +| ---------------------------- | ------------------------------------- | +| MSE SORF (3-round) | 3 × 128 × log₂(128) + 3 × 128 ≈ 3,072 | +| Centroid lookup | 128 binary searches | +| QJL Gaussian matmul (S × r) | 2B² = 32,768 (multiply + add) | +| QJL SORF (if per-block SORF) | ≈ 3,072 (same as MSE) | +| Norm computation | 128 FMA + sqrt ≈ 129 | -For d=768, k=6: MSE total ≈ 18K FLOPs, QJL matmul ≈ 98K FLOPs. The QJL -Gaussian matmul dominates encode cost — ~5× more expensive than the -[current][current-impl] SORF-based QJL. Acceptable for offline encoding. +For d=768, k=6: MSE total ≈ 18K FLOPs. QJL depends on strategy: Gaussian +matmul ≈ 197K FLOPs (~10× more than SORF QJL at ≈ 18K). The Gaussian QJL +dominates encode cost. SORF QJL adds negligible overhead. Acceptable for +offline encoding in both cases. ### Decode throughput @@ -563,13 +609,14 @@ Gaussian matmul dominates encode cost — ~5× more expensive than the | -------------------------------- | ----------------------- | | Codebook lookup | 128 table reads | | Inverse SORF | ≈ 3,072 | -| QJL Gaussian matmul (Sᵀ × signs) | B² = 16,384 | +| QJL Gaussian matmul (Sᵀ × signs) | 2B² = 32,768 | +| QJL SORF (if per-block SORF) | ≈ 3,072 | | Denormalize | 128 multiplies | -For d=768: MSE decode ≈ 18K FLOPs, QJL decode ≈ 98K FLOPs. QJL decode is -significantly more expensive due to the dense matmul. For scan workloads that -only need inner products (not full reconstruction), the fused distance -computation path avoids full decode entirely. +For d=768, k=6: MSE decode ≈ 18K FLOPs. QJL decode: Gaussian ≈ 197K FLOPs, +SORF ≈ 18K FLOPs. Gaussian QJL decode is ~10× more expensive than SORF QJL. +For scan workloads that only need inner products (not full reconstruction), the +fused distance computation path avoids full decode entirely. ### Scan throughput (PDX, Stage 2) @@ -614,7 +661,7 @@ Compare all four strategies at d=768 with B ∈ {64, 128, 256}: B. Quantify the quality cost of the SORF approximation at small block dimensions. Test at 3, 4, 5 SORF rounds. - **Full-dimension padded SORF QJL** (current approach): measure for comparison. - Higher dimension gives better SORF-to-Haar convergence (30 mixing stages at + Higher dimension gives better SORF-to-Haar convergence (30 butterfly stages at d=1024) which may compensate for the padding waste. This is the key comparison — does the better convergence of full-dim SORF outweigh the 25% wasted sign bits? From 6f303a3baf94e766fad681a45b96b74f979d3469 Mon Sep 17 00:00:00 2001 From: Will Manning Date: Fri, 3 Apr 2026 10:15:31 -0400 Subject: [PATCH 3/4] rewrite Signed-off-by: Will Manning --- proposed/0033-block-turboquant.md | 927 +++++++++++------------------- 1 file changed, 321 insertions(+), 606 deletions(-) diff --git a/proposed/0033-block-turboquant.md b/proposed/0033-block-turboquant.md index deadb96..65db85b 100644 --- a/proposed/0033-block-turboquant.md +++ b/proposed/0033-block-turboquant.md @@ -6,18 +6,22 @@ ## Summary -We propose modifying the [TurboQuant vector quantization encoding][current-impl] -to use uniform blocks of a tunable power-of-2 size B (to be experimentally -determined; expected range 64-256) with per-block SORF rotations for the MSE -stage. The QJL correction strategy is an open design question with four options -to be experimentally compared: per-block Gaussian (paper-correct), per-block -SORF (fast but approximate), full-dimension padded SORF (current approach, -better convergence but wastes sign bits), or MSE-only (no QJL). The -lowest-level TurboQuant array operates only on unit-norm vectors, with norms -externalized. In a second stage, a PDX-style dimension-major code layout within -groups of 64 vectors enables SIMD-friendly vectorized distance computation. The -TQ block size B (for quantization quality) and the PDX vector group size (always -64, for SIMD utilization) are independent parameters. +We propose evolving the [TurboQuant vector quantization encoding][current-impl] +in three stages: + +1. **MSE-only TurboQuant** (immediate): merge the current PR as an MSE-only + encoding. This is a complete, self-contained building block. +2. **Block decomposition** (next): for non-power-of-2 dimensions, split into + blocks of size B = the largest power-of-2 ≥ 64 that divides d. For + power-of-2 dimensions, B = d (single block, same as current). Per-block + norms externalized. +3. **PDX layout** (later): within each block, transpose codes into groups of + 64 vectors for SIMD scan performance. + +QJL correction is deferred to a later stage and may ultimately be dropped. +Community findings from 6+ independent TurboQuant implementations consistently +show that MSE-only outperforms MSE+QJL for attention and ANN ranking in +practice [7]. [current-impl]: https://github.com/vortex-data/vortex/pull/7167 @@ -47,389 +51,247 @@ rotation and the QJL projection, giving O(d) storage and O(d log d) per-vector. The 3-round SORF construction was introduced for kernel approximation [5] and approximates a random orthogonal matrix. Note that this is distinct from the single-round SRHT (`R·H·D`) analyzed by Tropp [3] and the FJLT (`P·H·D`) of -Ailon-Chazelle [2], both of which are dimensionality-reducing projections. Using -SORF for QJL (where the paper prescribes Gaussian S) is an additional -approximation that may affect the unbiasedness guarantee of Theorem 2. +Ailon-Chazelle [2], both of which are dimensionality-reducing projections. + +### Reference implementation bugs + +The Eviox corrections study [7] identified six material bugs in the paper's +reference Python implementation. The most critical is a mathematical error in +the QJL scale factor: the reference code used `√(π/(2d))` instead of +`√(π/2)/d` (Definition 1 in [1]), differing by a factor of √d (≈11× at d=128). +Our [current implementation][current-impl] uses the correct formula +(`sqrt(FRAC_PI_2) / padded_dim` in Rust), so this bug does **not** affect us. + +Other notable Eviox findings: (a) the reference code recomputes codebooks at +every instantiation (we cache in a `DashMap`); (b) the reference uses float16 +for codebook distance computation, causing misassignment at small centroid +spacings (we cast to f32 before quantization). See [7] for the full list. + +### Theorem 1 constant + +There is an ambiguity in the paper's notation for the MSE bound constant. The +formal proof gives `(√3 · π / 2) · 4^{-b}` where the constant √3·π/2 ≈ 2.72. +The Eviox report [7] interprets the notation as `√(3π)/2 ≈ 1.535`, but this is +incorrect: the measured distortion values from the paper (b=2: 0.117, b=3: 0.03) +exceed the putative `√(3π)/2` bound (b=2: 0.096, b=3: 0.024), confirming that +2.72 is the correct constant. The paper's "explicit values" (0.36, 0.117, 0.03, +0.009) are the actual computed distortion of the optimal quantizer, not the +bound itself — they are well below the 2.72/4^b bound. + +### Community findings on QJL + +Multiple independent TurboQuant implementations [7] have converged on a +significant practical finding: **MSE-only consistently outperforms MSE+QJL for +attention and ANN ranking**. The mechanism is a variance-bias tradeoff: +TurboQuant's QJL correction eliminates bias but increases variance, and softmax +attention (and cosine/L2 ranking) amplifies variance more than bias. At the same +total bit budget, allocating all bits to MSE (more centroids, lower variance) +beats splitting between MSE + QJL (fewer centroids + 1-bit correction). This has +been confirmed by 6+ groups across Python, C, and Rust implementations. + +This finding strongly supports making MSE-only the default strategy for our +columnar storage use case (ANN search, cosine similarity ranking). ### Current limitations The SORF requires power-of-2 input dimension. For non-power-of-2 dimensions -(e.g., 768-d embeddings from common transformer models), the input is -zero-padded to the next power of 2 (1024). This causes: - -- **33% wasted storage** for 768-d vectors: 1024 quantized codes stored per - vector when only 768 carry information. -- **QJL variance inflation**: The QJL correction also uses a padded SORF. Of - the 1024 QJL sign bits, 256 encode information about the zero-padded region - rather than the actual residual. The estimator remains unbiased in expectation, - but the wasted measurement budget inflates variance. Empirically, this - manifests as ~23% mean signed relative error in inner product estimation for - 768-d vs. ~11% for power-of-2 dimensions. -- **No scan-optimized layout**: Codes are stored in row-major order (all codes - for one vector contiguous), which prevents SIMD-over-vectors distance - computation. +(e.g., 768-d embeddings), the input is zero-padded to the next power of 2 +(1024). This causes: + +- **33% storage overhead** for 768-d vectors: 1024 codes stored vs. 768 useful. +- **No scan-optimized layout**: row-major code storage prevents SIMD-over-vectors + distance computation. ### PDX PDX [4] is a data layout for vector similarity search that stores dimensions in a vertical (dimension-major) layout within fixed-size blocks of 64 vectors. This enables the compiler to auto-vectorize the inner distance loop over vectors -rather than dimensions, achieving on average 2x speedups over SIMD-optimized -row-major kernels (up to 5.5x at low dimensions) on modern CPUs. The block size -of 64 is empirically optimal across AVX-512, AVX2, and NEON architectures [4]. +rather than dimensions, achieving on average 2× speedups over SIMD-optimized +row-major kernels on modern CPUs. The block size of 64 is empirically optimal +across AVX-512, AVX2, and NEON architectures [4]. ## Proposal -### Stage 1: Tunable block decomposition - -Partition d dimensions into `⌈d/B⌉` blocks of B dimensions each, where B is a -power-of-2 block size (configurable; expected range 64-256, to be determined -experimentally). The last block ("straggler") may have fewer than B dimensions -— see Straggler handling below. Each full block gets an independent B-dim SORF -rotation for the MSE stage. - -Larger B gives better quantization quality (more concentrated coordinate -distribution, better SORF mixing) but more per-block norm overhead. Smaller B -gives more blocks (more norms) but each block has higher practical variance. -The optimal B should be determined experimentally — see Experimental plan. +### Block size strategy + +For each dimension d, choose B = the largest power-of-2 ≥ 64 that evenly +divides d. This eliminates stragglers entirely for common embedding dimensions: + +| Dimension d | Block size B | Blocks k | Notes | +| ----------- | ------------ | -------- | --------------------------- | +| 512 | 512 | 1 | Single block (= current TQ) | +| 768 | 256 | 3 | Largest dividing power-of-2 | +| 1024 | 1024 | 1 | Single block | +| 1536 | 512 | 3 | | +| 2048 | 2048 | 1 | Single block | +| 3072 | 1024 | 3 | | +| 4096 | 4096 | 1 | Single block | + +**Key observations:** + +- **Power-of-2 dimensions** (512, 1024, 2048, 4096) use B = d — a single block, + identical to the current implementation except with PDX underneath (Stage 3). + No block decomposition overhead, no per-block norms. These dimensions are + already well-served by the current design. +- **Non-power-of-2 dimensions** (768, 1536, 3072) decompose into k=3 blocks at + B=256 or B=512. Zero padding waste. Each block has its own SORF rotation and + shares a single centroid set. +- **Stragglers are eliminated** for all common embedding dimensions. Dimensions + that are not multiples of 64 (e.g., 100, 200) would need straggler handling, + but these are rare in practice for modern model architectures. +- **The SORF approximation at B=256+ is well-tested**: 3 rounds at B=256 provides + 24 butterfly stages, and at B=512 provides 27 — both comparable to the current + B=1024 (30 stages). + +### Stage 1: MSE-only TurboQuant (immediate — split from current PR) + +Merge the [current MSE-only TurboQuant implementation][current-impl] as a +standalone encoding. This provides: + +- SORF-based random rotation at the padded dimension +- Max-Lloyd scalar quantization with shared centroids +- Per-vector norm storage (single f32 per vector) +- Slice, take, scalar_at compute pushdowns +- Quantized-domain cosine similarity and dot product +- File format integration via the compression scheme + +This is a complete, useful encoding for power-of-2 dimensions. For non-power-of-2 +dimensions it has the padding overhead described above. + +### Stage 2: Block decomposition + +For non-power-of-2 dimensions, split into blocks of size B (as determined by the +table above). Each full block gets an independent B-dim SORF rotation. **Key properties:** -- **B is a power of 2**, so every full block has zero SORF padding waste. - Common embedding dimensions are divisible by 64 and 128; most are also - divisible by 256. The straggler case is rare for well-chosen B. -- **One shared centroid set** for full blocks. All full blocks use the same - B-dim marginal distribution, so a single Max-Lloyd codebook serves every full - block. +- **One shared centroid set** for all blocks. All blocks use the same B-dim + marginal distribution, so a single Max-Lloyd codebook serves every block. - **Unit-norm assumption.** The TurboQuant array operates only on pre-normalized - sub-vectors. Per-block norms are externalized (see Norm architecture below). -- **Per-block SORF rotation signs.** Each block's MSE SORF is independent - (different seed). Signs are 3 × B bits per block. For d=768 with B=128: - 6 blocks × 3 × 128 = 2304 bits = 288 bytes. + sub-vectors. Per-block norms are externalized, following the pattern explored + in PR #7251 (closed; concept will need reimplementation). +- **Per-block SORF rotation signs.** Each block's SORF is independent (different + seed). Signs are 3 × B bits per block. +- **For power-of-2 dimensions**: B = d, k = 1. The block decomposition is a + no-op; the encoding is identical to Stage 1 except norms may be externalized. -#### Straggler handling +#### Norm architecture -If `d % B ≠ 0`, the final block has `d mod B` dimensions. Initially, the -straggler is **zero-padded to B** and uses the same SORF and centroids as all -other blocks. This is the simplest approach and reuses all existing -infrastructure. The padding waste is bounded by B-1 dimensions — much smaller -than the current full-dimension padding (up to d-1 dimensions). +The per-block norms are stored as a `FixedSizeListArray` with +`list_size = num_blocks`, where `F` matches or widens the input element type: -As a follow-up experiment, we will test using a **dense random orthogonal -matrix** (QR of Gaussian) at the straggler's actual dimension with its own -centroid set. This eliminates the straggler's padding waste entirely and may -allow increasing B (since the straggler cost is bounded). This requires separate -storage for the straggler rotation matrix and centroids. +| Input dtype | Norm dtype | Rationale | +| ----------- | ---------- | ---------------------------------------------- | +| f16 | f32 | f16 has insufficient range/precision for norms | +| f32 | f32 | Same type | +| f64 | f64 | Preserve full precision | + +Norms are stored as plain child arrays; the cascading compressor handles +secondary encoding (ALP, Pco, etc.). + +Note: centroids and quantization always operate in f32 internally (the +[current implementation][current-impl] converts all input to f32 before +quantization). For f64 input, decode produces f32 unit-direction reconstructions +scaled by f64 norms — a mixed-precision multiply that preserves norm precision. #### Zero-norm sub-vectors -When splitting a vector into B-dim blocks, some blocks may have zero norm -(e.g., sparse embeddings, or dimensions that are structurally unused). The -encoding must handle ‖xₖ‖ = 0 explicitly: skip rotation and quantization for -that block, store norm = 0, and decode as all zeros. The codes for a zero-norm -block are arbitrary (never read during decode). +When splitting a vector into B-dim blocks, some blocks may have zero norm. The +encoding handles ‖xₖ‖ = 0 explicitly: skip rotation and quantization, store +norm = 0, decode as all zeros. #### Theoretical MSE bound The paper's MSE bound (Theorem 1 in [1]) is: ``` -E[‖x - x̂‖² / ‖x‖²] ≤ (√3 · π / 2) / 4^b +E[‖x - x̂‖² / ‖x‖²] ≤ (√3 · π / 2) / 4^b ≈ 2.72 / 4^b ``` -This bound is dimension-independent — it depends only on the bit width `b`. -**Crucially, Theorem 1 is proved for true random orthogonal matrices.** Our use -of SORF is an approximation; the bound holds exactly only if the per-block -rotation is a true random orthogonal matrix (or if the SORF approximation is -validated empirically — see Experimental plan). +**Crucially, Theorem 1 is proved for true random orthogonal matrices (QR of +Gaussian), not SORF.** Our SORF is an approximation. The bound holds exactly +only with a true random orthogonal rotation or with empirical SORF validation +(see Experimental plan). -Assuming the per-block MSE bound holds, for a vector split into blocks with -per-block normalization (‖xₖ‖ > 0): +Assuming the per-block MSE bound holds, for a vector split into blocks: ``` ‖x - x̂‖² / ‖x‖² = Σ_k (‖xₖ‖² / ‖x‖²) × (‖xₖ - x̂ₖ‖² / ‖xₖ‖²) - ≤ MSE_bound × Σ_k (‖xₖ‖² / ‖x‖²) - = MSE_bound + ≤ MSE_bound × Σ_k (‖xₖ‖² / ‖x‖²) = MSE_bound ``` -The Pythagorean identity `Σ_k ‖xₖ‖² = ‖x‖²` ensures the weights sum to 1. -Blocks with ‖xₖ‖ = 0 contribute zero error and are excluded from the sum. - -**Practical MSE vs. theoretical bound.** The bound `(√3π/2)/4^b` is a worst-case -upper bound that is dimension-independent. However, the **actual** MSE depends -on the block dimension B. At higher dimensions, the coordinate distribution -after rotation is more concentrated (variance ~1/B), and the Max-Lloyd quantizer -exploits this concentration — the actual MSE can be well below the bound. At -smaller B, the distribution is wider and the quantizer has less concentration to -exploit. - -Block decomposition preserves the theoretical bound but may give worse -**practical** MSE than a single full-dimension rotation, because each B-dim -block operates at a less favorable point on the distortion curve. This must be -quantified empirically (see Experimental plan). - -**SORF approximation caveat.** Theorems 1 and 2 in [1] are proved for true -random orthogonal matrices (QR of Gaussian), not SORF. The 3-round SORF -construction `HD₃·HD₂·HD₁` [5] is a structured approximation. The approximation quality depends on dimension: each round of the Walsh-Hadamard -transform mixes all B coordinates through log₂(B) butterfly stages, so 3 rounds -provides 3 × log₂(B) total butterfly stages (18 at B=64, 21 at 128, 24 at 256). -This is a rough heuristic for mixing quality, not a formal convergence metric — -[5] does not analyze convergence rate as a function of rounds × dimension. -Empirical validation is needed for each candidate B — see Experimental plan. +The actual MSE may depend on block dimension B: at larger B the coordinate +distribution is more concentrated (variance ~1/B), giving the Max-Lloyd +quantizer more to exploit. See Experimental plan. -**Fallback: dense rotation.** If SORF proves insufficient at the chosen B, use a -B × B random orthogonal matrix (QR of Gaussian) instead. Storage at B=128: -128×128×4 = 64 KB per block. With k blocks for MSE: total = k × 64 KB. For -d=768, B=128 (k=6): 384 KB of MSE rotation matrices. This is significant for -small columns (1K vectors at 3 MB uncompressed ≈ 13% overhead) but amortizes to -negligible for large columns (100K+ vectors). Each block must have an -**independent** rotation matrix (different seeds) to avoid correlation between -quantization errors across blocks. - -#### QJL correction - -The QJL stage in the TurboQuant paper [1] uses a random Gaussian projection -matrix S ∈ ℝ^(d×d) with i.i.d. N(0,1) entries — **not** an orthogonal rotation. -The unbiasedness proof (Lemma 4 in [1]) and the scale factor `√(π/2)/d` are -derived specifically for Gaussian S. The [current implementation][current-impl] -uses SORF instead, which is an additional approximation. - -The QJL correction has several possible strategies, differing in projection -type (Gaussian vs SORF), scope (per-block vs full-dimension), and whether QJL -is used at all. The choice significantly affects correctness guarantees, -variance, storage, and speed. - -**Per-block Gaussian QJL** is the paper-correct approach: each block gets an -independent Gaussian projection Sₖ ∈ ℝ^(B×B) with i.i.d. N(0,1) entries. Since -Gaussian matrices work at any dimension, no padding is needed for the QJL stage. -Each block's QJL is provably unbiased by Lemma 4 in [1], and the sum over -blocks is also unbiased: `E[] = `. However, the per-block -variance is **d/B times higher** than full-dimension QJL. - -Lemma 4 gives the variance for QJL of a unit vector. Since QJL is applied to the -residual rₖ (with norm γₖ = ‖rₖ‖, typically ≪ 1), the actual variance scales -by γₖ²: - -``` -Per-block (B-dim): Var[] ≤ (π / (2B)) × ‖r‖² × ‖y‖² -Full-dim (d-dim): Var[] ≤ (π / (2d)) × ‖r‖² × ‖y‖² -``` +**SORF approximation.** The 3-round SORF `HD₃·HD₂·HD₁` [5] provides +3 × log₂(B) butterfly stages per round (18 at B=64, 24 at B=256, 27 at B=512). +This is a rough heuristic for mixing quality — [5] does not analyze convergence +rate as a function of rounds × dimension. Empirical validation is needed. -The ‖r‖² factor cancels when comparing strategies (same MSE quality → same -residual norms), so the **relative** variance ratio is d/B regardless. At -d=768, B=128: per-block has 6× more variance than full-dim. The absolute -variance is small — at b=4 MSE, ‖r‖² ≈ 0.01, so the per-block variance is -≈ 0.01 × (π/(2×128)) × ‖y‖² ≈ 1.2×10⁻⁴ × ‖y‖². - -Storage: B×B×4 bytes per block (384 KB for k=6 at B=128). Encode/decode cost: -O(B²) matmul per block. - -**Per-block SORF QJL** substitutes a B-dim SORF (`HD₃·HD₂·HD₁` [5]) for the -Gaussian matrix. This is NOT theoretically justified — Lemma 4 in [1] is proved -specifically for Gaussian S. For Haar-distributed random orthogonal S, -unbiasedness follows from rotational invariance (a separate argument), but the -variance constant may differ from π/(2B). SORF is an approximation to neither -Gaussian nor Haar measure. -However, the [current implementation][current-impl] already uses SORF for QJL at -d=1024 with acceptable results (~11% mean relative error for power-of-2 dims), -demonstrating practical viability. The tradeoff vs Gaussian is compelling: -O(B log B) speed (~10× faster than Gaussian at B=128), O(B) storage (over -1000× less). Quality -at B=128 needs validation — with only 21 mixing stages, the approximation to -Haar measure is weaker than at d=1024 (30 stages). - -**Full-dimension padded SORF QJL** applies a single SORF at the padded -dimension (e.g., 1024 for d=768) to the full residual vector `r = x - x̂`, -matching the [current implementation][current-impl]. The higher dimension gives -better SORF-to-Haar convergence (30 butterfly stages at d=1024 vs 21 at B=128) -and full-dimension variance `~(π/(2·padded_d))·‖r‖²·‖y‖²`, but wastes -`(padded_d - d)/padded_d` -of the sign bits on zero-padded coordinates (25% at 768→1024). This approach -requires computing the full residual from all blocks before applying QJL, -adding a full-dimension decode step to the encode path. - -**MSE-only (no QJL)** skips the QJL correction entirely. Inner product -estimates are biased but this may be acceptable for ANN ranking where relative -ordering matters more than absolute accuracy. Eliminates all QJL storage and -computation. - -**QJL strategy options** (to be experimentally compared): - -| Strategy | Theoretical | Variance (×‖r‖²‖y‖²) | Padding waste | Storage | Speed | -| -------------------- | ----------------- | -------------------- | --------------- | ------------ | ---------------- | -| Per-block Gaussian | Correct (Lemma 4) | π/(2B) | None | k×B²×4 bytes | O(B²)/block | -| Per-block SORF | Approximate | ~π/(2B) | None | k×3×B bits | O(B log B)/block | -| Full-dim padded SORF | Approximate | ~π/(2·pad_d) | (pad_d-d)/pad_d | 3×pad_d bits | O(d log d) total | -| MSE-only | N/A | N/A | N/A | None | 0 | - -Variance entries show the coefficient of `‖r‖²×‖y‖²` where `‖r‖²` is the -residual MSE (≈ 0.01 at b=4). The ‖r‖² factor is the same across strategies -(same MSE quality), so relative comparisons reduce to the coefficient alone: -per-block is d/B times higher than full-dim (6× at d=768, B=128). - -Note: the full-dim padded SORF variance bound formally uses `pad_d` (e.g., -1024), not `d` (768). The `pad_d - d` sign bits spent on zero-padded -coordinates carry no information about the residual, so the effective variance -reduction may be closer to `π/(2d)`. The experiment should measure actual -variance to resolve this. +**Fallback: dense rotation.** If SORF proves insufficient at the chosen B, use a +B × B random orthogonal matrix (QR of Gaussian). Storage at B=256: 256 KB per +block. For d=768 with k=3: 768 KB total. Amortizes for large columns (100K+ +vectors). Each block must have an **independent** rotation matrix. -#### Norm architecture +#### Quantized-domain operations -The TurboQuant array itself operates only on unit-norm B-dim sub-vectors. Norms -are externalized into a separate child array, following the pattern explored in -the NormVector encoding prototype (PR #7251, closed — the concept will need to -be implemented as part of this work or adapted from a different source). +All quantized operations require per-block norms: -The per-block norms are stored as a single `FixedSizeListArray` with -`list_size = num_blocks`, where `F` matches or widens the input element type: - -| Input dtype | Norm dtype | Rationale | -| ----------- | ---------- | ---------------------------------------------- | -| f16 | f32 | f16 has insufficient range/precision for norms | -| f32 | f32 | Same type | -| f64 | f64 | Preserve full precision | - -Norms are stored as plain child arrays — TurboQuant does not apply any secondary -encoding to them. The cascading compressor treats norms like any other float -column and is free to re-encode them with ALP, Pco, FastLanes, or other float -compression schemes. - -Note: centroids and quantization always operate in f32 (the -[current implementation][current-impl] converts all input to f32 before -quantization). For f64 input, the decode path produces f32 reconstructions -scaled by f64 norms — a mixed-precision multiply. This preserves the precision -of the norms (which capture the bulk of the vector's magnitude) while accepting -f32 precision for the unit-direction reconstruction. - -#### Quantized-domain operations with per-block norms - -All quantized-domain operations require reading the block norms for both -operands. This is a cost increase vs. the [current implementation][current-impl] -which stores a single global norm per vector. - -- **L2 norm**: Read block norms, compute `√(Σ_k ‖xₖ‖²)`. O(k) per vector — a - regression from the current O(1) single-norm readthrough, but modest compared - to full decompression. -- **Dot product**: ` ≈ Σ_k ‖aₖ‖·‖bₖ‖ · Σ_j centroids[code_aₖ[j]] · -centroids[code_bₖ[j]]`. Per-block: compute unit-norm quantized dot product - (sum of B centroid products), then weight by both vectors' block norms. -- **Cosine similarity**: `cos(a, b) ≈ (Σ_k ‖aₖ‖·‖bₖ‖·unit_dotₖ) / -(√(Σ_k ‖aₖ‖²) · √(Σ_k ‖bₖ‖²))`. Requires global norms reconstructed from - block norms. -- **L2 distance** (squared Euclidean): `‖a-b‖² = ‖a‖² + ‖b‖² - 2 -= Σ_k ‖aₖ‖² + Σ_k ‖bₖ‖² - 2 × Σ_k ‖aₖ‖·‖bₖ‖·unit_dotₖ`. Reuses the - per-block dot product and per-block norms; this is the primary ANN metric. - -The norms tensor should be read once per scan query and cached. +- **L2 distance**: `‖a-b‖² = Σ_k ‖aₖ‖² + Σ_k ‖bₖ‖² - 2·Σ_k ‖aₖ‖·‖bₖ‖· +unit_dotₖ`. Primary ANN metric; reuses per-block dot product and norms. +- **Dot product**: ` ≈ Σ_k ‖aₖ‖·‖bₖ‖ · Σ_j centroids[code_aₖ[j]] · +centroids[code_bₖ[j]]`. +- **Cosine similarity**: `cos(a,b) ≈ dot(a,b) / (‖a‖·‖b‖)` where + `‖a‖ = √(Σ_k ‖aₖ‖²)`. +- **L2 norm**: `√(Σ_k ‖xₖ‖²)`. O(k) per vector — a regression from the + current O(1) single-norm readthrough, but modest. #### Encoding algorithm ``` -Input: x ∈ ℝ^d, total_bits b, block_size B (power of 2) -b_mse = b - 1 for QJL strategies, b_mse = b for MSE-only +Input: x ∈ ℝ^d, b_mse bits per coordinate, block_size B +k = d / B (exact division, no straggler for chosen B) num_centroids = 2^b_mse -k = ⌈d/B⌉ # Block split and normalize for i in 0..k: - xᵢ = x[i*B .. min((i+1)*B, d)], zero-pad to B if straggler + xᵢ = x[i*B .. (i+1)*B] nᵢ = ‖xᵢ‖ if nᵢ > 0: ûᵢ = xᵢ / nᵢ else: - ûᵢ = zeros(B) (zero-norm block: skip quantization) + ûᵢ = zeros(B) # MSE stage (per block, SORF rotation) for i in 0..k: if nᵢ > 0: - rᵢ = SORFᵢ(ûᵢ) (3-round HD, independent per block) - cᵢ[j] = nearest_centroid(rᵢ[j]) (shared codebook, num_centroids levels) + rᵢ = SORFᵢ(ûᵢ) + cᵢ[j] = nearest_centroid(rᵢ[j]) else: cᵢ[j] = 0 Store: codes (k × B per vector), block_norms (k per vector), - centroids (shared), SORF signs (k × 3 × B, shared across vectors) - -# QJL stage (optional, one of four strategies) - -# --- Per-block strategies (Gaussian or SORF) --- -# Operate in unit-norm space, per block. Note: the current implementation -# computes the QJL residual in original scale (r = x - x̂). With externalized -# norms, we instead compute the unit-norm residual (rᵢ = ûᵢ - x̂_unitᵢ) and -# let denormalization handle the scaling. These are mathematically equivalent: -# nᵢ × correctionᵢ gives the same result either way. -for i in 0..k: - if nᵢ > 0: - x̂ᵢ = decode_mse_block(cᵢ, centroids, SORFᵢ) - rᵢ = ûᵢ - x̂ᵢ (per-block unit-norm residual) - γᵢ = ‖rᵢ‖ - sᵢ = sign(Projᵢ × rᵢ) (Gaussian B×B or SORF at B-dim) - else: - γᵢ = 0, sᵢ = zeros -Store: qjl_signs (k × B per vector), qjl_residual_norms (k per vector) - -# --- Full-dim padded SORF strategy --- -# Requires full decode + denormalization to compute the d-dim residual. -# This crosses the norm externalization boundary: the TurboQuant array -# operates on unit-norm sub-vectors, but the full residual lives in the -# original scale. The full-dim QJL encode path is: -x̂ = concat(nᵢ × decode_mse_block(cᵢ, ...) for i in 0..k)[0..d] -r = x - x̂ (d-dim, original scale) -γ = ‖r‖ -r_pad = [r; zeros(padded_d - d)] (zero-pad to padded_d) -s = sign(SORF_qjl(r_pad)) (single padded_d-dim SORF) -Store: qjl_signs (padded_d per vector), qjl_residual_norm (1 per vector) - -# Note: the full-dim strategy is more complex to implement because it -# requires a full MSE decode + denormalization during encoding, adding -# O(d) work and coupling the QJL stage to the norm externalization layer. + centroids (2^b_mse, shared), SORF signs (k × 3 × B, shared) ``` #### Decoding algorithm ``` -# MSE decode (produces unit-norm sub-vectors) for i in 0..k: r̂ᵢ[j] = centroids[cᵢ[j]] ûᵢ = SORF⁻¹ᵢ(r̂ᵢ) - -# QJL correction (if present) - -# --- Per-block strategies (Gaussian or SORF) --- -# Scale factor uses B (block dimension) because Lemma 4 applies per-block. -for i in 0..k: - if γᵢ > 0: - correctionᵢ = (√(π/2) / B) × γᵢ × Projᵢᵀ × sᵢ - ûᵢ += correctionᵢ -# Denormalize and concatenate -for i in 0..k: - x̂ᵢ = nᵢ × ûᵢ -x̃ = concat(x̂₀, x̂₁, ..., x̂ₖ₋₁)[0..d] - -# --- Full-dim padded SORF strategy --- -# Denormalize first (QJL operates in original scale) -for i in 0..k: x̂ᵢ = nᵢ × ûᵢ -x̂ = concat(x̂₀, ..., x̂ₖ₋₁)[0..d] -correction = (√(π/2) / padded_d) × γ × SORF⁻¹_qjl(s) -x̃ = x̂ + correction[0..d] +x̃ = concat(x̂₀, ..., x̂ₖ₋₁) ``` -### Stage 2: PDX dimension-major layout - -In a second stage, we transpose the code storage from row-major to -dimension-major within groups of 64 vectors, following the PDX layout [4]. The -64-vector group size is independent of the TQ block size B. - -**Row-major (Stage 1):** +### Stage 3: PDX dimension-major layout -``` -Vector 0: [tq0_d0 ... tq0_d(B-1) | tq1_d0 ... tq1_d(B-1) | ... ] -Vector 1: [tq0_d0 ... tq0_d(B-1) | tq1_d0 ... tq1_d(B-1) | ... ] -... -``` +Transpose code storage from row-major to dimension-major within groups of 64 +vectors [4]. The 64-vector group size is independent of B. -**PDX (Stage 2), within each 64-vector chunk:** +Within each 64-vector chunk, codes are stored dimension-major: ``` TQ block 0, dim 0: [v0 v1 v2 ... v63] @@ -440,302 +302,186 @@ TQ block 1, dim 0: [v0 v1 v2 ... v63] ... ``` -All codes for the same dimension across 64 vectors are contiguous. The inner -loop (over 64 vectors) has no inter-vector data dependencies and -auto-vectorizes into SIMD. TQ block boundaries affect only where norm weighting -occurs in the distance kernel — they don't affect the PDX transpose itself. - -This creates **B × 64 tiles**: B dimension coordinates × 64 vectors per chunk. - -#### Relationship to chunking - -The PDX 64-vector groups are effectively chunks. Shared metadata (centroids, -rotation signs/matrices) lives at the array level, not per-chunk. +The inner SIMD loop (64 vectors) has no inter-vector dependencies. TQ block +boundaries only affect where norm weighting occurs — they don't affect the +transpose. -**Open design questions:** - -- Should slice/take produce row-major or PDX results? Row-major is simpler - (un-transpose on the fly) but loses the scan benefit. -- Should the PDX flag be a property of the encoding or a separate layout layer? -- How does the PDX transpose interact with the cascading compressor? - -#### Quantized distance computation +**Quantized distance kernel (dot product):** ```rust -// Precompute: dist_table[i][j] = centroids[i] * centroids[j] -// At b=4: 16×16 = 256 floats = 1KB, fits in L1 cache +let dist_table = precompute_product_table(¢roids); +// At b=4: 16×16 = 256 floats = 1KB, fits in L1 -for tq_block in 0..num_tq_blocks { +for tq_block in 0..k { for dim in 0..B { let qd = query_codes[tq_block * B + dim]; let row = &dist_table[qd as usize]; - for v in 0..64 { // SIMD-friendly: no inter-vector deps - distances[v] += row[codes[offset + v] as usize]; + for v in 0..64 { // SIMD-friendly + unit_dots[v] += row[codes[offset] as usize]; + offset += 1; } } - // Weight by query_block_norm × data_block_norms[v] for this TQ block + for v in 0..64 { + distances[v] += query_norms[tq_block] * data_norms[v][tq_block] + * unit_dots[v]; + unit_dots[v] = 0.0; + } } ``` -**Full decompression from PDX layout.** When a scan is not the use case (e.g., -`scalar_at` or `to_canonical`), 64-vector chunks are un-transposed back to -row-major before per-vector inverse SORF decoding. +**Open design questions:** + +- Slice/take on PDX-transposed codes: produce row-major (simpler) or preserve + PDX (aligned 64-vector slices only)? +- Is PDX a property of the encoding or a separate layout layer? +- How does the compressor see the transposed codes? + +### QJL correction (deferred — experimental) + +Based on community findings [7], QJL is deferred to after the MSE stages are +validated. If pursued, four strategies should be compared: + +| Strategy | Theoretical | Speed | Storage | +| -------------------- | --------------------- | ---------------- | --------------- | +| Per-block Gaussian | Correct (Lemma 4 [1]) | O(B²)/block | k×B²×4 bytes | +| Per-block SORF | Approximate | O(B log B)/block | k×3×B bits | +| Full-dim padded SORF | Approximate | O(d log d) total | 3×padded_d bits | +| MSE-only (no QJL) | N/A | 0 | None | + +The paper's QJL uses Gaussian S (not SORF); Lemma 4 [1] is proved specifically +for Gaussian. SORF for QJL is an additional approximation (the +[current implementation][current-impl] uses SORF for QJL). Per-block QJL has +d/B times more variance than full-dimension QJL (Lemma 4 [1]). + +The community consensus is that MSE-only likely wins for ANN ranking at all +bit widths, so QJL may not be worth the complexity. ## Array layout +### Stage 1 (single block, current) + +Identical to the [current PR][current-impl] array structure. + +### Stage 2 (block decomposition) + ``` TurboQuantArray (operates on unit-norm B-dim sub-vectors) -├── metadata: { dimension: u32, bit_width: u32, block_size: u32, -│ num_blocks: u32, qjl_strategy: enum, is_pdx: bool } -│ -│ # Per-row children (sliced/taken on row operations) -├── codes: FixedSizeListArray # len=num_rows, list_size=num_blocks×B +├── metadata: { dimension, b_mse, block_size, num_blocks, is_pdx } │ -│ # Shared children (cloned on row operations, not sliced) -├── centroids: PrimitiveArray # len=2^(bit_width-1) [MSE codebook] -├── mse_rotation_signs: PrimitiveArray # len=num_blocks×3×B [per-block MSE SORF] +│ # Per-row children +├── codes: FixedSizeListArray # list_size = k × B │ -│ # QJL children (strategy-dependent, all optional) -│ # -│ # Per-block Gaussian: -│ # qjl_signs: FSL, list_size=num_blocks×B (per row) -│ # qjl_residual_norms: FSL, list_size=num_blocks (per row, externalized) -│ # qjl_matrices: Primitive, len=num_blocks×B×B (shared) -│ # -│ # Per-block SORF: -│ # qjl_signs: FSL, list_size=num_blocks×B (per row) -│ # qjl_residual_norms: FSL, list_size=num_blocks (per row, externalized) -│ # qjl_sorf_signs: Primitive, len=num_blocks×3×B (shared) -│ # -│ # Full-dim padded SORF: -│ # qjl_signs: FSL, list_size=padded_d (per row — larger than per-block!) -│ # qjl_residual_norm: Primitive, len=num_rows (per row, single norm, externalized) -│ # qjl_sorf_signs: Primitive, len=3×padded_d (shared) - -Externalized (lives in parent encoding, not in TurboQuantArray): -├── block_norms: FixedSizeListArray # len=num_rows, list_size=num_blocks -└── [qjl_residual_norms]: (see strategy above) -``` +│ # Shared children +├── centroids: PrimitiveArray # len = 2^b_mse +├── mse_rotation_signs: PrimitiveArray # len = k × 3 × B -The `is_pdx` flag in metadata determines whether `codes` (and `qjl_signs`) are -in row-major or PDX-transposed layout. - -Note: the full-dim padded SORF strategy has different per-vector storage than -the per-block strategies: `padded_d` QJL sign bits per vector (e.g., 1024 for -d=768) vs. `num_blocks × B` bits (768 for d=768, B=128). It also stores a -single residual norm per vector rather than one per block. +Externalized: +├── block_norms: FixedSizeListArray # list_size = k +``` ## Compression ratio -For f32 input at dimension d with MSE bit width b_mse (b_mse = b-1 for QJL -strategies, b_mse = b for MSE-only), k = ⌈d/B⌉ blocks: - -**MSE components (all strategies):** +For f32 input, b_mse bits MSE, k = d/B blocks, N vectors: | Component | Bits per vector | | ----------- | --------------- | | MSE codes | k × B × b_mse | -| Block norms | k × norm_bits | - -| Component | Shared bits | -| -------------- | ------------ | -| Centroids | 2^b_mse × 32 | -| MSE SORF signs | k × 3 × B | - -**QJL components (strategy-dependent):** - -| Strategy | Signs/vec | Res. norms/vec | Shared | -| -------------------- | --------- | -------------- | ------------ | -| Per-block Gaussian | k × B | k × norm_bits | k × B² × 32 | -| Per-block SORF | k × B | k × norm_bits | k × 3 × B | -| Full-dim padded SORF | padded_d | norm_bits | 3 × padded_d | -| MSE-only | 0 | 0 | 0 | - -### Example: f32, d=768, b=5, B=128, N=1000, k=6 (per-block Gaussian QJL) - -- Uncompressed: 768 × 32 × 1000 = 24,576,000 bits (3,000 KB) -- MSE codes: 6 × 128 × 4 × 1000 = 3,072,000 bits -- QJL signs: 6 × 128 × 1 × 1000 = 768,000 bits -- Block norms: 6 × 32 × 1000 = 192,000 bits -- QJL residual norms: 6 × 32 × 1000 = 192,000 bits -- Shared: 512 + 2,304 + 3,145,728 = 3,148,544 bits -- **Total compressed: 7,372,544 bits (899 KB)** -- **Ratio: 3.3×** (QJL Gaussian matrices dominate at small N) - -At N=100K, the shared overhead is ~31.5 bits/vector (<1% of per-vector cost), -giving ratio ≈ 5.8×. At N=1M, ratio ≈ 5.8×. - -### Comparison across configurations (f32, d=768, 5 bits/coordinate total) - -All configurations use 5 total bits per coordinate. For QJL strategies, this is -4-bit MSE + 1-bit QJL. For MSE-only, all 5 bits go to MSE (32 centroids). - -| Config | B | Ratio (N=1K) | Ratio (N=100K) | Notes | -| ------------------------------------- | --- | ------------ | -------------- | ---------------------------------- | -| Block MSE-only (5-bit MSE) | 128 | 6.1× | 6.1× | No QJL; biased inner products | -| Block + per-block SORF QJL | 128 | 5.8× | 5.8× | Approximate; minimal overhead | -| Block + full-dim padded SORF QJL | 128 | 5.7× | 5.7× | Lower variance; padded_d signs/vec | -| Block + per-block Gaussian QJL | 128 | 3.3× | 5.8× | Paper-correct; matrices amortize | -| [Current][current-impl] (padded SORF) | — | 4.7× | 4.7× | 33% padding waste | - -Per-block SORF QJL has the best ratio at all column sizes (SORF signs are -negligible overhead). Full-dim padded SORF QJL is close behind (the extra -padded_d − d = 256 sign bits per vector are a small cost). Per-block Gaussian -QJL is competitive only for large columns where the B²×k×4 byte matrices -amortize. +| Block norms | k × 32 | -## Performance analysis - -### Encode throughput +| Component | Shared bits | +| ---------- | ------------ | +| Centroids | 2^b_mse × 32 | +| SORF signs | k × 3 × B | -With k blocks at B-dim, encoding requires per block: +### Worked examples (f32, b_mse=5, N=1000) -| Operation | FLOPs (B=128) | -| ---------------------------- | ------------------------------------- | -| MSE SORF (3-round) | 3 × 128 × log₂(128) + 3 × 128 ≈ 3,072 | -| Centroid lookup | 128 binary searches | -| QJL Gaussian matmul (S × r) | 2B² = 32,768 (multiply + add) | -| QJL SORF (if per-block SORF) | ≈ 3,072 (same as MSE) | -| Norm computation | 128 FMA + sqrt ≈ 129 | +| d | B | k | Per-vec bits | Ratio | Notes | +| ------------- | ---- | --- | --------------------- | ----- | -------------------------- | +| 768 | 256 | 3 | 3×256×5 + 3×32 = 3936 | 6.2× | Block decomp; zero padding | +| 1024 | 1024 | 1 | 1024×5 + 32 = 5152 | 6.4× | Single block (= current) | +| 768 (current) | 1024 | 1 | 1024×5 + 32 = 5152 | 4.8× | Padded; 33% overhead | -For d=768, k=6: MSE total ≈ 18K FLOPs. QJL depends on strategy: Gaussian -matmul ≈ 197K FLOPs (~10× more than SORF QJL at ≈ 18K). The Gaussian QJL -dominates encode cost. SORF QJL adds negligible overhead. Acceptable for -offline encoding in both cases. +Block decomposition improves d=768 from 4.8× to 6.2× — a 30% storage +improvement. For d=1024 the encoding is identical to current. -### Decode throughput +## Performance analysis -| Operation | FLOPs per block (B=128) | -| -------------------------------- | ----------------------- | -| Codebook lookup | 128 table reads | -| Inverse SORF | ≈ 3,072 | -| QJL Gaussian matmul (Sᵀ × signs) | 2B² = 32,768 | -| QJL SORF (if per-block SORF) | ≈ 3,072 | -| Denormalize | 128 multiplies | +### Encode/decode throughput -For d=768, k=6: MSE decode ≈ 18K FLOPs. QJL decode: Gaussian ≈ 197K FLOPs, -SORF ≈ 18K FLOPs. Gaussian QJL decode is ~10× more expensive than SORF QJL. -For scan workloads that only need inner products (not full reconstruction), the -fused distance computation path avoids full decode entirely. +SORF at B dimensions: 3 × B × log₂(B) + 3 × B FLOPs per block. For k blocks: -### Scan throughput (PDX, Stage 2) +| B | SORF FLOPs/block | k (d=768) | Total MSE FLOPs | +| -------------- | ------------------------- | --------- | --------------- | +| 256 | 3×256×8 + 768 = 6,912 | 3 | 20,736 | +| 512 | 3×512×9 + 1536 = 15,360 | — | — | +| 1024 (current) | 3×1024×10 + 3072 = 33,792 | 1 | 33,792 | -The PDX scan kernel is memory-bandwidth bound, not compute bound. The lookup -table fits in L1 (1 KB at b=4). With 64 vectors per SIMD pass, this achieves -near-peak memory bandwidth utilization. Norm weighting adds k scalar multiplies -per 64-vector group — negligible. +Block decomposition at d=768 is ~40% fewer FLOPs than the current padded +approach, despite more blocks, because each block is smaller. ### Benchmarking plan -Stage 1: - -1. Encode throughput: block TQ vs. current TQ at d=128, 768, 1024 -2. Decode throughput: same dimensions -3. Quantized cosine similarity: block vs. current (includes norm overhead) -4. L2 norm readthrough: O(k) block norms vs. O(1) current - -Stage 2: - -5. PDX scan throughput vs. row-major scan at d=768, 1024 -6. Full decompression from PDX layout (includes un-transpose overhead) +1. Encode/decode throughput: block TQ vs. current TQ at d=128, 768, 1024 +2. Quantized cosine similarity: block vs. current +3. L2 norm readthrough: O(k) vs. O(1) +4. PDX scan throughput vs. row-major (Stage 3) ## Experimental plan -Before committing to specific parameters, the following experiments are needed: - ### MSE quality vs. block size -- Compare actual normalized MSE at B ∈ {64, 128, 256} vs. full-dimension - single-SORF at bit widths b ∈ {2, 3, 4, 5, 8} +- Compare actual normalized MSE at B ∈ {64, 128, 256, 512} vs. single-SORF at + padded dimension, at bit widths b ∈ {2, 3, 4, 5, 8} - Test SORF coordinate distribution at each B: histogram vs. analytical Beta -- Test 3, 4, and 5 SORF rounds at each B (3 rounds provides 3 × log₂(B) - mixing stages: 18 at B=64, 21 at 128, 24 at 256) - -### QJL strategy comparison - -Compare all four strategies at d=768 with B ∈ {64, 128, 256}: +- Test 3, 4, 5 SORF rounds at each B +- Determine if the practical MSE constant is worse at smaller B -- **Per-block Gaussian QJL** (paper-correct): measure inner product bias and - variance. Baseline for correctness. -- **Per-block SORF QJL**: measure bias/variance vs. per-block Gaussian at same - B. Quantify the quality cost of the SORF approximation at small block - dimensions. Test at 3, 4, 5 SORF rounds. -- **Full-dimension padded SORF QJL** (current approach): measure for comparison. - Higher dimension gives better SORF-to-Haar convergence (30 butterfly stages at - d=1024) which may compensate for the padding waste. This is the key - comparison — does the better convergence of full-dim SORF outweigh the 25% - wasted sign bits? -- **MSE-only** (no QJL): measure inner product bias to establish baseline. +### QJL strategy comparison (if pursued) -Key metrics: +- Per-block Gaussian QJL vs. per-block SORF QJL vs. full-dim padded SORF QJL + vs. MSE-only +- Key metric: ANN recall@k on standard benchmarks (SIFT, GloVe) +- Per community findings, MSE-only is expected to win [7] -- Mean signed relative error on inner products (bias) -- Variance of inner product estimates -- ANN recall@k on standard benchmarks (e.g., SIFT, GloVe) +### Straggler handling (if needed) -Per-block Gaussian QJL has d/B times more variance than full-dim QJL (6× at -d=768, B=128). Per-block SORF QJL has additional approximation error on top of -that. Full-dim padded SORF QJL has lower variance but wastes bits on padding. -The experiment should determine which effect dominates in practice. - -### Straggler handling - -- Compare padded straggler (zero-pad to B, use SORF) vs. dense rotation at - actual straggler dimension with own centroids -- Measure whether dense straggler enables increasing B (e.g., B=256 with - d=768 = 3×256, no straggler at all) +Rare for common dimensions. If encountered: zero-pad to B (simplest). Follow-up: +dense rotation at actual dimension. ## Phasing -**Stage 1** (block decomposition): Modify the existing TurboQuant encoding to -use B-dim blocks with per-block SORF rotation, externalized norms, and -configurable QJL strategy (per-block Gaussian, per-block SORF, full-dim padded -SORF, or MSE-only). Row-major code layout. - -**Stage 2** (PDX layout): Add dimension-major code transposition within -64-vector chunks. PDX-native distance computation kernels. Requires resolving -open design questions around chunk alignment and compressor interaction. +**Phase 1** — MSE-only single-block TurboQuant: Split the [current PR][current-impl] +to merge MSE-only (no QJL). This is a complete encoding for all dimensions +(with padding for non-power-of-2). -## Test migration +**Phase 2** — Block decomposition: Add block splitting for non-power-of-2 +dimensions. Externalize norms. B = largest power-of-2 ≥ 64 dividing d. -The [current TurboQuant test suite][current-impl] will need rewriting: +**Phase 3** — PDX layout: Dimension-major code transposition within 64-vector +chunks. Distance computation kernels. -- **Slot counts**: New structure with per-block SORF signs, QJL Gaussian - matrices, externalized norms. -- **Serde roundtrips**: New metadata (block_size, num_blocks). -- **MSE bound tests**: Add tests at B ∈ {64, 128, 256}. -- **QJL tests**: Compare all four strategies (per-block Gaussian, per-block - SORF, full-dim padded SORF, MSE-only). -- **Quantized cosine similarity**: Update for per-block norm weighting. -- **Slice/take**: More children to manage, same semantics. +**Phase 4** (experimental) — QJL: If the experimental plan shows QJL improves +recall@k beyond MSE-only, add per-block Gaussian or SORF QJL. Based on +community findings, this may not be pursued. -New tests: +## Practical recommendations -- SORF quality at B=64, 128, 256: distribution vs. Beta at 3, 4, 5 rounds -- Practical MSE comparison: block vs. single-rotation at same bit width -- QJL comparison: per-block Gaussian vs. per-block SORF vs. full-dim padded SORF -- Zero-norm sub-vector handling -- Straggler handling (padded initially, dense rotation as follow-up) -- Encode/decode benchmarks +For common model dimensions, the most promising configurations are: -## Future work: GPU decode and fused distance computation - -### Motivation - -For ANN search workloads, the dominant operation is computing distances between -a query vector and millions of database vectors. On GPU, the goal is to perform -this computation directly on the compressed representation, avoiding the cost -of materializing full decompressed vectors in HBM. +| Dimension | Recommendation | Rationale | +| --------------------- | --------------------------- | -------------------------------------------------------------------------- | +| 512, 1024, 2048, 4096 | Single-block MSE-only + PDX | B=d, no decomposition needed. Same as current TQ but with PDX scan layout. | +| 768, 1536, 3072 | 3-block MSE-only + PDX | B=256 or 512. Zero padding waste. 3 blocks, shared centroids. | +| Arbitrary d (rare) | Padded single-block | Fall back to current approach. Padding overhead bounded by B-1 dims. | -### Decode as GEMM +In all cases, MSE-only is the recommended starting point. QJL should only be +added if experiments demonstrate clear recall@k improvements for the target +workload. -The decode path for a single B-dim block is: - -``` -decoded_block = norm × R⁻¹ × codebook_lookup(codes) -``` +## Future work: GPU decode and fused distance computation -For a batch of N vectors sharing the same block's rotation matrix R⁻¹: +The B-dim block structure maps naturally to GPU tile sizes and tensor cores. +For a batch of N vectors sharing the same rotation matrix R⁻¹: ``` decoded_batch = diag(norms) × R⁻¹ × codebook_lookup_batch(codes) @@ -743,70 +489,39 @@ decoded_batch = diag(norms) × R⁻¹ × codebook_lookup_batch(codes) ↑ B×B × B×N = GEMM ``` -The codebook lookup produces a B×N matrix, and the inverse rotation is a B×B -matrix multiply — a GEMM that maps directly to tensor cores. - -**Partial decompression pipeline on GPU:** - -1. **Decompress rotation matrix** (once per block, shared across all vectors): - - If stored as bitpacked SORF signs: FastLanes unpack on CUDA cores, then - expand to B×B matrix in shared memory - - If stored as dense B×B matrix: direct load to shared memory - -2. **Decompress norms** (per vector, per block): - - If cascade-compressed with ALP or Pco: decompress on CUDA cores - - If uncompressed: direct load - -3. **Codebook gather + fused GEMM + scale**: - - Stage codebook in shared memory (trivially small) - - Stream code bytes from HBM, look up centroids, assemble B×N tile - - R⁻¹ × tile on tensor cores, scale by norms - - Follows the double-buffered streaming pattern from Flash-KMeans [6] - -### Fused distance computation (no decode) +The codebook gather + inverse rotation + norm scaling can be fused into a single +kernel following the double-buffered streaming pattern from Flash-KMeans [6]. +For distance computation without full decode, a precomputed B²-entry distance +table (1 KB at b=4) fits in shared memory; the kernel streams code bytes from +HBM with gather-reduce accumulation, using 4-8× less bandwidth than full float +vectors. -Stage the distance table in shared memory (1 KB at b=4), stream code bytes from -HBM, gather-reduce on-chip, write only final distances. The memory access -pattern follows Flash-KMeans [6]: double-buffered prefetch, on-chip -accumulation. We stream quantized code bytes — 4-8× less HBM bandwidth than -full float vectors. - -### Int8 tensor core path (b=9) - -At b=9, MSE codes are 8-bit indices. Direct int8 tensor core GEMM requires -approximately linear centroids (sacrificing Max-Lloyd optimality). Viable for -ANN ranking; prefer codebook gather for reconstruction. - -### Interaction with Vortex file format - -The B-dim block structure ensures rotation matrices fit in GPU shared memory -(B×B×4 bytes). Child arrays are individually compressed by the cascading -compressor; GPU decode requires either host-side decompression + GPU transfer, -or direct GPU decompression of FastLanes/ALP. - -Note: the GPU decode pipeline described above assumes per-block QJL. The -full-dim padded SORF QJL strategy requires a padded_d-dim inverse SORF, which -is a different (larger) kernel and may not fit the per-block tiling model as -cleanly. +At b=8, codes are raw int8 indices. Direct int8 tensor core GEMM requires +approximately linear centroids (sacrificing Max-Lloyd optimality); viable for +ANN ranking but not reconstruction. ## References [1] Zandieh, A., Daliri, M., Hadian, M. and Mirrokni, V. "TurboQuant: Online -Vector Quantization with Near-optimal Distortion Rate." arXiv:2504.19874, -April 2025. +Vector Quantization with Near-optimal Distortion Rate." ICLR 2026. +arXiv:2504.19874, April 2025. [2] Ailon, N. and Chazelle, B. "The Fast Johnson-Lindenstrauss Transform and -Approximate Nearest Neighbors." SIAM Journal on Computing, 39(1):302-322, 2009. +Approximate Nearest Neighbors." SIAM J. Comput. 39(1):302-322, 2009. [3] Tropp, J.A. "Improved Analysis of the Subsampled Randomized Hadamard -Transform." Advances in Adaptive Data Analysis, 3(1-2):115-126, 2011. +Transform." Adv. Adaptive Data Analysis 3(1-2):115-126, 2011. [4] Kuffo, L., Krippner, E. and Boncz, P. "PDX: A Data Layout for Vector -Similarity Search." Proceedings of SIGMOD '25. arXiv:2503.04422, March 2025. +Similarity Search." SIGMOD '25. arXiv:2503.04422, March 2025. [5] Yu, F.X., Suresh, A.T., Choromanski, K., Holtmann-Rice, D. and Kumar, S. -"Orthogonal Random Features." Advances in Neural Information Processing -Systems 29 (NeurIPS), 2016. arXiv:1610.09072. +"Orthogonal Random Features." NeurIPS 2016. arXiv:1610.09072. [6] Yang, S. et al. "Flash-KMeans: Fast and Memory-Efficient Exact K-Means." arXiv:2603.09229, March 2026. + +[7] Pathare, T. et al. "TurboQuant: Implementation Corrections, Production +Hardening, and Deployment Infrastructure." Eviox Tech Report v1.2.0, +March 2026. Community implementations: tonbistudio/turboquant-pytorch, +ggml-org/llama.cpp#20969, 0xSero/turboquant, others. From 4ecdecbef64e66a223846bab92be35f6e06e89ce Mon Sep 17 00:00:00 2001 From: Will Manning Date: Fri, 3 Apr 2026 10:21:52 -0400 Subject: [PATCH 4/4] review Signed-off-by: Will Manning --- proposed/0033-block-turboquant.md | 67 ++++++++++++++++++++----------- 1 file changed, 44 insertions(+), 23 deletions(-) diff --git a/proposed/0033-block-turboquant.md b/proposed/0033-block-turboquant.md index 65db85b..a024c58 100644 --- a/proposed/0033-block-turboquant.md +++ b/proposed/0033-block-turboquant.md @@ -21,7 +21,7 @@ in three stages: QJL correction is deferred to a later stage and may ultimately be dropped. Community findings from 6+ independent TurboQuant implementations consistently show that MSE-only outperforms MSE+QJL for attention and ANN ranking in -practice [7]. +practice [8]. [current-impl]: https://github.com/vortex-data/vortex/pull/7167 @@ -80,14 +80,14 @@ bound itself — they are well below the 2.72/4^b bound. ### Community findings on QJL -Multiple independent TurboQuant implementations [7] have converged on a +Multiple independent TurboQuant implementations have converged on a significant practical finding: **MSE-only consistently outperforms MSE+QJL for attention and ANN ranking**. The mechanism is a variance-bias tradeoff: TurboQuant's QJL correction eliminates bias but increases variance, and softmax attention (and cosine/L2 ranking) amplifies variance more than bias. At the same total bit budget, allocating all bits to MSE (more centroids, lower variance) beats splitting between MSE + QJL (fewer centroids + 1-bit correction). This has -been confirmed by 6+ groups across Python, C, and Rust implementations. +been confirmed by 6+ groups across Python, C, and Rust implementations [8]. This finding strongly supports making MSE-only the default strategy for our columnar storage use case (ANN search, cosine similarity ranking). @@ -98,7 +98,8 @@ The SORF requires power-of-2 input dimension. For non-power-of-2 dimensions (e.g., 768-d embeddings), the input is zero-padded to the next power of 2 (1024). This causes: -- **33% storage overhead** for 768-d vectors: 1024 codes stored vs. 768 useful. +- **33% storage overhead** for 768-d vectors: 1024 codes stored vs. 768 useful + (equivalently, 25% of stored codes are wasted on zero-padded dimensions). - **No scan-optimized layout**: row-major code storage prevents SIMD-over-vectors distance computation. @@ -140,18 +141,22 @@ divides d. This eliminates stragglers entirely for common embedding dimensions: - **Stragglers are eliminated** for all common embedding dimensions. Dimensions that are not multiples of 64 (e.g., 100, 200) would need straggler handling, but these are rare in practice for modern model architectures. -- **The SORF approximation at B=256+ is well-tested**: 3 rounds at B=256 provides - 24 butterfly stages, and at B=512 provides 27 — both comparable to the current - B=1024 (30 stages). +- **The SORF approximation at B=256+ is expected to be adequate**: 3 rounds at + B=256 provides 24 butterfly stages, and at B=512 provides 27 — both comparable + to the current B=1024 (30 stages). This needs empirical validation; see + Experimental plan. ### Stage 1: MSE-only TurboQuant (immediate — split from current PR) -Merge the [current MSE-only TurboQuant implementation][current-impl] as a -standalone encoding. This provides: +Split the [current PR][current-impl] to extract and merge the MSE-only subset +(removing QJL encoding, QJL array slots, and QJL-specific tests). The QJL code +can be preserved on a separate branch for Phase 4. The MSE-only encoding +provides: - SORF-based random rotation at the padded dimension - Max-Lloyd scalar quantization with shared centroids -- Per-vector norm storage (single f32 per vector) +- Per-vector norm storage (single f32, regardless of input dtype — the + dtype-matching norm behavior described in Stage 2 is a later change) - Slice, take, scalar_at compute pushdowns - Quantized-domain cosine similarity and dot product - File format integration via the compression scheme @@ -173,8 +178,10 @@ table above). Each full block gets an independent B-dim SORF rotation. in PR #7251 (closed; concept will need reimplementation). - **Per-block SORF rotation signs.** Each block's SORF is independent (different seed). Signs are 3 × B bits per block. -- **For power-of-2 dimensions**: B = d, k = 1. The block decomposition is a - no-op; the encoding is identical to Stage 1 except norms may be externalized. +- **For power-of-2 dimensions**: B = d, k = 1. The encoding is functionally + identical to Stage 1. The norm remains a single value per vector (not a + FixedSizeList with list_size=1). Norm externalization is optional for k=1 and + can be deferred to when it provides concrete benefit (e.g., GPU decode). #### Norm architecture @@ -310,21 +317,26 @@ transpose. ```rust let dist_table = precompute_product_table(¢roids); -// At b=4: 16×16 = 256 floats = 1KB, fits in L1 +// At b_mse=4: 16×16 = 256 floats = 1KB, fits in L1 + +let mut distances = [0.0f32; 64]; +let mut unit_dots = [0.0f32; 64]; +let mut offset = 0; for tq_block in 0..k { for dim in 0..B { let qd = query_codes[tq_block * B + dim]; let row = &dist_table[qd as usize]; - for v in 0..64 { // SIMD-friendly + for v in 0..64 { // SIMD-friendly: no inter-vector deps unit_dots[v] += row[codes[offset] as usize]; offset += 1; } } + // Weight per-block unit-norm dot product by both vectors' block norms for v in 0..64 { distances[v] += query_norms[tq_block] * data_norms[v][tq_block] * unit_dots[v]; - unit_dots[v] = 0.0; + unit_dots[v] = 0.0; // reset for next TQ block } } ``` @@ -441,7 +453,7 @@ approach, despite more blocks, because each block is smaller. - Per-block Gaussian QJL vs. per-block SORF QJL vs. full-dim padded SORF QJL vs. MSE-only - Key metric: ANN recall@k on standard benchmarks (SIFT, GloVe) -- Per community findings, MSE-only is expected to win [7] +- Per community findings, MSE-only is expected to win [8] ### Straggler handling (if needed) @@ -455,7 +467,10 @@ to merge MSE-only (no QJL). This is a complete encoding for all dimensions (with padding for non-power-of-2). **Phase 2** — Block decomposition: Add block splitting for non-power-of-2 -dimensions. Externalize norms. B = largest power-of-2 ≥ 64 dividing d. +dimensions. Externalize norms. B = largest power-of-2 ≥ 64 dividing d. The +`TurboQuantScheme::compress()` method must be updated to: (a) choose B based on +d, (b) split input into blocks, (c) normalize per-block, (d) encode each block, +and (e) store per-block norms in the parent encoding layer. **Phase 3** — PDX layout: Dimension-major code transposition within 64-vector chunks. Distance computation kernels. @@ -491,10 +506,10 @@ decoded_batch = diag(norms) × R⁻¹ × codebook_lookup_batch(codes) The codebook gather + inverse rotation + norm scaling can be fused into a single kernel following the double-buffered streaming pattern from Flash-KMeans [6]. -For distance computation without full decode, a precomputed B²-entry distance -table (1 KB at b=4) fits in shared memory; the kernel streams code bytes from -HBM with gather-reduce accumulation, using 4-8× less bandwidth than full float -vectors. +For distance computation without full decode, a precomputed (2^b_mse)²-entry +distance table fits in shared memory (1 KB at b_mse=4, 4 KB at b_mse=5); the +kernel streams code bytes from HBM with gather-reduce accumulation, using +4-8× less bandwidth than full float vectors. At b=8, codes are raw int8 indices. Direct int8 tensor core GEMM requires approximately linear centroids (sacrificing Max-Lloyd optimality); viable for @@ -523,5 +538,11 @@ arXiv:2603.09229, March 2026. [7] Pathare, T. et al. "TurboQuant: Implementation Corrections, Production Hardening, and Deployment Infrastructure." Eviox Tech Report v1.2.0, -March 2026. Community implementations: tonbistudio/turboquant-pytorch, -ggml-org/llama.cpp#20969, 0xSero/turboquant, others. +March 2026. + +[8] Community TurboQuant implementations and findings. Key sources: +tonbistudio/turboquant-pytorch (PyTorch, V3 MSE-only findings), +ggml-org/llama.cpp#20969 (C/C++, quantized attention analysis), +0xSero/turboquant (Triton kernels), vivekvar-dl/turboquant (pip package), +scos-lab/turboquant (reference reproduction). Consensus: MSE-only beats +MSE+QJL for attention and ANN ranking at all tested bit widths.