diff --git a/proposed/0033-block-turboquant.md b/proposed/0033-block-turboquant.md index fc1588f..9a53d81 100644 --- a/proposed/0033-block-turboquant.md +++ b/proposed/0033-block-turboquant.md @@ -10,18 +10,23 @@ We propose evolving the [TurboQuant vector quantization encoding][current-impl] in three stages: 1. **MSE-only TurboQuant** (immediate): merge the current PR as an MSE-only - encoding. This is a complete, self-contained building block. -2. **Block decomposition** (next): for non-power-of-2 dimensions, split into - blocks of size B = the largest power-of-2 ≥ 64 that divides d. For - power-of-2 dimensions, B = d (single block, same as current). Per-block - norms stored as internal children. -3. **PDX layout** (later): within each block, transpose codes into groups of - 64 vectors for SIMD scan performance. + encoding with d ≥ 128 scheme selection (see Minimum dimension; smaller d + available via explicit construction). This is a complete, self-contained + building block. +2. **Block decomposition** (next): for dimensions where a valid B exists + (greatest power-of-2 ≥ 64 dividing d), split into blocks of size B. For + power-of-2 dimensions, B = d (single block). Dimensions with no qualifying + B fall back to padded single-block. Per-block norms stored as internal + children. +3. **PDX layout** (later): transpose codes into dimension-major order within + groups of 64 vectors for SIMD scan performance. QJL correction is deferred to a later stage and may ultimately be dropped. -Community findings from 6+ independent TurboQuant implementations consistently -show that MSE-only outperforms MSE+QJL for attention and ANN ranking in -practice [8]. +Community findings from multiple independent TurboQuant implementations +often show that MSE-only outperforms MSE+QJL for KV-cache attention [8]. +For ANN ranking and vector-search workloads, the evidence is currently less +complete, so QJL should remain an empirical question rather than a settled +conclusion. [current-impl]: https://github.com/vortex-data/vortex/pull/7167 @@ -40,10 +45,35 @@ embeddings. It works by: 3. Optionally adding a 1-bit QJL (Quantized Johnson-Lindenstrauss) correction on the residual for unbiased inner product estimation (Theorem 2 in [1]). -The paper prescribes a full random orthogonal rotation (QR of Gaussian) for the -MSE stage — O(d²) storage and O(d²) per-vector. For the QJL stage, the paper -uses a random Gaussian projection matrix S with i.i.d. N(0,1) entries (not an -orthogonal rotation); this distinction matters for the unbiasedness proof. +The paper prescribes a full random orthogonal rotation (QR decomposition of a +matrix with i.i.d. N(0,1) entries, yielding a Haar-uniform orthogonal matrix) +for the MSE stage — O(d²) storage and O(d²) per-vector. For the QJL stage, the +paper uses a random Gaussian projection matrix S with i.i.d. N(0,1) entries (not +an orthogonal rotation); this distinction matters for the unbiasedness proof. + +**Comparison to Product Quantization.** TurboQuant's block decomposition (Stage +2 of this RFC) is structurally similar to Product Quantization (PQ) [9]: both +partition a vector into sub-vectors and quantize each independently. The key +differences are: + +| | TurboQuant | PQ | +| ---------------------- | --------------------------------------------------------------- | -------------------------------------------------------- | +| Quantization type | Scalar (per-coordinate, after rotation) | Vector (per-sub-vector, learned codebook) | +| Codebook | Analytically derived from Beta distribution; **data-oblivious** | Learned via k-means on training data; **data-dependent** | +| Rotation | Random orthogonal within each sub-vector | Typically none (OPQ [10] adds a learned rotation) | +| Theoretical guarantees | Provable data-oblivious MSE bound (Theorem 1 [1]) | No comparable data-oblivious bound | +| Codebook training | None (centroids derived from theory) | Requires training pass over data | +| Bits per sub-vector | Scalar: b bits per coordinate | Vector: typically 8 bits per sub-vector (256 codewords) | + +TurboQuant trades PQ's flexibility (data-dependent codebooks can exploit +structure) for data-obliviousness (no training, provable bounds, no offline +index-training phase). Encode-time work (rotation + quantization) still applies. +In return, PQ and OPQ retain a major advantage in expressivity: they learn +sub-vector codebooks from data rather than applying an analytic scalar quantizer. +In practice this means TurboQuant is attractive when training-free operation, +simple deployment, and theoretical guarantees matter most, while PQ or OPQ may +still win empirically when a learned vector codebook can exploit dataset-specific +structure. ### Current Vortex implementation @@ -81,13 +111,15 @@ norms) while sharing rotation signs and centroids. Quantized cosine similarity and dot product operate directly on codes and centroids without decompression. L2 norm returns the stored norm directly (O(1) readthrough). -**Compression scheme.** `TurboQuantScheme` implements the `Scheme` trait for the -BtrBlocks cascading compressor. It matches `Vector` and `FixedShapeTensor` -extension arrays with non-nullable float elements and dimension ≥ 3, using the -default config (5-bit QJL = 4-bit MSE + 1-bit QJL, seed 42). +**Compression scheme (pre-Stage 1).** `TurboQuantScheme` implements the `Scheme` +trait for the BtrBlocks cascading compressor. It matches `Vector` and +`FixedShapeTensor` extension arrays with non-nullable float elements and +dimension ≥ 3 (to be raised to ≥ 128 in Stage 1; see Minimum dimension below), +using the default config (5-bit QJL = 4-bit MSE + 1-bit QJL, seed 42). -**Input handling.** All float types (f16, f32, f64) are converted to f32 before -quantization. Per-vector L2 norms are computed and stored as f32. Non-power-of-2 +**Input handling (pre-Stage 1).** All float types (f16, f32, f64) are converted +to f32 before quantization. Per-vector L2 norms are computed and stored as f32 +(Stage 1 changes this to dtype-matching: f64 for f64 input). Non-power-of-2 dimensions are zero-padded to the next power of 2 for SORF compatibility. The minimum dimension is 3 (d=2 causes a singularity in the Beta distribution exponent). @@ -110,26 +142,34 @@ spacings (we cast to f32 before quantization). See [7] for the full list. There is an ambiguity in the paper's notation for the MSE bound constant. The formal proof gives `(√3 · π / 2) · 4^{-b}` where the constant √3·π/2 ≈ 2.72. -The Eviox report [7] interprets the notation as `√(3π)/2 ≈ 1.535`, but this is -incorrect: the measured distortion values from the paper (b=2: 0.117, b=3: 0.03) -exceed the putative `√(3π)/2` bound (b=2: 0.096, b=3: 0.024), confirming that -2.72 is the correct constant. The paper's "explicit values" (0.36, 0.117, 0.03, -0.009) are the actual computed distortion of the optimal quantizer, not the -bound itself — they are well below the 2.72/4^b bound. +The Eviox report [7] (Item 7) deliberately adopts the alternative parsing +`√(3π)/2 ≈ 1.535`, claiming it is "consistent with the formal proof." We treat +`√3·π/2 ≈ 2.72` as the theorem constant because: (a) the paper's prose +describes the constant as "≈ 2.7," which matches 2.72 not 1.535; and (b) the +paper's reported distortion values (b=2: 0.117, b=3: 0.03) exceed the 1.535- +based bound (b=2: 0.096, b=3: 0.024), ruling out `√(3π)/2` as a valid +**upper** bound on the measured quantity. The definitive resolution requires +checking the exact LaTeX grouping in the ICLR 2026 camera-ready proof. The +paper's "explicit values" (0.36, 0.117, 0.03, 0.009) are the actual computed +distortion of the optimal quantizer, not the bound itself — they are well below +the 2.72/4^b bound. ### Community findings on QJL -Multiple independent TurboQuant implementations have converged on a -significant practical finding: **MSE-only consistently outperforms MSE+QJL for -attention and ANN ranking**. The mechanism is a variance-bias tradeoff: -TurboQuant's QJL correction eliminates bias but increases variance, and softmax -attention (and cosine/L2 ranking) amplifies variance more than bias. At the same -total bit budget, allocating all bits to MSE (more centroids, lower variance) -beats splitting between MSE + QJL (fewer centroids + 1-bit correction). This has -been confirmed by 6+ groups across Python, C, and Rust implementations [8]. - -This finding strongly supports making MSE-only the default strategy for our -columnar storage use case (ANN search, cosine similarity ranking). +Multiple independent TurboQuant implementations have repeatedly reported a +practical finding for **KV-cache attention**: MSE-only often outperforms MSE+QJL +at the same bit budget. The likely mechanism is a variance-bias tradeoff: QJL +removes bias in raw inner-product estimation but adds variance, and the softmax +nonlinearity amplifies variance more than it penalizes bias. In that setting, +allocating all bits to MSE (more centroids, lower quantization variance) can beat +splitting the budget between MSE + QJL. This behavior has been reported by +multiple groups across Python, C, and Rust implementations [8]. + +For ANN search, cosine ranking, and other non-softmax vector-search workloads, +the evidence is currently less settled. MSE-only is still a reasonable default +because it is simpler and better supported by the current implementation work, +but the ANN question should be treated as empirical until evaluated on ANN +datasets with recall@k and ranking metrics (see Experimental plan). ### Current limitations @@ -147,8 +187,10 @@ The SORF requires power-of-2 input dimension. For non-power-of-2 dimensions PDX [4] is a data layout for vector similarity search. The paper (SIGMOD '25) describes a dimension-major layout within fixed-size blocks of 64 vectors, enabling the compiler to auto-vectorize the inner distance loop over vectors -rather than dimensions, achieving on average 2× speedups over SIMD-optimized -row-major kernels on modern CPUs. The block size of 64 is empirically optimal +rather than dimensions. In the paper, this yields average speedups of about 40% +over SIMD-optimized row-major kernels for the direct kernel comparison, while +dimension-pruning methods (ADSampling, BSA) recover much larger gains (2-7×) +when paired with the PDX layout [4]. The block size of 64 is empirically optimal across AVX-512, AVX2, and NEON architectures [4]. **PDX implementation evolution.** The [open-source implementation][pdx-impl] @@ -184,24 +226,26 @@ could skip entire TQ blocks (B dimensions at a time) if the partial distance already exceeds the candidate threshold. This combines the storage efficiency of quantization with the computational savings of early termination. -[pdx-impl]: https://github.com/cwida/PDX +[pdx-impl]: https://github.com/cwida/PDX "specific files: `include/pdx/quantizers/scalar.hpp` for SQ8, `include/pdx/pruners/adsampling.hpp` for ADSampling/DCT, `include/pdx/layout.hpp` for int8 interleaving, `include/pdx/distance_computers/avx512_computers.hpp` for VPDPBUSD kernels" ## Proposal ### Block size strategy -For each dimension d, choose B = the largest power-of-2 ≥ 64 that evenly -divides d. This eliminates stragglers entirely for common embedding dimensions: - -| Dimension d | Block size B | Blocks k | Notes | -| ----------- | ------------ | -------- | --------------------------- | -| 512 | 512 | 1 | Single block (= current TQ) | -| 768 | 256 | 3 | Largest dividing power-of-2 | -| 1024 | 1024 | 1 | Single block | -| 1536 | 512 | 3 | | -| 2048 | 2048 | 1 | Single block | -| 3072 | 1024 | 3 | | -| 4096 | 4096 | 1 | Single block | +For each dimension d, choose B = the greatest power-of-2 ≥ 64 that evenly +divides d. If no such B exists (e.g., d=96), fall back to the padded +single-block path from Stage 1. For common embedding dimensions, this rule +always produces a valid B and eliminates padding entirely: + +| Dimension d | Block size B | Blocks k | Notes | +| ----------- | ------------ | -------- | ---------------------------- | +| 512 | 512 | 1 | Single block (= current TQ) | +| 768 | 256 | 3 | Greatest dividing power-of-2 | +| 1024 | 1024 | 1 | Single block | +| 1536 | 512 | 3 | | +| 2048 | 2048 | 1 | Single block | +| 3072 | 1024 | 3 | | +| 4096 | 4096 | 1 | Single block | **Key observations:** @@ -210,16 +254,48 @@ divides d. This eliminates stragglers entirely for common embedding dimensions: No block decomposition overhead, no per-block norms. These dimensions are already well-served by the current design. - **Non-power-of-2 dimensions** (768, 1536, 3072) decompose into k=3 blocks at - B=256 or B=512. Zero padding waste. Each block has its own SORF rotation and - shares a single centroid set. -- **Stragglers are eliminated** for all common embedding dimensions. Dimensions - that are not multiples of 64 (e.g., 100, 200) would need straggler handling, - but these are rare in practice for modern model architectures. + B=256 or B=512. No padding waste (vs. 33% for the padded single-block path). + Each block has its own SORF rotation and shares a single centroid set. +- **No qualifying B is rare** for common embedding dimensions. Dimensions where + no power-of-2 ≥ 64 divides d (e.g., 96, 100) fall back to Stage 1's padded + single-block path. These are uncommon in modern model architectures. - **The SORF approximation at B=256+ is expected to be adequate**: 3 rounds at B=256 provides 24 butterfly stages, and at B=512 provides 27 — both comparable to the current B=1024 (30 stages). This needs empirical validation; see Experimental plan. +### Minimum dimension + +The compression scheme should only select TurboQuant for vectors with +dimension ≥ 128. Below this threshold, several factors degrade quality and +efficiency: + +- **SORF mixing quality:** 3-round SORF at d=64 provides only 18 butterfly + stages (vs. 21 at d=128, 30 at d=1024). The coordinate distribution deviates + more from the analytical Beta, making Max-Lloyd centroids less optimal. +- **Practical MSE:** At smaller d, the SORF mixing quality and coordinate- + independence approximations are weaker, potentially worsening practical + quantization quality beyond what the dimension-free theoretical bound + captures. The actual MSE at each d is an empirical question. +- **Overhead ratio:** Per-vector norm (32 bits) is a larger fraction of the + compressed representation at small d. At d=32, b=5: codes=160 bits, + norm=32 bits, total=192 — norm is ~17% of compressed size. At d=768: <1%. +- **Diminishing returns for high bit widths:** With fewer coordinates, the + fine-grained centroid structure of high-b quantization has less to exploit. + +The threshold of 128 is conservative: + +- d=128 (SIFT) is the smallest dimension in our recommended benchmark table. +- SORF at d=128 has 21 butterfly stages — tested and adequate in the current + implementation. +- The block-size rule produces B=128 for d=128 (single block, no decomposition). + +The array-level minimum remains d=3 (for the Beta distribution to be +well-defined), so users can still explicitly construct a TurboQuantArray at +smaller dimensions. The scheme minimum (128) controls automatic selection only. + +The exact threshold should be validated experimentally — see Experimental plan. + ### Stage 1: MSE-only TurboQuant (immediate — split from current PR) Split the [current PR][current-impl] to extract and merge the MSE-only subset. @@ -234,24 +310,29 @@ The QJL code can be preserved on a separate branch for Phase 4. | Scheme default | 5-bit QJL (4-bit MSE + 1-bit QJL) | **5-bit MSE-only** (32 centroids) | | Norms dtype | Always f32 | **Same-or-wider**: f64 for f64 input, f32 for f32/f16 | | Metadata | `has_qjl: bool` | **Removed** (always MSE-only) | +| Scheme minimum | dimension ≥ 3 | **dimension ≥ 128** (see Minimum dimension below) | **Unchanged from current PR:** SORF rotation, Max-Lloyd centroids, zero-padding for non-power-of-2, slice/take/scalar_at pushdowns, quantized -cosine similarity and dot product, compression scheme integration, minimum dim=3. +cosine similarity and dot product, compression scheme integration. **Added to metadata (for forward compat):** `block_size: u32` (always = -padded_dim), `num_blocks: u32` (always = 1), `is_pdx: bool` (always = false). -These fields are inert in Stage 1 but enable Stage 2/3 decoders to read -Stage 1 files. +padded_dim), `num_blocks: u32` (always = 1). These fields are inert in Stage 1 +but enable Stage 2 decoders to read Stage 1 files. (PDX is handled via the +codes child type, not a metadata flag — see Stage 3.) -This is a complete, useful encoding for all dimensions. Power-of-2 dimensions +This is a complete, useful encoding for all dimensions ≥ 3 (automatic scheme +selection applies only for d ≥ 128; smaller d remains available via explicit +array construction). Power-of-2 dimensions have zero padding waste; non-power-of-2 dimensions have the padding overhead described above. ### Stage 2: Block decomposition -For non-power-of-2 dimensions, split into blocks of size B (as determined by the -table above). Each full block gets an independent B-dim SORF rotation. +For dimensions where the block-size rule produces a valid B (see table above), +split into blocks of size B. Each full block gets an independent B-dim SORF +rotation. Dimensions with no qualifying B (e.g., d=96) remain on the padded +single-block path from Stage 1. **Changes vs. Stage 1:** @@ -328,13 +409,27 @@ Gaussian), not SORF.** Our SORF is an approximation. The bound holds exactly only with a true random orthogonal rotation or with empirical SORF validation (see Experimental plan). -Assuming the per-block MSE bound holds, for a vector split into blocks: +Assuming the per-block MSE bound holds, for a vector split into blocks the +first line is an **algebraic** identity (exact); the inequality on the second +line applies Theorem 1's **probabilistic** bound to each block and should be +read as holding in **expectation** over independent per-block rotations, not +almost surely: ``` -‖x - x̂‖² / ‖x‖² = Σ_k (‖xₖ‖² / ‖x‖²) × (‖xₖ - x̂ₖ‖² / ‖xₖ‖²) - ≤ MSE_bound × Σ_k (‖xₖ‖² / ‖x‖²) = MSE_bound +‖x - x̂‖² / ‖x‖² = Σ_k (‖xₖ‖² / ‖x‖²) × (‖xₖ - x̂ₖ‖² / ‖xₖ‖²) (exact) + E[...] ≤ MSE_bound × Σ_k (‖xₖ‖² / ‖x‖²) = MSE_bound (in expectation) ``` +The conclusion: `E[‖x - x̂‖² / ‖x‖²] ≤ MSE_bound` assuming independent +per-block rotations. (Theorem 1 applies because each block is normalized to +unit norm before rotation and quantization; the per-block encoding pipeline is: +split → normalize → rotate → quantize, matching the theorem's unit-sphere +assumption.) Note that TurboQuant's original analysis uses a single +global rotation in high-d where coordinates are nearly independent; with +smaller block dimension B, within-block coordinate dependence after rotation may +be stronger even when marginals are correct — this is an additional motivation +for the experimental plan's comparison of block sizes. + The actual MSE may depend on block dimension B: at larger B the coordinate distribution is more concentrated (variance ~1/B), giving the Max-Lloyd quantizer more to exploit. See Experimental plan. @@ -353,19 +448,25 @@ vectors). Each block must have an **independent** rotation matrix. **Why not DCT?** The PDX implementation [pdx-impl] uses DCT (via FFTW) as a fast rotation for ADSampling. DCT is O(B log B) and invertible, but it is a **fixed structured transform**, not a random rotation — it does not produce the Beta -marginal distribution `(1-x²)^((d-3)/2)` that TurboQuant's Max-Lloyd centroids -are optimized for. ADSampling only needs approximate coordinate independence +marginal distribution `(1-x²)^((B-3)/2)` (in block dimension B) that +TurboQuant's Max-Lloyd centroids are optimized for. ADSampling only needs +approximate coordinate independence (for hypothesis-testing pruning), so DCT suffices there. TurboQuant needs a specific known marginal distribution, so only random orthogonal rotations (QR or SORF) are suitable. -**Shared rotation with ADSampling.** Both TurboQuant and ADSampling apply a -random orthogonal rotation to make coordinates independent. If we integrate -ADSampling-style dimension pruning (see Stage 3), the same rotation could serve -both purposes: producing the Beta distribution for quantization AND enabling -hypothesis-testing for early pruning. This would avoid rotating the data twice -and is a natural future optimization when combining block-TurboQuant with -PDX-style scans. +**Shared rotation with ADSampling (speculative).** Both TurboQuant and +ADSampling apply a random orthogonal rotation to make coordinates independent. +If we integrate ADSampling-style dimension pruning (see Stage 3), the same +rotation could in principle serve both purposes. However, this is not automatic +under the Stage 2 block-decomposed design: ADSampling is formulated around a +single full-dimensional random projection whose coordinates can be sequentially +sampled, whereas Stage 2 introduces per-block rotations and per-block norm +weighting. Reusing one rotation across both systems should be treated as a +**future research direction** that requires new analysis or direct empirical +validation. If it proves viable, it would avoid rotating the data twice. The +query would also need to be rotated at query time with the same stored +transform. #### Quantized-domain operations @@ -383,71 +484,112 @@ centroids[code_bₖ[j]]`. #### Encoding algorithm ``` + Input: x ∈ ℝ^d, b_mse bits per coordinate, block_size B -k = d / B (exact division, no straggler for chosen B) +k = d / B (exact division, no straggler for chosen B) num_centroids = 2^b_mse # Block split and normalize + for i in 0..k: - xᵢ = x[i*B .. (i+1)*B] - nᵢ = ‖xᵢ‖ - if nᵢ > 0: - ûᵢ = xᵢ / nᵢ - else: - ûᵢ = zeros(B) +xᵢ = x[i*B .. (i+1)*B] +nᵢ = ‖xᵢ‖ +if nᵢ > 0: +ûᵢ = xᵢ / nᵢ +else: +ûᵢ = zeros(B) # MSE stage (per block, SORF rotation) + for i in 0..k: - if nᵢ > 0: - rᵢ = SORFᵢ(ûᵢ) - cᵢ[j] = nearest_centroid(rᵢ[j]) - else: - cᵢ[j] = 0 +if nᵢ > 0: +rᵢ = SORFᵢ(ûᵢ) +cᵢ[j] = nearest_centroid(rᵢ[j]) +else: +cᵢ[j] = 0 Store (all as internal children): - codes (k × B per vector), norms (k per vector), - centroids (2^b_mse, shared), SORF signs (k × 3 × B, shared) +codes (k × B per vector), norms (k per vector), +centroids (2^b_mse, shared), SORF signs (k × 3 × B, shared) + ``` #### Decoding algorithm ``` + for i in 0..k: - r̂ᵢ[j] = centroids[cᵢ[j]] - ûᵢ = SORF⁻¹ᵢ(r̂ᵢ) - x̂ᵢ = nᵢ × ûᵢ (nᵢ read from internal norms child) +r̂ᵢ[j] = centroids[cᵢ[j]] +ûᵢ = SORF⁻¹ᵢ(r̂ᵢ) +x̂ᵢ = nᵢ × ûᵢ (nᵢ read from internal norms child) x̃ = concat(x̂₀, ..., x̂ₖ₋₁) + ``` ### Stage 3: PDX dimension-major layout -Transpose code storage from row-major to dimension-major within groups of 64 -vectors [4]. The 64-vector group size is independent of B. +Introduce a new `PDXArray` encoding type that wraps any `FixedSizeListArray` +with a dimension-major layout within groups of 64 vectors [4]. PDXArray is +**not TurboQuant-specific** — it is a general-purpose layout optimization for +any FixedSizeList of scalar elements (raw float vectors, scalar-quantized +vectors, TurboQuant codes, etc.). **Changes vs. Stage 2:** -| Aspect | Stage 2 | Stage 3 | -| ---------------------- | ------------------------------------------------ | ----------------------------------------------------------------- | -| Codes layout | Row-major (all codes for one vector contiguous) | **Dimension-major within 64-vector chunks** | -| Metadata | `is_pdx = false` | **`is_pdx = true`** | -| Distance kernel | Per-vector loop with per-element centroid lookup | **SIMD-friendly 64-vector inner loop with distance-table lookup** | -| Decode path | Direct inverse SORF per vector | **Un-transpose 64-vector chunk first**, then inverse SORF | -| QJL signs (if present) | Row-major | **Also transposed** (same PDX layout as codes) | +| Aspect | Stage 2 | Stage 3 | +| ---------------- | ------------------------------------------------ | ------------------------------------------------------------------------------- | +| Codes child type | `FixedSizeListArray` | **`PDXArray`** (wraps FSL with transposed layout) | +| Codes detection | N/A (codes always FSL) | **TQ checks child type**: FSL → row-major decode, PDXArray → un-transpose first | +| Distance kernel | Per-vector loop with per-element centroid lookup | **SIMD-friendly 64-vector inner loop with distance-table lookup** | +| Decode path | Direct inverse SORF per vector | **PDXArray.to_fsl() first**, then inverse SORF | **Unchanged from Stage 2:** Block size B, centroid computation, norm storage, -SORF rotation, all encoding logic (PDX transpose is applied after encoding). -The encode path produces row-major codes then transposes; the decode path -un-transposes then decodes. +SORF rotation, all encoding logic. The encode path produces row-major codes +(FSL), then the compressor wraps them in a PDXArray; the decode path converts +PDXArray back to FSL then decodes. + +**PDXArray design:** + +``` + +PDXArray (general-purpose dimension-major layout for FixedSizeList) +├── metadata: { list_size, chunk_size (= 64) } +├── elements: PrimitiveArray # transposed: 64 values per dim, contiguous +├── validity: ... # same as FSL validity + +``` + +- `PDXArray::try_new(fsl)` — transposes a FixedSizeListArray into PDX layout +- `PDXArray::to_fsl()` — un-transposes back to row-major FSL (for decode, + scalar_at, or non-aligned slice/take) +- `PDXArray::elements_for_dim(dim, chunk)` — O(1) access to a contiguous slice + of 64 values for one dimension within one chunk +- Slice/take: un-transpose to FSL (simplest). Un-transpose cost is + O(rows × list_size) per operation; consider 64-row-aligned fast paths for + hot scan workloads. Preserving PDX layout is possible only for + 64-vector-aligned ranges. +- The cascade compressor treats PDXArray as a valid encoding of FSL-typed data. + +**Benefits of PDXArray as a separate type:** + +- PDX logic tested and maintained independently of TurboQuant +- Other encodings (raw float vectors, scalar quantization, future encodings) + get PDX scan performance for free +- TurboQuant doesn't need an `is_pdx` metadata flag — it checks its codes + child's type at runtime +- The distance kernel operates on PDXArray's dimension-contiguous slices Within each 64-vector chunk, codes are stored dimension-major: ``` -TQ block 0, dim 0: [v0 v1 v2 ... v63] -TQ block 0, dim 1: [v0 v1 v2 ... v63] + +TQ block 0, dim 0: [v0 v1 v2 ... v63] +TQ block 0, dim 1: [v0 v1 v2 ... v63] ... -TQ block 0, dim (B - 1): [v0 v1 v2 ... v63] -TQ block 1, dim 0: [v0 v1 v2 ... v63] +TQ block 0, dim (B - 1): [v0 v1 v2 ... v63] +TQ block 1, dim 0: [v0 v1 v2 ... v63] ... + ``` The inner SIMD loop (64 vectors) has no inter-vector dependencies. TQ block @@ -484,12 +626,15 @@ for tq_block in 0..k { **Int8 layout variant.** The PDX implementation [pdx-impl] uses a different tiling for int8 data: "4 dims × 16 vecs" to leverage VPDPBUSD/UDOT hardware -dot-product instructions. For TurboQuant codes at b_mse ≤ 8, codes are u8 -centroid indices (not linear values), so VPDPBUSD doesn't apply directly — we -need the distance-table-lookup path shown above. However, if we support a linear -quantization mode (b_mse=8 with uniform centroids), the "4 dims × 16 vecs" -layout could enable direct hardware dot-product on the codes, bypassing the -lookup table entirely. This is a potential Stage 3 optimization to evaluate. +dot-product instructions (which process 4 unsigned×signed byte pairs per +operation). For TurboQuant codes at b_mse ≤ 8, codes are uint8 centroid indices, +so VPDPBUSD doesn't apply directly — we need the distance-table-lookup path +shown above. However, at b_mse=8 with high B, the Max-Lloyd centroids are +near-uniformly spaced (see GPU section), potentially enabling direct hardware +dot-product on the codes. Whether this requires a separate linear quantization +mode or works with the existing Max-Lloyd centroids is an empirical question. The +"4 dims × 16 vecs" layout would be a Stage 3 optimization to evaluate alongside +the "1 dim × 64 vecs" float-style layout. **ADSampling integration.** The PDX dimension-pruning approach (ADSampling [4]) is complementary to TurboQuant's block structure. During a scan, the pruner @@ -500,12 +645,12 @@ boundaries (as shown in the kernel above), which our design already provides. **Open design questions:** -- Slice/take on PDX-transposed codes: produce row-major (simpler) or preserve - PDX (aligned 64-vector slices only)? -- Is PDX a property of the encoding or a separate layout layer? -- How does the compressor see the transposed codes? -- Should we support the "4 dims × 16 vecs" int8 layout variant alongside the - "1 dim × 64 vecs" float-style layout? +- Should PDXArray live in `vortex-array` (general infrastructure) or + `vortex-tensor` (vector-specific)? +- Should the cascade compressor automatically PDX-transpose FSL children when + it detects a scan-heavy workload, or should PDX be opt-in? +- Should we support the "4 dims × 16 vecs" uint8 layout variant (for hardware + dot-product) alongside the "1 dim × 64 vecs" float-style layout? ### QJL correction (deferred — experimental) @@ -517,7 +662,7 @@ validated. | Aspect | MSE-only | MSE + QJL | | ---------------------- | -------------------------------- | --------------------------------------------------------------- | | Bit budget | All b bits → MSE (2^b centroids) | b-1 bits MSE + 1 bit QJL (2^(b-1) centroids) | -| Inner product estimate | Biased (MSE quantization noise) | Unbiased (QJL correction, Theorem 2 [1]) | +| Inner product estimate | Biased (MSE quantization noise) | Unbiased (QJL correction; see TurboQuant_prod in [1]) | | Additional children | None | QJL signs, QJL residual norms, QJL projection params | | Encode cost | SORF only | SORF + QJL projection (O(B²) for Gaussian, O(B log B) for SORF) | | Decode cost | Inverse SORF only | Inverse SORF + QJL inverse projection | @@ -533,11 +678,15 @@ If pursued, four strategies should be compared: The paper's QJL uses Gaussian S (not SORF); Lemma 4 [1] is proved specifically for Gaussian. SORF for QJL is an additional approximation (the -[current implementation][current-impl] uses SORF for QJL). Per-block QJL has -d/B times more variance than full-dimension QJL (Lemma 4 [1]). +[current implementation][current-impl] uses SORF for QJL). Per-block QJL can +incur up to d/B times larger variance bound than full-dimension QJL (Lemma 4 +[1]), depending on how query and residual energy are distributed across blocks. -The community consensus is that MSE-only likely wins for ANN ranking at all -bit widths, so QJL may not be worth the complexity. +Community reports indicate MSE-only often wins for KV-cache attention at all +tested bit widths [8]. Whether this extends to ANN ranking is an empirical +question (see Experimental plan); QJL may not be worth the complexity. Note: +the [current PR][current-impl] flags a known SORF-related QJL bias for +non-power-of-2 padded dimensions (#7245); MSE-only Stage 1 avoids this path. ## Array layout @@ -546,10 +695,11 @@ bit widths, so QJL may not be worth the complexity. ``` TurboQuantArray ├── metadata: { dimension, b_mse, block_size (= padded_dim), -│ num_blocks (= 1), is_pdx (= false) } +│ num_blocks (= 1) } │ │ # Per-row children ├── codes: FixedSizeListArray # list_size = padded_dim +│ (or PDXArray after Stage 3) ├── norms: PrimitiveArray # len = num_rows (F = f64 for f64, f32 otherwise) │ │ # Shared children @@ -558,16 +708,19 @@ TurboQuantArray ``` Same structure as the [current PR][current-impl] minus the 3 QJL slots, plus -the forward-compatible metadata fields and dtype-matching norms. +the forward-compatible metadata fields and dtype-matching norms. The codes child +is `FixedSizeListArray` in Stages 1-2 and may be swapped to `PDXArray` in Stage +3 — TurboQuant checks the child type at runtime, not via a metadata flag. ### Stage 2 (block decomposition) ``` TurboQuantArray (self-contained, handles blocks internally) -├── metadata: { dimension, b_mse, block_size, num_blocks, is_pdx } +├── metadata: { dimension, b_mse, block_size, num_blocks } │ │ # Per-row children (sliced/taken on row operations) ├── codes: FixedSizeListArray # list_size = k × B +│ (or PDXArray after Stage 3) ├── norms: PrimitiveArray # len = num_rows (k=1) │ or FixedSizeListArray # list_size = k (k>1) │ @@ -578,7 +731,8 @@ TurboQuantArray (self-contained, handles blocks internally) ## Compression ratio -For f32 input, b_mse bits MSE, k = d/B blocks, N vectors: +For f32 input, b_mse bits MSE, k = d/B blocks, N vectors (for f64 input, +replace 32 with 64 in the norms row — ratios decrease accordingly): | Component | Bits per vector | | ----------- | --------------- | @@ -592,20 +746,27 @@ For f32 input, b_mse bits MSE, k = d/B blocks, N vectors: ### Worked examples (f32, b_mse=5, N=1000) -| d | B | k | Per-vec bits | Ratio | Notes | -| ------------- | ---- | --- | --------------------- | ----- | -------------------------- | -| 768 | 256 | 3 | 3×256×5 + 3×32 = 3936 | 6.2× | Block decomp; zero padding | -| 1024 | 1024 | 1 | 1024×5 + 32 = 5152 | 6.4× | Single block (= current) | -| 768 (current) | 1024 | 1 | 1024×5 + 32 = 5152 | 4.8× | Padded; 33% overhead | +| d | B | k | Per-vec bits | Ratio | Notes | +| ------------- | ---- | --- | --------------------- | ----- | ------------------------ | +| 768 | 256 | 3 | 3×256×5 + 3×32 = 3936 | 6.2× | Block decomp; no padding | +| 1024 | 1024 | 1 | 1024×5 + 32 = 5152 | 6.4× | Single block (= current) | +| 768 (current) | 1024 | 1 | 1024×5 + 32 = 5152 | 4.8× | Padded; 33% overhead | -Block decomposition improves d=768 from 4.8× to 6.2× — a 30% storage -improvement. For d=1024 the encoding is identical to current. +Block decomposition improves the compression ratio for d=768 from ~4.8× to +~6.2× (about 29% higher ratio; equivalently, about 24% fewer compressed bits +per vector: 5152 → 3936). For d=1024 the encoding is identical to current. + +**Shared overhead note:** centroids and SORF signs are amortized over N vectors; +for small N, per-column shared metadata is significant — report totals with and +without amortization when publishing ratios. ## Performance analysis ### Encode/decode throughput -SORF at B dimensions: 3 × B × log₂(B) + 3 × B FLOPs per block. For k blocks: +SORF at B dimensions (heuristic — real cost is dominated by memory bandwidth +and constant factors): 3 × B × log₂(B) butterflies + 3 × B sign applications +per block (plus B normalization multiplies, omitted). For k blocks: | B | SORF FLOPs/block | k (d=768) | Total MSE FLOPs | | -------------- | ------------------------- | --------- | --------------- | @@ -625,54 +786,122 @@ approach, despite more blocks, because each block is smaller. ## Experimental plan -### MSE quality vs. block size +### Minimum dimension threshold + +Test TurboQuant quality at d ∈ {32, 64, 96, 128, 256} to validate the scheme +minimum of 128: + +- Compare TurboQuant MSE distortion and ANN recall@k against scalar + quantization matched on **total compressed bits per vector** (codes + norm + + amortized shared metadata), not just bits-per-coordinate — this is critical + at small d where norm overhead is significant +- Plot the crossover point: at what d does TurboQuant's recall@k drop below + the rate-matched scalar baseline? +- Test SORF coordinate distribution quality at each d (histogram vs. Beta) +- Measure overhead ratio (norm bits / total compressed bits) at each d + +The scheme minimum should be set at the smallest d where TurboQuant reliably +beats the scalar baseline on recall@k across the benchmarking datasets. Default +scalar baseline: per-dimension linear min-max quantization at b bits per +coordinate plus an f32 norm (matching TurboQuant's norm overhead). Report +results at a reference N (e.g., N=100K vectors) where shared metadata is +amortized; optionally show sensitivity to small N where shared costs dominate. +The current proposal of 128 is conservative; experiments may justify lowering +to 64 or raising to 256. + +### MSE quality and scan performance vs. block size - Compare actual normalized MSE at B ∈ {64, 128, 256, 512} vs. single-SORF at padded dimension, at bit widths b ∈ {2, 3, 4, 5, 8} +- Compare ANN recall@k and scan throughput at fixed d (e.g., d=3072) across + B ∈ {256, 512, 1024} — smaller B gives more pruning checkpoints for + ADSampling-style early termination but increases norm overhead - Test SORF coordinate distribution at each B: histogram vs. analytical Beta - Test 3, 4, 5 SORF rounds at each B - Determine if the practical MSE constant is worse at smaller B +The block-size rule ("greatest qualifying B") is a starting heuristic that +maximizes per-block quality and minimizes norm count. Experiments may show that +smaller B with more pruning checkpoints yields better end-to-end scan +performance despite higher per-block overhead. + ### QJL strategy comparison (if pursued) - Per-block Gaussian QJL vs. per-block SORF QJL vs. full-dim padded SORF QJL vs. MSE-only -- Key metric: ANN recall@k on standard benchmarks (SIFT, GloVe) -- Per community findings, MSE-only is expected to win [8] +- Key metric: ANN recall@k on the datasets above (Contriever, OpenAI, SIFT) +- Per community findings for attention, MSE-only is expected to win [8]; ANN + ranking is the key open question + +### Benchmarking datasets + +The current test suite uses i.i.d. Gaussian vectors as a theory anchor and +sanity check: for isotropic data, a random orthogonal transform is +distributionally neutral, which cleanly validates theoretical bounds. This is +not a universal "worst case" for all production workloads — heavy-tailed or +clustered embeddings can behave differently. Recent work +(VIBE [11]) argues that traditional benchmarks (SIFT, GloVe) are no longer +representative of modern ANN workloads. + +**Recommended datasets:** + +| Dataset | Dim | Size | Source | Why | +| ----------------------------- | ------ | ------ | ---------------- | ----------------------------------------------------------------------------------------------------------------------------------------- | +| Contriever | 768 | ~1M | PDX paper [4] | Key non-power-of-2 target; real embeddings | +| OpenAI text-embedding-3-large | 1536 | ~1M | Common in RAG | High-d production embeddings | +| SIFT | 128 | 1M | Classic | Low-d power-of-2 baseline, well-studied recall numbers | +| arXiv embeddings | 768 | 2.25M | PDX paper [4] | Same dim as Contriever, larger scale | +| DEEP | 96 | 10M | Image embeddings | Large scale; d=96 < scheme min (128) and has no B ≥ 64 — requires explicit TurboQuantArray construction or benchmark-only scheme override | +| Synthetic Gaussian | varies | varies | Internal | Theory anchor / sanity check; not universal worst case | + +**Metrics** (at b_mse ∈ {2, 3, 4, 5, 8}): + +- Recall@10, Recall@100 (ANN ranking quality) +- Normalized MSE distortion (reconstruction quality) +- Inner product mean signed relative error (bias measurement) +- Encode/decode throughput (vectors/sec) + +The Gaussian baseline validates that theoretical bounds hold. The real-embedding +datasets measure practical quality — which may be **better** than Gaussian +(structured data benefits more from rotation) or **worse** (if the data has +adversarial properties for the specific rotation). -### Straggler handling (if needed) +### Dimensions with no qualifying B -Rare for common dimensions. If encountered: zero-pad to B (simplest). Follow-up: -dense rotation at actual dimension. +Rare for common embedding dimensions (e.g., d=96). These fall back to the +Stage 1 padded single-block path (pad to next power-of-2, single SORF). No +block decomposition is attempted. ## Phasing **Phase 1** — MSE-only single-block TurboQuant: Split the [current PR][current-impl] -to merge MSE-only (no QJL). This is a complete encoding for all dimensions -(with padding for non-power-of-2). +to merge MSE-only (no QJL). Scheme auto-selects for d ≥ 128; smaller d available +via explicit construction. Padding for non-power-of-2 dimensions. -**Phase 2** — Block decomposition: Add block splitting for non-power-of-2 -dimensions. B = largest power-of-2 ≥ 64 dividing d. Per-block norms stored as +**Phase 2** — Block decomposition: Add block splitting for dimensions where a +valid B exists (greatest power-of-2 ≥ 64 dividing d). Per-block norms stored as internal children. The `TurboQuantScheme::compress()` method must be updated to: (a) choose B based on d, (b) split input into blocks, (c) normalize per-block, (d) encode each block, and (e) store per-block norms as an internal child array. -**Phase 3** — PDX layout: Dimension-major code transposition within 64-vector -chunks. Distance computation kernels. +**Phase 3** — PDXArray + scan kernels: Introduce `PDXArray` as a general-purpose +dimension-major layout for `FixedSizeListArray`. TurboQuant's codes child is +swapped from FSL to PDXArray by the compressor. Distance computation kernels +operate on PDXArray's dimension-contiguous slices. **Phase 4** (experimental) — QJL: If the experimental plan shows QJL improves recall@k beyond MSE-only, add per-block Gaussian or SORF QJL. Based on -community findings, this may not be pursued. +KV-cache community reports [8], this may not be pursued. ## Practical recommendations For common model dimensions, the most promising configurations are: -| Dimension | Recommendation | Rationale | -| --------------------- | --------------------------- | -------------------------------------------------------------------------- | -| 512, 1024, 2048, 4096 | Single-block MSE-only + PDX | B=d, no decomposition needed. Same as current TQ but with PDX scan layout. | -| 768, 1536, 3072 | 3-block MSE-only + PDX | B=256 or 512. Zero padding waste. 3 blocks, shared centroids. | -| Arbitrary d (rare) | Padded single-block | Fall back to current approach. Padding overhead bounded by B-1 dims. | +| Dimension | Recommendation | Rationale | +| ---------------------- | --------------------------- | -------------------------------------------------------------------------- | +| 512, 1024, 2048, 4096 | Single-block MSE-only + PDX | B=d, no decomposition needed. Same as current TQ but with PDX scan layout. | +| 768, 1536, 3072 | 3-block MSE-only + PDX | B=256 or 512. No padding waste. 3 blocks, shared centroids. | +| No qualifying B (rare) | Padded single-block | Fall back to Stage 1: pad to next power-of-2, single SORF. | In all cases, MSE-only is the recommended starting point. QJL should only be added if experiments demonstrate clear recall@k improvements for the target @@ -681,7 +910,8 @@ workload. ## Future work: GPU decode and fused distance computation The B-dim block structure maps naturally to GPU tile sizes and tensor cores. -For a batch of N vectors sharing the same rotation matrix R⁻¹: +For a single block (k=1; Stage 2 generalizes to k independent per-block GEMMs) +with a batch of N vectors sharing the same rotation matrix R⁻¹: ``` decoded_batch = diag(norms) × R⁻¹ × codebook_lookup_batch(codes) @@ -690,15 +920,75 @@ decoded_batch = diag(norms) × R⁻¹ × codebook_lookup_batch(codes) ``` The codebook gather + inverse rotation + norm scaling can be fused into a single -kernel following the double-buffered streaming pattern from Flash-KMeans [6]. +kernel using an IO-aware streaming pattern analogous to Flash-KMeans [6] — not +the same algorithm (Flash-KMeans is GPU k-means), but a similar systems goal: +reduce HBM traffic and avoid full materialization. For distance computation without full decode, a precomputed (2^b_mse)²-entry distance table fits in shared memory (1 KB at b_mse=4, 4 KB at b_mse=5); the kernel streams code bytes from HBM with gather-reduce accumulation, using 4-8× less bandwidth than full float vectors. -At b=8, codes are raw int8 indices. Direct int8 tensor core GEMM requires -approximately linear centroids (sacrificing Max-Lloyd optimality); viable for -ANN ranking but not reconstruction. +At b_mse=8, codes are uint8 indices (0-255). Direct low-precision GEMM on +hardware accelerators (tensor cores on GPU, byte-dot-product instructions on +CPU) requires approximately linear +centroids — but at high B the Max-Lloyd centroids are already near-uniform +(the Beta distribution is highly concentrated, approaching Gaussian, for which +high-resolution optimal quantization is approximately uniform). Whether the +existing Max-Lloyd centroids are "linear enough" for hardware dot-product +instructions is an empirical question worth testing before introducing a +separate linear quantization mode. + +## Integration with Vortex scan engine + +TurboQuant's quantized-domain operations must integrate with Vortex's expression +evaluation and scan pushdown infrastructure. The current implementation provides +this via `ScalarFnVTable` implementations in `vortex-tensor`. + +**Current integration path.** The `CosineSimilarity`, `DotProduct`, and `L2Norm` +scalar functions check whether their input storage arrays are TurboQuant-encoded +(via `TurboQuant::try_match()`). If both operands are TurboQuant and the +`ApproxOptions::Approximate` flag is set, the scalar function dispatches to the +quantized-domain kernel (e.g., `cosine_similarity_quantized_column`), bypassing +full decompression. Otherwise, it falls back to the exact path (decompress → +compute on floats). + +**Stage 2 changes.** With block decomposition, the quantized kernels must be +updated to iterate over TQ blocks, weighting by per-block norms: + +- `cosine_similarity_quantized_column`: currently computes a single unit-norm + dot product per row pair. Must change to `Σ_k norm_a_k · norm_b_k · +unit_dot_k / (‖a‖ · ‖b‖)` with `‖a‖ = √(Σ_k norm_a_k²)`. +- `dot_product_quantized_column`: same per-block weighting. +- `l2_norm`: currently returns the stored norm directly (O(1)). Must change to + `√(Σ_k norm_k²)` — read the norms child (`PrimitiveArray` for k=1, + `FixedSizeListArray` for k>1) and compute. +- Both operands must have the **same block size B**, compatible centroids (same + `b_mse` and B-dim codebook), and **bit-identical MSE rotation parameters** + (`mse_rotation_signs` and same SORF construction) for the quantized + inner-product path to be valid. Two stored columns with different rotations + must **fall back to exact** (decompress → float). The common **column vs. + constant query** path avoids this: the query is re-encoded with the column's + rotation and centroids at query time. + +**Stage 3 changes.** The PDX distance kernel (shown in Stage 3 pseudocode) is a +new execution path that operates on `PDXArray`-typed codes. It should be exposed +as an alternative `ScalarFnVTable` implementation that activates when the codes +child is a `PDXArray` and the scan is over a contiguous 64-vector-aligned range. +For non-aligned ranges or single-vector access (`scalar_at`), the PDXArray is +converted to FSL first via `PDXArray::to_fsl()`. + +**Expression tree integration.** The typical ANN scan expression is: + +``` +top_k(cosine_similarity(column, constant_query), k=10) +``` + +The `constant_query` is broadcast to match the column length. The +`CosineSimilarity` scalar function receives both the column (TurboQuant-encoded) +and the query (ConstantArray wrapping a single vector). For the quantized path, +the query is first encoded with the column's rotation and centroids to produce +query codes and query block norms, then the PDX kernel runs over the column's +codes without decompressing them. ## Migration and compatibility @@ -706,10 +996,20 @@ TurboQuant has not shipped yet, so there are no existing files to migrate. We can design the metadata for forward compatibility from day one. **Strategy: single array ID, versioned metadata.** All stages use the same array -ID (`vortex.turboquant`). The metadata includes `block_size`, `num_blocks`, and -`is_pdx` fields from Stage 1 onward. Stage 1 always writes `num_blocks=1, -is_pdx=false`, but the fields exist so that Stage 2 and 3 decoders can read -Stage 1 files without migration. +ID (`vortex.turboquant`). The metadata includes `block_size` and `num_blocks` +fields from Stage 1 onward. Stage 1 always writes `num_blocks=1`, but the field +exists so that Stage 2 decoders can read Stage 1 files without migration. + +**Decoder invariant:** `block_size` is always the per-block SORF dimension B. +`codes.list_size` = `num_blocks × block_size`. The decoder **validates** +`num_blocks == codes.list_size / block_size` (exact integer division; reject +files where this does not hold). Note that `metadata.dimension` may differ +from `codes.list_size`: + +- Stage 1, non-power-of-2 d: `dimension=768`, `block_size=1024` (padded), + `list_size=1024`. `dimension < list_size` is expected; trailing code slots + are structural zeros from padding. +- Stage 2, no stragglers: `dimension = list_size = num_blocks × block_size`. **Norms are always internal children.** The TurboQuant array is self-contained — it stores norms as a child slot, not in a parent encoding. This means: @@ -723,24 +1023,29 @@ The decoder distinguishes k=1 from k>1 by reading `num_blocks` from metadata. A k=1 decoder is backward-compatible with Stage 1 files. A k>1 decoder is a new code path that only applies to files written by Stage 2+. -**Stage 3 (PDX) is additive.** The `is_pdx` flag in metadata tells the decoder -whether codes are row-major or dimension-major. Stage 1/2 files have -`is_pdx=false`; Stage 3 files have `is_pdx=true`. The decoder un-transposes -PDX files on read if needed. No migration required. +**Stage 3 (PDXArray) is additive.** PDX is not a TurboQuant metadata flag — it's +a separate array type (`PDXArray`) that wraps the codes child. Stage 1/2 files +have `FixedSizeListArray` codes; Stage 3 files have `PDXArray` codes. The +TurboQuant decoder checks the child type and un-transposes PDXArray on decode if +needed. `PDXArray` itself is registered as a new encoding, independent of +TurboQuant. **Incremental shipping:** -| Stage | Ships to users? | Reads Stage 1 files? | Notes | -| ------------ | ---------------- | ---------------------- | ----------------------------------- | -| 1 (MSE-only) | Yes, immediately | N/A (first version) | New encoding, no backcompat concern | -| 2 (blocks) | Yes | Yes (k=1 is identical) | k>1 files need Stage 2+ decoder | -| 3 (PDX) | Yes | Yes (is_pdx=false) | PDX files need Stage 3 decoder | +| Stage | Ships to users? | Reads Stage 1 files? | Notes | +| ------------ | ---------------- | -------------------------- | ----------------------------------- | +| 1 (MSE-only) | Yes, immediately | N/A (first version) | New encoding, no backcompat concern | +| 2 (blocks) | Yes | Yes (k=1 is identical) | k>1 files need Stage 2+ decoder | +| 3 (PDX) | Yes | Yes (FSL codes still work) | PDX codes need PDXArray registered | Each stage is independently shippable. Users can upgrade incrementally. Files written by earlier stages are always readable by later decoders. ## References +_All lemma, theorem, and definition numbers for [1] refer to arXiv:2504.19874v1. +The ICLR 2026 camera-ready proceedings may use different numbering._ + [1] Zandieh, A., Daliri, M., Hadian, M. and Mirrokni, V. "TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate." ICLR 2026. arXiv:2504.19874, April 2025. @@ -762,11 +1067,26 @@ arXiv:2603.09229, March 2026. [7] Pathare, T. et al. "TurboQuant: Implementation Corrections, Production Hardening, and Deployment Infrastructure." Eviox Tech Report v1.2.0, -March 2026. - -[8] Community TurboQuant implementations and findings. Key sources: -tonbistudio/turboquant-pytorch (PyTorch, V3 MSE-only findings), -ggml-org/llama.cpp#20969 (C/C++, quantized attention analysis), -0xSero/turboquant (Triton kernels), vivekvar-dl/turboquant (pip package), -scos-lab/turboquant (reference reproduction). Consensus: MSE-only beats -MSE+QJL for attention and ANN ranking at all tested bit widths. +March 2026. https://eviox.tech/nexus/eviox_turboquant_corrections_study.pdf + +[8] Community TurboQuant implementation reports (primarily KV-cache attention): + +- https://github.com/tonbistudio/turboquant-pytorch — MSE-only (V3) vs + MSE+QJL (V2); reports MSE-only wins for attention and generation quality. +- https://github.com/ggml-org/llama.cpp/discussions/20969 — TurboQuant + discussion; quantized attention analysis and MSE vs Prod comparison. +- https://github.com/0xSero/turboquant — Triton kernels; paper validation. +- https://github.com/scos-lab/turboquant — Reference reproduction; MSE vs + Prod/QJL comparison. + Multiple groups report MSE-only beating MSE+QJL for attention metrics at tested + bit widths. ANN ranking conclusions remain preliminary pending dedicated + benchmarks. + +[9] Jégou, H., Douze, M. and Schmid, C. "Product Quantization for Nearest +Neighbor Search." IEEE Trans. PAMI 33(1):117-128, 2011. + +[10] Ge, T., He, K., Ke, Q. and Sun, J. "Optimized Product Quantization." +IEEE Trans. PAMI 36(4):744-755, 2014. + +[11] Jääsaari, E., Hyvönen, V., Ceccarello, M., Roos, T. and Aumüller, M. +"VIBE: Vector Index Benchmark for Embeddings." arXiv:2505.17810, May 2025.