diff --git a/proposed/0033-block-turboquant.md b/proposed/0033-block-turboquant.md
index 1338e83..a024c58 100644
--- a/proposed/0033-block-turboquant.md
+++ b/proposed/0033-block-turboquant.md
@@ -2,21 +2,26 @@
 
 **Authors:** Will Manning
 **Status:** Proposal
-**Date:** 2026-04-02
+**Date:** 2026-04-03
 
 ## Summary
 
-We propose modifying the [TurboQuant vector quantization encoding][current-impl]
-to use uniform blocks of a tunable power-of-2 size B (to be experimentally
-determined; expected range 64-256) with per-block SORF rotations for the MSE
-stage. The QJL correction uses per-block Gaussian projection matrices as
-prescribed by [1], not SORF — this eliminates the need for power-of-2 padding in
-the QJL stage. The lowest-level TurboQuant array operates only on unit-norm
-vectors, with norms externalized. In a second stage, a PDX-style
-dimension-major code layout within groups of 64 vectors enables SIMD-friendly
-vectorized distance computation. The TQ block size B (for quantization quality)
-and the PDX vector group size (always 64, for SIMD utilization) are independent
-parameters.
+We propose evolving the [TurboQuant vector quantization encoding][current-impl]
+in three stages:
+
+1. **MSE-only TurboQuant** (immediate): merge the current PR as an MSE-only
+   encoding. This is a complete, self-contained building block.
+2. **Block decomposition** (next): for non-power-of-2 dimensions, split into
+   blocks of size B = the largest power-of-2 ≥ 64 that divides d. For
+   power-of-2 dimensions, B = d (single block, same as current). Per-block
+   norms externalized.
+3. **PDX layout** (later): within each block, transpose codes into groups of
+   64 vectors for SIMD scan performance.
+
+QJL correction is deferred to a later stage and may ultimately be dropped.
+Community findings from 6+ independent TurboQuant implementations consistently
+show that MSE-only outperforms MSE+QJL for attention and ANN ranking in
+practice [8].
 
 [current-impl]: https://github.com/vortex-data/vortex/pull/7167
 
@@ -46,314 +51,254 @@ rotation and the QJL projection, giving O(d) storage and O(d log d) per-vector.
 The 3-round SORF construction was introduced for kernel approximation [5] and
 approximates a random orthogonal matrix. Note that this is distinct from the
 single-round SRHT (`R·H·D`) analyzed by Tropp [3] and the FJLT (`P·H·D`) of
-Ailon-Chazelle [2], both of which are dimensionality-reducing projections. Using
-SORF for QJL (where the paper prescribes Gaussian S) is an additional
-approximation that may affect the unbiasedness guarantee of Theorem 2.
+Ailon-Chazelle [2], both of which are dimensionality-reducing projections.
+
+### Reference implementation bugs
+
+The Eviox corrections study [7] identified six material bugs in the paper's
+reference Python implementation. The most critical is a mathematical error in
+the QJL scale factor: the reference code used `√(π/(2d))` instead of
+`√(π/2)/d` (Definition 1 in [1]), differing by a factor of √d (≈11× at d=128).
+Our [current implementation][current-impl] uses the correct formula
+(`sqrt(FRAC_PI_2) / padded_dim` in Rust), so this bug does **not** affect us.
+
+Other notable Eviox findings: (a) the reference code recomputes codebooks at
+every instantiation (we cache in a `DashMap`); (b) the reference uses float16
+for codebook distance computation, causing misassignment at small centroid
+spacings (we cast to f32 before quantization). See [7] for the full list.
+
+### Theorem 1 constant
+
+There is an ambiguity in the paper's notation for the MSE bound constant. The
+formal proof gives `(√3 · π / 2) · 4^{-b}` where the constant √3·π/2 ≈ 2.72.
+The Eviox report [7] interprets the notation as `√(3π)/2 ≈ 1.535`, but this is
+incorrect: the measured distortion values from the paper (b=2: 0.117, b=3: 0.03)
+exceed the putative `√(3π)/2` bound (b=2: 0.096, b=3: 0.024), confirming that
+2.72 is the correct constant. The paper's "explicit values" (0.36, 0.117, 0.03,
+0.009) are the actual computed distortion of the optimal quantizer, not the
+bound itself — they are well below the 2.72/4^b bound.
+
+### Community findings on QJL
+
+Multiple independent TurboQuant implementations have converged on a
+significant practical finding: **MSE-only consistently outperforms MSE+QJL for
+attention and ANN ranking**. The mechanism is a variance-bias tradeoff:
+TurboQuant's QJL correction eliminates bias but increases variance, and softmax
+attention (and cosine/L2 ranking) amplifies variance more than bias. At the same
+total bit budget, allocating all bits to MSE (more centroids, lower variance)
+beats splitting between MSE + QJL (fewer centroids + 1-bit correction). This has
+been confirmed by 6+ groups across Python, C, and Rust implementations [8].
+
+This finding strongly supports making MSE-only the default strategy for our
+columnar storage use case (ANN search, cosine similarity ranking).
 
 ### Current limitations
 
 The SORF requires power-of-2 input dimension. For non-power-of-2 dimensions
-(e.g., 768-d embeddings from common transformer models), the input is
-zero-padded to the next power of 2 (1024). This causes:
-
-- **33% wasted storage** for 768-d vectors: 1024 quantized codes stored per
-  vector when only 768 carry information.
-- **QJL variance inflation**: The QJL correction also uses a padded SORF. Of
-  the 1024 QJL sign bits, 256 encode information about the zero-padded region
-  rather than the actual residual. The estimator remains unbiased in expectation,
-  but the wasted measurement budget inflates variance. Empirically, this
-  manifests as ~23% mean signed relative error in inner product estimation for
-  768-d vs. ~11% for power-of-2 dimensions.
-- **No scan-optimized layout**: Codes are stored in row-major order (all codes
-  for one vector contiguous), which prevents SIMD-over-vectors distance
-  computation.
+(e.g., 768-d embeddings), the input is zero-padded to the next power of 2
+(1024). This causes:
+
+- **33% storage overhead** for 768-d vectors: 1024 codes stored vs. 768 useful
+  (equivalently, 25% of stored codes are wasted on zero-padded dimensions).
+- **No scan-optimized layout**: row-major code storage prevents SIMD-over-vectors
+  distance computation.
 
 ### PDX
 
 PDX [4] is a data layout for vector similarity search that stores dimensions in
 a vertical (dimension-major) layout within fixed-size blocks of 64 vectors. This
 enables the compiler to auto-vectorize the inner distance loop over vectors
-rather than dimensions, achieving on average 2x speedups over SIMD-optimized
-row-major kernels (up to 5.5x at low dimensions) on modern CPUs. The block size
-of 64 is empirically optimal across AVX-512, AVX2, and NEON architectures [4].
+rather than dimensions, achieving on average 2× speedups over SIMD-optimized
+row-major kernels on modern CPUs. The block size of 64 is empirically optimal
+across AVX-512, AVX2, and NEON architectures [4].
 
 ## Proposal
 
-### Stage 1: Tunable block decomposition
-
-Partition d dimensions into `⌈d/B⌉` blocks of B dimensions each, where B is a
-power-of-2 block size (configurable; expected range 64-256, to be determined
-experimentally). The last block ("straggler") may have fewer than B dimensions
-— see Straggler handling below. Each full block gets an independent B-dim SORF
-rotation for the MSE stage.
-
-Larger B gives better quantization quality (more concentrated coordinate
-distribution, better SORF mixing) but more per-block norm overhead. Smaller B
-gives more blocks (more norms) but each block has higher practical variance.
-The optimal B should be determined experimentally — see Experimental plan.
+### Block size strategy
+
+For each dimension d, choose B = the largest power-of-2 ≥ 64 that evenly
+divides d. This eliminates stragglers entirely for common embedding dimensions:
+
+| Dimension d | Block size B | Blocks k | Notes                       |
+| ----------- | ------------ | -------- | --------------------------- |
+| 512         | 512          | 1        | Single block (= current TQ) |
+| 768         | 256          | 3        | Largest dividing power-of-2 |
+| 1024        | 1024         | 1        | Single block                |
+| 1536        | 512          | 3        |                             |
+| 2048        | 2048         | 1        | Single block                |
+| 3072        | 1024         | 3        |                             |
+| 4096        | 4096         | 1        | Single block                |
+
+**Key observations:**
+
+- **Power-of-2 dimensions** (512, 1024, 2048, 4096) use B = d — a single block,
+  identical to the current implementation except with PDX underneath (Stage 3).
+  No block decomposition overhead, no per-block norms. These dimensions are
+  already well-served by the current design.
+- **Non-power-of-2 dimensions** (768, 1536, 3072) decompose into k=3 blocks at
+  B=256 or B=512. Zero padding waste. Each block has its own SORF rotation and
+  shares a single centroid set.
+- **Stragglers are eliminated** for all common embedding dimensions. Dimensions
+  that are not multiples of 64 (e.g., 100, 200) would need straggler handling,
+  but these are rare in practice for modern model architectures.
+- **The SORF approximation at B=256+ is expected to be adequate**: 3 rounds at
+  B=256 provides 24 butterfly stages, and at B=512 provides 27 — both comparable
+  to the current B=1024 (30 stages). This needs empirical validation; see
+  Experimental plan.
+
+### Stage 1: MSE-only TurboQuant (immediate — split from current PR)
+
+Split the [current PR][current-impl] to extract and merge the MSE-only subset
+(removing QJL encoding, QJL array slots, and QJL-specific tests). The QJL code
+can be preserved on a separate branch for Phase 4. The MSE-only encoding
+provides:
+
+- SORF-based random rotation at the padded dimension
+- Max-Lloyd scalar quantization with shared centroids
+- Per-vector norm storage (single f32, regardless of input dtype — the
+  dtype-matching norm behavior described in Stage 2 is a later change)
+- Slice, take, scalar_at compute pushdowns
+- Quantized-domain cosine similarity and dot product
+- File format integration via the compression scheme
+
+This is a complete, useful encoding for power-of-2 dimensions. For non-power-of-2
+dimensions it has the padding overhead described above.
+
+### Stage 2: Block decomposition
+
+For non-power-of-2 dimensions, split into blocks of size B (as determined by the
+table above). Each full block gets an independent B-dim SORF rotation.
 
 **Key properties:**
 
-- **B is a power of 2**, so every full block has zero SORF padding waste.
-  Common embedding dimensions are divisible by 64 and 128; most are also
-  divisible by 256. The straggler case is rare for well-chosen B.
-- **One shared centroid set** for full blocks. All full blocks use the same
-  B-dim marginal distribution, so a single Max-Lloyd codebook serves every full
-  block.
+- **One shared centroid set** for all blocks. All blocks use the same B-dim
+  marginal distribution, so a single Max-Lloyd codebook serves every block.
 - **Unit-norm assumption.** The TurboQuant array operates only on pre-normalized
-  sub-vectors. Per-block norms are externalized (see Norm architecture below).
-- **Per-block SORF rotation signs.** Each block's MSE SORF is independent
-  (different seed). Signs are 3 × B bits per block. For d=768 with B=128:
-  6 blocks × 3 × 128 = 2304 bits = 288 bytes.
+  sub-vectors. Per-block norms are externalized, following the pattern explored
+  in PR #7251 (closed; concept will need reimplementation).
+- **Per-block SORF rotation signs.** Each block's SORF is independent (different
+  seed). Signs are 3 × B bits per block.
+- **For power-of-2 dimensions**: B = d, k = 1. The encoding is functionally
+  identical to Stage 1. The norm remains a single value per vector (not a
+  FixedSizeList with list_size=1). Norm externalization is optional for k=1 and
+  can be deferred to when it provides concrete benefit (e.g., GPU decode).
 
-#### Straggler handling
+#### Norm architecture
 
-If `d % B ≠ 0`, the final block has `d mod B` dimensions. Initially, the
-straggler is **zero-padded to B** and uses the same SORF and centroids as all
-other blocks. This is the simplest approach and reuses all existing
-infrastructure. The padding waste is bounded by B-1 dimensions — much smaller
-than the current full-dimension padding (up to d-1 dimensions).
+The per-block norms are stored as a `FixedSizeListArray<F>` with
+`list_size = num_blocks`, where `F` matches or widens the input element type:
+
+| Input dtype | Norm dtype | Rationale                                      |
+| ----------- | ---------- | ---------------------------------------------- |
+| f16         | f32        | f16 has insufficient range/precision for norms |
+| f32         | f32        | Same type                                      |
+| f64         | f64        | Preserve full precision                        |
 
-As a follow-up experiment, we will test using a **dense random orthogonal
-matrix** (QR of Gaussian) at the straggler's actual dimension with its own
-centroid set. This eliminates the straggler's padding waste entirely and may
-allow increasing B (since the straggler cost is bounded). This requires separate
-storage for the straggler rotation matrix and centroids.
+Norms are stored as plain child arrays; the cascading compressor handles
+secondary encoding (ALP, Pco, etc.).
+
+Note: centroids and quantization always operate in f32 internally (the
+[current implementation][current-impl] converts all input to f32 before
+quantization). For f64 input, decode produces f32 unit-direction reconstructions
+scaled by f64 norms — a mixed-precision multiply that preserves norm precision.
 
 #### Zero-norm sub-vectors
 
-When splitting a vector into B-dim blocks, some blocks may have zero norm
-(e.g., sparse embeddings, or dimensions that are structurally unused). The
-encoding must handle ‖xₖ‖ = 0 explicitly: skip rotation and quantization for
-that block, store norm = 0, and decode as all zeros. The codes for a zero-norm
-block are arbitrary (never read during decode).
+When splitting a vector into B-dim blocks, some blocks may have zero norm. The
+encoding handles ‖xₖ‖ = 0 explicitly: skip rotation and quantization, store
+norm = 0, decode as all zeros.
 
 #### Theoretical MSE bound
 
 The paper's MSE bound (Theorem 1 in [1]) is:
 
 ```
-E[‖x - x̂‖² / ‖x‖²] ≤ (√3 · π / 2) / 4^b
-```
-
-This bound is dimension-independent — it depends only on the bit width `b`. For
-a vector split into blocks with per-block normalization (assuming ‖xₖ‖ > 0):
-
-```
-‖x - x̂‖² / ‖x‖² = Σ_k (‖xₖ‖² / ‖x‖²) × (‖xₖ - x̂ₖ‖² / ‖xₖ‖²)
-                   ≤ MSE_bound × Σ_k (‖xₖ‖² / ‖x‖²)
-                   = MSE_bound
+E[‖x - x̂‖² / ‖x‖²] ≤ (√3 · π / 2) / 4^b ≈ 2.72 / 4^b
 ```
 
-The Pythagorean identity `Σ_k ‖xₖ‖² = ‖x‖²` ensures the weights sum to 1.
-Blocks with ‖xₖ‖ = 0 contribute zero error and are excluded from the sum.
-
-**Practical MSE vs. theoretical bound.** The bound `(√3π/2)/4^b` is a worst-case
-upper bound that is dimension-independent. However, the **actual** MSE depends
-on the block dimension B. At higher dimensions, the coordinate distribution
-after rotation is more concentrated (variance ~1/B), and the Max-Lloyd quantizer
-exploits this concentration — the actual MSE can be well below the bound. At
-smaller B, the distribution is wider and the quantizer has less concentration to
-exploit.
-
-Block decomposition preserves the theoretical bound but may give worse
-**practical** MSE than a single full-dimension rotation, because each B-dim
-block operates at a less favorable point on the distortion curve. This must be
-quantified empirically (see Experimental plan).
-
-**SORF approximation caveat.** Theorems 1 and 2 in [1] are proved for true
-random orthogonal matrices (QR of Gaussian), not SORF. The 3-round SORF
-construction `HD₃·HD₂·HD₁` [5] is a structured approximation. The
-approximation quality depends on dimension: 3 rounds provides 3 × log₂(B)
-mixing stages (18 at B=64, 21 at 128, 24 at 256, 30 at 1024). Empirical
-validation is needed for each candidate B — see Experimental plan.
+**Crucially, Theorem 1 is proved for true random orthogonal matrices (QR of
+Gaussian), not SORF.** Our SORF is an approximation. The bound holds exactly
+only with a true random orthogonal rotation or with empirical SORF validation
+(see Experimental plan).
 
-**Fallback: dense rotation.** If SORF proves insufficient at the chosen B, use a
-B × B random orthogonal matrix (QR of Gaussian) instead. Storage at B=128:
-128×128×4 = 64 KB per block. With k blocks for MSE: total = k × 64 KB. For
-d=768, B=128 (k=6): 384 KB of MSE rotation matrices. This is significant for
-small columns (1K vectors at 3 MB uncompressed ≈ 13% overhead) but amortizes to
-negligible for large columns (100K+ vectors). Each block must have an
-**independent** rotation matrix (different seeds) to avoid correlation between
-quantization errors across blocks.
-
-#### QJL correction
-
-The QJL stage in the TurboQuant paper [1] uses a random Gaussian projection
-matrix S ∈ ℝ^(d×d) with i.i.d. N(0,1) entries — **not** an orthogonal rotation.
-The unbiasedness proof (Lemma 4 in [1]) and the scale factor `√(π/2)/d` are
-derived specifically for Gaussian S. The [current implementation][current-impl]
-uses SORF instead, which is an additional approximation.
-
-The QJL correction has several possible strategies, differing in projection
-type (Gaussian vs SORF), scope (per-block vs full-dimension), and whether QJL
-is used at all. The choice significantly affects correctness guarantees,
-variance, storage, and speed.
-
-**Per-block Gaussian QJL** is the paper-correct approach: each block gets an
-independent Gaussian projection Sₖ ∈ ℝ^(B×B) with i.i.d. N(0,1) entries. Since
-Gaussian matrices work at any dimension, no padding is needed for the QJL stage.
-Each block's QJL is provably unbiased by Lemma 4 in [1], and the sum over
-blocks is also unbiased: `E[<y, correction>] = <y, r>`. However, the per-block
-variance is **d/B times higher** than full-dimension QJL:
+Assuming the per-block MSE bound holds, for a vector split into blocks:
 
 ```
-Per-block (B-dim):   Var[<y, correction>] ≤ (π / (2B)) × ‖y‖²
-Full-dim (d-dim):    Var[<y, correction>] ≤ (π / (2d)) × ‖y‖²
+‖x - x̂‖² / ‖x‖² = Σ_k (‖xₖ‖² / ‖x‖²) × (‖xₖ - x̂ₖ‖² / ‖xₖ‖²)
+                   ≤ MSE_bound × Σ_k (‖xₖ‖² / ‖x‖²) = MSE_bound
 ```
 
-At d=768, B=128: 6× more variance. Storage: B×B×4 bytes per block (384 KB for
-k=6 at B=128). Encode/decode cost: O(B²) matmul per block.
-
-**Per-block SORF QJL** substitutes a B-dim SORF (`HD₃·HD₂·HD₁` [5]) for the
-Gaussian matrix. This is NOT theoretically justified — Lemma 4 requires Gaussian
-or Haar-orthogonal S, and SORF is only an approximation to Haar measure.
-However, the [current implementation][current-impl] already uses SORF for QJL at
-d=1024 with acceptable results (~11% mean relative error for power-of-2 dims),
-demonstrating practical viability. The tradeoff vs Gaussian is compelling:
-O(B log B) speed (5× faster at B=128), O(B) storage (over 1000× less). Quality
-at B=128 needs validation — with only 21 mixing stages, the approximation to
-Haar measure is weaker than at d=1024 (30 stages).
-
-**Full-dimension padded SORF QJL** applies a single SORF at the padded
-dimension (e.g., 1024 for d=768) to the full residual vector `r = x - x̂`,
-matching the [current implementation][current-impl]. The higher dimension gives
-better SORF-to-Haar convergence (30 mixing stages at d=1024 vs 21 at B=128) and
-full-dimension variance `(π/(2d))·‖y‖²`, but wastes `(padded_d - d)/padded_d`
-of the sign bits on zero-padded coordinates (25% at 768→1024). This approach
-requires computing the full residual from all blocks before applying QJL,
-adding a full-dimension decode step to the encode path.
-
-**MSE-only (no QJL)** skips the QJL correction entirely. Inner product
-estimates are biased but this may be acceptable for ANN ranking where relative
-ordering matters more than absolute accuracy. Eliminates all QJL storage and
-computation.
-
-**QJL strategy options** (to be experimentally compared):
-
-| Strategy             | Theoretical       | Variance       | Padding waste | Storage         | Speed (per block) |
-| -------------------- | ----------------- | -------------- | ------------- | --------------- | ----------------- |
-| Per-block Gaussian   | Correct (Lemma 4) | (π/(2B))·‖y‖²  | None          | k×B²×4 bytes    | O(B²)             |
-| Per-block SORF       | Approximate       | ~(π/(2B))·‖y‖² | None          | k×3×B bits      | O(B log B)        |
-| Full-dim padded SORF | Approximate       | ~(π/(2d))·‖y‖² | (pad-d)/pad   | 3×padded_d bits | O(d log d)        |
-| MSE-only             | N/A               | N/A            | N/A           | None            | 0                 |
-
-#### Norm architecture
-
-The TurboQuant array itself operates only on unit-norm B-dim sub-vectors. Norms
-are externalized into a separate child array, following the pattern established
-by the NormVector encoding (PR #7251).
-
-The per-block norms are stored as a single `FixedSizeListArray<F>` with
-`list_size = num_blocks`, where `F` matches or widens the input element type:
+The actual MSE may depend on block dimension B: at larger B the coordinate
+distribution is more concentrated (variance ~1/B), giving the Max-Lloyd
+quantizer more to exploit. See Experimental plan.
 
-| Input dtype | Norm dtype | Rationale                                      |
-| ----------- | ---------- | ---------------------------------------------- |
-| f16         | f32        | f16 has insufficient range/precision for norms |
-| f32         | f32        | Same type                                      |
-| f64         | f64        | Preserve full precision                        |
+**SORF approximation.** The 3-round SORF `HD₃·HD₂·HD₁` [5] provides
+3 × log₂(B) butterfly stages per round (18 at B=64, 24 at B=256, 27 at B=512).
+This is a rough heuristic for mixing quality — [5] does not analyze convergence
+rate as a function of rounds × dimension. Empirical validation is needed.
 
-Norms are stored as plain child arrays — TurboQuant does not apply any secondary
-encoding to them. The cascading compressor treats norms like any other float
-column and is free to re-encode them with ALP, Pco, FastLanes, or other float
-compression schemes.
+**Fallback: dense rotation.** If SORF proves insufficient at the chosen B, use a
+B × B random orthogonal matrix (QR of Gaussian). Storage at B=256: 256 KB per
+block. For d=768 with k=3: 768 KB total. Amortizes for large columns (100K+
+vectors). Each block must have an **independent** rotation matrix.
 
-#### Quantized-domain operations with per-block norms
+#### Quantized-domain operations
 
-All quantized-domain operations require reading the block norms for both
-operands. This is a cost increase vs. the [current implementation][current-impl]
-which stores a single global norm per vector.
+All quantized operations require per-block norms:
 
-- **L2 norm**: Read block norms, compute `√(Σ_k ‖xₖ‖²)`. O(k) per vector — a
-  regression from the current O(1) single-norm readthrough, but modest compared
-  to full decompression.
-- **Dot product**: `<a, b> ≈ Σ_k ‖aₖ‖·‖bₖ‖ · Σ_j centroids[code_aₖ[j]] ·
-centroids[code_bₖ[j]]`. Per-block: compute unit-norm quantized dot product
-  (sum of B centroid products), then weight by both vectors' block norms.
-- **Cosine similarity**: `cos(a, b) ≈ (Σ_k ‖aₖ‖·‖bₖ‖·unit_dotₖ) /
-(√(Σ_k ‖aₖ‖²) · √(Σ_k ‖bₖ‖²))`. Requires global norms reconstructed from
-  block norms. The norms tensor should be read once per scan query and cached.
+- **L2 distance**: `‖a-b‖² = Σ_k ‖aₖ‖² + Σ_k ‖bₖ‖² - 2·Σ_k ‖aₖ‖·‖bₖ‖·
+unit_dotₖ`. Primary ANN metric; reuses per-block dot product and norms.
+- **Dot product**: `<a,b> ≈ Σ_k ‖aₖ‖·‖bₖ‖ · Σ_j centroids[code_aₖ[j]] ·
+centroids[code_bₖ[j]]`.
+- **Cosine similarity**: `cos(a,b) ≈ dot(a,b) / (‖a‖·‖b‖)` where
+  `‖a‖ = √(Σ_k ‖aₖ‖²)`.
+- **L2 norm**: `√(Σ_k ‖xₖ‖²)`. O(k) per vector — a regression from the
+  current O(1) single-norm readthrough, but modest.
 
 #### Encoding algorithm
 
 ```
-Input: x ∈ ℝ^d, bit_width b, block_size B (power of 2)
-k = ⌈d/B⌉
+Input: x ∈ ℝ^d, b_mse bits per coordinate, block_size B
+k = d / B  (exact division, no straggler for chosen B)
+num_centroids = 2^b_mse
 
 # Block split and normalize
 for i in 0..k:
-    xᵢ = x[i*B .. min((i+1)*B, d)], zero-pad to B if straggler
+    xᵢ = x[i*B .. (i+1)*B]
     nᵢ = ‖xᵢ‖
     if nᵢ > 0:
         ûᵢ = xᵢ / nᵢ
     else:
-        ûᵢ = zeros(B)               (zero-norm block: skip quantization)
+        ûᵢ = zeros(B)
 
 # MSE stage (per block, SORF rotation)
 for i in 0..k:
     if nᵢ > 0:
-        rᵢ = SORFᵢ(ûᵢ)             (3-round HD, independent per block)
-        cᵢ[j] = nearest_centroid(rᵢ[j])  (shared codebook)
+        rᵢ = SORFᵢ(ûᵢ)
+        cᵢ[j] = nearest_centroid(rᵢ[j])
     else:
         cᵢ[j] = 0
 
 Store: codes (k × B per vector), block_norms (k per vector),
-       centroids (shared), SORF signs (k × 3 × B, shared across vectors)
-
-# QJL stage (optional, one of four strategies)
-for i in 0..k:
-    if nᵢ > 0:
-        x̂ᵢ = decode_mse_block(cᵢ, centroids, SORFᵢ)
-        rᵢ = ûᵢ - x̂ᵢ                (per-block unit-norm residual)
-        γᵢ = ‖rᵢ‖
-        sᵢ = sign(Projᵢ × rᵢ)       (see QJL strategy options)
-    else:
-        γᵢ = 0, sᵢ = zeros
-
-# Projᵢ is one of:
-#   - Sᵢ ∈ ℝ^(B×B) Gaussian i.i.d. N(0,1)    (per-block Gaussian)
-#   - SORF_qjlᵢ at B-dim                       (per-block SORF)
-#   - SORF_qjl at padded_d-dim on full r        (full-dim padded SORF)
-
-Store: qjl_signs, qjl_residual_norms, projection params (strategy-dependent)
+       centroids (2^b_mse, shared), SORF signs (k × 3 × B, shared)
 ```
 
 #### Decoding algorithm
 
 ```
-# MSE decode (produces unit-norm sub-vectors)
 for i in 0..k:
     r̂ᵢ[j] = centroids[cᵢ[j]]
     ûᵢ = SORF⁻¹ᵢ(r̂ᵢ)
-
-# QJL correction (if present)
-for i in 0..k:
-    if γᵢ > 0:
-        correctionᵢ = (√(π/2) / dim_proj) × γᵢ × Projᵢᵀ × sᵢ
-        ûᵢ += correctionᵢ
-# dim_proj = B for per-block strategies, padded_d for full-dim strategy
-
-# Denormalize and concatenate (using externalized norms)
-for i in 0..k:
     x̂ᵢ = nᵢ × ûᵢ
-x̃ = concat(x̂₀, x̂₁, ..., x̂ₖ₋₁)[0..d]
+x̃ = concat(x̂₀, ..., x̂ₖ₋₁)
 ```
 
-### Stage 2: PDX dimension-major layout
-
-In a second stage, we transpose the code storage from row-major to
-dimension-major within groups of 64 vectors, following the PDX layout [4]. The
-64-vector group size is independent of the TQ block size B.
-
-**Row-major (Stage 1):**
+### Stage 3: PDX dimension-major layout
 
-```
-Vector 0: [tq0_d0 ... tq0_d(B-1) | tq1_d0 ... tq1_d(B-1) | ... ]
-Vector 1: [tq0_d0 ... tq0_d(B-1) | tq1_d0 ... tq1_d(B-1) | ... ]
-...
-```
+Transpose code storage from row-major to dimension-major within groups of 64
+vectors [4]. The 64-vector group size is independent of B.
 
-**PDX (Stage 2), within each 64-vector chunk:**
+Within each 64-vector chunk, codes are stored dimension-major:
 
 ```
 TQ block 0, dim 0:        [v0 v1 v2 ... v63]
@@ -364,266 +309,194 @@ TQ block 1, dim 0:        [v0 v1 v2 ... v63]
 ...
 ```
 
-All codes for the same dimension across 64 vectors are contiguous. The inner
-loop (over 64 vectors) has no inter-vector data dependencies and
-auto-vectorizes into SIMD. TQ block boundaries affect only where norm weighting
-occurs in the distance kernel — they don't affect the PDX transpose itself.
-
-This creates **B × 64 tiles**: B dimension coordinates × 64 vectors per chunk.
-
-#### Relationship to chunking
-
-The PDX 64-vector groups are effectively chunks. Shared metadata (centroids,
-rotation signs/matrices) lives at the array level, not per-chunk.
+The inner SIMD loop (64 vectors) has no inter-vector dependencies. TQ block
+boundaries only affect where norm weighting occurs — they don't affect the
+transpose.
 
-**Open design questions:**
-
-- Should slice/take produce row-major or PDX results? Row-major is simpler
-  (un-transpose on the fly) but loses the scan benefit.
-- Should the PDX flag be a property of the encoding or a separate layout layer?
-- How does the PDX transpose interact with the cascading compressor?
-
-#### Quantized distance computation
+**Quantized distance kernel (dot product):**
 
 ```rust
-// Precompute: dist_table[i][j] = centroids[i] * centroids[j]
-// At b=4: 16×16 = 256 floats = 1KB, fits in L1 cache
+let dist_table = precompute_product_table(&centroids);
+// At b_mse=4: 16×16 = 256 floats = 1KB, fits in L1
+
+let mut distances = [0.0f32; 64];
+let mut unit_dots = [0.0f32; 64];
+let mut offset = 0;
 
-for tq_block in 0..num_tq_blocks {
+for tq_block in 0..k {
     for dim in 0..B {
         let qd = query_codes[tq_block * B + dim];
         let row = &dist_table[qd as usize];
         for v in 0..64 {  // SIMD-friendly: no inter-vector deps
-            distances[v] += row[codes[offset + v] as usize];
+            unit_dots[v] += row[codes[offset] as usize];
+            offset += 1;
         }
     }
-    // Weight by query_block_norm × data_block_norms[v] for this TQ block
+    // Weight per-block unit-norm dot product by both vectors' block norms
+    for v in 0..64 {
+        distances[v] += query_norms[tq_block] * data_norms[v][tq_block]
+                        * unit_dots[v];
+        unit_dots[v] = 0.0;  // reset for next TQ block
+    }
 }
 ```
 
-**Full decompression from PDX layout.** When a scan is not the use case (e.g.,
-`scalar_at` or `to_canonical`), 64-vector chunks are un-transposed back to
-row-major before per-vector inverse SORF decoding.
+**Open design questions:**
 
-## Array layout
+- Slice/take on PDX-transposed codes: produce row-major (simpler) or preserve
+  PDX (aligned 64-vector slices only)?
+- Is PDX a property of the encoding or a separate layout layer?
+- How does the compressor see the transposed codes?
 
-```
-TurboQuantArray (operates on unit-norm B-dim sub-vectors)
-├── metadata: { dimension: u32, bit_width: u32, block_size: u32,
-│               num_blocks: u32, has_qjl: bool, is_pdx: bool }
-│
-│  # Per-row children (sliced/taken on row operations)
-├── codes: FixedSizeListArray<u8>              # len=num_rows, list_size=num_blocks×B
-│
-│  # Shared children (cloned on row operations, not sliced)
-├── centroids: PrimitiveArray<f32>             # len=2^(bit_width-1) [MSE codebook]
-├── mse_rotation_signs: PrimitiveArray<u8>     # len=num_blocks×3×B [per-block MSE SORF]
-│
-│  # Optional QJL children (per-block Gaussian)
-├── [qjl_signs]: FixedSizeListArray<u8>        # len=num_rows, list_size=num_blocks×B
-├── [qjl_matrices]: PrimitiveArray<f32>        # len=num_blocks×B×B [Gaussian i.i.d. N(0,1)]
+### QJL correction (deferred — experimental)
 
-Externalized (lives in parent encoding, not in TurboQuantArray):
-├── block_norms: FixedSizeListArray<F>         # len=num_rows, list_size=num_blocks
-└── [qjl_residual_norms]: FixedSizeListArray<F>    # len=num_rows, list_size=num_blocks
-```
+Based on community findings [7], QJL is deferred to after the MSE stages are
+validated. If pursued, four strategies should be compared:
 
-The `is_pdx` flag in metadata determines whether `codes` (and `qjl_signs`) are
-in row-major or PDX-transposed layout.
+| Strategy             | Theoretical           | Speed            | Storage         |
+| -------------------- | --------------------- | ---------------- | --------------- |
+| Per-block Gaussian   | Correct (Lemma 4 [1]) | O(B²)/block      | k×B²×4 bytes    |
+| Per-block SORF       | Approximate           | O(B log B)/block | k×3×B bits      |
+| Full-dim padded SORF | Approximate           | O(d log d) total | 3×padded_d bits |
+| MSE-only (no QJL)    | N/A                   | 0                | None            |
 
-## Compression ratio
+The paper's QJL uses Gaussian S (not SORF); Lemma 4 [1] is proved specifically
+for Gaussian. SORF for QJL is an additional approximation (the
+[current implementation][current-impl] uses SORF for QJL). Per-block QJL has
+d/B times more variance than full-dimension QJL (Lemma 4 [1]).
 
-For f32 input at dimension d with bit width b (QJL, so b-1 MSE bits + 1 QJL
-bit), k = ⌈d/B⌉ blocks:
+The community consensus is that MSE-only likely wins for ANN ranking at all
+bit widths, so QJL may not be worth the complexity.
 
-| Component          | Bits per vector |
-| ------------------ | --------------- |
-| MSE codes          | k × B × (b-1)   |
-| QJL signs          | k × B × 1       |
-| Block norms        | k × norm_bits   |
-| QJL residual norms | k × norm_bits   |
+## Array layout
 
-| Component             | Shared bits  |
-| --------------------- | ------------ |
-| Centroids             | 2^(b-1) × 32 |
-| MSE SORF signs        | k × 3 × B    |
-| QJL Gaussian matrices | k × B² × 32  |
+### Stage 1 (single block, current)
 
-### Example: f32, d=768, b=5, B=128, N=1000 vectors, k=6
+Identical to the [current PR][current-impl] array structure.
 
-- Uncompressed: 768 × 32 × 1000 = 24,576,000 bits (3,000 KB)
-- MSE codes: 6 × 128 × 4 × 1000 = 3,072,000 bits
-- QJL signs: 6 × 128 × 1 × 1000 = 768,000 bits
-- Block norms: 6 × 32 × 1000 = 192,000 bits
-- QJL residual norms: 6 × 32 × 1000 = 192,000 bits
-- Shared: 512 + 2,304 + 3,145,728 = 3,148,544 bits
-- **Total compressed: 7,372,544 bits (899 KB)**
-- **Ratio: 3.3×** (QJL Gaussian matrices dominate at small N)
+### Stage 2 (block decomposition)
 
-At N=100K vectors the shared overhead amortizes to <0.04 bits/vector, giving
-ratio ≈ 5.7×. At N=1M, ratio ≈ 5.8×.
+```
+TurboQuantArray (operates on unit-norm B-dim sub-vectors)
+├── metadata: { dimension, b_mse, block_size, num_blocks, is_pdx }
+│
+│  # Per-row children
+├── codes: FixedSizeListArray<u8>           # list_size = k × B
+│
+│  # Shared children
+├── centroids: PrimitiveArray<f32>          # len = 2^b_mse
+├── mse_rotation_signs: PrimitiveArray<u8>  # len = k × 3 × B
 
-### Comparison across configurations (f32, d=768, b=5)
+Externalized:
+├── block_norms: FixedSizeListArray<F>      # list_size = k
+```
 
-| Config                                           | B   | Ratio (N=1K) | Ratio (N=100K) | Notes                         |
-| ------------------------------------------------ | --- | ------------ | -------------- | ----------------------------- |
-| Block MSE-only                                   | 128 | 6.5×         | 6.5×           | No QJL; biased inner products |
-| Block + per-block Gaussian QJL                   | 128 | 3.3×         | 5.7×           | Unbiased; matrices amortize   |
-| [Current][current-impl] (padded SORF + SORF QJL) | —   | 4.7×         | 4.7×           | 33% padding waste             |
+## Compression ratio
 
-For small columns, MSE-only or the current padded approach may be preferable.
-For large columns (the common case for embedding tables), per-block Gaussian QJL
-gives the best ratio.
+For f32 input, b_mse bits MSE, k = d/B blocks, N vectors:
 
-## Performance analysis
+| Component   | Bits per vector |
+| ----------- | --------------- |
+| MSE codes   | k × B × b_mse   |
+| Block norms | k × 32          |
 
-### Encode throughput
+| Component  | Shared bits  |
+| ---------- | ------------ |
+| Centroids  | 2^b_mse × 32 |
+| SORF signs | k × 3 × B    |
 
-With k blocks at B-dim, encoding requires per block:
+### Worked examples (f32, b_mse=5, N=1000)
 
-| Operation                   | FLOPs (B=128)                         |
-| --------------------------- | ------------------------------------- |
-| MSE SORF (3-round)          | 3 × 128 × log₂(128) + 3 × 128 ≈ 3,072 |
-| Centroid lookup             | 128 binary searches                   |
-| QJL Gaussian matmul (S × r) | B² = 16,384                           |
-| Norm computation            | 128 FMA + sqrt ≈ 129                  |
+| d             | B    | k   | Per-vec bits          | Ratio | Notes                      |
+| ------------- | ---- | --- | --------------------- | ----- | -------------------------- |
+| 768           | 256  | 3   | 3×256×5 + 3×32 = 3936 | 6.2×  | Block decomp; zero padding |
+| 1024          | 1024 | 1   | 1024×5 + 32 = 5152    | 6.4×  | Single block (= current)   |
+| 768 (current) | 1024 | 1   | 1024×5 + 32 = 5152    | 4.8×  | Padded; 33% overhead       |
 
-For d=768, k=6: MSE total ≈ 18K FLOPs, QJL matmul ≈ 98K FLOPs. The QJL
-Gaussian matmul dominates encode cost — ~5× more expensive than the
-[current][current-impl] SORF-based QJL. Acceptable for offline encoding.
+Block decomposition improves d=768 from 4.8× to 6.2× — a 30% storage
+improvement. For d=1024 the encoding is identical to current.
 
-### Decode throughput
+## Performance analysis
 
-| Operation                        | FLOPs per block (B=128) |
-| -------------------------------- | ----------------------- |
-| Codebook lookup                  | 128 table reads         |
-| Inverse SORF                     | ≈ 3,072                 |
-| QJL Gaussian matmul (Sᵀ × signs) | B² = 16,384             |
-| Denormalize                      | 128 multiplies          |
+### Encode/decode throughput
 
-For d=768: MSE decode ≈ 18K FLOPs, QJL decode ≈ 98K FLOPs. QJL decode is
-significantly more expensive due to the dense matmul. For scan workloads that
-only need inner products (not full reconstruction), the fused distance
-computation path avoids full decode entirely.
+SORF at B dimensions: 3 × B × log₂(B) + 3 × B FLOPs per block. For k blocks:
 
-### Scan throughput (PDX, Stage 2)
+| B              | SORF FLOPs/block          | k (d=768) | Total MSE FLOPs |
+| -------------- | ------------------------- | --------- | --------------- |
+| 256            | 3×256×8 + 768 = 6,912     | 3         | 20,736          |
+| 512            | 3×512×9 + 1536 = 15,360   | —         | —               |
+| 1024 (current) | 3×1024×10 + 3072 = 33,792 | 1         | 33,792          |
 
-The PDX scan kernel is memory-bandwidth bound, not compute bound. The lookup
-table fits in L1 (1 KB at b=4). With 64 vectors per SIMD pass, this achieves
-near-peak memory bandwidth utilization. Norm weighting adds k scalar multiplies
-per 64-vector group — negligible.
+Block decomposition at d=768 is ~40% fewer FLOPs than the current padded
+approach, despite more blocks, because each block is smaller.
 
 ### Benchmarking plan
 
-Stage 1:
-
-1. Encode throughput: block TQ vs. current TQ at d=128, 768, 1024
-2. Decode throughput: same dimensions
-3. Quantized cosine similarity: block vs. current (includes norm overhead)
-4. L2 norm readthrough: O(k) block norms vs. O(1) current
-
-Stage 2:
-
-5. PDX scan throughput vs. row-major scan at d=768, 1024
-6. Full decompression from PDX layout (includes un-transpose overhead)
+1. Encode/decode throughput: block TQ vs. current TQ at d=128, 768, 1024
+2. Quantized cosine similarity: block vs. current
+3. L2 norm readthrough: O(k) vs. O(1)
+4. PDX scan throughput vs. row-major (Stage 3)
 
 ## Experimental plan
 
-Before committing to specific parameters, the following experiments are needed:
-
 ### MSE quality vs. block size
 
-- Compare actual normalized MSE at B ∈ {64, 128, 256} vs. full-dimension
-  single-SORF at bit widths b ∈ {2, 3, 4, 5, 8}
+- Compare actual normalized MSE at B ∈ {64, 128, 256, 512} vs. single-SORF at
+  padded dimension, at bit widths b ∈ {2, 3, 4, 5, 8}
 - Test SORF coordinate distribution at each B: histogram vs. analytical Beta
-- Test 3, 4, and 5 SORF rounds at each B (3 rounds provides 3 × log₂(B)
-  mixing stages: 18 at B=64, 21 at 128, 24 at 256)
-
-### QJL strategy comparison
-
-Compare all four strategies at d=768 with B ∈ {64, 128, 256}:
+- Test 3, 4, 5 SORF rounds at each B
+- Determine if the practical MSE constant is worse at smaller B
 
-- **Per-block Gaussian QJL** (paper-correct): measure inner product bias and
-  variance. Baseline for correctness.
-- **Per-block SORF QJL**: measure bias/variance vs. per-block Gaussian at same
-  B. Quantify the quality cost of the SORF approximation at small block
-  dimensions. Test at 3, 4, 5 SORF rounds.
-- **Full-dimension padded SORF QJL** (current approach): measure for comparison.
-  Higher dimension gives better SORF-to-Haar convergence (30 mixing stages at
-  d=1024) which may compensate for the padding waste. This is the key
-  comparison — does the better convergence of full-dim SORF outweigh the 25%
-  wasted sign bits?
-- **MSE-only** (no QJL): measure inner product bias to establish baseline.
+### QJL strategy comparison (if pursued)
 
-Key metrics:
+- Per-block Gaussian QJL vs. per-block SORF QJL vs. full-dim padded SORF QJL
+  vs. MSE-only
+- Key metric: ANN recall@k on standard benchmarks (SIFT, GloVe)
+- Per community findings, MSE-only is expected to win [8]
 
-- Mean signed relative error on inner products (bias)
-- Variance of inner product estimates
-- ANN recall@k on standard benchmarks (e.g., SIFT, GloVe)
+### Straggler handling (if needed)
 
-Per-block Gaussian QJL has d/B times more variance than full-dim QJL (6× at
-d=768, B=128). Per-block SORF QJL has additional approximation error on top of
-that. Full-dim padded SORF QJL has lower variance but wastes bits on padding.
-The experiment should determine which effect dominates in practice.
-
-### Straggler handling
-
-- Compare padded straggler (zero-pad to B, use SORF) vs. dense rotation at
-  actual straggler dimension with own centroids
-- Measure whether dense straggler enables increasing B (e.g., B=256 with
-  d=768 = 3×256, no straggler at all)
+Rare for common dimensions. If encountered: zero-pad to B (simplest). Follow-up:
+dense rotation at actual dimension.
 
 ## Phasing
 
-**Stage 1** (block decomposition): Modify the existing TurboQuant encoding to
-use B-dim blocks with per-block SORF rotation, externalized norms, and
-configurable QJL strategy (per-block Gaussian, per-block SORF, full-dim padded
-SORF, or MSE-only). Row-major code layout.
+**Phase 1** — MSE-only single-block TurboQuant: Split the [current PR][current-impl]
+to merge MSE-only (no QJL). This is a complete encoding for all dimensions
+(with padding for non-power-of-2).
 
-**Stage 2** (PDX layout): Add dimension-major code transposition within
-64-vector chunks. PDX-native distance computation kernels. Requires resolving
-open design questions around chunk alignment and compressor interaction.
+**Phase 2** — Block decomposition: Add block splitting for non-power-of-2
+dimensions. Externalize norms. B = largest power-of-2 ≥ 64 dividing d. The
+`TurboQuantScheme::compress()` method must be updated to: (a) choose B based on
+d, (b) split input into blocks, (c) normalize per-block, (d) encode each block,
+and (e) store per-block norms in the parent encoding layer.
 
-## Test migration
+**Phase 3** — PDX layout: Dimension-major code transposition within 64-vector
+chunks. Distance computation kernels.
 
-The [current TurboQuant test suite][current-impl] will need rewriting:
+**Phase 4** (experimental) — QJL: If the experimental plan shows QJL improves
+recall@k beyond MSE-only, add per-block Gaussian or SORF QJL. Based on
+community findings, this may not be pursued.
 
-- **Slot counts**: New structure with per-block SORF signs, QJL Gaussian
-  matrices, externalized norms.
-- **Serde roundtrips**: New metadata (block_size, num_blocks).
-- **MSE bound tests**: Add tests at B ∈ {64, 128, 256}.
-- **QJL tests**: Compare all four strategies (per-block Gaussian, per-block
-  SORF, full-dim padded SORF, MSE-only).
-- **Quantized cosine similarity**: Update for per-block norm weighting.
-- **Slice/take**: More children to manage, same semantics.
+## Practical recommendations
 
-New tests:
+For common model dimensions, the most promising configurations are:
 
-- SORF quality at B=64, 128, 256: distribution vs. Beta at 3, 4, 5 rounds
-- Practical MSE comparison: block vs. single-rotation at same bit width
-- QJL comparison: per-block Gaussian vs. per-block SORF vs. full-dim padded SORF
-- Zero-norm sub-vector handling
-- Straggler handling (padded initially, dense rotation as follow-up)
-- Encode/decode benchmarks
+| Dimension             | Recommendation              | Rationale                                                                  |
+| --------------------- | --------------------------- | -------------------------------------------------------------------------- |
+| 512, 1024, 2048, 4096 | Single-block MSE-only + PDX | B=d, no decomposition needed. Same as current TQ but with PDX scan layout. |
+| 768, 1536, 3072       | 3-block MSE-only + PDX      | B=256 or 512. Zero padding waste. 3 blocks, shared centroids.              |
+| Arbitrary d (rare)    | Padded single-block         | Fall back to current approach. Padding overhead bounded by B-1 dims.       |
 
-## Future work: GPU decode and fused distance computation
+In all cases, MSE-only is the recommended starting point. QJL should only be
+added if experiments demonstrate clear recall@k improvements for the target
+workload.
 
-### Motivation
-
-For ANN search workloads, the dominant operation is computing distances between
-a query vector and millions of database vectors. On GPU, the goal is to perform
-this computation directly on the compressed representation, avoiding the cost
-of materializing full decompressed vectors in HBM.
-
-### Decode as GEMM
-
-The decode path for a single B-dim block is:
-
-```
-decoded_block = norm × R⁻¹ × codebook_lookup(codes)
-```
+## Future work: GPU decode and fused distance computation
 
-For a batch of N vectors sharing the same block's rotation matrix R⁻¹:
+The B-dim block structure maps naturally to GPU tile sizes and tensor cores.
+For a batch of N vectors sharing the same rotation matrix R⁻¹:
 
 ```
 decoded_batch = diag(norms) × R⁻¹ × codebook_lookup_batch(codes)
@@ -631,65 +504,45 @@ decoded_batch = diag(norms) × R⁻¹ × codebook_lookup_batch(codes)
                                ↑ B×B × B×N = GEMM
 ```
 
-The codebook lookup produces a B×N matrix, and the inverse rotation is a B×B
-matrix multiply — a GEMM that maps directly to tensor cores.
-
-**Partial decompression pipeline on GPU:**
-
-1. **Decompress rotation matrix** (once per block, shared across all vectors):
-   - If stored as bitpacked SORF signs: FastLanes unpack on CUDA cores, then
-     expand to B×B matrix in shared memory
-   - If stored as dense B×B matrix: direct load to shared memory
-
-2. **Decompress norms** (per vector, per block):
-   - If cascade-compressed with ALP or Pco: decompress on CUDA cores
-   - If uncompressed: direct load
-
-3. **Codebook gather + fused GEMM + scale**:
-   - Stage codebook in shared memory (trivially small)
-   - Stream code bytes from HBM, look up centroids, assemble B×N tile
-   - R⁻¹ × tile on tensor cores, scale by norms
-   - Follows the double-buffered streaming pattern from Flash-KMeans [6]
-
-### Fused distance computation (no decode)
+The codebook gather + inverse rotation + norm scaling can be fused into a single
+kernel following the double-buffered streaming pattern from Flash-KMeans [6].
+For distance computation without full decode, a precomputed (2^b_mse)²-entry
+distance table fits in shared memory (1 KB at b_mse=4, 4 KB at b_mse=5); the
+kernel streams code bytes from HBM with gather-reduce accumulation, using
+4-8× less bandwidth than full float vectors.
 
-Stage the distance table in shared memory (1 KB at b=4), stream code bytes from
-HBM, gather-reduce on-chip, write only final distances. The memory access
-pattern follows Flash-KMeans [6]: double-buffered prefetch, on-chip
-accumulation. We stream quantized code bytes — 4-8× less HBM bandwidth than
-full float vectors.
-
-### Int8 tensor core path (b=9)
-
-At b=9, MSE codes are 8-bit indices. Direct int8 tensor core GEMM requires
-approximately linear centroids (sacrificing Max-Lloyd optimality). Viable for
-ANN ranking; prefer codebook gather for reconstruction.
-
-### Interaction with Vortex file format
-
-The B-dim block structure ensures rotation matrices fit in GPU shared memory
-(B×B×4 bytes). Child arrays are individually compressed by the cascading
-compressor; GPU decode requires either host-side decompression + GPU transfer,
-or direct GPU decompression of FastLanes/ALP.
+At b=8, codes are raw int8 indices. Direct int8 tensor core GEMM requires
+approximately linear centroids (sacrificing Max-Lloyd optimality); viable for
+ANN ranking but not reconstruction.
 
 ## References
 
 [1] Zandieh, A., Daliri, M., Hadian, M. and Mirrokni, V. "TurboQuant: Online
-Vector Quantization with Near-optimal Distortion Rate." arXiv:2504.19874,
-April 2025.
+Vector Quantization with Near-optimal Distortion Rate." ICLR 2026.
+arXiv:2504.19874, April 2025.
 
 [2] Ailon, N. and Chazelle, B. "The Fast Johnson-Lindenstrauss Transform and
-Approximate Nearest Neighbors." SIAM Journal on Computing, 39(1):302-322, 2009.
+Approximate Nearest Neighbors." SIAM J. Comput. 39(1):302-322, 2009.
 
 [3] Tropp, J.A. "Improved Analysis of the Subsampled Randomized Hadamard
-Transform." Advances in Adaptive Data Analysis, 3(1-2):115-126, 2011.
+Transform." Adv. Adaptive Data Analysis 3(1-2):115-126, 2011.
 
 [4] Kuffo, L., Krippner, E. and Boncz, P. "PDX: A Data Layout for Vector
-Similarity Search." Proceedings of SIGMOD '25. arXiv:2503.04422, March 2025.
+Similarity Search." SIGMOD '25. arXiv:2503.04422, March 2025.
 
 [5] Yu, F.X., Suresh, A.T., Choromanski, K., Holtmann-Rice, D. and Kumar, S.
-"Orthogonal Random Features." Advances in Neural Information Processing
-Systems 29 (NeurIPS), 2016. arXiv:1610.09072.
+"Orthogonal Random Features." NeurIPS 2016. arXiv:1610.09072.
 
 [6] Yang, S. et al. "Flash-KMeans: Fast and Memory-Efficient Exact K-Means."
 arXiv:2603.09229, March 2026.
+
+[7] Pathare, T. et al. "TurboQuant: Implementation Corrections, Production
+Hardening, and Deployment Infrastructure." Eviox Tech Report v1.2.0,
+March 2026.
+
+[8] Community TurboQuant implementations and findings. Key sources:
+tonbistudio/turboquant-pytorch (PyTorch, V3 MSE-only findings),
+ggml-org/llama.cpp#20969 (C/C++, quantized attention analysis),
+0xSero/turboquant (Triton kernels), vivekvar-dl/turboquant (pip package),
+scos-lab/turboquant (reference reproduction). Consensus: MSE-only beats
+MSE+QJL for attention and ANN ranking at all tested bit widths.