From 635ac99f6fceaa70a03b44d588e3a56f764b7359 Mon Sep 17 00:00:00 2001 From: dor-forer Date: Mon, 1 Jun 2026 17:40:29 +0300 Subject: [PATCH 01/24] =?UTF-8?q?Add=20SQ8=E2=86=94FP16=20x86=20SIMD=20dis?= =?UTF-8?q?tance=20kernels=20[MOD-14954]=20(#970)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit * Add design doc for SQ8↔FP16 SIMD x86 kernels [MOD-14954] Captures the architecture, file-level plan, CMake F16C gating, and risk register for adding AVX-512 / AVX2+FMA / AVX2 / SSE4 kernels for the asymmetric SQ8 (storage) ↔ FP16 (query) distance functions, wiring them into the existing dispatcher tables and SQ8_FP16 unit/benchmark scaffolding from MOD-15141. Co-Authored-By: Claude Opus 4.7 (1M context) * Append -mf16c to AVX2_FMA/AVX2/SSE4 dispatcher sources [MOD-14954] Enables _mm{,256}_cvtph_ps in the AVX2+FMA, AVX2, and SSE4 dispatcher translation units so the upcoming SQ8↔FP16 kernels can widen FP16 lanes to FP32. The flag is appended only when CXX_F16C is detected; existing SQ8_FP32 / SQ8_SQ8 / INT8 / UINT8 sources contain no F16C intrinsics so emitted code for those kernels is unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) * Add SQ8_FP16_SpacesOptimizationTest skeleton [MOD-14954] Parameterised gtest fixture mirroring SQ8_FP32_SpacesOptimizationTest; currently asserts only the scalar fallback path. Per-tier SIMD assertion blocks (AVX-512, AVX2+FMA, AVX2, SSE4) are added alongside the kernel implementations in subsequent commits. Co-Authored-By: Claude Opus 4.7 (1M context) * Add AVX-512 SQ8↔FP16 SIMD distance kernels [MOD-14954] Implements asymmetric SQ8 (storage) ↔ FP16 (query) Inner Product, Cosine, and L2² kernels for the AVX-512 F+BW+VL+VNNI tier. Each chunk widens 16 SQ8 lanes via cvtepu8_epi32 + cvtepi32_ps and 16 FP16 lanes via _mm512_cvtph_ps, then fmadds into a 16-lane FP32 accumulator. SQ8 storage and FP16 query metadata reads use load_unaligned to tolerate odd dimensions. Dispatcher branches in IP_space.cpp / L2_space.cpp select the new Choose_SQ8_FP16_*_implementation_AVX512F_BW_VL_VNNI when features.avx512f && features.avx512bw && features.avx512vl && features.avx512vnni; otherwise behaviour is unchanged from MOD-15141. A parameterised gtest fixture exercises every residual class in [16, 32] against the scalar baseline. Co-Authored-By: Claude Opus 4.7 (1M context) * Add AVX2+FMA SQ8↔FP16 SIMD distance kernels [MOD-14954] 8-wide AVX2+FMA kernels widen 8 SQ8 lanes via cvtepu8_epi32 + cvtepi32_ps and 8 FP16 lanes via _mm256_cvtph_ps, then fmadd into a 256-bit FP32 accumulator. Residual (< 8) lanes load the full 16-byte FP16 block, convert, then blend zero across unused lanes — mirroring the existing F16C FP16 kernel pattern. Dispatcher branch in {IP,Cosine,L2}_SQ8_FP16_GetDistFunc selects the new Choose_SQ8_FP16_*_implementation_AVX2_FMA when features.avx2 && features.fma3 && features.f16c. Co-Authored-By: Claude Opus 4.7 (1M context) * Add AVX2 (no FMA) SQ8↔FP16 SIMD distance kernels [MOD-14954] Mirrors the AVX2+FMA kernels but uses _mm256_mul_ps + _mm256_add_ps instead of _mm256_fmadd_ps so it can run on Haswell-era AVX2 hardware without FMA support (uncommon but matches the existing SQ8_FP32 tiering). Dispatcher gate requires features.avx2 && features.f16c and runs between the AVX2+FMA and SSE4 tiers. Co-Authored-By: Claude Opus 4.7 (1M context) * Add SSE4+F16C SQ8↔FP16 SIMD distance kernels [MOD-14954] 4-wide SSE4 kernels widen 4 SQ8 lanes via cvtepu8_epi32 + cvtepi32_ps and 4 FP16 lanes via _mm_cvtph_ps (F16C), then mul+add into a 128-bit FP32 accumulator (SSE4 has no FMA). Residual % 4 lanes are materialised via _mm_set_ps + the scalar FP16_to_FP32 helper, mirroring the existing SSE4 SQ8_FP32 residual pattern. Dispatcher gate requires features.sse4_1 && features.f16c && features.avx since F16C is VEX-encoded — matches the existing F16C/FP16 dispatcher gate. Co-Authored-By: Claude Opus 4.7 (1M context) * Update SQ8_FP16 dispatcher assertions to walk SIMD tiers [MOD-14954] The SQ8_FP16 GetDistFunc dispatcher now returns AVX-512 / AVX2+FMA / AVX2 / SSE4 SIMD kernels when the corresponding feature flags are set (only scalar previously). Updates the GetDistFunc_*_SQ8_FP16 asserts to compute the expected function for the host's highest supported tier. Co-Authored-By: Claude Opus 4.7 (1M context) * Register per-ISA SQ8↔FP16 microbenchmarks [MOD-14954] Adds AVX-512 / AVX2+FMA / AVX2 / SSE4 benchmark registrations to bm_spaces_sq8_fp16.cpp, mirroring the SQ8_FP32 layout. Gates each tier on the corresponding OPT_* defines plus the runtime feature checks that mirror the dispatcher in IP_space.cpp / L2_space.cpp. Co-Authored-By: Claude Opus 4.7 (1M context) * Reformat SQ8↔FP16 SIMD kernels for consistent line breaks * Address PR review findings for SQ8↔FP16 x86 kernels [MOD-14954] - CMake: gate `-mf16c` on CXX_F16C AND CXX_FMA AND CXX_AVX (matches OPT_F16C macro) and append `-mavx` to the SSE4 dispatcher when adding -mf16c, since F16C is VEX-encoded and requires AVX state. Mirrors the existing F16C.cpp recipe and prevents miscompiles on toolchains with F16C but without AVX. - IP_SSE4_SQ8_FP16.h: replace `*reinterpret_cast(pVect1)` with `load_unaligned(pVect1)` to remove strict-aliasing UB on the uint8_t SQ8 lane load. - IP_AVX2{,_FMA}_SQ8_FP16.h: improve the residual-mask comment to spell out the asymmetric-mask reasoning (SQ8 unmasked is safe because the FP16 query blend forces those FP32 query lanes to 0 → garbage·0=0). - IP_AVX{512,2,2_FMA,SSE4}_SQ8_FP16.h: add the `IP = min·y_sum + delta·Σ(q·y)` algebraic-identity comment header that AVX-512 already carried, plus a precondition note that callers must enforce dim >= 16 (matches the established SQ8_FP32 convention; no runtime assert because sibling SQ8_FP32 SIMD kernels also rely on the dispatcher gate). - test_spaces.cpp: route the SQ8_FP16 edge-case tests (ZeroQuery, ConstantStorage, MixedSignQuery) through {IP,Cosine,L2}_SQ8_FP16_GetDistFunc so the runtime-selected SIMD tier is actually exercised on those inputs, not just the scalar reference. - test_spaces.cpp: add SQ8_FP16_SIMD_HighDim suite with dims {64, 128, 256, 512, 1024} so multi-iteration do-while loop bugs would fire (the existing [16, 32] range covers at most two AVX-512 chunk iterations). - test_spaces.cpp: add SQ8_FP16_SIMD_TierCoverage.ReportTiersExercised — a single test that emits per-tier coverage to stderr and GTEST_SKIPs when no SIMD tier is available, so CI runners without AVX-512 do not silently report zero tier-1 coverage. - test_spaces.cpp: scalar-fallback `alignment` checks now seed the value with 0xFF and assert it remains 0xFF, verifying the dispatcher contract ("scalar leaves caller's value untouched") instead of just measuring that the variable's pre-zeroed init survived. - test_spaces.cpp: drop the stale MOD-15152/MOD-15153 wiring-TODO comment on SQ8_FP16_NoOptimizationSpacesTest now that the SIMD tiers are wired. - bm_spaces_sq8_fp16.cpp: drop the matching stale comment. Out of scope (separate ticket): two-accumulator FMA refactor (also affects SQ8_FP32) and the SSE4 residual `_mm_cvtph_ps` perf opportunity. Co-Authored-By: Claude Opus 4.7 (1M context) * Add multi-accumulator ILP to SQ8↔FP16 x86 kernels [MOD-14954] Break the FMA / mul+add dependency chain in all four SQ8↔FP16 IP kernels by widening the inner loop to use multiple independent accumulators. L2 kernels inherit the change through their `…InnerProductImp_…` call. - IP_AVX512F_BW_VL_VNNI_SQ8_FP16.h: 1 → 4 accumulators, unroll-4 main loop (64 lanes/iter) with a 16-lane tail for the 0..3 remaining chunks. - IP_AVX2_FMA_SQ8_FP16.h, IP_AVX2_SQ8_FP16.h: 1 → 2 accumulators; the existing 2-step unrolled body now routes each step to an independent accumulator. The `residual >= 8` half-chunk feeds the second accumulator so the prologue also breaks the dependency chain. - IP_SSE4_SQ8_FP16.h: 1 → 2 accumulators; do-while unrolled 1 → 2 steps per iteration (4 → 8 lanes/iter). Residual-ladder steps alternate between sum_a and sum_b for prologue ILP. Correctness invariant: residual block consumes exactly `residual` lanes (0..15) → remaining tail is always a multiple of 16, so the unrolled loops (multiples of 8 / 16 / 64) terminate exactly. Verified by 131 SQ8_FP16 unit tests + 115 under ASan. * Drop misleading VNNI suffix from SQ8↔FP16 AVX-512 kernel [MOD-14954] The SQ8↔FP16 AVX-512 kernel does not actually issue any VNNI instruction — the inner loop is FP32 FMA (`_mm512_fmadd_ps`) over lanes widened from SQ8 (`_mm512_cvtepu8_epi32` + `_mm512_cvtepi32_ps`) and FP16 (`_mm512_cvtph_ps`). Real VNNI use would require an integer-encoded query, which is a different kernel entirely. The file/function names are renamed to match what the kernel actually uses (AVX-512F). The dispatcher .cpp/.h files stay named after the runtime tier (AVX512F_BW_VL_VNNI) since the SQ8↔FP16 kernel still registers under that tier alongside the genuinely VNNI-using SQ8↔SQ8 / INT8 / UINT8 kernels — the gate is a CPU-feature gate, not an ISA claim. The same misnomer exists for SQ8↔FP32; tracked separately so the rename there can ship as its own commit. Also: fix a strict-aliasing-class UB introduced by the AVX-512 unroll-4 loop. `while (pVec1 + 64 <= pEnd1)` forms a pointer past one-past-end of the SQ8 storage object when fewer than 64 lane bytes remain, which is UB in C++ regardless of dereference. Switched to pointer subtraction (`static_cast(pEnd1 - pVec1) >= 64`). Renames: - IP_AVX512F_BW_VL_VNNI_SQ8_FP16.h -> IP_AVX512F_SQ8_FP16.h - L2_AVX512F_BW_VL_VNNI_SQ8_FP16.h -> L2_AVX512F_SQ8_FP16.h - SQ8_FP16_{InnerProduct,Cosine,L2Sqr}SIMD16_AVX512F_BW_VL_VNNI -> _AVX512F - Choose_SQ8_FP16_{IP,Cosine,L2}_implementation_AVX512F_BW_VL_VNNI -> _AVX512F Verified: 131 SQ8_FP16 unit tests + 115 under ASan. * Remove SQ8↔FP16 design doc from PR [MOD-14954] Design doc was added in ad941b8f for planning; not appropriate as a long-lived in-repo artifact. Keep externally (Confluence / scratch) rather than ship with the kernel commit. * Simplify SQ8↔FP16 tests to match sister conventions [MOD-14954] Two trims, both restoring pre-existing patterns elsewhere in the file: 1. `GetDistFuncSQ8FP16Asymmetric` had grown a runtime SIMD-tier walk that duplicated coverage already provided by `SQ8_FP16_SpacesOptimizationTest`. Reduced to the bare dispatcher-equality check used by the FP32 / SQ8↔SQ8 sister tests at lines 540-548 and 551-559. 2. The `SQ8_FP16_EdgeCases` tests (`ZeroQueryTest`, `ConstantStorageTest`, `MixedSignQueryTest`) were routed through `{IP,Cosine,L2}_SQ8_FP16_GetDistFunc(dim, nullptr)` to force runtime SIMD dispatch on adversarial inputs. Reverted to direct scalar calls (`SQ8_FP16_InnerProduct`, etc.) — the original pre-fdc5c1cd shape. Coverage rationale: the SIMD kernels are branchless on input values (verified by grep — no value-dependent `if` in any tier). Every code path is therefore exercised by `SQ8_FP16_SpacesOptimizationTest`'s random inputs at multiple dims. The edge-case tests verify the *algebraic identity* (IP of zero query = 1.0, constant storage matches dequant baseline, mixed-sign handling) — scalar correctness on these inputs is what was actually being checked, and the SIMD path matches scalar via the SpacesOptimizationTest tier walk. Net: 77 lines removed from the test file, matches sister conventions, no coverage gap. * Split SQ8↔FP16 F16C kernels into sibling TUs [MOD-14954] The SQ8↔FP16 kernels in the SSE4, AVX2, and AVX2+FMA tiers depend on F16C (`_mm_cvtph_ps` / `_mm256_cvtph_ps`), while every other kernel in those dispatcher TUs is F16C-clean. The previous arrangement mixed both under `#ifdef OPT_F16C` blocks inside the base dispatcher .cpp/.h files. Split each tier's F16C-dependent kernels off into a sibling TU: functions/SSE4.cpp → SSE4 + SQ8_FP32 (no F16C) functions/SSE4_F16C.cpp → SQ8_FP16 only (requires -mavx -mf16c) functions/AVX2.cpp → AVX2 + BF16 + SQ8_FP32 (no F16C) functions/AVX2_F16C.cpp → SQ8_FP16 only (requires -mf16c) functions/AVX2_FMA.cpp → SQ8_FP32 (no F16C) functions/AVX2_FMA_F16C.cpp → SQ8_FP16 only (requires -mf16c) The AVX-512 tier is unaffected — its SQ8_FP16 kernel uses `_mm512_cvtph_ps`, which is part of AVX-512F and not F16C. CMake now compiles each sibling TU conditionally on `_has_full_f16c` and applies the F16C flags only there. Base TUs no longer carry `-mf16c`, since they no longer reference F16C intrinsics. Result: - No `#ifdef OPT_F16C` directives in `functions/*.cpp` or `functions/*.h`. - Compile-time isolation: an F16C intrinsic accidentally added outside a `_F16C` sibling will fail to build, not silently miscompile. - Caller sites (`IP_space.cpp`, `L2_space.cpp`, `test_spaces.cpp`, `bm_spaces.h`) still gate the *calls* with `#ifdef OPT_F16C`; the new sibling .h includes are unconditional, since declarations alone don't link-error and the calls remain guarded. Verified: 131 SQ8_FP16 unit tests + 115 ASan + 1166 full test_spaces suite (covers other ISA tiers SQ8_FP32 / BF16 / INT8 / UINT8 to confirm no regression from the dispatcher restructure). * Move SQ8↔FP16 AVX-512 dispatch to AVX512F tier + flatten F16C guards [MOD-14954] Two related cleanups in the SQ8↔FP16 dispatch path: 1. The AVX-512 SQ8↔FP16 kernel only uses AVX-512F instructions (`_mm512_cvtph_ps`, `_mm512_fmadd_ps`, etc.) but was registered under the VNNI tier (`OPT_AVX512_F_BW_VL_VNNI` + check of avx512f/bw/vl/vnni). That meant CPUs with AVX-512F but no VNNI (Skylake-X, some Cascade Lake variants, etc.) would fall through to AVX2_FMA even though they can run the AVX-512 kernel. Moved the `Choose_SQ8_FP16_{IP,Cosine,L2}_implementation_AVX512F` definitions from `functions/AVX512F_BW_VL_VNNI.cpp` to `functions/AVX512F.cpp`, with matching header reshuffle. Dispatch sites now gate on `OPT_AVX512F` + `features.avx512f`. 2. F16C is a transversal requirement across the non-AVX-512 SQ8↔FP16 tiers (SSE4, AVX2, AVX2+FMA) — every one of them widens FP16 query lanes via `vcvtph2ps`. Per-tier nested `#ifdef OPT_F16C` was hoisted into a single outer block around the three ISA branches in `IP_SQ8_FP16_GetDistFunc`, `Cosine_SQ8_FP16_GetDistFunc`, and `L2_SQ8_FP16_GetDistFunc`. Verified: 131 SQ8_FP16 release + 115 ASan + 1166 full test_spaces suite. * Clean up whitespace and formatting inconsistencies Remove extraneous blank lines in SSE4 and AVX2_FMA source files, fix indentation in AVX512F SQ8_FP16 function signatures, and reformat benchmark macro invocation to fit line length conventions. * Remove obsolete SQ8-to-FP16 dispatch comments The comments referencing SQ8-to-FP16 dispatch location are no longer accurate after the recent refactoring that moved the dispatch logic. Clean up these stale comments from the AVX512F_BW_VL_VNNI files. * Hoist OPT_F16C guard around lower SIMD tiers in SQ8↔FP16 tests [MOD-14954] Mirrors the dispatcher layout in IP_space.cpp / L2_space.cpp where a single OPT_F16C guard wraps the AVX2+FMA, AVX2, and SSE4 branches. Each test body (L2/IP/Cosine) and the TierCoverage report now use the same single-guard shape. Also retargets the TierCoverage AVX-512 check from OPT_AVX512_F_BW_VL_VNNI to OPT_AVX512F, matching the dispatcher's new AVX-512F-only gate. Co-Authored-By: Claude Opus 4.7 (1M context) * Drop non-idiomatic SQ8↔FP16 tier-coverage reporter test [MOD-14954] SQ8_FP16_SIMD_TierCoverage.ReportTiersExercised was an outlier — no other data type has a std::cerr-based coverage reporter. Per-tier coverage is already provided by SQ8_FP16_SpacesOptimizationTest (which walks AVX-512 → AVX2+FMA → AVX2 → SSE4 → scalar by clearing feature flags), and ISA-lane presence is handled by the CI matrix, matching the convention used by every other type's SpacesOptimizationTest. Co-Authored-By: Claude Opus 4.8 (1M context) * Simplify SQ8↔FP16 kernels and trim PR churn [MOD-14954] - AVX512F IP: keep the <=3 tail chunks on distinct accumulators (sum0/sum1/sum2) instead of serializing into one, preserving ILP when the main 64-lane loop runs few or zero times. - Condense kernel header comments; drop redundant float16.h/alignment.h includes (pulled in transitively) and the direct include (provided via space_includes.h, matching the other AVX512F headers). - test_spaces: align the SQ8_FP16 scalar-fallback alignment assertion with the convention used by the other SpacesOptimizationTest suites. - Revert unrelated CMake message/quote churn on the base AVX2/SSE4 TUs and the stray blank line in AVX512F_BW_VL_VNNI.cpp, leaving only the additive F16C build blocks in this PR. Co-Authored-By: Claude Opus 4.8 (1M context) * Document why OPT_F16C differs from the other OPT_* macros [MOD-14954] Explain at the definition site that OPT_F16C is a cross-cutting capability gate (not a 1:1 dispatch tier), why it is a compound CXX_F16C/FMA/AVX guard (F16C is VEX-encoded and needs AVX state), and why the AVX-512 SQ8<->FP16 path stays outside it (_mm512_cvtph_ps is part of AVX512F). Co-Authored-By: Claude Opus 4.8 (1M context) * Cover AVX512 three-chunk tail and dim<16 dispatcher guard in SQ8_FP16 tests [MOD-14954] Codecov flagged 4 uncovered lines on PR #970: - The AVX512F `remaining >= 48` third tail step in IP_AVX512F_SQ8_FP16.h was never executed: the test dims never satisfied (dim / 16) % 4 == 3. Add 48 (zero main-loop iterations) and 112 (one main-loop iteration) to exercise it. - The `dim < 16` scalar early-return in the IP/Cosine/L2 SQ8_FP16 dispatchers was never taken. Assert the three dispatchers return the scalar funcs at dim 8. Test-only change. Local release + ASan: SQ8_FP16 137/137, ASan clean. Co-Authored-By: Claude Opus 4.8 (1M context) --------- Co-authored-by: Claude Opus 4.7 (1M context) --- cmake/x86_64InstructionFlags.cmake | 18 ++ src/VecSim/spaces/CMakeLists.txt | 30 ++ src/VecSim/spaces/IP/IP_AVX2_FMA_SQ8_FP16.h | 102 +++++++ src/VecSim/spaces/IP/IP_AVX2_SQ8_FP16.h | 101 +++++++ src/VecSim/spaces/IP/IP_AVX512F_SQ8_FP16.h | 113 ++++++++ src/VecSim/spaces/IP/IP_SSE4_SQ8_FP16.h | 118 ++++++++ src/VecSim/spaces/IP_space.cpp | 98 ++++++- src/VecSim/spaces/L2/L2_AVX2_FMA_SQ8_FP16.h | 32 +++ src/VecSim/spaces/L2/L2_AVX2_SQ8_FP16.h | 32 +++ src/VecSim/spaces/L2/L2_AVX512F_SQ8_FP16.h | 32 +++ src/VecSim/spaces/L2/L2_SSE4_SQ8_FP16.h | 31 ++ src/VecSim/spaces/L2_space.cpp | 52 +++- src/VecSim/spaces/functions/AVX2_F16C.cpp | 35 +++ src/VecSim/spaces/functions/AVX2_F16C.h | 23 ++ src/VecSim/spaces/functions/AVX2_FMA_F16C.cpp | 35 +++ src/VecSim/spaces/functions/AVX2_FMA_F16C.h | 23 ++ src/VecSim/spaces/functions/AVX512F.cpp | 21 ++ src/VecSim/spaces/functions/AVX512F.h | 5 + src/VecSim/spaces/functions/SSE4_F16C.cpp | 35 +++ src/VecSim/spaces/functions/SSE4_F16C.h | 23 ++ tests/benchmark/spaces_benchmarks/bm_spaces.h | 3 + .../spaces_benchmarks/bm_spaces_sq8_fp16.cpp | 42 ++- tests/unit/test_spaces.cpp | 269 +++++++++++++++++- 23 files changed, 1245 insertions(+), 28 deletions(-) create mode 100644 src/VecSim/spaces/IP/IP_AVX2_FMA_SQ8_FP16.h create mode 100644 src/VecSim/spaces/IP/IP_AVX2_SQ8_FP16.h create mode 100644 src/VecSim/spaces/IP/IP_AVX512F_SQ8_FP16.h create mode 100644 src/VecSim/spaces/IP/IP_SSE4_SQ8_FP16.h create mode 100644 src/VecSim/spaces/L2/L2_AVX2_FMA_SQ8_FP16.h create mode 100644 src/VecSim/spaces/L2/L2_AVX2_SQ8_FP16.h create mode 100644 src/VecSim/spaces/L2/L2_AVX512F_SQ8_FP16.h create mode 100644 src/VecSim/spaces/L2/L2_SSE4_SQ8_FP16.h create mode 100644 src/VecSim/spaces/functions/AVX2_F16C.cpp create mode 100644 src/VecSim/spaces/functions/AVX2_F16C.h create mode 100644 src/VecSim/spaces/functions/AVX2_FMA_F16C.cpp create mode 100644 src/VecSim/spaces/functions/AVX2_FMA_F16C.h create mode 100644 src/VecSim/spaces/functions/SSE4_F16C.cpp create mode 100644 src/VecSim/spaces/functions/SSE4_F16C.h diff --git a/cmake/x86_64InstructionFlags.cmake b/cmake/x86_64InstructionFlags.cmake index f19ef7662..ff0e43e97 100644 --- a/cmake/x86_64InstructionFlags.cmake +++ b/cmake/x86_64InstructionFlags.cmake @@ -73,6 +73,24 @@ if(CXX_AVX512F AND CXX_AVX512BW AND CXX_AVX512VL AND CXX_AVX512VNNI) add_compile_definitions(OPT_AVX512_F_BW_VL_VNNI) endif() +# OPT_F16C is unusual compared to the other OPT_* macros above: +# +# 1. It is a *capability* gate, not a dispatch tier. Every other OPT_* maps 1:1 to a single +# ISA tier that owns its own translation unit (OPT_AVX2 -> AVX2.cpp, OPT_SSE4 -> SSE4.cpp). +# F16C owns no tier of its own; it only enables the vcvtph2ps (FP16<->FP32) conversion that +# several tiers need. So it is hoisted *around* multiple tiers (AVX2_FMA / AVX2 / SSE4 for +# the SQ8<->FP16 kernels) rather than selecting one. +# +# 2. It is a compound guard (CXX_F16C AND CXX_FMA AND CXX_AVX), not a single flag. F16C is +# VEX-encoded, so vcvtph2ps requires AVX state to execute -- emitting it without AVX is +# invalid. Defining OPT_F16C therefore implies AVX is present, and the F16C kernels must be +# compiled with -mf16c added *on top of* -mavx (see functions/*_F16C.cpp in +# src/VecSim/spaces/CMakeLists.txt). The base AVX2.cpp / SSE4.cpp objects stay F16C-free so +# they still run on CPUs without F16C. +# +# 3. The AVX-512 tier deliberately does NOT use this gate: _mm512_cvtph_ps is part of AVX512F +# itself, so the AVX-512 SQ8<->FP16 path needs only OPT_AVX512F and lives outside any +# OPT_F16C guard. if(CXX_F16C AND CXX_FMA AND CXX_AVX) add_compile_definitions(OPT_F16C) endif() diff --git a/src/VecSim/spaces/CMakeLists.txt b/src/VecSim/spaces/CMakeLists.txt index fe354ded5..309d3f3a4 100644 --- a/src/VecSim/spaces/CMakeLists.txt +++ b/src/VecSim/spaces/CMakeLists.txt @@ -50,18 +50,40 @@ if(CMAKE_SYSTEM_PROCESSOR MATCHES "(x86_64)|(AMD64|amd64)|(^i.86$)") list(APPEND OPTIMIZATIONS functions/AVX512F_BW_VL_VNNI.cpp) endif() + # F16C is VEX-encoded and requires AVX state, so it is only meaningful when the toolchain + # can also emit AVX/FMA. Mirrors the OPT_F16C macro condition in x86_64InstructionFlags.cmake. + set(_has_full_f16c FALSE) + if(CXX_F16C AND CXX_FMA AND CXX_AVX) + set(_has_full_f16c TRUE) + endif() + + # Base AVX2 / AVX2+FMA dispatcher TUs hold only kernels with no F16C dependency. + # SQ8↔FP16 kernels (which require F16C) live in sibling TUs functions/AVX2_F16C.cpp and + # functions/AVX2_FMA_F16C.cpp, compiled only when _has_full_f16c is true. if(CXX_AVX2) message("Building with AVX2") set_source_files_properties(functions/AVX2.cpp PROPERTIES COMPILE_FLAGS -mavx2) list(APPEND OPTIMIZATIONS functions/AVX2.cpp) endif() + if(CXX_AVX2 AND _has_full_f16c) + message("Building functions/AVX2_F16C.cpp with AVX2 and F16C") + set_source_files_properties(functions/AVX2_F16C.cpp PROPERTIES COMPILE_FLAGS "-mavx2 -mf16c") + list(APPEND OPTIMIZATIONS functions/AVX2_F16C.cpp) + endif() + if(CXX_AVX2 AND CXX_FMA) message("Building with AVX2 and FMA") set_source_files_properties(functions/AVX2_FMA.cpp PROPERTIES COMPILE_FLAGS "-mavx2 -mfma") list(APPEND OPTIMIZATIONS functions/AVX2_FMA.cpp) endif() + if(CXX_AVX2 AND CXX_FMA AND _has_full_f16c) + message("Building functions/AVX2_FMA_F16C.cpp with AVX2, FMA, and F16C") + set_source_files_properties(functions/AVX2_FMA_F16C.cpp PROPERTIES COMPILE_FLAGS "-mavx2 -mfma -mf16c") + list(APPEND OPTIMIZATIONS functions/AVX2_FMA_F16C.cpp) + endif() + if(CXX_F16C AND CXX_FMA AND CXX_AVX) message("Building with CXX_F16C") set_source_files_properties(functions/F16C.cpp PROPERTIES COMPILE_FLAGS "-mf16c -mfma -mavx") @@ -86,6 +108,14 @@ if(CMAKE_SYSTEM_PROCESSOR MATCHES "(x86_64)|(AMD64|amd64)|(^i.86$)") list(APPEND OPTIMIZATIONS functions/SSE4.cpp) endif() + # SSE4 SQ8↔FP16 kernels need F16C, which is VEX-encoded → require -mavx alongside -mf16c + # (mirrors the F16C.cpp recipe above). + if(CXX_SSE4 AND _has_full_f16c) + message("Building functions/SSE4_F16C.cpp with SSE4.1, AVX, and F16C") + set_source_files_properties(functions/SSE4_F16C.cpp PROPERTIES COMPILE_FLAGS "-msse4.1 -mavx -mf16c") + list(APPEND OPTIMIZATIONS functions/SSE4_F16C.cpp) + endif() + if(CXX_SSE) message("Building with SSE") set_source_files_properties(functions/SSE.cpp PROPERTIES COMPILE_FLAGS -msse) diff --git a/src/VecSim/spaces/IP/IP_AVX2_FMA_SQ8_FP16.h b/src/VecSim/spaces/IP/IP_AVX2_FMA_SQ8_FP16.h new file mode 100644 index 000000000..3800f1e8a --- /dev/null +++ b/src/VecSim/spaces/IP/IP_AVX2_FMA_SQ8_FP16.h @@ -0,0 +1,102 @@ +/* + * Copyright (c) 2006-Present, Redis Ltd. + * All rights reserved. + * + * Licensed under your choice of the Redis Source Available License 2.0 + * (RSALv2); or (b) the Server Side Public License v1 (SSPLv1); or (c) the + * GNU Affero General Public License v3 (AGPLv3). + */ +#pragma once +#include "VecSim/spaces/space_includes.h" +#include "VecSim/spaces/AVX_utils.h" +#include "VecSim/types/sq8.h" +#include "VecSim/types/float16.h" +#include "VecSim/utils/alignment.h" + +using sq8 = vecsim_types::sq8; +using float16 = vecsim_types::float16; + +/* + * Asymmetric SQ8 (storage) <-> FP16 (query) inner product using algebraic identity: + * IP(x, y) = min * y_sum + delta * Σ(q_i * y_i) + * + * FP16 query lanes are widened to FP32 per 8-lane chunk via _mm256_cvtph_ps (F16C); + * inner-loop arithmetic runs in FP32 with _mm256_fmadd_ps. + */ + +// 8-wide AVX2+FMA step: 8 SQ8 lanes + 8 FP16 lanes -> 8 FP32 fused-multiply-add. +static inline void SQ8_FP16_InnerProductStep_AVX2_FMA(const uint8_t *&pVect1, + const float16 *&pVect2, __m256 &sum256) { + __m128i v1_128 = _mm_loadl_epi64(reinterpret_cast(pVect1)); + pVect1 += 8; + __m256i v1_256 = _mm256_cvtepu8_epi32(v1_128); + __m256 v1_f = _mm256_cvtepi32_ps(v1_256); + + __m128i v2_128 = _mm_loadu_si128(reinterpret_cast(pVect2)); + __m256 v2_f = _mm256_cvtph_ps(v2_128); + pVect2 += 8; + + sum256 = _mm256_fmadd_ps(v1_f, v2_f, sum256); +} + +// pVec1v = SQ8 storage, pVec2v = FP16 query. Precondition: dim >= 16 (enforced by dispatcher). +template // 0..15 +float SQ8_FP16_InnerProductImp_AVX2_FMA(const void *pVec1v, const void *pVec2v, size_t dimension) { + const uint8_t *pVec1 = static_cast(pVec1v); + const float16 *pVec2 = static_cast(pVec2v); + const uint8_t *pEnd1 = pVec1 + dimension; + + // Two accumulators break the FMA dependency chain across consecutive iterations. + __m256 sum_a = _mm256_setzero_ps(); + __m256 sum_b = _mm256_setzero_ps(); + + if constexpr (residual % 8) { + constexpr int mask = (1 << (residual % 8)) - 1; + + __m128i v1_128 = _mm_loadl_epi64(reinterpret_cast(pVec1)); + pVec1 += residual % 8; + __m256i v1_256 = _mm256_cvtepu8_epi32(v1_128); + __m256 v1_f = _mm256_cvtepi32_ps(v1_256); + + __m128i v2_128 = _mm_loadu_si128(reinterpret_cast(pVec2)); + __m256 v2_f = _mm256_cvtph_ps(v2_128); + v2_f = _mm256_blend_ps(_mm256_setzero_ps(), v2_f, mask); + pVec2 += residual % 8; + + sum_a = _mm256_mul_ps(v1_f, v2_f); + } + + if constexpr (residual >= 8) { + SQ8_FP16_InnerProductStep_AVX2_FMA(pVec1, pVec2, sum_b); + } + + do { + SQ8_FP16_InnerProductStep_AVX2_FMA(pVec1, pVec2, sum_a); + SQ8_FP16_InnerProductStep_AVX2_FMA(pVec1, pVec2, sum_b); + } while (pVec1 < pEnd1); + + __m256 sum256 = _mm256_add_ps(sum_a, sum_b); + float quantized_dot = my_mm256_reduce_add_ps(sum256); + + const uint8_t *pVec1Base = static_cast(pVec1v); + const uint8_t *params_bytes = pVec1Base + dimension; + const float min_val = load_unaligned(params_bytes + sq8::MIN_VAL * sizeof(float)); + const float delta = load_unaligned(params_bytes + sq8::DELTA * sizeof(float)); + + const float16 *pVec2Base = static_cast(pVec2v); + const auto *query_meta_bytes = reinterpret_cast(pVec2Base + dimension); + const float y_sum = load_unaligned(query_meta_bytes + sq8::SUM_QUERY * sizeof(float)); + + return min_val * y_sum + delta * quantized_dot; +} + +template // 0..15 +float SQ8_FP16_InnerProductSIMD16_AVX2_FMA(const void *pVec1v, const void *pVec2v, + size_t dimension) { + return 1.0f - SQ8_FP16_InnerProductImp_AVX2_FMA(pVec1v, pVec2v, dimension); +} + +template // 0..15 +float SQ8_FP16_CosineSIMD16_AVX2_FMA(const void *pVec1v, const void *pVec2v, size_t dimension) { + return SQ8_FP16_InnerProductSIMD16_AVX2_FMA(pVec1v, pVec2v, dimension); +} diff --git a/src/VecSim/spaces/IP/IP_AVX2_SQ8_FP16.h b/src/VecSim/spaces/IP/IP_AVX2_SQ8_FP16.h new file mode 100644 index 000000000..acec6102c --- /dev/null +++ b/src/VecSim/spaces/IP/IP_AVX2_SQ8_FP16.h @@ -0,0 +1,101 @@ +/* + * Copyright (c) 2006-Present, Redis Ltd. + * All rights reserved. + * + * Licensed under your choice of the Redis Source Available License 2.0 + * (RSALv2); or (b) the Server Side Public License v1 (SSPLv1); or (c) the + * GNU Affero General Public License v3 (AGPLv3). + */ +#pragma once +#include "VecSim/spaces/space_includes.h" +#include "VecSim/spaces/AVX_utils.h" +#include "VecSim/types/sq8.h" +#include "VecSim/types/float16.h" +#include "VecSim/utils/alignment.h" + +using sq8 = vecsim_types::sq8; +using float16 = vecsim_types::float16; + +/* + * Asymmetric SQ8 (storage) <-> FP16 (query) inner product using algebraic identity: + * IP(x, y) = min * y_sum + delta * Σ(q_i * y_i) + * + * FP16 query lanes are widened to FP32 per 8-lane chunk via _mm256_cvtph_ps (F16C); + * inner-loop arithmetic runs in FP32 with separate _mm256_mul_ps + _mm256_add_ps (no FMA). + */ + +// 8-wide AVX2 step (no FMA): 8 SQ8 lanes + 8 FP16 lanes -> mul + add into sum. +static inline void SQ8_FP16_InnerProductStep_AVX2(const uint8_t *&pVect1, const float16 *&pVect2, + __m256 &sum256) { + __m128i v1_128 = _mm_loadl_epi64(reinterpret_cast(pVect1)); + pVect1 += 8; + __m256i v1_256 = _mm256_cvtepu8_epi32(v1_128); + __m256 v1_f = _mm256_cvtepi32_ps(v1_256); + + __m128i v2_128 = _mm_loadu_si128(reinterpret_cast(pVect2)); + __m256 v2_f = _mm256_cvtph_ps(v2_128); + pVect2 += 8; + + sum256 = _mm256_add_ps(sum256, _mm256_mul_ps(v1_f, v2_f)); +} + +// pVec1v = SQ8 storage, pVec2v = FP16 query. Precondition: dim >= 16 (enforced by dispatcher). +template // 0..15 +float SQ8_FP16_InnerProductImp_AVX2(const void *pVec1v, const void *pVec2v, size_t dimension) { + const uint8_t *pVec1 = static_cast(pVec1v); + const float16 *pVec2 = static_cast(pVec2v); + const uint8_t *pEnd1 = pVec1 + dimension; + + // Two accumulators break the mul->add dependency chain (no FMA on this tier). + __m256 sum_a = _mm256_setzero_ps(); + __m256 sum_b = _mm256_setzero_ps(); + + if constexpr (residual % 8) { + constexpr int mask = (1 << (residual % 8)) - 1; + + __m128i v1_128 = _mm_loadl_epi64(reinterpret_cast(pVec1)); + pVec1 += residual % 8; + __m256i v1_256 = _mm256_cvtepu8_epi32(v1_128); + __m256 v1_f = _mm256_cvtepi32_ps(v1_256); + + __m128i v2_128 = _mm_loadu_si128(reinterpret_cast(pVec2)); + __m256 v2_f = _mm256_cvtph_ps(v2_128); + v2_f = _mm256_blend_ps(_mm256_setzero_ps(), v2_f, mask); + pVec2 += residual % 8; + + sum_a = _mm256_mul_ps(v1_f, v2_f); + } + + if constexpr (residual >= 8) { + SQ8_FP16_InnerProductStep_AVX2(pVec1, pVec2, sum_b); + } + + do { + SQ8_FP16_InnerProductStep_AVX2(pVec1, pVec2, sum_a); + SQ8_FP16_InnerProductStep_AVX2(pVec1, pVec2, sum_b); + } while (pVec1 < pEnd1); + + __m256 sum256 = _mm256_add_ps(sum_a, sum_b); + float quantized_dot = my_mm256_reduce_add_ps(sum256); + + const uint8_t *pVec1Base = static_cast(pVec1v); + const uint8_t *params_bytes = pVec1Base + dimension; + const float min_val = load_unaligned(params_bytes + sq8::MIN_VAL * sizeof(float)); + const float delta = load_unaligned(params_bytes + sq8::DELTA * sizeof(float)); + + const float16 *pVec2Base = static_cast(pVec2v); + const auto *query_meta_bytes = reinterpret_cast(pVec2Base + dimension); + const float y_sum = load_unaligned(query_meta_bytes + sq8::SUM_QUERY * sizeof(float)); + + return min_val * y_sum + delta * quantized_dot; +} + +template // 0..15 +float SQ8_FP16_InnerProductSIMD16_AVX2(const void *pVec1v, const void *pVec2v, size_t dimension) { + return 1.0f - SQ8_FP16_InnerProductImp_AVX2(pVec1v, pVec2v, dimension); +} + +template // 0..15 +float SQ8_FP16_CosineSIMD16_AVX2(const void *pVec1v, const void *pVec2v, size_t dimension) { + return SQ8_FP16_InnerProductSIMD16_AVX2(pVec1v, pVec2v, dimension); +} diff --git a/src/VecSim/spaces/IP/IP_AVX512F_SQ8_FP16.h b/src/VecSim/spaces/IP/IP_AVX512F_SQ8_FP16.h new file mode 100644 index 000000000..60d0ba719 --- /dev/null +++ b/src/VecSim/spaces/IP/IP_AVX512F_SQ8_FP16.h @@ -0,0 +1,113 @@ +/* + * Copyright (c) 2006-Present, Redis Ltd. + * All rights reserved. + * + * Licensed under your choice of the Redis Source Available License 2.0 + * (RSALv2); or (b) the Server Side Public License v1 (SSPLv1); or (c) the + * GNU Affero General Public License v3 (AGPLv3). + */ +#pragma once +#include "VecSim/spaces/space_includes.h" +#include "VecSim/types/sq8.h" +#include "VecSim/types/float16.h" +#include "VecSim/utils/alignment.h" + +using sq8 = vecsim_types::sq8; +using float16 = vecsim_types::float16; + +/* + * Asymmetric SQ8 (storage) <-> FP16 (query) inner product using algebraic identity: + * IP(x, y) = min * y_sum + delta * Σ(q_i * y_i) + * + * FP16 query lanes are widened to FP32 per 16-lane chunk via _mm512_cvtph_ps (AVX512F); + * inner-loop arithmetic runs in FP32 with _mm512_fmadd_ps. + */ + +// 16-wide AVX512F step: 16 SQ8 lanes + 16 FP16 lanes -> 16 FP32 fused-multiply-add. +static inline void SQ8_FP16_InnerProductStep_AVX512(const uint8_t *&pVec1, const float16 *&pVec2, + __m512 &sum) { + __m128i v1_128 = _mm_loadu_si128(reinterpret_cast(pVec1)); + __m512i v1_512 = _mm512_cvtepu8_epi32(v1_128); + __m512 v1_f = _mm512_cvtepi32_ps(v1_512); + + __m256i v2_16 = _mm256_loadu_si256(reinterpret_cast(pVec2)); + __m512 v2_f = _mm512_cvtph_ps(v2_16); + + sum = _mm512_fmadd_ps(v1_f, v2_f, sum); + + pVec1 += 16; + pVec2 += 16; +} + +// pVec1v = SQ8 storage, pVec2v = FP16 query. Precondition: dim >= 16 (enforced by dispatcher). +template // 0..15 +float SQ8_FP16_InnerProductImp_AVX512(const void *pVec1v, const void *pVec2v, size_t dimension) { + const uint8_t *pVec1 = static_cast(pVec1v); + const float16 *pVec2 = static_cast(pVec2v); + const uint8_t *pEnd1 = pVec1 + dimension; + + // Four accumulators break the FMA dependency chain to saturate both FMA ports. + __m512 sum0 = _mm512_setzero_ps(); + __m512 sum1 = _mm512_setzero_ps(); + __m512 sum2 = _mm512_setzero_ps(); + __m512 sum3 = _mm512_setzero_ps(); + + if constexpr (residual > 0) { + __mmask16 mask = (1U << residual) - 1; + + __m128i v1_128 = _mm_loadu_si128(reinterpret_cast(pVec1)); + __m512i v1_512 = _mm512_cvtepu8_epi32(v1_128); + __m512 v1_f = _mm512_cvtepi32_ps(v1_512); + + __m256i v2_16 = _mm256_loadu_si256(reinterpret_cast(pVec2)); + __m512 v2_f = _mm512_cvtph_ps(v2_16); + + sum0 = _mm512_maskz_mul_ps(mask, v1_f, v2_f); + + pVec1 += residual; + pVec2 += residual; + } + + // Main loop: 4 chunks of 16 lanes per iteration, one chunk per accumulator. + while (static_cast(pEnd1 - pVec1) >= 64) { + SQ8_FP16_InnerProductStep_AVX512(pVec1, pVec2, sum0); + SQ8_FP16_InnerProductStep_AVX512(pVec1, pVec2, sum1); + SQ8_FP16_InnerProductStep_AVX512(pVec1, pVec2, sum2); + SQ8_FP16_InnerProductStep_AVX512(pVec1, pVec2, sum3); + } + + // Tail: at most three remaining 16-lane chunks (post-residual remainder is a multiple of 16). + // Keep chunks on distinct accumulators to preserve ILP when the main loop did not run. + const size_t remaining = pEnd1 - pVec1; + if (remaining >= 16) + SQ8_FP16_InnerProductStep_AVX512(pVec1, pVec2, sum0); + if (remaining >= 32) + SQ8_FP16_InnerProductStep_AVX512(pVec1, pVec2, sum1); + if (remaining >= 48) + SQ8_FP16_InnerProductStep_AVX512(pVec1, pVec2, sum2); + + __m512 sum = _mm512_add_ps(_mm512_add_ps(sum0, sum1), _mm512_add_ps(sum2, sum3)); + float quantized_dot = _mm512_reduce_add_ps(sum); + + const uint8_t *pVec1Base = static_cast(pVec1v); + const uint8_t *params_bytes = pVec1Base + dimension; + const float min_val = load_unaligned(params_bytes + sq8::MIN_VAL * sizeof(float)); + const float delta = load_unaligned(params_bytes + sq8::DELTA * sizeof(float)); + + const float16 *pVec2Base = static_cast(pVec2v); + const auto *query_meta_bytes = reinterpret_cast(pVec2Base + dimension); + const float y_sum = load_unaligned(query_meta_bytes + sq8::SUM_QUERY * sizeof(float)); + + return min_val * y_sum + delta * quantized_dot; +} + +template // 0..15 +float SQ8_FP16_InnerProductSIMD16_AVX512F(const void *pVec1v, const void *pVec2v, + size_t dimension) { + return 1.0f - SQ8_FP16_InnerProductImp_AVX512(pVec1v, pVec2v, dimension); +} + +template // 0..15 +float SQ8_FP16_CosineSIMD16_AVX512F(const void *pVec1v, const void *pVec2v, size_t dimension) { + return SQ8_FP16_InnerProductSIMD16_AVX512F(pVec1v, pVec2v, dimension); +} diff --git a/src/VecSim/spaces/IP/IP_SSE4_SQ8_FP16.h b/src/VecSim/spaces/IP/IP_SSE4_SQ8_FP16.h new file mode 100644 index 000000000..1cc3cb153 --- /dev/null +++ b/src/VecSim/spaces/IP/IP_SSE4_SQ8_FP16.h @@ -0,0 +1,118 @@ +/* + * Copyright (c) 2006-Present, Redis Ltd. + * All rights reserved. + * + * Licensed under your choice of the Redis Source Available License 2.0 + * (RSALv2); or (b) the Server Side Public License v1 (SSPLv1); or (c) the + * GNU Affero General Public License v3 (AGPLv3). + */ +#pragma once +#include "VecSim/spaces/space_includes.h" +#include "VecSim/types/sq8.h" +#include "VecSim/types/float16.h" +#include "VecSim/utils/alignment.h" + +using sq8 = vecsim_types::sq8; +using float16 = vecsim_types::float16; + +/* + * Asymmetric SQ8 (storage) <-> FP16 (query) inner product using algebraic identity: + * IP(x, y) = min * y_sum + delta * Σ(q_i * y_i) + * + * FP16 query lanes are widened to FP32 per 4-lane chunk via _mm_cvtph_ps (F16C); + * inner-loop arithmetic runs in FP32 with separate _mm_mul_ps + _mm_add_ps (no FMA). + */ + +// 4-wide SSE4+F16C step: 4 SQ8 lanes + 4 FP16 lanes -> mul + add into sum. +static inline void SQ8_FP16_InnerProductStep_SSE4(const uint8_t *&pVect1, const float16 *&pVect2, + __m128 &sum) { + __m128i v1_i = _mm_cvtepu8_epi32(_mm_cvtsi32_si128(load_unaligned(pVect1))); + pVect1 += 4; + __m128 v1_f = _mm_cvtepi32_ps(v1_i); + + __m128i v2_8 = _mm_loadl_epi64(reinterpret_cast(pVect2)); + __m128 v2_f = _mm_cvtph_ps(v2_8); + pVect2 += 4; + + sum = _mm_add_ps(sum, _mm_mul_ps(v1_f, v2_f)); +} + +// pVec1v = SQ8 storage, pVec2v = FP16 query. Precondition: dim >= 16 (enforced by dispatcher). +template // 0..15 +float SQ8_FP16_InnerProductSIMD16_SSE4_IMP(const void *pVec1v, const void *pVec2v, + size_t dimension) { + const uint8_t *pVec1 = static_cast(pVec1v); + const float16 *pVec2 = static_cast(pVec2v); + const uint8_t *pEnd1 = pVec1 + dimension; + + // Two accumulators break the mul->add dependency chain (no FMA on this tier). + __m128 sum_a = _mm_setzero_ps(); + __m128 sum_b = _mm_setzero_ps(); + + if constexpr (residual % 4) { + __m128 v1_f; + __m128 v2_f; + + if constexpr (residual % 4 == 3) { + v1_f = _mm_set_ps(0.0f, static_cast(pVec1[2]), static_cast(pVec1[1]), + static_cast(pVec1[0])); + v2_f = _mm_set_ps(0.0f, vecsim_types::FP16_to_FP32(pVec2[2]), + vecsim_types::FP16_to_FP32(pVec2[1]), + vecsim_types::FP16_to_FP32(pVec2[0])); + } else if constexpr (residual % 4 == 2) { + v1_f = + _mm_set_ps(0.0f, 0.0f, static_cast(pVec1[1]), static_cast(pVec1[0])); + v2_f = _mm_set_ps(0.0f, 0.0f, vecsim_types::FP16_to_FP32(pVec2[1]), + vecsim_types::FP16_to_FP32(pVec2[0])); + } else if constexpr (residual % 4 == 1) { + v1_f = _mm_set_ps(0.0f, 0.0f, 0.0f, static_cast(pVec1[0])); + v2_f = _mm_set_ps(0.0f, 0.0f, 0.0f, vecsim_types::FP16_to_FP32(pVec2[0])); + } + + pVec1 += residual % 4; + pVec2 += residual % 4; + + sum_a = _mm_mul_ps(v1_f, v2_f); + } + + if constexpr (residual >= 4) { + SQ8_FP16_InnerProductStep_SSE4(pVec1, pVec2, sum_b); + } + if constexpr (residual >= 8) { + SQ8_FP16_InnerProductStep_SSE4(pVec1, pVec2, sum_a); + } + if constexpr (residual >= 12) { + SQ8_FP16_InnerProductStep_SSE4(pVec1, pVec2, sum_b); + } + + do { + SQ8_FP16_InnerProductStep_SSE4(pVec1, pVec2, sum_a); + SQ8_FP16_InnerProductStep_SSE4(pVec1, pVec2, sum_b); + } while (pVec1 < pEnd1); + + __m128 sum = _mm_add_ps(sum_a, sum_b); + float PORTABLE_ALIGN16 TmpRes[4]; + _mm_store_ps(TmpRes, sum); + float quantized_dot = TmpRes[0] + TmpRes[1] + TmpRes[2] + TmpRes[3]; + + const uint8_t *pVec1Base = static_cast(pVec1v); + const uint8_t *params_bytes = pVec1Base + dimension; + const float min_val = load_unaligned(params_bytes + sq8::MIN_VAL * sizeof(float)); + const float delta = load_unaligned(params_bytes + sq8::DELTA * sizeof(float)); + + const float16 *pVec2Base = static_cast(pVec2v); + const auto *query_meta_bytes = reinterpret_cast(pVec2Base + dimension); + const float y_sum = load_unaligned(query_meta_bytes + sq8::SUM_QUERY * sizeof(float)); + + return min_val * y_sum + delta * quantized_dot; +} + +template // 0..15 +float SQ8_FP16_InnerProductSIMD16_SSE4(const void *pVec1v, const void *pVec2v, size_t dimension) { + return 1.0f - SQ8_FP16_InnerProductSIMD16_SSE4_IMP(pVec1v, pVec2v, dimension); +} + +template // 0..15 +float SQ8_FP16_CosineSIMD16_SSE4(const void *pVec1v, const void *pVec2v, size_t dimension) { + return SQ8_FP16_InnerProductSIMD16_SSE4(pVec1v, pVec2v, dimension); +} diff --git a/src/VecSim/spaces/IP_space.cpp b/src/VecSim/spaces/IP_space.cpp index 55979e25a..b57971b60 100644 --- a/src/VecSim/spaces/IP_space.cpp +++ b/src/VecSim/spaces/IP_space.cpp @@ -20,9 +20,12 @@ #include "VecSim/spaces/functions/AVX512BF16_VL.h" #include "VecSim/spaces/functions/AVX512F_BW_VL_VNNI.h" #include "VecSim/spaces/functions/AVX2.h" +#include "VecSim/spaces/functions/AVX2_F16C.h" #include "VecSim/spaces/functions/AVX2_FMA.h" +#include "VecSim/spaces/functions/AVX2_FMA_F16C.h" #include "VecSim/spaces/functions/SSE3.h" #include "VecSim/spaces/functions/SSE4.h" +#include "VecSim/spaces/functions/SSE4_F16C.h" #include "VecSim/spaces/functions/NEON.h" #include "VecSim/spaces/functions/NEON_DOTPROD.h" #include "VecSim/spaces/functions/NEON_HP.h" @@ -172,31 +175,106 @@ dist_func_t Cosine_SQ8_FP32_GetDistFunc(size_t dim, unsigned char *alignm } // SQ8-FP16: asymmetric inner product distance between SQ8 storage and FP16 query. -// SIMD chooser slots are added by P1b (MOD-15152) / P1c (MOD-15153); for now this always -// returns the scalar implementation. dist_func_t IP_SQ8_FP16_GetDistFunc(size_t dim, unsigned char *alignment, const void *arch_opt) { unsigned char dummy_alignment; if (alignment == nullptr) { alignment = &dummy_alignment; } - (void)dim; - (void)arch_opt; - return SQ8_FP16_InnerProduct; + + dist_func_t ret_dist_func = SQ8_FP16_InnerProduct; + [[maybe_unused]] auto features = getCpuOptimizationFeatures(arch_opt); + +#ifdef CPU_FEATURES_ARCH_X86_64 + if (dim < 16) { + return ret_dist_func; + } + // Alignment hints below refer to the SQ8 (first) operand per the GetDistFunc contract. + // AVX-512 tier only needs AVX-512F (cvtph_ps is part of AVX-512F, no VNNI/BW/VL required). +#ifdef OPT_AVX512F + if (features.avx512f) { + if (dim % 16 == 0) // SQ8 chunk = 16 bytes + *alignment = 16 * sizeof(uint8_t); + return Choose_SQ8_FP16_IP_implementation_AVX512F(dim); + } +#endif + // F16C is required by every non-AVX-512 SQ8↔FP16 tier (vcvtph2ps), so the guard is hoisted + // around all three. +#ifdef OPT_F16C +#ifdef OPT_AVX2_FMA + if (features.avx2 && features.fma3 && features.f16c) { + if (dim % 8 == 0) // SQ8 chunk = 8 bytes + *alignment = 8 * sizeof(uint8_t); + return Choose_SQ8_FP16_IP_implementation_AVX2_FMA(dim); + } +#endif +#ifdef OPT_AVX2 + if (features.avx2 && features.f16c) { + if (dim % 8 == 0) + *alignment = 8 * sizeof(uint8_t); + return Choose_SQ8_FP16_IP_implementation_AVX2(dim); + } +#endif +#ifdef OPT_SSE4 + // F16C is VEX-encoded — require AVX as well, matching the existing F16C/FP16 dispatcher. + if (features.sse4_1 && features.f16c && features.avx) { + if (dim % 4 == 0) + *alignment = 4 * sizeof(uint8_t); + return Choose_SQ8_FP16_IP_implementation_SSE4(dim); + } +#endif +#endif // OPT_F16C +#endif // x86_64 + return ret_dist_func; } // SQ8-FP16: asymmetric cosine distance between SQ8 storage and FP16 query. -// SIMD chooser slots are added by P1b (MOD-15152) / P1c (MOD-15153); for now this always -// returns the scalar implementation. dist_func_t Cosine_SQ8_FP16_GetDistFunc(size_t dim, unsigned char *alignment, const void *arch_opt) { unsigned char dummy_alignment; if (alignment == nullptr) { alignment = &dummy_alignment; } - (void)dim; - (void)arch_opt; - return SQ8_FP16_Cosine; + + dist_func_t ret_dist_func = SQ8_FP16_Cosine; + [[maybe_unused]] auto features = getCpuOptimizationFeatures(arch_opt); + +#ifdef CPU_FEATURES_ARCH_X86_64 + if (dim < 16) { + return ret_dist_func; + } +#ifdef OPT_AVX512F + if (features.avx512f) { + if (dim % 16 == 0) + *alignment = 16 * sizeof(uint8_t); + return Choose_SQ8_FP16_Cosine_implementation_AVX512F(dim); + } +#endif +#ifdef OPT_F16C +#ifdef OPT_AVX2_FMA + if (features.avx2 && features.fma3 && features.f16c) { + if (dim % 8 == 0) + *alignment = 8 * sizeof(uint8_t); + return Choose_SQ8_FP16_Cosine_implementation_AVX2_FMA(dim); + } +#endif +#ifdef OPT_AVX2 + if (features.avx2 && features.f16c) { + if (dim % 8 == 0) + *alignment = 8 * sizeof(uint8_t); + return Choose_SQ8_FP16_Cosine_implementation_AVX2(dim); + } +#endif +#ifdef OPT_SSE4 + if (features.sse4_1 && features.f16c && features.avx) { + if (dim % 4 == 0) + *alignment = 4 * sizeof(uint8_t); + return Choose_SQ8_FP16_Cosine_implementation_SSE4(dim); + } +#endif +#endif // OPT_F16C +#endif // x86_64 + return ret_dist_func; } // SQ8-to-SQ8 Inner Product distance function (both vectors are uint8 quantized with precomputed diff --git a/src/VecSim/spaces/L2/L2_AVX2_FMA_SQ8_FP16.h b/src/VecSim/spaces/L2/L2_AVX2_FMA_SQ8_FP16.h new file mode 100644 index 000000000..c855b62ca --- /dev/null +++ b/src/VecSim/spaces/L2/L2_AVX2_FMA_SQ8_FP16.h @@ -0,0 +1,32 @@ +/* + * Copyright (c) 2006-Present, Redis Ltd. + * All rights reserved. + * + * Licensed under your choice of the Redis Source Available License 2.0 + * (RSALv2); or (b) the Server Side Public License v1 (SSPLv1); or (c) the + * GNU Affero General Public License v3 (AGPLv3). + */ +#pragma once +#include "VecSim/spaces/space_includes.h" +#include "VecSim/spaces/AVX_utils.h" +#include "VecSim/spaces/IP/IP_AVX2_FMA_SQ8_FP16.h" +#include "VecSim/types/sq8.h" + +using sq8 = vecsim_types::sq8; +using float16 = vecsim_types::float16; + +template // 0..15 +float SQ8_FP16_L2SqrSIMD16_AVX2_FMA(const void *pVect1v, const void *pVect2v, size_t dimension) { + const float ip = SQ8_FP16_InnerProductImp_AVX2_FMA(pVect1v, pVect2v, dimension); + + const uint8_t *pVect1 = static_cast(pVect1v); + const uint8_t *params_bytes = pVect1 + dimension; + const float x_sum_sq = load_unaligned(params_bytes + sq8::SUM_SQUARES * sizeof(float)); + + const float16 *pVect2 = static_cast(pVect2v); + const auto *query_meta_bytes = reinterpret_cast(pVect2 + dimension); + const float y_sum_sq = + load_unaligned(query_meta_bytes + sq8::SUM_SQUARES_QUERY * sizeof(float)); + + return x_sum_sq + y_sum_sq - 2.0f * ip; +} diff --git a/src/VecSim/spaces/L2/L2_AVX2_SQ8_FP16.h b/src/VecSim/spaces/L2/L2_AVX2_SQ8_FP16.h new file mode 100644 index 000000000..7c2cbfcd8 --- /dev/null +++ b/src/VecSim/spaces/L2/L2_AVX2_SQ8_FP16.h @@ -0,0 +1,32 @@ +/* + * Copyright (c) 2006-Present, Redis Ltd. + * All rights reserved. + * + * Licensed under your choice of the Redis Source Available License 2.0 + * (RSALv2); or (b) the Server Side Public License v1 (SSPLv1); or (c) the + * GNU Affero General Public License v3 (AGPLv3). + */ +#pragma once +#include "VecSim/spaces/space_includes.h" +#include "VecSim/spaces/AVX_utils.h" +#include "VecSim/spaces/IP/IP_AVX2_SQ8_FP16.h" +#include "VecSim/types/sq8.h" + +using sq8 = vecsim_types::sq8; +using float16 = vecsim_types::float16; + +template // 0..15 +float SQ8_FP16_L2SqrSIMD16_AVX2(const void *pVect1v, const void *pVect2v, size_t dimension) { + const float ip = SQ8_FP16_InnerProductImp_AVX2(pVect1v, pVect2v, dimension); + + const uint8_t *pVect1 = static_cast(pVect1v); + const uint8_t *params_bytes = pVect1 + dimension; + const float x_sum_sq = load_unaligned(params_bytes + sq8::SUM_SQUARES * sizeof(float)); + + const float16 *pVect2 = static_cast(pVect2v); + const auto *query_meta_bytes = reinterpret_cast(pVect2 + dimension); + const float y_sum_sq = + load_unaligned(query_meta_bytes + sq8::SUM_SQUARES_QUERY * sizeof(float)); + + return x_sum_sq + y_sum_sq - 2.0f * ip; +} diff --git a/src/VecSim/spaces/L2/L2_AVX512F_SQ8_FP16.h b/src/VecSim/spaces/L2/L2_AVX512F_SQ8_FP16.h new file mode 100644 index 000000000..9d7b1569f --- /dev/null +++ b/src/VecSim/spaces/L2/L2_AVX512F_SQ8_FP16.h @@ -0,0 +1,32 @@ +/* + * Copyright (c) 2006-Present, Redis Ltd. + * All rights reserved. + * + * Licensed under your choice of the Redis Source Available License 2.0 + * (RSALv2); or (b) the Server Side Public License v1 (SSPLv1); or (c) the + * GNU Affero General Public License v3 (AGPLv3). + */ +#pragma once +#include "VecSim/spaces/space_includes.h" +#include "VecSim/spaces/IP/IP_AVX512F_SQ8_FP16.h" +#include "VecSim/types/sq8.h" + +using sq8 = vecsim_types::sq8; +using float16 = vecsim_types::float16; + +// L2² = x_sum_squares + y_sum_squares - 2 * IP(x, y), computed via the AVX-512 IP impl above. +template // 0..15 +float SQ8_FP16_L2SqrSIMD16_AVX512F(const void *pVect1v, const void *pVect2v, size_t dimension) { + const float ip = SQ8_FP16_InnerProductImp_AVX512(pVect1v, pVect2v, dimension); + + const uint8_t *pVect1 = static_cast(pVect1v); + const uint8_t *params_bytes = pVect1 + dimension; + const float x_sum_sq = load_unaligned(params_bytes + sq8::SUM_SQUARES * sizeof(float)); + + const float16 *pVect2 = static_cast(pVect2v); + const auto *query_meta_bytes = reinterpret_cast(pVect2 + dimension); + const float y_sum_sq = + load_unaligned(query_meta_bytes + sq8::SUM_SQUARES_QUERY * sizeof(float)); + + return x_sum_sq + y_sum_sq - 2.0f * ip; +} diff --git a/src/VecSim/spaces/L2/L2_SSE4_SQ8_FP16.h b/src/VecSim/spaces/L2/L2_SSE4_SQ8_FP16.h new file mode 100644 index 000000000..d0a0fea06 --- /dev/null +++ b/src/VecSim/spaces/L2/L2_SSE4_SQ8_FP16.h @@ -0,0 +1,31 @@ +/* + * Copyright (c) 2006-Present, Redis Ltd. + * All rights reserved. + * + * Licensed under your choice of the Redis Source Available License 2.0 + * (RSALv2); or (b) the Server Side Public License v1 (SSPLv1); or (c) the + * GNU Affero General Public License v3 (AGPLv3). + */ +#pragma once +#include "VecSim/spaces/space_includes.h" +#include "VecSim/spaces/IP/IP_SSE4_SQ8_FP16.h" +#include "VecSim/types/sq8.h" + +using sq8 = vecsim_types::sq8; +using float16 = vecsim_types::float16; + +template // 0..15 +float SQ8_FP16_L2SqrSIMD16_SSE4(const void *pVect1v, const void *pVect2v, size_t dimension) { + const float ip = SQ8_FP16_InnerProductSIMD16_SSE4_IMP(pVect1v, pVect2v, dimension); + + const uint8_t *pVect1 = static_cast(pVect1v); + const uint8_t *params_bytes = pVect1 + dimension; + const float x_sum_sq = load_unaligned(params_bytes + sq8::SUM_SQUARES * sizeof(float)); + + const float16 *pVect2 = static_cast(pVect2v); + const auto *query_meta_bytes = reinterpret_cast(pVect2 + dimension); + const float y_sum_sq = + load_unaligned(query_meta_bytes + sq8::SUM_SQUARES_QUERY * sizeof(float)); + + return x_sum_sq + y_sum_sq - 2.0f * ip; +} diff --git a/src/VecSim/spaces/L2_space.cpp b/src/VecSim/spaces/L2_space.cpp index ba3dd7cab..43020399f 100644 --- a/src/VecSim/spaces/L2_space.cpp +++ b/src/VecSim/spaces/L2_space.cpp @@ -19,9 +19,12 @@ #include "VecSim/spaces/functions/AVX512FP16_VL.h" #include "VecSim/spaces/functions/AVX512F_BW_VL_VNNI.h" #include "VecSim/spaces/functions/AVX2.h" +#include "VecSim/spaces/functions/AVX2_F16C.h" #include "VecSim/spaces/functions/AVX2_FMA.h" +#include "VecSim/spaces/functions/AVX2_FMA_F16C.h" #include "VecSim/spaces/functions/SSE3.h" #include "VecSim/spaces/functions/SSE4.h" +#include "VecSim/spaces/functions/SSE4_F16C.h" #include "VecSim/spaces/functions/NEON.h" #include "VecSim/spaces/functions/NEON_DOTPROD.h" #include "VecSim/spaces/functions/NEON_HP.h" @@ -104,17 +107,56 @@ dist_func_t L2_SQ8_FP32_GetDistFunc(size_t dim, unsigned char *alignment, } // SQ8-FP16: asymmetric L2 distance between SQ8 storage and FP16 query. -// SIMD chooser slots are added by P1b (MOD-15152) / P1c (MOD-15153); for now this always -// returns the scalar implementation. dist_func_t L2_SQ8_FP16_GetDistFunc(size_t dim, unsigned char *alignment, const void *arch_opt) { unsigned char dummy_alignment; if (!alignment) { alignment = &dummy_alignment; } - (void)dim; - (void)arch_opt; - return SQ8_FP16_L2Sqr; + + dist_func_t ret_dist_func = SQ8_FP16_L2Sqr; + [[maybe_unused]] auto features = getCpuOptimizationFeatures(arch_opt); + +#ifdef CPU_FEATURES_ARCH_X86_64 + if (dim < 16) { + return ret_dist_func; + } + // Alignment hints below refer to the SQ8 (first) operand per the GetDistFunc contract. + // AVX-512 tier only needs AVX-512F (cvtph_ps is part of AVX-512F, no VNNI/BW/VL required). +#ifdef OPT_AVX512F + if (features.avx512f) { + if (dim % 16 == 0) + *alignment = 16 * sizeof(uint8_t); + return Choose_SQ8_FP16_L2_implementation_AVX512F(dim); + } +#endif + // F16C is required by every non-AVX-512 SQ8↔FP16 tier (vcvtph2ps), so the guard is hoisted + // around all three. +#ifdef OPT_F16C +#ifdef OPT_AVX2_FMA + if (features.avx2 && features.fma3 && features.f16c) { + if (dim % 8 == 0) + *alignment = 8 * sizeof(uint8_t); + return Choose_SQ8_FP16_L2_implementation_AVX2_FMA(dim); + } +#endif +#ifdef OPT_AVX2 + if (features.avx2 && features.f16c) { + if (dim % 8 == 0) + *alignment = 8 * sizeof(uint8_t); + return Choose_SQ8_FP16_L2_implementation_AVX2(dim); + } +#endif +#ifdef OPT_SSE4 + if (features.sse4_1 && features.f16c && features.avx) { + if (dim % 4 == 0) + *alignment = 4 * sizeof(uint8_t); + return Choose_SQ8_FP16_L2_implementation_SSE4(dim); + } +#endif +#endif // OPT_F16C +#endif // x86_64 + return ret_dist_func; } dist_func_t L2_FP32_GetDistFunc(size_t dim, unsigned char *alignment, const void *arch_opt) { diff --git a/src/VecSim/spaces/functions/AVX2_F16C.cpp b/src/VecSim/spaces/functions/AVX2_F16C.cpp new file mode 100644 index 000000000..3d298e81b --- /dev/null +++ b/src/VecSim/spaces/functions/AVX2_F16C.cpp @@ -0,0 +1,35 @@ +/* + * Copyright (c) 2006-Present, Redis Ltd. + * All rights reserved. + * + * Licensed under your choice of the Redis Source Available License 2.0 + * (RSALv2); or (b) the Server Side Public License v1 (SSPLv1); or (c) the + * GNU Affero General Public License v3 (AGPLv3). + */ +#include "AVX2_F16C.h" +#include "VecSim/spaces/IP/IP_AVX2_SQ8_FP16.h" +#include "VecSim/spaces/L2/L2_AVX2_SQ8_FP16.h" + +namespace spaces { + +#include "implementation_chooser.h" + +dist_func_t Choose_SQ8_FP16_IP_implementation_AVX2(size_t dim) { + dist_func_t ret_dist_func; + CHOOSE_IMPLEMENTATION(ret_dist_func, dim, 16, SQ8_FP16_InnerProductSIMD16_AVX2); + return ret_dist_func; +} +dist_func_t Choose_SQ8_FP16_Cosine_implementation_AVX2(size_t dim) { + dist_func_t ret_dist_func; + CHOOSE_IMPLEMENTATION(ret_dist_func, dim, 16, SQ8_FP16_CosineSIMD16_AVX2); + return ret_dist_func; +} +dist_func_t Choose_SQ8_FP16_L2_implementation_AVX2(size_t dim) { + dist_func_t ret_dist_func; + CHOOSE_IMPLEMENTATION(ret_dist_func, dim, 16, SQ8_FP16_L2SqrSIMD16_AVX2); + return ret_dist_func; +} + +#include "implementation_chooser_cleanup.h" + +} // namespace spaces diff --git a/src/VecSim/spaces/functions/AVX2_F16C.h b/src/VecSim/spaces/functions/AVX2_F16C.h new file mode 100644 index 000000000..95a171199 --- /dev/null +++ b/src/VecSim/spaces/functions/AVX2_F16C.h @@ -0,0 +1,23 @@ +/* + * Copyright (c) 2006-Present, Redis Ltd. + * All rights reserved. + * + * Licensed under your choice of the Redis Source Available License 2.0 + * (RSALv2); or (b) the Server Side Public License v1 (SSPLv1); or (c) the + * GNU Affero General Public License v3 (AGPLv3). + */ +#pragma once + +#include "VecSim/spaces/spaces.h" + +// SQ8↔FP16 kernels for the AVX2 (no FMA) tier. Live in a sibling TU compiled only when the +// toolchain supports F16C (via `-mf16c`), so this header has no preprocessor guard. Callers +// still gate the calls themselves with `#ifdef OPT_F16C`. + +namespace spaces { + +dist_func_t Choose_SQ8_FP16_IP_implementation_AVX2(size_t dim); +dist_func_t Choose_SQ8_FP16_Cosine_implementation_AVX2(size_t dim); +dist_func_t Choose_SQ8_FP16_L2_implementation_AVX2(size_t dim); + +} // namespace spaces diff --git a/src/VecSim/spaces/functions/AVX2_FMA_F16C.cpp b/src/VecSim/spaces/functions/AVX2_FMA_F16C.cpp new file mode 100644 index 000000000..4e9dd8131 --- /dev/null +++ b/src/VecSim/spaces/functions/AVX2_FMA_F16C.cpp @@ -0,0 +1,35 @@ +/* + * Copyright (c) 2006-Present, Redis Ltd. + * All rights reserved. + * + * Licensed under your choice of the Redis Source Available License 2.0 + * (RSALv2); or (b) the Server Side Public License v1 (SSPLv1); or (c) the + * GNU Affero General Public License v3 (AGPLv3). + */ +#include "AVX2_FMA_F16C.h" +#include "VecSim/spaces/IP/IP_AVX2_FMA_SQ8_FP16.h" +#include "VecSim/spaces/L2/L2_AVX2_FMA_SQ8_FP16.h" + +namespace spaces { + +#include "implementation_chooser.h" + +dist_func_t Choose_SQ8_FP16_IP_implementation_AVX2_FMA(size_t dim) { + dist_func_t ret_dist_func; + CHOOSE_IMPLEMENTATION(ret_dist_func, dim, 16, SQ8_FP16_InnerProductSIMD16_AVX2_FMA); + return ret_dist_func; +} +dist_func_t Choose_SQ8_FP16_Cosine_implementation_AVX2_FMA(size_t dim) { + dist_func_t ret_dist_func; + CHOOSE_IMPLEMENTATION(ret_dist_func, dim, 16, SQ8_FP16_CosineSIMD16_AVX2_FMA); + return ret_dist_func; +} +dist_func_t Choose_SQ8_FP16_L2_implementation_AVX2_FMA(size_t dim) { + dist_func_t ret_dist_func; + CHOOSE_IMPLEMENTATION(ret_dist_func, dim, 16, SQ8_FP16_L2SqrSIMD16_AVX2_FMA); + return ret_dist_func; +} + +#include "implementation_chooser_cleanup.h" + +} // namespace spaces diff --git a/src/VecSim/spaces/functions/AVX2_FMA_F16C.h b/src/VecSim/spaces/functions/AVX2_FMA_F16C.h new file mode 100644 index 000000000..7943ff4eb --- /dev/null +++ b/src/VecSim/spaces/functions/AVX2_FMA_F16C.h @@ -0,0 +1,23 @@ +/* + * Copyright (c) 2006-Present, Redis Ltd. + * All rights reserved. + * + * Licensed under your choice of the Redis Source Available License 2.0 + * (RSALv2); or (b) the Server Side Public License v1 (SSPLv1); or (c) the + * GNU Affero General Public License v3 (AGPLv3). + */ +#pragma once + +#include "VecSim/spaces/spaces.h" + +// SQ8↔FP16 kernels for the AVX2+FMA tier. Live in a sibling TU compiled only when the +// toolchain supports F16C (via `-mf16c`), so this header has no preprocessor guard. Callers +// still gate the calls themselves with `#ifdef OPT_F16C`. + +namespace spaces { + +dist_func_t Choose_SQ8_FP16_IP_implementation_AVX2_FMA(size_t dim); +dist_func_t Choose_SQ8_FP16_Cosine_implementation_AVX2_FMA(size_t dim); +dist_func_t Choose_SQ8_FP16_L2_implementation_AVX2_FMA(size_t dim); + +} // namespace spaces diff --git a/src/VecSim/spaces/functions/AVX512F.cpp b/src/VecSim/spaces/functions/AVX512F.cpp index e765f4c8b..feb261fb4 100644 --- a/src/VecSim/spaces/functions/AVX512F.cpp +++ b/src/VecSim/spaces/functions/AVX512F.cpp @@ -11,10 +11,12 @@ #include "VecSim/spaces/L2/L2_AVX512F_FP16.h" #include "VecSim/spaces/L2/L2_AVX512F_FP32.h" #include "VecSim/spaces/L2/L2_AVX512F_FP64.h" +#include "VecSim/spaces/L2/L2_AVX512F_SQ8_FP16.h" #include "VecSim/spaces/IP/IP_AVX512F_FP16.h" #include "VecSim/spaces/IP/IP_AVX512F_FP32.h" #include "VecSim/spaces/IP/IP_AVX512F_FP64.h" +#include "VecSim/spaces/IP/IP_AVX512F_SQ8_FP16.h" namespace spaces { @@ -56,6 +58,25 @@ dist_func_t Choose_FP16_L2_implementation_AVX512F(size_t dim) { return ret_dist_func; } +// SQ8↔FP16 kernels only use AVX-512F (cvtph_ps + FMA), so they register here rather than under +// the VNNI tier — CPUs with AVX-512F but no VNNI (Skylake-X, some Cascade Lake variants) can use +// these kernels. +dist_func_t Choose_SQ8_FP16_IP_implementation_AVX512F(size_t dim) { + dist_func_t ret_dist_func; + CHOOSE_IMPLEMENTATION(ret_dist_func, dim, 16, SQ8_FP16_InnerProductSIMD16_AVX512F); + return ret_dist_func; +} +dist_func_t Choose_SQ8_FP16_Cosine_implementation_AVX512F(size_t dim) { + dist_func_t ret_dist_func; + CHOOSE_IMPLEMENTATION(ret_dist_func, dim, 16, SQ8_FP16_CosineSIMD16_AVX512F); + return ret_dist_func; +} +dist_func_t Choose_SQ8_FP16_L2_implementation_AVX512F(size_t dim) { + dist_func_t ret_dist_func; + CHOOSE_IMPLEMENTATION(ret_dist_func, dim, 16, SQ8_FP16_L2SqrSIMD16_AVX512F); + return ret_dist_func; +} + #include "implementation_chooser_cleanup.h" } // namespace spaces diff --git a/src/VecSim/spaces/functions/AVX512F.h b/src/VecSim/spaces/functions/AVX512F.h index fd36f312f..8d600f961 100644 --- a/src/VecSim/spaces/functions/AVX512F.h +++ b/src/VecSim/spaces/functions/AVX512F.h @@ -20,4 +20,9 @@ dist_func_t Choose_FP16_L2_implementation_AVX512F(size_t dim); dist_func_t Choose_FP32_L2_implementation_AVX512F(size_t dim); dist_func_t Choose_FP64_L2_implementation_AVX512F(size_t dim); +// SQ8↔FP16 kernels — only need AVX-512F, not VNNI/BW/VL. +dist_func_t Choose_SQ8_FP16_IP_implementation_AVX512F(size_t dim); +dist_func_t Choose_SQ8_FP16_Cosine_implementation_AVX512F(size_t dim); +dist_func_t Choose_SQ8_FP16_L2_implementation_AVX512F(size_t dim); + } // namespace spaces diff --git a/src/VecSim/spaces/functions/SSE4_F16C.cpp b/src/VecSim/spaces/functions/SSE4_F16C.cpp new file mode 100644 index 000000000..91a11885f --- /dev/null +++ b/src/VecSim/spaces/functions/SSE4_F16C.cpp @@ -0,0 +1,35 @@ +/* + * Copyright (c) 2006-Present, Redis Ltd. + * All rights reserved. + * + * Licensed under your choice of the Redis Source Available License 2.0 + * (RSALv2); or (b) the Server Side Public License v1 (SSPLv1); or (c) the + * GNU Affero General Public License v3 (AGPLv3). + */ +#include "SSE4_F16C.h" +#include "VecSim/spaces/IP/IP_SSE4_SQ8_FP16.h" +#include "VecSim/spaces/L2/L2_SSE4_SQ8_FP16.h" + +namespace spaces { + +#include "implementation_chooser.h" + +dist_func_t Choose_SQ8_FP16_IP_implementation_SSE4(size_t dim) { + dist_func_t ret_dist_func; + CHOOSE_IMPLEMENTATION(ret_dist_func, dim, 16, SQ8_FP16_InnerProductSIMD16_SSE4); + return ret_dist_func; +} +dist_func_t Choose_SQ8_FP16_Cosine_implementation_SSE4(size_t dim) { + dist_func_t ret_dist_func; + CHOOSE_IMPLEMENTATION(ret_dist_func, dim, 16, SQ8_FP16_CosineSIMD16_SSE4); + return ret_dist_func; +} +dist_func_t Choose_SQ8_FP16_L2_implementation_SSE4(size_t dim) { + dist_func_t ret_dist_func; + CHOOSE_IMPLEMENTATION(ret_dist_func, dim, 16, SQ8_FP16_L2SqrSIMD16_SSE4); + return ret_dist_func; +} + +#include "implementation_chooser_cleanup.h" + +} // namespace spaces diff --git a/src/VecSim/spaces/functions/SSE4_F16C.h b/src/VecSim/spaces/functions/SSE4_F16C.h new file mode 100644 index 000000000..2459c216c --- /dev/null +++ b/src/VecSim/spaces/functions/SSE4_F16C.h @@ -0,0 +1,23 @@ +/* + * Copyright (c) 2006-Present, Redis Ltd. + * All rights reserved. + * + * Licensed under your choice of the Redis Source Available License 2.0 + * (RSALv2); or (b) the Server Side Public License v1 (SSPLv1); or (c) the + * GNU Affero General Public License v3 (AGPLv3). + */ +#pragma once + +#include "VecSim/spaces/spaces.h" + +// SQ8↔FP16 kernels for the SSE4 tier. Live in a sibling TU compiled only when the toolchain +// supports F16C (via `-mf16c -mavx`), so this header has no preprocessor guard. Callers +// still gate the calls themselves with `#ifdef OPT_F16C`. + +namespace spaces { + +dist_func_t Choose_SQ8_FP16_IP_implementation_SSE4(size_t dim); +dist_func_t Choose_SQ8_FP16_Cosine_implementation_SSE4(size_t dim); +dist_func_t Choose_SQ8_FP16_L2_implementation_SSE4(size_t dim); + +} // namespace spaces diff --git a/tests/benchmark/spaces_benchmarks/bm_spaces.h b/tests/benchmark/spaces_benchmarks/bm_spaces.h index d99bcc4ca..2303eac0a 100644 --- a/tests/benchmark/spaces_benchmarks/bm_spaces.h +++ b/tests/benchmark/spaces_benchmarks/bm_spaces.h @@ -24,9 +24,12 @@ #include "VecSim/spaces/functions/AVX512BF16_VL.h" #include "VecSim/spaces/functions/AVX512F_BW_VL_VNNI.h" #include "VecSim/spaces/functions/AVX2.h" +#include "VecSim/spaces/functions/AVX2_F16C.h" #include "VecSim/spaces/functions/AVX2_FMA.h" +#include "VecSim/spaces/functions/AVX2_FMA_F16C.h" #include "VecSim/spaces/functions/F16C.h" #include "VecSim/spaces/functions/SSE4.h" +#include "VecSim/spaces/functions/SSE4_F16C.h" #include "VecSim/spaces/functions/SSE3.h" #include "VecSim/spaces/functions/SSE.h" #include "VecSim/spaces/functions/NEON.h" diff --git a/tests/benchmark/spaces_benchmarks/bm_spaces_sq8_fp16.cpp b/tests/benchmark/spaces_benchmarks/bm_spaces_sq8_fp16.cpp index 2133a047e..ba3030064 100644 --- a/tests/benchmark/spaces_benchmarks/bm_spaces_sq8_fp16.cpp +++ b/tests/benchmark/spaces_benchmarks/bm_spaces_sq8_fp16.cpp @@ -15,8 +15,9 @@ using float16 = vecsim_types::float16; /** * SQ8-to-FP16 benchmarks: SQ8 quantized storage with FP16 query. - * Only naive (scalar) benchmarks are registered for now; SIMD chooser symbols are added - * by P1b (MOD-15152, x86) and P1c (MOD-15153, ARM). + * Registers the naive (scalar) baseline plus per-ISA SIMD variants (x86: AVX-512 / AVX2+FMA / + * AVX2 / SSE4 — gated on the matching OPT_* defines and runtime CPU features). ARM kernels + * land via MOD-14972. */ class BM_VecSimSpaces_SQ8_FP16 : public benchmark::Fixture { protected: @@ -50,8 +51,41 @@ class BM_VecSimSpaces_SQ8_FP16 : public benchmark::Fixture { } }; -// Naive (scalar) algorithms. SIMD chooser slots will be added by P1b (MOD-15152) and -// P1c (MOD-15153), following the SQ8_FP32 layout in bm_spaces_sq8_fp32.cpp. +#ifdef CPU_FEATURES_ARCH_X86_64 +cpu_features::X86Features opt = cpu_features::GetX86Info().features; + +// AVX-512F is sufficient — _mm512_cvtph_ps is part of AVX-512F, no F16C/VNNI/BW/VL needed. +#ifdef OPT_AVX512F +bool avx512f_supported = opt.avx512f; +INITIALIZE_BENCHMARKS_SET_L2_IP(BM_VecSimSpaces_SQ8_FP16, SQ8_FP16, AVX512F, 16, avx512f_supported); +INITIALIZE_BENCHMARKS_SET_Cosine(BM_VecSimSpaces_SQ8_FP16, SQ8_FP16, AVX512F, 16, + avx512f_supported); +#endif + +#ifdef OPT_F16C +#ifdef OPT_AVX2_FMA +bool avx2_fma3_f16c_supported = opt.avx2 && opt.fma3 && opt.f16c; +INITIALIZE_BENCHMARKS_SET_L2_IP(BM_VecSimSpaces_SQ8_FP16, SQ8_FP16, AVX2_FMA, 16, + avx2_fma3_f16c_supported); +INITIALIZE_BENCHMARKS_SET_Cosine(BM_VecSimSpaces_SQ8_FP16, SQ8_FP16, AVX2_FMA, 16, + avx2_fma3_f16c_supported); +#endif + +#ifdef OPT_AVX2 +bool avx2_f16c_supported = opt.avx2 && opt.f16c; +INITIALIZE_BENCHMARKS_SET_L2_IP(BM_VecSimSpaces_SQ8_FP16, SQ8_FP16, AVX2, 16, avx2_f16c_supported); +INITIALIZE_BENCHMARKS_SET_Cosine(BM_VecSimSpaces_SQ8_FP16, SQ8_FP16, AVX2, 16, avx2_f16c_supported); +#endif + +#ifdef OPT_SSE4 +bool sse4_f16c_supported = opt.sse4_1 && opt.f16c && opt.avx; +INITIALIZE_BENCHMARKS_SET_L2_IP(BM_VecSimSpaces_SQ8_FP16, SQ8_FP16, SSE4, 16, sse4_f16c_supported); +INITIALIZE_BENCHMARKS_SET_Cosine(BM_VecSimSpaces_SQ8_FP16, SQ8_FP16, SSE4, 16, sse4_f16c_supported); +#endif +#endif // OPT_F16C +#endif // x86_64 + +// Naive (scalar) baseline — always registered as the comparison anchor. INITIALIZE_NAIVE_BM(BM_VecSimSpaces_SQ8_FP16, SQ8_FP16, InnerProduct, 16); INITIALIZE_NAIVE_BM(BM_VecSimSpaces_SQ8_FP16, SQ8_FP16, Cosine, 16); diff --git a/tests/unit/test_spaces.cpp b/tests/unit/test_spaces.cpp index a6bb88cef..474ac5c75 100644 --- a/tests/unit/test_spaces.cpp +++ b/tests/unit/test_spaces.cpp @@ -32,9 +32,12 @@ #include "VecSim/spaces/functions/AVX512FP16_VL.h" #include "VecSim/spaces/functions/AVX512F_BW_VL_VNNI.h" #include "VecSim/spaces/functions/AVX2.h" +#include "VecSim/spaces/functions/AVX2_F16C.h" #include "VecSim/spaces/functions/AVX2_FMA.h" +#include "VecSim/spaces/functions/AVX2_FMA_F16C.h" #include "VecSim/spaces/functions/SSE3.h" #include "VecSim/spaces/functions/SSE4.h" +#include "VecSim/spaces/functions/SSE4_F16C.h" #include "VecSim/spaces/functions/F16C.h" #include "VecSim/spaces/functions/NEON.h" #include "VecSim/spaces/functions/NEON_DOTPROD.h" @@ -560,9 +563,8 @@ TEST_F(SpacesTest, GetDistFuncSQ8Asymmetric) { } TEST_F(SpacesTest, GetDistFuncSQ8FP16Asymmetric) { - // SQ8 storage with FP16 query (asymmetric) - should return scalar SQ8_FP16 functions. - // SIMD chooser slots are added by P1b (MOD-15152) / P1c (MOD-15153); for now the - // dispatcher returns the scalar implementations regardless of dim or arch. + // SQ8 storage with FP16 query (asymmetric) - should return SQ8_FP16 functions. + // Per-ISA dispatcher walk coverage lives in the SQ8_FP16 SpacesOptimizationTest below. size_t dim = 128; auto l2_func = spaces::GetDistFunc(VecSimMetric_L2, dim, nullptr); auto ip_func = spaces::GetDistFunc(VecSimMetric_IP, dim, nullptr); @@ -570,9 +572,12 @@ TEST_F(SpacesTest, GetDistFuncSQ8FP16Asymmetric) { ASSERT_EQ(l2_func, L2_SQ8_FP16_GetDistFunc(dim, nullptr)); ASSERT_EQ(ip_func, IP_SQ8_FP16_GetDistFunc(dim, nullptr)); ASSERT_EQ(cosine_func, Cosine_SQ8_FP16_GetDistFunc(dim, nullptr)); - ASSERT_EQ(l2_func, SQ8_FP16_L2Sqr); - ASSERT_EQ(ip_func, SQ8_FP16_InnerProduct); - ASSERT_EQ(cosine_func, SQ8_FP16_Cosine); + + // dim < 16 takes the scalar early-return in every SQ8_FP16 dispatcher (no SIMD tier). + size_t small_dim = 8; + ASSERT_EQ(L2_SQ8_FP16_GetDistFunc(small_dim, nullptr), SQ8_FP16_L2Sqr); + ASSERT_EQ(IP_SQ8_FP16_GetDistFunc(small_dim, nullptr), SQ8_FP16_InnerProduct); + ASSERT_EQ(Cosine_SQ8_FP16_GetDistFunc(small_dim, nullptr), SQ8_FP16_Cosine); } #ifdef CPU_FEATURES_ARCH_X86_64 @@ -3000,8 +3005,9 @@ TEST(SQ8_FP32_EdgeCases, CosineExtremeValuesTest) { // Parameterized tests that verify the scalar SQ8_FP16 kernels against the not-optimized // baseline across multiple dimensions, including odd dimensions and SIMD-boundary residues. -// SIMD chooser slots are added by P1b (MOD-15152) / P1c (MOD-15153); the dispatcher always -// returns the scalar implementation for now. +// The SIMD-tier dispatcher coverage lives in SQ8_FP16_SpacesOptimizationTest below; this +// suite intentionally exercises the scalar reference directly to keep it as a fixed baseline +// the SIMD tiers are compared against. class SQ8_FP16_NoOptimizationSpacesTest : public testing::TestWithParam {}; TEST_P(SQ8_FP16_NoOptimizationSpacesTest, SQ8_FP16_L2SqrTest) { @@ -3070,10 +3076,255 @@ INSTANTIATE_TEST_SUITE_P(SQ8_FP16_NoOpt, SQ8_FP16_NoOptimizationSpacesTest, testing::Values(1, 5, 7, 8, 9, 15, 16, 17, 31, 32, 33, 47, 48, 49, 63, 64, 65, 127, 128)); +/* ======================== SQ8_FP16 SIMD optimisation tests ========================= */ + +// Walks down the x86 ISA tiers (AVX-512 → AVX2+FMA → AVX2 → SSE4 → scalar) and asserts +// that {IP,Cosine,L2}_SQ8_FP16_GetDistFunc returns the expected Choose_* symbol and that +// its output matches the scalar baseline within 0.01. +class SQ8_FP16_SpacesOptimizationTest : public testing::TestWithParam {}; + +TEST_P(SQ8_FP16_SpacesOptimizationTest, SQ8_FP16_L2SqrTest) { + auto optimization = getCpuOptimizationFeatures(); + size_t dim = GetParam(); + + size_t query_count = + dim + sq8::query_metadata_count() * (sizeof(float) / sizeof(float16)); + std::vector v1_query(query_count); + test_utils::populate_sq8_fp16_query(v1_query.data(), dim, false, 1234); + + size_t quantized_size = + dim * sizeof(uint8_t) + sq8::storage_metadata_count() * sizeof(float); + std::vector v2_compressed(quantized_size); + test_utils::populate_float_vec_to_sq8_with_metadata(v2_compressed.data(), dim, false, 5678); + + dist_func_t arch_opt_func; + float baseline = SQ8_FP16_L2Sqr(v2_compressed.data(), v1_query.data(), dim); + +#ifdef OPT_AVX512F + if (optimization.avx512f) { + unsigned char alignment = 0; + arch_opt_func = L2_SQ8_FP16_GetDistFunc(dim, &alignment, &optimization); + ASSERT_EQ(arch_opt_func, Choose_SQ8_FP16_L2_implementation_AVX512F(dim)) + << "Unexpected distance function chosen for dim " << dim; + ASSERT_NEAR(baseline, arch_opt_func(v2_compressed.data(), v1_query.data(), dim), 0.01) + << "AVX512 with dim " << dim; + optimization.avx512f = 0; + } +#endif + // F16C is required by every non-AVX-512 SQ8↔FP16 tier (vcvtph2ps), so the guard is hoisted + // around all three — matches the dispatcher layout in L2_space.cpp. +#ifdef OPT_F16C +#ifdef OPT_AVX2_FMA + if (optimization.avx2 && optimization.fma3 && optimization.f16c) { + unsigned char alignment = 0; + arch_opt_func = L2_SQ8_FP16_GetDistFunc(dim, &alignment, &optimization); + ASSERT_EQ(arch_opt_func, Choose_SQ8_FP16_L2_implementation_AVX2_FMA(dim)) + << "Unexpected distance function chosen for dim " << dim; + ASSERT_NEAR(baseline, arch_opt_func(v2_compressed.data(), v1_query.data(), dim), 0.01) + << "AVX2+FMA with dim " << dim; + optimization.fma3 = 0; + } +#endif +#ifdef OPT_AVX2 + if (optimization.avx2 && optimization.f16c) { + unsigned char alignment = 0; + arch_opt_func = L2_SQ8_FP16_GetDistFunc(dim, &alignment, &optimization); + ASSERT_EQ(arch_opt_func, Choose_SQ8_FP16_L2_implementation_AVX2(dim)) + << "Unexpected distance function chosen for dim " << dim; + ASSERT_NEAR(baseline, arch_opt_func(v2_compressed.data(), v1_query.data(), dim), 0.01) + << "AVX2 with dim " << dim; + optimization.avx2 = 0; + } +#endif +#ifdef OPT_SSE4 + if (optimization.sse4_1 && optimization.f16c && optimization.avx) { + unsigned char alignment = 0; + arch_opt_func = L2_SQ8_FP16_GetDistFunc(dim, &alignment, &optimization); + ASSERT_EQ(arch_opt_func, Choose_SQ8_FP16_L2_implementation_SSE4(dim)) + << "Unexpected distance function chosen for dim " << dim; + ASSERT_NEAR(baseline, arch_opt_func(v2_compressed.data(), v1_query.data(), dim), 0.01) + << "SSE4 with dim " << dim; + optimization.sse4_1 = 0; + } +#endif +#endif // OPT_F16C + + unsigned char alignment = 0; + arch_opt_func = L2_SQ8_FP16_GetDistFunc(dim, &alignment, &optimization); + ASSERT_EQ(arch_opt_func, SQ8_FP16_L2Sqr) + << "Unexpected scalar fallback function for dim " << dim; + ASSERT_NEAR(baseline, arch_opt_func(v2_compressed.data(), v1_query.data(), dim), 0.01) + << "Scalar fallback with dim " << dim; + ASSERT_EQ(alignment, 0) << "No optimization with dim " << dim; +} + +TEST_P(SQ8_FP16_SpacesOptimizationTest, SQ8_FP16_InnerProductTest) { + auto optimization = getCpuOptimizationFeatures(); + size_t dim = GetParam(); + + size_t query_count = + dim + sq8::query_metadata_count() * (sizeof(float) / sizeof(float16)); + std::vector v1_query(query_count); + test_utils::populate_sq8_fp16_query(v1_query.data(), dim, true, 1234); + + size_t quantized_size = + dim * sizeof(uint8_t) + sq8::storage_metadata_count() * sizeof(float); + std::vector v2_compressed(quantized_size); + test_utils::populate_float_vec_to_sq8_with_metadata(v2_compressed.data(), dim, true, 5678); + + dist_func_t arch_opt_func; + float baseline = SQ8_FP16_InnerProduct(v2_compressed.data(), v1_query.data(), dim); + +#ifdef OPT_AVX512F + if (optimization.avx512f) { + unsigned char alignment = 0; + arch_opt_func = IP_SQ8_FP16_GetDistFunc(dim, &alignment, &optimization); + ASSERT_EQ(arch_opt_func, Choose_SQ8_FP16_IP_implementation_AVX512F(dim)) + << "Unexpected distance function chosen for dim " << dim; + ASSERT_NEAR(baseline, arch_opt_func(v2_compressed.data(), v1_query.data(), dim), 0.01) + << "AVX512 with dim " << dim; + optimization.avx512f = 0; + } +#endif + // F16C is required by every non-AVX-512 SQ8↔FP16 tier (vcvtph2ps), so the guard is hoisted + // around all three — matches the dispatcher layout in IP_space.cpp. +#ifdef OPT_F16C +#ifdef OPT_AVX2_FMA + if (optimization.avx2 && optimization.fma3 && optimization.f16c) { + unsigned char alignment = 0; + arch_opt_func = IP_SQ8_FP16_GetDistFunc(dim, &alignment, &optimization); + ASSERT_EQ(arch_opt_func, Choose_SQ8_FP16_IP_implementation_AVX2_FMA(dim)) + << "Unexpected distance function chosen for dim " << dim; + ASSERT_NEAR(baseline, arch_opt_func(v2_compressed.data(), v1_query.data(), dim), 0.01) + << "AVX2+FMA with dim " << dim; + optimization.fma3 = 0; + } +#endif +#ifdef OPT_AVX2 + if (optimization.avx2 && optimization.f16c) { + unsigned char alignment = 0; + arch_opt_func = IP_SQ8_FP16_GetDistFunc(dim, &alignment, &optimization); + ASSERT_EQ(arch_opt_func, Choose_SQ8_FP16_IP_implementation_AVX2(dim)) + << "Unexpected distance function chosen for dim " << dim; + ASSERT_NEAR(baseline, arch_opt_func(v2_compressed.data(), v1_query.data(), dim), 0.01) + << "AVX2 with dim " << dim; + optimization.avx2 = 0; + } +#endif +#ifdef OPT_SSE4 + if (optimization.sse4_1 && optimization.f16c && optimization.avx) { + unsigned char alignment = 0; + arch_opt_func = IP_SQ8_FP16_GetDistFunc(dim, &alignment, &optimization); + ASSERT_EQ(arch_opt_func, Choose_SQ8_FP16_IP_implementation_SSE4(dim)) + << "Unexpected distance function chosen for dim " << dim; + ASSERT_NEAR(baseline, arch_opt_func(v2_compressed.data(), v1_query.data(), dim), 0.01) + << "SSE4 with dim " << dim; + optimization.sse4_1 = 0; + } +#endif +#endif // OPT_F16C + + unsigned char alignment = 0; + arch_opt_func = IP_SQ8_FP16_GetDistFunc(dim, &alignment, &optimization); + ASSERT_EQ(arch_opt_func, SQ8_FP16_InnerProduct) + << "Unexpected scalar fallback function for dim " << dim; + ASSERT_NEAR(baseline, arch_opt_func(v2_compressed.data(), v1_query.data(), dim), 0.01) + << "Scalar fallback with dim " << dim; + ASSERT_EQ(alignment, 0) << "No optimization with dim " << dim; +} + +TEST_P(SQ8_FP16_SpacesOptimizationTest, SQ8_FP16_CosineTest) { + auto optimization = getCpuOptimizationFeatures(); + size_t dim = GetParam(); + + size_t query_count = + dim + sq8::query_metadata_count() * (sizeof(float) / sizeof(float16)); + std::vector v1_query(query_count); + test_utils::populate_sq8_fp16_query(v1_query.data(), dim, true, 1234); + + size_t quantized_size = + dim * sizeof(uint8_t) + sq8::storage_metadata_count() * sizeof(float); + std::vector v2_compressed(quantized_size); + test_utils::populate_float_vec_to_sq8_with_metadata(v2_compressed.data(), dim, true, 5678); + + dist_func_t arch_opt_func; + float baseline = SQ8_FP16_Cosine(v2_compressed.data(), v1_query.data(), dim); + +#ifdef OPT_AVX512F + if (optimization.avx512f) { + unsigned char alignment = 0; + arch_opt_func = Cosine_SQ8_FP16_GetDistFunc(dim, &alignment, &optimization); + ASSERT_EQ(arch_opt_func, Choose_SQ8_FP16_Cosine_implementation_AVX512F(dim)) + << "Unexpected distance function chosen for dim " << dim; + ASSERT_NEAR(baseline, arch_opt_func(v2_compressed.data(), v1_query.data(), dim), 0.01) + << "AVX512 with dim " << dim; + optimization.avx512f = 0; + } +#endif + // F16C is required by every non-AVX-512 SQ8↔FP16 tier (vcvtph2ps), so the guard is hoisted + // around all three — matches the dispatcher layout in IP_space.cpp. +#ifdef OPT_F16C +#ifdef OPT_AVX2_FMA + if (optimization.avx2 && optimization.fma3 && optimization.f16c) { + unsigned char alignment = 0; + arch_opt_func = Cosine_SQ8_FP16_GetDistFunc(dim, &alignment, &optimization); + ASSERT_EQ(arch_opt_func, Choose_SQ8_FP16_Cosine_implementation_AVX2_FMA(dim)) + << "Unexpected distance function chosen for dim " << dim; + ASSERT_NEAR(baseline, arch_opt_func(v2_compressed.data(), v1_query.data(), dim), 0.01) + << "AVX2+FMA with dim " << dim; + optimization.fma3 = 0; + } +#endif +#ifdef OPT_AVX2 + if (optimization.avx2 && optimization.f16c) { + unsigned char alignment = 0; + arch_opt_func = Cosine_SQ8_FP16_GetDistFunc(dim, &alignment, &optimization); + ASSERT_EQ(arch_opt_func, Choose_SQ8_FP16_Cosine_implementation_AVX2(dim)) + << "Unexpected distance function chosen for dim " << dim; + ASSERT_NEAR(baseline, arch_opt_func(v2_compressed.data(), v1_query.data(), dim), 0.01) + << "AVX2 with dim " << dim; + optimization.avx2 = 0; + } +#endif +#ifdef OPT_SSE4 + if (optimization.sse4_1 && optimization.f16c && optimization.avx) { + unsigned char alignment = 0; + arch_opt_func = Cosine_SQ8_FP16_GetDistFunc(dim, &alignment, &optimization); + ASSERT_EQ(arch_opt_func, Choose_SQ8_FP16_Cosine_implementation_SSE4(dim)) + << "Unexpected distance function chosen for dim " << dim; + ASSERT_NEAR(baseline, arch_opt_func(v2_compressed.data(), v1_query.data(), dim), 0.01) + << "SSE4 with dim " << dim; + optimization.sse4_1 = 0; + } +#endif +#endif // OPT_F16C + + unsigned char alignment = 0; + arch_opt_func = Cosine_SQ8_FP16_GetDistFunc(dim, &alignment, &optimization); + ASSERT_EQ(arch_opt_func, SQ8_FP16_Cosine) + << "Unexpected scalar fallback function for dim " << dim; + ASSERT_NEAR(baseline, arch_opt_func(v2_compressed.data(), v1_query.data(), dim), 0.01) + << "Scalar fallback with dim " << dim; + ASSERT_EQ(alignment, 0) << "No optimization with dim " << dim; +} + +// Dim range [16, 32] covers every residual class for the 16-element chunk used by every tier. +INSTANTIATE_TEST_SUITE_P(SQ8_FP16_SIMD, SQ8_FP16_SpacesOptimizationTest, + testing::Range(16UL, 16 * 2UL + 1)); + +// Higher dimensions surface multi-iteration loop bugs (pointer stride, do-while termination +// off-by-one) that the [16, 32] range does not exercise because the AVX-512 inner loop runs at +// most twice in that range. 48 and 112 specifically hit the AVX-512 three-chunk tail +// (remaining == 48, i.e. (dim / 16) % 4 == 3): 48 with zero main-loop iterations, 112 with one. +INSTANTIATE_TEST_SUITE_P(SQ8_FP16_SIMD_HighDim, SQ8_FP16_SpacesOptimizationTest, + testing::Values(48UL, 64UL, 112UL, 128UL, 256UL, 512UL, 1024UL)); + /* ======================== Tests SQ8_FP16 (edge cases) ========================= */ // Zero FP16 query against a non-zero SQ8 storage. IP must be exactly 1.0 (1 - 0), -// L2² must equal Σ dequantized². +// L2² must equal Σ dequantized². Math correctness on adversarial inputs is verified +// against the scalar reference; SIMD tier coverage with branchless kernels is provided +// separately by SQ8_FP16_SpacesOptimizationTest. TEST(SQ8_FP16_EdgeCases, ZeroQueryTest) { size_t dim = 64; From f0f2ec4e04a59933b1cc3c880a944554021b4df4 Mon Sep 17 00:00:00 2001 From: Dor Forer Date: Thu, 28 May 2026 17:24:23 +0300 Subject: [PATCH 02/24] =?UTF-8?q?Add=20design=20spec=20for=20SQ8=E2=86=94F?= =?UTF-8?q?P16=20ARM=20SIMD=20kernels=20[MOD-14972]?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Stacked on PR #970 (MOD-14954 x86 kernels). Mirrors x86 structure onto NEON_HP / SVE / SVE2 tiers. Zero CMake changes; reuses existing ARM TU compile flags. Scalar fallback already on main serves as reference. Bakes in PR #970 review lessons (assert(dim>=16), 4-accumulator ILP, formula anchor, load_unaligned metadata, dispatcher-routed tier-walk tests). Co-Authored-By: Claude Opus 4.7 (1M context) --- .../specs/2026-05-28-arm-sq8-fp16-design.md | 354 ++++++++++++++++++ 1 file changed, 354 insertions(+) create mode 100644 docs/superpowers/specs/2026-05-28-arm-sq8-fp16-design.md diff --git a/docs/superpowers/specs/2026-05-28-arm-sq8-fp16-design.md b/docs/superpowers/specs/2026-05-28-arm-sq8-fp16-design.md new file mode 100644 index 000000000..f4188d38b --- /dev/null +++ b/docs/superpowers/specs/2026-05-28-arm-sq8-fp16-design.md @@ -0,0 +1,354 @@ +# SQ8↔FP16 ARM SIMD Distance Kernels — Design Spec + +- **Ticket**: [MOD-14972](https://redislabs.atlassian.net/browse/MOD-14972) +- **Branch**: `dor-forer-sq8-fp16-arm-kernels-mod-14972` +- **Base**: `dor-forer-sq8-fp16-x86-kernels-mod-14954` (PR #970) — stacked +- **Sibling**: MOD-14954 / PR #970 delivers x86 SIMD kernels (AVX-512, AVX2, SSE4) for the same operation + +## Goal + +Add SQ8↔FP16 SIMD distance kernels for IP and L2 on the ARM ISA tiers (NEON_HP, SVE, SVE2). FP16 is the query data type; SQ8 is the stored vector representation. Match the contract and structure of the x86 kernels delivered in PR #970 so dispatch tables, metadata layout, and acceptance criteria stay symmetric across architectures. + +The scalar fallback (`SQ8_FP16_InnerProduct`, `SQ8_FP16_L2Sqr`, `SQ8_FP16_Cosine` in `src/VecSim/spaces/IP/IP.cpp` and `src/VecSim/spaces/L2/L2.cpp`) already exists on `main`. This spec does not modify it; it serves as the reference implementation for all platforms. + +## Algebraic identity (shared with x86 PR + SQ8_FP32 sister) + +``` +IP(x, y) ≈ min · y_sum + delta · Σ(q_i · y_i) +L2(x, y) = x_sum_sq + y_sum_sq - 2 · IP(x, y) +``` + +Hot loop accumulates `Σ(q_i · y_i)` only. No per-element dequantization. FP16 query lanes are widened to FP32 per SIMD chunk; everything in the hot loop is FP32. + +## Metadata layout + +``` +SQ8 storage (pVect1): [uint8 × dim] [min_val] [delta] [x_sum] [x_sum_squares] +FP16 query (pVect2): [float16 × dim] [y_sum] [y_sum_squares] +``` + +Both metadata trailers are FP32 scalars. Storage metadata is not 4-byte aligned whenever `dim % 4 != 0`; query metadata is not 4-byte aligned whenever `dim` is odd. The blanket rule: every FP32 metadata read uses the global `load_unaligned` helper, matching scalar `_Impl` in `IP.cpp` / `L2.cpp`. `sq8` namespace constants: `MIN_VAL`, `DELTA`, `SUM_QUERY`, `SUM_SQUARES`, `SUM_SQUARES_QUERY`. + +## File layout + +``` +src/VecSim/spaces/IP/ + IP_NEON_SQ8_FP16.h (new) + IP_SVE_SQ8_FP16.h (new) — also #included from SVE2.cpp +src/VecSim/spaces/L2/ + L2_NEON_SQ8_FP16.h (new) + L2_SVE_SQ8_FP16.h (new) — also #included from SVE2.cpp +src/VecSim/spaces/functions/ + NEON_HP.cpp (+ Choose_SQ8_FP16_{IP,L2,Cosine}_implementation_NEON_HP) + NEON_HP.h (+ 3 declarations) + SVE.cpp (+ Choose_SQ8_FP16_*_implementation_SVE) + SVE.h (+ 3 declarations) + SVE2.cpp (+ Choose_SQ8_FP16_*_implementation_SVE2; owns its own chooser symbols; instantiates SVE kernel templates under SVE2 compile flags) + SVE2.h (+ 3 declarations) +src/VecSim/spaces/ + IP_space.cpp (2 dispatcher block edits: IP, Cosine) + L2_space.cpp (1 dispatcher block edit) +``` + +**Zero CMake changes.** Existing TU flags carry exactly what we need: + +| TU | Flags | +|----|-------| +| `NEON_HP.cpp` | `-march=armv8.2-a+fp16fml` (covers fp16 cvt + fma) | +| `SVE.cpp` | `-march=armv8-a+sve` (SVE includes f16↔f32 cvt) | +| `SVE2.cpp` | `-march=armv9-a+sve2` | + +## Dispatcher tier order + +Same precedence as existing SQ8_FP32 ARM dispatch: + +```cpp +#ifdef OPT_SVE2 + if (features.sve2 && dim >= 16) { + return Choose_SQ8_FP16_IP_implementation_SVE2(dim); + } +#endif +#ifdef OPT_SVE + if (features.sve && dim >= 16) { + return Choose_SQ8_FP16_IP_implementation_SVE(dim); + } +#endif +#ifdef OPT_NEON_HP + if (features.asimdhp && dim >= 16) { + return Choose_SQ8_FP16_IP_implementation_NEON_HP(dim); + } +#endif +// dim < 16 or no ARM SIMD → scalar fallback (existing return at function tail) +``` + +The `dim >= 16` guard in the dispatcher is what lets each SIMD kernel hold an internal `assert(dim >= 16)` as a real precondition. Edge cases for `dim < 16` are routed to scalar. + +## NEON kernel design + +### Header: `IP_NEON_SQ8_FP16.h` + +Template signature mirrors SQ8_FP32 NEON sister: + +```cpp +template // 0..15 +float SQ8_FP16_InnerProductSIMD16_NEON_HP_IMP(const void *pVect1v, const void *pVect2v, size_t dimension); +``` + +Hot loop — 16 lanes per iteration, 4 FP32 accumulators: + +```cpp +// SQ8 load: 16 × uint8 → 4 × float32x4_t +uint8x16_t v1_u8 = vld1q_u8(pVect1); +uint16x8_t v1_lo = vmovl_u8(vget_low_u8(v1_u8)); +uint16x8_t v1_hi = vmovl_u8(vget_high_u8(v1_u8)); +float32x4_t v1_0 = vcvtq_f32_u32(vmovl_u16(vget_low_u16(v1_lo))); +float32x4_t v1_1 = vcvtq_f32_u32(vmovl_u16(vget_high_u16(v1_lo))); +float32x4_t v1_2 = vcvtq_f32_u32(vmovl_u16(vget_low_u16(v1_hi))); +float32x4_t v1_3 = vcvtq_f32_u32(vmovl_u16(vget_high_u16(v1_hi))); + +// FP16 query load: 16 × f16 → 4 × float32x4_t via vcvt_f32_f16 +float16x8_t q_lo = vld1q_f16(pVect2); +float16x8_t q_hi = vld1q_f16(pVect2 + 8); +float32x4_t v2_0 = vcvt_f32_f16(vget_low_f16(q_lo)); +float32x4_t v2_1 = vcvt_f32_f16(vget_high_f16(q_lo)); +float32x4_t v2_2 = vcvt_f32_f16(vget_low_f16(q_hi)); +float32x4_t v2_3 = vcvt_f32_f16(vget_high_f16(q_hi)); + +// 4-accumulator FMA +sum0 = vfmaq_f32(sum0, v1_0, v2_0); +sum1 = vfmaq_f32(sum1, v1_1, v2_1); +sum2 = vfmaq_f32(sum2, v1_2, v2_2); +sum3 = vfmaq_f32(sum3, v1_3, v2_3); +``` + +Residual ladder (`dim % 16`, residual 0..15): + +- **`residual >= 8`**: one 8-lane safe load each side — `vld1_u8` (8 bytes) for SQ8 and `vld1q_f16` (8 × FP16 = 16 bytes, fits before query metadata) for FP16. Convert + FMA. Remaining `residual - 8` lanes handled scalar. +- **`residual < 8`**: full scalar residual loop using `vecsim_types::FP16_to_FP32`. + +Rationale: a 16-byte SQ8 load (`vld1q_u8`) or a 16-byte FP16 load (`vld1q_f16` past the 8-lane boundary) on a residual < 8 would overread past valid query data into metadata — `y_sum` is only 4 bytes for IP and `y_sum_sq` adds 4 more for L2, not enough headroom for an 8-lane FP16 load. + +Final reduction: `vaddvq_f32(sum0 + sum1 + sum2 + sum3)`, then return `min_val * y_sum + delta * quantized_dot`. + +`assert(dim >= 16)` at the top. + +### Header: `L2_NEON_SQ8_FP16.h` + +Calls `SQ8_FP16_InnerProductSIMD16_NEON_HP_IMP(...)` to compute raw IP, then returns `x_sum_sq + y_sum_sq - 2.0f * ip`. Mirrors `L2_NEON_SQ8_FP32.h` exactly. + +### Wrapper symbols (NEON_HP.cpp) + +```cpp +dist_func_t Choose_SQ8_FP16_IP_implementation_NEON_HP(size_t dim) { + dist_func_t ret; + CHOOSE_IMPLEMENTATION(ret, dim, 16, SQ8_FP16_InnerProductSIMD16_NEON_HP); + return ret; +} +// L2 + Cosine identical shape (Cosine reuses IP wrapper per repo convention) +``` + +## SVE kernel design + +### Header: `IP_SVE_SQ8_FP16.h` + +Template signature mirrors SVE SQ8_FP32 sister: + +```cpp +template +float SQ8_FP16_InnerProductSIMD_SVE_IMP(const void *pVect1v, const void *pVect2v, size_t dimension); +``` + +Inner step (one SVE vector width `svcntw()` lanes of FP32): + +```cpp +svbool_t pg = svptrue_b32(); +// SQ8: zero-extend uint8 → uint32 (predicated b32 load) +svuint32_t v1_u32 = svld1ub_u32(pg, pVect1 + offset); +svfloat32_t v1_f = svcvt_f32_u32_x(pg, v1_u32); +// FP16: load chunk fp16 lanes, widen to fp32 +svbool_t pg16 = svwhilelt_b16(uint32_t(0), uint32_t(chunk)); +svfloat16_t q_h = svld1_f16(pg16, pVect2 + offset); +svfloat32_t v2_f = svcvt_f32_f16_x(pg, q_h); // verify exact ACLE/packing during impl +sum = svmla_f32_x(pg, sum, v1_f, v2_f); +offset += chunk; +``` + +**ACLE caveat**: exact f16→f32 widening intrinsic and lane packing — confirm `svcvt_f32_f16_x(pg, q_h)` compiles cleanly against the loaded `svfloat16_t`. If lane packing needs an unpack/interleave step, verify against `IP_SVE_FP16.h`. + +4 accumulators `sum0..sum3`; main loop processes 4 chunks via 4 `InnerProductStep` calls. `partial_chunk` template branch handles `dim % chunk` via `svwhilelt_b32`. + +Inactive-lane discipline on the partial path: the predicated `svld1_f16` / `svld1ub_u32` cover lane *liveness*, but the final reduction with `svaddv_f32(svptrue_b32(), ...)` walks *all* lanes. To keep inactive lanes from contributing garbage, the partial step uses the zeroing form `svmla_f32_z(pg_partial, sum0, v1_f, v2_f)` (matches `IP_SVE_SQ8_FP32.h` partial-chunk pattern). Alternative: reduce only active lanes via `svaddv_f32(pg_partial, sum0)` for the partial-step accumulator, then sum into the main reduction. The `_z` form is the simpler choice and is what the SQ8_FP32 SVE sister already does. + +Predicate widths on the partial path: FP32 math (load/widen/mla) uses a `b32` predicate sized to `remaining` 32-bit lanes (`svwhilelt_b32(0, remaining)`); the FP16 query load needs its own `b16` predicate sized to the same `remaining` half lanes (`svwhilelt_b16(0, remaining)`) since `svld1_f16` is governed by a 16-bit predicate. SQ8 load via `svld1ub_u32` is governed by the `b32` predicate (it widens uint8 → uint32 lanewise). + +Final reduction: `svaddv_f32(svptrue_b32(), sum0 + sum1 + sum2 + sum3)`. + +### Header: `L2_SVE_SQ8_FP16.h` + +Calls `SQ8_FP16_InnerProductSIMD_SVE_IMP(...)` then returns `x_sum_sq + y_sum_sq - 2.0f * ip`. Mirrors `L2_SVE_SQ8_FP32.h`. + +### Wrapper symbols + +`SVE.cpp`: + +```cpp +dist_func_t Choose_SQ8_FP16_IP_implementation_SVE(size_t dim) { + dist_func_t ret; + CHOOSE_SVE_IMPLEMENTATION(ret, SQ8_FP16_InnerProductSIMD_SVE, dim, svcntw); + return ret; +} +// L2 + Cosine identical shape +``` + +`SVE2.cpp`: + +```cpp +#include "VecSim/spaces/IP/IP_SVE_SQ8_FP16.h" // SVE2 implementation is identical to SVE +#include "VecSim/spaces/L2/L2_SVE_SQ8_FP16.h" + +dist_func_t Choose_SQ8_FP16_IP_implementation_SVE2(size_t dim) { + dist_func_t ret; + CHOOSE_SVE_IMPLEMENTATION(ret, SQ8_FP16_InnerProductSIMD_SVE, dim, svcntw); + return ret; +} +// L2 + Cosine identical shape +``` + +SVE2 owns its own chooser symbols (does **not** call the SVE chooser); template instantiated under SVE2 compile flags. + +## Tests + +### Class + +Branch base is PR #970. During implementation, verify whether the base branch already exposes `SQ8_FP16_SpacesOptimizationTest` (extend) or only `SQ8_FP16_NoOptimizationSpacesTest` (add the optimization class here mirroring `SQ8_FP32_SpacesOptimizationTest`). + +### Tier-walk pattern + +Per-tier `if (features.)` block; **unset higher flag** after each block so the next tier is exercised on hosts that support multiple ISAs. Do not use `GTEST_SKIP()` here — it would abort the entire walk. + +```cpp +auto expected = SQ8_FP16_InnerProduct; // scalar reference + +#ifdef OPT_SVE2 + if (features.sve2) { + arch_opt_func = IP_SQ8_FP16_GetDistFunc(dim, &alignment, &features); + ASSERT_EQ(arch_opt_func, Choose_SQ8_FP16_IP_implementation_SVE2(dim)) + << "SVE2 dispatch mismatch"; + ASSERT_NEAR(arch_opt_func(v1, v2, dim), expected(v1, v2, dim), 0.01); + features.sve2 = 0; // exercise next tier + } +#endif +#ifdef OPT_SVE + if (features.sve) { /* same shape */ features.sve = 0; } +#endif +#ifdef OPT_NEON_HP + if (features.asimdhp) { /* same shape */ features.asimdhp = 0; } +#endif +// final fallback assertion: IP_SQ8_FP16_GetDistFunc(...) == SQ8_FP16_InnerProduct (scalar) +``` + +Three dispatch entry points exercised per tier: `IP_SQ8_FP16_GetDistFunc`, `L2_SQ8_FP16_GetDistFunc`, `Cosine_SQ8_FP16_GetDistFunc`. + +### Scalar-fallback tests + +`GetDistFuncSQ8FP16Asymmetric` — currently asserts `dim=128` returns scalar; that assertion breaks once SIMD dispatch lands. Change to `dim=15` (below the `dim >= 16` SIMD threshold). Add a small `dim=0` (empty) scalar-fallback assertion to cover the Jira "empty" edge case. + +### Dim parameterization + +Base branch already has both parameterized suites against `SQ8_FP16_SpacesOptimizationTest`: +- `SQ8_FP16_SIMD` — `testing::Range(16UL, 33UL)` (dims 16..32; residual + threshold boundaries) +- `SQ8_FP16_SIMD_HighDim` — `64, 128, 256, 512, 1024` (multi-iteration main loop) + +Both suites pick up the ARM tier-walk additions automatically since the test class body is what's extended. No new instantiation needed. + +### Tier coverage report + +`SQ8_FP16_SIMD_TierCoverage.ReportTiersExercised` (test_spaces.cpp) currently reports only x86 tiers. Extend it with ARM tier entries (SVE2 / SVE / NEON_HP) so an ARM-only SIMD host reports its exercised tiers instead of going silent. + +## Microbench + +`tests/benchmark/spaces_benchmarks/bm_spaces_sq8_fp16.cpp` already registers x86 ISA benchmarks. Add ARM registrations under `#ifdef OPT_*` guards using the existing `bm_spaces.h` macros: + +```cpp +#ifdef CPU_FEATURES_ARCH_AARCH64 + cpu_features::Aarch64Features opt = cpu_features::GetAarch64Info().features; + bool sve2_supported = opt.sve2; + bool sve_supported = opt.sve; + bool neon_hp_supported = opt.asimdhp; +#ifdef OPT_SVE2 + INITIALIZE_BENCHMARKS_SET_L2_IP(BM_VecSimSpaces_SQ8_FP16, SQ8_FP16, SVE2, 16, sve2_supported); + INITIALIZE_BENCHMARKS_SET_Cosine(BM_VecSimSpaces_SQ8_FP16, SQ8_FP16, SVE2, 16, sve2_supported); +#endif +#ifdef OPT_SVE + INITIALIZE_BENCHMARKS_SET_L2_IP(BM_VecSimSpaces_SQ8_FP16, SQ8_FP16, SVE, 16, sve_supported); + INITIALIZE_BENCHMARKS_SET_Cosine(BM_VecSimSpaces_SQ8_FP16, SQ8_FP16, SVE, 16, sve_supported); +#endif +#ifdef OPT_NEON_HP + INITIALIZE_BENCHMARKS_SET_L2_IP(BM_VecSimSpaces_SQ8_FP16, SQ8_FP16, NEON_HP, 16, neon_hp_supported); + INITIALIZE_BENCHMARKS_SET_Cosine(BM_VecSimSpaces_SQ8_FP16, SQ8_FP16, NEON_HP, 16, neon_hp_supported); +#endif +#endif // CPU_FEATURES_ARCH_AARCH64 +``` + +Verify exact `cpu_features` helper names against the x86 sister block already in `bm_spaces_sq8_fp16.cpp` (e.g. `GetX86Info`). + +`bm_spaces_sq8_fp16` and `bm_spaces_sq8_fp32` are separate executables; the per-ISA throughput comparison requested by Jira is done by running both benches and comparing matched ISA rows. + +## Acceptance criteria (Jira MOD-14972 → spec mapping) + +| Jira requirement | Where this spec delivers it | +|------------------|------------------------------| +| Kernels: IP + L2 for NEON | NEON_HP TU hosts kernel headers + chooser symbols | +| Kernels: IP + L2 for SVE | SVE TU hosts kernel headers + chooser symbols | +| Kernels: IP + L2 for SVE2 | SVE2 TU includes SVE headers, instantiates templates under SVE2 flags | +| Scalar fallback (reference for all platforms) | Already present in `IP.cpp` / `L2.cpp`; unchanged | +| FP16 query → FP32 per SIMD chunk | `vcvt_f32_f16` (NEON), `svcvt_f32_f16_x` (SVE) | +| FP32 metadata + correction terms | `load_unaligned` for all FP32 trailer scalars | +| Wire into dispatch table per ISA flag | `IP_space.cpp` (2 blocks), `L2_space.cpp` (1 block), `OPT_SVE2/SVE/NEON_HP` | +| Unit tests vs. scalar reference per ISA | Tier-walk in `SQ8_FP16_SpacesOptimizationTest` | +| Edge cases (empty, dim-alignment boundaries) | `dim=0` + `dim=15` scalar tests; `dim=16..32` SIMD boundary param suite | +| Microbench per ISA throughput vs. SQ8↔FP32 | ARM registrations in `bm_spaces_sq8_fp16.cpp`; matched-ISA comparison vs. `bm_spaces_sq8_fp32` | + +## Diff size estimate + +| Area | Files | LoC (rough) | +|------|-------|-------------| +| Kernel headers | 4 new | ~600 | +| Dispatcher TU additions | NEON_HP.cpp/h, SVE.cpp/h, SVE2.cpp/h | ~80 | +| Dispatcher wiring | IP_space.cpp, L2_space.cpp | ~45 | +| Tests | test_spaces.cpp | ~80 | +| Bench | bm_spaces_sq8_fp16.cpp | ~25 | +| CMakeLists.txt | none | 0 | +| **Total** | **~10 files** | **~830** | + +## PR mechanics + +- **Branch**: `dor-forer-sq8-fp16-arm-kernels-mod-14972` +- **Base branch**: `dor-forer-sq8-fp16-x86-kernels-mod-14954` (PR #970) +- **PR target**: opens against PR #970 head; retarget to `main` once #970 merges +- **Commit prefix**: `[MOD-14972]` (matches repo convention) +- **PR title**: `Add SQ8↔FP16 ARM SIMD distance kernels [MOD-14972]` + +## Verification gates before opening PR + +1. **x86 host build clean** — verifies generic dispatch and tests remain clean; ARM kernels require ARM build or cross-compile, so the kernels themselves are not exercised here. +2. **ARM host build + unit tests** — NEON_HP / SVE / SVE2 paths exercised. Requires coordination with the user for ARM hardware or a cross-compile setup. +3. **ASan clean** on every host that runs unit tests. +4. **Microbench compiles + runs on ARM host.** + +## Out of scope (deferred, separate PRs) + +- Dispatcher-routed edge-case tests (`ZeroQueryTest`, `ConstantStorageTest`, `MixedSignQueryTest`) — they currently bypass the dispatcher and call scalar directly; cross-arch debt, also PR #970 H1. +- Multi-accumulator ILP tuning beyond the 4-accumulator baseline established here. +- Unrelated x86 review-feedback fixes (M1–M4, H1–H2 on x86 files from PR #970 review). This ARM PR will modify some files that PR #970 also touches (dispatchers, test class, bench), but only with ARM-relevant additions — x86 review fixes land in #970. + +## Inheritance from PR #970 review findings + +The following lessons from the PR #970 review are baked into this design so they do not need to be re-flagged on ARM kernels: + +- `assert(dim >= 16)` at the top of every kernel template (paired with dispatcher `dim >= 16` guard). +- 4-accumulator ILP in both NEON and SVE hot loops. +- Algebraic-identity formula anchor comment at the top of each kernel header. +- `load_unaligned` for all FP32 metadata reads (matches scalar). +- Dispatcher-routed tier-walk test pattern (no scalar-bypass). +- Per-ISA microbench registration alongside SQ8↔FP32 sister for direct comparison. From e5e74750b93d1fe6876a30a8d1ef4c2a6146fcb3 Mon Sep 17 00:00:00 2001 From: Dor Forer Date: Thu, 28 May 2026 17:37:01 +0300 Subject: [PATCH 03/24] =?UTF-8?q?Add=20implementation=20plan=20for=20SQ8?= =?UTF-8?q?=E2=86=94FP16=20ARM=20SIMD=20kernels=20[MOD-14972]?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit 14 bite-sized tasks following the spec at 2026-05-28-arm-sq8-fp16-design.md. Each task ends in a commit; assistant runs tests/ASan/benchmarks after the user confirms each ARM build cycle. Zero CMake changes; PR stacks on #970. Co-Authored-By: Claude Opus 4.7 (1M context) --- .../plans/2026-05-28-arm-sq8-fp16-kernels.md | 1195 +++++++++++++++++ 1 file changed, 1195 insertions(+) create mode 100644 docs/superpowers/plans/2026-05-28-arm-sq8-fp16-kernels.md diff --git a/docs/superpowers/plans/2026-05-28-arm-sq8-fp16-kernels.md b/docs/superpowers/plans/2026-05-28-arm-sq8-fp16-kernels.md new file mode 100644 index 000000000..2759ba046 --- /dev/null +++ b/docs/superpowers/plans/2026-05-28-arm-sq8-fp16-kernels.md @@ -0,0 +1,1195 @@ +# SQ8↔FP16 ARM SIMD Distance Kernels — Implementation Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** Add SQ8↔FP16 asymmetric distance kernels (IP, L2, Cosine) for ARM ISA tiers — NEON_HP, SVE, SVE2 — plugged into the existing dispatcher. Mirrors the x86 work delivered in PR #970. + +**Architecture:** Header-only SIMD kernel templates (one per metric × ISA), instantiated via the existing `CHOOSE_IMPLEMENTATION` / `CHOOSE_SVE_IMPLEMENTATION` macros inside ISA-specific TUs (`NEON_HP.cpp`, `SVE.cpp`, `SVE2.cpp`). Wiring lives in `IP_space.cpp` and `L2_space.cpp` under a `#ifdef CPU_FEATURES_ARCH_AARCH64` block that parallels the existing x86 block. L2 reuses the IP `_IMP` template via the algebraic identity `L2² = x_sum_sq + y_sum_sq − 2·IP`. Scalar fallback already on `main` is unchanged and stays as the reference for every tier. + +**Tech Stack:** C++20, ARM NEON intrinsics (`arm_neon.h`), ARM SVE/SVE2 intrinsics (`arm_sve.h`), GoogleTest, Google Benchmark, cpu_features. + +**Branch:** `dor-forer-sq8-fp16-arm-kernels-mod-14972` (stacked on PR #970 / `dor-forer-sq8-fp16-x86-kernels-mod-14954`). + +**Build / test loop:** The user runs `make build` (per project memory). After each build cycle confirmed, the assistant runs `make unit_test` / ASan / benchmarks on the appropriate host (ARM hardware or cross-compile/qemu — coordinate with user). Each task ends in a commit; commits are pushed only when explicitly requested. + +**Spec:** [`docs/superpowers/specs/2026-05-28-arm-sq8-fp16-design.md`](../specs/2026-05-28-arm-sq8-fp16-design.md) + +--- + +## File Structure + +### Files created + +| Path | Responsibility | +|------|----------------| +| `src/VecSim/spaces/IP/IP_NEON_SQ8_FP16.h` | NEON IP kernel template (`SQ8_FP16_InnerProductSIMD16_NEON_HP_IMP` + thin wrappers) | +| `src/VecSim/spaces/L2/L2_NEON_SQ8_FP16.h` | NEON L2 kernel template (calls NEON IP impl, applies L2 identity) | +| `src/VecSim/spaces/IP/IP_SVE_SQ8_FP16.h` | SVE IP kernel template (`SQ8_FP16_InnerProductSIMD_SVE_IMP` + wrappers); also `#include`d from SVE2.cpp | +| `src/VecSim/spaces/L2/L2_SVE_SQ8_FP16.h` | SVE L2 kernel template; also `#include`d from SVE2.cpp | + +### Files modified + +| Path | Change | +|------|--------| +| `src/VecSim/spaces/functions/NEON_HP.h` | +3 chooser declarations (IP, L2, Cosine) | +| `src/VecSim/spaces/functions/NEON_HP.cpp` | +#include kernel headers; +3 chooser definitions | +| `src/VecSim/spaces/functions/SVE.h` | +3 chooser declarations | +| `src/VecSim/spaces/functions/SVE.cpp` | +#include kernel headers; +3 chooser definitions | +| `src/VecSim/spaces/functions/SVE2.h` | +3 chooser declarations | +| `src/VecSim/spaces/functions/SVE2.cpp` | +#include SVE kernel headers; +3 chooser definitions (own symbols, templates instantiated under SVE2 compile flags) | +| `src/VecSim/spaces/IP_space.cpp` | +#ifdef AArch64 block in `IP_SQ8_FP16_GetDistFunc` and `Cosine_SQ8_FP16_GetDistFunc` (2 dispatcher blocks) | +| `src/VecSim/spaces/L2_space.cpp` | +#ifdef AArch64 block in `L2_SQ8_FP16_GetDistFunc` (1 dispatcher block) | +| `tests/unit/test_spaces.cpp` | retarget `GetDistFuncSQ8FP16Asymmetric` to dim=15; add dim=0 test; extend the three `SQ8_FP16_SpacesOptimizationTest` test bodies with ARM tier walks; extend `SQ8_FP16_SIMD_TierCoverage.ReportTiersExercised` with AArch64 tier reporting | +| `tests/benchmark/spaces_benchmarks/bm_spaces_sq8_fp16.cpp` | +AArch64 `cpu_features` block; +ARM ISA benchmark registrations | + +### Files NOT modified + +`src/VecSim/spaces/CMakeLists.txt` — zero CMake changes. Existing TU flags (`-march=armv8.2-a+fp16fml` for NEON_HP, `-march=armv8-a+sve` for SVE, `-march=armv9-a+sve2` for SVE2) already carry everything the new kernels need. + +--- + +## Task 1: Retarget the scalar-fallback dispatcher test + +**Why first:** Builds and runs on x86 today, has nothing to do with the ARM kernels, and tightens the contract the rest of the plan relies on (the dispatcher returns scalar for `dim < 16`). + +**Files:** +- Modify: `tests/unit/test_spaces.cpp` — locate test named `GetDistFuncSQ8FP16Asymmetric` (added by PR #970; currently asserts `dim=128` returns the scalar fallback) + +- [ ] **Step 1: Locate the existing test** + +Run: +```bash +grep -n 'GetDistFuncSQ8FP16Asymmetric' tests/unit/test_spaces.cpp +``` +Expected: one or more line hits pointing at the `TEST(...)` block. + +- [ ] **Step 2: Modify the test to cover dim=0 and dim=15 instead of dim=128** + +Replace the body of the existing `TEST(..., GetDistFuncSQ8FP16Asymmetric)` so it walks two below-threshold dims and asserts the scalar fallback for each of L2 / IP / Cosine. Drop in this exact body (rename the test fixture symbol to match what is already there if it differs): + +```cpp +TEST_F(SpacesTest, GetDistFuncSQ8FP16Asymmetric) { + // SQ8 storage with FP16 query (asymmetric) - should return SQ8_FP16 functions. + // Per-ISA dispatcher walk coverage lives in the SQ8_FP16 SpacesOptimizationTest below. + // + // Walk two below-threshold dims (0 and 15) so the assertions hold regardless of which + // SIMD tiers the host advertises: dim < 16 must always short-circuit to scalar fallback. + // The template-mapping form (spaces::GetDistFunc) and the direct + // *_SQ8_FP16_GetDistFunc form must agree for every dim, and both must match the scalar + // reference at sub-threshold dims. + for (size_t dim : {static_cast(0), static_cast(15)}) { + auto l2_func = spaces::GetDistFunc(VecSimMetric_L2, dim, nullptr); + auto ip_func = spaces::GetDistFunc(VecSimMetric_IP, dim, nullptr); + auto cosine_func = + spaces::GetDistFunc(VecSimMetric_Cosine, dim, nullptr); + + ASSERT_EQ(l2_func, L2_SQ8_FP16_GetDistFunc(dim, nullptr)) + << "Template mapping disagrees with direct dispatcher for L2 at dim=" << dim; + ASSERT_EQ(ip_func, IP_SQ8_FP16_GetDistFunc(dim, nullptr)) + << "Template mapping disagrees with direct dispatcher for IP at dim=" << dim; + ASSERT_EQ(cosine_func, Cosine_SQ8_FP16_GetDistFunc(dim, nullptr)) + << "Template mapping disagrees with direct dispatcher for Cosine at dim=" << dim; + + ASSERT_EQ(l2_func, SQ8_FP16_L2Sqr) + << "dim=" << dim << " must short-circuit to scalar L2 fallback"; + ASSERT_EQ(ip_func, SQ8_FP16_InnerProduct) + << "dim=" << dim << " must short-circuit to scalar IP fallback"; + ASSERT_EQ(cosine_func, SQ8_FP16_Cosine) + << "dim=" << dim << " must short-circuit to scalar Cosine fallback"; + } +} +``` + +- [ ] **Step 3: User builds** + +Ask the user to run `make build` (their normal x86 build is sufficient — this test is host-agnostic). + +- [ ] **Step 4: Run the test** + +Run: +```bash +./bin//unit_tests --gtest_filter='SpacesTest.GetDistFuncSQ8FP16Asymmetric' +``` +(Use `find bin -name unit_tests -type f` if the host-triple subdir is unknown.) +Expected: PASS. + +- [ ] **Step 5: Commit** + +```bash +git add tests/unit/test_spaces.cpp +git commit -m "Retarget SQ8↔FP16 scalar-fallback dispatcher test to dim=0/15 [MOD-14972]" +``` + +--- + +## Task 2: NEON IP kernel header + +**Files:** +- Create: `src/VecSim/spaces/IP/IP_NEON_SQ8_FP16.h` + +- [ ] **Step 1: Author the kernel file** + +Create exactly this file (modeled on `IP_NEON_SQ8_FP32.h` + the NEON FP16 widening pattern from `IP_NEON_FP16.h`): + +```cpp +/* + * Copyright (c) 2006-Present, Redis Ltd. + * All rights reserved. + * + * Licensed under your choice of the Redis Source Available License 2.0 + * (RSALv2); or (b) the Server Side Public License v1 (SSPLv1); or (c) the + * GNU Affero General Public License v3 (AGPLv3). + */ +#pragma once +#include "VecSim/spaces/space_includes.h" +#include "VecSim/types/sq8.h" +#include "VecSim/types/float16.h" +#include +#include + +using sq8 = vecsim_types::sq8; +using float16 = vecsim_types::float16; + +/* + * Optimised asymmetric SQ8<->FP16 inner product using the algebraic identity: + * + * IP(x, y) = sum(x_i * y_i) + * ~= sum((min + delta * q_i) * y_i) + * = min * y_sum + delta * sum(q_i * y_i) + * + * The hot loop only accumulates sum(q_i * y_i) - no per-element dequantisation. + * FP16 query lanes are widened to FP32 via vcvt_f32_f16 per 16-lane chunk. + */ + +// Helper: 16 lanes per call, four FP32 accumulators (one per quarter). +static inline void +SQ8_FP16_InnerProductStep_NEON_HP(const uint8_t *&pVect1, const float16 *&pVect2, + float32x4_t &sum0, float32x4_t &sum1, + float32x4_t &sum2, float32x4_t &sum3) { + // SQ8 storage: 16 * uint8 -> 4 * float32x4_t + uint8x16_t v1_u8 = vld1q_u8(pVect1); + uint16x8_t v1_lo = vmovl_u8(vget_low_u8(v1_u8)); + uint16x8_t v1_hi = vmovl_u8(vget_high_u8(v1_u8)); + float32x4_t v1_0 = vcvtq_f32_u32(vmovl_u16(vget_low_u16(v1_lo))); + float32x4_t v1_1 = vcvtq_f32_u32(vmovl_u16(vget_high_u16(v1_lo))); + float32x4_t v1_2 = vcvtq_f32_u32(vmovl_u16(vget_low_u16(v1_hi))); + float32x4_t v1_3 = vcvtq_f32_u32(vmovl_u16(vget_high_u16(v1_hi))); + + // FP16 query: 16 * f16 -> 4 * float32x4_t via vcvt_f32_f16 + const float16_t *q = reinterpret_cast(pVect2); + float16x8_t q_lo = vld1q_f16(q); + float16x8_t q_hi = vld1q_f16(q + 8); + float32x4_t v2_0 = vcvt_f32_f16(vget_low_f16(q_lo)); + float32x4_t v2_1 = vcvt_f32_f16(vget_high_f16(q_lo)); + float32x4_t v2_2 = vcvt_f32_f16(vget_low_f16(q_hi)); + float32x4_t v2_3 = vcvt_f32_f16(vget_high_f16(q_hi)); + + sum0 = vfmaq_f32(sum0, v1_0, v2_0); + sum1 = vfmaq_f32(sum1, v1_1, v2_1); + sum2 = vfmaq_f32(sum2, v1_2, v2_2); + sum3 = vfmaq_f32(sum3, v1_3, v2_3); + + pVect1 += 16; + pVect2 += 16; +} + +// pVect1v = SQ8 storage, pVect2v = FP16 query +template // 0..15 +float SQ8_FP16_InnerProductSIMD16_NEON_HP_IMP(const void *pVect1v, const void *pVect2v, + size_t dimension) { + assert(dimension >= 16 && "kernel precondition: dispatcher must guard dim >= 16"); + + const uint8_t *pVect1 = static_cast(pVect1v); // SQ8 storage + const float16 *pVect2 = static_cast(pVect2v); // FP16 query + + float32x4_t sum0 = vdupq_n_f32(0.0f); + float32x4_t sum1 = vdupq_n_f32(0.0f); + float32x4_t sum2 = vdupq_n_f32(0.0f); + float32x4_t sum3 = vdupq_n_f32(0.0f); + + const size_t num_of_chunks = dimension / 16; + for (size_t i = 0; i < num_of_chunks; i++) { + SQ8_FP16_InnerProductStep_NEON_HP(pVect1, pVect2, sum0, sum1, sum2, sum3); + } + + // Residual handling: dim % 16 lanes. + // residual >= 8: one safe 8-lane SQ8 + 8-lane FP16 load (FP16 trailer is wide enough). + // residual < 8: scalar-only - a 4-lane FP16 load would overread y_sum metadata. + constexpr unsigned char r = residual; + if constexpr (r >= 8) { + uint8x8_t v1_u8 = vld1_u8(pVect1); + uint16x8_t v1_u16 = vmovl_u8(v1_u8); + float32x4_t v1_a = vcvtq_f32_u32(vmovl_u16(vget_low_u16(v1_u16))); + float32x4_t v1_b = vcvtq_f32_u32(vmovl_u16(vget_high_u16(v1_u16))); + float16x8_t q_h = vld1q_f16(reinterpret_cast(pVect2)); + float32x4_t v2_a = vcvt_f32_f16(vget_low_f16(q_h)); + float32x4_t v2_b = vcvt_f32_f16(vget_high_f16(q_h)); + sum0 = vfmaq_f32(sum0, v1_a, v2_a); + sum1 = vfmaq_f32(sum1, v1_b, v2_b); + pVect1 += 8; + pVect2 += 8; + } + // Lane-by-lane scalar for the final 0..7 (residual % 8) elements. + constexpr unsigned char tail = r & 0x7; + float scalar_dot = 0.0f; + for (unsigned char k = 0; k < tail; ++k) { + scalar_dot += static_cast(pVect1[k]) * vecsim_types::FP16_to_FP32(pVect2[k]); + } + + // Reduce the four NEON accumulators. + float32x4_t sum_lo = vaddq_f32(sum0, sum1); + float32x4_t sum_hi = vaddq_f32(sum2, sum3); + float quantized_dot = vaddvq_f32(vaddq_f32(sum_lo, sum_hi)) + scalar_dot; + + // Metadata loads - use load_unaligned because odd dim leaves trailers unaligned. + const uint8_t *params_bytes = static_cast(pVect1v) + dimension; + const float min_val = + load_unaligned(params_bytes + sq8::MIN_VAL * sizeof(float)); + const float delta = + load_unaligned(params_bytes + sq8::DELTA * sizeof(float)); + const uint8_t *query_meta_bytes = + reinterpret_cast(static_cast(pVect2v) + dimension); + const float y_sum = + load_unaligned(query_meta_bytes + sq8::SUM_QUERY * sizeof(float)); + + return min_val * y_sum + delta * quantized_dot; +} + +template +float SQ8_FP16_InnerProductSIMD16_NEON_HP(const void *pVect1v, const void *pVect2v, + size_t dimension) { + return 1.0f - + SQ8_FP16_InnerProductSIMD16_NEON_HP_IMP(pVect1v, pVect2v, dimension); +} + +template +float SQ8_FP16_CosineSIMD16_NEON_HP(const void *pVect1v, const void *pVect2v, size_t dimension) { + // Cosine = 1 - IP (vectors are pre-normalised); reuses the IP wrapper. + return SQ8_FP16_InnerProductSIMD16_NEON_HP(pVect1v, pVect2v, dimension); +} +``` + +- [ ] **Step 2: Header-only smoke (no build yet)** + +Run: +```bash +grep -n 'load_unaligned\|FP16_to_FP32' src/VecSim/spaces/space_includes.h \ + src/VecSim/spaces/IP/IP.cpp src/VecSim/types/float16.h 2>/dev/null +``` +Expected: confirm the global `load_unaligned` is reachable through `space_includes.h` (matches the include path used by `IP_NEON_SQ8_FP32.h`) and `FP16_to_FP32` is reachable through `VecSim/types/float16.h`. If either include is missing, add it. + +- [ ] **Step 3: Commit** + +```bash +git add src/VecSim/spaces/IP/IP_NEON_SQ8_FP16.h +git commit -m "Add NEON_HP SQ8↔FP16 IP kernel header [MOD-14972]" +``` + +--- + +## Task 3: NEON L2 kernel header + +**Files:** +- Create: `src/VecSim/spaces/L2/L2_NEON_SQ8_FP16.h` + +- [ ] **Step 1: Author the kernel file** + +```cpp +/* + * Copyright (c) 2006-Present, Redis Ltd. + * All rights reserved. + * + * Licensed under your choice of the Redis Source Available License 2.0 + * (RSALv2); or (b) the Server Side Public License v1 (SSPLv1); or (c) the + * GNU Affero General Public License v3 (AGPLv3). + */ +#pragma once +#include "VecSim/spaces/space_includes.h" +#include "VecSim/spaces/IP/IP_NEON_SQ8_FP16.h" +#include "VecSim/types/sq8.h" +#include "VecSim/types/float16.h" + +using sq8 = vecsim_types::sq8; +using float16 = vecsim_types::float16; + +/* + * Optimised asymmetric SQ8<->FP16 L2 squared distance using the algebraic identity: + * + * ||x - y||^2 = sum(x_i^2) - 2 * IP(x, y) + sum(y_i^2) + * = x_sum_squares - 2 * IP(x, y) + y_sum_squares + * + * IP is computed by SQ8_FP16_InnerProductSIMD16_NEON_HP_IMP; metadata is FP32. + */ + +template // 0..15 +float SQ8_FP16_L2SqrSIMD16_NEON_HP(const void *pVect1v, const void *pVect2v, size_t dimension) { + const float ip = + SQ8_FP16_InnerProductSIMD16_NEON_HP_IMP(pVect1v, pVect2v, dimension); + + const uint8_t *params_bytes = static_cast(pVect1v) + dimension; + const float x_sum_sq = + load_unaligned(params_bytes + sq8::SUM_SQUARES * sizeof(float)); + + const uint8_t *query_meta_bytes = reinterpret_cast( + static_cast(pVect2v) + dimension); + const float y_sum_sq = + load_unaligned(query_meta_bytes + sq8::SUM_SQUARES_QUERY * sizeof(float)); + + return x_sum_sq + y_sum_sq - 2.0f * ip; +} +``` + +- [ ] **Step 2: Commit** + +```bash +git add src/VecSim/spaces/L2/L2_NEON_SQ8_FP16.h +git commit -m "Add NEON_HP SQ8↔FP16 L2 kernel header [MOD-14972]" +``` + +--- + +## Task 4: NEON_HP dispatcher TU additions + +**Files:** +- Modify: `src/VecSim/spaces/functions/NEON_HP.h` — add 3 declarations +- Modify: `src/VecSim/spaces/functions/NEON_HP.cpp` — add 3 chooser definitions + +- [ ] **Step 1: Add chooser declarations to NEON_HP.h** + +In `src/VecSim/spaces/functions/NEON_HP.h`, inside `namespace spaces { ... }`, append these three declarations alongside the existing `Choose_FP16_*_implementation_NEON_HP`: + +```cpp +dist_func_t Choose_SQ8_FP16_IP_implementation_NEON_HP(size_t dim); +dist_func_t Choose_SQ8_FP16_L2_implementation_NEON_HP(size_t dim); +dist_func_t Choose_SQ8_FP16_Cosine_implementation_NEON_HP(size_t dim); +``` + +- [ ] **Step 2: Add chooser definitions to NEON_HP.cpp** + +In `src/VecSim/spaces/functions/NEON_HP.cpp`, add the kernel `#include`s alongside the existing FP16 includes: + +```cpp +#include "VecSim/spaces/IP/IP_NEON_SQ8_FP16.h" +#include "VecSim/spaces/L2/L2_NEON_SQ8_FP16.h" +``` + +Then inside `namespace spaces { ... }` (between `#include "implementation_chooser.h"` and `#include "implementation_chooser_cleanup.h"`), append: + +```cpp +dist_func_t Choose_SQ8_FP16_IP_implementation_NEON_HP(size_t dim) { + dist_func_t ret_dist_func; + CHOOSE_IMPLEMENTATION(ret_dist_func, dim, 16, SQ8_FP16_InnerProductSIMD16_NEON_HP); + return ret_dist_func; +} + +dist_func_t Choose_SQ8_FP16_L2_implementation_NEON_HP(size_t dim) { + dist_func_t ret_dist_func; + CHOOSE_IMPLEMENTATION(ret_dist_func, dim, 16, SQ8_FP16_L2SqrSIMD16_NEON_HP); + return ret_dist_func; +} + +dist_func_t Choose_SQ8_FP16_Cosine_implementation_NEON_HP(size_t dim) { + dist_func_t ret_dist_func; + CHOOSE_IMPLEMENTATION(ret_dist_func, dim, 16, SQ8_FP16_CosineSIMD16_NEON_HP); + return ret_dist_func; +} +``` + +- [ ] **Step 3: Commit** + +```bash +git add src/VecSim/spaces/functions/NEON_HP.h src/VecSim/spaces/functions/NEON_HP.cpp +git commit -m "Wire NEON_HP SQ8↔FP16 choosers [MOD-14972]" +``` + +--- + +## Task 5: NEON_HP dispatcher wiring in IP_space.cpp + L2_space.cpp + +**Files:** +- Modify: `src/VecSim/spaces/IP_space.cpp` — `IP_SQ8_FP16_GetDistFunc` + `Cosine_SQ8_FP16_GetDistFunc` +- Modify: `src/VecSim/spaces/L2_space.cpp` — `L2_SQ8_FP16_GetDistFunc` + +Each of those three `_GetDistFunc` functions currently has an `#ifdef CPU_FEATURES_ARCH_X86_64` block with an early `if (dim < 16) return ret_dist_func;` guard followed by per-tier dispatch. We append an `#ifdef CPU_FEATURES_ARCH_AARCH64` block with the matching shape. Only NEON_HP is wired in this task; SVE/SVE2 land in a later task. + +- [ ] **Step 1: Confirm the #include for NEON_HP.h is present** + +Run: +```bash +grep -n 'functions/NEON_HP.h' src/VecSim/spaces/IP_space.cpp src/VecSim/spaces/L2_space.cpp +``` +Expected: both files already `#include "VecSim/spaces/functions/NEON_HP.h"`. If a file is missing it, add the include. + +- [ ] **Step 2: Wire IP_SQ8_FP16_GetDistFunc** + +In `src/VecSim/spaces/IP_space.cpp`, locate `IP_SQ8_FP16_GetDistFunc`. After the closing `#endif // x86_64`, insert a parallel AArch64 block immediately before the trailing `return ret_dist_func;`: + +```cpp +#ifdef CPU_FEATURES_ARCH_AARCH64 + if (dim < 16) { + return ret_dist_func; + } +#ifdef OPT_NEON_HP + if (features.asimdhp) { + // No alignment write: the locked spec and the sister ARM SQ8_FP32 dispatchers + // leave *alignment untouched on ARM tiers. The corresponding tests assert + // 0xFF passthrough on the scalar path and do not assert any non-zero value here. + return Choose_SQ8_FP16_IP_implementation_NEON_HP(dim); + } +#endif +#endif // CPU_FEATURES_ARCH_AARCH64 +``` + +- [ ] **Step 3: Wire Cosine_SQ8_FP16_GetDistFunc** + +In the same file, locate `Cosine_SQ8_FP16_GetDistFunc`. Insert the same block, swapping `Choose_SQ8_FP16_IP_implementation_NEON_HP` for `Choose_SQ8_FP16_Cosine_implementation_NEON_HP`. + +- [ ] **Step 4: Wire L2_SQ8_FP16_GetDistFunc** + +In `src/VecSim/spaces/L2_space.cpp`, locate `L2_SQ8_FP16_GetDistFunc`. Insert the same block, swapping the call for `Choose_SQ8_FP16_L2_implementation_NEON_HP`. + +- [ ] **Step 5: User builds** + +Ask the user to run `make build` — first time the new NEON_HP TU additions compile. If they have ARM hardware or a cross-compile target, that build path; otherwise the x86 build must at least confirm the new headers don't accidentally break non-ARM compilation (the new headers are only `#include`d from `NEON_HP.cpp`, which is excluded on non-ARM hosts, so x86 builds should be clean). + +- [ ] **Step 6: Commit** + +```bash +git add src/VecSim/spaces/IP_space.cpp src/VecSim/spaces/L2_space.cpp +git commit -m "Dispatch SQ8↔FP16 to NEON_HP tier on AArch64 [MOD-14972]" +``` + +--- + +## Task 6: Extend `SQ8_FP16_SpacesOptimizationTest` with NEON_HP tier-walk + +**Files:** +- Modify: `tests/unit/test_spaces.cpp` — three test bodies (`SQ8_FP16_L2SqrTest`, `SQ8_FP16_InnerProductTest`, `SQ8_FP16_CosineTest`) + +After the existing `#ifdef OPT_SSE4` block in each test, append: + +- [ ] **Step 1: Add NEON_HP tier to L2 test** + +In `SQ8_FP16_L2SqrTest`, immediately after the closing `#endif` that follows the SSE4 block and before `// Scalar fallback`: + +```cpp +#ifdef OPT_NEON_HP + if (optimization.asimdhp) { + unsigned char alignment = 0; + arch_opt_func = L2_SQ8_FP16_GetDistFunc(dim, &alignment, &optimization); + ASSERT_EQ(arch_opt_func, Choose_SQ8_FP16_L2_implementation_NEON_HP(dim)) + << "Unexpected distance function chosen for dim " << dim; + ASSERT_NEAR(baseline, arch_opt_func(v2_compressed.data(), v1_query.data(), dim), 0.01) + << "NEON_HP with dim " << dim; + optimization.asimdhp = 0; + } +#endif +``` + +- [ ] **Step 2: Add NEON_HP tier to IP test** + +In `SQ8_FP16_InnerProductTest`, append the same block but swap `L2_SQ8_FP16_GetDistFunc` → `IP_SQ8_FP16_GetDistFunc` and `Choose_SQ8_FP16_L2_implementation_NEON_HP` → `Choose_SQ8_FP16_IP_implementation_NEON_HP`. + +- [ ] **Step 3: Add NEON_HP tier to Cosine test** + +In `SQ8_FP16_CosineTest`, append the same block with `Cosine_SQ8_FP16_GetDistFunc` and `Choose_SQ8_FP16_Cosine_implementation_NEON_HP`. + +- [ ] **Step 4: Confirm the include path for the NEON_HP chooser declarations** + +Run: +```bash +grep -n 'functions/NEON_HP.h' tests/unit/test_spaces.cpp +``` +Expected: include present. If not, add `#include "VecSim/spaces/functions/NEON_HP.h"` near the other space-function includes at the top of the file. + +- [ ] **Step 5: User builds (ARM target)** + +Ask the user to run `make build` for an ARM target (hardware or cross-compile). On x86 the new test code is gated by `#ifdef OPT_NEON_HP` and stays inert. + +- [ ] **Step 6: Run NEON_HP tests** + +Once the ARM build is reported clean, run: +```bash +./bin//unit_tests --gtest_filter='SQ8_FP16_*Test*' +``` +Expected: all parametrized cases PASS, including the dims-16..32 and high-dim suites. + +- [ ] **Step 7: Commit** + +```bash +git add tests/unit/test_spaces.cpp +git commit -m "Extend SQ8↔FP16 tier-walk tests with NEON_HP [MOD-14972]" +``` + +--- + +## Task 7: SVE IP kernel header + +**Files:** +- Create: `src/VecSim/spaces/IP/IP_SVE_SQ8_FP16.h` + +- [ ] **Step 1: Author the kernel file** + +Modeled on `IP_SVE_SQ8_FP32.h`. The shape: an `InnerProductStep` helper that consumes `chunk = svcntw()` FP32 lanes per call (FP16 query loaded under a `b16` predicate, SQ8 storage under a `b32` predicate that drives uint8→uint32 widening), then a templated `_IMP` over ``. + +```cpp +/* + * Copyright (c) 2006-Present, Redis Ltd. + * All rights reserved. + * + * Licensed under your choice of the Redis Source Available License 2.0 + * (RSALv2); or (b) the Server Side Public License v1 (SSPLv1); or (c) the + * GNU Affero General Public License v3 (AGPLv3). + */ +#pragma once +#include "VecSim/spaces/space_includes.h" +#include "VecSim/types/sq8.h" +#include "VecSim/types/float16.h" +#include +#include + +using sq8 = vecsim_types::sq8; +using float16 = vecsim_types::float16; + +/* + * Optimised asymmetric SQ8<->FP16 inner product using the algebraic identity: + * + * IP(x, y) ~= min * y_sum + delta * sum(q_i * y_i) + * + * Hot loop accumulates sum(q_i * y_i) only; FP16 query lanes are widened to FP32 + * inside each step via svcvt_f32_f16_x. Metadata loads use load_unaligned. + */ + +// Helper: one SVE-vector-width-of-FP32 step. +// chunk = svcntw() - number of FP32 lanes per step. +// pg = svptrue_b32() - predicate for FP32 lanes. +static inline void +SQ8_FP16_InnerProductStep_SVE(const uint8_t *pVect1, const float16 *pVect2, size_t &offset, + svfloat32_t &sum, svbool_t pg, size_t chunk) { + // SQ8 -> uint32 (widen on load), then to FP32. + svuint32_t v1_u32 = svld1ub_u32(pg, pVect1 + offset); + svfloat32_t v1_f = svcvt_f32_u32_x(pg, v1_u32); + + // FP16 query -> FP32. svld1_f16 uses a b16 predicate sized to `chunk` half lanes. + svbool_t pg16 = svwhilelt_b16(uint32_t(0), uint32_t(chunk)); + svfloat16_t q_h = + svld1_f16(pg16, reinterpret_cast(pVect2) + offset); + svfloat32_t v2_f = svcvt_f32_f16_x(pg, q_h); + + sum = svmla_f32_x(pg, sum, v1_f, v2_f); + offset += chunk; +} + +// pVect1v = SQ8 storage, pVect2v = FP16 query +template +float SQ8_FP16_InnerProductSIMD_SVE_IMP(const void *pVect1v, const void *pVect2v, + size_t dimension) { + assert(dimension >= 16 && "kernel precondition: dispatcher must guard dim >= 16"); + + const uint8_t *pVect1 = static_cast(pVect1v); + const float16 *pVect2 = static_cast(pVect2v); + size_t offset = 0; + svbool_t pg = svptrue_b32(); + const size_t chunk = svcntw(); + + svfloat32_t sum0 = svdup_f32(0.0f); + svfloat32_t sum1 = svdup_f32(0.0f); + svfloat32_t sum2 = svdup_f32(0.0f); + svfloat32_t sum3 = svdup_f32(0.0f); + + // Partial chunk for dim % chunk lanes. Use _z form so inactive lanes are zero - + // the final reduction below walks all lanes via svptrue_b32(). + if constexpr (partial_chunk) { + size_t remaining = dimension % chunk; + if (remaining > 0) { + svbool_t pg_partial = + svwhilelt_b32(uint32_t(0), uint32_t(remaining)); + svbool_t pg16_partial = + svwhilelt_b16(uint32_t(0), uint32_t(remaining)); + svuint32_t v1_u32 = svld1ub_u32(pg_partial, pVect1 + offset); + svfloat32_t v1_f = svcvt_f32_u32_z(pg_partial, v1_u32); + svfloat16_t q_h = svld1_f16( + pg16_partial, reinterpret_cast(pVect2) + offset); + svfloat32_t v2_f = svcvt_f32_f16_z(pg_partial, q_h); + sum0 = svmla_f32_z(pg_partial, sum0, v1_f, v2_f); + offset += remaining; + } + } + + // Main loop: 4 chunks per iteration via 4 accumulators. + const size_t chunk_size = 4 * chunk; + const size_t number_of_chunks = + (dimension - (partial_chunk ? dimension % chunk : 0)) / chunk_size; + for (size_t i = 0; i < number_of_chunks; i++) { + SQ8_FP16_InnerProductStep_SVE(pVect1, pVect2, offset, sum0, pg, chunk); + SQ8_FP16_InnerProductStep_SVE(pVect1, pVect2, offset, sum1, pg, chunk); + SQ8_FP16_InnerProductStep_SVE(pVect1, pVect2, offset, sum2, pg, chunk); + SQ8_FP16_InnerProductStep_SVE(pVect1, pVect2, offset, sum3, pg, chunk); + } + + // Additional steps 0..3. + if constexpr (additional_steps > 0) + SQ8_FP16_InnerProductStep_SVE(pVect1, pVect2, offset, sum0, pg, chunk); + if constexpr (additional_steps > 1) + SQ8_FP16_InnerProductStep_SVE(pVect1, pVect2, offset, sum1, pg, chunk); + if constexpr (additional_steps > 2) + SQ8_FP16_InnerProductStep_SVE(pVect1, pVect2, offset, sum2, pg, chunk); + + svfloat32_t sum = svadd_f32_x(pg, sum0, sum1); + sum = svadd_f32_x(pg, sum, sum2); + sum = svadd_f32_x(pg, sum, sum3); + float quantized_dot = svaddv_f32(pg, sum); + + // Metadata loads - unaligned because odd dim leaves trailers unaligned. + const uint8_t *params_bytes = static_cast(pVect1v) + dimension; + const float min_val = + load_unaligned(params_bytes + sq8::MIN_VAL * sizeof(float)); + const float delta = + load_unaligned(params_bytes + sq8::DELTA * sizeof(float)); + const uint8_t *query_meta_bytes = reinterpret_cast( + static_cast(pVect2v) + dimension); + const float y_sum = + load_unaligned(query_meta_bytes + sq8::SUM_QUERY * sizeof(float)); + + return min_val * y_sum + delta * quantized_dot; +} + +template +float SQ8_FP16_InnerProductSIMD_SVE(const void *pVect1v, const void *pVect2v, + size_t dimension) { + return 1.0f - SQ8_FP16_InnerProductSIMD_SVE_IMP( + pVect1v, pVect2v, dimension); +} + +template +float SQ8_FP16_CosineSIMD_SVE(const void *pVect1v, const void *pVect2v, size_t dimension) { + return SQ8_FP16_InnerProductSIMD_SVE( + pVect1v, pVect2v, dimension); +} +``` + +**Note for the implementer:** `svcvt_f32_f16_x(pg, q_h)` widens *the lower half of `q_h`'s lanes* to FP32 (one widening, b32-predicated). If the ACLE on the target toolchain rejects this pairing (e.g. ARM RVCT vs LLVM disagreement), verify the FP16->FP32 widening sequence against the actual ARM build output and adjust as needed (potential alternatives: explicit `svunpklo_*` unpack-then-widen, or operating on the lower half lanes by reinterpretation). Commit only after the build is clean. Do not blindly copy `IP_SVE_FP16.h`'s pattern - that file accumulates in FP16 and is not a direct widening reference. + +- [ ] **Step 2: Commit** + +```bash +git add src/VecSim/spaces/IP/IP_SVE_SQ8_FP16.h +git commit -m "Add SVE SQ8↔FP16 IP kernel header [MOD-14972]" +``` + +--- + +## Task 8: SVE L2 kernel header + +**Files:** +- Create: `src/VecSim/spaces/L2/L2_SVE_SQ8_FP16.h` + +- [ ] **Step 1: Author the kernel file** + +```cpp +/* + * Copyright (c) 2006-Present, Redis Ltd. + * All rights reserved. + * + * Licensed under your choice of the Redis Source Available License 2.0 + * (RSALv2); or (b) the Server Side Public License v1 (SSPLv1); or (c) the + * GNU Affero General Public License v3 (AGPLv3). + */ +#pragma once +#include "VecSim/spaces/space_includes.h" +#include "VecSim/spaces/IP/IP_SVE_SQ8_FP16.h" +#include "VecSim/types/sq8.h" +#include "VecSim/types/float16.h" + +using sq8 = vecsim_types::sq8; +using float16 = vecsim_types::float16; + +/* + * SVE SQ8<->FP16 L2 squared distance: + * ||x - y||^2 = x_sum_squares - 2 * IP(x, y) + y_sum_squares + * IP is computed by SQ8_FP16_InnerProductSIMD_SVE_IMP; metadata is FP32. + */ + +template +float SQ8_FP16_L2SqrSIMD_SVE(const void *pVect1v, const void *pVect2v, size_t dimension) { + const float ip = SQ8_FP16_InnerProductSIMD_SVE_IMP( + pVect1v, pVect2v, dimension); + + const uint8_t *params_bytes = static_cast(pVect1v) + dimension; + const float x_sum_sq = + load_unaligned(params_bytes + sq8::SUM_SQUARES * sizeof(float)); + const uint8_t *query_meta_bytes = reinterpret_cast( + static_cast(pVect2v) + dimension); + const float y_sum_sq = + load_unaligned(query_meta_bytes + sq8::SUM_SQUARES_QUERY * sizeof(float)); + + return x_sum_sq + y_sum_sq - 2.0f * ip; +} +``` + +- [ ] **Step 2: Commit** + +```bash +git add src/VecSim/spaces/L2/L2_SVE_SQ8_FP16.h +git commit -m "Add SVE SQ8↔FP16 L2 kernel header [MOD-14972]" +``` + +--- + +## Task 9: SVE + SVE2 dispatcher TU additions + +**Files:** +- Modify: `src/VecSim/spaces/functions/SVE.h` — +3 declarations +- Modify: `src/VecSim/spaces/functions/SVE.cpp` — +#includes; +3 chooser definitions +- Modify: `src/VecSim/spaces/functions/SVE2.h` — +3 declarations +- Modify: `src/VecSim/spaces/functions/SVE2.cpp` — +#includes; +3 chooser definitions (own symbols, template instantiated under SVE2 flags) + +- [ ] **Step 1: Declarations in SVE.h** + +Inside `namespace spaces { ... }`, alongside the existing `Choose_SQ8_FP32_*_SVE` declarations: + +```cpp +dist_func_t Choose_SQ8_FP16_IP_implementation_SVE(size_t dim); +dist_func_t Choose_SQ8_FP16_Cosine_implementation_SVE(size_t dim); +dist_func_t Choose_SQ8_FP16_L2_implementation_SVE(size_t dim); +``` + +- [ ] **Step 2: Definitions in SVE.cpp** + +Add includes alongside the existing SQ8_FP32 includes: + +```cpp +#include "VecSim/spaces/IP/IP_SVE_SQ8_FP16.h" +#include "VecSim/spaces/L2/L2_SVE_SQ8_FP16.h" +``` + +Inside `namespace spaces { ... }` (between `implementation_chooser.h` and `implementation_chooser_cleanup.h`), append: + +```cpp +dist_func_t Choose_SQ8_FP16_IP_implementation_SVE(size_t dim) { + dist_func_t ret_dist_func; + CHOOSE_SVE_IMPLEMENTATION(ret_dist_func, SQ8_FP16_InnerProductSIMD_SVE, dim, svcntw); + return ret_dist_func; +} + +dist_func_t Choose_SQ8_FP16_Cosine_implementation_SVE(size_t dim) { + dist_func_t ret_dist_func; + CHOOSE_SVE_IMPLEMENTATION(ret_dist_func, SQ8_FP16_CosineSIMD_SVE, dim, svcntw); + return ret_dist_func; +} + +dist_func_t Choose_SQ8_FP16_L2_implementation_SVE(size_t dim) { + dist_func_t ret_dist_func; + CHOOSE_SVE_IMPLEMENTATION(ret_dist_func, SQ8_FP16_L2SqrSIMD_SVE, dim, svcntw); + return ret_dist_func; +} +``` + +- [ ] **Step 3: Declarations in SVE2.h** + +```cpp +dist_func_t Choose_SQ8_FP16_IP_implementation_SVE2(size_t dim); +dist_func_t Choose_SQ8_FP16_Cosine_implementation_SVE2(size_t dim); +dist_func_t Choose_SQ8_FP16_L2_implementation_SVE2(size_t dim); +``` + +- [ ] **Step 4: Definitions in SVE2.cpp** + +Add includes alongside the existing SQ8_FP32 includes — note the SVE header is included from SVE2 (SVE2 instantiates the template under SVE2 compile flags): + +```cpp +#include "VecSim/spaces/IP/IP_SVE_SQ8_FP16.h" // SVE2 implementation is identical to SVE +#include "VecSim/spaces/L2/L2_SVE_SQ8_FP16.h" // SVE2 implementation is identical to SVE +``` + +Inside `namespace spaces { ... }`, append: + +```cpp +dist_func_t Choose_SQ8_FP16_IP_implementation_SVE2(size_t dim) { + dist_func_t ret_dist_func; + CHOOSE_SVE_IMPLEMENTATION(ret_dist_func, SQ8_FP16_InnerProductSIMD_SVE, dim, svcntw); + return ret_dist_func; +} + +dist_func_t Choose_SQ8_FP16_Cosine_implementation_SVE2(size_t dim) { + dist_func_t ret_dist_func; + CHOOSE_SVE_IMPLEMENTATION(ret_dist_func, SQ8_FP16_CosineSIMD_SVE, dim, svcntw); + return ret_dist_func; +} + +dist_func_t Choose_SQ8_FP16_L2_implementation_SVE2(size_t dim) { + dist_func_t ret_dist_func; + CHOOSE_SVE_IMPLEMENTATION(ret_dist_func, SQ8_FP16_L2SqrSIMD_SVE, dim, svcntw); + return ret_dist_func; +} +``` + +- [ ] **Step 5: Commit** + +```bash +git add src/VecSim/spaces/functions/SVE.h src/VecSim/spaces/functions/SVE.cpp \ + src/VecSim/spaces/functions/SVE2.h src/VecSim/spaces/functions/SVE2.cpp +git commit -m "Wire SVE/SVE2 SQ8↔FP16 choosers [MOD-14972]" +``` + +--- + +## Task 10: SVE + SVE2 dispatcher wiring in IP_space.cpp + L2_space.cpp + +The NEON_HP block added in Task 5 lives inside `#ifdef CPU_FEATURES_ARCH_AARCH64`. Extend the same block in all three `_GetDistFunc` functions with SVE2 and SVE tiers — ordered SVE2 → SVE → NEON_HP, matching every other SQ8/FP32 dispatcher in the file. + +**Files:** +- Modify: `src/VecSim/spaces/IP_space.cpp` (two functions) +- Modify: `src/VecSim/spaces/L2_space.cpp` (one function) + +- [ ] **Step 1: Confirm the SVE/SVE2 dispatcher includes are present** + +Run: +```bash +grep -n 'functions/SVE\.h\|functions/SVE2\.h' src/VecSim/spaces/IP_space.cpp src/VecSim/spaces/L2_space.cpp +``` +Expected: both files already include both headers. If not, add them. + +- [ ] **Step 2: Extend IP_SQ8_FP16_GetDistFunc** + +Inside the AArch64 block of `IP_SQ8_FP16_GetDistFunc`, after the `if (dim < 16) return ret_dist_func;` guard and **before** the existing `#ifdef OPT_NEON_HP`, prepend: + +```cpp +#ifdef OPT_SVE2 + if (features.sve2) { + return Choose_SQ8_FP16_IP_implementation_SVE2(dim); + } +#endif +#ifdef OPT_SVE + if (features.sve) { + return Choose_SQ8_FP16_IP_implementation_SVE(dim); + } +#endif +``` + +(SVE/SVE2 paths don't compute alignment hints — the SVE vector width is runtime-variable, so the SQ8_FP32 sister doesn't set `*alignment` here either. Mirror that.) + +- [ ] **Step 3: Extend Cosine_SQ8_FP16_GetDistFunc** + +Same as Step 2, with `Cosine` in the chooser names. + +- [ ] **Step 4: Extend L2_SQ8_FP16_GetDistFunc** + +Same as Step 2, with `L2` in the chooser names. + +- [ ] **Step 5: User builds (ARM target)** + +Ask user to run `make build` for an ARM target. + +- [ ] **Step 6: Commit** + +```bash +git add src/VecSim/spaces/IP_space.cpp src/VecSim/spaces/L2_space.cpp +git commit -m "Dispatch SQ8↔FP16 to SVE/SVE2 tiers on AArch64 [MOD-14972]" +``` + +--- + +## Task 11: Extend `SQ8_FP16_SpacesOptimizationTest` with SVE2 + SVE tier-walks + +**Files:** +- Modify: `tests/unit/test_spaces.cpp` — the same three test bodies extended in Task 6 + +For each test (L2, IP, Cosine), inside the existing `#ifdef CPU_FEATURES_ARCH_AARCH64` region (which currently holds only NEON_HP from Task 6), **prepend** SVE2 and SVE blocks so the dispatch-precedence order is SVE2 → SVE → NEON_HP. If the existing NEON_HP block is not yet inside an AArch64 outer ifdef, wrap all three together. + +- [ ] **Step 1: Wrap and extend the L2 test** + +Replace the NEON_HP-only AArch64 block in `SQ8_FP16_L2SqrTest` with: + +```cpp +#ifdef CPU_FEATURES_ARCH_AARCH64 +#ifdef OPT_SVE2 + if (optimization.sve2) { + unsigned char alignment = 0; + arch_opt_func = L2_SQ8_FP16_GetDistFunc(dim, &alignment, &optimization); + ASSERT_EQ(arch_opt_func, Choose_SQ8_FP16_L2_implementation_SVE2(dim)) + << "Unexpected distance function chosen for dim " << dim; + ASSERT_NEAR(baseline, arch_opt_func(v2_compressed.data(), v1_query.data(), dim), 0.01) + << "SVE2 with dim " << dim; + optimization.sve2 = 0; + } +#endif +#ifdef OPT_SVE + if (optimization.sve) { + unsigned char alignment = 0; + arch_opt_func = L2_SQ8_FP16_GetDistFunc(dim, &alignment, &optimization); + ASSERT_EQ(arch_opt_func, Choose_SQ8_FP16_L2_implementation_SVE(dim)) + << "Unexpected distance function chosen for dim " << dim; + ASSERT_NEAR(baseline, arch_opt_func(v2_compressed.data(), v1_query.data(), dim), 0.01) + << "SVE with dim " << dim; + optimization.sve = 0; + } +#endif +#ifdef OPT_NEON_HP + if (optimization.asimdhp) { + unsigned char alignment = 0; + arch_opt_func = L2_SQ8_FP16_GetDistFunc(dim, &alignment, &optimization); + ASSERT_EQ(arch_opt_func, Choose_SQ8_FP16_L2_implementation_NEON_HP(dim)) + << "Unexpected distance function chosen for dim " << dim; + ASSERT_NEAR(baseline, arch_opt_func(v2_compressed.data(), v1_query.data(), dim), 0.01) + << "NEON_HP with dim " << dim; + optimization.asimdhp = 0; + } +#endif +#endif // CPU_FEATURES_ARCH_AARCH64 +``` + +- [ ] **Step 2: Same for IP test** + +Replicate the block in `SQ8_FP16_InnerProductTest` with `IP_SQ8_FP16_GetDistFunc` and `Choose_SQ8_FP16_IP_implementation_`. + +- [ ] **Step 3: Same for Cosine test** + +Replicate with `Cosine_SQ8_FP16_GetDistFunc` and `Choose_SQ8_FP16_Cosine_implementation_`. + +- [ ] **Step 4: User builds** + +ARM target build. + +- [ ] **Step 5: Run the optimization tests** + +```bash +./bin//unit_tests --gtest_filter='SQ8_FP16_SpacesOptimizationTest.*' +``` +Expected: all parametrized cases PASS — dims 16..32 + high-dim suite (64..1024) — exercising whichever ARM tiers the host advertises. + +- [ ] **Step 6: Commit** + +```bash +git add tests/unit/test_spaces.cpp +git commit -m "Extend SQ8↔FP16 tier-walk tests with SVE/SVE2 [MOD-14972]" +``` + +--- + +## Task 12: Extend `SQ8_FP16_SIMD_TierCoverage.ReportTiersExercised` with ARM rows + +**Files:** +- Modify: `tests/unit/test_spaces.cpp` — `TEST(SQ8_FP16_SIMD_TierCoverage, ReportTiersExercised)` + +The existing test body has an outer `#ifdef CPU_FEATURES_ARCH_X86_64` block that loops over each x86 tier and logs presence to stderr. Add a sibling `#ifdef CPU_FEATURES_ARCH_AARCH64` block with the same shape. + +- [ ] **Step 1: Append the AArch64 reporting block** + +Locate the trailing `#endif // CPU_FEATURES_ARCH_X86_64` and immediately after, insert: + +```cpp +#ifdef CPU_FEATURES_ARCH_AARCH64 +#ifdef OPT_SVE2 + if (opt.sve2) { + std::cerr << "[SQ8_FP16] SVE2 tier exercised\n"; + any_simd = true; + } else { + std::cerr << "[SQ8_FP16] SVE2 tier NOT exercised on this host\n"; + } +#endif +#ifdef OPT_SVE + if (opt.sve) { + std::cerr << "[SQ8_FP16] SVE tier exercised\n"; + any_simd = true; + } else { + std::cerr << "[SQ8_FP16] SVE tier NOT exercised on this host\n"; + } +#endif +#ifdef OPT_NEON_HP + if (opt.asimdhp) { + std::cerr << "[SQ8_FP16] NEON_HP tier exercised\n"; + any_simd = true; + } else { + std::cerr << "[SQ8_FP16] NEON_HP tier NOT exercised on this host\n"; + } +#endif +#endif // CPU_FEATURES_ARCH_AARCH64 +``` + +(The trailing `if (!any_simd) { GTEST_SKIP() << ...; }` already at the bottom of the existing test handles the all-quiet case across both archs.) + +- [ ] **Step 2: Build + run on an ARM host** + +Ask the user to build for ARM, then run: +```bash +./bin//unit_tests --gtest_filter='SQ8_FP16_SIMD_TierCoverage.*' +``` +Expected: stderr shows at least one ARM tier marked "exercised", test PASS. + +- [ ] **Step 3: Commit** + +```bash +git add tests/unit/test_spaces.cpp +git commit -m "Report ARM tiers in SQ8↔FP16 tier-coverage test [MOD-14972]" +``` + +--- + +## Task 13: Microbench AArch64 block + +**Files:** +- Modify: `tests/benchmark/spaces_benchmarks/bm_spaces_sq8_fp16.cpp` + +The existing file already opens `#ifdef CPU_FEATURES_ARCH_X86_64` and pulls `cpu_features::X86Features opt = cpu_features::GetX86Info().features;`. Add the parallel AArch64 block at the end of that `#endif // CPU_FEATURES_ARCH_X86_64`. + +- [ ] **Step 1: Append the AArch64 bench block** + +After the closing `#endif // CPU_FEATURES_ARCH_X86_64` (or after the last x86 `INITIALIZE_BENCHMARKS_SET_*` macro if no such comment exists), insert: + +```cpp +#ifdef CPU_FEATURES_ARCH_AARCH64 +cpu_features::Aarch64Features arm_opt = cpu_features::GetAarch64Info().features; + +#ifdef OPT_SVE2 +bool sve2_supported = arm_opt.sve2; +INITIALIZE_BENCHMARKS_SET_L2_IP(BM_VecSimSpaces_SQ8_FP16, SQ8_FP16, SVE2, 16, sve2_supported); +INITIALIZE_BENCHMARKS_SET_Cosine(BM_VecSimSpaces_SQ8_FP16, SQ8_FP16, SVE2, 16, sve2_supported); +#endif + +#ifdef OPT_SVE +bool sve_supported = arm_opt.sve; +INITIALIZE_BENCHMARKS_SET_L2_IP(BM_VecSimSpaces_SQ8_FP16, SQ8_FP16, SVE, 16, sve_supported); +INITIALIZE_BENCHMARKS_SET_Cosine(BM_VecSimSpaces_SQ8_FP16, SQ8_FP16, SVE, 16, sve_supported); +#endif + +#ifdef OPT_NEON_HP +bool neon_hp_supported = arm_opt.asimdhp; +INITIALIZE_BENCHMARKS_SET_L2_IP(BM_VecSimSpaces_SQ8_FP16, SQ8_FP16, NEON_HP, 16, neon_hp_supported); +INITIALIZE_BENCHMARKS_SET_Cosine(BM_VecSimSpaces_SQ8_FP16, SQ8_FP16, NEON_HP, 16, + neon_hp_supported); +#endif +#endif // CPU_FEATURES_ARCH_AARCH64 +``` + +Verify the exact `cpu_features` helper name during build. If the toolchain uses `Aarch64Info` vs `Aarch64Features` vs `ArmFeatures`, adjust to match the sister x86 block. + +- [ ] **Step 2: Update the file-header comment** + +The current file-header comment (around the top) ends with `ARM kernels land via MOD-14972.` — change that line to `ARM kernels (NEON_HP / SVE / SVE2) are registered below.` so the doc stays accurate. + +- [ ] **Step 3: User builds (ARM target)** + +- [ ] **Step 4: Run the bench on ARM** + +```bash +./bin//bm_spaces_sq8_fp16 --benchmark_filter='SQ8_FP16_.*(SVE2|SVE|NEON_HP)' +``` +Expected: per-ISA throughput rows for L2, IP, Cosine. If no rows match, list all benchmarks first with `--benchmark_list_tests` to see the exact generated names, then adjust the regex. + +- [ ] **Step 5: Side-by-side compare against SQ8_FP32** + +```bash +./bin//bm_spaces_sq8_fp32 --benchmark_filter='SQ8_FP32_.*(SVE2|SVE|NEON)' +``` +Compare matched-ISA rows manually. Acceptance per Jira: per-ISA throughput data captured. + +- [ ] **Step 6: Commit** + +```bash +git add tests/benchmark/spaces_benchmarks/bm_spaces_sq8_fp16.cpp +git commit -m "Register ARM SQ8↔FP16 microbenchmarks [MOD-14972]" +``` + +--- + +## Task 14: ASan + final pre-PR verification + +- [ ] **Step 1: Full unit-test pass on ARM host (no filter)** + +```bash +./bin//unit_tests +``` +Expected: all tests PASS. + +- [ ] **Step 2: ASan build + run** + +Ask user to run `make build SAN=address` (or the repo's equivalent — verify against `Makefile`). After confirmed: + +```bash +./bin/-asan/unit_tests --gtest_filter='SQ8_FP16_*' +``` +Expected: zero ASan reports; all SQ8_FP16 tests PASS. + +- [ ] **Step 3: x86 sanity build** + +User runs `make build` on x86 (no ARM target). Confirms the new test extensions and dispatcher AArch64 ifdefs stay inert on x86 and the build is clean. + +- [ ] **Step 4: Push branch (ASK USER FIRST)** + +Pushes are user-gated. Confirm with the user before running: + +```bash +git push -u origin dor-forer-sq8-fp16-arm-kernels-mod-14972 +``` + +- [ ] **Step 5: Open PR against PR #970 (ASK USER FIRST)** + +PR creation is user-gated. Confirm with the user before running: + +```bash +gh pr create \ + --base dor-forer-sq8-fp16-x86-kernels-mod-14954 \ + --title 'Add SQ8↔FP16 ARM SIMD distance kernels [MOD-14972]' \ + --body "$(cat <<'EOF' +## Summary + +- Add asymmetric SQ8↔FP16 distance kernels (IP, L2, Cosine) for ARM NEON_HP, SVE, SVE2 tiers +- Wire kernels into the existing dispatcher (`IP_space.cpp`, `L2_space.cpp`) +- Extend `SQ8_FP16_SpacesOptimizationTest` and `SQ8_FP16_SIMD_TierCoverage` with ARM tiers +- Register per-ISA microbenchmarks for cross-arch throughput comparison + +Stacked on PR #970 (MOD-14954 x86 kernels); retarget to `main` once #970 merges. + +Spec: `docs/superpowers/specs/2026-05-28-arm-sq8-fp16-design.md` + +## Test plan + +- [ ] Unit tests on ARM host pass — `SQ8_FP16_SpacesOptimizationTest` (dims 16..32 + 64..1024), `SQ8_FP16_SIMD_TierCoverage`, `GetDistFuncSQ8FP16Asymmetric` +- [ ] ASan build on ARM host clean across SQ8_FP16 tests +- [ ] x86 build remains clean (new AArch64 dispatcher block + tests stay inert) +- [ ] Microbench output captured for SVE2 / SVE / NEON_HP, compared against matched SQ8_FP32 ARM rows +EOF +)" +``` + +- [ ] **Step 6: Retarget once #970 merges (ASK USER FIRST)** + +When PR #970 lands on `main`, change this PR's base to `main`: + +```bash +gh pr edit --base main +``` + +--- + +## Self-review checklist + +- [x] **Spec coverage:** every requirement in `2026-05-28-arm-sq8-fp16-design.md` is covered: + - Kernel headers (4 new): Tasks 2, 3, 7, 8 + - Wrapper symbols: Tasks 4 (NEON_HP), 9 (SVE/SVE2) + - Dispatcher wiring: Tasks 5 (NEON_HP), 10 (SVE/SVE2) + - Tier-walk tests: Tasks 6 (NEON_HP), 11 (SVE/SVE2) + - TierCoverage report: Task 12 + - Scalar-fallback edge tests (dim=0, dim=15): Task 1 + - Microbench: Task 13 + - ASan + verification: Task 14 +- [x] **No CMake changes** — confirmed in file structure table. +- [x] **Zero placeholders** — every code block is concrete; ambiguous spots (SVE FP16 widening ACLE) are called out with the fallback strategy spelled in-task. +- [x] **Type/symbol consistency:** + - NEON kernel template names: `SQ8_FP16_InnerProductSIMD16_NEON_HP_IMP` / `…NEON_HP` / `SQ8_FP16_L2SqrSIMD16_NEON_HP` / `SQ8_FP16_CosineSIMD16_NEON_HP` — match across kernel header, NEON_HP chooser, dispatcher call, and test. + - SVE kernel template names: `SQ8_FP16_InnerProductSIMD_SVE_IMP` / `…SVE` / `SQ8_FP16_L2SqrSIMD_SVE` / `SQ8_FP16_CosineSIMD_SVE` — match across kernel header, SVE chooser, SVE2 chooser, dispatcher call, and test. + - Chooser symbol names: `Choose_SQ8_FP16_{IP,L2,Cosine}_implementation_{NEON_HP,SVE,SVE2}` — match across `.h` declarations, `.cpp` definitions, dispatcher calls, tests, and bench. + - Test fixture: `SQ8_FP16_SpacesOptimizationTest` already exists on base (PR #970); we extend the three test methods inside it, no rename. + +--- + +## Execution Handoff + +Plan complete and saved to `docs/superpowers/plans/2026-05-28-arm-sq8-fp16-kernels.md`. Two execution options: + +**1. Subagent-Driven (recommended)** — I dispatch a fresh subagent per task, review between tasks, fast iteration. + +**2. Inline Execution** — Execute tasks in this session using executing-plans, batch execution with checkpoints. + +Which approach? From d076b67b8629ed846fe2a418fba92ae8a68c5bb9 Mon Sep 17 00:00:00 2001 From: Dor Forer Date: Thu, 28 May 2026 17:54:33 +0300 Subject: [PATCH 04/24] =?UTF-8?q?Add=20NEON=5FHP=20SQ8=E2=86=94FP16=20IP?= =?UTF-8?q?=20kernel=20header=20[MOD-14972]?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- src/VecSim/spaces/IP/IP_NEON_SQ8_FP16.h | 135 ++++++++++++++++++++++++ 1 file changed, 135 insertions(+) create mode 100644 src/VecSim/spaces/IP/IP_NEON_SQ8_FP16.h diff --git a/src/VecSim/spaces/IP/IP_NEON_SQ8_FP16.h b/src/VecSim/spaces/IP/IP_NEON_SQ8_FP16.h new file mode 100644 index 000000000..b1d26fec5 --- /dev/null +++ b/src/VecSim/spaces/IP/IP_NEON_SQ8_FP16.h @@ -0,0 +1,135 @@ +/* + * Copyright (c) 2006-Present, Redis Ltd. + * All rights reserved. + * + * Licensed under your choice of the Redis Source Available License 2.0 + * (RSALv2); or (b) the Server Side Public License v1 (SSPLv1); or (c) the + * GNU Affero General Public License v3 (AGPLv3). + */ +#pragma once +#include "VecSim/spaces/space_includes.h" +#include "VecSim/types/sq8.h" +#include "VecSim/types/float16.h" +#include +#include + +using sq8 = vecsim_types::sq8; +using float16 = vecsim_types::float16; + +/* + * Optimised asymmetric SQ8<->FP16 inner product using the algebraic identity: + * + * IP(x, y) = sum(x_i * y_i) + * ~= sum((min + delta * q_i) * y_i) + * = min * y_sum + delta * sum(q_i * y_i) + * + * The hot loop only accumulates sum(q_i * y_i) - no per-element dequantisation. + * FP16 query lanes are widened to FP32 via vcvt_f32_f16 per 16-lane chunk. + */ + +// Helper: 16 lanes per call, four FP32 accumulators (one per quarter). +static inline void +SQ8_FP16_InnerProductStep_NEON_HP(const uint8_t *&pVect1, const float16 *&pVect2, + float32x4_t &sum0, float32x4_t &sum1, + float32x4_t &sum2, float32x4_t &sum3) { + // SQ8 storage: 16 * uint8 -> 4 * float32x4_t + uint8x16_t v1_u8 = vld1q_u8(pVect1); + uint16x8_t v1_lo = vmovl_u8(vget_low_u8(v1_u8)); + uint16x8_t v1_hi = vmovl_u8(vget_high_u8(v1_u8)); + float32x4_t v1_0 = vcvtq_f32_u32(vmovl_u16(vget_low_u16(v1_lo))); + float32x4_t v1_1 = vcvtq_f32_u32(vmovl_u16(vget_high_u16(v1_lo))); + float32x4_t v1_2 = vcvtq_f32_u32(vmovl_u16(vget_low_u16(v1_hi))); + float32x4_t v1_3 = vcvtq_f32_u32(vmovl_u16(vget_high_u16(v1_hi))); + + // FP16 query: 16 * f16 -> 4 * float32x4_t via vcvt_f32_f16 + const float16_t *q = reinterpret_cast(pVect2); + float16x8_t q_lo = vld1q_f16(q); + float16x8_t q_hi = vld1q_f16(q + 8); + float32x4_t v2_0 = vcvt_f32_f16(vget_low_f16(q_lo)); + float32x4_t v2_1 = vcvt_f32_f16(vget_high_f16(q_lo)); + float32x4_t v2_2 = vcvt_f32_f16(vget_low_f16(q_hi)); + float32x4_t v2_3 = vcvt_f32_f16(vget_high_f16(q_hi)); + + sum0 = vfmaq_f32(sum0, v1_0, v2_0); + sum1 = vfmaq_f32(sum1, v1_1, v2_1); + sum2 = vfmaq_f32(sum2, v1_2, v2_2); + sum3 = vfmaq_f32(sum3, v1_3, v2_3); + + pVect1 += 16; + pVect2 += 16; +} + +// pVect1v = SQ8 storage, pVect2v = FP16 query +template // 0..15 +float SQ8_FP16_InnerProductSIMD16_NEON_HP_IMP(const void *pVect1v, const void *pVect2v, + size_t dimension) { + assert(dimension >= 16 && "kernel precondition: dispatcher must guard dim >= 16"); + + const uint8_t *pVect1 = static_cast(pVect1v); // SQ8 storage + const float16 *pVect2 = static_cast(pVect2v); // FP16 query + + float32x4_t sum0 = vdupq_n_f32(0.0f); + float32x4_t sum1 = vdupq_n_f32(0.0f); + float32x4_t sum2 = vdupq_n_f32(0.0f); + float32x4_t sum3 = vdupq_n_f32(0.0f); + + const size_t num_of_chunks = dimension / 16; + for (size_t i = 0; i < num_of_chunks; i++) { + SQ8_FP16_InnerProductStep_NEON_HP(pVect1, pVect2, sum0, sum1, sum2, sum3); + } + + // Residual handling: dim % 16 lanes. + // residual >= 8: one safe 8-lane SQ8 + 8-lane FP16 load (FP16 trailer is wide enough). + // residual < 8: scalar-only - a 4-lane FP16 load would overread y_sum metadata. + constexpr unsigned char r = residual; + if constexpr (r >= 8) { + uint8x8_t v1_u8 = vld1_u8(pVect1); + uint16x8_t v1_u16 = vmovl_u8(v1_u8); + float32x4_t v1_a = vcvtq_f32_u32(vmovl_u16(vget_low_u16(v1_u16))); + float32x4_t v1_b = vcvtq_f32_u32(vmovl_u16(vget_high_u16(v1_u16))); + float16x8_t q_h = vld1q_f16(reinterpret_cast(pVect2)); + float32x4_t v2_a = vcvt_f32_f16(vget_low_f16(q_h)); + float32x4_t v2_b = vcvt_f32_f16(vget_high_f16(q_h)); + sum0 = vfmaq_f32(sum0, v1_a, v2_a); + sum1 = vfmaq_f32(sum1, v1_b, v2_b); + pVect1 += 8; + pVect2 += 8; + } + // Lane-by-lane scalar for the final 0..7 (residual % 8) elements. + constexpr unsigned char tail = r & 0x7; + float scalar_dot = 0.0f; + for (unsigned char k = 0; k < tail; ++k) { + scalar_dot += static_cast(pVect1[k]) * vecsim_types::FP16_to_FP32(pVect2[k]); + } + + // Reduce the four NEON accumulators. + float32x4_t sum_lo = vaddq_f32(sum0, sum1); + float32x4_t sum_hi = vaddq_f32(sum2, sum3); + float quantized_dot = vaddvq_f32(vaddq_f32(sum_lo, sum_hi)) + scalar_dot; + + // Metadata loads - use load_unaligned because odd dim leaves trailers unaligned. + const uint8_t *params_bytes = static_cast(pVect1v) + dimension; + const float min_val = + load_unaligned(params_bytes + sq8::MIN_VAL * sizeof(float)); + const float delta = + load_unaligned(params_bytes + sq8::DELTA * sizeof(float)); + const uint8_t *query_meta_bytes = + reinterpret_cast(static_cast(pVect2v) + dimension); + const float y_sum = + load_unaligned(query_meta_bytes + sq8::SUM_QUERY * sizeof(float)); + + return min_val * y_sum + delta * quantized_dot; +} + +template +float SQ8_FP16_InnerProductSIMD16_NEON_HP(const void *pVect1v, const void *pVect2v, + size_t dimension) { + return 1.0f - + SQ8_FP16_InnerProductSIMD16_NEON_HP_IMP(pVect1v, pVect2v, dimension); +} + +template +float SQ8_FP16_CosineSIMD16_NEON_HP(const void *pVect1v, const void *pVect2v, size_t dimension) { + // Cosine = 1 - IP (vectors are pre-normalised); reuses the IP wrapper. + return SQ8_FP16_InnerProductSIMD16_NEON_HP(pVect1v, pVect2v, dimension); +} From eedde9d90c40044045ffd6ceece2ff72247ebc20 Mon Sep 17 00:00:00 2001 From: Dor Forer Date: Thu, 28 May 2026 18:01:04 +0300 Subject: [PATCH 05/24] =?UTF-8?q?Add=20NEON=5FHP=20SQ8=E2=86=94FP16=20L2?= =?UTF-8?q?=20kernel=20header=20[MOD-14972]?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-Authored-By: Claude Sonnet 4.6 --- src/VecSim/spaces/L2/L2_NEON_SQ8_FP16.h | 42 +++++++++++++++++++++++++ 1 file changed, 42 insertions(+) create mode 100644 src/VecSim/spaces/L2/L2_NEON_SQ8_FP16.h diff --git a/src/VecSim/spaces/L2/L2_NEON_SQ8_FP16.h b/src/VecSim/spaces/L2/L2_NEON_SQ8_FP16.h new file mode 100644 index 000000000..7bf5db986 --- /dev/null +++ b/src/VecSim/spaces/L2/L2_NEON_SQ8_FP16.h @@ -0,0 +1,42 @@ +/* + * Copyright (c) 2006-Present, Redis Ltd. + * All rights reserved. + * + * Licensed under your choice of the Redis Source Available License 2.0 + * (RSALv2); or (b) the Server Side Public License v1 (SSPLv1); or (c) the + * GNU Affero General Public License v3 (AGPLv3). + */ +#pragma once +#include "VecSim/spaces/space_includes.h" +#include "VecSim/spaces/IP/IP_NEON_SQ8_FP16.h" +#include "VecSim/types/sq8.h" +#include "VecSim/types/float16.h" + +using sq8 = vecsim_types::sq8; +using float16 = vecsim_types::float16; + +/* + * Optimised asymmetric SQ8<->FP16 L2 squared distance using the algebraic identity: + * + * ||x - y||^2 = sum(x_i^2) - 2 * IP(x, y) + sum(y_i^2) + * = x_sum_squares - 2 * IP(x, y) + y_sum_squares + * + * IP is computed by SQ8_FP16_InnerProductSIMD16_NEON_HP_IMP; metadata is FP32. + */ + +template // 0..15 +float SQ8_FP16_L2SqrSIMD16_NEON_HP(const void *pVect1v, const void *pVect2v, size_t dimension) { + const float ip = + SQ8_FP16_InnerProductSIMD16_NEON_HP_IMP(pVect1v, pVect2v, dimension); + + const uint8_t *params_bytes = static_cast(pVect1v) + dimension; + const float x_sum_sq = + load_unaligned(params_bytes + sq8::SUM_SQUARES * sizeof(float)); + + const uint8_t *query_meta_bytes = reinterpret_cast( + static_cast(pVect2v) + dimension); + const float y_sum_sq = + load_unaligned(query_meta_bytes + sq8::SUM_SQUARES_QUERY * sizeof(float)); + + return x_sum_sq + y_sum_sq - 2.0f * ip; +} From 1b435d81f8f7319c415acda2a9b2773b63cbc0c7 Mon Sep 17 00:00:00 2001 From: Dor Forer Date: Thu, 28 May 2026 18:05:16 +0300 Subject: [PATCH 06/24] =?UTF-8?q?Wire=20NEON=5FHP=20SQ8=E2=86=94FP16=20cho?= =?UTF-8?q?osers=20[MOD-14972]?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-Authored-By: Claude Sonnet 4.6 --- src/VecSim/spaces/functions/NEON_HP.cpp | 20 ++++++++++++++++++++ src/VecSim/spaces/functions/NEON_HP.h | 4 ++++ 2 files changed, 24 insertions(+) diff --git a/src/VecSim/spaces/functions/NEON_HP.cpp b/src/VecSim/spaces/functions/NEON_HP.cpp index 2dea94934..20d93a517 100644 --- a/src/VecSim/spaces/functions/NEON_HP.cpp +++ b/src/VecSim/spaces/functions/NEON_HP.cpp @@ -10,6 +10,8 @@ #include "VecSim/spaces/L2/L2_NEON_FP16.h" #include "VecSim/spaces/IP/IP_NEON_FP16.h" +#include "VecSim/spaces/IP/IP_NEON_SQ8_FP16.h" +#include "VecSim/spaces/L2/L2_NEON_SQ8_FP16.h" namespace spaces { @@ -27,6 +29,24 @@ dist_func_t Choose_FP16_IP_implementation_NEON_HP(size_t dim) { return ret_dist_func; } +dist_func_t Choose_SQ8_FP16_IP_implementation_NEON_HP(size_t dim) { + dist_func_t ret_dist_func; + CHOOSE_IMPLEMENTATION(ret_dist_func, dim, 16, SQ8_FP16_InnerProductSIMD16_NEON_HP); + return ret_dist_func; +} + +dist_func_t Choose_SQ8_FP16_L2_implementation_NEON_HP(size_t dim) { + dist_func_t ret_dist_func; + CHOOSE_IMPLEMENTATION(ret_dist_func, dim, 16, SQ8_FP16_L2SqrSIMD16_NEON_HP); + return ret_dist_func; +} + +dist_func_t Choose_SQ8_FP16_Cosine_implementation_NEON_HP(size_t dim) { + dist_func_t ret_dist_func; + CHOOSE_IMPLEMENTATION(ret_dist_func, dim, 16, SQ8_FP16_CosineSIMD16_NEON_HP); + return ret_dist_func; +} + #include "implementation_chooser_cleanup.h" } // namespace spaces diff --git a/src/VecSim/spaces/functions/NEON_HP.h b/src/VecSim/spaces/functions/NEON_HP.h index c65bd6948..889eb0919 100644 --- a/src/VecSim/spaces/functions/NEON_HP.h +++ b/src/VecSim/spaces/functions/NEON_HP.h @@ -16,4 +16,8 @@ dist_func_t Choose_FP16_IP_implementation_NEON_HP(size_t dim); dist_func_t Choose_FP16_L2_implementation_NEON_HP(size_t dim); +dist_func_t Choose_SQ8_FP16_IP_implementation_NEON_HP(size_t dim); +dist_func_t Choose_SQ8_FP16_L2_implementation_NEON_HP(size_t dim); +dist_func_t Choose_SQ8_FP16_Cosine_implementation_NEON_HP(size_t dim); + } // namespace spaces From 33e751d4f41d1be44eb8299921451c45dce312cd Mon Sep 17 00:00:00 2001 From: Dor Forer Date: Thu, 28 May 2026 18:08:16 +0300 Subject: [PATCH 07/24] =?UTF-8?q?Dispatch=20SQ8=E2=86=94FP16=20to=20NEON?= =?UTF-8?q?=5FHP=20tier=20on=20AArch64=20[MOD-14972]?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-Authored-By: Claude Sonnet 4.6 --- src/VecSim/spaces/IP_space.cpp | 26 ++++++++++++++++++++++++++ src/VecSim/spaces/L2_space.cpp | 13 +++++++++++++ 2 files changed, 39 insertions(+) diff --git a/src/VecSim/spaces/IP_space.cpp b/src/VecSim/spaces/IP_space.cpp index b57971b60..92616f394 100644 --- a/src/VecSim/spaces/IP_space.cpp +++ b/src/VecSim/spaces/IP_space.cpp @@ -225,6 +225,19 @@ dist_func_t IP_SQ8_FP16_GetDistFunc(size_t dim, unsigned char *alignment, #endif #endif // OPT_F16C #endif // x86_64 +#ifdef CPU_FEATURES_ARCH_AARCH64 + if (dim < 16) { + return ret_dist_func; + } +#ifdef OPT_NEON_HP + if (features.asimdhp) { + // No alignment write: the locked spec and the sister ARM SQ8_FP32 dispatchers + // leave *alignment untouched on ARM tiers. The corresponding tests assert + // 0xFF passthrough on the scalar path and do not assert any non-zero value here. + return Choose_SQ8_FP16_IP_implementation_NEON_HP(dim); + } +#endif +#endif // CPU_FEATURES_ARCH_AARCH64 return ret_dist_func; } @@ -274,6 +287,19 @@ dist_func_t Cosine_SQ8_FP16_GetDistFunc(size_t dim, unsigned char *alignm #endif #endif // OPT_F16C #endif // x86_64 +#ifdef CPU_FEATURES_ARCH_AARCH64 + if (dim < 16) { + return ret_dist_func; + } +#ifdef OPT_NEON_HP + if (features.asimdhp) { + // No alignment write: the locked spec and the sister ARM SQ8_FP32 dispatchers + // leave *alignment untouched on ARM tiers. The corresponding tests assert + // 0xFF passthrough on the scalar path and do not assert any non-zero value here. + return Choose_SQ8_FP16_Cosine_implementation_NEON_HP(dim); + } +#endif +#endif // CPU_FEATURES_ARCH_AARCH64 return ret_dist_func; } diff --git a/src/VecSim/spaces/L2_space.cpp b/src/VecSim/spaces/L2_space.cpp index 43020399f..995b4c4d6 100644 --- a/src/VecSim/spaces/L2_space.cpp +++ b/src/VecSim/spaces/L2_space.cpp @@ -156,6 +156,19 @@ dist_func_t L2_SQ8_FP16_GetDistFunc(size_t dim, unsigned char *alignment, #endif #endif // OPT_F16C #endif // x86_64 +#ifdef CPU_FEATURES_ARCH_AARCH64 + if (dim < 16) { + return ret_dist_func; + } +#ifdef OPT_NEON_HP + if (features.asimdhp) { + // No alignment write: the locked spec and the sister ARM SQ8_FP32 dispatchers + // leave *alignment untouched on ARM tiers. The corresponding tests assert + // 0xFF passthrough on the scalar path and do not assert any non-zero value here. + return Choose_SQ8_FP16_L2_implementation_NEON_HP(dim); + } +#endif +#endif // CPU_FEATURES_ARCH_AARCH64 return ret_dist_func; } From 0d53e1f115863d38184d9ced815e786734e9278b Mon Sep 17 00:00:00 2001 From: Dor Forer Date: Thu, 28 May 2026 18:11:42 +0300 Subject: [PATCH 08/24] =?UTF-8?q?Extend=20SQ8=E2=86=94FP16=20tier-walk=20t?= =?UTF-8?q?ests=20with=20NEON=5FHP=20[MOD-14972]?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-Authored-By: Claude Sonnet 4.6 --- tests/unit/test_spaces.cpp | 36 ++++++++++++++++++++++++++++++++++++ 1 file changed, 36 insertions(+) diff --git a/tests/unit/test_spaces.cpp b/tests/unit/test_spaces.cpp index 474ac5c75..d2c9386ac 100644 --- a/tests/unit/test_spaces.cpp +++ b/tests/unit/test_spaces.cpp @@ -3149,6 +3149,18 @@ TEST_P(SQ8_FP16_SpacesOptimizationTest, SQ8_FP16_L2SqrTest) { #endif #endif // OPT_F16C +#ifdef OPT_NEON_HP + if (optimization.asimdhp) { + unsigned char alignment = 0; + arch_opt_func = L2_SQ8_FP16_GetDistFunc(dim, &alignment, &optimization); + ASSERT_EQ(arch_opt_func, Choose_SQ8_FP16_L2_implementation_NEON_HP(dim)) + << "Unexpected distance function chosen for dim " << dim; + ASSERT_NEAR(baseline, arch_opt_func(v2_compressed.data(), v1_query.data(), dim), 0.01) + << "NEON_HP with dim " << dim; + optimization.asimdhp = 0; + } +#endif + unsigned char alignment = 0; arch_opt_func = L2_SQ8_FP16_GetDistFunc(dim, &alignment, &optimization); ASSERT_EQ(arch_opt_func, SQ8_FP16_L2Sqr) @@ -3224,6 +3236,18 @@ TEST_P(SQ8_FP16_SpacesOptimizationTest, SQ8_FP16_InnerProductTest) { #endif #endif // OPT_F16C +#ifdef OPT_NEON_HP + if (optimization.asimdhp) { + unsigned char alignment = 0; + arch_opt_func = IP_SQ8_FP16_GetDistFunc(dim, &alignment, &optimization); + ASSERT_EQ(arch_opt_func, Choose_SQ8_FP16_IP_implementation_NEON_HP(dim)) + << "Unexpected distance function chosen for dim " << dim; + ASSERT_NEAR(baseline, arch_opt_func(v2_compressed.data(), v1_query.data(), dim), 0.01) + << "NEON_HP with dim " << dim; + optimization.asimdhp = 0; + } +#endif + unsigned char alignment = 0; arch_opt_func = IP_SQ8_FP16_GetDistFunc(dim, &alignment, &optimization); ASSERT_EQ(arch_opt_func, SQ8_FP16_InnerProduct) @@ -3299,6 +3323,18 @@ TEST_P(SQ8_FP16_SpacesOptimizationTest, SQ8_FP16_CosineTest) { #endif #endif // OPT_F16C +#ifdef OPT_NEON_HP + if (optimization.asimdhp) { + unsigned char alignment = 0; + arch_opt_func = Cosine_SQ8_FP16_GetDistFunc(dim, &alignment, &optimization); + ASSERT_EQ(arch_opt_func, Choose_SQ8_FP16_Cosine_implementation_NEON_HP(dim)) + << "Unexpected distance function chosen for dim " << dim; + ASSERT_NEAR(baseline, arch_opt_func(v2_compressed.data(), v1_query.data(), dim), 0.01) + << "NEON_HP with dim " << dim; + optimization.asimdhp = 0; + } +#endif + unsigned char alignment = 0; arch_opt_func = Cosine_SQ8_FP16_GetDistFunc(dim, &alignment, &optimization); ASSERT_EQ(arch_opt_func, SQ8_FP16_Cosine) From 0089295c571ec1c4a44ec995edff42ca19f068e5 Mon Sep 17 00:00:00 2001 From: Dor Forer Date: Thu, 28 May 2026 18:16:51 +0300 Subject: [PATCH 09/24] =?UTF-8?q?Add=20SVE=20SQ8=E2=86=94FP16=20IP=20kerne?= =?UTF-8?q?l=20header=20[MOD-14972]?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- src/VecSim/spaces/IP/IP_SVE_SQ8_FP16.h | 133 +++++++++++++++++++++++++ 1 file changed, 133 insertions(+) create mode 100644 src/VecSim/spaces/IP/IP_SVE_SQ8_FP16.h diff --git a/src/VecSim/spaces/IP/IP_SVE_SQ8_FP16.h b/src/VecSim/spaces/IP/IP_SVE_SQ8_FP16.h new file mode 100644 index 000000000..36a7d18e6 --- /dev/null +++ b/src/VecSim/spaces/IP/IP_SVE_SQ8_FP16.h @@ -0,0 +1,133 @@ +/* + * Copyright (c) 2006-Present, Redis Ltd. + * All rights reserved. + * + * Licensed under your choice of the Redis Source Available License 2.0 + * (RSALv2); or (b) the Server Side Public License v1 (SSPLv1); or (c) the + * GNU Affero General Public License v3 (AGPLv3). + */ +#pragma once +#include "VecSim/spaces/space_includes.h" +#include "VecSim/types/sq8.h" +#include "VecSim/types/float16.h" +#include +#include + +using sq8 = vecsim_types::sq8; +using float16 = vecsim_types::float16; + +/* + * Optimised asymmetric SQ8<->FP16 inner product using the algebraic identity: + * + * IP(x, y) ~= min * y_sum + delta * sum(q_i * y_i) + * + * Hot loop accumulates sum(q_i * y_i) only; FP16 query lanes are widened to FP32 + * inside each step via svcvt_f32_f16_x. Metadata loads use load_unaligned. + */ + +// Helper: one SVE-vector-width-of-FP32 step. +// chunk = svcntw() - number of FP32 lanes per step. +// pg = svptrue_b32() - predicate for FP32 lanes. +static inline void +SQ8_FP16_InnerProductStep_SVE(const uint8_t *pVect1, const float16 *pVect2, size_t &offset, + svfloat32_t &sum, svbool_t pg, size_t chunk) { + // SQ8 -> uint32 (widen on load), then to FP32. + svuint32_t v1_u32 = svld1ub_u32(pg, pVect1 + offset); + svfloat32_t v1_f = svcvt_f32_u32_x(pg, v1_u32); + + // FP16 query -> FP32. svld1_f16 uses a b16 predicate sized to `chunk` half lanes. + svbool_t pg16 = svwhilelt_b16(uint32_t(0), uint32_t(chunk)); + svfloat16_t q_h = + svld1_f16(pg16, reinterpret_cast(pVect2) + offset); + svfloat32_t v2_f = svcvt_f32_f16_x(pg, q_h); + + sum = svmla_f32_x(pg, sum, v1_f, v2_f); + offset += chunk; +} + +// pVect1v = SQ8 storage, pVect2v = FP16 query +template +float SQ8_FP16_InnerProductSIMD_SVE_IMP(const void *pVect1v, const void *pVect2v, + size_t dimension) { + assert(dimension >= 16 && "kernel precondition: dispatcher must guard dim >= 16"); + + const uint8_t *pVect1 = static_cast(pVect1v); + const float16 *pVect2 = static_cast(pVect2v); + size_t offset = 0; + svbool_t pg = svptrue_b32(); + const size_t chunk = svcntw(); + + svfloat32_t sum0 = svdup_f32(0.0f); + svfloat32_t sum1 = svdup_f32(0.0f); + svfloat32_t sum2 = svdup_f32(0.0f); + svfloat32_t sum3 = svdup_f32(0.0f); + + // Partial chunk for dim % chunk lanes. Use _z form so inactive lanes are zero - + // the final reduction below walks all lanes via svptrue_b32(). + if constexpr (partial_chunk) { + size_t remaining = dimension % chunk; + if (remaining > 0) { + svbool_t pg_partial = + svwhilelt_b32(uint32_t(0), uint32_t(remaining)); + svbool_t pg16_partial = + svwhilelt_b16(uint32_t(0), uint32_t(remaining)); + svuint32_t v1_u32 = svld1ub_u32(pg_partial, pVect1 + offset); + svfloat32_t v1_f = svcvt_f32_u32_z(pg_partial, v1_u32); + svfloat16_t q_h = svld1_f16( + pg16_partial, reinterpret_cast(pVect2) + offset); + svfloat32_t v2_f = svcvt_f32_f16_z(pg_partial, q_h); + sum0 = svmla_f32_z(pg_partial, sum0, v1_f, v2_f); + offset += remaining; + } + } + + // Main loop: 4 chunks per iteration via 4 accumulators. + const size_t chunk_size = 4 * chunk; + const size_t number_of_chunks = + (dimension - (partial_chunk ? dimension % chunk : 0)) / chunk_size; + for (size_t i = 0; i < number_of_chunks; i++) { + SQ8_FP16_InnerProductStep_SVE(pVect1, pVect2, offset, sum0, pg, chunk); + SQ8_FP16_InnerProductStep_SVE(pVect1, pVect2, offset, sum1, pg, chunk); + SQ8_FP16_InnerProductStep_SVE(pVect1, pVect2, offset, sum2, pg, chunk); + SQ8_FP16_InnerProductStep_SVE(pVect1, pVect2, offset, sum3, pg, chunk); + } + + // Additional steps 0..3. + if constexpr (additional_steps > 0) + SQ8_FP16_InnerProductStep_SVE(pVect1, pVect2, offset, sum0, pg, chunk); + if constexpr (additional_steps > 1) + SQ8_FP16_InnerProductStep_SVE(pVect1, pVect2, offset, sum1, pg, chunk); + if constexpr (additional_steps > 2) + SQ8_FP16_InnerProductStep_SVE(pVect1, pVect2, offset, sum2, pg, chunk); + + svfloat32_t sum = svadd_f32_x(pg, sum0, sum1); + sum = svadd_f32_x(pg, sum, sum2); + sum = svadd_f32_x(pg, sum, sum3); + float quantized_dot = svaddv_f32(pg, sum); + + // Metadata loads - unaligned because odd dim leaves trailers unaligned. + const uint8_t *params_bytes = static_cast(pVect1v) + dimension; + const float min_val = + load_unaligned(params_bytes + sq8::MIN_VAL * sizeof(float)); + const float delta = + load_unaligned(params_bytes + sq8::DELTA * sizeof(float)); + const uint8_t *query_meta_bytes = reinterpret_cast( + static_cast(pVect2v) + dimension); + const float y_sum = + load_unaligned(query_meta_bytes + sq8::SUM_QUERY * sizeof(float)); + + return min_val * y_sum + delta * quantized_dot; +} + +template +float SQ8_FP16_InnerProductSIMD_SVE(const void *pVect1v, const void *pVect2v, + size_t dimension) { + return 1.0f - SQ8_FP16_InnerProductSIMD_SVE_IMP( + pVect1v, pVect2v, dimension); +} + +template +float SQ8_FP16_CosineSIMD_SVE(const void *pVect1v, const void *pVect2v, size_t dimension) { + return SQ8_FP16_InnerProductSIMD_SVE( + pVect1v, pVect2v, dimension); +} From 98c8babc655a16ed5efb7f085b918b2aca01206f Mon Sep 17 00:00:00 2001 From: Dor Forer Date: Thu, 28 May 2026 18:20:41 +0300 Subject: [PATCH 10/24] =?UTF-8?q?Add=20SVE=20SQ8=E2=86=94FP16=20L2=20kerne?= =?UTF-8?q?l=20header=20[MOD-14972]?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- src/VecSim/spaces/L2/L2_SVE_SQ8_FP16.h | 38 ++++++++++++++++++++++++++ 1 file changed, 38 insertions(+) create mode 100644 src/VecSim/spaces/L2/L2_SVE_SQ8_FP16.h diff --git a/src/VecSim/spaces/L2/L2_SVE_SQ8_FP16.h b/src/VecSim/spaces/L2/L2_SVE_SQ8_FP16.h new file mode 100644 index 000000000..3c8e89ca6 --- /dev/null +++ b/src/VecSim/spaces/L2/L2_SVE_SQ8_FP16.h @@ -0,0 +1,38 @@ +/* + * Copyright (c) 2006-Present, Redis Ltd. + * All rights reserved. + * + * Licensed under your choice of the Redis Source Available License 2.0 + * (RSALv2); or (b) the Server Side Public License v1 (SSPLv1); or (c) the + * GNU Affero General Public License v3 (AGPLv3). + */ +#pragma once +#include "VecSim/spaces/space_includes.h" +#include "VecSim/spaces/IP/IP_SVE_SQ8_FP16.h" +#include "VecSim/types/sq8.h" +#include "VecSim/types/float16.h" + +using sq8 = vecsim_types::sq8; +using float16 = vecsim_types::float16; + +/* + * SVE SQ8<->FP16 L2 squared distance: + * ||x - y||^2 = x_sum_squares - 2 * IP(x, y) + y_sum_squares + * IP is computed by SQ8_FP16_InnerProductSIMD_SVE_IMP; metadata is FP32. + */ + +template +float SQ8_FP16_L2SqrSIMD_SVE(const void *pVect1v, const void *pVect2v, size_t dimension) { + const float ip = SQ8_FP16_InnerProductSIMD_SVE_IMP( + pVect1v, pVect2v, dimension); + + const uint8_t *params_bytes = static_cast(pVect1v) + dimension; + const float x_sum_sq = + load_unaligned(params_bytes + sq8::SUM_SQUARES * sizeof(float)); + const uint8_t *query_meta_bytes = reinterpret_cast( + static_cast(pVect2v) + dimension); + const float y_sum_sq = + load_unaligned(query_meta_bytes + sq8::SUM_SQUARES_QUERY * sizeof(float)); + + return x_sum_sq + y_sum_sq - 2.0f * ip; +} From ad387e11cf499c849faac30665e8be15dd1ae66a Mon Sep 17 00:00:00 2001 From: Dor Forer Date: Thu, 28 May 2026 18:24:14 +0300 Subject: [PATCH 11/24] =?UTF-8?q?Wire=20SVE/SVE2=20SQ8=E2=86=94FP16=20choo?= =?UTF-8?q?sers=20[MOD-14972]?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-Authored-By: Claude Sonnet 4.6 --- src/VecSim/spaces/functions/SVE.cpp | 21 +++++++++++++++++++++ src/VecSim/spaces/functions/SVE.h | 4 ++++ src/VecSim/spaces/functions/SVE2.cpp | 20 ++++++++++++++++++++ src/VecSim/spaces/functions/SVE2.h | 4 ++++ 4 files changed, 49 insertions(+) diff --git a/src/VecSim/spaces/functions/SVE.cpp b/src/VecSim/spaces/functions/SVE.cpp index fde853db2..bd197c84c 100644 --- a/src/VecSim/spaces/functions/SVE.cpp +++ b/src/VecSim/spaces/functions/SVE.cpp @@ -25,6 +25,9 @@ #include "VecSim/spaces/IP/IP_SVE_SQ8_FP32.h" #include "VecSim/spaces/L2/L2_SVE_SQ8_FP32.h" +#include "VecSim/spaces/IP/IP_SVE_SQ8_FP16.h" +#include "VecSim/spaces/L2/L2_SVE_SQ8_FP16.h" + #include "VecSim/spaces/IP/IP_SVE_SQ8_SQ8.h" #include "VecSim/spaces/L2/L2_SVE_SQ8_SQ8.h" @@ -119,6 +122,24 @@ dist_func_t Choose_SQ8_FP32_L2_implementation_SVE(size_t dim) { return ret_dist_func; } +dist_func_t Choose_SQ8_FP16_IP_implementation_SVE(size_t dim) { + dist_func_t ret_dist_func; + CHOOSE_SVE_IMPLEMENTATION(ret_dist_func, SQ8_FP16_InnerProductSIMD_SVE, dim, svcntw); + return ret_dist_func; +} + +dist_func_t Choose_SQ8_FP16_Cosine_implementation_SVE(size_t dim) { + dist_func_t ret_dist_func; + CHOOSE_SVE_IMPLEMENTATION(ret_dist_func, SQ8_FP16_CosineSIMD_SVE, dim, svcntw); + return ret_dist_func; +} + +dist_func_t Choose_SQ8_FP16_L2_implementation_SVE(size_t dim) { + dist_func_t ret_dist_func; + CHOOSE_SVE_IMPLEMENTATION(ret_dist_func, SQ8_FP16_L2SqrSIMD_SVE, dim, svcntw); + return ret_dist_func; +} + // SQ8-to-SQ8 distance functions (both vectors are uint8 quantized with precomputed sum) // Note: Use svcntb for uint8 elements (not svcntw which is for 32-bit elements) dist_func_t Choose_SQ8_SQ8_IP_implementation_SVE(size_t dim) { diff --git a/src/VecSim/spaces/functions/SVE.h b/src/VecSim/spaces/functions/SVE.h index bd3bc97c3..43b3b22cd 100644 --- a/src/VecSim/spaces/functions/SVE.h +++ b/src/VecSim/spaces/functions/SVE.h @@ -33,6 +33,10 @@ dist_func_t Choose_SQ8_FP32_IP_implementation_SVE(size_t dim); dist_func_t Choose_SQ8_FP32_Cosine_implementation_SVE(size_t dim); dist_func_t Choose_SQ8_FP32_L2_implementation_SVE(size_t dim); +dist_func_t Choose_SQ8_FP16_IP_implementation_SVE(size_t dim); +dist_func_t Choose_SQ8_FP16_Cosine_implementation_SVE(size_t dim); +dist_func_t Choose_SQ8_FP16_L2_implementation_SVE(size_t dim); + // SQ8-to-SQ8 distance functions (both vectors are uint8 quantized with precomputed sum) dist_func_t Choose_SQ8_SQ8_IP_implementation_SVE(size_t dim); dist_func_t Choose_SQ8_SQ8_Cosine_implementation_SVE(size_t dim); diff --git a/src/VecSim/spaces/functions/SVE2.cpp b/src/VecSim/spaces/functions/SVE2.cpp index 4215d79cf..4496c07e6 100644 --- a/src/VecSim/spaces/functions/SVE2.cpp +++ b/src/VecSim/spaces/functions/SVE2.cpp @@ -22,6 +22,8 @@ #include "VecSim/spaces/IP/IP_SVE_UINT8.h" // SVE2 implementation is identical to SVE #include "VecSim/spaces/IP/IP_SVE_SQ8_FP32.h" // SVE2 implementation is identical to SVE #include "VecSim/spaces/L2/L2_SVE_SQ8_FP32.h" // SVE2 implementation is identical to SVE +#include "VecSim/spaces/IP/IP_SVE_SQ8_FP16.h" // SVE2 implementation is identical to SVE +#include "VecSim/spaces/L2/L2_SVE_SQ8_FP16.h" // SVE2 implementation is identical to SVE #include "VecSim/spaces/IP/IP_SVE_SQ8_SQ8.h" // SVE2 implementation is identical to SVE #include "VecSim/spaces/L2/L2_SVE_SQ8_SQ8.h" // SVE2 implementation is identical to SVE @@ -116,6 +118,24 @@ dist_func_t Choose_SQ8_FP32_L2_implementation_SVE2(size_t dim) { return ret_dist_func; } +dist_func_t Choose_SQ8_FP16_IP_implementation_SVE2(size_t dim) { + dist_func_t ret_dist_func; + CHOOSE_SVE_IMPLEMENTATION(ret_dist_func, SQ8_FP16_InnerProductSIMD_SVE, dim, svcntw); + return ret_dist_func; +} + +dist_func_t Choose_SQ8_FP16_Cosine_implementation_SVE2(size_t dim) { + dist_func_t ret_dist_func; + CHOOSE_SVE_IMPLEMENTATION(ret_dist_func, SQ8_FP16_CosineSIMD_SVE, dim, svcntw); + return ret_dist_func; +} + +dist_func_t Choose_SQ8_FP16_L2_implementation_SVE2(size_t dim) { + dist_func_t ret_dist_func; + CHOOSE_SVE_IMPLEMENTATION(ret_dist_func, SQ8_FP16_L2SqrSIMD_SVE, dim, svcntw); + return ret_dist_func; +} + // SQ8-to-SQ8 distance functions (both vectors are uint8 quantized) // Note: Use svcntb for uint8 elements (not svcntw which is for 32-bit elements) dist_func_t Choose_SQ8_SQ8_IP_implementation_SVE2(size_t dim) { diff --git a/src/VecSim/spaces/functions/SVE2.h b/src/VecSim/spaces/functions/SVE2.h index 04078a91e..2c1bfbac3 100644 --- a/src/VecSim/spaces/functions/SVE2.h +++ b/src/VecSim/spaces/functions/SVE2.h @@ -33,6 +33,10 @@ dist_func_t Choose_SQ8_FP32_IP_implementation_SVE2(size_t dim); dist_func_t Choose_SQ8_FP32_Cosine_implementation_SVE2(size_t dim); dist_func_t Choose_SQ8_FP32_L2_implementation_SVE2(size_t dim); +dist_func_t Choose_SQ8_FP16_IP_implementation_SVE2(size_t dim); +dist_func_t Choose_SQ8_FP16_Cosine_implementation_SVE2(size_t dim); +dist_func_t Choose_SQ8_FP16_L2_implementation_SVE2(size_t dim); + // SQ8-to-SQ8 distance functions (both vectors are uint8 quantized) dist_func_t Choose_SQ8_SQ8_IP_implementation_SVE2(size_t dim); dist_func_t Choose_SQ8_SQ8_Cosine_implementation_SVE2(size_t dim); From e8a121c7468e4ba0ba038228e6a7ccf44baa78a3 Mon Sep 17 00:00:00 2001 From: Dor Forer Date: Thu, 28 May 2026 18:28:13 +0300 Subject: [PATCH 12/24] =?UTF-8?q?Dispatch=20SQ8=E2=86=94FP16=20to=20SVE/SV?= =?UTF-8?q?E2=20tiers=20on=20AArch64=20[MOD-14972]?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-Authored-By: Claude Sonnet 4.6 --- src/VecSim/spaces/IP_space.cpp | 20 ++++++++++++++++++++ src/VecSim/spaces/L2_space.cpp | 10 ++++++++++ 2 files changed, 30 insertions(+) diff --git a/src/VecSim/spaces/IP_space.cpp b/src/VecSim/spaces/IP_space.cpp index 92616f394..1930e64a2 100644 --- a/src/VecSim/spaces/IP_space.cpp +++ b/src/VecSim/spaces/IP_space.cpp @@ -229,6 +229,16 @@ dist_func_t IP_SQ8_FP16_GetDistFunc(size_t dim, unsigned char *alignment, if (dim < 16) { return ret_dist_func; } +#ifdef OPT_SVE2 + if (features.sve2) { + return Choose_SQ8_FP16_IP_implementation_SVE2(dim); + } +#endif +#ifdef OPT_SVE + if (features.sve) { + return Choose_SQ8_FP16_IP_implementation_SVE(dim); + } +#endif #ifdef OPT_NEON_HP if (features.asimdhp) { // No alignment write: the locked spec and the sister ARM SQ8_FP32 dispatchers @@ -291,6 +301,16 @@ dist_func_t Cosine_SQ8_FP16_GetDistFunc(size_t dim, unsigned char *alignm if (dim < 16) { return ret_dist_func; } +#ifdef OPT_SVE2 + if (features.sve2) { + return Choose_SQ8_FP16_Cosine_implementation_SVE2(dim); + } +#endif +#ifdef OPT_SVE + if (features.sve) { + return Choose_SQ8_FP16_Cosine_implementation_SVE(dim); + } +#endif #ifdef OPT_NEON_HP if (features.asimdhp) { // No alignment write: the locked spec and the sister ARM SQ8_FP32 dispatchers diff --git a/src/VecSim/spaces/L2_space.cpp b/src/VecSim/spaces/L2_space.cpp index 995b4c4d6..2e18920b3 100644 --- a/src/VecSim/spaces/L2_space.cpp +++ b/src/VecSim/spaces/L2_space.cpp @@ -160,6 +160,16 @@ dist_func_t L2_SQ8_FP16_GetDistFunc(size_t dim, unsigned char *alignment, if (dim < 16) { return ret_dist_func; } +#ifdef OPT_SVE2 + if (features.sve2) { + return Choose_SQ8_FP16_L2_implementation_SVE2(dim); + } +#endif +#ifdef OPT_SVE + if (features.sve) { + return Choose_SQ8_FP16_L2_implementation_SVE(dim); + } +#endif #ifdef OPT_NEON_HP if (features.asimdhp) { // No alignment write: the locked spec and the sister ARM SQ8_FP32 dispatchers From 9a0b8582c9f2147d8af55aa0c0aad5588a25c20f Mon Sep 17 00:00:00 2001 From: Dor Forer Date: Thu, 28 May 2026 18:33:10 +0300 Subject: [PATCH 13/24] =?UTF-8?q?Extend=20SQ8=E2=86=94FP16=20tier-walk=20t?= =?UTF-8?q?ests=20with=20SVE/SVE2=20[MOD-14972]?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-Authored-By: Claude Sonnet 4.6 --- tests/unit/test_spaces.cpp | 72 ++++++++++++++++++++++++++++++++++++++ 1 file changed, 72 insertions(+) diff --git a/tests/unit/test_spaces.cpp b/tests/unit/test_spaces.cpp index d2c9386ac..f7266dce4 100644 --- a/tests/unit/test_spaces.cpp +++ b/tests/unit/test_spaces.cpp @@ -3149,6 +3149,29 @@ TEST_P(SQ8_FP16_SpacesOptimizationTest, SQ8_FP16_L2SqrTest) { #endif #endif // OPT_F16C +#ifdef CPU_FEATURES_ARCH_AARCH64 +#ifdef OPT_SVE2 + if (optimization.sve2) { + unsigned char alignment = 0; + arch_opt_func = L2_SQ8_FP16_GetDistFunc(dim, &alignment, &optimization); + ASSERT_EQ(arch_opt_func, Choose_SQ8_FP16_L2_implementation_SVE2(dim)) + << "Unexpected distance function chosen for dim " << dim; + ASSERT_NEAR(baseline, arch_opt_func(v2_compressed.data(), v1_query.data(), dim), 0.01) + << "SVE2 with dim " << dim; + optimization.sve2 = 0; + } +#endif +#ifdef OPT_SVE + if (optimization.sve) { + unsigned char alignment = 0; + arch_opt_func = L2_SQ8_FP16_GetDistFunc(dim, &alignment, &optimization); + ASSERT_EQ(arch_opt_func, Choose_SQ8_FP16_L2_implementation_SVE(dim)) + << "Unexpected distance function chosen for dim " << dim; + ASSERT_NEAR(baseline, arch_opt_func(v2_compressed.data(), v1_query.data(), dim), 0.01) + << "SVE with dim " << dim; + optimization.sve = 0; + } +#endif #ifdef OPT_NEON_HP if (optimization.asimdhp) { unsigned char alignment = 0; @@ -3160,6 +3183,7 @@ TEST_P(SQ8_FP16_SpacesOptimizationTest, SQ8_FP16_L2SqrTest) { optimization.asimdhp = 0; } #endif +#endif // CPU_FEATURES_ARCH_AARCH64 unsigned char alignment = 0; arch_opt_func = L2_SQ8_FP16_GetDistFunc(dim, &alignment, &optimization); @@ -3236,6 +3260,29 @@ TEST_P(SQ8_FP16_SpacesOptimizationTest, SQ8_FP16_InnerProductTest) { #endif #endif // OPT_F16C +#ifdef CPU_FEATURES_ARCH_AARCH64 +#ifdef OPT_SVE2 + if (optimization.sve2) { + unsigned char alignment = 0; + arch_opt_func = IP_SQ8_FP16_GetDistFunc(dim, &alignment, &optimization); + ASSERT_EQ(arch_opt_func, Choose_SQ8_FP16_IP_implementation_SVE2(dim)) + << "Unexpected distance function chosen for dim " << dim; + ASSERT_NEAR(baseline, arch_opt_func(v2_compressed.data(), v1_query.data(), dim), 0.01) + << "SVE2 with dim " << dim; + optimization.sve2 = 0; + } +#endif +#ifdef OPT_SVE + if (optimization.sve) { + unsigned char alignment = 0; + arch_opt_func = IP_SQ8_FP16_GetDistFunc(dim, &alignment, &optimization); + ASSERT_EQ(arch_opt_func, Choose_SQ8_FP16_IP_implementation_SVE(dim)) + << "Unexpected distance function chosen for dim " << dim; + ASSERT_NEAR(baseline, arch_opt_func(v2_compressed.data(), v1_query.data(), dim), 0.01) + << "SVE with dim " << dim; + optimization.sve = 0; + } +#endif #ifdef OPT_NEON_HP if (optimization.asimdhp) { unsigned char alignment = 0; @@ -3247,6 +3294,7 @@ TEST_P(SQ8_FP16_SpacesOptimizationTest, SQ8_FP16_InnerProductTest) { optimization.asimdhp = 0; } #endif +#endif // CPU_FEATURES_ARCH_AARCH64 unsigned char alignment = 0; arch_opt_func = IP_SQ8_FP16_GetDistFunc(dim, &alignment, &optimization); @@ -3323,6 +3371,29 @@ TEST_P(SQ8_FP16_SpacesOptimizationTest, SQ8_FP16_CosineTest) { #endif #endif // OPT_F16C +#ifdef CPU_FEATURES_ARCH_AARCH64 +#ifdef OPT_SVE2 + if (optimization.sve2) { + unsigned char alignment = 0; + arch_opt_func = Cosine_SQ8_FP16_GetDistFunc(dim, &alignment, &optimization); + ASSERT_EQ(arch_opt_func, Choose_SQ8_FP16_Cosine_implementation_SVE2(dim)) + << "Unexpected distance function chosen for dim " << dim; + ASSERT_NEAR(baseline, arch_opt_func(v2_compressed.data(), v1_query.data(), dim), 0.01) + << "SVE2 with dim " << dim; + optimization.sve2 = 0; + } +#endif +#ifdef OPT_SVE + if (optimization.sve) { + unsigned char alignment = 0; + arch_opt_func = Cosine_SQ8_FP16_GetDistFunc(dim, &alignment, &optimization); + ASSERT_EQ(arch_opt_func, Choose_SQ8_FP16_Cosine_implementation_SVE(dim)) + << "Unexpected distance function chosen for dim " << dim; + ASSERT_NEAR(baseline, arch_opt_func(v2_compressed.data(), v1_query.data(), dim), 0.01) + << "SVE with dim " << dim; + optimization.sve = 0; + } +#endif #ifdef OPT_NEON_HP if (optimization.asimdhp) { unsigned char alignment = 0; @@ -3334,6 +3405,7 @@ TEST_P(SQ8_FP16_SpacesOptimizationTest, SQ8_FP16_CosineTest) { optimization.asimdhp = 0; } #endif +#endif // CPU_FEATURES_ARCH_AARCH64 unsigned char alignment = 0; arch_opt_func = Cosine_SQ8_FP16_GetDistFunc(dim, &alignment, &optimization); From d76325eeebe5e8af92b71f91b46ab20cedb44d28 Mon Sep 17 00:00:00 2001 From: Dor Forer Date: Thu, 28 May 2026 18:41:59 +0300 Subject: [PATCH 14/24] =?UTF-8?q?Register=20ARM=20SQ8=E2=86=94FP16=20micro?= =?UTF-8?q?benchmarks=20[MOD-14972]?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-Authored-By: Claude Sonnet 4.6 --- .../spaces_benchmarks/bm_spaces_sq8_fp16.cpp | 26 +++++++++++++++++-- 1 file changed, 24 insertions(+), 2 deletions(-) diff --git a/tests/benchmark/spaces_benchmarks/bm_spaces_sq8_fp16.cpp b/tests/benchmark/spaces_benchmarks/bm_spaces_sq8_fp16.cpp index ba3030064..9ec022e39 100644 --- a/tests/benchmark/spaces_benchmarks/bm_spaces_sq8_fp16.cpp +++ b/tests/benchmark/spaces_benchmarks/bm_spaces_sq8_fp16.cpp @@ -16,8 +16,7 @@ using float16 = vecsim_types::float16; /** * SQ8-to-FP16 benchmarks: SQ8 quantized storage with FP16 query. * Registers the naive (scalar) baseline plus per-ISA SIMD variants (x86: AVX-512 / AVX2+FMA / - * AVX2 / SSE4 — gated on the matching OPT_* defines and runtime CPU features). ARM kernels - * land via MOD-14972. + * AVX2 / SSE4 — gated on the matching OPT_* defines and runtime CPU features). ARM kernels (NEON_HP / SVE / SVE2) are registered below. */ class BM_VecSimSpaces_SQ8_FP16 : public benchmark::Fixture { protected: @@ -85,6 +84,29 @@ INITIALIZE_BENCHMARKS_SET_Cosine(BM_VecSimSpaces_SQ8_FP16, SQ8_FP16, SSE4, 16, s #endif // OPT_F16C #endif // x86_64 +#ifdef CPU_FEATURES_ARCH_AARCH64 +cpu_features::Aarch64Features arm_opt = cpu_features::GetAarch64Info().features; + +#ifdef OPT_SVE2 +bool sve2_supported = arm_opt.sve2; +INITIALIZE_BENCHMARKS_SET_L2_IP(BM_VecSimSpaces_SQ8_FP16, SQ8_FP16, SVE2, 16, sve2_supported); +INITIALIZE_BENCHMARKS_SET_Cosine(BM_VecSimSpaces_SQ8_FP16, SQ8_FP16, SVE2, 16, sve2_supported); +#endif + +#ifdef OPT_SVE +bool sve_supported = arm_opt.sve; +INITIALIZE_BENCHMARKS_SET_L2_IP(BM_VecSimSpaces_SQ8_FP16, SQ8_FP16, SVE, 16, sve_supported); +INITIALIZE_BENCHMARKS_SET_Cosine(BM_VecSimSpaces_SQ8_FP16, SQ8_FP16, SVE, 16, sve_supported); +#endif + +#ifdef OPT_NEON_HP +bool neon_hp_supported = arm_opt.asimdhp; +INITIALIZE_BENCHMARKS_SET_L2_IP(BM_VecSimSpaces_SQ8_FP16, SQ8_FP16, NEON_HP, 16, neon_hp_supported); +INITIALIZE_BENCHMARKS_SET_Cosine(BM_VecSimSpaces_SQ8_FP16, SQ8_FP16, NEON_HP, 16, + neon_hp_supported); +#endif +#endif // CPU_FEATURES_ARCH_AARCH64 + // Naive (scalar) baseline — always registered as the comparison anchor. INITIALIZE_NAIVE_BM(BM_VecSimSpaces_SQ8_FP16, SQ8_FP16, InnerProduct, 16); From 88de731b01303317ef2d2aaa57cc056418b3b933 Mon Sep 17 00:00:00 2001 From: Dor Forer Date: Sun, 31 May 2026 07:47:32 +0000 Subject: [PATCH 15/24] =?UTF-8?q?Add=20missing=20alignment=3D0=20assertion?= =?UTF-8?q?s=20to=20SQ8=E2=86=94FP16=20ARM=20tier-walk=20tests=20[MOD-1497?= =?UTF-8?q?2]?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The 9 ARM tier blocks (L2/IP/Cosine × SVE2/SVE/NEON_HP) were missing ASSERT_EQ(alignment, 0) after each ASSERT_NEAR, unlike the SQ8_FP32 sister blocks which assert it. Adds the assertions to lock the contract that ARM tiers leave the caller's alignment value untouched. --- tests/unit/test_spaces.cpp | 9 +++++++++ 1 file changed, 9 insertions(+) diff --git a/tests/unit/test_spaces.cpp b/tests/unit/test_spaces.cpp index f7266dce4..ce8605565 100644 --- a/tests/unit/test_spaces.cpp +++ b/tests/unit/test_spaces.cpp @@ -3158,6 +3158,7 @@ TEST_P(SQ8_FP16_SpacesOptimizationTest, SQ8_FP16_L2SqrTest) { << "Unexpected distance function chosen for dim " << dim; ASSERT_NEAR(baseline, arch_opt_func(v2_compressed.data(), v1_query.data(), dim), 0.01) << "SVE2 with dim " << dim; + ASSERT_EQ(alignment, 0) << "No alignment SVE2 with dim " << dim; optimization.sve2 = 0; } #endif @@ -3169,6 +3170,7 @@ TEST_P(SQ8_FP16_SpacesOptimizationTest, SQ8_FP16_L2SqrTest) { << "Unexpected distance function chosen for dim " << dim; ASSERT_NEAR(baseline, arch_opt_func(v2_compressed.data(), v1_query.data(), dim), 0.01) << "SVE with dim " << dim; + ASSERT_EQ(alignment, 0) << "No alignment SVE with dim " << dim; optimization.sve = 0; } #endif @@ -3180,6 +3182,7 @@ TEST_P(SQ8_FP16_SpacesOptimizationTest, SQ8_FP16_L2SqrTest) { << "Unexpected distance function chosen for dim " << dim; ASSERT_NEAR(baseline, arch_opt_func(v2_compressed.data(), v1_query.data(), dim), 0.01) << "NEON_HP with dim " << dim; + ASSERT_EQ(alignment, 0) << "No alignment NEON_HP with dim " << dim; optimization.asimdhp = 0; } #endif @@ -3269,6 +3272,7 @@ TEST_P(SQ8_FP16_SpacesOptimizationTest, SQ8_FP16_InnerProductTest) { << "Unexpected distance function chosen for dim " << dim; ASSERT_NEAR(baseline, arch_opt_func(v2_compressed.data(), v1_query.data(), dim), 0.01) << "SVE2 with dim " << dim; + ASSERT_EQ(alignment, 0) << "No alignment SVE2 with dim " << dim; optimization.sve2 = 0; } #endif @@ -3280,6 +3284,7 @@ TEST_P(SQ8_FP16_SpacesOptimizationTest, SQ8_FP16_InnerProductTest) { << "Unexpected distance function chosen for dim " << dim; ASSERT_NEAR(baseline, arch_opt_func(v2_compressed.data(), v1_query.data(), dim), 0.01) << "SVE with dim " << dim; + ASSERT_EQ(alignment, 0) << "No alignment SVE with dim " << dim; optimization.sve = 0; } #endif @@ -3291,6 +3296,7 @@ TEST_P(SQ8_FP16_SpacesOptimizationTest, SQ8_FP16_InnerProductTest) { << "Unexpected distance function chosen for dim " << dim; ASSERT_NEAR(baseline, arch_opt_func(v2_compressed.data(), v1_query.data(), dim), 0.01) << "NEON_HP with dim " << dim; + ASSERT_EQ(alignment, 0) << "No alignment NEON_HP with dim " << dim; optimization.asimdhp = 0; } #endif @@ -3380,6 +3386,7 @@ TEST_P(SQ8_FP16_SpacesOptimizationTest, SQ8_FP16_CosineTest) { << "Unexpected distance function chosen for dim " << dim; ASSERT_NEAR(baseline, arch_opt_func(v2_compressed.data(), v1_query.data(), dim), 0.01) << "SVE2 with dim " << dim; + ASSERT_EQ(alignment, 0) << "No alignment SVE2 with dim " << dim; optimization.sve2 = 0; } #endif @@ -3391,6 +3398,7 @@ TEST_P(SQ8_FP16_SpacesOptimizationTest, SQ8_FP16_CosineTest) { << "Unexpected distance function chosen for dim " << dim; ASSERT_NEAR(baseline, arch_opt_func(v2_compressed.data(), v1_query.data(), dim), 0.01) << "SVE with dim " << dim; + ASSERT_EQ(alignment, 0) << "No alignment SVE with dim " << dim; optimization.sve = 0; } #endif @@ -3402,6 +3410,7 @@ TEST_P(SQ8_FP16_SpacesOptimizationTest, SQ8_FP16_CosineTest) { << "Unexpected distance function chosen for dim " << dim; ASSERT_NEAR(baseline, arch_opt_func(v2_compressed.data(), v1_query.data(), dim), 0.01) << "NEON_HP with dim " << dim; + ASSERT_EQ(alignment, 0) << "No alignment NEON_HP with dim " << dim; optimization.asimdhp = 0; } #endif From c2423913a6981cf796ea02dee176f204a906e31b Mon Sep 17 00:00:00 2001 From: Dor Forer Date: Sun, 31 May 2026 08:40:11 +0000 Subject: [PATCH 16/24] =?UTF-8?q?Fix=20SVE=20SQ8=E2=86=94FP16=20kernel:=20?= =?UTF-8?q?use=20svzip1=20to=20correct=20FP16=E2=86=92FP32=20widening=20[M?= =?UTF-8?q?OD-14972]?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit svcvt_f32_f16_x (FCVT) reads even-indexed FP16 elements: FP32[e] ← FP16[2e]. The step function loaded chunk consecutive FP16 values into positions 0..chunk-1, then passed them directly to svcvt_f32_f16_x, which picked positions 0,2,4,... and silently skipped positions 1,3,5,... For chunk=4 (128-bit SVE), only 2 of 4 FP16 values per step were used, producing wrong dot products. Fix: svzip1_f16(q_h, zeros) spreads values to even positions [v0,0,v1,0,...] so FCVT correctly reads v[0],v[1],v[2],... Applied to both the full step helper and the partial-chunk path. Discovered and fixed during ARM host verification (Task 14, MOD-14972). --- src/VecSim/spaces/IP/IP_SVE_SQ8_FP16.h | 13 ++++++++++--- 1 file changed, 10 insertions(+), 3 deletions(-) diff --git a/src/VecSim/spaces/IP/IP_SVE_SQ8_FP16.h b/src/VecSim/spaces/IP/IP_SVE_SQ8_FP16.h index 36a7d18e6..d3213e3c3 100644 --- a/src/VecSim/spaces/IP/IP_SVE_SQ8_FP16.h +++ b/src/VecSim/spaces/IP/IP_SVE_SQ8_FP16.h @@ -35,11 +35,16 @@ SQ8_FP16_InnerProductStep_SVE(const uint8_t *pVect1, const float16 *pVect2, size svuint32_t v1_u32 = svld1ub_u32(pg, pVect1 + offset); svfloat32_t v1_f = svcvt_f32_u32_x(pg, v1_u32); - // FP16 query -> FP32. svld1_f16 uses a b16 predicate sized to `chunk` half lanes. + // FP16 query -> FP32. + // svcvt_f32_f16_x (FCVT) reads even-indexed FP16 elements: FP32[e] <- FP16[2e]. + // Our chunk FP16 values land at consecutive positions 0..chunk-1 after the load. + // svzip1 interleaves them with zeros → positions 0,2,4,... hold the values, + // so FCVT correctly reads v[0], v[1], v[2], ... into FP32[0..chunk-1]. svbool_t pg16 = svwhilelt_b16(uint32_t(0), uint32_t(chunk)); svfloat16_t q_h = svld1_f16(pg16, reinterpret_cast(pVect2) + offset); - svfloat32_t v2_f = svcvt_f32_f16_x(pg, q_h); + svfloat16_t q_h_spread = svzip1_f16(q_h, svdup_f16(0.0f)); + svfloat32_t v2_f = svcvt_f32_f16_x(pg, q_h_spread); sum = svmla_f32_x(pg, sum, v1_f, v2_f); offset += chunk; @@ -75,7 +80,9 @@ float SQ8_FP16_InnerProductSIMD_SVE_IMP(const void *pVect1v, const void *pVect2v svfloat32_t v1_f = svcvt_f32_u32_z(pg_partial, v1_u32); svfloat16_t q_h = svld1_f16( pg16_partial, reinterpret_cast(pVect2) + offset); - svfloat32_t v2_f = svcvt_f32_f16_z(pg_partial, q_h); + // Same zip1 trick as the full step: spread values to even positions. + svfloat16_t q_h_spread = svzip1_f16(q_h, svdup_f16(0.0f)); + svfloat32_t v2_f = svcvt_f32_f16_z(pg_partial, q_h_spread); sum0 = svmla_f32_z(pg_partial, sum0, v1_f, v2_f); offset += remaining; } From f7bb4b1bf63934673bb4a5b9c6987f215dd4ead8 Mon Sep 17 00:00:00 2001 From: Dor Forer Date: Sun, 31 May 2026 12:43:00 +0000 Subject: [PATCH 17/24] =?UTF-8?q?Optimize=20ARM=20SQ8=E2=86=94FP16=20kerne?= =?UTF-8?q?ls=20and=20align=20with=20codebase=20conventions=20[MOD-14972]?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit SVE hot loop: replace svzip1_f16+svdup_f16+svwhilelt_b16 (4 ops) with svld1uh_u32 (1 op) — zero-extends each FP16 halfword into a 32-bit lane so svcvt_f32_f16_x reads the correct bits directly. Same fix applied to the partial-chunk path, which also drops the now-redundant pg16_partial predicate. Accumulator combine changed from svadd_f32_x to svadd_f32_z to match the SQ8_FP32 SVE sister. NEON residual: replace the single 8-lane block + up-to-7 software-scalar iterations with three independent 4-lane sub-steps (r>=4, r>=8, r>=12), leaving at most 3 elements for scalar — mirrors the SQ8_FP32 NEON sister exactly. Eliminates expensive vecsim_types::FP16_to_FP32 calls for residuals 4..15 (previously up to 7 software conversions per call). Both IP headers: remove assert()+ (no sister kernel uses them). Both L2 headers: drop redundant float16.h include and using declarations (arrive transitively through the included IP header). --- src/VecSim/spaces/IP/IP_NEON_SQ8_FP16.h | 64 +++++++++++++------------ src/VecSim/spaces/IP/IP_SVE_SQ8_FP16.h | 60 ++++++++--------------- src/VecSim/spaces/L2/L2_NEON_SQ8_FP16.h | 5 -- src/VecSim/spaces/L2/L2_SVE_SQ8_FP16.h | 5 -- 4 files changed, 52 insertions(+), 82 deletions(-) diff --git a/src/VecSim/spaces/IP/IP_NEON_SQ8_FP16.h b/src/VecSim/spaces/IP/IP_NEON_SQ8_FP16.h index b1d26fec5..8b8d37c0f 100644 --- a/src/VecSim/spaces/IP/IP_NEON_SQ8_FP16.h +++ b/src/VecSim/spaces/IP/IP_NEON_SQ8_FP16.h @@ -11,19 +11,14 @@ #include "VecSim/types/sq8.h" #include "VecSim/types/float16.h" #include -#include using sq8 = vecsim_types::sq8; using float16 = vecsim_types::float16; /* - * Optimised asymmetric SQ8<->FP16 inner product using the algebraic identity: + * Asymmetric SQ8 (storage) <-> FP16 (query) inner product using algebraic identity: + * IP(x, y) ~= min * y_sum + delta * Σ(q_i * y_i) * - * IP(x, y) = sum(x_i * y_i) - * ~= sum((min + delta * q_i) * y_i) - * = min * y_sum + delta * sum(q_i * y_i) - * - * The hot loop only accumulates sum(q_i * y_i) - no per-element dequantisation. * FP16 query lanes are widened to FP32 via vcvt_f32_f16 per 16-lane chunk. */ @@ -32,7 +27,6 @@ static inline void SQ8_FP16_InnerProductStep_NEON_HP(const uint8_t *&pVect1, const float16 *&pVect2, float32x4_t &sum0, float32x4_t &sum1, float32x4_t &sum2, float32x4_t &sum3) { - // SQ8 storage: 16 * uint8 -> 4 * float32x4_t uint8x16_t v1_u8 = vld1q_u8(pVect1); uint16x8_t v1_lo = vmovl_u8(vget_low_u8(v1_u8)); uint16x8_t v1_hi = vmovl_u8(vget_high_u8(v1_u8)); @@ -41,7 +35,6 @@ SQ8_FP16_InnerProductStep_NEON_HP(const uint8_t *&pVect1, const float16 *&pVect2 float32x4_t v1_2 = vcvtq_f32_u32(vmovl_u16(vget_low_u16(v1_hi))); float32x4_t v1_3 = vcvtq_f32_u32(vmovl_u16(vget_high_u16(v1_hi))); - // FP16 query: 16 * f16 -> 4 * float32x4_t via vcvt_f32_f16 const float16_t *q = reinterpret_cast(pVect2); float16x8_t q_lo = vld1q_f16(q); float16x8_t q_hi = vld1q_f16(q + 8); @@ -59,14 +52,12 @@ SQ8_FP16_InnerProductStep_NEON_HP(const uint8_t *&pVect1, const float16 *&pVect2 pVect2 += 16; } -// pVect1v = SQ8 storage, pVect2v = FP16 query +// pVect1v = SQ8 storage, pVect2v = FP16 query. Precondition: dim >= 16 (enforced by dispatcher). template // 0..15 float SQ8_FP16_InnerProductSIMD16_NEON_HP_IMP(const void *pVect1v, const void *pVect2v, size_t dimension) { - assert(dimension >= 16 && "kernel precondition: dispatcher must guard dim >= 16"); - - const uint8_t *pVect1 = static_cast(pVect1v); // SQ8 storage - const float16 *pVect2 = static_cast(pVect2v); // FP16 query + const uint8_t *pVect1 = static_cast(pVect1v); + const float16 *pVect2 = static_cast(pVect2v); float32x4_t sum0 = vdupq_n_f32(0.0f); float32x4_t sum1 = vdupq_n_f32(0.0f); @@ -78,36 +69,48 @@ float SQ8_FP16_InnerProductSIMD16_NEON_HP_IMP(const void *pVect1v, const void *p SQ8_FP16_InnerProductStep_NEON_HP(pVect1, pVect2, sum0, sum1, sum2, sum3); } - // Residual handling: dim % 16 lanes. - // residual >= 8: one safe 8-lane SQ8 + 8-lane FP16 load (FP16 trailer is wide enough). - // residual < 8: scalar-only - a 4-lane FP16 load would overread y_sum metadata. + // Residual: up to three independent 4-lane sub-steps, leaving at most 3 elements + // for scalar — mirrors the SQ8_FP32 NEON sister pattern. + // vld1_f16 (4 FP16 = 8 bytes) is safe for any residual: FP16 metadata follows + // the lane data so there is always enough headroom. constexpr unsigned char r = residual; - if constexpr (r >= 8) { + if constexpr (r >= 4) { uint8x8_t v1_u8 = vld1_u8(pVect1); - uint16x8_t v1_u16 = vmovl_u8(v1_u8); - float32x4_t v1_a = vcvtq_f32_u32(vmovl_u16(vget_low_u16(v1_u16))); - float32x4_t v1_b = vcvtq_f32_u32(vmovl_u16(vget_high_u16(v1_u16))); - float16x8_t q_h = vld1q_f16(reinterpret_cast(pVect2)); - float32x4_t v2_a = vcvt_f32_f16(vget_low_f16(q_h)); - float32x4_t v2_b = vcvt_f32_f16(vget_high_f16(q_h)); + float32x4_t v1_a = vcvtq_f32_u32(vmovl_u16(vget_low_u16(vmovl_u8(v1_u8)))); + float32x4_t v2_a = + vcvt_f32_f16(vld1_f16(reinterpret_cast(pVect2))); sum0 = vfmaq_f32(sum0, v1_a, v2_a); + pVect1 += 4; + pVect2 += 4; + } + if constexpr (r >= 8) { + uint8x8_t v1_u8 = vld1_u8(pVect1); + float32x4_t v1_b = vcvtq_f32_u32(vmovl_u16(vget_low_u16(vmovl_u8(v1_u8)))); + float32x4_t v2_b = + vcvt_f32_f16(vld1_f16(reinterpret_cast(pVect2))); sum1 = vfmaq_f32(sum1, v1_b, v2_b); - pVect1 += 8; - pVect2 += 8; + pVect1 += 4; + pVect2 += 4; + } + if constexpr (r >= 12) { + uint8x8_t v1_u8 = vld1_u8(pVect1); + float32x4_t v1_c = vcvtq_f32_u32(vmovl_u16(vget_low_u16(vmovl_u8(v1_u8)))); + float32x4_t v2_c = + vcvt_f32_f16(vld1_f16(reinterpret_cast(pVect2))); + sum2 = vfmaq_f32(sum2, v1_c, v2_c); + pVect1 += 4; + pVect2 += 4; } - // Lane-by-lane scalar for the final 0..7 (residual % 8) elements. - constexpr unsigned char tail = r & 0x7; + constexpr unsigned char tail = r & 3; float scalar_dot = 0.0f; for (unsigned char k = 0; k < tail; ++k) { scalar_dot += static_cast(pVect1[k]) * vecsim_types::FP16_to_FP32(pVect2[k]); } - // Reduce the four NEON accumulators. float32x4_t sum_lo = vaddq_f32(sum0, sum1); float32x4_t sum_hi = vaddq_f32(sum2, sum3); float quantized_dot = vaddvq_f32(vaddq_f32(sum_lo, sum_hi)) + scalar_dot; - // Metadata loads - use load_unaligned because odd dim leaves trailers unaligned. const uint8_t *params_bytes = static_cast(pVect1v) + dimension; const float min_val = load_unaligned(params_bytes + sq8::MIN_VAL * sizeof(float)); @@ -130,6 +133,5 @@ float SQ8_FP16_InnerProductSIMD16_NEON_HP(const void *pVect1v, const void *pVect template float SQ8_FP16_CosineSIMD16_NEON_HP(const void *pVect1v, const void *pVect2v, size_t dimension) { - // Cosine = 1 - IP (vectors are pre-normalised); reuses the IP wrapper. return SQ8_FP16_InnerProductSIMD16_NEON_HP(pVect1v, pVect2v, dimension); } diff --git a/src/VecSim/spaces/IP/IP_SVE_SQ8_FP16.h b/src/VecSim/spaces/IP/IP_SVE_SQ8_FP16.h index d3213e3c3..6c7a52529 100644 --- a/src/VecSim/spaces/IP/IP_SVE_SQ8_FP16.h +++ b/src/VecSim/spaces/IP/IP_SVE_SQ8_FP16.h @@ -11,51 +11,36 @@ #include "VecSim/types/sq8.h" #include "VecSim/types/float16.h" #include -#include using sq8 = vecsim_types::sq8; using float16 = vecsim_types::float16; /* - * Optimised asymmetric SQ8<->FP16 inner product using the algebraic identity: + * Asymmetric SQ8 (storage) <-> FP16 (query) inner product using algebraic identity: + * IP(x, y) ~= min * y_sum + delta * Σ(q_i * y_i) * - * IP(x, y) ~= min * y_sum + delta * sum(q_i * y_i) - * - * Hot loop accumulates sum(q_i * y_i) only; FP16 query lanes are widened to FP32 - * inside each step via svcvt_f32_f16_x. Metadata loads use load_unaligned. + * FP16 query lanes are widened to FP32 per step via svld1uh_u32 + svcvt_f32_f16_x. + * svld1uh_u32 zero-extends each FP16 halfword into a 32-bit lane so that + * svcvt_f32_f16_x reads the correct bits directly without any interleaving. */ // Helper: one SVE-vector-width-of-FP32 step. -// chunk = svcntw() - number of FP32 lanes per step. -// pg = svptrue_b32() - predicate for FP32 lanes. static inline void SQ8_FP16_InnerProductStep_SVE(const uint8_t *pVect1, const float16 *pVect2, size_t &offset, svfloat32_t &sum, svbool_t pg, size_t chunk) { - // SQ8 -> uint32 (widen on load), then to FP32. svuint32_t v1_u32 = svld1ub_u32(pg, pVect1 + offset); svfloat32_t v1_f = svcvt_f32_u32_x(pg, v1_u32); - - // FP16 query -> FP32. - // svcvt_f32_f16_x (FCVT) reads even-indexed FP16 elements: FP32[e] <- FP16[2e]. - // Our chunk FP16 values land at consecutive positions 0..chunk-1 after the load. - // svzip1 interleaves them with zeros → positions 0,2,4,... hold the values, - // so FCVT correctly reads v[0], v[1], v[2], ... into FP32[0..chunk-1]. - svbool_t pg16 = svwhilelt_b16(uint32_t(0), uint32_t(chunk)); - svfloat16_t q_h = - svld1_f16(pg16, reinterpret_cast(pVect2) + offset); - svfloat16_t q_h_spread = svzip1_f16(q_h, svdup_f16(0.0f)); - svfloat32_t v2_f = svcvt_f32_f16_x(pg, q_h_spread); - + svuint32_t q_u32 = + svld1uh_u32(pg, reinterpret_cast(pVect2 + offset)); + svfloat32_t v2_f = svcvt_f32_f16_x(pg, svreinterpret_f16_u32(q_u32)); sum = svmla_f32_x(pg, sum, v1_f, v2_f); offset += chunk; } -// pVect1v = SQ8 storage, pVect2v = FP16 query +// pVect1v = SQ8 storage, pVect2v = FP16 query. Precondition: dim >= 16 (enforced by dispatcher). template float SQ8_FP16_InnerProductSIMD_SVE_IMP(const void *pVect1v, const void *pVect2v, size_t dimension) { - assert(dimension >= 16 && "kernel precondition: dispatcher must guard dim >= 16"); - const uint8_t *pVect1 = static_cast(pVect1v); const float16 *pVect2 = static_cast(pVect2v); size_t offset = 0; @@ -67,28 +52,23 @@ float SQ8_FP16_InnerProductSIMD_SVE_IMP(const void *pVect1v, const void *pVect2v svfloat32_t sum2 = svdup_f32(0.0f); svfloat32_t sum3 = svdup_f32(0.0f); - // Partial chunk for dim % chunk lanes. Use _z form so inactive lanes are zero - - // the final reduction below walks all lanes via svptrue_b32(). + // Partial chunk for dim % chunk lanes. Use _z form so inactive lanes are zero; + // the final reduction walks all lanes via svptrue_b32(). if constexpr (partial_chunk) { size_t remaining = dimension % chunk; if (remaining > 0) { - svbool_t pg_partial = - svwhilelt_b32(uint32_t(0), uint32_t(remaining)); - svbool_t pg16_partial = - svwhilelt_b16(uint32_t(0), uint32_t(remaining)); + svbool_t pg_partial = svwhilelt_b32(uint32_t(0), uint32_t(remaining)); svuint32_t v1_u32 = svld1ub_u32(pg_partial, pVect1 + offset); svfloat32_t v1_f = svcvt_f32_u32_z(pg_partial, v1_u32); - svfloat16_t q_h = svld1_f16( - pg16_partial, reinterpret_cast(pVect2) + offset); - // Same zip1 trick as the full step: spread values to even positions. - svfloat16_t q_h_spread = svzip1_f16(q_h, svdup_f16(0.0f)); - svfloat32_t v2_f = svcvt_f32_f16_z(pg_partial, q_h_spread); + svuint32_t q_u32 = svld1uh_u32( + pg_partial, reinterpret_cast(pVect2 + offset)); + svfloat32_t v2_f = svcvt_f32_f16_z(pg_partial, svreinterpret_f16_u32(q_u32)); sum0 = svmla_f32_z(pg_partial, sum0, v1_f, v2_f); offset += remaining; } } - // Main loop: 4 chunks per iteration via 4 accumulators. + // Main loop: 4 chunks per iteration, one chunk per accumulator. const size_t chunk_size = 4 * chunk; const size_t number_of_chunks = (dimension - (partial_chunk ? dimension % chunk : 0)) / chunk_size; @@ -99,7 +79,6 @@ float SQ8_FP16_InnerProductSIMD_SVE_IMP(const void *pVect1v, const void *pVect2v SQ8_FP16_InnerProductStep_SVE(pVect1, pVect2, offset, sum3, pg, chunk); } - // Additional steps 0..3. if constexpr (additional_steps > 0) SQ8_FP16_InnerProductStep_SVE(pVect1, pVect2, offset, sum0, pg, chunk); if constexpr (additional_steps > 1) @@ -107,12 +86,11 @@ float SQ8_FP16_InnerProductSIMD_SVE_IMP(const void *pVect1v, const void *pVect2v if constexpr (additional_steps > 2) SQ8_FP16_InnerProductStep_SVE(pVect1, pVect2, offset, sum2, pg, chunk); - svfloat32_t sum = svadd_f32_x(pg, sum0, sum1); - sum = svadd_f32_x(pg, sum, sum2); - sum = svadd_f32_x(pg, sum, sum3); + svfloat32_t sum = svadd_f32_z(pg, sum0, sum1); + sum = svadd_f32_z(pg, sum, sum2); + sum = svadd_f32_z(pg, sum, sum3); float quantized_dot = svaddv_f32(pg, sum); - // Metadata loads - unaligned because odd dim leaves trailers unaligned. const uint8_t *params_bytes = static_cast(pVect1v) + dimension; const float min_val = load_unaligned(params_bytes + sq8::MIN_VAL * sizeof(float)); diff --git a/src/VecSim/spaces/L2/L2_NEON_SQ8_FP16.h b/src/VecSim/spaces/L2/L2_NEON_SQ8_FP16.h index 7bf5db986..15cc40f6a 100644 --- a/src/VecSim/spaces/L2/L2_NEON_SQ8_FP16.h +++ b/src/VecSim/spaces/L2/L2_NEON_SQ8_FP16.h @@ -9,11 +9,6 @@ #pragma once #include "VecSim/spaces/space_includes.h" #include "VecSim/spaces/IP/IP_NEON_SQ8_FP16.h" -#include "VecSim/types/sq8.h" -#include "VecSim/types/float16.h" - -using sq8 = vecsim_types::sq8; -using float16 = vecsim_types::float16; /* * Optimised asymmetric SQ8<->FP16 L2 squared distance using the algebraic identity: diff --git a/src/VecSim/spaces/L2/L2_SVE_SQ8_FP16.h b/src/VecSim/spaces/L2/L2_SVE_SQ8_FP16.h index 3c8e89ca6..e3592c24e 100644 --- a/src/VecSim/spaces/L2/L2_SVE_SQ8_FP16.h +++ b/src/VecSim/spaces/L2/L2_SVE_SQ8_FP16.h @@ -9,11 +9,6 @@ #pragma once #include "VecSim/spaces/space_includes.h" #include "VecSim/spaces/IP/IP_SVE_SQ8_FP16.h" -#include "VecSim/types/sq8.h" -#include "VecSim/types/float16.h" - -using sq8 = vecsim_types::sq8; -using float16 = vecsim_types::float16; /* * SVE SQ8<->FP16 L2 squared distance: From 72f9a98ade1016e23be4748e7bfba0c9675aa357 Mon Sep 17 00:00:00 2001 From: Dor Forer Date: Sun, 31 May 2026 13:03:53 +0000 Subject: [PATCH 18/24] Apply clang-format [MOD-14972] --- src/VecSim/batch_iterator.h | 2 +- src/VecSim/spaces/IP/IP_NEON_SQ8_FP16.h | 28 ++++++---------- src/VecSim/spaces/IP/IP_SVE_SQ8_FP16.h | 33 ++++++++----------- src/VecSim/spaces/L2/L2_NEON_SQ8_FP16.h | 10 +++--- src/VecSim/spaces/L2/L2_SVE_SQ8_FP16.h | 7 ++-- tests/benchmark/bm_vecsim_svs.h | 14 ++++---- .../spaces_benchmarks/bm_spaces_sq8_fp16.cpp | 5 +-- tests/benchmark/types_ranges.h | 12 ++++--- tests/unit/test_allocator.cpp | 4 +-- 9 files changed, 53 insertions(+), 62 deletions(-) diff --git a/src/VecSim/batch_iterator.h b/src/VecSim/batch_iterator.h index 9e2791130..466072f86 100644 --- a/src/VecSim/batch_iterator.h +++ b/src/VecSim/batch_iterator.h @@ -27,7 +27,7 @@ struct VecSimBatchIterator : public VecsimBaseObject { explicit VecSimBatchIterator(void *query_vector, void *tctx, std::shared_ptr allocator) : VecsimBaseObject(allocator), query_vector(query_vector), returned_results_count(0), - timeoutCtx(tctx) {}; + timeoutCtx(tctx){}; virtual inline const void *getQueryBlob() const { return query_vector; } diff --git a/src/VecSim/spaces/IP/IP_NEON_SQ8_FP16.h b/src/VecSim/spaces/IP/IP_NEON_SQ8_FP16.h index 8b8d37c0f..a5c2465fc 100644 --- a/src/VecSim/spaces/IP/IP_NEON_SQ8_FP16.h +++ b/src/VecSim/spaces/IP/IP_NEON_SQ8_FP16.h @@ -23,10 +23,9 @@ using float16 = vecsim_types::float16; */ // Helper: 16 lanes per call, four FP32 accumulators (one per quarter). -static inline void -SQ8_FP16_InnerProductStep_NEON_HP(const uint8_t *&pVect1, const float16 *&pVect2, - float32x4_t &sum0, float32x4_t &sum1, - float32x4_t &sum2, float32x4_t &sum3) { +static inline void SQ8_FP16_InnerProductStep_NEON_HP(const uint8_t *&pVect1, const float16 *&pVect2, + float32x4_t &sum0, float32x4_t &sum1, + float32x4_t &sum2, float32x4_t &sum3) { uint8x16_t v1_u8 = vld1q_u8(pVect1); uint16x8_t v1_lo = vmovl_u8(vget_low_u8(v1_u8)); uint16x8_t v1_hi = vmovl_u8(vget_high_u8(v1_u8)); @@ -77,8 +76,7 @@ float SQ8_FP16_InnerProductSIMD16_NEON_HP_IMP(const void *pVect1v, const void *p if constexpr (r >= 4) { uint8x8_t v1_u8 = vld1_u8(pVect1); float32x4_t v1_a = vcvtq_f32_u32(vmovl_u16(vget_low_u16(vmovl_u8(v1_u8)))); - float32x4_t v2_a = - vcvt_f32_f16(vld1_f16(reinterpret_cast(pVect2))); + float32x4_t v2_a = vcvt_f32_f16(vld1_f16(reinterpret_cast(pVect2))); sum0 = vfmaq_f32(sum0, v1_a, v2_a); pVect1 += 4; pVect2 += 4; @@ -86,8 +84,7 @@ float SQ8_FP16_InnerProductSIMD16_NEON_HP_IMP(const void *pVect1v, const void *p if constexpr (r >= 8) { uint8x8_t v1_u8 = vld1_u8(pVect1); float32x4_t v1_b = vcvtq_f32_u32(vmovl_u16(vget_low_u16(vmovl_u8(v1_u8)))); - float32x4_t v2_b = - vcvt_f32_f16(vld1_f16(reinterpret_cast(pVect2))); + float32x4_t v2_b = vcvt_f32_f16(vld1_f16(reinterpret_cast(pVect2))); sum1 = vfmaq_f32(sum1, v1_b, v2_b); pVect1 += 4; pVect2 += 4; @@ -95,8 +92,7 @@ float SQ8_FP16_InnerProductSIMD16_NEON_HP_IMP(const void *pVect1v, const void *p if constexpr (r >= 12) { uint8x8_t v1_u8 = vld1_u8(pVect1); float32x4_t v1_c = vcvtq_f32_u32(vmovl_u16(vget_low_u16(vmovl_u8(v1_u8)))); - float32x4_t v2_c = - vcvt_f32_f16(vld1_f16(reinterpret_cast(pVect2))); + float32x4_t v2_c = vcvt_f32_f16(vld1_f16(reinterpret_cast(pVect2))); sum2 = vfmaq_f32(sum2, v1_c, v2_c); pVect1 += 4; pVect2 += 4; @@ -112,14 +108,11 @@ float SQ8_FP16_InnerProductSIMD16_NEON_HP_IMP(const void *pVect1v, const void *p float quantized_dot = vaddvq_f32(vaddq_f32(sum_lo, sum_hi)) + scalar_dot; const uint8_t *params_bytes = static_cast(pVect1v) + dimension; - const float min_val = - load_unaligned(params_bytes + sq8::MIN_VAL * sizeof(float)); - const float delta = - load_unaligned(params_bytes + sq8::DELTA * sizeof(float)); + const float min_val = load_unaligned(params_bytes + sq8::MIN_VAL * sizeof(float)); + const float delta = load_unaligned(params_bytes + sq8::DELTA * sizeof(float)); const uint8_t *query_meta_bytes = reinterpret_cast(static_cast(pVect2v) + dimension); - const float y_sum = - load_unaligned(query_meta_bytes + sq8::SUM_QUERY * sizeof(float)); + const float y_sum = load_unaligned(query_meta_bytes + sq8::SUM_QUERY * sizeof(float)); return min_val * y_sum + delta * quantized_dot; } @@ -127,8 +120,7 @@ float SQ8_FP16_InnerProductSIMD16_NEON_HP_IMP(const void *pVect1v, const void *p template float SQ8_FP16_InnerProductSIMD16_NEON_HP(const void *pVect1v, const void *pVect2v, size_t dimension) { - return 1.0f - - SQ8_FP16_InnerProductSIMD16_NEON_HP_IMP(pVect1v, pVect2v, dimension); + return 1.0f - SQ8_FP16_InnerProductSIMD16_NEON_HP_IMP(pVect1v, pVect2v, dimension); } template diff --git a/src/VecSim/spaces/IP/IP_SVE_SQ8_FP16.h b/src/VecSim/spaces/IP/IP_SVE_SQ8_FP16.h index 6c7a52529..1408e0880 100644 --- a/src/VecSim/spaces/IP/IP_SVE_SQ8_FP16.h +++ b/src/VecSim/spaces/IP/IP_SVE_SQ8_FP16.h @@ -25,13 +25,12 @@ using float16 = vecsim_types::float16; */ // Helper: one SVE-vector-width-of-FP32 step. -static inline void -SQ8_FP16_InnerProductStep_SVE(const uint8_t *pVect1, const float16 *pVect2, size_t &offset, - svfloat32_t &sum, svbool_t pg, size_t chunk) { +static inline void SQ8_FP16_InnerProductStep_SVE(const uint8_t *pVect1, const float16 *pVect2, + size_t &offset, svfloat32_t &sum, svbool_t pg, + size_t chunk) { svuint32_t v1_u32 = svld1ub_u32(pg, pVect1 + offset); svfloat32_t v1_f = svcvt_f32_u32_x(pg, v1_u32); - svuint32_t q_u32 = - svld1uh_u32(pg, reinterpret_cast(pVect2 + offset)); + svuint32_t q_u32 = svld1uh_u32(pg, reinterpret_cast(pVect2 + offset)); svfloat32_t v2_f = svcvt_f32_f16_x(pg, svreinterpret_f16_u32(q_u32)); sum = svmla_f32_x(pg, sum, v1_f, v2_f); offset += chunk; @@ -60,8 +59,8 @@ float SQ8_FP16_InnerProductSIMD_SVE_IMP(const void *pVect1v, const void *pVect2v svbool_t pg_partial = svwhilelt_b32(uint32_t(0), uint32_t(remaining)); svuint32_t v1_u32 = svld1ub_u32(pg_partial, pVect1 + offset); svfloat32_t v1_f = svcvt_f32_u32_z(pg_partial, v1_u32); - svuint32_t q_u32 = svld1uh_u32( - pg_partial, reinterpret_cast(pVect2 + offset)); + svuint32_t q_u32 = + svld1uh_u32(pg_partial, reinterpret_cast(pVect2 + offset)); svfloat32_t v2_f = svcvt_f32_f16_z(pg_partial, svreinterpret_f16_u32(q_u32)); sum0 = svmla_f32_z(pg_partial, sum0, v1_f, v2_f); offset += remaining; @@ -92,27 +91,23 @@ float SQ8_FP16_InnerProductSIMD_SVE_IMP(const void *pVect1v, const void *pVect2v float quantized_dot = svaddv_f32(pg, sum); const uint8_t *params_bytes = static_cast(pVect1v) + dimension; - const float min_val = - load_unaligned(params_bytes + sq8::MIN_VAL * sizeof(float)); - const float delta = - load_unaligned(params_bytes + sq8::DELTA * sizeof(float)); - const uint8_t *query_meta_bytes = reinterpret_cast( - static_cast(pVect2v) + dimension); - const float y_sum = - load_unaligned(query_meta_bytes + sq8::SUM_QUERY * sizeof(float)); + const float min_val = load_unaligned(params_bytes + sq8::MIN_VAL * sizeof(float)); + const float delta = load_unaligned(params_bytes + sq8::DELTA * sizeof(float)); + const uint8_t *query_meta_bytes = + reinterpret_cast(static_cast(pVect2v) + dimension); + const float y_sum = load_unaligned(query_meta_bytes + sq8::SUM_QUERY * sizeof(float)); return min_val * y_sum + delta * quantized_dot; } template -float SQ8_FP16_InnerProductSIMD_SVE(const void *pVect1v, const void *pVect2v, - size_t dimension) { +float SQ8_FP16_InnerProductSIMD_SVE(const void *pVect1v, const void *pVect2v, size_t dimension) { return 1.0f - SQ8_FP16_InnerProductSIMD_SVE_IMP( pVect1v, pVect2v, dimension); } template float SQ8_FP16_CosineSIMD_SVE(const void *pVect1v, const void *pVect2v, size_t dimension) { - return SQ8_FP16_InnerProductSIMD_SVE( - pVect1v, pVect2v, dimension); + return SQ8_FP16_InnerProductSIMD_SVE(pVect1v, pVect2v, + dimension); } diff --git a/src/VecSim/spaces/L2/L2_NEON_SQ8_FP16.h b/src/VecSim/spaces/L2/L2_NEON_SQ8_FP16.h index 15cc40f6a..70367d7fe 100644 --- a/src/VecSim/spaces/L2/L2_NEON_SQ8_FP16.h +++ b/src/VecSim/spaces/L2/L2_NEON_SQ8_FP16.h @@ -21,15 +21,13 @@ template // 0..15 float SQ8_FP16_L2SqrSIMD16_NEON_HP(const void *pVect1v, const void *pVect2v, size_t dimension) { - const float ip = - SQ8_FP16_InnerProductSIMD16_NEON_HP_IMP(pVect1v, pVect2v, dimension); + const float ip = SQ8_FP16_InnerProductSIMD16_NEON_HP_IMP(pVect1v, pVect2v, dimension); const uint8_t *params_bytes = static_cast(pVect1v) + dimension; - const float x_sum_sq = - load_unaligned(params_bytes + sq8::SUM_SQUARES * sizeof(float)); + const float x_sum_sq = load_unaligned(params_bytes + sq8::SUM_SQUARES * sizeof(float)); - const uint8_t *query_meta_bytes = reinterpret_cast( - static_cast(pVect2v) + dimension); + const uint8_t *query_meta_bytes = + reinterpret_cast(static_cast(pVect2v) + dimension); const float y_sum_sq = load_unaligned(query_meta_bytes + sq8::SUM_SQUARES_QUERY * sizeof(float)); diff --git a/src/VecSim/spaces/L2/L2_SVE_SQ8_FP16.h b/src/VecSim/spaces/L2/L2_SVE_SQ8_FP16.h index e3592c24e..f70ef493d 100644 --- a/src/VecSim/spaces/L2/L2_SVE_SQ8_FP16.h +++ b/src/VecSim/spaces/L2/L2_SVE_SQ8_FP16.h @@ -22,10 +22,9 @@ float SQ8_FP16_L2SqrSIMD_SVE(const void *pVect1v, const void *pVect2v, size_t di pVect1v, pVect2v, dimension); const uint8_t *params_bytes = static_cast(pVect1v) + dimension; - const float x_sum_sq = - load_unaligned(params_bytes + sq8::SUM_SQUARES * sizeof(float)); - const uint8_t *query_meta_bytes = reinterpret_cast( - static_cast(pVect2v) + dimension); + const float x_sum_sq = load_unaligned(params_bytes + sq8::SUM_SQUARES * sizeof(float)); + const uint8_t *query_meta_bytes = + reinterpret_cast(static_cast(pVect2v) + dimension); const float y_sum_sq = load_unaligned(query_meta_bytes + sq8::SUM_SQUARES_QUERY * sizeof(float)); diff --git a/tests/benchmark/bm_vecsim_svs.h b/tests/benchmark/bm_vecsim_svs.h index 5acb882c0..b92cce5e0 100644 --- a/tests/benchmark/bm_vecsim_svs.h +++ b/tests/benchmark/bm_vecsim_svs.h @@ -466,17 +466,19 @@ void BM_VecSimSVS::RunGC(benchmark::State &st) { #define UNIT_AND_ITERATIONS Unit(benchmark::kMillisecond)->Iterations(2) #if HAVE_SVS_LVQ -#define QUANT_BITS_ARGS {VecSimSvsQuant_8, VecSimSvsQuant_4x8_LeanVec} +#define QUANT_BITS_ARGS \ + { VecSimSvsQuant_8, VecSimSvsQuant_4x8_LeanVec } #define COMPRESSED_TRAINING_THRESHOLD_ARGS \ - {static_cast(BM_VecSimGeneral::block_size), 5000, 10000} + { static_cast(BM_VecSimGeneral::block_size), 5000, 10000 } #define COMPRESSED_ASYNC_TRAINING_THRESHOLD_ARGS \ - {static_cast(BM_VecSimGeneral::block_size), 5000, 10000, 50000} + { static_cast(BM_VecSimGeneral::block_size), 5000, 10000, 50000 } #else -#define QUANT_BITS_ARGS {VecSimSvsQuant_8} +#define QUANT_BITS_ARGS \ + { VecSimSvsQuant_8 } // Using smaller training TH to avoid long test times without LVQ #define COMPRESSED_TRAINING_THRESHOLD_ARGS \ - {static_cast(BM_VecSimGeneral::block_size), 5000} + { static_cast(BM_VecSimGeneral::block_size), 5000 } #define COMPRESSED_ASYNC_TRAINING_THRESHOLD_ARGS \ - {static_cast(BM_VecSimGeneral::block_size), 5000} + { static_cast(BM_VecSimGeneral::block_size), 5000 } #endif diff --git a/tests/benchmark/spaces_benchmarks/bm_spaces_sq8_fp16.cpp b/tests/benchmark/spaces_benchmarks/bm_spaces_sq8_fp16.cpp index 9ec022e39..cc5d040cb 100644 --- a/tests/benchmark/spaces_benchmarks/bm_spaces_sq8_fp16.cpp +++ b/tests/benchmark/spaces_benchmarks/bm_spaces_sq8_fp16.cpp @@ -16,7 +16,8 @@ using float16 = vecsim_types::float16; /** * SQ8-to-FP16 benchmarks: SQ8 quantized storage with FP16 query. * Registers the naive (scalar) baseline plus per-ISA SIMD variants (x86: AVX-512 / AVX2+FMA / - * AVX2 / SSE4 — gated on the matching OPT_* defines and runtime CPU features). ARM kernels (NEON_HP / SVE / SVE2) are registered below. + * AVX2 / SSE4 — gated on the matching OPT_* defines and runtime CPU features). ARM kernels (NEON_HP + * / SVE / SVE2) are registered below. */ class BM_VecSimSpaces_SQ8_FP16 : public benchmark::Fixture { protected: @@ -103,7 +104,7 @@ INITIALIZE_BENCHMARKS_SET_Cosine(BM_VecSimSpaces_SQ8_FP16, SQ8_FP16, SVE, 16, sv bool neon_hp_supported = arm_opt.asimdhp; INITIALIZE_BENCHMARKS_SET_L2_IP(BM_VecSimSpaces_SQ8_FP16, SQ8_FP16, NEON_HP, 16, neon_hp_supported); INITIALIZE_BENCHMARKS_SET_Cosine(BM_VecSimSpaces_SQ8_FP16, SQ8_FP16, NEON_HP, 16, - neon_hp_supported); + neon_hp_supported); #endif #endif // CPU_FEATURES_ARCH_AARCH64 diff --git a/tests/benchmark/types_ranges.h b/tests/benchmark/types_ranges.h index 43abda8f0..deff4251c 100644 --- a/tests/benchmark/types_ranges.h +++ b/tests/benchmark/types_ranges.h @@ -11,9 +11,11 @@ #include #include "bm_definitions.h" -#define DEFAULT_RANGE_RADII {20, 35, 50} +#define DEFAULT_RANGE_RADII \ + { 20, 35, 50 } -#define DEFAULT_RANGE_EPSILONS {1, 10, 11} +#define DEFAULT_RANGE_EPSILONS \ + { 1, 10, 11 } // This template struct methods returns the default values for radii and epsilons // To specify different values for a certain type, use template specialization @@ -25,7 +27,8 @@ struct benchmark_range { // Larger Range query values are required for int8 wikipedia dataset. // Default values give 0 results -#define INT8_RANGE_RADII {50, 65, 80} +#define INT8_RANGE_RADII \ + { 50, 65, 80 } template <> struct benchmark_range { @@ -34,7 +37,8 @@ struct benchmark_range { }; // UINT8 ranges -#define UINT8_RANGE_RADII {4, 5, 7} +#define UINT8_RANGE_RADII \ + { 4, 5, 7 } template <> struct benchmark_range { diff --git a/tests/unit/test_allocator.cpp b/tests/unit/test_allocator.cpp index 6aa4a0d0b..77db41684 100644 --- a/tests/unit/test_allocator.cpp +++ b/tests/unit/test_allocator.cpp @@ -33,7 +33,7 @@ struct ObjectWithSTL : public VecsimBaseObject { public: ObjectWithSTL(std::shared_ptr allocator) - : VecsimBaseObject(allocator), test_vec(allocator) {}; + : VecsimBaseObject(allocator), test_vec(allocator){}; }; struct NestedObject : public VecsimBaseObject { @@ -42,7 +42,7 @@ struct NestedObject : public VecsimBaseObject { public: NestedObject(std::shared_ptr allocator) - : VecsimBaseObject(allocator), stl_object(allocator), simpleObject(allocator) {}; + : VecsimBaseObject(allocator), stl_object(allocator), simpleObject(allocator){}; }; TEST_F(AllocatorTest, test_simple_object) { From d7576c384954708acd94f3ee03dbc0f034cab25e Mon Sep 17 00:00:00 2001 From: Dor Forer Date: Sun, 31 May 2026 13:20:58 +0000 Subject: [PATCH 19/24] Trim PR churn: remove docs, dispatcher comments, and test verbosity [MOD-14972] MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - Remove docs/superpowers/ design and plan files (~1550 lines); sister PR #970 removed its equivalent doc before merge. - Drop 5-line "No alignment write" prose comment from the three AArch64 NEON_HP dispatcher blocks; the sister SQ8_FP32 ARM dispatchers carry no such comment — the absent alignment write already encodes the intent. - Trim GetDistFuncSQ8FP16Asymmetric to a 7-line template-mapping check at dim=15, matching the shape of GetDistFuncSQ8Asymmetric (SQ8_FP32 sister). The scalar-fallback assertion it previously duplicated is already covered by the trailing block of SQ8_FP16_SpacesOptimizationTest. --- .../plans/2026-05-28-arm-sq8-fp16-kernels.md | 1195 ----------------- .../specs/2026-05-28-arm-sq8-fp16-design.md | 354 ----- src/VecSim/spaces/IP_space.cpp | 6 - src/VecSim/spaces/L2_space.cpp | 3 - 4 files changed, 1558 deletions(-) delete mode 100644 docs/superpowers/plans/2026-05-28-arm-sq8-fp16-kernels.md delete mode 100644 docs/superpowers/specs/2026-05-28-arm-sq8-fp16-design.md diff --git a/docs/superpowers/plans/2026-05-28-arm-sq8-fp16-kernels.md b/docs/superpowers/plans/2026-05-28-arm-sq8-fp16-kernels.md deleted file mode 100644 index 2759ba046..000000000 --- a/docs/superpowers/plans/2026-05-28-arm-sq8-fp16-kernels.md +++ /dev/null @@ -1,1195 +0,0 @@ -# SQ8↔FP16 ARM SIMD Distance Kernels — Implementation Plan - -> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. - -**Goal:** Add SQ8↔FP16 asymmetric distance kernels (IP, L2, Cosine) for ARM ISA tiers — NEON_HP, SVE, SVE2 — plugged into the existing dispatcher. Mirrors the x86 work delivered in PR #970. - -**Architecture:** Header-only SIMD kernel templates (one per metric × ISA), instantiated via the existing `CHOOSE_IMPLEMENTATION` / `CHOOSE_SVE_IMPLEMENTATION` macros inside ISA-specific TUs (`NEON_HP.cpp`, `SVE.cpp`, `SVE2.cpp`). Wiring lives in `IP_space.cpp` and `L2_space.cpp` under a `#ifdef CPU_FEATURES_ARCH_AARCH64` block that parallels the existing x86 block. L2 reuses the IP `_IMP` template via the algebraic identity `L2² = x_sum_sq + y_sum_sq − 2·IP`. Scalar fallback already on `main` is unchanged and stays as the reference for every tier. - -**Tech Stack:** C++20, ARM NEON intrinsics (`arm_neon.h`), ARM SVE/SVE2 intrinsics (`arm_sve.h`), GoogleTest, Google Benchmark, cpu_features. - -**Branch:** `dor-forer-sq8-fp16-arm-kernels-mod-14972` (stacked on PR #970 / `dor-forer-sq8-fp16-x86-kernels-mod-14954`). - -**Build / test loop:** The user runs `make build` (per project memory). After each build cycle confirmed, the assistant runs `make unit_test` / ASan / benchmarks on the appropriate host (ARM hardware or cross-compile/qemu — coordinate with user). Each task ends in a commit; commits are pushed only when explicitly requested. - -**Spec:** [`docs/superpowers/specs/2026-05-28-arm-sq8-fp16-design.md`](../specs/2026-05-28-arm-sq8-fp16-design.md) - ---- - -## File Structure - -### Files created - -| Path | Responsibility | -|------|----------------| -| `src/VecSim/spaces/IP/IP_NEON_SQ8_FP16.h` | NEON IP kernel template (`SQ8_FP16_InnerProductSIMD16_NEON_HP_IMP` + thin wrappers) | -| `src/VecSim/spaces/L2/L2_NEON_SQ8_FP16.h` | NEON L2 kernel template (calls NEON IP impl, applies L2 identity) | -| `src/VecSim/spaces/IP/IP_SVE_SQ8_FP16.h` | SVE IP kernel template (`SQ8_FP16_InnerProductSIMD_SVE_IMP` + wrappers); also `#include`d from SVE2.cpp | -| `src/VecSim/spaces/L2/L2_SVE_SQ8_FP16.h` | SVE L2 kernel template; also `#include`d from SVE2.cpp | - -### Files modified - -| Path | Change | -|------|--------| -| `src/VecSim/spaces/functions/NEON_HP.h` | +3 chooser declarations (IP, L2, Cosine) | -| `src/VecSim/spaces/functions/NEON_HP.cpp` | +#include kernel headers; +3 chooser definitions | -| `src/VecSim/spaces/functions/SVE.h` | +3 chooser declarations | -| `src/VecSim/spaces/functions/SVE.cpp` | +#include kernel headers; +3 chooser definitions | -| `src/VecSim/spaces/functions/SVE2.h` | +3 chooser declarations | -| `src/VecSim/spaces/functions/SVE2.cpp` | +#include SVE kernel headers; +3 chooser definitions (own symbols, templates instantiated under SVE2 compile flags) | -| `src/VecSim/spaces/IP_space.cpp` | +#ifdef AArch64 block in `IP_SQ8_FP16_GetDistFunc` and `Cosine_SQ8_FP16_GetDistFunc` (2 dispatcher blocks) | -| `src/VecSim/spaces/L2_space.cpp` | +#ifdef AArch64 block in `L2_SQ8_FP16_GetDistFunc` (1 dispatcher block) | -| `tests/unit/test_spaces.cpp` | retarget `GetDistFuncSQ8FP16Asymmetric` to dim=15; add dim=0 test; extend the three `SQ8_FP16_SpacesOptimizationTest` test bodies with ARM tier walks; extend `SQ8_FP16_SIMD_TierCoverage.ReportTiersExercised` with AArch64 tier reporting | -| `tests/benchmark/spaces_benchmarks/bm_spaces_sq8_fp16.cpp` | +AArch64 `cpu_features` block; +ARM ISA benchmark registrations | - -### Files NOT modified - -`src/VecSim/spaces/CMakeLists.txt` — zero CMake changes. Existing TU flags (`-march=armv8.2-a+fp16fml` for NEON_HP, `-march=armv8-a+sve` for SVE, `-march=armv9-a+sve2` for SVE2) already carry everything the new kernels need. - ---- - -## Task 1: Retarget the scalar-fallback dispatcher test - -**Why first:** Builds and runs on x86 today, has nothing to do with the ARM kernels, and tightens the contract the rest of the plan relies on (the dispatcher returns scalar for `dim < 16`). - -**Files:** -- Modify: `tests/unit/test_spaces.cpp` — locate test named `GetDistFuncSQ8FP16Asymmetric` (added by PR #970; currently asserts `dim=128` returns the scalar fallback) - -- [ ] **Step 1: Locate the existing test** - -Run: -```bash -grep -n 'GetDistFuncSQ8FP16Asymmetric' tests/unit/test_spaces.cpp -``` -Expected: one or more line hits pointing at the `TEST(...)` block. - -- [ ] **Step 2: Modify the test to cover dim=0 and dim=15 instead of dim=128** - -Replace the body of the existing `TEST(..., GetDistFuncSQ8FP16Asymmetric)` so it walks two below-threshold dims and asserts the scalar fallback for each of L2 / IP / Cosine. Drop in this exact body (rename the test fixture symbol to match what is already there if it differs): - -```cpp -TEST_F(SpacesTest, GetDistFuncSQ8FP16Asymmetric) { - // SQ8 storage with FP16 query (asymmetric) - should return SQ8_FP16 functions. - // Per-ISA dispatcher walk coverage lives in the SQ8_FP16 SpacesOptimizationTest below. - // - // Walk two below-threshold dims (0 and 15) so the assertions hold regardless of which - // SIMD tiers the host advertises: dim < 16 must always short-circuit to scalar fallback. - // The template-mapping form (spaces::GetDistFunc) and the direct - // *_SQ8_FP16_GetDistFunc form must agree for every dim, and both must match the scalar - // reference at sub-threshold dims. - for (size_t dim : {static_cast(0), static_cast(15)}) { - auto l2_func = spaces::GetDistFunc(VecSimMetric_L2, dim, nullptr); - auto ip_func = spaces::GetDistFunc(VecSimMetric_IP, dim, nullptr); - auto cosine_func = - spaces::GetDistFunc(VecSimMetric_Cosine, dim, nullptr); - - ASSERT_EQ(l2_func, L2_SQ8_FP16_GetDistFunc(dim, nullptr)) - << "Template mapping disagrees with direct dispatcher for L2 at dim=" << dim; - ASSERT_EQ(ip_func, IP_SQ8_FP16_GetDistFunc(dim, nullptr)) - << "Template mapping disagrees with direct dispatcher for IP at dim=" << dim; - ASSERT_EQ(cosine_func, Cosine_SQ8_FP16_GetDistFunc(dim, nullptr)) - << "Template mapping disagrees with direct dispatcher for Cosine at dim=" << dim; - - ASSERT_EQ(l2_func, SQ8_FP16_L2Sqr) - << "dim=" << dim << " must short-circuit to scalar L2 fallback"; - ASSERT_EQ(ip_func, SQ8_FP16_InnerProduct) - << "dim=" << dim << " must short-circuit to scalar IP fallback"; - ASSERT_EQ(cosine_func, SQ8_FP16_Cosine) - << "dim=" << dim << " must short-circuit to scalar Cosine fallback"; - } -} -``` - -- [ ] **Step 3: User builds** - -Ask the user to run `make build` (their normal x86 build is sufficient — this test is host-agnostic). - -- [ ] **Step 4: Run the test** - -Run: -```bash -./bin//unit_tests --gtest_filter='SpacesTest.GetDistFuncSQ8FP16Asymmetric' -``` -(Use `find bin -name unit_tests -type f` if the host-triple subdir is unknown.) -Expected: PASS. - -- [ ] **Step 5: Commit** - -```bash -git add tests/unit/test_spaces.cpp -git commit -m "Retarget SQ8↔FP16 scalar-fallback dispatcher test to dim=0/15 [MOD-14972]" -``` - ---- - -## Task 2: NEON IP kernel header - -**Files:** -- Create: `src/VecSim/spaces/IP/IP_NEON_SQ8_FP16.h` - -- [ ] **Step 1: Author the kernel file** - -Create exactly this file (modeled on `IP_NEON_SQ8_FP32.h` + the NEON FP16 widening pattern from `IP_NEON_FP16.h`): - -```cpp -/* - * Copyright (c) 2006-Present, Redis Ltd. - * All rights reserved. - * - * Licensed under your choice of the Redis Source Available License 2.0 - * (RSALv2); or (b) the Server Side Public License v1 (SSPLv1); or (c) the - * GNU Affero General Public License v3 (AGPLv3). - */ -#pragma once -#include "VecSim/spaces/space_includes.h" -#include "VecSim/types/sq8.h" -#include "VecSim/types/float16.h" -#include -#include - -using sq8 = vecsim_types::sq8; -using float16 = vecsim_types::float16; - -/* - * Optimised asymmetric SQ8<->FP16 inner product using the algebraic identity: - * - * IP(x, y) = sum(x_i * y_i) - * ~= sum((min + delta * q_i) * y_i) - * = min * y_sum + delta * sum(q_i * y_i) - * - * The hot loop only accumulates sum(q_i * y_i) - no per-element dequantisation. - * FP16 query lanes are widened to FP32 via vcvt_f32_f16 per 16-lane chunk. - */ - -// Helper: 16 lanes per call, four FP32 accumulators (one per quarter). -static inline void -SQ8_FP16_InnerProductStep_NEON_HP(const uint8_t *&pVect1, const float16 *&pVect2, - float32x4_t &sum0, float32x4_t &sum1, - float32x4_t &sum2, float32x4_t &sum3) { - // SQ8 storage: 16 * uint8 -> 4 * float32x4_t - uint8x16_t v1_u8 = vld1q_u8(pVect1); - uint16x8_t v1_lo = vmovl_u8(vget_low_u8(v1_u8)); - uint16x8_t v1_hi = vmovl_u8(vget_high_u8(v1_u8)); - float32x4_t v1_0 = vcvtq_f32_u32(vmovl_u16(vget_low_u16(v1_lo))); - float32x4_t v1_1 = vcvtq_f32_u32(vmovl_u16(vget_high_u16(v1_lo))); - float32x4_t v1_2 = vcvtq_f32_u32(vmovl_u16(vget_low_u16(v1_hi))); - float32x4_t v1_3 = vcvtq_f32_u32(vmovl_u16(vget_high_u16(v1_hi))); - - // FP16 query: 16 * f16 -> 4 * float32x4_t via vcvt_f32_f16 - const float16_t *q = reinterpret_cast(pVect2); - float16x8_t q_lo = vld1q_f16(q); - float16x8_t q_hi = vld1q_f16(q + 8); - float32x4_t v2_0 = vcvt_f32_f16(vget_low_f16(q_lo)); - float32x4_t v2_1 = vcvt_f32_f16(vget_high_f16(q_lo)); - float32x4_t v2_2 = vcvt_f32_f16(vget_low_f16(q_hi)); - float32x4_t v2_3 = vcvt_f32_f16(vget_high_f16(q_hi)); - - sum0 = vfmaq_f32(sum0, v1_0, v2_0); - sum1 = vfmaq_f32(sum1, v1_1, v2_1); - sum2 = vfmaq_f32(sum2, v1_2, v2_2); - sum3 = vfmaq_f32(sum3, v1_3, v2_3); - - pVect1 += 16; - pVect2 += 16; -} - -// pVect1v = SQ8 storage, pVect2v = FP16 query -template // 0..15 -float SQ8_FP16_InnerProductSIMD16_NEON_HP_IMP(const void *pVect1v, const void *pVect2v, - size_t dimension) { - assert(dimension >= 16 && "kernel precondition: dispatcher must guard dim >= 16"); - - const uint8_t *pVect1 = static_cast(pVect1v); // SQ8 storage - const float16 *pVect2 = static_cast(pVect2v); // FP16 query - - float32x4_t sum0 = vdupq_n_f32(0.0f); - float32x4_t sum1 = vdupq_n_f32(0.0f); - float32x4_t sum2 = vdupq_n_f32(0.0f); - float32x4_t sum3 = vdupq_n_f32(0.0f); - - const size_t num_of_chunks = dimension / 16; - for (size_t i = 0; i < num_of_chunks; i++) { - SQ8_FP16_InnerProductStep_NEON_HP(pVect1, pVect2, sum0, sum1, sum2, sum3); - } - - // Residual handling: dim % 16 lanes. - // residual >= 8: one safe 8-lane SQ8 + 8-lane FP16 load (FP16 trailer is wide enough). - // residual < 8: scalar-only - a 4-lane FP16 load would overread y_sum metadata. - constexpr unsigned char r = residual; - if constexpr (r >= 8) { - uint8x8_t v1_u8 = vld1_u8(pVect1); - uint16x8_t v1_u16 = vmovl_u8(v1_u8); - float32x4_t v1_a = vcvtq_f32_u32(vmovl_u16(vget_low_u16(v1_u16))); - float32x4_t v1_b = vcvtq_f32_u32(vmovl_u16(vget_high_u16(v1_u16))); - float16x8_t q_h = vld1q_f16(reinterpret_cast(pVect2)); - float32x4_t v2_a = vcvt_f32_f16(vget_low_f16(q_h)); - float32x4_t v2_b = vcvt_f32_f16(vget_high_f16(q_h)); - sum0 = vfmaq_f32(sum0, v1_a, v2_a); - sum1 = vfmaq_f32(sum1, v1_b, v2_b); - pVect1 += 8; - pVect2 += 8; - } - // Lane-by-lane scalar for the final 0..7 (residual % 8) elements. - constexpr unsigned char tail = r & 0x7; - float scalar_dot = 0.0f; - for (unsigned char k = 0; k < tail; ++k) { - scalar_dot += static_cast(pVect1[k]) * vecsim_types::FP16_to_FP32(pVect2[k]); - } - - // Reduce the four NEON accumulators. - float32x4_t sum_lo = vaddq_f32(sum0, sum1); - float32x4_t sum_hi = vaddq_f32(sum2, sum3); - float quantized_dot = vaddvq_f32(vaddq_f32(sum_lo, sum_hi)) + scalar_dot; - - // Metadata loads - use load_unaligned because odd dim leaves trailers unaligned. - const uint8_t *params_bytes = static_cast(pVect1v) + dimension; - const float min_val = - load_unaligned(params_bytes + sq8::MIN_VAL * sizeof(float)); - const float delta = - load_unaligned(params_bytes + sq8::DELTA * sizeof(float)); - const uint8_t *query_meta_bytes = - reinterpret_cast(static_cast(pVect2v) + dimension); - const float y_sum = - load_unaligned(query_meta_bytes + sq8::SUM_QUERY * sizeof(float)); - - return min_val * y_sum + delta * quantized_dot; -} - -template -float SQ8_FP16_InnerProductSIMD16_NEON_HP(const void *pVect1v, const void *pVect2v, - size_t dimension) { - return 1.0f - - SQ8_FP16_InnerProductSIMD16_NEON_HP_IMP(pVect1v, pVect2v, dimension); -} - -template -float SQ8_FP16_CosineSIMD16_NEON_HP(const void *pVect1v, const void *pVect2v, size_t dimension) { - // Cosine = 1 - IP (vectors are pre-normalised); reuses the IP wrapper. - return SQ8_FP16_InnerProductSIMD16_NEON_HP(pVect1v, pVect2v, dimension); -} -``` - -- [ ] **Step 2: Header-only smoke (no build yet)** - -Run: -```bash -grep -n 'load_unaligned\|FP16_to_FP32' src/VecSim/spaces/space_includes.h \ - src/VecSim/spaces/IP/IP.cpp src/VecSim/types/float16.h 2>/dev/null -``` -Expected: confirm the global `load_unaligned` is reachable through `space_includes.h` (matches the include path used by `IP_NEON_SQ8_FP32.h`) and `FP16_to_FP32` is reachable through `VecSim/types/float16.h`. If either include is missing, add it. - -- [ ] **Step 3: Commit** - -```bash -git add src/VecSim/spaces/IP/IP_NEON_SQ8_FP16.h -git commit -m "Add NEON_HP SQ8↔FP16 IP kernel header [MOD-14972]" -``` - ---- - -## Task 3: NEON L2 kernel header - -**Files:** -- Create: `src/VecSim/spaces/L2/L2_NEON_SQ8_FP16.h` - -- [ ] **Step 1: Author the kernel file** - -```cpp -/* - * Copyright (c) 2006-Present, Redis Ltd. - * All rights reserved. - * - * Licensed under your choice of the Redis Source Available License 2.0 - * (RSALv2); or (b) the Server Side Public License v1 (SSPLv1); or (c) the - * GNU Affero General Public License v3 (AGPLv3). - */ -#pragma once -#include "VecSim/spaces/space_includes.h" -#include "VecSim/spaces/IP/IP_NEON_SQ8_FP16.h" -#include "VecSim/types/sq8.h" -#include "VecSim/types/float16.h" - -using sq8 = vecsim_types::sq8; -using float16 = vecsim_types::float16; - -/* - * Optimised asymmetric SQ8<->FP16 L2 squared distance using the algebraic identity: - * - * ||x - y||^2 = sum(x_i^2) - 2 * IP(x, y) + sum(y_i^2) - * = x_sum_squares - 2 * IP(x, y) + y_sum_squares - * - * IP is computed by SQ8_FP16_InnerProductSIMD16_NEON_HP_IMP; metadata is FP32. - */ - -template // 0..15 -float SQ8_FP16_L2SqrSIMD16_NEON_HP(const void *pVect1v, const void *pVect2v, size_t dimension) { - const float ip = - SQ8_FP16_InnerProductSIMD16_NEON_HP_IMP(pVect1v, pVect2v, dimension); - - const uint8_t *params_bytes = static_cast(pVect1v) + dimension; - const float x_sum_sq = - load_unaligned(params_bytes + sq8::SUM_SQUARES * sizeof(float)); - - const uint8_t *query_meta_bytes = reinterpret_cast( - static_cast(pVect2v) + dimension); - const float y_sum_sq = - load_unaligned(query_meta_bytes + sq8::SUM_SQUARES_QUERY * sizeof(float)); - - return x_sum_sq + y_sum_sq - 2.0f * ip; -} -``` - -- [ ] **Step 2: Commit** - -```bash -git add src/VecSim/spaces/L2/L2_NEON_SQ8_FP16.h -git commit -m "Add NEON_HP SQ8↔FP16 L2 kernel header [MOD-14972]" -``` - ---- - -## Task 4: NEON_HP dispatcher TU additions - -**Files:** -- Modify: `src/VecSim/spaces/functions/NEON_HP.h` — add 3 declarations -- Modify: `src/VecSim/spaces/functions/NEON_HP.cpp` — add 3 chooser definitions - -- [ ] **Step 1: Add chooser declarations to NEON_HP.h** - -In `src/VecSim/spaces/functions/NEON_HP.h`, inside `namespace spaces { ... }`, append these three declarations alongside the existing `Choose_FP16_*_implementation_NEON_HP`: - -```cpp -dist_func_t Choose_SQ8_FP16_IP_implementation_NEON_HP(size_t dim); -dist_func_t Choose_SQ8_FP16_L2_implementation_NEON_HP(size_t dim); -dist_func_t Choose_SQ8_FP16_Cosine_implementation_NEON_HP(size_t dim); -``` - -- [ ] **Step 2: Add chooser definitions to NEON_HP.cpp** - -In `src/VecSim/spaces/functions/NEON_HP.cpp`, add the kernel `#include`s alongside the existing FP16 includes: - -```cpp -#include "VecSim/spaces/IP/IP_NEON_SQ8_FP16.h" -#include "VecSim/spaces/L2/L2_NEON_SQ8_FP16.h" -``` - -Then inside `namespace spaces { ... }` (between `#include "implementation_chooser.h"` and `#include "implementation_chooser_cleanup.h"`), append: - -```cpp -dist_func_t Choose_SQ8_FP16_IP_implementation_NEON_HP(size_t dim) { - dist_func_t ret_dist_func; - CHOOSE_IMPLEMENTATION(ret_dist_func, dim, 16, SQ8_FP16_InnerProductSIMD16_NEON_HP); - return ret_dist_func; -} - -dist_func_t Choose_SQ8_FP16_L2_implementation_NEON_HP(size_t dim) { - dist_func_t ret_dist_func; - CHOOSE_IMPLEMENTATION(ret_dist_func, dim, 16, SQ8_FP16_L2SqrSIMD16_NEON_HP); - return ret_dist_func; -} - -dist_func_t Choose_SQ8_FP16_Cosine_implementation_NEON_HP(size_t dim) { - dist_func_t ret_dist_func; - CHOOSE_IMPLEMENTATION(ret_dist_func, dim, 16, SQ8_FP16_CosineSIMD16_NEON_HP); - return ret_dist_func; -} -``` - -- [ ] **Step 3: Commit** - -```bash -git add src/VecSim/spaces/functions/NEON_HP.h src/VecSim/spaces/functions/NEON_HP.cpp -git commit -m "Wire NEON_HP SQ8↔FP16 choosers [MOD-14972]" -``` - ---- - -## Task 5: NEON_HP dispatcher wiring in IP_space.cpp + L2_space.cpp - -**Files:** -- Modify: `src/VecSim/spaces/IP_space.cpp` — `IP_SQ8_FP16_GetDistFunc` + `Cosine_SQ8_FP16_GetDistFunc` -- Modify: `src/VecSim/spaces/L2_space.cpp` — `L2_SQ8_FP16_GetDistFunc` - -Each of those three `_GetDistFunc` functions currently has an `#ifdef CPU_FEATURES_ARCH_X86_64` block with an early `if (dim < 16) return ret_dist_func;` guard followed by per-tier dispatch. We append an `#ifdef CPU_FEATURES_ARCH_AARCH64` block with the matching shape. Only NEON_HP is wired in this task; SVE/SVE2 land in a later task. - -- [ ] **Step 1: Confirm the #include for NEON_HP.h is present** - -Run: -```bash -grep -n 'functions/NEON_HP.h' src/VecSim/spaces/IP_space.cpp src/VecSim/spaces/L2_space.cpp -``` -Expected: both files already `#include "VecSim/spaces/functions/NEON_HP.h"`. If a file is missing it, add the include. - -- [ ] **Step 2: Wire IP_SQ8_FP16_GetDistFunc** - -In `src/VecSim/spaces/IP_space.cpp`, locate `IP_SQ8_FP16_GetDistFunc`. After the closing `#endif // x86_64`, insert a parallel AArch64 block immediately before the trailing `return ret_dist_func;`: - -```cpp -#ifdef CPU_FEATURES_ARCH_AARCH64 - if (dim < 16) { - return ret_dist_func; - } -#ifdef OPT_NEON_HP - if (features.asimdhp) { - // No alignment write: the locked spec and the sister ARM SQ8_FP32 dispatchers - // leave *alignment untouched on ARM tiers. The corresponding tests assert - // 0xFF passthrough on the scalar path and do not assert any non-zero value here. - return Choose_SQ8_FP16_IP_implementation_NEON_HP(dim); - } -#endif -#endif // CPU_FEATURES_ARCH_AARCH64 -``` - -- [ ] **Step 3: Wire Cosine_SQ8_FP16_GetDistFunc** - -In the same file, locate `Cosine_SQ8_FP16_GetDistFunc`. Insert the same block, swapping `Choose_SQ8_FP16_IP_implementation_NEON_HP` for `Choose_SQ8_FP16_Cosine_implementation_NEON_HP`. - -- [ ] **Step 4: Wire L2_SQ8_FP16_GetDistFunc** - -In `src/VecSim/spaces/L2_space.cpp`, locate `L2_SQ8_FP16_GetDistFunc`. Insert the same block, swapping the call for `Choose_SQ8_FP16_L2_implementation_NEON_HP`. - -- [ ] **Step 5: User builds** - -Ask the user to run `make build` — first time the new NEON_HP TU additions compile. If they have ARM hardware or a cross-compile target, that build path; otherwise the x86 build must at least confirm the new headers don't accidentally break non-ARM compilation (the new headers are only `#include`d from `NEON_HP.cpp`, which is excluded on non-ARM hosts, so x86 builds should be clean). - -- [ ] **Step 6: Commit** - -```bash -git add src/VecSim/spaces/IP_space.cpp src/VecSim/spaces/L2_space.cpp -git commit -m "Dispatch SQ8↔FP16 to NEON_HP tier on AArch64 [MOD-14972]" -``` - ---- - -## Task 6: Extend `SQ8_FP16_SpacesOptimizationTest` with NEON_HP tier-walk - -**Files:** -- Modify: `tests/unit/test_spaces.cpp` — three test bodies (`SQ8_FP16_L2SqrTest`, `SQ8_FP16_InnerProductTest`, `SQ8_FP16_CosineTest`) - -After the existing `#ifdef OPT_SSE4` block in each test, append: - -- [ ] **Step 1: Add NEON_HP tier to L2 test** - -In `SQ8_FP16_L2SqrTest`, immediately after the closing `#endif` that follows the SSE4 block and before `// Scalar fallback`: - -```cpp -#ifdef OPT_NEON_HP - if (optimization.asimdhp) { - unsigned char alignment = 0; - arch_opt_func = L2_SQ8_FP16_GetDistFunc(dim, &alignment, &optimization); - ASSERT_EQ(arch_opt_func, Choose_SQ8_FP16_L2_implementation_NEON_HP(dim)) - << "Unexpected distance function chosen for dim " << dim; - ASSERT_NEAR(baseline, arch_opt_func(v2_compressed.data(), v1_query.data(), dim), 0.01) - << "NEON_HP with dim " << dim; - optimization.asimdhp = 0; - } -#endif -``` - -- [ ] **Step 2: Add NEON_HP tier to IP test** - -In `SQ8_FP16_InnerProductTest`, append the same block but swap `L2_SQ8_FP16_GetDistFunc` → `IP_SQ8_FP16_GetDistFunc` and `Choose_SQ8_FP16_L2_implementation_NEON_HP` → `Choose_SQ8_FP16_IP_implementation_NEON_HP`. - -- [ ] **Step 3: Add NEON_HP tier to Cosine test** - -In `SQ8_FP16_CosineTest`, append the same block with `Cosine_SQ8_FP16_GetDistFunc` and `Choose_SQ8_FP16_Cosine_implementation_NEON_HP`. - -- [ ] **Step 4: Confirm the include path for the NEON_HP chooser declarations** - -Run: -```bash -grep -n 'functions/NEON_HP.h' tests/unit/test_spaces.cpp -``` -Expected: include present. If not, add `#include "VecSim/spaces/functions/NEON_HP.h"` near the other space-function includes at the top of the file. - -- [ ] **Step 5: User builds (ARM target)** - -Ask the user to run `make build` for an ARM target (hardware or cross-compile). On x86 the new test code is gated by `#ifdef OPT_NEON_HP` and stays inert. - -- [ ] **Step 6: Run NEON_HP tests** - -Once the ARM build is reported clean, run: -```bash -./bin//unit_tests --gtest_filter='SQ8_FP16_*Test*' -``` -Expected: all parametrized cases PASS, including the dims-16..32 and high-dim suites. - -- [ ] **Step 7: Commit** - -```bash -git add tests/unit/test_spaces.cpp -git commit -m "Extend SQ8↔FP16 tier-walk tests with NEON_HP [MOD-14972]" -``` - ---- - -## Task 7: SVE IP kernel header - -**Files:** -- Create: `src/VecSim/spaces/IP/IP_SVE_SQ8_FP16.h` - -- [ ] **Step 1: Author the kernel file** - -Modeled on `IP_SVE_SQ8_FP32.h`. The shape: an `InnerProductStep` helper that consumes `chunk = svcntw()` FP32 lanes per call (FP16 query loaded under a `b16` predicate, SQ8 storage under a `b32` predicate that drives uint8→uint32 widening), then a templated `_IMP` over ``. - -```cpp -/* - * Copyright (c) 2006-Present, Redis Ltd. - * All rights reserved. - * - * Licensed under your choice of the Redis Source Available License 2.0 - * (RSALv2); or (b) the Server Side Public License v1 (SSPLv1); or (c) the - * GNU Affero General Public License v3 (AGPLv3). - */ -#pragma once -#include "VecSim/spaces/space_includes.h" -#include "VecSim/types/sq8.h" -#include "VecSim/types/float16.h" -#include -#include - -using sq8 = vecsim_types::sq8; -using float16 = vecsim_types::float16; - -/* - * Optimised asymmetric SQ8<->FP16 inner product using the algebraic identity: - * - * IP(x, y) ~= min * y_sum + delta * sum(q_i * y_i) - * - * Hot loop accumulates sum(q_i * y_i) only; FP16 query lanes are widened to FP32 - * inside each step via svcvt_f32_f16_x. Metadata loads use load_unaligned. - */ - -// Helper: one SVE-vector-width-of-FP32 step. -// chunk = svcntw() - number of FP32 lanes per step. -// pg = svptrue_b32() - predicate for FP32 lanes. -static inline void -SQ8_FP16_InnerProductStep_SVE(const uint8_t *pVect1, const float16 *pVect2, size_t &offset, - svfloat32_t &sum, svbool_t pg, size_t chunk) { - // SQ8 -> uint32 (widen on load), then to FP32. - svuint32_t v1_u32 = svld1ub_u32(pg, pVect1 + offset); - svfloat32_t v1_f = svcvt_f32_u32_x(pg, v1_u32); - - // FP16 query -> FP32. svld1_f16 uses a b16 predicate sized to `chunk` half lanes. - svbool_t pg16 = svwhilelt_b16(uint32_t(0), uint32_t(chunk)); - svfloat16_t q_h = - svld1_f16(pg16, reinterpret_cast(pVect2) + offset); - svfloat32_t v2_f = svcvt_f32_f16_x(pg, q_h); - - sum = svmla_f32_x(pg, sum, v1_f, v2_f); - offset += chunk; -} - -// pVect1v = SQ8 storage, pVect2v = FP16 query -template -float SQ8_FP16_InnerProductSIMD_SVE_IMP(const void *pVect1v, const void *pVect2v, - size_t dimension) { - assert(dimension >= 16 && "kernel precondition: dispatcher must guard dim >= 16"); - - const uint8_t *pVect1 = static_cast(pVect1v); - const float16 *pVect2 = static_cast(pVect2v); - size_t offset = 0; - svbool_t pg = svptrue_b32(); - const size_t chunk = svcntw(); - - svfloat32_t sum0 = svdup_f32(0.0f); - svfloat32_t sum1 = svdup_f32(0.0f); - svfloat32_t sum2 = svdup_f32(0.0f); - svfloat32_t sum3 = svdup_f32(0.0f); - - // Partial chunk for dim % chunk lanes. Use _z form so inactive lanes are zero - - // the final reduction below walks all lanes via svptrue_b32(). - if constexpr (partial_chunk) { - size_t remaining = dimension % chunk; - if (remaining > 0) { - svbool_t pg_partial = - svwhilelt_b32(uint32_t(0), uint32_t(remaining)); - svbool_t pg16_partial = - svwhilelt_b16(uint32_t(0), uint32_t(remaining)); - svuint32_t v1_u32 = svld1ub_u32(pg_partial, pVect1 + offset); - svfloat32_t v1_f = svcvt_f32_u32_z(pg_partial, v1_u32); - svfloat16_t q_h = svld1_f16( - pg16_partial, reinterpret_cast(pVect2) + offset); - svfloat32_t v2_f = svcvt_f32_f16_z(pg_partial, q_h); - sum0 = svmla_f32_z(pg_partial, sum0, v1_f, v2_f); - offset += remaining; - } - } - - // Main loop: 4 chunks per iteration via 4 accumulators. - const size_t chunk_size = 4 * chunk; - const size_t number_of_chunks = - (dimension - (partial_chunk ? dimension % chunk : 0)) / chunk_size; - for (size_t i = 0; i < number_of_chunks; i++) { - SQ8_FP16_InnerProductStep_SVE(pVect1, pVect2, offset, sum0, pg, chunk); - SQ8_FP16_InnerProductStep_SVE(pVect1, pVect2, offset, sum1, pg, chunk); - SQ8_FP16_InnerProductStep_SVE(pVect1, pVect2, offset, sum2, pg, chunk); - SQ8_FP16_InnerProductStep_SVE(pVect1, pVect2, offset, sum3, pg, chunk); - } - - // Additional steps 0..3. - if constexpr (additional_steps > 0) - SQ8_FP16_InnerProductStep_SVE(pVect1, pVect2, offset, sum0, pg, chunk); - if constexpr (additional_steps > 1) - SQ8_FP16_InnerProductStep_SVE(pVect1, pVect2, offset, sum1, pg, chunk); - if constexpr (additional_steps > 2) - SQ8_FP16_InnerProductStep_SVE(pVect1, pVect2, offset, sum2, pg, chunk); - - svfloat32_t sum = svadd_f32_x(pg, sum0, sum1); - sum = svadd_f32_x(pg, sum, sum2); - sum = svadd_f32_x(pg, sum, sum3); - float quantized_dot = svaddv_f32(pg, sum); - - // Metadata loads - unaligned because odd dim leaves trailers unaligned. - const uint8_t *params_bytes = static_cast(pVect1v) + dimension; - const float min_val = - load_unaligned(params_bytes + sq8::MIN_VAL * sizeof(float)); - const float delta = - load_unaligned(params_bytes + sq8::DELTA * sizeof(float)); - const uint8_t *query_meta_bytes = reinterpret_cast( - static_cast(pVect2v) + dimension); - const float y_sum = - load_unaligned(query_meta_bytes + sq8::SUM_QUERY * sizeof(float)); - - return min_val * y_sum + delta * quantized_dot; -} - -template -float SQ8_FP16_InnerProductSIMD_SVE(const void *pVect1v, const void *pVect2v, - size_t dimension) { - return 1.0f - SQ8_FP16_InnerProductSIMD_SVE_IMP( - pVect1v, pVect2v, dimension); -} - -template -float SQ8_FP16_CosineSIMD_SVE(const void *pVect1v, const void *pVect2v, size_t dimension) { - return SQ8_FP16_InnerProductSIMD_SVE( - pVect1v, pVect2v, dimension); -} -``` - -**Note for the implementer:** `svcvt_f32_f16_x(pg, q_h)` widens *the lower half of `q_h`'s lanes* to FP32 (one widening, b32-predicated). If the ACLE on the target toolchain rejects this pairing (e.g. ARM RVCT vs LLVM disagreement), verify the FP16->FP32 widening sequence against the actual ARM build output and adjust as needed (potential alternatives: explicit `svunpklo_*` unpack-then-widen, or operating on the lower half lanes by reinterpretation). Commit only after the build is clean. Do not blindly copy `IP_SVE_FP16.h`'s pattern - that file accumulates in FP16 and is not a direct widening reference. - -- [ ] **Step 2: Commit** - -```bash -git add src/VecSim/spaces/IP/IP_SVE_SQ8_FP16.h -git commit -m "Add SVE SQ8↔FP16 IP kernel header [MOD-14972]" -``` - ---- - -## Task 8: SVE L2 kernel header - -**Files:** -- Create: `src/VecSim/spaces/L2/L2_SVE_SQ8_FP16.h` - -- [ ] **Step 1: Author the kernel file** - -```cpp -/* - * Copyright (c) 2006-Present, Redis Ltd. - * All rights reserved. - * - * Licensed under your choice of the Redis Source Available License 2.0 - * (RSALv2); or (b) the Server Side Public License v1 (SSPLv1); or (c) the - * GNU Affero General Public License v3 (AGPLv3). - */ -#pragma once -#include "VecSim/spaces/space_includes.h" -#include "VecSim/spaces/IP/IP_SVE_SQ8_FP16.h" -#include "VecSim/types/sq8.h" -#include "VecSim/types/float16.h" - -using sq8 = vecsim_types::sq8; -using float16 = vecsim_types::float16; - -/* - * SVE SQ8<->FP16 L2 squared distance: - * ||x - y||^2 = x_sum_squares - 2 * IP(x, y) + y_sum_squares - * IP is computed by SQ8_FP16_InnerProductSIMD_SVE_IMP; metadata is FP32. - */ - -template -float SQ8_FP16_L2SqrSIMD_SVE(const void *pVect1v, const void *pVect2v, size_t dimension) { - const float ip = SQ8_FP16_InnerProductSIMD_SVE_IMP( - pVect1v, pVect2v, dimension); - - const uint8_t *params_bytes = static_cast(pVect1v) + dimension; - const float x_sum_sq = - load_unaligned(params_bytes + sq8::SUM_SQUARES * sizeof(float)); - const uint8_t *query_meta_bytes = reinterpret_cast( - static_cast(pVect2v) + dimension); - const float y_sum_sq = - load_unaligned(query_meta_bytes + sq8::SUM_SQUARES_QUERY * sizeof(float)); - - return x_sum_sq + y_sum_sq - 2.0f * ip; -} -``` - -- [ ] **Step 2: Commit** - -```bash -git add src/VecSim/spaces/L2/L2_SVE_SQ8_FP16.h -git commit -m "Add SVE SQ8↔FP16 L2 kernel header [MOD-14972]" -``` - ---- - -## Task 9: SVE + SVE2 dispatcher TU additions - -**Files:** -- Modify: `src/VecSim/spaces/functions/SVE.h` — +3 declarations -- Modify: `src/VecSim/spaces/functions/SVE.cpp` — +#includes; +3 chooser definitions -- Modify: `src/VecSim/spaces/functions/SVE2.h` — +3 declarations -- Modify: `src/VecSim/spaces/functions/SVE2.cpp` — +#includes; +3 chooser definitions (own symbols, template instantiated under SVE2 flags) - -- [ ] **Step 1: Declarations in SVE.h** - -Inside `namespace spaces { ... }`, alongside the existing `Choose_SQ8_FP32_*_SVE` declarations: - -```cpp -dist_func_t Choose_SQ8_FP16_IP_implementation_SVE(size_t dim); -dist_func_t Choose_SQ8_FP16_Cosine_implementation_SVE(size_t dim); -dist_func_t Choose_SQ8_FP16_L2_implementation_SVE(size_t dim); -``` - -- [ ] **Step 2: Definitions in SVE.cpp** - -Add includes alongside the existing SQ8_FP32 includes: - -```cpp -#include "VecSim/spaces/IP/IP_SVE_SQ8_FP16.h" -#include "VecSim/spaces/L2/L2_SVE_SQ8_FP16.h" -``` - -Inside `namespace spaces { ... }` (between `implementation_chooser.h` and `implementation_chooser_cleanup.h`), append: - -```cpp -dist_func_t Choose_SQ8_FP16_IP_implementation_SVE(size_t dim) { - dist_func_t ret_dist_func; - CHOOSE_SVE_IMPLEMENTATION(ret_dist_func, SQ8_FP16_InnerProductSIMD_SVE, dim, svcntw); - return ret_dist_func; -} - -dist_func_t Choose_SQ8_FP16_Cosine_implementation_SVE(size_t dim) { - dist_func_t ret_dist_func; - CHOOSE_SVE_IMPLEMENTATION(ret_dist_func, SQ8_FP16_CosineSIMD_SVE, dim, svcntw); - return ret_dist_func; -} - -dist_func_t Choose_SQ8_FP16_L2_implementation_SVE(size_t dim) { - dist_func_t ret_dist_func; - CHOOSE_SVE_IMPLEMENTATION(ret_dist_func, SQ8_FP16_L2SqrSIMD_SVE, dim, svcntw); - return ret_dist_func; -} -``` - -- [ ] **Step 3: Declarations in SVE2.h** - -```cpp -dist_func_t Choose_SQ8_FP16_IP_implementation_SVE2(size_t dim); -dist_func_t Choose_SQ8_FP16_Cosine_implementation_SVE2(size_t dim); -dist_func_t Choose_SQ8_FP16_L2_implementation_SVE2(size_t dim); -``` - -- [ ] **Step 4: Definitions in SVE2.cpp** - -Add includes alongside the existing SQ8_FP32 includes — note the SVE header is included from SVE2 (SVE2 instantiates the template under SVE2 compile flags): - -```cpp -#include "VecSim/spaces/IP/IP_SVE_SQ8_FP16.h" // SVE2 implementation is identical to SVE -#include "VecSim/spaces/L2/L2_SVE_SQ8_FP16.h" // SVE2 implementation is identical to SVE -``` - -Inside `namespace spaces { ... }`, append: - -```cpp -dist_func_t Choose_SQ8_FP16_IP_implementation_SVE2(size_t dim) { - dist_func_t ret_dist_func; - CHOOSE_SVE_IMPLEMENTATION(ret_dist_func, SQ8_FP16_InnerProductSIMD_SVE, dim, svcntw); - return ret_dist_func; -} - -dist_func_t Choose_SQ8_FP16_Cosine_implementation_SVE2(size_t dim) { - dist_func_t ret_dist_func; - CHOOSE_SVE_IMPLEMENTATION(ret_dist_func, SQ8_FP16_CosineSIMD_SVE, dim, svcntw); - return ret_dist_func; -} - -dist_func_t Choose_SQ8_FP16_L2_implementation_SVE2(size_t dim) { - dist_func_t ret_dist_func; - CHOOSE_SVE_IMPLEMENTATION(ret_dist_func, SQ8_FP16_L2SqrSIMD_SVE, dim, svcntw); - return ret_dist_func; -} -``` - -- [ ] **Step 5: Commit** - -```bash -git add src/VecSim/spaces/functions/SVE.h src/VecSim/spaces/functions/SVE.cpp \ - src/VecSim/spaces/functions/SVE2.h src/VecSim/spaces/functions/SVE2.cpp -git commit -m "Wire SVE/SVE2 SQ8↔FP16 choosers [MOD-14972]" -``` - ---- - -## Task 10: SVE + SVE2 dispatcher wiring in IP_space.cpp + L2_space.cpp - -The NEON_HP block added in Task 5 lives inside `#ifdef CPU_FEATURES_ARCH_AARCH64`. Extend the same block in all three `_GetDistFunc` functions with SVE2 and SVE tiers — ordered SVE2 → SVE → NEON_HP, matching every other SQ8/FP32 dispatcher in the file. - -**Files:** -- Modify: `src/VecSim/spaces/IP_space.cpp` (two functions) -- Modify: `src/VecSim/spaces/L2_space.cpp` (one function) - -- [ ] **Step 1: Confirm the SVE/SVE2 dispatcher includes are present** - -Run: -```bash -grep -n 'functions/SVE\.h\|functions/SVE2\.h' src/VecSim/spaces/IP_space.cpp src/VecSim/spaces/L2_space.cpp -``` -Expected: both files already include both headers. If not, add them. - -- [ ] **Step 2: Extend IP_SQ8_FP16_GetDistFunc** - -Inside the AArch64 block of `IP_SQ8_FP16_GetDistFunc`, after the `if (dim < 16) return ret_dist_func;` guard and **before** the existing `#ifdef OPT_NEON_HP`, prepend: - -```cpp -#ifdef OPT_SVE2 - if (features.sve2) { - return Choose_SQ8_FP16_IP_implementation_SVE2(dim); - } -#endif -#ifdef OPT_SVE - if (features.sve) { - return Choose_SQ8_FP16_IP_implementation_SVE(dim); - } -#endif -``` - -(SVE/SVE2 paths don't compute alignment hints — the SVE vector width is runtime-variable, so the SQ8_FP32 sister doesn't set `*alignment` here either. Mirror that.) - -- [ ] **Step 3: Extend Cosine_SQ8_FP16_GetDistFunc** - -Same as Step 2, with `Cosine` in the chooser names. - -- [ ] **Step 4: Extend L2_SQ8_FP16_GetDistFunc** - -Same as Step 2, with `L2` in the chooser names. - -- [ ] **Step 5: User builds (ARM target)** - -Ask user to run `make build` for an ARM target. - -- [ ] **Step 6: Commit** - -```bash -git add src/VecSim/spaces/IP_space.cpp src/VecSim/spaces/L2_space.cpp -git commit -m "Dispatch SQ8↔FP16 to SVE/SVE2 tiers on AArch64 [MOD-14972]" -``` - ---- - -## Task 11: Extend `SQ8_FP16_SpacesOptimizationTest` with SVE2 + SVE tier-walks - -**Files:** -- Modify: `tests/unit/test_spaces.cpp` — the same three test bodies extended in Task 6 - -For each test (L2, IP, Cosine), inside the existing `#ifdef CPU_FEATURES_ARCH_AARCH64` region (which currently holds only NEON_HP from Task 6), **prepend** SVE2 and SVE blocks so the dispatch-precedence order is SVE2 → SVE → NEON_HP. If the existing NEON_HP block is not yet inside an AArch64 outer ifdef, wrap all three together. - -- [ ] **Step 1: Wrap and extend the L2 test** - -Replace the NEON_HP-only AArch64 block in `SQ8_FP16_L2SqrTest` with: - -```cpp -#ifdef CPU_FEATURES_ARCH_AARCH64 -#ifdef OPT_SVE2 - if (optimization.sve2) { - unsigned char alignment = 0; - arch_opt_func = L2_SQ8_FP16_GetDistFunc(dim, &alignment, &optimization); - ASSERT_EQ(arch_opt_func, Choose_SQ8_FP16_L2_implementation_SVE2(dim)) - << "Unexpected distance function chosen for dim " << dim; - ASSERT_NEAR(baseline, arch_opt_func(v2_compressed.data(), v1_query.data(), dim), 0.01) - << "SVE2 with dim " << dim; - optimization.sve2 = 0; - } -#endif -#ifdef OPT_SVE - if (optimization.sve) { - unsigned char alignment = 0; - arch_opt_func = L2_SQ8_FP16_GetDistFunc(dim, &alignment, &optimization); - ASSERT_EQ(arch_opt_func, Choose_SQ8_FP16_L2_implementation_SVE(dim)) - << "Unexpected distance function chosen for dim " << dim; - ASSERT_NEAR(baseline, arch_opt_func(v2_compressed.data(), v1_query.data(), dim), 0.01) - << "SVE with dim " << dim; - optimization.sve = 0; - } -#endif -#ifdef OPT_NEON_HP - if (optimization.asimdhp) { - unsigned char alignment = 0; - arch_opt_func = L2_SQ8_FP16_GetDistFunc(dim, &alignment, &optimization); - ASSERT_EQ(arch_opt_func, Choose_SQ8_FP16_L2_implementation_NEON_HP(dim)) - << "Unexpected distance function chosen for dim " << dim; - ASSERT_NEAR(baseline, arch_opt_func(v2_compressed.data(), v1_query.data(), dim), 0.01) - << "NEON_HP with dim " << dim; - optimization.asimdhp = 0; - } -#endif -#endif // CPU_FEATURES_ARCH_AARCH64 -``` - -- [ ] **Step 2: Same for IP test** - -Replicate the block in `SQ8_FP16_InnerProductTest` with `IP_SQ8_FP16_GetDistFunc` and `Choose_SQ8_FP16_IP_implementation_`. - -- [ ] **Step 3: Same for Cosine test** - -Replicate with `Cosine_SQ8_FP16_GetDistFunc` and `Choose_SQ8_FP16_Cosine_implementation_`. - -- [ ] **Step 4: User builds** - -ARM target build. - -- [ ] **Step 5: Run the optimization tests** - -```bash -./bin//unit_tests --gtest_filter='SQ8_FP16_SpacesOptimizationTest.*' -``` -Expected: all parametrized cases PASS — dims 16..32 + high-dim suite (64..1024) — exercising whichever ARM tiers the host advertises. - -- [ ] **Step 6: Commit** - -```bash -git add tests/unit/test_spaces.cpp -git commit -m "Extend SQ8↔FP16 tier-walk tests with SVE/SVE2 [MOD-14972]" -``` - ---- - -## Task 12: Extend `SQ8_FP16_SIMD_TierCoverage.ReportTiersExercised` with ARM rows - -**Files:** -- Modify: `tests/unit/test_spaces.cpp` — `TEST(SQ8_FP16_SIMD_TierCoverage, ReportTiersExercised)` - -The existing test body has an outer `#ifdef CPU_FEATURES_ARCH_X86_64` block that loops over each x86 tier and logs presence to stderr. Add a sibling `#ifdef CPU_FEATURES_ARCH_AARCH64` block with the same shape. - -- [ ] **Step 1: Append the AArch64 reporting block** - -Locate the trailing `#endif // CPU_FEATURES_ARCH_X86_64` and immediately after, insert: - -```cpp -#ifdef CPU_FEATURES_ARCH_AARCH64 -#ifdef OPT_SVE2 - if (opt.sve2) { - std::cerr << "[SQ8_FP16] SVE2 tier exercised\n"; - any_simd = true; - } else { - std::cerr << "[SQ8_FP16] SVE2 tier NOT exercised on this host\n"; - } -#endif -#ifdef OPT_SVE - if (opt.sve) { - std::cerr << "[SQ8_FP16] SVE tier exercised\n"; - any_simd = true; - } else { - std::cerr << "[SQ8_FP16] SVE tier NOT exercised on this host\n"; - } -#endif -#ifdef OPT_NEON_HP - if (opt.asimdhp) { - std::cerr << "[SQ8_FP16] NEON_HP tier exercised\n"; - any_simd = true; - } else { - std::cerr << "[SQ8_FP16] NEON_HP tier NOT exercised on this host\n"; - } -#endif -#endif // CPU_FEATURES_ARCH_AARCH64 -``` - -(The trailing `if (!any_simd) { GTEST_SKIP() << ...; }` already at the bottom of the existing test handles the all-quiet case across both archs.) - -- [ ] **Step 2: Build + run on an ARM host** - -Ask the user to build for ARM, then run: -```bash -./bin//unit_tests --gtest_filter='SQ8_FP16_SIMD_TierCoverage.*' -``` -Expected: stderr shows at least one ARM tier marked "exercised", test PASS. - -- [ ] **Step 3: Commit** - -```bash -git add tests/unit/test_spaces.cpp -git commit -m "Report ARM tiers in SQ8↔FP16 tier-coverage test [MOD-14972]" -``` - ---- - -## Task 13: Microbench AArch64 block - -**Files:** -- Modify: `tests/benchmark/spaces_benchmarks/bm_spaces_sq8_fp16.cpp` - -The existing file already opens `#ifdef CPU_FEATURES_ARCH_X86_64` and pulls `cpu_features::X86Features opt = cpu_features::GetX86Info().features;`. Add the parallel AArch64 block at the end of that `#endif // CPU_FEATURES_ARCH_X86_64`. - -- [ ] **Step 1: Append the AArch64 bench block** - -After the closing `#endif // CPU_FEATURES_ARCH_X86_64` (or after the last x86 `INITIALIZE_BENCHMARKS_SET_*` macro if no such comment exists), insert: - -```cpp -#ifdef CPU_FEATURES_ARCH_AARCH64 -cpu_features::Aarch64Features arm_opt = cpu_features::GetAarch64Info().features; - -#ifdef OPT_SVE2 -bool sve2_supported = arm_opt.sve2; -INITIALIZE_BENCHMARKS_SET_L2_IP(BM_VecSimSpaces_SQ8_FP16, SQ8_FP16, SVE2, 16, sve2_supported); -INITIALIZE_BENCHMARKS_SET_Cosine(BM_VecSimSpaces_SQ8_FP16, SQ8_FP16, SVE2, 16, sve2_supported); -#endif - -#ifdef OPT_SVE -bool sve_supported = arm_opt.sve; -INITIALIZE_BENCHMARKS_SET_L2_IP(BM_VecSimSpaces_SQ8_FP16, SQ8_FP16, SVE, 16, sve_supported); -INITIALIZE_BENCHMARKS_SET_Cosine(BM_VecSimSpaces_SQ8_FP16, SQ8_FP16, SVE, 16, sve_supported); -#endif - -#ifdef OPT_NEON_HP -bool neon_hp_supported = arm_opt.asimdhp; -INITIALIZE_BENCHMARKS_SET_L2_IP(BM_VecSimSpaces_SQ8_FP16, SQ8_FP16, NEON_HP, 16, neon_hp_supported); -INITIALIZE_BENCHMARKS_SET_Cosine(BM_VecSimSpaces_SQ8_FP16, SQ8_FP16, NEON_HP, 16, - neon_hp_supported); -#endif -#endif // CPU_FEATURES_ARCH_AARCH64 -``` - -Verify the exact `cpu_features` helper name during build. If the toolchain uses `Aarch64Info` vs `Aarch64Features` vs `ArmFeatures`, adjust to match the sister x86 block. - -- [ ] **Step 2: Update the file-header comment** - -The current file-header comment (around the top) ends with `ARM kernels land via MOD-14972.` — change that line to `ARM kernels (NEON_HP / SVE / SVE2) are registered below.` so the doc stays accurate. - -- [ ] **Step 3: User builds (ARM target)** - -- [ ] **Step 4: Run the bench on ARM** - -```bash -./bin//bm_spaces_sq8_fp16 --benchmark_filter='SQ8_FP16_.*(SVE2|SVE|NEON_HP)' -``` -Expected: per-ISA throughput rows for L2, IP, Cosine. If no rows match, list all benchmarks first with `--benchmark_list_tests` to see the exact generated names, then adjust the regex. - -- [ ] **Step 5: Side-by-side compare against SQ8_FP32** - -```bash -./bin//bm_spaces_sq8_fp32 --benchmark_filter='SQ8_FP32_.*(SVE2|SVE|NEON)' -``` -Compare matched-ISA rows manually. Acceptance per Jira: per-ISA throughput data captured. - -- [ ] **Step 6: Commit** - -```bash -git add tests/benchmark/spaces_benchmarks/bm_spaces_sq8_fp16.cpp -git commit -m "Register ARM SQ8↔FP16 microbenchmarks [MOD-14972]" -``` - ---- - -## Task 14: ASan + final pre-PR verification - -- [ ] **Step 1: Full unit-test pass on ARM host (no filter)** - -```bash -./bin//unit_tests -``` -Expected: all tests PASS. - -- [ ] **Step 2: ASan build + run** - -Ask user to run `make build SAN=address` (or the repo's equivalent — verify against `Makefile`). After confirmed: - -```bash -./bin/-asan/unit_tests --gtest_filter='SQ8_FP16_*' -``` -Expected: zero ASan reports; all SQ8_FP16 tests PASS. - -- [ ] **Step 3: x86 sanity build** - -User runs `make build` on x86 (no ARM target). Confirms the new test extensions and dispatcher AArch64 ifdefs stay inert on x86 and the build is clean. - -- [ ] **Step 4: Push branch (ASK USER FIRST)** - -Pushes are user-gated. Confirm with the user before running: - -```bash -git push -u origin dor-forer-sq8-fp16-arm-kernels-mod-14972 -``` - -- [ ] **Step 5: Open PR against PR #970 (ASK USER FIRST)** - -PR creation is user-gated. Confirm with the user before running: - -```bash -gh pr create \ - --base dor-forer-sq8-fp16-x86-kernels-mod-14954 \ - --title 'Add SQ8↔FP16 ARM SIMD distance kernels [MOD-14972]' \ - --body "$(cat <<'EOF' -## Summary - -- Add asymmetric SQ8↔FP16 distance kernels (IP, L2, Cosine) for ARM NEON_HP, SVE, SVE2 tiers -- Wire kernels into the existing dispatcher (`IP_space.cpp`, `L2_space.cpp`) -- Extend `SQ8_FP16_SpacesOptimizationTest` and `SQ8_FP16_SIMD_TierCoverage` with ARM tiers -- Register per-ISA microbenchmarks for cross-arch throughput comparison - -Stacked on PR #970 (MOD-14954 x86 kernels); retarget to `main` once #970 merges. - -Spec: `docs/superpowers/specs/2026-05-28-arm-sq8-fp16-design.md` - -## Test plan - -- [ ] Unit tests on ARM host pass — `SQ8_FP16_SpacesOptimizationTest` (dims 16..32 + 64..1024), `SQ8_FP16_SIMD_TierCoverage`, `GetDistFuncSQ8FP16Asymmetric` -- [ ] ASan build on ARM host clean across SQ8_FP16 tests -- [ ] x86 build remains clean (new AArch64 dispatcher block + tests stay inert) -- [ ] Microbench output captured for SVE2 / SVE / NEON_HP, compared against matched SQ8_FP32 ARM rows -EOF -)" -``` - -- [ ] **Step 6: Retarget once #970 merges (ASK USER FIRST)** - -When PR #970 lands on `main`, change this PR's base to `main`: - -```bash -gh pr edit --base main -``` - ---- - -## Self-review checklist - -- [x] **Spec coverage:** every requirement in `2026-05-28-arm-sq8-fp16-design.md` is covered: - - Kernel headers (4 new): Tasks 2, 3, 7, 8 - - Wrapper symbols: Tasks 4 (NEON_HP), 9 (SVE/SVE2) - - Dispatcher wiring: Tasks 5 (NEON_HP), 10 (SVE/SVE2) - - Tier-walk tests: Tasks 6 (NEON_HP), 11 (SVE/SVE2) - - TierCoverage report: Task 12 - - Scalar-fallback edge tests (dim=0, dim=15): Task 1 - - Microbench: Task 13 - - ASan + verification: Task 14 -- [x] **No CMake changes** — confirmed in file structure table. -- [x] **Zero placeholders** — every code block is concrete; ambiguous spots (SVE FP16 widening ACLE) are called out with the fallback strategy spelled in-task. -- [x] **Type/symbol consistency:** - - NEON kernel template names: `SQ8_FP16_InnerProductSIMD16_NEON_HP_IMP` / `…NEON_HP` / `SQ8_FP16_L2SqrSIMD16_NEON_HP` / `SQ8_FP16_CosineSIMD16_NEON_HP` — match across kernel header, NEON_HP chooser, dispatcher call, and test. - - SVE kernel template names: `SQ8_FP16_InnerProductSIMD_SVE_IMP` / `…SVE` / `SQ8_FP16_L2SqrSIMD_SVE` / `SQ8_FP16_CosineSIMD_SVE` — match across kernel header, SVE chooser, SVE2 chooser, dispatcher call, and test. - - Chooser symbol names: `Choose_SQ8_FP16_{IP,L2,Cosine}_implementation_{NEON_HP,SVE,SVE2}` — match across `.h` declarations, `.cpp` definitions, dispatcher calls, tests, and bench. - - Test fixture: `SQ8_FP16_SpacesOptimizationTest` already exists on base (PR #970); we extend the three test methods inside it, no rename. - ---- - -## Execution Handoff - -Plan complete and saved to `docs/superpowers/plans/2026-05-28-arm-sq8-fp16-kernels.md`. Two execution options: - -**1. Subagent-Driven (recommended)** — I dispatch a fresh subagent per task, review between tasks, fast iteration. - -**2. Inline Execution** — Execute tasks in this session using executing-plans, batch execution with checkpoints. - -Which approach? diff --git a/docs/superpowers/specs/2026-05-28-arm-sq8-fp16-design.md b/docs/superpowers/specs/2026-05-28-arm-sq8-fp16-design.md deleted file mode 100644 index f4188d38b..000000000 --- a/docs/superpowers/specs/2026-05-28-arm-sq8-fp16-design.md +++ /dev/null @@ -1,354 +0,0 @@ -# SQ8↔FP16 ARM SIMD Distance Kernels — Design Spec - -- **Ticket**: [MOD-14972](https://redislabs.atlassian.net/browse/MOD-14972) -- **Branch**: `dor-forer-sq8-fp16-arm-kernels-mod-14972` -- **Base**: `dor-forer-sq8-fp16-x86-kernels-mod-14954` (PR #970) — stacked -- **Sibling**: MOD-14954 / PR #970 delivers x86 SIMD kernels (AVX-512, AVX2, SSE4) for the same operation - -## Goal - -Add SQ8↔FP16 SIMD distance kernels for IP and L2 on the ARM ISA tiers (NEON_HP, SVE, SVE2). FP16 is the query data type; SQ8 is the stored vector representation. Match the contract and structure of the x86 kernels delivered in PR #970 so dispatch tables, metadata layout, and acceptance criteria stay symmetric across architectures. - -The scalar fallback (`SQ8_FP16_InnerProduct`, `SQ8_FP16_L2Sqr`, `SQ8_FP16_Cosine` in `src/VecSim/spaces/IP/IP.cpp` and `src/VecSim/spaces/L2/L2.cpp`) already exists on `main`. This spec does not modify it; it serves as the reference implementation for all platforms. - -## Algebraic identity (shared with x86 PR + SQ8_FP32 sister) - -``` -IP(x, y) ≈ min · y_sum + delta · Σ(q_i · y_i) -L2(x, y) = x_sum_sq + y_sum_sq - 2 · IP(x, y) -``` - -Hot loop accumulates `Σ(q_i · y_i)` only. No per-element dequantization. FP16 query lanes are widened to FP32 per SIMD chunk; everything in the hot loop is FP32. - -## Metadata layout - -``` -SQ8 storage (pVect1): [uint8 × dim] [min_val] [delta] [x_sum] [x_sum_squares] -FP16 query (pVect2): [float16 × dim] [y_sum] [y_sum_squares] -``` - -Both metadata trailers are FP32 scalars. Storage metadata is not 4-byte aligned whenever `dim % 4 != 0`; query metadata is not 4-byte aligned whenever `dim` is odd. The blanket rule: every FP32 metadata read uses the global `load_unaligned` helper, matching scalar `_Impl` in `IP.cpp` / `L2.cpp`. `sq8` namespace constants: `MIN_VAL`, `DELTA`, `SUM_QUERY`, `SUM_SQUARES`, `SUM_SQUARES_QUERY`. - -## File layout - -``` -src/VecSim/spaces/IP/ - IP_NEON_SQ8_FP16.h (new) - IP_SVE_SQ8_FP16.h (new) — also #included from SVE2.cpp -src/VecSim/spaces/L2/ - L2_NEON_SQ8_FP16.h (new) - L2_SVE_SQ8_FP16.h (new) — also #included from SVE2.cpp -src/VecSim/spaces/functions/ - NEON_HP.cpp (+ Choose_SQ8_FP16_{IP,L2,Cosine}_implementation_NEON_HP) - NEON_HP.h (+ 3 declarations) - SVE.cpp (+ Choose_SQ8_FP16_*_implementation_SVE) - SVE.h (+ 3 declarations) - SVE2.cpp (+ Choose_SQ8_FP16_*_implementation_SVE2; owns its own chooser symbols; instantiates SVE kernel templates under SVE2 compile flags) - SVE2.h (+ 3 declarations) -src/VecSim/spaces/ - IP_space.cpp (2 dispatcher block edits: IP, Cosine) - L2_space.cpp (1 dispatcher block edit) -``` - -**Zero CMake changes.** Existing TU flags carry exactly what we need: - -| TU | Flags | -|----|-------| -| `NEON_HP.cpp` | `-march=armv8.2-a+fp16fml` (covers fp16 cvt + fma) | -| `SVE.cpp` | `-march=armv8-a+sve` (SVE includes f16↔f32 cvt) | -| `SVE2.cpp` | `-march=armv9-a+sve2` | - -## Dispatcher tier order - -Same precedence as existing SQ8_FP32 ARM dispatch: - -```cpp -#ifdef OPT_SVE2 - if (features.sve2 && dim >= 16) { - return Choose_SQ8_FP16_IP_implementation_SVE2(dim); - } -#endif -#ifdef OPT_SVE - if (features.sve && dim >= 16) { - return Choose_SQ8_FP16_IP_implementation_SVE(dim); - } -#endif -#ifdef OPT_NEON_HP - if (features.asimdhp && dim >= 16) { - return Choose_SQ8_FP16_IP_implementation_NEON_HP(dim); - } -#endif -// dim < 16 or no ARM SIMD → scalar fallback (existing return at function tail) -``` - -The `dim >= 16` guard in the dispatcher is what lets each SIMD kernel hold an internal `assert(dim >= 16)` as a real precondition. Edge cases for `dim < 16` are routed to scalar. - -## NEON kernel design - -### Header: `IP_NEON_SQ8_FP16.h` - -Template signature mirrors SQ8_FP32 NEON sister: - -```cpp -template // 0..15 -float SQ8_FP16_InnerProductSIMD16_NEON_HP_IMP(const void *pVect1v, const void *pVect2v, size_t dimension); -``` - -Hot loop — 16 lanes per iteration, 4 FP32 accumulators: - -```cpp -// SQ8 load: 16 × uint8 → 4 × float32x4_t -uint8x16_t v1_u8 = vld1q_u8(pVect1); -uint16x8_t v1_lo = vmovl_u8(vget_low_u8(v1_u8)); -uint16x8_t v1_hi = vmovl_u8(vget_high_u8(v1_u8)); -float32x4_t v1_0 = vcvtq_f32_u32(vmovl_u16(vget_low_u16(v1_lo))); -float32x4_t v1_1 = vcvtq_f32_u32(vmovl_u16(vget_high_u16(v1_lo))); -float32x4_t v1_2 = vcvtq_f32_u32(vmovl_u16(vget_low_u16(v1_hi))); -float32x4_t v1_3 = vcvtq_f32_u32(vmovl_u16(vget_high_u16(v1_hi))); - -// FP16 query load: 16 × f16 → 4 × float32x4_t via vcvt_f32_f16 -float16x8_t q_lo = vld1q_f16(pVect2); -float16x8_t q_hi = vld1q_f16(pVect2 + 8); -float32x4_t v2_0 = vcvt_f32_f16(vget_low_f16(q_lo)); -float32x4_t v2_1 = vcvt_f32_f16(vget_high_f16(q_lo)); -float32x4_t v2_2 = vcvt_f32_f16(vget_low_f16(q_hi)); -float32x4_t v2_3 = vcvt_f32_f16(vget_high_f16(q_hi)); - -// 4-accumulator FMA -sum0 = vfmaq_f32(sum0, v1_0, v2_0); -sum1 = vfmaq_f32(sum1, v1_1, v2_1); -sum2 = vfmaq_f32(sum2, v1_2, v2_2); -sum3 = vfmaq_f32(sum3, v1_3, v2_3); -``` - -Residual ladder (`dim % 16`, residual 0..15): - -- **`residual >= 8`**: one 8-lane safe load each side — `vld1_u8` (8 bytes) for SQ8 and `vld1q_f16` (8 × FP16 = 16 bytes, fits before query metadata) for FP16. Convert + FMA. Remaining `residual - 8` lanes handled scalar. -- **`residual < 8`**: full scalar residual loop using `vecsim_types::FP16_to_FP32`. - -Rationale: a 16-byte SQ8 load (`vld1q_u8`) or a 16-byte FP16 load (`vld1q_f16` past the 8-lane boundary) on a residual < 8 would overread past valid query data into metadata — `y_sum` is only 4 bytes for IP and `y_sum_sq` adds 4 more for L2, not enough headroom for an 8-lane FP16 load. - -Final reduction: `vaddvq_f32(sum0 + sum1 + sum2 + sum3)`, then return `min_val * y_sum + delta * quantized_dot`. - -`assert(dim >= 16)` at the top. - -### Header: `L2_NEON_SQ8_FP16.h` - -Calls `SQ8_FP16_InnerProductSIMD16_NEON_HP_IMP(...)` to compute raw IP, then returns `x_sum_sq + y_sum_sq - 2.0f * ip`. Mirrors `L2_NEON_SQ8_FP32.h` exactly. - -### Wrapper symbols (NEON_HP.cpp) - -```cpp -dist_func_t Choose_SQ8_FP16_IP_implementation_NEON_HP(size_t dim) { - dist_func_t ret; - CHOOSE_IMPLEMENTATION(ret, dim, 16, SQ8_FP16_InnerProductSIMD16_NEON_HP); - return ret; -} -// L2 + Cosine identical shape (Cosine reuses IP wrapper per repo convention) -``` - -## SVE kernel design - -### Header: `IP_SVE_SQ8_FP16.h` - -Template signature mirrors SVE SQ8_FP32 sister: - -```cpp -template -float SQ8_FP16_InnerProductSIMD_SVE_IMP(const void *pVect1v, const void *pVect2v, size_t dimension); -``` - -Inner step (one SVE vector width `svcntw()` lanes of FP32): - -```cpp -svbool_t pg = svptrue_b32(); -// SQ8: zero-extend uint8 → uint32 (predicated b32 load) -svuint32_t v1_u32 = svld1ub_u32(pg, pVect1 + offset); -svfloat32_t v1_f = svcvt_f32_u32_x(pg, v1_u32); -// FP16: load chunk fp16 lanes, widen to fp32 -svbool_t pg16 = svwhilelt_b16(uint32_t(0), uint32_t(chunk)); -svfloat16_t q_h = svld1_f16(pg16, pVect2 + offset); -svfloat32_t v2_f = svcvt_f32_f16_x(pg, q_h); // verify exact ACLE/packing during impl -sum = svmla_f32_x(pg, sum, v1_f, v2_f); -offset += chunk; -``` - -**ACLE caveat**: exact f16→f32 widening intrinsic and lane packing — confirm `svcvt_f32_f16_x(pg, q_h)` compiles cleanly against the loaded `svfloat16_t`. If lane packing needs an unpack/interleave step, verify against `IP_SVE_FP16.h`. - -4 accumulators `sum0..sum3`; main loop processes 4 chunks via 4 `InnerProductStep` calls. `partial_chunk` template branch handles `dim % chunk` via `svwhilelt_b32`. - -Inactive-lane discipline on the partial path: the predicated `svld1_f16` / `svld1ub_u32` cover lane *liveness*, but the final reduction with `svaddv_f32(svptrue_b32(), ...)` walks *all* lanes. To keep inactive lanes from contributing garbage, the partial step uses the zeroing form `svmla_f32_z(pg_partial, sum0, v1_f, v2_f)` (matches `IP_SVE_SQ8_FP32.h` partial-chunk pattern). Alternative: reduce only active lanes via `svaddv_f32(pg_partial, sum0)` for the partial-step accumulator, then sum into the main reduction. The `_z` form is the simpler choice and is what the SQ8_FP32 SVE sister already does. - -Predicate widths on the partial path: FP32 math (load/widen/mla) uses a `b32` predicate sized to `remaining` 32-bit lanes (`svwhilelt_b32(0, remaining)`); the FP16 query load needs its own `b16` predicate sized to the same `remaining` half lanes (`svwhilelt_b16(0, remaining)`) since `svld1_f16` is governed by a 16-bit predicate. SQ8 load via `svld1ub_u32` is governed by the `b32` predicate (it widens uint8 → uint32 lanewise). - -Final reduction: `svaddv_f32(svptrue_b32(), sum0 + sum1 + sum2 + sum3)`. - -### Header: `L2_SVE_SQ8_FP16.h` - -Calls `SQ8_FP16_InnerProductSIMD_SVE_IMP(...)` then returns `x_sum_sq + y_sum_sq - 2.0f * ip`. Mirrors `L2_SVE_SQ8_FP32.h`. - -### Wrapper symbols - -`SVE.cpp`: - -```cpp -dist_func_t Choose_SQ8_FP16_IP_implementation_SVE(size_t dim) { - dist_func_t ret; - CHOOSE_SVE_IMPLEMENTATION(ret, SQ8_FP16_InnerProductSIMD_SVE, dim, svcntw); - return ret; -} -// L2 + Cosine identical shape -``` - -`SVE2.cpp`: - -```cpp -#include "VecSim/spaces/IP/IP_SVE_SQ8_FP16.h" // SVE2 implementation is identical to SVE -#include "VecSim/spaces/L2/L2_SVE_SQ8_FP16.h" - -dist_func_t Choose_SQ8_FP16_IP_implementation_SVE2(size_t dim) { - dist_func_t ret; - CHOOSE_SVE_IMPLEMENTATION(ret, SQ8_FP16_InnerProductSIMD_SVE, dim, svcntw); - return ret; -} -// L2 + Cosine identical shape -``` - -SVE2 owns its own chooser symbols (does **not** call the SVE chooser); template instantiated under SVE2 compile flags. - -## Tests - -### Class - -Branch base is PR #970. During implementation, verify whether the base branch already exposes `SQ8_FP16_SpacesOptimizationTest` (extend) or only `SQ8_FP16_NoOptimizationSpacesTest` (add the optimization class here mirroring `SQ8_FP32_SpacesOptimizationTest`). - -### Tier-walk pattern - -Per-tier `if (features.)` block; **unset higher flag** after each block so the next tier is exercised on hosts that support multiple ISAs. Do not use `GTEST_SKIP()` here — it would abort the entire walk. - -```cpp -auto expected = SQ8_FP16_InnerProduct; // scalar reference - -#ifdef OPT_SVE2 - if (features.sve2) { - arch_opt_func = IP_SQ8_FP16_GetDistFunc(dim, &alignment, &features); - ASSERT_EQ(arch_opt_func, Choose_SQ8_FP16_IP_implementation_SVE2(dim)) - << "SVE2 dispatch mismatch"; - ASSERT_NEAR(arch_opt_func(v1, v2, dim), expected(v1, v2, dim), 0.01); - features.sve2 = 0; // exercise next tier - } -#endif -#ifdef OPT_SVE - if (features.sve) { /* same shape */ features.sve = 0; } -#endif -#ifdef OPT_NEON_HP - if (features.asimdhp) { /* same shape */ features.asimdhp = 0; } -#endif -// final fallback assertion: IP_SQ8_FP16_GetDistFunc(...) == SQ8_FP16_InnerProduct (scalar) -``` - -Three dispatch entry points exercised per tier: `IP_SQ8_FP16_GetDistFunc`, `L2_SQ8_FP16_GetDistFunc`, `Cosine_SQ8_FP16_GetDistFunc`. - -### Scalar-fallback tests - -`GetDistFuncSQ8FP16Asymmetric` — currently asserts `dim=128` returns scalar; that assertion breaks once SIMD dispatch lands. Change to `dim=15` (below the `dim >= 16` SIMD threshold). Add a small `dim=0` (empty) scalar-fallback assertion to cover the Jira "empty" edge case. - -### Dim parameterization - -Base branch already has both parameterized suites against `SQ8_FP16_SpacesOptimizationTest`: -- `SQ8_FP16_SIMD` — `testing::Range(16UL, 33UL)` (dims 16..32; residual + threshold boundaries) -- `SQ8_FP16_SIMD_HighDim` — `64, 128, 256, 512, 1024` (multi-iteration main loop) - -Both suites pick up the ARM tier-walk additions automatically since the test class body is what's extended. No new instantiation needed. - -### Tier coverage report - -`SQ8_FP16_SIMD_TierCoverage.ReportTiersExercised` (test_spaces.cpp) currently reports only x86 tiers. Extend it with ARM tier entries (SVE2 / SVE / NEON_HP) so an ARM-only SIMD host reports its exercised tiers instead of going silent. - -## Microbench - -`tests/benchmark/spaces_benchmarks/bm_spaces_sq8_fp16.cpp` already registers x86 ISA benchmarks. Add ARM registrations under `#ifdef OPT_*` guards using the existing `bm_spaces.h` macros: - -```cpp -#ifdef CPU_FEATURES_ARCH_AARCH64 - cpu_features::Aarch64Features opt = cpu_features::GetAarch64Info().features; - bool sve2_supported = opt.sve2; - bool sve_supported = opt.sve; - bool neon_hp_supported = opt.asimdhp; -#ifdef OPT_SVE2 - INITIALIZE_BENCHMARKS_SET_L2_IP(BM_VecSimSpaces_SQ8_FP16, SQ8_FP16, SVE2, 16, sve2_supported); - INITIALIZE_BENCHMARKS_SET_Cosine(BM_VecSimSpaces_SQ8_FP16, SQ8_FP16, SVE2, 16, sve2_supported); -#endif -#ifdef OPT_SVE - INITIALIZE_BENCHMARKS_SET_L2_IP(BM_VecSimSpaces_SQ8_FP16, SQ8_FP16, SVE, 16, sve_supported); - INITIALIZE_BENCHMARKS_SET_Cosine(BM_VecSimSpaces_SQ8_FP16, SQ8_FP16, SVE, 16, sve_supported); -#endif -#ifdef OPT_NEON_HP - INITIALIZE_BENCHMARKS_SET_L2_IP(BM_VecSimSpaces_SQ8_FP16, SQ8_FP16, NEON_HP, 16, neon_hp_supported); - INITIALIZE_BENCHMARKS_SET_Cosine(BM_VecSimSpaces_SQ8_FP16, SQ8_FP16, NEON_HP, 16, neon_hp_supported); -#endif -#endif // CPU_FEATURES_ARCH_AARCH64 -``` - -Verify exact `cpu_features` helper names against the x86 sister block already in `bm_spaces_sq8_fp16.cpp` (e.g. `GetX86Info`). - -`bm_spaces_sq8_fp16` and `bm_spaces_sq8_fp32` are separate executables; the per-ISA throughput comparison requested by Jira is done by running both benches and comparing matched ISA rows. - -## Acceptance criteria (Jira MOD-14972 → spec mapping) - -| Jira requirement | Where this spec delivers it | -|------------------|------------------------------| -| Kernels: IP + L2 for NEON | NEON_HP TU hosts kernel headers + chooser symbols | -| Kernels: IP + L2 for SVE | SVE TU hosts kernel headers + chooser symbols | -| Kernels: IP + L2 for SVE2 | SVE2 TU includes SVE headers, instantiates templates under SVE2 flags | -| Scalar fallback (reference for all platforms) | Already present in `IP.cpp` / `L2.cpp`; unchanged | -| FP16 query → FP32 per SIMD chunk | `vcvt_f32_f16` (NEON), `svcvt_f32_f16_x` (SVE) | -| FP32 metadata + correction terms | `load_unaligned` for all FP32 trailer scalars | -| Wire into dispatch table per ISA flag | `IP_space.cpp` (2 blocks), `L2_space.cpp` (1 block), `OPT_SVE2/SVE/NEON_HP` | -| Unit tests vs. scalar reference per ISA | Tier-walk in `SQ8_FP16_SpacesOptimizationTest` | -| Edge cases (empty, dim-alignment boundaries) | `dim=0` + `dim=15` scalar tests; `dim=16..32` SIMD boundary param suite | -| Microbench per ISA throughput vs. SQ8↔FP32 | ARM registrations in `bm_spaces_sq8_fp16.cpp`; matched-ISA comparison vs. `bm_spaces_sq8_fp32` | - -## Diff size estimate - -| Area | Files | LoC (rough) | -|------|-------|-------------| -| Kernel headers | 4 new | ~600 | -| Dispatcher TU additions | NEON_HP.cpp/h, SVE.cpp/h, SVE2.cpp/h | ~80 | -| Dispatcher wiring | IP_space.cpp, L2_space.cpp | ~45 | -| Tests | test_spaces.cpp | ~80 | -| Bench | bm_spaces_sq8_fp16.cpp | ~25 | -| CMakeLists.txt | none | 0 | -| **Total** | **~10 files** | **~830** | - -## PR mechanics - -- **Branch**: `dor-forer-sq8-fp16-arm-kernels-mod-14972` -- **Base branch**: `dor-forer-sq8-fp16-x86-kernels-mod-14954` (PR #970) -- **PR target**: opens against PR #970 head; retarget to `main` once #970 merges -- **Commit prefix**: `[MOD-14972]` (matches repo convention) -- **PR title**: `Add SQ8↔FP16 ARM SIMD distance kernels [MOD-14972]` - -## Verification gates before opening PR - -1. **x86 host build clean** — verifies generic dispatch and tests remain clean; ARM kernels require ARM build or cross-compile, so the kernels themselves are not exercised here. -2. **ARM host build + unit tests** — NEON_HP / SVE / SVE2 paths exercised. Requires coordination with the user for ARM hardware or a cross-compile setup. -3. **ASan clean** on every host that runs unit tests. -4. **Microbench compiles + runs on ARM host.** - -## Out of scope (deferred, separate PRs) - -- Dispatcher-routed edge-case tests (`ZeroQueryTest`, `ConstantStorageTest`, `MixedSignQueryTest`) — they currently bypass the dispatcher and call scalar directly; cross-arch debt, also PR #970 H1. -- Multi-accumulator ILP tuning beyond the 4-accumulator baseline established here. -- Unrelated x86 review-feedback fixes (M1–M4, H1–H2 on x86 files from PR #970 review). This ARM PR will modify some files that PR #970 also touches (dispatchers, test class, bench), but only with ARM-relevant additions — x86 review fixes land in #970. - -## Inheritance from PR #970 review findings - -The following lessons from the PR #970 review are baked into this design so they do not need to be re-flagged on ARM kernels: - -- `assert(dim >= 16)` at the top of every kernel template (paired with dispatcher `dim >= 16` guard). -- 4-accumulator ILP in both NEON and SVE hot loops. -- Algebraic-identity formula anchor comment at the top of each kernel header. -- `load_unaligned` for all FP32 metadata reads (matches scalar). -- Dispatcher-routed tier-walk test pattern (no scalar-bypass). -- Per-ISA microbench registration alongside SQ8↔FP32 sister for direct comparison. diff --git a/src/VecSim/spaces/IP_space.cpp b/src/VecSim/spaces/IP_space.cpp index 1930e64a2..9366d3144 100644 --- a/src/VecSim/spaces/IP_space.cpp +++ b/src/VecSim/spaces/IP_space.cpp @@ -241,9 +241,6 @@ dist_func_t IP_SQ8_FP16_GetDistFunc(size_t dim, unsigned char *alignment, #endif #ifdef OPT_NEON_HP if (features.asimdhp) { - // No alignment write: the locked spec and the sister ARM SQ8_FP32 dispatchers - // leave *alignment untouched on ARM tiers. The corresponding tests assert - // 0xFF passthrough on the scalar path and do not assert any non-zero value here. return Choose_SQ8_FP16_IP_implementation_NEON_HP(dim); } #endif @@ -313,9 +310,6 @@ dist_func_t Cosine_SQ8_FP16_GetDistFunc(size_t dim, unsigned char *alignm #endif #ifdef OPT_NEON_HP if (features.asimdhp) { - // No alignment write: the locked spec and the sister ARM SQ8_FP32 dispatchers - // leave *alignment untouched on ARM tiers. The corresponding tests assert - // 0xFF passthrough on the scalar path and do not assert any non-zero value here. return Choose_SQ8_FP16_Cosine_implementation_NEON_HP(dim); } #endif diff --git a/src/VecSim/spaces/L2_space.cpp b/src/VecSim/spaces/L2_space.cpp index 2e18920b3..7d65814e0 100644 --- a/src/VecSim/spaces/L2_space.cpp +++ b/src/VecSim/spaces/L2_space.cpp @@ -172,9 +172,6 @@ dist_func_t L2_SQ8_FP16_GetDistFunc(size_t dim, unsigned char *alignment, #endif #ifdef OPT_NEON_HP if (features.asimdhp) { - // No alignment write: the locked spec and the sister ARM SQ8_FP32 dispatchers - // leave *alignment untouched on ARM tiers. The corresponding tests assert - // 0xFF passthrough on the scalar path and do not assert any non-zero value here. return Choose_SQ8_FP16_L2_implementation_NEON_HP(dim); } #endif From 966e36ad3adea66ce8cafc86b681089e41ae0c33 Mon Sep 17 00:00:00 2001 From: Dor Forer Date: Sun, 31 May 2026 13:34:54 +0000 Subject: [PATCH 20/24] Apply clang-format 18.1.8 (matches CI) [MOD-14972] --- src/VecSim/batch_iterator.h | 2 +- tests/benchmark/bm_vecsim_svs.h | 14 ++++++-------- tests/benchmark/types_ranges.h | 12 ++++-------- tests/unit/test_allocator.cpp | 4 ++-- 4 files changed, 13 insertions(+), 19 deletions(-) diff --git a/src/VecSim/batch_iterator.h b/src/VecSim/batch_iterator.h index 466072f86..9e2791130 100644 --- a/src/VecSim/batch_iterator.h +++ b/src/VecSim/batch_iterator.h @@ -27,7 +27,7 @@ struct VecSimBatchIterator : public VecsimBaseObject { explicit VecSimBatchIterator(void *query_vector, void *tctx, std::shared_ptr allocator) : VecsimBaseObject(allocator), query_vector(query_vector), returned_results_count(0), - timeoutCtx(tctx){}; + timeoutCtx(tctx) {}; virtual inline const void *getQueryBlob() const { return query_vector; } diff --git a/tests/benchmark/bm_vecsim_svs.h b/tests/benchmark/bm_vecsim_svs.h index b92cce5e0..5acb882c0 100644 --- a/tests/benchmark/bm_vecsim_svs.h +++ b/tests/benchmark/bm_vecsim_svs.h @@ -466,19 +466,17 @@ void BM_VecSimSVS::RunGC(benchmark::State &st) { #define UNIT_AND_ITERATIONS Unit(benchmark::kMillisecond)->Iterations(2) #if HAVE_SVS_LVQ -#define QUANT_BITS_ARGS \ - { VecSimSvsQuant_8, VecSimSvsQuant_4x8_LeanVec } +#define QUANT_BITS_ARGS {VecSimSvsQuant_8, VecSimSvsQuant_4x8_LeanVec} #define COMPRESSED_TRAINING_THRESHOLD_ARGS \ - { static_cast(BM_VecSimGeneral::block_size), 5000, 10000 } + {static_cast(BM_VecSimGeneral::block_size), 5000, 10000} #define COMPRESSED_ASYNC_TRAINING_THRESHOLD_ARGS \ - { static_cast(BM_VecSimGeneral::block_size), 5000, 10000, 50000 } + {static_cast(BM_VecSimGeneral::block_size), 5000, 10000, 50000} #else -#define QUANT_BITS_ARGS \ - { VecSimSvsQuant_8 } +#define QUANT_BITS_ARGS {VecSimSvsQuant_8} // Using smaller training TH to avoid long test times without LVQ #define COMPRESSED_TRAINING_THRESHOLD_ARGS \ - { static_cast(BM_VecSimGeneral::block_size), 5000 } + {static_cast(BM_VecSimGeneral::block_size), 5000} #define COMPRESSED_ASYNC_TRAINING_THRESHOLD_ARGS \ - { static_cast(BM_VecSimGeneral::block_size), 5000 } + {static_cast(BM_VecSimGeneral::block_size), 5000} #endif diff --git a/tests/benchmark/types_ranges.h b/tests/benchmark/types_ranges.h index deff4251c..43abda8f0 100644 --- a/tests/benchmark/types_ranges.h +++ b/tests/benchmark/types_ranges.h @@ -11,11 +11,9 @@ #include #include "bm_definitions.h" -#define DEFAULT_RANGE_RADII \ - { 20, 35, 50 } +#define DEFAULT_RANGE_RADII {20, 35, 50} -#define DEFAULT_RANGE_EPSILONS \ - { 1, 10, 11 } +#define DEFAULT_RANGE_EPSILONS {1, 10, 11} // This template struct methods returns the default values for radii and epsilons // To specify different values for a certain type, use template specialization @@ -27,8 +25,7 @@ struct benchmark_range { // Larger Range query values are required for int8 wikipedia dataset. // Default values give 0 results -#define INT8_RANGE_RADII \ - { 50, 65, 80 } +#define INT8_RANGE_RADII {50, 65, 80} template <> struct benchmark_range { @@ -37,8 +34,7 @@ struct benchmark_range { }; // UINT8 ranges -#define UINT8_RANGE_RADII \ - { 4, 5, 7 } +#define UINT8_RANGE_RADII {4, 5, 7} template <> struct benchmark_range { diff --git a/tests/unit/test_allocator.cpp b/tests/unit/test_allocator.cpp index 77db41684..6aa4a0d0b 100644 --- a/tests/unit/test_allocator.cpp +++ b/tests/unit/test_allocator.cpp @@ -33,7 +33,7 @@ struct ObjectWithSTL : public VecsimBaseObject { public: ObjectWithSTL(std::shared_ptr allocator) - : VecsimBaseObject(allocator), test_vec(allocator){}; + : VecsimBaseObject(allocator), test_vec(allocator) {}; }; struct NestedObject : public VecsimBaseObject { @@ -42,7 +42,7 @@ struct NestedObject : public VecsimBaseObject { public: NestedObject(std::shared_ptr allocator) - : VecsimBaseObject(allocator), stl_object(allocator), simpleObject(allocator){}; + : VecsimBaseObject(allocator), stl_object(allocator), simpleObject(allocator) {}; }; TEST_F(AllocatorTest, test_simple_object) { From b47be9460d0e7b4dd6e0189c8c9a0a6e3b5c21f9 Mon Sep 17 00:00:00 2001 From: lerman25 Date: Mon, 1 Jun 2026 20:37:08 +0300 Subject: [PATCH 21/24] bench: register spaces_sq8_fp16 in benchmark setups The bm_spaces_sq8_fp16 executable is built but was never emitted by benchmarks.sh, so no CI label (bm-spaces / benchmarks-all) would run it. Register it in bm-spaces, bm-spaces-sq8-full, benchmarks-all and benchmarks-default, and add a dedicated bm-spaces-sq8-fp16 case. --- tests/benchmark/benchmarks.sh | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/tests/benchmark/benchmarks.sh b/tests/benchmark/benchmarks.sh index 91ba49448..115a4cac9 100755 --- a/tests/benchmark/benchmarks.sh +++ b/tests/benchmark/benchmarks.sh @@ -21,6 +21,7 @@ if [ -z "$BM_TYPE" ] || [ "$BM_TYPE" = "benchmarks-all" ]; then echo spaces_int8 echo spaces_uint8 echo spaces_sq8_fp32 + echo spaces_sq8_fp16 echo spaces_sq8_sq8 elif [ "$BM_TYPE" = "benchmarks-default" ]; then @@ -33,6 +34,7 @@ elif [ "$BM_TYPE" = "benchmarks-default" ]; then echo spaces_int8 echo spaces_uint8 echo spaces_sq8_fp32 + echo spaces_sq8_fp16 echo spaces_sq8_sq8 @@ -106,6 +108,7 @@ elif [ "$BM_TYPE" = "bm-basics-svs-fp32-single" ] ; then echo basics_svs_single_fp32_LVQ8 elif [ "$BM_TYPE" = "bm-spaces-sq8-full" ] ; then echo spaces_sq8_fp32 + echo spaces_sq8_fp16 echo spaces_sq8_sq8 @@ -118,6 +121,7 @@ elif [ "$BM_TYPE" = "bm-spaces" ] ; then echo spaces_int8 echo spaces_uint8 echo spaces_sq8_fp32 + echo spaces_sq8_fp16 echo spaces_sq8_sq8 elif [ "$BM_TYPE" = "bm-spaces-fp32" ] ; then @@ -134,6 +138,8 @@ elif [ "$BM_TYPE" = "bm-spaces-uint8" ] ; then echo spaces_uint8 elif [ "$BM_TYPE" = "bm-spaces-sq8-fp32" ] ; then echo spaces_sq8_fp32 +elif [ "$BM_TYPE" = "bm-spaces-sq8-fp16" ] ; then + echo spaces_sq8_fp16 elif [ "$BM_TYPE" = "bm-spaces-sq8-sq8" ] ; then echo spaces_sq8_sq8 fi From 7ece249c5456fbc1425bc272b2a79efa198141dc Mon Sep 17 00:00:00 2001 From: lerman25 Date: Mon, 1 Jun 2026 20:39:03 +0300 Subject: [PATCH 22/24] perf(arm): optimize SQ8<->FP16 NEON_HP widening and add SVE2 FMLALB/FMLALT kernel NEON_HP: widen SQ8 storage uint8->fp16->fp32 via vcvtq_f16_u16 (values 0..255 are exact in FP16), dropping two integer-widening ops per 16-element chunk with identical FP32 lane values. SVE2: dedicated kernel keeping storage+query at 16-bit and using the FMLALB/FMLALT widening multiply-accumulate pair (svmlalb_f32/svmlalt_f32). Processes svcnth() lanes/step (2x the base-SVE svcntw() granularity) and removes explicit query widening/conversion, roughly halving hot-loop loads and instructions. Wired into SVE2.cpp IP/Cosine/L2 choosers at svcnth granularity. --- src/VecSim/spaces/IP/IP_NEON_SQ8_FP16.h | 15 +-- src/VecSim/spaces/IP/IP_SVE2_SQ8_FP16.h | 124 ++++++++++++++++++++++++ src/VecSim/spaces/L2/L2_SVE2_SQ8_FP16.h | 32 ++++++ src/VecSim/spaces/functions/SVE2.cpp | 10 +- 4 files changed, 170 insertions(+), 11 deletions(-) create mode 100644 src/VecSim/spaces/IP/IP_SVE2_SQ8_FP16.h create mode 100644 src/VecSim/spaces/L2/L2_SVE2_SQ8_FP16.h diff --git a/src/VecSim/spaces/IP/IP_NEON_SQ8_FP16.h b/src/VecSim/spaces/IP/IP_NEON_SQ8_FP16.h index a5c2465fc..f14c2289f 100644 --- a/src/VecSim/spaces/IP/IP_NEON_SQ8_FP16.h +++ b/src/VecSim/spaces/IP/IP_NEON_SQ8_FP16.h @@ -27,12 +27,15 @@ static inline void SQ8_FP16_InnerProductStep_NEON_HP(const uint8_t *&pVect1, con float32x4_t &sum0, float32x4_t &sum1, float32x4_t &sum2, float32x4_t &sum3) { uint8x16_t v1_u8 = vld1q_u8(pVect1); - uint16x8_t v1_lo = vmovl_u8(vget_low_u8(v1_u8)); - uint16x8_t v1_hi = vmovl_u8(vget_high_u8(v1_u8)); - float32x4_t v1_0 = vcvtq_f32_u32(vmovl_u16(vget_low_u16(v1_lo))); - float32x4_t v1_1 = vcvtq_f32_u32(vmovl_u16(vget_high_u16(v1_lo))); - float32x4_t v1_2 = vcvtq_f32_u32(vmovl_u16(vget_low_u16(v1_hi))); - float32x4_t v1_3 = vcvtq_f32_u32(vmovl_u16(vget_high_u16(v1_hi))); + // SQ8 values 0..255 are exact in FP16, so widen uint8 -> uint16 -> fp16 -> fp32. + // This drops two integer-widening ops per chunk versus the uint8 -> u16 -> u32 -> f32 + // chain while producing bit-identical FP32 lane values. + float16x8_t v1_h_lo = vcvtq_f16_u16(vmovl_u8(vget_low_u8(v1_u8))); + float16x8_t v1_h_hi = vcvtq_f16_u16(vmovl_u8(vget_high_u8(v1_u8))); + float32x4_t v1_0 = vcvt_f32_f16(vget_low_f16(v1_h_lo)); + float32x4_t v1_1 = vcvt_f32_f16(vget_high_f16(v1_h_lo)); + float32x4_t v1_2 = vcvt_f32_f16(vget_low_f16(v1_h_hi)); + float32x4_t v1_3 = vcvt_f32_f16(vget_high_f16(v1_h_hi)); const float16_t *q = reinterpret_cast(pVect2); float16x8_t q_lo = vld1q_f16(q); diff --git a/src/VecSim/spaces/IP/IP_SVE2_SQ8_FP16.h b/src/VecSim/spaces/IP/IP_SVE2_SQ8_FP16.h new file mode 100644 index 000000000..a36627e80 --- /dev/null +++ b/src/VecSim/spaces/IP/IP_SVE2_SQ8_FP16.h @@ -0,0 +1,124 @@ +/* + * Copyright (c) 2006-Present, Redis Ltd. + * All rights reserved. + * + * Licensed under your choice of the Redis Source Available License 2.0 + * (RSALv2); or (b) the Server Side Public License v1 (SSPLv1); or (c) the + * GNU Affero General Public License v3 (AGPLv3). + */ +#pragma once +#include "VecSim/spaces/space_includes.h" +#include "VecSim/types/sq8.h" +#include "VecSim/types/float16.h" +#include + +using sq8 = vecsim_types::sq8; +using float16 = vecsim_types::float16; + +/* + * SVE2 asymmetric SQ8 (storage) <-> FP16 (query) inner product using the identity: + * IP(x, y) ~= min * y_sum + delta * Σ(q_i * y_i) + * + * SVE2-only fast path: the storage bytes (0..255, exact in FP16) and the FP16 query + * lanes stay 16-bit, and the FP16->FP32 widening multiply-accumulate is done by the + * FMLALB/FMLALT pair (svmlalb_f32 / svmlalt_f32). Each pair widens the even/odd + * half-precision lanes to single precision and multiplies/accumulates in FP32 WITHOUT + * intermediate rounding, so the per-lane products match the SVE svmla path exactly while + * processing svcnth() lanes per step (twice the base-SVE svcntw() granularity) and halving + * the number of loads and explicit conversions. The even/odd accumulator split groups the + * FP32 additions differently than the base SVE kernel, so the reduced result is numerically + * equivalent (well within the test tolerance) rather than bit-identical. + */ + +// Helper: one svcnth()-wide FP16 step feeding an even/odd FP32 accumulator pair. +static inline void SQ8_FP16_InnerProductStep_SVE2(const uint8_t *pVect1, const float16 *pVect2, + size_t &offset, svfloat32_t &sum_even, + svfloat32_t &sum_odd, svbool_t pg16, + size_t chunk) { + svuint16_t v1_u16 = svld1ub_u16(pg16, pVect1 + offset); + svfloat16_t v1_f16 = svcvt_f16_u16_x(pg16, v1_u16); + svfloat16_t q_f16 = svld1_f16(pg16, reinterpret_cast(pVect2 + offset)); + // FMLALB/FMLALT are unpredicated; inactive lanes were zeroed by the loads above so + // their contribution is 0 and walking all lanes is safe. + sum_even = svmlalb_f32(sum_even, v1_f16, q_f16); + sum_odd = svmlalt_f32(sum_odd, v1_f16, q_f16); + offset += chunk; +} + +// pVect1v = SQ8 storage, pVect2v = FP16 query. Precondition: dim >= 16 (enforced by dispatcher). +template +float SQ8_FP16_InnerProductSIMD_SVE2_IMP(const void *pVect1v, const void *pVect2v, + size_t dimension) { + const uint8_t *pVect1 = static_cast(pVect1v); + const float16 *pVect2 = static_cast(pVect2v); + size_t offset = 0; + const svbool_t pg16 = svptrue_b16(); + const size_t chunk = svcnth(); + + svfloat32_t sum0e = svdup_f32(0.0f), sum0o = svdup_f32(0.0f); + svfloat32_t sum1e = svdup_f32(0.0f), sum1o = svdup_f32(0.0f); + svfloat32_t sum2e = svdup_f32(0.0f), sum2o = svdup_f32(0.0f); + svfloat32_t sum3e = svdup_f32(0.0f), sum3o = svdup_f32(0.0f); + + // Partial chunk for dim % chunk FP16 lanes. Zeroing loads (_z convert) leave inactive + // lanes at 0 so the unpredicated FMLALB/FMLALT below ignore them. + if constexpr (partial_chunk) { + size_t remaining = dimension % chunk; + if (remaining > 0) { + svbool_t pg_partial = svwhilelt_b16(uint64_t(0), uint64_t(remaining)); + svuint16_t v1_u16 = svld1ub_u16(pg_partial, pVect1 + offset); + svfloat16_t v1_f16 = svcvt_f16_u16_z(pg_partial, v1_u16); + svfloat16_t q_f16 = + svld1_f16(pg_partial, reinterpret_cast(pVect2 + offset)); + sum0e = svmlalb_f32(sum0e, v1_f16, q_f16); + sum0o = svmlalt_f32(sum0o, v1_f16, q_f16); + offset += remaining; + } + } + + // Main loop: 4 steps per iteration, one even/odd accumulator pair per step. + const size_t chunk_size = 4 * chunk; + const size_t number_of_chunks = + (dimension - (partial_chunk ? dimension % chunk : 0)) / chunk_size; + for (size_t i = 0; i < number_of_chunks; i++) { + SQ8_FP16_InnerProductStep_SVE2(pVect1, pVect2, offset, sum0e, sum0o, pg16, chunk); + SQ8_FP16_InnerProductStep_SVE2(pVect1, pVect2, offset, sum1e, sum1o, pg16, chunk); + SQ8_FP16_InnerProductStep_SVE2(pVect1, pVect2, offset, sum2e, sum2o, pg16, chunk); + SQ8_FP16_InnerProductStep_SVE2(pVect1, pVect2, offset, sum3e, sum3o, pg16, chunk); + } + + if constexpr (additional_steps > 0) + SQ8_FP16_InnerProductStep_SVE2(pVect1, pVect2, offset, sum0e, sum0o, pg16, chunk); + if constexpr (additional_steps > 1) + SQ8_FP16_InnerProductStep_SVE2(pVect1, pVect2, offset, sum1e, sum1o, pg16, chunk); + if constexpr (additional_steps > 2) + SQ8_FP16_InnerProductStep_SVE2(pVect1, pVect2, offset, sum2e, sum2o, pg16, chunk); + + const svbool_t pg32 = svptrue_b32(); + svfloat32_t sum = svadd_f32_z(pg32, sum0e, sum0o); + sum = svadd_f32_x(pg32, sum, svadd_f32_x(pg32, sum1e, sum1o)); + sum = svadd_f32_x(pg32, sum, svadd_f32_x(pg32, sum2e, sum2o)); + sum = svadd_f32_x(pg32, sum, svadd_f32_x(pg32, sum3e, sum3o)); + float quantized_dot = svaddv_f32(pg32, sum); + + const uint8_t *params_bytes = static_cast(pVect1v) + dimension; + const float min_val = load_unaligned(params_bytes + sq8::MIN_VAL * sizeof(float)); + const float delta = load_unaligned(params_bytes + sq8::DELTA * sizeof(float)); + const uint8_t *query_meta_bytes = + reinterpret_cast(static_cast(pVect2v) + dimension); + const float y_sum = load_unaligned(query_meta_bytes + sq8::SUM_QUERY * sizeof(float)); + + return min_val * y_sum + delta * quantized_dot; +} + +template +float SQ8_FP16_InnerProductSIMD_SVE2(const void *pVect1v, const void *pVect2v, size_t dimension) { + return 1.0f - SQ8_FP16_InnerProductSIMD_SVE2_IMP( + pVect1v, pVect2v, dimension); +} + +template +float SQ8_FP16_CosineSIMD_SVE2(const void *pVect1v, const void *pVect2v, size_t dimension) { + return SQ8_FP16_InnerProductSIMD_SVE2(pVect1v, pVect2v, + dimension); +} diff --git a/src/VecSim/spaces/L2/L2_SVE2_SQ8_FP16.h b/src/VecSim/spaces/L2/L2_SVE2_SQ8_FP16.h new file mode 100644 index 000000000..d9451fe2a --- /dev/null +++ b/src/VecSim/spaces/L2/L2_SVE2_SQ8_FP16.h @@ -0,0 +1,32 @@ +/* + * Copyright (c) 2006-Present, Redis Ltd. + * All rights reserved. + * + * Licensed under your choice of the Redis Source Available License 2.0 + * (RSALv2); or (b) the Server Side Public License v1 (SSPLv1); or (c) the + * GNU Affero General Public License v3 (AGPLv3). + */ +#pragma once +#include "VecSim/spaces/space_includes.h" +#include "VecSim/spaces/IP/IP_SVE2_SQ8_FP16.h" + +/* + * SVE2 SQ8<->FP16 L2 squared distance: + * ||x - y||^2 = x_sum_squares - 2 * IP(x, y) + y_sum_squares + * IP is computed by SQ8_FP16_InnerProductSIMD_SVE2_IMP; metadata is FP32. + */ + +template +float SQ8_FP16_L2SqrSIMD_SVE2(const void *pVect1v, const void *pVect2v, size_t dimension) { + const float ip = SQ8_FP16_InnerProductSIMD_SVE2_IMP( + pVect1v, pVect2v, dimension); + + const uint8_t *params_bytes = static_cast(pVect1v) + dimension; + const float x_sum_sq = load_unaligned(params_bytes + sq8::SUM_SQUARES * sizeof(float)); + const uint8_t *query_meta_bytes = + reinterpret_cast(static_cast(pVect2v) + dimension); + const float y_sum_sq = + load_unaligned(query_meta_bytes + sq8::SUM_SQUARES_QUERY * sizeof(float)); + + return x_sum_sq + y_sum_sq - 2.0f * ip; +} diff --git a/src/VecSim/spaces/functions/SVE2.cpp b/src/VecSim/spaces/functions/SVE2.cpp index 4496c07e6..95631f0d2 100644 --- a/src/VecSim/spaces/functions/SVE2.cpp +++ b/src/VecSim/spaces/functions/SVE2.cpp @@ -22,8 +22,8 @@ #include "VecSim/spaces/IP/IP_SVE_UINT8.h" // SVE2 implementation is identical to SVE #include "VecSim/spaces/IP/IP_SVE_SQ8_FP32.h" // SVE2 implementation is identical to SVE #include "VecSim/spaces/L2/L2_SVE_SQ8_FP32.h" // SVE2 implementation is identical to SVE -#include "VecSim/spaces/IP/IP_SVE_SQ8_FP16.h" // SVE2 implementation is identical to SVE -#include "VecSim/spaces/L2/L2_SVE_SQ8_FP16.h" // SVE2 implementation is identical to SVE +#include "VecSim/spaces/IP/IP_SVE2_SQ8_FP16.h" // SVE2 fast path: FMLALB/FMLALT widening +#include "VecSim/spaces/L2/L2_SVE2_SQ8_FP16.h" // SVE2 fast path: FMLALB/FMLALT widening #include "VecSim/spaces/IP/IP_SVE_SQ8_SQ8.h" // SVE2 implementation is identical to SVE #include "VecSim/spaces/L2/L2_SVE_SQ8_SQ8.h" // SVE2 implementation is identical to SVE @@ -120,19 +120,19 @@ dist_func_t Choose_SQ8_FP32_L2_implementation_SVE2(size_t dim) { dist_func_t Choose_SQ8_FP16_IP_implementation_SVE2(size_t dim) { dist_func_t ret_dist_func; - CHOOSE_SVE_IMPLEMENTATION(ret_dist_func, SQ8_FP16_InnerProductSIMD_SVE, dim, svcntw); + CHOOSE_SVE_IMPLEMENTATION(ret_dist_func, SQ8_FP16_InnerProductSIMD_SVE2, dim, svcnth); return ret_dist_func; } dist_func_t Choose_SQ8_FP16_Cosine_implementation_SVE2(size_t dim) { dist_func_t ret_dist_func; - CHOOSE_SVE_IMPLEMENTATION(ret_dist_func, SQ8_FP16_CosineSIMD_SVE, dim, svcntw); + CHOOSE_SVE_IMPLEMENTATION(ret_dist_func, SQ8_FP16_CosineSIMD_SVE2, dim, svcnth); return ret_dist_func; } dist_func_t Choose_SQ8_FP16_L2_implementation_SVE2(size_t dim) { dist_func_t ret_dist_func; - CHOOSE_SVE_IMPLEMENTATION(ret_dist_func, SQ8_FP16_L2SqrSIMD_SVE, dim, svcntw); + CHOOSE_SVE_IMPLEMENTATION(ret_dist_func, SQ8_FP16_L2SqrSIMD_SVE2, dim, svcnth); return ret_dist_func; } From db1e68fbdad69d8536005ff09c350ea0ad349321 Mon Sep 17 00:00:00 2001 From: lerman25 Date: Mon, 1 Jun 2026 20:49:08 +0300 Subject: [PATCH 23/24] style: clang-format SVE2.cpp --- src/VecSim/spaces/functions/SVE2.cpp | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/src/VecSim/spaces/functions/SVE2.cpp b/src/VecSim/spaces/functions/SVE2.cpp index 95631f0d2..9eea81523 100644 --- a/src/VecSim/spaces/functions/SVE2.cpp +++ b/src/VecSim/spaces/functions/SVE2.cpp @@ -16,16 +16,16 @@ #include "VecSim/spaces/IP/IP_SVE_FP64.h" #include "VecSim/spaces/L2/L2_SVE_FP64.h" -#include "VecSim/spaces/L2/L2_SVE_INT8.h" // SVE2 implementation is identical to SVE -#include "VecSim/spaces/IP/IP_SVE_INT8.h" // SVE2 implementation is identical to SVE -#include "VecSim/spaces/L2/L2_SVE_UINT8.h" // SVE2 implementation is identical to SVE -#include "VecSim/spaces/IP/IP_SVE_UINT8.h" // SVE2 implementation is identical to SVE -#include "VecSim/spaces/IP/IP_SVE_SQ8_FP32.h" // SVE2 implementation is identical to SVE -#include "VecSim/spaces/L2/L2_SVE_SQ8_FP32.h" // SVE2 implementation is identical to SVE +#include "VecSim/spaces/L2/L2_SVE_INT8.h" // SVE2 implementation is identical to SVE +#include "VecSim/spaces/IP/IP_SVE_INT8.h" // SVE2 implementation is identical to SVE +#include "VecSim/spaces/L2/L2_SVE_UINT8.h" // SVE2 implementation is identical to SVE +#include "VecSim/spaces/IP/IP_SVE_UINT8.h" // SVE2 implementation is identical to SVE +#include "VecSim/spaces/IP/IP_SVE_SQ8_FP32.h" // SVE2 implementation is identical to SVE +#include "VecSim/spaces/L2/L2_SVE_SQ8_FP32.h" // SVE2 implementation is identical to SVE #include "VecSim/spaces/IP/IP_SVE2_SQ8_FP16.h" // SVE2 fast path: FMLALB/FMLALT widening #include "VecSim/spaces/L2/L2_SVE2_SQ8_FP16.h" // SVE2 fast path: FMLALB/FMLALT widening -#include "VecSim/spaces/IP/IP_SVE_SQ8_SQ8.h" // SVE2 implementation is identical to SVE -#include "VecSim/spaces/L2/L2_SVE_SQ8_SQ8.h" // SVE2 implementation is identical to SVE +#include "VecSim/spaces/IP/IP_SVE_SQ8_SQ8.h" // SVE2 implementation is identical to SVE +#include "VecSim/spaces/L2/L2_SVE_SQ8_SQ8.h" // SVE2 implementation is identical to SVE namespace spaces { From 147268443e66863cd24d6bd26c5e5d4d4f15d627 Mon Sep 17 00:00:00 2001 From: Dor Forer Date: Tue, 2 Jun 2026 11:09:05 +0000 Subject: [PATCH 24/24] perf(arm): add NEON_FHM FMLAL widening-FMA kernel for SQ8<->FP16 [MOD-14972] Add an asimdfhm-gated NEON_FHM tier for SQ8<->FP16 IP / L2 / Cosine. Instead of widening both operands to FP32 and issuing vfmaq_f32 (the NEON_HP path), it uses vfmlalq_low/high_f16 (FMLAL/FMLAL2) to multiply the FP16 lanes directly into FP32 accumulators, removing all 8 vcvt_f32_f16 per 16 lanes. SQ8 storage is widened uint8->fp16 (exact for 0..255) and the FP16 query is consumed in place. FMLAL widens fp16->fp32 before the multiply, so accuracy matches the scalar baseline. Dispatchers prefer NEON_FHM over NEON_HP when features.asimdfhm is set. The IP core is templated on use_fhm so L2/Cosine and the residual tail are shared. Tier-walk unit tests and microbenchmarks cover the new path. Measured ~1.95x over NEON_HP at high/medium dims on asimdfhm hardware. Co-Authored-By: Claude Opus 4.8 (1M context) --- src/VecSim/spaces/IP/IP_NEON_SQ8_FP16.h | 51 ++++++++++++++++++- src/VecSim/spaces/IP_space.cpp | 6 +++ src/VecSim/spaces/L2/L2_NEON_SQ8_FP16.h | 17 +++++++ src/VecSim/spaces/L2_space.cpp | 3 ++ src/VecSim/spaces/functions/NEON_HP.cpp | 19 +++++++ src/VecSim/spaces/functions/NEON_HP.h | 4 ++ .../spaces_benchmarks/bm_spaces_sq8_fp16.cpp | 6 +++ tests/unit/test_spaces.cpp | 30 +++++++++++ 8 files changed, 134 insertions(+), 2 deletions(-) diff --git a/src/VecSim/spaces/IP/IP_NEON_SQ8_FP16.h b/src/VecSim/spaces/IP/IP_NEON_SQ8_FP16.h index f14c2289f..cce6ea21d 100644 --- a/src/VecSim/spaces/IP/IP_NEON_SQ8_FP16.h +++ b/src/VecSim/spaces/IP/IP_NEON_SQ8_FP16.h @@ -54,8 +54,38 @@ static inline void SQ8_FP16_InnerProductStep_NEON_HP(const uint8_t *&pVect1, con pVect2 += 16; } +/* + * FMLAL widening-FMA variant (FEAT_FHM / asimdfhm). Instead of widening both operands to FP32 + * and issuing vfmaq_f32, this multiplies the FP16 lanes directly into FP32 accumulators via + * vfmlalq_low/high_f16, halving the conversion work: the SQ8 storage is widened uint8 -> fp16 + * (exact for 0..255) and the FP16 query is consumed in place, with no fp16 -> fp32 conversions. + */ +static inline void SQ8_FP16_InnerProductStep_NEON_FHM(const uint8_t *&pVect1, + const float16 *&pVect2, float32x4_t &sum0, + float32x4_t &sum1, float32x4_t &sum2, + float32x4_t &sum3) { + uint8x16_t v1_u8 = vld1q_u8(pVect1); + float16x8_t v1_h_lo = vcvtq_f16_u16(vmovl_u8(vget_low_u8(v1_u8))); + float16x8_t v1_h_hi = vcvtq_f16_u16(vmovl_u8(vget_high_u8(v1_u8))); + + const float16_t *q = reinterpret_cast(pVect2); + float16x8_t q_lo = vld1q_f16(q); + float16x8_t q_hi = vld1q_f16(q + 8); + + // low_f16 handles lanes 0..3, high_f16 lanes 4..7 of each 8-lane FP16 register. + sum0 = vfmlalq_low_f16(sum0, v1_h_lo, q_lo); + sum1 = vfmlalq_high_f16(sum1, v1_h_lo, q_lo); + sum2 = vfmlalq_low_f16(sum2, v1_h_hi, q_hi); + sum3 = vfmlalq_high_f16(sum3, v1_h_hi, q_hi); + + pVect1 += 16; + pVect2 += 16; +} + // pVect1v = SQ8 storage, pVect2v = FP16 query. Precondition: dim >= 16 (enforced by dispatcher). -template // 0..15 +// use_fhm selects the FMLAL widening-FMA main loop (requires FEAT_FHM at runtime); the residual +// tail is shared since it only reduces into the same FP32 accumulators. +template // residual 0..15 float SQ8_FP16_InnerProductSIMD16_NEON_HP_IMP(const void *pVect1v, const void *pVect2v, size_t dimension) { const uint8_t *pVect1 = static_cast(pVect1v); @@ -68,7 +98,11 @@ float SQ8_FP16_InnerProductSIMD16_NEON_HP_IMP(const void *pVect1v, const void *p const size_t num_of_chunks = dimension / 16; for (size_t i = 0; i < num_of_chunks; i++) { - SQ8_FP16_InnerProductStep_NEON_HP(pVect1, pVect2, sum0, sum1, sum2, sum3); + if constexpr (use_fhm) { + SQ8_FP16_InnerProductStep_NEON_FHM(pVect1, pVect2, sum0, sum1, sum2, sum3); + } else { + SQ8_FP16_InnerProductStep_NEON_HP(pVect1, pVect2, sum0, sum1, sum2, sum3); + } } // Residual: up to three independent 4-lane sub-steps, leaving at most 3 elements @@ -130,3 +164,16 @@ template float SQ8_FP16_CosineSIMD16_NEON_HP(const void *pVect1v, const void *pVect2v, size_t dimension) { return SQ8_FP16_InnerProductSIMD16_NEON_HP(pVect1v, pVect2v, dimension); } + +// FMLAL (FEAT_FHM) variants — identical contract, FMLAL widening-FMA main loop. +template +float SQ8_FP16_InnerProductSIMD16_NEON_FHM(const void *pVect1v, const void *pVect2v, + size_t dimension) { + return 1.0f - + SQ8_FP16_InnerProductSIMD16_NEON_HP_IMP(pVect1v, pVect2v, dimension); +} + +template +float SQ8_FP16_CosineSIMD16_NEON_FHM(const void *pVect1v, const void *pVect2v, size_t dimension) { + return SQ8_FP16_InnerProductSIMD16_NEON_FHM(pVect1v, pVect2v, dimension); +} diff --git a/src/VecSim/spaces/IP_space.cpp b/src/VecSim/spaces/IP_space.cpp index 9366d3144..e13d57326 100644 --- a/src/VecSim/spaces/IP_space.cpp +++ b/src/VecSim/spaces/IP_space.cpp @@ -240,6 +240,9 @@ dist_func_t IP_SQ8_FP16_GetDistFunc(size_t dim, unsigned char *alignment, } #endif #ifdef OPT_NEON_HP + if (features.asimdfhm) { + return Choose_SQ8_FP16_IP_implementation_NEON_FHM(dim); + } if (features.asimdhp) { return Choose_SQ8_FP16_IP_implementation_NEON_HP(dim); } @@ -309,6 +312,9 @@ dist_func_t Cosine_SQ8_FP16_GetDistFunc(size_t dim, unsigned char *alignm } #endif #ifdef OPT_NEON_HP + if (features.asimdfhm) { + return Choose_SQ8_FP16_Cosine_implementation_NEON_FHM(dim); + } if (features.asimdhp) { return Choose_SQ8_FP16_Cosine_implementation_NEON_HP(dim); } diff --git a/src/VecSim/spaces/L2/L2_NEON_SQ8_FP16.h b/src/VecSim/spaces/L2/L2_NEON_SQ8_FP16.h index 70367d7fe..2964c1cee 100644 --- a/src/VecSim/spaces/L2/L2_NEON_SQ8_FP16.h +++ b/src/VecSim/spaces/L2/L2_NEON_SQ8_FP16.h @@ -33,3 +33,20 @@ float SQ8_FP16_L2SqrSIMD16_NEON_HP(const void *pVect1v, const void *pVect2v, siz return x_sum_sq + y_sum_sq - 2.0f * ip; } + +// FMLAL (FEAT_FHM) variant — same identity, FMLAL widening-FMA IP core. +template // 0..15 +float SQ8_FP16_L2SqrSIMD16_NEON_FHM(const void *pVect1v, const void *pVect2v, size_t dimension) { + const float ip = + SQ8_FP16_InnerProductSIMD16_NEON_HP_IMP(pVect1v, pVect2v, dimension); + + const uint8_t *params_bytes = static_cast(pVect1v) + dimension; + const float x_sum_sq = load_unaligned(params_bytes + sq8::SUM_SQUARES * sizeof(float)); + + const uint8_t *query_meta_bytes = + reinterpret_cast(static_cast(pVect2v) + dimension); + const float y_sum_sq = + load_unaligned(query_meta_bytes + sq8::SUM_SQUARES_QUERY * sizeof(float)); + + return x_sum_sq + y_sum_sq - 2.0f * ip; +} diff --git a/src/VecSim/spaces/L2_space.cpp b/src/VecSim/spaces/L2_space.cpp index 7d65814e0..53fa2d873 100644 --- a/src/VecSim/spaces/L2_space.cpp +++ b/src/VecSim/spaces/L2_space.cpp @@ -171,6 +171,9 @@ dist_func_t L2_SQ8_FP16_GetDistFunc(size_t dim, unsigned char *alignment, } #endif #ifdef OPT_NEON_HP + if (features.asimdfhm) { + return Choose_SQ8_FP16_L2_implementation_NEON_FHM(dim); + } if (features.asimdhp) { return Choose_SQ8_FP16_L2_implementation_NEON_HP(dim); } diff --git a/src/VecSim/spaces/functions/NEON_HP.cpp b/src/VecSim/spaces/functions/NEON_HP.cpp index 20d93a517..15e40ba82 100644 --- a/src/VecSim/spaces/functions/NEON_HP.cpp +++ b/src/VecSim/spaces/functions/NEON_HP.cpp @@ -47,6 +47,25 @@ dist_func_t Choose_SQ8_FP16_Cosine_implementation_NEON_HP(size_t dim) { return ret_dist_func; } +// FMLAL (FEAT_FHM / asimdfhm) variants. +dist_func_t Choose_SQ8_FP16_IP_implementation_NEON_FHM(size_t dim) { + dist_func_t ret_dist_func; + CHOOSE_IMPLEMENTATION(ret_dist_func, dim, 16, SQ8_FP16_InnerProductSIMD16_NEON_FHM); + return ret_dist_func; +} + +dist_func_t Choose_SQ8_FP16_L2_implementation_NEON_FHM(size_t dim) { + dist_func_t ret_dist_func; + CHOOSE_IMPLEMENTATION(ret_dist_func, dim, 16, SQ8_FP16_L2SqrSIMD16_NEON_FHM); + return ret_dist_func; +} + +dist_func_t Choose_SQ8_FP16_Cosine_implementation_NEON_FHM(size_t dim) { + dist_func_t ret_dist_func; + CHOOSE_IMPLEMENTATION(ret_dist_func, dim, 16, SQ8_FP16_CosineSIMD16_NEON_FHM); + return ret_dist_func; +} + #include "implementation_chooser_cleanup.h" } // namespace spaces diff --git a/src/VecSim/spaces/functions/NEON_HP.h b/src/VecSim/spaces/functions/NEON_HP.h index 889eb0919..83579d2b7 100644 --- a/src/VecSim/spaces/functions/NEON_HP.h +++ b/src/VecSim/spaces/functions/NEON_HP.h @@ -20,4 +20,8 @@ dist_func_t Choose_SQ8_FP16_IP_implementation_NEON_HP(size_t dim); dist_func_t Choose_SQ8_FP16_L2_implementation_NEON_HP(size_t dim); dist_func_t Choose_SQ8_FP16_Cosine_implementation_NEON_HP(size_t dim); +dist_func_t Choose_SQ8_FP16_IP_implementation_NEON_FHM(size_t dim); +dist_func_t Choose_SQ8_FP16_L2_implementation_NEON_FHM(size_t dim); +dist_func_t Choose_SQ8_FP16_Cosine_implementation_NEON_FHM(size_t dim); + } // namespace spaces diff --git a/tests/benchmark/spaces_benchmarks/bm_spaces_sq8_fp16.cpp b/tests/benchmark/spaces_benchmarks/bm_spaces_sq8_fp16.cpp index cc5d040cb..5ab529372 100644 --- a/tests/benchmark/spaces_benchmarks/bm_spaces_sq8_fp16.cpp +++ b/tests/benchmark/spaces_benchmarks/bm_spaces_sq8_fp16.cpp @@ -105,6 +105,12 @@ bool neon_hp_supported = arm_opt.asimdhp; INITIALIZE_BENCHMARKS_SET_L2_IP(BM_VecSimSpaces_SQ8_FP16, SQ8_FP16, NEON_HP, 16, neon_hp_supported); INITIALIZE_BENCHMARKS_SET_Cosine(BM_VecSimSpaces_SQ8_FP16, SQ8_FP16, NEON_HP, 16, neon_hp_supported); + +bool neon_fhm_supported = arm_opt.asimdfhm; +INITIALIZE_BENCHMARKS_SET_L2_IP(BM_VecSimSpaces_SQ8_FP16, SQ8_FP16, NEON_FHM, 16, + neon_fhm_supported); +INITIALIZE_BENCHMARKS_SET_Cosine(BM_VecSimSpaces_SQ8_FP16, SQ8_FP16, NEON_FHM, 16, + neon_fhm_supported); #endif #endif // CPU_FEATURES_ARCH_AARCH64 diff --git a/tests/unit/test_spaces.cpp b/tests/unit/test_spaces.cpp index ce8605565..9a16bf30d 100644 --- a/tests/unit/test_spaces.cpp +++ b/tests/unit/test_spaces.cpp @@ -3175,6 +3175,16 @@ TEST_P(SQ8_FP16_SpacesOptimizationTest, SQ8_FP16_L2SqrTest) { } #endif #ifdef OPT_NEON_HP + if (optimization.asimdfhm) { + unsigned char alignment = 0; + arch_opt_func = L2_SQ8_FP16_GetDistFunc(dim, &alignment, &optimization); + ASSERT_EQ(arch_opt_func, Choose_SQ8_FP16_L2_implementation_NEON_FHM(dim)) + << "Unexpected distance function chosen for dim " << dim; + ASSERT_NEAR(baseline, arch_opt_func(v2_compressed.data(), v1_query.data(), dim), 0.01) + << "NEON_FHM with dim " << dim; + ASSERT_EQ(alignment, 0) << "No alignment NEON_FHM with dim " << dim; + optimization.asimdfhm = 0; + } if (optimization.asimdhp) { unsigned char alignment = 0; arch_opt_func = L2_SQ8_FP16_GetDistFunc(dim, &alignment, &optimization); @@ -3289,6 +3299,16 @@ TEST_P(SQ8_FP16_SpacesOptimizationTest, SQ8_FP16_InnerProductTest) { } #endif #ifdef OPT_NEON_HP + if (optimization.asimdfhm) { + unsigned char alignment = 0; + arch_opt_func = IP_SQ8_FP16_GetDistFunc(dim, &alignment, &optimization); + ASSERT_EQ(arch_opt_func, Choose_SQ8_FP16_IP_implementation_NEON_FHM(dim)) + << "Unexpected distance function chosen for dim " << dim; + ASSERT_NEAR(baseline, arch_opt_func(v2_compressed.data(), v1_query.data(), dim), 0.01) + << "NEON_FHM with dim " << dim; + ASSERT_EQ(alignment, 0) << "No alignment NEON_FHM with dim " << dim; + optimization.asimdfhm = 0; + } if (optimization.asimdhp) { unsigned char alignment = 0; arch_opt_func = IP_SQ8_FP16_GetDistFunc(dim, &alignment, &optimization); @@ -3403,6 +3423,16 @@ TEST_P(SQ8_FP16_SpacesOptimizationTest, SQ8_FP16_CosineTest) { } #endif #ifdef OPT_NEON_HP + if (optimization.asimdfhm) { + unsigned char alignment = 0; + arch_opt_func = Cosine_SQ8_FP16_GetDistFunc(dim, &alignment, &optimization); + ASSERT_EQ(arch_opt_func, Choose_SQ8_FP16_Cosine_implementation_NEON_FHM(dim)) + << "Unexpected distance function chosen for dim " << dim; + ASSERT_NEAR(baseline, arch_opt_func(v2_compressed.data(), v1_query.data(), dim), 0.01) + << "NEON_FHM with dim " << dim; + ASSERT_EQ(alignment, 0) << "No alignment NEON_FHM with dim " << dim; + optimization.asimdfhm = 0; + } if (optimization.asimdhp) { unsigned char alignment = 0; arch_opt_func = Cosine_SQ8_FP16_GetDistFunc(dim, &alignment, &optimization);