Skip to content

Add SQ8↔FP16 ARM SIMD distance kernels [MOD-14972]#973

Merged
dor-forer merged 25 commits into
mainfrom
dor-forer-sq8-fp16-arm-kernels-mod-14972
Jun 3, 2026
Merged

Add SQ8↔FP16 ARM SIMD distance kernels [MOD-14972]#973
dor-forer merged 25 commits into
mainfrom
dor-forer-sq8-fp16-arm-kernels-mod-14972

Conversation

@dor-forer
Copy link
Copy Markdown
Collaborator

@dor-forer dor-forer commented May 31, 2026

Describe the changes in the pull request

Add asymmetric SQ8↔FP16 SIMD distance kernels (IP, L2, Cosine) for ARM tiers: NEON_HP, SVE, SVE2. Stacked on PR #970 (MOD-14954), which delivers the x86 equivalents.

The SVE hot loop uses svld1uh_u32 to zero-extend each FP16 halfword into a 32-bit lane, allowing svcvt_f32_f16_x to read the correct bits directly. The NEON residual mirrors the SQ8_FP32 NEON sister: three independent 4-lane sub-steps (r>=4/8/12) leaving at most 3 elements for scalar, replacing the previous single 8-lane block + up-to-7 software conversions.

Additional improvements (co-authored by @lerman25, originally from PR #975):

  • NEON_HP widening optimization: replaced the u8→u16→u32→f32 path (two vmovl_u16 + vcvtq_f32_u32) with the shorter u8→u16→f16→f32 path (vcvtq_f16_u16 + vcvt_f32_f16), reducing instruction count in the main loop.
  • Dedicated SVE2 kernels (IP_SVE2_SQ8_FP16.h, L2_SVE2_SQ8_FP16.h): use svmlalb_f32/svmlalt_f32 (SVE2-only widening multiply-accumulate) for a tighter inner loop vs the generic SVE path.
  • Benchmark registration: spaces_sq8_fp16 is now wired into all relevant benchmark categories (benchmarks-all, benchmarks-default, bm-spaces-sq8-full, bm-spaces, and the new bm-spaces-sq8-fp16 category).

Which issues this PR fixes

  1. MOD-14972

Main objects this PR modified

  1. src/VecSim/spaces/IP/IP_NEON_SQ8_FP16.h — NEON_HP IP kernel (+ widening optimization)
  2. src/VecSim/spaces/IP/IP_SVE_SQ8_FP16.h — SVE IP kernel
  3. src/VecSim/spaces/IP/IP_SVE2_SQ8_FP16.h — new dedicated SVE2 IP kernel
  4. src/VecSim/spaces/L2/L2_NEON_SQ8_FP16.h — NEON_HP L2 kernel
  5. src/VecSim/spaces/L2/L2_SVE_SQ8_FP16.h — SVE L2 kernel
  6. src/VecSim/spaces/L2/L2_SVE2_SQ8_FP16.h — new dedicated SVE2 L2 kernel
  7. src/VecSim/spaces/functions/NEON_HP.{h,cpp}, SVE.{h,cpp}, SVE2.{h,cpp} — chooser symbols
  8. src/VecSim/spaces/IP_space.cpp, L2_space.cpp — AArch64 dispatcher blocks
  9. tests/unit/test_spaces.cpp — tier-walk tests for NEON_HP / SVE / SVE2
  10. tests/benchmark/spaces_benchmarks/bm_spaces_sq8_fp16.cpp — ARM microbench registrations
  11. tests/benchmark/benchmarks.sh — benchmark category registration for spaces_sq8_fp16

Mark if applicable

  • This PR introduces API changes
  • This PR introduces serialization changes

Note

Medium Risk
Touches hot-path distance math with ISA-specific numerics and feature-gated dispatch; mitigated by unit/benchmark coverage but wrong kernel choice could skew search results.

Overview
Adds ARM SIMD for asymmetric SQ8 storage ↔ FP16 query distances (inner product, L2², cosine), complementing existing x86 paths.

New kernels cover NEON_HP (with optional FHM widening-FMA), SVE (FP16 loads via svld1uh_u32), and a dedicated SVE2 path using FMLALB/FMLALT at svcnth() granularity. L2 reuses the IP cores via the usual sum-of-squares identity. NEON also shortens SQ8 widening (u8→f16→f32) and reshapes the residual tail to match the SQ8↔FP32 NEON pattern.

Runtime dispatch on AArch64 (dim ≥ 16): SVE2 → SVE → asimdfhmasimdhp, wired through choosers in IP_space.cpp / L2_space.cpp and the NEON/SVE/SVE2 function modules.

Tests/benchmarks: unit tests walk each ARM tier against the scalar baseline; spaces_sq8_fp16 is registered in benchmark scripts and ARM variants are added to bm_spaces_sq8_fp16.cpp.

Reviewed by Cursor Bugbot for commit f798f5c. Bugbot is set up for automated code reviews on this repo. Configure here.

@CLAassistant
Copy link
Copy Markdown

CLAassistant commented May 31, 2026

CLA assistant check
All committers have signed the CLA.

@jit-ci
Copy link
Copy Markdown

jit-ci Bot commented May 31, 2026

🛡️ Jit Security Scan Results

CRITICAL HIGH MEDIUM

✅ No security findings were detected in this PR


Security scan by Jit

@codecov
Copy link
Copy Markdown

codecov Bot commented May 31, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 97.09%. Comparing base (daf391f) to head (f798f5c).

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #973      +/-   ##
==========================================
- Coverage   97.11%   97.09%   -0.03%     
==========================================
  Files         141      141              
  Lines        8110     8110              
==========================================
- Hits         7876     7874       -2     
- Misses        234      236       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

* Add design doc for SQ8↔FP16 SIMD x86 kernels [MOD-14954]

Captures the architecture, file-level plan, CMake F16C gating, and risk
register for adding AVX-512 / AVX2+FMA / AVX2 / SSE4 kernels for the
asymmetric SQ8 (storage) ↔ FP16 (query) distance functions, wiring them
into the existing dispatcher tables and SQ8_FP16 unit/benchmark
scaffolding from MOD-15141.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Append -mf16c to AVX2_FMA/AVX2/SSE4 dispatcher sources [MOD-14954]

Enables _mm{,256}_cvtph_ps in the AVX2+FMA, AVX2, and SSE4 dispatcher
translation units so the upcoming SQ8↔FP16 kernels can widen FP16 lanes
to FP32. The flag is appended only when CXX_F16C is detected; existing
SQ8_FP32 / SQ8_SQ8 / INT8 / UINT8 sources contain no F16C intrinsics so
emitted code for those kernels is unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Add SQ8_FP16_SpacesOptimizationTest skeleton [MOD-14954]

Parameterised gtest fixture mirroring SQ8_FP32_SpacesOptimizationTest;
currently asserts only the scalar fallback path. Per-tier SIMD
assertion blocks (AVX-512, AVX2+FMA, AVX2, SSE4) are added alongside
the kernel implementations in subsequent commits.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Add AVX-512 SQ8↔FP16 SIMD distance kernels [MOD-14954]

Implements asymmetric SQ8 (storage) ↔ FP16 (query) Inner Product,
Cosine, and L2² kernels for the AVX-512 F+BW+VL+VNNI tier. Each chunk
widens 16 SQ8 lanes via cvtepu8_epi32 + cvtepi32_ps and 16 FP16 lanes
via _mm512_cvtph_ps, then fmadds into a 16-lane FP32 accumulator. SQ8
storage and FP16 query metadata reads use load_unaligned to tolerate
odd dimensions. Dispatcher branches in IP_space.cpp / L2_space.cpp
select the new Choose_SQ8_FP16_*_implementation_AVX512F_BW_VL_VNNI
when features.avx512f && features.avx512bw && features.avx512vl &&
features.avx512vnni; otherwise behaviour is unchanged from MOD-15141.
A parameterised gtest fixture exercises every residual class in
[16, 32] against the scalar baseline.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Add AVX2+FMA SQ8↔FP16 SIMD distance kernels [MOD-14954]

8-wide AVX2+FMA kernels widen 8 SQ8 lanes via cvtepu8_epi32 +
cvtepi32_ps and 8 FP16 lanes via _mm256_cvtph_ps, then fmadd into a
256-bit FP32 accumulator. Residual (< 8) lanes load the full 16-byte
FP16 block, convert, then blend zero across unused lanes — mirroring
the existing F16C FP16 kernel pattern. Dispatcher branch in
{IP,Cosine,L2}_SQ8_FP16_GetDistFunc selects the new
Choose_SQ8_FP16_*_implementation_AVX2_FMA when features.avx2 &&
features.fma3 && features.f16c.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Add AVX2 (no FMA) SQ8↔FP16 SIMD distance kernels [MOD-14954]

Mirrors the AVX2+FMA kernels but uses _mm256_mul_ps + _mm256_add_ps
instead of _mm256_fmadd_ps so it can run on Haswell-era AVX2 hardware
without FMA support (uncommon but matches the existing SQ8_FP32
tiering). Dispatcher gate requires features.avx2 && features.f16c
and runs between the AVX2+FMA and SSE4 tiers.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Add SSE4+F16C SQ8↔FP16 SIMD distance kernels [MOD-14954]

4-wide SSE4 kernels widen 4 SQ8 lanes via cvtepu8_epi32 + cvtepi32_ps
and 4 FP16 lanes via _mm_cvtph_ps (F16C), then mul+add into a 128-bit
FP32 accumulator (SSE4 has no FMA). Residual % 4 lanes are materialised
via _mm_set_ps + the scalar FP16_to_FP32 helper, mirroring the existing
SSE4 SQ8_FP32 residual pattern. Dispatcher gate requires
features.sse4_1 && features.f16c && features.avx since F16C is
VEX-encoded — matches the existing F16C/FP16 dispatcher gate.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Update SQ8_FP16 dispatcher assertions to walk SIMD tiers [MOD-14954]

The SQ8_FP16 GetDistFunc dispatcher now returns AVX-512 / AVX2+FMA /
AVX2 / SSE4 SIMD kernels when the corresponding feature flags are set
(only scalar previously). Updates the GetDistFunc_*_SQ8_FP16 asserts
to compute the expected function for the host's highest supported tier.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Register per-ISA SQ8↔FP16 microbenchmarks [MOD-14954]

Adds AVX-512 / AVX2+FMA / AVX2 / SSE4 benchmark registrations to
bm_spaces_sq8_fp16.cpp, mirroring the SQ8_FP32 layout. Gates each tier
on the corresponding OPT_* defines plus the runtime feature checks
that mirror the dispatcher in IP_space.cpp / L2_space.cpp.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Reformat SQ8↔FP16 SIMD kernels for consistent line breaks

* Address PR review findings for SQ8↔FP16 x86 kernels [MOD-14954]

- CMake: gate `-mf16c` on CXX_F16C AND CXX_FMA AND CXX_AVX (matches OPT_F16C
  macro) and append `-mavx` to the SSE4 dispatcher when adding -mf16c, since
  F16C is VEX-encoded and requires AVX state. Mirrors the existing F16C.cpp
  recipe and prevents miscompiles on toolchains with F16C but without AVX.
- IP_SSE4_SQ8_FP16.h: replace `*reinterpret_cast<const int32_t *>(pVect1)`
  with `load_unaligned<int32_t>(pVect1)` to remove strict-aliasing UB on
  the uint8_t SQ8 lane load.
- IP_AVX2{,_FMA}_SQ8_FP16.h: improve the residual-mask comment to spell out
  the asymmetric-mask reasoning (SQ8 unmasked is safe because the FP16
  query blend forces those FP32 query lanes to 0 → garbage·0=0).
- IP_AVX{512,2,2_FMA,SSE4}_SQ8_FP16.h: add the `IP = min·y_sum + delta·Σ(q·y)`
  algebraic-identity comment header that AVX-512 already carried, plus a
  precondition note that callers must enforce dim >= 16 (matches the
  established SQ8_FP32 convention; no runtime assert because sibling
  SQ8_FP32 SIMD kernels also rely on the dispatcher gate).
- test_spaces.cpp: route the SQ8_FP16 edge-case tests (ZeroQuery,
  ConstantStorage, MixedSignQuery) through {IP,Cosine,L2}_SQ8_FP16_GetDistFunc
  so the runtime-selected SIMD tier is actually exercised on those inputs,
  not just the scalar reference.
- test_spaces.cpp: add SQ8_FP16_SIMD_HighDim suite with dims {64, 128, 256,
  512, 1024} so multi-iteration do-while loop bugs would fire (the existing
  [16, 32] range covers at most two AVX-512 chunk iterations).
- test_spaces.cpp: add SQ8_FP16_SIMD_TierCoverage.ReportTiersExercised — a
  single test that emits per-tier coverage to stderr and GTEST_SKIPs when
  no SIMD tier is available, so CI runners without AVX-512 do not silently
  report zero tier-1 coverage.
- test_spaces.cpp: scalar-fallback `alignment` checks now seed the value
  with 0xFF and assert it remains 0xFF, verifying the dispatcher contract
  ("scalar leaves caller's value untouched") instead of just measuring
  that the variable's pre-zeroed init survived.
- test_spaces.cpp: drop the stale MOD-15152/MOD-15153 wiring-TODO comment
  on SQ8_FP16_NoOptimizationSpacesTest now that the SIMD tiers are wired.
- bm_spaces_sq8_fp16.cpp: drop the matching stale comment.

Out of scope (separate ticket): two-accumulator FMA refactor (also affects
SQ8_FP32) and the SSE4 residual `_mm_cvtph_ps` perf opportunity.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Add multi-accumulator ILP to SQ8↔FP16 x86 kernels [MOD-14954]

Break the FMA / mul+add dependency chain in all four SQ8↔FP16 IP kernels
by widening the inner loop to use multiple independent accumulators.
L2 kernels inherit the change through their `…InnerProductImp_…` call.

- IP_AVX512F_BW_VL_VNNI_SQ8_FP16.h: 1 → 4 accumulators, unroll-4 main loop
  (64 lanes/iter) with a 16-lane tail for the 0..3 remaining chunks.
- IP_AVX2_FMA_SQ8_FP16.h, IP_AVX2_SQ8_FP16.h: 1 → 2 accumulators; the
  existing 2-step unrolled body now routes each step to an independent
  accumulator. The `residual >= 8` half-chunk feeds the second accumulator
  so the prologue also breaks the dependency chain.
- IP_SSE4_SQ8_FP16.h: 1 → 2 accumulators; do-while unrolled 1 → 2 steps
  per iteration (4 → 8 lanes/iter). Residual-ladder steps alternate
  between sum_a and sum_b for prologue ILP.

Correctness invariant: residual block consumes exactly `residual` lanes
(0..15) → remaining tail is always a multiple of 16, so the unrolled
loops (multiples of 8 / 16 / 64) terminate exactly. Verified by 131
SQ8_FP16 unit tests + 115 under ASan.

* Drop misleading VNNI suffix from SQ8↔FP16 AVX-512 kernel [MOD-14954]

The SQ8↔FP16 AVX-512 kernel does not actually issue any VNNI instruction
— the inner loop is FP32 FMA (`_mm512_fmadd_ps`) over lanes widened from
SQ8 (`_mm512_cvtepu8_epi32` + `_mm512_cvtepi32_ps`) and FP16
(`_mm512_cvtph_ps`). Real VNNI use would require an integer-encoded
query, which is a different kernel entirely.

The file/function names are renamed to match what the kernel actually
uses (AVX-512F). The dispatcher .cpp/.h files stay named after the
runtime tier (AVX512F_BW_VL_VNNI) since the SQ8↔FP16 kernel still
registers under that tier alongside the genuinely VNNI-using SQ8↔SQ8 /
INT8 / UINT8 kernels — the gate is a CPU-feature gate, not an ISA claim.

The same misnomer exists for SQ8↔FP32; tracked separately so the rename
there can ship as its own commit.

Also: fix a strict-aliasing-class UB introduced by the AVX-512 unroll-4
loop. `while (pVec1 + 64 <= pEnd1)` forms a pointer past one-past-end of
the SQ8 storage object when fewer than 64 lane bytes remain, which is UB
in C++ regardless of dereference. Switched to pointer subtraction
(`static_cast<size_t>(pEnd1 - pVec1) >= 64`).

Renames:
- IP_AVX512F_BW_VL_VNNI_SQ8_FP16.h -> IP_AVX512F_SQ8_FP16.h
- L2_AVX512F_BW_VL_VNNI_SQ8_FP16.h -> L2_AVX512F_SQ8_FP16.h
- SQ8_FP16_{InnerProduct,Cosine,L2Sqr}SIMD16_AVX512F_BW_VL_VNNI -> _AVX512F
- Choose_SQ8_FP16_{IP,Cosine,L2}_implementation_AVX512F_BW_VL_VNNI -> _AVX512F

Verified: 131 SQ8_FP16 unit tests + 115 under ASan.

* Remove SQ8↔FP16 design doc from PR [MOD-14954]

Design doc was added in ad941b8 for planning; not appropriate as
a long-lived in-repo artifact. Keep externally (Confluence / scratch)
rather than ship with the kernel commit.

* Simplify SQ8↔FP16 tests to match sister conventions [MOD-14954]

Two trims, both restoring pre-existing patterns elsewhere in the file:

1. `GetDistFuncSQ8FP16Asymmetric` had grown a runtime SIMD-tier walk that
   duplicated coverage already provided by `SQ8_FP16_SpacesOptimizationTest`.
   Reduced to the bare dispatcher-equality check used by the FP32 / SQ8↔SQ8
   sister tests at lines 540-548 and 551-559.

2. The `SQ8_FP16_EdgeCases` tests (`ZeroQueryTest`, `ConstantStorageTest`,
   `MixedSignQueryTest`) were routed through
   `{IP,Cosine,L2}_SQ8_FP16_GetDistFunc(dim, nullptr)` to force runtime SIMD
   dispatch on adversarial inputs. Reverted to direct scalar calls
   (`SQ8_FP16_InnerProduct`, etc.) — the original pre-fdc5c1cd shape.

   Coverage rationale: the SIMD kernels are branchless on input values
   (verified by grep — no value-dependent `if` in any tier). Every code
   path is therefore exercised by `SQ8_FP16_SpacesOptimizationTest`'s
   random inputs at multiple dims. The edge-case tests verify the
   *algebraic identity* (IP of zero query = 1.0, constant storage matches
   dequant baseline, mixed-sign handling) — scalar correctness on these
   inputs is what was actually being checked, and the SIMD path matches
   scalar via the SpacesOptimizationTest tier walk.

Net: 77 lines removed from the test file, matches sister conventions,
no coverage gap.

* Split SQ8↔FP16 F16C kernels into sibling TUs [MOD-14954]

The SQ8↔FP16 kernels in the SSE4, AVX2, and AVX2+FMA tiers depend on
F16C (`_mm_cvtph_ps` / `_mm256_cvtph_ps`), while every other kernel in
those dispatcher TUs is F16C-clean. The previous arrangement mixed
both under `#ifdef OPT_F16C` blocks inside the base dispatcher .cpp/.h
files.

Split each tier's F16C-dependent kernels off into a sibling TU:

  functions/SSE4.cpp           → SSE4 + SQ8_FP32 (no F16C)
  functions/SSE4_F16C.cpp      → SQ8_FP16 only (requires -mavx -mf16c)

  functions/AVX2.cpp           → AVX2 + BF16 + SQ8_FP32 (no F16C)
  functions/AVX2_F16C.cpp      → SQ8_FP16 only (requires -mf16c)

  functions/AVX2_FMA.cpp       → SQ8_FP32 (no F16C)
  functions/AVX2_FMA_F16C.cpp  → SQ8_FP16 only (requires -mf16c)

The AVX-512 tier is unaffected — its SQ8_FP16 kernel uses
`_mm512_cvtph_ps`, which is part of AVX-512F and not F16C.

CMake now compiles each sibling TU conditionally on
`_has_full_f16c` and applies the F16C flags only there. Base TUs no
longer carry `-mf16c`, since they no longer reference F16C intrinsics.

Result:
- No `#ifdef OPT_F16C` directives in `functions/*.cpp` or `functions/*.h`.
- Compile-time isolation: an F16C intrinsic accidentally added outside
  a `_F16C` sibling will fail to build, not silently miscompile.
- Caller sites (`IP_space.cpp`, `L2_space.cpp`, `test_spaces.cpp`,
  `bm_spaces.h`) still gate the *calls* with `#ifdef OPT_F16C`; the new
  sibling .h includes are unconditional, since declarations alone don't
  link-error and the calls remain guarded.

Verified: 131 SQ8_FP16 unit tests + 115 ASan + 1166 full test_spaces
suite (covers other ISA tiers SQ8_FP32 / BF16 / INT8 / UINT8 to confirm
no regression from the dispatcher restructure).

* Move SQ8↔FP16 AVX-512 dispatch to AVX512F tier + flatten F16C guards [MOD-14954]

Two related cleanups in the SQ8↔FP16 dispatch path:

1. The AVX-512 SQ8↔FP16 kernel only uses AVX-512F instructions
   (`_mm512_cvtph_ps`, `_mm512_fmadd_ps`, etc.) but was registered under
   the VNNI tier (`OPT_AVX512_F_BW_VL_VNNI` + check of avx512f/bw/vl/vnni).
   That meant CPUs with AVX-512F but no VNNI (Skylake-X, some Cascade Lake
   variants, etc.) would fall through to AVX2_FMA even though they can
   run the AVX-512 kernel.

   Moved the `Choose_SQ8_FP16_{IP,Cosine,L2}_implementation_AVX512F`
   definitions from `functions/AVX512F_BW_VL_VNNI.cpp` to
   `functions/AVX512F.cpp`, with matching header reshuffle. Dispatch
   sites now gate on `OPT_AVX512F` + `features.avx512f`.

2. F16C is a transversal requirement across the non-AVX-512 SQ8↔FP16
   tiers (SSE4, AVX2, AVX2+FMA) — every one of them widens FP16 query
   lanes via `vcvtph2ps`. Per-tier nested `#ifdef OPT_F16C` was hoisted
   into a single outer block around the three ISA branches in
   `IP_SQ8_FP16_GetDistFunc`, `Cosine_SQ8_FP16_GetDistFunc`, and
   `L2_SQ8_FP16_GetDistFunc`.

Verified: 131 SQ8_FP16 release + 115 ASan + 1166 full test_spaces suite.

* Clean up whitespace and formatting inconsistencies

Remove extraneous blank lines in SSE4 and AVX2_FMA source files, fix indentation in AVX512F SQ8_FP16 function signatures, and reformat benchmark macro invocation to fit line length conventions.

* Remove obsolete SQ8-to-FP16 dispatch comments

The comments referencing SQ8-to-FP16 dispatch location are no longer accurate after the recent refactoring that moved the dispatch logic. Clean up these stale comments from the AVX512F_BW_VL_VNNI files.

* Hoist OPT_F16C guard around lower SIMD tiers in SQ8↔FP16 tests [MOD-14954]

Mirrors the dispatcher layout in IP_space.cpp / L2_space.cpp where a single
OPT_F16C guard wraps the AVX2+FMA, AVX2, and SSE4 branches. Each test body
(L2/IP/Cosine) and the TierCoverage report now use the same single-guard
shape. Also retargets the TierCoverage AVX-512 check from
OPT_AVX512_F_BW_VL_VNNI to OPT_AVX512F, matching the dispatcher's new
AVX-512F-only gate.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Drop non-idiomatic SQ8↔FP16 tier-coverage reporter test [MOD-14954]

SQ8_FP16_SIMD_TierCoverage.ReportTiersExercised was an outlier — no
other data type has a std::cerr-based coverage reporter. Per-tier
coverage is already provided by SQ8_FP16_SpacesOptimizationTest (which
walks AVX-512 → AVX2+FMA → AVX2 → SSE4 → scalar by clearing feature
flags), and ISA-lane presence is handled by the CI matrix, matching
the convention used by every other type's SpacesOptimizationTest.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* Simplify SQ8↔FP16 kernels and trim PR churn [MOD-14954]

- AVX512F IP: keep the <=3 tail chunks on distinct accumulators
  (sum0/sum1/sum2) instead of serializing into one, preserving ILP when
  the main 64-lane loop runs few or zero times.
- Condense kernel header comments; drop redundant float16.h/alignment.h
  includes (pulled in transitively) and the direct <immintrin.h> include
  (provided via space_includes.h, matching the other AVX512F headers).
- test_spaces: align the SQ8_FP16 scalar-fallback alignment assertion with
  the convention used by the other SpacesOptimizationTest suites.
- Revert unrelated CMake message/quote churn on the base AVX2/SSE4 TUs and
  the stray blank line in AVX512F_BW_VL_VNNI.cpp, leaving only the additive
  F16C build blocks in this PR.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* Document why OPT_F16C differs from the other OPT_* macros [MOD-14954]

Explain at the definition site that OPT_F16C is a cross-cutting capability
gate (not a 1:1 dispatch tier), why it is a compound CXX_F16C/FMA/AVX guard
(F16C is VEX-encoded and needs AVX state), and why the AVX-512 SQ8<->FP16
path stays outside it (_mm512_cvtph_ps is part of AVX512F).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* Cover AVX512 three-chunk tail and dim<16 dispatcher guard in SQ8_FP16 tests [MOD-14954]

Codecov flagged 4 uncovered lines on PR #970:
- The AVX512F `remaining >= 48` third tail step in IP_AVX512F_SQ8_FP16.h was
  never executed: the test dims never satisfied (dim / 16) % 4 == 3. Add 48
  (zero main-loop iterations) and 112 (one main-loop iteration) to exercise it.
- The `dim < 16` scalar early-return in the IP/Cosine/L2 SQ8_FP16 dispatchers
  was never taken. Assert the three dispatchers return the scalar funcs at dim 8.

Test-only change. Local release + ASan: SQ8_FP16 137/137, ASan clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@dor-forer dor-forer force-pushed the dor-forer-sq8-fp16-arm-kernels-mod-14972 branch from 6f6ef26 to 4ac05ac Compare June 1, 2026 15:01
Base automatically changed from dor-forer-sq8-fp16-x86-kernels-mod-14954 to main June 1, 2026 15:10
dor-forer and others added 19 commits June 1, 2026 15:12
Stacked on PR #970 (MOD-14954 x86 kernels). Mirrors x86 structure
onto NEON_HP / SVE / SVE2 tiers. Zero CMake changes; reuses existing
ARM TU compile flags. Scalar fallback already on main serves as
reference. Bakes in PR #970 review lessons (assert(dim>=16),
4-accumulator ILP, formula anchor, load_unaligned<float> metadata,
dispatcher-routed tier-walk tests).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
14 bite-sized tasks following the spec at 2026-05-28-arm-sq8-fp16-design.md.
Each task ends in a commit; assistant runs tests/ASan/benchmarks after the
user confirms each ARM build cycle. Zero CMake changes; PR stacks on #970.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…OD-14972]

The 9 ARM tier blocks (L2/IP/Cosine × SVE2/SVE/NEON_HP) were missing
ASSERT_EQ(alignment, 0) after each ASSERT_NEAR, unlike the SQ8_FP32
sister blocks which assert it. Adds the assertions to lock the contract
that ARM tiers leave the caller's alignment value untouched.
…D-14972]

svcvt_f32_f16_x (FCVT) reads even-indexed FP16 elements: FP32[e] ← FP16[2e].
The step function loaded chunk consecutive FP16 values into positions 0..chunk-1,
then passed them directly to svcvt_f32_f16_x, which picked positions 0,2,4,...
and silently skipped positions 1,3,5,...  For chunk=4 (128-bit SVE), only 2 of
4 FP16 values per step were used, producing wrong dot products.

Fix: svzip1_f16(q_h, zeros) spreads values to even positions [v0,0,v1,0,...] so
FCVT correctly reads v[0],v[1],v[2],...  Applied to both the full step helper
and the partial-chunk path.

Discovered and fixed during ARM host verification (Task 14, MOD-14972).
…D-14972]

SVE hot loop: replace svzip1_f16+svdup_f16+svwhilelt_b16 (4 ops) with
svld1uh_u32 (1 op) — zero-extends each FP16 halfword into a 32-bit lane
so svcvt_f32_f16_x reads the correct bits directly. Same fix applied to
the partial-chunk path, which also drops the now-redundant pg16_partial
predicate. Accumulator combine changed from svadd_f32_x to svadd_f32_z
to match the SQ8_FP32 SVE sister.

NEON residual: replace the single 8-lane block + up-to-7 software-scalar
iterations with three independent 4-lane sub-steps (r>=4, r>=8, r>=12),
leaving at most 3 elements for scalar — mirrors the SQ8_FP32 NEON sister
exactly. Eliminates expensive vecsim_types::FP16_to_FP32 calls for
residuals 4..15 (previously up to 7 software conversions per call).

Both IP headers: remove assert()+<cassert> (no sister kernel uses them).
Both L2 headers: drop redundant float16.h include and using declarations
(arrive transitively through the included IP header).
…MOD-14972]

- Remove docs/superpowers/ design and plan files (~1550 lines); sister PR #970
  removed its equivalent doc before merge.
- Drop 5-line "No alignment write" prose comment from the three AArch64
  NEON_HP dispatcher blocks; the sister SQ8_FP32 ARM dispatchers carry no
  such comment — the absent alignment write already encodes the intent.
- Trim GetDistFuncSQ8FP16Asymmetric to a 7-line template-mapping check at
  dim=15, matching the shape of GetDistFuncSQ8Asymmetric (SQ8_FP32 sister).
  The scalar-fallback assertion it previously duplicated is already covered
  by the trailing block of SQ8_FP16_SpacesOptimizationTest.
@dor-forer dor-forer force-pushed the dor-forer-sq8-fp16-arm-kernels-mod-14972 branch from 4ac05ac to e1647dc Compare June 1, 2026 15:23
@lerman25 lerman25 mentioned this pull request Jun 1, 2026
2 tasks
@lerman25
Copy link
Copy Markdown
Collaborator

lerman25 commented Jun 1, 2026

SQ8↔FP16 ARM kernels: proposed optimizations + benchmark results

I profiled the ARM SQ8↔FP16 distance kernels from this PR and prototyped two optimizations, benchmarked head-to-head against this PR's code on a Graviton-4 (Neoverse V2) arm64 runner via bm-spaces.

To keep the comparison clean, both runs are identical except for the kernel changes, and both register spaces_sq8_fp16 in benchmarks.sh (it was built but never emitted by any CI setup, so the kernels had no benchmark coverage):

Speedup = baseline CPU-time ÷ treatment CPU-time (>1 = faster). 36 dimension points per tier.

Tier Metric Median speedup Range Median ns (base → treat)
SVE2 L2 1.83× 1.02–1.88 93.6 → 51.0
SVE2 IP 1.84× 1.00–1.89 93.6 → 50.9
SVE2 Cosine 1.84× 1.00–1.89 93.6 → 50.8
NEON_HP L2 1.05× 1.02–1.15 103.0 → 97.7
NEON_HP IP 1.05× 1.02–1.09 103.0 → 97.8
NEON_HP Cosine 1.05× 1.02–1.10 103.0 → 97.7
SVE (untouched control) all 1.00× 1.00–1.01 unchanged

What changed

  • SVE2 (~1.83–1.84×, near the theoretical 2×): a dedicated kernel using the FMLALB/FMLALT widening multiply-accumulate pair (svmlalb_f32/svmlalt_f32), keeping storage+query at 16 bits and processing svcnth lanes/step vs the base svcntw — doubling throughput and removing explicit query widening. By bucket: high_dim 1.86×, residual 1.85×, low_dim 1.55–1.58× (per-call overhead dilutes small dims), 512-bit chunks median 1.66× with min ~1.00× at Dimension:16 (single chunk → nothing to amortize, no regression).
  • NEON_HP (steady ~1.05×): widen SQ8 storage uint8 → fp16 → fp32 via vcvtq_f16_u16 (values 0..255 are exact in FP16), dropping two integer-widening ops per 16-element chunk with identical FP32 lane values.

Why this is trustworthy

The untouched base SVE tier reads 1.00×, confirming the comparison is clean (identical code → identical perf) — so the gains are attributable to the kernels, not run-to-run noise. No regressions anywhere (min speedup ≥ 1.00 in every tier).

Happy to fold these into this PR if you'd like — the diffs are small and isolated to the IP/L2 SVE2 headers + the NEON_HP step + the SVE2 chooser.

lerman25 added 3 commits June 2, 2026 07:19
The bm_spaces_sq8_fp16 executable is built but was never emitted by
benchmarks.sh, so no CI label (bm-spaces / benchmarks-all) would run it.
Register it in bm-spaces, bm-spaces-sq8-full, benchmarks-all and
benchmarks-default, and add a dedicated bm-spaces-sq8-fp16 case.
…MLALT kernel

NEON_HP: widen SQ8 storage uint8->fp16->fp32 via vcvtq_f16_u16 (values 0..255
are exact in FP16), dropping two integer-widening ops per 16-element chunk with
identical FP32 lane values.

SVE2: dedicated kernel keeping storage+query at 16-bit and using the FMLALB/FMLALT
widening multiply-accumulate pair (svmlalb_f32/svmlalt_f32). Processes svcnth()
lanes/step (2x the base-SVE svcntw() granularity) and removes explicit query
widening/conversion, roughly halving hot-loop loads and instructions. Wired into
SVE2.cpp IP/Cosine/L2 choosers at svcnth granularity.
…-14972]

Add an asimdfhm-gated NEON_FHM tier for SQ8<->FP16 IP / L2 / Cosine. Instead
of widening both operands to FP32 and issuing vfmaq_f32 (the NEON_HP path), it
uses vfmlalq_low/high_f16 (FMLAL/FMLAL2) to multiply the FP16 lanes directly
into FP32 accumulators, removing all 8 vcvt_f32_f16 per 16 lanes. SQ8 storage
is widened uint8->fp16 (exact for 0..255) and the FP16 query is consumed in
place. FMLAL widens fp16->fp32 before the multiply, so accuracy matches the
scalar baseline.

Dispatchers prefer NEON_FHM over NEON_HP when features.asimdfhm is set. The IP
core is templated on use_fhm so L2/Cosine and the residual tail are shared.
Tier-walk unit tests and microbenchmarks cover the new path.

Measured ~1.95x over NEON_HP at high/medium dims on asimdfhm hardware.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
lerman25
lerman25 previously approved these changes Jun 2, 2026
@dor-forer dor-forer enabled auto-merge June 2, 2026 12:28
@dor-forer dor-forer force-pushed the dor-forer-sq8-fp16-arm-kernels-mod-14972 branch from 3118914 to 1472684 Compare June 3, 2026 11:42
…m-kernels-mod-14972

# Conflicts:
#	src/VecSim/spaces/IP_space.cpp
#	src/VecSim/spaces/L2_space.cpp
#	tests/benchmark/spaces_benchmarks/bm_spaces_sq8_fp16.cpp
#	tests/unit/test_spaces.cpp
@dor-forer dor-forer dismissed lerman25’s stale review June 3, 2026 11:52

The merge-base changed after approval.

@dor-forer dor-forer requested a review from lerman25 June 3, 2026 15:00
@dor-forer dor-forer added this pull request to the merge queue Jun 3, 2026
Merged via the queue into main with commit b87edb2 Jun 3, 2026
23 checks passed
@dor-forer dor-forer deleted the dor-forer-sq8-fp16-arm-kernels-mod-14972 branch June 3, 2026 15:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants