Add SQ8↔FP16 ARM SIMD distance kernels [MOD-14972]#973
Conversation
🛡️ Jit Security Scan Results✅ No security findings were detected in this PR
Security scan by Jit
|
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #973 +/- ##
==========================================
- Coverage 97.11% 97.09% -0.03%
==========================================
Files 141 141
Lines 8110 8110
==========================================
- Hits 7876 7874 -2
- Misses 234 236 +2 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
* Add design doc for SQ8↔FP16 SIMD x86 kernels [MOD-14954]
Captures the architecture, file-level plan, CMake F16C gating, and risk
register for adding AVX-512 / AVX2+FMA / AVX2 / SSE4 kernels for the
asymmetric SQ8 (storage) ↔ FP16 (query) distance functions, wiring them
into the existing dispatcher tables and SQ8_FP16 unit/benchmark
scaffolding from MOD-15141.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* Append -mf16c to AVX2_FMA/AVX2/SSE4 dispatcher sources [MOD-14954]
Enables _mm{,256}_cvtph_ps in the AVX2+FMA, AVX2, and SSE4 dispatcher
translation units so the upcoming SQ8↔FP16 kernels can widen FP16 lanes
to FP32. The flag is appended only when CXX_F16C is detected; existing
SQ8_FP32 / SQ8_SQ8 / INT8 / UINT8 sources contain no F16C intrinsics so
emitted code for those kernels is unchanged.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* Add SQ8_FP16_SpacesOptimizationTest skeleton [MOD-14954]
Parameterised gtest fixture mirroring SQ8_FP32_SpacesOptimizationTest;
currently asserts only the scalar fallback path. Per-tier SIMD
assertion blocks (AVX-512, AVX2+FMA, AVX2, SSE4) are added alongside
the kernel implementations in subsequent commits.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* Add AVX-512 SQ8↔FP16 SIMD distance kernels [MOD-14954]
Implements asymmetric SQ8 (storage) ↔ FP16 (query) Inner Product,
Cosine, and L2² kernels for the AVX-512 F+BW+VL+VNNI tier. Each chunk
widens 16 SQ8 lanes via cvtepu8_epi32 + cvtepi32_ps and 16 FP16 lanes
via _mm512_cvtph_ps, then fmadds into a 16-lane FP32 accumulator. SQ8
storage and FP16 query metadata reads use load_unaligned to tolerate
odd dimensions. Dispatcher branches in IP_space.cpp / L2_space.cpp
select the new Choose_SQ8_FP16_*_implementation_AVX512F_BW_VL_VNNI
when features.avx512f && features.avx512bw && features.avx512vl &&
features.avx512vnni; otherwise behaviour is unchanged from MOD-15141.
A parameterised gtest fixture exercises every residual class in
[16, 32] against the scalar baseline.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* Add AVX2+FMA SQ8↔FP16 SIMD distance kernels [MOD-14954]
8-wide AVX2+FMA kernels widen 8 SQ8 lanes via cvtepu8_epi32 +
cvtepi32_ps and 8 FP16 lanes via _mm256_cvtph_ps, then fmadd into a
256-bit FP32 accumulator. Residual (< 8) lanes load the full 16-byte
FP16 block, convert, then blend zero across unused lanes — mirroring
the existing F16C FP16 kernel pattern. Dispatcher branch in
{IP,Cosine,L2}_SQ8_FP16_GetDistFunc selects the new
Choose_SQ8_FP16_*_implementation_AVX2_FMA when features.avx2 &&
features.fma3 && features.f16c.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* Add AVX2 (no FMA) SQ8↔FP16 SIMD distance kernels [MOD-14954]
Mirrors the AVX2+FMA kernels but uses _mm256_mul_ps + _mm256_add_ps
instead of _mm256_fmadd_ps so it can run on Haswell-era AVX2 hardware
without FMA support (uncommon but matches the existing SQ8_FP32
tiering). Dispatcher gate requires features.avx2 && features.f16c
and runs between the AVX2+FMA and SSE4 tiers.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* Add SSE4+F16C SQ8↔FP16 SIMD distance kernels [MOD-14954]
4-wide SSE4 kernels widen 4 SQ8 lanes via cvtepu8_epi32 + cvtepi32_ps
and 4 FP16 lanes via _mm_cvtph_ps (F16C), then mul+add into a 128-bit
FP32 accumulator (SSE4 has no FMA). Residual % 4 lanes are materialised
via _mm_set_ps + the scalar FP16_to_FP32 helper, mirroring the existing
SSE4 SQ8_FP32 residual pattern. Dispatcher gate requires
features.sse4_1 && features.f16c && features.avx since F16C is
VEX-encoded — matches the existing F16C/FP16 dispatcher gate.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* Update SQ8_FP16 dispatcher assertions to walk SIMD tiers [MOD-14954]
The SQ8_FP16 GetDistFunc dispatcher now returns AVX-512 / AVX2+FMA /
AVX2 / SSE4 SIMD kernels when the corresponding feature flags are set
(only scalar previously). Updates the GetDistFunc_*_SQ8_FP16 asserts
to compute the expected function for the host's highest supported tier.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* Register per-ISA SQ8↔FP16 microbenchmarks [MOD-14954]
Adds AVX-512 / AVX2+FMA / AVX2 / SSE4 benchmark registrations to
bm_spaces_sq8_fp16.cpp, mirroring the SQ8_FP32 layout. Gates each tier
on the corresponding OPT_* defines plus the runtime feature checks
that mirror the dispatcher in IP_space.cpp / L2_space.cpp.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* Reformat SQ8↔FP16 SIMD kernels for consistent line breaks
* Address PR review findings for SQ8↔FP16 x86 kernels [MOD-14954]
- CMake: gate `-mf16c` on CXX_F16C AND CXX_FMA AND CXX_AVX (matches OPT_F16C
macro) and append `-mavx` to the SSE4 dispatcher when adding -mf16c, since
F16C is VEX-encoded and requires AVX state. Mirrors the existing F16C.cpp
recipe and prevents miscompiles on toolchains with F16C but without AVX.
- IP_SSE4_SQ8_FP16.h: replace `*reinterpret_cast<const int32_t *>(pVect1)`
with `load_unaligned<int32_t>(pVect1)` to remove strict-aliasing UB on
the uint8_t SQ8 lane load.
- IP_AVX2{,_FMA}_SQ8_FP16.h: improve the residual-mask comment to spell out
the asymmetric-mask reasoning (SQ8 unmasked is safe because the FP16
query blend forces those FP32 query lanes to 0 → garbage·0=0).
- IP_AVX{512,2,2_FMA,SSE4}_SQ8_FP16.h: add the `IP = min·y_sum + delta·Σ(q·y)`
algebraic-identity comment header that AVX-512 already carried, plus a
precondition note that callers must enforce dim >= 16 (matches the
established SQ8_FP32 convention; no runtime assert because sibling
SQ8_FP32 SIMD kernels also rely on the dispatcher gate).
- test_spaces.cpp: route the SQ8_FP16 edge-case tests (ZeroQuery,
ConstantStorage, MixedSignQuery) through {IP,Cosine,L2}_SQ8_FP16_GetDistFunc
so the runtime-selected SIMD tier is actually exercised on those inputs,
not just the scalar reference.
- test_spaces.cpp: add SQ8_FP16_SIMD_HighDim suite with dims {64, 128, 256,
512, 1024} so multi-iteration do-while loop bugs would fire (the existing
[16, 32] range covers at most two AVX-512 chunk iterations).
- test_spaces.cpp: add SQ8_FP16_SIMD_TierCoverage.ReportTiersExercised — a
single test that emits per-tier coverage to stderr and GTEST_SKIPs when
no SIMD tier is available, so CI runners without AVX-512 do not silently
report zero tier-1 coverage.
- test_spaces.cpp: scalar-fallback `alignment` checks now seed the value
with 0xFF and assert it remains 0xFF, verifying the dispatcher contract
("scalar leaves caller's value untouched") instead of just measuring
that the variable's pre-zeroed init survived.
- test_spaces.cpp: drop the stale MOD-15152/MOD-15153 wiring-TODO comment
on SQ8_FP16_NoOptimizationSpacesTest now that the SIMD tiers are wired.
- bm_spaces_sq8_fp16.cpp: drop the matching stale comment.
Out of scope (separate ticket): two-accumulator FMA refactor (also affects
SQ8_FP32) and the SSE4 residual `_mm_cvtph_ps` perf opportunity.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* Add multi-accumulator ILP to SQ8↔FP16 x86 kernels [MOD-14954]
Break the FMA / mul+add dependency chain in all four SQ8↔FP16 IP kernels
by widening the inner loop to use multiple independent accumulators.
L2 kernels inherit the change through their `…InnerProductImp_…` call.
- IP_AVX512F_BW_VL_VNNI_SQ8_FP16.h: 1 → 4 accumulators, unroll-4 main loop
(64 lanes/iter) with a 16-lane tail for the 0..3 remaining chunks.
- IP_AVX2_FMA_SQ8_FP16.h, IP_AVX2_SQ8_FP16.h: 1 → 2 accumulators; the
existing 2-step unrolled body now routes each step to an independent
accumulator. The `residual >= 8` half-chunk feeds the second accumulator
so the prologue also breaks the dependency chain.
- IP_SSE4_SQ8_FP16.h: 1 → 2 accumulators; do-while unrolled 1 → 2 steps
per iteration (4 → 8 lanes/iter). Residual-ladder steps alternate
between sum_a and sum_b for prologue ILP.
Correctness invariant: residual block consumes exactly `residual` lanes
(0..15) → remaining tail is always a multiple of 16, so the unrolled
loops (multiples of 8 / 16 / 64) terminate exactly. Verified by 131
SQ8_FP16 unit tests + 115 under ASan.
* Drop misleading VNNI suffix from SQ8↔FP16 AVX-512 kernel [MOD-14954]
The SQ8↔FP16 AVX-512 kernel does not actually issue any VNNI instruction
— the inner loop is FP32 FMA (`_mm512_fmadd_ps`) over lanes widened from
SQ8 (`_mm512_cvtepu8_epi32` + `_mm512_cvtepi32_ps`) and FP16
(`_mm512_cvtph_ps`). Real VNNI use would require an integer-encoded
query, which is a different kernel entirely.
The file/function names are renamed to match what the kernel actually
uses (AVX-512F). The dispatcher .cpp/.h files stay named after the
runtime tier (AVX512F_BW_VL_VNNI) since the SQ8↔FP16 kernel still
registers under that tier alongside the genuinely VNNI-using SQ8↔SQ8 /
INT8 / UINT8 kernels — the gate is a CPU-feature gate, not an ISA claim.
The same misnomer exists for SQ8↔FP32; tracked separately so the rename
there can ship as its own commit.
Also: fix a strict-aliasing-class UB introduced by the AVX-512 unroll-4
loop. `while (pVec1 + 64 <= pEnd1)` forms a pointer past one-past-end of
the SQ8 storage object when fewer than 64 lane bytes remain, which is UB
in C++ regardless of dereference. Switched to pointer subtraction
(`static_cast<size_t>(pEnd1 - pVec1) >= 64`).
Renames:
- IP_AVX512F_BW_VL_VNNI_SQ8_FP16.h -> IP_AVX512F_SQ8_FP16.h
- L2_AVX512F_BW_VL_VNNI_SQ8_FP16.h -> L2_AVX512F_SQ8_FP16.h
- SQ8_FP16_{InnerProduct,Cosine,L2Sqr}SIMD16_AVX512F_BW_VL_VNNI -> _AVX512F
- Choose_SQ8_FP16_{IP,Cosine,L2}_implementation_AVX512F_BW_VL_VNNI -> _AVX512F
Verified: 131 SQ8_FP16 unit tests + 115 under ASan.
* Remove SQ8↔FP16 design doc from PR [MOD-14954]
Design doc was added in ad941b8 for planning; not appropriate as
a long-lived in-repo artifact. Keep externally (Confluence / scratch)
rather than ship with the kernel commit.
* Simplify SQ8↔FP16 tests to match sister conventions [MOD-14954]
Two trims, both restoring pre-existing patterns elsewhere in the file:
1. `GetDistFuncSQ8FP16Asymmetric` had grown a runtime SIMD-tier walk that
duplicated coverage already provided by `SQ8_FP16_SpacesOptimizationTest`.
Reduced to the bare dispatcher-equality check used by the FP32 / SQ8↔SQ8
sister tests at lines 540-548 and 551-559.
2. The `SQ8_FP16_EdgeCases` tests (`ZeroQueryTest`, `ConstantStorageTest`,
`MixedSignQueryTest`) were routed through
`{IP,Cosine,L2}_SQ8_FP16_GetDistFunc(dim, nullptr)` to force runtime SIMD
dispatch on adversarial inputs. Reverted to direct scalar calls
(`SQ8_FP16_InnerProduct`, etc.) — the original pre-fdc5c1cd shape.
Coverage rationale: the SIMD kernels are branchless on input values
(verified by grep — no value-dependent `if` in any tier). Every code
path is therefore exercised by `SQ8_FP16_SpacesOptimizationTest`'s
random inputs at multiple dims. The edge-case tests verify the
*algebraic identity* (IP of zero query = 1.0, constant storage matches
dequant baseline, mixed-sign handling) — scalar correctness on these
inputs is what was actually being checked, and the SIMD path matches
scalar via the SpacesOptimizationTest tier walk.
Net: 77 lines removed from the test file, matches sister conventions,
no coverage gap.
* Split SQ8↔FP16 F16C kernels into sibling TUs [MOD-14954]
The SQ8↔FP16 kernels in the SSE4, AVX2, and AVX2+FMA tiers depend on
F16C (`_mm_cvtph_ps` / `_mm256_cvtph_ps`), while every other kernel in
those dispatcher TUs is F16C-clean. The previous arrangement mixed
both under `#ifdef OPT_F16C` blocks inside the base dispatcher .cpp/.h
files.
Split each tier's F16C-dependent kernels off into a sibling TU:
functions/SSE4.cpp → SSE4 + SQ8_FP32 (no F16C)
functions/SSE4_F16C.cpp → SQ8_FP16 only (requires -mavx -mf16c)
functions/AVX2.cpp → AVX2 + BF16 + SQ8_FP32 (no F16C)
functions/AVX2_F16C.cpp → SQ8_FP16 only (requires -mf16c)
functions/AVX2_FMA.cpp → SQ8_FP32 (no F16C)
functions/AVX2_FMA_F16C.cpp → SQ8_FP16 only (requires -mf16c)
The AVX-512 tier is unaffected — its SQ8_FP16 kernel uses
`_mm512_cvtph_ps`, which is part of AVX-512F and not F16C.
CMake now compiles each sibling TU conditionally on
`_has_full_f16c` and applies the F16C flags only there. Base TUs no
longer carry `-mf16c`, since they no longer reference F16C intrinsics.
Result:
- No `#ifdef OPT_F16C` directives in `functions/*.cpp` or `functions/*.h`.
- Compile-time isolation: an F16C intrinsic accidentally added outside
a `_F16C` sibling will fail to build, not silently miscompile.
- Caller sites (`IP_space.cpp`, `L2_space.cpp`, `test_spaces.cpp`,
`bm_spaces.h`) still gate the *calls* with `#ifdef OPT_F16C`; the new
sibling .h includes are unconditional, since declarations alone don't
link-error and the calls remain guarded.
Verified: 131 SQ8_FP16 unit tests + 115 ASan + 1166 full test_spaces
suite (covers other ISA tiers SQ8_FP32 / BF16 / INT8 / UINT8 to confirm
no regression from the dispatcher restructure).
* Move SQ8↔FP16 AVX-512 dispatch to AVX512F tier + flatten F16C guards [MOD-14954]
Two related cleanups in the SQ8↔FP16 dispatch path:
1. The AVX-512 SQ8↔FP16 kernel only uses AVX-512F instructions
(`_mm512_cvtph_ps`, `_mm512_fmadd_ps`, etc.) but was registered under
the VNNI tier (`OPT_AVX512_F_BW_VL_VNNI` + check of avx512f/bw/vl/vnni).
That meant CPUs with AVX-512F but no VNNI (Skylake-X, some Cascade Lake
variants, etc.) would fall through to AVX2_FMA even though they can
run the AVX-512 kernel.
Moved the `Choose_SQ8_FP16_{IP,Cosine,L2}_implementation_AVX512F`
definitions from `functions/AVX512F_BW_VL_VNNI.cpp` to
`functions/AVX512F.cpp`, with matching header reshuffle. Dispatch
sites now gate on `OPT_AVX512F` + `features.avx512f`.
2. F16C is a transversal requirement across the non-AVX-512 SQ8↔FP16
tiers (SSE4, AVX2, AVX2+FMA) — every one of them widens FP16 query
lanes via `vcvtph2ps`. Per-tier nested `#ifdef OPT_F16C` was hoisted
into a single outer block around the three ISA branches in
`IP_SQ8_FP16_GetDistFunc`, `Cosine_SQ8_FP16_GetDistFunc`, and
`L2_SQ8_FP16_GetDistFunc`.
Verified: 131 SQ8_FP16 release + 115 ASan + 1166 full test_spaces suite.
* Clean up whitespace and formatting inconsistencies
Remove extraneous blank lines in SSE4 and AVX2_FMA source files, fix indentation in AVX512F SQ8_FP16 function signatures, and reformat benchmark macro invocation to fit line length conventions.
* Remove obsolete SQ8-to-FP16 dispatch comments
The comments referencing SQ8-to-FP16 dispatch location are no longer accurate after the recent refactoring that moved the dispatch logic. Clean up these stale comments from the AVX512F_BW_VL_VNNI files.
* Hoist OPT_F16C guard around lower SIMD tiers in SQ8↔FP16 tests [MOD-14954]
Mirrors the dispatcher layout in IP_space.cpp / L2_space.cpp where a single
OPT_F16C guard wraps the AVX2+FMA, AVX2, and SSE4 branches. Each test body
(L2/IP/Cosine) and the TierCoverage report now use the same single-guard
shape. Also retargets the TierCoverage AVX-512 check from
OPT_AVX512_F_BW_VL_VNNI to OPT_AVX512F, matching the dispatcher's new
AVX-512F-only gate.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* Drop non-idiomatic SQ8↔FP16 tier-coverage reporter test [MOD-14954]
SQ8_FP16_SIMD_TierCoverage.ReportTiersExercised was an outlier — no
other data type has a std::cerr-based coverage reporter. Per-tier
coverage is already provided by SQ8_FP16_SpacesOptimizationTest (which
walks AVX-512 → AVX2+FMA → AVX2 → SSE4 → scalar by clearing feature
flags), and ISA-lane presence is handled by the CI matrix, matching
the convention used by every other type's SpacesOptimizationTest.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* Simplify SQ8↔FP16 kernels and trim PR churn [MOD-14954]
- AVX512F IP: keep the <=3 tail chunks on distinct accumulators
(sum0/sum1/sum2) instead of serializing into one, preserving ILP when
the main 64-lane loop runs few or zero times.
- Condense kernel header comments; drop redundant float16.h/alignment.h
includes (pulled in transitively) and the direct <immintrin.h> include
(provided via space_includes.h, matching the other AVX512F headers).
- test_spaces: align the SQ8_FP16 scalar-fallback alignment assertion with
the convention used by the other SpacesOptimizationTest suites.
- Revert unrelated CMake message/quote churn on the base AVX2/SSE4 TUs and
the stray blank line in AVX512F_BW_VL_VNNI.cpp, leaving only the additive
F16C build blocks in this PR.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* Document why OPT_F16C differs from the other OPT_* macros [MOD-14954]
Explain at the definition site that OPT_F16C is a cross-cutting capability
gate (not a 1:1 dispatch tier), why it is a compound CXX_F16C/FMA/AVX guard
(F16C is VEX-encoded and needs AVX state), and why the AVX-512 SQ8<->FP16
path stays outside it (_mm512_cvtph_ps is part of AVX512F).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* Cover AVX512 three-chunk tail and dim<16 dispatcher guard in SQ8_FP16 tests [MOD-14954]
Codecov flagged 4 uncovered lines on PR #970:
- The AVX512F `remaining >= 48` third tail step in IP_AVX512F_SQ8_FP16.h was
never executed: the test dims never satisfied (dim / 16) % 4 == 3. Add 48
(zero main-loop iterations) and 112 (one main-loop iteration) to exercise it.
- The `dim < 16` scalar early-return in the IP/Cosine/L2 SQ8_FP16 dispatchers
was never taken. Assert the three dispatchers return the scalar funcs at dim 8.
Test-only change. Local release + ASan: SQ8_FP16 137/137, ASan clean.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
6f6ef26 to
4ac05ac
Compare
Stacked on PR #970 (MOD-14954 x86 kernels). Mirrors x86 structure onto NEON_HP / SVE / SVE2 tiers. Zero CMake changes; reuses existing ARM TU compile flags. Scalar fallback already on main serves as reference. Bakes in PR #970 review lessons (assert(dim>=16), 4-accumulator ILP, formula anchor, load_unaligned<float> metadata, dispatcher-routed tier-walk tests). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
14 bite-sized tasks following the spec at 2026-05-28-arm-sq8-fp16-design.md. Each task ends in a commit; assistant runs tests/ASan/benchmarks after the user confirms each ARM build cycle. Zero CMake changes; PR stacks on #970. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…OD-14972] The 9 ARM tier blocks (L2/IP/Cosine × SVE2/SVE/NEON_HP) were missing ASSERT_EQ(alignment, 0) after each ASSERT_NEAR, unlike the SQ8_FP32 sister blocks which assert it. Adds the assertions to lock the contract that ARM tiers leave the caller's alignment value untouched.
…D-14972] svcvt_f32_f16_x (FCVT) reads even-indexed FP16 elements: FP32[e] ← FP16[2e]. The step function loaded chunk consecutive FP16 values into positions 0..chunk-1, then passed them directly to svcvt_f32_f16_x, which picked positions 0,2,4,... and silently skipped positions 1,3,5,... For chunk=4 (128-bit SVE), only 2 of 4 FP16 values per step were used, producing wrong dot products. Fix: svzip1_f16(q_h, zeros) spreads values to even positions [v0,0,v1,0,...] so FCVT correctly reads v[0],v[1],v[2],... Applied to both the full step helper and the partial-chunk path. Discovered and fixed during ARM host verification (Task 14, MOD-14972).
…D-14972] SVE hot loop: replace svzip1_f16+svdup_f16+svwhilelt_b16 (4 ops) with svld1uh_u32 (1 op) — zero-extends each FP16 halfword into a 32-bit lane so svcvt_f32_f16_x reads the correct bits directly. Same fix applied to the partial-chunk path, which also drops the now-redundant pg16_partial predicate. Accumulator combine changed from svadd_f32_x to svadd_f32_z to match the SQ8_FP32 SVE sister. NEON residual: replace the single 8-lane block + up-to-7 software-scalar iterations with three independent 4-lane sub-steps (r>=4, r>=8, r>=12), leaving at most 3 elements for scalar — mirrors the SQ8_FP32 NEON sister exactly. Eliminates expensive vecsim_types::FP16_to_FP32 calls for residuals 4..15 (previously up to 7 software conversions per call). Both IP headers: remove assert()+<cassert> (no sister kernel uses them). Both L2 headers: drop redundant float16.h include and using declarations (arrive transitively through the included IP header).
…MOD-14972] - Remove docs/superpowers/ design and plan files (~1550 lines); sister PR #970 removed its equivalent doc before merge. - Drop 5-line "No alignment write" prose comment from the three AArch64 NEON_HP dispatcher blocks; the sister SQ8_FP32 ARM dispatchers carry no such comment — the absent alignment write already encodes the intent. - Trim GetDistFuncSQ8FP16Asymmetric to a 7-line template-mapping check at dim=15, matching the shape of GetDistFuncSQ8Asymmetric (SQ8_FP32 sister). The scalar-fallback assertion it previously duplicated is already covered by the trailing block of SQ8_FP16_SpacesOptimizationTest.
4ac05ac to
e1647dc
Compare
SQ8↔FP16 ARM kernels: proposed optimizations + benchmark resultsI profiled the ARM SQ8↔FP16 distance kernels from this PR and prototyped two optimizations, benchmarked head-to-head against this PR's code on a Graviton-4 (Neoverse V2) arm64 runner via To keep the comparison clean, both runs are identical except for the kernel changes, and both register
Speedup = baseline CPU-time ÷ treatment CPU-time (>1 = faster). 36 dimension points per tier.
What changed
Why this is trustworthyThe untouched base SVE tier reads 1.00×, confirming the comparison is clean (identical code → identical perf) — so the gains are attributable to the kernels, not run-to-run noise. No regressions anywhere (min speedup ≥ 1.00 in every tier). Happy to fold these into this PR if you'd like — the diffs are small and isolated to the IP/L2 SVE2 headers + the NEON_HP step + the SVE2 chooser. |
The bm_spaces_sq8_fp16 executable is built but was never emitted by benchmarks.sh, so no CI label (bm-spaces / benchmarks-all) would run it. Register it in bm-spaces, bm-spaces-sq8-full, benchmarks-all and benchmarks-default, and add a dedicated bm-spaces-sq8-fp16 case.
…MLALT kernel NEON_HP: widen SQ8 storage uint8->fp16->fp32 via vcvtq_f16_u16 (values 0..255 are exact in FP16), dropping two integer-widening ops per 16-element chunk with identical FP32 lane values. SVE2: dedicated kernel keeping storage+query at 16-bit and using the FMLALB/FMLALT widening multiply-accumulate pair (svmlalb_f32/svmlalt_f32). Processes svcnth() lanes/step (2x the base-SVE svcntw() granularity) and removes explicit query widening/conversion, roughly halving hot-loop loads and instructions. Wired into SVE2.cpp IP/Cosine/L2 choosers at svcnth granularity.
…-14972] Add an asimdfhm-gated NEON_FHM tier for SQ8<->FP16 IP / L2 / Cosine. Instead of widening both operands to FP32 and issuing vfmaq_f32 (the NEON_HP path), it uses vfmlalq_low/high_f16 (FMLAL/FMLAL2) to multiply the FP16 lanes directly into FP32 accumulators, removing all 8 vcvt_f32_f16 per 16 lanes. SQ8 storage is widened uint8->fp16 (exact for 0..255) and the FP16 query is consumed in place. FMLAL widens fp16->fp32 before the multiply, so accuracy matches the scalar baseline. Dispatchers prefer NEON_FHM over NEON_HP when features.asimdfhm is set. The IP core is templated on use_fhm so L2/Cosine and the residual tail are shared. Tier-walk unit tests and microbenchmarks cover the new path. Measured ~1.95x over NEON_HP at high/medium dims on asimdfhm hardware. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
3118914 to
1472684
Compare
…m-kernels-mod-14972 # Conflicts: # src/VecSim/spaces/IP_space.cpp # src/VecSim/spaces/L2_space.cpp # tests/benchmark/spaces_benchmarks/bm_spaces_sq8_fp16.cpp # tests/unit/test_spaces.cpp
The merge-base changed after approval.
Describe the changes in the pull request
Add asymmetric SQ8↔FP16 SIMD distance kernels (IP, L2, Cosine) for ARM tiers: NEON_HP, SVE, SVE2. Stacked on PR #970 (MOD-14954), which delivers the x86 equivalents.
The SVE hot loop uses
svld1uh_u32to zero-extend each FP16 halfword into a 32-bit lane, allowingsvcvt_f32_f16_xto read the correct bits directly. The NEON residual mirrors the SQ8_FP32 NEON sister: three independent 4-lane sub-steps (r>=4/8/12) leaving at most 3 elements for scalar, replacing the previous single 8-lane block + up-to-7 software conversions.Additional improvements (co-authored by @lerman25, originally from PR #975):
u8→u16→u32→f32path (twovmovl_u16+vcvtq_f32_u32) with the shorteru8→u16→f16→f32path (vcvtq_f16_u16+vcvt_f32_f16), reducing instruction count in the main loop.IP_SVE2_SQ8_FP16.h,L2_SVE2_SQ8_FP16.h): usesvmlalb_f32/svmlalt_f32(SVE2-only widening multiply-accumulate) for a tighter inner loop vs the generic SVE path.spaces_sq8_fp16is now wired into all relevant benchmark categories (benchmarks-all,benchmarks-default,bm-spaces-sq8-full,bm-spaces, and the newbm-spaces-sq8-fp16category).Which issues this PR fixes
Main objects this PR modified
src/VecSim/spaces/IP/IP_NEON_SQ8_FP16.h— NEON_HP IP kernel (+ widening optimization)src/VecSim/spaces/IP/IP_SVE_SQ8_FP16.h— SVE IP kernelsrc/VecSim/spaces/IP/IP_SVE2_SQ8_FP16.h— new dedicated SVE2 IP kernelsrc/VecSim/spaces/L2/L2_NEON_SQ8_FP16.h— NEON_HP L2 kernelsrc/VecSim/spaces/L2/L2_SVE_SQ8_FP16.h— SVE L2 kernelsrc/VecSim/spaces/L2/L2_SVE2_SQ8_FP16.h— new dedicated SVE2 L2 kernelsrc/VecSim/spaces/functions/NEON_HP.{h,cpp},SVE.{h,cpp},SVE2.{h,cpp}— chooser symbolssrc/VecSim/spaces/IP_space.cpp,L2_space.cpp— AArch64 dispatcher blockstests/unit/test_spaces.cpp— tier-walk tests for NEON_HP / SVE / SVE2tests/benchmark/spaces_benchmarks/bm_spaces_sq8_fp16.cpp— ARM microbench registrationstests/benchmark/benchmarks.sh— benchmark category registration forspaces_sq8_fp16Mark if applicable
Note
Medium Risk
Touches hot-path distance math with ISA-specific numerics and feature-gated dispatch; mitigated by unit/benchmark coverage but wrong kernel choice could skew search results.
Overview
Adds ARM SIMD for asymmetric SQ8 storage ↔ FP16 query distances (inner product, L2², cosine), complementing existing x86 paths.
New kernels cover NEON_HP (with optional FHM widening-FMA), SVE (FP16 loads via
svld1uh_u32), and a dedicated SVE2 path using FMLALB/FMLALT atsvcnth()granularity. L2 reuses the IP cores via the usual sum-of-squares identity. NEON also shortens SQ8 widening (u8→f16→f32) and reshapes the residual tail to match the SQ8↔FP32 NEON pattern.Runtime dispatch on AArch64 (
dim ≥ 16): SVE2 → SVE →asimdfhm→asimdhp, wired through choosers inIP_space.cpp/L2_space.cppand the NEON/SVE/SVE2 function modules.Tests/benchmarks: unit tests walk each ARM tier against the scalar baseline;
spaces_sq8_fp16is registered in benchmark scripts and ARM variants are added tobm_spaces_sq8_fp16.cpp.Reviewed by Cursor Bugbot for commit f798f5c. Bugbot is set up for automated code reviews on this repo. Configure here.