optimized: add NEON grid_sampler_2d.out and Vectorized<float> sum.IntList_out by jgibson2 · Pull Request #4 · PolyCam/executorch

jgibson2 · 2026-04-23T19:44:44Z

Summary

This PR has been restructured. Previously it added NEON kernels via a separate `custom_kernels` static library and `kernels/custom/` source directory. That approach worked but didn't fit the existing `kernels/optimized/` integration pattern. This version ports both kernels to be proper optimized kernels, registered in `optimized.yaml` alongside `opt_add`, `opt_gelu`, etc., and sourced from `kernels/optimized/cpu/`.

Also opened as pytorch/executorch#19119 — this PR can be retired once that lands and is synced into polycam/main.

Measured end-to-end on a real depth model (Pixel 9 / arm64-v8a, fp16 inputs, shapes representative of the model's hot path):

Op	Portable	This PR	Speedup
`grid_sampler_2d.out`	17.3 ms	3.4 ms	5.1x
`sum.IntList_out` (5 calls, aggregate)	3.0 ms	0.56 ms	5.4x

End-to-end `forward()` p50 on a 9.4 MB polycam depth model: 94 ms → 97 ms (within noise).

`grid_sampler_2d.out`

aarch64 NEON, bilinear + zeros padding only. Processes 4 channels per iteration with a vectorized FMA chain. fp16 inputs promoted to fp32 for weight computation and accumulation, cast back on store — avoids fp16 catastrophic cancellation on `(ix_se - ix)`-style weight subtractions. Unsupported modes and non-aarch64 targets delegate to the portable kernel.

`sum.IntList_out`

`at::vec::Vectorized`-based implementation of the single-dim reduction fast path (both innermost-contiguous and strided cases). Cross-architecture SIMD; always accumulates in fp32. Multi-dim reductions, dtype-converting reductions, and complex types delegate to portable. ~2x faster than a handwritten-NEON equivalent, because the compiler optimizes the fp16→fp32 gather better than a manual temp-buffer implementation.

Integration

Sources added to `OPTIMIZED_KERNELS_SRCS` in `build_variables.bzl` and to `OPTIMIZED_ATEN_OPS` in `op_registration_util.bzl` — single source of truth for both Buck and CMake builds.
`optimized.yaml` registers the ops with the standard `opt_*` naming convention.
`kernels/optimized/CMakeLists.txt` scopes the `-march=armv8.2-a+fp16` flag to just `op_grid_sampler_2d.cpp` via `set_source_files_properties`, so x86_64 builds are unaffected.

Included `preset.cmake` fix

Carries a one-char fix (quote `${DESCRIPTION}` in the `set(...CACHE...)` call) that's also in PR #2, so this branch builds independently of #2's merge order. Drops to a no-op once #2 lands.

Test plan

Builds cleanly for Android arm64-v8a, Android x86_64 (via `scripts/build_android_library.sh`).
End-to-end Pixel 9 latency on the polycam depth model matches the previous handwritten-NEON integration (within run-to-run noise) while producing more accurate fp16 outputs.
The 5.1x / 5.4x op-level speedups are reproducible via `ETDump` analysis.

🤖 Generated with Claude Code

meredithbayne

Before you merge, could you double check the rounding for the fp16 and fp32 paths? I'm not sure if fp16 grid_sampler has divergent weight-rounding between the NEON body and the scalar tail (tail rounds weights after the multiply; body rounds before)

meredithbayne · 2026-04-24T17:02:08Z

+
+    // We only optimize bilinear + zeros — the only mode our model uses.
+    // Other modes fall through to the scalar path.
+    ET_KERNEL_CHECK(


Do we need any of the other checks here, like the dim order check?

jgibson2 · 2026-04-24T17:59:21Z

Pushed two follow-up commits:

995f6e3 — NEON fp16 grid_sampler now accumulates in fp32 internally (fp16 loads/stores, fp32 FMA chain). Fixes catastrophic cancellation on weight computation that was masking real precision divergence from the portable reference. Unmeasurable perf delta; max_abs vs fp32-reference drops ~0.1 → 0.
0c2ba68 — adds a standalone on-device verify binary (verify_custom_kernels, opt-in via -DEXECUTORCH_BUILD_CUSTOM_VERIFY=ON) that cross-checks both NEON kernels against an fp32 reference across the shapes the polycam depth model uses. All 12 test cases pass; output prints both absolute and relative-off-zero divergence per case.

Reference for the fp16 tests is portable run on up-cast fp32 inputs, then down-cast to fp16. This keeps the verifier meaningful regardless of whether the portable-fp16 precision fix (pytorch#19117) has landed. Once that fix is merged and synced into the branch, both sides will converge further without a test-harness change.

…List_out Two new optimized CPU kernels registered alongside the existing optimized_kernels library. Both replace the portable reference kernel (still available as fallback for unsupported inputs) with a vectorized implementation that accumulates in fp32, avoiding the fp16 precision issues noted in pytorch#19117 for grid_sampler_2d bilinear. Measured end-to-end on a real depth model (Pixel 9, fp16 inputs, shapes representative of the model's hot path): | Op | Portable | This PR | Speedup | | -------------------------------- | -------- | ------- | ------- | | grid_sampler_2d.out | 17.3 ms | 3.4 ms | 5.1x | | sum.IntList_out (5 calls, total) | 3.0 ms | 0.56 ms | 5.4x | ### grid_sampler_2d.out aarch64 NEON, bilinear + zeros padding only. Processes 4 channels per iteration with a vectorized FMA chain. fp16 inputs are promoted to fp32 for weight computation and accumulation, then cast back on store — the portable kernel's fp16 weight subtractions like `(ix_se - ix)` otherwise suffer catastrophic cancellation. Unsupported modes and non-aarch64 targets delegate to the portable kernel. ### sum.IntList_out at::vec::Vectorized<float>-based implementation of the single-dim reduction fast path (both innermost-contiguous and strided cases). Cross-architecture SIMD via PyTorch's existing vector abstraction; accumulates in fp32 regardless of input dtype. Multi-dim reductions, dtype-converting reductions, and complex types delegate to portable. ### Integration - Sources added to OPTIMIZED_KERNELS_SRCS in build_variables.bzl and to OPTIMIZED_ATEN_OPS in op_registration_util.bzl. Single source of truth for both Buck and CMake builds. - optimized.yaml registers the ops with the standard opt_* naming convention used by sibling kernels. - kernels/optimized/CMakeLists.txt scopes the -march=armv8.2-a+fp16 flag to just op_grid_sampler_2d.cpp via set_source_files_properties, so x86_64 builds are unaffected. The kernel has #ifdef __aarch64__ guards and falls through to portable on non-arm64 targets.

Same one-char fix as pytorch#19117 (and our PR #2): the DESCRIPTION argument to `set(...CACHE TYPE DOCSTRING)` was expanded unquoted, so multi-word descriptions on STRING options passed via `-D` spilled their trailing words into subsequent set() args. This was latent until PR #3 introduced EXECUTORCH_VULKAN_FP16_PRECISION with a multi-word help string — builds that set it (e.g. via scripts/build_android_library.sh forwarding the env var) then fail. Carried here so this branch remains self-contained and buildable independent of the merge order of PR #2. Drops cleanly after PR #2 lands; git will treat the duplicate line as a no-op.

jgibson2 · 2026-04-24T18:41:07Z

Force-pushed with a restructured approach. Previous version used a separate custom_kernels static library under kernels/custom/; this version ports the same kernels to be proper optimized kernels in kernels/optimized/cpu/ — registered in optimized.yaml alongside opt_gelu, opt_add, etc.

Same runtime performance, cleaner integration:

kernels/custom/ directory removed
custom_kernels static library removed from kernels/optimized/CMakeLists.txt and configurations/CMakeLists.txt
build_variables.bzl + op_registration_util.bzl updated so both Buck and CMake pick up the new sources automatically
Symbols renamed from custom::native::* to the standard torch::executor::native::opt_* convention
The sum implementation swapped from handwritten NEON intrinsics to at::vec::Vectorized<float> — compiler-optimized SIMD that's actually ~2× faster than the handwritten version
The on-device verify test (verify_custom_kernels) dropped from this PR — with the optimized kernels in place, the existing kernels/test/op_*_test.cpp gtest infrastructure covers us

Also pushed as pytorch/executorch#19119. Once that lands and syncs into polycam/main, this PR can be closed (or merged and then dropped).

Standalone aarch64 binary that cross-checks opt_grid_sampler_2d_out and opt_sum_dim_out against an fp32 reference derived from the portable kernel (portable run on up-cast fp32 inputs, then down-cast to fp16). Reference is independent of portable's own fp16 path, so the test stays meaningful regardless of pytorch#19117's merge state. Pass/fail uses numpy.testing.assert_allclose semantics: |a - b| <= abs_tol + rel_tol * |b| Avoids the "relative error explodes at zero crossings" trap for mean-zero reductions and bilinear samples near cancellation points. Opt-in via -DEXECUTORCH_BUILD_OPTIMIZED_VERIFY=ON so default builds are unaffected. Build + run: cmake -DEXECUTORCH_BUILD_OPTIMIZED_VERIFY=ON ... cmake --build <out> --target verify_optimized_kernels adb push <out>/kernels/optimized/verify_optimized_kernels /data/local/tmp/ adb shell /data/local/tmp/verify_optimized_kernels Exits 0 on all-pass; reports max_abs / max_rel(far) / near_zero / viol per test case. 12 test cases across grid_sampler and sum, covering the shapes the polycam depth model uses plus a few edge cases (odd channel count, align_corners=1, multi-batch).

The NEON fast path indexes input/grid/out directly assuming contiguous NCHW default-dim-order layout — no use of .strides() or .dim_order(). If the caller passes anything else (NHWC, transposed, strided, channels- last), we'd read wrong memory and silently produce garbage output. Add the same check pattern op_sum.cpp already uses at L150-151: tensor_is_default_dim_order + tensor_is_contiguous on input, grid, and out. If any fails, delegate to the portable kernel (which handles arbitrary strides / dim orders correctly via .strides()). No perf impact on the hot path — the checks are a handful of scalar comparisons run once per call, and the common polycam depth model case is already default-contiguous so the fast path is still taken.

jgibson2 changed the base branch from main to polycam April 24, 2026 15:44

meredithbayne approved these changes Apr 24, 2026

View reviewed changes

jgibson2 added 2 commits April 24, 2026 14:39

jgibson2 force-pushed the jgibson/neon-custom-kernels branch from 0c2ba68 to f4086e3 Compare April 24, 2026 18:40

jgibson2 changed the title ~~kernels/custom: NEON grid_sampler_2d + sum.IntList_out for arm64~~ optimized: add NEON grid_sampler_2d.out and Vectorized<float> sum.IntList_out Apr 24, 2026

meredithbayne approved these changes Apr 24, 2026

View reviewed changes

jgibson2 merged commit ac4fe82 into polycam Apr 24, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

optimized: add NEON grid_sampler_2d.out and Vectorized<float> sum.IntList_out#4

optimized: add NEON grid_sampler_2d.out and Vectorized<float> sum.IntList_out#4
jgibson2 merged 4 commits intopolycamfrom
jgibson/neon-custom-kernels

jgibson2 commented Apr 23, 2026 •

edited

Loading

Uh oh!

meredithbayne left a comment

Uh oh!

meredithbayne Apr 24, 2026

Uh oh!

jgibson2 Apr 24, 2026

Uh oh!

jgibson2 commented Apr 24, 2026

Uh oh!

jgibson2 commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jgibson2 commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

`grid_sampler_2d.out`

`sum.IntList_out`

Integration

Included `preset.cmake` fix

Test plan

Uh oh!

meredithbayne left a comment

Choose a reason for hiding this comment

Uh oh!

meredithbayne Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

jgibson2 Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

jgibson2 commented Apr 24, 2026

Uh oh!

jgibson2 commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jgibson2 commented Apr 23, 2026 •

edited

Loading