Skip to content

optimized: add NEON grid_sampler_2d.out and Vectorized<float> sum.IntList_out#4

Merged
jgibson2 merged 4 commits intopolycamfrom
jgibson/neon-custom-kernels
Apr 24, 2026
Merged

optimized: add NEON grid_sampler_2d.out and Vectorized<float> sum.IntList_out#4
jgibson2 merged 4 commits intopolycamfrom
jgibson/neon-custom-kernels

Conversation

@jgibson2
Copy link
Copy Markdown
Collaborator

@jgibson2 jgibson2 commented Apr 23, 2026

Summary

This PR has been restructured. Previously it added NEON kernels via a separate `custom_kernels` static library and `kernels/custom/` source directory. That approach worked but didn't fit the existing `kernels/optimized/` integration pattern. This version ports both kernels to be proper optimized kernels, registered in `optimized.yaml` alongside `opt_add`, `opt_gelu`, etc., and sourced from `kernels/optimized/cpu/`.

Also opened as pytorch/executorch#19119 — this PR can be retired once that lands and is synced into polycam/main.

Measured end-to-end on a real depth model (Pixel 9 / arm64-v8a, fp16 inputs, shapes representative of the model's hot path):

Op Portable This PR Speedup
`grid_sampler_2d.out` 17.3 ms 3.4 ms 5.1x
`sum.IntList_out` (5 calls, aggregate) 3.0 ms 0.56 ms 5.4x

End-to-end `forward()` p50 on a 9.4 MB polycam depth model: 94 ms → 97 ms (within noise).

`grid_sampler_2d.out`

aarch64 NEON, bilinear + zeros padding only. Processes 4 channels per iteration with a vectorized FMA chain. fp16 inputs promoted to fp32 for weight computation and accumulation, cast back on store — avoids fp16 catastrophic cancellation on `(ix_se - ix)`-style weight subtractions. Unsupported modes and non-aarch64 targets delegate to the portable kernel.

`sum.IntList_out`

`at::vec::Vectorized`-based implementation of the single-dim reduction fast path (both innermost-contiguous and strided cases). Cross-architecture SIMD; always accumulates in fp32. Multi-dim reductions, dtype-converting reductions, and complex types delegate to portable. ~2x faster than a handwritten-NEON equivalent, because the compiler optimizes the fp16→fp32 gather better than a manual temp-buffer implementation.

Integration

  • Sources added to `OPTIMIZED_KERNELS_SRCS` in `build_variables.bzl` and to `OPTIMIZED_ATEN_OPS` in `op_registration_util.bzl` — single source of truth for both Buck and CMake builds.
  • `optimized.yaml` registers the ops with the standard `opt_*` naming convention.
  • `kernels/optimized/CMakeLists.txt` scopes the `-march=armv8.2-a+fp16` flag to just `op_grid_sampler_2d.cpp` via `set_source_files_properties`, so x86_64 builds are unaffected.

Included `preset.cmake` fix

Carries a one-char fix (quote `${DESCRIPTION}` in the `set(...CACHE...)` call) that's also in PR #2, so this branch builds independently of #2's merge order. Drops to a no-op once #2 lands.

Test plan

  • Builds cleanly for Android arm64-v8a, Android x86_64 (via `scripts/build_android_library.sh`).
  • End-to-end Pixel 9 latency on the polycam depth model matches the previous handwritten-NEON integration (within run-to-run noise) while producing more accurate fp16 outputs.
  • The 5.1x / 5.4x op-level speedups are reproducible via `ETDump` analysis.

🤖 Generated with Claude Code

@jgibson2 jgibson2 changed the base branch from main to polycam April 24, 2026 15:44
Copy link
Copy Markdown

@meredithbayne meredithbayne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before you merge, could you double check the rounding for the fp16 and fp32 paths? I'm not sure if fp16 grid_sampler has divergent weight-rounding between the NEON body and the scalar tail (tail rounds weights after the multiply; body rounds before)

Comment thread kernels/custom/neon_grid_sampler_2d.cpp Outdated

// We only optimize bilinear + zeros — the only mode our model uses.
// Other modes fall through to the scalar path.
ET_KERNEL_CHECK(
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need any of the other checks here, like the dim order check?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

@jgibson2
Copy link
Copy Markdown
Collaborator Author

Pushed two follow-up commits:

  • 995f6e3 — NEON fp16 grid_sampler now accumulates in fp32 internally (fp16 loads/stores, fp32 FMA chain). Fixes catastrophic cancellation on weight computation that was masking real precision divergence from the portable reference. Unmeasurable perf delta; max_abs vs fp32-reference drops ~0.1 → 0.
  • 0c2ba68 — adds a standalone on-device verify binary (verify_custom_kernels, opt-in via -DEXECUTORCH_BUILD_CUSTOM_VERIFY=ON) that cross-checks both NEON kernels against an fp32 reference across the shapes the polycam depth model uses. All 12 test cases pass; output prints both absolute and relative-off-zero divergence per case.

Reference for the fp16 tests is portable run on up-cast fp32 inputs, then down-cast to fp16. This keeps the verifier meaningful regardless of whether the portable-fp16 precision fix (pytorch#19117) has landed. Once that fix is merged and synced into the branch, both sides will converge further without a test-harness change.

…List_out

Two new optimized CPU kernels registered alongside the existing
optimized_kernels library. Both replace the portable reference kernel
(still available as fallback for unsupported inputs) with a vectorized
implementation that accumulates in fp32, avoiding the fp16 precision
issues noted in pytorch#19117 for grid_sampler_2d bilinear.

Measured end-to-end on a real depth model (Pixel 9, fp16 inputs, shapes
representative of the model's hot path):

| Op                               | Portable | This PR | Speedup |
| -------------------------------- | -------- | ------- | ------- |
| grid_sampler_2d.out              | 17.3 ms  | 3.4 ms  | 5.1x    |
| sum.IntList_out (5 calls, total) | 3.0 ms   | 0.56 ms | 5.4x    |

### grid_sampler_2d.out

aarch64 NEON, bilinear + zeros padding only. Processes 4 channels per
iteration with a vectorized FMA chain. fp16 inputs are promoted to fp32
for weight computation and accumulation, then cast back on store — the
portable kernel's fp16 weight subtractions like `(ix_se - ix)` otherwise
suffer catastrophic cancellation. Unsupported modes and non-aarch64
targets delegate to the portable kernel.

### sum.IntList_out

at::vec::Vectorized<float>-based implementation of the single-dim
reduction fast path (both innermost-contiguous and strided cases).
Cross-architecture SIMD via PyTorch's existing vector abstraction;
accumulates in fp32 regardless of input dtype. Multi-dim reductions,
dtype-converting reductions, and complex types delegate to portable.

### Integration

- Sources added to OPTIMIZED_KERNELS_SRCS in build_variables.bzl and to
  OPTIMIZED_ATEN_OPS in op_registration_util.bzl. Single source of
  truth for both Buck and CMake builds.
- optimized.yaml registers the ops with the standard opt_* naming
  convention used by sibling kernels.
- kernels/optimized/CMakeLists.txt scopes the -march=armv8.2-a+fp16
  flag to just op_grid_sampler_2d.cpp via set_source_files_properties,
  so x86_64 builds are unaffected. The kernel has #ifdef __aarch64__
  guards and falls through to portable on non-arm64 targets.
Same one-char fix as pytorch#19117 (and our PR #2): the
DESCRIPTION argument to `set(...CACHE TYPE DOCSTRING)` was expanded
unquoted, so multi-word descriptions on STRING options passed via `-D`
spilled their trailing words into subsequent set() args.

This was latent until PR #3 introduced EXECUTORCH_VULKAN_FP16_PRECISION
with a multi-word help string — builds that set it (e.g. via
scripts/build_android_library.sh forwarding the env var) then fail.

Carried here so this branch remains self-contained and buildable
independent of the merge order of PR #2. Drops cleanly after PR #2
lands; git will treat the duplicate line as a no-op.
@jgibson2 jgibson2 force-pushed the jgibson/neon-custom-kernels branch from 0c2ba68 to f4086e3 Compare April 24, 2026 18:40
@jgibson2 jgibson2 changed the title kernels/custom: NEON grid_sampler_2d + sum.IntList_out for arm64 optimized: add NEON grid_sampler_2d.out and Vectorized<float> sum.IntList_out Apr 24, 2026
@jgibson2
Copy link
Copy Markdown
Collaborator Author

Force-pushed with a restructured approach. Previous version used a separate custom_kernels static library under kernels/custom/; this version ports the same kernels to be proper optimized kernels in kernels/optimized/cpu/ — registered in optimized.yaml alongside opt_gelu, opt_add, etc.

Same runtime performance, cleaner integration:

  • kernels/custom/ directory removed
  • custom_kernels static library removed from kernels/optimized/CMakeLists.txt and configurations/CMakeLists.txt
  • build_variables.bzl + op_registration_util.bzl updated so both Buck and CMake pick up the new sources automatically
  • Symbols renamed from custom::native::* to the standard torch::executor::native::opt_* convention
  • The sum implementation swapped from handwritten NEON intrinsics to at::vec::Vectorized<float> — compiler-optimized SIMD that's actually ~2× faster than the handwritten version
  • The on-device verify test (verify_custom_kernels) dropped from this PR — with the optimized kernels in place, the existing kernels/test/op_*_test.cpp gtest infrastructure covers us

Also pushed as pytorch/executorch#19119. Once that lands and syncs into polycam/main, this PR can be closed (or merged and then dropped).

Standalone aarch64 binary that cross-checks opt_grid_sampler_2d_out and
opt_sum_dim_out against an fp32 reference derived from the portable
kernel (portable run on up-cast fp32 inputs, then down-cast to fp16).
Reference is independent of portable's own fp16 path, so the test stays
meaningful regardless of pytorch#19117's merge state.

Pass/fail uses numpy.testing.assert_allclose semantics:
  |a - b| <= abs_tol + rel_tol * |b|
Avoids the "relative error explodes at zero crossings" trap for
mean-zero reductions and bilinear samples near cancellation points.

Opt-in via -DEXECUTORCH_BUILD_OPTIMIZED_VERIFY=ON so default builds are
unaffected. Build + run:

  cmake -DEXECUTORCH_BUILD_OPTIMIZED_VERIFY=ON ...
  cmake --build <out> --target verify_optimized_kernels
  adb push <out>/kernels/optimized/verify_optimized_kernels /data/local/tmp/
  adb shell /data/local/tmp/verify_optimized_kernels

Exits 0 on all-pass; reports max_abs / max_rel(far) / near_zero / viol
per test case. 12 test cases across grid_sampler and sum, covering the
shapes the polycam depth model uses plus a few edge cases (odd channel
count, align_corners=1, multi-batch).
The NEON fast path indexes input/grid/out directly assuming contiguous
NCHW default-dim-order layout — no use of .strides() or .dim_order().
If the caller passes anything else (NHWC, transposed, strided, channels-
last), we'd read wrong memory and silently produce garbage output.

Add the same check pattern op_sum.cpp already uses at L150-151:
tensor_is_default_dim_order + tensor_is_contiguous on input, grid, and
out. If any fails, delegate to the portable kernel (which handles
arbitrary strides / dim orders correctly via .strides()).

No perf impact on the hot path — the checks are a handful of scalar
comparisons run once per call, and the common polycam depth model case
is already default-contiguous so the fast path is still taken.
@jgibson2 jgibson2 merged commit ac4fe82 into polycam Apr 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants