Conversation
meredithbayne
left a comment
There was a problem hiding this comment.
Before you merge, could you double check the rounding for the fp16 and fp32 paths? I'm not sure if fp16 grid_sampler has divergent weight-rounding between the NEON body and the scalar tail (tail rounds weights after the multiply; body rounds before)
|
|
||
| // We only optimize bilinear + zeros — the only mode our model uses. | ||
| // Other modes fall through to the scalar path. | ||
| ET_KERNEL_CHECK( |
There was a problem hiding this comment.
Do we need any of the other checks here, like the dim order check?
|
Pushed two follow-up commits:
Reference for the fp16 tests is portable run on up-cast fp32 inputs, then down-cast to fp16. This keeps the verifier meaningful regardless of whether the portable-fp16 precision fix (pytorch#19117) has landed. Once that fix is merged and synced into the branch, both sides will converge further without a test-harness change. |
…List_out Two new optimized CPU kernels registered alongside the existing optimized_kernels library. Both replace the portable reference kernel (still available as fallback for unsupported inputs) with a vectorized implementation that accumulates in fp32, avoiding the fp16 precision issues noted in pytorch#19117 for grid_sampler_2d bilinear. Measured end-to-end on a real depth model (Pixel 9, fp16 inputs, shapes representative of the model's hot path): | Op | Portable | This PR | Speedup | | -------------------------------- | -------- | ------- | ------- | | grid_sampler_2d.out | 17.3 ms | 3.4 ms | 5.1x | | sum.IntList_out (5 calls, total) | 3.0 ms | 0.56 ms | 5.4x | ### grid_sampler_2d.out aarch64 NEON, bilinear + zeros padding only. Processes 4 channels per iteration with a vectorized FMA chain. fp16 inputs are promoted to fp32 for weight computation and accumulation, then cast back on store — the portable kernel's fp16 weight subtractions like `(ix_se - ix)` otherwise suffer catastrophic cancellation. Unsupported modes and non-aarch64 targets delegate to the portable kernel. ### sum.IntList_out at::vec::Vectorized<float>-based implementation of the single-dim reduction fast path (both innermost-contiguous and strided cases). Cross-architecture SIMD via PyTorch's existing vector abstraction; accumulates in fp32 regardless of input dtype. Multi-dim reductions, dtype-converting reductions, and complex types delegate to portable. ### Integration - Sources added to OPTIMIZED_KERNELS_SRCS in build_variables.bzl and to OPTIMIZED_ATEN_OPS in op_registration_util.bzl. Single source of truth for both Buck and CMake builds. - optimized.yaml registers the ops with the standard opt_* naming convention used by sibling kernels. - kernels/optimized/CMakeLists.txt scopes the -march=armv8.2-a+fp16 flag to just op_grid_sampler_2d.cpp via set_source_files_properties, so x86_64 builds are unaffected. The kernel has #ifdef __aarch64__ guards and falls through to portable on non-arm64 targets.
Same one-char fix as pytorch#19117 (and our PR #2): the DESCRIPTION argument to `set(...CACHE TYPE DOCSTRING)` was expanded unquoted, so multi-word descriptions on STRING options passed via `-D` spilled their trailing words into subsequent set() args. This was latent until PR #3 introduced EXECUTORCH_VULKAN_FP16_PRECISION with a multi-word help string — builds that set it (e.g. via scripts/build_android_library.sh forwarding the env var) then fail. Carried here so this branch remains self-contained and buildable independent of the merge order of PR #2. Drops cleanly after PR #2 lands; git will treat the duplicate line as a no-op.
0c2ba68 to
f4086e3
Compare
|
Force-pushed with a restructured approach. Previous version used a separate Same runtime performance, cleaner integration:
Also pushed as pytorch/executorch#19119. Once that lands and syncs into polycam/main, this PR can be closed (or merged and then dropped). |
Standalone aarch64 binary that cross-checks opt_grid_sampler_2d_out and opt_sum_dim_out against an fp32 reference derived from the portable kernel (portable run on up-cast fp32 inputs, then down-cast to fp16). Reference is independent of portable's own fp16 path, so the test stays meaningful regardless of pytorch#19117's merge state. Pass/fail uses numpy.testing.assert_allclose semantics: |a - b| <= abs_tol + rel_tol * |b| Avoids the "relative error explodes at zero crossings" trap for mean-zero reductions and bilinear samples near cancellation points. Opt-in via -DEXECUTORCH_BUILD_OPTIMIZED_VERIFY=ON so default builds are unaffected. Build + run: cmake -DEXECUTORCH_BUILD_OPTIMIZED_VERIFY=ON ... cmake --build <out> --target verify_optimized_kernels adb push <out>/kernels/optimized/verify_optimized_kernels /data/local/tmp/ adb shell /data/local/tmp/verify_optimized_kernels Exits 0 on all-pass; reports max_abs / max_rel(far) / near_zero / viol per test case. 12 test cases across grid_sampler and sum, covering the shapes the polycam depth model uses plus a few edge cases (odd channel count, align_corners=1, multi-batch).
The NEON fast path indexes input/grid/out directly assuming contiguous NCHW default-dim-order layout — no use of .strides() or .dim_order(). If the caller passes anything else (NHWC, transposed, strided, channels- last), we'd read wrong memory and silently produce garbage output. Add the same check pattern op_sum.cpp already uses at L150-151: tensor_is_default_dim_order + tensor_is_contiguous on input, grid, and out. If any fails, delegate to the portable kernel (which handles arbitrary strides / dim orders correctly via .strides()). No perf impact on the hot path — the checks are a handful of scalar comparisons run once per call, and the common polycam depth model case is already default-contiguous so the fast path is still taken.
Summary
This PR has been restructured. Previously it added NEON kernels via a separate `custom_kernels` static library and `kernels/custom/` source directory. That approach worked but didn't fit the existing `kernels/optimized/` integration pattern. This version ports both kernels to be proper optimized kernels, registered in `optimized.yaml` alongside `opt_add`, `opt_gelu`, etc., and sourced from `kernels/optimized/cpu/`.
Also opened as pytorch/executorch#19119 — this PR can be retired once that lands and is synced into polycam/main.
Measured end-to-end on a real depth model (Pixel 9 / arm64-v8a, fp16 inputs, shapes representative of the model's hot path):
End-to-end `forward()` p50 on a 9.4 MB polycam depth model: 94 ms → 97 ms (within noise).
`grid_sampler_2d.out`
aarch64 NEON, bilinear + zeros padding only. Processes 4 channels per iteration with a vectorized FMA chain. fp16 inputs promoted to fp32 for weight computation and accumulation, cast back on store — avoids fp16 catastrophic cancellation on `(ix_se - ix)`-style weight subtractions. Unsupported modes and non-aarch64 targets delegate to the portable kernel.
`sum.IntList_out`
`at::vec::Vectorized`-based implementation of the single-dim reduction fast path (both innermost-contiguous and strided cases). Cross-architecture SIMD; always accumulates in fp32. Multi-dim reductions, dtype-converting reductions, and complex types delegate to portable. ~2x faster than a handwritten-NEON equivalent, because the compiler optimizes the fp16→fp32 gather better than a manual temp-buffer implementation.
Integration
Included `preset.cmake` fix
Carries a one-char fix (quote `${DESCRIPTION}` in the `set(...CACHE...)` call) that's also in PR #2, so this branch builds independently of #2's merge order. Drops to a no-op once #2 lands.
Test plan
🤖 Generated with Claude Code