Revert "perf: SIMD-accelerated FastBase64 for Scala Native via C FFI" by stephenamar-db · Pull Request #777 · databricks/sjsonnet

stephenamar-db · 2026-04-14T16:10:28Z

Reverts #749

…749)" This reverts commit 1613935.

Motivation: PR databricks#749 added SIMD base64 and runtime optimizations (ByteArr, RangeArr, asciiSafe) but was reverted by databricks#777 due to incorrect hand-written x86 SIMD C code. This PR restores all optimizations while replacing the buggy SIMD code with the battle-tested aklomp/base64 library. Modification: - Replace hand-written C SIMD with aklomp/base64 (BSD-2-Clause) which provides correct SIMD dispatch (SSSE3/AVX2/AVX512/NEON64) via runtime CPU detection - Add PlatformBase64 abstraction: JVM/JS use java.util.Base64 with strict RFC 4648 padding validation, Native uses aklomp/base64 FFI - Switch to strict mode aligned with go-jsonnet: reject unpadded base64 input (e.g. "YQ" without "=="). java.util.Base64 is lenient, so JVM/JS add explicit length check for ASCII input, matching go-jsonnet's len(str) % 4 != 0 check (builtins.go:1467) - Restore Val.ByteArr: compact byte-backed array for base64DecodeBytes - Restore Val.RangeArr subclass from flag-based _isRange - Restore Val.Str._asciiSafe + renderAsciiSafeString - Restore Materializer/ByteRenderer fast paths for ByteArr - Add comprehensive test suite (56+ Scala tests + 4 Jsonnet golden tests) Result: Beats jrsonnet on DecodeBytes benchmarks (1.47x faster). Overall 15-38% faster than master on base64 workloads.

## Summary Rollforward of #749 (reverted by #777) with the buggy hand-written C SIMD replaced by the battle-tested [aklomp/base64](https://github.com/aklomp/base64) library (BSD-2-Clause). Also restores the non-SIMD optimizations from #749 (ByteArr, RangeArr subclass, asciiSafe rendering) and adds strict RFC 4648 padding validation aligned with go-jsonnet. ### How the SIMD bug was fixed PR #749's hand-written C SIMD code had incorrect x86 implementation (the reason for the revert in #777). Instead of fixing the hand-written code, this PR replaces it entirely with **aklomp/base64** — a well-tested C library that handles SIMD dispatch correctly on all architectures: - x86_64: SSSE3 / SSE4.1 / SSE4.2 / AVX / AVX2 / AVX-512 (runtime CPU detection) - AArch64: NEON - Fallback: optimized generic C implementation The library is built as a static library via CMake and linked via `nativeLinkingOptions`. No hand-written SIMD code remains. ### Strict mode aligned with go-jsonnet Switched base64 decoding to **strict RFC 4648 mode** — unpadded input (e.g. `"YQ"` instead of `"YQ=="`) is now rejected on all platforms, matching go-jsonnet behavior: - **go-jsonnet**: `len(str) % 4 != 0` check before `base64.StdEncoding.DecodeString` (builtins.go:1467) - **C++ jsonnet**: `std.length(str) % 4 != 0` check in stdlib - **sjsonnet (before)**: `java.util.Base64` was lenient, accepting unpadded input — a pre-existing behavioral divergence - **sjsonnet (after)**: JVM/JS add explicit ASCII-only length validation; Native uses aklomp/base64 which is strict by default ### Changes 1. **PlatformBase64 abstraction** — Platform-specific base64 implementations: - JVM/JS: `java.util.Base64` + strict padding pre-check - Native: aklomp/base64 FFI with JVM-compatible error messages on the error path (zero hot-path overhead) 2. **Val.ByteArr** — Compact byte-backed array for `base64DecodeBytes`. Stores `Array[Byte]` directly instead of N `Val.Num` wrappers (80%+ memory savings). Zero-copy `rawBytes` access for re-encoding. 3. **Val.RangeArr subclass** — Extracted from flag-based `_isRange` in `Arr` to reduce per-Arr memory footprint. O(1) creation for `std.range`. 4. **Val.Str._asciiSafe + renderAsciiSafeString** — Marks strings that need no JSON escaping (e.g. base64 output). Renderer skips SWAR escape scanning, writing bytes directly. 5. **Materializer/ByteRenderer fast paths** — Direct byte iteration for ByteArr, skipping per-element type dispatch. 6. **Comprehensive test suite** — 56+ Scala unit tests + 4 Jsonnet golden file tests covering RFC 4648 vectors, SIMD boundary sizes, bidirectional verification, strict padding enforcement, all 256 byte values, and error handling. ## Benchmark Results — Scala Native vs jrsonnet (Rust) Machine: Apple Silicon (AArch64/NEON), macOS. Tool: `hyperfine --warmup 3 --runs 10 -N`. Both `master` and `simd-full` binaries built from the same upstream/master base (4123ac3). The only difference is this PR's changes. ### SIMD base64 throughput (large payloads) Larger payloads isolate base64 codec performance from Jsonnet interpreter overhead. The improvement scales with data size: | Benchmark | Payload | master (ms) | simd-full (ms) | jrsonnet (ms) | simd vs master | |-----------|---------|:-----------:|:--------------:|:-------------:|:--------------:| | base64_heavy | 200KB, 3 strings + 10K bytes | 9.8 | **8.8** | 6.9 | **10% faster** | | base64_throughput | 150KB × 5 roundtrips | 15.6 | **13.6** | 5.6 | **13% faster** | | base64_mega | 1MB + 100K byte array | 34.1 | **28.7** | 22.0 | **16% faster** | | base64_ultra | 4.5MB × 2 roundtrips | 119.9 | **91.3** | 14.0 | **24% faster** | User CPU time (excluding process overhead) tells the same story: | Benchmark | master User CPU | simd-full User CPU | Reduction | |-----------|:-:|:-:|:-:| | base64_heavy | 4.8 ms | 4.0 ms | **17%** | | base64_throughput | 10.0 ms | 7.7 ms | **23%** | | base64_mega | 26.9 ms | 21.5 ms | **20%** | | base64_ultra | 107.8 ms | 78.2 ms | **27%** | > **Note**: jrsonnet's advantage on large-payload benchmarks (especially ultra: 14ms vs 91ms) is primarily due to Rust's UTF-8 string representation enabling zero-copy base64, whereas Scala Native requires UTF-16 ↔ UTF-8 conversion at the FFI boundary. This is a fundamental runtime characteristic, not a base64 algorithm difference. ### ByteArr compact storage (DecodeBytes / byte_array) sjsonnet's `ByteArr` stores decoded bytes as `Array[Byte]` directly (vs N `Val.Num` wrappers), beating jrsonnet (Rust) on byte-oriented operations: | Benchmark | master (ms) | simd-full (ms) | jrsonnet (ms) | simd vs master | simd vs jrsonnet | |-----------|:-----------:|:--------------:|:-------------:|:--------------:|:----------------:| | std_base64decodebytes | 15.6 | **13.9** | 19.0 | **11% faster** | **1.36x faster** | | go base64DecodeBytes | 16.0 | **13.5** | 19.4 | **16% faster** | **1.43x faster** | | std_base64_byte_array | 9.0 | **8.8** | 18.4 | ~neutral | **2.09x faster** | ### Small payload benchmarks (interpreter-dominated) These benchmarks process ~3KB payloads. Base64 codec time is negligible compared to process startup (~3ms) and Jsonnet parsing/evaluation, so codec improvements don't show here: | Benchmark | master (ms) | simd-full (ms) | jrsonnet (ms) | simd vs master | |-----------|:-----------:|:--------------:|:-------------:|:--------------:| | std_base64 (encode) | 7.3 | 6.8 | 4.4 | ~neutral | | std_base64decode | 6.0 | 6.2 | 4.8 | ~neutral | | go base64 (encode) | 7.0 | 7.6 | 4.7 | ~neutral | | go base64Decode | 6.8 | 7.3 | 5.3 | ~neutral | ## Test plan - [x] `./mill 'sjsonnet.jvm[3.3.7]'.test` — 61 tests pass (including 56 Base64Tests with strict padding) - [x] `./mill 'sjsonnet.js[3.3.7]'.test` — 455 tests pass - [x] `./mill 'sjsonnet.native[3.3.7]'.test` — 476 tests pass - [x] `./mill __.checkFormat` — scalafmt passes - [x] Benchmark regression verified across multiple runs (10 runs per benchmark) - [x] Local ARM64 (Apple Silicon/NEON) verification — all tests pass - [x] CI x86_64 verification via GitHub Actions runners Closes #777

Motivation: PR databricks#749 added SIMD base64 and runtime optimizations (ByteArr, RangeArr, asciiSafe) but was reverted by databricks#777 due to incorrect hand-written x86 SIMD C code. This PR restores all optimizations while replacing the buggy SIMD code with the battle-tested aklomp/base64 library. Modification: - Replace hand-written C SIMD with aklomp/base64 (BSD-2-Clause) which provides correct SIMD dispatch (SSSE3/AVX2/AVX512/NEON64) via runtime CPU detection - Add PlatformBase64 abstraction: JVM/JS use java.util.Base64 with strict RFC 4648 padding validation, Native uses aklomp/base64 FFI - Switch to strict mode aligned with go-jsonnet: reject unpadded base64 input (e.g. "YQ" without "=="). java.util.Base64 is lenient, so JVM/JS add explicit length check for ASCII input, matching go-jsonnet's len(str) % 4 != 0 check (builtins.go:1467) - Restore Val.ByteArr: compact byte-backed array for base64DecodeBytes - Restore Val.RangeArr subclass from flag-based _isRange - Restore Val.Str._asciiSafe + renderAsciiSafeString - Restore Materializer/ByteRenderer fast paths for ByteArr - Add comprehensive test suite (56+ Scala tests + 4 Jsonnet golden tests) Result: Beats jrsonnet on DecodeBytes benchmarks (1.47x faster). Overall 15-38% faster than master on base64 workloads.

Revert "perf: SIMD-accelerated FastBase64 for Scala Native via C FFI (#…

3aa4273

…749)" This reverts commit 1613935.

stephenamar-db merged commit 4123ac3 into master Apr 14, 2026
5 checks passed

stephenamar-db deleted the revert-749-perf/fast-base64-native branch April 14, 2026 16:20

He-Pin mentioned this pull request Apr 14, 2026

perf: SIMD base64 via aklomp/base64 + ByteArr/RangeArr/asciiSafe #778

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revert "perf: SIMD-accelerated FastBase64 for Scala Native via C FFI"#777

Revert "perf: SIMD-accelerated FastBase64 for Scala Native via C FFI"#777
stephenamar-db merged 1 commit intomasterfrom
revert-749-perf/fast-base64-native

stephenamar-db commented Apr 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

stephenamar-db commented Apr 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant