Revert "perf: SIMD-accelerated FastBase64 for Scala Native via C FFI"#777
Merged
stephenamar-db merged 1 commit intomasterfrom Apr 14, 2026
Merged
Conversation
He-Pin
added a commit
to He-Pin/sjsonnet
that referenced
this pull request
Apr 14, 2026
Motivation: PR databricks#749 added SIMD base64 and runtime optimizations (ByteArr, RangeArr, asciiSafe) but was reverted by databricks#777 due to incorrect hand-written x86 SIMD C code. This PR restores all optimizations while replacing the buggy SIMD code with the battle-tested aklomp/base64 library. Modification: - Replace hand-written C SIMD with aklomp/base64 (BSD-2-Clause) which provides correct SIMD dispatch (SSSE3/AVX2/AVX512/NEON64) via runtime CPU detection - Add PlatformBase64 abstraction: JVM/JS use java.util.Base64 with strict RFC 4648 padding validation, Native uses aklomp/base64 FFI - Switch to strict mode aligned with go-jsonnet: reject unpadded base64 input (e.g. "YQ" without "=="). java.util.Base64 is lenient, so JVM/JS add explicit length check for ASCII input, matching go-jsonnet's len(str) % 4 != 0 check (builtins.go:1467) - Restore Val.ByteArr: compact byte-backed array for base64DecodeBytes - Restore Val.RangeArr subclass from flag-based _isRange - Restore Val.Str._asciiSafe + renderAsciiSafeString - Restore Materializer/ByteRenderer fast paths for ByteArr - Add comprehensive test suite (56+ Scala tests + 4 Jsonnet golden tests) Result: Beats jrsonnet on DecodeBytes benchmarks (1.47x faster). Overall 15-38% faster than master on base64 workloads.
7 tasks
He-Pin
added a commit
to He-Pin/sjsonnet
that referenced
this pull request
Apr 18, 2026
Motivation: PR databricks#749 added SIMD base64 and runtime optimizations (ByteArr, RangeArr, asciiSafe) but was reverted by databricks#777 due to incorrect hand-written x86 SIMD C code. This PR restores all optimizations while replacing the buggy SIMD code with the battle-tested aklomp/base64 library. Modification: - Replace hand-written C SIMD with aklomp/base64 (BSD-2-Clause) which provides correct SIMD dispatch (SSSE3/AVX2/AVX512/NEON64) via runtime CPU detection - Add PlatformBase64 abstraction: JVM/JS use java.util.Base64 with strict RFC 4648 padding validation, Native uses aklomp/base64 FFI - Switch to strict mode aligned with go-jsonnet: reject unpadded base64 input (e.g. "YQ" without "=="). java.util.Base64 is lenient, so JVM/JS add explicit length check for ASCII input, matching go-jsonnet's len(str) % 4 != 0 check (builtins.go:1467) - Restore Val.ByteArr: compact byte-backed array for base64DecodeBytes - Restore Val.RangeArr subclass from flag-based _isRange - Restore Val.Str._asciiSafe + renderAsciiSafeString - Restore Materializer/ByteRenderer fast paths for ByteArr - Add comprehensive test suite (56+ Scala tests + 4 Jsonnet golden tests) Result: Beats jrsonnet on DecodeBytes benchmarks (1.47x faster). Overall 15-38% faster than master on base64 workloads.
He-Pin
added a commit
to He-Pin/sjsonnet
that referenced
this pull request
Apr 18, 2026
Motivation: PR databricks#749 added SIMD base64 and runtime optimizations (ByteArr, RangeArr, asciiSafe) but was reverted by databricks#777 due to incorrect hand-written x86 SIMD C code. This PR restores all optimizations while replacing the buggy SIMD code with the battle-tested aklomp/base64 library. Modification: - Replace hand-written C SIMD with aklomp/base64 (BSD-2-Clause) which provides correct SIMD dispatch (SSSE3/AVX2/AVX512/NEON64) via runtime CPU detection - Add PlatformBase64 abstraction: JVM/JS use java.util.Base64 with strict RFC 4648 padding validation, Native uses aklomp/base64 FFI - Switch to strict mode aligned with go-jsonnet: reject unpadded base64 input (e.g. "YQ" without "=="). java.util.Base64 is lenient, so JVM/JS add explicit length check for ASCII input, matching go-jsonnet's len(str) % 4 != 0 check (builtins.go:1467) - Restore Val.ByteArr: compact byte-backed array for base64DecodeBytes - Restore Val.RangeArr subclass from flag-based _isRange - Restore Val.Str._asciiSafe + renderAsciiSafeString - Restore Materializer/ByteRenderer fast paths for ByteArr - Add comprehensive test suite (56+ Scala tests + 4 Jsonnet golden tests) Result: Beats jrsonnet on DecodeBytes benchmarks (1.47x faster). Overall 15-38% faster than master on base64 workloads.
He-Pin
added a commit
to He-Pin/sjsonnet
that referenced
this pull request
Apr 21, 2026
Motivation: PR databricks#749 added SIMD base64 and runtime optimizations (ByteArr, RangeArr, asciiSafe) but was reverted by databricks#777 due to incorrect hand-written x86 SIMD C code. This PR restores all optimizations while replacing the buggy SIMD code with the battle-tested aklomp/base64 library. Modification: - Replace hand-written C SIMD with aklomp/base64 (BSD-2-Clause) which provides correct SIMD dispatch (SSSE3/AVX2/AVX512/NEON64) via runtime CPU detection - Add PlatformBase64 abstraction: JVM/JS use java.util.Base64 with strict RFC 4648 padding validation, Native uses aklomp/base64 FFI - Switch to strict mode aligned with go-jsonnet: reject unpadded base64 input (e.g. "YQ" without "=="). java.util.Base64 is lenient, so JVM/JS add explicit length check for ASCII input, matching go-jsonnet's len(str) % 4 != 0 check (builtins.go:1467) - Restore Val.ByteArr: compact byte-backed array for base64DecodeBytes - Restore Val.RangeArr subclass from flag-based _isRange - Restore Val.Str._asciiSafe + renderAsciiSafeString - Restore Materializer/ByteRenderer fast paths for ByteArr - Add comprehensive test suite (56+ Scala tests + 4 Jsonnet golden tests) Result: Beats jrsonnet on DecodeBytes benchmarks (1.47x faster). Overall 15-38% faster than master on base64 workloads.
He-Pin
added a commit
to He-Pin/sjsonnet
that referenced
this pull request
Apr 21, 2026
Motivation: PR databricks#749 added SIMD base64 and runtime optimizations (ByteArr, RangeArr, asciiSafe) but was reverted by databricks#777 due to incorrect hand-written x86 SIMD C code. This PR restores all optimizations while replacing the buggy SIMD code with the battle-tested aklomp/base64 library. Modification: - Replace hand-written C SIMD with aklomp/base64 (BSD-2-Clause) which provides correct SIMD dispatch (SSSE3/AVX2/AVX512/NEON64) via runtime CPU detection - Add PlatformBase64 abstraction: JVM/JS use java.util.Base64 with strict RFC 4648 padding validation, Native uses aklomp/base64 FFI - Switch to strict mode aligned with go-jsonnet: reject unpadded base64 input (e.g. "YQ" without "=="). java.util.Base64 is lenient, so JVM/JS add explicit length check for ASCII input, matching go-jsonnet's len(str) % 4 != 0 check (builtins.go:1467) - Restore Val.ByteArr: compact byte-backed array for base64DecodeBytes - Restore Val.RangeArr subclass from flag-based _isRange - Restore Val.Str._asciiSafe + renderAsciiSafeString - Restore Materializer/ByteRenderer fast paths for ByteArr - Add comprehensive test suite (56+ Scala tests + 4 Jsonnet golden tests) Result: Beats jrsonnet on DecodeBytes benchmarks (1.47x faster). Overall 15-38% faster than master on base64 workloads.
stephenamar-db
pushed a commit
that referenced
this pull request
Apr 24, 2026
## Summary Rollforward of #749 (reverted by #777) with the buggy hand-written C SIMD replaced by the battle-tested [aklomp/base64](https://github.com/aklomp/base64) library (BSD-2-Clause). Also restores the non-SIMD optimizations from #749 (ByteArr, RangeArr subclass, asciiSafe rendering) and adds strict RFC 4648 padding validation aligned with go-jsonnet. ### How the SIMD bug was fixed PR #749's hand-written C SIMD code had incorrect x86 implementation (the reason for the revert in #777). Instead of fixing the hand-written code, this PR replaces it entirely with **aklomp/base64** — a well-tested C library that handles SIMD dispatch correctly on all architectures: - x86_64: SSSE3 / SSE4.1 / SSE4.2 / AVX / AVX2 / AVX-512 (runtime CPU detection) - AArch64: NEON - Fallback: optimized generic C implementation The library is built as a static library via CMake and linked via `nativeLinkingOptions`. No hand-written SIMD code remains. ### Strict mode aligned with go-jsonnet Switched base64 decoding to **strict RFC 4648 mode** — unpadded input (e.g. `"YQ"` instead of `"YQ=="`) is now rejected on all platforms, matching go-jsonnet behavior: - **go-jsonnet**: `len(str) % 4 != 0` check before `base64.StdEncoding.DecodeString` (builtins.go:1467) - **C++ jsonnet**: `std.length(str) % 4 != 0` check in stdlib - **sjsonnet (before)**: `java.util.Base64` was lenient, accepting unpadded input — a pre-existing behavioral divergence - **sjsonnet (after)**: JVM/JS add explicit ASCII-only length validation; Native uses aklomp/base64 which is strict by default ### Changes 1. **PlatformBase64 abstraction** — Platform-specific base64 implementations: - JVM/JS: `java.util.Base64` + strict padding pre-check - Native: aklomp/base64 FFI with JVM-compatible error messages on the error path (zero hot-path overhead) 2. **Val.ByteArr** — Compact byte-backed array for `base64DecodeBytes`. Stores `Array[Byte]` directly instead of N `Val.Num` wrappers (80%+ memory savings). Zero-copy `rawBytes` access for re-encoding. 3. **Val.RangeArr subclass** — Extracted from flag-based `_isRange` in `Arr` to reduce per-Arr memory footprint. O(1) creation for `std.range`. 4. **Val.Str._asciiSafe + renderAsciiSafeString** — Marks strings that need no JSON escaping (e.g. base64 output). Renderer skips SWAR escape scanning, writing bytes directly. 5. **Materializer/ByteRenderer fast paths** — Direct byte iteration for ByteArr, skipping per-element type dispatch. 6. **Comprehensive test suite** — 56+ Scala unit tests + 4 Jsonnet golden file tests covering RFC 4648 vectors, SIMD boundary sizes, bidirectional verification, strict padding enforcement, all 256 byte values, and error handling. ## Benchmark Results — Scala Native vs jrsonnet (Rust) Machine: Apple Silicon (AArch64/NEON), macOS. Tool: `hyperfine --warmup 3 --runs 10 -N`. Both `master` and `simd-full` binaries built from the same upstream/master base (4123ac3). The only difference is this PR's changes. ### SIMD base64 throughput (large payloads) Larger payloads isolate base64 codec performance from Jsonnet interpreter overhead. The improvement scales with data size: | Benchmark | Payload | master (ms) | simd-full (ms) | jrsonnet (ms) | simd vs master | |-----------|---------|:-----------:|:--------------:|:-------------:|:--------------:| | base64_heavy | 200KB, 3 strings + 10K bytes | 9.8 | **8.8** | 6.9 | **10% faster** | | base64_throughput | 150KB × 5 roundtrips | 15.6 | **13.6** | 5.6 | **13% faster** | | base64_mega | 1MB + 100K byte array | 34.1 | **28.7** | 22.0 | **16% faster** | | base64_ultra | 4.5MB × 2 roundtrips | 119.9 | **91.3** | 14.0 | **24% faster** | User CPU time (excluding process overhead) tells the same story: | Benchmark | master User CPU | simd-full User CPU | Reduction | |-----------|:-:|:-:|:-:| | base64_heavy | 4.8 ms | 4.0 ms | **17%** | | base64_throughput | 10.0 ms | 7.7 ms | **23%** | | base64_mega | 26.9 ms | 21.5 ms | **20%** | | base64_ultra | 107.8 ms | 78.2 ms | **27%** | > **Note**: jrsonnet's advantage on large-payload benchmarks (especially ultra: 14ms vs 91ms) is primarily due to Rust's UTF-8 string representation enabling zero-copy base64, whereas Scala Native requires UTF-16 ↔ UTF-8 conversion at the FFI boundary. This is a fundamental runtime characteristic, not a base64 algorithm difference. ### ByteArr compact storage (DecodeBytes / byte_array) sjsonnet's `ByteArr` stores decoded bytes as `Array[Byte]` directly (vs N `Val.Num` wrappers), beating jrsonnet (Rust) on byte-oriented operations: | Benchmark | master (ms) | simd-full (ms) | jrsonnet (ms) | simd vs master | simd vs jrsonnet | |-----------|:-----------:|:--------------:|:-------------:|:--------------:|:----------------:| | std_base64decodebytes | 15.6 | **13.9** | 19.0 | **11% faster** | **1.36x faster** | | go base64DecodeBytes | 16.0 | **13.5** | 19.4 | **16% faster** | **1.43x faster** | | std_base64_byte_array | 9.0 | **8.8** | 18.4 | ~neutral | **2.09x faster** | ### Small payload benchmarks (interpreter-dominated) These benchmarks process ~3KB payloads. Base64 codec time is negligible compared to process startup (~3ms) and Jsonnet parsing/evaluation, so codec improvements don't show here: | Benchmark | master (ms) | simd-full (ms) | jrsonnet (ms) | simd vs master | |-----------|:-----------:|:--------------:|:-------------:|:--------------:| | std_base64 (encode) | 7.3 | 6.8 | 4.4 | ~neutral | | std_base64decode | 6.0 | 6.2 | 4.8 | ~neutral | | go base64 (encode) | 7.0 | 7.6 | 4.7 | ~neutral | | go base64Decode | 6.8 | 7.3 | 5.3 | ~neutral | ## Test plan - [x] `./mill 'sjsonnet.jvm[3.3.7]'.test` — 61 tests pass (including 56 Base64Tests with strict padding) - [x] `./mill 'sjsonnet.js[3.3.7]'.test` — 455 tests pass - [x] `./mill 'sjsonnet.native[3.3.7]'.test` — 476 tests pass - [x] `./mill __.checkFormat` — scalafmt passes - [x] Benchmark regression verified across multiple runs (10 runs per benchmark) - [x] Local ARM64 (Apple Silicon/NEON) verification — all tests pass - [x] CI x86_64 verification via GitHub Actions runners Closes #777
He-Pin
added a commit
to He-Pin/sjsonnet
that referenced
this pull request
Apr 25, 2026
Motivation: PR databricks#749 added SIMD base64 and runtime optimizations (ByteArr, RangeArr, asciiSafe) but was reverted by databricks#777 due to incorrect hand-written x86 SIMD C code. This PR restores all optimizations while replacing the buggy SIMD code with the battle-tested aklomp/base64 library. Modification: - Replace hand-written C SIMD with aklomp/base64 (BSD-2-Clause) which provides correct SIMD dispatch (SSSE3/AVX2/AVX512/NEON64) via runtime CPU detection - Add PlatformBase64 abstraction: JVM/JS use java.util.Base64 with strict RFC 4648 padding validation, Native uses aklomp/base64 FFI - Switch to strict mode aligned with go-jsonnet: reject unpadded base64 input (e.g. "YQ" without "=="). java.util.Base64 is lenient, so JVM/JS add explicit length check for ASCII input, matching go-jsonnet's len(str) % 4 != 0 check (builtins.go:1467) - Restore Val.ByteArr: compact byte-backed array for base64DecodeBytes - Restore Val.RangeArr subclass from flag-based _isRange - Restore Val.Str._asciiSafe + renderAsciiSafeString - Restore Materializer/ByteRenderer fast paths for ByteArr - Add comprehensive test suite (56+ Scala tests + 4 Jsonnet golden tests) Result: Beats jrsonnet on DecodeBytes benchmarks (1.47x faster). Overall 15-38% faster than master on base64 workloads.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Reverts #749