perf: SIMD-accelerated FastBase64 for Scala Native via C FFI#749
perf: SIMD-accelerated FastBase64 for Scala Native via C FFI#749stephenamar-db merged 5 commits intodatabricks:masterfrom
Conversation
97a5c51 to
905cac7
Compare
|
not sure it's worth it. |
|
This needs to be SIMD-based |
He-Pin
left a comment
There was a problem hiding this comment.
PR #749 Review: SIMD-accelerated base64 for Scala Native
Overall: Major feature, well-architected with platform-specific implementations. The byte-backed Val.Arr is a good general-purpose optimization beyond just base64. Benchmark results are solid - base64DecodeBytes 1.26x faster than jrsonnet, base64_byte_array 1.94x faster.
Concern 1 - _byteData mutability: The rawBytes accessor returns the internal _byteData array directly. If someone modifies the underlying byte array, the cached Val.Num objects from value(i) could become stale. Consider documenting immutability guarantee or returning defensive copy.
Concern 2 - C file complexity: The 1255-line C file (sjsonnet_base64.c) is complex. If there are bugs in the SIMD paths (NEON, SSSE3, AVX2, AVX-512), they could be hard to track down. Recommend comprehensive edge case tests:
- Empty input
- Input lengths 1, 2, 3 (boundary cases for base64 encoding)
- Input with all possible byte values (0x00-0xFF)
- Large input (>64 bytes to trigger SIMD paths)
- Invalid padding detection
Known issue already fixed: AVX-512 avx512dq target feature was missing - resolved in the follow-up commit.
Startup overhead: The benchmark shows base64 encode and base64Decode are still slower than jrsonnet on Scala Native (1.77x and 1.50x). This is attributed to startup overhead (5.5ms vs 3.2ms). Consider investigating the startup cost separately.
|
I don't see a difference in the PR comment? This seems neutral everywhere. Are there updated benchs? |
|
@stephenamar-db The real power need other pr be merged first, otherwise the scala native start up time will reduce the numbers, because the numbers is really mall,will conver to draft |
052d600 to
2234d32
Compare
Replace java.util.Base64 with a custom FastBase64 implementation that
avoids the overhead of Scala Native's pure-Scala Base64 wrapper.
Key optimizations:
- Direct char-to-char encoding for ASCII strings (no intermediate byte[])
- Pre-computed lookup tables as primitive arrays (char[64] encode, int[256] decode)
- Tight while-loops processing 3->4 (encode) or 4->3 (decode) units
- ISO-8859-1 compatible: chars > 0xFF mapped to 0x3F ('?') matching java.util.Base64 behavior
On JVM this is performance-neutral since java.util.Base64 uses native
intrinsics. On Scala Native, this avoids the Wrapper-object-based,
recursive iterate() implementation in scala-native's java.util.Base64.
All 49 JVM tests pass including base64/base64Decode/base64DecodeBytes.
Motivation: The pure-Scala FastBase64 cannot use SIMD since Scala Native has no built-in SIMD intrinsics (tracked as scala-native#37 since 2016). Modification: - Add sjsonnet_base64.c with three SIMD paths: * ARM64 NEON: 48→64 encode / 64→48 decode per iteration * x86_64 SSSE3: 12→16 encode / 16→12 decode per iteration * Scalar fallback for other architectures - Split FastBase64.scala into platform-specific implementations: * src-native: C FFI wrapper calling NEON/SSSE3/scalar C code * src-jvm: delegates to java.util.Base64 (C2 intrinsic-optimized) * src-js: pure Scala (unchanged from shared version) - Add base64_stress.jsonnet benchmark Result: All 420 native tests pass. On Apple Silicon (ARM64 NEON): sjsonnet-native beats jrsonnet on base64_byte_array (1.68x faster), competitive on other base64 benchmarks (1.3-1.9x of jrsonnet).
…ted decode Motivation: Head-to-head benchmarks against jrsonnet showed sjsonnet-native was 1.3-2x slower on base64 operations. Most overhead was in per-call allocations and double-pass decode (Scala validation + C decode). Modification: - Add sjsonnet_base64_decode_validated() to C: single-pass validation + decode with specific error codes (-1 for invalid char, -2 for bad padding) - Reusable module-level buffers (safe: Scala Native is single-threaded) eliminates per-call array allocations after first call - ASCII fast-path in encodeString: skip UTF-8 encoding for pure ASCII strings - Fast String construction: direct char array instead of charset lookup - decodeToString ASCII fast-path: avoid charset decode for ASCII output Result: base64 encode: 9.4ms → 7.0ms (25% faster) base64_stress: 1.31x gap → 1.23x gap vs jrsonnet All 420 native tests pass.
2234d32 to
9acbe23
Compare
…string rendering Motivation: base64DecodeBytes created N Val.Num wrappers per byte. The materializer did per-element type dispatch on byte arrays. base64 encode output was scanned for JSON escape characters despite being guaranteed ASCII-safe. Val.Arr carried inline _isRange/_byteData fields that bloated every regular array instance. Modification: - Extract RangeArr and ByteArr as subclasses of Arr (non-final base). Removes _isRange/_rangeFrom/_byteData inline fields from Arr, saving ~13 bytes per regular array instance. - ByteArr stores Array[Byte] as immutable val (never cleared after materialization), guaranteeing rawBytes is always non-null for safe multi-use. reversed() materializes first to keep value()/eval() simple. - Materializer recursive, iterative, and fused ByteRenderer paths detect ByteArr via pattern match and emit visitFloat64 directly from bytes. - Val.Str._asciiSafe flag + asciiSafe() factory skips SWAR escape scanning and UTF-8 encoding in BaseByteRenderer.renderAsciiSafeString. - Fix AVX-512 VBMI compile: add avx512dq target for _mm512_inserti64x2. - Add regression tests for ByteArr and RangeArr correctness (multi-use, reverse, concat, round-trip scenarios). Result: JVM base64DecodeBytes 10.2% faster. Native base64DecodeBytes 2.13x faster than master, 1.50x faster than jrsonnet. Native base64_byte_array 2.02x faster than jrsonnet.
500c801 to
52f2b6b
Compare
|
I want to build #776 on top of this. |
|
@stephenamar-db The performance of base64 improved now, and the SIMD part will help the rendering pipeline performane with simd enhances later. |
There was a problem hiding this comment.
I think that there might be multiple correctness issues in sjsonnet_base64.c. I prompted Claude Opus 4.6 (in a claude.ai chat conversation, with code interpreter enabled) to take a look at this PR and after some back-and-forth we uncovered some significant correctness issues.
One "code smell" that prompted me to dig in was the presence of several code comments where it looks like an LLM backed out of one implementation approach in favor of another, e.g.
sjsonnet/sjsonnet/resources/scala-native/sjsonnet_base64.c
Lines 836 to 837 in 52f2b6b
or
sjsonnet/sjsonnet/resources/scala-native/sjsonnet_base64.c
Lines 521 to 522 in 52f2b6b
I prompted Claude to look for security + correctness issues, and to focus on these types of "changed my mind" comments and this flagged several issues. Here's Claude's summary:
Executive Summary
All three x86 SIMD codepaths (SSSE3, AVX2, AVX-512 VBMI) in
sjsonnet_base64.cproduce incorrect output for both encode and decode. The bugs were confirmed by compiling the C source natively on x86_64 with all three instruction sets available, running it through a simulation of the exact Scala Native FFI wrapper logic, and comparing against scalar baseline and RFC 4648 expected values.The C source contains 13 LLM chain-of-thought comments and 8 dead variables from abandoned approaches that directly correlate with the bug locations.
Test Environment
- CPU: x86_64 with SSSE3, AVX2, and AVX-512 VBMI (native execution, not emulated)
- Compiler: GCC with
-O2 -march=native- Method: The C file was compiled directly and called through a harness that replicates the exact Scala
FastBase64.scalaFFI wrapper logic — char-to-byte conversion, C call, byte-to-char conversion. Feature-detection globals were overridden to force each SIMD tier independently.Finding 1: Data-Corrupting Bugs in All x86 SIMD Paths
Decode: 3-byte group reversal
Each SIMD decode path reverses the byte order within every 3-byte output group. The project's own test assertion demonstrates this:
std.assertEqual(std.base64Decode("SGVsbG8gV29ybGQh"), "Hello World!")
Path Output Correct? Scalar Hello World!✅ SSSE3 leH olroW!dl❌ AVX2 leH olroW!dl❌ AVX-512 (needs ≥64 chars to trigger) — At SIMD-triggering lengths, every 3-byte group is cleanly reversed:
Hel→leH,Wor→roW,ld!→!dl. The scalar tail handles any trailing bytes correctly.Full verification at AVX-512 decode threshold (64-char input decoding to 48 bytes):
Expected: ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuv AVX-512: CBAFEDIHGLKJONMRQPUTSXWVaZYdcbgfejihmlkponsrqvutAll 16 three-byte groups reversed.
Encode: corrupted 6-bit index extraction
The encode bug is mechanistically different from decode. The SSSE3/AVX2/AVX-512 encode paths use a reshuffle mask that is incompatible with the Muła multiply constants that follow it. This causes the
mulhi_epu16/mullo_epi16extraction to pull 6-bit indices from the wrong byte positions, producing corrupted base64 characters — not a clean reversal, but a non-trivial scramble:Input: "abcdefghijklmnop" (16 bytes) Scalar: YWJjZGVmZ2hpamtsbW5vcA== ✅ SSSE3: YmBhZWBkaGBna2BqbW5vcA== ❌Cross-path verification confirms real corruption: SSSE3-encoded data decoded by the scalar path produces
b\x60ae\x60dh\x60gk\x60jmnopinstead ofabcdefghijklmnop.Activation thresholds
Each SIMD tier only activates above a minimum input size. The dispatcher is an if/else chain that selects the highest available tier, so on an AVX-512 machine, only AVX-512 thresholds matter:
Path Encode activates at Decode activates at SSSE3 ≥16 input bytes ≥16 input chars AVX2 ≥32 input bytes ≥32 input chars AVX-512 ≥48 input bytes ≥64 input chars Inputs below the selected tier's threshold fall through to the correct scalar implementation.
Why project tests pass
The project's test inputs are small. The
stdlib.jsonnetencode inputs are 12, 11, 10, and 0 bytes — below every SIMD encode threshold. Thebyte_arr_correctness.jsonnetinputs are 8, 4, and 0 base64 chars — below every decode threshold.The 16-char decode input
SGVsbG8gV29ybGQhis the only test that reaches a SIMD threshold: it exactly meets the SSSE3 decode minimum of 16 chars. On a CPU with only SSSE3 (no AVX2/AVX-512), this test would fail. But on the AVX-512 CPUs where the PR was likely tested, the dispatcher selects AVX-512, whose 64-char decode threshold is not met, so execution falls through to scalar and the test passes by accident.The PR was benchmarked on Apple Silicon M4 Max (ARM64 NEON). The NEON implementation uses hardware interleaved load/store intrinsics (
vld3q_u8/vst4q_u8) that handle byte ordering automatically — a fundamentally different approach from the x86 paths.Impact
This bug silently corrupts data whenever Scala Native runs on x86 and processes base64 inputs above the SIMD threshold:
- Scala Native on x86 encodes base64 → JVM or any standard decoder reads it
- External/standard base64 is decoded by Scala Native on x86
- Encoded output is compared to or consumed by any non-sjsonnet-x86 system
Finding 2: LLM Chain-of-Thought Comments Map to Bug Locations
The C source contains 13 comments characteristic of LLM chain-of-thought reasoning — mid-function strategy pivots, self-corrections, and abandoned approaches left in place. These correlate directly with the code regions containing the bugs.
Encode: wrong reshuffle mask survived three attempts
The SSSE3 encode function contains three sequential attempts at 6-bit index extraction, with only the third used:
Attempt 1 (lines 310–312): Shift-and-mask approach. The LLM computed
t0andt1, then annotated"t0 has: byte0=(in2>>4)&0x3F=wrong... need different approach". Dead code —t0andt1are never used.Attempt 2 (lines 344–364): Range classification using saturating subtract. The LLM built
cmpandless26, then wrote"Hmm, this doesn't work directly. Let me use the standard approach"when it recognized a collision between index ranges. Dead code —cmpandless26from this block are abandoned.Attempt 3 (lines 370–406, labeled
"Redo"): The final range classification, which works correctly in isolation. But it operates onindicesproduced by the Muła multiply at lines 321–329, which in turn depends on the reshuffle mask at line 279. The reshuffle mask reverses each 3-byte group ([2,1,0,-1, 5,4,3,-1, ...]) instead of creating the overlapping byte pairs the multiply constants expect. The classification logic is correct; its input is wrong.Decode: byte-order error in pack shuffle
The SSSE3 decode has a parallel pattern of abandoned approaches:
Lines 478–510: Three abandoned decode strategies.
hi_nibbles(nibble-based classification),offset_lut(nibble offset table), andca(mullo pack attempt) are all computed and never used. Comments include"wait let me recalculate","This is getting complex", and"Hmm, the pack is tricky".Line 446–448: The surviving
pack_shufextracts bytes[0,1,2]from each 32-bit lane. Aftermaddubs+madd, each lane holds a 24-bit value in little-endian: byte 0 is the LSB (output byte 2), byte 2 is the MSB (output byte 0). The correct pack should extract[2,1,0]per lane. This is the direct cause of the 3-byte group reversal.AVX-512: abandoned constants, same byte-order errors
The AVX-512 encode function contains two dead reshuffle constants (
input_shufat line 827,shuf48at line 848), each abandoned after comments like"Actually, let me use a cleaner approach"and"Hmm, _mm512_set_epi8 fills from high byte to low byte. Let me fix ordering". The third attempt (shuf_permat line 859) has the same reversal as SSSE3/AVX2.The AVX-512 decode has a dead
pack_shuf(line 926) that is never referenced. The actual gather usesgather_idx(line 972), which picks[0,1,2]per lane instead of[2,1,0]— the same byte-order error as SSSE3/AVX2.Dead variable summary
GCC
-Wunused-variableconfirms 8 dead variables from abandoned LLM approaches:t0 (line 310) — SSSE3 encode, first extraction attempt t1 (line 311) — SSSE3 encode, first extraction attempt hi_nibbles (line 478) — SSSE3 decode, nibble-based classification offset_lut (line 504) — SSSE3 decode, nibble offset table ca (line 578) — SSSE3 decode, mullo pack attempt input_shuf (line 827) — AVX-512 encode, first reshuffle attempt shuf48 (line 848) — AVX-512 encode, second reshuffle attempt pack_shuf (line 926) — AVX-512 decode, unused shuffle constantRoot Cause Summary
Path Encode bug Decode bug SSSE3 Reshuffle mask [2,1,0,-1,...]incompatible with Muła multiply constants0x0FC0FC00/0x04000040— extracts 6-bit indices from wrong byte positionspack_shufextracts[0,1,2]per lane instead of[2,1,0]— reverses each 3-byte output groupAVX2 Same reshuffle mask (duplicated for 256-bit lanes) Same pack_shuf(duplicated for 256-bit lanes)AVX-512 Same reshuffle via shuf_perm, same Muła constantsgather_idxpicks[0,1,2]per lane instead of[2,1,0]NEON Uses vld3q_u8/vst4q_u8interleaved intrinsics — byte ordering handled by hardwareUses vst3q_u8interleaved store — byte ordering handled by hardwareThe x86 paths all share the same systematic endianness confusion. The NEON path avoids the issue entirely by using ARM's interleaved load/store intrinsics, which abstract away byte ordering within groups.
If we're going to include a bunch of custom C code, we need stronger tests (and probably more careful code review to actually look at what we're merging!).
Note that I'm not an expert in native SIMD programming, but I place moderate trust in the above analysis given that Claude actually compiled and tested the C code (albeit not through the Scala Native FFI interface, but I don't anticipate that to affect the analysis / outcome here).
|
Let's roll back. @He-Pin, when you rollforward, please include a more thorough testing suite. |
|
Thanks for the details. I think a more proper way to handle this may use a depend instanceof of this. Will prepare one with additional build. |
Motivation: PR databricks#749 added SIMD base64 and runtime optimizations (ByteArr, RangeArr, asciiSafe) but was reverted by databricks#777 due to incorrect hand-written x86 SIMD C code. This PR restores all optimizations while replacing the buggy SIMD code with the battle-tested aklomp/base64 library. Modification: - Replace hand-written C SIMD with aklomp/base64 (BSD-2-Clause) which provides correct SIMD dispatch (SSSE3/AVX2/AVX512/NEON64) via runtime CPU detection - Add PlatformBase64 abstraction: JVM/JS use java.util.Base64 with strict RFC 4648 padding validation, Native uses aklomp/base64 FFI - Switch to strict mode aligned with go-jsonnet: reject unpadded base64 input (e.g. "YQ" without "=="). java.util.Base64 is lenient, so JVM/JS add explicit length check for ASCII input, matching go-jsonnet's len(str) % 4 != 0 check (builtins.go:1467) - Restore Val.ByteArr: compact byte-backed array for base64DecodeBytes - Restore Val.RangeArr subclass from flag-based _isRange - Restore Val.Str._asciiSafe + renderAsciiSafeString - Restore Materializer/ByteRenderer fast paths for ByteArr - Add comprehensive test suite (56+ Scala tests + 4 Jsonnet golden tests) Result: Beats jrsonnet on DecodeBytes benchmarks (1.47x faster). Overall 15-38% faster than master on base64 workloads.
Motivation: PR databricks#749 added SIMD base64 and runtime optimizations (ByteArr, RangeArr, asciiSafe) but was reverted by databricks#777 due to incorrect hand-written x86 SIMD C code. This PR restores all optimizations while replacing the buggy SIMD code with the battle-tested aklomp/base64 library. Modification: - Replace hand-written C SIMD with aklomp/base64 (BSD-2-Clause) which provides correct SIMD dispatch (SSSE3/AVX2/AVX512/NEON64) via runtime CPU detection - Add PlatformBase64 abstraction: JVM/JS use java.util.Base64 with strict RFC 4648 padding validation, Native uses aklomp/base64 FFI - Switch to strict mode aligned with go-jsonnet: reject unpadded base64 input (e.g. "YQ" without "=="). java.util.Base64 is lenient, so JVM/JS add explicit length check for ASCII input, matching go-jsonnet's len(str) % 4 != 0 check (builtins.go:1467) - Restore Val.ByteArr: compact byte-backed array for base64DecodeBytes - Restore Val.RangeArr subclass from flag-based _isRange - Restore Val.Str._asciiSafe + renderAsciiSafeString - Restore Materializer/ByteRenderer fast paths for ByteArr - Add comprehensive test suite (56+ Scala tests + 4 Jsonnet golden tests) Result: Beats jrsonnet on DecodeBytes benchmarks (1.47x faster). Overall 15-38% faster than master on base64 workloads.
Motivation: PR databricks#749 added SIMD base64 and runtime optimizations (ByteArr, RangeArr, asciiSafe) but was reverted by databricks#777 due to incorrect hand-written x86 SIMD C code. This PR restores all optimizations while replacing the buggy SIMD code with the battle-tested aklomp/base64 library. Modification: - Replace hand-written C SIMD with aklomp/base64 (BSD-2-Clause) which provides correct SIMD dispatch (SSSE3/AVX2/AVX512/NEON64) via runtime CPU detection - Add PlatformBase64 abstraction: JVM/JS use java.util.Base64 with strict RFC 4648 padding validation, Native uses aklomp/base64 FFI - Switch to strict mode aligned with go-jsonnet: reject unpadded base64 input (e.g. "YQ" without "=="). java.util.Base64 is lenient, so JVM/JS add explicit length check for ASCII input, matching go-jsonnet's len(str) % 4 != 0 check (builtins.go:1467) - Restore Val.ByteArr: compact byte-backed array for base64DecodeBytes - Restore Val.RangeArr subclass from flag-based _isRange - Restore Val.Str._asciiSafe + renderAsciiSafeString - Restore Materializer/ByteRenderer fast paths for ByteArr - Add comprehensive test suite (56+ Scala tests + 4 Jsonnet golden tests) Result: Beats jrsonnet on DecodeBytes benchmarks (1.47x faster). Overall 15-38% faster than master on base64 workloads.
Motivation: PR databricks#749 added SIMD base64 and runtime optimizations (ByteArr, RangeArr, asciiSafe) but was reverted by databricks#777 due to incorrect hand-written x86 SIMD C code. This PR restores all optimizations while replacing the buggy SIMD code with the battle-tested aklomp/base64 library. Modification: - Replace hand-written C SIMD with aklomp/base64 (BSD-2-Clause) which provides correct SIMD dispatch (SSSE3/AVX2/AVX512/NEON64) via runtime CPU detection - Add PlatformBase64 abstraction: JVM/JS use java.util.Base64 with strict RFC 4648 padding validation, Native uses aklomp/base64 FFI - Switch to strict mode aligned with go-jsonnet: reject unpadded base64 input (e.g. "YQ" without "=="). java.util.Base64 is lenient, so JVM/JS add explicit length check for ASCII input, matching go-jsonnet's len(str) % 4 != 0 check (builtins.go:1467) - Restore Val.ByteArr: compact byte-backed array for base64DecodeBytes - Restore Val.RangeArr subclass from flag-based _isRange - Restore Val.Str._asciiSafe + renderAsciiSafeString - Restore Materializer/ByteRenderer fast paths for ByteArr - Add comprehensive test suite (56+ Scala tests + 4 Jsonnet golden tests) Result: Beats jrsonnet on DecodeBytes benchmarks (1.47x faster). Overall 15-38% faster than master on base64 workloads.
Motivation: PR databricks#749 added SIMD base64 and runtime optimizations (ByteArr, RangeArr, asciiSafe) but was reverted by databricks#777 due to incorrect hand-written x86 SIMD C code. This PR restores all optimizations while replacing the buggy SIMD code with the battle-tested aklomp/base64 library. Modification: - Replace hand-written C SIMD with aklomp/base64 (BSD-2-Clause) which provides correct SIMD dispatch (SSSE3/AVX2/AVX512/NEON64) via runtime CPU detection - Add PlatformBase64 abstraction: JVM/JS use java.util.Base64 with strict RFC 4648 padding validation, Native uses aklomp/base64 FFI - Switch to strict mode aligned with go-jsonnet: reject unpadded base64 input (e.g. "YQ" without "=="). java.util.Base64 is lenient, so JVM/JS add explicit length check for ASCII input, matching go-jsonnet's len(str) % 4 != 0 check (builtins.go:1467) - Restore Val.ByteArr: compact byte-backed array for base64DecodeBytes - Restore Val.RangeArr subclass from flag-based _isRange - Restore Val.Str._asciiSafe + renderAsciiSafeString - Restore Materializer/ByteRenderer fast paths for ByteArr - Add comprehensive test suite (56+ Scala tests + 4 Jsonnet golden tests) Result: Beats jrsonnet on DecodeBytes benchmarks (1.47x faster). Overall 15-38% faster than master on base64 workloads.
## Summary Rollforward of #749 (reverted by #777) with the buggy hand-written C SIMD replaced by the battle-tested [aklomp/base64](https://github.com/aklomp/base64) library (BSD-2-Clause). Also restores the non-SIMD optimizations from #749 (ByteArr, RangeArr subclass, asciiSafe rendering) and adds strict RFC 4648 padding validation aligned with go-jsonnet. ### How the SIMD bug was fixed PR #749's hand-written C SIMD code had incorrect x86 implementation (the reason for the revert in #777). Instead of fixing the hand-written code, this PR replaces it entirely with **aklomp/base64** — a well-tested C library that handles SIMD dispatch correctly on all architectures: - x86_64: SSSE3 / SSE4.1 / SSE4.2 / AVX / AVX2 / AVX-512 (runtime CPU detection) - AArch64: NEON - Fallback: optimized generic C implementation The library is built as a static library via CMake and linked via `nativeLinkingOptions`. No hand-written SIMD code remains. ### Strict mode aligned with go-jsonnet Switched base64 decoding to **strict RFC 4648 mode** — unpadded input (e.g. `"YQ"` instead of `"YQ=="`) is now rejected on all platforms, matching go-jsonnet behavior: - **go-jsonnet**: `len(str) % 4 != 0` check before `base64.StdEncoding.DecodeString` (builtins.go:1467) - **C++ jsonnet**: `std.length(str) % 4 != 0` check in stdlib - **sjsonnet (before)**: `java.util.Base64` was lenient, accepting unpadded input — a pre-existing behavioral divergence - **sjsonnet (after)**: JVM/JS add explicit ASCII-only length validation; Native uses aklomp/base64 which is strict by default ### Changes 1. **PlatformBase64 abstraction** — Platform-specific base64 implementations: - JVM/JS: `java.util.Base64` + strict padding pre-check - Native: aklomp/base64 FFI with JVM-compatible error messages on the error path (zero hot-path overhead) 2. **Val.ByteArr** — Compact byte-backed array for `base64DecodeBytes`. Stores `Array[Byte]` directly instead of N `Val.Num` wrappers (80%+ memory savings). Zero-copy `rawBytes` access for re-encoding. 3. **Val.RangeArr subclass** — Extracted from flag-based `_isRange` in `Arr` to reduce per-Arr memory footprint. O(1) creation for `std.range`. 4. **Val.Str._asciiSafe + renderAsciiSafeString** — Marks strings that need no JSON escaping (e.g. base64 output). Renderer skips SWAR escape scanning, writing bytes directly. 5. **Materializer/ByteRenderer fast paths** — Direct byte iteration for ByteArr, skipping per-element type dispatch. 6. **Comprehensive test suite** — 56+ Scala unit tests + 4 Jsonnet golden file tests covering RFC 4648 vectors, SIMD boundary sizes, bidirectional verification, strict padding enforcement, all 256 byte values, and error handling. ## Benchmark Results — Scala Native vs jrsonnet (Rust) Machine: Apple Silicon (AArch64/NEON), macOS. Tool: `hyperfine --warmup 3 --runs 10 -N`. Both `master` and `simd-full` binaries built from the same upstream/master base (4123ac3). The only difference is this PR's changes. ### SIMD base64 throughput (large payloads) Larger payloads isolate base64 codec performance from Jsonnet interpreter overhead. The improvement scales with data size: | Benchmark | Payload | master (ms) | simd-full (ms) | jrsonnet (ms) | simd vs master | |-----------|---------|:-----------:|:--------------:|:-------------:|:--------------:| | base64_heavy | 200KB, 3 strings + 10K bytes | 9.8 | **8.8** | 6.9 | **10% faster** | | base64_throughput | 150KB × 5 roundtrips | 15.6 | **13.6** | 5.6 | **13% faster** | | base64_mega | 1MB + 100K byte array | 34.1 | **28.7** | 22.0 | **16% faster** | | base64_ultra | 4.5MB × 2 roundtrips | 119.9 | **91.3** | 14.0 | **24% faster** | User CPU time (excluding process overhead) tells the same story: | Benchmark | master User CPU | simd-full User CPU | Reduction | |-----------|:-:|:-:|:-:| | base64_heavy | 4.8 ms | 4.0 ms | **17%** | | base64_throughput | 10.0 ms | 7.7 ms | **23%** | | base64_mega | 26.9 ms | 21.5 ms | **20%** | | base64_ultra | 107.8 ms | 78.2 ms | **27%** | > **Note**: jrsonnet's advantage on large-payload benchmarks (especially ultra: 14ms vs 91ms) is primarily due to Rust's UTF-8 string representation enabling zero-copy base64, whereas Scala Native requires UTF-16 ↔ UTF-8 conversion at the FFI boundary. This is a fundamental runtime characteristic, not a base64 algorithm difference. ### ByteArr compact storage (DecodeBytes / byte_array) sjsonnet's `ByteArr` stores decoded bytes as `Array[Byte]` directly (vs N `Val.Num` wrappers), beating jrsonnet (Rust) on byte-oriented operations: | Benchmark | master (ms) | simd-full (ms) | jrsonnet (ms) | simd vs master | simd vs jrsonnet | |-----------|:-----------:|:--------------:|:-------------:|:--------------:|:----------------:| | std_base64decodebytes | 15.6 | **13.9** | 19.0 | **11% faster** | **1.36x faster** | | go base64DecodeBytes | 16.0 | **13.5** | 19.4 | **16% faster** | **1.43x faster** | | std_base64_byte_array | 9.0 | **8.8** | 18.4 | ~neutral | **2.09x faster** | ### Small payload benchmarks (interpreter-dominated) These benchmarks process ~3KB payloads. Base64 codec time is negligible compared to process startup (~3ms) and Jsonnet parsing/evaluation, so codec improvements don't show here: | Benchmark | master (ms) | simd-full (ms) | jrsonnet (ms) | simd vs master | |-----------|:-----------:|:--------------:|:-------------:|:--------------:| | std_base64 (encode) | 7.3 | 6.8 | 4.4 | ~neutral | | std_base64decode | 6.0 | 6.2 | 4.8 | ~neutral | | go base64 (encode) | 7.0 | 7.6 | 4.7 | ~neutral | | go base64Decode | 6.8 | 7.3 | 5.3 | ~neutral | ## Test plan - [x] `./mill 'sjsonnet.jvm[3.3.7]'.test` — 61 tests pass (including 56 Base64Tests with strict padding) - [x] `./mill 'sjsonnet.js[3.3.7]'.test` — 455 tests pass - [x] `./mill 'sjsonnet.native[3.3.7]'.test` — 476 tests pass - [x] `./mill __.checkFormat` — scalafmt passes - [x] Benchmark regression verified across multiple runs (10 runs per benchmark) - [x] Local ARM64 (Apple Silicon/NEON) verification — all tests pass - [x] CI x86_64 verification via GitHub Actions runners Closes #777
Motivation: PR databricks#749 added SIMD base64 and runtime optimizations (ByteArr, RangeArr, asciiSafe) but was reverted by databricks#777 due to incorrect hand-written x86 SIMD C code. This PR restores all optimizations while replacing the buggy SIMD code with the battle-tested aklomp/base64 library. Modification: - Replace hand-written C SIMD with aklomp/base64 (BSD-2-Clause) which provides correct SIMD dispatch (SSSE3/AVX2/AVX512/NEON64) via runtime CPU detection - Add PlatformBase64 abstraction: JVM/JS use java.util.Base64 with strict RFC 4648 padding validation, Native uses aklomp/base64 FFI - Switch to strict mode aligned with go-jsonnet: reject unpadded base64 input (e.g. "YQ" without "=="). java.util.Base64 is lenient, so JVM/JS add explicit length check for ASCII input, matching go-jsonnet's len(str) % 4 != 0 check (builtins.go:1467) - Restore Val.ByteArr: compact byte-backed array for base64DecodeBytes - Restore Val.RangeArr subclass from flag-based _isRange - Restore Val.Str._asciiSafe + renderAsciiSafeString - Restore Materializer/ByteRenderer fast paths for ByteArr - Add comprehensive test suite (56+ Scala tests + 4 Jsonnet golden tests) Result: Beats jrsonnet on DecodeBytes benchmarks (1.47x faster). Overall 15-38% faster than master on base64 workloads.
Motivation
On Scala Native,
java.util.Base64is a pure-Scala implementation that uses Wrapper objects,@tailrecrecursiveiterate(), and per-byte pattern matching — significantly slower than HotSpot's intrinsic-backed implementation.Beyond the raw codec,
base64DecodeByteswas creatingArray[Eval](N)and filling each slot withVal.cachedNum— N allocations for an N-byte decode. The materializer then needed per-element type dispatch to render these arrays. Andbase64encode output (guaranteed ASCII-safe) was still being scanned for JSON escape characters.Val.Arrcarried inline_isRange/_byteDatafields that bloated every regular array instance (~13 bytes wasted per non-specialized array).Modification
1. Platform-agnostic
FastBase64encoder/decoderENCODE_TABLE(char[64]) andDECODE_TABLE(int[256]) pre-computed lookup tablesencodeString(): ASCII fast path does direct char→char encoding without intermediatebyte[]decodeToString()/decodeToBytes(): Direct string→bytes via lookup tablejava.util.Base64behavior2. C FFI SIMD base64 for Scala Native (
sjsonnet_base64.c)vld3/vst4interleaved load/store +vqtbl4q64-byte lookup for encode;vbslq/vmovl_u8/vmovn_u16for byte↔char widening/narrowingpshufb/vpshufb/vpermi2bsjsonnet_base64_decode_validated(): Single-pass validation + decode with specific error codes3. Native-specific optimizations
encodeString: skip UTF-8 encoding for pure ASCII strings4.
RangeArrandByteArrsubclasses ofVal.ArrVal.Arrchanged fromfinal classto non-finalclass, enabling specializationRangeArr extends Arr: Lazy integer range — keepsrangeFromfield out of regular arrays, saving ~9 bytes per non-range array (merges refactor: extract RangeArr subclass from Arr to reduce memory footprint #772)ByteArr extends Arr: CompactArray[Byte]backing store for 0–255 integer arraysbyteDatais an immutableval— never cleared after materialization, guaranteeingrawBytesis always non-null for safe multi-usereversed()materializes first to keepvalue()/eval()simple and avoid reversed-index bugsrawBytesaccessor enables zero-copy fast paths inbase64encode and materializercase ba: Val.ByteArr =>) instead of null-returningrawByteson base class5. Materializer fast-path for byte arrays
ByteArrvia pattern matchvalue(i)lookup + type dispatch +asDoubleconversionvisitFloat64((bytes(i) & 0xff).toDouble)in a tight loop6. ASCII-safe string rendering
Val.Str._asciiSafeflag marks strings known to contain only printable ASCII (no JSON escaping needed)Val.Str.asciiSafe(pos, s)factory for creating flagged stringsBaseByteRenderer.renderAsciiSafeString()skips SWAR escape scanning and UTF-8 encoding — writes bytes directly from charsbase64encode output is marked as ASCII-safe since base64 alphabet is[A-Za-z0-9+/=]7.
EncodingModuleupdatesbase64DecodeBytes: UsesVal.Arr.fromBytes(pos, decoded)— one allocation instead of Nbase64encode: Pattern matchesByteArrfor zero-copy bypass; output markedasciiSafeBenchmark Results
JMH (JVM, Scala 3.3.7, Apple Silicon M4 Max)
Scala Native (hyperfine -N, 30 runs, Apple Silicon M4 Max)
Compared against jrsonnet 0.5.0-pre98 (built from source,
cargo build --release).Compute-heavy benchmarks (
base64DecodeBytes,base64_byte_array): sjsonnet significantly outperforms jrsonnet — 1.50× and 2.02× faster respectively.Small benchmarks (
base64,base64Decode,base64_stress): jrsonnet is faster due to lower startup overhead (~3ms vs ~5ms). The actual base64 computation time is comparable; the gap is dominated by process startup.Files Changed
sjsonnet/src/sjsonnet/Val.scalaArrnon-final,RangeArr+ByteArrsubclasses,_asciiSafeflag,asciiSafefactorysjsonnet/src/sjsonnet/Materializer.scalasjsonnet/src/sjsonnet/ByteRenderer.scalasjsonnet/src/sjsonnet/BaseByteRenderer.scalarenderAsciiSafeString()for escape-free renderingsjsonnet/src/sjsonnet/stdlib/EncodingModule.scalafromBytesfor DecodeBytes, ByteArr match for encode,asciiSafefor outputsjsonnet/src-js/sjsonnet/stdlib/FastBase64.scalasjsonnet/src-jvm/sjsonnet/stdlib/FastBase64.scalajava.util.Base64(unchanged behavior)sjsonnet/src-native/sjsonnet/stdlib/FastBase64.scalasjsonnet/resources/scala-native/sjsonnet_base64.csjsonnet/test/resources/new_test_suite/byte_arr_correctness.jsonnetsjsonnet/test/resources/new_test_suite/range_arr_correctness.jsonnetbench/resources/go_suite/base64_stress.jsonnetResult