perf: SIMD base64 via aklomp/base64 + ByteArr/RangeArr/asciiSafe by He-Pin · Pull Request #778 · databricks/sjsonnet

He-Pin · 2026-04-14T22:26:16Z

Summary

Rollforward of #749 (reverted by #777) with the buggy hand-written C SIMD replaced by the battle-tested aklomp/base64 library (BSD-2-Clause). Also restores the non-SIMD optimizations from #749 (ByteArr, RangeArr subclass, asciiSafe rendering) and adds strict RFC 4648 padding validation aligned with go-jsonnet.

How the SIMD bug was fixed

PR #749's hand-written C SIMD code had incorrect x86 implementation (the reason for the revert in #777). Instead of fixing the hand-written code, this PR replaces it entirely with aklomp/base64 — a well-tested C library that handles SIMD dispatch correctly on all architectures:

x86_64: SSSE3 / SSE4.1 / SSE4.2 / AVX / AVX2 / AVX-512 (runtime CPU detection)
AArch64: NEON
Fallback: optimized generic C implementation

The library is built as a static library via CMake and linked via nativeLinkingOptions. No hand-written SIMD code remains.

Strict mode aligned with go-jsonnet

Switched base64 decoding to strict RFC 4648 mode — unpadded input (e.g. "YQ" instead of "YQ==") is now rejected on all platforms, matching go-jsonnet behavior:

go-jsonnet: len(str) % 4 != 0 check before base64.StdEncoding.DecodeString (builtins.go:1467)
C++ jsonnet: std.length(str) % 4 != 0 check in stdlib
sjsonnet (before): java.util.Base64 was lenient, accepting unpadded input — a pre-existing behavioral divergence
sjsonnet (after): JVM/JS add explicit ASCII-only length validation; Native uses aklomp/base64 which is strict by default

Changes

PlatformBase64 abstraction — Platform-specific base64 implementations:
- JVM/JS: java.util.Base64 + strict padding pre-check
- Native: aklomp/base64 FFI with JVM-compatible error messages on the error path (zero hot-path overhead)
Val.ByteArr — Compact byte-backed array for base64DecodeBytes. Stores Array[Byte] directly instead of N Val.Num wrappers (80%+ memory savings). Zero-copy rawBytes access for re-encoding.
Val.RangeArr subclass — Extracted from flag-based _isRange in Arr to reduce per-Arr memory footprint. O(1) creation for std.range.
Val.Str._asciiSafe + renderAsciiSafeString — Marks strings that need no JSON escaping (e.g. base64 output). Renderer skips SWAR escape scanning, writing bytes directly.
Materializer/ByteRenderer fast paths — Direct byte iteration for ByteArr, skipping per-element type dispatch.
Comprehensive test suite — 56+ Scala unit tests + 4 Jsonnet golden file tests covering RFC 4648 vectors, SIMD boundary sizes, bidirectional verification, strict padding enforcement, all 256 byte values, and error handling.

Benchmark Results — Scala Native vs jrsonnet (Rust)

Machine: Apple Silicon (AArch64/NEON), macOS. Tool: hyperfine --warmup 3 --runs 10 -N.

Both master and simd-full binaries built from the same upstream/master base (4123ac3). The only difference is this PR's changes.

SIMD base64 throughput (large payloads)

Larger payloads isolate base64 codec performance from Jsonnet interpreter overhead. The improvement scales with data size:

Benchmark	Payload	master (ms)	simd-full (ms)	jrsonnet (ms)	simd vs master
base64_heavy	200KB, 3 strings + 10K bytes	9.8	8.8	6.9	10% faster
base64_throughput	150KB × 5 roundtrips	15.6	13.6	5.6	13% faster
base64_mega	1MB + 100K byte array	34.1	28.7	22.0	16% faster
base64_ultra	4.5MB × 2 roundtrips	119.9	91.3	14.0	24% faster

User CPU time (excluding process overhead) tells the same story:

Benchmark	master User CPU	simd-full User CPU	Reduction
base64_heavy	4.8 ms	4.0 ms	17%
base64_throughput	10.0 ms	7.7 ms	23%
base64_mega	26.9 ms	21.5 ms	20%
base64_ultra	107.8 ms	78.2 ms	27%

Note: jrsonnet's advantage on large-payload benchmarks (especially ultra: 14ms vs 91ms) is primarily due to Rust's UTF-8 string representation enabling zero-copy base64, whereas Scala Native requires UTF-16 ↔ UTF-8 conversion at the FFI boundary. This is a fundamental runtime characteristic, not a base64 algorithm difference.

ByteArr compact storage (DecodeBytes / byte_array)

sjsonnet's ByteArr stores decoded bytes as Array[Byte] directly (vs N Val.Num wrappers), beating jrsonnet (Rust) on byte-oriented operations:

Benchmark	master (ms)	simd-full (ms)	jrsonnet (ms)	simd vs master	simd vs jrsonnet
std_base64decodebytes	15.6	13.9	19.0	11% faster	1.36x faster
go base64DecodeBytes	16.0	13.5	19.4	16% faster	1.43x faster
std_base64_byte_array	9.0	8.8	18.4	~neutral	2.09x faster

Small payload benchmarks (interpreter-dominated)

These benchmarks process ~3KB payloads. Base64 codec time is negligible compared to process startup (~3ms) and Jsonnet parsing/evaluation, so codec improvements don't show here:

Benchmark	master (ms)	simd-full (ms)	jrsonnet (ms)	simd vs master
std_base64 (encode)	7.3	6.8	4.4	~neutral
std_base64decode	6.0	6.2	4.8	~neutral
go base64 (encode)	7.0	7.6	4.7	~neutral
go base64Decode	6.8	7.3	5.3	~neutral

Test plan

./mill 'sjsonnet.jvm[3.3.7]'.test — 61 tests pass (including 56 Base64Tests with strict padding)
./mill 'sjsonnet.js[3.3.7]'.test — 455 tests pass
./mill 'sjsonnet.native[3.3.7]'.test — 476 tests pass
./mill __.checkFormat — scalafmt passes
Benchmark regression verified across multiple runs (10 runs per benchmark)
Local ARM64 (Apple Silicon/NEON) verification — all tests pass
CI x86_64 verification via GitHub Actions runners

Closes #777

He-Pin · 2026-04-15T03:04:09Z

@JoshRosen @stephenamar-db This new implementation uses a mature third-party library; please take another look.

He-Pin · 2026-04-18T09:24:13Z

I have updated the PR @stephenamar-db

Motivation: Combined review of PR databricks#776 + databricks#778 identified ~130 lines of duplicated SWAR string rendering and long-to-char conversion code, plus two missing overflow checks in StringModule. Modification: - Extract renderQuotedStringSWAR as protected method in BaseCharRenderer, delegate from MaterializeJsonRenderer (removes ~60 lines duplication) - Make escapeCharInline protected, remove duplicate in Renderer - Consolidate Renderer.visitFloat64 onto inherited writeLongDirect, remove standalone RenderUtils.appendLong (~40 lines) - Add totalLen > Int.MaxValue guard in Join pre-sized allocation - Add Long overflow detection in parseDigits - Leverage _asciiSafe flag in Substr/Join to skip redundant scans Result: Net -132 lines. All tests pass across JVM/JS/Native/WASM.

Motivation: PR databricks#749 added SIMD base64 and runtime optimizations (ByteArr, RangeArr, asciiSafe) but was reverted by databricks#777 due to incorrect hand-written x86 SIMD C code. This PR restores all optimizations while replacing the buggy SIMD code with the battle-tested aklomp/base64 library. Modification: - Replace hand-written C SIMD with aklomp/base64 (BSD-2-Clause) which provides correct SIMD dispatch (SSSE3/AVX2/AVX512/NEON64) via runtime CPU detection - Add PlatformBase64 abstraction: JVM/JS use java.util.Base64 with strict RFC 4648 padding validation, Native uses aklomp/base64 FFI - Switch to strict mode aligned with go-jsonnet: reject unpadded base64 input (e.g. "YQ" without "=="). java.util.Base64 is lenient, so JVM/JS add explicit length check for ASCII input, matching go-jsonnet's len(str) % 4 != 0 check (builtins.go:1467) - Restore Val.ByteArr: compact byte-backed array for base64DecodeBytes - Restore Val.RangeArr subclass from flag-based _isRange - Restore Val.Str._asciiSafe + renderAsciiSafeString - Restore Materializer/ByteRenderer fast paths for ByteArr - Add comprehensive test suite (56+ Scala tests + 4 Jsonnet golden tests) Result: Beats jrsonnet on DecodeBytes benchmarks (1.47x faster). Overall 15-38% faster than master on base64 workloads.

Motivation: Combined review of PR databricks#776 + databricks#778 identified ~130 lines of duplicated SWAR string rendering and long-to-char conversion code, plus two missing overflow checks in StringModule. Modification: - Extract renderQuotedStringSWAR as protected method in BaseCharRenderer, delegate from MaterializeJsonRenderer (removes ~60 lines duplication) - Make escapeCharInline protected, remove duplicate in Renderer - Consolidate Renderer.visitFloat64 onto inherited writeLongDirect, remove standalone RenderUtils.appendLong (~40 lines) - Add totalLen > Int.MaxValue guard in Join pre-sized allocation - Add Long overflow detection in parseDigits - Leverage _asciiSafe flag in Substr/Join to skip redundant scans Result: Net -132 lines. All tests pass across JVM/JS/Native/WASM.

This comment was marked as outdated.

Sign in to view

He-Pin commented Apr 15, 2026

View reviewed changes

Comment thread sjsonnet/src-jvm/sjsonnet/stdlib/PlatformBase64.scala Outdated

This comment was marked as outdated.

Sign in to view

stephenamar-db self-requested a review April 17, 2026 20:57

stephenamar-db requested changes Apr 17, 2026

View reviewed changes

Comment thread .github/workflows/pr-build.yaml Outdated

He-Pin force-pushed the simdbase64-full branch from 8feb874 to dece904 Compare April 18, 2026 06:49

He-Pin requested a review from stephenamar-db April 18, 2026 06:52

He-Pin commented Apr 18, 2026

View reviewed changes

Comment thread .gitmodules Outdated

He-Pin force-pushed the simdbase64-full branch from dece904 to 267b298 Compare April 18, 2026 07:54

stephenamar-db requested changes Apr 21, 2026

View reviewed changes

Comment thread .github/workflows/pr-build.yaml Outdated

He-Pin force-pushed the simdbase64-full branch from 267b298 to 0253450 Compare April 21, 2026 18:27

stephenamar-db approved these changes Apr 21, 2026

View reviewed changes

stephenamar-db reviewed Apr 21, 2026

View reviewed changes

Comment thread .github/workflows/pr-build.yaml Outdated

He-Pin force-pushed the simdbase64-full branch from 0253450 to f418fcd Compare April 21, 2026 18:37

He-Pin requested a review from stephenamar-db April 21, 2026 19:03

stephenamar-db merged commit 1a84e00 into databricks:master Apr 24, 2026
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: SIMD base64 via aklomp/base64 + ByteArr/RangeArr/asciiSafe#778

perf: SIMD base64 via aklomp/base64 + ByteArr/RangeArr/asciiSafe#778
stephenamar-db merged 1 commit intodatabricks:masterfrom
He-Pin:simdbase64-full

He-Pin commented Apr 14, 2026 •

edited

Loading

Uh oh!

He-Pin commented Apr 15, 2026

Uh oh!

This comment was marked as outdated.

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

Uh oh!

Uh oh!

He-Pin commented Apr 18, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

He-Pin commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

How the SIMD bug was fixed

Strict mode aligned with go-jsonnet

Changes

Benchmark Results — Scala Native vs jrsonnet (Rust)

SIMD base64 throughput (large payloads)

ByteArr compact storage (DecodeBytes / byte_array)

Small payload benchmarks (interpreter-dominated)

Test plan

Uh oh!

He-Pin commented Apr 15, 2026

Uh oh!

This comment was marked as outdated.

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

Uh oh!

Uh oh!

He-Pin commented Apr 18, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

He-Pin commented Apr 14, 2026 •

edited

Loading