Skip to content

perf: SIMD base64 via aklomp/base64 + ByteArr/RangeArr/asciiSafe#778

Merged
stephenamar-db merged 1 commit intodatabricks:masterfrom
He-Pin:simdbase64-full
Apr 24, 2026
Merged

perf: SIMD base64 via aklomp/base64 + ByteArr/RangeArr/asciiSafe#778
stephenamar-db merged 1 commit intodatabricks:masterfrom
He-Pin:simdbase64-full

Conversation

@He-Pin
Copy link
Copy Markdown
Contributor

@He-Pin He-Pin commented Apr 14, 2026

Summary

Rollforward of #749 (reverted by #777) with the buggy hand-written C SIMD replaced by the battle-tested aklomp/base64 library (BSD-2-Clause). Also restores the non-SIMD optimizations from #749 (ByteArr, RangeArr subclass, asciiSafe rendering) and adds strict RFC 4648 padding validation aligned with go-jsonnet.

How the SIMD bug was fixed

PR #749's hand-written C SIMD code had incorrect x86 implementation (the reason for the revert in #777). Instead of fixing the hand-written code, this PR replaces it entirely with aklomp/base64 — a well-tested C library that handles SIMD dispatch correctly on all architectures:

  • x86_64: SSSE3 / SSE4.1 / SSE4.2 / AVX / AVX2 / AVX-512 (runtime CPU detection)
  • AArch64: NEON
  • Fallback: optimized generic C implementation

The library is built as a static library via CMake and linked via nativeLinkingOptions. No hand-written SIMD code remains.

Strict mode aligned with go-jsonnet

Switched base64 decoding to strict RFC 4648 mode — unpadded input (e.g. "YQ" instead of "YQ==") is now rejected on all platforms, matching go-jsonnet behavior:

  • go-jsonnet: len(str) % 4 != 0 check before base64.StdEncoding.DecodeString (builtins.go:1467)
  • C++ jsonnet: std.length(str) % 4 != 0 check in stdlib
  • sjsonnet (before): java.util.Base64 was lenient, accepting unpadded input — a pre-existing behavioral divergence
  • sjsonnet (after): JVM/JS add explicit ASCII-only length validation; Native uses aklomp/base64 which is strict by default

Changes

  1. PlatformBase64 abstraction — Platform-specific base64 implementations:

    • JVM/JS: java.util.Base64 + strict padding pre-check
    • Native: aklomp/base64 FFI with JVM-compatible error messages on the error path (zero hot-path overhead)
  2. Val.ByteArr — Compact byte-backed array for base64DecodeBytes. Stores Array[Byte] directly instead of N Val.Num wrappers (80%+ memory savings). Zero-copy rawBytes access for re-encoding.

  3. Val.RangeArr subclass — Extracted from flag-based _isRange in Arr to reduce per-Arr memory footprint. O(1) creation for std.range.

  4. Val.Str._asciiSafe + renderAsciiSafeString — Marks strings that need no JSON escaping (e.g. base64 output). Renderer skips SWAR escape scanning, writing bytes directly.

  5. Materializer/ByteRenderer fast paths — Direct byte iteration for ByteArr, skipping per-element type dispatch.

  6. Comprehensive test suite — 56+ Scala unit tests + 4 Jsonnet golden file tests covering RFC 4648 vectors, SIMD boundary sizes, bidirectional verification, strict padding enforcement, all 256 byte values, and error handling.

Benchmark Results — Scala Native vs jrsonnet (Rust)

Machine: Apple Silicon (AArch64/NEON), macOS. Tool: hyperfine --warmup 3 --runs 10 -N.

Both master and simd-full binaries built from the same upstream/master base (4123ac3). The only difference is this PR's changes.

SIMD base64 throughput (large payloads)

Larger payloads isolate base64 codec performance from Jsonnet interpreter overhead. The improvement scales with data size:

Benchmark Payload master (ms) simd-full (ms) jrsonnet (ms) simd vs master
base64_heavy 200KB, 3 strings + 10K bytes 9.8 8.8 6.9 10% faster
base64_throughput 150KB × 5 roundtrips 15.6 13.6 5.6 13% faster
base64_mega 1MB + 100K byte array 34.1 28.7 22.0 16% faster
base64_ultra 4.5MB × 2 roundtrips 119.9 91.3 14.0 24% faster

User CPU time (excluding process overhead) tells the same story:

Benchmark master User CPU simd-full User CPU Reduction
base64_heavy 4.8 ms 4.0 ms 17%
base64_throughput 10.0 ms 7.7 ms 23%
base64_mega 26.9 ms 21.5 ms 20%
base64_ultra 107.8 ms 78.2 ms 27%

Note: jrsonnet's advantage on large-payload benchmarks (especially ultra: 14ms vs 91ms) is primarily due to Rust's UTF-8 string representation enabling zero-copy base64, whereas Scala Native requires UTF-16 ↔ UTF-8 conversion at the FFI boundary. This is a fundamental runtime characteristic, not a base64 algorithm difference.

ByteArr compact storage (DecodeBytes / byte_array)

sjsonnet's ByteArr stores decoded bytes as Array[Byte] directly (vs N Val.Num wrappers), beating jrsonnet (Rust) on byte-oriented operations:

Benchmark master (ms) simd-full (ms) jrsonnet (ms) simd vs master simd vs jrsonnet
std_base64decodebytes 15.6 13.9 19.0 11% faster 1.36x faster
go base64DecodeBytes 16.0 13.5 19.4 16% faster 1.43x faster
std_base64_byte_array 9.0 8.8 18.4 ~neutral 2.09x faster

Small payload benchmarks (interpreter-dominated)

These benchmarks process ~3KB payloads. Base64 codec time is negligible compared to process startup (~3ms) and Jsonnet parsing/evaluation, so codec improvements don't show here:

Benchmark master (ms) simd-full (ms) jrsonnet (ms) simd vs master
std_base64 (encode) 7.3 6.8 4.4 ~neutral
std_base64decode 6.0 6.2 4.8 ~neutral
go base64 (encode) 7.0 7.6 4.7 ~neutral
go base64Decode 6.8 7.3 5.3 ~neutral

Test plan

  • ./mill 'sjsonnet.jvm[3.3.7]'.test — 61 tests pass (including 56 Base64Tests with strict padding)
  • ./mill 'sjsonnet.js[3.3.7]'.test — 455 tests pass
  • ./mill 'sjsonnet.native[3.3.7]'.test — 476 tests pass
  • ./mill __.checkFormat — scalafmt passes
  • Benchmark regression verified across multiple runs (10 runs per benchmark)
  • Local ARM64 (Apple Silicon/NEON) verification — all tests pass
  • CI x86_64 verification via GitHub Actions runners

Closes #777

@He-Pin
Copy link
Copy Markdown
Contributor Author

He-Pin commented Apr 15, 2026

@JoshRosen @stephenamar-db This new implementation uses a mature third-party library; please take another look.

He-Pin

This comment was marked as outdated.

Comment thread sjsonnet/src-jvm/sjsonnet/stdlib/PlatformBase64.scala Outdated
He-Pin

This comment was marked as outdated.

@stephenamar-db stephenamar-db self-requested a review April 17, 2026 20:57
Comment thread .github/workflows/pr-build.yaml Outdated
Comment thread .gitmodules Outdated
@He-Pin
Copy link
Copy Markdown
Contributor Author

He-Pin commented Apr 18, 2026

I have updated the PR @stephenamar-db

He-Pin added a commit to He-Pin/sjsonnet that referenced this pull request Apr 18, 2026
Motivation:
Combined review of PR databricks#776 + databricks#778 identified ~130 lines of duplicated
SWAR string rendering and long-to-char conversion code, plus two
missing overflow checks in StringModule.

Modification:
- Extract renderQuotedStringSWAR as protected method in BaseCharRenderer,
  delegate from MaterializeJsonRenderer (removes ~60 lines duplication)
- Make escapeCharInline protected, remove duplicate in Renderer
- Consolidate Renderer.visitFloat64 onto inherited writeLongDirect,
  remove standalone RenderUtils.appendLong (~40 lines)
- Add totalLen > Int.MaxValue guard in Join pre-sized allocation
- Add Long overflow detection in parseDigits
- Leverage _asciiSafe flag in Substr/Join to skip redundant scans

Result:
Net -132 lines. All tests pass across JVM/JS/Native/WASM.
Comment thread .github/workflows/pr-build.yaml Outdated
Comment thread .github/workflows/pr-build.yaml Outdated
Motivation:
PR databricks#749 added SIMD base64 and runtime optimizations (ByteArr, RangeArr,
asciiSafe) but was reverted by databricks#777 due to incorrect hand-written x86
SIMD C code. This PR restores all optimizations while replacing the
buggy SIMD code with the battle-tested aklomp/base64 library.

Modification:
- Replace hand-written C SIMD with aklomp/base64 (BSD-2-Clause) which
  provides correct SIMD dispatch (SSSE3/AVX2/AVX512/NEON64) via runtime
  CPU detection
- Add PlatformBase64 abstraction: JVM/JS use java.util.Base64 with strict
  RFC 4648 padding validation, Native uses aklomp/base64 FFI
- Switch to strict mode aligned with go-jsonnet: reject unpadded base64
  input (e.g. "YQ" without "=="). java.util.Base64 is lenient, so JVM/JS
  add explicit length check for ASCII input, matching go-jsonnet's
  len(str) % 4 != 0 check (builtins.go:1467)
- Restore Val.ByteArr: compact byte-backed array for base64DecodeBytes
- Restore Val.RangeArr subclass from flag-based _isRange
- Restore Val.Str._asciiSafe + renderAsciiSafeString
- Restore Materializer/ByteRenderer fast paths for ByteArr
- Add comprehensive test suite (56+ Scala tests + 4 Jsonnet golden tests)

Result:
Beats jrsonnet on DecodeBytes benchmarks (1.47x faster). Overall 15-38%
faster than master on base64 workloads.
@He-Pin He-Pin requested a review from stephenamar-db April 21, 2026 19:03
@stephenamar-db stephenamar-db merged commit 1a84e00 into databricks:master Apr 24, 2026
5 checks passed
He-Pin added a commit to He-Pin/sjsonnet that referenced this pull request Apr 25, 2026
Motivation:
Combined review of PR databricks#776 + databricks#778 identified ~130 lines of duplicated
SWAR string rendering and long-to-char conversion code, plus two
missing overflow checks in StringModule.

Modification:
- Extract renderQuotedStringSWAR as protected method in BaseCharRenderer,
  delegate from MaterializeJsonRenderer (removes ~60 lines duplication)
- Make escapeCharInline protected, remove duplicate in Renderer
- Consolidate Renderer.visitFloat64 onto inherited writeLongDirect,
  remove standalone RenderUtils.appendLong (~40 lines)
- Add totalLen > Int.MaxValue guard in Join pre-sized allocation
- Add Long overflow detection in parseDigits
- Leverage _asciiSafe flag in Substr/Join to skip redundant scans

Result:
Net -132 lines. All tests pass across JVM/JS/Native/WASM.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants