Skip to content

Revert "perf: SIMD-accelerated FastBase64 for Scala Native via C FFI"#777

Merged
stephenamar-db merged 1 commit intomasterfrom
revert-749-perf/fast-base64-native
Apr 14, 2026
Merged

Revert "perf: SIMD-accelerated FastBase64 for Scala Native via C FFI"#777
stephenamar-db merged 1 commit intomasterfrom
revert-749-perf/fast-base64-native

Conversation

@stephenamar-db
Copy link
Copy Markdown
Collaborator

Reverts #749

@stephenamar-db stephenamar-db merged commit 4123ac3 into master Apr 14, 2026
5 checks passed
@stephenamar-db stephenamar-db deleted the revert-749-perf/fast-base64-native branch April 14, 2026 16:20
He-Pin added a commit to He-Pin/sjsonnet that referenced this pull request Apr 14, 2026
Motivation:
PR databricks#749 added SIMD base64 and runtime optimizations (ByteArr, RangeArr,
asciiSafe) but was reverted by databricks#777 due to incorrect hand-written x86
SIMD C code. This PR restores all optimizations while replacing the
buggy SIMD code with the battle-tested aklomp/base64 library.

Modification:
- Replace hand-written C SIMD with aklomp/base64 (BSD-2-Clause) which
  provides correct SIMD dispatch (SSSE3/AVX2/AVX512/NEON64) via runtime
  CPU detection
- Add PlatformBase64 abstraction: JVM/JS use java.util.Base64 with strict
  RFC 4648 padding validation, Native uses aklomp/base64 FFI
- Switch to strict mode aligned with go-jsonnet: reject unpadded base64
  input (e.g. "YQ" without "=="). java.util.Base64 is lenient, so JVM/JS
  add explicit length check for ASCII input, matching go-jsonnet's
  len(str) % 4 != 0 check (builtins.go:1467)
- Restore Val.ByteArr: compact byte-backed array for base64DecodeBytes
- Restore Val.RangeArr subclass from flag-based _isRange
- Restore Val.Str._asciiSafe + renderAsciiSafeString
- Restore Materializer/ByteRenderer fast paths for ByteArr
- Add comprehensive test suite (56+ Scala tests + 4 Jsonnet golden tests)

Result:
Beats jrsonnet on DecodeBytes benchmarks (1.47x faster). Overall 15-38%
faster than master on base64 workloads.
He-Pin added a commit to He-Pin/sjsonnet that referenced this pull request Apr 18, 2026
Motivation:
PR databricks#749 added SIMD base64 and runtime optimizations (ByteArr, RangeArr,
asciiSafe) but was reverted by databricks#777 due to incorrect hand-written x86
SIMD C code. This PR restores all optimizations while replacing the
buggy SIMD code with the battle-tested aklomp/base64 library.

Modification:
- Replace hand-written C SIMD with aklomp/base64 (BSD-2-Clause) which
  provides correct SIMD dispatch (SSSE3/AVX2/AVX512/NEON64) via runtime
  CPU detection
- Add PlatformBase64 abstraction: JVM/JS use java.util.Base64 with strict
  RFC 4648 padding validation, Native uses aklomp/base64 FFI
- Switch to strict mode aligned with go-jsonnet: reject unpadded base64
  input (e.g. "YQ" without "=="). java.util.Base64 is lenient, so JVM/JS
  add explicit length check for ASCII input, matching go-jsonnet's
  len(str) % 4 != 0 check (builtins.go:1467)
- Restore Val.ByteArr: compact byte-backed array for base64DecodeBytes
- Restore Val.RangeArr subclass from flag-based _isRange
- Restore Val.Str._asciiSafe + renderAsciiSafeString
- Restore Materializer/ByteRenderer fast paths for ByteArr
- Add comprehensive test suite (56+ Scala tests + 4 Jsonnet golden tests)

Result:
Beats jrsonnet on DecodeBytes benchmarks (1.47x faster). Overall 15-38%
faster than master on base64 workloads.
He-Pin added a commit to He-Pin/sjsonnet that referenced this pull request Apr 18, 2026
Motivation:
PR databricks#749 added SIMD base64 and runtime optimizations (ByteArr, RangeArr,
asciiSafe) but was reverted by databricks#777 due to incorrect hand-written x86
SIMD C code. This PR restores all optimizations while replacing the
buggy SIMD code with the battle-tested aklomp/base64 library.

Modification:
- Replace hand-written C SIMD with aklomp/base64 (BSD-2-Clause) which
  provides correct SIMD dispatch (SSSE3/AVX2/AVX512/NEON64) via runtime
  CPU detection
- Add PlatformBase64 abstraction: JVM/JS use java.util.Base64 with strict
  RFC 4648 padding validation, Native uses aklomp/base64 FFI
- Switch to strict mode aligned with go-jsonnet: reject unpadded base64
  input (e.g. "YQ" without "=="). java.util.Base64 is lenient, so JVM/JS
  add explicit length check for ASCII input, matching go-jsonnet's
  len(str) % 4 != 0 check (builtins.go:1467)
- Restore Val.ByteArr: compact byte-backed array for base64DecodeBytes
- Restore Val.RangeArr subclass from flag-based _isRange
- Restore Val.Str._asciiSafe + renderAsciiSafeString
- Restore Materializer/ByteRenderer fast paths for ByteArr
- Add comprehensive test suite (56+ Scala tests + 4 Jsonnet golden tests)

Result:
Beats jrsonnet on DecodeBytes benchmarks (1.47x faster). Overall 15-38%
faster than master on base64 workloads.
He-Pin added a commit to He-Pin/sjsonnet that referenced this pull request Apr 21, 2026
Motivation:
PR databricks#749 added SIMD base64 and runtime optimizations (ByteArr, RangeArr,
asciiSafe) but was reverted by databricks#777 due to incorrect hand-written x86
SIMD C code. This PR restores all optimizations while replacing the
buggy SIMD code with the battle-tested aklomp/base64 library.

Modification:
- Replace hand-written C SIMD with aklomp/base64 (BSD-2-Clause) which
  provides correct SIMD dispatch (SSSE3/AVX2/AVX512/NEON64) via runtime
  CPU detection
- Add PlatformBase64 abstraction: JVM/JS use java.util.Base64 with strict
  RFC 4648 padding validation, Native uses aklomp/base64 FFI
- Switch to strict mode aligned with go-jsonnet: reject unpadded base64
  input (e.g. "YQ" without "=="). java.util.Base64 is lenient, so JVM/JS
  add explicit length check for ASCII input, matching go-jsonnet's
  len(str) % 4 != 0 check (builtins.go:1467)
- Restore Val.ByteArr: compact byte-backed array for base64DecodeBytes
- Restore Val.RangeArr subclass from flag-based _isRange
- Restore Val.Str._asciiSafe + renderAsciiSafeString
- Restore Materializer/ByteRenderer fast paths for ByteArr
- Add comprehensive test suite (56+ Scala tests + 4 Jsonnet golden tests)

Result:
Beats jrsonnet on DecodeBytes benchmarks (1.47x faster). Overall 15-38%
faster than master on base64 workloads.
He-Pin added a commit to He-Pin/sjsonnet that referenced this pull request Apr 21, 2026
Motivation:
PR databricks#749 added SIMD base64 and runtime optimizations (ByteArr, RangeArr,
asciiSafe) but was reverted by databricks#777 due to incorrect hand-written x86
SIMD C code. This PR restores all optimizations while replacing the
buggy SIMD code with the battle-tested aklomp/base64 library.

Modification:
- Replace hand-written C SIMD with aklomp/base64 (BSD-2-Clause) which
  provides correct SIMD dispatch (SSSE3/AVX2/AVX512/NEON64) via runtime
  CPU detection
- Add PlatformBase64 abstraction: JVM/JS use java.util.Base64 with strict
  RFC 4648 padding validation, Native uses aklomp/base64 FFI
- Switch to strict mode aligned with go-jsonnet: reject unpadded base64
  input (e.g. "YQ" without "=="). java.util.Base64 is lenient, so JVM/JS
  add explicit length check for ASCII input, matching go-jsonnet's
  len(str) % 4 != 0 check (builtins.go:1467)
- Restore Val.ByteArr: compact byte-backed array for base64DecodeBytes
- Restore Val.RangeArr subclass from flag-based _isRange
- Restore Val.Str._asciiSafe + renderAsciiSafeString
- Restore Materializer/ByteRenderer fast paths for ByteArr
- Add comprehensive test suite (56+ Scala tests + 4 Jsonnet golden tests)

Result:
Beats jrsonnet on DecodeBytes benchmarks (1.47x faster). Overall 15-38%
faster than master on base64 workloads.
stephenamar-db pushed a commit that referenced this pull request Apr 24, 2026
## Summary

Rollforward of #749 (reverted by #777) with the buggy hand-written C
SIMD replaced by the battle-tested
[aklomp/base64](https://github.com/aklomp/base64) library
(BSD-2-Clause). Also restores the non-SIMD optimizations from #749
(ByteArr, RangeArr subclass, asciiSafe rendering) and adds strict RFC
4648 padding validation aligned with go-jsonnet.

### How the SIMD bug was fixed

PR #749's hand-written C SIMD code had incorrect x86 implementation (the
reason for the revert in #777). Instead of fixing the hand-written code,
this PR replaces it entirely with **aklomp/base64** — a well-tested C
library that handles SIMD dispatch correctly on all architectures:
- x86_64: SSSE3 / SSE4.1 / SSE4.2 / AVX / AVX2 / AVX-512 (runtime CPU
detection)
- AArch64: NEON
- Fallback: optimized generic C implementation

The library is built as a static library via CMake and linked via
`nativeLinkingOptions`. No hand-written SIMD code remains.

### Strict mode aligned with go-jsonnet

Switched base64 decoding to **strict RFC 4648 mode** — unpadded input
(e.g. `"YQ"` instead of `"YQ=="`) is now rejected on all platforms,
matching go-jsonnet behavior:
- **go-jsonnet**: `len(str) % 4 != 0` check before
`base64.StdEncoding.DecodeString` (builtins.go:1467)
- **C++ jsonnet**: `std.length(str) % 4 != 0` check in stdlib
- **sjsonnet (before)**: `java.util.Base64` was lenient, accepting
unpadded input — a pre-existing behavioral divergence
- **sjsonnet (after)**: JVM/JS add explicit ASCII-only length
validation; Native uses aklomp/base64 which is strict by default

### Changes

1. **PlatformBase64 abstraction** — Platform-specific base64
implementations:
   - JVM/JS: `java.util.Base64` + strict padding pre-check
- Native: aklomp/base64 FFI with JVM-compatible error messages on the
error path (zero hot-path overhead)

2. **Val.ByteArr** — Compact byte-backed array for `base64DecodeBytes`.
Stores `Array[Byte]` directly instead of N `Val.Num` wrappers (80%+
memory savings). Zero-copy `rawBytes` access for re-encoding.

3. **Val.RangeArr subclass** — Extracted from flag-based `_isRange` in
`Arr` to reduce per-Arr memory footprint. O(1) creation for `std.range`.

4. **Val.Str._asciiSafe + renderAsciiSafeString** — Marks strings that
need no JSON escaping (e.g. base64 output). Renderer skips SWAR escape
scanning, writing bytes directly.

5. **Materializer/ByteRenderer fast paths** — Direct byte iteration for
ByteArr, skipping per-element type dispatch.

6. **Comprehensive test suite** — 56+ Scala unit tests + 4 Jsonnet
golden file tests covering RFC 4648 vectors, SIMD boundary sizes,
bidirectional verification, strict padding enforcement, all 256 byte
values, and error handling.

## Benchmark Results — Scala Native vs jrsonnet (Rust)

Machine: Apple Silicon (AArch64/NEON), macOS. Tool: `hyperfine --warmup
3 --runs 10 -N`.

Both `master` and `simd-full` binaries built from the same
upstream/master base (4123ac3). The only difference is this PR's
changes.

### SIMD base64 throughput (large payloads)

Larger payloads isolate base64 codec performance from Jsonnet
interpreter overhead. The improvement scales with data size:

| Benchmark | Payload | master (ms) | simd-full (ms) | jrsonnet (ms) |
simd vs master |

|-----------|---------|:-----------:|:--------------:|:-------------:|:--------------:|
| base64_heavy | 200KB, 3 strings + 10K bytes | 9.8 | **8.8** | 6.9 |
**10% faster** |
| base64_throughput | 150KB × 5 roundtrips | 15.6 | **13.6** | 5.6 |
**13% faster** |
| base64_mega | 1MB + 100K byte array | 34.1 | **28.7** | 22.0 | **16%
faster** |
| base64_ultra | 4.5MB × 2 roundtrips | 119.9 | **91.3** | 14.0 | **24%
faster** |

User CPU time (excluding process overhead) tells the same story:

| Benchmark | master User CPU | simd-full User CPU | Reduction |
|-----------|:-:|:-:|:-:|
| base64_heavy | 4.8 ms | 4.0 ms | **17%** |
| base64_throughput | 10.0 ms | 7.7 ms | **23%** |
| base64_mega | 26.9 ms | 21.5 ms | **20%** |
| base64_ultra | 107.8 ms | 78.2 ms | **27%** |

> **Note**: jrsonnet's advantage on large-payload benchmarks (especially
ultra: 14ms vs 91ms) is primarily due to Rust's UTF-8 string
representation enabling zero-copy base64, whereas Scala Native requires
UTF-16 ↔ UTF-8 conversion at the FFI boundary. This is a fundamental
runtime characteristic, not a base64 algorithm difference.

### ByteArr compact storage (DecodeBytes / byte_array)

sjsonnet's `ByteArr` stores decoded bytes as `Array[Byte]` directly (vs
N `Val.Num` wrappers), beating jrsonnet (Rust) on byte-oriented
operations:

| Benchmark | master (ms) | simd-full (ms) | jrsonnet (ms) | simd vs
master | simd vs jrsonnet |

|-----------|:-----------:|:--------------:|:-------------:|:--------------:|:----------------:|
| std_base64decodebytes | 15.6 | **13.9** | 19.0 | **11% faster** |
**1.36x faster** |
| go base64DecodeBytes | 16.0 | **13.5** | 19.4 | **16% faster** |
**1.43x faster** |
| std_base64_byte_array | 9.0 | **8.8** | 18.4 | ~neutral | **2.09x
faster** |

### Small payload benchmarks (interpreter-dominated)

These benchmarks process ~3KB payloads. Base64 codec time is negligible
compared to process startup (~3ms) and Jsonnet parsing/evaluation, so
codec improvements don't show here:

| Benchmark | master (ms) | simd-full (ms) | jrsonnet (ms) | simd vs
master |

|-----------|:-----------:|:--------------:|:-------------:|:--------------:|
| std_base64 (encode) | 7.3 | 6.8 | 4.4 | ~neutral |
| std_base64decode | 6.0 | 6.2 | 4.8 | ~neutral |
| go base64 (encode) | 7.0 | 7.6 | 4.7 | ~neutral |
| go base64Decode | 6.8 | 7.3 | 5.3 | ~neutral |

## Test plan

- [x] `./mill 'sjsonnet.jvm[3.3.7]'.test` — 61 tests pass (including 56
Base64Tests with strict padding)
- [x] `./mill 'sjsonnet.js[3.3.7]'.test` — 455 tests pass
- [x] `./mill 'sjsonnet.native[3.3.7]'.test` — 476 tests pass
- [x] `./mill __.checkFormat` — scalafmt passes
- [x] Benchmark regression verified across multiple runs (10 runs per
benchmark)
- [x] Local ARM64 (Apple Silicon/NEON) verification — all tests pass
- [x] CI x86_64 verification via GitHub Actions runners

Closes #777
He-Pin added a commit to He-Pin/sjsonnet that referenced this pull request Apr 25, 2026
Motivation:
PR databricks#749 added SIMD base64 and runtime optimizations (ByteArr, RangeArr,
asciiSafe) but was reverted by databricks#777 due to incorrect hand-written x86
SIMD C code. This PR restores all optimizations while replacing the
buggy SIMD code with the battle-tested aklomp/base64 library.

Modification:
- Replace hand-written C SIMD with aklomp/base64 (BSD-2-Clause) which
  provides correct SIMD dispatch (SSSE3/AVX2/AVX512/NEON64) via runtime
  CPU detection
- Add PlatformBase64 abstraction: JVM/JS use java.util.Base64 with strict
  RFC 4648 padding validation, Native uses aklomp/base64 FFI
- Switch to strict mode aligned with go-jsonnet: reject unpadded base64
  input (e.g. "YQ" without "=="). java.util.Base64 is lenient, so JVM/JS
  add explicit length check for ASCII input, matching go-jsonnet's
  len(str) % 4 != 0 check (builtins.go:1467)
- Restore Val.ByteArr: compact byte-backed array for base64DecodeBytes
- Restore Val.RangeArr subclass from flag-based _isRange
- Restore Val.Str._asciiSafe + renderAsciiSafeString
- Restore Materializer/ByteRenderer fast paths for ByteArr
- Add comprehensive test suite (56+ Scala tests + 4 Jsonnet golden tests)

Result:
Beats jrsonnet on DecodeBytes benchmarks (1.47x faster). Overall 15-38%
faster than master on base64 workloads.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant