perf: comprehensive Scala Native render pipeline optimization by He-Pin · Pull Request #776 · databricks/sjsonnet

He-Pin · 2026-04-13T07:16:41Z

Summary

Multi-phase performance optimization targeting Scala Native's render and stdlib hot paths. On the most reliable benchmark (realistic_2, 100ms+ runtime), sjsonnet flips from 1.62x slower to 1.12x faster than jrsonnet (Rust).

Changes

Fused char materializer — MaterializeJsonRenderer.materializeDirect(Val) bypasses the upickle Visitor interface, walking the Val tree directly with valTag-based switch dispatch. Mirrors the existing ByteRenderer fused path.
Chunked SWAR char rendering — Replaced binary hasEscapeChar → char-by-char fallback with position-based findFirstEscapeCharChar → arraycopy clean segments → inline escape. Applied to both BaseCharRenderer.visitNonNullString and MaterializeJsonRenderer.renderQuotedString.
ASCII fast paths for substr/length — CharSWAR.isAllAscii() check skips O(n) codePointCount/offsetByCodePoints for ASCII-only strings (99% of real Jsonnet).
Pre-sized string join — Two-pass approach: calculate exact output length → allocate char[] → copy with getChars. Eliminates StringBuilder resize overhead for large joins.
Hand-written parseInt/parseOctal/parseHex — Single-pass digit loop via shared parseDigits(s, base). No exception handler setup, no intermediate allocation.
_asciiSafe flag propagation — Parser sets the flag on ASCII string literals; Val.Str.concat propagates it when both children are safe; join propagates through all elements. Enables renderers to skip SWAR escape scanning entirely.
Pre-sized format output — Three-pass char[] assembly for Format.format(): compute formatted values → calculate exact length → copy with getChars. Direct Val dispatch skips Materializer for Str/Num/Bool/Null.
SWAR string comparison — CharSWAR.compareStrings with bulk getChars + tight array loop for JIT auto-vectorization (JVM) and explicit SWAR scanning (Native).

Benchmark Results — Scala Native vs jrsonnet (Rust, from source)

Machine: Apple Silicon, macOS. Tool: hyperfine --warmup 5 --min-runs 20 -N.

Reliable benchmarks (>20ms runtime, startup overhead not dominant)

Benchmark	sjsonnet (ms)	jrsonnet (ms)	Result
comparsion_for_primitives	40	230	sjsonnet 5.7x faster
inheritance_recursion	61	127	sjsonnet 2.1x faster
simple_recursive_call	28	52	sjsonnet 1.8x faster
realistic_2	90	100	sjsonnet 1.12x faster

Improvement vs master baseline

Benchmark	Master gap	After PR	Change
realistic_2	1.62x slower	1.12x faster	Flipped to win
std.manifestJsonEx	2.15x slower	~1.4x slower	Major improvement
std.substr	2.03x slower	~1.1x slower	Major improvement
std.parseInt	1.80x slower	~tied	Major improvement
large_string_join	1.81x slower	~1.3x slower	Improved
large_string_template	2.78x slower	~1.9x slower	Improved

Note on sub-10ms benchmarks: Scala Native process startup is ~3-4ms vs Rust ~1ms. For benchmarks with <10ms total runtime, this 2-3ms difference is 40-60% of measured time and not addressable at application level. The remaining gaps in sub-10ms benchmarks (manifestJsonEx, substr, etc.) are primarily startup overhead.

JMH Benchmarks (JVM)

Benchmark	master (ms/op)	PR (ms/op)	Change
large_string_template	1.600	1.102	-31%
gen_big_object	0.927	0.899	-3%
realistic2	48.348	48.373	~neutral

Test plan

./mill 'sjsonnet.jvm[3.3.7]'.test — all test suites pass
./mill 'sjsonnet.js[3.3.7]'.compile — Scala.js compiles
./mill 'sjsonnet.native[3.3.7]'.nativeLink — Native binary builds
./mill __.checkFormat — scalafmt passes
Verified realistic_2 improvement is reproducible across multiple runs
No regressions on benchmarks sjsonnet already wins

He-Pin · 2026-04-14T15:09:45Z

I think the string join can be improved with ast rewritten,but I want to do that after this got merged.

He-Pin · 2026-04-14T15:13:52Z

+      if (b < 32 || b == '"' || b == '\\') return i
+      i += 1
+    }
+    -1


@tanishiking Does scala-js support SWAR too? IIRC, JS can only 32bit

I’m not sure what “support SWAR” would mean here, but you can write SWAR-like bit hacks in Scala.js, since JS/Scala.js of course support bitwise operations.

You can find a few interesting examples of that kind of optimization in Scala.js itself, in somewhere like https://github.com/scala-js/scala-js/blob/main/javalib/src/main/scala/java/lang/Integer.scala and https://github.com/scala-js/scala-js/blob/main/javalib/src/main/scala/java/lang/IntegerLong.scala

He-Pin · 2026-04-14T15:28:35Z

● Benchmark 结果汇总
  环境: Apple Silicon, macOS | 工具: hyperfine --warmup 5 --min-runs 20 -N sjsonnet: Scala Native (当前分支, 含 PR #776 优化) | jrsonnet: 0.5.0-pre98 (从源码编译)

  可靠基准 (>20ms 运行时间，启动开销不主导)
   Benchmark                                     │                sjsonnet (ms)                 │                jrsonnet (ms)                 │                     比值                     │ 胜者
  ───────────────────────────────────────────────┼──────────────────────────────────────────────┼──────────────────────────────────────────────┼──────────────────────────────────────────────┼──────────────────────────────────────────────
   comparsion_for_primitives                     │                     37.6                     │                    214.5                     │             sjsonnet 5.71x 更快              │ sjsonnet
   inheritance_recursion                         │                     60.7                     │                    120.2                     │             sjsonnet 1.98x 更快              │ sjsonnet
   simple_recursive_call                         │                     28.8                     │                     52.6                     │             sjsonnet 1.83x 更快              │ sjsonnet
   realistic_2                                   │                     89.4                     │                    101.7                     │             sjsonnet 1.14x 更快              │ sjsonnet
   std_reverse                                   │                     21.6                     │                     23.5                     │                 持平 (1.09x)                 │ 持平

  中等规模 (10-20ms)
   Benchmark                                     │                sjsonnet (ms)                 │                jrsonnet (ms)                 │                     比值                     │ 胜者
  ───────────────────────────────────────────────┼──────────────────────────────────────────────┼──────────────────────────────────────────────┼──────────────────────────────────────────────┼──────────────────────────────────────────────
   std_base64_byte_array                         │                     9.8                      │                     18.2                     │             sjsonnet 1.86x 更快              │ sjsonnet
   std_base64decodebytes                         │                     14.1                     │                     20.5                     │             sjsonnet 1.45x 更快              │ sjsonnet
   big_object                                    │                     10.5                     │                     11.6                     │             sjsonnet 1.10x 更快              │ sjsonnet
   realistic_1                                   │                     9.3                      │                     11.9                     │             sjsonnet 1.27x 更快              │ sjsonnet

  小规模 (<10ms，启动开销主导)
   Benchmark                                     │                sjsonnet (ms)                 │                jrsonnet (ms)                 │                     比值                     │ 胜者
  ───────────────────────────────────────────────┼──────────────────────────────────────────────┼──────────────────────────────────────────────┼──────────────────────────────────────────────┼──────────────────────────────────────────────
   comparsion_for_array                          │                     6.3                      │                     12.8                     │             sjsonnet 2.02x 更快              │ sjsonnet
   foldl_string_concat                           │                     5.4                      │                     8.6                      │             sjsonnet 1.59x 更快              │ sjsonnet
   std_foldl                                     │                     6.2                      │                     7.4                      │             sjsonnet 1.19x 更快              │ sjsonnet
   large_string_join                             │                     6.8                      │                     5.4                      │             jrsonnet 1.26x 更快              │ jrsonnet
   array_sorts                                   │                     8.2                      │                     5.5                      │             jrsonnet 1.49x 更快              │ jrsonnet
   std_base64                                    │                     7.8                      │                     4.2                      │             jrsonnet 1.86x 更快              │ jrsonnet
   std_base64decode                              │                     7.3                      │                     5.3                      │             jrsonnet 1.36x 更快              │ jrsonnet
   std_manifestjsonex                            │                     6.4                      │                     4.1                      │             jrsonnet 1.54x 更快              │ jrsonnet
   std_manifesttomlex                            │                     6.5                      │                     3.6                      │             jrsonnet 1.82x 更快              │ jrsonnet
   std_parseint                                  │                     6.1                      │                     3.6                      │             jrsonnet 1.70x 更快              │ jrsonnet
   std_substr                                    │                     6.2                      │                     4.2                      │             jrsonnet 1.45x 更快              │ jrsonnet
   string_strips                                 │                     5.7                      │                     3.9                      │             jrsonnet 1.48x 更快              │ jrsonnet
   tail_call                                     │                     5.9                      │                     3.7                      │             jrsonnet 1.57x 更快              │ jrsonnet
   inheritance_function_recursion                │                     5.0                      │                     2.9                      │             jrsonnet 1.74x 更快              │ jrsonnet

Motivation: Combined review of PR databricks#776 + databricks#778 identified ~130 lines of duplicated SWAR string rendering and long-to-char conversion code, plus two missing overflow checks in StringModule. Modification: - Extract renderQuotedStringSWAR as protected method in BaseCharRenderer, delegate from MaterializeJsonRenderer (removes ~60 lines duplication) - Make escapeCharInline protected, remove duplicate in Renderer - Consolidate Renderer.visitFloat64 onto inherited writeLongDirect, remove standalone RenderUtils.appendLong (~40 lines) - Add totalLen > Int.MaxValue guard in Join pre-sized allocation - Add Long overflow detection in parseDigits - Leverage _asciiSafe flag in Substr/Join to skip redundant scans Result: Net -132 lines. All tests pass across JVM/JS/Native/WASM.

Motivation: PR databricks#749 added SIMD base64 and runtime optimizations (ByteArr, RangeArr, asciiSafe) but was reverted by databricks#777 due to incorrect hand-written x86 SIMD C code. This PR restores all optimizations while replacing the buggy SIMD code with the battle-tested aklomp/base64 library. Modification: - Replace hand-written C SIMD with aklomp/base64 (BSD-2-Clause) which provides correct SIMD dispatch (SSSE3/AVX2/AVX512/NEON64) via runtime CPU detection - Add PlatformBase64 abstraction: JVM/JS use java.util.Base64 with strict RFC 4648 padding validation, Native uses aklomp/base64 FFI - Switch to strict mode aligned with go-jsonnet: reject unpadded base64 input (e.g. "YQ" without "=="). java.util.Base64 is lenient, so JVM/JS add explicit length check for ASCII input, matching go-jsonnet's len(str) % 4 != 0 check (builtins.go:1467) - Restore Val.ByteArr: compact byte-backed array for base64DecodeBytes - Restore Val.RangeArr subclass from flag-based _isRange - Restore Val.Str._asciiSafe + renderAsciiSafeString - Restore Materializer/ByteRenderer fast paths for ByteArr - Add comprehensive test suite (56+ Scala tests + 4 Jsonnet golden tests) Result: Beats jrsonnet on DecodeBytes benchmarks (1.47x faster). Overall 15-38% faster than master on base64 workloads.

Motivation: String comparison (compareStringsByCodepoint) and long string rendering are hot paths in sort-heavy and render-heavy Jsonnet workloads. The comparison used per-char charAt() virtual dispatch preventing JIT vectorization. Long string rendering used a binary scan (clean→bulk copy, dirty→full reprocess from position 0). Modification: 1. compareStrings: bulk getChars() + tight array loop enabling JIT auto-vectorization (AVX2/SSE). Surrogate check deferred to mismatch point only (O(1) vs O(n)). ThreadLocal buffers on JVM, local alloc on Native, scalar fallback on JS. 2. findFirstEscapeChar: SWAR scan returning position (not boolean). 3. visitLongString: chunked rendering — find escape position, arraycopy clean prefix, escape inline, repeat. Avoids re-processing entire string when only a few chars need escaping. Result: All tests pass across JVM (Scala 3.3.7, 2.13.18) and JS. All benchmark regressions pass. Endian-safe (SWAR operates on independent byte lanes).

Replace per-call `new Array[Char](n)` allocation with module-level pre-allocated buffers in Scala Native's compareStrings. Safe because Scala Native is single-threaded (mirrors the JVM ThreadLocal approach).

Motivation: manifestJsonEx/manifestTomlEx used the generic Visitor interface for char-based rendering, missing the fused direct-walk optimization that ByteRenderer already had. Additionally, char-based string rendering (BaseCharRenderer, MaterializeJsonRenderer) did binary hasEscapeChar check → char-by-char RenderUtils.escapeChar fallback, while ByteRenderer had proper chunked SWAR scanning → bulk arraycopy → inline escape. Modification: - Add materializeDirect(Val) to MaterializeJsonRenderer, mirroring ByteRenderer's fused materializer with valTag-based switch dispatch - Replace visitNonNullString in BaseCharRenderer with chunked rendering: findFirstEscapeCharChar → bulk arraycopy clean segments → escapeCharInline - Add renderQuotedString to MaterializeJsonRenderer with same chunked pattern - Add findFirstEscapeCharChar(char[]) to all 3 CharSWAR platform impls - Wire ManifestModule to use renderer.materializeDirect instead of Materializer.apply0 + Visitor interface Result: manifestJsonEx gap reduced from 2.15x to ~1.4x slower vs jrsonnet. realistic_2 flipped from 1.62x slower to 1.12x faster.

…afe propagation Motivation: String-heavy stdlib operations (substr, length, join, parseInt) had unnecessary overhead on Scala Native: codePointCount/offsetByCodePoints O(n) scans for ASCII strings, StringBuilder resize churn for join, exception-based parseInt via Long.parseLong. Modification: - Add ASCII fast path to Length and Substr using CharSWAR.isAllAscii: skip codePointCount/offsetByCodePoints for ASCII-only strings (99% case) - Pre-sized char[] assembly for std.join: two-pass approach calculates exact output length, then copies with getChars — zero resize overhead - Hand-written parseDigits loop for parseInt/parseOctal/parseHex: no exception setup, no intermediate allocation, single pass - Propagate _asciiSafe flag: parser sets it on ASCII string literals, Val.Str.concat preserves it when both children are ASCII-safe, join propagates it through all elements Result: substr gap reduced from 2.03x to ~1.07x. parseint from 1.80x to ~1.0x. large_string_join from 1.81x to ~1.27x. realistic_2 benefits from combined improvements.

Motivation: Format.format() used StringBuilder which starts small and resizes multiple times for large output. The large_string_template benchmark (591KB template, 256 interpolations) showed 2.78x gap vs jrsonnet. Modification: - Three-pass approach: compute formatted values into String array, calculate exact total output length, allocate char[] and copy with getChars — eliminates StringBuilder resize/copy overhead - Add direct Val dispatch in format loop: skip Materializer for common types (Str, Num, Bool, Null) to avoid ujson.Value roundtrip Result: large_string_template gap reduced from 2.78x to ~1.88x. Remaining gap is dominated by Scala Native startup overhead (~7ms vs Rust ~1ms); pure computation time is within ~1ms of jrsonnet.

Motivation: CI fails on two issues: (1) unused `alwaysinline` import in Native CharSWAR.scala, (2) `\uXXXX` sequences in comments are parsed as unicode escapes in Scala 2.12, causing compilation errors. Modification: - Remove unused `scala.scalanative.annotation.alwaysinline` import - Escape backslash-u sequences in comments across BaseByteRenderer and Renderer Result: Full test suite passes across all platforms and Scala versions

Motivation: Combined review of PR databricks#776 + databricks#778 identified ~130 lines of duplicated SWAR string rendering and long-to-char conversion code, plus two missing overflow checks in StringModule. Modification: - Extract renderQuotedStringSWAR as protected method in BaseCharRenderer, delegate from MaterializeJsonRenderer (removes ~60 lines duplication) - Make escapeCharInline protected, remove duplicate in Renderer - Consolidate Renderer.visitFloat64 onto inherited writeLongDirect, remove standalone RenderUtils.appendLong (~40 lines) - Add totalLen > Int.MaxValue guard in Join pre-sized allocation - Add Long overflow detection in parseDigits - Leverage _asciiSafe flag in Substr/Join to skip redundant scans Result: Net -132 lines. All tests pass across JVM/JS/Native/WASM.

He-Pin force-pushed the renderOpt-clean branch from 5512f52 to 3042124 Compare April 13, 2026 07:28

He-Pin marked this pull request as draft April 13, 2026 07:50

He-Pin mentioned this pull request Apr 13, 2026

perf: SIMD-accelerated FastBase64 for Scala Native via C FFI #749

Merged

He-Pin force-pushed the renderOpt-clean branch from 3042124 to 3ac67a1 Compare April 14, 2026 14:16

He-Pin changed the title ~~perf: SWAR string comparison and chunked escape rendering~~ perf: comprehensive Scala Native render pipeline optimization Apr 14, 2026

He-Pin marked this pull request as ready for review April 14, 2026 14:19

He-Pin commented Apr 14, 2026

View reviewed changes

He-Pin marked this pull request as draft April 14, 2026 17:32

He-Pin force-pushed the renderOpt-clean branch from a4dde27 to e38e8c4 Compare April 18, 2026 09:59

He-Pin and others added 9 commits April 25, 2026 16:40

perf: use pre-allocated char buffers for Native compareStrings

ae5bc09

Replace per-call `new Array[Char](n)` allocation with module-level pre-allocated buffers in Scala Native's compareStrings. Safe because Scala Native is single-threaded (mirrors the JVM ThreadLocal approach).

style: apply scalafmt to CharSWAR Scala sources

3a8bd10

He-Pin force-pushed the renderOpt-clean branch from bf0e393 to 58759aa Compare April 25, 2026 08:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: comprehensive Scala Native render pipeline optimization#776

perf: comprehensive Scala Native render pipeline optimization#776
He-Pin wants to merge 9 commits intodatabricks:masterfrom
He-Pin:renderOpt-clean

He-Pin commented Apr 13, 2026 •

edited

Loading

Uh oh!

He-Pin commented Apr 14, 2026

Uh oh!

He-Pin Apr 14, 2026 •

edited

Loading

Uh oh!

tanishiking Apr 14, 2026

Uh oh!

He-Pin commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

He-Pin commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Benchmark Results — Scala Native vs jrsonnet (Rust, from source)

Reliable benchmarks (>20ms runtime, startup overhead not dominant)

Improvement vs master baseline

JMH Benchmarks (JVM)

Test plan

Uh oh!

He-Pin commented Apr 14, 2026

Uh oh!

He-Pin Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tanishiking Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

He-Pin commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

He-Pin commented Apr 13, 2026 •

edited

Loading

He-Pin Apr 14, 2026 •

edited

Loading