perf: comprehensive Scala Native render pipeline optimization#776
Draft
He-Pin wants to merge 9 commits intodatabricks:masterfrom
Draft
perf: comprehensive Scala Native render pipeline optimization#776He-Pin wants to merge 9 commits intodatabricks:masterfrom
He-Pin wants to merge 9 commits intodatabricks:masterfrom
Conversation
Contributor
Author
|
I think the string join can be improved with ast rewritten,but I want to do that after this got merged. |
He-Pin
commented
Apr 14, 2026
| if (b < 32 || b == '"' || b == '\\') return i | ||
| i += 1 | ||
| } | ||
| -1 |
Contributor
Author
There was a problem hiding this comment.
@tanishiking Does scala-js support SWAR too? IIRC, JS can only 32bit
There was a problem hiding this comment.
I’m not sure what “support SWAR” would mean here, but you can write SWAR-like bit hacks in Scala.js, since JS/Scala.js of course support bitwise operations.
You can find a few interesting examples of that kind of optimization in Scala.js itself, in somewhere like https://github.com/scala-js/scala-js/blob/main/javalib/src/main/scala/java/lang/Integer.scala and https://github.com/scala-js/scala-js/blob/main/javalib/src/main/scala/java/lang/IntegerLong.scala
Contributor
Author
● Benchmark 结果汇总
环境: Apple Silicon, macOS | 工具: hyperfine --warmup 5 --min-runs 20 -N sjsonnet: Scala Native (当前分支, 含 PR #776 优化) | jrsonnet: 0.5.0-pre98 (从源码编译)
可靠基准 (>20ms 运行时间,启动开销不主导)
Benchmark │ sjsonnet (ms) │ jrsonnet (ms) │ 比值 │ 胜者
───────────────────────────────────────────────┼──────────────────────────────────────────────┼──────────────────────────────────────────────┼──────────────────────────────────────────────┼──────────────────────────────────────────────
comparsion_for_primitives │ 37.6 │ 214.5 │ sjsonnet 5.71x 更快 │ sjsonnet
inheritance_recursion │ 60.7 │ 120.2 │ sjsonnet 1.98x 更快 │ sjsonnet
simple_recursive_call │ 28.8 │ 52.6 │ sjsonnet 1.83x 更快 │ sjsonnet
realistic_2 │ 89.4 │ 101.7 │ sjsonnet 1.14x 更快 │ sjsonnet
std_reverse │ 21.6 │ 23.5 │ 持平 (1.09x) │ 持平
中等规模 (10-20ms)
Benchmark │ sjsonnet (ms) │ jrsonnet (ms) │ 比值 │ 胜者
───────────────────────────────────────────────┼──────────────────────────────────────────────┼──────────────────────────────────────────────┼──────────────────────────────────────────────┼──────────────────────────────────────────────
std_base64_byte_array │ 9.8 │ 18.2 │ sjsonnet 1.86x 更快 │ sjsonnet
std_base64decodebytes │ 14.1 │ 20.5 │ sjsonnet 1.45x 更快 │ sjsonnet
big_object │ 10.5 │ 11.6 │ sjsonnet 1.10x 更快 │ sjsonnet
realistic_1 │ 9.3 │ 11.9 │ sjsonnet 1.27x 更快 │ sjsonnet
小规模 (<10ms,启动开销主导)
Benchmark │ sjsonnet (ms) │ jrsonnet (ms) │ 比值 │ 胜者
───────────────────────────────────────────────┼──────────────────────────────────────────────┼──────────────────────────────────────────────┼──────────────────────────────────────────────┼──────────────────────────────────────────────
comparsion_for_array │ 6.3 │ 12.8 │ sjsonnet 2.02x 更快 │ sjsonnet
foldl_string_concat │ 5.4 │ 8.6 │ sjsonnet 1.59x 更快 │ sjsonnet
std_foldl │ 6.2 │ 7.4 │ sjsonnet 1.19x 更快 │ sjsonnet
large_string_join │ 6.8 │ 5.4 │ jrsonnet 1.26x 更快 │ jrsonnet
array_sorts │ 8.2 │ 5.5 │ jrsonnet 1.49x 更快 │ jrsonnet
std_base64 │ 7.8 │ 4.2 │ jrsonnet 1.86x 更快 │ jrsonnet
std_base64decode │ 7.3 │ 5.3 │ jrsonnet 1.36x 更快 │ jrsonnet
std_manifestjsonex │ 6.4 │ 4.1 │ jrsonnet 1.54x 更快 │ jrsonnet
std_manifesttomlex │ 6.5 │ 3.6 │ jrsonnet 1.82x 更快 │ jrsonnet
std_parseint │ 6.1 │ 3.6 │ jrsonnet 1.70x 更快 │ jrsonnet
std_substr │ 6.2 │ 4.2 │ jrsonnet 1.45x 更快 │ jrsonnet
string_strips │ 5.7 │ 3.9 │ jrsonnet 1.48x 更快 │ jrsonnet
tail_call │ 5.9 │ 3.7 │ jrsonnet 1.57x 更快 │ jrsonnet
inheritance_function_recursion │ 5.0 │ 2.9 │ jrsonnet 1.74x 更快 │ jrsonnet |
He-Pin
added a commit
to He-Pin/sjsonnet
that referenced
this pull request
Apr 18, 2026
Motivation: Combined review of PR databricks#776 + databricks#778 identified ~130 lines of duplicated SWAR string rendering and long-to-char conversion code, plus two missing overflow checks in StringModule. Modification: - Extract renderQuotedStringSWAR as protected method in BaseCharRenderer, delegate from MaterializeJsonRenderer (removes ~60 lines duplication) - Make escapeCharInline protected, remove duplicate in Renderer - Consolidate Renderer.visitFloat64 onto inherited writeLongDirect, remove standalone RenderUtils.appendLong (~40 lines) - Add totalLen > Int.MaxValue guard in Join pre-sized allocation - Add Long overflow detection in parseDigits - Leverage _asciiSafe flag in Substr/Join to skip redundant scans Result: Net -132 lines. All tests pass across JVM/JS/Native/WASM.
Motivation: PR databricks#749 added SIMD base64 and runtime optimizations (ByteArr, RangeArr, asciiSafe) but was reverted by databricks#777 due to incorrect hand-written x86 SIMD C code. This PR restores all optimizations while replacing the buggy SIMD code with the battle-tested aklomp/base64 library. Modification: - Replace hand-written C SIMD with aklomp/base64 (BSD-2-Clause) which provides correct SIMD dispatch (SSSE3/AVX2/AVX512/NEON64) via runtime CPU detection - Add PlatformBase64 abstraction: JVM/JS use java.util.Base64 with strict RFC 4648 padding validation, Native uses aklomp/base64 FFI - Switch to strict mode aligned with go-jsonnet: reject unpadded base64 input (e.g. "YQ" without "=="). java.util.Base64 is lenient, so JVM/JS add explicit length check for ASCII input, matching go-jsonnet's len(str) % 4 != 0 check (builtins.go:1467) - Restore Val.ByteArr: compact byte-backed array for base64DecodeBytes - Restore Val.RangeArr subclass from flag-based _isRange - Restore Val.Str._asciiSafe + renderAsciiSafeString - Restore Materializer/ByteRenderer fast paths for ByteArr - Add comprehensive test suite (56+ Scala tests + 4 Jsonnet golden tests) Result: Beats jrsonnet on DecodeBytes benchmarks (1.47x faster). Overall 15-38% faster than master on base64 workloads.
Motivation: String comparison (compareStringsByCodepoint) and long string rendering are hot paths in sort-heavy and render-heavy Jsonnet workloads. The comparison used per-char charAt() virtual dispatch preventing JIT vectorization. Long string rendering used a binary scan (clean→bulk copy, dirty→full reprocess from position 0). Modification: 1. compareStrings: bulk getChars() + tight array loop enabling JIT auto-vectorization (AVX2/SSE). Surrogate check deferred to mismatch point only (O(1) vs O(n)). ThreadLocal buffers on JVM, local alloc on Native, scalar fallback on JS. 2. findFirstEscapeChar: SWAR scan returning position (not boolean). 3. visitLongString: chunked rendering — find escape position, arraycopy clean prefix, escape inline, repeat. Avoids re-processing entire string when only a few chars need escaping. Result: All tests pass across JVM (Scala 3.3.7, 2.13.18) and JS. All benchmark regressions pass. Endian-safe (SWAR operates on independent byte lanes).
Replace per-call `new Array[Char](n)` allocation with module-level pre-allocated buffers in Scala Native's compareStrings. Safe because Scala Native is single-threaded (mirrors the JVM ThreadLocal approach).
Motivation: manifestJsonEx/manifestTomlEx used the generic Visitor interface for char-based rendering, missing the fused direct-walk optimization that ByteRenderer already had. Additionally, char-based string rendering (BaseCharRenderer, MaterializeJsonRenderer) did binary hasEscapeChar check → char-by-char RenderUtils.escapeChar fallback, while ByteRenderer had proper chunked SWAR scanning → bulk arraycopy → inline escape. Modification: - Add materializeDirect(Val) to MaterializeJsonRenderer, mirroring ByteRenderer's fused materializer with valTag-based switch dispatch - Replace visitNonNullString in BaseCharRenderer with chunked rendering: findFirstEscapeCharChar → bulk arraycopy clean segments → escapeCharInline - Add renderQuotedString to MaterializeJsonRenderer with same chunked pattern - Add findFirstEscapeCharChar(char[]) to all 3 CharSWAR platform impls - Wire ManifestModule to use renderer.materializeDirect instead of Materializer.apply0 + Visitor interface Result: manifestJsonEx gap reduced from 2.15x to ~1.4x slower vs jrsonnet. realistic_2 flipped from 1.62x slower to 1.12x faster.
…afe propagation Motivation: String-heavy stdlib operations (substr, length, join, parseInt) had unnecessary overhead on Scala Native: codePointCount/offsetByCodePoints O(n) scans for ASCII strings, StringBuilder resize churn for join, exception-based parseInt via Long.parseLong. Modification: - Add ASCII fast path to Length and Substr using CharSWAR.isAllAscii: skip codePointCount/offsetByCodePoints for ASCII-only strings (99% case) - Pre-sized char[] assembly for std.join: two-pass approach calculates exact output length, then copies with getChars — zero resize overhead - Hand-written parseDigits loop for parseInt/parseOctal/parseHex: no exception setup, no intermediate allocation, single pass - Propagate _asciiSafe flag: parser sets it on ASCII string literals, Val.Str.concat preserves it when both children are ASCII-safe, join propagates it through all elements Result: substr gap reduced from 2.03x to ~1.07x. parseint from 1.80x to ~1.0x. large_string_join from 1.81x to ~1.27x. realistic_2 benefits from combined improvements.
Motivation: Format.format() used StringBuilder which starts small and resizes multiple times for large output. The large_string_template benchmark (591KB template, 256 interpolations) showed 2.78x gap vs jrsonnet. Modification: - Three-pass approach: compute formatted values into String array, calculate exact total output length, allocate char[] and copy with getChars — eliminates StringBuilder resize/copy overhead - Add direct Val dispatch in format loop: skip Materializer for common types (Str, Num, Bool, Null) to avoid ujson.Value roundtrip Result: large_string_template gap reduced from 2.78x to ~1.88x. Remaining gap is dominated by Scala Native startup overhead (~7ms vs Rust ~1ms); pure computation time is within ~1ms of jrsonnet.
Motivation: CI fails on two issues: (1) unused `alwaysinline` import in Native CharSWAR.scala, (2) `\uXXXX` sequences in comments are parsed as unicode escapes in Scala 2.12, causing compilation errors. Modification: - Remove unused `scala.scalanative.annotation.alwaysinline` import - Escape backslash-u sequences in comments across BaseByteRenderer and Renderer Result: Full test suite passes across all platforms and Scala versions
Motivation: Combined review of PR databricks#776 + databricks#778 identified ~130 lines of duplicated SWAR string rendering and long-to-char conversion code, plus two missing overflow checks in StringModule. Modification: - Extract renderQuotedStringSWAR as protected method in BaseCharRenderer, delegate from MaterializeJsonRenderer (removes ~60 lines duplication) - Make escapeCharInline protected, remove duplicate in Renderer - Consolidate Renderer.visitFloat64 onto inherited writeLongDirect, remove standalone RenderUtils.appendLong (~40 lines) - Add totalLen > Int.MaxValue guard in Join pre-sized allocation - Add Long overflow detection in parseDigits - Leverage _asciiSafe flag in Substr/Join to skip redundant scans Result: Net -132 lines. All tests pass across JVM/JS/Native/WASM.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Multi-phase performance optimization targeting Scala Native's render and stdlib hot paths. On the most reliable benchmark (realistic_2, 100ms+ runtime), sjsonnet flips from 1.62x slower to 1.12x faster than jrsonnet (Rust).
Changes
Fused char materializer —
MaterializeJsonRenderer.materializeDirect(Val)bypasses the upickle Visitor interface, walking the Val tree directly with valTag-based switch dispatch. Mirrors the existing ByteRenderer fused path.Chunked SWAR char rendering — Replaced binary
hasEscapeChar→ char-by-char fallback with position-basedfindFirstEscapeCharChar→arraycopyclean segments → inline escape. Applied to bothBaseCharRenderer.visitNonNullStringandMaterializeJsonRenderer.renderQuotedString.ASCII fast paths for substr/length —
CharSWAR.isAllAscii()check skips O(n)codePointCount/offsetByCodePointsfor ASCII-only strings (99% of real Jsonnet).Pre-sized string join — Two-pass approach: calculate exact output length → allocate char[] → copy with
getChars. Eliminates StringBuilder resize overhead for large joins.Hand-written parseInt/parseOctal/parseHex — Single-pass digit loop via shared
parseDigits(s, base). No exception handler setup, no intermediate allocation._asciiSafeflag propagation — Parser sets the flag on ASCII string literals;Val.Str.concatpropagates it when both children are safe;joinpropagates through all elements. Enables renderers to skip SWAR escape scanning entirely.Pre-sized format output — Three-pass char[] assembly for
Format.format(): compute formatted values → calculate exact length → copy withgetChars. Direct Val dispatch skips Materializer for Str/Num/Bool/Null.SWAR string comparison —
CharSWAR.compareStringswith bulkgetChars+ tight array loop for JIT auto-vectorization (JVM) and explicit SWAR scanning (Native).Benchmark Results — Scala Native vs jrsonnet (Rust, from source)
Machine: Apple Silicon, macOS. Tool: hyperfine --warmup 5 --min-runs 20 -N.
Reliable benchmarks (>20ms runtime, startup overhead not dominant)
Improvement vs master baseline
JMH Benchmarks (JVM)
Test plan
./mill 'sjsonnet.jvm[3.3.7]'.test— all test suites pass./mill 'sjsonnet.js[3.3.7]'.compile— Scala.js compiles./mill 'sjsonnet.native[3.3.7]'.nativeLink— Native binary builds./mill __.checkFormat— scalafmt passes