Skip to content

perf: comprehensive Scala Native render pipeline optimization#776

Draft
He-Pin wants to merge 9 commits intodatabricks:masterfrom
He-Pin:renderOpt-clean
Draft

perf: comprehensive Scala Native render pipeline optimization#776
He-Pin wants to merge 9 commits intodatabricks:masterfrom
He-Pin:renderOpt-clean

Conversation

@He-Pin
Copy link
Copy Markdown
Contributor

@He-Pin He-Pin commented Apr 13, 2026

Summary

Multi-phase performance optimization targeting Scala Native's render and stdlib hot paths. On the most reliable benchmark (realistic_2, 100ms+ runtime), sjsonnet flips from 1.62x slower to 1.12x faster than jrsonnet (Rust).

Changes

  1. Fused char materializerMaterializeJsonRenderer.materializeDirect(Val) bypasses the upickle Visitor interface, walking the Val tree directly with valTag-based switch dispatch. Mirrors the existing ByteRenderer fused path.

  2. Chunked SWAR char rendering — Replaced binary hasEscapeChar → char-by-char fallback with position-based findFirstEscapeCharChararraycopy clean segments → inline escape. Applied to both BaseCharRenderer.visitNonNullString and MaterializeJsonRenderer.renderQuotedString.

  3. ASCII fast paths for substr/lengthCharSWAR.isAllAscii() check skips O(n) codePointCount/offsetByCodePoints for ASCII-only strings (99% of real Jsonnet).

  4. Pre-sized string join — Two-pass approach: calculate exact output length → allocate char[] → copy with getChars. Eliminates StringBuilder resize overhead for large joins.

  5. Hand-written parseInt/parseOctal/parseHex — Single-pass digit loop via shared parseDigits(s, base). No exception handler setup, no intermediate allocation.

  6. _asciiSafe flag propagation — Parser sets the flag on ASCII string literals; Val.Str.concat propagates it when both children are safe; join propagates through all elements. Enables renderers to skip SWAR escape scanning entirely.

  7. Pre-sized format output — Three-pass char[] assembly for Format.format(): compute formatted values → calculate exact length → copy with getChars. Direct Val dispatch skips Materializer for Str/Num/Bool/Null.

  8. SWAR string comparisonCharSWAR.compareStrings with bulk getChars + tight array loop for JIT auto-vectorization (JVM) and explicit SWAR scanning (Native).

Benchmark Results — Scala Native vs jrsonnet (Rust, from source)

Machine: Apple Silicon, macOS. Tool: hyperfine --warmup 5 --min-runs 20 -N.

Reliable benchmarks (>20ms runtime, startup overhead not dominant)

Benchmark sjsonnet (ms) jrsonnet (ms) Result
comparsion_for_primitives 40 230 sjsonnet 5.7x faster
inheritance_recursion 61 127 sjsonnet 2.1x faster
simple_recursive_call 28 52 sjsonnet 1.8x faster
realistic_2 90 100 sjsonnet 1.12x faster

Improvement vs master baseline

Benchmark Master gap After PR Change
realistic_2 1.62x slower 1.12x faster Flipped to win
std.manifestJsonEx 2.15x slower ~1.4x slower Major improvement
std.substr 2.03x slower ~1.1x slower Major improvement
std.parseInt 1.80x slower ~tied Major improvement
large_string_join 1.81x slower ~1.3x slower Improved
large_string_template 2.78x slower ~1.9x slower Improved

Note on sub-10ms benchmarks: Scala Native process startup is ~3-4ms vs Rust ~1ms. For benchmarks with <10ms total runtime, this 2-3ms difference is 40-60% of measured time and not addressable at application level. The remaining gaps in sub-10ms benchmarks (manifestJsonEx, substr, etc.) are primarily startup overhead.

JMH Benchmarks (JVM)

Benchmark master (ms/op) PR (ms/op) Change
large_string_template 1.600 1.102 -31%
gen_big_object 0.927 0.899 -3%
realistic2 48.348 48.373 ~neutral

Test plan

  • ./mill 'sjsonnet.jvm[3.3.7]'.test — all test suites pass
  • ./mill 'sjsonnet.js[3.3.7]'.compile — Scala.js compiles
  • ./mill 'sjsonnet.native[3.3.7]'.nativeLink — Native binary builds
  • ./mill __.checkFormat — scalafmt passes
  • Verified realistic_2 improvement is reproducible across multiple runs
  • No regressions on benchmarks sjsonnet already wins

@He-Pin He-Pin marked this pull request as draft April 13, 2026 07:50
@He-Pin He-Pin changed the title perf: SWAR string comparison and chunked escape rendering perf: comprehensive Scala Native render pipeline optimization Apr 14, 2026
@He-Pin He-Pin marked this pull request as ready for review April 14, 2026 14:19
@He-Pin
Copy link
Copy Markdown
Contributor Author

He-Pin commented Apr 14, 2026

I think the string join can be improved with ast rewritten,but I want to do that after this got merged.

if (b < 32 || b == '"' || b == '\\') return i
i += 1
}
-1
Copy link
Copy Markdown
Contributor Author

@He-Pin He-Pin Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tanishiking Does scala-js support SWAR too? IIRC, JS can only 32bit

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m not sure what “support SWAR” would mean here, but you can write SWAR-like bit hacks in Scala.js, since JS/Scala.js of course support bitwise operations.

You can find a few interesting examples of that kind of optimization in Scala.js itself, in somewhere like https://github.com/scala-js/scala-js/blob/main/javalib/src/main/scala/java/lang/Integer.scala and https://github.com/scala-js/scala-js/blob/main/javalib/src/main/scala/java/lang/IntegerLong.scala

@He-Pin
Copy link
Copy Markdown
Contributor Author

He-Pin commented Apr 14, 2026

 Benchmark 结果汇总
  环境: Apple Silicon, macOS | 工具: hyperfine --warmup 5 --min-runs 20 -N sjsonnet: Scala Native (当前分支, 含 PR #776 优化) | jrsonnet: 0.5.0-pre98 (从源码编译)

  可靠基准 (>20ms 运行时间,启动开销不主导)
   Benchmark                                                     sjsonnet (ms)                                 jrsonnet (ms)                                      比值                      胜者
  ───────────────────────────────────────────────┼──────────────────────────────────────────────┼──────────────────────────────────────────────┼──────────────────────────────────────────────┼──────────────────────────────────────────────
   comparsion_for_primitives                                          37.6                                         214.5                                  sjsonnet 5.71x 更快               sjsonnet
   inheritance_recursion                                              60.7                                         120.2                                  sjsonnet 1.98x 更快               sjsonnet
   simple_recursive_call                                              28.8                                          52.6                                  sjsonnet 1.83x 更快               sjsonnet
   realistic_2                                                        89.4                                         101.7                                  sjsonnet 1.14x 更快               sjsonnet
   std_reverse                                                        21.6                                          23.5                                      持平 (1.09x)                  持平

  中等规模 (10-20ms)
   Benchmark                                                     sjsonnet (ms)                                 jrsonnet (ms)                                      比值                      胜者
  ───────────────────────────────────────────────┼──────────────────────────────────────────────┼──────────────────────────────────────────────┼──────────────────────────────────────────────┼──────────────────────────────────────────────
   std_base64_byte_array                                              9.8                                           18.2                                  sjsonnet 1.86x 更快               sjsonnet
   std_base64decodebytes                                              14.1                                          20.5                                  sjsonnet 1.45x 更快               sjsonnet
   big_object                                                         10.5                                          11.6                                  sjsonnet 1.10x 更快               sjsonnet
   realistic_1                                                        9.3                                           11.9                                  sjsonnet 1.27x 更快               sjsonnet

  小规模 (<10ms,启动开销主导)
   Benchmark                                                     sjsonnet (ms)                                 jrsonnet (ms)                                      比值                      胜者
  ───────────────────────────────────────────────┼──────────────────────────────────────────────┼──────────────────────────────────────────────┼──────────────────────────────────────────────┼──────────────────────────────────────────────
   comparsion_for_array                                               6.3                                           12.8                                  sjsonnet 2.02x 更快               sjsonnet
   foldl_string_concat                                                5.4                                           8.6                                   sjsonnet 1.59x 更快               sjsonnet
   std_foldl                                                          6.2                                           7.4                                   sjsonnet 1.19x 更快               sjsonnet
   large_string_join                                                  6.8                                           5.4                                   jrsonnet 1.26x 更快               jrsonnet
   array_sorts                                                        8.2                                           5.5                                   jrsonnet 1.49x 更快               jrsonnet
   std_base64                                                         7.8                                           4.2                                   jrsonnet 1.86x 更快               jrsonnet
   std_base64decode                                                   7.3                                           5.3                                   jrsonnet 1.36x 更快               jrsonnet
   std_manifestjsonex                                                 6.4                                           4.1                                   jrsonnet 1.54x 更快               jrsonnet
   std_manifesttomlex                                                 6.5                                           3.6                                   jrsonnet 1.82x 更快               jrsonnet
   std_parseint                                                       6.1                                           3.6                                   jrsonnet 1.70x 更快               jrsonnet
   std_substr                                                         6.2                                           4.2                                   jrsonnet 1.45x 更快               jrsonnet
   string_strips                                                      5.7                                           3.9                                   jrsonnet 1.48x 更快               jrsonnet
   tail_call                                                          5.9                                           3.7                                   jrsonnet 1.57x 更快               jrsonnet
   inheritance_function_recursion                                     5.0                                           2.9                                   jrsonnet 1.74x 更快               jrsonnet

@He-Pin He-Pin marked this pull request as draft April 14, 2026 17:32
He-Pin added a commit to He-Pin/sjsonnet that referenced this pull request Apr 18, 2026
Motivation:
Combined review of PR databricks#776 + databricks#778 identified ~130 lines of duplicated
SWAR string rendering and long-to-char conversion code, plus two
missing overflow checks in StringModule.

Modification:
- Extract renderQuotedStringSWAR as protected method in BaseCharRenderer,
  delegate from MaterializeJsonRenderer (removes ~60 lines duplication)
- Make escapeCharInline protected, remove duplicate in Renderer
- Consolidate Renderer.visitFloat64 onto inherited writeLongDirect,
  remove standalone RenderUtils.appendLong (~40 lines)
- Add totalLen > Int.MaxValue guard in Join pre-sized allocation
- Add Long overflow detection in parseDigits
- Leverage _asciiSafe flag in Substr/Join to skip redundant scans

Result:
Net -132 lines. All tests pass across JVM/JS/Native/WASM.
He-Pin and others added 9 commits April 25, 2026 16:40
Motivation:
PR databricks#749 added SIMD base64 and runtime optimizations (ByteArr, RangeArr,
asciiSafe) but was reverted by databricks#777 due to incorrect hand-written x86
SIMD C code. This PR restores all optimizations while replacing the
buggy SIMD code with the battle-tested aklomp/base64 library.

Modification:
- Replace hand-written C SIMD with aklomp/base64 (BSD-2-Clause) which
  provides correct SIMD dispatch (SSSE3/AVX2/AVX512/NEON64) via runtime
  CPU detection
- Add PlatformBase64 abstraction: JVM/JS use java.util.Base64 with strict
  RFC 4648 padding validation, Native uses aklomp/base64 FFI
- Switch to strict mode aligned with go-jsonnet: reject unpadded base64
  input (e.g. "YQ" without "=="). java.util.Base64 is lenient, so JVM/JS
  add explicit length check for ASCII input, matching go-jsonnet's
  len(str) % 4 != 0 check (builtins.go:1467)
- Restore Val.ByteArr: compact byte-backed array for base64DecodeBytes
- Restore Val.RangeArr subclass from flag-based _isRange
- Restore Val.Str._asciiSafe + renderAsciiSafeString
- Restore Materializer/ByteRenderer fast paths for ByteArr
- Add comprehensive test suite (56+ Scala tests + 4 Jsonnet golden tests)

Result:
Beats jrsonnet on DecodeBytes benchmarks (1.47x faster). Overall 15-38%
faster than master on base64 workloads.
Motivation:
String comparison (compareStringsByCodepoint) and long string rendering
are hot paths in sort-heavy and render-heavy Jsonnet workloads. The
comparison used per-char charAt() virtual dispatch preventing JIT
vectorization. Long string rendering used a binary scan (clean→bulk copy,
dirty→full reprocess from position 0).

Modification:
1. compareStrings: bulk getChars() + tight array loop enabling JIT
   auto-vectorization (AVX2/SSE). Surrogate check deferred to mismatch
   point only (O(1) vs O(n)). ThreadLocal buffers on JVM, local alloc
   on Native, scalar fallback on JS.
2. findFirstEscapeChar: SWAR scan returning position (not boolean).
3. visitLongString: chunked rendering — find escape position, arraycopy
   clean prefix, escape inline, repeat. Avoids re-processing entire
   string when only a few chars need escaping.

Result:
All tests pass across JVM (Scala 3.3.7, 2.13.18) and JS. All benchmark
regressions pass. Endian-safe (SWAR operates on independent byte lanes).
Replace per-call `new Array[Char](n)` allocation with module-level
pre-allocated buffers in Scala Native's compareStrings. Safe because
Scala Native is single-threaded (mirrors the JVM ThreadLocal approach).
Motivation:
manifestJsonEx/manifestTomlEx used the generic Visitor interface for
char-based rendering, missing the fused direct-walk optimization that
ByteRenderer already had. Additionally, char-based string rendering
(BaseCharRenderer, MaterializeJsonRenderer) did binary hasEscapeChar
check → char-by-char RenderUtils.escapeChar fallback, while ByteRenderer
had proper chunked SWAR scanning → bulk arraycopy → inline escape.

Modification:
- Add materializeDirect(Val) to MaterializeJsonRenderer, mirroring
  ByteRenderer's fused materializer with valTag-based switch dispatch
- Replace visitNonNullString in BaseCharRenderer with chunked rendering:
  findFirstEscapeCharChar → bulk arraycopy clean segments → escapeCharInline
- Add renderQuotedString to MaterializeJsonRenderer with same chunked pattern
- Add findFirstEscapeCharChar(char[]) to all 3 CharSWAR platform impls
- Wire ManifestModule to use renderer.materializeDirect instead of
  Materializer.apply0 + Visitor interface

Result:
manifestJsonEx gap reduced from 2.15x to ~1.4x slower vs jrsonnet.
realistic_2 flipped from 1.62x slower to 1.12x faster.
…afe propagation

Motivation:
String-heavy stdlib operations (substr, length, join, parseInt) had
unnecessary overhead on Scala Native: codePointCount/offsetByCodePoints
O(n) scans for ASCII strings, StringBuilder resize churn for join,
exception-based parseInt via Long.parseLong.

Modification:
- Add ASCII fast path to Length and Substr using CharSWAR.isAllAscii:
  skip codePointCount/offsetByCodePoints for ASCII-only strings (99% case)
- Pre-sized char[] assembly for std.join: two-pass approach calculates
  exact output length, then copies with getChars — zero resize overhead
- Hand-written parseDigits loop for parseInt/parseOctal/parseHex:
  no exception setup, no intermediate allocation, single pass
- Propagate _asciiSafe flag: parser sets it on ASCII string literals,
  Val.Str.concat preserves it when both children are ASCII-safe,
  join propagates it through all elements

Result:
substr gap reduced from 2.03x to ~1.07x. parseint from 1.80x to ~1.0x.
large_string_join from 1.81x to ~1.27x. realistic_2 benefits from
combined improvements.
Motivation:
Format.format() used StringBuilder which starts small and resizes
multiple times for large output. The large_string_template benchmark
(591KB template, 256 interpolations) showed 2.78x gap vs jrsonnet.

Modification:
- Three-pass approach: compute formatted values into String array,
  calculate exact total output length, allocate char[] and copy with
  getChars — eliminates StringBuilder resize/copy overhead
- Add direct Val dispatch in format loop: skip Materializer for
  common types (Str, Num, Bool, Null) to avoid ujson.Value roundtrip

Result:
large_string_template gap reduced from 2.78x to ~1.88x. Remaining gap
is dominated by Scala Native startup overhead (~7ms vs Rust ~1ms);
pure computation time is within ~1ms of jrsonnet.
Motivation:
CI fails on two issues: (1) unused `alwaysinline` import in Native
CharSWAR.scala, (2) `\uXXXX` sequences in comments are parsed as
unicode escapes in Scala 2.12, causing compilation errors.

Modification:
- Remove unused `scala.scalanative.annotation.alwaysinline` import
- Escape backslash-u sequences in comments across BaseByteRenderer
  and Renderer

Result:
Full test suite passes across all platforms and Scala versions
Motivation:
Combined review of PR databricks#776 + databricks#778 identified ~130 lines of duplicated
SWAR string rendering and long-to-char conversion code, plus two
missing overflow checks in StringModule.

Modification:
- Extract renderQuotedStringSWAR as protected method in BaseCharRenderer,
  delegate from MaterializeJsonRenderer (removes ~60 lines duplication)
- Make escapeCharInline protected, remove duplicate in Renderer
- Consolidate Renderer.visitFloat64 onto inherited writeLongDirect,
  remove standalone RenderUtils.appendLong (~40 lines)
- Add totalLen > Int.MaxValue guard in Join pre-sized allocation
- Add Long overflow detection in parseDigits
- Leverage _asciiSafe flag in Substr/Join to skip redundant scans

Result:
Net -132 lines. All tests pass across JVM/JS/Native/WASM.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants