perf: hybrid hot-path evaluator — up to 40% faster dispatch by He-Pin · Pull Request #785 · databricks/sjsonnet

He-Pin · 2026-04-18T18:12:44Z

Summary

Profile all 66 benchmark files across 5 suites to identify ExprTag visit frequencies. Top 7 types cover 96.1% of all visitExpr calls: ValidId (30%), BinaryOp (21%), Val.Literal (18%), Select (13%), Apply1 (5%), ObjExtend (4%), IfElse (4%).
Split NewEvaluator.visitExpr into a hot path (~120 bytecodes, 7 instanceof checks) and a cold path (private visitExprCold using tag + @switch for remaining 30 types).
The hot path fits within JIT FreqInlineSize=325 bytecodes, enabling C2 to inline visitExpr into callers (visitBinaryOp, visitSelect, etc.). The old evaluator's ~700-bytecode method body never gets inlined.
Add --new-evaluator CLI flag for A/B testing.
Add EvaluatorBenchmark (JMH) and ExprTagProfile profiling tool.

JMH Results

Steady-state performance (1 fork, 8 warmup, 10 measurement iterations):

Benchmark	Old (ms)	New (ms)	Delta
bench.01	0.026	0.018	-31%
bench.02	32.58	25.73	-21%
bench.03	9.39	5.64	-40%
gen_big_object	0.928	0.715	-23%
string_render_perf	0.768	0.496	-35%
base64_mega	3.462	3.106	-10%
realistic1	1.850	1.764	-5%
heavy_string_render	34.80	33.09	-5%
realistic2	47.32	47.78	~tied
bench.04, 06, 08, 09	-	-	~tied

Evaluator-heavy benchmarks (bench.01–03, gen_big_object, string_render_perf) show 21–40% improvement. Builtin-dominated benchmarks (bench.04, foldl, comparison) are unaffected — the evaluator dispatch is not their bottleneck.

Why it works

The old evaluator's visitExpr compiles to a ~700-bytecode instanceof chain. This exceeds JIT's FreqInlineSize=325, so C2 never inlines it into callers. Every recursive visitExpr call from within visitBinaryOp, visitSelect, etc. pays full virtual dispatch overhead.

The hybrid approach splits into:

Hot path (~120 bytecodes): 7 instanceof checks for 96% of calls — small enough for C2 to inline
Cold path (separate method): tag + @switch tableswitch for the remaining 4% — O(1) dispatch instead of scanning 30+ instanceof checks

ExprTag frequency data (global across all 66 benchmark files)

  Rank  ExprTag              Count        Pct   Cumulative
  1     ValidId          3,435,607     29.9%      29.9%
  2     BinaryOp         2,455,182     21.4%      51.3%
  3     Val.Literal      2,099,413     18.3%      69.5%
  4     Select           1,464,561     12.7%      82.3%
  5     Apply1             619,927      5.4%      87.7%
  6     ObjExtend          485,621      4.2%      91.9%
  7     IfElse             485,570      4.2%      96.1%
  8     ObjBody.MemberList 250,734      2.2%      98.3%
  9     ApplyBuiltin1      132,666      1.2%      99.4%
  10+   (remaining)         63,212      0.6%     100.0%

Test plan

./mill 'sjsonnet.jvm[3.3.7]'.test — all JVM tests pass (both old and new evaluator)
./mill __.reformat — scalafmt clean
JMH A/B benchmarks across cpp_suite, go_suite, bug_suite, sjsonnet_suite
ExprTagProfile across all 66 benchmark files

Motivation: The NewEvaluator used a pure tag + @switch (tableswitch) dispatch which suffered from invokeinterface overhead on every visitExpr call (~5-8ns) because Expr is a trait. This made it 0-16% slower than the old instanceof-chain evaluator in JMH benchmarks. Modification: - Profile all 66 benchmark files across 5 suites (cpp, go, jrsonnet, sjsonnet, bug) to identify ExprTag frequencies. Top 7 types cover 96.1% of all visitExpr calls: ValidId (30%), BinaryOp (21%), Val.Literal (18%), Select (13%), Apply1 (5%), ObjExtend (4%), IfElse (4%). - Split NewEvaluator.visitExpr into a hot path (~120 bytes, 7 instanceof checks) and a cold path (private visitExprCold using tag + @switch for remaining types). - The hot path fits within JIT FreqInlineSize=325 bytecodes, enabling C2 to inline visitExpr into callers (visitBinaryOp, visitSelect, etc.). The old evaluator's ~700-byte method body never gets inlined. - Add --new-evaluator CLI flag to Config/SjsonnetMainBase for A/B testing via hyperfine. - Add EvaluatorBenchmark (JMH) covering all suites and ExprTagProfile profiling tool. Result: JMH steady-state performance (1 fork, 8 warmup, 10 measurement): Benchmark Old (ms) New (ms) Delta bench.01 0.026 0.018 -31% bench.02 32.58 25.73 -21% bench.03 9.39 5.64 -40% gen_big_object 0.928 0.715 -23% string_render 0.768 0.496 -35% base64_mega 3.462 3.106 -10% realistic1 1.850 1.764 -5% heavy_string 34.80 33.09 -5% realistic2 47.32 47.78 tied bench.04-09 tied tied tied

stephenamar-db

what would it take to just move to the new Evaluator once and for all?
Can you benchmark them?

He-Pin · 2026-04-21T17:39:41Z

I actually think the new one is better now. I will take a benchmark.

stephenamar-db approved these changes Apr 21, 2026

View reviewed changes

stephenamar-db merged commit b63fb40 into databricks:master Apr 21, 2026
5 checks passed

He-Pin mentioned this pull request Apr 25, 2026

perf: promote hot-path evaluator to default; remove dual-evaluator flag #788

Open

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: hybrid hot-path evaluator — up to 40% faster dispatch#785

perf: hybrid hot-path evaluator — up to 40% faster dispatch#785
stephenamar-db merged 1 commit intodatabricks:masterfrom
He-Pin:perf/unified-evaluator

He-Pin commented Apr 18, 2026

Uh oh!

stephenamar-db left a comment

Uh oh!

Uh oh!

He-Pin commented Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

He-Pin commented Apr 18, 2026

Summary

JMH Results

Why it works

ExprTag frequency data (global across all 66 benchmark files)

Test plan

Uh oh!

stephenamar-db left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

He-Pin commented Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants