Skip to content

perf: hybrid hot-path evaluator — up to 40% faster dispatch#785

Merged
stephenamar-db merged 1 commit intodatabricks:masterfrom
He-Pin:perf/unified-evaluator
Apr 21, 2026
Merged

perf: hybrid hot-path evaluator — up to 40% faster dispatch#785
stephenamar-db merged 1 commit intodatabricks:masterfrom
He-Pin:perf/unified-evaluator

Conversation

@He-Pin
Copy link
Copy Markdown
Contributor

@He-Pin He-Pin commented Apr 18, 2026

Summary

  • Profile all 66 benchmark files across 5 suites to identify ExprTag visit frequencies. Top 7 types cover 96.1% of all visitExpr calls: ValidId (30%), BinaryOp (21%), Val.Literal (18%), Select (13%), Apply1 (5%), ObjExtend (4%), IfElse (4%).
  • Split NewEvaluator.visitExpr into a hot path (~120 bytecodes, 7 instanceof checks) and a cold path (private visitExprCold using tag + @switch for remaining 30 types).
  • The hot path fits within JIT FreqInlineSize=325 bytecodes, enabling C2 to inline visitExpr into callers (visitBinaryOp, visitSelect, etc.). The old evaluator's ~700-bytecode method body never gets inlined.
  • Add --new-evaluator CLI flag for A/B testing.
  • Add EvaluatorBenchmark (JMH) and ExprTagProfile profiling tool.

JMH Results

Steady-state performance (1 fork, 8 warmup, 10 measurement iterations):

Benchmark Old (ms) New (ms) Delta
bench.01 0.026 0.018 -31%
bench.02 32.58 25.73 -21%
bench.03 9.39 5.64 -40%
gen_big_object 0.928 0.715 -23%
string_render_perf 0.768 0.496 -35%
base64_mega 3.462 3.106 -10%
realistic1 1.850 1.764 -5%
heavy_string_render 34.80 33.09 -5%
realistic2 47.32 47.78 ~tied
bench.04, 06, 08, 09 - - ~tied

Evaluator-heavy benchmarks (bench.01–03, gen_big_object, string_render_perf) show 21–40% improvement. Builtin-dominated benchmarks (bench.04, foldl, comparison) are unaffected — the evaluator dispatch is not their bottleneck.

Why it works

The old evaluator's visitExpr compiles to a ~700-bytecode instanceof chain. This exceeds JIT's FreqInlineSize=325, so C2 never inlines it into callers. Every recursive visitExpr call from within visitBinaryOp, visitSelect, etc. pays full virtual dispatch overhead.

The hybrid approach splits into:

  • Hot path (~120 bytecodes): 7 instanceof checks for 96% of calls — small enough for C2 to inline
  • Cold path (separate method): tag + @switch tableswitch for the remaining 4% — O(1) dispatch instead of scanning 30+ instanceof checks

ExprTag frequency data (global across all 66 benchmark files)

  Rank  ExprTag              Count        Pct   Cumulative
  1     ValidId          3,435,607     29.9%      29.9%
  2     BinaryOp         2,455,182     21.4%      51.3%
  3     Val.Literal      2,099,413     18.3%      69.5%
  4     Select           1,464,561     12.7%      82.3%
  5     Apply1             619,927      5.4%      87.7%
  6     ObjExtend          485,621      4.2%      91.9%
  7     IfElse             485,570      4.2%      96.1%
  8     ObjBody.MemberList 250,734      2.2%      98.3%
  9     ApplyBuiltin1      132,666      1.2%      99.4%
  10+   (remaining)         63,212      0.6%     100.0%

Test plan

  • ./mill 'sjsonnet.jvm[3.3.7]'.test — all JVM tests pass (both old and new evaluator)
  • ./mill __.reformat — scalafmt clean
  • JMH A/B benchmarks across cpp_suite, go_suite, bug_suite, sjsonnet_suite
  • ExprTagProfile across all 66 benchmark files

Motivation:
The NewEvaluator used a pure tag + @switch (tableswitch) dispatch
which suffered from invokeinterface overhead on every visitExpr call
(~5-8ns) because Expr is a trait. This made it 0-16% slower than the
old instanceof-chain evaluator in JMH benchmarks.

Modification:
- Profile all 66 benchmark files across 5 suites (cpp, go, jrsonnet,
  sjsonnet, bug) to identify ExprTag frequencies. Top 7 types cover
  96.1% of all visitExpr calls: ValidId (30%), BinaryOp (21%),
  Val.Literal (18%), Select (13%), Apply1 (5%), ObjExtend (4%),
  IfElse (4%).
- Split NewEvaluator.visitExpr into a hot path (~120 bytes, 7
  instanceof checks) and a cold path (private visitExprCold using
  tag + @switch for remaining types).
- The hot path fits within JIT FreqInlineSize=325 bytecodes, enabling
  C2 to inline visitExpr into callers (visitBinaryOp, visitSelect,
  etc.). The old evaluator's ~700-byte method body never gets inlined.
- Add --new-evaluator CLI flag to Config/SjsonnetMainBase for A/B
  testing via hyperfine.
- Add EvaluatorBenchmark (JMH) covering all suites and ExprTagProfile
  profiling tool.

Result:
JMH steady-state performance (1 fork, 8 warmup, 10 measurement):

  Benchmark         Old (ms)  New (ms)  Delta
  bench.01           0.026     0.018    -31%
  bench.02          32.58     25.73     -21%
  bench.03           9.39      5.64    -40%
  gen_big_object     0.928     0.715    -23%
  string_render      0.768     0.496    -35%
  base64_mega        3.462     3.106    -10%
  realistic1         1.850     1.764     -5%
  heavy_string       34.80    33.09      -5%
  realistic2        47.32     47.78     tied
  bench.04-09        tied      tied     tied
Copy link
Copy Markdown
Collaborator

@stephenamar-db stephenamar-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what would it take to just move to the new Evaluator once and for all?
Can you benchmark them?

@stephenamar-db stephenamar-db merged commit b63fb40 into databricks:master Apr 21, 2026
5 checks passed
@He-Pin
Copy link
Copy Markdown
Contributor Author

He-Pin commented Apr 21, 2026

I actually think the new one is better now. I will take a benchmark.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants