Skip to content

Performance (overview): round-3 combined — 1.82× faster automode (all five PRs)#3839

Draft
unp1 wants to merge 25 commits into
mainfrom
perf-round3-meta
Draft

Performance (overview): round-3 combined — 1.82× faster automode (all five PRs)#3839
unp1 wants to merge 25 commits into
mainfrom
perf-round3-meta

Conversation

@unp1

@unp1 unp1 commented Jun 17, 2026

Copy link
Copy Markdown
Member

This is the overview of a performance series that prepares a clearer match → evaluate → apply pipeline before larger optimisations: the compiled matcher (1/5, #3831) plus the four default-on PRs #3835#3838 below. It reports their combined effect.

📖 Developer docs: Performance Optimizations (3.1)

Correction: an earlier version of this PR reported a 1.66× combined speed-up. That figure
did not include the compiled matcher (#3831) — the
first PR of this series — which was an oversight, now corrected. The figures below cover all five
PRs
of the series (matcher + the four pipeline-preparation PRs), all with the compiled matcher on
by default.

Meta: Performance round 3 — combined

This is the umbrella PR for round-3 performance work. It collects the individually-reviewable PRs below; each is self-contained and can be merged on its own. This branch stacks all five member PRs (plus a small test-only commit adding the perfTest measurement group), so the combined figures below can be reproduced directly from a checkout — see Reproducing these numbers. All figures are against main.

Member PRs

# PR What Byte-identical?
1 #3831 — matcher compiled (partially-evaluated) taclet find-matcher yes (differential)
2 #3835 — memory remove 4 allocation hot-spots + release parser DFA cache yes
3 #3836 — checkPrefix O(d²)→O(d) prefix walk via cached transformer-free predicate yes
4 #3837 — cost reuse + age-as-cost-term carry age-free cost across re-expansion (sound, annotation-driven) yes (perfTest/perfValidation)
5 #3838 — assumes-parking operator-indexed parking of assumes-incomplete bases (+ drop order-fragile lenOfSeqSubEQ) no (shifts proof shapes)

All five are active by default — checking out any branch needs no build configuration.

Combined performance

Measured on the 6-problem perfTest real-world set, median of 3 runs, main vs all five combined.

Combined: 1.82× faster automode on the 6-problem set (101.8 s → 55.9 s, median of 3 runs).

Problem main (ms) all five (ms) speedup
symmArray 23346 12105 1.93×
gemplusDecimal/add 12113 8148 1.49×
ArrayList.remove.1 3907 2627 1.49×
SimplifiedLinkedList.remove 31367 17490 1.79×
Saddleback_search 25710 13327 1.93×
coincidence_count/project 5330 2200 2.42×
Total 101773 55897 1.82×

Per-PR contribution (same 6 problems, each alone vs main): memory 1.12×, checkPrefix 1.01×, cost-reuse+age 1.07×, parking 1.44×; the matcher (#3831) is a foundation that adds the remainder on top (its standalone matching speed-up is ~6.5–8.3× but most matching is cached during proof search).

flowchart LR
  M["main"] --> X["+ matcher"] --> A["+ memory"] --> B["+ checkPrefix"] --> C["+ cost"] --> D["+ parking = 1.82x"]
Loading

Reproducing these numbers

This branch stacks all five PRs (compiled matcher on by default) plus a test-only commit adding the
perfTest group. By default the full regression suite still runs (like main); restrict to the
6 perfTest problems with -Dkey.runallproofs.runOnlyOn=perfTest. From a checkout:

rm -rf key.core/build/generated-src/rap
./gradlew :key.core:generateRAPUnitTests -Dkey.runallproofs.runOnlyOn=perfTest  # generate only perfTest
./gradlew :key.core:testRAP --rerun-tasks -PrapForks=1                          # run them serially

Then read the Automode time (ms) column of
key.ui/examples/build/reports/runallproofs/runStatistics.csv (one row per problem; the file
appends, so clear it between runs and take the median of a few runs on an otherwise idle machine).
The baseline numbers are the same procedure on main with only that test-only commit cherry-picked.

Methodology

  • 6 curated real-world problems (the perfTest group): SimplifiedLinkedList.remove, gemplusDecimal/add, Saddleback_search, symmArray, coincidence_count/project, ArrayList.remove.1.
  • runAllProofs perfTest group, serial forks, automode time from proof.getStatistics(), median of 3 runs.
  • Node counts reported alongside time; for byte-identical optimizations node counts are unchanged so the speedup is pure time-per-node.
  • Synthetic families (arith/quant/symEx) reported where they isolate a specific engine.

📖 Conceptual overview in the developer docs: Performance Optimizations (3.1)

Additional information and contact(s)

This PR has been done with AI tooling support.

The contributions within this pull request are licensed under GPLv2 (only) for inclusion in KeY.

unp1 and others added 12 commits June 17, 2026 00:13
The Java program inside a modality was matched by a single monolithic
MatchProgramInstruction delegating to ProgramElement.match. Make the generic
program part (ordinary statements/expressions: class + exact arity + child
recursion, and non-list program schema variables) matchable by the same
instruction VM the rest of the find pattern uses:

- MatchProgramElementInstruction (class + exact arity, generic over
  SyntaxElement) and MatchSubProgramInstruction (runs a sub-program over the
  modality's program via its own cursor).
- The generator converts such programs into a VMInstruction sub-program;
  anything it does not handle falls back to MatchProgramInstruction.
- Seams introduced here: MatchProgram (the match-program abstraction
  implemented by VMProgramInterpreter, and later by the compiled matcher)
  and ProgramChildrenMatcher (for matching a run of program children).

Gated behind -Dkey.matcher.programInstructions (read at matcher construction
so it can be toggled by reloading; default off → unchanged monolithic path);
behaviour-preserving when on.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The .. ... pattern of symbolic-execution taclets (ContextStatementBlock) is
matched specially: variable-length prefix descent to the active statement,
inner execution context, active-statement matching, and prefix/suffix
position bookkeeping. Convert phase (3) -- the active statements -- to VM
instructions while keeping the intricate phases (1)(2)(4) in place:

- ContextStatementBlock.match gains a phase-(3) seam taking a
  ProgramChildrenMatcher; the default still uses matchChildren, but a
  supplied matcher (a VM sub-program here) can drive the active-statement
  matching instead.
- MatchContextStatementBlockInstruction wires a context block to that seam.
- VMProgramInterpreter.matchChildrenFrom runs a sub-program over a run of
  source children from a child offset (the active statements).
- The generator emits the context-block instruction for a top-level context
  block, falling back when an active statement is not convertible.

Same -Dkey.matcher.programInstructions gate; behaviour-preserving when on.
CompiledMatchProgram is a second MatchProgram backend that navigates the
term and Java-program structure directly (term.op()/sub(i) and
SyntaxElement.getChild), avoiding the PoolSyntaxElementCursor entirely. It
compiles essentially the whole find-taclet base: ordinary operators and
schema variables, bound variables (quantifiers/substitutions), modalities
with their program (generic programs and context blocks), parametric
function instances and elementary updates; program elements with their own
match (value literals, type refs, loops) and variable-arity children (list
SVs #slist) are reused cursor-free by delegating to their own match.

VMTacletMatcher selects the compiled find-matcher when
-Dkey.matcher.compiled is set (read at construction, so toggling it and
reloading switches matchers; default off → the interpreter, which stays the
source of truth). Behaviour-preserving; ~1.2-1.7x on matcher-bound proving.

CompiledMatchProgramTest checks the compiled matcher against explicit
expectations (propositional, function and bound-variable patterns, success
and failure).
testRAP (generated runAllProofs regression suite, in-process ProveTest per
fork) now runs on up to 10 parallel JVMs (-PrapForks=N), with a configurable
per-fork heap (-PrapHeap), and forwards the compiled-matcher switch
(-Pmatcher.compiled=true / -Dkey.matcher.compiled) to the proof runs so the
regression suite can exercise the compiled matcher.
…RE MERGE]

Development-only verification/measurement, not intended for the PR:
- ProgramMatchDifferentialTest: every find-taclet matched by the interpreter
  oracle vs the converted/compiled matchers over a real-proof term corpus,
  asserting identical results (success/failure + instantiations).
- CompiledMatchProgramBenchmark / ContextMatchBenchmark: isolated
  interpreter-vs-compiled matching-time measurements.
A new language-agnostic module holding the match-plan IR from which both
find-matcher back-ends (the VMProgramInterpreter and the cursor-free compiled
matcher) are derived from a single description, so a construct described once
drives both.

- MatchPlan: the IR node (emitInstructions for the interpreter, compile for the
  compiled matcher); OperatorPlan / SchemaVarPlan cover the term skeleton.
- MatchHead: the operator-specific check (no subterm recursion); GenericOperatorHead
  handles ordinary operators.
- BinderMatcher / ProgramMatchHook: the two cross-language SPIs (bound variables
  and the modality program AST), kept abstract here so other ncore-based provers
  (Rust, Solidity) can reuse the framework.

The module depends only on key.ncore / key.ncore.calculus / key.util (no
Java-DL types); key.core gains a dependency on it.
The single Java-DL dispatch (JavaMatchPlanBuilder.buildPlan) builds a MatchPlan
for a find pattern; both back-ends are then derived from it. It covers the FOL
term skeleton, elementary updates, parametric function instances and modalities,
falling back to the legacy matchers only for term labels.

- Heads ElementaryUpdateHead / ParametricFunctionHead / ModalityHead carry the
  operator-specific interpreter + compiled fragments, lifted verbatim from the
  hand-written generator and compiled matcher (so behaviour is preserved).
- JavaBinderMatcher / JavaProgramMatchHook implement the two SPIs (bound-variable
  binding/renaming; the JavaBlock / ContextStatementBlock program matching).
- CompiledMatchProgram.compiledProgramMatcher is extracted from compileModality
  (now keyed on the JavaBlock) so the compiled matcher and the program hook share
  one program-matching implementation; buildProgramInstruction is made
  package-visible for the hook's interpreter side.
- JavaMatchPlanBuilder also exposes the production facades interpreterProgram /
  compiledProgram (framework-built, with legacy fallback for term labels).
The find and assumes matchers are now built via JavaMatchPlanBuilder
(interpreterProgram / compiledProgram) instead of calling the two hand-written
dispatches directly, making the match-plan IR the single source of truth in
production. The facades fall back to the legacy generator / compiled matcher for
the constructs the framework does not build yet (term labels), so behaviour is
unchanged. The key.matcher.compiled / key.matcher.programInstructions flags keep
their meaning.
…ROP BEFORE MERGE]

Extends the dev-only differential test and micro-benchmark (added in the
"matcher differential test + isolated benchmarks" drop commit) to also exercise
the match-plan framework alongside the hand-written matchers:
- ProgramMatchDifferentialTest builds the plan and verifies its interpreter (with
  program-instruction conversion both off and on, since production reads that
  flag) and its compiled matcher against the legacy oracle (24.8M comparisons).
- CompiledMatchProgramBenchmark times the framework-built matchers next to the
  hand-written ones for both back-ends (the no-overhead check).

Like the commit it extends, this is dropped before merge.
…e source)

With both back-ends now derived from the match-plan framework, the two
hand-written per-construct dispatches are redundant and are removed, leaving the
framework as the single source of truth for find-matching:

- delete CompiledMatchProgram's term-level dispatch (compile / compileStep /
  compileCore / compile{ElementaryUpdate,ParametricFunction,Modality}); the heads
  already carry that logic. Its reused Java-program helpers (compiledProgramMatcher,
  compileProgram, delegateToMatch, compileActiveStatements) move to a small
  JavaProgramCompiler used by the program hook.
- delete SyntaxElementMatchProgramGenerator's createProgram dispatch; only the
  program-instruction conversion helpers (buildProgramInstruction & co), which the
  hook reuses, remain.
- migrate term labels into the framework (JavaMatchPlanBuilder.LabelPlan, reusing
  the matchTermLabelSV instruction) so buildPlan is total; the facades no longer
  fall back -- an unsupported pattern raises a clear error naming the missing head
  (no current taclet hits this; the whole standard base is covered).
- retarget CompiledMatchProgramTest to the framework facade.

Net: ~390 fewer lines of production matcher code, no duplicated dispatch. The
interpreter/compiled engines and the program helpers are unchanged.
…only benchmark

The differential test and the context benchmark depend on the hand-written
matchers as an independent oracle, which no longer exist in this branch. They are
retained on a separate development branch (with the reference interpreter) as the
regression net, and removed here. CompiledMatchProgramBenchmark is retargeted to
compare the framework's interpreter vs its compiled matcher (no oracle).
Selecting the compiled find-matcher previously needed -Dkey.matcher.compiled,
which is awkward to pass through Gradle when trying it out. Expose it as a KeY
feature flag (Settings -> Feature Flags) too:

- VMTacletMatcher.COMPILED_MATCHER_FEATURE ("MATCHER_COMPILED"); the matcher is
  selected when the system property OR the feature flag is set. The property is
  kept for headless / CI (testRAP).
- SettingsManager references the flag so it is registered and shown in the
  feature-settings panel on a fresh start, before any proof is loaded.

The flag is read per taclet at construction, so it applies to newly loaded proofs
(or after reloading the current one); the panel shows a "reload required" notice
(restartRequired = true). No on-the-fly switch of an open proof's matchers.
@unp1 unp1 changed the title Performance round 3 (combined): 1.66× faster automode on real-world proofs Performance (overview): round-3 combined — 1.66× faster automode Jun 17, 2026
@unp1 unp1 self-assigned this Jun 17, 2026
@unp1 unp1 force-pushed the perf-round3-meta branch from 26d4023 to ee2e6cb Compare June 17, 2026 05:00
The cursor-free compiled find-matcher is now selected by default; the legacy interpreter becomes an opt-out via -Dkey.matcher.interpreter or the MATCHER_INTERPRETER feature flag (and remains the automatic fallback for patterns the compiler does not handle). Differential testing established the two back-ends are byte-identical, so this only changes which one runs, not any proof.
@unp1 unp1 changed the title Performance (overview): round-3 combined — 1.66× faster automode Performance (overview): round-3 combined — 1.74× faster automode Jun 17, 2026
unp1 added 11 commits June 17, 2026 08:00
Carry a rule-app container's strategy cost forward across the per-round
re-expansion instead of recomputing it, when the taclet's cost is a pure
function of the app + find subterm (plus the always-refreshed age term and
NonDuplicateApp vetoes). Sound-by-construction, annotation-driven
classification (CostLocal/CostNonLocal, default non-local); generator-aware
so a composite summing over a sequent-scanning generator stays non-local.
A development aid -Dkey.strategy.costReuse.verify recomputes the cost and warns on any
mismatch. Byte-identical on the perfTest/perfValidation
corpus; ~7% automode speedup on cost-bound proofs.
introductionTime cached the not-introduced-yet answer (-1); the symbol may be
introduced by a later rule, after which the real time would be found, so the
frozen -1 made the value depend on whether the symbol was first compared before
or after its introduction -- an access-pattern dependence that makes term
ordering, and hence OneStepSimplifier rewriting, subtly non-deterministic. Only
cache a real introduction time (stable once found).
Replace the per-call new Feature[0] / r.length==0 idiom with a shared
INELIGIBLE constant and identity check. An eligible taclet always carries
at least the top-level NonDuplicateApp veto, so identity is the clearer
'not eligible' test. Pure refactor: byte-identical (symmArray 14601 nodes,
0 verify-mode mismatches).
Age (goal time) was contributed inside each top-level strategy's computeCost
(AgeFeature in ModularJavaDLStrategy's cost/inst sums; getTime() in FIFOStrategy
and SimpleFilteredStrategy). Move it out into a single container-level term,
RuleAppContainer.withAge, added exactly once when a container is built -- so
strategies (and their components) compute only their age-free cost and age is
added once regardless of how strategies are composed. AgeFeature is removed.

This lets cost reuse carry the age-free base forward verbatim: TacletAppContainer
stores the age-free cost and the reuse fast path is just 'base + current age'
with no getTime()-getAge() reconstruction and no age>=0 guard (initial containers
reuse soundly too). As a side effect the container's age field is decoupled from
the cost and is now purely the AssumesInstantiator freshness key.

Behaviour-preserving: byte-identical to the parent on SLL, saddleback, symmArray,
median (verified by A/B against the legacy age-in-features path before it was
removed; isolated timing showed the relocation is performance-neutral, so its
value is code quality plus enabling the simpler, broader cost-reuse path).
Park assumes-incomplete taclet bases that re-expand to nothing (97-99.6% of base
re-expansions) out of the active queue, and wake them by a precise operator index: index
each parked base by the concrete top operator(s) of its \assumes formulas (resolved through
the find-match's SV instantiations); wake exactly the bases whose operator matches a formula
added/modified that round (Goal.fireSequentChanged -> sequentChanged), walking the changed
formula's update-prefix spine (a sound superset, since the assumes matcher strips the update
context). Only effectively-indexable bases are parked; unbound-generic tops stay in the
queue. Insertion-ordered (LinkedHashMap/Set) for determinism; clone() deep-copies the index.

Active by default. Provability-safe on the full real RAP
suite (681 goals) once the redundant order-fragile lenOfSeqSubEQ is dropped from automode
(see follow-up commit). Not byte-identical (reordering shifts proof shapes, all still close);
~40% faster automode on the perfTest goals.
lenOfSeqSubEQ rewrites seqLen(EQ) via an antecedent equation EQ = seqSub(...). It is
redundant for completeness -- the direct lenOfSeqSub suffices (full RAP closes all 681
goals without it). It is also order-fragile: when the negated-goal equation
seqSub(s,0,i)=seqSub(s,0,i+5) is reused as an \assumes to simplify, the simplification
reproduces that same formula, and the duplicate/cycle guard then refuses to re-apply
subSeqEqual -> the goal dead-ends. The original rule order avoided this by luck; any
reordering (e.g. assumes-parking) can expose it. Only the \heuristics is commented out,
so the taclet stays defined and existing proofs that applied it still load/replay.
…Labels

removeIrrelevantLabels rebuilt the whole term tree on every call (stream().map()/filter()
.collect() per node), the single biggest allocator during proof search (~20%), even though
most subterms have no irrelevant label. Replace with an identity-preserving rebuild (plain
loops, lazy sub-array, return the original term when its subtree has no irrelevant label).
Behaviour-preserving (terms are immutable; result is structurally identical).
Objects.hash(first, second) allocates an Object[] on every call; Pair is heavily used as a
hash-map key during proof search. Inline the same hash value without the array.
applyReplacewithHelper allocated a PiTIterator (posInTerm().iterator()) per rewrite-taclet
application and consumed it in the recursive replace(). Thread the PosInTerm + a depth index
instead (same indices/order), avoiding the per-application iterator object.
The KeY and JML ANTLR parsers build a prediction (DFA) cache lazily while parsing, held on the generated parsers' static fields, so it stays resident for the whole JVM -- including the (long) proof search, where it is unused (~17 MB retained on a large proof). It is a pure cache that ANTLR rebuilds transparently on the next parse, so dropping it after a problem/proof has finished loading is correctness-safe. Add ParsingFacade.clearParserCaches() (KeY/JavaDL) and JmlFacade.clearCaches() (JML) and call them from AbstractProblemLoader.load().
…former

RewriteTaclet.checkPrefix walks the whole root-to-position prefix (PIOPathIterator) at every position it is asked about. During taclet-index construction / one-step simplification of a deep term that is O(depth) per position over ~d positions, i.e. O(d^2) -- the dominant cost on deeply nested terms such as chained if-then-else (a JFR profile showed 54% self-time in checkPrefix on a trivial 3-node proof).

For an unrestricted (NONE) rewrite taclet -- the common case -- the only prefix-dependent outcome of that walk is a veto when a Transformer occurs on the path; the update/polarity/modality handling is guarded by a non-NONE restriction and the polarity is discarded. So if the formula provably contains no Transformer anywhere, no prefix can, and the walk can be skipped.

"Formula contains a Transformer" is computed once and cached per term (JTerm.containsTransformerRecursive, mirroring containsJavaBlockRecursive), giving O(1) amortized and dropping the per-position prefix cost from O(depth) to O(1) in the transformer-free case; the O(d^2) on deep terms becomes O(d). Behaviour-preserving: it only short-circuits a provably-equivalent case; restricted taclets and transformer-bearing formulas still take the full walk.
@unp1 unp1 changed the title Performance (overview): round-3 combined — 1.74× faster automode Performance (overview): round-3 combined — 1.82× faster automode (all five PRs) Jun 17, 2026
@unp1 unp1 force-pushed the perf-round3-meta branch from ee2e6cb to 65bf27c Compare June 17, 2026 06:45
…oofs.runOnlyOn=perfTest)

Adds the curated 6-problem perfTest group used for the combined benchmark. By default all runAllProofs groups run (full regression, like main); pass -Dkey.runallproofs.runOnlyOn=perfTest to restrict to the perfTest group, and -PrapForks=1 for clean serial timing.
@unp1 unp1 force-pushed the perf-round3-meta branch from 65bf27c to 2271f36 Compare June 17, 2026 06:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant