Skip to content

release/v0.6.0 PR-C: eliminate_dead_stores — full backward liveness (-0.4% on calculator.wasm)#96

Open
avrabe wants to merge 4 commits intomainfrom
release/v0.6.0-pr-c-dead-stores
Open

release/v0.6.0 PR-C: eliminate_dead_stores — full backward liveness (-0.4% on calculator.wasm)#96
avrabe wants to merge 4 commits intomainfrom
release/v0.6.0-pr-c-dead-stores

Conversation

@avrabe
Copy link
Copy Markdown
Contributor

@avrabe avrabe commented May 3, 2026

Summary

Third PR of v0.6.0. Stacks on PR #95 (which stacks on PR #94). When PRs #94 and #95 land, this auto-retargets to main.

A new pass eliminate_dead_stores doing per-position dead-store elimination via backward liveness walking the structured wasm instruction tree. Path-sensitive complement to PR-B (eliminate_dead_locals): catches dead writes even when the target local IS read elsewhere in the function — by computing live-after sets correctly across all wasm structured control flow.

Pick #3 from the v0.6.0 wasm-opt-gap research agent's plan, taken in full (Option C).

Why "full liveness" not the simpler middle option

The user picked the most ambitious scope. The simpler "branch-aware adjacent-LocalSet" middle option would catch the obvious cases but miss the interesting one: write before an if where both arms overwrite. That requires computing live-before-if = live-in-then ∪ live-in-else. Once you do that properly, you've already done full structured-wasm liveness. So full liveness it is.

Algorithm

Backward walk over the instruction tree, propagating a LiveSet = BTreeSet<u32> at each position. A LocalSet/LocalTee is dead iff its target local is not in the live-after set.

Construct Liveness rule
Block { body } br N targets end-of-block ⇒ break-target liveness = live-after-block
If { then, else } live-before-if = live-in-then ∪ live-in-else
Loop { body } conservative v1: every local read anywhere in body is live throughout body and live just before loop. Avoids fixpoint iteration; sound but loses precision inside loops. Loop fixpoint is a follow-up (~100 LOC).
Br N live ← label_stack[depth - 1 - N] (no fall-through)
BrIf N live ∪= label_stack[...] (taken vs fall-through)
BrTable [...] live ∪= ⋃ over all targets ∪ default
Return / Unreachable live ← ∅ (no continuation)
Call / CallIndirect no effect (don't access caller's locals)

The label stack mirrors wasm's nesting: outermost first, innermost last. br N counts from innermost-out, so target = stack[len - 1 - N].

Write-ID tracking (the trickiest engineering bit)

Identifying which LocalSet/LocalTee is dead across two walks (analyze backward, apply forward) needs a stable ID. I use a counter that:

  • starts at total_writes (pre-counted in a forward sweep)
  • DECREMENTS during backward analysis as writes are encountered
  • the decremented value is the write's forward-walk-order ID

The apply phase walks forward, increments a parallel counter, and looks up dead_writes by ID. Because both walks use mirrored deterministic structural recursion, IDs align without tree paths or unsafe pointers.

Trap-effecting instructions (load, store, div)

May trap and end the function early. We compute liveness under the no-trap assumption: writes are removed only if dead on the no-trap continuation. If a trap intervenes, no later instruction observes the local — so removal remains sound. The conservative direction is correct here.

Pipeline order

... → simplify-locals → dead-stores → dead-locals

dead-stores BEFORE dead-locals: a write-only local with dead-only writes becomes a fully-unused local, which dead-locals then drops in the same pipeline run. Synergistic.

Measurement

Workload Effect
gale_in_baseline.wasm (1.9 KB kernel FFI) unchanged at 804 B (PR-B already exhausts this workload's pattern — pure write-without-read; PR-C's path-sensitive cases don't appear here)
calculator.wasm (2.3 MB component) 2,337,724 → 2,327,794 bytes (-0.4%, ~10 KB) with --passes dead-stores alone. Output validates.

The gale-zero-effect and calculator-real-effect together confirm: PR-C's value is on workloads with branchier dead-store patterns. It scales with workload complexity.

Stack-effect rules (same as PR-B)

Original Stack effect Substitute Why
LocalSet idx [T] -> [] Drop Drop also [T] -> []
LocalTee idx [T] -> [T] (removed) Value passes through

The branch_aware_keeps_partial_use test pins this — if we got it wrong, the validator would reject (an uninitialized read is type-error in wasm).

Tests (6 new)

  • overwritten_in_straight_line — two writes, no read between; first is dead.
  • preserves_live_writes — single live write must survive untouched.
  • branch_aware_keeps_partial_use — write before if; if-not-taken path reaches the local.get directly. Write IS LIVE on that path and MUST survive. The hardest case — Drop'ing it would expose an uninitialized read.
  • both_arms_overwrite — write before if where BOTH arms overwrite. live-before-if = ∅, outer write is dead. Tests union-of-arm-deads.
  • return_kills_continuation — write followed by Return. live ← ∅ after Return; write is dead.
  • localtee_dead_removed — dead Tee removed (not Drop'd); stack [T] -> [T] passes through.

All 303 tests pass (258 lib + 28 + 17).

Follow-ups not in this PR

  • Loop fixpoint precision (~100 LOC) — replace conservative loop body approximation with proper fixpoint iteration on the back-edge. Adds dead-store catches inside loops.
  • Const+drop peephole in vacuum (~10 LOC) — clean up i32.const X; drop patterns left by PR-B/PR-C neutralization.

🤖 Generated with Claude Code

avrabe added 4 commits May 3, 2026 14:40
The v0.4.0 audit measured LOOM's CSE growing the gale_ffi
kernel-scheduler code section by +6.3% while wasm-opt -O3 reduced
it by -2.0%. The cause: enhanced-CSE deduplicated every duplicate
expression including 1-2 byte constants. Replacing
`i32.const -22` (2 bytes encoded) with `local.tee N / local.get N`
(2+2 = 4 bytes) plus an additional local declaration (~2 bytes
amortized) is unconditionally a size regression on cheap
constants — and gale is full of them (errno values).

Fix: add a cost gate `Expr::worth_dedup(occurrences)` that
estimates net byte savings before deciding to dedup. Skip when:

  net = (N - 1) * (cost - 2) - 4 ≤ 0

Examples:
  i32.const 42 (cost=2, N=10): savings = -4 → skip
  i32.add cost=5, N=2: savings = -1 → skip
  i32.add cost=5, N=3: savings = +2 → keep

Measurement on gale_ffi:
  v0.5.0: code section 811 → 862 bytes (+6.3% regression)
  v0.6.0: code section 811 → 808 bytes (-0.4% net win)

Tests:
  - test_cse_phase4_duplicate_constants_above_cost_threshold:
    LARGE constants (5+ byte LEB128) still get deduplicated.
  - test_cse_phase4_keeps_small_constants: regression test for
    the gale fix. Cheap constants must survive CSE.

Pick #1 from v0.6.0 wasm-opt-gap research agent's plan.

Trace: REQ-3, REQ-14
Two research outputs from v0.6.0 planning, both grounded in real
Verus-verified kernel-scheduler code at /Users/r/git/pulseengine/z/gale.

source-pattern-analysis.md — eight optimization-relevant patterns
in gale source with file:line citations:
- Closed-set FSM dispatch (br_table targets) — 6 near-identical
  matches over SchedThreadState in sched.rs:649-779
- Default-then-override (the LOOM v0.4/v0.5 hoist guard pattern)
  — canonical example in sched.rs:404-444 (next_up_smp), four
  more found
- Verus-bounded loops — 24 `decreases` clauses, all of form
  MAX_CONST - i (MAX_WAITERS=64, MAX_CPUS=16)
- Tail-call dispatch matches, leaf-inline candidates, bit-mask
  axiom ingestion (event.rs lemmas), 2D state-machine matches,
  and Verus annotations as trusted axioms (1607 clauses total)

wasm-opt-gap-analysis.md — top 7 wasm-opt passes ranked by
expected payoff on kernel code, cheapest-first:
1. Constant-CSE suppression gate (50 LOC, 0.3 wks) — DONE
   in this PR
2. reorder-locals (slot renumbering) (250 LOC, 0.5 wks)
3. RedundantSetElimination liveness (600 LOC, 1.5 wks)
4. Compare-operand canonicalization (400 LOC, 1.0 wk)
5. merge-locals (500 LOC, 1.5 wks)
6. directize call_indirect → call (700 LOC, 2.0 wks)
7. simplify-locals sinking mode (1500 LOC, 4.0 wks)

The first three together (~1.3 weeks combined) flip the sign of
the gale +6.3% regression. Pick #1 is shipped in this PR;
picks #2 and #3 are tracked for follow-up PRs in v0.6.0.

Trace: REQ-7
…gale)

New pass that removes locals declared by a function but never read
by any LocalGet anywhere in the function body. Targets the gale
"default-then-override" pattern: rustc/LLVM materializes an EINVAL
default at function entry, then every reachable path overwrites
it before return. The default's local.set becomes pure dead store.

Key property: "zero reads anywhere" is path-INSENSITIVE. Unlike full
liveness (Pick #3), this rule is sound regardless of BrIf/BrTable/
early-Return control flow. So the pass DOES NOT need the
has_dataflow_unsafe_control_flow guard that gates simplify_locals
and coalesce_locals on every kernel-style early-exit function.
v0.5.0's simplify_locals had zero effect on the gale workload by
construction; this pass picks up where it refused to act.

Algorithm:
  1. Recursive read-count scan over the instruction tree.
  2. Dead set = { idx | idx >= param_count && reads(idx) == 0 }.
  3. Neutralize writes:
       LocalSet dead → Drop      (preserves [T] -> [] stack effect)
       LocalTee dead → removed   (Tee's [T] -> [T] passes through)
  4. Pack-down remap: dense indices, reuse remap_instructions.
  5. Z3 translation validation — revert on rejection.

Stack-effect rationale for the asymmetric LocalSet/LocalTee handling:
  LocalSet idx : [T] -> []     so Drop is the substitute
  LocalTee idx : [T] -> [T]    so removing leaves stack passing through
Confusing these would corrupt the stack — replacing LocalTee with
Drop would consume a value that downstream consumers expected to
remain.

Measurement on gale_ffi:
  baseline:           code section 811 bytes
  v0.5.0 (regress'n): code section 862 bytes (+6.3%)
  v0.6.0 PR-A (CSE):  code section 808 bytes (-0.4%)
  v0.6.0 PR-B (this): code section 804 bytes (-0.86%, vs baseline)

PR-A and PR-B are independent and stack: PR-A fixes the regression,
PR-B exposes a new optimization wasm-opt does that LOOM previously
skipped on early-exit code.

Visual confirmation on gale_bitarray_alloc_validate:
  before: (local i32) ; i32.const -22 ; local.set 3 ; ...
  after:  (no locals) ; i32.const -22 ; drop      ; ...

The leftover `const; drop` is dead code that vacuum could in principle
eliminate, but vacuum runs before this pass. A const+drop peephole
in vacuum is a follow-up (~5 LOC).

Tests (5 new): basic_write_only, preserves_used_locals,
localtee_neutralization, packs_indices, skips_params.

Pick #2 from v0.6.0 wasm-opt-gap research agent's plan (narrowed to
the path-insensitive subset; full liveness is Pick #3).

Trace: REQ-3, REQ-14
…ed wasm (-0.4% on calculator.wasm)

New pass: per-position dead-store elimination via backward liveness
walking the structured wasm instruction tree. Path-sensitive
complement to eliminate_dead_locals (PR-B): catches dead writes
even when the local IS read elsewhere in the function, by computing
live-after sets correctly across all wasm structured control flow.

Pick #3 (Option C) from the v0.6.0 wasm-opt-gap research agent's
plan. Full liveness, no scope reduction.

ALGORITHM

Backward walk over the instruction tree, propagating a LiveSet
(BTreeSet<u32>) at each position. A LocalSet/LocalTee is dead iff
its target local is NOT in the live-after set.

Wasm structured-control-flow handling (the heart of the analysis):

  Block { body }
    br N inside the block targets the END of the block, so
    break-target liveness equals live-after-block. Walk body with
    that as live-after; pop label after.

  Loop { body }
    br N inside the loop targets the START of the body. To avoid
    fixpoint iteration on the back-edge, v1 uses the conservative
    approximation: every local read anywhere in the body is treated
    as live throughout the body and live just before the loop. Sound
    but loses precision INSIDE loops. Loop fixpoint is a follow-up
    (~100 LOC) — but the gale dead-store patterns sit BEFORE loops,
    so this approximation costs nothing on the target workload.

  If { then, else }
    Both arms see the same live-after-if. live-before-if is the
    UNION of live-in-then and live-in-else. The if's label targets
    the END of the if (live-after).

  Br N             live becomes label_stack[depth - 1 - N]
  BrIf N           live ∪= label_stack[...]   (taken vs fall-through)
  BrTable [...]    live ∪= ⋃ over all targets ∪ default
  Return / Unreachable
                   live becomes empty (no continuation)
  Call / CallIndirect
                   no effect on caller's locals

The label_stack mirrors wasm's nesting: outermost first, innermost
last. br N counts from innermost-out, so target = stack[len - 1 - N].

ID assignment for write tracking:
  Forward pre-walk counts total_writes.
  Backward analysis decrements a counter as it encounters writes;
  the decremented value is the write's forward-walk-order ID.
  Apply phase walks forward, increments a parallel counter, looks
  up dead_writes by ID.
This avoids tree paths or unsafe pointers — IDs are stable and
align across the two walks because both use deterministic structural
recursion in mirrored orders.

NEUTRALIZATION (same rules as PR-B's eliminate_dead_locals):
  LocalSet idx : [T] → []     → Drop
  LocalTee idx : [T] → [T]    → removed (value passes through)

TRAP-EFFECTING INSTRUCTIONS
  load/store/div/etc. may trap. We compute liveness under the
  no-trap assumption: writes are removed only if dead on the no-trap
  continuation. If a trap intervenes, no later instruction observes
  the local — so removal remains sound.

PIPELINE ORDER
  ... → simplify-locals → dead-stores → dead-locals
  dead-stores BEFORE dead-locals: a write-only local with dead-only
  writes becomes a fully-unused local, which dead-locals then drops
  in the same pipeline run.

MEASUREMENT

  gale_in_baseline.wasm (1.9 KB, kernel FFI):
    code section unchanged at 804 bytes (PR-B already handles this
    workload's pattern — pure write-without-read; PR-C's
    path-sensitive cases don't appear here).

  calculator.wasm (2.3 MB component):
    --passes dead-stores ALONE: 2,337,724 → 2,327,794 bytes (-0.4%, ~10 KB).
    Output validates.
    Confirms PR-C scales with workload complexity.

TESTS (6 new)
  - overwritten_in_straight_line: two writes, no read between;
    first is dead.
  - preserves_live_writes: a single live write must survive
    untouched.
  - branch_aware_keeps_partial_use: write before an if; if-not-taken
    path reaches the local.get directly. Write IS LIVE on that path
    and MUST survive. The hardest case — Drop'ing it would expose
    an uninitialized read.
  - both_arms_overwrite: write before an if where BOTH arms overwrite
    the same local. live-before-if = ∅ for that local, so the outer
    write is dead. Tests the union-of-arm-deads.
  - return_kills_continuation: write followed by Return. Live
    becomes empty after Return; the write is dead.
  - localtee_dead_removed: dead Tee removed (not Drop'd); stack
    [T] -> [T] passes through.

All 258 lib + 28 + 17 = 303 tests pass.

Trace: REQ-3, REQ-14
@avrabe avrabe changed the base branch from release/v0.6.0-pr-b-dead-locals to main May 3, 2026 15:02
@avrabe avrabe closed this May 3, 2026
@avrabe avrabe reopened this May 3, 2026
@avrabe avrabe changed the title release/v0.6.0 PR-C: eliminate_dead_stores — full backward liveness (-0.4% on calculator.wasm, stacks on PR #95) release/v0.6.0 PR-C: eliminate_dead_stores — full backward liveness (-0.4% on calculator.wasm) May 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant