release/v0.6.0 PR-C: eliminate_dead_stores — full backward liveness (-0.4% on calculator.wasm) by avrabe · Pull Request #96 · pulseengine/loom

avrabe · 2026-05-03T14:59:23Z

Summary

Third PR of v0.6.0. Stacks on PR #95 (which stacks on PR #94). When PRs #94 and #95 land, this auto-retargets to main.

A new pass eliminate_dead_stores doing per-position dead-store elimination via backward liveness walking the structured wasm instruction tree. Path-sensitive complement to PR-B (eliminate_dead_locals): catches dead writes even when the target local IS read elsewhere in the function — by computing live-after sets correctly across all wasm structured control flow.

Pick #3 from the v0.6.0 wasm-opt-gap research agent's plan, taken in full (Option C).

Why "full liveness" not the simpler middle option

The user picked the most ambitious scope. The simpler "branch-aware adjacent-LocalSet" middle option would catch the obvious cases but miss the interesting one: write before an if where both arms overwrite. That requires computing live-before-if = live-in-then ∪ live-in-else. Once you do that properly, you've already done full structured-wasm liveness. So full liveness it is.

Algorithm

Backward walk over the instruction tree, propagating a LiveSet = BTreeSet<u32> at each position. A LocalSet/LocalTee is dead iff its target local is not in the live-after set.

Construct	Liveness rule
`Block { body }`	br N targets end-of-block ⇒ break-target liveness = live-after-block
`If { then, else }`	live-before-if = live-in-then ∪ live-in-else
`Loop { body }`	conservative v1: every local read anywhere in body is live throughout body and live just before loop. Avoids fixpoint iteration; sound but loses precision inside loops. Loop fixpoint is a follow-up (~100 LOC).
`Br N`	live ← `label_stack[depth - 1 - N]` (no fall-through)
`BrIf N`	live ∪= `label_stack[...]` (taken vs fall-through)
`BrTable [...]`	live ∪= ⋃ over all targets ∪ default
`Return` / `Unreachable`	live ← ∅ (no continuation)
`Call` / `CallIndirect`	no effect (don't access caller's locals)

The label stack mirrors wasm's nesting: outermost first, innermost last. br N counts from innermost-out, so target = stack[len - 1 - N].

Write-ID tracking (the trickiest engineering bit)

Identifying which LocalSet/LocalTee is dead across two walks (analyze backward, apply forward) needs a stable ID. I use a counter that:

starts at total_writes (pre-counted in a forward sweep)
DECREMENTS during backward analysis as writes are encountered
the decremented value is the write's forward-walk-order ID

The apply phase walks forward, increments a parallel counter, and looks up dead_writes by ID. Because both walks use mirrored deterministic structural recursion, IDs align without tree paths or unsafe pointers.

Trap-effecting instructions (load, store, div)

May trap and end the function early. We compute liveness under the no-trap assumption: writes are removed only if dead on the no-trap continuation. If a trap intervenes, no later instruction observes the local — so removal remains sound. The conservative direction is correct here.

Pipeline order

... → simplify-locals → dead-stores → dead-locals

dead-stores BEFORE dead-locals: a write-only local with dead-only writes becomes a fully-unused local, which dead-locals then drops in the same pipeline run. Synergistic.

Measurement

Workload	Effect
gale_in_baseline.wasm (1.9 KB kernel FFI)	unchanged at 804 B (PR-B already exhausts this workload's pattern — pure write-without-read; PR-C's path-sensitive cases don't appear here)
calculator.wasm (2.3 MB component)	2,337,724 → 2,327,794 bytes (-0.4%, ~10 KB) with `--passes dead-stores` alone. Output validates.

The gale-zero-effect and calculator-real-effect together confirm: PR-C's value is on workloads with branchier dead-store patterns. It scales with workload complexity.

Stack-effect rules (same as PR-B)

Original	Stack effect	Substitute	Why
`LocalSet idx`	`[T] -> []`	`Drop`	Drop also `[T] -> []`
`LocalTee idx`	`[T] -> [T]`	(removed)	Value passes through

The branch_aware_keeps_partial_use test pins this — if we got it wrong, the validator would reject (an uninitialized read is type-error in wasm).

Tests (6 new)

overwritten_in_straight_line — two writes, no read between; first is dead.
preserves_live_writes — single live write must survive untouched.
branch_aware_keeps_partial_use — write before if; if-not-taken path reaches the local.get directly. Write IS LIVE on that path and MUST survive. The hardest case — Drop'ing it would expose an uninitialized read.
both_arms_overwrite — write before if where BOTH arms overwrite. live-before-if = ∅, outer write is dead. Tests union-of-arm-deads.
return_kills_continuation — write followed by Return. live ← ∅ after Return; write is dead.
localtee_dead_removed — dead Tee removed (not Drop'd); stack [T] -> [T] passes through.

All 303 tests pass (258 lib + 28 + 17).

Follow-ups not in this PR

Loop fixpoint precision (~100 LOC) — replace conservative loop body approximation with proper fixpoint iteration on the back-edge. Adds dead-store catches inside loops.
Const+drop peephole in vacuum (~10 LOC) — clean up i32.const X; drop patterns left by PR-B/PR-C neutralization.

🤖 Generated with Claude Code

The v0.4.0 audit measured LOOM's CSE growing the gale_ffi kernel-scheduler code section by +6.3% while wasm-opt -O3 reduced it by -2.0%. The cause: enhanced-CSE deduplicated every duplicate expression including 1-2 byte constants. Replacing `i32.const -22` (2 bytes encoded) with `local.tee N / local.get N` (2+2 = 4 bytes) plus an additional local declaration (~2 bytes amortized) is unconditionally a size regression on cheap constants — and gale is full of them (errno values). Fix: add a cost gate `Expr::worth_dedup(occurrences)` that estimates net byte savings before deciding to dedup. Skip when: net = (N - 1) * (cost - 2) - 4 ≤ 0 Examples: i32.const 42 (cost=2, N=10): savings = -4 → skip i32.add cost=5, N=2: savings = -1 → skip i32.add cost=5, N=3: savings = +2 → keep Measurement on gale_ffi: v0.5.0: code section 811 → 862 bytes (+6.3% regression) v0.6.0: code section 811 → 808 bytes (-0.4% net win) Tests: - test_cse_phase4_duplicate_constants_above_cost_threshold: LARGE constants (5+ byte LEB128) still get deduplicated. - test_cse_phase4_keeps_small_constants: regression test for the gale fix. Cheap constants must survive CSE. Pick #1 from v0.6.0 wasm-opt-gap research agent's plan. Trace: REQ-3, REQ-14

Two research outputs from v0.6.0 planning, both grounded in real Verus-verified kernel-scheduler code at /Users/r/git/pulseengine/z/gale. source-pattern-analysis.md — eight optimization-relevant patterns in gale source with file:line citations: - Closed-set FSM dispatch (br_table targets) — 6 near-identical matches over SchedThreadState in sched.rs:649-779 - Default-then-override (the LOOM v0.4/v0.5 hoist guard pattern) — canonical example in sched.rs:404-444 (next_up_smp), four more found - Verus-bounded loops — 24 `decreases` clauses, all of form MAX_CONST - i (MAX_WAITERS=64, MAX_CPUS=16) - Tail-call dispatch matches, leaf-inline candidates, bit-mask axiom ingestion (event.rs lemmas), 2D state-machine matches, and Verus annotations as trusted axioms (1607 clauses total) wasm-opt-gap-analysis.md — top 7 wasm-opt passes ranked by expected payoff on kernel code, cheapest-first: 1. Constant-CSE suppression gate (50 LOC, 0.3 wks) — DONE in this PR 2. reorder-locals (slot renumbering) (250 LOC, 0.5 wks) 3. RedundantSetElimination liveness (600 LOC, 1.5 wks) 4. Compare-operand canonicalization (400 LOC, 1.0 wk) 5. merge-locals (500 LOC, 1.5 wks) 6. directize call_indirect → call (700 LOC, 2.0 wks) 7. simplify-locals sinking mode (1500 LOC, 4.0 wks) The first three together (~1.3 weeks combined) flip the sign of the gale +6.3% regression. Pick #1 is shipped in this PR; picks #2 and #3 are tracked for follow-up PRs in v0.6.0. Trace: REQ-7

…gale) New pass that removes locals declared by a function but never read by any LocalGet anywhere in the function body. Targets the gale "default-then-override" pattern: rustc/LLVM materializes an EINVAL default at function entry, then every reachable path overwrites it before return. The default's local.set becomes pure dead store. Key property: "zero reads anywhere" is path-INSENSITIVE. Unlike full liveness (Pick #3), this rule is sound regardless of BrIf/BrTable/ early-Return control flow. So the pass DOES NOT need the has_dataflow_unsafe_control_flow guard that gates simplify_locals and coalesce_locals on every kernel-style early-exit function. v0.5.0's simplify_locals had zero effect on the gale workload by construction; this pass picks up where it refused to act. Algorithm: 1. Recursive read-count scan over the instruction tree. 2. Dead set = { idx | idx >= param_count && reads(idx) == 0 }. 3. Neutralize writes: LocalSet dead → Drop (preserves [T] -> [] stack effect) LocalTee dead → removed (Tee's [T] -> [T] passes through) 4. Pack-down remap: dense indices, reuse remap_instructions. 5. Z3 translation validation — revert on rejection. Stack-effect rationale for the asymmetric LocalSet/LocalTee handling: LocalSet idx : [T] -> [] so Drop is the substitute LocalTee idx : [T] -> [T] so removing leaves stack passing through Confusing these would corrupt the stack — replacing LocalTee with Drop would consume a value that downstream consumers expected to remain. Measurement on gale_ffi: baseline: code section 811 bytes v0.5.0 (regress'n): code section 862 bytes (+6.3%) v0.6.0 PR-A (CSE): code section 808 bytes (-0.4%) v0.6.0 PR-B (this): code section 804 bytes (-0.86%, vs baseline) PR-A and PR-B are independent and stack: PR-A fixes the regression, PR-B exposes a new optimization wasm-opt does that LOOM previously skipped on early-exit code. Visual confirmation on gale_bitarray_alloc_validate: before: (local i32) ; i32.const -22 ; local.set 3 ; ... after: (no locals) ; i32.const -22 ; drop ; ... The leftover `const; drop` is dead code that vacuum could in principle eliminate, but vacuum runs before this pass. A const+drop peephole in vacuum is a follow-up (~5 LOC). Tests (5 new): basic_write_only, preserves_used_locals, localtee_neutralization, packs_indices, skips_params. Pick #2 from v0.6.0 wasm-opt-gap research agent's plan (narrowed to the path-insensitive subset; full liveness is Pick #3). Trace: REQ-3, REQ-14

…ed wasm (-0.4% on calculator.wasm) New pass: per-position dead-store elimination via backward liveness walking the structured wasm instruction tree. Path-sensitive complement to eliminate_dead_locals (PR-B): catches dead writes even when the local IS read elsewhere in the function, by computing live-after sets correctly across all wasm structured control flow. Pick #3 (Option C) from the v0.6.0 wasm-opt-gap research agent's plan. Full liveness, no scope reduction. ALGORITHM Backward walk over the instruction tree, propagating a LiveSet (BTreeSet<u32>) at each position. A LocalSet/LocalTee is dead iff its target local is NOT in the live-after set. Wasm structured-control-flow handling (the heart of the analysis): Block { body } br N inside the block targets the END of the block, so break-target liveness equals live-after-block. Walk body with that as live-after; pop label after. Loop { body } br N inside the loop targets the START of the body. To avoid fixpoint iteration on the back-edge, v1 uses the conservative approximation: every local read anywhere in the body is treated as live throughout the body and live just before the loop. Sound but loses precision INSIDE loops. Loop fixpoint is a follow-up (~100 LOC) — but the gale dead-store patterns sit BEFORE loops, so this approximation costs nothing on the target workload. If { then, else } Both arms see the same live-after-if. live-before-if is the UNION of live-in-then and live-in-else. The if's label targets the END of the if (live-after). Br N live becomes label_stack[depth - 1 - N] BrIf N live ∪= label_stack[...] (taken vs fall-through) BrTable [...] live ∪= ⋃ over all targets ∪ default Return / Unreachable live becomes empty (no continuation) Call / CallIndirect no effect on caller's locals The label_stack mirrors wasm's nesting: outermost first, innermost last. br N counts from innermost-out, so target = stack[len - 1 - N]. ID assignment for write tracking: Forward pre-walk counts total_writes. Backward analysis decrements a counter as it encounters writes; the decremented value is the write's forward-walk-order ID. Apply phase walks forward, increments a parallel counter, looks up dead_writes by ID. This avoids tree paths or unsafe pointers — IDs are stable and align across the two walks because both use deterministic structural recursion in mirrored orders. NEUTRALIZATION (same rules as PR-B's eliminate_dead_locals): LocalSet idx : [T] → [] → Drop LocalTee idx : [T] → [T] → removed (value passes through) TRAP-EFFECTING INSTRUCTIONS load/store/div/etc. may trap. We compute liveness under the no-trap assumption: writes are removed only if dead on the no-trap continuation. If a trap intervenes, no later instruction observes the local — so removal remains sound. PIPELINE ORDER ... → simplify-locals → dead-stores → dead-locals dead-stores BEFORE dead-locals: a write-only local with dead-only writes becomes a fully-unused local, which dead-locals then drops in the same pipeline run. MEASUREMENT gale_in_baseline.wasm (1.9 KB, kernel FFI): code section unchanged at 804 bytes (PR-B already handles this workload's pattern — pure write-without-read; PR-C's path-sensitive cases don't appear here). calculator.wasm (2.3 MB component): --passes dead-stores ALONE: 2,337,724 → 2,327,794 bytes (-0.4%, ~10 KB). Output validates. Confirms PR-C scales with workload complexity. TESTS (6 new) - overwritten_in_straight_line: two writes, no read between; first is dead. - preserves_live_writes: a single live write must survive untouched. - branch_aware_keeps_partial_use: write before an if; if-not-taken path reaches the local.get directly. Write IS LIVE on that path and MUST survive. The hardest case — Drop'ing it would expose an uninitialized read. - both_arms_overwrite: write before an if where BOTH arms overwrite the same local. live-before-if = ∅ for that local, so the outer write is dead. Tests the union-of-arm-deads. - return_kills_continuation: write followed by Return. Live becomes empty after Return; the write is dead. - localtee_dead_removed: dead Tee removed (not Drop'd); stack [T] -> [T] passes through. All 258 lib + 28 + 17 = 303 tests pass. Trace: REQ-3, REQ-14

avrabe added 4 commits May 3, 2026 14:40

avrabe changed the base branch from release/v0.6.0-pr-b-dead-locals to main May 3, 2026 15:02

avrabe closed this May 3, 2026

avrabe reopened this May 3, 2026

avrabe changed the title ~~release/v0.6.0 PR-C: eliminate_dead_stores — full backward liveness (-0.4% on calculator.wasm, stacks on PR #95)~~ release/v0.6.0 PR-C: eliminate_dead_stores — full backward liveness (-0.4% on calculator.wasm) May 3, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

release/v0.6.0 PR-C: eliminate_dead_stores — full backward liveness (-0.4% on calculator.wasm)#96

release/v0.6.0 PR-C: eliminate_dead_stores — full backward liveness (-0.4% on calculator.wasm)#96
avrabe wants to merge 4 commits intomainfrom
release/v0.6.0-pr-c-dead-stores

avrabe commented May 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

avrabe commented May 3, 2026

Summary

Why "full liveness" not the simpler middle option

Algorithm

Write-ID tracking (the trickiest engineering bit)

Trap-effecting instructions (load, store, div)

Pipeline order

Measurement

Stack-effect rules (same as PR-B)

Tests (6 new)

Follow-ups not in this PR

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant