Skip to content

release/v0.6.0 PR-B: eliminate_dead_locals (-0.86% on gale)#95

Open
avrabe wants to merge 3 commits intomainfrom
release/v0.6.0-pr-b-dead-locals
Open

release/v0.6.0 PR-B: eliminate_dead_locals (-0.86% on gale)#95
avrabe wants to merge 3 commits intomainfrom
release/v0.6.0-pr-b-dead-locals

Conversation

@avrabe
Copy link
Copy Markdown
Contributor

@avrabe avrabe commented May 3, 2026

Summary

Second PR of v0.6.0. Stacks on PR #94 — base branch is release/v0.6.0-pr-a-cse-cost-threshold, so this PR's diff shows only the new pass. After PR #94 merges, GitHub will auto-retarget this PR to main.

A new pass eliminate_dead_locals that drops locals declared by a function but never read by any LocalGet anywhere in the function body. Targets the gale "default-then-override" pattern: rustc/LLVM materializes an EINVAL default at function entry, then every reachable path overwrites it before return. The default's local.set becomes pure dead store.

Pick #2 from the v0.6.0 wasm-opt-gap research agent's plan, narrowed to the path-insensitive subset (full liveness is Pick #3).

Why a NEW pass instead of extending simplify_locals?

The crucial observation: "zero reads anywhere in the function" is a structural property of the instruction tree — sound regardless of BrIf/BrTable/early-Return control flow. So this pass DOES NOT need the has_dataflow_unsafe_control_flow guard that gates simplify_locals (lib.rs:7706) and coalesce_locals (lib.rs:9843) on every kernel-style early-exit function.

v0.5.0's simplify_locals had zero effect on the gale workload by construction. This pass picks up where it refused to act.

Algorithm

  1. Recursive read-count scan over the instruction tree (matches the recursion shape of remap_instructions and eliminate_redundant_sets).
  2. Dead set: { idx | idx >= param_count && reads(idx) == 0 }.
  3. Neutralize writes to dead locals:
    • LocalSet deadDrop (preserves [T] -> [] stack effect)
    • LocalTee dead → removed (Tee's [T] -> [T] lets the value pass through)
  4. Pack-down remap: surviving locals get dense indices starting at param_count (best LEB128 encoding); reuse existing remap_instructions.
  5. Z3 translation-validation — revert on rejection.

Stack-effect rationale (subtle!)

Original Stack effect Substitute Why
LocalSet idx [T] -> [] Drop Drop also [T] -> []
LocalTee idx [T] -> [T] (removed) Value passes through unchanged

Confusing these would corrupt the stack — replacing LocalTee with Drop would consume a value that downstream consumers expected to remain. The asymmetric handling in neutralize_dead_writes is the single most important correctness invariant in this PR.

Measurement on gale_ffi

Build Code section Δ vs baseline
baseline 811 B
v0.5.0 (regression) 862 B +6.3%
v0.6.0 PR-A (CSE) 808 B -0.4%
v0.6.0 PR-B (this) 804 B -0.86%

PR-A and PR-B are independent and stack: PR-A fixes the regression, PR-B exposes a new optimization wasm-opt does that LOOM previously skipped on early-exit code.

Visual confirmation (gale_bitarray_alloc_validate, func 0)

Before:

(local i32)         ;; local 3 declared
i32.const -22
local.set 3         ;; writes EINVAL — never read
local.get 0
...

After (this PR):

(no locals)
i32.const -22
drop                ;; LocalSet → Drop
local.get 0
...

Tests (5 new)

  • test_eliminate_dead_locals_basic_write_only — the canonical gale pattern; pin elimination.
  • test_eliminate_dead_locals_preserves_used_locals — locals that ARE read must survive.
  • test_eliminate_dead_locals_localtee_neutralization — Tee removed (not Drop) — encoding must validate.
  • test_eliminate_dead_locals_packs_indices — surviving middle-deleted locals get packed indices, downstream LocalGet/Set/Tee references all updated.
  • test_eliminate_dead_locals_skips_params — parameters are caller-observable, never touched.

All 252 existing lib tests pass.

Follow-ups (not in this PR)

  • The leftover i32.const -22; drop pair is dead code that vacuum could in principle eliminate. A const+drop peephole in vacuum (~5 LOC) would close the loop on this transformation. Tracked for follow-up.
  • Pick Phase 3: ISLE Term Definitions #3 (RedundantSetElimination liveness) — the path-sensitive case. ~600 LOC, ~1.5 weeks. Will further close the gap on functions where SOME paths use a local but a particular write is dominated by another write.

🤖 Generated with Claude Code

avrabe added 3 commits May 3, 2026 14:40
The v0.4.0 audit measured LOOM's CSE growing the gale_ffi
kernel-scheduler code section by +6.3% while wasm-opt -O3 reduced
it by -2.0%. The cause: enhanced-CSE deduplicated every duplicate
expression including 1-2 byte constants. Replacing
`i32.const -22` (2 bytes encoded) with `local.tee N / local.get N`
(2+2 = 4 bytes) plus an additional local declaration (~2 bytes
amortized) is unconditionally a size regression on cheap
constants — and gale is full of them (errno values).

Fix: add a cost gate `Expr::worth_dedup(occurrences)` that
estimates net byte savings before deciding to dedup. Skip when:

  net = (N - 1) * (cost - 2) - 4 ≤ 0

Examples:
  i32.const 42 (cost=2, N=10): savings = -4 → skip
  i32.add cost=5, N=2: savings = -1 → skip
  i32.add cost=5, N=3: savings = +2 → keep

Measurement on gale_ffi:
  v0.5.0: code section 811 → 862 bytes (+6.3% regression)
  v0.6.0: code section 811 → 808 bytes (-0.4% net win)

Tests:
  - test_cse_phase4_duplicate_constants_above_cost_threshold:
    LARGE constants (5+ byte LEB128) still get deduplicated.
  - test_cse_phase4_keeps_small_constants: regression test for
    the gale fix. Cheap constants must survive CSE.

Pick #1 from v0.6.0 wasm-opt-gap research agent's plan.

Trace: REQ-3, REQ-14
Two research outputs from v0.6.0 planning, both grounded in real
Verus-verified kernel-scheduler code at /Users/r/git/pulseengine/z/gale.

source-pattern-analysis.md — eight optimization-relevant patterns
in gale source with file:line citations:
- Closed-set FSM dispatch (br_table targets) — 6 near-identical
  matches over SchedThreadState in sched.rs:649-779
- Default-then-override (the LOOM v0.4/v0.5 hoist guard pattern)
  — canonical example in sched.rs:404-444 (next_up_smp), four
  more found
- Verus-bounded loops — 24 `decreases` clauses, all of form
  MAX_CONST - i (MAX_WAITERS=64, MAX_CPUS=16)
- Tail-call dispatch matches, leaf-inline candidates, bit-mask
  axiom ingestion (event.rs lemmas), 2D state-machine matches,
  and Verus annotations as trusted axioms (1607 clauses total)

wasm-opt-gap-analysis.md — top 7 wasm-opt passes ranked by
expected payoff on kernel code, cheapest-first:
1. Constant-CSE suppression gate (50 LOC, 0.3 wks) — DONE
   in this PR
2. reorder-locals (slot renumbering) (250 LOC, 0.5 wks)
3. RedundantSetElimination liveness (600 LOC, 1.5 wks)
4. Compare-operand canonicalization (400 LOC, 1.0 wk)
5. merge-locals (500 LOC, 1.5 wks)
6. directize call_indirect → call (700 LOC, 2.0 wks)
7. simplify-locals sinking mode (1500 LOC, 4.0 wks)

The first three together (~1.3 weeks combined) flip the sign of
the gale +6.3% regression. Pick #1 is shipped in this PR;
picks #2 and #3 are tracked for follow-up PRs in v0.6.0.

Trace: REQ-7
…gale)

New pass that removes locals declared by a function but never read
by any LocalGet anywhere in the function body. Targets the gale
"default-then-override" pattern: rustc/LLVM materializes an EINVAL
default at function entry, then every reachable path overwrites
it before return. The default's local.set becomes pure dead store.

Key property: "zero reads anywhere" is path-INSENSITIVE. Unlike full
liveness (Pick #3), this rule is sound regardless of BrIf/BrTable/
early-Return control flow. So the pass DOES NOT need the
has_dataflow_unsafe_control_flow guard that gates simplify_locals
and coalesce_locals on every kernel-style early-exit function.
v0.5.0's simplify_locals had zero effect on the gale workload by
construction; this pass picks up where it refused to act.

Algorithm:
  1. Recursive read-count scan over the instruction tree.
  2. Dead set = { idx | idx >= param_count && reads(idx) == 0 }.
  3. Neutralize writes:
       LocalSet dead → Drop      (preserves [T] -> [] stack effect)
       LocalTee dead → removed   (Tee's [T] -> [T] passes through)
  4. Pack-down remap: dense indices, reuse remap_instructions.
  5. Z3 translation validation — revert on rejection.

Stack-effect rationale for the asymmetric LocalSet/LocalTee handling:
  LocalSet idx : [T] -> []     so Drop is the substitute
  LocalTee idx : [T] -> [T]    so removing leaves stack passing through
Confusing these would corrupt the stack — replacing LocalTee with
Drop would consume a value that downstream consumers expected to
remain.

Measurement on gale_ffi:
  baseline:           code section 811 bytes
  v0.5.0 (regress'n): code section 862 bytes (+6.3%)
  v0.6.0 PR-A (CSE):  code section 808 bytes (-0.4%)
  v0.6.0 PR-B (this): code section 804 bytes (-0.86%, vs baseline)

PR-A and PR-B are independent and stack: PR-A fixes the regression,
PR-B exposes a new optimization wasm-opt does that LOOM previously
skipped on early-exit code.

Visual confirmation on gale_bitarray_alloc_validate:
  before: (local i32) ; i32.const -22 ; local.set 3 ; ...
  after:  (no locals) ; i32.const -22 ; drop      ; ...

The leftover `const; drop` is dead code that vacuum could in principle
eliminate, but vacuum runs before this pass. A const+drop peephole
in vacuum is a follow-up (~5 LOC).

Tests (5 new): basic_write_only, preserves_used_locals,
localtee_neutralization, packs_indices, skips_params.

Pick #2 from v0.6.0 wasm-opt-gap research agent's plan (narrowed to
the path-insensitive subset; full liveness is Pick #3).

Trace: REQ-3, REQ-14
@avrabe avrabe changed the base branch from release/v0.6.0-pr-a-cse-cost-threshold to main May 3, 2026 15:02
@avrabe avrabe closed this May 3, 2026
@avrabe avrabe reopened this May 3, 2026
@avrabe avrabe changed the title release/v0.6.0 PR-B: eliminate_dead_locals (-0.86% on gale, stacks on PR #94) release/v0.6.0 PR-B: eliminate_dead_locals (-0.86% on gale) May 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant