Skip to content

release/v0.6.0 PR-A: CSE cost-aware dedup gate (eliminates gale +6.3% regression)#94

Open
avrabe wants to merge 2 commits intomainfrom
release/v0.6.0-pr-a-cse-cost-threshold
Open

release/v0.6.0 PR-A: CSE cost-aware dedup gate (eliminates gale +6.3% regression)#94
avrabe wants to merge 2 commits intomainfrom
release/v0.6.0-pr-a-cse-cost-threshold

Conversation

@avrabe
Copy link
Copy Markdown
Contributor

@avrabe avrabe commented May 3, 2026

Summary

First PR of v0.6.0. Closes the +6.3% code-size regression that LOOM v0.5.0 produced on the gale kernel-scheduler FFI (Verus-verified Rust). Pick #1 from the v0.6.0 wasm-opt-gap research agent's plan: 50 LOC fix, 0.3 weeks effort, no risk.

Also lands two research docs in docs/research/gale-v0.5.0/: source-pattern analysis and wasm-opt pass-gap analysis. Both feed v0.6.0 planning.

The bug

LOOM's enhanced-CSE deduplicated every duplicate expression — including 1-2 byte constants. Replacing i32.const -22 (2 bytes encoded) with local.tee N / local.get N (4 bytes) plus a new local declaration (~2 bytes amortized) is unconditionally a size regression. Gale is full of cheap constants (errno values: -EINVAL = -22, -EBUSY = -16, K_FOREVER = -1).

The fix

A cost gate Expr::worth_dedup(occurrences) that estimates net byte savings before deciding to dedup:

net = (N - 1) * (cost - 2) - 4

where N = number of occurrences, cost = wasm-encoded byte cost of materializing the expression once. Skip when net ≤ 0.

Examples (all measured against actual wasm encoding via signed_leb128_bytes_* helpers):

Expression Cost N Net Decision
i32.const 42 2 2 -4 skip
i32.const 42 2 10 -4 skip (cheap)
i32.add of LocalGet+Const42 5 2 -1 skip
i32.add of LocalGet+Const42 5 3 +2 keep
i32.const 0x12345678 6 3 +4 keep

Measurement on gale_ffi

Build Code section Δ vs baseline
baseline (gale_in_baseline.wasm) 811 B
LOOM v0.5.0 862 B +6.3%
LOOM v0.6.0 (this PR) 808 B -0.4%

A 6.7-point swing on real kernel-scheduler code.

Tests

  • test_cse_phase4_duplicate_constants_above_cost_threshold — pin that LARGE constants (5+ byte LEB128) still get deduplicated. CSE remains useful where it pays off.
  • test_cse_phase4_keeps_small_constants — regression test for the gale fix. Cheap constants (2-byte LEB128) must survive CSE; neither locals nor instruction count grow.
  • All 247 existing lib tests pass.

v0.6.0 plan

The wasm-opt-gap research agent ranked 7 picks. The first three (1.3 weeks combined) flip the sign of the gale gap entirely:

Pick Status LOC Wks
1. Constant-CSE suppression gate this PR 50 0.3
2. reorder-locals (slot renumbering) pending 250 0.5
3. RedundantSetElimination liveness pending 600 1.5

🤖 Generated with Claude Code

avrabe added 2 commits May 3, 2026 14:40
The v0.4.0 audit measured LOOM's CSE growing the gale_ffi
kernel-scheduler code section by +6.3% while wasm-opt -O3 reduced
it by -2.0%. The cause: enhanced-CSE deduplicated every duplicate
expression including 1-2 byte constants. Replacing
`i32.const -22` (2 bytes encoded) with `local.tee N / local.get N`
(2+2 = 4 bytes) plus an additional local declaration (~2 bytes
amortized) is unconditionally a size regression on cheap
constants — and gale is full of them (errno values).

Fix: add a cost gate `Expr::worth_dedup(occurrences)` that
estimates net byte savings before deciding to dedup. Skip when:

  net = (N - 1) * (cost - 2) - 4 ≤ 0

Examples:
  i32.const 42 (cost=2, N=10): savings = -4 → skip
  i32.add cost=5, N=2: savings = -1 → skip
  i32.add cost=5, N=3: savings = +2 → keep

Measurement on gale_ffi:
  v0.5.0: code section 811 → 862 bytes (+6.3% regression)
  v0.6.0: code section 811 → 808 bytes (-0.4% net win)

Tests:
  - test_cse_phase4_duplicate_constants_above_cost_threshold:
    LARGE constants (5+ byte LEB128) still get deduplicated.
  - test_cse_phase4_keeps_small_constants: regression test for
    the gale fix. Cheap constants must survive CSE.

Pick #1 from v0.6.0 wasm-opt-gap research agent's plan.

Trace: REQ-3, REQ-14
Two research outputs from v0.6.0 planning, both grounded in real
Verus-verified kernel-scheduler code at /Users/r/git/pulseengine/z/gale.

source-pattern-analysis.md — eight optimization-relevant patterns
in gale source with file:line citations:
- Closed-set FSM dispatch (br_table targets) — 6 near-identical
  matches over SchedThreadState in sched.rs:649-779
- Default-then-override (the LOOM v0.4/v0.5 hoist guard pattern)
  — canonical example in sched.rs:404-444 (next_up_smp), four
  more found
- Verus-bounded loops — 24 `decreases` clauses, all of form
  MAX_CONST - i (MAX_WAITERS=64, MAX_CPUS=16)
- Tail-call dispatch matches, leaf-inline candidates, bit-mask
  axiom ingestion (event.rs lemmas), 2D state-machine matches,
  and Verus annotations as trusted axioms (1607 clauses total)

wasm-opt-gap-analysis.md — top 7 wasm-opt passes ranked by
expected payoff on kernel code, cheapest-first:
1. Constant-CSE suppression gate (50 LOC, 0.3 wks) — DONE
   in this PR
2. reorder-locals (slot renumbering) (250 LOC, 0.5 wks)
3. RedundantSetElimination liveness (600 LOC, 1.5 wks)
4. Compare-operand canonicalization (400 LOC, 1.0 wk)
5. merge-locals (500 LOC, 1.5 wks)
6. directize call_indirect → call (700 LOC, 2.0 wks)
7. simplify-locals sinking mode (1500 LOC, 4.0 wks)

The first three together (~1.3 weeks combined) flip the sign of
the gale +6.3% regression. Pick #1 is shipped in this PR;
picks #2 and #3 are tracked for follow-up PRs in v0.6.0.

Trace: REQ-7
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant