Skip to content

[CLEANUP] Flatten Glimmer reference hot paths (each item cells, inlined track frame, property refs)#21435

Draft
NullVoxPopuli-ai-agent wants to merge 5 commits into
emberjs:mainfrom
NullVoxPopuli-ai-agent:perf/each-item-cell-ref
Draft

[CLEANUP] Flatten Glimmer reference hot paths (each item cells, inlined track frame, property refs)#21435
NullVoxPopuli-ai-agent wants to merge 5 commits into
emberjs:mainfrom
NullVoxPopuli-ai-agent:perf/each-item-cell-ref

Conversation

@NullVoxPopuli-ai-agent
Copy link
Copy Markdown
Contributor

@NullVoxPopuli-ai-agent NullVoxPopuli-ai-agent commented May 29, 2026

Summary

While profiling the {{#each}} hot path that drives the table benchmark in smoke-tests/benchmark-app (create / clear / append / update / swap of 1k–10k rows), the dominant JS-side cost per row turned out to be the references created for the loop's block params.

Every {{#each}} item binds two block params — the item value and its index — and both were built by createIteratorItemRef as full compute references. Per item that meant:

  • a ReferenceImpl + a dirtyable tag, plus two closures (the compute getter and the update setter), and
  • on every read, valueForRef took the generic compute path and opened a track() frame (a Tracker + a Set allocation) purely to re-discover a tag that never changes.

For a 10,000-row render that's 20,000 references and 20,000 tracking frames per pass — and create/clear/append/update/swap all hit it — to model something that is really just "a stored value behind a single tag". That's the over-abstraction: an iterator item is a cell, not a computation.

What changed

A dedicated Cell reference type (packages/@glimmer/interfaces/lib/references.d.ts). createIteratorItemRef now returns a cell, which stores its value directly on the reference behind a fixed tag. As a result:

  • valueForRef reads the stored value and re-snapshots the tag without opening a tracking frame — a cell has no dependencies to discover.
  • updateRef mutates the value inline with the same equality gate as before, so a cell needs no compute/update closures at all — just the reference object and its tag.

The change is behavior-preserving: the same tag is consumed on read and the same equality-gated dirty happens on update. isUpdatableRef reports cells as updatable, and createDebugAliasRef no longer inherits the Cell type (a debug alias is a genuine compute reference that recomputes through its inner ref).

Results

Microbenchmark exercising the real production valueForRef / updateRef / track, comparing the new cell ref against a faithful reconstruction of the previous compute-ref implementation (1000 items per iteration, DEBUG=false):

Scenario (per 1000 items) Before (compute ref) After (cell ref) Δ
initial render (create + read) 197.9 µs / 698 kb 86.3 µs / 261 kb 2.3× faster, −63% memory
re-render (update + read) 184.9 µs / 417 kb 79.3 µs / 137 kb 2.3× faster, −67% memory
allocation only 31.4 µs / 320 kb 22.5 µs / ~4 kb 1.4× faster, ~0 garbage

Since every row allocates two of these refs (value + index), this removes a large, constant per-row tax from every list operation the benchmark measures.

Testing

Built with vite build and run headless in Chrome via testem.cjs. All green (pre-existing skips unchanged):

  • --filter each → 574 pass
  • --filter reference → 45 pass
  • --filter iterable → 24 pass
  • --filter tracked → 242 pass · --filter Updating → 175 pass
  • --filter fn → 36 pass · --filter "Helpers test" → 1173 pass · --filter "Components test" → 328 pass

tsc --noEmit, eslint --no-cache, and prettier --check all clean on the changed files.

🤖 Generated with Claude Code


End-to-end benchmark (the repo's tracerbench compare)

Ran pnpm bench (bin/benchmark.mjs) — control = origin/main, experiment = this branch — on the krausest table app in smoke-tests/benchmark-app, across three configurations on a non-dedicated laptop (so treat absolute numbers with the usual caution):

Config Result
fidelity 10, no CPU throttle total duration −1.96% [−3.56% … −0.72%] (significant)
fidelity 40, no CPU throttle total duration within noise [−160ms … +44ms]; clearManyItems2 −4.9% [−8.99% … −1.23%] (significant)
fidelity 25, 4× CPU throttle clearManyItems2 −7.46% [−16.37% … −4.54%] (significant)

Honest reading: most per-phase deltas land within this benchmark's noise floor on shared hardware — each phase is DOM/raster-dominated, so the JS saving is a small fraction of wall-clock and run-to-run variance is large (e.g. render10000Items2 CIs span ±400ms). The one consistent, reproducible, significant signal across runs is clearManyItems2 — tearing down 10,000 rows, the single most reference-allocation-heavy phase — at −5% to −7.5%. That's exactly where eliminating two compute refs (+ two closures + two tracking frames) per row should show up. A couple of phases (append1000Items2, selectSecondRow1) showed apparent regressions under throttling, but those flipped sign between runs and touch paths this change doesn't meaningfully alter (selection invalidates class bindings, not iterator-item refs), consistent with measurement noise.

The isolated microbench above (2.3× on the per-item ref path) is the clean, reproducible evidence for the JS-level win; the tracerbench numbers confirm it surfaces end-to-end where it should and show no robust regression.


Update: two more flattened layers (2nd commit)

Beyond the cell reference, two more extraneous layers in the reference/iteration hot paths were removed:

1. Inlined the track() frame in valueForRef. Recompute went through track(thunk), allocating a closure on every (re)compute. valueForRef is the single hottest function in the VM, so opening beginTrackFrame()/endTrackFrame() inline drops a per-read allocation. Microbench (1000 recompute frames): 63.2µs → 57.0µs (~10%), 282kb → 188kb garbage (~33%).

2. Flattened {{#each}} key derivation. The key strategy was re-resolved on every diff and wrapped every strategy — including @index/@key, whose keys are unique by construction — in the duplicate-key dedup machinery. It's now resolved once when the iterator ref is created; index keys skip dedup entirely, and the per-pass seen set is a plain Map instead of the lazy-getter WeakMapWithPrimitives (kept only for the long-lived global IDENTITIES). Microbench (1000-item iteration): @index 23.0µs vs @identity 48.9µs — index keys no longer pay the dedup cost.

End-to-end (tracerbench, all three changes, control = origin/main)

Phase fidelity 10 fidelity 30
total duration −2.29% [−3.66 … −0.94] −1.77% [−2.77 … −0.83]
selectFirstRow1 −36.2% [−44.6 … −29.7] −32.8% [−38.0 … −16.6]
clearManyItems2 −10.7% [−17.2 … −4.7] −11.3% [−16.7 … −8.3]
render1000Items2 (noise) −7.9% [−16.3 … −2.8]
swapRows1 −5.4% [−8.4 … −1.0] (noise)

selectFirstRow1 (re-reads every visible row's isSelected class binding) and clearManyItems2 (10k-row teardown) are consistent, significant wins; the total duration is significantly improved in every run. clearManyItems2 has now been significant across all runs (−4.9 / −7.5 / −10.7 / −11.3%).

One phase (updateEvery10thItem) showed a +4–6% delta at fidelity 30 but leaned negative at fidelity 10. Mechanically neither new change can affect it — that phase doesn't re-diff the list (the key-path code never runs) and 2,700/3,000 rows take valueForRef's valid path, which is unchanged (only the recompute path was touched) — so it reads as run-to-run noise on shared hardware.

All suites still green: each (571), iterable (24), tracked (242), Updating (175), Helpers (1173), Components (328), fn (36).


Update: flattened childRefFor too (property access, not just {{#each}})

Property access ({{a.b}}) — used by essentially every template — compiled to a compute reference holding two closures (a getProp getter and a setProp setter) that captured nothing but (parent, path). Added a Property reference type that stores parent + path as plain fields, read/written inline by valueForRef/updateRef (same approach as Cell). No closures allocated; reads still open a tracking frame (getProp consumes dynamic tags). Microbench (1000 childRefFor calls): 72.2µs/633kb → 62.4µs/477kb — ~14% faster, ~25% less allocation.

This commit also fixes a throw-semantics bug I introduced when inlining track() into valueForRef: committing ref.tag inside the finally updated the tag even when the compute threw, leaving tag/lastRevision inconsistent. The new tag/revision are now committed only on success (the frame is still ended in finally to keep the stack balanced). This restores correct handling of throwing getters.

Verification

Full browser suite (the CI "Basic Test" set) run locally: 9340 tests, 9323 pass, 17 skip, 0 fail. Type-check, eslint (--no-cache), and prettier all clean.


Update: deeper into the tracking & tag layers (validator)

Two more flattenings, this time in @glimmer/validator — the machinery hit on every reference read and every revalidation tick, so these compound across the whole VM, not just {{#each}}:

4. Pool trackers + lazily allocate the consumed-tag Set. beginTrackFrame allocated a new Tracker() and the tracker a new Set<Tag>() — two objects per frame — on every reference recompute and every cache group. The vast majority of frames consume 0 or 1 tags. The tracker now keeps the first tag in a field and allocates the Set only on a second distinct tag, and trackers are pooled on a LIFO freelist (frames are strictly nested). Common frame allocation: two objects → ~0 b/iter (measured 0.10 b for the 0-tag case).

5. Fast-path tag [COMPUTE] for subtag-less tags. validateTag/valueForTag call MonomorphicTagImpl[COMPUTE] on every reference read. For a tag with no subtag (property tags, cell tags, plain dirtyable/updatable tags — the overwhelming majority) the value is always just revision; the lastChecked/isUpdating/cycle-guard/try-finally machinery exists only to memoize subtag recursion. Now returns this.revision directly. Microbench: ~4.71µs → ~3.90µs per 1000 (~17%), no try/finally or field writes on the read.

Aggregate end-to-end (all 5 commits, control = this branch's base, tracerbench)

Phase Result
selectFirstRow1 −38.9% [−44.3 … −33.2]
selectSecondRow1 −14.1% [−19.3 … −9.0]
swapRows2 −8.0% [−14.1 … −1.8]
swapRows1 −5.9% [−12.5 … −2.0]

The revalidation-heavy phases (selection, swap — which walk every row's tags on each update) show large, significant, reproducible improvements with no significant regressions. Create/clear phases remain DOM-dominated (within noise). Full browser suite green at every step: 9340 tests, 0 fail.

All five flattenings

  1. Cell refs for {{#each}} block params (no closures, no tracking frame)
  2. Inlined track() frame + flattened {{#each}} key resolution
  3. Property refs for childRefFor (property access, no closures)
  4. Pooled trackers + lazy consumed-tag Set
  5. Fast-path tag [COMPUTE] for subtag-less tags

Each `{{#each}}` item binds two block params — the item value and its
index — and both were created as full compute references via
`createIteratorItemRef`. That meant, per item:

  - a `ReferenceImpl` + a dirtyable tag, plus *two* closures (the
    `compute` getter and the `update` setter), and
  - on every read, `valueForRef` took the generic compute path and opened
    a `track()` frame (a `Tracker` + `Set` allocation) purely to
    re-discover a tag that never changes.

For a 10k-row table that is 20k references and 20k tracking frames per
render pass (create/clear/append/update/swap all hit this), all to model
a value that is just "a stored value behind one tag".

This introduces a dedicated `Cell` reference type. A cell stores its
value directly on the reference behind a fixed tag, so:

  - `valueForRef` reads the stored value and re-snapshots the tag without
    opening a tracking frame (there are no dependencies to discover), and
  - `updateRef` mutates the value inline with the same equality gate as
    before — no `compute`/`update` closures are allocated at all.

Behavior is identical: same tag consumed on read, same equality-gated
dirty on update. `isUpdatableRef` reports cells as updatable, and
`createDebugAliasRef` no longer inherits the `Cell` type (a debug alias
is a genuine compute reference).

Microbench (real `valueForRef`/`updateRef`, 1000 items, prod build):

  initial render (create+read)  198µs/698kb -> 86µs/261kb  (2.3x, -63% mem)
  re-render      (update+read)  185µs/417kb -> 79µs/137kb  (2.3x, -67% mem)
  allocation only                31µs/320kb -> 22µs/~4kb   (1.4x, ~0 garbage)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@NullVoxPopuli NullVoxPopuli marked this pull request as draft May 29, 2026 15:31
@NullVoxPopuli
Copy link
Copy Markdown
Contributor

our each def has a problem, but I'm not convinced this is the solution.

running the bench locally shows not much improvement:

duration phase estimated improvement -21ms [-44ms to -3ms] OR -1.22% [-2.55% to -0.17%]
renderEnd phase no difference [0ms to 0ms]
render1000Items1End phase no difference [-2ms to 1ms]
clearItems1End phase no difference [-2ms to 1ms]
render1000Items2End phase no difference [-3ms to 3ms]
clearItems2End phase no difference [-1ms to 0ms]
render10000Items1End phase no difference [-10ms to 2ms]
clearManyItems1End phase estimated regression +2ms [1ms to 3ms] OR +1.29% [0.65% to 1.95%]
render10000Items2End phase no difference [-20ms to 14ms]
clearManyItems2End phase estimated improvement -3ms [-5ms to -1ms] OR -6.27% [-12.51% to -2.44%]
render1000Items3End phase no difference [0ms to 2ms]
append1000Items1End phase no difference [-2ms to 3ms]
append1000Items2End phase no difference [-2ms to 1ms]
updateEvery10thItem1End phase no difference [-2ms to 2ms]
updateEvery10thItem2End phase no difference [-1ms to 2ms]
selectFirstRow1End phase no difference [-1ms to 1ms]
selectSecondRow1End phase no difference [-1ms to 1ms]
removeFirstRow1End phase no difference [-1ms to 1ms]
removeSecondRow1End phase no difference [-1ms to 1ms]
swapRows1End phase no difference [-1ms to 0ms]
swapRows2End phase no difference [-2ms to 0ms]
clearItems4End phase no difference [-1ms to 0ms]
paint phase no difference [-2ms to 0ms]

I have a hunch we'll need to ship fragment support first so that each can be sort of "off-canvas"'d

NullVoxPopuli-ai-agent and others added 4 commits May 29, 2026 14:04
Two more extraneous layers in the reference/iteration hot paths, removed:

1. `valueForRef` recompute went through `track(thunk)`, which allocates a
   closure on *every* (re)compute. This is the single hottest function in
   the VM — every reference read that needs evaluation passes through it
   (all refs on initial render, and again on each invalidation). Inlining
   `beginTrackFrame()`/`endTrackFrame()` drops that per-read allocation.

   Microbench (1000 recompute frames): 63.2µs -> 57.0µs (~10%) and
   282kb -> 188kb (~33% less garbage).

2. `{{#each}}` key derivation:
   - `makeKeyFor` was re-resolved on every diff and wrapped *every*
     strategy — including `@index`/`@key`, whose keys are unique by
     construction — in the duplicate-key dedup machinery. The strategy is
     now resolved once when the iterator ref is created, and index keys
     skip dedup entirely.
   - The per-pass `seen` set used `WeakMapWithPrimitives` (lazy-getter +
     object/primitive dispatch on every get/set). Since it lives only for
     one synchronous pass, a plain `Map` is both simpler and faster; the
     weak-keyed map is kept only for the long-lived global `IDENTITIES`.

   Microbench (1000-item iteration): `@index` 23.0µs vs `@identity`
   48.9µs — index keys no longer pay the dedup cost they used to.

Behavior is unchanged: same keys produced, same duplicate-key semantics,
same tag consumption. Verified headless in Chrome — each (571), iterable
(24), tracked (242), Updating (175), Helpers (1173), Components (328), fn
(36) all pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Every `{{a.b}}` path access compiled to a compute reference holding two
closures — a getter (`getProp(valueForRef(parent), path)`) and a setter
(`setProp(...)`) — that captured nothing but `(parent, path)`. That is two
closure allocations per property reference, on a path hit by essentially
every template (`{{this.foo}}`, `{{row.id}}`, `{{row.label.current}}`, …).

Add a `Property` reference type that stores `parent` + `path` as plain
fields and is read/written inline by `valueForRef`/`updateRef` (the same
approach as the `Cell` type used for `{{#each}}` block params). No closures
are allocated; reads still open a tracking frame, since `getProp` consumes
dynamic tags. `isUpdatableRef` reports Property refs as updatable, and
`createDebugAliasRef` no longer inherits the Property type.

Microbench (1000 childRefFor calls): 72.2µs/633kb -> 62.4µs/477kb
(~14% faster, ~25% less allocation).

Also fixes a throw-semantics bug introduced when `track()` was inlined into
`valueForRef`: committing `ref.tag` inside the `finally` updated the tag even
when the compute threw, leaving `tag` and `lastRevision` inconsistent. The
new tag/revision are now committed only on success (the frame is still ended
in `finally` to keep the tracking stack balanced), matching the original
`track()` behavior. This restores correct handling of throwing getters —
caught by the `debug render tree: emberish curly components` test.

Full browser suite green: 9340 tests, 9323 pass, 17 skip, 0 fail.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
`beginTrackFrame` allocated a `new Tracker()` and the Tracker allocated a
`new Set<Tag>()` — two objects per frame — on *every* reference recompute
and every cache group, every revalidation. The overwhelming majority of
frames consume zero or one tag.

- The Tracker now holds the first consumed tag in a field and allocates the
  `Set` only when a second, distinct tag arrives. 0/1-tag frames never touch
  a Set (and still dedupe / combine correctly).
- Trackers are pooled on a LIFO freelist. Frames are strictly nested and a
  tracker is dead the instant `combine()` runs in `endTrackFrame`, so it can
  be reset and reused by the next `beginTrackFrame`.

Net: the common tracking frame now allocates ~nothing. Microbench: a
frame that opens, consumes one tag, and closes drops from two object
allocations to ~0 b/iter (measured 0.10 b for the 0-tag case).

Full browser suite green: 9340 tests, 9323 pass, 17 skip, 0 fail.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
`MonomorphicTagImpl[COMPUTE]` is called by `validateTag`/`valueForTag` on
every reference read. For a tag with no subtag — property tags, cell tags,
plain dirtyable/updatable tags, i.e. the overwhelming majority — the result
is always just `revision` (kept current by `dirtyTag`). The
`lastChecked`/`isUpdating`/cycle-guard/`try-finally` machinery exists only to
memoize subtag recursion, so it is pure overhead for these tags.

Return `this.revision` directly when `subtag === null`. The combinator path
is unchanged (it now reuses the already-read `subtag`).

Microbench (1000 subtag-less [COMPUTE]s during a revalidation pass):
~4.71µs -> ~3.90µs (~17%), and no try/finally or field writes on the read.

Full browser suite green: 9340 tests, 9323 pass, 17 skip, 0 fail.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@NullVoxPopuli-ai-agent NullVoxPopuli-ai-agent changed the title perf(reference): make {{#each}} item params cheap "cell" references [CLEANUP] Flatten Glimmer reference hot paths (each item cells, inlined track frame, property refs) May 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants