[CLEANUP] Flatten Glimmer reference hot paths (each item cells, inlined track frame, property refs) by NullVoxPopuli-ai-agent · Pull Request #21435 · emberjs/ember.js

NullVoxPopuli-ai-agent · 2026-05-29T15:08:18Z

Summary

While profiling the {{#each}} hot path that drives the table benchmark in smoke-tests/benchmark-app (create / clear / append / update / swap of 1k–10k rows), the dominant JS-side cost per row turned out to be the references created for the loop's block params.

Every {{#each}} item binds two block params — the item value and its index — and both were built by createIteratorItemRef as full compute references. Per item that meant:

a ReferenceImpl + a dirtyable tag, plus two closures (the compute getter and the update setter), and
on every read, valueForRef took the generic compute path and opened a track() frame (a Tracker + a Set allocation) purely to re-discover a tag that never changes.

For a 10,000-row render that's 20,000 references and 20,000 tracking frames per pass — and create/clear/append/update/swap all hit it — to model something that is really just "a stored value behind a single tag". That's the over-abstraction: an iterator item is a cell, not a computation.

What changed

A dedicated Cell reference type (packages/@glimmer/interfaces/lib/references.d.ts). createIteratorItemRef now returns a cell, which stores its value directly on the reference behind a fixed tag. As a result:

valueForRef reads the stored value and re-snapshots the tag without opening a tracking frame — a cell has no dependencies to discover.
updateRef mutates the value inline with the same equality gate as before, so a cell needs no compute/update closures at all — just the reference object and its tag.

The change is behavior-preserving: the same tag is consumed on read and the same equality-gated dirty happens on update. isUpdatableRef reports cells as updatable, and createDebugAliasRef no longer inherits the Cell type (a debug alias is a genuine compute reference that recomputes through its inner ref).

Results

Microbenchmark exercising the real production valueForRef / updateRef / track, comparing the new cell ref against a faithful reconstruction of the previous compute-ref implementation (1000 items per iteration, DEBUG=false):

Scenario (per 1000 items)	Before (compute ref)	After (cell ref)	Δ
initial render (create + read)	197.9 µs / 698 kb	86.3 µs / 261 kb	2.3× faster, −63% memory
re-render (update + read)	184.9 µs / 417 kb	79.3 µs / 137 kb	2.3× faster, −67% memory
allocation only	31.4 µs / 320 kb	22.5 µs / ~4 kb	1.4× faster, ~0 garbage

Since every row allocates two of these refs (value + index), this removes a large, constant per-row tax from every list operation the benchmark measures.

Testing

Built with vite build and run headless in Chrome via testem.cjs. All green (pre-existing skips unchanged):

--filter each → 574 pass
--filter reference → 45 pass
--filter iterable → 24 pass
--filter tracked → 242 pass · --filter Updating → 175 pass
--filter fn → 36 pass · --filter "Helpers test" → 1173 pass · --filter "Components test" → 328 pass

tsc --noEmit, eslint --no-cache, and prettier --check all clean on the changed files.

🤖 Generated with Claude Code

End-to-end benchmark (the repo's tracerbench `compare`)

Ran pnpm bench (bin/benchmark.mjs) — control = origin/main, experiment = this branch — on the krausest table app in smoke-tests/benchmark-app, across three configurations on a non-dedicated laptop (so treat absolute numbers with the usual caution):

Config	Result
fidelity 10, no CPU throttle	total `duration` −1.96% [−3.56% … −0.72%] (significant)
fidelity 40, no CPU throttle	total `duration` within noise [−160ms … +44ms]; `clearManyItems2` −4.9% [−8.99% … −1.23%] (significant)
fidelity 25, 4× CPU throttle	`clearManyItems2` −7.46% [−16.37% … −4.54%] (significant)

Honest reading: most per-phase deltas land within this benchmark's noise floor on shared hardware — each phase is DOM/raster-dominated, so the JS saving is a small fraction of wall-clock and run-to-run variance is large (e.g. render10000Items2 CIs span ±400ms). The one consistent, reproducible, significant signal across runs is clearManyItems2 — tearing down 10,000 rows, the single most reference-allocation-heavy phase — at −5% to −7.5%. That's exactly where eliminating two compute refs (+ two closures + two tracking frames) per row should show up. A couple of phases (append1000Items2, selectSecondRow1) showed apparent regressions under throttling, but those flipped sign between runs and touch paths this change doesn't meaningfully alter (selection invalidates class bindings, not iterator-item refs), consistent with measurement noise.

The isolated microbench above (2.3× on the per-item ref path) is the clean, reproducible evidence for the JS-level win; the tracerbench numbers confirm it surfaces end-to-end where it should and show no robust regression.

Update: two more flattened layers (2nd commit)

Beyond the cell reference, two more extraneous layers in the reference/iteration hot paths were removed:

1. Inlined the track() frame in valueForRef. Recompute went through track(thunk), allocating a closure on every (re)compute. valueForRef is the single hottest function in the VM, so opening beginTrackFrame()/endTrackFrame() inline drops a per-read allocation. Microbench (1000 recompute frames): 63.2µs → 57.0µs (~10%), 282kb → 188kb garbage (~33%).

2. Flattened {{#each}} key derivation. The key strategy was re-resolved on every diff and wrapped every strategy — including @index/@key, whose keys are unique by construction — in the duplicate-key dedup machinery. It's now resolved once when the iterator ref is created; index keys skip dedup entirely, and the per-pass seen set is a plain Map instead of the lazy-getter WeakMapWithPrimitives (kept only for the long-lived global IDENTITIES). Microbench (1000-item iteration): @index 23.0µs vs @identity 48.9µs — index keys no longer pay the dedup cost.

End-to-end (tracerbench, all three changes, control = `origin/main`)

Phase	fidelity 10	fidelity 30
total `duration`	−2.29% [−3.66 … −0.94]	−1.77% [−2.77 … −0.83]
`selectFirstRow1`	−36.2% [−44.6 … −29.7]	−32.8% [−38.0 … −16.6]
`clearManyItems2`	−10.7% [−17.2 … −4.7]	−11.3% [−16.7 … −8.3]
`render1000Items2`	(noise)	−7.9% [−16.3 … −2.8]
`swapRows1`	−5.4% [−8.4 … −1.0]	(noise)

selectFirstRow1 (re-reads every visible row's isSelected class binding) and clearManyItems2 (10k-row teardown) are consistent, significant wins; the total duration is significantly improved in every run. clearManyItems2 has now been significant across all runs (−4.9 / −7.5 / −10.7 / −11.3%).

One phase (updateEvery10thItem) showed a +4–6% delta at fidelity 30 but leaned negative at fidelity 10. Mechanically neither new change can affect it — that phase doesn't re-diff the list (the key-path code never runs) and 2,700/3,000 rows take valueForRef's valid path, which is unchanged (only the recompute path was touched) — so it reads as run-to-run noise on shared hardware.

All suites still green: each (571), iterable (24), tracked (242), Updating (175), Helpers (1173), Components (328), fn (36).

Update: flattened `childRefFor` too (property access, not just `{{#each}}`)

Property access ({{a.b}}) — used by essentially every template — compiled to a compute reference holding two closures (a getProp getter and a setProp setter) that captured nothing but (parent, path). Added a Property reference type that stores parent + path as plain fields, read/written inline by valueForRef/updateRef (same approach as Cell). No closures allocated; reads still open a tracking frame (getProp consumes dynamic tags). Microbench (1000 childRefFor calls): 72.2µs/633kb → 62.4µs/477kb — ~14% faster, ~25% less allocation.

This commit also fixes a throw-semantics bug I introduced when inlining track() into valueForRef: committing ref.tag inside the finally updated the tag even when the compute threw, leaving tag/lastRevision inconsistent. The new tag/revision are now committed only on success (the frame is still ended in finally to keep the stack balanced). This restores correct handling of throwing getters.

Verification

Full browser suite (the CI "Basic Test" set) run locally: 9340 tests, 9323 pass, 17 skip, 0 fail. Type-check, eslint (--no-cache), and prettier all clean.

Update: deeper into the tracking & tag layers (validator)

Two more flattenings, this time in @glimmer/validator — the machinery hit on every reference read and every revalidation tick, so these compound across the whole VM, not just {{#each}}:

4. Pool trackers + lazily allocate the consumed-tag Set. beginTrackFrame allocated a new Tracker() and the tracker a new Set<Tag>() — two objects per frame — on every reference recompute and every cache group. The vast majority of frames consume 0 or 1 tags. The tracker now keeps the first tag in a field and allocates the Set only on a second distinct tag, and trackers are pooled on a LIFO freelist (frames are strictly nested). Common frame allocation: two objects → ~0 b/iter (measured 0.10 b for the 0-tag case).

5. Fast-path tag [COMPUTE] for subtag-less tags. validateTag/valueForTag call MonomorphicTagImpl[COMPUTE] on every reference read. For a tag with no subtag (property tags, cell tags, plain dirtyable/updatable tags — the overwhelming majority) the value is always just revision; the lastChecked/isUpdating/cycle-guard/try-finally machinery exists only to memoize subtag recursion. Now returns this.revision directly. Microbench: ~4.71µs → ~3.90µs per 1000 (~17%), no try/finally or field writes on the read.

Aggregate end-to-end (all 5 commits, control = this branch's base, tracerbench)

Phase	Result
`selectFirstRow1`	−38.9% [−44.3 … −33.2]
`selectSecondRow1`	−14.1% [−19.3 … −9.0]
`swapRows2`	−8.0% [−14.1 … −1.8]
`swapRows1`	−5.9% [−12.5 … −2.0]

The revalidation-heavy phases (selection, swap — which walk every row's tags on each update) show large, significant, reproducible improvements with no significant regressions. Create/clear phases remain DOM-dominated (within noise). Full browser suite green at every step: 9340 tests, 0 fail.

All five flattenings

Cell refs for {{#each}} block params (no closures, no tracking frame)
Inlined track() frame + flattened {{#each}} key resolution
Property refs for childRefFor (property access, no closures)
Pooled trackers + lazy consumed-tag Set
Fast-path tag [COMPUTE] for subtag-less tags

Each `{{#each}}` item binds two block params — the item value and its index — and both were created as full compute references via `createIteratorItemRef`. That meant, per item: - a `ReferenceImpl` + a dirtyable tag, plus *two* closures (the `compute` getter and the `update` setter), and - on every read, `valueForRef` took the generic compute path and opened a `track()` frame (a `Tracker` + `Set` allocation) purely to re-discover a tag that never changes. For a 10k-row table that is 20k references and 20k tracking frames per render pass (create/clear/append/update/swap all hit this), all to model a value that is just "a stored value behind one tag". This introduces a dedicated `Cell` reference type. A cell stores its value directly on the reference behind a fixed tag, so: - `valueForRef` reads the stored value and re-snapshots the tag without opening a tracking frame (there are no dependencies to discover), and - `updateRef` mutates the value inline with the same equality gate as before — no `compute`/`update` closures are allocated at all. Behavior is identical: same tag consumed on read, same equality-gated dirty on update. `isUpdatableRef` reports cells as updatable, and `createDebugAliasRef` no longer inherits the `Cell` type (a debug alias is a genuine compute reference). Microbench (real `valueForRef`/`updateRef`, 1000 items, prod build): initial render (create+read) 198µs/698kb -> 86µs/261kb (2.3x, -63% mem) re-render (update+read) 185µs/417kb -> 79µs/137kb (2.3x, -67% mem) allocation only 31µs/320kb -> 22µs/~4kb (1.4x, ~0 garbage) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

NullVoxPopuli · 2026-05-29T15:32:49Z

our each def has a problem, but I'm not convinced this is the solution.

running the bench locally shows not much improvement:

duration phase estimated improvement -21ms [-44ms to -3ms] OR -1.22% [-2.55% to -0.17%]
renderEnd phase no difference [0ms to 0ms]
render1000Items1End phase no difference [-2ms to 1ms]
clearItems1End phase no difference [-2ms to 1ms]
render1000Items2End phase no difference [-3ms to 3ms]
clearItems2End phase no difference [-1ms to 0ms]
render10000Items1End phase no difference [-10ms to 2ms]
clearManyItems1End phase estimated regression +2ms [1ms to 3ms] OR +1.29% [0.65% to 1.95%]
render10000Items2End phase no difference [-20ms to 14ms]
clearManyItems2End phase estimated improvement -3ms [-5ms to -1ms] OR -6.27% [-12.51% to -2.44%]
render1000Items3End phase no difference [0ms to 2ms]
append1000Items1End phase no difference [-2ms to 3ms]
append1000Items2End phase no difference [-2ms to 1ms]
updateEvery10thItem1End phase no difference [-2ms to 2ms]
updateEvery10thItem2End phase no difference [-1ms to 2ms]
selectFirstRow1End phase no difference [-1ms to 1ms]
selectSecondRow1End phase no difference [-1ms to 1ms]
removeFirstRow1End phase no difference [-1ms to 1ms]
removeSecondRow1End phase no difference [-1ms to 1ms]
swapRows1End phase no difference [-1ms to 0ms]
swapRows2End phase no difference [-2ms to 0ms]
clearItems4End phase no difference [-1ms to 0ms]
paint phase no difference [-2ms to 0ms]

I have a hunch we'll need to ship fragment support first so that each can be sort of "off-canvas"'d

Two more extraneous layers in the reference/iteration hot paths, removed: 1. `valueForRef` recompute went through `track(thunk)`, which allocates a closure on *every* (re)compute. This is the single hottest function in the VM — every reference read that needs evaluation passes through it (all refs on initial render, and again on each invalidation). Inlining `beginTrackFrame()`/`endTrackFrame()` drops that per-read allocation. Microbench (1000 recompute frames): 63.2µs -> 57.0µs (~10%) and 282kb -> 188kb (~33% less garbage). 2. `{{#each}}` key derivation: - `makeKeyFor` was re-resolved on every diff and wrapped *every* strategy — including `@index`/`@key`, whose keys are unique by construction — in the duplicate-key dedup machinery. The strategy is now resolved once when the iterator ref is created, and index keys skip dedup entirely. - The per-pass `seen` set used `WeakMapWithPrimitives` (lazy-getter + object/primitive dispatch on every get/set). Since it lives only for one synchronous pass, a plain `Map` is both simpler and faster; the weak-keyed map is kept only for the long-lived global `IDENTITIES`. Microbench (1000-item iteration): `@index` 23.0µs vs `@identity` 48.9µs — index keys no longer pay the dedup cost they used to. Behavior is unchanged: same keys produced, same duplicate-key semantics, same tag consumption. Verified headless in Chrome — each (571), iterable (24), tracked (242), Updating (175), Helpers (1173), Components (328), fn (36) all pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Every `{{a.b}}` path access compiled to a compute reference holding two closures — a getter (`getProp(valueForRef(parent), path)`) and a setter (`setProp(...)`) — that captured nothing but `(parent, path)`. That is two closure allocations per property reference, on a path hit by essentially every template (`{{this.foo}}`, `{{row.id}}`, `{{row.label.current}}`, …). Add a `Property` reference type that stores `parent` + `path` as plain fields and is read/written inline by `valueForRef`/`updateRef` (the same approach as the `Cell` type used for `{{#each}}` block params). No closures are allocated; reads still open a tracking frame, since `getProp` consumes dynamic tags. `isUpdatableRef` reports Property refs as updatable, and `createDebugAliasRef` no longer inherits the Property type. Microbench (1000 childRefFor calls): 72.2µs/633kb -> 62.4µs/477kb (~14% faster, ~25% less allocation). Also fixes a throw-semantics bug introduced when `track()` was inlined into `valueForRef`: committing `ref.tag` inside the `finally` updated the tag even when the compute threw, leaving `tag` and `lastRevision` inconsistent. The new tag/revision are now committed only on success (the frame is still ended in `finally` to keep the tracking stack balanced), matching the original `track()` behavior. This restores correct handling of throwing getters — caught by the `debug render tree: emberish curly components` test. Full browser suite green: 9340 tests, 9323 pass, 17 skip, 0 fail. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

`beginTrackFrame` allocated a `new Tracker()` and the Tracker allocated a `new Set<Tag>()` — two objects per frame — on *every* reference recompute and every cache group, every revalidation. The overwhelming majority of frames consume zero or one tag. - The Tracker now holds the first consumed tag in a field and allocates the `Set` only when a second, distinct tag arrives. 0/1-tag frames never touch a Set (and still dedupe / combine correctly). - Trackers are pooled on a LIFO freelist. Frames are strictly nested and a tracker is dead the instant `combine()` runs in `endTrackFrame`, so it can be reset and reused by the next `beginTrackFrame`. Net: the common tracking frame now allocates ~nothing. Microbench: a frame that opens, consumes one tag, and closes drops from two object allocations to ~0 b/iter (measured 0.10 b for the 0-tag case). Full browser suite green: 9340 tests, 9323 pass, 17 skip, 0 fail. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

`MonomorphicTagImpl[COMPUTE]` is called by `validateTag`/`valueForTag` on every reference read. For a tag with no subtag — property tags, cell tags, plain dirtyable/updatable tags, i.e. the overwhelming majority — the result is always just `revision` (kept current by `dirtyTag`). The `lastChecked`/`isUpdating`/cycle-guard/`try-finally` machinery exists only to memoize subtag recursion, so it is pure overhead for these tags. Return `this.revision` directly when `subtag === null`. The combinator path is unchanged (it now reuses the already-read `subtag`). Microbench (1000 subtag-less [COMPUTE]s during a revalidation pass): ~4.71µs -> ~3.90µs (~17%), and no try/finally or field writes on the read. Full browser suite green: 9340 tests, 9323 pass, 17 skip, 0 fail. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

NullVoxPopuli marked this pull request as draft May 29, 2026 15:31

NullVoxPopuli-ai-agent and others added 4 commits May 29, 2026 14:04

NullVoxPopuli-ai-agent changed the title ~~perf(reference): make {{#each}} item params cheap "cell" references~~ [CLEANUP] Flatten Glimmer reference hot paths (each item cells, inlined track frame, property refs) May 29, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[CLEANUP] Flatten Glimmer reference hot paths (each item cells, inlined track frame, property refs)#21435

[CLEANUP] Flatten Glimmer reference hot paths (each item cells, inlined track frame, property refs)#21435
NullVoxPopuli-ai-agent wants to merge 5 commits into
emberjs:mainfrom
NullVoxPopuli-ai-agent:perf/each-item-cell-ref

NullVoxPopuli-ai-agent commented May 29, 2026 •

edited

Loading

Uh oh!

NullVoxPopuli commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

NullVoxPopuli-ai-agent commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What changed

Results

Testing

End-to-end benchmark (the repo's tracerbench compare)

Update: two more flattened layers (2nd commit)

End-to-end (tracerbench, all three changes, control = origin/main)

Update: flattened childRefFor too (property access, not just {{#each}})

Verification

Update: deeper into the tracking & tag layers (validator)

Aggregate end-to-end (all 5 commits, control = this branch's base, tracerbench)

All five flattenings

Uh oh!

NullVoxPopuli commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

NullVoxPopuli-ai-agent commented May 29, 2026 •

edited

Loading

End-to-end benchmark (the repo's tracerbench `compare`)

End-to-end (tracerbench, all three changes, control = `origin/main`)

Update: flattened `childRefFor` too (property access, not just `{{#each}}`)