[CLEANUP] Flatten Glimmer reference hot paths (each item cells, inlined track frame, property refs)#21435
Draft
NullVoxPopuli-ai-agent wants to merge 5 commits into
Draft
Conversation
Each `{{#each}}` item binds two block params — the item value and its
index — and both were created as full compute references via
`createIteratorItemRef`. That meant, per item:
- a `ReferenceImpl` + a dirtyable tag, plus *two* closures (the
`compute` getter and the `update` setter), and
- on every read, `valueForRef` took the generic compute path and opened
a `track()` frame (a `Tracker` + `Set` allocation) purely to
re-discover a tag that never changes.
For a 10k-row table that is 20k references and 20k tracking frames per
render pass (create/clear/append/update/swap all hit this), all to model
a value that is just "a stored value behind one tag".
This introduces a dedicated `Cell` reference type. A cell stores its
value directly on the reference behind a fixed tag, so:
- `valueForRef` reads the stored value and re-snapshots the tag without
opening a tracking frame (there are no dependencies to discover), and
- `updateRef` mutates the value inline with the same equality gate as
before — no `compute`/`update` closures are allocated at all.
Behavior is identical: same tag consumed on read, same equality-gated
dirty on update. `isUpdatableRef` reports cells as updatable, and
`createDebugAliasRef` no longer inherits the `Cell` type (a debug alias
is a genuine compute reference).
Microbench (real `valueForRef`/`updateRef`, 1000 items, prod build):
initial render (create+read) 198µs/698kb -> 86µs/261kb (2.3x, -63% mem)
re-render (update+read) 185µs/417kb -> 79µs/137kb (2.3x, -67% mem)
allocation only 31µs/320kb -> 22µs/~4kb (1.4x, ~0 garbage)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Contributor
|
our each def has a problem, but I'm not convinced this is the solution. running the bench locally shows not much improvement: I have a hunch we'll need to ship fragment support first so that each can be sort of "off-canvas"'d |
Two more extraneous layers in the reference/iteration hot paths, removed:
1. `valueForRef` recompute went through `track(thunk)`, which allocates a
closure on *every* (re)compute. This is the single hottest function in
the VM — every reference read that needs evaluation passes through it
(all refs on initial render, and again on each invalidation). Inlining
`beginTrackFrame()`/`endTrackFrame()` drops that per-read allocation.
Microbench (1000 recompute frames): 63.2µs -> 57.0µs (~10%) and
282kb -> 188kb (~33% less garbage).
2. `{{#each}}` key derivation:
- `makeKeyFor` was re-resolved on every diff and wrapped *every*
strategy — including `@index`/`@key`, whose keys are unique by
construction — in the duplicate-key dedup machinery. The strategy is
now resolved once when the iterator ref is created, and index keys
skip dedup entirely.
- The per-pass `seen` set used `WeakMapWithPrimitives` (lazy-getter +
object/primitive dispatch on every get/set). Since it lives only for
one synchronous pass, a plain `Map` is both simpler and faster; the
weak-keyed map is kept only for the long-lived global `IDENTITIES`.
Microbench (1000-item iteration): `@index` 23.0µs vs `@identity`
48.9µs — index keys no longer pay the dedup cost they used to.
Behavior is unchanged: same keys produced, same duplicate-key semantics,
same tag consumption. Verified headless in Chrome — each (571), iterable
(24), tracked (242), Updating (175), Helpers (1173), Components (328), fn
(36) all pass.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Every `{{a.b}}` path access compiled to a compute reference holding two
closures — a getter (`getProp(valueForRef(parent), path)`) and a setter
(`setProp(...)`) — that captured nothing but `(parent, path)`. That is two
closure allocations per property reference, on a path hit by essentially
every template (`{{this.foo}}`, `{{row.id}}`, `{{row.label.current}}`, …).
Add a `Property` reference type that stores `parent` + `path` as plain
fields and is read/written inline by `valueForRef`/`updateRef` (the same
approach as the `Cell` type used for `{{#each}}` block params). No closures
are allocated; reads still open a tracking frame, since `getProp` consumes
dynamic tags. `isUpdatableRef` reports Property refs as updatable, and
`createDebugAliasRef` no longer inherits the Property type.
Microbench (1000 childRefFor calls): 72.2µs/633kb -> 62.4µs/477kb
(~14% faster, ~25% less allocation).
Also fixes a throw-semantics bug introduced when `track()` was inlined into
`valueForRef`: committing `ref.tag` inside the `finally` updated the tag even
when the compute threw, leaving `tag` and `lastRevision` inconsistent. The
new tag/revision are now committed only on success (the frame is still ended
in `finally` to keep the tracking stack balanced), matching the original
`track()` behavior. This restores correct handling of throwing getters —
caught by the `debug render tree: emberish curly components` test.
Full browser suite green: 9340 tests, 9323 pass, 17 skip, 0 fail.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
`beginTrackFrame` allocated a `new Tracker()` and the Tracker allocated a `new Set<Tag>()` — two objects per frame — on *every* reference recompute and every cache group, every revalidation. The overwhelming majority of frames consume zero or one tag. - The Tracker now holds the first consumed tag in a field and allocates the `Set` only when a second, distinct tag arrives. 0/1-tag frames never touch a Set (and still dedupe / combine correctly). - Trackers are pooled on a LIFO freelist. Frames are strictly nested and a tracker is dead the instant `combine()` runs in `endTrackFrame`, so it can be reset and reused by the next `beginTrackFrame`. Net: the common tracking frame now allocates ~nothing. Microbench: a frame that opens, consumes one tag, and closes drops from two object allocations to ~0 b/iter (measured 0.10 b for the 0-tag case). Full browser suite green: 9340 tests, 9323 pass, 17 skip, 0 fail. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
`MonomorphicTagImpl[COMPUTE]` is called by `validateTag`/`valueForTag` on every reference read. For a tag with no subtag — property tags, cell tags, plain dirtyable/updatable tags, i.e. the overwhelming majority — the result is always just `revision` (kept current by `dirtyTag`). The `lastChecked`/`isUpdating`/cycle-guard/`try-finally` machinery exists only to memoize subtag recursion, so it is pure overhead for these tags. Return `this.revision` directly when `subtag === null`. The combinator path is unchanged (it now reuses the already-read `subtag`). Microbench (1000 subtag-less [COMPUTE]s during a revalidation pass): ~4.71µs -> ~3.90µs (~17%), and no try/finally or field writes on the read. Full browser suite green: 9340 tests, 9323 pass, 17 skip, 0 fail. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
While profiling the
{{#each}}hot path that drives the table benchmark insmoke-tests/benchmark-app(create / clear / append / update / swap of 1k–10k rows), the dominant JS-side cost per row turned out to be the references created for the loop's block params.Every
{{#each}}item binds two block params — the item value and its index — and both were built bycreateIteratorItemRefas full compute references. Per item that meant:ReferenceImpl+ a dirtyable tag, plus two closures (thecomputegetter and theupdatesetter), andvalueForReftook the generic compute path and opened atrack()frame (aTracker+ aSetallocation) purely to re-discover a tag that never changes.For a 10,000-row render that's 20,000 references and 20,000 tracking frames per pass — and create/clear/append/update/swap all hit it — to model something that is really just "a stored value behind a single tag". That's the over-abstraction: an iterator item is a cell, not a computation.
What changed
A dedicated
Cellreference type (packages/@glimmer/interfaces/lib/references.d.ts).createIteratorItemRefnow returns a cell, which stores its value directly on the reference behind a fixed tag. As a result:valueForRefreads the stored value and re-snapshots the tag without opening a tracking frame — a cell has no dependencies to discover.updateRefmutates the value inline with the same equality gate as before, so a cell needs nocompute/updateclosures at all — just the reference object and its tag.The change is behavior-preserving: the same tag is consumed on read and the same equality-gated dirty happens on update.
isUpdatableRefreports cells as updatable, andcreateDebugAliasRefno longer inherits theCelltype (a debug alias is a genuine compute reference that recomputes through its inner ref).Results
Microbenchmark exercising the real production
valueForRef/updateRef/track, comparing the new cell ref against a faithful reconstruction of the previous compute-ref implementation (1000 items per iteration,DEBUG=false):Since every row allocates two of these refs (value + index), this removes a large, constant per-row tax from every list operation the benchmark measures.
Testing
Built with
vite buildand run headless in Chrome viatestem.cjs. All green (pre-existing skips unchanged):--filter each→ 574 pass--filter reference→ 45 pass--filter iterable→ 24 pass--filter tracked→ 242 pass ·--filter Updating→ 175 pass--filter fn→ 36 pass ·--filter "Helpers test"→ 1173 pass ·--filter "Components test"→ 328 passtsc --noEmit,eslint --no-cache, andprettier --checkall clean on the changed files.🤖 Generated with Claude Code
End-to-end benchmark (the repo's tracerbench
compare)Ran
pnpm bench(bin/benchmark.mjs) — control =origin/main, experiment = this branch — on the krausest table app insmoke-tests/benchmark-app, across three configurations on a non-dedicated laptop (so treat absolute numbers with the usual caution):duration−1.96% [−3.56% … −0.72%] (significant)durationwithin noise [−160ms … +44ms];clearManyItems2−4.9% [−8.99% … −1.23%] (significant)clearManyItems2−7.46% [−16.37% … −4.54%] (significant)Honest reading: most per-phase deltas land within this benchmark's noise floor on shared hardware — each phase is DOM/raster-dominated, so the JS saving is a small fraction of wall-clock and run-to-run variance is large (e.g.
render10000Items2CIs span ±400ms). The one consistent, reproducible, significant signal across runs isclearManyItems2— tearing down 10,000 rows, the single most reference-allocation-heavy phase — at −5% to −7.5%. That's exactly where eliminating two compute refs (+ two closures + two tracking frames) per row should show up. A couple of phases (append1000Items2,selectSecondRow1) showed apparent regressions under throttling, but those flipped sign between runs and touch paths this change doesn't meaningfully alter (selection invalidates class bindings, not iterator-item refs), consistent with measurement noise.The isolated microbench above (2.3× on the per-item ref path) is the clean, reproducible evidence for the JS-level win; the tracerbench numbers confirm it surfaces end-to-end where it should and show no robust regression.
Update: two more flattened layers (2nd commit)
Beyond the cell reference, two more extraneous layers in the reference/iteration hot paths were removed:
1. Inlined the
track()frame invalueForRef. Recompute went throughtrack(thunk), allocating a closure on every (re)compute.valueForRefis the single hottest function in the VM, so openingbeginTrackFrame()/endTrackFrame()inline drops a per-read allocation. Microbench (1000 recompute frames): 63.2µs → 57.0µs (~10%), 282kb → 188kb garbage (~33%).2. Flattened
{{#each}}key derivation. The key strategy was re-resolved on every diff and wrapped every strategy — including@index/@key, whose keys are unique by construction — in the duplicate-key dedup machinery. It's now resolved once when the iterator ref is created; index keys skip dedup entirely, and the per-passseenset is a plainMapinstead of the lazy-getterWeakMapWithPrimitives(kept only for the long-lived globalIDENTITIES). Microbench (1000-item iteration):@index23.0µs vs@identity48.9µs — index keys no longer pay the dedup cost.End-to-end (tracerbench, all three changes, control =
origin/main)durationselectFirstRow1clearManyItems2render1000Items2swapRows1selectFirstRow1(re-reads every visible row'sisSelectedclass binding) andclearManyItems2(10k-row teardown) are consistent, significant wins; the totaldurationis significantly improved in every run.clearManyItems2has now been significant across all runs (−4.9 / −7.5 / −10.7 / −11.3%).One phase (
updateEvery10thItem) showed a +4–6% delta at fidelity 30 but leaned negative at fidelity 10. Mechanically neither new change can affect it — that phase doesn't re-diff the list (the key-path code never runs) and 2,700/3,000 rows takevalueForRef's valid path, which is unchanged (only the recompute path was touched) — so it reads as run-to-run noise on shared hardware.All suites still green: each (571), iterable (24), tracked (242), Updating (175), Helpers (1173), Components (328), fn (36).
Update: flattened
childRefFortoo (property access, not just{{#each}})Property access (
{{a.b}}) — used by essentially every template — compiled to a compute reference holding two closures (agetPropgetter and asetPropsetter) that captured nothing but(parent, path). Added aPropertyreference type that storesparent+pathas plain fields, read/written inline byvalueForRef/updateRef(same approach asCell). No closures allocated; reads still open a tracking frame (getPropconsumes dynamic tags). Microbench (1000childRefForcalls): 72.2µs/633kb → 62.4µs/477kb — ~14% faster, ~25% less allocation.This commit also fixes a throw-semantics bug I introduced when inlining
track()intovalueForRef: committingref.taginside thefinallyupdated the tag even when the compute threw, leavingtag/lastRevisioninconsistent. The new tag/revision are now committed only on success (the frame is still ended infinallyto keep the stack balanced). This restores correct handling of throwing getters.Verification
Full browser suite (the CI "Basic Test" set) run locally: 9340 tests, 9323 pass, 17 skip, 0 fail. Type-check, eslint (
--no-cache), and prettier all clean.Update: deeper into the tracking & tag layers (validator)
Two more flattenings, this time in
@glimmer/validator— the machinery hit on every reference read and every revalidation tick, so these compound across the whole VM, not just{{#each}}:4. Pool trackers + lazily allocate the consumed-tag
Set.beginTrackFrameallocated anew Tracker()and the tracker anew Set<Tag>()— two objects per frame — on every reference recompute and every cache group. The vast majority of frames consume 0 or 1 tags. The tracker now keeps the first tag in a field and allocates theSetonly on a second distinct tag, and trackers are pooled on a LIFO freelist (frames are strictly nested). Common frame allocation: two objects → ~0 b/iter (measured 0.10 b for the 0-tag case).5. Fast-path tag
[COMPUTE]for subtag-less tags.validateTag/valueForTagcallMonomorphicTagImpl[COMPUTE]on every reference read. For a tag with no subtag (property tags, cell tags, plain dirtyable/updatable tags — the overwhelming majority) the value is always justrevision; thelastChecked/isUpdating/cycle-guard/try-finallymachinery exists only to memoize subtag recursion. Now returnsthis.revisiondirectly. Microbench: ~4.71µs → ~3.90µs per 1000 (~17%), no try/finally or field writes on the read.Aggregate end-to-end (all 5 commits, control = this branch's base, tracerbench)
selectFirstRow1selectSecondRow1swapRows2swapRows1The revalidation-heavy phases (selection, swap — which walk every row's tags on each update) show large, significant, reproducible improvements with no significant regressions. Create/clear phases remain DOM-dominated (within noise). Full browser suite green at every step: 9340 tests, 0 fail.
All five flattenings
Cellrefs for{{#each}}block params (no closures, no tracking frame)track()frame + flattened{{#each}}key resolutionPropertyrefs forchildRefFor(property access, no closures)Set[COMPUTE]for subtag-less tags