[DataOriented] Fastcache, perf, pruning#705
Open
hughperkins wants to merge 126 commits into
Open
Conversation
…pre-declaring struct ndarrays ``_predeclare_struct_ndarrays._walk_obj`` only recursed into ``dataclasses.is_dataclass`` children of a dataclass root; for non-dataclass roots (the ``@qd.data_oriented`` case) it didn't recurse at all. That meant an ndarray held by a nested ``@qd.data_oriented`` (or a ``dataclasses.dataclass`` reached through a ``@qd.data_oriented`` attribute, or vice versa) was never registered as a kernel arg, and ``state.inner.x[i] = ...`` raised ``QuadrantsCompilationError`` with "Ndarray ... used in kernel scope but not registered as a kernel parameter". Extend both branches to recurse on either a dataclass instance or an ``is_data_oriented(child)`` value. Pure superset of the prior walk — same shape, just more permissive on which children to descend into. Bug pinned by ``tests/python/test_data_oriented_ndarray.py::test_data_oriented_nested`` and the new nesting / cross-container tests in the same file.
…rs, not just non-frozen dataclasses ``launch_kernel`` folds the live id(s) of struct-held ndarrays into ``args_hash`` only when the host container is "mutable", and used ``type(args[idx]).__hash__ is None`` as the predicate. Python sets ``__hash__ = None`` for non-frozen dataclasses (the common ``eq=True, frozen=False`` default), so that arm fires correctly for them. But ``@qd.data_oriented`` classes inherit ``object.__hash__``, which is never ``None``, so the guard missed them entirely. Consequence: reassigning ``state.x = other_ndarray`` on the same data_oriented instance left ``args_hash`` unchanged, hit the launch-context cache, and re-launched the kernel against the stale ndarray binding (the old ``x1``). Extend the predicate with an explicit ``is_data_oriented(args[idx])`` arm. The launch-context cache is a perf optimisation so widening its invalidation predicate is safe. Bug pinned by ``tests/python/test_data_oriented_ndarray.py::test_data_oriented_ndarray_reassign_same_shape`` and ``::test_data_oriented_nested_ndarray_reassign``.
…esting, deep nesting, mutation through chain, multi-kernel, sub-func Adds tests 12-17 to the file added in 06b7c6a: - data_oriented holding (frozen) dataclass that holds ndarray - dataclass holding data_oriented that holds ndarray (kernel-arg via qd.template()) - 3-level data_oriented nesting - mutation through 2-level chain (outer.inner.x reassign) - two kernels sharing the same data_oriented instance - ndarray access via @qd.func sub-call The dataclass-of-data_oriented case uses qd.template() rather than typed dataclass kernel arg because the typed-dataclass-arg form goes through ``_transform_kernel_arg`` which does not currently recurse on data_oriented field types — tracked as a separate follow-up. Also tightens the xfail reason on test_data_oriented_ndarray_reassign_different_dtype to call out that the remaining failure is the template-mapper spec-key gap, not the launch-cache gap (latter fixed by the kernel.py change in this PR).
Update compound_types.md to reflect what landed in #561 [Type] Tensor 24 (which added ``_predeclare_struct_ndarrays``) and what's fixed in this PR (the nested + mutation cases). The old "no" cell predated the Tensor 24 infrastructure by ~6 weeks and was already inconsistent with the in-tree error message in ``python/quadrants/lang/impl.py`` which lists "@qd.data_oriented / frozen-dataclass template" as the supported route for ndarrays inside structs. Add an ndarray-member example under the @qd.data_oriented section.
…rray members ``_extract_arg`` returned ``weakref.ref(arg)`` for any ``is_data_oriented(arg)``, which over-shared the compiled kernel when ``state.x`` was reassigned to an ndarray of a different dtype or ndim on the same instance — the second launch re-used the kernel specialised for the original shape and silently corrupted the new-shape buffer. Walk the reachable ``Ndarray`` members (recursively through nested data_oriented and dataclass children) and prepend their ``(path, element_type, ndim, needs_grad, layout)`` descriptors to the spec key. Same memory-leak avoidance — the descriptors are values, no strong reference to the ndarray itself, and the weakref to the container is preserved for the per-instance identity tail. Containers with *no* ndarrays (the genesis field-backend ``@qd.data_oriented`` workload) take the existing short path unchanged — ``_collect_struct_nd_descriptors`` returns an empty list and we return ``weakref.ref(arg)`` as before. So this is a no-op for the existing hot path, and the overhead is paid only by containers that actually hold ndarrays. Pinned by ``test_data_oriented_ndarray_reassign_different_dtype`` (was xfail, now passes), ``::reassign_different_ndim``, ``::nested_ndarray_reassign_different_dtype``, and ``::field_only_no_speckey_change`` (no-regression case).
- Unmark test_data_oriented_ndarray_reassign_different_dtype as xfail (passes now). - Add ::reassign_different_ndim to cover the 1D->2D shape change case. - Add ::nested_ndarray_reassign_different_dtype to confirm the recursive walker reaches a leaf ndarray through a nested @qd.data_oriented chain. - Add ::field_only_no_speckey_change to pin the no-regression case (data_oriented with only field members still uses the original weakref short-path).
…y member is reassigned The spec-key fix in dc7997b (``_extract_arg`` descends into ``is_data_oriented(arg)`` to emit ndarray shape descriptors) was being silently bypassed for the same-instance case: ``TemplateMapper. lookup`` has a fast-path ``_mapping_cache_tracker`` keyed only on ``tuple(id(arg) for arg in args)``, which short-circuits ``extract()`` whenever the same instance is passed again. So ``run(state)``-then-``state.x = other``-then-``run(state)`` re-used the cached spec key from the first call and the kernel kept its original compile-time dtype/ndim. Fold the ids of all ndarrays reachable through any ``is_data_oriented(arg)`` (recursively, via nested data_oriented and dataclass children) into ``args_hash``. Reassigning a member ndarray changes its id, which changes the hash, which forces ``extract()`` and (when warranted) a fresh compilation. No-op for data_oriented containers with no ndarrays. Mirror at this cache layer of the launch-context stale-guard fix from 97afa6d. Pinned by ``test_data_oriented_ndarray_reassign_different_dtype`` — was failing under just the ``_extract_arg`` change because of this cache layer; now passes.
…lass kernel arg
Annotating a kernel arg as a dataclass whose field type is a ``@qd.data_oriented`` class mixes two
incompatible kernel-arg patterns:
- Typed-dataclass args are flattened into per-leaf kernel args using the field type annotations at
compile time (``_transform_kernel_arg`` recurses on ``field.type``).
- ``@qd.data_oriented`` containers don't carry per-attribute type annotations — their ndarray and
field members are walked at kernel-compile time from the *value* (``vars(self)``) via
``_predeclare_struct_ndarrays``, which only fires for ``qd.template()`` / ``qd.Tensor`` outer
annotations.
Before this commit, the data_oriented field type fell through ``_transform_kernel_arg``'s else
branch and bubbled up a confusing ``Invalid data type`` error from ``cook_dtype``. Now we raise a
``QuadrantsSyntaxError`` naming the offending field and pointing users at the recommended fix
(``s: qd.template()``).
Pinned by ``test_typed_dataclass_with_data_oriented_field_raises_clear_error``.
…ap A The unconditional ``vars(arg).items()`` recursion that the Gap A fix added to both ``_extract_arg`` and ``TemplateMapper.lookup`` was paid once per kernel call per data_oriented arg. For the genesis field-backend, where the ``@qd.data_oriented`` Solver is passed as ``self`` to every kernel and holds dozens of attributes, this cost ~150 FPS/env on anymal_c (B=4096) — measured ~14% regression in paired runs. Cache the attribute paths to ndarrays per class (``type(arg) -> list[tuple[str, ...]]``). First call for a class walks once via ``_build_struct_nd_paths``; subsequent calls do a dict lookup + ``getattr`` chains for the (typically zero or one or two) cached paths. For solvers with no ndarray members (genesis field backend), the cached list is empty and the per-call cost collapses to a single dict lookup. Trades freshness for speed: assumes the *set* of ndarray-holding attribute paths is stable across instances of the same class. Genesis Solver and similar data_oriented containers declare members in ``__init__`` and don't add new ones later, so this is safe. Documented in the docstring for ``_struct_nd_paths_for``. Shared between ``_template_mapper.py`` (id collection for args_hash) and ``_template_mapper_hotpath.py`` (shape descriptors for spec key) — same paths, different payload.
Documents what combinations of `dataclasses.dataclass`, `@qd.data_oriented`, `@qd.struct`, `qd.ndarray`, and `qd.field` work as nested members, after the data_oriented + ndarray fix series. Three additions: 1. Per-container × per-member-type matrix replacing the previous text-only claim that ``@qd.data_oriented`` could not contain ndarrays. 2. Outer kernel-arg annotation rules: when to use ``qd.template()`` vs a typed-dataclass annotation, including the ``frozen=True`` requirement for a dataclass passed via ``qd.template()`` and the rejection of ``@qd.data_oriented`` field types inside a typed-dataclass kernel arg (matches the error from c9598ad). 3. Reassignment + restrictions: documents that ndarray reassignment with different dtype/ndim is supported (Gap A), and that the ndarray-bearing attribute set on a data_oriented class is assumed stable across instances (path-cache caveat from 93893e5). Plus three spot tests in ``test_data_oriented_mixed_combos.py`` that empirically pin the more involved matrix claims: - ``test_data_oriented_with_ndarray_field_and_nested_data_oriented``: single data_oriented holding ndarray + field + nested data_oriented + primitive simultaneously. - ``test_dataclass_with_data_oriented_via_template``: frozen dataclass holding a data_oriented holding an ndarray, passed via ``qd.template()``. - ``test_data_oriented_with_dataclass_and_ndarray_sibling``: data_oriented holding both a direct ndarray AND a dataclass-with-ndarray sibling. All three pass on cluster.
``@qd.struct`` does not exist as an exported symbol — ``dir(qd)`` has only ``Struct``, ``StructField``, and ``dataclass``. The original doc claimed ``@qd.struct`` / ``@qd.dataclass`` as a legacy decorator pair, but only ``@qd.dataclass`` exists. The function-form equivalent ``qd.types.struct(name1=type1, ...)`` produces the same ``StructType``. Replace all ``@qd.struct`` references with ``@qd.dataclass`` (with a parenthetical note pointing to the function-form factory ``qd.types.struct``). No semantic change — the row's "field-only, no ndarrays" classification was already correct; only the name was wrong.
Belt-and-braces tests for the case the user explicitly requires: fastcache should work when a
@qd.data_oriented contains ndarrays (with or without primitives or nested data_oriented
children), and should *correctly fall back* (not error, not silently miscompile) when the
container holds a qd.field.
Pattern adapted from ``test_cache.test_fastcache``: call ``qd_init_same_arch`` twice with the
same ``offline_cache_file_path`` to simulate two processes. Monkeypatch ``launch_kernel`` to
capture ``compiled_kernel_data`` per call: ``None`` on the cold init (compile) and a non-None
``CompiledKernelData`` on the warm init (loaded from disk fastcache).
New tests:
- ``test_data_oriented_ndarray_fastcache_cross_init`` — single ndarray member, second init loads
from disk.
- ``test_data_oriented_nested_ndarray_fastcache_cross_init`` — nested @qd.data_oriented + ndarray
member, second init loads from disk. Exercises the args_hasher recursion.
- ``test_data_oriented_ndarray_fastcache_dtype_key_distinct`` — two different ndarray dtypes on
the same data_oriented produce two distinct cache entries; both load from disk on warm init.
Pins the ``[nd-{dtype}-{ndim}]`` repr in args_hasher.
- ``test_data_oriented_field_disables_fastcache_but_runs`` — data_oriented + qd.field documented
fallback: ``cache_key_generated`` is False, but the kernel still runs correctly.
The pre-existing ``test_data_oriented_ndarray_fastcache_eligible`` (kept) checks the in-process
``cache_key_generated`` flag; these four add cross-init disk-cache verification.
…otguns The existing fastcache.md mentions @qd.data_oriented in the constraint table and in a one-line note next to the dataclass section, but doesn't give a worked example or spell out the behavioural semantics. This commit adds a focused subsection covering: - A worked Simulation example: __init__ allocates state once, @qd.kernel(fastcache=True) method consumes it via self. - Primitive members of @qd.data_oriented are *implicitly templated* — their values are folded into the fastcache key without needing add_value_to_cache_key or qd.static(...). This is the property that lets the cache differentiate between Simulation(n=8) and Simulation(n=64). - Tensor contents vs reassignment: a per-operation table showing which mutations share the cache entry (element writes, same-dtype/ndim reassignment) and which produce a new entry (dtype or ndim change). - dataclasses.dataclass nesting works, but has the inverse default for primitives — types only, not values. Spell out the silent-miscompile risk if you put a qd.static-baked value in a dataclass field without FIELD_METADATA_CACHE_VALUE. - What disables fastcache on a data_oriented arg: any qd.field child anywhere in the tree, with a pointer to the perso_hugh follow-up doc. Also adds a short "Fastcache interaction" cross-reference in compound_types.md so a reader who lands there is pointed at the fastcache subsection. No code changes — purely user-facing documentation of behaviour that already exists on the hp/data-oriented-ndarray-fix branch (data_oriented + ndarray + fastcache works end-to-end across processes, verified in the investigation doc).
… for compound-type keying - Main body now covers only: how to enable fastcache + the constraints for enabling it. - Move all container-specific behaviour (data_oriented primitive value folding, dataclasses.dataclass FIELD_METADATA_CACHE_VALUE opt-in, qd.field disables fastcache) into a single tight "Advanced -> Compound-type cache keying" subsection. - Drop @qd.data_oriented description from fastcache.md (lives in compound_types.md). Drop qd.static <-> fastcache conflation: the two mechanisms are orthogonal. - compound_types.md retains a single cross-link to the new fastcache.md#compound-type-cache-keying anchor.
…uous bare 'field' In fastcache.md and compound_types.md, several places used the bare word 'field' to mean 'attribute of a dataclasses.dataclass / @qd.data_oriented container'. Because qd.field is itself a documented Quadrants type (listed in the same parameter-types table that disables fastcache when it appears), bare 'field' was ambiguous. Standardise on 'member' for compound-type members. Keep: - 'qd.field' / 'ScalarField' / 'MatrixField' / 'qd.dataclass' / 'StructType' references unchanged (these are the Quadrants types). - 'dataclasses.field(...)' unchanged (Python stdlib API). - 'attribute' only where it means Python attribute-access syntax (`s.foo`) or the `src_ll_cache_observations` Python instance attribute. Also clean up the purity-constraint closure-list example to drop 'fields' (it was unrelated to the qd.field/dataclass-field distinction and was just listing examples of external state).
…ers ('baked into kernel')
Replace 'folded into the cache key' jargon (which was undefined and
ambiguous: ndarray dtype info is just keyed, whereas data_oriented
primitive children are also Template-style specialised). Mirror the
existing qd.Template row: primitive member values are 'baked into
kernel'. Use 'included in the cache key' for type-only contributions
(ndarray dtype/ndim/layout, dataclass member types).
…are delete + late-reassign-with-different-dtype
Previous wording said 'don't add new members after the first kernel
launch'. Empirical results show this is overly broad: adding new ndarray
attributes on later instances of the same class is safe (each instance
gets its own spec entry via per-instance weakref; the compile-time
walker registers all reachable ndarrays). The actual failure modes are:
(a) Deleting an ndarray attribute that was present on the first
launch -> AttributeError on the next launch (the cached path
still does getattr on the missing attribute).
(b) Reassigning a post-first-walk ndarray attribute (a member that
wasn't on the first instance walked, was added later, and is now
re-assigned) to one with a different dtype/ndim -> not detected
by the id-augmented args_hash invalidation tracker; stale
compiled kernel is silently reused -> bit-reinterpretation of
the new storage.
Verified empirically via ~/ais/deskai9/tmp/check_path_cache_stability.py
on cluster (cases A/B safe; C errors; D safe via per-instance weakref;
E silent miscompile - f32 array reassigned over i32 displays the i32
bit pattern as ~4e-45).
…rnel name
test_data_oriented_ndarray_fastcache_cross_init was asserting on the LAST
launch_kernel call, but state.x.to_numpy() between run(state) and the
assertion launches an internal ndarray_to_ext_arr kernel that is
is_pure=False and so always has compiled_kernel_data=None. The assertion
captured the wrong launch and the test failed even though the actual
fastcache load for the user kernel worked correctly (verified via
src_ll_cache_observations.cache_loaded=True in a debug repro).
Filter the captured list to only the user kernel ('run'). Applied the
same filter to the other two cross-init fastcache tests (which happened
to pass because their assertions came before .to_numpy(), but the filter
makes the pattern robust against future test edits).
…embers example The intro paragraph says '@qd.data_oriented is designed for classes that define @qd.kernel methods as class members.' The ndarray-members example just below was defining the kernel outside the class (taking s: qd.template()) which contradicted the paragraph and was inconsistent with the qd.field example above it. Move step() inside State as a self-bound @qd.kernel method.
…e-valued (one compile per distinct value) Primitive members (int/float/bool/enum) on a @qd.data_oriented class are read at AST-parse time and baked into the kernel IR. Different instances with different primitive values each trigger a fresh compilation via the per-instance weakref in the spec key. Add a short subsection with an example.
…oss-referencing dataclasses.dataclass
…p dataclass mention
…@qd.data_oriented section
…n under @qd.data_oriented
…field, qd.ndarray, qd.tensor uniformly
hughperkins
commented
Jun 12, 2026
hughperkins
commented
Jun 12, 2026
hughperkins
commented
Jun 12, 2026
- L130 'value at such a variable' -> 'value of such a variable' (r3403909834).
- L157 Compound-type cache keying intro: 'only descending into members the
kernel actually reads or writes' was hard to parse; replaced with the user's
suggested wording — 'Any members that are not themselves read or written by
the kernel, nor contain members read or written by the kernel, are skipped
during the walk' as a new sentence (r3403932479).
- L174 dataclass FIELD_METADATA_CACHE_VALUE example: dropped the SimConfig
prose ('two SimConfig instances with different num_layers...') which
referred to the example class out of context, replaced with 'Primarily this
means any variable used inside qd.static' (links to static.md). The
SimConfig name is still introduced and self-contained inside the python
code block (r3403946029).
- compound_types.md L139: reverted to 'pruning of the sub-struct's leaf
members that the callee never reads' — Hugh prefers 'pruning' in this
sub-function-passing context, where it's a well-established compiler term
for the optimisation rather than a fastcache-specific concept (r3399513849).
Add a top-level `### stable_members` subsection under `## qd.data_oriented` (first sub-heading, before "Primitive members") recommending `stable_members=True` as standard for the common case where ndarray members are allocated once and never rebound. Includes the microbench showing ~28% / 5 us/call reduction in per-launch Python overhead on a 30-ndarray container, and a clear UB caveat that forward-links to the "Reassigning ndarray members" subsection. That subsection now back-links to `stable_members` and flags that ndarray rebinding is unsupported when the opt-in is set.
duburcqa
reviewed
Jun 12, 2026
Collaborator
Author
|
Note: this is ready for next round of review. |
…asher Move the body of `Kernel._maybe_persist_l1_and_set_l2_key` into a new free function `src_hasher.persist_l1_and_set_l2_key` (alongside the `store_pruning_info` / `compute_narrow_args_hash` / `make_full_cache_key` primitives it composes). The Kernel method becomes a thin one-call delegate that wires kernel attributes in and assigns the returned `fast_checksum` / `cache_key_generated` flag back. Addresses the "Check feature factorization" CI smell on #705: the orchestration was net-new code on the central `Kernel` class that plausibly belongs in `_fast_caching/`. Pure refactor — behaviour preserved.
…120c Three header comments wrapped at ~98-104c that combine cleanly under the project's 120c limit. Addresses the "Check line wrapping" CI smell on #705. Pure prose reflow.
duburcqa
reviewed
Jun 18, 2026
| | `stable_members=True` | 13.5 µs/call | | ||
| | | **−5 µs/call (−28%)** | | ||
|
|
||
| **Trade-off:** with `stable_members=True`, reassigning an ndarray member on an instance is undefined behavior — the previously compiled kernel will be reused even if the new ndarray has a different `dtype`, `ndim`, or layout, silently bit-reinterpreting the new array's storage. Set it only on classes whose ndarray members are allocated once (typically in `__init__`) and never rebound. See [Reassigning ndarray members](#reassigning-ndarray-members) below for the supported alternative. |
Contributor
There was a problem hiding this comment.
Maybe this behaviour should be enforced via post-init freeze. But I think it is tricky to get right without annoying boilerplate on user-side and/or performance penalty.
Even Python dataclass does not offer a clean way to customise init when frozen.
https://docs.python.org/3/library/dataclasses.html#dataclasses
frozen: If true (the default is False), assigning to fields will generate an exception. This emulates read-only frozen instances. See the [discussion](https://docs.python.org/3/library/dataclasses.html#dataclasses-frozen) below.
If [__setattr__()](https://docs.python.org/3/reference/datamodel.html#object.__setattr__) or [__delattr__()](https://docs.python.org/3/reference/datamodel.html#object.__delattr__) is defined in the class and frozen is true, then [TypeError](https://docs.python.org/3/library/exceptions.html#TypeError) is raised.
duburcqa
reviewed
Jun 18, 2026
| ### Compound-type cache keying | ||
|
|
||
| The args hasher walks compound-type kernel parameters recursively. For each leaf member it decides what (if anything) contributes to the cache key. The headline rules: | ||
| For `@qd.data_oriented` and `dataclasses.dataclass` kernel parameters, fastcache walks members recursively. Any members that are not themselves read or written by the kernel, nor contain members read or written by the kernel, are skipped during the walk (per the [strict invariants](#two-strict-invariants) above). Member-by-member behavior: |
Contributor
There was a problem hiding this comment.
This is redundant with:
- Unrecognised types at variables the kernel reads or writes must not be silently dropped or hashed by type-name. If the value of such a variable has a type fastcache doesn't explicitly handle (Pydantic models, UUIDs, third-party tensor wrappers, …), fastcache is disabled for the call with a one-shot
[FASTCACHE][UNKNOWN_TYPE]warning identifying the offending type plus an[INVALID_FUNC]log line confirming the cache is off.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Extends
@qd.data_orientedwith kernel-pruning-driven fastcache argument hashing, an opt-instable_members=Truelaunch-time perf hint, walker robustness fixes (cycle-safe, MRO-safe), and@qd.func-from-kernel support for dataclass-typed args.See the user-guide doc changes for the user-facing surface:
compound_types.md—@qd.data_orientedndarray members, fastcache section,stable_members=True.fastcache.md— pruning-driven argument hashing rules and the compound-type cache-keying table.