Skip to content

feat: [DSM-142] Add ColdStats aggregate to CanisterStates#10286

Open
alin-at-dfinity wants to merge 4 commits into
alin/canister-states-foundationsfrom
alin/DSM-142-canister-states-cold-stats
Open

feat: [DSM-142] Add ColdStats aggregate to CanisterStates#10286
alin-at-dfinity wants to merge 4 commits into
alin/canister-states-foundationsfrom
alin/DSM-142-canister-states-cold-stats

Conversation

@alin-at-dfinity
Copy link
Copy Markdown
Contributor

Maintains a small ColdStats aggregate over the canisters in the cold pool, updated incrementally on every transition into / out of cold. This lets the "touch every canister" aggregate queries — total_compute_allocation, total_canister_memory_usage, memory_taken, callback_count, guaranteed_response_message_memory_taken, best_effort_message_memory_taken — run in O(|hot|) instead of O(|all canisters|), which is the primary motivation for the hot/cold split on subnets with a long tail of idle canisters.

The aggregates are derived (not persisted) and are reconstructed by CanisterStates::new on checkpoint load. debug_assert_invariants ensures every mutating method keeps them in sync, and the ColdStats struct stays module-private — callers always reach the totals through the public aggregator methods on CanisterStates.

MemoryTaken's fields are bumped from private to pub(crate) so that CanisterStates::memory_taken can construct the struct directly, keeping MemoryTaken in its current home in replicated_state.rs. CanisterStates::memory_taken itself is pub(crate) and will be wired up to ReplicatedState::memory_taken in the next PR.

Maintains a small `ColdStats` aggregate over the canisters in the
`cold` pool, updated incrementally on every transition into / out of
`cold`. This lets the "touch every canister" aggregate queries —
`total_compute_allocation`, `total_canister_memory_usage`,
`memory_taken`, `callback_count`,
`guaranteed_response_message_memory_taken`,
`best_effort_message_memory_taken` — run in `O(|hot|)` instead of
`O(|all canisters|)`, which is the primary motivation for the
hot/cold split on subnets with a long tail of idle canisters.

The aggregates are derived (not persisted) and are reconstructed by
`CanisterStates::new` on checkpoint load. `debug_assert_invariants`
(now also runs an `O(|cold|)` recompute and compares against the live
aggregate) ensures every mutating method keeps them in sync, and the
`ColdStats` struct stays module-private — callers always reach the
totals through the public aggregator methods on `CanisterStates`.

`MemoryTaken`'s fields are bumped from private to `pub(crate)` so that
`CanisterStates::memory_taken` can construct the struct directly,
keeping `MemoryTaken` in its current home in `replicated_state.rs`.
`CanisterStates::memory_taken` itself is `pub(crate)` and will be
wired up to `ReplicatedState::memory_taken` in the next PR; an
`#[allow(dead_code)]` keeps the build warning-free until then.

Aggregator behaviour is exercised by two new tests
(`memory_aggregators_combine_hot_and_cold`,
`callback_count_combines_hot_and_cold`) and the bookkeeping
discipline is exercised by an extended set of `*_updates_cold_stats*`
tests covering `insert`, `remove`, `try_cool*`, `for_each_mut`,
`try_for_each_mut`, and `retain`.

Co-authored-by: Cursor <cursoragent@cursor.com>
@alin-at-dfinity alin-at-dfinity requested a review from a team as a code owner May 22, 2026 08:55
@github-actions github-actions Bot added the feat label May 22, 2026
alin-at-dfinity and others added 3 commits May 22, 2026 09:14
…ry. Rename raw_memory to execution_memory, so it better matches the equivalent MemoryTaken field. Update documentation and tests.
A canister can satisfy `CanisterState::is_cold()` while still holding a
guaranteed-response slot reservation: `is_cold()` only requires empty
input/output *messages* (the pool count) and no unexpired best-effort
callback, both of which are independent of whether the canister has
in-flight guaranteed-response requests. A canister that has pushed a
guaranteed-response request that's already been moved to an outgoing
stream still keeps the input-slot reservation for the eventual response,
which contributes `MAX_RESPONSE_COUNT_BYTES` to its
`guaranteed_response_message_memory_usage()`.

The previous commit dropped this field from `ColdStats` on the
assumption it was always zero. It isn't, and the consequence is that
`guaranteed_response_message_memory_taken()` quietly under-reports
subnet-wide memory: promoting a cold canister with a reservation to
`hot` (e.g. on the next `get_mut`) makes the subnet total jump up out
of nowhere, breaking conservation invariants in downstream code
(stream handler `debug_assert!`s, in particular).

Restore the field and the corresponding `add`/`sub` bookkeeping, fold
it into `guaranteed_response_message_memory_taken`,
`total_canister_memory_usage`, and `memory_taken`, and add a focused
test (`cold_canister_with_guaranteed_response_reservation_is_aggregated`)
exercising the case via `push_output_request` followed by draining the
output queue.

Best-effort message memory remains hot-only: an unexpired best-effort
callback forces the canister into `hot`, and any expired best-effort
callback shows up as a pending input which also forces `hot`.

Co-authored-by: Cursor <cursoragent@cursor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant