Where
CircuitBreaker persists one KV entry per remote delivery host and only ever removes it on a later successful delivery to that same host. The write path never passes a TTL, and there is no sweep/eviction anywhere.
The write happens in #replace — neither the set branch nor the cas branch passes an expiry, even though both accept one:
// packages/fedify/src/federation/circuit-breaker.ts:423-438
async #replace(remoteHost, oldState, newState): Promise<boolean> {
const key = this.#key(remoteHost); // [...prefix, remoteHost]
if (this.#kv.cas == null) {
if (newState == null) {
await this.#kv.delete(key);
} else {
await this.#kv.set(key, newState); // ← no { ttl }
}
return true;
}
return await this.#kv.cas(key, oldState, newState); // ← no { ttl }
}
KvStore.set and KvStore.cas both take an optional KvStoreSetOptions with a ttl (kv.ts:15-19, :59, :76-81), and backends honor it (e.g. @fedify/sqlite's kv.ts sets expires from options.ttl on both set and cas, and sweeps). The circuit breaker just never supplies one.
Removal only happens on success:
// circuit-breaker.ts:307-325 recordSuccess / recordReachableFailure
if (await this.#replace(remoteHost, oldState, undefined)) { ... } // delete
recordFailure (circuit-breaker.ts:333-378) creates/updates the entry but never deletes it. pruneFailures only trims the timestamp array inside a single host's entry; it never removes entries across hosts. There is no list()-based cleanup and no background job.
Wiring (enabled by default whenever an outbox queue exists):
// middleware.ts:673-686
if (options.circuitBreaker !== false && this.outboxQueue != null) {
this.circuitBreaker = new CircuitBreaker({ kv: options.kv, prefix: this.kvPrefixes.circuitBreaker, ... });
}
The key is the raw host of the target inbox: getRemoteHost(url) => url.host (metrics.ts:1741), called at middleware.ts:1349 / :1483-1491. So the keyspace is indexed by remote-controlled host names.
What happens
- Fedify tries to deliver an activity to
https://dead.example/inbox.
- Delivery fails →
recordFailure("dead.example") → one KV entry at [...circuitBreaker, "dead.example"], written with no TTL.
dead.example never comes back, so recordSuccess/recordReachableFailure for it is never called.
- The entry stays forever. Repeat for every distinct host you ever fail to reach.
Minimal reproduction (drives CircuitBreaker directly against a TTL-honoring in-memory KvStore that records whether a TTL is ever passed):
after 10000 distinct failing hosts:
entries in KV store : 10000
set() calls : 0
cas() calls : 10000 # the atomic (production) path
ttl ever passed? : false # never, on either path
after a second failure round (dead hosts never recover):
entries in KV store : 10000 # monotonic
after recordSuccess() on exactly one host:
entries in KV store : 9999 # only per-host success removes one
(Full script attached below.) The store used here actively expires TTL'd entries, so the growth is purely because no TTL is ever supplied — this is not specific to MemoryKvStore; it affects any backend (Redis, Postgres, SQLite, Deno KV, …).
Why it matters
- Unbounded growth of the shared
KvStore keyspace under the circuitBreaker prefix, proportional to the number of distinct hosts ever failed and not since recovered. For a long-lived instance this only ever climbs: the fediverse steadily accumulates dead/renamed/one-off instances.
- It is remotely inducible.
remoteHost comes from a target inbox URL, i.e. the content of a remote actor document. An actor whose inbox/sharedInbox points at many distinct unreachable hostnames (e.g. random subdomains of a controlled domain) forces a new permanent entry per failed delivery attempt — a slow, cheap amplification against the operator's KV store.
- It is silent: no metric or log signals the growth, and the default failure/recovery windows (
failureWindow 10m, recoveryDelay 30m) suggest these entries are short-lived, which they are not.
This shipped in 2.3.0 and is live in the current release (2.3.1), so it is recent enough that long-running instances haven't accumulated much yet — but they will, monotonically.
Proposal
Give the state a TTL at the write site, since both set and cas already accept one:
const ttl = /* an upper bound after which a stale entry is safe to drop */;
if (newState == null) await this.#kv.delete(key);
else if (this.#kv.cas == null) await this.#kv.set(key, newState, { ttl });
else return await this.#kv.cas(key, oldState, newState, { ttl });
An entry is only meaningful while it can still affect a decision: while open/half-open (bounded by recoveryDelay) or while any failure timestamp is still inside failureWindow. So a TTL on the order of recoveryDelay + failureWindow (with some margin) drops nothing that is still in use — after that window with no updates, the host is treated as closed from scratch, which is the correct default anyway. The exact value is a maintainer call; I did not want to hardcode a policy in a proposal.
Existing entries left by 2.3.0–2.3.1 (migration)
The TTL fix above is forward-only: it stamps an expiry on writes made after the upgrade. Backends only expire records that carry an expires, so entries already written by 2.3.0/2.3.1 have none and will persist. Worse, they do not self-heal: an active host gets a fresh TTL the next time #replace touches it (or is deleted on the next success), but the leaked population is exactly the dead/never-recontacted hosts — those are never touched again, so the fix alone never reclaims them.
So a fix should also reclaim the existing entries. Circuit-breaker state is soft — it's derived from delivery outcomes, and dropping it only resets a host to closed (the safe default), costing at most one extra probe. list() by prefix is a required KvStore method, so a one-shot sweep of the prefix is safe and backend-agnostic:
// prefix defaults to ["_fedify", "circuit"] (middleware.ts:656),
// or ctx.federation.kvPrefixes.circuitBreaker if customized
for await (const { key } of kv.list(circuitBreakerPrefix)) {
await kv.delete(key);
}
Options for where this lives: a one-time sweep on the first startup after upgrade (e.g. gated by a version marker key), or — lighter, no auto-magic — just a documented "run this once after upgrading past the fix" snippet in the release notes. Either is fine; the point is that shipping only the write-site TTL leaves every current deployment's already-leaked entries in place forever.
Costs / risks
- A too-short TTL could drop an
open entry early and let one extra probe through before it re-opens — harmless (that is the recovery path). Choosing >= recoveryDelay + failureWindow avoids even that.
- The
cas == null fallback backend still can't expire atomically-consistently, but it already can't provide CAS at all, so passing ttl to its set is strictly better than today.
Alternative
Lazy cleanup on read instead of TTL: when #get loads an entry that is closed with an empty pruned failure list (or otherwise past its useful window), treat it as absent and delete it. This needs no interface guarantees but only reclaims entries that are read again — dead hosts that are never contacted again would still linger, so it does not fully fix the leak. I'd lean against relying on it alone; it pairs well with the TTL as defense-in-depth. If you'd rather close this, the case for doing so would be "host cardinality is bounded in practice and the entries are tiny" — which holds for a small instance but not against the remote-inducible spray above.
References
Verified against upstream fedify-dev/fedify main and release tag 2.3.1; circuit-breaker.ts is byte-identical in both.
About this issue
Drafted by Shiro (Claude Opus 4.8), an AI assistant working with @nyanrus. I read the paths above and ran the reproduction myself, but if I've misread the removal path or there's an eviction I missed, please tell me — I'd like to correct it.
Reproduction script (deno run --config path/to/fedify/deno.json repro.ts)
import { CircuitBreaker } from "@fedify/fedify/federation";
import type {
KvKey, KvStore, KvStoreListEntry, KvStoreSetOptions,
} from "@fedify/fedify/federation";
// A KvStore that actively honors TTL and records whether a ttl is ever passed.
class ObservingKv implements KvStore {
#map = new Map<string, { value: unknown; expires: number | null }>();
ttlEverPassed = false; setCalls = 0; casCalls = 0;
#enc(k: KvKey) { return JSON.stringify(k); }
#sweep() { const n = Date.now(); for (const [k, v] of this.#map) if (v.expires != null && v.expires <= n) this.#map.delete(k); }
get<T>(k: KvKey): Promise<T | undefined> { this.#sweep(); const e = this.#map.get(this.#enc(k)); return Promise.resolve(e ? e.value as T : undefined); }
set(k: KvKey, value: unknown, o?: KvStoreSetOptions) { this.setCalls++; if (o?.ttl != null) this.ttlEverPassed = true; this.#map.set(this.#enc(k), { value, expires: o?.ttl != null ? Date.now() + o.ttl.total("millisecond") : null }); return Promise.resolve(); }
delete(k: KvKey) { this.#map.delete(this.#enc(k)); return Promise.resolve(); }
cas(k: KvKey, expected: unknown, value: unknown, o?: KvStoreSetOptions) {
this.casCalls++; if (o?.ttl != null) this.ttlEverPassed = true; this.#sweep();
const cur = this.#map.get(this.#enc(k)); const curV = cur ? cur.value : undefined;
if (JSON.stringify(curV) !== JSON.stringify(expected)) return Promise.resolve(false);
if (value === undefined) this.#map.delete(this.#enc(k)); else this.#map.set(this.#enc(k), { value, expires: null });
return Promise.resolve(true);
}
async *list(prefix?: KvKey): AsyncIterable<KvStoreListEntry> { this.#sweep(); for (const [ek, e] of this.#map) { const key = JSON.parse(ek) as KvKey; if (prefix != null && !prefix.every((p, i) => key[i] === p)) continue; yield { key, value: e.value }; } }
size() { this.#sweep(); return this.#map.size; }
}
const kv = new ObservingKv();
const cb = new CircuitBreaker({ kv, prefix: ["_cb"] });
const N = 10_000;
for (let i = 0; i < N; i++) await cb.recordFailure(`host-${i}.example.invalid`);
console.log(`after ${N} distinct failing hosts: entries=${kv.size()} set=${kv.setCalls} cas=${kv.casCalls} ttlEverPassed=${kv.ttlEverPassed}`);
for (let i = 0; i < N; i++) await cb.recordFailure(`host-${i}.example.invalid`);
console.log(`after a second failure round: entries=${kv.size()} (monotonic)`);
await cb.recordSuccess("host-0.example.invalid");
console.log(`after recordSuccess on one host: entries=${kv.size()} (dropped by 1)`);
Where
CircuitBreakerpersists one KV entry per remote delivery host and only ever removes it on a later successful delivery to that same host. The write path never passes a TTL, and there is no sweep/eviction anywhere.The write happens in
#replace— neither thesetbranch nor thecasbranch passes an expiry, even though both accept one:KvStore.setandKvStore.casboth take an optionalKvStoreSetOptionswith attl(kv.ts:15-19,:59,:76-81), and backends honor it (e.g.@fedify/sqlite'skv.tssetsexpiresfromoptions.ttlon bothsetandcas, and sweeps). The circuit breaker just never supplies one.Removal only happens on success:
recordFailure(circuit-breaker.ts:333-378) creates/updates the entry but never deletes it.pruneFailuresonly trims the timestamp array inside a single host's entry; it never removes entries across hosts. There is nolist()-based cleanup and no background job.Wiring (enabled by default whenever an outbox queue exists):
The key is the raw host of the target inbox:
getRemoteHost(url) => url.host(metrics.ts:1741), called atmiddleware.ts:1349/:1483-1491. So the keyspace is indexed by remote-controlled host names.What happens
https://dead.example/inbox.recordFailure("dead.example")→ one KV entry at[...circuitBreaker, "dead.example"], written with no TTL.dead.examplenever comes back, sorecordSuccess/recordReachableFailurefor it is never called.Minimal reproduction (drives
CircuitBreakerdirectly against a TTL-honoring in-memoryKvStorethat records whether a TTL is ever passed):(Full script attached below.) The store used here actively expires TTL'd entries, so the growth is purely because no TTL is ever supplied — this is not specific to
MemoryKvStore; it affects any backend (Redis, Postgres, SQLite, Deno KV, …).Why it matters
KvStorekeyspace under thecircuitBreakerprefix, proportional to the number of distinct hosts ever failed and not since recovered. For a long-lived instance this only ever climbs: the fediverse steadily accumulates dead/renamed/one-off instances.remoteHostcomes from a target inbox URL, i.e. the content of a remote actor document. An actor whoseinbox/sharedInboxpoints at many distinct unreachable hostnames (e.g. random subdomains of a controlled domain) forces a new permanent entry per failed delivery attempt — a slow, cheap amplification against the operator's KV store.failureWindow10m,recoveryDelay30m) suggest these entries are short-lived, which they are not.This shipped in 2.3.0 and is live in the current release (2.3.1), so it is recent enough that long-running instances haven't accumulated much yet — but they will, monotonically.
Proposal
Give the state a TTL at the write site, since both
setandcasalready accept one:An entry is only meaningful while it can still affect a decision: while
open/half-open(bounded byrecoveryDelay) or while any failure timestamp is still insidefailureWindow. So a TTL on the order ofrecoveryDelay + failureWindow(with some margin) drops nothing that is still in use — after that window with no updates, the host is treated asclosedfrom scratch, which is the correct default anyway. The exact value is a maintainer call; I did not want to hardcode a policy in a proposal.Existing entries left by 2.3.0–2.3.1 (migration)
The TTL fix above is forward-only: it stamps an expiry on writes made after the upgrade. Backends only expire records that carry an
expires, so entries already written by 2.3.0/2.3.1 have none and will persist. Worse, they do not self-heal: an active host gets a fresh TTL the next time#replacetouches it (or is deleted on the next success), but the leaked population is exactly the dead/never-recontacted hosts — those are never touched again, so the fix alone never reclaims them.So a fix should also reclaim the existing entries. Circuit-breaker state is soft — it's derived from delivery outcomes, and dropping it only resets a host to
closed(the safe default), costing at most one extra probe.list()by prefix is a requiredKvStoremethod, so a one-shot sweep of the prefix is safe and backend-agnostic:Options for where this lives: a one-time sweep on the first startup after upgrade (e.g. gated by a version marker key), or — lighter, no auto-magic — just a documented "run this once after upgrading past the fix" snippet in the release notes. Either is fine; the point is that shipping only the write-site TTL leaves every current deployment's already-leaked entries in place forever.
Costs / risks
openentry early and let one extra probe through before it re-opens — harmless (that is the recovery path). Choosing>= recoveryDelay + failureWindowavoids even that.cas == nullfallback backend still can't expire atomically-consistently, but it already can't provide CAS at all, so passingttlto itssetis strictly better than today.Alternative
Lazy cleanup on read instead of TTL: when
#getloads an entry that isclosedwith an empty pruned failure list (or otherwise past its useful window), treat it as absent anddeleteit. This needs no interface guarantees but only reclaims entries that are read again — dead hosts that are never contacted again would still linger, so it does not fully fix the leak. I'd lean against relying on it alone; it pairs well with the TTL as defense-in-depth. If you'd rather close this, the case for doing so would be "host cardinality is bounded in practice and the entries are tiny" — which holds for a small instance but not against the remote-inducible spray above.References
circuit-breaker.ts:423-438(#replace),:307-325(recordSuccess),:333-378(recordFailure)kv.ts:15-19,:59,:76-81(ttlonset/cas)middleware.ts:673-686(enablement),:1349,:1483-1491(call sites),:656(defaultcircuitBreakerprefix["_fedify", "circuit"]),metrics.ts:1741(getRemoteHost)Verified against upstream
fedify-dev/fedifymainand release tag2.3.1;circuit-breaker.tsis byte-identical in both.About this issue
Drafted by Shiro (Claude Opus 4.8), an AI assistant working with @nyanrus. I read the paths above and ran the reproduction myself, but if I've misread the removal path or there's an eviction I missed, please tell me — I'd like to correct it.
Reproduction script (
deno run --config path/to/fedify/deno.json repro.ts)