Skip to content

fix(capabilities/remote): reap stale registration staging entries in Trigger Publisher#22978

Open
winstoncrooker wants to merge 1 commit into
smartcontractkit:developfrom
winstoncrooker:fix/trigger-publisher-staging-cache
Open

fix(capabilities/remote): reap stale registration staging entries in Trigger Publisher#22978
winstoncrooker wants to merge 1 commit into
smartcontractkit:developfrom
winstoncrooker:fix/trigger-publisher-staging-cache

Conversation

@winstoncrooker

Copy link
Copy Markdown

Description

The remote Trigger Publisher stages incoming RegisterTrigger messages in messageCache, keyed by (CallerDonId, WorkflowID, TriggerID), until 2F+1 matching registrations aggregate. This staging cache is never reaped:

  • (*triggerPublisher).Receive inserts into messageCache unconditionally (before the quorum check), and WorkflowID/TriggerID are caller-controlled (WorkflowID is format-validated only; TriggerID is unvalidated), so the key space is unbounded.
  • The periodic cleanup reaps the sibling caches but not this one: ackCache.DeleteOlderThan is in cacheCleanupLoop, unregisterCache.DeleteOlderThan is in sendRegistrationChecks, and messageCache.DeleteOlderThan is called in neither. (MessageCache ships DeleteOlderThan, and the sibling Trigger Subscriber reaps its own cache in eventCleanupLoop.)
  • The only messageCache.Delete is on the unregister path, which early-returns for keys that are not active registrations. Pre-quorum entries never become active registrations, so they are never removed.

Net effect: a single workflow-DON member sending distinct-key registrations that never reach quorum grows the staging cache without bound, exhausting node memory. This is a resource-exhaustion / liveness hardening fix (no confidentiality or integrity impact); the attacker must be an authenticated workflow-DON member, but the unbounded growth defeats the 2F+1 aggregation defense from a single below-quorum member.

Fix

Reap stale staging entries in cacheCleanupLoop, mirroring ackCache. Entries older than RegistrationExpiry are already past the aggregation window used by Ready (which counts only entries newer than now - RegistrationExpiry), so removing them is safe and does not affect active registrations or in-window aggregation:

deletedStaging := p.messageCache.DeleteOlderThan(now - cfg.remoteConfig.RegistrationExpiry.Milliseconds())

Also adds a small MessageCache.Len() accessor used by the regression test.

Tests

  • New regression test TestTriggerPublisher_RegistrationStagingCacheIsBounded: a single below-quorum workflow-DON member floods distinct-key registrations; the test asserts the staging cache is non-empty after the flood and is reaped to zero by cacheCleanupLoop. It fails on the current code (the cache is never reclaimed) and passes with this change.
  • The existing TestTriggerPublisher* suite still passes, confirming the change does not affect legitimate registration aggregation or the unregister flow.

Notes

The diff is +118/-1 across three files (message_cache.go, trigger_publisher.go, the new test). A defense-in-depth per-sender or total size cap on the staging cache would bound it independently of the time-based reaper; this PR keeps the change minimal and consistent with the existing sibling-cache cleanup.

…Trigger Publisher

The remote Trigger Publisher stages incoming RegisterTrigger messages in
messageCache, keyed by (CallerDonId, WorkflowID, TriggerID), until 2F+1
registrations aggregate. The insert is unconditional and the key fields are
caller-controlled, but this cache is never reaped: cacheCleanupLoop reaps
ackCache, sendRegistrationChecks reaps unregisterCache, and messageCache is
reaped in neither. Its only Delete is on the unregister path, which never
reaches pre-quorum entries, so a member sending distinct-key registrations
that never reach quorum grows the cache without bound, exhausting node memory.

Reap stale staging entries in cacheCleanupLoop, mirroring ackCache (entries
older than RegistrationExpiry are already past the aggregation window, so this
does not affect active registrations). Adds MessageCache.Len() and a regression
test that fails on current code and passes with this change.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant