fix(capabilities/remote): reap stale registration staging entries in Trigger Publisher#22978
Open
winstoncrooker wants to merge 1 commit into
Open
Conversation
…Trigger Publisher The remote Trigger Publisher stages incoming RegisterTrigger messages in messageCache, keyed by (CallerDonId, WorkflowID, TriggerID), until 2F+1 registrations aggregate. The insert is unconditional and the key fields are caller-controlled, but this cache is never reaped: cacheCleanupLoop reaps ackCache, sendRegistrationChecks reaps unregisterCache, and messageCache is reaped in neither. Its only Delete is on the unregister path, which never reaches pre-quorum entries, so a member sending distinct-key registrations that never reach quorum grows the cache without bound, exhausting node memory. Reap stale staging entries in cacheCleanupLoop, mirroring ackCache (entries older than RegistrationExpiry are already past the aggregation window, so this does not affect active registrations). Adds MessageCache.Len() and a regression test that fails on current code and passes with this change. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
The remote Trigger Publisher stages incoming
RegisterTriggermessages inmessageCache, keyed by(CallerDonId, WorkflowID, TriggerID), until2F+1matching registrations aggregate. This staging cache is never reaped:(*triggerPublisher).Receiveinserts intomessageCacheunconditionally (before the quorum check), andWorkflowID/TriggerIDare caller-controlled (WorkflowIDis format-validated only;TriggerIDis unvalidated), so the key space is unbounded.ackCache.DeleteOlderThanis incacheCleanupLoop,unregisterCache.DeleteOlderThanis insendRegistrationChecks, andmessageCache.DeleteOlderThanis called in neither. (MessageCacheshipsDeleteOlderThan, and the sibling Trigger Subscriber reaps its own cache ineventCleanupLoop.)messageCache.Deleteis on the unregister path, which early-returns for keys that are not active registrations. Pre-quorum entries never become active registrations, so they are never removed.Net effect: a single workflow-DON member sending distinct-key registrations that never reach quorum grows the staging cache without bound, exhausting node memory. This is a resource-exhaustion / liveness hardening fix (no confidentiality or integrity impact); the attacker must be an authenticated workflow-DON member, but the unbounded growth defeats the
2F+1aggregation defense from a single below-quorum member.Fix
Reap stale staging entries in
cacheCleanupLoop, mirroringackCache. Entries older thanRegistrationExpiryare already past the aggregation window used byReady(which counts only entries newer thannow - RegistrationExpiry), so removing them is safe and does not affect active registrations or in-window aggregation:Also adds a small
MessageCache.Len()accessor used by the regression test.Tests
TestTriggerPublisher_RegistrationStagingCacheIsBounded: a single below-quorum workflow-DON member floods distinct-key registrations; the test asserts the staging cache is non-empty after the flood and is reaped to zero bycacheCleanupLoop. It fails on the current code (the cache is never reclaimed) and passes with this change.TestTriggerPublisher*suite still passes, confirming the change does not affect legitimate registration aggregation or the unregister flow.Notes
The diff is +118/-1 across three files (
message_cache.go,trigger_publisher.go, the new test). A defense-in-depth per-sender or total size cap on the staging cache would bound it independently of the time-based reaper; this PR keeps the change minimal and consistent with the existing sibling-cache cleanup.