Skip to content

fix(eventstream): Server.publish snapshots subs, releases RLock before I/O#2

Merged
TeoSlayer merged 1 commit into
mainfrom
fix-server-publish-rlock-across-io
May 28, 2026
Merged

fix(eventstream): Server.publish snapshots subs, releases RLock before I/O#2
TeoSlayer merged 1 commit into
mainfrom
fix-server-publish-rlock-across-io

Conversation

@TeoSlayer
Copy link
Copy Markdown
Contributor

Summary

  • Server.publish previously held mu.RLock() across every WriteEvent call. A slow subscriber whose pipe / TCP send buffer was full would block WriteEvent while the read lock was held, starving every concurrent addSub/removeSub waiting on mu.Lock(). One wedged peer wedged all subscription mutations.
  • Refactor to match the companion broker.publishWith in service.go (P2-003 pattern): snapshot per-topic + "*" subscribers under a brief RLock, release, then iterate WriteEvent on the snapshot without holding any lock. Per-write errors are logged.
  • Two regression tests added in zz_server_internals_test.go that both FAIL deterministically against the old code and PASS after the fix.

Test plan

  • TestServer_PublishDoesNotBlockOnSlowSubscriber — addSub completes <100ms while a publish is mid-flight against a wedged peer. Verified to FAIL 5/5 against old code, PASS 5/5 with the fix.
  • TestServer_PublishHandlesSlowSubscriberWithoutDeadlock — fast peer still receives event; 3 concurrent publishers + removeSub all complete; no deadlock. Verified to FAIL 3/3 against old code, PASS 3/3 with the fix.
  • go test -race -count=3 -timeout 180s ./... — green.

…e I/O

Server.publish previously held mu.RLock() across every WriteEvent call.
A slow subscriber whose pipe / TCP send buffer was full would block
WriteEvent while the read lock was still held, starving every
concurrent addSub/removeSub waiting on mu.Lock(). One wedged peer
effectively wedged all subscription mutations.

Refactor to match the companion broker.publishWith in service.go
(P2-003): snapshot the per-topic and "*" subscriber lists under a brief
RLock, release the lock, then iterate WriteEvent on the snapshot
without holding any lock. Per-write errors are logged.

Adds two regression tests:

- TestServer_PublishDoesNotBlockOnSlowSubscriber asserts addSub
  returns within 100ms while a publish is in flight against a
  blocked-pipe peer. Verified to FAIL deterministically against the
  old code (addSub blocked >100ms behind slow peer) and PASS after
  the fix.

- TestServer_PublishHandlesSlowSubscriberWithoutDeadlock asserts a
  fast peer still gets the event promptly, multiple concurrent
  publishers don't deadlock, and removeSub completes while a slow
  peer is wedged.

Full suite go test -race -count=3 -timeout 180s ./... passes.
@codecov
Copy link
Copy Markdown

codecov Bot commented May 28, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@TeoSlayer TeoSlayer merged commit e1bca94 into main May 28, 2026
2 checks passed
@TeoSlayer TeoSlayer deleted the fix-server-publish-rlock-across-io branch May 28, 2026 01:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants