Skip to content

fix(chat): queue WS sends, replay buffer on initial bind#71

Merged
pufit merged 1 commit into
ClickHouse:mainfrom
alex-fedotyev:alex/nerve-chat-reconnect-resync
May 13, 2026
Merged

fix(chat): queue WS sends, replay buffer on initial bind#71
pufit merged 1 commit into
ClickHouse:mainfrom
alex-fedotyev:alex/nerve-chat-reconnect-resync

Conversation

@alex-fedotyev
Copy link
Copy Markdown
Contributor

Summary

Two small fixes for the user-visible chat glitch where a question gets no
response and the next reply ends up answering the previous prompt. Five
reliability PRs (#63, #64, #65, #66, #67) each close one underlying
cause; the two remaining gaps live at the client send path and the
gateway initial-bind handshake.

Gap 1: client send silently drops payloads

web/src/api/websocket.ts checked readyState === WebSocket.OPEN and
no-op'd otherwise. The 3-second reconnect window leaves a hole: a
disconnected send() returns nothing to the caller and
chatStore.sendMessage has already optimistically appended the user
message and flipped isStreaming: true. The UI shows "thinking" while
the message never reached the server, the next message succeeds, and
the user reads the second response as if it answered the first prompt.

The fix:

  • send() returns 'sent' | 'queued' | 'dropped'. CONNECTING and
    "reconnect scheduled" go to a bounded _pending queue (5 entries,
    oldest evicted). onopen drains the queue in arrival order.
  • chatStore.sendMessage captures the return value. On 'dropped'
    the optimistic user message is popped, isStreaming is cleared,
    and an inline assistant error explains the failure so the user can
    retry.

Gap 2: gateway initial bind never replayed the broadcaster buffer

nerve/gateway/server.py:286-311 registers the new listener but does
not ship session_status with buffered_events. The switch_session
handler at 363-379 already does. Reload mid-turn (or the 3s WS reconnect
after a network blip) and the in-flight stream is lost from the
client's view even though broadcaster._session_buffers[session_id]
has every event.

The fix:

  • Lift the shared session_status construction into
    _send_session_status(websocket, session_id, is_running, session_record).
  • Call it from the initial-bind branch only when
    broadcaster.is_buffering(active_session) is true. Idle sessions
    stay silent.
  • switch_session calls it unconditionally so the client refreshes
    is_running / status on every selection.

The frontend handleSessionStatus already rebuilds streamingBlocks,
panels, todos, and interaction state from msg.buffered_events
(web/src/stores/handlers/sessionHandlers.ts:22-114, last updated by
#69), so this is purely additive at the gateway.

Test plan

  • python -m pytest tests/test_gateway_ws.py -v (9 new tests
    covering the helper output, the initial-bind gate, the
    switch_session regression path, and a buffer-fidelity check).
  • python -m pytest tests/ (444 pass, 2 skip, 2 pre-existing
    failures unrelated: test_bootstrap docker-env detection and
    test_cli_upgrade docker mode).
  • npx tsc --noEmit clean in web/.
  • npm run build clean in web/.
  • Manual: open a chat, start a long-running turn, reload the tab
    mid-stream. The in-flight stream should restore (blocks rebuild via
    the buffer, todos via fix(web): refresh todos panel from buffered events on reconnect #69, isStreaming: true reflects live state).
  • Manual: dev-tools "Offline" toggle for 5s during a turn. On
    reconnect, the buffer replays.
  • Manual: force a send during the offline window. On reconnect,
    the message flushes ('queued' path) and the agent runs.
  • Manual: force a permanent close (kill the daemon). Type and
    submit. UI shows the dropped-message error ('dropped' path) and
    the optimistic state is reverted.

Out of scope (followups, not blocking)

  • Stale-listener cleanup on swallowed send_json exceptions
    (server.py:298-301).
  • Application-level message_received ack broadcast from the engine
    after sessions.add_message.
  • _session_locks TTL / reclaim on session archive.

Notes

[no-changeset: allow] (no changeset infrastructure in this repo).

…lickHouse#67

User-reported glitch: "I ask for something and there is no response, then
I ask again and it answers to the previous question." Five reliability
PRs (ClickHouse#63 shorthand-schema, ClickHouse#64 synthetic done, ClickHouse#65 stale sdk_session_id,
ClickHouse#66 idle timeout, ClickHouse#67 sticky session) each close one underlying cause.
Two gaps remain that none of those PRs cover.

Gap 1: client-side send silently drops payloads.
web/src/api/websocket.ts checked readyState === OPEN and no-op'd
otherwise. The 3s reconnect window leaves a hole: send() returns to the
caller and chatStore.sendMessage has already optimistically appended the
user message and set isStreaming=true. The user thinks the agent is
thinking but the message never reached the server, so the next reply
lands against a stale prompt.

Track readyState explicitly. CONNECTING or reconnect-scheduled now queues
the payload (bounded to 5 entries; oldest evicted) and flushes from
onopen. CLOSED-without-reconnect and CLOSING return 'dropped' so the
caller can revert. chatStore.sendMessage pops the optimistic user message
on 'dropped' and surfaces an inline assistant error so the user can
retry.

Gap 2: gateway initial-bind never replayed the broadcaster buffer.
The switch_session handler already shipped session_status with
buffered_events on session switch, but the initial-connect handshake at
server.py:286-311 didn't. Reload mid-turn (or a transient 3s WS drop)
and the in-flight stream was lost from the client's view even though
the events sat in broadcaster._session_buffers waiting to be replayed.

Lift the duplicated send-status construction into _send_session_status
and call it from both branches. Initial-bind gates on
broadcaster.is_buffering so idle sessions stay silent; switch_session
calls unconditionally so the client refreshes is_running/status on
every selection. The frontend handleSessionStatus already restores
streamingBlocks, panels, todos, and interaction state from the buffer
(handled by ClickHouse#69), so this is purely additive at the gateway.

Tests:
- 9 new asserts in tests/test_gateway_ws.py covering the helper output,
  the initial-bind gate, the switch_session regression path, and a
  load-fidelity check for buffer ordering.
- Full pytest run: 444 pass, 2 skip, 2 pre-existing failures unrelated
  (test_bootstrap docker-env detection and test_cli_upgrade docker
  mode, both noted in notes/repo-conventions/nerve.md).
- web/ tsc --noEmit clean, npm run build clean.

Out of scope (followups, not blocking):
- Stale-listener cleanup on swallowed send_json exceptions
  (server.py:298-301).
- Application-level message_received ack from engine after
  sessions.add_message.
- _session_locks TTL on session archive.
@pufit pufit merged commit 01be5c3 into ClickHouse:main May 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants