Skip to content

Streamable HTTP: Multiple SSE streams cause infinite reconnect loop #659

@its-mash

Description

@its-mash

Summary

When multiple SSE streams exist for the same session (e.g. from POST response reconnections), LocalSessionWorker::resume() unconditionally replaces self.common.tx, killing the other stream's receiver. Both EventSource connections then reconnect every sse_retry seconds, leapfrogging each other in an infinite loop that floods the server with GET requests.

Affected versions: rmcp 0.14.0, 0.15.0
Severity: Critical — causes infinite reconnect loops with clients like Cursor, and breaks server-to-client notifications over Streamable HTTP

Root Cause

The MCP Streamable HTTP transport sends POST SSE responses with a priming event containing retry: 3000. When the POST stream ends (after delivering the response), the browser's EventSource API automatically reconnects via GET. This creates multiple competing EventSource connections:

  1. The initial standalone GET stream (primary notification channel)
  2. Reconnecting GETs from completed POST responses (initialize, tools/list, etc.)

Each reconnecting GET calls resume() which unconditionally replaces self.common.tx:

// Before fix — local.rs resume()
None => {
    let (tx, rx) = tokio::sync::mpsc::channel(self.session_config.channel_capacity);
    self.common.tx = tx;  // ← Unconditionally replaces sender, kills other stream
    // ...
}

Dropping the old sender closes the old receiver, terminating the OTHER EventSource's stream. That stream then reconnects, replacing the sender again. Both leapfrog every sse_retry (3s) indefinitely.

The Leapfrog Loop

1. Client POST initialize → SSE response with priming (retry: 3000) → stream ends
2. Client GET (standalone) → becomes primary common channel (tx1/rx1)
3. POST EventSource reconnects via GET (3s later) → replaces common.tx → kills rx1
4. GET from step 2 reconnects → replaces common.tx → kills stream from step 3
5. Repeat every 3 seconds indefinitely

Server logs confirm the pattern — alternating GET requests every 3 seconds with different Last-Event-ID values:

13:33:51.670  GET Last-Event-ID: 0/2   ← from completed POST response
13:33:54.668  GET Last-Event-ID: 0     ← from killed standalone stream
13:33:57.679  GET Last-Event-ID: 0/2   ← leapfrog
13:34:00.674  GET Last-Event-ID: 0     ← leapfrog
...

Additional Issue: Cache Replay Loop

Even without the leapfrog, resume() called sync() on the common channel to replay cached events. Replaying server-initiated list_changed notifications caused clients to re-process old signals, triggering unnecessary re-fetches every reconnection cycle.

What Happens in Practice

Cursor (infinite loop)

  • Connects via POST initialize + GET standalone
  • POST SSE stream ends → EventSource reconnects via GET
  • Two competing streams leapfrog every 3 seconds
  • Server flooded with GET requests indefinitely
  • Notifications intermittently lost as channels are swapped

VS Code (silent notification loss)

  • Reconnects SSE every ~5 minutes with same session ID
  • Each reconnection replaces the channel sender
  • Previous stream's receiver is orphaned
  • notify_tool_list_changed().await returns Ok(())silent failure

Fix: Shadow Channels

PR: #660

Instead of unconditionally replacing the common channel, check if the primary is still active:

  • Primary dead (tx.is_closed()) → Replace it. New stream becomes primary.
  • Primary alive → Create a shadow stream — an idle SSE connection kept alive by SSE keep-alive pings that does NOT receive notifications and does NOT replace the primary channel.
fn resume_or_shadow_common(&mut self) -> Result<StreamableHttpMessageReceiver, SessionError> {
    let (tx, rx) = tokio::sync::mpsc::channel(self.session_config.channel_capacity);
    if self.common.tx.is_closed() {
        // Primary is dead — replace it
        self.common.tx = tx;
    } else {
        // Primary is alive — create shadow (idle, keep-alive only)
        self.shadow_txs.push(tx);
    }
    Ok(StreamableHttpMessageReceiver { http_request_id: None, inner: rx })
}

Why Not 409 Conflict?

The initial approach (matching the TypeScript SDK) was to return 409 Conflict on duplicate standalone streams. However:

  1. The MCP spec states: "The client MAY remain connected to multiple SSE streams simultaneously" — 409 is not spec-compliant
  2. 409 causes Cursor to fail entirely on reconnection (500 errors from unhandled Conflict)
  3. The reconnecting EventSources are legitimate HTTP requests — they need a valid stream back

Shadow channels are the correct approach: keep all connections alive without interference.

Why No Cache Replay on Common Channel?

Common channel notifications (tools/list_changed, resources/list_changed) are idempotent signals. Replaying cached ones causes clients to re-process old events, triggering unnecessary re-fetches or infinite notification loops. Missing one is harmless — the next real event arrives naturally. Request-wise channels still use sync() for proper response replay.

Changes (5 commits)

Commit Description
8bd424e Initial 409 Conflict approach (returned error on duplicate standalone stream)
0d03eb5 Handle resume with completed request-wise channels (fall through to common)
a7bb822 Remove 409 Conflict — allow channel replacement per MCP spec
7cf5406 Skip cache replay (sync) when replacing active streams
a7df58c Shadow channels — the final fix that prevents the leapfrog loop

Files Changed

  • crates/rmcp/src/transport/streamable_http_server/session/local.rs
    • Added shadow_txs: Vec<Sender<ServerSseMessage>> to LocalSessionWorker
    • New method resume_or_shadow_common() with primary-alive check
    • Updated resume() to use shadow logic for both direct common and request-wise fallback paths
    • Removed sync() calls on common channel resume
    • Updated close_sse_stream() to clear shadow senders
    • Updated create_local_session() to initialize shadow_txs

Test Results

  • Cursor connects and initializes successfully (no 409/500 errors)
  • Cursor does NOT enter infinite GET reconnect loop after connection
  • Feature changes trigger exactly one batch of list_changed notifications
  • Cursor receives and processes notifications correctly (re-fetches tools/resources)
  • No notification replay loop (no repeated ResourceListChanged every 3s)
  • VS Code connects and works correctly (unaffected by changes)
  • cargo check --workspace passes

Environment

  • rmcp 0.15.0 (also affects 0.14.0)
  • StreamableHttpService with stateful_mode: true
  • LocalSessionManager (default session manager)
  • Clients tested: Cursor 2.4.37, VS Code MCP Extension

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions