Skip to content

bug(runner): agent responses to background task notifications are silently dropped from chat UI #1494

@quay-devel

Description

@quay-devel

Summary

When an agent responds to a <task-notification> (background task completion), the response text and any subsequent tool calls are never rendered in the chat UI. The agent does process the notification — it reads output, makes tool calls, generates text — but all of that is invisibly dropped before reaching the frontend.

Reproduction

  1. An Ambient session runs a Bash command with run_in_background: true
  2. The script eventually exits (success or failure)
  3. A <task-notification> is delivered to the agent with the task ID, output file, status, and summary
  4. The agent processes the notification and generates a text response (e.g. "Poll exited with code 3, let me check the comments...")
  5. Bug: That response is NOT rendered in the Ambient chat UI. The user sees nothing.

Root Cause (Two-Layer Failure)

Layer 1 — Runner: stream_between_run_events is implemented but never reachable

components/runners/ambient-runner/ambient_runner/bridges/claude/bridge.py:374 implements stream_between_run_events(). This method has the correct logic:

  • Consumes TaskNotificationMessage from _between_run_queue → emits a task:completed CUSTOM event
  • Picks up the agent's subsequent response messages from _between_run_queue as the next non-task message
  • Opens a synthetic RunStartedEvent envelope around the response
  • Pipes it through _stream_claude_sdk → produces TEXT_MESSAGE_* events + MESSAGES_SNAPSHOT
  • Closes with RunFinishedEvent

This method is never called from anywhere. No HTTP endpoint, no gRPC listener task, no test invokes it. The between-run queue fills up with the agent's response and nobody drains it.

Additionally, app.py includes the events router twice (lines 280 and 325), with the second inclusion even commented # Between-run event stream (always registered) — a leftover from a previous incomplete attempt to wire this up.

Layer 2 — Backend: the between-run listener connects to a non-existent URL

components/backend/websocket/agui_proxy.go:1175listenBetweenRunEvents — is the backend goroutine designed to capture between-run events. It constructs:

eventsURL := strings.TrimSuffix(runnerURL, "/") + "/events"

This produces http://runner:8000/eventsno thread_id path parameter. The runner only registers GET /events/{thread_id} (a required FastAPI path parameter), so the runner returns 404 for every attempt. After 30 retries with exponential backoff the goroutine exits. Failures are logged but never surfaced to users.

Even if the URL were corrected to include a thread_id, GET /events/{thread_id} reads from _active_streams[thread_id], which is populated exclusively during bridge.run() (user-initiated turns). Between-run queue messages are never placed there.

Also: gRPC transport has the same gap

When AMBIENT_GRPC_ENABLED=true, GRPCSessionListener._listen_loop only processes event_type == "user" messages. There is no background task draining worker._between_run_queue and feeding GRPCMessageWriter. Between-run events are equally lost in the gRPC transport path.

Message Flow (Broken)

Background task exits
  → SDK delivers TaskNotificationMessage via receive_messages()
  → _read_messages_forever: active_output_queue is None → goes to _between_run_queue
  → Claude responds to the notification
  → Response messages (StreamEvents, AssistantMessage, ResultMessage) → _between_run_queue

_between_run_queue now holds:
  [TaskNotificationMessage, StreamEvent..., AssistantMessage, ResultMessage]

stream_between_run_events() has correct logic to drain this — but is never called.

Backend listenBetweenRunEvents goroutine:
  → Calls GET http://runner:8000/events  (no thread_id → 404)
  → Retries 30× with backoff, logs failures
  → Gives up and exits

Result: agent response never leaves the runner. Frontend sees nothing. No DB record written.

Proposed Fix

Runner — expose stream_between_run_events via HTTP

Add a GET /events endpoint (no thread_id parameter) to the runner that serves stream_between_run_events:

# In ambient_runner/endpoints/events.py
@router.get("/events")
async def get_between_run_events(request: Request):
    bridge = request.app.state.bridge
    context = bridge.context
    if not context:
        raise HTTPException(503, "Context not initialized")
    thread_id = context.session_id
    encoder = EventEncoder(accept="text/event-stream")

    async def event_stream():
        async for event in bridge.stream_between_run_events(thread_id):
            yield encoder.encode(event)

    return StreamingResponse(
        event_stream(),
        media_type="text/event-stream",
        headers={"Cache-Control": "no-cache", "X-Accel-Buffering": "no"},
    )

stream_between_run_events runs a while True loop until the worker shuts down, so one persistent SSE connection covers the pod's lifetime. The backend's existing persistStreamedEvent + publishLine will correctly handle the incoming events (CUSTOM task events, TEXT_MESSAGE events, MESSAGES_SNAPSHOT, RUN_FINISHED).

Also remove the duplicate router registration from app.py:325.

gRPC path — wire between-run queue to GRPCMessageWriter

In GRPCSessionListener, after a worker is ready, spawn a background task that:

  1. Calls bridge.stream_between_run_events(session_id)
  2. For each RUN_STARTED: creates a new GRPCMessageWriter instance
  3. For each MESSAGES_SNAPSHOT: feeds it to the writer (accumulates messages)
  4. For each RUN_FINISHED: calls writer._write_message(status="completed") to persist to DB

This ensures between-run agent responses are also persisted in gRPC-mode deployments.

Secondary Issues Found During Investigation

  • ensureBetweenRunListener only starts on first user message — no restart if the listener gives up
  • cleanupStaleSessions deletes the tracking key without stopping the goroutine, allowing potential duplicate listeners
  • _stream_claude_sdk is called with frontend_tool_names=set() in between-run context, so HITL tools in between-run responses would not be detected as halting

Affected Files

File Issue
components/runners/ambient-runner/ambient_runner/bridges/claude/bridge.py:374 stream_between_run_events correct but never called
components/runners/ambient-runner/ambient_runner/endpoints/events.py Missing GET /events route for between-run stream
components/runners/ambient-runner/ambient_runner/app.py:325 Duplicate events router registration
components/backend/websocket/agui_proxy.go:1179 /events URL missing required thread_id; even correct URL would read wrong queue
components/runners/ambient-runner/ambient_runner/bridges/claude/grpc_transport.py No between-run queue integration in gRPC listener

Severity

High. Any session using run_in_background: true (including all background tool calls from subagents) produces agent responses that are invisibly dropped. No error is shown to the user. The agent does work but appears silent.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions