|
| 1 | +# LLM Streaming Refactor Plan |
| 2 | + |
| 3 | +## Observed LiteLLM stream event types |
| 4 | + |
| 5 | +LiteLLM emits `ResponsesAPIStreamEvents` values while streaming. The current enum and their string payloads are: |
| 6 | + |
| 7 | +- `response.created` |
| 8 | +- `response.in_progress` |
| 9 | +- `response.completed` |
| 10 | +- `response.failed` |
| 11 | +- `response.incomplete` |
| 12 | +- `response.output_item.added` |
| 13 | +- `response.output_item.done` |
| 14 | +- `response.output_text.delta` |
| 15 | +- `response.output_text.done` |
| 16 | +- `response.output_text.annotation.added` |
| 17 | +- `response.reasoning_summary_text.delta` |
| 18 | +- `response.reasoning_summary_part.added` |
| 19 | +- `response.function_call_arguments.delta` |
| 20 | +- `response.function_call_arguments.done` |
| 21 | +- `response.mcp_call_arguments.delta` |
| 22 | +- `response.mcp_call_arguments.done` |
| 23 | +- `response.mcp_call.in_progress` |
| 24 | +- `response.mcp_call.completed` |
| 25 | +- `response.mcp_call.failed` |
| 26 | +- `response.mcp_list_tools.in_progress` |
| 27 | +- `response.mcp_list_tools.completed` |
| 28 | +- `response.mcp_list_tools.failed` |
| 29 | +- `response.file_search_call.in_progress` |
| 30 | +- `response.file_search_call.searching` |
| 31 | +- `response.file_search_call.completed` |
| 32 | +- `response.web_search_call.in_progress` |
| 33 | +- `response.web_search_call.searching` |
| 34 | +- `response.web_search_call.completed` |
| 35 | +- `response.refusal.delta` |
| 36 | +- `response.refusal.done` |
| 37 | +- `error` |
| 38 | +- `response.content_part.added` |
| 39 | +- `response.content_part.done` |
| 40 | + |
| 41 | +These events conceptually fall into buckets we care about for visualization and higher-level semantics: |
| 42 | + |
| 43 | +| Category | Events | Notes | |
| 44 | +| --- | --- | --- | |
| 45 | +| **Lifecycle / status** | created, in_progress, completed, failed, incomplete, *_call.* events, output_item.added/done, content_part.added/done, error | remind our UI but typically not shown inline | |
| 46 | +| **Assistant text** | output_text.delta, output_text.done, output_text.annotation.added | forms "Message" body | |
| 47 | +| **Reasoning summary** | reasoning_summary_part.added, reasoning_summary_text.delta | feed into Reasoning blobs | |
| 48 | +| **Function / tool arguments** | function_call_arguments.delta/done, mcp_call_arguments.delta/done | update Action sections | |
| 49 | +| **Refusal** | refusal.delta/done | render special refusal text | |
| 50 | + |
| 51 | +## Problems to resolve |
| 52 | + |
| 53 | +1. **Streaming display duplicates content and forces line breaks.** We currently print each delta as its own Rich print call with `end=""`, but Live panels aren’t used and the console injects newlines between `print` calls, so output becomes `word\nword\n...`. |
| 54 | +2. **No per-message aggregation.** All reasoning deltas accumulate into a single global area, so later messages overwrite earlier context. We need separate buffers per "logical container" (assistant message, reasoning summary, function call) associated with the owning `LLMConvertibleEvent` (e.g., `MessageEvent`, `ActionEvent`). |
| 55 | +3. **Naming collision / clarity.** LiteLLM "events" clash with our own domain events. We should introduce a distinct abstraction, e.g. `LLMStreamChunk`, that wraps metadata about channel, indices, and owning response item. |
| 56 | +4. **Persistence & replay.** We still want to persist raw stream parts for clients, but the visualizer should rebuild high-level fragments from these parts when replaying history. |
| 57 | + |
| 58 | +## Proposed model hierarchy |
| 59 | + |
| 60 | +``` |
| 61 | +LLMStreamChunk (renamed from LLMStreamEvent) |
| 62 | +├── part_kind: Literal["assistant", "reasoning", "function_arguments", "refusal", "status", "tool_output"] |
| 63 | +├── text_delta: str | None |
| 64 | +├── arguments_delta: str | None |
| 65 | +├── response_index: int | None |
| 66 | +├── item_id: str | None |
| 67 | +├── chunk_type: str # raw LiteLLM value |
| 68 | +├── is_terminal: bool |
| 69 | +├── raw_chunk: Any # original LiteLLM payload retained for logging/replay |
| 70 | +└── origin_metadata: dict[str, Any] |
| 71 | +``` |
| 72 | + |
| 73 | +Keeping the raw LiteLLM payload inside each `LLMStreamChunk` means we do **not** need a separate envelope structure; logging can simply serialize the chunk directly. |
| 74 | + |
| 75 | +## Visualization strategy |
| 76 | + |
| 77 | +1. **Track a hierarchy per conversation event.** When a LiteLLM stream begins we emit a placeholder `MessageEvent` (assistant message) or `ActionEvent` (function call). Each `LLMStreamChunk` should include a `response_id`/`item_id` so we can map to the owning conversation event: |
| 78 | + - `output_text` → existing `MessageEvent` for the assistant response. |
| 79 | + - `reasoning_summary_*` → reasoning area inside `MessageEvent`. |
| 80 | + - `function_call_arguments_*` → arguments area inside `ActionEvent`. |
| 81 | +2. **Use `Live` per section.** For each unique `(conversation_event_id, part_kind, item_id)` create a Rich `Live` instance that updates with concatenated text. When the part is terminal, stop the `Live` and leave the final text in place. |
| 82 | +3. **Avoid newlines unless emitted by the model.** We’ll join chunks using plain string concatenation and only add newline characters when the delta contains `\n` or when we intentionally insert separators (e.g., between tool JSON arguments). |
| 83 | +4. **Segregate sections:** |
| 84 | + - `Reasoning:` header per `MessageEvent`. Each new reasoning item gets its own Live line under that message. |
| 85 | + - `Assistant:` body for natural language output, appended inside the message panel. |
| 86 | + - `Function Arguments:` block under each action panel, streaming JSON incrementally. |
| 87 | + |
| 88 | +## Implementation roadmap |
| 89 | + |
| 90 | +1. **Data model adjustments** |
| 91 | + - Rename the existing `LLMStreamEvent` class to `LLMStreamChunk` and extend it with richer fields: `part_kind`, `response_index`, `conversation_event_id` (populated later), `raw_chunk`, etc. |
| 92 | + - Create helper to classify LiteLLM chunks into `LLMStreamChunk` instances (including mapping item IDs to owning role/time). |
| 93 | + |
| 94 | +2. **Conversation state integration** |
| 95 | + - When we enqueue the initial `MessageEvent`/`ActionEvent`, cache a lookup (e.g., `inflight_streams[(response_id, output_index)] = conversation_event_id`). |
| 96 | + - Update `LocalConversation` token callback wrapper to attach the resolved conversation event ID onto the `LLMStreamChunk` before emitting/persisting. |
| 97 | + |
| 98 | +3. **Visualizer rewrite** |
| 99 | + - Maintain `self._stream_views[(conversation_event_id, part_kind, item_id)] = LiveState` where `LiveState` wraps buffer, style, and a `Live` instance. |
| 100 | + - On streaming updates: update buffer, `live.update(Text(buffer, style=...))` without printing newlines. |
| 101 | + - On final chunk: stop `Live`, render final static text, and optionally record in conversation state for playback. |
| 102 | + - Ensure replay (when visualizer processes stored events) converts stored parts into final text as well. |
| 103 | + |
| 104 | +4. **Persistence / tests** |
| 105 | + - Update tests to ensure: |
| 106 | + - Multiple output_text deltas produce contiguous text without duplicates or extra newlines. |
| 107 | + - Separate reasoning items create separate entries under one message event. |
| 108 | + - Function call arguments stream into their own block. |
| 109 | + - Add snapshot/log assertions to confirm persisted JSONL remains unchanged for downstream clients. |
| 110 | + |
| 111 | +5. **Documentation & naming cleanup** |
| 112 | + - Decide on final terminology (`LLMStreamChunk`, `StreamItem`, etc.) and update code comments accordingly. |
| 113 | + - Document the classification table for future maintainers. |
| 114 | + |
| 115 | +## Next actions |
| 116 | + |
| 117 | +- [ ] Refactor classifier to output `LLMStreamChunk` objects with clear `part_kind`. |
| 118 | +- [ ] Track in-flight conversation events so parts know their owner. |
| 119 | +- [ ] Replace print-based visualizer streaming with `Live` blocks per section. |
| 120 | +- [ ] Extend unit tests to cover multiple messages, reasoning segments, and tool calls. |
| 121 | +- [ ] Manually validate with long streaming example to confirm smooth in-place updates. |
0 commit comments