Skip to content

Commit d331abf

Browse files
committed
Document LLM streaming refactor plan
1 parent d149370 commit d331abf

File tree

1 file changed

+121
-0
lines changed

1 file changed

+121
-0
lines changed

llm_streaming_refactor_plan.md

Lines changed: 121 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,121 @@
1+
# LLM Streaming Refactor Plan
2+
3+
## Observed LiteLLM stream event types
4+
5+
LiteLLM emits `ResponsesAPIStreamEvents` values while streaming. The current enum and their string payloads are:
6+
7+
- `response.created`
8+
- `response.in_progress`
9+
- `response.completed`
10+
- `response.failed`
11+
- `response.incomplete`
12+
- `response.output_item.added`
13+
- `response.output_item.done`
14+
- `response.output_text.delta`
15+
- `response.output_text.done`
16+
- `response.output_text.annotation.added`
17+
- `response.reasoning_summary_text.delta`
18+
- `response.reasoning_summary_part.added`
19+
- `response.function_call_arguments.delta`
20+
- `response.function_call_arguments.done`
21+
- `response.mcp_call_arguments.delta`
22+
- `response.mcp_call_arguments.done`
23+
- `response.mcp_call.in_progress`
24+
- `response.mcp_call.completed`
25+
- `response.mcp_call.failed`
26+
- `response.mcp_list_tools.in_progress`
27+
- `response.mcp_list_tools.completed`
28+
- `response.mcp_list_tools.failed`
29+
- `response.file_search_call.in_progress`
30+
- `response.file_search_call.searching`
31+
- `response.file_search_call.completed`
32+
- `response.web_search_call.in_progress`
33+
- `response.web_search_call.searching`
34+
- `response.web_search_call.completed`
35+
- `response.refusal.delta`
36+
- `response.refusal.done`
37+
- `error`
38+
- `response.content_part.added`
39+
- `response.content_part.done`
40+
41+
These events conceptually fall into buckets we care about for visualization and higher-level semantics:
42+
43+
| Category | Events | Notes |
44+
| --- | --- | --- |
45+
| **Lifecycle / status** | created, in_progress, completed, failed, incomplete, *_call.* events, output_item.added/done, content_part.added/done, error | remind our UI but typically not shown inline |
46+
| **Assistant text** | output_text.delta, output_text.done, output_text.annotation.added | forms "Message" body |
47+
| **Reasoning summary** | reasoning_summary_part.added, reasoning_summary_text.delta | feed into Reasoning blobs |
48+
| **Function / tool arguments** | function_call_arguments.delta/done, mcp_call_arguments.delta/done | update Action sections |
49+
| **Refusal** | refusal.delta/done | render special refusal text |
50+
51+
## Problems to resolve
52+
53+
1. **Streaming display duplicates content and forces line breaks.** We currently print each delta as its own Rich print call with `end=""`, but Live panels aren’t used and the console injects newlines between `print` calls, so output becomes `word\nword\n...`.
54+
2. **No per-message aggregation.** All reasoning deltas accumulate into a single global area, so later messages overwrite earlier context. We need separate buffers per "logical container" (assistant message, reasoning summary, function call) associated with the owning `LLMConvertibleEvent` (e.g., `MessageEvent`, `ActionEvent`).
55+
3. **Naming collision / clarity.** LiteLLM "events" clash with our own domain events. We should introduce a distinct abstraction, e.g. `LLMStreamChunk`, that wraps metadata about channel, indices, and owning response item.
56+
4. **Persistence & replay.** We still want to persist raw stream parts for clients, but the visualizer should rebuild high-level fragments from these parts when replaying history.
57+
58+
## Proposed model hierarchy
59+
60+
```
61+
LLMStreamChunk (renamed from LLMStreamEvent)
62+
├── part_kind: Literal["assistant", "reasoning", "function_arguments", "refusal", "status", "tool_output"]
63+
├── text_delta: str | None
64+
├── arguments_delta: str | None
65+
├── response_index: int | None
66+
├── item_id: str | None
67+
├── chunk_type: str # raw LiteLLM value
68+
├── is_terminal: bool
69+
├── raw_chunk: Any # original LiteLLM payload retained for logging/replay
70+
└── origin_metadata: dict[str, Any]
71+
```
72+
73+
Keeping the raw LiteLLM payload inside each `LLMStreamChunk` means we do **not** need a separate envelope structure; logging can simply serialize the chunk directly.
74+
75+
## Visualization strategy
76+
77+
1. **Track a hierarchy per conversation event.** When a LiteLLM stream begins we emit a placeholder `MessageEvent` (assistant message) or `ActionEvent` (function call). Each `LLMStreamChunk` should include a `response_id`/`item_id` so we can map to the owning conversation event:
78+
- `output_text` → existing `MessageEvent` for the assistant response.
79+
- `reasoning_summary_*` → reasoning area inside `MessageEvent`.
80+
- `function_call_arguments_*` → arguments area inside `ActionEvent`.
81+
2. **Use `Live` per section.** For each unique `(conversation_event_id, part_kind, item_id)` create a Rich `Live` instance that updates with concatenated text. When the part is terminal, stop the `Live` and leave the final text in place.
82+
3. **Avoid newlines unless emitted by the model.** We’ll join chunks using plain string concatenation and only add newline characters when the delta contains `\n` or when we intentionally insert separators (e.g., between tool JSON arguments).
83+
4. **Segregate sections:**
84+
- `Reasoning:` header per `MessageEvent`. Each new reasoning item gets its own Live line under that message.
85+
- `Assistant:` body for natural language output, appended inside the message panel.
86+
- `Function Arguments:` block under each action panel, streaming JSON incrementally.
87+
88+
## Implementation roadmap
89+
90+
1. **Data model adjustments**
91+
- Rename the existing `LLMStreamEvent` class to `LLMStreamChunk` and extend it with richer fields: `part_kind`, `response_index`, `conversation_event_id` (populated later), `raw_chunk`, etc.
92+
- Create helper to classify LiteLLM chunks into `LLMStreamChunk` instances (including mapping item IDs to owning role/time).
93+
94+
2. **Conversation state integration**
95+
- When we enqueue the initial `MessageEvent`/`ActionEvent`, cache a lookup (e.g., `inflight_streams[(response_id, output_index)] = conversation_event_id`).
96+
- Update `LocalConversation` token callback wrapper to attach the resolved conversation event ID onto the `LLMStreamChunk` before emitting/persisting.
97+
98+
3. **Visualizer rewrite**
99+
- Maintain `self._stream_views[(conversation_event_id, part_kind, item_id)] = LiveState` where `LiveState` wraps buffer, style, and a `Live` instance.
100+
- On streaming updates: update buffer, `live.update(Text(buffer, style=...))` without printing newlines.
101+
- On final chunk: stop `Live`, render final static text, and optionally record in conversation state for playback.
102+
- Ensure replay (when visualizer processes stored events) converts stored parts into final text as well.
103+
104+
4. **Persistence / tests**
105+
- Update tests to ensure:
106+
- Multiple output_text deltas produce contiguous text without duplicates or extra newlines.
107+
- Separate reasoning items create separate entries under one message event.
108+
- Function call arguments stream into their own block.
109+
- Add snapshot/log assertions to confirm persisted JSONL remains unchanged for downstream clients.
110+
111+
5. **Documentation & naming cleanup**
112+
- Decide on final terminology (`LLMStreamChunk`, `StreamItem`, etc.) and update code comments accordingly.
113+
- Document the classification table for future maintainers.
114+
115+
## Next actions
116+
117+
- [ ] Refactor classifier to output `LLMStreamChunk` objects with clear `part_kind`.
118+
- [ ] Track in-flight conversation events so parts know their owner.
119+
- [ ] Replace print-based visualizer streaming with `Live` blocks per section.
120+
- [ ] Extend unit tests to cover multiple messages, reasoning segments, and tool calls.
121+
- [ ] Manually validate with long streaming example to confirm smooth in-place updates.

0 commit comments

Comments
 (0)