Add Full Voice Agent Pipeline (ASR → LLM → TTS) with Real-time Streaming

> **Parent issue:** #16 — Add OpenAI-compatible Audio Support (ASR + TTS) via Caddy Routing

## Motivation

Once we have OpenAI-compatible ASR and TTS endpoints (see #16), the natural next step is to enable **full local voice agents**.

Users want to talk to their models conversationally — just like ChatGPT Voice or Grok Voice, but fully offline, private, and running on NeuralDrive.

This feature would turn NeuralDrive into a complete **local voice AI appliance**.

## Goal

Add a high-level endpoint that orchestrates the full voice loop:

```
User speaks → VAD → ASR → LLM → TTS → User hears response
```

With support for:
- Natural turn-taking
- Real-time streaming responses
- Interruption handling
- Low latency

## Proposed Design

### New Primary Endpoint

- `POST /v1/audio/voice-chat` (or `POST /v1/audio/chat/completions` for compatibility)

This endpoint would accept audio input and return streaming audio output, while internally handling the full pipeline.

### High-Level Architecture

```
User Audio
    ↓
VAD (Voice Activity Detection)          ← Silero VAD or WebRTC VAD
    ↓
ASR Service (/v1/audio/transcriptions)  ← faster-whisper
    ↓
LLM Service (/v1/chat/completions)      ← Ollama / oLLM
    ↓
TTS Service (/v1/audio/speech)          ← Kokoro (primary) + Piper (fallback)
    ↓
Streaming Audio Response
```

## Key Components

| Component       | Recommended Tech                  | Purpose                          | Notes |
|-----------------|-----------------------------------|----------------------------------|-------|
| **VAD**         | Silero VAD (ONNX)                 | Detect when user starts/stops speaking | Lightweight, runs on CPU |
| **ASR**         | faster-whisper                    | Transcribe user speech           | Already planned |
| **LLM**         | Ollama / oLLM                     | Generate response                | Existing |
| **TTS**         | Kokoro (primary) + Piper          | Generate spoken response         | Already planned |
| **Orchestrator**| New FastAPI service or extension of System API | Manage the full pipeline        | Core new component |

## Implementation Details

### 1. New Orchestration Service

- Create `neuraldrive-voice-agent.service`
- Runs on an internal port (e.g. `:8003`)
- Handles the full conversation loop with proper state management (conversation history, turn detection, etc.)

### 2. Streaming Support

- Support both:
  - Non-streaming (wait for full response)
  - Streaming (send partial TTS audio chunks as they're generated) — critical for low perceived latency

### 3. Caddy Routing

Add one new route:
```caddyfile
handle /v1/audio/voice-chat {
    reverse_proxy localhost:8003
}
```

### 4. System API Extensions

New endpoints:
- `POST /v1/audio/voice-chat` (main endpoint)
- `GET /system/voice-agent/status`
- `POST /system/voice-agent/config` (adjust VAD sensitivity, TTS voice, temperature, etc.)

### 5. TUI Enhancements

- Add a "Voice Agent" mode in the Textual TUI
- Show live transcription + response
- Allow quick voice testing from the console

### 6. Advanced Features (Future)

- Interruption detection (user can cut off the AI mid-response)
- Multi-turn memory with context window management
- Support for multiple voices / personas
- Optional wake-word detection (e.g. "Hey NeuralDrive")

## Phased Implementation Roadmap

| Phase | Feature                                      | Priority |
|-------|----------------------------------------------|----------|
| 1     | Basic non-streaming voice-chat endpoint      | High |
| 2     | Real-time streaming TTS responses            | High |
| 3     | Natural turn-taking + VAD tuning             | Medium |
| 4     | Interruption handling                        | Medium |
| 5     | Wake word + always-listening mode            | Low |

## Benefits

- Transforms NeuralDrive from "text LLM server" into a **complete local voice AI device**
- Enables powerful new use cases: voice assistants, accessibility tools, hands-free coding, smart home control, etc.
- Fully private and offline
- Leverages the audio foundation from #16
- Maintains the same simple, secure, OpenAI-compatible interface

## Open Questions

1. Should the voice-chat endpoint be **stateful** (maintains conversation history) or **stateless** (client manages history)?
2. Do we want to support **function calling** inside the voice pipeline (e.g., "turn on the lights")?
3. Should we offer different "personalities" or system prompts optimized for voice?
4. Performance target: What's acceptable end-to-end latency for a good experience? (Target: < 800ms from end of speech to start of response?)
5. Should this feature be enabled by default or opt-in (due to higher resource usage)?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Full Voice Agent Pipeline (ASR → LLM → TTS) with Real-time Streaming #17

Motivation

Goal

Proposed Design

New Primary Endpoint

High-Level Architecture

Key Components

Implementation Details

1. New Orchestration Service

2. Streaming Support

3. Caddy Routing

4. System API Extensions

5. TUI Enhancements

6. Advanced Features (Future)

Phased Implementation Roadmap

Benefits

Open Questions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Component	Recommended Tech	Purpose	Notes
VAD	Silero VAD (ONNX)	Detect when user starts/stops speaking	Lightweight, runs on CPU
ASR	faster-whisper	Transcribe user speech	Already planned
LLM	Ollama / oLLM	Generate response	Existing
TTS	Kokoro (primary) + Piper	Generate spoken response	Already planned
Orchestrator	New FastAPI service or extension of System API	Manage the full pipeline	Core new component

Phase	Feature	Priority
1	Basic non-streaming voice-chat endpoint	High
2	Real-time streaming TTS responses	High
3	Natural turn-taking + VAD tuning	Medium
4	Interruption handling	Medium
5	Wake word + always-listening mode	Low

Add Full Voice Agent Pipeline (ASR → LLM → TTS) with Real-time Streaming #17

Description

Motivation

Goal

Proposed Design

New Primary Endpoint

High-Level Architecture

Key Components

Implementation Details

1. New Orchestration Service

2. Streaming Support

3. Caddy Routing

4. System API Extensions

5. TUI Enhancements

6. Advanced Features (Future)

Phased Implementation Roadmap

Benefits

Open Questions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions