You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Parent issue:#16 — Add OpenAI-compatible Audio Support (ASR + TTS) via Caddy Routing
Motivation
Once we have OpenAI-compatible ASR and TTS endpoints (see #16), the natural next step is to enable full local voice agents.
Users want to talk to their models conversationally — just like ChatGPT Voice or Grok Voice, but fully offline, private, and running on NeuralDrive.
This feature would turn NeuralDrive into a complete local voice AI appliance.
Goal
Add a high-level endpoint that orchestrates the full voice loop:
User speaks → VAD → ASR → LLM → TTS → User hears response
With support for:
Natural turn-taking
Real-time streaming responses
Interruption handling
Low latency
Proposed Design
New Primary Endpoint
POST /v1/audio/voice-chat (or POST /v1/audio/chat/completions for compatibility)
This endpoint would accept audio input and return streaming audio output, while internally handling the full pipeline.
High-Level Architecture
User Audio
↓
VAD (Voice Activity Detection) ← Silero VAD or WebRTC VAD
↓
ASR Service (/v1/audio/transcriptions) ← faster-whisper
↓
LLM Service (/v1/chat/completions) ← Ollama / oLLM
↓
TTS Service (/v1/audio/speech) ← Kokoro (primary) + Piper (fallback)
↓
Streaming Audio Response
Key Components
Component
Recommended Tech
Purpose
Notes
VAD
Silero VAD (ONNX)
Detect when user starts/stops speaking
Lightweight, runs on CPU
ASR
faster-whisper
Transcribe user speech
Already planned
LLM
Ollama / oLLM
Generate response
Existing
TTS
Kokoro (primary) + Piper
Generate spoken response
Already planned
Orchestrator
New FastAPI service or extension of System API
Manage the full pipeline
Core new component
Implementation Details
1. New Orchestration Service
Create neuraldrive-voice-agent.service
Runs on an internal port (e.g. :8003)
Handles the full conversation loop with proper state management (conversation history, turn detection, etc.)
2. Streaming Support
Support both:
Non-streaming (wait for full response)
Streaming (send partial TTS audio chunks as they're generated) — critical for low perceived latency
Motivation
Once we have OpenAI-compatible ASR and TTS endpoints (see #16), the natural next step is to enable full local voice agents.
Users want to talk to their models conversationally — just like ChatGPT Voice or Grok Voice, but fully offline, private, and running on NeuralDrive.
This feature would turn NeuralDrive into a complete local voice AI appliance.
Goal
Add a high-level endpoint that orchestrates the full voice loop:
With support for:
Proposed Design
New Primary Endpoint
POST /v1/audio/voice-chat(orPOST /v1/audio/chat/completionsfor compatibility)This endpoint would accept audio input and return streaming audio output, while internally handling the full pipeline.
High-Level Architecture
Key Components
Implementation Details
1. New Orchestration Service
neuraldrive-voice-agent.service:8003)2. Streaming Support
3. Caddy Routing
Add one new route:
4. System API Extensions
New endpoints:
POST /v1/audio/voice-chat(main endpoint)GET /system/voice-agent/statusPOST /system/voice-agent/config(adjust VAD sensitivity, TTS voice, temperature, etc.)5. TUI Enhancements
6. Advanced Features (Future)
Phased Implementation Roadmap
Benefits
Open Questions