Skip to content

Add Full Voice Agent Pipeline (ASR → LLM → TTS) with Real-time Streaming #17

@eshork

Description

@eshork

Parent issue: #16 — Add OpenAI-compatible Audio Support (ASR + TTS) via Caddy Routing

Motivation

Once we have OpenAI-compatible ASR and TTS endpoints (see #16), the natural next step is to enable full local voice agents.

Users want to talk to their models conversationally — just like ChatGPT Voice or Grok Voice, but fully offline, private, and running on NeuralDrive.

This feature would turn NeuralDrive into a complete local voice AI appliance.

Goal

Add a high-level endpoint that orchestrates the full voice loop:

User speaks → VAD → ASR → LLM → TTS → User hears response

With support for:

  • Natural turn-taking
  • Real-time streaming responses
  • Interruption handling
  • Low latency

Proposed Design

New Primary Endpoint

  • POST /v1/audio/voice-chat (or POST /v1/audio/chat/completions for compatibility)

This endpoint would accept audio input and return streaming audio output, while internally handling the full pipeline.

High-Level Architecture

User Audio
    ↓
VAD (Voice Activity Detection)          ← Silero VAD or WebRTC VAD
    ↓
ASR Service (/v1/audio/transcriptions)  ← faster-whisper
    ↓
LLM Service (/v1/chat/completions)      ← Ollama / oLLM
    ↓
TTS Service (/v1/audio/speech)          ← Kokoro (primary) + Piper (fallback)
    ↓
Streaming Audio Response

Key Components

Component Recommended Tech Purpose Notes
VAD Silero VAD (ONNX) Detect when user starts/stops speaking Lightweight, runs on CPU
ASR faster-whisper Transcribe user speech Already planned
LLM Ollama / oLLM Generate response Existing
TTS Kokoro (primary) + Piper Generate spoken response Already planned
Orchestrator New FastAPI service or extension of System API Manage the full pipeline Core new component

Implementation Details

1. New Orchestration Service

  • Create neuraldrive-voice-agent.service
  • Runs on an internal port (e.g. :8003)
  • Handles the full conversation loop with proper state management (conversation history, turn detection, etc.)

2. Streaming Support

  • Support both:
    • Non-streaming (wait for full response)
    • Streaming (send partial TTS audio chunks as they're generated) — critical for low perceived latency

3. Caddy Routing

Add one new route:

handle /v1/audio/voice-chat {
    reverse_proxy localhost:8003
}

4. System API Extensions

New endpoints:

  • POST /v1/audio/voice-chat (main endpoint)
  • GET /system/voice-agent/status
  • POST /system/voice-agent/config (adjust VAD sensitivity, TTS voice, temperature, etc.)

5. TUI Enhancements

  • Add a "Voice Agent" mode in the Textual TUI
  • Show live transcription + response
  • Allow quick voice testing from the console

6. Advanced Features (Future)

  • Interruption detection (user can cut off the AI mid-response)
  • Multi-turn memory with context window management
  • Support for multiple voices / personas
  • Optional wake-word detection (e.g. "Hey NeuralDrive")

Phased Implementation Roadmap

Phase Feature Priority
1 Basic non-streaming voice-chat endpoint High
2 Real-time streaming TTS responses High
3 Natural turn-taking + VAD tuning Medium
4 Interruption handling Medium
5 Wake word + always-listening mode Low

Benefits

  • Transforms NeuralDrive from "text LLM server" into a complete local voice AI device
  • Enables powerful new use cases: voice assistants, accessibility tools, hands-free coding, smart home control, etc.
  • Fully private and offline
  • Leverages the audio foundation from Add OpenAI-compatible Audio Support (ASR + TTS) via Caddy Routing #16
  • Maintains the same simple, secure, OpenAI-compatible interface

Open Questions

  1. Should the voice-chat endpoint be stateful (maintains conversation history) or stateless (client manages history)?
  2. Do we want to support function calling inside the voice pipeline (e.g., "turn on the lights")?
  3. Should we offer different "personalities" or system prompts optimized for voice?
  4. Performance target: What's acceptable end-to-end latency for a good experience? (Target: < 800ms from end of speech to start of response?)
  5. Should this feature be enabled by default or opt-in (due to higher resource usage)?

Metadata

Metadata

Assignees

No one assigned

    Labels

    audioAudio/voice related featuresenhancementNew feature or requestroadmapPlanned feature on the project roadmapvoice-agentVoice agent pipeline and conversational AI

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions