Skip to content

Add OpenAI-compatible Audio Support (ASR + TTS) via Caddy Routing #16

@eshork

Description

@eshork

Motivation

NeuralDrive is an excellent portable LLM appliance with a clean architecture: immutable rootfs, Caddy as the central reverse proxy on :8443, OpenAI-compatible API, LRU model management, and a unified System API + TUI.

Users are increasingly asking for voice capabilities (speech-to-text and text-to-speech) to build local voice agents, hands-free interfaces, or full conversational systems. Adding audio support in a way that feels native to the existing design would significantly increase the platform's usefulness.

Goal

Expose standard OpenAI-compatible audio endpoints through the existing :8443 gateway:

  • POST /v1/audio/transcriptions — ASR (Speech-to-Text)
  • POST /v1/audio/speech — TTS (Text-to-Speech)

All while maintaining:

  • Bearer token authentication
  • The same OpenAI-compatible style as text models
  • Integration with the System API and Textual TUI
  • LRU-based model loading/unloading
  • Storage on the persistence partition

Proposed Architecture

Caddy (:8443)
├── /v1/chat/completions          → Ollama (existing)
├── /v1/audio/transcriptions      → faster-whisper service (new, internal :8001)
├── /v1/audio/speech              → Kokoro-FastAPI (new, internal :8002)
└── /system/* + /v1/models        → FastAPI System API (extend)

Internal Services (new systemd units):

  • neuraldrive-whisper.service — faster-whisper + thin FastAPI wrapper
  • neuraldrive-kokoro.service — Official ghcr.io/remsky/kokoro-fastapi-* image

Recommended Starting Models

Type Model Reason OpenAI Compatible?
ASR faster-whisper (large-v3 / Turbo) Best balance of speed & accuracy Yes (via wrapper)
TTS Kokoro-82M Highest quality open-source TTS, tiny, runs on CPU Yes (native)
TTS Piper (optional) Extremely lightweight & fast Easy to wrap

Later phases can add Fish Speech S2, Qwen3 audio models, etc.

Implementation Details

1. Caddyfile Changes (minimal)

:8443 {
    # ... existing routes ...

    handle /v1/audio/transcriptions {
        reverse_proxy localhost:8001
    }

    handle /v1/audio/speech {
        reverse_proxy localhost:8002
    }
}

2. New Systemd Services

  • Run as unprivileged users with proper hardening (matching current services)
  • Store models in /persistence/models/audio/

3. System API Extensions (:3001)

Add these endpoints to the existing FastAPI backend:

  • GET /v1/audio/models
  • POST /v1/audio/models/{name}/load
  • POST /v1/audio/models/{name}/unload
  • GET /system/audio/status (shows loaded models + resource usage)

Reuse the existing LRU eviction logic where possible.

4. TUI Updates

  • Show loaded audio models alongside text models
  • Add quick load/unload commands
  • Display real-time audio inference stats

5. Model Storage

Use the existing persistence partition (/persistence/models/audio/) so models survive reboots and are portable.

Phased Implementation Roadmap

Phase Scope Priority
1 Add Kokoro-FastAPI + faster-whisper behind Caddy High
2 Integrate into System API + TUI High
3 Add Piper as lightweight alternative Medium
4 Support additional models (Fish Speech, etc.) Medium
5 Full voice agent pipeline (ASR → LLM → TTS) Nice-to-have

Benefits

  • Keeps the "one appliance, one API" philosophy
  • No breaking changes to existing users
  • Leverages the mature OpenAI ecosystem (tools, SDKs, clients)
  • Enables powerful new use cases (local voice agents, accessibility, etc.)
  • Maintains security model (TLS + Bearer auth)

Questions for Discussion

  1. Should we support streaming TTS responses (stream: true) from day one?
  2. Do we want a higher-level /v1/audio/voice-chat endpoint that chains ASR → LLM → TTS?
  3. Should audio models participate in the same VRAM management pool as text models, or have separate limits?
  4. Any preference between using a thin FastAPI wrapper vs. running the official Kokoro image directly?

Metadata

Metadata

Assignees

No one assigned

    Labels

    APIAPI endpoints and compatibilityaudioAudio/voice related featuresenhancementNew feature or requestroadmapPlanned feature on the project roadmap

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions