Motivation
NeuralDrive is an excellent portable LLM appliance with a clean architecture: immutable rootfs, Caddy as the central reverse proxy on :8443, OpenAI-compatible API, LRU model management, and a unified System API + TUI.
Users are increasingly asking for voice capabilities (speech-to-text and text-to-speech) to build local voice agents, hands-free interfaces, or full conversational systems. Adding audio support in a way that feels native to the existing design would significantly increase the platform's usefulness.
Goal
Expose standard OpenAI-compatible audio endpoints through the existing :8443 gateway:
POST /v1/audio/transcriptions — ASR (Speech-to-Text)
POST /v1/audio/speech — TTS (Text-to-Speech)
All while maintaining:
- Bearer token authentication
- The same OpenAI-compatible style as text models
- Integration with the System API and Textual TUI
- LRU-based model loading/unloading
- Storage on the persistence partition
Proposed Architecture
Caddy (:8443)
├── /v1/chat/completions → Ollama (existing)
├── /v1/audio/transcriptions → faster-whisper service (new, internal :8001)
├── /v1/audio/speech → Kokoro-FastAPI (new, internal :8002)
└── /system/* + /v1/models → FastAPI System API (extend)
Internal Services (new systemd units):
neuraldrive-whisper.service — faster-whisper + thin FastAPI wrapper
neuraldrive-kokoro.service — Official ghcr.io/remsky/kokoro-fastapi-* image
Recommended Starting Models
| Type |
Model |
Reason |
OpenAI Compatible? |
| ASR |
faster-whisper (large-v3 / Turbo) |
Best balance of speed & accuracy |
Yes (via wrapper) |
| TTS |
Kokoro-82M |
Highest quality open-source TTS, tiny, runs on CPU |
Yes (native) |
| TTS |
Piper (optional) |
Extremely lightweight & fast |
Easy to wrap |
Later phases can add Fish Speech S2, Qwen3 audio models, etc.
Implementation Details
1. Caddyfile Changes (minimal)
:8443 {
# ... existing routes ...
handle /v1/audio/transcriptions {
reverse_proxy localhost:8001
}
handle /v1/audio/speech {
reverse_proxy localhost:8002
}
}
2. New Systemd Services
- Run as unprivileged users with proper hardening (matching current services)
- Store models in
/persistence/models/audio/
3. System API Extensions (:3001)
Add these endpoints to the existing FastAPI backend:
GET /v1/audio/models
POST /v1/audio/models/{name}/load
POST /v1/audio/models/{name}/unload
GET /system/audio/status (shows loaded models + resource usage)
Reuse the existing LRU eviction logic where possible.
4. TUI Updates
- Show loaded audio models alongside text models
- Add quick load/unload commands
- Display real-time audio inference stats
5. Model Storage
Use the existing persistence partition (/persistence/models/audio/) so models survive reboots and are portable.
Phased Implementation Roadmap
| Phase |
Scope |
Priority |
| 1 |
Add Kokoro-FastAPI + faster-whisper behind Caddy |
High |
| 2 |
Integrate into System API + TUI |
High |
| 3 |
Add Piper as lightweight alternative |
Medium |
| 4 |
Support additional models (Fish Speech, etc.) |
Medium |
| 5 |
Full voice agent pipeline (ASR → LLM → TTS) |
Nice-to-have |
Benefits
- Keeps the "one appliance, one API" philosophy
- No breaking changes to existing users
- Leverages the mature OpenAI ecosystem (tools, SDKs, clients)
- Enables powerful new use cases (local voice agents, accessibility, etc.)
- Maintains security model (TLS + Bearer auth)
Questions for Discussion
- Should we support streaming TTS responses (
stream: true) from day one?
- Do we want a higher-level
/v1/audio/voice-chat endpoint that chains ASR → LLM → TTS?
- Should audio models participate in the same VRAM management pool as text models, or have separate limits?
- Any preference between using a thin FastAPI wrapper vs. running the official Kokoro image directly?
Motivation
NeuralDrive is an excellent portable LLM appliance with a clean architecture: immutable rootfs, Caddy as the central reverse proxy on
:8443, OpenAI-compatible API, LRU model management, and a unified System API + TUI.Users are increasingly asking for voice capabilities (speech-to-text and text-to-speech) to build local voice agents, hands-free interfaces, or full conversational systems. Adding audio support in a way that feels native to the existing design would significantly increase the platform's usefulness.
Goal
Expose standard OpenAI-compatible audio endpoints through the existing
:8443gateway:POST /v1/audio/transcriptions— ASR (Speech-to-Text)POST /v1/audio/speech— TTS (Text-to-Speech)All while maintaining:
Proposed Architecture
Internal Services (new systemd units):
neuraldrive-whisper.service— faster-whisper + thin FastAPI wrapperneuraldrive-kokoro.service— Officialghcr.io/remsky/kokoro-fastapi-*imageRecommended Starting Models
Later phases can add Fish Speech S2, Qwen3 audio models, etc.
Implementation Details
1. Caddyfile Changes (minimal)
2. New Systemd Services
/persistence/models/audio/3. System API Extensions (
:3001)Add these endpoints to the existing FastAPI backend:
GET /v1/audio/modelsPOST /v1/audio/models/{name}/loadPOST /v1/audio/models/{name}/unloadGET /system/audio/status(shows loaded models + resource usage)Reuse the existing LRU eviction logic where possible.
4. TUI Updates
5. Model Storage
Use the existing persistence partition (
/persistence/models/audio/) so models survive reboots and are portable.Phased Implementation Roadmap
Benefits
Questions for Discussion
stream: true) from day one?/v1/audio/voice-chatendpoint that chains ASR → LLM → TTS?