Tracker for epic #87 item 9 — multi-cartridge wave.
Scope
Four cartridges spanning audio + image + video + transcoding:
Why one issue covers four
These four close BoJ's multi-modal gap. Currently `browser-mcp` is the only sensory cartridge and it's purely visual-via-screenshot. Adding audio (whisper + elevenlabs) and generative image/video (replicate) gives agents the multi-modal surface that's now table stakes for agent work. `ffmpeg-mcp` is the glue that lets the other three compose (extract audio from a video, transcode whisper output, embed Replicate-generated frames into a video).
Surface
Per cartridge:
- `whisper-mcp` → `transcribe` (file/URL → text), `detect_language`, optional `translate` (any → English)
- `elevenlabs-mcp` → `synthesize` (text → audio), `list_voices`, `clone_voice` (premium tier)
- `replicate-mcp` → `run_model` (model_id + inputs → outputs), `list_models`, `get_prediction` (status-poll)
- `ffmpeg-mcp` → `probe` (metadata), `extract_audio`, `transcode`, `extract_frames`, `concat`, `trim`
Bridge-level: 4 tools (`boj_audio_stt`, `boj_audio_tts`, `boj_media_generate`, `boj_media_transcode`) or one umbrella `boj_multimodal` with operation-routing. Recommend the umbrella — keeps tool count manageable, mirrors `boj_search`.
Composition with existing cartridges
Implementation plan
Each cartridge as a sibling PR:
- `feat/item-9a-whisper-mcp` — STT first, smallest surface, gives immediate value
- `feat/item-9b-ffmpeg-mcp` — local-only, enables audio/video extraction needed by whisper for video inputs
- `feat/item-9c-elevenlabs-mcp` — TTS, paired with whisper for bidirectional audio
- `feat/item-9d-replicate-mcp` — broadest surface (many model families); land last
~1-2 days per cartridge for manifest + bridge surface + offline-menu wiring. Replicate is the longer one because of the model-id taxonomy.
Dependencies / non-dependencies
- Pairs naturally with item 14 (HTTP transport) — multi-modal use cases include browser-based agent web apps where users record audio/video locally and send to BoJ
- Pairs with item 7 (vector DBs) for multi-modal RAG
- `ffmpeg-mcp` is local-only — requires host FFmpeg binary; won't work on Cloudflare Workers per ADR-0013's cartridge-compat matrix. Other three are HTTP-API-based, work everywhere
Out of scope
- OpenAI's image / audio APIs — overlaps with Replicate's catalogue; if a user needs OpenAI specifically, route through the existing claude-ai-mcp pattern (a sibling `openai-mcp` is a separate ask, not part of multi-modal core)
- Web Audio API browser-side capture — that's a client-side concern; BoJ receives audio data, doesn't capture it
- Real-time streaming (live TTS / live STT) — v1 is batch; revisit when item 14 (HTTP/SSE transport) lands and SSE streams are available
Exit criteria
Close when all 4 cartridges have manifest + bridge surface + offline-menu entry, mirroring the search-mcp pattern. Backend implementation tracked separately per cartridge.
Tracker for epic #87 item 9 — multi-cartridge wave.
Scope
Four cartridges spanning audio + image + video + transcoding:
Why one issue covers four
These four close BoJ's multi-modal gap. Currently `browser-mcp` is the only sensory cartridge and it's purely visual-via-screenshot. Adding audio (whisper + elevenlabs) and generative image/video (replicate) gives agents the multi-modal surface that's now table stakes for agent work. `ffmpeg-mcp` is the glue that lets the other three compose (extract audio from a video, transcode whisper output, embed Replicate-generated frames into a video).
Surface
Per cartridge:
Bridge-level: 4 tools (`boj_audio_stt`, `boj_audio_tts`, `boj_media_generate`, `boj_media_transcode`) or one umbrella `boj_multimodal` with operation-routing. Recommend the umbrella — keeps tool count manageable, mirrors `boj_search`.
Composition with existing cartridges
Implementation plan
Each cartridge as a sibling PR:
~1-2 days per cartridge for manifest + bridge surface + offline-menu wiring. Replicate is the longer one because of the model-id taxonomy.
Dependencies / non-dependencies
Out of scope
Exit criteria
Close when all 4 cartridges have manifest + bridge surface + offline-menu entry, mirroring the search-mcp pattern. Backend implementation tracked separately per cartridge.