Skip to content

Item 9: Multi-modal cartridges wave (epic #87) #101

@hyperpolymath

Description

@hyperpolymath

Tracker for epic #87 item 9 — multi-cartridge wave.

Scope

Four cartridges spanning audio + image + video + transcoding:

  • `whisper-mcp` — Speech-to-text (Whisper API + local whisper.cpp)
  • `elevenlabs-mcp` — Text-to-speech (high-quality, voice cloning, multilingual)
  • `replicate-mcp` — Image + video generation (Stable Diffusion, FLUX, Veo, etc. via Replicate's hosted models)
  • `ffmpeg-mcp` — Audio/video transcoding, extraction, and analysis (local FFmpeg)

Why one issue covers four

These four close BoJ's multi-modal gap. Currently `browser-mcp` is the only sensory cartridge and it's purely visual-via-screenshot. Adding audio (whisper + elevenlabs) and generative image/video (replicate) gives agents the multi-modal surface that's now table stakes for agent work. `ffmpeg-mcp` is the glue that lets the other three compose (extract audio from a video, transcode whisper output, embed Replicate-generated frames into a video).

Surface

Per cartridge:

  • `whisper-mcp` → `transcribe` (file/URL → text), `detect_language`, optional `translate` (any → English)
  • `elevenlabs-mcp` → `synthesize` (text → audio), `list_voices`, `clone_voice` (premium tier)
  • `replicate-mcp` → `run_model` (model_id + inputs → outputs), `list_models`, `get_prediction` (status-poll)
  • `ffmpeg-mcp` → `probe` (metadata), `extract_audio`, `transcode`, `extract_frames`, `concat`, `trim`

Bridge-level: 4 tools (`boj_audio_stt`, `boj_audio_tts`, `boj_media_generate`, `boj_media_transcode`) or one umbrella `boj_multimodal` with operation-routing. Recommend the umbrella — keeps tool count manageable, mirrors `boj_search`.

Composition with existing cartridges

Implementation plan

Each cartridge as a sibling PR:

  1. `feat/item-9a-whisper-mcp` — STT first, smallest surface, gives immediate value
  2. `feat/item-9b-ffmpeg-mcp` — local-only, enables audio/video extraction needed by whisper for video inputs
  3. `feat/item-9c-elevenlabs-mcp` — TTS, paired with whisper for bidirectional audio
  4. `feat/item-9d-replicate-mcp` — broadest surface (many model families); land last

~1-2 days per cartridge for manifest + bridge surface + offline-menu wiring. Replicate is the longer one because of the model-id taxonomy.

Dependencies / non-dependencies

  • Pairs naturally with item 14 (HTTP transport) — multi-modal use cases include browser-based agent web apps where users record audio/video locally and send to BoJ
  • Pairs with item 7 (vector DBs) for multi-modal RAG
  • `ffmpeg-mcp` is local-only — requires host FFmpeg binary; won't work on Cloudflare Workers per ADR-0013's cartridge-compat matrix. Other three are HTTP-API-based, work everywhere

Out of scope

  • OpenAI's image / audio APIs — overlaps with Replicate's catalogue; if a user needs OpenAI specifically, route through the existing claude-ai-mcp pattern (a sibling `openai-mcp` is a separate ask, not part of multi-modal core)
  • Web Audio API browser-side capture — that's a client-side concern; BoJ receives audio data, doesn't capture it
  • Real-time streaming (live TTS / live STT) — v1 is batch; revisit when item 14 (HTTP/SSE transport) lands and SSE streams are available

Exit criteria

Close when all 4 cartridges have manifest + bridge surface + offline-menu entry, mirroring the search-mcp pattern. Backend implementation tracked separately per cartridge.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions