Item 9: Multi-modal cartridges wave (epic #87)

Tracker for epic #87 item 9 — multi-cartridge wave.

## Scope

Four cartridges spanning audio + image + video + transcoding:

- [ ] \`whisper-mcp\` — Speech-to-text (Whisper API + local whisper.cpp)
- [ ] \`elevenlabs-mcp\` — Text-to-speech (high-quality, voice cloning, multilingual)
- [ ] \`replicate-mcp\` — Image + video generation (Stable Diffusion, FLUX, Veo, etc. via Replicate's hosted models)
- [ ] \`ffmpeg-mcp\` — Audio/video transcoding, extraction, and analysis (local FFmpeg)

## Why one issue covers four

These four close BoJ's multi-modal gap. Currently \`browser-mcp\` is the only sensory cartridge and it's purely visual-via-screenshot. Adding audio (whisper + elevenlabs) and generative image/video (replicate) gives agents the multi-modal surface that's now table stakes for agent work. \`ffmpeg-mcp\` is the glue that lets the other three compose (extract audio from a video, transcode whisper output, embed Replicate-generated frames into a video).

## Surface

Per cartridge:

- \`whisper-mcp\` → \`transcribe\` (file/URL → text), \`detect_language\`, optional \`translate\` (any → English)
- \`elevenlabs-mcp\` → \`synthesize\` (text → audio), \`list_voices\`, \`clone_voice\` (premium tier)
- \`replicate-mcp\` → \`run_model\` (model_id + inputs → outputs), \`list_models\`, \`get_prediction\` (status-poll)
- \`ffmpeg-mcp\` → \`probe\` (metadata), \`extract_audio\`, \`transcode\`, \`extract_frames\`, \`concat\`, \`trim\`

Bridge-level: 4 tools (\`boj_audio_stt\`, \`boj_audio_tts\`, \`boj_media_generate\`, \`boj_media_transcode\`) or one umbrella \`boj_multimodal\` with operation-routing. Recommend the umbrella — keeps tool count manageable, mirrors \`boj_search\`.

## Composition with existing cartridges

- Pairs with item 7 (\`chromadb-mcp\` etc.) for vision-RAG and audio-search
- Pairs with \`browser-mcp\` (screenshot → describe via Replicate vision model)
- Pairs with the new \`boj_search\` (PR #99) for video transcript retrieval workflows
- Composes inside future prompt templates: \`transcribe-and-summarise\`, \`generate-image-for\`, \`extract-key-frames\`

## Implementation plan

Each cartridge as a sibling PR:

1. \`feat/item-9a-whisper-mcp\` — STT first, smallest surface, gives immediate value
2. \`feat/item-9b-ffmpeg-mcp\` — local-only, enables audio/video extraction needed by whisper for video inputs
3. \`feat/item-9c-elevenlabs-mcp\` — TTS, paired with whisper for bidirectional audio
4. \`feat/item-9d-replicate-mcp\` — broadest surface (many model families); land last

\~1-2 days per cartridge for manifest + bridge surface + offline-menu wiring. Replicate is the longer one because of the model-id taxonomy.

## Dependencies / non-dependencies

- **Pairs naturally with** item 14 (HTTP transport) — multi-modal use cases include browser-based agent web apps where users record audio/video locally and send to BoJ
- **Pairs with** item 7 (vector DBs) for multi-modal RAG
- **\`ffmpeg-mcp\` is local-only** — requires host FFmpeg binary; won't work on Cloudflare Workers per ADR-0013's cartridge-compat matrix. Other three are HTTP-API-based, work everywhere

## Out of scope

- OpenAI's image / audio APIs — overlaps with Replicate's catalogue; if a user needs OpenAI specifically, route through the existing claude-ai-mcp pattern (a sibling \`openai-mcp\` is a separate ask, not part of multi-modal core)
- Web Audio API browser-side capture — that's a client-side concern; BoJ receives audio data, doesn't capture it
- Real-time streaming (live TTS / live STT) — v1 is batch; revisit when item 14 (HTTP/SSE transport) lands and SSE streams are available

## Exit criteria

Close when all 4 cartridges have manifest + bridge surface + offline-menu entry, mirroring the search-mcp pattern. Backend implementation tracked separately per cartridge.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Item 9: Multi-modal cartridges wave (epic #87) #101

Scope

Why one issue covers four

Surface

Composition with existing cartridges

Implementation plan

Dependencies / non-dependencies

Out of scope

Exit criteria

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Item 9: Multi-modal cartridges wave (epic #87) #101

Description

Scope

Why one issue covers four

Surface

Composition with existing cartridges

Implementation plan

Dependencies / non-dependencies

Out of scope

Exit criteria

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions