Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
27014d0
Fix Voxtral macOS packaging for runtime libraries.
seyeong-han Mar 24, 2026
95f1053
Add Flow-like features and Silero VAD wake to Voxtral macOS app
seyeong-han Mar 25, 2026
a64526c
Add Wake settings page to sidebar with toggle shortcut
seyeong-han Mar 25, 2026
548465d
Move all settings into sidebar, remove Settings window
seyeong-han Mar 25, 2026
33ad4da
Skip redundant mic permission prompt on VAD-triggered dictation
seyeong-han Mar 25, 2026
8c4df93
Preload Voxtral model before starting wake listening
seyeong-han Mar 26, 2026
a5e5a3b
Fix runner launch blocked by mic permission after fresh build
seyeong-han Mar 26, 2026
060cc76
Increase wake check window default to 4s
seyeong-han Mar 26, 2026
005ecdc
Accumulate full speech segments before wake phrase check
seyeong-han Mar 26, 2026
9f6354f
Always restart VAD after dictation ends
seyeong-han Mar 26, 2026
b8b9a41
Remove Recent Dictations section and strip trailing wake punctuation
seyeong-han Mar 26, 2026
8bc6e3a
Stop saving dictation sessions to history
seyeong-han Mar 26, 2026
b10ec97
Fix wake keyword placeholder text
seyeong-han Mar 26, 2026
09dd1f5
Add Excitorch -> ExecuTorch default replacement
seyeong-han Mar 26, 2026
2b7fc3c
Add padding to snippet editor sheet
seyeong-han Mar 26, 2026
b43614c
Fix snippet expansion: strip ASR punctuation and reorder pipeline
seyeong-han Mar 26, 2026
85b1a59
Fix editor sheets: wrong title on edit, missing padding
seyeong-han Mar 26, 2026
b62a7cd
Fix edit button showing Add sheet on first click
seyeong-han Mar 26, 2026
f0a0ddd
Allow snippet expansion by direct trigger match without prefix
seyeong-han Mar 26, 2026
d841430
Simplify snippet matching to direct trigger only
seyeong-han Mar 26, 2026
f3acd80
Update README with replacements, snippets, wake, and sidebar docs
seyeong-han Mar 26, 2026
50931f2
Add xqtorch -> ExecuTorch default replacement
seyeong-han Mar 26, 2026
a0f76ad
Update default replacements with common tech brand casing
seyeong-han Mar 26, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
108 changes: 95 additions & 13 deletions voxtral_realtime/macos/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,10 +8,14 @@ https://github.com/user-attachments/assets/6d6089fc-5feb-458b-a60b-08379855976a

- **Live transcription** — real-time token streaming with audio waveform visualization
- **System-wide dictation** — press `Ctrl+Space` in any app to transcribe speech and auto-paste the result
- **"Hey torch" voice wake** — hands-free dictation via Silero VAD speech detection and wake phrase matching
- **Text replacements** — auto-correct names, acronyms, and domain terms after transcription
- **Snippets** — say a trigger phrase (e.g., "email signature") to paste pre-written templates
- **Model preloading** — load the model once, transcribe instantly across sessions
- **Pause / resume** — pause and resume within the same session without losing context
- **Session history** — searchable history with rename, copy, and delete
- **Silence detection** — dictation auto-stops after 2 seconds of silence
- **Session history** — searchable history with pinning, recency grouping, rename, copy, and multi-format export (.txt, .json, .srt)
- **Silence detection** — dictation auto-stops after configurable silence timeout
- **Sidebar navigation** — Home, Replacements, Snippets, Wake, and Settings pages
- **Self-contained DMG** — runner binary, model weights, and runtime libraries all bundled

## Download
Expand Down Expand Up @@ -40,9 +44,39 @@ https://github.com/user-attachments/assets/6d6089fc-5feb-458b-a60b-08379855976a
2. Focus any text field in any app (Notes, Slack, browser, etc.)
3. Press **`Ctrl+Space`** — a floating overlay appears with a waveform
4. Speak — live transcribed text appears in the overlay
5. Press **`Ctrl+Space`** again to stop, or wait for 2 seconds of silence
5. Press **`Ctrl+Space`** again to stop, or wait for silence auto-stop
6. The transcribed text is automatically pasted into the focused text field

### Voice wake ("Hey torch")

1. Open the **Wake** page in the sidebar and enable it (or press `Ctrl+Shift+W`)
2. Say **"Hey torch"** — Silero VAD detects your speech, Voxtral checks for the wake keyword
3. If matched, the dictation panel appears and you can start speaking
4. Dictation auto-pastes when you stop speaking, then wake listening resumes

The wake keyword is configurable in the Wake settings (default: "torch"). The app accumulates the full speech segment before checking, so you can say "hey torch" naturally as one phrase.

### Replacements

Add text replacements in the **Replacements** page. Each entry has a trigger and replacement — when the trigger appears in transcribed text, it's automatically replaced.

Examples:
- `mtia` → `MTIA`
- `executorch` → `ExecuTorch`
- `execute torch` → `ExecuTorch`

Supports case-preserving matching and word boundary options.

### Snippets

Add voice-triggered templates in the **Snippets** page. When your entire dictation matches a snippet trigger, the template content is pasted instead.

Example: say **"email signature"** to paste:
```
Best,
Younghan
```

### Keyboard shortcuts

| Shortcut | Action |
Expand All @@ -53,7 +87,7 @@ https://github.com/user-attachments/assets/6d6089fc-5feb-458b-a60b-08379855976a
| `Cmd+Shift+C` | Copy transcript |
| `Cmd+Shift+U` | Unload model |
| `Ctrl+Space` | Toggle system-wide dictation |
| `Cmd+,` | Settings |
| `Ctrl+Shift+W` | Toggle voice wake on/off |

---

Expand Down Expand Up @@ -129,7 +163,19 @@ The runner binary will be at:
${EXECUTORCH_PATH}/cmake-out/examples/models/voxtral_realtime/voxtral_realtime_runner
```

#### 4. Install Python packages
#### 4. Build the Silero VAD stream runner (for voice wake)

```bash
cd ${EXECUTORCH_PATH}
make silero-vad-cpu
```

This builds `silero_vad_stream_runner` at:
```
${EXECUTORCH_PATH}/cmake-out/examples/models/silero_vad/silero_vad_stream_runner
```

#### 5. Install Python packages

```bash
pip install huggingface_hub sounddevice
Expand All @@ -138,7 +184,7 @@ pip install huggingface_hub sounddevice
- `huggingface_hub` — to download model artifacts from HuggingFace
- `sounddevice` — for the CLI mic streaming test script

#### 5. Download model artifacts
#### 6. Download model artifacts

```bash
export LOCAL_FOLDER="$HOME/voxtral_realtime_quant_metal"
Expand All @@ -152,7 +198,13 @@ This downloads three files (~6.2 GB total):

HuggingFace repo: [`mistralai/Voxtral-Mini-4B-Realtime-2602-ExecuTorch`](https://huggingface.co/mistral-labs/Voxtral-Mini-4B-Realtime-2602-ExecuTorch)

#### 6. Test with CLI (optional)
Download the Silero VAD model:

```bash
hf download younghan-meta/Silero-VAD-ExecuTorch-XNNPACK --local-dir ~/silero_vad_xnnpack
```

#### 7. Test with CLI (optional)

Verify the runner works before building the app:

Expand All @@ -170,7 +222,7 @@ cd ${LOCAL_FOLDER} && chmod +x stream_audio.py
--mic
```

#### 7. Build the app and create DMG
#### 8. Build the app and create DMG

```bash
cd voxtral_realtime/macos
Expand Down Expand Up @@ -198,6 +250,7 @@ VoxtralRealtimeApp
├── TranscriptStore (@Observable, @MainActor)
│ ├── SessionState: idle → loading → transcribing ⇆ paused → idle
│ ├── ModelState: unloaded → loading → ready
│ ├── TextPipeline: replacements → snippets → style (no-op)
│ └── RunnerBridge (actor)
│ ├── Process (voxtral_realtime_runner)
│ │ ├── stdin ← raw 16kHz mono f32le PCM
Expand All @@ -207,9 +260,21 @@ VoxtralRealtimeApp
│ └── AVAudioEngine → format conversion → pipe
├── DictationManager (@Observable, @MainActor)
│ ├── Global hotkey (Carbon RegisterEventHotKey)
│ ├── VadService (actor)
│ │ ├── Process (silero_vad_stream_runner)
│ │ │ ├── stdin ← 16kHz mono f32le PCM
│ │ │ └── stdout → PROB <time> <probability>
│ │ └── AVAudioEngine → speech segment accumulation
│ ├── Wake flow: VAD speech-end → Voxtral phrase check → active
│ ├── DictationPanel (NSPanel, non-activating, floating)
│ └── Paste via CGEvent (Cmd+V to frontmost app)
└── Views (SwiftUI)
├── ReplacementStore / SnippetStore (JSON, Application Support)
└── Views (SwiftUI, sidebar navigation)
├── Home (welcome, transcription, session detail)
├── Replacements (add/edit/delete trigger→replacement)
├── Snippets (add/edit/delete voice-triggered templates)
├── Wake (VAD toggle, keyword, detection tuning, status)
└── Settings (runner paths, model files, silence, shortcuts)
```

## Troubleshooting
Expand All @@ -223,11 +288,13 @@ The app needs **Accessibility** permission to simulate `Cmd+V` in other apps.
3. If already listed, toggle it **off and back on**
4. **Quit and relaunch** the app — macOS caches the trust state at process launch

When running Debug builds from Xcode, each rebuild produces a new binary signature. macOS tracks Accessibility trust per binary identity, so you may need to re-grant permission after rebuilding. To avoid this:
- Remove the old entry from Accessibility settings before re-adding
- Or run the Release build for testing dictation
After every fresh build, reset Accessibility permissions:

```bash
tccutil reset Accessibility org.pytorch.executorch.VoxtralRealtime
```

Even if Accessibility isn't granted, the transcribed text is always copied to the clipboard — you can paste manually with `Cmd+V`.
Then relaunch and re-grant when prompted.

### Model fails to load / runner crashes

Expand All @@ -249,6 +316,21 @@ The app requests microphone access on first use. If denied:
2. Enable `Voxtral Realtime`
3. **Quit and relaunch** the app — macOS caches permission grants per process lifetime

### Microphone producing silence

If the Wake status shows "Disabled" with a silence error, macOS is delivering zero-audio to the app. This can happen after fresh builds or permission changes.

1. Toggle the app's microphone permission **off and on** in System Settings
2. Quit and relaunch the app
3. If running from Cursor's terminal, try the native macOS Terminal instead

### Voice wake not triggering

- Check the **Wake** page — status should show "Listening for speech..."
- Ensure `silero_vad_stream_runner` and `silero_vad.pte` paths are correct
- Try increasing the **Check window** slider (default 4s) — Voxtral needs 1-2s to produce the first token
- Speak clearly and wait for a brief pause after "hey torch" so VAD detects the speech segment end

### Permission prompts don't appear (stale TCC entries)

If you've built or installed the app multiple times (Debug builds, Release builds, DMG installs), macOS may have accumulated multiple permission entries for the same bundle ID. Reset them to get a clean slate:
Expand Down
Loading
Loading