AI-generated voice acting for classic SNES games, live in your browser.
PixelVoice lets you play SNES ROMs in the browser and hear characters speak their dialogue out loud — even though the original games shipped as silent text. It captures video frames from the emulator, sends them to Gemini Flash for real-time dialogue detection, synthesizes character-specific speech via Gradium TTS, and plays it back as you play.
Built for {tech: europe} London AI Hack 2026.
Browser Server
┌──────────────┐ JPEG frames ┌──────────────────┐
│ EmulatorJS │ ──── WS ──────> │ Frame Analyzer │
│ (SNES9x) │ │ (Gemini Flash) │
│ │ dialogue JSON │ │
│ Audio out │ <─── WS ─────── │ Dialog Detection │
│ <speaker> │ └──────────────────┘
│ │ POST /api/tts ┌──────────────────┐
│ TTS Client │ ──── HTTP ─────> │ Gradium TTS API │
│ │ WAV base64 │ (Voice Synth) │
│ │ <─── HTTP ────── │ │
└──────────────┘ └──────────────────┘
- Capture — The browser captures the emulator tab at ~0.33 Hz (1 frame every 3 seconds) via
getDisplayMedia, resizes to max 1024x768, and encodes as JPEG. - Detect — Each frame is sent over WebSocket to the server, which forwards it to Gemini 3.1 Flash Lite with a vision prompt that extracts any on-screen dialogue, the speaking character's name, and emotional context.
- Synthesize — When dialogue is detected, the client calls
/api/ttswith the character name and text. The server maps the character to a unique Gradium voice profile and returns synthesized WAV audio. - Play — The browser decodes and plays the WAV via the HTML5 Audio API, giving each character a distinct voice as you play.
| Layer | Technology | Purpose |
|---|---|---|
| Frontend | Next.js 16 (App Router), React 19, TypeScript | Application framework |
| Styling | Tailwind CSS 4 | Dark CRT-phosphor aesthetic |
| Emulation | EmulatorJS (SNES9x core, CDN-hosted) | In-browser SNES emulation |
| Dialogue Detection | Google Gemini 3.1 Flash Lite (@google/genai) |
Vision-based dialogue extraction from game frames |
| Voice Synthesis | Gradium TTS API | Character-specific text-to-speech |
| Real-time Transport | WebSockets (ws) |
Frame streaming between browser and server |
| Dev Server | Custom Node.js HTTP server (tsx) |
WebSocket upgrade handling alongside Next.js |
| Deployment | Cloudflare Workers via OpenNext | Edge deployment |
| Linting/Formatting | Biome | Code quality |
- Node.js >= 20
- pnpm (package manager)
- Gemini API key — Get one from Google AI Studio
- Gradium API key — Sign up at Gradium
- An SNES ROM file (
.sfc,.smc, or.zip)
# Clone the repository
git clone https://github.com/zacksheppard/pixelvoice.git
cd pixelvoice
# Install dependencies
pnpm install
# Create environment file
cp .env.example .env.local
# (or create .env.local manually — see below)Create a .env.local file in the project root:
# Required
GEMINI_API_KEY=your_gemini_api_key_here
GRADIUM_API_KEY=your_gradium_api_key_here
# Optional
GRADIUM_API_BASE=https://eu.api.gradium.ai # Custom Gradium API base URL
GRADIUM_REGION=us # "us" for US region (defaults to EU)
PORT=3000 # Dev server port (default: 3000)pnpm devThis starts a custom Node.js server (via tsx watch) that:
- Serves the Next.js application on
http://localhost:3000 - Handles WebSocket upgrades at
/api/voicefor the frame analysis pipeline
Open http://localhost:3000 in your browser.
- Load a ROM — Click "Load a ROM" on the landing page, then upload an
.sfc,.smc, or.zipfile. The ROM stays entirely in your browser memory (no server upload). - Play the game — The SNES emulator runs in-browser via EmulatorJS.
- Start voice — Click "Start voice" to begin screen capture. Your browser will prompt you to share the current tab.
- Listen — As dialogue appears on screen, characters will speak their lines aloud with unique AI-generated voices.
Append ?debug=1 to the play page URL for a debug panel with:
- WRAM memory dumps (e.g., Chrono Trigger text table at
0x7E0290) - Canvas snapshot export
- Real-time pipeline log output
Bidirectional WebSocket for the frame analysis pipeline.
Client sends: Binary JPEG frame data (ArrayBuffer)
Server responds: JSON messages
The server applies backpressure — if the previous frame is still being analyzed, incoming frames are dropped.
Synthesizes dialogue audio via Gradium TTS.
Request:
{
"character": "marle",
"dialog": "But, you're a princess!",
"condition_notes": "surprised, excited"
}Response (200):
{
"accepted": true,
"character": "marle",
"dialog": "But, you're a princess!",
"condition_notes": "surprised, excited",
"audio_wav_base64": "UklGRi4AAABXQVZFZm10..."
}Error responses:
| Status | Cause |
|---|---|
| 400 | Invalid or missing request body |
| 502 | Gradium API returned an error |
| 503 | GRADIUM_API_KEY not configured |
pixelvoice/
├── server.ts # Custom dev server (Next.js + WebSocket)
├── src/
│ ├── app/
│ │ ├── page.tsx # Landing page
│ │ ├── layout.tsx # Root layout (fonts, metadata)
│ │ ├── globals.css # Tailwind config + theme
│ │ ├── play/
│ │ │ └── page.tsx # Emulator + voice pipeline UI
│ │ ├── tts/
│ │ │ └── page.tsx # Standalone TTS test page
│ │ └── api/
│ │ └── tts/
│ │ └── route.ts # POST /api/tts endpoint
│ ├── components/
│ │ └── snes-player.tsx # EmulatorJS lifecycle wrapper
│ ├── lib/
│ │ ├── frame-capture.ts # Screen capture via getDisplayMedia
│ │ ├── gemini-session.ts # Gemini vision dialogue detection
│ │ ├── voice-socket.ts # WebSocket client manager
│ │ ├── tts-playback.ts # Audio decode + playback utilities
│ │ ├── character-gradium-voices.ts # Character → voice ID mapping
│ │ └── snes-hooks.ts # WRAM reader, canvas snapshot, CT text decoder
│ └── types/
│ ├── emulatorjs.d.ts # EmulatorJS ambient type declarations
│ └── tts.ts # TTS request/response interfaces
├── next.config.ts # Next.js config (server-external packages)
├── wrangler.jsonc # Cloudflare Workers deployment config
├── open-next.config.ts # OpenNext adapter config
├── biome.jsonc # Biome linter/formatter config
└── package.json
The emulator loads via a CDN-hosted script tag. window.EJS_* globals must be set before the loader script is injected — EmulatorJS reads them synchronously on load. The SnesPlayer component manages this lifecycle and cleanup.
Each game character maps to a specific Gradium voice profile for consistent identity. Unrecognized NPCs fall back to a pool of 12 backup voices (6 masculine, 6 feminine) selected by Gemini based on the character's apparent appearance.
The client deduplicates dialogue events by character + text hash to avoid re-synthesizing the same line while it remains on screen across multiple capture frames.
Binary JPEG frames are streamed over WebSocket rather than HTTP POST to minimize per-frame overhead. The custom server.ts intercepts WebSocket upgrades since Next.js API routes don't natively support WebSocket connections.
pnpm dev # Start dev server (Next.js + WebSocket)
pnpm build # Build Next.js production bundle
pnpm lint # Run Biome linter
pnpm format # Run Biome formatter (auto-fix)
pnpm deploy # Build with OpenNext + deploy to Cloudflare
pnpm preview # Build with OpenNext + local Cloudflare preview
pnpm cf-typegen # Regenerate Cloudflare environment typesPixelVoice deploys to Cloudflare Workers via the OpenNext adapter:
pnpm deployThe wrangler.jsonc configuration includes:
nodejs_compatcompatibility flag (required forwsand@google/genai)- Asset serving from
.open-next/assets - Image optimization binding
- Source map uploads for debugging
MIT