PixelVoice

AI-generated voice acting for classic SNES games, live in your browser.

PixelVoice lets you play SNES ROMs in the browser and hear characters speak their dialogue out loud — even though the original games shipped as silent text. It captures video frames from the emulator, sends them to Gemini Flash for real-time dialogue detection, synthesizes character-specific speech via Gradium TTS, and plays it back as you play.

Built for {tech: europe} London AI Hack 2026.

How It Works

Browser                          Server
┌──────────────┐    JPEG frames   ┌──────────────────┐
│  EmulatorJS  │ ──── WS ──────> │  Frame Analyzer   │
│  (SNES9x)   │                  │  (Gemini Flash)   │
│              │  dialogue JSON   │                   │
│  Audio out   │ <─── WS ─────── │  Dialog Detection │
│  <speaker>   │                  └──────────────────┘
│              │    POST /api/tts  ┌──────────────────┐
│  TTS Client  │ ──── HTTP ─────> │  Gradium TTS API  │
│              │  WAV base64      │  (Voice Synth)    │
│              │ <─── HTTP ────── │                   │
└──────────────┘                  └──────────────────┘

Capture — The browser captures the emulator tab at ~0.33 Hz (1 frame every 3 seconds) via getDisplayMedia, resizes to max 1024x768, and encodes as JPEG.
Detect — Each frame is sent over WebSocket to the server, which forwards it to Gemini 3.1 Flash Lite with a vision prompt that extracts any on-screen dialogue, the speaking character's name, and emotional context.
Synthesize — When dialogue is detected, the client calls /api/tts with the character name and text. The server maps the character to a unique Gradium voice profile and returns synthesized WAV audio.
Play — The browser decodes and plays the WAV via the HTML5 Audio API, giving each character a distinct voice as you play.

Tech Stack

Layer	Technology	Purpose
Frontend	Next.js 16 (App Router), React 19, TypeScript	Application framework
Styling	Tailwind CSS 4	Dark CRT-phosphor aesthetic
Emulation	EmulatorJS (SNES9x core, CDN-hosted)	In-browser SNES emulation
Dialogue Detection	Google Gemini 3.1 Flash Lite (`@google/genai`)	Vision-based dialogue extraction from game frames
Voice Synthesis	Gradium TTS API	Character-specific text-to-speech
Real-time Transport	WebSockets (`ws`)	Frame streaming between browser and server
Dev Server	Custom Node.js HTTP server (`tsx`)	WebSocket upgrade handling alongside Next.js
Deployment	Cloudflare Workers via OpenNext	Edge deployment
Linting/Formatting	Biome	Code quality

Prerequisites

Node.js >= 20
pnpm (package manager)
Gemini API key — Get one from Google AI Studio
Gradium API key — Sign up at Gradium
An SNES ROM file (.sfc, .smc, or .zip)

Setup & Installation

# Clone the repository
git clone https://github.com/zacksheppard/pixelvoice.git
cd pixelvoice

# Install dependencies
pnpm install

# Create environment file
cp .env.example .env.local
# (or create .env.local manually — see below)

Environment Variables

Create a .env.local file in the project root:

# Required
GEMINI_API_KEY=your_gemini_api_key_here
GRADIUM_API_KEY=your_gradium_api_key_here

# Optional
GRADIUM_API_BASE=https://eu.api.gradium.ai   # Custom Gradium API base URL
GRADIUM_REGION=us                              # "us" for US region (defaults to EU)
PORT=3000                                      # Dev server port (default: 3000)

Run the Dev Server

pnpm dev

This starts a custom Node.js server (via tsx watch) that:

Serves the Next.js application on http://localhost:3000
Handles WebSocket upgrades at /api/voice for the frame analysis pipeline

Open http://localhost:3000 in your browser.

Usage

Load a ROM — Click "Load a ROM" on the landing page, then upload an .sfc, .smc, or .zip file. The ROM stays entirely in your browser memory (no server upload).
Play the game — The SNES emulator runs in-browser via EmulatorJS.
Start voice — Click "Start voice" to begin screen capture. Your browser will prompt you to share the current tab.
Listen — As dialogue appears on screen, characters will speak their lines aloud with unique AI-generated voices.

Debug Mode

Append ?debug=1 to the play page URL for a debug panel with:

WRAM memory dumps (e.g., Chrono Trigger text table at 0x7E0290)
Canvas snapshot export
Real-time pipeline log output

API Reference

WebSocket: `/api/voice`

Bidirectional WebSocket for the frame analysis pipeline.

Client sends: Binary JPEG frame data (ArrayBuffer)

Server responds: JSON messages

// Dialogue detected
{
  "character": "marle",
  "dialog": "But, you're a princess!",
  "condition_notes": "surprised, excited"
}

// No dialogue on screen
{
  "type": "no_dialog"
}

The server applies backpressure — if the previous frame is still being analyzed, incoming frames are dropped.

POST `/api/tts`

Synthesizes dialogue audio via Gradium TTS.

Request:

{
  "character": "marle",
  "dialog": "But, you're a princess!",
  "condition_notes": "surprised, excited"
}

Response (200):

{
  "accepted": true,
  "character": "marle",
  "dialog": "But, you're a princess!",
  "condition_notes": "surprised, excited",
  "audio_wav_base64": "UklGRi4AAABXQVZFZm10..."
}

Error responses:

Status	Cause
400	Invalid or missing request body
502	Gradium API returned an error
503	`GRADIUM_API_KEY` not configured

Project Structure

pixelvoice/
├── server.ts                        # Custom dev server (Next.js + WebSocket)
├── src/
│   ├── app/
│   │   ├── page.tsx                 # Landing page
│   │   ├── layout.tsx               # Root layout (fonts, metadata)
│   │   ├── globals.css              # Tailwind config + theme
│   │   ├── play/
│   │   │   └── page.tsx             # Emulator + voice pipeline UI
│   │   ├── tts/
│   │   │   └── page.tsx             # Standalone TTS test page
│   │   └── api/
│   │       └── tts/
│   │           └── route.ts         # POST /api/tts endpoint
│   ├── components/
│   │   └── snes-player.tsx          # EmulatorJS lifecycle wrapper
│   ├── lib/
│   │   ├── frame-capture.ts         # Screen capture via getDisplayMedia
│   │   ├── gemini-session.ts        # Gemini vision dialogue detection
│   │   ├── voice-socket.ts          # WebSocket client manager
│   │   ├── tts-playback.ts          # Audio decode + playback utilities
│   │   ├── character-gradium-voices.ts  # Character → voice ID mapping
│   │   └── snes-hooks.ts            # WRAM reader, canvas snapshot, CT text decoder
│   └── types/
│       ├── emulatorjs.d.ts          # EmulatorJS ambient type declarations
│       └── tts.ts                   # TTS request/response interfaces
├── next.config.ts                   # Next.js config (server-external packages)
├── wrangler.jsonc                   # Cloudflare Workers deployment config
├── open-next.config.ts              # OpenNext adapter config
├── biome.jsonc                      # Biome linter/formatter config
└── package.json

Key Design Decisions

EmulatorJS Integration

The emulator loads via a CDN-hosted script tag. window.EJS_* globals must be set before the loader script is injected — EmulatorJS reads them synchronously on load. The SnesPlayer component manages this lifecycle and cleanup.

Character Voice Mapping

Each game character maps to a specific Gradium voice profile for consistent identity. Unrecognized NPCs fall back to a pool of 12 backup voices (6 masculine, 6 feminine) selected by Gemini based on the character's apparent appearance.

Frame Deduplication

The client deduplicates dialogue events by character + text hash to avoid re-synthesizing the same line while it remains on screen across multiple capture frames.

WebSocket over HTTP for Frames

Binary JPEG frames are streamed over WebSocket rather than HTTP POST to minimize per-frame overhead. The custom server.ts intercepts WebSocket upgrades since Next.js API routes don't natively support WebSocket connections.

Development Commands

pnpm dev          # Start dev server (Next.js + WebSocket)
pnpm build        # Build Next.js production bundle
pnpm lint         # Run Biome linter
pnpm format       # Run Biome formatter (auto-fix)
pnpm deploy       # Build with OpenNext + deploy to Cloudflare
pnpm preview      # Build with OpenNext + local Cloudflare preview
pnpm cf-typegen   # Regenerate Cloudflare environment types

Deployment

PixelVoice deploys to Cloudflare Workers via the OpenNext adapter:

pnpm deploy

The wrangler.jsonc configuration includes:

nodejs_compat compatibility flag (required for ws and @google/genai)
Asset serving from .open-next/assets
Image optimization binding
Source map uploads for debugging

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
.hk-hooks		.hk-hooks
.vscode		.vscode
docs/superpowers/specs		docs/superpowers/specs
public		public
scripts		scripts
src		src
worker		worker
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
README.md		README.md
SAMPLE_AUDIO_PAYLOAD.md		SAMPLE_AUDIO_PAYLOAD.md
_typos.toml		_typos.toml
biome.jsonc		biome.jsonc
cloudflare-env.d.ts		cloudflare-env.d.ts
hk.pkl		hk.pkl
mise.toml		mise.toml
next.config.ts		next.config.ts
open-next.config.ts		open-next.config.ts
package.json		package.json
pnpm-lock.yaml		pnpm-lock.yaml
pnpm-workspace.yaml		pnpm-workspace.yaml
postcss.config.mjs		postcss.config.mjs
server.ts		server.ts
tsconfig.json		tsconfig.json
wrangler.jsonc		wrangler.jsonc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PixelVoice

How It Works

Tech Stack

Prerequisites

Setup & Installation

Environment Variables

Run the Dev Server

Usage

Debug Mode

API Reference

WebSocket: `/api/voice`

POST `/api/tts`

Project Structure

Key Design Decisions

EmulatorJS Integration

Character Voice Mapping

Frame Deduplication

WebSocket over HTTP for Frames

Development Commands

Deployment

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PixelVoice

How It Works

Tech Stack

Prerequisites

Setup & Installation

Environment Variables

Run the Dev Server

Usage

Debug Mode

API Reference

WebSocket: /api/voice

POST /api/tts

Project Structure

Key Design Decisions

EmulatorJS Integration

Character Voice Mapping

Frame Deduplication

WebSocket over HTTP for Frames

Development Commands

Deployment

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

WebSocket: `/api/voice`

POST `/api/tts`

Packages