Skip to content

zackdotcomputer/pixelvoice

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

28 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PixelVoice

AI-generated voice acting for classic SNES games, live in your browser.

PixelVoice lets you play SNES ROMs in the browser and hear characters speak their dialogue out loud — even though the original games shipped as silent text. It captures video frames from the emulator, sends them to Gemini Flash for real-time dialogue detection, synthesizes character-specific speech via Gradium TTS, and plays it back as you play.

Built for {tech: europe} London AI Hack 2026.


How It Works

Browser                          Server
┌──────────────┐    JPEG frames   ┌──────────────────┐
│  EmulatorJS  │ ──── WS ──────> │  Frame Analyzer   │
│  (SNES9x)   │                  │  (Gemini Flash)   │
│              │  dialogue JSON   │                   │
│  Audio out   │ <─── WS ─────── │  Dialog Detection │
│  <speaker>   │                  └──────────────────┘
│              │    POST /api/tts  ┌──────────────────┐
│  TTS Client  │ ──── HTTP ─────> │  Gradium TTS API  │
│              │  WAV base64      │  (Voice Synth)    │
│              │ <─── HTTP ────── │                   │
└──────────────┘                  └──────────────────┘
  1. Capture — The browser captures the emulator tab at ~0.33 Hz (1 frame every 3 seconds) via getDisplayMedia, resizes to max 1024x768, and encodes as JPEG.
  2. Detect — Each frame is sent over WebSocket to the server, which forwards it to Gemini 3.1 Flash Lite with a vision prompt that extracts any on-screen dialogue, the speaking character's name, and emotional context.
  3. Synthesize — When dialogue is detected, the client calls /api/tts with the character name and text. The server maps the character to a unique Gradium voice profile and returns synthesized WAV audio.
  4. Play — The browser decodes and plays the WAV via the HTML5 Audio API, giving each character a distinct voice as you play.

Tech Stack

Layer Technology Purpose
Frontend Next.js 16 (App Router), React 19, TypeScript Application framework
Styling Tailwind CSS 4 Dark CRT-phosphor aesthetic
Emulation EmulatorJS (SNES9x core, CDN-hosted) In-browser SNES emulation
Dialogue Detection Google Gemini 3.1 Flash Lite (@google/genai) Vision-based dialogue extraction from game frames
Voice Synthesis Gradium TTS API Character-specific text-to-speech
Real-time Transport WebSockets (ws) Frame streaming between browser and server
Dev Server Custom Node.js HTTP server (tsx) WebSocket upgrade handling alongside Next.js
Deployment Cloudflare Workers via OpenNext Edge deployment
Linting/Formatting Biome Code quality

Prerequisites


Setup & Installation

# Clone the repository
git clone https://github.com/zacksheppard/pixelvoice.git
cd pixelvoice

# Install dependencies
pnpm install

# Create environment file
cp .env.example .env.local
# (or create .env.local manually — see below)

Environment Variables

Create a .env.local file in the project root:

# Required
GEMINI_API_KEY=your_gemini_api_key_here
GRADIUM_API_KEY=your_gradium_api_key_here

# Optional
GRADIUM_API_BASE=https://eu.api.gradium.ai   # Custom Gradium API base URL
GRADIUM_REGION=us                              # "us" for US region (defaults to EU)
PORT=3000                                      # Dev server port (default: 3000)

Run the Dev Server

pnpm dev

This starts a custom Node.js server (via tsx watch) that:

  • Serves the Next.js application on http://localhost:3000
  • Handles WebSocket upgrades at /api/voice for the frame analysis pipeline

Open http://localhost:3000 in your browser.


Usage

  1. Load a ROM — Click "Load a ROM" on the landing page, then upload an .sfc, .smc, or .zip file. The ROM stays entirely in your browser memory (no server upload).
  2. Play the game — The SNES emulator runs in-browser via EmulatorJS.
  3. Start voice — Click "Start voice" to begin screen capture. Your browser will prompt you to share the current tab.
  4. Listen — As dialogue appears on screen, characters will speak their lines aloud with unique AI-generated voices.

Debug Mode

Append ?debug=1 to the play page URL for a debug panel with:

  • WRAM memory dumps (e.g., Chrono Trigger text table at 0x7E0290)
  • Canvas snapshot export
  • Real-time pipeline log output

API Reference

WebSocket: /api/voice

Bidirectional WebSocket for the frame analysis pipeline.

Client sends: Binary JPEG frame data (ArrayBuffer)

Server responds: JSON messages

// Dialogue detected
{
  "character": "marle",
  "dialog": "But, you're a princess!",
  "condition_notes": "surprised, excited"
}

// No dialogue on screen
{
  "type": "no_dialog"
}

The server applies backpressure — if the previous frame is still being analyzed, incoming frames are dropped.

POST /api/tts

Synthesizes dialogue audio via Gradium TTS.

Request:

{
  "character": "marle",
  "dialog": "But, you're a princess!",
  "condition_notes": "surprised, excited"
}

Response (200):

{
  "accepted": true,
  "character": "marle",
  "dialog": "But, you're a princess!",
  "condition_notes": "surprised, excited",
  "audio_wav_base64": "UklGRi4AAABXQVZFZm10..."
}

Error responses:

Status Cause
400 Invalid or missing request body
502 Gradium API returned an error
503 GRADIUM_API_KEY not configured

Project Structure

pixelvoice/
├── server.ts                        # Custom dev server (Next.js + WebSocket)
├── src/
│   ├── app/
│   │   ├── page.tsx                 # Landing page
│   │   ├── layout.tsx               # Root layout (fonts, metadata)
│   │   ├── globals.css              # Tailwind config + theme
│   │   ├── play/
│   │   │   └── page.tsx             # Emulator + voice pipeline UI
│   │   ├── tts/
│   │   │   └── page.tsx             # Standalone TTS test page
│   │   └── api/
│   │       └── tts/
│   │           └── route.ts         # POST /api/tts endpoint
│   ├── components/
│   │   └── snes-player.tsx          # EmulatorJS lifecycle wrapper
│   ├── lib/
│   │   ├── frame-capture.ts         # Screen capture via getDisplayMedia
│   │   ├── gemini-session.ts        # Gemini vision dialogue detection
│   │   ├── voice-socket.ts          # WebSocket client manager
│   │   ├── tts-playback.ts          # Audio decode + playback utilities
│   │   ├── character-gradium-voices.ts  # Character → voice ID mapping
│   │   └── snes-hooks.ts            # WRAM reader, canvas snapshot, CT text decoder
│   └── types/
│       ├── emulatorjs.d.ts          # EmulatorJS ambient type declarations
│       └── tts.ts                   # TTS request/response interfaces
├── next.config.ts                   # Next.js config (server-external packages)
├── wrangler.jsonc                   # Cloudflare Workers deployment config
├── open-next.config.ts              # OpenNext adapter config
├── biome.jsonc                      # Biome linter/formatter config
└── package.json

Key Design Decisions

EmulatorJS Integration

The emulator loads via a CDN-hosted script tag. window.EJS_* globals must be set before the loader script is injected — EmulatorJS reads them synchronously on load. The SnesPlayer component manages this lifecycle and cleanup.

Character Voice Mapping

Each game character maps to a specific Gradium voice profile for consistent identity. Unrecognized NPCs fall back to a pool of 12 backup voices (6 masculine, 6 feminine) selected by Gemini based on the character's apparent appearance.

Frame Deduplication

The client deduplicates dialogue events by character + text hash to avoid re-synthesizing the same line while it remains on screen across multiple capture frames.

WebSocket over HTTP for Frames

Binary JPEG frames are streamed over WebSocket rather than HTTP POST to minimize per-frame overhead. The custom server.ts intercepts WebSocket upgrades since Next.js API routes don't natively support WebSocket connections.


Development Commands

pnpm dev          # Start dev server (Next.js + WebSocket)
pnpm build        # Build Next.js production bundle
pnpm lint         # Run Biome linter
pnpm format       # Run Biome formatter (auto-fix)
pnpm deploy       # Build with OpenNext + deploy to Cloudflare
pnpm preview      # Build with OpenNext + local Cloudflare preview
pnpm cf-typegen   # Regenerate Cloudflare environment types

Deployment

PixelVoice deploys to Cloudflare Workers via the OpenNext adapter:

pnpm deploy

The wrangler.jsonc configuration includes:

  • nodejs_compat compatibility flag (required for ws and @google/genai)
  • Asset serving from .open-next/assets
  • Image optimization binding
  • Source map uploads for debugging

License

MIT

About

Give pixels a voice

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors