A visual benchmark comparing how different Large Language Models (LLMs) handle complex coding prompts, particularly for games and interactive UI. This project serves as a "context-window-in-action" gallery.
The project is a static site that aggregates benchmark results from various models (Gemini, Claude, GPT, Grok). Each benchmark is a directory containing:
prompt.txt: The exact prompt given to the models.- Sub-directories for each model (e.g.,
gemini,claude) containing the generatedindex.html. modelnames.json: Mapping internal IDs to human-readable names.
.
├── create_config.sh # Script to regenerate the gallery index
├── index.html # Main gallery UI
├── flappy/ # Benchmark: Flappy Bird clone
│ ├── prompt.txt # The prompt used
│ ├── gemini/ # Result from Gemini
│ │ └── index.html
│ └── claude/ # Result from Claude
│ └── index.html
└── platformer/ # Benchmark: Platformer game
└── ...
- Side-by-Side Comparison: view model outputs for the same prompt in one interface.
- Dynamic Config Generation: Just drop a new result folder and run
create_config.sh. - Vanilla Implementation: No heavy frameworks, just fast, static HTML/JS.
We welcome contributions of new benchmarks or new model results for existing benchmarks!
If you want to add a result for a model (e.g., "DeepSeek") to an existing benchmark (e.g., flappy):
- Create a folder named
deepseekinsideflappy/. - Add the generated
index.htmlfile intoflappy/deepseek/. - (Optional) Add the model name to
flappy/modelnames.json. - Run
./create_config.shto update the site.
- Create a new root folder (e.g.,
tetris/). - Add a
prompt.txtwith the prompt you used. - Add folders for each model you tested.
- Run
./create_config.sh.
- Clone the repository.
- To view the site, you can use any static server, like
npx phostorpython -m http.server. - After adding new folders or files, run:
bash create_config.sh
See LICENSE.md for details.