OpenGame: The Agent That Runs the Game to See If It Works

⚡ TL;DR

What it solves: LLM-generated game code that compiles but renders a black screen: cross-file inconsistencies, broken Phaser lifecycle wiring, game loops that never tick
Why it matters: No other agentic tool closes the loop by actually running the generated game in a sandbox before declaring it done
Best for: Developers who want a fully playable HTML5 game prototype from a single prompt, not a code scaffold they still have to debug themselves
Main differentiator: Game Skill = Template Skill (proven scaffold first) + Debug Skill (sandbox run, observe, fix, repeat until playable)
Use case example: opengame -p "tower defense with cat units and cucumber enemies, kawaii art style" --yolo produces a downloadable browser game in minutes

I asked for a Snake clone. Got 600 lines of TypeScript. The canvas was black.

I pasted the error. Got a refactored version. Still black.

I asked what was wrong. Got an explanation that made complete sense. Still black.

At some point I stopped being surprised and started being annoyed. The code was not broken in any way I could point to. It was broken in the way things are broken when all the pieces are present but none of them are connected.

That failure mode has a name. OpenGame is built specifically to close it.

OpenGame is a CLI tool from CUHK MMLab (released April 2026) that takes a plain-English game description and produces a playable browser game, not code that resembles a game. A game you can open in a tab and actually play. The key difference is that it runs the game. Most agents don’t.

The Problem with Asking LLMs to Build Games

Games are structurally harder than typical software. A web app’s failures are loud: 500 errors, stack traces, broken renders you can see immediately. A game’s failures are silent. Broken scene wiring doesn’t crash; it renders nothing. A misconfigured physics callback doesn’t throw; it just means nothing ever collides.

CUHK MMLab identified three specific failure modes in vanilla coding agents.

Cross-file inconsistencies. Game state is spread across many files. A variable renamed in GameScene.ts breaks a reference three files over. The agent patches one file at a time without seeing the integration layer.

Broken scene wiring. Phaser has a strict lifecycle: preload → create → update. Assets must be declared before they’re referenced. Scene transitions pass state through this.scene.start(). General-purpose agents know these APIs exist. They don’t always honor the contract.

Logical incoherence. Win conditions that never trigger. Collision events that fire on every frame. Physics accumulating floating-point drift until the player clips through floors. Each piece looks plausible. The runtime behavior is not.

Standard coding agents patch isolated syntax bugs. The integration layer stays broken.

Game Skill: The Two-Part Core

OpenGame’s answer is the Game Skill. Two parts.

Template Skill. Before any game logic is written, the agent picks an appropriate rendering engine (Phaser by default, canvas or three.js for different scopes) and scaffolds a stable, conventional project structure from a library of proven templates. Games built successfully contribute back to that library. The scaffold is sound before a single mechanic is implemented.

Debug Skill. After generation, the agent runs the game in a sandboxed browser, catches integration errors and console exceptions, and systematically resolves them until the game is playable. It maintains a living protocol of verified fixes: not just syntax corrections, but integration patterns. Known fix for broken Phaser preload order. Known fix for scene state leaking across transitions.

Two builders. One ships code when it compiles. The other sleeps in the house first to see if the roof leaks. The first is faster. The first’s game canvas is black.

Markdy animation

This loop (generate, run, observe, fix) is what makes the difference. The agent stops when the game is actually playable, not when the code resembles playable.

Getting It Running

No npm package yet. Source install:

git clone https://github.com/leigest519/OpenGame.git
cd OpenGame
npm install
npm run build
npm link
# opengame is now on your PATH

Authenticate against your preferred LLM provider:

export OPENAI_API_KEY="sk-..."
export OPENAI_MODEL="gpt-4o"              # also works with claude-3-5-sonnet or any OpenAI-compat model
export OPENAI_BASE_URL="https://api.openai.com/v1"   # optional; swap for OpenRouter or local inference

Build a game:

mkdir my-game && cd my-game
opengame -p "Build a tower defense where meme cats defend a golden tuna can against cucumbers and robot vacuums." --yolo
# OpenGame prints a provider-status banner, scaffolds from templates, builds, runs in a sandbox, iterates.
# When it finishes: npm install && npm run dev   (opens at http://localhost:5173)

The --yolo flag grants the agent shell execution permissions. The Debug Skill’s sandbox loop requires it. Without it, the agent can only edit files and the verification step won’t run. For CI or untrusted prompts, set GEMINI_SANDBOX=docker instead to run in a fully isolated Docker container.

Prefer interactive mode? Run opengame with no flags. Use --continue to resume the last session, --resume to pick from past sessions.

The demo zips included in the repo are worth running locally to see the output quality before committing to generating something new:

unzip demo_marvel_avengers.zip && cd demo_marvel_avengers
npm install
npm run dev    # opens at http://localhost:5173

What Can It Actually Build?

Six end-to-end demos ship with the repo. These are not toy examples.

Game	Genre	What the prompt specified
Marvel Avengers: Infinity Strike	Side-scrolling platformer	Iron Man/Thor/Hulk; 3 levels; Thanos boss; 90s Capcom pixel art
Harry Potter: Arithmancy Academy	Turn-based card battle	Cast spells by answering trivia; Magic Resonance combos; Gothic Hogwarts art
K.O.F: Celestial Showdown	2-player quiz fighter	Race to buzz in on physics questions; wrong answers backfire; SNK 16-bit arcade
Hajimi Defense: The Tuna Crisis	Tower defense	Cats defend Golden Tuna Can vs cucumbers and robot vacuums; kawaii art
StarWars: Mandalorian Protocol	Top-down twin-stick shooter	Blaster + Beskar Spear + Jetpack Dash; cover system; sci-fi pixel art
Squid Game: Red Light, Green Light	Survival reflex	Run/freeze mechanic; bodies pile up permanently; gritty 16-bit art

That table is not “generated something that vaguely resembles a platformer.” Each game was built end-to-end from those exact prompts, with themed mechanics, multi-level progression, and styled assets. They have live demos on the project page and downloadable source zips.

A cat tower defense game where meme cats defend a Golden Tuna Can against cucumbers and robot vacuums. I didn’t expect that sentence to describe a real, playable game. But it does.

The Bench

OpenGame-Bench evaluates agents across 150 game prompts on three axes.

Build Health. Does the game compile and run without errors? A black screen counts as a failure.

Visual Usability. Does the game render something meaningful and interactive? Can anyone looking at a screenshot recognize what kind of game it is?

Intent Alignment. Does the finished game match what the prompt described? A VLM judge compares the running game to the original specification, not the source code to the prompt.

The pipeline launches generated games in a headless browser, drives them with scripted interactions, and records scores. Not static analysis of code patterns. Running games.

The benchmark pipeline hasn’t shipped yet; it’s marked coming-soon in the README. The paper (arXiv 2604.18394) reports state-of-the-art results across all three dimensions against vanilla agents. I’m taking that at face value until the pipeline is public.

Tradeoffs

Headless-only generation. No live preview panel during generation. The agent edits files and runs a sandboxed browser internally; you watch terminal output. Fine for automation, slightly opaque for iteration.

No npm package yet. git clone + npm link. The README notes an npm release is in preparation. Not a barrier, but worth knowing before you start.

The settings directory is named .qwen. Config lives in ~/.qwen/settings.json. This is a backward-compatibility artifact from upstream qwen-code, which is itself a fork of Gemini CLI. It will be renamed .opengame in a future release. Until then, your team will ask questions you cannot explain well.

Bring your own keys for every modality. Image generation, audio, and video each require separate provider API keys. No defaults ship. OpenGame prints a provider-status banner at startup so you at least know what’s missing before you start.

GameCoder-27B requires self-hosting. The purpose-built 27B model (trained through continual pre-training on game engine APIs, SFT on game-development trajectories, and execution-grounded RL on real playability scores) is technically the strongest option. It requires local deployment. Default falls back to whatever OPENAI_MODEL is set to. GPT-4o and Claude work fine.

OpenGame-Bench not yet released. The evaluation pipeline is described fully in the paper, but the pipeline itself is not yet public. The state-of-the-art claim is currently literature-only.

OpenGame closes the one loop general agents skip: it runs the game to see if it actually works.

Source, demos, and documentation: https://github.com/leigest519/OpenGame