AI Coding Benchmark1 round
AI coding models, same game brief, same conditions.
AI coding and LLM benchmark for games — Claude, Qwen, GLM, MiniMax, Kimi, and more, tested on identical Three.js game briefs.
Every round puts each major AI coding model through the same Three.js game brief, with identical generated assets and the same agentic workflow. One variable: the coding model. Results are wall-clock time, debug iterations, code volume, and whether the build runs.
Head-to-heads · April 2026 — Coding Tool Benchmark
Head-to-heads.
Head-to-head
claude opus 4 vs claude sonnet 4
Claude Opus 4 and Claude Sonnet 4 produced playable builds from the same Three.js brief with a 5:1 gap in code volume.
Head-to-head
claude opus 4 vs glm 5 1
Claude Opus 4 finished the Three.js brief in 22 min 19 s; GLM-5.1 finished in 48 min 10 s.
Head-to-head
qwen 3 6 plus vs qwen 3 5 35b
Qwen 3.6 Plus passed in 79 min 15 s. Qwen 3.5 35B hit the 60-minute timeout.
Archive
All rounds.
First round published. New rounds ship when a major model releases.
About the benchmark
FAQ
- Every major AI coding model gets the same Three.js game brief, same assets, same agentic workflow. Only the model changes. We measure wall-clock, debug iterations, code volume, and whether the build runs.
Build your own
Build a playable game with Sandscape.
Sandscape takes a text prompt and returns a playable game. The same models in this benchmark run under the hood. No coding required.