April 2026 — Coding Tool Benchmark
AI coding model benchmark — Claude, Qwen, GLM, MiniMax, Kimi, and MiMo compared on a real-world Three.js game build.
8 coding models built the same Three.js game brief. Identical prompt, identical generated assets, identical agentic workflow. The only variable is the model.
One variable: the coding model.
Prompt, assets, workflow, and strategy advice are identical across runs. Only the model changes.
Combat Arcade Racer
A high-octane street racer that merges responsive arcade-style car mechanics with aggressive power-up systems in dense metropolitan environments. Players navigate claustrophobic city circuits, utilizing tactical abilities to outmaneuver traffic and complete high-risk challenges in a quest for urban dominance.
- Concept & prompt
- One game design doc, shared verbatim.
- Generated assets
- Same concept art, 3D models, and audio.
- Workflow
- build_mid_no_strategy — agentic coder loop.
- Strategy advisor
- Pre-baked recommendations injected identically.
- Debug iterations
- Bounded loop. Model decides when to stop.
- Timeout
- 60 minutes per run. After that, it fails.
Wall-clock time to a playable build.
Figure 1. Total elapsed time per model, sorted fastest-first. Failed runs pinned to the end.
Full per-model data.
| 01 | Claude Opus 4 | Anthropic | Pass | 22m 19s | — | 1,820 | 11 | 0 |
| 02 | Claude Sonnet 4 | Anthropic | Pass | 36m 47s | — | 9,225 | 14 | 4 |
| 03 | GLM-5.1 | Zhipu AI | Pass | 48m 10s | — | 2,150 | 13 | 5 |
| 04 | MiniMax M2.7 | MiniMax | Pass | 57m 0s | — | 1,980 | 12 | 6 |
| 07 | Qwen 3.5 35B | Alibaba | Fail | 60m 0s | — | 850 | 7 | 8 |
| 08 | Kimi K2.6 | Moonshot | Fail | 60m 0s | — | 620 | 5 | 8 |
| 05 | MiMo V2 Pro | Xiaomi | Pass | 66m 0s | — | 1,740 | 10 | 7 |
| 06 | Qwen 3.6 Plus | Alibaba | Pass | 79m 15s | — | 2,610 | 15 | 4 |
Tap any column to sort. Cost column fills in when pipeline telemetry lands.
Output per model.
Every model, per-run data.
Claude Sonnet 4
GLM-5.1
MiniMax M2.7
MiMo V2 Pro
Qwen 3.6 Plus
Qwen 3.5 35B
Kimi K2.6
Head-to-head comparisons.
Build a playable game with Sandscape.
Sandscape takes a text prompt and returns a playable game. The same models in this benchmark run under the hood. No coding required.