Three.js requires several systems to be correct at the same time before anything renders: scene graph, camera setup, render loop, input binding, asset loading, game logic, and physics. A single error produces a blank canvas. This makes Three.js a useful runtime test for AI code generation, because static code review cannot detect most of the failure modes.
Methodology
Each model received the same brief: Combat Arcade Racer, a Three.js game with arcade physics, collision-based power-ups, and a multi-system update loop. We supplied the same generated assets across all runs (GLTF 3D models, textures, audio), used the same agentic workflow, and enforced a 60-minute ceiling.
We tracked the following Three.js-specific failure modes:
- requestAnimationFrame errors (missing delta timing, missing resize handler, frame accumulation bugs)
- Scene graph errors (orphan meshes, incorrect parenting, camera never added to scene)
- Input binding errors (keyboard events attached to the wrong target)
- Asset loading errors (main-thread blocking, missing error handling, use-before-load)
- Physics and collision API errors (calls to non-existent methods in
threeorcannon-es)
Results
Three models produced builds that booted, rendered, handled input, and played correctly: Claude Opus 4, Claude Sonnet 4, and GLM-5.1. Opus completed the task with zero debug iterations. Sonnet required four. GLM-5.1 required five.
Per-model breakdown
Claude Opus 4
Claude Opus 4 completed the build in 22 minutes with 1,820 lines and zero debug iterations. Scene, camera, and renderer were wired correctly on the first pass. requestAnimationFrame used delta timing. Keyboard events were bound to the window. Asset loading used proper async awaits before first-frame usage.
Claude Sonnet 4
Claude Sonnet 4 produced 9,225 lines, approximately five times the Opus output, and required four debug iterations. Each debug pass addressed a bug introduced by the model in earlier iterations. Physics and collision stabilized in the fourth pass. The final build was playable.
GLM-5.1
GLM-5.1 completed the build in 48 minutes with 2,150 lines and five debug iterations. The model produced correct requestAnimationFrame handling, correct input event binding, and functional collision after the debug cycles. This is the only non-Anthropic model in the round that completed the brief.
Failures
Qwen 3.5 35B (open-weights) produced correct scene setup, a partial render loop, and early collision stubs. The 60-minute timer expired during debug. Commit history shows continued progress at the timeout. The model did not hallucinate APIs.
Kimi K2.6 hit the same 60-minute timeout with a shorter output trail. In both cases, a longer time budget or task decomposition is likely to produce a passing run.
Why Three.js is included in the benchmark
Leaderboard benchmarks measure whether a model produces code. Three.js measures whether the produced code runs as a system. A broken render loop does not pass at runtime regardless of how clean the source looks. The browser renders the scene or it does not. This is the signal the benchmark tracks.
For the full game development picture -- including wall-clock time, debug efficiency, and which models actually ship a playable build -- see our companion guide: Best AI for game development in 2026.