Three.js · April 2026 — Coding Tool Benchmark

The best AI models for Three.js

Eight LLMs, same Three.js brief — render loop, input handling, physics, and asset loading, all measured under identical conditions.

Three.js benchmark results across eight models, April 2026.

Top pick
Claude Opus 4
22m 19s·1,820 lines·0 debug loops
The test

Three.js requires several systems to be correct at the same time before anything renders: scene graph, camera setup, render loop, input binding, asset loading, game logic, and physics. A single error produces a blank canvas. This makes Three.js a useful runtime test for AI code generation, because static code review cannot detect most of the failure modes.

Methodology

Each model received the same brief: Combat Arcade Racer, a Three.js game with arcade physics, collision-based power-ups, and a multi-system update loop. We supplied the same generated assets across all runs (GLTF 3D models, textures, audio), used the same agentic workflow, and enforced a 60-minute ceiling.

We tracked the following Three.js-specific failure modes:

  • requestAnimationFrame errors (missing delta timing, missing resize handler, frame accumulation bugs)
  • Scene graph errors (orphan meshes, incorrect parenting, camera never added to scene)
  • Input binding errors (keyboard events attached to the wrong target)
  • Asset loading errors (main-thread blocking, missing error handling, use-before-load)
  • Physics and collision API errors (calls to non-existent methods in three or cannon-es)

Results

Three models produced builds that booted, rendered, handled input, and played correctly: Claude Opus 4, Claude Sonnet 4, and GLM-5.1. Opus completed the task with zero debug iterations. Sonnet required four. GLM-5.1 required five.

Per-model breakdown

Claude Opus 4

Claude Opus 4 completed the build in 22 minutes with 1,820 lines and zero debug iterations. Scene, camera, and renderer were wired correctly on the first pass. requestAnimationFrame used delta timing. Keyboard events were bound to the window. Asset loading used proper async awaits before first-frame usage.

Claude Sonnet 4

Claude Sonnet 4 produced 9,225 lines, approximately five times the Opus output, and required four debug iterations. Each debug pass addressed a bug introduced by the model in earlier iterations. Physics and collision stabilized in the fourth pass. The final build was playable.

GLM-5.1

GLM-5.1 completed the build in 48 minutes with 2,150 lines and five debug iterations. The model produced correct requestAnimationFrame handling, correct input event binding, and functional collision after the debug cycles. This is the only non-Anthropic model in the round that completed the brief.

Failures

Qwen 3.5 35B (open-weights) produced correct scene setup, a partial render loop, and early collision stubs. The 60-minute timer expired during debug. Commit history shows continued progress at the timeout. The model did not hallucinate APIs.

Kimi K2.6 hit the same 60-minute timeout with a shorter output trail. In both cases, a longer time budget or task decomposition is likely to produce a passing run.

Why Three.js is included in the benchmark

Leaderboard benchmarks measure whether a model produces code. Three.js measures whether the produced code runs as a system. A broken render loop does not pass at runtime regardless of how clean the source looks. The browser renders the scene or it does not. This is the signal the benchmark tracks.


For the full game development picture -- including wall-clock time, debug efficiency, and which models actually ship a playable build -- see our companion guide: Best AI for game development in 2026.

Leaderboard

Wall-clock time to a playable build.

Winner
Pass
Fail

Figure 1. Total elapsed time per model, sorted fastest-first. Failed runs pinned to the end.

Top picks

Top 3 models, ranked.

  1. 1
    Anthropic

    Claude Opus 4

    22m 19s1,820 lines0 debug loopsPASS

    22 minutes, 1,820 lines, zero debug iterations. Produced a working Three.js scene graph, render loop, input handling, and asset loading on the first pass.

  2. 2
    Anthropic

    Claude Sonnet 4

    36m 47s9,225 lines4 debug loopsPASS

    9,225 lines, four debug iterations. Stabilized physics and collision after four passes.

  3. 3
    Zhipu AI

    GLM-5.1

    48m 10s2,150 lines5 debug loopsPASS

    48 minutes, 2,150 lines, five debug iterations. Produced correct requestAnimationFrame, input events, and collision after debug cycles.

How we score

Scoring criteria.

Every model is judged against the same criteria. No synthetic scores. The measures are whether the build runs, wall-clock time, debug iterations, code volume, and cost.

  1. 01

    Scene graph correctness

    Whether the model uses Three.js Scene, Camera, Renderer, and Mesh hierarchy correctly, including parenting and transforms.

    High
  2. 02

    Render loop stability

    requestAnimationFrame wiring, delta timing, and resize handling. Incorrect values here cause blank frames or unstable frame rates.

    High
  3. 03

    Input handling

    Keyboard, mouse, and pointer lock bindings. Measured by whether input registers at runtime, not by whether the source code looks correct.

    High
  4. 04

    Physics/collision

    Correct integration of a physics engine or hand-rolled AABB collision. Errors cause missed power-up triggers and objects passing through walls.

    Medium
  5. 05

    Asset loading

    GLTFLoader, TextureLoader, and async handling. Blocking loads or silent errors prevent the game from booting.

    Medium
Full ranking

Ranked results.

  • 1
    Claude Opus 4
    Anthropic
    pass
    Duration
    22m 19s
    Lines
    1,820
    Debug
    0
  • 2
    Claude Sonnet 4
    Anthropic
    pass
    Duration
    36m 47s
    Lines
    9,225
    Debug
    4
  • 3
    GLM-5.1
    Zhipu AI
    pass
    Duration
    48m 10s
    Lines
    2,150
    Debug
    5
  • 4
    MiniMax M2.7
    MiniMax
    pass
    Duration
    57m 0s
    Lines
    1,980
    Debug
    6
  • 5
    MiMo V2 Pro
    Xiaomi
    pass
    Duration
    66m 0s
    Lines
    1,740
    Debug
    7
  • 6
    Qwen 3.6 Plus
    Alibaba
    pass
    Duration
    79m 15s
    Lines
    2,610
    Debug
    4
  • 7
    Qwen 3.5 35B
    Alibaba
    fail
    Duration
    60m 0s
    Lines
    850
    Debug
    8
  • 8
    Kimi K2.6
    Moonshot
    fail
    Duration
    60m 0s
    Lines
    620
    Debug
    8
FAQ

FAQ

  • In the April 2026 benchmark, Claude Opus 4 produced a working Three.js build in 22 minutes with 1,820 lines and zero debug iterations. No other model in the round matched that result.
Full data

3 of 8 shown above. See the full round.

Every model ran against the same brief, same assets, same agentic workflow. The full round archive has all per-model timings, code volume, debug counts, and playable builds.