Game dev · April 2026 — Coding Tool Benchmark

The best AI models for game development

Eight LLMs, same Three.js game brief, 60-minute ceiling. Ranked by wall-clock time, debug iterations, and whether the build runs.

AI coding models ranked by playable Three.js game output, April 2026.

Top pick
Claude Opus 4
22m 19s·1,820 lines·0 debug loops
The test

Eight AI coding models were given the same Three.js game brief, the same generated assets, the same agentic workflow, and a 60-minute ceiling. Six shipped a playable build. Two timed out. None crashed.

The brief was Combat Arcade Racer: a street racer with combat power-ups, urban circuits, and arcade physics. The build requires a working render loop, input handler, collision detection, and power-up logic to produce anything playable.

Results

Six of eight models passed. Claude Opus 4 finished in 22 minutes and 19 seconds with 1,820 lines and zero debug iterations. Claude Sonnet 4 passed in 36 minutes with 9,225 lines. GLM-5.1 passed in 48 minutes with 2,150 lines across 5 debug iterations. The slowest passing run took 79 minutes.

Two models timed out at the 60-minute ceiling. Both were still iterating in the debug loop when the clock stopped.

Methodology

We gave eight AI coding models the same Three.js game brief. Concept art, 3D models, and audio were generated once and reused across all runs. The agentic workflow was held constant. The only variable was the coding model under test.

Each run had a 60-minute wall-clock ceiling. A run counted as a pass if the output ran in a browser, accepted input, and did not crash within 30 seconds.

The brief was Combat Arcade Racer: a street racer with combat power-ups, urban circuits, and arcade physics. Passing the brief requires a working render loop, input handler, collision system, and power-up logic on the same build.

Observations

Wall-clock time among passing models ranged from 22 to 79 minutes. Debug iteration counts among passing models ranged from 0 to 7. High debug counts did not prevent a pass.

Line counts varied by roughly 9x across models given the same brief. The largest passing codebase came from a model that also ran four full debug cycles. The smallest passing build was under 2,000 lines.

Both failing models timed out during active debug iteration rather than crashing. A longer ceiling might have converted one or both to passes.


If you're specifically interested in how these models handle Three.js scene graphs, render loops, and asset loading, see our detailed breakdown: Best AI for Three.js development.

Leaderboard

Wall-clock time to a playable build.

Winner
Pass
Fail

Figure 1. Total elapsed time per model, sorted fastest-first. Failed runs pinned to the end.

Top picks

Top 3 models, ranked.

  1. 1
    Anthropic

    Claude Opus 4

    22m 19s1,820 lines0 debug loopsPASS

    22 minutes, 1,820 lines, zero debug iterations. Only model in the round to pass on the first build. No manual intervention required.

  2. 2
    Anthropic

    Claude Sonnet 4

    36m 47s9,225 lines4 debug loopsPASS

    36 minutes, 9,225 lines, passed. Highest line count of any passing model. Lower per-token cost than Opus 4.

  3. 3
    Zhipu AI

    GLM-5.1

    48m 10s2,150 lines5 debug loopsPASS

    48 minutes, 2,150 lines, 5 debug iterations, passed. Top non-Anthropic result in the round.

How we score

Scoring criteria.

Every model is judged against the same criteria. No synthetic scores. The measures are whether the build runs, wall-clock time, debug iterations, code volume, and cost.

  1. 01

    Ships a playable build

    Output must run in a browser, boot, accept input, and survive 30 seconds without crashing.

    Non-negotiable
  2. 02

    Wall-clock time

    Elapsed time from prompt submission to a running build.

    High
  3. 03

    Debug-loop efficiency

    Number of debug passes required before the build runs. Lower is better.

    High
  4. 04

    Code volume

    Total lines of code in the final build. Line counts varied 9x across models given the same brief.

    Medium
  5. 05

    Failure mode

    Classification of how a run ended: timeout with partial progress, crash, or hallucinated API reference.

    Medium
Full ranking

Ranked results.

  • 1
    Claude Opus 4
    Anthropic
    pass
    Duration
    22m 19s
    Lines
    1,820
    Debug
    0
  • 2
    Claude Sonnet 4
    Anthropic
    pass
    Duration
    36m 47s
    Lines
    9,225
    Debug
    4
  • 3
    GLM-5.1
    Zhipu AI
    pass
    Duration
    48m 10s
    Lines
    2,150
    Debug
    5
  • 4
    MiniMax M2.7
    MiniMax
    pass
    Duration
    57m 0s
    Lines
    1,980
    Debug
    6
  • 5
    MiMo V2 Pro
    Xiaomi
    pass
    Duration
    66m 0s
    Lines
    1,740
    Debug
    7
  • 6
    Qwen 3.6 Plus
    Alibaba
    pass
    Duration
    79m 15s
    Lines
    2,610
    Debug
    4
  • 7
    Qwen 3.5 35B
    Alibaba
    fail
    Duration
    60m 0s
    Lines
    850
    Debug
    8
  • 8
    Kimi K2.6
    Moonshot
    fail
    Duration
    60m 0s
    Lines
    620
    Debug
    8
FAQ

FAQ

  • In the April 2026 Three.js arcade-racer benchmark, Claude Opus 4 finished in 22 minutes with 1,820 lines and zero debug iterations. Claude Sonnet 4 and GLM-5.1 also produced playable builds.
Full data

3 of 8 shown above. See the full round.

Every model ran against the same brief, same assets, same agentic workflow. The full round archive has all per-model timings, code volume, debug counts, and playable builds.