which AI is best at writing three.js code

In the April 2026 benchmark, Claude Opus 4 produced a working Three.js build in 22 minutes with 1,820 lines and zero debug iterations. No other model in the round matched that result.

can chatgpt or claude write three.js that runs

Claude Opus 4 and Claude Sonnet 4 both produced playable Three.js builds in this benchmark. Opus passed on the first iteration; Sonnet required four debug passes. GPT models were not included in this round.

what is the hardest part of three.js for AI models

The render loop and input handling produce the most runtime failures. Scene graph and camera setup are usually correct, but requestAnimationFrame delta timing and pointer/keyboard event binding often fail only at runtime.

can open-weights LLMs write three.js

Partially. Qwen 3.5 35B produced structurally correct Three.js code but timed out at 60 minutes before stabilizing input handling and collision. Longer ceilings or task decomposition may produce a passing run.

Three.js · April 2026 — Coding Tool Benchmark

The best AI models for Three.js

Three.js benchmark results across eight models, April 2026.

Top pick

Claude Opus 4

22m 19s·1,820 lines·0 debug loops

The test

Three.js requires several systems to be correct at the same time before anything renders: scene graph, camera setup, render loop, input binding, asset loading, game logic, and physics. A single error produces a blank canvas. This makes Three.js a useful runtime test for AI code generation, because static code review cannot detect most of the failure modes.

Methodology

Each model received the same brief: Combat Arcade Racer, a Three.js game with arcade physics, collision-based power-ups, and a multi-system update loop. We supplied the same generated assets across all runs (GLTF 3D models, textures, audio), used the same agentic workflow, and enforced a 60-minute ceiling.

We tracked the following Three.js-specific failure modes:

requestAnimationFrame errors (missing delta timing, missing resize handler, frame accumulation bugs)
Scene graph errors (orphan meshes, incorrect parenting, camera never added to scene)
Input binding errors (keyboard events attached to the wrong target)
Asset loading errors (main-thread blocking, missing error handling, use-before-load)
Physics and collision API errors (calls to non-existent methods in three or cannon-es)

Results

Three models produced builds that booted, rendered, handled input, and played correctly: Claude Opus 4, Claude Sonnet 4, and GLM-5.1. Opus completed the task with zero debug iterations. Sonnet required four. GLM-5.1 required five.

Per-model breakdown

Claude Opus 4

Claude Opus 4 completed the build in 22 minutes with 1,820 lines and zero debug iterations. Scene, camera, and renderer were wired correctly on the first pass. requestAnimationFrame used delta timing. Keyboard events were bound to the window. Asset loading used proper async awaits before first-frame usage.

Claude Sonnet 4

Claude Sonnet 4 produced 9,225 lines, approximately five times the Opus output, and required four debug iterations. Each debug pass addressed a bug introduced by the model in earlier iterations. Physics and collision stabilized in the fourth pass. The final build was playable.

GLM-5.1

GLM-5.1 completed the build in 48 minutes with 2,150 lines and five debug iterations. The model produced correct requestAnimationFrame handling, correct input event binding, and functional collision after the debug cycles. This is the only non-Anthropic model in the round that completed the brief.

Failures

Qwen 3.5 35B (open-weights) produced correct scene setup, a partial render loop, and early collision stubs. The 60-minute timer expired during debug. Commit history shows continued progress at the timeout. The model did not hallucinate APIs.

Kimi K2.6 hit the same 60-minute timeout with a shorter output trail. In both cases, a longer time budget or task decomposition is likely to produce a passing run.

Why Three.js is included in the benchmark

Leaderboard benchmarks measure whether a model produces code. Three.js measures whether the produced code runs as a system. A broken render loop does not pass at runtime regardless of how clean the source looks. The browser renders the scene or it does not. This is the signal the benchmark tracks.

For the full game development picture -- including wall-clock time, debug efficiency, and which models actually ship a playable build -- see our companion guide: Best AI for game development in 2026.

Leaderboard

Wall-clock time to a playable build.

Winner

Pass

Fail

Figure 1. Total elapsed time per model, sorted fastest-first. Failed runs pinned to the end.

Top picks

Top 3 models, ranked.

1
Anthropic
Claude Opus 4
22m 19s1,820 lines0 debug loopsPASS
22 minutes, 1,820 lines, zero debug iterations. Produced a working Three.js scene graph, render loop, input handling, and asset loading on the first pass.
2
Anthropic
Claude Sonnet 4
36m 47s9,225 lines4 debug loopsPASS
9,225 lines, four debug iterations. Stabilized physics and collision after four passes.
3
Zhipu AI
GLM-5.1
48m 10s2,150 lines5 debug loopsPASS
48 minutes, 2,150 lines, five debug iterations. Produced correct requestAnimationFrame, input events, and collision after debug cycles.

How we score

Scoring criteria.

Every model is judged against the same criteria. No synthetic scores. The measures are whether the build runs, wall-clock time, debug iterations, code volume, and cost.

01
Scene graph correctness
Whether the model uses Three.js Scene, Camera, Renderer, and Mesh hierarchy correctly, including parenting and transforms.
High
02
Render loop stability
requestAnimationFrame wiring, delta timing, and resize handling. Incorrect values here cause blank frames or unstable frame rates.
High
03
Input handling
Keyboard, mouse, and pointer lock bindings. Measured by whether input registers at runtime, not by whether the source code looks correct.
High
04
Physics/collision
Correct integration of a physics engine or hand-rolled AABB collision. Errors cause missed power-up triggers and objects passing through walls.
Medium
05
Asset loading
GLTFLoader, TextureLoader, and async handling. Blocking loads or silent errors prevent the game from booting.
Medium

Full ranking

Ranked results.

#	Model	Provider	Status	Duration	Lines	Debug loops
1	Claude Opus 4	Anthropic	pass	22m 19s	1,820	0
2	Claude Sonnet 4	Anthropic	pass	36m 47s	9,225	4
3	GLM-5.1	Zhipu AI	pass	48m 10s	2,150	5
4	MiniMax M2.7	MiniMax	pass	57m 0s	1,980	6
5	MiMo V2 Pro	Xiaomi	pass	66m 0s	1,740	7
6	Qwen 3.6 Plus	Alibaba	pass	79m 15s	2,610	4
7	Qwen 3.5 35B	Alibaba	fail	60m 0s	850	8
8	Kimi K2.6	Moonshot	fail	60m 0s	620	8

1
Claude Opus 4
Anthropic
pass
Duration
22m 19s
Lines
1,820
Debug
0
2
Claude Sonnet 4
Anthropic
pass
Duration
36m 47s
Lines
9,225
Debug
4
3
GLM-5.1
Zhipu AI
pass
Duration
48m 10s
Lines
2,150
Debug
5
4
MiniMax M2.7
MiniMax
pass
Duration
57m 0s
Lines
1,980
Debug
6
5
MiMo V2 Pro
Xiaomi
pass
Duration
66m 0s
Lines
1,740
Debug
7
6
Qwen 3.6 Plus
Alibaba
pass
Duration
79m 15s
Lines
2,610
Debug
4
7
Qwen 3.5 35B
Alibaba
fail
Duration
60m 0s
Lines
850
Debug
8
8
Kimi K2.6
Moonshot
fail
Duration
60m 0s
Lines
620
Debug
8

Direct comparisons

Head-to-head comparisons.

Head-to-head

claude opus 4 vs claude sonnet 4

Claude Opus 4 and Claude Sonnet 4 produced playable builds from the same Three.js brief with a 5:1 gap in code volume.

Head-to-head

claude opus 4 vs glm 5 1

Claude Opus 4 finished the Three.js brief in 22 min 19 s; GLM-5.1 finished in 48 min 10 s.

Head-to-head

qwen 3 6 plus vs qwen 3 5 35b

Qwen 3.6 Plus passed in 79 min 15 s. Qwen 3.5 35B hit the 60-minute timeout.

FAQ

In the April 2026 benchmark, Claude Opus 4 produced a working Three.js build in 22 minutes with 1,820 lines and zero debug iterations. No other model in the round matched that result.

The best AI models for Three.js

Eight LLMs, same Three.js brief — render loop, input handling, physics, and asset loading, all measured under identical conditions.

Methodology

Results

Per-model breakdown

Claude Opus 4

Claude Sonnet 4

GLM-5.1

Failures

Why Three.js is included in the benchmark

Wall-clock time to a playable build.

Top 3 models, ranked.

Claude Opus 4

Claude Sonnet 4

GLM-5.1

Scoring criteria.

Scene graph correctness

Render loop stability

Input handling

Physics/collision

Asset loading

Ranked results.

Head-to-head comparisons.

FAQ

3 of 8 shown above. See the full round.