RoundApril 16, 2026Combat Arcade Racer

April 2026 — Coding Tool Benchmark

Name: April 2026 — Coding Tool Benchmark
Creator: Sandscape
Published: 2026-04-16
Keywords: AI coding benchmark, LLM benchmark, coding models, Combat Racing, Arcade Sim, Urban Sports, Competitive, Claude Opus 4, Claude Sonnet 4, GLM-5.1, MiniMax M2.7, Qwen 3.6 Plus, MiMo V2 Pro, Qwen 3.5 35B, Kimi K2.6

8 coding models built the same Three.js game brief. Identical prompt, identical generated assets, identical agentic workflow. The only variable is the model.

Models

Pass rate

75%

Fastest

22m19s

Claude Opus 4

Published April 16, 2026Game Combat Arcade RacerGenre Combat Racing / Arcade Sim

Methodology

One variable: the coding model.

Prompt, assets, workflow, and strategy advice are identical across runs. Only the model changes.

Brief

Combat Arcade Racer

A high-octane street racer that merges responsive arcade-style car mechanics with aggressive power-up systems in dense metropolitan environments. Players navigate claustrophobic city circuits, utilizing tactical abilities to outmaneuver traffic and complete high-risk challenges in a quest for urban dominance.

Combat RacingArcade SimUrban SportsCompetitive

Held constant

Concept & prompt: One game design doc, shared verbatim.
Generated assets: Same concept art, 3D models, and audio.
Workflow: build_mid_no_strategy — agentic coder loop.
Strategy advisor: Pre-baked recommendations injected identically.
Debug iterations: Bounded loop. Model decides when to stop.
Timeout: 60 minutes per run. After that, it fails.

Leaderboard

Wall-clock time to a playable build.

Winner

Pass

Fail

Figure 1. Total elapsed time per model, sorted fastest-first. Failed runs pinned to the end.

Data table

Full per-model data.


01	Claude Opus 4	Anthropic	Pass	22m 19s	—	1,820	11	0
02	Claude Sonnet 4	Anthropic	Pass	36m 47s	—	9,225	14	4
03	GLM-5.1	Zhipu AI	Pass	48m 10s	—	2,150	13	5
04	MiniMax M2.7	MiniMax	Pass	57m 0s	—	1,980	12	6
07	Qwen 3.5 35B	Alibaba	Fail	60m 0s	—	850	7	8
08	Kimi K2.6	Moonshot	Fail	60m 0s	—	620	5	8
05	MiMo V2 Pro	Xiaomi	Pass	66m 0s	—	1,740	10	7
06	Qwen 3.6 Plus	Alibaba	Pass	79m 15s	—	2,610	15	4

Tap any column to sort. Cost column fills in when pipeline telemetry lands.

Games

Output per model.

Per-model breakdown

Every model, per-run data.

No screenshot

pass

Anthropic

Claude Opus 4

Duration

22m 19s

Cost

—

Code lines

1,820

Files

Debug loops

Debug %

15%

Time breakdown

Initial Debug

No screenshot

pass

Anthropic

Claude Sonnet 4

Duration

36m 47s

Cost

—

Code lines

9,225

Files

Debug loops

Debug %

51%

Time breakdown

Initial Debug

No screenshot

pass

Zhipu AI

GLM-5.1

Duration

48m 10s

Cost

—

Code lines

2,150

Files

Debug loops

Debug %

43%

Time breakdown

Initial Debug

No screenshot

pass

MiniMax

MiniMax M2.7

Duration

57m 0s

Cost

—

Code lines

1,980

Files

Debug loops

Debug %

46%

Time breakdown

Initial Debug

No screenshot

pass

Xiaomi

MiMo V2 Pro

Duration

66m 0s

Cost

—

Code lines

1,740

Files

Debug loops

Debug %

47%

Time breakdown

Initial Debug

No screenshot

pass

Alibaba

Qwen 3.6 Plus

Duration

79m 15s

Cost

—

Code lines

2,610

Files

Debug loops

Debug %

11%

Time breakdown

Initial Debug

No screenshot

fail

Alibaba

Qwen 3.5 35B

Duration

60m 0s

Cost

—

Code lines

850

Files

Debug loops

Debug %

60%

Time breakdown

Initial Debug

No screenshot

fail

Moonshot

Kimi K2.6

Duration

60m 0s

Cost

—

Code lines

620

Files

Debug loops

Debug %

64%

Time breakdown

Initial Debug

Head-to-heads

Head-to-head comparisons.

Head-to-head

claude opus 4 vs claude sonnet 4

Claude Opus 4 and Claude Sonnet 4 produced playable builds from the same Three.js brief with a 5:1 gap in code volume.

Head-to-head

claude opus 4 vs glm 5 1

Claude Opus 4 finished the Three.js brief in 22 min 19 s; GLM-5.1 finished in 48 min 10 s.

Head-to-head

qwen 3 6 plus vs qwen 3 5 35b

Qwen 3.6 Plus passed in 79 min 15 s. Qwen 3.5 35B hit the 60-minute timeout.

Build your own

Build a playable game with Sandscape.

Sandscape takes a text prompt and returns a playable game. The same models in this benchmark run under the hood. No coding required.

April 2026 — Coding Tool Benchmark

AI coding model benchmark — Claude, Qwen, GLM, MiniMax, Kimi, and MiMo compared on a real-world Three.js game build.

One variable: the coding model.

Combat Arcade Racer

Wall-clock time to a playable build.

Full per-model data.

Output per model.

Every model, per-run data.

Claude Opus 4

Claude Sonnet 4

GLM-5.1

MiniMax M2.7

MiMo V2 Pro

Qwen 3.6 Plus

Qwen 3.5 35B

Kimi K2.6

Head-to-head comparisons.

Build a playable game with Sandscape.