Question 1

What is the Sandscape AI coding benchmark?

Accepted Answer

Every major AI coding model gets the same Three.js game brief, same assets, same agentic workflow. Only the model changes. We measure wall-clock, debug iterations, code volume, and whether the build runs.

Question 2

How is this different from other LLM coding benchmarks?

Accepted Answer

Most LLM benchmarks are synthetic — pass-at-k on isolated problems. Ours tests end-to-end output. The game runs in a browser or it doesn't. Render loops and input handlers can't be faked.

Question 3

How often do you run a new round?

Accepted Answer

Every few weeks, timed to significant model releases. Each round uses a fresh game brief to prevent training-data leakage and stress different parts of the stack.

Question 4

Which AI coding models do you test?

Accepted Answer

Every serious model with an API or open-weights release at round time. Latest round: Claude Opus 4, Claude Sonnet 4, GLM-5.1, Qwen 3.6 Plus, Qwen 3.5 35B, MiniMax M2.7, MiMo V2 Pro, Kimi K2.6. New models are added as they ship.

Question 5

Can I play the games the models built?

Accepted Answer

Yes. Every round page embeds each model's playable build unedited. Head-to-head pages let you play both builds side by side.

AI coding models, same game brief, same conditions.

April 2026 — Coding Tool Benchmark

April 2026 — Coding Tool Benchmark

Head-to-heads.

All rounds.

By use case.

FAQ

Build a playable game with Sandscape.

AI coding models, same game brief, same conditions.

AI coding and LLM benchmark for games — Claude, Qwen, GLM, MiniMax, Kimi, and more, tested on identical Three.js game briefs.

April 2026 — Coding Tool Benchmark

April 2026 — Coding Tool Benchmark

Head-to-heads.

All rounds.

By use case.

FAQ

Build a playable game with Sandscape.