Head-to-head · April 2026 — Coding Tool Benchmark

Qwen 3.6 Plus vs Qwen 3.5 35Bfor games

Qwen 3.6 Plus vs Qwen 3.5 35B for games — AI coding benchmark on a Three.js game build with timing, debug iterations, and code metrics.

Qwen 3.6 Plus passed in 79 min 15 s. Qwen 3.5 35B hit the 60-minute timeout.

pass·fail·Winner: Qwen 3.6 Plus
Results

Per-model results.

Alibaba

Qwen 3.6 Plus

Winner
Status
PASS
Duration
79m 15s
Code Lines
2,610
Files
15
Debug Loops
4
Cost
Alibaba

Qwen 3.5 35B

Status
FAIL
Duration
60m 00s
Code Lines
850
Files
7
Debug Loops
8
Cost
Playable builds

Per-model playable builds.

Raw output from each model. Same brief, same assets, same 60-minute ceiling. No manual edits applied.

Playable build · publishing soon
Qwen 3.6 Plus build is being packaged for web.

Full build output is captured for every run. Hosted versions go live shortly after each round. See the blog for round write-ups.

Now showing: Qwen 3.6 Plus
Leaderboard

Wall-clock time to a playable build.

Winner
Pass
Fail

Figure 1. Total elapsed time per model, sorted fastest-first. Failed runs pinned to the end.

Analysis

Results

Qwen 3.6 Plus and Qwen 3.5 35B are both from Alibaba's Qwen family. Qwen 3.6 Plus is the commercial API tier. Qwen 3.5 35B is the open-weights variant.

Both models received the same Three.js brief, the same assets, and the same agentic workflow.

  • Qwen 3.6 Plus: passed in 79 min 15 s. 2,610 lines across 15 files. 4 debug iterations.
  • Qwen 3.5 35B: reached the 60-minute ceiling. 850 lines across 7 files. 8 debug iterations, still in the debug loop.

Time breakdown

Qwen 3.6 Plus spent 70.5 minutes in initial development and 8.7 minutes in debug. The initial phase is the longest of any model in this round. Total: 79 min 15 s.

Qwen 3.5 35B spent 24.2 minutes in initial development and 35.8 minutes in debug before the timeout. Debug iterations: 8. Commit history shows active fixes during the debug phase.

Per-model breakdown

Qwen 3.6 Plus front-loaded reasoning into initial development and produced a compiling build with minimal debug work. 4 debug iterations were sufficient to pass.

Qwen 3.5 35B produced a shorter initial build and entered the debug loop earlier. 8 iterations landed within the budget without clearing all failures before the 60-minute ceiling.

Methodology

We gave both models the same Three.js brief, the same input assets, and the same agentic workflow. Wall-clock time was capped at 60 minutes for the open-weights run and allowed to run to completion for the commercial run. Line counts are measured across files in the final commit. Debug iterations count the number of automated fix cycles after the first build attempt.

Takeaways

Qwen 3.6 Plus completed the brief at 79 min 15 s with 2,610 lines. Qwen 3.5 35B did not complete within 60 minutes and produced 850 lines before the ceiling.

For self-hosted Qwen 3.5 35B on a brief of this size, two options apply: decompose the task into smaller steps that fit the model's reasoning horizon, or raise the timeout above 60 minutes. For the commercial Qwen 3.6 Plus API, the brief completes, at a cost of longer wall-clock time than other commercial models in this round.

Verdict

Qwen 3.6 Plus shipped 2,610 lines across 15 files in 79 min 15 s; Qwen 3.5 35B reached the 60-minute ceiling with 850 lines across 7 files.

FAQ

FAQ

  • In the April 2026 benchmark, Qwen 3.5 35B reached the 60-minute ceiling mid-debug with 850 lines across 7 files. The commit trail shows active fixes, so the model was progressing rather than stalled.
Full round

2 of 8 models shown. See the full round.

All models in the round ran the same brief, same assets, same agentic workflow. The archive has full per-model metrics.