Can Qwen 3.5 35B open-weights build a Three.js game in 60 minutes?

In the April 2026 benchmark, Qwen 3.5 35B reached the 60-minute ceiling mid-debug with 850 lines across 7 files. The commit trail shows active fixes, so the model was progressing rather than stalled.

Why did Qwen 3.6 Plus take 70.5 minutes in initial development?

Qwen 3.6 Plus spent 70.5 minutes in initial development and 8.7 minutes in debug. That is the longest initial phase of any model in this round.

Is Qwen 3.5 35B suitable for self-hosted game development?

At a 60-minute timeout on this Three.js brief, Qwen 3.5 35B did not finish. A longer timeout or smaller task decomposition would be required to complete a brief of this size.

Head-to-head · April 2026 — Coding Tool Benchmark

Qwen 3.6 Plus vs Qwen 3.5 35Bfor games

Qwen 3.6 Plus passed in 79 min 15 s. Qwen 3.5 35B hit the 60-minute timeout.

pass·fail·Winner: Qwen 3.6 Plus

Results

Per-model results.

Alibaba

Qwen 3.6 Plus

Winner

Status

PASS

Duration

79m 15s

Code Lines

2,610

Files

15

Debug Loops

4

Cost

—

Alibaba

Qwen 3.5 35B

Status

FAIL

Duration

60m 00s

Code Lines

850

Files

7

Debug Loops

8

Cost

—

Playable builds

Per-model playable builds.

Raw output from each model. Same brief, same assets, same 60-minute ceiling. No manual edits applied.

Playable build · publishing soon

Qwen 3.6 Plus build is being packaged for web.

Full build output is captured for every run. Hosted versions go live shortly after each round. See the blog for round write-ups.

Now showing: Qwen 3.6 Plus

Leaderboard

Wall-clock time to a playable build.

Winner

Pass

Fail

Figure 1. Total elapsed time per model, sorted fastest-first. Failed runs pinned to the end.

Analysis

Results

Qwen 3.6 Plus and Qwen 3.5 35B are both from Alibaba's Qwen family. Qwen 3.6 Plus is the commercial API tier. Qwen 3.5 35B is the open-weights variant.

Both models received the same Three.js brief, the same assets, and the same agentic workflow.

Qwen 3.6 Plus: passed in 79 min 15 s. 2,610 lines across 15 files. 4 debug iterations.
Qwen 3.5 35B: reached the 60-minute ceiling. 850 lines across 7 files. 8 debug iterations, still in the debug loop.

Time breakdown

Qwen 3.6 Plus spent 70.5 minutes in initial development and 8.7 minutes in debug. The initial phase is the longest of any model in this round. Total: 79 min 15 s.

Qwen 3.5 35B spent 24.2 minutes in initial development and 35.8 minutes in debug before the timeout. Debug iterations: 8. Commit history shows active fixes during the debug phase.

Per-model breakdown

Qwen 3.6 Plus front-loaded reasoning into initial development and produced a compiling build with minimal debug work. 4 debug iterations were sufficient to pass.

Qwen 3.5 35B produced a shorter initial build and entered the debug loop earlier. 8 iterations landed within the budget without clearing all failures before the 60-minute ceiling.

Methodology

We gave both models the same Three.js brief, the same input assets, and the same agentic workflow. Wall-clock time was capped at 60 minutes for the open-weights run and allowed to run to completion for the commercial run. Line counts are measured across files in the final commit. Debug iterations count the number of automated fix cycles after the first build attempt.

Takeaways

Qwen 3.6 Plus completed the brief at 79 min 15 s with 2,610 lines. Qwen 3.5 35B did not complete within 60 minutes and produced 850 lines before the ceiling.

For self-hosted Qwen 3.5 35B on a brief of this size, two options apply: decompose the task into smaller steps that fit the model's reasoning horizon, or raise the timeout above 60 minutes. For the commercial Qwen 3.6 Plus API, the brief completes, at a cost of longer wall-clock time than other commercial models in this round.

Verdict

Qwen 3.6 Plus shipped 2,610 lines across 15 files in 79 min 15 s; Qwen 3.5 35B reached the 60-minute ceiling with 850 lines across 7 files.

FAQ

In the April 2026 benchmark, Qwen 3.5 35B reached the 60-minute ceiling mid-debug with 850 lines across 7 files. The commit trail shows active fixes, so the model was progressing rather than stalled.

Other comparisons