May 7, 2026 · 6 min read · ← All posts

Round 4: Qwen 3.5-9B-int4 punches above its weight — when 9B int4 beats 308B IQ3_S on affordance

Same probe, same screen, same prompt as round 3. This time: Intel/Qwen3.5-9B-int4-AutoRound on vLLM. Older generation than the round 1 winner, four times smaller, and aggressively quantized to int4 — yet it explicitly classifies the email-chip dropdown that MiMo V2.5 at IQ3_S (308B params!) and even Qwen 3.6-35B-A3B both missed. The affordance call doesn't scale with size. Visual cue extraction does.

The setup, again

Same test/vision-probe.mjs, same Google sign-in screenshot at test/fixtures/google-signin-password-error.jpg, same 6-section structured caption prompt. This time pointed at vLLM on localhost:8000 serving Intel/Qwen3.5-9B-int4-AutoRound — Intel's int4 quantization of the dense Qwen 3.5-9B-VL via AutoRound. Important framing for what follows:

By any "size + recency + precision = capability" prior, this should land at the bottom of the table next to Gemma 4-E2B. It doesn't.

The surprise: it classified the dropdown

The email chip esokullu@gmail.com with the small chevron is structurally a dropdown — click it, you get an account picker. Across the eight models we've now tested, only three caught this and put it in §3 as a structured input the planner can act on:

Qwen 3.5-9B-int4's §3 entry, verbatim:

3) Inputs:
- Dropdown: Label "esokullu@gmail.com", value "esokullu@gmail.com",
  not focused, not disabled.
- Text field: Label "Enter your password", value "", focused,
  not disabled.
- Checkbox: Label "Show password", unchecked, not disabled.

Same structural framing Nemotron used. And this is a 9B int4 model. The thing the 308B MiMo at IQ3_S missed. The thing the 35B-A3B Qwen 3.6 only flagged parenthetically.

Best guess at why: the affordance call seems to hinge on the vision encoder family and the tile granularity it imposes, not the LLM head's parameter count. Qwen 3.5-9B-VL and Qwen 3.6 share lineage there; whatever in the encoder sees "chevron next to text → combobox" survives the int4 quant on the smaller model. Meanwhile MiMo's encoder produces 5557 prompt tokens for the same image and still doesn't put the chip in §3 — the encoder is doing more work and producing a worse structural read. The capability isn't where you'd expect it from the parameter count alone.

What else it got right

Where it slipped: the red-border visual cue

This is where the small + int4 cost shows up. Qwen 3.5-9B-int4's §4:

4) State signals:
- Error message: "Enter a password" displayed below the password
  input field.
- No loading spinners, toasts, modals, or other overlays.

No mention of the red border on the password field. No mention of the small red exclamation icon. The error message text is captured, but the visual treatment that makes the error visible at a glance — the colored border ring around the input — is missed entirely.

Compare round 1's Qwen 3.5-27B entry, which got border + icon + text. Same model family, three times the parameters, and the visual cues come back. This is a clean data point: visual state extraction does scale with size and quant, even when affordance classification doesn't. The 9B-int4 reads the structure right but misses what the colors mean.

For a browser agent this is recoverable — DOM cross-check confirms the validation state — but it's a real gap. If you were relying on the vision sub-call to flag "this form has an error" without round-tripping through the accessibility tree, the smaller model wouldn't tell you.

The §6 result you can probably guess

Section 6 of the prompt:

If you cannot read something clearly, say so. Do not guess numbers, names, or identifiers.

Qwen 3.5-9B-int4 wrote:

6) Unknowns:
- None.

Joins the club. We're now at five models in the "wrote None even when there was something to flag" bucket (the red border ambiguity is right there to call out — it's used for both focus and error on Material-style inputs and a planner needs to know which). Across seven model variants from five families on three engines at three quant levels, only Qwen 3.6-35B-A3B does §6 honestly. That's no longer a quirk — it's the central finding of the shootout series.

The numbers

The full table, updated

Gemma 4-E2B Gemma 4-31B Qwen 3.6-27B Qwen 3.6-35B-A3B Nemotron Omni 30B MiMo V2.5 IQ3_S Qwen 3.5-9B-int4
Architecture Dense ~2B Dense 31B Dense 27B MoE 35B / ~3B active MoE 30B / ~3B active MoE 308B (omni) Dense 9B
Engine llama.cpp llama.cpp vLLM int4 llama.cpp vLLM NVFP4 llama.cpp IQ3_S vLLM int4 AR
Latency 1.5s 4.6s 5.9s 5.3s 12.0s 61s / 209t 10.3s / 252t
Prompt tokens (image) 574 574 5570 4374 3636 5557 3379
Email chip OCR
All 12 visible strings 5/12 12 12 12 11 (email moved to §3) 12 12
Affordance: chip = dropdown missed missed missed parenthetical explicit missed explicit
Red error state missed text only border + icon + text border + icon + text + flag border + icon + text border + icon + text text only
Inferred blocker no no yes yes yes yes yes
Honest "Unknowns" §6 no no no YES — only model no no no
Multilingual weak partial native native English-only native (untested) native (untested)
VRAM bracket ~3 GB ~20 GB ~16 GB ~22 GB ~18 GB ~110 GB ~6 GB

What this rounds out about size, quant, and capability

Four rounds in, the capability axes are starting to separate cleanly:

Verdict for VRAM-budget operators

If you have ≤8 GB of VRAM and need a self-hosted vision sub-call for a browser agent, Qwen 3.5-9B-int4 on vLLM is suddenly the most interesting option on this list. It nails what matters most for navigating browser forms — find the dropdown, OCR the strings, identify the blocker — while sacrificing what matters least: the precise visual cue surfacing, which is recoverable from the DOM. It does not replace Qwen 3.6-35B-A3B for users with the VRAM headroom — the §6 calibration is the difference, and it's the single most important difference on the table for browser-agent reliability — but on a 6-8 GB GPU, this is a much more capable model than its specs suggest.

For a routing policy, the rough shape is now:

What's next

The probe is unchanged from round 3 (and unchanged from rounds 1-2 in everything that affects model behavior — same prompt, same temperature, same max tokens). Three lines:

node test/vision-probe.mjs ./shot.png http://127.0.0.1:8000 qwen3.5-9b
node test/vision-probe.mjs ./shot.png http://127.0.0.1:8080 MiMo-V2.5-IQ3_S
node test/vision-probe.mjs ./shot.png http://127.0.0.1:8080 Qwen3.6-35B-A3B
Methodology caveat, again. Still one screenshot, still one fixture, now eight model variants. The reason the same Google sign-in screen keeps doing useful work as we add models is that it surfaces every axis we care about: OCR identifiers, structured affordances, visual state cues, ambiguity. Different pages will surface different weaknesses; if a model passes here and fails on a Stripe dashboard, that's worth knowing before you wire it in.
Written by Emre Sokullu. WebBrain is MIT-licensed and open on GitHub — the probe lives at test/vision-probe.mjs, file an issue if you've benched a model worth adding.