April 29, 2026 · 8 min read · ← All posts

Round 2: Nemotron Omni 30B vs Qwen 3.6 — does cheaper image tokens beat calibrated uncertainty?

Last week we ran a four-way vision-model shootout and Qwen 3.6 35B-A3B walked away with the round. Two new contenders showed up since: NVIDIA's Nemotron Omni 30B-A3B-Reasoning, and the dense Qwen 3.6-27B for an apples-to-MoE comparison within the same model family. Nemotron has two genuine wins over Qwen — and a third axis where it loses badly. Plus a multilingual gotcha that decides the round for most readers without us touching a benchmark.

The setup, again

Same probe (test/vision-probe.mjs in the repo), same prompt (the 6-section structured caption WebBrain's vision sub-call uses), same image (the Google sign-in screen with a focused password field, red error border, and a small email chip with a dropdown affordance). The probe sends the exact system prompt, user message, and parameters our extension's vision sub-call sends, against any OpenAI-compatible endpoint.

If you missed round 1: we tested Gemma 4-E2B, Gemma 4-31B, Qwen 3.5-27B, and Qwen 3.6-35B-A3B. Headline finding was that Qwen 3.6's MoE variant beat the dense 27B on every axis — same VRAM, better quality, lower latency. Round 1 here. This post adds two models on top.

A reasoning-suppression detour, then the numbers

First test on Nemotron: 18.4 seconds. For a 30B-A3B with 195 completion tokens, that's wrong. The probe sets chat_template_kwargs.enable_thinking: false — the Qwen-style gate — but Nemotron uses think: false instead, so our flag was being ignored. The model was reasoning silently before emitting the structured caption.

Adding both keys (enable_thinking: false AND think: false — servers ignore unknown kwargs, so packing both is harmless) brought it down to 12.0 seconds. We also tried /no_think as a system-prompt prefix per NVIDIA's docs; that did essentially nothing on top of the kwarg. The probe in the repo now sends all three keys (enable_thinking, think, thinking for the DeepSeek family) so future reasoning models are more likely to behave without manual surgery.

Once the gate was set right, the comparison stopped being about waiting and started being about caption quality. That's where it got interesting.

The two things Nemotron does better

1. It's 17% cheaper per image than Qwen 3.6-A3B

Same screenshot, prompt-token counts:

Different vision encoder family entirely — Nemotron-3-Nano has a tighter tile budget than Qwen2.5-VL. Doesn't matter if you're running this locally on a 5090 (you're paying in prompt-eval time, not money). Matters a lot if you're pointing the dedicated vision model at a paid endpoint — Nemotron is 17–53% cheaper than the Qwen variants for the exact same input.

2. It actually classifies the email chip as a dropdown

The test image's email chip — esokullu@gmail.com with a small chevron next to it — is visually a dropdown affordance. Click it, you get an account picker. It's not a heading, link, button, tab, or menu item, even though its main contents are text.

Nemotron classified it the right way:

3) Inputs:
- Label: "esokullu@gmail.com", Type: dropdown, Value: "esokullu@gmail.com",
   Focused: false, Disabled: false
- Label: "Enter your password", Type: text input, Value: "",
   Focused: true, Disabled: false

Qwen 3.6-A3B noted the same affordance, but as parenthetical metadata under "visible text" rather than as a structured input:

2) Visible text:
- "esokullu@gmail.com" (with dropdown arrow)

For a planning model that has to decide whether to click({text: "esokullu@gmail.com"}) (Qwen's framing → "click the visible text" → no defined behavior) versus click_ax({ref_id: "..."}) on a combobox role (Nemotron's framing → click-to-toggle a picker), Nemotron's reading is the right one. It encodes the affordance directly in the place the planner looks for actionable elements.

There's a real cost to this, though: by classifying the chip as an input, Nemotron technically violated section 2 of the prompt ("list the EXACT strings on buttons, links, headings, tabs, and menu items"). It omitted the email from the visible-text list because it decided the chip wasn't any of those things. Defensible reading — but if downstream code does a .includes("@") check on §2 to find email addresses, Nemotron would silently miss it. Whether you call this a win depends on whether you trust your prompt or your model.

The one thing Nemotron does worse — and why it matters

Section 6 of our vision prompt asks for an "Unknowns" list — text the model couldn't read clearly, ambiguous states, anything it isn't sure of. The instruction is verbatim:

If you cannot read something clearly, say so. Do not guess numbers, names, or identifiers.

Across every model we've tested, this is the bullet that gets the least respect. Most models write "None" by default even when they had visible reasons to be unsure. Round 1's headline finding was that Qwen 3.6-35B-A3B was the only model that actually used section 6 honestly — it flagged the red border on the password field as ambiguous (could mean focus, could mean error), and noted that interpretation should be DOM-cross-checked.

Nemotron also wrote "Unknowns: None" — even though it had the same red-border ambiguity to flag. So did Qwen 3.6-27B dense.

For a browser agent, this is the bullet that decides everything else. An affordance misclassification is recoverable — the planner has get_accessibility_tree to cross-check whether something really is a combobox. An OCR misread is recoverable too — verify_form reads the DOM, not pixels. But overconfidence is not recoverable. If the vision model commits to "this is a focus indicator, not an error" and the planner takes that as ground truth, you end up acting on a hallucinated state with no signal that anything's wrong. The "Unknowns" section is the planner's escape hatch — without it, model perception becomes ground truth.

So even though Nemotron has cheaper image tokens AND better affordance classification, it loses on the one axis a browser agent values most. Two wins, one loss — but the loss is structural, the wins are nice-to-haves.

The Nemotron downside that decides the round for most readers

Nemotron Omni 30B is English-only.

WebBrain users on Spanish, French, Turkish, Chinese, German, Arabic, Japanese, Korean, Russian — anyone running the agent on pages in their own language — will get unusable captions out of Nemotron. The page text comes back garbled, half-translated, or transliterated. The Qwen 3.x family is multilingual by design and handles non-English page text natively.

For an English-language agent on English-language pages with a paid vision endpoint, Nemotron's 17% token discount is a real argument. For everyone else, it isn't a real option, regardless of how cheap it is per image.

The full table

Gemma 4-E2B Gemma 4-31B Qwen 3.6-27B dense Qwen 3.6-35B-A3B Nemotron Omni 30B-A3B
Architecture Dense, ~2B effective Dense 31B Dense 27B MoE 35B / ~3B active MoE 30B / ~3B active, reasoning
Engine tested llama.cpp llama.cpp vLLM (Intel int4 AutoRound) llama.cpp vLLM (NVFP4)
Latency 1.5s 4.6s 5.9s 5.3s 12.0s (after reasoning fix)
Prompt tokens (image) 574 574 5570 4374 3636
Email chip OCR (esokullullu)
All 12 visible strings 5 of 12 12 12 12 11 in §2 (email moved to §3 as input)
Affordance classification (email chip = dropdown) missed missed missed noted as parenthetical explicit Type: dropdown
Red error state surfaced missed text only border + icon + text border + icon + text + ambiguity flag border + icon + text
Inferred semantic blocker no no yes yes yes
Honest "Unknowns" section no no no YES — only model that did no
Multilingual weak partial native native English-only

Head-to-head: Qwen 3.6-35B-A3B vs Nemotron Omni 30B-A3B

For the two A3B MoEs that are realistically competing for the dedicated-vision-model slot:

Axis Qwen 3.6-35B-A3B Nemotron Omni 30B-A3B
OCR fidelity (email, etc.) tie tie — both correct
State extraction (red border, error, focus) tie tie
Affordance classification (email chip = dropdown?) parenthetical annotation explicit Type: dropdown in §3 ← Nemotron ahead
Calibrated uncertainty (§6 Unknowns) ✓ flagged red-border ambiguity ← Qwen ahead ❌ "None"
Image token cost 4374 3636 (-17%) ← Nemotron ahead
Latency (after reasoning fix) 5.3s ← Qwen ahead 12.0s
Multilingual page text native ← Qwen ahead English-only

Verdict for consumer devices

Qwen 3.6-35B-A3B is still the pick, and round 2 makes the case stronger, not weaker:

Reasoning + visual + faster than the dense 27B + multilingual is a remarkable combination on a single 35B-A3B that fits on consumer hardware. The 5090 we tested on can run this comfortably; so can a 4090 with the right quant. For self-hosted browser-agent vision, this is the model to beat.

Where Nemotron makes sense anyway

If you tick all of these:

Then Nemotron is a real argument. Otherwise, it's an interesting data point — not a default.

Where Gemma 4 lands

Round 1 was already not kind to Gemma. Round 2 doesn't change that — both Gemma 4-E2B and Gemma 4-31B are clearly behind every Qwen and Nemotron variant on the metrics that matter for a browser agent. Gemma's main appeal is the 574 image-token count (massively cheaper than the 3636–5570 range of Qwen / Nemotron), but the OCR keeps mangling identifiers (esokulluesokullullu, emreillu, etc.) and section 6 is always "None". For a browser agent that needs to act on what it reads, that's not a tradeoff — it's a non-starter. Gemma is fine for very high-level page-purpose detection if you don't need precision on names, IDs, or dollar amounts. For an agent doing real form-filling or login-screen handling, neither variant is competitive.

What's next

We're going to keep running this probe against new models as they show up — focusing specifically on what fits on consumer GPUs (8–32 GB VRAM) at usable latency. The gap between Gemma and the rest is wide enough now that we'll probably stop including Gemma in headline tables unless something major changes there.

The other dimension we want to nail down: quantization and inference-engine effects. A few questions we haven't answered cleanly yet:

Those are the next posts. The probe stays where it is — small, reproducible, same prompt every time. If you want to run your own comparisons, it's three lines:

node test/vision-probe.mjs ./shot.png http://127.0.0.1:8080  Qwen3.6-35B-A3B
node test/vision-probe.mjs ./shot.png http://127.0.0.1:8000  nemotron-omni-30b
node test/vision-probe.mjs ./shot.png http://localhost:11434/v1  llava:13b
Methodology caveat. One screenshot is one data point — the same caveat as round 1. Different pages will surface different model weaknesses. The probe is the cheapest possible way to compare models on whatever pages you actually care about; if a candidate model passes on a Google login screen but fails on a Stripe dashboard, that's worth knowing before you wire it in. Run it on your own screens before committing.
Written by Emre Sokullu. WebBrain is MIT-licensed and open on GitHub — file an issue if you've benchmarked a model worth adding.