Engineering notes
Short write-ups on design decisions, failure modes, and benchmarks from building an open-source AI browser agent.
Pruning Gemma 4 26B-A4B for small GPUs: Turkish-first, language-agnostic MoE surgery
Router hooks + expert activation telemetry + surgical long-tail removal + brief LoRA heal. Early run: 128→101 experts/layer, 26B→21B params, ~11 GB at 4-bit GGUF, with solid Turkish fluency and code performance.
Round 4: Qwen 3.5-9B-int4 punches above its weight — when 9B int4 beats 308B IQ3_S on affordance
A 9B int4 Qwen 3.5 on vLLM classifies the email-chip dropdown explicitly — something the 308B MiMo V2.5 at IQ3_S and the larger Qwen 3.6-35B-A3B both missed. Cheapest image tokens after Gemma. The catch: it loses the red-border visual cue and joins the "Unknowns: None" club. Suddenly the most interesting option for ≤8 GB VRAM. Plus an updated routing-policy table by VRAM bracket.
Round 3: Xiaomi MiMo V2.5 enters the vision shootout — and joins the "Unknowns: None" club
The empirical follow-up to last week's MiMo speculation post. Same probe, same Google sign-in screen, same prompt. MiMo at IQ3_S nails OCR and state extraction in the Qwen 3.6 tier — but joins Nemotron and Gemma in writing "Unknowns: None" instead of flagging the red-border ambiguity. Token cost ties the most expensive bucket; latency is in a different class. Plus a probe upgrade so big reasoning models stop tripping the default fetch headers timeout.
Xiaomi MiMo V2.5 Pro vs "V2.5 Flash": should WebBrain add both?
Research notes on Xiaomi's newly released MiMo V2.5 series and why multimodal Pro+Flash-style routing may outperform text-only stacks for browser-agent workloads, while Qwen 3.6 still leads value on many pure-text tasks.
Round 2: Nemotron Omni 30B vs Qwen 3.6 — does cheaper image tokens beat calibrated uncertainty?
A second round of vision-model benchmarking for browser agents. NVIDIA's Nemotron Omni 30B-A3B-Reasoning is 17% cheaper per image and classifies inputs better than Qwen 3.6-35B-A3B — but loses on calibrated uncertainty, and is English-only. Plus a head-to-head with the dense Qwen 3.6-27B that explains why MoE is the right architecture for self-hosted vision.
Four vision models, one screenshot: which one is actually worth running locally for a browser agent?
We fed the same Google sign-in page through Gemma 4-E2B, Gemma 4-31B, Qwen3.5-27B, and Qwen3.6-35B-A3B using the exact system prompt WebBrain's vision sub-call ships with. The spread on OCR accuracy, latency, and token cost is wider than you'd expect — and one model quietly changed our mind about which architecture to reach for.