May 19, 2026 · 4 min read

Pruning Gemma 4 26B-A4B to run on small GPUs

I’m pruning Google’s Gemma 4 26B-A4B for a Turkish + English deployment. The proof of concept is Turkish-first, but the method is language-agnostic: measure which experts are actually used, remove the long tail, then do a short LoRA heal to recover from the cuts.

Why this works on MoE models

MoE models develop implicit specialization. In practice, Gemma 4 experts quietly separate across scripts and patterns: CJK characters, Cyrillic, Devanagari, Arabic script, Hangul, and more. For Turkish + English workloads, a meaningful subset of experts barely activates.

Method

Early results

Why prune + heal instead of retrain or plain finetune?

Pretraining from scratch takes months and large budgets, and throws away Gemma’s pretraining value. Finetuning only keeps the full heavyweight model. Prune + heal preserves the valuable base and removes what this deployment does not use.

Why this matters next

Even if VRAM gets cheaper, we still want specialized smaller models running side by side (planner, vision, coder). Pruning is the right operational tool: start from a strong base, keep only what serves the job.

Model link: huggingface.co/esokullu/gemma4-tr-26b-a4b-pruned-gguf

Tags: #LLM #MoE #Pruning