Pruning Gemma 4 26B-A4B to run on small GPUs
I’m pruning Google’s Gemma 4 26B-A4B for a Turkish + English deployment. The proof of concept is Turkish-first, but the method is language-agnostic: measure which experts are actually used, remove the long tail, then do a short LoRA heal to recover from the cuts.
Why this works on MoE models
MoE models develop implicit specialization. In practice, Gemma 4 experts quietly separate across scripts and patterns: CJK characters, Cyrillic, Devanagari, Arabic script, Hangul, and more. For Turkish + English workloads, a meaningful subset of experts barely activates.
Method
- Hook the routers and collect per-expert activation stats on Turkish + code + math + web-heavy data.
- Surgically prune low-utility long-tail experts at the layer level.
- Run a brief LoRA heal on Turkish instruction data so the model adapts to the reduced expert set.
Early results
- 128 → 101 experts per layer
- 26B → 21B parameters (~21% smaller)
- 4-bit GGUF size: ~11 GB
- Fits 24 GB GPUs; possible on 12 GB with IQ4_XS
- Turkish fluency + code + general knowledge remain solid in practical checks
Why prune + heal instead of retrain or plain finetune?
Pretraining from scratch takes months and large budgets, and throws away Gemma’s pretraining value. Finetuning only keeps the full heavyweight model. Prune + heal preserves the valuable base and removes what this deployment does not use.
Why this matters next
Even if VRAM gets cheaper, we still want specialized smaller models running side by side (planner, vision, coder). Pruning is the right operational tool: start from a strong base, keep only what serves the job.
Model link: huggingface.co/esokullu/gemma4-tr-26b-a4b-pruned-gguf