fak · the reuse demo

ladder: … hardware: …
Same model · same tokens · same answers. The headline race is fak vs a tuned warm-cache baseline — the real serving baseline a stack like vLLM / SGLang prefix-caching or provider prompt-caching gives you (prefix cached once per agent, only new tokens ingested). fak adds cross-agent prefix sharing + batched decode on top.

① Live race — fak vs a tuned warm-cache baseline SOTA serving baseline · both run live, same model

workload: P=512 T=5 C=5 D=16 R=3225 requests
fak prefix once · cloned · batched decode idle
prefilled 0 · decoded 0
tuned warm cache (SOTA) per-agent KV · prefix once/agent · incremental idle
prefilled 0 · decoded 0

② Reuse curve across the model ladder fak + tuned warm-cache both LIVE

same workload, smaller P=128 for tractability on CPU. As the model grows, the absolute minutes saved over the tuned baseline grow with it — the ratio holds.
fak (reuse)tuned warm cache (SOTA)
Each rung: fak and the tuned warm-cache baseline both run live — the headline ratio above each pair is fak vs tuned.
fak in-kernel engine · pure-Go Q8 forward pass · tokens are real model output (anchor-quality on the 135M reference, chat-quality on the Qwen2.5 rungs). No network, no API, all on this box.