M=32 batched inference for the AMD Strix Halo APU. 24× faster than day 1. Beats FLM Kraken Point. Open source. Zero Python.
The NPU AMD shipped locked. The GPU they already open. 1bit.systems drives both — from one mini PC, with no cloud in the loop.
50 TOPS of INT8 AMD shipped disabled on consumer silicon. Driven directly through XRT — verified at 10 ms/tok (97 tok/s). Beats FLM Kraken Point.
Vulkan 1.3 compute. GLSL shaders → SPIR-V, GGUF native. 37k tok/s throughput, 281 tok/s 1-bit.
Hand-written, dependency-free, benchmarked on the metal. Compile a binary; run it offline.
C++23. M=32 batched decode. OpenMP attention + LM head. 24× vs v3. Beats FLM Kraken Point.
IQ1_S · 385 MB. Vulkan backend. pi-agent patched llama.cpp.
Zig. GLSL compute → SPIR-V. Qwen3.5-9B Q4_K. GGUF native. 7.8× rocBLAS.
Apple MLX fork with IRON XDNA 2 backend. Same framework runs on M1–M5 AND your NPU.
μs-accurate profiling revealed 99% of XRT dispatch time is kernel launch overhead, not compute. 112 dispatches per token × 1346μs each = 157ms.
No wrapper, no scheduler, no Python. Compile, point it at the NPU, and read the latency straight off the metal.
Clone the repo, follow the build guide, and reproduce every number on this page.