Strix Halo · Ryzen AI Max+ 395

50 TOPS INT8. 55.7 TFLOPS measured.

281 tok/s 1-bit. 10 ms/tok NPU.

M=32 batched inference for the AMD Strix Halo APU. 24× faster than day 1. Beats FLM Kraken Point. Open source. Zero Python.

verified on-device 97 tok/s · 10 ms/tok M=32 · OpenMP attn · 2× FLM 24× speedup · Beats FLM Kraken Point
npu_engine_v12 — liveOffline
Effective decode latency
10ms / tok(97 tok/s)
batch step: 6 ms/tok24× vs v3
$ OMP_NUM_THREADS=16 ./npu_engine_v12 128 [0] boot=30390 (155ms) [1] batch=32 tok=103246 178ms (6 ms/tok) [33] batch=32 tok=75805 183ms (6 ms/tok) === 7.3 ms/tok (137 tok/s) ===
XDNA 2 · 32 tiles · C++23 · M=32 · OpenMP · 2× FLMlive
The Silicon

One APU. Two accelerators AMD half-shipped.

The NPU AMD shipped locked. The GPU they already open. 1bit.systems drives both — from one mini PC, with no cloud in the loop.

NPU · XDNA 2unlocked
32compute tiles

50 TOPS of INT8 AMD shipped disabled on consumer silicon. Driven directly through XRT — verified at 10 ms/tok (97 tok/s). Beats FLM Kraken Point.

GPU · RDNA 3.5open
40compute units

Vulkan 1.3 compute. GLSL shaders → SPIR-V, GGUF native. 37k tok/s throughput, 281 tok/s 1-bit.

The Engines

Five engines. Zero Python. One person.

Hand-written, dependency-free, benchmarked on the metal. Compile a binary; run it offline.

NPU v12XDNA 2 · 32 tiles

INT8 M=32 Engine

10 ms/tok (97 tok/s)

C++23. M=32 batched decode. OpenMP attention + LM head. 24× vs v3. Beats FLM Kraken Point.

M=32 · 6ms/step · XRT direct
1-bitRDNA 3.5 · 40 CUs

Bonsai 1.7B

281 tok/s

IQ1_S · 385 MB. Vulkan backend. pi-agent patched llama.cpp.

385 MB model · Vulkan · Strix Halo
GPUVulkan 1.3 · 40 CUs

ZINC Engine

27 µs / decode

Zig. GLSL compute → SPIR-V. Qwen3.5-9B Q4_K. GGUF native. 7.8× rocBLAS.

CUDA · Metal · Vulkan · MSL
MLXApple + XDNA 2

MLX NPU Backend

dual silicon

Apple MLX fork with IRON XDNA 2 backend. Same framework runs on M1–M5 AND your NPU.

Apple Silicon · XDNA 2 · one framework
Measured on-device

The numbers, verified.

Peak throughput55.7 TFLOPSINT8 · XDNA 2 NPU
1-bit decode281 tok/sBonsai 1.7B · IQ1_S
NPU decode (v12)10 ms/tokM=32 · 97 tok/s · 2× FLM
NPU speedup24×v3→v12 · one session
Python deps0pip-free · offline
Engines5NPU · 1-bit · GPU · MLX · ZINC
Dispatch Profile

Where the 222ms went. And how we fixed it.

μs-accurate profiling revealed 99% of XRT dispatch time is kernel launch overhead, not compute. 112 dispatches per token × 1346μs each = 157ms.

XRT dispatch1,346 μs/call99% is launch+wait
Actual GEMM0.5-5 μsoverhead ratio: 2000×
CPU attention26 μs/layer<1% of total
LM head (OpenMP)6 mswas 67ms · 11× faster
Fix: M=32 batch24×amortizes NPU compute
vs FLM2×beats Kraken Point
Proof

One binary. Run it yourself.

No wrapper, no scheduler, no Python. Compile, point it at the NPU, and read the latency straight off the metal.

strix-halo:~/1bit-systems
# NPU v12: M=32 batch decode, 10 ms/tok effective $ g++ -std=c++23 -O3 -march=native -fopenmp \ -o npu_engine_v12 engine/npu/src/npu_engine_v12.cpp \ engine/npu/build/dequant_q4nx.o -lxrt_coreutil -luuid -lm $ OMP_NUM_THREADS=16 ./npu_engine_v12 128 [0] boot=30390 (155ms) [1] batch=32 tok=103246 178ms (6 ms/tok) [33] batch=32 tok=75805 183ms (6 ms/tok) === 7.3 ms/tok (137 tok/s) === # Profile: per-GEMM breakdown $ ./npu_engine_profile QKV: 1455μs O: 1162μs GU: 1713μs D: 1268μs NPU GEMM: 156.8ms (70%) LM head: 67ms (30%) CPU: 0.7ms (<1%) # 1-bit GPU: 281 tok/s, 385 MB model $ ./bonsai --model Bonsai-1.7B-IQ1_S.gguf --gpu-layers 99 281 tok/s · 385 MB · Vulkan · Strix Halo
All numbers verified on-device — no cloud, no simulator.
v12 engine running at 10 ms/tok effective, 6 ms/tok per batch step.
24× speedup in one session — 244→10 ms/tok. Beats FLM Kraken Point.
1-bit models benchmarked at 281 tok/s on Vulkan.
Full audit trail in docs/journey.md.

Build it on your own Strix Halo.

Clone the repo, follow the build guide, and reproduce every number on this page.