Pure Rust · Production Ready

Blazing-fast LLM Inference in Pure Rust

No PyTorch. No Python runtime. Just fast, portable, production-ready inference.

terminal
$ curl -sSL https://guoqingbao.github.io/xinfer/install.sh | bash
npm install -g xinfer-ai

Features

Built for Speed & Simplicity

Everything you need for production LLM inference — without the Python baggage.

🦀

Zero Dependencies

Pure Rust backend — no PyTorch, no CUDA Python bindings, no Python runtime.

Blazing Fast

Flash Attention, FlashInfer, CUDA Graphs, continuous batching. Up to 175+ tok/s.

🪶

Tiny Footprint

Core scheduling & attention logic in under 5,000 lines of Rust.

🌍

Cross-Platform

CUDA on Linux, Metal on macOS. Same binary, same API everywhere.

🏭

Production Ready

OpenAI & Anthropic APIs, built-in Web UI, MCP tool calling, structured outputs.

🗜️

KV Compression

TurboQuant 2–4 bit KV cache extends context up to 4.3× with minimal quality loss.

Performance

Real-World Benchmarks

Tested on V100, A100, Hopper H800, and RTX 5090.

ModelFormatSizeHardwareSpeedNote
Qwen3-30B-A3BNVFP430B MoERTX 5090 (SM120)0 tok/sHW NVFP4
Gemma4-26B-A4BNVFP426B MoERTX 5090 (SM120)0 tok/sHW NVFP4
Qwen3.6-35B-A3BFP835B MoEH800 (SM90)0 tok/sHW FP8
DeepSeek-R1-Qwen3-8BQ4_K_M8BA100 (SM80)0 tok/sGGUF
Llama-3.1-8BISQ Q4K8BA100 (SM80)0 tok/sSW quant
MiniMax-M2.5NVFP4229B MoEH800 ×2 (SM90)0 tok/sSW NVFP4 (no HW)
Qwen3-30B-A3BNVFP430B MoEV100 (SM70)0 tok/sSW FP4
Qwen3.6-27BFP827B DenseH800 (SM90)0 tok/sHW FP8

* HW = hardware-accelerated. Hopper (SM90) supports HW FP8 but not HW NVFP4; Blackwell (SM120) supports both. NVFP4 on Hopper uses software emulation.

Models

Supported Models & Formats

From 3B to 397B — dense, MoE, and multimodal architectures.

LLaMA 4Qwen3Qwen3.5Qwen3.6 DeepSeek V3DeepSeek R1Gemma 3Gemma 4 GLM4GLM4.7 FlashPhi4Mistral 3 VLMiniMax M2.5

Supported Formats

SafetensorsGGUFFP8NVFP4 MXFP4GPTQAWQISQ

Quick Start

Install & Run

Get up and running in minutes.

bash
curl -sSL https://guoqingbao.github.io/xinfer/install.sh | bash
bash
npm install -g xinfer-ai
bash
# Prerequisites: Rust, CUDA Toolkit (or Metal Xcode CLI)
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
sudo apt-get install -y git build-essential libssl-dev pkg-config

export XINFER_REPO="https://github.com/guoqingbao/xinfer"
# macOS/Metal: replace features with `metal`
# SM70/SM75 (V100): remove `flashinfer` and `cutlass`
cargo install --git $XINFER_REPO xinfer --features cuda,nccl,flashinfer,cutlass
bash
# Build Python wheel from source
pip install maturin maturin[patchelf]

# FlashInfer backend (SM80+)
./build.sh --release --features cuda,nccl,flashinfer,cutlass,python

# macOS Metal
maturin build --release --features metal,python

# Install the wheel
pip install target/wheels/xinfer*.whl --force-reinstall
bash
# SM70/SM75: remove `flashinfer` and `cutlass`
./build_docker.sh "cuda,nccl,flashinfer,cutlass"

Run Examples

bash
# HuggingFace model with Web UI
xinfer --m Qwen/Qwen3.6-27B-FP8 --kvcache-dtype turbo4 --ui-server

# Multi-GPU with local model
xinfer --w /path/to/model --d 0,1 --ui-server

# Python mode
python3 -m xinfer.server --m Qwen/Qwen3.6-27B-FP8 --ui-server

Downloads

Pre-built Packages

Pre-compiled binaries and pip wheels for every GPU architecture.

Binary Downloads

Pre-built Python Wheels (pip)

KV Cache

TurboQuant Compression

Extend context length up to 4.3× with --kvcache-dtype.

Mode (--kvcache-dtype)CompressionQualityGPU
default (BF16)BaselineAll
fp8Near-losslessSM70+ / M1
turbo82.6×High qualitySM70+ / M1
turbo43.7×Best balanceSM70+ / M1
turbo34.7×Max compressionSM70+

Documentation

Learn More

Guides, API references, and integration tutorials.