开源推理引擎 是 AI Skill Hub 本期精选AI工具之一。综合评分 8.0 分,整体质量较高。我们强烈推荐将其纳入你的 AI 工具库,帮助提升工作效率。
基于Rust和CUDA的LLM推理引擎,兼容OpenAI
开源推理引擎 是一款基于 Rust 开发的开源工具,专注于 cuda、gpu、inference 等核心功能。作为 GitHub 开源项目,它拥有活跃的社区支持和持续的版本迭代,代码完全透明可审计,支持本地部署以保护数据隐私。无论是个人使用还是集成到企业工作流,都能提供稳定可靠的解决方案。
基于Rust和CUDA的LLM推理引擎,兼容OpenAI
开源推理引擎 是一款基于 Rust 开发的开源工具,专注于 cuda、gpu、inference 等核心功能。作为 GitHub 开源项目,它拥有活跃的社区支持和持续的版本迭代,代码完全透明可审计,支持本地部署以保护数据隐私。无论是个人使用还是集成到企业工作流,都能提供稳定可靠的解决方案。
# 方式一:cargo install(推荐) cargo install openinfer # 方式二:从源码编译 git clone https://github.com/openinfer-project/openinfer cd openinfer cargo build --release # 二进制在 ./target/release/openinfer
# 查看帮助 openinfer --help # 基本运行 openinfer [options] <input> # 详细使用说明请查阅文档 # https://github.com/openinfer-project/openinfer
# openinfer 配置说明 # 查看配置选项 openinfer --config-example > config.yml # 常见配置项 # output_dir: ./output # log_level: info # workers: 4 # 环境变量(覆盖配置文件) export OPENINFER_CONFIG="/path/to/config.yml"
<p align="center"> <img src="logo.png" width="200" alt="openinfer logo"> </p>
<p align="center"> Pure Rust + CUDA LLM inference engine. No PyTorch. No model framework runtime. </p>
<p align="center"> <a href="https://open-infer.org/"> <img src="https://img.shields.io/badge/Docs%20%26%20Blog-open--infer.org-2ea44f" alt="Docs & Blog at open-infer.org"> </a> <a href="https://join.slack.com/t/openinferhq/shared_invite/zt-41scnc53a-d0McNJDjK2lVqFGoSLUgXA"> <img src="https://img.shields.io/badge/Slack-join%20the%20community-4A154B?logo=slack&logoColor=white" alt="Join the openinfer Slack"> </a> </p>
<p align="center"> <a href="#quickstart">Quickstart</a> · <a href="#supported-models">Models</a> · <a href="#api">API</a> · <a href="#performance">Performance</a> · <a href="#architecture">Architecture</a> · <a href="https://open-infer.org/blog/">Blog</a> </p>
---
openinfer is an LLM inference engine built entirely in Rust and CUDA — no PyTorch, no ONNX, no framework runtime, every kernel and scheduler hand-written.
It serves frontier-scale models, from Qwen3 to the trillion-parameter Kimi-K2, and already holds its own against the best open-source inference frameworks.
Docs, guides, and engineering deep-dives live at open-infer.org — start with OpenInfer 0.1.0: Writing a Production-Grade Inference Engine in Rust and Co-locating Prefill and Decode on One GPU.
uv venv .venv --python 3.12 uv pip install "triton-windows<3.7" $env:OPENINFER_TRITON_PYTHON = ".venv\Scripts\python.exe" cargo run --release --features qwen35-4b -- --model-path models/Qwen3.5-4B ```
</details>
cuda-12090 cudarc feature does not raise the driver floorqwen35-4b feature builds (build-time only — no Python at runtime)deepseek-v4 feature builds (build-time only)deepseek-v4 / kimi-k2 EP paths additionally need NCCL ≥ 2.27 at runtime (ncclAlltoAll)uv venv && uv pip install triton export OPENINFER_TRITON_PYTHON=.venv/bin/python cargo run --release --features qwen35-4b -- --model-path models/Qwen3.5-4B
uv pip install "tilelang==0.1.9" export OPENINFER_TILELANG_PYTHON=.venv/bin/python cargo run --release --features deepseek-v4 -- --model-path models/DeepSeek-V4-Flash
```bash
export CUDA_HOME=/usr/local/cuda cargo run --release
> **Note**: The server CLI is in `openinfer-server`. Model crates such as `openinfer-qwen3`, `openinfer-qwen35-4b`, and `openinfer-deepseek-v4` contain model logic and diagnostics but are not server entrypoints. Use `cargo run --release` from the workspace root, or `cargo run --release -p openinfer-server -- --model-path <path>`.
bash
cargo build --release cargo run --release -p openinfer-server -- --model-path models/Qwen3-4B
scripts/setup_dev.sh bootstraps a build environment on any fresh NVIDIA Ubuntu host: apt build deps + protobuf-compiler, uv, the rustup nightly pinned by rust-toolchain.toml, the vendored flashinfer/3rdparty/cccl submodule, then cargo build --release. CUDA is a prerequisite — it detects nvcc and fails loudly rather than installing a toolkit, so boot a CUDA image.
```bash bash scripts/setup_dev.sh
OPENINFER_CUDA_SM=90 bash scripts/setup_dev.sh ```
To get the box itself, scripts/prime_devbox.sh provisions the cheapest match on Prime Intellect, has the box git-clone this repo over HTTPS, and runs setup_dev.sh — see the script header for one-time setup.
OpenAI-compatible /v1/completions endpoint.
| Field | Type | Default | Description |
|---|---|---|---|
prompt | string | (required) | Input text |
max_tokens | int | 128 | Maximum tokens to generate |
temperature | float | 0.0 | Sampling temperature (0 = greedy) |
top_k | int | 50 | Top-k sampling |
top_p | float | 1.0 | Nucleus sampling threshold |
stream | bool | false | Enable SSE streaming |
Sampling and logprob support is model-dependent. Qwen models support the sampling controls above; the initial DeepSeek V4 path accepts greedy requests only and reports unsupported parameters through stop_reason.
| Model | Architecture | Params | Status |
|---|---|---|---|
| [Qwen3-4B](https://huggingface.co/Qwen/Qwen3-4B) | Full attention (GQA) | 4B | Greedy + sampling, default feature, pure Rust + CUDA build |
| [Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B) | Full attention (GQA) | 8B | Greedy + sampling, default feature, pure Rust + CUDA build |
| [Qwen3.5-4B](https://huggingface.co/Qwen/Qwen3.5-4B) | Hybrid (24 linear + 8 full attention) | 4B | Greedy + sampling, feature-gated, --features qwen35-4b (build-time Triton) |
| [DeepSeek-V2-Lite](https://huggingface.co/deepseek-ai/DeepSeek-V2-Lite) | MoE + EP | 15.7B total / 2.4B active | Feature-gated, --features deepseek-v2-lite, 2-GPU path |
| [DeepSeek-V4-Flash](https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash) | MoE + sparse attention (compressor + indexer), MP8 checkpoint | 671B total / 37B active | Initial greedy, feature-gated, 8-GPU MP8 |
| [Kimi-K2-Instruct](https://huggingface.co/moonshotai/Kimi-K2-Instruct) | MLA + MoE + Marlin INT4 | 1T total / 32B active | Feature-gated, --features kimi-k2, 8-GPU EP path |
Model type is auto-detected from config.json — just point --model-path at any supported model directory. Every model line is controlled by a cargo feature; only qwen3 is on by default, so the stock build serves Qwen3 with zero Python. Other lines require rebuilding openinfer-server with the matching --features ... flag before launch.
DeepSeek V4 support is intentionally narrower than the Qwen paths in the initial PR: it requires --features deepseek-v4, uses CUDA devices 0..7, serves greedy requests only, terminates unsupported logprobs and non-greedy sampling requests with an explicit stop_reason, and does not use CUDA Graph yet.
OPENINFER_TEST_MODEL_PATH=models/Qwen3-4B cargo test --release -p openinfer-qwen3 --test hf_golden_gate OPENINFER_TEST_MODEL_PATH=models/Qwen3.5-4B cargo test --release -p openinfer-qwen35-4b --features qwen35-4b --test hf_golden_gate OPENINFER_TEST_MODEL_PATH=models/Qwen3.5-4B cargo test --release -p openinfer-qwen35-4b --features qwen35-4b --test e2e_scheduler OPENINFER_TEST_MODEL_PATH=models/DeepSeek-V4-Flash cargo test --release -p openinfer-deepseek-v4 --features deepseek-v4 --test mp8_manifest ```
Single RTX 5090 (32 GB), Qwen3.5-4B, BF16, TP1 — openinfer with the Qwen3.5 decode-tuning change, vLLM 0.23.0, both driven by vllm bench serve 0.23.0. Fixed random prompts, 64 measured requests, 2 warmups, text-only serving with prefix cache off on both engines. Full flags and caveats are in the Qwen3.5 benchmark report.
| Workload | Metric | openinfer | vLLM 0.23.0 |
|---|---|---|---|
| 1 input / 256 output | TPOT mean | 6.282 ms | **6.214 ms** |
| 1 input / 512 output | TPOT mean | 6.381 ms | **6.221 ms** |
| 1024 input / 256 output | reported input tokens | 63,459 (992/request) | 65,536 (1,024/request) |
| 1024 input / 256 output | TTFT mean (client-contract) | 55.3 ms | 66.3 ms |
| 1024 input / 256 output | TPOT mean | 7.110 ms | **6.346 ms** |
| 1024 input / 256 output | output tok/s | 137.0 | **151.9** |
| 2048 input / 1 output | reported input tokens | 126,957 (1,984/request) | 131,072 (2,048/request) |
| 2048 input / 1 output | TTFT mean (client-contract) | 97.4 ms | 101.9 ms |
The decode-tuning change improves openinfer's own direct Qwen3.5 decode TPOT by about 2-3%. Against vLLM, prompt-len-1 decode is close, but vLLM still leads the 1024/256 decode and high-concurrency HTTP rows. TTFT rows are fixed-client timings because reported prompt-token totals differ on the longer prompts.
高性能LLM推理引擎,兼容OpenAI
AI Skill Hub 为第三方内容聚合平台,本页面信息基于公开数据整理,不对工具功能和质量作任何法律背书。
建议在沙箱或测试环境中充分验证后,再部署至生产环境,并做好必要的安全评估。
✅ Apache 2.0 — 宽松开源协议,可商用,需保留版权声明和 NOTICE 文件,含专利授权条款。
经综合评估,开源推理引擎 在AI工具赛道中表现稳健,质量优秀。如果你已有明确的使用需求,可以直接上手体验;如果还在评估阶段,建议对比同类工具后再做决策。
| 原始名称 | openinfer |
| Topics | cudagpuinferencerust |
| GitHub | https://github.com/openinfer-project/openinfer |
| License | Apache-2.0 |
| 语言 | Rust |
收录时间:2026-07-05 · 更新时间:2026-07-05 · License:Apache-2.0 · AI Skill Hub 不对第三方内容的准确性作法律背书。