经 AI Skill Hub 精选评估,imp AI技能包 获评「强烈推荐」。这款AI工具在功能完整性、社区活跃度和易用性方面表现出色,AI 评分 8.0 分,适合有一定技术背景的用户使用。
高性能LLM推理引擎,支持NVIDIA Blackwell GeForce
imp AI技能包 是一款基于 Cuda 开发的开源工具,专注于 cuda、cpp、cuda-graphs 等核心功能。作为 GitHub 开源项目,它拥有活跃的社区支持和持续的版本迭代,代码完全透明可审计,支持本地部署以保护数据隐私。无论是个人使用还是集成到企业工作流,都能提供稳定可靠的解决方案。
高性能LLM推理引擎,支持NVIDIA Blackwell GeForce
imp AI技能包 是一款基于 Cuda 开发的开源工具,专注于 cuda、cpp、cuda-graphs 等核心功能。作为 GitHub 开源项目,它拥有活跃的社区支持和持续的版本迭代,代码完全透明可审计,支持本地部署以保护数据隐私。无论是个人使用还是集成到企业工作流,都能提供稳定可靠的解决方案。
# 克隆仓库 git clone https://github.com/kekzl/imp cd imp # 查看安装说明 cat README.md # 按 README 完成环境依赖安装后即可使用
# 查看帮助 imp --help # 基本运行 imp [options] <input> # 详细使用说明请查阅文档 # https://github.com/kekzl/imp
# imp 配置说明 # 查看配置选项 imp --config-example > config.yml # 常见配置项 # output_dir: ./output # log_level: info # workers: 4 # 环境变量(覆盖配置文件) export IMP_CONFIG="/path/to/config.yml"
<p align="center"> <img src="docs/logo.svg" alt="imp" width="500"> </p>
<p align="center"> From-scratch CUDA inference engine for the NVIDIA RTX 5090 (<code>sm_120a</code>).<br> The best single-GPU backend for <b>agentic AI</b> — tool calling, long-context loops, reasoning and concurrent sub-agents,<br> on top of the fastest single-user inference on the 5090: faster decode than llama.cpp (+37–72% dense GGUF),<br> at-or-ahead of vLLM on NVFP4, and the only engine running native NVFP4 on consumer Blackwell.<br> ~97k lines, 100% written by <a href="https://claude.ai/claude-code">Claude Code</a>. </p>
<p align="center"> <a href="LICENSE"><img src="https://img.shields.io/github/license/kekzl/imp?style=flat&color=blue" alt="License"></a> <img src="https://img.shields.io/badge/CUDA-13.3-76b900?style=flat&logo=nvidia" alt="CUDA 13.3"> <img src="https://img.shields.io/badge/C++-20-00599C?style=flat&logo=cplusplus" alt="C++20"> <img src="https://img.shields.io/badge/status-experimental-orange?style=flat" alt="Status: experimental"> </p>
---
| **Quantization** | GGUF Q4_K_M/Q5_K_M/Q6_K/Q8_0 + IQ4_NL/IQ4_XS, SafeTensors NVFP4 (prequant), MXFP4. NVFP4 KV cache (--kv-nvfp4) for 4× context compression at decode parity. |
| **LoRA** | PEFT adapter hot-swap (--lora name=path, per-request "lora" field) — runtime low-rank deltas, no weight patching, works with every quant path. |
| **Attention** | Prefill: FP16 cuBLAS below the auto fmha_prefill_threshold (the largest chunk whose S-matrix fits, ~2.5k tokens), then the FMHA family above it — an mma.sync m16n8k32 FP8-E4M3 score kernel and a register-resident FlashAttention-2 kernel (head_dim 128). Decode: paged attention (block_size 16) switching on KV dtype (FP16/FP8/INT8/INT4/NVFP4/MXFP4). Auto-dispatch per phase × dtype × layer — see [docs/attention-dispatch.md](docs/attention-dispatch.md). |
| **Architectures** | Dense transformers, Mixture-of-Experts (top-k grouped GEMM), Multi-head Latent Attention (DeepSeek-V2; materialized + opt-in absorbed latent-KV-cache decode), Gated DeltaNet (fused recurrent scan), Mamba2 (SSM), SigLIP/Gemma-4v vision encoders. |
**sm_120a kernels** | NVFP4 block-scaled mma.sync mxf4nvf4 GEMM/GEMV (CUTLASS v4.5.2), FP8 f8f6f4 attention scores, FA2 block-scaling, packed cvt.e2m1/cvt.e4m3x2 dequant, PDL, Green Contexts. **No** tcgen05/TMEM/wgmma/TMA-WS — those are datacenter Blackwell only. |
| **Server** | OpenAI /v1/chat/completions + /v1/responses (Responses API — Agents SDK / Codex dialect, native SSE events with incremental tool-call argument deltas) + /v1/completions + /v1/embeddings + /tokenize + /detokenize; Anthropic /v1/messages with real per-token SSE streaming (ping keepalives), cache_control prompt caching (prefix-cache pinning + cache_read/cache_creation_input_tokens usage reporting; prefix cache default-on since #538, model-fingerprint-gated on disk so a cache file from a different model is never replayed); Prometheus /metrics with TTFT / inter-token-latency histograms and cancellation counters. Tool/function calling, json_object + json_schema constrained decoding (whole-token validated), reasoning_content separation (DeepSeek format) + think budget. Per-request speculative-decode override ("speculative": true/false) and an opt-in deterministic mode (--set runtime.deterministic=true, ordered MoE reduction). Client-disconnect cancellation reclaims the slot + KV within one scheduler tick. Strict single-model semantics (/v1/models lists only the loaded model; foreign names get a 404, no auto-swap). API-key auth, rate limiting, JSONL request logging. Agent-shaped load harness in tools/agent_bench.py (TTFT/ITL p50/p99 under concurrency, warm-vs-cold cache). |
| **Runtime** | CUDA Graphs (auto per model), imp.conf + CLI config, Jinja2 chat templates with macro support. degen_suite.py is the coherence quality-gate after hot-path changes. |
```bash
make build # → imp:test image make verify-fast # build + tests + perf gate (~90s) ```
Full build options and test commands: docs/usage.md. Contributing: CONTRIBUTING.md.
Everything runs in Docker — no local CUDA toolkit needed. Prebuilt images are on GHCR (built per release for x86-64 + sm_120a):
```bash
| Family | Variants | Quantizations |
|---|---|---|
| Qwen3 / Qwen3-MoE | dense + MoE (Coder-30B-A3B) | Q4_K_M, Q6_K, Q8_0, NVFP4 |
| Qwen3.5 / Qwen3.6 | GDN + attention (+ MoE) | Q4_K_M, Q8_0, NVFP4 |
| Gemma-4 | 26B-A4B MoE, 31B dense | Q4_K_M, Q5_K_M, Q8_0, NVFP4 |
| Gemma-3 | text + vision (SigLIP) | GGUF |
| Phi-4-reasoning-plus | dense, fused projections | NVFP4 |
| gpt-oss-20b | MoE (32 experts, top-4), Harmony | SafeTensors MXFP4 (native) |
| Nemotron-H | Mamba2 + Attention + MoE | NVFP4, GGUF |
| Llama / Mistral / DeepSeek | dense + MoE | GGUF (Q*_K, Q8_0) |
| DeepSeek-V2 (MLA) | Multi-head Latent Attention + MoE | SafeTensors (bf16) |
VRAM, decode tok/s, and per-model notes: docs/supported-models.md.
高性能LLM推理引擎,支持NVIDIA Blackwell GeForce
AI Skill Hub 为第三方内容聚合平台,本页面信息基于公开数据整理,不对工具功能和质量作任何法律背书。
建议在沙箱或测试环境中充分验证后,再部署至生产环境,并做好必要的安全评估。
✅ MIT 协议 — 最宽松的开源协议之一,可自由商用、修改、分发,仅需保留版权声明。
AI Skill Hub 点评:imp AI技能包 的核心功能完整,质量优秀。对于AI 技术爱好者来说,这是一个值得纳入个人工具库的选择。建议先在非生产环境试用,再逐步推广。
| 原始名称 | imp |
| 原始描述 | 开源AI工具:High-performance LLM inference engine in C++/CUDA for NVIDIA Blackwell GeForce /。⭐17 · Cuda |
| Topics | cudacppcuda-graphsgated-deltanet |
| GitHub | https://github.com/kekzl/imp |
| License | MIT |
| 语言 | Cuda |
收录时间:2026-05-16 · 更新时间:2026-05-19 · License:MIT · AI Skill Hub 不对第三方内容的准确性作法律背书。