📄 工具详情 ⚙️ 安装教程 📚 使用教程

🛠

AI工具

FlashRT

Q: FlashRT 如何安装和开始使用？

访问 FlashRT 的 GitHub 仓库或官方网站，按照 README 文档中的步骤安装依赖并运行。通常需要 Python 3.8+ 或 Node.js 16+ 基础环境。

Q: FlashRT 是否免费？许可证是什么？

FlashRT 完全免费，采用 Apache-2.0 许可证开源发布，任何人都可以免费使用、修改和分发。

Q: FlashRT 适合哪些用户使用？

FlashRT 主要面向有一定技术基础的用户，包括开发者、数据分析师、AI 工程师等专业人士。

Q: FlashRT 的社区活跃度和项目维护状况如何？

FlashRT 在 GitHub 上已获得 182 个 Star，处于积极发展阶段，社区在持续扩大。

基于 C++ · 开源免费，本地部署，数据完全自主可控

⭐ 182 Stars 🍴 26 Forks 💻 C++ 📄 Apache-2.0 🏷 AI 6.8分

6.8AI 综合评分

installablecudacuda-kernelsgr00tgr00t-n1-6-3bjetsonc++

✦ AI Skill Hub 推荐

FlashRT 是 AI Skill Hub 本期精选AI工具之一。综合评分 6.8 分，整体质量稳定。我们推荐使用将其纳入你的 AI 工具库，帮助提升工作效率。

📚 深度解析

FlashRT 是一款基于 C++ 的开源工具，在 GitHub 上收获 0k+ Star，是installable、cuda、cuda-kernels、gr00t领域中的优质开源项目。开源工具的最大优势在于代码完全透明，你可以审计每一行代码的安全性，也可以根据自身需求进行二次开发和定制。

**为什么要使用开源工具而非商业 SaaS？**
对于个人开发者和有隐私需求的用户，本地部署的开源工具意味着数据不离本机，不受第三方服务商的数据政策约束。同时，开源工具通常没有使用次数限制和月度费用，一次安装即可长期使用，对于高频使用场景的总拥有成本（TCO）远低于订阅制商业工具。

**安装与环境准备**
FlashRT 依赖 C++ 运行环境。建议通过 pyenv（Python）或 nvm（Node.js）管理 C++ 版本，避免全局环境污染。对于新手用户，推荐先创建虚拟环境（python -m venv venv && source venv/bin/activate），再安装依赖，这样即使出现问题也可以随时删除虚拟环境重新开始，不影响系统稳定性。

**社区与维护**
GitHub Issue 和 Discussion 是获取帮助的最快渠道。在提问前建议先检查 Closed Issues（已关闭的问题），大多数常见问题都已有解答。遇到 Bug 时，提供 pip list 的输出、完整错误堆栈和最小可复现示例，能显著提高开发者响应速度。AI Skill Hub 将持续追踪 FlashRT 的版本更新，及时通知重要功能变化。

📋 工具概览

FlashRT 是一款基于 C++ 开发的开源工具，专注于 installable、cuda、cuda-kernels 等核心功能。作为 GitHub 开源项目，它拥有活跃的社区支持和持续的版本迭代，代码完全透明可审计，支持本地部署以保护数据隐私。无论是个人使用还是集成到企业工作流，都能提供稳定可靠的解决方案。

GitHub Stars

⭐ 182

开发语言

C++

支持平台

Windows / macOS / Linux

维护状态

轻量级项目，按需更新

开源协议

Apache-2.0

AI 综合评分

6.8 分

工具类型

AI工具

Forks

📖 中文文档

以下内容由 AI Skill Hub 根据项目信息自动整理，如需查看完整原始文档请访问底部「原始来源」。

📌 核心特色

开源免费，支持本地部署，数据完全自主可控
活跃的 GitHub 开源社区，持续迭代更新
提供详细文档和使用示例，新手友好
支持自定义配置，灵活适配不同使用环境
可作为基础组件集成进现有技术栈或进行二次开发

🎯 主要使用场景

本地部署运行，保护数据隐私，满足合规要求
自定义集成到现有系统，扩展技术栈能力
作为开源基础组件进行商业化二次开发

以下安装命令基于项目开发语言和类型自动生成，实际以官方 README 为准。

安装命令

# 克隆仓库
git clone https://github.com/LiangSu8899/FlashRT
cd FlashRT

# 查看安装说明
cat README.md

# 按 README 完成环境依赖安装后即可使用

📋 安装步骤说明

访问 GitHub 仓库页面
按照 README 文档完成依赖安装
根据系统环境完成初始化配置
参考官方示例或文档开始使用
遇到问题可在 GitHub Issues 中查找解答

以下用法示例由 AI Skill Hub 整理，涵盖最常见的使用场景。

常用命令 / 代码示例

# 查看帮助
flashrt --help

# 基本运行
flashrt [options] <input>

# 详细使用说明请查阅文档
# https://github.com/LiangSu8899/FlashRT

以下配置示例基于典型使用场景生成，具体参数请参照官方文档调整。

配置示例

# flashrt 配置说明
# 查看配置选项
flashrt --config-example > config.yml

# 常见配置项
# output_dir: ./output
# log_level: info
# workers: 4

# 环境变量（覆盖配置文件）
export FLASHRT_CONFIG="/path/to/config.yml"

📑 README 深度解析真实文档完整度 81/100 查看 GitHub 原文 →

以下内容由系统直接从 GitHub README 解析整理，保留代码块、表格与列表结构。

FlashRT

FlashRT is a high-performance realtime inference engine for small-batch, latency-sensitive AI workloads.

A general kernel library composed into static graphs — no ONNX export, no engine compilation, no per-driver rebuild. Hand-written kernels (norm / activation / fusion / RoPE / FP8 / NVFP4 GEMM / attention) cover standard transformer, DiT, and SigLIP primitives. The composition pattern itself is hardware-agnostic; today the codebase ships with NVIDIA implementations spanning edge to server (Jetson AGX Thor through A100 / RTX 4090 / 5090).

The flagship integration today is VLA control — production frontends for Pi0, Pi0.5, GROOT N1.6, GROOT N1.7, and Pi0-FAST, validated on LIBERO where applicable. The same kernel set also powers the BAGEL world-model image-generation pipeline (research preview) and audio / video generation (4× over PyTorch). FlashRT now also serves single-stream LLM inference — the v1 release ships Qwen3.6-27B (NVFP4) with 256 K context on a single RTX 5090, an OpenAI-compatible HTTP server, and decode throughput of ~100 tok/s typical / 129 tok/s peak (real warm-state range across mixed chat / reasoning / code prompts; see Performance for the breakdown). The pattern is workload-shaped (small-batch realtime), not model-class-shaped.

Existing inference tooling is shaped for different workloads — TensorRT for tactic-search compile to frozen engines, vLLM / SGLang for high-batch LLM serving. FlashRT targets the small-batch realtime cell with hand-tuned kernels and no compile step.

Tested hardware + what's theoretically supported

Verified working on: RTX 5090, RTX 4090, RTX 5060 Ti, RTX 4060 Ti, NVIDIA L40, Jetson AGX Thor, and Jetson AGX Orin.

CMake's ENABLE_FA2 gate accepts any card in SM80 / 86 / 89 / 120 (Ampere through Blackwell consumer). That means A100, A10, RTX 3090, 3080, A5000/A6000, 4090, 4080, 4070, 4060 Ti, 5090 — all should build and run out of the box. "Theoretical" here just means the other cards haven't gone through the regression suite yet; the kernel set and dispatch paths are the same.

Getting Started

Pi0-FAST max-performance mode (for fixed-prompt 24h deployment):

model = flash_rt.load_model( checkpoint="/path/to/pi0_fast_base", config="pi0fast", decode_cuda_graph=True, # capture decode loop as CUDA Graph decode_graph_steps=46, # action tokens per inference (50 total with text prefix) )


#### Qwen3.6-27B NVFP4 (LLM, RTX 5090)

The LLM path uses a dedicated frontend — same kernel binary, separate
generation API since chat completion has a different surface from VLA
control. See [`docs/qwen36_usage.md`](docs/qwen36_usage.md) for the
full parameter reference and [`docs/qwen36_nvfp4.md`](docs/qwen36_nvfp4.md)
for the K-curve / measured throughput / model-dependency notes.

python import os import torch from flash_rt.frontends.torch.qwen36_rtx import Qwen36TorchFrontendRtx

Build & install

This is the hands-on "go from a fresh machine to a green benchmark" section. For a single-page install reference (prerequisites, troubleshooting table, JAX/transformers pin rationale) see docs/INSTALL.md.

Docker and native Linux paths both produce the same two extension modules:

Artifact	Size	What it contains
`flash_rt/flash_rt_kernels.so`	~3 MB	Hand-written memory-bound kernels (norm, activation, fusion, FP8 quant, cuBLASLt wrappers, Thor FMHA). Always built.
`flash_rt/flash_rt_fa2.so`	~135 MB	Vendored Flash-Attention 2 v2.7.4.post1 fwd (fp16 + bf16, SM80/86/89/120). Built only on RTX targets — Thor skips it and uses `fvk.attention_qkv_fp16` (cuBLAS-decomposed) for attention instead.

Crucially — no pip install flash-attn required. The FA2 kernel is vendored at source level and built into flash_rt_fa2.so during cmake/make; at runtime import flash_rt loads both .so files directly, so you never hit the flash-attn wheel's torch × CUDA × driver × glibc compatibility matrix. Setting FVK_RTX_FA2=0 is still supported as a fall-back to pip flash-attn for debugging, but the default path has zero pip-wheel dependency.

Option A — Prebuilt Docker image (fastest, recommended)

The published image already has CUDA 13.0, PyTorch 2.9, the FlashRT kernels prebuilt, and CUTLASS vendored — pull and run, no local compile, no flash-attn wheel hunting:

```bash docker pull ghcr.io/liangsu8899/flashrt:latest docker run --rm --gpus all -it ghcr.io/liangsu8899/flashrt:latest

Option B — Build the Docker image yourself

If you need a different GPU arch, want to pin a specific commit, or prefer to vet the image source:

git clone https://github.com/LiangSu8899/FlashRT.git
cd FlashRT
docker build -t flashrt:dev -f docker/Dockerfile .
docker run --rm --gpus all -it flashrt:dev

Build args (GPU_ARCH, FA2_HDIMS, BASE_IMAGE, CUTLASS_REF) documented in docker/README.md. Cold build on a fresh host is ~25 min (NGC pull + FA2 codegen); warm rebuild ~12 min.

Option C — Native Linux (no Docker)

System requirements:

Component	Minimum	Notes
GPU	SM80+ (A100, 30xx+, Thor, 4090, 5090)
NVIDIA driver	545+ for CUDA 13, 525+ for CUDA 12.4	5090 needs 550+
CUDA Toolkit	12.4+ (Thor/Hopper) or 12.8+ (Blackwell)	CUDA 13 recommended on 5090
Python	3.10 / 3.11 / 3.12	3.12 on the default NGC image
GCC/G++	11+ with C++17
CMake	3.24+

Create an isolated Python environment first. The build step calls python3 -m pybind11 --cmakedir to locate pybind11 headers, so the Python that runs cmake .. MUST be the same interpreter the .so files will be imported from. System-Python + conda-Python mix-ups are the #1 native-install failure mode.

python3.12 -m venv .venv         # 3.10 / 3.11 / 3.12 all supported
source .venv/bin/activate

Minimum pip list (for the torch frontend; everything must be installed before cmake ..):

```bash

2. Build helpers

pip install pybind11 cmake "numpy>=1.24" safetensors

tracked upstream — see docs/INSTALL.md §JAX for rationale.

pip install jax==0.5.3 jax-cuda12-pjrt==0.5.3 jax-cuda12-plugin==0.5.3 ml_dtypes==0.5.3


Then build:

bash git clone https://github.com/LiangSu8899/FlashRT.git cd FlashRT git clone --depth 1 --branch v4.4.2 \ https://github.com/NVIDIA/cutlass.git third_party/cutlass

pip install -e ".[torch]" # or "[jax]" / "[all]"

NOTE: editable mode (-e) is required. The cmake build below drops

compiled .so files into flash_rt/ in the source tree; editable

install makes that directory importable directly. A non-editable

`pip install .` would install a copy BEFORE the .so files exist and

`make install` / `ninja install` step needed.

```

Build timing (one-time)

On a 5090 with CUDA 13 in a warm container, make -j$(nproc):

Target	Time
`flash_rt_kernels` (main kernels)	~2 min
`flash_rt_fa2` (FA2 vendor, default — 12 kernel .cu files × 3 arches)	~4.5 min (267 s)
Full `make -j$(nproc)`	~6.5 min

Subsequent rebuilds of only the hand-written kernels take ~2 min — FA2 is a separate CMake target and is only re-linked, not recompiled, unless the vendored source itself changes.

Slim-build flags (developer iteration speed)

FA2's CUTLASS 3.x templates dominate cold-build cost. The default matrix covers every RTX family card × fp16+bf16 × all 3 hdim buckets, which is right for distribution but overkill when you're iterating on a single 5090/4090 and a single model family. Three opt-in CMake flags trade binary coverage for iteration speed:

Flag	Default	What it does	`fa2` cold build on 5090
—	(none)	12 .cu × sm_80 + sm_120 + PTX fallback	267 s (4.5 min)
`-DFA2_ARCH_NATIVE_ONLY=ON`	OFF	Only emit SASS for the detected GPU; skip sm_80 + PTX passes	110 s (−59%)
`-DFA2_HDIMS="96;256"`	`"96;128;256"`	Drop `head_dim=128` (shipped models don't use it; reserved for future DiT variants)	210 s (−21%)
`-DFA2_DTYPES="fp16"`	`"fp16;bf16"`	Drop bf16 (Pi0 is fp16-only; Pi0.5 / GROOT need bf16)	179 s (−33%)
`-DFA2_ARCH_NATIVE_ONLY=ON -DFA2_HDIMS="96;256" -DFA2_DTYPES="fp16"`	—	All three combined (single-card + pi0-only)	87 s (−67%)

Shipped flash_rt_fa2.so size also shrinks — the all-three-slim build produces 17.8 MB (vs 135 MB default), a 87% reduction in binary size on the FA2 module.

Dropped entries still resolve at the Python layer — calling a stubbed entry (e.g. fa2.fwd_bf16 on a build with FA2_DTYPES="fp16") aborts the process with a clear "rebuild with -DFA2_DTYPES=…" message instead of linker errors or silent wrong output.

ccache (iterative C++ rebuild speedup)

If ccache is on PATH at CMake-config time, it is enabled automatically for both C++ and CUDA compiles. First build is unchanged. Hit rate on the .cpp side (pybind bindings) is high, so repeat edits to csrc/bindings.cpp / csrc/fa2_bindings.cpp get fast rebuilds. CUDA .cu files — nvcc's invocation style makes ccache hit rate unreliable, so treat CUDA speedup as a bonus rather than a guarantee. Tip: set CCACHE_DIR to a host-mounted path so the cache survives container rebuilds.

Install via apt-get install ccache (Ubuntu) or equivalent.

Quick Start

Already built? Run the snippet below. Not yet built? See Build & install first — cmake .. && make -j produces the kernel .so files this snippet imports. About 6 minutes from git clone to first inference.

```python import flash_rt # Python module name; project is FlashRT (see About)

model = flash_rt.load_model( checkpoint="/path/to/pi05_checkpoint", config="pi05", # or "pi0", "groot", "groot_n17", "pi0fast" framework="torch", # or "jax" )

actions = model.predict( images=[base_img, wrist_img], prompt="pick up the red block", )

The NVFP4 ckpt has no MTP head; point this env var at a paired

4. JAX-side (optional — only if you will load Orbax checkpoints).

API snippets

Already built? Jump to API examples below. Not yet built? See Build & install for the full Docker / native Linux flow, then come back.

tokenizer API.

pip install "transformers<4.56" pandas pillow pyarrow

to RtxTorchGroot; on Jetson Thor it resolves to ThorPipelineTorchGroot.

model = flash_rt.load_model( "/path/to/groot_checkpoint", config="groot", embodiment_tag="gr1", # see GROOT embodiment slots below ) ```

Versions are pinned because the Orbax/jaxlib/PJRT plugin ABI is

`import flash_rt` would fail at runtime with a missing-module error.

cmake -B build -S . # auto-detects GPU arch cmake --build build -j$(nproc)

Supported Models

Latency columns below are 2-view, pure CUDA Graph replay (p50, see Measurement protocol). All per-view breakdowns live in the Latency sections further down.

Model	Architecture	Latency (Thor, 2v)	Latency (RTX 5090, 2v)	Source
[Pi0.5](https://github.com/Physical-Intelligence/openpi)	PaliGemma 2B encoder + 300M decoder, 10-step diffusion	44 ms	17.58 ms	Physical Intelligence
[Pi0](https://github.com/Physical-Intelligence/openpi)	Same as Pi0.5, with continuous state input	46 ms	(Thor class w/ SM120 fork)	Physical Intelligence
[GROOT N1.6](https://github.com/NVIDIA/Isaac-GR00T)	Eagle3-VL + Qwen3 1.7B + AlternateVLDiT 32L, 4-step flow matching	45 ms (T=50) / 41 ms (T=16)	13.08 ms (T=50) / 12.53 ms (T=16)	NVIDIA
[Pi0-FAST](https://github.com/Physical-Intelligence/openpi)	Gemma 2B autoregressive, FAST tokenizer	8.1 ms/token, ~431 ms (50 tok)	2.39 ms/token, ~140 ms (50 tok, max-perf)	Physical Intelligence

---

Comparison

Solution	Hardware	Pi0	Pi0.5	GROOT N1.6	Source
Original openpi (JAX, unoptimized)	Jetson Thor	—	714 ms (1.4 Hz)	—	[openpi](https://github.com/Physical-Intelligence/openpi)
PyTorch naive	RTX 4090	—	~200 ms	—	HuggingFace LeRobot
torch.compile	RTX 4090	—	~40 ms	—	HuggingFace LeRobot
Triton-based VLA	RTX 5090	—	26.6 ms (2v)	—	arXiv 2510.26742
NVIDIA VLA-Perf	RTX 4090	31.06 ms (Pi0 3B)	—	—	arXiv 2602.18397
NVIDIA Isaac GR00T (TensorRT)	Jetson Thor	—	91–95 ms (3v)	~95 ms	[Isaac GR00T](https://github.com/NVIDIA/Isaac-GR00T)
FlashRT	RTX 5090	21.16 ms (2v)	17.58 ms (2v)	13.08 ms (T=50, 2v)	this work
FlashRT	Jetson Thor	46 ms (2v)	39.78 ms (2v) / 51.51 ms (3v) (NVFP4)	45 ms (T=50, 2v)	this work

On the same Jetson AGX Thor hardware, FlashRT goes from the original openpi JAX baseline (1.4 Hz) to 23 Hz (FP8) / 25 Hz (NVFP4) — a ~16-18× speedup at zero accuracy loss (cosine ≥ 0.9996 vs the production reference).

FlashRT Pi0.5 Thor numbers above are the NVFP4 production preset (use_fp4=True); the FP8 baseline is 44.0 ms 2v / 54.8 ms 3v at the same task success (491/500). See Latency (Thor) for the full sweep.

⚡ 核心功能

开源免费，支持本地部署，数据完全自主可控
活跃的 GitHub 开源社区，持续迭代更新
提供详细文档和使用示例，新手友好
支持自定义配置，灵活适配不同使用环境
可作为基础组件集成进现有技术栈或进行二次开发

👥 适合人群

AI 技术爱好者研究人员和学生开发者和工程师技术创业者

🎯 使用场景

本地部署运行，保护数据隐私，满足合规要求
自定义集成到现有系统，扩展技术栈能力
作为开源基础组件进行商业化二次开发

⚖️ 优点与不足

✅ 优点

+Apache-2.0 协议，可免费商用
+完全开源免费，无授权费用
+本地部署，数据完全自主可控
+开发者社区支持，遇问题可查可问

⚠️ 不足

−安装和初始配置可能需要一定技术基础
−功能完整性通常不如成熟商业产品
−技术支持主要依赖开源社区，响应速度不稳定

⚠️ 使用须知

AI Skill Hub 为第三方内容聚合平台，本页面信息基于公开数据整理，不对工具功能和质量作任何法律背书。

建议在沙箱或测试环境中充分验证后，再部署至生产环境，并做好必要的安全评估。

📄 License 说明

🔗 相关工具推荐

PaddleOCR AI技能包

PaddleOCR AI技能包是可安装AI技能包，PDF处理，GitHub 77.8k Stars。安装后AI可调用专属

andrej-karpathy-skills — Claude 必备 Skill中文文档

A single CLAUDE.md file to improve Claude Code behavior, der

MockingBird — AI 语音合成工具中文文档

🚀Clone a voice in 5 seconds to generate arbitrary speech in

humanizer — Claude 必备 Skill中文文档

Claude Code skill that removes signs of AI-generated writing

❓ 常见问题 FAQ

FlashRT 是什么工具？−

FlashRT 是一款C++开发的AI辅助工具。开源AI工具：FlashRT is a high-performance realtime inference engine for small-batch, latency。⭐182 · C++

FlashRT 如何安装和开始使用？+

FlashRT 是否免费？许可证是什么？+

FlashRT 适合哪些用户使用？+

FlashRT 的社区活跃度和项目维护状况如何？+

安装这个工具需要什么基础？+

安装过程中遇到依赖冲突怎么办？+

工具安装成功但运行报错，该怎么处理？+

💡 AI Skill Hub 点评

经综合评估，FlashRT 在AI工具赛道中表现稳健，质量良好。如果你已有明确的使用需求，可以直接上手体验；如果还在评估阶段，建议对比同类工具后再做决策。

📚 深入学习 FlashRT

查看分步骤安装教程和完整使用指南，快速上手这款工具

⚙️ 安装教程 📚 使用教程

🌐 原始信息

原始名称	`FlashRT`
原始描述	开源AI工具：FlashRT is a high-performance realtime inference engine for small-batch, latency。⭐182 · C++
Topics	`installablecudacuda-kernelsgr00tgr00t-n1-6-3bjetsonc++`
GitHub	https://github.com/LiangSu8899/FlashRT
License	Apache-2.0
语言	C++

🔗 原始来源

🐙 GitHub 仓库 https://github.com/LiangSu8899/FlashRT

收录时间：2026-05-21 · 更新时间：2026-05-22 · License：Apache-2.0 · AI Skill Hub 不对第三方内容的准确性作法律背书。