🛠
AI工具

FlashRT

基于 C++ · 开源免费,本地部署,数据完全自主可控
⭐ 182 Stars 🍴 26 Forks 💻 C++ 📄 Apache-2.0 🏷 AI 6.8分
6.8AI 综合评分
installablecudacuda-kernelsgr00tgr00t-n1-6-3bjetsonc++
✦ AI Skill Hub 推荐

FlashRT 是 AI Skill Hub 本期精选AI工具之一。综合评分 6.8 分,整体质量稳定。我们推荐使用将其纳入你的 AI 工具库,帮助提升工作效率。

📚 深度解析
FlashRT 是一款基于 C++ 的开源工具,在 GitHub 上收获 0k+ Star,是installable、cuda、cuda-kernels、gr00t领域中的优质开源项目。开源工具的最大优势在于代码完全透明,你可以审计每一行代码的安全性,也可以根据自身需求进行二次开发和定制。

**为什么要使用开源工具而非商业 SaaS?**
对于个人开发者和有隐私需求的用户,本地部署的开源工具意味着数据不离本机,不受第三方服务商的数据政策约束。同时,开源工具通常没有使用次数限制和月度费用,一次安装即可长期使用,对于高频使用场景的总拥有成本(TCO)远低于订阅制商业工具。

**安装与环境准备**
FlashRT 依赖 C++ 运行环境。建议通过 pyenv(Python)或 nvm(Node.js)管理 C++ 版本,避免全局环境污染。对于新手用户,推荐先创建虚拟环境(python -m venv venv && source venv/bin/activate),再安装依赖,这样即使出现问题也可以随时删除虚拟环境重新开始,不影响系统稳定性。

**社区与维护**
GitHub Issue 和 Discussion 是获取帮助的最快渠道。在提问前建议先检查 Closed Issues(已关闭的问题),大多数常见问题都已有解答。遇到 Bug 时,提供 pip list 的输出、完整错误堆栈和最小可复现示例,能显著提高开发者响应速度。AI Skill Hub 将持续追踪 FlashRT 的版本更新,及时通知重要功能变化。
📋 工具概览

FlashRT 是一款基于 C++ 开发的开源工具,专注于 installable、cuda、cuda-kernels 等核心功能。作为 GitHub 开源项目,它拥有活跃的社区支持和持续的版本迭代,代码完全透明可审计,支持本地部署以保护数据隐私。无论是个人使用还是集成到企业工作流,都能提供稳定可靠的解决方案。

GitHub Stars
⭐ 182
开发语言
C++
支持平台
Windows / macOS / Linux
维护状态
轻量级项目,按需更新
开源协议
Apache-2.0
AI 综合评分
6.8 分
工具类型
AI工具
Forks
26
📖 中文文档
以下内容由 AI Skill Hub 根据项目信息自动整理,如需查看完整原始文档请访问底部「原始来源」。

FlashRT 是一款基于 C++ 开发的开源工具,专注于 installable、cuda、cuda-kernels 等核心功能。作为 GitHub 开源项目,它拥有活跃的社区支持和持续的版本迭代,代码完全透明可审计,支持本地部署以保护数据隐私。无论是个人使用还是集成到企业工作流,都能提供稳定可靠的解决方案。

📌 核心特色
  • 开源免费,支持本地部署,数据完全自主可控
  • 活跃的 GitHub 开源社区,持续迭代更新
  • 提供详细文档和使用示例,新手友好
  • 支持自定义配置,灵活适配不同使用环境
  • 可作为基础组件集成进现有技术栈或进行二次开发
🎯 主要使用场景
  • 本地部署运行,保护数据隐私,满足合规要求
  • 自定义集成到现有系统,扩展技术栈能力
  • 作为开源基础组件进行商业化二次开发
以下安装命令基于项目开发语言和类型自动生成,实际以官方 README 为准。
安装命令
# 克隆仓库
git clone https://github.com/LiangSu8899/FlashRT
cd FlashRT

# 查看安装说明
cat README.md

# 按 README 完成环境依赖安装后即可使用
📋 安装步骤说明
  1. 访问 GitHub 仓库页面
  2. 按照 README 文档完成依赖安装
  3. 根据系统环境完成初始化配置
  4. 参考官方示例或文档开始使用
  5. 遇到问题可在 GitHub Issues 中查找解答
以下用法示例由 AI Skill Hub 整理,涵盖最常见的使用场景。
常用命令 / 代码示例
# 查看帮助
flashrt --help

# 基本运行
flashrt [options] <input>

# 详细使用说明请查阅文档
# https://github.com/LiangSu8899/FlashRT
以下配置示例基于典型使用场景生成,具体参数请参照官方文档调整。
配置示例
# flashrt 配置说明
# 查看配置选项
flashrt --config-example > config.yml

# 常见配置项
# output_dir: ./output
# log_level: info
# workers: 4

# 环境变量(覆盖配置文件)
export FLASHRT_CONFIG="/path/to/config.yml"
📑 README 深度解析 真实文档 完整度 81/100 查看 GitHub 原文 →
以下内容由系统直接从 GitHub README 解析整理,保留代码块、表格与列表结构。

FlashRT

FlashRT is a high-performance realtime inference engine for small-batch, latency-sensitive AI workloads.

A general kernel library composed into static graphs — no ONNX export, no engine compilation, no per-driver rebuild. Hand-written kernels (norm / activation / fusion / RoPE / FP8 / NVFP4 GEMM / attention) cover standard transformer, DiT, and SigLIP primitives. The composition pattern itself is hardware-agnostic; today the codebase ships with NVIDIA implementations spanning edge to server (Jetson AGX Thor through A100 / RTX 4090 / 5090).

The flagship integration today is VLA control — production frontends for Pi0, Pi0.5, GROOT N1.6, GROOT N1.7, and Pi0-FAST, validated on LIBERO where applicable. The same kernel set also powers the BAGEL world-model image-generation pipeline (research preview) and audio / video generation (4× over PyTorch). FlashRT now also serves single-stream LLM inference — the v1 release ships Qwen3.6-27B (NVFP4) with 256 K context on a single RTX 5090, an OpenAI-compatible HTTP server, and decode throughput of ~100 tok/s typical / 129 tok/s peak (real warm-state range across mixed chat / reasoning / code prompts; see Performance for the breakdown). The pattern is workload-shaped (small-batch realtime), not model-class-shaped.

Existing inference tooling is shaped for different workloads — TensorRT for tactic-search compile to frozen engines, vLLM / SGLang for high-batch LLM serving. FlashRT targets the small-batch realtime cell with hand-tuned kernels and no compile step.

Tested hardware + what's theoretically supported

Verified working on: RTX 5090, RTX 4090, RTX 5060 Ti, RTX 4060 Ti, NVIDIA L40, Jetson AGX Thor, and Jetson AGX Orin.

CMake's ENABLE_FA2 gate accepts any card in SM80 / 86 / 89 / 120 (Ampere through Blackwell consumer). That means A100, A10, RTX 3090, 3080, A5000/A6000, 4090, 4080, 4070, 4060 Ti, 5090 — all should build and run out of the box. "Theoretical" here just means the other cards haven't gone through the regression suite yet; the kernel set and dispatch paths are the same.

Getting Started

Pi0-FAST max-performance mode (for fixed-prompt 24h deployment):

model = flash_rt.load_model( checkpoint="/path/to/pi0_fast_base", config="pi0fast", decode_cuda_graph=True, # capture decode loop as CUDA Graph decode_graph_steps=46, # action tokens per inference (50 total with text prefix) )


#### Qwen3.6-27B NVFP4 (LLM, RTX 5090)

The LLM path uses a dedicated frontend — same kernel binary, separate
generation API since chat completion has a different surface from VLA
control. See [`docs/qwen36_usage.md`](docs/qwen36_usage.md) for the
full parameter reference and [`docs/qwen36_nvfp4.md`](docs/qwen36_nvfp4.md)
for the K-curve / measured throughput / model-dependency notes.
python import os import torch from flash_rt.frontends.torch.qwen36_rtx import Qwen36TorchFrontendRtx

Build & install

This is the hands-on "go from a fresh machine to a green benchmark" section. For a single-page install reference (prerequisites, troubleshooting table, JAX/transformers pin rationale) see docs/INSTALL.md.

Docker and native Linux paths both produce the same two extension modules:

ArtifactSizeWhat it contains
flash_rt/flash_rt_kernels.so~3 MBHand-written memory-bound kernels (norm, activation, fusion, FP8 quant, cuBLASLt wrappers, Thor FMHA). **Always built.**
flash_rt/flash_rt_fa2.so~135 MBVendored Flash-Attention 2 v2.7.4.post1 fwd (fp16 + bf16, SM80/86/89/120). **Built only on RTX targets** — Thor skips it and uses fvk.attention_qkv_fp16 (cuBLAS-decomposed) for attention instead.

Crucially — no pip install flash-attn required. The FA2 kernel is vendored at source level and built into flash_rt_fa2.so during cmake/make; at runtime import flash_rt loads both .so files directly, so you never hit the flash-attn wheel's torch × CUDA × driver × glibc compatibility matrix. Setting FVK_RTX_FA2=0 is still supported as a fall-back to pip flash-attn for debugging, but the default path has zero pip-wheel dependency.

Option B — Build the Docker image yourself

If you need a different GPU arch, want to pin a specific commit, or prefer to vet the image source:

git clone https://github.com/LiangSu8899/FlashRT.git
cd FlashRT
docker build -t flashrt:dev -f docker/Dockerfile .
docker run --rm --gpus all -it flashrt:dev

Build args (GPU_ARCH, FA2_HDIMS, BASE_IMAGE, CUTLASS_REF) documented in docker/README.md. Cold build on a fresh host is ~25 min (NGC pull + FA2 codegen); warm rebuild ~12 min.

Option C — Native Linux (no Docker)

System requirements:

ComponentMinimumNotes
GPUSM80+ (A100, 30xx+, Thor, 4090, 5090)
NVIDIA driver545+ for CUDA 13, 525+ for CUDA 12.45090 needs 550+
CUDA Toolkit12.4+ (Thor/Hopper) or 12.8+ (Blackwell)CUDA 13 recommended on 5090
Python3.10 / 3.11 / 3.123.12 on the default NGC image
GCC/G++11+ with C++17
CMake3.24+

Create an isolated Python environment first. The build step calls python3 -m pybind11 --cmakedir to locate pybind11 headers, so the Python that runs cmake .. MUST be the same interpreter the .so files will be imported from. System-Python + conda-Python mix-ups are the #1 native-install failure mode.

python3.12 -m venv .venv         # 3.10 / 3.11 / 3.12 all supported
source .venv/bin/activate

Minimum pip list (for the torch frontend; everything must be installed before cmake ..):

```bash

2. Build helpers

pip install pybind11 cmake "numpy>=1.24" safetensors

tracked upstream — see docs/INSTALL.md §JAX for rationale.

pip install jax==0.5.3 jax-cuda12-pjrt==0.5.3 jax-cuda12-plugin==0.5.3 ml_dtypes==0.5.3


Then build:
bash git clone https://github.com/LiangSu8899/FlashRT.git cd FlashRT git clone --depth 1 --branch v4.4.2 \ https://github.com/NVIDIA/cutlass.git third_party/cutlass

pip install -e ".[torch]" # or "[jax]" / "[all]"

NOTE: editable mode (-e) is required. The cmake build below drops

compiled .so files into flash_rt/ in the source tree; editable

install makes that directory importable directly. A non-editable

`pip install .` would install a copy BEFORE the .so files exist and

`make install` / `ninja install` step needed.

```

Build timing (one-time)

On a 5090 with CUDA 13 in a warm container, make -j$(nproc):

TargetTime
flash_rt_kernels (main kernels)~2 min
flash_rt_fa2 (FA2 vendor, default — 12 kernel .cu files × 3 arches)**~4.5 min** (267 s)
Full make -j$(nproc)~6.5 min

Subsequent rebuilds of only the hand-written kernels take ~2 min — FA2 is a separate CMake target and is only re-linked, not recompiled, unless the vendored source itself changes.

Slim-build flags (developer iteration speed)

FA2's CUTLASS 3.x templates dominate cold-build cost. The default matrix covers every RTX family card × fp16+bf16 × all 3 hdim buckets, which is right for distribution but overkill when you're iterating on a single 5090/4090 and a single model family. Three opt-in CMake flags trade binary coverage for iteration speed:

FlagDefaultWhat it doesfa2 cold build on 5090
(none)12 .cu × sm_80 + sm_120 + PTX fallback**267 s (4.5 min)**
-DFA2_ARCH_NATIVE_ONLY=ONOFFOnly emit SASS for the detected GPU; skip sm_80 + PTX passes**110 s** (−59%)
-DFA2_HDIMS="96;256""96;128;256"Drop head_dim=128 (shipped models don't use it; reserved for future DiT variants)**210 s** (−21%)
-DFA2_DTYPES="fp16""fp16;bf16"Drop bf16 (Pi0 is fp16-only; Pi0.5 / GROOT need bf16)**179 s** (−33%)
-DFA2_ARCH_NATIVE_ONLY=ON -DFA2_HDIMS="96;256" -DFA2_DTYPES="fp16"All three combined (single-card + pi0-only)**87 s** (−67%)

Shipped flash_rt_fa2.so size also shrinks — the all-three-slim build produces 17.8 MB (vs 135 MB default), a 87% reduction in binary size on the FA2 module.

Dropped entries still resolve at the Python layer — calling a stubbed entry (e.g. fa2.fwd_bf16 on a build with FA2_DTYPES="fp16") aborts the process with a clear "rebuild with -DFA2_DTYPES=…" message instead of linker errors or silent wrong output.

ccache (iterative C++ rebuild speedup)

If ccache is on PATH at CMake-config time, it is enabled automatically for both C++ and CUDA compiles. First build is unchanged. Hit rate on the .cpp side (pybind bindings) is high, so repeat edits to csrc/bindings.cpp / csrc/fa2_bindings.cpp get fast rebuilds. CUDA .cu files — nvcc's invocation style makes ccache hit rate unreliable, so treat CUDA speedup as a bonus rather than a guarantee. Tip: set CCACHE_DIR to a host-mounted path so the cache survives container rebuilds.

Install via apt-get install ccache (Ubuntu) or equivalent.

Quick Start

Already built? Run the snippet below. Not yet built? See Build & install firstcmake .. && make -j produces the kernel .so files this snippet imports. About 6 minutes from git clone to first inference.

```python import flash_rt # Python module name; project is FlashRT (see About)

model = flash_rt.load_model( checkpoint="/path/to/pi05_checkpoint", config="pi05", # or "pi0", "groot", "groot_n17", "pi0fast" framework="torch", # or "jax" )

actions = model.predict( images=[base_img, wrist_img], prompt="pick up the red block", )

The NVFP4 ckpt has no MTP head; point this env var at a paired

4. JAX-side (optional — only if you will load Orbax checkpoints).

API snippets

Already built? Jump to API examples below. Not yet built? See Build & install for the full Docker / native Linux flow, then come back.

tokenizer API.

pip install "transformers<4.56" pandas pillow pyarrow

to RtxTorchGroot; on Jetson Thor it resolves to ThorPipelineTorchGroot.

model = flash_rt.load_model( "/path/to/groot_checkpoint", config="groot", embodiment_tag="gr1", # see GROOT embodiment slots below ) ```

Versions are pinned because the Orbax/jaxlib/PJRT plugin ABI is

`import flash_rt` would fail at runtime with a missing-module error.

cmake -B build -S . # auto-detects GPU arch cmake --build build -j$(nproc)

Supported Models

Latency columns below are 2-view, pure CUDA Graph replay (p50, see Measurement protocol). All per-view breakdowns live in the Latency sections further down.

ModelArchitectureLatency (Thor, 2v)Latency (RTX 5090, 2v)Source
[**Pi0.5**](https://github.com/Physical-Intelligence/openpi)PaliGemma 2B encoder + 300M decoder, 10-step diffusion**44 ms****17.58 ms**Physical Intelligence
[**Pi0**](https://github.com/Physical-Intelligence/openpi)Same as Pi0.5, with continuous state input**46 ms**(Thor class w/ SM120 fork)Physical Intelligence
[**GROOT N1.6**](https://github.com/NVIDIA/Isaac-GR00T)Eagle3-VL + Qwen3 1.7B + AlternateVLDiT 32L, 4-step flow matching**45 ms** (T=50) / **41 ms** (T=16)**13.08 ms** (T=50) / **12.53 ms** (T=16)NVIDIA
[**Pi0-FAST**](https://github.com/Physical-Intelligence/openpi)Gemma 2B autoregressive, FAST tokenizer**8.1 ms/token**, ~431 ms (50 tok)**2.39 ms/token**, ~140 ms (50 tok, max-perf)Physical Intelligence

---

Comparison

SolutionHardwarePi0Pi0.5GROOT N1.6Source
Original openpi (JAX, unoptimized)Jetson Thor**714 ms (1.4 Hz)**[openpi](https://github.com/Physical-Intelligence/openpi)
PyTorch naiveRTX 4090~200 msHuggingFace LeRobot
torch.compileRTX 4090~40 msHuggingFace LeRobot
Triton-based VLARTX 509026.6 ms (2v)arXiv 2510.26742
NVIDIA VLA-PerfRTX 409031.06 ms (Pi0 3B)arXiv 2602.18397
NVIDIA Isaac GR00T (TensorRT)Jetson Thor91–95 ms (3v)~95 ms[Isaac GR00T](https://github.com/NVIDIA/Isaac-GR00T)
**FlashRT****RTX 5090****21.16 ms** (2v)**17.58 ms** (2v)**13.08 ms** (T=50, 2v)this work
**FlashRT****Jetson Thor****46 ms** (2v)**39.78 ms** (2v) / **51.51 ms** (3v) (NVFP4)**45 ms** (T=50, 2v)this work

On the same Jetson AGX Thor hardware, FlashRT goes from the original openpi JAX baseline (1.4 Hz) to 23 Hz (FP8) / 25 Hz (NVFP4) — a ~16-18× speedup at zero accuracy loss (cosine ≥ 0.9996 vs the production reference).

FlashRT Pi0.5 Thor numbers above are the NVFP4 production preset (use_fp4=True); the FP8 baseline is 44.0 ms 2v / 54.8 ms 3v at the same task success (491/500). See Latency (Thor) for the full sweep.

<a name="community-benchmarks"></a>

⚡ 核心功能
👥 适合人群
AI 技术爱好者研究人员和学生开发者和工程师技术创业者
🎯 使用场景
  • 本地部署运行,保护数据隐私,满足合规要求
  • 自定义集成到现有系统,扩展技术栈能力
  • 作为开源基础组件进行商业化二次开发
⚖️ 优点与不足
✅ 优点
  • +Apache-2.0 协议,可免费商用
  • +完全开源免费,无授权费用
  • +本地部署,数据完全自主可控
  • +开发者社区支持,遇问题可查可问
⚠️ 不足
  • 安装和初始配置可能需要一定技术基础
  • 功能完整性通常不如成熟商业产品
  • 技术支持主要依赖开源社区,响应速度不稳定
⚠️ 使用须知

AI Skill Hub 为第三方内容聚合平台,本页面信息基于公开数据整理,不对工具功能和质量作任何法律背书。

建议在沙箱或测试环境中充分验证后,再部署至生产环境,并做好必要的安全评估。

📄 License 说明

✅ Apache 2.0 — 宽松开源协议,可商用,需保留版权声明和 NOTICE 文件,含专利授权条款。

🔗 相关工具推荐
❓ 常见问题 FAQ
FlashRT 是一款C++开发的AI辅助工具。开源AI工具:FlashRT is a high-performance realtime inference engine for small-batch, latency。⭐182 · C++
💡 AI Skill Hub 点评

经综合评估,FlashRT 在AI工具赛道中表现稳健,质量良好。如果你已有明确的使用需求,可以直接上手体验;如果还在评估阶段,建议对比同类工具后再做决策。

📚 深入学习 FlashRT
查看分步骤安装教程和完整使用指南,快速上手这款工具
🌐 原始信息
原始名称 FlashRT
原始描述 开源AI工具:FlashRT is a high-performance realtime inference engine for small-batch, latency。⭐182 · C++
Topics installablecudacuda-kernelsgr00tgr00t-n1-6-3bjetsonc++
GitHub https://github.com/LiangSu8899/FlashRT
License Apache-2.0
语言 C++
🔗 原始来源
🐙 GitHub 仓库  https://github.com/LiangSu8899/FlashRT

收录时间:2026-05-21 · 更新时间:2026-05-22 · License:Apache-2.0 · AI Skill Hub 不对第三方内容的准确性作法律背书。