FlashRT 是 AI Skill Hub 本期精选AI工具之一。综合评分 6.8 分,整体质量稳定。我们推荐使用将其纳入你的 AI 工具库,帮助提升工作效率。
FlashRT 是一款基于 C++ 开发的开源工具,专注于 installable、cuda、cuda-kernels 等核心功能。作为 GitHub 开源项目,它拥有活跃的社区支持和持续的版本迭代,代码完全透明可审计,支持本地部署以保护数据隐私。无论是个人使用还是集成到企业工作流,都能提供稳定可靠的解决方案。
FlashRT 是一款基于 C++ 开发的开源工具,专注于 installable、cuda、cuda-kernels 等核心功能。作为 GitHub 开源项目,它拥有活跃的社区支持和持续的版本迭代,代码完全透明可审计,支持本地部署以保护数据隐私。无论是个人使用还是集成到企业工作流,都能提供稳定可靠的解决方案。
# 克隆仓库 git clone https://github.com/LiangSu8899/FlashRT cd FlashRT # 查看安装说明 cat README.md # 按 README 完成环境依赖安装后即可使用
# 查看帮助 flashrt --help # 基本运行 flashrt [options] <input> # 详细使用说明请查阅文档 # https://github.com/LiangSu8899/FlashRT
# flashrt 配置说明 # 查看配置选项 flashrt --config-example > config.yml # 常见配置项 # output_dir: ./output # log_level: info # workers: 4 # 环境变量(覆盖配置文件) export FLASHRT_CONFIG="/path/to/config.yml"
FlashRT is a high-performance realtime inference engine for small-batch, latency-sensitive AI workloads.
A general kernel library composed into static graphs — no ONNX export, no engine compilation, no per-driver rebuild. Hand-written kernels (norm / activation / fusion / RoPE / FP8 / NVFP4 GEMM / attention) cover standard transformer, DiT, and SigLIP primitives. The composition pattern itself is hardware-agnostic; today the codebase ships with NVIDIA implementations spanning edge to server (Jetson AGX Thor through A100 / RTX 4090 / 5090).
The flagship integration today is VLA control — production frontends for Pi0, Pi0.5, GROOT N1.6, GROOT N1.7, and Pi0-FAST, validated on LIBERO where applicable. The same kernel set also powers the BAGEL world-model image-generation pipeline (research preview) and audio / video generation (4× over PyTorch). FlashRT now also serves single-stream LLM inference — the v1 release ships Qwen3.6-27B (NVFP4) with 256 K context on a single RTX 5090, an OpenAI-compatible HTTP server, and decode throughput of ~100 tok/s typical / 129 tok/s peak (real warm-state range across mixed chat / reasoning / code prompts; see Performance for the breakdown). The pattern is workload-shaped (small-batch realtime), not model-class-shaped.
Existing inference tooling is shaped for different workloads — TensorRT for tactic-search compile to frozen engines, vLLM / SGLang for high-batch LLM serving. FlashRT targets the small-batch realtime cell with hand-tuned kernels and no compile step.
Verified working on: RTX 5090, RTX 4090, RTX 5060 Ti, RTX 4060 Ti, NVIDIA L40, Jetson AGX Thor, and Jetson AGX Orin.
CMake's ENABLE_FA2 gate accepts any card in SM80 / 86 / 89 / 120 (Ampere through Blackwell consumer). That means A100, A10, RTX 3090, 3080, A5000/A6000, 4090, 4080, 4070, 4060 Ti, 5090 — all should build and run out of the box. "Theoretical" here just means the other cards haven't gone through the regression suite yet; the kernel set and dispatch paths are the same.
model = flash_rt.load_model( checkpoint="/path/to/pi0_fast_base", config="pi0fast", decode_cuda_graph=True, # capture decode loop as CUDA Graph decode_graph_steps=46, # action tokens per inference (50 total with text prefix) )
#### Qwen3.6-27B NVFP4 (LLM, RTX 5090)
The LLM path uses a dedicated frontend — same kernel binary, separate
generation API since chat completion has a different surface from VLA
control. See [`docs/qwen36_usage.md`](docs/qwen36_usage.md) for the
full parameter reference and [`docs/qwen36_nvfp4.md`](docs/qwen36_nvfp4.md)
for the K-curve / measured throughput / model-dependency notes.
python import os import torch from flash_rt.frontends.torch.qwen36_rtx import Qwen36TorchFrontendRtx
This is the hands-on "go from a fresh machine to a green benchmark" section. For a single-page install reference (prerequisites, troubleshooting table, JAX/transformers pin rationale) see docs/INSTALL.md.
Docker and native Linux paths both produce the same two extension modules:
| Artifact | Size | What it contains |
|---|---|---|
flash_rt/flash_rt_kernels.so | ~3 MB | Hand-written memory-bound kernels (norm, activation, fusion, FP8 quant, cuBLASLt wrappers, Thor FMHA). **Always built.** |
flash_rt/flash_rt_fa2.so | ~135 MB | Vendored Flash-Attention 2 v2.7.4.post1 fwd (fp16 + bf16, SM80/86/89/120). **Built only on RTX targets** — Thor skips it and uses fvk.attention_qkv_fp16 (cuBLAS-decomposed) for attention instead. |
Crucially — no pip install flash-attn required. The FA2 kernel is vendored at source level and built into flash_rt_fa2.so during cmake/make; at runtime import flash_rt loads both .so files directly, so you never hit the flash-attn wheel's torch × CUDA × driver × glibc compatibility matrix. Setting FVK_RTX_FA2=0 is still supported as a fall-back to pip flash-attn for debugging, but the default path has zero pip-wheel dependency.
The published image already has CUDA 13.0, PyTorch 2.9, the FlashRT kernels prebuilt, and CUTLASS vendored — pull and run, no local compile, no flash-attn wheel hunting:
```bash docker pull ghcr.io/liangsu8899/flashrt:latest docker run --rm --gpus all -it ghcr.io/liangsu8899/flashrt:latest
If you need a different GPU arch, want to pin a specific commit, or prefer to vet the image source:
git clone https://github.com/LiangSu8899/FlashRT.git
cd FlashRT
docker build -t flashrt:dev -f docker/Dockerfile .
docker run --rm --gpus all -it flashrt:dev
Build args (GPU_ARCH, FA2_HDIMS, BASE_IMAGE, CUTLASS_REF) documented in docker/README.md. Cold build on a fresh host is ~25 min (NGC pull + FA2 codegen); warm rebuild ~12 min.
System requirements:
| Component | Minimum | Notes |
|---|---|---|
| GPU | SM80+ (A100, 30xx+, Thor, 4090, 5090) | |
| NVIDIA driver | 545+ for CUDA 13, 525+ for CUDA 12.4 | 5090 needs 550+ |
| CUDA Toolkit | 12.4+ (Thor/Hopper) or 12.8+ (Blackwell) | CUDA 13 recommended on 5090 |
| Python | 3.10 / 3.11 / 3.12 | 3.12 on the default NGC image |
| GCC/G++ | 11+ with C++17 | |
| CMake | 3.24+ |
Create an isolated Python environment first. The build step calls python3 -m pybind11 --cmakedir to locate pybind11 headers, so the Python that runs cmake .. MUST be the same interpreter the .so files will be imported from. System-Python + conda-Python mix-ups are the #1 native-install failure mode.
python3.12 -m venv .venv # 3.10 / 3.11 / 3.12 all supported
source .venv/bin/activate
Minimum pip list (for the torch frontend; everything must be installed before cmake ..):
```bash
pip install pybind11 cmake "numpy>=1.24" safetensors
pip install jax==0.5.3 jax-cuda12-pjrt==0.5.3 jax-cuda12-plugin==0.5.3 ml_dtypes==0.5.3
Then build:
bash git clone https://github.com/LiangSu8899/FlashRT.git cd FlashRT git clone --depth 1 --branch v4.4.2 \ https://github.com/NVIDIA/cutlass.git third_party/cutlass
pip install -e ".[torch]" # or "[jax]" / "[all]"
```
On a 5090 with CUDA 13 in a warm container, make -j$(nproc):
| Target | Time |
|---|---|
flash_rt_kernels (main kernels) | ~2 min |
flash_rt_fa2 (FA2 vendor, default — 12 kernel .cu files × 3 arches) | **~4.5 min** (267 s) |
Full make -j$(nproc) | ~6.5 min |
Subsequent rebuilds of only the hand-written kernels take ~2 min — FA2 is a separate CMake target and is only re-linked, not recompiled, unless the vendored source itself changes.
FA2's CUTLASS 3.x templates dominate cold-build cost. The default matrix covers every RTX family card × fp16+bf16 × all 3 hdim buckets, which is right for distribution but overkill when you're iterating on a single 5090/4090 and a single model family. Three opt-in CMake flags trade binary coverage for iteration speed:
| Flag | Default | What it does | fa2 cold build on 5090 |
|---|---|---|---|
| — | (none) | 12 .cu × sm_80 + sm_120 + PTX fallback | **267 s (4.5 min)** |
-DFA2_ARCH_NATIVE_ONLY=ON | OFF | Only emit SASS for the detected GPU; skip sm_80 + PTX passes | **110 s** (−59%) |
-DFA2_HDIMS="96;256" | "96;128;256" | Drop head_dim=128 (shipped models don't use it; reserved for future DiT variants) | **210 s** (−21%) |
-DFA2_DTYPES="fp16" | "fp16;bf16" | Drop bf16 (Pi0 is fp16-only; Pi0.5 / GROOT need bf16) | **179 s** (−33%) |
-DFA2_ARCH_NATIVE_ONLY=ON -DFA2_HDIMS="96;256" -DFA2_DTYPES="fp16" | — | All three combined (single-card + pi0-only) | **87 s** (−67%) |
Shipped flash_rt_fa2.so size also shrinks — the all-three-slim build produces 17.8 MB (vs 135 MB default), a 87% reduction in binary size on the FA2 module.
Dropped entries still resolve at the Python layer — calling a stubbed entry (e.g. fa2.fwd_bf16 on a build with FA2_DTYPES="fp16") aborts the process with a clear "rebuild with -DFA2_DTYPES=…" message instead of linker errors or silent wrong output.
If ccache is on PATH at CMake-config time, it is enabled automatically for both C++ and CUDA compiles. First build is unchanged. Hit rate on the .cpp side (pybind bindings) is high, so repeat edits to csrc/bindings.cpp / csrc/fa2_bindings.cpp get fast rebuilds. CUDA .cu files — nvcc's invocation style makes ccache hit rate unreliable, so treat CUDA speedup as a bonus rather than a guarantee. Tip: set CCACHE_DIR to a host-mounted path so the cache survives container rebuilds.
Install via apt-get install ccache (Ubuntu) or equivalent.
Already built? Run the snippet below. Not yet built? See Build & install first —cmake .. && make -jproduces the kernel.sofiles this snippet imports. About 6 minutes fromgit cloneto first inference.
```python import flash_rt # Python module name; project is FlashRT (see About)
model = flash_rt.load_model( checkpoint="/path/to/pi05_checkpoint", config="pi05", # or "pi0", "groot", "groot_n17", "pi0fast" framework="torch", # or "jax" )
actions = model.predict( images=[base_img, wrist_img], prompt="pick up the red block", )
Already built? Jump to API examples below. Not yet built? See Build & install for the full Docker / native Linux flow, then come back.
pip install "transformers<4.56" pandas pillow pyarrow
model = flash_rt.load_model( "/path/to/groot_checkpoint", config="groot", embodiment_tag="gr1", # see GROOT embodiment slots below ) ```
cmake -B build -S . # auto-detects GPU arch cmake --build build -j$(nproc)
Latency columns below are 2-view, pure CUDA Graph replay (p50, see Measurement protocol). All per-view breakdowns live in the Latency sections further down.
| Model | Architecture | Latency (Thor, 2v) | Latency (RTX 5090, 2v) | Source |
|---|---|---|---|---|
| [**Pi0.5**](https://github.com/Physical-Intelligence/openpi) | PaliGemma 2B encoder + 300M decoder, 10-step diffusion | **44 ms** | **17.58 ms** | Physical Intelligence |
| [**Pi0**](https://github.com/Physical-Intelligence/openpi) | Same as Pi0.5, with continuous state input | **46 ms** | (Thor class w/ SM120 fork) | Physical Intelligence |
| [**GROOT N1.6**](https://github.com/NVIDIA/Isaac-GR00T) | Eagle3-VL + Qwen3 1.7B + AlternateVLDiT 32L, 4-step flow matching | **45 ms** (T=50) / **41 ms** (T=16) | **13.08 ms** (T=50) / **12.53 ms** (T=16) | NVIDIA |
| [**Pi0-FAST**](https://github.com/Physical-Intelligence/openpi) | Gemma 2B autoregressive, FAST tokenizer | **8.1 ms/token**, ~431 ms (50 tok) | **2.39 ms/token**, ~140 ms (50 tok, max-perf) | Physical Intelligence |
---
| Solution | Hardware | Pi0 | Pi0.5 | GROOT N1.6 | Source |
|---|---|---|---|---|---|
| Original openpi (JAX, unoptimized) | Jetson Thor | — | **714 ms (1.4 Hz)** | — | [openpi](https://github.com/Physical-Intelligence/openpi) |
| PyTorch naive | RTX 4090 | — | ~200 ms | — | HuggingFace LeRobot |
| torch.compile | RTX 4090 | — | ~40 ms | — | HuggingFace LeRobot |
| Triton-based VLA | RTX 5090 | — | 26.6 ms (2v) | — | arXiv 2510.26742 |
| NVIDIA VLA-Perf | RTX 4090 | 31.06 ms (Pi0 3B) | — | — | arXiv 2602.18397 |
| NVIDIA Isaac GR00T (TensorRT) | Jetson Thor | — | 91–95 ms (3v) | ~95 ms | [Isaac GR00T](https://github.com/NVIDIA/Isaac-GR00T) |
| **FlashRT** | **RTX 5090** | **21.16 ms** (2v) | **17.58 ms** (2v) | **13.08 ms** (T=50, 2v) | this work |
| **FlashRT** | **Jetson Thor** | **46 ms** (2v) | **39.78 ms** (2v) / **51.51 ms** (3v) (NVFP4) | **45 ms** (T=50, 2v) | this work |
On the same Jetson AGX Thor hardware, FlashRT goes from the original openpi JAX baseline (1.4 Hz) to 23 Hz (FP8) / 25 Hz (NVFP4) — a ~16-18× speedup at zero accuracy loss (cosine ≥ 0.9996 vs the production reference).
FlashRT Pi0.5 Thor numbers above are the NVFP4 production preset (use_fp4=True); the FP8 baseline is 44.0 ms 2v / 54.8 ms 3v at the same task success (491/500). See Latency (Thor) for the full sweep.
<a name="community-benchmarks"></a>
AI Skill Hub 为第三方内容聚合平台,本页面信息基于公开数据整理,不对工具功能和质量作任何法律背书。
建议在沙箱或测试环境中充分验证后,再部署至生产环境,并做好必要的安全评估。
✅ Apache 2.0 — 宽松开源协议,可商用,需保留版权声明和 NOTICE 文件,含专利授权条款。
经综合评估,FlashRT 在AI工具赛道中表现稳健,质量良好。如果你已有明确的使用需求,可以直接上手体验;如果还在评估阶段,建议对比同类工具后再做决策。
| 原始名称 | FlashRT |
| 原始描述 | 开源AI工具:FlashRT is a high-performance realtime inference engine for small-batch, latency。⭐182 · C++ |
| Topics | installablecudacuda-kernelsgr00tgr00t-n1-6-3bjetsonc++ |
| GitHub | https://github.com/LiangSu8899/FlashRT |
| License | Apache-2.0 |
| 语言 | C++ |
收录时间:2026-05-21 · 更新时间:2026-05-22 · License:Apache-2.0 · AI Skill Hub 不对第三方内容的准确性作法律背书。