经 AI Skill Hub 精选评估,Vortex 获评「推荐使用」。这款AI工具在功能完整性、社区活跃度和易用性方面表现出色,AI 评分 7.5 分,适合有一定技术背景的用户使用。
Vortex 是一款基于 Python 开发的开源工具,专注于 llm、sparse-attention、python 等核心功能。作为 GitHub 开源项目,它拥有活跃的社区支持和持续的版本迭代,代码完全透明可审计,支持本地部署以保护数据隐私。无论是个人使用还是集成到企业工作流,都能提供稳定可靠的解决方案。
Vortex 是一款基于 Python 开发的开源工具,专注于 llm、sparse-attention、python 等核心功能。作为 GitHub 开源项目,它拥有活跃的社区支持和持续的版本迭代,代码完全透明可审计,支持本地部署以保护数据隐私。无论是个人使用还是集成到企业工作流,都能提供稳定可靠的解决方案。
# 方式一:pip 安装(推荐)
pip install vortex_torch
# 方式二:虚拟环境安装(推荐生产环境)
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install vortex_torch
# 方式三:从源码安装(获取最新功能)
git clone https://github.com/Infini-AI-Lab/vortex_torch
cd vortex_torch
pip install -e .
# 验证安装
python -c "import vortex_torch; print('安装成功')"
# 命令行使用
vortex_torch --help
# 基本用法
vortex_torch input_file -o output_file
# Python 代码中调用
import vortex_torch
# 示例
result = vortex_torch.process("input")
print(result)
# vortex_torch 配置文件示例(config.yml) app: name: "vortex_torch" debug: false log_level: "INFO" # 运行时指定配置文件 vortex_torch --config config.yml # 或通过环境变量配置 export VORTEX_TORCH_API_KEY="your-key" export VORTEX_TORCH_OUTPUT_DIR="./output"
<p align="center"> <img alt="Vortex" src="assets/vortex_logo_flat.png" width="55%" /> </p>
<p align="center"> <a href="https://arxiv.org/abs/2606.06453"><img src="https://img.shields.io/badge/Paper-arXiv%3A2606.06453-b31b1b?logo=arxiv&logoColor=white" alt="Paper" /></a> <a href="https://infini-ai-lab.github.io/vortex_torch/docs/"><img src="https://img.shields.io/badge/Docs-Documentation-1f6feb?logo=readthedocs&logoColor=white" alt="Documentation" /></a> <a href="https://infini-ai-lab.github.io/vortex_torch/"><img src="https://img.shields.io/badge/Website-vortex__torch-2ea44f?logo=githubpages&logoColor=white" alt="Website" /></a> </p>
Vortex turns sparse-attention algorithm design into something AI agents can do. Sparse attention is increasingly essential for serving LLMs as generation lengths grow — but deploying and evaluating new sparse-attention algorithms at scale has been highly engineering-intensive, slowing both human researchers and AI agents as they explore the design space.
Vortex couples a Python-embedded frontend over a page-centric tensor abstraction — concise enough to express a broad range of sparse-attention algorithms — with an efficient backend tightly integrated into modern LLM serving stacks (SGLang). A new algorithm goes from idea to deployed-and-benchmarked in minutes, turning its theoretical efficiency into real-world throughput without touching core model code.
This makes Vortex a platform for autonomous algorithm discovery: AI agents generate and refine diverse sparse-attention algorithms with Vortex — the best reaching up to 3.46× higher throughput than full attention while preserving accuracy. Vortex also extends sparse attention to emerging architectures and very large models that are otherwise hard to experiment with (up to 4.7× on the MLA-based GLM-4.7-Flash and 1.37× on the 229B-parameter MiniMax-M2.7), and doubles as a research instrument for understanding where the routing signal lives in sparse attention.
<p align="center"> <img src="assets/fig1_workflow.png" alt="A workflow to study sparse attention algorithms with Vortex" width="40%" /> <img src="assets/fig1_results.png" alt="Agent-generated sparse attention on Qwen3-1.7B / AIME" width="52%" /> </p> <p align="center"> <em><b>(a)</b> A workflow to study sparse attention algorithms using Vortex. <b>(b)</b> Agent-generated sparse attention (Qwen3-1.7B, AIME, NVIDIA H200): each point is one algorithm generated or optimized by AI agents with Vortex — the best reaches up to 3.46× the throughput of full attention while preserving accuracy.</em> </p>
---
- Easy Programming Program sparse attention with a PyTorch-like frontend. No worrying about batching, caching & paged attention.
- High Performance Built to work with FlashInfer & CUDA Graph & Radix Attention for efficient LLM inference.
- Agent Native Designed for autonomous algorithm discovery — AI agents generate, benchmark, and refine sparse attention end-to-end, with a Claude Code workspace and OpenHands demo built in.
---
cd third_party/sglang/v0.5.9/sglang pip install -e "python" cd ../../../../
```bash git clone --recursive https://github.com/Infini-AI-Lab/vortex_torch.git
cd vortex_torch pip install -e . ```
---
<p align="center"> <img src="assets/fig16_mm_a.png" alt="mean@16 vs throughput" width="32%" /> <img src="assets/fig16_mm_b.png" alt="pass@4 vs throughput" width="32%" /> <img src="assets/fig16_mm_c.png" alt="pass@8 vs throughput" width="32%" /> </p> <p align="center"> <em>Scaling to a 229B model with tensor parallelism — MiniMax-M2.7 (229B) on AIME26 with 32K-token generation on four NVIDIA B200 GPUs (TP=4): <b>(a)</b> mean@16, <b>(b)</b> pass@4, <b>(c)</b> pass@8 versus end-to-end throughput. Block top-k and Quest sweep the number of attended blocks; the star marks the full-attention operating point.</em> </p>
A working setup is two files:
1. The flow module (this section) — a .py file that defines your sparse-attention algorithm as a vFlow subclass and @registers it under a name. It contains only vortex ops; it never imports sglang. 2. The launch script (next section) — imports sglang + vortex_torch and starts the engine pointing at the flow by its registered name.
VortexConfig is a single dataclass (vortex_torch/engine/sgl/config.py) that holds every vortex sparse-attention hyper-parameter in one place, instead of ~18 loose vortex_* arguments scattered across sglang's ServerArgs. Its presence on the engine is also the on/off switch: pass a VortexConfig and sparsity is enabled; leave it out and the model runs ordinary dense attention.
Every field, with what it controls and an example value:
| Field | Explanation | Example |
|---|---|---|
module_path | Path to the .py file holding your flow. None → vortex searches vortex_torch.flow.algorithms. | "submissions/custom.py" |
module_name | The @register(...) name of the vFlow to load. Must match exactly. | "custom_sparse_attention" |
topk_val | **Static page budget** — the fixed minimum number of pages each sequence keeps, regardless of length. The core accuracy↔throughput knob. | 30 |
topk_ratio | **Dynamic page budget** — a fraction of the sequence's pages; the engine keeps max(static floor, topk_ratio × num_pages). 0.0 disables it (use topk_val only). | 0.0625 |
max_topk_val | Upper bound on the selected-page count, used to size/pick the top-k kernel variant. None → derived from max_seq_lens. | 256 |
layers_skip | Layer indices that **bypass sparse attention and run dense** (e.g. early layers that need global context). None → all layers sparse. | [0, 4, 8, 12] |
block_reserved_bos | Pages at the **start** of the sequence that are always selected (attention sink). Int ≥ 1. | 1 |
block_reserved_eos | Pages at the **end** (most-recent tokens) that are always selected. Int ≥ 1. | 1 |
max_seq_lens | Maximum sequence length to plan buffers for. -1 → use the model default. | 8192 |
block_size | Vortex **page size** (the unit of sparsity). Positive power of 2; smaller = finer granularity, larger = less cache-summary overhead. Defaults to sglang's page_size. | 16 |
workload_chunk_size | Planner granularity — how many blocks are grouped into one indexer workload. Positive power of 2; a throughput-tuning knob. | 32 |
dtype | dtype for **intermediate** indexer tensors. "bfloat16" is the tested default; "float16"/"float32"/"fp8_e4m3"/"fp8_e5m2" are accepted. | "bfloat16" |
compilation_cache_dir | Directory for the JIT-compiled kernel cache. None → next to the compiler module. | "/tmp/vortex_cache" |
schedule_policy | **A CUDA C++ snippet that computes each sequence's page budget** (see below). None → the default budget formula. | None |
attention_backend | Sparse-attention kernel family: "flashinfer" (default) or "trtllm". | "flashinfer" |
impl_backend | Indexer op implementation backend: "triton" (default) or "cuda". | "triton" |
use_tensor_core | Enable tensor-core (bf16 tl.dot) codegen in the triton kernel. Only valid with impl_backend="triton". | False |
To serve vortex sparse attention over HTTP instead of driving the engine in-process, use examples/server_launch.sh. It boots an sglang server with an OpenAI-compatible API on 127.0.0.1:30000:
```bash
高效的稀疏注意力框架,适合深度学习应用
AI Skill Hub 为第三方内容聚合平台,本页面信息基于公开数据整理,不对工具功能和质量作任何法律背书。
建议在沙箱或测试环境中充分验证后,再部署至生产环境,并做好必要的安全评估。
✅ Apache 2.0 — 宽松开源协议,可商用,需保留版权声明和 NOTICE 文件,含专利授权条款。
AI Skill Hub 点评:Vortex 的核心功能完整,质量良好。对于AI 技术爱好者来说,这是一个值得纳入个人工具库的选择。建议先在非生产环境试用,再逐步推广。
| 原始名称 | vortex_torch |
| 原始描述 | 开源AI工具:Vortex: A Flexible and Efficient Sparse Attention Framework。⭐53 · Python |
| Topics | llmsparse-attentionpython |
| GitHub | https://github.com/Infini-AI-Lab/vortex_torch |
| License | Apache-2.0 |
| 语言 | Python |
收录时间:2026-06-05 · 更新时间:2026-06-06 · License:Apache-2.0 · AI Skill Hub 不对第三方内容的准确性作法律背书。