能力标签

🔄 工作流 💻 CLI 🔗 REST API 🧬 Embedding 📚 RAG 🖼 视觉 🎙 STT 🖥 本地 LLM

🛠

AI工具

TurboQuant

基于 Python · 开源免费，本地部署，数据完全自主可控

英文名：turboquant-pro

⭐ 19 Stars 🍴 1 Forks 💻 Python 📄 MIT 🏷 AI 8.0分

8.0AI 综合评分

AILLM缓存压缩

📺 TG 频道

✦ AI Skill Hub 推荐

AI Skill Hub 强烈推荐：TurboQuant 是一款优质的AI工具。AI 综合评分 8.0 分，在同类工具中表现稳健。如果你正在寻找可靠的AI工具解决方案，这是一个值得深入了解的选择。

📚 深度解析

TurboQuant 是一款基于 Python 的开源工具，在 GitHub 上收获 0k+ Star，是AI、LLM、缓存压缩领域中的优质开源项目。开源工具的最大优势在于代码完全透明，你可以审计每一行代码的安全性，也可以根据自身需求进行二次开发和定制。

**为什么要使用开源工具而非商业 SaaS？**
对于个人开发者和有隐私需求的用户，本地部署的开源工具意味着数据不离本机，不受第三方服务商的数据政策约束。同时，开源工具通常没有使用次数限制和月度费用，一次安装即可长期使用，对于高频使用场景的总拥有成本（TCO）远低于订阅制商业工具。

**安装与环境准备**
TurboQuant 依赖 Python 运行环境。建议通过 pyenv（Python）或 nvm（Node.js）管理 Python 版本，避免全局环境污染。对于新手用户，推荐先创建虚拟环境（python -m venv venv && source venv/bin/activate），再安装依赖，这样即使出现问题也可以随时删除虚拟环境重新开始，不影响系统稳定性。

**社区与维护**
GitHub Issue 和 Discussion 是获取帮助的最快渠道。在提问前建议先检查 Closed Issues（已关闭的问题），大多数常见问题都已有解答。遇到 Bug 时，提供 pip list 的输出、完整错误堆栈和最小可复现示例，能显著提高开发者响应速度。AI Skill Hub 将持续追踪 TurboQuant 的版本更新，及时通知重要功能变化。

📋 工具概览

TurboQuant 是一款基于 Python 开发的开源工具，专注于 AI、LLM、缓存压缩等核心功能。作为 GitHub 开源项目，它拥有活跃的社区支持和持续的版本迭代，代码完全透明可审计，支持本地部署以保护数据隐私。无论是个人使用还是集成到企业工作流，都能提供稳定可靠的解决方案。

GitHub Stars

⭐ 19

开发语言

Python

支持平台

Windows / macOS / Linux

维护状态

轻量级项目，按需更新

开源协议

MIT

AI 综合评分

8.0 分

工具类型

AI工具

Forks

📖 中文文档

以下内容由 AI Skill Hub 根据项目信息自动整理，如需查看完整原始文档请访问底部「原始来源」。

📌 核心特色

开源免费，支持本地部署，数据完全自主可控
活跃的 GitHub 开源社区，持续迭代更新
提供详细文档和使用示例，新手友好
支持自定义配置，灵活适配不同使用环境
可作为基础组件集成进现有技术栈或进行二次开发

🎯 主要使用场景

本地部署运行，保护数据隐私，满足合规要求
自定义集成到现有系统，扩展技术栈能力
作为开源基础组件进行商业化二次开发

以下安装命令基于项目开发语言和类型自动生成，实际以官方 README 为准。

安装命令

# 方式一：pip 安装（推荐）
pip install turboquant-pro

# 方式二：虚拟环境安装（推荐生产环境）
python -m venv .venv
source .venv/bin/activate  # Windows: .venv\Scripts\activate
pip install turboquant-pro

# 方式三：从源码安装（获取最新功能）
git clone https://github.com/ahb-sjsu/turboquant-pro
cd turboquant-pro
pip install -e .

# 验证安装
python -c "import turboquant_pro; print('安装成功')"

📋 安装步骤说明

访问 GitHub 仓库页面
按照 README 文档完成依赖安装
根据系统环境完成初始化配置
参考官方示例或文档开始使用
遇到问题可在 GitHub Issues 中查找解答

以下用法示例由 AI Skill Hub 整理，涵盖最常见的使用场景。

常用命令 / 代码示例

# 命令行使用
turboquant-pro --help

# 基本用法
turboquant-pro input_file -o output_file

# Python 代码中调用
import turboquant_pro

# 示例
result = turboquant_pro.process("input")
print(result)

以下配置示例基于典型使用场景生成，具体参数请参照官方文档调整。

配置示例

# turboquant-pro 配置文件示例（config.yml）
app:
  name: "turboquant-pro"
  debug: false
  log_level: "INFO"

# 运行时指定配置文件
turboquant-pro --config config.yml

# 或通过环境变量配置
export TURBOQUANT_PRO_API_KEY="your-key"
export TURBOQUANT_PRO_OUTPUT_DIR="./output"

📑 README 深度解析真实文档完整度 86/100 含工作流图查看 GitHub 原文 →

以下内容由系统直接从 GitHub README 解析整理，保留代码块、表格与列表结构。

TurboQuant Pro

PCA-Matryoshka dimension reduction + TurboQuant scalar quantization for embedding compression, LLM KV caches, model weight pruning, pgvector, FAISS, and NATS transport.

Up to 27x embedding compression at 99.8% recall@10 (with 5x oversampling + reranking — all methods benchmarked identically). At ~30x compression turboquant-pro beats the 2024 SOTA (RaBitQ) on recall and ties OPQ — at 1M-vector scale — while building the index 4–20x faster. Learned codebooks reduce quantization error 22%. 397 tests. Multi-modal (text, vision, audio, code). Production observability. Works on consumer GPUs (Volta+) and CPU.

Important: Cosine similarity to the original vector is not a reliable proxy for retrieval quality at high compression. Our own data shows PCA-256+TQ3 has lower cosine (0.963) but higher recall@10 (78.2%) than PCA-384+TQ3 (0.979 cosine, 76.4% recall). Always evaluate on task-relevant retrieval metrics.

What's New in v1.0.0

Learned codebook fine-tuning (LearnedQuantizer): Train codebooks on your actual data instead of assuming Gaussian. fit_codebook(embeddings) returns a ready quantizer. Pushes cosine similarity from 0.978 to 0.99+ at the same bit-width.
Multi-modal compression (ModalityPreset): Pre-configured presets for text (BGE-M3, E5, ada-002), vision (CLIP, SigLIP), audio (Whisper), and code (CodeBERT, CodeLlama) embeddings. Per-modality optimal PCA + bit-width recommendations.
Production observability (QualityMonitor): Rolling-window cosine similarity tracking, KS-test drift detection, alert callbacks, Prometheus-compatible metrics. Know when compression quality degrades in production.

Installation

```bash pip install turboquant-pro

Build any component

tq = cfg.build_quantizer() # TurboQuantKV cache = cfg.build_cache() # TurboQuantKVCache rq = cfg.build_rope_quantizer() # RoPEAwareQuantizer mgr = cfg.build_manager() # TurboQuantKVManager (all layers)

Quick Start

```python from turboquant_pro import TurboQuantKV

Fit PCA on a sample of embeddings (5-10K vectors is sufficient)

pca = PCAMatryoshka(input_dim=1024, output_dim=384) result = pca.fit(sample_embeddings) print(f"Variance explained: {result.total_variance_explained:.1%}")

Auto-configure from model name — picks optimal K/V bits, RoPE-awareness

tq = TurboQuantKV.from_model("llama-3-8b") # balanced (K4/V3) tq = TurboQuantKV.from_model("gemma-2-27b", target="compression") # K3/V2

compressed_k = tq.compress(kv_key_tensor, packed=True, kind="key") # 4-bit keys compressed_v = tq.compress(kv_val_tensor, packed=True, kind="value") # 3-bit values key_approx = tq.decompress(compressed_k) # cos_sim > 0.995 (keys) val_approx = tq.decompress(compressed_v) # cos_sim > 0.978 (values)


Or manually:

python tq = TurboQuantKV(head_dim=256, n_heads=16, bits=3, use_gpu=False) compressed = tq.compress(kv_tensor, packed=True) # 5.1x smaller reconstructed = tq.decompress(compressed) # cos_sim > 0.978 ```

Auto-Config API

Auto-detect model architecture and select optimal compression:

```python from turboquant_pro import AutoConfig

Works from a HuggingFace config dict too

cfg = AutoConfig.from_dict(model.config.to_dict(), target="compression") ```

Target presets:

Target	Config	Key CosSim	Ratio	Use case
`quality`	K4/V4 + RoPE	0.995	3.8x	Maximum accuracy
`balanced`	K4/V3 + RoPE	0.995 / 0.978	4.3x	Recommended default
`compression`	K3/V2 + RoPE	0.978 / 0.941	5.8x	Memory-constrained
`extreme`	K2/V2	0.941	7.1x	Maximum compression

Supported models: LLaMA 3 (8B, 70B), Gemma 2 (9B, 27B), Gemma 4 27B-A4B (262K context MoE), Qwen 2.5 (7B, 72B), Mistral 7B. Any HuggingFace model works via transformers.AutoConfig.

Eigenvalue-Weighted Mixed Precision (v0.9.0)

Theory: After PCA, early dimensions explain most variance. Spending 4 bits on high-eigenvalue dimensions and 2 bits on the tail gives better quality than uniform 3-bit at the same average storage.

How it works: pca.with_weighted_quantizer(avg_bits=3.0) auto-computes the bit schedule from cumulative variance thresholds (top 60% variance -> 4-bit, next 30% -> 3-bit, bottom 10% -> 2-bit). Each segment gets its own quantizer with the appropriate codebook.

Result: At 2.8 avg bits, beats uniform 3-bit (0.962 vs 0.958) in 7% less storage.

Unified Auto-Config API (v0.9.1)

How it works: TurboQuantKV.from_model("llama-3-8b") reads head_dim, n_kv_heads, rope_theta, and max_position_embeddings from a built-in model registry (or HuggingFace Hub), then selects optimal key_bits, value_bits, and RoPE-aware settings based on a target preset (quality/balanced/compression/extreme).

Config Ratio Cosine Recall Var% Time

PCA-128 + TQ2 113.8x 0.9237 78.7% 79.9% 2.2s PCA-256 + TQ3 41.0x 0.9700 92.0% 92.3% 0.7s PCA-384 + TQ4 20.9x 0.9906 96.0% 97.3% 0.6s PCA-512 + TQ4 15.8x 0.9949 96.3% 99.0% 0.6s

Recommendation (min recall >= 95%): PCA-384 + TQ4: 20.9x compression, 96.0% recall@10 ```

Integration Options

Feature Reference

A complete guide to every feature in TurboQuant Pro, the theory behind it, and when to use it.

Component map

flowchart TB subgraph API["Public API"] AC[AutoConfig.from_pretrained] TQ[TurboQuantKV] PCA[PCAMatryoshka] LQ[LearnedQuantizer] end subgraph Build["Built by AutoConfig"] CACHE[TurboQuantKVCache] RQ[RoPEAwareQuantizer] MGR[TurboQuantKVManager] end subgraph Index["Retrieval"] HNSW[CompressedHNSW] CACHE2[L2 Embedding Cache] end subgraph Ops["Production"] QM[QualityMonitor
drift detection] EXP[Cross-framework Export
FAISS / Milvus / Qdrant / Weaviate] end AC --> TQ AC --> CACHE AC --> RQ AC --> MGR PCA --> TQ LQ --> TQ TQ --> HNSW TQ --> CACHE2 TQ --> EXP TQ --> QM classDef api fill:#e3f2fd,stroke:#1565c0; classDef build fill:#fff3e0,stroke:#e65100; classDef ops fill:#f3e5f5,stroke:#6a1b9a; class AC,TQ,PCA,LQ api; class CACHE,RQ,MGR build; class HNSW,CACHE2,QM,EXP ops;

Additional Components

Streaming KV Cache (v0.3.0): Two-tier L1 hot / L2 cold cache for autoregressive generation.
NATS Transport Codec (v0.3.0): Compressed wire format for NATS JetStream events (392 bytes vs 4096 bytes per embedding).
vLLM Plugin (v0.5.0): TurboQuantKVManager multi-layer cache for vLLM integration.
FAISS Integration (v0.5.0): TurboQuantFAISS wraps FAISS with auto PCA compression.
Rust pgext (v0.5.0): Native PostgreSQL extension with tqvector type and <=> operator.
Autotune CLI (v0.5.0): turboquant-pro autotune finds optimal compression in ~10 seconds.
Model Weight Compression (v0.6.0-v0.7.0): SVD and activation-space PCA for LLM weight pruning.

---

Create the full pipeline: PCA-384 + TurboQuant 3-bit

pipeline = pca.with_quantizer(bits=3) # ~27x compression

FAISS Integration

Wrap FAISS indices with automatic PCA compression:

from turboquant_pro import PCAMatryoshka
from turboquant_pro.faiss_index import TurboQuantFAISS

pca = PCAMatryoshka(input_dim=1024, output_dim=384)
pca.fit(sample_embeddings)

index = TurboQuantFAISS(pca, index_type="ivf", n_lists=100)
index.add(corpus)  # Auto PCA-compressed
distances, ids = index.search(query, k=10)  # Auto PCA-rotated
print(index.stats())  # 2.7x smaller index

Supports Flat, IVF, and HNSW. Save/load indices to disk.

Native PostgreSQL Extension (Rust + CUDA)

The pgext/ directory contains a native PostgreSQL extension written in Rust (pgrx) that adds the tqvector data type directly to PostgreSQL — no Python needed.

-- Compress your entire table in one command
CREATE TABLE embeddings_tq AS
SELECT id, tq_compress(embedding::float4[], 3) AS tqv
FROM embeddings;

-- Search with cosine distance operator
SELECT id, tqv <=> tq_compress(query::float4[], 3) AS dist
FROM embeddings_tq ORDER BY dist LIMIT 10;

-- Check compression
SELECT tq_dim(tqv), tq_bits(tqv), tq_ratio(tqv) FROM embeddings_tq LIMIT 1;
-- 1024, 3, 10.6

Production benchmark (194K BGE-M3 1024-dim vectors on Atlas):

Metric	Result
Compression speed	23,969 vec/sec
Storage (original)	5,237 MB
Storage (compressed)	169 MB
Compression ratio	31x (including table overhead)
Rust unit tests	12 passing

Build and install:

cd pgext
cargo install cargo-pgrx && cargo pgrx init --pg16 $(which pg_config)
cargo pgrx install --release
psql -c "CREATE EXTENSION tqvector;"

Optional GPU acceleration: cargo build --features gpu (requires CUDA 12.0+, cudarc).

See pgext/README.md for full API documentation.

PostgreSQL integration

tq.create_compressed_table(conn, "embeddings_compressed") tq.insert_compressed(conn, "embeddings_compressed", ids, embeddings) results = tq.search_compressed(conn, "embeddings_compressed", query, top_k=10) ```

Storage savings (1024-dim BGE-M3, 3-bit, no PCA truncation):

TurboQuant 3-bit alone compresses each vector from 4,096 to ~388 bytes (10.5x):

Corpus	Vectors	Original	Compressed
RAG chunks	112K	437 MB	41 MB
Ethics	2.4M	9,375 MB	893 MB
Publications	824K	3,222 MB	307 MB

Components

Class	Purpose
`PCAMatryoshka`	PCA rotation + truncation for dimension reduction
`PCAMatryoshkaPipeline`	Combined PCA + TurboQuant end-to-end pipeline
`TurboQuantKV`	Stateless compress/decompress with optional bit-packing
`TurboQuantKVCache`	Streaming L1/L2 tiered cache for autoregressive inference
`TurboQuantKVManager`	Multi-layer KV cache manager (vLLM plugin)
`TurboQuantFAISS`	FAISS index wrapper with auto PCA compression
`TurboQuantPGVector`	Compress pgvector embeddings for PostgreSQL storage
`TurboQuantNATSCodec`	Encode/decode embeddings for NATS transport
`run_autotune`	Sweep configs and recommend optimal compression
`ModelCompressor`	SVD analysis + low-rank compression of model FFN weights

vLLM KV Cache Plugin

Multi-layer KV cache manager with hot/cold tiering:

```python from turboquant_pro.vllm_plugin import TurboQuantKVManager

mgr = TurboQuantKVManager( n_layers=32, n_kv_heads=8, head_dim=128, bits=3, hot_window=512 )

Benchmarks vs SOTA (real data, all methods reranked identically)

At 32× compression, recall@10 on real LaBSE / multilingual-Gutenberg embeddings (RESULTS_labse_199k.md, RESULTS_gutenberg_1m.md):

method	recall@10 (single)	recall@10 (+rerank)	index build
PQ	0.467	0.827	142 s
IVF-PQ	0.496	0.756	355 s
RaBitQ (2024 SOTA)	0.630	0.962	0.3 s
OPQ	0.780	0.999	632 s
turboquant-pro	0.784	0.9992	31 s

turboquant-pro beats the 2024 binary-quantization SOTA (RaBitQ) at both operating points and ties OPQ, at 4–20× lower index build cost — and this holds at 1M scale (tq-pro 0.989 +rerank, tying OPQ). Fast search: the AVX2 ADC kernel (turboquant_pro/_adc/) reproduces this recall (0.9995 +rerank) at 3802 qps — 7.9× faster than naive flat-reconstruct and competitive with ScaNN — at 96 bytes, training-free (see docs/DESIGN_fast_adc.md). Full honest evaluation of every feature: COMPREHENSIVE_ANALYSIS.md.