AI Skill Hub 强烈推荐:TurboQuant 是一款优质的AI工具。AI 综合评分 8.0 分,在同类工具中表现稳健。如果你正在寻找可靠的AI工具解决方案,这是一个值得深入了解的选择。
TurboQuant 是一款基于 Python 开发的开源工具,专注于 AI、LLM、缓存压缩 等核心功能。作为 GitHub 开源项目,它拥有活跃的社区支持和持续的版本迭代,代码完全透明可审计,支持本地部署以保护数据隐私。无论是个人使用还是集成到企业工作流,都能提供稳定可靠的解决方案。
TurboQuant 是一款基于 Python 开发的开源工具,专注于 AI、LLM、缓存压缩 等核心功能。作为 GitHub 开源项目,它拥有活跃的社区支持和持续的版本迭代,代码完全透明可审计,支持本地部署以保护数据隐私。无论是个人使用还是集成到企业工作流,都能提供稳定可靠的解决方案。
# 方式一:pip 安装(推荐)
pip install turboquant-pro
# 方式二:虚拟环境安装(推荐生产环境)
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install turboquant-pro
# 方式三:从源码安装(获取最新功能)
git clone https://github.com/ahb-sjsu/turboquant-pro
cd turboquant-pro
pip install -e .
# 验证安装
python -c "import turboquant_pro; print('安装成功')"
# 命令行使用
turboquant-pro --help
# 基本用法
turboquant-pro input_file -o output_file
# Python 代码中调用
import turboquant_pro
# 示例
result = turboquant_pro.process("input")
print(result)
# turboquant-pro 配置文件示例(config.yml) app: name: "turboquant-pro" debug: false log_level: "INFO" # 运行时指定配置文件 turboquant-pro --config config.yml # 或通过环境变量配置 export TURBOQUANT_PRO_API_KEY="your-key" export TURBOQUANT_PRO_OUTPUT_DIR="./output"
PCA-Matryoshka dimension reduction + TurboQuant scalar quantization for embedding compression, LLM KV caches, model weight pruning, pgvector, FAISS, and NATS transport.
Up to 27x embedding compression at 99.8% recall@10 (with 5x oversampling + reranking — all methods benchmarked identically). At ~30x compression turboquant-pro beats the 2024 SOTA (RaBitQ) on recall and ties OPQ — at 1M-vector scale — while building the index 4–20x faster. Learned codebooks reduce quantization error 22%. 397 tests. Multi-modal (text, vision, audio, code). Production observability. Works on consumer GPUs (Volta+) and CPU.
Important: Cosine similarity to the original vector is not a reliable proxy for retrieval quality at high compression. Our own data shows PCA-256+TQ3 has lower cosine (0.963) but higher recall@10 (78.2%) than PCA-384+TQ3 (0.979 cosine, 76.4% recall). Always evaluate on task-relevant retrieval metrics.
LearnedQuantizer): Train codebooks on your actual data instead of assuming Gaussian. fit_codebook(embeddings) returns a ready quantizer. Pushes cosine similarity from 0.978 to 0.99+ at the same bit-width.ModalityPreset): Pre-configured presets for text (BGE-M3, E5, ada-002), vision (CLIP, SigLIP), audio (Whisper), and code (CodeBERT, CodeLlama) embeddings. Per-modality optimal PCA + bit-width recommendations.QualityMonitor): Rolling-window cosine similarity tracking, KS-test drift detection, alert callbacks, Prometheus-compatible metrics. Know when compression quality degrades in production.```bash pip install turboquant-pro
tq = cfg.build_quantizer() # TurboQuantKV cache = cfg.build_cache() # TurboQuantKVCache rq = cfg.build_rope_quantizer() # RoPEAwareQuantizer mgr = cfg.build_manager() # TurboQuantKVManager (all layers)
```python from turboquant_pro import TurboQuantKV
pca = PCAMatryoshka(input_dim=1024, output_dim=384) result = pca.fit(sample_embeddings) print(f"Variance explained: {result.total_variance_explained:.1%}")
tq = TurboQuantKV.from_model("llama-3-8b") # balanced (K4/V3) tq = TurboQuantKV.from_model("gemma-2-27b", target="compression") # K3/V2
compressed_k = tq.compress(kv_key_tensor, packed=True, kind="key") # 4-bit keys compressed_v = tq.compress(kv_val_tensor, packed=True, kind="value") # 3-bit values key_approx = tq.decompress(compressed_k) # cos_sim > 0.995 (keys) val_approx = tq.decompress(compressed_v) # cos_sim > 0.978 (values)
Or manually:
python tq = TurboQuantKV(head_dim=256, n_heads=16, bits=3, use_gpu=False) compressed = tq.compress(kv_tensor, packed=True) # 5.1x smaller reconstructed = tq.decompress(compressed) # cos_sim > 0.978 ```
Auto-detect model architecture and select optimal compression:
```python from turboquant_pro import AutoConfig
cfg = AutoConfig.from_dict(model.config.to_dict(), target="compression") ```
Target presets:
| Target | Config | Key CosSim | Ratio | Use case |
|---|---|---|---|---|
quality | K4/V4 + RoPE | 0.995 | 3.8x | Maximum accuracy |
balanced | K4/V3 + RoPE | 0.995 / 0.978 | 4.3x | **Recommended default** |
compression | K3/V2 + RoPE | 0.978 / 0.941 | 5.8x | Memory-constrained |
extreme | K2/V2 | 0.941 | 7.1x | Maximum compression |
Supported models: LLaMA 3 (8B, 70B), Gemma 2 (9B, 27B), Gemma 4 27B-A4B (262K context MoE), Qwen 2.5 (7B, 72B), Mistral 7B. Any HuggingFace model works via transformers.AutoConfig.
Theory: After PCA, early dimensions explain most variance. Spending 4 bits on high-eigenvalue dimensions and 2 bits on the tail gives better quality than uniform 3-bit at the same average storage.
How it works: pca.with_weighted_quantizer(avg_bits=3.0) auto-computes the bit schedule from cumulative variance thresholds (top 60% variance -> 4-bit, next 30% -> 3-bit, bottom 10% -> 2-bit). Each segment gets its own quantizer with the appropriate codebook.
Result: At 2.8 avg bits, beats uniform 3-bit (0.962 vs 0.958) in 7% less storage.
How it works: TurboQuantKV.from_model("llama-3-8b") reads head_dim, n_kv_heads, rope_theta, and max_position_embeddings from a built-in model registry (or HuggingFace Hub), then selects optimal key_bits, value_bits, and RoPE-aware settings based on a target preset (quality/balanced/compression/extreme).
PCA-128 + TQ2 113.8x 0.9237 78.7% 79.9% 2.2s PCA-256 + TQ3 41.0x 0.9700 92.0% 92.3% 0.7s PCA-384 + TQ4 20.9x 0.9906 96.0% 97.3% 0.6s PCA-512 + TQ4 15.8x 0.9949 96.3% 99.0% 0.6s
Recommendation (min recall >= 95%): PCA-384 + TQ4: 20.9x compression, 96.0% recall@10 ```
A complete guide to every feature in TurboQuant Pro, the theory behind it, and when to use it.
TurboQuantKVManager multi-layer cache for vLLM integration.TurboQuantFAISS wraps FAISS with auto PCA compression.tqvector type and <=> operator.turboquant-pro autotune finds optimal compression in ~10 seconds.---
pipeline = pca.with_quantizer(bits=3) # ~27x compression
Wrap FAISS indices with automatic PCA compression:
from turboquant_pro import PCAMatryoshka
from turboquant_pro.faiss_index import TurboQuantFAISS
pca = PCAMatryoshka(input_dim=1024, output_dim=384)
pca.fit(sample_embeddings)
index = TurboQuantFAISS(pca, index_type="ivf", n_lists=100)
index.add(corpus) # Auto PCA-compressed
distances, ids = index.search(query, k=10) # Auto PCA-rotated
print(index.stats()) # 2.7x smaller index
Supports Flat, IVF, and HNSW. Save/load indices to disk.
The pgext/ directory contains a native PostgreSQL extension written in Rust (pgrx) that adds the tqvector data type directly to PostgreSQL — no Python needed.
-- Compress your entire table in one command
CREATE TABLE embeddings_tq AS
SELECT id, tq_compress(embedding::float4[], 3) AS tqv
FROM embeddings;
-- Search with cosine distance operator
SELECT id, tqv <=> tq_compress(query::float4[], 3) AS dist
FROM embeddings_tq ORDER BY dist LIMIT 10;
-- Check compression
SELECT tq_dim(tqv), tq_bits(tqv), tq_ratio(tqv) FROM embeddings_tq LIMIT 1;
-- 1024, 3, 10.6
Production benchmark (194K BGE-M3 1024-dim vectors on Atlas):
| Metric | Result |
|---|---|
| Compression speed | 23,969 vec/sec |
| Storage (original) | 5,237 MB |
| Storage (compressed) | 169 MB |
| Compression ratio | 31x (including table overhead) |
| Rust unit tests | 12 passing |
Build and install:
cd pgext
cargo install cargo-pgrx && cargo pgrx init --pg16 $(which pg_config)
cargo pgrx install --release
psql -c "CREATE EXTENSION tqvector;"
Optional GPU acceleration: cargo build --features gpu (requires CUDA 12.0+, cudarc).
See pgext/README.md for full API documentation.
tq.create_compressed_table(conn, "embeddings_compressed") tq.insert_compressed(conn, "embeddings_compressed", ids, embeddings) results = tq.search_compressed(conn, "embeddings_compressed", query, top_k=10) ```
Storage savings (1024-dim BGE-M3, 3-bit, no PCA truncation):
TurboQuant 3-bit alone compresses each vector from 4,096 to ~388 bytes (10.5x):
| Corpus | Vectors | Original | Compressed |
|---|---|---|---|
| RAG chunks | 112K | 437 MB | 41 MB |
| Ethics | 2.4M | 9,375 MB | 893 MB |
| Publications | 824K | 3,222 MB | 307 MB |
| Class | Purpose |
|---|---|
PCAMatryoshka | PCA rotation + truncation for dimension reduction |
PCAMatryoshkaPipeline | Combined PCA + TurboQuant end-to-end pipeline |
TurboQuantKV | Stateless compress/decompress with optional bit-packing |
TurboQuantKVCache | Streaming L1/L2 tiered cache for autoregressive inference |
TurboQuantKVManager | Multi-layer KV cache manager (vLLM plugin) |
TurboQuantFAISS | FAISS index wrapper with auto PCA compression |
TurboQuantPGVector | Compress pgvector embeddings for PostgreSQL storage |
TurboQuantNATSCodec | Encode/decode embeddings for NATS transport |
run_autotune | Sweep configs and recommend optimal compression |
ModelCompressor | SVD analysis + low-rank compression of model FFN weights |
Multi-layer KV cache manager with hot/cold tiering:
```python from turboquant_pro.vllm_plugin import TurboQuantKVManager
mgr = TurboQuantKVManager( n_layers=32, n_kv_heads=8, head_dim=128, bits=3, hot_window=512 )
At 32× compression, recall@10 on real LaBSE / multilingual-Gutenberg embeddings (RESULTS_labse_199k.md, RESULTS_gutenberg_1m.md):
| method | recall@10 (single) | recall@10 (+rerank) | index build |
|---|---|---|---|
| PQ | 0.467 | 0.827 | 142 s |
| IVF-PQ | 0.496 | 0.756 | 355 s |
| RaBitQ (2024 SOTA) | 0.630 | 0.962 | 0.3 s |
| OPQ | 0.780 | 0.999 | 632 s |
| **turboquant-pro** | **0.784** | **0.9992** | **31 s** |
turboquant-pro beats the 2024 binary-quantization SOTA (RaBitQ) at both operating points and ties OPQ, at 4–20× lower index build cost — and this holds at 1M scale (tq-pro 0.989 +rerank, tying OPQ). Fast search: the AVX2 ADC kernel (turboquant_pro/_adc/) reproduces this recall (0.9995 +rerank) at 3802 qps — 7.9× faster than naive flat-reconstruct and competitive with ScaNN — at 96 bytes, training-free (see docs/DESIGN_fast_adc.md). Full honest evaluation of every feature: COMPREHENSIVE_ANALYSIS.md.
高性能AI工具,优化LLM和向量数据库
AI Skill Hub 为第三方内容聚合平台,本页面信息基于公开数据整理,不对工具功能和质量作任何法律背书。
建议在沙箱或测试环境中充分验证后,再部署至生产环境,并做好必要的安全评估。
✅ MIT 协议 — 最宽松的开源协议之一,可自由商用、修改、分发,仅需保留版权声明。
总体来看,TurboQuant 是一款质量优秀的AI工具,在同类工具中具备一定竞争力。AI Skill Hub 将持续追踪其更新动态,建议收藏备用,结合自身场景选择合适时机引入使用。
| 原始名称 | turboquant-pro |
| Topics | AILLM缓存压缩 |
| GitHub | https://github.com/ahb-sjsu/turboquant-pro |
| License | MIT |
| 语言 | Python |
收录时间:2026-06-21 · 更新时间:2026-06-21 · License:MIT · AI Skill Hub 不对第三方内容的准确性作法律背书。