经 AI Skill Hub 精选评估,TurboQuant-MLX 获评「强烈推荐」。这款AI工具在功能完整性、社区活跃度和易用性方面表现出色,AI 评分 8.0 分,适合有一定技术背景的用户使用。
TurboQuant-MLX 是一款基于 Python 开发的开源工具,专注于 apple-silicon、kv-cache、llm 等核心功能。作为 GitHub 开源项目,它拥有活跃的社区支持和持续的版本迭代,代码完全透明可审计,支持本地部署以保护数据隐私。无论是个人使用还是集成到企业工作流,都能提供稳定可靠的解决方案。
TurboQuant-MLX 是一款基于 Python 开发的开源工具,专注于 apple-silicon、kv-cache、llm 等核心功能。作为 GitHub 开源项目,它拥有活跃的社区支持和持续的版本迭代,代码完全透明可审计,支持本地部署以保护数据隐私。无论是个人使用还是集成到企业工作流,都能提供稳定可靠的解决方案。
# 方式一:pip 安装(推荐)
pip install turboquant-mlx
# 方式二:虚拟环境安装(推荐生产环境)
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install turboquant-mlx
# 方式三:从源码安装(获取最新功能)
git clone https://github.com/manjunathshiva/turboquant-mlx
cd turboquant-mlx
pip install -e .
# 验证安装
python -c "import turboquant_mlx; print('安装成功')"
# 命令行使用
turboquant-mlx --help
# 基本用法
turboquant-mlx input_file -o output_file
# Python 代码中调用
import turboquant_mlx
# 示例
result = turboquant_mlx.process("input")
print(result)
# turboquant-mlx 配置文件示例(config.yml) app: name: "turboquant-mlx" debug: false log_level: "INFO" # 运行时指定配置文件 turboquant-mlx --config config.yml # 或通过环境变量配置 export TURBOQUANT_MLX_API_KEY="your-key" export TURBOQUANT_MLX_OUTPUT_DIR="./output"
Extreme weight and KV cache compression for LLMs on Apple Silicon. MLX implementation of Google's TurboQuant (Zandieh et al., 2025) — Hadamard rotation + Lloyd-Max codebooks applied both to weights (compile time) and the KV cache (run time).
Supports dense models (LLaMA, Qwen, Mistral), Mixture-of-Experts (Qwen-MoE, GPT-OSS, Qwen3.5-MoE, Qwen3.6-35B-A3B, Qwen3-235B-A22B, DeepSeek-V2/V3), and Mamba/attention hybrids (Nemotron-3-Nano-4B, Nemotron-3-Super-120B). Compatible with hybrid attention architectures, attention sinks, sliding-window attention, and linear attention layers.
**With both weight and KV cache compression at 3-bit, GPT-OSS-120B fits its full 131K context window in 50 GB on a 64 GB MacBook — and KV cache compression actually makes generation faster on the 120B (8.7 vs 6.4 tok/s) because the smaller cache cuts memory bandwidth more than dequant costs.**
Expert streaming (v0.4.0) runs MoE models whose weights exceed available RAM by paging only the router-selected experts from disk per token — e.g. the 35B-parameter Qwen3.6-35B-A3B runs on a 16 GB Mac mini in under 4 GB of RAM, with output bit-identical to the fully-resident model. See Qwen3.6-35B-A3B on a 16 GB Mac mini.
Local coding — Qwen3.6-27B, a dense SWE-bench-grade coder, runs fully resident on a 48 GB Mac at 3-bit (~13 GB on disk, ~17.5 GB at runtime) and serves to Cursor / VS Code over an OpenAI-compatible endpoint. See Qwen3.6-27B.
The Metal kernels are JIT-compiled by MLX at first use, so no Xcode / CMake toolchain is required to install the package.
pip install turboquant-mlx-full
The package is published as turboquant-mlx-full on PyPI, but importable as turboquant_mlx (without the -full suffix) — this matches the original project name and the examples in the Medium articles.
import turboquant_mlx
from turboquant_mlx.layers import TurboQuantKVCache, convert_cache_to_turboquant
git clone https://github.com/manjunathshiva/turboquant-mlx.git
cd turboquant-mlx
pip install -e .
For evaluation utilities (perplexity benchmarking), also install the optional dependencies:
pip install "turboquant-mlx-full[eval]"
cache = make_prompt_cache(model)
```python from turboquant_mlx.layers import convert_cache_to_turboquant from mlx_lm.models.cache import make_prompt_cache
python -m turboquant_mlx.generate_vlm \ --model ./diffusiongemma-26B-A4B-it-tq3-g32 \ --prompt "Write a short paragraph about the ocean." --max-tokens 256 ```
python -m turboquant_mlx.convert --help
Options:
--hf-path TEXT HuggingFace model path or local path (required)
--mlx-path TEXT Output directory (default: mlx_model)
--bits {2,3,4} Quantization bit-width (default: 3)
--group-size {32,64,128} Elements per quantization group (default: 64)
--rotation TEXT Rotation method: hadamard, blockwise_hadamard, none
--use-qjl Enable 1-bit QJL residual correction (+1 bit overhead)
--dtype TEXT Model dtype before quantization: float16, bfloat16
turboquant-serve wraps mlx_lm.server and patches its loader so any TurboQuant model (quantization.mode = "turboquant" in config.json) loads through the PolarQuant path. Non-TurboQuant models pass through unchanged, so this is a drop-in replacement for mlx_lm.server.
```bash
turboquant-generate exposes the same controls:
turboquant-generate --model ./model-tq3 --prompt "..." \
--kv-k-bits 8 --kv-v-bits 3 \
--kv-min-tokens 128 \
--kv-group-size 64
| Flag | Purpose |
|---|---|
--kv-bits N | Symmetric K=V=N (legacy v0.1) |
--kv-k-bits / --kv-v-bits | Mixed precision (v0.2 recommended) |
--kv-min-tokens N | Keep the first N cached tokens in fp16 (sink protection) |
--kv-group-size N | Hadamard rotation group size (default 64) |
python -m turboquant_mlx.benchmarks.demo_kv_v02 \ --model ./gpt-oss-120b-tq3 \ --prompt "Why is the sky blue?" \ --max-tokens 1024 --temp 0.7 --top-p 0.9 --repetition-penalty 1.1 ```
高性能LLM计算加速工具,苹果芯片优化
该工具使用 NOASSERTION 协议,商用场景请仔细阅读协议条款,必要时咨询法律意见。
AI Skill Hub 为第三方内容聚合平台,本页面信息基于公开数据整理,不对工具功能和质量作任何法律背书。
建议在沙箱或测试环境中充分验证后,再部署至生产环境,并做好必要的安全评估。
📄 NOASSERTION — 请查阅原始协议条款了解具体使用限制。
AI Skill Hub 点评:TurboQuant-MLX 的核心功能完整,质量优秀。对于AI 技术爱好者来说,这是一个值得纳入个人工具库的选择。建议先在非生产环境试用,再逐步推广。
| 原始名称 | turboquant-mlx |
| Topics | apple-siliconkv-cachellmmlxquantization |
| GitHub | https://github.com/manjunathshiva/turboquant-mlx |
| License | NOASSERTION |
| 语言 | Python |
收录时间:2026-06-20 · 更新时间:2026-06-20 · License:NOASSERTION · AI Skill Hub 不对第三方内容的准确性作法律背书。