LLMKube 是 AI Skill Hub 本期精选AI工具之一。综合评分 7.5 分,整体质量较高。我们推荐使用将其纳入你的 AI 工具库,帮助提升工作效率。
LLMKube是开源AI工具,提供Kubernetes操作本地LLM推理功能,支持多种LLM引擎,包括llama.cpp、vLLM、TGI和mlx-s。
LLMKube 是一款基于 Go 开发的开源工具,专注于 ai、apple-silicon、autoscaling 等核心功能。作为 GitHub 开源项目,它拥有活跃的社区支持和持续的版本迭代,代码完全透明可审计,支持本地部署以保护数据隐私。无论是个人使用还是集成到企业工作流,都能提供稳定可靠的解决方案。
LLMKube是开源AI工具,提供Kubernetes操作本地LLM推理功能,支持多种LLM引擎,包括llama.cpp、vLLM、TGI和mlx-s。
LLMKube 是一款基于 Go 开发的开源工具,专注于 ai、apple-silicon、autoscaling 等核心功能。作为 GitHub 开源项目,它拥有活跃的社区支持和持续的版本迭代,代码完全透明可审计,支持本地部署以保护数据隐私。无论是个人使用还是集成到企业工作流,都能提供稳定可靠的解决方案。
# 方式一:go install(推荐) go install github.com/defilantech/LLMKube@latest # 方式二:从源码编译 git clone https://github.com/defilantech/LLMKube cd LLMKube go build -o llmkube . # 方式三:下载预编译二进制 # 访问 Releases 页面下载对应平台二进制文件 # https://github.com/defilantech/LLMKube/releases
# 查看帮助 llmkube --help # 基本运行 llmkube [options] <input> # 详细使用说明请查阅文档 # https://github.com/defilantech/LLMKube
# llmkube 配置说明 # 查看配置选项 llmkube --config-example > config.yml # 常见配置项 # output_dir: ./output # log_level: info # workers: 4 # 环境变量(覆盖配置文件) export LLMKUBE_CONFIG="/path/to/config.yml"
# LLMKube
Your models. Your hardware. Your rules.
<p> <a href="https://github.com/defilantech/LLMKube/actions/workflows/test.yml"> <img src="https://github.com/defilantech/LLMKube/actions/workflows/test.yml/badge.svg" alt="Tests"> </a> <a href="https://github.com/defilantech/LLMKube/actions/workflows/helm-chart.yml"> <img src="https://github.com/defilantech/LLMKube/actions/workflows/helm-chart.yml/badge.svg" alt="Helm Chart CI"> </a> <a href="https://goreportcard.com/report/github.com/defilantech/llmkube"> <img src="https://goreportcard.com/badge/github.com/defilantech/llmkube" alt="Go Report Card"> </a> <a href="https://github.com/defilantech/LLMKube/releases"> <img src="https://img.shields.io/github/v/release/defilantech/LLMKube?label=version" alt="Version"> </a> <a href="https://github.com/defilantech/LLMKube/stargazers"> <img src="https://img.shields.io/github/stars/defilantech/LLMKube?style=social" alt="GitHub Stars"> </a> <a href="LICENSE"> <img src="https://img.shields.io/badge/license-Apache%202.0-blue.svg" alt="License"> </a> <img src="https://img.shields.io/github/go-mod/go-version/defilantech/LLMKube" alt="Go Version"> <a href="https://discord.gg/Ktz85RFHDv"> <img src="https://img.shields.io/badge/Discord-Join%20us-5865F2?logo=discord&logoColor=white" alt="Discord"> </a> </p>
<p> <a href="#quick-start">Quick Start</a> • <a href="#composition-modelrouter">ModelRouter</a> • <a href="#the-metal-agent">Metal Agent</a> • <a href="#how-is-this-different">Why LLMKube?</a> • <a href="#performance">Benchmarks</a> • <a href="ROADMAP.md">Roadmap</a> • <a href="https://discord.gg/Ktz85RFHDv">Discord</a> </p>
</div>
---
Inference: - Kubernetes-native CRDs (Model + InferenceService) - Multiple runtimes: llama.cpp (GGUF), vLLM (HuggingFace + safetensors), TGI in-cluster; llama-server, mlx-server, and vllm-swift natively on Apple Silicon - Automatic model download from HuggingFace, HTTP, or PVC (S3 planned) - Persistent model cache, download once, deploy instantly (guide) - OpenAI-compatible /v1/chat/completions API - Multi-replica horizontal scaling with scale subresource support (kubectl scale, KEDA) - License compliance scanning for GGUF models
Routing & policy (ModelRouter): - ModelRouter CRD: one OpenAI-compatible endpoint, multiple backends (local InferenceServices + external Anthropic / OpenAI / Bedrock / LiteLLM) - Policy-aware rules: data classification, task complexity, required capabilities, arbitrary header match - Fail-closed semantics for regulated data: static (apply-time) + runtime (HTTP 503, no cloud egress) - Per-rule and per-backend timeout budgets (spec.rules[].timeout / spec.backends[].timeout) - Half-open circuit breaker with configurable quarantine window - Audit log on every request: rule, backend, tier, resolved timeout, outcome - Streaming SSE passthrough from day one
GPU: - NVIDIA CUDA (T4, L4, A100, RTX) - Apple Silicon Metal via Metal Agent (M1-M4) - Multi-GPU inference for 13B-70B+ models (guide) - Automatic layer offloading and tensor splitting - GPU queue management with priority classes
Operations: - Full CLI: llmkube deploy/list/status/delete/catalog/cache/queue - Model catalog with 10+ pre-configured models - Prometheus metrics + OpenTelemetry tracing - Grafana dashboards for GPU and inference monitoring - GPU metrics (utilization, temp, power, memory) - SLO alerts (GPU health, service availability) - Custom CA certificates for corporate environments - Multi-cloud Terraform (GKE, AKS, EKS) - Cost optimization (spot instances, auto-scale to zero)
---
brew install defilantech/tap/llmkube
helm repo add llmkube https://defilantech.github.io/LLMKube helm install llmkube llmkube/llmkube --namespace llmkube-system --create-namespace
llmkube deploy phi-4-mini
```bash
Every deployment exposes an OpenAI-compatible API. Use any OpenAI SDK:
from openai import OpenAI
client = OpenAI(
base_url="http://llama-3b-service:8080/v1",
api_key="not-needed"
)
response = client.chat.completions.create(
model="llama-3b",
messages=[{"role": "user", "content": "Explain Kubernetes in one sentence"}]
)
Works with LangChain, LlamaIndex, and any OpenAI-compatible client library.
---
<details> <summary><b>Model won't download</b></summary>
kubectl describe model <model-name>
kubectl logs <pod-name> -c model-downloader Common causes: HuggingFace URL needs auth (use direct links), insufficient disk space, network timeout (auto-retries). </details>
<details> <summary><b>Pod OOM crash</b></summary>
llmkube deploy <model> --memory 8Gi # Rule of thumb: file size x 1.2 </details>
<details> <summary><b>GPU not detected</b></summary>
kubectl get pods -n gpu-operator-resources
kubectl get pods -n kube-system -l name=nvidia-device-plugin-ds </details>
<details> <summary><b>OpenShift / MicroShift / OKD: ship the bundled Helm preset</b></summary>
LLMKube is tested in CI against MicroShift to verify the OpenShift SCC admission path end-to-end on every PR. The repo ships a Helm values preset at charts/llmkube/values-openshift.yaml that disables the operator's default fsGroup so the restricted-v2 SCC can inject an appropriate value from the namespace's allocated supplemental-groups range.
Recommended install:
helm install llmkube ./charts/llmkube \
-f charts/llmkube/values-openshift.yaml \
-n llmkube-system --create-namespace
That single command produces an LLMKube deployment whose InferenceService pods are admitted cleanly under restricted-v2. The same values-openshift.yaml works on MicroShift, OKD, OpenShift Container Platform, and any other distribution that runs the SCC admission controller with the standard MustRunAs fsGroup strategy.
Per-InferenceService override (fallback for single-tenant cases).
If you would rather pin fsGroup per workload instead of disabling the default operator-wide:
```bash
LLMKube是一个有潜力的开源AI工具,提供了多种LLM引擎的支持和Kubernetes操作功能,但仍需要进一步的开发和测试。
AI Skill Hub 为第三方内容聚合平台,本页面信息基于公开数据整理,不对工具功能和质量作任何法律背书。
建议在沙箱或测试环境中充分验证后,再部署至生产环境,并做好必要的安全评估。
✅ Apache 2.0 — 宽松开源协议,可商用,需保留版权声明和 NOTICE 文件,含专利授权条款。
经综合评估,LLMKube 在AI工具赛道中表现稳健,质量良好。如果你已有明确的使用需求,可以直接上手体验;如果还在评估阶段,建议对比同类工具后再做决策。
| 原始名称 | LLMKube |
| 原始描述 | 开源AI工具:Kubernetes operator for local LLM inference with llama.cpp, vLLM, TGI, and mlx-s。⭐106 · Go |
| Topics | aiapple-siliconautoscalingedge-computingggufgo |
| GitHub | https://github.com/defilantech/LLMKube |
| License | Apache-2.0 |
| 语言 | Go |
收录时间:2026-05-23 · 更新时间:2026-05-23 · License:Apache-2.0 · AI Skill Hub 不对第三方内容的准确性作法律背书。