LLMesh 是 AI Skill Hub 本期精选AI工具之一。综合评分 8.0 分,整体质量较高。我们强烈推荐将其纳入你的 AI 工具库,帮助提升工作效率。
LLMesh 是一款基于 Python 开发的开源工具,专注于 AI、LLM、分布式 等核心功能。作为 GitHub 开源项目,它拥有活跃的社区支持和持续的版本迭代,代码完全透明可审计,支持本地部署以保护数据隐私。无论是个人使用还是集成到企业工作流,都能提供稳定可靠的解决方案。
LLMesh 是一款基于 Python 开发的开源工具,专注于 AI、LLM、分布式 等核心功能。作为 GitHub 开源项目,它拥有活跃的社区支持和持续的版本迭代,代码完全透明可审计,支持本地部署以保护数据隐私。无论是个人使用还是集成到企业工作流,都能提供稳定可靠的解决方案。
# 方式一:pip 安装(推荐)
pip install llmesh
# 方式二:虚拟环境安装(推荐生产环境)
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install llmesh
# 方式三:从源码安装(获取最新功能)
git clone https://github.com/qcoda-ai/llmesh
cd llmesh
pip install -e .
# 验证安装
python -c "import llmesh; print('安装成功')"
# 命令行使用
llmesh --help
# 基本用法
llmesh input_file -o output_file
# Python 代码中调用
import llmesh
# 示例
result = llmesh.process("input")
print(result)
# llmesh 配置文件示例(config.yml) app: name: "llmesh" debug: false log_level: "INFO" # 运行时指定配置文件 llmesh --config config.yml # 或通过环境变量配置 export LLMESH_API_KEY="your-key" export LLMESH_OUTPUT_DIR="./output"
LLMesh is a distributed workload orchestration system that routes AI inference tasks to a decentralized network of compute nodes based on hardware fitness and availability. This project establishes the foundation for managing robust, intelligent workloads across multiple environments.

The v0.2 bundle (released 2026-05-29 as internal 0.20.0) lands two headline features alongside a stack of streaming/durability upgrades. Full per-decision detail in CHANGELOG.md.
_run_streaming_mlx() in the agent. Verified end-to-end against osaurus (Apache 2.0 Swift, primary target) on M1 Ultra; mlx-lm.server also works. Set MLX_STREAMING_ENABLED=false to revert. See decisions D059 + D060.StreamBatcher). New since the first release. Three flush triggers (size, time, target PPS) + TPS-driven sliding window that converges to ~8× token aggregation at MLX rates without hurting time-to-first-token. Cuts hub /stream syscall pressure under fast clusters by ~80%. Unified across all three backends (Ollama, vLLM, MLX) — no per-backend streaming divergence. STREAM_BATCH_FIXED=N escape hatch for load testing / debug / conservative production. Per-batch telemetry surfaced agent → hub → dashboard. See decisions D041 (algorithm), D067 (three-backend unification), D068 (telemetry).POST /v1/images/generations + dashboard Image tab. Backend: mflux in-process on Apple Silicon Macs (FLUX-schnell, FLUX-dev). Operator-explicit model install — never auto-downloads weights. Read the BETA + system-requirement advisory in docs/image_gen.md before enabling: 64 GB UMA minimum, do not co-run with other large MLX/LLM workloads (Ollama with a big model loaded, mlx-lm.server, etc.) — co-resident large RSS has triggered a macOS kernel panic on M1 Ultra 64 GB (D083). 128 GB Mac Studio recommended for production. See decisions D064, D071, D073, D083.Other v0.2 wins: Anthropic Messages SSE streaming on /v1/messages (D061), vLLM streaming default ON (D040 + D044), hub state durability for the task queue + node registry (D053 + D058), weighted routing (D054), CSRF on the dashboard (D055), /v1/limits + 256 KB MAX_INPUT_BYTES (D049). Full list: CHANGELOG.md.
If you hit one of these limitations and it's a blocker for your use case, please open an issue — priority shifts based on what the community actually needs.
docker-compose-plugin on Linux).env and server_config.json files (templates below)Hub-side install via Docker Compose. Brings up the FastAPI hub + a Postgres-backed session store. Agents still run on each compute host bare-metal (they need GPU/Ollama/MLX access on the host, so they don't containerise sensibly).
1. Clone the repository and navigate to the project root:
cd llmesh
2. Create and activate a virtual environment: python -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
3. Install LLMesh. Pick the install profile that matches your role:
Hub or headless agent (server / Linux deployment):
pip install .
Installs only what the FastAPI hub and the polling agent need. No GUI, no PyInstaller, no macOS-specific packages — safe on Linux servers and inside containers.
Desktop tray client (macOS / Windows developer machine):
pip install '.[desktop]'
Adds pystray + Pillow and the macOS-only pyobjc-* packages (gated by platform markers, so the same command is harmless on Linux).
Building the desktop binary (PyInstaller):
pip install '.[desktop,build]'
Running the test suite:
pip install '.[dev]'
4. Install git hooks — Ledger Law check + gitleaks secrets scan:
bash scripts/install_git_hooks.sh
Pre-commit will refuse to run without gitleaks installed locally. On macOS: brew install gitleaks
See .qcoda/CONVENTIONS.md § Enforcement for the full hook details.
The Hub maintains per-session conversation history so clients do not need to resend the full message history on every turn. History is stored in a local sessions.db SQLite file (created automatically — no setup needed).
How it works: Include an X-Session-ID header in your requests. The Hub returns the assigned session ID in the same response header:
```bash
When Ollama runs a model, it allocates a KV cache sized to the context window at load time. The default context window is 4096 tokens, regardless of what the model was trained on. Models trained on larger contexts (e.g. 32768 tokens for Llama 3) produce a log warning:
llama_context: n_ctx_per_seq (4096) < n_ctx_train (32768)
-- the full capacity of the model will not be utilized
This is not a crash — inference still works — but it means the model cannot attend to more than 4096 tokens of input at once. Long prompts, multi-turn conversations, or document-grounded queries will silently truncate if they exceed this limit.
The agent sets num_ctx on every Ollama request via the OLLAMA_NUM_CTX environment variable:
| Env var | Default | Description |
|---|---|---|
OLLAMA_NUM_CTX | 8192 | Context window (tokens) sent to Ollama with every inference call |
Why 8192 and not the full training context?
num_ctx directly controls VRAM allocation — Ollama pre-allocates the full KV cache at load time. Setting it too high can prevent the model from loading at all on machines with limited VRAM:
num_ctx | Approximate KV cache (7B Q4 model) |
|---|---|
| 4096 (Ollama default) | ~0.5 GB |
| **8192 (LLMesh default)** | **~1 GB** |
| 16384 | ~2 GB |
| 32768 | ~4 GB |
8192 covers the vast majority of chat and code tasks. The per-machine override lets high-VRAM nodes serve larger contexts without forcing that requirement on every node in the mesh.
Setting per machine (in the agent's .env):
```bash
Inference events and node snapshots accumulate indefinitely by default. Pruning runs automatically in the hub's cleanup loop. Configure via server_config.json or env vars:
{
"metrics": {
"retention_days_events": 30,
"retention_days_snapshots": 7
}
}
server_config.json key | Env var | Default | Description |
|---|---|---|---|
metrics.retention_days_events | METRICS_RETENTION_DAYS | 30 | Days to retain inference events |
metrics.retention_days_snapshots | SNAPSHOT_RETENTION_DAYS | 7 | Days to retain node snapshots |
GET /version (unauthenticated, D097). Returns {"version": "<APP_VERSION>"} for fast post-deploy verification (curl -s https://mesh.qcoda.com/version) without an API key. /health stays version-less per its CVE-targeting comment; version-string enumeration adds no surface beyond what pip index versions or GitHub Releases already expose./v1/messages (Anthropic endpoint) now forwards tools + tool_choice (D099). Previously the Anthropic path accepted the schema fields but never forwarded them to the agent (text-only response, stop_reason:end_turn). Now forwards both with the same hub-enforced filter/post-validate as the OpenAI path and emits native Anthropic tool_use content blocks (non-streaming + SSE), flipping stop_reason to "tool_use".tools and parses tool_calls/reasoning_content from the upstream response — closing the last silent-drop on non-Ollama nodes. tool_choice stays hub-enforced (not forwarded), matching the Ollama contract.<think>...</think> blocks (MLX) are a separate gap from gpt-oss harmony. Streaming tool_call.arguments arrives as a single synthesized delta (Ollama emits the whole call in one frame), not true per-token incremental deltas. Full breakdown in .qcoda/api.md.Once the Hub is running, it exposes four standard LLM API endpoints (chat, embeddings, Anthropic messages, image generation). Point any compatible SDK at http://localhost:8000 using your API key from server_config.json.
Based on our recent implementation phases, the system is designed around two primary components:
lib/hub/): A FastAPI-based central orchestrator. It manages node registrations, tracks active hardware resources, queues inference requests, and dynamically routes those tasks to the optimal available node. It also provides a web-based dashboard for real-time monitoring of tasks and connected nodes.lib/agent/): A lightweight Python client running on contributor or execution machines. The agent detects locally running inference backends (Ollama, and optionally vLLM/MLX), registers hardware capabilities and available models with the Hub, maintains a heartbeat, and continuously polls for tasks. Upon receiving a task it dispatches to the appropriate local backend and transmits the result back to the Hub.Our project history reflects ongoing architectural evolution, particularly focusing on migrating from a single-owner MVP model to a scalable, multi-tenant SaaS architecture.
Use this when you want SQLite (default), are developing on the hub itself, or are installing the agent on a compute host.
LLMesh 是一个分布式工作负载编排系统,旨在将 AI 推理任务根据硬件适配度和可用性,智能路由到去中心化的计算节点网络中。该项目为在多环境下管理稳健且智能的工作负载奠定了基础,通过高效的任务调度,实现计算资源的优化利用。
在 v0.2 版本中,LLMesh 引入了重大更新,包括支持 MLX 的实时 per-token streaming(默认开启),并针对 M1 Ultra 等硬件进行了端到端验证。此外,系统还进行了大量的流式传输与持久化升级,确保了在不同后端环境下的响应性能与数据可靠性。
运行本项目需要满足以下环境要求:首先需安装 Docker 及 Docker Compose v2(macOS/Windows 使用 Docker Desktop,Linux 使用 docker-compose-plugin);其次,Agent 端需要本地运行 Ollama(或 vLLM / MLX)以提供推理能力;此外,还需要准备好 .env 和 server_config.json 配置文件。若进行源码开发,则需 Python 3.10+ 环境。
推荐使用 Docker Compose 进行 Hub 端安装,它会自动部署基于 FastAPI 的 Hub 以及使用 Postgres 作为后端存储的 Session Store。由于 Agent 需要直接访问宿主机的 GPU、Ollama 或 MLX 资源,建议在宿主机上以 bare-metal 方式运行。若需部署 Hub 或 headless agent(服务器/Linux 环境),可通过 pip install . 进行安装。
系统支持会话记忆功能,Hub 会自动在本地 sessions.db (SQLite) 中维护对话历史,客户端只需在请求中携带 X-Session-ID 头部即可实现上下文关联。此外,针对 Ollama 用户,可以通过 OLLAMA_NUM_CTX 参数调整上下文窗口大小,以避免因默认 4096 tokens 限制而导致的模型性能下降或警告。
LLMesh 提供兼容性极高的 API 接口。最新的版本已支持 Anthropic 风格的 /v1/messages 端点,能够完整转发 tools 和 tool_choice 参数,确保 Agent 能够正确处理工具调用。此外,系统还提供了一个无需身份验证的 /version 接口,方便开发者在部署后快速进行版本校验。
LLMesh 的核心架构由两个主要组件构成:1. Hub (lib/hub/):基于 FastAPI 构建的中央编排器,负责管理节点注册、追踪活跃硬件资源、管理推理请求队列,并根据最优策略将任务动态路由至合适的计算节点。
高性能AI推理代理,易于部署
AI Skill Hub 为第三方内容聚合平台,本页面信息基于公开数据整理,不对工具功能和质量作任何法律背书。
建议在沙箱或测试环境中充分验证后,再部署至生产环境,并做好必要的安全评估。
✅ MIT 协议 — 最宽松的开源协议之一,可自由商用、修改、分发,仅需保留版权声明。
经综合评估,LLMesh 在AI工具赛道中表现稳健,质量优秀。如果你已有明确的使用需求,可以直接上手体验;如果还在评估阶段,建议对比同类工具后再做决策。
| 原始名称 | llmesh |
| 原始描述 | 开源AI工具:Distributed inference broker for local LLMs. nginx for AI inference.。⭐10 · Python |
| Topics | AILLM分布式 |
| GitHub | https://github.com/qcoda-ai/llmesh |
| License | MIT |
| 语言 | Python |
收录时间:2026-06-17 · 更新时间:2026-06-17 · License:MIT · AI Skill Hub 不对第三方内容的准确性作法律背书。