能力标签

🤖 Agent 🔄 工作流 🐳 Docker 💻 CLI 🔗 REST API 🧬 Embedding 🧠 Claude ✨ GPT 🖥 本地 LLM

🛠

AI工具

LLMesh

基于 Python · 开源免费，本地部署，数据完全自主可控

英文名：llmesh

⭐ 10 Stars 🍴 1 Forks 💻 Python 📄 MIT 🏷 AI 8.0分

8.0AI 综合评分

AILLM分布式

📺 TG 频道

✦ AI Skill Hub 推荐

LLMesh 是 AI Skill Hub 本期精选AI工具之一。综合评分 8.0 分，整体质量较高。我们强烈推荐将其纳入你的 AI 工具库，帮助提升工作效率。

📚 深度解析

LLMesh 是一款基于 Python 的开源工具，在 GitHub 上收获 0k+ Star，是AI、LLM、分布式领域中的优质开源项目。开源工具的最大优势在于代码完全透明，你可以审计每一行代码的安全性，也可以根据自身需求进行二次开发和定制。

**为什么要使用开源工具而非商业 SaaS？**
对于个人开发者和有隐私需求的用户，本地部署的开源工具意味着数据不离本机，不受第三方服务商的数据政策约束。同时，开源工具通常没有使用次数限制和月度费用，一次安装即可长期使用，对于高频使用场景的总拥有成本（TCO）远低于订阅制商业工具。

**安装与环境准备**
LLMesh 依赖 Python 运行环境。建议通过 pyenv（Python）或 nvm（Node.js）管理 Python 版本，避免全局环境污染。对于新手用户，推荐先创建虚拟环境（python -m venv venv && source venv/bin/activate），再安装依赖，这样即使出现问题也可以随时删除虚拟环境重新开始，不影响系统稳定性。

**社区与维护**
GitHub Issue 和 Discussion 是获取帮助的最快渠道。在提问前建议先检查 Closed Issues（已关闭的问题），大多数常见问题都已有解答。遇到 Bug 时，提供 pip list 的输出、完整错误堆栈和最小可复现示例，能显著提高开发者响应速度。AI Skill Hub 将持续追踪 LLMesh 的版本更新，及时通知重要功能变化。

📋 工具概览

LLMesh 是一款基于 Python 开发的开源工具，专注于 AI、LLM、分布式等核心功能。作为 GitHub 开源项目，它拥有活跃的社区支持和持续的版本迭代，代码完全透明可审计，支持本地部署以保护数据隐私。无论是个人使用还是集成到企业工作流，都能提供稳定可靠的解决方案。

GitHub Stars

⭐ 10

开发语言

Python

支持平台

Windows / macOS / Linux

维护状态

轻量级项目，按需更新

开源协议

MIT

AI 综合评分

8.0 分

工具类型

AI工具

Forks

📖 中文文档

以下内容由 AI Skill Hub 根据项目信息自动整理，如需查看完整原始文档请访问底部「原始来源」。

📌 核心特色

开源免费，支持本地部署，数据完全自主可控
活跃的 GitHub 开源社区，持续迭代更新
提供详细文档和使用示例，新手友好
支持自定义配置，灵活适配不同使用环境
可作为基础组件集成进现有技术栈或进行二次开发

🎯 主要使用场景

本地部署运行，保护数据隐私，满足合规要求
自定义集成到现有系统，扩展技术栈能力
作为开源基础组件进行商业化二次开发

以下安装命令基于项目开发语言和类型自动生成，实际以官方 README 为准。

安装命令

# 方式一：pip 安装（推荐）
pip install llmesh

# 方式二：虚拟环境安装（推荐生产环境）
python -m venv .venv
source .venv/bin/activate  # Windows: .venv\Scripts\activate
pip install llmesh

# 方式三：从源码安装（获取最新功能）
git clone https://github.com/qcoda-ai/llmesh
cd llmesh
pip install -e .

# 验证安装
python -c "import llmesh; print('安装成功')"

📋 安装步骤说明

访问 GitHub 仓库页面
按照 README 文档完成依赖安装
根据系统环境完成初始化配置
参考官方示例或文档开始使用
遇到问题可在 GitHub Issues 中查找解答

以下用法示例由 AI Skill Hub 整理，涵盖最常见的使用场景。

常用命令 / 代码示例

# 命令行使用
llmesh --help

# 基本用法
llmesh input_file -o output_file

# Python 代码中调用
import llmesh

# 示例
result = llmesh.process("input")
print(result)

以下配置示例基于典型使用场景生成，具体参数请参照官方文档调整。

配置示例

# llmesh 配置文件示例（config.yml）
app:
  name: "llmesh"
  debug: false
  log_level: "INFO"

# 运行时指定配置文件
llmesh --config config.yml

# 或通过环境变量配置
export LLMESH_API_KEY="your-key"
export LLMESH_OUTPUT_DIR="./output"

📑 README 深度解析真实文档完整度 75/100 查看 GitHub 原文 →

以下内容由系统直接从 GitHub README 解析整理，保留代码块、表格与列表结构。

LLMesh Hub & Agent

LLMesh is a distributed workload orchestration system that routes AI inference tasks to a decentralized network of compute nodes based on hardware fitness and availability. This project establishes the foundation for managing robust, intelligent workloads across multiple environments.

LLMesh demo

What's new in v0.2 (`0.20.0`)

The v0.2 bundle (released 2026-05-29 as internal 0.20.0) lands two headline features alongside a stack of streaming/durability upgrades. Full per-decision detail in CHANGELOG.md.

MLX real per-token streaming, default ON. New _run_streaming_mlx() in the agent. Verified end-to-end against osaurus (Apache 2.0 Swift, primary target) on M1 Ultra; mlx-lm.server also works. Set MLX_STREAMING_ENABLED=false to revert. See decisions D059 + D060.
Adaptive chunked SSE streaming (StreamBatcher). New since the first release. Three flush triggers (size, time, target PPS) + TPS-driven sliding window that converges to ~8× token aggregation at MLX rates without hurting time-to-first-token. Cuts hub /stream syscall pressure under fast clusters by ~80%. Unified across all three backends (Ollama, vLLM, MLX) — no per-backend streaming divergence. STREAM_BATCH_FIXED=N escape hatch for load testing / debug / conservative production. Per-batch telemetry surfaced agent → hub → dashboard. See decisions D041 (algorithm), D067 (three-backend unification), D068 (telemetry).
Image generation v1 — BETA. OpenAI-compatible POST /v1/images/generations + dashboard Image tab. Backend: mflux in-process on Apple Silicon Macs (FLUX-schnell, FLUX-dev). Operator-explicit model install — never auto-downloads weights. Read the BETA + system-requirement advisory in docs/image_gen.md before enabling: 64 GB UMA minimum, do not co-run with other large MLX/LLM workloads (Ollama with a big model loaded, mlx-lm.server, etc.) — co-resident large RSS has triggered a macOS kernel panic on M1 Ultra 64 GB (D083). 128 GB Mac Studio recommended for production. See decisions D064, D071, D073, D083.

Other v0.2 wins: Anthropic Messages SSE streaming on /v1/messages (D061), vLLM streaming default ON (D040 + D044), hub state durability for the task queue + node registry (D053 + D058), weighted routing (D054), CSRF on the dashboard (D055), /v1/limits + 256 KB MAX_INPUT_BYTES (D049). Full list: CHANGELOG.md.

What's explicitly out of scope for v0

Multi-hub clustering / HA
WebSocket or long-poll agent transport
Session history encryption at rest
Dashboard internationalization or theming
Non-English model evaluation

If you hit one of these limitations and it's a blocker for your use case, please open an issue — priority shifts based on what the community actually needs.

Prerequisites

Docker + Docker Compose v2 (Docker Desktop on macOS/Windows, docker-compose-plugin on Linux)
At least one machine running Ollama (or vLLM / MLX) — this is the agent side; not Docker
.env and server_config.json files (templates below)

Prerequisites

Python 3.10+
Ollama running locally (if running an agent node)

Quick Start (Docker — recommended)

Hub-side install via Docker Compose. Brings up the FastAPI hub + a Postgres-backed session store. Agents still run on each compute host bare-metal (they need GPU/Ollama/MLX access on the host, so they don't containerise sensibly).

Installation

1. Clone the repository and navigate to the project root:

   cd llmesh

2. Create and activate a virtual environment:

   python -m venv .venv
   source .venv/bin/activate
   pip install --upgrade pip

3. Install LLMesh. Pick the install profile that matches your role:

Hub or headless agent (server / Linux deployment):

   pip install .

Installs only what the FastAPI hub and the polling agent need. No GUI, no PyInstaller, no macOS-specific packages — safe on Linux servers and inside containers.

Desktop tray client (macOS / Windows developer machine):

   pip install '.[desktop]'

Adds pystray + Pillow and the macOS-only pyobjc-* packages (gated by platform markers, so the same command is harmless on Linux).

Building the desktop binary (PyInstaller):

   pip install '.[desktop,build]'

Running the test suite:

   pip install '.[dev]'

4. Install git hooks — Ledger Law check + gitleaks secrets scan:

   bash scripts/install_git_hooks.sh

Pre-commit will refuse to run without gitleaks installed locally. On macOS:

   brew install gitleaks

See .qcoda/CONVENTIONS.md § Enforcement for the full hook details.

3. Optional: Session Memory & Configuration

The Hub maintains per-session conversation history so clients do not need to resend the full message history on every turn. History is stored in a local sessions.db SQLite file (created automatically — no setup needed).

How it works: Include an X-Session-ID header in your requests. The Hub returns the assigned session ID in the same response header:

```bash

4. Optional: Ollama context window (`OLLAMA_NUM_CTX`)

When Ollama runs a model, it allocates a KV cache sized to the context window at load time. The default context window is 4096 tokens, regardless of what the model was trained on. Models trained on larger contexts (e.g. 32768 tokens for Llama 3) produce a log warning:

llama_context: n_ctx_per_seq (4096) < n_ctx_train (32768)
-- the full capacity of the model will not be utilized

This is not a crash — inference still works — but it means the model cannot attend to more than 4096 tokens of input at once. Long prompts, multi-turn conversations, or document-grounded queries will silently truncate if they exceed this limit.

The agent sets num_ctx on every Ollama request via the OLLAMA_NUM_CTX environment variable:

Env var	Default	Description
`OLLAMA_NUM_CTX`	`8192`	Context window (tokens) sent to Ollama with every inference call

Why 8192 and not the full training context?

num_ctx directly controls VRAM allocation — Ollama pre-allocates the full KV cache at load time. Setting it too high can prevent the model from loading at all on machines with limited VRAM:

`num_ctx`	Approximate KV cache (7B Q4 model)
4096 (Ollama default)	~0.5 GB
8192 (LLMesh default)	~1 GB
16384	~2 GB
32768	~4 GB

8192 covers the vast majority of chat and code tasks. The per-machine override lets high-VRAM nodes serve larger contexts without forcing that requirement on every node in the mesh.

Setting per machine (in the agent's .env):

```bash

6. Optional: Metrics retention

Inference events and node snapshots accumulate indefinitely by default. Pruning runs automatically in the hub's cleanup loop. Configure via server_config.json or env vars:

{
    "metrics": {
        "retention_days_events": 30,
        "retention_days_snapshots": 7
    }
}

`server_config.json` key	Env var	Default	Description
`metrics.retention_days_events`	`METRICS_RETENTION_DAYS`	`30`	Days to retain inference events
`metrics.retention_days_snapshots`	`SNAPSHOT_RETENTION_DAYS`	`7`	Days to retain node snapshots

`0.21.1` — Unauthenticated `/version` endpoint

GET /version (unauthenticated, D097). Returns {"version": "<APP_VERSION>"} for fast post-deploy verification (curl -s https://mesh.qcoda.com/version) without an API key. /health stays version-less per its CVE-targeting comment; version-string enumeration adds no surface beyond what pip index versions or GitHub Releases already expose.

`0.21.3` — Multi-backend + Anthropic-endpoint tool coverage

/v1/messages (Anthropic endpoint) now forwards tools + tool_choice (D099). Previously the Anthropic path accepted the schema fields but never forwarded them to the agent (text-only response, stop_reason:end_turn). Now forwards both with the same hub-enforced filter/post-validate as the OpenAI path and emits native Anthropic tool_use content blocks (non-streaming + SSE), flipping stop_reason to "tool_use".
vLLM + MLX backends now forward tools (D099). The agent's vLLM/MLX non-streaming path now sends tools and parses tool_calls/reasoning_content from the upstream response — closing the last silent-drop on non-Ollama nodes. tool_choice stays hub-enforced (not forwarded), matching the Ollama contract.
Still not covered. qwen3-thinking <think>...</think> blocks (MLX) are a separate gap from gpt-oss harmony. Streaming tool_call.arguments arrives as a single synthesized delta (Ollama emits the whole call in one frame), not true per-token incremental deltas. Full breakdown in .qcoda/api.md.

API Endpoints

Once the Hub is running, it exposes four standard LLM API endpoints (chat, embeddings, Anthropic messages, image generation). Point any compatible SDK at http://localhost:8000 using your API key from server_config.json.

Core Project Components

Based on our recent implementation phases, the system is designed around two primary components:

Hub (lib/hub/): A FastAPI-based central orchestrator. It manages node registrations, tracks active hardware resources, queues inference requests, and dynamically routes those tasks to the optimal available node. It also provides a web-based dashboard for real-time monitoring of tasks and connected nodes.
Agent (lib/agent/): A lightweight Python client running on contributor or execution machines. The agent detects locally running inference backends (Ollama, and optionally vLLM/MLX), registers hardware capabilities and available models with the Hub, maintains a heartbeat, and continuously polls for tasks. Upon receiving a task it dispatches to the appropriate local backend and transmits the result back to the Hub.

Our project history reflects ongoing architectural evolution, particularly focusing on migrating from a single-owner MVP model to a scalable, multi-tenant SaaS architecture.