LLMesh 是 AI Skill Hub 本期精选AI工具之一。综合评分 8.0 分,整体质量较高。我们强烈推荐将其纳入你的 AI 工具库,帮助提升工作效率。
LLMesh 是一款基于 Python 开发的开源工具,专注于 AI、LLM、分布式 等核心功能。作为 GitHub 开源项目,它拥有活跃的社区支持和持续的版本迭代,代码完全透明可审计,支持本地部署以保护数据隐私。无论是个人使用还是集成到企业工作流,都能提供稳定可靠的解决方案。
LLMesh 是一款基于 Python 开发的开源工具,专注于 AI、LLM、分布式 等核心功能。作为 GitHub 开源项目,它拥有活跃的社区支持和持续的版本迭代,代码完全透明可审计,支持本地部署以保护数据隐私。无论是个人使用还是集成到企业工作流,都能提供稳定可靠的解决方案。
# 方式一:pip 安装(推荐)
pip install llmesh
# 方式二:虚拟环境安装(推荐生产环境)
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install llmesh
# 方式三:从源码安装(获取最新功能)
git clone https://github.com/qcoda-ai/llmesh
cd llmesh
pip install -e .
# 验证安装
python -c "import llmesh; print('安装成功')"
# 命令行使用
llmesh --help
# 基本用法
llmesh input_file -o output_file
# Python 代码中调用
import llmesh
# 示例
result = llmesh.process("input")
print(result)
# llmesh 配置文件示例(config.yml) app: name: "llmesh" debug: false log_level: "INFO" # 运行时指定配置文件 llmesh --config config.yml # 或通过环境变量配置 export LLMESH_API_KEY="your-key" export LLMESH_OUTPUT_DIR="./output"
LLMesh is a distributed workload orchestration system that routes AI inference tasks to a decentralized network of compute nodes based on hardware fitness and availability. This project establishes the foundation for managing robust, intelligent workloads across multiple environments.

The v0.2 bundle (released 2026-05-29 as internal 0.20.0) lands two headline features alongside a stack of streaming/durability upgrades. Full per-decision detail in CHANGELOG.md.
_run_streaming_mlx() in the agent. Verified end-to-end against osaurus (Apache 2.0 Swift, primary target) on M1 Ultra; mlx-lm.server also works. Set MLX_STREAMING_ENABLED=false to revert. See decisions D059 + D060.StreamBatcher). New since the first release. Three flush triggers (size, time, target PPS) + TPS-driven sliding window that converges to ~8× token aggregation at MLX rates without hurting time-to-first-token. Cuts hub /stream syscall pressure under fast clusters by ~80%. Unified across all three backends (Ollama, vLLM, MLX) — no per-backend streaming divergence. STREAM_BATCH_FIXED=N escape hatch for load testing / debug / conservative production. Per-batch telemetry surfaced agent → hub → dashboard. See decisions D041 (algorithm), D067 (three-backend unification), D068 (telemetry).POST /v1/images/generations + dashboard Image tab. Backend: mflux in-process on Apple Silicon Macs (FLUX-schnell, FLUX-dev). Operator-explicit model install — never auto-downloads weights. Read the BETA + system-requirement advisory in docs/image_gen.md before enabling: 64 GB UMA minimum, do not co-run with other large MLX/LLM workloads (Ollama with a big model loaded, mlx-lm.server, etc.) — co-resident large RSS has triggered a macOS kernel panic on M1 Ultra 64 GB (D083). 128 GB Mac Studio recommended for production. See decisions D064, D071, D073, D083.Other v0.2 wins: Anthropic Messages SSE streaming on /v1/messages (D061), vLLM streaming default ON (D040 + D044), hub state durability for the task queue + node registry (D053 + D058), weighted routing (D054), CSRF on the dashboard (D055), /v1/limits + 256 KB MAX_INPUT_BYTES (D049). Full list: CHANGELOG.md.
If you hit one of these limitations and it's a blocker for your use case, please open an issue — priority shifts based on what the community actually needs.
docker-compose-plugin on Linux).env and server_config.json files (templates below)Hub-side install via Docker Compose. Brings up the FastAPI hub + a Postgres-backed session store. Agents still run on each compute host bare-metal (they need GPU/Ollama/MLX access on the host, so they don't containerise sensibly).
1. Clone the repository and navigate to the project root:
cd llmesh
2. Create and activate a virtual environment: python -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
3. Install LLMesh. Pick the install profile that matches your role:
Hub or headless agent (server / Linux deployment):
pip install .
Installs only what the FastAPI hub and the polling agent need. No GUI, no PyInstaller, no macOS-specific packages — safe on Linux servers and inside containers.
Desktop tray client (macOS / Windows developer machine):
pip install '.[desktop]'
Adds pystray + Pillow and the macOS-only pyobjc-* packages (gated by platform markers, so the same command is harmless on Linux).
Building the desktop binary (PyInstaller):
pip install '.[desktop,build]'
Running the test suite:
pip install '.[dev]'
4. Install git hooks — Ledger Law check + gitleaks secrets scan:
bash scripts/install_git_hooks.sh
Pre-commit will refuse to run without gitleaks installed locally. On macOS: brew install gitleaks
See .qcoda/CONVENTIONS.md § Enforcement for the full hook details.
The Hub maintains per-session conversation history so clients do not need to resend the full message history on every turn. History is stored in a local sessions.db SQLite file (created automatically — no setup needed).
How it works: Include an X-Session-ID header in your requests. The Hub returns the assigned session ID in the same response header:
```bash
When Ollama runs a model, it allocates a KV cache sized to the context window at load time. The default context window is 4096 tokens, regardless of what the model was trained on. Models trained on larger contexts (e.g. 32768 tokens for Llama 3) produce a log warning:
llama_context: n_ctx_per_seq (4096) < n_ctx_train (32768)
-- the full capacity of the model will not be utilized
This is not a crash — inference still works — but it means the model cannot attend to more than 4096 tokens of input at once. Long prompts, multi-turn conversations, or document-grounded queries will silently truncate if they exceed this limit.
The agent sets num_ctx on every Ollama request via the OLLAMA_NUM_CTX environment variable:
| Env var | Default | Description |
|---|---|---|
OLLAMA_NUM_CTX | 8192 | Context window (tokens) sent to Ollama with every inference call |
Why 8192 and not the full training context?
num_ctx directly controls VRAM allocation — Ollama pre-allocates the full KV cache at load time. Setting it too high can prevent the model from loading at all on machines with limited VRAM:
num_ctx | Approximate KV cache (7B Q4 model) |
|---|---|
| 4096 (Ollama default) | ~0.5 GB |
| **8192 (LLMesh default)** | **~1 GB** |
| 16384 | ~2 GB |
| 32768 | ~4 GB |
8192 covers the vast majority of chat and code tasks. The per-machine override lets high-VRAM nodes serve larger contexts without forcing that requirement on every node in the mesh.
Setting per machine (in the agent's .env):
```bash
Inference events and node snapshots accumulate indefinitely by default. Pruning runs automatically in the hub's cleanup loop. Configure via server_config.json or env vars:
{
"metrics": {
"retention_days_events": 30,
"retention_days_snapshots": 7
}
}
server_config.json key | Env var | Default | Description |
|---|---|---|---|
metrics.retention_days_events | METRICS_RETENTION_DAYS | 30 | Days to retain inference events |
metrics.retention_days_snapshots | SNAPSHOT_RETENTION_DAYS | 7 | Days to retain node snapshots |
GET /version (unauthenticated, D097). Returns {"version": "<APP_VERSION>"} for fast post-deploy verification (curl -s https://mesh.qcoda.com/version) without an API key. /health stays version-less per its CVE-targeting comment; version-string enumeration adds no surface beyond what pip index versions or GitHub Releases already expose./v1/messages (Anthropic endpoint) now forwards tools + tool_choice (D099). Previously the Anthropic path accepted the schema fields but never forwarded them to the agent (text-only response, stop_reason:end_turn). Now forwards both with the same hub-enforced filter/post-validate as the OpenAI path and emits native Anthropic tool_use content blocks (non-streaming + SSE), flipping stop_reason to "tool_use".tools and parses tool_calls/reasoning_content from the upstream response — closing the last silent-drop on non-Ollama nodes. tool_choice stays hub-enforced (not forwarded), matching the Ollama contract.<think>...</think> blocks (MLX) are a separate gap from gpt-oss harmony. Streaming tool_call.arguments arrives as a single synthesized delta (Ollama emits the whole call in one frame), not true per-token incremental deltas. Full breakdown in .qcoda/api.md.Once the Hub is running, it exposes four standard LLM API endpoints (chat, embeddings, Anthropic messages, image generation). Point any compatible SDK at http://localhost:8000 using your API key from server_config.json.
Based on our recent implementation phases, the system is designed around two primary components:
lib/hub/): A FastAPI-based central orchestrator. It manages node registrations, tracks active hardware resources, queues inference requests, and dynamically routes those tasks to the optimal available node. It also provides a web-based dashboard for real-time monitoring of tasks and connected nodes.lib/agent/): A lightweight Python client running on contributor or execution machines. The agent detects locally running inference backends (Ollama, and optionally vLLM/MLX), registers hardware capabilities and available models with the Hub, maintains a heartbeat, and continuously polls for tasks. Upon receiving a task it dispatches to the appropriate local backend and transmits the result back to the Hub.Our project history reflects ongoing architectural evolution, particularly focusing on migrating from a single-owner MVP model to a scalable, multi-tenant SaaS architecture.
Use this when you want SQLite (default), are developing on the hub itself, or are installing the agent on a compute host.
高性能AI推理代理,易于部署
AI Skill Hub 为第三方内容聚合平台,本页面信息基于公开数据整理,不对工具功能和质量作任何法律背书。
建议在沙箱或测试环境中充分验证后,再部署至生产环境,并做好必要的安全评估。
✅ MIT 协议 — 最宽松的开源协议之一,可自由商用、修改、分发,仅需保留版权声明。
经综合评估,LLMesh 在AI工具赛道中表现稳健,质量优秀。如果你已有明确的使用需求,可以直接上手体验;如果还在评估阶段,建议对比同类工具后再做决策。
| 原始名称 | llmesh |
| Topics | AILLM分布式 |
| GitHub | https://github.com/qcoda-ai/llmesh |
| License | MIT |
| 语言 | Python |
收录时间:2026-06-17 · 更新时间:2026-06-17 · License:MIT · AI Skill Hub 不对第三方内容的准确性作法律背书。