{% extends "base.html" %} {% block title %}Multi-LLM Networking — Maxim Docs | Distributed Inference{% endblock %} {% block meta_description %}How Maxim distributes LLM inference across peer machines using Cloudflare tunnels, a stdlib-only reverse proxy, admission control, and lane metrics.{% endblock %} {% block meta_keywords %}Maxim networking, distributed inference, peer mesh, Cloudflare tunnel, LeaderProxy, LLM scaling, multi-GPU, agentic robotics, Maxim peer, admission control, lane metrics{% endblock %} {% block meta_author %}Maxim Project{% endblock %} {% block og_site_name %}Maxim{% endblock %} {% block og_type %}article{% endblock %} {% block structured_data %} {% endblock %} {% block content %}
MAXIM
Distributed Inference, Cloudflare Tunnels, and Cross-Machine Coordination
Maxim's agent pipeline calls the LLM multiple times per cycle: perception, memory consolidation, goal reasoning, execution planning, and statistical review. On a single machine with one GPU, these calls compete for the same inference server. When the model is large or the context window is deep, that bottleneck caps the cycle rate.
Distributed inference solves this by letting peer machines contribute their compute. A laptop on the same network, a desktop in another room, or a cloud VM halfway around the world can all send inference requests to the machine running the GPU. The leader handles the model; peers handle everything else.
The networking layer sits between the LLM router and the inference backend. On the leader, a reverse proxy accepts authenticated requests and forwards them to the local llama-cpp-server. On peers, the router is configured to point at the leader instead of localhost.
Both the leader's own agent loop and remote peers are independent HTTP clients of the same llama-cpp-server. The LeaderProxy authenticates and rate-limits peer traffic before it reaches the backend. The leader's own requests go directly to :8100, bypassing the proxy entirely.
Every Maxim instance operates in one of three roles. The role determines how the LLM router resolves inference endpoints and whether the proxy/tunnel subsystems activate.
| Role | Description | Runs |
|---|---|---|
| Leader | GPU machine. Hosts the model, runs the inference server, proxy, and tunnel. | llama-cpp-server + LeaderProxy + cloudflared + agent loop |
| Peer | Client machine. Sends inference requests to the leader over the tunnel. | agent loop only (LLM router points at leader) |
| Solo | Default. Everything local, no networking. Equivalent to a leader with no peers. | llama-cpp-server + agent loop (no proxy, no tunnel) |
Maxim's unified detector at startup uses the following priority — first match wins:
MAXIM_ROLE=leader|peer|solo environment variable.~/.config/maxim/config.json::role — the canonical persistent setting via maxim config set role <value>.~/.config/maxim/mesh.yml implies peer.~/.cloudflared/config.yml OR .yaml (extension widened in 1.0) OR the systemd path /etc/cloudflared/config.{yml,yaml} implies leader. This is promoted above peer.yml as of 1.0 so a stale peer.yml from earlier exploration doesn't silently override a real leader provisioning signal.~/.config/maxim/peer.yml (legacy, deprecated as of 1.0, retired in 2.0) implies peer.--llm <local-profile> CLI flag with none of the above signals implies solo.First-startup peer.yml → config.json auto-migration: when config.json is absent AND peer.yml is present AND cloudflared config is absent, the loader writes a minimal config.json from peer.yml fields on first run. peer.yml is never deleted. When cloudflared is present, migration is skipped so a stale peer.yml from a previous peer setup doesn't auto-flip a leader machine to peer.
The LeaderProxy is a stdlib-only reverse proxy that listens on port 8099 and forwards authenticated requests to the local llama-cpp-server on port 8100. It uses only Python's http.server and urllib.request — no third-party dependencies.
Authorization: Bearer <key> header. Keys are managed with maxim tunnel key rotate.X-Request-ID header (UUID4) for end-to-end tracing.X-Maxim-Proxy: true, X-Maxim-Request-ID, and X-Maxim-Latency-Ms on every response.The proxy exposes debug endpoints for operational visibility. All require the same Bearer token as inference requests (or localhost access).
| Endpoint | Purpose |
|---|---|
/v1/debug/status |
Proxy uptime, active connections, backend reachability |
/v1/debug/heartbeat |
Lightweight liveness check (200 OK) |
/v1/debug/metrics |
Request counts, latency percentiles, error rates |
/v1/debug/last-requests |
Ring buffer of recent requests (peer ID, latency, status) |
/v1/debug/vram |
Live VRAM usage (nvidia-smi ratio, spillover/warning flags) + projected model footprint. Returns 503 if no GPU. Prerequisite for capacity-aware routing. |
/v1/debug/version |
Maxim version, git hash, Python version |
/v1/debug/logs |
Recent structured log entries from the ring buffer |
/v1/debug/deps |
Installed Python packages and optional extras |
The inference server can only handle so many concurrent requests before latency degrades or VRAM is exhausted. Admission control prevents overload by rejecting excess traffic early, before it reaches the backend.
Configured via MAXIM_PROXY_MAX_CONCURRENT (default: 4). When all slots are occupied, new requests receive a 429 Too Many Requests with an X-Maxim-Queue-Depth header indicating how many requests are waiting.
Configured via MAXIM_PROXY_RATE_LIMIT_RPM (default: 0, disabled). Each peer is tracked by its API key. Exceeding the limit returns a 429 with a Retry-After header.
| Variable | Default | Description |
|---|---|---|
MAXIM_PROXY_MAX_CONCURRENT |
4 | Maximum simultaneous requests forwarded to backend |
MAXIM_PROXY_RATE_LIMIT_RPM |
0 (disabled) | Requests per minute allowed per peer API key (0 = unlimited) |
On the peer side, the LLM router's retry logic handles 429s gracefully: it reads Retry-After, backs off, and retries. From the agent's perspective, the request is simply slower — the pipeline does not crash.
The WorkerPool routes inference requests through capability tiers: large (14B+ GPU), medium (7B CPU/GPU), and small (1.7B CPU). Functions declare which tier they need via a FunctionRouter with fallback chains. Per-tier performance counters track throughput and latency in real time.
| Metric | Description |
|---|---|
| p50 / p99 latency | Median and tail latency per lane, computed over a sliding window |
| Failure rate | Percentage of requests that returned an error or timed out |
| Token throughput | Tokens per second generated, per lane |
The MetricsRegistry is a singleton shared between the agent runtime and the LeaderProxy. Both write to the same counters, giving a unified view of local and remote load. These metrics feed into maxim doctor, which flags lanes with elevated failure rates or latency spikes.
# View lane metrics from the proxy debug endpoint
curl -s -H "Authorization: Bearer $MAXIM_API_KEY" \
https://maxim.yourdomain.com/v1/debug/metrics | python -m json.tool
# Example output
{
"lanes": {
"large": { "p50_ms": 142, "p99_ms": 890, "fail_pct": 0.3, "tok_per_sec": 48.2 },
"medium": { "p50_ms": 98, "p99_ms": 520, "fail_pct": 0.0, "tok_per_sec": 31.7 },
"small": { "p50_ms": 67, "p99_ms": 310, "fail_pct": 0.1, "tok_per_sec": 22.4 }
},
"uptime_s": 3847,
"total_requests": 1294
}
The heartbeat subsystem runs as a daemon thread that samples system vitals every 10 seconds. It provides early warning when hardware resources are constrained or the agent loop has stalled.
# Enable heartbeat logging export MAXIM_HEARTBEAT=1 # Or enable lane trace (which also enables heartbeat) export MAXIM_LANE_TRACE=1 # Heartbeat output in the log [heartbeat] gpu=72% vram=5.1/8.0GB cpu=2.4 ram=61% disk=42GB loop=+0.8s lanes=3/0/1
Stall detection is particularly useful for debugging distributed setups. If the agent loop blocks on a remote inference call that the leader has rate-limited, the heartbeat will flag the idle gap before the user notices the pause.
Setting up a peer takes three steps: install Maxim on the peer machine, connect to the leader, and verify the link.
# On the peer machine maxim peer connect https://maxim.yourdomain.com/v1 # This prompts for the API key (generated on the leader with `maxim tunnel key rotate`) # and writes ~/.config/maxim/peer.yml
# Quick connectivity test (runs from the peer, no full agent runtime needed) maxim peer test https://maxim.yourdomain.com/v1 # Expected output: # Connecting to https://maxim.yourdomain.com/v1 ... # Auth ............ OK (Bearer token accepted) # Heartbeat ....... OK (proxy alive, 1294 requests served) # Inference ....... OK (model loaded, 48 tok/s) # Latency ......... 23ms round-trip
# Enable lane tracing to see which requests go remote export MAXIM_LANE_TRACE=1 maxim --language-model mistral-7b # Trace output shows remote routing: # [lane:large] POST /v1/chat/completions -> remote (23ms RTT, 142ms total)
url: https://maxim.yourdomain.com/v1 api_key: mk-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx # Optional fields: # model: mistral-7b (overrides /v1/models default) # is_cloud: true (forces cloud-lane gate for public URLs)
Most networking issues fall into a few categories. Start with maxim doctor — it checks tunnel status, proxy reachability, auth validity, key hygiene, inference coherence, and system resources automatically. Use --json for CI integration or --as peer <url> to diagnose from the peer's perspective.
When running Maxim as a peer pointed at a remote leader, doctor auto-detects the role (or use --as peer) and runs connectivity-specific checks instead of tunnel/key setup:
# Full peer diagnostic with retry support maxim doctor --as peer https://maxim.yourdomain.com/v1 # JSON output for scripts maxim doctor --json --as peer https://maxim.yourdomain.com/v1 # Quick connectivity test (minimal, no retry) maxim peer test https://maxim.yourdomain.com/v1
Peer checks cover: DNS resolution, URL reachability, API key validation, auth verification, model availability, and round-trip latency (p50/p95 from 5 probes). Fix hints point at the leader machine, not the peer.
If inference is unexpectedly hitting a cloud API (Anthropic/OpenAI) instead of the peer leader, check the provider priority in ~/.maxim/config/llm.json. The router picks the highest-priority provider that's available. Ensure the local/peer provider is ranked above cloud providers for the model you're using.
The energy tracker applies a cost multiplier per provider. If the peer endpoint is not registered as a local-class provider, the router may reject requests that exceed the per-cycle cost budget. Fix by setting the provider class to local in llm.json.
Cloudflare's Web Application Firewall can return a 403 Forbidden for automated requests. If maxim peer test shows a 403 with an HTML body mentioning "Just a moment," Bot Fight Mode is interfering. Disable it in the Cloudflare dashboard under Security → Bots, or add a WAF exception rule for the tunnel hostname.
After rotating the Cloudflare tunnel or changing the hostname, DNS propagation can take up to 5 minutes. If maxim peer test times out but the leader's cloudflared shows no incoming connections, flush DNS on the peer machine and retry.
maxim doctor now checks API key age (warns after 90 days), file permissions (fails if world-readable on POSIX), and runs an auth smoke test that verifies the server accepts the real key and rejects bogus ones. If auth smoke reports "server accepts ANY key," your tunnel is bypassing the LeaderProxy — route through port 8099 instead of 8100.
Doctor sends a fixed prompt ("What is 2+2?") and checks for "4" in the response. A wrong answer suggests the model is misconfigured, corrupted, or loaded in the wrong quantization. This catches silent failures where the server responds 200 but produces gibberish.
For deeper diagnostic procedures, see docs/troubleshooting/ in the repository. Use maxim doctor --json to generate machine-readable output for support bundles or CI pipelines.
The networking layer has a clear progression from the current manual-setup model toward automatic discovery and intelligent routing.
Automatic peer discovery on the local network using mDNS/DNS-SD. A Maxim instance broadcasts a _maxim-llm._tcp service record; peers discover it without manual endpoint configuration. Falls back to the current explicit peer.yml approach on networks where mDNS is blocked. Now part of the Agent Mesh plan (Phase 0a).
Smart request routing when multiple inference backends are available (local GPU + cloud API + peer leader). The InferenceRouter selects the backend per-request based on lane metrics (latency, queue depth, failure rate), cost constraints, and model compatibility. Now part of the Agent Mesh plan (Phase 0b).
maxim peer update auto-detects the leader's install mode. Pip-installed leaders upgrade via PyPI (--version 0.3.1 to pin). Git-checkout leaders pull + reinstall (--dev [branch] to force git mode). Installed extras are auto-detected and preserved during pip upgrades. maxim peer restart soft-restarts the leader via os.execv (same PID, clean import cycle).
maxim peer llm <model> swaps the running LLM without restarting the Maxim process. Stops the current llama-cpp-server, starts a new one with the requested model, and health-checks it. The choice persists across restarts. maxim peer llm --status shows the active model, uptimes, and GPU utilization.
Cloud LLMs (Claude, GPT-4o) can be added as fallback engines or dedicated lane providers. --cloud-fallback claude-sonnet adds Claude as a fallback when the self-hosted model fails. --cloud-lane review claude-haiku assigns a cloud model to a specific lane. Cost tracking, redaction gates, and session budgets enforce safety.