observability-mcp

The unified observability gateway for AI agents.

What Grafana did for dashboards, we do for AI agents.

npm: @thotischner/observability-mcp ghcr.io/thotischner/observability-mcp

The problem

Every observability vendor ships its own MCP server.

Prometheus MCP

One vendor. PromQL only.

Datadog MCP

Another silo. Different schema.

Grafana MCP

Third process. Third config.

Elastic MCP

Yet another one.

Loki MCP

...

N more

Tomorrow's stack.

Agents that reason across systems juggle N disconnected servers.
There is no unified abstraction layer.

Background

What is MCP?

Model Context Protocol — Anthropic, open spec, 2024.
A standard way for AI agents to talk to external systems.

Why it matters

  • Agents don't ship with knowledge of your infrastructure.
  • Without context, every prompt is a guess.
  • MCP gives them tools they can call.
  • Tool results land in the conversation as new context.

Three primitives

  • Tools — functions the agent can invoke (query metrics, search logs...)
  • Resources — data the agent can read (file contents, configs...)
  • Prompts — reusable templates the user can pick
Background

MCP at a glance

sequenceDiagram autonumber participant U as User participant C as AI Agent (MCP Client) participant S as MCP Server participant B as Backend U->>C: "Why is the API slow?" C->>S: tools/list S-->>C: [query_metrics, query_logs, ...] C->>S: query_metrics(service=api, metric=latency) S->>B: PromQL query B-->>S: time-series data S-->>C: { values, summary } C->>U: "p99 spiked 3x at 14:02, checking logs..."

Streamable HTTP transport • JSON-RPC 2.0 • One round-trip per tool call

The solution

One MCP. Any backend. Pluggable.

What you connect to

  • Prometheus / Mimir / AMP
  • Loki (self-hosted & managed)
  • Your future backend (one interface)

What you get

  • Single MCP endpoint
  • Cross-signal analysis (z-score, health scoring)
  • Web UI for sources, services, health

10-second start

npx @thotischner/observability-mcp
# open http://localhost:3000

Add Prometheus / Loki via the Web UI or env vars.
Point any MCP client at :3000/mcp.

Architecture

How it fits together

flowchart TB Agent["AI Agent (Claude, Ollama, ...)"] subgraph MCP ["observability-mcp :3000"] direction TB Tools["8 MCP Tools"] Analysis["Analysis Engine"] UI["Web UI"] end subgraph Connectors ["Pluggable Connectors"] direction TB Prom["Prometheus / PromQL"] Loki["Loki / LogQL"] Next["Your Backend"] end Agent <--> Tools Tools --- Analysis Tools --- UI Tools --> Prom Tools --> Loki Tools --> Next style MCP fill:#1a1a2e,stroke:#58a6ff,color:#fff style Connectors fill:#0d1117,stroke:#3fb950,color:#fff style Agent fill:#58a6ff,stroke:#58a6ff,color:#000 style Next fill:#0d1117,stroke:#3fb950,color:#8b949e,stroke-dasharray: 5 5
Surface area

8 tools. One contract.

ToolSignalWhat it does
list_sourcesmetaDiscover backends & their health
list_servicesmetaDiscover services across all backends
query_metricsmetricsTime-series + summary stats + per-instance breakdown
query_logslogsLog entries with error counts and top patterns
get_service_healthunified0–100 score combining metrics & logs
detect_anomaliesunifiedCross-signal anomalies via z-score analysis
get_topologytopologyMerged infrastructure graph (resources + edges) across topology connectors
get_blast_radiustopology"If this host dies, who else fails?" — pivots on the generic RUNS_ON relation

Same shape regardless of backend. Adding Datadog or InfluxDB doesn't change the tool surface — only adds another connector.

Why it Just Works

Adaptive resolution.

Real Prometheus deployments don't agree on metric names or label conventions. We probe the backend instead of guessing.

Series discovery

Per metric, an ordered list of candidates. Probe per-service:

cpu:
  process_cpu_seconds_total      # prom-client
  ↓ fallback
  node_cpu_seconds_total{...}    # node_exporter

The selected candidate lands in resolvedSeries.

Label resolution

Service identifier is matched against the labels real Prometheus uses:

probe order:
  job  →  service  →  app  →  service_name

Configurable via PROMETHEUS_SERVICE_LABELS.
Same idea on Loki for service_name / container / job / ....

v1.3 highlight

Per-instance breakdown

Multi-target services (dev + prod, k8s replicas, ...) collapse into one number by default.
Pass groupBy to see them split:

query_metrics(service="api", metric="cpu", groupBy="instance")
{
  "metric": "cpu",
  "groupBy": "instance",
  "groups": [
    { "key": "prod-vm-1:9100", "values": [...], "summary": { "current": 42.1 } },
    { "key": "dev-vm-1:9100",  "values": [...], "summary": { "current": 11.8 } }
  ],
  "resolvedSeries": "100 - avg by(instance) (rate(node_cpu_seconds_total{...}[1m])) * 100"
}

Without groupBy, the response includes a hint: "2 distinct instances exist for this service. Pass groupBy="instance" to break it down."

In action

From "anything wrong?" to root cause.

1
curl Trigger chaos: curl -X POST :8081/chaos/error-spike
2
Ask Claude: "Are there any anomalies right now?"
3
Claude calls detect_anomalies → finds CPU spike (3.4σ), request rate dropping
4
Claude calls query_logs → finds "internal error during POST /payments (6x)"
5
Claude correlates the signals and explains the incident in plain language.
No PromQL. No LogQL. No dashboards.
Web UI

Configure visually.

Web UI Dashboard

Sources · Services · Health · Settings — dark theme, real-time, zero deps.

How to ship it

Three ways to run it.

npm

npx @thotischner/observability-mcp

Local dev, zero install.

Docker (GHCR)

docker run -p 3000:3000 \
  ghcr.io/thotischner/...
  observability-mcp:latest

Multi-arch (amd64 + arm64), native runners.

From source

git clone …
docker-compose up

Full POC: 3 services + chaos.

npm Provenance signed (SLSA). Multi-arch Docker built natively, no QEMU emulation.

Continuous security

Self-driving, hands-off.

Detection

  • Dependabot — weekly grouped PRs (npm, GH Actions, Docker)
  • CodeQL — security-extended on every PR
  • Trivy — Docker + filesystem CVE scan, daily
  • npm audit — fails CI on high-severity
  • OSSF Scorecard — repo posture, weekly

Reaction

  • Auto-merge sweeper — patch/minor PRs ≥ 72 h, all checks green
  • Auto-release — Sunday 23 UTC, patch-bumps + tags
  • Tag → 3 publishes — npm + GHCR + GitHub Release, parallel
  • Major bumps — stay manual, never auto-merged
Six issues filed by users, six closed via auto-pipeline. Time from issue → live release: minutes.
Where we are

v1.3.2 — and counting.

Shipped

  • Prometheus + Loki + Kubernetes (topology) connectors
  • Web UI (6 pages, real-time — incl. Topology graph)
  • prom-client + node_exporter adaptive defaults
  • Per-instance groupBy breakdown
  • Grafana Cloud / Mimir / managed Loki support
  • Auth (Basic, Bearer) + TLS (custom CA, mTLS)
  • Optional Ollama agent for autonomous detection

Roadmap

  • InfluxDB / Tempo / Datadog connectors
  • Cluster-aware breakdowns (k8s namespace, region)
  • Trace queries (OTel)
  • Alert-as-context — pull active alerts into the conversation
  • Streaming responses (SSE) for long queries

Try it now

npx @thotischner/observability-mcp

github.com/ThoTischner/observability-mcp

Star ⭐ helps others discover the project.

Apache 2026 — Apache-2.0 License — Built with TypeScript, Node 20, MCP SDK 1.12+