Brainstorm: OTel Network Monitoring — eBPF Event Stream

Problem & Topic

Design center: SIEM-style visibility for clawker agents. Today's monitoring stack is built like a dev-debug observability stack (Grafana + Loki + Prom + Jaeger). Users running clawker actually need a SIEM-grade surface to answer "which of my 5 running agents is compromised?" and "why did this agent burn 10M tokens overnight?" — security + cost forensics over all agent-side telemetry, not just network logs.

Two architectural problems with the existing pipeline:

  1. Attribution is in the wrong layer. Agent identity must be resolved at the source-event layer, but logs come from Envoy/CoreDNS — neither has agent-identity scope. Promtail then tries to bolt agent labels on via relabel rules with no access to Docker labels. The agent: and project: labels in promtail-config.yaml.tmpl are vestigial from this attempt — always blank.
  2. Backing store is wrong for the workload. Cross-source security drill-down (filter network events by agent+verdict AND join with overseer-event firewall changes AND Claude-Code tool-call events, all on shared agent/project dimensions) is search-engine work, not time-series-log work. Two prior attempts at making Loki serve this UX failed.

Scope: two coupled workstreams.

  1. New BPF egress event stream — this is one new source feeding the SIEM surface. Schema, BPF mechanics, unification with Envoy/CoreDNS.
  2. Monitor stack replacement — entire stack pivots to OpenTelemetry-universal-ingest + OpenSearch. Receives all existing + new sources (see Ingest Sources Inventory section). New egress stream is the forcing function; the pivot benefits every other source.

Real requirement

Ingest Sources Inventory

The OS layer must accept these sources today plus accommodate the planned next one. Source-shape diversity is the design constraint that rules out a single polymorphic mega-index.

SourceStatusSignal typeExamplesSIEM value
CP app logs (clawker-cp) Exists (file) Structured log records (zerolog) event=agent_dialer_unavailable, dialer errors, registry mismatches CP misbehaving ≡ security concern. Operator must see degraded paths.
CP overseer events Exists (in-process bus) Domain events from internal/controlplane/overseer/ Firewall rule changes, bypass activation, agent lifecycle, registry events, session connected/disconnected, trust-attestation outcomes Audit trail. "Who toggled bypass at 3am?" Today these are in-process only.
Claude Code telemetry Exists (already OTLP-emitting) Logs/events (records) + metrics (separate path) API requests, API errors, tool decisions, tool results, cost & token metrics "What did the agent decide to do?" Per-tool-call audit. Cost data alongside security data for correlated drill-down. Metrics → Prom (D23). Logs/events → OS (new).
BPF egress events New (this brainstorm) L3/L4 verdicts from cgroup hooks connect4/sendmsg4/recvmsg4 verdicts: ALLOWED/DENIED/BYPASSED, dst_ip/dst_port/l4_proto, domain-hash → domain via P7 reverse map Every outbound network attempt visible — including bypass-mode (forensic black hole today).
Envoy access logs New (this brainstorm) L7-over-MITM HTTP/TCP access records HTTP method/path/response_code, TLS SNI, response_flags, upstream timing — emitted natively as OTLP via envoy.access_loggers.open_telemetry (D31) "What HTTP API did the agent actually hit?" Path-level forensics for TLS-terminated traffic.
CoreDNS query logs New (this brainstorm) Per-query DNS records client_ip, qname, qtype, rcode, answer set — emitted via log plugin stdout → collector filelog receiver (D32) "What did the agent try to resolve?" DNS-side audit independent of BPF/Envoy.
Sys exec call events Planned (future scope) eBPF-derived process events execve / fork / exit, command line + cgroup attribution "What did the agent actually run inside the container?" Pairs with egress for full agent-behavior picture. NOT a current source-list entry — design only.

Cross-source invariants:

Current BPF Surface

Seven cgroup programs in bpf/clawker.c: connect4, sendmsg4, recvmsg4, connect6, sendmsg6, recvmsg6, sock_create. Every decision point already invokes metric_inc() — same call sites become ringbuf emission points.

Existing pinned maps

MapKeyValueUsed by event-stream change?
container_mapcgroup_idcontainer_configpresence gate for enforcement (no change)
bypass_mapcgroup_idu8 flagno change; bypass still counted as ACTION_BYPASS
dns_cacheIP{domain_hash, expire_ts}reader walks this to build hash→domain reverse map
route_map{domain_hash, dst_port}{envoy_port}no change
metrics_map{cgroup_id, hash, port, action}counterstays for break-glass ebpf-manager dump
events_ringbuf(none)egress_event recordsNEW — this change adds it

Confirmed Decisions

Proposals Pending Your Call

Schema specifics

BPF emit policy & flow lifecycle

Operational concerns

Verdict source-mapping

Conclusions & Insights

Gotchas & Risks

Open Items & Questions

Unknowns

Next Steps

  1. OS index / mapping design for the egress stream. Index naming, field types, retention. Will surface as a new proposal block.
  2. otel-collector pipeline config rewrite. Drop otlphttp/loki and otlp/jaeger exporters; add OpenSearch exporter for events/logs. Keep prometheus exporter and metrics pipeline (D23). Preserve existing resource/agent + resource/cp + transform/metrics processors (provenance-stamping invariants are storage-independent).
  3. Envoy and CoreDNS access-log ingest path. How those access logs reach otel-collector — new proposals when walked.
  4. OSD dashboard migration plan. Mechanical port of the existing ~30 panels. Includes OSD Prometheus datasource verification for the PromQL panels (if gaps, adapt panels — not re-add Grafana per D23).
  5. After design lock: produce a Phase-2 implementation plan as a separate exercise. Phasing comes after design is locked, not during.
  6. Separately filed: pre-existing metrics_map silent-drop bug; docs/firewall.mdx chronology fix (D15). Both noted, neither part of this design.