A Nobulex Publication

Agent Reliability Index

A weekly observatory of AI agent behavior change across frontier vendors.
Issue 001 · Charter Volume 1 Published 11 May 2026 Methodology v0.1

Every week, the major frontier AI vendors silently change the behavior of agents in production. Some changes are announced. Most are not. Enterprise customers running these agents have no cross-customer baseline to detect drift. Vendors are structurally disincentivized to disclose regressions. The gap between what is observable and what is disclosed is the structural problem this index addresses.

The Agent Reliability Index is a public observatory. Each week, it publishes a standardized-prompt drift signal across the major frontier models, vendor-by-vendor reliability scorecards on the prior 7 days, notable incidents with structured severity classification, and methodology updates as the analytical model matures.

The index is free to read. The methodology is open and lives in the Nobulex repository. The premium tier — historical archives, machine-readable feeds, drill-down per task class, custom prompt queries — is paid.

This is the charter issue. It establishes the baseline and publishes the methodology in full. Subsequent issues will be denser, with real drift signal and incident catalogs. The methodology is being published before the first issue contains real findings, so that readers, vendors, regulators, and contributors can scrutinize the approach before it is loaded with results.


The Index — Charter Baseline

The headline figure for each issue is the Cross-Vendor Reliability Index (CVRI), computed as a normalized composite across five behavioral dimensions for each tracked vendor.

For the charter issue, the index is set to a baseline of 100 for each tracked vendor. Subsequent issues will report deviations from baseline, week-over-week deltas, and longitudinal trend lines.

VendorCVRI (Charter Baseline)Tracked Endpoints
Anthropic100Claude family, current production
OpenAI100GPT family, o-series reasoning
Google100Gemini family, current production
Microsoft100Copilot, Azure-hosted variants
Meta100Llama production deployments via partners

The methodology for computing CVRI is published in full at docs/AGENT-RELIABILITY-INDEX.md. Readers are invited to challenge weights, reweight components for their own use case, or build alternative composite indices on the underlying data.


What is measured

The index tracks five behavioral dimensions for each frontier agent endpoint. These dimensions were chosen because they are leading indicators most often cited in post-incident reviews of production agent failures, and because they are observable from the outside without privileged access.

1 · Output stability under fixed prompts

100 standardized prompts spanning 10 task classes are run weekly with deterministic settings (temperature 0 where supported, fixed seed where supported). Outputs are scored on lexical, semantic, and structural similarity to the prior week. A statistically significant divergence — particularly when accompanied by no announced model update — is flagged as silent drift.

2 · Stated confidence calibration

Where the agent provides confidence scores or hedge language, the empirical distribution is tracked. A shift in the mean or variance of stated confidence is a leading indicator that the underlying model has been retuned.

3 · Refusal and safety-filter rate

The proportion of prompts that result in refusal, partial refusal, or safety-filter intervention. Increases are not inherently problematic, but customers running agents in production rely on a stable refusal rate for downstream workflow design.

4 · Latency and routing variance

Response latency, time-to-first-token, and total token throughput. Significant changes here typically indicate routing or infrastructure changes that may also affect output quality.

5 · Tool-use reliability

Success rate on a fixed set of standardized tool-use scenarios for agent platforms with tool-calling capabilities. A drop in tool-use reliability is one of the highest-impact regressions for enterprise deployments and is rarely surfaced in vendor release notes.


Drift signal — this week

The charter issue establishes the baseline. There is, by definition, no drift signal yet.

Subsequent issues will populate this section with the week's flagged drift events, classified as announced drift (coinciding with vendor disclosure) or silent drift (no corresponding vendor disclosure). Silent drift is the editorially most significant signal the index produces — it surfaces behavior changes the vendor's customers have no other way to detect.

Silent drift — observable behavior change without corresponding vendor disclosure — is the structural problem the index exists to surface. The category receives the largest editorial weight in every weekly issue.

Notable incidents — this week

The index tracks publicly reported AI agent incidents and classifies them on a four-level severity scale. Source streams include the AI Incident Database, vendor status pages, regulatory filings, and verifiable press coverage. The charter issue does not yet include a catalog. Subsequent issues will. Classification rubric:


Methodology — at a glance

Full methodology lives at docs/AGENT-RELIABILITY-INDEX.md. The CVRI composite is computed as:

CVRI(vendor, week) = 100 - (
  0.30 * |output_stability_z|     +
  0.15 * |confidence_calibration_z| +
  0.20 * |refusal_rate_z|           +
  0.10 * |latency_variance_z|       +
  0.25 * |tool_use_reliability_z|
)

Each _z is the z-score versus the vendor's own 12-week rolling baseline. A prompt-level drift event is flagged at greater than 2σ from the 4-week baseline. A vendor-level drift event requires at least 15% of tracked prompts to show drift in the same week and a CVRI delta of at least 5 points.


What the index does not do


Editorial policy

Nobulex receives no payment from any AI vendor for inclusion, exclusion, weighting, or scoring. This is a contractual commitment and the structural moat that protects the index against the conflict-of-interest pattern that has weakened other rating agencies.

Vendors may dispute any finding. Disputes are resolved publicly: the vendor submits the dispute with a specific methodological objection; Nobulex publishes the dispute verbatim in the next issue alongside its response. The catalog of vendor disputes is itself a structured precedent stream.

Methodology changes are 12-week back-tested before going live. Back-tested impact on prior CVRI scores is published in the issue immediately preceding the change. The prior methodology is archived and remains accessible. This prevents the failure mode where a low-scoring vendor lobbies for methodology adjustments that retroactively rehabilitate their score.

Subscribe & contribute

The Agent Reliability Index is published every Monday.

To subscribe, email nobulex.dev@gmail.com with the subject "Agent Reliability Index Subscription."

To submit methodology critique, open an issue at github.com/arian-gogani/nobulex/issues with the label observatory:methodology. Substantive critiques are addressed in subsequent issues with full attribution.

For the full strategic vision behind the observatory, see docs/OBSERVATORY-VISION.md.