Benchmarks

Trust you can measure.

We publish our own numbers and exactly how to reproduce them. We don't post head-to-head tables against other tools — a score only means something relative to the harness that produced it, so cross-tool numbers measured on different harnesses aren't a fair comparison. Run these on your own repo.

1. The Trust Benchmark — the one that matters

The question retrieval benchmarks never ask: can you trust what the memory returns? An agent acting on stale or hallucinated memory is worse than one with none. This is what Kage is built for, and what no other tool measures.

kage benchmark --trust --project .
MetricResult
Hallucinated-citation rejection100%
Stale-memory exclusion100%
Live grounding rate99%
Trust score100 / 100

Controlled gates run in an isolated sandbox; grounding runs on your real repo. Methodology: docs/TRUST.md.

2. Retrieval — competitive, dependency-free (a sanity check)

We run the recognized long-term-memory retrieval benchmark, LongMemEval-S, as an external sanity check. It's a conversational benchmark — not Kage's core use case (durable repo memory) — but it shows our retrieval is strong with zero dependencies: no vector DB, no embedding model, no API key.

MetricKage strict recall
R@596.17%
R@1098.72%
R@2099.79%
MRR0.909
Median latency~210 ms
node benchmarks/longmemeval-kage-retrieval.mjs --data longmemeval_s_cleaned.json --limit 470 --top-k 20

Read this honestly: these are session-level retrieval-recall scores, not end-to-end QA accuracy. On this dataset a plain BM25 baseline also reaches ~96.6% R@5 — the benchmark is lexically tractable and strong retrievers cluster at 95–97%. The takeaway is "Kage matches strong lexical retrieval with no dependencies," not "Kage is uniquely best at retrieval." Details: benchmarks/LONGMEMEVAL.md.

3. Coding-task memory — the category-correct benchmark

Retrieval recall doesn't tell you whether memory makes a coding agent better. Kage ships a SWE-bench Verified memory ablation — a controlled, single-variable experiment measuring whether repo memory improves real GitHub-issue resolution. See benchmark/README.md.

Principle

Lead with trust (ours, uncontested). Treat retrieval as a sanity check, stated precisely. Never compare against a number measured on someone else's harness. Every number here is reproducible with the commands above.