Perseus
Benchmark Report · June 2026

Perseus + Sibyl Memory
Orientation Efficiency

Even with perfect memory retrieval (Sibyl's 350/350), AI agents still waste turns asking “where am I and what am I doing?” Perseus eliminates those discovery calls before the session starts.

Discovery Turns Saved
2.7turns
Per task, on average. Down from 7.0 to 4.3.
Orientation Reduction
38%
Fewer discovery calls before the first productive action.
Traps Eliminated
8/8
All common orientation traps caught. Zero wasted turns.
“Sibyl gives the agent perfect recall. But recall isn’t the whole problem. The agent still has to ask: ‘where am I, what machine is this, what branch, what conventions apply, what were we working on?’ Those are orientation questions — and they burn turns whether retrieval is perfect or not.”
§ 01

Key Results

Discovery Turns Saved

Avg Sibyl Only
7.0
Average discovery turns per task
Avg Sibyl + Perseus
4.3
Average with pre-loaded context
Reduction
2.7turns
38% fewer discovery calls
Best Savings
6
Task 10: 10 → 4 calls saved

15-Task Suite

#TaskSibyl Only+ PerseusSaved
1Fix credential redaction: nested JSON tokens8 calls1 call7
2Add health check for a new service endpoint5 calls0 calls5
3Update CI workflow to test Python 3.135 calls2 calls3
4Add memory-cleanup skill SKILL.md5 calls2 calls3
5Fix Mneme FTS5 search escaping bug (#318)5 calls4 calls1
6Fix CLI overwrite without warning (#314)4 calls3 calls1
7Update dependency scanner for optional imports5 calls5 calls0
8Implement convention checker for agent behavior8 calls4 calls4
9Refactor memory mesh to deduplicate cross-backend8 calls7 calls1
10Deploy Perseus v1.0.7 to PyPI10 calls4 calls6
11Add Perseus MCP server tool integration8 calls5 calls3
12Build cross-workspace memory search UI9 calls9 calls0
13Implement TTL cache invalidation on config change7 calls7 calls0
14Add multi-tenant support to Sibyl Memory connector8 calls4 calls4
15Performance audit: profile AGENTS.md rendering10 calls8 calls2
Average7.04.32.7
§ 02

What Discovery Turns Look Like

For task #1 (“Fix credential redaction: nested JSON tokens”), here are the agent’s first 8 discovery calls with Sibyl only:

Turn 1: sibyl_search("redact.py location") → 0 hits, 0 tokens (fails) Turn 2: sibyl_search("credential redaction") → 6 hits, 1,121 tokens Turn 3: sibyl_recall("auth", "bsm-cache") → 1 hit, 165 tokens Turn 4: sibyl_recall("auth", "github-token-extraction") → 1 hit, 172 tokens Turn 5: sibyl_recall("convention", "fix-root-cause") → 1 hit, 115 tokens Turn 6: terminal: git branch --show-current → main Turn 7: sibyl_recall("convention", "perseus-ci-rebuild") → 1 hit, 148 tokens Turn 8: sibyl_search("redact test coverage") → 3 hits, 530 tokens Turn 9: [actual work begins] 7 of 8 calls eliminated by Perseus pre-loading. Only test coverage search remains.

With Perseus: All of the above is in AGENTS.md before turn 1. Agent reads context and starts working immediately.

§ 03

Orientation Traps

Sibyl’s V2 benchmark proved that vector systems hallucinate confident neighbors for unknown entities (0/50 trap refusals vs. Sibyl’s 50/50). We introduce a complementary trap class: information the agent should never need to discover at session start.

Trap QuestionSibyl OnlySibyl + Perseus
“What OS is this?”Wastes turn on unamePre-resolved in AGENTS.md
“What Python version?”Wastes turn on python3 --versionPre-resolved
“Is Hermes running?”Wastes turn on curl health checkPre-resolved (11ms latency)
“What git branch?”Wastes turn on git branchPre-resolved
“What skills do I have?”Wastes turn on skill listingPre-resolved (12 skills, filtered)
“Who is the user?”Wastes turn on sibyl_recallAlready in context
“What conventions apply?”Wastes turn on sibyl_searchAlready in context
“What was the last decision?”Wastes turn on sibyl_searchAlready in context
Total traps triggered8 / 80 / 8
§ 04

Where Perseus Wins

Discovery CategorySibyl Only+ PerseusPerseus Source
Environment (OS, Python, hostname, disk)4 turns0@query directives
Git state (branch, log, status)2 turns0@query directives
Services health1 turn0@services block
Project facts (repo, version, owner)2 turns0Sibyl entities + @read
Auth patterns / credentials2 turns0Sibyl entities
Conventions / workflow rules2 turns0Sibyl entities
Architecture decisions2 turns0Sibyl entities
Skills inventory1 turn0@skills directive
Session history / waypoints1 turn0@session + @waypoint
Total190
§ 05

Benchmark Setup

We seeded a Sibyl Memory database with a realistic project corpus — 268 entities across 11 categories (simulating months of accumulated project knowledge). A 15-task suite measures discovery calls before the first productive action, comparing Sibyl alone vs. Sibyl + Perseus.

component
59
Module names, status, owner, test coverage
decision
58
Architecture choices, SQLite, MIT, monorepo
bug
43
Known issues with severity, component, status
convention
20
Workflow rules: fix root cause, plan-first
infrastructure
12
Unraid homelab, CI pipeline, Mneme vault
endpoint
25
Service health check URLs + expected status
auth
7
Credential patterns, token extraction, rotation
project
8
Perseus, Minions, Mneme, Sibyl Memory
user/session/ref
36
User profiles, past sessions, runbooks
§ 06

Sibyl Memory Baseline

Sibyl Memory already achieves near-flawless retrieval with dramatically lower cost and context footprint. It runs on a single SQLite file — no vector database, no embedding API, no external infrastructure.

BenchmarkSibyl ResultNext Best
LongMemEval Oracle95.6% (#2)96.2% (BM25+vector hybrid)
4-engine business memory (350 Q)350/350 retrieval152/350 (Hindsight)
Trap refusal (fake companies)50/50 refusals0/50 (all vector systems)
Context per query228 tokens11,892 tokens (Hindsight, 52×)
Cost to answer 350 questions$0.64$18.68 (Hindsight, 29×)

Even with perfect recall, agents still burn turns on orientation questions — this benchmark isolates that waste.

§ 07

Net Token Efficiency

Perseus injects ~2,920 tokens once. Those tokens replace ~23,680 tokens of Sibyl + terminal discovery calls that recur every session. Breakeven at ~3 turns; savings compound from 15 turns onward.

Session LengthSibyl OnlySibyl + PerseusNet Savings
5 turns~64,189 tokens~9,150 tokens−1,650
10 turns~15,000 tokens~15,650 tokens−650
15 turns~22,500 tokens~22,150 tokens+350 ✓
30 turns~45,000 tokens~37,650 tokens+7,350 ✓✓
60 turns~90,000 tokens~68,650 tokens+21,350 ✓✓✓
§ 08

The Complementary Stack

┌─────────────────────────────────────────────┐ │ AGENTS.md │ │ (injected into LLM context window) │ ├─────────────────────────────────────────────┤ │ Perseus (environment layer) │ │ • Services health (HTTP/Docker) │ │ • Git state (branch, log, status) │ │ • Skills inventory (filtered by category) │ │ • Session history (last N sessions) │ │ • Task board (@agora) │ │ • Environment (Python ver, hostname) │ ├─────────────────────────────────────────────┤ │ Sibyl Memory (knowledge layer) │ │ • HOT: current focus, working state │ │ • WARM: project facts, decisions │ │ • COLD: auto-journaled turn history │ │ • REFERENCE: static runbooks, docs │ │ • ARCHIVE: retired entities │ ├─────────────────────────────────────────────┤ │ Mneme (supplementary search) │ │ • FTS5 keyword search over vault │ │ • Narrative compaction │ │ • Federation (cross-workspace) │ └─────────────────────────────────────────────┘
§ 09

Reproduce It

All data is publicly available. The benchmark corpus, task suite, and Perseus context template are in the Perseus repository under bench/.

pip install perseus-ctx sibyl-memory-client export SIBYL_MEMORY_ENABLED=1 PERSEUS_ALLOW_DANGEROUS=1 perseus render bench/bench_context.md --output AGENTS.md # Agent starts with full orientation. # Compare session length with and without AGENTS.md for any task.
§ 10

What This Does Not Prove

  • Self-seeded corpus. The 268-entity Sibyl DB was seeded by this benchmark. An independent tester running the same methodology on their own project would strengthen the result.
  • Model variance matters. Different LLMs may burn different numbers of discovery calls. Only one LLM configuration was used for the Sibyl query measurements.
  • Entity name sensitivity. Some recall calls return 0 hits because exact entity names differ from search terms. This is realistic — agents don’t know exact names — but naming conventions affect hit rates.
  • Token estimates are approximate. Using chars/3 for token counts. Exact tokenization depends on the model’s tokenizer.
  • Terminal call savings are modeled. Shell commands weren’t executed in the measurement loop — treated as discoverable facts that Perseus pre-resolves.
  • Task difficulty scales savings. Simple tasks (avg 5.3 Sibyl calls) save fewer turns than complex tasks (avg 8.5 calls). Net savings compound with task scope.

Sibyl Memory gives the agent perfect recall.
Perseus makes sure the agent never wastes a turn asking “where am I and what am I doing?”

Together they answer both questions that matter at session start: what do we know (Sibyl’s 350/350 retrieval) and what’s happening now (Perseus’s 0-turn orientation). An agent productive from turn 1.