Even with perfect memory retrieval (Sibyl's 350/350), AI agents still waste turns asking “where am I and what am I doing?” Perseus eliminates those discovery calls before the session starts.
| # | Task | Sibyl Only | + Perseus | Saved |
|---|---|---|---|---|
| 1 | Fix credential redaction: nested JSON tokens | 8 calls | 1 call | 7 |
| 2 | Add health check for a new service endpoint | 5 calls | 0 calls | 5 |
| 3 | Update CI workflow to test Python 3.13 | 5 calls | 2 calls | 3 |
| 4 | Add memory-cleanup skill SKILL.md | 5 calls | 2 calls | 3 |
| 5 | Fix Mneme FTS5 search escaping bug (#318) | 5 calls | 4 calls | 1 |
| 6 | Fix CLI overwrite without warning (#314) | 4 calls | 3 calls | 1 |
| 7 | Update dependency scanner for optional imports | 5 calls | 5 calls | 0 |
| 8 | Implement convention checker for agent behavior | 8 calls | 4 calls | 4 |
| 9 | Refactor memory mesh to deduplicate cross-backend | 8 calls | 7 calls | 1 |
| 10 | Deploy Perseus v1.0.7 to PyPI | 10 calls | 4 calls | 6 |
| 11 | Add Perseus MCP server tool integration | 8 calls | 5 calls | 3 |
| 12 | Build cross-workspace memory search UI | 9 calls | 9 calls | 0 |
| 13 | Implement TTL cache invalidation on config change | 7 calls | 7 calls | 0 |
| 14 | Add multi-tenant support to Sibyl Memory connector | 8 calls | 4 calls | 4 |
| 15 | Performance audit: profile AGENTS.md rendering | 10 calls | 8 calls | 2 |
| Average | 7.0 | 4.3 | 2.7 |
For task #1 (“Fix credential redaction: nested JSON tokens”), here are the agent’s first 8 discovery calls with Sibyl only:
With Perseus: All of the above is in AGENTS.md before turn 1. Agent reads context and starts working immediately.
Sibyl’s V2 benchmark proved that vector systems hallucinate confident neighbors for unknown entities (0/50 trap refusals vs. Sibyl’s 50/50). We introduce a complementary trap class: information the agent should never need to discover at session start.
| Trap Question | Sibyl Only | Sibyl + Perseus |
|---|---|---|
| “What OS is this?” | Wastes turn on uname | Pre-resolved in AGENTS.md |
| “What Python version?” | Wastes turn on python3 --version | Pre-resolved |
| “Is Hermes running?” | Wastes turn on curl health check | Pre-resolved (11ms latency) |
| “What git branch?” | Wastes turn on git branch | Pre-resolved |
| “What skills do I have?” | Wastes turn on skill listing | Pre-resolved (12 skills, filtered) |
| “Who is the user?” | Wastes turn on sibyl_recall | Already in context |
| “What conventions apply?” | Wastes turn on sibyl_search | Already in context |
| “What was the last decision?” | Wastes turn on sibyl_search | Already in context |
| Total traps triggered | 8 / 8 | 0 / 8 |
| Discovery Category | Sibyl Only | + Perseus | Perseus Source |
|---|---|---|---|
| Environment (OS, Python, hostname, disk) | 4 turns | 0 | @query directives |
| Git state (branch, log, status) | 2 turns | 0 | @query directives |
| Services health | 1 turn | 0 | @services block |
| Project facts (repo, version, owner) | 2 turns | 0 | Sibyl entities + @read |
| Auth patterns / credentials | 2 turns | 0 | Sibyl entities |
| Conventions / workflow rules | 2 turns | 0 | Sibyl entities |
| Architecture decisions | 2 turns | 0 | Sibyl entities |
| Skills inventory | 1 turn | 0 | @skills directive |
| Session history / waypoints | 1 turn | 0 | @session + @waypoint |
| Total | 19 | 0 |
We seeded a Sibyl Memory database with a realistic project corpus — 268 entities across 11 categories (simulating months of accumulated project knowledge). A 15-task suite measures discovery calls before the first productive action, comparing Sibyl alone vs. Sibyl + Perseus.
Sibyl Memory already achieves near-flawless retrieval with dramatically lower cost and context footprint. It runs on a single SQLite file — no vector database, no embedding API, no external infrastructure.
| Benchmark | Sibyl Result | Next Best |
|---|---|---|
| LongMemEval Oracle | 95.6% (#2) | 96.2% (BM25+vector hybrid) |
| 4-engine business memory (350 Q) | 350/350 retrieval | 152/350 (Hindsight) |
| Trap refusal (fake companies) | 50/50 refusals | 0/50 (all vector systems) |
| Context per query | 228 tokens | 11,892 tokens (Hindsight, 52×) |
| Cost to answer 350 questions | $0.64 | $18.68 (Hindsight, 29×) |
Even with perfect recall, agents still burn turns on orientation questions — this benchmark isolates that waste.
Perseus injects ~2,920 tokens once. Those tokens replace ~23,680 tokens of Sibyl + terminal discovery calls that recur every session. Breakeven at ~3 turns; savings compound from 15 turns onward.
| Session Length | Sibyl Only | Sibyl + Perseus | Net Savings |
|---|---|---|---|
| 5 turns | ~64,189 tokens | ~9,150 tokens | −1,650 |
| 10 turns | ~15,000 tokens | ~15,650 tokens | −650 |
| 15 turns | ~22,500 tokens | ~22,150 tokens | +350 ✓ |
| 30 turns | ~45,000 tokens | ~37,650 tokens | +7,350 ✓✓ |
| 60 turns | ~90,000 tokens | ~68,650 tokens | +21,350 ✓✓✓ |
All data is publicly available. The benchmark corpus, task suite, and Perseus context template are in the Perseus repository under bench/.
Sibyl Memory gives the agent perfect recall.
Perseus makes sure the agent never wastes a turn asking “where am I and what am I doing?”
Together they answer both questions that matter at session start: what do we know (Sibyl’s 350/350 retrieval) and what’s happening now (Perseus’s 0-turn orientation). An agent productive from turn 1.