Benchmarks

Zaxy keeps the public benchmark surface intentionally small. Active benchmark evidence is limited to:

Older backend shootouts, partial slices, experimental LongMemEval iterations, LongMemBench adapter artifacts, and debug reports are archived under reports/archive/ and docs/archive/. They are development history, not current public claims.

Headline 500

The current headline LongMemEval-compatible result is:

reports/benchmarks/longmemeval-500-publish-20260607/live-benchmark.md

Frozen run config: reports/benchmarks/longmemeval-500-publish-20260607/run-config.md

This is a Zaxy same-harness checkout diagnostic over the cleaned LongMemEval-compatible workload. It is not an official LongMemEval end-to-end assistant score.

Metric Value
Generated 2026-06-07T16:20:10Z
Workload SHA-256 90fb2307195d7e16b963a2b8a30f03b375bd42a45d41aeaa55423029dd84e3fc
Events 5,372
Questions 500
Sessions 948
Backend zaxy-checkout
Mean score 0.956
Answer@5 0.910
Recall@1 0.960
Recall@5 1.000
Recall@10 1.000
Identity recall 0.980
Citation coverage 1.000
p50 latency 881.01 ms
p95 latency 1,966.65 ms
p99 latency 2,495.07 ms
Approx tokens 10,192

Interpretation: retrieval and citation are at ceiling in this adapted checkout protocol. The remaining reported misses are synthesis-side (45 synthesis_miss cases). The same report includes a BM25 baseline with mean 0.520, Answer@5 0.520, Recall@5 0.770, and citation coverage 1.000.

Harvey LAB

The current Harvey LAB external memory-ablation evidence is:

reports/benchmarks/harvey-lab-memory-ablation/publishable-statistics.md

Primary report artifacts:

Metric Value
External suite Harvey LAB memory retrieval ablation
Harvey commit 29748828133dff83ad2263af353fb035504f8f77
Tasks completed 10/10
Mean criterion pass rate 0.788
Delta vs regular/no-memory +0.184
Delta vs article-best task rows +0.081
Wins vs article-best task rows 9/10
Mean total seconds 138.786
Total tokens 5,951,174
Memory search calls 30
Memory read calls 10

Interpretation: Harvey LAB is external downstream work-product evidence. The metric is criterion pass rate, not binary task pass/fail.

Zaxy 2.0 RC.1 Benchmark Freeze

The 2.0.0-rc.1 freeze gate validates the current release evidence without changing retrieval, synthesis, or benchmark scoring behavior:

zaxy benchmark-freeze --json

The tracked freeze manifest is reports/benchmarks/2.0.0-rc.1/manifest.json.

The gate is a claim-boundary and artifact-integrity check. It requires the headline 500-question LongMemEval-compatible checkout report, the frozen headline run config, Harvey LAB external-anchor artifacts, and the project-defined RC lanes for StateRecoveryBench, CoordinationBench, PurposeBench, causal, consolidation, procedural, and metacognitive behavior.

RC.1 evidence is interpreted in three separate buckets:

The active RC.1 project-defined artifacts are:

CoordinationBench has a conservative competitor-claim boundary. The competitor_claim_gate remains blocked for Quarq and Hybi unless a future same-harness run supplies tracked, auditable runner outputs. Release checks use explicit flags such as --require-competitor-claim quarq and --require-competitor-claim hybi only to prove that missing competitor evidence is disclosed and cannot silently become a public claim. The current CoordinationBench artifact is a Zaxy project-defined internal guardrail for accepted parent state, stale-claim rejection, duplicate consolidation, non-authoritative leakage, evidence coverage, purpose feedback, and checkout answerability.

PurposeBench follows the same disclosure posture for purpose-conditioned memory claims. Publicly derived purpose examples that mention systems such as Quarq or Semantic Reach are diagnostic holdouts only; they are not head-to-head benchmark claims and they do not establish competitor performance. The active PurposeBench report proves Zaxy's purpose profiles and evidence policies on tracked internal lanes, while the holdout pack documents source boundaries and claim status.

The RC.1 gate fails closed when required artifacts are missing, when the headline 500 falls below the frozen quality or latency floors, or when a 2.0 internal or project-defined lane is classified as external validation. This is intentionally a release-readiness gate, not a reward function.

Zaxy 2.0 Alpha Causal And Consolidation Lane

Zaxy 2.0 alpha includes a project-defined internal guardrail lane for causal projection and review-gated consolidation. This lane is not external validation, is not part of the headline LongMemEval-compatible checkout claim, and must not be reported as a public benchmark number unless a future release explicitly publishes a full report with its own claim boundary.

The alpha lane checks behavior that is specific to the causal and consolidation contracts:

Use this lane as an engineering regression guardrail for the alpha causal and consolidation surface. The consolidation guardrail is internal and project-defined: it measures source-event fidelity, review gating, stale rejection, and authority-boundary preservation. Do not combine it with the headline 500 metrics, Harvey LAB evidence, or external-validation language.

Zaxy 2.0 Beta.1 Reasoning-Loop Guardrail

Beta.1 adds an internal guardrail scorer for reasoning-loop memory primitives. This is an engineering contract check, not a public benchmark claim and not a LongMemBench-tailored lane. It does not score final answers or tune retrieval.

The guardrail reports five transparent fields:

Use this lane to catch regressions in observability, phase routing, citation coverage, and authority boundaries for beta.1 primitives. Do not report it as external validation, do not combine it with the headline 500 or Harvey LAB numbers, and do not use it to reward answer phrasing.

Beta.2 extends the internal guardrail to metacognition and procedural planning contracts. The scorer inspects contract fields only; it does not score final answers, expected benchmark labels, or answer phrasing. The beta.2 fields are:

This beta.2 guardrail is an internal release-quality check and readiness signal. It is not external validation and must not be merged into the headline LongMemEval-compatible or Harvey LAB results.

Agent Experience Lanes (Internal)

Zaxy 2.1 Phase 1 adds three deterministic agent-experience lanes in zaxy_benchmarks/agent_experience_lanes.py. They are project-defined internal lanes labeled "validation": "internal" in every report. Run them with:

zaxy agent-experience-lanes --lanes all

All three lanes are deterministic (hash embeddings, embedded projection, fixed seed content, no LLM calls). Do not report them as external validation and do not combine them with the headline 500 or Harvey LAB numbers.

Cognitive Lanes (Internal)

Zaxy 2.2 adds two deterministic cognitive-memory lanes in zaxy_benchmarks/forgetting_lane.py and zaxy_benchmarks/fok_calibration_lane.py. They produce the mechanism-level evidence behind the 2.3-rc.1 default-flip decisions (cognitive retrieval profile, salience-on) and are labeled "validation": "internal" in every report. Run them with:

zaxy cognitive-lanes --lanes all

Both lanes are deterministic (hash embeddings, embedded projection, fixed word tables and timestamps, no LLM calls): two runs produce identical reports. They are mechanism-level engineering evidence, not external validation, and must not be merged into the headline 500 or Harvey LAB numbers.

Graph-walk and vector-scale lanes (internal)

Zaxy 2.2/2.3 adds two more deterministic lanes in zaxy_benchmarks/graph_walk_lane.py and zaxy_benchmarks/vector_scale_lane.py, labeled "validation": "internal" in every report. Run them with:

zaxy graph-scale-lanes --lanes all

Both lanes use synthetic corpora and deterministic synthetic vectors (the hash embedding provider, or seeded gaussian vectors for the vector-scale realistic-distribution control). Do not report them as external validation and do not combine them with the headline 500 or Harvey LAB numbers.

Claim Boundaries

Related docs: testing.md, external-validation.md, and README.md.