Zaxy 2.0 Beta.2 Metacognitive and Procedural Hardening Implementation Plan

For agentic workers: REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (- [ ]) syntax for tracking.

Goal: Add production-ready metacognitive state tracking and first-class procedural planning diagnostics without turning uncertainty, confidence, or procedures into authoritative facts.

Architecture: Eventloom remains the source of truth. Beta.2 adds strict metacognition event contracts, replay-derived query surfaces, checkout diagnostics, and procedural planning buckets that reuse existing Skill Memory and consolidation projections. All generated metacognitive/procedural outputs are non-authoritative, cited, observable, and excluded from claim-support scoring unless a separate authority path has accepted them.

Tech Stack: Python 3.11+, Eventloom JSONL, existing MemoryFabric, Memory Checkout, Skill Memory, consolidation candidates, Typer CLI, MCP Python SDK, pytest, ruff.

---

Scope Boundary

Included:

Excluded:

File Structure

Create:

Modify:

---

Task 1: Add Metacognition Contracts

Files:

Add tests proving:

from zaxy.metacognition import (
    build_confidence_assessment_event,
    build_conflict_cluster_event,
    build_known_unknown_event,
    build_reverify_request_event,
    summarize_metacognition_events,
)


SOURCE = [{"seq": 7, "hash": "a" * 64}]


def test_known_unknown_event_is_open_cited_and_non_authoritative() -> None:
    event = build_known_unknown_event(
        actor="agent",
        session_id="agent-1",
        question="Which projection backend caused the latency spike?",
        reason="Checkout had conflicting backend evidence.",
        source_events=SOURCE,
        claim_key="projection-latency-cause",
        gap_type="conflicting_evidence",
        reverify_query="latest cited projection latency cause",
    )

    assert event["event_type"] == "metacognition.unknown.recorded"
    assert event["thread"] == "agent-1"
    assert event["payload"]["status"] == "open"
    assert event["payload"]["authority_status"] == "non_authoritative"
    assert event["payload"]["source_events"] == SOURCE


def test_confidence_assessment_event_tracks_append_only_point() -> None:
    event = build_confidence_assessment_event(
        actor="zaxy-reasoning",
        session_id="agent-1",
        claim="Projection stale caused failure",
        confidence=0.42,
        support_count=1,
        conflict_count=2,
        evidence=[
            {"citation": "eventloom://agent-1/events/7#aaaaaaaaaaaa", "stance": "support"},
            {"citation": "eventloom://agent-1/events/8#bbbbbbbbbbbb", "stance": "conflict"},
        ],
        method="deterministic_token_overlap_v1",
        requires_reverify=True,
    )

    assert event["event_type"] == "metacognition.confidence.assessed"
    assert event["payload"]["confidence"] == 0.42
    assert event["payload"]["requires_reverify"] is True
    assert event["payload"]["authority_status"] == "non_authoritative"


def test_conflict_cluster_event_preserves_support_and_conflict_sources() -> None:
    event = build_conflict_cluster_event(
        actor="zaxy-reasoning",
        session_id="agent-1",
        claim_key="projection-latency-cause",
        claim="Projection stale caused failure",
        supporting_source_events=[{"seq": 7, "hash": "a" * 64}],
        conflicting_source_events=[{"seq": 8, "hash": "b" * 64}],
        confidence=0.5,
        reason="Support and conflict evidence both present.",
    )

    assert event["event_type"] == "metacognition.conflict.clustered"
    assert event["payload"]["resolution_status"] == "unresolved"
    assert event["payload"]["authority_status"] == "non_authoritative"


def test_reverify_request_event_is_open_and_cited() -> None:
    event = build_reverify_request_event(
        actor="zaxy-reasoning",
        session_id="agent-1",
        query="Re-check cited projection latency cause",
        reason="Low confidence and conflict count above zero.",
        source_events=SOURCE,
        priority="high",
        claim_key="projection-latency-cause",
    )

    assert event["event_type"] == "metacognition.reverify.requested"
    assert event["payload"]["status"] == "open"
    assert event["payload"]["priority"] == "high"


def test_summarize_metacognition_events_returns_open_counts() -> None:
    events = [
        build_known_unknown_event(
            actor="agent",
            session_id="agent-1",
            question="What changed?",
            reason="No cited answer.",
            source_events=SOURCE,
            claim_key="change",
            gap_type="missing_evidence",
            reverify_query="what changed",
        ),
        build_reverify_request_event(
            actor="agent",
            session_id="agent-1",
            query="what changed",
            reason="missing evidence",
            source_events=SOURCE,
            priority="normal",
            claim_key="change",
        ),
    ]

    summary = summarize_metacognition_events(events)

    assert summary["unknown_count"] == 1
    assert summary["open_unknown_count"] == 1
    assert summary["reverify_needed_count"] == 1

Run:

pytest tests/test_metacognition.py --no-cov -q

Expected: fail because zaxy.metacognition does not exist.

Create src/zaxy/metacognition.py with:

Every builder must set authority_status="non_authoritative". Known unknowns and reverify requests must set status="open". Conflict clusters must set resolution_status="unresolved".

Run:

pytest tests/test_metacognition.py --no-cov -q

Expected: all metacognition contract tests pass.

---

Task 2: Add MemoryFabric Metacognition Services

Files:

Add async tests proving:

Use embedded/Eventloom-only tests. Monkeypatch projection where necessary.

Run:

pytest tests/test_metacognition.py tests/test_reasoning_primitives.py -k "metacognition or confidence" --no-cov -q

Expected: fail because MemoryFabric methods and assessment recording are missing.

Add methods:

async def record_known_unknown(
    self,
    question: str,
    *,
    reason: str,
    source_events: list[dict[str, Any]],
    claim_key: str,
    gap_type: str = "missing_evidence",
    reverify_query: str | None = None,
    phase: str = "review",
    session_id: str = "default",
    actor: str = "zaxy-reasoning",
) -> dict[str, Any]: ...

async def list_known_unknowns(
    self,
    *,
    session_id: str = "default",
    status: str = "open",
    limit: int = 10,
) -> dict[str, Any]: ...

async def list_conflict_clusters(
    self,
    *,
    session_id: str = "default",
    unresolved_only: bool = True,
    limit: int = 10,
) -> dict[str, Any]: ...

async def list_confidence_trajectory(
    self,
    claim: str,
    *,
    session_id: str = "default",
    limit: int = 10,
) -> dict[str, Any]: ...

async def list_reverification_needs(
    self,
    query: str | None = None,
    *,
    session_id: str = "default",
    limit: int = 10,
    min_confidence: float = 0.7,
) -> dict[str, Any]: ...

Update get_claim_confidence(...) to accept record_assessment: bool = True and append metacognition.confidence.assessed when true. If support and conflict are both present, append metacognition.conflict.clustered. If confidence is below min_confidence or conflict count is non-zero, append metacognition.reverify.requested.

Implementation requirements:

Run:

pytest tests/test_metacognition.py tests/test_reasoning_primitives.py -k "metacognition or confidence" --no-cov -q

Expected: all Task 2 tests pass.

---

Task 3: Add Typed Metacognition Projection and Checkout Diagnostics

Files:

Tests must prove:

Run:

pytest tests/test_extract.py tests/test_causal_checkout.py tests/test_checkout.py -k "metacognition or unknown or reverify" --no-cov -q

Expected: fail because typed extraction and diagnostics are missing.

Add extractors in src/zaxy/extract.py:

Use entity types:

Use deterministic entity names from event payload IDs. Use the existing source-event snapshot helpers for backend-safe source_event_refs, source_event_seqs, and source_event_hashes.

Add deterministic diagnostics in src/zaxy/checkout.py:

diagnostics["metacognition"] = {
    "unknown_count": ...,
    "open_unknown_count": ...,
    "conflict_cluster_count": ...,
    "unresolved_conflict_count": ...,
    "low_confidence_count": ...,
    "reverify_needed_count": ...,
    "authority_status": "non_authoritative",
}

Guidance must include:

Run:

pytest tests/test_extract.py tests/test_causal_checkout.py tests/test_checkout.py -k "metacognition or unknown or reverify" --no-cov -q

Expected: pass.

---

Task 4: Harden Procedural Planning Lane

Files:

Tests must prove:

Run:

pytest tests/test_reasoning_primitives.py tests/test_checkout.py tests/test_mcp.py -k "procedure or skill" --no-cov -q

Expected: fail because the procedural planning lane is still flat and memory_skill does not pass all rollback fields.

Update helper logic near _procedure_contexts:

Update retrieve_similar_procedures to keep procedures as a backward-compatible alias for applicable, and add applicable, diagnostic, excluded, and procedural_memory.

In src/zaxy/mcp_server.py, extend memory_skill schema and handler to accept:

Pass these fields into the payload for skill.contradicted, skill.deprecated, and other actions when supplied. Preserve existing validation for list fields.

Add/extend checkout diagnostics:

diagnostics["procedural_memory"] = {
    "applicable_count": ...,
    "diagnostic_count": ...,
    "excluded_count": ...,
    "rollback_candidate_count": ...,
    "contradiction_count": ...,
    "excluded_reasons": {...},
    "authority_status": "non_authoritative",
}

Prompt guidance must say applicable procedures are planning guidance, not authoritative facts, and rollback/contradiction candidates should be avoided or explicitly reviewed.

Run:

pytest tests/test_reasoning_primitives.py tests/test_checkout.py tests/test_mcp.py -k "procedure or skill" --no-cov -q

Expected: pass.

---

Task 5: Add CLI, MCP, Docs, and Beta.2 Guardrail

Files:

Add CLI tests for:

Add MCP schema/handler/dispatch tests for:

Add guardrail tests proving beta.2 scoring includes:

Run:

pytest tests/test_cli.py tests/test_mcp.py tests/test_reasoning_benchmark.py -k "metacognition or unknown or reverify or trajectory or plan_from_procedures or beta2" --no-cov -q

Expected: fail because public surfaces and guardrail are missing.

CLI command names:

MCP tool names:

All CLI/MCP handlers must instantiate configured MemoryFabric, call the matching core method, close the fabric in finally, and return JSON-compatible dicts.

Add score_metacognition_guardrail(...) or extend score_reasoning_guardrail(...) with explicit beta.2 fields:

The scorer must inspect contract fields only. It must not score task answers or benchmark labels.

Document:

Run:

python scripts/build-site-docs.py
scripts/validate-docs.sh --root .

Run:

pytest tests/test_cli.py tests/test_mcp.py tests/test_reasoning_benchmark.py -k "metacognition or unknown or reverify or trajectory or plan_from_procedures or beta2" --no-cov -q

Expected: pass.

---

Final Regression Gate

After all tasks:

pytest \
  tests/test_metacognition.py \
  tests/test_reasoning_primitives.py \
  tests/test_reasoning_benchmark.py \
  tests/test_causal_checkout.py \
  tests/test_checkout.py \
  tests/test_extract.py \
  tests/test_cli.py \
  tests/test_mcp.py \
  -k "metacognition or unknown or reverify or trajectory or procedure or skill or beta2" \
  --no-cov -q

pytest tests/test_checkout.py tests/test_graph.py tests/test_mcp.py tests/test_extract.py --no-cov -q

ruff check \
  src/zaxy/metacognition.py \
  src/zaxy/reasoning_primitives.py \
  src/zaxy/reasoning_benchmark.py \
  src/zaxy/core.py \
  src/zaxy/__main__.py \
  src/zaxy/mcp_server.py \
  src/zaxy/checkout.py \
  src/zaxy/extract.py \
  tests/test_metacognition.py \
  tests/test_reasoning_primitives.py \
  tests/test_reasoning_benchmark.py \
  tests/test_causal_checkout.py \
  tests/test_checkout.py \
  tests/test_cli.py \
  tests/test_mcp.py \
  tests/test_extract.py

scripts/validate-docs.sh --root .

python -m zaxy benchmark-compare \
  reports/benchmarks/longmemeval-500-publish-20260607/live-benchmark.json \
  --backend zaxy-checkout \
  --min-mean-score 0.95 \
  --min-answer-recall-at-5 0.90 \
  --min-recall-at-5 0.99 \
  --min-citation-coverage 1.0 \
  --max-p95-ms 2500 \
  --max-p99-ms 3000

Expected:

Self-Review Notes

Spec coverage:

Known risks: