{% extends "base.html" %} {% block title %}Proof Engine — AI that proves instead of asserts{% endblock %} {% block meta_description %}Open-source AI skill that verifies claims through code and live sources — not by asking the LLM to check itself. Every fact cited, every calculation re-runnable.{% endblock %} {% block body_attrs %} class="landing-v2"{% endblock %} {% block head_extra %} {% endblock %} {% block nav %} {% endblock %} {% block content %}
An open-source skill that makes every factual claim carry its receipts. Numbers get computed in Python. Quotes get fetched from live URLs and matched against the page. A fabricated citation fails the match — and the verdict downgrades so the gap is visible instead of hidden.
LLMs hallucinate facts and hallucinate the checks on those facts. Asking one to verify its own answer runs the same error-prone process twice.
Search-grounded AI finds pages, but still lets the model summarize them. A link says "the evidence is over there." A proof says: here is the evidence, here is the code that checked it, here is what happened when it ran.
Proof Engine routes every claim through a gate the LLM can't fake. Python doesn't hallucinate. A fabricated citation fails to match live page text. A partial match downgrades the verdict so the gap is visible — instead of smoothed away by a confident summary.
The model still does useful work: it drafts code, finds sources, formalizes claims. It just doesn't get to be the verification.
Each step produces an artifact that survives independently of the model that drafted it. You can re-run the whole pipeline from source, offline, in Python.
live, wayback, snapshot), and credibility tier.sympy for exact math, numpy for quantitative, every constant version-controlled. Anyone can python proof.py and see the same result.PROVED, DISPROVED, PARTIAL, SUPPORTED, or UNDETERMINED. Every verdict ships with a Jupyter notebook, PROV-JSON provenance chain, and an RO-Crate 1.1 archive bundle.Same claim, same sources, two radically different artifacts. The one on the left is a confident summary. The one on the right is a re-runnable script with a trace.
B1 Gerlich 2025, n=666 — negative correlation (r=-0.68), cognitive offloading (quote-verified)B2 Lee et al. 2025 (Microsoft Research / CHI), n=319 knowledge workers — confidence ↑ → critical effort ↓ (quote-verified)B3 Harvard Gazette 2025 — faculty cross-discipline panel, same concern (quote-verified)B4 Jose et al. 2025 (Frontiers / PMC) — ChatGPT users solved 48% more problems but scored 17% lower on concept tests (quote-verified){{ featured_proofs|length }} claims, verdicts redacted. Tap a card to reveal what the pipeline actually found. Every one of them is a re-runnable Python script in the catalog.
Calibrated honesty beats confident vagueness. The engine refuses to gesture at things it can't mechanically check.
Every verdict in the catalog includes these artifacts, versioned, DOI-minted, and downloadable. Run the Python. Open the notebook. Cite the JSON.
python proof.py