{% extends "base.html" %} {% block title %}Proof Engine — AI that proves instead of asserts{% endblock %} {% block meta_description %}Open-source AI skill that verifies claims through code and live sources — not by asking the LLM to check itself. Every fact cited, every calculation re-runnable.{% endblock %} {% block body_attrs %} class="landing-v2"{% endblock %} {% block head_extra %} {% endblock %} {% block nav %} {% endblock %} {% block content %}
exhibit A / live

LLMs assert.
Proof Engine proves.

An open-source skill that makes every factual claim carry its receipts. Numbers get computed in Python. Quotes get fetched from live URLs and matched against the page. A fabricated citation fails the match — and the verdict downgrades so the gap is visible instead of hidden.

{{ stats.total }} proofs · {{ stats.tags_count }} domains · {{ stats.total_sources_checked }} sources checked · MIT · auditable · re-runnable
{{ stats.total }}
Proofs
{{ stats.tags_count }}
Domains
{{ stats.total_sources_checked }}
Sources checked
100%
Re-runnable
§01 · premise

The witness cannot corroborate itself.

LLMs hallucinate facts and hallucinate the checks on those facts. Asking one to verify its own answer runs the same error-prone process twice.

Search-grounded AI finds pages, but still lets the model summarize them. A link says "the evidence is over there." A proof says: here is the evidence, here is the code that checked it, here is what happened when it ran.

Proof Engine routes every claim through a gate the LLM can't fake. Python doesn't hallucinate. A fabricated citation fails to match live page text. A partial match downgrades the verdict so the gap is visible — instead of smoothed away by a confident summary.

The LLM is an untrusted author. Every claim it makes passes through a deterministic checkpoint. When it hallucinates, the pipeline breaks visibly instead of hiding the error.

The model still does useful work: it drafts code, finds sources, formalizes claims. It just doesn't get to be the verification.

The circular check
LLM(LLM(claim)) ≠ proof

The proof pipeline
fetch(url) ∘ match(quote) ∘ compute(python) → verdict
— deterministic, replayable, breakable
§02 · how it works

Five steps. No LLM in the verification path.

Each step produces an artifact that survives independently of the model that drafted it. You can re-run the whole pipeline from source, offline, in Python.

01
?
claim input
Any factual assertion — a viral myth, a stat, a mathematical identity, a VC pitch-deck number. The LLM decomposes it into sub-claims (SC1, SC2…) and extractable facts (B1, B2, A1…).
# from a session
claim = "0.999… < 1"
decompose(claim) → [SC1: "0.999 repeating equals 1"]
02
fetch sources network
Academic papers, government data, reference encyclopedias — never the model's memory. Every URL goes in the audit trail with HTTP status, fetch mode (live, wayback, snapshot), and credibility tier.
fetch("planck.esa.int/2018-legacy") → 200 // tier-1 · academic
03
verify quotes text match
Each quoted sentence must appear on the live page, modulo Unicode/HTML normalization. Partial matches downgrade the verdict. Fabricated citations break the pipeline visibly, with coverage percentages in the audit log.
match(quote="Ω_Λ = 0.6853 ± 0.0074", page) → verified // 100% word coverage
04
run proof.py python
Deterministic computation. sympy for exact math, numpy for quantitative, every constant version-controlled. Anyone can python proof.py and see the same result.
compare(omega_lambda, threshold=0.68, op=">") → True
assert "0.6853 > 0.68" # holds
05
verdict output
Structured outcome: PROVED, DISPROVED, PARTIAL, SUPPORTED, or UNDETERMINED. Every verdict ships with a Jupyter notebook, PROV-JSON provenance chain, and an RO-Crate 1.1 archive bundle.
verdict: PROVED // Ω_Λ = 0.6853 > 0.68 — confirmed
§03 · the difference

"Why not just ask the model?"

Same claim, same sources, two radically different artifacts. The one on the left is a confident summary. The one on the right is a re-runnable script with a trace.

typical LLM
prompt: "does using AI tools make humans worse at critical thinking?"
"The claim is too absolute to confirm or deny cleanly. The real picture appears to be that passive, over-reliant use degrades critical thinking, while active, interrogative use can augment it — making the unqualified 'makes humans worse' framing false as a universal statement."
  • hedges into unfalsifiability — "too absolute," "context-dependent"
  • names one 2025 study without a URL, quote, or coverage check
  • invokes "cognitive-offloading theory" without citing a source
  • concludes "false as a universal statement" — but no universal was claimed
  • a second ask re-runs the same mechanism; the hedge persists
✗ plausible-sounding · zero provenance
proof engine
same claim → verdict + audit trail
PROVED — four independent research groups, different institutions and methods, reach the same association: AI tool use correlates with measurable drops in critical-thinking scores.
  • B1 Gerlich 2025, n=666 — negative correlation (r=-0.68), cognitive offloading (quote-verified)
  • B2 Lee et al. 2025 (Microsoft Research / CHI), n=319 knowledge workers — confidence ↑ → critical effort ↓ (quote-verified)
  • B3 Harvard Gazette 2025 — faculty cross-discipline panel, same concern (quote-verified)
  • B4 Jose et al. 2025 (Frontiers / PMC) — ChatGPT users solved 48% more problems but scored 17% lower on concept tests (quote-verified)
  • verdict qualifier in the record: correlation, not proven causation; routine use > high-stakes use
✓ 4 sources verified · consensus threshold met · re-runnable read the full proof →
{% if featured_proofs %}
§04 · exhibit room

Think you know the answer?

{{ featured_proofs|length }} claims, verdicts redacted. Tap a card to reveal what the pipeline actually found. Every one of them is a re-runnable Python script in the catalog.

tap to reveal · R to reveal all
explore all {{ stats.total }} proofs in the catalog →
{% endif %}
§05 · scope

What it can and can't do.

Calibrated honesty beats confident vagueness. The engine refuses to gesture at things it can't mechanically check.

works well for
  • Factual claims with citable evidence dates, numbers, quotes, statistics with verifiable source pages
  • Mathematical assertions anything Python + sympy can compute deterministically, including symbolic identities
  • Debunking specific claims "did X really say Y?" · "is statistic Z accurate?" · "does this compound claim decompose?"
  • Compound claims decomposes "X and Y" into independently verified sub-claims (SC1 ∧ SC2)
doesn't work for
  • Causal claims "X caused Y" tops out at PARTIAL — facts yes, causal theory weighting no
  • Broad literature synthesis "coffee reduces diabetes risk" needs a systematic review, not a proof
  • JS-rendered pages citation match degrades when the source needs a browser to render
  • Absence-of-evidence search-based facts reach SUPPORTED at best; the engine can't prove non-existence
  • Contested definitions "is a hot dog a sandwich?" — depends on definition, not evidence
  • Original theorem proving computations yes, novel conjectures no
§06 · every proof ships

Eight files. Nothing to take on faith.

Every verdict in the catalog includes these artifacts, versioned, DOI-minted, and downloadable. Run the Python. Open the notebook. Cite the JSON.

proof.py
re-runnable verification script — python proof.py
.py
§
proof.md
structured report with verdict + sub-claim breakdown
.md
proof_audit.md
citation-by-citation evidence trail, coverage % per quote
.md
proof_narrative.md
plain-language summary for non-technical readers
.md
Jupyter Notebook
interactive re-verification in a browser cell-by-cell
.ipynb
W3C PROV-JSON
provenance chain — feed it to downstream fact pipelines
.json
RO-Crate 1.1
archival research-object bundle, DOI-minted
.crate
Citation files
BibTeX, RIS, CFF, Chicago, APA — ready to cite
.bib
§07 · install
Build agents that prove instead of assert.
Drop the skill into Claude or any other agent that supports Skills. Then just say: "use the proof-engine skill to verify …" — it auto-activates when a claim needs checking, and refuses to guess when it can't.
{% endblock %} {% block footer %} {% endblock %} {% block scripts %} {% endblock %}