launch order matters. X thread first (tweets 2-6 must be replies to tweet 1 — you'll do this after posting tweet 1). Wait for T+0:35 before the HN self-comment.
T + 0:00
1. X thread — post tweet 1, then reply with tweets 2-6
click the first button to open composer pre-filled. post it. then click tweet 2, post as reply. repeat. pin the thread after tweet 6.
Today we're publishing the founding manifesto for cognometry — the empirical measurement of cognitive states in machine systems.
Every benchmark scores what the model said. None answer the question a production operator actually needs: was the model refusing, confabulating, retrieving, or reasoning when it wrote that?
Styxx 4.0.2 is the open-source instrument, cross-validated across 8 public hallucination benchmarks — the first detector I'm aware of at this breadth of cross-validation. Three laws, each with a cross-validated number:
• Law I — every computation leaves vitals (AUC 0.998 HaluEval-QA; 5/8 benchmarks above 0.65; 2 published failure modes)
• Law II — vitals are substrate-transferable (cos +0.464 cross-scale refusal direction, ~26σ above chance)
• Law III — vitals are causally actionable (refuse@unsafe 97% → 17% at α=3.0 on Llama-3.2-1B)
One decorator (@trust) runs the cross-validated detector on any LLM call. Zero config. MIT on code, CC-BY on weights.
If you build, audit, or regulate AI systems and the question of cognitive-state measurement at runtime matters to you, the invitation is open.
Manifesto: https://fathom.darkflobi.com/cognometry?ref=li
Paper (DOI): https://doi.org/10.5281/zenodo.19703527
Code: https://github.com/fathom-lab/styxx
PyPI: pip install styxx==4.0.2[nli]
T + 0:10
3. Hacker News — submit (no self-comment yet, wait for T+0:35)
the button below opens the HN submit form with the URL + title pre-filled. click Submit. don't add text.
Shipped a hallucination detector today that runs entirely locally. `pip install styxx[nli]` → one decorator wraps any LLM call and scores the response across 9 signals (text, entity, novelty, NLI contradiction) before returning. Halts on high risk.
Numbers from a 3-seed averaged pooled LR across 8 public benchmarks:
```
HaluEval-QA AUC 0.998
TruthfulQA AUC 0.994
HaluBench-RAGTruth AUC 0.807
HaluBench-PubMedQA AUC 0.719
HaluEval-Dialog AUC 0.676
HaluEval-Summ AUC 0.643
HaluBench-Finance AUC 0.492 (at chance — published)
HaluBench-DROP AUC 0.424 (below chance — published)
```
Uses MoritzLaurer/DeBERTa-v3-base-mnli-fever-anli (~184M) for the NLI signal. Works on CPU (~400 ms/call) or CUDA (~30 ms/call). No API key, no phone-home, MIT-licensed. Full failure modes documented in the weights module itself.
Repo + reproducers: https://github.com/fathom-lab/styxx
Write-up: https://fathom.darkflobi.com/cognometry?ref=r_llama
T + 0:20
5. Reddit /r/MachineLearning — remember to pick "Research" flair
Paper draft: https://github.com/fathom-lab/styxx/blob/main/papers/cognometry-v0.md
We propose **cognometry** — the empirical measurement of cognitive states (refusal, confabulation, retrieval, reasoning, drift) in LLMs — as a frame distinct from interpretability (what a feature represents) and eval (what the text is). The paper's central empirical claim is narrower: a 9-signal pooled logistic regression fused over text, entity, novelty, grounding, and NLI contradiction signals achieves cross-validated hallucination discrimination across 8 public benchmarks (HaluEval QA/Dialog/Summ, TruthfulQA, HaluBench DROP/PubMedQA/FinanceBench/RAGTruth).
Per-dataset held-out test AUC (3-seed mean):
```
HaluEval-QA 0.998 ± 0.001
TruthfulQA 0.994 ± 0.006
HaluBench-RAGTruth 0.807 ± 0.043
HaluBench-PubMedQA 0.719 ± 0.051
HaluEval-Dialog 0.676 ± 0.037
HaluEval-Summ 0.643 ± 0.060
HaluBench-Finance 0.492 ± 0.026 declared failure mode
HaluBench-DROP 0.424 ± 0.080 declared failure mode
```
Both failure modes are structural and openly characterized: DROP hallucinations are extractive-span errors (wrong span from right passage — NLI entails them, novelty is blind), and FinanceBench hallucinations are arithmetic errors on verbatim-copied source numbers (also NLI-blind and novelty-blind).
Code + full reproducer: https://github.com/fathom-lab/styxx
Drop-in API: `pip install styxx[nli]` + `@trust` decorator.
MIT code, CC-BY-4.0 calibrated weights.
Paper (Zenodo DOI): https://doi.org/10.5281/zenodo.19703527
Happy to take disconfirmations on any of the 8 benchmarks at different random seeds or n.
tl;dr: `pip install styxx[nli]` + `@trust` gets you cross-validated hallucination detection on any LLM call. 5 of 8 benchmarks above AUC 0.65. Two benchmarks below chance, published openly.
Full write-up: https://fathom.darkflobi.com/cognometry?ref=r_dev
What actually matters for devs:
- Shape-preserving: works on OpenAI, Anthropic, LangChain, dicts, raw strings. Auto-detects.
- Sync + async
- Four halt policies (fallback/retry/raise/annotate)
- ~10-30ms CUDA, ~400ms CPU per call
- MIT, no phone-home, no API key
Where it'll fail you:
- Reading-comp extractive-span errors (DROP) — detector can't see the wrong span.
- Arithmetic errors on numbers copied from source (FinanceBench) — detector can't see the computation.
Both declared in the weights module. Do not deploy for finance/reading-comp without reading §3.4 of the paper first.
Repo: https://github.com/fathom-lab/styxx
T + 0:35
7. HN self-comment — go back to your submission, paste this
find your submission on hn.algolia.com or /newest, click it, scroll to comment box, paste:
Author here.
Styxx 4.0.2 is the first hallucination detector I'm aware of cross-validated across 8 public benchmarks — HaluEval QA/Dialog/Summarization, TruthfulQA, and four HaluBench subsets (DROP, PubMedQA, FinanceBench, RAGTruth). 3-seed averaged, n=150/dataset, pooled 9-signal logistic regression.
Paper (Zenodo, peer-archived): https://doi.org/10.5281/zenodo.19703527
Code: https://github.com/fathom-lab/styxx
Leaderboard: https://fathom.darkflobi.com/cognometry/leaderboard
Colab demo (2 min): https://colab.research.google.com/github/fathom-lab/styxx/blob/main/examples/cognometry_colab.ipynb
Real numbers:
HaluEval-QA AUC 0.998
TruthfulQA AUC 0.994
HaluBench-RAGTruth AUC 0.807 (new — RAG faithfulness)
HaluBench-PubMedQA AUC 0.719 (new — biomedical)
HaluEval-Dialog AUC 0.676
HaluEval-Summarization AUC 0.643
HaluBench-FinanceBench AUC 0.492 (below chance)
HaluBench-DROP AUC 0.424 (below chance)
Two below-chance results are the part I'd most like HN to react to. They are published as failure modes in the weights module itself, not hidden:
- DROP: reading-comp hallucinations are extractive-span errors — wrong span, right passage. NLI scores that as entailed; novelty signals don't fire. Tried 6 naive heuristic fixes; all null. The null probe is committed alongside the successes.
- FinanceBench: hallucinations are calculation errors on numbers copied verbatim from the source. Novelty + NLI are semantically blind to arithmetic correctness.
Both failure modes are declared in calibrated_weights_v4.CALIBRATION_NOTES.documented_failure_modes so production callers know where the detector will lie.
pip install styxx[nli] → wrap a function with @trust → get verified output on every call. Zero config: auto-detects context/reference/passage kwargs, auto-enables NLI when installed, adaptive threshold. MIT on code, CC-BY on calibrated weights.
Happy to get disconfirmations on any of the 8 benchmarks at your favorite random seed.