launch: cognometry + styxx 4.0.2

nothing crosses unseen · click each button in order · text pre-filled where possible
clock (local):  ·  elapsed since T+0: (press start)  ·  [ START TIMER ]
launch order matters. X thread first (tweets 2-6 must be replies to tweet 1 — you'll do this after posting tweet 1). Wait for T+0:35 before the HN self-comment.
T + 0:00

1. X thread — post tweet 1, then reply with tweets 2-6

click the first button to open composer pre-filled. post it. then click tweet 2, post as reply. repeat. pin the thread after tweet 6.
tweet 1 tweet 2 tweet 3 tweet 4 tweet 5 tweet 6
T + 0:05

2. LinkedIn — open, paste the box below, publish

LinkedIn doesn't accept URL-encoded pre-fill for organic posts. open LinkedIn, click start a post, paste:
open LinkedIn →
Today we're publishing the founding manifesto for cognometry — the empirical measurement of cognitive states in machine systems.

Every benchmark scores what the model said. None answer the question a production operator actually needs: was the model refusing, confabulating, retrieving, or reasoning when it wrote that?

Styxx 4.0.2 is the open-source instrument, cross-validated across 8 public hallucination benchmarks — the first detector I'm aware of at this breadth of cross-validation. Three laws, each with a cross-validated number:

• Law I — every computation leaves vitals (AUC 0.998 HaluEval-QA; 5/8 benchmarks above 0.65; 2 published failure modes)
• Law II — vitals are substrate-transferable (cos +0.464 cross-scale refusal direction, ~26σ above chance)
• Law III — vitals are causally actionable (refuse@unsafe 97% → 17% at α=3.0 on Llama-3.2-1B)

One decorator (@trust) runs the cross-validated detector on any LLM call. Zero config. MIT on code, CC-BY on weights.

If you build, audit, or regulate AI systems and the question of cognitive-state measurement at runtime matters to you, the invitation is open.

Manifesto: https://fathom.darkflobi.com/cognometry?ref=li
Paper (DOI): https://doi.org/10.5281/zenodo.19703527
Code: https://github.com/fathom-lab/styxx
PyPI: pip install styxx==4.0.2[nli]
T + 0:10

3. Hacker News — submit (no self-comment yet, wait for T+0:35)

the button below opens the HN submit form with the URL + title pre-filled. click Submit. don't add text.
HN submit (pre-filled) →
T + 0:15

4. Reddit /r/LocalLLaMA

open pre-filled submit →
paste this body:
Shipped a hallucination detector today that runs entirely locally. `pip install styxx[nli]` → one decorator wraps any LLM call and scores the response across 9 signals (text, entity, novelty, NLI contradiction) before returning. Halts on high risk.

Numbers from a 3-seed averaged pooled LR across 8 public benchmarks:

```
HaluEval-QA          AUC 0.998
TruthfulQA           AUC 0.994
HaluBench-RAGTruth   AUC 0.807
HaluBench-PubMedQA   AUC 0.719
HaluEval-Dialog      AUC 0.676
HaluEval-Summ        AUC 0.643
HaluBench-Finance    AUC 0.492  (at chance — published)
HaluBench-DROP       AUC 0.424  (below chance — published)
```

Uses MoritzLaurer/DeBERTa-v3-base-mnli-fever-anli (~184M) for the NLI signal. Works on CPU (~400 ms/call) or CUDA (~30 ms/call). No API key, no phone-home, MIT-licensed. Full failure modes documented in the weights module itself.

Repo + reproducers: https://github.com/fathom-lab/styxx
Write-up: https://fathom.darkflobi.com/cognometry?ref=r_llama
T + 0:20

5. Reddit /r/MachineLearning — remember to pick "Research" flair

open pre-filled submit →
Paper draft: https://github.com/fathom-lab/styxx/blob/main/papers/cognometry-v0.md

We propose **cognometry** — the empirical measurement of cognitive states (refusal, confabulation, retrieval, reasoning, drift) in LLMs — as a frame distinct from interpretability (what a feature represents) and eval (what the text is). The paper's central empirical claim is narrower: a 9-signal pooled logistic regression fused over text, entity, novelty, grounding, and NLI contradiction signals achieves cross-validated hallucination discrimination across 8 public benchmarks (HaluEval QA/Dialog/Summ, TruthfulQA, HaluBench DROP/PubMedQA/FinanceBench/RAGTruth).

Per-dataset held-out test AUC (3-seed mean):

```
HaluEval-QA           0.998 ± 0.001
TruthfulQA            0.994 ± 0.006
HaluBench-RAGTruth    0.807 ± 0.043
HaluBench-PubMedQA    0.719 ± 0.051
HaluEval-Dialog       0.676 ± 0.037
HaluEval-Summ         0.643 ± 0.060
HaluBench-Finance     0.492 ± 0.026    declared failure mode
HaluBench-DROP        0.424 ± 0.080    declared failure mode
```

Both failure modes are structural and openly characterized: DROP hallucinations are extractive-span errors (wrong span from right passage — NLI entails them, novelty is blind), and FinanceBench hallucinations are arithmetic errors on verbatim-copied source numbers (also NLI-blind and novelty-blind).

Code + full reproducer: https://github.com/fathom-lab/styxx
Drop-in API: `pip install styxx[nli]` + `@trust` decorator.
MIT code, CC-BY-4.0 calibrated weights.

Paper (Zenodo DOI): https://doi.org/10.5281/zenodo.19703527

Happy to take disconfirmations on any of the 8 benchmarks at different random seeds or n.
T + 0:25

6. Reddit /r/LLMDevs

open pre-filled submit →
tl;dr: `pip install styxx[nli]` + `@trust` gets you cross-validated hallucination detection on any LLM call. 5 of 8 benchmarks above AUC 0.65. Two benchmarks below chance, published openly.

Full write-up: https://fathom.darkflobi.com/cognometry?ref=r_dev

What actually matters for devs:

- Shape-preserving: works on OpenAI, Anthropic, LangChain, dicts, raw strings. Auto-detects.
- Sync + async
- Four halt policies (fallback/retry/raise/annotate)
- ~10-30ms CUDA, ~400ms CPU per call
- MIT, no phone-home, no API key

Where it'll fail you:

- Reading-comp extractive-span errors (DROP) — detector can't see the wrong span.
- Arithmetic errors on numbers copied from source (FinanceBench) — detector can't see the computation.

Both declared in the weights module. Do not deploy for finance/reading-comp without reading §3.4 of the paper first.

Repo: https://github.com/fathom-lab/styxx
T + 0:35

7. HN self-comment — go back to your submission, paste this

find your submission on hn.algolia.com or /newest, click it, scroll to comment box, paste:
find your HN submission →
Author here.

Styxx 4.0.2 is the first hallucination detector I'm aware of cross-validated across 8 public benchmarks — HaluEval QA/Dialog/Summarization, TruthfulQA, and four HaluBench subsets (DROP, PubMedQA, FinanceBench, RAGTruth). 3-seed averaged, n=150/dataset, pooled 9-signal logistic regression.

Paper (Zenodo, peer-archived): https://doi.org/10.5281/zenodo.19703527
Code: https://github.com/fathom-lab/styxx
Leaderboard: https://fathom.darkflobi.com/cognometry/leaderboard
Colab demo (2 min): https://colab.research.google.com/github/fathom-lab/styxx/blob/main/examples/cognometry_colab.ipynb

Real numbers:

    HaluEval-QA             AUC 0.998
    TruthfulQA              AUC 0.994
    HaluBench-RAGTruth      AUC 0.807   (new — RAG faithfulness)
    HaluBench-PubMedQA      AUC 0.719   (new — biomedical)
    HaluEval-Dialog         AUC 0.676
    HaluEval-Summarization  AUC 0.643
    HaluBench-FinanceBench  AUC 0.492   (below chance)
    HaluBench-DROP          AUC 0.424   (below chance)

Two below-chance results are the part I'd most like HN to react to. They are published as failure modes in the weights module itself, not hidden:

- DROP: reading-comp hallucinations are extractive-span errors — wrong span, right passage. NLI scores that as entailed; novelty signals don't fire. Tried 6 naive heuristic fixes; all null. The null probe is committed alongside the successes.
- FinanceBench: hallucinations are calculation errors on numbers copied verbatim from the source. Novelty + NLI are semantically blind to arithmetic correctness.

Both failure modes are declared in calibrated_weights_v4.CALIBRATION_NOTES.documented_failure_modes so production callers know where the detector will lie.

pip install styxx[nli] → wrap a function with @trust → get verified output on every call. Zero config: auto-detects context/reference/passage kwargs, auto-enables NLI when installed, adaptive threshold. MIT on code, CC-BY on calibrated weights.

Happy to get disconfirmations on any of the 8 benchmarks at your favorite random seed.

watch

metrics tabs

pypi downloads github stars find on HN HN /newest