Cognitive instruments for machine cognition. Open source. Published failure modes.
We measure cognitive states of large language models at runtime — refusal, confabulation, retrieval, reasoning, adversarial drift — from signals already carried on the token stream and residual activations. Three of our tools are public today:
-
styxx— one decorator, any LLM call, cross-validated hallucination detection.pip install styxx[nli]+@trust. Cross-validated across 8 public benchmarks with two declared failure modes published openly in the weights module. -
fathom— SAE-based depth measurement for transformer internals. Fathom constant 1.0212 measured across two open-weight architectures. -
Cognometry manifesto — fathom.darkflobi.com/cognometry — three falsifiable laws for cognometric measurement, each with a cross-validated number.
| Benchmark | AUC |
|---|---|
| HaluEval-QA | 0.998 ± 0.001 |
| TruthfulQA | 0.994 ± 0.006 |
| HaluBench-RAGTruth | 0.807 ± 0.043 |
| HaluBench-PubMedQA | 0.719 ± 0.051 |
| HaluEval-Dialog | 0.676 ± 0.037 |
| HaluEval-Summarization | 0.643 ± 0.060 |
| HaluBench-FinanceBench | 0.492 ± 0.026 — declared failure |
| HaluBench-DROP | 0.424 ± 0.080 — declared failure |
Two of the eight came in below chance. They're declared in
calibrated_weights_v4.CALIBRATION_NOTES.documented_failure_modesso production callers know where the detector will lie. That honesty is load-bearing for how we run this lab.
Open submission: any lab can PR a detector following the Cognometry Detector Interface v0 protocol and have it auto-evaluated against our 8 benchmarks. Live table:
→ fathom.darkflobi.com/cognometry/leaderboard
- Cognometry v0 (Zenodo) — 8-benchmark cross-validated hallucination detection.
- Cognitive Metrology (Zenodo) — logprob-trajectory methodology.
- Site: fathom.darkflobi.com
- Twitter/X: @fathom_lab
- PyPI: styxx
- OSF: osf.io/g2epj · parent project osf.io/wtkzg
- Disconfirmations welcome. If a number is wrong at your favorite seed, open an issue or PR — we cite disconfirmations in the next paper.
- Submit a detector to the cognometry leaderboard (one-file PR, protocol above).
- Extend the benchmark suite. FEVER, FactCC, XSum-Faithful, and PHD-A are on the v4.2 track — PRs welcome.
"nothing crosses unseen"