fathom-lab

Fathom Lab

Cognitive instruments for machine cognition. Open source. Published failure modes.

What we build

We measure cognitive states of large language models at runtime — refusal, confabulation, retrieval, reasoning, adversarial drift — from signals already carried on the token stream and residual activations. Three of our tools are public today:

styxx — one decorator, any LLM call, cross-validated hallucination detection. pip install styxx[nli] + @trust. Cross-validated across 8 public benchmarks with two declared failure modes published openly in the weights module.
fathom — SAE-based depth measurement for transformer internals. Fathom constant 1.0212 measured across two open-weight architectures.
Cognometry manifesto — fathom.darkflobi.com/cognometry — three falsifiable laws for cognometric measurement, each with a cross-validated number.

Current numbers (styxx v4.0.2, 3-seed averaged, n=150/dataset)

Benchmark	AUC
HaluEval-QA	0.998 ± 0.001
TruthfulQA	0.994 ± 0.006
HaluBench-RAGTruth	0.807 ± 0.043
HaluBench-PubMedQA	0.719 ± 0.051
HaluEval-Dialog	0.676 ± 0.037
HaluEval-Summarization	0.643 ± 0.060
HaluBench-FinanceBench	0.492 ± 0.026 — declared failure
HaluBench-DROP	0.424 ± 0.080 — declared failure

Two of the eight came in below chance. They're declared in calibrated_weights_v4.CALIBRATION_NOTES.documented_failure_modes so production callers know where the detector will lie. That honesty is load-bearing for how we run this lab.

Cognometry leaderboard

Open submission: any lab can PR a detector following the Cognometry Detector Interface v0 protocol and have it auto-evaluated against our 8 benchmarks. Live table:

→ fathom.darkflobi.com/cognometry/leaderboard

Papers

Cognometry v0 (Zenodo) — 8-benchmark cross-validated hallucination detection.
Cognitive Metrology (Zenodo) — logprob-trajectory methodology.

Also at

Site: fathom.darkflobi.com
Twitter/X: @fathom_lab
PyPI: styxx
OSF: osf.io/g2epj · parent project osf.io/wtkzg

How to contribute

Disconfirmations welcome. If a number is wrong at your favorite seed, open an issue or PR — we cite disconfirmations in the next paper.
Submit a detector to the cognometry leaderboard (one-file PR, protocol above).
Extend the benchmark suite. FEVER, FactCC, XSum-Faithful, and PHD-A are on the v4.2 track — PRs welcome.

"nothing crosses unseen"

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fathom-lab

Fathom Lab

What we build

Current numbers (styxx v4.0.2, 3-seed averaged, n=150/dataset)

Cognometry leaderboard

Papers

Also at

How to contribute

Popular repositories Loading

Repositories

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

People

Top languages

Uh oh!

Most used topics

Uh oh!