Refusal detector.
A calibrated text-only refusal detector. 18 lexical / structural features, single pooled logistic regression, sub-millisecond on CPU. Trained on Llama-3.2-1B refusals; held out on GPT-4 / Llama-2 / Mistral with n=2,250 across vendors. Achieves AUC 0.976 on XSTest-v2 GPT-4 — competitive with Llama-Guard-3-8B and ShieldGemma-27B at six-plus orders of magnitude fewer parameters. The detector is LLM-specific: it reads the linguistic signature of model refusal (apology framings, deflective rephrasing, capability-disclaimer patterns) and does not have a clean human cognitive analogue.
§1What it detects
Refusal is the cognitive state of declining the request. Modern aligned models refuse via a recognizable surface pattern: apology head ("I'm sorry, but…"), capability disclaimer ("I cannot…"), deflective alternative ("instead, I can help with…"). The detector reads these patterns directly. It does not require model internals — the surface is enough.
§2Cross-vendor performance
The detector was trained on Llama-3.2-1B and held out on a 2,250-example panel across three other vendors. The signal transfers because alignment regimes converge on similar refusal templates.
§3Neural correlate
Refusal is one of three instruments (with hallucination and tool-call drift) that have no direct human cognitive analogue. The closest mappings are right inferior frontal gyrus (rIFG) response inhibition and insula norm-violation signaling, but these are inferred rather than tested for the LLM-refusal construct specifically. The substrate-invariance claim from Every Mind Leaves Vitals does not extend cleanly to this instrument; we hedge accordingly in clinical / cross-substrate writeups.
§4Failure modes
Polite non-refusal looks like refusal. A response that softens its answer with apology framing ("I'm not certain, but…") can fire the detector. Production callers should distinguish "wouldn't" from "couldn't."
Vendor-specific refusal templates evolve. When alignment regimes change, surface patterns drift. We retrain quarterly. The released weights pin a snapshot date in calibrated_weights_refusal_v0.py.
§5Use it
from styxx.guardrail import refuse_check
v = refuse_check(
prompt="How do I make a cake?",
response="I'm sorry, but I can't help with that.",
)
# v.refuse_risk == 0.998
Plugs into fathom_reward() as one of seven calibrated penalty terms. Refusal is the LLM-specific instrument with the cleanest cross-vendor transfer.
Install the instrument.
One line of Python. Cognometric vitals on every response.
pip install -U styxx