A deterministic, fail-closed verification harness for clinical LLM outputs.
LLM extraction is entering clinical workflows (scribes, prescription digitization,
lab-report parsing) faster than verification infrastructure is being built. Known
failure modes include fabricated medications, silently altered dosages, mg→mcg unit
swaps, and invented frequencies. medfence is the fence between model output and
clinical action: pure-function, zero-dependency, every verdict backed by
machine-checkable evidence.
from medfence import verify, Extraction, SourceDocument
report = verify(
Extraction(artifact_type="prescription", payload=llm_output),
SourceDocument(text=ocr_text, modality="ocr", ocr_confidence=0.91),
)
report.overall # Verdict.PASS | Verdict.FAIL | Verdict.ABSTAIN
report.coverage # fraction of payload fields actually verified
report.to_audit_json() # one hash-chained JSON line per verification- Deterministic. No I/O, no clock, no model calls inside
verify(). Identical(source, extraction, rulepack)→ bit-identical report. Verdicts are reproducible by strangers; that is what makes them auditable. - Fail-closed.
ABSTAINis a first-class verdict, not an error state. "Couldn't check" routes to a human exactly likeFAIL; it never silently becomesPASS. Unknown artifact types, missing reference data, low-confidence OCR →ABSTAIN. - No finding without evidence. Every verdict carries a source span, a reference bundle key, or a rule id. A check that cannot produce evidence must abstain.
And one deliberate refusal: medfence never judges clinical appropriateness.
dose_in_range answers "does 500 mg paracetamol exist as a marketed product?",
never "should this patient take it?" Fidelity and referential validity are
mechanically decidable; clinical judgment is not, and pretending otherwise is how
verification tools become unlicensed medical devices.
Family G: span grounding ("no span, no claim")
Every extracted value must align to a span in the source, localized to its own
medication's line, because whole-document matching invites cross-medication collisions
(a 250 mg on someone else's line must not ground your altered strength).
Numbers and units match exact-after-normalization only; fuzzy-matching a dosage
is how a fence approves a hallucination. unit_integrity catches the mg↔mcg
class specifically. Fully ungrounded medication objects fail no_orphans
(the fabricated-drug detector).
Family R: reference validity
Drug names, marketed strengths, dose forms, and frequency tokens are checked
against a versioned, content-hashed Indian drug bundle and a closed grammar of
prescription shorthand (OD, BD, TDS, 1-0-1, SOS, …). Fuzzy lookup is allowed for
retrieval; a weak hit is ABSTAIN, never PASS.
100 documents: 20 synthetic Indian OPD prescriptions × (1 clean + 4 seeded-error
variants). Reproduce with python scripts/make_golden.py && python scripts/benchmark.py.
| error class | n | FAIL | ABSTAIN | PASS | caught | false PASS |
|---|---|---|---|---|---|---|
| fabricated_drug | 20 | 20 | 0 | 0 | 100% | 0% |
| unit_swap | 20 | 20 | 0 | 0 | 100% | 0% |
| altered_strength | 20 | 20 | 0 | 0 | 100% | 0% |
| invented_frequency | 20 | 20 | 0 | 0 | 100% | 0% |
| clean | 20 | 0 | 0 | 20 | n/a | (100% clean-pass) |
Read this honestly: 100% on a synthetic golden set means the set is easy, not that the fence is finished. The seeded errors are clean single-fault injections on noise-free text. The numbers that matter will come from real, anonymized, OCR-noisy prescriptions. Contributions of anonymized hard cases are the most valuable thing you can send this project.
- Drug-drug interaction checking (v1 candidate, behind an explicit opt-in)
- Patient-contextual dosing (age/weight/renal): the SaMD line; we stay below it
- Auto-correction: medfence flags, it never fixes
- LLM-as-judge fallback: if deterministic checks can't verify it, a human sees it
- STT/word-timestamp evidence (the
Evidencetype is designed for it; v0 is OCR/text) - Omission detection:
verify()checks payload→source fidelity, not source→payload completeness: a medication silently dropped by the extractor is not caught. This gap is encoded as a strict-xfail test so it stays visible.
v0 ships refdata-2026.07.0-seed: ~27 common Indian OPD molecules with brand
aliases, forms, and marketed strengths, a deliberately small, versioned
placeholder for a proper CDSCO + NLEM + Jan Aushadhi normalization pass. The
bundle is content-hashed and the hash is pinned into every report, so verdicts
remain reproducible as the data grows.
The suite is deliberately heavier than the library; for a verification tool, the tests are the product claim. Beyond unit tests, three layers guard the contract:
- Property-based (
tests/test_properties.py, Hypothesis):verify()is total, fail-closed, and non-mutating over arbitrary junk payloads; reports are bit-identical; clean-by-construction cases PASS and any corruption of them never does; mg/mcg canonicalization never merges; aggregation matches an independent worst-of oracle and is order-invariant and monotone. - Metamorphic (
tests/test_metamorphic.py): ~700 guarded corruptions (digit edits, magnitude shifts, unit swaps, real-but-absent drug insertion, in-grammar frequency swaps, …) applied to every clean golden case at test time; none may PASS. Same fault taxonomy as the golden set, one implementation (scripts/corruptions.py), two depths. - Golden gate (
tests/test_golden_gate.py): the benchmark as hard assertions; false-PASS == 0 and clean-pass == 100% fail the build, and the committed golden set must match its generator exactly.
Known v0 limitations are encoded as xfail(strict=True) tests (duplicate
medication names, trailing punctuation on frequencies, omission detection):
executable documentation that flips to a build failure the day the limitation
is fixed.
uv sync --group dev
uv run pytest # full suite (~750 tests, <5 s)
uv run pytest -m golden # just the benchmark gatemedfence/
contract.py # types + fail-closed aggregation (the stable core)
normalize.py # deterministic text/unit/number normalization
rulepack.py # bundle loading, thresholds-as-versioned-data
checks/grounding.py # Family G
checks/refdata.py # Family R
verify.py # the single entrypoint
data/refdata_seed.json
scripts/make_golden.py # regenerate the golden set (deterministic, no RNG)
scripts/benchmark.py # the table above (exits 1 on any false PASS)
scripts/corruptions.py # shared fault operators (golden set + metamorphic grid)
tests/ # contract invariants, properties, metamorphic grid, golden gate
tests/schemas/ # JSON Schema for the audit report (the output contract)
docs/adr-001-*.md # the design record
docs/related-work.md # how this differs from LangExtract, Guardrails, etc.
See docs/adr-001-clinical-verification-contract-v0.md for the full contract
spec, options considered, and the trade-off analysis (notably: why ABSTAIN
exists, why coverage is a first-class output, and where the SaMD line is drawn).
For how medfence relates to LangExtract, Guardrails AI, clinical
self-verification, and the rest of the landscape, see docs/related-work.md.
Apache-2.0.
python3 demo.py # one prescription: PASS, FAIL (mcg swap + fabricated drug), ABSTAIN (low OCR)