medfence

A deterministic, fail-closed verification harness for clinical LLM outputs.

LLM extraction is entering clinical workflows (scribes, prescription digitization, lab-report parsing) faster than verification infrastructure is being built. Known failure modes include fabricated medications, silently altered dosages, mg→mcg unit swaps, and invented frequencies. medfence is the fence between model output and clinical action: pure-function, zero-dependency, every verdict backed by machine-checkable evidence.

from medfence import verify, Extraction, SourceDocument

report = verify(
    Extraction(artifact_type="prescription", payload=llm_output),
    SourceDocument(text=ocr_text, modality="ocr", ocr_confidence=0.91),
)
report.overall        # Verdict.PASS | Verdict.FAIL | Verdict.ABSTAIN
report.coverage       # fraction of payload fields actually verified
report.to_audit_json()  # one hash-chained JSON line per verification

The three guarantees

Deterministic. No I/O, no clock, no model calls inside verify(). Identical (source, extraction, rulepack) → bit-identical report. Verdicts are reproducible by strangers; that is what makes them auditable.
Fail-closed. ABSTAIN is a first-class verdict, not an error state. "Couldn't check" routes to a human exactly like FAIL; it never silently becomes PASS. Unknown artifact types, missing reference data, low-confidence OCR → ABSTAIN.
No finding without evidence. Every verdict carries a source span, a reference bundle key, or a rule id. A check that cannot produce evidence must abstain.

And one deliberate refusal: medfence never judges clinical appropriateness. dose_in_range answers "does 500 mg paracetamol exist as a marketed product?", never "should this patient take it?" Fidelity and referential validity are mechanically decidable; clinical judgment is not, and pretending otherwise is how verification tools become unlicensed medical devices.

Check families (v0)

Family G: span grounding ("no span, no claim") Every extracted value must align to a span in the source, localized to its own medication's line, because whole-document matching invites cross-medication collisions (a 250 mg on someone else's line must not ground your altered strength). Numbers and units match exact-after-normalization only; fuzzy-matching a dosage is how a fence approves a hallucination. unit_integrity catches the mg↔mcg class specifically. Fully ungrounded medication objects fail no_orphans (the fabricated-drug detector).

Family R: reference validity Drug names, marketed strengths, dose forms, and frequency tokens are checked against a versioned, content-hashed Indian drug bundle and a closed grammar of prescription shorthand (OD, BD, TDS, 1-0-1, SOS, …). Fuzzy lookup is allowed for retrieval; a weak hit is ABSTAIN, never PASS.

Benchmark

100 documents: 20 synthetic Indian OPD prescriptions × (1 clean + 4 seeded-error variants). Reproduce with python scripts/make_golden.py && python scripts/benchmark.py.

error class	n	FAIL	PASS	caught	false PASS
fabricated_drug	20	20	0	100%	0%
unit_swap	20	20	0	100%	0%
altered_strength	20	20	0	100%	0%
invented_frequency	20	20	0	100%	0%
clean	20	0	20	n/a	(100% clean-pass)

Read this honestly: 100% on a synthetic golden set means the set is easy, not that the fence is finished. The seeded errors are clean single-fault injections on noise-free text. The numbers that matter will come from real, anonymized, OCR-noisy prescriptions. Contributions of anonymized hard cases are the most valuable thing you can send this project.

What v0 deliberately does not do

Drug-drug interaction checking (v1 candidate, behind an explicit opt-in)
Patient-contextual dosing (age/weight/renal): the SaMD line; we stay below it
Auto-correction: medfence flags, it never fixes
LLM-as-judge fallback: if deterministic checks can't verify it, a human sees it
STT/word-timestamp evidence (the Evidence type is designed for it; v0 is OCR/text)
Omission detection: verify() checks payload→source fidelity, not source→payload completeness: a medication silently dropped by the extractor is not caught. This gap is encoded as a strict-xfail test so it stays visible.

Reference bundle

v0 ships refdata-2026.07.0-seed: ~27 common Indian OPD molecules with brand aliases, forms, and marketed strengths, a deliberately small, versioned placeholder for a proper CDSCO + NLEM + Jan Aushadhi normalization pass. The bundle is content-hashed and the hash is pinned into every report, so verdicts remain reproducible as the data grows.

Testing & validation

The suite is deliberately heavier than the library; for a verification tool, the tests are the product claim. Beyond unit tests, three layers guard the contract:

Property-based (tests/test_properties.py, Hypothesis): verify() is total, fail-closed, and non-mutating over arbitrary junk payloads; reports are bit-identical; clean-by-construction cases PASS and any corruption of them never does; mg/mcg canonicalization never merges; aggregation matches an independent worst-of oracle and is order-invariant and monotone.
Metamorphic (tests/test_metamorphic.py): ~700 guarded corruptions (digit edits, magnitude shifts, unit swaps, real-but-absent drug insertion, in-grammar frequency swaps, …) applied to every clean golden case at test time; none may PASS. Same fault taxonomy as the golden set, one implementation (scripts/corruptions.py), two depths.
Golden gate (tests/test_golden_gate.py): the benchmark as hard assertions; false-PASS == 0 and clean-pass == 100% fail the build, and the committed golden set must match its generator exactly.

Known v0 limitations are encoded as xfail(strict=True) tests (duplicate medication names, trailing punctuation on frequencies, omission detection): executable documentation that flips to a build failure the day the limitation is fixed.

uv sync --group dev
uv run pytest                      # full suite (~750 tests, <5 s)
uv run pytest -m golden            # just the benchmark gate

Project layout

medfence/
  contract.py        # types + fail-closed aggregation (the stable core)
  normalize.py       # deterministic text/unit/number normalization
  rulepack.py        # bundle loading, thresholds-as-versioned-data
  checks/grounding.py  # Family G
  checks/refdata.py    # Family R
  verify.py          # the single entrypoint
  data/refdata_seed.json
scripts/make_golden.py   # regenerate the golden set (deterministic, no RNG)
scripts/benchmark.py     # the table above (exits 1 on any false PASS)
scripts/corruptions.py   # shared fault operators (golden set + metamorphic grid)
tests/               # contract invariants, properties, metamorphic grid, golden gate
tests/schemas/       # JSON Schema for the audit report (the output contract)
docs/adr-001-*.md    # the design record
docs/related-work.md # how this differs from LangExtract, Guardrails, etc.

Design record & positioning

See docs/adr-001-clinical-verification-contract-v0.md for the full contract spec, options considered, and the trade-off analysis (notably: why ABSTAIN exists, why coverage is a first-class output, and where the SaMD line is drawn). For how medfence relates to LangExtract, Guardrails AI, clinical self-verification, and the rest of the landscape, see docs/related-work.md.

License

Apache-2.0.

Quick demo

python3 demo.py   # one prescription: PASS, FAIL (mcg swap + fabricated drug), ABSTAIN (low OCR)

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.github/workflows		.github/workflows
docs		docs
golden		golden
medfence		medfence
scripts		scripts
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
demo.py		demo.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

medfence

The three guarantees

Check families (v0)

Benchmark

What v0 deliberately does not do

Reference bundle

Testing & validation

Project layout

Design record & positioning

License

Quick demo

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

medfence

The three guarantees

Check families (v0)

Benchmark

What v0 deliberately does not do

Reference bundle

Testing & validation

Project layout

Design record & positioning

License

Quick demo

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages