Skip to content

Debanitrkl/medfence

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

medfence

ci

A deterministic, fail-closed verification harness for clinical LLM outputs.

LLM extraction is entering clinical workflows (scribes, prescription digitization, lab-report parsing) faster than verification infrastructure is being built. Known failure modes include fabricated medications, silently altered dosages, mg→mcg unit swaps, and invented frequencies. medfence is the fence between model output and clinical action: pure-function, zero-dependency, every verdict backed by machine-checkable evidence.

from medfence import verify, Extraction, SourceDocument

report = verify(
    Extraction(artifact_type="prescription", payload=llm_output),
    SourceDocument(text=ocr_text, modality="ocr", ocr_confidence=0.91),
)
report.overall        # Verdict.PASS | Verdict.FAIL | Verdict.ABSTAIN
report.coverage       # fraction of payload fields actually verified
report.to_audit_json()  # one hash-chained JSON line per verification

The three guarantees

  1. Deterministic. No I/O, no clock, no model calls inside verify(). Identical (source, extraction, rulepack) → bit-identical report. Verdicts are reproducible by strangers; that is what makes them auditable.
  2. Fail-closed. ABSTAIN is a first-class verdict, not an error state. "Couldn't check" routes to a human exactly like FAIL; it never silently becomes PASS. Unknown artifact types, missing reference data, low-confidence OCR → ABSTAIN.
  3. No finding without evidence. Every verdict carries a source span, a reference bundle key, or a rule id. A check that cannot produce evidence must abstain.

And one deliberate refusal: medfence never judges clinical appropriateness. dose_in_range answers "does 500 mg paracetamol exist as a marketed product?", never "should this patient take it?" Fidelity and referential validity are mechanically decidable; clinical judgment is not, and pretending otherwise is how verification tools become unlicensed medical devices.

Check families (v0)

Family G: span grounding ("no span, no claim") Every extracted value must align to a span in the source, localized to its own medication's line, because whole-document matching invites cross-medication collisions (a 250 mg on someone else's line must not ground your altered strength). Numbers and units match exact-after-normalization only; fuzzy-matching a dosage is how a fence approves a hallucination. unit_integrity catches the mg↔mcg class specifically. Fully ungrounded medication objects fail no_orphans (the fabricated-drug detector).

Family R: reference validity Drug names, marketed strengths, dose forms, and frequency tokens are checked against a versioned, content-hashed Indian drug bundle and a closed grammar of prescription shorthand (OD, BD, TDS, 1-0-1, SOS, …). Fuzzy lookup is allowed for retrieval; a weak hit is ABSTAIN, never PASS.

Benchmark

100 documents: 20 synthetic Indian OPD prescriptions × (1 clean + 4 seeded-error variants). Reproduce with python scripts/make_golden.py && python scripts/benchmark.py.

error class n FAIL ABSTAIN PASS caught false PASS
fabricated_drug 20 20 0 0 100% 0%
unit_swap 20 20 0 0 100% 0%
altered_strength 20 20 0 0 100% 0%
invented_frequency 20 20 0 0 100% 0%
clean 20 0 0 20 n/a (100% clean-pass)

Read this honestly: 100% on a synthetic golden set means the set is easy, not that the fence is finished. The seeded errors are clean single-fault injections on noise-free text. The numbers that matter will come from real, anonymized, OCR-noisy prescriptions. Contributions of anonymized hard cases are the most valuable thing you can send this project.

What v0 deliberately does not do

  • Drug-drug interaction checking (v1 candidate, behind an explicit opt-in)
  • Patient-contextual dosing (age/weight/renal): the SaMD line; we stay below it
  • Auto-correction: medfence flags, it never fixes
  • LLM-as-judge fallback: if deterministic checks can't verify it, a human sees it
  • STT/word-timestamp evidence (the Evidence type is designed for it; v0 is OCR/text)
  • Omission detection: verify() checks payload→source fidelity, not source→payload completeness: a medication silently dropped by the extractor is not caught. This gap is encoded as a strict-xfail test so it stays visible.

Reference bundle

v0 ships refdata-2026.07.0-seed: ~27 common Indian OPD molecules with brand aliases, forms, and marketed strengths, a deliberately small, versioned placeholder for a proper CDSCO + NLEM + Jan Aushadhi normalization pass. The bundle is content-hashed and the hash is pinned into every report, so verdicts remain reproducible as the data grows.

Testing & validation

The suite is deliberately heavier than the library; for a verification tool, the tests are the product claim. Beyond unit tests, three layers guard the contract:

  • Property-based (tests/test_properties.py, Hypothesis): verify() is total, fail-closed, and non-mutating over arbitrary junk payloads; reports are bit-identical; clean-by-construction cases PASS and any corruption of them never does; mg/mcg canonicalization never merges; aggregation matches an independent worst-of oracle and is order-invariant and monotone.
  • Metamorphic (tests/test_metamorphic.py): ~700 guarded corruptions (digit edits, magnitude shifts, unit swaps, real-but-absent drug insertion, in-grammar frequency swaps, …) applied to every clean golden case at test time; none may PASS. Same fault taxonomy as the golden set, one implementation (scripts/corruptions.py), two depths.
  • Golden gate (tests/test_golden_gate.py): the benchmark as hard assertions; false-PASS == 0 and clean-pass == 100% fail the build, and the committed golden set must match its generator exactly.

Known v0 limitations are encoded as xfail(strict=True) tests (duplicate medication names, trailing punctuation on frequencies, omission detection): executable documentation that flips to a build failure the day the limitation is fixed.

uv sync --group dev
uv run pytest                      # full suite (~750 tests, <5 s)
uv run pytest -m golden            # just the benchmark gate

Project layout

medfence/
  contract.py        # types + fail-closed aggregation (the stable core)
  normalize.py       # deterministic text/unit/number normalization
  rulepack.py        # bundle loading, thresholds-as-versioned-data
  checks/grounding.py  # Family G
  checks/refdata.py    # Family R
  verify.py          # the single entrypoint
  data/refdata_seed.json
scripts/make_golden.py   # regenerate the golden set (deterministic, no RNG)
scripts/benchmark.py     # the table above (exits 1 on any false PASS)
scripts/corruptions.py   # shared fault operators (golden set + metamorphic grid)
tests/               # contract invariants, properties, metamorphic grid, golden gate
tests/schemas/       # JSON Schema for the audit report (the output contract)
docs/adr-001-*.md    # the design record
docs/related-work.md # how this differs from LangExtract, Guardrails, etc.

Design record & positioning

See docs/adr-001-clinical-verification-contract-v0.md for the full contract spec, options considered, and the trade-off analysis (notably: why ABSTAIN exists, why coverage is a first-class output, and where the SaMD line is drawn). For how medfence relates to LangExtract, Guardrails AI, clinical self-verification, and the rest of the landscape, see docs/related-work.md.

License

Apache-2.0.

Quick demo

python3 demo.py   # one prescription: PASS, FAIL (mcg swap + fabricated drug), ABSTAIN (low OCR)

About

Deterministic, fail-closed verification harness for clinical LLM outputs — span grounding + drug reference validity, PASS/FAIL/ABSTAIN verdicts with machine-checkable evidence

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages