Skip to content

feat(documents): document extraction pipeline -- text, OCR, and LLM-powered structured pre-fill #200

@cpcloud

Description

@cpcloud

Overview

Build a document extraction pipeline that auto-fills form fields when uploading documents. The pipeline has three independent, gracefully degrading layers -- each adds value on its own and no layer requires another to function.

Layers:

  • Text extraction (Go library, always available) -- pull text from digital PDFs, plain text, markdown
  • Tesseract OCR (optional CLI tool) -- extract text from scanned PDFs and images when text layer is absent
  • Extraction model (optional local LLM) -- interpret raw text and return structured fields (vendor, amounts, dates, entity links)

This is fully independent of the chat feature. Someone may use extraction without chat, chat without extraction, both, or neither.

Capability matrix

Extraction model Tesseract Text PDFs Scanned PDFs/images Plain text
yes yes Full structured pre-fill Full structured pre-fill via OCR Full structured pre-fill
yes no Full structured pre-fill Title from filename only; one-time hint to install tesseract Full structured pre-fill
no yes Text extracted + stored for future use; title from filename OCR text extracted + stored; title from filename Title from filename
no no Text layer extracted + stored; title from filename Opaque BLOB; title from filename, MIME, size Title from filename

In every case the document is stored, the auto-title works, and the app never errors or blocks on a missing capability.

Design decisions to make

1. PDF text extraction library

Need a Go library to pull text from PDF files. Candidates:

  • ledongthuc/pdf
  • pdfcpu
  • unipdf (note: dual-licensed, needs audit)

Criteria: no cgo, good text extraction quality, maintained, compatible license.

2. Extracted text storage schema

Recommendation: two columns on Document:

  • ExtractedText string -- plain text for FTS/search and LLM input
  • OCRData []byte -- raw TSV from tesseract, preserved for future use (confidence scores, bounding boxes)

The text column powers search (SQLite FTS) and feeds the extraction model. The TSV blob sits dormant until needed (e.g. highlighting source locations in a future UI).

3. Scanned PDF detection heuristic

Before invoking tesseract, determine whether OCR is actually needed:

  1. Extract text via Go library
  2. If empty/whitespace-only --> scanned, needs OCR
  3. If non-empty --> real text layer, skip OCR

Edge case: PDFs with a bad hidden OCR layer (garbage text). Could add a quality heuristic (non-printable character ratio, word density vs page count) but fine to punt to v2 and trust any non-empty text layer for now.

4. PDF rasterization for OCR

Tesseract needs images, not PDFs. Need pdftoppm (poppler-utils) to rasterize pages first. The availability check becomes: tesseract on PATH and pdftoppm on PATH. Both are nix-packaged and commonly co-installed.

5. Tesseract output format

Recommendation: TSV. Captures word-level confidence scores and coordinates at no extra cost. Plain text is derived by reading the TSV text column. Richer data is preserved for future use (e.g. passing confidence info to the extraction model, or UI source highlighting).

6. Extraction timing (sync vs async)

Recommendation: synchronous for v1. User picks file, text extraction + optional OCR + optional LLM runs, form opens pre-filled. Most home documents are 1-3 pages -- a couple seconds of blocking is fine and the code is dramatically simpler than async message-passing in bubbletea.

7. Page limit for OCR

Long documents (e.g. 200-page appliance manuals) will be slow through tesseract. Options:

  • Cap at N pages (10-20) with a status hint
  • No cap but switch to async for docs over a threshold
  • No cap, let it rip

Recommendation: cap at 10-20 pages for v1. Manuals front-load important info (specs, warranty, maintenance schedule) in the first few pages.

8. Extraction model choice

Recommendation: Qwen2.5 7B. Best-in-class for structured JSON output at this parameter count. Runs on 8GB VRAM. Alternatives: Phi-3.5 Mini (3.8B) for lighter hardware, Llama 3.1 8B as a solid fallback.

Separate from whatever model the user runs for chat. Extraction wants small + fast + schema-rigid. Chat wants larger + better reasoning.

9. LLM prompt strategy: one-pass vs two-pass

Two-pass: classify document type first, then extract with a typed schema. More precise, doubles LLM calls.

One-pass: universal schema with all fields optional, model returns what it finds.

Recommendation: one-pass with a universal schema for v1. Split into two passes only if extraction quality is measurably bad on specific document types.

10. Context passed to extraction model

Include document metadata as a header block before the extracted text:

Filename: quarterly_budget-report.pdf
MIME: application/pdf
Size: 245 KiB
Existing entities: [projects: Kitchen Remodel, Bathroom...] [vendors: Garcia Plumbing, ...]

---
[extracted text here]

Passing existing entity names lets the model match "Garcia" to "Garcia Plumbing" without hallucinating new entities.

11. Universal extraction schema (strawman)

{
  "document_type": "quote|invoice|receipt|manual|warranty|permit|inspection|contract|other",
  "title_suggestion": "string",
  "summary": "one-line string for table display",
  "vendor_hint": "string or null -- matched against existing vendors",
  "total_cents": "int or null",
  "labor_cents": "int or null",
  "materials_cents": "int or null",
  "date": "YYYY-MM-DD or null",
  "warranty_expiry": "YYYY-MM-DD or null",
  "entity_kind_hint": "project|appliance|vendor|maintenance|quote|service_log or null",
  "entity_name_hint": "string or null -- matched against existing entity names",
  "maintenance_items": [
    {"name": "string", "interval_months": "int"}
  ],
  "notes": "string or null -- anything else worth capturing"
}

All fields optional. The model fills what it can. The maintenance_items array handles the "extract maintenance schedule from manual" case.

12. Where LLM-generated fields land

Extraction returns a hints struct. Hints pre-fill form fields. The user sees and can edit every field before submitting. Nothing goes directly into the database without user confirmation. Extraction quality doesn't need to be perfect -- saving 4 out of 5 fields from manual entry is a win.

13. Graceful degradation UX

  • No errors, ever. Missing capabilities just mean fewer pre-filled fields.
  • One-time status bar hint when a scanned doc is uploaded without tesseract: "install tesseract for better document extraction." Never shown again.
  • No hints at all for missing LLM -- the app just works with less automation.
  • When extraction model becomes available later, offer a "re-process documents" action to run extraction on stored text that was never LLM-processed. Retroactive enrichment.

Future possibilities (not in scope for this issue)

  • Semantic search via embeddings stored alongside BLOBs in SQLite
  • Vision model (Qwen2-VL, LLaVA) as alternative to tesseract + extraction model
  • "Compare these quotes" cross-document reasoning
  • Auto-create maintenance items from extracted manual schedules (the schema supports it, the mutation UX needs design)
  • Source highlighting using OCR bounding box data

Metadata

Metadata

Assignees

No one assigned

    Labels

    documentsDocument management featuresenhancementNew feature or requestllmLLM and chat features

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions