-
Notifications
You must be signed in to change notification settings - Fork 29
Description
Overview
Build a document extraction pipeline that auto-fills form fields when uploading documents. The pipeline has three independent, gracefully degrading layers -- each adds value on its own and no layer requires another to function.
Layers:
- Text extraction (Go library, always available) -- pull text from digital PDFs, plain text, markdown
- Tesseract OCR (optional CLI tool) -- extract text from scanned PDFs and images when text layer is absent
- Extraction model (optional local LLM) -- interpret raw text and return structured fields (vendor, amounts, dates, entity links)
This is fully independent of the chat feature. Someone may use extraction without chat, chat without extraction, both, or neither.
Capability matrix
| Extraction model | Tesseract | Text PDFs | Scanned PDFs/images | Plain text |
|---|---|---|---|---|
| yes | yes | Full structured pre-fill | Full structured pre-fill via OCR | Full structured pre-fill |
| yes | no | Full structured pre-fill | Title from filename only; one-time hint to install tesseract | Full structured pre-fill |
| no | yes | Text extracted + stored for future use; title from filename | OCR text extracted + stored; title from filename | Title from filename |
| no | no | Text layer extracted + stored; title from filename | Opaque BLOB; title from filename, MIME, size | Title from filename |
In every case the document is stored, the auto-title works, and the app never errors or blocks on a missing capability.
Design decisions to make
1. PDF text extraction library
Need a Go library to pull text from PDF files. Candidates:
ledongthuc/pdfpdfcpuunipdf(note: dual-licensed, needs audit)
Criteria: no cgo, good text extraction quality, maintained, compatible license.
2. Extracted text storage schema
Recommendation: two columns on Document:
ExtractedText string-- plain text for FTS/search and LLM inputOCRData []byte-- raw TSV from tesseract, preserved for future use (confidence scores, bounding boxes)
The text column powers search (SQLite FTS) and feeds the extraction model. The TSV blob sits dormant until needed (e.g. highlighting source locations in a future UI).
3. Scanned PDF detection heuristic
Before invoking tesseract, determine whether OCR is actually needed:
- Extract text via Go library
- If empty/whitespace-only --> scanned, needs OCR
- If non-empty --> real text layer, skip OCR
Edge case: PDFs with a bad hidden OCR layer (garbage text). Could add a quality heuristic (non-printable character ratio, word density vs page count) but fine to punt to v2 and trust any non-empty text layer for now.
4. PDF rasterization for OCR
Tesseract needs images, not PDFs. Need pdftoppm (poppler-utils) to rasterize pages first. The availability check becomes: tesseract on PATH and pdftoppm on PATH. Both are nix-packaged and commonly co-installed.
5. Tesseract output format
Recommendation: TSV. Captures word-level confidence scores and coordinates at no extra cost. Plain text is derived by reading the TSV text column. Richer data is preserved for future use (e.g. passing confidence info to the extraction model, or UI source highlighting).
6. Extraction timing (sync vs async)
Recommendation: synchronous for v1. User picks file, text extraction + optional OCR + optional LLM runs, form opens pre-filled. Most home documents are 1-3 pages -- a couple seconds of blocking is fine and the code is dramatically simpler than async message-passing in bubbletea.
7. Page limit for OCR
Long documents (e.g. 200-page appliance manuals) will be slow through tesseract. Options:
- Cap at N pages (10-20) with a status hint
- No cap but switch to async for docs over a threshold
- No cap, let it rip
Recommendation: cap at 10-20 pages for v1. Manuals front-load important info (specs, warranty, maintenance schedule) in the first few pages.
8. Extraction model choice
Recommendation: Qwen2.5 7B. Best-in-class for structured JSON output at this parameter count. Runs on 8GB VRAM. Alternatives: Phi-3.5 Mini (3.8B) for lighter hardware, Llama 3.1 8B as a solid fallback.
Separate from whatever model the user runs for chat. Extraction wants small + fast + schema-rigid. Chat wants larger + better reasoning.
9. LLM prompt strategy: one-pass vs two-pass
Two-pass: classify document type first, then extract with a typed schema. More precise, doubles LLM calls.
One-pass: universal schema with all fields optional, model returns what it finds.
Recommendation: one-pass with a universal schema for v1. Split into two passes only if extraction quality is measurably bad on specific document types.
10. Context passed to extraction model
Include document metadata as a header block before the extracted text:
Filename: quarterly_budget-report.pdf
MIME: application/pdf
Size: 245 KiB
Existing entities: [projects: Kitchen Remodel, Bathroom...] [vendors: Garcia Plumbing, ...]
---
[extracted text here]
Passing existing entity names lets the model match "Garcia" to "Garcia Plumbing" without hallucinating new entities.
11. Universal extraction schema (strawman)
{
"document_type": "quote|invoice|receipt|manual|warranty|permit|inspection|contract|other",
"title_suggestion": "string",
"summary": "one-line string for table display",
"vendor_hint": "string or null -- matched against existing vendors",
"total_cents": "int or null",
"labor_cents": "int or null",
"materials_cents": "int or null",
"date": "YYYY-MM-DD or null",
"warranty_expiry": "YYYY-MM-DD or null",
"entity_kind_hint": "project|appliance|vendor|maintenance|quote|service_log or null",
"entity_name_hint": "string or null -- matched against existing entity names",
"maintenance_items": [
{"name": "string", "interval_months": "int"}
],
"notes": "string or null -- anything else worth capturing"
}All fields optional. The model fills what it can. The maintenance_items array handles the "extract maintenance schedule from manual" case.
12. Where LLM-generated fields land
Extraction returns a hints struct. Hints pre-fill form fields. The user sees and can edit every field before submitting. Nothing goes directly into the database without user confirmation. Extraction quality doesn't need to be perfect -- saving 4 out of 5 fields from manual entry is a win.
13. Graceful degradation UX
- No errors, ever. Missing capabilities just mean fewer pre-filled fields.
- One-time status bar hint when a scanned doc is uploaded without tesseract: "install tesseract for better document extraction." Never shown again.
- No hints at all for missing LLM -- the app just works with less automation.
- When extraction model becomes available later, offer a "re-process documents" action to run extraction on stored text that was never LLM-processed. Retroactive enrichment.
Future possibilities (not in scope for this issue)
- Semantic search via embeddings stored alongside BLOBs in SQLite
- Vision model (Qwen2-VL, LLaVA) as alternative to tesseract + extraction model
- "Compare these quotes" cross-document reasoning
- Auto-create maintenance items from extracted manual schedules (the schema supports it, the mutation UX needs design)
- Source highlighting using OCR bounding box data