feat(documents): document extraction pipeline -- text, OCR, and LLM-powered structured pre-fill

## Overview

Build a document extraction pipeline that auto-fills form fields when uploading documents. The pipeline has three independent, gracefully degrading layers -- each adds value on its own and no layer requires another to function.

**Layers:**
- **Text extraction** (Go library, always available) -- pull text from digital PDFs, plain text, markdown
- **Tesseract OCR** (optional CLI tool) -- extract text from scanned PDFs and images when text layer is absent
- **Extraction model** (optional local LLM) -- interpret raw text and return structured fields (vendor, amounts, dates, entity links)

This is fully independent of the chat feature. Someone may use extraction without chat, chat without extraction, both, or neither.

## Capability matrix

| Extraction model | Tesseract | Text PDFs | Scanned PDFs/images | Plain text |
|---|---|---|---|---|
| yes | yes | Full structured pre-fill | Full structured pre-fill via OCR | Full structured pre-fill |
| yes | no | Full structured pre-fill | Title from filename only; one-time hint to install tesseract | Full structured pre-fill |
| no | yes | Text extracted + stored for future use; title from filename | OCR text extracted + stored; title from filename | Title from filename |
| no | no | Text layer extracted + stored; title from filename | Opaque BLOB; title from filename, MIME, size | Title from filename |

In every case the document is stored, the auto-title works, and the app never errors or blocks on a missing capability.

## Design decisions to make

### 1. PDF text extraction library

Need a Go library to pull text from PDF files. Candidates:
- `ledongthuc/pdf`
- `pdfcpu`
- `unipdf` (note: dual-licensed, needs audit)

Criteria: no cgo, good text extraction quality, maintained, compatible license.

### 2. Extracted text storage schema

**Recommendation:** two columns on Document:
- `ExtractedText string` -- plain text for FTS/search and LLM input
- `OCRData []byte` -- raw TSV from tesseract, preserved for future use (confidence scores, bounding boxes)

The text column powers search (SQLite FTS) and feeds the extraction model. The TSV blob sits dormant until needed (e.g. highlighting source locations in a future UI).

### 3. Scanned PDF detection heuristic

Before invoking tesseract, determine whether OCR is actually needed:
1. Extract text via Go library
2. If empty/whitespace-only --> scanned, needs OCR
3. If non-empty --> real text layer, skip OCR

Edge case: PDFs with a bad hidden OCR layer (garbage text). Could add a quality heuristic (non-printable character ratio, word density vs page count) but fine to punt to v2 and trust any non-empty text layer for now.

### 4. PDF rasterization for OCR

Tesseract needs images, not PDFs. Need `pdftoppm` (poppler-utils) to rasterize pages first. The availability check becomes: tesseract on PATH **and** pdftoppm on PATH. Both are nix-packaged and commonly co-installed.

### 5. Tesseract output format

**Recommendation:** TSV. Captures word-level confidence scores and coordinates at no extra cost. Plain text is derived by reading the TSV text column. Richer data is preserved for future use (e.g. passing confidence info to the extraction model, or UI source highlighting).

### 6. Extraction timing (sync vs async)

**Recommendation:** synchronous for v1. User picks file, text extraction + optional OCR + optional LLM runs, form opens pre-filled. Most home documents are 1-3 pages -- a couple seconds of blocking is fine and the code is dramatically simpler than async message-passing in bubbletea.

### 7. Page limit for OCR

Long documents (e.g. 200-page appliance manuals) will be slow through tesseract. Options:
- Cap at N pages (10-20) with a status hint
- No cap but switch to async for docs over a threshold
- No cap, let it rip

**Recommendation:** cap at 10-20 pages for v1. Manuals front-load important info (specs, warranty, maintenance schedule) in the first few pages.

### 8. Extraction model choice

**Recommendation:** Qwen2.5 7B. Best-in-class for structured JSON output at this parameter count. Runs on 8GB VRAM. Alternatives: Phi-3.5 Mini (3.8B) for lighter hardware, Llama 3.1 8B as a solid fallback.

Separate from whatever model the user runs for chat. Extraction wants small + fast + schema-rigid. Chat wants larger + better reasoning.

### 9. LLM prompt strategy: one-pass vs two-pass

**Two-pass:** classify document type first, then extract with a typed schema. More precise, doubles LLM calls.

**One-pass:** universal schema with all fields optional, model returns what it finds.

**Recommendation:** one-pass with a universal schema for v1. Split into two passes only if extraction quality is measurably bad on specific document types.

### 10. Context passed to extraction model

Include document metadata as a header block before the extracted text:

```
Filename: quarterly_budget-report.pdf
MIME: application/pdf
Size: 245 KiB
Existing entities: [projects: Kitchen Remodel, Bathroom...] [vendors: Garcia Plumbing, ...]

---
[extracted text here]
```

Passing existing entity names lets the model match "Garcia" to "Garcia Plumbing" without hallucinating new entities.

### 11. Universal extraction schema (strawman)

```json
{
  "document_type": "quote|invoice|receipt|manual|warranty|permit|inspection|contract|other",
  "title_suggestion": "string",
  "summary": "one-line string for table display",
  "vendor_hint": "string or null -- matched against existing vendors",
  "total_cents": "int or null",
  "labor_cents": "int or null",
  "materials_cents": "int or null",
  "date": "YYYY-MM-DD or null",
  "warranty_expiry": "YYYY-MM-DD or null",
  "entity_kind_hint": "project|appliance|vendor|maintenance|quote|service_log or null",
  "entity_name_hint": "string or null -- matched against existing entity names",
  "maintenance_items": [
    {"name": "string", "interval_months": "int"}
  ],
  "notes": "string or null -- anything else worth capturing"
}
```

All fields optional. The model fills what it can. The `maintenance_items` array handles the "extract maintenance schedule from manual" case.

### 12. Where LLM-generated fields land

Extraction returns a hints struct. Hints pre-fill form fields. The user sees and can edit every field before submitting. Nothing goes directly into the database without user confirmation. Extraction quality doesn't need to be perfect -- saving 4 out of 5 fields from manual entry is a win.

### 13. Graceful degradation UX

- No errors, ever. Missing capabilities just mean fewer pre-filled fields.
- One-time status bar hint when a scanned doc is uploaded without tesseract: "install tesseract for better document extraction." Never shown again.
- No hints at all for missing LLM -- the app just works with less automation.
- When extraction model becomes available later, offer a "re-process documents" action to run extraction on stored text that was never LLM-processed. Retroactive enrichment.

## Future possibilities (not in scope for this issue)

- Semantic search via embeddings stored alongside BLOBs in SQLite
- Vision model (Qwen2-VL, LLaVA) as alternative to tesseract + extraction model
- "Compare these quotes" cross-document reasoning
- Auto-create maintenance items from extracted manual schedules (the schema supports it, the mutation UX needs design)
- Source highlighting using OCR bounding box data


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(documents): document extraction pipeline -- text, OCR, and LLM-powered structured pre-fill #200

Overview

Capability matrix

Design decisions to make

1. PDF text extraction library

2. Extracted text storage schema

3. Scanned PDF detection heuristic

4. PDF rasterization for OCR

5. Tesseract output format

6. Extraction timing (sync vs async)

7. Page limit for OCR

8. Extraction model choice

9. LLM prompt strategy: one-pass vs two-pass

10. Context passed to extraction model

11. Universal extraction schema (strawman)

12. Where LLM-generated fields land

13. Graceful degradation UX

Future possibilities (not in scope for this issue)

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Extraction model	Tesseract	Text PDFs	Scanned PDFs/images	Plain text
yes	yes	Full structured pre-fill	Full structured pre-fill via OCR	Full structured pre-fill
yes	no	Full structured pre-fill	Title from filename only; one-time hint to install tesseract	Full structured pre-fill
no	yes	Text extracted + stored for future use; title from filename	OCR text extracted + stored; title from filename	Title from filename
no	no	Text layer extracted + stored; title from filename	Opaque BLOB; title from filename, MIME, size	Title from filename

feat(documents): document extraction pipeline -- text, OCR, and LLM-powered structured pre-fill #200

Description

Overview

Capability matrix

Design decisions to make

1. PDF text extraction library

2. Extracted text storage schema

3. Scanned PDF detection heuristic

4. PDF rasterization for OCR

5. Tesseract output format

6. Extraction timing (sync vs async)

7. Page limit for OCR

8. Extraction model choice

9. LLM prompt strategy: one-pass vs two-pass

10. Context passed to extraction model

11. Universal extraction schema (strawman)

12. Where LLM-generated fields land

13. Graceful degradation UX

Future possibilities (not in scope for this issue)

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions