fix: layout coordinate scaling and OCR table extraction (#582) by kh3rld · Pull Request #593 · kreuzberg-dev/kreuzberg

kh3rld · 2026-03-27T02:56:34Z

fixes #582 and relates to #574. Scanned and image-based PDFs were returning 0 tables and assigning page_number: 1 to all OCR elements.

This PR addresses three distinct bugs:

Coordinate scaling (ocr.rs): Scaled layout detections (from 640×640) up to the OCR render resolution (e.g., 2480×3508) before running TATR table recognition or generating layout hints. This restores word intersection matching.
Page numbers (ocr.rs): Stamped elem.page_number = page_idx + 1 directly on each OCR element rather than accepting the Tesseract default of 1.
Pipeline pass (mod.rs): Moved layout detection upfront so the paddle-ocr/pipeline execution path doesn't silently skip layout detection entirely.

…age_number fixes

kh3rld added 3 commits March 26, 2026 22:52

fix(ocr): stamp page_number and scale layout bboxes to OCR resolution (…

c8c2942

…#582)

docs(changelog): add entries for #582 layout coordinate scaling and p…

1ac82b7

…age_number fixes

fix(pdf): run layout detection upfront for pipeline path (#582)

db96089

kh3rld requested a review from Goldziher March 27, 2026 02:56

kh3rld added this to Kreuzberg Open-Source Kanban Mar 27, 2026

kh3rld added the bug Something isn't working label Mar 27, 2026

github-project-automation bot moved this to Todo in Kreuzberg Open-Source Kanban Mar 27, 2026

kh3rld moved this from Todo to In Progress in Kreuzberg Open-Source Kanban Mar 27, 2026

kh3rld moved this from In Progress to In Review in Kreuzberg Open-Source Kanban Mar 27, 2026

Goldziher merged commit a384074 into main Mar 27, 2026
83 of 91 checks passed

Goldziher deleted the fix/layout-coordinate-scaling branch March 27, 2026 05:39

github-project-automation bot moved this from In Review to Done in Kreuzberg Open-Source Kanban Mar 27, 2026

Provide feedback