Skip to content

fix: layout coordinate scaling and OCR table extraction (#582)#593

Merged
Goldziher merged 3 commits intomainfrom
fix/layout-coordinate-scaling
Mar 27, 2026
Merged

fix: layout coordinate scaling and OCR table extraction (#582)#593
Goldziher merged 3 commits intomainfrom
fix/layout-coordinate-scaling

Conversation

@kh3rld
Copy link
Copy Markdown
Contributor

@kh3rld kh3rld commented Mar 27, 2026

fixes #582 and relates to #574. Scanned and image-based PDFs were returning 0 tables and assigning page_number: 1 to all OCR elements.

This PR addresses three distinct bugs:

  1. Coordinate scaling (ocr.rs): Scaled layout detections (from 640×640) up to the OCR render resolution (e.g., 2480×3508) before running TATR table recognition or generating layout hints. This restores word intersection matching.
  2. Page numbers (ocr.rs): Stamped elem.page_number = page_idx + 1 directly on each OCR element rather than accepting the Tesseract default of 1.
  3. Pipeline pass (mod.rs): Moved layout detection upfront so the paddle-ocr/pipeline execution path doesn't silently skip layout detection entirely.

@kh3rld kh3rld requested a review from Goldziher March 27, 2026 02:56
@kh3rld kh3rld added the bug Something isn't working label Mar 27, 2026
@kh3rld kh3rld moved this from Todo to In Progress in Kreuzberg Open-Source Kanban Mar 27, 2026
@kh3rld kh3rld moved this from In Progress to In Review in Kreuzberg Open-Source Kanban Mar 27, 2026
@Goldziher Goldziher merged commit a384074 into main Mar 27, 2026
83 of 91 checks passed
@Goldziher Goldziher deleted the fix/layout-coordinate-scaling branch March 27, 2026 05:39
@github-project-automation github-project-automation bot moved this from In Review to Done in Kreuzberg Open-Source Kanban Mar 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

Development

Successfully merging this pull request may close these issues.

Layout detection (LayoutDetectionConfig) silently returns 0 detections on scanned/image-based PDFs V2

2 participants