Text+Plus, #09: SwineBad Tool Support: Building an Extraction Pipeline for Historical Tables

#09: SwineBad
Every year, Text+ funds cooperation projects with a duration of up to twelve months. In this blog series, funded projects provide insight into their work, tools, and results.
Historical newspapers are not only narrative sources, they are structured data archives. Price lists, shipping registers, demographic statistics, and guest lists were often printed in tabular form. Yet despite large-scale digitization efforts, this structured information usually remains locked in scanned page images. The Text+ cooperation project “Tool Support for the Automatic Extraction of Tabular Data from Historical Newspapers” set out to change that. Using the Swinemünder Badeanzeiger (1910–1932) as a case study, we developed and evaluated an open-source pipeline for automatically detecting, extracting, and structuring tabular data from historical newspaper scans.
The Swinemünder Badeanzeiger
The Swinemünder Badeanzeiger is a historical spa gazette published between 1910 and 1932. It regularly listed names, origins, professions, and accommodations of arriving guests in tabular form.
The Digitale Bibliothek Mecklenburg-Vorpommern provides:
- 4,227 scanned issues
- ~6,500 tables
- ~60 entries per table
- approximately 350,000 individual records
From Image to Dataset: The Extraction Pipeline
The goal was not to build a new OCR engine. Instead, the project integrates existing technologies into a modular pipeline that performs four main steps:
- Table segmentation
- Optical Character Recognition (OCR)
- OCR correction
- Data structuring
Each step was evaluated separately and in combination.
Table Segmentation with Detectron2
The first challenge is identifying where tables are located on a page. For this, we used Detectron2 and fine-tuned the pre-trained TableBank_152 model on annotated images of the Swinemünder Badeanzeiger. Because layouts varied significantly over the years, custom annotation was necessary.
An example of the annotated data is shown in figures below, where green and blue boxes indicate the annotated tables. The annotations were stored in JSON files, enabling their use in training. The ground-truth dataset consists of 22 images per publication year, totaling 374 annotated images.
Fine-tuning dramatically improved detection performance. Average Precision (AP) increased from 0.696 before fine-tuning to over 98 after fine-tuning.
| Model | AP | AP50 |
| Before Fine-tuning | 0.696 | 1.664 |
| After Fine-tuning | 98.281 | 99.962 |
OCR with Specialized Fraktur Models
After segmentation, the cropped tables are processed using Tesseract OCR. Because the gazette is printed in Fraktur script, three specialized models were used:
- Fraktur
- GT4HistOCR
- frak2021
No single model performed best across all years. Character Error Rate (CER) varied significantly due to scan quality and layout differences.
OCR Correction with a Large Language Model
To improve OCR results, we experimented with Llama 3.1 70B Instruct (4-bit quantized). The LLM received outputs from all three OCR models and attempted to merge them into a corrected version. While correction slightly reduced error rates, the improvement was limited. Moreover, unpredictable behavior and format instability introduced new challenges.
Figure below shows the mean CER for each year after correction for different orders of input. Here, 1 represents frak2021, 2 represents Fraktur, and 3 represents GT4HistOCR. It is evident that different orders can lead to significant differences in performance. Moreover, no optimal order is apparent. Therefore, the order 213 was picked for further processing.
Figure below shows the comparison of OCR correction with the separate OCR models. It is evident that the OCR correction is almost always below the best model. However, the performance gain compared to the best model is not very significant in most years.
Structuring the Data
The final step transforms raw OCR text into structured fields:
- Last name
- First name
- Title
- Profession
- Residence
- Accommodation
- Number of persons
The LLM was again used for classification into these categories. When structuring manually verified ground-truth data, F1 scores ranged between 0.83 and 0.93. Figure below shows the result of structuring the training data. The F1 score is plotted across different years. It can be seen that the score fluctuates between 0.83 and 0.93 over the years. In the right section, the average F1 score across all years is shown. The box indicates the interquartile range, the orange line represents the median, and the green triangle shows the mean. It can be observed that the average F1 score is approximately 0.88.
In full end-to-end evaluation (including OCR errors), the average F1 score reached approximately 0.73. Figure below shows the result of the end-to-end evaluation on the training data. The F1 score varies between 0.62 and 0.82 over the years. On average, a value of approximately 0.73 is achieved.
Results and Research Potential
In total, approximately 350,000 structured guest records were extracted. This dataset enables:
- Social network analysis (shared accommodations)
- Longitudinal studies of spa communities
- Research on mobility patterns
- Analysis of social change before and after World War I
The Swinemünde spa registers include figures connected to major cultural and political developments, making the dataset relevant across disciplines.
Integration with OCR-D
Large parts of the segmentation and OCR steps can be integrated into the OCR-D framework. The structuring component currently cannot be fully implemented within OCR-D.
Conclusion
The project demonstrates that automated extraction of tabular data from historical newspapers is feasible with acceptable accuracy. While LLM-based correction and structuring show promise, they also reveal the limits of current large language models in deterministic data workflows.
The developed pipeline is openly available and reusable for similar historical corpora: GitHub Repository SwineBad Tool Support.
This project was realized at the University of Wismar (project runtime 01/2024 – 12/2024), with Prof. Dr.-Ing. Frank Krüger leading and Dr.-Ing. Steffen Steiner involved. Within Text+, the project is assigned to the task area Infrastructure/Operations.
Beitragsbild: ETH-Bibliothek Zürich, Bildarchiv / Fotograf: Metzger, Jack / Com_L16-0320-0002-0002 / CC BY-SA 4.0, http://doi.org/10.3932/ethz-a-000969258.
OpenEdition schlägt Ihnen vor, diesen Beitrag wie folgt zu zitieren:
Steffen Steiner, Frank Krüger (26. März 2026). Text+Plus, #09: SwineBad Tool Support: Building an Extraction Pipeline for Historical Tables. Text+ Blog. Abgerufen am 2. April 2026 von https://doi.org/10.58079/15yca







