Digital resources in the Social Sciences and Humanities OpenEdition Our platforms OpenEdition Books OpenEdition Journals Hypotheses Calenda Libraries OpenEdition Freemium Follow us

Text+Plus, #09: SwineBad Tool Support: Building an Extraction Pipeline for Historical Tables

Schwarzweißfotografie: Blick in die Kuppel des "Man in the Community Pavillon" auf der Expo 1967 in Montreal.

#09: SwineBad

Every year, Text+ funds cooperation projects with a duration of up to twelve months. In this blog series, funded projects provide insight into their work, tools, and results.

Historical newspapers are not only narrative sources, they are structured data archives. Price lists, shipping registers, demographic statistics, and guest lists were often printed in tabular form. Yet despite large-scale digitization efforts, this structured information usually remains locked in scanned page images. The Text+ cooperation project “Tool Support for the Automatic Extraction of Tabular Data from Historical Newspapers” set out to change that. Using the Swinemünder Badeanzeiger (1910–1932) as a case study, we developed and evaluated an open-source pipeline for automatically detecting, extracting, and structuring tabular data from historical newspaper scans.

The Swinemünder Badeanzeiger

The Swinemünder Badeanzeiger is a historical spa gazette published between 1910 and 1932. It regularly listed names, origins, professions, and accommodations of arriving guests in tabular form.

The Digitale Bibliothek Mecklenburg-Vorpommern provides:

  • 4,227 scanned issues
  • ~6,500 tables
  • ~60 entries per table
  • approximately 350,000 individual records

From Image to Dataset: The Extraction Pipeline

The goal was not to build a new OCR engine. Instead, the project integrates existing technologies into a modular pipeline that performs four main steps:

  • Table segmentation
  • Optical Character Recognition (OCR)
  • OCR correction
  • Data structuring

Each step was evaluated separately and in combination.

Table Segmentation with Detectron2

The first challenge is identifying where tables are located on a page. For this, we used Detectron2 and fine-tuned the pre-trained TableBank_152 model on annotated images of the Swinemünder Badeanzeiger. Because layouts varied significantly over the years, custom annotation was necessary.

An example of the annotated data is shown in figures below, where green and blue boxes indicate the annotated tables. The annotations were stored in JSON files, enabling their use in training. The ground-truth dataset consists of 22 images per publication year, totaling 374 annotated images.

Frontpage of Swinemünder Bade-Anzeiger no. 3, June 6, 1910
Example of annotation 1910
Page frome the Swienmünder Bade-Anzeiger
Example of annotation 1915

Fine-tuning dramatically improved detection performance. Average Precision (AP) increased from 0.696 before fine-tuning to over 98 after fine-tuning.

Model APAP50
Before Fine-tuning0.6961.664
After Fine-tuning98.28199.962


OCR with Specialized Fraktur Models

After segmentation, the cropped tables are processed using Tesseract OCR. Because the gazette is printed in Fraktur script, three specialized models were used:

  • Fraktur
  • GT4HistOCR
  • frak2021

No single model performed best across all years. Character Error Rate (CER) varied significantly due to scan quality and layout differences.

Mean Character Error Rate across years for different OCR models.


OCR Correction with a Large Language Model

To improve OCR results, we experimented with Llama 3.1 70B Instruct (4-bit quantized). The LLM received outputs from all three OCR models and attempted to merge them into a corrected version. While correction slightly reduced error rates, the improvement was limited. Moreover, unpredictable behavior and format instability introduced new challenges.

Figure below shows the mean CER for each year after correction for different orders of input. Here, 1 represents frak2021, 2 represents Fraktur, and 3 represents GT4HistOCR. It is evident that different orders can lead to significant differences in performance. Moreover, no optimal order is apparent. Therefore, the order 213 was picked for further processing.

OCR Correction for different orders evaluated on train data

Figure below shows the comparison of OCR correction with the separate OCR models. It is evident that the OCR correction is almost always below the best model. However, the performance gain compared to the best model is not very significant in most years.

OCR Correction for 213 evaluated on train data


Structuring the Data

The final step transforms raw OCR text into structured fields:

  • Last name
  • First name
  • Title
  • Profession
  • Residence
  • Accommodation
  • Number of persons


The LLM was again used for classification into these categories. When structuring manually verified ground-truth data, F1 scores ranged between 0.83 and 0.93. Figure below shows the result of structuring the training data. The F1 score is plotted across different years. It can be seen that the score fluctuates between 0.83 and 0.93 over the years. In the right section, the average F1 score across all years is shown. The box indicates the interquartile range, the orange line represents the median, and the green triangle shows the mean. It can be observed that the average F1 score is approximately 0.88.

Structuring the groundtruth evaluated on train data

In full end-to-end evaluation (including OCR errors), the average F1 score reached approximately 0.73. Figure below shows the result of the end-to-end evaluation on the training data. The F1 score varies between 0.62 and 0.82 over the years. On average, a value of approximately 0.73 is achieved.

Structuring OCR data evaluated on train data

Results and Research Potential

In total, approximately 350,000 structured guest records were extracted. This dataset enables:

  • Social network analysis (shared accommodations)
  • Longitudinal studies of spa communities
  • Research on mobility patterns
  • Analysis of social change before and after World War I

The Swinemünde spa registers include figures connected to major cultural and political developments, making the dataset relevant across disciplines.

Integration with OCR-D

Large parts of the segmentation and OCR steps can be integrated into the OCR-D framework. The structuring component currently cannot be fully implemented within OCR-D.

Conclusion

The project demonstrates that automated extraction of tabular data from historical newspapers is feasible with acceptable accuracy. While LLM-based correction and structuring show promise, they also reveal the limits of current large language models in deterministic data workflows.

The developed pipeline is openly available and reusable for similar historical corpora: GitHub Repository SwineBad Tool Support.

This project was realized at the University of Wismar (project runtime 01/2024 – 12/2024), with Prof. Dr.-Ing. Frank Krüger leading and Dr.-Ing. Steffen Steiner involved. Within Text+, the project is assigned to the task area Infrastructure/Operations.

Beitragsbild: ETH-Bibliothek Zürich, Bildarchiv / Fotograf: Metzger, Jack / Com_L16-0320-0002-0002 / CC BY-SA 4.0, http://doi.org/10.3932/ethz-a-000969258.


OpenEdition schlägt Ihnen vor, diesen Beitrag wie folgt zu zitieren:
Steffen Steiner, Frank Krüger (26. März 2026). Text+Plus, #09: SwineBad Tool Support: Building an Extraction Pipeline for Historical Tables. Text+ Blog. Abgerufen am 2. April 2026 von https://doi.org/10.58079/15yca


Das könnte dich auch interessieren …

Schreibe einen Kommentar

Deine E-Mail-Adresse wird nicht veröffentlicht. Erforderliche Felder sind mit * markiert

This site uses Akismet to reduce spam. Learn how your comment data is processed.