As illustrative examples for your own experimentation we included some example documents (about 100 pages each) to be used with both the profiler and PoCoTo.
Many thanks to Kay Würzner (Grenzboten), Federico Boschetti (Zonaras) and Jasmin Chebib and Haide Friedrich-Salgado (Hobbes) for providing us with ground truth.
Each book project folder contains:
- preprocessed page images (binarized tifs)
- ground truth in gt (incomplete for Hobbes and Zonaras, none for CSEL and Swete)
- OCR output from ABBYY, Tesseract and OCRopus
The script nfc.sh is a simple bash script allowing you to normalize all UTF-8 text files to "normalization form composed". Be aware that different forms of Unicode normalization behave differently under text transformations and that text comparisons (e.g. OCR corrected text to ground truth) are meaningful only if both texts are equally normalized.
- Goethe: Die Wahlverwandtschaften vol. 1, 1809
- Text in Deutsches Textarchiv
- Die Grenzboten, 1841
- Text in DTAQ (registration required)
- Hobbes: Leviathan, Latin edition, 1668
- Corpus scriptorum ecclesiasticorum latinorum (CSEL) vol. 4, 1875
- Zonaras: Epitome historiarum vol. 3, 1870
- Swete Septuaginta: Old Testament in Greek, vol. 1, 1901