The HistCorp Collection of Historical Corpora and Resources

Eva Pettersson

The HistCorp Collection of Historical Corpora and Resources

Eva Pettersson

2018

Sign up for access to the world's latest research

checkGet notified about relevant papers

checkSave papers to use in your research

checkJoin the discussion with peers

checkTrack your impact

Abstract

We present the HistCorp collection, a freely available open platform aiming at the distribution of a wide range of historical corpora and other useful resources and tools for researchers and scholars interested in the study of historical texts. The platform contains a monitoring corpus of historical texts from various time periods and genres for 14 European languages. The collection is taken from well-documented historical corpora, and distributed in a uniform, standardised format. The texts are downloadable as plaintext, and in a tokenised format. Furthermore, a subset of the corpus contains information on the modern spelling variant, and some of the texts are also annotated with part-of-speech and syntactic structure. In addition, preconfigured n-gram language models and spelling normalisation tools are provided to allow the study of historical languages.

Paul Bennett, Martin Durrell

2011

This paper describes an annotated gold standard sample corpus of Early Modern German containing over 50,000 tokens of text manually annotated with POS tags, lemmas, and normalised spelling variants. The corpus is the first resource of its kind for this variant of German, and represents an ideal test bed for evaluating and adapting existing NLP tools on historical data. We describe the corpus format, annotation levels, and challenges, providing an example of the requirements and needs of smaller humanities-based corpus projects.

Log In

The HistCorp Collection of Historical Corpora and Resources

Sign up for access to the world's latest research

Abstract

Related papers

Related topics

Related papers