Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2018
We present the HistCorp collection, a freely available open platform aiming at the distribution of a wide range of historical corpora and other useful resources and tools for researchers and scholars interested in the study of historical texts. The platform contains a monitoring corpus of historical texts from various time periods and genres for 14 European languages. The collection is taken from well-documented historical corpora, and distributed in a uniform, standardised format. The texts are downloadable as plaintext, and in a tokenised format. Furthermore, a subset of the corpus contains information on the modern spelling variant, and some of the texts are also annotated with part-of-speech and syntactic structure. In addition, preconfigured n-gram language models and spelling normalisation tools are provided to allow the study of historical languages.
2011
This paper describes an annotated gold standard sample corpus of Early Modern German containing over 50,000 tokens of text manually annotated with POS tags, lemmas, and normalised spelling variants. The corpus is the first resource of its kind for this variant of German, and represents an ideal test bed for evaluating and adapting existing NLP tools on historical data. We describe the corpus format, annotation levels, and challenges, providing an example of the requirements and needs of smaller humanities-based corpus projects.
New Methods in Historical Corpus Linguistics, 2013
Tools for historical corpus research, and a corpus of Latin We present LatinISE, a Latin corpus for the Sketch Engine. LatinISE consists of Latin works comprising a total of 13 million words, covering the time span from the 2 nd century B. C. to the 21 st century A. D. LatinISE is provided with rich metadata markup , including author, title, genre, era, date and century, as well as book, section, paragraph and line of verses. We have automatically annotated LatinISE with lemma and part-of-speech information. The annotation enables the users to search the corpus with a number of criteria, ranging from lemma, part-of-speech, context, to subcorpora defined chronologically or by genre. We also illustrate word sketches, one-page summaries of a word's corpus-based collocational behaviour. Our future plan is to produce word sketches for Latin words by adding richer morphological and syntactic annotation to the corpus.
ICAME Journal, 2015
Corpora of Early Modern English have been collected and released for research for a number of years. With large scale digitisation activities gathering pace in the last decade, much more historical textual data is now available for research on numerous topics including historical linguistics and conceptual history. We summarise previous research which has shown that it is necessary to map historical spelling variants to modern equivalents in order to successfully apply natural language processing and corpus linguistics methods. Manual and semiautomatic methods have been devised to support this normalisation and standardisation process. We argue that it is important to develop a linguistically meaningful rationale to achieve good results from this process. In order to do so, we propose a number of guidelines for normalising corpora and show how these guidelines have been applied in the Corpus of English Dialogues.
1997
The aim of this paper is twofold. On the one hand, we intend to show an overview of what has been and is being done with respect to so-caBed Corpus Linguistics as far as the English language is concemed. On the other, special attention will be paid to the possibilities of using computerised textual corpora when doing historical research. The former goal will comprise a quick overview of the history of English Corpus Linguistics ( § 1) and a brief account of technical features such as the systems of incorporated annotations (§2), related software (§3), and so on. Updated lists of institutions (§4), collections of corpora (§5) and completed or in-progress projects in this field (§6) will also follow. With regard to the historical dimension, which this paper also intends to cover, section 7 shows a panorama of different products consisting of electronic English texts previous to the present-day standard. More specifically, in section 8 the authors concentrate on the Helsinki Corpus of ...
Language Resources and Evaluation, 2021
The paper describes the process of building the electronic corpus of 17th- and 18th-century Polish texts, a relatively large, balanced, structurally and morphologically annotated resource of the Middle Polish language, available for searching at https://www.korba.edu.pl. The corpus consists of samples extracted from over seven hundred texts written and published between 1601 and 1772, summing up to a total size of 13.5 million tokens which makes it one of the largest historical corpora for a Slavic language.
Proceedings of Corpus Linguistics 2003, 2003
As reported by Wilson and Rayson (1993) and Rayson and Wilson (1996), the UCREL semantic analysis system (USAS) has been designed to undertake the automatic semantic analysis of present-day English (henceforth PresDE) texts. In this paper, we report on the feasibility of (re)training the USAS system to cope with English from earlier periods, specifically the Early Modern English (henceforth EmodE) period. We begin by describing how effectively the existing system tagged a training corpus prior to any modifications. The training corpus consists of newsbooks dating from December 1653 -May 1654, and totals approximately 613,000.words. We then document the various adaptations that we made to the system in an attempt to improve its efficiency, and the results we achieved when we applied the modified system to two newsbook texts, and an additional text from the Lampeter Corpus (i.e. a text that was not part of the original training corpus). To conclude, we propose a design for a modified semantic tagger for EmodE texts, that contains an 'intelligent' spelling regulariser, that is, a system that has been designed so as to regularise spellings in their 'correct' context. selection of texts from the Lampeter corpus, before undertaking experiments using the semantic categories, using the newsbook test corpus to validate our findings).
International Journal of English Studies, 2011
Historical corpora offer many potentialities for linguistic research. Thus, the present article provides an overview of the major English historical corpora compiled or being compiled both in Spain and abroad. They include different types such as tagged and parsed corpora, and their main features will be outlined. As for the organisation of the article, after the introductory section, the historical corpora created abroad will be presented. Then, those being constructed in Spain (Coruña, Las Palmas, Málaga, Salamanca, Santiago and Sevilla) will be discussed. Some final remarks and the references close the article.
Bilingual parallel corpora are increasingly recognised as solid bases for contrastive linguistics, both from a synchronic and diachronic perspective. The Historical Luxembourgish Bilingual Database of Public Notices is a diachronic single-genre corpus, comprising French-German parallel texts from the years 1795 to 1920. This paper gives an overview of the text-corpus, specifying the features of the genre 'public notices' , and explaining the criteria for text selection. Building on that, the paper details the compilation and presentation of text and image data stored in the corpus. Finally, we describe the technical tools for indexing, searching and managing the text and image data.
Journal of Universal Computer Science, 2012
Medieval manuscripts or other written documents from that period contain valuable information about people, religion, and politics of the medieval period, making the study of medieval documents a necessary pre-requisite to gaining in-depth knowledge of medieval history. Although tool-less study of such documents is possible and has been ongoing for centuries, much subtle information remains locked such manuscripts unless it gets revealed by effective means of computational analysis. Automatic analysis of medieval manuscripts is a non-trivial task mainly due to non-conforming styles, spelling peculiarities, or lack of relational structures (hyper-links), which could be used to answer meaningful queries. Natural Language Processing (NLP) tools and algorithms are used to carry out computational analysis of text data. However due to high percentage of spelling variations in medieval manuscripts, NLP tools and algorithms cannot be applied directly for computational analysis. If the spelling variations are mapped to standard dictionary words, then application of standard NLP tools and algorithms becomes possible. In this paper we describe a web-based software tool CAMM (Computational Analysis of Medieval Manuscripts) that maps medieval spelling variations to a modern German dictionary. Here we describe the steps taken to acquire, reformat, and analyze data, produce putative mappings as well as the steps taken to evaluate the findings. At the time of the writing of this paper, CAMM provides access to 11275 manuscripts organized into 54 collections containing a total of 242446 distinctly spelled words. CAMM accurately corrects spelling of 55% percent of the verifiable words. CAMM is freely available at
2012
Language technology tools can be very useful for making information concealed in historical documents more easily accessible to historians, linguists and other researchers in humanities. For many languages, there is however a lack of linguistically annotated historical data that could be used for training NLP tools adapted to historical text. One way of avoiding the data sparseness problem in this context is to normalise the input text to a more modern spelling, before applying NLP tools trained on contemporary corpora. In this paper, we explore the impact of a set of hand-crafted normalisation rules on Swedish texts ranging from 1527 to 1812. Normalisation accuracy as well as tagging and parsing performance are evaluated. We show that, even though the rules were generated on the basis of one 17th century text sample, the rules are applicable to all texts, regardless of time period and text genre. This clearly indicates that spelling correction is a useful strategy for applying contemporary NLP tools to historical text.
Corpus linguistics has revolutionised our way of working in historical linguistics. The painstaking job of collecting data and manually analysing them has been made less arduous with the introduction of the machine processing of corpora, which allows for quick and efficient searches. The aim of the present study is two-fold: to show how corpus linguistics has contributed to the ways in which researchers approach the study of the history of English, and to provide an overview of selected corpora available in the field. Setting aside the theoretical debate as to whether corpus linguistics should be considered merely a methodology, a branch of linguistics, or both (Taylor, 2008), it is widely acknowledged that corpus linguistics is of considerable help in any branch of linguistics, be it theoretical or applied. The use of corpora makes it possible to test hypotheses established within a specific linguistic area through the fast and reliable analysis of vast pools of material. As a result, the objective measurement of data is available to scholars, who can thus verify their hypotheses and intuitions, and can quickly amend or qualify their research claims if previous ones are seen to be falsifiable. There is, then, a continuous interaction within theory, as expressed in linguistic postulates, concepts and hypotheses, and an application and validation of these theoretical principles through the use of linguistic corpus analysis. The use of corpora is perhaps a more powerful instrument in the field of historical linguistics than in other fields, since the absence of living informants here makes judgements based on intuitions unreliable, and claims have to be empirically attested using data. This data can be extracted from systematically compiled collections of machine-readable texts, called corpora. However, in considering these undeniably advantageous working tools, some caveats should be borne in mind, as will be discussed in what follows.
2016
Historical text constitutes a rich source of information for historians and other researchers in humanities. Many texts are however not available in an electronic format, and even if they are, ther ...
ArXiv, 2013
The impact-es diachronic corpus of historical Spanish compiles over one hundred books —containing approximately 8 million words— in addition to a complementary lexicon which links more than 10 thousand lemmas with attestations of the different variants found in the documents. This textual corpus and the accompanying lexicon have been released under an open license (Creative Commons by-nc-sa) in order to permit their intensive exploitation in linguistic research. Approximately 7% of the words in the corpus (a selection aimed at enhancing the coverage of the most frequent word forms) have been annotated with their lemma, part of speech, and modern equivalent. This paper describes the annotation criteria followed and the standards, based on the Text Encoding Initiative recommendations, used to the represent the texts in digital form. As an illustration of the possible synergies between diachronic textual resources and linguistic research, we describe the application of statistical mach...
Challenging the Myth of Monolingual Corpora, 2017
Digital Scholarship in the Humanities, 2014
We describe, evaluate, and improve the automatic annotation of diachronic corpora at the levels of word-class, lemma, chunks, and dependency syntax. As corpora we use the ARCHER corpus (texts from 1,600 to 2,000) and the ZEN corpus (texts from 1,660 to 1,800). Performance on Modern English is considerably lower than on Present Day English (PDE). We present several methods that improve performance. First we use the spelling normalization tool VARD to map spelling variants to their PDE equivalent, which improves tagging. We investigate the tagging changes that are due to the normalization and observe improvements, deterioration, and missing mappings. We then implement an optimized version, using VARD rules and preprocessing steps to improve normalization. We evaluate the improvement on parsing performance, comparing original text, standard VARD, and our optimized version. Over 90% of the normalization changes lead to improved parsing, and 17.3% of all 422 manually annotated sentences get a net improved parse. As a next step, we adapt the parser's grammar, add a semantic expectation model and a model for prepositional phrases (PP)-attachment interaction to the parser. These extensions improve parser performance, marginally on PDE, more considerably on earlier texts-2-5% on PP-attachment relations (e.g. from 63.6 to 68.4% and from 70 to 72.9% on 17th century texts). Finally, we briefly outline linguistic applications and give two examples: gerundials and auxiliary verbs in the ZEN corpus, showing that despite high noise levels linguistic signals clearly emerge, opening new possibilities for large-scale research of gradient phenomena in language change.
2015
The focus of this keynote is the larger picture of how different disciplines of classics and archaeologies from analogue times, when taken onto the digital level, change. And what that change looks like, when focused on the often misunderstood role of computer linguists inside the data ecosystem of classics and archaeologies.
Literary and Linguistic Computing, 2007
Universität Duisburg-Essen) Paul Rayson (Lancaster University) 1. Introduction In this paper, we describe the approaches taken by two teams of researchers to the identification of spelling variants. Each team is working on a different language (English and German) but both are using historical texts from much the same time period (17 th -19 th century). The approaches differ in a number of other respects, for example we can draw a distinction between two types of context rules: in the German system, context rules operate at the level of individual letters and represent constraints on candidate letter replacements or n-graphs; in the English system, contextual rules operate at the level of words and provide clues to detect real-word spelling variants i.e. 'then' used instead of 'than'. However, we noticed an overlap between the types of issues that we need to address for both English and German and also a similarity between the letter replacement patterns found in the two languages.
Language Resources and Evaluation, 2013
The impact-es diachronic corpus of historical Spanish compiles over one hundred books-containing approximately 8 million words-in addition to a complementary lexicon which links more than 10 thousand lemmas with attestations of the different variants found in the documents. This textual corpus and the accompanying lexicon have been released under an open license (Creative Commons by-nc-sa) in order to permit their intensive exploitation in linguistic research. Approximately 7% of the words in the corpus (a selection aimed at enhancing the coverage of the most frequent word forms) have been annotated with their lemma, part of speech, and modern equivalent. This paper describes the annotation criteria followed and the standards, based on the Text Encoding Initiative recommendations, used to the represent the texts in digital form. As an illustration of the possible synergies between diachronic textual resources and linguistic research, we describe the application of statistical machine translation techniques to infer probabilistic context-sensitive rules for the automatic modernisation of spelling. The automatic modernisation with this type of statistical methods leads to very low character error rates when the output is compared with the supervised modern version of the text.
The aim of this paper is to offer a description of the Corpus of Historical English Texts (CHET), one of the several sub-corpora within the Coruña Corpus of English Scientific Writing (CC). The compilation principles behind it as well as the sociolinguistic variables considered in the process of text selection will be explained.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.