Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
Digital Humanities Workshop
Iliad and Odyssey are products of a collective effort involving numerous authors, each contributing unknown portions of text, and it still cannot be determined whether a single individual (or distinct group of poets) contributed larger chunks of such additional verses, or even whole Books. In this work, we employed characterlevel statistical language modeling to analyse the computational authorship of Homeric text and study the linguistic proximity and divergence between the books of Iliad and Odyssey. We show that some pairs of books are much closer than others and that some books are linguistically far from the rest. Furthermore, we investigated the linguistic association between the Homeric poems and four Homeric hymns, showing that "To Aphrodite" is linguistically close and that "To Hermes" is linguistically far from both, Iliad and Odyssey. In a final experiment, we show that statistical language models can be used to classify excerpts between Iliad and Odyssey similarly to the average human expert. • Applied computing → Digital libraries and archives; • Computing methodologies → Information extraction; Language resources.
International Journal of Digital Humanities
Natural language modeling is used to predict or generate the next word or character of modern languages. Furthermore, statistical character-based language models have been found useful in authorship attribution analyses by studying the linguistic proximity of excerpts unknown to the model. In prior work, we modeled Homeric language and provided empirical findings regarding the authorship nature of the 48 Iliad and Odyssey books. Following this line of work, and considering the current philological views and trends, we break down the two poems further into smaller portions. By employing language modeling we identify outlying passages, indicating reduced linguistic affinity with the main body of the two works and, by extension, potentially different authorship. Our results show that some of the passages isolated as outliers by the language models were also identified as such by human researchers. We further test our methodology and models on texts of similar language and genre created...
This is an informal handout for a very informal talk presented at the UCLA PIES graduate seminar in Winter 2017. It explores applying quantitative methods of authorship analysis to the Homeric Question. In particular, we employ k-means and hierarchical clustering techniques to discover likely groupings of the traditional book divisions of the Iliad and Odyssey, employing feature sets based on word bigrams and character trigrams. While preliminary, we suggest that results point to a multi-event model for the textualization of the Homeric poems.
Proceedings of the 1st Workshop on Machine Learning for Ancient Languages (ML4AL 2024). Association for Computational Linguistics., 2024
Past research has modelled statistically the language of the Homeric poems, assessing the degree of surprisal for each verse through diverse metrics and resulting to the HoLM resource. In this study we utilise the HoLM resource to explore cross poem affinity at the verse level, looking at Iliadic verses and passages that are less surprising to the Odyssean model than to the Iliadic one and vice-versa. Using the same tool, we investigate verses that evoke greater surprise when assessed by a local model trained solely on their source book, compared to a global model trained on the entire source poem. Investigating deeper on the distribution of such verses across the Homeric poems we employ machine learning text classification to further analyse quantitatively cross-poem affinity in selected books.
Human IT: Journal for Information Technology Studies as a Human Science, 2018
This project addresses a two-millennium old mystery surrounding the authorship of ancient Latin war memoirs attributed to Caesar, using Distributional Semantics, a modern computational method for detecting written text patterns. The Civil War has been confirmed to be Caesar’s work, as well as the first seven of the eight chapters of the Gallic War, the eighth by Hirtius. The authorship of the African, Alexandrine, and Spanish Wars, though attributed to Caesar, is still under debate. Methods of distributional semantics derive representations of words from their distribution across a large amount of text, such that words that occur in similar contexts have similar representations. These representations can then be combined to model larger units of text, such as chapters and whole books. SemanticVectors software was used to calculate the similarity between chapters or books after dimension reduction using Random Indexing. The results show that the Gallic War’s eighth chapter is signif...
2007
Large, real world, data sets have been investigated in the context of Authorship Attribution of real world documents. Ngram measures can be used to accurately assign authorship for long documents such as novels. A number of 5 (authors) # 5 (movies) arrays of movie reviews were acquired from the Internet Movie Database. Both ngram and naive Bayes classifiers were used to classify along both the authorship and topic (movie) axes. Both approaches yielded similar results, and authorship was as accurately detected, or more accurately detected, than topic. Part of speech tagging and function-word lists were used to investigate the influence of structure on classification tasks on documents with meaning removed but grammatical structure intact.
Language Resources and Evaluation, 2001
The most important approaches to computer-assisted authorship attribution are exclusively based on lexical measures that either represent the vocabulary richness of the author or simply comprise frequencies of occurrence of common words. In this paper we present a fully-automated approach to the identification of the authorship of unrestricted text that excludes any lexical measure. Instead we adapt a set of style markers to the analysis of the text performed by an already existing natural language processing tool using three stylometric levels, i.e., token-level, phrase-level, and analysis-level measures. The latter represent the way in which the text has been analyzed. The presented experiments on a Modern Greek newspaper corpus show that the proposed set of style markers is able to distinguish reliably the authors of a randomly-chosen group and performs better than a lexically-based approach. However, the combination of these two approaches provides the most accurate solution (i.e., 87% accuracy). Moreover, we describe experiments on various sizes of the training data as well as tests dealing with the significance of the proposed set of style markers.
Studia Metrica et Poetica, 2018
This article describes pilot experiments performed as one part of a long- term project examining the possibilities for using versification analysis to determine the authorships of poetic texts. Since we are addressing this article to both stylometry experts and experts in the study of verse, we first introduce in detail the common classifiers used in contemporary stylometry (Burrows’ Delta, Argamon’s Quadratic Delta, Smith-Aldridge’s Cosine Delta, and the Support Vector Machine) and explain how they work via graphic examples. We then provide an evaluation of these classifiers’ performance when used with the versification features found in Czech, German, Spanish, and English poetry. We conclude that versification is a reasonable stylometric marker, the strength of which is comparable to the other markers traditionally used in stylometry (such as the frequencies of the most frequent words and the frequencies of the most frequent character n-grams).
The aim of this study is to explore authorship attribution methods in Greek tweets. We have developed the first Modern Greek Twitter corpus (GTC) consisted of 12,973 tweets crawled from 10 Greek popular users. We used this corpus in order to study the effectiveness of a specific document representation called Author's Multilevel N-gram Profile (AMNP) and the impact of different methods on training data construction for the task of authorship attribution. In order to address the above research questions we used GTC to create 4 different datasets which contained merged tweets in texts of different sizes (100, 75, 50 and 25 words). Results were evaluated using authorship attribution accuracy both in 10-fold crossvalidation and in an external test set compiled from actual tweets. AMNP representation achieved significant better accuracies than single feature groups across all text sizes.
Applied Network Science
In this work, we analyze in detail the topology of the written language network using co-occurrence of words to recognize authorship. The Latin texts object of this study are excerpts from Historia Augusta, a collection of biographies of Roman emperors extending from Hadrian, who started to reign in 117 CE, to Carus and his sons Numerian and Carinus, that is, to the years up 284–285 CE. According to the manuscript tradition, the biographies are attributed to six different authors. Scholarship since the late 19th century has been arguing for a single authorship instead. The aim of this paper is to verify this hypothesis.
The aim of this study is to obtain authorship attribution and author's gender identification in a corpus of blogs written in Modern Greek language. More specifically, the corpus used contains 20 bloggers equally divided by gender (10 males & 10 females) with 50 blog posts from each author (1,000 posts in total and an overall size of 406,460 words). From this corpus we calculated a number of standard stylometric variables (e.g. word length statistics and various vocabulary "richness" indices) and 300 most frequent word and character n-grams (character and word unigrams, bigrams, trigrams). Support Vector Machines (SVM) were trained on this data, and the author's gender prediction accuracy in 10-fold cross-validation experiment reached 82.6% accuracy, a result that is comparable to current state-of-the-art author profiling systems. Authorship attribution accuracy reached 85.4%, an equally satisfying result given the large number of candidate authors (n=20).
Ciceroniana on line, 2019
«Ciceroniana on line» III, 1, 2019, 15-48 RAIJA VAINIO, REIMA VÄLIMÄKI, ALEKSI VESANTO, ANNI HELLA, FILIP GINTER, MARJO KAARTINEN, TEEMU IMMONEN 1 RECONSIDERING AUTHORSHIP IN THE CICERONIAN CORPUS THROUGH COMPUTATIONAL AUTHORSHIP ATTRIBUTION The authorship of some texts related to Cicero or traditionally attributed to him has puzzled scholars for centuries. The most famous of these texts is Rhetorica ad C. Herennium, whose removal from the Cice-ronian corpus was proposed as early as the fifteenth century. The other two (minor) texts are Commentariolum petitionis, usually attributed to Marcus Cicero's younger brother Quintus, and most recently De optimo genere oratorum. Sir Ronald Syme stated on the authenticity of old texts: «In every age the principal criteria of authenticity are the stylistic and the historical. They do not always bring certainty, for we do not know enough about either style or history. If a different approach can be devised, or a subsidiary method, so much the better» 2. In recent years, digital methods have offered promising results for the reattribu-tion of classical texts. M. Kestemont, J.A. Stover and others have worked with some ancient Latin texts 3 , but although a computational analysis by R. Forsyth, D. Holmes, and E. Tse confirmed the consensus that Consolatio Ciceronis is indeed a sixteenth-century forgery 4 , until now these methods have had only a limited impact on the Ciceronian corpus itself. We attempt to take on this task with today's highly advanced computational methods and the use of high performance computing (CSC supercomputer, Kajaani, Finland). After a brief account on 1 The main responsibility of the historical background and interpreting the results (chapters 1-3 and 5) lies on Raija Vainio, and that of the methods and respective parts in the results (ch. 4) on Aleksi Vesanto and Filip Ginter. All the text has been commented upon, revised and supplemented by Reima Välimäki, Anni Hella, Marjo Kaartinen and Teemu Immonen. Reima Välimäki was responsible for making the computational methods understandable for humanists.
Open Linguistics, 2016
We are investigating methods by which data from dependency syntax treebanks of ancient Greek can be applied to questions of authorship in ancient Greek historiography. From the Ancient Greek Dependency Treebank were constructed syntax words (sWords) by tracing the shortest path from each leaf node to the root for each sentence tree. This paper presents the results of a preliminary test of the usefulness of the sWord as a stylometric discriminator. The sWord data was subjected to clustering analysis. The resultant groupings were in accord with traditional classifications. The use of sWords also allows a more fine-grained heuristic exploration of difficult questions of text reuse. A comparison of relative frequencies of sWords in the directly transmitted Polybius book 1 and the excerpted books 9–10 indicate that the measurements of the two texts are generally very close, but when frequencies do vary, the differences are surprisingly large. These differences reveal that a certain synta...
Journal on Computing and Cultural Heritage
We present and make available MedLatinEpi and MedLatinLit , two datasets of medieval Latin texts to be used in research on computational authorship analysis. MedLatinEpi and MedLatinLit consist of 294 and 30 curated texts, respectively, labelled by author; MedLatinEpi texts are of epistolary nature, while MedLatinLit texts consist of literary comments and treatises about various subjects. As such, these two datasets lend themselves to supporting research in authorship analysis tasks, such as authorship attribution, authorship verification, or same-author verification. Along with the datasets, we provide experimental results, obtained on these datasets, for the authorship verification task, i.e., the task of predicting whether a text of unknown authorship was written by a candidate author. We also make available the source code of the authorship verification system we have used, thus allowing our experiments to be reproduced, and to be used as baselines, by other researchers. We also...
Όψεις της Σωματοκειμενικής Γλωσσολογίας [Aspects of Corpus Linguistics], 2018
Η έρευνα για την ταυτοποίηση του συγγραφέα συνδέεται άµεσα µε τις γλωσσικές ιδιαιτερότητες του κάθε ατόµου όπως αυτές αντανακλώνται στις γραπτές του παραγωγές συνθέτοντας ένα µοναδικό συγγραφικό προφίλ, το οποίο µε βάση τη ποσοτική ανάλυση µεγάλου εύρους υφοµετρικών µεταβλητών το διακρίνει από κάθε άλλο πιθανό συγγραφέα ενός κειµένου. Στο άρθρο αυτό παρουσιάζονται τα αποτελέσµατα πειραµατικής έρευνας, η οποία στηρίζεται στην υπόθεση ότι οι πολυλεκτικές ακολουθίες µεταβλητού µήκους µπορούν να λειτουργήσουν ως αξιόπιστη µέθοδος επιλογής χαρακτηριστικών δεικτών για την ορθή απόδοση συγγραφικής ταυτότητας σε κείµενα ιστολογίων από πέντε διαφορετικούς συγγραφείς.
ArXiv, 2020
We present and make available MedLatin1 and MedLatin2, two datasets of medieval Latin texts to be used in research on computational authorship analysis. MedLatin1 and MedLatin2 consist of 294 and 30 curated texts, respectively, labelled by author, with MedLatin1 texts being of an epistolary nature and MedLatin2 texts consisting of literary comments and treatises about various subjects. As such, these two datasets lend themselves to supporting research in authorship analysis tasks, such as authorship attribution, authorship verification, or same-author verification.
Exact methods in the study of language and text: …, 2007
Proceedings of the Sixth Italian Conference on Computational Linguistics, 2019
If a road map had to be drawn for Computational Criticism and subsequent Artificial Literature, it would have certainly considered Shakespearean plays. Demonstration of these structures through text analysis can be seen as both a naive effort and a scientific view of the characteristics of the texts. In this study, the textual analysis of Shakespeare plays was carried out for this purpose. Methodologically, we consecutively use Latent Dirichlet Allocation (LDA) and Singular Value Decomposition (SVD) in order to extract topics and then reduce topic distribution over documents into two-dimensional space. The first question asks if there is a genre called Romance between Comedy and Tragedy plays. The second question is, if each character's speech is taken as a text, whether the dramatic relationship between them can be revealed. Consequently, we find relationships between genres, also verified by literary theory and the main characters follow the an-tagonisms within the play as the length of speech increases. Although the results of the classification of the side characters in the plays are not always what one would have expected based on the reading of the plays, there are observations on dramatic fiction, which is also verified by literary theory. Tragedies and revenge dramas have different character groupings.
PLoS ONE, 2014
In this paper we analyse the word frequency profiles of a set of works from the Shakespearean era to uncover patterns of relationship between them, highlighting the connections within authorial canons. We used a text corpus comprising 256 plays and poems from the 16th and 17th centuries, with 17 works of uncertain authorship. Our clustering approach is based on the Jensen-Shannon divergence and a graph partitioning algorithm, and our results show that authors' characteristic styles are very powerful factors in explaining the variation of word use, frequently transcending cross-cutting factors like the differences between tragedy and comedy, early and late works, and plays and poems. Our method also provides an empirical guide to the authorship of plays and poems where this is unknown or disputed.
2017
The ‘Homeric Question’ today is no longer a question about Homer as a person, but a question of the genesis and history of the early Greek epic texts. It is still unresolved. Three mutually incompatible Homer theories (Analysis, Neoanalysis and Oral Poetry Theory) compete, which are based only on manual selections of the text, and this fact seems to be part of the problem. In this paper, we report on the development of a toolbox providing methods of computational linguistics intended to improve our capacity to examine texts in their entireity.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.