Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2023, TAPA
…
33 pages
1 file
This paper argues that machine learning (ML) has a role to play in the future of philology, understood here as a discipline concerned with preserving and elucidating the global archive of premodern texts. We offer one initialcase study in order to outline broader possibilities for the field. The argument is in four parts. First, we offer a brief introduction to the history of classical philology, focusing on the development of three technologies: writing, printing, and digitizing. We evaluate their impact and emphasize some elements of continuity in philological practice. Second, we describe Logion, an ML model we are currently developing to support various philological tasks, such as making conjectures to fill lacunae, identifying scribal errors, and proposing emendations.In part three, we present some of the results achieved to date in editing the work of the Byzantine author Michael Psellos. Finally, we build on the specific study presented (part three), as well as our more general considerations on philology (part one) and ML (part two), in order to shed light on current challenges and future opportunities for the global archive of premodern texts.
Computational Linguistics 49.3, 2023
Co-authored with Thea Sommerschield, Yannis Assael, and Ioannis Pavlopoulos (lead authors), Vanessa Stefanak, Andrew Senior, Chris Dyer, Jonathan Prag, Ion Androutsopoulos, and Nando de Freitas Ancient languages preserve the cultures and histories of the past. However, their study is fraught with difficulties, and experts must tackle a range of challenging text-based tasks, from deciphering lost languages to restoring damaged inscriptions, to determining the authorship of works of literature. Technological aids have long supported the study of ancient texts, but in recent years advances in artificial intelligence and machine learning have enabled analyses on a scale and in a detail that are reshaping the field of humanities, similarly to how microscopes and telescopes have contributed to the realm of science. This article aims to provide a comprehensive survey of published research using machine learning for the study of ancient texts written in any language, script, and medium, spanning over three and a half millennia of civilizations around the ancient world. To analyze the relevant literature, we introduce a taxonomy of tasks inspired by the steps involved in the study of ancient documents: digitization, restoration, attribution, linguistic analysis, textual criticism, translation, and decipherment. This work offers three major contributions: first, mapping the interdisciplinary field carved out by the synergy between the humanities and machine learning; second, highlighting how active collaboration between specialists from both fields is key to producing impactful and compelling scholarship; third, highlighting promising directions for future work in this field. Thus, this work promotes and supports the continued collaborative impetus between the humanities and machine learning.
arXiv (Cornell University), 2023
This paper presents machine-learning methods to address various problems in Greek philology. After training a BERT model on the largest premodern Greek dataset used for this purpose to date, we identify and correct previously undetected errors made by scribes in the process of textual transmission, in what is, to our knowledge, the first successful identification of such errors via machine learning. Additionally, we demonstrate the model's capacity to fill gaps caused by material deterioration of premodern manuscripts and compare the model's performance to that of a domain expert. We find that best performance is achieved when the domain expert is provided with model suggestions for inspiration. With such human-computer collaborations in mind, we explore the model's interpretability and find that certain attention heads appear to encode select grammatical features of premodern Greek.
De Gruyter eBooks, 2019
Cataloging and Citing Greek and Latin Authors and Works illustrates not only how Classicists have built upon larger standards and data models such as the Functional Requirements for Bibliographic Records (FRBR, allowing us to represent different versions of a text) and the Text Encoding Initiative (TEI) Guidelines for XML encoding of source texts (representing the logical structure of sources) but also highlights some major contributions from Classics. Alison Babeu, Digital Librarian at Perseus, describes a new form of catalog for Greek and Latin works that exploits the FRBR data model to represent the many versions of our sourcesincluding translations. Christopher Blackwell and Neel Smith built on FRBR to develop the Canonical Text Services (CTS) data model as part of the CITE Architecture. CTS provides an explicit framework within which we can address any substring in any version of a text, allowing us to create annotations that can be maintained for years and even for generations. This addressesat least within the limited space of textual dataa problem that has plagued hypertext systems since the 1970s and that still afflicts the World Wide Web. Those who read these papers years from now will surely find that many of the URLs in the citations no longer function but all of the CTS citations should be usablewhether we remain with this data model or replace it with something more expressive. Computer Scientists Jochen Tiepmar and Gerhard Heyer show how they were able to develop a CTS server that could scale to more than a billion words, thus establishing the practical nature of the CTS protocol. If there were a Nobel Prize for Classics, my nominations would go to Blackwell and Smith for CITE/CTS and to Bruce Robertson, whose paper on Optical Character Recognition opens the section on Data Entry, Collection, and Analysis for Classical Philology. Robertson has worked a decade, with funding and without, on the absolutely essential problem of converting images of print Greek into machine readable text. In this effort, he has mastered a wide range of techniques drawn from areas such as computer human interaction, statistical analysis, and machine learning. We can now acquire billions of words of Ancient Greek from printed sources and not just from multiple editions of individual works (allowing us not only to trace the development of our texts over time but also to identify quotations of Greek texts in articles and books, thus allowing us to see which passages are studied by different scholarly communities at different times). He has enabled fundamental new work on Greek. Meanwhile the papers by Tauber, Burns, and Coffee are on representing characters, on a pipeline for textual analysis of Classical languages and on a system that detects where one text alludes towithout extensively quotinganother text. At its base, philology depends upon the editions which provide information about our source texts, including variant readings, a proposed reconstruction of the original, and reasoning behind decisions made in analyzing the text.
2020
This first note introduces the need to flush out a robust interdisciplinary method to analyse fragmentary manuscript corpora in general and the Judaean Desert Scrolls and Cairo Genizah manuscripts in particular.
Computational Linguistics
Ancient languages preserve the cultures and histories of the past. However, their study is fraught with difficulties, and experts must tackle a range of challenging text-based tasks, from deciphering lost languages to restoring damaged inscriptions, to determining the authorship of works of literature. Technological aids have long supported the study of ancient texts, but in recent years advances in Artificial Intelligence and Machine Learning have enabled analyses on a scale and in a detail that are reshaping the field of Humanities, similarly to how microscopes and telescopes have contributed to the realm of Science. This article aims to provide a comprehensive survey of published research using machine learning for the study of ancient texts written in any language, script and medium, spanning over three and a half millennia of civilisations around the ancient world. To analyse the relevant literature, we introduce a taxonomy of tasks inspired by the steps involved in the study o...
Journal of Data Mining and Digital Humanities, 2017
The production of digital critical editions of texts using TEI is now a widely-adopted procedure within digital humanities. The work described in this paper extends this approach to the publication of gnomologia (anthologies of wise sayings), which formed a widespread literary genre in many cultures of the medieval Mediterranean. These texts are challenging because they were rarely copied straightforwardly; rather, sayings were selected, reorganised, modified or re-attributed between manuscripts, resulting in a highly interconnected corpus for which a standard approach to digital publication is insufficient. Focusing on Greek and Arabic collections, we address this challenge using semantic web techniques to create an ecosystem of texts, relationships and annotations, and consider a new model-organic, collaborative, interconnected, and open-ended-of what constitutes an edition. This semantic web-based approach allows scholars to add their own materials and annotations to the network of information and to explore the conceptual networks that arise from these interconnected sayings.
Hipogrifo. Revista de literatura y cultura del Siglo de Oro, 2023
This is a translation of the article "La Inteligencia Artificial al rescate del Siglo de Oro: transcripción y modernización automática de mil trescientos impresos y manuscritos teatrales". https://www.revistahipogrifo.com/index.php/hipogrifo/article/view/1262 https://doi.org/10.13035/H.2023.11.01.08 Cuéllar, Álvaro. (2023). «La Inteligencia Artificial al rescate del Siglo de Oro. Transcripción y modernización automática de mil trescientos impresos y manuscritos teatrales», Hipogrifo. Revista de literatura y cultura del Siglo de Oro, vol. 11, núm. 1, pp. 101-115, https://doi.org/10.13035/H.2023.11.01.08. A high percentage of theatrical prints and manuscripts from the aurisecular period have never been transcribed in an analogical or, of course, digital format. It is therefore impossible to use these documents to carry out searches of our interest or for the valuable computer analyses (stylometry, topic modelling, sentiment analysis, etc.) that have been developed in recent years. Thanks to Artificial Intelligence (Transkribus) and HTR (Handwritten Text Recognition) techniques, I have trained three models, already public for the research community, capable of transcribing and orthographically modernizing these documents automatically with a high degree of precision: around 97% of success in prints and 91% in manuscripts. Through these models I have been able to process some 1,300 theatrical plays contained in prints and manuscripts from numerous libraries, archives, and other digitized sources. The resulting transcripts are now part of the ETSO project, of the TEXORO search engine and, in addition to being an advanced starting point for careful editing of the texts, they themselves have sufficient quality to be subjected to stylometric analysis, which is yielding authorship attributions of interest.
This collection gathers the essays by eight scholars from disparate areas of textual criticism, addressing a general main topic, that is philology and digital humanities, and dealing with old and new-Lachmannian approaches, anti-Lachmannian responses, treatments of varia lectio, stemmatology, qualitative and quantitative methods of textual inquiry, and the establishment of standards for digital scholarly editions. The investigated data sets comprise canonical ancient traditions (Paolo Monella), Byzantine scriptural Greek (Barbara Crostini), the dawn of vernacular literacy, with Old Saxon (Marina Buzzoni), the still variable poetic and narrative corpora from the 12th century onwards (Thomas Bein, Anna Cappellotto), French and German epics (Luca Cadioli, Adele Cipolla), the reassessment of neo-Lachmannian procedures to Old French vernacular traditions, as that of the Bédierian Lai de l’ombre (Paolo Trovato). Every author dealt with a given issue from his or her own field of study, searching for and testing the performance of specific digital solutions. All of them touched on and suggested answers to often quite old but still sensitive critical issues.
2020
This paper aims at presenting the surplus value of collaboration between philologists and data scientists in the research on medieval digitized manuscripts. Both the great potential and the challenges of such a collaboration will be addressed. The following case study originates from research which is conducted in the Collaborative Research Center "Episteme in Motion. Transfer from the Ancient World to the Early Modern Period" which is located at the Freie Universit¨at Berlin and funded by the German Research Foundation (DFG). One of the goals of this collaboration is to advance research questions in which the data basis is complex or too complex for traditional research methods. The case study presented in this paper will deal with the knowledge transfer and text transmission in manuscripts of Aristotle's ancient Greek treatises on logic, the so-called Organon, and will focus on the manuscripts of his work de interpretatione (On Interpretation) and on commentaries and...
Large-scale synthetic research in ancient history is often hindered by the incompatibility of taxonomies used by different digital datasets. Using the example of enriching the Latin Inscriptions from the Roman Empire dataset (LIRE), we demonstrate that machine-learning classification models can bridge the gap between two distinct classification systems and make comparative study possible. We report on training, testing and application of a machine learning classification model using inscription categories from the Epigraphic Database Heidelberg (EDH) to label inscriptions from the Epigraphic Database Claus-Slaby (EDCS). The model is trained on a labeled set of records included in both sources (N =46,171). Several different classification algorithms and parametrizations are explored. The final model is based on Extremely Randomized Trees algorithm (ET) and employs 10,055 features, based on several attributes. The final model classifies two thirds of a test dataset with 98% accuracy and 85% of it with 95% accuracy. After model selection and evaluation, we apply the model on inscriptions covered exclusively by EDCS (N =83,482) in an attempt to adopt one consistent system of classification for all records within the LIRE dataset.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.
Studia graeco-arabica 3 (2013), 2013
Milan Law Review, 2023
Proceedings of the 2007 conference on Digital libraries - JCDL '07, 2007
Proceedings of the 27th International Congress Papyrology, 2016
Proceedings of Digital Humanities Conference, 14.07.2023, Graz, Austria, 2023, 110–111, 2023
CARMEN Working Papers 3, 2022
New Methods in Historical Corpus Linguistics, 2013
2017
CyberResearch on the Ancient Near East and Neighboring Regions, 2018
Aslib Proceedings, 2006
Journal of Universal Computer Science, 2012