Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
Computational Linguistics
…
44 pages
1 file
Ancient languages preserve the cultures and histories of the past. However, their study is fraught with difficulties, and experts must tackle a range of challenging text-based tasks, from deciphering lost languages to restoring damaged inscriptions, to determining the authorship of works of literature. Technological aids have long supported the study of ancient texts, but in recent years advances in Artificial Intelligence and Machine Learning have enabled analyses on a scale and in a detail that are reshaping the field of Humanities, similarly to how microscopes and telescopes have contributed to the realm of Science. This article aims to provide a comprehensive survey of published research using machine learning for the study of ancient texts written in any language, script and medium, spanning over three and a half millennia of civilisations around the ancient world. To analyse the relevant literature, we introduce a taxonomy of tasks inspired by the steps involved in the study o...
Computational Linguistics 49.3, 2023
Co-authored with Thea Sommerschield, Yannis Assael, and Ioannis Pavlopoulos (lead authors), Vanessa Stefanak, Andrew Senior, Chris Dyer, Jonathan Prag, Ion Androutsopoulos, and Nando de Freitas Ancient languages preserve the cultures and histories of the past. However, their study is fraught with difficulties, and experts must tackle a range of challenging text-based tasks, from deciphering lost languages to restoring damaged inscriptions, to determining the authorship of works of literature. Technological aids have long supported the study of ancient texts, but in recent years advances in artificial intelligence and machine learning have enabled analyses on a scale and in a detail that are reshaping the field of humanities, similarly to how microscopes and telescopes have contributed to the realm of science. This article aims to provide a comprehensive survey of published research using machine learning for the study of ancient texts written in any language, script, and medium, spanning over three and a half millennia of civilizations around the ancient world. To analyze the relevant literature, we introduce a taxonomy of tasks inspired by the steps involved in the study of ancient documents: digitization, restoration, attribution, linguistic analysis, textual criticism, translation, and decipherment. This work offers three major contributions: first, mapping the interdisciplinary field carved out by the synergy between the humanities and machine learning; second, highlighting how active collaboration between specialists from both fields is key to producing impactful and compelling scholarship; third, highlighting promising directions for future work in this field. Thus, this work promotes and supports the continued collaborative impetus between the humanities and machine learning.
Association for Computational Linguistics, 2024
Edited by: John Pavlopoulos, Thea Sommerschield, Yannis Assael et al.
TAPA, 2023
This paper argues that machine learning (ML) has a role to play in the future of philology, understood here as a discipline concerned with preserving and elucidating the global archive of premodern texts. We offer one initialcase study in order to outline broader possibilities for the field. The argument is in four parts. First, we offer a brief introduction to the history of classical philology, focusing on the development of three technologies: writing, printing, and digitizing. We evaluate their impact and emphasize some elements of continuity in philological practice. Second, we describe Logion, an ML model we are currently developing to support various philological tasks, such as making conjectures to fill lacunae, identifying scribal errors, and proposing emendations.In part three, we present some of the results achieved to date in editing the work of the Byzantine author Michael Psellos. Finally, we build on the specific study presented (part three), as well as our more general considerations on philology (part one) and ML (part two), in order to shed light on current challenges and future opportunities for the global archive of premodern texts.
Large-scale synthetic research in ancient history is often hindered by the incompatibility of taxonomies used by different digital datasets. Using the example of enriching the Latin Inscriptions from the Roman Empire dataset (LIRE), we demonstrate that machine-learning classification models can bridge the gap between two distinct classification systems and make comparative study possible. We report on training, testing and application of a machine learning classification model using inscription categories from the Epigraphic Database Heidelberg (EDH) to label inscriptions from the Epigraphic Database Claus-Slaby (EDCS). The model is trained on a labeled set of records included in both sources (N =46,171). Several different classification algorithms and parametrizations are explored. The final model is based on Extremely Randomized Trees algorithm (ET) and employs 10,055 features, based on several attributes. The final model classifies two thirds of a test dataset with 98% accuracy and 85% of it with 95% accuracy. After model selection and evaluation, we apply the model on inscriptions covered exclusively by EDCS (N =83,482) in an attempt to adopt one consistent system of classification for all records within the LIRE dataset.
arXiv (Cornell University), 2023
This paper presents machine-learning methods to address various problems in Greek philology. After training a BERT model on the largest premodern Greek dataset used for this purpose to date, we identify and correct previously undetected errors made by scribes in the process of textual transmission, in what is, to our knowledge, the first successful identification of such errors via machine learning. Additionally, we demonstrate the model's capacity to fill gaps caused by material deterioration of premodern manuscripts and compare the model's performance to that of a domain expert. We find that best performance is achieved when the domain expert is provided with model suggestions for inspiration. With such human-computer collaborations in mind, we explore the model's interpretability and find that certain attention heads appear to encode select grammatical features of premodern Greek.
2021
Large-scale synthetic research in ancient history is often hindered by the incompatibility of taxonomies used by different digital datasets. Using the example of enriching the Latin Inscriptions from the Roman Empire dataset (LIRE), we demonstrate that machine-learning classification models can bridge the gap between two distinct classification systems and make comparative study possible. We report on training, testing and application of a machine learning classification model using inscription categories from the Epigraphic Database Heidelberg (EDH) to label inscriptions from the Epigraphic Database Claus-Slaby (EDCS). The model is trained on a labeled set of records included in both sources (N=46,171). Several different classification algorithms and parametrizations are explored. The final model is based on Extremely Randomized Trees algorithm (ET) and employs 10,055 features, based on several attributes. The final model classifies two thirds of a test dataset with 98% accuracy an...
Information, 2023
This paper analyzes the relationships among eight ancient scripts from between Greece and India. We used convolutional neural networks combined with support vector machines to give a numerical rating of the similarity between pairs of signs (one sign from each of two different scripts). Two scripts that had a one-to-one matching of their signs were determined to be related. The result of the analysis is the finding of the following three groups, which are listed in chronological order: (1) Sumerian pictograms, the Indus Valley script, and the proto-Elamite script; (2) Cretan hieroglyphs and Linear B; and (3) the Phoenician, Greek, and Brahmi alphabets. Based on their geographic locations and times of appearance, Group (1) may originate from Mesopotamia in the early Bronze Age, Group (2) may originate from Europe in the middle Bronze Age, and Group (3) may originate from the Sinai Peninsula in the late Bronze Age.
2020
This first note introduces the need to flush out a robust interdisciplinary method to analyse fragmentary manuscript corpora in general and the Judaean Desert Scrolls and Cairo Genizah manuscripts in particular.
2022
The overall aim of the project is to develop a method to integrate these fields, using cutting edge machine learning techniques, with the goal of getting a much more complete picture of the development of the text and language of the Hebrew Bible than is currently possible.
WordPress, 2022
We know now that these tablets, described by one excavator as “little masterpieces of controlled realism,” are indigenous to the Indian subcontinent; researchers believe they were probably used to close documents and mark packages of goods, which is why they are referred to as seals. In part because of how the symbols in the inscriptions jostle each other at one end, almost as if the inscriber had run out of space, researchers have concluded that the inscriptions are meant to be read right to left. But we still don’t know what they actually say. A stone stamp-seal found at Harappa in the Indus Valley, mondern-day Pakistan’s Punjab and Sindh provinces. The Trustees of the British Museum This isn’t from a lack of trying. Scholars often point out that the Indus script, as the collection of some 4,000 excavated inscriptions, comprising between 400 and roughly 700 unique symbols, is known, might be one of the most deciphered scripts in history. More than a hundred attempts have been published since the 1920s. One theory links it to the Rongorongo script of Easter Island, also still undeciphered; another, offered by a German tantric guru claiming to have achieved his solution through meditation, links it to the cuneiform script used to write the Sumerian language.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.
PLoS ONE, 2022
Journal of Data Mining and Digital Humanities, 2017
Studia UBB Digitalia, 65/1, 2020, p. 39-54
Studia Universitatis Babeș-Bolyai Digitalia
Journal of Chinese Writing System , 2024
ADCAIJ: Advances in Distributed Computing and Artificial Intelligence Journal, 2022
Digital Humanities Workshop
Proceedings of the Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature, 2017
Aslib Proceedings, 2006
Advances in Archaeological Practice, 2021
IEEE Transactions on Pattern Analysis and Machine Intelligence, 2009
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), 2024
Artificial Intelligence, Machine Learning, and Deep Learning in Archaeology. The British School at Rome - The European Space Agency , 2019