Papers by Giuseppe G. A. Celano
Proceedings of the 6th International Conference on Natural Language and Speech Processing (ICNLSP 2023), 2023
In the present article, five neural networks models for prediction of the number of elliptical no... more In the present article, five neural networks models for prediction of the number of elliptical nodes in Ancient Greek sentences are compared. The models are trained on dependency treebank data, where elliptical nodes are introduced if and only if they govern nodes that would otherwise become orphans. As exact word forms of elliptical nodes cannot often be identified (and therefore be annotated) in Ancient Greek, the task is modeled as a multiclass classification one, where each sentence is associated with zero, one, two, or more than two elliptical nodes. The study shows that pretrained BERT token embeddings allow achievement of the best performance. A model, which is the first of its kind, is made available for further research.
Proceedings of the 4th Workshop on Research in Computational Linguistic Typology and Multilingual NLP, 80–85., 2022
This paper presents the transformer model built to participate in the SIGTYP 2022 Shared Task on ... more This paper presents the transformer model built to participate in the SIGTYP 2022 Shared Task on the Prediction of Cognate Reflexes. It consists of an encoder-decoder architecture with multi-head attention mechanism. Its output is concatenated with the one hot encoding of the language label of an input character sequence to predict a target character sequence. The results show that the transformer outperforms the baseline rule-based system only partially.
The Prague Bulletin of Mathematical Linguistics, 2014
Many studies try to determine whether Ancient Greek is an OV or VO language. All of them, however... more Many studies try to determine whether Ancient Greek is an OV or VO language. All of them, however, fail to conduct a research whose method is entirely clear. This paper presents the first attempt to quantify the number of verbs governing preverbal or postverbal accusative object nouns or pronouns in single or coordinate independent clauses in Homer’s Iliad and Odyssey, Herodotus’ Histories, and the New Testament, by providing results which are fully verifiable and reproducible. I prove that as for the parameter OV vs. VO there is great variation in the texts, which suggests a change over time from OV order in Homer to VO order in the New Testament. The figures for Herodotus’ Greek prove a quasi-exact match between OV order and VO order.
The 3rd Workshop on Research in Computational Typology and Multilingual NLP, 2021
This paper describes the model built for the SIGTYP 2021 Shared Task aimed at iden- tifying 18 ty... more This paper describes the model built for the SIGTYP 2021 Shared Task aimed at iden- tifying 18 typologically different languages from speech recordings. Mel-frequency cep- stral coefficients derived from audio files are transformed into spectrograms, which are then fed into a ResNet-50-based CNN architecture. The final model achieved validation and test accuracies of 0.73 and 0.53, respectively.
Studi e Saggi Linguistici, 2020
The present article presents some challenges posed by lemmatization and PoS tagging of Latin, wit... more The present article presents some challenges posed by lemmatization and PoS tagging of Latin, with reference to the ongoing work to revise the Latin Dependency Treebank. Current options available for lemmatization and morphological analysis of Latin are reviewed and discussed. The pipeline to annotate the morphological layer of the La-tin Dependency Treebank is shown to consist in three main steps: (i) tokenization/ sentence split, which is performed via a documented rule-based algorithm, (ii) prepo-pulation by means of COMBO, a state-of-the-art joint lemmatizer, PoS tagger, and parser trained on the data of the Latin Dependency Treebank 2.1, and (iii) manual error correction informed by the attempt to identify and document lemmatization and morphology annotation rules.

1st Workshop on Language Technologies for Historical and Ancient Languages, (LT4HALA 2020), 2020
The paper presents the system used in the EvaLatin shared task to POS tag and lemmatize Latin. It... more The paper presents the system used in the EvaLatin shared task to POS tag and lemmatize Latin. It consists of two components. A gradient boosting machine (LightGBM) is used for POS tagging, mainly fed with pre-computed word embeddings of a window of seven contiguous tokens—the token at hand plus the three preceding and following ones—per target feature value. Word embeddings are trained on the texts of the Perseus Digital Library, Patrologia Latina, and Biblioteca Digitale di Testi Tardo Antichi, which together comprise a high number of texts of different genres from the Classical Age to Late Antiquity. Word forms plus the outputted POS labels are used to feed a Seq2Seq algorithm implemented in Keras to predict lemmas. The final shared-task accuracies measured for Classical Latin texts are in line with state-of-the-art POS taggers (∼96%) and lemmatizers (∼95%).
Digital Classical Philology. Ancient Greek and Latin in the Digital Revolution, 2019
The article aims to be an introduction to the dependency treebanks currently available for Ancien... more The article aims to be an introduction to the dependency treebanks currently available for Ancient Greek and Latin, i.e., the Ancient Greek and Latin Dependency Treebank (AGLDT), the Index Thomisticus Treebank (IT-TB), the PROIEL Treebank, and the SEMATIA Treebank. Their pipelines for creation of morphosyntactic annotations are presented so as to highlight major com-monalities and differences. All treebanks share the same basic underlying formalism , whereby syntactic words are connected to each other to form labeled directed acyclic graphs, and their annotation schemes, although different, are comparable to a very large extent.

Standoff Annotation for the Ancient Greek and Latin Dependency Treebank, 2019
This contribution presents the work in progress to convert the Ancient Greek and Latin Dependency... more This contribution presents the work in progress to convert the Ancient Greek and Latin Dependency Treebank (AGLDT) into stand-off annotation using PAULA XML. With an increasing number of annotations of any kind, it becomes more and more urgent that annotations related to the same texts be added standoff. Standoff annotation consists in adding any kind of annotation in separate documents, which are ultimately linked to a main text, the so-called "base text", which is meant to be unchangeable. References occur via a graph-based system of IDs, which allows an annotation layer (contained in a separate file) to be linked to another annotation layer (contained in another separate file). All the annotations/files create a labeled directed acyclic graph, whose root is represented by the base text. Standoff annotation enables easy interoperability and extension, in that single annotation layers can reference other layers of annotation independently, thus overcoming the problem of conflicting hierarchies. Moreover, standoff annotation also allows addition of different annotations of the same kind to the same text (e.g., two different interpretations of the POS tag for a given token). In the present contribution I show how the annotations of the AGLDT can become standoff using PAULA XML, which is an open access format following the LAF principles. More precisely, I show the case study of Caesar's De Bello Civili. I detail the PAULA XML files created for its tokenization and sentence split, which are preliminarily required to add morphosyntactic annotation. CCS CONCEPTS • Applied computing → Arts and humanities.
Glottometrics ist eine unregelmäßig erscheinende Zeitdchrift (2-3 Ausgaben pro Jahr) für die quan... more Glottometrics ist eine unregelmäßig erscheinende Zeitdchrift (2-3 Ausgaben pro Jahr) für die quantitative Erforschung von Sprache und Text. Beiträge in Deutsch oder Englisch sollten an einen der Herausgeber in einem gängigen Textverarbeitungssystem (vorrangig WORD) geschickt werden. Glottometrics kann aus dem Internet heruntergeladen werden (Open Access), auf CD-ROM (PDF-Format) oder als Druckversion bestellt werden. Glottometrics is a scientific journal for the quantitative research on language and text published at irregular intervals (2-3 times a year). J. Mačutek Univ. Bratislava (Slovakia) [email protected] A. Mehler Univ. Frankfurt (Germany) [email protected] M. Místecký Univ. Ostrava (Czech Republic) [email protected] G. Wimmer Univ. Bratislava (Slovakia) [email protected] P. Zörnig Univ. Brasilia (Brasilia) [email protected]
This paper presents preliminary corpus-based evidence from Russian for an " as-pectual coding asy... more This paper presents preliminary corpus-based evidence from Russian for an " as-pectual coding asymmetry ". The main research question is: Can different lengths of aspectual verb forms be predicted? We assume that each verb has a default aspectual value and that this value can be estimated based on frequency, which according to Zipf (1936) has a negative correlation to length. Our study provides evidence that the aspectual default value is a better pre-dictor of lengths of verb forms in Russian than frequency. In addition, we observed a positive but weaker impact of information content (Cohen Priva, 2008; Piantadosi et al., 2011), estimated from the verbs' syntactic dependents. A final result is the tendency for the impact of frequency to level out as IC increases.
In this article we report the results for five POS taggers, i.e., the Mate tagger, the Hunpos tag... more In this article we report the results for five POS taggers, i.e., the Mate tagger, the Hunpos tagger, RFTagger, the OpenNLP tagger, and NLTK Unigram tagger, tested on the data of the Ancient Greek Dependency Treebank. This is done in order to find the most efficient POS tagger to use for pre-annotation of new treebank data. A corrected 1-run 10-fold cross validation t test shows that the Mate tagger outperforms all the other taggers, with an accuracy score of 88%.
This is a web service which allows automatic comparison of two annotations directly from the (fre... more This is a web service which allows automatic comparison of two annotations directly from the (free) Arethusa annotation environment. It is possible to automatically compare differences between two sentences annotated for morphology, syntax, and semantics. This is an interactive framework to build high-quality annotations.
The comparison shows both percentage agreement and Cohen's Kappa.
This is the html file containing the guidelines for the annotation of the morphology, syntax, and... more This is the html file containing the guidelines for the annotation of the morphology, syntax, and semantics of Ancient Greek used for treebanking at the Humboldt Chair of Digital Humanities in Leipzig (download the file and open it in a browser). I also authored the algorithm that has been incorporated in the Arethusa annotation environment (see the link in github)
Many studies try to determine whether Ancient Greek is an OV or VO language. All of them, however... more Many studies try to determine whether Ancient Greek is an OV or VO language. All of them, however, fail to conduct a research whose method is entirely clear. This paper presents the first attempt to quantify the number of verbs governing preverbal or postverbal accusative object nouns or pronouns in single or coordinate independent clauses in Homer's Iliad and Odyssey, Herodotus' Histories, and the New Testament, by providing results which are fully verifiable and reproducible. I prove that as for the parameter OV vs. VO there is great variation in the texts, which suggests a change over time from OV order in Homer to VO order in the New Testament. The figures for Herodotus' Greek prove a quasi-exact match between OV order and VO order.
Uploads
Papers by Giuseppe G. A. Celano
The comparison shows both percentage agreement and Cohen's Kappa.
The comparison shows both percentage agreement and Cohen's Kappa.
the Ancient Greek Dependency Treebank. It consists in a hierarchical tagset
which implements H. W. Smyth’s A Greek Grammar for Colleges, partly revised
to meet algorithmic adequacy. The results are then shown for the intercoder
agreement values calculated for two annotators who have treebanked a
pilot corpus containing 417 sentences (6486 tk).