Papers by Kristín Bjarnadóttir
The topic of this presentation is a rule-based pipeline for converting constituency treebanks bas... more The topic of this presentation is a rule-based pipeline for converting constituency treebanks based on the Penn Treebank format to Universal Dependencies (UD). We describe the conversion process, the methods used to deliver a fully automated UD corpus and complications involved. An Icelandic constituency treebank is converted to a UD corpus, and the converter extended to convert a Faroese constituency treebank. The result is an open-source conversion tool, published under an Apache 2.0 license, applicable to a Penn-style treebank for conversion to a UD corpus, along with two new UD corpora, an Icelandic one and a Faroese one. Both are included in version 2.7 of UD.

Zenodo (CERN European Organization for Nuclear Research), Sep 5, 2017
This collection of papers on phrasal compounds is part of a bigger project whose aims are twofold... more This collection of papers on phrasal compounds is part of a bigger project whose aims are twofold: First, it seeks to broaden the typological perspective by providing data for as many different languages as possible to gain a better understanding of the phenomenon itself. Second, based on these data, which clearly show interaction between syntax and morphology, it aims to discuss theoretical models which deal with this kind of interaction in different ways. For example, models like Generative Grammar assume components of grammar and a clearcut distinction between the lexicon (often including morphology) and grammar which mostly stands for the computational system (syntax). Other models, like construction grammar do not assume such components and are rather based on a lexicon including constructs. A comparison of these models makes it then possible to assess their explanatory power. The field of morphology and syntax started to acknowledge the existence of phrasal compounds predominantly in the context of Lexicalist theories because a number of authors realised that they are not easy to handle in models of linguistic theory which demarcate the lexicon (morphology) from syntax. Commenting on Carola Trips & Jaklin Kornfilt. Further insights into phrasal compounding. In Carola Trips & Jaklin Kornfilt (eds.), Further investigations into the nature of phrasal compounding, 1-11. Berlin: Language Science Press.
Betingelser for brug af denne artikel Denne artikel er omfattet af ophavsretsloven, og der må cit... more Betingelser for brug af denne artikel Denne artikel er omfattet af ophavsretsloven, og der må citeres fra den. Følgende betingelser skal dog vaere opfyldt: Citatet skal vaere i overensstemmelse med "god skik" Der må kun citeres "i det omfang, som betinges af formålet" Ophavsmanden til teksten skal krediteres, og kilden skal angives, jf. ovenstående bibliografiske oplysninger. Søgbarhed Artiklerne i de aeldre Nordiske studier i leksikografi (1-5) er skannet og OCR-behandlet. OCR står for 'optical character recognition' og kan ved tegngenkendelse konvertere et billede til tekst. Dermed kan man søge i teksten. Imidlertid kan der opstå fejl i tegngenkendelsen, og når man søger på fx navne, skal man vaere forberedt på at søgningen ikke er 100 % pålidelig.

Proceedings of the 12th International Conference on Language Resources and Evaluation (LREC), 2020
In this paper, we present a character-based BiLSTM model for splitting Icelandic compound words, ... more In this paper, we present a character-based BiLSTM model for splitting Icelandic compound words, and show how varying amounts of training data affects the performance of the model. Compounding is highly productive in Icelandic, and new compounds are constantly being created. This results in a large number of out-of-vocabulary (OOV) words, negatively impacting the performance of many NLP tools. Our model is trained on a dataset of 2.9 million unique word forms and their constituent structures from the Database of Icelandic Morphology. The model learns how to split compound words into two parts and can be used to derive the constituent structure of any word form. Knowing the constituent structure of a word form makes it possible to generate the optimal split for a given task, e.g., a full split for subword tokenization, or, in the case of part-of-speech tagging, splitting an OOV word until the largest known morphological head is found. The model outperforms other previously published methods when evaluated on a corpus of manually split word forms. This method has been integrated into Kvistur, an Icelandic compound word analyzer.

International Conference on Language Resources and Evaluation, 2014
Compounding is extremely productive in Icelandic and multi-word compounds are common. The likelih... more Compounding is extremely productive in Icelandic and multi-word compounds are common. The likelihood of finding previously unseen compounds in texts is thus very high, which makes out-of-vocabulary words a problem in the use of NLP tools. The tool described in this paper splits Icelandic compounds and shows their binary constituent structure. The probability of a constituent in an unknown (or unanalysed) compound forming a combined constituent with either of its neighbours is estimated, with the use of data on the constituent structure of over 240 thousand compounds from the Database of Modern Icelandic Inflection, and word frequencies from Íslenskur orðasjóður, a corpus of approx. 550 million words. Thus, the structure of an unknown compound is derived by comparison with compounds with partially the same constituents and similar structure in the training data. The granularity of the split returned by the decompounder is important in tasks such as semantic analysis or machine translation, where a flat (non-structured) sequence of constituents is insufficient.

Since 2016, the tour de CLARIN initiative has been periodically highlighting prominent user invol... more Since 2016, the tour de CLARIN initiative has been periodically highlighting prominent user involvement activities in the CLARIN network in order to increase the visibility of its members, reveal the richness of the CLARIN landscape, and display the full range of activities that show what CLARIN has to offer to researchers, teachers, students, professionals and the general public interested in using and processing language data in various forms. In 2019, we expanded the initiative to also feature the work of CLARIN Knowledge Centres, which offer knowledge and expertise in specific areas provide to researchers, educators and developers alike. Initially conceived as a series of blog posts published on the CLARIN website, Tour de CLARIN soon proved to be one of our flagship outreach initiatives, which has been released in the form of two printed volumes. this third volume of tour de CLARIN is organized into two parts. In Part 1, we present the six countries which have been featured sin...
Artikkelen handler om de problemer som hefter ved lemmatisering av sammensetninger i en tosprakli... more Artikkelen handler om de problemer som hefter ved lemmatisering av sammensetninger i en tospraklig ordbok der islandsk er kildespraket. Fordi enkelte ord kan vise varierende ordformer som forsteledd i sammensetninger, vil lemmaseleksjonen ikke utelukkende gjenspeile semantisk leksikalisering. Det ma ogsa tas hensyn til at leksikaliseringen i mange tilfeller er begrenset til en bestemt formvariant. Dette forholdet kompliseres ytterligere ved at sammensetninger som viser et produktivt ordlagingsmonster, kan inneholde polyseme ordledd, eller ved at ordleddene star i en flertydig relasjon til hverandre.

In Icelandic, as in many other languages, phrasal compounds are an interface phenomenon of the di... more In Icelandic, as in many other languages, phrasal compounds are an interface phenomenon of the different components of grammar. The rules of syntax seem to be preserved in the phrasal component of Icelandic compounds, as they show full internal case assignment and agreement. Phrasal compounds in Icelandic can be divided into two distinct groups. The first group contains common words which are part of the core vocabulary irrespective of genre, and these are not stylistically marked in any way. Examples of these structures can be found in texts from the 13th century onwards. The second group contains more complex compounds, mainly found in informal writing, as in blogs, and in speech. These seem to be 20th century phenomena. Phrasal compounds of both types are relatively rare in Icelandic, but other types of compounding are extremely productive. Traditionally, Icelandic compounds are divided into two groups, i.e., compounds containing stems and compounds containing inflected word form...

In this paper, we present a character-based BiLSTM model for splitting Icelandic compound words, ... more In this paper, we present a character-based BiLSTM model for splitting Icelandic compound words, and show how varying amounts of training data affects the performance of the model. Compounding is highly productive in Icelandic, and new compounds are constantly being created. This results in a large number of out-of-vocabulary (OOV) words, negatively impacting the performance of many NLP tools. Our model is trained on a dataset of 2.9 million unique word forms and their constituent structures from the Database of Icelandic Morphology. The model learns how to split compound words into two parts and can be used to derive the constituent structure of any word form. Knowing the constituent structure of a word form makes it possible to generate the optimal split for a given task, e.g., a full split for subword tokenization, or, in the case of part-of-speech tagging, splitting an OOV word until the largest known morphological head is found. The model outperforms other previously published ...
The topic of this paper is The Database of Icelandic Morphology (DIM), a multipurpose linguistic ... more The topic of this paper is The Database of Icelandic Morphology (DIM), a multipurpose linguistic resource, created for use in language technology, as a reference for the general public in Iceland, and for use in research on the Icelandic language. DIM contains inflectional paradigms and analysis of word formation, with a vocabulary of approx. 285,000 lemmas. DIM is based on The Database of Modern Icelandic Inflection, which has been in use since 2004.

This collection of papers on phrasal compounding is part of a bigger project whose aims are twofo... more This collection of papers on phrasal compounding is part of a bigger project whose aims are twofold: First, it seeks to broaden the typological perspective by providing data for as many different languages as possible to gain a better understanding of the phenomenon itself. Second, based on these data which clearly show interaction between syntax and morphology it aims to discuss theoretical models which deal with this kind of interaction in different ways. Models like Generative Grammar, assume components of grammar and a clear-cut distinction between the lexicon (often including morphology) and grammar. Other models like construction grammar do not assume such components and are rather based on a lexicon including constructs. A comparison of these models on the basis of this phenomenon on the morphology-syntax interface makes it possible to assess their descriptive and explanatory power.
Lemmatization, finding the basic morphological form of a word in a corpus, is an important step i... more Lemmatization, finding the basic morphological form of a word in a corpus, is an important step in many natural language processing tasks when working with morphologically rich languages. We describe and evaluate Nefnir, a new open source lemmatizer for Icelandic. Nefnir uses suffix substitution rules, derived from a large morphological database, to lemmatize tagged text. Evaluation shows that for correctly tagged text, Nefnir obtains an accuracy of 99.55%, and for text tagged with a PoS tagger, the accuracy obtained is 96.88%.

Orð og tunga
Compounding is extremely productive in Icelandic and multi-word compounds are common. The likelih... more Compounding is extremely productive in Icelandic and multi-word compounds are common. The likelihood of finding previously unseen compounds in texts is thus very high, which makes out-of-vocabulary words a problem in the use of NLP tools. Kvistur, the decompounder described in this paper, splits Icelandic compounds and shows their binary constituent structure. The probability of a constituent in an unknown (or unanalysed) compound forming a combined constituent with either of its neighbours is estimated, with the use of data on the constituent structure of over 240 thousand compounds from the Database of Modern Icelandic Inflection (Kristín Bjarna-dótt ir 2012), and word frequencies from Íslenskur orðasjóður, a corpus of approx. 550 million words. Thus, the structure of an unknown compound is derived by comparison with compounds with partially the same constituents and similar structure in the training data. The granularity of the split returned by the decompounder is important in t...

In this paper, we describe the development of a morphosyntactically tagged corpus of Icelandic, t... more In this paper, we describe the development of a morphosyntactically tagged corpus of Icelandic, the MÍM corpus. The corpus consists of about 25 million tokens of contemporary Icelandic texts collected from varied sources during the years 2006–2010. The corpus is intended for use in Language Technology projects and for linguistic research. We describe briefly other Icelandic corpora and how they differ from the MÍM corpus. We describe the text selection and collection for MÍM, both for written and spoken text, and how metadata was created. Furthermore, copyright issues are discussed and how permission clearance was obtained for texts from different sources. Text cleaning and annotation phases are also described. The corpus is available for search through a web interface and for download in TEI-conformant XML format. Examples are given of the use of the corpus and some spin-offs of the corpus project are described. We believe that the care with which we secured copyright clearance for...
We give an overview of Icelandic language technology since its inception ten years ago and descri... more We give an overview of Icelandic language technology since its inception ten years ago and describe briefly its main achievements. Then we outline the research program of the Icelandic Language Technology community for the next few years, which is being implemented thanks to a large grant which has just been allotted to the program by the Icelandic Research Fund. Finally, we discuss the need for Nordic cooperation within Language Technology and put forward some concrete proposals for enhanced cooperation.
Uploads
Papers by Kristín Bjarnadóttir