The National Library of Finland (NLF) has digitized historical newspapers, journals and ephemera ... more The National Library of Finland (NLF) has digitized historical newspapers, journals and ephemera published in Finland since the late 1990s. The present collection consists of about 16.51 million pages mainly in Finnish and Swedish. Out of these about 7.64 million pages are freely available on the web site https:/ / digi.kansalliskirjasto.f / etusivu. The copyright restricted part of the collection can be used at six legal deposit libraries in different parts of Finland. The time period of the open collection is from 1771 to 1929.
Proceedings of the Fifth International Workshop on Computational Linguistics for Uralic Languages
This paper introduces a work in progress for implementing a free full text semantic tagger for Fi... more This paper introduces a work in progress for implementing a free full text semantic tagger for Finnish, FiST. The tagger is based on a 46 226 lexeme semantic lexicon of Finnish that was published in 2016. The basis of the semantic lexicon was developed in the early 2000s in an EU funded project Benedict (Löfberg et al., 2005). Löfberg (2017) describes compilation of the lexicon and evaluates a proprietary version of the Finnish Semantic Tagger, the FST 2. The FST and its lexicon were developed using the English Semantic Tagger (The EST) of University of Lancaster as a model. This semantic tagger was developed at the University Centre for Corpus Research on Language (UCREL) at Lancaster University as part of the UCREL Semantic Analysis System (USAS 3) framework. The semantic lexicon of the USAS framework is based on the modified and enriched categories of the Longman Lexicon of Contemporary English (McArthur, 1981). We have implemented a basic working version of a new full text semantic tagger for Finnish based on freely available components. The implementation uses Omorfi and FinnPos for morphological analysis of Finnish words. After the morphological recognition phase words from the 46K semantic lexicon are matched against the morphologically unambiguous base forms. In our comprehensive tests the lexical tagging coverage of the current implementation is around 82-90% with different text types. The present version needs still some enhancements, at least processing of semantic ambiguity of words and analysis of compounds, and perhaps also treatment of multiword expressions. Also a semantically marked ground truth evaluation collection should be established for evaluation of the tagger.
Tässä artikkelissa luodaan katsaus Kansalliskirjaston digitoitujen lehtiaineistojen avoimen datan... more Tässä artikkelissa luodaan katsaus Kansalliskirjaston digitoitujen lehtiaineistojen avoimen datan tutkimuskäyttöön. Lehtiaineistoista julkaistiin vuonna 2017 vuodet 1771–1910 kattava datapaketti, ja sen tutkimuskäytöstä on kertynyt tähän mennessä hiukan yli vuoden kokemus. Sivuamme katsauksessa myös aineiston verkkokäyttöä tutkimuksessa. Esittelemme lisäksi myös ohjelmistorajapintoja, joiden kautta aineistoihin pääsee käsiksi.
Digital collections of the National Library of Finland (NLF) contain over 10 million pages of his... more Digital collections of the National Library of Finland (NLF) contain over 10 million pages of historical newspapers, journals and some technical ephemera. The material ranges from the early Finnish newspapers from 1771 until the present day. The material up to 1910 can be viewed in the public web service, where as anything later is available at the six legal deposit libraries in Finland. A recent user study noticed that a different type of researcher use is one of the key uses of the collection. National Library of Finland has gotten several requests to provide the content of the digital collections as one offline bundle, where all the needed content is included. For this purpose we introduced a new format, which contains three different information sets: the full metadata of a publication page, the actual page content as ALTO XML, and the raw text content. We consider these formats most useful to be provided as raw data for the researchers. In this paper we will describe how the export format was created, how other parties have packaged the same data and what the benefits are of the current approach. We shall also briefly discuss word level quality of the content and show a real research scenario for the data.
Communications in Computer and Information Science, 2016
ABSTRACT There has been a huge interest in digitization of both hand-written and printed historic... more ABSTRACT There has been a huge interest in digitization of both hand-written and printed historical material in the last 10–15 years and most probably this interest will only increase in the ongoing Digital Humanities era. As a result of the interest we have lots of digital historical document collections available and will have more of them in the future. The National Library of Finland has digitized a large proportion of the historical newspapers published in Finland between 1771 and 1910 [1,2,3]; the collection, Digi, can be reached at http://digi.kansalliskirjasto.fi/. This collection contains approximately 1.95 million pages in Finnish and Swedish, the Finnish part being about 837 million words [4]. In the output of the Optical Character Recognition (OCR) process, errors are common especially when the texts are printed in the Gothic (Fraktur, blackletter) typeface. The errors lower the usability of the corpus both from the point of view of human users as well as considering possible elaborated text mining applications. Automatic spell checking and correction of the data is also difficult due to the historical spelling variants and low OCR quality level of the material. This paper discusses the overall situation of the intended post-correction of the Digi content and evaluation of the correction. We shall present results of our post-correction trials, and discuss some aspects of methodology of evaluation. These are the first reported evaluation results of post-correction of the data and the experiences will be used in planning of the post-correction of the whole ma-terial.
In this paper we introduce evaluation results of Cross-language information retrieval for two sma... more In this paper we introduce evaluation results of Cross-language information retrieval for two small languages, Finnish and Swedish. Our approach is based on machine translation of topics and usage of the Frequent Case Generation method for management of query term variation in translated topics. Retrieval results of more standard query term variation management approaches, such as stemming and lemmatization of translated topics, are also shown.
Yksi tekstitiedonhaun ongelmista erityisesti kokotekstihaussa on kautta aikojen ollut hakutermeis... more Yksi tekstitiedonhaun ongelmista erityisesti kokotekstihaussa on kautta aikojen ollut hakutermeissä ja tekstidokumenteissa esiintyvä sananmuotojen vaihtelu. Eri kielet käyttäytyvät tässä suhteessa eri tavoin, esimerkiksi englannin kielessä sanat eivät muodoltaan juurikaan vaihtele, mutta suomessa sanojen muoto-oppi eli morfologia on rikasta. Tämä puolestaan vaikuttaa hakutuloksiin, jos sananmuotojen vaihtelulle ei tekstien haku-ja indeksointivaiheessa tehdä jotain. Perinteisiä keinoja sananmuotojen vaihtelun hallintaan ovat olleet hakutermien katkaisu, karsinta eli stemmaus sekä perusmuotoistaminen eli lemmaus. Kaikkia näitä menetelmiä on käytetty menestyksekkäästi erilaisissa tekstihakujärjestelmissä, ja erityisesti karsinta ja perusmuotoistaminen ovat muodostuneet hakujärjestelmien vakiomenetelmiksi. Monia muitakin menetelmiä on olemassa, ja erilaisia menetelmiä esittelevät kattavasti esimerkiksi Kettunen (2009) ja McNamee, Nicholas ja Mayfield (2009).
This paper presents a new management method for morphological variation of keywords. The method i... more This paper presents a new management method for morphological variation of keywords. The method is called FCG, Frequent Case Generation. It is based on the skewed distributions of word forms in natural languages and is suitable for languages that have either fair amount of morphological variation or are morphologically very rich. The proposed method has been evaluated so far with four languages, Finnish, Swedish, German and Russian, which show varying degrees of morphological complexity.
The paper introduces the evaluation results of Cross Language Information Retrieval(CLIR) for thr... more The paper introduces the evaluation results of Cross Language Information Retrieval(CLIR) for three target languages, Finnish, German and Swedish using English as the source language. Our CLLR approach is based on machine translation of topics and usage of the Frequent Case Generation (FCG) method for management of query term variation in translated topics and retrieval in inflected indexes. Retrieval results of more standard query term variation management approaches, such as stemming and lemmatization of translated topics, are also shown. Results of the paper show, that when machine translation of queries are combined with FCG, results can be at best very promising. The besi Machine Translation (MT) programs seem to translate standard laboratory type Information Retrieval (IR) topics quite well at least from the query performance point of view. Few times the translated queries perform as well as or slightly better than the monolingual baseline. Many times differences to monolingua...
This paper introduces a new algorithmic stemmer, Regstems, for a highly inflectional language, Fi... more This paper introduces a new algorithmic stemmer, Regstems, for a highly inflectional language, Finnish. The stemmer is shown to perform competitively with a standard stemmer, Snowball Finnish stemmer, with CLEF 2003 Finnish collection's long and short queries. We also discuss requirements for stemmer creation in general and compare the stemmers using lexical test data. In string level similarity comparison of the output of stemmers we use a novel method, Normalized Compression Distance, a general similarity metric.
This paper discusses different methods that have been used for management of word form variation ... more This paper discusses different methods that have been used for management of word form variation in information retrieval during the history of textual information retrieval. The techniques have been characterized in many ways during the history of IR. We pinpoint the most meaningful features of the approaches and make comparisons that have practical value. In the discussion we characterize word form variation management methods in different ways and offer the reader an overall practical guide for choosing between different methods to be used.
This paper describes usage of MT metrics in choosing the best candidates for MT-based query trans... more This paper describes usage of MT metrics in choosing the best candidates for MT-based query translation resources in Cross-Language Information Retrieval. Our metrics is METEOR. Language pair of our evaluation is English ! German, because METEOR metrics does not offer very many language pairs for comparison. English ! German has also available many MT programs that can be used in
The National Library of Finland (NLF) has digitized historical newspapers, journals and ephemera ... more The National Library of Finland (NLF) has digitized historical newspapers, journals and ephemera published in Finland since the late 1990s. The present collection consists of about 16.51 million pages mainly in Finnish and Swedish. Out of these about 7.64 million pages are freely available on the web site https:/ / digi.kansalliskirjasto.f / etusivu. The copyright restricted part of the collection can be used at six legal deposit libraries in different parts of Finland. The time period of the open collection is from 1771 to 1929.
Proceedings of the Fifth International Workshop on Computational Linguistics for Uralic Languages
This paper introduces a work in progress for implementing a free full text semantic tagger for Fi... more This paper introduces a work in progress for implementing a free full text semantic tagger for Finnish, FiST. The tagger is based on a 46 226 lexeme semantic lexicon of Finnish that was published in 2016. The basis of the semantic lexicon was developed in the early 2000s in an EU funded project Benedict (Löfberg et al., 2005). Löfberg (2017) describes compilation of the lexicon and evaluates a proprietary version of the Finnish Semantic Tagger, the FST 2. The FST and its lexicon were developed using the English Semantic Tagger (The EST) of University of Lancaster as a model. This semantic tagger was developed at the University Centre for Corpus Research on Language (UCREL) at Lancaster University as part of the UCREL Semantic Analysis System (USAS 3) framework. The semantic lexicon of the USAS framework is based on the modified and enriched categories of the Longman Lexicon of Contemporary English (McArthur, 1981). We have implemented a basic working version of a new full text semantic tagger for Finnish based on freely available components. The implementation uses Omorfi and FinnPos for morphological analysis of Finnish words. After the morphological recognition phase words from the 46K semantic lexicon are matched against the morphologically unambiguous base forms. In our comprehensive tests the lexical tagging coverage of the current implementation is around 82-90% with different text types. The present version needs still some enhancements, at least processing of semantic ambiguity of words and analysis of compounds, and perhaps also treatment of multiword expressions. Also a semantically marked ground truth evaluation collection should be established for evaluation of the tagger.
Tässä artikkelissa luodaan katsaus Kansalliskirjaston digitoitujen lehtiaineistojen avoimen datan... more Tässä artikkelissa luodaan katsaus Kansalliskirjaston digitoitujen lehtiaineistojen avoimen datan tutkimuskäyttöön. Lehtiaineistoista julkaistiin vuonna 2017 vuodet 1771–1910 kattava datapaketti, ja sen tutkimuskäytöstä on kertynyt tähän mennessä hiukan yli vuoden kokemus. Sivuamme katsauksessa myös aineiston verkkokäyttöä tutkimuksessa. Esittelemme lisäksi myös ohjelmistorajapintoja, joiden kautta aineistoihin pääsee käsiksi.
Digital collections of the National Library of Finland (NLF) contain over 10 million pages of his... more Digital collections of the National Library of Finland (NLF) contain over 10 million pages of historical newspapers, journals and some technical ephemera. The material ranges from the early Finnish newspapers from 1771 until the present day. The material up to 1910 can be viewed in the public web service, where as anything later is available at the six legal deposit libraries in Finland. A recent user study noticed that a different type of researcher use is one of the key uses of the collection. National Library of Finland has gotten several requests to provide the content of the digital collections as one offline bundle, where all the needed content is included. For this purpose we introduced a new format, which contains three different information sets: the full metadata of a publication page, the actual page content as ALTO XML, and the raw text content. We consider these formats most useful to be provided as raw data for the researchers. In this paper we will describe how the export format was created, how other parties have packaged the same data and what the benefits are of the current approach. We shall also briefly discuss word level quality of the content and show a real research scenario for the data.
Communications in Computer and Information Science, 2016
ABSTRACT There has been a huge interest in digitization of both hand-written and printed historic... more ABSTRACT There has been a huge interest in digitization of both hand-written and printed historical material in the last 10–15 years and most probably this interest will only increase in the ongoing Digital Humanities era. As a result of the interest we have lots of digital historical document collections available and will have more of them in the future. The National Library of Finland has digitized a large proportion of the historical newspapers published in Finland between 1771 and 1910 [1,2,3]; the collection, Digi, can be reached at http://digi.kansalliskirjasto.fi/. This collection contains approximately 1.95 million pages in Finnish and Swedish, the Finnish part being about 837 million words [4]. In the output of the Optical Character Recognition (OCR) process, errors are common especially when the texts are printed in the Gothic (Fraktur, blackletter) typeface. The errors lower the usability of the corpus both from the point of view of human users as well as considering possible elaborated text mining applications. Automatic spell checking and correction of the data is also difficult due to the historical spelling variants and low OCR quality level of the material. This paper discusses the overall situation of the intended post-correction of the Digi content and evaluation of the correction. We shall present results of our post-correction trials, and discuss some aspects of methodology of evaluation. These are the first reported evaluation results of post-correction of the data and the experiences will be used in planning of the post-correction of the whole ma-terial.
In this paper we introduce evaluation results of Cross-language information retrieval for two sma... more In this paper we introduce evaluation results of Cross-language information retrieval for two small languages, Finnish and Swedish. Our approach is based on machine translation of topics and usage of the Frequent Case Generation method for management of query term variation in translated topics. Retrieval results of more standard query term variation management approaches, such as stemming and lemmatization of translated topics, are also shown.
Yksi tekstitiedonhaun ongelmista erityisesti kokotekstihaussa on kautta aikojen ollut hakutermeis... more Yksi tekstitiedonhaun ongelmista erityisesti kokotekstihaussa on kautta aikojen ollut hakutermeissä ja tekstidokumenteissa esiintyvä sananmuotojen vaihtelu. Eri kielet käyttäytyvät tässä suhteessa eri tavoin, esimerkiksi englannin kielessä sanat eivät muodoltaan juurikaan vaihtele, mutta suomessa sanojen muoto-oppi eli morfologia on rikasta. Tämä puolestaan vaikuttaa hakutuloksiin, jos sananmuotojen vaihtelulle ei tekstien haku-ja indeksointivaiheessa tehdä jotain. Perinteisiä keinoja sananmuotojen vaihtelun hallintaan ovat olleet hakutermien katkaisu, karsinta eli stemmaus sekä perusmuotoistaminen eli lemmaus. Kaikkia näitä menetelmiä on käytetty menestyksekkäästi erilaisissa tekstihakujärjestelmissä, ja erityisesti karsinta ja perusmuotoistaminen ovat muodostuneet hakujärjestelmien vakiomenetelmiksi. Monia muitakin menetelmiä on olemassa, ja erilaisia menetelmiä esittelevät kattavasti esimerkiksi Kettunen (2009) ja McNamee, Nicholas ja Mayfield (2009).
This paper presents a new management method for morphological variation of keywords. The method i... more This paper presents a new management method for morphological variation of keywords. The method is called FCG, Frequent Case Generation. It is based on the skewed distributions of word forms in natural languages and is suitable for languages that have either fair amount of morphological variation or are morphologically very rich. The proposed method has been evaluated so far with four languages, Finnish, Swedish, German and Russian, which show varying degrees of morphological complexity.
The paper introduces the evaluation results of Cross Language Information Retrieval(CLIR) for thr... more The paper introduces the evaluation results of Cross Language Information Retrieval(CLIR) for three target languages, Finnish, German and Swedish using English as the source language. Our CLLR approach is based on machine translation of topics and usage of the Frequent Case Generation (FCG) method for management of query term variation in translated topics and retrieval in inflected indexes. Retrieval results of more standard query term variation management approaches, such as stemming and lemmatization of translated topics, are also shown. Results of the paper show, that when machine translation of queries are combined with FCG, results can be at best very promising. The besi Machine Translation (MT) programs seem to translate standard laboratory type Information Retrieval (IR) topics quite well at least from the query performance point of view. Few times the translated queries perform as well as or slightly better than the monolingual baseline. Many times differences to monolingua...
This paper introduces a new algorithmic stemmer, Regstems, for a highly inflectional language, Fi... more This paper introduces a new algorithmic stemmer, Regstems, for a highly inflectional language, Finnish. The stemmer is shown to perform competitively with a standard stemmer, Snowball Finnish stemmer, with CLEF 2003 Finnish collection's long and short queries. We also discuss requirements for stemmer creation in general and compare the stemmers using lexical test data. In string level similarity comparison of the output of stemmers we use a novel method, Normalized Compression Distance, a general similarity metric.
This paper discusses different methods that have been used for management of word form variation ... more This paper discusses different methods that have been used for management of word form variation in information retrieval during the history of textual information retrieval. The techniques have been characterized in many ways during the history of IR. We pinpoint the most meaningful features of the approaches and make comparisons that have practical value. In the discussion we characterize word form variation management methods in different ways and offer the reader an overall practical guide for choosing between different methods to be used.
This paper describes usage of MT metrics in choosing the best candidates for MT-based query trans... more This paper describes usage of MT metrics in choosing the best candidates for MT-based query translation resources in Cross-Language Information Retrieval. Our metrics is METEOR. Language pair of our evaluation is English ! German, because METEOR metrics does not offer very many language pairs for comparison. English ! German has also available many MT programs that can be used in
Uploads
Papers by Kimmo Kettunen