Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
1983
A parser based on this model is being implemented as a component of a larger system, namely a natural language data base interface. There it will follow a component of morphological analysis (see JSppinen et al C83); hence, throughout the present paper it is assumed that all relevant morpho logical and lexical information is computationally available for all words in a sentence. Even though we have a data base application in mind, sen tence analysis will be based on general linguistic knowledge. All applicatio-. dependent inferences are left to subsequent modules which are not discussed here.
2013
We describe the methods and resources used to build FinnTreeBank-3, a 76.4 million token corpus of Finnish with automatically produced morphological and dependency syntax analyses. Starting from a definition of the target dependency scheme, we show how existing resources are transformed to conform to this definition and subsequently used to develop a parsing pipeline capable of processing a large-scale corpus. An independent formal evaluation demonstrates high accuracy of both morphological and syntactic annotation layers. The parsed corpus is freely available within the FIN-CLARIN infrastructure project.
2017
According to \citet{HolmbergEtAl1993} the finite sentence of Finnish is a structure with 2--6 functional heads. In this article, the theory is developed further and the functional heads are reanalyzed. The functional categories are divided into two categories: (i) lexical categories Neg, Aux, V, and C; (ii) morphological categories: AgrS, T, and Ptc. These categories are in separate tiers, and the tiers are linked to each other. Both lexical and morphological categories are hierarchically organized, and the linking between the tiers follows these hierarchies. The result of the reanalysis is a system that does not involve movement nor a complicated constituent structure of functional categories even though the desired properties of the previous analysis remain.
Computational Linguistics, 2012
Parsing is a key task in natural language processing. It involves predicting, for each natural language sentence, an abstract representation of the grammatical entities in the sentence and the relations between these entities. This representation provides an interface to compositional semantics and to the notions of “who did what to whom.” The last two decades have seen great advances in parsing English, leading to major leaps also in the performance of applications that use parsers as part of their backbone, such as systems ...
Proc. CILC 2011-III Congreso Internacional de …, 2011
We outline the design and creation of a syntactically and morphologically annotated corpora of Finnish for use by the research community. We motivate a definitional, systematic "grammar definition corpus" as a first step in a three-year annotation effort to help create higher-quality, better-documented extensive parsebanks at a later stage. The syntactic representation, consisting of a dependency structure and a basic set of dependency functions, is outlined with examples. Reference is made to double-blind annotation experiments to measure the applicability of the new grammar definition corpus methodology.
Proceedings of the Seventh International Workshop on Computational Linguistics of Uralic Languages, 2021
There are a lot of tools and resources available for processing Finnish. In this paper, we survey recent papers focusing on Finnish NLP related to many different subcategories of NLP such as parsing, generation, semantics and speech. NLP research is conducted in many different research groups in Finland, and it is frequently the case that NLP tools and models resulting from academic research are made available for others to use on platforms such as Github. Tiivistelmä Suomen kielen koneelliseen käsittelyyn on tarjolla paljon valmiita työkaluja ja resursseja. Tässä artikkelissa tarkastelemme viimeaikoina julkaistuja tieteellisiä artikkeleita, joissa keskitytään suomen kielen kieliteknologiaan. Tarkastelemme kieliteknologian eri alaluokkia, kuten jäsentämistä, tuottamista, semantiikkaa ja puheetta. kieliteknologista tutkimusta tehdään Suomessa monissa eri tutkimusryhmissä, ja usein akateemisen tutkimuksen tuloksena tuotetut kieliteknologian työkalut ja mallit julkaistaan muiden käytettäväksi esimerkiksi Githubissa.
International Journal on Natural Language Computing, 2012
This paper refers to the syntactic analysis of phrases in Romanian, as an important process of natural language processing. We will suggest a real-time solution, based on the idea of using some words or groups of words that indicate grammatical category; and some specific endings of some parts of sentence. Our idea is based on some characteristics of the Romanian language, where some prepositions, adverbs or some specific endings can provide a lot of information about the structure of a complex sentence. Such characteristics can be found in other languages, too, such as French. Using a special grammar, we developed a system (DIASEXP) that can perform a dialogue in natural language with assertive and interogative sentences about a "story" (a set of sentences describing some events from the real life).
Procedia - Social and Behavioral Sciences, 2015
This article studies syntactic ngrams, i.e. little subtrees of dependency syntax analyses, as keystructures reflecting syntactic characteristics of corpora. While traditional keywords correspond to statistically more or less frequent words of a corpus and are often informative on the corpus topic and style, unlexicalized syntactic ngrams applied in this study extend the level of description beyond individual words to sequences of syntactic elements. The article analyzes the utility of these sequences in corpus description and gives first results on the structural characteristics reflected by them in the studied texts, including Finnish literature, Internet forum discussions from the major Finnish social networking website and Internet discussions following the news and editorials of the major Finnish newspaper's website. The syntactic ngrams are produced with the freely available Finnish Dependency Parser and Ngram Builder and the keystructures analyzed with a linear classifier. The results suggest that syntactic ngrams illustrate both topical features, such as names and Internet urls discussed in the corpora, as well as structural characteristics, such as subject-verb combinations, negations and informal sentence structures, thus both generalizing the information given by traditional keywords from individual words to concepts and providing new knowledge about typical constructions not reached by lexemes.
Pedersen, BS, Nešpore, G., Inguna Skadi n.(eds.) …, 2011
In this paper we present an open source implementation for Finnish morphological parser. We shortly evaluate it against contemporary criticism towards monolithic and unmaintainable finite-state language description. We use it to demonstrate way of writing finite-state language description that is used for varying set of projects, that typically need morphological analyser, such as POS tagging, morphological analysis, hyphenation, spell checking and correction, rule-based machine translation and syntactic analysis. The language description is done using available open source methods for building finitestate descriptions coupled with autotoolsstyle build system, which is de facto standard in open source projects. 1
Proceedings of the 12th conference on …, 1988
Language Resources and Evaluation, 2015
This paper describes FinnPos, an open-source morphological tagging and lemmatization toolkit for Finnish. The morphological tagging model is based on the averaged structured perceptron classifier. Given training data, new taggers are estimated in a computationally efficient manner using a combination of beam search and model cascade. The lemmatization is performed employing a combination of a rule-based morphological analyzer, OMorFi, and a data-driven lemmatization model. The toolkit is readily applicable for tagging and lemmatization of running text with models learned from the recently published Finnish Turku Dependency Treebank and FinnTreeBank. Empirical evaluation on these corpora shows that FinnPos performs favorably compared to reference systems in terms of tagging and lemmatization accuracy. In addition, we demonstrate that our system is highly competitive with regard to computational efficiency of learning new models and assigning analyses to novel sentences.
1983
Our model is intended for the analysis of Finnish word forms. The system performs all meaningful morphotactic segmentations for a given surface word form, transforms alternated stems into the basic form (sg nominative or 1st infinitive for nominals and verbs, respectively), and matches the stems against lexical entries in order to find the meaninful words. The present version of the system does not analyze compound word forma into their constituents, nor does Lt analyze derivational word forms. We are building a new version which will have some of these characteristics. Otherwise model is complete; it has been fully implemented, and tests indicate its correctness so far to lie in the neighborhood of 99.5 % (JSppinen et al., 1983).
ling.helsinki.fi
2. Background As Finnish is a morphologically rich language, its nouns, verbs and adjectives can theoretically have different (inflected) word forms in the thousands. Generally given figures for the number of inflected forms that one can possibly construct in Finnish are just over 1,850 for a noun, just below 6,000 for an adjective, and approximately 20,000 for a verb 1. Though such inflectional paradigms can be viewed as both full and their individual members as equal according to some earlier views (by H. Seiler, PH Matthews, A. ...
European Journal of Cognitive Psychology, 2001
The study examined how morphologically complex clause constructions were processed during reading Finnish. Readers' eye xation patterns were recorded when they read two alternative versions of the same linguistic construction, a morphologically complex converb construction and its less complex subclause counterpart. The complexity of the converb construction is apparent in the construction being marked by less perceivable bound morphemes, which make the clause subject and predicate morphologically more complex and more dense in information. Experiment 1 showed that more complex converb constructions produced longer gaze durations than the length-and frequency-matched subclause constructions. Experiment 2 showed that the complexity e V ect is reversed when the more complex clause form was clearly more common in the language than its less complex counterpart. It is concluded that both structural complexity and structural frequency in uence the ease with which linguistic expressions are processed during reading.
2002
Traditionally, the analysis of word structure (morphology) is divided into two basic fields as infleetion and derivation. Therefore, the morphological structure of each word may include elements such as prefix, suffix, infix, or even a separate root, and these elements can modify the meaning of the basic root or stern of the word. If the consequent word is only a paradigmatic application of its base form, this variation of the word is called inflection; but if the resulting word is an entirely different word or a compound, which is formed of two or more roots, it is called derivation. While derivation is a word-creating process, infleetion constitutes different forms of any word. The model developed in this study, which analyses the morphology of Turkish verbs, can recognize all of the inflectional categories. The computational tool consists of a Java applet that can run on every machine, and a database that has been extracted from Turkish Dictionary published by Turkish Language Society. The database includes both the verb roots and derived verbs. We utilize Koskenniemi's two-level system to develop the morphological modeL. The input verb, which precedes the suffixes, is analyzed as an invariant root by querying the database, and the following suffix particles may indicate voice (causative, reciprocal, reflexive, passive), modality (necessitive, abilitative, conditional), negation, tense-aspect mood and person/number.
Nordic Journal of Linguistics, 1992
SWETWOL is implemented in the framework of Koskenniemi's (1983) two-level model. It contains a^"48,000 item lexicon and a full inflectional description. Special attention was paid to the design of a computational analysis of productive Swedish compounds. Recall (coverage) and precision of SWETWOL meet high standards. SWETWOL has been extensively tested on various types of texts.
Information Technology and Control, 2007
The problem of automation of syntactic analysis of the Lithuanian simple sentences is investigated. The features of Lithuanian language-great inflexion and free word order in a sentence-which raise the specific claims for solving the problem of syntactic analysis of Lithuanian sentences are highlighted. The formalized procedure of syntactic analysis is based on the Backus and Naur formalism. The possibilities of extension of the boundaries of the procedure for syntactic analysis are shown. The article presents an algorithm and software for syntactic analysis of Lithuanian simple sentences. The accuracy of performance of the system was evaluated.
The paper argues for two points in relation to Turkish NLP: (i) we are better off developing and using research methodologies and tools that are not language-specifi c, although the models built with these methods and tools must certainly exploit language-specifi c thinking or technology. One way to do this is to collect distributional data at the level of morphemes. (ii) we need to incorporate semantics into the picture somehow, otherwise what we do is form recognition, or contextually deprived (or dissituated) form production. The last point raises problems from the world's morphologies (and from Turkish morphology in particular) for the current state of art in NLP, where morphological processing is usually separated from syntactic processing for practical reasons. There is no semantic motivation to separate morphological processing of compositional meaning from syntactic processing of meaning. In fact, semantic aspects indicate that we should integrate them. I will mention some attempts at the problem and suggest some lines of research.
2009
This paper introduces our work for adapting a rule based parser of spoken Estonian to the morphologically unambiguous part of the corpus of dialects. A Constraint Grammar based parser was used for shallow syntactic analysis of Estonian dialects. The recall of the grammar was 96-97% and the precision 87-89%.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.