The paper presents both conceptual and technical issues related to the construc-tion of an HPSG t... more The paper presents both conceptual and technical issues related to the construc-tion of an HPSG test-suite for Polish. The test-suite consists of sentences of written Polish—both grammatical and ungrammatical. Each sentence is annotated with a list of linguistic phenomena it illustrates. Additionally, grammatical sentences are encoded in HPSG-style AVM structures. We describe also a technical organization of the database, as well as possible operations on it.
Recognizing the lack of information about existing language resources as one of the major factors... more Recognizing the lack of information about existing language resources as one of the major factors hindering the development of the field, FLaReNet has undertaken a number of actions to survey existing resources, inform about them, and enhance their visibility.
Według tradycyjnego, lecz chyba najsłabiej obecnie rozpowszechnionego poglądu, formy niemęskoosob... more Według tradycyjnego, lecz chyba najsłabiej obecnie rozpowszechnionego poglądu, formy niemęskoosobowe liczebników, takie jak pięć w (l), występują w mianowniku, zaś formy męskoosobowe, np. pięciu w (2)–w dopełniaczu lub 'dopełniaczubierniku'. Odmiany tego podejścia, zwanego poniżej hipotezą mianownikowodopełniaczową, można znaleźć m. in. w pracach: Doroszewski i Wieczorkiewicz 1959, Klemensiewicz 1968, Bartnicka i Satkiewicz 1990, Mieczkowska 1994 oraz Rappaport 2003.
Abstract This paper presents the procedure of the syntactic annotation of the National Corpus of ... more Abstract This paper presents the procedure of the syntactic annotation of the National Corpus of Polish. Syntactic annotation consists here of shallow parsing and manual post-editing of the results by annotators. The description concentrates on the delimitation of syntactic words and groups, as well as on problems encountered during the annotation process.
Abstract This article compares and evaluates common statistics used in the process of filtering t... more Abstract This article compares and evaluates common statistics used in the process of filtering the hypotheses within the task of automatic valence extraction. A broader range of statistics is compared than the ones usually found in the literature, including Binomial Miscue Probability, Likelihood Ratio, t Test, and various simpler statistics. All experiments are performed on the basis of morphosyntactically annotated but very noisy Polish data.
Abstract This paper presents a formalism for simultaneous rule-based morphosyntactic tagging and ... more Abstract This paper presents a formalism for simultaneous rule-based morphosyntactic tagging and partial parsing. A prototype implementation of the formalism exists, while a more efficient implementation is being developed with the aim of annotating the IPI PAN Corpus of Polish.
We propose a novel machine learning approach to the task of identifying definitions in Polish doc... more We propose a novel machine learning approach to the task of identifying definitions in Polish documents. Specifics of the problem domain and characteristics of the available dataset have been taken into consideration, by carefully choosing and adapting a classification method to highly imbalanced and noisy data. We evaluate the performance of a Random Forest-based classifier in extracting definitional sentences from natural language text and give a comparison with previous work.
Streszczenie Niniejszy raport zawiera propozycję zestawu znaczników morfosyntaktycznych do anotac... more Streszczenie Niniejszy raport zawiera propozycję zestawu znaczników morfosyntaktycznych do anotacji korpusu tekstów języka polskiego. Zestaw ten opiera się wyłącznie na kryteriach morfologicznych i składniowych, w tych terminach zdefiniowane zostało też pojęcie klasy gramatycznej, zwykle utożsamiane z semantycznym pojęciem leksemu.
The aim of this paper is to present the design of a partial syntactic annotation of the IPI PAN C... more The aim of this paper is to present the design of a partial syntactic annotation of the IPI PAN Corpus of Polish [13] and the corresponding extension of the corpus search engine Poliqarp [14, 7] developed at the Institue of Computer Science PAS and currently employed in Polish and Portuguese corpora projects.
Abstract There exists a well-established and almost unanimously adopted measure of tagger perform... more Abstract There exists a well-established and almost unanimously adopted measure of tagger performance, namely, accuracy. Although it is perfectly adequate for small tagsets and typical approaches to disambiguation, we show that it is deficient when applied to rich morphological tagsets and propose various extensions designed to better correlate with the real usefulness of the tagger.
Abstract It has been shown that English existential there contexts like There are some/two/many/n... more Abstract It has been shown that English existential there contexts like There are some/two/many/no/* most/* all students in the class only allow determiners expressing a certain well-defined subclass of generalised quantifiers (Keenan 1987, 2003). In this paper we show that the Polish distributive preposition po imposes a stronger constraint, namely, requires that its argument express a cardinal quantifier.
Abstract. This paper addresses the problem of converting part of speech–or, more generally, morph... more Abstract. This paper addresses the problem of converting part of speech–or, more generally, morphosyntactic–annotations within a single language. Conversion between tagsets is a difficult task and, typically, it is either expensive (when performed manually) or inaccurate (lossy automatic conversion or re-tagging with classical taggers). A statistical method of annotation conversion is proposed here which achieves high accuracy, provided the source annotation is of high quality.
It is surprising that the adjunct vs. complement dichotomy, one of the most conspicuous in lingui... more It is surprising that the adjunct vs. complement dichotomy, one of the most conspicuous in linguistics, is at the same time also one of the least understood. In spite of this fact, linguistic theories (both transformational, eg, Government and Binding (GB), and constraint-based, eg, Head-driven Phrase Structure Grammar (HPSG)) do not hesitate to postulate clear-cut syntactic differences between the two classes of dependents.
This document contains excerpts from the publication The IPI PAN Corpus: Preliminary version.(Thi... more This document contains excerpts from the publication The IPI PAN Corpus: Preliminary version.(This version was modified in March 2006 in order to take into account changes in the 2nd edition of the IPI PAN Corpus and in April 2010 to take into account changes in the National Corpus of Polish.)
Abstract This paper presents preliminary results of an effort aiming at the creation of a morphol... more Abstract This paper presents preliminary results of an effort aiming at the creation of a morphological dictionary of Polish, PoliMorf, available under a very liberal BSD-style license. The dictionary is a result of a merger of two existing resources, SGJP and Morfologik and was prepared within the CESAR/META-NET initiative. The work completed so far includes re-licensing of the two dictionaries and filling the new resource with the morphological data semi-automatically unified from both sources.
Abstract The aim of this paper is to present current efforts towards the creation of a comprehens... more Abstract The aim of this paper is to present current efforts towards the creation of a comprehensive open repository of Polish language resources and tools (LRTs). The work described here is carried out within the CESAR project, member of the META-NET consortium. It has already resulted in the creation of the Computational Linguistics in Poland website containing an exhaustive collection of Polish LRTs.
