Skip to main content

Adam Przepiórkowski

Institute of Computer Science, Polish Academy of Sciences, Linguistic Engineering Group, Faculty Member

Followers

105

Following

0

Public Views

Siniša Runjaić

Institute of Croatian Language and Linguistics

Institute of Croatian Language and Linguistics

University of Zagreb

Giuseppe G. A. Celano

Universität Leipzig

Joanna Blaszczak

University of Wroclaw

Indiana University

Agnieszka Mykowiecka

Institute of Computer Science, Polish Academy of Sciences

sander lestrade

Beata Trawiński

Boban Arsenijevic

Karl-Franzens-University of Graz

InterestsView All (8)

Uploads

Papers by Adam Przepiórkowski

On the computational usability of valence dictionaries for Polish

Streszczenie angielskie: We consider undirected connected graphs as natural model for computer ne... more Streszczenie angielskie: We consider undirected connected graphs as natural model for computer networks. For this model, the distributed election algorithm is presented, its correctness is proved and its complexity is discussed. Streszczenie polskie: Słowniki walencyjne, tj. słowniki podające informacje o możliwych argumentach wyrazów, przede wszystkim czasowników, grają ważną rolę w przetwarzaniu języków maturalnych 9ang.

An HPSG-annotated test suite for Polish

The paper presents both conceptual and technical issues related to the construc-tion of an HPSG t... more The paper presents both conceptual and technical issues related to the construc-tion of an HPSG test-suite for Polish. The test-suite consists of sentences of written Polish—both grammatical and ungrammatical. Each sentence is annotated with a list of linguistic phenomena it illustrates. Additionally, grammatical sentences are encoded in HPSG-style AVM structures. We describe also a technical organization of the database, as well as possible operations on it.

Word Sense Disambiguation in National Corpus of Polish

The FLaReNet Databook

Recognizing the lack of information about existing language resources as one of the major factors... more

O wartości przypadka podmiotów liczebnikowych

Według tradycyjnego, lecz chyba najsłabiej obecnie rozpowszechnionego poglądu, formy niemęskoosob... more Według tradycyjnego, lecz chyba najsłabiej obecnie rozpowszechnionego poglądu, formy niemęskoosobowe liczebników, takie jak pięć w (l), występują w mianowniku, zaś formy męskoosobowe, np. pięciu w (2)–w dopełniaczu lub 'dopełniaczubierniku'. Odmiany tego podejścia, zwanego poniżej hipotezą mianownikowodopełniaczową, można znaleźć m. in. w pracach: Doroszewski i Wieczorkiewicz 1959, Klemensiewicz 1968, Bartnicka i Satkiewicz 1990, Mieczkowska 1994 oraz Rappaport 2003.

The design of syntactic annotation levels in the National Corpus of Polish

Abstract This paper presents the procedure of the syntactic annotation of the National Corpus of ... more Abstract This paper presents the procedure of the syntactic annotation of the National Corpus of Polish. Syntactic annotation consists here of shallow parsing and manual post-editing of the results by annotators. The description concentrates on the delimitation of syntactic words and groups, as well as on problems encountered during the annotation process.

Automatic extraction of Polish verb subcategorization: An evaluation of common statistics

Abstract This article compares and evaluates common statistics used in the process of filtering t... more Abstract This article compares and evaluates common statistics used in the process of filtering the hypotheses within the task of automatic valence extraction. A broader range of statistics is compared than the ones usually found in the literature, including Binomial Miscue Probability, Likelihood Ratio, t Test, and various simpler statistics. All experiments are performed on the basis of morphosyntactically annotated but very noisy Polish data.

A preliminary formalism for simultaneous rule-based tagging and partial parsing

Abstract This paper presents a formalism for simultaneous rule-based morphosyntactic tagging and ... more

Definition extraction with balanced random forests

We propose a novel machine learning approach to the task of identifying definitions in Polish doc... more We propose a novel machine learning approach to the task of identifying definitions in Polish documents. Specifics of the problem domain and characteristics of the available dataset have been taken into consideration, by carefully choosing and adapting a classification method to highly imbalanced and noisy data. We evaluate the performance of a Random Forest-based classifier in extracting definitional sentences from natural language text and give a comparison with previous work.

Projekt anotacji morfosyntaktycznej korpusu języka polskiego

Streszczenie Niniejszy raport zawiera propozycję zestawu znaczników morfosyntaktycznych do anotac... more Streszczenie Niniejszy raport zawiera propozycję zestawu znaczników morfosyntaktycznych do anotacji korpusu tekstów języka polskiego. Zestaw ten opiera się wyłącznie na kryteriach morfologicznych i składniowych, w tych terminach zdefiniowane zostało też pojęcie klasy gramatycznej, zwykle utożsamiane z semantycznym pojęciem leksemu.

Construction of an HPSG TreeBank for Polish

Resume-Abstract Cet article pr esente les aspects conceptuels et techniques concernant la constru... more Resume-Abstract Cet article pr esente les aspects conceptuels et techniques concernant la construction d'un corpus annot e pour le polonais. Le corpus contient des phrases du polonais ecrit representees comme des structures AVM dans le cadre de formalisme HPSG. En plus, chaque phrase est annotes pour les types des phenomenes linguistiques illustres. Car le corpus est aussi utilise comme une base pour tester des grammaires (un test-suite), des phrases bien et mal formees sont inclues egalement.

TEI P5 as a Text Encoding Standard for Multilevel Corpus Annotation

The need for text encoding standards for language resources (LRs) is widely acknowledged: within ... more

On heads and coordination in a partial treebank

The aim of this paper is to present the design of a partial syntactic annotation of the IPI PAN C... more The aim of this paper is to present the design of a partial syntactic annotation of the IPI PAN Corpus of Polish [13] and the corresponding extension of the corpus search engine Poliqarp [14, 7] developed at the Institue of Computer Science PAS and currently employed in Polish and Portuguese corpora projects.

Towards the adequate evaluation of morphosyntactic taggers

Abstract There exists a well-established and almost unanimously adopted measure of tagger perform... more Abstract There exists a well-established and almost unanimously adopted measure of tagger performance, namely, accuracy. Although it is perfectly adequate for small tagsets and typical approaches to disambiguation, we show that it is deficient when applied to rich morphological tagsets and propose various extensions designed to better correlate with the real usefulness of the tagger.

Generalised quantifier restrictions on the arguments of the Polish distributive preposition PO

Abstract It has been shown that English existential there contexts like There are some/two/many/n... more Abstract It has been shown that English existential there contexts like There are some/two/many/no/* most/* all students in the class only allow determiners expressing a certain well-defined subclass of generalised quantifiers (Keenan 1987, 2003). In this paper we show that the Polish distributive preposition po imposes a stronger constraint, namely, requires that its argument express a cardinal quantifier.

Tagset conversion with decision trees

Abstract. This paper addresses the problem of converting part of speech–or, more generally, morph... more Abstract. This paper addresses the problem of converting part of speech–or, more generally, morphosyntactic–annotations within a single language. Conversion between tagsets is a difficult task and, typically, it is either expensive (when performed manually) or inaccurate (lossy automatic conversion or re-tagging with classical taggers). A statistical method of annotation conversion is proposed here which achieves high accuracy, provided the source annotation is of high quality.

Verbal Proforms and the Complement–Adjunct Distinction in Polish

It is surprising that the adjunct vs. complement dichotomy, one of the most conspicuous in lingui... more It is surprising that the adjunct vs. complement dichotomy, one of the most conspicuous in linguistics, is at the same time also one of the least understood. In spite of this fact, linguistic theories (both transformational, eg, Government and Binding (GB), and constraint-based, eg, Head-driven Phrase Structure Grammar (HPSG)) do not hesitate to postulate clear-cut syntactic differences between the two classes of dependents.

Ściągawka do Narodowego Korpusu Języka Polskiego. The National Corpus of Polish Cheatsheet

This document contains excerpts from the publication The IPI PAN Corpus: Preliminary version.(Thi... more

PoliMorf: a (not so) new open morphological dictionary for Polish

Abstract This paper presents preliminary results of an effort aiming at the creation of a morphol... more Abstract This paper presents preliminary results of an effort aiming at the creation of a morphological dictionary of Polish, PoliMorf, available under a very liberal BSD-style license. The dictionary is a result of a merger of two existing resources, SGJP and Morfologik and was prepared within the CESAR/META-NET initiative. The work completed so far includes re-licensing of the two dictionaries and filling the new resource with the morphological data semi-automatically unified from both sources.

Towards a comprehensive open repository of Polish language resources

Abstract The aim of this paper is to present current efforts towards the creation of a comprehens... more Abstract The aim of this paper is to present current efforts towards the creation of a comprehensive open repository of Polish language resources and tools (LRTs). The work described here is carried out within the CESAR project, member of the META-NET consortium. It has already resulted in the creation of the Computational Linguistics in Poland website containing an exhaustive collection of Polish LRTs.

On the computational usability of valence dictionaries for Polish

Streszczenie angielskie: We consider undirected connected graphs as natural model for computer ne... more Streszczenie angielskie: We consider undirected connected graphs as natural model for computer networks. For this model, the distributed election algorithm is presented, its correctness is proved and its complexity is discussed. Streszczenie polskie: Słowniki walencyjne, tj. słowniki podające informacje o możliwych argumentach wyrazów, przede wszystkim czasowników, grają ważną rolę w przetwarzaniu języków maturalnych 9ang.

An HPSG-annotated test suite for Polish

The paper presents both conceptual and technical issues related to the construc-tion of an HPSG t... more The paper presents both conceptual and technical issues related to the construc-tion of an HPSG test-suite for Polish. The test-suite consists of sentences of written Polish—both grammatical and ungrammatical. Each sentence is annotated with a list of linguistic phenomena it illustrates. Additionally, grammatical sentences are encoded in HPSG-style AVM structures. We describe also a technical organization of the database, as well as possible operations on it.

Word Sense Disambiguation in National Corpus of Polish

The FLaReNet Databook

Recognizing the lack of information about existing language resources as one of the major factors... more

O wartości przypadka podmiotów liczebnikowych

Według tradycyjnego, lecz chyba najsłabiej obecnie rozpowszechnionego poglądu, formy niemęskoosob... more Według tradycyjnego, lecz chyba najsłabiej obecnie rozpowszechnionego poglądu, formy niemęskoosobowe liczebników, takie jak pięć w (l), występują w mianowniku, zaś formy męskoosobowe, np. pięciu w (2)–w dopełniaczu lub 'dopełniaczubierniku'. Odmiany tego podejścia, zwanego poniżej hipotezą mianownikowodopełniaczową, można znaleźć m. in. w pracach: Doroszewski i Wieczorkiewicz 1959, Klemensiewicz 1968, Bartnicka i Satkiewicz 1990, Mieczkowska 1994 oraz Rappaport 2003.

The design of syntactic annotation levels in the National Corpus of Polish

Abstract This paper presents the procedure of the syntactic annotation of the National Corpus of ... more Abstract This paper presents the procedure of the syntactic annotation of the National Corpus of Polish. Syntactic annotation consists here of shallow parsing and manual post-editing of the results by annotators. The description concentrates on the delimitation of syntactic words and groups, as well as on problems encountered during the annotation process.

Automatic extraction of Polish verb subcategorization: An evaluation of common statistics

Abstract This article compares and evaluates common statistics used in the process of filtering t... more Abstract This article compares and evaluates common statistics used in the process of filtering the hypotheses within the task of automatic valence extraction. A broader range of statistics is compared than the ones usually found in the literature, including Binomial Miscue Probability, Likelihood Ratio, t Test, and various simpler statistics. All experiments are performed on the basis of morphosyntactically annotated but very noisy Polish data.

A preliminary formalism for simultaneous rule-based tagging and partial parsing

Abstract This paper presents a formalism for simultaneous rule-based morphosyntactic tagging and ... more

Definition extraction with balanced random forests

We propose a novel machine learning approach to the task of identifying definitions in Polish doc... more We propose a novel machine learning approach to the task of identifying definitions in Polish documents. Specifics of the problem domain and characteristics of the available dataset have been taken into consideration, by carefully choosing and adapting a classification method to highly imbalanced and noisy data. We evaluate the performance of a Random Forest-based classifier in extracting definitional sentences from natural language text and give a comparison with previous work.

Projekt anotacji morfosyntaktycznej korpusu języka polskiego

Streszczenie Niniejszy raport zawiera propozycję zestawu znaczników morfosyntaktycznych do anotac... more Streszczenie Niniejszy raport zawiera propozycję zestawu znaczników morfosyntaktycznych do anotacji korpusu tekstów języka polskiego. Zestaw ten opiera się wyłącznie na kryteriach morfologicznych i składniowych, w tych terminach zdefiniowane zostało też pojęcie klasy gramatycznej, zwykle utożsamiane z semantycznym pojęciem leksemu.

Construction of an HPSG TreeBank for Polish

Resume-Abstract Cet article pr esente les aspects conceptuels et techniques concernant la constru... more Resume-Abstract Cet article pr esente les aspects conceptuels et techniques concernant la construction d'un corpus annot e pour le polonais. Le corpus contient des phrases du polonais ecrit representees comme des structures AVM dans le cadre de formalisme HPSG. En plus, chaque phrase est annotes pour les types des phenomenes linguistiques illustres. Car le corpus est aussi utilise comme une base pour tester des grammaires (un test-suite), des phrases bien et mal formees sont inclues egalement.

TEI P5 as a Text Encoding Standard for Multilevel Corpus Annotation

The need for text encoding standards for language resources (LRs) is widely acknowledged: within ... more

On heads and coordination in a partial treebank

The aim of this paper is to present the design of a partial syntactic annotation of the IPI PAN C... more The aim of this paper is to present the design of a partial syntactic annotation of the IPI PAN Corpus of Polish [13] and the corresponding extension of the corpus search engine Poliqarp [14, 7] developed at the Institue of Computer Science PAS and currently employed in Polish and Portuguese corpora projects.

Towards the adequate evaluation of morphosyntactic taggers

Abstract There exists a well-established and almost unanimously adopted measure of tagger perform... more Abstract There exists a well-established and almost unanimously adopted measure of tagger performance, namely, accuracy. Although it is perfectly adequate for small tagsets and typical approaches to disambiguation, we show that it is deficient when applied to rich morphological tagsets and propose various extensions designed to better correlate with the real usefulness of the tagger.

Generalised quantifier restrictions on the arguments of the Polish distributive preposition PO

Abstract It has been shown that English existential there contexts like There are some/two/many/n... more Abstract It has been shown that English existential there contexts like There are some/two/many/no/* most/* all students in the class only allow determiners expressing a certain well-defined subclass of generalised quantifiers (Keenan 1987, 2003). In this paper we show that the Polish distributive preposition po imposes a stronger constraint, namely, requires that its argument express a cardinal quantifier.

Tagset conversion with decision trees

Abstract. This paper addresses the problem of converting part of speech–or, more generally, morph... more Abstract. This paper addresses the problem of converting part of speech–or, more generally, morphosyntactic–annotations within a single language. Conversion between tagsets is a difficult task and, typically, it is either expensive (when performed manually) or inaccurate (lossy automatic conversion or re-tagging with classical taggers). A statistical method of annotation conversion is proposed here which achieves high accuracy, provided the source annotation is of high quality.

Verbal Proforms and the Complement–Adjunct Distinction in Polish

It is surprising that the adjunct vs. complement dichotomy, one of the most conspicuous in lingui... more It is surprising that the adjunct vs. complement dichotomy, one of the most conspicuous in linguistics, is at the same time also one of the least understood. In spite of this fact, linguistic theories (both transformational, eg, Government and Binding (GB), and constraint-based, eg, Head-driven Phrase Structure Grammar (HPSG)) do not hesitate to postulate clear-cut syntactic differences between the two classes of dependents.

Ściągawka do Narodowego Korpusu Języka Polskiego. The National Corpus of Polish Cheatsheet

This document contains excerpts from the publication The IPI PAN Corpus: Preliminary version.(Thi... more

PoliMorf: a (not so) new open morphological dictionary for Polish

Abstract This paper presents preliminary results of an effort aiming at the creation of a morphol... more Abstract This paper presents preliminary results of an effort aiming at the creation of a morphological dictionary of Polish, PoliMorf, available under a very liberal BSD-style license. The dictionary is a result of a merger of two existing resources, SGJP and Morfologik and was prepared within the CESAR/META-NET initiative. The work completed so far includes re-licensing of the two dictionaries and filling the new resource with the morphological data semi-automatically unified from both sources.

Towards a comprehensive open repository of Polish language resources

Abstract The aim of this paper is to present current efforts towards the creation of a comprehens... more Abstract The aim of this paper is to present current efforts towards the creation of a comprehensive open repository of Polish language resources and tools (LRTs). The work described here is carried out within the CESAR project, member of the META-NET consortium. It has already resulted in the creation of the Computational Linguistics in Poland website containing an exhaustive collection of Polish LRTs.