Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2004
In this paper two statistical methods for extracting collocations from text corpora written in Modern Greek are described, the mean and variance method and a method based on the X 2 test. The mean and variance method calculates distances ("offsets") between words in a corpus and looks for specific patterns of distance. The X 2 test is combined with the formulation of a null hypothesis H 0 for a sample of occurrences and we check if there are associations between the words. The X 2 testing does not assume that the words in the corpus have normally distributed probabilities and hence it seems to be more flexible. The two methods extract interesting collocations that are useful in various applications e.g. computational lexicography, language generation and machine translation.
Natural Language Understanding and Cognitive Science, 2004
In this paper we describe and apply two statistical methods for extracting collocations from text corpora written in Modern Greek. The first one is the mean and variance method which calculates "offsets" (distances) between words in a corpus and looks for patterns of distances with low spread. The second method is based on the X 2 test. Such an approach seems to be more flexible because it does not assume normally distributed probabilities of the words in the corpus. The two techniques produce interesting collocations that are useful in various applications e.g. computational lexicography, language generation and machine translation.
In this paper we describe and apply two statistical methods for extracting collocations from text corpora written in Modern Greek. The first one is the mean and variance method which calculates "offsets" (distances) between words in a corpus and looks for patterns of distances with low spread. The second method is based on the X2 test. Such an approach seems to be more flexible because it does not assume normally distributed probabilities of the words in the corpus. The two techniques produce interesting collocations that are useful in various applications e.g. computational lexicography, language generation and machine translation.
2017
Identifying multiword expressions (MWEs) in a sentence and performi ng the syntactic analysis of the sentence are interrelated processes. In our approach, priority is given to parsing alternatives involving collocations, and hence collocational information helps the parser through the maze of alternatives, with the aim to lead to substantial improvements in the performance of both tasks (collocation identification and parsing), and in that of a subsequent task (automatic annotation). In this paper, we are going to present our system and the procedure that we have followed in order to proceed to the automatic annotation of Greek verbal multiword expressions (VMWEs) in running texts.
The Egyptian Journal of Language Engineering, 2015
This paper aims to explain why we would use statistical methods for Ancient Greek linguistics? The statistics are typically using a large machine-readable corpus, in order to discover general principles of linguistic behavior, genre difference, etc. The paper also sets out to prove some hypotheses, or identify some linguistic phenomena, such as morphological, syntactic, and semantic phenomena. The proposed paper will consist of the following points:-What kinds of linguistic data can they handle?-What are the advantages and disadvantages of statistical linguistics?-What is the nature of the assumptions they require of the analyst?-What is the strategy for studying of linguistic phenomena?
Conference of the European Chapter of the Association for Computational Linguistics, 1985
ExLing 2019. Proceedings of the 10th international conference of experimental linguistics. 25–27 September 2019. Lisbon, Portugal, 2019
It seems that there are certain linguistic situations ideally adapted for the usage-based approach and the Greek case is one of them. This is a corpus-based study, which is based on the analysis and processing of a large variety of Greek texts in everyday spoken interactions. As our starting point, we argue the panchronic character of Greek lexicon, its extreme conservatism and the man-made character of the formation of the Greek literary standard. Some practical issues such as the choice between monotonic and polytonic orthography, lexemes tagging to obtain more data are addressed here.
ICT in the Analysis, Teaching and Learning …, 2006
Studies in Corpus Linguistics, 2012
Multiword expressions (MWEs) are words that co-occur so often that they are perceived as a linguistic unit. Since MWEs pervade natural language, their identification is pertinent for a range of tasks within lexicography, terminology and language technology. We apply various statistical association measures (AMs) to word sequences from the Norwegian Newspaper Corpus (NNC) in order to rank two-and three-word sequences (bigrams and trigrams) in terms of their tendency to co-occur. The results show that some statistical measures favour relatively frequent MWEs (e.g. i motsetning til 'as opposed to'), whereas other measures favour relatively low-frequent units, which typically comprise loan words (de facto), technical terms (notarius publicus) and phrasal anglicisms (practical jokes; cf. Andersen this volume). On this basis we evaluate the relevance of each of these measures for lexicography, terminology and language technology purposes.
Multiword Expressions / Πολυλεκτικές εκφράσεις, 2020
Th is paper discusses theoretical and empirical aspects of neological multiword compounds, a subcategory of multiword expressions, as a contribution to the further development of the system of neologism extraction of Νεοδημία, an ongoing research programme conducted at the Research Centre for Scientifi c Terms and Neologisms to accomplish the tasks of (semi)automated detection and linguistic analysis of Greek Neologisms and Terminology. Under the epistemological paradigm of Natural Th eory, as well as a corpus-linguistic perspective, using data from the Monitor Corpus of Neologisms developed at the Centre, we demonstrate how the parameters of multiword compoundhood and neologicity can be operationalized and further refi ned both in theory and practice. A pilot set of multiword expressions, selected on theoretical grounds, served as a basis for qualitative and quantitative comparisons. Th e Adjective+Noun structures were subjected to a corpus-linguistic analysis, using a statistical, collocate-node approach. Although there was a cline of associations, some collocates were much more salient, “standing out” in relation to all the others in the same network. With the help of relative frequency ratios, we interpreted the existence of these highly salient collocations as empirical evidence for compoundhood, confi rming the theoretical hypotheses upon which they were selected. Moreover, the process of neologism consolidation was captured with the use of time-lined dispersion statistics, as well as the exploration of contextual variability. All factors were proven decisive in a defi nition of multiword lexical neology, suggesting improvements to our system of identifying, recording and monitoring neologisms in the Greek Press.
Computing Research Repository, 1996
The usefulness of a statistical approach suggested by Church and Hanks (1989) is evaluated for the extraction of verb-noun (V-N) collocations from German text corpora. Some motivations for the extraction of V-N collocations from corpora are given and a couple of differences concerning the German language are mentioned that have implications on the applicability of extraction methods developed for English. We present precision and recall results for V-N collocations with support verbs and discuss the consequences for further work on the extraction of collocations from German corpora. Depending on the goal to be achieved, emphasis can be put on a high recall for lexicographic purposes or on high precision for automatic lexical acquisition, in each case leading to a decrease of the corresponding other variable. Low recall can still be acceptable if very large corpora (i.e. 50 -100 miUion words) are available or if corpora are used for special domains in addition to the data found in machine readable (collocation) dictionaries.
Lecture Notes in Computer Science, vol. 10415 (Text, Speech, and Dialogue - 20th International Conference, TSD 2017, Prague, Czech Republic, August 27-31, 2017, Proceedings) / K. Ekstein, V. Matousek (Eds.). Springer International Publishing AG, 2017
The paper deals with collocation extraction from corpus data. A whole number of formulae have been created to integrate different factors that determine the association between the collocation components. The experiments are described which objective was to study the method of collocation extraction based on the statistical association measures. The work is focused on bigram colloca‐ tions. The obtained data on the measure precision allow to establish to some degree that some measures are more precise than others. No measure is ideal, which is why various options of their integration are desirable and useful. We propose a number of parameters that allow to rank collocates in an combined list, namely, an average rank, a normalized rank and an optimized rank.
2005
Modern Greek is one of the least quantitatively studied modern European languages and the goal of this paper is to fill this relative void. We use the Hellenic National Corpus (HNC), which is a growing corpus that currently includes 33 million words. The corpus and all the tools used in our work were developed by the Institute for Language and Speech Processing (ILSP). In this paper we focus on three main areas: the lists of the 1000 most common words and lemmas, word length and letter frequency. We also make some comparisons with earlier work, in which we had used the previous 13 million word edition of the HNC.
2008
The notion of collocation is quite ambiguous. A concise survey of different approaches to it (British contextualism, lexicographical approach, approach of the “Meaning-Text” theory) is proposed in the paper. The paper discusses the results of retrieving collocations from a corpus of Russian texts. The data obtained is compared to the data given for set expressions in modern Russian dictionaries. The paper also explores the role of statistical measures for extracting collocations in Russian, and the issue of their applicability to the Russian language.
1992
This рарѳг focuses on the extractk>n of collocations from the 'Collins-Robert English-French. French-English dictionary". The extraction programme, based on the WordCruncher Text Retrieval software pack age. Is Illustrated by the study of the combinatory properties of the word PRICE. The co-occurrence knowledge extracted from the dictionary Is then compared wlth similar data retrieved from a statistically-processed corpus. The two techniques are assessed and shown to be complemen tary and mutually enriching.
Proc. of the LREC …, 2010
This paper describes a series of machine translation experiments with the English-Romanian language pair. The experiments were intended to test and prove the hypothesis that syntactically motivated long translation examples added to a base-line 3gram statistically extracted phrase table improves the translation performance in terms of the score BLEU. Extensive tests with a couple of different scenarios were performed: 1) simply concatenating the "extra" translations example to the baseline phrase-table; 2) computing and taking into account perplexities for the POS-string associated to the translation examples; 3) taking into account the number of words in each member of a translation example; 4) filtering the "extra" translation examples by taking into account a score that appreciates the correctness of their lexical alignment. Different combinations of the four scenarios were also tested. Also, the paper presents a method for extracting syntactically motivated translation examples using the dependency linkage of both the source and target sentence. To decompose the source/target sentence into fragments, we identified two types of dependency link-structures -super-links and chains -and used these structures to set the translation example borders.
2008
The paper discusses the results of an experiment in collocation extraction in a corpus of Russian texts. The data obtained is compared to the data given for set expressions in modern Russian dictionaries in order to analyze from the standpoint of traditional lexicography what kind of phrases can be received by such an approach. The paper also explores the role of statistical measures for extracting collocations in Russian.
Research in Corpus Linguistics, 2019
Pragmantax II. Zum aktuellen Stand der Linguistik und ihren Teildisziplinen. The Present State of Linguistics and its Sub-Disciplines. Frankfurt a.M.: Peter Lang, 2014. S. 333-344.
The paper describes a notion of collocability and collocations, statistical background for collocation extraction and experiments of applying statistical tools in order to extract collocations from Russian texts.
Atti Del Xii Congresso Internazionale Di Lessicografia Torino 6 9 Settembre 2006 Vol 1 2006 Isbn 88 7694 918 6 Pags 377 381, 2006
This paper deals with the development of an electronic database of new words in Greek, which constitutes a part ofthe research project "Pythagoras". Our aim is (a) to create an electronic inventory ofall the new Greek words that are used in non scientific Greek language magazines, (b) to offer precise information about the semantic frame, the morphological features and the special register of all the lexical units which will be included in the database, (c) to make useful correlations between those units, as far as their form and meaning are concerned, and (d) to elaborate a lexicographic product that will enable us to use it for future linguistic descriptions and/or analyses of the Greek language. In the present paper our corpora and scientific methods will be presented, as well as a small sample of the lexical units which are going to be included in our database.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.