Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2009, … Linguistics and Intelligent …
The C-value/NC-value algorithm, a hybrid approach to automatic term recognition, has been originally developed to extract multiword term candidates from specialised documents written in English. Here, we present three main modifications to this algorithm that affect how the obtained output is refined. The first modification aims to maximise the number of real terms in the list of candidates with a new approach for the stop-list application process. The second modification adapts the C-value calculation formula in order to consider single word terms. The third modification changes how the term candidates are grouped, exploiting a lemmatised version of the input corpus. Additionally, size of candidate's context window is variable. We also show the necessary linguistic modifications to apply this algorithm to the recognition of term candidates in Spanish.
2000
Technical terms (henceforth called terms), are important elements for digital libraries. In this paperwe present a domain-independent method for the automatic extraction of multi-word terms, from machinereadable special language corpora. The method, (C-value/NC-value), combines linguistic and statistical information. The rst part, C-value enhances the common statistical measure of frequency of occurrence for term extraction, making it sensitive t o a particular ty p e o f m ulti-word terms, the nested terms. The second part, NC-value, gives: 1) a method for the extraction of term context words (words that tend to appear with terms), 2) the incorporation of information from term context words to the extraction of terms. 2 The C-value Approach This section presents the C-value approach to multiword ATR. C-value is a domain-independent method for multi-word ATR which aims to improve t h e extraction of nested terms. The method takes as input an SL corpus and produces a list of candidate multi-word terms. These are ordered by t h e i r termhood, w h i c h w e also call C-value. The output list is evaluated by a domain expert. Since the candidate terms are ranked according to their termhood, the domain expert can scan the lists starting from the top, and go as far down the list as time/money allow. The C-value approach combines linguistic and statistical information, emphasis being placed on the sta
In this paper we account for the main characteristics and performance of a number of recently developed term extraction systems. The analysed tools represent the main strategies followed by researchers in this area. All systems are analysed and compared against a set of technically relevant characteristics. * In Bourigault, D.; Jacquemin, C.; L'Homme, M-C. (2001) Recent Advances in Computational Terminology, In order to give a broader view of TE we use both extractor and detector to refer to the same notion. However, we are aware of the fact that some scholars attribute different meanings to these words.
1997
In this paper we present an approach for the extraction of multi-word terms from special language corpora. the new element is the incorporation of context information for the evaluation of candidate terms. This information is embedded to the C-value method in the form of statistical weights. 1 Introduction Automatic term recognition (ATR) is the extraction of technical terms from special language corpora with the use of computers. Its applications include specialised dictionary construction and maintenance, human and machine translation, indexing in books and digital libraries, hypertext linking, text categorization etc. ATR also gives the potential to work with large amounts of real data, that it would not be able to handle manually. We should note that by ATR we neither mean dictionary string matching, nor term interpretation (which deals with the relations between terms and conceptsI). When ATR is concerned with single-word term extraction, domain-dependent linguistic information...
Terminology, 1996
Following the growing interest in "corpus-based" approaches to computational linguistics, a number of studies have recently appeared on the topic of automatic term recognition or extraction. Because a successful term-recognition method has to be based on proper insights into the nature of terms, studies of automatic term recognition not only contribute to the applications of computational linguistics but also to the theoretical foundation of terminology. Many studies on automatic term recognition treat interesting aspects of terms, but most of them are not well founded and described. This paper tries to give an overview of the principles and methods of automatic term recognition. For that purpose, two major trends are examined, i.e., studies in automatic recognition of significant elements for indexing mainly carried out in information-retrieval circles and current research in automatic term recognition in the field of computational linguistics.
2013
In this paper a method and a flexible tool for performing monolingual term extraction is presented, based on the use of syntactic analysis where information on parts-of-speech, syntactic functions and surface syntax tags can be utilised. The standard approaches to evaluating term extraction, namely by manual evaluation of the top n term candidates or by comparing to a gold standard consisting of a list of terms from a specific domain can have its advantages, but in this paper we try to realise a proposal by Bernier-Colborne (2012) where extracted terms are compared to a gold standard consisting of a test corpus where terms have been annotated in context. Apart from applying this evaluation to different configurations of the tool, practical experiences from using the tool for real world situations are described.
Computational Linguistics and …, 2010
This paper deals with the application of natural language processing techniques to the field of information retrieval. To be precise, we propose the application of morphological families for single term conflation in order to reduce the linguistic variety of indexed documents written in Spanish. A system for automatic generation of morphological families by means of Productive Derivational Morphology is discussed. The main characteristics of this system are the use of a minimum of linguistic resources, a low computational cost, and the independence withrespect to the indexing engine.
Advanced Linguistics, 2019
Nowadays the processes of translation become more unified, and translators depend not only on their knowledge and sense of language, but also on various software, which facilitate the process of translation. The following article is devoted to one branch of such software, the systems of automatic extraction, which are an essential part in the process of lexicographic sources development of translation of text, which include a variety of terms. Consequently, the necessity to choose among the variety of different programs arose and the results of this research i.e. the comparison of functions of different programs, are described in our article. Several criteria, by which the quality of terms extraction can be measured, have been compared, e.g., the speed of extraction, the "purity" of the output list of terms, whether the extracted lexical material corresponded to the requirements to terms, the quality of irrelevant choices, extracted by automatic extraction systems, and the factors, influencing this quality, etc. The advantages and disadvantages of cloud and desktop services have been investigated and compared. It was noted that the main difficulty is that programs still are not able to distinguish between word forms, thus the texts that undergo the extraction process, require auxiliary procedures such as POS-marking, lemmatization and tokenization. The other obstacle was the inability of certain programs to distinguish between compound terms and simple word combinations. The key points of the research may be used in the course of translation studies, in researches devoted to "smart" or electronic lexicography and by translators in general as they may use these systems of terms extraction during the process of translation for the purpose of forming or unifying the required glossary.
1999
Methods for multi-word term extraction generally involve statistical and/or linguistic techniques, but the linguistic information used is mainly in the form of simple syntactic information. Our approach makes use of semantic information from UMLS [12], a domain-specific thesaurus, and of a rich syntactic and semantic representation. by incorporation new contextual weights based on these, and in particular, a similarity measure we have developed, we improve on the NC-Value method of term recognition [6], which produces a ranking of candidate terms. Our results show that by adding deeper forms of information to this value, we can make more of a distinction between terms and non-terms, thereby improving the ranking, and we can also perform some disambiguation of these terms. 1
Journal of Natural Language Processing, 1999
In this paper we present a domain-independent method for the automatic extraction of multi-word(technical)terms,from machine-readable special language corpora. The method,(C-value/NC-value),combines linguistic and statistical information. The first part,C-value enhances the common statistical measure of frequency of occurrence for term extraction,making it sensitive to a particular type of multi-word terms,the nested terms.Nested terms are those which also exist as substrings of other terms.The second part,NC-value,gives two things:1)a method for the extraction of term context words(words that tend to appear with terms),2)the incorporation of information from term context words to the extraction of terms.We apply the method to a medical corpus and compare the results with those produced by frequency of occurrence also applied on the same corpus.Frequency of occurrence was chosen for the comparison since it is the most commonly used statistical method for automatic term extraction to date.We show that using C-value we improve the extraction of nested multi-word terms,while using context information(NC-value) we improve the extraction of multi-word terms in general.In the evaluation sections, we give directions for the further improvement of the method.
2008
Automatic Term recognition (ATR) is a fundamental processing step preceding more complex tasks such as semantic search and ontology learning. From a large number of methodologies available in the literature only a few are able to handle both single and multi-word terms. In this paper we present a comparison of five such algorithms and propose a combined approach using a voting mechanism. We evaluated the six approaches using two different corpora and show how the voting algorithm performs best on one corpus (a collection of texts from Wikipedia) and less well using the Genia corpus (a standard life science corpus). This indicates that choice and design of corpus has a major impact on the evaluation of term recognition algorithms. Our experiments also showed that single-word terms can be equally important and occupy a fairly large proportion in certain domains. As a result, algorithms that ignore single-word terms may cause problems to tasks built on top of ATR. Effective ATR systems also need to take into account both the unstructured text and the structured aspects and this means information extraction techniques need to be integrated into the term recognition process.
Terminology, 2007
Term extraction may be defined as a text mining activity whose main purpose is to obtain all the terms included in a text of a given domain. Since the eighties, and mainly due to the rapid scientific advances as well as the evolution of the communication systems, there has been a growing interest in obtaining the terms found in written documents. During this time, a number of techniques and strategies have been proposed for satisfying this requirement. At present it seems that term extraction has reached a maturity stage.
2015
The large amount of textual information digitally available today gives rise to the need for effective means of indexing, searching and retrieving this information. Keywords are used to describe briefly and precisely the contents of a textual document. In this paper we present an algorithm for keyword extraction from documents written in Spanish.This algorithm combines autoencoders, which are adequate for highly unbalanced classification problems, with the discriminative power of conventional binary classifiers. In order to improve its performance on larger and more diverse datasets, our algorithm trains several models of each kind through bagging.Fil: Aquino, Germán Osvaldo. Universidad Nacional de la Plata. Facultad de Informatica. Instituto de Investigación En Informatica Lidi; Argentina. Consejo Nacional de Investigaciones Científicas y Técnicas; ArgentinaFil: Lanzarini, Laura Cristina. Universidad Nacional de la Plata. Facultad de Informatica. Instituto de Investigación En Info...
2016
This chapter focuses on computational approaches to the automatic extraction of terms from domain specific corpora. The different subtasks of Automatic Term Extraction are presented in detail, including corpus compilation, unithood, termhood and variant detection, and system evaluation.
Terminology, 2007
Term extraction may be defined as a text mining activity whose main purpose is to obtain all the terms included in a text of a given domain. Since the eighties, and mainly due to the rapid scientific advances as well as the evolution of the communication systems, there has been a growing interest in obtaining the terms found in written documents. A number of techniques and strategies have been proposed for satisfying this requirement. At present it seems that term extraction has reached a maturity stage. Nevertheless, many of the systems proposed fail to qualitatively present their results, almost every system evaluates its abilities in an ad hoc manner (if any, many times). Often, the authors do not explain their evaluation methodology; therefore comparisons between different implementations are difficult to draw. In this paper, we review the state-of-the-art of term extraction systems evaluation in the framework of natural language systems evaluation. The main approaches are prese...
2015
This paper tests two different strategies for medical term extraction in an Arabic Medical Corpus. The experiments and the corpus are developed within the framework of Multimedica project funded by the Spanish Ministry of Science and Innovation and aiming at developing multilingual resources and tools for processing of newswire texts in the Health domain. The first experiment uses a fixed list of medical terms, the second experiment uses a list of Arabic equivalents of very limited list of common Latin prefix and suffix used in medical terms. Results show that using equivalents of Latin suffix and prefix outperforms the fixed list. The paper starts with an introduction, followed by a description of the state-of-art in the field of Arabic Medical Language Resources (LRs). The third section describes the corpus and its characteristics. The fourth and the fifth sections explain the lists used and the results of the experiments carried out on a sub-corpus for evaluation. The last sectio...
2010
In this paper a machine learning approach is applied to Automatic Term Recognition (ATR). Similar approaches have been successfully used in Automatic Keyword Extraction (AKE). Using a dataset consisting of Swedish patent texts and validated terms belonging to these texts, unigrams and bigrams are extracted and annotated with linguistic and statistical feature values. Experiments using a varying ratio between positive and negative examples in the training data are conducted using the annotated n-grams. The results indicate that a machine learning approach is viable for ATR. Furthermore, a machine learning approach for bilingual ATR is discussed. Preliminary analysis however indicate that some modifications have to be made to apply the monolingual machine learning approach to a bilingual context.
2006
We present the work done by Elhuyar Foundation in the field of bilingual terminology extraction. The aim of this work is to develop some techniques for the automatic extraction of pairs of equivalent terms from Spanish-Basque translation memories, and to implement those techniques in a prototype. Our approach is based on a previous monolingual extraction of term candidates in each language, the creation of candidate bigrams from both segments of the same translation unit, and, finally, the selection of the most likely pair of candidates, based mainly on statistical information (association measures) and cognates. In the first step, we use linguistic techniques for the extraction of term candidates. The result of our work is ELexBI, a prototype tool that can extract equivalent terms from Spanish-Basque translation memories. This work wants to be a contribution to corpusbased bilingual lexicography and terminology in Basque.
2016
In this paper we present a rule-based method for multi-word term extraction that relies on extensive lexical resources in the form of electronic dictionaries and finite-state transducers for modelling various syntactic structures of multi-word terms. The same technology is used for lemmatization of extracted multi-word terms, which is unavoidable for highly inflected languages in order to pass extracted data to evaluators and subsequently to terminological e-dictionaries and databases. The approach is illustrated on a corpus of Serbian texts from the mining domain containing more than 600,000 simple word forms. Extracted and lemmatized multi-word terms are filtered in order to reject falsely offered lemmas and then ranked by introducing measures that combine linguistic and statistical information (C-Value, T-Score, LLR, and Keyness). Mean average precision for retrieval of MWU forms ranges from 0.789 to 0.804, while mean average precision of lemma production ranges from 0.956 to 0.9...
Investigación Bibliotecológica: Archivonomía, Bibliotecología e Información, 2015
Linguistic phenomena associated with the analysis of document content and employed for the purpose of organization and retrieval are well-visited objects of study in the field of library and information science. Language often acts as a gatekeeper, admitting or excluding people from gaining access to knowledge. As such, the terms used in the scientific and technical language of research need to be kept up and their behavior within the domain examined. Documental content analysis of scientific texts provides knowledge of specialized lexicons and their specific applications, while (67), September/December, 2015, México, ISSN: 0187-358X, 19-45. differentiating them from common use in order to establish indexing languages. Thus, as proposed herein, the application of lexicographic techniques to documental content analysis of non-specialized language yields the components needed to describe and extract lexical units of the specialized language.
NTCIR, 1999
We participated in the term recognition task, one of the subtasks covered by the NTCIR tmrec group. In this paper, we present a system used in this task and evaluate the term recognition results of this system. We believe that terms could be words that characterize the eld's data and have the following three features: (1) They frequently appear in the target eld's corpus. (2) They are not common terms in the target eld. (3) They less frequently appear in the other elds' corpora. Our system uses dierent eld corpora and recognizes these features as terms. We extracted a term list by using two kinds of eld corpora, the NACSIS Academic Conference Database and the MAINICHI newspaper database. We then analyzed the dierence between our term list and Manual-Candidates made by the NTCIR tmrec group. In this paper, we clarify what should be considered when recognizing terms. Furthermore, through comparative experiments based on Manual-Candidates, we verify the importance of indices which are used to extract a term list.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.