Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
1999, NTCIR
We participated in the term recognition task, one of the subtasks covered by the NTCIR tmrec group. In this paper, we present a system used in this task and evaluate the term recognition results of this system. We believe that terms could be words that characterize the eld's data and have the following three features: (1) They frequently appear in the target eld's corpus. (2) They are not common terms in the target eld. (3) They less frequently appear in the other elds' corpora. Our system uses dierent eld corpora and recognizes these features as terms. We extracted a term list by using two kinds of eld corpora, the NACSIS Academic Conference Database and the MAINICHI newspaper database. We then analyzed the dierence between our term list and Manual-Candidates made by the NTCIR tmrec group. In this paper, we clarify what should be considered when recognizing terms. Furthermore, through comparative experiments based on Manual-Candidates, we verify the importance of indices which are used to extract a term list.
2008
Automatic Term recognition (ATR) is a fundamental processing step preceding more complex tasks such as semantic search and ontology learning. From a large number of methodologies available in the literature only a few are able to handle both single and multi-word terms. In this paper we present a comparison of five such algorithms and propose a combined approach using a voting mechanism. We evaluated the six approaches using two different corpora and show how the voting algorithm performs best on one corpus (a collection of texts from Wikipedia) and less well using the Genia corpus (a standard life science corpus). This indicates that choice and design of corpus has a major impact on the evaluation of term recognition algorithms. Our experiments also showed that single-word terms can be equally important and occupy a fairly large proportion in certain domains. As a result, algorithms that ignore single-word terms may cause problems to tasks built on top of ATR. Effective ATR systems also need to take into account both the unstructured text and the structured aspects and this means information extraction techniques need to be integrated into the term recognition process.
Terminology, 1996
Following the growing interest in "corpus-based" approaches to computational linguistics, a number of studies have recently appeared on the topic of automatic term recognition or extraction. Because a successful term-recognition method has to be based on proper insights into the nature of terms, studies of automatic term recognition not only contribute to the applications of computational linguistics but also to the theoretical foundation of terminology. Many studies on automatic term recognition treat interesting aspects of terms, but most of them are not well founded and described. This paper tries to give an overview of the principles and methods of automatic term recognition. For that purpose, two major trends are examined, i.e., studies in automatic recognition of significant elements for indexing mainly carried out in information-retrieval circles and current research in automatic term recognition in the field of computational linguistics.
Terminology, 2007
Term extraction may be defined as a text mining activity whose main purpose is to obtain all the terms included in a text of a given domain. Since the eighties, and mainly due to the rapid scientific advances as well as the evolution of the communication systems, there has been a growing interest in obtaining the terms found in written documents. A number of techniques and strategies have been proposed for satisfying this requirement. At present it seems that term extraction has reached a maturity stage. Nevertheless, many of the systems proposed fail to qualitatively present their results, almost every system evaluates its abilities in an ad hoc manner (if any, many times). Often, the authors do not explain their evaluation methodology; therefore comparisons between different implementations are difficult to draw. In this paper, we review the state-of-the-art of term extraction systems evaluation in the framework of natural language systems evaluation. The main approaches are prese...
2010
In the paper we present a method that allows an extraction of single-word terms for a specific domain. At the next stage these terms can be used as candidates for multi-word term extraction. The proposed method is based on comparison with general reference corpus using log-likelihood similarity. We also perform clustering of the extracted terms using k-means algorithm and cosine similarity measure. We made experiments using texts of the domain of computer science. The obtained term list is analyzed in detail.
Frontiers in Research Metrics and Analytics, 2018
The Termolator is an open-source high-performing terminology extraction system, available on Github. The Termolator combines several different approaches to get superior coverage and precision. The in-line term component identifies potential instances of terminology using a chunking procedure, similar to noun group chunking, but favoring chunks that contain out-of-vocabulary words, nominalizations, technical adjectives, and other specialized word classes. The distributional component ranks such term chunks according to several metrics including: (a) a set of metrics that favors term chunks that are relatively more frequent in a "foreground" corpus about a single topic than they are in a "background" or multi-topic corpus; (b) a well-formedness score based on linguistic features; and (c) a relevance score which measures how often terms appear in articles and patents in a Yahoo web search. We analyse the contributions made by each of these components and show that all modules contribute to the system's performance, both in terms of the number and quality of terms identified. This paper expands upon previous publications about this research and includes descriptions of some of the improvements made since its initial release. This study also includes a comparison with another terminology extraction system available on-line, Termostat (Drouin, 2003). We found that the systems get comparable results when applied to small amounts of data: about 50% precision for a single foreground file (Einstein's Theory of Relativity). However, when running the system with 500 patent files as foreground, Termolator performed significantly better than Termostat. For 500 refrigeration patents, Termolator got 70% precision vs. Termostat's 52%. For 500 semiconductor patents, Termolator got 79% precision vs. Termostat's 51%.
… Linguistics and Intelligent …, 2009
The C-value/NC-value algorithm, a hybrid approach to automatic term recognition, has been originally developed to extract multiword term candidates from specialised documents written in English. Here, we present three main modifications to this algorithm that affect how the obtained output is refined. The first modification aims to maximise the number of real terms in the list of candidates with a new approach for the stop-list application process. The second modification adapts the C-value calculation formula in order to consider single word terms. The third modification changes how the term candidates are grouped, exploiting a lemmatised version of the input corpus. Additionally, size of candidate's context window is variable. We also show the necessary linguistic modifications to apply this algorithm to the recognition of term candidates in Spanish.
2000
Technical terms (henceforth called terms), are important elements for digital libraries. In this paperwe present a domain-independent method for the automatic extraction of multi-word terms, from machinereadable special language corpora. The method, (C-value/NC-value), combines linguistic and statistical information. The rst part, C-value enhances the common statistical measure of frequency of occurrence for term extraction, making it sensitive t o a particular ty p e o f m ulti-word terms, the nested terms. The second part, NC-value, gives: 1) a method for the extraction of term context words (words that tend to appear with terms), 2) the incorporation of information from term context words to the extraction of terms. 2 The C-value Approach This section presents the C-value approach to multiword ATR. C-value is a domain-independent method for multi-word ATR which aims to improve t h e extraction of nested terms. The method takes as input an SL corpus and produces a list of candidate multi-word terms. These are ordered by t h e i r termhood, w h i c h w e also call C-value. The output list is evaluated by a domain expert. Since the candidate terms are ranked according to their termhood, the domain expert can scan the lists starting from the top, and go as far down the list as time/money allow. The C-value approach combines linguistic and statistical information, emphasis being placed on the sta
… The Future of …, 2009
Monolingual and multilingual terminology and collocation bases, covering a specific domain, used independently or integrated with other resources, have become a valuable electronic resource. Building of such resources could be assisted by automatic term extraction tools, combining statistical and linguistic approaches. In this paper, the research on term extraction from monolingual corpus is presented. The corpus consists of publicly accessible English legislative documents. In the paper, results of two hybrid approaches are compared: extraction using the TermeX tool and an automatic statistical extraction procedure followed by linguistic filtering through the open source linguistic engineering tool. The results have been elaborated through statistical measures of precision, recall, and F-measure.
2016
In the paper, we address the problem of recognition of non-domain phrases in terminology lists obtained with an automatic term extraction tool. We focus on identification of multi-word phrases that are general terms and discourse function expressions. We tested several methods based on domain corpora comparison and a method based on contexts of phrases identified in a large corpus of general language. We compared the results of the methods to manual annotation. The results show that the task is quite hard as the inter-annotator agreement is low. Several tested methods achieved similar overall results, although the phrase ordering varied between methods. The most successful method with the precision about 0.75 at the half of the tested list was the context based method using a modified contextual diversity coefficient.
Proceedings of the ninth conference on …, 1999
A novel technique for automatic thesaurus construction is proposed. It is based on the complementary use of two tools: (1) a Term Extraction tool that acquires term candidates from tagged corpora through a shallow grammar of noun phrases, and (2) a Term Clustering tool that groups syntactic variants (insertions). Experiments performed on corpora in three technical domains yield clusters of term candidates with precision rates between 93% and 98%.
1997
In this paper we present an approach for the extraction of multi-word terms from special language corpora. the new element is the incorporation of context information for the evaluation of candidate terms. This information is embedded to the C-value method in the form of statistical weights. 1 Introduction Automatic term recognition (ATR) is the extraction of technical terms from special language corpora with the use of computers. Its applications include specialised dictionary construction and maintenance, human and machine translation, indexing in books and digital libraries, hypertext linking, text categorization etc. ATR also gives the potential to work with large amounts of real data, that it would not be able to handle manually. We should note that by ATR we neither mean dictionary string matching, nor term interpretation (which deals with the relations between terms and conceptsI). When ATR is concerned with single-word term extraction, domain-dependent linguistic information...
English for Specific Purposes World, 2014
Getting to know the terms in a specialised text certainly contributes to the understanding of the text itself. Their identification becomes essential precisely because of that reason and, owing to the large size of specialised corpora nowadays, the use of automatic term recognition (ATR) methods is fundamental when trying to extract the most characteristic terms in a given domain. However, these methods are not 100% effective and they must be validated before resorting to them so that the precision levels achieved are high enough for specialists to draw reliable conclusions on this type of vocabulary. This article presents the assessment of four different ATR methods on two specialised corpora of legal and telecommunication English. The methods selected, TF-IDF (term frequency-inverse document frequency), C-value (Frantzi and Ananiadou, 1999) TermoStat (Drouin, 2003) and Terminus 2.0 (Nazar and Cabré, 2012) were evaluated in terms of precision . The aim of this evaluation is to compare the results obtained in all cases and to conclude whether there exists a certain degree of domain- dependence as regards each of these methods.
Information Retrieval Journal, 2016
We evaluate five term scoring methods for automatic term extraction on four different types of text collections: personal document collections, news articles, scientific articles and medical discharge summaries. Each collection has its own use case: author profiling, boolean query term suggestion, personalized query suggestion and patient query expansion. The methods for term scoring that have been proposed in the literature were designed with a specific goal in mind. However, it is as yet unclear how these methods perform on collections with characteristics different than what they were designed for, and which method is the most suitable for a given (new) collection. In a series of experiments, we evaluate, compare and analyse the output of six term scoring methods for the collections at hand. We found that the most important factors in the success of a term scoring method are the size of the collection and the importance of multi-word terms in the domain. Larger collections lead to better terms; all methods are hindered by small collection sizes (below 1000 words). The most flexible method for the extraction of singleword and multi-word terms is pointwise Kullback-Leibler divergence for informativeness and phraseness. Overall, we have shown that extracting relevant terms using unsupervised term scoring methods is possible in diverse use cases, and that the methods are applicable in more contexts than their original design purpose.
In this paper we account for the main characteristics and performance of a number of recently developed term extraction systems. The analysed tools represent the main strategies followed by researchers in this area. All systems are analysed and compared against a set of technically relevant characteristics. * In Bourigault, D.; Jacquemin, C.; L'Homme, M-C. (2001) Recent Advances in Computational Terminology, In order to give a broader view of TE we use both extractor and detector to refer to the same notion. However, we are aware of the fact that some scholars attribute different meanings to these words.
2016
This chapter focuses on computational approaches to the automatic extraction of terms from domain specific corpora. The different subtasks of Automatic Term Extraction are presented in detail, including corpus compilation, unithood, termhood and variant detection, and system evaluation.
International Journal of English Studies, 2011
This paper argues in favor of a statistical approach to terminology extraction, general to all languages but with language specific parameters. In contrast to many application-oriented terminology studies, which are focused on a particular language and domain, this paper adopts some general principles of the statistical properties of terms and a method to obtain the corresponding language specific parameters. This method is used for the automatic identification of terminology and is quantitatively evaluated in an empirical study of English medical terms. The proposal is theoretically and computationally simple and disregards resources such as linguistic or ontological knowledge. The algorithm learns to identify terms during a training phase where it is shown examples of both terminological and non-terminological units. With these examples, the algorithm creates a model of the terminology that accounts for the frequency of lexical, morphological and syntactic elements of the terms in relation to the non-terminological vocabulary. The model is then used for the later identification of new terminology in previously unseen text. The comparative evaluation shows that performance is significantly higher than other well-known systems.
Terminology, 2007
Term extraction may be defined as a text mining activity whose main purpose is to obtain all the terms included in a text of a given domain. Since the eighties, and mainly due to the rapid scientific advances as well as the evolution of the communication systems, there has been a growing interest in obtaining the terms found in written documents. During this time, a number of techniques and strategies have been proposed for satisfying this requirement. At present it seems that term extraction has reached a maturity stage.
Natural Language Processing, 2001
Traditional methods of multi-word term extraction have used hybrid methods combining linguistic and statistical information. The linguistic part of these applications is often underexploited and consists of very shallow knowledge in the form of a simple syntactic lter. In most cases no interpretation of terms is undertaken and recognition does not involve distinguishing between di erent senses of terms, although ambiguity can be a serious problem for applications such as ontology building and machine translation. The approach described uses both statistical and linguistic information combining syntax and semantics to identify, rank and disambiguate terms. We describe a new thesaurus-based similarity measure which uses semantic information to calculate the importance of di erent parts of the context in relation to the term. Results show that making use of semantic information is bene cial for both theoretical and practical aspects of terminology.
Terminology management is a key component of many natural language processing activities such as machine translation (Langlais and Carl, 2004), text summarization and text indexation. With the rapid development of science and technology continuously increasing the number of technical terms, terminology management is certain to become of the utmost importance in more and more content-based applications.
Advanced Linguistics, 2019
Nowadays the processes of translation become more unified, and translators depend not only on their knowledge and sense of language, but also on various software, which facilitate the process of translation. The following article is devoted to one branch of such software, the systems of automatic extraction, which are an essential part in the process of lexicographic sources development of translation of text, which include a variety of terms. Consequently, the necessity to choose among the variety of different programs arose and the results of this research i.e. the comparison of functions of different programs, are described in our article. Several criteria, by which the quality of terms extraction can be measured, have been compared, e.g., the speed of extraction, the "purity" of the output list of terms, whether the extracted lexical material corresponded to the requirements to terms, the quality of irrelevant choices, extracted by automatic extraction systems, and the factors, influencing this quality, etc. The advantages and disadvantages of cloud and desktop services have been investigated and compared. It was noted that the main difficulty is that programs still are not able to distinguish between word forms, thus the texts that undergo the extraction process, require auxiliary procedures such as POS-marking, lemmatization and tokenization. The other obstacle was the inability of certain programs to distinguish between compound terms and simple word combinations. The key points of the research may be used in the course of translation studies, in researches devoted to "smart" or electronic lexicography and by translators in general as they may use these systems of terms extraction during the process of translation for the purpose of forming or unifying the required glossary.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.