Term Recognition by Using Different Field Corpora

Hitoshi Isahara

Term Recognition by Using Different Field Corpora

Hitoshi Isahara

1999, NTCIR

Sign up for access to the world's latest research

checkGet notified about relevant papers

checkSave papers to use in your research

checkJoin the discussion with peers

checkTrack your impact

Abstract

We participated in the term recognition task, one of the subtasks covered by the NTCIR tmrec group. In this paper, we present a system used in this task and evaluate the term recognition results of this system. We believe that terms could be words that characterize the eld's data and have the following three features: (1) They frequently appear in the target eld's corpus. (2) They are not common terms in the target eld. (3) They less frequently appear in the other elds' corpora. Our system uses dierent eld corpora and recognizes these features as terms. We extracted a term list by using two kinds of eld corpora, the NACSIS Academic Conference Database and the MAINICHI newspaper database. We then analyzed the dierence between our term list and Manual-Candidates made by the NTCIR tmrec group. In this paper, we clarify what should be considered when recognizing terms. Furthermore, through comparative experiments based on Manual-Candidates, we verify the importance of indices which are used to extract a term list.

Figures (5)

Table 1: Example of Term Recognition appear only once or twice in a field, it is difficult ning)” and “@iAW ic SED < 38 HE (explanation-

Figure 1: Recall and Precision based on Manual-Candidates (Tagged)

Figure 2: Recall and Precision based on Manual-Candidates (Untagged)

Figure 3: Relationship between a and F-measure

Christopher Brewster

2008

Automatic Term recognition (ATR) is a fundamental processing step preceding more complex tasks such as semantic search and ontology learning. From a large number of methodologies available in the literature only a few are able to handle both single and multi-word terms. In this paper we present a comparison of five such algorithms and propose a combined approach using a voting mechanism. We evaluated the six approaches using two different corpora and show how the voting algorithm performs best on one corpus (a collection of texts from Wikipedia) and less well using the Genia corpus (a standard life science corpus). This indicates that choice and design of corpus has a major impact on the evaluation of term recognition algorithms. Our experiments also showed that single-word terms can be equally important and occupy a fairly large proportion in certain domains. As a result, algorithms that ignore single-word terms may cause problems to tasks built on top of ATR. Effective ATR systems also need to take into account both the unstructured text and the structured aspects and this means information extraction techniques need to be integrated into the term recognition process.

Log In

Term Recognition by Using Different Field Corpora

Sign up for access to the world's latest research

Abstract

Related papers

Related topics

Related papers