MMR-based feature selection for text categorization

Changki Lee; Gary Geunbae Lee

MMR-based feature selection for text categorization

Changki Lee

2004

visibility

…

description

4 pages

link

1 file

Sign up for access to the world's latest research

checkGet notified about relevant papers

checkSave papers to use in your research

checkJoin the discussion with peers

checkTrack your impact

Abstract

We introduce a new method of feature selection for text categorization. Our MMR-based feature selection method strives to reduce redundancy between features while maintaining information gain in selecting appropriate features for text categorization. Empirical results show that MMR-based feature selection is more effective than Koller & Sahami's method, which is one of greedy feature selection methods, and conventional information gain which is commonly used in feature selection for text categorization. Moreover, MMRbased feature selection sometimes produces some improvements of conventional machine learning algorithms over SVM which is known to give the best classification accuracy. [12] William S. Cooper. 1991. Some Inconsistencies and Misnomers in Probabilistic Information Retrieval. In Proceedings of the 14th ACM SIGIR International Conference on Research and Development in Information Retrieval. [13] Mehran Sahami. 1998. Using Machine Learning to Improve Information Access. PhD thesis, Stanford University.

Jose Ranilla, Elena Montañés

International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 2007

ABSTRACT Feature Selection is an important task within Text Categorization, where irrelevant or noisy features are usually present, causing a lost in the performance of the classifiers. Feature Selection in Text Categorization has usually been performed using a filtering approach based on selecting the features with highest score according to certain measures. Measures of this kind come from the Information Retrieval, Information Theory and Machine Learning fields. However, wrapper approaches are known to perform better in Feature Selection than filtering approaches, although they are time-consuming and sometimes infeasible, especially in text domains. However a wrapper that explores a reduced number of feature subsets and that uses a fast method as evaluation function could overcome these difficulties. The wrapper presented in this paper satisfies these properties. Since exploring a reduced number of subsets could result in less promising subsets, a hybrid approach, that combines the wrapper method and some scoring measures, allows to explore more promising feature subsets. A comparison among some scoring measures, the wrapper method and the hybrid approach is performed. The results reveal that the hybrid approach outperforms both the wrapper approach and the scoring measures, particularly for corpora whose features are less scattered over the categories.

Log In

MMR-based feature selection for text categorization

Sign up for access to the world's latest research

Abstract

Related papers

Related papers

Related topics