A novel feature selection algorithm for text categorization

Wenqian Shang; Houkuan Huang; Haibin Zhu; Yongmin Lin; Youli Qu; Zhihai Wang

A novel feature selection algorithm for text categorization

youli qu

2007, Expert Systems With Applications

Sign up for access to the world's latest research

checkGet notified about relevant papers

checkSave papers to use in your research

checkJoin the discussion with peers

checkTrack your impact

Abstract

With the development of the web, large numbers of documents are available on the Internet. Digital libraries, news sources and inner data of companies surge more and more. Automatic text categorization becomes more and more important for dealing with massive data. However the major problem of text categorization is the high dimensionality of the feature space. At present there are many methods to deal with text feature selection. To improve the performance of text categorization, we present another method of dealing with text feature selection. Our study is based on Gini index theory and we design a novel Gini index algorithm to reduce the high dimensionality of the feature space. A new measure function of Gini index is constructed and made to fit text categorization. The results of experiments show that our improvements of Gini index behave better than other methods of feature selection.

Haibin Zhu

2008 International Conference on Natural Language Processing and Knowledge Engineering, 2008

With the development of the web, large numbers of documents are put onto the Internet. More and more digital libraries, news sources and inner data of companies are available. Automatic text categorization becomes more and more important for dealing with massive data. However, text preprocessing is still the bottleneck of text categorization based on Vector Space Model (VSM). The result of text preprocessing directly affects the performance and precision of categorization. Moreover, feature selection and feature weight become the major obstacles of text preprocessing. In this paper, we mainly focus on feature weight. We present a novel feature weight algorithm----TF-Gini that can improve the categorization performance significantly. The experiment results verify the effectiveness of this algorithm.

Log In

A novel feature selection algorithm for text categorization

Sign up for access to the world's latest research

Abstract

Related papers

Related topics