Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2007, Expert Systems With Applications
With the development of the web, large numbers of documents are available on the Internet. Digital libraries, news sources and inner data of companies surge more and more. Automatic text categorization becomes more and more important for dealing with massive data. However the major problem of text categorization is the high dimensionality of the feature space. At present there are many methods to deal with text feature selection. To improve the performance of text categorization, we present another method of dealing with text feature selection. Our study is based on Gini index theory and we design a novel Gini index algorithm to reduce the high dimensionality of the feature space. A new measure function of Gini index is constructed and made to fit text categorization. The results of experiments show that our improvements of Gini index behave better than other methods of feature selection.
2008 International Conference on Natural Language Processing and Knowledge Engineering, 2008
With the development of the web, large numbers of documents are put onto the Internet. More and more digital libraries, news sources and inner data of companies are available. Automatic text categorization becomes more and more important for dealing with massive data. However, text preprocessing is still the bottleneck of text categorization based on Vector Space Model (VSM). The result of text preprocessing directly affects the performance and precision of categorization. Moreover, feature selection and feature weight become the major obstacles of text preprocessing. In this paper, we mainly focus on feature weight. We present a novel feature weight algorithm----TF-Gini that can improve the categorization performance significantly. The experiment results verify the effectiveness of this algorithm.
International Journal of Mathematical Archive, 2013
W ith the rapid spread of the Internet and the increase in on-line information, the technology for automatically classifying huge amounts of diverse text information has come to play a very important role in today’s world. In the 1990s, the performance of computers improved sharply and it became possible to handle large quantities of text data. This led to the use of the machine learning approach, which is a method of creating classifiers automatically from the text data given in a category label. This approach provides excellent accuracy, reduces labor, and ensures conservative use of resources. In this communication, we discussed that Feature selection plays an important role in Text Categorization (Yiming Yang, Jan O. Pedersen, 1997).We have also deliberated on Automatic feature selection methods such as document frequency thresholding (DF), Information Gain (IG), Mutual Information (MI) and Pointwise Mutual Information (PMI) which are commonly applied in text categorization.
International Journal of Enterprise Network Management, 2017
The curse of dimensionality has made dimension reduction an essential step in text categorisation. Feature selection is an approach for dimension reduction. In this paper an analysis on feature selection measures for text categorisation is performed. Under the unsupervised approach document frequency and under the supervised approach chi-square, odds ratio, mutual information, and information gain are considered for analysis. They are considered here because they are the widely used and effective measures. Analysis of these measures is performed using the 20 newsgroups dataset. Twenty newsgroups dataset consists of closely related categories as well as highly unrelated categories. Certain categories of 20 newsgroups dataset are selected and organised into three groups of overlapping (highly related) classes, non-overlapping (highly unrelated) classes and combination of overlapping and non-overlapping classes. Feature selection and subsequent classification is applied to the three groups separately and the classification performance is studied based on the feature selection measures. The noticeable behaviour was with odds ratio measure in that it performed well for non-overlapping group and overlapping groups considered separately and was poorer in performance for the group containing both overlapping and non-overlapping categories. Remaining measures showed consistent behaviour for all the three groups. Classification was achieved using support vector machine classifier. The performance comparisons of different measures on different groups are presented in terms of micro-F 1 and macro-F 1 .
2010
Text document categorization involves large amount of data or features. The high dimensionality of features is a troublesome and can affect the performance of the classification. Therefore, feature selection is strongly considered as one of the crucial part in text document categorization. Selecting the best features to represent documents can reduce the dimensionality of feature space hence increase the performance. There were many approaches has been implemented by various researchers to overcome this problem. This paper proposed a novel hybrid approach for feature selection in text document categorization based on Ant Colony Optimization (ACO) and Information Gain (IG). We also presented state-of-the-art algorithms by several other researchers.
Text categorization is the task of automatically sorting a set of documents into categories from a pre-defined set. This means it assigns predefines categories to free-text documents. In this paper we are proposing a unique two stage feature selection method for text categorization by using information gain, principle component analysis and genetic algorithm. In the methodology, every term inside the document is ranked depending on their importance for classification using the information gain (IG) methodology. In the second stage, genetic algorithm (GA) and principal component analysis (PCA) feature selection and feature extraction methods are applied individually to the terms which are ranked in decreasing order of importance, and a dimension reduction is carried out. Therefore, throughout the text categorization terms of less importance are ignored, and feature selection and extraction methods are applied to the terms of highest importance; so the computational time and complexity of categorization is reduced. To analyze the dimension reduction in our proposed model, experiments area unit conducted using the k-nearest neighbor(KNN) and C4.5 decision tree algorithmic rule on selected data set.
The 4th Workshop on …, 2010
A number of feature selection mechanisms have been explored in text categorization, among which mutual information, information gain and chi-square are considered most effective. In this paper, we study another method known as within class popularity to deal with feature selection based on the concept Gini coefficient of inequality (a commonly used measure of inequality of income). The proposed measure explores the relative distribution of a feature among different classes. From extensive experiments with four text classifiers over three datasets of different levels of heterogeneity, we observe that the proposed measure outperforms the mutual information, information gain and chi-square static with an average improvement of approximately 28.5%, 19% and 9.2% respectively.
2011
In this paper, we compare several aspects related to automatic text categorization which include document representation, feature selection, three classifiers, and their application to two language text collections. Regarding the computational representation of documents, we compare the traditional bag of words representation with 4 other alternative representations: bag of multiwords and bag of word prefixes with N characters (for N = 4, 5 and 6). Concerning the feature selection we compare the well known feature selection metrics Information Gain and Chi-Square with a new one based on the third moment statistics which enhances rare terms. As to the classifiers, we compare the well known Support Vector Machine and K-Nearest Neighbor classifiers with a classifier based on Mahalanobis distance. Finally, the study performed is language independent and was applied over two document collections, one written in English (Reuters-21578) and the other in Portuguese (Folha de São Paulo).
International Journal of Science and Research (IJSR)
The rapid growth of World Wide Web has led to explosive growth of information. As most of information is stored in the form of texts, text mining has gained paramount importance. With the high availability of information from diverse sources, the task of automatic categorization of documents has become a vital method for managing, organizing vast amount of information and knowledge discovery. Text classification is the task of assigning predefined categories to documents. The major challenge of text classification is accuracy of classifier and high dimensionality of feature space. These problems can be overcome using Feature Selection. Feature selection is a process of identifying a subset of the most useful features from the original entire set of features. Feature selection (FS) is a strategy that aims at making text document classifiers more efficient and accurate. Feature selection methods provide us a way of reducing computation time, improving prediction performance, and a better understanding of the data. This paper surveys of text classification, several approaches of text classification, feature selection methods and applications of text classifications.
Due to the huge volume of text documents available on the Internet, it is increasingly necessary to effectively manage them and then help users to retrieve what they want. Document categorization can organize documents into domain specific classes and so facilitate information retrieval. In general, most of document categorization systems are composed of three kinds of models: one for weighting terms, the second for selecting feature terms and the third for categorizing documents accordingly. In practice, a document categorization system is essentially one of the different combinations of these models. Based on the observation of model coherence used in document categorization system, this paper proposes two approaches CBA and IBA to feature selection. The empirical results done with k-Nearest Neighbors and naïve Bayes classifiers against Reuters-21578 corpus show that CBA and IBA are comparable to c 2 feature selection model.
2020
Feature selection methods select a small subset of the relevant features from the original feature space by eliminating redundant or irrelevant features. In the process it also reduces the dimensionality of the feature space and improves the efficiency of the data mining algorithms. In this paper, sixteen state of art feature selection methods are studied that use different benchmark datasets with respect to text categorization and their performance is summarized. The past research reveals that performance of feature selection methods are dataset specific. In the present work, further experiments are carried out with the state of art feature selection methods for text categorization over a unifying framework of benchmark datasets to evaluate and compare their performance on same standards. The efficiency of the methods is evaluated on the basis of their performance with k-means clustering and KNN classification. The experiments reveal that unsupervised feature selection method of Mu...
International Journal of Machine Learning and Cybernetics, 2015
Term selection methods in text categorization effectively reduce the size of the vocabulary to improve the quality of classifier. Each corpus generally contains many irrelevant and noisy terms, which eventually reduces the effectiveness of text categorization. Term selection, thus, focuses on identifying the relevant terms for each category without affecting the quality of text categorization. A new supervised term selection technique have been proposed for dimensionality reduction. The method assigns a score to each term of a corpus based on its similarity with all the categories, and then all the terms of the corpus are ranked accordingly. Subsequently the significant terms of each category are selected to create the final subset of terms irrespective of the size of the category. The performance of the proposed term selection technique is compared with the performance of nine other term selection methods for categorization of several well known text corpora using kNN and SVM classifiers. The empirical results show that the proposed method performs significantly better than the other methods in most of the cases of all the corpora.
International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 2007
ABSTRACT Feature Selection is an important task within Text Categorization, where irrelevant or noisy features are usually present, causing a lost in the performance of the classifiers. Feature Selection in Text Categorization has usually been performed using a filtering approach based on selecting the features with highest score according to certain measures. Measures of this kind come from the Information Retrieval, Information Theory and Machine Learning fields. However, wrapper approaches are known to perform better in Feature Selection than filtering approaches, although they are time-consuming and sometimes infeasible, especially in text domains. However a wrapper that explores a reduced number of feature subsets and that uses a fast method as evaluation function could overcome these difficulties. The wrapper presented in this paper satisfies these properties. Since exploring a reduced number of subsets could result in less promising subsets, a hybrid approach, that combines the wrapper method and some scoring measures, allows to explore more promising feature subsets. A comparison among some scoring measures, the wrapper method and the hybrid approach is performed. The results reveal that the hybrid approach outperforms both the wrapper approach and the scoring measures, particularly for corpora whose features are less scattered over the categories.
Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2 - ACL-IJCNLP '09, 2009
In text categorization, feature selection (FS) is a strategy that aims at making text classifiers more efficient and accurate. However, when dealing with a new task, it is still difficult to quickly select a suitable one from various FS methods provided by many previous studies. In this paper, we propose a theoretic framework of FS methods based on two basic measurements: frequency measurement and ratio measurement. Then six popular FS methods are in detail discussed under this framework. Moreover, with the guidance of our theoretical analysis, we propose a novel method called weighed frequency and odds (WFO) that combines the two measurements with trained weights. The experimental results on data sets from both topic-based and sentiment classification tasks show that this new method is robust across different tasks and numbers of selected features.
IEEE Transactions on Knowledge and Data Engineering, 2005
Text Categorization, which consists of automatically assigning documents to a set of categories, usually involves the management of a huge number of features. Most of them are irrelevant and others introduce noise which could mislead the classifiers. Thus, feature reduction is often performed in order to increase the efficiency and effectiveness of the classification. In this paper, we propose to select relevant features by means of a family of linear filtering measures which are simpler than the usual measures applied for this purpose. We carry out experiments over two different corpora and find that the proposed measures perform better than the existing ones.
In different application domains as well as areas of research text classification is one of the well studied problems. So there is need to enhance the effective and efficient algorithm for text classification .There are many algorithm presented by different authors over the successfully and accurate text classification by different researchers. Each algorithm presented are specific to applications or some other domains of research. Some techniques presented are based on data mining and machine learning domains. The main aim of this paper is to summarize the different types of algorithm presented for text classification. In this paper we have presented the key components for text classification which will be helpful for researcher to understand the existing techniques of text classification. First we will give the overview of why there is need for feature reduction and different technique for feature selection, then the key components of text classification system. Later we will discuss the different algorithm of text classification.
One of the problems faced by document categorization is that terms present in the collection of example documents are numerous. From the point of view of coherence between the models used in document categorization, we analyses the frameworks of both k-NN and NB categorization models and feature selection problem. Two algorithms CBA and IBA to feature selection are proposed. The empirical results done with k-NN and NB classifiers show that the coherence between models in the categorization system can bring benefits for performance.
Lecture Notes in Computer Science, 2004
In Text Categorization problems usually there is a lot of noisy and irrelevant information present. In this paper we propose to apply some measures taken from the Machine Learning environment for Feature Selection. The classifier used is Support Vector Machines. The experiments over two different corpora show that some of the new measures perform better than the traditional Information Theory measures.
Text classification and feature selection plays an important role for correctly identifying the documents into particular category, due to the explosive growth of the textual information from the electronic digital documents as well as world wide web. In the text mining present challenge is to select important or relevant feature from large and vast amount of features in the data set. The aim of this paper is to improve the feature selection method for text document classification in machine learning. In machine learning the training set is generated for testing the documents. This can be achieved by selecting important new term i.e. weights of term in text document to improve both classification with relevance to accuracy and performance.
2004
We introduce a new method of feature selection for text categorization. Our MMR-based feature selection method strives to reduce redundancy between features while maintaining information gain in selecting appropriate features for text categorization. Empirical results show that MMR-based feature selection is more effective than Koller & Sahami's method, which is one of greedy feature selection methods, and conventional information gain which is commonly used in feature selection for text categorization. Moreover, MMRbased feature selection sometimes produces some improvements of conventional machine learning algorithms over SVM which is known to give the best classification accuracy. [12] William S. Cooper. 1991. Some Inconsistencies and Misnomers in Probabilistic Information Retrieval. In Proceedings of the 14th ACM SIGIR International Conference on Research and Development in Information Retrieval. [13] Mehran Sahami. 1998. Using Machine Learning to Improve Information Access. PhD thesis, Stanford University.
Fourth International Conference on Fuzzy Systems and Knowledge Discovery (FSKD 2007), 2007
The improvement of text categorization by statistical methods can be performed from two main directions, namely the feature selection and the evaluation of characteristic weights. In this paper, we propose an enhanced text categorization method based on a modified mutual information algorithm and evaluation algorithm of characteristic weights which improves both aspects. The proposed method is applied to the benchmark test set Reuters-21578 Top10 to examine its effectiveness. Numerical results show that the precision, the recall and the value of F1 of the proposed method are all superior tothose of existing conventional methods.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.