Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2002, Object recognition supported by user interaction for service robots
The bag-of-words approach to text document representation typically results in vectors of the order of 5000 to 20000 components as the representation of documents. In order to make effective use of various statistical classifiers, it may be necessary to reduce the dimensionality of this representation. We point out deficiencies in class discrimination of two popular such methods, Latent Semantic Indexing (LSI), and sequential feature selection according to some relevant criterion.
2001
Document representation using the bag-of-words approach may require bringing the dimensionality of the representation down in order to be able to make effective use of various statistical classification methods. Latent Semantic Indexing (LSI) is one such method that is based on eigendecomposition of the covariance of the document-term matrix. Another often used approach is to select a small number of most important features out of the whole set according to some relevant criterion. This paper points out that LSI ignores discrimination while concentrating on representation. Furthermore, selection methods fail to produce a feature set that jointly optimizes class discrimination. As a remedy, we suggest supervised linear discriminative transforms, and report good classification results applying these to the Reuters-21578 database.
Formal Pattern Analysis & Applications, 2004
The bag-of-words approach to text document representation typically results in vectors of the order of 5000 to 20000 components as the representation of documents. In order to make effective use of various statistical classifiers, it may be necessary to reduce the dimensionality of this representation. We point out deficiencies in class discrimination of two popular such methods, Latent Semantic Indexing (LSI), and sequential feature selection according to some relevant criterion.
Hybrid Information Systems, A. Abraham and …, 2002
Linear Discriminant (LD) techniques are typically used in pattern recognition tasks when there are many (n >> 10 4 ) datapoints in low-dimensional (d < 10 2 ) space. In this paper we argue on theoretical grounds that LD is in fact more appropriate when training data is sparse, and the dimension of the space is extremely high. To support this conclusion we present experimental results on a medical text classification problem of great practical importance, autocoding of adverse event reports. We trained and tested LD-based systems for a variety of classification schemes widely used in the clinical drug trial process (COSTART, WHOART, HARTS, and MedDRA) and obtained significant reduction in the rate of misclassification compared both to generic Bayesian machine-learning techniques and to the current generation of domain-specific autocoders based on string matching.
Automatic text classification (ATC) is the task of automatically assigning a set of documents into appropriate categories (or classes, or topics). One of the feature generation techniques is extracting absolute word frequency from textual documents to be used as feature vectors in machine learning techniques. One of the limitations of this technique is the dependency on text length leading into lower classification rates. Another problem in ATC is the high dimensional space. We present a performance evaluation of feature transformation techniques and regularized linear discriminant function (RLD) in automatic text classification. Moreover we provide experimental evaluation of Principal Component Analysis (PCA) in reducing the high dimensionality. Feature transformation techniques used considerably improved the classification accuracy, and RLD outperformed all classifiers used. Experimental results showed effective dimension reduction.
2008
Text categorization is an important research area and has been receiving much attention due to the growth of the on-line information and of Internet. Automated text categorization is generally cast as a multi-class classification problem. Much of previous work focused on binary document classification problems. Support vector machines (SVMs) excel in binary classification, but the elegant theory behind large-margin hyperplane cannot be easily extended to multi-class text classification.
2017 International Conference on Engineering & MIS (ICEMIS), 2017
The linear discriminant analysis (LDA) is a dimensionality reduction technique that is widely used in pattern recognition applications. The LDA aims at generating effective feature vectors by reducing the dimensions of the original data (e.g. bag-of-words textual representation) into a lower dimensional space. Hence, the LDA is a convenient method for text classification that is known by huge dimensional feature vectors. In this paper, we empirically investigated two LDA based methods for Arabic text classification. The first method is based on computing the generalized eigenvectors of the ratio (between-class to within-class) scatters, the second method includes linear classification functions that assume equal population covariance matrices (i.e. pooled sample covariance matrix). We used a textual data collection that contains 1,750 documents belong to five categories. The testing set contains 250 documents belong to five categories (50 documents for each category). The experiment...
2003
Abstract Text categorization is an important research area and has been receiving much attention due to the growth of the on-line information and of Internet. Automated text categorization is generally cast as a multi-class classification problem. Much of previous work focused on binary document classification problems. Support vector machines (SVMs) excel in binary classification, but the elegant theory behind large-margin hyperplane cannot be easily extended to multi-class text classification.
2011
In this paper, we compare several aspects related to automatic text categorization which include document representation, feature selection, three classifiers, and their application to two language text collections. Regarding the computational representation of documents, we compare the traditional bag of words representation with 4 other alternative representations: bag of multiwords and bag of word prefixes with N characters (for N = 4, 5 and 6). Concerning the feature selection we compare the well known feature selection metrics Information Gain and Chi-Square with a new one based on the third moment statistics which enhances rare terms. As to the classifiers, we compare the well known Support Vector Machine and K-Nearest Neighbor classifiers with a classifier based on Mahalanobis distance. Finally, the study performed is language independent and was applied over two document collections, one written in English (Reuters-21578) and the other in Portuguese (Folha de São Paulo).
2006
Abstract We address the problem of building fast and effective text classification tools. We describe a" representatives methodology" related to feature extraction and illustrate its performance using as vehicles a centroid based method and a method based on clustered LSI that were recently proposed as useful tools for low rank matrix approximation and cost effective alternatives to LSI. The methodology is very flexible, providing the means for accelerating existing algorithms.
2011
The curse of dimensionality is a well-recognized problem in the field of document filtering. In particular, this concerns methods where vector space models are utilized to describe the document-concept space. When performing content classification across a variety of topics, the number of different concepts (dimensions) rapidly explodes and as a result many techniques are rendered inapplicable. Furthermore the extent of information represented by each of the concepts may vary significantly. In this paper, we present a dimensionality reduction approach which approximates the user's preferences in the form of value function and leads to a quick and efficient filtering procedure. The proposed system requires the user to provide preference information in the form of a training set in order to generate a search rule. Each document in the training set is profiled into a vector of concepts. The document profiling is accomplished by utilizing Wikipedia-articles to define the semantic information contained in words which allows them to be perceived as concepts. Once the set of concepts contained in the training set is known, a modified Wilks' lambda approach is used for dimensionality reduction by ensuring minimal loss of semantic information.
IEICE Transactions on Information and Systems, 2008
Feature transformation in automatic text classification (ATC) can lead to better classification performance. Furthermore dimensionality reduction is important in ATC. Hence, feature transformation and dimensionality reduction are performed to obtain lower computational costs with improved classification performance. However, feature transformation and dimension reduction techniques have been conventionally considered in isolation. In such cases classification performance can be lower than when integrated. Therefore, we propose an integrated feature analysis approach which improves the classification performance at lower dimensionality. Moreover, we propose a multiple feature integration technique which also improves classification effectiveness. key words: text classification/categorization, feature transformation, dimension reduction, principal component analysis, canonical discriminant analysis, integrated feature analysis, multiple feature integration 2. Integrated Feature Analysis (IFA) This section describes techniques on which the proposed approach is based. We name this framework the integrated feature analysis (IFA). This includes the descriptions of the multiple feature integration (MFI).
Journal of Machine Learning Research, 2005
Support vector machines (SVMs) have been recognized as one of the most successful classification methods for many applications including text classification. Even though the learning ability and computational complexity of training in support vector machines may be independent of the dimension of the feature space, reducing computational complexity is an essential issue to efficiently handle a large number of terms in practical applications of text classification. In this paper, we adopt novel dimension reduction methods to reduce the dimension of the document vectors dramatically. We also introduce decision functions for the centroid-based classification algorithm and support vector classifiers to handle the classification problem where a document may belong to multiple classes. Our substantial experimental results show that with several dimension reduction methods that are designed particularly for clustered data, higher efficiency for both training and testing can be achieved without sacrificing prediction accuracy of text classification even when the dimension of the input space is significantly reduced.
International Journal of Computer Applications
As the world is moving towards globalization, digitization of text has been escalating a lot and the need to organize, categorize and classify text has become obligatory. Disorganization or little categorization and sorting of text may result in dawdling response time of information retrieval. There has been the 'curse of dimensionality' (as termed by Bellman)[1] problem, namely the inherent sparsity of high dimensional spaces. Thus, the search for a possible presence of some unspecified structure in such a high dimensional space can be difficult. This is the task of feature reduction methods. They obtain the most relevant information from the original data and represent the information in a lower dimensionality space. In this paper, all the applied methods on feature extraction on text categorization from the traditional bag-of-words model approach to the unconventional neural networks are discussed.
Communications in Computer and Information Science, 2016
To meet the fast and effective requirements of document classification in Web 2.0, the most direct strategy is to reduce the dimension of document representation without much information loss. Topic model and neural network language model are two main strategies to represent document in a low-dimensional space. To compare the effectiveness of bag-of-words, topic model and neural network language model for document classification, TF*IDF, latent Dirichlet allocation (LDA) and Paragraph Vector model are selected. Based on the generated vectors of these three methods, support vector machine classifiers are developed respectively. The performances of these three methods on English and Chinese document collections are evaluated. The experimental results show that TF*IDF outperforms LDA and Paragraph Vector, but the high-dimensional vectors take up much time and memory. Furthermore, through cross validation, the results reveal that stop words elimination and the size of training samples significantly affect the performances of LDA and Paragraph Vector, and Paragraph Vector displays its potential to overwhelm two other methods. Finally, the suggestions related with stop words elimination and data size for LDA and Paragraph Vector training are provided.
Procedia Technology, 2012
The pragmatic realism of the high dimensionality incurs limitations in many pattern recognition arena such as text classification ,data mining, information retrieval and face recognition. The unsupervised PCA ante up no attention to the class labels of the existing training data. LDA is not stable due to the small sample size problem and but corroborate best directions when each class has a Gaussian density with a common covariance matrix. But it can flop if the class densities are more general and interpreted class separability in between-class-matrices are inadequate. Maximum Margin Criterion (MMC) having lower computational cost, is more efficient than LDA for calculating the discriminant vectors barring the computation for inverse of within-class-scatter matrix. But traditional MMC disregards the discriminative information within the local structure of samples and performance is depended on choosing of a coefficient. In this paper we delineate the locality of data points by counting a distances among data points considering the supervised knowledge. We have computed the entire scatter matrix in Laplacian graph embedded space and finally the produced stastically uncorrelated discriminant vectors reduces redundancy among the extracted features and there is no constant to be chosen. Our experiment with Reauter dataset recommends this algorithm is more efficient than LDA, MMC and it manifests similar or sometimes better result than other locality based algorithm like LPP and LSDA.
Proceedings of The Third …, 2003
The attribute-value representation of documents used in Text Mining provides a natural framework for classifying or clustering documents based on their content. Supervised learning algorithms can be applied whenever the docu-ments have labels preassigned ...
Procedia Engineering, 2011
When dealing with the high dimensions and large-scale multi-class textual data, it is commonly to ignore the semantic relation between words with the traditional feature selection method. In order to solve the problem, we introduce the categories information into the existing LDA model feature selection algorithm, and construct SVM multi-class classifier on the implicit topic-text matrix. Experimental results show that this method can improve classification accuracy and the dimensionality is reduced availably, the value of F1, Macro-F1, and Micro-F1 are obtained improvement.
KSII Transactions on Internet and Information Systems, 2019
The performance of text classification is highly related to the feature selection methods. Usually, two tasks are performed when a feature selection method is applied to construct a feature set; 1) assign score to each feature and 2) select the top-N features. The selection of top-N features in the existing filter-based feature selection methods is biased by their discriminative power and the empirical process which is followed to determine the value of N. In order to improve the text classification performance by presenting a more illustrative feature set, we present an approach via a potent representation learning technique, namely DBN (Deep Belief Network). This algorithm learns via the semantic illustration of documents and uses feature vectors for their formulation. The nodes, iteration, and a number of hidden layers are the main parameters of DBN, which can tune to improve the classifier's performance. The results of experiments indicate the effectiveness of the proposed method to increase the classification performance and aid developers to make effective decisions in certain domains.
J. Univers. Comput. Sci., 2008
Feature selection methods are often applied in the context of document classification. They are particularly important for processing large data sets that may contain millions of documents and are typically represented by a large number, possibly tens of thousands of features. Processing large data sets thus raises the issue of computational resources and we often have to find the right trade-off between the size of the feature set and the number of training data that we can taken into account. Furthermore, depending on the selected classification technique, different feature selection methods require different optimization approaches, raising the issue of compatibility between the two. We demonstrate an effective classifier training and feature selection method that is suitable for large data collections. We explore feature selection based on the weights obtained from linear classifiers themselves, trained on a subset of training documents. While most feature weighting schemes scor...
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.