Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2003, The Journal of Machine Learning Research
…
24 pages
1 file
We address the problem of categorising documents using kernel-based methods such as Support Vector Machines. Since the work of Joachims (1998), there is ample experimental evidence that SVM using the standard word frequencies as features yield state-of-the-art performance on a number of benchmark problems. Recently, Lodhi et al. (2002) proposed the use of string kernels, a novel way of computing document similarity based of matching non-consecutive subsequences of characters. In this article, we propose the use of this technique with sequences of words rather than characters. This approach has several advantages, in particular it is more efficient computationally and it ties in closely with standard linguistic pre-processing techniques. We present some extensions to sequence kernels dealing with symbol-dependent and match-dependent decay factors, and present empirical evaluations of these extensions on the Reuters-21578 datasets.
The Journal of …, 2002
We propose a novel approach for categorizing text documents based on the use of a special kernel. The kernel is an inner product in the feature space generated by all subsequences of length k. A subsequence is any ordered sequence of k characters occurring in the text though not necessarily contiguously. The subsequences are weighted by an exponentially decaying factor of their full length in the text, hence emphasising those occurrences that are close to contiguous. A direct computation of this feature vector would involve a prohibitive amount of computation even for modest values of k, since the dimension of the feature space grows exponentially with k. The paper describes how despite this fact the inner product can be efficiently evaluated by a dynamic programming technique.
Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks. IJCNN 2000. Neural Computing: New Challenges and Perspectives for the New Millennium, 2000
We propose to solve a text categorization task using a new metric between documents, based on a priori semantic knowledge about words. This metric can be incorporated into the definition of radial basis kernels of Support Vector Machines or directly used in a K-nearest neighbors algorithm. Both SVM and KNN are tested and compared on the 20newsgroups database. Support Vector Machines provide the best accuracy on test data.
This work presents kernel functions that can be used in conjunction with the Support Vector Machine-SVM-learning algorithm to solve the automatic text classification task. Initially the Vector Space Model for text processing is presented. According to this model text is seen as a set of vectors in a high dimensional space; then extensions and alternative models are derived, and some preprocessing procedures are discussed. The SVM learning algorithm, largely employed for text classification, is outlined: its decision procedure is obtained as a solution of an optimization problem. The "kernel trick", that allows the algorithm to be applied in non-linearly separable cases, is presented, as well as some kernel functions that are currently used in text applications. Finally some text classification experiments employing the SVM classifier are conducted, in order to illustrate some text preprocessing techniques and the presented kernel functions.
Engineering Applications of Artificial Intelligence, 2015
Text categorization plays a crucial role in both academic and commercial platforms due to the growing demand for automatic organization of documents. Kernel-based classification algorithms such as Support Vector Machines (SVM) have become highly popular in the task of text mining. This is mainly due to their relatively high classification accuracy on several application domains as well as their ability to handle high dimensional and sparse data which is the prohibitive characteristics of textual data representation. Recently, there is an increased interest in the exploitation of background knowledge such as ontologies and corpus-based statistical knowledge in text categorization. It has been shown that, by replacing the standard kernel functions such as linear kernel with customized kernel functions which take advantage of this background knowledge, it is possible to increase the performance of SVM in the text classification domain. Based on this, we propose a novel semantic smoothing kernel for SVM. The suggested approach is based on a meaning measure, which calculates the meaningfulness of the terms in the context of classes. The documents vectors are smoothed based on these meaning values of the terms in the context of classes. Since we efficiently make use of the class information in the smoothing process, it can be considered a supervised smoothing kernel. The meaning measure is based on the Helmholtz principle from Gestalt theory and has previously been applied to several text mining applications such as document summarization and feature extraction. However, to the best of our knowledge, ours is the first study to use meaning measure in a supervised setting to build a semantic kernel for SVM. We evaluated the proposed approach by conducting a large number of experiments on well-known textual datasets and present results with respect to different experimental conditions. We compare our results with traditional kernels used in SVM such as linear kernel as well as with several corpus-based semantic kernels. Our results show that classification performance of the proposed approach outperforms other kernels.
2006
Traditional bag-of-words model and recent wordsequence kernel are two well-known techniques in the field of text categorization. Bag-of-words representation neglects the word order, which could result in less computation accuracy for some types of documents. Word-sequence kernel takes into account word order, but does not include all information of the word frequency. A weighted kernel model that combines these two models was proposed by the authors [1]. This paper is focused on the optimization of the weighting parameters, which are functions of word frequency. Experiments have been conducted with Reuter's database and show that the new weighted kernel achieves better classification accuracy.
Knowledge-Based Systems, 2008
One of the main themes supporting text mining is text representation, i.e., looking for the appropriate terms to transfer the documents into numerical vectors. Recently, many efforts have been invested on this topic to enrich text representation using vector space model (VSM) to improve the performances of text mining techniques such as classification, clustering, etc. The main concern of this paper is to investigate the effectiveness of using multi-words for text representation on the performances of classification. Firstly, a practical method is proposed to implement the multi-word extraction from documents based on the syntactical structure. Secondly, two strategies as general concept representation and subtopic representation are presented to represent the documents using the extracted multi-words. Especially, the dynamic k-mismatch is proposed to determine the presence of a long multi-word which is a subtopic of the content of a document. Finally, we carried out a series of experiments on classifying the Reuters-21578 documents using the representations with multi-words, respectively. We used the performance of representation in individual words as the baseline, which has the largest dimension of feature set for representation without linguistic preprocessing. Moreover, linear kernel and non-linear polynomial kernel in support vector machines (SVM) are examined comparatively for classification to investigate the effect of kernel type on the performance of classification. And the index terms with low information gain (IG) are removed from the feature set at different percentage to observe the robustness of each classification method. Our experiments demonstrate that in multi-word representation, subtopic of general concept representation outperforms the general concept representation and the linear kernel outperforms non-linear kernel of SVM in classifying the Reuters data. And the effect of applying different representation strategies is greater than the effect of applying the different SVM kernels on classification performance. Furthermore, the representation using individual words outperforms any representation using multi-words. This is consistent with the most opinions concerning the role of linguistic preprocessing on documents' features when using SVM for classification.
Lecture Notes in Computer Science, 2014
We propose a semantic kernel for Support Vector Machines (SVM) that takes advantage of higher-order relations between the words and between the documents. Conventional approach in text categorization systems is to represent documents as a "Bag of Words" (BOW) in which the relations between the words and their positions are lost. Additionally, traditional machine learning algorithms assume that instances, in our case documents, are independent and identically distributed. This approach simplifies the underlying models, but nevertheless it ignores the semantic connections between words as well as the semantic relations between documents that stem from the words. In this study, we improve the semantic knowledge capture capability of a previous work in [1], which is called χ-Sim Algorithm and use this method in the SVM as a semantic kernel. The proposed approach is evaluated on different benchmark textual datasets. Experiment results show that classification performance improves over the well-known traditional kernels used in the SVM such as the linear kernel (one of the state-of-the-art algorithms for text classification system), the polynomial kernel and the Radial Basis Function (RBF) kernel.
2012
This work concerns a comparison of SVM kernel methods in text categorization tasks. In particular I define a kernel function that estimates the similarity between two objects computing by their compressed lengths. In fact, compression algorithms can detect arbitrarily long dependencies within the text strings. Data text vectorization looses information in feature extractions and is highly sensitive by textual language. Furthermore, these methods are language independent and require no text preprocessing. Moreover, the accuracy computed on the datasets (Web-KB, 20ng and Reuters-21578), in some case, is greater than Gaussian, linear and polynomial kernels. The method limits are represented by computational time complexity of the Gram matrix and by very poor performance on non-textual datasets.
2017
The popularity of the internet is increasing day-by-day, which makes tough for the end-user to get desired pages from the web in a short time. Text classification, a branch of machine learning can shed light on this problem. State-of-the-art classifier like Support Vector Machines (SVM) has become very popular in the domain of text classification. This paper studies the effect of SVM using different kernel methods and proposes a modified cosine distance based log kernel (log cosine) for text classification which is proved as Conditional Positive Definite (CPD). Its classification performance is compared with other CPDs and Positive Definite (PD) kernels. A novel feature selection technique is proposed which improves the effectiveness of the classification and gathers the crux of the terms in the corpus without deteriorating the outcome in the construction process. From the experimental results, it is observed that CPD kernels provide better results for text classification when used with SVMs compared to PD kernels, and the performance of the proposed log-cosine is better than the existing kernel methods.
System Sciences, 2003. …, 2003
Text categorization is the process of sorting text documents into one or more predefined categories or classes of similar documents. Differences in the results of such categorization arise from the feature set chosen to base the association of a given document with a given category. Advocates of text categorization recognize that the sorting of text documents into categories of like
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.
Advanced Data Analysis and Classification, 2007
Proceedings of the 2009 International Joint Conference on Neural Networks, 2009
Lecture Notes in Computer Science, 2007
Journal of the American Society for Information Science and Technology, 2007
Information Systems, 2011
Applied Bioinformatics, 2005
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2009
Learning in Web Search Guest Editors …, 2006
Computational Linguistics, 2003
ACM Computing Surveys, 2002
2004
Intelligent Data Analysis, 2006