Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2005, Proceedings of the Ninth Conference on Computational Natural Language Learning - CONLL '05
…
8 pages
1 file
In this paper we propose and evaluate a technique to perform semi-supervised learning for Text Categorization.
Principles of Data Mining and Knowledge Discovery, 2000
Supervised learning algorithms usually require large amounts of training data to learn reasonably accurate classifiers. Yet, for many text classification tasks, providing labeled training documents is expensive, while unlabeled documents are readily available in large quantities. Learning from both, labeled and unlabeled documents, in a semisupervised framework is a promising approach to reduce the need for labeled training documents. This paper compares three commonly applied text classifiers in the light of semi-supervised learning, namely a linear support vector machine, a similarity-based tfidf and a Naïve Bayes classifier. Results on a real-world text datasets show that these learners may substantially benefit from using a large amount of unlabeled documents in addition to some labeled documents.
Artificial Intelligence Review
A huge amount of data is generated daily leading to big data challenges. One of them is related to text mining, especially text classification. To perform this task we usually need a large set of labeled data that can be expensive, time-consuming, or difficult to be obtained. Considering this scenario semi-supervised learning (SSL), the branch of machine learning concerned with using labeled and unlabeled data has expanded in volume and scope. Since no recent survey exists to overview how SSL has been used in text classification, we aim to fill this gap and present an up-to-date review of SSL for text classification. We retrieve 1794 works from the last 5 years from IEEE Xplore, ACM Digital Library, Science Direct, and Springer. Then, 157 articles were selected to be included in this review. We present the application domain, datasets, and languages employed in the works. The text representations and machine learning algorithms. We also summarize and organize the works following a recent taxonomy of SSL. We analyze the percentage of labeled data used, the evaluation metrics, and obtained results. Lastly, we present some limitations and future trends in the area. We aim to provide researchers and practitioners with an outline of the area as well as useful information for their current research.
Information Retrieval, 2009
Most current methods for automatic text catagorization are based on supervised learning techniques and, therefore, they face the problem of requiering a great number of training instances to construct an accurate classifier. In order to tackle this problem, this paper proposes a new semi-supervised method for text categorization, which considers the automatic extraction of unlabeled examples from the Web and the application of an enriched self-training approach for the constrution of the classifier. This method, even though language independent, is more pertinent for scenarios where large sets of labeled resourses do not exist. That, for instance, could be the case of several application domains in different non-English languages such as Spanish. The experimental evaluation of the method was carried out in three different tasks and in two different languages. The achieved results demonstrate the applicability and usefulness of the proposed method.
2012
Text categorization automatically assigns a document to its underlying topics. Documents are typically represented as bag-ofwords, and machine learning based approaches have been shown to provide effective and scalable solutions by learning from examples. However, a limiting factor in the application of these approaches relies on the large number of examples required to train a classifier working on large taxonomies of classes. This paper presents a method to integrate prior knowledge that is typically available on the learning task into a text classifier based on kernel machines. The presented solution deals with any prior knowledge represented as firstorder logic (FOL) and, thanks to the generality of this formulation, can be used to express relations among the input patterns, known semantic relationships among the output categories and input-output rules. The kernel machine mathematical apparatus is re-used to cast the learning problem into a primal optimization of a function composed of the loss on the supervised examples, the regularization term, and a penalty term deriving from converting the knowledge into a set of continuous constraints. The experimental results, performed over the popular CORA dataset, show that the proposed approach overperforms both SVMs and state-of-the-art semi-supervised techniques in multi-label text classification problem.
Knowledge Based Systems, 2016
Vector Space Models (VSM) are commonly used in language processing to represent certain aspects of natural language semantics. Semantics of VSM comes from the distributional hypothesis, which states that words that occur in similar contexts usually have similar meanings. In our previous work, we proposed novel semantic smoothing kernels based on classspecific transformations. These kernels use classterm matrices, which can be considered as a new type of VSM. By using the class as the context, these methods can extract class specific semantics by making use of word distributions both in documents and in different classes. In this study, we adapt two of these semantic classification approaches to build a novel and high performance semi-supervised text classification algorithm. These approaches include Helmholtz principle based calculation of term meanings in the context of classes for initial classification and a supervised term weighting based semantic kernel with Support Vector Machines (SVM) for the final classification model. The approach used in the first phase is especially good at learning with very small datasets, while the approach in the second phase is specifically good at eliminating noise in a relatively large and noisy training sets when building a classification model. Overall, as a semantic semi-supervised learning algorithm, our approach can effectively utilize abundant source of unlabeled instances to improve the classification accuracy significantly especially when the amount of labeled instances are limited.
Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks. IJCNN 2000. Neural Computing: New Challenges and Perspectives for the New Millennium, 2000
We propose to solve a text categorization task using a new metric between documents, based on a priori semantic knowledge about words. This metric can be incorporated into the definition of radial basis kernels of Support Vector Machines or directly used in a K-nearest neighbors algorithm. Both SVM and KNN are tested and compared on the 20newsgroups database. Support Vector Machines provide the best accuracy on test data.
2006
Traditional bag-of-words model and recent wordsequence kernel are two well-known techniques in the field of text categorization. Bag-of-words representation neglects the word order, which could result in less computation accuracy for some types of documents. Word-sequence kernel takes into account word order, but does not include all information of the word frequency. A weighted kernel model that combines these two models was proposed by the authors [1]. This paper is focused on the optimization of the weighting parameters, which are functions of word frequency. Experiments have been conducted with Reuter's database and show that the new weighted kernel achieves better classification accuracy.
International Journal of Data Mining & Knowledge Management Process, 2012
Automatic text categorization (ATC) is a prominent research area within Information retrieval. Through this paper a classification model for ATC in multi-label domain is discussed. We are proposing a new multi label text classification model for assigning more relevant set of categories to every input text document. Our model is greatly influenced by graph based framework and Semi supervised learning. We demonstrate the effectiveness of our model using Enron , Slashdot , Bibtex and RCV1 datasets. Our experimental results indicate that the use of Semi Supervised Learning in MLTC greatly improves the decision making capability of classifier.
A generic system for text categorization is presented which uses a representative text corpus to adapt the processing steps: feature extraction, dimension reduction, and classification. Feature extraction automatically learns features from the corpus by reducing actual word forms using statistical information of the corpus and general linguistic knowledge. The dimension of feature vector is then reduced by linear transformation keeping the essential information. The classification principle is a minimum least square approach based on polynomials. The described system can be readily adapted to new domains or new languages. In application, the system is reliable, fast, and processes completely automatically. It is shown that the text categorizer works successfully both on text generated by document image analysis -DIA and on ground truth data.
Neural Information Processing, 2018
For many text classification tasks, there is a major problem posed by the lack of labeled data in a target domain. Although classifiers for a target domain can be trained on labeled text data from a related source domain, the accuracy of such classifiers is usually lower in the cross-domain setting. Recently, string kernels have obtained state-ofthe-art results in various text classification tasks such as native language identification or automatic essay scoring. Moreover, classifiers based on string kernels have been found to be robust to the distribution gap between different domains. In this paper, we formally describe an algorithm composed of two simple yet effective transductive learning approaches to further improve the results of string kernels in cross-domain settings. By adapting string kernels to the test set without using the ground-truth test labels, we report significantly better accuracy rates in cross-domain English polarity classification.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.
Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering, 2014
International Journal of Computer Processing of Languages, 2005
Frontiers in Applied Mathematics and Statistics
Lecture Notes in Computer Science, 2003
Proceeding of the 17th ACM conference on Information and knowledge mining - CIKM '08, 2008
Engineering Applications of Artificial Intelligence, 2015
Communications in Computer and Information Science, 2017
Statistical Analysis and Data Mining, 2010
Research Journal of Applied Sciences, Engineering and Technology, 2015
Learning in Web Search Guest Editors …, 2006