Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
…
8 pages
1 file
In many important text classi cation problems, acquiring class labels for training documents is costly, while gathering large quantities of unlabeled data is cheap. This paper shows that the accuracy of text classi ers trained with a small number of labeled documents can be improved by augmenting this small training set with a large pool of unlabeled documents. We present a theoretical argument showing that, under common assumptions, unlabeled data contain information about the target function. We then introduce an algorithm for learning from labeled and unlabeled text based on the combination of Expectation-Maximization with a naive Bayes classi er. The algorithm rst trains a classi er using the available labeled documents, and probabilistically labels the unlabeled documents; it then trains a new classi er using the labels for all the documents, and iterates to convergence. Experimental results, obtained using text from three di erent realworld tasks, show that the use of unlabeled data reduces classi cation error by up to 33%.
This paper shows that the accuracy of learned text classi ers can be improved by augmenting a small number of labeled training documents with a large pool of unlabeled documents. This is important because in many text classi cation problems obtaining training labels is expensive, while large quantities of unlabeled documents are readily available.
Proceedings of the 19th ACM international conference on Information and knowledge management - CIKM '10, 2010
Rule based systems for processing text data encode the knowledge of a human expert into a rule base to take decisions based on interactions of the input data and the rule base. Similarly, supervised learning based systems can learn patterns present in a given dataset to make decisions on similar and other related data. Performances of both these classes of models are largely dependent on the training examples seen by them, based on which the learning was performed. Even though trained models might fit well on training data, the accuracies they yield on a new test data may be considerably different. Computing the accuracy of the learnt models on new unlabeled datasets is a challenging problem requiring costly labeling, and which is still likely to only cover a subset of the new data because of the large sizes of datasets involved. In this paper, we present a method to estimate the accuracy of a given model on a new dataset without manually labeling the data. We verify our method on large datasets for two shallow text processing tasks: document classification and postal address segmentation, and using both supervised machine learning methods and human generated rule based models.
As the amount of online document increases, the demand for document classification to aid the analysis and management of document is increasing. Text is cheap, but information, in the form of knowing what classes a document belongs to, is expensive. The main purpose of this paper is to explain the expectation maximization technique of data mining to classify the document and to learn how to improve the accuracy while using semi-supervised approach. Expectation maximization algorithm is applied with both supervised and semi-supervised approach. It is found that semi-supervised approach is more accurate and effective. The main advantage of semi supervised approach is "DYNAMICALLY GENERATION OF NEW CLASS". The algorithm first trains a classifier using the labeled document and probabilistically classifies the unlabeled documents. The car dataset for the evaluation purpose is collected from UCI repository dataset in which some changes have been done from our side.
Principles of Data Mining and Knowledge Discovery, 2000
Supervised learning algorithms usually require large amounts of training data to learn reasonably accurate classifiers. Yet, for many text classification tasks, providing labeled training documents is expensive, while unlabeled documents are readily available in large quantities. Learning from both, labeled and unlabeled documents, in a semisupervised framework is a promising approach to reduce the need for labeled training documents. This paper compares three commonly applied text classifiers in the light of semi-supervised learning, namely a linear support vector machine, a similarity-based tfidf and a Naïve Bayes classifier. Results on a real-world text datasets show that these learners may substantially benefit from using a large amount of unlabeled documents in addition to some labeled documents.
2004
Traditionally, text classifiers are built from labeled training examples. Labeling is usually done manually by human experts (or the users), which is a labor intensive and time consuming process. In the past few years, researchers investigated various forms of semi-supervised learning to reduce the burden of manual labeling. In this paper, we propose a different approach. Instead of labeling a set of documents, the proposed method labels a set of representative words for each class. It then uses these words to extract a set of documents for each class from a set of unlabeled documents to form the initial training set. The EM algorithm is then applied to build the classifier. The key issue of the approach is how to obtain a set of representative words for each class. One way is to ask the user to provide them, which is difficult because the user usually can only give a few words (which are insufficient for accurate learning). We propose a method to solve the problem. It combines clustering and feature selection. The technique can effectively rank the words in the unlabeled set according to their importance. The user then selects/labels some words from the ranked list for each class. This process requires less effort than providing words with no help or manual labeling of documents. Our results show that the new method is highly effective and promising.
Computers in Human Behavior, 2014
Supervised text classifiers need to learn from many labeled examples to achieve a high accuracy. However, in a real context, sufficient labeled examples are not always available because human labeling is enormously time-consuming. For this reason, there has been recent interest in methods that are capable of obtaining a high accuracy when the size of the training set is small.
… The 30th European …, 2008
This paper investigates a new extension of the Probabilistic Latent Semantic Analysis (PLSA) model for text classification where the training set is partially labeled. The proposed approach iteratively labels the unlabeled documents and estimates the probabilities of its labeling errors. These probabilities are then taken into account in the estimation of the new model parameters before the next round. Our approach outperforms an earlier semi-supervised extension of PLSA introduced by [9] which is based on the use of fake labels. However, it maintains its simplicity and ability to solve multiclass problems. In addition, it gives valuable information about the most uncertain and difficult classes to label. We perform experiments over the 20Newsgroups, WebKB and Reuters document collections and show the effectiveness of our approach over two other semi-supervised algorithms applied to these text classification problems.
Lecture Notes in Computer Science, 2002
A key difficulty for applying machine learning classification algorithms for many applications is that they require a lot of handlabeled examples. Labeling large amount of data is a costly process which in many cases is prohibitive. In this paper we show how the use of a small number of labeled data together with a large number of unlabeled data can create high-accuracy classifiers. Our approach does not rely on any parametric assumptions about the data as it is usually the case with generative methods widely used in semi-supervised learning. We propose new discriminant algorithms handling both labeled and unlabeled data for training classification models and we analyze their performances on different information access problems ranging from text span classification for text summarization to e-mail spam detection and text classification.
We describe a method for improving the classi- fication of short text strings using a combination of labeled training data plus a secondary corpus of unlabeled but related longer documents. We show that such unlabeled background knowledge can greatly decrease error rates, particularly if the number of examples or the size of the strings in the training set is small.
Communications in Computer and Information Science, 2013
It is well known that supervised text classification methods need to learn from many labeled examples to achieve a high accuracy. However, in a real context, sufficient labeled examples are not always available. In this paper we demonstrate that a way to obtain a high accuracy, when the number of labeled examples is low, is to consider structured features instead of list of weighted words as observed features. The proposed vector of features considers a hierarchical structure, named a mixed Graph of Terms, composed of a directed and an undirected sub-graph of words, that can be automatically constructed from a set of documents through the probabilistic Topic Model.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.
Proceedings of the twelfth international conference on Information and knowledge management - CIKM '03, 2003
Proceedings of the 2004 SIAM International Conference on Data Mining, 2004
Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval, 2014
Research Journal of Applied Sciences, Engineering and Technology, 2015
IEEE Transactions on Neural Networks, 2003
Third IEEE International Conference on Data Mining
National Conference on Artificial Intelligence, 2007
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining - KDD '02, 2002
Cornell University - arXiv, 2022
IEEE Transactions on …, 2006
Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 -, 2003