Academia.eduAcademia.edu

Learning to Classify Text from Labeled and Unlabeled Documents

Abstract

In many important text classi cation problems, acquiring class labels for training documents is costly, while gathering large quantities of unlabeled data is cheap. This paper shows that the accuracy of text classi ers trained with a small number of labeled documents can be improved by augmenting this small training set with a large pool of unlabeled documents. We present a theoretical argument showing that, under common assumptions, unlabeled data contain information about the target function. We then introduce an algorithm for learning from labeled and unlabeled text based on the combination of Expectation-Maximization with a naive Bayes classi er. The algorithm rst trains a classi er using the available labeled documents, and probabilistically labels the unlabeled documents; it then trains a new classi er using the labels for all the documents, and iterates to convergence. Experimental results, obtained using text from three di erent realworld tasks, show that the use of unlabeled data reduces classi cation error by up to 33%.