Using EM to Classify Text from Labeled and Unlabeled Documents

Kamal Nigam; Andrew McCallum; Sebastian Thrun; Tom Mitchell

Learning to Classify Text from Labeled and Unlabeled Documents

Andrew McCallum

visibility

…

description

8 pages

link

1 file

Sign up for access to the world's latest research

checkGet notified about relevant papers

checkSave papers to use in your research

checkJoin the discussion with peers

checkTrack your impact

Abstract

In many important text classi cation problems, acquiring class labels for training documents is costly, while gathering large quantities of unlabeled data is cheap. This paper shows that the accuracy of text classi ers trained with a small number of labeled documents can be improved by augmenting this small training set with a large pool of unlabeled documents. We present a theoretical argument showing that, under common assumptions, unlabeled data contain information about the target function. We then introduce an algorithm for learning from labeled and unlabeled text based on the combination of Expectation-Maximization with a naive Bayes classi er. The algorithm rst trains a classi er using the available labeled documents, and probabilistically labels the unlabeled documents; it then trains a new classi er using the labels for all the documents, and iterates to convergence. Experimental results, obtained using text from three di erent realworld tasks, show that the use of unlabeled data reduces classi cation error by up to 33%.

Andrew McCallum

This paper shows that the accuracy of learned text classi ers can be improved by augmenting a small number of labeled training documents with a large pool of unlabeled documents. This is important because in many text classi cation problems obtaining training labels is expensive, while large quantities of unlabeled documents are readily available.

Log In

Learning to Classify Text from Labeled and Unlabeled Documents

Sign up for access to the world's latest research

Abstract

Related papers

Related papers