Learning Classification with Both Labeled and Unlabeled Data

Vittaut, Jean-Noël; Amini, Massih-Reza; Gallinari, Patrick

Learning Classification with Both Labeled and Unlabeled Data

Massih-reza Amini

2002, Lecture Notes in Computer Science

visibility

…

description

12 pages

link

1 file

Sign up for access to the world's latest research

checkGet notified about relevant papers

checkSave papers to use in your research

checkJoin the discussion with peers

checkTrack your impact

Abstract

A key difficulty for applying machine learning classification algorithms for many applications is that they require a lot of handlabeled examples. Labeling large amount of data is a costly process which in many cases is prohibitive. In this paper we show how the use of a small number of labeled data together with a large number of unlabeled data can create high-accuracy classifiers. Our approach does not rely on any parametric assumptions about the data as it is usually the case with generative methods widely used in semi-supervised learning. We propose new discriminant algorithms handling both labeled and unlabeled data for training classification models and we analyze their performances on different information access problems ranging from text span classification for text summarization to e-mail spam detection and text classification.

Figures (3)

Fig. 1. Web Pages classification (a) and e-mail Spam detection (5) - Performance of two clas- sifiers with respect to the ratio x of labeled data in the training set. The classifier is a logistic unit trained with labeled data in a supervised scheme (dashed bottom curves) and using the semi-supervised discriminant-CEM algorithm (solid top curves) For comparison, we have also performed tests with the logistic classifier trained in a supervised way using the same x% labeled data in the training set.

Fig. 2. Average precision of 3 trainable summarizers with respect to the ratio of labeled sen- tences in the training set for the cmp_lg collection For text summarization, we have also compared in figure 2, discriminant-CEM to the generative-CEM algorithm presented in section 3.2. For the latter, we assume that the conditional density functions {f;,};-1,.... are normal distributions.

Table 1. Comparison between kupiec et al.’s summarizer system and discriminant and genera- tive CEM algorithms for the cmp_lg collection. All classifiers are trained in a fully supervised way The two CEM classifiers allow approximately 10% increase both in average preci- sion and in accuracy over Kupiec et al’s system. Another interesting result is that both discriminant and generative CEM trained on semi-supervised learning scheme (using 10% of labeled sentences together with 90% of unlabeled sentences in the training set) gave similar performances to the Kupiec et al.’s summarizer system fully supervised.

Carlos Manuel Hernandez Gonzalez

Principles of Data Mining and Knowledge Discovery, 2000

Supervised learning algorithms usually require large amounts of training data to learn reasonably accurate classifiers. Yet, for many text classification tasks, providing labeled training documents is expensive, while unlabeled documents are readily available in large quantities. Learning from both, labeled and unlabeled documents, in a semisupervised framework is a promising approach to reduce the need for labeled training documents. This paper compares three commonly applied text classifiers in the light of semi-supervised learning, namely a linear support vector machine, a similarity-based tfidf and a Naïve Bayes classifier. Results on a real-world text datasets show that these learners may substantially benefit from using a large amount of unlabeled documents in addition to some labeled documents.

Log In

Learning Classification with Both Labeled and Unlabeled Data

Sign up for access to the world's latest research

Abstract

Figures (3)

Related papers

Related papers

Related topics