Text classification from positive and unlabeled documents

R. Gilleron

Text classification from positive and unlabeled documents

R. Gilleron

2003, Proceedings of the twelfth international conference on Information and knowledge management - CIKM '03

Sign up for access to the world's latest research

checkGet notified about relevant papers

checkSave papers to use in your research

checkJoin the discussion with peers

checkTrack your impact

Abstract

Most existing studies of text classification assume that the training data are completely labeled. In reality, however, many information retrieval problems can be more accurately described as learning a binary classifier from a set of incompletely labeled examples, where we typically have a small number of labeled positive examples and a very large number of unlabeled examples. In this paper, we study such a problem of performing Text Classification WithOut labeled Negative data (TC-WON). In this paper, we explore an efficient extension of the standard Support Vector Machine (SVM) approach, called SVMC (Support Vector Mapping Convergence) , for the TC-WON tasks. Our analyses show that when the positive training data is not too under-sampled, SVMC significantly outperforms other methods because SVMC basically exploits the natural "gap" between positive and negative documents in the feature space, which eventually corresponds to improving the generalization performance. In the text domain there are likely to exist many gaps in the feature space because a document is usually mapped to a sparse and high dimensional feature space. However, as the number of positive training data decreases, the boundary of SVMC starts overfitting at some point and end up generating very poor results. This is because when the positive training data is too few, the boundary over-iterates and trespasses the natural gaps between positive and negative class in the feature space and thus ends up fitting tightly around the few positive training data.

Figures (4)

Figure 3: Example of the MC algorithm in the one dimensional feature space.

The MC algorithm is described in Figure 1. To illustrate the MC process, consider an example of data distribution in one dimen- sional feature space in Figure 2, in which U is composed of eigh data clusters, and the fifth one from left is positive and the rest are negative. If U is completely labeled, a margin-maximization al- gorithm such as SVM trained from U would generate the optimal boundary (ba, b,), where bg maximizes the near and farther points of the gap (gn and gy) between positive and negative clusters. Un- fortunately, under the common scenarios of SCC, the only labeled data is assumed to be the dark center within the positive cluster, and the rest are unlabeled.

Figure 4: Intermediate results of SVMC. P is big dots, and U is all dots (small and big).

Figure 5: SVMC performance convergence on the “earn” class

István T. Nagy

Text, Speech and Dialogue, 2011

In this paper we present a slightly modified machine learning approach for text classification working exclusively from positive and unlabeled samples. Our method can assure that the positive class is not underrepresented during the iterative training process and it can achieve 30% better F-value when the amount of positive examples is low.

Log In

Text classification from positive and unlabeled documents

Sign up for access to the world's latest research

Abstract

Related papers

Related topics