Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2003, Proceedings of the twelfth international conference on Information and knowledge management - CIKM '03
Most existing studies of text classification assume that the training data are completely labeled. In reality, however, many information retrieval problems can be more accurately described as learning a binary classifier from a set of incompletely labeled examples, where we typically have a small number of labeled positive examples and a very large number of unlabeled examples. In this paper, we study such a problem of performing Text Classification WithOut labeled Negative data (TC-WON). In this paper, we explore an efficient extension of the standard Support Vector Machine (SVM) approach, called SVMC (Support Vector Mapping Convergence) , for the TC-WON tasks. Our analyses show that when the positive training data is not too under-sampled, SVMC significantly outperforms other methods because SVMC basically exploits the natural "gap" between positive and negative documents in the feature space, which eventually corresponds to improving the generalization performance. In the text domain there are likely to exist many gaps in the feature space because a document is usually mapped to a sparse and high dimensional feature space. However, as the number of positive training data decreases, the boundary of SVMC starts overfitting at some point and end up generating very poor results. This is because when the positive training data is too few, the boundary over-iterates and trespasses the natural gaps between positive and negative class in the feature space and thus ends up fitting tightly around the few positive training data.
Text, Speech and Dialogue, 2011
In this paper we present a slightly modified machine learning approach for text classification working exclusively from positive and unlabeled samples. Our method can assure that the positive class is not underrepresented during the iterative training process and it can achieve 30% better F-value when the amount of positive examples is low.
Third IEEE International Conference on Data Mining
This paper studies the problem of building text classifiers using positive and unlabeled examples. The key feature of this problem is that there is no negative example for learning. Recently, a few techniques for solving this problem were proposed in the literature. These techniques are based on the same idea, which builds a classifier in two steps. Each existing technique uses a different method for each step. In this paper, we first introduce some new methods for the two steps, and perform a comprehensive evaluation of all possible combinations of methods of the two steps. We then propose a more principled approach to solving the problem based on a biased formulation of SVM, and show experimentally that it is more accurate than the existing techniques.
International Journal of Computer Applications Technology and Research, 2014
Positive-unlabeled (PU) learning is a learning problem which uses a semi-supervised method for learning. In PU learning problem, the aim is to build an accurate binary classifier without the need to collect negative examples for training. Two-step approach is a solution for PU learning problem that consists of tow steps: (1) Identifying a set of reliable negative documents. (2) Building a classifier iteratively. In this paper we evaluate five combinations of techniques for two-step strategy. We found that using Rocchio method in step 1 and Expectation-Maximization method in step 2 is most effective combination in our experiments.
International Journal of Computer Applications Technology and Research, 2014
Feature Selection is important in the processing of data in domains such as text because such data can be of very high dimension. Because in positive-unlabeled (PU) learning problems, there are no labeled negative data for training, we need unsupervised feature selection methods that do not use the class information in the training documents when selecting features for the classifier. There are few feature selection methods that are available for use in document classification with PU learning. In this paper we evaluate four unsupervised methods including, collection frequency (CF), document frequency (DF), collection frequency-inverse document frequency (CF-IDF) and term frequency-document frequency (TF-DF). We found DF most effective in our experiments.
Anais do XX Encontro Nacional de Inteligência Artificial e Computacional (ENIAC 2023)
Due to the overwhelming data generation that surpasses human evaluation capacity, manually labeling data for training machine learning models is becoming increasingly impractical. This article focuses on analyzing techniques to address the challenges of Positive Unlabeled Learning (PUL). To this end, we propose structural adaptations to the Non-Negative Matrix Factorization (NMF) algorithm, specifically tailored for PU data (NMFPUL). We compare NMFPUL with state-of-the-art techniques to identify improvements in the performance of textual data classification. Our study reveals that NMFPUL consistently outperforms most baseline algorithms across diverse document collections even with a limited number of labeled documents, and mainly on these situations.
IEEE Transactions on Knowledge and Data Engineering, 2004
Web page classification is one of the essential techniques for Web mining because classifying Web pages of an interesting class is often the first step of mining the Web. However, constructing a classifier for an interesting class requires laborious preprocessing such as collecting positive and negative training examples. For instance, in order to construct a "homepage" classifier, one needs to collect a sample of homepages (positive examples) and a sample of nonhomepages (negative examples). In particular, collecting negative training examples requires arduous work and caution to avoid bias. This paper presents a framework, called Positive Example Based Learning (PEBL), for Web page classification which eliminates the need for manually collecting negative training examples in preprocessing. The PEBL framework applies an algorithm, called Mapping-Convergence (M-C), to achieve high classification accuracy (with positive and unlabeled data) as high as that of a traditional SVM (with positive and negative data). M-C runs in two stages: the mapping stage and convergence stage. In the mapping stage, the algorithm uses a weak classifier that draws an initial approximation of "strong" negative data. Based on the initial approximation, the convergence stage iteratively runs an internal classifier (e.g., SVM) which maximizes margins to progressively improve the approximation of negative data. Thus, the class boundary eventually converges to the true boundary of the positive class in the feature space. We present the M-C algorithm with supporting theoretical and experimental justifications. Our experiments show that, given the same set of positive examples, the M-C algorithm outperforms one-class SVMs, and it is almost as accurate as the traditional SVMs.
Automatic text classification is one of the most important tools in Information Retrieval. As the traditional methods for text classification cannot find the best feature set, the GA is applied to the feature selection because it can get the global optimal solution. This paper presents a novel text classifier from positive and unlabeled documents based on GA. Firstly, we identify reliable negative documents by improved 1-DNF algorithm. Secondly, we build a set of classifiers by iteratively applying SVM algorithm on training example sets. Thirdly, we discuss an approach to evaluate the weighted vote of all classifiers generated in the iteration steps to construct the final classifier based on GA instead of choosing one of the classifiers as the final classifier. GA evolving process can discover the best combination of the weights. The experimental result on the Reuter data set shows that the performance is exciting.
Statistica Sinica, 2022
In semi-supervised learning, a training sample is comprised of both labeled and unlabeled instances from each class under consideration. In practice, an important yet challenging issue is the detection of novel classes that may be absent from the training sample. In this article, we focus on a binary situation in which labeled instances come from the positive class whereas unlabeled instances from both classes. Particularly, we propose a semi-supervised large margin classifier to learn the negative (novel) class based on pseudo-data iteratively generated using an estimated model. Numerically, we employ an efficient algorithm to implement the proposed method with the hinge-loss and ψ-loss functions. Theoretically, we derive a learning theory for the new classifier to quantify the misclassification error. Finally, numerical analysis demonstrates that the proposed method compares favorably on simulated examples and is highly competitive against its competitors on benchmark examples.
Pattern Recognition Letters, 2014
We consider the problem of learning a binary classifier from a training set of positive and unlabeled examples, both in the inductive and in the transductive setting. This problem, often referred to as PU learning, differs from the standard supervised classification problem by the lack of negative examples in the training set. It corresponds to an ubiquitous situation in many applications such as information retrieval or gene ranking, when we have identified a set of data of interest sharing a particular property, and we wish to automatically retrieve additional data sharing the same property among a large and easily available pool of unlabeled data. We propose a conceptually simple method, akin to bagging, to approach both inductive and transductive PU learning problems, by converting them into series of supervised binary classification problems discriminating the known positive examples from random subsamples of the unlabeled set. We empirically demonstrate the relevance of the method on simulated and real data, where it performs at least as well as existing methods while being faster.
Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval - SIGIR '00, 2000
In this paper, we show how a simple feedforward neural network can be trained to filter documents when only positive information is available, and that this method seems to be superior to more standard methods, such as tf-idf retrieval based on an "average vector". A novel experimental finding that retrieval is enhanced substantially in this context by carrying out a certain kind of uniform transformation ("Hadamard") of the information prior to the training of the network.
Machine Learning
Learning from positive and unlabeled data or PU learning is the setting where a learner only has access to positive examples and unlabeled data. The assumption is that the unlabeled data can contain both positive and negative examples. This setting has attracted increasing interest within the machine learning literature as this type of data naturally arises in applications such as medical diagnosis and knowledge base completion. This article provides a survey of the current state of the art in PU learning. It proposes seven key research questions that commonly arise in this field and provides a broad overview of how the field has tried to address them.
2005
We propose a simple probabilistic approach to learning from positive and unlabeled examples, and show experimentally that it can approximate or outperform other state-ofthe-art approaches to this problem in spite of its simplicity. By employing a linear-time learning algorithm such as PrTFIDF, our approach can be highly efficient and scalable.
In many important text classi cation problems, acquiring class labels for training documents is costly, while gathering large quantities of unlabeled data is cheap. This paper shows that the accuracy of text classi ers trained with a small number of labeled documents can be improved by augmenting this small training set with a large pool of unlabeled documents. We present a theoretical argument showing that, under common assumptions, unlabeled data contain information about the target function. We then introduce an algorithm for learning from labeled and unlabeled text based on the combination of Expectation-Maximization with a naive Bayes classi er. The algorithm rst trains a classi er using the available labeled documents, and probabilistically labels the unlabeled documents; it then trains a new classi er using the labels for all the documents, and iterates to convergence. Experimental results, obtained using text from three di erent realworld tasks, show that the use of unlabeled data reduces classi cation error by up to 33%.
Principles of Data Mining and Knowledge Discovery, 2000
Supervised learning algorithms usually require large amounts of training data to learn reasonably accurate classifiers. Yet, for many text classification tasks, providing labeled training documents is expensive, while unlabeled documents are readily available in large quantities. Learning from both, labeled and unlabeled documents, in a semisupervised framework is a promising approach to reduce the need for labeled training documents. This paper compares three commonly applied text classifiers in the light of semi-supervised learning, namely a linear support vector machine, a similarity-based tfidf and a Naïve Bayes classifier. Results on a real-world text datasets show that these learners may substantially benefit from using a large amount of unlabeled documents in addition to some labeled documents.
Positive and unlabelled learning (PU learning) has been investigated to deal with the situation where only the positive examples and the unlabelled examples are available. Most of the previous works focus on identifying some negative examples from the unlabelled data, so that the supervised learning methods can be applied to build a classifier. However, for the remaining unlabelled data, which can not be explicitly identified as positive or negative (we call them ambiguous examples), they either exclude them from the training phase or simply enforce them to either class. Consequently, their performance may be constrained. This paper proposes a novel approach, called similarity-based PU learning (SPUL) method, by associating the ambiguous examples with two similarity weights, which indicate the similarity of an ambiguous example towards the positive class and the negative class, respectively. The local similarity-based and global similarity-based mechanisms are proposed to generate t...
2008
The problem of PU Learning, i.e., learning classifiers with positive and unlabelled examples (but not negative examples), is very important in information retrieval and data mining. We address this problem through a novel approach: reducing it to the problem of learning classifiers for some meaningful multivariate performance measures. In particular, we show how a powerful machine learning algorithm, Support Vector Machine, can be adapted to solve this problem. The effectiveness and efficiency of the proposed approach have been confirmed by our experiments on three real-world datasets.
2018
For many interesting tasks, such as medical diagnosis and web page classification, a learner only has access to some positively labeled examples and many unlabeled examples. Learning from this type of data requires making assumptions about the true distribution of the classes and/or the mechanism that was used to select the positive examples to be labeled. The commonly made assumptions, separability of the classes and positive examples being selected completely at random, are very strong. This paper proposes a weaker assumption that assumes the positive examples to be selected at random, conditioned on some of the attributes. To learn under this assumption, an EM method is proposed. Experiments show that our method is not only very capable of learning under this assumption, but it also outperforms the state of the art for learning under the selected completely at random assumption.
Suppose we have a large collection of documents most of which are unlabeled. Suppose further that we have a small subset of these documents which represent a particular class of documents we are interested in, i.e. these are labeled as positive examples. We may have reason to believe that there are more of these positive class documents in our large unlabeled collection. What data mining techniques could help us find these unlabeled positive examples? Here we examine machine learning strategies designed to solve this problem. We find that a proper choice of machine learning method as well as training strategies can give substantial improvement in retrieving, from the large collection, data enriched with positive examples. We illustrate the principles with a real example consisting of multiword UMLS phrases among a much larger collection of phrases from Medline.
Proceedings of the 2004 SIAM International Conference on Data Mining, 2004
Automatic classification of documents is an important area of research with many applications in the fields of document searching, forensics and others. Methods to perform classification of text rely on the existence of a sample of documents whose class labels are known. However, in many situations, obtaining this sample may not be an easy (or even possible) task. Consider for instance, a set of documents that is returned as a result of a query. If we want to separate the documents that are truly relevant to the query from those that are not, it is unlikely that we will have at hand labelled documents to train classification models to perform this task. In this paper we focus on the classification of an unlabelled set of documents into two classes: relevant and irrelevant, given a topic of interest. By dividing the set of documents into buckets (for instance, answers returned by different search engines), and using association rule mining to find common sets of words among the buckets, we can efficiently obtain a sample of documents that has a large percentage of relevant ones. (I.e., a high "purity".) This sample can be used to train models to classify the entire set of documents. We try several methods of classification to separate the documents, including Two-class SVM, for which we develop a heuristic to identify a small sample of negative examples. We prove, via experimentation, that our method is capable of accurately classify a set of documents into relevant and irrelevant classes.
2019
In recent years, the softmax model and its fast approximations have become the de-facto loss functions for deep neural networks when dealing with multi-class prediction. This loss has been extended to language modeling and recommendation, two fields that fall into the framework of learning from Positive and Unlabeled data. In this paper, we stress the different drawbacks of the current family of softmax losses and sampling schemes when applied in a Positive and Unlabeled learning setup. We propose both a Relaxed Softmax loss (RS) and a new negative sampling scheme based on Boltzmann formulation. We show that the new training objective is better suited for the tasks of density estimation, item similarity and next-event prediction by driving uplifts in performance on textual and recommendation datasets against classical softmax.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.