19 - A Sequential Algorithm For Training Text Classi Ers
19 - A Sequential Algorithm For Training Text Classi Ers
1 Introduction
Text classi cation is the automated grouping of textual or partially textual entities. Document retrieval,
categorization, routing, ltering, and clustering, as well as natural language processing tasks such as
tagging, word sense disambiguation, and some aspects of understanding can be formulated as text clas-
si cation. As the amount of online text increases, the demand for text classi cation to aid the analysis
and management of text is increasing.
One advantage of formulating text processing tasks as classi cation is that methods from statistics
and machine learning can be used to form text classi ers automatically. While using machine learning
does require manually annotating training data with class labels, this annotation takes less skill and
expense than, for instance, building classi cation rules by hand [1].
There is often more text available than can be economically labeled, so a subset or sample of the
data must be chosen to label.1 Random sampling [3] will usually not be e ective. If only 1 in 1000
texts are class members (not atypical), and only 500 texts can be labeled, then a random sample will
usually contain 500 negative examples and no positive ones. This will not support training a classi er to
distinguish positive from negative examples.
Relevance feedback [4] does a kind of nonrandom sampling. In e ect, users are asked to label those
texts that the current classi er considers most likely to be class members. This approach, which we
might call relevance sampling, is a reasonable strategy in a text retrieval context, where the user is more
interested in seeing relevant texts than in the e ectiveness of the nal classi er produced. Relevance
feedback has also been proposed for nding examples of unusual word senses [5]. However, relevance
feedback has many problems as an approach to sampling. It works more poorly as the classi er improves,
and is susceptible to selecting redundant examples.
Relevance sampling is a sequential approach to sampling, since the labeling of earlier examples in u-
ences the selection of later ones [6]. This paper describes an alternative sequential approach, uncertainty
sampling, motivated by results in computational learning theory. Uncertainty sampling is an iterative
process of manual labeling of examples, classi er tting from those examples, and use of the classi er to
select new examples whose class membership is unclear. We show how a probabilistic classi er can be
trained using uncertainty sampling. A test of this method on a text categorization task showed reductions
of up to 500-fold in the number of examples that must be labeled to produce a classi er with a given
e ectiveness.
Recently, several algorithms for learning via queries have been proposed that lter existing examples
rather than creating arti cial ones [9, 10, 11]. These algorithms ask a teacher to label only those examples
whose class membership is suciently \uncertain". Several de nitions of uncertainty have been used, but
all are based on estimating how likely a classi er trained on previously labeled data would be to produce
the correct class label for a given unlabeled example. Looking at this as a sampling method rather than
a querying one, we call this approach uncertainty sampling.
Seung, Opper, and Sompolinsky [11] present a theoretical analysis of \query by committee" (QBC),
an algorithm that, for each unlabeled example, draws two classi ers randomly from the version space,
i.e. the set of all classi ers consistent with the labeled training data [12]. An in nite stream of unlabeled
data is assumed, from which QBC asks the teacher for class labels only on those examples for which the
two chosen classi ers disagree.
Freund, Seung, Shamir, and Tishby extend the QBC result to a wide range of classi er forms [13].
They prove that, under certain assumptions, the number of queries made after examining m random
examples will be logarithmic in m, while generalization error will decrease almost as quickly as it would if
queries were made on all examples. More precisely, generalization error decreases as O(1=m). Therefore,
in terms of the number of queries, generalization error decreases exponentially fast.
This is a provocative result, since it implies that the e ect of training on labeled data can be gotten
for the cost of obtaining unlabeled data, and labeling only a logarithmic fraction of it. However, the
QBC assumptions include that the data is noise free, a perfect deterministic classi er exists, and that it
is possible to draw classi ers randomly from the version space, all of which are problematic for real world
tasks. The e ectiveness of QBC and related methods on real world tasks remains to be determined.
A heuristic alternative to QBC's random drawing of classi ers from the version space is to let the
learning algorithm do what it always does|pick a single classi er from the version space. If the classi er
can not only make classi cation decisions, but estimate their certainty, the certainty estimate can be used
to select examples.
A single classi er approach to uncertainty sampling has several theoretical failings, including under-
estimation of true uncertainty, and biases caused by nonrepresentative classi ers [9, 10]. On the other
hand, experiments using a single classi er to make arbitrary queries [14] or select subsets of labeled data
[8, 15] have shown substantial speedups in learning. Relevance sampling, which has proven quite e ective
for text retrieval, also uses a single classi er.
P(C jw) = i
(4)
1 + exp(log 1 PP(C(C) ) + di=1 log PP ((ww jjCC )) )
i
i
Equation 4 is rarely used directly in text classi cation, probably because its estimates of P(C jw) are
systematically inaccurate. One reason for this inaccuracy is that the independence assumptions made in
producing Equation 3 are always incorrect when the wi 's are words or other features de ned from natural
language. Another problem is that P(C) is typically small and thus hard to estimate, a problem which
is compounded when the training set is not a random sample.
Logistic regression [22] provides a partial solution to these problems. It is a general technique for
combining multiple predictor values to estimate a posterior probability. The form of the estimate is:
P(C jx) = 1 +exp(a + b1 x1 + ::: + bm xm )
exp(a + b1x1 + ::: + bm xm ) (5)
A number of approaches to using logistic regression in text classi cation have been proposed [23, 24, 25].
The similarity between Equation 4 and Equation 5 prompted us to try a particularly simple approach,
where the log likelihood ratio from the Bayesian independence formulation is used as the single predictor
variable:
P
exp(a + b di=1 log PP ((ww jjCC)) )
P
i
P(C jw) = i
(6)
1 + exp(a + b di=1 log PP ((ww jjCC )) )
i
i
Intuitively, we could hope that the logistic parameter a would substitute for the hard-to-estimate
prior log odds in Equation 4, while b would serve to dampen extreme log likelihood ratios resulting
from independence violations. We have in fact found this simple formulation to work well for text
categorization, though we have not compared it with the more complex formulations suggested by other
authors. Note that our approach would probably not be appropriate if documents were of widely varying
lengths.
3 We are careful to distinguish an example from the corresponding pattern w , since di erent examples may have the
e
same feature values 1 d.
w ; :::; w
4.2 Training the Classi er
We used the following
The rst step in using Equation 6 is estimating the values P(wijC)=P (wijC).
estimator:
cpi +(Np +0:5)=(Np +Nn +1)
P(wijC) =: N +d(N +0:5)=(N +N +1)
p p p n
(7)
P(wijC) c +(N +0:5)=(N +N +1)
ni n p n
N +d(N +0:5)=(N +N +1)
n n p n
Here Np and Nn are the numbers of tokens in the positive and negative training sets, respectively, cpi
and cni are correspondingly the number of instances of wi in the positive and negative training sets, and
d is the number of features. This is an ad hoc estimator, loosely justi ed by its analogy to the expected
likelihood estimator [26]. The above estimator attempts to avoid extreme estimates of the log likelihood
ratio when Np and Nn are of very di erent sizes, for instance before our sampling procedure starts nding
positive examples.
In text classi cation there typically is a huge set of potential wi 's, for instance all the types (distinct
words) in a collection of documents. Using feature selection to reduce this set (or, equivalently, to lock
all but a few values at 0) can improve e ectiveness [19]. As a feature quality measure we used:
(cpi + cni) log P(wijC)
: (8)
P(wijC)
We selected features in order of this value until a speci ed fraction (0.7 in the experiments reported
here) of the total score of all training examples was reached. This was done separately for features with
positive and negative log likelihood ratios.
After feature selection is performed, the log likelihood values are used to compute:
Xd log P(wijC) (9)
i=1
P(wijC)
for each training example. Logistic regression is then used to nd the values a and b which give the best
t of this value to the probability of class membership.
4.3 Uncertainty Sampling with the Probabilistic Classi er
Uncertainty sampling is simple given a classi er that estimates P(C jw). On each iteration, the current
version of classi er can be applied to each example, and those examples with estimated P(C jw) values
closest to 0:5 selected, since 0:5 corresponds to the classi er being most uncertain of the class label.
We adopted the slightly more complex method of scoring all examples, and then choosing the b=2
examples closest to 0:5 and above it, and the b=2 examples closest to 0:5 and below it. This method
guarantees that no more than half the examples selected on an iteration are exact duplicates (unless all
examples score above 0:5 or all score below 0:5). In addition, there is also some evidence that training
on pairs of examples on opposite sides of a decision boundary is useful [14].
4.4 Classi cation with the Probabilistic Classi er
An advantage of using a classi er which provides accurate estimates of P(C jw) is that, under certain
assumptions, decision theory gives an optimal rule for deciding whether an example should be assigned
to class C ([27], p. 15). Let lij be the penalty or loss incurred for deciding class i when the true class is
We then should assign the example to class C exactly when:
j. (We let i; j = 1 for C, i; j = 2 for C.)
l21P(C jw) + l22 (1 P(C jw)) > l11P(C jw) + l12 (1 P(C jw)) (10)
For instance, if we desire minimum error rate (both types of incorrect decisions are equally bad) the
appropriate losses would be l12 = l21 = 1 and l11 = l22 = 0.
5 Experiment
We conducted an experiment to see if uncertainty sampling would reduce the amount of labeled data
necessary to train a classi er, in comparison with random sampling and relevance sampling. The training
method and probabilistic classi er of Section 4 were used. Classi ers were trained to perform a text
categorization task on news story titles.
Training Test
Category Number Freq. Number Freq.
tickertalk 208 0.0007 40 0.0008
boxoce 314 0.0010 42 0.0008
bonds 470 0.0015 60 0.0012
nielsens 511 0.0016 87 0.0017
burma 510 0.0016 93 0.0018
dukakis 642 0.0020 107 0.0021
ireland 780 0.0024 117 0.0023
quayle 786 0.0025 133 0.0026
budget 1176 0.0037 197 0.0038
hostages 1560 0.0049 228 0.0044
Table 1. The 10 categories used in our experiments, with number of occurrences and frequency of occurrence on
training and test sets.
P +R (13)
The parameter ranges between 0 and in nity and controls the relative weight given to recall and
precision. A of 1 corresponds to equal weighting of recall and precision. To get a single measure of
e ectiveness where higher values correspond to better e ectiveness, and where recall and precision are of
equal importance, we de ne F =1 = 1 E =1 .
6 Results
Table 2 shows for each category the mean F =1 for classi ers formed by uncertainty sampling as well
as those formed by relevance sampling and on the full training set. We also show e ectiveness using
just the 3 starting examples, plus 7 randomly selected examples. This gives a sense of the quality of
the initial classi er. For all categories except tickertalk, an uncertainty sample of 999 texts resulted in
a classi er substantially more e ective than the initial classi er or one formed from a relevance sample
of 999 texts. The classi er was usually of similar or better e ectiveness than one trained on all 319,463
texts. Classi ers trained on a random sample of 1000 texts in most cases had very low e ectiveness.
Figure 2 plots e ectiveness against sample size for uncertainty sampling, relevance sampling, and
random sampling. Results for 9 categories are presented, omitting tickertalk on which no strategy worked
well.
3 + 996 uncer. 3 + 7 rand. 3 + 996 rel. 3 + 319,460 full
Category mean SD mean SD mean SD mean SD
tickertalk .033 (.031) .018 (.023) .023 (.039) .047 (.001)
boxoce .700 (.041) .222 (.172) .481 (.053) .647 (.023)
bonds .636 (.034) .146 (.134) .541 (.069) .509 (.020)
nielsens .801 (.016) .291 (.218) .567 (.132) .741 (.022)
burma .653 (.035) .032 (.033) .201 (.057) .464 (.023)
dukakis .136 (.046) .101 (.075) .035 (.021) .163 (.015)
ireland .416 (.041) .050 (.033) .170 (.038) .288 (.030)
quayle .386 (.040) .081 (.064) .140 (.072) .493 (.009)
budget .290 (.039) .058 (.046) .141 (.029) .235 (.005)
hostages .477 (.021) .068 (.042) .177 (.039) .498 (.003)
Table 2. Mean and standard deviation of F =1 for training on initial 3 examples combined with each of 996 un-
certainty selected examples, 7 random examples, 996 relevance selected examples, or 319,460 remaining examples.
Means are over 10 runs for uncertainty and relevance sampling, and over 20 runs for random and full sampling.
7 Discussion
As Figure 2 shows, classi er e ectiveness generally increases with sample size under all sampling methods,
but faster with the two sequential methods. Of the sequential methods, uncertainty sampling substantially
outperforms relevance sampling. The results hold across categories with widely varying absolute levels
of e ectiveness.4
The superiority of uncertainty sampling over relevance sampling is particularly notable since the low
frequency of the categories used limited the danger that relevance sampling would drown in positive
examples. Indeed, the di erence between uncertainty sampling and relevance sampling is lower for the
less frequent categories. However, even here uncertainty sampling is better, both in its higher mean and
in its lower standard deviation.
In most cases e ectiveness levels reached by random sampling only with 100,000 or more training
examples are reached by uncertainty sampling with less than 1000 examples, while training on 1000 ran-
domly selected examples gives greatly inferior results. In some cases, reaching a given level of e ectiveness
requires 500 times as many randomly selected examples as examples selected by uncertainty sampling.
Some caution is required in comparing the results for small uncertainty samples with those for large
random samples. For 6 of 10 categories, the mean F =1 for a classi er trained on a uncertainty sample
of 999 examples actually exceeds that from training on the full training set of 319,463 examples. This
means that some aspect of our classi er training is not making e ective use of large training sets. Feature
selection is the most likely villain. Our method produced several thousand features when applied to the
full training set, and previous work suggests this is too many [19].
The graphs of average e ectiveness hide some variation from run to run. Several of the standard
deviations shown in Table 2 amount to 10% or more of the mean e ectiveness, meaning that the quality
of the nal classi ers is somewhat unpredictable.
One source of this variation was our initial subsamples of 3 positive examples. When a subsample was
badly unrepresentative of the category, the initial classi er was ine ective, and there was a considerable
delay before uncertainty sampling started to nd additional positive examples. Even initial classi ers
with the same raw e ectiveness could lead to di erent parts of the space of examples being searched rst.
Besides in uencing early classi er formation, the starting subsamples had an impact on e ectiveness
through our making their words required features. This can be seen in the standard deviations for the
Full column of Table 2 where, since all classi ers were trained on the same set of examples, the only
di erence among runs for a category is in the required features provided by the initial examples.
A second source of variation was uctuations in the quality of successive classi ers. The uncertainty
sampling process is inherently exploratory, with de ciencies in the classi er produced on one iteration
leading to the selection of compensating examples on later iterations.
8 Future Work
Our results demonstrate that e ective text classi ers can be created by obtaining large amounts of
unlabeled data and labeling only a small fraction of it. While we tested uncertainty sampling on a text
categorization task, it can equally well be applied to any classi cation task.
4 The variations in absolute levels of e ectiveness are to be expected|some categories are simply harder than others,
and some of our category de nitions were particularly noisy.
Text retrieval is an obvious application, though the tradeo between retrieving the texts most useful
to the user vs. the texts from which the system will learn the most must be considered [31]. The tradeo
is less of an issue in ltering, routing, and information dissemination applications, since the cost of judging
nonrelevant examples can be amortized over a longer period of operation.
Uncertainty sampling should also bene t classi cation-based approaches to natural language process-
ing tasks. A number of projects have annotated or are annotating large corpora to support the training
of statistical methods for these tasks. Our results suggest that gathering huge unlabeled corpora, and
using uncertainty sampling to annotate a small subset for each task may be cheaper and equally e ective.
Domains besides text processing where large data sets are available can also bene t.
Many questions remain to be answered about uncertainty sampling. The most important practical
ones have to do with how the teacher knows when to stop, and how to form the nal classi er for use
after they do. Estimates of classi er e ectiveness would enable the teacher to track progress, but new
methods will be needed to produce such an estimate from a sample which is not random. E ectiveness
estimates would also aid in selecting a classi er to use from the classi ers formed on the last few iterations.
Alternately, methods for stabilizing the uctuations from iteration to iteration might be pursued.
A variety of extensions and improvements to uncertainty sampling can be explored. We need to
determine the relationship between subsample size and e ectiveness, since larger subsamples require less
computation. It also seems likely that subsample size can be increased if redundancy within subsamples
is decreased. Other eciency improvements include using a less accurate but more eciently trained
classi er during sampling [32], and picking the rst examples satisfying a threshold on uncertainty rather
than the most uncertain examples. Simultaneous training of classi ers for multiple classes is also of
interest.
As currently formulated, uncertainty sampling requires that the underlying training algorithm produce
reasonable classi ers even from very small training sets. This meant we had to tinker with feature selection
and parameter estimation to avoid producing pathological behavior on small samples. We are currently
exploring variations on uncertainty sampling which would be more robust with respect to problems in
classi er training.
9 Summary
Text is cheap, but information, in the form of knowing what classes a text belongs to, is expensive.
Automatic classi cation of text can provide this information at low cost, but the classi ers themselves
must be built with expensive human e ort, or trained from texts which have themselves been manually
classi ed. We have demonstrated that uncertainty sampling can sharply reduce the amount of text which
must be manually labeled to create an e ective classi er. Uncertainty sampling has potential applications
in a variety of text processing tasks, as well as in other domains where large amounts of unclassi ed data
are available.
10 Acknowledgments
We thank Jason Catlett, WilliamCohen, Eileen Fitzpatrick, Yoav Freund, Trevor Hastie, Robert Schapire,
and Sebastian Seung for advice and useful comments on this work, and Ken Church for making available
his text processing tools and his help with them.
References
1. P. J. Hayes. Intelligent high-volume text processing using shallow, domain-speci c techniques. In
Paul. S. Jacobs, editor, Text-Based Intelligent Systems: Current Research in Text Analysis, Infor-
mation Extraction, and Retrieval, pages 227{241. Lawrence Erlbaum, Hillsdale, NJ, 1992.
2. P. Biebricher, N. Fuhr, G. Lustig, M. Schwantner, and G. Knorz. The automatic indexing system
AIR/PHYS|from research to application. In Proc. SIGIR-88, pages 333|342, 1988.
3. W. G. Cochran. Sampling Techniques. John Wiley & Sons, New York, 3rd edition, 1977.
4. G. Salton and C. Buckley. Improving retrieval performance by relevance feedback. Journal of the
American Society for Information Science, 41(4):288{297, 1990.
5. W. A. Gale, K. W. Church, and D. Yarowsky. A method for disambiguating word senses in a large
corpus. Computers and the Humanities, 26:415{439, 1993.
6. B. K. Ghosh. A brief history of sequential analysis. In B. K. Ghosh and P. K. Sen, editors, Handbook
of Sequential Analysis, chapter 1, pages 1{19. Marcel Dekker, New York, 1991.
7. D. Angluin. Queries and concept learning. Machine Learning, 2:319{342, 1988.
8. M. Plutowski and H. White. Selecting concise training sets from clean data. IEEE Transactions on
Neural Networks, 4(2):305{318, March 1993.
9. D. Cohn, L. Atlas, and R. Ladner. Improving generalization with self-directed learning, 1992. To
appear in Machine Learning.
10. D. J. C. MacKay. The evidence framework applied to classi cation networks. Neural Computation,
4:720{736, 1992.
11. H. S. Seung, M. Opper, and H. Sompolinsky. Query by committee. In Proceedings of the Fifth
Annual ACM Workshop on Computational Learning Theory, pages 287{294, 1992.
12. T. M. Mitchell. Generalization as search. Arti cial Intelligence, 18:203{226, 1982.
13. Y. Freund, H. S. Seung, E. Shamir, and N. Tishby. Information, prediction, and query by committee.
In Advances in Neural Informations Processing Systems 5, San Mateo, CA, 1992. Morgan Kaufmann.
14. J. Hwang, J. J. Choi, S. Oh, and R. J. Marks II. Query-based learning applied to partially trained
multilayer perceptrons. IEEE Transactions on Neural Networks, 2(1):131{136, January 1991.
15. D. T. Davis and J. Hwang. Attentional focus training by boundary region data selection. In In-
ternational Joint Conference on Neural Networks, pages I{676 to I{681, Baltimore, MD, June 7{11
1992.
16. P. E. Hart. The condensed nearest neighbor rule. IEEE Transactions on Information Theory,
IT-14:515{516, May 1968.
17. P. E. Utgo . Improved training via incremental learning. In Sixth International Workshop on
Machine Learning, pages 362{365, 1989.
18. N. Fuhr. Models for retrieval with probabilistic indexing. Information Processing and Management,
25(1):55{72, 1989.
19. D. D. Lewis. An evaluation of phrasal and clustered representations on a text categorization task.
In Proc. SIGIR-92, pages 37{50, 1992.
20. M. E. Maron. Automatic indexing: An experimental inquiry. Journal of the Association for Com-
puting Machinery, 8:404{417, 1961.
21. W. S. Cooper. Some inconsistencies and misnomers in probabilistic information retrieval. In Proc.
SIGIR-91, pages 57{61, 1991.
22. P. McCullagh and J. A. Nelder. Generalized Linear Models. Chapman & Hall, London, 2nd edition,
1989.
23. W. S. Cooper, F. C. Gey, and D. P. Dabney. Probabilistic retrieval based on staged logistic regression.
In Proc. SIGIR-92, pages 198{210, 1992.
24. N. Fuhr and U. Pfeifer. Combining model-oriented and description-oriented approaches for proba-
bilistic indexing. In Proc. SIGIR-91, pages 46{56, 1991.
25. S. Robertson and J. Bovey. Statistical problems in the application of probabilistic models to infor-
mation retrieval. Report 5739, British Library, London, 1982.
26. W. A. Gale and K. W. Church. Poor estimates of context are worse than none. In Speech and Natural
Language Workshop, pages 283{287, San Mateo, CA, June 1990. DARPA, Morgan Kaufmann.
27. R. O. Duda and P. E. Hart. Pattern Classi cation and Scene Analysis. Wiley-Interscience, New
York, 1973.
28. N. Goldstein, editor. The Associated Press Stylebook and Libel Manual. Addison-Wesley, Reading,
MA, 1992.
29. W. B. Croft and D. J. Harper. Using probabilistic models of document retrieval without relevance
feedback. Journal of Documentation, 35(4):285{295, 1979.
30. C. J. van Rijsbergen. Information Retrieval. Butterworths, London, second edition, 1979.
31. A. Bookstein. Information retrieval: A sequential learning process. Journal of the American Society
for Information Science, 34:331{342, September 1983.
32. David D. Lewis and Jason Catlett. Heterogeneous uncertainty sampling for supervised learning. In
Proceedings of the Eleventh International Conference on Machine Learning, 1994. To appear.
100000
100000
100000
10000
10000
10000
hostages (.0049)
nielsens (.0016)
ireland (.0024)
1000
1000
1000
100
100
100
10
10
10
0.8 0.6 0.4 0.2 0.0 0.4 0.3 0.2 0.1 0.0 0.5 0.4 0.3 0.2 0.1 0.0
F F F
100000
100000
100000
10000
10000
10000
dukakis (.0020)
budget (.0037)
bonds (.0015)
1000
1000
1000
100
100
100
10
10
10
0.6 0.4 0.2 0.0 0.20 0.15 0.10 0.05 0.0 0.30 0.20 0.10 0.0
F F F
100000
100000
100000
10000
10000
10000
boxoffice (.0010)
quayle (.0025)
burma (.0016)
1000
1000
1000
100
100
100
10
10
10
0.6 0.4 0.2 0.0 0.6 0.4 0.2 0.0 0.5 0.4 0.3 0.2 0.1 0.0
F F F
Figure 2. Mean F =1 values for text classi ers trained on uncertainty (solid line), relevance (dotted line) and
random (dashed line) samples of titles from AP corpus. Means are over 10 runs for uncertainty and relevance
sampling, and over 20 runs for random sampling. Results shown for 9 categories. Frequency of category on
training set shown in parentheses.