US9104972
US9104972
(54) CLASSIFYING DOCUMENTS USING 8,495,002 B2 * 7/2013 Nelken et al. ................... TO6/62
MULTIPLE CLASSIFIERS 2004/0249796 A1 12, 2004 AZZam
2005/0027664 A1 2/2005 Johnson et al. ................. TO6, 12
r ar. 2005, OO65919 A1* 3, 2005 Gotoh et al. ....... 707 3
(71) Applicant: Google Inc., Mountain View, CA (US) 2005/0216443 A1* 9, 2005 Morton et al. ......... 707 3
2006/0004748 A1 1/2006 Ramarathnam et al. TO7/6
(72) Inventors: Dmitry Korolev, Unteriberg (CH): 2006/0085767 A1* 4/2006 Hinckley et al. .............. T15,863
Hartmut Maennel, Zurich (CH) 2006, O123000 A1 6/2006 Baxter et al.
2006, O184521 A1 8, 2006 Ponte
2006/0218134 A1* 9, 2006 Simske et al. .................... 7O7/4
(73) Assignee: Google Inc., Mountain View, CA (US) 2007/0038625 A1 2/2007 Yang-Stephens et al.
(*) Notice: Subject to any disclaimer, the term of this (Continued)
patent is extended or adjusted under 35 OTHER PUBLICATIONS
U.S.C. 154(b) by 0 days.
Angelov, S., Harb, B., Kannan, S., and Wang, L., “Weighted Isotonic
(21) Appl. No.: 14/223,632 Regression under the L Norm.” in: Proceeding of the seventeenth
1-1. annual ACM-SIAM symposium on Discrete algorithm, Miami, FL.
(22) Filed: Mar 24, 2014 2006, pp. 783-791, http://portal.acm.org/citation.cfm?id=1109643.
Related U.S. Application Data (Continued)
(63) Continuation of application No. 12/404,089, filed on
Mar. 13, 2009, now Pat. No. 8,713,007. Primary Examiner — Noosha Arjomandi
(74) Attorney, Agent, or Firm — Fish & Richardson P.C.
(51) Int. Cl.
G06F 7/30 (2006.01) (57) ABSTRACT
G06N5/04 (2006.01) Methods, systems, and apparatus, including computer pro
G06N 99/00 (2010.01) grams encoded on computer storage media, for classifying
(52) U.S. Cl. resources using scores from multiple classifiers. In general,
CPC .............. G06N5/048 (2013.01); G06N 99/005 one aspect of the subject matter described in this specification
(2013.01) can be embodied in methods that include the actions of
(58) Field of Classification Search receiving identifying a collection of documents to classify:
CPC ................................................. G06F 17/30477 receiving a plurality of classifiers for scoring a document with
USPC .......................... 707/705, 708, 713, 729,731 respect to a specified property; for each document in the
See application file for complete search history. collection, applying each of the plurality of classifiers, each
classifier generating a score associated with a likelihood that
(56) References Cited the document has the specified property, combining the
U.S. PATENT DOCUMENTS
scores from each classifier including applying a multiple
classifier model that uses monotonic regression to combine
6,137,911 A * 10/2000 Zhilyaev ....................... 382,225 the plurality of classifiers, and classifying the document as
7, 117,149 B1 * 10/2006 Zakarauskas ................. TO4,233 having the specified property based on the combined score.
7,155,668 B2 * 12/2006 Holland et al. ............... 71.5/256
7,899,816 B2 3/2011 Kolo et al. 20 Claims, 6 Drawing Sheets
Receive a document
2
100
ldentify multiple classifiers for a
particular document property
(56) References Cited Goldberg, A., and Rao, S., "Length Functions for Flow Computa
tions.” Technical Report #97-055, NEC Research Institute, Inc., Aug.
U.S. PATENT DOCUMENTS 1997, 19 pages.
Maxwell, W. and Muckstadt, J., “Establishing Consistent and Real
2007/O112756 A1 5, 2007 Wen et al. istic Reorder Intervals in Production-Distribution Systems.” Opera
2007/O150472 A1 6, 2007 Cao et al. tions Research, vol. 33, No. 6, Nov.-Dec. 1985, pp. 1316-1341.
2007/O198447 A1* 8/2007 Tilford et al. ................... T06/20 van Leeuwen, J., “Graph Algorithms, 1.4. Transitive reduction and
2008/0059448 A1 3/2008 Chang et al. transitive closure,” in: Handbook of Theoretical Computer Science,
2009/O116756 A1 5/2009 Neogi et al. vol. A: Algorithms and Complexity, Elsevier / MIT Press 1994, 10
pageS.
Zadrozny, B. and Elkan, C., “Transforming Classifier Scores into
OTHER PUBLICATIONS Accurate Multiclass Probability Estimates.” Proceedings of the
Eighth ACM SIGKDDInternational Conference on Knowledge Dis
Cormen, T., Leiserson, C., Rivest, R., Stein, C., “Introduction to covery and Data Mining, Edmonton, Alberta, Canada, 2002, pp.
694-699.
Algorithms.” Chapter 26, (2nd ed., MIT Press and McGraw-Hill
2001), 58 pages. * cited by examiner
U.S. Patent Aug. 11, 2015 Sheet 1 of 6 US 9,104,972 B1
Receive a document
102
->
100
FIG. 1
U.S. Patent Aug. 11, 2015 Sheet 2 of 6 US 9,104,972 B1
FIG. 2
U.S. Patent Aug. 11, 2015 Sheet 3 of 6 US 9,104,972 B1
ldentify group of
doCuments and
aSSOCiated Classifier
SCOreS 302
300 N.
Linearize probability
distribution for doCuments 304
Iterate to satisfy
constraints on training
doCuments 308
FIG. 3
U.S. Patent Aug. 11, 2015 Sheet 4 of 6 US 9,104,972 B1
400 N.
Receive search query
from a Client
402
FIG. 4
U.S. Patent Aug. 11, 2015 Sheet 5 of 6 US 9,104,972 B1
f 502
CLIENT DEVICE
RAM
506
PROCESSOR
503
504
Search
Results
Query
510
528
574 SEARCH SYSTEM
530 SEARCHENGINE
INDEXING
ENGINE 520
RANKING
ENGINE
FIG. 5
U.S. Patent Aug. 11, 2015 Sheet 6 of 6 US 9,104,972 B1
w_0 9 ( s)uo e d
US 9,104,972 B1
1. 2
CLASSIFYING DOCUMENTS USING embodiments of this aspect include corresponding systems,
MULTIPLE CLASSIFIERS apparatus, and computer program products.
These and other embodiments can optionally include one
CROSS REFERENCE TO RELATED or more of the following features. The method further
APPLICATIONS includes generating a list of documents including the collec
tion classified as having the specified property. The method
This application is a continuation application of, and further includes receiving a search query; identifying
claims priority to, pending U.S. patent application Ser. No. resources responsive to the search query; generating initial
12/404,089, filed on Mar. 13, 2009, entitled “Classifying search results identifying the resources responsive to the
Documents. Using Multiple Classifiers, the entire contents of 10
search query; filtering the initial search results based on
which are herein incorporated by reference. resources corresponding to entries in the list to produce fil
BACKGROUND tered search results; and presenting filtered search results in
response to the received search query. Filtering includes
This specification relates to classifying documents using 15 removing search results corresponding to entries in the list.
scores from multiple classifiers. Filtering includes removing search results that do not match
entries in the list.
Documents (e.g., Web pages or Web sites) can be classified
according to one or more document properties. These classi The method further includes generating the probability
fied documents can then be treated differently, for example, model including: identifying a group of documents to be
by a search engine or other information retrieval techniques. classified with respect to the specified property; calculating
For example, a document property can be content of a special scores for each document of the group of documents using the
topic of interest, either because the topic is particularly desir multiple classifiers; identifying a training group of docu
able (e.g. financial sites would like to show detailed informa ments from the group of documents; determining whether
tion about companies business performance) or because the each document in the training group of documents has the
topic is undesirable (e.g. pornographic content ("porn') or 25 specified property; and generating the multiple classifier
depictions of violence may be undesired in particular circum model using the training group of documents, the generating
stances). Undesired documents can be filtered out from including calculating a monotonic regression from the maxi
search results while desirable documents can be shown with mum likelihood estimate.
a preference over documents having uncertain or different Identifying the training group of documents further
topics. 30
includes creating a partition of a value set for each classifier
Documents can be classified according to different tech into smaller intervals; assigning each document to a bucket
niques. For example, human raters can be used to manually based on intervals in which the classifier outputs lie; iterating
classify documents as having a specified property. While bucket assignments to satisfy one or more constraints on the
highly accurate, this is very time consuming for large num group of training documents; and selecting the group of train
bers of documents (e.g., a collection of Web documents). 35
ing documents according to bucket.
Alternatively, automatic classifiers can flag documents as Each document of the training group of documents is rated
likely having the particular property. Typically, the classifiers
examine the documents for particular types of content, for by a human rater with respect to the specified property. Gen
example, images or text. However, conventional automatic erating the multiple classifier model further includes using
classifiers often do not provide a likelihood that a document 40 specific classifier scores from the training group of docu
has the specified property with a confidence level sufficient to ments to calculate the monotonic regression that maximizes
allow automatic actions. In particular, if there are classifica the likelihood for multiple classifier scores identifying a par
tion systems on both the level of Web pages and Web sites, an ticular outcome probability that a given document has the
action on the site level would affect all pages of that site, so an specified property. The multiple classifier model uses a tran
action on the site level has to have a very high confidence. If 45 sitive reduction of monotonicity constraints. The method fur
the Web site as a whole cannot be classified with high confi ther includes assigning a probability to the combined score
dence, it may be preferable to classify the individual pages for the document, where classifying the document includes
based on their individual content. In general this is more comparing the probability with a threshold value and when
difficult because there is less information upon which to base the score is greater than the threshold, classifying the docu
the classification. 50 ment as having the specified property. Combining the scores
includes using the individual classifier scores from a training
SUMMARY group of documents to identify in training documents having
scores monotonically below the document and using the
This specification describes technologies relating to clas scores of those n training documents to calculate the com
Sifying documents using multiple classifiers. 55 bined score.
In general, one aspect of the Subject matter described in this Particular embodiments of the subject matter described in
specification can be embodied in methods that include the this specification can be implemented so as to realize one or
actions of selecting a collection of documents to classify: more of the following advantages. Documents are classified
selecting multiple classifiers for scoring a document with with a high confidence. While a seed set of documents is
respect to a specified property; for each document in the 60 classified by human raters to train the classification system,
collection, applying each of the multiple classifiers, each once system is built, documents can be classified without
classifier generating a score associated with a likelihood that evaluation by human raters. The classification process gives
the document has the specified property, combining the each document a probability that the document has the
scores from each classifier including applying a multiple desired property. Thus, users of this classification scheme can
classifier model that uses monotonic regression to combine 65 perform different actions on classified documents based on
the multiple classifiers, and classifying the document as hav the level of confidence they need. The result can be provided
ing the specified property based on the combined score. Other as a probability Such that the result has an intuitive meaning
US 9,104,972 B1
3 4
and it is easy for users to specify a desired threshold level. The system identifies multiple classifiers 104 for the speci
Providing the result in the form of probability also eliminates fied document property. In particular, if the specific property
output calibration processes. being classified is porn, the system identifies multiple porn
New individual classifiers can be added and previously classifiers. Different classifiers can be used to determine a
used classifiers can be disabled without affecting the scale of 5 score indicating whether a document has the specified prop
the output. The classification is based on scores (numbers) erty based on an analysis of a particular type of content of the
output by different classifiers, but these scores do not need to document. For example, a porn classifier can include text
be comparable to each other or normalized in a certain way. classifiers and image classifiers. The text classifier examines
Instead, the only required information is that for a given document text content to determine alikelihood that the docu
classifier, the document is more likely to have the property 10 ment is porn while the image classifier examines document
than another document. The classification is computationally image content to determine a likelihood that the document is
feasible even for large seed sets (e.g., a model based on porn. In some implementations, the score is a probability
100,000 seed documents can be computed on a single PC in while in other implementations the score is a numerical value
minutes). Combining classifiers with monotonic regression (e.g., a number of words identified by the classifier). In some
can increase precision over other multiple classifier tech 15 implementations, the system first identifies the types of con
niques. tent within the document and then identifies the correspond
The details of one or more embodiments of the subject ing classifiers. For example, the system can determine
matter described in this specification are set forth in the whether the document contains text before identifying a text
accompanying drawings and the description below. Other classifier as one of the multiple classifiers for the document.
features, aspects, and advantages of the invention will There are numerous ways to construct a classifier for a
become apparent from the description, the drawings, and the given type of content. For example, a text classifier can assign
claims. numbers for key words and add up all those numbers that
occur in a given document. These numbers are positive for
BRIEF DESCRIPTION OF THE DRAWINGS words that are likely to occur in the texts of documents asso
25 ciated with the property for classification. In some implemen
FIG. 1 is flowchart showing an example method for clas tations, there are also negative values for words that are
Sifying documents using scores from multiple classifiers. unlikely to occur in texts having the document property. The
FIG. 2 is a flowchart showing an example method for text classifier then sums all the numbers that correspond to
modeling combined classifiers to provide an output probabil words in the text of the document to be classified. If the sum
ity result. 30 is greater than a specified threshold, the text classifier indi
FIG.3 is a flowchart showing an example method of select cates that the document is likely to have the property being
ing a group of training documents. classified. In general, the higher the number, the more likely
FIG. 4 is a flowchart showing an example method of pro it is that the document text is associated with the property
viding search results. being classified. Text classifiers can be applied to different
FIG. 5 shows a block diagram of an example search sys 35 parts of a document, for example, for a Web page different
tem. text classifiers can be applied to the title of the Web page, the
FIG. 6 is a schematic diagram of an example system. actual text content of the Web page, and the URL of the Web
Like reference numbers and designations in the various page.
drawings indicate like elements. The system applies 106 each of the multiple classifiers to
40 the document. Applying each classifier to the document (e.g.,
DETAILED DESCRIPTION an image classifier and a text classifier) provides a particular
result (“score’) for each classifier indicating a likelihood as to
FIG. 1 is flowchart showing an example method 100 for whether, taken alone, the document is likely to have the
classifying documents using scores from multiple classifiers. specified property. For example, for a porn classifier, each
For convenience, the method 100 will be described with 45 result indicates the likelihood that the document is porn.
respect to a system that performs the method 100. The system uses 108 a multiple classifier model to identify
The system receives 102 a document. In particular, the a combined score, which can be interpreted as a probability,
system can receive the document from a collection of docu for the document based on the individual classifier results.
ments to be classified. For example, the document can be a The multiple classifier model is described in greater detail
Web document from a collection of unclassified Web docu 50 below with respect to FIG. 2. The multiple classifier uses the
ments. Alternatively, in some implementations, the document scores for training documents to identify the score for an
includes all of the resources for a particular Web site (e.g., unknown input document. Identifying the combined score
individual Web pages, images, and other multimedia content using the multiple classifier model for the document is also
of the Web site). described in greater detail below.
The system can classify each document as having one or 55 The system classifies 110 the document based on the com
more properties. For clarity, the method 100 will be described bined score. In particular, a threshold value can be specified.
with respect to classifying a single document relative to a When the combined score is at or above the threshold value
specific property. For example, the property can be whether or (indicating a specified level of confidence), the document is
not the document includes financial information or whether automatically classified as having the specified property (e.g.,
or not the document includes porn content. If the property is 60 porn). When the combined score is below the specified
porn, the document is classified as either being porn or not threshold, the document is automatically classified as not
porn (e.g., a Boolean classification is made for the document having the specified property (e.g., not porn). For example, a
as either being porn or not being porn). threshold of 50% probability can be set for classifying docu
Other document properties can be the subject of classifi ments as having the specified property.
cation. For example, documents can be classified for different 65 Alternatively, in some implementations, both a high
document topics including sports, research topics, commer threshold and a low threshold are used. Documents with
cial sites, and celebrity gossip. combined scores at or above the high threshold (e.g., 98%) are
US 9,104,972 B1
5 6
automatically classified as having the specified property. ments includes documents that have been rated by human
Documents with a combined score at or below the low thresh raters to determine whether or not the document has the
old (e.g., 10%) are automatically classified as not having the specified property.
specified property. However, documents with a combined In some implementations, each training document is
5 selected Such that the group of training documents satisfies
score between the low threshold and the high threshold are
sent to one or more human raters for classification. particular parameters. FIG. 3 is a flowchart showing an
If the document is classified as having the property, the example method 300 of selecting a group of training docu
system adds 112 the document to a list. In some implemen ments. For convenience, the method 300 will be described
tations, the list is a blacklist. When the particular document 10
with respect to a system that performs the method 300.
property being classified is a property used to filter informa The system identifies 302 a collection of documents to be
tion retrieval in a restricted mode (e.g., a safe search), the classified and associated classifier scores. The system gener
document can be added to the blacklist so that information ates 304 a linearized probability distribution for the identified
retrieval results (e.g., Web search results) can be filtered. For documents based on their classifier scores. The scores for
example, documents classified as porn can be added to the 15 each document can be normalized before the probability dis
blacklist so that a user search under a safe search mode can tribution is linearized. The system assigns 306 each document
have search results filtered according to the blacklist such that to a particular bucket based on the linearized probability
no porn documents are identified in the search results. The distribution. In some implementations, the number of buckets
blacklist can identify all resources associated with the docu is chosen Such that each bucket only contains a small number
ment (e.g., web pages and images from a particular web site of documents (e.g., less than 10). The system iterates 306 a
corresponding to the document). The blacklist is stored for number of buckets and a number of selected training docu
future use (e.g., in filtering search results). ments until particular parameters are satisfied. For example,
The system can add the document to a type of list other than such that a specified number of N documents are chosen for
a blacklist. In some implementations, the system adds the the training group and that the ratio of documents having the
documents to a list for presentation. For example, the docu 25 property of interest is close to 50% (including manual rating
ment property being classified can be a particular topic. of a sample of the documents). The system selects 310 train
Documents associated with the topic can be added to the list. ing documents from each bucket after the parameters have
Identifiers for one or more of these documents can be pre been satisfied. For example, an equal number of documents
sented to a user (e.g., as links to the corresponding docu can be selected from each bucket to reach a total of N docu
ments). For example, the property being classified can be 30 ments. A more detailed description of the process for select
“finance' and one or more of the listed finance documents can ing the group of training documents is provided below.
be identified in response to a particular user request. As shown in FIG. 2, the system generates 208 a multiple
The system uses 114 the list in information retrieval. For classifier model for combining classifiers to generate a com
bined results for the training group documents corresponding
example, as described above, the blacklist can be used to filter 35 to the known results. Generating the multiple classifier model
search results to eliminate search results identifying includes determining a maximum likelihood probability
resources associated with a blacklisted document. An function for set of classifiers. The multiple classifier model
example of searching is described below with respect to FIG. uses the specific scores from the training set to calculate a
4. monotonic regression that maximizes the likelihood for mul
FIG. 2 is a flowchart showing an example method 200 for 40 tiple classifier scores identifying a particular outcome prob
modeling combined classifiers to provide an output score for ability that a given document has the specified property. Gen
a document. For convenience, the method 200 will be erating the multiple classifier model is described in greater
described with respect to a system that performs the method detail below.
2OO. FIG. 4 is a flowchart showing an example method 400 of
The system identifies 202 a group of documents to be 45 providing filtered search results using a list based on an iden
classified. The group of documents can include a collection of tified property of interest. For convenience, the method 400
Web sites to be classified as having the specified property. For will be described with respect to a system (e.g., a search
example, the group of documents can be Web sites to be system) that performs the method 400.
classified as porn or not porn. The system receives 402 a search query. For example, the
The system calculates 204 scores for documents using 50 system can receive the search query from a user of a client
multiple classifiers. The system uses multiple classifiers that device. The user can input the search query to a search inter
indicate whether a document is likely to have the specified face displayed on a client browser. The system provides the
property. The classifiers can examine different content of the query to a search engine, either locally or remotely, that
documents, for example, text or image content of each docu identifies resources responsive to the search query.
ment. Each classifier generates a score for the document that 55 The system receives 404 search results responsive to the
indicates the likelihood that the document has the property. search query. For example, the system can receive a set of
Generally, a higher score indicates a higher likelihood that the search results from the search engine. The set of search results
document has the property. include identifiers for the resources responsive to the search
The system selects 206 a subgroup of documents of the query. In some implementations, the search results further
group of documents as a group of training documents. For 60 include a link to each identified resource and Snippets of
example, the Subgroup of documents can be a group of docu content from the respective resource.
ments with known classifier scores and known determina The system determines 406 whether a restricted search is
tions as to whether the respective documents of the Subgroup being performed. For example, a restricted search based on a
have the specified property of interest (e.g., as a result of specific document property or a restricted “safe search’ mode
human raters determining whether each document of the 65 for limiting results based on the property of interest can be
group of training documents has the specified property of selected by a user through the search interface. The safe
interest). In some implementations, the Subgroup of docu search mode can filter search results that reference resources
US 9,104,972 B1
7 8
identified as having a particular property (e.g., resources performed using conventional techniques. The search engine
associated with documents classified as porn). 530 can transmit the search results 528 through the network to
When the system determines that a restricted search is not the client device 504, for example, for presentation to the user
being performed, the system provides 408 the search results SO2.
to the client for presentation (e.g., as an ordered set of results The search system 514 may also maintain one or more user
by the client browser). search histories based on the queries it receives from a user.
When the system determines that a restricted search is Generally speaking, a user search history stores a sequence of
being performed, the system filters 410 the search results queries received from a user. User search histories may also
using a list. The list identifies documents, and therefore their include additional information such as which results were
associated resources, used for the restricted search. For 10 selected after a search was performed and how long each
example, the list can identify resources associated with docu selected result was viewed.
ments having a finance property that should be identified for In some implementations, the search system 514 includes
a user in response to a search restricted to finance documents. one or more lists 540. The lists 540 each identify resources, in
In another example, if the restricted search is one that should particular Web pages or Web sites, that have been classified as
limit results (e.g., porn), the list is used to identify resources 15 having a specified property (e.g., porn, finance, sports).
that should not be identified in search results (e.g., a black Under specified circumstances, for example, when a user has
list). indicated that a restricted search is to be performed, the
For example, the list can identify documents classified as resources identified by the search engine 530 are compared
porn. Any resources associated with the list documents can be with the resources identified by the list 540 corresponding to
identified from the list document. The system compares the the restricted search. In particular, the search results 528 can
resources of the search results with the list documents and be filtered to remove results identifying resources that are not
removes any search results identifying resources associated associated with the list 540 or, alternatively, that are associ
with the list. For example, if a search result identifies a ated with the list 540.
resource corresponding to an image associated with a Web Modeling Maximum Likelihood Probability Function for
site identified as porn, the image resource is filtered. The 25 Multiple Classifiers
system provides 412 the filtered search results to the client for For a collection of documents that potentially have a speci
presentation. fied property there is a set of d classifiers that each provide a
FIG. 5 illustrates an example search system 514 for pro likelihood that a particular document has the property. Each
viding search results relevant to Submitted queries as can be classifier (e.g., a text classifier oran image classifier) provides
implemented in an internet, an intranet, or another client and 30 an output score (a result value represented by a real number).
server environment. The search system 514 is an example of The higher the score, the more likely the document has the
an information retrieval system in which the systems, com specified property.
ponents, and techniques described below can be imple The classifiers can be applied to a group of documents in
mented. the collection to determine scores for those documents and
A user 502 can interact with the search system 514 through 35 human raters can determine which documents really have the
a client device 504. For example, the client 504 can be a specified property. Using this input of rated documents, an
computer coupled to the search system 514 through a local estimate is then made for the probability function that com
area network (LAN) or wide area network (WAN), e.g., the bines multiple classifiers to provide a combined score for a
Internet. In some implementations, the search system 514 and new input document based on its classifier scores.
the client device 504 can be one machine. For example, a user 40 Mathematically, for a given probability spaceX there ared
can install a desktop search application on the client device classifier functions S.: X->R and a function Y; X->{0, 1}.
504. The client device 504 will generally include a random Thus, each classifier function S, provides an output score that
access memory (RAM) 506 and a processor 508. is a real number and the special property that documents can
A user 502 can submit a query 510 to a search engine 530 have is modeled as a Boolean result {0, 1} that specifies that
within a search system 514. When the user 502 submits a 45 the document X either has the specified value (“Y(x)=1') or
query 510, the query 510 is transmitted through a network to does not have the specified value (“Y(x)=0).
the search system 114. The search system 514 can be imple The conditional joint probabilities p(S, . . . . s.)=P
mented as, for example, computer programs running on one (Y=1|S. S. . . . . S.S.) are monotonically increasing for
or more computers in one or more locations that are coupled each of the d parameters (classifiers 1. . . . . d). Each s,
to each other through a network. The search system 514 50 represents a real number (output of a classifier), for example,
includes an index database 522 and a search engine 530. The a word count in a word classifier function S, used to calculate
search system 514 responds to the query 510 by generating a score for the classifier. Therefore, the likelihood that the
search results 528, which are transmitted through the network document is classified as having the property is defined as a
to the client device 504 in a form that can be presented to the function of each individual classifier score. These probabili
user 502 (e.g., a search results web page to be displayed in a 55 ties are assumed to be monotonically increasing, i.e. the
web browser running on the client device 504). higher the score for each classifier, the more likely that the
When the query 510 is received by the search engine 530, document has the property being evaluated by the classifier
the search engine 530 identifies resources that match the (e.g., for a porn classifier, the higher the classifier score, the
query 510. The search engine 530 may also identify a par greater the likelihood that the document is porn).
ticular "snippet' or section of each resource that is relevant to 60 The classifiers are applied to a group of N documents. For
the query. The search engine 530 will generally include an each document there are a set of parameters corresponding to
indexing engine 520 that indexes resources (e.g., web pages, the classifier functions applied to each document of the N
images, or news articles on the Internet) found in a corpus documents, S(i). . . . , S(i) for i=1,..., N. Thus, for a first
(e.g., a collection or repository of content), an index database document (i-1) a parameter from each individual classifier
522 that stores the index information, and a ranking engine 65 (e.g., S. (1), S(1)) is calculated. Thus, a parameters (1) can be
552 (or other software) to rank the resources that match the a count from a text classifier and a parameters (1) can be a
query 510. The indexing and ranking of the resources can be score from an image classifier for the same document. These
US 9,104,972 B1
10
parameters are used to determine a classifier score for the These particular functions f, have their minimum at a. With
document (e.g., S. (1) and S(1)) out the monotonicity constraint, minimizing
For the group of N documents that are human evaluated,
a 1 if document i has the special property, and a 0 if the
document i does not have the special property. Thus for each
human evaluated document, there are dreal values s, and one
value a, that is either 0 or 1. Using the experimental data for
the group of N documents, a maximum likelihood estimation would be trivial: choose each X, to be its corresponding a .
for the conditional probabilities p(S, ..., S) is derived. 10 However, the monotonicity constraints need to be considered,
The function p represents an estimate for a probability that namely that X,sX, for certain pairs i, j.
a document has the property, given as only information the To define the monotonicity constraint, let is if S (i)s
scores of the d classifiers. Estimating p given the information S (j) M . . . M S (i)ss (j) where M is the logical conjunction
ona, ..., a can be calculated using a maximum likelihood set such that the inequalities must be true for each classifier S. A
calculation to identify the p that gives the highest probability thatVsisisdefined
15 as {1,2,..., N} and E is a set of pairs (i,j) such
the transitive reflexive closure of E. In other words,
to observe the ratings a...., a that were actually observed (V, E) is a directed graph and is if and only if i ij or there is
(e.g., as previously evaluated by human raters) a path from i to in E. Thus, each entry in V is a vertex of the
If we assume we know the function p, the probability of directed graph and represents, e.g., a document from the
getting the output with all the observed values of a, is given experiment group. An edge can be drawn from i to between
by: Vertices if the classifier scores indicate that document is at
least as likely to have the specified property as document i.
As an example, assume the classifiers identify the property
W
of financial information in documents, for example in Web
P(a1, ... . aw|p) = i=1 p(s)(i), ... , sa(i))" (1-p(s)(i),..., sa(i))'''. is pages. The system can use two classifiers that assign to each
Web page two integers: s—it of technical financial words in
the text and sit of links to known sites with financial infor
mation. Additionally, there is an assumption that the higher
Maximum likelihood estimation is a statistical method that value of S, the more likely the web page is of interest (i.e.,
can be used to fit a model to the experimental data. To apply more likely to have the specified property of financial infor
30 mation). And similarly, the higher value of s, the more likely
this method, we have to find the monotonic function p of d the Web page is of interest.
variables that maximizes this probability. This condition only For example, if a Web page 1 has 2 financial terms and 1
determines the values p(S(i). . . . , S(i)) that appear in that link to financial pages, it will be of interest with a probability
formula. However, by the monotonicity constraint for each p(2,1)=x. If a Web page 2 has 10 financial terms and 3 links
other value S. . . . , s at least an interval in which p(S, ..., 35 to financial sites, such Web pages will be of interest with a
S) must lie can be calculated. If the training set includes at probability p(10,3)-X. Since Web page 2 contains both more
least a specified number of documents, these intervals are financial terms and more links to financial sites than Web page
small and hence in practice will allow the probability 1, the monotonicity assumption gives p(10, 3)ap (2, 1), or
p(S, . . . , s) to be calculated for all new documents with stated in terms of the “financial probabilities” for the pages:
reasonable precision. 40
Xax. In the directed graph an edge (i,j) is drawn from Web
page 1 to Web page 2 since we know that Web page 2 is at least
The Maximum Likelihood estimate for the function p can as likely to be interesting (i.e., have the specified property) as
be calculated. Mathematically, maximizing the likelihood is Web page 1.
equivalent to minimizing the (-log) of the likelihood, which The monotonicity requirement that establishes that X,<x,
can be written as: 45 for is is equivalent to the requirement that X,sX, for (i,j)e E.
As a result, the minimizing problem can be restated for a
directed graph (V, E) where for eachie V there is a labela, e
-logP(a1, ... , a N p) = {0,1}. A vector x with components x, e 0.1 for eachie V is
identified that minimizes
W
a; (log(p(S1 (i), ... , S(i)) + (1 - ai)log(1 - p(S1(i), ... , S(i)). 50
=
iew