0% found this document useful (0 votes)
31 views20 pages

US9104972

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views20 pages

US9104972

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

US009 104972B1

(12) United States Patent (10) Patent No.: US 9,104,972 B1


Korolev et al. (45) Date of Patent: Aug. 11, 2015

(54) CLASSIFYING DOCUMENTS USING 8,495,002 B2 * 7/2013 Nelken et al. ................... TO6/62
MULTIPLE CLASSIFIERS 2004/0249796 A1 12, 2004 AZZam
2005/0027664 A1 2/2005 Johnson et al. ................. TO6, 12
r ar. 2005, OO65919 A1* 3, 2005 Gotoh et al. ....... 707 3
(71) Applicant: Google Inc., Mountain View, CA (US) 2005/0216443 A1* 9, 2005 Morton et al. ......... 707 3
2006/0004748 A1 1/2006 Ramarathnam et al. TO7/6
(72) Inventors: Dmitry Korolev, Unteriberg (CH): 2006/0085767 A1* 4/2006 Hinckley et al. .............. T15,863
Hartmut Maennel, Zurich (CH) 2006, O123000 A1 6/2006 Baxter et al.
2006, O184521 A1 8, 2006 Ponte
2006/0218134 A1* 9, 2006 Simske et al. .................... 7O7/4
(73) Assignee: Google Inc., Mountain View, CA (US) 2007/0038625 A1 2/2007 Yang-Stephens et al.
(*) Notice: Subject to any disclaimer, the term of this (Continued)
patent is extended or adjusted under 35 OTHER PUBLICATIONS
U.S.C. 154(b) by 0 days.
Angelov, S., Harb, B., Kannan, S., and Wang, L., “Weighted Isotonic
(21) Appl. No.: 14/223,632 Regression under the L Norm.” in: Proceeding of the seventeenth
1-1. annual ACM-SIAM symposium on Discrete algorithm, Miami, FL.
(22) Filed: Mar 24, 2014 2006, pp. 783-791, http://portal.acm.org/citation.cfm?id=1109643.
Related U.S. Application Data (Continued)
(63) Continuation of application No. 12/404,089, filed on
Mar. 13, 2009, now Pat. No. 8,713,007. Primary Examiner — Noosha Arjomandi
(74) Attorney, Agent, or Firm — Fish & Richardson P.C.
(51) Int. Cl.
G06F 7/30 (2006.01) (57) ABSTRACT
G06N5/04 (2006.01) Methods, systems, and apparatus, including computer pro
G06N 99/00 (2010.01) grams encoded on computer storage media, for classifying
(52) U.S. Cl. resources using scores from multiple classifiers. In general,
CPC .............. G06N5/048 (2013.01); G06N 99/005 one aspect of the subject matter described in this specification
(2013.01) can be embodied in methods that include the actions of
(58) Field of Classification Search receiving identifying a collection of documents to classify:
CPC ................................................. G06F 17/30477 receiving a plurality of classifiers for scoring a document with
USPC .......................... 707/705, 708, 713, 729,731 respect to a specified property; for each document in the
See application file for complete search history. collection, applying each of the plurality of classifiers, each
classifier generating a score associated with a likelihood that
(56) References Cited the document has the specified property, combining the
U.S. PATENT DOCUMENTS
scores from each classifier including applying a multiple
classifier model that uses monotonic regression to combine
6,137,911 A * 10/2000 Zhilyaev ....................... 382,225 the plurality of classifiers, and classifying the document as
7, 117,149 B1 * 10/2006 Zakarauskas ................. TO4,233 having the specified property based on the combined score.
7,155,668 B2 * 12/2006 Holland et al. ............... 71.5/256
7,899,816 B2 3/2011 Kolo et al. 20 Claims, 6 Drawing Sheets

Receive a document
2

100
ldentify multiple classifiers for a
particular document property

Apply classifiers to document

Use multi-classifier model to


identify a combined score for the
document 18

Classify document based on


SCOe
N-110
if documentis classified as having
document property, add document
to a list 12

Use list for information retrieval \


114
US 9,104,972 B1
Page 2

(56) References Cited Goldberg, A., and Rao, S., "Length Functions for Flow Computa
tions.” Technical Report #97-055, NEC Research Institute, Inc., Aug.
U.S. PATENT DOCUMENTS 1997, 19 pages.
Maxwell, W. and Muckstadt, J., “Establishing Consistent and Real
2007/O112756 A1 5, 2007 Wen et al. istic Reorder Intervals in Production-Distribution Systems.” Opera
2007/O150472 A1 6, 2007 Cao et al. tions Research, vol. 33, No. 6, Nov.-Dec. 1985, pp. 1316-1341.
2007/O198447 A1* 8/2007 Tilford et al. ................... T06/20 van Leeuwen, J., “Graph Algorithms, 1.4. Transitive reduction and
2008/0059448 A1 3/2008 Chang et al. transitive closure,” in: Handbook of Theoretical Computer Science,
2009/O116756 A1 5/2009 Neogi et al. vol. A: Algorithms and Complexity, Elsevier / MIT Press 1994, 10
pageS.
Zadrozny, B. and Elkan, C., “Transforming Classifier Scores into
OTHER PUBLICATIONS Accurate Multiclass Probability Estimates.” Proceedings of the
Eighth ACM SIGKDDInternational Conference on Knowledge Dis
Cormen, T., Leiserson, C., Rivest, R., Stein, C., “Introduction to covery and Data Mining, Edmonton, Alberta, Canada, 2002, pp.
694-699.
Algorithms.” Chapter 26, (2nd ed., MIT Press and McGraw-Hill
2001), 58 pages. * cited by examiner
U.S. Patent Aug. 11, 2015 Sheet 1 of 6 US 9,104,972 B1

Receive a document
102
->
100

ldentify multiple classifiers for a


particular document property
104

Apply classifiers to document


106

Use multi-classifier model to


identify a combined score for the
doCument 108

Classify document based on


SCOre 1 10

lf document is classified as having


document property, add document
to a list 112

Use list for information retrieval


114

FIG. 1
U.S. Patent Aug. 11, 2015 Sheet 2 of 6 US 9,104,972 B1

200 N. ldentify group of


doCuments to be
classified with respect to
a specific property 202

Calculate SCOreS for


documents using multiple
Classifiers 204

ldentify Subset of group


Satisfying group
parameters as training
doCuments 206

Generate model for group


of documents using
training documents 208

FIG. 2
U.S. Patent Aug. 11, 2015 Sheet 3 of 6 US 9,104,972 B1

ldentify group of
doCuments and
aSSOCiated Classifier
SCOreS 302
300 N.
Linearize probability
distribution for doCuments 304

Bucket doCuments based


on linear probability 306
distribution

Iterate to satisfy
constraints on training
doCuments 308

Select training documents


according to bucket
310

FIG. 3
U.S. Patent Aug. 11, 2015 Sheet 4 of 6 US 9,104,972 B1

400 N.
Receive search query
from a Client
402

Receive Search results


404

PrOvide Search results to Restricted


Client Search?
\ 406

Filter results using list


410

Provide filtered Search


results to Client
412

FIG. 4
U.S. Patent Aug. 11, 2015 Sheet 5 of 6 US 9,104,972 B1

f 502
CLIENT DEVICE

RAM
506

PROCESSOR
503
504

Search
Results
Query
510
528
574 SEARCH SYSTEM

INDEX DATABASE 522

530 SEARCHENGINE
INDEXING
ENGINE 520

RANKING
ENGINE

FIG. 5
U.S. Patent Aug. 11, 2015 Sheet 6 of 6 US 9,104,972 B1

w_0 9 ( s)uo e d
US 9,104,972 B1
1. 2
CLASSIFYING DOCUMENTS USING embodiments of this aspect include corresponding systems,
MULTIPLE CLASSIFIERS apparatus, and computer program products.
These and other embodiments can optionally include one
CROSS REFERENCE TO RELATED or more of the following features. The method further
APPLICATIONS includes generating a list of documents including the collec
tion classified as having the specified property. The method
This application is a continuation application of, and further includes receiving a search query; identifying
claims priority to, pending U.S. patent application Ser. No. resources responsive to the search query; generating initial
12/404,089, filed on Mar. 13, 2009, entitled “Classifying search results identifying the resources responsive to the
Documents. Using Multiple Classifiers, the entire contents of 10
search query; filtering the initial search results based on
which are herein incorporated by reference. resources corresponding to entries in the list to produce fil
BACKGROUND tered search results; and presenting filtered search results in
response to the received search query. Filtering includes
This specification relates to classifying documents using 15 removing search results corresponding to entries in the list.
scores from multiple classifiers. Filtering includes removing search results that do not match
entries in the list.
Documents (e.g., Web pages or Web sites) can be classified
according to one or more document properties. These classi The method further includes generating the probability
fied documents can then be treated differently, for example, model including: identifying a group of documents to be
by a search engine or other information retrieval techniques. classified with respect to the specified property; calculating
For example, a document property can be content of a special scores for each document of the group of documents using the
topic of interest, either because the topic is particularly desir multiple classifiers; identifying a training group of docu
able (e.g. financial sites would like to show detailed informa ments from the group of documents; determining whether
tion about companies business performance) or because the each document in the training group of documents has the
topic is undesirable (e.g. pornographic content ("porn') or 25 specified property; and generating the multiple classifier
depictions of violence may be undesired in particular circum model using the training group of documents, the generating
stances). Undesired documents can be filtered out from including calculating a monotonic regression from the maxi
search results while desirable documents can be shown with mum likelihood estimate.
a preference over documents having uncertain or different Identifying the training group of documents further
topics. 30
includes creating a partition of a value set for each classifier
Documents can be classified according to different tech into smaller intervals; assigning each document to a bucket
niques. For example, human raters can be used to manually based on intervals in which the classifier outputs lie; iterating
classify documents as having a specified property. While bucket assignments to satisfy one or more constraints on the
highly accurate, this is very time consuming for large num group of training documents; and selecting the group of train
bers of documents (e.g., a collection of Web documents). 35
ing documents according to bucket.
Alternatively, automatic classifiers can flag documents as Each document of the training group of documents is rated
likely having the particular property. Typically, the classifiers
examine the documents for particular types of content, for by a human rater with respect to the specified property. Gen
example, images or text. However, conventional automatic erating the multiple classifier model further includes using
classifiers often do not provide a likelihood that a document 40 specific classifier scores from the training group of docu
has the specified property with a confidence level sufficient to ments to calculate the monotonic regression that maximizes
allow automatic actions. In particular, if there are classifica the likelihood for multiple classifier scores identifying a par
tion systems on both the level of Web pages and Web sites, an ticular outcome probability that a given document has the
action on the site level would affect all pages of that site, so an specified property. The multiple classifier model uses a tran
action on the site level has to have a very high confidence. If 45 sitive reduction of monotonicity constraints. The method fur
the Web site as a whole cannot be classified with high confi ther includes assigning a probability to the combined score
dence, it may be preferable to classify the individual pages for the document, where classifying the document includes
based on their individual content. In general this is more comparing the probability with a threshold value and when
difficult because there is less information upon which to base the score is greater than the threshold, classifying the docu
the classification. 50 ment as having the specified property. Combining the scores
includes using the individual classifier scores from a training
SUMMARY group of documents to identify in training documents having
scores monotonically below the document and using the
This specification describes technologies relating to clas scores of those n training documents to calculate the com
Sifying documents using multiple classifiers. 55 bined score.
In general, one aspect of the Subject matter described in this Particular embodiments of the subject matter described in
specification can be embodied in methods that include the this specification can be implemented so as to realize one or
actions of selecting a collection of documents to classify: more of the following advantages. Documents are classified
selecting multiple classifiers for scoring a document with with a high confidence. While a seed set of documents is
respect to a specified property; for each document in the 60 classified by human raters to train the classification system,
collection, applying each of the multiple classifiers, each once system is built, documents can be classified without
classifier generating a score associated with a likelihood that evaluation by human raters. The classification process gives
the document has the specified property, combining the each document a probability that the document has the
scores from each classifier including applying a multiple desired property. Thus, users of this classification scheme can
classifier model that uses monotonic regression to combine 65 perform different actions on classified documents based on
the multiple classifiers, and classifying the document as hav the level of confidence they need. The result can be provided
ing the specified property based on the combined score. Other as a probability Such that the result has an intuitive meaning
US 9,104,972 B1
3 4
and it is easy for users to specify a desired threshold level. The system identifies multiple classifiers 104 for the speci
Providing the result in the form of probability also eliminates fied document property. In particular, if the specific property
output calibration processes. being classified is porn, the system identifies multiple porn
New individual classifiers can be added and previously classifiers. Different classifiers can be used to determine a
used classifiers can be disabled without affecting the scale of 5 score indicating whether a document has the specified prop
the output. The classification is based on scores (numbers) erty based on an analysis of a particular type of content of the
output by different classifiers, but these scores do not need to document. For example, a porn classifier can include text
be comparable to each other or normalized in a certain way. classifiers and image classifiers. The text classifier examines
Instead, the only required information is that for a given document text content to determine alikelihood that the docu
classifier, the document is more likely to have the property 10 ment is porn while the image classifier examines document
than another document. The classification is computationally image content to determine a likelihood that the document is
feasible even for large seed sets (e.g., a model based on porn. In some implementations, the score is a probability
100,000 seed documents can be computed on a single PC in while in other implementations the score is a numerical value
minutes). Combining classifiers with monotonic regression (e.g., a number of words identified by the classifier). In some
can increase precision over other multiple classifier tech 15 implementations, the system first identifies the types of con
niques. tent within the document and then identifies the correspond
The details of one or more embodiments of the subject ing classifiers. For example, the system can determine
matter described in this specification are set forth in the whether the document contains text before identifying a text
accompanying drawings and the description below. Other classifier as one of the multiple classifiers for the document.
features, aspects, and advantages of the invention will There are numerous ways to construct a classifier for a
become apparent from the description, the drawings, and the given type of content. For example, a text classifier can assign
claims. numbers for key words and add up all those numbers that
occur in a given document. These numbers are positive for
BRIEF DESCRIPTION OF THE DRAWINGS words that are likely to occur in the texts of documents asso
25 ciated with the property for classification. In some implemen
FIG. 1 is flowchart showing an example method for clas tations, there are also negative values for words that are
Sifying documents using scores from multiple classifiers. unlikely to occur in texts having the document property. The
FIG. 2 is a flowchart showing an example method for text classifier then sums all the numbers that correspond to
modeling combined classifiers to provide an output probabil words in the text of the document to be classified. If the sum
ity result. 30 is greater than a specified threshold, the text classifier indi
FIG.3 is a flowchart showing an example method of select cates that the document is likely to have the property being
ing a group of training documents. classified. In general, the higher the number, the more likely
FIG. 4 is a flowchart showing an example method of pro it is that the document text is associated with the property
viding search results. being classified. Text classifiers can be applied to different
FIG. 5 shows a block diagram of an example search sys 35 parts of a document, for example, for a Web page different
tem. text classifiers can be applied to the title of the Web page, the
FIG. 6 is a schematic diagram of an example system. actual text content of the Web page, and the URL of the Web
Like reference numbers and designations in the various page.
drawings indicate like elements. The system applies 106 each of the multiple classifiers to
40 the document. Applying each classifier to the document (e.g.,
DETAILED DESCRIPTION an image classifier and a text classifier) provides a particular
result (“score’) for each classifier indicating a likelihood as to
FIG. 1 is flowchart showing an example method 100 for whether, taken alone, the document is likely to have the
classifying documents using scores from multiple classifiers. specified property. For example, for a porn classifier, each
For convenience, the method 100 will be described with 45 result indicates the likelihood that the document is porn.
respect to a system that performs the method 100. The system uses 108 a multiple classifier model to identify
The system receives 102 a document. In particular, the a combined score, which can be interpreted as a probability,
system can receive the document from a collection of docu for the document based on the individual classifier results.
ments to be classified. For example, the document can be a The multiple classifier model is described in greater detail
Web document from a collection of unclassified Web docu 50 below with respect to FIG. 2. The multiple classifier uses the
ments. Alternatively, in some implementations, the document scores for training documents to identify the score for an
includes all of the resources for a particular Web site (e.g., unknown input document. Identifying the combined score
individual Web pages, images, and other multimedia content using the multiple classifier model for the document is also
of the Web site). described in greater detail below.
The system can classify each document as having one or 55 The system classifies 110 the document based on the com
more properties. For clarity, the method 100 will be described bined score. In particular, a threshold value can be specified.
with respect to classifying a single document relative to a When the combined score is at or above the threshold value
specific property. For example, the property can be whether or (indicating a specified level of confidence), the document is
not the document includes financial information or whether automatically classified as having the specified property (e.g.,
or not the document includes porn content. If the property is 60 porn). When the combined score is below the specified
porn, the document is classified as either being porn or not threshold, the document is automatically classified as not
porn (e.g., a Boolean classification is made for the document having the specified property (e.g., not porn). For example, a
as either being porn or not being porn). threshold of 50% probability can be set for classifying docu
Other document properties can be the subject of classifi ments as having the specified property.
cation. For example, documents can be classified for different 65 Alternatively, in some implementations, both a high
document topics including sports, research topics, commer threshold and a low threshold are used. Documents with
cial sites, and celebrity gossip. combined scores at or above the high threshold (e.g., 98%) are
US 9,104,972 B1
5 6
automatically classified as having the specified property. ments includes documents that have been rated by human
Documents with a combined score at or below the low thresh raters to determine whether or not the document has the
old (e.g., 10%) are automatically classified as not having the specified property.
specified property. However, documents with a combined In some implementations, each training document is
5 selected Such that the group of training documents satisfies
score between the low threshold and the high threshold are
sent to one or more human raters for classification. particular parameters. FIG. 3 is a flowchart showing an
If the document is classified as having the property, the example method 300 of selecting a group of training docu
system adds 112 the document to a list. In some implemen ments. For convenience, the method 300 will be described
tations, the list is a blacklist. When the particular document 10
with respect to a system that performs the method 300.
property being classified is a property used to filter informa The system identifies 302 a collection of documents to be
tion retrieval in a restricted mode (e.g., a safe search), the classified and associated classifier scores. The system gener
document can be added to the blacklist so that information ates 304 a linearized probability distribution for the identified
retrieval results (e.g., Web search results) can be filtered. For documents based on their classifier scores. The scores for
example, documents classified as porn can be added to the 15 each document can be normalized before the probability dis
blacklist so that a user search under a safe search mode can tribution is linearized. The system assigns 306 each document
have search results filtered according to the blacklist such that to a particular bucket based on the linearized probability
no porn documents are identified in the search results. The distribution. In some implementations, the number of buckets
blacklist can identify all resources associated with the docu is chosen Such that each bucket only contains a small number
ment (e.g., web pages and images from a particular web site of documents (e.g., less than 10). The system iterates 306 a
corresponding to the document). The blacklist is stored for number of buckets and a number of selected training docu
future use (e.g., in filtering search results). ments until particular parameters are satisfied. For example,
The system can add the document to a type of list other than such that a specified number of N documents are chosen for
a blacklist. In some implementations, the system adds the the training group and that the ratio of documents having the
documents to a list for presentation. For example, the docu 25 property of interest is close to 50% (including manual rating
ment property being classified can be a particular topic. of a sample of the documents). The system selects 310 train
Documents associated with the topic can be added to the list. ing documents from each bucket after the parameters have
Identifiers for one or more of these documents can be pre been satisfied. For example, an equal number of documents
sented to a user (e.g., as links to the corresponding docu can be selected from each bucket to reach a total of N docu
ments). For example, the property being classified can be 30 ments. A more detailed description of the process for select
“finance' and one or more of the listed finance documents can ing the group of training documents is provided below.
be identified in response to a particular user request. As shown in FIG. 2, the system generates 208 a multiple
The system uses 114 the list in information retrieval. For classifier model for combining classifiers to generate a com
bined results for the training group documents corresponding
example, as described above, the blacklist can be used to filter 35 to the known results. Generating the multiple classifier model
search results to eliminate search results identifying includes determining a maximum likelihood probability
resources associated with a blacklisted document. An function for set of classifiers. The multiple classifier model
example of searching is described below with respect to FIG. uses the specific scores from the training set to calculate a
4. monotonic regression that maximizes the likelihood for mul
FIG. 2 is a flowchart showing an example method 200 for 40 tiple classifier scores identifying a particular outcome prob
modeling combined classifiers to provide an output score for ability that a given document has the specified property. Gen
a document. For convenience, the method 200 will be erating the multiple classifier model is described in greater
described with respect to a system that performs the method detail below.
2OO. FIG. 4 is a flowchart showing an example method 400 of
The system identifies 202 a group of documents to be 45 providing filtered search results using a list based on an iden
classified. The group of documents can include a collection of tified property of interest. For convenience, the method 400
Web sites to be classified as having the specified property. For will be described with respect to a system (e.g., a search
example, the group of documents can be Web sites to be system) that performs the method 400.
classified as porn or not porn. The system receives 402 a search query. For example, the
The system calculates 204 scores for documents using 50 system can receive the search query from a user of a client
multiple classifiers. The system uses multiple classifiers that device. The user can input the search query to a search inter
indicate whether a document is likely to have the specified face displayed on a client browser. The system provides the
property. The classifiers can examine different content of the query to a search engine, either locally or remotely, that
documents, for example, text or image content of each docu identifies resources responsive to the search query.
ment. Each classifier generates a score for the document that 55 The system receives 404 search results responsive to the
indicates the likelihood that the document has the property. search query. For example, the system can receive a set of
Generally, a higher score indicates a higher likelihood that the search results from the search engine. The set of search results
document has the property. include identifiers for the resources responsive to the search
The system selects 206 a subgroup of documents of the query. In some implementations, the search results further
group of documents as a group of training documents. For 60 include a link to each identified resource and Snippets of
example, the Subgroup of documents can be a group of docu content from the respective resource.
ments with known classifier scores and known determina The system determines 406 whether a restricted search is
tions as to whether the respective documents of the Subgroup being performed. For example, a restricted search based on a
have the specified property of interest (e.g., as a result of specific document property or a restricted “safe search’ mode
human raters determining whether each document of the 65 for limiting results based on the property of interest can be
group of training documents has the specified property of selected by a user through the search interface. The safe
interest). In some implementations, the Subgroup of docu search mode can filter search results that reference resources
US 9,104,972 B1
7 8
identified as having a particular property (e.g., resources performed using conventional techniques. The search engine
associated with documents classified as porn). 530 can transmit the search results 528 through the network to
When the system determines that a restricted search is not the client device 504, for example, for presentation to the user
being performed, the system provides 408 the search results SO2.
to the client for presentation (e.g., as an ordered set of results The search system 514 may also maintain one or more user
by the client browser). search histories based on the queries it receives from a user.
When the system determines that a restricted search is Generally speaking, a user search history stores a sequence of
being performed, the system filters 410 the search results queries received from a user. User search histories may also
using a list. The list identifies documents, and therefore their include additional information such as which results were
associated resources, used for the restricted search. For 10 selected after a search was performed and how long each
example, the list can identify resources associated with docu selected result was viewed.
ments having a finance property that should be identified for In some implementations, the search system 514 includes
a user in response to a search restricted to finance documents. one or more lists 540. The lists 540 each identify resources, in
In another example, if the restricted search is one that should particular Web pages or Web sites, that have been classified as
limit results (e.g., porn), the list is used to identify resources 15 having a specified property (e.g., porn, finance, sports).
that should not be identified in search results (e.g., a black Under specified circumstances, for example, when a user has
list). indicated that a restricted search is to be performed, the
For example, the list can identify documents classified as resources identified by the search engine 530 are compared
porn. Any resources associated with the list documents can be with the resources identified by the list 540 corresponding to
identified from the list document. The system compares the the restricted search. In particular, the search results 528 can
resources of the search results with the list documents and be filtered to remove results identifying resources that are not
removes any search results identifying resources associated associated with the list 540 or, alternatively, that are associ
with the list. For example, if a search result identifies a ated with the list 540.
resource corresponding to an image associated with a Web Modeling Maximum Likelihood Probability Function for
site identified as porn, the image resource is filtered. The 25 Multiple Classifiers
system provides 412 the filtered search results to the client for For a collection of documents that potentially have a speci
presentation. fied property there is a set of d classifiers that each provide a
FIG. 5 illustrates an example search system 514 for pro likelihood that a particular document has the property. Each
viding search results relevant to Submitted queries as can be classifier (e.g., a text classifier oran image classifier) provides
implemented in an internet, an intranet, or another client and 30 an output score (a result value represented by a real number).
server environment. The search system 514 is an example of The higher the score, the more likely the document has the
an information retrieval system in which the systems, com specified property.
ponents, and techniques described below can be imple The classifiers can be applied to a group of documents in
mented. the collection to determine scores for those documents and
A user 502 can interact with the search system 514 through 35 human raters can determine which documents really have the
a client device 504. For example, the client 504 can be a specified property. Using this input of rated documents, an
computer coupled to the search system 514 through a local estimate is then made for the probability function that com
area network (LAN) or wide area network (WAN), e.g., the bines multiple classifiers to provide a combined score for a
Internet. In some implementations, the search system 514 and new input document based on its classifier scores.
the client device 504 can be one machine. For example, a user 40 Mathematically, for a given probability spaceX there ared
can install a desktop search application on the client device classifier functions S.: X->R and a function Y; X->{0, 1}.
504. The client device 504 will generally include a random Thus, each classifier function S, provides an output score that
access memory (RAM) 506 and a processor 508. is a real number and the special property that documents can
A user 502 can submit a query 510 to a search engine 530 have is modeled as a Boolean result {0, 1} that specifies that
within a search system 514. When the user 502 submits a 45 the document X either has the specified value (“Y(x)=1') or
query 510, the query 510 is transmitted through a network to does not have the specified value (“Y(x)=0).
the search system 114. The search system 514 can be imple The conditional joint probabilities p(S, . . . . s.)=P
mented as, for example, computer programs running on one (Y=1|S. S. . . . . S.S.) are monotonically increasing for
or more computers in one or more locations that are coupled each of the d parameters (classifiers 1. . . . . d). Each s,
to each other through a network. The search system 514 50 represents a real number (output of a classifier), for example,
includes an index database 522 and a search engine 530. The a word count in a word classifier function S, used to calculate
search system 514 responds to the query 510 by generating a score for the classifier. Therefore, the likelihood that the
search results 528, which are transmitted through the network document is classified as having the property is defined as a
to the client device 504 in a form that can be presented to the function of each individual classifier score. These probabili
user 502 (e.g., a search results web page to be displayed in a 55 ties are assumed to be monotonically increasing, i.e. the
web browser running on the client device 504). higher the score for each classifier, the more likely that the
When the query 510 is received by the search engine 530, document has the property being evaluated by the classifier
the search engine 530 identifies resources that match the (e.g., for a porn classifier, the higher the classifier score, the
query 510. The search engine 530 may also identify a par greater the likelihood that the document is porn).
ticular "snippet' or section of each resource that is relevant to 60 The classifiers are applied to a group of N documents. For
the query. The search engine 530 will generally include an each document there are a set of parameters corresponding to
indexing engine 520 that indexes resources (e.g., web pages, the classifier functions applied to each document of the N
images, or news articles on the Internet) found in a corpus documents, S(i). . . . , S(i) for i=1,..., N. Thus, for a first
(e.g., a collection or repository of content), an index database document (i-1) a parameter from each individual classifier
522 that stores the index information, and a ranking engine 65 (e.g., S. (1), S(1)) is calculated. Thus, a parameters (1) can be
552 (or other software) to rank the resources that match the a count from a text classifier and a parameters (1) can be a
query 510. The indexing and ranking of the resources can be score from an image classifier for the same document. These
US 9,104,972 B1
10
parameters are used to determine a classifier score for the These particular functions f, have their minimum at a. With
document (e.g., S. (1) and S(1)) out the monotonicity constraint, minimizing
For the group of N documents that are human evaluated,
a 1 if document i has the special property, and a 0 if the
document i does not have the special property. Thus for each
human evaluated document, there are dreal values s, and one
value a, that is either 0 or 1. Using the experimental data for
the group of N documents, a maximum likelihood estimation would be trivial: choose each X, to be its corresponding a .
for the conditional probabilities p(S, ..., S) is derived. 10 However, the monotonicity constraints need to be considered,
The function p represents an estimate for a probability that namely that X,sX, for certain pairs i, j.
a document has the property, given as only information the To define the monotonicity constraint, let is if S (i)s
scores of the d classifiers. Estimating p given the information S (j) M . . . M S (i)ss (j) where M is the logical conjunction
ona, ..., a can be calculated using a maximum likelihood set such that the inequalities must be true for each classifier S. A
calculation to identify the p that gives the highest probability thatVsisisdefined
15 as {1,2,..., N} and E is a set of pairs (i,j) such
the transitive reflexive closure of E. In other words,
to observe the ratings a...., a that were actually observed (V, E) is a directed graph and is if and only if i ij or there is
(e.g., as previously evaluated by human raters) a path from i to in E. Thus, each entry in V is a vertex of the
If we assume we know the function p, the probability of directed graph and represents, e.g., a document from the
getting the output with all the observed values of a, is given experiment group. An edge can be drawn from i to between
by: Vertices if the classifier scores indicate that document is at
least as likely to have the specified property as document i.
As an example, assume the classifiers identify the property
W
of financial information in documents, for example in Web
P(a1, ... . aw|p) = i=1 p(s)(i), ... , sa(i))" (1-p(s)(i),..., sa(i))'''. is pages. The system can use two classifiers that assign to each
Web page two integers: s—it of technical financial words in
the text and sit of links to known sites with financial infor
mation. Additionally, there is an assumption that the higher
Maximum likelihood estimation is a statistical method that value of S, the more likely the web page is of interest (i.e.,
can be used to fit a model to the experimental data. To apply more likely to have the specified property of financial infor
30 mation). And similarly, the higher value of s, the more likely
this method, we have to find the monotonic function p of d the Web page is of interest.
variables that maximizes this probability. This condition only For example, if a Web page 1 has 2 financial terms and 1
determines the values p(S(i). . . . , S(i)) that appear in that link to financial pages, it will be of interest with a probability
formula. However, by the monotonicity constraint for each p(2,1)=x. If a Web page 2 has 10 financial terms and 3 links
other value S. . . . , s at least an interval in which p(S, ..., 35 to financial sites, such Web pages will be of interest with a
S) must lie can be calculated. If the training set includes at probability p(10,3)-X. Since Web page 2 contains both more
least a specified number of documents, these intervals are financial terms and more links to financial sites than Web page
small and hence in practice will allow the probability 1, the monotonicity assumption gives p(10, 3)ap (2, 1), or
p(S, . . . , s) to be calculated for all new documents with stated in terms of the “financial probabilities” for the pages:
reasonable precision. 40
Xax. In the directed graph an edge (i,j) is drawn from Web
page 1 to Web page 2 since we know that Web page 2 is at least
The Maximum Likelihood estimate for the function p can as likely to be interesting (i.e., have the specified property) as
be calculated. Mathematically, maximizing the likelihood is Web page 1.
equivalent to minimizing the (-log) of the likelihood, which The monotonicity requirement that establishes that X,<x,
can be written as: 45 for is is equivalent to the requirement that X,sX, for (i,j)e E.
As a result, the minimizing problem can be restated for a
directed graph (V, E) where for eachie V there is a labela, e
-logP(a1, ... , a N p) = {0,1}. A vector x with components x, e 0.1 for eachie V is
identified that minimizes
W
a; (log(p(S1 (i), ... , S(i)) + (1 - ai)log(1 - p(S1(i), ... , S(i)). 50
=

iew

To minimize the log of the likelihood, the values of


p(S(i). . . . , S(i)) need to be determined. For each document 55 under constraints x,<x, for (i,j) e E. Since each function f, is
i these can be defined as X,Ep(S(i). . . . , S(i)). Additionally, convex and the constraints are linear, this can be treated as a
f(X) can be defined as -a, log(x)-(1-a) log (1-X) So that each convex optimization problem. Additionally, the problem has
of these functions is either -log(x) or -log(1-X). Thus, the a unique solution because the derivatives of thefs are strictly
increasing.
(-log) likelihood can be rewritten as: A general convex optimization technique can be used to
60 find the minimum. However, the function f is separable and
the constraints involve only one or two coordinates. Given
-logP(a1, ... , a NIX1, ... , XN) = this, the minimum can be found by using a more efficient
W W method described in Willima L. Maxwell and John A. Muck
-ailog(xi) - (1 - ai)log(1 - x) = X fixi. stadt, Establishing Consistent and Realistic Reorder Inter
= i=1 65 vals in Production-Distribution Systems, Operations
Research, Vol. 33, no. 6, pp. 1316-1341 (1985) for solving
special convex optimization problems.
US 9,104,972 B1
11 12
In particular, the approach of Maxwell and Muckstadt cannot be split. Otherwise a min cut gives a decomposition IU
reduces the optimization problem to solving a maximum flow {s,t}=(LU {t}) U (RU {s}) that determines the sets L and R
problem. Network flows and the maximum flow problem are used above.
described, for example, in Chapter 26 of Thomas H. Cormen, As shown in the pseudocode above, when no more I's
Charles E. Leiserson, Ronald L. Rivest, Cliffort Stain: Intro selected from V can be split, the result for documents in each
duction to Algorithms (2" ed., MIT Press and McGraw-Hill I is set to the empirical probability that a document of I has
2001). The resulting maximum flow problem can be solved label 1, i.e. to the number of points in/with label 1 divided by
efficiently since the corresponding network can be rescaled to the number of all points in I.
involve only integers below IV and the system can perform a 10
Optimization from Preprocessing
preprocessing step to reduce the number of edges from Instead of using as E the full set of pairs (i,j) with is, the
O(IV) to an expected number of O(V log(IV)'''). The same result can be achieved for any other E' such that the full
preprocessing to reduce the number of edges is described in set is the transitive closure of E. Let us denote by E in the
greater detail below. following the full set of pairs (i,j) with is. So depending on
Using a max flow algorithm Suitable for these conditions, 15 the how the min cut is performed, it can be faster to first
for example, as described by Andrew V. Goldberg and Satish compute a “transitive reduction of (V, E). In order to define
Rao, Length Functions for Flow Computations, Technical the transitive reduction of (V, E), let i,j from V be equivalent
Report #97-055, NEC Research Institute, Inc., August 1997, if (i,j) and (j, i) are both in E. This means they have the same
provides an upper bound of O(M' log(N/M)log(N)), which vector of scores. Then the transitive reduction of E is the set of
for N=|VI and M=O(|V|log(IVI)''') gives O(N' all (i,j) in E such that there exists no k in V with (i, k) and (k,
log(N)'''') instead of the O(N) bound used in Maxwell j) in E and k not equivalent to i or j. (See J. van Leeuwen:
and Muckstadt. This speed up obtained by the preprocessing
step makes it feasible to run the algorithm on large data sets, Graph Algorithms, 1.4. Transitive reduction and transitive
e.g. the monotonic regression of 100000 points can be calcu closure, in: Handbook of Theoretical Computer Science, Vol.
lated in minutes on one PC. A: Algorithms and Complexity, Elsevier/MIT Press 1994)
The process for minimizing 25 This means the point i is either the same as point j or it is
below the point j, and there is no point in between those
points. A fast way to compute the transitive reduction of E is
as follows:
iew
Order V in lexicographical order of the score d-tuples,
30 remove duplicates, and lett(i) be the index of i in this order.
This means:
with respect to the monotonicity constraints given by the t(i)=t() if and only if i and are equivalent, and
directed graph (V, E) with vertices each labeled as either 0 or t(i)st(j) if (i,j) is in E (but in general we will also have ti)stC)
1 to determine the optimal probability X, for each vertex for other (i,j).
document i (as derived from Maxwell and Muckstadt) can be 35 As pseudo code, the technique to compute the transitive
written in pseudo code as follows: reduction is as follows:
Set active sets-V} and results- . for each i in V:
while active sets is not empty do Set resulti-}.
Choose I from active sets and remove it. go through all j with t()>t(i) in the order given by t
Try to split I into “left part L and “right part R. 40 if (i,j) in E then
if it can be split then for each k in resulti:
Add L and R to active sets. if (k,j) in E then
else continue with next
Add I to result. end if
end if 45 next k
end while add to resulti
for I in result do end if
as-Number of points in I with label 1. next
be-Number of points in I with label 0. for each with t()=t(i):
Set the output probability to a/(a+b) for each point in I. 50 add to resulti
end for next
In this process, the way that I is selected from the active sets Now resulti contains exactly the j such that (i,j) is in the
does not matter (alternatively, the sets can be assessed in transitive reduction of E.
parallel). Each I has some number of points from V, e.g., some next i
number of represented documents. The main part to be speci 55 Obtaining the Initial Set of Documents for Constructing the
fied above is the splitting of a set of points (I). This is given by Model
a minimal cut in the following network: In some implementations, a group of labeled documents to
Let ICV consist of a elements with label 1 and b elements perform the initial model construction is selected from a
with label 0. Then the network has as nodes the elements of I larger group of documents ("universe') to be classified. A
and two additional nodes S and t. The network also has arcs 60 document is considered labeled if it is known whetherit has or
including an arc of capacity b from S to each of the a points has not the desired property P. For example, if each document
with label 1, an arc of capacity a from each of the b points with is a Web site, the universe is a collection of Web sites and
label 0 to t and an arc of infinite capacity from i to for each some of these Web sites are provided to human raters who
pair (i,j) e E?h IxI. determine whether they have the property P. These training
If the max flow has capacity ab, a corresponding min cut 65 documents are then used to build the model. However, extra
separates S from all other points ort from all other points so it attention should be paid to the selection of the training docu
does not give a non-trivial composition of I. As a result, the set ments. Drawing the training documents randomly from the
US 9,104,972 B1
13 14
universe will generally not results inauseful group of training and a high value b, such that documents with score s>b, are
documents R due to the following properties of the universe: unlikely not to have property P and a middle score m, such
Property A. that the ratio of documents with score greater than m, is Q
In most cases, only a small fraction of documents will have (since Q is the ratio of documents with property P, this means
the property P. For example, the ratio of financial sites or porn 5 there are as many documents with score greater than m, that
sites to all Web sites is rather small. A good group of the do not have property Pas there are documents with score less
training documents should contain significant number of the than m, that have property P). Divide the intervals a, b, into
K smaller intervals such that:
documents both having and not having the property P. Thus,
biased sampling should be used instead of the uniform Sam If K is even: m is the highest point of the lower K/2
pling while selecting the documents that form the group of 10 intervals and the lowest point of the higher K/2 intervals:
and
training documents. Generally, a set containing 50% of docu
ments with property P would be good to learn the distinction If K is odd: m, is in the middle interval.
between P and not P. If we are specifically interested in b, into canThis
K/2
be done by dividing the intervals a, m, and Im
equal pieces, whether that is appropriate depends
identifying a set of documents which we are very confident to on the classifier and its scores. A different scenario using
have property P, then it makes sense to use even more of the 15 quantiles instead is described below.
documents which are likely to have property P. Assume a fixed scheme that gives for a given Kanda set of
Property B. m partitions of the intervalsa, b, into KSmallerintervals I l,
If the classifiers are good at predicting some property P, the lf. . . . . I?. The system takes all products of the smaller
outputs of individual classifiers are highly correlated, since intervals and identifies a document in each of these buckets,
they should predict the same property. This means that, for a i.e. for each sequence (i, i..... i.) with from {1,2,..., K}
random document from the universe, if its score from some the system finds a document such that s is in II'', s is in
individual classifier is high compared to the scores this indi I'...., s, is in IA'). This results in a collection of atmost
vidual classifier gives to the other documents in the universe, K" documents. In general, many of these buckets can be
than the other individual classifiers are likely to score this empty since e.g. there may be no document that classifier 1
document higher than the other documents. The method of 25 considers almost certainly to have property P (s is large), but
sampling the documents from the universe to form the group classifier 2 considers highly likely not to have property P (s
of training documents should take the above statement into is low).
account and draw documents with various scores in order to If the number of documents is significantly lower than N.
achieve the better coverage of the scores space. the system can repeat this procedure with higher K. More
To overcome the issue of the documents in the universe 30 generally, the number of documents found with this method
having these properties, a bootstrapping process is used to grows monotonically with K. To this monotonically increas
identify a group of training documents (R), such that using ing function the system can apply binary search to find a value
them as a set of documents to build a model from will produce of K such that the number of documents found is approxi
a model that covers the feature space described by the indi mately N. This fulfils constraint 3, and by construction satis
vidual classifiers. Additionally, the bootstrapping technique 35 fies constraint 2.
satisfies three constraints: While often constraint 1 will also be fulfilled because of the
Constraint 1. choice of m, the system can adjust also the m, like K if
Substantially half of the documents in R should have prop necessary. In particular, increasing one or several m, is
erty P. (To compensate Property A) expected not to change the number of documents substan
Constraint 2. 40 tially, but should increase the number of documents with
R should contain documents with various combinations of property P. If the system is configured to increase all m, at the
scores from individual classifiers. (To compensate Property same time (e.g. by the same relative amount), the system can
B) again appeal to monotonicity and binary search to find values
Constraint 3. m, such that roughly the same number of documents have
The number of documents in R should be substantially N (a 45 property P or don’t have property P. While an exact determi
given integer number of documents which is determined e.g. nation of the fraction of documents in R with property P
by the human rater capacity or the cost of having N docu would require human rater input, the system can, in practice,
ments rated). often just take a small sample to make Sure that constraint 1 is
Given the properties and constraints, the bootstrapping substantially fulfilled.
process is formed as follows: 50 The above way to construct the smaller score intervals
Let S be all documents in the universe. The bootstrapping given Kand them, assumes that dividing a score interval into
process chooses the set R, as a subset of S that fits the con K equal pieces is a meaningful operation—this may not
straints described above. always be the case, sometimes only the order relation
In some implementations, there are pre-requirements to between scores is meaningful. Another option is to use quan
running the bootstrapping process. In particular, Some pre 55 tiles instead of scores; they remain invariant under monotonic
requirements include a pre-requirement that the scores from transformation of the scores (but on the other hand they
all classifiers are pre-computed for all documents in S. depend on the set of all documents considered). The quantile
Another pre-requirement is that there exists some estimate of corresponding to m, is F=1-Q. Let c, be such that a, corre
what ratio of documents from Shas the property P, referred to sponds to the quantile F, then from a.sm, we know that c->1.
as Q below. (For example, coming from the previous experi- 60 and we can consider the quantile intervals given by the end
ence with manually labeling a uniformly sampled Subset of points
the universe). Additionally, another pre-requirement is that N
is specified beforehand.
The following procedure is applied to choose R: (KI))
Fei, Flei'k'),..., Fleik), F(R), t,
Set K to the smallest integer such that K'N. For each 65
classifier, the system determines a small value a, such that
documents with scores,<a, are unlikely to have the property P instead of the intervals given above.
US 9,104,972 B1
15 16
Another variant is using fewer buckets (lower K), but summing up ap, for j=-k,..., -1, 1,..., k with coefficients
allowing a certain number of documents into each bucket. a-0 and divide by the sum of the a. The above is the special
Additionally, manual fine-tuning can be introduced, different case k=1, a -a=1. Distributing the total weight over more
values of K and F can be used for different classifiers. While coefficients will generate a smoother function p, which can
manually looking at the sites, raters can have an idea about increase the accuracy of the estimate. However, such a
what scores are well covered and where improvements can be scheme means that there is an additional parameter to tune.
made to the classification. For example, raters can manually One possible choice for these coefficients would be a exp
raise some K’s by a factor of 1.5 . . . 2 to make sure all the (°/w) for some parameter w. If the estimate should be con
dimensions are covered (e.g., to fulfill Constraint 2). servative, to make Sure a high confidence threshold is satisfied
Missing Scores 10
in the classification if the computed probability is close to 1,
Some documents do not have content applicable to all of p can be used for some constant k, e.g. k 20 as below. This
the multiple classifiers being applied. For example, an image is explained in more detail below with respect to detecting
based classifier would not provide a score on a document that documents having a specified property with high confidence.
does not include images. In some implementations, the sys Detecting Documents Having a Specified Property with High
temassigns a default score to classifiers that do not provide an 15
Confidence
output score. For example, if a system to detect financial sites
uses as one signal an image classifier that determines the As an example, Suppose there are two classifiers. The first
percentage or number of images that are charts, this classifier classifier has scores on the range of 0, ... , 5 and the second
could output 0 for documents with no images. classifier has scores on the range of 0, ... , 9. Table 1 shows
Alternatively, in some other implementations, the system the first classifier scores on the y-axis and the second classifier
generates a modified probabilistic model that omits the clas scores on the X-axis. Points on the table represent documents
sifier at issue from the probability function combining the where a value “P” indicates the document has the property
classifiers. For n classifiers that could not output a classifica and “N' indicates that the document does not have the prop
tion for some documents (and that cannot use a default value), erty. Additionally, Table 1 includes a document L of the
2'-1 models are built, one for each non-empty subset of these 25 training set and a document X as a new point to be classified.
classifiers. To build such a model for a subset S of classifiers,
the system uses that Subset of the human-rated data that has TABLE 1
classifier outputs for all classifiers in S. So while those 5 P N P P P P P
human-rated documents that have scores for all classifiers
4 N P N P P N P P
would be used for all 2'-1 models, for models for a set S the 30 3 P N N N P N P
system also uses those documents that have not scores for all 2 N N N N N X
classifiers, but for all classifiers in S. When classifying a new 1
O
N
N
N
P
N
N
N
N
N L
document using these models, the system determines the set O 1 2 3 4 5 6 7 8 9
of classifiers that give an output for this document and applies
the model corresponding to this set. 35
Applying the Multiple Classifier Model to Determine a Com First suppose L is a point labeled with “P”. Since all points
bined Score for a Document above it also have the property, it will get the combined
Given the multiple classifier model generated from a set of probability 1.0. (The terms “above” and “below” refer to the
training documents with scores S(i). . . . , S(i) and the monotonicity requirement. For example, point A is lower than
computed probabilities p(i) that a document with these scores 40
point B if they are not the same and if for all the individual
has the specified property, there are several ways to determine classifiers that provide both A and B with the scores, the score
a probability for a new document with scores S. . . . . s. The for A is not greater than the score for B.) Since X is above L.
techniques use the assumed monotonicity of the probability it will also get the combined probability 1.0. Now suppose L
in all scores S. . . . . s. is labeled with “N”. Almost all of the points below it are “N”,
Let V be the set of all training documents i that have 45
so it will get a very low combined probability (e.g., about
scores S(i)as for k=1, 2, . . . . d. Arrange the computed 0.07). The same is actually true for all points below X, so the
probabilities p(i) for i in V, in ascending order, and call value of X should be greater than 0.07 and at most 1.0.
these p, p. . . . . Then by monotonicity the probability
p(s1, ..., s) should be lower than p. If V is empty, set However, the actual value that should be assigned to X is
pp. ... =1. More generally p, can be set to 1 for i>IV.I. 50 unclear. So although X gets a combined probability of 1.0
Similarly, let Vibe the set of all training documents i when L is porn, it is not that Sure that it actually is porn (L
which have all scores S(i)ss for k-1,2,..., d. Arrange the could be some rater error or other random fluctuation).
computed probabilities p(i) for i in V, in descending A modification can be applied such that the combined
order, and call these p , p . . . . . Then by monotonicity the probability of the actual point X is not used for determining
probability p(S, ..., s) should be greater than p . If V. 55 whether X has the property P. Instead, the n-th highest com
is empty, set p_i p_2= ... = ... -0. More generally p, can be bined probability among all points that are “below X (i.e.
set to 0 for DIVI. have all coordinates less or equal to the coordinates of X) can
This gives, for each new document with given scores, an be used to determine the combined probability. The value of
interval p , p in which the probability should lie (by n can vary (e.g., n. 20). Additionally, to determine the value of
construction of the probabilities p sp.). If there are a speci 60 n for particular purposes, a cross-validation measurement can
fied number of documents in the training set, this interval will be used to find n. This n-th score is compared with the thresh
be small and already give the sought probability with enough old. This threshold has less physical meaning as it can not be
precision. directly translated into the probability. However, in practice,
To obtain one number instead of an interval to assign to a the object is to identify a “top in most Suspicious documents'.
new document, there are several possibilities: One technique 65 where boosting the documents where there is more confi
is to simply use the mean (p +p)/2. More generally, the dence (based on lower results for other documents) improves
system can take a linear combination of the probabilities by the final results.
US 9,104,972 B1
17 18
Other Applications of Using a Multiple Classifier Model to one or more buses 614. The network communications module
Identify Documents Having a Particular Topic 618 includes various components for establishing and main
Search: taining network connections (e.g., Software for implementing
When users search for documents the system can show communication protocols, such as TCP/IP, HTTP, Ethernet,
predominantly documents that match a certain topic if the etc.).
system can determine, from circumstances of the search, that The multiple classifier 620 provides various software com
the user is likely looking for documents of a particular topic, ponents for performing the various functions for combining
e.g. if the query is issued from a page with a certain topic (e.g., classifier scores into a single classifier score, as described
a financial site, a site on sports, etc.). Additionally, the system with respect to FIGS. 1-4.
can filter documents that are adult oriented if the user acti 10
vated a filter for such material. Embodiments of the subject matter and the operations
Advertising: described in this specification can be implemented in digital
When matching advertising campaigns to Web sites to electronic circuitry, or in computer Software, firmware, or
present the advertisement, the system can suggest Web sites hardware, including the structures disclosed in this specifica
with a topic that matches the advertising campaign, e.g. the 15 tion and their structural equivalents, or in combinations of one
system Suggests advertising on sports related Web sites for a or more of them. Embodiments of the subject matter
manufacturer of sports equipment. Alternatively, the system described in this specification can be implemented as one or
can filter Web sites related to offending topics (pornography, more computer programs, i.e., one or more modules of com
graphic depictions of violence). puter program instructions, encoded on a computer storage
Main Document Language as “Topic: media for execution by, or to control the operation of data
Although the language of a document is not typically con processing apparatus. Alternatively or in addition, the pro
sidered a “topic', the detection of documents of a certain gram instructions can be encoded on an artificially-generated
language has similarities to the detection of documents about propagated signal, e.g., a machine-generated electrical, opti
a certain topic. In particular, the detection of the main lan cal, or electromagnetic signal, that is generated to encode
guage of a document can also be a problem that multiple 25 information for transmission to Suitable receiver apparatus
classifiers are applied to. For example, from combining the for execution by a data processing apparatus. The computer
signals of different classifiers including a text of main content storage medium can be, or be included in, a computer-read
(this can be empty or too short to identify a language, the page able storage device, a computer-readable storage Substrate, a
could just contain a picture with a name, or pictures and text random or serial access memory array or device, or a combi
could be combined in one image; there could be parts in 30 nation of one or more of them.
different languages, and it is not easy to determine program The operations described in this specification can be imple
matically which part is the main content), text of title (can be mented as operations performed by a data processing appa
ambiguous if title is short), text of other documents on the ratus on data stored on one or more computer-readable Stor
same site (could be ambiguous for multi-language sites), and age devices or received from other sources.
text in images obtained by OCR (if there are images and OCR 35 The term “data processing apparatus' encompasses all
is successful in extracting text). kinds of apparatus, devices, and machines for processing
Similarly to topics, users often will only be interested in data, including by way of example a programmable proces
documents in a particular language or set of languages. These Sor, a computer, a system on a chip, or combinations of them.
languages can either be given explicitly by the user or be The apparatus can include special purpose logic circuitry,
inferred from the interface language or the language of the 40 e.g., an FPGA (field programmable gate array) or an ASIC
query or the country from which the query was issued. (application-specific integrated circuit). The apparatus can
FIG. 6 illustrates an example architecture of a system 600. also include, in addition to hardware, code that creates an
The system architecture 600 is capable of performing opera execution environment for the computer program in question,
tions for identifying non-compositional compounds. The e.g., code that constitutes processor firmware, a protocol
architecture 600 includes one or more processors 602 (e.g., 45 Stack, a database management System, an operating System, a
IBM PowerPC, Intel Pentium 4, etc.), one or more display cross-platform runtime environment, e.g., a virtual machine,
devices 604 (e.g., CRT, LCD), graphics processing units 606 or a combination of one or more of them. The apparatus and
(e.g., NVIDIA GeForce, etc.), a network interface 608 (e.g., execution environment can realize various different comput
Ethernet, FireWire, USB, etc.), input devices 610 (e.g., key ing model infrastructures, such as web services, distributed
board, mouse, etc.), and one or more computer-readable 50 computing and grid computing infrastructures.
mediums 612. These components exchange communications A computer program (also known as a program, Software,
and data using one or more buses 614 (e.g., EISA, PCI, PCI Software application, Script, or code) can be written in any
Express, etc.). form of programming language, including compiled or inter
The term “computer-readable medium” refers to any preted languages, declarative or procedural languages, and it
medium that participates in providing instructions to a pro 55 can be deployed in any form, including as a stand-alone
cessor 602 for execution. The computer-readable medium program or as a module, component, Subroutine, object, or
612 further includes an operating system 616 (e.g., Mac OSR), other unit Suitable for use in a computing environment. A
WindowSR, Linux, etc.), a network communication module computer program may, but need not, correspond to a file in a
618, multiple classifier 622, and other applications 624. file system. A program can be stored in a portion of a file that
The operating system 616 can be multi-user, multiprocess 60 holds other programs or data (e.g., one or more scripts stored
ing, multitasking, multithreading, real-time and the like. The in a markup language document), in a single file dedicated to
operating system 616 performs basic tasks, including but not the program in question, or in multiple coordinated files (e.g.,
limited to: recognizing input from input devices 610; sending files that store one or more modules, Sub-programs, or por
output to display devices 604; keeping track of files and tions of code). A computer program can be deployed to be
directories on computer-readable mediums 612 (e.g., 65 executed on one computer or on multiple computers that are
memory or a storage device); controlling peripheral devices located at one site or distributed across multiple sites and
(e.g., disk drives, printers, etc.); and managing traffic on the interconnected by a communication network.
US 9,104,972 B1
19 20
The processes and logic flows described in this specifica The computing system can include clients and servers. A
tion can be performed by one or more programmable proces client and server are generally remote from each other and
sors executing one or more computer programs to perform typically interact through a communication network. The
functions by operating on input data and generating output. relationship of client and server arises by virtue of computer
The processes and logic flows can also be performed by, and programs running on the respective computers and having a
apparatus can also be implemented as, special purpose logic client-server relationship to each other. In some embodi
circuitry, e.g., an FPGA (field programmable gate array) or an ments, a server transmits data (e.g., an HTML page) to a client
ASIC (application-specific integrated circuit). device (e.g., for purposes of displaying data to and receiving
Processors suitable for the execution of a computer pro user input from a user interacting with the client device). Data
gram include, by way of example, both general and special 10 generated at the client device (e.g., a result of the user inter
purpose microprocessors, and any one or more processors of action) can be received from the client device at the server.
any kind of digital computer. Generally, a processor will While this specification contains many specific implemen
receive instructions and data from a read-only memory or a tation details, these should not be construed as limitations on
random access memory or both. The essential elements of a the scope of the invention or of what may be claimed, but
computer area processor for performing or executing instruc 15 rather as descriptions of features specific to particular
tions and one or more memory devices for storing instructions embodiments of the invention. Certain features that are
and data. Generally, a computer will also include, or be opera described in this specification in the context of separate
tively coupled to receive data from or transfer data to, or both, embodiments can also be implemented in combination in a
one or more mass storage devices for storing data (e.g., mag single embodiment. Conversely, various features that are
netic, magneto-optical disks, or optical disks). However, a described in the context of a single embodiment can also be
computer need not have such devices. Moreover, a computer implemented in multiple embodiments separately or in any
can be embedded in another device, e.g., a mobile telephone, suitable subcombination. Moreover, although features may
a personal digital assistant (PDA), a mobile audio or video be described above as acting in certain combinations and even
player, a game console, a Global Positioning System (GPS) initially claimed as Such, one or more features from a claimed
receiver, or a portable storage device (e.g., a universal serial 25 combination can in Some cases be excised from the combi
bus (USB) flash drive), to name just a few. Devices suitable nation, and the claimed combination may be directed to a
for storing computer program instructions and data include Subcombination or variation of a Subcombination.
all forms of non-volatile memory, media and memory Similarly, while operations are depicted in the drawings in
devices, including by way of example semiconductor a particular order, this should not be understood as requiring
memory devices, e.g., EPROM, EEPROM, and flash memory 30 that such operations be performed in the particular order
devices; magnetic disks, e.g., internal hard disks or remov shown or in sequential order, or that all illustrated operations
able disks; magneto-optical disks; and CD-ROM and DVD be performed, to achieve desirable results. In certain circum
ROM disks. The processor and the memory can be supple stances, multitasking and parallel processing may be advan
mented by, or incorporated in, special purpose logic circuitry. tageous. Moreover, the separation of various system compo
To provide for interaction with a user, embodiments of the 35 nents in the embodiments described above should not be
Subject matter described in this specification can be imple understood as requiring Such separation in all embodiments,
mented on a computer having a display device, e.g., a CRT and it should be understood that the described program com
(cathode ray tube) or LCD (liquid crystal display) monitor, ponents and systems can generally be integrated together in a
for displaying information to the user and a keyboard and a single software product or packaged into multiple Software
pointing device, e.g., amouse or a trackball, by which the user 40 products.
can provide input to the computer. Other kinds of devices can Thus, particular embodiments of the invention have been
be used to provide for interaction with a user as well; for described. Other embodiments are within the scope of the
example, feedback provided to the user can be any form of following claims. In some cases, the actions recited in the
sensory feedback, e.g., visual feedback, auditory feedback, or claims can be performed in a different order and still achieve
tactile feedback; and input from the user can be received in 45 desirable results. In addition, the processes depicted in the
any form, including acoustic, speech, or tactile input. In addi accompanying figures do not necessarily require the particu
tion, a computer can interact with a user by sending docu lar order shown, or sequential order, to achieve desirable
ments to and receiving documents from a device that is used results. In certain implementations, multitasking and parallel
by the user; for example, by sending web pages to a web processing may be advantageous.
browser on a user's client device in response to requests 50 What is claimed is:
received from the web browser. 1. A computer-implemented method comprising:
Embodiments of the subject matter described in this speci computing multiple respective scores for each document in
fication can be implemented in a computing system that a collection of documents, one score from each of a
includes a back-end component, e.g., as a data server, or that plurality of D distinct classifiers, wherein each classifier
includes a middleware component, e.g., an application server, 55 computes a respective score representing a likelihood
or that includes a front-end component, e.g., a client com that the document has a property P, wherein each clas
puter having a graphical user interface or a Web browser sifier has a respective lower threshold a wherein docu
through which a user can interact with an implementation of ments having a score less thana, are unlikely to have the
the Subject matter described in this specification, or any com property P, and wherein each classifier has a respective
bination of one or more suchback-end, middleware, or front 60 upper threshold be wherein documents having a score
end components. The components of the system can be inter greater than b, are likely to have the property P:
connected by any form or medium of digital data determining, for each respective classifier, a plurality of
communication, e.g., a communication network. Examples intervals between a, and b, for the classifier;
of communication networks include a local area network determining, for each document in the collection of docu
(“LAN”) and a wide area network (“WAN'), an inter-network 65 ments, a combination of intervals I, to I, according to
(e.g., the Internet), and peer-to-peer networks (e.g., ad hoc which interval of the plurality of intervals each respec
peer-to-peer networks). tive score for the document belongs;
US 9,104,972 B1
21 22
determining, for each combination of intervals I, tO I?s, selecting no more than M documents for each combination
whether any documents in the collection of documents of intervals I, to I, for which at least one document in
have a corresponding combination of intervals II' to the collection has the corresponding combination of
I: intervals; and
Selecting no more than M documents for each combination 5 training a multiple classifier model for the D distinct clas
of intervals I,' tO I?s for which at least one document in sifiers using each selected document.
the collection has the corresponding combination of 11. The system of claim 10, wherein determining, for each
intervals; and respective classifier, a plurality of intervals between a, and b,
training a multiple classifier model for the D distinct clas for the classifier comprises determining Kintervals for each
10 distinct classifier.
sifiers using each selected document. 12. The system of claim 11, wherein selecting no more than
2. The method of claim 1, wherein determining, for each M documents for each combination of intervals comprises
respective classifier, a plurality of intervals between a, and b, selecting no more than N total documents, wherein N is an
for the classifier comprises determining Kintervals for each upper bound on a number of documents selected.
distinct classifier. 15 13. The system of claim 12, wherein the operations further
3. The method of claim 2, wherein selecting no more than comprise:
M documents for each combination of intervals comprises determining K Such that K is the Smallest integer that
selecting no more than N total documents, wherein N is an satisfies K->=N.
upper bound on a number of documents selected. 14. The system of claim 13, wherein the operations further
4. The method of claim 3, further comprising: comprise:
determining K Such that K is the Smallest integer that determining, for each classifier, a middle score m, wherein
satisfies K->=N. determining a plurality of intervals between a, and b, for
5. The method of claim 4, further comprising: each classifier comprises determining K/2 intervals
determining, for each classifier, a middle score m, wherein between a, and m, and K/2 intervals between m, and b,
determining a plurality of intervals between a, and b, for 25 15. The system of claim 14, wherein determining, for each
each classifier comprises determining K/2 intervals classifier, the middle score m, comprises determining the
between a, and m, and K/2 intervals between m, and b, middle score m, such that a ratio of documents having a score
6. The method of claim 5, wherein determining, for each greater than m, is a ratio of documents having the property P.
classifier, the middle score m, comprises determining the 16. The system of claim 10, wherein substantially half of
middle score m, such that a ratio of documents having a score 30 the selected documents have the property P.
greater than m, is a ratio of documents having the property P. 17. The system of claim 10, wherein the operations further
7. The method of claim 1, wherein substantially half of the
comprise:
selected documents have the property P. providing each selected document to one or more human
raters; and
8. The method of claim 1, further comprising: 35 receiving, from the one or more human raters, an indication
providing each selected document to one or more human of whether or not each document has the property P.
raters; and 18. The system of claim 10, wherein training the multiple
receiving, from the one or more human raters, an indication classifier model comprises training a monotonic regression
of whether or not each document has the property P. model.
9. The method of claim 1, wherein training the multiple 40 19. A computer program product, encoded on one or more
classifier model comprises training a monotonic regression non-transitory computer storage media, comprising instruc
model. tions that when executed by one or more computers cause the
10. A system comprising: one or more computers to perform operations comprising:
one or more computers and one or more storage devices computing multiple respective scores for each document in
storing instructions that are operable, when executed by 45 a collection of documents, one score from each of a
the one or more computers, to cause the one or more plurality of D distinct classifiers, wherein each classifier
computers to perform operations comprising: computes a respective score representing a likelihood
computing multiple respective scores for each document in that the document has a property P, wherein each clas
a collection of documents, one score from each of a sifier has a respective lower threshold a wherein docu
plurality of D distinct classifiers, wherein each classifier 50 ments having a score less thana, are unlikely to have the
computes a respective score representing a likelihood property P, and wherein each classifier has a respective
that the document has a property P, wherein each clas upper threshold be wherein documents having a score
sifier has a respective lower threshold a wherein docu greater than b, are likely to have the property P:
ments having a score less thana, are unlikely to have the determining, for each respective classifier, a plurality of
property P, and wherein each classifier has a respective 55 intervals between a, and b, for the classifier;
upper threshold b, wherein documents having a score determining, for each document in the collection of docu
greater than b, are likely to have the property P: ments, a combination of intervals I, to I, according to
determining, for each respective classifier, a plurality of which interval of the plurality of intervals each respec
intervals between a, and b, for the classifier; tive score for the document belongs;
determining, for each document in the collection of docu 60 determining, for each combination of intervals to I, tO I?s,
ments, a combination of intervals I' to I, according to whether any documents in the collection of documents
which interval of the plurality of intervals each respec have a corresponding combination of intervals II' to
tive score for the document belongs; I:
determining, for each combination of intervals I, tO I?s, selecting no more than M documents for each combination
whether any documents in the collection of documents 65 of intervals I, tO I?s for which at least one document in
have a corresponding combination of intervals II' to the collection has the corresponding combination of
I: intervals; and
US 9,104,972 B1
23 24
training a multiple classifier model for the D distinct clas
sifiers using each selected document.
20. The computer program product of claim 19, wherein
determining, for each respective classifier, a plurality of inter
vals between aj and bj for the classifier comprises determin- 5
ing K intervals for each distinct classifier.
k k k k k

You might also like