Text Data Management and Analysis PDF
Text Data Management and Analysis PDF
A Practical Introduction
data are especially valuable for discovering knowledge about human opinions and preferenc-
es, in addition to many other kinds of knowledge that we encode in text. In contrast to struc-
tured data, which conform to well-defined schemas (thus are relatively easy for computers to
handle), text has less explicit structure, requiring computer processing toward understanding
of the content encoded in text. The current technology of natural language processing has
to Information Retrieval
not yet reached a point to enable a computer to precisely understand natural language text,
but a wide range of statistical and heuristic approaches to management and analysis of text
and Text Mining
data have been developed over the past few decades. They are usually very robust and can be
applied to analyze and manage text data in any natural language, and about any topic.
This book provides a systematic introduction to many of these approaches, with an em-
phasis on covering the most useful knowledge and skills required to build a variety of prac-
tically useful text information systems. Because humans can understand natural languages
and Analysis
far better than computers can, effective involvement of humans in a text information system
is generally needed and text information systems often serve as intelligent assistants for hu-
mans. Depending on how a text information system collaborates with humans, we distinguish
two kinds of text information systems. The first is information retrieval systems which include
search engines and recommender systems; they assist users in finding from a large collection
of text data the most relevant text data that are actually needed for solving a specific applica-
tion problem, thus effectively turning big raw text data into much smaller relevant text data
that can be more easily processed by humans. The second is text mining application systems;
they can assist users in analyzing patterns in text data to extract and discover useful action-
able knowledge directly useful for task completion or decision making, thus providing more
direct task support for users. ChengXiang Zhai
B O O K S . A C M . O R G • W W W. M O R G A N C L AY P O O L . C O M 9 78 1 970 001 1 67
Text Data Management
and Analysis
ACM Books
Editor in Chief
M. Tamer Özsu, University of Waterloo
ACM Books is a new series of high-quality books for the computer science community,
published by ACM in collaboration with Morgan & Claypool Publishers. ACM Books
publications are widely distributed in both print and digital formats through booksellers
and to libraries (and library consortia) and individual ACM members via the ACM Digital
Library platform.
ChengXiang Zhai
University of Illinois at Urbana–Champaign
Sean Massung
University of Illinois at Urbana–Champaign
All rights reserved. No part of this publication may be reproduced, stored in a retrieval
system, or transmitted in any form or by any means—electronic, mechanical, photocopy,
recording, or any other except for brief quotations in printed reviews—without the prior
permission of the publisher.
Designations used by companies to distinguish their products are often claimed as
trademarks or registered trademarks. In all instances in which Morgan & Claypool is aware
of a claim, the product names appear in initial capital or all capital letters. Readers, however,
should contact the appropriate companies for more complete information regarding
trademarks and registration.
First Edition
10 9 8 7 6 5 4 3 2 1
To Mei and Alex
To Kai
Contents
Preface xv
Acknowledgments xviii
Chapter 1 Introduction 3
1.1 Functions of Text Information Systems 7
1.2 Conceptual Framework for Text Information Systems 10
1.3 Organization of the Book 13
1.4 How to Use this Book 15
Bibliographic Notes and Further Reading 18
Chapter 2 Background 21
2.1 Basics of Probability and Statistics 21
2.2 Information Theory 31
2.3 Machine Learning 34
Bibliographic Notes and Further Reading 36
Exercises 37
Chapter 4 META: A Unified Toolkit for Text Data Management and Analysis 57
4.1 Design Philosophy 58
4.2 Setting up META 59
4.3 Architecture 60
4.4 Tokenization with META 61
4.5 Related Toolkits 64
Exercises 65
Chapter 20 Toward A Unified System for Text Management and Analysis 445
20.1 Text Analysis Operators 448
20.2 System Architecture 452
20.3 META as a Unified System 453
References 477
Index 489
Authors’ Biographies 509
Preface
directly expressed in text data. For example, it is now the norm for people to tap into
opinionated text data such as product reviews, forum discussions, and social media
text to obtain opinions. Once again, due to the overwhelming amount of informa-
tion, people need intelligent software tools to help discover relevant knowledge for
optimizing decisions or helping them complete their tasks more efficiently. While
the technology for supporting text mining is not yet as mature as search engines
for supporting text access, significant progress has been made in this area in re-
cent years, and specialized text mining tools have now been widely used in many
application domains. The subtitle of this book suggests that we cover two major
topics, information retrieval and text mining. These two topics roughly correspond
to the techniques needed to build the two types of application systems discussed
above (i.e., search engines and text analytics systems), although the separation of
the two is mostly artificial and only meant to help provide a high-level structure for
the book, and a sophisticated application system likely would use many techniques
from both topic areas.
In contrast to structured data, which conform to well-defined schemas and are
thus relatively easy for computers to handle, text has less explicit structure so
the development of intelligent software tools discussed above requires computer
processing to understand the content encoded in text. The current technology of
natural language processing has not yet reached a point to enable a computer to pre-
cisely understand natural language text (a main reason why humans often should
be involved in the loop), but a wide range of statistical and heuristic approaches
to management and analysis of text data have been developed over the past few
decades. They are usually very robust and can be applied to analyze and manage
text data in any natural language, and about any topic. This book intends to provide
a systematic introduction to many of these approaches, with an emphasis on cov-
ering the most useful knowledge and skills required to build a variety of practically
useful text information systems.
This book is primarily based on the materials that the authors have used for
teaching a course on the topic of text data management and analysis (i.e., CS410
Text Information Systems) at the University of Illinois at Urbana–Champaign, as
well as the two Massive Open Online Courses (MOOCs) on “Text Retrieval and
Search Engines” and “Text Mining and Analytics” taught by the first author on
Coursera in 2015. Most of the materials in the book directly match those of these
two MOOCs with also similar structures of topics. As such, the book can be used as
a main reference book for any of these two MOOCs.
Information Retrieval (IR) is a relatively mature field and there are no short-
age of good textbooks on IR; for example, the most recent ones include Modern
Information Retrieval: The Concepts and Technology behind Search by Baeza-Yates
Preface xvii
are expected to have basic knowledge about computer science, particularly data
structures and programming languages and be comfortable with some basic con-
cepts in probability and statistics such as conditional probability and parameter
estimation. Readers who do not have this background may still be able to follow
the basic ideas of most of the algorithms discussed in the book; they can also ac-
quire the needed background by carefully studying Chapter 2 of the book and, if
necessary, reading some of the references mentioned in the Bibliographical Notes
section of that chapter to have a solid understanding of all the major concepts men-
tioned therein. META can be used by anyone to easily experiment with algorithms
and build applications, but modifying it or extending it would require at least some
basic knowledge of C++ programming.
The book can be used as a textbook for an upper-level undergraduate course on
information retrieval and text mining or a reference book for a graduate course to
cover practical aspects of information retrieval and text mining. It should also be
useful to practitioners in industry to help them acquire a wide range of practical
techniques for managing and analyzing text data that they can use immediately to
build various interesting real-world applications.
Acknowledgments
This book is the result of many people’s help. First and foremost, we want to express
our sincere thanks to Edward A. Fox for his invitation to write this book for the
ACM Book Series in the area of Information Retrieval and Digital Libraries, of
which he is the Area Editor. We are also grateful to Tamer Ozsu, Editor-in-Chief of
ACM Books, for his support and useful comments on the book proposal. Without
their encouragement and support this book would have not been possible. Next,
we are deeply indebted to Edward A. Fox, Donna Harman, Bing Liu, and Jimmy
Lin for thoroughly reviewing the initial draft of the book and providing very useful
feedback and constructive suggestions. While we were not able to fully implement
all their suggestions, all their reviews were extremely helpful and led to significant
improvement of the quality of the book in many ways; naturally, any remaining
errors in the book are solely the responsibility of the authors.
Throughout the process of writing the book, we received strong support and
great help from Diane Cerra, Executive Editor at Morgan & Claypool Publishers,
whose regular reminders and always timely support are key factors that prevented
us from having the risk of taking “forever” to finish the book; for this, we are truly
grateful to her. In addition, we would like to thank Sara Kreisman for copyediting
and Paul C. Anagnostopoulos and his production team at Windfall Software (Ted
Acknowledgments xix
Laux, Laurel Muller, MaryEllen Oliver, and Jacqui Scarlott) for their great help with
indexing, illustrations, art proofreading, and composition, which ensured a fast
and smooth production of the book.
The content of the book and our understanding of the topics covered in the book
have benefited from many discussions and interactions with a large number of
people in both the research community and industry. Due to space limitations,
we can only mention some of them here (and have to apologize to many whose
names are not mentioned): James Allan, Charu Aggarwal, Ricardo Baeza-Yates,
Nicholas J. Belkin, Andrei Broder, Jamie Callan, Jaime Carbonell, Kevin C. Chang,
Yi Chang, Charlie Clarke, Fabio Crestani, W. Bruce Croft, Maarten de Rijke, Arjen
de Vries, Daniel Diermeier, AnHai Doan, Susan Dumais, David A. Evans, Edward A.
Fox, Ophir Frieder, Norbert Fuhr, Evgeniy Gabrilovich, C. Lee Giles, David Gross-
man, Jiawei Han, Donna Harman, Marti Hearst, Jimmy Huang, Rong Jin, Thorsten
Joachims, Paul Kantor, David Karger, Diane Kelly, Ravi Kumar, Oren Kurland, John
Lafferty, Victor Lavrenko, Lillian Lee, David Lewis, Jimmy Lin, Bing Liu, Wei-Ying
Ma, Christopher Manning, Gary Marchionini, Andrew McCallum, Alistair Moffat,
Jian-Yun Nie, Douglas Oard, Dragomir R. Radev, Prabhakar Raghavan, Stephen
Robertson, Roni Rosenfeld, Dan Roth, Mark Sanderson, Bruce Schatz, Fabrizio Se-
bastiani, Amit Singhal, Keith van Rijsbergen, Luo Si, Noah Smith, Padhraic Smyth,
Andrew Tomkins, Ellen Voorhees, and Yiming Yang, Yi Zhang, Justin Zobel. We
want to thank all of them for their indirect contributions to this book. Some ma-
terials in the book, especially those in Chapter 19, are based on the research work
done by many Ph.D. graduates of the Text Information Management and Analysis
(TIMAN) group at the University of Illinois at Urbana–Champaign, under the super-
vision by the first author. We are grateful to all of them, including Tao Tao, Hui Fang,
Xuehua Shen, Azadeh Shakery, Jing Jiang, Qiaozhu Mei, Xuanhui Wang, Bin Tan,
Xu Ling, Younhee Ko, Alexander Kotov, Yue Lu, Maryam Karimzadehgan, Yuanhua
Lv, Duo Zhang, [Link] Vydiswaran, Hyun Duk Kim, Kavita Ganesan, Parikshit
Sondhi, Huizhong Duan, Yanen Li, Hongning Wang, Mingjie Qian, and Dae Hoon
Park. The authors’ own work included in the book has been supported by multiple
funding sources, including NSF, NIH, NASA, IARPA, Air Force, ONR, DHS, Alfred P.
Sloan Foundation, and many companies including Microsoft, Google, IBM, Yahoo!,
LinkedIn, Intel, HP, and TCL. We are thankful to all of them.
The two Massive Open Online Courses (MOOCs) offered by the first author for
the University of Illinois at Urbana–Champaign (UIUC) in 2015 on Coursera (i.e.,
Text Retrieval and Search Engines and Text Mining and Analytics) provided a direct
basis for this book in the sense that many parts of the book are based primarily
on the transcribed notes of the lectures in these two MOOCs. We thus would like
xx Preface
to thank all the people who have helped with these two MOOCs, especially TAs
Hussein Hazimeh and Alex Morales, and UIUC instruction support staff Jason
Mock, Shannon Bicknell, Katie Woodruff, and Edward Noel Dignan, and the Head
of Computer Science Department, Rob Rutenbar, whose encouragement, support,
and help are all essential for these two MOOCs to happen. The first author also
wants to thank UIUC for allowing him to use the sabbatical leave in Fall 2015 to
work on this book. Special thanks are due to Chase Geigle, co-founder of META. In
addition to all the above, the second author would like to thank Chase Geigle, Jason
Cho, and Urvashi Khandelwal (among many others) for insightful discussion and
encouragement.
Finally, we would like to thank all our family members, particularly our wives,
Mei and Kai, for their love and support. The first author wants to further thank
his brother Chengxing for the constant intellectual stimulation in their regular
research discussions and his parents for cultivating his passion for learning and
sharing knowledge with others.
ChengXiang Zhai
Sean Massung
June 2016
I
PART
OVERVIEW AND
BACKGROUND
1
Introduction
In the last two decades, we have experienced an explosive growth of online infor-
mation. According to a study done at University of California Berkeley back in 2003:
“. . . the world produces between 1 and 2 exabytes (1018 petabytes) of unique infor-
mation per year, which is roughly 250 megabytes for every man, woman, and child
on earth. Printed documents of all kinds comprise only .03% of the total.” [Lyman
et al. 2003]
A large amount of online information is textual information (i.e., in natural lan-
guage text). For example, according to the Berkeley study cited above: “Newspapers
represent 25 terabytes annually, magazines represent 10 terabytes . . . office docu-
ments represent 195 terabytes. It is estimated that 610 billion emails are sent each
year representing 11,000 terabytes.” Of course, there are also blog articles, forum
posts, tweets, scientific literature, government documents, etc. Roe [2012] updates
the email count from 610 billion emails in 2003 to 107 trillion emails sent in 2010.
According to a recent IDC report report [Gantz & Reinsel 2012], from 2005 to 2020,
the digital universe will grow by a factor of 300, from 130 exabytes to 40,000 ex-
abytes, or 40 trillion gigabytes.
While, in general, all kinds of online information are useful, textual information
plays an especially important role and is arguably the most useful kind of informa-
tion for the following reasons.
Text (natural language) is the most natural way of encoding human knowledge.
As a result, most human knowledge is encoded in the form of text data. For
example, scientific knowledge almost exclusively exists in scientific literature,
while technical manuals contain detailed explanations of how to operate
devices.
Text is by far the most common type of information encountered by people.
Indeed, most of the information a person produces and consumes daily is in
text form.
4 Chapter 1 Introduction
Text is the most expressive form of information in the sense that it can be
used to describe other media such as video or images. Indeed, image search
engines such as those supported by Google and Bing often rely on matching
companion text of images to retrieve “matching” images to a user’s keyword
query.
The explosive growth of online text information has created a strong demand
for intelligent software tools to provide the following two related services to help
people manage and exploit big text data.
Text Retrieval. The growth of text data makes it impossible for people to con-
sume the data in a timely manner. Since text data encode much of our accu-
mulated knowledge, they generally cannot be discarded, leading to, e.g., the
accumulation of a large amount of literature data which is now beyond any
individual’s capacity to even skim over. The rapid growth of online text infor-
mation also means that no one can possibly digest all the new information
created on a daily basis. Thus, there is an urgent need for developing intel-
ligent text retrieval systems to help people get access to the needed relevant
information quickly and accurately, leading to the recent growth of the web
search industry. Indeed, web search engines like Google and Bing are now an
essential part of our daily life, serving millions of queries daily. In general,
search engines are useful anywhere there is a relatively large amount of text
data (e.g., desktop search, enterprise search or literature search in a specific
domain such as PubMed).
Text Mining. Due to the fact that text data are produced by humans for commu-
nication purposes, they are generally rich in semantic content and often con-
tain valuable knowledge, information, opinions, and preferences of people.
As such, they offer great opportunity for discovering various kinds of knowl-
edge useful for many applications, especially knowledge about human opin-
ions and preferences, which is often directly expressed in text data. For exam-
ple, it is now the norm for people to tap into opinionated text data such as
product reviews, forum discussions, and social media text to obtain opinions
about topics interesting to them and optimize various decision-making tasks
such as purchasing a product or choosing a service. Once again, due to the
overwhelming amount of information, people need intelligent software tools
to help discover relevant knowledge to optimize decisions or help them com-
plete their tasks more efficiently. While the technology for supporting text
mining is not yet as mature as search engines for supporting text access, sig-
Chapter 1 Introduction 5
nificant progress has been made in this area in recent years, and specialized
text mining tools have now been widely used in many application domains.
Figure 1.1 Text retrieval and text mining are two main techniques for analyzing big text data.
6 Chapter 1 Introduction
Once we obtain a small set of most relevant text data, we would need to further
analyze the text data to help users digest the content and knowledge in the text
data. This is the text mining step where the goal is to further discover knowledge
and patterns from text data so as to support a user’s task. Furthermore, due to the
need for assessing trustworthiness of any discovered knowledge, users generally
have a need to go back to the original raw text data to obtain appropriate context
for interpreting the discovered knowledge and verify the trustworthiness of the
knowledge, hence a search engine system, which is primarily useful for text access,
also has to be available in any text-based decision-support system for supporting
knowledge provenance. The two steps are thus conceptually interleaved, and a
full-fledged intelligent text information system must integrate both in a unified
framework.
It is worth pointing out that put in the context of “big data,” text data is very dif-
ferent from other kinds of data because it is generally produced directly by humans
and often also meant to be consumed by humans as well. In contrast, other data
tend to be machine-generated data (e.g., data collected by using all kinds of physi-
cal sensors). Since humans can understand text data far better than computers can,
involvement of humans in the process of mining and analyzing text data is abso-
lutely crucial (much more necessary than in other big data applications), and how
to optimally divide the work between humans and machines so as to optimize the
collaboration between humans and machines and maximize their “combined in-
telligence” with minimum human effort is a general challenge in all applications of
text data management and analysis. The two steps discussed above can be regarded
as two different ways for a text information system to assist humans: information
retrieval systems assist users in finding from a large collection of text data the most
relevant text data that are actually needed for solving a specific application prob-
lem, thus effectively turning big raw text data into much smaller relevant text data
that can be more easily processed by humans, while text mining application sys-
tems can assist users in analyzing patterns in text data to extract and discover useful
actionable knowledge directly useful for task completion or decision making, thus
providing more direct task support for users.
With this view, we partition the techniques covered in the book into two parts to
match the two steps shown in Figure 1.1, which are then followed by one chapter to
discuss how all the techniques may be integrated in a unified text information sys-
tem. The book attempts to provide a complete coverage of all the major concepts,
techniques, and ideas in information retrieval and text data mining from a prac-
tical viewpoint. It includes many hands-on exercises designed with a companion
software toolkit META to help readers learn how to apply techniques of information
1.1 Functions of Text Information Systems 7
retrieval and text mining to real-world text data and learn how to experiment with
and improve some of the algorithms for interesting application tasks. This book
can be used as a textbook for computer science undergraduates and graduates, li-
brary and information scientists, or as a reference book for practitioners working
on relevant application problems in analyzing and managing text data.
Information Access. This capability gives a user access to the useful informa-
tion when the user needs it. With this capability, a TIS can connect the right
information with the right user at the right time. For example, a search en-
gine enables a user to access text information through querying, whereas a
recommender system can push relevant information to a user as new informa-
tion items become available. Since the main purpose of Information Access
is to connect a user with relevant information, a TIS offering this capability
Access Mining
Select Create
information knowledge
Organization
Add
structure/annotations
Figure 1.2 Information access, knowledge acquisition, and text organization are three major
capabilities of a text information system with text organization playing a supporting
role for information access and knowledge acquisition. Knowledge acquisition is also
often referred to as text mining.
8 Chapter 1 Introduction
generally only does minimum analysis of text data sufficient for matching
relevant information with a user’s information need, and the original infor-
mation items (e.g., web pages) are often delivered to the user in their original
form, though summaries of the delivered items are often provided. From the
perspective of text analysis, a user would generally need to read the informa-
tion items to further digest and exploit the delivered information.
Information access can be further classified into two modes: pull and push. In
the pull mode, the user takes initiative to “pull” the useful information out from
the system; in this case, the system plays a passive role and waits for a user to
make a request, to which the system would then respond with relevant information.
This mode of information access is often very useful when a user has an ad hoc
1.1 Functions of Text Information Systems 9
information need, i.e., a temporary information need (e.g., an immediate need for
opinions about a product). For example, a search engine like Google generally
serves a user in pull mode. In the push mode, the system takes initiative to “push”
(recommend) to the user an information item that the system believes is useful to
the user. The push mode often works well when the user has a relatively stable
information need (e.g., hobby of a person); in such a case, a system can know
“in advance” a user’s preferences and interests, making it feasible to recommend
information to a user without having the user to take the initiative. We cover both
modes of information access in this book.
The pull mode further consists of two complementary ways for a user to obtain
relevant information: querying and browsing. In the case of querying, the user
specifies the information need with a (keyword) query, and the system would take
the query as input and return documents that are estimated to be relevant to the
query. In the case of browsing, the user simply navigates along structures that
link information items together and progressively reaches relevant information.
Since querying can also be regarded as a way to navigate, in one step, into a set
of relevant documents, it’s clear that browsing and querying can be interleaved
naturally. Indeed, a user of a web search engine often interleaves querying and
browsing.
Knowledge acquisition from text data is often achieved through the process of
text mining, which can be defined as mining text data to discover useful knowl-
edge. Both the data mining community and the natural language processing
(NLP) community have developed methods for text mining, although the two
communities tend to adopt slightly different perspective on the problem. From
a data mining perspective, we may view text mining as mining a special kind
of data, i.e., text. Following the general goals of data mining, the goal of text
mining would naturally be regarded as to discover and extract interesting pat-
terns in text data, which can include latent topics, topical trends, or outliers.
From an NLP perspective, text mining can be regarded as to partially under-
stand natural language text, convert text into some form of knowledge represen-
tation and make limited inferences based on the extracted knowledge. Thus a
key task is to perform information extraction, which often aims to identify and ex-
tract mentions of various entities (e.g., people, organization, and location) and
their relations (e.g., who met with whom). In practice, of course, any text min-
ing applications would likely involve both pattern discovery (i.e., data mining
view) and information extraction (i.e., NLP view), with information extraction
serving as enriching the semantic representation of text, which enables pattern
10 Chapter 1 Introduction
Retrieval Mining
applications Summarization Visualization applications
Filtering Clustering
Information Information Knowledge
access organization acquisition
Search Extraction
Text
also possible, which may be based on recognized entities and relations or other
techniques for more in-depth understanding of text.
With content analysis as the basis, there are multiple components in a TIS that
are useful for users in different ways. The following are some commonly seen
functions for managing and analyzing text information.
Search. Take a user’s query and return relevant documents. The search com-
ponent in a TIS is generally called a search engine. Web search engines are
among the most useful search engines that enable users to effectively and
efficiently deal with a huge amount of text data.
Categorization. Classify a text object into one or several of the predefined cat-
egories where the categories can vary depending on applications. The cat-
egorization component in a TIS can annotate text objects with all kinds of
meaningful categories, thus enriching the representation text data, which
further enables more effective and deeper text analysis. The categories can
also be used for organizing text data and facilitating text access. Subject cate-
gorizers that classify a text article into one or multiple subject categories and
sentiment taggers that classify a sentence into positive, negative, or neutral in
sentiment polarity are both specific examples of a text categorization system.
Topic Analysis. Take a set of documents and extract and analyze topics in them.
Topics directly facilitate digestion of text data by users and support browsing
of text data. When combined with the companion non-textual data such as
time, location, authors, and other meta data, topic analysis can generate
many interesting patterns such as temporal trends of topics, spatiotemporal
distributions of topics, and topic profiles of authors.
Clustering. Discover groups of similar text objects (e.g., terms, sentences, doc-
uments, . . . ). The clustering component of a TIS plays an important role in
helping users explore an information space. It uses empirical data to create
meaningful structures that can be useful for browsing text objects and ob-
taining a quick understanding of a large text data set. It is also useful for
discovering outliers by identifying the items that do not form natural clusters
with other items.
This list also serves as an outline of the major topics to be covered later in
this book. Specifically, search and filtering are covered first in Part II about text
data access, whereas categorization, clustering, topic analysis, and summarization
are covered later in Part III about text data analysis. Information extraction is not
covered in this book since we want to focus on general approaches that can be
readily applied to text data in any natural language, but information extraction
often requires language-specific techniques. Visualization is also not covered due
to the intended focus on algorithms in this book. However, it must be stressed that
both information extraction and visualization are very important topics relevant
to text data analysis and management. Readers interested in these techniques can
find some useful references in the Bibliographic Notes at the end of this chapter.
Part I. Overview and Background. This part consists of the first four chapters
and provides an overview of the book and background knowledge, including
basic concepts needed for understanding the content of the book that some
readers may not be familiar with, and an introduction to the META toolkit
used for exercises in the book. This part also gives a brief overview of natu-
ral language processing techniques needed for understanding text data and
obtaining informative representation of text needed in all text data analysis
applications.
Part II. Text Data Access. This part consists of Chapters 5–11, covering the ma-
jor techniques for supporting text data access. This part provides a systematic
discussion of the basic information retrieval techniques, including the for-
mulation of retrieval tasks as a problem of ranking documents for a query
(Chapter 5), retrieval models that form the foundation of the design of rank-
ing functions in a search engine (Chapter 6), feedback techniques (Chapter 7),
implementation of retrieval systems (Chapter 8), and evaluation of retrieval
systems (Chapter 9). It then covers web search engines, the most important
application of information retrieval so far (Chapter 10), where techniques for
analyzing links in text data for improving ranking of text objects are intro-
duced and application of supervised machine learning to combine multiple
14 Chapter 1 Introduction
Chapter 1
Chapter 2
Chapter 3
Chapter 6 Chapter 13
Chapter 16 Chapter 18
Chapter 20 Chapter 19
features for ranking is briefly discussed. The last chapter in this part (Chap-
ter 11) covers recommender systems which provide a “push” mode of informa-
tion access, as opposed to the “pull” mode of information access supported
by a typical search engine (i.e., querying by users).
Part III. Text Data Analysis. This part consists of Chapters 12–19, covering a
variety of techniques for analyzing text data to facilitate user digestion of text
data and discover useful topical or other semantic patterns in text data. Chap-
ter 12 gives an overview of text analysis from the perspective of data mining,
where we may view text data as data generated by humans as “subjective sen-
sors” of the world; this view allows us to look at the text analysis problem in the
more general context of data analysis and mining in general, and facilitates
the discussion of joint analysis of text and non-text data. This is followed by
multiple chapters covering a number of the most useful general techniques
for analyzing text data without or with only minimum human effort. Specif-
ically, Chapter 13 discusses techniques for discovering two fundamental se-
1.4 How to Use this Book 15
mantic relations between lexical units in text data, i.e., paradigmatic relations
and syntagmatic relations, which can be regarded as an example of discov-
ering knowledge about the natural language used to generate the text data
(i.e., linguistic knowledge). Chapter 14 and Chapter 15 cover, respectively, two
closely related techniques to generate and associate meaningful structures
or annotations with otherwise unorganized text data, i.e., text clustering and
text categorization. Chapter 16 discusses text summarization useful for facil-
itating human digestion of text information. Chapter 17 provides a detailed
discussion of an important family of probabilistic approaches to discovery
and analysis of topical patterns in text data (i.e., topic models). Chapter 18
discusses techniques for analyzing sentiment and opinions expressed in text
data, which are key to discovery of knowledge about preferences, opinions,
and behavior of people based on analyzing the text data produced by them.
Finally, Chapter 19 discusses joint analysis of text and non-text data, which is
often needed in many applications since it is in general beneficial to use as
much data as possible for gaining knowledge and intelligence through (big)
data analysis.
Part IV. Unified Text Management and Analysis System. This last part consists
of Chapter 20 where we attempt to discuss how all the techniques discussed
in this book can be conceptually integrated in an operator-based unified
framework, and thus potentially implemented in a general unified system
for text management and analysis that can be useful for supporting a wide
range of different applications. This part also serves as a roadmap for further
extension of META to provide effective and general high-level support for
various applications and provides guidance on how META may be integrated
with many other related existing toolkits, including particularly search engine
systems, database systems, natural language processing toolkits, machine
learning toolkits, and data mining toolkits.
Due to our attempt to treat all the topics from a practical perspective, most
of the discussions of the concepts and techniques in the book are informal and
intuitive. To satisfy the needs of some readers that might be interested in deeper
understanding of some topics, the book also includes an appendix with notes to
provide a more detailed and rigorous explanation of a few important topics.
such a tradeoff, we have chosen to emphasize the coverage of the basic concepts
and practical techniques of text data mining at the cost of not being able to cover
many advanced techniques in detail, and provide some references at the end of
many chapters to help readers learn more about those advanced techniques if
they wish to. Our hope is that with the foundation received from reading this
book, you will be able to learn about more advanced techniques by yourself or via
another resource. We have also chosen to cover more general techniques for text
management and analysis and favor techniques that can be applicable to any text in
any natural language. Most techniques we discuss can be implemented without any
human effort or only requiring minimal human effort; this is in contrast to some
more detailed analysis of text data, particularly using natural language processing
techniques. Such “deep analysis” techniques are obviously very important and are
indeed necessary for some applications where we would like to go in-depth to
understand text in detail. However, at this point, these techniques are often not
scalable and they tend to require a large amount of human effort. In practice, it
would be beneficial to combine both kinds of techniques.
We envision three main (and potentially overlapping) categories of readers.
the entire book as an Introduction to Text Data Mining, while skipping some
chapters in Part 2 that are more specific to search engine implementation and
applications specific to the Web. Another choice would be using all parts as a
supplemental graduate textbook, where there is still some emphasis on prac-
tical programming knowledge that can be combined with reading referenced
papers in each chapter. Exercises for graduate students could be implement-
ing some methods they read in the references into META.
The exercises at the end of each chapter give students experience working
with a powerful—yet easily understandable—text retrieval and mining toolkit
in addition to written questions. In a programming-focused class, using the
META exercises is strongly encouraged. Programming assignments can be cre-
ated from selecting a subset of exercises in each chapter. Due to the modular
nature of the toolkit, additional programming experiments may be created by
extending the existing system or implementing other well-known algorithms
that do not come with META by default. Finally, students may use compo-
nents of META they learned through the exercises to complete a larger final
programming project. Using different corpora with the toolkit can yield dif-
ferent project challenges, e.g., review summary vs. sentiment analysis.
Practitioners. Most readers in industry would most likely use this book as a
reference, although we also hope that it may serve as some inspiration in
your own work. As with the student user suggestion, we think you would get
the most of this book by first reading the initial three chapters. Then, you may
choose a chapter relevant to your current interests and delve deeper or refresh
your knowledge.
Since many applications in META can be used simply via config files, we
anticipate it as a quick way to get a handle on your dataset and provide some
baseline results without any programming required.
The exercises at the end of each chapter can be thought of as default
implementations for a particular task at hand. You may choose to include
META in your work since it uses a permissive free software license. In fact, it is
dual-licensed under MIT and University of Illinois/NCSA licenses. Of course,
we still encourage and invite you to share any modifications, extensions, and
improvements with META that are not proprietary for the benefit of all the
readers.
No matter what your goal, we hope that you find this book useful and educa-
tional. We also appreciate your comments and suggestions for improvement of the
book. Thanks for reading!
18 Chapter 1 Introduction
of some key techniques important for text mining, notably the information extrac-
tion (IE) techniques which are essential for text mining. We decided not to cover IE
because the IE techniques tend to be language-specific and require non-trivial man-
ual work by humans. Another reason is that many IE techniques rely on supervised
machine learning approaches, which are well covered in many existing machine
learning books (see, e.g., Bishop 2006, Mitchell 1997). Readers who are interested
in knowing more about IE can start with the survey book [Sarawagi 2008] and review
articles [Jiang 2012].
From an application perspective, another important topic missing in this book
is information visualization, which is due to our focus on the coverage of models
and algorithms. However, since every application system must have a user-friendly
interface to allow users to optimally interact with a system, those readers who are
interested in developing text data application systems will surely find it useful to
learn more about user interface design. An excellent reference to start with is Hearst
[2009], which also has a detailed coverage of information visualization.
Finally, due to our emphasis on breadth, the book does not cover any compo-
nent algorithm in depth. To know more about some of the topics, readers can
further read books in natural language processing (e.g., Jurafsky and Martin 2009,
Manning and Schütze 1999), advanced books on IR (e.g., Baeza-Yates and Ribeiro-
Neto [2011]), and books on machine learning (e.g., Bishop [2006]). You may find
more specific recommendations of readings relevant to a particular topic in the
Bibliographic Notes at the end of each chapter that covers the corresponding topic.
2
2.1
Background
This chapter contains background information that is necessary to know in order
to get the most out of the rest of this book; readers who are already familiar with
these basic concepts may safely skip the entire chapter or some of the sections.
We first focus on some basic probability and statistics concepts required for most
algorithms and models in this book. Next, we continue our mathematical back-
ground with an overview of some concepts in information theory that are often
used in many text mining applications. The last section introduces the basic idea
and problem setup of machine learning, particularly supervised machine learning,
which is useful for classification, categorization, or text-based prediction in the text
domain. In general, machine learning is very useful for many information retrieval
and data mining tasks.
where the first index corresponds to p(red) = 61 , the second index corresponds to
p(orange) = 61 , and so on. But what if we had an unfair die? We could use a different
probability distribution θ to model events concerning it:
1 1 1 1 1 1
θ = , , , , , .
3 3 12 12 12 12
In this case, red and orange are assumed to be rolled more often than the
other colors. Be careful to note the difference between the sample space and
the defined probability model θ used to quantify its uncertainty. In our text mining
tasks, we usually try to estimate θ given some knowledge about . The different
methods to estimate θ will determine how accurate or useful the probabilistic
model is.
Consider the following notation:
x ∼ θ.
We read this as x is drawn from theta, or the random variable x is drawn from the
probability distribution θ . The random variable x takes on each value from with
a certain probability defined by θ . For example, if we had x ∼ θ , then there is a 23
chance that x is either red or orange.
In our text application tasks, we usually have as V , the vocabulary of some text
corpus. For example, the vocabulary could be
and we could model the text data with a probability distribution θ . Thus, if we have
some word w we can write p(w | θ) (read as the probability of w given θ). If w is the
word data, we might have p(w = data | θ) = 0.003 or equivalently pθ (w = data) =
0.003.
In our examples, we have only considered discrete probability distributions.
That is, our models only assign probabilities for a finite (discrete) set of outcomes.
In reality, there are also continuous probability distributions, where there are an
infinite number of “events” that are not countable. For example, the normal (or
Gaussian) distribution is a continuous probability distribution that assigns real-
valued probabilities according to some parameters. We will discuss continuous
distributions as necessary in later chapters. For now, though, it’s sufficient to un-
derstand that we can use a discrete probability distribution to model the probability
of observing a single word in a vocabulary V .
2.1 Basics of Probability and Statistics 23
0 ≤ pθ (ω ∈ ) ≤ 1. (2.1)
. An event not in has probability zero, and the probability of any event
occurring from is one:
Let’s also assume that each color is represented by a particular shape. This makes
our die look like
24 Chapter 2 Background
{ , , , , , }
where the colors of the shapes are, red, orange, yellow, blue, green, and purple,
respectively.
We can now create another distribution for the shape θS . Let each index in θS
represent p(square), p(circle), p(triangle), respectively. That gives
1 1 1
θS = , , .
3 2 6
Then we can let xC ∼ θC represent the color random variable and let xS ∼ θS repre-
sent the shape random variable. We now have two variables to work with.
A joint probability measures the likelihood that two events occur simultane-
ously. For example, what is the probability that xC = red and xS = circle? Since there
are no red circles, this has probability zero. How about p(xC = green, xS = circle)?
This notation signifies the joint probability of the two random variables. In this
case, the joint probability is 61 because there is only one green circle.
Consider a modified die:
{ , , , , , }
where we changed the color of the blue circle (the fourth element in the set) to
green. Thus, we now have two green circles instead of one green and one blue.
What would p(xC = green, xS = circle) be? Since two out of the six elements satisfy
both these criteria, the answer is 26 = 31 . As another example, if we had a 12-sided
fair die with 5 green circles and 7 other combinations of shape and color, then
p(xC = green, xS = circle) = 125
.
A conditional probability measures the likelihood that one event occurs given
that another event has already occurred. Let’s use the original die with six unique
colors. Say we know that a square was rolled. With that information, what is the
probability that the color is red? How about purple? We can write this as p(xC =
red | xS = square). Since we know there are two squares, of which one is red, p(red |
square) = 21 .
We can write the conditional probabilities for two random variables X and Y
based on their joint probability with the following equation:
p(X = x , Y = y)
p(X = x | Y = y) = . (2.4)
p(Y = y)
2.1 Basics of Probability and Statistics 25
of observing this sequence is simply the product of observing each event, i.e.,
θ × (1 − θ ) × θ × θ × (1 − θ) = θ 3(1 − θ)2 with no adjustment for different orders
of observing three heads and two tails.
The more commonly used multinomial distribution in text analysis, which mod-
els the probability of seeing a word in a particular scenario (e.g., in a document), is
very similar to this Bernoulli distribution, just with more than two outcomes.
D = {h, t , h, h, t}.
Now we would like to figure out what θ is based on the observed data. Using
maximum likelihood estimation (MLE), we choose the θ that has the highest like-
lihood given our data, i.e., choose the θ such that the probability of observed data
is maximized.
To compute the MLE, we would first write down the likelihood function, i.e.,
p(D | θ ), which is θ 3(1 − θ)2 as we explained earlier. The problem is thus reduced
to find the θ that maximizes the function f (θ) = θ 3(1 − θ)2. Equivalently, we can
attempt to maximize the log-likelihood: log f (θ ) = 3 log θ + 2 log(1 − θ), since log-
arithm transformation preserves the order of values. Using knowledge of calculus,
we know that a necessary condition for a function to achieve a maximum value at a
θ value is that the derivative at the same θ value is zero. Thus, we just need to solve
the following equation:
d log f (θ ) 3 2
= − = 0,
dθ θ 1−θ
H
= .
H +T
28 Chapter 2 Background
The notation arg max represents the argument (i.e., θ in this case) that makes
the likelihood function (i.e., p(D | θ)) reach its maximum. Thus, the value of an
arg max expression stays the same if we perform any monotonic transformation of
the function inside arg max. This is why we could use the logarithm transformation
in the example above, which made it easier to compute the derivative.
The solution to MLE shown above should be intuitive: the θ that maximizes our
data likelihood is just the ratio of heads. It is a general characteristic of the MLE
that the estimated probability is the normalized counts of the corresponding events
denoted by the probability. As an example, the MLE of a multinomial distribution
(which will be further discussed in detail later in the book) gives each possible
outcome a probability proportional to the observed counts of the outcome. Note
that a consequence of this is that all unobserved outcomes would have a zero
probability according to MLE. This is often not reasonable especially when the data
sample is small, a problem that motivates Bayesian parameter estimation which we
discuss below.
p(D) = p(θ )p(D | θ )dθ (2.9)
θ
The last one is called the marginal likelihood because the integration “marginal-
izes out” (removes) the parameter θ from the equation. Since the likelihood of the
data remains constant, observing the constraint that p(θ | D) must sum to one over
all possible values of θ , we usually just say
That is, the posterior is proportional to the prior times the likelihood.
The posterior distribution of the parameter θ fully characterizes the uncertainty
of the parameter value and can be used to infer any quantity that depends on the
parameter value, including computing a point estimate of the parameter (i.e., a
single value of the parameter). There are multiple ways to compute a point estimate
based on a posterior distribution. One possibility is to compute the mean of the
posterior distribution, which is given by the weighted sum of probabilities and the
parameter values. For a discrete distribution,
E[X] = xp(x) (2.11)
x
Here it is easy to see that the MAP estimate would deviate from the MLE with
consideration of maximizing the probability of the parameter according to our
prior belief encoded as p(θ ). It is through the use of appropriate prior that we
can address the overfitting problem of MLE since our prior can strongly prefer an
estimate where neither heads, nor tails should have a zero probability.
For a continuation and more in-depth discussion of this material, consult Ap-
pendix A.
The first step has already been addressed. In our example, we wish to capture the
probabilities of individual words occurring in our corpus. In the second step, we
need to figure out actually how to set the probabilities for each word. One obvious
way would be to calculate the probability of each individual word in the corpus
itself. That is, the count of a unique word wi divided by the total number of words
2.2 Information Theory 31
in the corpus could be the value of p(wi | θ). This can be shown to be the solution of
the MLE of the model. More sophisticated models and their parameter estimation
will be discussed later in the book. Finally, once we have θ defined, what can we
actually do with it? One use case would be analyzing the probability of a specific
subset of words in the corpus, and another could be observing unseen data and
calculating the probability of seeing the words in the new text. It is often possible
to design the model such that the model parameters would encode the knowledge
we hope to discover from text data. In such a case, the estimated model parameters
can be directly used as the output (result) of text mining.
Please keep in mind that probabilistic models are a general tool and don’t only
have to be used for text analysis—that’s just our main application!
Information Theory
2.2 Information theory deals with uncertainty and the transfer or storage of quantified
information in the form of bits. It is applied in many fields, such as electrical engi-
neering, computer science, mathematics, physics, and linguistics. A few concepts
from information theory are very useful in text data management and analysis,
which we introduce here briefly. The most important concept of information theory
is entropy, which is a building block for many other measures.
The problem can be formally defined as the quantified uncertainty in predicting
the value of a random variable. In the common example of a coin, the two values
would be 1 or 0 (depicting heads or tails) and the random variable representing
these outcomes is X. In other words,
1 if heads
X=
0 if tails.
The more random this random variable is, the more difficult the prediction of heads
or tails will be. How does one quantitatively measure the randomness of a random
variable like X? This is precisely what entropy does.
Roughly, the entropy of a random variable X, H (X), is a measure of expected
number of bits needed to represent the outcome of an event x ∼ X. If the outcome
is known (completely certain), we don’t need to represent any information and
H (X) = 0. If the outcome is unknown, we would like to represent the outcome in
bits as efficiently as possible. That means using fewer bits for common occurrences
and more bits when the event is less likely. Entropy gives us the expected number
32 Chapter 2 Background
In the cases where we have log 2 0, we generally just define this to be 0 since log 2 0
is undefined. We will get different H (X) for different random variables X.
The exact theory and reasoning behind this formula are beyond the scope of this
book, but it suffices to say that H (X) = 0 means there is no randomness, H (X) = 1
means there is complete randomness in that all events are equally likely. Thus, the
amount of randomness varies from 0 to 1. For our coin example where the sample
space is two events (heads or tails), the entropy function looks like
For a fair coin, we would have p(X = 1) = p(X = 0) = 21 . To calculate H (X), we’d
have the calculation
1 1 1 1
H (X) = − log 2 − log 2 = 1,
2 2 2 2
whereas for a completely biased coin with p(X = 1) = 1, p(X = 0) = 0 we would have
For this example, we had only two possible outcomes (i.e., a binary random
variable). As we can see from the formula, this idea of entropy easily generalizes
to random variables with more than two outcomes; in those cases, the sum is over
more than two elements.
If we plot H (X) for our coin example against the probability of heads p(X = 1),
we receive a plot like the one shown in Figure 2.1. At the two ends of the x-axis,
the probability of X = 1 is either very small or very large. In both these cases, the
entropy function has a low value because the outcome is not very random. The most
random is when p(X = 1) = 21 . In that case, H (X) = 1, the maximum value. Since
the two probabilities are symmetric, we get a symmetric inverted U -shape as the
plot of H (X) as p(X = 1) varies.
It’s a good exercise to consider when a particular random variable (not just the
coin example) has a maximum or minimal value. In particular, let’s think about
some special cases. For example, we might have a random variable Y that always
takes a value of 1. Or, there’s a random variable Z that is equally likely to take a
value of 1, 2, or 3. In these cases, H (Y ) < H (Z) since the outcome of Y is much
2.2 Information Theory 33
1.0
0.8
0.6
H(X)
0.4
0.2
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
P(X = 1)
easier to predict than the outcome of Z. This is precisely what entropy captures.
You can calculate H (Y ) and H (Z) to confirm this answer.
For our applications, it may be useful to consider the entropy of a word w in
some context. Here, high-entropy words would be harder to predict. Let W be the
random variable that denotes whether a word occurs in a document in our corpus.
Say W = 1 if the word occurs and W = 0 otherwise. How do you think H (Wthe )
compares to H (Wcomputer )? The entropy of the word the is close to zero since it
occurs everywhere. It’s not surprising to see this word in a document, thus it is easy
to predict that Wthe = 1. This case is just like the biased coin that always lands one
way. The word computer, on the other hand, is a less common word and is harder
to predict whether it occurs or not, so the entropy will be higher.
When we attempt to quantify uncertainties of conditional probabilities, we can
also define conditional entropy H (X | Y ), which indicates the expected uncertainty
of X given that we observe Y , where the expectation is taken under the distribution
of all possible values of Y . Intuitively, if X is completely determined by Y , then
H (X | Y ) = 0 since once we know Y , there would be no uncertainty in X, whereas if
X and Y are independent, then H (X | Y ) would be the same as the original entropy
of X, i.e., H (X | Y ) = H (X) since knowing Y does not help at all in resolving the
uncertainty of X.
34 Chapter 2 Background
It is easy to see that I (X; Y ) tends to be large if X and Y are correlated, whereas
I (X; Y ) would be small if X and Y are not so related; indeed, in the extreme case
when X and Y are completely independent, there would be no reduction of entropy,
and thus H (X) = H (X | Y ), and I (X; Y ) = 0. However, if X is completely determined
by Y , then H (X | Y ) = 0, thus I (X; Y ) = H (X). Intuitively, mutual information can
measure the correlation of two random variables. Clearly as a correlation measure
on X and Y , mutual information is symmetric.
Applications of these basic concepts, including entropy, conditional entropy,
and mutual information will be further discussed later in this book.
Machine Learning
2.3 Machine learning is a very important technique for solving many problems, and has
very broad applications. In text data management and analysis, it has also many
uses. Any in-depth treatment of this topic would clearly be beyond the scope of
this book, but here we introduce some basic concepts in machine learning that are
needed to better understand the content later in the book.
Machine learning techniques can often be classified into two types: supervised
learning and unsupervised learning. In supervised learning, a computer would
learn how to compute a function ŷ = f (x) based on a set of examples of the input
value x (called training data) and the corresponding expected output value y. It is
called “supervised” because typically the y values must be provided by humans for
each x, and thus the computer receives a form of supervision from the humans.
Once the function is learned, the computer would be able to take unseen values of
x and compute the function f (x).
When y takes a value from a finite set of values, which can be called labels, a
function f (.) can serve as a classifier in that it can be used to map an instance x
to the “right” label (or multiple correct labels when appropriate). Thus, the prob-
lem can be called a classification problem. The simplest case of the classification
problem is when we have just two labels (known as binary classification). When y
2.3 Machine Learning 35
takes a real value, the problem is often called a regression problem. Both forms
of the problem can also be called prediction when our goal is mainly to infer the
unknown y for a given x; the term “prediction” is especially meaningful when y is
some property of a future event.
In text-based applications, both forms may occur, although the classification
problem is far more common, in which case the problem is also called text catego-
rization or text classification. We dedicate a chapter to this topic later in the book
(Chapter 15). The regression problem may occur when we use text data to predict
another non-text variable such as sentiment rating or stock prices; both cases are
also discussed later.
In classification as well as regression, the (input) data instance x is often repre-
sented as a feature vector where each feature provides a potential clue about which
y value is most likely the value of f (x). What the computer learns from the training
data is an optimal way to combine these features with weights on them to indi-
cate their importance and their influence on the final function value y. “Optimal”
here simply means that the prediction error on the training data is minimum, i.e.,
the predicted ŷ values are maximally consistent with the true y values in the train-
ing data.
More formally, let our collection of objects be X such that xi ∈ X is a feature
vector that represents object i. A feature is an attribute of an object that describes
it in some way. For example, if the objects are news articles, one feature could be
whether the word good occurred in the article. All these different features are part
of a document’s feature vector, which is used to represent the document. In our
cases, the feature vector will usually have to do with the words that appear in the
document.
We also have Y, which is the set of possible labels for each object. Thus, yi may
be sports in our news article classification setup and yj could be politics.
A classifier is a function f (.) that takes a feature vector as input and outputs a
predicted label ŷ ∈ Y. Thus, we could have f (xi ) = sports, meaning ŷ = sports. If the
true y is also sports, the classifier was correct in its prediction.
Notice how we can only evaluate a classification algorithm if we know the true
labels of the data. In fact, we will have to use the true labels in order to learn a good
function f (.) to take unseen feature vectors and classify them. For this reason, when
studying machine learning algorithms, we often split our corpus X into two parts:
training data and testing data. The training portion is used to build the classifier,
and the testing portion is used to evaluate the performance (e.g., determine how
many correct labels were predicted). In applications, the training data are generally
36 Chapter 2 Background
all the labelled examples that we can generate, and the test cases are the data points,
to which we would like to apply our machine learning program.
But what does the function f (.) actually do? Consider a very simple example
that determines whether a news article has positive or negative sentiment, i.e.,
Y = {positive, negative}:
Of course, this example is overly simplified, but it does demonstrate the basic
idea of a classifier: it takes a feature vector as input and outputs a class label. Based
on the training data, the classifier may have determined that positive sentiment ar-
ticles contain the term good more than once; therefore, this knowledge is encoded
in the function. In Chapter 15, we will investigate some specific algorithms for cre-
ating the function f (.) based on the training data. Other topics such as feedback
for information retrieval (Chapter 7) and sentiment analysis (Chapter 18) make use
of classifiers, or resemble them. For this reason, it’s good to know what machine
learning is and what kinds of problems it can solve.
In contrast to supervised learning, in unsupervised learning we only have the
data instances X without knowing Y. In such a case, obviously we cannot really know
how to compute y based on an x. However, we may still learn latent properties or
structures of X. Since there is no human effort involved, such an approach is called
unsupervised. For example, the computer can learn that some data instances are
very similar, and the whole dataset can be represented by three major clusters of
data instances such that in each cluster, the data instances are all very similar.
This is essentially the clustering technique that we will discuss in Chapter 14.
Another form of unsupervised learning is to design probabilistic models to model
the data (called “generative models”) where we can embed interesting parameters
that denote knowledge that we would like to discover from the data. By fitting the
model to our data, we can estimate the parameter values that can best explain the
data, and treat the obtained parameter values as the knowledge discovered from
the data. Applications of such an approach in analyzing latent topics in text are
discussed in detail in Chapter 17.
Exercises
2.1. What can you say about p(X | Y ) if we know X and Y are independent random
variables? Prove it.
2.3. Use Bayes’ rule to solve the following problem. One third of the time, Milo
takes the bus to work and the other times he takes the train. The bus is less reliable,
so he gets to work on time only 50% of the time. If taking the train, he is on time 90%
of the time. Given that he was on time on a particular day, what is the probability
that Milo took the bus?
2.5. Consider the game outlined in the previous question. Imagine that two
aces were drawn, leaving 50 cards remaining. What is the expected value of the
next draw?
2.7. In the information theory section, we compared the entropy of the word the
to that of the word unicorn. In general, what types of words have a high entropy and
what types of words have a low entropy? As an example, consider a corpus of ten
38 Chapter 2 Background
documents where the occurs in all documents, unicorn appears in five documents,
and Mercury appears in one document. What would be the entropy value of each?
2.8. Brainstorm some different features that may be good for the sentiment clas-
sification task outlined in this chapter. What are the strengths and weaknesses of
such features?
2.9. Consider the following scenario. You are writing facial recognition software
that determines whether there is a face in a given image. You have a collection of
100, 000 images with the correct answer and need to determine if there are faces
in new, unseen images.
Answer the following questions.
(a) Is this supervised learning or unsupervised learning?
(b) What are the labels or values we are predicting?
(c) Is this binary classification or multiclass classification? (Or neither?)
(d) Is this a regression problem?
(e) What are the features that could be used?
2.10. Consider the following scenario. You are writing essay grading software that
takes in a student essay and produces a score from 0–100%. To design this system,
you are given essays from the past year which have been graded by humans. Your
task is to use the system with the current year’s essays as input.
Answer the same questions as in Exercise 2.9.
2.11. Consider the following scenario. You are writing a tool that determines
whether a given web page is one of
personal home page,
links to a personal home page, or
neither of the above.
To help you in your task, you are given 5, 000, 000 pages that are already labeled.
Answer the same questions as in Exercise 2.9.
3
Text Data Understanding
In this chapter, we introduce basic concepts in text data understanding through
natural language processing (NLP). NLP is concerned with developing computa-
tional techniques to enable a computer to understand the meaning of natural
language text. NLP is a foundation of text information systems because how ef-
fective a TIS is in helping users access and analyze text data is largely determined
by how well the system can understand the content of text data. Content analysis
is thus logically the first step in text data analysis and management.
While a human can instantly understand a sentence in their native language,
it is quite challenging for a computer to make sense of one. In general, this may
involve the following tasks.
Lexical analysis. The purpose of lexical analysis is to figure out what the basic
meaningful units in a language are (e.g., words in English) and determine
the meaning of each word. In English, it is rather easy to determine the
boundaries of words since they are separated by spaces, but it is non-trivial to
find word boundaries in some other languages such as Chinese where there
is no clear delimiter to separate words.
Syntactic analysis. The purpose of syntactic analysis is to determine how words
are related with each other in a sentence, thus revealing the syntactic structure
of a sentence.
Semantic analysis. The purpose of semantic analysis is to determine the mean-
ing of a sentence. This typically involves the computation of meaning of a
whole sentence or a larger unit based on the meanings of words and their
syntactic structure.
Pragmatic analysis. The purpose of pragmatic analysis is to determine meaning
in context, e.g., to infer the speech acts of language. Natural language is
used by humans to communicate with each other. A deeper understanding
40 Chapter 3 Text Data Understanding
ligence. In this sense, NLP is “AI complete”, i.e., as difficult as any other difficult
problems in artificial intelligence.
The following are a few examples of specific challenges in natural language
understanding.
Word-level ambiguity. A word may have multiple syntactic categories and mul-
tiple senses. For example, design can be a noun or a verb (ambiguous POS);
root has multiple meanings even as a noun (ambiguous sense).
Syntactic ambiguity. A phrase or a sentence may have multiple syntactic struc-
tures. For example, natural language processing can have two different inter-
pretations: “processing of natural language” vs. “natural processing of lan-
guage” (ambiguous modification). Another example: A man saw a boy with
a telescope has two distinct syntactic structures, leading to a different result
regarding who had the telescope (ambiguous prepositional phrase (PP) at-
tachment).
Anaphora resolution. What exactly a pronoun refers to may be unclear. For
example, in John persuaded Bill to buy a TV for himself , does himself refer to
John or Bill?
42 Chapter 3 Text Data Understanding
empirical uses of language, and rely solely on labeled training data by humans and
application of machine learning techniques.
While linguistic knowledge is always useful, today, the most advanced natural
language processing techniques tend to rely on heavy use of statistical machine
learning techniques with linguistic knowledge only playing a somewhat secondary
role. These statistical NLP techniques are successful for some of the NLP tasks. Part
of speech tagging is a relatively easy task, and state-of-the-art POS taggers may have
a very high accuracy (above 97% on news data). Parsing is more difficult, though
partial parsing can probably be done with reasonably high accuracy (e.g., above
90% for recognizing noun phrases)1.
However, full structure parsing remains very difficult, mainly because of ambi-
guities. Semantic analysis is even more difficult, only successful for some aspects
of analysis, notably information extraction (recognizing named entities such as
names of people and organization, and relations between entities such as who
works in which organization), word sense disambiguation (distinguishing different
senses of a word in different contexts of usage), and sentiment analysis (recogniz-
ing positive opinions about a product in a product review). Inferences and speech
act analysis are generally only feasible in very limited domains.
In summary, only “shallow” analysis of natural language processing can be done
for arbitrary text and in a robust manner; “deep” analysis tends not to scale up well
or be robust enough for analyzing unrestricted text. In many cases, a significant
amount of training data (created by human labeling) must be available in order to
achieve reasonable accuracy.
1. These performance numbers were based on a specific data set, so they may not generalize well
even within the same domain.
44 Chapter 3 Text Data Understanding
“Easier” and
Tasks Dependency on NLP more “workarounds”
Classification/
retrieval
Summarization/
extraction/
topic mining
Translation/
dialogue
Question
answering
easy task as compared with a more difficult task such as machine translation where
deep understanding of natural language is clearly required.
Figure 3.2 shows a number of TIS tasks that require somewhat different levels
of NLP. At one end of the spectrum, tasks such as retrieval and classification are
relatively easy, and in most of the cases, they don’t require deep NLP; indeed,
looking at the keywords mentioned in text is often sufficient to determine whether a
document is relevant to a query or about a certain topic. At the other end, however,
tasks such as machine translation and question answering would require much
more precise understanding; for example, a wrong parse of a sentence generally
would lead to a wrong translation unless the target language has a similar ambiguity
structure, and similarly, a wrong understanding of the question would lead to
wrong answers.
When it comes to a specific application task, it is often possible to bypass the
difficulty in accurately understanding natural language and go directly to solve the
application problem. A well-known example is the Eliza system,2 which is supposed
to play the role of a therapist and make a dialogue with a human user [Weizenbaum
1966]. The following is a sample dialogue.
2. [Link]
3.2 NLP and Text Information Systems 45
On the surface, the dialogue appears to be quite natural, and indeed, such a
dialogue might be useful to engage a depressed patient in a conversation. However,
the system does not really understand the language, and solely relies on heuristic
rules like the following to keep the dialogue going:
Such rules enable the system to directly perform the task, i.e., making a conver-
sation, without necessarily trying to understand the real meaning of words and
determining the meaning of the entire sentence.
Such a pattern-based way of solving a problem has turned out to be quite pow-
erful. Indeed, modern machine learning approaches to natural language under-
standing are essentially based on this and in many ways are similar to the Eliza
system, but with two important differences. The first is that the rules in a machine
learning system would not be exact or strict; instead, they tend to be stochastic, and
the probabilities of choosing which rule would be empirically set based on a train-
ing data set where the expected behavior of a function to be computed is known.
Second, instead of having human to supply rules, the “soft” rules may be learned
46 Chapter 3 Text Data Understanding
automatically from the training data with only minimum help from users who can,
e.g., specify the elements in a rule.
Even difficult tasks like machine translation can be done by such statistical
approaches. The most useful NLP techniques for building a TIS are statistical ap-
proaches which tend to be much more robust than symbolic approaches. Statistical
language models are especially useful because they can quantify the uncertainties
associated with the use of natural language in a principled way.
Text Representation
3.3 Techniques from NLP allow us to design many different types of informative fea-
tures for text objects. Let’s take a look at the example sentence A dog is chasing a boy
on the playground in Figure 3.3. We can represent this sentence in many different
ways. First, we can always represent such a sentence as a string of characters. This
is true for every language. This is perhaps the most general way of representing text
since we can always use this approach to represent any text data. Unfortunately,
the downside to this representation is that it can’t allow us to perform semantic
analysis, which is often needed for many applications of text mining. We’re not
even recognizing words, which are the basic units of meaning for any language. (Of
course, there are some situations where characters are useful, but that is not the
general case.)
The next version of text representation is performing word segmentation to
obtain a sequence of words. In the example sentence, we get features like dog and
chasing. With this level of representation, we suddenly have much more freedom.
By identifying words, we can (for example), easily discover the most frequent words
in this document or the whole collection. These words can then be used to form
topics. Therefore, representing text data as a sequence of words opens up a lot of
interesting analysis possibilities.
However, this level of representation is slightly less general than a string of char-
acters. In some languages, such as Chinese, it’s actually not that easy to identify all
the word boundaries since in such a language text is a sequence of characters with
no spaces in between words. To solve this problem, we have to rely on some special
techniques to identify words and perform more advanced segmentation that isn’t
only based on whitespace (which isn’t always 100% accurate). So, the sequence of
words representation is not as robust as the string of characters representation. In
English, it’s very easy to obtain this level of representation so we can use this all
the time.
If we go further in natural language processing, we can add part-of-speech (POS)
tags to the words. This allows us to count, for example, the most frequent nouns; or,
3.3 Text Representation 47
we could determine what kind of nouns are associated with what kind of verbs. This
opens up more interesting opportunities for further analysis. Note in Figure 3.3
that we use a plus sign on the additional features because by representing text as
a sequence of part of speech tags, we don’t necessarily replace the original word
sequence. Instead, we add this as an additional way of representing text data.
Representing text as both words and POS tags enriches the representation of
text data, enabling a deeper, more principled analysis. If we go further, then we’ll
be parsing the sentence to obtain a syntactic structure. Again, this further opens up
more interesting analysis of, for example, the writing styles or grammatical error
correction.
If we go further still into semantic analysis, then we might be able to recognize
dog as an animal. We also can recognize boy as a person, and playground as a
location and analyze their relations. One deduction could be that the dog was
chasing the boy, and the boy is on the playground. This will add more entities and
relations, through entity-relation recognition. Now, we can count the most frequent
person that appears in this whole collection of news articles. Or, whenever you see
a mention of this person you also tend to see mentions of another person or object.
These types of repeated pattens can potentially make very good features.
Sentence
Closer to knowledge
Deeper NLP: requires more human effort; less accurate representation
Such a high-level representation is even less robust than the sequence of words
or POS tags. It’s not always easy to identify all the entities with the right types and
we might make mistakes. Relations are even harder to find; again, we might make
mistakes. The level of representation is less robust, yet it’s very useful. If we move
further to a logic representation, then we have predicates and inference rules. With
inference rules we can infer interesting derived facts from the text. As one would
imagine, we can’t do that all the time for all kinds of sentences since it may take
significant computation time or a large amount of training data.
Finally, speech acts would add yet another level of representation of the intent
of this sentence. In this example, it might be a request. Knowing that would allow
us to analyze even more interesting things about the observer or the author of this
sentence. What’s the intention of saying that? What scenarios or what kinds of
actions will occur?
Figure 3.3 shows that if we move downwards, we generally see more sophisti-
cated NLP techniques. Unfortunately, such techniques would require more human
effort as well, and they are generally less robust since they attempt to solve a much
more difficult problem. If we analyze our text at levels that represent deeper analy-
sis of language, then we have to tolerate potential errors. That also means it’s still
necessary to combine such deep analysis with shallow analysis based on (for exam-
ple) sequences of words. On the right side, there is an arrow that points down to
indicate that as we go down, our representation of text is closer to the knowledge
representation in our mind. That’s the purpose of text mining!
Clearly, there is a tradeoff here between doing deeper analysis that might have
errors but would give us direct knowledge that can be extracted from text and doing
shadow analysis that is more robust but wouldn’t give us the necessary deeper
representation of knowledge.
Text data are generated by humans and are meant to be consumed by humans.
As a result, in text data analysis and text mining, humans play a very important role.
They are always in the loop, meaning that we should optimize for a collaboration
between humans and computers. In that sense, it’s okay that computers may not
be able to have a completely accurate representation of text data. Patterns that
are extracted from text data can be interpreted by humans, and then humans can
guide the computers to do more accurate analysis by annotating more data, guiding
machine learning programs to make them work more effectively.
Different text representation tends to enable different analyses, as shown in
Figure 3.4. In particular, we can gradually add more and more deeper analysis
results to represent text data that would open up more interesting representation
opportunities and analysis capabilities. The table summarizes what we have just
3.3 Text Representation 49
seen; the first column shows the type of text representation while the second
visualizes the generality of such a representation. By generality, we mean whether
we can do this kind of representation accurately for all the text data (very general) or
only some of them (not very general). The third column shows the enabled analysis
techniques and the final column shows some examples of applications that can be
achieved with a particular level of representation.
As a sequence of characters, text can only be processed by string processing
algorithms. They are very robust and general. In a compression application, we
don’t need to know word boundaries (although knowing word boundaries might
actually help). Sequences of words (as opposed to characters) offer a very important
level of representation; it’s quite general and relatively robust, indicating that it
supports many analysis techniques such as word relation analysis, topic analysis,
and sentiment analysis. As you may expect, many applications can be enabled by
these kinds of analysis. For example, thesaurus discovery has to do with discovering
related words, and topic- and opinion-related applications can also be based on
word-level representation. People might be interested in knowing major topics
covered in the collection of text, where a topic is represented as a distribution over
words.
Moving down, we’ll see we can gradually add additional representations. By
adding syntactic structures, we can enable syntactic graph analysis; we can use
graph mining algorithms to analyze these syntactic graphs. For example, stylistic
50 Chapter 3 Text Data Understanding
. Given that we see John and feels, how likely will we see happy as opposed to
habit as the next word? Answering this question can help speech recognition
as happy and habit have very similar acoustic signals, but a language model
can easily suggest that John feels happy is far more likely than John feels habit.
. Given that we observe baseball three times and game once in a news article,
how likely is it about the topic “sports”? This will obviously directly help text
categorization and information retrieval tasks.
. Given that a user is interested in sports news, how likely would it be for the
user to use baseball in a query? This is directly related to information retrieval.
p(w|θ1 ) p(w|θ2 )
... ...
text 0.2 food 0.25
mining 0.1 nutrition 0.1
association 0.01 healthy 0.05
clustering 0.02 diet 0.02
... ...
food 0.00001 text 0.00001
... ...
Figure 3.5 Two examples of unigram language models, representing two different topics.
It is easy to show that the ML estimate of a unigram language model gives each
word a probability equal to its relative frequency in D. That is,
c(w, D)
p(w | θ̂) = , (3.3)
|D|
3.4 Statistical Language Models 53
where c(w, D) is the count of word w in D and |D| is the length of D, or total number
of words in D.
Such an estimate is optimal in the sense that it would maximize the probability
of the observed data, but whether it is really optimal for an application is still
questionable. For example, if our goal is to estimate the language model in the mind
of an author of a research article, and we use the maximum likelihood estimator
to estimate the model based only on the abstract of a paper, then it is clearly non-
optimal since the estimated model would assign zero probability to any unseen
words in the abstract, which would make the whole article have a zero probability
unless it only uses words in the abstract. Note that, in general, the maximum
likelihood estimate would assign zero probability to any unseen token or event in
the observed data; this is so because assigning a non-zero probability to such a
token or event would take away probability mass that could have been assigned
to an observed word (since all probabilities must sum to 1), thus reducing the
likelihood of the observed data. We will discuss various techniques for improving
the maximum likelihood estimator later by using techniques called smoothing.
Although extremely simple, a unigram language model is already very useful
for text analysis. For example, Figure 3.6 shows three different unigram language
models estimated on three different text data samples, i.e., a general English text
database, a computer science research article database, and a text mining research
paper. In general, the words with the highest probabilities in all the three models
are those functional words in English because such words are frequently used
in any text. After going further down on the list of words, one would see more
content-carrying and topical words. Such content words would differ dramatically
depending on the data to be used for the estimation, and thus can be used to
discriminate the topics in different text samples.
Unigram language models can also be used to perform semantic analysis of word
relations. For example, we can use them to find what words are semantically asso-
ciated with a word like computer. The main idea for doing this is to see what other
words tend to co-occur with the word computer. Specifically, we can first obtain a
sample of documents (or sentences) where computer is mentioned. We can then
estimate a language model based on this sample to obtain p(w | computer). This
model tells us which words occur frequently in the context of “computer.” However,
the most frequent words according to this model would likely be functional words
in English or words that are simply common in the data, but have no strong asso-
ciation with computer. To filter out such common words, we need a model for such
words which can then tell us what words should be filtered. It is easy to see that the
general English language model (i.e., a background language model) would serve
54 Chapter 3 Text Data Understanding
B C D
General Computer Text mining
background science paper
English text papers
Figure 3.6 Three different language models representing three different topics.
the purpose well. So we can use the background language model to normalize the
model p(w | computer) and obtain a probability ratio for each word. Words with
high ratio values can then be assumed to be semantically associated with computer
since they tend to occur frequently in its context, but not frequently in general. This
is illustrated in Figure 3.7.
More applications of language models in text information systems will be fur-
ther discussed as their specific applications appear in later chapters. For example,
we can represent both documents and queries as being generated from some lan-
guage model. Given this background however, the reader should have sufficient
information to understand the future chapters dealing with this powerful statisti-
cal tool.
Figure 3.7 Using topic language models and a background language model to find semantically
related words.
Exercises
3.1. In what way is NLP related to text mining?
3.2. Does poor NLP performance mean poor retrieval performance? Explain.
3.3. Given a collection of documents for a specific topic, how can we use maximum
likelihood estimation to create a topic unigram language model?
3.4. How might the size of a document collection affect the quality of a language
model?
3.5. Why might maximum likelihood estimation not be the best guess of parame-
ters for a topic language model?
56 Chapter 3 Text Data Understanding
3.6. Suppose we appended a duplicate copy of the topic collection to itself and
re-estimated a maximum likelihood language model. Would θ change?
3.7. A unigram language model as defined in this chapter can take a sequence of
words as input and output its probability. Explain how this calculation has strong
independence assumptions.
3.8. Given a unigram language model θ estimated from this book, and two doc-
uments d1 = information retrieval and d2 = retrieval information, then is
p(d1 | θ) > p(d2 | θ )? Explain.
3.9. An n-gram language model records sequences of n words. How does the num-
ber of possible parameters change if we decided to use a 2-gram (bigram) language
model instead of a unigram language model? How about a 3-gram (trigram) model?
Give your answer in terms of V , the unigram vocabulary size.
3.11. Sort the words by their probabilities from the previous exercise. If you used
a different text file, how would your sorted list be different? How would it be the
same?
4
META: A Unified Toolkit
for Text Data
Management and
Analysis
This chapter introduces the accompanying software META, a free and open-source
toolkit that can be used to analyze text data. Throughout this book, we give hands-
on exercises with META to practice concepts and explore different text mining
algorithms.
Most of the algorithms and methods discussed in this book can be found in
some form in the META toolkit. If META doesn’t include a technique discussed in
this book, it’s likely that a chapter exercise is to implement this feature yourself! De-
spite being a powerful toolkit, META’s simplicity makes feature additions relatively
straightforward, usually through extending a short class hierarchy.
Configuration files are an integral part of META’s forward-facing infrastructure.
They are designed such that exploratory analysis usually requires no programming
effort from the user. By default, META is packaged with various executables that can
be used to solve a particular task. For example, for a classification experiment the
user would run the following command in their terminal1:
./classify [Link]
This is standard procedure for using the default executables; they take only one
configuration file parameter. The configuration file format is explained in detail
later in this chapter, but essentially it allows the user to select a dataset, a way
1. Running the default classification experiment requires a dataset to operate on. The 20news-
groups dataset is specified in the default META config file and can be downloaded here: https://
[Link]/data/[Link]. Place it in the meta/data/ directory.
58 Chapter 4 META: A Unified Toolkit for Text Data Management and Analysis
to tokenize the dataset, and a particular classification algorithm to run (for this
example).
If more advanced functionality is desired, programming in C++ is required to
make calls to META’s API (applications programming interface). Both configuration
file and API usage are documented on META’s website, [Link] as
well as in this chapter. Additionally, a forum for META exists ([Link]
[Link]), containing discussion surrounding the toolkit. It includes user support
topics, community-written documentation, and developer discussions.
The sections that follow delve into a little more detail about particular aspects
of META so the reader will be comfortable working with it in future chapters.
Design Philosophy
4.1 META’s goal is to improve upon and complement the current body of open source
machine learning and information retrieval software. The existing environment of
this open source software tends to be quite fragmented.
There is rarely a single location for a wide variety of algorithms; a good example
of this is the LIBLINEAR [Fan et al. 2008] software package for SVMs. While this
is the most cited of the open source implementations of linear SVMs, it focuses
solely on kernel-less methods. If presented with a nonlinear classification problem,
one would be forced to find a different software package that supports kernels
(such as the same authors’ LIBSVM package [Chang and Lin 2011]). This places an
undue burden on the researchers and students—not only are they required to have
a detailed understanding of the research problem at hand, but they are now forced
to understand this fragmented nature of the open-source software community,
find the appropriate tools in this mishmash of implementations, and compile and
configure the appropriate tool.
Even when this is all done, there is the problem of data formatting—it is unlikely
that the tools have standardized upon a single input format, so a certain amount of
data preprocessing is now required. This all detracts from the actual task at hand,
which has a marked impact on the speed of discovery and education.
META addresses these issues. In particular, it provides a unifying framework
for text indexing and analysis methods, allowing users to quickly run controlled
experiments. It modularizes the feature generation, instance representation, data
storage formats, and algorithm implementations; this allows for researchers and
students to make seamless transitions along any of these dimensions with minimal
effort.
4.2 Setting up META 59
Setting up META
4.2 All future sections in this book will assume the reader has META downloaded and
installed. Here, we’ll show how to set up META.
META has both a website with tutorials and an online repository on GitHub.
To actually download the toolkit, users will check it out with the version control
software git in their command line terminal after installing any necessary prereq-
uisites.
The META website contains instructions for downloading and setting up the
software for a particular system configuration. At the time of writing this book, both
Linux and Mac OS are supported. Visit [Link]
and follow the instructions for the desired platform. We will assume the reader has
performed the steps listed in the setup guide and has a working version of META for
all exercises and demonstrations in this book.
There are two steps that are not mentioned in the setup guide. The first is to
make sure the reader has the version of META that was current when this book was
published. To ensure that the commands and examples sync up with the software
the reader has downloaded, we will ensure that META is checked out with version
2.2.0. Run the following command inside the meta/ directory:
The second step is to make sure that any necessary model files are also down-
loaded. These are available on the META releases page: [Link]
toolkit/meta/releases/tag/v2.2.0. By default, the model files should be placed in the
meta/build/ directory, but you can place them anywhere as long as the paths in
the config file are updated.
60 Chapter 4 META: A Unified Toolkit for Text Data Management and Analysis
Once these steps are complete, the reader should be able to complete any exer-
cise or run any demo. If any additional files or information are needed, it will be
provided in the accompanying section.
Architecture
4.3 All processed data in META is stored in an index. There are two index types:
forward_index and inverted_index. The former is keyed by document IDs, and
the latter is keyed by term IDs.
forward_index is used for applications such as topic modeling and most clas-
sification tasks.
inverted_index is used to create search engines, or do classification with k-
nearest-neighbor or similar algorithms.
Since each META application takes an index as input, all processed data is
interchangeable between all the components. This also gives a great advantage to
classification: META supports out-of-core classification by default! If a dataset is
small enough (like most other toolkits assume), a cache can be used such as no_
evict_cache to keep it all in memory without sacrificing any speed. (Index usage
is explained in much more detail in the search engine exercises.)
There are four corpus input formats.
libsvm_corpus. If only being used for classification, META can also take
LIBSVM-formatted input to create a forward_index. There are many ma-
chine learning datasets available in this format on the LIBSVM site.2
Filters come next, and can be chained together. They define ways that text can
be modified or transformed. Here are some examples of filters.
length_filter. this filter accepts tokens that are within a certain length and
rejects those that are not.
icu_filter. applies an ICU (International Components for Unicode)3 translit-
eration to each token in the sequence. For example, an accented character like
¨
ı is instead written as i.
2. [Link]
3. [Link] note that different versions of ICU will tokenize text in slightly
different ways!
62 Chapter 4 META: A Unified Toolkit for Text Data Management and Analysis
list_filter. this filter either accepts or rejects tokens based on a list. For
example, one could use a stop word list and reject stop words.
porter2_stemmer. this filter transforms each token according to the Porter2
English Stemmer rules.4
Analyzers operate on the output from the filter chain and produce token counts
from documents. Here are some examples of analyzers.
META defines a sane default filter chain that users are encouraged to use for
general text analysis in the absence of any specific requirements. To use it, one
should specify the following in the configuration file:
[[analyzers]]
method = "ngram-word"
ngram = 1
filter = "default-chain"
This configures the text analysis process to consider unigrams of words gener-
ated by running each document through the default filter chain. This filter chain
should work well for most languages, as all of its operations (including but not lim-
ited to tokenization and sentence boundary detection) are defined in terms of the
Unicode standard wherever possible.
To consider both unigrams and bigrams, the configuration file should look like
the following:
[[analyzers]]
method = "ngram-word"
ngram = 1
filter = "default-chain"
[[analyzers]]
4. [Link]
4.4 Tokenization with META 63
method = "ngram-word"
ngram = 2
filter = "default-chain"
Each [[analyzers]] block defines a single analyzer and its corresponding fil-
ter chain: as many can be used as desired—the tokens generated by each analyzer
specified will be counted and placed in a single sparse vector of counts. This is
useful for combining multiple different kinds of features together into your doc-
ument representation. For example, the following configuration would combine
unigram words, bigram part-of-speech tags, tree skeleton features, and subtree
features.
[[analyzers]]
method = "ngram-word"
ngram = 1
filter = "default-chain"
[[analyzers]]
method = "ngram-pos"
ngram = 2
filter = [{type = "icu-tokenizer"}, {type = "ptb-normalizer"}]
crf-prefix = "path/to/crf/model"
[[analyzers]]
method = "tree"
filter = [{type = "icu-tokenizer"}, {type = "ptb-normalizer"}]
features = ["skel", "subtree"]
tagger = "path/to/greedy-tagger/model"
parser = "path/to/sr-parser/model"
If an application requires specific text analysis operations, one can specify di-
rectly what the filter chain should look like by modifying the configuration file.
Instead of filter being a string parameter as above, we will change filter to look very
much like the [[analyzers]] blocks: each analyzer will have a series of [[ana-
[Link]]] blocks, each of which defines a step in the filter chain. All filter
chains must start with a tokenizer. Here is an example filter chain for unigram
words like the one at the beginning of this section:
[[analyzers]]
method = "ngram-word"
ngram = 1
[[[Link]]]
type = "icu-tokenizer"
64 Chapter 4 META: A Unified Toolkit for Text Data Management and Analysis
[[[Link]]]
type = "lowercase"
[[[Link]]]
type = "length"
min = 2
max = 35
META provides many different classes to support building filter chains. Please
look at the API documentation5 for more information. In particular, the analyz-
ers::tokenizers namespace and the analyzers::filters namespace should
give a good idea of the capabilities. The static public attribute id for a given class
is the string needed for the “type” in the configuration file.
Related Toolkits
4.5 Existing toolkits supporting text management and analysis tend to fall into two
categories. The first is search engine toolkits, which are especially suitable for
building a search engine application, but tend to have limited support for text
analysis/mining functions. Examples include the following.
Lucene. [Link]
Terrier. [Link]
Indri/Lemur. [Link]
The second is text mining or general data mining and machine learning toolkits,
which tend to selectively support some text analysis functions, but generally do not
support search capability. Examples include the following.
Weka. [Link]
LIBSVM. [Link]
Stanford NLP. [Link]
Illinois NLP Curator. [Link]
Curator
Scikit Learn. [Link]
NLTK. [Link]
5. Visit [Link]
Exercises 65
Exercises
In its simplest form, text data could be a single document in .txt format. This
exercise will get you familiar with various techniques that are used to analyze
text. We’ll use the novel A Tale of Two Cities by Charles Dickens as example text.
The book is called [Link], and is located at [Link]
textdatabook/[Link]. You can also use any of your own plaintext files that
have multiple English sentences.
Like all future exercises, we will assume that the reader followed the META setup
guide and successfully compiled the executables. In this exercise, we’ll only be
using the profile program. Running ./profile from inside the build/ directory
will print out the following usage information:
If running ./profile prints out this information, then everything has been
set up correctly. We’ll look into what each of these options mean in the following
exercises.
4.1. Stop Word Removal. Consider the following words: I, the, of, my, it, to, from.
If it was known that a document contained these words, would there be any idea
what the document was about? Probably not. These types of words are called stop
words. Specifically, they are very high frequency words that do not contain content
information. They are used because they’re grammatically required, such as when
connecting sentences.
Since these words do not contain any topical information, they are often removed
as a preprocessing step in text analysis. Not only are these (usually) useless words
ignored, but having less data can mean that algorithms run faster!
Now, use the profile program to remove stop words from the document two-
[Link]. Can you still get an idea of what the book is about without these words
present?
4.2. Stemming. Stemming is the process of reducing a word to a base form.
This is especially useful for search engines. If a user wants to find books about
running, documents containing the word run or runs would not match. If we apply
a stemming algorithm to a word, it is more likely that other forms of the word will
match it in an information retrieval task.
The most popular stemming algorithm is the Porter2 English Stemmer, devel-
oped by Martin Porter. It is a slightly improved version from the original Porter
Stemmer from 1980. Some examples are:
META uses the Porter2 stemmer by default. You can read more about the Porter2
stemmer here: [Link] An
online demo of the stemmer is also available if you’d like to play around with it:
[Link]
Now that you have an idea of what stemming is, run the stemmer on A Tale of
Two Cities.
Like stop word removal, stemming tries to keep the basic meaning behind the
original text. Can you still make sense of it after it’s stemmed?
Note that POS tagging the book may take up to one minute to complete. Does
it look like META’s POS tagger is accurate? Can you find any mistakes? When
replacing the words with their tags, is it possible to determine what the original
sentence was? Experiment with the book or any other text file.
NP VP
PRP VBP NP
tree are the words in the sentence, and the preterminals (the direct parents of the
leaves) are part-of-speech tags.
Some common features from a parse tree are production rules such as S →
N P V P , tree depth, and structural tree features. Syntactic categories (node labels)
alone can also be used.
The following command runs the parser on each sentence in the input file:
Like POS-tagging, the parsing may also take a minute or two to complete.
2-grams (bigrams):
3-grams (trigrams):
This will give the output file [Link] for the option --freq-
unigram and so on.
What makes the output reasonably clear? Think back to stop words and stem-
ming. Removing stop words gets rid of the noisy high-frequency words that don’t
give any information about the content of the document. Stemming will aggre-
gate inflected words into a single count. This means the partial vector {run : 4,
running : 2, runs : 3} would instead be represented as {run : 9}. Not only does this
make it easier for humans to interpret the frequency analysis, but it can improve
text mining algorithms, too!
4.6. Zipf’s Law. In English, the top four most frequent words are about 10–15%
of all word occurrences. The top 50 words are 35–40% of word occurrences. In fact,
there is a similar trend in any human language. Think back to the stop words. These
are the most frequent words, and make up a majority of text. At the same time, many
words may only appear once in a given document.
We can plot the rank of a word on the x axis, and the frequency count on the y
axis. Such a graph can give us an idea of the word distribution in a given document
or collection. In Figure 4.2, we counted unigram words from another Dickens book,
Oliver Twist. The plot on the left is a normal x ∼ y plot and the one on the right is
a log x ∼ log y plot.
70 Chapter 4 META: A Unified Toolkit for Text Data Management and Analysis
Word frequency
1,000
6
100
4
10
2
0 1
0 2 4 6 8 10 1 10 100 1,000 10,000
Word frequency rank (thousands) Word frequency rank
Zipf’s law describes the shape of these plots. What do you think Zipf’s law states?
The shape of these plots allows us to apply certain techniques to take advantage of
the word distribution in natural language.
II
PART
TIS access
modes
Querying Browsing
In pull mode, the user initiates the access process to find the relevant text data,
typically by using a search engine. This mode of text access is essential when a user
has an ad hoc information need, i.e., a temporary information need that might
disappear once the need is satisfied. In such a case, the user can use a query to
find relevant information with a search engine. For example, a user may have a
need to buy a product and thus be interested in retrieving all the relevant opinions
about candidate products; after the user has purchased the product, the user would
generally no longer need such information. Another example is that during the
process of analyzing social media data to understand opinions about an emerging
event, the analyst may also decide to explore information about a particular entity
related to the event (e.g., a person), which can also trigger a search activity.
While querying is the most common way of accessing text data in the pull mode,
browsing is another complementary way of accessing text data in the pull mode, and
can be very useful when a user does not know how to formulate an effective query,
or finds it inconvenient to enter a keyword query (e.g., through a smartphone), or
simply wants to explore a topic with no fixed goal. Indeed, when searching the Web,
users tend to mix querying and browsing (e.g., while traversing through hyperlinks).
In general, we may regard querying and browsing as two complementary ways
of finding relevant information in the information space. Their relation can be
understood by making an analogy between information seeking and sightseeing
in a physical world. When a tourist knows the exact address of an attraction, the
tourist can simply take a taxi directly to the attraction; this is similar to when a
user knows exactly what he or she is looking for and can formulate a query with the
5.1 Access Mode: Pull vs. Push 75
“right keywords,” which would bring to the user relevant pages directly. However,
if a tourist doesn’t know the exact address of an attraction, the tourist may want to
take a taxi to an approximate location and then walk around to find the attraction.
Similarly, if a user does not have a good knowledge about the target pages, he or
she can also use an approximate query to reach some related pages and then browse
into truly relevant information. Thus, when querying does not work well, browsing
can be very useful.
In the push mode, the system initiates the process to recommend a set of
relevant information items to the user. This mode of information access is generally
more useful to satisfy a long-standing information need of a user or analyst. For
example, a researcher’s research interests can be regarded as relatively stable over
time. In comparison, the information stream (i.e., published research articles)
is dynamic. In such a scenario, although a user can regularly search for relevant
literature information with queries, it is more desirable for a recommender (also
called filtering) system to monitor the dynamic information stream and “push” any
relevant articles to the user based on the matching of the articles with the user’s
interests (e.g., in the form of an email). In some long-term analytics applications,
it would also be desirable to use the push mode to monitor any relevant text data
(such as relevant social media) about a topic related to the application.
Another scenario of push mode is producer-initiated recommendation, which
can be more appropriately called selective dissemination of information (SDI). In
such a scenario, the producer of information has an interest in disseminating
the information among relevant users, and would push an information item to
such users. Advertising of product information on search result pages is such an
example. The recommendation can be delivered through email notifications or
recommended through a search engine result page.
There are broadly two kinds of information needs: short-term need and long-
term need. Short-term needs are most often associated with pull mode, and long-
term needs are most often associated with push mode. A short-term information
need is temporary and usually satisfied through search or navigation in the informa-
tion space, whereas a long-term information need can be better satisfied through
filtering or recommendation where the system would take the initiative to push
the relevant information to a user. Ad hoc retrieval is extremely important because
ad hoc information needs show up far more frequently than long-term informa-
tion needs. The techniques effective for ad hoc retrieval can usually be re-used for
filtering and recommendation as well. Also, in the case of long-term information
needs, it is possible to collect user feedback, which can be exploited. In this sense,
76 Chapter 5 Overview of Text Data Access
Click-Through
Figure 5.2 Sample interface of browsing with a topic map where browsing and querying are
naturally integrated.
5.2 Multimode Interactive Access 77
engine interface to enable a user to browse the information space flexibly. With this
interface, a user can do any of the following at any moment.
Querying (long-range jump). When a user submits a new query through the
search box the search results from a search engine will be shown in the right
pane. At the same time, the relevant part of a topic map is also shown on the
left pane to facilitate browsing should the user want to.
Navigating on the map (short-range walk). The left pane in our interface is to
let a user navigate on the map. When a user clicks on a map node, this pane
will be refreshed and a local view with the clicked node as the current focus
will be displayed. In the local view, we show the parents, the children, and
the horizontal neighbors of the current node in focus (labelled as “center” in
the interface). A user can thus zoom into a child node, zoom out to a parent
node, or navigate into a horizontal neighbor node. The number attached to
a node is a score for the node that we use for ranking the nodes. Such a map
enables the user to “walk” in the information space to browse into relevant
documents without needing to reformulate queries.
Viewing a topic region. The user may double-click on a topic node on the map
to view the documents covered in the topic region. The search result pane
would be updated with new results corresponding to the documents in the
selected topic region. From a user’s perspective, the result pane always shows
the documents in the current region that the user is focused on (either search
results of the query or the documents corresponding to a current node on the
map when browsing).
Viewing a document. Within the result pane, a user can select any document
to view as in a standard search interface.
In Figure 5.3, we further show an example trace of browsing in which the user
started with a query dining table, zoomed into asian dining table, zoomed out back to
dining table, browsed horizontally first to dining chair and then to dining furniture,
and finally zoomed out to the general topic furniture where the user would have
many options to explore different kinds of furniture. If this user feels that a “long-
jump” is needed, he or she can use a new query to achieve it. Since the map can be
hidden and only brought to display when the user needs it, such an interface is a very
natural extension of the current search interface from a user’s perspective. Thus,
we can see how one text access system can combine multiple modes of information
access to suit a user’s current needs.
78 Chapter 5 Overview of Text Data Access
3. Horizontal navigation
to “dining chairs”
4. Further navigation
to “dining furniture”
1. Zoom in on
“asian dining table”
5. Zoom out to
2. Zoom back out
explore “furniture”
to “dining table”
Figure 5.3 A sample trace of browsing showing how a user can navigate in the information space
without querying.
Text Retrieval
5.3 The most important tool for supporting text data access is a search engine, which
is why web search engines are used by many people on a daily basis. Search engines
directly provide support for querying and can be easily extended to provide recom-
mendation or browsing. Moreover, the techniques used to implement an effective
search engine are often also useful for implementation of a recommender system
as well as many text analysis functions. We thus devote a large portion of this book
to discussing search engine techniques.
In this section, we discuss the problem of text retrieval (TR), which is solved
by developing a search engine system. We specify the differences between un-
structured TR and structured database retrieval. We then make an argument for
document ranking as opposed to document selection. This provides a basis for us
to discuss in the next chapter how to rank documents for a query.
From a user’s perspective, the problem of TR is to use a query to find relevant
documents in a collection of text documents. This is a frequently needed task
because users often have temporary ad hoc information needs for various tasks,
5.3 Text Retrieval 79
and would like to find the relevant information immediately. The system to support
TR is a text retrieval system, or a search engine.
Although TR is sometimes used interchangeably with the more general term “in-
formation retrieval” (IR), the latter also includes retrieval of other types of informa-
tion such as images or videos. It is worth noting, though, that retrieval techniques
for other non-textual data are less mature and, as a result, retrieval of other types
of information tends to rely on using text retrieval techniques to match a keyword
query with companion text data with a non-textual data element. For example, the
current image search engines on the Web are essentially a TR system where each
image is represented by a text document consisting of any associated text data with
the image (e.g., title, caption, or simply textual context of the image such as the
news article content where an image is included).
In industry, the problem of TR is generally referred to as the search problem,
and the techniques for text retrieval are often called search technology or search
engine technology.
The task of TR can be easy or hard, depending on specific queries and specific
collections. For example, during a web search, finding homepages is generally easy,
but finding out people’s opinions about some topic (e.g., U.S. foreign policy) would
be much harder. There are several reasons why TR is difficult:
. a query is usually quite short and incomplete (no formal language like SQL);
. the information need may be difficult to describe precisely, especially when
the user isn’t familiar with the topic, and
. precise understanding of the document content is difficult. In general, since
what counts as the correct answer is subjective, even when human experts
judge the relevance of documents, they may disagree with each other.
Due to the lack of clear semantic structures and difficulty in natural language
understanding, it is often challenging to accurately retrieve relevant information to
a user’s query. Indeed, even though the current web search engines may appear to
be sufficient sometimes, it may still be difficult for a user to quickly locate and har-
vest all the relevant information for a task. In general, the current search engines
work very well for navigational queries and simple, popular informational queries,
but in the case where a user has a complex information need such as analyzing
opinions about products to buy, or researching medical information about some
symptoms, they often work poorly. Moreover, the current search engines generally
provide little or no support to help users digest and exploit the retrieved informa-
tion. As a result, even if a search engine can retrieve the most relevant information,
80 Chapter 5 Overview of Text Data Access
a user would still have to sift through a long list of documents and read them in
detail to fully digest the knowledge buried in text data in order to perform their task
at hand. The techniques discussed later in this book can be exploited to help users
digest the found information quickly or directly analyze a large amount of text data
to reveal useful and actionable knowledge that can be used to optimize decision
making or help a user finish a task.
1. Although common parlance refers to text as unstructured with a meaningful contrast with
relational database structuring, it employs a narrow sense of “structure.” For example, from a
linguistics perspective, grammar provides well-defined structure. To study this matter further,
see the 5S (societies, scenarios, spaces, structures, and streams) works by Fox et al. [2012]
5.4 Text Retrieval vs. Database Retrieval 81
literature to a research problem, the user is unlikely able to clearly and completely
specify which documents should be returned.
Finally, the expected results in the two applications are also different. In data-
base search, we can retrieve very specific data elements (e.g., specific columns); in
TR, we are generally only able to retrieve a set of relevant documents. With passages
or fields identified in a text document, a search engine can also retrieve passages,
but it is generally difficult to retrieve specific entities or attribute values as we can in
a database. This difference is not as essential as the difference in the vague specifi-
cation of what exactly is the “correct” answer to a query, but is a direct consequence
of the vague information need in TR.
Due to these differences, the challenges in building a useful database and a
useful search engine are also somewhat different. In databases, since what items
should be returned is clearly specified, there is no challenge in determining which
data elements satisfy the user’s query and thus should be returned; a major re-
maining challenge is how to find the answers as quickly as possible especially
when there are many queries being issued at the same time. While the efficiency
challenge also exists in a search engine, a more important challenge there is to
first figure out which documents should be returned for a query before worrying
about how to return the answers quickly. In database applications—at least tradi-
tional database applications—it is also very important to maintain the integrity
of the data; that is, to ensure no inconsistency occurs due to power failure. In
TR, modeling a user’s information need and search tasks is important, again due
to the difficulty for a user to clearly specify information needs and the difficulty
in NLP.
Since what counts as the best answer to a query depends on the user, in TR,
the user is actually part of our input (together with the query, and document
set). Thus, there is no mathematical way to prove that one answer is better than
another or prove one method is better than another. Instead, we always have to
rely on empirical evaluation using some test collections and users. In contrast, in
database research, since the main issue is efficiency, one can prove one algorithm
is better than another by analyzing the computational complexity or do some
simulation study. Note that, however, when doing simulation study (to determine
which algorithm is faster), we also face the same problem as in text retrieval—the
simulation may not accurately reflect the real applications. Thus, an algorithm
shown to be faster with simulation may not be actually faster for a particular
application. Similarly, a retrieval algorithm shown to be more effective with a test
collection may turn out to be less effective for a particular application or even
another test collection. How to reliably evaluate retrieval algorithms is itself a
challenging research topic.
82 Chapter 5 Overview of Text Data Access
Because of the difference, the two fields have been traditionally studied in differ-
ent communities with a different application basis. Databases have had widespread
applications in virtually every domain with a well-established strong industry. The
IR community that studies text retrieval has been an interdisciplinary community
involving library and information science and computer science, but had not had
a strong industry base until the Web was born in the early 1990s. Since then, the
search engine industry has dominated, and as more and more online information
is available, the search engine technologies (which include TR and other technical
components such as machine learning and natural language processing) will con-
tinue to grow. Soon we will find search technologies to have widespread use just
like databases. Furthermore, because of the inherent similarity between database
search and TR, because both efficiency and effectiveness (accuracy) are important,
and because most online data has text fields as well as some kind of structures, the
two fields are now moving closer and closer to each other, leading to some common
fundamental questions such as: “What should be the right query language?”; “How
can we rank items accurately?”; “How do we find answers quickly?”; and “How do
we support interactive search?”
Perhaps the most important conclusion from this comparison is that the prob-
lem of text retrieval is an empirically defined problem. This means that which method
works better cannot be answered by pure analytical reasoning or mathematical
proofs. Instead, it has to be empirically evaluated by users, making it a significant
challenge in evaluating the effectiveness of a search engine. This is also the reason
why a significant amount of effort has been spent in research of TR evaluation since
it was initially studied in the 1960s. The evaluation methodology of TR remains an
important open research topic today; we discuss it in detail in Chapter 9.
is often typed in by a user using a search engine system, and users generally do not
want to make much effort to type in many words. However, this is not always the
case. For example, in a Twitter search, each document is a tweet which is very short,
and a user may also cut and paste a text segment from an existing document as a
query, which can be very long. Our text collection C = {d1 , . . . , dM } is a set of text
documents. In general, we may assume that there exists a subset of documents in
the collection, i.e., R(q) ⊂ C, which are relevant to the user’s query q; that is, they
are relevant documents or documents useful to the user who typed in the query.
Naturally, this relevant set depends on the query q. However, which documents are
relevant is generally unknown; the user’s query is only a “hint” at which documents
should be in the set R(q). Furthermore, different users may use the same query
to intend to retrieve somewhat different sets of relevant documents (e.g., in an
extreme case, a query word may be ambiguous). This means that it is unrealistic
to expect a computer to return exactly the set R(q), unlike the case in database
search where this is feasible. Thus, the best a computer can do is to return an
approximation of R(q), which we will denote by R (q).
Now, how can a computer compute R (q)? At a high level, there are two alterna-
tive strategies: document selection vs. document ranking. In document selection,
we will implement a binary classifier to classify a document as either relevant or
non-relevant with respect to a particular query. That is, we will design a binary clas-
sification function, or an indicator function, f (q , d) ∈ {0, 1}. If f (q , d) = 1, d would
be assumed to be relevant, whereas if f (q , d) = 0, it would be non-relevant. Thus,
R (q) = {d|f (q , d) = 1, d ∈ C}. Using such a strategy, the system must estimate the
absolute relevance, i.e., whether a document is relevant or not.
An alternative strategy is to rank documents and let the user decide a cutoff. That
is, we will implement a ranking function f (q , d) ∈ R and rank all the documents in
descending values of this ranking function. A user would then browse the ranked
list and stop whenever they consider it appropriate. In this case, the set R (q) is
actually defined partly by the system and partly by the user, since the user would
implicitly choose a score threshold θ based on the rank position where he or she
stopped. In this case, R (q) = {d|f (q , d) ≥ θ }. Using this strategy, the system only
needs to estimate the relative relevance of documents: which documents are more
likely relevant.
Since estimation of relative relevance is intuitively easier than that of absolute
relevance, we can expect it to be easier to implement the ranking strategy. Indeed,
ranking is generally preferred to document selection for multiple reasons.
First, due to the difficulty for a user to prescribe the exact criteria for selecting
relevant documents, the binary classifier is unlikely accurate. Often the query is
84 Chapter 5 Overview of Text Data Access
So the problem is the following. We have a query that has a sequence of words,
and a document that’s also a sequence of words, and we hope to define the func-
tion f (., .) that can compute a score based on the query and document. The main
challenge is designing a good ranking function that can rank all the relevant docu-
ments on top of all the non-relevant ones. Now clearly this means our function must
be able to measure the likelihood that a document d is relevant to a query q. That
also means we have to have some way to define relevance. In particular, in order to
implement the program to do that, we have to have a computational definition of
relevance, and we achieve this goal by designing a retrieval model which gives us a
formalization of relevance. We introduce retrieval models in the Chapter 6.
formation retrieval book by van Rijsbergen [1979]. [Hearst 2009] has a systematic
discussion of user interfaces of a search system, which is relevant to the design of
interfaces for any information system in general; in particular, many visualization
techniques that can facilitate browsing and querying are discussed in the book. Ex-
ploratory search is a particular type of search tasks that often requires multimodal
information access including both querying and browsing. It was covered in a spe-
cial issue of Communications of ACM [White et al. 2006], and White and Roth [2009].
The probability ranking principle [Robertson 1997] is generally regarded as the the-
oretical foundation for framing the retrieval problem as a ranking problem. More
historical work related to this, as well as a set of important research papers in IR up
to 1997, can be found in Readings in Information Retrieval [Sparck Jones and Willett
1997]. A brief survey of IR history can be found in Sanderson and Croft [2012].
Exercises
5.1. When might browsing be preferable to querying?
5.2. Given search engine user logs, how could you distinguish between browsing
behavior and querying behavior?
5.3. Often, push and pull modes are combined in a single system. Give an example
of such an application.
5.4. Imagine you have search engine session logs from users that you know are
browsing for information. How can you use these logs to enhance search results of
future users with ad hoc information needs?
5.5. In a Chapter 11, we will discuss recommender systems. These are systems in
push mode that deliver information to users. What are some specific applications
of recommender systems? Can you name some services available to you that fit into
this access mode?
5.6. How could a recommender system (push mode) be coupled with a search
engine (pull mode)? Can these two services mutually enhance one another?
5.7. Design a text information system used to explore musical artists. For example,
you can search for an artist’s name directly. The results are displayed as a graph,
with edges to similar artists (as measured by some similarity algorithm). Use TIS
access mode vocabulary to describe this system and any enhancements you could
make to satisfy different information needs.
5.8. In the same way as the previous question, categorize “Google knowledge
graph” ([Link]
86 Chapter 5 Overview of Text Data Access
5.9. In the same way as the previous two questions, categorize “citation alerts.”
These are alerts that are based on previous search history in an academic search
engine. When new papers are found that are potentially interesting to the user
based on their browsing history, an alert is created.
5.10. One assumption of the probability ranking principle is that each document’s
usefulness to the user is independent of the usefulness of other documents in the
index. What is a scenario where this assumption does not hold?
6
6.1
Retrieval Models
In this chapter, we introduce the two main information retrieval models: vector
space and query likelihood, which are among the most effective and practically
useful retrieval models. We begin with a brief overview of retrieval models in general
and then discuss the two basic models, i.e., the vector space model and the query
likelihood model afterward.
Overview
Over many decades, researchers have designed various different kinds of retrieval
models which fall into different categories (see Zhai [2008] for a detailed review).
First, one family of the models are based on the similarity idea. Basically, we assume
that if a document is more similar to the query than another document is, we would
say the first document is more relevant than the second one. So in this case, the
ranking function is defined as the similarity between the query and the document.
One well-known example of this case is the vector space model [Salton et al. 1975],
which we will cover more in detail later in this chapter.
The second set of models are called probabilistic retrieval models [Lafferty and
Zhai 2003]. In this family of models, we follow a very different strategy. We assume
that queries and documents are all observations from random variables, and we
assume there is a binary random variable called R (with a value of either 1 or 0) to
indicate whether a document is relevant to a query. We then define the score of a
document with respect to a query as the probability that this random variable R is
equal to 1 given a particular document and query. There are different cases of such a
general idea. One is the classic probabilistic model, which dates back to work done
in the 1960s and 1970s [Maron and Kuhns 1960, Robertson and Sparck Jones 1976],
another is the language modeling approach [Ponte and Croft 1998], and yet another
is the divergence-from-randomness model [Amati and Van Rijsbergen 2002]. We
will cover a particular language modeling approach called query likelihood retrieval
model in detail later in this chapter. One of the most effective retrieval models
88 Chapter 6 Retrieval Models
derived from the classific probabilistic retrieval framework is BM25 [Robertson and
Zaragoza 2009], but since the retrieval function of BM25 is so similar to a vector
space retrieval model, we have chosen to cover it as a variant of the vector space
model.
The third kind of model is probabilistic inference [Turtle and Croft 1990]. Here
the idea is to associate uncertainty to inference rules. We can then quantify the
probability that we can show that the query follows from the document. This family
of models is theoretically appealing, but in practice, they are often reduced to
models essentially similar to vector-space model or a regular probabilistic retrieval
model.
Finally, there is also a family of models that use axiomatic thinking [Fang et al.
2011]. The idea is to define a set of constraints that we hope a good retrieval function
satisfies. In this case the problem is to find a good ranking function that can satisfy
all the desired constraints. Interestingly, although all these models are based on
different thinking, in the end the retrieval functions tend to be very similar and
involve similar variables. The axiomatic retrieval framework has proven effective
for diagnosing deficiencies of a retrieval model and developing improved retrieval
models accordingly (e.g., BM25+ [Lv and Zhai 2011]).
Although many models have been proposed, very few have survived extensive ex-
perimentation to prove effective and robustness. In this book, we have chosen to
cover four specific models (i.e., BM25, pivoted length normalization, query likeli-
hood with JM smoothing, and query likelihood with Dirichlet prior smoothing) that
are among the very few most effective and robust models.1
1. PL2 is another very effective model that the readers should also know of [Amati and Van Rijs-
bergen 2002].
6.2 Common Form of a Retrieval Function 89
Figure 6.1 Illustration of common ideas for scoring with a bag-of-words representation.
We can see there are three different components, each corresponding to how
well the document matches each of the query words. Inside of these functions, we
see a number of heuristics. For example, one factor that affects the function g is how
many times the word presidential occurs in each document. This is called a term
frequency (TF). We might also denote this as c(presidential, d). In general, if the
word occurs more frequently in the document, the value of this function would be
larger. Another factor is the document length. In general, if a term occurs in a long
document many times, it is not as significant as if it occurred the same number of
times in a short document (since any term is expected to occur more frequently
in a long document). Finally, there is a factor called document frequency. This
looks at how often presidential occurs at least once in any document in the entire
collection. We call this the document frequency, or DF, of presidential. DF attempts
to characterize the popularity of the term in the collection. In general, matching a
rare term in the collection is contributing more to the overall score than matching a
common term. TF, DF, and document length capture some of the main ideas used
in pretty much all state-of-the-art retrieval models. In some other models we might
also use a probability to characterize this information.
A natural question is: Which model works the best? It turns out that many
models work equally well, so here we list the four major models that are generally
regarded as state-of-the-art:
Programming
Query q d2 ?
dM
d3 ?
d5
Library
d4 d1 ?
Presidential
Figure 6.2 Illustration of documents plotted in vector space. (Courtesy of Marti Hearst)
and therefore d2 will be ranked above the others. This is the main idea of the vector
space model.
To be more precise, the VS model is a framework. In this framework, we make
some assumptions. One assumption is that we represent each document and query
by a term vector. Here, a term can be any basic concept such as a word or a
phrase, or even n-grams of characters or any other feature representation. Each
term is assumed to define one dimension. Therefore, since we have |V | terms in
our vocabulary, we define a |V |-dimensional space. A query vector would consist
of a number of elements corresponding to the weights of different terms. Each
document vector is also similar; it has a number of elements and each value of
each element is indicating the weight of the corresponding term. The relevance in
this case is measured based on the similarity between the two vectors. Therefore,
our retrieval function is also defined as the similarity between the query vector and
document vector.
Now, if you were asked to write a program to implement this approach for a
search engine, you would realize that this explanation was far from complete. We
haven’t seen many things in detail, therefore it’s impossible to actually write the
program to implement this. That’s why this is called the vector space retrieval
framework. It has to be refined in order to actually suggest a particular function
that can be implemented on a computer. First, it did not say how to define or select
the basic concepts (terms). We clearly assume the concepts are orthogonal, otherwise
there will be redundancy. For example, if two synonyms are somehow distinguished
as two different concepts, they would be defined in two different dimensions,
causing a redundancy or overemphasis of matching this concept (since it would be
as if you matched two dimensions when you actually matched only one semantic
concept). Second, it did not say how to place documents and queries in this vector
space. We saw some examples of query and document vectors, but where exactly
should the vector for a particular document point to? This is equivalent to how to
define the term weights. This is a very important question because the term weight
in the query vector indicates the importance of a term; depending on how you assign
the weight, you might prefer some terms to be matched over others. Similarly,
term weight in the document is also very meaningful—it indicates how well the
term characterizes the document. If many nonrelevant documents are returned by
a search engine using this model, then the chosen terms and weights must not
represent the documents accurately. Finally, how to define the similarity measure is
also unclear. These questions must be addressed before we can have an operational
function that we can actually implement using a programming language. Solving
these problems is the main topic of the next section.
6.3 Vector Space Retrieval Models 93
This is only one of the many different ways of computing the similarity. So, we’ve
defined the dimensions, the vector space, and the similarity function; we finally
have the simplest instantiation of the vector space model! It’s based on the bit
vector representation, dot product similarity, and bag of words instantiation. Now
we can finally implement this ranking function using a programming language and
then rank documents in our corpus given a particular query.
We’ve gone through the process of modeling the retrieval problem using a vector
space model. Then, we made assumptions about how we place vectors in the vector
space and how we define the similarity. In the end, we’ve got a specific retrieval
function shown in Figure 6.3. The next step is to think about whether this individual
function actually makes sense. Can we expect this function will actually perform
well? It’s worth thinking about the value that we are calculating; in the end, we’ve
got a number, but what does this number mean? Please take a few minutes to think
about that before proceeding to the next section.
d1 … news about … d4 +
d3 +
d2 … news about organic food campaign …
d1 … news about …
Figure 6.5 Computation of the bit vector retrieval model on a sample query and corpus.
the others since these two really cover the query well. They match news, presidential,
and campaign, so they should be ranked on top. The other three, d1, d2, and d5, are
non-relevant.
Let’s see if our vector space model could do the same or could do something
close to our ideal ranking. First, think about how we actually use this model to
score documents. In Figure 6.5, we show two documents, d1 and d3, and we have
the query here also. In the vector space model, we want to first compute the vectors
for these documents and the query. The query has four words, so for these four
words, there would be a one and for the rest there will be zeros. Document d1 has
two ones, news and about, while the rest of the dimensions are zeros. Now that
we have the two vectors, we can compute the similarity with the dot product by
multiplying the corresponding elements in each vector. Each pair of vectors forms
a product, which represents the similarity between the two items. We actually don’t
have to care about the zeroes in each vector since any product with one will be zero.
So, when we take a sum over all these pairs, we’re just counting how many pairs of
ones there are. In this case, we have seen two, so the result will be two. That means
this number is the value of this scoring function; it’s simply the count of how many
unique query terms are matched in the document. This is how we interpret the
score. Now we can also take a look at d3. In this case, you can see the result is
three because d3 matched the three distinct query words news, presidential, and
campaign, whereas d1 only matched two. Based on this, d3 is ranked above d1. That
looks pretty good. However, if we examine this model in detail, we will find some
problems.
96 Chapter 6 Retrieval Models
Figure 6.6 Ranking of example documents using the simple vector space model.
In Figure 6.6, we show all the scores for these five documents. The bit vector
scoring function counts the number of unique query terms matched in each docu-
ment. If a document matches more unique query terms, then the document will be
assumed to be more relevant; that seems to make sense. The only problem is that
there are three documents, d2, d3, and d4, that are tied with a score of three. Upon
closer inspection, it seems that d4 should be right above d3 since d3 only mentioned
presidential once while d4 mentioned it many more times. Another problem is that
d2 and d3 also have the same score since for d2, news, about, and campaign were
matched. In d3, it matched news, presidential, and campaign. Intuitively, d3 is more
relevant and should be scored higher than d2. Matching presidential is more impor-
tant than matching about even though about and presidential are both in the query.
But this model doesn’t do that, and that means we have to solve these problems.
To summarize, we talked about how to instantiate a vector space model. We need
to do three things:
Based on this idea, we discussed a very simple way to instantiate the vector
space model. Indeed, it’s probably the simplest vector space model that we can
derive. We used each word to define a dimension, with a zero-one bit vector to
represent a document or a query. In this case, we only care about word presence or
absence, ignoring the frequency. For a similarity measure, we used the dot product
6.3 Vector Space Retrieval Models 97
and showed that this scoring function scores a document based on the number of
distinct query words matched in it. We also showed that such a simple vector space
model still doesn’t work well, and we need to improve it. This is the topic for the
next section.
It’s worth thinking at this point about why we have these issues. If we look back at
the assumptions we made while instantiating the VS model, we will realize that the
problem is really coming from some of those assumptions. In particular, it has to
do with how we place the vectors in the vector space. Naturally, in order to fix these
problems, we have to revisit those assumptions. A natural thought is to consider
multiple occurrences of a term in a document as opposed to binary representation;
we should consider the TF instead of just the absence or presence. In order to
consider the difference between a document where a query term occurred multiple
times and one where the query term occurred just once, we have to consider the
term frequency—the count of a term in a document. The simplest way to express
the TF of a word w in a document d is
With the bit vector, we only captured the presence or absence of a term, ignoring
the actual number of times that a term occurred. Let’s add the count information
back: we will represent a document by a vector with as each dimension’s weight.
That is, the elements of both the query vector and the document vector will not be
zeroes and ones, but instead they will be the counts of a word in the query or the
document, as illustrated in Figure 6.7.
98 Chapter 6 Retrieval Models
Figure 6.8 Frequency vector representation rewards multiple occurrences of a query term.
Now, let’s see what the formula would look like if we change this representation.
The formula looks identical since we are still using the dot product similarity. The
difference is inside of the sum since xi and yi are now different—they’re now the
counts of words in the query and the document. Because of the change in document
representation, the new score has a different interpretation. We can see whether
this would fix the problems of the bit vector VS model.
Look at the three documents again in Figure 6.8. The query vector is the same
because all these words occurred exactly once in the query. The same goes for d2
and d3 since none of these words has been repeated. As a result, the score is also
the same for both these documents. But, d4 would be different; here, presidential
occurred twice. Thus, the corresponding dimension would be weighted as two
instead of one, and the score for d4 is higher. This means, by using TF, we can
now rank d4 above d2 and d3 as we had hoped to.
6.3 Vector Space Retrieval Models 99
Unfortunately, d2 and d3 still have identical scores. We would like to give more
credit for matching presidential than matching about. How can we solve this prob-
lem in a general way? Is there any way to determine which word should be treated
more importantly and which word can be essentially ignored? About doesn’t carry
that much content, so we should be able to ignore it. We call such a word a stop
word. They are generally very frequent and they occur everywhere such that match-
ing it doesn’t have any significance. Can we come up with any statistical approaches
to somehow distinguish a content word like presidential from a stop word like
about? One difference is that a word like about occurs everywhere. If you count
the occurrence of the word in the whole collection of M documents (where M 5),
then we would see that about has a much higher count than presidential. This idea
suggests that we could somehow use the global statistics of terms or some other
information to try to decrease the weight of the about dimension in the vector rep-
resentation of d2. At the same time, we hope to somehow increase the weight of
presidential in the d3 vector. If we can do that, then we can expect that d2 will get an
overall score of less than three, while d3 will get a score of about three. That way,
we’ll be able to rank d3 on top of d2.
This particular idea is called the inverse document frequency (IDF). It is a very
important signal used in modern retrieval functions. The document frequency
is the count of documents that contain a particular term. Here, we say inverse
document frequency because we actually want to reward a word that doesn’t occur
in many documents. The way to incorporate this into our vector is to modify the
frequency count by multiplying it by the IDF of the corresponding word, as shown
in Figure 6.9.
We can now penalize common words which generally have a low IDF and reward
informative words that have a higher IDF. IDF can be defined as
M +1
IDF(w) = , (6.2)
df(w)
where M is the total number of documents in the collection and df(.) counts the
document frequency (the total number of documents containing w).
Let’s compare the terms campaign and about. Intuitively, about should have a
lower IDF score than campaign since about is a less informative word. For clarity,
let’s assume M = 10, 000, df(about) = 5000, df(campaign) = 1166, and we use a
base two logarithm. Then,
10, 001 10, 001
IDF(about) = log = log ≈ 1.0
df(about) 5000
100 Chapter 6 Retrieval Models
q = (x1, …, xN)
yi = c(Wi, d) *IDF(Wi)
d = (y1, …, yN)
W3
W2
Figure 6.9 Representation of a blue document vector and red query vector with TF-IDF weighting.
IDF(W)
Total number of docs in collection
log(M + 1)
IDF(W) = log[(M + 1)/k]
k (doc freq)
1 M
Figure 6.10 Illustration of the IDF function as the document frequency varies.
and
10, 001 10, 001
IDF(campaign) = log = log ≈ 3.1.
df(campaign) 1166
Let k represent df(w); if you plot the IDF function by varying k, then you will see
a curve like the one illustrated in Figure 6.10. In general, you can see it would give
a higher value for a low df , indicating a rare word. You can also see the maximum
value of this function is log(M + 1). The specific function is not as important as the
heuristic it captures: penalizing popular terms. Whether there is a better form of
6.3 Vector Space Retrieval Models 101
the IDF function is an open research question. With the evaluation skills you will
learn in Chapter 9, you can test your different instantiations.
If we use a linear function like the diagonal line (as shown in the figure), it may
not be as reasonable as the IDF function we just defined. In the standard IDF,
we have a dropping off point where we say “these terms are essentially not very
useful.” This makes sense when the term occurs so frequently that it’s unlikely
to differentiate two documents’ relevance (since the term is so common). But, if
you look at the linear representation, there is no dropping off point. Intuitively,
we want to focus more on the discrimination of low df words rather than these
common words. Of course, which one works better still has to be validated by
running experiments on a data set.
Let’s look at the two documents again in Figure 6.11. Without IDF weighting,
we just had bit vectors. With IDF weighting, we now can adjust the TF (term
frequency) weight by multiplying it with the IDF weight. With this scheme, there
is an adjustment by using the IDF value of about which is smaller than the IDF
value of presidential. Thus, the IDF will distinguish these two words based on how
informative they are. Including the IDF weighting causes d3 to be ranked above
d2, since it matched a rare (informative) word, whereas d2 matched a common
(uninformative) word. This shows that the idea of weighting can solve our second
problem. How effective is this model in general when we use this TF-IDF weighting?
Well, let’s take a look at all the documents that we have seen before.
In Figure 6.12, we show all the five documents that we have seen before and their
new scores using TF-IDF weighting. We see the scores for the first four documents
seem to be quite reasonable. But again, we also see a new problem since d5 did not
even have a very high score with our simplest vector space model, but now d5 has
the highest score. This is actually a common phenomenon when designing retrieval
functions; when you try to fix one problem, you tend to introduce other problems!
That’s why it’s very tricky to design an effective ranking function, and why finding
a “best” ranking function is an open research question. In the next few sections,
we’ll continue to discuss some additional ideas to further improve this model and
try to fix this problem.
6.3.4 TF Transformation
In the previous section, we derived a TF-IDF weighting formula using the vector
space model and showed that this model actually works pretty well for the examples
shown in the figures—except for d5, which has a very high score. This document is
intuitively non-relevant, so its position is not desirable. Now, we’re going to discuss
how to use a TF transformation to solve this problem. Before we discuss the details,
let’s take a look at the formula for the TF-IDF weighting and ranking function we
previously derived. It is shown in Figure 6.13.
If you look at the formula carefully, you will see it involves a sum over all the
matched query terms. Inside the sum, each matched query term has a particular
weight; this weight is TF-IDF weighting. It has an IDF component where we see two
variables: one is the total number of documents in the collection, M. The other is
the document frequency, df(w), which is the number of documents that contain
w. The other variables involved in the formula include the count of the query term
6.3 Vector Space Retrieval Models 103
Total number of
documents in collection
N
M+1
f(q, d) = ∑ xi yi = ∑ c(w, q)c(w, d)log —
i=1 w2q\d df(w)
w in the query, and the count of w in the document, represented as c(w, q) and
c(w, d), respectively.
Looking at d5 again, it’s not hard to realize that the reason why it has received
a high score is because it has a very high count of the term campaign. Its count in
d5 is four, which is much higher than the other documents, and has contributed to
the high score of this document. Intriguingly, in order to lower the score for this
document, we need to somehow restrict the contribution of matching this term
in the document. Essentially, we shouldn’t reward multiple occurrences so gener-
ously. The first occurrence of a term says a lot about matching of this term because
it goes from a zero count to a count of one, and that increase is very informative.
Once we see a word in the document, it’s very likely that the document is talking
about this word. If we see an extra occurrence on top of the first occurrence, that
is to go from one to two, then we also can say the second occurrence confirmed
that it’s not an accidental mention of the word. But imagine we have seen, let’s
say, 50 occurrences of the word in the document. Then, adding one extra occur-
rence is not going to bring new evidence about the term because we are already
sure that this document is about this word. Thus, we should restrict the contribu-
tion of a high-count term. That is exactly the idea of TF transformation, illustrated
in Figure 6.14.
This transformation function is going to turn the raw count of word into a
TF weight for the word in the document. On the x-axis is the raw count, and on
the y-axis is the TF weight. In the previous ranking functions, we actually have
implicitly used some kind of transformation. For example, in the zero-one bit vector
representation, we actually used the binary transformation function as shown here.
If the count is zero then it has zero weight. Otherwise it would have a weight of
one. Then, we considered term count as a TF weight, which is a linear function.
We just saw that this is not desirable. With a logarithm, we can have a sublinear
104 Chapter 6 Retrieval Models
y = log(1 + x)
transformation that looks like the red lines in the figure. This will control the
influence of a very high weight because it’s going to lower its influence, yet it will
retain the influence of a small count. We might even want to bend the curve more by
applying a logarithm twice. Researchers have tried all these methods and they are
indeed working better than the linear transformation, but so far what works the best
seems to be this special transformation called BM25 TF, illustrated in Figure 6.15,
where BM stands for best matching.
In this transformation, there is a parameter k which controls the upper bound of
this function. It’s easy to see this function has a upper bound, because if you look
x
at the x+k as being multiplied by (k + 1), the fraction will never exceed one, since
the numerator is always less than the denominator. Thus, it’s upper-bounded by
(k + 1). This is also the difference between the BM25 TF function and the logarithm
transformation, which doesn’t have an upper bound. Furthermore, one interesting
property of this function is that as we vary k, we can actually simulate different
transformation functions including the two extremes that are shown in the figure.
When k = 0, we have a zero one bit transformation. If we set k to a very large number,
on the other hand, it’s going to look more like the linear transformation function.
In this sense, this transformation is very flexible since it allows us to control the
shape of the TF curve quite easily. It also has the nice property of a simple upper
bound. This upper bound is useful to control the influence of a particular term. For
example, we can prevent a spammer from just increasing the count of one term to
spam all queries that might match this term. In other words, this upper bound
6.3 Vector Space Retrieval Models 105
k+1 (k + 1)x
y =—
x+k
1 k=0
x = c(w, d)
0 1 2 3 …
ensures that all terms will be counted when we aggregate the weights to compute
a score.
To summarize, we need to capture some sublinearity in the TF function. This
ensures that we represent the intuition of diminishing return from high term
counts. It also avoids a dominance by one single term over all others. The BM25
TF formula we discussed has an upper bound while being robust and effective.
If we plug this function into our TF-IDF vector space model, then we would end
up having a ranking function with a BM25 TF component. This is very close to a
state-of-the-art ranking function called BM25. We’ll see the entire BM25 formula
soon.
nothing to do with the mention of presidential at the end. In general, if you think
about long documents, they would have a higher chance to match any query since
they contain more words. In fact, if you generate a long document by randomly
sampling words from the distribution of all words, then eventually you probably
will match any query! In this sense, we should penalize long documents because
they naturally have a better chance to match any query. This is our idea of document
length normalization. On the one hand, we want to penalize a long document, but
on the other hand, we also don’t want to over-penalize them. The reason is that a
document may be long because of different reason: in one case the document may
be longer because it uses more words. For example, think about a research paper
article. It would use more words than the corresponding abstract. This is the case
where we probably should penalize the matching of a long document such as a full
paper. When we compare matching words in such long document with matching
words in the short abstract, the long papers generally have a higher chance of
matching query words. Therefore, we should penalize the long documents.
However, there is another case when the document is long—that is when the
document simply has more content. Consider a case of a long document, where we
simply concatenated abstracts of different papers. In such a case, we don’t want to
penalize this long document. That’s why we need to be careful about using the right
degree of length penalization, and an understanding of the discourse structure of
documents is needed for optimal document length normalization.
A method that has worked well is called pivoted length normalization, illustrated
in Figure 6.17 and described originally in Singhal et al. [1996]. Here, the idea is to
6.3 Vector Space Retrieval Models 107
b>0
Reward
1.0 b=0
Penalization
|d|
0 1 2 … avdl …
use the average document length as a pivot, or reference point. That means we will
assume that for the average length documents, the score is about right (a normalizer
would be one). If a document is longer than the average document length, then
there will be some penalization. If it’s shorter than the average document length,
there’s even some reward. The x-axis represents the length of a document. On the
y-axis we show the normalizer, i.e., the pivoted length normalization. The formula
for the normalizer is an interpolation of one and the normalized document lengths,
controlled by a parameter b. When we first divide the length of the document by
the average document length, this not only gives us some sense about how this
document is compared with the average document length, but also gives us the
benefit of not worrying about the unit of length. This normalizer has an interesting
property; first, we see that if we set the parameter b to zero, then the normalizer
value would be one, indicating no length normalization at all. If we set b to a
nonzero value, then the value would be higher for documents that are longer
than the average document length, whereas the value of the normalizer will be
smaller for shorter documents. In this sense we see there’s a penalization for
long documents and a reward for short documents. The degree of penalization is
controlled by b. By adjusting b (which varies from zero to one), we can control the
degree of length normalization. If we plug this length normalization factor into the
108 Chapter 6 Retrieval Models
BM25/Okapi
(k + 1)c(w, d) M +1
f (q , d) = c(w, q) |d|
log b ∈ [0, 1], k ∈ [0, +∞)
w∈q∩d c(w, d) + k(1 − b + b avdl ) df(w)
Figure 6.18 State-of-the-art vector space models: pivoted length normalization and Okapi BM25.
vector space model ranking functions that we have already examined, we will end
up with state-of-the-art retrieval models, some of which are shown in Figure 6.18.
Let’s take a look at each of them. The first one is called pivoted length normaliza-
tion. We see that it’s basically the TF-IDF weighting model that we have discussed.
The IDF component appears in the last term. There is also a query TF component,
and in the middle there is normalized TF. For this, we have the double logarithm
as we discussed before; this is to achieve a sublinear transformation. We also put a
document length normalizer in the denominator of the TF formula, which causes
a penalty for long documents, since the larger the denominator is, the smaller the
TF weight is. The document length normalization is controlled by the parameter b.
The next formula is called Okapi BM25, or just BM25. It’s similar to the pivoted
length normalization formula in that it has an IDF component and a query TF
component. In the middle, the normalization is a little bit different; we have a
sublinear transformation with an upper bound. There is a length normalization
factor here as well. It achieves a similar effect as discussed before, since we put
the normalizer in the denominator. Thus, again, if a document is longer, the term
weight will be smaller.
We have now reached one of the best-known retrieval functions by thinking
logically about how to represent a document and by slowly tweaking formulas and
considering our initial assumptions.
stemmed words (words that have been transformed into a basic root form) are a
viable option so all forms of the same word are treated as one, and can be matched
as one term. We also need to perform stop word removal; this removes some very
common words that don’t carry any content such as the, a, or of . We could use
phrases or even latent semantic analysis, which characterizes documents by which
cluster words belong to. We can also use smaller units, like character n-grams,
which are sequences of n characters, as dimensions. In practice, researchers have
found that the bag-of-words representation with phrases (or “bag-of-phrases”) is
the most effective representation. It’s also efficient so this is still by far the most
popular document representation method and it’s used in all the major search
engines.
Sometimes we need to employ language-specific and domain-specific represen-
tation. This is actually very important as we might have variations of the terms that
prevent us from matching them with each other even though they mean the same
thing. Take Chinese, for example. We first need to segment text to obtain word
boundaries because it’s originally just a sequence of characters. A word might cor-
respond to one character or two characters or even three characters. It’s easier in
English when we have a space to separate the words, but in some other languages we
may need to do some natural language processing to determine word boundaries.
There is also possibility to improve the similarity function. So far, we’ve used
the dot product, but there are other measures. We could compute the cosine of the
angle between two vectors, or we can use a Euclidean distance measure. The dot
product still seems the best and one of the reasons is because it’s very general; in
fact, it’s sufficiently general. If you consider the possibilities of doing weighting in
different ways, cosine measure can be regarded as the dot product of two normal-
ized vectors. That means we first normalize each vector, and then we take the dot
product. That would be equivalent to the cosine measure.
We mentioned that BM25 seems to be one of the most effective formulas—but
there has also been further development in improving BM25, although none of
these works have changed the BM25 fundamentally. In one line of work, people
have derived BM25-F. Here, F stands for field, and this is BM25 for documents
with structure. For example, you might consider the title field, the abstract field,
the body of the research article, or even anchor text (on web pages). These can all be
combined with an appropriate weight on different fields to help improve scoring for
each document. Essentially, this formulation applies BM25 on each field, and then
combines the scores, but keeps global (i.e., across all fields) frequency counts. This
has the advantage of avoiding over-counting the first occurrence of the term. Recall
that in the sublinear transformation of TF, the first occurrence is very important
110 Chapter 6 Retrieval Models
and contributes a large weight. If we do that for all the fields, then the same term
might have gained a large advantage in every field. When we just combine counts
on each separate field, the extra occurrences will not be counted as fresh first
occurrences. This method has worked very well for scoring structured documents.
More details can be found in Robertson et al. [2004].
Another line of extension is called BM25+. Here, researchers have addressed the
problem of over-penalization of long documents by BM25. To address this problem,
the fix is actually quite simple. We can simply add a small constant to the TF
normalization formula. But what’s interesting is that we can analytically prove that
by doing such a small modification, we will fix the problem of over-penalization
of long documents by the original BM25. Thus, the new formula called BM25+ is
empirically and analytically shown to be better than BM25 [Lv and Zhai 2011].
6.3.7 Summary
In vector space retrieval models, we use similarity as a notion of relevance, assum-
ing that the relevance of a document with respect to a query is correlated with the
similarity between the query and the document. Naturally, that implies that the
query and document must be represented in the same way, and in this case, we
represent them as vectors in a high dimensional vector space. The dimensions are
defined by words, concepts, or terms. We generally need to use multiple heuris-
tics to design a ranking function; we gave some examples which show the need for
several heuristics, which include:
. TF (term frequency) weighting and sublinear transformation;
. IDF (inverse document frequency) weighting; and
. document length normalization.
These three are the most important heuristics to ensure such a general ranking
function works well for all kinds of tasks. Finally, BM25 and pivoted length nor-
malization seem to be the most effective VS formulas. While there has been some
work done in improving these two powerful measures, their main idea remains
the same. In the next section, we will discuss an alternative approach to the vector
space representation.
q1 d1 1
q1 d2 1
q1 d3 0
d1 d4 0
q1 d5 1
..
.
q1 d1 0
d1 d2 1
q1 d3 0
q2 d3 1
q3 d1 1
q4 d2 1
q4 d3 0
count(q , d , R = 1)
f (q , d) = p(R = 1 | d , q) =
count(q , d)
we see q and d as a pair in this table and then count how many times we actually
have also seen a one in the third column and compute the ratio:
count(R = 1, d , q)
p(R = 1 | d , q) = . (6.3)
count(d , q)
it’s going to rank d2 above all the other documents because in all the cases, given
q1 and d2, R = 1.
With volumes of clickthrough data, a search engine can learn to improve its
results. This is a simple example that shows that with even a small number of
entries, we can already estimate some probabilities. These probabilities would give
us some sense about which document might be more useful to a user for this
query. Of course, the problem is that we don’t observe all the queries and all of
the documents and all the relevance values; there will be many unseen documents.
In general, we can only collect data from the documents that we have shown to the
users. In fact, there are even more unseen queries because you cannot predict what
queries will be typed in by users. Obviously, this approach won’t work if we apply
it to unseen queries or unseen documents. Nevertheless, this shows the basic idea
of the probabilistic retrieval model.
What do we do in such a case when we have a lot of unseen documents and
unseen queries? The solution is that we have to approximate in some way. In the
particular case called the query likelihood retrieval model, we just approximate this
by another conditional probability, p(q | d , R = 1) [Lafferty and Zhai 2003].
We assume that the user likes the document because we have seen that the
user clicked on this document, and we are interested in all these cases when a
user liked this particular document and want to see what kind of queries they
have used. Note that we have made an interesting assumption here: we assume
that a user formulates the query based on an imaginary relevant document. If you
just look at this as a conditional probability, it’s not obvious we are making this
assumption. We have to somehow be able to estimate this conditional probability
without relying on the big table from Figure 6.19. Otherwise, we would have similar
problems as before. By making this assumption, we have some way to bypass the
big table.
Let’s look at how this new model works for our example. We ask the following
question: which of these documents is most likely the imaginary relevant docu-
ment in the user’s mind when the user formulates this query? We quantify this
probability as a conditional probability of observing this query if a particular doc-
ument is in fact the imaginary relevant document in the user’s mind. We compute
all these query likelihood probabilities—that is, the likelihood of the query given
each document. Once we have these values, we can then rank these documents.
To summarize, the general idea of modeling relevance in the probabilistic re-
trieval model is to assume that we introduce a binary random variable R and let
the scoring function be defined based on the conditional probability p(R = 1 | d , q).
We also talked about approximating this by using query likelihood. This means we
have a ranking function that’s based on a probability of a query given the document.
114 Chapter 6 Retrieval Models
… news of presidential
p(q = “presidential campaign”|d = campaign … presidential )
candidate …
“presidential”
ca m
paig campaign
n
“pre
side
ntia
l”
This probability should be interpreted as the probability that a user who likes doc-
ument d would pose query q. Now the question, of course, is how do we compute
this conditional probability? We will discuss this in detail in the next section.
c(“presidential”, d) c(“campaign”, d)
p(q = “presidential campaign”|d) = — * —
|d| |d|
Figure 6.21 Computing the probability of a query given a document using the query likelihood
formulation.
We’ve made the assumption that each query word is independent and that
each word is obtained from the imagined ideal document satisfying the user’s
information need. Let’s see how this works exactly. Since we are computing a query
likelihood, then the total probability is the probability of this particular query,
which is a sequence of words. Since we make the assumption that each word
is generated independently, the probability of the query is just a product of the
probability of each query word, where the probability of each word is just the relative
frequency of the word in the document. For example, the probability of presidential
given the document would be just the count of presidential in the document divided
by the total number of words in the document (i.e., the document length). We now
have an actual formula for retrieval that we can use to rank documents.
Let’s take a look at some example documents from Figure 6.21. Suppose now
the query is presidential campaign. To score these documents, we just count how
many times we have seen presidential and how many times we have seen campaign.
We’ve seen presidential two times in d4, so that’s |d2 | . We also multiply by |d1 | for the
4 4
probability of campaign. Similarly, we can calculate probabilities for the other two
documents d3 and d2. If we assume d3 and d4 have about the same length, then it
looks like we will rank d4 above d3, which is above d2. As we would expect, it looks
like this formulation captures the TF heuristic from the vector space models.
However, if we try a different query like this one, presidential campaign update,
then we might see a problem. Consider the word update: none of the documents
contain this word. According to our assumption that a user would pick a word from
a document to generate a query, the probability of obtaining a word like update
116 Chapter 6 Retrieval Models
would be zero. Clearly, this causes a problem because it would cause all these
documents to have zero probability of generating this query.
While it’s fine to have a zero probability for d2 which is not relevant, it’s not okay
to have zero probability for d3 and d4 because now we no longer can distinguish
them. In fact, we can’t even distinguish them from d2. Clearly, that’s not desirable.
When one has such a result, we should think about what has caused this problem,
examining what assumptions have been made as we derive this ranking function.
We have made an assumption that every query word must be drawn from the
document in the user’s mind—in order to fix this, we have to assume that the user
could have drawn a word not necessarily from the document. So let’s consider an
improved model.
Instead of drawing a word from the document, let’s imagine that the user would
actually draw a word from a document language model as depicted in Figure 6.22.
Here, we assume that this document is generated by using this unigram language
model, which doesn’t necessarily assign zero probability for the word update. In
fact, we assume this model does not assign zero probability for any word. If we’re
thinking this way, then the generative process is a bit different: the user has this
model (distribution of words) in mind instead of a particular ideal document,
although the model still has to be estimated based on the documents in our corpus.
The user can generate the query using a similar process. They may pick a word
such as presidential and another word such as campaign. The difference is that now
we can pick a word like update even though it doesn’t occur in the document. This
… news of presidential
p(q = “presidential campaign”|d = campaign … presidential )
candidate …
“camp
…
a
ig n”
Figure 6.22 Computing the probability of a query given a document using a document language
model.
6.4 Probabilistic Retrieval Models 117
p(w|d2)
…
d2
food 0.25 p(“data mining alg”|d2)
Food nutrition nutrition 0.1 = p(“data”|d2)
paper healthy 0.05 × p(“mining”|d2)
diet 0.02
… × p(“alg”|d2)
Figure 6.23 Scoring a query on two documents based on their language models.
would fix our problem with zero probabilities and it’s also reasonable because we’re
now thinking of what the user is looking for in a more general way, via a unigram
language model instead of a single fixed document.
In Figure 6.23, we show two possible language models based on documents d1
and d2, and a query data mining algorithms. By making an independence assump-
tion, we could have p(q | d) as a product of the probability of each query word in
each document’s language model. We score these two documents and then rank
them based on the probabilities we calculate.
Let’s formally state our scoring process for query likelihood. A query q contains
the words
q = w 1 , w 2 , . . . , wn
such that |q| = n. The scoring or ranking function is then the probability that we
observe q given that a user is thinking of a particular document d. This is the prod-
uct of probabilities of all individual words, which is based on the independence
assumption mentioned before:
In practice, we score the document for this query by using a logarithm of the
query likelihood:
n
score(q , d) = log p(q | d) = log p(wi | d) = c(w, q) log p(w | d). (6.5)
i=1 w∈V
probability mass from seen words because we need some extra probability mass
for the unseen words—otherwise, they won’t sum to one.
To make this transformation and to improve the MLE, we will assign nonzero
probabilities to words that are not observed in the data. This is called smoothing,
and smoothing has to do with improving the estimate by including the probabilities
of unseen words. Considering this factor, a smoothed language model would be
a more accurate representation of the actual document. Imagine you have seen
the abstract of a research paper; or, imagine a document is just an abstract. If
we assume words that don’t appear in the abstract have a probability of zero, that
means sampling a word outside the abstract is impossible. Imagine the user who
is interested in the topic of this abstract; the user might actually choose a word
that is not in the abstract to use as query. In other words, if we had asked this
author to write more, the author would have written the full text of the article, which
contains words that don’t appear in the abstract. So, smoothing the language model
is attempting to try to recover the model for the whole article. Of course, we don’t
usually have knowledge about the words not observed in the abstract, so that’s why
smoothing is actually a tricky problem.
The key question here is what probability should be assigned to those unseen
words. As one would imagine, there are many different approaches to solve this
issue. One idea that’s very useful for retrieval is to let the probability of an unseen
word be proportional to its probability as given by a reference language model. That
means if you don’t observe the word in the corpus, we’re going to assume that its
probability is governed by another reference language model that we construct. It
will tell us which unseen words have a higher probability than other unseen words.
In the case of retrieval, a natural choice would be to take the collection LM as the
reference LM. That is to say if you don’t observe a word in the document, we’re going
to assume that the probability of this word would be proportional to the probability
of the word in the whole collection.
More formally, we’ll be estimating the probability of a word given a document
as follows:
pseen(w | d) if w seen in d
p(w | d) = (6.6)
αd . p(w | C) otherwise.
If the word is seen in the document, then the probability would be a discounted
MLE estimate pseen. Otherwise, if the word is not seen in the document, we’ll let the
probability be proportional to the probability of the word in the collection p(w | C),
with the coefficient αd controlling the amount of probability mass that we assign
120 Chapter 6 Retrieval Models
Figure 6.24 Substituting smoothed probabilities into the query likelihood retrieval formula.
to unseen words. Regardless of whether the word w is seen in the document or not,
all these probabilities must sum to one, so αd is constrained.
Now that we have this smoothing formula, we can plug it into our query likeli-
hood ranking function, illustrated in Figure 6.24. In this formula, we have a sum
over all the query words, written in the form of a sum over the corpus vocabulary.
Although we sum over words in the vocabulary, in effect we are just taking a sum of
query words since each word is weighted by its frequency in the query. Such a way to
write this sum is convenient in some transformations. In our smoothing method,
we’re assuming the words that are not observed in the document have a somewhat
different form of probability. Using this form we can decompose this sum into two
parts: one over all the query words that are matched in the document and the other
over all the words that are not matched. These unmatched words have a different
form of probability because of our assumption about smoothing.
We can then rewrite the second sum (of query words not matched in d) as a
difference between the scores of all words in the vocabulary minus all the query
words matched in d. This is actually quite useful, since part of the sum over all w ∈ V
can now be written as |q| log αd . Additionally, the sum of query words matched in d
6.4 Probabilistic Retrieval Models 121
Doc length
TF weighting normalization
N
pSeen(w|d)
log p(q|d) = ∑ c(w, q)[log —] + n log αd + ∑ log p(wi|C)
w2d αd p(w|C) i=1
w2q
Figure 6.25 The query likelihood retrieval formula captures the three heuristics from the vector
space models.
is in terms of words that we observe in the query. Just like in the vector space model,
we are now able to take a sum of terms in the intersection of the query vector and
the document vector.
If we look at this rewriting further as shown in Figure 6.25, we can see how
it actually would give us two benefits. The first benefit is that it helps us better
understand the ranking function. In particular, we’re going to show that from this
formula we can see the connection of smoothing using a collection language model
with weighting heuristics similar to TF-IDF weighting and length normalization.
The second benefit is that it also allows us to compute the query likelihood more
efficiently, since we only need to consider terms matched in the query.
We see that the main part of the formula is a sum over the matching query
terms. This is much better than if we take the sum over all the words. After we
smooth the document using the collection language model, we would have nonzero
probabilities for all the words w ∈ V . This new form of the formula is much easier
to compute. It’s also interesting to note that the last term is independent of the
document being scored, so it can be ignored for ranking. Ignoring this term won’t
affect the order of the documents since it would just be the same value added onto
each document’s final score.
Inside the sum, we also see that each matched query term would contribute
a weight. This weight looks like TF-IDF weighting from the vector space models.
First, we can already see it has a frequency of the word in the query, just like in the
vector space model. When we take the dot product, the word frequency in the query
appears in the sum as a vector element from the query vector. The corresponding
term from the document vector encodes a weight that has an effect similar to TF-
IDF weighting. pseen is related to the term frequency in the sense that if a word
occurs very frequently in the document, then the seen probability will tend to be
122 Chapter 6 Retrieval Models
larger. This term is really doing something like TF weighting. In the denominator,
we achieve the IDF effect through p(w | C), or the popularity of the term in the
collection. Because it’s in the denominator, a larger collection probability actually
makes the weight of the entire term smaller. This means a popular term carries
a smaller weight—this is precisely what IDF weighting is doing! Only now, we
have a different form of TF and IDF. Remember, IDF has a logarithm of document
frequency, but here we have something different. Intuitively, however, it achieves
a similar effect to the VS interpretation.
We also have something related to the length normalization. In particular, αd
might be related to document length. It encodes how much probability mass we
want to give to unseen words, or how much smoothing we are allowed to do.
Intuitively, if a document is long then we need to do less smoothing because we
can assume that it is large enough that we have probably observed all of the words
that the author could have written. If the document is short, the number of unseen
words is expected to be large, and we need to do more smoothing in this case. Thus,
αd penalizes long documents since it tends to be smaller for long documents. The
variable αd actually occurs in two places. Thus its overall effect may not necessarily
be penalizing long documents, but as we will see later when we consider smoothing
methods, αd would always penalize long documents in a specific way.
This formulation is quite convenient since it means we don’t have to think about
the specific way of doing smoothing. We just need to assume that if we smooth with
the collection language model, then we would have a formula that looks like TF-
IDF weighting and document length normalization. It’s also interesting that we
have a very fixed form of the ranking function. Note that we have not heuristically
put a logarithm here, but have used a logarithm of query likelihood for scoring
and turned the product into a sum of logarithms of probabilities. If we only want
to heuristically implement TF-IDF weighting, we don’t necessarily have to have
a logarithm. Imagine if we drop this logarithm; we would still have TF and IDF
weighting. But, what’s nice with probabilistic modeling is that we are automatically
given a logarithm function which achieves sublinear scaling of our term “weights.”
In summary, a nice property of probabilistic models is that by following some
assumptions and probabilistic rules, we’ll get a formula by derivation. If we heuris-
tically design the formula, we may not necessarily end up having such a specific
form. Additionally, we talked about the need for smoothing a document language
model. Otherwise, it would give zero probability for unseen words in the document,
which is not good for scoring a query with an unseen word. It’s also necessary to
improve the accuracy of estimating the model representing the topic of this docu-
ment. The general idea of smoothing in retrieval is to use the collection language
model to give us some clue about which unseen word would have a higher proba-
6.4 Probabilistic Retrieval Models 123
bility. That is, the probability of the unseen word is assumed to be proportional to
its probability in the entire collection. With this assumption, we’ve shown that we
can derive a general ranking formula for query likelihood retrieval models that au-
tomatically contains the vector space heuristics of TF-IDF weighting and document
length normalization.
We also saw that through some rewriting, the scoring of such a ranking function
is primarily based on a sum of weights on matched query terms, also just like in
the vector space model. The actual ranking function is given to us automatically by
the probabilistic derivation and assumptions we have made, unlike in the vector
space model where we have to heuristically think about the forms of each function.
However, we still need to address the question: how exactly should we smooth
a document language model? How exactly should we use the reference language
model based on the collection to adjust the probability of the MLE of seen terms?
This is the topic of the next section.
We can see it’s a sum of all the matched query terms, and inside the sum it’s a
count of terms in the query with some weight for the term in the document. We saw
in the previous section how TF and IDF are captured in this sum. We also mentioned
how the second term αd can be used for document length normalization. If we
wanted to implement this function using a programming language, we’d still need
to figure out a few variables; in particular, we’re going to need to know how to
estimate the probability of a word and how to set αd . In order to answer these
questions, we have to think about specific smoothing methods, where we define
pseen and αd .
We’re going to talk about two different smoothing methods. The first is a lin-
ear interpolation with a fixed mixing coefficient. This is also called Jelinek-Mercer
smoothing. The idea is actually quite simple. Figure 6.26 shows how we estimate
the document language model by using MLE. That gives us word counts normal-
ized by the total number of words in the document. The idea of using this method
is to maximize the probability of the observed text. As a result, if a word like network
is not observed in the text, it’s going to get zero probability. The idea of smoothing
is to rely on the collection reference model where this word is not going to have a
124 Chapter 6 Retrieval Models
In Figure 6.26 we also consider the word network, which does not appear in d.
In this case, the MLE estimate is zero, and its smoothed probability is 0 + λ .
p(w | C) = λ . 0.001. You can see now that αd in this smoothing method is just λ
c(w, d)
p(w|d) = (1 – λ) — + λ p(w|C) λ 2[0, 1]
|d|
Document d Collection LM
Unigram LM p(w|θ) = ? Total #words = 100 P(w|C)
… text 10 the 0.1
10/100 text ? mining 5 a 0.08
…
5/100 mining ? association 3
computer 0.02
3/100 association ? database 3 database 0.01
3/100 database ? algorithm 2 …
… … text 0.001
1/100 query ? query 1 network 0.001
mining 0.0009
0/100 network ? efficient 1 …
10
p(“text”|d) = (1 – λ) — + λ * 0.001 p(“network”|d) = λ * 0.001
100
Figure 6.26 Smoothing the query likelihood retrieval function with linear interpolation: Jelinek-
Mercer smoothing.
6.4 Probabilistic Retrieval Models 125
because that’s the coefficient in front of the probability of the word given by the
collection language model.
The second smoothing method we will discuss is called Dirichlet prior smooth-
ing, or Bayesian smoothing. Again, we face the problem of zero probability for words
like network. Just like Jelinek-Mercer smoothing, we’ll use the collection language
model, but in this case we’re going to combine it with the MLE esimate in a some-
what different way. The formula first can be seen as an interpolation of the MLE
probability and the collection language model as before. Instead, however, αd is
not simply a fixed λ, but a dynamic coefficient which takes μ > 0 as a parameter.
Based on Figure 6.27, we can see if we set μ to a constant, the effect is that a
long document would actually get a smaller coefficient here. Thus, a long docu-
ment would have less smoothing as we would expect, so this seems to make more
|d| μ
sense than fixed-coefficient smoothing. The two coefficients |d|+μ and |d|+μ would
still sum to one, giving us a valid probability model. This smoothing can be un-
derstood as a dynamic coefficient interpolation. Another way to understand this
formula—which is even easier to remember—is to rewrite this smoothing method
in this form:
c(w, d) + μ . p(w | C)
p(w | d) = . (6.8)
|d| + μ
Figure 6.27 Smoothing the query likelihood retrieval function with linear interpolation: Dirichlet
prior smoothing.
126 Chapter 6 Retrieval Models
Here, we can easily see what change we have made to the MLE. In this form,
we see that we add a count of μ . p(w | C) to every word, which is proportional
to the probability of w in the entire corpus. We pretend every word w has μ .
p(w | C) additional pseudocounts. Since we add this extra probability mass in the
numerator, we have to re-normalize in order to have a valid probability distribution.
Since w∈V p(w | C) = 1, we can add a μ in the denominator, which is the total
number of pseudocounts we added for each w in the numerator.
Let’s also take a look at this specific example again. For the word text, we will have
ten counts that we actually observe but we also added some pseudocounts which
are proportional to the probability of text in the entire corpus. Say we set μ = 3000,
meaning we will add 3000 extra word counts into our smoothed model. We want
some portion of the 3000 counts to be allocated to text; since p(text | C) = 0.001,
we’ll assign 0.001 . 3000 counts to that word. The same goes for the word network;
for d, we observe zero counts, but also add μ . p(network | C) extra pseudocounts
for our smoothed probability.
In Dirichlet prior smoothing, αd will actually depend on the current document
being scored, since |d| is used in the smoothed probability. In the Jelinek-Mercer
linear interpolation, αd = λ, which is a constant. For Dirichlet prior, we have αd =
μ
|d|+μ , which is the interpolation coefficient applied to the collection language
model. For a slightly more detailed derivation of these variables, the reader may
consult Appendix A.
Now that we have defined pseen and αd for both smoothing methods, let’s plug
these variables in the original smoothed query likelihood retrieval function. Let’s
start with Jelinek-Mercer smoothing:
pseen(w | d) (1 − λ) . pMLE (w | d) + λ . p(w | C) 1 − λ . c(w, d)
= = 1+ . (6.9)
αd p(w | C)
. λ p(w | C)
. λ |d| . p(w | C)
Then, plugging this into the entire query likelihood retrieval formula, we get
1−λ . c(w, d)
scoreJ M (q , d) = c(w, q) log 1 + . (6.10)
w∈q , d
λ |d| . p(w | C)
We ignore the |q| log αd additive term (derived in the previous section) since
αd = λ does not depend on the current document being scored. We’ll end up having
a ranking function that is strikingly similar to a vector space model since it is a
sum over all the matched query terms. The value of the logarithm term is non-
negative. We see very clearly the TF weighting in the numerator, which is scaled
sublinearly. We also see the IDF-like weighting, which is the p(w | C) term in the
denominator; the more frequent the term is in the entire collection, the more
6.4 Probabilistic Retrieval Models 127
discounted the numerator will be. Finally, we can see the |d| in the denominator is a
form of document length normalization, since as |d| grows, the overall term weight
would decrease, suggesting that the impact of αd in this case is clearly to penalize
a long document. The second fraction can also be considered as the ratio of two
probabilities; if the ratio is greater than one, it means the probability of w in d is
greater than appearing by chance in the background. If the ratio is less than one,
the chance of seeing w in d is actually less likely than observing it in the collection.
What’s also important to note is that we received this weighting function auto-
matically by making various assumptions, whereas in the vector space model, we
had to go through those heuristic design choices in order to get this. These are the
advantages of using this kind of probabilistic reasoning where we have made ex-
plicit assumptions. We know precisely why we have a logarithm here, and precisely
why we have these probabilities. We have a formula that makes sense and does
TF-IDF weighting and document length normalization.
Let’s look at the complete function for Dirichlet prior smoothing now. We know
μ
what pseen is and we know that αd = |d|+μ :
therefore,
c(w, d)+μ.p(w|C)
pseen(w | d) |d|+μ c(w, d)
= μ.p(w|C)
=1+ . (6.12)
αd . p(w | C) |d|+μ
μ p(w | C)
.
The form of the function looks very similar to the Jelinek-Mercer scoring func-
tion. We compute a ratio that is sublinearly scaled by a non-negative logarithm.
Both TF and IDF are computed in almost the exact same way. The difference here
is that Dirichlet prior smoothing can capture document length normalization dif-
ferently than Jelinek-Mercer smoothing. Here, we have retained the |q| log αd term
since αd depends on the document, namely |d|. If |d| is large, then less extra mass
is added onto the final score; if |d| is small, more extra mass is added to the score,
effectively rewarding a short document.
To summarize this section, we’ve talked about two smoothing methods: Jelinek-
Mercer, which is doing the fixed coefficient linear interpolation, and Dirichlet prior,
128 Chapter 6 Retrieval Models
which adds pseudo counts proportional to the probability of the current word in
the background collection. In most cases we can see, by using these smoothing
methods, we will be able to reach a retrieval function where the assumptions are
clearly articulated, making them less heuristic than some of the vector space mod-
els. Even though we didn’t explicitly set out to define the popular VS heuristics, in
the end we naturally arrived at TF-IDF weighting and document length normaliza-
tion, perhaps justifying their inclusion in the VS models. Each of these functions
also has a smoothing parameter (λ or μ) with an intuitive meaning. Still, we need to
set these smoothing parameters or estimate them in some way. Overall, this shows
that by using a probabilistic model, we follow very different strategies than the vec-
tor space model. Yet in the end, we end up with retrieval functions that look very
similar to the vector space model. Some advantages here are having assumptions
clearly stated and a final form dictated by a probabilistic model.
This section also concludes our discussion of the query likelihood probabilistic
retrieval models. Let’s recall what assumptions we have made in order to derive the
functions that we have seen the following.
If we make these four assumptions, then we have no choice but to take the
form of the retrieval function that we have seen earlier. Fortunately, the function
has a nice property in that it implements TF-IDF weighting and document length
normalization. In practice, these functions also work very well. In that sense, these
functions are less heuristic compared with the vector space model.
Exercises
6.1. Here’s a query and document vector. What is the score for the given document
using dot product similarity?
d = {1, 0, 0, 0, 1, 4} q = {2, 1, 0, 1, 1, 1}
6.2. In what kinds of queries do we probably not care about query term frequency?
6.3. Let d be a document in a corpus. Suppose we add another copy of d to collec-
tion. How does this affect the IDF of all words in the corpus?
6.4. Given a fixed vocabulary size, the length of a document is the same as the
length of the vector used to represent it. True or false? Why?
6.5. Consider Euclidean distance as our similarity measure for text documents:
|V |
d(q , d) = (qi − di )2 .
i=1
What does this measure capture compared to the cosine measure discussed in this
chapter? Would you prefer one over the other?
6.7. Which of the following ways is best to reduce the size of the vocabulary in a
large corpus?
Remove top 10 words
Remove words that occur 10 times or fewer
6.8. Why might using raw term frequency counts with dot product similarity not
give the best possible ranking?
6.9. How can you apply the VS model to a domain other than text documents? For
example, how do you find similar movies in IMDB or similar music to a specific
song? Hint: first define your concept space; what is your “term” vector?
130 Chapter 6 Retrieval Models
where:
k > 0 is some parameter;
c(w, C) and c(w, d) are the count of the current word in the collection and
current document, respectively;
N is the total number of documents;
df(w) is the number of documents that the current word w appears in;
n is the document length of the current document;
navg is the average document length of the corpus.
7
Feedback
In this chapter, we will discuss feedback in a TR system. Feedback takes the results
of a user’s actions or previous search results to improve retrieval results. This is
illustrated in Figure 7.1. As shown, feedback is often implemented as updates to
a query, which alters the list of returned documents. We can see the user would
type in a query and then the query would be sent to a standard search engine,
which returns a ranked list of results (we discussed this in depth in Chapter 6).
These search results would be shown to the user. The user can make judgements
about whether each returned document is useful or not. For example, the user may
say one document is good or one document is not very useful. Each decision on a
document is called a relevance judgment. This overall process is a type of relevance
feedback, because we’ve got some feedback information from the user based on the
judgements of the search results.
As one would expect, this can be very useful to the retrieval system since we
should be able to learn what exactly is interesting to a particular user or users.
The feedback module would then take these judgements as input and also use
the document collection to try to improve future rankings. As mentioned, it would
typically involve updating the query so the system can now rank the results more
accurately for the user; this is the main idea behind relevance feedback.
These types of relevance judgements are reliable, but the users generally don’t
want to make extra effort unless they have to. There is another form of feedback
called pseudo relevance feedback, or blind feedback. In this case, we don’t have to
involve users since we simply assume that the top k ranked documents are relevant.
Let’s say we assume the top k = 10 documents are relevant. Then, we will use these
documents to learn and to improve the query. But how could this help if the top-
ranked documents are random? In fact, the top documents are actually similar to
relevant documents, even if they are not relevant. Otherwise, how would they have
appeared high in the ranked list? So, it’s possible to learn some related terms to
the query from this set anyway regardless whether the user says that a document is
relevant or not.
134 Chapter 7 Feedback
Results:
d1 3.5
Query Retrieval d2 2.4
engine …
dk 0.5
…
You may recall that we talked about using language models to analyze word
associations by learning related words to the word computer (see Chapter 3). First,
we used computer to retrieve all the documents that contain that word. That is,
imagine the query is computer. Then, the results will be those documents that
contain computer. We take the top k results that match computer well and we
estimate term probabilities (by counting them) in this set for our topic language
model. Lastly, we use the background language model to choose the terms that
are frequent in this retrieved set but not frequent in the whole collection. If we
contrast these two ideas, what we can find is that we’ll learn some related terms to
computer. These related words can then be added to the original query to expand
the query, which helps find documents that don’t necessarily match computer, but
match other words like program and software that may not have been in the original
query.
Unfortunately, pseudo relevance feedback is not completely reliable; we have
to arbitrarily set a cutoff and hope that the ranking function is good enough to
get at least some useful documents. There is also another feedback method called
implicit feedback. In this case, we still involve users, but we don’t have to explicitly
ask them to make judgements. Instead, we are going to observe how the users
interact with the search results by observing their clickthroughs. If a user clicked
on one document and skipped another, this gives a clue about whether a document
is useful or not. We can even assume that we’re going to use only the snippet here
in a document that is displayed on the search engine results page (the text that’s
actually seen by the user). We can assume this displayed text is probably relevant
or interesting to the user since they clicked on it. This is the idea behind implicit
7.1 Feedback in the Vector Space Model 135
feedback and we can again use this information to update the query. This is a very
important technique used in modern search engines—think about how Google and
Bing can collect user activity to improve their search results.
To summarize, we talked about three types of feedback. In relevance feedback,
we use explicit relevance judgements, which require some user effort, but this
method is the most reliable. We talked about pseudo feedback, where we simply
assumed the top k document are relevant without involving the user at all. In
this case, we can actually do this automatically for each query before showing the
user the final results page. Lastly, we mentioned implicit feedback where we use
clickthrough data. While this method does involve users, the user doesn’t have to
make explicit effort to make judgements on the results.
Next, we will discuss how to apply feedback techniques to both the vector space
and query likelihood retrieval models. The future sections do not make any note of
how the feedback documents are obtained since no matter how they are obtained,
they would be dealt with the same way by each of the following two feedback
methods.
Centroid of Centroid of
relevant documents non-relevant documents
– – –
––
+ + +++
–– – + qm – –
q + – –
– + +
– – – ++ + +++ –
+ + + – ––
– – ––
–
Figure 7.2 Illustration of Rocchio Feedback: adjusting weights in the query vector to move it closer
to a cluster of relevant documents.
all the top-ranked documents will be positive, and this is the motivation behind
feedback in the first place. Our goal is to move the query vector to some position to
improve the retrieval accuracy, shifting the dotted circle of similarity. By looking at
this diagram, we see that we should move the query vector so that the dotted circle
encompasses more + documents than − documents. This is the basic idea behind
Rocchio feedback.
Geometrically, we’re talking about moving a vector closer to some vectors and
away from other vectors. Algebraically, it means we have the following formula
(using the arrow vector notation for clarity):
β . γ .
qm = α . q + dj − dj , (7.1)
|Dr | |Dn|
dj ∈Dr dj ∈Dn
where q is the original query vector that is transformed into qm, the modified (i.e.,
expanded) vector. Dr is the set of relevant feedback documents and Dn is the set of
non-relevant feedback documents. Additionally, we have the parameters α, β , and
γ which are weights that control the amount of movement of the original vector.
In terms of movement, we see that the terms in the original query are boosted by a
factor of α, and terms from positive documents are boosted by a factor of β, while
terms from negative documents are shrunk by a factor of γ .
Another interpretation of the second term (the sum over positive documents)
is the centroid vector of relevant feedback documents while the third term is the
centroid vector of the negative feedback documents. In this sense, we shift the
original query towards the relevant centroid and away from the negative centroid.
Thus, the average over these two terms computes one dimension’s weight in the
centroid of these vectors.
7.1 Feedback in the Vector Space Model 137
After we have performed these operations, we will get a new query vector which
can be used again to score documents in the index. This new query vector will then
reflect the move of the original query vector toward the relevant centroid vector and
away from the non-relevant centroid vector.
Let’s take a look at a detailed example depicted below. Imagine we have a small
vocabulary,
and a query
q = {1, 1, 1, 1, 0, 0}.
Recall from Chapter 6 that our vocabulary V is a fixed-length term vector. It’s not
necessary to know what type of weighting scheme this search engine is using, since
in Rocchio feedback, we will only be adding and subtracting term weights from the
query vector.
Say we are given five feedback documents whose term vectors are denoted as
relevant with a + prefix. The negative feedback documents are prefixed with −.
For Rocchio feedback, we first compute the centroid of the positive and nega-
tive feedback documents. The centroid of the positive documents would have the
average of each dimension, and the case is the same for the negative centroid:
+ Cr { 1.5+1.5
2 0.0 3.0+4.0
2
2.0+2.0
2 0.0 0.0 }
− Cn { 1.5+1.5+1.5
3
0.1+0.1+0.0
3 0.0 0.0+2.0+6.0
3
0.0+2.0+2.0
3 0.0 }
Now that we have the two centroids, we modify the original query to create the
expanded query qm:
qm = α . q + β . Cr − γ . Cn
= {α + 1.5β − 1.5γ , α − 0.067γ , α + 3.5β , α + 2β − 2.67γ , −1.33γ , 0}.
138 Chapter 7 Feedback
We have the parameter α controlling the original query term weight, which all
happened to be one. We have β to control the influence of the relevant centroid
vector Cr . Finally, we have γ , which is the non-relevant centroid Cn weight. Shifting
the original query vector q by these amounts yields our modified query qm. We
rerun the search with this new query. Due to the movement of the query vector,
we should match the relevant documents much better, since we moved q closer to
them and away from the non-relevant documents—this is precisely what we want
from feedback.
If we apply this method in practice we will see one potential problem: we would
be performing a somewhat large computation to calculate the centroids and modify
all the weights in the new query. Therefore, we often truncate this vector and
only retain the terms which contain the highest weights, considering only a small
number of words. This is for efficiency. Additionally, negative examples or non-
relevant examples tend not to be very useful, especially compared with positive
examples. One reason is because negative documents distract the query in all
directions, so taking the average doesn’t really tell us where exactly it should be
moving to. On the other hand, positive documents tend to be clustered together
and they are often in a consistent direction with respect to the query. Because of
this effect, we sometimes don’t use the negative examples or set the parameter γ
to be small.
It’s also important to avoid over-fitting, which means we have to keep relatively
high weight α on the original query terms. We don’t want to overly trust a small
sample of documents and completely reformulate the query without regard to
its original meaning. Those original terms are typed in by the user because the
user decided that those terms were important! Thus, we bias the modified vector
towards the original query direction. This is especially true for pseudo relevance
feedback, since the feedback documents are less trustworthy. Despite these issues,
the Rocchio method is usually robust and effective, making it a very popular method
for feedback.
pseen(w|d)
Query likelihood f (q, d) = ∑ c(w, q) [log —] + n log αd
w2d αd p(w|C)
w2q
pseen(w|d)
KL-divergence f (q, d) = ∑ [ p(w|θ̂Q) log—] + log αd
αd p(w|C)
(cross entropy) w2d,p(w|θQ )>0
c(w, Q)
Query LM p(w|θ̂Q) = —
|Q|
Figure 7.3 The KL-divergence retrieval model changes the way we represent the query. This
enables feedback information to be incorporated into the query more easily.
140 Chapter 7 Feedback
Document D θD
D(θQ || θD) Results
Query Q θQ
θQ′ = θQ θQ′ = θF
No Full
feedback feedback
So, the two formulas look almost identical except that in the generalized formula
we have a probability of a word given by a query language model. Still, we add all
the words that are in the document and have non-zero probability for the query
language model. Again, this becomes a generalization of summing over all the
matching query words. We can recover the original query likelihood formula by
simply setting the query language model to be the relative frequency of a word in
the query, which eliminates the query length term n = |q| which is a constant.
Figure 7.4 shows that we first estimate a document language model, then we
estimate a query language model and we compute the KL-divergence, often denoted
by D(.||.). We compute a language model from the documents containing the
query terms called the feedback language model θF . This feedback language model
is similar to the positive centroid Cr in Rocchio feedback. This model can be
combined with the original query language model using a linear interpolation,
which produces an updated model, again just like Rocchio.
We have a parameter α ∈ [0, 1] that controls the strength of the feedback docu-
ments. If α = 0, there is no feedback; if α = 1, we receive full feedback and ignore
the original query. Of course, these extremes are generally not desirable. The main
question is how to compute this θF .
Now, we’ll discuss one of the approaches to estimate θF . This approach is based
on a generative model shown in Figure 7.5. Let’s say we are observing the posi-
tive documents, which are collected by users’ judgements, the top k documents
from a search, clickthrough logs, or some other means. One approach to estimate
a language model over these documents is to assume these documents are gen-
7.2 Feedback in Language Models 141
Background words
P(w |C) w
λ
1–λ P(w |θ ) w
Topic words
erated from some ideal feedback language model as we did before; this entails
normalizing all the frequency counts from all the feedback documents. But is this
distribution good for feedback? What would the top-ranked words in θF be?
As depicted in the language model on the right in Figure 7.6, the high-scoring
words are actually common words like the. This isn’t very good for feedback, be-
cause we will be adding many such words to our query when we interpolate with
our original query language model. Clearly, we need to get rid of these stop words.
In fact, we have already seen one way to do that, by using a background language
model while learning word associations in Chapter 2. Instead, we’re going to talk
about another approach which is more principled. What we can do is to assume
that those unwanted words are from the background language model. If we use a
maximum likelihood estimate, a single model would have been forced to assign
high probabilities to a word like the because it occurs so frequently. In order to re-
duce its probability in this model, we have to have another model to explain such a
common word. It is appropriate to use the background language model to achieve
this goal because this model will assign high probabilities to these common words.
We assume the machine that generated these words would work as follows.
Imagine we flip a coin to decide what distribution to use (topic words or background
words). With the probability of λ ∈ [0, 1] the coin shows up as heads and then
we’re going to use the background language model. Once we know we will use the
background LM, we can then sample a word from that model. Alternatively, with
probability 1 − λ, we decide to use an unknown topic model to generate a word. This
is a mixture model because there are two distributions that are mixed together, and
we actually don’t know when each distribution is used. We can treat this feedback
142 Chapter 7 Feedback
mixture model as a single distribution in that we can still ask it to generate words,
and it will still give us a word in a random way (according to the underlying models).
Which word will show up depends on both the topic distribution and background
distribution. In addition, it would also depend on the mixing parameter λ; if λ is
high, it’s going to prefer the background distribution. Conversely, if λ is very small,
we’re going to use only our topic words.
Once we’re thinking this way, we can do exactly the same as what we did before
by using MLE to adjust this model and set the parameters to best explain the data.
The difference, however, is that we are not asking the unknown topic model alone
to explain all the words; rather, we’re going to ask the whole mixture model to
explain the data. As a result, it doesn’t have to assign high probabilities to words like
the, which is exactly what we want. It would then assign high probabilities to other
words that are common in the topic distribution but not having high probability
in the background distribution. As a result, this topic model must assign high
probabilities to the words common in the feedback documents yet not common
across the whole collection.
Mathematically, we have to compute the log likelihood of the feedback docu-
ments F with another parameter λ, which denotes noise in the feedback
documents. We assume it will be fixed to some value. Assuming it’s fixed, then
7.2 Feedback in Language Models 143
we only have word probabilities θ as parameters, just like in the simplest unigram
language model. This gives us the following formula to estimate the feedback lan-
guage model:
Exercises
7.1. How should you set the Rocchio parameters α, β , and γ depending on what
type of feedback you are using? That is, should the parameters be set differently if
Exercises 145
7.2. Imagine you are in charge of a large search-engine company. What other
strategies could you devise to get relevance judgments from users?
7.3. Say one of your new strategies is to measure the amount of time t a user spends
on each search result document. How can you incorporate this t for each document
into a feedback measure for a particular query?
7.5. Implement mixture model feedback for language models in META. Use which-
ever method is most convenient to estimate θF . Or, compare different estimation
methods for θF .
7.6. After implementing Rocchio pseudo feedback, index a dataset with relevance
judgements. Plot MAP (see Chapter 9) across different values of k. Do you see any
trends?
7.7. After implementing mixture model feedback, index a dataset with relevance
judgements. Plot MAP (see Chapter 9) across different values of the mixing param-
eter α. Do you see any trends?
7.8. Design a heuristic to automatically determine the best k for pseudo feedback
on a query-by-query basis. You could look at the query itself, the number of match-
ing documents, or the distribution of ranking scores in the original results. Test
your heuristic by doing experiments.
7.9. Design a heuristic to automatically determine the best α for mixture model
feedback on a query-by-query basis. You could look at the query itself, the number
of matching documents, or the distribution of ranking scores in the original results.
Test your heuristic by doing experiments.
7.13. In a real search system, storing modified query vectors for all observed
queries will take up a large amount of space. How could you optimize the amount
of space required? What kind of solutions provide a tradeoff between space and
query time? How about an online system that benefits the majority of users or the
majority of queries?
8
Search Engine
Implementation
This chapter focuses on how to implement an information retrieval (IR) system or
a search engine. In general, an IR system consists of four components.
For the first three items, there are fairly standard techniques that are essentially
used in all current search engines. The techniques for implementing feedback,
148 Chapter 8 Search Engine Implementation
however, highly depend on the learning approaches and applications. Despite this,
we did discuss some common methods for feedback in the previous chapter.
We will additionally investigate two additional optimizations. These are not
required to ensure the correctness of an information retrieval system, but they will
enable such a system to be much more efficient in both speed and disk usage.
The following sections in this chapter discuss each of the above components
in turn.
Tokenizer
8.1 Document tokenization is the first step in any text mining task. This determines
how we represent a document. We saw in the previous chapter that we often rep-
resent documents as document vectors, where each index corresponds to a single
word. The value stored in the index is then a raw count of the number of occurrences
of that word in a particular document.
When running information retrieval scoring functions on these vectors, we usu-
ally prefer some alternate representation of term count, such as smoothed term
count, or TF-IDF weighting. In real search engine systems, we often leave the term
scoring up to the index scorer module. Thus, in tokenization we will simply use
the raw count of features (words), since the raw count can be used by the scorer to
calculate some weighted term representation. Additionally, calculating something
like TF-IDF is more involved than a simple scanning of a single document (since we
need to calculate IDF). Furthermore, we’d like our scorer to be able to use different
1. As we will discuss, the string terms themselves are almost always represented as term IDs, and
most of the processing on “words” is done on integer IDs instead of strings for efficiency.
8.1 Tokenizer 149
scoring functions as necessary; storing only TF-IDF weight would then require us
to always use TF-IDF weighting.
Therefore, a tokenizer’s job is to segment the document into countable features
or tokens. A document is then represented by how many and what kind of tokens
appear in it. The raw counts of these tokens are used by the scorer to formulate the
retrieval scoring functions that we discussed in the previous chapter.
The most basic tokenizer we will consider is a whitespace tokenizer. This tok-
enizer simply delimits words by their whitespace. Thus,
could result in
A slightly more advanced unigram words tokenizer could first lowercase the
sentence and split the words based on punctuation. There is a special case here
where the period after Mr. is not split (since it forms a unique word):
{1, 1, 1, 1, 1, 2, 1, 1}.
Of course, a real document vector would be much larger and much sparser—that
is, most of the dimensions will have a count of zero.
This process is also called feature generation. It defines the building blocks
of our document objects and gives us meaningful ways to compare them. Once
we define how to conceptualize documents, we can index them, cluster them,
and classify them, among many other text mining tasks. As mentioned in the
Introduction, tokenization is perhaps the most critical component of our indexer,
since all downstream operations depend on its output.
Indexer
8.2 Modern search engines are designed to be able to index data that is much larger
than the amount of system memory. For example, a Wikipedia database dump is
about 40 GB of uncompressed text. At the time of writing this book, this is much
larger than the amount of memory in common personal systems, although it is
quite a common dataset for computer science researchers. TREC research datasets
may even be as large as several terabytes. This doesn’t even take into account real-
world production systems such as Google that index the entire Web.
This requires us to design indexing systems that only load portions of the raw
corpus in memory at one time. Furthermore, when running queries on our indexed
files, we want to ensure that we can return the necessary term statistics fast enough
to ensure a usable search engine. Scanning over every document in the corpus to
match terms in the query will not be sufficient, even for relatively small corpora.
An inverted index is the main data structure used in a search engine. It allows
for quick lookup of documents that contain any given term. The relevant data
structures include (1) the lexicon (a lookup table of term-specific information, such
as document frequency and where in the postings file to access the per-document
term counts) and (2) the postings file (mapping from any term integer ID to a list
of document IDs and frequency information of the term in those documents).
8.2 Indexer 151
Term ID: 56
Document frequency: 78
Total number of occurrences: 443
Offset into postings file: 8923754
Of course, the actual lexicon would just store 56 → {78, 443, 8923754}. Since
the tokenizer assigned term IDs sequentially, we could represent the lexicon as
a large array indexed by term ID. Each element in the large array would store tuples
of (document frequency, total count, offset) information. If we seek to position
8,923,754 in the large postings file, we could see something like
Term ID: 56
Doc ID: 4, count: 1, position 56
Doc ID: 7, count: 9, position 4, position 89, position...
Doc ID: 24, count: 19, position 1, position 67, position...
152 Chapter 8 Search Engine Implementation
which is the counts and position information for the 78 documents that term ID 56
appears in. Notice how the doc IDs (and positions) are stored in increasing order;
this is a fact we will take advantage of when compressing the postings file. Also
make note of the large difference in size of the lexicon and postings file. For each
entry in the lexicon, we know we will only store three values per term. In the postings
file, we store at least three values (doc ID, count, positions) for each document that
the term appears in. If the term appears in all documents, we’d have a list of the
length of the number of documents in the corpus. This is true for all unique terms.
For this reason, we often assume that the lexicon can fit into main memory and the
postings file resides on disk, and is seeked into based on pointers from the lexicon.
Indexing is the process of creating these data structures based on a set of tok-
enized documents. A popular approach for indexing is the following sorting-based
approach.
Figure 8.2 shows how documents produce terms originally in document ID order.
The terms from multiple documents are then sorted by term ID in small postings
chunks that fit in memory before they are flushed to the disk.
8.3 Scorer 153
A forward index may be created in a very similar way to the inverted index. Instead
of mapping terms to documents, a forward index maps documents to a list of
terms that occur in them. This type of setup is useful when doing other operations
aside from search. For example, clustering or classification would need to access
an entire document’s content at once. Using an inverted index to do this is not
efficient at all, since we’d have to scan the entire postings file to find all the terms
that occur in a specific document. Thus, we have the forward index structure that
records a term vector for each document ID.
In the next section, we’ll see how using the inverted term-to-document mapping
can greatly decrease query scoring time. There are other efficiency aspects that are
relevant to the forward index as well, such as compression and caching.
Scorer
8.3 Now that we have our inverted index, how can we use it to efficiently score queries?
Imagine for a moment that we don’t have an inverted index; we only have the
forward index, which maps document IDs to a list of terms that occur in them. To
score a query vector, we’d need to iterate through every single entry (i.e., document)
in the forward index and run a scoring function on the each (document, query) pair.
154 Chapter 8 Search Engine Implementation
Most likely, many documents do not contain any of the query terms (especially
if stop word removal is performed), which means that their ranking score will be
zero. Why should we bother scoring these documents anyway? This is exactly how
we can benefit from an inverted index: we can only score documents that match
at least one query term—that is, we will only score documents that will have nonzero
scores. We assume (and verify in practice) that scoring only documents containing
terms that appear in the query results in much less scoring computation. This leads
us to our first scoring algorithm using the inverted index.
and return (usually) the top k. Again, we save time in this sorting operation by only
sorting documents that contained a query term as opposed to sorting every single
document in the index, even if its score is zero.
are then merged together. This type of algorithm design—distributing the work
and then merging the results—is a very common paradigm called Map Reduce. We
will discuss its generality and many other applications in future chapters.
Feedback Implementation
8.4 Chapter 7 discussed feedback in a standard information retrieval system. We saw
two implementations of feedback: the vector space Rocchio feedback and the query
likelihood mixture model for feedback.
Both can be implemented with the inverted index and document metadata we’ve
described in the previous sections. For Rocchio feedback, we can use the forward
index to obtain the vectors of both the query and feedback documents, running
the Rocchio algorithm on that set of vectors. The mixture model feedback method
requires a language model to be learned over the feedback documents; again, this
can be achieved efficiently by using the term counts from the forward index. The
only other information needed is the corpus background probabilities for each
term, which can be stored in the term lexicon.
With this information, it is now possible to create an online (or “in-memory”)
pseudo-feedback method. Recall that pseudo-feedback looks at the top k returned
documents from search and assumes they are relevant. The following process could
be used to enable online feedback.
. Run the user’s original query.
. Use the top k documents and the forward index to either modify the query vec-
tor (Rocchio) or estimate a language model and interpolate with the feedback
model (query likelihood).
. Rerun the search with the modified query and return the new results to the
user.
There are both advantages and disadvantages to this simple feedback model.
For one, it requires very little memory and disk storage to implement since each
modified query is “forgotten” as soon as the new results are returned. Thus, we
don’t need to create any additional storage structures for the index.
The downside is that all the processing is done at query time, which could
be quite computationally expensive, especially when using a search engine with
many users. The completely opposite tradeoff is to store every modified query in a
database, and look up its expanded form, running the search function only once. Of
course, this is infeasible as the number of queries would quickly make the database
explode in size, not to mention that adding more documents to the index would
158 Chapter 8 Search Engine Implementation
invalidate the stored query vectors (since the new documents might also match the
original query).
In practice, we can have some compromise between these two extremes, e.g.,
only storing the very frequently expanded queries, or using query similarity to
search for a similar query that has been saved. The caching techniques discussed
in a later section are also applicable to feedback methods, so consider how to adopt
them from caching terms to caching expanded queries.
Of course, this only touches on the pseudo-feedback method. There is also
clickthrough data, which can be stored in a database, and relevance judgements,
which can be stored the same way. Once we know which documents we’d like to
include in the chosen feedback method, all implementations are the same since
they deal with a list of feedback documents.
Compression
8.5 Another technical component in a retrieval system is integer compression, which
is applied to compress the very large postings file. A compressed index is not only
smaller, but also faster when it’s loaded into main memory. The general idea of
compressing integers (and compression in general) is to exploit the non-uniform
distribution of values. Intuitively, we will assign a short code to values that are
frequent at the price of using longer codes for rare values. The optimal compression
rate is related to the entropy of the random variable taking the values that we
consider—skewed distributions would have lower entropy and are thus easier to
compress.
It is important that all of our compression methods need to support random
access decoding; that is, we could like to seek to a particular position in the
postings file and start decompressing without having to decompress all the pre-
vious data.
Because inverted index entries are stored sequentially, we may exploit this fact
to compress document IDs (and position information) based on their gaps. The
document IDs would otherwise be distributed relatively uniformly, but the distri-
bution of their gaps would be skewed since when a term is frequent, its inverted list
would have many document IDs leading to many small gaps. Consider the following
example of a list of document IDs:
{23, 25, 34, 35, 39, 43, 49, 51, 57, 59, . . .}.
Instead of storing these exact numbers, we can store the offsets between them;
this creates more smaller numbers, which are easier to compress since they take
8.5 Compression 159
{23, 2, 9, 1, 4, 4, 6, 2, 6, 2, . . .}.
To get the actual document ID values, simply add the offset to the previous value.
So the first ID is 23 and the second is 23 + 2 = 25. The third is 25 + 9 = 34, and
so on.
In this section, we will discuss the following types of compression, which may
or may not operate on gap-encoded values:
1→1
2 → 01
3 → 001
4 → 0001
5 → 00001
19 → 0000000000000000001
Note that we can’t encode the number zero—this is true of most other methods
as well. An example of a unary-encoded sequence is
000100100010000000101000100001 = 4, 3, 4, 8, 2, 4, 5.
160 Chapter 8 Search Engine Implementation
quired to encode and decode. Every single bit needs to be read in order to read
one integer. Block compression attempts to alleviate this issue by reading bytes at
a time instead of bits. In block compression schemes, only one bitwise operation
per byte is usually required as opposed to at least one operation per bit in the pre-
vious three schemes (e.g., count how many bit operations are required to δ-encode
the integer 47).
Block compression seeks to reduce the number of CPU instructions required
in decoding at the expense of using more storage. The two block compression
methods we will investigate deal mainly with bytes instead of bits.
vByte stands for variable byte encoding. It uses the lower seven bits of a byte to
store a binary number and the most significant bit as a flag. The flag signals whether
the decoder should keep reading the binary number. The parentheses below are
added for emphasis.
1 → (0)0000001
2 → (0)0000010
19 → (0)0010011
47 → (0)0101111
127 → (0)1111111
128 → (1)0000000(0)0000001
194 → (1)1000010(0)0000001
678 → (1)0100110(0)0000101
20, 000 → (1)0100000(1)0011100(0)0000001
The decoder works by keeping a sum (which starts at zero) and adding each
byte into the sum as it is processed. Notice how the bytes are “chained” together
backwards. For every “link” we follow, we left shift the byte to add by 7 × k, where k
is the number of bytes read so far. Therefore, the sum we have to decode the integer
20,000 is
We transform this block by subtracting the minimum value in the list from each
element:
Up to this point, this is similar to the gap encoding discussed previously. However,
instead of encoding each value with a bitwise compression such as γ - or δ-encoding,
we will simply use binary with the smallest number of bits possible. Since the
maximum number in this chunk is 27, we will store each of the integers in 5 bits:
{00000, 00010, 00110, 01110, 10010, 10011, 11001, 11011}, min = 45, bytes = 5.
45, 5, 0000000010001100111010010100111100111011,
where 45 and 5 could be stored however is convenient (e.g., fixed binary length).
This method also reduces the number of bitwise operations, since we read chunks
of 5 bits and add them to the base value 45 to recreate the sequence. Another
nice side effect is that we know the minimum value and maximum possible value
stored in this chunk; therefore, if we are looking for a particular integer (say for a
document ID), we know we can skip this chunk and not decompress it if I D < 45
or I D > 45 + 25.
Caching
8.6 While we designed our inverted index structure to be very efficient, we still have
the issue of disk latency. For this reason, it is very common for a real-world search
engine to also employ some sort of caching structure stored in memory for postings
data objects. In this section, we’ll overview two different caching strategies.
The basic idea of a cache is to make frequently accessed objects fast to acquire.
To accomplish this, we attempt to keep the frequently accessed objects in memory
so we don’t need to seek to the disk. In our case, the objects we access are postings
lists. Due to Zipf’s law [Zipf 1949], we expect that a relatively small number of
8.6 Caching 163
Hash table
K7 K1 K4 K9 K2 K12
V4 V1 V7 V2 V12 V9
postings lists (keyed by term ID) will be accessed the majority of the time. But how
do we know which terms to store? Even if we knew which terms are most queried,
how do we set a cutoff for memory consumption?
The two cache designs we describe address both these issues. That is: (1) the
most-frequently used items are fast to access, and (2) we can set a maximum size
for our cache so it doesn’t get too large.
We want searching the cache to be fast, so we use a hash table to store term IDs
as keys and postings lists as values. This enables O(1) amortized insert, find, and
delete operations. The interesting part of the LRU cache is how to determine the
access order of the objects. To do this, we link together the values as a doubly linked
list. Once an element is inserted or accessed, it is moved to the head (front) of the
164 Chapter 8 Search Engine Implementation
Primary map
Secondary map
list in O(1) time. When we need to remove the LRU item, we look at the element at
the tail (end) of the linked list and delete it, also in constant time.
This cache has a rough hierarchy of usage: primary contains elements that are
more frequently accessed than secondary. So when the cache fills, the secondary
table is emptied to free memory. While the temporal accuracy of the DBLRU cache
is not as precise as the LRU cache, it is a much simpler setup, which translates to
faster access times. As usual, there is a tradeoff between the speed and accuracy of
these two caches.
2. [Link]
Exercises 165
Exercises
8.1. Why is using an inverted index to score documents preferred over a more naive
solution?
8.2. For the following values, explain whether we can efficiently get the value from
a default (standard) inverted index postings file. By efficiently, we mean with one
lookup—not scanning the entire index.
Symbol Value
8.3. For values in the previous question that were not able to be extracted efficiently
from the postings file, either explain what auxiliary structures need to be searched
or why it isn’t possible to retrieve the value efficiently.
8.4. Imagine that each document is tagged with some sentiment score (either
positive or negative). Outline a method on how you could score documents if you
only wanted to return documents of a certain sentiment on a per-query basis.
8.5. Compress the following sequence of integers using each of the compression
methods discussed:
8.6. According to Zipf’s law, which of the following strategies is more effective for
reducing the size of an inverted index? (1) Remove all words that appear 10 times
or less or (2) remove the top 10 most frequent words.
166 Chapter 8 Search Engine Implementation
8.7. What type of integer compression is supported in META? Write one of the
implementations discussed in this chapter that does not exist in the toolkit.
8.8. Which type of scoring (document- vs. term-at-a-time) does META use? Hint:
see the file meta/src/index/ranker/[Link].
8.9. Implement the other type of query scoring and compare its runtime to the
existing method in META.
8.10. META has an implementation of a DBLRU cache. Experiment with the cache
size parameter and plot the retrieval speed against different cache sizes while
holding any other variables constant.
8.11. META does not currently have an implementation of an LRU cache as dis-
cussed in this chapter. Can you write one and compare its performance to the
DBLRU cache in terms of memory usage and speed?
9
9.1
Search Engine Evaluation
This chapter focuses on the evaluation of text retrieval (TR) systems. In the previous
chapter, we talked about a number of different TR methods and ranking functions,
but how do we know which one works the best? In order to answer this question,
we have to compare them, and that means we’ll have to evaluate these retrieval
methods. This is the main focus of this chapter. We start out with the methodology
behind evaluation. Then, we compare the retrieval of sets with the retrieval of
ranked lists as well as judgements with multiple levels of relevance. We end with
practical issues in evaluation, followed by exercises.
Introduction
Let’s think about why we have to do evaluation. There are two main reasons;
the first is that we have to use evaluation to figure out which retrieval method
works the best. This is very important for advancing our knowledge, otherwise we
wouldn’t know whether a new idea works better than an old idea. Previously in this
book (Chapter 6), we discussed the problem of text retrieval and compared it with
database retrieval. Search engine evaluation must rely on users, so this becomes
a very challenging problem. Because of this, we must determine how we can get
users involved and draw a fair comparison of different methods.
The second reason to perform evaluation is to assess the actual utility of an
overall TR system (as opposed to specific methods). Imagine you’re building your
own applications; you would be interested in knowing how well your search engine
works for your users. In this case, measures must reflect the utility to the actual
users in the real application as opposed to measures on each individual retrieval
result. Typically, this has been done via user studies—where human users inter-
act with the corpus via the system. In the case of comparing different methods, the
measures we use all need to be correlated with the utility to the users. The measures
only need to be good enough to determine which method works better. This is usu-
ally done by using a test collection, which is a main idea that we’ll be talking about
168 Chapter 9 Search Engine Evaluation
in this chapter. This has been very important for comparing different algorithms
and for improving search engines systems in general.
Effectiveness or accuracy. How accurate are the search results? In this case,
we’re measuring a system’s capability of ranking relevant documents on top
of non-relevant ones.
Efficiency. How quickly can a user get the results? How large are the computing
resources that are needed to answer a query? In this case, we need to measure
the space and time overhead of the system.
Usability. How useful is the system for real user tasks? Here, interfaces and
many other things are also important and we typically have to do user studies.
In this book, we’re going to talk mainly about the effectiveness and accu-
racy measures because the efficiency and usability dimensions are not unique
to search engines (they are needed for evaluating other software systems). There
is also very good coverage of such material in other books, so we suggest the
reader consult Harman [2011] for further reading in this area. Additional readings
are Sanderson [2010] and Kelly [2009], which cover user studies and A-B testing
(concepts that are discussed later in this chapter).
ments. These are judgments of which documents should be returned for which
queries. Ideally, they have to be made by users who formulated the queries because
those are the people that know exactly what the documents (search results) would
be used for. Finally, we have to have measures to quantify how well a system’s result
matches the ideal ranked list that would be constructed based on users’ relevance
judgements.
This methodology is very useful for evaluating retrieval algorithms because the
test set can be reused many times. It also provides a fair comparison for all the
methods, since the evaluation is exactly the same for each one. That is, we have the
same criteria, same corpus, and same relevance judgements to compare the differ-
ent algorithms. This allows us to compare a new algorithm with an old algorithm
that was invented many years ago by using the same approach.
In Figure 9.1, we illustrate how the Cranfield evaluation methodology works. As
mentioned, we need a set of queries that are shown here. We have Q1 , Q2 , and so
on. We also need the document collection, D1 , D2 , . . ., and on the far right side of
the figure, we have the relevance judgments which are plus or minus annotations
on each document specifying whether it is relevant (plus) or not relevant (minus).
Essentially, these are binary judgments of documents with respect to a specific
query since there are only two levels of relevance. For example, D1 and D2 are
judged as being relevant to Q1. D3 is judged as non-relevant with respect to Q1.
These Qi judgements are created by users that interact with each system. Once we
have these judgements, we can compare two or more systems. Each query is run on
each system, and we investigate the documents that each system returns.
Let’s say the query is Q1. In the figure we have RA as ranked results from system A
and RB as the ranked results from system B. Thus, RA is system A’s approximation
of relevant documents and RB is system B’s approximation. Let’s take a look at
these results—which is better? As a user, which one would you like? There are
some differences and there are some documents that are returned by both systems.
But if you look at the results you will feel that maybe system A is better in the
sense that we don’t have that many documents returned, and among the three
documents returned two of them are relevant. That’s good; system A is precise. On
the other hand, we can also say system B is better because it returned more relevant
documents; it returned three instead of two. So which one is better and how do we
quantify this? This question highly depends on a user’s task, and it depends on the
individual users as well! For some users, maybe system A is better if the user is not
interested in getting all the relevant documents so he or she doesn’t have to read
too much. On the other hand, imagine a user might need to have as many relevant
documents as possible, for example, in writing a literature survey. You might be
in the second category, and then you might find that system B is better. In either
case, we’ll have to also define measures that would quantify the information need
of a user. We will need to define multiple measures because users have different
perspectives when looking at the results.
We have only seen three relevant documents there, but we can imagine there are
other documents judged for this query. Intuitively we thought that system A is better
because it did not have much noise. In particular we have seen, out of three results,
two are relevant. On the other hand, in system B we have five results and only three
of them are relevant. Based on this, it looks like system A is more accurate. This can
be captured by the measure of precision, where we simply compute to what extent
all the retrieval results are relevant. 100% precision would mean all the retrieved
documents are relevant. Thus in this case, system A has a precision of 23 = 0.66.
System B has 35 = 0.60. This shows that system A is better according to precision.
But we also mentioned that system B might be preferred by some other users who
wish to retrieve as many relevant documents as possible. So, in that case we have
to compare the number of total relevant documents to the number that is actually
retrieved. This is captured in another measure called recall, which measures the
completeness of coverage of relevant documents in your retrieval result. We assume
that there are ten relevant documents in the collection. Here we’ve got two of them
from system A, so the recall is 102
= 0.20. System B has 10
3
= 0.30. Therefore, system
B is better according to recall.
These two measures are the very basic foundation for evaluating search engines.
They are very important because they are also widely used in many other evaluation
problems. For example, if you look at the applications of machine learning you
tend to see precision and recall numbers being reported for all kinds of tasks.
Now, let’s define these two measures more precisely and how these measures are
used to evaluate a set of retrieved documents. That means we are considering that
approximation of a set of relevant documents. We can distinguish the results in
four cases, depending on the situation of a document, as shown in Figure 9.2.
A document is either retrieved or not retrieved since we’re talking about the set of
results. The document can be also relevant or not relevant, depending on whether
the user thinks this is a useful document. We now have counts of documents in each
of the four categories. We can have a represent the number of documents that are
retrieved and relevant, b for documents that are not retrieved but relevant, c for
documents that are retrieved but not relevant, and d for documents that are both
not retrieved and not relevant. With this table, we have defined precision as the ratio
of the relevant retrieved documents a to the total number of retrieved documents
a
a and c: a+c . In this case, the denominator is all the retrieved documents. Recall
is defined by dividing a by the sum of a and b, where a + b is the total number of
relevant documents. Precision and recall are focused on looking at a, the number of
retrieved relevant documents. The two measures differ based on the denominator
of the formula.
172 Chapter 9 Search Engine Evaluation
Action
Retrieved Not retrieved
Relevant a b
Doc
Not relevant c d
a
Precision = — Ideal results: precision = recall = 1.0
a+c
a In reality, high recall tends to be
Recall = —
a+b associated with low precision
So what would be an ideal result? If precision and recall are both 1.0, that
means all the results that we returned are relevant, and we didn’t miss any relevant
documents; there’s no single non-relevant document returned. In reality, however,
high recall tends to be associated with low precision; as you go down the list to
try to get as many relevant documents as possible, you tend to include many non-
relevant documents, which decreases the precision. We often are interested in the
precision up to ten documents for web search. This means we look at the top ten
results and see how many documents among them are actually relevant. This is a
very meaningful measure, because it tells us how many relevant documents a user
can expect to see on the first page of search results (where there are typically ten
results).
In the next section, we’ll see how to combine precision and recall into one score
that captures both measures.
1 (β 2 + 1)P ∗ R
Fβ = =
β2 1 1 1 β 2P + R
β 2+1 R β 2+1 P
2P R
F1 =
P +R
whereP = precision, R = recall, β = parameter (often set to 1)
to see that if you have a large precision or large recall then the F1 measure would
be high. But what’s interesting is how the tradeoff between precision and recall is
captured in the F1 score.
In order to understand the formulation, we can first ask the natural question:
Why not combine them using a simple arithmetic mean? That would be likely the
most natural way of combining them. Why is this not as good as F1, i.e., what’s
the problem with an arithmetic mean?
The arithmetic mean tends to be dominated by large values. That means if you
have a very high P or a very high R, then you really don’t care about whether the
other value is low since the whole sum would be high. This is not the desirable
effect because one can easily have a perfect recall by returning all the documents!
Then we have a perfect recall and a low precision. This will still give a relatively
high average. Such search results are clearly not very useful for users even though
the average using this formula would be relatively high. In contrast, the F1 score will
reward a case where precision and recall are roughly similar. So, it would penalize
a case with an extremely high result for only one of them. This means F1 encodes
a different tradeoff between them than a simple arithmetic mean. This example
shows a very important methodology: when we try to solve a problem, you might
naturally think of one solution (e.g., the arithmetic mean), but it’s important not
to settle on this solution; rather, think whether there are other ways to approach it.
Once you have multiple ideas, it’s important to analyze their differences and then
think about which one makes more sense in a real scenario.
To summarize, we talked about precision, which addresses the question: are
the retrieval results all relevant? We also talked about recall, which addresses the
question: have all the relevant documents been retrieved? These two are the two
basic measures in information retrieval evaluation. They are used for many other
tasks as well. We talked about F measure as a way to combine precision and recall.
We also talked about the tradeoff between precision and recall, and it turns out to
depend on the users’ search tasks and preferences.
174 Chapter 9 Search Engine Evaluation
Precision Recall
D1 + 1/1 1/10 Precision
D2 + 2/2 2/10 1.0
D3 – 2/3 2/10
0.8
D4 –
D5 + 3/5 3/10 0.6
D6 – 0.4
D7 – 0.2
D8 + 4/8 4/10
0.0
D9 – 0.1 0.2 0.3 0.4 … 1.0
D10 – ? 10/10