What are the core challenges faced by current natural language processing technology in understanding text data?

The core challenges faced by current natural language processing technology in understanding text data stem from the lack of explicit structure in text data, which makes it difficult for computers to process and understand the semantic content encoded in natural language. Despite the significant progress, natural language processing technology has not reached a level where computers can precisely understand text, requiring human involvement. Robust statistical and heuristic approaches have been developed to manage and analyze text data, but these methods still fall short of achieving complete understanding .

What role does feedback play in improving the accuracy of text retrieval systems?

Feedback plays a critical role in improving the accuracy of text retrieval systems by adjusting queries based on user interactions and judgments. Relevance feedback uses explicit judgments to optimize future query results, while pseudo feedback assumes relevance of top-ranked results to refine queries. Implicit feedback uses user interactions, like clickthroughs, to infer relevance. These methods collectively enable the retrieval system to learn user interests and improve relevance of future results, despite the challenges in reliably determining relevance without explicit user input .

What are the implications of using implicit feedback in modern search engines?

The use of implicit feedback in modern search engines has significant implications as it allows for ongoing improvement of search accuracy without requiring explicit user input. By monitoring user interactions, such as clickthroughs, search engines infer document relevance and adjust ranking algorithms to better match user preferences. This method enhances user experience by providing more pertinent results over time, although it relies on interpreting limited user actions, which can sometimes be misleading if click behavior does not accurately reflect user preferences .

Why is it essential to involve human users in the loop of natural language processing?

It is essential to involve human users in the loop of natural language processing because current technology lacks the precision to fully understand natural language text. Humans are necessary to provide contextual understanding and resolve ambiguities that machines cannot handle effectively. The interaction ensures that the insights derived from text data are accurate and meet user needs, highlighting the inherent limitations of automated systems in processing and understanding complex semantic content .

How does the mixture model in text mining enhance the discovery of topical patterns from text data?

The mixture model enhances discovery of topical patterns in text mining by combining multiple distributions to capture both common and content-specific words. It includes a background model to account for common words like 'the' and 'is,' allowing topic-specific distributions to concentrate on more informative words. This separation helps in better estimating topic word distributions, facilitating clearer identification of topics by essentially filtering out noise from the analysis and allowing more focused discovery of meaningful patterns .

How has the integration of statistical and heuristic methods impacted the field of text mining and information retrieval?

The integration of statistical and heuristic methods has significantly advanced the field of text mining and information retrieval by providing robust techniques to handle the less explicit structure of text data. These methods allow for efficient management and analysis of text data across various natural languages and topics. They have made it possible to develop intelligent software tools that can identify relevant information within large datasets and optimize decision-making tasks despite the challenges posed by the semantic richness and unstructured nature of text data .

Describe the relationship between paradigmatic and syntagmatic relations in text data.

Paradigmatic and syntagmatic relations are two fundamental types of semantic relations in text data. Paradigmatic relations refer to associations between words that can substitute for each other in similar contexts, highlighting the relationship through similarity or substitution, such as synonyms or antonyms. Syntagmatic relations, on the other hand, characterize the linear combinations of words, showing the sequential or positional relationship, typical in phrases or sentences. Together, these relations help in understanding and analyzing linguistic structures and associations within text data .

In what ways do text retrieval and text mining processes differ in their application to big text data?

Text retrieval and text mining processes differ in their application to big text data primarily in their goals and methods. Text retrieval focuses on identifying a small, highly relevant subset of data from a large dataset to avoid unnecessary processing. In contrast, text mining aims to discover various kinds of knowledge from the data, especially opinions and preferences, and often involves processing a larger amount of data to extract this knowledge. While text retrieval uses techniques to convert raw big text data into smaller datasets, text mining utilizes complex analysis on these datasets to uncover patterns and insights .

Explain how joint analysis of text and non-text data can benefit applications in text mining.

Joint analysis of text and non-text data can greatly benefit applications in text mining by leveraging complementary strengths. By integrating structured, well-defined non-text data with unstructured, rich text data, applications can gain more comprehensive insights. This approach enables the correlation of textual insights with quantitative data, enhancing the understanding and predictions about patterns, trends, and relationships in complex datasets, and leading to more informed decision-making and knowledge discovery .

How does text categorization differ from text clustering in terms of their approach to organizing text data?

Text categorization differs from text clustering in its approach to organizing text data by relying on predefined categories to assign documents, making it a supervised learning process. In contrast, text clustering is an unsupervised process that groups documents based on their intrinsic similarities without predefined labels. While categorization uses known structures to classify texts, clustering explores natural groupings within the data to discover structures, allowing for discovery in addition to organization .

Open navigation menu

Upload

100% found this document useful (3 votes)

1K views531 pages

Text Data Management and Analysis PDF

Uploaded by

Aparna Kohli

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (3 votes)

1K views531 pages

Text Data Management and Analysis PDF

Uploaded by

Aparna Kohli

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

ZHAI • MASSUNG

Text Data Management and Analysis

A Practical Introduction to Information Retrieval and Text Mining
ChengXiang Zhai and Sean Massung
Text Data
Recent years have seen a dramatic growth of natural language text data, including web pages,
news articles, scientific literature, emails, enterprise documents, and social media such as
Management
and Analysis
blog articles, forum posts, product reviews, and tweets. This has led to an increasing demand
for powerful software tools to help people manage and analyze vast amounts of text data ef-

Text Data Management

fectively and efficiently. Unlike data generated by a computer system or sensors, text data are
usually generated directly by humans, and capture semantically rich content. As such, text

A Practical Introduction
data are especially valuable for discovering knowledge about human opinions and preferenc-
es, in addition to many other kinds of knowledge that we encode in text. In contrast to struc-
tured data, which conform to well-defined schemas (thus are relatively easy for computers to
handle), text has less explicit structure, requiring computer processing toward understanding
of the content encoded in text. The current technology of natural language processing has
to Information Retrieval
not yet reached a point to enable a computer to precisely understand natural language text,
but a wide range of statistical and heuristic approaches to management and analysis of text
and Text Mining
data have been developed over the past few decades. They are usually very robust and can be
applied to analyze and manage text data in any natural language, and about any topic.
This book provides a systematic introduction to many of these approaches, with an em-
phasis on covering the most useful knowledge and skills required to build a variety of prac-
tically useful text information systems. Because humans can understand natural languages

and Analysis
far better than computers can, effective involvement of humans in a text information system
is generally needed and text information systems often serve as intelligent assistants for hu-
mans. Depending on how a text information system collaborates with humans, we distinguish
two kinds of text information systems. The first is information retrieval systems which include
search engines and recommender systems; they assist users in finding from a large collection
of text data the most relevant text data that are actually needed for solving a specific applica-
tion problem, thus effectively turning big raw text data into much smaller relevant text data
that can be more easily processed by humans. The second is text mining application systems;
they can assist users in analyzing patterns in text data to extract and discover useful action-
able knowledge directly useful for task completion or decision making, thus providing more
direct task support for users. ChengXiang Zhai

ACM | MORGAN & CLAYPOOL

Sean Massung
ABOUT ACM BOOKS
ACM Books is a new series of high quality books for

M the computer science community, published by ACM

&C in collaboration with Morgan & Claypool Publishers.

ACM Books publications are widely distributed in
both print and digital formats through booksellers
M
&C
ISBN: 978-1-97000-116-7
and to libraries (and library consortia) and individual ACM members via the ACM
90000
Digital Library platform.

B O O K S . A C M . O R G • W W W. M O R G A N C L AY P O O L . C O M 9 78 1 970 001 1 67
Text Data Management
and Analysis
ACM Books

Editor in Chief
M. Tamer Özsu, University of Waterloo

ACM Books is a new series of high-quality books for the computer science community,
published by ACM in collaboration with Morgan & Claypool Publishers. ACM Books
publications are widely distributed in both print and digital formats through booksellers
and to libraries (and library consortia) and individual ACM members via the ACM Digital
Library platform.

Text Data Management and Analysis: A Practical Introduction to Information

Retrieval and Text Mining
ChengXiang Zhai, University of Illinois at Urbana–Champaign
Sean Massung, University of Illinois at Urbana–Champaign
2016
An Architecture for Fast and General Data Processing on Large Clusters
Matei Zaharia, Massachusetts Institute of Technology
2016
Reactive Internet Programming: State Chart XML in Action
Franck Barbier, University of Pau, France
2016
Verified Functional Programming in Agda
Aaron Stump, The University of Iowa
2016
The VR Book: Human-Centered Design for Virtual Reality
Jason Jerald, NextGen Interactions
2016
Ada’s Legacy: Cultures of Computing from the Victorian to the Digital Age
Robin Hammerman, Stevens Institute of Technology
Andrew L. Russell, Stevens Institute of Technology
2016
Edmund Berkeley and the Social Responsibility of Computer Professionals
Bernadette Longo, New Jersey Institute of Technology
2015
Candidate Multilinear Maps
Sanjam Garg, University of California, Berkeley
2015
Smarter than Their Machines: Oral Histories of Pioneers in Interactive Computing
John Cullinane, Northeastern University; Mossavar-Rahmani Center for Business
and Government, John F. Kennedy School of Government, Harvard University
2015
A Framework for Scientific Discovery through Video Games
Seth Cooper, University of Washington
2014
Trust Extension as a Mechanism for Secure Code Execution on Commodity
Computers
Bryan Jeffrey Parno, Microsoft Research
2014
Embracing Interference in Wireless Systems
Shyamnath Gollakota, University of Washington
2014
Text Data Management
and Analysis
A Practical Introduction to Information
Retrieval and Text Mining

ChengXiang Zhai
University of Illinois at Urbana–Champaign

Sean Massung
University of Illinois at Urbana–Champaign

ACM Books #12

All rights reserved. No part of this publication may be reproduced, stored in a retrieval
system, or transmitted in any form or by any means—electronic, mechanical, photocopy,
recording, or any other except for brief quotations in printed reviews—without the prior
permission of the publisher.
Designations used by companies to distinguish their products are often claimed as
trademarks or registered trademarks. In all instances in which Morgan & Claypool is aware
of a claim, the product names appear in initial capital or all capital letters. Readers, however,
should contact the appropriate companies for more complete information regarding
trademarks and registration.

Text Data Management and Analysis

ChengXiang Zhai and Sean Massung
[Link]
[Link]

ISBN: 978-1-97000-119-8 hardcover

ISBN: 978-1-97000-116-7 paperback
ISBN: 978-1-97000-117-4 ebook
ISBN: 978-1-97000-118-1 ePub
Series ISSN: 2374-6769 print 2374-6777 electronic
DOIs:
10.1145/2915031 Book 10.1145/2915031.2915044 Chapter 12
10.1145/2915031.2915032 Preface 10.1145/2915031.2915045 Chapter 13
10.1145/2915031.2915033 Chapter 1 10.1145/2915031.2915046 Chapter 14
10.1145/2915031.2915034 Chapter 2 10.1145/2915031.2915047 Chapter 15
10.1145/2915031.2915035 Chapter 3 10.1145/2915031.2915048 Chapter 16
10.1145/2915031.2915036 Chapter 4 10.1145/2915031.2915049 Chapter 17
10.1145/2915031.2915037 Chapter 5 10.1145/2915031.2915050 Chapter 18
10.1145/2915031.2915038 Chapter 6 10.1145/2915031.2915051 Chapter 19
10.1145/2915031.2915039 Chapter 7 10.1145/2915031.2915052 Chapter 20
10.1145/2915031.2915040 Chapter 8 10.1145/2915031.2915053 Appendices
10.1145/2915031.2915041 Chapter 9 10.1145/2915031.2915054 References
10.1145/2915031.2915042 Chapter 10 10.1145/2915031.2915055 Index
10.1145/2915031.2915043 Chapter 11

A publication in the ACM Books series, #12

Editor in Chief: M. Tamer Özsu, University of Waterloo
Area Editor: Edward A. Fox, Virginia Tech

First Edition
10 9 8 7 6 5 4 3 2 1
To Mei and Alex

To Kai
Contents

Preface xv
Acknowledgments xviii

PART I OVERVIEW AND BACKGROUND 1

Chapter 1 Introduction 3
1.1 Functions of Text Information Systems 7
1.2 Conceptual Framework for Text Information Systems 10
1.3 Organization of the Book 13
1.4 How to Use this Book 15
Bibliographic Notes and Further Reading 18

Chapter 2 Background 21
2.1 Basics of Probability and Statistics 21
2.2 Information Theory 31
2.3 Machine Learning 34
Bibliographic Notes and Further Reading 36
Exercises 37

Chapter 3 Text Data Understanding 39

3.1 History and State of the Art in NLP 42
3.2 NLP and Text Information Systems 43
3.3 Text Representation 46
3.4 Statistical Language Models 50
Bibliographic Notes and Further Reading 54
Exercises 55
x Contents

Chapter 4 META: A Unified Toolkit for Text Data Management and Analysis 57
4.1 Design Philosophy 58
4.2 Setting up META 59
4.3 Architecture 60
4.4 Tokenization with META 61
4.5 Related Toolkits 64
Exercises 65

PART II TEXT DATA ACCESS 71

Chapter 5 Overview of Text Data Access 73

5.1 Access Mode: Pull vs. Push 73
5.2 Multimode Interactive Access 76
5.3 Text Retrieval 78
5.4 Text Retrieval vs. Database Retrieval 80
5.5 Document Selection vs. Document Ranking 82
Bibliographic Notes and Further Reading 84
Exercises 85

Chapter 6 Retrieval Models 87

6.1 Overview 87
6.2 Common Form of a Retrieval Function 88
6.3 Vector Space Retrieval Models 90
6.4 Probabilistic Retrieval Models 110
Bibliographic Notes and Further Reading 128
Exercises 129

Chapter 7 Feedback 133

7.1 Feedback in the Vector Space Model 135
7.2 Feedback in Language Models 138
Bibliographic Notes and Further Reading 144
Exercises 144

Chapter 8 Search Engine Implementation 147

8.1 Tokenizer 148
8.2 Indexer 150
8.3 Scorer 153
Contents xi

8.4 Feedback Implementation 157

8.5 Compression 158
8.6 Caching 162
Bibliographic Notes and Further Reading 165
Exercises 165

Chapter 9 Search Engine Evaluation 167

9.1 Introduction 167
9.2 Evaluation of Set Retrieval 170
9.3 Evaluation of a Ranked List 174
9.4 Evaluation with Multi-level Judgements 180
9.5 Practical Issues in Evaluation 183
Bibliographic Notes and Further Reading 187
Exercises 188

Chapter 10 Web Search 191

10.1 Web Crawling 192
10.2 Web Indexing 194
10.3 Link Analysis 200
10.4 Learning to Rank 208
10.5 The Future of Web Search 212
Bibliographic Notes and Further Reading 216
Exercises 216

Chapter 11 Recommender Systems 221

11.1 Content-based Recommendation 222
11.2 Collaborative Filtering 229
11.3 Evaluation of Recommender Systems 233
Bibliographic Notes and Further Reading 235
Exercises 235

PART III TEXT DATA ANALYSIS 239

Chapter 12 Overview of Text Data Analysis 241

12.1 Motivation: Applications of Text Data Analysis 242
12.2 Text vs. Non-text Data: Humans as Subjective Sensors 244
12.3 Landscape of text mining tasks 246
xii Contents

Chapter 13 Word Association Mining 251

13.1 General idea of word association mining 252
13.2 Discovery of paradigmatic relations 255
13.3 Discovery of Syntagmatic Relations 260
13.4 Evaluation of Word Association Mining 271
Bibliographic Notes and Further Reading 273
Exercises 273

Chapter 14 Text Clustering 275

14.1 Overview of Clustering Techniques 277
14.2 Document Clustering 279
14.3 Term Clustering 284
14.4 Evaluation of Text Clustering 294
Bibliographic Notes and Further Reading 296
Exercises 296

Chapter 15 Text Categorization 299

15.1 Introduction 299
15.2 Overview of Text Categorization Methods 300
15.3 Text Categorization Problem 302
15.4 Features for Text Categorization 304
15.5 Classification Algorithms 307
15.6 Evaluation of Text Categorization 313
Bibliographic Notes and Further Reading 315
Exercises 315

Chapter 16 Text Summarization 317

16.1 Overview of Text Summarization Techniques 318
16.2 Extractive Text Summarization 319
16.3 Abstractive Text Summarization 321
16.4 Evaluation of Text Summarization 324
16.5 Applications of Text Summarization 325
Bibliographic Notes and Further Reading 327
Exercises 327

Chapter 17 Topic Analysis 329

17.1 Topics as Terms 332
17.2 Topics as Word Distributions 335
Contents xiii

17.3 Mining One Topic from Text 340

17.4 Probabilistic Latent Semantic Analysis 368
17.5 Extension of PLSA and Latent Dirichlet Allocation 377
17.6 Evaluating Topic Analysis 383
17.7 Summary of Topic Models 384
Bibliographic Notes and Further Reading 385
Exercises 386

Chapter 18 Opinion Mining and Sentiment Analysis 389

18.1 Sentiment Classification 393
18.2 Ordinal Regression 396
18.3 Latent Aspect Rating Analysis 400
18.4 Evaluation of Opinion Mining and Sentiment Analysis 409
Bibliographic Notes and Further Reading 410
Exercises 410

Chapter 19 Joint Analysis of Text and Structured Data 413

19.1 Introduction 413
19.2 Contextual Text Mining 417
19.3 Contextual Probabilistic Latent Semantic Analysis 419
19.4 Topic Analysis with Social Networks as Context 428
19.5 Topic Analysis with Time Series Context 433
19.6 Summary 439
Bibliographic Notes and Further Reading 440
Exercises 440

PART IV UNIFIED TEXT DATA MANAGEMENT ANALYSIS SYSTEM 443

Chapter 20 Toward A Unified System for Text Management and Analysis 445
20.1 Text Analysis Operators 448
20.2 System Architecture 452
20.3 META as a Unified System 453

Appendix A Bayesian Statistics 457

A.1 Binomial Estimation and the Beta Distribution 457
A.2 Pseudo Counts, Smoothing, and Setting Hyperparameters 459
A.3 Generalizing to a Multinomial Distribution 460
xiv Contents

A.4 The Dirichlet Distribution 461

A.5 Bayesian Estimate of Multinomial Parameters 463
A.6 Conclusion 464

Appendix B Expectation- Maximization 465

B.1 A Simple Mixture Unigram Language Model 466
B.2 Maximum Likelihood Estimation 466
B.3 Incomplete vs. Complete Data 467
B.4 A Lower Bound of Likelihood 468
B.5 The General Procedure of EM 469

Appendix C KL-divergence and Dirichlet Prior Smoothing 473

C.1 Using KL-divergence for Retrieval 473
C.2 Using Dirichlet Prior Smoothing 475
C.3 Computing the Query Model p(w | θQ) 475

References 477
Index 489
Authors’ Biographies 509
Preface

The growth of “big data” created unprecedented opportunities to leverage compu-

tational and statistical approaches to turn raw data into actionable knowledge that
can support various application tasks. This is especially true for the optimization of
decision making in virtually all application domains such as health and medicine,
security and safety, learning and education, scientific discovery, and business in-
telligence. Just as a microscope enables us to see things in the “micro world” and a
telescope allows us to see things far away, one can imagine a “big data scope” would
enable us to extend our perception ability to “see” useful hidden information and
knowledge buried in the data, which can help make predictions and improve the op-
timality of a chosen decision. This book covers general computational techniques
for managing and analyzing large amounts of text data that can help users manage
and make use of text data in all kinds of applications.
Text data include all data in the form of natural language text (e.g., English text
or Chinese text): all the web pages, social media data such as tweets, news, scien-
tific literature, emails, government documents, and many other kinds of enterprise
data. Text data play an essential role in our lives. Since we communicate using nat-
ural languages, we produce and consume a large amount of text data every day on
all kinds of topics. The explosive growth of text data makes it impossible, or at least
very difficult, for people to consume all the relevant text data in a timely manner.
Thus, there is an urgent need for developing intelligent information retrieval sys-
tems to help people manage the text data and get access to the needed relevant
information quickly and accurately at any time. This need is a major reason be-
hind the recent growth of the web search engine industry. Due to the fact that text
data are produced by humans for communication purposes, they are generally rich
in semantic content and often contain valuable knowledge, information, opinions,
and preferences of people. Thus, as a special kind of “big data,” text data offer a
great opportunity to discover various kinds of knowledge useful for many applica-
tions, especially knowledge about human opinions and preferences, which is often
xvi Preface

directly expressed in text data. For example, it is now the norm for people to tap into
opinionated text data such as product reviews, forum discussions, and social media
text to obtain opinions. Once again, due to the overwhelming amount of informa-
tion, people need intelligent software tools to help discover relevant knowledge for
optimizing decisions or helping them complete their tasks more efficiently. While
the technology for supporting text mining is not yet as mature as search engines
for supporting text access, significant progress has been made in this area in re-
cent years, and specialized text mining tools have now been widely used in many
application domains. The subtitle of this book suggests that we cover two major
topics, information retrieval and text mining. These two topics roughly correspond
to the techniques needed to build the two types of application systems discussed
above (i.e., search engines and text analytics systems), although the separation of
the two is mostly artificial and only meant to help provide a high-level structure for
the book, and a sophisticated application system likely would use many techniques
from both topic areas.
In contrast to structured data, which conform to well-defined schemas and are
thus relatively easy for computers to handle, text has less explicit structure so
the development of intelligent software tools discussed above requires computer
processing to understand the content encoded in text. The current technology of
natural language processing has not yet reached a point to enable a computer to pre-
cisely understand natural language text (a main reason why humans often should
be involved in the loop), but a wide range of statistical and heuristic approaches
to management and analysis of text data have been developed over the past few
decades. They are usually very robust and can be applied to analyze and manage
text data in any natural language, and about any topic. This book intends to provide
a systematic introduction to many of these approaches, with an emphasis on cov-
ering the most useful knowledge and skills required to build a variety of practically
useful text information systems.
This book is primarily based on the materials that the authors have used for
teaching a course on the topic of text data management and analysis (i.e., CS410
Text Information Systems) at the University of Illinois at Urbana–Champaign, as
well as the two Massive Open Online Courses (MOOCs) on “Text Retrieval and
Search Engines” and “Text Mining and Analytics” taught by the first author on
Coursera in 2015. Most of the materials in the book directly match those of these
two MOOCs with also similar structures of topics. As such, the book can be used as
a main reference book for any of these two MOOCs.
Information Retrieval (IR) is a relatively mature field and there are no short-
age of good textbooks on IR; for example, the most recent ones include Modern
Information Retrieval: The Concepts and Technology behind Search by Baeza-Yates
Preface xvii

and Ribeiro-Neto [2011], Information Retrieval: Implementing and Evaluating Search

Engines by Büttcher et al. [2010], Search Engines: Information Retrieval in Practice
by Croft et al. [2009], and Introduction to Information Retrieval by Manning et al.
[2008]. Compared with these existing books on information retrieval, our book has
a broader coverage of topics as it attempts to cover topics in both information re-
trieval and text mining, and attempts to paint a general roadmap for building a text
information system that can support both text information access and text analy-
sis. For example, it includes a detailed introduction to word association mining,
probabilistic topic modeling, and joint analysis of text and non-text data, which
are not available in any existing information retrieval books. In contrast with IR,
Text Mining (TM) is far from mature and is actually still in its infancy. Indeed, how
to define TM precisely remains an open question. As such, it appears that there is
not yet a textbook on TM. As a textbook on TM, our book provides a basic introduc-
tion to the major representative techniques for TM. By introducing TM and IR in a
unified framework, we want to emphasize the importance of integration of IR and
TM in any practical text information system since IR plays two important roles in
any TM application. The first is to enable fast reduction of the data size by filtering
out a large amount of non-relevant text data to obtain a small set of most relevant
data to a particular application problem. The second is to support an analyst to
verify and interpret any patterns discovered from text data where an analyst would
need to use search and browsing functions to reach and examine the most relevant
support data to the pattern.
Another feature that sets this book apart is the availability of a companion
toolkit for information retrieval and text mining, i.e., the META toolkit (available at
[Link] which contains implementations of many techniques
discussed in the book. Many exercises in the book are also designed based on this
toolkit to help readers acquire practical skills of experimenting with the learned
techniques from the book and applying them to solve real-world application prob-
lems.
This book consists of four parts. Part I provides an overview of the content
covered in the book and some background knowledge needed to understand the
chapters later. Parts II and III contain the major content of the book and cover a
wide range of techniques in IR (called Text Data Access techniques) and techniques
in TM (called Text Data Analysis techniques), respectively. Part IV summarizes the
book with a unified framework for text management and analysis where many
techniques of IR and TM can be combined to provide more advanced support for
text data access and analysis with humans in the loop to control the workflow.
The required background knowledge to understand the content in this book is
minimal since the book is intended to be mostly self-contained. However, readers
xviii Preface

are expected to have basic knowledge about computer science, particularly data
structures and programming languages and be comfortable with some basic con-
cepts in probability and statistics such as conditional probability and parameter
estimation. Readers who do not have this background may still be able to follow
the basic ideas of most of the algorithms discussed in the book; they can also ac-
quire the needed background by carefully studying Chapter 2 of the book and, if
necessary, reading some of the references mentioned in the Bibliographical Notes
section of that chapter to have a solid understanding of all the major concepts men-
tioned therein. META can be used by anyone to easily experiment with algorithms
and build applications, but modifying it or extending it would require at least some
basic knowledge of C++ programming.
The book can be used as a textbook for an upper-level undergraduate course on
information retrieval and text mining or a reference book for a graduate course to
cover practical aspects of information retrieval and text mining. It should also be
useful to practitioners in industry to help them acquire a wide range of practical
techniques for managing and analyzing text data that they can use immediately to
build various interesting real-world applications.

Acknowledgments
This book is the result of many people’s help. First and foremost, we want to express
our sincere thanks to Edward A. Fox for his invitation to write this book for the
ACM Book Series in the area of Information Retrieval and Digital Libraries, of
which he is the Area Editor. We are also grateful to Tamer Ozsu, Editor-in-Chief of
ACM Books, for his support and useful comments on the book proposal. Without
their encouragement and support this book would have not been possible. Next,
we are deeply indebted to Edward A. Fox, Donna Harman, Bing Liu, and Jimmy
Lin for thoroughly reviewing the initial draft of the book and providing very useful
feedback and constructive suggestions. While we were not able to fully implement
all their suggestions, all their reviews were extremely helpful and led to significant
improvement of the quality of the book in many ways; naturally, any remaining
errors in the book are solely the responsibility of the authors.
Throughout the process of writing the book, we received strong support and
great help from Diane Cerra, Executive Editor at Morgan & Claypool Publishers,
whose regular reminders and always timely support are key factors that prevented
us from having the risk of taking “forever” to finish the book; for this, we are truly
grateful to her. In addition, we would like to thank Sara Kreisman for copyediting
and Paul C. Anagnostopoulos and his production team at Windfall Software (Ted
Acknowledgments xix

Laux, Laurel Muller, MaryEllen Oliver, and Jacqui Scarlott) for their great help with
indexing, illustrations, art proofreading, and composition, which ensured a fast
and smooth production of the book.
The content of the book and our understanding of the topics covered in the book
have benefited from many discussions and interactions with a large number of
people in both the research community and industry. Due to space limitations,
we can only mention some of them here (and have to apologize to many whose
names are not mentioned): James Allan, Charu Aggarwal, Ricardo Baeza-Yates,
Nicholas J. Belkin, Andrei Broder, Jamie Callan, Jaime Carbonell, Kevin C. Chang,
Yi Chang, Charlie Clarke, Fabio Crestani, W. Bruce Croft, Maarten de Rijke, Arjen
de Vries, Daniel Diermeier, AnHai Doan, Susan Dumais, David A. Evans, Edward A.
Fox, Ophir Frieder, Norbert Fuhr, Evgeniy Gabrilovich, C. Lee Giles, David Gross-
man, Jiawei Han, Donna Harman, Marti Hearst, Jimmy Huang, Rong Jin, Thorsten
Joachims, Paul Kantor, David Karger, Diane Kelly, Ravi Kumar, Oren Kurland, John
Lafferty, Victor Lavrenko, Lillian Lee, David Lewis, Jimmy Lin, Bing Liu, Wei-Ying
Ma, Christopher Manning, Gary Marchionini, Andrew McCallum, Alistair Moffat,
Jian-Yun Nie, Douglas Oard, Dragomir R. Radev, Prabhakar Raghavan, Stephen
Robertson, Roni Rosenfeld, Dan Roth, Mark Sanderson, Bruce Schatz, Fabrizio Se-
bastiani, Amit Singhal, Keith van Rijsbergen, Luo Si, Noah Smith, Padhraic Smyth,
Andrew Tomkins, Ellen Voorhees, and Yiming Yang, Yi Zhang, Justin Zobel. We
want to thank all of them for their indirect contributions to this book. Some ma-
terials in the book, especially those in Chapter 19, are based on the research work
done by many Ph.D. graduates of the Text Information Management and Analysis
(TIMAN) group at the University of Illinois at Urbana–Champaign, under the super-
vision by the first author. We are grateful to all of them, including Tao Tao, Hui Fang,
Xuehua Shen, Azadeh Shakery, Jing Jiang, Qiaozhu Mei, Xuanhui Wang, Bin Tan,
Xu Ling, Younhee Ko, Alexander Kotov, Yue Lu, Maryam Karimzadehgan, Yuanhua
Lv, Duo Zhang, [Link] Vydiswaran, Hyun Duk Kim, Kavita Ganesan, Parikshit
Sondhi, Huizhong Duan, Yanen Li, Hongning Wang, Mingjie Qian, and Dae Hoon
Park. The authors’ own work included in the book has been supported by multiple
funding sources, including NSF, NIH, NASA, IARPA, Air Force, ONR, DHS, Alfred P.
Sloan Foundation, and many companies including Microsoft, Google, IBM, Yahoo!,
LinkedIn, Intel, HP, and TCL. We are thankful to all of them.
The two Massive Open Online Courses (MOOCs) offered by the first author for
the University of Illinois at Urbana–Champaign (UIUC) in 2015 on Coursera (i.e.,
Text Retrieval and Search Engines and Text Mining and Analytics) provided a direct
basis for this book in the sense that many parts of the book are based primarily
on the transcribed notes of the lectures in these two MOOCs. We thus would like
xx Preface

to thank all the people who have helped with these two MOOCs, especially TAs
Hussein Hazimeh and Alex Morales, and UIUC instruction support staff Jason
Mock, Shannon Bicknell, Katie Woodruff, and Edward Noel Dignan, and the Head
of Computer Science Department, Rob Rutenbar, whose encouragement, support,
and help are all essential for these two MOOCs to happen. The first author also
wants to thank UIUC for allowing him to use the sabbatical leave in Fall 2015 to
work on this book. Special thanks are due to Chase Geigle, co-founder of META. In
addition to all the above, the second author would like to thank Chase Geigle, Jason
Cho, and Urvashi Khandelwal (among many others) for insightful discussion and
encouragement.
Finally, we would like to thank all our family members, particularly our wives,
Mei and Kai, for their love and support. The first author wants to further thank
his brother Chengxing for the constant intellectual stimulation in their regular
research discussions and his parents for cultivating his passion for learning and
sharing knowledge with others.

ChengXiang Zhai
Sean Massung
June 2016
I
PART

OVERVIEW AND
BACKGROUND
1
Introduction
In the last two decades, we have experienced an explosive growth of online infor-
mation. According to a study done at University of California Berkeley back in 2003:
“. . . the world produces between 1 and 2 exabytes (1018 petabytes) of unique infor-
mation per year, which is roughly 250 megabytes for every man, woman, and child
on earth. Printed documents of all kinds comprise only .03% of the total.” [Lyman
et al. 2003]
A large amount of online information is textual information (i.e., in natural lan-
guage text). For example, according to the Berkeley study cited above: “Newspapers
represent 25 terabytes annually, magazines represent 10 terabytes . . . office docu-
ments represent 195 terabytes. It is estimated that 610 billion emails are sent each
year representing 11,000 terabytes.” Of course, there are also blog articles, forum
posts, tweets, scientific literature, government documents, etc. Roe [2012] updates
the email count from 610 billion emails in 2003 to 107 trillion emails sent in 2010.
According to a recent IDC report report [Gantz & Reinsel 2012], from 2005 to 2020,
the digital universe will grow by a factor of 300, from 130 exabytes to 40,000 ex-
abytes, or 40 trillion gigabytes.
While, in general, all kinds of online information are useful, textual information
plays an especially important role and is arguably the most useful kind of informa-
tion for the following reasons.

Text (natural language) is the most natural way of encoding human knowledge.
As a result, most human knowledge is encoded in the form of text data. For
example, scientific knowledge almost exclusively exists in scientific literature,
while technical manuals contain detailed explanations of how to operate
devices.
Text is by far the most common type of information encountered by people.
Indeed, most of the information a person produces and consumes daily is in
text form.
4 Chapter 1 Introduction

Text is the most expressive form of information in the sense that it can be
used to describe other media such as video or images. Indeed, image search
engines such as those supported by Google and Bing often rely on matching
companion text of images to retrieve “matching” images to a user’s keyword
query.

The explosive growth of online text information has created a strong demand
for intelligent software tools to provide the following two related services to help
people manage and exploit big text data.

Text Retrieval. The growth of text data makes it impossible for people to con-
sume the data in a timely manner. Since text data encode much of our accu-
mulated knowledge, they generally cannot be discarded, leading to, e.g., the
accumulation of a large amount of literature data which is now beyond any
individual’s capacity to even skim over. The rapid growth of online text infor-
mation also means that no one can possibly digest all the new information
created on a daily basis. Thus, there is an urgent need for developing intel-
ligent text retrieval systems to help people get access to the needed relevant
information quickly and accurately, leading to the recent growth of the web
search industry. Indeed, web search engines like Google and Bing are now an
essential part of our daily life, serving millions of queries daily. In general,
search engines are useful anywhere there is a relatively large amount of text
data (e.g., desktop search, enterprise search or literature search in a specific
domain such as PubMed).
Text Mining. Due to the fact that text data are produced by humans for commu-
nication purposes, they are generally rich in semantic content and often con-
tain valuable knowledge, information, opinions, and preferences of people.
As such, they offer great opportunity for discovering various kinds of knowl-
edge useful for many applications, especially knowledge about human opin-
ions and preferences, which is often directly expressed in text data. For exam-
ple, it is now the norm for people to tap into opinionated text data such as
product reviews, forum discussions, and social media text to obtain opinions
about topics interesting to them and optimize various decision-making tasks
such as purchasing a product or choosing a service. Once again, due to the
overwhelming amount of information, people need intelligent software tools
to help discover relevant knowledge to optimize decisions or help them com-
plete their tasks more efficiently. While the technology for supporting text
mining is not yet as mature as search engines for supporting text access, sig-
Chapter 1 Introduction 5

nificant progress has been made in this area in recent years, and specialized
text mining tools have now been widely used in many application domains.

In contrast to structured data, which conform to well-defined schemas and are

thus relatively easy for computers to handle, text has less explicit structure, so
the development of intelligent software tools discussed above requires computer
processing to understand the content encoded in text. The current technology of
natural language processing has not yet reached a point to enable a computer to pre-
cisely understand natural language text (a main reason why humans often should
be involved in the loop), but a wide range of statistical and heuristic approaches
to management and analysis of text data have been developed over the past few
decades. They are usually very robust and can be applied to analyze and manage
text data in any natural language, and about any topic. This book intends to provide
a systematic introduction to many of these approaches, with an emphasis on cov-
ering the most useful knowledge and skills required to build a variety of practically
useful text information systems.
The two services discussed above (i.e., text retrieval and text mining) concep-
tually correspond to the two natural steps in the process of analyzing any “big
text data” as shown in Figure 1.1. While the raw text data may be large, a specific
application often requires only a small amount of most relevant text data, thus
conceptually, the very first step in any application should be to identify the relevant
text data to a particular application or decision-making problem and avoid the un-
necessary processing of large amounts of non-relevant text data. This first step of
converting the raw big text data into much smaller, but highly relevant text data is
often accomplished by techniques of text retrieval with help from users (e.g., users
may use multiple queries to collect all the relevant text data for a decision problem).
In this first step, the main goal is to connect users (or applications) with the most
relevant text data.

Text retrieval Text mining

Small Knowledge Many applications!

Big relevant data
text data

Figure 1.1 Text retrieval and text mining are two main techniques for analyzing big text data.
6 Chapter 1 Introduction

Once we obtain a small set of most relevant text data, we would need to further
analyze the text data to help users digest the content and knowledge in the text
data. This is the text mining step where the goal is to further discover knowledge
and patterns from text data so as to support a user’s task. Furthermore, due to the
need for assessing trustworthiness of any discovered knowledge, users generally
have a need to go back to the original raw text data to obtain appropriate context
for interpreting the discovered knowledge and verify the trustworthiness of the
knowledge, hence a search engine system, which is primarily useful for text access,
also has to be available in any text-based decision-support system for supporting
knowledge provenance. The two steps are thus conceptually interleaved, and a
full-fledged intelligent text information system must integrate both in a unified
framework.
It is worth pointing out that put in the context of “big data,” text data is very dif-
ferent from other kinds of data because it is generally produced directly by humans
and often also meant to be consumed by humans as well. In contrast, other data
tend to be machine-generated data (e.g., data collected by using all kinds of physi-
cal sensors). Since humans can understand text data far better than computers can,
involvement of humans in the process of mining and analyzing text data is abso-
lutely crucial (much more necessary than in other big data applications), and how
to optimally divide the work between humans and machines so as to optimize the
collaboration between humans and machines and maximize their “combined in-
telligence” with minimum human effort is a general challenge in all applications of
text data management and analysis. The two steps discussed above can be regarded
as two different ways for a text information system to assist humans: information
retrieval systems assist users in finding from a large collection of text data the most
relevant text data that are actually needed for solving a specific application prob-
lem, thus effectively turning big raw text data into much smaller relevant text data
that can be more easily processed by humans, while text mining application sys-
tems can assist users in analyzing patterns in text data to extract and discover useful
actionable knowledge directly useful for task completion or decision making, thus
providing more direct task support for users.
With this view, we partition the techniques covered in the book into two parts to
match the two steps shown in Figure 1.1, which are then followed by one chapter to
discuss how all the techniques may be integrated in a unified text information sys-
tem. The book attempts to provide a complete coverage of all the major concepts,
techniques, and ideas in information retrieval and text data mining from a prac-
tical viewpoint. It includes many hands-on exercises designed with a companion
software toolkit META to help readers learn how to apply techniques of information
1.1 Functions of Text Information Systems 7

retrieval and text mining to real-world text data and learn how to experiment with
and improve some of the algorithms for interesting application tasks. This book
can be used as a textbook for computer science undergraduates and graduates, li-
brary and information scientists, or as a reference book for practitioners working
on relevant application problems in analyzing and managing text data.

Functions of Text Information Systems

1.1 From a user’s perspective, a text information system (TIS) can offer three distinct,
but related capabilities, as illustrated in Figure 1.2.

Information Access. This capability gives a user access to the useful informa-
tion when the user needs it. With this capability, a TIS can connect the right
information with the right user at the right time. For example, a search en-
gine enables a user to access text information through querying, whereas a
recommender system can push relevant information to a user as new informa-
tion items become available. Since the main purpose of Information Access
is to connect a user with relevant information, a TIS offering this capability

Access Mining

Select Create
information knowledge

Organization

Add
structure/annotations

Figure 1.2 Information access, knowledge acquisition, and text organization are three major
capabilities of a text information system with text organization playing a supporting
role for information access and knowledge acquisition. Knowledge acquisition is also
often referred to as text mining.
8 Chapter 1 Introduction

generally only does minimum analysis of text data sufficient for matching
relevant information with a user’s information need, and the original infor-
mation items (e.g., web pages) are often delivered to the user in their original
form, though summaries of the delivered items are often provided. From the
perspective of text analysis, a user would generally need to read the informa-
tion items to further digest and exploit the delivered information.

Knowledge Acquisition (Text Analysis). This capability enables a user to acquire

useful knowledge encoded in the text data that is not easy for a user to obtain
without synthesizing and analyzing a relatively large portion of the data. In
this case, a TIS can analyze a large amount of text data to discover interesting
patterns buried in text. A TIS with the capability of knowledge acquisition
can be referred to as an analysis engine. For example, while a search engine
can return relevant reviews of a product to a user, an analysis engine would
enable a user to obtain directly the major positive or negative opinions about
the product and to compare opinions about multiple similar products. A
TIS offering the capability of knowledge acquisition generally would have to
analyze text data in more detail and synthesize information from multiple
text documents, discover interesting patterns, and create new information or
knowledge.

Text Organization. This capability enables a TIS to annotate a collection of text

documents with meaningful (topical) structures so that scattered informa-
tion can be connected and a user can navigate in the information space by
following the structures. While such structures may be regarded as “knowl-
edge” acquired from the text data, and thus can be directly useful to users,
in general, they are often only useful for facilitating either information ac-
cess or knowledge acquisition, or both. In this sense, the capability of text
organization plays a supporting role in a TIS to make information access and
knowledge acquisition more effective. For example, the added structures can
allow a user to search with constraints on structures or browse by following
structures. The structures can also be leveraged to perform detailed analysis
with consideration of constraints on structures.

Information access can be further classified into two modes: pull and push. In
the pull mode, the user takes initiative to “pull” the useful information out from
the system; in this case, the system plays a passive role and waits for a user to
make a request, to which the system would then respond with relevant information.
This mode of information access is often very useful when a user has an ad hoc
1.1 Functions of Text Information Systems 9

information need, i.e., a temporary information need (e.g., an immediate need for
opinions about a product). For example, a search engine like Google generally
serves a user in pull mode. In the push mode, the system takes initiative to “push”
(recommend) to the user an information item that the system believes is useful to
the user. The push mode often works well when the user has a relatively stable
information need (e.g., hobby of a person); in such a case, a system can know
“in advance” a user’s preferences and interests, making it feasible to recommend
information to a user without having the user to take the initiative. We cover both
modes of information access in this book.
The pull mode further consists of two complementary ways for a user to obtain
relevant information: querying and browsing. In the case of querying, the user
specifies the information need with a (keyword) query, and the system would take
the query as input and return documents that are estimated to be relevant to the
query. In the case of browsing, the user simply navigates along structures that
link information items together and progressively reaches relevant information.
Since querying can also be regarded as a way to navigate, in one step, into a set
of relevant documents, it’s clear that browsing and querying can be interleaved
naturally. Indeed, a user of a web search engine often interleaves querying and
browsing.
Knowledge acquisition from text data is often achieved through the process of
text mining, which can be defined as mining text data to discover useful knowl-
edge. Both the data mining community and the natural language processing
(NLP) community have developed methods for text mining, although the two
communities tend to adopt slightly different perspective on the problem. From
a data mining perspective, we may view text mining as mining a special kind
of data, i.e., text. Following the general goals of data mining, the goal of text
mining would naturally be regarded as to discover and extract interesting pat-
terns in text data, which can include latent topics, topical trends, or outliers.
From an NLP perspective, text mining can be regarded as to partially under-
stand natural language text, convert text into some form of knowledge represen-
tation and make limited inferences based on the extracted knowledge. Thus a
key task is to perform information extraction, which often aims to identify and ex-
tract mentions of various entities (e.g., people, organization, and location) and
their relations (e.g., who met with whom). In practice, of course, any text min-
ing applications would likely involve both pattern discovery (i.e., data mining
view) and information extraction (i.e., NLP view), with information extraction
serving as enriching the semantic representation of text, which enables pattern
10 Chapter 1 Introduction

finding algorithms to generate semantically more meaningful patterns than di-

rectly working on word or string-level representations of text. Due to our em-
phasis on covering general and robust techniques that can work for all kinds
of text data without much manual effort, we mostly adopt the data mining view
in this book since information extraction techniques tend to be more language-
specific and generally require much manual effort. However, it is important to
stress that information extraction is an essential component in any text infor-
mation system that attempts to support deeper knowledge discovery or semantic
analysis.
Applications of text mining can be classified as either direct applications, where
the discovered knowledge would be directly consumed by users, or indirect appli-
cations, where the discovered knowledge isn’t necessarily directly useful to a user,
but can indirectly help a user through better support of information access. Knowl-
edge acquisition can also be further classified based on what knowledge is to be
discovered. However, due to the wide range of variations of the “knowledge,” it is
impossible to use a small number of categories to cover all the variations. Neverthe-
less, we can still identify a few common categories which we cover in this book. For
example, one type of knowledge that a TIS can discover is a set of topics or subtopics
buried in text data, which can serve as a concise summary of the major content in
the text data. Another type of knowledge that can be acquired from opinionated
text is the overall sentiment polarity of opinions about a topic.

Conceptual Framework for Text Information Systems

1.2 Conceptually, a text information system may consist of several modules, as illus-
trated in Figure 1.3.
First, there is a need for a module of content analysis based on natural language
processing techniques. This module allows a TIS to transform raw text data into
more meaningful representations that can be more effectively matched with a
user’s query in the case of a search engine, and more effectively processed in general
in text analysis. Current NLP techniques mostly rely on statistical machine learning
enhanced with limited linguistic knowledge with variable depth of understanding
of text data; shallow techniques are robust, but deeper semantic analysis is only
feasible for very limited domains. Some TIS capabilities (e.g., summarization) tend
to require deeper NLP than others (e.g., search). Most text information systems use
very shallow NLP, where text would simply be represented as a “bag of words,” where
words are basic units for representation and the order of words is ignored (although
the counts of words are retained). However, a more sophisticated representation is
1.2 Conceptual Framework for Text Information Systems 11

Retrieval Mining
applications Summarization Visualization applications

Filtering Clustering
Information Information Knowledge
access organization acquisition
Search Extraction

Categorization Topic analysis

Natural language content analysis

Text

Figure 1.3 Conceptual framework of text information systems.

also possible, which may be based on recognized entities and relations or other
techniques for more in-depth understanding of text.
With content analysis as the basis, there are multiple components in a TIS that
are useful for users in different ways. The following are some commonly seen
functions for managing and analyzing text information.

Search. Take a user’s query and return relevant documents. The search com-
ponent in a TIS is generally called a search engine. Web search engines are
among the most useful search engines that enable users to effectively and
efficiently deal with a huge amount of text data.

Filtering/Recommendation. Monitor an incoming stream, decide which items

are relevant (or non-relevant) to a user’s interest, and then recommend rele-
vant items to the user (or filter out non-relevant items). Depending on whether
the system focuses on recognizing relevant items or non-relevant items, this
component in a TIS may be called a recommender system (whose goal is to
recommend relevant items to users) or a filtering system (whose goal is to filter
out non-relevant items to allow a user to keep only the relevant items). Liter-
ature recommender and spam email filter are examples of a recommender
system and a filtering system, respectively.
12 Chapter 1 Introduction

Categorization. Classify a text object into one or several of the predefined cat-
egories where the categories can vary depending on applications. The cat-
egorization component in a TIS can annotate text objects with all kinds of
meaningful categories, thus enriching the representation text data, which
further enables more effective and deeper text analysis. The categories can
also be used for organizing text data and facilitating text access. Subject cate-
gorizers that classify a text article into one or multiple subject categories and
sentiment taggers that classify a sentence into positive, negative, or neutral in
sentiment polarity are both specific examples of a text categorization system.

Summarization. Take one or multiple text documents, and generate a concise

summary of the essential content. A summary reduces human effort in digest-
ing text information and may also improve the efficiency in text mining. The
summarization component of a TIS is called a summarizer. News summarizer
and opinion summarizer are both examples of a summarizer.

Topic Analysis. Take a set of documents and extract and analyze topics in them.
Topics directly facilitate digestion of text data by users and support browsing
of text data. When combined with the companion non-textual data such as
time, location, authors, and other meta data, topic analysis can generate
many interesting patterns such as temporal trends of topics, spatiotemporal
distributions of topics, and topic profiles of authors.

Information Extraction. Extract entities, relations of entities or other “knowl-

edge nuggets” from text. The information extraction component of a TIS en-
ables construction of entity-relation graphs. Such a knowledge graph is useful
in multiple ways, including support of navigation (along edges and paths of
the graph) and further application of graph mining algorithms to discover
interesting entity-relation patterns.

Clustering. Discover groups of similar text objects (e.g., terms, sentences, doc-
uments, . . . ). The clustering component of a TIS plays an important role in
helping users explore an information space. It uses empirical data to create
meaningful structures that can be useful for browsing text objects and ob-
taining a quick understanding of a large text data set. It is also useful for
discovering outliers by identifying the items that do not form natural clusters
with other items.

Visualization. Visually display patterns in text data. The visualization compo-

nent is important for engaging humans in the process of discovering inter-
esting patterns. Since humans are very good at recognizing visual patterns,
1.3 Organization of the Book 13

visualization of the results generated from various text mining algorithms is

generally desirable.

This list also serves as an outline of the major topics to be covered later in
this book. Specifically, search and filtering are covered first in Part II about text
data access, whereas categorization, clustering, topic analysis, and summarization
are covered later in Part III about text data analysis. Information extraction is not
covered in this book since we want to focus on general approaches that can be
readily applied to text data in any natural language, but information extraction
often requires language-specific techniques. Visualization is also not covered due
to the intended focus on algorithms in this book. However, it must be stressed that
both information extraction and visualization are very important topics relevant
to text data analysis and management. Readers interested in these techniques can
find some useful references in the Bibliographic Notes at the end of this chapter.

Organization of the Book

1.3 The book is organized into four parts, as shown in Figure 1.4.

Part I. Overview and Background. This part consists of the first four chapters
and provides an overview of the book and background knowledge, including
basic concepts needed for understanding the content of the book that some
readers may not be familiar with, and an introduction to the META toolkit
used for exercises in the book. This part also gives a brief overview of natu-
ral language processing techniques needed for understanding text data and
obtaining informative representation of text needed in all text data analysis
applications.

Part II. Text Data Access. This part consists of Chapters 5–11, covering the ma-
jor techniques for supporting text data access. This part provides a systematic
discussion of the basic information retrieval techniques, including the for-
mulation of retrieval tasks as a problem of ranking documents for a query
(Chapter 5), retrieval models that form the foundation of the design of rank-
ing functions in a search engine (Chapter 6), feedback techniques (Chapter 7),
implementation of retrieval systems (Chapter 8), and evaluation of retrieval
systems (Chapter 9). It then covers web search engines, the most important
application of information retrieval so far (Chapter 10), where techniques for
analyzing links in text data for improving ranking of text objects are intro-
duced and application of supervised machine learning to combine multiple
14 Chapter 1 Introduction

Chapter 1

Chapter 2

Chapter 3

Chapter 5 Chapter 4 Chapter 12

Chapter 6 Chapter 13

Chapter 9 Chapter 8 Chapter 7 Chapter 14 Chapter 17

Chapter 10 Chapter 11 Chapter 15

Chapter 16 Chapter 18

Chapter 20 Chapter 19

Figure 1.4 Dependency relations among the chapters.

features for ranking is briefly discussed. The last chapter in this part (Chap-
ter 11) covers recommender systems which provide a “push” mode of informa-
tion access, as opposed to the “pull” mode of information access supported
by a typical search engine (i.e., querying by users).
Part III. Text Data Analysis. This part consists of Chapters 12–19, covering a
variety of techniques for analyzing text data to facilitate user digestion of text
data and discover useful topical or other semantic patterns in text data. Chap-
ter 12 gives an overview of text analysis from the perspective of data mining,
where we may view text data as data generated by humans as “subjective sen-
sors” of the world; this view allows us to look at the text analysis problem in the
more general context of data analysis and mining in general, and facilitates
the discussion of joint analysis of text and non-text data. This is followed by
multiple chapters covering a number of the most useful general techniques
for analyzing text data without or with only minimum human effort. Specif-
ically, Chapter 13 discusses techniques for discovering two fundamental se-
1.4 How to Use this Book 15

mantic relations between lexical units in text data, i.e., paradigmatic relations
and syntagmatic relations, which can be regarded as an example of discov-
ering knowledge about the natural language used to generate the text data
(i.e., linguistic knowledge). Chapter 14 and Chapter 15 cover, respectively, two
closely related techniques to generate and associate meaningful structures
or annotations with otherwise unorganized text data, i.e., text clustering and
text categorization. Chapter 16 discusses text summarization useful for facil-
itating human digestion of text information. Chapter 17 provides a detailed
discussion of an important family of probabilistic approaches to discovery
and analysis of topical patterns in text data (i.e., topic models). Chapter 18
discusses techniques for analyzing sentiment and opinions expressed in text
data, which are key to discovery of knowledge about preferences, opinions,
and behavior of people based on analyzing the text data produced by them.
Finally, Chapter 19 discusses joint analysis of text and non-text data, which is
often needed in many applications since it is in general beneficial to use as
much data as possible for gaining knowledge and intelligence through (big)
data analysis.
Part IV. Unified Text Management and Analysis System. This last part consists
of Chapter 20 where we attempt to discuss how all the techniques discussed
in this book can be conceptually integrated in an operator-based unified
framework, and thus potentially implemented in a general unified system
for text management and analysis that can be useful for supporting a wide
range of different applications. This part also serves as a roadmap for further
extension of META to provide effective and general high-level support for
various applications and provides guidance on how META may be integrated
with many other related existing toolkits, including particularly search engine
systems, database systems, natural language processing toolkits, machine
learning toolkits, and data mining toolkits.

Due to our attempt to treat all the topics from a practical perspective, most
of the discussions of the concepts and techniques in the book are informal and
intuitive. To satisfy the needs of some readers that might be interested in deeper
understanding of some topics, the book also includes an appendix with notes to
provide a more detailed and rigorous explanation of a few important topics.

How to Use this Book

1.4 Due to the extremely broad scope of the topics that we would like to cover, we have
to make many tradeoffs between breadth and depth in coverage. When making
16 Chapter 1 Introduction

such a tradeoff, we have chosen to emphasize the coverage of the basic concepts
and practical techniques of text data mining at the cost of not being able to cover
many advanced techniques in detail, and provide some references at the end of
many chapters to help readers learn more about those advanced techniques if
they wish to. Our hope is that with the foundation received from reading this
book, you will be able to learn about more advanced techniques by yourself or via
another resource. We have also chosen to cover more general techniques for text
management and analysis and favor techniques that can be applicable to any text in
any natural language. Most techniques we discuss can be implemented without any
human effort or only requiring minimal human effort; this is in contrast to some
more detailed analysis of text data, particularly using natural language processing
techniques. Such “deep analysis” techniques are obviously very important and are
indeed necessary for some applications where we would like to go in-depth to
understand text in detail. However, at this point, these techniques are often not
scalable and they tend to require a large amount of human effort. In practice, it
would be beneficial to combine both kinds of techniques.
We envision three main (and potentially overlapping) categories of readers.

Students. This book is specifically designed to give you hands-on experience

in working with real text mining tools and applications. If used individually,
we suggest first reading through Chapters 1–4 in order to get a good under-
standing of the prerequisite knowledge in this book. Chapters 1, 2, and 3 will
familiarize you with the concepts and vocabulary necessary to understand the
future chapters. Chapter 4 introduces you to the companion toolkit META,
which is used in exercises in each chapter. We hope the exercises and chapter
descriptions provide inspiration to work on your own text mining project. The
provided code in META should give a large head start and allow you to focus
more on your contribution.
If used in class, there are several logical flows that an instructor may choose
to take. As prerequisite knowledge, we assume some basic knowledge in
probability and statistics as well as programming in a language such as C++
or Java. META is written in modern C++, although some exercises may be
accomplished only by modifying config files.
Instructors. We have gathered a logical and cohesive collection of topics that
may be combined together for various course curricula. For example, Part 1
and Part 2 of the book may be used as an undergraduate introduction to Infor-
mation Retrieval with a focus on how search engines work. Exercises assume
basic programming experience and a little mathematical background in prob-
ability and statistics. A different undergraduate course may choose to survey
1.4 How to Use this Book 17

the entire book as an Introduction to Text Data Mining, while skipping some
chapters in Part 2 that are more specific to search engine implementation and
applications specific to the Web. Another choice would be using all parts as a
supplemental graduate textbook, where there is still some emphasis on prac-
tical programming knowledge that can be combined with reading referenced
papers in each chapter. Exercises for graduate students could be implement-
ing some methods they read in the references into META.
The exercises at the end of each chapter give students experience working
with a powerful—yet easily understandable—text retrieval and mining toolkit
in addition to written questions. In a programming-focused class, using the
META exercises is strongly encouraged. Programming assignments can be cre-
ated from selecting a subset of exercises in each chapter. Due to the modular
nature of the toolkit, additional programming experiments may be created by
extending the existing system or implementing other well-known algorithms
that do not come with META by default. Finally, students may use compo-
nents of META they learned through the exercises to complete a larger final
programming project. Using different corpora with the toolkit can yield dif-
ferent project challenges, e.g., review summary vs. sentiment analysis.

Practitioners. Most readers in industry would most likely use this book as a
reference, although we also hope that it may serve as some inspiration in
your own work. As with the student user suggestion, we think you would get
the most of this book by first reading the initial three chapters. Then, you may
choose a chapter relevant to your current interests and delve deeper or refresh
your knowledge.
Since many applications in META can be used simply via config files, we
anticipate it as a quick way to get a handle on your dataset and provide some
baseline results without any programming required.
The exercises at the end of each chapter can be thought of as default
implementations for a particular task at hand. You may choose to include
META in your work since it uses a permissive free software license. In fact, it is
dual-licensed under MIT and University of Illinois/NCSA licenses. Of course,
we still encourage and invite you to share any modifications, extensions, and
improvements with META that are not proprietary for the benefit of all the
readers.

No matter what your goal, we hope that you find this book useful and educa-
tional. We also appreciate your comments and suggestions for improvement of the
book. Thanks for reading!
18 Chapter 1 Introduction

Bibliographic Notes and Further Reading

There are already multiple excellent text books in information retrieval (IR). Due
to the long history of research in information retrieval and the fact that much
foudational work has been done in 1960s, even some very old books such as van
Rijsbergen [1979] and Salton and McGill [1983] and Salton [1989] remain very
useful today. Another useful early book is Frakes and Baeza-Yates [1992]. More
recent ones include Grossman and Frieder [2004], Witten et al. [1999], and Belew
[2008]. The most recent ones are Manning et al. [2008], Croft et al. [2009], Büttcher
et al. [2010], and Baeza-Yates and Ribeiro-Neto [2011]. Compared with these books,
this book has a broader view of the topic of information retrieval and attempts to
cover both text retrieval and text mining. While some existing books on IR have
also touched some topics such as text categorization and text clustering, which we
classify as text mining topics, no previous book has included an in-depth discussion
of topic mining and analysis, an important family of techniques very useful for
text mining. Recommender systems also seem to be missing in the existing books
on IR, which we include as an alternative way to support users for text access
complementary with search engines. More importantly, this book treats all these
topics in a more systematic way than existing books by framing them in a unified
coherent conceptual framework for managing and analyzing big text data; the book
also attempts to minimize the gap between abstract explanation of algorithms
and practical applications by providing a companion toolkit for many exercises.
Readers who want to know more about the history of IR research and the major
early milestones should take a look at the collection of readings in Sparck Jones
and Willett [1997].
The topic of text mining has also been covered in multiple books (e.g., Feldman
and Sanger [2007]). A major difference between this book and those is our empha-
sis on the integration of text mining and information retrieval with a belief that
any text data application system must involve humans in the loop and search en-
gines are essential components of any text mining systems to support two essential
functions: (1) help convert a large raw text data set into a much smaller, but more
relevant text data set which can be efficiently anlayzed by using a text mining al-
gorithm (i.e., data reduction) and (2) help users verify the source text articles from
which knowledge is discovered by a text mining algorithm (i.e., knowledge prove-
nance). As a result, this book provides a more complete coverage of techniques
required for developing big text data applications.
The focus of this book is on covering algorithms that are general and robust,
which can be readily applied to any text data in any natural language, often with
no or minimum human effort. An evitable cost of this focus is its lack of coverage
Bibliographic Notes and Further Reading 19

of some key techniques important for text mining, notably the information extrac-
tion (IE) techniques which are essential for text mining. We decided not to cover IE
because the IE techniques tend to be language-specific and require non-trivial man-
ual work by humans. Another reason is that many IE techniques rely on supervised
machine learning approaches, which are well covered in many existing machine
learning books (see, e.g., Bishop 2006, Mitchell 1997). Readers who are interested
in knowing more about IE can start with the survey book [Sarawagi 2008] and review
articles [Jiang 2012].
From an application perspective, another important topic missing in this book
is information visualization, which is due to our focus on the coverage of models
and algorithms. However, since every application system must have a user-friendly
interface to allow users to optimally interact with a system, those readers who are
interested in developing text data application systems will surely find it useful to
learn more about user interface design. An excellent reference to start with is Hearst
[2009], which also has a detailed coverage of information visualization.
Finally, due to our emphasis on breadth, the book does not cover any compo-
nent algorithm in depth. To know more about some of the topics, readers can
further read books in natural language processing (e.g., Jurafsky and Martin 2009,
Manning and Schütze 1999), advanced books on IR (e.g., Baeza-Yates and Ribeiro-
Neto [2011]), and books on machine learning (e.g., Bishop [2006]). You may find
more specific recommendations of readings relevant to a particular topic in the
Bibliographic Notes at the end of each chapter that covers the corresponding topic.
2
2.1
Background
This chapter contains background information that is necessary to know in order
to get the most out of the rest of this book; readers who are already familiar with
these basic concepts may safely skip the entire chapter or some of the sections.
We first focus on some basic probability and statistics concepts required for most
algorithms and models in this book. Next, we continue our mathematical back-
ground with an overview of some concepts in information theory that are often
used in many text mining applications. The last section introduces the basic idea
and problem setup of machine learning, particularly supervised machine learning,
which is useful for classification, categorization, or text-based prediction in the text
domain. In general, machine learning is very useful for many information retrieval
and data mining tasks.

Basics of Probability and Statistics

As we will see later in this chapter and in many other chapters, probabilistic or
statistical models play a very important role in text mining algorithms. This section
gives every reader a sufficient background and vocabulary to understand these
probabilistic and statistical approaches covered in the later chapters of the book.
A probability distribution is a way to assign likelihood to an event in some
probability space . As an example, let our probability space be a six-sided die.
Each side has a different color. Thus, = {red, orange, yellow, green, blue, purple}
and an event is the act of rolling the die and observing a color.
We can quantify the uncertainty of rolling the die by declaring a probability
distribution over all possible events. Assuming we have a fair die, the probability
of rolling any specific color is 61 , or about 16%. We can represent our probability
distribution as a collection of probabilities such as

1 1 1 1 1 1
θ= , , , , , ,
6 6 6 6 6 6
22 Chapter 2 Background

where the first index corresponds to p(red) = 61 , the second index corresponds to
p(orange) = 61 , and so on. But what if we had an unfair die? We could use a different
probability distribution θ to model events concerning it:

1 1 1 1 1 1
θ = , , , , , .
3 3 12 12 12 12

In this case, red and orange are assumed to be rolled more often than the
other colors. Be careful to note the difference between the sample space and
the defined probability model θ used to quantify its uncertainty. In our text mining
tasks, we usually try to estimate θ given some knowledge about . The different
methods to estimate θ will determine how accurate or useful the probabilistic
model is.
Consider the following notation:

x ∼ θ.

We read this as x is drawn from theta, or the random variable x is drawn from the
probability distribution θ . The random variable x takes on each value from with
a certain probability defined by θ . For example, if we had x ∼ θ , then there is a 23
chance that x is either red or orange.
In our text application tasks, we usually have as V , the vocabulary of some text
corpus. For example, the vocabulary could be

V = {a, and, apple, . . . , zap, zirconium, zoo}

and we could model the text data with a probability distribution θ . Thus, if we have
some word w we can write p(w | θ) (read as the probability of w given θ). If w is the
word data, we might have p(w = data | θ) = 0.003 or equivalently pθ (w = data) =
0.003.
In our examples, we have only considered discrete probability distributions.
That is, our models only assign probabilities for a finite (discrete) set of outcomes.
In reality, there are also continuous probability distributions, where there are an
infinite number of “events” that are not countable. For example, the normal (or
Gaussian) distribution is a continuous probability distribution that assigns real-
valued probabilities according to some parameters. We will discuss continuous
distributions as necessary in later chapters. For now, though, it’s sufficient to un-
derstand that we can use a discrete probability distribution to model the probability
of observing a single word in a vocabulary V .
2.1 Basics of Probability and Statistics 23

The Kolmogorov axioms describe facts about probability distributions in general

(both discrete and continuous). We discuss them now, since they are a good sanity
check when designing your own models. A valid probability distribution θ with
probability space must satisfy the following three axioms.

. Each event has a probability between zero and one:

0 ≤ pθ (ω ∈ ) ≤ 1. (2.1)

. An event not in has probability zero, and the probability of any event
occurring from is one:

pθ (ω) = 0, ω ∈ and pθ () = 1. (2.2)

. The probability of all (disjoint) events sums to one:

pθ (ω) = 1. (2.3)
ω∈

Note that, strictly speaking, an event is defined as a subset of the probability

space , and we say that an event happens if and only if the outcome from a random
experiment (i.e., randomly drawing an outcome from ) is in the corresponding
subset of outcomes defined by the event. Thus, it is easy to understand that the
special event corresponding to the empty subset is an impossible event with a
probability of zero, whereas the special event corresponding to the complete set
itself always happens and so has a probability of 1.0. As a special case, we can
consider an event space with only those events that each have precisely one element
of outcome, which is exactly what we assumed when talking about a distribution
over all the words. Here each word corresponds to the event defined by the subset
with the word as the only element; clearly, such an event happens if and only if the
outcome is the corresponding word.

2.1.1 Joint and Conditional Probabilities

For this section, let’s modify our original die rolling example. We will keep the
original distribution as θC , indicating the color probabilities:

1 1 1 1 1 1
θC = , , , , , .
6 6 6 6 6 6

Let’s also assume that each color is represented by a particular shape. This makes
our die look like
24 Chapter 2 Background

{ , , , , , }
where the colors of the shapes are, red, orange, yellow, blue, green, and purple,
respectively.
We can now create another distribution for the shape θS . Let each index in θS
represent p(square), p(circle), p(triangle), respectively. That gives

1 1 1
θS = , , .
3 2 6

Then we can let xC ∼ θC represent the color random variable and let xS ∼ θS repre-
sent the shape random variable. We now have two variables to work with.
A joint probability measures the likelihood that two events occur simultane-
ously. For example, what is the probability that xC = red and xS = circle? Since there
are no red circles, this has probability zero. How about p(xC = green, xS = circle)?
This notation signifies the joint probability of the two random variables. In this
case, the joint probability is 61 because there is only one green circle.
Consider a modified die:

{ , , , , , }
where we changed the color of the blue circle (the fourth element in the set) to
green. Thus, we now have two green circles instead of one green and one blue.
What would p(xC = green, xS = circle) be? Since two out of the six elements satisfy
both these criteria, the answer is 26 = 31 . As another example, if we had a 12-sided
fair die with 5 green circles and 7 other combinations of shape and color, then
p(xC = green, xS = circle) = 125
.
A conditional probability measures the likelihood that one event occurs given
that another event has already occurred. Let’s use the original die with six unique
colors. Say we know that a square was rolled. With that information, what is the
probability that the color is red? How about purple? We can write this as p(xC =
red | xS = square). Since we know there are two squares, of which one is red, p(red |
square) = 21 .
We can write the conditional probabilities for two random variables X and Y
based on their joint probability with the following equation:

p(X = x , Y = y)
p(X = x | Y = y) = . (2.4)
p(Y = y)
2.1 Basics of Probability and Statistics 25

The numerator p(X = x , Y = y) is the probability of exactly the configuration

we’re looking for (i.e., both x and y have been observed), which is normalized by
p(Y = y), the probability that the condition is true (i.e., y has been observed). Using
this knowledge, we can calculate p(xC = green | xS = circle):
p(xC = green, xS = circle) 1/6 1
p(xC = green | xS = circle) = = = .
p(xS = circle) 1/2 3
One other important concept to mention is independence. In the previous ex-
amples, the two random variables were dependent, meaning the value of one
will influence the value of the other. Consider another situation where we have
c1 , c2 ∼ θC . That is, we draw two colors from the color distribution. Does the knowl-
edge of c1 inform the probability of c2? No, since each draw is done “independently”
of the other. In the case where two random variables X and Y are independent,
p(X, Y ) = p(X)p(Y ). Can you see why this is the case?
Both conditional and joint probabilities can be used to answer interesting ques-
tions about text. For example, given a document, what is the probability of observ-
ing the word information and retrieval in the same sentence? What is the probability
of observing retrieval if we know information has occurred?

2.1.2 Bayes’ Rule

Bayes’ rule may be derived using the definition of conditional probability:
p(X, Y ) p(Y , X)
p(X | Y ) = and p(Y | X) = .
p(Y ) p(X)
Therefore, setting the two joint probabilities equal,

p(X | Y )p(Y ) = p(X, Y ) = p(Y | X)p(X).

We can simplify them as

p(Y | X)p(X)
p(X | Y ) = . (2.5)
p(Y )
The above formula is known as Bayes’ rule, named after the Reverend Thomas Bayes
(1701–1761). This rule has widespread applications. In this book, you will see heavy
use of this formula in the text categorization chapter as well as the topic analysis
chapter, among others. We will leave it up to the individual chapters to explain their
use of this rule in their implementation. Essentially, though, Bayes’ rule can be used
to make inference about a hypothesis based on the observed evidence related to the
hypothesis.
26 Chapter 2 Background

Specifically, we may view random variable X as denoting our hypothesis, and Y

as denoting the observed evidence. p(X) can thus be interpreted as our prior belief
about which hypothesis is true; it is “prior” because it is our belief before we have
any knowledge about evidence Y . In contrast, p(X | Y ) encodes our posterior belief
about the hypothesis since it is our belief after knowing evidence Y . Bayes’ rule is
seen to connect the prior belief and posterior belief, and provide a way to update
the prior belief p(X) based on the likelihood of the observed evidence Y and obtain
the posterior belief p(X | Y ). It is clear that if X and Y are independent, then no
updating will happen as in this case, p(X | Y ) = p(X).

2.1.3 Coin Flips and the Binomial Distribution

In most discussions on probability, a good example to investigate is flipping a
coin. For example, we may be interested in modeling the presence or absence of a
particular word in a text document, which can be easily mapped to a coin flipping
problem. There are two possible outcomes in coin flipping: heads or tails. The
probability of heads is denoted as θ , which means the probability of tails is 1 − θ .
To model the probability of success (in our case, “heads”), we can use the
Bernoulli distribution. The Bernoulli distribution gives the probability of success
for a single event—flipping the coin once. If we want to model n throws and find the
probability of k successes, we instead use the binomial distribution. The binomial
distribution is a discrete distribution since k is an integer. We can write it as

n k
p(k heads) = θ (1 − θ)n−k . (2.6)
k
We can also write it as follows:
n!
p(k heads) = θ k (1 − θ)n−k . (2.7)
k!(n − k)!
But why is it this formula? Well, let’s break it apart. If we have n total binary
trials, and want to see k heads, that means we must have flipped k heads and n − k
tails. The probability of observing each of the k heads is θ , while the probability
of observing each of the remaining n − k tails is 1 − θ . Since we assume all these
flips are independent, we simply multiply all the outcomes together. Since we don’t
care about the order of the outcomes, we additionally multiply by the number of
possible ways to choose k items from a set of n items.
What if we do care about the order of the outcomes? For example, what is the
probability of observing the particular sequence of outcomes (h, t , h, h, t) where h
and t denote heads and tails, respectively? Well, it is easy to see that the probability
2.1 Basics of Probability and Statistics 27

of observing this sequence is simply the product of observing each event, i.e.,
θ × (1 − θ ) × θ × θ × (1 − θ) = θ 3(1 − θ)2 with no adjustment for different orders
of observing three heads and two tails.
The more commonly used multinomial distribution in text analysis, which mod-
els the probability of seeing a word in a particular scenario (e.g., in a document), is
very similar to this Bernoulli distribution, just with more than two outcomes.

2.1.4 Maximum Likelihood Parameter Estimation

Now that we have a model for our coin flipping, how can we estimate its parameters
given some observed data? For example, maybe we observe the data D that we
discussed above where n = 5:

D = {h, t , h, h, t}.

Now we would like to figure out what θ is based on the observed data. Using
maximum likelihood estimation (MLE), we choose the θ that has the highest like-
lihood given our data, i.e., choose the θ such that the probability of observed data
is maximized.
To compute the MLE, we would first write down the likelihood function, i.e.,
p(D | θ ), which is θ 3(1 − θ)2 as we explained earlier. The problem is thus reduced
to find the θ that maximizes the function f (θ) = θ 3(1 − θ)2. Equivalently, we can
attempt to maximize the log-likelihood: log f (θ ) = 3 log θ + 2 log(1 − θ), since log-
arithm transformation preserves the order of values. Using knowledge of calculus,
we know that a necessary condition for a function to achieve a maximum value at a
θ value is that the derivative at the same θ value is zero. Thus, we just need to solve
the following equation:

d log f (θ ) 3 2
= − = 0,
dθ θ 1−θ

and we easily find that the solution is θ = 3/5.

More generally, let H be the number of heads and T be the number of tails. The
MLE of the probability of heads is given by:

θMLE = arg max p(D | θ)

= arg max θ H (1 − θ)T

H
= .
H +T
28 Chapter 2 Background

The notation arg max represents the argument (i.e., θ in this case) that makes
the likelihood function (i.e., p(D | θ)) reach its maximum. Thus, the value of an
arg max expression stays the same if we perform any monotonic transformation of
the function inside arg max. This is why we could use the logarithm transformation
in the example above, which made it easier to compute the derivative.
The solution to MLE shown above should be intuitive: the θ that maximizes our
data likelihood is just the ratio of heads. It is a general characteristic of the MLE
that the estimated probability is the normalized counts of the corresponding events
denoted by the probability. As an example, the MLE of a multinomial distribution
(which will be further discussed in detail later in the book) gives each possible
outcome a probability proportional to the observed counts of the outcome. Note
that a consequence of this is that all unobserved outcomes would have a zero
probability according to MLE. This is often not reasonable especially when the data
sample is small, a problem that motivates Bayesian parameter estimation which we
discuss below.

2.1.5 Bayesian Parameter Estimation

One potential problem of MLE is that it is often inaccurate when the size of the
data sample is small since it always attempts to fit the data as well as possible.
Consider an extreme example of observing just two data points of flipping a coin
which happen to be all heads. The MLE would say that the probability of heads is
1.0 while the probability of tails is 0. Such an estimate is intuitively inaccurate even
though it maximizes the probability of the observed two data points.
This problem of “overfitting” can be addressed and alleviated by considering the
uncertainty on the parameter and using Bayesian parameter estimation instead of
MLE. In Bayesian parameter estimation, we consider a distribution over all the
possible values for the parameter; that is, we treat the parameter itself as a random
variable.
Specifically, we may use p(θ ) to represent a distribution over all possible values
for θ, which encodes our prior belief about what value is the true value of θ , while
the data D provide evidence for or against that belief. The prior belief p(θ ) can
then be updated based on the observed evidence. We’ll use Bayes’ rule to rewrite
p(θ | D), or our belief of the parameters given data, as
p(D | θ)p(θ)
p(θ | D) = , (2.8)
p(D)
where p(D) can be calculated by summing over all configurations of θ. For a
continuous distribution, that would be
2.1 Basics of Probability and Statistics 29

p(D) = p(θ )p(D | θ )dθ (2.9)
θ

which means the probability for a particular θ is

p(D | θ)p(θ)
p(θ | D) =
. (2.10)
θ p(θ )p(D | θ )dθ

We have special names for these quantities:

. p(θ | D): the posterior probability of θ
. p(θ ): the prior probability of θ
. p(D | θ ): the likelihood of D

θ p(θ )p(D | θ )dθ : the marginal likelihood of D
.

The last one is called the marginal likelihood because the integration “marginal-
izes out” (removes) the parameter θ from the equation. Since the likelihood of the
data remains constant, observing the constraint that p(θ | D) must sum to one over
all possible values of θ , we usually just say

p(θ | D) ∝ p(θ )p(D | θ).

That is, the posterior is proportional to the prior times the likelihood.
The posterior distribution of the parameter θ fully characterizes the uncertainty
of the parameter value and can be used to infer any quantity that depends on the
parameter value, including computing a point estimate of the parameter (i.e., a
single value of the parameter). There are multiple ways to compute a point estimate
based on a posterior distribution. One possibility is to compute the mean of the
posterior distribution, which is given by the weighted sum of probabilities and the
parameter values. For a discrete distribution,

E[X] = xp(x) (2.11)
x

while in a continuous distribution,

E[X] = xf (x)dx (2.12)
x

Sometimes, we are interested in using the mode of the posterior distribution

as our estimate of the parameter, which is called Maximum a Posteriori (MAP)
estimate, given by:

θMAP = arg max p(θ | D) = arg max p(D | θ)p(θ). (2.13)

θ θ
30 Chapter 2 Background

Here it is easy to see that the MAP estimate would deviate from the MLE with
consideration of maximizing the probability of the parameter according to our
prior belief encoded as p(θ ). It is through the use of appropriate prior that we
can address the overfitting problem of MLE since our prior can strongly prefer an
estimate where neither heads, nor tails should have a zero probability.
For a continuation and more in-depth discussion of this material, consult Ap-
pendix A.

2.1.6 Probabilistic Models and Their Applications

With the statistical foundation from the previous sections, we can now start to see
how we might apply a probabilistic model to text analysis.
In general, in text processing, we would be interested in a probabilistic model
for text data, which defines distributions over sequences of words. Such a model is
often called statistical language model, or a generative model for text data (i.e., a
probabilistic model that can be used for sampling sequences of words).
As we started to explain previously, we usually treat the sample space as V , the
set of all observed words in our corpus. That is, we define probability distributions
over words from our dataset, which are essentially multinomial distributions if
we do not consider the order of words. While there are many more sophisticated
models for text data (see, e.g., Jelinek [1997]), this simplest model (often called
unigram language model) is already very useful for a number of tasks in text data
management and analysis due to the fact that the words in our vocabulary are very
well designed meaningful basic units for human communications.
For now, we can discuss the general framework in which statistical models
are “learned.” Learning a model means estimating its parameters. In the case
of a distribution over words, we have one parameter for each element in V . The
workflow looks like the following.

1. Define the model.

2. Learn its parameters.
3. Apply the model.

The first step has already been addressed. In our example, we wish to capture the
probabilities of individual words occurring in our corpus. In the second step, we
need to figure out actually how to set the probabilities for each word. One obvious
way would be to calculate the probability of each individual word in the corpus
itself. That is, the count of a unique word wi divided by the total number of words
2.2 Information Theory 31

in the corpus could be the value of p(wi | θ). This can be shown to be the solution of
the MLE of the model. More sophisticated models and their parameter estimation
will be discussed later in the book. Finally, once we have θ defined, what can we
actually do with it? One use case would be analyzing the probability of a specific
subset of words in the corpus, and another could be observing unseen data and
calculating the probability of seeing the words in the new text. It is often possible
to design the model such that the model parameters would encode the knowledge
we hope to discover from text data. In such a case, the estimated model parameters
can be directly used as the output (result) of text mining.
Please keep in mind that probabilistic models are a general tool and don’t only
have to be used for text analysis—that’s just our main application!

Information Theory
2.2 Information theory deals with uncertainty and the transfer or storage of quantified
information in the form of bits. It is applied in many fields, such as electrical engi-
neering, computer science, mathematics, physics, and linguistics. A few concepts
from information theory are very useful in text data management and analysis,
which we introduce here briefly. The most important concept of information theory
is entropy, which is a building block for many other measures.
The problem can be formally defined as the quantified uncertainty in predicting
the value of a random variable. In the common example of a coin, the two values
would be 1 or 0 (depicting heads or tails) and the random variable representing
these outcomes is X. In other words,

1 if heads
X=
0 if tails.

The more random this random variable is, the more difficult the prediction of heads
or tails will be. How does one quantitatively measure the randomness of a random
variable like X? This is precisely what entropy does.
Roughly, the entropy of a random variable X, H (X), is a measure of expected
number of bits needed to represent the outcome of an event x ∼ X. If the outcome
is known (completely certain), we don’t need to represent any information and
H (X) = 0. If the outcome is unknown, we would like to represent the outcome in
bits as efficiently as possible. That means using fewer bits for common occurrences
and more bits when the event is less likely. Entropy gives us the expected number
32 Chapter 2 Background

of bits for any x ∼ X using the formula

H (X) = − p(x) log 2 p(x). (2.14)
x∈X

In the cases where we have log 2 0, we generally just define this to be 0 since log 2 0
is undefined. We will get different H (X) for different random variables X.
The exact theory and reasoning behind this formula are beyond the scope of this
book, but it suffices to say that H (X) = 0 means there is no randomness, H (X) = 1
means there is complete randomness in that all events are equally likely. Thus, the
amount of randomness varies from 0 to 1. For our coin example where the sample
space is two events (heads or tails), the entropy function looks like

H (X) = −p(X = 0) log 2 p(X = 0) − p(X = 1) log 2 p(X = 1).

For a fair coin, we would have p(X = 1) = p(X = 0) = 21 . To calculate H (X), we’d
have the calculation
1 1 1 1
H (X) = − log 2 − log 2 = 1,
2 2 2 2
whereas for a completely biased coin with p(X = 1) = 1, p(X = 0) = 0 we would have

H (X) = −0 log 2 0 − 1 log 2 1 = 0.

For this example, we had only two possible outcomes (i.e., a binary random
variable). As we can see from the formula, this idea of entropy easily generalizes
to random variables with more than two outcomes; in those cases, the sum is over
more than two elements.
If we plot H (X) for our coin example against the probability of heads p(X = 1),
we receive a plot like the one shown in Figure 2.1. At the two ends of the x-axis,
the probability of X = 1 is either very small or very large. In both these cases, the
entropy function has a low value because the outcome is not very random. The most
random is when p(X = 1) = 21 . In that case, H (X) = 1, the maximum value. Since
the two probabilities are symmetric, we get a symmetric inverted U -shape as the
plot of H (X) as p(X = 1) varies.
It’s a good exercise to consider when a particular random variable (not just the
coin example) has a maximum or minimal value. In particular, let’s think about
some special cases. For example, we might have a random variable Y that always
takes a value of 1. Or, there’s a random variable Z that is equally likely to take a
value of 1, 2, or 3. In these cases, H (Y ) < H (Z) since the outcome of Y is much
2.2 Information Theory 33

1.0

0.8

0.6
H(X)

0.4

0.2

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
P(X = 1)

Figure 2.1 Entropy as a measure of randomness of a random variable.

easier to predict than the outcome of Z. This is precisely what entropy captures.
You can calculate H (Y ) and H (Z) to confirm this answer.
For our applications, it may be useful to consider the entropy of a word w in
some context. Here, high-entropy words would be harder to predict. Let W be the
random variable that denotes whether a word occurs in a document in our corpus.
Say W = 1 if the word occurs and W = 0 otherwise. How do you think H (Wthe )
compares to H (Wcomputer )? The entropy of the word the is close to zero since it
occurs everywhere. It’s not surprising to see this word in a document, thus it is easy
to predict that Wthe = 1. This case is just like the biased coin that always lands one
way. The word computer, on the other hand, is a less common word and is harder
to predict whether it occurs or not, so the entropy will be higher.
When we attempt to quantify uncertainties of conditional probabilities, we can
also define conditional entropy H (X | Y ), which indicates the expected uncertainty
of X given that we observe Y , where the expectation is taken under the distribution
of all possible values of Y . Intuitively, if X is completely determined by Y , then
H (X | Y ) = 0 since once we know Y , there would be no uncertainty in X, whereas if
X and Y are independent, then H (X | Y ) would be the same as the original entropy
of X, i.e., H (X | Y ) = H (X) since knowing Y does not help at all in resolving the
uncertainty of X.
34 Chapter 2 Background

Another useful concept is mutual information defined on two random variables,

I (X; Y ), which is defined as the reduction of entropy of X due to knowledge about
another random variable Y , i.e.,

I (X; Y ) = H (X) − H (X | Y ). (2.15)

It can be shown that mutual information can be equivalently written as

I (X; Y ) = H (Y ) − H (Y | X). (2.16)

It is easy to see that I (X; Y ) tends to be large if X and Y are correlated, whereas
I (X; Y ) would be small if X and Y are not so related; indeed, in the extreme case
when X and Y are completely independent, there would be no reduction of entropy,
and thus H (X) = H (X | Y ), and I (X; Y ) = 0. However, if X is completely determined
by Y , then H (X | Y ) = 0, thus I (X; Y ) = H (X). Intuitively, mutual information can
measure the correlation of two random variables. Clearly as a correlation measure
on X and Y , mutual information is symmetric.
Applications of these basic concepts, including entropy, conditional entropy,
and mutual information will be further discussed later in this book.

Machine Learning
2.3 Machine learning is a very important technique for solving many problems, and has
very broad applications. In text data management and analysis, it has also many
uses. Any in-depth treatment of this topic would clearly be beyond the scope of
this book, but here we introduce some basic concepts in machine learning that are
needed to better understand the content later in the book.
Machine learning techniques can often be classified into two types: supervised
learning and unsupervised learning. In supervised learning, a computer would
learn how to compute a function ŷ = f (x) based on a set of examples of the input
value x (called training data) and the corresponding expected output value y. It is
called “supervised” because typically the y values must be provided by humans for
each x, and thus the computer receives a form of supervision from the humans.
Once the function is learned, the computer would be able to take unseen values of
x and compute the function f (x).
When y takes a value from a finite set of values, which can be called labels, a
function f (.) can serve as a classifier in that it can be used to map an instance x
to the “right” label (or multiple correct labels when appropriate). Thus, the prob-
lem can be called a classification problem. The simplest case of the classification
problem is when we have just two labels (known as binary classification). When y
2.3 Machine Learning 35

takes a real value, the problem is often called a regression problem. Both forms
of the problem can also be called prediction when our goal is mainly to infer the
unknown y for a given x; the term “prediction” is especially meaningful when y is
some property of a future event.
In text-based applications, both forms may occur, although the classification
problem is far more common, in which case the problem is also called text catego-
rization or text classification. We dedicate a chapter to this topic later in the book
(Chapter 15). The regression problem may occur when we use text data to predict
another non-text variable such as sentiment rating or stock prices; both cases are
also discussed later.
In classification as well as regression, the (input) data instance x is often repre-
sented as a feature vector where each feature provides a potential clue about which
y value is most likely the value of f (x). What the computer learns from the training
data is an optimal way to combine these features with weights on them to indi-
cate their importance and their influence on the final function value y. “Optimal”
here simply means that the prediction error on the training data is minimum, i.e.,
the predicted ŷ values are maximally consistent with the true y values in the train-
ing data.
More formally, let our collection of objects be X such that xi ∈ X is a feature
vector that represents object i. A feature is an attribute of an object that describes
it in some way. For example, if the objects are news articles, one feature could be
whether the word good occurred in the article. All these different features are part
of a document’s feature vector, which is used to represent the document. In our
cases, the feature vector will usually have to do with the words that appear in the
document.
We also have Y, which is the set of possible labels for each object. Thus, yi may
be sports in our news article classification setup and yj could be politics.
A classifier is a function f (.) that takes a feature vector as input and outputs a
predicted label ŷ ∈ Y. Thus, we could have f (xi ) = sports, meaning ŷ = sports. If the
true y is also sports, the classifier was correct in its prediction.
Notice how we can only evaluate a classification algorithm if we know the true
labels of the data. In fact, we will have to use the true labels in order to learn a good
function f (.) to take unseen feature vectors and classify them. For this reason, when
studying machine learning algorithms, we often split our corpus X into two parts:
training data and testing data. The training portion is used to build the classifier,
and the testing portion is used to evaluate the performance (e.g., determine how
many correct labels were predicted). In applications, the training data are generally
36 Chapter 2 Background

all the labelled examples that we can generate, and the test cases are the data points,
to which we would like to apply our machine learning program.
But what does the function f (.) actually do? Consider a very simple example
that determines whether a news article has positive or negative sentiment, i.e.,
Y = {positive, negative}:

positive if x’s count for the term good is greater than 1

f (x) =
negative otherwise.

Of course, this example is overly simplified, but it does demonstrate the basic
idea of a classifier: it takes a feature vector as input and outputs a class label. Based
on the training data, the classifier may have determined that positive sentiment ar-
ticles contain the term good more than once; therefore, this knowledge is encoded
in the function. In Chapter 15, we will investigate some specific algorithms for cre-
ating the function f (.) based on the training data. Other topics such as feedback
for information retrieval (Chapter 7) and sentiment analysis (Chapter 18) make use
of classifiers, or resemble them. For this reason, it’s good to know what machine
learning is and what kinds of problems it can solve.
In contrast to supervised learning, in unsupervised learning we only have the
data instances X without knowing Y. In such a case, obviously we cannot really know
how to compute y based on an x. However, we may still learn latent properties or
structures of X. Since there is no human effort involved, such an approach is called
unsupervised. For example, the computer can learn that some data instances are
very similar, and the whole dataset can be represented by three major clusters of
data instances such that in each cluster, the data instances are all very similar.
This is essentially the clustering technique that we will discuss in Chapter 14.
Another form of unsupervised learning is to design probabilistic models to model
the data (called “generative models”) where we can embed interesting parameters
that denote knowledge that we would like to discover from the data. By fitting the
model to our data, we can estimate the parameter values that can best explain the
data, and treat the obtained parameter values as the knowledge discovered from
the data. Applications of such an approach in analyzing latent topics in text are
discussed in detail in Chapter 17.

Bibliographic Notes and Further Reading

Detailed discussion of the basic concepts in probability and statistics can be found
in many textbooks such as Hodges and Lehmann [1970]. An excellent introduction
to the maximum likelihood estimation can be found in Myung [2003]. An accessi-
Exercises 37

ble comprehensive introduction to Bayesian statistics is given in the book Bayesian

Data Analysis [Gelman et al. 1995]. Cover and Thomas [1991] provide a comprehen-
sive introduction to information theory. There are many books on machine learning
where a more rigorous introduction to the basic concepts in machine learning as
well as many specific machine learning approaches can be found (e.g., Bishop 2006,
Mitchell 1997).

Exercises
2.1. What can you say about p(X | Y ) if we know X and Y are independent random
variables? Prove it.

2.2. In an Information Retrieval course, there are 78 computer science majors,

21 electrical and computer engineering majors, and 10 library and information
science majors. Two students are randomly selected from the course. What is the
probability that they are from the same department? What is the probability that
they are from different departments?

2.3. Use Bayes’ rule to solve the following problem. One third of the time, Milo
takes the bus to work and the other times he takes the train. The bus is less reliable,
so he gets to work on time only 50% of the time. If taking the train, he is on time 90%
of the time. Given that he was on time on a particular day, what is the probability
that Milo took the bus?

2.4. In a game based on a deck of 52 cards, a single card is drawn. Depending on

the type of card, a certain value is either won or lost. If the card is one of the four
aces, $10 is won. If the card is one of the four kings, $5 is won. If the card is one of
the eleven diamonds that is not a king or ace, $2 is won. Otherwise, $1 is lost. What
are the expected winnings or losings after drawing a single card? (Would you play?)

2.5. Consider the game outlined in the previous question. Imagine that two
aces were drawn, leaving 50 cards remaining. What is the expected value of the
next draw?

2.6. In the information theory section, we defined three random variables X, Y ,

and Z when discussing entropy. We compared H (Y ) with H (Z). How does H (X)
compare to the other two entropies?

2.7. In the information theory section, we compared the entropy of the word the
to that of the word unicorn. In general, what types of words have a high entropy and
what types of words have a low entropy? As an example, consider a corpus of ten
38 Chapter 2 Background

documents where the occurs in all documents, unicorn appears in five documents,
and Mercury appears in one document. What would be the entropy value of each?

2.8. Brainstorm some different features that may be good for the sentiment clas-
sification task outlined in this chapter. What are the strengths and weaknesses of
such features?

2.9. Consider the following scenario. You are writing facial recognition software
that determines whether there is a face in a given image. You have a collection of
100, 000 images with the correct answer and need to determine if there are faces
in new, unseen images.
Answer the following questions.
(a) Is this supervised learning or unsupervised learning?
(b) What are the labels or values we are predicting?
(c) Is this binary classification or multiclass classification? (Or neither?)
(d) Is this a regression problem?
(e) What are the features that could be used?

2.10. Consider the following scenario. You are writing essay grading software that
takes in a student essay and produces a score from 0–100%. To design this system,
you are given essays from the past year which have been graded by humans. Your
task is to use the system with the current year’s essays as input.
Answer the same questions as in Exercise 2.9.

2.11. Consider the following scenario. You are writing a tool that determines
whether a given web page is one of
personal home page,
links to a personal home page, or
neither of the above.
To help you in your task, you are given 5, 000, 000 pages that are already labeled.
Answer the same questions as in Exercise 2.9.
3
Text Data Understanding
In this chapter, we introduce basic concepts in text data understanding through
natural language processing (NLP). NLP is concerned with developing computa-
tional techniques to enable a computer to understand the meaning of natural
language text. NLP is a foundation of text information systems because how ef-
fective a TIS is in helping users access and analyze text data is largely determined
by how well the system can understand the content of text data. Content analysis
is thus logically the first step in text data analysis and management.
While a human can instantly understand a sentence in their native language,
it is quite challenging for a computer to make sense of one. In general, this may
involve the following tasks.

Lexical analysis. The purpose of lexical analysis is to figure out what the basic
meaningful units in a language are (e.g., words in English) and determine
the meaning of each word. In English, it is rather easy to determine the
boundaries of words since they are separated by spaces, but it is non-trivial to
find word boundaries in some other languages such as Chinese where there
is no clear delimiter to separate words.
Syntactic analysis. The purpose of syntactic analysis is to determine how words
are related with each other in a sentence, thus revealing the syntactic structure
of a sentence.
Semantic analysis. The purpose of semantic analysis is to determine the mean-
ing of a sentence. This typically involves the computation of meaning of a
whole sentence or a larger unit based on the meanings of words and their
syntactic structure.
Pragmatic analysis. The purpose of pragmatic analysis is to determine meaning
in context, e.g., to infer the speech acts of language. Natural language is
used by humans to communicate with each other. A deeper understanding
40 Chapter 3 Text Data Understanding

of natural language than semantic analysis is thus to further understand the

purpose in communication.

Discourse analysis. Discourse analysis is needed when a large chunk of text

with multiple sentences is to be analyzed; in such a case, the connections
between these sentences must be considered and the analysis of an individual
sentence must be placed in the appropriate context involving other sentences.

In Figure 3.1, we show what is involved in understanding a very simple English

sentence “A dog is chasing a boy on the playground.” The lexical analysis in this case
involves determining the syntactic categories (parts of speech) of all the words (for
example, dog is a noun and chasing is a verb). Syntactic analysis is to determine that
a and boy form a noun phrase. So do the and playground, and on the playground is a
prepositional phrase. Semantic analysis is to map noun phrases to entities and verb
phrases to relations so as to obtain a formal representation of the meaning of the
sentence. For example, the noun phrase a boy can be mapped to a semantic entity
denoting a boy (i.e., b1), and a dog to an entity denoting a dog (i.e., d1). The verb
phrase can be mapped to a relation predicate chasing(d1,b1,p1) as shown in
the figure. Note that with this level of understanding, one may also infer additional
information based on any relevant common sense knowledge. For example, if we
assume that if someone is being chased, he or she may be scared, we could infer
that the boy being chased (b1) may be scared. Finally, pragmatic analysis might
further reveal that the person who said this sentence might intend to request an
action, such as reminding the owner of the dog to take the dog back.
While it is possible to derive a clear semantic representation for a simple sen-
tence like the one shown in Figure 3.1, it is in general very challenging to do this
kind of analysis for unrestricted natural language text. The main reason for this
difficulty is because natural language is designed to make human communication
efficient; this is in contrast with a programming language which is designed to
facilitate computer understanding. Specifically, there are two reasons why NLP is
very difficult. (1) We omit a lot of “common sense” knowledge in natural language
communication because we assume the hearer or reader possesses such knowledge
(thus there’s no need to explicitly communicate it). (2) We keep a lot of ambiguities,
which we assume the hearer/reader knows how to resolve (thus there’s no need to
waste words to clarify them). As a result, natural language text is full of ambiguity,
and resolving ambiguity would generally involve reasoning with a large amount of
common-sense knowledge, which is a general difficult challenge in artificial intel-
Chapter 3 Text Data Understanding 41

A dog is chasing a boy on the playground Lexical

Det Noun Aux Verb Det Noun Prep Det Noun analysis
(part-of-speech
tagging)
Noun phrase Complex verb Noun phrase Noun phrase

Semantic analysis Verb phrase Prep phrase

Dog (d1). Syntactic analysis
Boy (b1). Verb phrase (parsing)
Playground (p1).
Chasing (d1, b1, p1).
Sentence
+
Scared(x) if Chasing(_,x,_).
A person saying this may be
reminding another person
Scared(b1) to get the dog back.
Inference Pragmatic analysis
(speech act )

Figure 3.1 An example of tasks in natural language understanding.

ligence. In this sense, NLP is “AI complete”, i.e., as difficult as any other difficult
problems in artificial intelligence.
The following are a few examples of specific challenges in natural language
understanding.

Word-level ambiguity. A word may have multiple syntactic categories and mul-
tiple senses. For example, design can be a noun or a verb (ambiguous POS);
root has multiple meanings even as a noun (ambiguous sense).
Syntactic ambiguity. A phrase or a sentence may have multiple syntactic struc-
tures. For example, natural language processing can have two different inter-
pretations: “processing of natural language” vs. “natural processing of lan-
guage” (ambiguous modification). Another example: A man saw a boy with
a telescope has two distinct syntactic structures, leading to a different result
regarding who had the telescope (ambiguous prepositional phrase (PP) at-
tachment).
Anaphora resolution. What exactly a pronoun refers to may be unclear. For
example, in John persuaded Bill to buy a TV for himself , does himself refer to
John or Bill?
42 Chapter 3 Text Data Understanding

Presupposition. He has quit smoking implies that he smoked before; making

such inferences in a general way is difficult.

History and State of the Art in NLP

3.1 Research in NLP dated back to at least the 1950s when researchers were very
optimistic about having computers that understood human language, particularly
for the purpose of machine translation. Soon however, it was clear, as stated in
Bar-Hillel’s report in 1960, that fully-automatic high-quality translation could not
be accomplished without knowledge. That is, a dictionary is insufficient; instead,
we would need an encyclopedia.
Realizing that machine translation may be too ambitious, researchers tackled
less ambitious applications of NLP in the late 1960s and 1970s with some success,
though the techniques developed failed to scale up, thus only having limited ap-
plication impact. For example, people looked at speech recognition applications
where the goal is to transcribe a speech. Such a task requires only limited under-
standing of natural language, thus more realistic; for example, figuring out the
exact syntactic structure is probably not very crucial for speech recognition. Two
interesting projects that demonstrated clear ability of computer understanding of
natural language are worth mentioning. One is the Eliza project where shallow
rules are used to enable a computer to play the role of a therapist to engage a nat-
ural language dialogue with a human. The other is the block world project which
demonstrated feasibility of deep semantic understanding of natural language when
the language is limited to a toy domain with only blocks as objects.
In the 1970s–1980s, attention was paid to process real-world natural-language
text data, particularly story understanding. Many formalisms for knowledge rep-
resentation and heuristic inference rules were developed. However, the general
conclusion was that even simple stories are quite challenging to understand by
a computer, confirming the need for large-scale knowledge representation and in-
ferences under uncertainty.
After the 1980s, researchers started moving away from the traditional symbolic
(logic-based) approaches to natural language processing, which mostly had proven
to be not robust for real applications, and paying more attention to statistical
approaches, which enjoyed more success, initially in speech recognition, but later
also in virtually all other NLP tasks. In contrast to symbolic approaches, statistical
approaches tend to be more robust because they have less reliance on human-
generated rules; instead, they often take advantage of regularities and patterns in
3.2 NLP and Text Information Systems 43

empirical uses of language, and rely solely on labeled training data by humans and
application of machine learning techniques.
While linguistic knowledge is always useful, today, the most advanced natural
language processing techniques tend to rely on heavy use of statistical machine
learning techniques with linguistic knowledge only playing a somewhat secondary
role. These statistical NLP techniques are successful for some of the NLP tasks. Part
of speech tagging is a relatively easy task, and state-of-the-art POS taggers may have
a very high accuracy (above 97% on news data). Parsing is more difficult, though
partial parsing can probably be done with reasonably high accuracy (e.g., above
90% for recognizing noun phrases)1.
However, full structure parsing remains very difficult, mainly because of ambi-
guities. Semantic analysis is even more difficult, only successful for some aspects
of analysis, notably information extraction (recognizing named entities such as
names of people and organization, and relations between entities such as who
works in which organization), word sense disambiguation (distinguishing different
senses of a word in different contexts of usage), and sentiment analysis (recogniz-
ing positive opinions about a product in a product review). Inferences and speech
act analysis are generally only feasible in very limited domains.
In summary, only “shallow” analysis of natural language processing can be done
for arbitrary text and in a robust manner; “deep” analysis tends not to scale up well
or be robust enough for analyzing unrestricted text. In many cases, a significant
amount of training data (created by human labeling) must be available in order to
achieve reasonable accuracy.

NLP and Text Information Systems

3.2 Because of the required robustness and efficiency in TIS applications, in general,
robust shallow NLP techniques tend to be more useful than fragile deep analysis
techniques, which may hurt application performance due to inevitable analysis
errors caused by the general difficulty of NLP. The limited value of deep NLP for
some TIS tasks is further due to various ways to bypass the more difficult task of
precisely understanding the meaning of natural language text and directly optimize
the task performance. Thus, while improved NLP techniques should in general
enable improved TIS task performance, lack of NLP capability isn’t necessarily a
major barrier for some application tasks, notably text retrieval, which is a relatively

1. These performance numbers were based on a specific data set, so they may not generalize well
even within the same domain.
44 Chapter 3 Text Data Understanding

“Easier” and
Tasks Dependency on NLP more “workarounds”

Classiﬁcation/
retrieval

Summarization/
extraction/
topic mining

Translation/
dialogue

Question
answering

Figure 3.2 “Easy” vs. “difficult” NLP applications.

easy task as compared with a more difficult task such as machine translation where
deep understanding of natural language is clearly required.
Figure 3.2 shows a number of TIS tasks that require somewhat different levels
of NLP. At one end of the spectrum, tasks such as retrieval and classification are
relatively easy, and in most of the cases, they don’t require deep NLP; indeed,
looking at the keywords mentioned in text is often sufficient to determine whether a
document is relevant to a query or about a certain topic. At the other end, however,
tasks such as machine translation and question answering would require much
more precise understanding; for example, a wrong parse of a sentence generally
would lead to a wrong translation unless the target language has a similar ambiguity
structure, and similarly, a wrong understanding of the question would lead to
wrong answers.
When it comes to a specific application task, it is often possible to bypass the
difficulty in accurately understanding natural language and go directly to solve the
application problem. A well-known example is the Eliza system,2 which is supposed
to play the role of a therapist and make a dialogue with a human user [Weizenbaum
1966]. The following is a sample dialogue.

Person: Men are all alike.

2. [Link]
3.2 NLP and Text Information Systems 45

ELIZA : In what way?

Person: They’re always bugging us about something or other.
ELIZA: Can you think of a specific example?
Person: Well, my boyfriend made me come here.
ELIZA: Your boyfriend made you come here?
Person: He says I’m depressed much of the time.
ELIZA: I am sorry to hear that you are depressed.
Person: It’s true. I am unhappy.
ELIZA: Do you think coming here will help you not to be unhappy?
Person: I need some help, that much seems certain.
ELIZA: What would it mean to you if you got some help?
Person: Perhaps I could learn to get along with my mother.
ELIZA: Tell me more about your family

On the surface, the dialogue appears to be quite natural, and indeed, such a
dialogue might be useful to engage a depressed patient in a conversation. However,
the system does not really understand the language, and solely relies on heuristic
rules like the following to keep the dialogue going:

I remember X → Do you often think of X?

always → Can you think of a specific example?

Such rules enable the system to directly perform the task, i.e., making a conver-
sation, without necessarily trying to understand the real meaning of words and
determining the meaning of the entire sentence.
Such a pattern-based way of solving a problem has turned out to be quite pow-
erful. Indeed, modern machine learning approaches to natural language under-
standing are essentially based on this and in many ways are similar to the Eliza
system, but with two important differences. The first is that the rules in a machine
learning system would not be exact or strict; instead, they tend to be stochastic, and
the probabilities of choosing which rule would be empirically set based on a train-
ing data set where the expected behavior of a function to be computed is known.
Second, instead of having human to supply rules, the “soft” rules may be learned
46 Chapter 3 Text Data Understanding

automatically from the training data with only minimum help from users who can,
e.g., specify the elements in a rule.
Even difficult tasks like machine translation can be done by such statistical
approaches. The most useful NLP techniques for building a TIS are statistical ap-
proaches which tend to be much more robust than symbolic approaches. Statistical
language models are especially useful because they can quantify the uncertainties
associated with the use of natural language in a principled way.

Text Representation
3.3 Techniques from NLP allow us to design many different types of informative fea-
tures for text objects. Let’s take a look at the example sentence A dog is chasing a boy
on the playground in Figure 3.3. We can represent this sentence in many different
ways. First, we can always represent such a sentence as a string of characters. This
is true for every language. This is perhaps the most general way of representing text
since we can always use this approach to represent any text data. Unfortunately,
the downside to this representation is that it can’t allow us to perform semantic
analysis, which is often needed for many applications of text mining. We’re not
even recognizing words, which are the basic units of meaning for any language. (Of
course, there are some situations where characters are useful, but that is not the
general case.)
The next version of text representation is performing word segmentation to
obtain a sequence of words. In the example sentence, we get features like dog and
chasing. With this level of representation, we suddenly have much more freedom.
By identifying words, we can (for example), easily discover the most frequent words
in this document or the whole collection. These words can then be used to form
topics. Therefore, representing text data as a sequence of words opens up a lot of
interesting analysis possibilities.
However, this level of representation is slightly less general than a string of char-
acters. In some languages, such as Chinese, it’s actually not that easy to identify all
the word boundaries since in such a language text is a sequence of characters with
no spaces in between words. To solve this problem, we have to rely on some special
techniques to identify words and perform more advanced segmentation that isn’t
only based on whitespace (which isn’t always 100% accurate). So, the sequence of
words representation is not as robust as the string of characters representation. In
English, it’s very easy to obtain this level of representation so we can use this all
the time.
If we go further in natural language processing, we can add part-of-speech (POS)
tags to the words. This allows us to count, for example, the most frequent nouns; or,
3.3 Text Representation 47

we could determine what kind of nouns are associated with what kind of verbs. This
opens up more interesting opportunities for further analysis. Note in Figure 3.3
that we use a plus sign on the additional features because by representing text as
a sequence of part of speech tags, we don’t necessarily replace the original word
sequence. Instead, we add this as an additional way of representing text data.
Representing text as both words and POS tags enriches the representation of
text data, enabling a deeper, more principled analysis. If we go further, then we’ll
be parsing the sentence to obtain a syntactic structure. Again, this further opens up
more interesting analysis of, for example, the writing styles or grammatical error
correction.
If we go further still into semantic analysis, then we might be able to recognize
dog as an animal. We also can recognize boy as a person, and playground as a
location and analyze their relations. One deduction could be that the dog was
chasing the boy, and the boy is on the playground. This will add more entities and
relations, through entity-relation recognition. Now, we can count the most frequent
person that appears in this whole collection of news articles. Or, whenever you see
a mention of this person you also tend to see mentions of another person or object.
These types of repeated pattens can potentially make very good features.

A dog is chasing a boy on the playground String of characters

A dog is chasing a boy on the playground Sequence of words
Det Noun Aux Verb Det Noun Prep Det Noun + POS tags
Noun phrase Complex verb Noun phrase Noun phrase

Verb phrase Prep phrase + Syntactic structures

Verb phrase

Sentence

A dog a boy the playground + Entities and relations

CHASE ON
Animal Person Location

Dog(d1). Boy(b1). Playground(p1). Chasing(d1,b1,p1) + Logic predicates

Speech act = REQUEST + Speech acts

Closer to knowledge
Deeper NLP: requires more human effort; less accurate representation

Figure 3.3 Illustration of different levels of text representation.

48 Chapter 3 Text Data Understanding

Such a high-level representation is even less robust than the sequence of words
or POS tags. It’s not always easy to identify all the entities with the right types and
we might make mistakes. Relations are even harder to find; again, we might make
mistakes. The level of representation is less robust, yet it’s very useful. If we move
further to a logic representation, then we have predicates and inference rules. With
inference rules we can infer interesting derived facts from the text. As one would
imagine, we can’t do that all the time for all kinds of sentences since it may take
significant computation time or a large amount of training data.
Finally, speech acts would add yet another level of representation of the intent
of this sentence. In this example, it might be a request. Knowing that would allow
us to analyze even more interesting things about the observer or the author of this
sentence. What’s the intention of saying that? What scenarios or what kinds of
actions will occur?
Figure 3.3 shows that if we move downwards, we generally see more sophisti-
cated NLP techniques. Unfortunately, such techniques would require more human
effort as well, and they are generally less robust since they attempt to solve a much
more difficult problem. If we analyze our text at levels that represent deeper analy-
sis of language, then we have to tolerate potential errors. That also means it’s still
necessary to combine such deep analysis with shallow analysis based on (for exam-
ple) sequences of words. On the right side, there is an arrow that points down to
indicate that as we go down, our representation of text is closer to the knowledge
representation in our mind. That’s the purpose of text mining!
Clearly, there is a tradeoff here between doing deeper analysis that might have
errors but would give us direct knowledge that can be extracted from text and doing
shadow analysis that is more robust but wouldn’t give us the necessary deeper
representation of knowledge.
Text data are generated by humans and are meant to be consumed by humans.
As a result, in text data analysis and text mining, humans play a very important role.
They are always in the loop, meaning that we should optimize for a collaboration
between humans and computers. In that sense, it’s okay that computers may not
be able to have a completely accurate representation of text data. Patterns that
are extracted from text data can be interpreted by humans, and then humans can
guide the computers to do more accurate analysis by annotating more data, guiding
machine learning programs to make them work more effectively.
Different text representation tends to enable different analyses, as shown in
Figure 3.4. In particular, we can gradually add more and more deeper analysis
results to represent text data that would open up more interesting representation
opportunities and analysis capabilities. The table summarizes what we have just
3.3 Text Representation 49

Text Rep Generality Enabled Analysis Examples of Application

String String processing Compression

Words Word relation analysis; Thesaurus discovery; topic-

topic analysis; sentiment and opinion-related
analysis applications

+ Syntactic Syntactic graph analysis Stylistic analysis; structure-

structures based feature extraction

+ Entities & Knowledge graph analysis; Discovery of knowledge and

relations information network opinions about specific
analysis entities

+ Logic Integrative analysis of Knowledge assistant for

predicates scattered knowledge; biologists
logic inference

Figure 3.4 Text representation and enabled analysis.

seen; the first column shows the type of text representation while the second
visualizes the generality of such a representation. By generality, we mean whether
we can do this kind of representation accurately for all the text data (very general) or
only some of them (not very general). The third column shows the enabled analysis
techniques and the final column shows some examples of applications that can be
achieved with a particular level of representation.
As a sequence of characters, text can only be processed by string processing
algorithms. They are very robust and general. In a compression application, we
don’t need to know word boundaries (although knowing word boundaries might
actually help). Sequences of words (as opposed to characters) offer a very important
level of representation; it’s quite general and relatively robust, indicating that it
supports many analysis techniques such as word relation analysis, topic analysis,
and sentiment analysis. As you may expect, many applications can be enabled by
these kinds of analysis. For example, thesaurus discovery has to do with discovering
related words, and topic- and opinion-related applications can also be based on
word-level representation. People might be interested in knowing major topics
covered in the collection of text, where a topic is represented as a distribution over
words.
Moving down, we’ll see we can gradually add additional representations. By
adding syntactic structures, we can enable syntactic graph analysis; we can use
graph mining algorithms to analyze these syntactic graphs. For example, stylistic
50 Chapter 3 Text Data Understanding

analysis generally requires syntactical structure representation. We can also gener-

ate structure-based features that might help us classify the text objects into differ-
ent categories by looking at their different syntactic structures. If you want to clas-
sify articles into different categories corresponding to different authors, then you
generally need to look at syntactic structures. When we add entities and relations,
then we can enable other techniques such as knowledge graphs or information net-
works. Using these more advanced feature representations allows applications that
deal with entities.
Finally, when we add logical predicates, we can integrate analysis of scattered
knowledge. For example, we can add an ontology on top of extracted information
from text to make inferences. A good example of an application enabled by this
level of representation is a knowledge assistant for biologists. This system is able
to manage all the relevant knowledge from literature about a research problem such
as understanding gene functions. The computer can make inferences about some
of the hypotheses that a biologist might be interested in. For example, it could
determine whether a gene has a certain function by reading literature to extract
relevant facts. It could use a logic system to track answers to researchers’ questions
about what genes are related to what functions. In order to support this level of
application, we need to go as far as logical representation.
This book covers techniques mainly focused on word-based representation.
These techniques are general and robust and widely used in various applications.
In fact, in virtually all text mining applications, you need this level of represen-
tation. Still, other levels can be combined in order to support more linguistically
sophisticated applications as needed.

Statistical Language Models

3.4 A statistical language model (or just language model for short) is a probability
distribution over word sequences. It thus gives any sequence of words a potentially
different probability. For example, a language model may give the following three-
word sequences different probabilities:
p(Today is Wednesday) = 0.001
p(Today Wednesday is) = 0.000000001
p(The equation has a solution) = 0.000001
Clearly, a language model can be context-dependent. In the language model
shown above, the sequence The equation has a solution has a smaller probability
than Today is Wednesday. This may be a reasonable language model for describ-
3.4 Statistical Language Models 51

ing general conversations, but it may be inaccurate for describing conversations

happening at a mathematics conference, where the sequence The equation has a
solution may occur more frequently than Today is Wednesday.
Given a language model, we can sample word sequences according to the distri-
bution to obtain a text sample. In this sense, we may use such a model to “generate”
text. Thus, a language model is also often called a generative model for text.
Why is a language model useful? A general answer is that it provides a principled
way to quantify the uncertainties associated with the use of natural language.
More specifically, it allows us to answer many interesting questions related to text
analysis and information retrieval. The following are some examples of questions
that a language model can help answer.

. Given that we see John and feels, how likely will we see happy as opposed to
habit as the next word? Answering this question can help speech recognition
as happy and habit have very similar acoustic signals, but a language model
can easily suggest that John feels happy is far more likely than John feels habit.
. Given that we observe baseball three times and game once in a news article,
how likely is it about the topic “sports”? This will obviously directly help text
categorization and information retrieval tasks.
. Given that a user is interested in sports news, how likely would it be for the
user to use baseball in a query? This is directly related to information retrieval.

If we enumerate all the possible sequences of words and give a probability to

each sequence, the model would be too complex to estimate because the number
of parameters is potentially infinite since we have a potentially infinite number
of word sequences. That is, we would never have enough data to estimate these
parameters. Thus, we have to make assumptions to simplify the model.
The simplest language model is the unigram language model in which we
assume that a word sequence results from generating each word independently.
Thus, the probability of a sequence of words would be equal to the product of the
probability of each word. Formally, let V be the set of words in the vocabulary, and
w1 , . . . , wn a word sequence, where wi ∈ V is a word. We have

n
p(w1 , . . . , wn) = p(wi ). (3.1)
i=1

Given a unigram language model θ, we have as many parameters as the words

in the vocabulary, and they satisfy the constraint w∈V p(w) = 1. Such a model
essentially specifies a multinomial distribution over all the words.
52 Chapter 3 Text Data Understanding

p(w|θ1 ) p(w|θ2 )
... ...
text 0.2 food 0.25
mining 0.1 nutrition 0.1
association 0.01 healthy 0.05
clustering 0.02 diet 0.02
... ...
food 0.00001 text 0.00001
... ...

Figure 3.5 Two examples of unigram language models, representing two different topics.

Given a language model θ , in general, the probabilities of generating two dif-

ferent documents D1 and D2 would be different, i.e., p(D1 | θ) = p(D2 | θ). What
kind of documents would have higher probabilities? Intuitively it would be those
documents that contain many occurrences of the high probability words accord-
ing to p(w | θ ). In this sense, the high probability words of θ can indicate the topic
captured by θ.
For example, the two unigram language models illustrated in Figure 3.5 suggest
a topic about “text mining” and a topic about “health”, respectively. Intuitively, if
D is a text mining paper, we would expect p(D | θ1) > p(D | θ2), while if D is a blog
article discussing diet control, we would expect the opposite: p(D | θ1) < p(D | θ2).
We can also expect p(D | θ1) > p(D | θ1) and p(D | θ2) < p(D | θ2).
Now suppose we have observed a document D (e.g., a short abstract of a text
mining paper) which is assumed to be generated using a unigram language model
θ, and we would like to infer the underlying model θ (i.e., estimate the probabilities
of each word w, p(w | θ)) based on the observed D. This is a standard problem
in statistics called parameter estimation and can be solved using many different
methods.
One popular method is the maximum likelihood (ML) estimator, which seeks a
model θ̂ that would give the observed data the highest likelihood (i.e., best explain
the data):

θ̂ = arg max θ p(D | θ). (3.2)

It is easy to show that the ML estimate of a unigram language model gives each
word a probability equal to its relative frequency in D. That is,

c(w, D)
p(w | θ̂) = , (3.3)
|D|
3.4 Statistical Language Models 53

where c(w, D) is the count of word w in D and |D| is the length of D, or total number
of words in D.
Such an estimate is optimal in the sense that it would maximize the probability
of the observed data, but whether it is really optimal for an application is still
questionable. For example, if our goal is to estimate the language model in the mind
of an author of a research article, and we use the maximum likelihood estimator
to estimate the model based only on the abstract of a paper, then it is clearly non-
optimal since the estimated model would assign zero probability to any unseen
words in the abstract, which would make the whole article have a zero probability
unless it only uses words in the abstract. Note that, in general, the maximum
likelihood estimate would assign zero probability to any unseen token or event in
the observed data; this is so because assigning a non-zero probability to such a
token or event would take away probability mass that could have been assigned
to an observed word (since all probabilities must sum to 1), thus reducing the
likelihood of the observed data. We will discuss various techniques for improving
the maximum likelihood estimator later by using techniques called smoothing.
Although extremely simple, a unigram language model is already very useful
for text analysis. For example, Figure 3.6 shows three different unigram language
models estimated on three different text data samples, i.e., a general English text
database, a computer science research article database, and a text mining research
paper. In general, the words with the highest probabilities in all the three models
are those functional words in English because such words are frequently used
in any text. After going further down on the list of words, one would see more
content-carrying and topical words. Such content words would differ dramatically
depending on the data to be used for the estimation, and thus can be used to
discriminate the topics in different text samples.
Unigram language models can also be used to perform semantic analysis of word
relations. For example, we can use them to find what words are semantically asso-
ciated with a word like computer. The main idea for doing this is to see what other
words tend to co-occur with the word computer. Specifically, we can first obtain a
sample of documents (or sentences) where computer is mentioned. We can then
estimate a language model based on this sample to obtain p(w | computer). This
model tells us which words occur frequently in the context of “computer.” However,
the most frequent words according to this model would likely be functional words
in English or words that are simply common in the data, but have no strong asso-
ciation with computer. To filter out such common words, we need a model for such
words which can then tell us what words should be filtered. It is easy to see that the
general English language model (i.e., a background language model) would serve
54 Chapter 3 Text Data Understanding

B C D
General Computer Text mining
background science paper
English text papers

the 0.03 the 0.032 the 0.031

a 0.02 a 0.019 ...
is 0.015 is 0.014 text 0.04
we 0.01 we 0.011 mining 0.035
... ... association 0.03
food 0.003 computer 0.004 clustering 0.005
computer 0.00001 software 0.0001 computer 0.0009
... ... ...
text 0.000006 text 0.00006 food 0.000001
... ... ...
Background LM: p(w|B) Collection LM: p(w|C) Document LM: p(w|D)

Figure 3.6 Three different language models representing three different topics.

the purpose well. So we can use the background language model to normalize the
model p(w | computer) and obtain a probability ratio for each word. Words with
high ratio values can then be assumed to be semantically associated with computer
since they tend to occur frequently in its context, but not frequently in general. This
is illustrated in Figure 3.7.
More applications of language models in text information systems will be fur-
ther discussed as their specific applications appear in later chapters. For example,
we can represent both documents and queries as being generated from some lan-
guage model. Given this background however, the reader should have sufficient
information to understand the future chapters dealing with this powerful statisti-
cal tool.

Bibliographic Notes and Further Reading

There are many textbooks on NLP, including, Speech and Language Processing
[Jurafsky and Martin 2009], Foundations of Statistical NLP [Manning and Schütze
1999], and Natural Language Understanding [Allen 1995]. An in-depth coverage of
statistical language models can be found in the book Statistical Methods for Speech
Recognition [Jelinek 1997]. Rosenfeld [2000] provides a concise yet comprehensive
Exercises 55

Topic LM: p(w|“computer”) Normalized topic LM:

p(w|“computer”)/p(w|B)
the 0.032
a 0.019 computer 400
is 0.014 software 150
All documents we 0.008 program 104
containing word computer 0.004 ...
“computer” software 0.0001 text 3.0
... ...
text 0.00006 the 1.1
a 0.99
Background LM: p(w|B) the 0.03 is 0.9
B a 0.02 we 0.8
is 0.015
General
we 0.01
background ...
English text
computer 0.00001
...

Figure 3.7 Using topic language models and a background language model to find semantically
related words.

review of statistical language models. Zhai [2008] contains a detailed discussion

of the use of statistical language models for information retrieval, some of which
will be covered in later chapters of this book. An important topic in NLP that we
have not covered much in this chapter is information extraction. A comprehensive
introduction to this topic can be found in Sarawagi [2008], and a useful survey can
be found in Jiang [2012]. For a discussion of this topic in the context of information
retrieval, see the book Moens [2006].

Exercises
3.1. In what way is NLP related to text mining?
3.2. Does poor NLP performance mean poor retrieval performance? Explain.
3.3. Given a collection of documents for a specific topic, how can we use maximum
likelihood estimation to create a topic unigram language model?

3.4. How might the size of a document collection affect the quality of a language
model?

3.5. Why might maximum likelihood estimation not be the best guess of parame-
ters for a topic language model?
56 Chapter 3 Text Data Understanding

3.6. Suppose we appended a duplicate copy of the topic collection to itself and
re-estimated a maximum likelihood language model. Would θ change?

3.7. A unigram language model as defined in this chapter can take a sequence of
words as input and output its probability. Explain how this calculation has strong
independence assumptions.

3.8. Given a unigram language model θ estimated from this book, and two doc-
uments d1 = information retrieval and d2 = retrieval information, then is
p(d1 | θ) > p(d2 | θ )? Explain.

3.9. An n-gram language model records sequences of n words. How does the num-
ber of possible parameters change if we decided to use a 2-gram (bigram) language
model instead of a unigram language model? How about a 3-gram (trigram) model?
Give your answer in terms of V , the unigram vocabulary size.

3.10. Using your favorite programming language, estimate a unigram language

model using maximum likelihood. Do this by reading a single text file and delim-
iting words by whitespace.

3.11. Sort the words by their probabilities from the previous exercise. If you used
a different text file, how would your sorted list be different? How would it be the
same?
4
META: A Unified Toolkit
for Text Data
Management and
Analysis
This chapter introduces the accompanying software META, a free and open-source
toolkit that can be used to analyze text data. Throughout this book, we give hands-
on exercises with META to practice concepts and explore different text mining
algorithms.
Most of the algorithms and methods discussed in this book can be found in
some form in the META toolkit. If META doesn’t include a technique discussed in
this book, it’s likely that a chapter exercise is to implement this feature yourself! De-
spite being a powerful toolkit, META’s simplicity makes feature additions relatively
straightforward, usually through extending a short class hierarchy.
Configuration files are an integral part of META’s forward-facing infrastructure.
They are designed such that exploratory analysis usually requires no programming
effort from the user. By default, META is packaged with various executables that can
be used to solve a particular task. For example, for a classification experiment the
user would run the following command in their terminal1:

./classify [Link]

This is standard procedure for using the default executables; they take only one
configuration file parameter. The configuration file format is explained in detail
later in this chapter, but essentially it allows the user to select a dataset, a way

1. Running the default classification experiment requires a dataset to operate on. The 20news-
groups dataset is specified in the default META config file and can be downloaded here: https://
[Link]/data/[Link]. Place it in the meta/data/ directory.
58 Chapter 4 META: A Unified Toolkit for Text Data Management and Analysis

to tokenize the dataset, and a particular classification algorithm to run (for this
example).
If more advanced functionality is desired, programming in C++ is required to
make calls to META’s API (applications programming interface). Both configuration
file and API usage are documented on META’s website, [Link] as
well as in this chapter. Additionally, a forum for META exists ([Link]
[Link]), containing discussion surrounding the toolkit. It includes user support
topics, community-written documentation, and developer discussions.
The sections that follow delve into a little more detail about particular aspects
of META so the reader will be comfortable working with it in future chapters.

Design Philosophy
4.1 META’s goal is to improve upon and complement the current body of open source
machine learning and information retrieval software. The existing environment of
this open source software tends to be quite fragmented.
There is rarely a single location for a wide variety of algorithms; a good example
of this is the LIBLINEAR [Fan et al. 2008] software package for SVMs. While this
is the most cited of the open source implementations of linear SVMs, it focuses
solely on kernel-less methods. If presented with a nonlinear classification problem,
one would be forced to find a different software package that supports kernels
(such as the same authors’ LIBSVM package [Chang and Lin 2011]). This places an
undue burden on the researchers and students—not only are they required to have
a detailed understanding of the research problem at hand, but they are now forced
to understand this fragmented nature of the open-source software community,
find the appropriate tools in this mishmash of implementations, and compile and
configure the appropriate tool.
Even when this is all done, there is the problem of data formatting—it is unlikely
that the tools have standardized upon a single input format, so a certain amount of
data preprocessing is now required. This all detracts from the actual task at hand,
which has a marked impact on the speed of discovery and education.
META addresses these issues. In particular, it provides a unifying framework
for text indexing and analysis methods, allowing users to quickly run controlled
experiments. It modularizes the feature generation, instance representation, data
storage formats, and algorithm implementations; this allows for researchers and
students to make seamless transitions along any of these dimensions with minimal
effort.
4.2 Setting up META 59

META’s modularity supports exploration, encourages contributions, and in-

creases visibility to its inner workings. These facts make it a perfect companion
toolkit for this book. As mentioned at the beginning of the chapter, readers will
follow exercises that add real functionality to the toolkit. After reading this book
and learning about text data management and analysis, it is envisioned readers
continue to modify META to suit their text information needs, building upon their
newfound knowledge.
Finally, since META will always be free and open-source, readers as a community
can jointly contribute to its functionality, benefiting all those involved.

Setting up META
4.2 All future sections in this book will assume the reader has META downloaded and
installed. Here, we’ll show how to set up META.
META has both a website with tutorials and an online repository on GitHub.
To actually download the toolkit, users will check it out with the version control
software git in their command line terminal after installing any necessary prereq-
uisites.
The META website contains instructions for downloading and setting up the
software for a particular system configuration. At the time of writing this book, both
Linux and Mac OS are supported. Visit [Link]
and follow the instructions for the desired platform. We will assume the reader has
performed the steps listed in the setup guide and has a working version of META for
all exercises and demonstrations in this book.
There are two steps that are not mentioned in the setup guide. The first is to
make sure the reader has the version of META that was current when this book was
published. To ensure that the commands and examples sync up with the software
the reader has downloaded, we will ensure that META is checked out with version
2.2.0. Run the following command inside the meta/ directory:

git reset --hard v2.2.0

The second step is to make sure that any necessary model files are also down-
loaded. These are available on the META releases page: [Link]
toolkit/meta/releases/tag/v2.2.0. By default, the model files should be placed in the
meta/build/ directory, but you can place them anywhere as long as the paths in
the config file are updated.
60 Chapter 4 META: A Unified Toolkit for Text Data Management and Analysis

Once these steps are complete, the reader should be able to complete any exer-
cise or run any demo. If any additional files or information are needed, it will be
provided in the accompanying section.

Architecture
4.3 All processed data in META is stored in an index. There are two index types:
forward_index and inverted_index. The former is keyed by document IDs, and
the latter is keyed by term IDs.

forward_index is used for applications such as topic modeling and most clas-
sification tasks.
inverted_index is used to create search engines, or do classification with k-
nearest-neighbor or similar algorithms.

Since each META application takes an index as input, all processed data is
interchangeable between all the components. This also gives a great advantage to
classification: META supports out-of-core classification by default! If a dataset is
small enough (like most other toolkits assume), a cache can be used such as no_
evict_cache to keep it all in memory without sacrificing any speed. (Index usage
is explained in much more detail in the search engine exercises.)
There are four corpus input formats.

line_corpus. each dataset consists of one to three files:

[Link]. each document appears on one line
[Link]. optional file that includes the class or label of
the document on each line, again corresponding to the order in cor-
[Link]. These are the labels that are used for the classification
tasks.
file_corpus. each document is its own file, and the name of the file becomes
the name of the document. There is also a [Link]
which contains (on each line) a required class label for each document fol-
lowed by the path to the file on disk. If there are no class labels, a placeholder
label should be required, e.g., “[none]”.
gz_corpus. similar to line_corpus, but each of its files and metadata files are
compressed using gzip compression:
[Link]. compressed version of [Link]
[Link]. compressed version of corpusname.
[Link]
4.4 Tokenization with META 61

libsvm_corpus. If only being used for classification, META can also take
LIBSVM-formatted input to create a forward_index. There are many ma-
chine learning datasets available in this format on the LIBSVM site.2

For more information on corpus storage and configuration settings, we suggest

the reader consult [Link]

Tokenization with META

4.4 The first step in creating an index over any sort of text data is the “tokenization”
process. At a high level, this simply means converting individual text documents
into sparse vectors of counts of terms—these sparse vectors are then typically
consumed by an indexer to output an inverted_index over your corpus.
META structures this text analysis process into several layers in order to give the
user as much power and control over the way the text is analyzed as possible.
An analyzer, in most cases, will take a “filter chain” that is used to generate the
final tokens for its tokenization process: the filter chains are always defined as a
specific tokenizer class followed by a sequence of zero or more filter classes, each
of which reads from the previous class’s output. For example, here is a simple filter
chain that lowercases all tokens and only keeps tokens with a certain length range:
icu_tokenizer → lowercase_filter → length_filter
Tokenizers always come first. They define how to split a document’s string
content into tokens. Some examples are as follows.

icu_tokenizer. converts documents into streams of tokens by following the

Unicode standards for sentence and word segmentation.
character_tokenizer. converts documents into streams of single characters.

Filters come next, and can be chained together. They define ways that text can
be modified or transformed. Here are some examples of filters.

length_filter. this filter accepts tokens that are within a certain length and
rejects those that are not.
icu_filter. applies an ICU (International Components for Unicode)3 translit-
eration to each token in the sequence. For example, an accented character like
¨
ı is instead written as i.

2. [Link]
3. [Link] note that different versions of ICU will tokenize text in slightly
different ways!
62 Chapter 4 META: A Unified Toolkit for Text Data Management and Analysis

list_filter. this filter either accepts or rejects tokens based on a list. For
example, one could use a stop word list and reject stop words.
porter2_stemmer. this filter transforms each token according to the Porter2
English Stemmer rules.4

Analyzers operate on the output from the filter chain and produce token counts
from documents. Here are some examples of analyzers.

ngram_word_analyzer. Collects and counts sequences of n words (tokens)

that have been filtered by the filter chain.
ngram_pos_analyzer. Same as ngram_word_analyzer, but operates on part-
of-speech tags from META’s CRF implementation.
tree_analyzer. Collects and counts occurrences of parse tree features.
libsvm_analyzer. Converts a LIBSVM line_corpus into META format.

META defines a sane default filter chain that users are encouraged to use for
general text analysis in the absence of any specific requirements. To use it, one
should specify the following in the configuration file:

[[analyzers]]
method = "ngram-word"
ngram = 1
filter = "default-chain"

This configures the text analysis process to consider unigrams of words gener-
ated by running each document through the default filter chain. This filter chain
should work well for most languages, as all of its operations (including but not lim-
ited to tokenization and sentence boundary detection) are defined in terms of the
Unicode standard wherever possible.
To consider both unigrams and bigrams, the configuration file should look like
the following:

[[analyzers]]
method = "ngram-word"
ngram = 1
filter = "default-chain"

[[analyzers]]

4. [Link]
4.4 Tokenization with META 63

method = "ngram-word"
ngram = 2
filter = "default-chain"

Each [[analyzers]] block defines a single analyzer and its corresponding fil-
ter chain: as many can be used as desired—the tokens generated by each analyzer
specified will be counted and placed in a single sparse vector of counts. This is
useful for combining multiple different kinds of features together into your doc-
ument representation. For example, the following configuration would combine
unigram words, bigram part-of-speech tags, tree skeleton features, and subtree
features.

[[analyzers]]
method = "ngram-word"
ngram = 1
filter = "default-chain"

[[analyzers]]
method = "ngram-pos"
ngram = 2
filter = [{type = "icu-tokenizer"}, {type = "ptb-normalizer"}]
crf-prefix = "path/to/crf/model"

[[analyzers]]
method = "tree"
filter = [{type = "icu-tokenizer"}, {type = "ptb-normalizer"}]
features = ["skel", "subtree"]
tagger = "path/to/greedy-tagger/model"
parser = "path/to/sr-parser/model"

If an application requires specific text analysis operations, one can specify di-
rectly what the filter chain should look like by modifying the configuration file.
Instead of filter being a string parameter as above, we will change filter to look very
much like the [[analyzers]] blocks: each analyzer will have a series of [[ana-
[Link]]] blocks, each of which defines a step in the filter chain. All filter
chains must start with a tokenizer. Here is an example filter chain for unigram
words like the one at the beginning of this section:

[[analyzers]]
method = "ngram-word"
ngram = 1
[[[Link]]]
type = "icu-tokenizer"
64 Chapter 4 META: A Unified Toolkit for Text Data Management and Analysis

[[[Link]]]
type = "lowercase"

[[[Link]]]
type = "length"
min = 2
max = 35

META provides many different classes to support building filter chains. Please
look at the API documentation5 for more information. In particular, the analyz-
ers::tokenizers namespace and the analyzers::filters namespace should
give a good idea of the capabilities. The static public attribute id for a given class
is the string needed for the “type” in the configuration file.

Related Toolkits
4.5 Existing toolkits supporting text management and analysis tend to fall into two
categories. The first is search engine toolkits, which are especially suitable for
building a search engine application, but tend to have limited support for text
analysis/mining functions. Examples include the following.

Lucene. [Link]
Terrier. [Link]
Indri/Lemur. [Link]

The second is text mining or general data mining and machine learning toolkits,
which tend to selectively support some text analysis functions, but generally do not
support search capability. Examples include the following.

Weka. [Link]
LIBSVM. [Link]
Stanford NLP. [Link]
Illinois NLP Curator. [Link]
Curator
Scikit Learn. [Link]
NLTK. [Link]

5. Visit [Link]
Exercises 65

However, there is a lack of seamless integration of search engine capabilities

with various text analysis functions, which is necessary for building a unified system
for supporting text management and analysis. A main design philosophy of META,
which also differentiates META from the existing toolkits, is its emphasis on the
tight integration of search capabilities (indeed, text access capabilities in general)
with text analysis functions, enabling it to provide full support for building a power-
ful text analysis application. To facilitate education and research, META is designed
with an emphasis on modularity and extensibility achieved through object-oriented
design. META can be used together with existing toolkits in multiple ways. For ex-
ample, for very large-scale text applications, an existing search engine toolkit can
be used to support search, while META can be used to further support analysis of
the found search results or any subset of text data that are obtained from the orig-
inal large data set. NLP toolkits can be used to preprocess text data and generate
annotated text data for modules in META to use as input. META can also be used
to generate a text representation that would be fed into a different data mining or
machine learning toolkit.

Exercises
In its simplest form, text data could be a single document in .txt format. This
exercise will get you familiar with various techniques that are used to analyze
text. We’ll use the novel A Tale of Two Cities by Charles Dickens as example text.
The book is called [Link], and is located at [Link]
textdatabook/[Link]. You can also use any of your own plaintext files that
have multiple English sentences.
Like all future exercises, we will assume that the reader followed the META setup
guide and successfully compiled the executables. In this exercise, we’ll only be
using the profile program. Running ./profile from inside the build/ directory
will print out the following usage information:

Usage: ./profile [Link] [Link] [OPTION]

where [OPTION] is one or more of:
--stem perform stemming on each word
--stop remove stop words
--pos annotate words with POS tags
--pos-replace replace words with their POS tags
--parse create grammatical parse trees from file content
--freq-unigram sort and count unigram words
--freq-bigram sort and count bigram words
--freq-trigram sort and count trigram words
--all run all options
66 Chapter 4 META: A Unified Toolkit for Text Data Management and Analysis

If running ./profile prints out this information, then everything has been
set up correctly. We’ll look into what each of these options mean in the following
exercises.
4.1. Stop Word Removal. Consider the following words: I, the, of, my, it, to, from.
If it was known that a document contained these words, would there be any idea
what the document was about? Probably not. These types of words are called stop
words. Specifically, they are very high frequency words that do not contain content
information. They are used because they’re grammatically required, such as when
connecting sentences.
Since these words do not contain any topical information, they are often removed
as a preprocessing step in text analysis. Not only are these (usually) useless words
ignored, but having less data can mean that algorithms run faster!

./profile [Link] [Link] --stop

Now, use the profile program to remove stop words from the document two-
[Link]. Can you still get an idea of what the book is about without these words
present?
4.2. Stemming. Stemming is the process of reducing a word to a base form.
This is especially useful for search engines. If a user wants to find books about
running, documents containing the word run or runs would not match. If we apply
a stemming algorithm to a word, it is more likely that other forms of the word will
match it in an information retrieval task.
The most popular stemming algorithm is the Porter2 English Stemmer, devel-
oped by Martin Porter. It is a slightly improved version from the original Porter
Stemmer from 1980. Some examples are:

{run, runs, running} → run

{argue, argued, argues, arguing} → argu
{lies, lying , lie} → lie

META uses the Porter2 stemmer by default. You can read more about the Porter2
stemmer here: [Link] An
online demo of the stemmer is also available if you’d like to play around with it:
[Link]
Now that you have an idea of what stemming is, run the stemmer on A Tale of
Two Cities.

./profile [Link] [Link] --stem

Exercises 67

Like stop word removal, stemming tries to keep the basic meaning behind the
original text. Can you still make sense of it after it’s stemmed?

4.3. Part-of-Speech Tagging. When learning English, students often encounter

different grammatical labels for words, such as noun, adjective, verb, etc. In linguis-
tics and computer science, there is a much larger dichotomy of these labels called
part of speech (POS) tags. Each word can be assigned a tag based on surrounding
words. Consider the following sentence: All hotel rooms are pretty much the same,
although the room number might change. Here’s a part-of-speech tagged version:

AllDT hotelNN roomsN N S areV BP prettyRB muchRB theDT sameJ J , ,

althoughI N theDT roomN N numberN N mightMD changeV B . .

Above, V BP and V B are different types of verbs, N N and N N S are singular

and plural nouns, and DT means determiner. This is just a subset of about 80
commonly used tags. Not every word has a unique part of speech tag. For instance,
flies and like can have multiple tags depending on the context:

TimeNN fliesV BZ likeI N anDT arrowN N . .

FruitNN fliesNN S likeV BP aDT bananaN N . .

Such situations can make POS-tagging challenging. Nevertheless, human agree-

ment on POS tag labeling is about 97%, which is the ceiling for automatic taggers.
POS tags can be used in text analysis as an alternate (or additional) represen-
tation to words. Using these tags captures a slightly more grammatical sense of a
document or corpus. The profile program has two options for POS tagging. The
first annotates each word like the examples above, and the second replaces each
word with its POS tag.

./profile [Link] [Link] --pos

./profile [Link] [Link] --pos-replace

Note that POS tagging the book may take up to one minute to complete. Does
it look like META’s POS tagger is accurate? Can you find any mistakes? When
replacing the words with their tags, is it possible to determine what the original
sentence was? Experiment with the book or any other text file.

4.4. Parsing. Grammatical parse trees represent deeper syntactic knowledge

from text sentences. They represent sentence phrase hierarchy as a tree structure.
Consider the example in Figure 4.1.
The parse tree is rooted with S, denoting Sentence; the sentence is composed
of a noun phrase (NP) followed by a verb phrase (VP) and period. The leaves of the
68 Chapter 4 META: A Unified Toolkit for Text Data Management and Analysis

NP VP

PRP VBP NP

They have JJ JJ NNS

many theoretical ideas

Figure 4.1 An example of a parse tree.

tree are the words in the sentence, and the preterminals (the direct parents of the
leaves) are part-of-speech tags.
Some common features from a parse tree are production rules such as S →
N P V P , tree depth, and structural tree features. Syntactic categories (node labels)
alone can also be used.
The following command runs the parser on each sentence in the input file:

./profile [Link] [Link] --parse

Like POS-tagging, the parsing may also take a minute or two to complete.

4.5. Frequency Analysis. Perhaps the most common text-processing technique is

frequency counting. This simply counts how many times each unique word appears
in a document (or corpus). Viewing a descending list of words sorted by frequency
can give you an idea of what the document is about. Intuitively, similar documents
should have some of the same high-frequency words . . . not including stop words.
Instead of single words, we can also look at strings of n words, called n-grams.
Consider this sentence: I took a vacation to go to a beach.
1-grams (unigrams):

{I : 1, took : 1, a : 2, vacation : 1, to : 2, go : 1, beach : 1}

2-grams (bigrams):

{I took : 1, took a : 1, a vacation : 1, vacation to : 1,

to go : 1, go to : 1, to a : 1, a beach : 1}
Exercises 69

3-grams (trigrams):

{I took a : 1, took a vacation : 1, a vacation to : 1, . . .}

As we will see in this text, the unigram words document representation is of

utmost importance for text representation. This vector of counts representation
does have a downside though: we lost the order of the words. This representation
is also known as “bag-of-words,” since we only know the counts of each word, and
no longer know the context or position. This unigram counting scheme can be used
with POS tags or any other type of token derived from a document.
Use the following three commands to do an n-gram frequency analysis on a
document, for n ∈ [1, 3].

./profile [Link] [Link] --freq-unigram

./profile [Link] [Link] --freq-bigram
./profile [Link] [Link] --freq-trigram

This will give the output file [Link] for the option --freq-
unigram and so on.
What makes the output reasonably clear? Think back to stop words and stem-
ming. Removing stop words gets rid of the noisy high-frequency words that don’t
give any information about the content of the document. Stemming will aggre-
gate inflected words into a single count. This means the partial vector {run : 4,
running : 2, runs : 3} would instead be represented as {run : 9}. Not only does this
make it easier for humans to interpret the frequency analysis, but it can improve
text mining algorithms, too!

4.6. Zipf’s Law. In English, the top four most frequent words are about 10–15%
of all word occurrences. The top 50 words are 35–40% of word occurrences. In fact,
there is a similar trend in any human language. Think back to the stop words. These
are the most frequent words, and make up a majority of text. At the same time, many
words may only appear once in a given document.
We can plot the rank of a word on the x axis, and the frequency count on the y
axis. Such a graph can give us an idea of the word distribution in a given document
or collection. In Figure 4.2, we counted unigram words from another Dickens book,
Oliver Twist. The plot on the left is a normal x ∼ y plot and the one on the right is
a log x ∼ log y plot.
70 Chapter 4 META: A Unified Toolkit for Text Data Management and Analysis

Word ranks in Oliver Twist Word ranks in Oliver Twist

10 10,000
Word frequency (thousands)
8

Word frequency
1,000
6
100
4

10
2

0 1
0 2 4 6 8 10 1 10 100 1,000 10,000
Word frequency rank (thousands) Word frequency rank

Figure 4.2 Illustration of Zipf’s law.

Zipf’s law describes the shape of these plots. What do you think Zipf’s law states?
The shape of these plots allows us to apply certain techniques to take advantage of
the word distribution in natural language.
II
PART

TEXT DATA ACCESS

5 Overview of Text
Data Access
Text data access is the foundation for text analysis. Text access technology plays two
important roles in text management and analysis applications. First, it enables re-
trieval of the most relevant text data to a particular analysis problem, thus avoiding
unnecessary overhead from processing a large amount of non-relevant data. Sec-
ond, it enables interpretation of any analysis results or discovered knowledge in
appropriate context and provides data provenance (origin).
The general goal of text data access is to connect users with the right information
at the right time. This connection can be done in two ways: pull, where the users
take the initiative to fetch relevant information out from the system, and push,
where the system takes the initiative to offer relevant information to users. In this
chapter, we will give a high-level overview of these two modes of text data access.
Then, we will formalize and motivate the problem of text retrieval. In the following
chapters, we will cover specific techniques for supporting text access in both push
and pull modes.

Access Mode: Pull vs. Push

5.1 Because text data are created for consumption by humans, humans play an impor-
tant role in text data analysis and management applications. Specifically, humans
can help select the most relevant data to a particular application problem, which
is beneficial since it enables us to avoid processing the huge amount of raw text
data (which would be inefficient) and focus on analyzing the most relevant part.
Selecting relevant text data from a large collection is the basic task of text access.
This selection is generally based on a specification of the information need of an
analyst (a user), and can be done in two modes: pull and push. Figure 5.1 describes
how these modes fit together along with querying and browsing.
74 Chapter 5 Overview of Text Data Access

TIS access
modes

Pull mode Push mode

(short-term) (long-term)

Querying Browsing

Figure 5.1 The dichotomy of text information access modes.

In pull mode, the user initiates the access process to find the relevant text data,
typically by using a search engine. This mode of text access is essential when a user
has an ad hoc information need, i.e., a temporary information need that might
disappear once the need is satisfied. In such a case, the user can use a query to
find relevant information with a search engine. For example, a user may have a
need to buy a product and thus be interested in retrieving all the relevant opinions
about candidate products; after the user has purchased the product, the user would
generally no longer need such information. Another example is that during the
process of analyzing social media data to understand opinions about an emerging
event, the analyst may also decide to explore information about a particular entity
related to the event (e.g., a person), which can also trigger a search activity.
While querying is the most common way of accessing text data in the pull mode,
browsing is another complementary way of accessing text data in the pull mode, and
can be very useful when a user does not know how to formulate an effective query,
or finds it inconvenient to enter a keyword query (e.g., through a smartphone), or
simply wants to explore a topic with no fixed goal. Indeed, when searching the Web,
users tend to mix querying and browsing (e.g., while traversing through hyperlinks).
In general, we may regard querying and browsing as two complementary ways
of finding relevant information in the information space. Their relation can be
understood by making an analogy between information seeking and sightseeing
in a physical world. When a tourist knows the exact address of an attraction, the
tourist can simply take a taxi directly to the attraction; this is similar to when a
user knows exactly what he or she is looking for and can formulate a query with the
5.1 Access Mode: Pull vs. Push 75

“right keywords,” which would bring to the user relevant pages directly. However,
if a tourist doesn’t know the exact address of an attraction, the tourist may want to
take a taxi to an approximate location and then walk around to find the attraction.
Similarly, if a user does not have a good knowledge about the target pages, he or
she can also use an approximate query to reach some related pages and then browse
into truly relevant information. Thus, when querying does not work well, browsing
can be very useful.
In the push mode, the system initiates the process to recommend a set of
relevant information items to the user. This mode of information access is generally
more useful to satisfy a long-standing information need of a user or analyst. For
example, a researcher’s research interests can be regarded as relatively stable over
time. In comparison, the information stream (i.e., published research articles)
is dynamic. In such a scenario, although a user can regularly search for relevant
literature information with queries, it is more desirable for a recommender (also
called filtering) system to monitor the dynamic information stream and “push” any
relevant articles to the user based on the matching of the articles with the user’s
interests (e.g., in the form of an email). In some long-term analytics applications,
it would also be desirable to use the push mode to monitor any relevant text data
(such as relevant social media) about a topic related to the application.
Another scenario of push mode is producer-initiated recommendation, which
can be more appropriately called selective dissemination of information (SDI). In
such a scenario, the producer of information has an interest in disseminating
the information among relevant users, and would push an information item to
such users. Advertising of product information on search result pages is such an
example. The recommendation can be delivered through email notifications or
recommended through a search engine result page.
There are broadly two kinds of information needs: short-term need and long-
term need. Short-term needs are most often associated with pull mode, and long-
term needs are most often associated with push mode. A short-term information
need is temporary and usually satisfied through search or navigation in the informa-
tion space, whereas a long-term information need can be better satisfied through
filtering or recommendation where the system would take the initiative to push
the relevant information to a user. Ad hoc retrieval is extremely important because
ad hoc information needs show up far more frequently than long-term informa-
tion needs. The techniques effective for ad hoc retrieval can usually be re-used for
filtering and recommendation as well. Also, in the case of long-term information
needs, it is possible to collect user feedback, which can be exploited. In this sense,
76 Chapter 5 Overview of Text Data Access

ad hoc retrieval is much harder, as we do not have much feedback information

from a user (i.e., little training data for a particular query). Due to the availability of
training data, the problem of filtering or recommendation can usually be solved by
using supervised machine learning techniques, which are covered well in many ex-
isting books. Thus, we will cover ad hoc retrieval in much more detail than filtering
and recommendation.

Multimode Interactive Access

5.2 Ideally, the system should provide support for users to have multimode interactive
access to relevant text data so that the push and pull modes are integrated in
the same information access environment, and querying and browsing are also
seamlessly integrated to provide maximum flexibility to users and allow them to
query and browse at will.
In Figure 5.2, we show a snapshot of a prototype system ([Link]
.edu/proj/sosurf/) where a topic map automatically constructed based on a set of
queries collected in a commercial search engine has been added to a regular search

Click-Through

Search Result for "dining table"

Figure 5.2 Sample interface of browsing with a topic map where browsing and querying are
naturally integrated.
5.2 Multimode Interactive Access 77

engine interface to enable a user to browse the information space flexibly. With this
interface, a user can do any of the following at any moment.

Querying (long-range jump). When a user submits a new query through the
search box the search results from a search engine will be shown in the right
pane. At the same time, the relevant part of a topic map is also shown on the
left pane to facilitate browsing should the user want to.
Navigating on the map (short-range walk). The left pane in our interface is to
let a user navigate on the map. When a user clicks on a map node, this pane
will be refreshed and a local view with the clicked node as the current focus
will be displayed. In the local view, we show the parents, the children, and
the horizontal neighbors of the current node in focus (labelled as “center” in
the interface). A user can thus zoom into a child node, zoom out to a parent
node, or navigate into a horizontal neighbor node. The number attached to
a node is a score for the node that we use for ranking the nodes. Such a map
enables the user to “walk” in the information space to browse into relevant
documents without needing to reformulate queries.
Viewing a topic region. The user may double-click on a topic node on the map
to view the documents covered in the topic region. The search result pane
would be updated with new results corresponding to the documents in the
selected topic region. From a user’s perspective, the result pane always shows
the documents in the current region that the user is focused on (either search
results of the query or the documents corresponding to a current node on the
map when browsing).
Viewing a document. Within the result pane, a user can select any document
to view as in a standard search interface.

In Figure 5.3, we further show an example trace of browsing in which the user
started with a query dining table, zoomed into asian dining table, zoomed out back to
dining table, browsed horizontally first to dining chair and then to dining furniture,
and finally zoomed out to the general topic furniture where the user would have
many options to explore different kinds of furniture. If this user feels that a “long-
jump” is needed, he or she can use a new query to achieve it. Since the map can be
hidden and only brought to display when the user needs it, such an interface is a very
natural extension of the current search interface from a user’s perspective. Thus,
we can see how one text access system can combine multiple modes of information
access to suit a user’s current needs.
78 Chapter 5 Overview of Text Data Access

3. Horizontal navigation
to “dining chairs”
4. Further navigation
to “dining furniture”

1. Zoom in on
“asian dining table”
5. Zoom out to
2. Zoom back out
explore “furniture”
to “dining table”

Figure 5.3 A sample trace of browsing showing how a user can navigate in the information space
without querying.

Text Retrieval
5.3 The most important tool for supporting text data access is a search engine, which
is why web search engines are used by many people on a daily basis. Search engines
directly provide support for querying and can be easily extended to provide recom-
mendation or browsing. Moreover, the techniques used to implement an effective
search engine are often also useful for implementation of a recommender system
as well as many text analysis functions. We thus devote a large portion of this book
to discussing search engine techniques.
In this section, we discuss the problem of text retrieval (TR), which is solved
by developing a search engine system. We specify the differences between un-
structured TR and structured database retrieval. We then make an argument for
document ranking as opposed to document selection. This provides a basis for us
to discuss in the next chapter how to rank documents for a query.
From a user’s perspective, the problem of TR is to use a query to find relevant
documents in a collection of text documents. This is a frequently needed task
because users often have temporary ad hoc information needs for various tasks,
5.3 Text Retrieval 79

and would like to find the relevant information immediately. The system to support
TR is a text retrieval system, or a search engine.
Although TR is sometimes used interchangeably with the more general term “in-
formation retrieval” (IR), the latter also includes retrieval of other types of informa-
tion such as images or videos. It is worth noting, though, that retrieval techniques
for other non-textual data are less mature and, as a result, retrieval of other types
of information tends to rely on using text retrieval techniques to match a keyword
query with companion text data with a non-textual data element. For example, the
current image search engines on the Web are essentially a TR system where each
image is represented by a text document consisting of any associated text data with
the image (e.g., title, caption, or simply textual context of the image such as the
news article content where an image is included).
In industry, the problem of TR is generally referred to as the search problem,
and the techniques for text retrieval are often called search technology or search
engine technology.
The task of TR can be easy or hard, depending on specific queries and specific
collections. For example, during a web search, finding homepages is generally easy,
but finding out people’s opinions about some topic (e.g., U.S. foreign policy) would
be much harder. There are several reasons why TR is difficult:

. a query is usually quite short and incomplete (no formal language like SQL);
. the information need may be difficult to describe precisely, especially when
the user isn’t familiar with the topic, and
. precise understanding of the document content is difficult. In general, since
what counts as the correct answer is subjective, even when human experts
judge the relevance of documents, they may disagree with each other.

Due to the lack of clear semantic structures and difficulty in natural language
understanding, it is often challenging to accurately retrieve relevant information to
a user’s query. Indeed, even though the current web search engines may appear to
be sufficient sometimes, it may still be difficult for a user to quickly locate and har-
vest all the relevant information for a task. In general, the current search engines
work very well for navigational queries and simple, popular informational queries,
but in the case where a user has a complex information need such as analyzing
opinions about products to buy, or researching medical information about some
symptoms, they often work poorly. Moreover, the current search engines generally
provide little or no support to help users digest and exploit the retrieved informa-
tion. As a result, even if a search engine can retrieve the most relevant information,
80 Chapter 5 Overview of Text Data Access

a user would still have to sift through a long list of documents and read them in
detail to fully digest the knowledge buried in text data in order to perform their task
at hand. The techniques discussed later in this book can be exploited to help users
digest the found information quickly or directly analyze a large amount of text data
to reveal useful and actionable knowledge that can be used to optimize decision
making or help a user finish a task.

Text Retrieval vs. Database Retrieval

5.4 It is useful to make a comparison of the problem of TR and the similar problem of
database retrieval. Both retrieval tasks are to help users find relevant information,
but due to the difference in the data managed by these two tasks, there are many
important differences.
First, the data managed by a search engine and a database system are different.
In databases, the data are structured where each field has a clearly defined meaning
according to a schema. Thus, the data can be viewed as a table with well-specified
columns. For example, in a bank database system, one field may be customer
names, another may be the address, and yet another may be the balance of each
type of account. In contrast, the data managed by a search engine are unstructured
text which can be difficult for computers to understand.1 Thus, even if a sentence
says a person lives in a particular address, it remains difficult for the computer to
answer a query about the address of a person in response to a keyword query since
there is no simple defined structure to free text. Therefore structured data are often
easier to manage and analyze since they conform to a clearly defined schema where
the meaning of each field is well defined.
Second, a consequence of the difference in the data is that the queries that
can be supported by the two are also different. A database query clearly specifies
the constraints on the fields of the data table, and thus the expected retrieval
results (answers to the query) are very well specified with no ambiguity. In a search
engine, however, the queries are generally keyword queries, which are only a vague
specification of what documents should be returned. Even if the computer can fully
understand the semantics of natural language text, it is still often the case that the
user’s information need is vague due to the lack of complete knowledge about the
information to be found (which is often the reason why the user wants to find the
information in the first place!). For example, in the case of searching for relevant

1. Although common parlance refers to text as unstructured with a meaningful contrast with
relational database structuring, it employs a narrow sense of “structure.” For example, from a
linguistics perspective, grammar provides well-defined structure. To study this matter further,
see the 5S (societies, scenarios, spaces, structures, and streams) works by Fox et al. [2012]
5.4 Text Retrieval vs. Database Retrieval 81

literature to a research problem, the user is unlikely able to clearly and completely
specify which documents should be returned.
Finally, the expected results in the two applications are also different. In data-
base search, we can retrieve very specific data elements (e.g., specific columns); in
TR, we are generally only able to retrieve a set of relevant documents. With passages
or fields identified in a text document, a search engine can also retrieve passages,
but it is generally difficult to retrieve specific entities or attribute values as we can in
a database. This difference is not as essential as the difference in the vague specifi-
cation of what exactly is the “correct” answer to a query, but is a direct consequence
of the vague information need in TR.
Due to these differences, the challenges in building a useful database and a
useful search engine are also somewhat different. In databases, since what items
should be returned is clearly specified, there is no challenge in determining which
data elements satisfy the user’s query and thus should be returned; a major re-
maining challenge is how to find the answers as quickly as possible especially
when there are many queries being issued at the same time. While the efficiency
challenge also exists in a search engine, a more important challenge there is to
first figure out which documents should be returned for a query before worrying
about how to return the answers quickly. In database applications—at least tradi-
tional database applications—it is also very important to maintain the integrity
of the data; that is, to ensure no inconsistency occurs due to power failure. In
TR, modeling a user’s information need and search tasks is important, again due
to the difficulty for a user to clearly specify information needs and the difficulty
in NLP.
Since what counts as the best answer to a query depends on the user, in TR,
the user is actually part of our input (together with the query, and document
set). Thus, there is no mathematical way to prove that one answer is better than
another or prove one method is better than another. Instead, we always have to
rely on empirical evaluation using some test collections and users. In contrast, in
database research, since the main issue is efficiency, one can prove one algorithm
is better than another by analyzing the computational complexity or do some
simulation study. Note that, however, when doing simulation study (to determine
which algorithm is faster), we also face the same problem as in text retrieval—the
simulation may not accurately reflect the real applications. Thus, an algorithm
shown to be faster with simulation may not be actually faster for a particular
application. Similarly, a retrieval algorithm shown to be more effective with a test
collection may turn out to be less effective for a particular application or even
another test collection. How to reliably evaluate retrieval algorithms is itself a
challenging research topic.
82 Chapter 5 Overview of Text Data Access

Because of the difference, the two fields have been traditionally studied in differ-
ent communities with a different application basis. Databases have had widespread
applications in virtually every domain with a well-established strong industry. The
IR community that studies text retrieval has been an interdisciplinary community
involving library and information science and computer science, but had not had
a strong industry base until the Web was born in the early 1990s. Since then, the
search engine industry has dominated, and as more and more online information
is available, the search engine technologies (which include TR and other technical
components such as machine learning and natural language processing) will con-
tinue to grow. Soon we will find search technologies to have widespread use just
like databases. Furthermore, because of the inherent similarity between database
search and TR, because both efficiency and effectiveness (accuracy) are important,
and because most online data has text fields as well as some kind of structures, the
two fields are now moving closer and closer to each other, leading to some common
fundamental questions such as: “What should be the right query language?”; “How
can we rank items accurately?”; “How do we find answers quickly?”; and “How do
we support interactive search?”
Perhaps the most important conclusion from this comparison is that the prob-
lem of text retrieval is an empirically defined problem. This means that which method
works better cannot be answered by pure analytical reasoning or mathematical
proofs. Instead, it has to be empirically evaluated by users, making it a significant
challenge in evaluating the effectiveness of a search engine. This is also the reason
why a significant amount of effort has been spent in research of TR evaluation since
it was initially studied in the 1960s. The evaluation methodology of TR remains an
important open research topic today; we discuss it in detail in Chapter 9.

Document Selection vs. Document Ranking

5.5 Given a document collection (a set of unordered text documents), the task of text
retrieval can be defined as using a user query (i.e., a description of the user’s
information need) to identify a subset of documents that can satisfy the user’s
information need. In order to computationally solve the problem of TR, we must
first formally define it. Thus, in this section, we will provide a formal definition of
TR and discuss high-level strategies for solving this problem.
Let V = {w1 , . . . , wN } be a vocabulary set of all the words in a particular natural
language where wi is a word. A user’s query q = q1 , q2 , . . . , qm is a sequence of words,
where qi ∈ V . Similarly, a document di = di1 , . . . , dim is also a sequence of words
where dij ∈ V . In general, a query is much shorter than a document since the query
5.5 Document Selection vs. Document Ranking 83

is often typed in by a user using a search engine system, and users generally do not
want to make much effort to type in many words. However, this is not always the
case. For example, in a Twitter search, each document is a tweet which is very short,
and a user may also cut and paste a text segment from an existing document as a
query, which can be very long. Our text collection C = {d1 , . . . , dM } is a set of text
documents. In general, we may assume that there exists a subset of documents in
the collection, i.e., R(q) ⊂ C, which are relevant to the user’s query q; that is, they
are relevant documents or documents useful to the user who typed in the query.
Naturally, this relevant set depends on the query q. However, which documents are
relevant is generally unknown; the user’s query is only a “hint” at which documents
should be in the set R(q). Furthermore, different users may use the same query
to intend to retrieve somewhat different sets of relevant documents (e.g., in an
extreme case, a query word may be ambiguous). This means that it is unrealistic
to expect a computer to return exactly the set R(q), unlike the case in database
search where this is feasible. Thus, the best a computer can do is to return an
approximation of R(q), which we will denote by R (q).
Now, how can a computer compute R (q)? At a high level, there are two alterna-
tive strategies: document selection vs. document ranking. In document selection,
we will implement a binary classifier to classify a document as either relevant or
non-relevant with respect to a particular query. That is, we will design a binary clas-
sification function, or an indicator function, f (q , d) ∈ {0, 1}. If f (q , d) = 1, d would
be assumed to be relevant, whereas if f (q , d) = 0, it would be non-relevant. Thus,
R (q) = {d|f (q , d) = 1, d ∈ C}. Using such a strategy, the system must estimate the
absolute relevance, i.e., whether a document is relevant or not.
An alternative strategy is to rank documents and let the user decide a cutoff. That
is, we will implement a ranking function f (q , d) ∈ R and rank all the documents in
descending values of this ranking function. A user would then browse the ranked
list and stop whenever they consider it appropriate. In this case, the set R (q) is
actually defined partly by the system and partly by the user, since the user would
implicitly choose a score threshold θ based on the rank position where he or she
stopped. In this case, R (q) = {d|f (q , d) ≥ θ }. Using this strategy, the system only
needs to estimate the relative relevance of documents: which documents are more
likely relevant.
Since estimation of relative relevance is intuitively easier than that of absolute
relevance, we can expect it to be easier to implement the ranking strategy. Indeed,
ranking is generally preferred to document selection for multiple reasons.
First, due to the difficulty for a user to prescribe the exact criteria for selecting
relevant documents, the binary classifier is unlikely accurate. Often the query is
84 Chapter 5 Overview of Text Data Access

either over-constrained or under-constrained. In the case of an over-constrained

query, there may be no relevant documents matching all the query words, so forcing
a binary decision may result in no delivery of any search result. If the query is
under-constrained (too general), there may be too many documents matching the
query, resulting in over-delivery. Unfortunately, it is often very difficult for a user
to know the “right” level of specificity in advance before exploring the document
collection due to the knowledge gap in the user’s mind (which can be the reason
why the user wants to find information about the topic). Even if the classifier can
be accurate, a user would still benefit from prioritization of the matched relevant
documents for examination since a user can only examine one document at a time
and some relevant documents may be more useful than others (relevance is a matter
of degree). For all these reasons, ranking documents appropriately becomes a main
technical challenge in designing an effective text retrieval system.
The strategy of ranking is further shown to be optimal theoretically under two
assumptions based on the probability ranking principle [Robertson 1997], which
states that returning a ranked list of documents in descending order of predicted
relevance is the optimal strategy under the following two assumptions.

1. The utility of a document to a user is independent of the utility of any other

document.
2. A user will browse the results sequentially.

So the problem is the following. We have a query that has a sequence of words,
and a document that’s also a sequence of words, and we hope to define the func-
tion f (., .) that can compute a score based on the query and document. The main
challenge is designing a good ranking function that can rank all the relevant docu-
ments on top of all the non-relevant ones. Now clearly this means our function must
be able to measure the likelihood that a document d is relevant to a query q. That
also means we have to have some way to define relevance. In particular, in order to
implement the program to do that, we have to have a computational definition of
relevance, and we achieve this goal by designing a retrieval model which gives us a
formalization of relevance. We introduce retrieval models in the Chapter 6.

Bibliographic Notes and Further Reading

A broader discussion of supporting information access via a digital library is avail-
able in Gonçalves et al. [2004]. The relation between retrieval (“pull”) and filtering
(“push”) has been discussed in the article Belkin and Croft [1992]. The contrast
between information retrieval and database search was discussed in the classic in-
Exercises 85

formation retrieval book by van Rijsbergen [1979]. [Hearst 2009] has a systematic
discussion of user interfaces of a search system, which is relevant to the design of
interfaces for any information system in general; in particular, many visualization
techniques that can facilitate browsing and querying are discussed in the book. Ex-
ploratory search is a particular type of search tasks that often requires multimodal
information access including both querying and browsing. It was covered in a spe-
cial issue of Communications of ACM [White et al. 2006], and White and Roth [2009].
The probability ranking principle [Robertson 1997] is generally regarded as the the-
oretical foundation for framing the retrieval problem as a ranking problem. More
historical work related to this, as well as a set of important research papers in IR up
to 1997, can be found in Readings in Information Retrieval [Sparck Jones and Willett
1997]. A brief survey of IR history can be found in Sanderson and Croft [2012].

Exercises
5.1. When might browsing be preferable to querying?
5.2. Given search engine user logs, how could you distinguish between browsing
behavior and querying behavior?
5.3. Often, push and pull modes are combined in a single system. Give an example
of such an application.
5.4. Imagine you have search engine session logs from users that you know are
browsing for information. How can you use these logs to enhance search results of
future users with ad hoc information needs?
5.5. In a Chapter 11, we will discuss recommender systems. These are systems in
push mode that deliver information to users. What are some specific applications
of recommender systems? Can you name some services available to you that fit into
this access mode?
5.6. How could a recommender system (push mode) be coupled with a search
engine (pull mode)? Can these two services mutually enhance one another?
5.7. Design a text information system used to explore musical artists. For example,
you can search for an artist’s name directly. The results are displayed as a graph,
with edges to similar artists (as measured by some similarity algorithm). Use TIS
access mode vocabulary to describe this system and any enhancements you could
make to satisfy different information needs.
5.8. In the same way as the previous question, categorize “Google knowledge
graph” ([Link]
86 Chapter 5 Overview of Text Data Access

5.9. In the same way as the previous two questions, categorize “citation alerts.”
These are alerts that are based on previous search history in an academic search
engine. When new papers are found that are potentially interesting to the user
based on their browsing history, an alert is created.

5.10. One assumption of the probability ranking principle is that each document’s
usefulness to the user is independent of the usefulness of other documents in the
index. What is a scenario where this assumption does not hold?
6
6.1
Retrieval Models
In this chapter, we introduce the two main information retrieval models: vector
space and query likelihood, which are among the most effective and practically
useful retrieval models. We begin with a brief overview of retrieval models in general
and then discuss the two basic models, i.e., the vector space model and the query
likelihood model afterward.

Overview
Over many decades, researchers have designed various different kinds of retrieval
models which fall into different categories (see Zhai [2008] for a detailed review).
First, one family of the models are based on the similarity idea. Basically, we assume
that if a document is more similar to the query than another document is, we would
say the first document is more relevant than the second one. So in this case, the
ranking function is defined as the similarity between the query and the document.
One well-known example of this case is the vector space model [Salton et al. 1975],
which we will cover more in detail later in this chapter.
The second set of models are called probabilistic retrieval models [Lafferty and
Zhai 2003]. In this family of models, we follow a very different strategy. We assume
that queries and documents are all observations from random variables, and we
assume there is a binary random variable called R (with a value of either 1 or 0) to
indicate whether a document is relevant to a query. We then define the score of a
document with respect to a query as the probability that this random variable R is
equal to 1 given a particular document and query. There are different cases of such a
general idea. One is the classic probabilistic model, which dates back to work done
in the 1960s and 1970s [Maron and Kuhns 1960, Robertson and Sparck Jones 1976],
another is the language modeling approach [Ponte and Croft 1998], and yet another
is the divergence-from-randomness model [Amati and Van Rijsbergen 2002]. We
will cover a particular language modeling approach called query likelihood retrieval
model in detail later in this chapter. One of the most effective retrieval models
88 Chapter 6 Retrieval Models

derived from the classific probabilistic retrieval framework is BM25 [Robertson and
Zaragoza 2009], but since the retrieval function of BM25 is so similar to a vector
space retrieval model, we have chosen to cover it as a variant of the vector space
model.
The third kind of model is probabilistic inference [Turtle and Croft 1990]. Here
the idea is to associate uncertainty to inference rules. We can then quantify the
probability that we can show that the query follows from the document. This family
of models is theoretically appealing, but in practice, they are often reduced to
models essentially similar to vector-space model or a regular probabilistic retrieval
model.
Finally, there is also a family of models that use axiomatic thinking [Fang et al.
2011]. The idea is to define a set of constraints that we hope a good retrieval function
satisfies. In this case the problem is to find a good ranking function that can satisfy
all the desired constraints. Interestingly, although all these models are based on
different thinking, in the end the retrieval functions tend to be very similar and
involve similar variables. The axiomatic retrieval framework has proven effective
for diagnosing deficiencies of a retrieval model and developing improved retrieval
models accordingly (e.g., BM25+ [Lv and Zhai 2011]).
Although many models have been proposed, very few have survived extensive ex-
perimentation to prove effective and robustness. In this book, we have chosen to
cover four specific models (i.e., BM25, pivoted length normalization, query likeli-
hood with JM smoothing, and query likelihood with Dirichlet prior smoothing) that
are among the very few most effective and robust models.1

Common Form of a Retrieval Function

6.2 Before we introduce specific models, we first take a look at the common form of
a state-of-the-art retrieval model and examine some of the common ideas used
in all these models. This is illustrated in Figure 6.1. First, these models are all
based on the assumption of using a bag-of-words representation of text. This was
explained in detail in the natural language processing chapter. A bag-of-words
representation remains the main representation used in all the search engines.
With this assumption, the score of a query like presidential campaign news, with
respect to a document d, would be based on scores computed on each individual
query word. That means the score would depend on the score of each word, such
as presidential, campaign, and news.

1. PL2 is another very effective model that the readers should also know of [Amati and Van Rijs-
bergen 2002].
6.2 Common Form of a Retrieval Function 89

f (q = “presidential campaign news”, d )

“Bag of Words”

g(“presidential”, d ) g (“campaign”, d ) g (“news”, d )

How many times does “presidential” occur in d?

Term frequency (TF): c (“presidential”, d)

How long is d? Document length: |d|

How often do we see “presidential” in the entire collection?

Document frequency: DF(“presidential”)
P(“presidential”|collection)

Figure 6.1 Illustration of common ideas for scoring with a bag-of-words representation.

We can see there are three different components, each corresponding to how
well the document matches each of the query words. Inside of these functions, we
see a number of heuristics. For example, one factor that affects the function g is how
many times the word presidential occurs in each document. This is called a term
frequency (TF). We might also denote this as c(presidential, d). In general, if the
word occurs more frequently in the document, the value of this function would be
larger. Another factor is the document length. In general, if a term occurs in a long
document many times, it is not as significant as if it occurred the same number of
times in a short document (since any term is expected to occur more frequently
in a long document). Finally, there is a factor called document frequency. This
looks at how often presidential occurs at least once in any document in the entire
collection. We call this the document frequency, or DF, of presidential. DF attempts
to characterize the popularity of the term in the collection. In general, matching a
rare term in the collection is contributing more to the overall score than matching a
common term. TF, DF, and document length capture some of the main ideas used
in pretty much all state-of-the-art retrieval models. In some other models we might
also use a probability to characterize this information.
A natural question is: Which model works the best? It turns out that many
models work equally well, so here we list the four major models that are generally
regarded as state-of-the-art:

. pivoted length normalization [Singhal et al. 1996];

. Okapi BM25 [Robertson and Zaragoza 2009];
90 Chapter 6 Retrieval Models

. query likelihood [Ponte and Croft 1998]; and

. PL2 [Amati and Van Rijsbergen 2002].

When optimized, these models tend to perform similarly as discussed in detail

in Fang et al. [2011]. Among all these, BM25 is probably the most popular. It’s most
likely that this has been used in virtually all search engine implementations, and it
is quite common to see this method discussed in research papers. We’ll talk more
about this method in a later section.
In summary, the main points are as follows. First, the design of a good ranking
function requires a computational definition of relevance, and we achieve this goal
by designing a proper retrieval model. Second, many models are equally effective
but we don’t have a single winner. Researchers are still actively working on this
problem, trying to find a truly optimal retrieval model. Finally, the state-of-the-art
ranking functions tend to rely on the following ideas: (1) bag of words representa-
tion; and (2) TF and the document frequency of words. Such information is used by
a ranking function to determine the overall contribution of matching a word, with
an adjustment for document length. These are often combined in interesting ways.
We’ll discuss how exactly they are combined to rank documents later in this book.

Vector Space Retrieval Models

6.3 The vector space (VS) retrieval model is a simple, yet effective method of designing
ranking functions for information retrieval. It is a special case of similarity-based
models that we discussed previously, where we assume relevance is roughly corre-
lated to similarity between a document and a query. Whether this assumption is
the best way to capture the notion of relevance formally remains an open question,
but in order to solve our search problem we have to convert the vague notion of
relevance into a more precise definition that can be implemented with a program-
ming language in one way or another. In this process we inevitably have to make
a number of assumptions. Here we assume that if a document is more similar to
a query than another document, then the first document would be assumed to be
more relevant than the second one. This is the basis for ranking documents in the
vector space model. This is not the only way to formalize relevance; we will see later
there are other ways to model relevance.
The basic idea of VS retrieval models is actually very easy to understand. Imagine
a high dimensional space, where each dimension corresponds to a term; we can
plot our documents in this space since they are represented as vectors of term
magnitudes.
6.3 Vector Space Retrieval Models 91

Programming

Query q d2 ?
dM

d3 ?
d5

Library

d4 d1 ?
Presidential

Figure 6.2 Illustration of documents plotted in vector space. (Courtesy of Marti Hearst)

In Figure 6.2, we show a three-dimensional space with three words: program-

ming, library, and presidential. Each term defines one dimension. We can consider
vectors in this three dimensional space, and we will assume all our documents and
the query will all be placed in this vector space. For example, the vector d1 repre-
sents a document that probably covers the terms library and presidential without
really talking about programming. What does this mean in terms of representation
of the document? It means that we will rely solely on this vector to represent the
original document, and thus ignore everything else, including, e.g., the order of
the words (which may sometimes be important to keep!). It is thus not an optimal
representation, but it is often sufficient for many retrieval problems.
Intuitively, in this representation, d1 seems to suggest a topic in either presi-
dential or library. Now this is different from another document which might be
represented as a different vector d2. In this case, the document covers programming
and library, but it doesn’t talk about presidential. As you can probably guess, the
topic is likely about programming language and the library is actually a software
library. By using this vector space representation, we can intuitively capture the dif-
ferences between topics of documents. Next, d3 is pointing in a direction that might
be about presidential and programming. We place all documents in our collection in
this vector space and they will be pointing to all kinds of directions given by these
three dimensions.
Similarly, we can place our query in this space as another vector. We can then
measure the similarity between the query vector and every document vector. In this
case, for example, we can easily see d2 seems to be the closest to the query vector
92 Chapter 6 Retrieval Models

and therefore d2 will be ranked above the others. This is the main idea of the vector
space model.
To be more precise, the VS model is a framework. In this framework, we make
some assumptions. One assumption is that we represent each document and query
by a term vector. Here, a term can be any basic concept such as a word or a
phrase, or even n-grams of characters or any other feature representation. Each
term is assumed to define one dimension. Therefore, since we have |V | terms in
our vocabulary, we define a |V |-dimensional space. A query vector would consist
of a number of elements corresponding to the weights of different terms. Each
document vector is also similar; it has a number of elements and each value of
each element is indicating the weight of the corresponding term. The relevance in
this case is measured based on the similarity between the two vectors. Therefore,
our retrieval function is also defined as the similarity between the query vector and
document vector.
Now, if you were asked to write a program to implement this approach for a
search engine, you would realize that this explanation was far from complete. We
haven’t seen many things in detail, therefore it’s impossible to actually write the
program to implement this. That’s why this is called the vector space retrieval
framework. It has to be refined in order to actually suggest a particular function
that can be implemented on a computer. First, it did not say how to define or select
the basic concepts (terms). We clearly assume the concepts are orthogonal, otherwise
there will be redundancy. For example, if two synonyms are somehow distinguished
as two different concepts, they would be defined in two different dimensions,
causing a redundancy or overemphasis of matching this concept (since it would be
as if you matched two dimensions when you actually matched only one semantic
concept). Second, it did not say how to place documents and queries in this vector
space. We saw some examples of query and document vectors, but where exactly
should the vector for a particular document point to? This is equivalent to how to
define the term weights. This is a very important question because the term weight
in the query vector indicates the importance of a term; depending on how you assign
the weight, you might prefer some terms to be matched over others. Similarly,
term weight in the document is also very meaningful—it indicates how well the
term characterizes the document. If many nonrelevant documents are returned by
a search engine using this model, then the chosen terms and weights must not
represent the documents accurately. Finally, how to define the similarity measure is
also unclear. These questions must be addressed before we can have an operational
function that we can actually implement using a programming language. Solving
these problems is the main topic of the next section.
6.3 Vector Space Retrieval Models 93

6.3.1 Instantiation of the Vector Space Model

In this section, we will discuss how to instantiate a vector space model so that we
can get a very specific ranking function. As mentioned previously, the vector space
model is really a framework: it doesn’t specify many things. For example, it did not
say how we should define the dimensions of the vectors. It also did not say how
we place a document vector or query vector into this space. That is, how should we
define/calculate the values of all the elements in the query and document vectors?
Finally, it did not say how we should compute similarity between the query vector
and the document vector. As you can imagine, in order to implement this model,
we have to determine specifically how we should compute and use these vectors.
In Figure 6.3, we illustrate the simplest instantiation of the vector space model.
In this instantiation, we use each word in our vocabulary to define a dimension,
thus giving |V | dimensions—this is the bag-of-words instantiation. Now let’s look
at how we place vectors in this space. Here, the simplest strategy is to use a bit vector
to represent both a query and a document, and that means each element xi and yi
would take a value of either zero or one. When it’s one, it means the corresponding
word is present in the document or query. When it’s zero, it’s absent. If the user
types in a few words for a query, then the query vector would have a few ones and
many, many zeros. The document vector in general would have more ones than the
query vector, but there will still be many zeros since the vocabulary is often very
large. Many words in the vocabulary don’t occur in a single document; many words
will only occasionally occur in a given document. Most words in the vocabulary will
be absent in any particular document.
Now that we have placed the documents and the query in the vector space, let’s
look at how we compute the similarity between them. A commonly used similarity
measure is the dot product; the dot product of two vectors is simply defined as the
sum of the products of the corresponding elements of the two vectors. In Figure 6.3
we see that it’s the product of x1 and y1 plus the product of x2 and y2, and so on.

q = (x1, …, xN) xi, yi 2{0, 1}

1: word Wi is present
d = (y1, …, yN) 0: word Wi is absent

Sim(q, d) = q.d = x1 y1 + … + xN yN = ΣNi=1 xi yi

Figure 6.3 Computing the similarity between a query and document vector using a bit vector
representation and dot product similarity.
94 Chapter 6 Retrieval Models

This is only one of the many different ways of computing the similarity. So, we’ve
defined the dimensions, the vector space, and the similarity function; we finally
have the simplest instantiation of the vector space model! It’s based on the bit
vector representation, dot product similarity, and bag of words instantiation. Now
we can finally implement this ranking function using a programming language and
then rank documents in our corpus given a particular query.
We’ve gone through the process of modeling the retrieval problem using a vector
space model. Then, we made assumptions about how we place vectors in the vector
space and how we define the similarity. In the end, we’ve got a specific retrieval
function shown in Figure 6.3. The next step is to think about whether this individual
function actually makes sense. Can we expect this function will actually perform
well? It’s worth thinking about the value that we are calculating; in the end, we’ve
got a number, but what does this number mean? Please take a few minutes to think
about that before proceeding to the next section.

6.3.2 Behavior of the Bit Vector Representation

In order to assess whether this simplest vector space model actually works well,
let’s look at the example in Figure 6.4.
This figure shows some sample documents and a simple query. The query is
news about presidential campaign. For this example, we will examine five documents
from the corpus that cover different terms in the query. You may realize that some
documents are probably relevant and others probably not relevant. If we ask you to
rank these documents, how would you rank them? Your answer (as the user) is the
ideal ranking, R (q). Most users would agree that d4 and d3 are probably better than

Query = “news about presidential campaign” Ideal ranking?

d1 … news about … d4 +
d3 +
d2 … news about organic food campaign …

d3 … news of presidential campaign …

d4 … news of presidential campaign … d1 –

… presidential candidate … d2 –
d5 –
d5 … news of organic food campaign …
campaign … campaign … campaign …

Figure 6.4 Application of the bit vector VS model in a simple example.

6.3 Vector Space Retrieval Models 95

Query = “news about presidential campaign”

d1 … news about …

d3 … news of presidential campaign …

V = {news, about, presidential, campaign, food …}

q = (1, 1, 1, 1, 0, …)
d1 = (1, 1, 0, 0, 0, …)
f(q, d1) = 1 * 1 + 1 * 1 + 1 * 0 + 1 * 0 + 0 * 0 + … = 2
d3 = (1, 0, 1, 1, 0, …)
f(q, d3) = 1 * 1 + 1 * 0 + 1 * 1 + 1 * 1 + 0 * 0 + … = 3

Figure 6.5 Computation of the bit vector retrieval model on a sample query and corpus.

the others since these two really cover the query well. They match news, presidential,
and campaign, so they should be ranked on top. The other three, d1, d2, and d5, are
non-relevant.
Let’s see if our vector space model could do the same or could do something
close to our ideal ranking. First, think about how we actually use this model to
score documents. In Figure 6.5, we show two documents, d1 and d3, and we have
the query here also. In the vector space model, we want to first compute the vectors
for these documents and the query. The query has four words, so for these four
words, there would be a one and for the rest there will be zeros. Document d1 has
two ones, news and about, while the rest of the dimensions are zeros. Now that
we have the two vectors, we can compute the similarity with the dot product by
multiplying the corresponding elements in each vector. Each pair of vectors forms
a product, which represents the similarity between the two items. We actually don’t
have to care about the zeroes in each vector since any product with one will be zero.
So, when we take a sum over all these pairs, we’re just counting how many pairs of
ones there are. In this case, we have seen two, so the result will be two. That means
this number is the value of this scoring function; it’s simply the count of how many
unique query terms are matched in the document. This is how we interpret the
score. Now we can also take a look at d3. In this case, you can see the result is
three because d3 matched the three distinct query words news, presidential, and
campaign, whereas d1 only matched two. Based on this, d3 is ranked above d1. That
looks pretty good. However, if we examine this model in detail, we will find some
problems.
96 Chapter 6 Retrieval Models

Query = “news about presidential campaign”

d1 … news about … f (q, d1) = 2

d2 … news about organic food campaign … f (q, d2) = 3

d3 … news of presidential campaign … f (q, d3) = 3

d4 … news of presidential campaign … f (q, d4) = 3

… presidential candidate …

d5 … news of organic food campaign … f (q, d5) = 2

campaign … campaign … campaign …

Figure 6.6 Ranking of example documents using the simple vector space model.

In Figure 6.6, we show all the scores for these five documents. The bit vector
scoring function counts the number of unique query terms matched in each docu-
ment. If a document matches more unique query terms, then the document will be
assumed to be more relevant; that seems to make sense. The only problem is that
there are three documents, d2, d3, and d4, that are tied with a score of three. Upon
closer inspection, it seems that d4 should be right above d3 since d3 only mentioned
presidential once while d4 mentioned it many more times. Another problem is that
d2 and d3 also have the same score since for d2, news, about, and campaign were
matched. In d3, it matched news, presidential, and campaign. Intuitively, d3 is more
relevant and should be scored higher than d2. Matching presidential is more impor-
tant than matching about even though about and presidential are both in the query.
But this model doesn’t do that, and that means we have to solve these problems.
To summarize, we talked about how to instantiate a vector space model. We need
to do three things:

1. define the dimensions (the concept of what a document is);

2. decide how to place documents and queries as vectors in the vector space;
and
3. define the similarity between two vectors.

Based on this idea, we discussed a very simple way to instantiate the vector
space model. Indeed, it’s probably the simplest vector space model that we can
derive. We used each word to define a dimension, with a zero-one bit vector to
represent a document or a query. In this case, we only care about word presence or
absence, ignoring the frequency. For a similarity measure, we used the dot product
6.3 Vector Space Retrieval Models 97

and showed that this scoring function scores a document based on the number of
distinct query words matched in it. We also showed that such a simple vector space
model still doesn’t work well, and we need to improve it. This is the topic for the
next section.

6.3.3 Improved Instantiation

In this section, we will improve the representation of this model from the bit
vector model. We saw the bit vector representation essentially counts how many
unique query terms match the document. From Figure 6.6 we would like d4 to be
ranked above d3, and d2 is really not relevant. The problem here is that this function
couldn’t capture the following characteristics.

. First, we would like to give more credit to d4 because it matches presidential

more times than d3.
. Second, matching presidential should be more important than matching
about, because about is a very common word that occurs everywhere; it doesn’t
carry that much content.

It’s worth thinking at this point about why we have these issues. If we look back at
the assumptions we made while instantiating the VS model, we will realize that the
problem is really coming from some of those assumptions. In particular, it has to
do with how we place the vectors in the vector space. Naturally, in order to fix these
problems, we have to revisit those assumptions. A natural thought is to consider
multiple occurrences of a term in a document as opposed to binary representation;
we should consider the TF instead of just the absence or presence. In order to
consider the difference between a document where a query term occurred multiple
times and one where the query term occurred just once, we have to consider the
term frequency—the count of a term in a document. The simplest way to express
the TF of a word w in a document d is

T F (w, d) = count(w, d). (6.1)

With the bit vector, we only captured the presence or absence of a term, ignoring
the actual number of times that a term occurred. Let’s add the count information
back: we will represent a document by a vector with as each dimension’s weight.
That is, the elements of both the query vector and the document vector will not be
zeroes and ones, but instead they will be the counts of a word in the query or the
document, as illustrated in Figure 6.7.
98 Chapter 6 Retrieval Models

q = (x1, …, xN) xi = count of word Wi in query

d = (y1, …, yN) yi = count of word Wi in doc

Sim(q, d) = q.d = x1 y1 + … + xN yN = ΣNi=1 xi yi

Figure 6.7 Frequency vector representation and dot product similarity.

d2 … news about organic food campaign … f (q, d2) = 3

q = (1, 1, 1, 1, 0, …)
d2 = (1, 1, 0, 1, 1, …)

d3 … news of presidential campaign … f (q, d3) = 3

q = (1, 1, 1, 1, 0, …)
d3 = (1, 0, 1, 1, 0, …)

d4 … news of presidential campaign … f (q, d4) = 4!

… presidential candidate …
q = (1, 1, 1, 1, 0, …)
d4 = (1, 0, 2, 1, 0, …)

Figure 6.8 Frequency vector representation rewards multiple occurrences of a query term.

Now, let’s see what the formula would look like if we change this representation.
The formula looks identical since we are still using the dot product similarity. The
difference is inside of the sum since xi and yi are now different—they’re now the
counts of words in the query and the document. Because of the change in document
representation, the new score has a different interpretation. We can see whether
this would fix the problems of the bit vector VS model.
Look at the three documents again in Figure 6.8. The query vector is the same
because all these words occurred exactly once in the query. The same goes for d2
and d3 since none of these words has been repeated. As a result, the score is also
the same for both these documents. But, d4 would be different; here, presidential
occurred twice. Thus, the corresponding dimension would be weighted as two
instead of one, and the score for d4 is higher. This means, by using TF, we can
now rank d4 above d2 and d3 as we had hoped to.
6.3 Vector Space Retrieval Models 99

Unfortunately, d2 and d3 still have identical scores. We would like to give more
credit for matching presidential than matching about. How can we solve this prob-
lem in a general way? Is there any way to determine which word should be treated
more importantly and which word can be essentially ignored? About doesn’t carry
that much content, so we should be able to ignore it. We call such a word a stop
word. They are generally very frequent and they occur everywhere such that match-
ing it doesn’t have any significance. Can we come up with any statistical approaches
to somehow distinguish a content word like presidential from a stop word like
about? One difference is that a word like about occurs everywhere. If you count
the occurrence of the word in the whole collection of M documents (where M 5),
then we would see that about has a much higher count than presidential. This idea
suggests that we could somehow use the global statistics of terms or some other
information to try to decrease the weight of the about dimension in the vector rep-
resentation of d2. At the same time, we hope to somehow increase the weight of
presidential in the d3 vector. If we can do that, then we can expect that d2 will get an
overall score of less than three, while d3 will get a score of about three. That way,
we’ll be able to rank d3 on top of d2.
This particular idea is called the inverse document frequency (IDF). It is a very
important signal used in modern retrieval functions. The document frequency
is the count of documents that contain a particular term. Here, we say inverse
document frequency because we actually want to reward a word that doesn’t occur
in many documents. The way to incorporate this into our vector is to modify the
frequency count by multiplying it by the IDF of the corresponding word, as shown
in Figure 6.9.
We can now penalize common words which generally have a low IDF and reward
informative words that have a higher IDF. IDF can be defined as

M +1
IDF(w) = , (6.2)
df(w)

where M is the total number of documents in the collection and df(.) counts the
document frequency (the total number of documents containing w).
Let’s compare the terms campaign and about. Intuitively, about should have a
lower IDF score than campaign since about is a less informative word. For clarity,
let’s assume M = 10, 000, df(about) = 5000, df(campaign) = 1166, and we use a
base two logarithm. Then,

10, 001 10, 001
IDF(about) = log = log ≈ 1.0
df(about) 5000
100 Chapter 6 Retrieval Models

W1 xi = count of word Wi in query

q = (x1, …, xN)
yi = c(Wi, d) *IDF(Wi)

d = (y1, …, yN)

Figure 6.9 Representation of a blue document vector and red query vector with TF-IDF weighting.

IDF(W)
Total number of docs in collection

log(M + 1)
IDF(W) = log[(M + 1)/k]

Total number of docs containing W

(document frequency)

k (doc freq)
1 M

Figure 6.10 Illustration of the IDF function as the document frequency varies.

and

10, 001 10, 001
IDF(campaign) = log = log ≈ 3.1.
df(campaign) 1166

Let k represent df(w); if you plot the IDF function by varying k, then you will see
a curve like the one illustrated in Figure 6.10. In general, you can see it would give
a higher value for a low df , indicating a rare word. You can also see the maximum
value of this function is log(M + 1). The specific function is not as important as the
heuristic it captures: penalizing popular terms. Whether there is a better form of
6.3 Vector Space Retrieval Models 101

the IDF function is an open research question. With the evaluation skills you will
learn in Chapter 9, you can test your different instantiations.
If we use a linear function like the diagonal line (as shown in the figure), it may
not be as reasonable as the IDF function we just defined. In the standard IDF,
we have a dropping off point where we say “these terms are essentially not very
useful.” This makes sense when the term occurs so frequently that it’s unlikely
to differentiate two documents’ relevance (since the term is so common). But, if
you look at the linear representation, there is no dropping off point. Intuitively,
we want to focus more on the discrimination of low df words rather than these
common words. Of course, which one works better still has to be validated by
running experiments on a data set.
Let’s look at the two documents again in Figure 6.11. Without IDF weighting,
we just had bit vectors. With IDF weighting, we now can adjust the TF (term
frequency) weight by multiplying it with the IDF weight. With this scheme, there
is an adjustment by using the IDF value of about which is smaller than the IDF
value of presidential. Thus, the IDF will distinguish these two words based on how
informative they are. Including the IDF weighting causes d3 to be ranked above
d2, since it matched a rare (informative) word, whereas d2 matched a common
(uninformative) word. This shows that the idea of weighting can solve our second
problem. How effective is this model in general when we use this TF-IDF weighting?
Well, let’s take a look at all the documents that we have seen before.
In Figure 6.12, we show all the five documents that we have seen before and their
new scores using TF-IDF weighting. We see the scores for the first four documents

d2 … news about organic food campaign …

d3 … news of presidential campaign …

V= {news, about, presidential, campaign, food …}

IDF(W) = 1.5 1.0 2.5 3.1 1.8
q= (1, 1, 1, 1, 0, …)
d2 = (1 * 1.5, 1 * 1.0, 0, 1 * 3.1, 0, …)
q= (1, 1, 1, 1, 0, …)
d3 = (1 * 1.5, 0, 1 * 2.5, 1 * 3.1, 0, …)

f(q, d2) = 5.6 < f(q, d3) = 7.1

Figure 6.11 The impact of IDF weighting on document ranking.

102 Chapter 6 Retrieval Models

Query = “news about presidential campaign”

d1 … news about … f (q, d1) = 2.5

d2 … news about organic food campaign … f (q, d2) = 5.6

d3 … news of presidential campaign … f (q, d3) = 7.1

d4 … news of presidential campaign … f (q, d4) = 9.6

… presidential candidate …

d5 … news of organic food campaign … f (q,, d5) = 13.9!

campaign … campaign … campaign …

Figure 6.12 Scores of all five documents using TF-IDF weighting.

seem to be quite reasonable. But again, we also see a new problem since d5 did not
even have a very high score with our simplest vector space model, but now d5 has
the highest score. This is actually a common phenomenon when designing retrieval
functions; when you try to fix one problem, you tend to introduce other problems!
That’s why it’s very tricky to design an effective ranking function, and why finding
a “best” ranking function is an open research question. In the next few sections,
we’ll continue to discuss some additional ideas to further improve this model and
try to fix this problem.

6.3.4 TF Transformation
In the previous section, we derived a TF-IDF weighting formula using the vector
space model and showed that this model actually works pretty well for the examples
shown in the figures—except for d5, which has a very high score. This document is
intuitively non-relevant, so its position is not desirable. Now, we’re going to discuss
how to use a TF transformation to solve this problem. Before we discuss the details,
let’s take a look at the formula for the TF-IDF weighting and ranking function we
previously derived. It is shown in Figure 6.13.
If you look at the formula carefully, you will see it involves a sum over all the
matched query terms. Inside the sum, each matched query term has a particular
weight; this weight is TF-IDF weighting. It has an IDF component where we see two
variables: one is the total number of documents in the collection, M. The other is
the document frequency, df(w), which is the number of documents that contain
w. The other variables involved in the formula include the count of the query term
6.3 Vector Space Retrieval Models 103

Total number of
documents in collection

N
M+1
f(q, d) = ∑ xi yi = ∑ c(w, q)c(w, d)log —
i=1 w2q\d df(w)

All matched query Document

words in document frequency

Figure 6.13 A ranking function using a TF-IDF weighting scheme.

w in the query, and the count of w in the document, represented as c(w, q) and
c(w, d), respectively.
Looking at d5 again, it’s not hard to realize that the reason why it has received
a high score is because it has a very high count of the term campaign. Its count in
d5 is four, which is much higher than the other documents, and has contributed to
the high score of this document. Intriguingly, in order to lower the score for this
document, we need to somehow restrict the contribution of matching this term
in the document. Essentially, we shouldn’t reward multiple occurrences so gener-
ously. The first occurrence of a term says a lot about matching of this term because
it goes from a zero count to a count of one, and that increase is very informative.
Once we see a word in the document, it’s very likely that the document is talking
about this word. If we see an extra occurrence on top of the first occurrence, that
is to go from one to two, then we also can say the second occurrence confirmed
that it’s not an accidental mention of the word. But imagine we have seen, let’s
say, 50 occurrences of the word in the document. Then, adding one extra occur-
rence is not going to bring new evidence about the term because we are already
sure that this document is about this word. Thus, we should restrict the contribu-
tion of a high-count term. That is exactly the idea of TF transformation, illustrated
in Figure 6.14.
This transformation function is going to turn the raw count of word into a
TF weight for the word in the document. On the x-axis is the raw count, and on
the y-axis is the TF weight. In the previous ranking functions, we actually have
implicitly used some kind of transformation. For example, in the zero-one bit vector
representation, we actually used the binary transformation function as shown here.
If the count is zero then it has zero weight. Otherwise it would have a weight of
one. Then, we considered term count as a TF weight, which is a linear function.
We just saw that this is not desirable. With a logarithm, we can have a sublinear
104 Chapter 6 Retrieval Models

TF transformation: c(w, d) → TF(w, d)

Term frequency weight
y = TF(w, d)
y=x

y = log(1 + x)

y = log(1 + log(1 + x))

1 0/1 bit vector

x = c(w, d)
0 1 2 3 …

Figure 6.14 Illustration of different ways to transform TF.

transformation that looks like the red lines in the figure. This will control the
influence of a very high weight because it’s going to lower its influence, yet it will
retain the influence of a small count. We might even want to bend the curve more by
applying a logarithm twice. Researchers have tried all these methods and they are
indeed working better than the linear transformation, but so far what works the best
seems to be this special transformation called BM25 TF, illustrated in Figure 6.15,
where BM stands for best matching.
In this transformation, there is a parameter k which controls the upper bound of
this function. It’s easy to see this function has a upper bound, because if you look
x
at the x+k as being multiplied by (k + 1), the fraction will never exceed one, since
the numerator is always less than the denominator. Thus, it’s upper-bounded by
(k + 1). This is also the difference between the BM25 TF function and the logarithm
transformation, which doesn’t have an upper bound. Furthermore, one interesting
property of this function is that as we vary k, we can actually simulate different
transformation functions including the two extremes that are shown in the figure.
When k = 0, we have a zero one bit transformation. If we set k to a very large number,
on the other hand, it’s going to look more like the linear transformation function.
In this sense, this transformation is very flexible since it allows us to control the
shape of the TF curve quite easily. It also has the nice property of a simple upper
bound. This upper bound is useful to control the influence of a particular term. For
example, we can prevent a spammer from just increasing the count of one term to
spam all queries that might match this term. In other words, this upper bound
6.3 Vector Space Retrieval Models 105

Term frequency weight

y = TF(w, d)
Very large k

k+1 (k + 1)x
y =—
x+k

1 k=0
x = c(w, d)
0 1 2 3 …

Figure 6.15 Illustration of BM25 TF transformation.

ensures that all terms will be counted when we aggregate the weights to compute
a score.
To summarize, we need to capture some sublinearity in the TF function. This
ensures that we represent the intuition of diminishing return from high term
counts. It also avoids a dominance by one single term over all others. The BM25
TF formula we discussed has an upper bound while being robust and effective.
If we plug this function into our TF-IDF vector space model, then we would end
up having a ranking function with a BM25 TF component. This is very close to a
state-of-the-art ranking function called BM25. We’ll see the entire BM25 formula
soon.

6.3.5 Document Length Normalization

In this section, we will discuss the issue of document length normalization. So far
in our exploration of the vector space model we considered the TF or the count of
a term in a document. We have also considered the global statistic IDF. However,
we have not considered the document length.
In Figure 6.16, we show two example documents. Document d4 is very short with
only one hundred words. Conversely, d6 has five thousand words. If you look at the
matching of these query words we see that d6 has many more matchings of the
query words; one might reason that d6 may have matched these query words in a
scattered manner. Perhaps d6’s topic is not really the same as the query’s topic. In
the beginning of d6, there is discussion of a campaign. This discussion may have
106 Chapter 6 Retrieval Models

Query = “news about presidential campaign”

d4 ... news of presidential campaign ...

... presidential candidate ... d6 > d4?
100 words
d6 ... campaign...........campaign ............................................
5000 words
............................................................................................
.........news...........................................................................
............................................................................................
................................................................................news....
............................................................................................
............................................................................................
...............................presidential.........presidential.............

Figure 6.16 Two documents with very different document lengths.

nothing to do with the mention of presidential at the end. In general, if you think
about long documents, they would have a higher chance to match any query since
they contain more words. In fact, if you generate a long document by randomly
sampling words from the distribution of all words, then eventually you probably
will match any query! In this sense, we should penalize long documents because
they naturally have a better chance to match any query. This is our idea of document
length normalization. On the one hand, we want to penalize a long document, but
on the other hand, we also don’t want to over-penalize them. The reason is that a
document may be long because of different reason: in one case the document may
be longer because it uses more words. For example, think about a research paper
article. It would use more words than the corresponding abstract. This is the case
where we probably should penalize the matching of a long document such as a full
paper. When we compare matching words in such long document with matching
words in the short abstract, the long papers generally have a higher chance of
matching query words. Therefore, we should penalize the long documents.
However, there is another case when the document is long—that is when the
document simply has more content. Consider a case of a long document, where we
simply concatenated abstracts of different papers. In such a case, we don’t want to
penalize this long document. That’s why we need to be careful about using the right
degree of length penalization, and an understanding of the discourse structure of
documents is needed for optimal document length normalization.
A method that has worked well is called pivoted length normalization, illustrated
in Figure 6.17 and described originally in Singhal et al. [1996]. Here, the idea is to
6.3 Vector Space Retrieval Models 107

Pivoted length normalization

|d|
normalizer = 1 – b + b — b 2 [0, 1]
avdl
b >> 0

b>0
Reward
1.0 b=0
Penalization

|d|
0 1 2 … avdl …

Shorter than avdl Longer than avdl

Figure 6.17 Illustration of pivoted document length normalization.

use the average document length as a pivot, or reference point. That means we will
assume that for the average length documents, the score is about right (a normalizer
would be one). If a document is longer than the average document length, then
there will be some penalization. If it’s shorter than the average document length,
there’s even some reward. The x-axis represents the length of a document. On the
y-axis we show the normalizer, i.e., the pivoted length normalization. The formula
for the normalizer is an interpolation of one and the normalized document lengths,
controlled by a parameter b. When we first divide the length of the document by
the average document length, this not only gives us some sense about how this
document is compared with the average document length, but also gives us the
benefit of not worrying about the unit of length. This normalizer has an interesting
property; first, we see that if we set the parameter b to zero, then the normalizer
value would be one, indicating no length normalization at all. If we set b to a
nonzero value, then the value would be higher for documents that are longer
than the average document length, whereas the value of the normalizer will be
smaller for shorter documents. In this sense we see there’s a penalization for
long documents and a reward for short documents. The degree of penalization is
controlled by b. By adjusting b (which varies from zero to one), we can control the
degree of length normalization. If we plug this length normalization factor into the
108 Chapter 6 Retrieval Models

Pivoted length normalization VSM

ln(1 + ln(1 + c(w, d))) M +1
f (q , d) = c(w, q) |d|
log b ∈ [0, 1]
w∈q∩d 1−b + b avdl df(w)

BM25/Okapi
(k + 1)c(w, d) M +1
f (q , d) = c(w, q) |d|
log b ∈ [0, 1], k ∈ [0, +∞)
w∈q∩d c(w, d) + k(1 − b + b avdl ) df(w)

Figure 6.18 State-of-the-art vector space models: pivoted length normalization and Okapi BM25.

vector space model ranking functions that we have already examined, we will end
up with state-of-the-art retrieval models, some of which are shown in Figure 6.18.
Let’s take a look at each of them. The first one is called pivoted length normaliza-
tion. We see that it’s basically the TF-IDF weighting model that we have discussed.
The IDF component appears in the last term. There is also a query TF component,
and in the middle there is normalized TF. For this, we have the double logarithm
as we discussed before; this is to achieve a sublinear transformation. We also put a
document length normalizer in the denominator of the TF formula, which causes
a penalty for long documents, since the larger the denominator is, the smaller the
TF weight is. The document length normalization is controlled by the parameter b.
The next formula is called Okapi BM25, or just BM25. It’s similar to the pivoted
length normalization formula in that it has an IDF component and a query TF
component. In the middle, the normalization is a little bit different; we have a
sublinear transformation with an upper bound. There is a length normalization
factor here as well. It achieves a similar effect as discussed before, since we put
the normalizer in the denominator. Thus, again, if a document is longer, the term
weight will be smaller.
We have now reached one of the best-known retrieval functions by thinking
logically about how to represent a document and by slowly tweaking formulas and
considering our initial assumptions.

6.3.6 Further Improvement of Basic VS Models

So far, we have talked mainly about how to place the document vector in vector
space. This has played an important role in determining the performance of the
ranking function. However, there are also other considerations that we did not
really examine in detail. We’ve assumed that we can represent a document as a
bag of words. Obviously, we can see there are many other choices. For example,
6.3 Vector Space Retrieval Models 109

stemmed words (words that have been transformed into a basic root form) are a
viable option so all forms of the same word are treated as one, and can be matched
as one term. We also need to perform stop word removal; this removes some very
common words that don’t carry any content such as the, a, or of . We could use
phrases or even latent semantic analysis, which characterizes documents by which
cluster words belong to. We can also use smaller units, like character n-grams,
which are sequences of n characters, as dimensions. In practice, researchers have
found that the bag-of-words representation with phrases (or “bag-of-phrases”) is
the most effective representation. It’s also efficient so this is still by far the most
popular document representation method and it’s used in all the major search
engines.
Sometimes we need to employ language-specific and domain-specific represen-
tation. This is actually very important as we might have variations of the terms that
prevent us from matching them with each other even though they mean the same
thing. Take Chinese, for example. We first need to segment text to obtain word
boundaries because it’s originally just a sequence of characters. A word might cor-
respond to one character or two characters or even three characters. It’s easier in
English when we have a space to separate the words, but in some other languages we
may need to do some natural language processing to determine word boundaries.
There is also possibility to improve the similarity function. So far, we’ve used
the dot product, but there are other measures. We could compute the cosine of the
angle between two vectors, or we can use a Euclidean distance measure. The dot
product still seems the best and one of the reasons is because it’s very general; in
fact, it’s sufficiently general. If you consider the possibilities of doing weighting in
different ways, cosine measure can be regarded as the dot product of two normal-
ized vectors. That means we first normalize each vector, and then we take the dot
product. That would be equivalent to the cosine measure.
We mentioned that BM25 seems to be one of the most effective formulas—but
there has also been further development in improving BM25, although none of
these works have changed the BM25 fundamentally. In one line of work, people
have derived BM25-F. Here, F stands for field, and this is BM25 for documents
with structure. For example, you might consider the title field, the abstract field,
the body of the research article, or even anchor text (on web pages). These can all be
combined with an appropriate weight on different fields to help improve scoring for
each document. Essentially, this formulation applies BM25 on each field, and then
combines the scores, but keeps global (i.e., across all fields) frequency counts. This
has the advantage of avoiding over-counting the first occurrence of the term. Recall
that in the sublinear transformation of TF, the first occurrence is very important
110 Chapter 6 Retrieval Models

and contributes a large weight. If we do that for all the fields, then the same term
might have gained a large advantage in every field. When we just combine counts
on each separate field, the extra occurrences will not be counted as fresh first
occurrences. This method has worked very well for scoring structured documents.
More details can be found in Robertson et al. [2004].
Another line of extension is called BM25+. Here, researchers have addressed the
problem of over-penalization of long documents by BM25. To address this problem,
the fix is actually quite simple. We can simply add a small constant to the TF
normalization formula. But what’s interesting is that we can analytically prove that
by doing such a small modification, we will fix the problem of over-penalization
of long documents by the original BM25. Thus, the new formula called BM25+ is
empirically and analytically shown to be better than BM25 [Lv and Zhai 2011].

6.3.7 Summary
In vector space retrieval models, we use similarity as a notion of relevance, assum-
ing that the relevance of a document with respect to a query is correlated with the
similarity between the query and the document. Naturally, that implies that the
query and document must be represented in the same way, and in this case, we
represent them as vectors in a high dimensional vector space. The dimensions are
defined by words, concepts, or terms. We generally need to use multiple heuris-
tics to design a ranking function; we gave some examples which show the need for
several heuristics, which include:
. TF (term frequency) weighting and sublinear transformation;
. IDF (inverse document frequency) weighting; and
. document length normalization.

These three are the most important heuristics to ensure such a general ranking
function works well for all kinds of tasks. Finally, BM25 and pivoted length nor-
malization seem to be the most effective VS formulas. While there has been some
work done in improving these two powerful measures, their main idea remains
the same. In the next section, we will discuss an alternative approach to the vector
space representation.

Probabilistic Retrieval Models

6.4 In this section, we will look at a very different way to design ranking functions than
the vector space model that we discussed before. In probabilistic models, we define
the ranking function based on the probability that a given document d is relevant
6.4 Probabilistic Retrieval Models 111

to a query q, or p(R = 1 | d , q) where R ∈ {0, 1} is a binary random variable denoting

relevance. In other words, we introduce a binary random variable R and we model
the query and the documents as observations from random variables.
Note that in the vector space model, we assume that documents are all equal-
length vectors. Here, we assumed they are the data observed from random variables.
Thus, the problem is to estimate the probability of relevance.
In this category of models, there are many different variants. The classic prob-
abilistic model has led to the BM25 retrieval function, which we discussed in the
vector space model section because its form is quite similar to these types of mod-
els. We will discuss another special case of probabilistic retrieval functions called
language modeling approaches to retrieval. In particular, we’re going to discuss
the query likelihood retrieval model, which is one of the most effective models in
probabilistic models. There is also another line of functions called divergence-from-
randomness models (such as the PL2 function [Amati and Van Rijsbergen 2002]). It’s
also one of the most effective state-of-the-art retrieval models.
In query likelihood, our assumption is that this probability of relevance can
be approximated by the probability of a query given a document and relevance,
p(q | d , R = 1). Intuitively, this probability just captures the following probability:
if a user likes document d, how likely would the user enter query q in order to
retrieve document d? The condition part contains document d and R = 1, which
can be interpreted as the condition that the user likes document d. To understand
this idea, let’s first take a look at the basic idea of probabilistic retrieval models.
Figure 6.19 lists some imagined relevance status values (or relevance judgments)
of queries and documents. It shows that q1 is a query that the user typed in and d1 is
a document the user has seen. A “1” in the far right column means the user thinks
d1 is relevant to q1. The R here can be also approximated by the clickthrough data
that the search engine can collect by watching how users interact with the search
results. In this case, let’s say the user clicked on document d1, so there’s a one
associated with the pair (q1 , d1). Similarly, the user clicked on d2, so there’s a one
associated with (q1 , d2). Thus, d2 is assumed to be relevant to q1 while d3 is non-
relevant, d4 is non-relevant, d5 is again relevant, and so on and so forth. Perhaps
the second half of the table (after the ellipses) is from a different user issuing the
same queries. This other user typed in q1 and then found that d1 is actually not
useful, which is in contrast to the first user’s judgement.
We can imagine that we have a large amount of search data and are able to ask
the question, “how can we estimate the probability of relevance?” Simply, if we look
at all the entries where we see a particular d and a particular q, we can calculate
how likely we will see a one in the third column. We can first count how many times
112 Chapter 6 Retrieval Models

Query Document Relevant?

q d R

q1 d1 1
q1 d2 1
q1 d3 0
d1 d4 0
q1 d5 1
..
.
q1 d1 0
d1 d2 1
q1 d3 0
q2 d3 1
q3 d1 1
q4 d2 1
q4 d3 0

count(q , d , R = 1)
f (q , d) = p(R = 1 | d , q) =
count(q , d)

P (R = 1 | q1, d1) = 1/2

P (R = 1 | q1, d2) = 2/2
P (R = 1 | q1, d3) = 0/2

Figure 6.19 Basic idea of probabilistic models for information retrieval.

we see q and d as a pair in this table and then count how many times we actually
have also seen a one in the third column and compute the ratio:

count(R = 1, d , q)
p(R = 1 | d , q) = . (6.3)
count(d , q)

Clearly, p(R = 1 | d , q) + p(R = 0 | d , q) = 1.

Let’s take a look at some specific examples. Suppose we are trying to compute
this probability for d1, d2, and d3 for q1. What is the estimated probability? If we are
interested in q1 and d1, we consider the two pairs containing q1 and d1; only in one
of the two cases has the user said that the document is relevant. So R is equal to 1 in
only one of the two cases, which gives our probability a value of 0.5. What about d2
and d3? For d2, R is equal to 1 in both cases. For d3, R is equal to 0 in both cases. We
now have a score for d1 , d2 , and d3 for q1. We can simply rank them based on these
probabilities—that’s the basic idea of probabilistic retrieval model. In our example,
6.4 Probabilistic Retrieval Models 113

it’s going to rank d2 above all the other documents because in all the cases, given
q1 and d2, R = 1.
With volumes of clickthrough data, a search engine can learn to improve its
results. This is a simple example that shows that with even a small number of
entries, we can already estimate some probabilities. These probabilities would give
us some sense about which document might be more useful to a user for this
query. Of course, the problem is that we don’t observe all the queries and all of
the documents and all the relevance values; there will be many unseen documents.
In general, we can only collect data from the documents that we have shown to the
users. In fact, there are even more unseen queries because you cannot predict what
queries will be typed in by users. Obviously, this approach won’t work if we apply
it to unseen queries or unseen documents. Nevertheless, this shows the basic idea
of the probabilistic retrieval model.
What do we do in such a case when we have a lot of unseen documents and
unseen queries? The solution is that we have to approximate in some way. In the
particular case called the query likelihood retrieval model, we just approximate this
by another conditional probability, p(q | d , R = 1) [Lafferty and Zhai 2003].
We assume that the user likes the document because we have seen that the
user clicked on this document, and we are interested in all these cases when a
user liked this particular document and want to see what kind of queries they
have used. Note that we have made an interesting assumption here: we assume
that a user formulates the query based on an imaginary relevant document. If you
just look at this as a conditional probability, it’s not obvious we are making this
assumption. We have to somehow be able to estimate this conditional probability
without relying on the big table from Figure 6.19. Otherwise, we would have similar
problems as before. By making this assumption, we have some way to bypass the
big table.
Let’s look at how this new model works for our example. We ask the following
question: which of these documents is most likely the imaginary relevant docu-
ment in the user’s mind when the user formulates this query? We quantify this
probability as a conditional probability of observing this query if a particular doc-
ument is in fact the imaginary relevant document in the user’s mind. We compute
all these query likelihood probabilities—that is, the likelihood of the query given
each document. Once we have these values, we can then rank these documents.
To summarize, the general idea of modeling relevance in the probabilistic re-
trieval model is to assume that we introduce a binary random variable R and let
the scoring function be defined based on the conditional probability p(R = 1 | d , q).
We also talked about approximating this by using query likelihood. This means we
have a ranking function that’s based on a probability of a query given the document.
114 Chapter 6 Retrieval Models

… news of presidential
p(q = “presidential campaign”|d = campaign … presidential )
candidate …

“presidential”
ca m
paig campaign
n
“pre
side
ntia
l”

If the user is thinking of this doc,

how likely would she pose this query?

Figure 6.20 Generating a query by sampling words from a document.

This probability should be interpreted as the probability that a user who likes doc-
ument d would pose query q. Now the question, of course, is how do we compute
this conditional probability? We will discuss this in detail in the next section.

6.4.1 The Query Likelihood Retrieval Model

In the query likelihood retrieval model, we quantify how likely a user would pose a
particular query in order to find a particular document.
Figure 6.20 shows how the query likelihood model assumes a user imagines
some ideal document and generates a query based on that ideal document’s con-
tent. In this example, the ideal document is about “presidential campaign news.”
Under this model, the user would use this ideal document as a basis to compose
a query to try and retrieve a desired document. More concretely, we assume that
the query is generated by sampling words from the document. For example, a user
might pick a word like presidential from this imaginary document, and then use
this as a query word. The user would then pick another word like campaign, and
that would be the second query word. Of course, this is only an assumption we
have made about how users pose queries. Whether a user actually follows this pro-
cess is a different question. Importantly, though, this assumption has allowed us
to formally characterize the conditional probability of a query given a document
without relying on the big table that was presented earlier. This is why we can use
this fundamental idea to further derive retrieval functions that we can implement
with language models.
6.4 Probabilistic Retrieval Models 115

c(“presidential”, d) c(“campaign”, d)
p(q = “presidential campaign”|d) = — * —
|d| |d|

… news of presidential campaign 2 1

p(q|d4 = … presidential candidate … )= —
|d4| * |d4|
—

p(q|d3 = … news of presidential campaign … ) = —1 1

* —
|d3| |d3|

… news about organic food 0 1

p(q|d2 = campaign … )=— — =0
|d2| * |d2|

Figure 6.21 Computing the probability of a query given a document using the query likelihood
formulation.

We’ve made the assumption that each query word is independent and that
each word is obtained from the imagined ideal document satisfying the user’s
information need. Let’s see how this works exactly. Since we are computing a query
likelihood, then the total probability is the probability of this particular query,
which is a sequence of words. Since we make the assumption that each word
is generated independently, the probability of the query is just a product of the
probability of each query word, where the probability of each word is just the relative
frequency of the word in the document. For example, the probability of presidential
given the document would be just the count of presidential in the document divided
by the total number of words in the document (i.e., the document length). We now
have an actual formula for retrieval that we can use to rank documents.
Let’s take a look at some example documents from Figure 6.21. Suppose now
the query is presidential campaign. To score these documents, we just count how
many times we have seen presidential and how many times we have seen campaign.
We’ve seen presidential two times in d4, so that’s |d2 | . We also multiply by |d1 | for the
4 4
probability of campaign. Similarly, we can calculate probabilities for the other two
documents d3 and d2. If we assume d3 and d4 have about the same length, then it
looks like we will rank d4 above d3, which is above d2. As we would expect, it looks
like this formulation captures the TF heuristic from the vector space models.
However, if we try a different query like this one, presidential campaign update,
then we might see a problem. Consider the word update: none of the documents
contain this word. According to our assumption that a user would pick a word from
a document to generate a query, the probability of obtaining a word like update
116 Chapter 6 Retrieval Models

would be zero. Clearly, this causes a problem because it would cause all these
documents to have zero probability of generating this query.
While it’s fine to have a zero probability for d2 which is not relevant, it’s not okay
to have zero probability for d3 and d4 because now we no longer can distinguish
them. In fact, we can’t even distinguish them from d2. Clearly, that’s not desirable.
When one has such a result, we should think about what has caused this problem,
examining what assumptions have been made as we derive this ranking function.
We have made an assumption that every query word must be drawn from the
document in the user’s mind—in order to fix this, we have to assume that the user
could have drawn a word not necessarily from the document. So let’s consider an
improved model.
Instead of drawing a word from the document, let’s imagine that the user would
actually draw a word from a document language model as depicted in Figure 6.22.
Here, we assume that this document is generated by using this unigram language
model, which doesn’t necessarily assign zero probability for the word update. In
fact, we assume this model does not assign zero probability for any word. If we’re
thinking this way, then the generative process is a bit different: the user has this
model (distribution of words) in mind instead of a particular ideal document,
although the model still has to be estimated based on the documents in our corpus.
The user can generate the query using a similar process. They may pick a word
such as presidential and another word such as campaign. The difference is that now
we can pick a word like update even though it doesn’t occur in the document. This

… news of presidential
p(q = “presidential campaign”|d = campaign … presidential )
candidate …
“camp

…
a
ig n”

“presidential” presidential 0.2

campaign 0.1
“pr news 0.01
esi
den campaign candidate 0.02
t ia
“up l” …
dat update 0.00001
e”
update …

Figure 6.22 Computing the probability of a query given a document using a document language
model.
6.4 Probabilistic Retrieval Models 117

Document Document LM Query q =

Figure 6.23 Scoring a query on two documents based on their language models.

would fix our problem with zero probabilities and it’s also reasonable because we’re
now thinking of what the user is looking for in a more general way, via a unigram
language model instead of a single fixed document.
In Figure 6.23, we show two possible language models based on documents d1
and d2, and a query data mining algorithms. By making an independence assump-
tion, we could have p(q | d) as a product of the probability of each query word in
each document’s language model. We score these two documents and then rank
them based on the probabilities we calculate.
Let’s formally state our scoring process for query likelihood. A query q contains
the words

q = w 1 , w 2 , . . . , wn

such that |q| = n. The scoring or ranking function is then the probability that we
observe q given that a user is thinking of a particular document d. This is the prod-
uct of probabilities of all individual words, which is based on the independence
assumption mentioned before:

p(q | d) = p(w1 | d) × p(w2 | d) × . . . × p(wn | d). (6.4)

118 Chapter 6 Retrieval Models

In practice, we score the document for this query by using a logarithm of the
query likelihood:

n
score(q , d) = log p(q | d) = log p(wi | d) = c(w, q) log p(w | d). (6.5)
i=1 w∈V

We do this to avoid having numerous small probabilities multiplied together,

Needs which could cause underflow and precision loss. By transforming using a loga-
clarity rithm, we maintain the order of these documents while simultaneously avoiding
the underflow problem. Note the last term in the equation above; in this sum, we
have a sum over all the possible words in the vocabulary V and iterate through
each word in the query. Essentially, we are only considering the words in the query
because if a word is not in the query, its contribution to the sum would be zero.
The only part we don’t know is this document language model, p(w | d). There-
fore, we can convert the retrieval problem into the problem of estimating this
document language model so that we can compute the probability of a query being
generated by each document. Different estimation methods for p(w | d) lead to dif-
ferent ranking functions, and this is just like the different ways to place a document
into a vector in the vector space model. Here, there are different ways to estimate
parameters in the language model, which lead to different ranking functions for
query likelihood.

6.4.2 Smoothing the Document Language Model

When calculating the query likelihood retrieval score, recall that we take a sum of
log probabilities over all of the query words, using the probability of a word in the
query given the document (i.e., the document language model). The main task now
is to estimate this document language model. In this section we look into this task
in more detail.
First of all, how do we estimate this language model? The obvious choice would
be the maximum likelihood estimation (MLE) that we have seen before in Chap-
ter 2. In MLE, we normalize the word frequencies in the document by the document
length. Thus, all the words that have the same frequency count will have an equal
probability under this estimation method. Note that words that have not occurred
in the document will have zero probability. In other words, we assume the user will
sample a word from the document to formulate the query, and there is no chance
of sampling any word that is not in the document. But we know that’s not good,
so how would we improve this? In order to assign a non-zero probability to words
that have not been observed in the document, we would have to take away some
6.4 Probabilistic Retrieval Models 119

probability mass from seen words because we need some extra probability mass
for the unseen words—otherwise, they won’t sum to one.
To make this transformation and to improve the MLE, we will assign nonzero
probabilities to words that are not observed in the data. This is called smoothing,
and smoothing has to do with improving the estimate by including the probabilities
of unseen words. Considering this factor, a smoothed language model would be
a more accurate representation of the actual document. Imagine you have seen
the abstract of a research paper; or, imagine a document is just an abstract. If
we assume words that don’t appear in the abstract have a probability of zero, that
means sampling a word outside the abstract is impossible. Imagine the user who
is interested in the topic of this abstract; the user might actually choose a word
that is not in the abstract to use as query. In other words, if we had asked this
author to write more, the author would have written the full text of the article, which
contains words that don’t appear in the abstract. So, smoothing the language model
is attempting to try to recover the model for the whole article. Of course, we don’t
usually have knowledge about the words not observed in the abstract, so that’s why
smoothing is actually a tricky problem.
The key question here is what probability should be assigned to those unseen
words. As one would imagine, there are many different approaches to solve this
issue. One idea that’s very useful for retrieval is to let the probability of an unseen
word be proportional to its probability as given by a reference language model. That
means if you don’t observe the word in the corpus, we’re going to assume that its
probability is governed by another reference language model that we construct. It
will tell us which unseen words have a higher probability than other unseen words.
In the case of retrieval, a natural choice would be to take the collection LM as the
reference LM. That is to say if you don’t observe a word in the document, we’re going
to assume that the probability of this word would be proportional to the probability
of the word in the whole collection.
More formally, we’ll be estimating the probability of a word given a document
as follows:

pseen(w | d) if w seen in d
p(w | d) = (6.6)
αd . p(w | C) otherwise.

If the word is seen in the document, then the probability would be a discounted
MLE estimate pseen. Otherwise, if the word is not seen in the document, we’ll let the
probability be proportional to the probability of the word in the collection p(w | C),
with the coefficient αd controlling the amount of probability mass that we assign
120 Chapter 6 Retrieval Models

log p(q|d) = ∑ c(w, q) log p(w|d)

w2V

= ∑ c(w, q) log pSeen(w|d) + ∑ c(w, q) log αd p(w|C)

w2V,c(w,d)>0 w2V,c(w,d)=0

Query words matched in d Query words not matched in d

∑ c(w, q) log αd p(w|C) – ∑ c(w, q) log αd p(w|C)

Figure 6.24 Substituting smoothed probabilities into the query likelihood retrieval formula.

to unseen words. Regardless of whether the word w is seen in the document or not,
all these probabilities must sum to one, so αd is constrained.
Now that we have this smoothing formula, we can plug it into our query likeli-
hood ranking function, illustrated in Figure 6.24. In this formula, we have a sum
over all the query words, written in the form of a sum over the corpus vocabulary.
Although we sum over words in the vocabulary, in effect we are just taking a sum of
query words since each word is weighted by its frequency in the query. Such a way to
write this sum is convenient in some transformations. In our smoothing method,
we’re assuming the words that are not observed in the document have a somewhat
different form of probability. Using this form we can decompose this sum into two
parts: one over all the query words that are matched in the document and the other
over all the words that are not matched. These unmatched words have a different
form of probability because of our assumption about smoothing.
We can then rewrite the second sum (of query words not matched in d) as a
difference between the scores of all words in the vocabulary minus all the query
words matched in d. This is actually quite useful, since part of the sum over all w ∈ V
can now be written as |q| log αd . Additionally, the sum of query words matched in d
6.4 Probabilistic Retrieval Models 121

Doc length
TF weighting normalization

N
pSeen(w|d)
log p(q|d) = ∑ c(w, q)[log —] + n log αd + ∑ log p(wi|C)
w2d αd p(w|C) i=1
w2q

Matched query terms IDF weighting Ignore for ranking

Figure 6.25 The query likelihood retrieval formula captures the three heuristics from the vector
space models.

is in terms of words that we observe in the query. Just like in the vector space model,
we are now able to take a sum of terms in the intersection of the query vector and
the document vector.
If we look at this rewriting further as shown in Figure 6.25, we can see how
it actually would give us two benefits. The first benefit is that it helps us better
understand the ranking function. In particular, we’re going to show that from this
formula we can see the connection of smoothing using a collection language model
with weighting heuristics similar to TF-IDF weighting and length normalization.
The second benefit is that it also allows us to compute the query likelihood more
efficiently, since we only need to consider terms matched in the query.
We see that the main part of the formula is a sum over the matching query
terms. This is much better than if we take the sum over all the words. After we
smooth the document using the collection language model, we would have nonzero
probabilities for all the words w ∈ V . This new form of the formula is much easier
to compute. It’s also interesting to note that the last term is independent of the
document being scored, so it can be ignored for ranking. Ignoring this term won’t
affect the order of the documents since it would just be the same value added onto
each document’s final score.
Inside the sum, we also see that each matched query term would contribute
a weight. This weight looks like TF-IDF weighting from the vector space models.
First, we can already see it has a frequency of the word in the query, just like in the
vector space model. When we take the dot product, the word frequency in the query
appears in the sum as a vector element from the query vector. The corresponding
term from the document vector encodes a weight that has an effect similar to TF-
IDF weighting. pseen is related to the term frequency in the sense that if a word
occurs very frequently in the document, then the seen probability will tend to be
122 Chapter 6 Retrieval Models

larger. This term is really doing something like TF weighting. In the denominator,
we achieve the IDF effect through p(w | C), or the popularity of the term in the
collection. Because it’s in the denominator, a larger collection probability actually
makes the weight of the entire term smaller. This means a popular term carries
a smaller weight—this is precisely what IDF weighting is doing! Only now, we
have a different form of TF and IDF. Remember, IDF has a logarithm of document
frequency, but here we have something different. Intuitively, however, it achieves
a similar effect to the VS interpretation.
We also have something related to the length normalization. In particular, αd
might be related to document length. It encodes how much probability mass we
want to give to unseen words, or how much smoothing we are allowed to do.
Intuitively, if a document is long then we need to do less smoothing because we
can assume that it is large enough that we have probably observed all of the words
that the author could have written. If the document is short, the number of unseen
words is expected to be large, and we need to do more smoothing in this case. Thus,
αd penalizes long documents since it tends to be smaller for long documents. The
variable αd actually occurs in two places. Thus its overall effect may not necessarily
be penalizing long documents, but as we will see later when we consider smoothing
methods, αd would always penalize long documents in a specific way.
This formulation is quite convenient since it means we don’t have to think about
the specific way of doing smoothing. We just need to assume that if we smooth with
the collection language model, then we would have a formula that looks like TF-
IDF weighting and document length normalization. It’s also interesting that we
have a very fixed form of the ranking function. Note that we have not heuristically
put a logarithm here, but have used a logarithm of query likelihood for scoring
and turned the product into a sum of logarithms of probabilities. If we only want
to heuristically implement TF-IDF weighting, we don’t necessarily have to have
a logarithm. Imagine if we drop this logarithm; we would still have TF and IDF
weighting. But, what’s nice with probabilistic modeling is that we are automatically
given a logarithm function which achieves sublinear scaling of our term “weights.”
In summary, a nice property of probabilistic models is that by following some
assumptions and probabilistic rules, we’ll get a formula by derivation. If we heuris-
tically design the formula, we may not necessarily end up having such a specific
form. Additionally, we talked about the need for smoothing a document language
model. Otherwise, it would give zero probability for unseen words in the document,
which is not good for scoring a query with an unseen word. It’s also necessary to
improve the accuracy of estimating the model representing the topic of this docu-
ment. The general idea of smoothing in retrieval is to use the collection language
model to give us some clue about which unseen word would have a higher proba-
6.4 Probabilistic Retrieval Models 123

bility. That is, the probability of the unseen word is assumed to be proportional to
its probability in the entire collection. With this assumption, we’ve shown that we
can derive a general ranking formula for query likelihood retrieval models that au-
tomatically contains the vector space heuristics of TF-IDF weighting and document
length normalization.
We also saw that through some rewriting, the scoring of such a ranking function
is primarily based on a sum of weights on matched query terms, also just like in
the vector space model. The actual ranking function is given to us automatically by
the probabilistic derivation and assumptions we have made, unlike in the vector
space model where we have to heuristically think about the forms of each function.
However, we still need to address the question: how exactly should we smooth
a document language model? How exactly should we use the reference language
model based on the collection to adjust the probability of the MLE of seen terms?
This is the topic of the next section.

6.4.3 Specific smoothing methods

From the last section, we showed how to smooth the query likelihood retrieval
model with the collection language model. We end up having a retrieval function
that looks like the following:

pseen(w | d)
c(w, q) log + |q| log αd . (6.7)
w∈d , q
αd . p(w | C)

We can see it’s a sum of all the matched query terms, and inside the sum it’s a
count of terms in the query with some weight for the term in the document. We saw
in the previous section how TF and IDF are captured in this sum. We also mentioned
how the second term αd can be used for document length normalization. If we
wanted to implement this function using a programming language, we’d still need
to figure out a few variables; in particular, we’re going to need to know how to
estimate the probability of a word and how to set αd . In order to answer these
questions, we have to think about specific smoothing methods, where we define
pseen and αd .
We’re going to talk about two different smoothing methods. The first is a lin-
ear interpolation with a fixed mixing coefficient. This is also called Jelinek-Mercer
smoothing. The idea is actually quite simple. Figure 6.26 shows how we estimate
the document language model by using MLE. That gives us word counts normal-
ized by the total number of words in the document. The idea of using this method
is to maximize the probability of the observed text. As a result, if a word like network
is not observed in the text, it’s going to get zero probability. The idea of smoothing
is to rely on the collection reference model where this word is not going to have a
124 Chapter 6 Retrieval Models

zero probability, helping us decide what non-zero probability should be assigned

to such a word. In Jelinek-Mercer smoothing, we do a linear interpolation between
the maximum likelihood estimate and the collection language model. This is con-
trolled by the smoothing parameter λ ∈ [0, 1]. Thus, λ is a smoothing parameter for
this particular smoothing method. The larger λ is, the more smoothing we have,
putting more weight on the background probabilities. By mixing the two distri-
butions together, we achieve the goal of assigning non-zero probability to unseen
words in the document that we’re currently scoring.
So let’s see how it works for some of the words here. For example, if we com-
pute the smoothed probability for the word text, we get the MLE estimate in the
document interpolated with the background probability. Since text appears ten
times in d and |d| = 100, our MLE estimate is 100 10
. In the background, we have
p(text | C) = 0.001, giving our smoothed probability of
pseen(w | d) = (1 − λ) . pMLE (w | d) + λ . p(w | C)
10
= (1 − λ) . + λ . 0.001.
100

In Figure 6.26 we also consider the word network, which does not appear in d.
In this case, the MLE estimate is zero, and its smoothed probability is 0 + λ .
p(w | C) = λ . 0.001. You can see now that αd in this smoothing method is just λ

c(w, d)
p(w|d) = (1 – λ) — + λ p(w|C) λ 2[0, 1]
|d|
Document d Collection LM
Unigram LM p(w|θ) = ? Total #words = 100 P(w|C)
… text 10 the 0.1
10/100 text ? mining 5 a 0.08
…
5/100 mining ? association 3
computer 0.02
3/100 association ? database 3 database 0.01
3/100 database ? algorithm 2 …
… … text 0.001
1/100 query ? query 1 network 0.001
mining 0.0009
0/100 network ? efﬁcient 1 …

10
p(“text”|d) = (1 – λ) — + λ * 0.001 p(“network”|d) = λ * 0.001
100

Figure 6.26 Smoothing the query likelihood retrieval function with linear interpolation: Jelinek-
Mercer smoothing.
6.4 Probabilistic Retrieval Models 125

because that’s the coefficient in front of the probability of the word given by the
collection language model.
The second smoothing method we will discuss is called Dirichlet prior smooth-
ing, or Bayesian smoothing. Again, we face the problem of zero probability for words
like network. Just like Jelinek-Mercer smoothing, we’ll use the collection language
model, but in this case we’re going to combine it with the MLE esimate in a some-
what different way. The formula first can be seen as an interpolation of the MLE
probability and the collection language model as before. Instead, however, αd is
not simply a fixed λ, but a dynamic coefficient which takes μ > 0 as a parameter.
Based on Figure 6.27, we can see if we set μ to a constant, the effect is that a
long document would actually get a smaller coefficient here. Thus, a long docu-
ment would have less smoothing as we would expect, so this seems to make more
|d| μ
sense than fixed-coefficient smoothing. The two coefficients |d|+μ and |d|+μ would
still sum to one, giving us a valid probability model. This smoothing can be un-
derstood as a dynamic coefficient interpolation. Another way to understand this
formula—which is even easier to remember—is to rewrite this smoothing method
in this form:

c(w, d) + μ . p(w | C)
p(w | d) = . (6.8)
|d| + μ

c(w, d) + μp(w|C) |d| c(w, d) μ

p(w|d) =— = — — + — p(w|C) μ2[0, +∞)
|d| + μ |d| + μ |d| |d| + μ
Document d Collection LM
Unigram LM p(w|θ) = ? Total #words = 100 P(w|C)
… text 10 the 0.1
10/100 text ? mining 5 a 0.08
…
5/100 mining ? association 3
computer 0.02
3/100 association ? database 3 database 0.01
3/100 database ? algorithm 2 …
… … text 0.001
1/100 query ? query 1 network 0.001
mining 0.0009
0/100 network ? efﬁcient 1 …
10 + μ * 0.001 μ
p(“text”|d) = — p(“network”|d) = — * 0.001
100 + μ 100 + μ

Figure 6.27 Smoothing the query likelihood retrieval function with linear interpolation: Dirichlet
prior smoothing.
126 Chapter 6 Retrieval Models

Here, we can easily see what change we have made to the MLE. In this form,
we see that we add a count of μ . p(w | C) to every word, which is proportional
to the probability of w in the entire corpus. We pretend every word w has μ .
p(w | C) additional pseudocounts. Since we add this extra probability mass in the
numerator, we have to re-normalize in order to have a valid probability distribution.

Since w∈V p(w | C) = 1, we can add a μ in the denominator, which is the total
number of pseudocounts we added for each w in the numerator.
Let’s also take a look at this specific example again. For the word text, we will have
ten counts that we actually observe but we also added some pseudocounts which
are proportional to the probability of text in the entire corpus. Say we set μ = 3000,
meaning we will add 3000 extra word counts into our smoothed model. We want
some portion of the 3000 counts to be allocated to text; since p(text | C) = 0.001,
we’ll assign 0.001 . 3000 counts to that word. The same goes for the word network;
for d, we observe zero counts, but also add μ . p(network | C) extra pseudocounts
for our smoothed probability.
In Dirichlet prior smoothing, αd will actually depend on the current document
being scored, since |d| is used in the smoothed probability. In the Jelinek-Mercer
linear interpolation, αd = λ, which is a constant. For Dirichlet prior, we have αd =
μ
|d|+μ , which is the interpolation coefficient applied to the collection language
model. For a slightly more detailed derivation of these variables, the reader may
consult Appendix A.
Now that we have defined pseen and αd for both smoothing methods, let’s plug
these variables in the original smoothed query likelihood retrieval function. Let’s
start with Jelinek-Mercer smoothing:
pseen(w | d) (1 − λ) . pMLE (w | d) + λ . p(w | C) 1 − λ . c(w, d)
= = 1+ . (6.9)
αd p(w | C)
. λ p(w | C)
. λ |d| . p(w | C)
Then, plugging this into the entire query likelihood retrieval formula, we get

1−λ . c(w, d)
scoreJ M (q , d) = c(w, q) log 1 + . (6.10)
w∈q , d
λ |d| . p(w | C)

We ignore the |q| log αd additive term (derived in the previous section) since
αd = λ does not depend on the current document being scored. We’ll end up having
a ranking function that is strikingly similar to a vector space model since it is a
sum over all the matched query terms. The value of the logarithm term is non-
negative. We see very clearly the TF weighting in the numerator, which is scaled
sublinearly. We also see the IDF-like weighting, which is the p(w | C) term in the
denominator; the more frequent the term is in the entire collection, the more
6.4 Probabilistic Retrieval Models 127

discounted the numerator will be. Finally, we can see the |d| in the denominator is a
form of document length normalization, since as |d| grows, the overall term weight
would decrease, suggesting that the impact of αd in this case is clearly to penalize
a long document. The second fraction can also be considered as the ratio of two
probabilities; if the ratio is greater than one, it means the probability of w in d is
greater than appearing by chance in the background. If the ratio is less than one,
the chance of seeing w in d is actually less likely than observing it in the collection.
What’s also important to note is that we received this weighting function auto-
matically by making various assumptions, whereas in the vector space model, we
had to go through those heuristic design choices in order to get this. These are the
advantages of using this kind of probabilistic reasoning where we have made ex-
plicit assumptions. We know precisely why we have a logarithm here, and precisely
why we have these probabilities. We have a formula that makes sense and does
TF-IDF weighting and document length normalization.
Let’s look at the complete function for Dirichlet prior smoothing now. We know
μ
what pseen is and we know that αd = |d|+μ :

c(w, d) + μ . p(w | C) |d| . c(w, d) μ

pseen(w | d) = = + . p(w | C), (6.11)
| d | +μ |d| + μ |d| |d| + μ

therefore,
c(w, d)+μ.p(w|C)
pseen(w | d) |d|+μ c(w, d)
= μ.p(w|C)
=1+ . (6.12)
αd . p(w | C) |d|+μ
μ p(w | C)
.

We can now substitute this into the complete formula:

c(w, d) μ
scoreDI R (q , d) = c(w, q) log 1 + + |q| log . (6.13)
w∈q , d
μ p(w | C)
. μ + |d|

The form of the function looks very similar to the Jelinek-Mercer scoring func-
tion. We compute a ratio that is sublinearly scaled by a non-negative logarithm.
Both TF and IDF are computed in almost the exact same way. The difference here
is that Dirichlet prior smoothing can capture document length normalization dif-
ferently than Jelinek-Mercer smoothing. Here, we have retained the |q| log αd term
since αd depends on the document, namely |d|. If |d| is large, then less extra mass
is added onto the final score; if |d| is small, more extra mass is added to the score,
effectively rewarding a short document.
To summarize this section, we’ve talked about two smoothing methods: Jelinek-
Mercer, which is doing the fixed coefficient linear interpolation, and Dirichlet prior,
128 Chapter 6 Retrieval Models

which adds pseudo counts proportional to the probability of the current word in
the background collection. In most cases we can see, by using these smoothing
methods, we will be able to reach a retrieval function where the assumptions are
clearly articulated, making them less heuristic than some of the vector space mod-
els. Even though we didn’t explicitly set out to define the popular VS heuristics, in
the end we naturally arrived at TF-IDF weighting and document length normaliza-
tion, perhaps justifying their inclusion in the VS models. Each of these functions
also has a smoothing parameter (λ or μ) with an intuitive meaning. Still, we need to
set these smoothing parameters or estimate them in some way. Overall, this shows
that by using a probabilistic model, we follow very different strategies than the vec-
tor space model. Yet in the end, we end up with retrieval functions that look very
similar to the vector space model. Some advantages here are having assumptions
clearly stated and a final form dictated by a probabilistic model.
This section also concludes our discussion of the query likelihood probabilistic
retrieval models. Let’s recall what assumptions we have made in order to derive the
functions that we have seen the following.

1. The relevance can be modeled by the query likelihood, i.e., p(R | d , q) ≈

p(q | d).
2. Query words are generated independently, allowing us to decompose the
probability of the whole query into a product of probabilities of observed
words in the query.
3. If a word is not seen in the document, its probability is proportional to its
probability in the collection (smoothing with the background collection).
4. Finally, we made one of two assumptions about the smoothing, using either
Jelinek-Mercer smoothing or Dirichlet prior smoothing.

If we make these four assumptions, then we have no choice but to take the
form of the retrieval function that we have seen earlier. Fortunately, the function
has a nice property in that it implements TF-IDF weighting and document length
normalization. In practice, these functions also work very well. In that sense, these
functions are less heuristic compared with the vector space model.

Bibliographic Notes and Further Reading

A brief review of many different kinds of retrieval models can be found in Chapter
2 Zhai [2008]. The vector space model with pivoted length normalization was pro-
posed and discussed in detail in Singhal et al. [1996]. The query likelihood retrieval
model was initially proposed in Ponte and Croft [1998]. A useful reference for the
Exercises 129

BM25 retrieval function is Robertson and Zaragoza [2009]. A comprehensive survey

of language models for information retrieval can be found in Zhai [2008]. A formal
treatment of retrieval heuristics is given in Fang et al. [2004], and a diagnostic eval-
uation method for assessing deficiencies of a retrieval model is proposed in Fang
et al. [2011], where multiple improved basic retrieval functions are also derived.

Exercises
6.1. Here’s a query and document vector. What is the score for the given document
using dot product similarity?

d = {1, 0, 0, 0, 1, 4} q = {2, 1, 0, 1, 1, 1}

6.2. In what kinds of queries do we probably not care about query term frequency?
6.3. Let d be a document in a corpus. Suppose we add another copy of d to collec-
tion. How does this affect the IDF of all words in the corpus?

6.4. Given a fixed vocabulary size, the length of a document is the same as the
length of the vector used to represent it. True or false? Why?

6.5. Consider Euclidean distance as our similarity measure for text documents:

|V |

d(q , d) = (qi − di )2 .
i=1

What does this measure capture compared to the cosine measure discussed in this
chapter? Would you prefer one over the other?

6.6. If you perform stemming on words in V to create V then |V | > |V |. True or

false? Why?

6.7. Which of the following ways is best to reduce the size of the vocabulary in a
large corpus?
Remove top 10 words
Remove words that occur 10 times or fewer

6.8. Why might using raw term frequency counts with dot product similarity not
give the best possible ranking?

6.9. How can you apply the VS model to a domain other than text documents? For
example, how do you find similar movies in IMDB or similar music to a specific
song? Hint: first define your concept space; what is your “term” vector?
130 Chapter 6 Retrieval Models

6.10. In Okapi BM25, how can we remove document length normalization by

setting a parameter? What value should it have?
6.11. Examine the Okapi BM25 retrieval function in META. You should see that
it is slightly different than the formula discussed in this chapter. What are the
differences and what do you suppose their effect is?
6.12. How are query likelihood and language models related?
6.13. In a unigram document LM, how many parameters are needed? (That is, how
many probabilities must be known in order to describe the LM?)
6.14. In a bigram document LM, how many parameters are needed?
6.15. Given a unigram language model θ estimated from this book’s content, and
two documents d1 =“information retrieval” and d2 =“retrieval information”, then
p(d1 | θ) > p(d2 | θ ). True or false? Why?
6.16. For this and the next question, refer to this probabilistic retrieval method
called absolute discounting:
max(c(w, d) − δ, 0) δ|d|u .
ps (w | d) = + p(w | C)
|d| |d|
and
δ|d|u
αd = ,
|d|
where δ ∈ [0, 1] or |d|u is the total number of unique terms in a particular docu-
ment d.
What happens in the extreme cases where δ = 0 and δ = 1?
6.17. Does absolute discounting capture document length normalization? How?
6.18. Give two reasons why Dirichlet Prior smoothing is better than Add-1 smooth-
ing, which is defined as
c(w, d) + 1
ps (w | d) = .
|d| + |V |
6.19. Which heuristics from the vector space models are captured in the general
smoothed query likelihood formula?
6.20. Is the following formula an acceptable scoring function? Why or why not?
k . c(w, C)
N +1 . n
score(q , d) = . ln ,
w∈q , d
c(w, d) df(w) navg
Exercises 131

where:
k > 0 is some parameter;
c(w, C) and c(w, d) are the count of the current word in the collection and
current document, respectively;
N is the total number of documents;
df(w) is the number of documents that the current word w appears in;
n is the document length of the current document;
navg is the average document length of the corpus.
7
Feedback
In this chapter, we will discuss feedback in a TR system. Feedback takes the results
of a user’s actions or previous search results to improve retrieval results. This is
illustrated in Figure 7.1. As shown, feedback is often implemented as updates to
a query, which alters the list of returned documents. We can see the user would
type in a query and then the query would be sent to a standard search engine,
which returns a ranked list of results (we discussed this in depth in Chapter 6).
These search results would be shown to the user. The user can make judgements
about whether each returned document is useful or not. For example, the user may
say one document is good or one document is not very useful. Each decision on a
document is called a relevance judgment. This overall process is a type of relevance
feedback, because we’ve got some feedback information from the user based on the
judgements of the search results.
As one would expect, this can be very useful to the retrieval system since we
should be able to learn what exactly is interesting to a particular user or users.
The feedback module would then take these judgements as input and also use
the document collection to try to improve future rankings. As mentioned, it would
typically involve updating the query so the system can now rank the results more
accurately for the user; this is the main idea behind relevance feedback.
These types of relevance judgements are reliable, but the users generally don’t
want to make extra effort unless they have to. There is another form of feedback
called pseudo relevance feedback, or blind feedback. In this case, we don’t have to
involve users since we simply assume that the top k ranked documents are relevant.
Let’s say we assume the top k = 10 documents are relevant. Then, we will use these
documents to learn and to improve the query. But how could this help if the top-
ranked documents are random? In fact, the top documents are actually similar to
relevant documents, even if they are not relevant. Otherwise, how would they have
appeared high in the ranked list? So, it’s possible to learn some related terms to
the query from this set anyway regardless whether the user says that a document is
relevant or not.
134 Chapter 7 Feedback

Results:
d1 3.5
Query Retrieval d2 2.4
engine …
dk 0.5
…

Updated Judgments: User

Document
query collection d1 +
d2 –
d3 +
…
Feedback dk –
…

Figure 7.1 How feedback is part of an information retrieval system.

You may recall that we talked about using language models to analyze word
associations by learning related words to the word computer (see Chapter 3). First,
we used computer to retrieve all the documents that contain that word. That is,
imagine the query is computer. Then, the results will be those documents that
contain computer. We take the top k results that match computer well and we
estimate term probabilities (by counting them) in this set for our topic language
model. Lastly, we use the background language model to choose the terms that
are frequent in this retrieved set but not frequent in the whole collection. If we
contrast these two ideas, what we can find is that we’ll learn some related terms to
computer. These related words can then be added to the original query to expand
the query, which helps find documents that don’t necessarily match computer, but
match other words like program and software that may not have been in the original
query.
Unfortunately, pseudo relevance feedback is not completely reliable; we have
to arbitrarily set a cutoff and hope that the ranking function is good enough to
get at least some useful documents. There is also another feedback method called
implicit feedback. In this case, we still involve users, but we don’t have to explicitly
ask them to make judgements. Instead, we are going to observe how the users
interact with the search results by observing their clickthroughs. If a user clicked
on one document and skipped another, this gives a clue about whether a document
is useful or not. We can even assume that we’re going to use only the snippet here
in a document that is displayed on the search engine results page (the text that’s
actually seen by the user). We can assume this displayed text is probably relevant
or interesting to the user since they clicked on it. This is the idea behind implicit
7.1 Feedback in the Vector Space Model 135

feedback and we can again use this information to update the query. This is a very
important technique used in modern search engines—think about how Google and
Bing can collect user activity to improve their search results.
To summarize, we talked about three types of feedback. In relevance feedback,
we use explicit relevance judgements, which require some user effort, but this
method is the most reliable. We talked about pseudo feedback, where we simply
assumed the top k document are relevant without involving the user at all. In
this case, we can actually do this automatically for each query before showing the
user the final results page. Lastly, we mentioned implicit feedback where we use
clickthrough data. While this method does involve users, the user doesn’t have to
make explicit effort to make judgements on the results.
Next, we will discuss how to apply feedback techniques to both the vector space
and query likelihood retrieval models. The future sections do not make any note of
how the feedback documents are obtained since no matter how they are obtained,
they would be dealt with the same way by each of the following two feedback
methods.

Feedback in the Vector Space Model

7.1 This section is about feedback in the vector space retrieval model. As we have
discussed, feedback in a TR system is based on learning from previous queries to
improve retrieval accuracy in future queries. We will have positive examples, which
are the documents that we assume to be relevant to a particular query, and we have
negative examples, which are non-relevant to a specific query. The way the system
gets these judged documents depends on the particular feedback strategy that is
employed (which was discussed in the previous section).
The general method in the vector space model for feedback is to modify our
query vector. We want to place the query vector in a better position in the high-
dimensional term space, plotting it closer to relevant documents. We might adjust
weights of old terms or assign weights to new terms in the query vector. As a result,
the query will usually have more terms, which is why this is often called query
expansion. The most effective method for the vector space model feedback was
proposed several decades ago and is called Rocchio feedback.
We illustrate this idea in Figure 7.2 by using a two-dimensional display of all the
documents in the collection in addition to the query vector q. The query vector is in
the center and the + (positive) or − (negative) represent documents. When we have
a query vector and use a similarity function to find the most similar documents, we
are drawing this dotted circle, denoting the top-ranked documents. Of course, not
136 Chapter 7 Feedback

Centroid of Centroid of
relevant documents non-relevant documents

– – –
––
+ + +++
–– – + qm – –
q + – –
– + +
– – – ++ + +++ –
+ + + – ––
– – ––
–

Figure 7.2 Illustration of Rocchio Feedback: adjusting weights in the query vector to move it closer
to a cluster of relevant documents.

all the top-ranked documents will be positive, and this is the motivation behind
feedback in the first place. Our goal is to move the query vector to some position to
improve the retrieval accuracy, shifting the dotted circle of similarity. By looking at
this diagram, we see that we should move the query vector so that the dotted circle
encompasses more + documents than − documents. This is the basic idea behind
Rocchio feedback.
Geometrically, we’re talking about moving a vector closer to some vectors and
away from other vectors. Algebraically, it means we have the following formula
(using the arrow vector notation for clarity):
β . γ .
qm = α . q + dj − dj , (7.1)
|Dr | |Dn|
dj ∈Dr dj ∈Dn

where q is the original query vector that is transformed into qm, the modified (i.e.,
expanded) vector. Dr is the set of relevant feedback documents and Dn is the set of
non-relevant feedback documents. Additionally, we have the parameters α, β , and
γ which are weights that control the amount of movement of the original vector.
In terms of movement, we see that the terms in the original query are boosted by a
factor of α, and terms from positive documents are boosted by a factor of β, while
terms from negative documents are shrunk by a factor of γ .
Another interpretation of the second term (the sum over positive documents)
is the centroid vector of relevant feedback documents while the third term is the
centroid vector of the negative feedback documents. In this sense, we shift the
original query towards the relevant centroid and away from the negative centroid.
Thus, the average over these two terms computes one dimension’s weight in the
centroid of these vectors.
7.1 Feedback in the Vector Space Model 137

After we have performed these operations, we will get a new query vector which
can be used again to score documents in the index. This new query vector will then
reflect the move of the original query vector toward the relevant centroid vector and
away from the non-relevant centroid vector.
Let’s take a look at a detailed example depicted below. Imagine we have a small
vocabulary,

V = {news, about, presidential, campaign, food, text}

and a query

q = {1, 1, 1, 1, 0, 0}.

Recall from Chapter 6 that our vocabulary V is a fixed-length term vector. It’s not
necessary to know what type of weighting scheme this search engine is using, since
in Rocchio feedback, we will only be adding and subtracting term weights from the
query vector.
Say we are given five feedback documents whose term vectors are denoted as
relevant with a + prefix. The negative feedback documents are prefixed with −.

{ news about pres. campaign food text }

− d1 { 1.5 0.1 0.0 0.0 0.0 0.0 }
− d2 { 1.5 0.1 0.0 2.0 2.0 0.0 }
+ d3 { 1.5 0.0 3.0 2.0 0.0 0.0 }
+ d4 { 1.5 0.0 4.0 2.0 0.0 0.0 }
− d5 { 1.5 0.0 0.0 6.0 2.0 0.0 }

For Rocchio feedback, we first compute the centroid of the positive and nega-
tive feedback documents. The centroid of the positive documents would have the
average of each dimension, and the case is the same for the negative centroid:

{ news about pres. campaign food text }

+ Cr { 1.5+1.5
2 0.0 3.0+4.0
2
2.0+2.0
2 0.0 0.0 }

− Cn { 1.5+1.5+1.5
3
0.1+0.1+0.0
3 0.0 0.0+2.0+6.0
3
0.0+2.0+2.0
3 0.0 }

Now that we have the two centroids, we modify the original query to create the
expanded query qm:
qm = α . q + β . Cr − γ . Cn
= {α + 1.5β − 1.5γ , α − 0.067γ , α + 3.5β , α + 2β − 2.67γ , −1.33γ , 0}.
138 Chapter 7 Feedback

We have the parameter α controlling the original query term weight, which all
happened to be one. We have β to control the influence of the relevant centroid
vector Cr . Finally, we have γ , which is the non-relevant centroid Cn weight. Shifting
the original query vector q by these amounts yields our modified query qm. We
rerun the search with this new query. Due to the movement of the query vector,
we should match the relevant documents much better, since we moved q closer to
them and away from the non-relevant documents—this is precisely what we want
from feedback.
If we apply this method in practice we will see one potential problem: we would
be performing a somewhat large computation to calculate the centroids and modify
all the weights in the new query. Therefore, we often truncate this vector and
only retain the terms which contain the highest weights, considering only a small
number of words. This is for efficiency. Additionally, negative examples or non-
relevant examples tend not to be very useful, especially compared with positive
examples. One reason is because negative documents distract the query in all
directions, so taking the average doesn’t really tell us where exactly it should be
moving to. On the other hand, positive documents tend to be clustered together
and they are often in a consistent direction with respect to the query. Because of
this effect, we sometimes don’t use the negative examples or set the parameter γ
to be small.
It’s also important to avoid over-fitting, which means we have to keep relatively
high weight α on the original query terms. We don’t want to overly trust a small
sample of documents and completely reformulate the query without regard to
its original meaning. Those original terms are typed in by the user because the
user decided that those terms were important! Thus, we bias the modified vector
towards the original query direction. This is especially true for pseudo relevance
feedback, since the feedback documents are less trustworthy. Despite these issues,
the Rocchio method is usually robust and effective, making it a very popular method
for feedback.

Feedback in Language Models

7.2 This section is about feedback for language modeling in the query likelihood model
of information retrieval. Recall that we derive the query likelihood ranking function
by making various assumptions, such as term independence. As a basic retrieval
function, that family of functions worked well. However, if we think about in-
corporating feedback information, it is not immediately obvious how to modify
7.2 Feedback in Language Models 139

query likelihood to perform feedback. Many times, the feedback information is

additional information about the query, but since we assumed that the query is
generated by assembling words from an ideal document language model, we don’t
have an easy way to add this additional information.
However, we have a way to generalize the query likelihood function that will
allow us to include feedback documents more easily: it’s called a Kullback-Leibler
divergence retrieval model, or KL-divergence retrieval model for short. This model
actually makes the query likelihood retrieval function much closer to the vector
space model. Despite this, the new form of the language model retrieval can still be
regarded as a generalization of query likelihood (in that it covers query likelihood
without feedback as a special case). Here, the feedback can be achieved through
query model estimation or updating. This is very similar to Rocchio feedback which
updates the query vector; in this case, we update the query language model instead.
Figure 7.3 shows the difference between our original query likelihood formula
and the generalized KL-divergence model. On top, we have the query likelihood
retrieval function. The KL-divergence retrieval model generalizes the query term
frequency into a probabilistic distribution. This distribution is the only difference,
which is able to characterize the user’s query in a more general way. This query lan-
guage model can be estimated in many different ways—including using feedback
information. This method is called KL-divergence because this can be interpreted
as measuring the divergence (i.e., difference) between two distributions; one is the
query model p(w | θ̂Q) and the other is the document language model from before.
We won’t go into detail on KL-divergence, but there is a more detailed explanation
in appendix C.

c(w, Q)
Query LM p(w|θ̂Q) = —
|Q|

Figure 7.3 The KL-divergence retrieval model changes the way we represent the query. This
enables feedback information to be incorporated into the query more easily.
140 Chapter 7 Feedback

Document D θD
D(θQ || θD) Results
Query Q θQ

θQ′ = (1 – α) θQ + αθF θF Feedback docs

Generative F = {d1, d2, …, dn}
α=0 α=1 model

θQ′ = θQ θQ′ = θF
No Full
feedback feedback

Figure 7.4 Model-based feedback.

So, the two formulas look almost identical except that in the generalized formula
we have a probability of a word given by a query language model. Still, we add all
the words that are in the document and have non-zero probability for the query
language model. Again, this becomes a generalization of summing over all the
matching query words. We can recover the original query likelihood formula by
simply setting the query language model to be the relative frequency of a word in
the query, which eliminates the query length term n = |q| which is a constant.
Figure 7.4 shows that we first estimate a document language model, then we
estimate a query language model and we compute the KL-divergence, often denoted
by D(.||.). We compute a language model from the documents containing the
query terms called the feedback language model θF . This feedback language model
is similar to the positive centroid Cr in Rocchio feedback. This model can be
combined with the original query language model using a linear interpolation,
which produces an updated model, again just like Rocchio.
We have a parameter α ∈ [0, 1] that controls the strength of the feedback docu-
ments. If α = 0, there is no feedback; if α = 1, we receive full feedback and ignore
the original query. Of course, these extremes are generally not desirable. The main
question is how to compute this θF .
Now, we’ll discuss one of the approaches to estimate θF . This approach is based
on a generative model shown in Figure 7.5. Let’s say we are observing the posi-
tive documents, which are collected by users’ judgements, the top k documents
from a search, clickthrough logs, or some other means. One approach to estimate
a language model over these documents is to assume these documents are gen-
7.2 Feedback in Language Models 141

Background words
P(w |C) w
λ

P(source) F = {d1, …, dn}

1–λ P(w |θ ) w
Topic words

log p(F|θ) = ∑ ∑ c(w; di)log[(1 – λ)p(w |θ ) + λp(w |C)]

i w

Maximum likelihood θ F = argmax log p(F|θ)

λ = noise in feedback documents

Figure 7.5 Mixture model for feedback.

erated from some ideal feedback language model as we did before; this entails
normalizing all the frequency counts from all the feedback documents. But is this
distribution good for feedback? What would the top-ranked words in θF be?
As depicted in the language model on the right in Figure 7.6, the high-scoring
words are actually common words like the. This isn’t very good for feedback, be-
cause we will be adding many such words to our query when we interpolate with
our original query language model. Clearly, we need to get rid of these stop words.
In fact, we have already seen one way to do that, by using a background language
model while learning word associations in Chapter 2. Instead, we’re going to talk
about another approach which is more principled. What we can do is to assume
that those unwanted words are from the background language model. If we use a
maximum likelihood estimate, a single model would have been forced to assign
high probabilities to a word like the because it occurs so frequently. In order to re-
duce its probability in this model, we have to have another model to explain such a
common word. It is appropriate to use the background language model to achieve
this goal because this model will assign high probabilities to these common words.
We assume the machine that generated these words would work as follows.
Imagine we flip a coin to decide what distribution to use (topic words or background
words). With the probability of λ ∈ [0, 1] the coin shows up as heads and then
we’re going to use the background language model. Once we know we will use the
background LM, we can then sample a word from that model. Alternatively, with
probability 1 − λ, we decide to use an unknown topic model to generate a word. This
is a mixture model because there are two distributions that are mixed together, and
we actually don’t know when each distribution is used. We can treat this feedback
142 Chapter 7 Feedback

Query: “airport security”

Mixture model approach
λ = 0.9 λ = 0.7
w P(w |θF) w P(w |θF)
security 0.0558 the 0.0405
airport 0.0546 security 0.0377
beverage 0.0488 airport 0.0342
alcohol 0.0474 Web database beverage 0.0305
bomb 0.0236 alcohol 0.0304
Top 10 docs
terrorist 0.0217 to 0.0268
author 0.0206 of 0.0241
license 0.0188 and 0.0214
bond 0.0186 author 0.0156
counter-terror 0.0173 bomb 0.0150
terror 0.0142 terrorist 0.0137
newsnet 0.0129 in 0.0135
attack 0.0124 license 0.0127
operation 0.0121 state 0.0127
headline 0.0121 by 0.0125

Figure 7.6 Example of query models learned via pseudo-relevance feedback.

mixture model as a single distribution in that we can still ask it to generate words,
and it will still give us a word in a random way (according to the underlying models).
Which word will show up depends on both the topic distribution and background
distribution. In addition, it would also depend on the mixing parameter λ; if λ is
high, it’s going to prefer the background distribution. Conversely, if λ is very small,
we’re going to use only our topic words.
Once we’re thinking this way, we can do exactly the same as what we did before
by using MLE to adjust this model and set the parameters to best explain the data.
The difference, however, is that we are not asking the unknown topic model alone
to explain all the words; rather, we’re going to ask the whole mixture model to
explain the data. As a result, it doesn’t have to assign high probabilities to words like
the, which is exactly what we want. It would then assign high probabilities to other
words that are common in the topic distribution but not having high probability
in the background distribution. As a result, this topic model must assign high
probabilities to the words common in the feedback documents yet not common
across the whole collection.
Mathematically, we have to compute the log likelihood of the feedback docu-
ments F with another parameter λ, which denotes noise in the feedback
documents. We assume it will be fixed to some value. Assuming it’s fixed, then
7.2 Feedback in Language Models 143

we only have word probabilities θ as parameters, just like in the simplest unigram
language model. This gives us the following formula to estimate the feedback lan-
guage model:

θF = arg max log p(F | θ)

θ
(7.2)
= arg max c(w, d) . log [(1 − λ) . p(w | θ) + λ . p(w | C)]
θ
d∈F w

We choose this probability distribution θF to maximize the log likelihood of

the feedback documents under our model. This is the same idea as the maximum
likelihood estimator. Here though, the mathematical problem is to solve this opti-
mization problem. We could try all possible θ values and select the one that gives
the whole expression the maximum probability. Once we have done that, we ob-
tain this θF that can be interpolated with the original query model to do feedback.
Of course, in practice it isn’t feasible to try all values of θ, so we use the EM algo-
rithm to estimate its parameters [Zhai and Lafferty 2001]. Such a model involving
multiple component models combined together is called a mixture model, and
we will further discuss such models in more detail in the topic analysis chapter
(Chapter 17).
Figure 7.6 shows some examples of the feedback model learned from a web docu-
ment collection for performing pseudo feedback. We just use the top 10 documents,
and we use the mixture model with parameters λ = 0.9 and λ = 0.7. The query is air-
port security. We select the top ten documents returned by the search engine for this
query and feed them to the mixture model. The words in the two tables are learned
using the approach we described. For example, the words airport and security still
show up as high probabilities in each case naturally because they occur frequently
in the top-ranked documents. But we also see beverage, alcohol, bomb, and terrorist.
Clearly, these are relevant to this topic, and if combined with the original query can
help us match other documents in the index more accurately.
If we compare the two tables, we see that when λ is set to a smaller value, we’ll
still see some common words when we don’t use the background model often.
Remember that λ can “choose” the probability of using the background model to
generate to the text. If we don’t rely much on the background model, we still have
to use the topic model to account for the common words. Setting λ to a very high
value uses the background model more often to explain these words and there is no
burden on explaining the common words in the feedback documents. As a result,
the topic model is very discriminative—it contains all the relevant words without
common words.
144 Chapter 7 Feedback

To summarize, this section discussed feedback in the language model approach;

we transform our original query likelihood retrieval function to a more general KL-
divergence model. This generalization allows us to use a language model for the
query, which can be manipulated to include feedback documents. We described a
method for estimating the parameters in this feedback model that discriminates
between topic words (relevant to the query) and background words (useless stop
words).
In this chapter, we talked about the three major feedback scenarios: relevance
feedback, pseudo feedback, and implicit feedback. We talked about how to use
Rocchio to do feedback in the vector-space model and how to use query model esti-
mation for feedback in language models. We briefly talked about the mixture model
for its estimation, although there are other methods to estimate these parameters
that we mention later on in the book.

Bibliographic Notes and Further Reading

An early empirical comparison of various relevance feedback techniques can be
found in Salton and Buckley [1990]. Pseudo-relevance feedback has become popu-
lar after positive results being observed in TREC experiments (e.g., Buckley 1994,
Xu and Croft 1996). A comparison of feedback approaches in language models is
available in Lv and Zhai [2009]. The positional relevance model proposed in Lv and
Zhai [2010] appears to be one of the most effective methods for estimating a query
language model for pseudo feedback. Due to the availability of a large amount of
search engine log data, implicit feedback based on users’ interaction behavior has
become a very important and very effective technique to enable web search engines
to improve their accuracy over time as more and more users are using the systems,
though the interpretation of user clickthroughs must take position bias into con-
sideration, which is discussed in detail in Joachims et al. [2007]. A bibliography on
implicit feedback can be found in Kelly and Teevan [2003]. In the web search era,
implicit feedback is often implemented in the form of using feedback features in
a ranking function using machine learning, i.e., learning to rank techniques; they
are discussed briefly in Chapter 10 of the book. For a more thorough discussion of
mining query logs, see the tutorial [Silvestri 2010].

Exercises
7.1. How should you set the Rocchio parameters α, β , and γ depending on what
type of feedback you are using? That is, should the parameters be set differently if
Exercises 145

you are using pseudo feedback compared to user-supplied relevance judgements?

What about implicit feedback through clickthrough data?

7.2. Imagine you are in charge of a large search-engine company. What other
strategies could you devise to get relevance judgments from users?

7.3. Say one of your new strategies is to measure the amount of time t a user spends
on each search result document. How can you incorporate this t for each document
into a feedback measure for a particular query?

7.4. Implement Rocchio pseudo feedback in META.

7.5. Implement mixture model feedback for language models in META. Use which-
ever method is most convenient to estimate θF . Or, compare different estimation
methods for θF .

7.6. After implementing Rocchio pseudo feedback, index a dataset with relevance
judgements. Plot MAP (see Chapter 9) across different values of k. Do you see any
trends?

7.7. After implementing mixture model feedback, index a dataset with relevance
judgements. Plot MAP (see Chapter 9) across different values of the mixing param-
eter α. Do you see any trends?

7.8. Design a heuristic to automatically determine the best k for pseudo feedback
on a query-by-query basis. You could look at the query itself, the number of match-
ing documents, or the distribution of ranking scores in the original results. Test
your heuristic by doing experiments.

7.9. Design a heuristic to automatically determine the best α for mixture model
feedback on a query-by-query basis. You could look at the query itself, the number
of matching documents, or the distribution of ranking scores in the original results.
Test your heuristic by doing experiments.

7.10. In mixture model feedback, we discussed how to incorporate positive feed-

back documents via a language model θF . Design a formula that also incorporates
a set of negative feedback documents. Ensure that your new query language model
is still a valid probability distribution.

7.11. In mixture model feedback, we estimated the feedback LM with a probabilis-

tic model that ensures stop words do not affect the reformulated query. Do we need
to do anything like this for Rocchio feedback?
146 Chapter 7 Feedback

7.12. In the feedback methods we discussed in this chapter, we assumed we only

had sets of relevant and non-relevant documents. In reality, we actually have two
ranked lists of relevant and non-relevant documents. How can we take advantage
of these ranked lists for feedback? In other words, how can we treat feedback
documents differently depending on how similar they are to the original query?
Consider the vector space model, the query likelihood model, or both.

7.13. In a real search system, storing modified query vectors for all observed
queries will take up a large amount of space. How could you optimize the amount
of space required? What kind of solutions provide a tradeoff between space and
query time? How about an online system that benefits the majority of users or the
majority of queries?
8
Search Engine
Implementation
This chapter focuses on how to implement an information retrieval (IR) system or
a search engine. In general, an IR system consists of four components.

Tokenizer. This component takes in documents as raw strings and determines

how to separate the large document string into separate tokens (or features).
These token streams are then passed on to the indexer. This is perhaps the
most vital part of the system as a whole, since a poor tokenization method
will affect all other parts of the indexing, and propagate downstream to the
end user.
Indexer. This is the module that processes documents and indexes them with
appropriate data structures. An Indexer can be run offline. The main chal-
lenges are to index large amounts of documents quickly with a limited amount
of memory. Other challenges include supporting addition and deletion of doc-
uments.
Scorer/Ranker. This is the module that takes a query and returns a ranked list of
documents. Here the challenge is to implement a retrieval model efficiently
so that we can score documents efficiently.
Feedback/Learner. This is the module that is responsible for relevance feed-
back or pseudo feedback. When there is a lot of implicit feedback information
such as user clickthroughs available (as in a modern web search engine), this
learning module can be fairly sophisticated. It was discussed in detail in the
previous chapter, so in this chapter we will just outline how it may be added
to an existing system.

For the first three items, there are fairly standard techniques that are essentially
used in all current search engines. The techniques for implementing feedback,
148 Chapter 8 Search Engine Implementation

however, highly depend on the learning approaches and applications. Despite this,
we did discuss some common methods for feedback in the previous chapter.
We will additionally investigate two additional optimizations. These are not
required to ensure the correctness of an information retrieval system, but they will
enable such a system to be much more efficient in both speed and disk usage.

Compression. The documents we index could consume hundreds of gigabytes

or terabytes. We can simultaneously save disk space and increase disk read
efficiency by losslessly compressing the data in our index, which is usually
just integers.1
Caching. Even after designing and compressing an efficient data structure for
document retrieval storage, the system will still be at the mercy of the hard
disk speed. Thus, it is common practice to add a cache between the front-
facing API and the document index on disk. The cache will be able to save
frequently-accessed term information so the number of slow disk seeks dur-
ing query-time is reduced.

The following sections in this chapter discuss each of the above components
in turn.

Tokenizer
8.1 Document tokenization is the first step in any text mining task. This determines
how we represent a document. We saw in the previous chapter that we often rep-
resent documents as document vectors, where each index corresponds to a single
word. The value stored in the index is then a raw count of the number of occurrences
of that word in a particular document.
When running information retrieval scoring functions on these vectors, we usu-
ally prefer some alternate representation of term count, such as smoothed term
count, or TF-IDF weighting. In real search engine systems, we often leave the term
scoring up to the index scorer module. Thus, in tokenization we will simply use
the raw count of features (words), since the raw count can be used by the scorer to
calculate some weighted term representation. Additionally, calculating something
like TF-IDF is more involved than a simple scanning of a single document (since we
need to calculate IDF). Furthermore, we’d like our scorer to be able to use different

1. As we will discuss, the string terms themselves are almost always represented as term IDs, and
most of the processing on “words” is done on integer IDs instead of strings for efficiency.
8.1 Tokenizer 149

scoring functions as necessary; storing only TF-IDF weight would then require us
to always use TF-IDF weighting.
Therefore, a tokenizer’s job is to segment the document into countable features
or tokens. A document is then represented by how many and what kind of tokens
appear in it. The raw counts of these tokens are used by the scorer to formulate the
retrieval scoring functions that we discussed in the previous chapter.
The most basic tokenizer we will consider is a whitespace tokenizer. This tok-
enizer simply delimits words by their whitespace. Thus,

whitespace_tokenizer(Mr. Quill’s book is very very long.)

could result in

{Mr.: 1, Quill s: 1, book: 1, is: 1, very: 2, long.: 1}.

A slightly more advanced unigram words tokenizer could first lowercase the
sentence and split the words based on punctuation. There is a special case here
where the period after Mr. is not split (since it forms a unique word):

{mr.: 1, quill: 1, ’s: 1, book: 1, is: 1, very: 2, long: 1, .: 1}.

Of course, we aren’t restricted to using a unigram words representation. Look

back to the exercises from Chapter 4 to see some different ways in which we
can represent text. We could use bigram words, POS-tags, grammatical parse tree
features, or any combination. Common words (stop words) could be removed
and words could also be reduced to their common stem (stemming). Again, the
exercises in Chapter 4 give good examples of these transformations using META. In
essence, the indexer and scorer shouldn’t care how the term IDs were generated;
this is solely the job of the tokenizer.
Another common task of the tokenizer is to assign document IDs. It is much
more efficient to refer to documents as unique numbers as opposed to strings
such as /home/jeremy/docs/[Link]. It’s much faster to do integer com-
parisons than string comparisons, in addition to integers taking up much less
space. The same argument may be made for string terms vs. term IDs. Finally, it
will almost always be necessary to map terms to counts or documents to counts. In
C++, we could of course use some structure internally such as std::unordered_
map<std::string, uint64_t>. As you know, using a hash table like this gives
amortized O(1) lookup time to find a uint64_t corresponding to a particular
std::string.
150 Chapter 8 Search Engine Implementation

However, using term IDs, we can instead write std::vector<uint64_t>. This

data structure takes up less space, and allows true O(1) access to each uint64_t
using a term ID integer as the index into the std::vector. Thus, for term ID 57,
we would look up index 57 in the array.
Using term IDs and the second tokenizer example, we could set mr.→ term id 0,
quill→ term id 1 and so on, then our document vector looks like

{1, 1, 1, 1, 1, 2, 1, 1}.

Of course, a real document vector would be much larger and much sparser—that
is, most of the dimensions will have a count of zero.
This process is also called feature generation. It defines the building blocks
of our document objects and gives us meaningful ways to compare them. Once
we define how to conceptualize documents, we can index them, cluster them,
and classify them, among many other text mining tasks. As mentioned in the
Introduction, tokenization is perhaps the most critical component of our indexer,
since all downstream operations depend on its output.

Indexer
8.2 Modern search engines are designed to be able to index data that is much larger
than the amount of system memory. For example, a Wikipedia database dump is
about 40 GB of uncompressed text. At the time of writing this book, this is much
larger than the amount of memory in common personal systems, although it is
quite a common dataset for computer science researchers. TREC research datasets
may even be as large as several terabytes. This doesn’t even take into account real-
world production systems such as Google that index the entire Web.
This requires us to design indexing systems that only load portions of the raw
corpus in memory at one time. Furthermore, when running queries on our indexed
files, we want to ensure that we can return the necessary term statistics fast enough
to ensure a usable search engine. Scanning over every document in the corpus to
match terms in the query will not be sufficient, even for relatively small corpora.
An inverted index is the main data structure used in a search engine. It allows
for quick lookup of documents that contain any given term. The relevant data
structures include (1) the lexicon (a lookup table of term-specific information, such
as document frequency and where in the postings file to access the per-document
term counts) and (2) the postings file (mapping from any term integer ID to a list
of document IDs and frequency information of the term in those documents).
8.2 Indexer 151

Doc 1 Dictionary (or lexicon) Postings

… news about Term # docs Total freq Doc ID Freq Position
news 3 3 1 1 p1
Doc 2
campaign 2 2 2 1 p2
… news about presidential 1 2 3 1 p3
organic food food 1 1 2 1 p4
campaign … … … … 3 1 p5
Doc 3 3 2 p6, p7
… news of presidential campaign … 2 1 p8
… …
… presidential candidate …
… …

Figure 8.1 Inverted index postings and lexicon files.

In order to support “proximity heuristics” (rewarding matching terms that are

together), it is also common to store the position of each term occurrence. Such
position information can be used to check whether all the query terms are matched
within a certain window of text, e.g., it can be used to check whether a phrase
is matched. This information is stored in the postings file since it is document-
specific.
Figure 8.1 shows a representation of the lexicon and postings files. The arrows
in the image are actually integer offsets that represent bit or byte indices into the
postings file.
For example, if we want to score the term computer, which is term ID 56, we look
up 56 in the lexicon. The information we receive could be:

Term ID: 56
Document frequency: 78
Total number of occurrences: 443
Offset into postings file: 8923754

Of course, the actual lexicon would just store 56 → {78, 443, 8923754}. Since
the tokenizer assigned term IDs sequentially, we could represent the lexicon as
a large array indexed by term ID. Each element in the large array would store tuples
of (document frequency, total count, offset) information. If we seek to position
8,923,754 in the large postings file, we could see something like

Term ID: 56
Doc ID: 4, count: 1, position 56
Doc ID: 7, count: 9, position 4, position 89, position...
Doc ID: 24, count: 19, position 1, position 67, position...
152 Chapter 8 Search Engine Implementation

Doc ID: 90, count: 4, position 90, position 93, position...

Doc ID: 141, count: 1, position 100
Doc ID: 144, count: 2, position 34, position 89
.
.
.

which is the counts and position information for the 78 documents that term ID 56
appears in. Notice how the doc IDs (and positions) are stored in increasing order;
this is a fact we will take advantage of when compressing the postings file. Also
make note of the large difference in size of the lexicon and postings file. For each
entry in the lexicon, we know we will only store three values per term. In the postings
file, we store at least three values (doc ID, count, positions) for each document that
the term appears in. If the term appears in all documents, we’d have a list of the
length of the number of documents in the corpus. This is true for all unique terms.
For this reason, we often assume that the lexicon can fit into main memory and the
postings file resides on disk, and is seeked into based on pointers from the lexicon.
Indexing is the process of creating these data structures based on a set of tok-
enized documents. A popular approach for indexing is the following sorting-based
approach.

. Scan the raw document stream sequentially. In tokenization, assign each

document an ID. Tokenize each document to obtain term IDs, creating new
term IDs as needed.
. While scanning documents, collect term counts for each term-document pair
and build an inverted index for a subset of documents in memory. When we
reach the limit of memory, write the incomplete inverted index into the disk.
(It will be the same format as the resulting postings file, just smaller.)
. Continue this process to generate many incomplete inverted indices (called
“runs”) all written on disk.
. Merge all these runs in a pair-wise manner to produce a single sorted (by
term ID) postings file. This algorithm is essentially the merge function from
mergesort.
. Once the postings file is created, create the lexicon by scanning through the
postings file and assigning the offset values for each term ID.

Figure 8.2 shows how documents produce terms originally in document ID order.
The terms from multiple documents are then sorted by term ID in small postings
chunks that fit in memory before they are flushed to the disk.
8.3 Scorer 153

Sort by doc-id Sort by term-id All info about term 1 Term

lexicon:
doc1 <1,1,3> <1,1,3> <1,1,3>
<2,1,2> <1,2,2> <1,2,2> the 1
<3,1,1> <2,1,2> <1,5,2> campaign 2
… <2,4,3> <1,6,3> news 3
doc2
<1,2,2> … … a4
<3,2,3> <1,5,3> <1,300,3> …
<4,2,2> <1,6,2> <2,1,2>
… … … DocID
lexicon:
…
doc1 1
doc300 <1,300,3> <1,299,3> <5000,299,1> doc2 2
<3,300,1> <1,300,1> <5000,300,1> doc3 3
… … … …
Parse and count “Local” sort Merge sort

Figure 8.2 Sort-based inversion of inverted index chunks.

A forward index may be created in a very similar way to the inverted index. Instead
of mapping terms to documents, a forward index maps documents to a list of
terms that occur in them. This type of setup is useful when doing other operations
aside from search. For example, clustering or classification would need to access
an entire document’s content at once. Using an inverted index to do this is not
efficient at all, since we’d have to scan the entire postings file to find all the terms
that occur in a specific document. Thus, we have the forward index structure that
records a term vector for each document ID.
In the next section, we’ll see how using the inverted term-to-document mapping
can greatly decrease query scoring time. There are other efficiency aspects that are
relevant to the forward index as well, such as compression and caching.

Scorer
8.3 Now that we have our inverted index, how can we use it to efficiently score queries?
Imagine for a moment that we don’t have an inverted index; we only have the
forward index, which maps document IDs to a list of terms that occur in them. To
score a query vector, we’d need to iterate through every single entry (i.e., document)
in the forward index and run a scoring function on the each (document, query) pair.
154 Chapter 8 Search Engine Implementation

Algorithm 8.1 Term-at-a-time Ranking

scores = {} // score accumulator maps doc IDs to scores
for w ∈ q do
for d , count ∈ [Link] docs(w) do
scores[d] = scores[d] + score term(count)
end for
end for
return top k documents from scores

Most likely, many documents do not contain any of the query terms (especially
if stop word removal is performed), which means that their ranking score will be
zero. Why should we bother scoring these documents anyway? This is exactly how
we can benefit from an inverted index: we can only score documents that match
at least one query term—that is, we will only score documents that will have nonzero
scores. We assume (and verify in practice) that scoring only documents containing
terms that appear in the query results in much less scoring computation. This leads
us to our first scoring algorithm using the inverted index.

8.3.1 Term-at-a-time Ranking

Once an inverted index is built, scoring a query term-by-term can be done efficiently
on an inverted index I dx using Algorithm 8.1 with query q.
For each term, fetch the corresponding entries (frequency counts) in the inverted
index. Create document score accumulators as needed (variables that hold the
accumulated score for each document). Scan the inverted index entries for the
current term and for each entry (corresponding to a document containing the term),
update its score accumulator based on some term weighting method (the score_
term function). This could be (for example) Okapi BM25. As we finish processing
all the query terms, the score accumulators should have the final scores for all the
documents that contain at least one query term. Note that we don’t need to create
a score accumulator if the document doesn’t match any query term.
In reality, the fetch_docs function would return some object that contains in-
formation about the current term in the document, such as count, background
probability, or any other necessary information that the score_term function would
need to operate.
Once we’ve iterated through all the query terms, the score accumulators have
been finalized. We just need to sort these documents by their accumulated scores
8.3 Scorer 155

and return (usually) the top k. Again, we save time in this sorting operation by only
sorting documents that contained a query term as opposed to sorting every single
document in the index, even if its score is zero.

8.3.2 Document-at-a-time Ranking

One disadvantage to term-at-a-time ranking is that the size of the score accumula-
tors scores will be the size of the number of documents matching at least one term.
While this is a huge improvement over all documents in the index, we can still make
this data structure smaller.
Instead of iterating through each document multiple times for each matched
query term occurrence, we can instead score an entire document at once. Since
most (if not all) searches are top-k searches, we can only keep the top k documents at
any one time. This is only possible if we have the complete score for each document
in our structure that holds scored documents. Otherwise, as with term-at-a-time
scoring, a document may start out with a lower score than another, only to surpass
it as more terms are scored.
We can hold the k best completely scored documents with a priority queue. Using
the inverted index, we can get a list of document IDs and postings data that need to
be scored. As we score a complete document, it is added on the priority queue. We
assign high priorities to documents with low scores; this is so that after adding the
(k + 1)st document, we can (in O(log k) time) remove the lowest-score document
and only hold onto the top k. Once we’ve iterated through all the document IDs, we
can easily sort the k documents and return them. See Algorithm 8.2.
We can use a similar priority queue approach while extracting the top k docu-
ments from the term-at-a-time score accumulators, but we would still need to store
all the scores before finding the top k.

8.3.3 Filtering Documents

Another common task is only returning documents that meet a certain criteria. For
example, our index may store newspaper articles with dates as metadata. In our
top-k search, suppose we want to only return documents that were written within
the past year. This is a common document filtering problem.
With term-at-a-time ranking, we can ignore documents that are not in the correct
date range by not updating their scores in the score accumulator (thus not inserting
those document IDs into the structure). In document-at-a-time ranking, we can
simply skip the document if it doesn’t pass the filter when creating the context for
that particular document.
156 Chapter 8 Search Engine Implementation

Algorithm 8.2 Document-at-a-time Ranking

context = {} // maps a document to a list of matching terms
for w ∈ q do
for d , count ∈ [Link] docs(w) do
context[d].append(count)
end for
end for
priority queue = {} // low score is treated as high priority
for d , term counts ∈ context do
score = 0
for count ∈ term counts do
score = score + score term(count)
end for
priority [Link](d , score)
if priority [Link]() > k then
priority [Link]() // removes lowest score so far
end if
end for
Return sorted documents from priority queue

Filters can be as complex as desired, since a filter is essentially just a Boolean

function that takes a document and returns whether or not it should be returned
in the list of scored documents. The filtering function can then be an optional
parameter to the scoring function which has access to the document metadata
store (usually a database) and a forward index (in order to filter documents that
contain certain terms).

8.3.4 Index Sharding

Index sharding is the concept of keeping more than one inverted index for a partic-
ular search engine. This is easily achieved by stopping the postings chunk merging
process when the number of chunks is equal to the number of desired shards. All
the same data is stored as one final chunk, but it’s just broken down into several
pieces.
But why would we want multiple inverted index chunks? Consider when we
have the number of shards equal to the number of threads (or cluster nodes) in
our search system. You can probably imagine an algorithm where each thread
searches for matching terms in its respective shard, and the final search results
8.4 Feedback Implementation 157

are then merged together. This type of algorithm design—distributing the work
and then merging the results—is a very common paradigm called Map Reduce. We
will discuss its generality and many other applications in future chapters.

Feedback Implementation
8.4 Chapter 7 discussed feedback in a standard information retrieval system. We saw
two implementations of feedback: the vector space Rocchio feedback and the query
likelihood mixture model for feedback.
Both can be implemented with the inverted index and document metadata we’ve
described in the previous sections. For Rocchio feedback, we can use the forward
index to obtain the vectors of both the query and feedback documents, running
the Rocchio algorithm on that set of vectors. The mixture model feedback method
requires a language model to be learned over the feedback documents; again, this
can be achieved efficiently by using the term counts from the forward index. The
only other information needed is the corpus background probabilities for each
term, which can be stored in the term lexicon.
With this information, it is now possible to create an online (or “in-memory”)
pseudo-feedback method. Recall that pseudo-feedback looks at the top k returned
documents from search and assumes they are relevant. The following process could
be used to enable online feedback.
. Run the user’s original query.
. Use the top k documents and the forward index to either modify the query vec-
tor (Rocchio) or estimate a language model and interpolate with the feedback
model (query likelihood).
. Rerun the search with the modified query and return the new results to the
user.

There are both advantages and disadvantages to this simple feedback model.
For one, it requires very little memory and disk storage to implement since each
modified query is “forgotten” as soon as the new results are returned. Thus, we
don’t need to create any additional storage structures for the index.
The downside is that all the processing is done at query time, which could
be quite computationally expensive, especially when using a search engine with
many users. The completely opposite tradeoff is to store every modified query in a
database, and look up its expanded form, running the search function only once. Of
course, this is infeasible as the number of queries would quickly make the database
explode in size, not to mention that adding more documents to the index would
158 Chapter 8 Search Engine Implementation

invalidate the stored query vectors (since the new documents might also match the
original query).
In practice, we can have some compromise between these two extremes, e.g.,
only storing the very frequently expanded queries, or using query similarity to
search for a similar query that has been saved. The caching techniques discussed
in a later section are also applicable to feedback methods, so consider how to adopt
them from caching terms to caching expanded queries.
Of course, this only touches on the pseudo-feedback method. There is also
clickthrough data, which can be stored in a database, and relevance judgements,
which can be stored the same way. Once we know which documents we’d like to
include in the chosen feedback method, all implementations are the same since
they deal with a list of feedback documents.

Compression
8.5 Another technical component in a retrieval system is integer compression, which
is applied to compress the very large postings file. A compressed index is not only
smaller, but also faster when it’s loaded into main memory. The general idea of
compressing integers (and compression in general) is to exploit the non-uniform
distribution of values. Intuitively, we will assign a short code to values that are
frequent at the price of using longer codes for rare values. The optimal compression
rate is related to the entropy of the random variable taking the values that we
consider—skewed distributions would have lower entropy and are thus easier to
compress.
It is important that all of our compression methods need to support random
access decoding; that is, we could like to seek to a particular position in the
postings file and start decompressing without having to decompress all the pre-
vious data.
Because inverted index entries are stored sequentially, we may exploit this fact
to compress document IDs (and position information) based on their gaps. The
document IDs would otherwise be distributed relatively uniformly, but the distri-
bution of their gaps would be skewed since when a term is frequent, its inverted list
would have many document IDs leading to many small gaps. Consider the following
example of a list of document IDs:

{23, 25, 34, 35, 39, 43, 49, 51, 57, 59, . . .}.

Instead of storing these exact numbers, we can store the offsets between them;
this creates more smaller numbers, which are easier to compress since they take
8.5 Compression 159

up less space and are more frequent:

{23, 2, 9, 1, 4, 4, 6, 2, 6, 2, . . .}.

To get the actual document ID values, simply add the offset to the previous value.
So the first ID is 23 and the second is 23 + 2 = 25. The third is 25 + 9 = 34, and
so on.
In this section, we will discuss the following types of compression, which may
or may not operate on gap-encoded values:

. unary encoding (bitwise);

. γ -encoding (bitwise);
. δ-encoding (bitwise);
. vByte (block); and
. frame of reference (block).

8.5.1 Bitwise compression

With bitwise compression, instead of writing out strings representing numbers
(like “1624”), or fixed byte-width chunks (like a 4-byte integer as “00000658”), we
are writing raw binary numbers. When the representation ends, the next number
begins. There is no fixed width, or length, of the number representations. Using
bitwise compression means performing some bit operations for every bit that is
encoded in order to “build” the compressed integer back into its original form.
Unary. Unary encoding is the simplest method. To write the integer k, we simply
write k − 1 zeros followed by a one. The one acts as a delimiter and lets us know
when to stop reading:

1→1
2 → 01
3 → 001
4 → 0001
5 → 00001
19 → 0000000000000000001
Note that we can’t encode the number zero—this is true of most other methods
as well. An example of a unary-encoded sequence is

000100100010000000101000100001 = 4, 3, 4, 8, 2, 4, 5.
160 Chapter 8 Search Engine Implementation

As long as the lexicon has a pointer to the beginning of a compressed integer, we

can easily support random access decoding. We also have the property that small
numbers take less space, while larger numbers take up more space. The next two
compression methods are built on the concept of unary encoding.
Gamma. To encode a number with γ -encoding, first simply write the number in
binary. Let k be the number of bits in your binary string. Then, prepend k − 1 zeros
to the binary number:
1→1
2 → 010
3 → 011
4 → 00100
5 → 00101
19 → 000010011
47 → 00000101111
To decode, read and count k zeros until you hit a one. Read the one and additional
k bits in binary. Note that all γ codes will have an odd number of bits.
Delta. In short, δ-encoding is γ -encoding a number and then γ -encoding the
unary prefix (including the one):
1→1→1
2 → 010 → 0100
3 → 011 → 0101
4 → 00100 → 01100
5 → 00101 → 01101
19 → 000010011 → 001010011
47 → 00000101111 → 0011001111
To decode, decode the γ code at your start position to get an integer k. Write
a one, and then read the next k + 1 bits in binary (including the one you wrote).
As you can see, the δ compression at first starts off to have more bits than the
γ encoding, but eventually becomes more efficient as the numbers get larger. It
probably depends on the particular dataset (the distribution of integers) as to which
compression method would be better in terms of compression ratio. A compression
ratio is simply the uncompressed size divided by the compressed size. Thus, a
compression ratio of 3 is better (in that the compressed files are smaller) than a
compression ratio of 2.

8.5.2 Block compression

While bitwise encoding can achieve a very high compression ratio due to its fine-
grained distribution model, its downside is the amount of processing that is re-
8.5 Compression 161

quired to encode and decode. Every single bit needs to be read in order to read
one integer. Block compression attempts to alleviate this issue by reading bytes at
a time instead of bits. In block compression schemes, only one bitwise operation
per byte is usually required as opposed to at least one operation per bit in the pre-
vious three schemes (e.g., count how many bit operations are required to δ-encode
the integer 47).
Block compression seeks to reduce the number of CPU instructions required
in decoding at the expense of using more storage. The two block compression
methods we will investigate deal mainly with bytes instead of bits.
vByte stands for variable byte encoding. It uses the lower seven bits of a byte to
store a binary number and the most significant bit as a flag. The flag signals whether
the decoder should keep reading the binary number. The parentheses below are
added for emphasis.

1 → (0)0000001
2 → (0)0000010
19 → (0)0010011
47 → (0)0101111
127 → (0)1111111
128 → (1)0000000(0)0000001
194 → (1)1000010(0)0000001
678 → (1)0100110(0)0000101
20, 000 → (1)0100000(1)0011100(0)0000001

The decoder works by keeping a sum (which starts at zero) and adding each
byte into the sum as it is processed. Notice how the bytes are “chained” together
backwards. For every “link” we follow, we left shift the byte to add by 7 × k, where k
is the number of bytes read so far. Therefore, the sum we have to decode the integer
20,000 is

(0100000 << 0) + (0011100 << 7) + (0000001 << 14)

which is the same as

0100000b
+ 00111000000000b
+ 000000100000000000000b
= 000000100111000100000b
= 20,000d
162 Chapter 8 Search Engine Implementation

In this method, the “blocks” that we are encoding are bytes.

Frame of reference encoding takes a block size k and encodes k integers at a time,
so a block in this method is actually a sequence of numbers. Take the following
block of size k = 8 as an example:

{45, 47, 51, 59, 63, 64, 70, 72}.

We transform this block by subtracting the minimum value in the list from each
element:

{0, 2, 6, 14, 18, 19, 25, 27}, min = 45.

Up to this point, this is similar to the gap encoding discussed previously. However,
instead of encoding each value with a bitwise compression such as γ - or δ-encoding,
we will simply use binary with the smallest number of bits possible. Since the
maximum number in this chunk is 27, we will store each of the integers in 5 bits:

{00000, 00010, 00110, 01110, 10010, 10011, 11001, 11011}, min = 45, bytes = 5.

This might look like

45, 5, 0000000010001100111010010100111100111011,

where 45 and 5 could be stored however is convenient (e.g., fixed binary length).
This method also reduces the number of bitwise operations, since we read chunks
of 5 bits and add them to the base value 45 to recreate the sequence. Another
nice side effect is that we know the minimum value and maximum possible value
stored in this chunk; therefore, if we are looking for a particular integer (say for a
document ID), we know we can skip this chunk and not decompress it if I D < 45
or I D > 45 + 25.

Caching
8.6 While we designed our inverted index structure to be very efficient, we still have
the issue of disk latency. For this reason, it is very common for a real-world search
engine to also employ some sort of caching structure stored in memory for postings
data objects. In this section, we’ll overview two different caching strategies.
The basic idea of a cache is to make frequently accessed objects fast to acquire.
To accomplish this, we attempt to keep the frequently accessed objects in memory
so we don’t need to seek to the disk. In our case, the objects we access are postings
lists. Due to Zipf’s law [Zipf 1949], we expect that a relatively small number of
8.6 Caching 163

Hash table

K7 K1 K4 K9 K2 K12

V4 V1 V7 V2 V12 V9

Doubly linked list

Figure 8.3 An implementation of a Least-Recently Used (LRU) cache.

postings lists (keyed by term ID) will be accessed the majority of the time. But how
do we know which terms to store? Even if we knew which terms are most queried,
how do we set a cutoff for memory consumption?
The two cache designs we describe address both these issues. That is: (1) the
most-frequently used items are fast to access, and (2) we can set a maximum size
for our cache so it doesn’t get too large.

8.6.1 LRU cache

We first consider the least recently used (LRU) cache as displayed in Figure 8.3. The
LRU algorithm is as follows, assuming we want to retrieve the postings list for term
ID x.

. First, search the cache for term ID x.

. If it exists in the cache, return the postings list corresponding to x.
. If it doesn’t exist in the cache, retrieve it from disk and insert it into the cache.
If this causes the cache to exceed its maximum size, remove the least-recently
used postings list.

We want searching the cache to be fast, so we use a hash table to store term IDs
as keys and postings lists as values. This enables O(1) amortized insert, find, and
delete operations. The interesting part of the LRU cache is how to determine the
access order of the objects. To do this, we link together the values as a doubly linked
list. Once an element is inserted or accessed, it is moved to the head (front) of the
164 Chapter 8 Search Engine Implementation

Primary map

K,V7 K,V1 K,V4 K,V9 K,V2 K,V12

K,V3 K,V13 K,V17 K,V20 K,V5 K,V8

Secondary map

Figure 8.4 A full Double Barrel Least-Recently Used (DBLRU) cache.

list in O(1) time. When we need to remove the LRU item, we look at the element at
the tail (end) of the linked list and delete it, also in constant time.

8.6.2 DBLRU cache

The double barrel LRU cache was originally used in the popular Lucene search
engine.2 It is a simplified approximation of the LRU cache. Figure 8.4 shows a
DBLRU cache with size six.
The DBLRU cache is just two hash tables named primary and secondary. The
algorithm is as follows, assuming we want to retrieve the postings list for term ID x.

. First, search primary for term ID x; if it exists, return it.

. If it’s not in primary, search secondary.
. If it’s in secondary, delete it. Insert it in primary and return it. If this causes
primary to reach the maximum size, clear the entire contents of secondary.
Then, swap the two hash tables.
. If it’s not in secondary, retrieve it from disk and insert it into secondary.

This cache has a rough hierarchy of usage: primary contains elements that are
more frequently accessed than secondary. So when the cache fills, the secondary
table is emptied to free memory. While the temporal accuracy of the DBLRU cache
is not as precise as the LRU cache, it is a much simpler setup, which translates to
faster access times. As usual, there is a tradeoff between the speed and accuracy of
these two caches.

2. [Link]
Exercises 165

Bibliographic Notes and Further Reading

A classic reference book for the implementation of search engines is the book Man-
aging Gigabytes [Witten et al. 1999]. Other books on information retrieval, such
as Introduction to Information Retrieval [Manning et al. 2008], and Search Engines:
Information Retrieval in Practice [Croft et al. 2009], and Information Retrieval: Imple-
menting and Evaluating Search Engines [Büttcher et al. 2010] also have an excellent
coverage of implementations of search engines.

Exercises
8.1. Why is using an inverted index to score documents preferred over a more naive
solution?
8.2. For the following values, explain whether we can efficiently get the value from
a default (standard) inverted index postings file. By efficiently, we mean with one
lookup—not scanning the entire index.

Symbol Value

|d| total number of terms in a given doc

|d|u number of unique terms in a given doc
df number of documents a given term appears in
c(w, C) number of times w occurs in the corpus
c(w, d) count of w in d
p(w, d) probability of w occurring in d

8.3. For values in the previous question that were not able to be extracted efficiently
from the postings file, either explain what auxiliary structures need to be searched
or why it isn’t possible to retrieve the value efficiently.
8.4. Imagine that each document is tagged with some sentiment score (either
positive or negative). Outline a method on how you could score documents if you
only wanted to return documents of a certain sentiment on a per-query basis.
8.5. Compress the following sequence of integers using each of the compression
methods discussed:

{34, 36, 39, 42, 47, 48, 49, 51}

8.6. According to Zipf’s law, which of the following strategies is more effective for
reducing the size of an inverted index? (1) Remove all words that appear 10 times
or less or (2) remove the top 10 most frequent words.
166 Chapter 8 Search Engine Implementation

8.7. What type of integer compression is supported in META? Write one of the
implementations discussed in this chapter that does not exist in the toolkit.

8.8. Which type of scoring (document- vs. term-at-a-time) does META use? Hint:
see the file meta/src/index/ranker/[Link].

8.9. Implement the other type of query scoring and compare its runtime to the
existing method in META.

8.10. META has an implementation of a DBLRU cache. Experiment with the cache
size parameter and plot the retrieval speed against different cache sizes while
holding any other variables constant.

8.11. META does not currently have an implementation of an LRU cache as dis-
cussed in this chapter. Can you write one and compare its performance to the
DBLRU cache in terms of memory usage and speed?
9
9.1
Search Engine Evaluation
This chapter focuses on the evaluation of text retrieval (TR) systems. In the previous
chapter, we talked about a number of different TR methods and ranking functions,
but how do we know which one works the best? In order to answer this question,
we have to compare them, and that means we’ll have to evaluate these retrieval
methods. This is the main focus of this chapter. We start out with the methodology
behind evaluation. Then, we compare the retrieval of sets with the retrieval of
ranked lists as well as judgements with multiple levels of relevance. We end with
practical issues in evaluation, followed by exercises.

Introduction
Let’s think about why we have to do evaluation. There are two main reasons;
the first is that we have to use evaluation to figure out which retrieval method
works the best. This is very important for advancing our knowledge, otherwise we
wouldn’t know whether a new idea works better than an old idea. Previously in this
book (Chapter 6), we discussed the problem of text retrieval and compared it with
database retrieval. Search engine evaluation must rely on users, so this becomes
a very challenging problem. Because of this, we must determine how we can get
users involved and draw a fair comparison of different methods.
The second reason to perform evaluation is to assess the actual utility of an
overall TR system (as opposed to specific methods). Imagine you’re building your
own applications; you would be interested in knowing how well your search engine
works for your users. In this case, measures must reflect the utility to the actual
users in the real application as opposed to measures on each individual retrieval
result. Typically, this has been done via user studies—where human users inter-
act with the corpus via the system. In the case of comparing different methods, the
measures we use all need to be correlated with the utility to the users. The measures
only need to be good enough to determine which method works better. This is usu-
ally done by using a test collection, which is a main idea that we’ll be talking about
168 Chapter 9 Search Engine Evaluation

in this chapter. This has been very important for comparing different algorithms
and for improving search engines systems in general.

9.1.1 What to Measure?

There are many aspects of a search engine we can measure—here are the three
major ones.

Effectiveness or accuracy. How accurate are the search results? In this case,
we’re measuring a system’s capability of ranking relevant documents on top
of non-relevant ones.
Efficiency. How quickly can a user get the results? How large are the computing
resources that are needed to answer a query? In this case, we need to measure
the space and time overhead of the system.
Usability. How useful is the system for real user tasks? Here, interfaces and
many other things are also important and we typically have to do user studies.

In this book, we’re going to talk mainly about the effectiveness and accu-
racy measures because the efficiency and usability dimensions are not unique
to search engines (they are needed for evaluating other software systems). There
is also very good coverage of such material in other books, so we suggest the
reader consult Harman [2011] for further reading in this area. Additional readings
are Sanderson [2010] and Kelly [2009], which cover user studies and A-B testing
(concepts that are discussed later in this chapter).

9.1.2 Cranfield Evaluation Methodology

The Cranfield evaluation methodology was developed in the 1960s and is a strategy
for laboratory testing of system components. It’s actually a methodology that has
been very useful not only for search engine evaluation, but also for evaluating
virtually all kinds of empirical tasks. For example, in image processing or other
fields where the problem is empirically defined we typically would need to use a
method such as this.
The basic idea of this approach is to build reusable test collections and define
measures using these collections. Once such a test collection is built, it can be used
again and again to test different algorithms or ideas. Using these test collections,
we will define measures that allow us to quantify the performance of a system or al-
gorithm. The assembled test collection of documents is similar to a real document
collection in a search application. We can also have a sample set of queries or topics
that simulate the user’s information need. Then, we need to have relevance judg-
9.1 Introduction 169

ments. These are judgments of which documents should be returned for which
queries. Ideally, they have to be made by users who formulated the queries because
those are the people that know exactly what the documents (search results) would
be used for. Finally, we have to have measures to quantify how well a system’s result
matches the ideal ranked list that would be constructed based on users’ relevance
judgements.
This methodology is very useful for evaluating retrieval algorithms because the
test set can be reused many times. It also provides a fair comparison for all the
methods, since the evaluation is exactly the same for each one. That is, we have the
same criteria, same corpus, and same relevance judgements to compare the differ-
ent algorithms. This allows us to compare a new algorithm with an old algorithm
that was invented many years ago by using the same approach.
In Figure 9.1, we illustrate how the Cranfield evaluation methodology works. As
mentioned, we need a set of queries that are shown here. We have Q1 , Q2 , and so
on. We also need the document collection, D1 , D2 , . . ., and on the far right side of
the figure, we have the relevance judgments which are plus or minus annotations
on each document specifying whether it is relevant (plus) or not relevant (minus).
Essentially, these are binary judgments of documents with respect to a specific
query since there are only two levels of relevance. For example, D1 and D2 are
judged as being relevant to Q1. D3 is judged as non-relevant with respect to Q1.

Test Collection Evaluation

Queries Relevance
Query = Q1 RA judgments
Q 1 Q 2 Q3
D2 + Q1 D1 +
… Q50 …
System A D1 + Q1 D2 +
D4 – Which is better? Q1 D3 –
Q1 D4 –
RA or RB? Q1 D5 +
…
RB Q2 D1 –
How to quantify?
D1 + Q2 D2 +
D2
D4 – Q2 D3 +
D1 D3 Q2 D4 –
System B D3 – …
… D48 D5 +
Q50 D1 –
D2 + Q50 D2 –
Document collection Q50 D3 +

Figure 9.1 Illustration of Cranfield evaluation methodology.

170 Chapter 9 Search Engine Evaluation

These Qi judgements are created by users that interact with each system. Once we
have these judgements, we can compare two or more systems. Each query is run on
each system, and we investigate the documents that each system returns.
Let’s say the query is Q1. In the figure we have RA as ranked results from system A
and RB as the ranked results from system B. Thus, RA is system A’s approximation
of relevant documents and RB is system B’s approximation. Let’s take a look at
these results—which is better? As a user, which one would you like? There are
some differences and there are some documents that are returned by both systems.
But if you look at the results you will feel that maybe system A is better in the
sense that we don’t have that many documents returned, and among the three
documents returned two of them are relevant. That’s good; system A is precise. On
the other hand, we can also say system B is better because it returned more relevant
documents; it returned three instead of two. So which one is better and how do we
quantify this? This question highly depends on a user’s task, and it depends on the
individual users as well! For some users, maybe system A is better if the user is not
interested in getting all the relevant documents so he or she doesn’t have to read
too much. On the other hand, imagine a user might need to have as many relevant
documents as possible, for example, in writing a literature survey. You might be
in the second category, and then you might find that system B is better. In either
case, we’ll have to also define measures that would quantify the information need
of a user. We will need to define multiple measures because users have different
perspectives when looking at the results.

Evaluation of Set Retrieval

9.2 In this section, we examine the basic measures for evaluation of text retrieval
systems. We discuss how to design these basic measures and how to quantitatively
compare two systems. Although the systems return a ranked list of documents, the
evaluation metrics discussed in this section deal with sets of returned documents;
that is, the order of the returned results is not taken into account. These measures
and their intuition are used to design other more sophisticated methods, but are
also quite valuable on their own.

9.2.1 Precision and Recall

Let’s return to Figure 9.1. Which set of results is better—system A’s or system
B’s? We can now discuss how to actually quantify their performance. Suppose we
have a total of ten relevant documents in the corpus for the current query, Q1. Of
course, the relevance judgements shown on the right did not include all the ten.
9.2 Evaluation of Set Retrieval 171

We have only seen three relevant documents there, but we can imagine there are
other documents judged for this query. Intuitively we thought that system A is better
because it did not have much noise. In particular we have seen, out of three results,
two are relevant. On the other hand, in system B we have five results and only three
of them are relevant. Based on this, it looks like system A is more accurate. This can
be captured by the measure of precision, where we simply compute to what extent
all the retrieval results are relevant. 100% precision would mean all the retrieved
documents are relevant. Thus in this case, system A has a precision of 23 = 0.66.
System B has 35 = 0.60. This shows that system A is better according to precision.
But we also mentioned that system B might be preferred by some other users who
wish to retrieve as many relevant documents as possible. So, in that case we have
to compare the number of total relevant documents to the number that is actually
retrieved. This is captured in another measure called recall, which measures the
completeness of coverage of relevant documents in your retrieval result. We assume
that there are ten relevant documents in the collection. Here we’ve got two of them
from system A, so the recall is 102
= 0.20. System B has 10
3
= 0.30. Therefore, system
B is better according to recall.
These two measures are the very basic foundation for evaluating search engines.
They are very important because they are also widely used in many other evaluation
problems. For example, if you look at the applications of machine learning you
tend to see precision and recall numbers being reported for all kinds of tasks.
Now, let’s define these two measures more precisely and how these measures are
used to evaluate a set of retrieved documents. That means we are considering that
approximation of a set of relevant documents. We can distinguish the results in
four cases, depending on the situation of a document, as shown in Figure 9.2.
A document is either retrieved or not retrieved since we’re talking about the set of
results. The document can be also relevant or not relevant, depending on whether
the user thinks this is a useful document. We now have counts of documents in each
of the four categories. We can have a represent the number of documents that are
retrieved and relevant, b for documents that are not retrieved but relevant, c for
documents that are retrieved but not relevant, and d for documents that are both
not retrieved and not relevant. With this table, we have defined precision as the ratio
of the relevant retrieved documents a to the total number of retrieved documents
a
a and c: a+c . In this case, the denominator is all the retrieved documents. Recall
is defined by dividing a by the sum of a and b, where a + b is the total number of
relevant documents. Precision and recall are focused on looking at a, the number of
retrieved relevant documents. The two measures differ based on the denominator
of the formula.
172 Chapter 9 Search Engine Evaluation

Action
Retrieved Not retrieved
Relevant a b
Doc
Not relevant c d

a
Precision = — Ideal results: precision = recall = 1.0
a+c
a In reality, high recall tends to be
Recall = —
a+b associated with low precision

Set can be deﬁned by a cutoff (e.g., precision @ 10 docs)

Figure 9.2 Basic measures: precision and recall.

So what would be an ideal result? If precision and recall are both 1.0, that
means all the results that we returned are relevant, and we didn’t miss any relevant
documents; there’s no single non-relevant document returned. In reality, however,
high recall tends to be associated with low precision; as you go down the list to
try to get as many relevant documents as possible, you tend to include many non-
relevant documents, which decreases the precision. We often are interested in the
precision up to ten documents for web search. This means we look at the top ten
results and see how many documents among them are actually relevant. This is a
very meaningful measure, because it tells us how many relevant documents a user
can expect to see on the first page of search results (where there are typically ten
results).
In the next section, we’ll see how to combine precision and recall into one score
that captures both measures.

9.2.2 The F Measure: Combining Precision and Recall

There tends to be a tradeoff between precision and recall, so it is natural to combine
them. One metric that is often used is called the Fβ measure, displayed in Figure 9.3.
In the case where β = 1, it’s a harmonic mean of precision and recall.
Considering the parameter β and after some simplification, we can see the F
measure may be written in the form on the right-hand side of the figure. Often, β
set to one, which indicates an equal preference towards precision and recall. In the
case where β = 1, we have a special case of the F measure, often called F1. This is
a popular measure that is often used as a combined precision and recall score. If β
is not equal to one, it controls the emphasis between precision and recall. It’s easy
9.2 Evaluation of Set Retrieval 173

1 (β 2 + 1)P ∗ R
Fβ = =
β2 1 1 1 β 2P + R
β 2+1 R β 2+1 P

2P R
F1 =
P +R
whereP = precision, R = recall, β = parameter (often set to 1)

Figure 9.3 The F measure.

to see that if you have a large precision or large recall then the F1 measure would
be high. But what’s interesting is how the tradeoff between precision and recall is
captured in the F1 score.
In order to understand the formulation, we can first ask the natural question:
Why not combine them using a simple arithmetic mean? That would be likely the
most natural way of combining them. Why is this not as good as F1, i.e., what’s
the problem with an arithmetic mean?
The arithmetic mean tends to be dominated by large values. That means if you
have a very high P or a very high R, then you really don’t care about whether the
other value is low since the whole sum would be high. This is not the desirable
effect because one can easily have a perfect recall by returning all the documents!
Then we have a perfect recall and a low precision. This will still give a relatively
high average. Such search results are clearly not very useful for users even though
the average using this formula would be relatively high. In contrast, the F1 score will
reward a case where precision and recall are roughly similar. So, it would penalize
a case with an extremely high result for only one of them. This means F1 encodes
a different tradeoff between them than a simple arithmetic mean. This example
shows a very important methodology: when we try to solve a problem, you might
naturally think of one solution (e.g., the arithmetic mean), but it’s important not
to settle on this solution; rather, think whether there are other ways to approach it.
Once you have multiple ideas, it’s important to analyze their differences and then
think about which one makes more sense in a real scenario.
To summarize, we talked about precision, which addresses the question: are
the retrieval results all relevant? We also talked about recall, which addresses the
question: have all the relevant documents been retrieved? These two are the two
basic measures in information retrieval evaluation. They are used for many other
tasks as well. We talked about F measure as a way to combine precision and recall.
We also talked about the tradeoff between precision and recall, and it turns out to
depend on the users’ search tasks and preferences.
174 Chapter 9 Search Engine Evaluation

Evaluation of a Ranked List

9.3 In the previous section, we only considered whether a relevant document appeared
in the results or not—a binary measure. In this section, we will see how we can take
each document’s position into account when assigning an evaluation score.
We saw that precision and recall are the two basic ways to quantitatively measure
the performance of a search result. But, as we talked about in depth in Chapter 5, the
text retrieval problem is a ranking problem, not a classification one. Thus, we need
to evaluate the quality of a ranked list as opposed to whether a relevant document
was returned anywhere in the results.
How can we use precision and recall to evaluate a ranked list? Naturally, we will
have to look at precision and recall at different cutoffs since a ranked list of relevant
documents is determined by where the user stops browsing. If we assume the user
sequentially browses the list of results, the user would stop at some point. That
point would determine the size of the set. Therefore, that’s the most important
cutoff that we have to consider when we compute the precision-recall.
Without knowing where exactly the user would stop, we have to consider all the
possible positions where they might stop. A precision-recall curve does exactly this,
as illustrated in Figure 9.4
What if the user stops at the first document? What’s the precision-recall at this
point? Since D1 is relevant, the precision is one out of one since we have one

Evaluating Ranking: Precision–Recall (PR) Curve

Total number of relevant documents in collection: 10

Precision Recall
D1 + 1/1 1/10 Precision
D2 + 2/2 2/10 1.0
D3 – 2/3 2/10
0.8
D4 –
D5 + 3/5 3/10 0.6
D6 – 0.4
D7 – 0.2
D8 + 4/8 4/10
0.0
D9 – 0.1 0.2 0.3 0.4 … 1.0
D10 – ? 10/10

Text Data Management and Analysis PDF

Uploaded by

Text Data Management and Analysis PDF

Uploaded by

ZHAI • MASSUNG

Text Data Management and Analysis

Text Data Management

ACM | MORGAN & CLAYPOOL

M the computer science community, published by ACM

&C in collaboration with Morgan & Claypool Publishers.

Text Data Management and Analysis: A Practical Introduction to Information

ACM Books #12

Text Data Management and Analysis

ISBN: 978-1-97000-119-8 hardcover

A publication in the ACM Books series, #12

PART I OVERVIEW AND BACKGROUND 1

Chapter 3 Text Data Understanding 39

PART II TEXT DATA ACCESS 71

Chapter 5 Overview of Text Data Access 73

Chapter 6 Retrieval Models 87

Chapter 7 Feedback 133

Chapter 8 Search Engine Implementation 147

8.4 Feedback Implementation 157

Chapter 9 Search Engine Evaluation 167

Chapter 10 Web Search 191

Chapter 11 Recommender Systems 221

PART III TEXT DATA ANALYSIS 239

Chapter 12 Overview of Text Data Analysis 241

Chapter 13 Word Association Mining 251

Chapter 14 Text Clustering 275

Chapter 15 Text Categorization 299

Chapter 16 Text Summarization 317

Chapter 17 Topic Analysis 329

17.3 Mining One Topic from Text 340

Chapter 18 Opinion Mining and Sentiment Analysis 389

Chapter 19 Joint Analysis of Text and Structured Data 413

PART IV UNIFIED TEXT DATA MANAGEMENT ANALYSIS SYSTEM 443

Appendix A Bayesian Statistics 457

A.4 The Dirichlet Distribution 461

Appendix B Expectation- Maximization 465

Appendix C KL-divergence and Dirichlet Prior Smoothing 473

The growth of “big data” created unprecedented opportunities to leverage compu-

and Ribeiro-Neto [2011], Information Retrieval: Implementing and Evaluating Search

In contrast to structured data, which conform to well-defined schemas and are

Text retrieval Text mining

Small Knowledge Many applications!

Functions of Text Information Systems

Knowledge Acquisition (Text Analysis). This capability enables a user to acquire

Text Organization. This capability enables a TIS to annotate a collection of text

finding algorithms to generate semantically more meaningful patterns than di-

Conceptual Framework for Text Information Systems

Categorization Topic analysis

Natural language content analysis

Figure 1.3 Conceptual framework of text information systems.

Filtering/Recommendation. Monitor an incoming stream, decide which items

Summarization. Take one or multiple text documents, and generate a concise

Information Extraction. Extract entities, relations of entities or other “knowl-

Visualization. Visually display patterns in text data. The visualization compo-

visualization of the results generated from various text mining algorithms is

Organization of the Book

Chapter 5 Chapter 4 Chapter 12

Chapter 9 Chapter 8 Chapter 7 Chapter 14 Chapter 17

Chapter 10 Chapter 11 Chapter 15

Figure 1.4 Dependency relations among the chapters.

How to Use this Book

Students. This book is specifically designed to give you hands-on experience

Bibliographic Notes and Further Reading

Basics of Probability and Statistics

V = {a, and, apple, . . . , zap, zirconium, zoo}

The Kolmogorov axioms describe facts about probability distributions in general

. Each event has a probability between zero and one:

pθ (ω) = 0, ω  ∈ and pθ () = 1. (2.2)

. The probability of all (disjoint) events sums to one:

Note that, strictly speaking, an event is defined as a subset of the probability

2.1.1 Joint and Conditional Probabilities

The numerator p(X = x , Y = y) is the probability of exactly the configuration

2.1.2 Bayes’ Rule

p(X | Y )p(Y ) = p(X, Y ) = p(Y | X)p(X).

We can simplify them as

Specifically, we may view random variable X as denoting our hypothesis, and Y

pθ (ω) = 0, ω ∈ and pθ () = 1. (2.2)