Introduction to
Information Storage and Retrieval
Introduction to
Information Storage and Retrieval
Chapter One:
Introduction to ISR
Information Retrieval Systems?
Document (Web page)
retrieval in response to a
query
Quite effective (at some
things)
Commercially successful
(some of them)
But what goes on behind
the scenes?
How do they work? Web search
What happens beyond
systems
• Lycos, Excite,
the Web?
Yahoo, Google, Live,
Northern Light,
Teoma, HotBot,
Examples of IR systems
Conventional (library catalog): by keyword, title, author,
etc.
E.g.: You are probably familiar with www.library.unt.edu
AAU library catalog
Text-based (Google, Yahoo, Lexis-Nexis,, FAST): Search
by keywords. Limited search using queries in natural language.
Multimedia (QBIC, WebSeek, SaFe): Search by visual
appearance (shapes, colors,… ).
Question answering systems (AskJeeves, Answerbus):
Search in (restricted) natural language
Other:
Cross language information retrieval (uses multiple languages),
Music retrieval
Information Retrieval
Information retrieval (IR) is the process of
finding material (usually documents) of an
unstructured nature (usually text) that satisfies
an information need of the user from within
large collections (usually stored on computers).
Information is organized into (a large number
of) documents
Large collections of documents from various sources:
news articles,
research papers,
books,
digital libraries,
Web pages, etc.
Example: Web Search Engines like Google claim to index
over 1.5 billion pages
General Goal of Information
Retrieval
To help users find useful/relevant
information based on their information
needs (with a minimum effort) despite
The challenge is:
Increasing complexity of Information
Changing needs of user
Provide immediate random access to the
document collection
Retrieval systems, such as Google, Yahoo,
are developed with this aim
Information Retrieval vs. Data
Retrieval
Emphasis of IR is on the retrieval of information, rather than on the
retrieval of data
Data retrieval
Consists mainly of determining which documents contain a set of
keywords in the user query (which is not enough to satisfy the user
information need)
Aims at retrieving all objects that satisfy well defined semantics
a single erroneous object among a thousand retrieved objects implies
failure
Information retrieval
Is concerned with retrieving information about a subject or topic than
retrieving data which satisfies a given query
semantics is frequently loose: the retrieved objects might be inaccurate
small errors are tolerated
Information Retrieval vs. Data
Retrieval
Example of data retrieval system is a relational database
Data Retrieval Info Retrieval
Data organization Structured Unstructured
Fields Clear Semantics No fields (other
(ID, Name, age,…) than text)
Query Language Artificial (defined, Free text (“natural
SQL) language”), Boolean
Matching Exact (results are Partial match, best match
always “correct”)
Query specification Complete Incomplete
Items wanted Matching Relevant
Accuracy 100% >50%
Error response Sensitive Insensitive
Why is IR so hard?
Information retrieval problem: locating
relevant documents based on user input, such
as keywords or example documents
The real problem boils down to matching the
language of the query to the language of the
document
Simply matching on words is a very weak approach
One word can have different semantic meanings. Consider:
Take
“take a place at the table”
“take money to the bank”
“take a picture”
More Problems with IR
You can’t even tell what part of speech a word has:
“I saw her duck”
A query that searches for “pictures of a duck” will find
documents that contains:
“I saw her duck away from the ball falling from the sky”
Proper Nouns often use regular old nouns
Consider a document with “a man named Abraham
owned a Lincoln”
A word matching query for “Abraham Lincoln” may well
find the above document
Basic Concepts in Information Retrieval:
(i) User Task and
(ii) Logical View of documents
The User Task:
user task – retrieval and browsing
Retrieval
DB
Browsing
USER
• Retrieval The User Task
• It is the process of retrieving information
whereby the main objective is clearly
defined from the onset of searching
process
• The user of a retrieval system has to
translate his information need into a
query in the language provided by the
system
• In this context (i.e. by specifying a set of
words), the user searches for useful
information executing a retrieval task
• English Language Statement :
I want a book by J. K Rowling titled The Chamber of Secrets
• Browsing The User Task
• It is the process of retrieving information,
whereby the main objective is not
clearly defined from the beginning and
whose purpose might change during the
interaction with the system
• E.g. User might search for documents about ‘car
racing’ . Meanwhile he might find interesting
documents about ‘car manufacturers’. While
reading about car manufacturers in Addis, he might
turn his attention to a document providing
‘direction to Addis’, and from this to documents
which cover ‘Tourism in Ethiopia’
• In this context, user is said to be browsing in
the collection and not searching, since a
user may has an interest of glancing around
Logical View of Documents
Documents in a collection are frequently represented by a
set of index terms or keywords
Such keywords are mostly extracted directly from the text of
the document
These representative keywords provide a logical view of the
document
Docs Tokenizatio stop words stemming Indexing
n
Full Index terms
text
Document representation viewed as a continuum, in which
logical view of documents might shift from full text to index
terms
Logical view of documents
If full text :
Each word in the text is a keyword
Most complex form….why?
Expensive….why?
If full text is too large, the set of
representative keywords can be reduced
through transformation process called text
operation
Itreduce the complexity of the document
representation and allow moving the logical
view from that of a full text to a set of index
terms
Structure of an IR System
An Information Retrieval System serves as a bridge
between the world of authors and the world of
readers/users,
That is, writers present a set of ideas in a document using
a set of concepts. Then Users seek the IR system for
relevant documents that satisfy their information need.
User Black box Documents
The black box is the information retrieval system.
To be effective in its attempt to satisfy information need of
users, the IR system must some how ‘interpret’ the
contents of documents in a collection and rank them
according to their degree of relevance to the user
query.
Thus the notion of relevance is at the centre of IR
The primary goal of an IR system is to retrieve all the
documents which are relevant to a user query while
retrieving as few non-relevant documents as
possible
Typical IR Task
Given:
A collection of textual natural-language
documents
A user query in the form of a textual string
Process:
A ranked set of documents that are assumed
to be relevant to the user query
Measure of Effectiveness:
Number of relevant docs from the retrieved
collection
Number of relevant docs retrieved from the
whole collection
Measure of Accuracy
Typical IR System Architecture
Document
corpus
Query IR
String System
Ranked
Documents
1. Doc1
2. Doc2
3. Doc3
.
Web Search System (e.g.: Google)
Web crawler
Web Spider
Document
corpus
Query IR
String System
Ranked
Documents
1. Page1
2. Page2
3. Page3
.
What is Information Retrieval ?
Information retrieval a broad area of computer
science focusing on easy access of information, as
defined in Baeze-Yates & Riberio-Neto (2011, p1)
“Information retrieval deals with representation,
storage, organization of, and access to
information items, such as documents, Web pages,
online catalogs, structured and semi-structured
records, multimedia objects.The organization and
access of information items should provide the user
with easy access to the information in which he is
interested”
The definition incorporates all important features of
a good information retrieval system
Representation
Storage
Organization
Access
The focus is on the user information need
Overview of the Retrieval process
Overview of the Retrieval process (2)
The Retrieval Process
It is necessary to define the text collection
before any of the retrieval processes are
initiated
This is usually done by the manager of the
database and includes specifying the following
The documents to be used
The operations to be performed on the text
The text model to be used (the text structure and
what elements can be retrieved)
The text operations transform the original
documents and the information needs and
generate a logical view of each document
The Retrieval Process
Once the logical view of the documents is defined,
the database module builds an index of the text
An index is a critical data structure
It allows fast searching over large volumes of data
Reduces the vocabulary of the collection
Different index structures might be used , but the
most popular one is the inverted file (more on
this later)
Given the document database is indexed, the
retrieval process can be initiated
The Retrieval Process …
The user first specifies a user need which is then parsed and
transformed by the same text operation applied to the text
Next the query operations is applied before the actual query,
which provides a system representation for the user need, is
generated
The query is then processed (compared) to obtain the retrieved
documents
Before the retrieved documents are presented to the user, the retrieved
documents are ranked according to the likelihood of relevance
The user then examines the set of ranked documents in the search
for useful information. Two choices for the user:
(i) reformulate query, run on entire collection or (ii) reformulate query, run
on result set
At this point, he might pinpoint a subset of the documents seen as
definitely of interest and initiate a user feedback cycle
In such a cycle, the system uses the documents selected by the
user to change the query formulation.
Hopefully, this modified query is a better representation of the real
user need
Detail view of the Retrieval Process
User Text
Interface
User
need
Text
Text Operations
logical view Logical view
DB
User Query Language manager
feedback Indexing Module
& Operations
Query Inverted file
Searching Index
Retrieved docs Text
Database
Ranked docs
Ranking
Issues in IR
Text representation
what makes a “good” representation?
how is a representation generated from text?
what are retrievable objects and how are they organized?
Information needs representation
what is an appropriate query language?
how can interactive query formulation and refinement be
supported?
Comparing representations
to identify relevant documents
What weighting scheme and similarity measure to be
used?
what is a “good” model of retrieval?
Evaluating effectiveness of retrieval
what are good metrics?
what constitutes a good experimental test bed?
Focus in IR System Design
Our focus during IR system design is:
In improving performance effectiveness of the
system
Effectiveness of the system is measured in terms of:
precision,
recall, …
Stemming, stop words, weighting schemes, matching
algorithms
In improving performance efficiency
The concern here is storage space usage, access time,
searching time, data transfer time …
There is space – time tradeoffs !!
Use Compression techniques, data/file structures, etc.
Subsystems of an IR system
The two subsystems of an IR system:
Indexing: is an offline process of organizing
documents using keywords extracted from the collection
Searching: is an online process of finding relevant
documents in the index list as per users query
Indexing and searching: are unavoidably connected
you cannot search that was not first indexed in some manner
indexing of documents or objects is done in order to be
searchable
there are many ways to do indexing
to index one needs an indexing language
there are many indexing languages
even taking every word in a document is an indexing
language
Knowing searching is knowing indexing
Indexing Subsystem
documents
Documents Assign document identifier
text document
Tokenize
IDs
tokens
Stop list
non-stoplist Stemming & Normalize
tokens
stemmed Term weighting
terms
terms with
weights Index
Searching Subsystem
query parse query
query tokens
ranked
Stop list non-stoplist
document
tokens
set
ranking
Stemming & Normalize
relevant stemmed terms
document set
Similarity Query Term weighting
Measure terms
Index terms
Index
End of Chapter One