0% found this document useful (0 votes)

221 views25 pages

Overview of Information Retrieval Systems

The document provides an introduction to information retrieval systems and key concepts. It defines information retrieval as searching for relevant documents from large collections to satisfy user needs. It discusses how IR systems represent, store, organize and provide access to information through indexing and keywords. Examples of different IR systems are provided, including conventional library catalogs, text-based web search engines, and multimedia and question answering systems. The goals and challenges of IR are also summarized.

Uploaded by

Ebisa Chemeda

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

221 views25 pages

Overview of Information Retrieval Systems

Uploaded by

Ebisa Chemeda

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 25

Introduction to Information

Storage and Retrieval

Chapter One
Overview of Information Retrieval

Introduction to Information
10/16/2017 Retrieval 1
1
IR and IR Systems
 Information
retrieval (IR) is the process of searching for relevant
documents from unstructured large corpus that satisfy users
information need .
 According to Baeze-Yates & Riberio-Neto Information
retrieval deals with representation, storage, organization
of, and access to information items.

 The organization and access of information items should

provide the user with easy access to the information in
which he is interested
 The definition incorporates all important features of a
good information retrieval system
 Representation
 Storage
 Organization
 Access
10/16/2017 Introduction to Information Retrieval 2
Examples of IR systems
❖ Conventional (library catalog): Search by keyword,
title, author, etc.
❖ Text-based (Lexis-Nexis, Google, FAST): Search by
keywords. Limited search using queries in natural
language.
❖ Multimedia (QBIC, WebSeek, SaFe): Search by
visual appearance (shapes, colors,… ).
❖ Question answering systems (AskJeeves,
Answerbus): Search in (restricted) natural language
Web search systems
•Lycos, Excite, Yahoo,
Google, Live, Northern
Light, Teoma, HotBot,
Baidu, …
10/16/2017 Introduction to Information Retrieval 3
General Goal of Information Retrieval

❖ To help users find useful information based on

their information needs (with a minimum effort)
despite
 Increasing complexity of Information
 Changing needs of user

❖ Provide immediate random access to the

document collection.
➢ Retrieval systems, such as Google, Yahoo, are
developed with this aim.

10/16/2017 Introduction to Information Retrieval 4

Information Retrieval vs. Data Retrieval
 Emphasis of IR is on the retrieval of information, rather than
on the retrieval of data
Data retrieval
➢ Consists mainly of determining which documents contain a set
of keywords in the user query (which is not enough to satisfy
the user information need)
➢ Aims at retrieving all objects that satisfy well defined semantics
➢ a single erroneous object among a thousand retrieved objects
implies failure
Information retrieval
➢ Is concerned with retrieving information about a subject or topic
than retrieving data which satisfies a given query
➢ semantics is frequently loose: the retrieved objects might be
inaccurate
➢ small errors are tolerated
Example of data retrieval system is a relational database
10/16/2017 Introduction to Information Retrieval 5
Information Retrieval vs. Data Retrieval
Data Retrieval Info Retrieval
Data organization Structured Unstructured
Fields Clear Semantics No fields (other
(ID, Name, age,…) than text)
Query Language Artificial (defined, Free text (“natural
SQL) language”), Boolean
Matching Exact (results are Partial match, best match
always “correct”)
Query specification Complete Incomplete
Items wanted Matching Relevant
Accuracy 100% < 50%
Error response Sensitive Insensitive
10/16/2017 Introduction to Information Retrieval 6
Why is IR so hard?
 Traditionnel Information retrieval (IR) System attempt to
find relevant documents to respond to a user’s request.
 Information retrieval problem: locating relevant
documents based on user input, such as keywords or
example documents
➢ The real problem boils down to matching the language
of the query to the language of the document.
➢ Simply matching on words is a very brittle (no
elasticity) approach. One word can have different
semantic meanings. Consider: Take
➢ “take a place at the table”

➢ “take money to the bank”

➢ “take a picture”

10/16/2017 Introduction to Information Retrieval 7

Basic Concepts in Information Retrieval:
(i) User Task and (ii) Logical View of documents
The User Task:
Two user task – retrieval and browsing

Retrieval

DB
Browsing

USER

10/16/2017 Introduction to Information Retrieval 8

User Task: Retrieval
❖ Retrieval is the process of retrieving information
whereby the main objective is clearly defined
from the onset of searching process.
❖ The user of a retrieval system has to translate his
information need into a query in the language
provided by the system.
❖ In this context (i.e. by specifying a set of words),
the user searches for useful information executing
a retrieval task
❖ English Language Statement :
I want a book by J. K Rowling titled The Chamber
of Secrets

10/17/2017 Introduction to Information Retrieval 9

User Task: Browsing
❖ Browsing is the process of retrieving information,
whereby the main objective is not clearly defined from
the beginning and whose purpose might change during
the interaction with the system.
❖ E.g. User might search for documents about ‘car racing’
. Meanwhile he might find interesting documents about
‘car manufacturers’. While reading about car
manufacturers in Addis, he might turn his attention to a
document providing ‘direction to Addis’, and from this
to documents which cover ‘Tourism in Ethiopia’.
❖ In this context, user is said to be browsing in the
collection and not searching, since a user may has an
interest glancing around

10/16/2017 Introduction to Information Retrieval 10

Logical View of Documents
Documents in a collection are frequently represented by a set of index
terms or keywords
Such keywords are mostly extracted directly from the text of the
document
These representative keywords provide a logical view of the document

Docs Tokenization stop words stemming Indexing

Full Index terms

text

Document representation viewed as a continuum, in which logical view

of documents might shift from full text to index terms
10/16/2017 Introduction to Information Retrieval 11
Logical view of documents
 If full text :
 Each word in the text is a keyword
 Most complex form
 Expensive
 If full text is too large, the set of representative keywords
can be reduced through transformation process called
text operation
 It reduce the complexity of the document
representation and allow moving the logical view
from that of a full text to a set of index terms

10/16/2017 Introduction to Information Retrieval 12

Structure of an IR System

 An Information Retrieval System serves as a bridge between the world of

authors and the world of readers/users,
 That is, writers present a set of ideas in a document using a set of concepts.
Then Users seek the IR system for relevant documents that satisfy their
information need.
User Documents
Black box

 The black box is the information retrieval system.

 To be effective in its attempt to satisfy information need of users, the IR
system must ‘interpret’ the contents of documents in a collection and rank
them according to their degree of relevance to the user query.
 Thus the notion of relevance is at the center of IR
 The primary goal of an IR system is to retrieve all the documents which are
relevant to a user query while retrieving as few non-relevant documents as
possible
10/16/2017 Introduction to Information Retrieval 13
Typical IR Task

 Given:
 A corpus of textual natural-language documents.
 A user query in the form of a textual string.
 Find:
 A ranked set of documents that are relevant to the
query.

10/16/2017 Introduction to Information Retrieval 14

Typical IR System Architecture

Document
corpus

Quer IR
y System
Strin
1. Doc1
g 2. Doc2
Ranked 3. Doc3
.
Documents .

10/16/2017 Introduction to Information Retrieval 15

Overview of the Retrieval process

10/16/2017 Introduction to Information Retrieval 16

The Retrieval Process
 It is necessary to define the text database before any of
the retrieval processes are initiated
 This is usually done by the manager of the database and
includes specifying the following
➢ The documents to be used

➢ The operations to be performed on the text

➢ The text model to be used (the text structure and what

elements can be retrieved)
 The text operations transform the original documents and
the information needs and generate a logical view of them
 Once the logical view of the documents is defined, the
database manager(using the DB Manager Module) builds
an index of the text.
 An index is a critical data structure because it allows fast
searching over large volumes of data
.
10/17/2017 Introduction to Information Retrieval 17
Retrieval Process ….
 Different index structures might be used , but the most
popular one is the inverted file
 The resources (time and storage space) spent on defining
the text database and building the index are amortized by
querying the retrieval system many times.
 Given that the document database is indexed, the retrieval
process can be initiated.
 The user first specifies a user need which is then parsed
and transformed by the same text operations applied to
the text.
 Then, query operations might be applied before the actual
query, which provides a system representation for the user
need, is generated.

10/17/2017 Introduction to Information Retrieval 18

The Retrieval Process …
❖ The query is then processed to obtain the retrieved documents
✓ Before the retrieved documents are sent to the user, the
retrieved documents are ranked according to the likelihood
of relevance
❖ The user then examines the set of ranked documents in the search
for useful information. Two choices for the user:
✓ Reformulate query, run on entire collection or
✓ Reformulate query, run on result set
❖ At this point, the user might pinpoint a subset of the documents
seen as definitely of interest and initiate a user feedback cycle
 In such a cycle, the system uses the documents selected by the

user to change the query formulation.

 Hopefully, this modified query is a better representation of the

real user need

10/17/2017 Introduction to Information Retrieval 19
Detail view of the Retrieval Process

User Interface
Text
User
Text Operations
Need
Logical View
User Query DB Manager
Feedback Operations Indexing
Module
Inverted
file
Query Searching Index
Text
Ranked Retrieved Database
Docs Ranking Docs
Issues that arise in IR
 Text representation
 what makes a “good” representation?
 how is a representation generated from text?
 what are retrievable objects and how are they organized?
 Information needs representation
 what is an appropriate query language?
 how can interactive query formulation and refinement be
supported?
 Comparing representations (to identify relevant
documents)
 What weighting scheme and similarity measure to be used?
 what is a “good” model of retrieval?
 Evaluating effectiveness of retrieval
 what are good metrics?
 what constitutes a good experimental test bed?

10/16/2017 Introduction to Information Retrieval 21

Focus in IR System Design
Our focus during IR system design is:
 In improving performance effectiveness of the
system
 Effectiveness of the system is measured in terms of
precision, recall, …
 Stemming, stop words, weighting schemes, matching
algorithms
 In improving performance efficiency
 The concern here is storage space usage, access time,
searching time, data transfer time …
 Concern regarding space – time tradeoffs !!

 Use Compression techniques, data/file structures, etc.

10/16/2017 Introduction to Information Retrieval 22

Subsystems of an IR system
 The two subsystems of an IR system:
 Searching: is an online process of finding relevant

documents in the index list as per users query

 Indexing: is an offline process of organizing
documents using keywords extracted from the
collection
 Indexing and searching: are unavoidably connected
 you cannot search what was not first indexed in some
manner or other
 indexing of documents or objects is done in order to be
searchable
 to index one needs an indexing language

 Knowing searching is knowing indexing

10/16/2017 Introduction to Information Retrieval 23
Indexing Subsystem

documents
Documents Assign document identifier

text document
Tokenize
IDs
tokens Stop list
non-stoplist Stemming & Normalize
tokens
stemmed Term weighting
terms
terms with
weights Index

10/16/2017 Introduction to Information Retrieval 24

Searching Subsystem
query parse query
query tokens
ranked non-stoplist
document Stop list
tokens
set
ranking
Stemming & Normalize
relevant stemmed terms
document set
Similarity Query Term weighting
Measure terms

Index terms
Index

10/16/2017 Introduction to Information Retrieval 25

Introduction to Information Retrieval Course
No ratings yet
Introduction to Information Retrieval Course
39 pages
IR Chap3
No ratings yet
IR Chap3
45 pages
Query Operation 2021
No ratings yet
Query Operation 2021
35 pages
Information Retrieval: IR Evaluation
No ratings yet
Information Retrieval: IR Evaluation
36 pages
Chapter 1
No ratings yet
Chapter 1
23 pages
SAD PPt-1,2,3
No ratings yet
SAD PPt-1,2,3
39 pages
Systems Planning and Feasibility Analysis
No ratings yet
Systems Planning and Feasibility Analysis
20 pages
Overview of Programming Languages and Concepts
No ratings yet
Overview of Programming Languages and Concepts
91 pages
Evaluating Information Retrieval Systems
No ratings yet
Evaluating Information Retrieval Systems
26 pages
OOP2 Lecture Week 12 (Spring2023 24)
No ratings yet
OOP2 Lecture Week 12 (Spring2023 24)
19 pages
Chapter 4.4 Application Layers
No ratings yet
Chapter 4.4 Application Layers
22 pages
02 Text Operation
No ratings yet
02 Text Operation
52 pages
Chapter 1 SAD Introduction
No ratings yet
Chapter 1 SAD Introduction
25 pages
IR Chapter 2 Text Operations
No ratings yet
IR Chapter 2 Text Operations
25 pages
UNit 5
No ratings yet
UNit 5
50 pages
C# Chapter 5
No ratings yet
C# Chapter 5
11 pages
Chapter 2 Pointers in C++
100% (1)
Chapter 2 Pointers in C++
44 pages
Defining a Person Structure in C++
100% (1)
Defining a Person Structure in C++
48 pages
Lecture 01 Introduction
No ratings yet
Lecture 01 Introduction
84 pages
Chapter 3
No ratings yet
Chapter 3
24 pages
2 - Text Operation
No ratings yet
2 - Text Operation
45 pages
Chapter 6
No ratings yet
Chapter 6
28 pages
C# Multiform & Database Guide
No ratings yet
C# Multiform & Database Guide
25 pages
Basics
100% (1)
Basics
118 pages
Programming Paradigms-1-79
No ratings yet
Programming Paradigms-1-79
79 pages
Overview of Computer Software Types
No ratings yet
Overview of Computer Software Types
15 pages
DH-INT1472-CLC-Chapter 1 - Introduction To Information Security
No ratings yet
DH-INT1472-CLC-Chapter 1 - Introduction To Information Security
55 pages
Understanding Layered Models in Networking
No ratings yet
Understanding Layered Models in Networking
53 pages
Compiler Design Group Assignment
No ratings yet
Compiler Design Group Assignment
11 pages
Chapter-2 SN
No ratings yet
Chapter-2 SN
68 pages
Chapter Four Layered Models: Compiled By: Mr. Dawit M
No ratings yet
Chapter Four Layered Models: Compiled By: Mr. Dawit M
70 pages
Chapter 2.2
No ratings yet
Chapter 2.2
46 pages
Chapter 4 - Reference Models and Network Protocols
No ratings yet
Chapter 4 - Reference Models and Network Protocols
74 pages
CH 1 C++
No ratings yet
CH 1 C++
17 pages
2 Data Communications Concepts
No ratings yet
2 Data Communications Concepts
15 pages
Hill Cipher Lab for WILP Students
No ratings yet
Hill Cipher Lab for WILP Students
4 pages
Lec21-22 Programming in C++ Variables & Data Types-1
100% (1)
Lec21-22 Programming in C++ Variables & Data Types-1
33 pages
C# Chapter 3
No ratings yet
C# Chapter 3
24 pages
Lecture 1&2
No ratings yet
Lecture 1&2
55 pages
Chapter 4 Data Communication and Computer Networks and e Commerce
No ratings yet
Chapter 4 Data Communication and Computer Networks and e Commerce
108 pages
Lecture-02-Basic Elements of C++
100% (1)
Lecture-02-Basic Elements of C++
82 pages
Functions
No ratings yet
Functions
29 pages
Basic Programing I Chapter 1
No ratings yet
Basic Programing I Chapter 1
48 pages
FIS Unit Three
No ratings yet
FIS Unit Three
23 pages
2-Text Operations - New
No ratings yet
2-Text Operations - New
39 pages
Chapter 4.1A
No ratings yet
Chapter 4.1A
55 pages
Visual Programming with GUI Controls
No ratings yet
Visual Programming with GUI Controls
15 pages
Chapter Two IR
No ratings yet
Chapter Two IR
45 pages
ICT Lecturer 6
No ratings yet
ICT Lecturer 6
17 pages
EDP Part 1
No ratings yet
EDP Part 1
42 pages
Data Commu
No ratings yet
Data Commu
237 pages
OSI Layers
No ratings yet
OSI Layers
70 pages
Network Switching & Multiplexing Guide
No ratings yet
Network Switching & Multiplexing Guide
31 pages
Lecture 8 - Functions
No ratings yet
Lecture 8 - Functions
38 pages
Chapter Six - Pointer
No ratings yet
Chapter Six - Pointer
11 pages
C# Chapter 2
No ratings yet
C# Chapter 2
23 pages
CN Question Bank
No ratings yet
CN Question Bank
3 pages
CP Chapter 2
No ratings yet
CP Chapter 2
51 pages
04 02 AWT Controls
No ratings yet
04 02 AWT Controls
29 pages
Understanding Information Retrieval Systems
No ratings yet
Understanding Information Retrieval Systems
18 pages
Governance, Risk and Compliance
100% (1)
Governance, Risk and Compliance
474 pages
Zphisher - Automated Phishing Tool in Kali Linux
No ratings yet
Zphisher - Automated Phishing Tool in Kali Linux
6 pages
Wolkite University: College of Computing and Informatics
No ratings yet
Wolkite University: College of Computing and Informatics
52 pages
Chapter 5
No ratings yet
Chapter 5
40 pages
Research Process: Chapter Three Information Systems 3 Year
No ratings yet
Research Process: Chapter Three Information Systems 3 Year
75 pages
UT Dallas Syllabus For cs4347.501 05s Taught by Latifur Khan (Lkhan)
No ratings yet
UT Dallas Syllabus For cs4347.501 05s Taught by Latifur Khan (Lkhan)
3 pages
Hashing Presentation
No ratings yet
Hashing Presentation
12 pages
Vsam Refresh
No ratings yet
Vsam Refresh
26 pages
Oracle 11gR2 Database Upgrade Guide
No ratings yet
Oracle 11gR2 Database Upgrade Guide
5 pages
Manage Db2 User Accounts & Roles
No ratings yet
Manage Db2 User Accounts & Roles
3 pages
Hyperion Intelligence Quickstart Guide
No ratings yet
Hyperion Intelligence Quickstart Guide
26 pages
How To Win Coding Competitions: Secrets of Champions: Pavel Krotkov Saint Petersburg 2016
No ratings yet
How To Win Coding Competitions: Secrets of Champions: Pavel Krotkov Saint Petersburg 2016
32 pages
Laravel File Storage Security Guide
No ratings yet
Laravel File Storage Security Guide
2 pages
Hadoop HDFS Commands
No ratings yet
Hadoop HDFS Commands
4 pages
IT Data Management Expert Profile
No ratings yet
IT Data Management Expert Profile
6 pages
Theory Final Exam - DB - BS (SE)
No ratings yet
Theory Final Exam - DB - BS (SE)
10 pages
Introduction To The Oracle Database: Data Files
No ratings yet
Introduction To The Oracle Database: Data Files
4 pages
Changing An Idoc's Status With An Excel Upload: Main Program
0% (1)
Changing An Idoc's Status With An Excel Upload: Main Program
5 pages
Oracle DBA Syllabus
No ratings yet
Oracle DBA Syllabus
8 pages
RD Research Topic
No ratings yet
RD Research Topic
6 pages
Power BI Boot Camp CES LUMS
No ratings yet
Power BI Boot Camp CES LUMS
3 pages
How To Configure A BDA Server Disk After Disk Replacement With The Bdadiskutility Utility (Doc ID 2642582.1)
No ratings yet
How To Configure A BDA Server Disk After Disk Replacement With The Bdadiskutility Utility (Doc ID 2642582.1)
5 pages
The Database Environment and Development Process
No ratings yet
The Database Environment and Development Process
55 pages
Oracle DBA Resume: 8+ Years Experience
50% (2)
Oracle DBA Resume: 8+ Years Experience
5 pages
1-The Database Environment and Development Process
No ratings yet
1-The Database Environment and Development Process
30 pages
Content Lab 2 - Graphql
No ratings yet
Content Lab 2 - Graphql
24 pages
The Future of MySQL (The Project)
100% (10)
The Future of MySQL (The Project)
20 pages
Data-Driven BI Expert Profile
No ratings yet
Data-Driven BI Expert Profile
3 pages
Data Engineer
No ratings yet
Data Engineer
1 page
Search and Sort Algorithm
No ratings yet
Search and Sort Algorithm
37 pages
Understanding the Hadoop Ecosystem
No ratings yet
Understanding the Hadoop Ecosystem
55 pages
Normalization 2
No ratings yet
Normalization 2
12 pages
Bkash Technical
No ratings yet
Bkash Technical
8 pages
1 Maarek Test
100% (1)
1 Maarek Test
106 pages
Enterprise Data World: Converting An Into
No ratings yet
Enterprise Data World: Converting An Into
41 pages

Overview of Information Retrieval Systems

Uploaded by

Overview of Information Retrieval Systems

Uploaded by

Introduction to Information

Storage and Retrieval

 The organization and access of information items should

❖ To help users find useful information based on

❖ Provide immediate random access to the

10/16/2017 Introduction to Information Retrieval 4

➢ “take money to the bank”

10/16/2017 Introduction to Information Retrieval 7

10/16/2017 Introduction to Information Retrieval 8

10/17/2017 Introduction to Information Retrieval 9

10/16/2017 Introduction to Information Retrieval 10

Docs Tokenization stop words stemming Indexing

Full Index terms

Document representation viewed as a continuum, in which logical view

10/16/2017 Introduction to Information Retrieval 12

 An Information Retrieval System serves as a bridge between the world of

 The black box is the information retrieval system.

10/16/2017 Introduction to Information Retrieval 14

10/16/2017 Introduction to Information Retrieval 15

10/16/2017 Introduction to Information Retrieval 16

➢ The operations to be performed on the text

➢ The text model to be used (the text structure and what

10/17/2017 Introduction to Information Retrieval 18

user to change the query formulation.

real user need

10/16/2017 Introduction to Information Retrieval 21

 Use Compression techniques, data/file structures, etc.

10/16/2017 Introduction to Information Retrieval 22

documents in the index list as per users query

 Knowing searching is knowing indexing

10/16/2017 Introduction to Information Retrieval 24

10/16/2017 Introduction to Information Retrieval 25

You might also like