Information retrieval
(IR):
traditional model
1. Why? Rationale for the
module. Definition of IR
2. System & user components
3. Exact match & best match
searches
4. Strengths & weaknesses
Tefko Saracevic
1. Why? Rationale for
the module.
Definition of IR
includes problems
addressed in IR
Tefko Saracevic
Why?
Every online database, every
search engine, everything that is
searched online is based in some
way or another on principles
developed in IR
IR is at the heart of searching used in
systems such as DIALOG, LexisNexis
& others
Understanding the basics of IR is a
prerequisite for understanding how
searching of online systems works.
Tefko Sarace
You are asking:
What basic elements and
processes are involved in IR?
What are the conceptual bases
for searching?
How are these applied in
practice?
Tefko Sarace
IR:
- original definition
Information retrieval embraces the
intellectual aspects of the
description of information and its
specification for search, and also
whatever systems, techniques, or
machines are employed to carry
out the operation.
Calvin Mooers, 1951
Tefko Sarace
IR:
Objective & problems
Provide the users with effective
access to & interaction with
information resources.
Problems addressed:
1. How to organize information
intellectually?
2. How to specify search &
interaction intellectually?
3. What systems & techniques to
use for those processes?
Where do you fit?
With what problems do you deal?
Tefko Sarace
2. System & user
components
Traditional IR model
presented
Tefko Saracevic
IR models
Model depicts, represents what is
involved
a choice of features, processes, things
for consideration
Several IR models used over time
traditional: oldest, most used, shows
basic elements involved
treated in this module
interactive: more realistic, favored now,
shows also interactions involved
treated in next module (module 5)
Each has strengths, weaknesses
Tefko Sarace
Description of
traditional IR model
It has two streams of activities
one is the systems side with processes
performed by the system
other is the user side with processes
performed by users & intermediaries (you)
these two sides led to system orientation &
user orientation
in system side automatic processing is done;
in user side human processing is done
They meet at the matching process
where the query is fed into the system and
system looks for documents that match the
query
Also feedback is involved so that things
change based on results
e.g. query is modified & new matching done
Tefko Sarace
Traditional IR model
System
User
Acquisition
Problem
documents, objects
information need
Representation
Representation
indexing, ...
question
File organization
Query
search formulation
Matching
searching
feedba
ck
indexed documents
Retrieved objects
Tefko Sarace
10
Acquisition
(system)
Content: What is in files, resources
in DIALOG first part of blue sheets: File
Description, Subject Coverage
Selection of documents & other
objects from various sources
in blue sheets: Sources
Mostly text based documents
full texts, titles, abstracts ...
but also other objects:
data, statistics, images, maps, trade marks,
sounds ...
Importance:
Determines contents what
is in it
Key to file, resource
selection !!!
Tefko Sarace
11
Representation
of documents, objects
(system)
Indexing many ways :
free text terms (even in full texts)
controlled vocabulary - thesaurus
manual & automatic techniques
Abstracting; summarizing
Bibliographic description:
author, title, sources, date
metadata
Classifying, clustering
Organizing in fields & limits
in DIALOG: Basic Index, Additional Index.
Limits
Basic to what is available
for searching & displaying
Tefko Sarace
12
File organization
(system)
Sequential
record (document) by record
Inverted
term by term; list of records under
each term
Combination: indexes inverted,
documents sequential
When citation retrieved only,
need for document files
Large file approaches
for efficient retrieval by computers
Enables searching & interplay
between types of files
Tefko Sarace
13
Problem
(user)
Related to users task, situation
vary in specificity, clarity
Produces information need
ultimate criterion for effectiveness of
retrieval
how well was the need met?
Inf. need for the same problem may
change, evolve, shift during the IR
process - adjustment in searching
often more than one search for same
problem over time
you will experience this in your term project
Critical for examination
in interview
Tefko Sarace
14
Representation - question
( user & possibly system)
Non-mediated: end user alone
Mediated: intermediary + user
interviews; human-human interaction
Question analysis
selection, elaboration of terms
various tools may be used
thesaurus, classification schemes,
dictionaries, textbooks, catalogs
Focus toward
deriving search terms & logic
selection of files, resources
Subject to feedback changes
Critical roles of intermediary - you
Determines search specification
- a dynamic process
Tefko Sarace
15
Query - search statement
(user & system)
Translation into systems requirements &
limits
start of human-computer interaction
query is the thing that goes into the computer
Selection of files, resources
Search strategy - selection of:
search terms & logic
possible fields, delimiters
controlled & uncontrolled vocabulary
variations in effectiveness tactics
Reiterations from feedback
several feedback types: relevance feedback,
magnitude feedback *...
query expansion & modification
What & how of actual searching
Tefko Sarace
16
Clarifying difference
Question is what user asks and what
you may then have elaborated
Query is what is asked of computer to
match what is put in
Question is transformed into query
Question:
I am interested in major historical
developments in the area of information
retrieval?
Query
history information retrieval (in Google)
history AND information(w)retrieval (in
DIALOG) (plus you have to select which
file(s) to search)
Tefko Sarace
17
Matching - searching
(user & system)
Process of matching, comparing
search: what documents in the file
match the query as stated?
Various search algorithms:
exact match - Boolean
still available in most, if not all systems
best match - ranking by relevance
increasingly used e.g. on the web
hybrids incorporating both
e.g. Target, Rank in DIALOG
Each has strengths, weaknesses
no perfect method exists
and probably never will
Involves many types of search
interactions & formulations
Tefko Sarace
18
Retrieved documents
(from system to user)
Various order of output:
Last In First Out (LIFO); sorted
ranked by relevance
ranked by other characteristics
Various forms of output
In DIALOG: Output options
When citations only: possible
links to document delivery
Base for relevance, utility
evaluation by users
Relevance feedback
What a user (or you) sees, gets,
judges can be specified
Tefko Sarace
19
3. Exact match & best
match searches
Getting to that Boolean and
similar stuff the nitty-gritty
of matching
which actually affects how
you formulate the query
Tefko Saracevic
20
Exact match Boolean search
You retrieve exactly what you ask
for in the query:
all documents that have the term(s)
with logical connection(s), and
possible other restrictions (e.g. to be
in titles) as stated in the query
exactly: nothing less, nothing more
Based on matching following rules
of Boolean algebra, or algebra of
sets
new algebra
presented by circles in Venn
diagrams
Tefko Sarace
21
Boolean algebra
Operates on sets
e.g. set of documents
Has four operations (like in algebra):
1. A: retrieve set A
I want documents that have the term library
2. A AND B: retrieve set that has A and B
often called intersection & labeled A B
I want documents that have both terms library
and digital someplace within
3. A OR B: retrieve set that has either A or B
often called union and labeled A B
I want documents that have either term library
or term digital someplace within
4. A NOT B: retrieve set A but not B
often called negation and labeled A B
I want documents that have term library but if
they also have term digital I do not want those
Tefko Sarace
22
Potential problems
But beware:
digital AND library will retrieve documents
that have digital library (together as a
phrase) but also documents that have
digital in the first paragraph and library in
the third section, 5 pages later, and it
does not deal with digital libraries at all
thus in Google you will ask for digital
library and in DIALOG for
digital(w)library to retrieve the exact
phrase digital library
digital NOT library will retrieve documents
that have digital and suppress those that
along with digital also have library, but
sometimes those suppressed may very
well be relevant. Thus, NOT is also
known as the dangerous operator
Tefko Sarace
23
Boolean algebra depicted
in Venn diagrams
Four basic operations:
e.g. A = digital B= libraries
A
1
B
2
A
1 2
A alone. All documents that have A.
Shade 1 & 2. digital
B
3
A AND B. Shade 2
digital AND libraies
A
1 2
B
3
A OR B. Shade 1, 2, 3
digital OR libraries
A
1 2
B
3
Tefko Sarace
A NOT B. Shade 1
digital NOT libraries
24
Venn diagrams cont.
Complex statements allowed e.g
A
B
2
1
4
3
6
(A OR B) AND C
Shade 4,5,6
(digital OR libraries) AND
Rutgers
C
(A OR B) NOT C
Shade what?
(digital OR libraries) NOT
Rutgers
Tefko Sarace
25
Venn diagrams cont.
Complex statements can be
made
as in ordinary algebra e.g. (2+3)x4
As in ordinary algebra: watch for
parenthesis:
2+(3 x 4)
is not the same as
(2+3)x4
(A AND B) OR C
is not the same as
A AND (B OR C)
Tefko Sarace
26
Best match searching
Output is ranked
it is NOT presented as a Boolean set but in
some rank order
You retrieve documents ranked by how
similar (close) they are to a query (as
calculated by the system)
similarity assumed as relevance
ranked from highest to lowest relevance to the
query
mind you, as considered by the system
you change the query, system changes rank
thus, documents as answers are presented
from those that are most likely relevant
downwards to less & less likely relevant
can be cut at any desired number - e.g. first 10
Tefko Sarace
27
Best match ...
cont.
Best match process deals with
PROBABILITY:
compares the set of query terms with the
sets of terms in documents
calculates a similarity between query &
each document based on common terms &/or
other aspects
sorts the documents in order of similarity
assumes that the higher ranked documents
have a higher probability of being relevant
allows for cut-off at a chosen number
BIG issue: What representation &
similarity measures are better?
better determined by a number of criteria,
e.g. relevance, speed
Tefko Sarace
28
Best match (cont.)
Variety of algorithms (formulas) used
to determine similarity
using statistic &/or linguistic properties
e.g. if digital appears a lot in a given
document relative to its size, that document
will be ranked higher when the query is digital
many proposed & tested in IR research
many developed by commercial
organizations
Google also uses calculations as to number
of links to/from a document
many algorithms are now proprietary
system ranking and your ranking may not
necessarily be in agreement
Web outputs are mostly ranked
But DIALOG allows ranking as well,
with special commands
Tefko Sarace
29
4. Strengths &
weaknesses
Tefko Saracevic
30
Boolean vs. best
match
Boolean
allows for logic
provides all that
has been
matched
BUT
has no particular
order of output
treats all
retrievals equally
- from the most
to least relevant
ones
often requires
examination of
large outputs
Tefko Sarace
Best match
allows for free
terminology
provides for a
ranked output
provides for cut-off
- any size output
BUT
does not include
logic
ranking method
(algorithm) not
transparent
whose
relevance?
where to cut off?
31
Strengths of traditional
IR model
Lists major components in both
system & user branches
Suggests:
What to explain to users about
system, if needed
What to ask of users for more
effective searching (problem ...)
Selection of component(s) for
concentration
mostly ever better representation
Provides a framework for
evaluation of (static) aspects
Tefko Sarace
32
Weaknesses
Does not address nor account for
interaction & judgment of results
by users
identifies interaction with search only
interaction is a much richer process
Many types of & variables in
interaction not reflected
Feedback has many types &
functions - also not shown
Evaluation thus one-sided
IR is a highly interactive process
- thus additional model(s) needed
Tefko Sarace
33
Interactive models
Explored in next module
Module 5
Tefko Sarace
34