Chapter 2 Modeling
Introduction
.
Traditional information retrieval systems
usually adopt index terms to index and
retrieve documents.
An index term is a keyword(or group of
related
words) which has some meaning of its own
(usually a noun).
The advantage of using
index terms
Simple
The semantic of the documents and of
the user information need can be
naturally expressed through sets of
index terms.
Ranking algorithms are at the core of information
retrieval systems(predicting which documents are
relevant and which are not).
A taxonomy of information
retrieval models
Classic Models Set Theoretic
Boolean Fuzzy
Vector Extended Boolean
U Probabilistic
S Retrieval:
E Ad hoc
R Algebraic
Filtering Structured Models Generalized Vector
T Lat. Semantic Index
A Non-overlapping lists
Neural Networks
S Browsing Proximal Nodes
K
Browsing Probabilistic
Flat Inference Network
Structured Guided Belief Network
Hypertext
Index Terms Full Text Full Text+
Structure
Retrieval Classic Classic Structured
Set Set
Theoretic Theoretic
Algebraic Algebraic
Probabilistic Probabilisti
c
Browsing Flat Flat Structure
Hypertext Guided
Hypertext
Figure 2.2 Retrieval models most frequently associated with distinct
combinations of a document logical view and a user task.
Retrieval : Ad hoc and Filtering
Ad hoc : The documents in the
collection remain relatively static while
new queries are submtted to the
system.
Filtering : The queries remain
relatively static while new documents
come into the system
Filtering
Typically, the filtering task simply
indicates to the user the
documents which might be of
interest to him.
Routing : Rank the filtering
documents and show this ranking
to the user.
Constructing user profiles in two
ways.
models
D : A set composed of logical
views(or representation) for the
documents in the collection.
Q : A set composed of logical
views(or representation) for the user
information needs(queries).
F : A framework for modeling
document representations, queries,
and their relationships.
R(qi, dj) : A ranking function which
defines an ordering among the
documents with regard to the query.
Classic information retrieval
model
Basic concepts : Each document is
described by a set of
representative keywords called
index terms.
Assign a numerical weights to
distinct relevance between index
terms.
Define
ki : A generic index term
K : The set of all index terms {k1,…,kt}
wi,j : A weight associated with index term
ki of a document dj
gi : A function returns the weight
associated
with ki in any t-dimensoinal
vector( gi(dj)=wi,j )
Boolean model
The Boolean Model is a simple retrieval
model based on set theory and Boolean
algebra.
Based on a binary decision criterion
without any notion of a grading scale.
Boolean expressions have precise
semantics. It is not simple to translate an
information need into a Boolean
expression.
Can be represented as a disjunction of
conjunction vectors(in disjunctive normal
form-DNF).
Model
Retrieval strategy is based on a
binary decision without any notion
of a grading scale .
Boolean model in reality is much
more data retrival model.
Not easy to translate an
information need into a Boolean
expression
Boolean Model
The Boolean model considers that
index terms are present or absent
in a document.
The index term weights are
present or absent(1 or 0) in a
document.
A query q is composed of index
terms linked by three connectives.
not, and ,or.
Boolean Model
A query is essentially a
conventional Boolean expression
which can be represented as
disjunction of conjunctive
vectors(DNF)
Eg. q=Ka ˄ (Kb ˅ ̚ Kc )
Boolean Model
qdnf = (1 1 1) ˅ (1 1 0) ˅ (1 0 0) where each of the
components is a binary weighted vector associated with the
Ka, Kb , Kc)
tuple (
These binary weighted vectors are called the conjunctive
components( qcc) of qdnf.
Boolean Model
Vector model
Assign non-binary weights to index
terms in queries and in documents.
Compute the similarity between
documents and query.
More precise than Boolean model.
Problem
We think of the documents as a
collection C of objects and think of the
user query as a specification of a set A
of objects. In this scenario, the IR
problem can be reduced to the
problem of determine which
documents are in the set A and which
ones are not(i.e., the IR problem can
be viewed as a clustering problem).
Intra-cluster : One needs to
determine what are the features
which better describe the objects in
the set A.
Inter-cluster : One needs to
determine what are the features
which better distinguish the objects
in the set A.
tf : intra-clustering similarity is
quantified by measuring the raw
frequency of a term ki inside a
document dj, such term frequency is
usually referred to as the tf factor and
provides one measure of how well that
term describes the document contents.
(intra-document characterization)
idf : inter-clustering dissimilarity is
quantified by measuring the inverse of
the frequency of a term ki among the
documents in the collection.This
model
Let n be the total number of
documents
Let ni be the number of documents
in which the index term Ki
appears.
Let freqi,j be the raw frequency of
term Ki in the document dj . (The
number of times the term ki is
mentioned in the text of the
document dj) .
Then the normalized frequency fi,j
Vecors
Vectors
Text Collection
Mount Everest is Earth’s highest mountain above sea
level, located in the sub-range of the Himalaya. Mount
Everest attracts many climbers, some of them highly
experienced mountaineers.
Kalsubai is a mountain in the western Ghats,located in
the Indian State. The mountain range lies within the
Kalsubai Harishchandraged Wildlife Sanctuary.
Mount Fuji is a very distinctive feature of the
geography of Japan. The mountain stands about 100
km.
Mount Everest Earth Mountai Kalsuba Fuji
n i
ni 2 1 1 3 1 1
Freq
Freq Mount Everest Eart Mountain Kalsuba Fuji Maxi
i,j h i
Freq i,1 2 2 1 1 0 0 2
Freq i,2 0 0 0 2 2 0 2
Freq i,3 1 0 0 1 0 1 1
Normalized Frequency
Freq i,j Mount Everest Earth Mountain Kalsubai Fuji
Freq i,1 1 1 0.5 0.5 0 0
Freq i,2 0 0 0 1 1 0
Freq i,3 1 0 0 1 0 1
Vector Model
Mount Everest Earth Mounta Kalsub Fuji
in ai
ni 2 1 1 3 1 1
Mount Everest Earth Mounta Kalsub Fuji
in ai
idfi 0.176 0.4 0.4 0 0.4 0.4
Normalized Frequency
Freq i,j Mount Everest Earth Mountain Kalsubai Fuji
Freq i,1 1 1 0.5 0.5 0 0
Freq i,2 0 0 0 1 1 0
Freq i,3 1 0 0 1 0 1
Mount Everest Earth Mounta Kalsub Fuji
in ai
idfi 0.176 0.4 0.4 0 0.4 0.4
w i,j Mount Everest Earth Mountain Kalsubai Fuji
Wi,1 0.176 0.4 0.2 0 0 0
W i,2 0 0 0 0 0.4 0
W i,3 0.176 0 0 0 0 0.4
Q - Mount Kalsubai
Freq Mount Everest Eart Mountain Kalsuba Fuji Maxi
i,j h i
Freq i,q 1 0 0 0 1 0 1
Norm 1 0 0 0 1 0
Freq i,q
Mount Everest Earth Mounta Kalsub Fuji
in ai
idfi 0.176 0.4 0.4 0 0.4 0.4
Mount Everest Earth Mounta Kalsub Fuji
in ai
Wi,q 0.176 0 0 0 0.4 0
Vector Model Similarity
Similarity and Ranking
Ranking of Documents will be d2,d3,d1
Model
Its term-weighting scheme
improves retrieval performance
Its partial matching strategy allows
retrieval of documents that
approximate the query conditions
Its cosine ranking formula sorts the
documents according to their
degree of similarity to the query
Disadvantage of Vector
Model
Index terms are assumed to be
mutually independent. It doesn’t
account for index term
dependencies.