0% found this document useful (0 votes)
9 views34 pages

Unit2 Bool Vect Example ST

Chapter 2 discusses various information retrieval models, emphasizing the use of index terms for document indexing and retrieval. It outlines classic models such as Boolean and Vector models, detailing their mechanisms and advantages in ranking documents based on user queries. The chapter also highlights the challenges of translating information needs into Boolean expressions and the limitations of the Vector model regarding index term independence.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views34 pages

Unit2 Bool Vect Example ST

Chapter 2 discusses various information retrieval models, emphasizing the use of index terms for document indexing and retrieval. It outlines classic models such as Boolean and Vector models, detailing their mechanisms and advantages in ranking documents based on user queries. The chapter also highlights the challenges of translating information needs into Boolean expressions and the limitations of the Vector model regarding index term independence.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Chapter 2 Modeling

Introduction
.
 Traditional information retrieval systems
usually adopt index terms to index and
retrieve documents.
 An index term is a keyword(or group of
related
words) which has some meaning of its own
(usually a noun).
The advantage of using
index terms
Simple
The semantic of the documents and of
the user information need can be
naturally expressed through sets of
index terms.

 Ranking algorithms are at the core of information


retrieval systems(predicting which documents are
relevant and which are not).
A taxonomy of information
retrieval models
Classic Models Set Theoretic
Boolean Fuzzy
Vector Extended Boolean
U Probabilistic
S Retrieval:
E Ad hoc
R Algebraic
Filtering Structured Models Generalized Vector
T Lat. Semantic Index
A Non-overlapping lists
Neural Networks
S Browsing Proximal Nodes
K

Browsing Probabilistic
Flat Inference Network
Structured Guided Belief Network
Hypertext
Index Terms Full Text Full Text+
Structure
Retrieval Classic Classic Structured
Set Set
Theoretic Theoretic
Algebraic Algebraic
Probabilistic Probabilisti
c
Browsing Flat Flat Structure
Hypertext Guided
Hypertext

Figure 2.2 Retrieval models most frequently associated with distinct


combinations of a document logical view and a user task.
Retrieval : Ad hoc and Filtering
Ad hoc : The documents in the
collection remain relatively static while
new queries are submtted to the
system.

Filtering : The queries remain


relatively static while new documents
come into the system
Filtering
Typically, the filtering task simply
indicates to the user the
documents which might be of
interest to him.

Routing : Rank the filtering


documents and show this ranking
to the user.

Constructing user profiles in two


ways.
models
D : A set composed of logical
views(or representation) for the
documents in the collection.
Q : A set composed of logical
views(or representation) for the user
information needs(queries).
F : A framework for modeling
document representations, queries,
and their relationships.
R(qi, dj) : A ranking function which
defines an ordering among the
documents with regard to the query.
Classic information retrieval
model
Basic concepts : Each document is
described by a set of
representative keywords called
index terms.

Assign a numerical weights to


distinct relevance between index
terms.
Define
ki : A generic index term
K : The set of all index terms {k1,…,kt}
wi,j : A weight associated with index term
ki of a document dj

gi : A function returns the weight


associated
with ki in any t-dimensoinal
vector( gi(dj)=wi,j )
Boolean model
The Boolean Model is a simple retrieval
model based on set theory and Boolean
algebra.
Based on a binary decision criterion
without any notion of a grading scale.
Boolean expressions have precise
semantics. It is not simple to translate an
information need into a Boolean
expression.
Can be represented as a disjunction of
conjunction vectors(in disjunctive normal
form-DNF).
Model

Retrieval strategy is based on a


binary decision without any notion
of a grading scale .
Boolean model in reality is much
more data retrival model.
Not easy to translate an
information need into a Boolean
expression
Boolean Model
The Boolean model considers that
index terms are present or absent
in a document.
The index term weights are
present or absent(1 or 0) in a
document.
A query q is composed of index
terms linked by three connectives.
not, and ,or.
Boolean Model
A query is essentially a
conventional Boolean expression
which can be represented as
disjunction of conjunctive
vectors(DNF)
Eg. q=Ka ˄ (Kb ˅ ̚ Kc )
Boolean Model
qdnf = (1 1 1) ˅ (1 1 0) ˅ (1 0 0) where each of the

components is a binary weighted vector associated with the


Ka, Kb , Kc)
tuple (

These binary weighted vectors are called the conjunctive

components( qcc) of qdnf.


Boolean Model
Vector model

Assign non-binary weights to index


terms in queries and in documents.

Compute the similarity between


documents and query.

More precise than Boolean model.


Problem
We think of the documents as a
collection C of objects and think of the
user query as a specification of a set A
of objects. In this scenario, the IR
problem can be reduced to the
problem of determine which
documents are in the set A and which
ones are not(i.e., the IR problem can
be viewed as a clustering problem).
Intra-cluster : One needs to
determine what are the features
which better describe the objects in
the set A.

Inter-cluster : One needs to


determine what are the features
which better distinguish the objects
in the set A.
tf : intra-clustering similarity is
quantified by measuring the raw
frequency of a term ki inside a
document dj, such term frequency is
usually referred to as the tf factor and
provides one measure of how well that
term describes the document contents.
(intra-document characterization)

idf : inter-clustering dissimilarity is


quantified by measuring the inverse of
the frequency of a term ki among the
documents in the collection.This
model
Let n be the total number of
documents
Let ni be the number of documents
in which the index term Ki
appears.
Let freqi,j be the raw frequency of
term Ki in the document dj . (The
number of times the term ki is
mentioned in the text of the
document dj) .
Then the normalized frequency fi,j
Vecors
Vectors
Text Collection
Mount Everest is Earth’s highest mountain above sea
level, located in the sub-range of the Himalaya. Mount
Everest attracts many climbers, some of them highly
experienced mountaineers.

Kalsubai is a mountain in the western Ghats,located in


the Indian State. The mountain range lies within the
Kalsubai Harishchandraged Wildlife Sanctuary.

Mount Fuji is a very distinctive feature of the


geography of Japan. The mountain stands about 100
km.
Mount Everest Earth Mountai Kalsuba Fuji
n i
ni 2 1 1 3 1 1
Freq
Freq Mount Everest Eart Mountain Kalsuba Fuji Maxi
i,j h i
Freq i,1 2 2 1 1 0 0 2

Freq i,2 0 0 0 2 2 0 2

Freq i,3 1 0 0 1 0 1 1

Normalized Frequency
Freq i,j Mount Everest Earth Mountain Kalsubai Fuji
Freq i,1 1 1 0.5 0.5 0 0
Freq i,2 0 0 0 1 1 0

Freq i,3 1 0 0 1 0 1
Vector Model
Mount Everest Earth Mounta Kalsub Fuji
in ai
ni 2 1 1 3 1 1

Mount Everest Earth Mounta Kalsub Fuji


in ai
idfi 0.176 0.4 0.4 0 0.4 0.4
Normalized Frequency
Freq i,j Mount Everest Earth Mountain Kalsubai Fuji
Freq i,1 1 1 0.5 0.5 0 0
Freq i,2 0 0 0 1 1 0

Freq i,3 1 0 0 1 0 1

Mount Everest Earth Mounta Kalsub Fuji


in ai
idfi 0.176 0.4 0.4 0 0.4 0.4

w i,j Mount Everest Earth Mountain Kalsubai Fuji


Wi,1 0.176 0.4 0.2 0 0 0
W i,2 0 0 0 0 0.4 0

W i,3 0.176 0 0 0 0 0.4


Q - Mount Kalsubai
Freq Mount Everest Eart Mountain Kalsuba Fuji Maxi
i,j h i
Freq i,q 1 0 0 0 1 0 1

Norm 1 0 0 0 1 0
Freq i,q

Mount Everest Earth Mounta Kalsub Fuji


in ai
idfi 0.176 0.4 0.4 0 0.4 0.4

Mount Everest Earth Mounta Kalsub Fuji


in ai
Wi,q 0.176 0 0 0 0.4 0
Vector Model Similarity
Similarity and Ranking

Ranking of Documents will be d2,d3,d1


Model
Its term-weighting scheme
improves retrieval performance

Its partial matching strategy allows


retrieval of documents that
approximate the query conditions

Its cosine ranking formula sorts the


documents according to their
degree of similarity to the query
Disadvantage of Vector
Model
Index terms are assumed to be
mutually independent. It doesn’t
account for index term
dependencies.

You might also like