MODULE I
IR Models
CHAPTER 2
Syllabus
Retrieval Models, Retrieval: Formal
Modeling: Taxonomy of Information
Altemative
Characteristics of IR models, Classic Information Retrieval,
Structured text retrieval
Set Theoretic models. Probabilistic Models,
Models, models for Browsing:
Self-learning Topics: Terrier
2.1 INTRODUCTION
What do you mean information retrieval models?
Information retrieval's (IR) objective is to give users the documents they
need to satisfy their informational needs.
We use the term "document" in a broad sense to refer to both textual and
non-textual information, including multimedia items.
Index terms are typically used by traditional information retrieval
systems to index and retrieve documents. A keyword (or combination of
related terms) with a distinct meaning is known as an index term (usually
a noun)
The semantics of the documents and of the user information need can be
naturally expressed through sets of index terms.
This method is simple to implement but retrieved documents are often
irrelevant because a lot of semantics are lost when we replace its text
With a set word.
The main problem in information retrieval is judging relevant and non
relevant documents.
Information Retrieval System (MU-Sem.7-1T) (IRModels) Pg.no.(2-2)
Information retrieval systems use rankingalgorithms to determine which
documents are relevant and which are not.
The predictions of what is relevant and what is not are based on the
accepted IR mode
2.2 A TAXONOMY OF INFORMATION RETRIEVAL
MODELS
GQ What are the three classic models in information retrieval system?
Explain the taxonomy of information retrieval with a classification
diagram.
The three classic models in information retrieval:
(1) Boolean: Documents and queries are represented as set s of index terms
in the Boolean model. As a result, we describe the model as set theoretic
(2) Vector : Documents and queries are represented as vectors in the vector
model in a t-dimensional space. As a result, we define the model as
algebraic.
(3) Probabilistic : The framework for modeling document and query
representations in the probabilistic model is based on probability theory.
As a result, we refer to the model as probabilistic, its
as name
suggests.
For each sort of traditional model (i.e., set-theoretic, algebraic, and
probabilistic), alternative modeling paradigms have been put out over the
years.
We make a distinction between the fuzzy and extended Boolean models
when it comes to alternative set-theoretic models.
. We differentiate the generalized vector, latent semantic indexing, and
neural network models as alternative algebraic models.
. We distinguish between the inference network and belief network
models when referring to alternaüve probabilistic models. A taxonomy
of these information retrieval models is shown in Fig. 2.2.1.
We distinguish between the non-overlapping lists model and the
roximal nodes model tor structured text retrieval.
(New Syll. wefacademic year 22-23) (M7-87) Tech-Neo Publications
Infomation Retrieval System (MU-Sem.7-IT) (IR Models) Pg. no. (2-3)
Set Theoretlc
Classic Models Fuzzy
Boolean Vector Extended Boolean
Retrieva
RHO
Ad hoc
Fitering
Probabilistic
Algebralc
Structured Models Generalized Vector
Lat. Semanttc Index
Non-overlapping lists Neural Networks
Proximal Nodes
Browsing
Probablites
Browsing
Inference Network
Flat Structure Belief Network
Guided Hypertext
(181Fig. 2.2.1l:A taxonomy ofInformation Retrieval Models
As discussed in chapter 1, the logical view of the documents (whole text,
collection of index words, etc.), the IR model (Boolean, vector,
probabilistic, etc.), and the user tasks (retrieval, browsing) are orthogonal
features ofa retrieval system.
Thus, even though some models are better suited for one user task than
another, the same IR model can be utilized with various document
logical views to carry out various user tasks as shown in Fig. 2.2.2.
Logical view of documents
Index terms Full text Full Text+
s
E Structure
R Retrieval | Classical set Classical set Structured
theoretic algebraic theoretic
T probabilistic algebraic
A
probabilistic
S
K
Browsing Flat Flat hypertext Structure
guided
hypertext
Fig. 2.2.2: Retrieval models most frequently associated with distinct
combinations of a document logical view and a user task
(New Syll. w.e.f academic year 22-23) (M7-87) E Tech-Neo Publications
Information Retrieval System (MU-Sem.7-1T) (IR Models) Pg. no. (2-4)
W 2.3 RETRIEVAL:AD HOC AND FILTERING
GQ Define Ad hoc retrieval and Filtering.
Ad hoc retrieval : When new queries are entered into a traditional
information retrieval system, the collection of documents remains
largely
static.
Filtering: Queries are relatively static as new documents are added to
the system (and leave). In filtering user profile is created according to the
user's preferences.
The incoming documents are then compared to this profile in an effort to
identify any that might be of interest to this specific user.
This method can be used, for instance, to choose a news article from
among the many that are broadcast each day.
Ranking of the filtered documents is not provided.
A set of keywords are used to create user profile.
2.4 A FORMAL
CHARACTERIZATION OF IR MODELS
GQ. llustrate formal characterization of IR Model.
- - ---
The formal characterization of IR Model is as follows:
Definition: An information retrieval model is a quaduple [D, Q. F, R|
19i. d)]where
D is a set composed of
logical views (or
documents in the collection. representations) for the
Ois a
composed of logical views (or
set
representations)
information needs. Such representations are called for the user
queries,
F is a framework tor modeling document
their relationships. representations, queries, and
R 4 d) is a ranking function which
assOciates a real number with a
query q e Q and a document
representation
ordering among tne documents with d,
e D.
defines an Such ranking
regard to the query Q
(New Syll. wef academic year 22-23) (M7-87)
Tech-Neo Publications
Retrieval System (MU-Sem.7-1T) (IR Models) Pg. no.(2-5)
Infomation
2.5 CLASSIC INFORMATION RETRIEVAL
Retrieval.
GQ. Explain Classic Information
-
briefly present the three classic models in information
In this section, we
the vector, and the probabilistic models.
retrieval namely, the Boolean,
Basic Concepts
.Each document is described by a group of representative keywords
called index terms.
document whose semantics
An index term is only a word from the
makes it easier to recall its core ideas.
used to index and summarise the contents of the
Index terms are
document. Nouns are preferred as index terms.
When used to describe the contents of a document, various index terms
have differing degrees of importance.
numerical weight in order to
Each index term in a document is given a
represent this effect.
Let kibe an index term, d; be a document, and Wij> 0 be a weight
associated with the pair (k; di). This weight quantifies the importance of
the index term for describing the document semantic contents.
Definition: Let t be the number of index terms in the system and kË be a
generic index term. K ={k], .. . k} is the set of all index terms. A weight
document dj For an
Wi,j> 0 is associated with each index term ki of a
.
index term which does not appear in the document text, Wi,j= 0. With the|
document d, is associated an index term vector dj represented by|
dj (W1j, W2j... . Wj) Further, let gi be a function thatreturns
=
the weight
associated with the index term ki in any t-dimensional vector
i.e,gd)=Wi,).
2.5.1 Boolean Model
-
GQ What is the basis for the Boolean model?
What the advantages and disadvantages of the Boolean model?
GQ
GQ. are
- -
The Boolean model is a simple retrieval model based on set theory and
(New Syll. w.e.f academic year 22-23) (M7-87) ATech-Neo Publications
O
o o o o o o
(IR Models) Pg. no (2-7)
Infomation Retrieval System (MU-Sem.7-1T)
D5 = [K4. K5. K6, K7, K8|
D6 (K1. K2. K3, K4)
KI and (K2 or
Query: K1 A (K2v K 3 ) e.g documents containing
(not K3)
Answer:
n ({DI, D2, D3, D6) U {D3, D5)) =
{DI, D2, D6)
(DI, D2, D4, D6)
Definition: For the Boolean model,
the index term weight variables are all
Boolean expression.
binary i.e., Wi,j ¬ {0, 1) A query q is a conventional
form for the query q. Further, let q be
9 dnf be the disjunctive normal
cc
Let
The similarity of a document d
any of the conjunctive components of q dnf
to the query q is defined as
sim(d, 9) = 3 (3.e a) (va (7,))
0 otherwise
If simdj. q) = I then the Boolean model predicts that the document dj is
relevant to the query q (it might not be). Otherwise, the prediction is that the
document is not relevant.
Advantages of the Boolean Mode
(1) The simplest model is based on sets.
(2) Easy to understand and implement.
(3) It only retrieves exact matches
(4) It gives the user, a sense of control over the system.
(5) Boolean retrieval was adopted by many commercial bibliographic
systems.
(6) Boolean queries are akin to database queries.
Disadvantages of the Boolean Model
(1) The model's similarity function is Boolean. Hence, there would be no
partial matches. This can be annoying for the users.
(2) Information need has to be translated into Boolean expressions which
most users find awkward.
(3) In this model, the Boolean operator usage has much more influence than
a critical word.
(New Syll.w.e.f academic year 22-23) (M7-87) Tech-Neo Publications
Infomation Retrieval System (MU-Sem.7-1T) (1R Models) Pg. no. (2-8)
(4) The Boolean queries formulated by the users are most often too
simplistic.
(5) As a result, the Boolean model frequently returns either too few or too
many documents in response to the user query.
(6) The query language is expressive, but it is complicated too.
(7) No ranking for retrieved documents (absence of grading scale).
(8) t is not possible to assign a degree ofrelevance
2.5.2 Vector Model
GQ. Define the Vector Model with relevant mathematical equations.
GQ What are the assumptions of vector space model?
GQ. What are the Parameters in calculating a weight for a document
term or query term?
GQ. How can you calculate tf and idf in the vector model?
.The vector model suggests a framework that allows for
partial matching
because acknowledges that using binary
it
weights is too restrictive.
It assigns non-binary weights to index terms in queries and documents.
The degree of similarity between each document stored
in the system
and the user query is calculated using these term
weights.
The vector model considers documents that match
the query terms only
partially by ordering the retrieved documents in decreasing order of this
degree of similarity.
In comparison to the Boolean model, the ranked document answer set is
significantly more precise (in the sense that it
better satisfies the users
information need).
Definition: For the vector model, the
weight Wi, i associated
with a pair
k, d) is positive and
non-binary. Further, the index terms in
the query ar
also weighted. Let be the weight
associated with the pair
where wi q0. Then, the query vector q is [ki. q
defined as
q (W1,q, W2,q Wi,9 wnere t is the total number
the svstem. As before, the vectOr ror a
of index terms n
document di is
represented by
d=(Wi.j, W2,J,. ,j).
(New Syll wefacademic year 22-23) (M7-87)
LA Tech-Neo Publications
Infomation Retrieval System (MU.Sem.7-1T) (IR Models) Pg. no. (2-9)
GQ What is cosine similarity?
GQ Define term frequency
GO. Define inverse tem frequency.
--
T h e vector model proposes to evaluate the degree of similarity of the
document d, with regard to the query q as the correlation between the
vectors d and q.
For instance, this correlation can be quantified by the cosine of the angle
between these two vectors as shown in Fig. 2.5.1. That is,
2 Wi,j X
Wi,q
i-1
L.
sim(di. q) =
1xlg V
2
*V W.
1,
2
where ldl and lql are the norms of the document and query vectors. The
factor lql does not affect the ranking (i.e., the ordering of the
documents) because it is the same for all documents. The factor ldl
provides a normalization in the space of the documents.
(182Fig. 2.5.1 The cosine of Q is adopted as sim (d, q).
By calculating the raw frequency of a phrase (ki) within a document (d).
the vector model measures the intra-clustering similarity.
Such term frequency is usually referred to as the tf factor and provides
one measure of how well that term describes the document contents (i.e.,
intra-document characterization).
The inverse of the frequency of a phrase ki among the documents in the
collection is used to calculate the inter-cluster dissimilarity. This factor is
known as the inverse document frequency or the idf factor.
(New Syll. w.e.f academic year 22-23) (M7-87) JTech-Neo Publications
Information Retrieval System (MU-Sem.7-1T) (IR Models) Pg. no. (2-10)
|Definition : Let N be the total number of documents in the system and ni be
the number of documents in which the index term kË appears. Let freqi, j be|
the raw frequency of term k; in the document di (i.e., the number of times|
the term ki is mentioned in the text of the document di). Then, the|
normalized frequency fij of term kj in document dj is given by|
fregii where the maximum is computed over all terms which|
max, freqij
are mentioned in the text of the document
If the term ki does not appear
di.
in the document dj then fi, j = 0. Further, let idf inverse document|
frequency for k, be given by idfi log* N =
N
Weights are given by wij fi.j x log
=
Such term weighting schemes are called tf-idf schemes.
The Vector Model Example
Let's consider that
collection includes 10,000 documents
a
T h e term A appears 20 tümes in a particular document
The maximum appearance of any term in this
document is 50
The term A appears in 2,000 of the collection
documents.
fij) freqi.j)/ max(freqij) = 20/5 = 0.4
idfi)= log(N/n;) log
(10,000/2,000) log(5)
=
=
= 2.32
W i j t a J )* log(N/n;) = 0.4 *2.32 = 0.93
-
GQ. What are the advantages and disadvantages of the Vector Model?
-.
rAdvantages of Vector Space Model
1) Its term-weighting scheme improves the
quality of answer set and
retrieval performance.
(2) Its partial matching strategy allows retrieval of
approximate the query conditions.
documents that
(3 Its cosine ranking formula sorts the documents according to their degree
of similarity to the query.
(New Syll we.f academic year 22-23) (M7-87)
Tech-Neo Publications
Infomation Retrieval System (MU-Sem.7-1T) (IR Models) Pg. no. (2-11)
Disadvantage of Vector Space Model
between index terms
(1) The assumption of mutual independence
a2.5.3 Probabilistic Model
------. - - -
GQ. What are the Fundamental assumptions for probabilistic principle?
GQ Write the advantages and disadvantages of probabilistic model.
The probabilistic model is an effort to frame the information retrieval
problem within a probabilistic framework.
The probabilistic model tries to estimate the probability that the user will
find the document d; relevant with ratio
P (dj relevant to q) /P (d; non relevant to q)
It is useful to derive ranking functions used by search engines and web
search engines in order to rank matching documents according to their
relevance to a given search query
This model is used to calculate the probability that a document, dj, will
be relevant to a given query, q
The model makes the assumption that the query and document
representations influence this probability of relevance.
Given a query q, there exists a subset of the documents R which are
relevant to q But membership of R is uncertain
Users give with information needs, which they translate into query
representations. Similarly, there are documents, which are converted into
document representations. Given only a query, an IR system has an
uncertain understanding of the information needed.
So IR is an uncertain process, because,
o Information need to query
Documents to index terms
Query terms and index terms mismatch
Probability theory provides a principled foundation for such reasoning
under uncertainty. This model provides how likely a document is
relevant to an information need.
Documents can be relevant and non-relevant, we can estimate the
probability of a term t appearing in a relevant document P(t 1 R=1).
(New Syll. w.e.f academic year 22-23) (M7-87) Tech-Neo Publications
Information Retrieval System (MU-Sem.7-1T) (IR Models) Pg. no. (2-12)
Probabilistic methods are one of the oldest but also one of the currently
hottest topics in Information Retrieval.
For Probabilistic model
- :
GQ How can you find the similarity between doc and query in
probabilistic principle Using Bayes' rule?
All index term weights are all binary i.e., Wij ¬ {0,1}, wi, q e {0,1}
Let R be the set of documents known to be relevant to query q
Let R' be the complement of R.
Let (Rld) be the probability that the document dj is relevant to the
query
Let P(R ldj) be the probability that the document di is non-relevant to the
query
The similarity sim(d,q) of the document d; to the query q is defined as
the ratio
P (RId)
sim(d.q)
P(RI d)
using Bayes' rule,
simd, 9) "
P(R)xP(R)
P3R)x PR)
P(d R) stands for the probability of randomly selecting the document dJ
from the set R of relevant documents.
PR) stands for the probability that a document randomly selected from
the entire collection is relevant
Advantage of Probabilistic Model
(1 Documents are ranked in
decreasing order of probability of relevance.
Disadvantages of Probabilistic Model
(1) Need to guess initial estimates for P( K;IR)
(New Syll. w.e.f academic year 22-23) (M7-87)
Tech-Neo Publications
Retrieval System (MU-Sem.7-1T) (IR Models) Pg. no. (2-13)
Infomation
2.6 ALTERNATIVE SET THEORETIC MODELS
- ----
GQ Discuss alternative set theoretic models.
--
In this section, we discuss two alternative set theoretic models, namely
the fuzzy set model and the extended Boolean model.
2.6.1 Fuzzy Set Model
- -
GQ. Explain fuzzy set model
GQ Write basics of fuzzy set theory.
When documents and queries are represented by sets of keywords,
descriptions that are only loosely related to the actual semantic contents
of the corresponding documents and queries are produced.
As a result, there is only a rough match between a document and the
search terms (or vague).
T h i s can be represented mathematically by assuming that each query
degree of
phrase defines a fuzzy set and that each page has a
membership (often smaller than 1) in this set.
This interpretation provides the foundation for many models of IR based
on fuzzy theory.
Basics of Fuzzy Set Theory
Fuzzy sets theory is an extension of classical set theory.
Elements have a varying degree of membership. A logic based on two
truth values,
True and False are sometimes insufficient when describing human
reasoning.
Fuzzy Logic uses the whole interval between 0 (false) and 1 (true) to
describe human reasoning.
A Fuzzy Set is any set that allows its members to have different degree
of membership, called membership function, having interval [0, 1].
Fuzzy Logic is derived from fuzzy set theory
allowed.
Many degrees of membership (between 0 to 1) are
Thus a membership function uA (x) is associated with a fuzzy sets A
Nay Syll we.f academic year 22-23) (M7-87) LTech-Neo Publications
Information Retrieval System (MU-Sem.7-1T) (IR Models) Pg. no. (2-14)
such that the function maps every element of the universe of discourse X
to the
interval [0, 1].
The mapping is written as: u (x): X -> [0, 1].
Fuzzy Logic is capable of handling inherently imprecise (vague or
inexact or rough or inaccurate) concepts
A fuzzy set is defined as follows: If X is a universe of discourse and x is
a particular element of X, then a fuzzy set A defined on X and can be
written as a collection of ordered pairs A = { (%, u (K), x ¬X}
GQ. Define membership function.
GQ.Explain fuzzyinformation retrieval. - -
Example
Let X = {g1, g2, g3, g4, g5} be the reference set of students.
Let A be the fuzzy set of "smart" students, where "smart" is a fuzzy
term.
A= (g1 ,0.4) (g2 ,0.5) (g3,1) (g4 .0.9) (g5 ,0.8)
Here A indicates that the smartness of gl is 0.4 and so on
Membership Function: The membership function fully defines the
fuzzy set. A membership function provides a measure of the degree of
similarity of an element to a fuzzy set
Fuzzy Information Retrieval
The main idea is to supplement the query's index terms with related
terms (obtained from a thesaurus) so that the user query can acquire
more relevant pages
By creating a term-term correlation matrix (referred to as a keyword
connection matrix in whose rows and columns are connected to the index
terms in the document collection, a thesaurus can be created. In thiS
matrix C, a normalized correlation factor Ci between two terms k; and Ki
can be defined by
Ci,
n; +n-ni,!
Where nj is the number or documents which contain the term ki, n 1S
number of documents wnich contain the term ki, and ni1 is the
(New Syll. w.e.f
academic year 22-23) (M7-87) Tech-Neo Publications
Infomation Retrieval System (MU-Sem.7-1T) (IAModels) Pg. no. (2-15)
number of documents which contain both terms.
In this fuzzy set. a document d; has a degree of membership ui
computed as
ij k Ed,(-Ci)
which computes algebraic sum over all terms in document dj
a 2.6.2 Extended Boolean Model
GQ. Discuss extended Boolean model.
In the Boolean model, no provision for term weighting and no ranking of
the answer set is generated.
As a result, the size of the output might be too large or too small
However, an alternative strategy is to add the capabilities of term
weighting and partial matching to the Boolean model. With this method,
it's possible to integrate vector model properties with Boolean query
constructions.
The extended Boolean model, was introduced in 1983 by Salton, Fox,
and Wu.
H2.7 STRUCTURED TEXT RETRIEVAL MODELIS
GQ Explain Structured text retrieval models
Think about a user who has a strong visual memory. A user of this type
would then remember that the particular document in which he is
interested has a page where the phrase "Nuclear Blast" occurs in italics
in the text around a Figure whose label contains the word "earth
This query may be phrased as [Nuclear Blast' and 'earth'] in a traditional
information retrieval approach, which would return all pages containing
both strings. But it's clear that this customer didn't want as many
documents as this answer provides.
In this scenario, the user wants to make his inquiry clearer by using a
richer expression, like
same-page (near Nuclear Blast, Figure (label ('earth')))
(New Syll.w.e.facademic year 22-23) (M7-87) Tech-Neo Publications
Infomation Retrieval System (MU-Sem.7-1T) (IR Models) Pg. no.(2-16)
which conveys the details in his visual recollection
Structured text retrieval models are types of retrieval models that
incorporate information on both the text content and the document
structure.
Structured text retrieval models consider both the text's content and
document structure.
A structured text retrieval system looks for all the documents that match
the search criteria, that's why the retrieval job is not associated with any
idea of relevance.
The current models for structured text retrieval are data retrieval models
rather than information retrieval models.
The retrieval system could search for documents that match the query
conditions only partially
The position in the test of a string of words that matches the user query is
refered to as the "term match point."
e.g user query: [ 'information retrieval system']
ifthis appears at 3 positions in document dj, then match points are 3.
2.7.1 Model based on Non-overlapping lists
G EXplain non overlapping lists with the help of an example.
---.
.Each document's whole text is divided into list of
a
non-overlapping text
sections.
Multiple lists are generated as there are various ways to break a text into
non-overlapping sections. For example
(1) A List for chapters
(2) A List for sections
(3) A List for subsections
These lists are kept as separate and distinct data structures.
A single inverted file is built with each structural element to allow
earching for both index terms and text areas. Fig. 2.7.1 shows an
example of different lists.
(New Syl. we.facademic year 22-23) (M7-87) Tech-Neo Publications
Infomation Retrieval System
(MU-Sem.7-1T) (1R Models) Pg. no.(2-17)
L1
chapter
section
L2
subsections
L3
subsubsections
L4
list
(183)Fig. 2.7.1 Structure of text documents through different indexing
Implementation
which each structural component stands
A single inverted file is built, in
as an entry in the index
list of occurrences
Each entry has a list of text regions as a
Such list could be
a merged with the traditional inverted file
easily
Example types of queries
Select a region that contains a given word
B
Select a region A which does not contain any other region
Select a region not contained within any other region
2 . 7 . 2 Model Based on Proximal Nodes
GQ. Discuss model based on proximal nodes. -
- -
Baeza-Yates
T h i s model was proposed by Navarroand
the text. This
Basic idea is to define a strict hierarchical index over
enriches the previous model that uses flat.
It allows the definition of independent hierarchical (non-flat) indexing
structures over the same text of the document.
Every indexing system is made up of nodes, which are chapters,
sections, paragraphs, pages, and lines.
Each node is associated with a text region.
answer is formed
If query refers to different hierarchies, compiles
user
by nodes which all come from only one of them.
T h i s type of models allow us to formulate more complex queries than the
model based on non-overlapping lists.
Only nearby (proximal) nodes are looked for faster query processing.
Fig.2.7.2 shows the hierarchical indexing suructure of four levels and an
inverted list for the word 'Everest
(New Syll wef academic year 22-23) (M7-87) Teth
E Tech-Neo Publications
Information Retrieval System (MU-Sem.7-1T) (IR Models) Pg.no.(2-18)
L1 chapter
L2 section
L3 subsections
L4 . subsubsections
erverest 10 256 48.304
(184)Fig. 2.7.2: Hierarchical indexingstructure
Features
One node might be contained within another node.
B u t two nodes of the same hierarchy cannot overlap.
The inverted list for words complements the hierarchicalindex.
Query language in regular expression
(1) Searches for string
(2) Reference to structural components by name
(3) Combination of these
An example query [(*section) with ("Everest"]
Searches for the sections, the subsections, and the sub-subsections that
contain the word "Everest"
Model is a compromise between expressiveness and efficiency
2.8 MODELS FOR BROwSING
Sometimes the user is interested to spend some time in exploring the
references instead of searching for a
document, looking for interesting
specific query.
Users have goals to pursue in both cases
is
But the searching task's goal more clear than a browsing task's goal in
the user's mind.
(New Syll. w.e.f academic year 22-23)
(M7-87) Tech-Neo Publications
Information Retrieval System (MU-Sem.7-IT) (IR Models) Pg. no. (2-19)
Types of Browsing
What are different types of browsing.
GQ
(1) Flat Browsing9
Documents are represented as dots in a (two- dimensional) plan or as
elements in a (single dimension) list.
The user then glances here and there looking for information within the
documents visited
The user looks for correlations among neighbor documents or for
keywords
These keywords could be added to the original query for query
expansion and this process is called relevance feedback. this helps in the
retrieval of more relevant documents.
Users can also explore a single document in a flat manner (like a web
page)
Drawback
On a given page user may not have an indication about the context
where the user is. For example, if a user opens a book on a random page,
he might not know in which chapter that page is.
(2) Structure Guided Browsing
Documents are organized in a structure as a directory to help users in
browsing.
Directories are hierarchies of classes that group documents covering
related topics
These hierarchies of classes have been used to classify document
collections. E.g: "Yahoo!" provides a hierarchical directory
The user performs a structured guided type of browsing.
The same idea applied to a single document
Chapter level, section level, etc.
O The last level is the text itself (flat!)
o A good UI is needed for keeping track ofthe context in a focused
manner.
(New Syll. w.ef academic year 22-23) (M7-87) LATech-Neo Publications
Infomation Retrieval System (MU-Sem.7-1T) (IR Models) Pg. no.(2-20)
e.g. the "adobe acrobat pdf" files
Additional facilities are provided when searching such as
visited
O A history map to identify classes recently
the structures in
Display occurrences (of terms) by showing
a
global context, in addition to the text positions
(3) The Hypertext Model
writing is the notion of
The fundamental concept related to the task of
sequencing
structure lies underneath the most written
A sequenced organizational
text
The reader should not expect to fully understand the message conveyed
there
by the writer by randomly reading pieces of text here and
Sometimes, we even can't capture the information through sequential
reading of the whole text
For example, a book about "the history of the wars" is organized
chronologically, but the user might in interested in wars fought by
any
case user will have a tough time
particular army or country, in such
interested in.
finding the information he is
Because contents are organized sequentially
solutions is to rewrite the book but
in these situations, one of the possible
book
there is no point in rewriting the
1S to define a new structure to organize the contents
Another solution
which can be achieved through the design of hypertext.
Hypertext
interactive navigational structure allows users to browse
A high-level
text non-sequentially
Consist of nodes (text regions)
correlated by directed links in a grapn
structure
article, or a
A node could be
a chapter in a book, a section in an
web page
Links are attached to specific strings inside the nodes
O
(New Syll. w.e.f academic year
22-23) (M7-87) Tech-Neo Publication
Infomation Retrieval System (MU-Sem.7-1T) (IR Models) Pg. no. (2-21)
the hypertext can be understood as a traversal
The process of navigating
of a directed graph.
Hypertexts provide the basis for HTML(Hyper Text Markup Language)
and HTTP( Hypertext Transfer Protocol)
0Drawbacks of Hypertext
(1) Loose in hyperspace the user will lose track of the organizational
structure of the hypertext when it is large
shows where the user is at all times (graphical user
A hypertext map
interface design)
of information previously
(2) But, the user is restricted to the intended flow
convinced by the hypertext designer
Should take into account the needs of potential users
Analyzing the requirements before starting implementation of hypertext
is required
orient
(3) During the hypertext navigation, the user might find it difficult
to
himself Guiding tools can help in navigation (hypertext map)
Short Questions and Answers
Q.1 What do you mean information retrieval models?
Ans.
A retrieval model can be a description of either the computational
process or the human process of retrieval: The process of choosing
documents for retrieval; the process by which information needs are first
articulated and then refined.
Q. 2 What is cosine similarity?
Ans.
This metric is frequently used when trying to determine similarity
between two documents. Since there are more words that are in common
between two documents, it is useless to use the other methods of calculating
similarities
(New Syll. w.e.f academic year 22-23) (M7-87) Tech-Neo Publications
Infomation Retrieval System (MU-Sem.7-1T) (IR Models) Pg. no.(2-22)
feedback?
Q.3 What are the characteristics of relevance
Ans.
(1) It shields the user from the details of the query reformulation process.
(2) It breaks down the whole searching task into a sequence of small steps
which are easier to grasp.
controlled process designed to emphasize some terms and de-
(3) Provide a
emphasize others.
Q. 4 What are the assumptions of vector space model?
Ans.
(1) Assumption of vector space model:
(2) The degree of matching can be used to rank-order documents
(3) This rank-ordering corresponds to how well a document satisfying a
user's information needs
Q.5 What are the disadvantages of Boolean model?
Ans.
(1) It is not simple to translate an information need into a Boolean
expression.
(2) Exact matching may lead to retrieval of too many documents.
(3) The retrieved documents are not ranked
(4) The model does not use term weights
Q.6 Define term frequency.
Ans.
Term frequency : Frequency of occurrence of
query keyword
document
Q.7 What are the three classic models in information retrieval system?
Ans.
(1) Boolean model
(2) Vector Space model
(3) Probabilistic model
(New Syll. w.e.f academic year 22-23) (M7-87)
edh Tech-Neo Publications
(IRModels) Pg. no.(2-23)
Infomation Retrieval System (MU-Sem.7-IT)
What is the basis for Boolean
mod
Q. 8
Ans.
and Boolean algebra
Simple model based on set theory
(1) Documents are sets of terms
expressions on terms.
(2) Queries are specified as Boolean
Boolean model?
Q.9 What are the disadvantages of
Ans.
may retrieve too
few or too many documents
Exact matching
some documents are more important than
(1) Difficult to rank output,
others.
(2) Hard to translate a query into
a Boolean expression
(3) All terms are equally weighted
retrieval
(4) More like data retrieval than information
(5) No notion for partial matching
Q. 10 What are the Fundamental assumptions for probabilistic principle?
Ans.
9-user query,dj
-
doc in the collections
Model assumes, relevance depends on the query and the doc
representation only
R- ideal answer set, relevant to the query
R-ideal answerset,non-relevant to the query
Similarity to the query ratio is, i.e. probabilistic ranking computed as
Ratio = P(dj relevant-to q)/P(dj non-relevant-to q)
The rank minimizes the probability of the erroneous judgment
Q.11 Write the advantages and disadvantages of probabilistic model:
Ans.
Advantages
(1) Doc's are ranked in decreasing order of their probability of relevant
(New Syll.w.e.facademic year 22-23) (M7-87) Tech-Neo Publications
Infomation Retrieval System (MU-Sem.7-1T) (IRModels) Pg. no. (2-24)
Disadvantages
(1) Need to guess the initial separation of doc's into relevant and non-
relevant sets.
(2) All weights are binary
(3) The adoption of the independence assumption for index terms
(4) Need to guess initial estimates for P(ki l R)
(5) Method does not take into account tf and idf factors
Q.12 Why Classic IR might lead to poor retrieval ?
Ans.
(1) The user information need is more related to concepts and ideas than to
index terms but in classic IR.
(2) Unrelated documents might be included in the answer set.
(3) Relevant documents that do not contain at least one index term are not
retrieved.
(4) Reasoning: retrieval based on index terms is vague and noisy.
Chapter Ends...
O00