0% found this document useful (0 votes)
39 views44 pages

Chapter 4 - Part II

Chapter Four of the document discusses the Vector Space Model (VSM) in Information Retrieval (IR) systems, highlighting its development, key concepts, and issues related to graphical representation and similarity measures. The VSM allows for partial matching by using non-binary weights for terms, facilitating better ranking of documents based on their similarity to queries. The chapter also covers various similarity measures, including inner product and cosine similarity, and provides examples of how to compute these measures for document retrieval.

Uploaded by

bellhermon
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views44 pages

Chapter 4 - Part II

Chapter Four of the document discusses the Vector Space Model (VSM) in Information Retrieval (IR) systems, highlighting its development, key concepts, and issues related to graphical representation and similarity measures. The VSM allows for partial matching by using non-binary weights for terms, facilitating better ranking of documents based on their similarity to queries. The chapter also covers various similarity measures, including inner product and cosine similarity, and provides examples of how to compute these measures for document retrieval.

Uploaded by

bellhermon
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Models of Modern IR

Systems
Chapter Four- Part-II

1
Topics

 Overview
 Boolean/Logical Model
 Vector Space Model (VSM)

2
Vector Space Model (VSM)

Most widely used

3
Topics
 Overview and Issues of vector space model
 Graphical representation of documents and queries in
vector space model
 Similarity in vector space model
 Similarity measures : Inner product, Cosine, Dice,
Jaccard, overlap Measuring similarity between
documents and query
 Ranking in vector space model
 Advantages and disadvantages of vector space model
 Exercise/Assignments

4
Vector Space Model
 Developed within the context of the SMART project
(experimental IR system by G. Salton and his students, since
1961) by recognizing that the use of binary weights is too
limiting.
 It proposes a framework in which partial matching is possible.
 This is accomplished by assigning non-binary weights to index
terms in queries and documents, which provide consideration
for partial matches.

 These term weights are used to compute a degree of


similarity between a query and each document stored in the
system and ranked set of documents provides for better
matching.
5
Cont…
 Key or essential ideas:
 Indexing terms are regarded as the coordinates
of a multidimensional information space
 Everything (documents, queries) is a vector in a
t-dimensional space

6
Document Collection Representation

An entry in the matrix corresponds to the “weight” of a term


in the document; zero means the term has no significance in
the document or it simply doesn’t exist in the document.

 wi,jindicates the T1 T2 …. Tt
importance of term i (ki) in D1 w11 w21 … wt1
document j (dj) D2 w12 w22 … wt2
: : : :
: : : :
Dn w1n w2n … wtn
7
Issues in vector space model
 In Graphic Representation
 How to convert objects (i.e., documents and queries) into vectors
 Terms (axis)
 Easy to determine (lexical analysis, etc)
 One dimension per term
 Documents (Point) – how to represent documents?
 As the sum of its term vectors
E.g., D1 = 2t1 + 3t2 + 4t3 = <2, 3, 4>
 Queries (point) – how to represent queries ?
 Same way that documents are represented
 We regard queries as short documents
 How to select magnitude along a dimension (term weighting)
 How to compare objects in vector space (cosine similarity)

8
Graphic Representation
A document is a point in this space
More term means more dimension
Example:
 D1 = 2T1 + 3T2 + 5T3 T3
Multidimensional
 D2 = 3T1 + 7T2 + T3 space where the size of
5
 Q = 0T1 + 0T2 + 2T3 the space depends on the
number of terms
D1 = 2T1+ 3T2 + 5T3
Q = 0T1 + 0T2 + 2T3
2 3
T1
D2 = 3T1 + 7T2 + T3
• Is D1 or D2 more similar to Q?
7 • How to measure the degree of
T2 similarity?

9
Example system using VSM
 SMART, developed by Salton and students
at Cornell starting in the 60’s.
 Most Web search engines

10
Cont…
 The vector matching operation is based on the
cosine formula, the measure of the cosine of the
angle between vectors, is used to compute similarity.
 More term means more dimensions.
 Multidimensional space, where the size of the space
depends on the number of terms.
 We are representing a document in our space library,
where a point is a location in the space library.

11
Cont…
 We can use vector space model
 For term weighting
 Recall here TDV
 Ranking

 Matching function

 Thus
 What do you think are the basic steps in implementing
retrieval through vector space model?

12
Steps
 Identify unique terms in a collection
 To convert objects into vectors use unique index
terms,
 Representing doc in vector forms
 Can use binary or weighted real value
 To select magnitude along a dimension use term
weighting
 Representing queries in vector form
 The coefficients (vector elements, term weights)
represent term presence, importance, or “aboutness”.
 Find the similarity b/n two
 To compare objects (Queries to documents) in vector
space use similarity measures
 Rank based on the result
13
Cont…
 Some common choices in doing this are
 Binary:1 = term is present, 0 = term not
present in document
 tf: The frequency of the term in the document
 tf • idf
 Signal- Noise

14
Cont…
 The index terms in the query are also weighted
 The query vector vec(q) is defined as
vec(q) = (w1q, w2q, ..., wtq)
 The vector of a document is represented as
vec(dj) = (w1j, w2j, ..., wtj)

 Thus in the space, queries and documents are


represented as weighted vectors.
 A document dj and a query q are represented as t-
dimensional vectors.

15
Cont…
 Thus, instead of attempting only to predict
 whether a document is relevant or not,
 the VSM ranks the documents according to their
degree of similarity to the query.
 A document is retrieved even if it matches the query
terms only partially

 To compute rankings we need to specify how index term


weights are obtained.

16
Vector space similarity: Common measures

17
Inner Product
 Similarity between vectors for the document di and query q
can be computed as the vector
n inner product:

sim(dj,q) = dj•q = wij · wiq


i 1

where wij is the weight of term i in document j and wiq is


the weight of term i in the query q

 For binary vectors, the inner product is the number of


matched query terms in the document (size of intersection).
 For weighted term vectors, it is the sum of the products of the
weights of the matched terms.
18
Inner Product -- Examples
 Given the following term-document matrix, using inner
product which document is more relevant for the query Q?

Retrieval Database Architecture


D1 2 3 5
D2 3 7 1
Q 1 0 2

 sim(D1 , Q) = 2*1 + 3*0 + 5*2 = 12


 sim(D2 , Q) = 3*1 + 7*0 + 1*2 = 5

19
Cosine similarity
 Measures similarity between d1 and d2 captured by the
cosine of the angle x between them.
 

n
d j q wi , j wi ,q
sim(d j , q)     i 1

 
n n
dj q w 2 2
w
i 1 i, j i 1 i ,q

 The denominator involves the lengths of the vectors


 So the cosine measure is also known as the normalized
inner product

i1 i, j
n
Length d j  w 2

20
Example : Computing Cosine Similarity

• Let say we have query vector Q = (0.4, 0.8); and also


document D1 = (0.2, 0.7). Compute their similarity using
cosine?

(0.4 * 0.2)  (0.8 * 0.7)


sim(Q, D2 ) 
[( 0.4)  (0.8) ] *[( 0.2)  (0.7) ]
2 2 2 2

0.64
  0.98
0.42

21
Vector Space Similarity: Cosine
Coefficient (correlation) Example

22
Simulating Retrieval
Example/ Exercise

23
Example (VSM for document clustering)

 Given three documents; D1, D2 and D3 with the corresponding


TFIDF weight, Which documents are more similar using the
three similarity measurement?

Terms D1 D2 D3
affection 0.996 0.993 0.847
Jealous 0.087 0.120 0.466
gossip 0.017 0.000 0.254

24
VSM – for retrival : Example

• Suppose user query for: Q = “gold silver truck”. The


database collection consists of three documents with the
following content.

D1: “Shipment of gold damaged in a fire”


D2: “Delivery of silver arrived in a silver truck”
D3: “Shipment of gold arrived in a truck”

25
Vector-Space Model: Example

 How do you think the query should be submitted?


 Show retrieval results in ranked order?
1. Assume that full text terms are used during indexing,
without removing common terms, stop words, & also
no terms are stemmed
2. Also compare your result with or without normalizing
term frequency

26
Vector-Space Model: Example
Counts TF Wi = TF*IDF
Terms Q D1 D2 D3 DF IDF Q D1 D2 D3
a 0 1 1 1 3 0 0 0 0 0
arrived 0 0 1 1 2 0.176 0 0 0.176 0.176

damaged 0 1 0 0 1 0.477 0 0.477 0 0

delivery 0 0 1 0 1 0.477 0 0 0.477 0

fire 0 1 0 0 1 0.477 0 0.477 0 0

gold 1 1 0 1 2 0.176 0.176 0.176 0 0.176

in 0 1 1 1 3 0 0 0 0 0

of 0 1 1 1 3 0 0 0 0 0

silver 1 0 2 0 1 0.477 0.477 0 0.954 0

shipment 0 1 0 1 2 0.176 0 0.176 0 0.176


27
truck 1 0 1 1 2 0.176 0.176 0 0.176 0.176
Vector-Space Model

Terms Q D1 D2 D3
a 0 0 0 0
arrived 0 0 0.176 0.176
damaged 0 0.477 0 0
delivery 0 0 0.477 0
fire 0 0.477 0 0
gold 0.176 0.176 0 0.176
in 0 0 0 0
of 0 0 0 0
silver 0.477 0 0.954 0
shipment 0 0.176 0 0.176
truck 0.176 0 0.176 0.176
28
Vector-Space Model: Example

 Compute similarity using cosine Sim(q,d1)

• First, for each document and query, compute all vector


lengths (zero terms ignored)
|d1|= 0.4772  0.4772  0.1762  0.1762 = 0.517 = 0.719
|d2|= 0.1762  0.4772  0.1762  0.1762 = = 1.095
1.2001
|d3|= 0.1762  0.1762  0.1762  0.1762 = = 0.352
0.124
|q|= 0.1762  0.4712  0.1762 = 0.2896 = 0.538

29
Vector-Space Model: Example

• Next, compute dot products (zero products


ignored)
Q*d1= 0.176*0.176 = 0.0310

Q*d2 = 0.954*0.477 + 0.176 *0.176 = 0.4862

Q*d3 = 0.176*0.176 + 0.176*0.167 = 0.0620


30
Cont…
Now, compute similarity score
Sim(q,d1) = (0.0310) / (0.538*0.719) = 0.0801
Sim(q,d2) = (0.4862 ) / (0.538*1.095)= 0.8246
Sim(q,d3) = (0.0620) / (0.538*0.352)= 0.3271
Finally, we sort and rank documents in descending
order according to the similarity scores
Rank 1: Doc 2 = 0.8246
Rank 2: Doc 3 = 0.3271
Rank 3: Doc 1 = 0.0801

31
Cont…

• Exercise: using normalized TF, rank documents using


cosine similarity measure?

• Hint: Normalize TF of term i in doc j using max


frequency of a term k in document j.

32
The Vector Space Model: Advantages

 Ability to incorporate term weights. Any type of term weights


can be added
 Its term-weighting scheme improves quality of the answer set
(the retrieval performance), Good retrieval quality
 Its partial matching strategy allows retrieval of docs that
approximate the query conditions
 Its cosine ranking formula sorts documents according to
degree of similarity to the query
 Fast and a popular model nowadays
 Can measure similarities between almost anything: documents
and queries, documents and documents, queries and queries
33
The Vector Space Model: Disadvantages

 Lack of justification for some vector operations


 E.g. choice of similarity functions (as it defines
various similarity measures)
 E.g. choice of term weights (as it defines various term
weighting techniques)
 The need for several query terms to achieve a
discriminating ranking (easily handled in a Boolean model
by ANDing two terms)
 For large text collection the term-by-document matrix is
always huge and require more storage space.

34
Difference between Boolean and VS model

 The-underlying set of terms is the same for both models.


 Difference between the two models arise when the
individual term representations and methods of determining
the similarity between a document and the query is
considered.
 In Boolean retrieval model, the similarity between a
document and a query (or between any two documents) is
based on the presence and absence of terms both in the
query and the document.
 In VS model, similarity and retrieval uses more sophisticated
similarity calculations

35
Exercise
case on VSM

36
Simulate a retrieval of documents in
response to the query given

 Doc01- artificial, intelligence, knowledge, base, artificial


 Doc02- robot, artificial, technology, robot
 Doc03- robot, technology, robot, technology
 Doc04- technology, knowledge
 Query- artificial, intelligence

37
Cont…

 Use VSM
 Use- tf*idf (composite with a raw
frequency)
 Use- Cosine Similarity measure
 Rank in decreasing order of relevance and
comment on the result
 Show the necessary steps

38
More points that do have some thing
to do with IR models

 Types of search
 High recall search- all relevant
 High precision search- only top relevant
 Brief search- few relevant as opposed to all
relevant

 How do you think we satisfy the above


requirements ?

39
Query Reformulation
 Information need mostly varies from that expressed
(formulated) in a query
 From the very nature of the information need
 Thus reformulating search statement is necessary
 Query reformulation is basically based on relevance
feedback
 Relevance feedback is the idea of allowing users to
provide feedback on the relevance of one or more of
retrieved documents so that to uses it to reformulate the
query

40
Cont…
 Reformulating search statement through relevance
feedback
 Manual
 Automatic
 Semi automatic
 Relevance feedback
 How? specifically in the automatic case
 Pseudo-feedback

41
How to evaluate Models?

 We need to investigate what procedures they follow and what


techniques they used for:
 Are they using binary or non-binary weighting for measuring
importance of terms in documents
 Are they using similarity measurements?
 Are they applying partial matching?
 Are they performing Exact matching or Best matching for
document retrieval?
 Any Ranking mechanism?

42
On project – SE component development
 What you will implement
 At least two modules or functions - More functions is better
 If you want to do the retrieve/searching subsystem
 You may assume that you have a database of index terms
linked with some documents in another folder
 Submission: : A document containing
 Introduction

 Flowchart representation of the major process


 Description of how it works, assumption etc..
 Screenshot of some interfaces and Source code
 Submission for both type of projects is – August 22 -2022
 There will be a presentation -to face
43
Good Job

44

You might also like