0% found this document useful (0 votes)

44 views11 pages

Vector Space Model

Uploaded by

nickybing781

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

44 views11 pages

Vector Space Model

Uploaded by

nickybing781

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

Absolutely!

Here's a detailed, student-friendly explanation of the Vector Space Model

(VSM) suitable for teaching engineering students, with illustrations of how to introduce the
concept, explain step-by-step, and engage them with real-world relatable examples.

🎓 Teaching Guide: Vector Space Model (VSM)

1️⃣ INTRODUCTION: Why Do We Need VSM?

Start with a Real-Life Analogy:

"When you search something on Google like ‘best laptop for coding’, how does it know which
web pages to show you first?"

Let them guess.

Then say:

"Google represents both your query and every document (web page) as vectors—like arrows in
space—and then finds which ones are pointing in a similar direction. That’s the idea of the
Vector Space Model."

2️⃣ CONCEPT: What Is the Vector Space Model?

Definition:
The Vector Space Model (VSM) is a mathematical model used to represent text documents
and queries as vectors in a multi-dimensional space. Each unique word is considered as a
dimension.

Think of it as converting words to numbers so a computer can calculate how similar they are.

3️⃣ STEP-BY-STEP EXPLANATION

Let’s go through the full process in detail.

📘 Step 1: Create a Document Collection (Corpus)

Example:

D1: "cat sat on the mat"

D2: "dog sat on the log"
D3: "cat chased dog"

🔑 Goal: Convert these into numbers (vectors)

📕 Step 2: Build the Vocabulary

Extract unique terms from all documents:

Vocabulary: [cat, sat, on, the, mat, dog, log, chased]

Each word is a dimension in our vector space.

So we have an 8-dimensional space.

📊 Step 3: Represent Each Document as a Vector

Now, for each document, count how many times each term appears.

Term D1 D2 D3
cat 1 0 1
sat 1 1 0
on 1 1 0
the 1 1 0
mat 1 0 0
dog 0 1 1
log 0 1 0
chased 0 0 1

So:

 D1 → [1,1,1,1,1,0,0,0]
 D2 → [0,1,1,1,0,1,1,0]
 D3 → [1,0,0,0,0,1,0,1]

🤖 Step 4: Compare Vectors Using Cosine Similarity

Cosine similarity measures how similar two vectors are based on the angle between them.

Cosine Similarity=A⋅B∥A∥∥B∥\text{Cosine Similarity} = \frac{A \cdot B}{\|A\| \|B\|}

Why cosine?

 It handles different lengths of documents

 Focuses on direction, not magnitude

💡 Step 5: Use It for Information Retrieval

Say a user types a query:

Query: "cat sat"

Convert it to a vector:
["cat", "sat", "on", "the", "mat", "dog", "log", "chased"] → [1,1,0,0,0,0,0,0]

Now compute cosine similarity between this query vector and each document vector.

The document with highest similarity score is considered most relevant.

4️⃣ Application Areas of VSM

Area Use Case Example

Search Engines Ranking web pages based on your query
Email Spam Detection Compare incoming mail with spam vectors
Document Clustering Grouping similar research papers or news
Recommendation Systems Finding similar reviews or products

5️⃣ Pros and Cons

✅ Advantages

 Simple and effective

 Good with TF-IDF weighting
 Fast similarity computation

❌ Limitations

 Ignores word order

 Doesn’t capture meaning or semantics
 Doesn’t handle synonyms well

For example, "car" and "automobile" are treated as completely different!

🧠 Bonus: TF-IDF Weighting (Term Frequency - Inverse Document Frequency)

Instead of just counting how many times a word appears (TF), we also reduce the weight of
common words like "the", "is", "on".

This improves relevance!

TF-IDF=TF×log⁡(NDF)\text{TF-IDF} = \text{TF} \times \log \left( \frac{N}{DF} \right)

Where:

 TF = term frequency in the document

 DF = number of documents containing the term
 N = total number of documents

6️⃣ Visual Illustration

Imagine a 3D space for 3 terms: "cat", "sat", "dog"

Document Vector (x:cat, y:sat, z:dog)

D1 (1, 1, 0)
D2 (0, 1, 1)
Query (1, 1, 0)

Query vector is pointing in the same direction as D1 → most relevant!

7️⃣ Classroom Activity

✅ Task: Give students 3 documents and a query. Ask them to:

1. Build vocabulary
2. Create vectors
3. Compute cosine similarity
4. Rank documents by relevance
Then discuss real-world relevance like how Amazon recommends products or how Netflix finds
similar movies.

PowerPoint Lesson: Vector Space Model (VSM) for Engineering Students

Slide 1: Title Slide

Title: Vector Space Model (VSM) in Text Mining

Subtitle: Understanding Document Similarity with Vectors
Presented by: [Your Name]

Slide 2: Why Vector Space Model?

Prompt:
"How does Google decide which pages to show when you search?"
Answer:
It represents text (documents and queries) as vectors and finds similarity.

Slide 3: What is VSM?

 A method to represent text documents and queries as vectors

 Each unique word is a dimension
 Similarity is measured mathematically (cosine similarity)

Slide 4: Real-World Analogy

 Think of vectors as arrows

 Documents that point in similar directions are more alike
 Just like arrows pointing the same way

Slide 5: Sample Document Set

Documents:

 D1: "cat sat on the mat"

 D2: "dog sat on the log"
 D3: "cat chased dog"

Slide 6: Vocabulary Creation

Vocabulary:
[cat, sat, on, the, mat, dog, log, chased]

 Each term is a dimension

 Total: 8-dimensional space

Slide 7: Term Frequency Vectors

Term D1 D2 D3
cat 1 0 1
sat 1 1 0
on 1 1 0
the 1 1 0
mat 1 0 0
dog 0 1 1
log 0 1 0
chased 0 0 1

Slide 8: Vector Representation

 D1 → [1,1,1,1,1,0,0,0]
 D2 → [0,1,1,1,0,1,1,0]
 D3 → [1,0,0,0,0,1,0,1]

Slide 9: Cosine Similarity

Formula:
Cosine Similarity = (A · B) / (||A|| ||B||)

 Measures angle between two vectors

 Closer angle = more similar

Slide 10: Query Example

Query: "cat sat"

Vector: [1,1,0,0,0,0,0,0]

Compare this query with D1, D2, D3 using cosine similarity.

Slide 11: Real-World Uses

 Search engines (Google, Bing)

 Spam filters
 Document classification
 Product and content recommendation

Slide 12: Advantages of VSM

 Easy to implement
 Efficient for similarity comparisons
 Improves with TF-IDF weighting

Slide 13: Limitations of VSM

 Ignores word order and context

 High-dimensional for large corpora
 Can’t detect synonyms or meaning (semantic gaps)

Slide 14: Bonus - TF-IDF

Formula:
TF-IDF = TF × log(N / DF)

 Reduces weight for common words like "the"

 Highlights unique words
Slide 15: Classroom Activity

Task:

 Provide 3 mini-documents and a query

 Ask students to:
1. Build vocabulary
2. Create term frequency vectors
3. Compute cosine similarity
4. Rank documents

Slide 16: Summary

 VSM represents documents as vectors

 Similarity helps retrieve relevant info
 Foundation for many NLP applications

Slide 17: Questions & Discussion

"How can we improve upon VSM in modern systems?"

(Hint: Semantic models like Word2Vec, BERT)

Slide 18: Thank You!

Contact: [Your Email / Institute]

Let’s keep exploring AI & NLP!

Great question! Let's walk through how to use Cosine Similarity to compare documents or text
queries in a way that’s easy for engineering students to follow.
✅ What is Cosine Similarity?
Cosine Similarity measures the angle between two vectors. The smaller the angle, the more
similar the vectors.

Cosine Similarity=A⃗⋅B⃗∥A⃗∥×∥B⃗∥\text{Cosine Similarity} = \frac{\vec{A} \cdot \vec{B}}{\|\

vec{A}\| \times \|\vec{B}\|}

Where:

A⃗⋅B⃗\vec{A} \cdot \vec{B} is the dot product of the vectors

∥A⃗∥\|\vec{A}\| is the magnitude (length) of vector A


∥B⃗∥\|\vec{B}\| is the magnitude (length) of vector B




✏️Step-by-Step Example (Manual Calculation)

Let’s say you have:

Term Query (Q) D1 D2

cat 1 1 0
sat 1 1 1
on 0 1 1
mat 0 1 0

Vectors:

 Query (Q) = [1, 1, 0, 0]

 Document 1 (D1) = [1, 1, 1, 1]
 Document 2 (D2) = [0, 1, 1, 0]

🔸 Step 1: Dot Product

Q⋅D1=(1×1)+(1×1)+(0×1)+(0×1)=2Q \cdot D1 = (1×1) + (1×1) + (0×1) + (0×1) = 2

Q⋅D2=(1×0)+(1×1)+(0×1)+(0×0)=1Q \cdot D2 = (1×0) + (1×1) + (0×1) + (0×0) = 1

🔸 Step 2: Magnitudes
∣∣Q∣∣=12+12+02+02=2||Q|| = \sqrt{1^2 + 1^2 + 0^2 + 0^2} = \sqrt{2} ∣∣D1∣∣=12+12+12+12=4=2||
D1|| = \sqrt{1^2 + 1^2 + 1^2 + 1^2} = \sqrt{4} = 2 ∣∣D2∣∣=02+12+12+02=2||D2|| = \sqrt{0^2 +
1^2 + 1^2 + 0^2} = \sqrt{2}

🔸 Step 3: Cosine Similarity

CosSim(Q,D1)=22×2=222=12≈0.707\text{CosSim}(Q, D1) = \frac{2}{\sqrt{2} × 2} = \frac{2}

{2\sqrt{2}} = \frac{1}{\sqrt{2}} ≈ 0.707 CosSim(Q,D2)=12×2=12=0.5\text{CosSim}(Q, D2)
= \frac{1}{\sqrt{2} × \sqrt{2}} = \frac{1}{2} = 0.5

✅ Result:

 Document 1 is more similar to the query than Document 2.

 So D1 will be ranked higher than D2 for this query.

🧠 Summary for Students

Concept Meaning
Dot Product Measures overlap (matching values)
Magnitude Length of the vector (sqrt of squares)
Cosine Similarity Cosine of the angle between the vectors
Higher Cosine Value More similarity

🧪 Try It Yourself (Activity)

Ask students to:

1. Choose 3 small documents and 1 query

2. Create vectors
3. Calculate cosine similarity
4. Rank documents based on similarity score

Here are the cosine similarity scores between the query "cat sat" and each document:

Document Similarity Score

Document 1 0.516
Document 3 0.317
Document Similarity Score
Document 2 0.250

✅ Interpretation:

 Document 1 ("cat sat on the mat") is the most similar to the query.
 Documents 3 and 2 follow, with decreasing similarity.
 These values are computed using TF-IDF weighting, which reduces the impact of
common words.

Would you like this turned into a classroom-friendly Python notebook or PPT slide as well?

Lecture 3 VSM
No ratings yet
Lecture 3 VSM
16 pages
Chapter 4 - Part II
No ratings yet
Chapter 4 - Part II
44 pages
Vector Space Model
No ratings yet
Vector Space Model
7 pages
Understanding Vector Space Model & Cosine Similarity
No ratings yet
Understanding Vector Space Model & Cosine Similarity
4 pages
2 (D) Vector Space Model
No ratings yet
2 (D) Vector Space Model
9 pages
Cosine Similarity in Machine Learning
No ratings yet
Cosine Similarity in Machine Learning
14 pages
IR-Lab Manual A1
No ratings yet
IR-Lab Manual A1
3 pages
Non Numeric Clustering Seminar
No ratings yet
Non Numeric Clustering Seminar
26 pages
ISR Chap... 5
No ratings yet
ISR Chap... 5
34 pages
Types of Similarity Search
No ratings yet
Types of Similarity Search
11 pages
IR Lecture 4b
No ratings yet
IR Lecture 4b
57 pages
Guided Learning Pathways Project: 4/7, 2011 Tetsuro Takahashi
No ratings yet
Guided Learning Pathways Project: 4/7, 2011 Tetsuro Takahashi
21 pages
DL U4 - Deep Learning Unit 4 DL U4 - Deep Learning Unit 4: Scan To Open On Studocu Scan To Open On Studocu
No ratings yet
DL U4 - Deep Learning Unit 4 DL U4 - Deep Learning Unit 4: Scan To Open On Studocu Scan To Open On Studocu
21 pages
Vector Space Model in Information Retrieval
No ratings yet
Vector Space Model in Information Retrieval
6 pages
IR Lecture 4b
No ratings yet
IR Lecture 4b
57 pages
ShortCourse QTT Lecture1
No ratings yet
ShortCourse QTT Lecture1
40 pages
Term Weighting & The Vector Space Model
No ratings yet
Term Weighting & The Vector Space Model
2 pages
CS 3308 Learning Journal 4
No ratings yet
CS 3308 Learning Journal 4
3 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
27 pages
06 VectorSpaceModel
No ratings yet
06 VectorSpaceModel
65 pages
Webir 06
No ratings yet
Webir 06
32 pages
L04
No ratings yet
L04
35 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
33 pages
Frontiers of Computational Journalism - Columbia Journalism School Fall 2012 - Week 3: Document Topic Modeling
No ratings yet
Frontiers of Computational Journalism - Columbia Journalism School Fall 2012 - Week 3: Document Topic Modeling
48 pages
CIKM2022 Submission 3961
No ratings yet
CIKM2022 Submission 3961
5 pages
Learning Journal Unit 4
No ratings yet
Learning Journal Unit 4
5 pages
L14 VSM
No ratings yet
L14 VSM
24 pages
Week 5 - Latent Semantic Indexing
No ratings yet
Week 5 - Latent Semantic Indexing
38 pages
Vector Space Model
No ratings yet
Vector Space Model
11 pages
Vector Space Model Basics
No ratings yet
Vector Space Model Basics
11 pages
Understand Semantic Search - Training - Microsoft Learn 22
No ratings yet
Understand Semantic Search - Training - Microsoft Learn 22
4 pages
Soft Similarity and Soft Cosine Measure: Similarity of Features in Vector Space Model
No ratings yet
Soft Similarity and Soft Cosine Measure: Similarity of Features in Vector Space Model
14 pages
Cosine Similarity
No ratings yet
Cosine Similarity
5 pages
Generalized Vector Space Model for IR
No ratings yet
Generalized Vector Space Model for IR
9 pages
Understanding Vector Embeddings in AI
No ratings yet
Understanding Vector Embeddings in AI
46 pages
Module 3 Indexing Part A
No ratings yet
Module 3 Indexing Part A
46 pages
Machine Learning For Natural Language Processing: Classification: Nearest Neighbors
No ratings yet
Machine Learning For Natural Language Processing: Classification: Nearest Neighbors
28 pages
NLP Ir
No ratings yet
NLP Ir
24 pages
Vector Space Model: An Information Retrieval System: Information Technology Empowering Digital India
No ratings yet
Vector Space Model: An Information Retrieval System: Information Technology Empowering Digital India
3 pages
Large Language Models: Foundation of
No ratings yet
Large Language Models: Foundation of
8 pages
Information Retrieval Practical
No ratings yet
Information Retrieval Practical
10 pages
UNIT-4 Information Retrieval Notes
No ratings yet
UNIT-4 Information Retrieval Notes
16 pages
Vector Space Model Overview
No ratings yet
Vector Space Model Overview
75 pages
Vector Space Model for IR Students
No ratings yet
Vector Space Model for IR Students
23 pages
Module-7 Similarity Measure
No ratings yet
Module-7 Similarity Measure
39 pages
Vector Space Model
No ratings yet
Vector Space Model
6 pages
Vector Search
No ratings yet
Vector Search
10 pages
Chapter 8 - Collaborative - Filtering
No ratings yet
Chapter 8 - Collaborative - Filtering
118 pages
Lecture 2 Introduction To Linear Algebra (Part 1)
No ratings yet
Lecture 2 Introduction To Linear Algebra (Part 1)
49 pages
University Solved Questions-VSM - Compressed
No ratings yet
University Solved Questions-VSM - Compressed
5 pages
Overview of Information Retrieval Systems
No ratings yet
Overview of Information Retrieval Systems
42 pages
Understanding Vector Models and Similarity Search
No ratings yet
Understanding Vector Models and Similarity Search
10 pages
Representing Structured Relational Data in Euclidean Vector Spaces
No ratings yet
Representing Structured Relational Data in Euclidean Vector Spaces
14 pages
Lec 3
No ratings yet
Lec 3
51 pages
Clustering With Multi-Viewpoint Based Similarity Measure: An Overview
No ratings yet
Clustering With Multi-Viewpoint Based Similarity Measure: An Overview
5 pages
RAGHack AzureAISearch Spanish
No ratings yet
RAGHack AzureAISearch Spanish
85 pages
2013 COMP5318 Lecture1
No ratings yet
2013 COMP5318 Lecture1
21 pages
Clustering
No ratings yet
Clustering
43 pages
Organizational Maintenance Manual Includ
No ratings yet
Organizational Maintenance Manual Includ
187 pages
SPECIALIZED - STEM11 - Basic Calculus - q3 - CLAS4 - The Tangent Line To The Graph of A Function at A Point - V2 1 JOSEPH AURELLO - Redacted
No ratings yet
SPECIALIZED - STEM11 - Basic Calculus - q3 - CLAS4 - The Tangent Line To The Graph of A Function at A Point - V2 1 JOSEPH AURELLO - Redacted
17 pages
Updated Glassblowing2
No ratings yet
Updated Glassblowing2
16 pages
Wms-Vidpl-mineral Fibre Acoustical Suspended Ceiling
No ratings yet
Wms-Vidpl-mineral Fibre Acoustical Suspended Ceiling
3 pages
Kinky KakaIru
No ratings yet
Kinky KakaIru
41 pages
Upper Extremity Special Tests
No ratings yet
Upper Extremity Special Tests
5 pages
Pupillary and Corneal Reflex Assessment
No ratings yet
Pupillary and Corneal Reflex Assessment
2 pages
CloudLearn ERP Company Profile - 2019
No ratings yet
CloudLearn ERP Company Profile - 2019
16 pages
Lit AP Giovanni's Room
0% (1)
Lit AP Giovanni's Room
6 pages
Different Types of Offerings and The Development Process in Marketing
No ratings yet
Different Types of Offerings and The Development Process in Marketing
4 pages
(PDF) Download Jaya: An Illustrated Retelling of The Mahabharata Download Online
No ratings yet
(PDF) Download Jaya: An Illustrated Retelling of The Mahabharata Download Online
1 page
Simulator Training Course STW 43-3-4 - Model Course - Train The Simulator Trainer and Assessor (Secretariat)
80% (5)
Simulator Training Course STW 43-3-4 - Model Course - Train The Simulator Trainer and Assessor (Secretariat)
125 pages
Bar Bending Schedule For Column - Detailed Practical Guide
No ratings yet
Bar Bending Schedule For Column - Detailed Practical Guide
5 pages
Guide To Importing From China & India 2025 Edition Full
No ratings yet
Guide To Importing From China & India 2025 Edition Full
2 pages
Presentation About Hacking (Cybersecurity)
No ratings yet
Presentation About Hacking (Cybersecurity)
15 pages
Photos of 35 Year Old Woman
No ratings yet
Photos of 35 Year Old Woman
1 page
Cerebrospinal Fluid Analysis Guide
No ratings yet
Cerebrospinal Fluid Analysis Guide
81 pages
Just Genes The Ethics of Genetic Technologies 1st Edition Carol Isaacson Barash Ebook All Chapters PDF
No ratings yet
Just Genes The Ethics of Genetic Technologies 1st Edition Carol Isaacson Barash Ebook All Chapters PDF
67 pages
Kusum Sales Corporation (23848)
No ratings yet
Kusum Sales Corporation (23848)
1 page
A Paradigm Shift Health Matters James C Lin Dec 2023 IEEE Microwave Magazine
No ratings yet
A Paradigm Shift Health Matters James C Lin Dec 2023 IEEE Microwave Magazine
3 pages
Google’s Rise: Revolutionizing Search Engines
No ratings yet
Google’s Rise: Revolutionizing Search Engines
2 pages
English 6 Quarterly Test Review Quiz
No ratings yet
English 6 Quarterly Test Review Quiz
22 pages
COVID-19 Positive Test Result
0% (1)
COVID-19 Positive Test Result
2 pages
Swinburne OOP 8.1P
No ratings yet
Swinburne OOP 8.1P
3 pages
Flavour Boat Menu - Varca
No ratings yet
Flavour Boat Menu - Varca
17 pages
2000 2100 2200 2300 and 2400 Tractors Zetor Introduction
No ratings yet
2000 2100 2200 2300 and 2400 Tractors Zetor Introduction
10 pages
Đề Cương Môn Nói 6
No ratings yet
Đề Cương Môn Nói 6
17 pages
Staar Grade 4 2015 Test Read
100% (1)
Staar Grade 4 2015 Test Read
40 pages
Ahmed Research Final
No ratings yet
Ahmed Research Final
63 pages
Laundry and Linen Management SOP
No ratings yet
Laundry and Linen Management SOP
16 pages