0% found this document useful (0 votes)
44 views11 pages

Vector Space Model

Uploaded by

nickybing781
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views11 pages

Vector Space Model

Uploaded by

nickybing781
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

Absolutely!

Here's a detailed, student-friendly explanation of the Vector Space Model


(VSM) suitable for teaching engineering students, with illustrations of how to introduce the
concept, explain step-by-step, and engage them with real-world relatable examples.

🎓 Teaching Guide: Vector Space Model (VSM)

1️⃣ INTRODUCTION: Why Do We Need VSM?

Start with a Real-Life Analogy:

"When you search something on Google like ‘best laptop for coding’, how does it know which
web pages to show you first?"

Let them guess.

Then say:

"Google represents both your query and every document (web page) as vectors—like arrows in
space—and then finds which ones are pointing in a similar direction. That’s the idea of the
Vector Space Model."

2️⃣ CONCEPT: What Is the Vector Space Model?

Definition:
The Vector Space Model (VSM) is a mathematical model used to represent text documents
and queries as vectors in a multi-dimensional space. Each unique word is considered as a
dimension.

Think of it as converting words to numbers so a computer can calculate how similar they are.

3️⃣ STEP-BY-STEP EXPLANATION

Let’s go through the full process in detail.

📘 Step 1: Create a Document Collection (Corpus)


Example:

D1: "cat sat on the mat"


D2: "dog sat on the log"
D3: "cat chased dog"

🔑 Goal: Convert these into numbers (vectors)

📕 Step 2: Build the Vocabulary

Extract unique terms from all documents:

Vocabulary: [cat, sat, on, the, mat, dog, log, chased]

Each word is a dimension in our vector space.

So we have an 8-dimensional space.

📊 Step 3: Represent Each Document as a Vector

Now, for each document, count how many times each term appears.

Term D1 D2 D3
cat 1 0 1
sat 1 1 0
on 1 1 0
the 1 1 0
mat 1 0 0
dog 0 1 1
log 0 1 0
chased 0 0 1

So:

 D1 → [1,1,1,1,1,0,0,0]
 D2 → [0,1,1,1,0,1,1,0]
 D3 → [1,0,0,0,0,1,0,1]

🤖 Step 4: Compare Vectors Using Cosine Similarity


Cosine similarity measures how similar two vectors are based on the angle between them.

Cosine Similarity=A⋅B∥A∥∥B∥\text{Cosine Similarity} = \frac{A \cdot B}{\|A\| \|B\|}

Why cosine?

 It handles different lengths of documents


 Focuses on direction, not magnitude

💡 Step 5: Use It for Information Retrieval

Say a user types a query:

Query: "cat sat"

Convert it to a vector:
["cat", "sat", "on", "the", "mat", "dog", "log", "chased"] → [1,1,0,0,0,0,0,0]

Now compute cosine similarity between this query vector and each document vector.

The document with highest similarity score is considered most relevant.

4️⃣ Application Areas of VSM

Area Use Case Example


Search Engines Ranking web pages based on your query
Email Spam Detection Compare incoming mail with spam vectors
Document Clustering Grouping similar research papers or news
Recommendation Systems Finding similar reviews or products

5️⃣ Pros and Cons

✅ Advantages

 Simple and effective


 Good with TF-IDF weighting
 Fast similarity computation

❌ Limitations

 Ignores word order


 Doesn’t capture meaning or semantics
 Doesn’t handle synonyms well

For example, "car" and "automobile" are treated as completely different!

🧠 Bonus: TF-IDF Weighting (Term Frequency - Inverse Document Frequency)

Instead of just counting how many times a word appears (TF), we also reduce the weight of
common words like "the", "is", "on".

This improves relevance!

TF-IDF=TF×log⁡(NDF)\text{TF-IDF} = \text{TF} \times \log \left( \frac{N}{DF} \right)

Where:

 TF = term frequency in the document


 DF = number of documents containing the term
 N = total number of documents

6️⃣ Visual Illustration

Imagine a 3D space for 3 terms: "cat", "sat", "dog"

Document Vector (x:cat, y:sat, z:dog)


D1 (1, 1, 0)
D2 (0, 1, 1)
Query (1, 1, 0)

Query vector is pointing in the same direction as D1 → most relevant!

7️⃣ Classroom Activity

✅ Task: Give students 3 documents and a query. Ask them to:

1. Build vocabulary
2. Create vectors
3. Compute cosine similarity
4. Rank documents by relevance
Then discuss real-world relevance like how Amazon recommends products or how Netflix finds
similar movies.

PowerPoint Lesson: Vector Space Model (VSM) for Engineering Students

Slide 1: Title Slide

Title: Vector Space Model (VSM) in Text Mining


Subtitle: Understanding Document Similarity with Vectors
Presented by: [Your Name]

Slide 2: Why Vector Space Model?

Prompt:
"How does Google decide which pages to show when you search?"
Answer:
It represents text (documents and queries) as vectors and finds similarity.

Slide 3: What is VSM?

 A method to represent text documents and queries as vectors


 Each unique word is a dimension
 Similarity is measured mathematically (cosine similarity)

Slide 4: Real-World Analogy

 Think of vectors as arrows


 Documents that point in similar directions are more alike
 Just like arrows pointing the same way

Slide 5: Sample Document Set


Documents:

 D1: "cat sat on the mat"


 D2: "dog sat on the log"
 D3: "cat chased dog"

Slide 6: Vocabulary Creation

Vocabulary:
[cat, sat, on, the, mat, dog, log, chased]

 Each term is a dimension


 Total: 8-dimensional space

Slide 7: Term Frequency Vectors

Term D1 D2 D3
cat 1 0 1
sat 1 1 0
on 1 1 0
the 1 1 0
mat 1 0 0
dog 0 1 1
log 0 1 0
chased 0 0 1

Slide 8: Vector Representation

 D1 → [1,1,1,1,1,0,0,0]
 D2 → [0,1,1,1,0,1,1,0]
 D3 → [1,0,0,0,0,1,0,1]

Slide 9: Cosine Similarity

Formula:
Cosine Similarity = (A · B) / (||A|| ||B||)

 Measures angle between two vectors


 Closer angle = more similar

Slide 10: Query Example

Query: "cat sat"


Vector: [1,1,0,0,0,0,0,0]

Compare this query with D1, D2, D3 using cosine similarity.

Slide 11: Real-World Uses

 Search engines (Google, Bing)


 Spam filters
 Document classification
 Product and content recommendation

Slide 12: Advantages of VSM

 Easy to implement
 Efficient for similarity comparisons
 Improves with TF-IDF weighting

Slide 13: Limitations of VSM

 Ignores word order and context


 High-dimensional for large corpora
 Can’t detect synonyms or meaning (semantic gaps)

Slide 14: Bonus - TF-IDF

Formula:
TF-IDF = TF × log(N / DF)

 Reduces weight for common words like "the"


 Highlights unique words
Slide 15: Classroom Activity

Task:

 Provide 3 mini-documents and a query


 Ask students to:
1. Build vocabulary
2. Create term frequency vectors
3. Compute cosine similarity
4. Rank documents

Slide 16: Summary

 VSM represents documents as vectors


 Similarity helps retrieve relevant info
 Foundation for many NLP applications

Slide 17: Questions & Discussion

"How can we improve upon VSM in modern systems?"


(Hint: Semantic models like Word2Vec, BERT)

Slide 18: Thank You!

Contact: [Your Email / Institute]


Let’s keep exploring AI & NLP!

Great question! Let's walk through how to use Cosine Similarity to compare documents or text
queries in a way that’s easy for engineering students to follow.
✅ What is Cosine Similarity?
Cosine Similarity measures the angle between two vectors. The smaller the angle, the more
similar the vectors.

Cosine Similarity=A⃗⋅B⃗∥A⃗∥×∥B⃗∥\text{Cosine Similarity} = \frac{\vec{A} \cdot \vec{B}}{\|\


vec{A}\| \times \|\vec{B}\|}

Where:

A⃗⋅B⃗\vec{A} \cdot \vec{B} is the dot product of the vectors


∥A⃗∥\|\vec{A}\| is the magnitude (length) of vector A

∥B⃗∥\|\vec{B}\| is the magnitude (length) of vector B



✏️Step-by-Step Example (Manual Calculation)


Let’s say you have:

Term Query (Q) D1 D2


cat 1 1 0
sat 1 1 1
on 0 1 1
mat 0 1 0

Vectors:

 Query (Q) = [1, 1, 0, 0]


 Document 1 (D1) = [1, 1, 1, 1]
 Document 2 (D2) = [0, 1, 1, 0]

🔸 Step 1: Dot Product

Q⋅D1=(1×1)+(1×1)+(0×1)+(0×1)=2Q \cdot D1 = (1×1) + (1×1) + (0×1) + (0×1) = 2


Q⋅D2=(1×0)+(1×1)+(0×1)+(0×0)=1Q \cdot D2 = (1×0) + (1×1) + (0×1) + (0×0) = 1

🔸 Step 2: Magnitudes
∣∣Q∣∣=12+12+02+02=2||Q|| = \sqrt{1^2 + 1^2 + 0^2 + 0^2} = \sqrt{2} ∣∣D1∣∣=12+12+12+12=4=2||
D1|| = \sqrt{1^2 + 1^2 + 1^2 + 1^2} = \sqrt{4} = 2 ∣∣D2∣∣=02+12+12+02=2||D2|| = \sqrt{0^2 +
1^2 + 1^2 + 0^2} = \sqrt{2}

🔸 Step 3: Cosine Similarity

CosSim(Q,D1)=22×2=222=12≈0.707\text{CosSim}(Q, D1) = \frac{2}{\sqrt{2} × 2} = \frac{2}


{2\sqrt{2}} = \frac{1}{\sqrt{2}} ≈ 0.707 CosSim(Q,D2)=12×2=12=0.5\text{CosSim}(Q, D2)
= \frac{1}{\sqrt{2} × \sqrt{2}} = \frac{1}{2} = 0.5

✅ Result:

 Document 1 is more similar to the query than Document 2.


 So D1 will be ranked higher than D2 for this query.

🧠 Summary for Students


Concept Meaning
Dot Product Measures overlap (matching values)
Magnitude Length of the vector (sqrt of squares)
Cosine Similarity Cosine of the angle between the vectors
Higher Cosine Value More similarity

🧪 Try It Yourself (Activity)


Ask students to:

1. Choose 3 small documents and 1 query


2. Create vectors
3. Calculate cosine similarity
4. Rank documents based on similarity score

Here are the cosine similarity scores between the query "cat sat" and each document:

Document Similarity Score


Document 1 0.516
Document 3 0.317
Document Similarity Score
Document 2 0.250

✅ Interpretation:

 Document 1 ("cat sat on the mat") is the most similar to the query.
 Documents 3 and 2 follow, with decreasing similarity.
 These values are computed using TF-IDF weighting, which reduces the impact of
common words.

Would you like this turned into a classroom-friendly Python notebook or PPT slide as well?

You might also like