0% found this document useful (0 votes)

50 views17 pages

Understanding Term Weighting Methods

1. Documents and queries are represented as vectors of terms where each term is assigned a weight. Common weighting schemes include binary, term frequency (TF), inverse document frequency (IDF), and TF-IDF. 2. TF measures how frequently a term appears in a document while IDF measures how rare a term is across documents, with rarer terms given higher weight. 3. TF-IDF is the most commonly used weighting scheme as it favors terms that appear frequently in a document but rarely across documents, making the terms more discriminative. The weight is the product of TF and IDF.

Uploaded by

abraham getu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

50 views17 pages

Understanding Term Weighting Methods

Uploaded by

abraham getu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 17

1

Terms
Terms are usually stems. Terms can be also phrases,
such as “Computer Science”, “World Wide Web”, etc.
Documents and queries are represented as vectors or
“bags of words” (BOW).
Each vector holds a place for every term in the collection.
Position 1 corresponds to term 1, position 2 to term 2, po-
sition n to term n.

D
i
wdi1
,wdi2
,...,
wdin

Q
wq
1,w,...,
W=0
q2
if w
a term is absent
qn
Documents are represented by binary weights or
Non-binary weighted vectors of terms.
2
Document Collection
A collection of n documents can be represented in the
vector space model by a term-document matrix.
An entry in the matrix corresponds to the “weight” of
a term in the document; zero means the term has no
significance in the document or it simply doesn’t exist
in the document.

T1 T2 …. Tt
D1 w11 w21 … wt1
D2 w12 w22 … wt2
: : : :
: : : :
Dn w1n w2n … wtn

3
Binary Weights
• Only the presence (1) or ab- docs t1 t2 t3
sence (0) of a term is in- D1 1 0 1
D2 1 0 0
cluded in the vector D3 0 1 1
• Binary formula gives every D4 1 0 0
D5 1 1 1
word that appears in a docu- D6 1 1 0
ment equal relevance. D7 0 1 0
D8 0 1 0
• It can be useful when fre- D9 0 0 1
quency is not important. D10 0 1 1
D11 1 0 1
• Binary Weights Formula:
1iffreq

 ij0
freq
ij
0iffreq

 ij0
Why use term weighting?
Binary weights are too limiting.
terms are either present or absent.
Not allow to order documents according to their level of
relevance for a given query

Non-binary weights allow to model partial matching .

Partial matching allows retrieval of docs that approxi-
mate the query.
• Term-weighting improves quality of answer set.
Term weighting enables ranking of retrieved documents;
such that best matching documents are ordered at the top
as they are more relevant than others.
5
Term Weighting: Term Frequency (TF)
TF (term frequency) - Count the number
of times term occurs in document.
docs t1 t2 t3
fij = frequency of term i in document j
D1 2 0 3
The more times a term t occurs in docu- D2 1 0 0
ment d the more likely it is that t is rele- D3 0 4 7
vant to the document, i.e. more indicative D4 3 0 0
of the topic.. D5 1 6 3
 If used alone, it favors common words and long D6 3 5 0
documents.
D7 0 8 0
 It gives too much credit to words that appears
more frequently.
D8 0 10 0
D9 0 0 1
May want to normalize term frequency (tf)
D10 0 3 5
across the entire corpus:
D11 4 0 1
tfij = fij / max{fij}
Document Normalization
 Long documents have an unfair advantage:
 They use a lot of terms
 So they get more matches than short documents

 And they use the same words repeatedly

 So they have much higher term frequencies

 Normalization seeks to remove these effects:

 Related somehow to maximum term frequency.
 But also sensitive to the number of terms.

 If we don’t normalize short documents may not be

recognized as relevant.

7
Problems with term frequency
Need a mechanism for attenuating the effect of terms
that occur too often in the collection to be meaningful for
relevance/meaning determination
Scale down the term weight of terms with high collection
frequency
Reduce the tf weight of a term by a factor that grows with the
collection frequency
More common for this purpose is document frequency
how many documents in the collection contain the term

• The example shows that collection

frequency and document fre-
quency behaves differently 8
Document Frequency
 It is defined to be the number of documents in the col-
lection that contain a term

DF = document frequency

 Count the frequency considering the whole collec-

tion of documents.
 Less frequently a term appears in the whole collec-
tion, the more discriminating it is.

df i = document frequency of term i

= number of documents containing term i
9
Inverse Document Frequency (IDF)
IDF measures rarity of the term in collection. The IDF
is a measure of the general importance of the term
Inverts the document frequency.
It diminishes the weight of terms that occur very fre-
quently in the collection and increases the weight of
terms that occur rarely.
Gives full weight to terms that occur in one document
only.
Gives lowest weight to terms that occur in all docu-
ments.
Terms that appear in many different documents are less in-
dicative of overall topic.
idfi = inverse document frequency of term i,
= log2 (N/ df i) (N: total number of docu-
ments)
10
Inverse Document Frequency
• E.g.: given a collection of 1000 documents and document
frequency, compute IDF for each word?
Word N DF IDF
the 1000 1000 0
some 1000 100 3.322
car 1000 10 6.644
merge 1000 1 9.966
• IDF provides high values for rare words and low values
for common words.
• IDF is an indication of a term’s discrimination power.
• Log used to dampen the effect relative to tf.
• Make the difference between Document frequency vs. corpus
frequency ?
11
TF*IDF Weighting
The most used term-weighting is tf*idf weighting
scheme:
wij = tfij idfi = tfij * log2 (N/ dfi)

A term occurring frequently in the document but rarely

in the rest of the collection is given high weight.
The tf*idf value for a term will always be greater than or
equal to zero.

Experimentally, tf*idf has been found to work well.

It is often used in the vector space model together with co-
sine similarity to determine the similarity between two doc-
uments.

12
TF*IDF weighting
When does TF*IDF registers a high weight? when a
term t occurs many times within a small number of
documents
Highest tf*idf for a term shows a term has a high term fre-
quency (in the given document) and a low document frequency
(in the whole collection of documents);
the weights hence tend to filter out common terms.
Thus lending high discriminating power to those documents
Lower TF*IDF is registered when the term occurs fewer
times in a document, or occurs in many documents
Thus offering a less pronounced relevance signal
Lowest TF*IDF is registered when the term occurs in virtu-
ally all documents
Computing TF-IDF: An Example
Assume collection contains 10,000 documents and statistical
analysis shows that document frequencies (DF) of three terms
are: A(50), B(1300), C(250). And also term frequencies (TF) of
these terms are: A(3), B(2), C(1) with a maximum term fre-
quency of 3. Compute TF*IDF for each term?
A: tf = 3/3=1.0idf = log2(10000/50) = 7.644; tf*idf = 7.644
B: tf = 2/3=0.667 idf = log2(10000/1300) = 2.943; tf*idf =
1.962
C: tf = 1/3=0.33 idf = log2(10000/250) = 5.322; tf*idf =
1.774
Query vector is typically treated as a document and also tf*idf
weighted.
14
More Example
Consider a document containing 100 words where in the
word cow appears 3 times. Now, assume we have 10 million
documents and cow appears in one thousand of these.

The term frequency (TF) for cow :

3/100 = 0.03

The inverse document frequency is

log2(10,000,000 / 1,000) = 13.228

The TFIDF score is the product of these frequencies: 0.03

13.228 = 0.39684

15
Concluding remarks
Suppose from a set of English documents, we wish to determine which
once are the most relevant to the query "the brown cow."
A simple way to start out is by eliminating documents that do not con-
tain all three words "the," "brown," and "cow," but this still leaves many
documents.
To further distinguish them, we might count the number of times each
term occurs in each document and sum them all together;
the number of times a term occurs in a document is called its TF. However,
because the term "the" is so common, this will tend to incorrectly emphasize
documents which happen to use the word "the" more, without giving
enough weight to the more meaningful terms "brown" and "cow".
Also the term "the" is not a good keyword to distinguish relevant and non-
relevant documents and terms like "brown" and "cow" that occur rarely are
good keywords to distinguish relevant documents from the non-relevant
once.

16
Concluding remarks
Hence IDF is incorporated which diminishes the
weight of terms that occur very frequently in the col-
lection and increases the weight of terms that occur
rarely.
This leads to use TF*IDF as a better weighting technique
On top of that we apply similarity measures to calcu-
late the distance between document i and query j.
• There are a number of similarity measures; the most com-
mon similarity measures are
Euclidean distance , Inner or Dot product, Cosine simi-
larity, Dice similarity, Jaccard similarity, etc.

Understanding Term Weighting in IR
No ratings yet
Understanding Term Weighting in IR
34 pages
Term Weighting in Information Retrieval
No ratings yet
Term Weighting in Information Retrieval
34 pages
Term Weighting and Similarity Measures
No ratings yet
Term Weighting and Similarity Measures
35 pages
3 Termweighting
No ratings yet
3 Termweighting
34 pages
3 termWeightingIR
No ratings yet
3 termWeightingIR
32 pages
Text Processing & Term Weighting
100% (2)
Text Processing & Term Weighting
38 pages
Chapter 3 IR
No ratings yet
Chapter 3 IR
34 pages
Term Weighting & Similarity Basics
50% (2)
Term Weighting & Similarity Basics
54 pages
IR Chapter 2 Part II
No ratings yet
IR Chapter 2 Part II
45 pages
Chapter Three Term Weighting and Similarity Measures
No ratings yet
Chapter Three Term Weighting and Similarity Measures
33 pages
ISR Chap..3
No ratings yet
ISR Chap..3
26 pages
Chapter Three Term Weighting and Similarity Measures
No ratings yet
Chapter Three Term Weighting and Similarity Measures
25 pages
Chapter 3 Term Weighting
No ratings yet
Chapter 3 Term Weighting
11 pages
2 Termweighting
No ratings yet
2 Termweighting
38 pages
Text Representation
No ratings yet
Text Representation
16 pages
AI6122 Topic 3.2 - Ranking
No ratings yet
AI6122 Topic 3.2 - Ranking
27 pages
3 Termweighting
No ratings yet
3 Termweighting
40 pages
Introduction to IR Models
No ratings yet
Introduction to IR Models
22 pages
Term Weighting
No ratings yet
Term Weighting
71 pages
Vmodel
No ratings yet
Vmodel
10 pages
Relevance of A Document To A Query
No ratings yet
Relevance of A Document To A Query
10 pages
TF Idf
100% (3)
TF Idf
38 pages
Information Retrieval Systems Information Retrieval Systems
No ratings yet
Information Retrieval Systems Information Retrieval Systems
7 pages
TF-IDF and Vector Space Model Overview
No ratings yet
TF-IDF and Vector Space Model Overview
37 pages
Session 4 Text Feature
No ratings yet
Session 4 Text Feature
40 pages
4-IR Models
No ratings yet
4-IR Models
33 pages
4 IRModels
No ratings yet
4 IRModels
30 pages
IR Models for Information Retrieval
No ratings yet
IR Models for Information Retrieval
51 pages
4 IRModels
No ratings yet
4 IRModels
46 pages
Understanding Information Retrieval Models
No ratings yet
Understanding Information Retrieval Models
30 pages
TF Idf
No ratings yet
TF Idf
4 pages
Understanding Ranked Retrieval Models
No ratings yet
Understanding Ranked Retrieval Models
52 pages
4-IR Models
No ratings yet
4-IR Models
33 pages
IR Lecture 4b
No ratings yet
IR Lecture 4b
57 pages
Learning Guide Unit 4 - Home
No ratings yet
Learning Guide Unit 4 - Home
14 pages
TF-IDF and Ranked Retrieval Basics
No ratings yet
TF-IDF and Ranked Retrieval Basics
51 pages
Overview of Information Retrieval Models
100% (1)
Overview of Information Retrieval Models
32 pages
Introduction to IR Models and Techniques
100% (1)
Introduction to IR Models and Techniques
32 pages
Lecture 5 - Scoring, Term Weighting, Vector Space Model - Part 1
No ratings yet
Lecture 5 - Scoring, Term Weighting, Vector Space Model - Part 1
45 pages
Vector Space Model and Document Scoring
No ratings yet
Vector Space Model and Document Scoring
44 pages
L12&L13 Ranked Retrieval
No ratings yet
L12&L13 Ranked Retrieval
31 pages
Understanding Information Retrieval Models
No ratings yet
Understanding Information Retrieval Models
46 pages
Overview of Information Retrieval Models
100% (1)
Overview of Information Retrieval Models
26 pages
Paper 4 Paik Tist 16
No ratings yet
Paper 4 Paik Tist 16
21 pages
Chapter 2
No ratings yet
Chapter 2
37 pages
IR Lecture 4b
No ratings yet
IR Lecture 4b
57 pages
(Jaffar) IR - Modeling - II
No ratings yet
(Jaffar) IR - Modeling - II
39 pages
Chapter 6 - Scoring Term Weighting and Vector Space Model
No ratings yet
Chapter 6 - Scoring Term Weighting and Vector Space Model
43 pages
Introduction To Automatic Indexing
No ratings yet
Introduction To Automatic Indexing
28 pages
4 IRinArabic2021 Ranked Retrieval I
No ratings yet
4 IRinArabic2021 Ranked Retrieval I
49 pages
3 Retrieval Models
No ratings yet
3 Retrieval Models
87 pages
Lecture 10
No ratings yet
Lecture 10
18 pages
Understanding Information Retrieval Models
No ratings yet
Understanding Information Retrieval Models
28 pages
CS8080 INFORMATION RETRIEVAL TECHNIQUES II INTERNAL EXAMINATION - Google Forms
No ratings yet
CS8080 INFORMATION RETRIEVAL TECHNIQUES II INTERNAL EXAMINATION - Google Forms
420 pages
Vector Space Model for IR Students
No ratings yet
Vector Space Model for IR Students
23 pages
Information Retrieval Models Guide
No ratings yet
Information Retrieval Models Guide
30 pages
Understanding IR Models and Ranking
No ratings yet
Understanding IR Models and Ranking
25 pages
L02-IR Models MMN
No ratings yet
L02-IR Models MMN
27 pages
Understanding IR Models and Ranking
No ratings yet
Understanding IR Models and Ranking
43 pages
Data Structures & Algorithms for IR
No ratings yet
Data Structures & Algorithms for IR
34 pages
Chapter-2 - Automatic Text Anlysis
No ratings yet
Chapter-2 - Automatic Text Anlysis
67 pages
CCC-I CS - Compressed (1) 55481
No ratings yet
CCC-I CS - Compressed (1) 55481
384 pages
Students Management System for SLIA
No ratings yet
Students Management System for SLIA
108 pages
Lecturer 6: Conic Section (Cont..)
No ratings yet
Lecturer 6: Conic Section (Cont..)
8 pages
Oxo APhyQ 0704 pr02 Xxaann
No ratings yet
Oxo APhyQ 0704 pr02 Xxaann
2 pages
2007 Diederichs
100% (1)
2007 Diederichs
35 pages
A Novel Unified Handover Algorithm For LTE-A
No ratings yet
A Novel Unified Handover Algorithm For LTE-A
5 pages
Tutorial 5
No ratings yet
Tutorial 5
4 pages
33 As Statistics Unit 5 Test
No ratings yet
33 As Statistics Unit 5 Test
2 pages
Disha Cds Maths 15 Years
67% (3)
Disha Cds Maths 15 Years
504 pages
ANSYS Mechanical FEA Overview
No ratings yet
ANSYS Mechanical FEA Overview
31 pages
Class 9th - Kinematics PDF
No ratings yet
Class 9th - Kinematics PDF
1 page
MCQ On TAFLas Per AKTU Syllabus (Unit 3 and 4) )
No ratings yet
MCQ On TAFLas Per AKTU Syllabus (Unit 3 and 4) )
113 pages
Theory of Elasticity-Polar Coordinates
0% (1)
Theory of Elasticity-Polar Coordinates
17 pages
Kriging 1 - 3
No ratings yet
Kriging 1 - 3
23 pages
Limits from Graphs in Calculus
No ratings yet
Limits from Graphs in Calculus
2 pages
Assignment 1 MTH202
No ratings yet
Assignment 1 MTH202
7 pages
Test On Expressions, Formulae and Equations
100% (3)
Test On Expressions, Formulae and Equations
10 pages
Pressure Drop Analysis in Reactors
No ratings yet
Pressure Drop Analysis in Reactors
19 pages
Googles Sketchup 8
100% (2)
Googles Sketchup 8
26 pages
3D Trigonometry Worksheet
No ratings yet
3D Trigonometry Worksheet
35 pages
Term 1 STD 12 Paper Solutions
No ratings yet
Term 1 STD 12 Paper Solutions
14 pages
CPGA Iput
No ratings yet
CPGA Iput
4 pages
Aryabhata Ganit Challenge (CBSE)
No ratings yet
Aryabhata Ganit Challenge (CBSE)
1 page
Class 9th Holiday Homework 2024-2025
No ratings yet
Class 9th Holiday Homework 2024-2025
7 pages
DLMCSA01 Mastersolution
No ratings yet
DLMCSA01 Mastersolution
5 pages
Numbers The Time Colours Days Months Personal Information
No ratings yet
Numbers The Time Colours Days Months Personal Information
12 pages
Be Glar 2009
No ratings yet
Be Glar 2009
23 pages
Saxena - Machine Learning in Visible Light Communication System A
No ratings yet
Saxena - Machine Learning in Visible Light Communication System A
12 pages
Capital Budgeting Fundamentals
No ratings yet
Capital Budgeting Fundamentals
56 pages
Umbrello Et Al 2004 - Hardness Based Flow Stress and Fracture Models For Numerical Simulation of Hard Machining AISI 52100 Bearing Steel
No ratings yet
Umbrello Et Al 2004 - Hardness Based Flow Stress and Fracture Models For Numerical Simulation of Hard Machining AISI 52100 Bearing Steel
11 pages
Math ODE Solutions for Students
No ratings yet
Math ODE Solutions for Students
3 pages
Module 5 (Lecture 11)
No ratings yet
Module 5 (Lecture 11)
20 pages

Understanding Term Weighting Methods

Uploaded by

Understanding Term Weighting Methods

Uploaded by

1

Non-binary weights allow to model partial matching .

 And they use the same words repeatedly

 Normalization seeks to remove these effects:

 If we don’t normalize short documents may not be

• The example shows that collection

 Count the frequency considering the whole collec-

df i = document frequency of term i

A term occurring frequently in the document but rarely

Experimentally, tf*idf has been found to work well.

The term frequency (TF) for cow :

The inverse document frequency is

The TF*IDF score is the product of these frequencies: 0.03 *

You might also like

The TFIDF score is the product of these frequencies: 0.03