1
Terms
Terms are usually stems. Terms can be also phrases,
such as “Computer Science”, “World Wide Web”, etc.
Documents and queries are represented as vectors or
“bags of words” (BOW).
Each vector holds a place for every term in the collection.
Position 1 corresponds to term 1, position 2 to term 2, po-
sition n to term n.
D
i
wdi1
,wdi2
,...,
wdin
Q
wq
1,w,...,
W=0
q2
if w
a term is absent
qn
Documents are represented by binary weights or
Non-binary weighted vectors of terms.
2
Document Collection
A collection of n documents can be represented in the
vector space model by a term-document matrix.
An entry in the matrix corresponds to the “weight” of
a term in the document; zero means the term has no
significance in the document or it simply doesn’t exist
in the document.
T1 T2 …. Tt
D1 w11 w21 … wt1
D2 w12 w22 … wt2
: : : :
: : : :
Dn w1n w2n … wtn
3
Binary Weights
• Only the presence (1) or ab- docs t1 t2 t3
sence (0) of a term is in- D1 1 0 1
D2 1 0 0
cluded in the vector D3 0 1 1
• Binary formula gives every D4 1 0 0
D5 1 1 1
word that appears in a docu- D6 1 1 0
ment equal relevance. D7 0 1 0
D8 0 1 0
• It can be useful when fre- D9 0 0 1
quency is not important. D10 0 1 1
D11 1 0 1
• Binary Weights Formula:
1iffreq
ij0
freq
ij
0iffreq
ij0
Why use term weighting?
Binary weights are too limiting.
terms are either present or absent.
Not allow to order documents according to their level of
relevance for a given query
Non-binary weights allow to model partial matching .
Partial matching allows retrieval of docs that approxi-
mate the query.
• Term-weighting improves quality of answer set.
Term weighting enables ranking of retrieved documents;
such that best matching documents are ordered at the top
as they are more relevant than others.
5
Term Weighting: Term Frequency (TF)
TF (term frequency) - Count the number
of times term occurs in document.
docs t1 t2 t3
fij = frequency of term i in document j
D1 2 0 3
The more times a term t occurs in docu- D2 1 0 0
ment d the more likely it is that t is rele- D3 0 4 7
vant to the document, i.e. more indicative D4 3 0 0
of the topic.. D5 1 6 3
If used alone, it favors common words and long D6 3 5 0
documents.
D7 0 8 0
It gives too much credit to words that appears
more frequently.
D8 0 10 0
D9 0 0 1
May want to normalize term frequency (tf)
D10 0 3 5
across the entire corpus:
D11 4 0 1
tfij = fij / max{fij}
Document Normalization
Long documents have an unfair advantage:
They use a lot of terms
So they get more matches than short documents
And they use the same words repeatedly
So they have much higher term frequencies
Normalization seeks to remove these effects:
Related somehow to maximum term frequency.
But also sensitive to the number of terms.
If we don’t normalize short documents may not be
recognized as relevant.
7
Problems with term frequency
Need a mechanism for attenuating the effect of terms
that occur too often in the collection to be meaningful for
relevance/meaning determination
Scale down the term weight of terms with high collection
frequency
Reduce the tf weight of a term by a factor that grows with the
collection frequency
More common for this purpose is document frequency
how many documents in the collection contain the term
• The example shows that collection
frequency and document fre-
quency behaves differently 8
Document Frequency
It is defined to be the number of documents in the col-
lection that contain a term
DF = document frequency
Count the frequency considering the whole collec-
tion of documents.
Less frequently a term appears in the whole collec-
tion, the more discriminating it is.
df i = document frequency of term i
= number of documents containing term i
9
Inverse Document Frequency (IDF)
IDF measures rarity of the term in collection. The IDF
is a measure of the general importance of the term
Inverts the document frequency.
It diminishes the weight of terms that occur very fre-
quently in the collection and increases the weight of
terms that occur rarely.
Gives full weight to terms that occur in one document
only.
Gives lowest weight to terms that occur in all docu-
ments.
Terms that appear in many different documents are less in-
dicative of overall topic.
idfi = inverse document frequency of term i,
= log2 (N/ df i) (N: total number of docu-
ments)
10
Inverse Document Frequency
• E.g.: given a collection of 1000 documents and document
frequency, compute IDF for each word?
Word N DF IDF
the 1000 1000 0
some 1000 100 3.322
car 1000 10 6.644
merge 1000 1 9.966
• IDF provides high values for rare words and low values
for common words.
• IDF is an indication of a term’s discrimination power.
• Log used to dampen the effect relative to tf.
• Make the difference between Document frequency vs. corpus
frequency ?
11
TF*IDF Weighting
The most used term-weighting is tf*idf weighting
scheme:
wij = tfij idfi = tfij * log2 (N/ dfi)
A term occurring frequently in the document but rarely
in the rest of the collection is given high weight.
The tf*idf value for a term will always be greater than or
equal to zero.
Experimentally, tf*idf has been found to work well.
It is often used in the vector space model together with co-
sine similarity to determine the similarity between two doc-
uments.
12
TF*IDF weighting
When does TF*IDF registers a high weight? when a
term t occurs many times within a small number of
documents
Highest tf*idf for a term shows a term has a high term fre-
quency (in the given document) and a low document frequency
(in the whole collection of documents);
the weights hence tend to filter out common terms.
Thus lending high discriminating power to those documents
Lower TF*IDF is registered when the term occurs fewer
times in a document, or occurs in many documents
Thus offering a less pronounced relevance signal
Lowest TF*IDF is registered when the term occurs in virtu-
ally all documents
Computing TF-IDF: An Example
Assume collection contains 10,000 documents and statistical
analysis shows that document frequencies (DF) of three terms
are: A(50), B(1300), C(250). And also term frequencies (TF) of
these terms are: A(3), B(2), C(1) with a maximum term fre-
quency of 3. Compute TF*IDF for each term?
A: tf = 3/3=1.0idf = log2(10000/50) = 7.644; tf*idf = 7.644
B: tf = 2/3=0.667 idf = log2(10000/1300) = 2.943; tf*idf =
1.962
C: tf = 1/3=0.33 idf = log2(10000/250) = 5.322; tf*idf =
1.774
Query vector is typically treated as a document and also tf*idf
weighted.
14
More Example
Consider a document containing 100 words where in the
word cow appears 3 times. Now, assume we have 10 million
documents and cow appears in one thousand of these.
The term frequency (TF) for cow :
3/100 = 0.03
The inverse document frequency is
log2(10,000,000 / 1,000) = 13.228
The TF*IDF score is the product of these frequencies: 0.03 *
13.228 = 0.39684
15
Concluding remarks
Suppose from a set of English documents, we wish to determine which
once are the most relevant to the query "the brown cow."
A simple way to start out is by eliminating documents that do not con-
tain all three words "the," "brown," and "cow," but this still leaves many
documents.
To further distinguish them, we might count the number of times each
term occurs in each document and sum them all together;
the number of times a term occurs in a document is called its TF. However,
because the term "the" is so common, this will tend to incorrectly emphasize
documents which happen to use the word "the" more, without giving
enough weight to the more meaningful terms "brown" and "cow".
Also the term "the" is not a good keyword to distinguish relevant and non-
relevant documents and terms like "brown" and "cow" that occur rarely are
good keywords to distinguish relevant documents from the non-relevant
once.
16
Concluding remarks
Hence IDF is incorporated which diminishes the
weight of terms that occur very frequently in the col-
lection and increases the weight of terms that occur
rarely.
This leads to use TF*IDF as a better weighting technique
On top of that we apply similarity measures to calcu-
late the distance between document i and query j.
• There are a number of similarity measures; the most com-
mon similarity measures are
Euclidean distance , Inner or Dot product, Cosine simi-
larity, Dice similarity, Jaccard similarity, etc.