TF-IDF (Term Frequency-Inverse Document Frequency)
Consider the following corpus of four documents:
Document 1: "Data science is transforming the world."
Document 2: "Machine learning is a subset of data science."
Document 3: "Deep learning and AI are advancing rapidly."
Document 4: "AI and machine learning are reshaping industries."
a. Step-by-step, calculate the TF-IDF (Term Frequency-Inverse Document Frequency) for the
given corpus and identify the word(s) with the highest value.
b. Construct a document vector table based on the TF-IDF values for the given corpus.
Answer:
Step 1: Create the Term Frequency (TF) Table
The formula for TF is:
Let's list out all the unique words in the corpus:
Word
Data
Science
Is
Transforming
The
World
Machine
Learning
A
Word
Subset
Of
Deep
And
AI
Are
Advancing
Rapidly
Reshaping
Industries
Now, we count word occurrences and calculate term frequencies.
TF Calculation for Each Document
• Document 1: "Data science is transforming the world."
o Total words: 6
o TF values:
▪ TF(Data) = 1/6=0.1667
▪ TF(Science) = 1/6=0.1667
▪ TF(Is) = 1/6=0.1667
▪ TF(Transforming) = 1/6=0.1667
▪ TF(The) = 1/6=0.1667
▪ TF(World) = 1/6=0.1667
• Document 2: "Machine learning is a subset of data science."
o Total words: 7
o TF values:
▪ TF(Machine) = 1/7=0.1429
▪ TF(Learning) = 1/7=0.1429
▪ TF(Is) = 1/7=0.1429
▪ TF(A) = 1/7=0.1429
▪ TF(Subset) = 1/7=0.1429
▪ TF(Of) = 1/7=0.1429
▪ TF(Data) = 1/7=0.1429
▪ TF(Science) = 1/7=0.1429
• Document 3: "Deep learning and AI are advancing rapidly."
o Total words: 6
o TF values:
▪ TF(Deep) = 1/6=0.1667
▪ TF(Learning) = 1/6=0.1667
▪ TF(And) = 1/6=0.1667
▪ TF(AI) = 1/6=0.1667
▪ TF(Are) = 1/6=0.1667
▪ TF(Advancing) = 1/6=0.1667
▪ TF(Rapidly) = 1/6=0.1667
• Document 4: "AI and machine learning are reshaping industries."
o Total words: 6
o TF values:
▪ TF(AI) = 1/6=0.1667
▪ TF(And) = 1/6=0.1667
▪ TF(Machine) = 1/6=0.1667
▪ TF(Learning) = 1/6=0.1667
▪ TF(Are) = 1/6=0.1667
▪ TF(Reshaping) = 1/6=0.1667
▪ TF(Industries) = 1/6=0.1667
Step 2: Compute Inverse Document Frequency (IDF)
The formula for IDF is:
where:
• N=4 (Total number of documents)
• DF(t) = Number of documents that contain the term t.
Let's calculate IDFIDFIDF:
Word DF (Number of Docs) IDF = log(4/DF)
Data 2 log(4/2) = 0.693
Science 2 log(4/2) = 0.693
Is 2 log(4/2) = 0.693
Transforming 1 log(4/1) = 1.386
The 1 log(4/1) = 1.386
World 1 log(4/1) = 1.386
Machine 2 log(4/2) = 0.693
Learning 3 log(4/3) = 0.287
A 1 log(4/1) = 1.386
Subset 1 log(4/1) = 1.386
Of 1 log(4/1) = 1.386
Deep 1 log(4/1) = 1.386
And 2 log(4/2) = 0.693
AI 2 log(4/2) = 0.693
Are 2 log(4/2) = 0.693
Advancing 1 log(4/1) = 1.386
Rapidly 1 log(4/1) = 1.386
Reshaping 1 log(4/1) = 1.386
Industries 1 log(4/1) = 1.386
Step 3: Compute TF-IDF
TF − IDF (t, d) = TF (t, d) × IDF(t)
Now we compute the values. The word with the highest TF-IDF will have the highest product
of TF and IDF.
After calculation, the highest TF-IDF value is for words that appear in only one document
(IDF = 1.386), and their TF is 0.1667, giving:
TF −IDF = 0.1667 × 1.386 = 0.231
The words with the highest TF-IDF score are:
• Transforming
• The
• World
• A
• Subset
• Of
• Deep
• Advancing
• Rapidly
• Reshaping
• Industries
Step 4: Construct Document Vector Table
We construct a matrix where each row represents a document, and each column represents a
word in the corpus, filled with TF-IDF values.
Word D1 D2 D3 D4
Data 0.116 0.099 0 0
Science 0.116 0.099 0 0
Is 0.116 0.099 0 0
Transforming 0.231 0 0 0
The 0.231 0 0 0
World 0.231 0 0 0
Word D1 D2 D3 D4
Machine 0 0.099 0 0.115
Learning 0 0.099 0.048 0.048
AI 0 0 0.115 0.115
Thus, Transforming, World, The, etc., have the highest TF-IDF.
Questions for Practise:
Consider a small corpus consisting of three Text documents:
Text Doc 1: "The cat sat on the mat."
Text Doc 2: "The dog chased the cat."
Text Doc 3: "The cat and the dog played together."
Calculate TF-IDF.