Module 03 – Project
Sentiment Analysis
for IMDB Review Dataset
Nguyen Quoc Thai
1
Objectives
Text Classification Text Preprocessing
❖ Introduction ❖ Duplicate Handling
❖ Token-level Text Classification ❖ Text Cleaning
❖ Document-level Text Classification ❖ EDA
❖ Sentiment Analysis ❖ Tokenization
❖ IMDB Dataset
Text Representation Classification
❖ Numeric Representation ❖ Decision Tree Classifier
❖ One-hot Encoding ❖ Random Forest Classifier
❖ Bag-of-Words (BoW) ❖ Evaluation
❖ TF-IDF ❖ Inference
2
Outline
SECTION 1
Text Classification
Text Classification
SECTION 2
Text Preprocessing Technology
SECTION 3
Sports
Text Representation
Business
SECTION 4
Classification
3
Text Classification
! Classification Problem
Classification Training Data
Feature Label
➢ Classify input variables to
identify discrete output variables
(labels, categories)
What
Will
willit be
be the
hot or … …
temperature
cold tomorrow?
tomorrow?
Test Data
?
4
Text Classification
! Classification Problem
❖ Input Petal_Length Petal_Width Label
➢ A fixed set of classes C = {c1, c2, …, cN}
1 0.2 0
➢ A training set of M hand-labeled documents:
(d1, c1),…, (dM, cN) 1.3 0.6 0
➢ A document d 0.9 0.7 0
❖ Output
➢ A learned classifier d => c (C) 1.7 0.5 1
1.8 0.9 1
1.2 1.3 1
Petal_Length Petal_Width Label
1.2 0.2 ?
4
Text Classification
! Classification Problem
Petal_Length Petal_Width Label Petal_Width Label
1 0.2 0
A wonderful little production.
1.3 0.6 0 <br/> The filming technique is very 0
unassuming- very ole-time-B…
0.9 0.7 0
1.7 0.5 1
A rating of “1” does not begin to
1.8 0.9 1 express how dull, depressing and 0
relentlessly bad this movie is.
1.2 1.3 1
4
Text Classification
! Token-level Tokenization
❖ Sequence Labeling: Word Segmentation, Part Of Speech Tagging (POS), Named Entity Recognition (NER)
Name Entity Recognition (NER) STATE OF PROVINCE TIME
I have a flight to New York at 5 pm
Part Of Speech Tagging (POS)
4
Text Classification
! Document-level Tokenization
❖ Sentiment Analysis
Text Classification Sentiment Analysis
Technology
Sports
Business
4
Text Classification
! IMDB Review Dataset
Text Cleaning
A wonderful little production. <br/> The filming Petal_Width Label
technique is very unassuming- very ole-time-B…
A rating of “1” does not begin to express how dull, A wonderful little production.
depressing and relentlessly bad this movie is. <br/> The filming technique is very 0
unassuming- very ole-time-B…
A rating of “1” does not begin to
Text Representation
express how dull, depressing and 0
1 1.4 0.2 relentlessly bad this movie is.
1 1.5 0.2
𝒙=
1 3.0 1.1
1 4.1 1.3
4
Text Classification
! IMDB Review Dataset
Dataset Preprocessing Representation Modeling Evaluation
Exploratory Inference
Data Analysis
4
Outline
SECTION 1
Text Classification
SECTION 2
Text Preprocessing
SECTION 3
Text Representation
SECTION 4
Classification
11
Text Preprocessing
! Text Preprocessing
❑ Removal of URLs and ❑ Removal Stop Words ❑ Tokenization
HTML tags Sentence
❑ Removal Rare Words Word
❑ Text Standardizing Character
❑ Handle Emoji and Subwords
❑ Lowercasing Emoticons
❑ Stemming
❑ Number and ❑ Spelling Correction
Punctuation Handling ❑ Lemmatization
4
Text Preprocessing
! Removal URLs, HTML Tags
➢ Extract text based on the structure of an HTML document
➢ URLs: image links, reference links,…
➢ HTML tags: <p>..</p>, <div>…</div>,…
@82476 <p> We'd like to help S @82476 We'd like to help Sam,
am, which number is caling you? </ which number is caling you? Please
p> Please DM us more info so we ca DM us more info so we can advise
n advise further. https://t.co/5py further.
LDJBC6r
4
Text Preprocessing
! Text Standardizing
➢Lowercasing: Use lower() function in Python
➢Using short words and abbreviations to represent the same meaning
➢Contrastions: I’m, isn’t, can’t,…
DM @82476 We'd @82476 we
like to help Sam, would like to help
lowercasing Convert which number is sam, which number
Direct message caling you? is caling you?
Lowercasing Please DM us more please direct
info so we can message us more
Convert I’m / can’t advise further. information so we
lowercasing Convert can advise
further.
I am / can not
4
Text Preprocessing
! Number and Punctuation Handling
➢ Removal: Text Classification
@82476 We would like to help
➢ As token: Machine Translation, POS Tagging, Sam, which number is caling you?
Named Entity Recognition Please direct message us more
information so we can advise
further.
Removal As Token We would like to help Sam
which number is caling you Please
Sam. Sam Sam .
direct message us more information
so we can advise further
You? You You ?
@ 82476 We would like to help
Sam , which number is caling you
Further. Further Further . ? Please direct message us more
information so we can advise
further .
4
Text Preprocessing
! Stop / Rare Words Handling
➢ Focus on the important keywords
➢ Stop words: common words - no meaning or
less meaning compared to keywords
English: a, an, that, for,…
Vietnamese: à, ừ, vậy, thế,…
➢ Rare words: words that appear only a few times
in corpus
4
Text Preprocessing
! Emoji and Emoticons Handling
➢ Emojis: …
➢ Emoticons: :-) :-( :-))) :-)
➢ Some tasks: convert emojis and emoticons to
word.
➢ Example: :-) => happy, :-( => sad,…
@82476 We would like to help @82476 thinking_face .We would
Sam, which number is caling you? like to help Sam, which number is
Please direct message us more caling you? Please direct message
information so we can advise us more information so we can
further. advise further.
4
Text Preprocessing
! Stemming and Lemmatization
➢ Lemmatization: ➢ Goal: convert words => the same root
words have the same root am is are => be
despite their surface differences dinner, dinners => dinner
4
Text Preprocessing
! Stemming and Lemmatization
Morphological parsing
➢ Morphology: The small meaningful units that make up words
◦ Stems: The core meaning-bearing units
◦ Affixes: Parts that adhere to stems, often with grammatical functions
➢ Morphological Parsers:
Highest High + est
Higher High + er
4
Text Preprocessing
! Stemming and Lemmatization
Stemming
➢ Stemming – Simple Lemmatization
Naïve version of morphological analysis
Chopping off word-final stemming affixes
…ational …ate relational => relate
…sses …ss grasses => grass
4
Text Preprocessing
! Tokenization
➢ Split paragraph, document into sentences
➢ Use RegEx or library: nltk, genism,… => nltk.sent_tokenize()
Input Text Tokenization is one of the first step in any NLP pipeline. Tokenization is nothing
but splitting the raw text into small chunks of words or sentences, called tokens
Tokenization is one of the first step in any NLP pipeline.
Sentence
Tokenization
Tokenization is nothing but splitting the raw text into small chunks of words or
sentences, called tokens
4
Text Preprocessing
! Tokenization
➢ Split paragraph, document into sentences
➢ Use RegEx or library: nltk, genism,… => nltk.word_tokenize()
Input Text Tokenization is one of the first step in any NLP pipeline.
Tokenization is one of
Word
Tokenization the first step in
any NLP pipeline .
4
Text Preprocessing
! Tokenization
4
Text Preprocessing
! Exploratory Data Analysis (EDA)
➢ Frequencies of sentiment labels
4
Text Preprocessing
! Exploratory Data Analysis (EDA)
➢ Word lengths
4
Outline
SECTION 1
Text Classification I go to school
Convert
SECTION 2 [0 0 0 0 1 0 1 0 1 ]
Text Preprocessing
SECTION 3
Text Representation
SECTION 4
Classification
26
Text Representation
! Numeric Representation
❖ Token-Level
I
Tokenization go
I go to school
to
school
❖ Document-Level
I go to school
4
Text Representation
! One-hot Encoding
❖ Token-Level
❖ Represented by a V-dimensional binary vector of 0s and 1s
- All 0s barring the index, index = wid
- At this index, put 1 Vocabulary
IDX Token
0 bites
1 dog
Dog bites man. [dog, bites, man] 2 eats
Preprocessing Build
Man bites dog. [man, bites, dog] 3 food
Dog eats meat. Tokenization [dog, eats, meat] Vocabulary 4 man
Man eats food. [man, eats, food] 5 meat
4
Text Representation
! One-hot Encoding
❖ Token-Level
❖ Represented by a V-dimensional binary vector of 0s and 1s
- All 0s barring the index, index = wid
- At this index, put 1 Vocabulary
IDX Token
0 bites
Dog bites man. [dog, bites, man] 1 dog
2 eats
dog 0 1 0 0 0 0 3 food
[[0, 1, 0, 0, 0, 0], 4 man
[1, 0, 0, 0, 0, 0], bites 1 0 0 0 0 0
5 meat
[0, 0, 0, 0, 1, 0]]
man 0 0 0 0 1 0
4
Text Representation
! Bag Of Word (BoW)
❖ Document-Level: Consider text as a bag (collection) of words
❖ Represented by a V-dimensional
Use: the number of occurrences of the word in the document
Vocabulary
IDX 0 1 2 3 4 5
Token bites dog eats food man meat
Counter
[dog, bites, man] 1 1 0 0 1 0
4
Text Representation
! Bag Of Word (BoW)
❖ Document-Level: Consider text as a bag (collection) of words
❖ Represented by a V-dimensional
Use: the number of occurrences of the word in the document
[dog, bites, man] 1 1 0 0 1 0
[man, bites, dog] 1 1 0 0 1 0
[dog, eats, meat] 0 1 1 0 0 1
[man, eats, food] 0 0 0 1 1 0
4
Text Representation
! TF-IDF
𝑡𝑓𝑡,𝑑 = 𝑐𝑜𝑢𝑛𝑡 𝑡, 𝑑
➢ Some ways to reduce the raw frequency:
• Using log space + add 1:
𝑡𝑓𝑡,𝑑 = 𝑙𝑜𝑔(𝑐𝑜𝑢𝑛𝑡 𝑡, 𝑑 + 1)
• Divide the number of occurrences by the length of document:
𝑐𝑜𝑢𝑛𝑡(𝑡, 𝑑)
𝑡𝑓𝑡,𝑑 =
𝑙𝑒𝑛(𝑑)
4
Text Representation
! TF-IDF
𝑡𝑓𝑡,𝑑 = 𝑐𝑜𝑢𝑛𝑡 𝑡, 𝑑
Example bites dog eats food man meat
[dog, bites, man] D1 1/3 1/3 0 0 1/3 0
[man, bites, dog] D2 1/3 1/3 0 0 1/3 0
[dog, eats, meat] D3 0 1/3 1/3 0 0 1/3
[man, eats, food] D4 0 0 1/3 1/3 1/3 0
4
Text Representation
! TF-IDF
𝑁
𝑖𝑑𝑓𝑡 =
𝑑𝑓𝑡
➢ Measures the importance of the word across a corpus
N: The total number of documents in the corpus
dft: The number of documents with term t in them
➢ Using log space:
𝑁 𝑁 𝑁+1
𝑖𝑑𝑓𝑡 = 𝑙𝑜𝑔 𝑖𝑑𝑓𝑡 = 𝑙𝑜𝑔 +1 𝑖𝑑𝑓𝑡 = 𝑙𝑜𝑔 +1
𝑑𝑓𝑡 𝑑𝑓𝑡 𝑑𝑓𝑡 + 1
4
Text Representation
! TF-IDF
𝑁+1
𝑖𝑑𝑓𝑡 = 𝑙𝑛 +1
𝑑𝑓𝑡 + 1
Example
[dog, bites, man]
[man, bites, dog] bites dog eats food man meat
[dog, eats, meat] 1.511 1.223 1.511 1.916 1.223 1.916
[man, eats, food]
4
Text Representation
! TF-IDF
𝑤𝑡,𝑑 = 𝑡𝑓𝑡,𝑑 × 𝑖𝑑𝑓𝑡
➢ The weighted value wt,d for word t in document d
➢ IDF weighs down the terms: very common across a corpus and rare terms
➢ The TF-IDF vector representation for a document is then simply TF-IDF score for each
term in that document.
4
Text Representation
! TF-IDF
bites 1.511 bites dog eats food man meat
dog 1.223 D1 1/3 1/3 0 0 1/3 0
Eats 1.511 D2 1/3 1/3 0 0 1/3 0
Food 1.916
D3 0 1/3 1/3 0 0 1/3
Example Man 1.223
D4 0 0 1/3 1/3 1/3 0
[dog, bites, man] meat 1.916
[man, bites, dog]
bites dog eats food man meat
[dog, eats, meat]
D1 0.504 0.408 0 0 0.400 0
[man, eats, food]
D2 0.504 0.408 0 0 0.408 0
D3 0 0.408 0.504 0 0 0.639
D4 0 0 0.504 0.639 0.408 0
4
Text Representation
! TF-IDF
4
Outline
SECTION 1
Text Classification
SECTION 2
Text Preprocessing
SECTION 3
Text Representation
SECTION 4
Classification
40
Classification
! Decision Tree & Random Forest
Decision Tree Random Forest
Dataset Dataset
4
Classification
! Decision Tree & Random Forest
Decision Tree Random Forest
4
Classification
! AdaBoost & Gradient Boosting
AdaBoost Gradient Boosting
4
Classification
! XGBoost
4
Summary
Text Classification Text Preprocessing
❖ Introduction ❖ Duplicate Handling
❖ Token-level Text Classification ❖ Text Cleaning
❖ Document-level Text Classification ❖ EDA
❖ Sentiment Analysis ❖ Tokenization
❖ IMDB Dataset
Text Representation Classification
❖ Numeric Representation ❖ Decision Tree Classifier
❖ One-hot Encoding ❖ Random Forest Classifier
❖ Bag-of-Words (BoW) ❖ Evaluation
❖ TF-IDF ❖ Inference
45
Thanks!
Any questions?
46