0% found this document useful (0 votes)

20 views46 pages

(Slide) Sentiment Analysis v3

Uploaded by

tovandatfpt

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views46 pages

(Slide) Sentiment Analysis v3

Uploaded by

tovandatfpt

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Module 03 – Project

Sentiment Analysis
for IMDB Review Dataset
Nguyen Quoc Thai

1
Objectives
Text Classification Text Preprocessing
❖ Introduction ❖ Duplicate Handling
❖ Token-level Text Classification ❖ Text Cleaning
❖ Document-level Text Classification ❖ EDA
❖ Sentiment Analysis ❖ Tokenization
❖ IMDB Dataset

Text Representation Classification

❖ Numeric Representation ❖ Decision Tree Classifier
❖ One-hot Encoding ❖ Random Forest Classifier
❖ Bag-of-Words (BoW) ❖ Evaluation
❖ TF-IDF ❖ Inference

2
Outline
SECTION 1

Text Classification
Text Classification
SECTION 2

Text Preprocessing Technology

SECTION 3
Sports
Text Representation
Business
SECTION 4

Classification
3
Text Classification
! Classification Problem

Classification Training Data

Feature Label
➢ Classify input variables to
identify discrete output variables
(labels, categories)
What
Will
willit be
be the
hot or … …
temperature
cold tomorrow?
tomorrow?

Test Data
?

4
Text Classification
! Classification Problem

❖ Input Petal_Length Petal_Width Label

➢ A fixed set of classes C = {c1, c2, …, cN}
1 0.2 0
➢ A training set of M hand-labeled documents:
(d1, c1),…, (dM, cN) 1.3 0.6 0
➢ A document d 0.9 0.7 0
❖ Output
➢ A learned classifier d => c (C) 1.7 0.5 1
1.8 0.9 1
1.2 1.3 1

Petal_Length Petal_Width Label

1.2 0.2 ?

4
Text Classification
! Classification Problem

Petal_Length Petal_Width Label Petal_Width Label

1 0.2 0
A wonderful little production.
1.3 0.6 0 <br/> The filming technique is very 0
unassuming- very ole-time-B…
0.9 0.7 0
1.7 0.5 1
A rating of “1” does not begin to
1.8 0.9 1 express how dull, depressing and 0
relentlessly bad this movie is.
1.2 1.3 1

4
Text Classification
! Token-level Tokenization
❖ Sequence Labeling: Word Segmentation, Part Of Speech Tagging (POS), Named Entity Recognition (NER)

Name Entity Recognition (NER) STATE OF PROVINCE TIME

I have a flight to New York at 5 pm

Part Of Speech Tagging (POS)

4
Text Classification
! Document-level Tokenization
❖ Sentiment Analysis

Text Classification Sentiment Analysis

Technology

Sports

Business

4
Text Classification
! IMDB Review Dataset

Text Cleaning
A wonderful little production. <br/> The filming Petal_Width Label
technique is very unassuming- very ole-time-B…
A rating of “1” does not begin to express how dull, A wonderful little production.
depressing and relentlessly bad this movie is. <br/> The filming technique is very 0
unassuming- very ole-time-B…

A rating of “1” does not begin to

Text Representation
express how dull, depressing and 0
1 1.4 0.2 relentlessly bad this movie is.
1 1.5 0.2
𝒙=
1 3.0 1.1
1 4.1 1.3
4
Text Classification
! IMDB Review Dataset

Dataset Preprocessing Representation Modeling Evaluation

Exploratory Inference
Data Analysis

4
Outline
SECTION 1

Text Classification

SECTION 2

Text Preprocessing

SECTION 3

Text Representation

SECTION 4

Classification
11
Text Preprocessing
! Text Preprocessing

❑ Removal of URLs and ❑ Removal Stop Words ❑ Tokenization

HTML tags Sentence
❑ Removal Rare Words Word
❑ Text Standardizing Character
❑ Handle Emoji and Subwords
❑ Lowercasing Emoticons
❑ Stemming
❑ Number and ❑ Spelling Correction
Punctuation Handling ❑ Lemmatization

4
Text Preprocessing
! Removal URLs, HTML Tags
➢ Extract text based on the structure of an HTML document
➢ URLs: image links, reference links,…
➢ HTML tags: <p>..</p>, <div>…</div>,…

@82476 <p> We'd like to help S @82476 We'd like to help Sam,
am, which number is caling you? </ which number is caling you? Please
p> Please DM us more info so we ca DM us more info so we can advise
n advise further. https://t.co/5py further.
LDJBC6r

4
Text Preprocessing
! Text Standardizing
➢Lowercasing: Use lower() function in Python
➢Using short words and abbreviations to represent the same meaning
➢Contrastions: I’m, isn’t, can’t,…

DM @82476 We'd @82476 we

like to help Sam, would like to help
lowercasing Convert which number is sam, which number
Direct message caling you? is caling you?
Lowercasing Please DM us more please direct
info so we can message us more
Convert I’m / can’t advise further. information so we
lowercasing Convert can advise
further.
I am / can not
4
Text Preprocessing
! Number and Punctuation Handling
➢ Removal: Text Classification
@82476 We would like to help
➢ As token: Machine Translation, POS Tagging, Sam, which number is caling you?
Named Entity Recognition Please direct message us more
information so we can advise
further.
Removal As Token We would like to help Sam
which number is caling you Please
Sam. Sam Sam .
direct message us more information
so we can advise further
You? You You ?
@ 82476 We would like to help
Sam , which number is caling you
Further. Further Further . ? Please direct message us more
information so we can advise
further .
4
Text Preprocessing
! Stop / Rare Words Handling

➢ Focus on the important keywords

➢ Stop words: common words - no meaning or

less meaning compared to keywords
English: a, an, that, for,…
Vietnamese: à, ừ, vậy, thế,…

➢ Rare words: words that appear only a few times

in corpus

4
Text Preprocessing
! Emoji and Emoticons Handling

➢ Emojis: …
➢ Emoticons: :-) :-( :-))) :-)
➢ Some tasks: convert emojis and emoticons to
word.
➢ Example: :-) => happy, :-( => sad,…

@82476 We would like to help @82476 thinking_face .We would

Sam, which number is caling you? like to help Sam, which number is
Please direct message us more caling you? Please direct message
information so we can advise us more information so we can
further. advise further.

4
Text Preprocessing
! Stemming and Lemmatization
➢ Lemmatization: ➢ Goal: convert words => the same root
words have the same root am is are => be
despite their surface differences dinner, dinners => dinner

4
Text Preprocessing
! Stemming and Lemmatization
Morphological parsing
➢ Morphology: The small meaningful units that make up words
◦ Stems: The core meaning-bearing units
◦ Affixes: Parts that adhere to stems, often with grammatical functions
➢ Morphological Parsers:

Highest High + est

Higher High + er

4
Text Preprocessing
! Stemming and Lemmatization
Stemming
➢ Stemming – Simple Lemmatization
Naïve version of morphological analysis
Chopping off word-final stemming affixes

…ational …ate relational => relate

…sses …ss grasses => grass

4
Text Preprocessing
! Tokenization
➢ Split paragraph, document into sentences
➢ Use RegEx or library: nltk, genism,… => nltk.sent_tokenize()

Input Text Tokenization is one of the first step in any NLP pipeline. Tokenization is nothing
but splitting the raw text into small chunks of words or sentences, called tokens

Tokenization is one of the first step in any NLP pipeline.

Sentence
Tokenization
Tokenization is nothing but splitting the raw text into small chunks of words or
sentences, called tokens
4
Text Preprocessing
! Tokenization
➢ Split paragraph, document into sentences
➢ Use RegEx or library: nltk, genism,… => nltk.word_tokenize()

Input Text Tokenization is one of the first step in any NLP pipeline.

Tokenization is one of
Word
Tokenization the first step in

any NLP pipeline .

4
Text Preprocessing
! Tokenization

4
Text Preprocessing
! Exploratory Data Analysis (EDA)
➢ Frequencies of sentiment labels

4
Text Preprocessing
! Exploratory Data Analysis (EDA)
➢ Word lengths

4
Outline
SECTION 1

Text Classification I go to school

Convert
SECTION 2 [0 0 0 0 1 0 1 0 1 ]
Text Preprocessing

SECTION 3

Text Representation

SECTION 4

Classification
26
Text Representation
! Numeric Representation

❖ Token-Level
I
Tokenization go
I go to school
to
school

❖ Document-Level

I go to school

4
Text Representation
! One-hot Encoding
❖ Token-Level
❖ Represented by a V-dimensional binary vector of 0s and 1s
- All 0s barring the index, index = wid
- At this index, put 1 Vocabulary
IDX Token
0 bites
1 dog
Dog bites man. [dog, bites, man] 2 eats
Preprocessing Build
Man bites dog. [man, bites, dog] 3 food
Dog eats meat. Tokenization [dog, eats, meat] Vocabulary 4 man
Man eats food. [man, eats, food] 5 meat

4
Text Representation
! One-hot Encoding
❖ Token-Level
❖ Represented by a V-dimensional binary vector of 0s and 1s
- All 0s barring the index, index = wid
- At this index, put 1 Vocabulary
IDX Token
0 bites
Dog bites man. [dog, bites, man] 1 dog
2 eats
dog 0 1 0 0 0 0 3 food
[[0, 1, 0, 0, 0, 0], 4 man
[1, 0, 0, 0, 0, 0], bites 1 0 0 0 0 0
5 meat
[0, 0, 0, 0, 1, 0]]
man 0 0 0 0 1 0
4
Text Representation
! Bag Of Word (BoW)
❖ Document-Level: Consider text as a bag (collection) of words
❖ Represented by a V-dimensional
Use: the number of occurrences of the word in the document

Vocabulary
IDX 0 1 2 3 4 5
Token bites dog eats food man meat

Counter

[dog, bites, man] 1 1 0 0 1 0

4
Text Representation
! Bag Of Word (BoW)
❖ Document-Level: Consider text as a bag (collection) of words
❖ Represented by a V-dimensional
Use: the number of occurrences of the word in the document

[dog, bites, man] 1 1 0 0 1 0

[man, bites, dog] 1 1 0 0 1 0

[dog, eats, meat] 0 1 1 0 0 1

[man, eats, food] 0 0 0 1 1 0

4
Text Representation
! TF-IDF

𝑡𝑓𝑡,𝑑 = 𝑐𝑜𝑢𝑛𝑡 𝑡, 𝑑

➢ Some ways to reduce the raw frequency:

• Using log space + add 1:
𝑡𝑓𝑡,𝑑 = 𝑙𝑜𝑔(𝑐𝑜𝑢𝑛𝑡 𝑡, 𝑑 + 1)
• Divide the number of occurrences by the length of document:
𝑐𝑜𝑢𝑛𝑡(𝑡, 𝑑)
𝑡𝑓𝑡,𝑑 =
𝑙𝑒𝑛(𝑑)

4
Text Representation
! TF-IDF

𝑡𝑓𝑡,𝑑 = 𝑐𝑜𝑢𝑛𝑡 𝑡, 𝑑

Example bites dog eats food man meat

[dog, bites, man] D1 1/3 1/3 0 0 1/3 0
[man, bites, dog] D2 1/3 1/3 0 0 1/3 0
[dog, eats, meat] D3 0 1/3 1/3 0 0 1/3
[man, eats, food] D4 0 0 1/3 1/3 1/3 0

4
Text Representation
! TF-IDF

𝑁
𝑖𝑑𝑓𝑡 =
𝑑𝑓𝑡

➢ Measures the importance of the word across a corpus

N: The total number of documents in the corpus
dft: The number of documents with term t in them
➢ Using log space:
𝑁 𝑁 𝑁+1
𝑖𝑑𝑓𝑡 = 𝑙𝑜𝑔 𝑖𝑑𝑓𝑡 = 𝑙𝑜𝑔 +1 𝑖𝑑𝑓𝑡 = 𝑙𝑜𝑔 +1
𝑑𝑓𝑡 𝑑𝑓𝑡 𝑑𝑓𝑡 + 1

4
Text Representation
! TF-IDF

𝑁+1
𝑖𝑑𝑓𝑡 = 𝑙𝑛 +1
𝑑𝑓𝑡 + 1

Example
[dog, bites, man]
[man, bites, dog] bites dog eats food man meat
[dog, eats, meat] 1.511 1.223 1.511 1.916 1.223 1.916
[man, eats, food]

4
Text Representation
! TF-IDF

𝑤𝑡,𝑑 = 𝑡𝑓𝑡,𝑑 × 𝑖𝑑𝑓𝑡

➢ The weighted value wt,d for word t in document d

➢ IDF weighs down the terms: very common across a corpus and rare terms
➢ The TF-IDF vector representation for a document is then simply TF-IDF score for each
term in that document.

4
Text Representation
! TF-IDF
bites 1.511 bites dog eats food man meat
dog 1.223 D1 1/3 1/3 0 0 1/3 0
Eats 1.511 D2 1/3 1/3 0 0 1/3 0
Food 1.916
D3 0 1/3 1/3 0 0 1/3
Example Man 1.223
D4 0 0 1/3 1/3 1/3 0
[dog, bites, man] meat 1.916

[man, bites, dog]

bites dog eats food man meat
[dog, eats, meat]
D1 0.504 0.408 0 0 0.400 0
[man, eats, food]
D2 0.504 0.408 0 0 0.408 0
D3 0 0.408 0.504 0 0 0.639
D4 0 0 0.504 0.639 0.408 0

4
Text Representation
! TF-IDF

4
Outline
SECTION 1

Text Classification

SECTION 2

Text Preprocessing

SECTION 3

Text Representation

SECTION 4

Classification
40
Classification
! Decision Tree & Random Forest
Decision Tree Random Forest

Dataset Dataset

4
Classification
! Decision Tree & Random Forest
Decision Tree Random Forest

4
Classification
! AdaBoost & Gradient Boosting
AdaBoost Gradient Boosting

4
Classification
! XGBoost

4
Summary
Text Classification Text Preprocessing
❖ Introduction ❖ Duplicate Handling
❖ Token-level Text Classification ❖ Text Cleaning
❖ Document-level Text Classification ❖ EDA
❖ Sentiment Analysis ❖ Tokenization
❖ IMDB Dataset

Text Representation Classification

❖ Numeric Representation ❖ Decision Tree Classifier
❖ One-hot Encoding ❖ Random Forest Classifier
❖ Bag-of-Words (BoW) ❖ Evaluation
❖ TF-IDF ❖ Inference

45
Thanks!
Any questions?

NLP - Module 2
No ratings yet
NLP - Module 2
54 pages
NLP Lecture2 Text Pre Processing
No ratings yet
NLP Lecture2 Text Pre Processing
54 pages
Statistical NLP Techniques Overview
No ratings yet
Statistical NLP Techniques Overview
45 pages
Chapter 7.1 - Introducing Natural Language Processing
No ratings yet
Chapter 7.1 - Introducing Natural Language Processing
39 pages
Stemming, Lemmatization & NLP Basics
No ratings yet
Stemming, Lemmatization & NLP Basics
6 pages
NLP - 1 - 250119 - 222702
No ratings yet
NLP - 1 - 250119 - 222702
71 pages
Ai TXT Unit2
No ratings yet
Ai TXT Unit2
14 pages
Text Mining & NLP for Academics
No ratings yet
Text Mining & NLP for Academics
38 pages
AMLTA
No ratings yet
AMLTA
17 pages
Rajeev Mishra 20 SCSE1180087
No ratings yet
Rajeev Mishra 20 SCSE1180087
29 pages
CS-875-Lecture 4
No ratings yet
CS-875-Lecture 4
47 pages
Natural Language Processing
No ratings yet
Natural Language Processing
10 pages
Lect 05 Preprocessing Text
No ratings yet
Lect 05 Preprocessing Text
25 pages
NLP Pipeline
No ratings yet
NLP Pipeline
50 pages
Lesson 1 Intro
No ratings yet
Lesson 1 Intro
51 pages
DSC 202
No ratings yet
DSC 202
8 pages
NLP with Python Lab Manual
No ratings yet
NLP with Python Lab Manual
15 pages
NLP Workshop for Beginners
No ratings yet
NLP Workshop for Beginners
68 pages
Unit 6 - AI (NLP)
No ratings yet
Unit 6 - AI (NLP)
37 pages
NLP Intro
No ratings yet
NLP Intro
74 pages
Module 1 NLP
No ratings yet
Module 1 NLP
26 pages
NLP Course Overview and Tools
No ratings yet
NLP Course Overview and Tools
225 pages
Lesson 2 Feature Engineering On Text Data
No ratings yet
Lesson 2 Feature Engineering On Text Data
89 pages
Text Mining
No ratings yet
Text Mining
34 pages
Week 6: Introduction To Natural Language Processing
No ratings yet
Week 6: Introduction To Natural Language Processing
18 pages
Pipeline
No ratings yet
Pipeline
9 pages
NLP Notes
No ratings yet
NLP Notes
12 pages
AP For NLP-Word 2 Vec
No ratings yet
AP For NLP-Word 2 Vec
33 pages
Lect 04
No ratings yet
Lect 04
44 pages
1009 NLP PPT
No ratings yet
1009 NLP PPT
31 pages
Text Blob
No ratings yet
Text Blob
16 pages
NLP Final Review
No ratings yet
NLP Final Review
32 pages
Text Blob
No ratings yet
Text Blob
16 pages
NLTK Package Training
No ratings yet
NLTK Package Training
17 pages
Text Preprocessing & NLTK Guide
No ratings yet
Text Preprocessing & NLTK Guide
8 pages
Chapter V - Working With Text Data
No ratings yet
Chapter V - Working With Text Data
30 pages
Ai NLP
No ratings yet
Ai NLP
9 pages
NLP Class X AI
No ratings yet
NLP Class X AI
36 pages
Large Language Models
No ratings yet
Large Language Models
32 pages
Sentiment Analysis for Engineers
No ratings yet
Sentiment Analysis for Engineers
7 pages
NLB Final Lab Manual
No ratings yet
NLB Final Lab Manual
23 pages
Week 02 Tokenizers
No ratings yet
Week 02 Tokenizers
36 pages
CH 3
No ratings yet
CH 3
183 pages
NLP Ai X
No ratings yet
NLP Ai X
6 pages
NLP m2
No ratings yet
NLP m2
71 pages
Text Analysis for Students
No ratings yet
Text Analysis for Students
11 pages
Captura de Pantalla 2024-05-31 A La(s) 9.07.37 A. M.
No ratings yet
Captura de Pantalla 2024-05-31 A La(s) 9.07.37 A. M.
245 pages
Big Data Analytics Chap 11
No ratings yet
Big Data Analytics Chap 11
8 pages
NLP Techniques: Tokenization & Stemming
No ratings yet
NLP Techniques: Tokenization & Stemming
11 pages
NLP Challenges & Techniques
No ratings yet
NLP Challenges & Techniques
45 pages
NLP Techniques and Applications
No ratings yet
NLP Techniques and Applications
17 pages
Natural Language Processing Revision Notes
No ratings yet
Natural Language Processing Revision Notes
4 pages
AP For NLP-LO1
No ratings yet
AP For NLP-LO1
61 pages
Natural Language Processing
No ratings yet
Natural Language Processing
25 pages
Lecture 8 - Text Analytics NLP
No ratings yet
Lecture 8 - Text Analytics NLP
24 pages
NLP Problem-Solving Techniques
No ratings yet
NLP Problem-Solving Techniques
118 pages
NLP Quiz
No ratings yet
NLP Quiz
8 pages
NLP Notes CL 10
No ratings yet
NLP Notes CL 10
13 pages
Unit - 2
No ratings yet
Unit - 2
55 pages
Learn German Easily: A Beginner's Guide
No ratings yet
Learn German Easily: A Beginner's Guide
5 pages
Sba Paper Guess 2025 Version # 04: School-Based Assessment (Sba) - 2025 End-Of-Year Assessment Final Term
No ratings yet
Sba Paper Guess 2025 Version # 04: School-Based Assessment (Sba) - 2025 End-Of-Year Assessment Final Term
2 pages
2223 Material INglés 2º ESO 3er Trimestre - Compressed
No ratings yet
2223 Material INglés 2º ESO 3er Trimestre - Compressed
34 pages
Strong and Weak Form
No ratings yet
Strong and Weak Form
4 pages
TEFL Award Course Updated Policy 05.09.2024
No ratings yet
TEFL Award Course Updated Policy 05.09.2024
35 pages
ALL L1 091410 Tpod101 Recordingscript
No ratings yet
ALL L1 091410 Tpod101 Recordingscript
3 pages
A Grammar of The Irish Language PDF
100% (2)
A Grammar of The Irish Language PDF
565 pages
Pages - 14 EXAM FOLDER 1 (Exercise 1)
No ratings yet
Pages - 14 EXAM FOLDER 1 (Exercise 1)
1 page
Grammar Exercises for Students
No ratings yet
Grammar Exercises for Students
2 pages
Słowotwórstwo-Tworzenie-czasowników-C1-Ćwiczenia ALAN ONLINE
No ratings yet
Słowotwórstwo-Tworzenie-czasowników-C1-Ćwiczenia ALAN ONLINE
1 page
Functional Equivalence Applied in The Translation of Poems
No ratings yet
Functional Equivalence Applied in The Translation of Poems
4 pages
Gerunds
No ratings yet
Gerunds
4 pages
Understanding Prepositions: Time & Place
No ratings yet
Understanding Prepositions: Time & Place
12 pages
Vocabulary Enrichment for Students
No ratings yet
Vocabulary Enrichment for Students
3 pages
Interchange 3 Fifth Edition Unit 6
No ratings yet
Interchange 3 Fifth Edition Unit 6
74 pages
English English: Quarter 1 Quarter 1 Speech Sound Speech Sound
No ratings yet
English English: Quarter 1 Quarter 1 Speech Sound Speech Sound
14 pages
English Grammar Error Detection
No ratings yet
English Grammar Error Detection
16 pages
Word Class
No ratings yet
Word Class
3 pages
English Lit Exam: 1550-1798
No ratings yet
English Lit Exam: 1550-1798
39 pages
Essay in Hindi For Class 5Th
No ratings yet
Essay in Hindi For Class 5Th
13 pages
Bilingualism and Migration - (Codeswitching and The Organisation of The Mental Lexicon)
No ratings yet
Bilingualism and Migration - (Codeswitching and The Organisation of The Mental Lexicon)
20 pages
Mastering Conditionals
No ratings yet
Mastering Conditionals
8 pages
Ott & Struckmeier 2018
No ratings yet
Ott & Struckmeier 2018
15 pages
ST 2 Soft Skill - 231226 - 173629
No ratings yet
ST 2 Soft Skill - 231226 - 173629
2 pages
FCE Story Notes
No ratings yet
FCE Story Notes
2 pages
Avoid Passive Voice with Zombies
No ratings yet
Avoid Passive Voice with Zombies
2 pages
Language Teaching for Educators
No ratings yet
Language Teaching for Educators
3 pages
Grammar-The Simple Past Tense RESUELTO
100% (4)
Grammar-The Simple Past Tense RESUELTO
10 pages
Understanding English Intonation Patterns
No ratings yet
Understanding English Intonation Patterns
16 pages
L A Reading Series 1 Chapter2
No ratings yet
L A Reading Series 1 Chapter2
26 pages