Text Analytics
sunny1637@[Link]
BX9T5ZHNQF
Machine Learning
Proprietary content. ©Great Learning. All Rights Reserved.
This fileUnauthorized use or
is meant for personal distribution
use prohibited. only.
by sunny1637@[Link]
Sharing or publishing the contents in part or full is liable for legal action.
Agenda
• Introduction to Text Analytics
• Text Analytics and Applications
• Unstructured Vs Structured Data, Cleaning
• Bag of words, Word Frequencies
sunny1637@[Link]
BX9T5ZHNQF
• Hierarchical Clustering
• Sentiment Analysis
• Word Embeddings
• Ensemble Methods
Proprietary content. ©Great Learning. All Rights Reserved.
This fileUnauthorized use or
is meant for personal distribution
use prohibited. only.
by sunny1637@[Link]
2
Sharing or publishing the contents in part or full is liable for legal action.
Text Analytics
• The process of drawing meaning out of written
communication.
• to understand online reviews, tweets, call center agent
notes, survey results, and other types of written feedback
that capture insight into your customers.
sunny1637@[Link]
BX9T5ZHNQF • Spam detection
• translation
• search and crawl
• sentimental analysis
• entity modeling to support fact based decision making
• text summarization
Proprietary content. ©Great Learning. All Rights Reserved.
This fileUnauthorized use or
is meant for personal distribution
use prohibited. only.
by sunny1637@[Link]
3
Sharing or publishing the contents in part or full is liable for legal action.
Text: Unstructured Data
• Structured: Data is organized into pre-defined structure like
a table of database - with rows and columns.
sunny1637@[Link]
BX9T5ZHNQF
• UnStructured Data: Data does not have a pre-defined
structure. Think of a collection of emails, a bunch of satellite
images or the entire text of speeches from the british
parliament since 1803.
Proprietary content. ©Great Learning. All Rights Reserved.
This fileUnauthorized use or
is meant for personal distribution
use prohibited. only.
by sunny1637@[Link]
4
Sharing or publishing the contents in part or full is liable for legal action.
Modeling/representing text
• Bag of words - Documents simply represented by the words
in the document and their frequencies. Disregards grammar
and word order
•
sunny1637@[Link]
BX9T5ZHNQF Bayesian SPAM filter
• Semantic - mapping natural language rules to get a formal
representation of the meaning of the text
• Name entity identification
Proprietary content. ©Great Learning. All Rights Reserved.
This fileUnauthorized use or
is meant for personal distribution
use prohibited. only.
by sunny1637@[Link]
5
Sharing or publishing the contents in part or full is liable for legal action.
Bag of words
• Corpus:
• A: John likes to play soccer
• B: John is reading a book
John likes soccer play book reading a is to
sunny1637@[Link]
BX9T5ZHNQF
A 1 1 1 1 1
B 1 1 1 1 1
Proprietary content. ©Great Learning. All Rights Reserved.
This fileUnauthorized use or
is meant for personal distribution
use prohibited. only.
by sunny1637@[Link]
6
Sharing or publishing the contents in part or full is liable for legal action.
n-gram model
• The Bag-of-words model is an orderless document representation. Only
the counts of words matter.
• We could do this also by choosing consecutive pairs (2-gram) and
representing each pair
• A: John likes to play soccer
sunny1637@[Link] • B: John is reading a book
BX9T5ZHNQF
• 2-gram (bigram):
John likes likes to play soccer to play John is is reading reading a a book
A 1 1 1 1
B 1 1 1 1
Proprietary content. ©Great Learning. All Rights Reserved.
This fileUnauthorized use or
is meant for personal distribution
use prohibited. only.
by sunny1637@[Link]
7
Sharing or publishing the contents in part or full is liable for legal action.
Cleaning text
• Stop words: Common words that are not useful in providing
value or context. Eg: ‘the’, ‘an’, ‘in’ etc.
• Stemming: Returning words to their original stem. Eg:
‘Chopping’, ‘Chopped’ are all replaced with ‘Chop’
sunny1637@[Link]
BX9T5ZHNQF
• Lower case conversion
• Remove punctuations
• Strip extra white spaces
• Remove numbers
Proprietary content. ©Great Learning. All Rights Reserved.
This fileUnauthorized use or
is meant for personal distribution
use prohibited. only.
by sunny1637@[Link]
8
Sharing or publishing the contents in part or full is liable for legal action.
Example
sunny1637@[Link]
BX9T5ZHNQF
Proprietary content. ©Great Learning. All Rights Reserved.
This fileUnauthorized use or
is meant for personal distribution
use prohibited. only.
by sunny1637@[Link]
9
Sharing or publishing the contents in part or full is liable for legal action.
Example
sunny1637@[Link]
BX9T5ZHNQF
Proprietary content. ©Great Learning. All Rights Reserved.
This fileUnauthorized use or
is meant for personal distribution
use prohibited. only.
by sunny1637@[Link]
10
Sharing or publishing the contents in part or full is liable for legal action.
Term-Document Matrix (TDM)
Doc 1 Doc 2 … Doc N
Term 1
Term 2
…
…
sunny1637@[Link]
BX9T5ZHNQF
Term M
Document-Term Matrix (DTM)
Term 1 Term 2 … Term M
Doc 1
Doc 2
…
Doc N
Proprietary content. ©Great Learning. All Rights Reserved.
This fileUnauthorized use or
is meant for personal distribution
use prohibited. only.
by sunny1637@[Link]
11
Sharing or publishing the contents in part or full is liable for legal action.
• Each document is represented by a vector in the term document
matrix
• This lends itself to a number of ML techniques
• For example, these vectors (documents) can be clustered to
identify similar documents
sunny1637@[Link]
BX9T5ZHNQF
Proprietary content. ©Great Learning. All Rights Reserved.
This fileUnauthorized use or
is meant for personal distribution
use prohibited. only.
by sunny1637@[Link]
12
Sharing or publishing the contents in part or full is liable for legal action.