0% found this document useful (0 votes)

27 views40 pages

Lect 5

The document provides an overview of text mining, which involves extracting useful information from unstructured text data. It covers key processes such as text preprocessing, feature generation, and pattern discovery, along with various applications like sentiment analysis and document classification. Additionally, it discusses techniques for feature selection and similarity measures used in clustering and classification tasks.

Uploaded by

Mariam Yehia

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

27 views40 pages

Lect 5

Uploaded by

Mariam Yehia

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 40

AIS 421

Intelligent Decision Support Systems

Data Mining

Text Mining
Outline
1. What is Text Mining?
2. Text Preprocessing
3. Feature Creation
4. Feature Selection
5. Pattern Discovery
Motivation for Text Mining
Approximately 90% of the world’s data is held in
unstructured formats.

Structured data Examples:

10% − web pages
− emails
Unstructured or − customer complaint letters
semi-structured
data − corporate documents
− scientific papers
− books in digital libraries
Text Mining
The extraction of implicit, previously unknown
and potentially useful information from large
amounts of textual resources.

Data Information
Mining Retrieval

Text
Mining Machine
Statistics
Learning

Computational
Linguistics &
NLP
Some Text Mining Applications
1. Classification of news stories
2. Email and news filtering / SPAM detection
3. Sentiment analysis
4. Clustering of documents or web pages
5. Search term auto-completion
6. Information extraction
Sentiment Analysis
− The goal of sentiment analysis is to determine the polarity of a
given text at the document, sentence, or feature/aspect level
− Polarity values
• positve, neutral, negative
• likert scale (1 to 10)

− Application examples
• Document level
• analysis of tweets about
politicians
• Feature/aspect level
• analysis of product reviews
Search Log Mining
− Analysis of search queries issued by large user communities
− Applications
1. Search term
auto-completion
using
association
analysis
2. Query topic
detection
using
classification
Information Extraction
− Information extraction is the task of
automatically extracting structured
information from unstructured or
semi-structured documents.
− Subtasks
1. Named Entity Recognition
and Disambiguation
• “The parliament in Berlin has decided …“
• Which parliament? Which Berlin?
2. Relationship Extraction
• PERSON works for ORGANIZATION
• PERSON located in LOCATION
3. Fact Extraction
• CITY has population NUMBER
• COMPANY has turnover NUMBER [Unit]
Search versus Discovery
Search/Query Discovery
(Goal-oriented)

Structured Data Query Data

Processing Mining

Text Information Text

Retrieval Mining
The Text Mining Process
1. Text Preprocessing
• syntactic and/or
semantic analysis

2. Feature Generation
• bag of words, word embeddings

3. Feature Selection
• reduce large number

4. Data Mining
• clustering
• classification
• association
analysis
2. Text Preprocessing
1. Tokenization
2. Stopword Removal
3. Stemming
4. POS Tagging
Syntactic and Linguistic Text Preprocessing
− Simple Syntactic Processing
• Text Cleanup (remove punctuation and HTML tags)
• Tokenization (break text into single words)
− Advanced Linguistic Processing
• Word Sense Disambiguation
• determine which sense a word is having.
• normalize synonyms (United States, USA, US)
• normalize pronouns (he, she, it)
• Part Of Speech (POS) Tagging
• parse sentences according to grammar
• determine function of each term
• e.g. John (noun) gave (verb) the (det) ball (noun)
Stopword Removal
− Many of the most frequently used words in English are likely to be useless for
text mining
− These words are called Stopwords
• examples: the, of, and, to, an, is, that, …
• typically text contains about 400 to 500 such words
• for an application, an additional domain specific stopwords list may be
constructed

− Why should we remove stopwords?

• Reduce data set size
• stopwords account for 20-30% of total word count
• Improve effectivity of text mining methods
• stopwords may confuse the mining algorithm
More Examples of Stopwords
Stemming
− Techniques to find the stem of a word

• words: User, users, used, using ➔ Stem: use

• words: Engineering, engineered ➔ Stem: engineer

− Usefulness for Text Mining

• improve effectivity of text mining methods
• matching of similar words
• reduce term vector size
• combing words with same stem may reduce
the term vector as much as 40-50%
Some Basic Stemming Rules
− remove endings
• if a word ends with a consonant other than s,
followed by an s, then delete s.
• if a word ends in es, drop the s.
• if a word ends in ing, delete the ing unless the
remaining word consists only of one letter or of th
• if a word ends with ed, preceded by a consonant,
delete the ed unless this leaves only a single letter
• …...

− transform words
• if a word ends with “ies” but not “eies” or “aies” then
“ies ➔ y”
Text Preprocessing in Python
3. Feature Generation
1. Bag-of-Words
2. Word Embeddings
Bag-of-Words: The Term-Document Matrix
Bag-of-Words: Feature Generation
− Document is treated as a bag of words (or terms)
• each word/term becomes a feature
• order of words/terms is ignored

− Each document is represented by a vector

− Different techniques for vector creation:
1. Binary Term Occurrence: Boolean attributes describe whether
or not a term appears in the document
2. Term Occurrence: Number of occurrences of a term in the document
(problematic if documents have different length)
3. Terms Frequency: Attributes represent the frequency in which
a term appears in the document (number of occurrences /
number of words in document)
4. TF-IDF: see next slide
The TF-IDF Term Weighting Scheme
− The TF-IDF weight (term frequency–inverse document
frequency) is used to evaluate how important a word is to a
corpus of documents.
• TF: Term Frequency (see last slide)
• IDF: Inverse Document Frequency.
N: total number of docs in corpus
dfi : the number of docs in which ti appears

− Gives more weight to rare words.

− Give less weight to common words
(domain-specific stopwords).
Word Embeddings
− Embeddings represent words not as a single number in a word
vector (one-hot representation) but represent each word as a vector
of real numbers (distributed representation)
e.g. 50 to 300 numbers
− Embeddings are chosen in a way that
semantically related words (e.g. dog, puppy)
end up at similar locations in the vector space
• thus, embeddings can deal better with synonyms
and related terms than bag-of-words vectors

− Embeddings are calculated based on the

assumption that similar words appear in
similar contexts (distributional similarity)
• Skip-gram approach used by Word2Vec:
predict context words for each word using
a neural net
Embedding Methods and Pretrained Models
• GloVe 50 embedding of
− Well known embedding methods
the word the pretrained on
• Word2Vec (Google)
Common Crawl
• GloVe (Stanford NLP Group)
the
• fastText (Facebook AI Research) 0.418 0.24968 -0.41242 0.1217
0.34527 -0.044457 -0.49688 -
• BERT (Google) 0.17862 -0.00066023 -0.6566
0.27843 -0.14767 -0.55677 0.14658
− Pretrained embeddings can be -0.0095095 0.011658 0.10204 -
0.12792 -0.8443 -0.12181 -
downloaded 0.016801 -0.33279 -0.1552 -
0.23131 -0.19181 -1.8823 -0.76746
• GloVe: trained on Common Crawl, 0.099051 -0.42125 -0.19526 4.0071
Wikipedia, and Tweets -0.18594 -0.52287 -0.31681
0.00059213 0.0074449 0.17778 -
0.15897 0.012041 -0.054223 -
• fastText: embeddings for 294 languages 0.29871 -0.15749 -0.34758 -
0.045637 -0.44251 0.18785
− Using Embeddings 0.0027849 -0.18411 -0.11514 -
0.78581
− Python: Gensim offers Word2Vec implementation
− RapidMiner: Word2Vec extension on marketplace
4. Feature Selection
− Not all features help!
− Learners might have difficulty
with high dimensional data
Filter Tokens by POS Tags
− POS tagging may be helpful for feature selection
− sometimes you want to focus on certain
classes of words:
– Adjectives (JJ.) for sentiment analysis
– good, bad, great
– Nouns (N.) for text clustering
– red and blue cars are similar
– red and blue trousers are similar

– Rapidminer supports
– PENN tag system for English
– STTS tag system for German
– filtering conditions are expressed
as regular expressions

Question
− Which similarity measures are a good choice for comparing document
vectors?
Jaccard Coefficient
− The Jaccard coefficient is a popular similarity measure for vectors
consisting of asymmetric binary attributes

M 11
dist (x i , x j ) =
M 01 + M 10 + M 11
Number of 11 matches / number of not-both-zero attributes values

− used together with binary term occurrence vector

• 1 represents occurrence of specific word
• 0 represents absence of specific word
Example: Jaccard Coefficient
− Example document set
d1 = “Saturn is the gas planet with rings.”
d2 = “Jupiter is the largest gas planet.”
d3 = “Saturn is the Roman god of sowing.”

− Documents as binary term occurrence vectors

Saturn is the gas planet with rings Jupiter largest Roman god of sowing

d1 1 1 1 1 1 1 1 0 0 0 0 0 0
d2 0 1 1 1 1 0 0 1 1 0 0 0 0
d3 1 1 1 0 0 0 0 0 0 1 1 1 1

− Jaccard similarities between the documents

• sim(d1,d2) = 0.44
• sim(d1,d3) = 0.27
• sim(d2,d3) = 0.18
Cosine Similarity
− Popular similarity measure for comparing weighted document
vectors such as term-frequency or TF-IDF vectors

d1 • d 2
cos(d1 , d 2 ) =
|| d1 || || d 2 ||
N
where • indicates vector dot product and 2
ǁa ǁ= Σa i
|| d || is the length of vector d i= 1

− Example
d1 = 3 2 0 5 0 0 0 2 0 0
d2 = 1 0 0 0 0 0 0 1 0 2
d1 • d2= 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5
||d1|| = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)0.5 = (42) 0.5 = 6.481
||d2|| = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) 0.5 = (6) 0.5 = 2.245

cos( d1, d2 ) = 0.3150

Example: Cosine Similarity and TF-IDF
− A commonly used combination for text clustering
− Each document is represented by vectors of TF-IDF weights
− Sample document set:
• “Saturn is the gas planet with rings.”
• “Jupiter is the largest gas planet.”
• “Saturn is the Roman god of sowing.”

− First document as TF-IDF vector:

(1/7 * log(3/2), 1/7*log(3/3), 1/7*log(3/3), …, 0, 0, 0, ...)

Saturn is the Jupiter largest Roman

Example: Cosine Similarity and TF-IDF
− Sample document set
d1 = “Saturn is the gas planet with rings.”
d2 = “Jupiter is the largest gas planet.”
d3 = “Saturn is the Roman god of sowing.”

− Documents as TF-IDF vectors

Saturn is the gas planet with rings Jupiter largest Roman god of sowing

d1 0.03 0 0 0.03 0.03 0.07 0.07 0 0 0 0 0 0

d2 0 0 0 0.03 0.03 0 0 0.08 0.08 0 0 0 0
d3 0.03 0 0 0 0 0 0 0 0 0.07 0.07 0.07 0.07

− Cosine similarities between the documents

− cos(d1,d2) = 0.13
− cos(d1,d3) = 0.05
− cos(d2,d3) = 0.00
Embedding-based Similarity
1. Translate documents into embedding vectors
• using for example doc2vec or
• average embeddings of all words in the document

2. Calculate similarity of document

embedding vectors
• cosine similarity
• word movers distance
• neural nets (RNNs, LTSMs)
http://bionlp-www.utu.fi/wv_demo/
− Libraries for calculating embeddings
− Python: gensim offers doc2vec and word2vec implementations
− RapidMiner: word2vec extension on marketplace
5.2 Document Classification
− Given: A collection of labeled documents (training set)
− Find: A model for the class as a function of the values of the features
− Goal: Previously unseen documents should be assigned a class as
accurately as possible
− Applications
• topical classification of news stories or web pages
• SPAM detection
• sentiment analysis

− Classification methods commonly used for text

1. naive bayes
2. support vector machines (SVMs)
3. recurrent neural networks (RNNs), e.g. long short-term memory (LSTMs)
4. but KNN or random forests may also work
Example Application: Sentiment Analysis

− Given: A text
− Goal: Assign a class of sentiment to the text
• e.g., positive, neutral, negative
• e.g., sad, happy, angry, surprised

− Can be implemented as supervised classification task

• requires training data
• i.e., pairs like <text; sentiment>
Example Application: Sentiment Analysis
− Labeling data for sentiment analysis
• is expensive, like every data labeling task

− Reviews from the Web may be used as labeled data

− There exist various large corpora of reviews for public download

• Amazon Product Data by Julian McAuley: 142 million reviews from Amazon
• WebDataCommons: 70 million reviews from 50,000 websites that use RDFa
or Microdata markup
Preprocessing for Sentiment Analysis
− Recap – we started our processing with: Simple Syntactic Analysis
• text cleanup (remove punctuation, HTML tags, …)
• normalize case
• …

− However, reasonable features for sentiment analysis might include

• punctuation “!”, “?”, “?!”
• smileys encoded using punctuation: e.g. ;-) :-(
• use of visual markup, where available (red color, bold face, ...)
• amount of capitalization (“SCREAMING”)

− Practical Approach
• Replace smileys or visual markup with sentiment words in preprocessing
• ☺ ➔ great, COOL ➔ cool cool
Summary
− Main challenge in text mining: Preprocessing and vectorization
• in order to be able to apply well known Data Mining algorithms

− There are lots of alternative techniques

• thus you need to experiment in order to find out which work well
for your use case
• focus has shifted from bag-of-words approaches to embeddings

− Text mining can be tricky, but OK-ish results are easily achieved
Questions ?

DM05 Text Mining
No ratings yet
DM05 Text Mining
44 pages
Lesson 2 Feature Engineering On Text Data
No ratings yet
Lesson 2 Feature Engineering On Text Data
89 pages
1 - Overview of NLP
No ratings yet
1 - Overview of NLP
39 pages
01 - Introduction To Text Analytics - Part2
No ratings yet
01 - Introduction To Text Analytics - Part2
48 pages
Introduction To Text Mining
No ratings yet
Introduction To Text Mining
82 pages
Bag - of - Words NLP
100% (1)
Bag - of - Words NLP
23 pages
L5 - L6 - Natural Language Processing
100% (1)
L5 - L6 - Natural Language Processing
94 pages
Data Mining Techniques Guide
No ratings yet
Data Mining Techniques Guide
61 pages
Dealing With Textual Data
No ratings yet
Dealing With Textual Data
67 pages
Data Mining:: Concepts and Techniques
No ratings yet
Data Mining:: Concepts and Techniques
37 pages
Text, Web and Social Media Analytics: SE Computer, Sem VIII Academic Year: 2023 - 24
No ratings yet
Text, Web and Social Media Analytics: SE Computer, Sem VIII Academic Year: 2023 - 24
36 pages
Lab 5
No ratings yet
Lab 5
27 pages
Lesson 2 Feature Engineering On Text Data
No ratings yet
Lesson 2 Feature Engineering On Text Data
131 pages
Lecture 6 - Word2Vec and Text Classification
No ratings yet
Lecture 6 - Word2Vec and Text Classification
66 pages
Unit I - Text Mining
No ratings yet
Unit I - Text Mining
48 pages
Text Mining
No ratings yet
Text Mining
25 pages
Unit 5
No ratings yet
Unit 5
8 pages
Week 12
No ratings yet
Week 12
19 pages
Text Mining Techniques and Applications
No ratings yet
Text Mining Techniques and Applications
31 pages
Text Mining & NLP for Academics
No ratings yet
Text Mining & NLP for Academics
38 pages
Samaksh Gupta Programming Ass. IR
No ratings yet
Samaksh Gupta Programming Ass. IR
13 pages
Ai TXT Unit2
No ratings yet
Ai TXT Unit2
14 pages
ITD253 L2 TextPreprocessing
No ratings yet
ITD253 L2 TextPreprocessing
33 pages
Week 7 - Show in Class - Text Processing
No ratings yet
Week 7 - Show in Class - Text Processing
4 pages
Predictive Text Mining Techniques
No ratings yet
Predictive Text Mining Techniques
75 pages
NLP Text Preprocessing
No ratings yet
NLP Text Preprocessing
19 pages
CT075!3!2 DTM Topic 12 Text Data Mining
No ratings yet
CT075!3!2 DTM Topic 12 Text Data Mining
25 pages
Probabilistic Topic Models Overview
No ratings yet
Probabilistic Topic Models Overview
78 pages
Preprocessing Stemin JI
No ratings yet
Preprocessing Stemin JI
3 pages
NLP Scheme for Mobile Forensics Exam
No ratings yet
NLP Scheme for Mobile Forensics Exam
6 pages
Module03 Embeddings
No ratings yet
Module03 Embeddings
102 pages
Week 2 and 3
No ratings yet
Week 2 and 3
76 pages
Statistical NLP Techniques Overview
No ratings yet
Statistical NLP Techniques Overview
45 pages
Next-Word Prediction Techniques
No ratings yet
Next-Word Prediction Techniques
12 pages
1 Text Mining Review Slides
No ratings yet
1 Text Mining Review Slides
78 pages
Feature Eng
No ratings yet
Feature Eng
34 pages
Reference Material NLP - 2
No ratings yet
Reference Material NLP - 2
40 pages
Module III
No ratings yet
Module III
42 pages
Text Classification MLND Project Report Prasann Pandya
No ratings yet
Text Classification MLND Project Report Prasann Pandya
17 pages
Text Similarity Cosine BOW TF-IDF Lecture
No ratings yet
Text Similarity Cosine BOW TF-IDF Lecture
6 pages
NLP Challenges & Techniques
No ratings yet
NLP Challenges & Techniques
45 pages
Ijcst V3i2p17
No ratings yet
Ijcst V3i2p17
5 pages
Lec 5 e Text Analytics Vector Space TF IDF
No ratings yet
Lec 5 e Text Analytics Vector Space TF IDF
51 pages
Effective Text Classification Techniques
No ratings yet
Effective Text Classification Techniques
6 pages
Natural Language Processing: Lecture # 7
No ratings yet
Natural Language Processing: Lecture # 7
36 pages
Intro To TM
No ratings yet
Intro To TM
32 pages
Topic 8
No ratings yet
Topic 8
55 pages
Lecture 6 - From Unstructured Texts To Structure Data I
No ratings yet
Lecture 6 - From Unstructured Texts To Structure Data I
17 pages
Text Mining
No ratings yet
Text Mining
34 pages
Text and Web Mining
No ratings yet
Text and Web Mining
44 pages
NLP DeepNLP
No ratings yet
NLP DeepNLP
61 pages
Business Intelligence & Text Mining Guide
No ratings yet
Business Intelligence & Text Mining Guide
122 pages
NLP An Intuitive Understanding of Word Embeddings From Count Vectors To Word2Vec
No ratings yet
NLP An Intuitive Understanding of Word Embeddings From Count Vectors To Word2Vec
18 pages
Word2Vec: Vector Representations Explained
No ratings yet
Word2Vec: Vector Representations Explained
31 pages
Applications of NLP
No ratings yet
Applications of NLP
85 pages
Lec 6
No ratings yet
Lec 6
2 pages
NLP - Module 2
No ratings yet
NLP - Module 2
54 pages
Unit-4 NLP
No ratings yet
Unit-4 NLP
21 pages
Lect 4
No ratings yet
Lect 4
53 pages
Lect3 - Intelligent Decision Support Systems
No ratings yet
Lect3 - Intelligent Decision Support Systems
57 pages
Lect1-Intelligent Decision Support Systems
No ratings yet
Lect1-Intelligent Decision Support Systems
37 pages
Lect2 - Intelligent Decision Support Systems
No ratings yet
Lect2 - Intelligent Decision Support Systems
39 pages
Lect 9 - AI-rule-based-expert-systems - Uncertainty
No ratings yet
Lect 9 - AI-rule-based-expert-systems - Uncertainty
25 pages
Lect 10 - AI-rule-based-expert-systems - Uncertainty
No ratings yet
Lect 10 - AI-rule-based-expert-systems - Uncertainty
32 pages
Feduc 1 1492308
No ratings yet
Feduc 1 1492308
24 pages
Database Security and Computer Programming
No ratings yet
Database Security and Computer Programming
11 pages
Objectives of DBMS:: Data Availability
0% (1)
Objectives of DBMS:: Data Availability
2 pages
Online Hostel Management System Report
No ratings yet
Online Hostel Management System Report
35 pages
What Is The Greedy Method, Illustrated by Huffman Coding
No ratings yet
What Is The Greedy Method, Illustrated by Huffman Coding
4 pages
SAS® Intelligence Platform Overview
No ratings yet
SAS® Intelligence Platform Overview
74 pages
Database Intrusion Detection System
100% (1)
Database Intrusion Detection System
44 pages
Research
No ratings yet
Research
12 pages
Tycs Sem6 Atkt Timetable
No ratings yet
Tycs Sem6 Atkt Timetable
1 page
Cloud-IoT Integration Overview
No ratings yet
Cloud-IoT Integration Overview
13 pages
DBS Unit 4
No ratings yet
DBS Unit 4
15 pages
Deep Learning Models Based On Image Classification: A Review
No ratings yet
Deep Learning Models Based On Image Classification: A Review
8 pages
CV Alvi
No ratings yet
CV Alvi
3 pages
Unitedworld Institute of Technology (UIT)
No ratings yet
Unitedworld Institute of Technology (UIT)
1 page
Facebook Data Analysis Using Hadoop and Hive
No ratings yet
Facebook Data Analysis Using Hadoop and Hive
4 pages
Python Programming for Data Analysis
No ratings yet
Python Programming for Data Analysis
2 pages
Sylabus
No ratings yet
Sylabus
4 pages
Software Engineering
No ratings yet
Software Engineering
2 pages
Prakshal Jain Resume
No ratings yet
Prakshal Jain Resume
1 page
Deep Learning
No ratings yet
Deep Learning
5 pages
Internship REPORT CYBER GYAN VIRTUAL INTERNSHIP
No ratings yet
Internship REPORT CYBER GYAN VIRTUAL INTERNSHIP
29 pages
SNSW Mid Questions
No ratings yet
SNSW Mid Questions
6 pages
Multimedia & Animation-CCS352 - CAT-1 Questions With Answer Keys
100% (3)
Multimedia & Animation-CCS352 - CAT-1 Questions With Answer Keys
12 pages
Kcs051 Data Analytics
No ratings yet
Kcs051 Data Analytics
2 pages
Holding Report 1232024
No ratings yet
Holding Report 1232024
1 page
Introduction To PyTorch
No ratings yet
Introduction To PyTorch
3 pages
Information Theory in Cyber Security
No ratings yet
Information Theory in Cyber Security
20 pages
Sanjeet Kumar Resume 2024
No ratings yet
Sanjeet Kumar Resume 2024
1 page
Database Management Systems Guide
No ratings yet
Database Management Systems Guide
58 pages
Unit3 Linux
No ratings yet
Unit3 Linux
9 pages