0% found this document useful (0 votes)

57 views29 pages

BDACh 09 L02 Text Mining Process Phases

The document outlines the phases of the text mining process, including preprocessing, feature generation, feature selection, data mining techniques, and result analysis. Key concepts such as bag of words, TF-IDF, and the use of vectors and matrices are discussed, along with supervised and unsupervised learning methods. The overall aim is to extract meaningful information from text data efficiently.

Uploaded by

Ranjini Ranju

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

57 views29 pages

BDACh 09 L02 Text Mining Process Phases

Uploaded by

Ranjini Ranju

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 29

Lesson 2

Text Mining Process Phases

“Big Data Analytics “, Ch.09 L02: Text, Web, ...Social Network Analytics,
2019 1
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
Text Document Components
• Syntactically, characters that form
words, which can be further combined
to generate phrases or sentences

“Big Data Analytics “, Ch.09 L02: Text, Web, ...Social Network

2019 Analytics, Raj 2
Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
Text Mining Steps
• Recognizing, extracting and using the
information present in words
• Along with searching of words, mining
involves search for semantic patterns

“Big Data Analytics “, Ch.09 L02: Text, Web, ...Social Network Analytics,
2019 3
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
Text Mining Process
• Consists of a process-pipeline executing
in several phases
• Mining uses the iterative and interactive
processes
• The processing in pipeline does text
mining efficiently and mines the new
information.

“Big Data Analytics “, Ch.09 L02: Text, Web, ...Social Network Analytics,
2019 4
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
Figure 9.2 Five phases in a process pipeline

“Big Data Analytics “, Ch.09 L02: Text, Web, ...Social Network Analytics,
2019 5
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
Phase 2: Preprocessing

• Clean-up
• Tokenization
• Part of Speech (POS) tagging
• Word sense disambiguation
• Parsing

“Big Data Analytics “, Ch.09 L02: Text, Web, ...Social Network Analytics,
2019 6
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
Phase 2: Feature Generation
1. Bag of words—Order of words is not
that important for certain applications.
Text document represented by the
words it contains (and their
occurrences) and for finding occurrence
(frequency) of each word as a feature

“Big Data Analytics “, Ch.09 L02: Text, Web, ...Social Network Analytics,
2019 7
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
Feature Generation
2. Stemming—identifies a word by its root
• Reduces the word to its most basic
element. (impure  pure)
Normalizes or unifies variations of the
same concept
Removes plurals, normalizes verb
tenses and remove affixes

“Big Data Analytics “, Ch.09 L02: Text, Web, ...Social Network Analytics,
2019 8
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
Feature Generation
3. Removing stop-words from the feature
space—unlikely to help text mining, the
search program tries to ignore stop-
words
Ignores a, at, for, it, in, are, as, such, so,
….

“Big Data Analytics “, Ch.09 L02: Text, Web, ...Social Network Analytics,
2019 9
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
Vector Space Model (VSM)
• An algebraic model for representing text
documents as vector of identifiers, word
frequencies or terms in the document
index
• Term frequency-inverse document
frequency (TF-IDF) for evaluating how
important is a word in a document

“Big Data Analytics “, Ch.09 L02: Text, Web, ...Social Network Analytics,
2019 10
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
Weight of a Word

• TF-IDF Weight
• May assign higher weights to keywords
and Titles

“Big Data Analytics “, Ch.09 L02: Text, Web, ...Social Network Analytics,
2019 11
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
Use of Vectors and Matrices
• Represent a collection of web
documents as vectors
• Represent by a matrix with |D| × F
shape, where |D| is the cardinality of the
document space (total number of
documents) and the F is the number of
features. F represents the vocabulary
size.
“Big Data Analytics “, Ch.09 L02: Text, Web, ...Social Network Analytics,
2019 12
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
Example 9.2
• Shows that the matrices representing
term frequencies tend to be very sparse
(with majority of terms zeroed)
• A common representation of such matrix
is thus the sparse matrices

“Big Data Analytics “, Ch.09 L02: Text, Web, ...Social Network Analytics,
2019 13
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
Phase 3: Features Selection

• Process that selects a subset of features

by rejecting irrelevant and/ or redundant
features (variables, predictors or
dimension) according to defined criteria

“Big Data Analytics “, Ch.09 L02: Text, Web, ...Social Network Analytics,
2019 14
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
1. Feature Selection
• 1. Dimensionality reduction—Feature
selection is one of the methods of
division and therefore, dimension
reduction. The basic objective is to
eliminate irrelevant and redundant data.
Redundant features are those, which
provide no extra information

“Big Data Analytics “, Ch.09 L02: Text, Web, ...Social Network Analytics,
2019 15
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
Feature Selection
• Principal Component Analysis (PCA)
and Linear Discriminate Analysis
(LDA) for dimension reduction
methods
• Discrimination ability of a feature
measures relevancy of features.
Correlation helps in finding the
redundancy of the feature.
“Big Data Analytics “, Ch.09 L02: Text, Web, ...Social Network Analytics,
2019 16
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
2. Feature Selection
2. N-gram evaluation—finding the number
of consecutive words of interest and
extract them
For example, 2-gram is a two words
sequence, [“tasty food”, “Good one”].
3-gram is a three words sequence,
[“Crime Investigation Department”].

“Big Data Analytics “, Ch.09 L02: Text, Web, ...Social Network Analytics,
2019 17
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
Feature Selection
• Discrimination ability of a feature
measures relevancy of features.
Correlation helps in finding the
redundancy of the feature
• Two features are redundant to each
other if their values correlate with each
other.
• .
“Big Data Analytics “, Ch.09 L02: Text, Web, ...Social Network Analytics,
2019 18
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
2. Feature Selection

2. N-gram evaluation—finding the number

of consecutive words of interest and
extract them. For example, 2-gram is a
two words sequence, [“tasty food”,
“Good one”]. 3-gram is a three words
sequence, [“Crime Investigation
Department”].

“Big Data Analytics “, Ch.09 L02: Text, Web, ...Social Network Analytics,
2019 19
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
3. Feature Selection
• 3. Noise detection and evaluation of
outliers methods do the identification of
unusual or suspicious items, events or
observations from the data set
• Step helps in cleaning the data from
irrelevant words/information

“Big Data Analytics “, Ch.09 L02: Text, Web, ...Social Network Analytics,
2019 20
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
Phase 4: Data Mining Techniques
• Unsupervised learning (for example,
clustering)
• (i) The class labels (categories) of
training data are unknown
• (ii) Establish the existence of groups or
clusters in the data

“Big Data Analytics “, Ch.09 L02: Text, Web, ...Social Network Analytics,
2019 21
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
Clustering

• Good clustering methods use high intra-

cluster similarity and low inter-cluster
similarity
• Examples of uses – blogs, patterns and
trends

“Big Data Analytics “, Ch.09 L02: Text, Web, ...Social Network Analytics,
2019 22
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
Supervised learning (for example,
classification)
• (i) The training data is labeled
indicating the class
• (ii) New data is classified based on the
training set

“Big Data Analytics “, Ch.09 L02: Text, Web, ...Social Network Analytics,
2019 23
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
Identifying evolutionary patterns in
temporal text streams
• Useful in a wide range of applications,
such as summarizing of events in news
articles and extracting the research
trends in the scientific literature

“Big Data Analytics “, Ch.09 L02: Text, Web, ...Social Network Analytics,
2019 24
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
Phase 5: Analysing results
(i) Evaluate the outcome of the complete
process.
(ii) Interpretation of Result– If acceptable
then results obtained can be used as an
input for next set of sequences. Else, the
result can be discarded, and try to
understand what and why the process
failed.

“Big Data Analytics “, Ch.09 L02: Text, Web, ...Social Network Analytics,
2019 25
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
Phase 5: Analysing results
(iii) Visualization – Prepare visuals from
data, and build a prototype.
(iv) Use the results for further
improvement in activities at the
enterprise, industry or institution.

“Big Data Analytics “, Ch.09 L02: Text, Web, ...Social Network Analytics,
2019 26
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
Summary
We learnt:
• Text Mining Process Pipeline
• Preprocessing Phase
• Feature Generation Phase: Bag of
Words. TF-IDF, Weights to the words
and terms
• VSM: Use of vectors and matrices

“Big Data Analytics “, Ch.09 L02: Text, Web, ...Social Network Analytics,
2019 27
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
Summary
We learnt:
• Feature Selection Phase
• Data Mining Phase
• Supervised and unsupervised methods
• Analyzing the Results Phase

“Big Data Analytics “, Ch.09 L02: Text, Web, ...Social Network Analytics,
2019 28
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India
End of Lesson 2 on
Text Mining Process Phases

“Big Data Analytics “, Ch.09 L02: Text, Web, ...Social Network Analytics,
2019 29
Raj Kamal and Preeti Saxena, © McGraw-Hill Higher Edu. India

BDACh 09 L03 Text Classifier KNN&Naive Baye
No ratings yet
BDACh 09 L03 Text Classifier KNN&Naive Baye
22 pages
E026 ShubhamTanna ASTM Exp-1
No ratings yet
E026 ShubhamTanna ASTM Exp-1
9 pages
10) Big Data 4 Social Media Analytics
No ratings yet
10) Big Data 4 Social Media Analytics
65 pages
CH 06 PPTaccessible
No ratings yet
CH 06 PPTaccessible
71 pages
Data Analytics
No ratings yet
Data Analytics
24 pages
E012 Niraj Lalani ASTM Exp-1
No ratings yet
E012 Niraj Lalani ASTM Exp-1
5 pages
19ECB455 Syllabus
No ratings yet
19ECB455 Syllabus
3 pages
02.MOUDLE 5 - Text Mining
No ratings yet
02.MOUDLE 5 - Text Mining
27 pages
Introduction To User Studies
No ratings yet
Introduction To User Studies
52 pages
Week 3 Text, Web, and Social Media Analytics
No ratings yet
Week 3 Text, Web, and Social Media Analytics
58 pages
Week10 Social Network Analytics
No ratings yet
Week10 Social Network Analytics
19 pages
Artificial Intelligence in Social Networking
No ratings yet
Artificial Intelligence in Social Networking
49 pages
AI in Healthcare Syllabus
No ratings yet
AI in Healthcare Syllabus
7 pages
NLP For Business
No ratings yet
NLP For Business
4 pages
Applied Text Analysis Overview
No ratings yet
Applied Text Analysis Overview
13 pages
05b.BDA (18CS72) Module-5 Text Mining
No ratings yet
05b.BDA (18CS72) Module-5 Text Mining
23 pages
BMC PDF
No ratings yet
BMC PDF
15 pages
Business Intelligence, Analytics, and Data Science: A Managerial Perspective
No ratings yet
Business Intelligence, Analytics, and Data Science: A Managerial Perspective
73 pages
Outline - Advanced Analytics 2017-19
No ratings yet
Outline - Advanced Analytics 2017-19
2 pages
Syllabus
No ratings yet
Syllabus
6 pages
p287 Siemens PDF
No ratings yet
p287 Siemens PDF
2 pages
Motivation Letter PDF
No ratings yet
Motivation Letter PDF
3 pages
BDA Module-5b Text Mining
No ratings yet
BDA Module-5b Text Mining
23 pages
Social Networks Science Design, Implementation, Security, and Challenges From Social Networks Analysis To Social Networks Intelligence
No ratings yet
Social Networks Science Design, Implementation, Security, and Challenges From Social Networks Analysis To Social Networks Intelligence
15 pages
WINSEM2023-24 BCSE206L TH VL2023240501787 2024-02-19 Reference-Material-I
No ratings yet
WINSEM2023-24 BCSE206L TH VL2023240501787 2024-02-19 Reference-Material-I
42 pages
3510-6510 Ch5
No ratings yet
3510-6510 Ch5
73 pages
Challenges and Applications of Data Analytics in Social Perspectives
No ratings yet
Challenges and Applications of Data Analytics in Social Perspectives
4 pages
Big Data Factories: Sorin Adam Matei Nicolas Jullien Sean P. Goggins
No ratings yet
Big Data Factories: Sorin Adam Matei Nicolas Jullien Sean P. Goggins
141 pages
Minor Project Presentation
No ratings yet
Minor Project Presentation
16 pages
User Studies
No ratings yet
User Studies
52 pages
Autonomy - SEM VIII - Major-Minor-DataScience
No ratings yet
Autonomy - SEM VIII - Major-Minor-DataScience
4 pages
Social Media Analytics
No ratings yet
Social Media Analytics
7 pages
ASTMA Assingments 231108 113739
No ratings yet
ASTMA Assingments 231108 113739
2 pages
Advanced Text & Web Analytics
No ratings yet
Advanced Text & Web Analytics
4 pages
Business Intelligence & Text Mining Guide
No ratings yet
Business Intelligence & Text Mining Guide
122 pages
DS SEM 8 Curriculum
No ratings yet
DS SEM 8 Curriculum
3 pages
Thesis Final
0% (1)
Thesis Final
186 pages
Social Media Trend Analysis Seminar
100% (1)
Social Media Trend Analysis Seminar
16 pages
Sentiment Analysis For Social Media Illustrated Carlos A Iglesias Editor Download
No ratings yet
Sentiment Analysis For Social Media Illustrated Carlos A Iglesias Editor Download
55 pages
STM
No ratings yet
STM
2 pages
Astma Lab Manual
No ratings yet
Astma Lab Manual
17 pages
MTP Report
No ratings yet
MTP Report
42 pages
Comparative Analysis On Techniques For Big Data Testing: Adiba Abidin Divya Lal Naveen Garg Vikas Deep
No ratings yet
Comparative Analysis On Techniques For Big Data Testing: Adiba Abidin Divya Lal Naveen Garg Vikas Deep
5 pages
Data Analytics for IoT/M2M Insights
No ratings yet
Data Analytics for IoT/M2M Insights
25 pages
7 - Text Analytics Text Mining and Sentiment Analysis
100% (2)
7 - Text Analytics Text Mining and Sentiment Analysis
53 pages
Social Media Text Mining for BI
No ratings yet
Social Media Text Mining for BI
5 pages
Lec 5 e Text Analytics Vector Space TF IDF
No ratings yet
Lec 5 e Text Analytics Vector Space TF IDF
51 pages
Book SN (Autosaved1) 2013
No ratings yet
Book SN (Autosaved1) 2013
313 pages
Data Warehouse and Mining UNIT 6
No ratings yet
Data Warehouse and Mining UNIT 6
18 pages
Text and Web Analytics
No ratings yet
Text and Web Analytics
48 pages
Lec 3 Data Preprocessing
No ratings yet
Lec 3 Data Preprocessing
9 pages
BERT for Social Media Sentiment Analysis
No ratings yet
BERT for Social Media Sentiment Analysis
34 pages
Chapter 3 - InformationRetrieval-1
No ratings yet
Chapter 3 - InformationRetrieval-1
32 pages
DataScience RajaMeenakshi
No ratings yet
DataScience RajaMeenakshi
41 pages
Lab 1
No ratings yet
Lab 1
11 pages
1BESC104D
No ratings yet
1BESC104D
4 pages
BDACh 03 L05 Object Data Store
No ratings yet
BDACh 03 L05 Object Data Store
19 pages
Apache Sqoop Import and Export Methods
No ratings yet
Apache Sqoop Import and Export Methods
6 pages
BDACh06L01Definitions AI ML DL
No ratings yet
BDACh06L01Definitions AI ML DL
10 pages
The Week Magazine 2
No ratings yet
The Week Magazine 2
2 pages
CN Unit5
No ratings yet
CN Unit5
112 pages
By Mrs - Shivaranjini
No ratings yet
By Mrs - Shivaranjini
19 pages
Lateral Pile Response in Weak Sandstone
No ratings yet
Lateral Pile Response in Weak Sandstone
11 pages
Crodamol Eo LQ MV
No ratings yet
Crodamol Eo LQ MV
8 pages
BIA Decision: Javier Rosales U-Visa Case
100% (1)
BIA Decision: Javier Rosales U-Visa Case
2 pages
Hydraulic Brake Booster Dissasembly LC200
100% (1)
Hydraulic Brake Booster Dissasembly LC200
10 pages
15 - 516x Week0 1 Program Overview en
No ratings yet
15 - 516x Week0 1 Program Overview en
2 pages
Unit 7 Termination and Dismissal
No ratings yet
Unit 7 Termination and Dismissal
36 pages
La Marzocco Technical Newsletter July 2017
No ratings yet
La Marzocco Technical Newsletter July 2017
1 page
ACCA Exam Entry Terms & Conditions
No ratings yet
ACCA Exam Entry Terms & Conditions
2 pages
Book 2
No ratings yet
Book 2
164 pages
Bookkeeping NC III: Posting Transactions
100% (2)
Bookkeeping NC III: Posting Transactions
36 pages
SSLT-35-JAI Lighting Tower Parts Manual
No ratings yet
SSLT-35-JAI Lighting Tower Parts Manual
35 pages
Bundle of a First Course in Differential Equations With Modeling Applications 12e Metric Edition Dennis G Zill
No ratings yet
Bundle of a First Course in Differential Equations With Modeling Applications 12e Metric Edition Dennis G Zill
344 pages
Job Satisfaction
No ratings yet
Job Satisfaction
3 pages
8th Revision Worksheet IT
No ratings yet
8th Revision Worksheet IT
11 pages
Low Pressure Portable Flare Stack: Product Description Applications
No ratings yet
Low Pressure Portable Flare Stack: Product Description Applications
2 pages
Chapter 12-Managing Economic Exposure and Translation Exposure
No ratings yet
Chapter 12-Managing Economic Exposure and Translation Exposure
11 pages
U3A - Researching For Sustainable Product Development
No ratings yet
U3A - Researching For Sustainable Product Development
6 pages
PTFE Hose Assembly Specification
No ratings yet
PTFE Hose Assembly Specification
22 pages
Multi-Objective Optimization of Injection Molding Process Parameters For Short Cycle Time and Warpage Reduction Using Conformal Cooling Channel
No ratings yet
Multi-Objective Optimization of Injection Molding Process Parameters For Short Cycle Time and Warpage Reduction Using Conformal Cooling Channel
10 pages
Economics
No ratings yet
Economics
22 pages
Assignment A242
No ratings yet
Assignment A242
3 pages
Osti Ia G Interlocks All v032113
No ratings yet
Osti Ia G Interlocks All v032113
195 pages
How To View Internet Explorer Auto-Detected Proxy Settings - Super User
No ratings yet
How To View Internet Explorer Auto-Detected Proxy Settings - Super User
2 pages
Petition for Relief: Dagupan Case Analysis
100% (1)
Petition for Relief: Dagupan Case Analysis
4 pages
Tatatito Menu Prices Philippines 2025 (Updated) - All About Philippines Menu
No ratings yet
Tatatito Menu Prices Philippines 2025 (Updated) - All About Philippines Menu
20 pages
Kali Linux On W11
No ratings yet
Kali Linux On W11
5 pages
Jaa TGL-10 (Rnav) PDF
No ratings yet
Jaa TGL-10 (Rnav) PDF
29 pages
Human Resource Management 16th Edition by Sean R Valentine Full Download
No ratings yet
Human Resource Management 16th Edition by Sean R Valentine Full Download
406 pages
Role of Information Systems in Indian Railways
No ratings yet
Role of Information Systems in Indian Railways
21 pages
HBL632RT2: Construction Electrical Optics Specification Features
No ratings yet
HBL632RT2: Construction Electrical Optics Specification Features
2 pages

BDACh 09 L02 Text Mining Process Phases

Uploaded by

BDACh 09 L02 Text Mining Process Phases

Uploaded by

Lesson 2

Text Mining Process Phases

“Big Data Analytics “, Ch.09 L02: Text, Web, ...Social Network

• Process that selects a subset of features

2. N-gram evaluation—finding the number

• Good clustering methods use high intra-

You might also like