Text Mining and Preprocessing Techniques

18 Text Mining - Text Preprocessing

Uploaded by

MD PRINTING PRESS PRINTING PRESS

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

75 views40 pages

Text Mining and Preprocessing Techniques

18 Text Mining - Text Preprocessing

Uploaded by

MD PRINTING PRESS PRINTING PRESS

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 40

Prof. V.V.

Subrahmanyam
School of Computer and Information Sciences
Indira Gandhi National Open University (IGNOU)
New Delhi
Date: 22nd Aug, 2024 Time : 4-00PM to 4-30PM
Text Mining
 Text mining, also known
as text data mining, is the
process of transforming
unstructured text into
a structured format to
identify meaningful
patterns and new insights.
Text Preprocessing - Introduction
 Text data derived from natural language is
unstructured and noisy.
 So text preprocessing is a critical step to
transform messy, unstructured text data
into a form that can be effectively used to
train machine learning models, leading to
better results and insights.
Text Preprocessing
 Text preprocessing refers to a series of
techniques used to clean, transform and
prepare raw textual data into a format that
is suitable for natural language processing
(NLP) or Text Mining or Machine Learning
(ML) tasks.
Goal of Text Preprocessing
 The goal of text preprocessing is to
enhance the quality and usability of
the text data for subsequent analysis
or modeling.
Common Text Preprocessing / Cleaning Steps
 Lower Casing  Conversion of
 Removal of Punctuations emoticons to words
 Removal of Stopwords  Conversion of emojis to
 Removal of Frequent words words
 Removal of Rare words  Removal of URLs
 Stemming
 Removal of HTML tags
 Lemmatization
 Removal of emojis  Chat words conversion
 Removal of emoticons  Spelling correction
Lower Casing
 Lower casing is a common text preprocessing
technique. The idea is to convert the input text
into same casing format so that, for example 'text',
'Text' and 'TEXT' are treated the same way.
 This is more helpful for text featurization
techniques like frequency, tfidf as it helps to
combine the same words together thereby
reducing the duplication and get correct counts /
tfidf values.
Removal of Punctuations
 This is again a text standardization process
that will help to treat 'hurray' and 'hurray!'
in the same way.
 We also need to carefully choose the list of
punctuations to exclude depending on the
use case.
Removal of Stopwords
 Stopwords are commonly occuring words in a language
like 'the', 'a' and so on.
 They can be removed from the text most of the times,
as they don't provide valuable information for
downstream analysis.
 In cases like Part of Speech(POS) tagging, we should
not remove them as provide very valuable information
about the POS.
Removal of Stop Frequent Words
 In the previos preprocessing step, we observed the
stopwords based on language information. But
say, if we have a domain specific corpus, we
might also have some frequent words which are of
not so much importance to us.
 So this step is to remove the frequent words in the
given corpus. If we use something like tfidf, this is
automatically taken care of.
Some of the Domain Specific Corpus
Frequent Words….

 I, us, DM, Help, We, Hi,

Please, Get, Thanks etc..
Removal of Rare Words
 This is very similar to previous
preprocessing step but we will remove the
rare words from the corpus.
 We can combine all the list of words
(stopwords, frequent words and rare
words) and create a single list to remove
them at once.
Stemming
 Stemming is the process of reducing inflected or derived
words to their word stem, base or root form.
 For example, if there are two words in the
corpus walks and walking, then stemming will stem the
suffix to make them walk.
 But say in another example, we have two
words console and consoling, the stemmer will remove
the suffix and make them consol which isn’t a proper
English word.
Contd…
 There are several type of stemming algorithms
available and one of the famous one is porter stemmer
which is widely used.
 Porter stemmer is for English language. If we are
working with other languages, we can use snowball
stemmer.
Stemming
Example
 We can see that words like private and propose have
their e at the end chopped off due to stemming. This is
not intented.
 What can we do for that? We can use Lemmatization
in such cases.
Lemmatization

 Lemmatization is similar to stemming in reducing

inflected words to their word stem but differs in the
way that it makes sure the root word (also called as
lemma) belongs to the language.
Examples: Propose, Private
Illustration of Lemmatization and Stemming
Removal of Emojis
 With more and more usage of social media
platforms, there is an explosion in the usage
of emojis in our day to day life as well.
Probably we might need to remove these
emojis for some of our textual analysis.
Removal of Emoticons

 There is a minor difference between emojis and

emoticons.
 Emoticon is built from keyboard characters that when
put together in a certain way represent a facial
expression, an emoji is an actual image.
 :-) is an emoticon
 😀 is an emoji
Conversion of Emoticon to Words
 In the previous step, we have removed the emoticons.
In case of use cases like sentiment analysis, the
emoticons give some valuable information and so
removing them might not be a good solution. What
can we do in such cases?
 One way is to convert the emoticons to word format so
that they can be used in downstream modeling
processes.
Conversion of Emoji to Words

 Now let us do the same for Emojis as well.

 We may make use of a dictionary to convert the emojis
to corresponding words.
 Again this conversion might be better than emoji
removal for certain use cases. Please use the one that is
suitable for the use case.
Removal of URLs
 Next preprocessing step is to remove any URLs present
in the data.
 For example, if we are doing a X (Twitter) analysis,
then there is a good chance that the tweet will have
some URL in it. Probably we might need to remove
them for our further analysis.
Removal of HTML Tags
 One another common preprocessing technique that
will come handy in multiple places is removal of
HRML tags.
 This is especially useful, if we scrap the data from
different websites. We might end up having html
strings as part of our text.
Chat Words Conversion

 This is an important text preprocessing step if we are

dealing with chat data.
 People do use a lot of abbreviated words in chat and
so it might be helpful to expand those words for our
analysis purposes.
Examples
 AFAIK=As Far As I Know
 AFK=Away From Keyboard
 ASAP=As Soon As Possible
 ATK=At The Keyboard
 ATM=At The Moment
 A3=Anytime, Anywhere, Anyplace
 BAK=Back At Keyboard
 BBL=Be Back Later
 BBS=Be Back Soon
 BFN=Bye For Now
 B4N=Bye For Now
Spelling Correction

 One another important text preprocessing step is

spelling correction.
 Typos are common in text data and we might want to
correct those spelling mistakes before we do our
analysis.
Tokenization
 Tokenization is the process of breaking up
text into separate tokens, which can be
individual words, phrases, or whole
sentences.
 In some cases, punctuation and special
characters (symbols like %, &, $) are
discarded in the process.
Tokenization
Contd…
A few common operations that require tokenization
include:
 Finding how many words or sentences appear in text
 Determining how many times a specific word or
phrase exists
 Accounting for which terms are likely to co-occur
Parts of Speech (POS) Tagging
 This is one of the more advanced text preprocessing
technique.
 This step augments the input text with additional
information about the sentence’s grammatical structure.
 Each word is, therefore, inserted into one of the predefined
categories such as a noun, verb, adjective, etc.
 This step is also sometimes referred to as grammatical
tagging.
Term Frequency
 Term frequency tells you how much a term occurs in
a document.
 Terms can be either individual words or phrases
containing multiple words.
 Since documents differ in length, it’s possible that a
term would appear more times in longer documents
than shorter ones.
Contd…
 Thus, you can calculate term frequency by dividing the
number of times the term appears, by the total
number of terms in the document, as a way of
normalization.
 Term Frequency = [Number of times the term appears
in the document] / [Total number of terms in the
document]
While Working with Python Language….
 We will be using the NLTK (Natural Language Toolkit)

# import the necessary libraries

import nltk
import string
import re
To Remove Punctuation
To remove white space
THANK YOU
Email: [email protected]

Text Mining Preprocessing Techniques
No ratings yet
Text Mining Preprocessing Techniques
15 pages
Understanding Each Pre-Processing Aspect
No ratings yet
Understanding Each Pre-Processing Aspect
5 pages
NLB Final Lab Manual
No ratings yet
NLB Final Lab Manual
23 pages
Statistical NLP Techniques Overview
No ratings yet
Statistical NLP Techniques Overview
45 pages
NLP Practical
No ratings yet
NLP Practical
27 pages
Ir Manual
No ratings yet
Ir Manual
53 pages
4.twitter Extraction and Analytics
No ratings yet
4.twitter Extraction and Analytics
45 pages
Natural Language Processing Overview
No ratings yet
Natural Language Processing Overview
8 pages
Unit 6 - AI (NLP)
No ratings yet
Unit 6 - AI (NLP)
37 pages
Text Mining & NLP for Academics
No ratings yet
Text Mining & NLP for Academics
38 pages
IR Lec3
No ratings yet
IR Lec3
41 pages
Chap 2
No ratings yet
Chap 2
70 pages
Chapter 7.1 - Introducing Natural Language Processing
No ratings yet
Chapter 7.1 - Introducing Natural Language Processing
39 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
16 pages
Text Analytics with TF-IDF in Python
No ratings yet
Text Analytics with TF-IDF in Python
14 pages
EBUS622 - Week 5 - Lecture - Text Preparation
No ratings yet
EBUS622 - Week 5 - Lecture - Text Preparation
40 pages
NLP with Python Lab Manual
No ratings yet
NLP with Python Lab Manual
15 pages
Part B Notes
No ratings yet
Part B Notes
62 pages
NLP Class X AI
No ratings yet
NLP Class X AI
36 pages
Unit 2
No ratings yet
Unit 2
25 pages
NLP (DP) Notes1
No ratings yet
NLP (DP) Notes1
61 pages
NLP Pipeline
No ratings yet
NLP Pipeline
50 pages
2 Text Operations
No ratings yet
2 Text Operations
32 pages
Handling Corpus Raw Text
No ratings yet
Handling Corpus Raw Text
15 pages
Ai TXT Unit2
No ratings yet
Ai TXT Unit2
14 pages
2-Text Operations - New
No ratings yet
2-Text Operations - New
39 pages
Text Preprocessing Techniques Guide
No ratings yet
Text Preprocessing Techniques Guide
6 pages
Text Mining
No ratings yet
Text Mining
62 pages
Unit 5
No ratings yet
Unit 5
8 pages
Natural Language Processing
No ratings yet
Natural Language Processing
10 pages
TSP Unit1 Own
No ratings yet
TSP Unit1 Own
20 pages
Text Analysis for Students
No ratings yet
Text Analysis for Students
11 pages
ML Ch-6 Text Mining and Time Series
No ratings yet
ML Ch-6 Text Mining and Time Series
11 pages
Intro to NLP and Chatbots
No ratings yet
Intro to NLP and Chatbots
3 pages
Week 1-4 Text An
No ratings yet
Week 1-4 Text An
74 pages
Web and Social Media Analytics Lab
No ratings yet
Web and Social Media Analytics Lab
34 pages
TSP Unit1 Own
No ratings yet
TSP Unit1 Own
13 pages
Extracting, Cleaning and Pre-Processing Text
No ratings yet
Extracting, Cleaning and Pre-Processing Text
12 pages
NLP - Notes
No ratings yet
NLP - Notes
3 pages
NLP Techniques and Applications
No ratings yet
NLP Techniques and Applications
17 pages
Sentiment Analysis for Engineers
No ratings yet
Sentiment Analysis for Engineers
7 pages
PDF NLP
No ratings yet
PDF NLP
7 pages
Unit 2
No ratings yet
Unit 2
20 pages
02 - Text Preprocessing - Part2
No ratings yet
02 - Text Preprocessing - Part2
36 pages
CAT King Study Material 5
No ratings yet
CAT King Study Material 5
21 pages
NLP Applications in Healthcare
No ratings yet
NLP Applications in Healthcare
71 pages
Cataloging and Indexing
No ratings yet
Cataloging and Indexing
16 pages
NLP Record
No ratings yet
NLP Record
15 pages
Text Preprocessing: Information Retrieval
100% (2)
Text Preprocessing: Information Retrieval
16 pages
NLP Notes CL 10
No ratings yet
NLP Notes CL 10
13 pages
2 - Text Operation - 1
No ratings yet
2 - Text Operation - 1
28 pages
Natural Language Processing Revision Notes
No ratings yet
Natural Language Processing Revision Notes
4 pages
Chapter-1 Introduction To NLP
No ratings yet
Chapter-1 Introduction To NLP
12 pages
NLP Sem Questions and Answers
100% (1)
NLP Sem Questions and Answers
72 pages
Basics of Text Processing
No ratings yet
Basics of Text Processing
28 pages
Preprocessing in Ir: Rida Hafeez
No ratings yet
Preprocessing in Ir: Rida Hafeez
14 pages
Text Mining Preprocessing Guide
No ratings yet
Text Mining Preprocessing Guide
7 pages
Indian Data Privacy in Digital Marketing
No ratings yet
Indian Data Privacy in Digital Marketing
31 pages
Question Bank For 5 Units of BPPK
75% (8)
Question Bank For 5 Units of BPPK
3 pages
Karan Balance Sheet 31.03.204
No ratings yet
Karan Balance Sheet 31.03.204
1 page
Getting Somewhere by Lilian A. Aujo-Group 1
100% (1)
Getting Somewhere by Lilian A. Aujo-Group 1
3 pages
L - Chap-5
No ratings yet
L - Chap-5
34 pages
Automatic Rain Sensing Wiper System
80% (5)
Automatic Rain Sensing Wiper System
10 pages
Geography Notefor Grade 11,2 ND Term 2025 Fien 35 SK
No ratings yet
Geography Notefor Grade 11,2 ND Term 2025 Fien 35 SK
128 pages
E Auction 20.04.2023 Publication
No ratings yet
E Auction 20.04.2023 Publication
5 pages
Tables in SAP
No ratings yet
Tables in SAP
20 pages
Information Systems 1A Exam
No ratings yet
Information Systems 1A Exam
7 pages
Tuned-Mass Systems For The Seismic Retrofit of Buildings: Peter Nawrotzki
No ratings yet
Tuned-Mass Systems For The Seismic Retrofit of Buildings: Peter Nawrotzki
8 pages
Pakyawlabor2024 09
No ratings yet
Pakyawlabor2024 09
2 pages
Karthik June24
No ratings yet
Karthik June24
1 page
Onnekas
No ratings yet
Onnekas
2 pages
Computer Applications Radiology
No ratings yet
Computer Applications Radiology
9 pages
Build A Large Space Saving CNC Router For Under 60
No ratings yet
Build A Large Space Saving CNC Router For Under 60
10 pages
Fuse and Relay Diagram For Ford Transit (20
No ratings yet
Fuse and Relay Diagram For Ford Transit (20
13 pages
System Formwor
100% (1)
System Formwor
55 pages
Csec Agricultural Science School Based Assessment (Sba) : Crop Production
No ratings yet
Csec Agricultural Science School Based Assessment (Sba) : Crop Production
16 pages
Lecture 4.8 - Summary of Contents Introduced in Week 1 To 4
No ratings yet
Lecture 4.8 - Summary of Contents Introduced in Week 1 To 4
35 pages
Understanding Plant Generation Baselines
No ratings yet
Understanding Plant Generation Baselines
4 pages
Fed-Batch and Continuous Culture Kinetics
No ratings yet
Fed-Batch and Continuous Culture Kinetics
30 pages
Home Page: BCSL 057 Web Programming Lab Phone No. 9811854308
100% (1)
Home Page: BCSL 057 Web Programming Lab Phone No. 9811854308
11 pages
PH YS IC S: Physics STD 12: Physics MCQ - 3
No ratings yet
PH YS IC S: Physics STD 12: Physics MCQ - 3
18 pages
Understanding Elements and Mixtures
No ratings yet
Understanding Elements and Mixtures
16 pages
Jurisdiction in Cheque Dishonour Cases
No ratings yet
Jurisdiction in Cheque Dishonour Cases
4 pages
B&D, 2010, Automotive & Electronic Catalog
No ratings yet
B&D, 2010, Automotive & Electronic Catalog
27 pages
Medical College Surgical and Delivery Forms
No ratings yet
Medical College Surgical and Delivery Forms
5 pages
Basic Load Cases Used For Piping Stress Analysis
No ratings yet
Basic Load Cases Used For Piping Stress Analysis
5 pages
Understanding HACCP Principles for Food Safety
No ratings yet
Understanding HACCP Principles for Food Safety
4 pages

Text Mining and Preprocessing Techniques

Uploaded by

Text Mining and Preprocessing Techniques

Uploaded by

Prof. V.V.

 I, us, DM, Help, We, Hi,

 Lemmatization is similar to stemming in reducing

 There is a minor difference between emojis and

 Now let us do the same for Emojis as well.

 This is an important text preprocessing step if we are

 One another important text preprocessing step is

# import the necessary libraries

You might also like