Prof. V.V.
Subrahmanyam
School of Computer and Information Sciences
Indira Gandhi National Open University (IGNOU)
New Delhi
Date: 22nd Aug, 2024 Time : 4-00PM to 4-30PM
Text Mining
Text mining, also known
as text data mining, is the
process of transforming
unstructured text into
a structured format to
identify meaningful
patterns and new insights.
Text Preprocessing - Introduction
Text data derived from natural language is
unstructured and noisy.
So text preprocessing is a critical step to
transform messy, unstructured text data
into a form that can be effectively used to
train machine learning models, leading to
better results and insights.
Text Preprocessing
Text preprocessing refers to a series of
techniques used to clean, transform and
prepare raw textual data into a format that
is suitable for natural language processing
(NLP) or Text Mining or Machine Learning
(ML) tasks.
Goal of Text Preprocessing
The goal of text preprocessing is to
enhance the quality and usability of
the text data for subsequent analysis
or modeling.
Common Text Preprocessing / Cleaning Steps
Lower Casing Conversion of
Removal of Punctuations emoticons to words
Removal of Stopwords Conversion of emojis to
Removal of Frequent words words
Removal of Rare words Removal of URLs
Stemming
Removal of HTML tags
Lemmatization
Removal of emojis Chat words conversion
Removal of emoticons Spelling correction
Lower Casing
Lower casing is a common text preprocessing
technique. The idea is to convert the input text
into same casing format so that, for example 'text',
'Text' and 'TEXT' are treated the same way.
This is more helpful for text featurization
techniques like frequency, tfidf as it helps to
combine the same words together thereby
reducing the duplication and get correct counts /
tfidf values.
Removal of Punctuations
This is again a text standardization process
that will help to treat 'hurray' and 'hurray!'
in the same way.
We also need to carefully choose the list of
punctuations to exclude depending on the
use case.
Removal of Stopwords
Stopwords are commonly occuring words in a language
like 'the', 'a' and so on.
They can be removed from the text most of the times,
as they don't provide valuable information for
downstream analysis.
In cases like Part of Speech(POS) tagging, we should
not remove them as provide very valuable information
about the POS.
Removal of Stop Frequent Words
In the previos preprocessing step, we observed the
stopwords based on language information. But
say, if we have a domain specific corpus, we
might also have some frequent words which are of
not so much importance to us.
So this step is to remove the frequent words in the
given corpus. If we use something like tfidf, this is
automatically taken care of.
Some of the Domain Specific Corpus
Frequent Words….
I, us, DM, Help, We, Hi,
Please, Get, Thanks etc..
Removal of Rare Words
This is very similar to previous
preprocessing step but we will remove the
rare words from the corpus.
We can combine all the list of words
(stopwords, frequent words and rare
words) and create a single list to remove
them at once.
Stemming
Stemming is the process of reducing inflected or derived
words to their word stem, base or root form.
For example, if there are two words in the
corpus walks and walking, then stemming will stem the
suffix to make them walk.
But say in another example, we have two
words console and consoling, the stemmer will remove
the suffix and make them consol which isn’t a proper
English word.
Contd…
There are several type of stemming algorithms
available and one of the famous one is porter stemmer
which is widely used.
Porter stemmer is for English language. If we are
working with other languages, we can use snowball
stemmer.
Stemming
Example
We can see that words like private and propose have
their e at the end chopped off due to stemming. This is
not intented.
What can we do for that? We can use Lemmatization
in such cases.
Lemmatization
Lemmatization is similar to stemming in reducing
inflected words to their word stem but differs in the
way that it makes sure the root word (also called as
lemma) belongs to the language.
Examples: Propose, Private
Illustration of Lemmatization and Stemming
Removal of Emojis
With more and more usage of social media
platforms, there is an explosion in the usage
of emojis in our day to day life as well.
Probably we might need to remove these
emojis for some of our textual analysis.
Removal of Emoticons
There is a minor difference between emojis and
emoticons.
Emoticon is built from keyboard characters that when
put together in a certain way represent a facial
expression, an emoji is an actual image.
:-) is an emoticon
😀 is an emoji
Conversion of Emoticon to Words
In the previous step, we have removed the emoticons.
In case of use cases like sentiment analysis, the
emoticons give some valuable information and so
removing them might not be a good solution. What
can we do in such cases?
One way is to convert the emoticons to word format so
that they can be used in downstream modeling
processes.
Conversion of Emoji to Words
Now let us do the same for Emojis as well.
We may make use of a dictionary to convert the emojis
to corresponding words.
Again this conversion might be better than emoji
removal for certain use cases. Please use the one that is
suitable for the use case.
Removal of URLs
Next preprocessing step is to remove any URLs present
in the data.
For example, if we are doing a X (Twitter) analysis,
then there is a good chance that the tweet will have
some URL in it. Probably we might need to remove
them for our further analysis.
Removal of HTML Tags
One another common preprocessing technique that
will come handy in multiple places is removal of
HRML tags.
This is especially useful, if we scrap the data from
different websites. We might end up having html
strings as part of our text.
Chat Words Conversion
This is an important text preprocessing step if we are
dealing with chat data.
People do use a lot of abbreviated words in chat and
so it might be helpful to expand those words for our
analysis purposes.
Examples
AFAIK=As Far As I Know
AFK=Away From Keyboard
ASAP=As Soon As Possible
ATK=At The Keyboard
ATM=At The Moment
A3=Anytime, Anywhere, Anyplace
BAK=Back At Keyboard
BBL=Be Back Later
BBS=Be Back Soon
BFN=Bye For Now
B4N=Bye For Now
Spelling Correction
One another important text preprocessing step is
spelling correction.
Typos are common in text data and we might want to
correct those spelling mistakes before we do our
analysis.
Tokenization
Tokenization is the process of breaking up
text into separate tokens, which can be
individual words, phrases, or whole
sentences.
In some cases, punctuation and special
characters (symbols like %, &, $) are
discarded in the process.
Tokenization
Contd…
A few common operations that require tokenization
include:
Finding how many words or sentences appear in text
Determining how many times a specific word or
phrase exists
Accounting for which terms are likely to co-occur
Parts of Speech (POS) Tagging
This is one of the more advanced text preprocessing
technique.
This step augments the input text with additional
information about the sentence’s grammatical structure.
Each word is, therefore, inserted into one of the predefined
categories such as a noun, verb, adjective, etc.
This step is also sometimes referred to as grammatical
tagging.
Term Frequency
Term frequency tells you how much a term occurs in
a document.
Terms can be either individual words or phrases
containing multiple words.
Since documents differ in length, it’s possible that a
term would appear more times in longer documents
than shorter ones.
Contd…
Thus, you can calculate term frequency by dividing the
number of times the term appears, by the total
number of terms in the document, as a way of
normalization.
Term Frequency = [Number of times the term appears
in the document] / [Total number of terms in the
document]
While Working with Python Language….
We will be using the NLTK (Natural Language Toolkit)
# import the necessary libraries
import nltk
import string
import re
To Remove Punctuation
To remove white space
THANK YOU
Email:
[email protected]