Introduction to NLP
What is NLP?
● Natural Language Processing(NLP) is a field of Artificial Intelligence that gives
machines the ability to read, understand and derive meaning from human language.
● With the help of NLP, we can communicate with computers using a natural language
such as English.
Advantages of NLP
● Today NLP is booming with the advancements in the access to data and increase in
the computational power.
● This is helping practitioners to achieve meaningful results in areas like
○ Healthcare
○ Media
○ Finance
Problems with Text Data
● Today, millions of data is generated through conversations, declarations or tweets and
these data are unstructured data.
● Unstructured data does not fit into row and column structure which makes it difficult
to analyze and manipulate.
Why should we learn NLP?
● With the help of NLP, it is possible for the machines to detect figure of speeches like
irony and even perform sentiment analysis.
● We can not have data in numbers all the time, so to deal with textual data we use NLP
which can easily take raw language as input and derive meaningful insights from it.
Applications of NLP
● NLP enables the recognition and prediction of diseases based on electronic health
records and a patient’s own speech.
● Organizations can determine what customers’ are saying about product by identifying
and extracting information.
○ This sentiment analysis can tell a lot about customer’s choices and their decision
drivers.
Applications of NLP
● Big companies like Google filter and classify emails with NLP by analyzing text in
emails and stopping spam emails before they enter your inbox.
● Amazon’s Alexa and Apple’s Siri are examples of intelligent voice-driven interfaces
that use NLP to respond to vocal prompts and do everything.
Applications of NLP
● NLP is used in both search and selection phase of
talent recruitment by identifying the skills of
potential hires.
● NLP is also used in search Autocorrect and
Autocomplete.
Steps to solve NLP problems
● Gather Data
○ Gather textual data from emails, posts or tweets.
● Clean Data
○ A clean dataset allows the model to learn meaningful features and not overfit on
irrelevant noise.
■ Remove all irrelevant characters.
■ Tokenize the word by separating it in different words.
Steps to solve NLP problems
● Clean Data
■ Convert all characters to lowercase. Combine misspelled or alternatively
spelled words to single representation.
■ Reduce words such as “am”, “are” and “is” to a common form.
● Finding good representation
○ Change textual data into numbers form which algorithms can understand and
derive insights from it.
Steps to solve NLP problems
● Classification
○ Split the data into training and testing data.
○ Classify the data using a model by fitting the training data into it and check how
well the model generalizes on unseen dataset using the testing dataset.
● Inspection
○ Understand the errors made by model using confusion matrix
What is Text Processing?
● Text Processing means analysis, manipulation and generation of text.
● It is the automated process which analyzes data for getting structured information.
● It includes extracting smaller bits of information from text data and assigning tags
depending on its context.
Techniques to analyze text data
● Statistical Methods
○ We use statistical methods such as frequency distribution and TF-IDF to process
and analyze text.
● Text Classification
○ Text Classification classifies text into predefined groups based on its content.
Popular models include:- Topic Analysis, Sentiment Analysis, intent detection
and language classification.
Techniques to analyze text data
● Text Extraction
○ Text extraction is a text processing technique that identifies and obtain valuable
pieces of data that are present within the text.
○ This method helps us to detect and extract the relevant words or expressions
from text.
Popular libraries used for NLP
● Spacy
○ Spacy is an open source library which excels at working with incredibly large
scale information extraction tasks.
○ Major takeaways are:-
■ Part-of-speech tagging, and Tokenization
■ Dependency parsing
■ Sentence Segmentation
■ Entity and sentence recognition
■ Methods for cleaning and normalizing texts.
Popular libraries used for NLP
● NLTK (Natural Language ToolKit)
○ Its goal is to make learning and working with computational linguistics easier by
offering features such as classification, stemming, tagging, parsing, semantic
reasoning and wrappers.
● Gensim
○ It is a library for Topic Modelling and similarity retrieval.
○ It excels at two things: Processing of language and Information Retrieval.
Popular libraries used for NLP
● TextBlob
○ Textblob is used for processing text based data and offers smooth integration
with other programming languages.
■ Part-of-speech tagging
■ Sentiment Analysis
■ Classification
■ Tokenization
■ N-grams
■ Parsing, and Spelling correction
What is Feature Engineering?
● Feature engineering is the process of creating new
features from the existing ones and removing the
unimportant features.
● Feature engineering is an art and a skill.
● It requires us to have creativity and domain
knowledge.
Why do we need Feature Engineering?
● Good Features present in the data
influence the results of the predictive
model in a very positive Way.
Why is it necessary to Clean Data?
● Data Cleaning is a very crucial step in NLP because without cleaning, the dataset is
just a cluster of words which computer doesn’t understand.
● The textual data is unstructured and noisy and can have:-
■ Typos, Bad grammar, Usage of slangs, URLs.
■ Stopwords, Expressions, Punctuations etc.
Steps to clean Textual Data
● Remove punctuations and numbers
● Perform tokenization
● Remove special and accented characters
● Remove Stopwords
● Perform Stemming and Lemmatization
What is Tokenization?
● Tokenization is simple as well as the building block of Natural Language.
● It is a way of separating a piece of text into smaller units called tokens.
● The tokens can be:-
■ Words, Characters, Sentences
StopWords
● Stopwords are the words which do not add much value to sentence.
● They are removed from the vocabulary to reduce noise as well as the dimension of the
feature set.
● The English stop words are:-
■ The, And, Myself, this, Into, Here etc
■ And So many more Meaningless words Like this.
Stemming
● Stemming is a process of removing a part of a word, or reducing a word to its stem or
root word.
● Example:-
■ We have three words: “Ask”, “asking” and “asked”
■ Stemming converts all the three words into the root word “ask”.
Lemmatization
● Lemmatization reduces the word to its dictionary form. The root word in
lemmatization is called “lemma”.
● We have two words: “good” and “better”.
○ Lemmatization reduce them to the same root word “good”.
Difference between Stemming & Lemmatization
● Algorithms used in stemming process don’t know the meaning behind the words.
● Algorithms used in lemmatization process refer to a dictionary to understand the
meaning of the word before chopping it off.
Let’s Take an Example
● We have three words:- “play”, “playing” and “player”.
○ According to Stemmer, all three words have same root word “play”.
○ According to lemmatizer, “play” and “playing” have same root and “player” is
a word with different meaning.
Feature Extraction for NLP
What is Feature Extraction?
● Feature extraction means to extract and produce feature representations that are
appropriate for the NLP task.
● Features that can be extracted from the text are:-
■ Number of words, characters, Stopwords etc.
■ Length of text.
■ Number of punctuation.
Feature Extraction Techniques
● Major feature extraction techniques for NLP are:-
■ Bag of words representation
■ TF-IDF
■ N-gram analysis
Bag of Words
● Bag of words is a way of extracting features from text for use in modelling.
● The bag of words approach is very simple and flexible and can be used in a number of
ways for extracting features from the documents.
Bag of words representation
● Bag of words work in the following way:-
○ It checks the frequency of distinct words occurring in the text.
○ All the distinct words make the columns of the matrix and the values are
represented as 0 or 1 based on the absence or presence of the word in the text.
Introduction to TF-IDF
● TF-IDF short for Term Frequency-Inverse Document Frequency.
● It is designed to reflect how important a word is to a document in a collection for
corpus.
● The TF-IDF value increases proportionally to the number of times word appears in a
document.
TF-IDF Score
● The value of TF-IDF is calculated by multiplying two metrics:-
■ How many times a word appears in a document.
■ Inverse document frequency of the word across a set of documents.
● Inverse Document frequency
○ How common a word is in entire document set
■ log(Total no. of documents/ no. of documents containing word)
Why TF-IDF?
● Information Retrieval
○ TF-IDF was invented for document search and is used to deliver results that are
most relevant to what we are searching for.
● Keyword extraction
○ TF-IDF is also useful for keyword extraction. The highest scoring words for a
document are the most relevant keywords.
N-grams
● N-grams is a sequence of n-words.
● For Example: I love reading books.
○ 1-gram or unigram will be:- “I”, “love”, “reading”, “books”.
○ 2-gram or bigram will be:- “I love”, “love reading”, “reading books”.
○ 3-gram or trigram will be:- “I love reading”, “love reading books”.
Why N-grams?
● N-grams of texts are extensively used in text mining and NLP tasks such as Auto
completion of sentences and Auto spell check.
● Example:-
○ Using a 3-gram analysis, a bot will understand the difference between “What’s
the temperature” and “Set the temperature” which is not possible using 1-gram
or 2-grams.
What is Text Classification?
● Text classification or Text categorization is the process of analyzing the natural
language text and then labelling the text with a predefined set of labels or tags.
● Text classifiers have proven to be great alternative to structure textual data in a fast,
cost-effective and scalable way.
● It allows to easily get insights from data and automate business processes.
Examples
● Classifying emails as spam or not spam.
● Sentiment analysis:- Understanding if the text has positive, negative or neutral
sentiment.
● Language detection:- Detecting the language of a given text.
● Classifying content into categories to easily search and navigate within a website or
application.
Applications of Text Classification
● Tagging content or products using categories as a way to improve browsing or to
identify related content on the website.
● As marketing is becoming more targeted everyday, automated classification of users
into cohorts can make marketer’s life simple.
● Text classification of content on website help Google crawl website easily which help
in SEO.
● Email providers use text classification to differentiate between legitimate and spam
mails.
Text Classification using ML
● Machine learning helps in text classification and classify based on past observations.
● By using pre-labelled examples as training data, ML algorithms learn different
associations between pieces of texts.
Models for Text Classification
● Naive Bayes Family of Algorithms
● Support Vector Machines (SVM)
● Deep learning
Conditional Probability
● Probability of an event occurring based on the occurrence of the previous event.
● We have a bag of 5 balls.
2/5 3/5
Example
2/4 2/4
● The probability of one event depending on the probability of the previous
event.
Bayes Theorem
● P(A/B) = P(A and B) / P(B) => P(A and B) = P(A/B) P(B)
● P(B/A) = P(B and A) / P(A) => P(B and A) = P(B/A) P(A)
● Equating both the terms, we get Bayes Theorem
○ P(A/B) = [P(B/A) P(A)] / P(B)
Naive Bayes Classifier
● Naive Bayes is a family of probabilistics algorithms which uses Bayes theorem to
predict the tag of the text.
● It is a probabilistic algorithm which means it calculates the probability of each tag for
a given text.
Example
● We have 4 major words in the categories Sport and not sport.
○ 4 major words are match, game, win and election.
Sport Not Sport
Match 6 6/15 Match 1 1/15
Game 5 5/15 Game 2 2/15
Win 3 3/15 Win 5 5/15
Election 1 1/15 Election 7 7/15
Example
● New message comes which has win election in it.
● The probability of a message in “sport” or “not sport” category is ½.
● P(sport/win election) = 0.006
● P(not sport/win election) = 0.07
● New message “win election” is in the “Not sport” category.
Support Vector Machines(SVM)
● SVM is a powerful text classification machine
learning algorithm.
● SVM divides or separates the two sides by a line.
● An optimal line is the one with the largest
distance between each label.
● It works for both linear and non-linear data.
More Things to Try!
● We can try some more classification models such as Logistic Regression and Deep
learning.
● Instead of using TF-IDF vectorizer, we can use Bag of words to change texts into
numbers and check how the model accuracy changes.
○ In case of Bag of words also we have two choices, either using binary bag of
words or frequency bag of words.