UNIT-6 : NATURAL LANGUAGE PROCESS (NLP)
Introduction to Natural Language Processing
● Natural Language Processing, or NLP, is the sub-field of AI that is focused on enabling computers
to understand and process human languages.
● It is concerned with the interactions between computers and human (natural) languages, in
particular how to program computers to process and analyze large amounts of natural language
data.
A usual interaction between machines and humans using Natural Language Processing could go as follows:
• Humans talk to the computer
• The computer captures the audio
• There is an audio to text conversion
• Text data is processed Data is converted to audio
• The computer plays the audio file and responds to humans
Applications of Natural Language Processing
1. Chatbots
● Chatbots are a form of artificial intelligence that is programmed to
interact with humans in such a way that they sound like humans
themselves.
● Chatbots are created using Natural Language Processing and
Machine Learning, which means that they understand the
complexities of the English language and find the actual meaning of
the sentence and they also learn from their conversations with
humans and become better with time.
● Chatbots work in two simple steps. First, they identify the meaning
of the question asked and collect all the data from the user that may
be required to answer the question. Then they answer the question
appropriately.
2. Autocomplete in Search Engines
● If you have noticed that search engines tend to guess what
you are typing and automatically complete your sentences?
● All these suggestions are provided using auto complete that
uses Natural Language Processing to guess what you want to
ask.
● They use Natural Language Processing to make sense of these
words and how they are interconnected to form different
sentences.
3.Voice Assistants (Virtual Assistants)
● Voice assistants like Siri, Alexa, or Google Assistant uses one
of these to make calls, place reminders, schedule meetings,
set alarms, surf the internet, etc.
● These voice assistants have made life much easier.
● They use a complex combination of speech recognition,
natural language understanding, and natural language
processing to understand what humans are saying and then act
on it.
4. Language Translator
● Google Translate is a great tool to convert text from one
language to another.
● Google Translate and other translation tools as well as use
Sequence to sequence modeling that is a technique in
Natural Language Processing.
● It allows the algorithm to convert a sequence of words from
one language to another which is translation.
5. Grammar Checkers
● Grammar and spelling is a very important tool while writing
professional reports. They can not only correct grammar and
check spellings but also suggest better synonyms and improve
the overall readability of your content.
● They utilize natural language processing to provide the best
possible piece of writing!
● The NLP algorithm is trained on millions of sentences to
understand the correct format. That is why it can suggest the
correct verb tense, a better synonym, or a clearer sentence
structure than what you have written.
● Some of the most popular grammar checkers that use NLP
include Grammarly, WhiteSmoke, ProWritingAid, etc.
6. Sentiment Analysis
● The goal of sentiment analysis is to identify sentiment
among several posts or even in the same post where
emotion is not always explicitly expressed.
● Companies use Natural Language Processing
applications, such as sentiment analysis, to identify
opinions and sentiment online to help them understand
what customers think about their products and services.
(To find out the emotions of their target audience, to
understand product reviews)
● And not just private companies, even governments use
sentiment analysis to find popular opinion
7. Automatic Summarization:
● Automatic summarization is relevant not only for
summarizing the meaning of documents and
information, but also to understand the emotional
meanings within the information, such as in collecting
data from social media.
● Automatic summarization is relevant when used to
provide an overview of a news item or blog post, while
avoiding redundancy from multiple sources and
maximizing the diversity of content obtained.
8. Email Classification and Filtering
● Email services use natural language processing to
identify the contents of each Email with text
classification so that it can be put in the correct
section.
● In more advanced cases, some companies also use
specialty anti-virus software with natural language
processing to scan the emails and see if there are any
patterns and phrases that may indicate a phishing
attempt on the employees.
9. Text classification
● Text classification makes it possible to assign
predefined categories to a document and organize it to
help you find the information you need or simplify
some activities.
● For example, an application of text categorization is
spam filtering in email.
Chatbots
One of the most common applications of Natural Language Processing is a chatbot. There are a lot of
chatbots available.
There are 2 types of chatbots :
● Simple Chatbot (Script bots)
● Smart Chatbots (AI based Smart bots)
Script Bot Smart-Bot
Script bots are easy to make Smart-bots are flexible and powerful
Script bots work around a script which is Smart bots work on bigger databases and other
programmed in them resources directly
Mostly they are free and are easy to integrate to a Smart bots learn with more data
messaging platform
No or little language processing skills Coding is required to take this up on board
Limited functionality Wide functionality
Example: the bots which are deployed in the Example: Google Assistant, Alexa, Cortana, Siri,
customer care section of various companies etc.
Human Language VS Computer Language
Human Language Computer Language
● Our brain keeps on processing the sounds ● Computers understand the language of
that it hears around itself and tries to make numbers. Everything that is sent to the
sense of them all the time machine has to be converted to numbers.
● Example: In the classroom, as the teacher ● Binary code is a system by which
delivers the session, our brain is numbers, letters and other information are
continuously processing everything and represented using only two symbols, or
storing it someplace. binary digits.
● Also, while this is happening, when your ● The binary definition to a computer is a 1s
friend whispers something, the focus of your and 0s code arranged in ways that the
brain automatically shifts from the teacher’s computer can read, understand, and act
speech to your friend’s conversation. upon.
● So now, the brain is processing both the
sounds but is prioritizing the one on which
our interest lies.
● After processing the signal, the brain gains ● The communications made by the
an understanding of its meaning. If it is clear, machines are very basic and simple.
the signal gets stored.
● Otherwise, the listener asks for clarity from
the speaker. This is how human languages
are processed by humans.
Data Processing
Arrangement of the words and meaning
● There are rules in human language. There are nouns, verbs, adverbs, adjectives.
● A word can be a noun at one time and an adjective some other time. There are rules to provide
structure to a language.
● This is an issue related to the syntax of the language.
● Syntax refers to the grammatical structure of a sentence.
● When the structure is present, we can start interpreting the message.
Text Normalization technique used in NLP
Since human languages are complex, we need to first of all simplify them in order to make sure that the
understanding becomes possible.
● Text Normalization helps in cleaning up the textual data in such a way that it comes down to a
level where its complexity is lower than the actual data.
● It is a process to reduce the variations in text’s word forms to a common form
when the variation means the same thing. Text normalization simplifies the text for
further processing.
● We will be working on text from multiple documents and the term used for the whole textual data
from all the documents altogether is known as corpus.
Sentence Segmentation
● Under sentence segmentation, the whole corpus is divided into sentences.
● Each sentence is taken as a different data so now the whole corpus gets reduced to sentences.
Tokenisation
● After segmenting the sentences, each sentence is then further divided into tokens.
● Tokens is a term used for any word or number or special character occurring in a sentence.
● Under tokenisation, every word, number and special character is considered separately and each
of them is now a separate token.
Removing Stopwords, Special Characters and Numbers
● Stopwords are the words which occur very frequently in the corpus but do not add any value to
it.
● Some examples of stopwords are:
● These words occur the most in any given corpus but talk very little or nothing about the context or
the meaning of it.
● Hence, to make it easier for the computer to focus on meaningful terms, these words are
removed.
Converting text to a common case
● After the stopwords removal, we convert the whole text into a similar case, preferably lower
case.
● This ensures that the case-sensitivity of the machine does not consider same words as different
just because of different cases.
Stemming
In this step, the remaining words are reduced to their root words.
In other words, stemming is the process in which the affixes of words are removed and the words are
converted to their base form.
Note :
● In stemming, the stemmed words (words which are we get after removing the affixes) might not
be meaningful.
● Here in this example as you can see: healed, healing and healer all were reduced to heal but
studies was reduced to studi after the affix removal which is not a meaningful word.
● Stemming does not take into account if the stemmed word is meaningful or not. It just removes
the affixes hence it is faster.
Lemmatization
● Stemming and lemmatization both are alternative processes to each other as the role of both the
processes is same – removal of affixes.
● But the difference between both of them is that in lemmatization, the word we get after affix
removal (also known as lemma) is a meaningful one.
● Lemmatization makes sure that lemma is a word with meaning and hence it takes a longer time to
execute than stemming.
Difference between stemming and lemmatization can be summarized by this example:
With this we have normalized our text to tokens which are the simplest form of words present in the
corpus. Now it is time to convert the tokens into numbers.
For this, we would use the Bag of Words algorithm.
Bag of Words (BOW)
● Bag of Words is a Natural Language Processing model which helps in extracting features out
of the text which can be helpful in machine learning algorithms.
● In a bag of words, we get the occurrences of each word and construct the vocabulary for the
corpus.
This image gives us a brief overview about how bag of words works.
Let us assume that the text on the left in this image is the normalized corpus which we have got after
going through all the steps of text processing.
● Now, as we put this text into the bag of words algorithm, the algorithm returns to us the unique
words out of the corpus and their occurrences in it.
● As you can see at the right, it shows us a list of words appearing in the corpus and the numbers
corresponding to it shows how many times the word has occurred in the text body.
Thus, we can say that the bag of words gives us two things :
1. A vocabulary of words for the corpus
2. The frequency of these words (number of times it has occurred in the whole corpus).
Here calling this algorithm “bag” of words symbolizes that the sequence of sentences or tokens
does not matter in this case as all we need are the unique words and their frequency in it.
Here is the step-by-step approach to implement bag of words algorithm:
1. Text Normalisation: Collect data and pre-process it
2. Create Dictionary: Make a list of all the unique words occurring in the corpus. (Vocabulary)
3. Create document vectors: For each document in the corpus, find out how many times the word from
the unique list of words has occurred.
4. Create document vectors for all the documents.
Let us go through all the steps with an example:
Step 1: Collecting data and pre-processing it.
Document 1: Aman and Anil are stressed
Document 2: Aman went to a therapist
Document 3: Anil went to download a health chatbot
Here are three documents having one sentence each. After text normalisation, the text becomes:
Document 1: [aman, and, anil, are, stressed]
Document 2: [aman, went, to, a, therapist]
Document 3: [anil, went, to, download, a, health, chatbot]
Note that no tokens have been removed in the stopwords removal step. It is because we have very little
data and since the frequency of all the words is almost the same, no word can be said to have lesser
value than the other.
Step 2: Create Dictionary
Go through all the steps and create a dictionary i.e., list down all the words which occur in all three
documents:
Dictionary:
Note that even though some words are repeated in different documents, they are all written just once as
while creating the dictionary, we create the list of unique words.
Step 3: Create document vector
● In this step, the vocabulary is written in the top row. Now, for each word in the document, if it
matches with the vocabulary, put a 1 under it.
● If the same word appears again, increment the previous value by 1. And if the word does not
occur in that document, put a 0 under it.
Since in the first document, we have words: aman, and, anil, are, stressed. So, all these words get a
value of 1 and the rest of the words get a 0 value.
Step 4: Repeat for all documents Same exercise has to be done for all the documents. Hence, the table
becomes:
In this table, the header row contains the vocabulary of the corpus and three rows correspond to three
different documents.
● Let's take a look at this table and analyze the positioning of 0s and 1s in it.
● Finally, this gives us the document vector table for our corpus. But the tokens have still not
converted to numbers.
● This leads us to the final steps of our algorithm : TFIDF.
TFIDF: Term Frequency & Inverse Document Frequency
● TFIDF helps in identifying the value for each word.
Term Frequency
Term frequency is the frequency of a word in one document. Term frequency can easily be found from the
document vector table as in that table we mention the frequency of each word of the vocabulary in each
document
Here, you can see that the frequency of each word for each document has been recorded in the table.
These numbers are nothing but the Term Frequencies!
Inverse Document Frequency
● Now, let us look at the other half of TFIDF which is Inverse Document Frequency.
● For this, let us first understand what does document frequency mean.
Document Frequency is the number of documents in which the word occurs irrespective of how many
times it has occurred in those documents.
The document frequency for the exemplar vocabulary would be:
Here, you can see that the document frequency of ‘aman’, ‘anil’, ‘went’, ‘to’ and ‘a’ is 2 as they have
occurred in two documents.
Rest of them occurred in just one document hence the document frequency for them is one.
Now, let's discuss the inverse document frequency, we need to put the document frequency in the
denominator while the total number of documents is the numerator.
Here, the total number of documents are 3, hence inverse document frequency becomes:
Finally, the formula of TFIDF for any word W becomes:
TFIDF(W) = TF(W) * log( IDF(W) )
Here, log is to the base of 10.
Now, let’s multiply the IDF values to the TF values.
Note that the TF values are for each document while the IDF values are for the whole corpus.
Hence, we need to multiply the IDF values to each row of the document vector table.
Here, you can see that the IDF values for Aman in each row is the same and similar pattern is followed
for all the words of the vocabulary. After calculating all the values, we get:
● Finally, the words have been converted to numbers. These numbers are the values of each for
each document.
● Here, you can see that since we have less amount of data, words like ‘are’ and ‘and’ also have a
high value.
● But as the IDF value increases, the value of that word decreases.
For example:
Total Number of documents: 10
Number of documents in which ‘and’ occurs: 10
Therefore, IDF(and) = 10/10 = 1
Which means: log(1) = 0.
Hence, the value of ‘and’ becomes 0.
On the other hand, number of documents in which ‘pollution’ occurs: 3
IDF(pollution) = 10/3 = 3.3333…
Which means: log(3.3333) = 0.522;
which shows that the word ‘pollution’ has considerable value in the corpus.
Summarizing the concept, we can say that :
1. Words that occur in all the documents with high term frequencies have the least values and are
considered to be the stopwords.
2. For a word to have high TFIDF value, the word needs to have a high term frequency but less
document frequency which shows that the word is important for one document but is not a common word
for all documents.
3. These values help the computer understand which words are to be considered while processing the
natural language. The higher the value, the more important the word is for a given corpus
Applications of TFIDF
TFIDF is commonly used in the Natural Language Processing domain.
Some of its applications are:
Prepared by :
Hitesh Pujari