0% found this document useful (0 votes)

45 views11 pages

Introduction To Natural Language Processing

Natural Language Processing (NLP) is a sub-field of AI that enables computers to understand and process human languages, facilitating interactions through applications like chatbots, voice assistants, and language translators. Key techniques in NLP include text normalization, tokenization, stemming, and lemmatization, which help in simplifying and processing textual data. The Bag of Words and TFIDF algorithms are essential for feature extraction and determining the importance of words in documents.

Uploaded by

knownbody0

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

45 views11 pages

Introduction To Natural Language Processing

Uploaded by

knownbody0

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

UNIT-6 : NATURAL LANGUAGE PROCESS (NLP)

Introduction to Natural Language Processing

● Natural Language Processing, or NLP, is the sub-field of AI that is focused on enabling computers
to understand and process human languages.
● It is concerned with the interactions between computers and human (natural) languages, in
particular how to program computers to process and analyze large amounts of natural language
data.

A usual interaction between machines and humans using Natural Language Processing could go as follows:
• Humans talk to the computer
• The computer captures the audio
• There is an audio to text conversion
• Text data is processed Data is converted to audio
• The computer plays the audio file and responds to humans

Applications of Natural Language Processing

1. Chatbots
● Chatbots are a form of artificial intelligence that is programmed to
interact with humans in such a way that they sound like humans
themselves.
● Chatbots are created using Natural Language Processing and
Machine Learning, which means that they understand the
complexities of the English language and find the actual meaning of
the sentence and they also learn from their conversations with
humans and become better with time.
● Chatbots work in two simple steps. First, they identify the meaning
of the question asked and collect all the data from the user that may
be required to answer the question. Then they answer the question
appropriately.

2. Autocomplete in Search Engines

● If you have noticed that search engines tend to guess what
you are typing and automatically complete your sentences?
● All these suggestions are provided using auto complete that
uses Natural Language Processing to guess what you want to
ask.
● They use Natural Language Processing to make sense of these
words and how they are interconnected to form different
sentences.

3.Voice Assistants (Virtual Assistants)

● Voice assistants like Siri, Alexa, or Google Assistant uses one
of these to make calls, place reminders, schedule meetings,
set alarms, surf the internet, etc.
● These voice assistants have made life much easier.
● They use a complex combination of speech recognition,
natural language understanding, and natural language
processing to understand what humans are saying and then act
on it.
4. Language Translator
● Google Translate is a great tool to convert text from one
language to another.
● Google Translate and other translation tools as well as use
Sequence to sequence modeling that is a technique in
Natural Language Processing.
● It allows the algorithm to convert a sequence of words from
one language to another which is translation.

5. Grammar Checkers
● Grammar and spelling is a very important tool while writing
professional reports. They can not only correct grammar and
check spellings but also suggest better synonyms and improve
the overall readability of your content.
● They utilize natural language processing to provide the best
possible piece of writing!
● The NLP algorithm is trained on millions of sentences to
understand the correct format. That is why it can suggest the
correct verb tense, a better synonym, or a clearer sentence
structure than what you have written.
● Some of the most popular grammar checkers that use NLP
include Grammarly, WhiteSmoke, ProWritingAid, etc.

6. Sentiment Analysis
● The goal of sentiment analysis is to identify sentiment
among several posts or even in the same post where
emotion is not always explicitly expressed.
● Companies use Natural Language Processing
applications, such as sentiment analysis, to identify
opinions and sentiment online to help them understand
what customers think about their products and services.
(To find out the emotions of their target audience, to
understand product reviews)
● And not just private companies, even governments use
sentiment analysis to find popular opinion

7. Automatic Summarization:
● Automatic summarization is relevant not only for
summarizing the meaning of documents and
information, but also to understand the emotional
meanings within the information, such as in collecting
data from social media.
● Automatic summarization is relevant when used to
provide an overview of a news item or blog post, while
avoiding redundancy from multiple sources and
maximizing the diversity of content obtained.
8. Email Classification and Filtering
● Email services use natural language processing to
identify the contents of each Email with text
classification so that it can be put in the correct
section.
● In more advanced cases, some companies also use
specialty anti-virus software with natural language
processing to scan the emails and see if there are any
patterns and phrases that may indicate a phishing
attempt on the employees.

9. Text classification
● Text classification makes it possible to assign
predefined categories to a document and organize it to
help you find the information you need or simplify
some activities.
● For example, an application of text categorization is
spam filtering in email.

Chatbots
One of the most common applications of Natural Language Processing is a chatbot. There are a lot of
chatbots available.
There are 2 types of chatbots :
● Simple Chatbot (Script bots)
● Smart Chatbots (AI based Smart bots)

Script Bot Smart-Bot

Script bots are easy to make Smart-bots are flexible and powerful

Script bots work around a script which is Smart bots work on bigger databases and other
programmed in them resources directly

Mostly they are free and are easy to integrate to a Smart bots learn with more data
messaging platform

No or little language processing skills Coding is required to take this up on board

Limited functionality Wide functionality

Example: the bots which are deployed in the Example: Google Assistant, Alexa, Cortana, Siri,
customer care section of various companies etc.
Human Language VS Computer Language

Human Language Computer Language

● Our brain keeps on processing the sounds ● Computers understand the language of
that it hears around itself and tries to make numbers. Everything that is sent to the
sense of them all the time machine has to be converted to numbers.
● Example: In the classroom, as the teacher ● Binary code is a system by which
delivers the session, our brain is numbers, letters and other information are
continuously processing everything and represented using only two symbols, or
storing it someplace. binary digits.
● Also, while this is happening, when your ● The binary definition to a computer is a 1s
friend whispers something, the focus of your and 0s code arranged in ways that the
brain automatically shifts from the teacher’s computer can read, understand, and act
speech to your friend’s conversation. upon.
● So now, the brain is processing both the
sounds but is prioritizing the one on which
our interest lies.
● After processing the signal, the brain gains ● The communications made by the
an understanding of its meaning. If it is clear, machines are very basic and simple.
the signal gets stored.
● Otherwise, the listener asks for clarity from
the speaker. This is how human languages
are processed by humans.
Data Processing
Arrangement of the words and meaning
● There are rules in human language. There are nouns, verbs, adverbs, adjectives.
● A word can be a noun at one time and an adjective some other time. There are rules to provide
structure to a language.
● This is an issue related to the syntax of the language.
● Syntax refers to the grammatical structure of a sentence.
● When the structure is present, we can start interpreting the message.

Text Normalization technique used in NLP

Since human languages are complex, we need to first of all simplify them in order to make sure that the
understanding becomes possible.
● Text Normalization helps in cleaning up the textual data in such a way that it comes down to a
level where its complexity is lower than the actual data.
● It is a process to reduce the variations in text’s word forms to a common form
when the variation means the same thing. Text normalization simplifies the text for
further processing.
● We will be working on text from multiple documents and the term used for the whole textual data
from all the documents altogether is known as corpus.

Sentence Segmentation
● Under sentence segmentation, the whole corpus is divided into sentences.
● Each sentence is taken as a different data so now the whole corpus gets reduced to sentences.

Tokenisation
● After segmenting the sentences, each sentence is then further divided into tokens.
● Tokens is a term used for any word or number or special character occurring in a sentence.
● Under tokenisation, every word, number and special character is considered separately and each
of them is now a separate token.
Removing Stopwords, Special Characters and Numbers
● Stopwords are the words which occur very frequently in the corpus but do not add any value to
it.
● Some examples of stopwords are:

● These words occur the most in any given corpus but talk very little or nothing about the context or
the meaning of it.
● Hence, to make it easier for the computer to focus on meaningful terms, these words are
removed.

Converting text to a common case

● After the stopwords removal, we convert the whole text into a similar case, preferably lower
case.
● This ensures that the case-sensitivity of the machine does not consider same words as different
just because of different cases.

Stemming
In this step, the remaining words are reduced to their root words.
In other words, stemming is the process in which the affixes of words are removed and the words are
converted to their base form.

Note :
● In stemming, the stemmed words (words which are we get after removing the affixes) might not
be meaningful.
● Here in this example as you can see: healed, healing and healer all were reduced to heal but
studies was reduced to studi after the affix removal which is not a meaningful word.
● Stemming does not take into account if the stemmed word is meaningful or not. It just removes
the affixes hence it is faster.
Lemmatization
● Stemming and lemmatization both are alternative processes to each other as the role of both the
processes is same – removal of affixes.
● But the difference between both of them is that in lemmatization, the word we get after affix
removal (also known as lemma) is a meaningful one.
● Lemmatization makes sure that lemma is a word with meaning and hence it takes a longer time to
execute than stemming.

Difference between stemming and lemmatization can be summarized by this example:

With this we have normalized our text to tokens which are the simplest form of words present in the
corpus. Now it is time to convert the tokens into numbers.
For this, we would use the Bag of Words algorithm.

Bag of Words (BOW)

● Bag of Words is a Natural Language Processing model which helps in extracting features out
of the text which can be helpful in machine learning algorithms.
● In a bag of words, we get the occurrences of each word and construct the vocabulary for the
corpus.

This image gives us a brief overview about how bag of words works.
Let us assume that the text on the left in this image is the normalized corpus which we have got after
going through all the steps of text processing.
● Now, as we put this text into the bag of words algorithm, the algorithm returns to us the unique
words out of the corpus and their occurrences in it.
● As you can see at the right, it shows us a list of words appearing in the corpus and the numbers
corresponding to it shows how many times the word has occurred in the text body.

Thus, we can say that the bag of words gives us two things :
1. A vocabulary of words for the corpus
2. The frequency of these words (number of times it has occurred in the whole corpus).

Here calling this algorithm “bag” of words symbolizes that the sequence of sentences or tokens
does not matter in this case as all we need are the unique words and their frequency in it.

Here is the step-by-step approach to implement bag of words algorithm:

1. Text Normalisation: Collect data and pre-process it
2. Create Dictionary: Make a list of all the unique words occurring in the corpus. (Vocabulary)
3. Create document vectors: For each document in the corpus, find out how many times the word from
the unique list of words has occurred.
4. Create document vectors for all the documents.

Let us go through all the steps with an example:

Step 1: Collecting data and pre-processing it.

Document 1: Aman and Anil are stressed
Document 2: Aman went to a therapist
Document 3: Anil went to download a health chatbot

Here are three documents having one sentence each. After text normalisation, the text becomes:
Document 1: [aman, and, anil, are, stressed]
Document 2: [aman, went, to, a, therapist]
Document 3: [anil, went, to, download, a, health, chatbot]

Note that no tokens have been removed in the stopwords removal step. It is because we have very little
data and since the frequency of all the words is almost the same, no word can be said to have lesser
value than the other.

Step 2: Create Dictionary

Go through all the steps and create a dictionary i.e., list down all the words which occur in all three
documents:

Dictionary:

Note that even though some words are repeated in different documents, they are all written just once as
while creating the dictionary, we create the list of unique words.
Step 3: Create document vector
● In this step, the vocabulary is written in the top row. Now, for each word in the document, if it
matches with the vocabulary, put a 1 under it.
● If the same word appears again, increment the previous value by 1. And if the word does not
occur in that document, put a 0 under it.

Since in the first document, we have words: aman, and, anil, are, stressed. So, all these words get a
value of 1 and the rest of the words get a 0 value.

Step 4: Repeat for all documents Same exercise has to be done for all the documents. Hence, the table
becomes:

In this table, the header row contains the vocabulary of the corpus and three rows correspond to three
different documents.
● Let's take a look at this table and analyze the positioning of 0s and 1s in it.
● Finally, this gives us the document vector table for our corpus. But the tokens have still not
converted to numbers.
● This leads us to the final steps of our algorithm : TFIDF.

TFIDF: Term Frequency & Inverse Document Frequency

● TFIDF helps in identifying the value for each word.
Term Frequency
Term frequency is the frequency of a word in one document. Term frequency can easily be found from the
document vector table as in that table we mention the frequency of each word of the vocabulary in each
document

Here, you can see that the frequency of each word for each document has been recorded in the table.
These numbers are nothing but the Term Frequencies!
Inverse Document Frequency
● Now, let us look at the other half of TFIDF which is Inverse Document Frequency.
● For this, let us first understand what does document frequency mean.

Document Frequency is the number of documents in which the word occurs irrespective of how many
times it has occurred in those documents.
The document frequency for the exemplar vocabulary would be:

Here, you can see that the document frequency of ‘aman’, ‘anil’, ‘went’, ‘to’ and ‘a’ is 2 as they have
occurred in two documents.
Rest of them occurred in just one document hence the document frequency for them is one.

Now, let's discuss the inverse document frequency, we need to put the document frequency in the
denominator while the total number of documents is the numerator.
Here, the total number of documents are 3, hence inverse document frequency becomes:

Finally, the formula of TFIDF for any word W becomes:

TFIDF(W) = TF(W) * log( IDF(W) )
Here, log is to the base of 10.
Now, let’s multiply the IDF values to the TF values.
Note that the TF values are for each document while the IDF values are for the whole corpus.
Hence, we need to multiply the IDF values to each row of the document vector table.

Here, you can see that the IDF values for Aman in each row is the same and similar pattern is followed
for all the words of the vocabulary. After calculating all the values, we get:

● Finally, the words have been converted to numbers. These numbers are the values of each for
each document.
● Here, you can see that since we have less amount of data, words like ‘are’ and ‘and’ also have a
high value.
● But as the IDF value increases, the value of that word decreases.
For example:
Total Number of documents: 10
Number of documents in which ‘and’ occurs: 10
Therefore, IDF(and) = 10/10 = 1
Which means: log(1) = 0.
Hence, the value of ‘and’ becomes 0.

On the other hand, number of documents in which ‘pollution’ occurs: 3

IDF(pollution) = 10/3 = 3.3333…
Which means: log(3.3333) = 0.522;
which shows that the word ‘pollution’ has considerable value in the corpus.

Summarizing the concept, we can say that :

1. Words that occur in all the documents with high term frequencies have the least values and are
considered to be the stopwords.
2. For a word to have high TFIDF value, the word needs to have a high term frequency but less
document frequency which shows that the word is important for one document but is not a common word
for all documents.
3. These values help the computer understand which words are to be considered while processing the
natural language. The higher the value, the more important the word is for a given corpus

Applications of TFIDF
TFIDF is commonly used in the Natural Language Processing domain.

Some of its applications are:

Prepared by :
Hitesh Pujari

Natural Language Processing (NLP)
No ratings yet
Natural Language Processing (NLP)
27 pages
NLP Notes
No ratings yet
NLP Notes
90 pages
Unit 3&4
No ratings yet
Unit 3&4
10 pages
Natural Language Processing (NLP)
No ratings yet
Natural Language Processing (NLP)
31 pages
Unit No 1 Introduction To NLP
No ratings yet
Unit No 1 Introduction To NLP
20 pages
Unit-6 Natural Language Processing
No ratings yet
Unit-6 Natural Language Processing
7 pages
Chapter 6.
No ratings yet
Chapter 6.
31 pages
Natural Language Processing Unit1
No ratings yet
Natural Language Processing Unit1
23 pages
Natural Language Processing UNIT 1
No ratings yet
Natural Language Processing UNIT 1
130 pages
Understanding Natural Language Processing
No ratings yet
Understanding Natural Language Processing
14 pages
A Beginner's Introduction To Natural Language Processing (NLP)
100% (1)
A Beginner's Introduction To Natural Language Processing (NLP)
15 pages
NLP Unit-1
No ratings yet
NLP Unit-1
20 pages
Natural Language Processing
No ratings yet
Natural Language Processing
43 pages
U1 - NLP Complete
No ratings yet
U1 - NLP Complete
108 pages
NLP Notes
No ratings yet
NLP Notes
9 pages
NLP MODULE 1 Chapter1 &2
100% (1)
NLP MODULE 1 Chapter1 &2
83 pages
Bhawini NLP Practical
No ratings yet
Bhawini NLP Practical
98 pages
NLP Overview and Applications
No ratings yet
NLP Overview and Applications
3 pages
AI Unit 3 - Natural Language Processing by Kulbhushan (Krazy Kaksha & KK World)
No ratings yet
AI Unit 3 - Natural Language Processing by Kulbhushan (Krazy Kaksha & KK World)
4 pages
Unit 1 NLP
No ratings yet
Unit 1 NLP
76 pages
Chapter-6 Communicating, Perceiving, and Acting
100% (1)
Chapter-6 Communicating, Perceiving, and Acting
10 pages
Introduction To
No ratings yet
Introduction To
16 pages
NLP 1
No ratings yet
NLP 1
3 pages
NLP for AI and Tech Enthusiasts
No ratings yet
NLP for AI and Tech Enthusiasts
30 pages
Natural Language Processin1
No ratings yet
Natural Language Processin1
86 pages
NLP and Machine Learning Overview
No ratings yet
NLP and Machine Learning Overview
11 pages
Natural Language Processing - 1
No ratings yet
Natural Language Processing - 1
44 pages
Topic 2: Introduction To Natural Language Processing (NLP)
No ratings yet
Topic 2: Introduction To Natural Language Processing (NLP)
16 pages
Aids Module 5
No ratings yet
Aids Module 5
35 pages
Unit 4
No ratings yet
Unit 4
39 pages
NLP Unit 1
No ratings yet
NLP Unit 1
48 pages
NLP Lecture Slides - Part 1
No ratings yet
NLP Lecture Slides - Part 1
54 pages
NLP Textbook Star Edu
No ratings yet
NLP Textbook Star Edu
103 pages
NLP Basics for Beginners
No ratings yet
NLP Basics for Beginners
8 pages
NLP Lecture 1
No ratings yet
NLP Lecture 1
3 pages
Chapter - 6 Communicating, Perceiving, and Acting
No ratings yet
Chapter - 6 Communicating, Perceiving, and Acting
30 pages
Introduction To NLP - Part 1
No ratings yet
Introduction To NLP - Part 1
23 pages
Seminar Darshna
No ratings yet
Seminar Darshna
13 pages
CH 5 NLP
No ratings yet
CH 5 NLP
12 pages
Understanding Natural Language Processing
No ratings yet
Understanding Natural Language Processing
5 pages
Unit 7
No ratings yet
Unit 7
17 pages
NLP Lect Unit I
100% (1)
NLP Lect Unit I
140 pages
Natural Language Processing Unit 1
No ratings yet
Natural Language Processing Unit 1
13 pages
Introduction to Natural Language Processing
No ratings yet
Introduction to Natural Language Processing
87 pages
Natural Language Processing: All You Need To Know About
No ratings yet
Natural Language Processing: All You Need To Know About
45 pages
Basic NLP To End-To-End Pipeline .PPTX - Removed
No ratings yet
Basic NLP To End-To-End Pipeline .PPTX - Removed
35 pages
Ai NLP
No ratings yet
Ai NLP
34 pages
AI Unit-5
No ratings yet
AI Unit-5
10 pages
NLP in Everyday Tech
No ratings yet
NLP in Everyday Tech
11 pages
Tech Titans
No ratings yet
Tech Titans
12 pages
NLP Materia
No ratings yet
NLP Materia
29 pages
Module 1
No ratings yet
Module 1
49 pages
Intro to NLP: Concepts & Applications
No ratings yet
Intro to NLP: Concepts & Applications
80 pages
Natural Language Processing
No ratings yet
Natural Language Processing
73 pages
Natural Languag-Wps Office
No ratings yet
Natural Languag-Wps Office
24 pages
Understanding Natural Language Processing
No ratings yet
Understanding Natural Language Processing
12 pages
Unit1 A
No ratings yet
Unit1 A
8 pages
NLP Module 1
No ratings yet
NLP Module 1
124 pages
Mag Map 2000 Manual
No ratings yet
Mag Map 2000 Manual
247 pages
Industrial Battery Solutions
No ratings yet
Industrial Battery Solutions
2 pages
Understanding Privacy: Types and Laws
No ratings yet
Understanding Privacy: Types and Laws
15 pages
Day 2 ME Hydraulic Power System HPC - HCU
No ratings yet
Day 2 ME Hydraulic Power System HPC - HCU
147 pages
Introduction to Data Communication Systems
No ratings yet
Introduction to Data Communication Systems
24 pages
Chapter 5
No ratings yet
Chapter 5
32 pages
Stern Tube
100% (2)
Stern Tube
79 pages
Name Shubham Mali
No ratings yet
Name Shubham Mali
11 pages
Bolt Torque Calculation For Flange
No ratings yet
Bolt Torque Calculation For Flange
1 page
8942 Tampa Ave 2ND Review 060324
No ratings yet
8942 Tampa Ave 2ND Review 060324
27 pages
E-Marketing & Crisis Management in Hotels
No ratings yet
E-Marketing & Crisis Management in Hotels
89 pages
Industrial Control System Module
No ratings yet
Industrial Control System Module
9 pages
Student Entrepreneurship Report
No ratings yet
Student Entrepreneurship Report
10 pages
Wallet Finder Balance Check - C Users AP Downloads FEBCC Check
No ratings yet
Wallet Finder Balance Check - C Users AP Downloads FEBCC Check
2 pages
Cot 3 - Tle Ict Week 6
100% (1)
Cot 3 - Tle Ict Week 6
6 pages
Linux Networking Commands
100% (1)
Linux Networking Commands
11 pages
01.0 Flyer - DLF
No ratings yet
01.0 Flyer - DLF
2 pages
The Coming Wave Technology Power and The Twenty First Century S Greatest Dilemma Mustafa Suleyman Download
No ratings yet
The Coming Wave Technology Power and The Twenty First Century S Greatest Dilemma Mustafa Suleyman Download
50 pages
Infosphere Information Server Installation
No ratings yet
Infosphere Information Server Installation
7 pages
MDM ETL Code Review Summary
No ratings yet
MDM ETL Code Review Summary
3 pages
Campus Drive Questions With Solution
No ratings yet
Campus Drive Questions With Solution
45 pages
Data Structures and Algorithms
No ratings yet
Data Structures and Algorithms
21 pages
Sel 3350 Rtac
No ratings yet
Sel 3350 Rtac
16 pages
Wa0028.
No ratings yet
Wa0028.
25 pages
Sistem Manajemen Basis Data
No ratings yet
Sistem Manajemen Basis Data
5 pages
CT & VT Testing
No ratings yet
CT & VT Testing
16 pages
61 MC Elvain Cave Durand Bingham Fluids HR Value
No ratings yet
61 MC Elvain Cave Durand Bingham Fluids HR Value
10 pages
ITI - Interested List - xlsx3.7.21
No ratings yet
ITI - Interested List - xlsx3.7.21
28 pages
Name of File Payslip For Periode 6-6 (07 - 16 - 2024)
No ratings yet
Name of File Payslip For Periode 6-6 (07 - 16 - 2024)
2 pages
MANUAL X Gold 20234 - Ingles
No ratings yet
MANUAL X Gold 20234 - Ingles
2 pages

Introduction To Natural Language Processing

Uploaded by

Introduction To Natural Language Processing

Uploaded by

UNIT-6 : NATURAL LANGUAGE PROCESS (NLP)

Introduction to Natural Language Processing

Applications of Natural Language Processing

2. Autocomplete in Search Engines

3.Voice Assistants (Virtual Assistants)

Script Bot Smart-Bot

No or little language processing skills Coding is required to take this up on board

Limited functionality Wide functionality

Human Language Computer Language

Text Normalization technique used in NLP

Converting text to a common case

Difference between stemming and lemmatization can be summarized by this example:

Bag of Words (BOW)

Here is the step-by-step approach to implement bag of words algorithm:

Let us go through all the steps with an example:

Step 1: Collecting data and pre-processing it.

Step 2: Create Dictionary

TFIDF: Term Frequency & Inverse Document Frequency

Finally, the formula of TFIDF for any word W becomes:

On the other hand, number of documents in which ‘pollution’ occurs: 3

Summarizing the concept, we can say that :

Some of its applications are:

You might also like