0% found this document useful (0 votes)
29 views16 pages

Ai Part B ch12

This document provides an overview of key concepts in Natural Language Processing (NLP), including Natural Language Understanding (NLU), Natural Language Generation (NLG), and the importance of text normalization. It details the steps involved in text normalization, such as sentence segmentation, tokenization, and stemming, as well as the Bag-of-Words model for feature extraction. The document emphasizes the complexities of human language and the necessity of preprocessing text for effective machine understanding.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views16 pages

Ai Part B ch12

This document provides an overview of key concepts in Natural Language Processing (NLP), including Natural Language Understanding (NLU), Natural Language Generation (NLG), and the importance of text normalization. It details the steps involved in text normalization, such as sentence segmentation, tokenization, and stemming, as well as the Bag-of-Words model for feature extraction. The document emphasizes the complexities of human language and the necessity of preprocessing text for effective machine understanding.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

12 Concepts in

Natural Language Processing


Learning Objectives
After studying this chapter, students will be able to:
• Understand the key components of NLP
• Learn more about the Text normalization technique
• Learn more about the NLP model-Bag-of-Words

KEY COMPONENTS OF NLP


Natural Language Understanding (NLU)
NLU involves analyzing and understanding the meaning behind sentences. It enables software to find
similar meanings in different sentences or to process words that have different meanings. This is a subset
of NLP that focuses on enabling computers to understand the meaning and intent behind human language.
It involves tasks like:
• Text Analysis: Breaking down text into its components (words, sentences, etc.).
• Semantic Analysis: Understanding the meaning of words and phrases in context.
• Sentiment Analysis: Determining the emotional tone of the text.
• Intent Recognition: Identifying the user’s goal or purpose in a given utterance.
Natural Language Generation (NLG)
NLG is the task of generating human-like text from structured data. It is essential for applications like
chatbots and virtual assistants, where the system needs to respond to user queries in a natural and coherent
manner. This focuses on enabling computers to generate human-like text. This includes tasks like:
• Text Summarization: Creating concise summaries of longer pieces of text.
• Machine Translation: Translating text from one language to another.
• Chatbot Responses: Generating human-like conversational responses.
• Storytelling: Creating fictional narratives.
Data Processing
To enable machines to understand and generate natural languages. Natural Language Processing (NLP)
starts by converting human language into numerical data. The initial step in this process is Text
Normalization.
TEXT NORMALIZATION
Text normalization is a crucial preprocessing step in Natural Language Processing (NLP) that transforms raw
text into a standardized format. This process is essential for ensuring consistency and improving the accuracy
of subsequent NLP tasks. It involves cleaning and simplifying textual data to reduce its complexity. This pro-
cess transforms the text into a more manageable form, making it easier for the machine to handle.
Why is Text Normalization Important?
• Improved Accuracy: Consistent text representation leads to more accurate results in NLP tasks like
sentiment analysis, topic modeling, and machine translation.
• Reduced Data Sparsity: By reducing variations in word forms, text normalization helps to increase the
frequency of words, improving the performance of statistical models.
• Faster Processing: Simplified text is easier and faster for NLP algorithms to process.
• Better Data Quality: By removing noise and inconsistencies, text normalization enhances the overall
quality of the data.
Arrangement of the words and meaning
There are rules in human language. There are nouns, verbs, adverbs, adjectives. A word can be a noun at
one time and an adjective some other time. There are rules to provide structure to a language. This is the
issue related to the syntax of the language. Syntax refers to the grammatical structure of a sentence. When
the structure is present, we can start interpreting the message. Now we also want to have the computer
do this. One way to do this is to use the part-of-speech tagging. This allows the computer to identify the
different parts of a speech.
Besides the matter of arrangement, there’s also meaning behind the language we use. Human
communication is complex. There are multiple characteristics of the human language that might be easy
for a human to understand but extremely difficult for a computer to understand.
Analogy with programming language:
Different syntax, same semantics: 2+3 = 3+2
Here the way these statements are written is different but their meanings are the same that is 5.
Different semantics, same syntax: 2/3 (Python 2.7) ≠ 2/3 (Python 3)
Here the statements written have the same syntax but their meanings are different. In Python 2.7, this
statement would result in 0 while in Python 3, it would give an output of 1.5.
Think of some other examples of different syntax and same semantics and vice-versa.

Multiple Meanings of a word


Let’s consider these three sentences:
His face turned red after he found out that he took the wrong bag
What does this mean? Is he feeling ashamed because he took another person’s bag instead of his? Is he
feeling angry because he did not manage to steal the bag that he has been targeting?
The red car zoomed past his nose
Probably talking about the color of the car
His face turns red after consuming the medicine
Is he having an allergic reaction? Or is he not able to bear the taste of that medicine?
Here we can see that context is important. We understand a sentence almost intuitively, depending on our
history of using the language, and the memories that have been built within. In all three sentences, the
word red has been used in three different ways which according to the context of the statement changes its
meaning completely. Thus, in natural language, it is important to understand that a word can have multiple
meanings and the meanings fit into the statement according to the context of it.
Think of some other words which can have multiple meanings and use them in sentences.

PERFECT SYNTAX, NO MEANING


Sometimes, a statement can have a perfectly correct syntax but it does not mean anything. For
example, take a look at this statement:
Chickens feed extravagantly while the moon drinks tea.
This statement is correct grammatically but does this make any sense? In Human language, a perfect
balance of syntax and semantics is important for better understanding.
Think of some other sentences having correct syntax and incorrect semantics.

Humans interact with each other very easily. For us, the natural languages that we use are so convenient
that we speak them easily and understand them well too. But for computers, our languages are very
complex. As you have already gone through some of the complications in human languages above, now
it is time to see how Natural Language Processing makes it possible for the machines to understand and
speak in the Natural Languages just like humans.
Since we all know that the language of computers is Numerical, the very first step that comes to our mind
is to convert our language to numbers. This conversion takes a few steps to happen. The first step to it is
Text Normalisation. Since human languages are complex, we need to first of all simplify them in order to
make sure that the understanding becomes possible. Text Normalisation helps in cleaning up the textual
data in such a way that it comes down to a level where its complexity is lower than the actual data. Let us
go through Text Normalisation in detail.

TEXT NORMALISATION
In Text Normalisation, we undergo several steps to normalise the text to a lower level. We will be working
on a collection of written text from multiple documents and the term used for the whole textual data from
all the documents altogether is known as corpus.
Let us take a look at the steps:
Sentence Segmentation
Under sentence segmentation or Sentence boundary Detection , the whole corpus is divided into
sentences. Each sentence is taken as a different data so now the whole corpus gets reduced to sentences.
Example:
Original Sentence: Today around 80 % of total data is available in the raw form. Big Data comes from
information stored in big organizations as well as enterprises. Examples include information of employees,
company purchase, sale records, business transactions, the previous record of organizations, social media
etc. Though humans use language, which is ambiguous and unstructured to be interpreted by computers,
with the help of NLP, this huge unstructured data can be harnessed for evolving patterns inside data to
better know the information contained in data.
Segmented Sentence: Today around 80 % of total data is available in the raw form. Big Data comes from
information stored in big organizations as well as enterprises.
Examples include information of employees, company purchase, sale records, business transactions, the
previous record of organizations, social media etc.
Though humans use language, which is ambiguous and unstructured to be interpreted by computers, with
the help of NLP, this huge unstructured data can be harnessed for evolving patterns inside data to better
know the information contained in data.
There are various libraries including some of the most popular ones like NLTK, Spacy, Stanford CoreNLP
that provide excellent, easy to use functions for sentence segmentation.
Let’s take a look at how these libraries segment the text “I am Batman. I live in Gotham.”

These libraries work exactly as they are supposed to i.e: work near perfectly on perfectly formatted text
and fail miserably on text with bad punctuations, wrong capitalisations etc.
Tokenisation
Word tokenization (also called word segmentation) is the problem of dividing a string of written language
into its component words. In English and many other languages using some form of Latin alphabet, space
is a good approximation of a word divider. However, we still can have problems if we only split by space
to achieve the wanted results. Some English compound nouns are variably written and sometimes they
contain a space.
After segmenting the sentences, each sentence is then further divided into tokens. Tokens is a term used for
any word or number or special character occurring in a sentence. Under tokenisation, every word, number
and special character is considered separately and each of them is now a separate token.
In most cases, we use a library to achieve the wanted results. We can use the nltk.word_tokenize function.
Example:
Original Sentence : ‘I am Batman, I live in Gotham’
Segmented Sentence:
I am Batman
I live in Gotham
Word Tokenisation:
[‘I’, ‘am’, ‘Batman’]
[‘I’, ‘live’, ‘in’, ‘Gotham’]
Removing Stopwords, Special Characters and Numbers
In this step, the tokens which are not necessary are removed from the token list. What can be the possible
words which we might not require?
Stop words are words which are filtered out before or after processing of text. When applying machine
learning to text, these words can add a lot of noise. That’s why we want to remove these irrelevant words.
Stop words usually refer to the most common words such as “and”, “the”, “a” in a language, but there is no
single universal list of stopwords. The list of the stop words can change depending on your application.
Stopwords are the words which occur very frequently in the corpus but do not add any value to it. Humans
use grammar to make their sentences meaningful for the other person to understand. But grammatical
words do not add any essence to the information which is to be transmitted through the statement hence
they come under stopwords. Some examples of stopwords are:
A, an, the, such, as , it, in, if, or,and, is, are, for, to , into, on, there
These words occur the most in any given corpus but talk very little or nothing about the context or the
meaning of it. Hence, to make it easier for the computer to focus on meaningful terms, these words are
removed. Along with these words, a lot of times our corpus might have special characters and/or numbers.
Now it depends on the type of corpus that we are working on whether we should keep them in it or not.
For example, if you are working on a document containing email IDs, then you might not want to remove
the special characters and numbers whereas in some other textual data if these characters do not make
sense, then you can remove them along with the stopwords.
Converting text to a common case
After the stopwords removal, we convert the whole text into a similar case, preferably lower case. This
ensures that the case-sensitivity of the machine does not consider the same words as different just because
of different cases. For example:
Python , python, PYTHON, PYthon, PYThon etc all are converted into lowercase ‘python’
In this example all written forms of the word ‘Python’ is converted to lowercase ‘python’ and are treated
the same by the machine.
Stemming
In this step, the remaining words are reduced to their root words. In other words, stemming is the process
in which the affixes of words are removed and the words are converted to their base form.
Word Affixes Stem
Sweetened ed sweeten
Sweetening ing sweeten
Sweetener er sweeten
Tries es tri
Trying ing try

Note that in stemming, the stemmed words (words which are we get after removing the affixes) might not
be meaningful. Here in this example as you can see: sweetened, sweetening, sweetener all were reduced
to sweeten but tries was reduced to tri after the affix removal which is not a meaningful word. Stemming
does not take into account if the stemmed word is meaningful or not. It just removes the affixes hence it
is faster.
Lemmatization
Stemming and lemmatization both are alternative processes to each other as the role of both the processes
is same – removal of affixes. But the difference between both of them is that in lemmatization, the word we
get after affix removal (also known as lemma) is a meaningful one. Lemmatization makes sure that lemma
is a word with meaning and hence it takes a longer time to execute than stemming.
Word Affixes Lemma
Sweetened Ed sweeten
Sweetening Ing sweeten
Sweetener Er sweeten
Tries Es try
Trying Ing try

Now with lemmatization, the word ‘tries’ after removal of the affix ‘es’ becomes ‘try’ , a meaningful word,
instead of ‘tri’ in stemming.
With this we have normalised our text to tokens which are the simplest form of words present in the
corpus. Now it is time to convert the tokens into numbers. For this, we would use the Bag of Words
algorithm

BAG OF WORDS
Machine learning algorithms cannot work with raw text directly, we need to convert the text into vectors
of numbers. This is called feature extraction.
The bag-of-words model is a popular and simple feature extraction technique used when we work with
text. It describes the occurrence of each word within a document.
To use this model, we need to:
• Design a vocabulary of known words (also called tokens)
• Choose a measure of the presence of known words
Any information about the order or structure of words is discarded. That’s why it’s called a bag of
words. This model is trying to understand whether a known word occurs in a document, but don’t know
where that word is in the document.
The intuition is that similar documents have similar contents. Also, from a content, we can learn something
about the meaning of the document.
Bag of Words is a Natural Language Processing model which helps in extracting features out of the
text which can be helpful in machine learning algorithms. In bag of words, we get the occurrences of
each word and construct the vocabulary for the corpus.
This image gives us a brief overview about how bag of words works. Let us assume that the text on the
left in this image is the normalised corpus which we have got after going through all the steps of text
processing. Now, as we put this text into the bag of words algorithm, the algorithm returns to us the
unique words out of the corpus and their occurrences in it. As you can see at the right, it shows us a list
of words appearing in the corpus and the numbers corresponding to it shows how many times the word
has occurred in the text body. Thus, we can say that the bag of words gives us two things:
1. A vocabulary of words for the corpus
2. The frequency of these words (number of times it has occurred in the whole corpus).
Here calling this algorithm “bag” of words symbolises that the sequence of sentences or tokens does not
matter in this case as all we need are the unique words and their frequency in it.
Here is the step-by-step approach to implement bag of words algorithm:
1. Text Normalisation: Collect data and pre-process it
2. Create Dictionary: Make a list of all the unique words occurring in the corpus. (Vocabulary)
3. Create document vectors: For each document in the corpus, find out how many times the word from
the unique list of words has occurred.
4. Create document vectors for all the documents.
Let us go through all the steps with an example:

Step 1: Collecting data and preprocessing it.


Document 1: I like this movie
Document 2: I hate this movie
Document 3: This movie is awesome, I love it.
Here are three documents having one sentence each. After text normalisation, the text becomes:
Document 1: [I, like, this, movie]
Document 2: [I, hate, this, movie]
Document 3: [This, movie, is, awesome, I, love, it]
Note that no tokens have been removed in the stopwords removal step. It is because we have very little
data and since the frequency of all the words is almost the same, no word can be said to have lesser value
than the other.
Step 2: Create Dictionary
Go through all the steps and create a dictionary i.e., list down all the words which occur in all three
documents:

Dictionary:
I Like this
Movie Hate is
Awesome Love it
Note that even though some words are repeated in different documents, they are all written just once as
while creating the dictionary, we create the list of unique words.

Step 3: Create document vector


In this step, the vocabulary is written in the top row. Now, for each word in the document, if it matches
with the vocabulary, put a 1 under it. If the same word appears again, increment the previous value by 1.
And if the word does not occur in that document, put a 0 under it.
I like This movie hate is awesome love it
1 1 1 1 0 0 0 0 0

Step 4: Repeat for all documents


Same exercise has to be done for all the documents. Hence, the table becomes:
I like This movie hate is awesome love it
1 1 1 1 0 0 0 0 0
1 0 1 1 1 0 0 0 0
1 0 1 1 0 1 1 1 1

In this table, the header row contains the vocabulary of the corpus and three rows correspond to three
different documents. Take a look at this table and analyse the positioning of 0s and 1s in it.
Finally, this gives us the document vector table for our corpus. But the tokens have still not converted to
numbers. This leads us to the final steps of our algorithm: TFIDF.

TFIDF: TERM FREQUENCY & INVERSE DOCUMENT FREQUENCY


One problem with scoring word frequency is that the most frequent words in the document start to have
the highest scores. These frequent words may not contain as much “informational gain” to the model
compared with some rarer and domain-specific words. One approach to fix that problem is to penalize
words that are frequent across all the documents. This approach is called TF-IDF.
Suppose you have a book. Which characters or words do you think would occur the most in it?

Bag of words algorithm gives us the frequency of words in each document we have in our corpus. It gives
us an idea that if the word is occurring more in a document, its value is more for that document.
And, this, is, the, etc. are the words which occur the most in almost all the documents. But these words
do not talk about the corpus at all. Though they are important for humans as they make the statements
understandable to us, for the machine they are a complete waste as they do not provide us with any
information regarding the corpus. Hence, these are termed as stopwords and are mostly removed at the
pre-processing stage only.
DID YOU KNOW ?
Google Duplex is a new project from Google that is currently live in the
majority of the US. It allows certain users to make a restaurant reservation
by phone. However, instead of the user speaking directly to the restaurant
employee, Google Duplex, with the help of Google Assistant, speaks for
the user. It does this with an AI-based, but human-sounding, voice. Google
Duplex’s voice even put in words like “um” and pause breaks to make it
sound more like a real human.
Duplex lets AI mimic a human voice to make appointments and book tables through phone calls. Google’s voice-
calling “Duplex” - which lets Artificial Intelligence (AI) mimic a human voice to make appointments and book tables
through phone calls - may soon enter call centres assisting humans with customer queries.
‘Duplex’ is designed to operate in very specific use cases, and currently it is focused on testing with restaurant
reservations, hair salon booking and holiday tours with a limited set of trusted testers. It uses Google DeepMind’s
new “WaveNet” audio-generation technique and other advances in Natural Language Processing (NLP) to replicate
human speech patterns.

To summarize
• Stop words have the highest occurrence in the documents but have negligible value.
• Frequent words have adequate occurrence in the corpus. These words mostly talk about the docu-
ment’s subject and their occurrence is adequate in the corpus.
• Rare or valuable words have least occurrences in the documents. These words occur the least but add
the most value to the corpus. Hence, when we look at the text, we take frequent and rare words into
consideration.
TF-IDF, short for term frequency-inverse document frequency is a statistical measure used to evaluate
the importance of a word to a document in a collection or corpus. The TF-IDF scoring value increases
proportionally to the number of times a word appears in the document, but it is offset by the number of
documents in the corpus that contain the word.
TFIDF helps in identifying the value for each word. Let us understand each term one by one.
Term Frequency
Term frequency is the frequency of a word in one document. Term frequency can easily be found from
the document vector table as in that table we mention the frequency of each word of the vocabulary in
each document.
I like this movie hate is awesome love it
1 1 1 1 0 0 0 0 0
1 0 1 1 1 0 0 0 0
1 0 1 1 0 1 1 1 1

Here, you can see that the frequency of each word for each document has been recorded in the table. These
numbers are nothing but the Term Frequencies!
Inverse Document Frequency
Now, let us look at the other half of TFIDF which is Inverse Document Frequency. For this, let us first
understand what does document frequency mean. Document Frequency is the number of documents in
which the word occurs irrespective of how many times it has occurred in those documents. The document
frequency for the exemplar vocabulary would be:
I like this movie Hate is awesome love it
3 1 3 3 1 1 1 1 1
Talking about inverse document frequency, we need to put the document frequency in the denominator
while the total number of documents is the numerator. Here, the total number of documents are 3, hence
inverse document frequency becomes:
I like this movie Hate is awesome love it
3/3 3/1 3/3 3/3 3/1 3/1 3/1 3/1 3/1

Finally, the formula of TFIDF for any word W becomes:


TFIDF(W) = TF(W) * log( IDF(W) )
Here, log is to the base of 10. Now, let’s multiply the IDF values to the TF values. Note that the TF values
are for each document while the IDF values are for the whole corpus. Hence, we need to multiply the IDF
values to each row of the document vector table.
I like this movie hate is awesome love it
1*log(1) 1*log(3) 1*log(1) 1*log(1) 0*log(3) 0*log(3) 0*log(3) 0*log(3) 0*log(3)
1*log(1) 0*log(3) 1*log(1) 1*log(1) 1*log(3) 0*log(3) 0*log(3) 0*log(3) 0*log(3)
1*log(1) 0*log(3) 1*log(1) 1*log(1) 0*log(3) 1*log(3) 1*log(3) 1*log(3) 1*log(3)

Here, you can see that the IDF values for andy in each row is the same and similar pattern is followed for
all the words of the vocabulary. After calculating all the values, we get:
I like this movie hate is awesome love it
0 0.477 0 0 0 0 0 0 0
0 0 0 0 0.477 0 0 0 0
0 0 0 0 0 0.477 0.477 0.477 0.477

Finally, the words have been converted to numbers. These numbers are the values of each for each
document. Here, you can see that since we have less data, words like ‘are’ and ‘and’ also have a high value.
But as the IDF value increases, the value of that word decreases.
That is, for example:
Total Number of documents: 10
Number of documents in which ‘and’ occurs: 10
Therefore, IDF(and) = 10/10 = 1
Which means: log(1) = 0. Hence, the value of ‘and’ becomes 0.
On the other hand, number of documents in which ‘movie’ occurs: 3
IDF(pollution) = 10/3 = 3.3333...
Which means: log(3.3333) = 0.522; which shows that the word ‘movie’ has considerable value in the corpus.
Summarising the concept, we can say that:
1. Words that occur in all the documents with high term frequencies have the least values and are consid-
ered to be the stopwords.
2. For a word to have high TFIDF value, the word needs to have a high term frequency but less document
frequency which shows that the word is important for one document but is not a common word for all
documents.
3. These values help the computer understand which words are to be considered while processing the
natural language. The higher the value, the more important the word is for a given corpus.
Applications of TFIDF
TFIDF is commonly used in the Natural Language Processing domain. Some of its applications are:
Document Information Retrieval
Topic Modelling Stop word filtering
Classification System
Helps in classifying It helps in To extract the important Helps in removing the
the type and genre predicting the topic information out of a unnecessary words out of a
of a document. for a corpus. corpus. text body.

RECAP
• Natural Language Processing, abbreviated as NLP, is a branch of artificial intelligence that deals with the
interaction between computers and humans using the natural language.
• In Text Normalisation, we undergo several steps to normalise the text to a lower level. The steps include
• Sentence segmentation : the whole corpus is divided into sentences. Each sentence is taken as a different
data so now the whole corpus gets reduced to sentences.
• Word tokenization is the problem of dividing a string of written language into its component words.
• Removing Stopwords, Special Characters and Numbers: The tokens which are not necessary are removed
from the token list.
• Converting text to a common case: After the stop words remoral we convert the whole text into a
similar case, preferably Lower case.
• Stemming : The remaining words are reduced to their root words
• Lemmatization makes sure that lemma is a word with meaning when the words are reduced to their
root words
• Bag of Words helps in extracting features out of the text which can be helpful in machine learning
algorithms.
• TF-IDF is a statistical measure used to evaluate the importance of a word to a document in a collection
or corpus. The TF-IDF scoring value increases proportionally to the number of times a word appears in
the document, but it is offset by the number of documents in the corpus that contain the word.

KEY TERMS
• NLU involves analyzing and understanding the meaning behind sentences.
• NLG is the task of generating human-like text from structured data.
• Corpus is the term used for the whole textual data from all the documents altogether.
• Word tokenization (also called word segmentation) is the problem of dividing a string of written language
into its component words.
• Stemming is the process in which the affixes of words are removed and the words are converted to their
base form.
• Lemmatization is the process in which the affixes of words are removed and the words are converted
to their base form in a meaningful way.
• Stop words are words which are filtered out before or after processing of text.
• Bag of Words is a Natural Language Processing model which helps in extracting features out of the text
which can be helpful in machine learning algorithms.
• TF-IDF, short for term frequency-inverse document frequency, is a statistical measure used to evaluate
the importance of a word to a document in a collection or corpus.

EXERCISES
A. Multiple Choice Questions
1. _____________ is the problem of dividing a string of written language into its component words.
(a) Lemmatization (b) Stemming
(c) Tokenization (d) Sentiment Analysis
2. _____________ are the words which occur very frequently in the corpus but do not add any value to it.
(a) StopWords (b) FreqWords (d) NoValueWords (d) NoWords
3. ____________ is the process in which the affixes of words are removed and the words are converted to their
base form.
(a) Stemming (b) Lemmatization
(c) Tokenization (d) Language Analysis
4. ____________ takes a longer time to execute than stemming.
(a) Tokenization (b) Language Analysis
(c) Sentiment Analysis (d) Lemmatization
5. In _____________, we get the occurrences of each word and construct the vocabulary for the corpus.
(a) Tokenization (b) Bag of Words
(c) Lemmatization (d) Automatic Summarization
6. ____________is a statistical measure used to evaluate the importance of a word to a document in a collection
or corpus.
(a) Lemmatization (b) Stemming (c) TF-IDF (d) Bag of Words
7. Which of the following is NOT a benefit of Text Normalization?
(a) Improved accuracy (b) Faster processing
(c) Better data quality (d) Increased data complexity
8. What does the Inverse Document Frequency (IDF) component of TF-IDF aim to achieve?
(a) Increase the score of words that appear frequently in a single document
(b) Penalize words that are frequent across all documents
(c) Measure the total number of words in a corpus
(d) Identify the most common words in a language
9. Which of the following is the term used to describe the complete set of text data from multiple documents?
(a) Token (b) Vocabulary
(c) Document vector (d) Corpus
10. Which of the following tasks is associated with Natural Language Generation (NLG)?
(a) Intent Recognition (b) Semantic Analysis
(c) Chatbot Responses (d) Text Analysis
11. Which of the following is a reason why stop words are often removed during text pre-processing?
(a) They are difficult for computers to understand.
(b) They are not grammatically correct.
(c) They occur frequently but add little meaning to the text.
(d) They are always irrelevant to the task at hand.
12. Why is converting text to a common case (e.g., lowercase) important in Text Normalization?
(a) To improve the readability of the text for humans.
(b) To reduce the file size of the text data.
(c) To ensure that the same words are treated equally regardless of their original case.
(d) To make it easier to identify the parts of speech.
13. In the Bag-of-Words model, what does the document vector represent?
(a) The order of words in a document.
(b) The grammatical relationships between words.
(c) The frequency of each word in the vocabulary within a specific document.
(d) The sentiment expressed in the document.
14. Which of the following is a disadvantage of the Bag-of-Words model?
(a) It is computationally expensive to implement.
(b) It requires a large amount of training data.
(c) It disregards the order and structure of words in a document.
(d) It is not effective for document classification tasks.
15. Which statement best describes the relationship between Term Frequency (TF) and Inverse Document Frequency
(IDF)?
(a) TF and IDF are independent measures that are not related.
(b) TF measures the frequency of a word in a document, while IDF penalizes words that are frequent across all
documents.
(c) TF is used for stemming, while IDF is used for lemmatization.
(d) TF is a measure of sentiment, while IDF is a measure of topic relevance.

B. Answer the following questions


1. Differentiate between
(a) Stemming and Lemmatization
(b) Term frequency and Inverse Document Frequency
2. “A sentence can be grammatically correct but still lack meaning.” Discuss this statement using examples. Why is
it important to consider both syntax and semantics in NLP?
3. The Bag-of-Words model is described as a way to represent text numerically. Why is it necessary to convert text
into numbers for computers to work with it?

C. Competency based Questions Life Skills & Values

1. Through a step-by-step process, calculate TF-IDF for the given corpus and mention the word(s) having the highest
value.

Document 1: We are going to Mumbai.

Document 2: Mumbai is a famous place.

Document 3: We are going to a famous place.

Document 4: I am famous in Mumbai.
2. Through a step-by-step process calculate TF-IDF for the given corpus
Document 1: Johny, Johny, Yes Papa.
Document 2: Eating Sugar? No Papa.
Document 3: Telling lies? No Papa.
Document 4: Open your mouth, Ha! Ha! Ha!
3. (a) Write down the steps to implement a bag-of-words algorithm.
(b) What will be the output of the word “studies” if we do the following:
(i) Lemmatization (ii) Stemming
(c) How many tokens are there in the sentence given below?
Traffic Jams have become a common part of our lives nowadays. Living in an urban area means you have to
face traffic each and every time you get out on the road. Mostly, school students opt for buses to go to school.
(d) Identify any 2 stopwords in the given sentence:
Pollution is the introduction of contaminants into the natural environment that cause adverse change.
The three types of pollution are air pollution, water pollution and land pollution.
(e) Write any two applications of TF-IDF.
4. Normalize the given text and comment on the vocabulary before and after the normalization:
(i) Raj and Vijay are best friends.
(ii) They play together with other friends.
(iii) Raj likes to play football but Vijay prefers to play online games.
(iv) Raj wants to be a footballer.
(v) Vijay wants to become an online gamer.
5. Before the meaning of a sentence can be determined, the meanings of its constituent parts must be established.
This requires a knowledge of the structure of the sentence, the meanings of individual words and how the words
modify each other. The process of determining the syntactical structure of a sentence is known as parsing.
Parsing is the process of analyzing a sentence by taking it apart word by word and determining its structure
from its constituent parts and subparts. The structure of a sentence can be represented with a syntactic tree or
a list. The parsing process is basically the inverse of the sentence generation process since it involves finding a
grammatical sentence structure from an input string. When given an input string, the lexical parts or terms (root
words) must first be identified by type, and then the role they play in a sentence must be determined. These
parts can then be combined successively into larger units until a complete tree structure has been completed.
(i) ___________ is the process of analyzing a sentence by taking it apart word by word.
(a) Parsing (b) Analysis (c) Knowledge (d) None of these
(ii) The process of determining the ______ structure of a sentence is called parsing.
(a) Semantical (b) Syntactical (c) Analytical (d) Any of these
(iii) The parsing process is basically the inverse of the sentence generation process.
(a) True (b) False
(iv) The structure of a sentence can be represented with a ___________.
(a) Terms (b) Lexical parts (c) Syntactic tree (d) Root words
(v) Which of the following is the type of data used by NLP applications?
(a) Images (b) Numerical data
(c) Graphical data (d) Text and Speech
(vi) The_________ approach was designed to judge whether a machine could or could not display artificial
intelligence.
(a) Boolean Algebra (b) Turing Test (c) Logarithm (d) Algorithm
6. You want to build a simple spam filter for your email inbox.
• What are some common keywords or phrases that might indicate a spam email?

• How can the Bag-of-Words model help you identify these keywords and flag potential spam messages?

• What are the challenges of using only the Bag-of-Words model for spam detection?

7. You’re building a simple chatbot for a pizza place. Customers can text their order. How can the Bag-of-Words
model help categorize customer messages?
• Example:

• Message 1: “I want a large pepperoni pizza with extra cheese.”
• Message 2: “Can I order a medium veggie pizza for delivery?”
• Message 3: “What are your current deals?”
• How would you use the Bag-of-Words model to represent these messages?
• What keywords would be important for categories like “order pizza,” “delivery,” and “inquiries”?

• How would you use this information to route customer messages to the appropriate department (e.g., order

taking, delivery, customer service)?

D. Activity Zone Experiential Learning

1. Here is a corpus for you to challenge yourself with the given tasks. Use the knowledge you have gained in the
chapter and try completing the whole exercise by yourself.

The Corpus

Document 1: We can use health chatbots for treating stress.

Document 2: We can use NLP to create chatbots and we will be making health chatbots now!

Document 3: Health ChatBots cannot replace human counsellors now.
Accomplish the following challenges on the basis of the corpus given above. You can use the tools available online
for these challenges. Link for each tool is given below:
1. Sentence Segmentation: https://tinyurl.com/y36hd92n
2. Tokenisation: https://text-processing.com/demo/tokenize/
3. Stopword removal: https://demos.datasciencedojo.com/demo/stopwords/
4. Lowercase conversion: https://caseconverter.com/
5. Stemming: http://textanalysisonline.com/nltk-porter-stemmer
6. Lemmatisation: http://textanalysisonline.com/spacy-word-lemmatize
7. Bag of Words: Create a document vector table for all documents.
8. Generate TFIDF values for all the words.
9. Find the words having the highest value and the least value.
2. Amazon Lex is a fully managed service for building conversational interfaces into any application using voice and
text. It provides deep learning functionalities like:
• Automatic Speech Recognition (ASR) for converting speech to text

• Natural Language Understanding (NLU) to recognize the intent of the text

Amazon Lex is a flexible chatbot framework with NLU and machine learning capabilities. With Amazon Lex, you
can build everything from simple bots for messaging apps to complex bots for enterprise environments. Amazon
Lex enables you to quickly & easily build chatbots with highly engaging user experiences and lifelike conversational
interactions.
Visit the following site for details on how to build a bot using Amazon Lex https://aws.amazon.com/lex/
E. GROUP DISCUSSION Communication

Conduct a debate in class. Divide the class into three groups and give them one topic each out of the following.
• The Rise of AI-Powered Language Models: A Blessing or a Curse?

• The Ethical Implications of AI-Powered Language Translation

• The Role of NLP in Combating Misinformation and Fake News

F. Knowledge Hub Subject Enrichment

https://www.analyticsvidhya.com/blog/2020/02/quick-introduction-bag-of-words-bow-tf-idf/
https://medium.com/@abhishekjainindore24/tf-idf-in-nlp-term-frequency-inverse-document-frequency-
e05b65932f1d

G. Experiential Learning
https://youtu.be/isuRxhLQSXU
https://youtu.be/zLMEnNbdh4Q

You might also like