Natural Language Processing - Notes - Unit 2
Natural Language Processing - Notes - Unit 2
2. Predictions could be made very easily by doing the same. Users would benefit
because the next word in completing the statement would be one that has appeared
more time because of the appropriate probability density that has been assigned to the
same. E.g. If we have a sentence, “Please submit your ______.” It is most likely that
the next word would more likely be ‘assignment’ rather than ‘house’.
3. Spellings are something that needs to be changed from time to time depending
on the typing speed of the person. Having a feature like autocorrect could be based
more on the occurrence of the word after a certain one. E.g. If we have a sentence,
“drink mlik” due to some error while typing, the machine will correct it to ‘milk’ and
in turn change the spelling. This is done because the word ‘milk’ has a higher chance
of coming after the word ‘drink’.
Now that we understand the use of an N-gram model let us move forward and get an
idea about the formal understanding of an N-gram model. An N-gram model is
essentially used to predict the occurrences of the previous (N-1) words. To understand
this better we have a few terms introduced which help us understand the type of
model we are going to be working with. This helps us figure out our next word by
handpicking the required data from each of them.
Bi-Gram Model – This means that there are 2 words (N=2) and we need to go back to
(N-1) = (2-1) = 1 word to predict the next word accurately.
Tri-Gram Model – This means that there are 3 words (N=3) and we need to go back to
(N-1) = (3-1) = 2 words to predict the next word accurately.
To see how the model works further we need to work with assigning a certain
probabilistic density to each of the following words in a model. Depending on the
sentence the user can choose the model which could be bigram, trigram, or even
further for more complex strings. To illustrate we need a ‘corpus’ or a large set of small
sentences. Let us look at a few small sentences to illustrate this property.
1. He went to New Delhi.
2. New Delhi has nice weather.
3. It is raining in the new city.
Let us assume one of the models that have been mentioned above. What if we take a
bigram model? Probability is the [number of times the previous word (N-1) appears
before the word that we require to predict] / [Number of time the previous word
appears before the required word in the entire set of lines]
Let us find out the probability that the word ‘Delhi’ comes after the word ‘New’. So
for this, we need to take the number of times ‘New Delhi’ appears divided by the
number of times ‘New’ appears in the 3 lines on top.
= (number of times ‘New Delhi’ appears) / (the number of times ‘New’ appears)
Bigram models work very well with all the different types of corpus’ that are available
from the model. Bigram may work better than trigram models even for more complex
strings. However, when we have a lot of data for testing it is much more feasible to
work with 4-gram and 5-gram models. One needs to analyse the right and left sides
of the sentence as well before making any adjustments and assumptions about the
same. Now we will look at the different unsmoothed N-gram models and their
evaluation.
2.3. Smoothing
A lot of times while testing our model we might come in contact with words that might
be unknown to us or those which simply have meaning but have occurred in the
dataset only once. Due to their single presence, the algorithm might assign them with
simply a zero probability. If this happens it might cause our result set to be null and
void due to the negligible importance given to those words. The process that allows
us to assign some mass to the events is known as smoothing or discounting. There
are a few ways to do this smoothing which are given below.
Laplace Smoothing
Before we complete the normalization of the probability densities, we can add the
number 1 to every bigram count. We would notice that all the counts would be upped
by a single value. E.g. 2 would be 1, 3 would be 4, etc. This is the simplest way of
smoothing which may not be of great use in newer N-gram models but does a great
job in getting one familiar with concepts that might be key in learning different types
of smoothing.
Let us get an idea of this smoothing with the help of a unigram model
Where, w = word count
c = normalized count
N = number of words/tokens (after tokenization)
Therefore, P(w) = c / N
Now as explained above, the process of Laplace smoothing allows each word count
to be increased by the value 1. After making this increment, we need to also notice
that an adjustment has to be made to the denominator to make up for the W words in
the observation list.
w = word count
c = normalized count
N = number of words/tokens (after tokenization)
W = words in observation (left out)
Therefore, P(Laplace)(w) = c +1 / N +W
2.4. Interpolation and Back off – Word Classes, Part-of-Speech Tagging, Rule-based,
Stochastic, and Transformation-based Tagging
We learned a lot about the smoothing that takes place to get rid of the problem that
involves not assigning any probability to new words. However, different problems
need to be taken care of which are shown below. Some of the problems that are solved
by involving a trigram can be solved instead by using bigram. We can also solve a few
bigram models by using unigram models.
We can use lesser context as a good example that allows the user to generalize the
model. There are 2 different types of ways that we can use the following
There is also a little more complex version of the same in which the λ is given as per
the context in which it is issued. We have a little assumption made here that the word
counts for the trigram model would be more accurate as compared to the bigram and
unigram models. Hence, we get a new modified equation as shown below.
2. In the case of having a certain non-zero word count the process of interpolation
uses the data which is given by lower n-gram models.
3. In the case of having a certain non-zero word count the process of back off does
not use the data which is given by lower n-gram models.
4. One can create an interpolated version of a back off algorithm and the same goes
with creating a back off version of the interpolated algorithm.
1.13.3 Word Classes
When we look at the topic of word classes, we find a very simple understanding of
the same in the English language. Having a good knowledge of the different word
classes could help us understand the morphology and other lexical analysis’
happening in language processing. We can start with the simple examples of word
classes which are given below and then move forward.
Basic Word Classes
• Noun – N
• Verb – V
• Adjective – A
• Adverb – a
2. Closed Classes
• Determiners – the, an, a
• Pronouns – he, she, it, they, etc.
• Prepositions - over, from, with, under, around, etc.
1.13.4 Part of Speech (POS) Tagging
When we talk about the different word classes above, we get a clear idea about the
English language that we are dealing with here. Below is a table that is made up of the
tags given to the different parts of speech along with examples of the same to get you
familiar with the following. In POS, each word class is tagged and then used as a token
during the analysis process.
Properties
• The following type of tagging is based on all the statistical inferences that
have been made to tag words that are part of the corpus.
• A training set needs to be assigned to the model to get an idea of the
instances that have occurred in the past.
• A probability would also be assigned to the words that may not appear
in the training corpus.
• The training and testing corpus are different from each other.
• It simply uses the tags which appear most frequently during the training
process using a training corpus set.
2.5. Issues in PoS Tagging – Hidden Markov and Maximum Entropy Models
Fig. No. 2
Fig. Description 2: These are the hidden events during the transformation as shown in
the diagram.
Tags in each of the following are sequences that generate tags for the outputs being
given in the hidden states. These states have tags that could give an output that is
observable to the user.
1.14.2 Maximum Entropy Markov Model (MEMM)
This model is used as an alternate form for the above Markov Model. In this model,
the sequences which are taking place are replaced by a single function that we can
define. This model can simply tell us about the happenings in the predicted state ‘s’
while knowing the past state ‘s’ as well as the current observation o. P (s | s’, o)
The main difference that we can get ourselves familiar with is the part where HMM’s
are only dependent on the current state whereas the MEMM’s have their inferences
which can be based on the current as well as the past states. This could involve a better
form of accuracy while going through the entire tagging process. We can even see the
difference in the diagram which is given below.
Disadvantages of MEMM’s
MEMM models have a problem due to the label bias problem. Once the state has been
selected, the observation which is going to succeed the current one will select all of the
transitions that might be leaving the state.
2. There are a few examples in words such as “rib” and “rob” which when placed
in the algorithm in their raw form have a little different path as compared to the other
words. When we have “r_b” there is a path that goes through 0-1 and 0-4 to get the
probability mass function. Both the paths would be equally likely not allowing the
model to select one fixed word that could be predicted by the model correctly.
HMM, and MEMM models are used to tag and have their pros and cons which can be
decided based on the model that needs to be adapted.
Perplexity
One of the methods that are put to use to make sure that the evaluation of models is
done correctly is the process of ‘perplexity’. We cannot use probabilities that are raw
for a metric while evaluating languages. A variant in this is known as perplexity
(PP). Perplexity of a certain NLP model on a set that is used for testing is normalized
by the number of words and the probabilistic density assigned to it.
Let us say that for a set W = w1.w2.w3.w4……..wn
The following process is done by working on the probability expansion with the help
of the chain rule. This is followed by the assumption that the model is a 2-gram ort a
bigram model as given above. The result is perplexity which gives us a clear
understanding while evaluating a model.
The following process is done by working on the probability expansion with the help
of the chain rule. This is followed by the assumption that the model is a 2-gram ort a
bigram model as given above. The result is perplexity which gives us a clear
understanding while evaluating a model.
These are the different formulas as given below:
The final result that we can conclude is that because the probability is an inverse type
of probability, the higher the conditional probability of W the lower if the perplexity
(PP).
We could even try to understand the concept of perplexity with the help of a factor
called the ‘weighted average branching factor’. This simply is a factor that tells the
user about the number of words that may follow the given word. Let us take a simple
example of the first 20 digits in the normal number system. Hence the set S = {1,2,3,4,
……..20}. The probability that any of the digits occurs is simply P = 1/20. However, if
we consider the following as a language we notice something a little different about
the measure of the perplexity. As mentioned before the perplexity is simply the
inverse which is precisely why the answer would be inverted as shown below.
We know,
Perplexity = PP(S) = P(s1.s2.s3…….sN) ^ (-1/N)
= { [ (1/20)^N ] ^ (-1/N) } = ( 1/20 ) ^ -1 = 20