0% found this document useful (0 votes)
44 views12 pages

Natural Language Processing - Notes - Unit 2

Uploaded by

rkeshava086
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views12 pages

Natural Language Processing - Notes - Unit 2

Uploaded by

rkeshava086
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Natural Language Processing

Unit – 2 – Word Level Analysis

2.1 Unsmoothed N-grams


Word Level Analysis
N-Gram as a term simply means the simplest bit the forms a sentence. It is essentially
the number of bits that form a given string while we are reading data that has to be
processed. For example:
• New Delhi – (2-gram word)
• Phir Hera Pheri – (3-gram word)
• He stood up quickly – (4-gram word)
It is very obvious that we have heard the first two examples much more frequently
than we may have heard the last example which is “He stood up quickly.” This is an
example of an instance where the N-gram is not observed as much. We then notice
that assigning a certain probability to the occurrence of the N-bits in the model to
predict the next word in the string, we could benefit from it for 3 of the main following
reasons:
Many of the 2-bit N-grams could be clubbed together to form a single word and in
turn for a single bit that could be easier to read.

2. Predictions could be made very easily by doing the same. Users would benefit
because the next word in completing the statement would be one that has appeared
more time because of the appropriate probability density that has been assigned to the
same. E.g. If we have a sentence, “Please submit your ______.” It is most likely that
the next word would more likely be ‘assignment’ rather than ‘house’.
3. Spellings are something that needs to be changed from time to time depending
on the typing speed of the person. Having a feature like autocorrect could be based
more on the occurrence of the word after a certain one. E.g. If we have a sentence,
“drink mlik” due to some error while typing, the machine will correct it to ‘milk’ and
in turn change the spelling. This is done because the word ‘milk’ has a higher chance
of coming after the word ‘drink’.
Now that we understand the use of an N-gram model let us move forward and get an
idea about the formal understanding of an N-gram model. An N-gram model is
essentially used to predict the occurrences of the previous (N-1) words. To understand
this better we have a few terms introduced which help us understand the type of
model we are going to be working with. This helps us figure out our next word by
handpicking the required data from each of them.
Bi-Gram Model – This means that there are 2 words (N=2) and we need to go back to
(N-1) = (2-1) = 1 word to predict the next word accurately.
Tri-Gram Model – This means that there are 3 words (N=3) and we need to go back to
(N-1) = (3-1) = 2 words to predict the next word accurately.
To see how the model works further we need to work with assigning a certain
probabilistic density to each of the following words in a model. Depending on the
sentence the user can choose the model which could be bigram, trigram, or even
further for more complex strings. To illustrate we need a ‘corpus’ or a large set of small
sentences. Let us look at a few small sentences to illustrate this property.
1. He went to New Delhi.
2. New Delhi has nice weather.
3. It is raining in the new city.
Let us assume one of the models that have been mentioned above. What if we take a
bigram model? Probability is the [number of times the previous word (N-1) appears
before the word that we require to predict] / [Number of time the previous word
appears before the required word in the entire set of lines]
Let us find out the probability that the word ‘Delhi’ comes after the word ‘New’. So
for this, we need to take the number of times ‘New Delhi’ appears divided by the
number of times ‘New’ appears in the 3 lines on top.
= (number of times ‘New Delhi’ appears) / (the number of times ‘New’ appears)
Bigram models work very well with all the different types of corpus’ that are available
from the model. Bigram may work better than trigram models even for more complex
strings. However, when we have a lot of data for testing it is much more feasible to
work with 4-gram and 5-gram models. One needs to analyse the right and left sides
of the sentence as well before making any adjustments and assumptions about the
same. Now we will look at the different unsmoothed N-gram models and their
evaluation.

2.2. Evaluating N-Grams


When we have a model like the N-gram model we must know how effective the model
is in practice as well as hypothetical cases. If the application of the program improves
after time and spatial expansion it is certain that the performance is improving. If we
want to analyse and go through the model, we need to conduct something which is
known as ‘extrinsic evaluation’.
Extrinsic evaluation is a method that works from one end of the model that is the start
to the other end which is the output. We can simply run the model twice with a slight
tweak in the model to get a whole description of the analysis that is going to be
conducted by the model. Once we get both of the tests done we can simply compare
them. However, the worst part about having to do a test like this is the cost. Running
an NLP system twice just for testing could cost a lot.
Due to the cost and other factors that would cause the previous evaluation to be
accurate but not every effective in the long run a new method was decided. This one
simply decided a certain metric that would be constant for all of the models. Once the
model works it would simply compare similar statistics even without having a certain
application of the model. Such an application and testing of the model is known as
‘Intrinsic evaluation’.
Now while using the above ‘Intrinsic evaluation’ we need to train the corpus that we
have and hence we need something which is known as a ‘test set’. Just as a lot of other
tests that are present in the world of statistics the N-gram model also relies heavily on
testing the model with the help of a corpus. So, this allows us to have a ‘training set’
and a ‘test set’ with data that is unseen by the system. This allows the user to train the
data with other features not prevalent in the current dataset but might appear in future
datasets.
Once we train our model with the help of the sets that are given above we notice that
our algorithm gets tuned up explicitly to the features that are present in these sets.
This causes it to become redundant with respect to a new feature that could appear in
datasets in the future. This is why people have come up with something which is
known as the ‘development test set’ or a ‘devset’ to solve this problem. A general
distribution by users who work with the data is 80% which foes for the training data,
10% which goes for the test data, and the remaining 10% which goes for the
development set.

2.3. Smoothing

A lot of times while testing our model we might come in contact with words that might
be unknown to us or those which simply have meaning but have occurred in the
dataset only once. Due to their single presence, the algorithm might assign them with
simply a zero probability. If this happens it might cause our result set to be null and
void due to the negligible importance given to those words. The process that allows
us to assign some mass to the events is known as smoothing or discounting. There
are a few ways to do this smoothing which are given below.

Laplace Smoothing
Before we complete the normalization of the probability densities, we can add the
number 1 to every bigram count. We would notice that all the counts would be upped
by a single value. E.g. 2 would be 1, 3 would be 4, etc. This is the simplest way of
smoothing which may not be of great use in newer N-gram models but does a great
job in getting one familiar with concepts that might be key in learning different types
of smoothing.
Let us get an idea of this smoothing with the help of a unigram model
Where, w = word count
c = normalized count
N = number of words/tokens (after tokenization)

Therefore, P(w) = c / N
Now as explained above, the process of Laplace smoothing allows each word count
to be increased by the value 1. After making this increment, we need to also notice
that an adjustment has to be made to the denominator to make up for the W words in
the observation list.
w = word count
c = normalized count
N = number of words/tokens (after tokenization)
W = words in observation (left out)

Therefore, P(Laplace)(w) = c +1 / N +W

2.4. Interpolation and Back off – Word Classes, Part-of-Speech Tagging, Rule-based,
Stochastic, and Transformation-based Tagging

We learned a lot about the smoothing that takes place to get rid of the problem that
involves not assigning any probability to new words. However, different problems
need to be taken care of which are shown below. Some of the problems that are solved
by involving a trigram can be solved instead by using bigram. We can also solve a few
bigram models by using unigram models.
We can use lesser context as a good example that allows the user to generalize the
model. There are 2 different types of ways that we can use the following

1.13.1 Interpolation (Jelinek-Mercer Smoothing)


Interpolation which is also a type of smoothing is used in place of back off at times.
The process of interpolation involves mixing the estimates of the probability of all the
n-gram estimators available. This is done by consciously weighing all the word counts
in the 3-gram, 2-gram, and unigram models.
When we look at simple interpolation there is a combination of different n-gram
orders as mentioned above in the model. The mixing of all the models gives us a fair
idea of all the probabilities with their weights assigned to them as P(wn|wn−[Link]−1).
If λ is given as the weight coefficient then we get the following expression.

Pˆ(wn|wn−2wn−1) = λ1P(wn|wn−2wn−1) + λ2P(wn|wn−1) + λ3P(wn)

There is also a little more complex version of the same in which the λ is given as per
the context in which it is issued. We have a little assumption made here that the word
counts for the trigram model would be more accurate as compared to the bigram and
unigram models. Hence, we get a new modified equation as shown below.

1.13.2 Back off (Katz Smoothing)


1. Back off is the process where the user uses a trigram to provide a source
of evidence to the NLP algorithm. If this does not train appropriately the
user can switch to a bigram model as well as a unigram model thereafter.
In short, the user “backs off” from using a higher gram model if there is
0 or no evidence of the same during discounting. This allows lower n-
gram models to come into the big picture.
2. In the algorithm, if we notice that the word count of the desired word in
the corpus has zero counts, we allow the model to be reduced to one that
has (N-1) grams. This process is simply continued until the model has
enough evidence to be used as an appropriate model. Higher ordered
grams save some probability density to get an appropriate discount for
the same.
3. We notice that the probability that is going to be added to the total
discounted word counts in the string must have an assigned number
that is greater than 1!. Katz back off is a type of back off that relies on
the discounting that happens throughout a probability P*(P-star). We
would also need a function α(alpha) to allow a fair distribution of the
probability density throughout. This algorithm is mainly used with the
Good Turing Algorithm that is used for smoothing. Collectively this is
known as the Good-Turing back off method for providing an insight
into the zero-gram word counts present in the corpus.
4. The P(BO) which is the probability of the back off is given as:

Comparison between back off and interpolation


Both the models (interpolation and back off) deal with involving higher and lower
gram models to compute the final result during smoothing.

2. In the case of having a certain non-zero word count the process of interpolation
uses the data which is given by lower n-gram models.
3. In the case of having a certain non-zero word count the process of back off does
not use the data which is given by lower n-gram models.

4. One can create an interpolated version of a back off algorithm and the same goes
with creating a back off version of the interpolated algorithm.
1.13.3 Word Classes
When we look at the topic of word classes, we find a very simple understanding of
the same in the English language. Having a good knowledge of the different word
classes could help us understand the morphology and other lexical analysis’
happening in language processing. We can start with the simple examples of word
classes which are given below and then move forward.
Basic Word Classes
• Noun – N
• Verb – V
• Adjective – A
• Adverb – a
2. Closed Classes
• Determiners – the, an, a
• Pronouns – he, she, it, they, etc.
• Prepositions - over, from, with, under, around, etc.
1.13.4 Part of Speech (POS) Tagging
When we talk about the different word classes above, we get a clear idea about the
English language that we are dealing with here. Below is a table that is made up of the
tags given to the different parts of speech along with examples of the same to get you
familiar with the following. In POS, each word class is tagged and then used as a token
during the analysis process.

1.13.4 Part of Speech (POS) Tagging


When we talk about the different word classes above, we get a clear idea about the
English language that we are dealing with here. Below is a table that is made up of the
tags given to the different parts of speech along with examples of the same to get you
familiar with the following. In POS, each word class is tagged and then used as a token
during the analysis process.

WORD CLASS TAG EXAMPLE


Singular Noun NN House, school
Plural Noun NNS Boys, girls
Singular Proper Noun NNP APPLE, OPPO
Plural Proper Noun NNPS Anands, Geetas
Base Verb VB Work, play
Verb past tense VBD Worked, played
Verb Gerund VBG Working, Playing
Verb Past Participle VBN Driven
Adjective JJ Beautiful, long
Comparative Adjective JJR Bigger, Longer
Superlative Adjective JJS Deadliest
Adverb RB Easily, beautifully
Preposition IN Under, around, near
Conjunction CC And, or, neither, nor
Determiner DT The, a, an
Foreign Word FW Mea Culpa
Existential EX (is) there
Personal pronoun PRP You, He
Possessive pronoun PRP$ Yours, Theirs
List / Item Marker LS 3,4, three, four
Modal MD Must, might
Cardinal Number CD Nine, ten
Interjection UH Ouch!, Alas!
Dollar Sign $ $
Pound Sign # #
Left Quote “ ‘, “
Right Quote ” ’, ”
Left Parentheses ( (, [, {, >
Right Parentheses ) ), ], }, <
Comma , ,

1.13.5 Rule-Based Tagging


This type of tagging happens to be one of the most formerly known methods of
tagging with the help of POS. These simply use tags for word classes that are present
in the lexicon or the corpus. Sometimes there could be more than one possible tag. To
deal with this problem there are a set of rules that have been made to deal with this.
Rule-based tagging is a process that goes through 2 stages entirely. The first stage and
the second stage are given below.
Stage One - In this stage, the language processing uses the dictionary that has been
included to give each word a specific POS which can be used in the tagging process
later.
2. Stage Two – There is a list of rules which are used to finalize every word to select
and assign them with a particular POS.
Rules for Rule-Based tagging
• The taggers that are used are all driven by knowledge.
• The rules involved in finalizing the POS are all made manually.
• Rules are used to code the information.
• There are a limited number of rules.
• Rule-Based tags are the only rules used for smoothing and NLP.

1.13.6 Stochastic Tagging


There is another model that is made to complete the process of tagging. This is known
as the Stochastic POS tagging. We only need to identify which of the model could be
classified as a stochastic model and used for the same. The name comes from the
feature of the model using a lot of statistics to get the desired results.
Word Frequency Approach
In the word frequency approach, the taggers that are assigned classify the words
mainly based on the probability of occurrence. If the tag has been assigned the most it
will show up after the training set and be used in the main assigning process. A lot of
tags have different approaches and this is one of the common ones that are used for
the same
The N-Gram approach is one that is very similar to the same because of the different
probability densities given to the model. As explained above it depends on the
previous words assigned to the entire sequence.

Properties
• The following type of tagging is based on all the statistical inferences that
have been made to tag words that are part of the corpus.
• A training set needs to be assigned to the model to get an idea of the
instances that have occurred in the past.
• A probability would also be assigned to the words that may not appear
in the training corpus.
• The training and testing corpus are different from each other.
• It simply uses the tags which appear most frequently during the training
process using a training corpus set.

1.13.7 Transformation Based Tagging


This type of tagging is also known by the name of ‘Brill Tagging’. This form of tagging
does not use any sort of probability or statistical inferences to make tags. Instead, there
are a set of rules used by the algorithm to allow POS tags on the corpus. It essentially
transforms the text into a different state through the intervention of pre-defined rules.
The model is heavily based on the previous two models to understand the rules laid
down by the algorithm for the tagging process. There is a fixed protocol that is
followed by this tagging process to get an idea of the tagging process.
1. The solution to begin – The program works with solutions that are very
familiar to it based on past problems that occurred which take place in
cycles.
2. The most appropriate solution chosen - The most useful solution would
be automatically adopted.
3. Application – The solution is then implemented into the solution of the
problem.
Rules for Rule-Based tagging
• The taggers that are used are all driven by knowledge.
• The rules involved in finalizing the POS are all made manually.
• Rules are used to code the information.
• There are a limited number of rules.
• Rule-Based tags are the only rules used for smoothing and NLP.

2.5. Issues in PoS Tagging – Hidden Markov and Maximum Entropy Models

1.14.1 Hidden Markov Model


A Hidden Markov Model (HMM) is a model that is a modified stochastic tagging
model with the main process being underlined throughout the main process. The
result is simply the same if not more accurate. However, the processes that get done
cannot be seen. The results of each sequence throughout the process are the only part
that can be seen and comprehended.
To understand this a little better, we need to understand the term Markov chain. This
is simply a model that gives us inferences about the sequences from the different
probability densities occurring throughout the corpus. To make a prediction the
Markov chain relies heavily on the current position of the tag or word count in the
string. The Hidden Markov Model allows the user to dwell around both visible events
(input strings) as well as hidden events (POS tagging).
Let us take an example to understand better the HMM process for a certain example.
Say we have to build an HMM of the problem where coins are tossed in a certain
sequence. To get a better understanding we can break this down into the 2 processes
and factors
Visible Events
These are the events that we know are happening and can physically or virtually
picture them. In this case, it is the tossing of coins. We are sure that coins are being
tossed with an option of either showing ‘heads’ or ‘tails’ as the outcome.
2. Hidden Events
These are the factors that are completely hidden from the user. In this case, they could
be the number of coins that are used in the problem or the order in which the coins
are tossed up in case they are different or weighted. One could conclude so many
different models to understand this process better. One of the forms of the model is
given below.

Fig. No. 2
Fig. Description 2: These are the hidden events during the transformation as shown in
the diagram.

Tags in each of the following are sequences that generate tags for the outputs being
given in the hidden states. These states have tags that could give an output that is
observable to the user.
1.14.2 Maximum Entropy Markov Model (MEMM)
This model is used as an alternate form for the above Markov Model. In this model,
the sequences which are taking place are replaced by a single function that we can
define. This model can simply tell us about the happenings in the predicted state ‘s’
while knowing the past state ‘s’ as well as the current observation o. P (s | s’, o)
The main difference that we can get ourselves familiar with is the part where HMM’s
are only dependent on the current state whereas the MEMM’s have their inferences
which can be based on the current as well as the past states. This could involve a better
form of accuracy while going through the entire tagging process. We can even see the
difference in the diagram which is given below.
Disadvantages of MEMM’s
MEMM models have a problem due to the label bias problem. Once the state has been
selected, the observation which is going to succeed the current one will select all of the
transitions that might be leaving the state.

2. There are a few examples in words such as “rib” and “rob” which when placed
in the algorithm in their raw form have a little different path as compared to the other
words. When we have “r_b” there is a path that goes through 0-1 and 0-4 to get the
probability mass function. Both the paths would be equally likely not allowing the
model to select one fixed word that could be predicted by the model correctly.
HMM, and MEMM models are used to tag and have their pros and cons which can be
decided based on the model that needs to be adapted.

Perplexity
One of the methods that are put to use to make sure that the evaluation of models is
done correctly is the process of ‘perplexity’. We cannot use probabilities that are raw
for a metric while evaluating languages. A variant in this is known as perplexity
(PP). Perplexity of a certain NLP model on a set that is used for testing is normalized
by the number of words and the probabilistic density assigned to it.
Let us say that for a set W = w1.w2.w3.w4……..wn
The following process is done by working on the probability expansion with the help
of the chain rule. This is followed by the assumption that the model is a 2-gram ort a
bigram model as given above. The result is perplexity which gives us a clear
understanding while evaluating a model.

The following process is done by working on the probability expansion with the help
of the chain rule. This is followed by the assumption that the model is a 2-gram ort a
bigram model as given above. The result is perplexity which gives us a clear
understanding while evaluating a model.
These are the different formulas as given below:

The final result that we can conclude is that because the probability is an inverse type
of probability, the higher the conditional probability of W the lower if the perplexity
(PP).
We could even try to understand the concept of perplexity with the help of a factor
called the ‘weighted average branching factor’. This simply is a factor that tells the
user about the number of words that may follow the given word. Let us take a simple
example of the first 20 digits in the normal number system. Hence the set S = {1,2,3,4,
……..20}. The probability that any of the digits occurs is simply P = 1/20. However, if
we consider the following as a language we notice something a little different about
the measure of the perplexity. As mentioned before the perplexity is simply the
inverse which is precisely why the answer would be inverted as shown below.
We know,
Perplexity = PP(S) = P(s1.s2.s3…….sN) ^ (-1/N)
= { [ (1/20)^N ] ^ (-1/N) } = ( 1/20 ) ^ -1 = 20

You might also like