Probabilistic Language Modeling
Nakul Dave
Assistant Professor
Computer Engineering Department
Vishwakarma Government Engineering College - Ahmedabad
Nakul Dave L a n g u a g e M ode ling 1 / 45
Outline
Outline
Language Modeling - What it is
Why Language Modeling - Use cases
N-gram Language Models
Chain Rule
Challenges
Language Model Evaluation
Nakul Dave L a n g u a g e M ode ling 2 / 45
U s e cases
How do they do it?
Figure: Word Prediction in the
sequence
Figure: Word Prediction
Nakul Dave L a n g u a g e M ode ling 3 / 45
U s e cases
Downstream Applications - Language Modeling
Sentence Completion
Machine Translation
O C R and Hand-writing recognition
Speech Recognition
Nakul Dave L a n g u a g e M ode ling 4 / 45
U s e cases
Use Cases - Language Modeling
Gmail Smart Compose
Spelling Correction
Machine Translation
Nakul Dave L a n g u a g e M ode ling 5 / 45
U s e cases
Gmail Smart Compose
Supports predicting completion of sentence
Real-time suggestions in Gmail that assists users in writing mails
Figure: Gmail Smart Compose1
1
Source of image- https://encrypted-tbn0.gstatic.com/
Nakul Dave L a n g u a g e M ode ling 6 / 45
U s e cases
Sequence Prediction
Figure: Google’s Sequence Prediction
Nakul Dave L a n g u a g e M ode ling 7 / 45
U s e cases
Spelling Correction
Figure: Microsoft word spelling correction
Nakul Dave L a n g u a g e M ode ling 8 / 45
U s e cases
Context Sensitive Spelling Correction
Natural Language Proposing
Natural Language Processing
An Example
P(Natural Language Proposing) < P(Natural Language
Processing)
P(Proposing | Natural,Language) < P(Processing |
Natural,Language)
Nakul Dave L a n g u a g e M ode ling 9 / 45
U s e cases
Applications - Probabilistic Language Modeling
Speech Recognition
P(Ice cream) > P ( I scream)
Figure: Homophones2
2
Source of image- http://yourdictionary.com/examples-of-homophones.html
Nakul Dave L a n g u a g e M ode ling 10 / 45
U s e cases
Machine Translation
Machine Translation
Which phrase is more probable in the target language?
P(high winds) > P(large winds)
Figure: Google M T - Gujarati to English
Nakul Dave L a n g u a g e M ode ling 11 / 45
Introduction
Language Modeling
Definition
Language Models are stochastic process for computing probability of a
sentence or sequence of words.
P (W ) = P (w1, w2, w3, . . . , w n )
Probabilistic approach - Finding how probable a sentence may
occur given the sequence of words
In the probabilistic world, the Language Model is used to assign
the probability P ( W ) to every possible word sequence W.
Nakul Dave L a n g u a g e M ode ling 12 / 45
Introduction
Language Modeling
John Rupert Firth - An English linguist
“You shall know a word by the company it keeps.”
— Firth, J.R.
Nakul Dave L a n g u a g e M ode ling 13 / 45
Introduction
Completion prediction
Please turn off your cell.... phone
Your program does not... run
I call you... back
Thank you for your kind... help
Nakul Dave L a n g u a g e M ode ling 14 / 45
Introduction
Completion prediction
Please turn off your cell.... phone
Your program does not... run
I call you... back
Thank you for your kind... help
Nakul Dave L a n g u a g e M ode ling 14
15 / 45
Introduction
Completion prediction
Please turn off your cell.... phone
Your program does not... run
I call you... back
Thank you for your kind... help
Nakul Dave L a n g u a g e M ode ling 14
16 / 45
Introduction
Completion prediction
Please turn off your cell.... phone
Your program does not... run
I call you... back
Thank you for your kind... help
Nakul Dave L a n g u a g e M ode ling 14
17 / 45
Introduction
Completion prediction
Please turn off your cell.... phone
Your program does not... run
I call you... back
Thank you for your kind... help
Nakul Dave L a n g u a g e M ode ling 14
18 / 45
Introduction
Completion prediction
Please turn off your cell.... phone
Your program does not... run
I call you... back
Thank you for your kind... help
Nakul Dave L a n g u a g e M ode ling 14
19 / 45
Introduction
Completion prediction
Please turn off your cell.... phone
Your program does not... run
I call you... back
Thank you for your kind... help
Nakul Dave L a n g u a g e M ode ling 14
20 / 45
Introduction
Completion prediction
Please turn off your cell.... phone
Your program does not... run
I call you... back
Thank you for your kind... help
Nakul Dave L a n g u a g e M ode ling 14
21 / 45
Introduction
Completion prediction
Please turn off your cell.... phone
Your program does not... run
I call you... back
Thank you for your kind... help
Nakul Dave L a n g u a g e M ode ling 14
22 / 45
Introduction
Language Modeling
Figure: Language Model
Nakul Dave L a n g u a g e M ode ling 15 / 45
Introduction
Probabilistic Language Modeling
Function 1: Compute the probability of sequence of words
P (W ) = P (w1, w2, w3, . . . , w n )
Function 2:: Probability of an upcoming word:
P (w4 | w1, w2, w3)
A model that computes either of these is called language model.
Nakul Dave L a n g u a g e M ode ling 16 / 45
Introduction
Types of Language Models
Unigram Language Model [ No dependency of the word]
Bigram Language Model [ The word is dependent on the previous
word ]
Trigram Language Model [ The word is dependent on the previous
two words ]
N-gram Langauge Model [ The word is dependent on the previous
N-1 words ]
Nakul Dave L a n g u a g e M ode ling 17 / 45
Introduction
N-gram Language model
Figure: N-gram Language Modeling 3
Source of image - https://encrypted-tbn0.gstatic.com
Nakul Dave L a n g u a g e M ode ling 18 / 45
Introduction
Word Count in N-gram Language Models
There are N number of tokens in the collection.
Unigram Language Model
C ount (W i )
P (W i ) = N
Bigram Language Model
Count(Wi−1Wi)
P (W i ) = C ount (W i )
Trigram Language Model
Count(Wi−2Wi−1Wi)
P (W i ) = Count(Wi−2Wi−1)
N-gram Language Model
Count(Wi−(N−1)Wi−(N−2)...Wi−1Wi)
P (W i ) = Count(Wi−(N−1)Wi(N−2)...Wi
Nakul Dave L a n g u a g e M ode ling 19 / 45
L a n g u a g e M ode ling
Computing P(W)
How to compute the joint probability
(natural, language, processing, is,)
It depends on the Chain Rule of probability.
Nakul Dave L a n g u a g e M ode ling 20 / 45
L a n g u a g e M ode ling
Chain Rule
Conditional Probability
The Probability of occurring B given that A has already occurred.
P ( B |A) = P (A,B)
P (A)
Multiple Variables
The probability of seeing word B given that A, B and C have already
been seen in some sentence
P (A, B , C , D ) = P ( A ) P ( B | A ) P ( C | A, B ) P ( D | A, B , C )
Generalized form of Chain Rule
P (w1, w2, w3, . . . , w n ) = P (w 1 )P (w2 | w 1 )P (w3 | w1, w2) . . . P (w n |
w1, . . . , w n−1 )
Nakul Dave L a n g u a g e M ode ling 21
29 / 45
L a n g u a g e M ode ling
Chain Rule
Conditional Probability
The Probability of occurring B given that A has already occurred.
P ( B |A) = P (A,B)
P (A)
Multiple Variables
The probability of seeing word B given that A, B and C have already
been seen in some sentence
P (A, B , C , D ) = P ( A ) P ( B | A ) P ( C | A, B ) P ( D | A, B , C )
Generalized form of Chain Rule
P (w1, w2, w3, . . . , w n ) = P (w 1 )P (w2 | w 1 )P (w3 | w1, w2) . . . P (w n |
w1, . . . , w n−1 )
Nakul Dave L a n g u a g e M ode ling 21
30 / 45
L a n g u a g e M ode ling
Chain Rule
Conditional Probability
The Probability of occurring B given that A has already occurred.
P ( B |A) = P (A,B)
P (A)
Multiple Variables
The probability of seeing word B given that A, B and C have already
been seen in some sentence
P (A, B , C , D ) = P ( A ) P ( B | A ) P ( C | A, B ) P ( D | A, B , C )
Generalized form of Chain Rule
P (w1, w2, w3, . . . , w n ) = P (w 1 )P (w2 | w 1 )P (w3 | w1, w2) . . . P (w n |
w1, . . . , w n−1 )
Nakul Dave L a n g u a g e M ode ling 21
31 / 45
L a n g u a g e M ode ling
Language Modeling
An Example
P (Natural Language Processing is important)
Nakul Dave L a n g u a g e M ode ling 22 / 45
L a n g u a g e M ode ling
Computing Probability Values
Product of probability values using chain rule
Q
P (w1w2 . . . w n ) = i P (w i | w1w2 . . . w i−1 )
P(Natural Language Processing is important)
P (Natural) × P (Language | Natural) × P (Processing |
Natural Language) × P (is | Natural Language Processing) ×
P (important | Natural Language Processing is)
Nakul Dave L a n g u a g e M ode ling 23
33 / 45
L a n g u a g e M ode ling
Computing Probability Values
Product of probability values using chain rule
Q
P (w1w2 . . . w n ) = i P (w i | w1w2 . . . w i−1 )
P(Natural Language Processing is important)
P (Natural) × P (Language | Natural) × P (Processing |
Natural Language) × P (is | Natural Language Processing) ×
P (important | Natural Language Processing is)
Nakul Dave L a n g u a g e M ode ling 23
34 / 45
L a n g u a g e M ode ling
Computing Probability Values
Product of probability values using chain rule
Q
P (w1w2 . . . w n ) = i P (w i | w1w2 . . . w i−1 )
P(Natural Language Processing is important)
P (Natural) × P (Language | Natural) × P (Processing |
Natural Language) × P (is | Natural Language Processing) ×
P (important | Natural Language Processing is)
Nakul Dave L a n g u a g e M ode ling 23
35 / 45
L a n g u a g e M ode ling
Language Modeling
Example
Language Modeling also computes the probability of next word in the
sequence
P (w4 | w1, w2, w3)
Nakul Dave L a n g u a g e M ode ling 24 / 45
L a n g u a g e M ode ling
Google Ngram Example
Figure: Google Ngram - How are *
G o o g l e N g r a m Vie w e r
Nakul Dave L a n g u a g e M ode ling 25 / 45
L a n g u a g e M ode ling
Google Ngram Example
Figure: Google Ngram - Natural Language *
Nakul Dave L a n g u a g e M ode ling 26 / 45
L a n g u a g e M ode ling
Google Ngram Example
Figure: Google Ngram - Natural * Processing
Nakul Dave L a n g u a g e M ode ling 27 / 45
L a n g u a g e M ode ling
Maximum Likelihood Estimation ( M L E )
MLE
Values are converted to probability... that computes the ”most
probable” from observed data
Count(Wi−1Wi)
P (W i ) = C ount (W i )
C(Wi−1Wi)
P (W i ) = C(Wi)
Nakul Dave L a n g u a g e M ode ling 28 / 45
L a n g u a g e M ode ling
An example
< s > I am here < / s >
C(Wi−1Wi) <s>who am I < / s >
P (W i ) = C(Wi)
< s > I would like to know < / s >
Estimating bigrams
P ( I | < s > ) = 2/3
P (</s>| here) = 1
P(would | I) = 1/3
P(here | am) = 1/2
P(know | like) = 0
Nakul Dave L a n g u a g e M ode ling 29 / 45
L a n g u a g e M ode ling
An example
< s > I am here < / s >
C(Wi−1Wi) <s>who am I < / s >
P (W i ) = C(Wi)
< s > I would like to know < / s >
Estimating bigrams
P ( I | < s > ) = 2/3
P (</s>| here) = 1
P(would | I) = 1/3
P(here | am) = 1/2
P(know | like) = 0
Nakul Dave L a n g u a g e M ode ling 29 / 45
L a n g u a g e M ode ling
Bigram counts from 9222 Restaurant Sentences
i want to eat chinese food lunch spend
i 5 827 0 9 0 0 0 2
want 2 0 608 1 6 6 5 1
to 2 0 4 686 2 0 6 211
eat 0 0 2 0 16 2 42 0
chinese 1 0 0 0 0 82 1 0
food 15 0 15 0 1 4 0 0
lunch 2 0 0 0 1 0 0 0
spend 1 0 1 0 0 0 0 0
Table: Bigrams Counts for Restaurant Corpus
Nakul Dave L a n g u a g e M ode ling 30 / 45
L a n g u a g e M ode ling
Normalized by Unigram - Counts to Probabilities
Unigram Counts
i want to eat chinese food lunch spend
2533 927 2417 746 158 1093 341 278
Bigram Probabilities
i want to eat chinese food lunch spend
i 0.002 0.33 0 0.0036 0 0 0 0.00079
want 0.0022 0 0.66 0.0011 0.0065 0.0065 0.0054 0.0011
to 0.00083 0 0.0017 0.28 0.00083 0 0.0025 0.087
eat 0 0 0.0027 0 0.021 0.0027 0.056 0
chinese 0.0063 0 0 0 0 0.52 0.0063 0
food 0.014 0 0.014 0 0.00092 0.0037 0 0
lunch 0.0059 0 0 0 0 0.0029 0 0
spend 0.0036 0 0.0036 0 0 0 0 0
Nakul Dave L a n g u a g e M ode ling 31 / 45
L a n g u a g e M ode ling
Computing Sentence Probabilities
P ( < s > I want english food < / s > )
= P ( I | < s > ) × P(want |I) × P(english | want) × P(food | english )
× P (</ s>| food)
= 0.000031
Nakul Dave L a n g u a g e M ode ling 32
45 / 45
L a n g u a g e M ode ling
Computing Sentence Probabilities
P ( < s > I want english food < / s > )
= P ( I | < s > ) × P(want |I) × P(english | want) × P(food | english )
× P (</ s>| food)
= 0.000031
Nakul Dave L a n g u a g e M ode ling 32
46 / 45
L a n g u a g e M ode ling
What knowledge does n-gram represent?
P(english | want) = .0011
P(chinese | want) = .0065
P(to | want) = .66
P(eat | to) = .28
P(food | to) = 0
P(want | spend) = 0
P (i | < s > ) = .25
Nakul Dave L a n g u a g e M ode ling 33 / 45
L a n g u a g e M ode ling
Challenges
N-gram model might be missing some words - no matter how
training corpus is large
Probability computation might results in zero for some sequence of
words
The trained model considers the sequence impossible if it is not in
the training set
Nakul Dave L a n g u a g e M ode ling 34 / 45
L a n g u a g e M ode ling
An Example
Test Data: I want to spend money.
P ( < s > I want to spend money < / s > )
= P ( I | < s > ) × P(want |I) × P(to | want) × P(spend | to ) ×
P(money | to ) × P (</ s>| money)
Probability of the sentence
= 0.25 × 0.33 × 0.66 × 0 × 0 =0
Nakul Dave L a n g u a g e M ode ling 35
49 / 45
L a n g u a g e M ode ling
An Example
Test Data: I want to spend money.
P ( < s > I want to spend money < / s > )
= P ( I | < s > ) × P(want |I) × P(to | want) × P(spend | to ) ×
P(money | to ) × P (</ s>| money)
Probability of the sentence
= 0.25 × 0.33 × 0.66 × 0 × 0 =0
Nakul Dave L a n g u a g e M ode ling 35
50 / 45
Laplace Smoothing
Laplace Smoothing ( Additive smoothing )
Laplace Smoothing is introduced to solve the problem of zero
probability.
Pretend as if we saw each word (N-gram) one more time that we
actually did
Just add one to all the counts!
Nakul Dave L a n g u a g e M ode ling 36 / 45
Laplace Smoothing
Laplace smoothed bigrams counts
i want to eat chinese food lunch spend
i 6 828 1 10 1 1 1 3
want 3 1 609 2 7 7 6 2
to 3 1 5 687 3 1 7 212
eat 1 1 3 1 17 3 43 1
chinese 2 1 1 1 1 83 2 1
food 16 1 16 1 2 5 1 1
lunch 3 1 1 1 2 1 1 1
spend 2 1 2 1 1 1 1 1
Table: Laplace smoothed bigram counts
Nakul Dave L a n g u a g e M ode ling 37 / 45
Laplace Smoothing
Smoothing techniques
Laplacian smoothing (Add-1 smoothing)
Add-K smoothing
Backoff and Interpolation
Kneser-Ney Smoothing
1 Absolute discounting
2 Kneser-Ney discounting
Nakul Dave L a n g u a g e M ode ling 38 / 45
Laplace Smoothing
Choosing Vocabulary
Open Vs. Closed Vocabulary
Criteria for building closed vocabulary
1 Minimum word frequency (e.g 2)
2 Upper bound for number of words in vocabulary
(e.g | V |= 10000)
Nakul Dave L a n g u a g e M ode ling 39 / 45
Laplace Smoothing
Out Of Vocabulary ( O O V ) words
Words may appear that does not appear while building the model
These words are ‘Out Of Vocabulary (OOV)’ words
It causes the overall sentence or phrase probabilities to zero
Replace the word which is not in the vocabulary by a special token
- <UNK >
Count the probability with < U N K > as any other words in the
corpus
Nakul Dave L a n g u a g e M ode ling 40 / 45
Lang uag e Model Evaluation
Language Model Evaluation
Does it predict good symbol/word/sequence?
Assigns higher probability to frequently observed word (sentence) over
rarely seen word (sentence)
Training and Testing Set
Model is trained on a training set, parameters are learned
Model is tested on held-out (Test data) to check the prediction
accuracy of the model
Nakul Dave L a n g u a g e M ode ling 41 / 45
Lang uag e Model Evaluation
Extrinsic Evaluation
Two models, A and B, are built on some dataset
Use each Model for some N L P applications, such as Statistical
Machine Translation or spelling correction
Measure the accuracy value of A and B
Compare the accuracy of A and B
Nakul Dave L a n g u a g e M ode ling 42
57 / 45
Lang uag e Model Evaluation
Extrinsic Evaluation
Two models, A and B, are built on some dataset
Use each Model for some N L P applications, such as Statistical
Machine Translation or spelling correction
Measure the accuracy value of A and B
Compare the accuracy of A and B
Nakul Dave L a n g u a g e M ode ling 42
58 / 45
Lang uag e Model Evaluation
Extrinsic Evaluation
Two models, A and B, are built on some dataset
Use each Model for some N L P applications, such as Statistical
Machine Translation or spelling correction
Measure the accuracy value of A and B
Compare the accuracy of A and B
Nakul Dave L a n g u a g e M ode ling 42
59 / 45
Lang uag e Model Evaluation
Extrinsic Evaluation
Two models, A and B, are built on some dataset
Use each Model for some N L P applications, such as Statistical
Machine Translation or spelling correction
Measure the accuracy value of A and B
Compare the accuracy of A and B
Nakul Dave L a n g u a g e M ode ling 42
60 / 45
Lang uag e Model Evaluation
Intrinsic Evaluation - Perplexity
Perplexity is a measurement of how well a probability model
predicts a word or a sequence of words.
Intuitively, perplexity can be understood as a measure of
uncertainty.
,
u YN 1
For Biagram Model, P P (W N ) = N, P (W i ) | W i−1 (1)
i=1
,
u YN 1
For Trigram Model, P P (W N ) = N, P (W i ) | W i− 1 W i− 2 (2)
i=1
A minimum perplexity is the indication of a better language model.
Nakul Dave L a n g u a g e M ode ling 43 / 45
Lang uag e Model Evaluation
Summary
Word or sequence prediction models based on the probabilities in
the training data
Spelling Correction, Speech recognition, Smart email compose,
Machine Translation
Neural Network based language model performs better
Nakul Dave L a n g u a g e M ode ling 44 / 45
Lang uag e Model Evaluation
Thank You All.......
Nakul Dave L a n g u a g e M ode ling 45 / 45