0% found this document useful (0 votes)
3 views63 pages

Probabilistic Language Models

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views63 pages

Probabilistic Language Models

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Probabilistic Language Modeling

Nakul Dave
Assistant Professor
Computer Engineering Department
Vishwakarma Government Engineering College - Ahmedabad

Nakul Dave L a n g u a g e M ode ling 1 / 45


Outline

Outline

Language Modeling - What it is


Why Language Modeling - Use cases
N-gram Language Models
Chain Rule
Challenges
Language Model Evaluation

Nakul Dave L a n g u a g e M ode ling 2 / 45


U s e cases

How do they do it?

Figure: Word Prediction in the


sequence
Figure: Word Prediction

Nakul Dave L a n g u a g e M ode ling 3 / 45


U s e cases

Downstream Applications - Language Modeling

Sentence Completion

Machine Translation
O C R and Hand-writing recognition

Speech Recognition

Nakul Dave L a n g u a g e M ode ling 4 / 45


U s e cases

Use Cases - Language Modeling

Gmail Smart Compose


Spelling Correction

Machine Translation

Nakul Dave L a n g u a g e M ode ling 5 / 45


U s e cases

Gmail Smart Compose

Supports predicting completion of sentence


Real-time suggestions in Gmail that assists users in writing mails

Figure: Gmail Smart Compose1

1
Source of image- https://encrypted-tbn0.gstatic.com/
Nakul Dave L a n g u a g e M ode ling 6 / 45
U s e cases

Sequence Prediction

Figure: Google’s Sequence Prediction

Nakul Dave L a n g u a g e M ode ling 7 / 45


U s e cases

Spelling Correction

Figure: Microsoft word spelling correction

Nakul Dave L a n g u a g e M ode ling 8 / 45


U s e cases

Context Sensitive Spelling Correction

Natural Language Proposing


Natural Language Processing

An Example
P(Natural Language Proposing) < P(Natural Language
Processing)

P(Proposing | Natural,Language) < P(Processing |


Natural,Language)

Nakul Dave L a n g u a g e M ode ling 9 / 45


U s e cases

Applications - Probabilistic Language Modeling


Speech Recognition
P(Ice cream) > P ( I scream)

Figure: Homophones2
2
Source of image- http://yourdictionary.com/examples-of-homophones.html
Nakul Dave L a n g u a g e M ode ling 10 / 45
U s e cases

Machine Translation

Machine Translation
Which phrase is more probable in the target language?
P(high winds) > P(large winds)

Figure: Google M T - Gujarati to English

Nakul Dave L a n g u a g e M ode ling 11 / 45


Introduction

Language Modeling

Definition
Language Models are stochastic process for computing probability of a
sentence or sequence of words.

P (W ) = P (w1, w2, w3, . . . , w n )

Probabilistic approach - Finding how probable a sentence may


occur given the sequence of words

In the probabilistic world, the Language Model is used to assign


the probability P ( W ) to every possible word sequence W.

Nakul Dave L a n g u a g e M ode ling 12 / 45


Introduction

Language Modeling
John Rupert Firth - An English linguist
“You shall know a word by the company it keeps.”

— Firth, J.R.

Nakul Dave L a n g u a g e M ode ling 13 / 45


Introduction

Completion prediction

Please turn off your cell.... phone


Your program does not... run

I call you... back

Thank you for your kind... help

Nakul Dave L a n g u a g e M ode ling 14 / 45


Introduction

Completion prediction

Please turn off your cell.... phone


Your program does not... run

I call you... back

Thank you for your kind... help

Nakul Dave L a n g u a g e M ode ling 14


15 / 45
Introduction

Completion prediction

Please turn off your cell.... phone


Your program does not... run

I call you... back

Thank you for your kind... help

Nakul Dave L a n g u a g e M ode ling 14


16 / 45
Introduction

Completion prediction

Please turn off your cell.... phone


Your program does not... run

I call you... back

Thank you for your kind... help

Nakul Dave L a n g u a g e M ode ling 14


17 / 45
Introduction

Completion prediction

Please turn off your cell.... phone


Your program does not... run

I call you... back

Thank you for your kind... help

Nakul Dave L a n g u a g e M ode ling 14


18 / 45
Introduction

Completion prediction

Please turn off your cell.... phone


Your program does not... run

I call you... back

Thank you for your kind... help

Nakul Dave L a n g u a g e M ode ling 14


19 / 45
Introduction

Completion prediction

Please turn off your cell.... phone


Your program does not... run

I call you... back

Thank you for your kind... help

Nakul Dave L a n g u a g e M ode ling 14


20 / 45
Introduction

Completion prediction

Please turn off your cell.... phone


Your program does not... run

I call you... back

Thank you for your kind... help

Nakul Dave L a n g u a g e M ode ling 14


21 / 45
Introduction

Completion prediction

Please turn off your cell.... phone


Your program does not... run

I call you... back

Thank you for your kind... help

Nakul Dave L a n g u a g e M ode ling 14


22 / 45
Introduction

Language Modeling

Figure: Language Model

Nakul Dave L a n g u a g e M ode ling 15 / 45


Introduction

Probabilistic Language Modeling

Function 1: Compute the probability of sequence of words

P (W ) = P (w1, w2, w3, . . . , w n )

Function 2:: Probability of an upcoming word:

P (w4 | w1, w2, w3)

A model that computes either of these is called language model.

Nakul Dave L a n g u a g e M ode ling 16 / 45


Introduction

Types of Language Models

Unigram Language Model [ No dependency of the word]


Bigram Language Model [ The word is dependent on the previous
word ]
Trigram Language Model [ The word is dependent on the previous
two words ]
N-gram Langauge Model [ The word is dependent on the previous
N-1 words ]

Nakul Dave L a n g u a g e M ode ling 17 / 45


Introduction

N-gram Language model

Figure: N-gram Language Modeling 3

Source of image - https://encrypted-tbn0.gstatic.com


Nakul Dave L a n g u a g e M ode ling 18 / 45
Introduction

Word Count in N-gram Language Models


There are N number of tokens in the collection.
Unigram Language Model
C ount (W i )
P (W i ) = N

Bigram Language Model


Count(Wi−1Wi)
P (W i ) = C ount (W i )

Trigram Language Model


Count(Wi−2Wi−1Wi)
P (W i ) = Count(Wi−2Wi−1)

N-gram Language Model


Count(Wi−(N−1)Wi−(N−2)...Wi−1Wi)
P (W i ) = Count(Wi−(N−1)Wi(N−2)...Wi

Nakul Dave L a n g u a g e M ode ling 19 / 45


L a n g u a g e M ode ling

Computing P(W)

How to compute the joint probability

(natural, language, processing, is,)

It depends on the Chain Rule of probability.

Nakul Dave L a n g u a g e M ode ling 20 / 45


L a n g u a g e M ode ling

Chain Rule

Conditional Probability
The Probability of occurring B given that A has already occurred.

P ( B |A) = P (A,B)
P (A)

Multiple Variables
The probability of seeing word B given that A, B and C have already
been seen in some sentence

P (A, B , C , D ) = P ( A ) P ( B | A ) P ( C | A, B ) P ( D | A, B , C )

Generalized form of Chain Rule


P (w1, w2, w3, . . . , w n ) = P (w 1 )P (w2 | w 1 )P (w3 | w1, w2) . . . P (w n |
w1, . . . , w n−1 )

Nakul Dave L a n g u a g e M ode ling 21


29 / 45
L a n g u a g e M ode ling

Chain Rule

Conditional Probability
The Probability of occurring B given that A has already occurred.

P ( B |A) = P (A,B)
P (A)

Multiple Variables
The probability of seeing word B given that A, B and C have already
been seen in some sentence

P (A, B , C , D ) = P ( A ) P ( B | A ) P ( C | A, B ) P ( D | A, B , C )

Generalized form of Chain Rule


P (w1, w2, w3, . . . , w n ) = P (w 1 )P (w2 | w 1 )P (w3 | w1, w2) . . . P (w n |
w1, . . . , w n−1 )

Nakul Dave L a n g u a g e M ode ling 21


30 / 45
L a n g u a g e M ode ling

Chain Rule

Conditional Probability
The Probability of occurring B given that A has already occurred.

P ( B |A) = P (A,B)
P (A)

Multiple Variables
The probability of seeing word B given that A, B and C have already
been seen in some sentence

P (A, B , C , D ) = P ( A ) P ( B | A ) P ( C | A, B ) P ( D | A, B , C )

Generalized form of Chain Rule


P (w1, w2, w3, . . . , w n ) = P (w 1 )P (w2 | w 1 )P (w3 | w1, w2) . . . P (w n |
w1, . . . , w n−1 )

Nakul Dave L a n g u a g e M ode ling 21


31 / 45
L a n g u a g e M ode ling

Language Modeling

An Example
P (Natural Language Processing is important)

Nakul Dave L a n g u a g e M ode ling 22 / 45


L a n g u a g e M ode ling

Computing Probability Values

Product of probability values using chain rule


Q
P (w1w2 . . . w n ) = i P (w i | w1w2 . . . w i−1 )

P(Natural Language Processing is important)

P (Natural) × P (Language | Natural) × P (Processing |


Natural Language) × P (is | Natural Language Processing) ×
P (important | Natural Language Processing is)

Nakul Dave L a n g u a g e M ode ling 23


33 / 45
L a n g u a g e M ode ling

Computing Probability Values

Product of probability values using chain rule


Q
P (w1w2 . . . w n ) = i P (w i | w1w2 . . . w i−1 )

P(Natural Language Processing is important)

P (Natural) × P (Language | Natural) × P (Processing |


Natural Language) × P (is | Natural Language Processing) ×
P (important | Natural Language Processing is)

Nakul Dave L a n g u a g e M ode ling 23


34 / 45
L a n g u a g e M ode ling

Computing Probability Values

Product of probability values using chain rule


Q
P (w1w2 . . . w n ) = i P (w i | w1w2 . . . w i−1 )

P(Natural Language Processing is important)

P (Natural) × P (Language | Natural) × P (Processing |


Natural Language) × P (is | Natural Language Processing) ×
P (important | Natural Language Processing is)

Nakul Dave L a n g u a g e M ode ling 23


35 / 45
L a n g u a g e M ode ling

Language Modeling

Example
Language Modeling also computes the probability of next word in the
sequence

P (w4 | w1, w2, w3)

Nakul Dave L a n g u a g e M ode ling 24 / 45


L a n g u a g e M ode ling

Google Ngram Example

Figure: Google Ngram - How are *

G o o g l e N g r a m Vie w e r

Nakul Dave L a n g u a g e M ode ling 25 / 45


L a n g u a g e M ode ling

Google Ngram Example

Figure: Google Ngram - Natural Language *

Nakul Dave L a n g u a g e M ode ling 26 / 45


L a n g u a g e M ode ling

Google Ngram Example

Figure: Google Ngram - Natural * Processing

Nakul Dave L a n g u a g e M ode ling 27 / 45


L a n g u a g e M ode ling

Maximum Likelihood Estimation ( M L E )

MLE
Values are converted to probability... that computes the ”most
probable” from observed data
Count(Wi−1Wi)
P (W i ) = C ount (W i )

C(Wi−1Wi)
P (W i ) = C(Wi)

Nakul Dave L a n g u a g e M ode ling 28 / 45


L a n g u a g e M ode ling

An example

< s > I am here < / s >


C(Wi−1Wi) <s>who am I < / s >
P (W i ) = C(Wi)
< s > I would like to know < / s >
Estimating bigrams
P ( I | < s > ) = 2/3
P (</s>| here) = 1
P(would | I) = 1/3
P(here | am) = 1/2
P(know | like) = 0

Nakul Dave L a n g u a g e M ode ling 29 / 45


L a n g u a g e M ode ling

An example

< s > I am here < / s >


C(Wi−1Wi) <s>who am I < / s >
P (W i ) = C(Wi)
< s > I would like to know < / s >
Estimating bigrams
P ( I | < s > ) = 2/3
P (</s>| here) = 1
P(would | I) = 1/3
P(here | am) = 1/2
P(know | like) = 0

Nakul Dave L a n g u a g e M ode ling 29 / 45


L a n g u a g e M ode ling

Bigram counts from 9222 Restaurant Sentences

i want to eat chinese food lunch spend


i 5 827 0 9 0 0 0 2
want 2 0 608 1 6 6 5 1
to 2 0 4 686 2 0 6 211
eat 0 0 2 0 16 2 42 0
chinese 1 0 0 0 0 82 1 0
food 15 0 15 0 1 4 0 0
lunch 2 0 0 0 1 0 0 0
spend 1 0 1 0 0 0 0 0
Table: Bigrams Counts for Restaurant Corpus

Nakul Dave L a n g u a g e M ode ling 30 / 45


L a n g u a g e M ode ling

Normalized by Unigram - Counts to Probabilities


Unigram Counts
i want to eat chinese food lunch spend
2533 927 2417 746 158 1093 341 278

Bigram Probabilities

i want to eat chinese food lunch spend


i 0.002 0.33 0 0.0036 0 0 0 0.00079
want 0.0022 0 0.66 0.0011 0.0065 0.0065 0.0054 0.0011
to 0.00083 0 0.0017 0.28 0.00083 0 0.0025 0.087
eat 0 0 0.0027 0 0.021 0.0027 0.056 0
chinese 0.0063 0 0 0 0 0.52 0.0063 0
food 0.014 0 0.014 0 0.00092 0.0037 0 0
lunch 0.0059 0 0 0 0 0.0029 0 0
spend 0.0036 0 0.0036 0 0 0 0 0

Nakul Dave L a n g u a g e M ode ling 31 / 45


L a n g u a g e M ode ling

Computing Sentence Probabilities

P ( < s > I want english food < / s > )

= P ( I | < s > ) × P(want |I) × P(english | want) × P(food | english )


× P (</ s>| food)

= 0.000031

Nakul Dave L a n g u a g e M ode ling 32


45 / 45
L a n g u a g e M ode ling

Computing Sentence Probabilities

P ( < s > I want english food < / s > )

= P ( I | < s > ) × P(want |I) × P(english | want) × P(food | english )


× P (</ s>| food)

= 0.000031

Nakul Dave L a n g u a g e M ode ling 32


46 / 45
L a n g u a g e M ode ling

What knowledge does n-gram represent?

P(english | want) = .0011


P(chinese | want) = .0065
P(to | want) = .66
P(eat | to) = .28
P(food | to) = 0
P(want | spend) = 0
P (i | < s > ) = .25

Nakul Dave L a n g u a g e M ode ling 33 / 45


L a n g u a g e M ode ling

Challenges

N-gram model might be missing some words - no matter how


training corpus is large
Probability computation might results in zero for some sequence of
words
The trained model considers the sequence impossible if it is not in
the training set

Nakul Dave L a n g u a g e M ode ling 34 / 45


L a n g u a g e M ode ling

An Example

Test Data: I want to spend money.


P ( < s > I want to spend money < / s > )

= P ( I | < s > ) × P(want |I) × P(to | want) × P(spend | to ) ×


P(money | to ) × P (</ s>| money)

Probability of the sentence


= 0.25 × 0.33 × 0.66 × 0 × 0 =0

Nakul Dave L a n g u a g e M ode ling 35


49 / 45
L a n g u a g e M ode ling

An Example

Test Data: I want to spend money.


P ( < s > I want to spend money < / s > )

= P ( I | < s > ) × P(want |I) × P(to | want) × P(spend | to ) ×


P(money | to ) × P (</ s>| money)

Probability of the sentence


= 0.25 × 0.33 × 0.66 × 0 × 0 =0

Nakul Dave L a n g u a g e M ode ling 35


50 / 45
Laplace Smoothing

Laplace Smoothing ( Additive smoothing )

Laplace Smoothing is introduced to solve the problem of zero


probability.
Pretend as if we saw each word (N-gram) one more time that we
actually did
Just add one to all the counts!

Nakul Dave L a n g u a g e M ode ling 36 / 45


Laplace Smoothing

Laplace smoothed bigrams counts

i want to eat chinese food lunch spend


i 6 828 1 10 1 1 1 3
want 3 1 609 2 7 7 6 2
to 3 1 5 687 3 1 7 212
eat 1 1 3 1 17 3 43 1
chinese 2 1 1 1 1 83 2 1
food 16 1 16 1 2 5 1 1
lunch 3 1 1 1 2 1 1 1
spend 2 1 2 1 1 1 1 1
Table: Laplace smoothed bigram counts

Nakul Dave L a n g u a g e M ode ling 37 / 45


Laplace Smoothing

Smoothing techniques

Laplacian smoothing (Add-1 smoothing)


Add-K smoothing
Backoff and Interpolation
Kneser-Ney Smoothing
1 Absolute discounting
2 Kneser-Ney discounting

Nakul Dave L a n g u a g e M ode ling 38 / 45


Laplace Smoothing

Choosing Vocabulary

Open Vs. Closed Vocabulary


Criteria for building closed vocabulary
1 Minimum word frequency (e.g 2)
2 Upper bound for number of words in vocabulary
(e.g | V |= 10000)

Nakul Dave L a n g u a g e M ode ling 39 / 45


Laplace Smoothing

Out Of Vocabulary ( O O V ) words

Words may appear that does not appear while building the model
These words are ‘Out Of Vocabulary (OOV)’ words
It causes the overall sentence or phrase probabilities to zero
Replace the word which is not in the vocabulary by a special token
- <UNK >
Count the probability with < U N K > as any other words in the
corpus

Nakul Dave L a n g u a g e M ode ling 40 / 45


Lang uag e Model Evaluation

Language Model Evaluation

Does it predict good symbol/word/sequence?

Assigns higher probability to frequently observed word (sentence) over


rarely seen word (sentence)

Training and Testing Set


Model is trained on a training set, parameters are learned
Model is tested on held-out (Test data) to check the prediction
accuracy of the model

Nakul Dave L a n g u a g e M ode ling 41 / 45


Lang uag e Model Evaluation

Extrinsic Evaluation

Two models, A and B, are built on some dataset

Use each Model for some N L P applications, such as Statistical


Machine Translation or spelling correction

Measure the accuracy value of A and B

Compare the accuracy of A and B

Nakul Dave L a n g u a g e M ode ling 42


57 / 45
Lang uag e Model Evaluation

Extrinsic Evaluation

Two models, A and B, are built on some dataset

Use each Model for some N L P applications, such as Statistical


Machine Translation or spelling correction

Measure the accuracy value of A and B

Compare the accuracy of A and B

Nakul Dave L a n g u a g e M ode ling 42


58 / 45
Lang uag e Model Evaluation

Extrinsic Evaluation

Two models, A and B, are built on some dataset

Use each Model for some N L P applications, such as Statistical


Machine Translation or spelling correction

Measure the accuracy value of A and B

Compare the accuracy of A and B

Nakul Dave L a n g u a g e M ode ling 42


59 / 45
Lang uag e Model Evaluation

Extrinsic Evaluation

Two models, A and B, are built on some dataset

Use each Model for some N L P applications, such as Statistical


Machine Translation or spelling correction

Measure the accuracy value of A and B

Compare the accuracy of A and B

Nakul Dave L a n g u a g e M ode ling 42


60 / 45
Lang uag e Model Evaluation

Intrinsic Evaluation - Perplexity

Perplexity is a measurement of how well a probability model


predicts a word or a sequence of words.
Intuitively, perplexity can be understood as a measure of
uncertainty.
,
u YN 1
For Biagram Model, P P (W N ) = N, P (W i ) | W i−1 (1)
i=1
,
u YN 1
For Trigram Model, P P (W N ) = N, P (W i ) | W i− 1 W i− 2 (2)
i=1

A minimum perplexity is the indication of a better language model.

Nakul Dave L a n g u a g e M ode ling 43 / 45


Lang uag e Model Evaluation

Summary

Word or sequence prediction models based on the probabilities in


the training data
Spelling Correction, Speech recognition, Smart email compose,
Machine Translation
Neural Network based language model performs better

Nakul Dave L a n g u a g e M ode ling 44 / 45


Lang uag e Model Evaluation

Thank You All.......

Nakul Dave L a n g u a g e M ode ling 45 / 45

You might also like