0% found this document useful (0 votes)
24 views50 pages

NLP07 Generative Language Models

Uploaded by

240415
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views50 pages

NLP07 Generative Language Models

Uploaded by

240415
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Natural Language Processing

07: Generative Language Models

1. N-Gram and Probabilities


2. N-Gram Language Model
3. Perplexity
4. Out of Vocabulary Words
5. Smoothing & Interpolation
Dr. Imran Ihsan
Ph.D. in Knowledge Engineering
Associate Professor

Natural Language Processing


2

07-01
N-Gram and Probabilities
07: Generative Language Models

Natural Language Processing


07: Generative Language Models
3

Language Model
Create Language Model (LM) from text corpus to
Estimate probability of word sequences
Estimate probability of a word following a sequence of words

Apply this concept to autocomplete a sentence with most likely suggestions

“chocolate”
Text Language
Corpus Model “Lyn is eating…” “eggs”

“toast”

Natural Language Processing


07: Generative Language Models
4

Applications
Speech Recognition
P(I saw a van) > P(eyes awe a an)

Spelling Correction
“He entered the ship to buy some groceries” – “ship” a dictionary word
P(entered the shop to buy) > P(entered the ship to buy)

Augmented Communication
Predict most likely word from menu for people unable to physically
talk or sign (Newell et. al., 1998)

Natural Language Processing


07: Generative Language Models
5

N-Gram
An N-Gram is a sequence of N words.

Corpus: I am happy because I am learning

Unigrams: { I , am , happy , because , learning }

Natural Language Processing


07: Generative Language Models
6

N-Gram
An N-Gram is a sequence of N words.

Corpus: I am happy because I am learning

Unigrams: { I , am , happy , because , learning }

Bigrams: { I am , am happy , happy because , … } I happy

Natural Language Processing


07: Generative Language Models
7

N-Gram
An N-Gram is a sequence of N words.

Corpus: I am happy because I am learning

Unigrams: { I , am , happy , because , learning }

Bigrams: { I am , am happy , happy because , … } I happy

Trigrams: { I am happy , am happy because , … }

Natural Language Processing


07: Generative Language Models
8

Sequence Notation
Corpus: This is great … teacher drinks tea. 𝑚 = 500
𝑤1 𝑤2 𝑤3 𝑤498 𝑤499 𝑤500

𝑤1𝑚 = 𝑤1 𝑤2 ⋯ 𝑤𝑚

𝑤13 = 𝑤1 𝑤2 𝑤3

𝑚
𝑤𝑚−2 = 𝑤𝑚−2 𝑤𝑚−1 𝑤𝑚

Natural Language Processing


07: Generative Language Models
9

Unigram Probability
Corpus: I am happy because I am learning

Size of the corpus: 7

2 1
𝑃 𝐼 = 𝑃 ℎ𝑎𝑝𝑝𝑦 =
7 7

𝐶(𝑤)
Probability of Unigram: 𝑃 𝑤 =
𝑚

Natural Language Processing


07: Generative Language Models
10

Bigram Probability
Corpus: I am happy because I am learning

𝐶(𝐼 𝑎𝑚) 2
𝑃 𝑎𝑚 | 𝐼 = = =1
𝐶(𝐼) 2

𝐶(𝐼 ℎ𝑎𝑝𝑝𝑦) 0
𝑃 ℎ𝑎𝑝𝑝𝑦 | 𝐼 = = =0
𝐶(𝐼) 2

𝐶(𝑎𝑚 𝑙𝑒𝑎𝑟𝑛𝑖𝑛𝑔) 1
𝑃 𝑙𝑒𝑎𝑟𝑛𝑖𝑛𝑔 | 𝑎𝑚 = = = 0.5
𝐶(𝑎𝑚) 2

𝐶(𝑥 𝑦) 𝐶(𝑥 𝑦)
Probability of Bigram: 𝑃 𝑦|𝑥 = =
σ𝑤 𝐶(𝑥 𝑤) 𝐶(𝑥)

Natural Language Processing


07: Generative Language Models
11

Trigram Probability
Corpus: I am happy because I am learning

𝐶(𝐼 𝑎𝑚 ℎ𝑎𝑝𝑝𝑦) 1
𝑃 ℎ𝑎𝑝𝑝𝑦 | 𝐼 𝑎𝑚 = = = 0.5
𝐶(𝐼 𝑎𝑚) 2

𝐶(𝑤12 𝑤3 )
Probability of Trigram: 𝑃 𝑤3 |𝑤12 =
𝐶(𝑤12 )

𝐶 𝑤12 𝑤3 = 𝐶 𝑤1 𝑤2 𝑤3 = 𝐶(𝑤13 )

Natural Language Processing


07: Generative Language Models
12

N-Gram Probability

𝐶(𝑤1𝑁−1 𝑤𝑁 )
𝑃 𝑤𝑁 |𝑤1𝑁−1 =
𝐶(𝑤1𝑁−1 )
𝐶 𝑤1𝑁−1 𝑤𝑁 = 𝐶(𝑤1𝑁 )

Natural Language Processing


07: Generative Language Models
13

Probability of a Sequence
Given a sentence, what is its probability?

𝑃 𝑡ℎ𝑒 𝑡𝑒𝑎𝑐ℎ𝑒𝑟 𝑑𝑟𝑖𝑛𝑘𝑠 𝑡𝑒𝑎 =?

Conditional Probability and Chain Rule

𝑃(𝐴, 𝐵)
𝑃 𝐵𝐴 = → 𝑃 𝐴, 𝐵 = 𝑃 𝐴 𝑃 𝐵 𝐴
𝑃(𝐴)

𝑃 𝐴, 𝐵, 𝐶, 𝐷 = 𝑃 𝐴 . 𝑃 𝐵 𝐴 . 𝑃 𝐶 𝐴, 𝐵 . 𝑃(𝐷|𝐴, 𝐵, 𝐶)

𝑃 𝑡ℎ𝑒 𝑡𝑒𝑎𝑐ℎ𝑒𝑟 𝑑𝑟𝑖𝑛𝑘𝑠 𝑡𝑒𝑎 =


𝑃 𝑡ℎ𝑒 . 𝑃 𝑡𝑒𝑎𝑐ℎ𝑒𝑟 𝑡ℎ𝑒 . 𝑃 𝑑𝑟𝑖𝑛𝑘𝑠 𝑡ℎ𝑒 𝑡𝑒𝑎𝑐ℎ𝑒𝑟 . 𝑃(𝑡𝑒𝑎|𝑡ℎ𝑒 𝑡𝑒𝑎𝑐ℎ𝑒𝑟 𝑑𝑟𝑖𝑛𝑘𝑠)
Natural Language Processing
07: Generative Language Models
14

Sentence Not in Corpus


Problem: Corpus almost never contains the exact sentence we’re interested in or
even its longer subsequences.

Input: the teacher drinks tea

𝑃 𝑡ℎ𝑒 𝑡𝑒𝑎𝑐ℎ𝑒𝑟 𝑑𝑟𝑖𝑛𝑘𝑠 𝑡𝑒𝑎 =


𝑃 𝑡ℎ𝑒 . 𝑃 𝑡𝑒𝑎𝑐ℎ𝑒𝑟 𝑡ℎ𝑒 . 𝑃 𝑑𝑟𝑖𝑛𝑘𝑠 𝑡ℎ𝑒 𝑡𝑒𝑎𝑐ℎ𝑒𝑟 . 𝑃(𝑡𝑒𝑎|𝑡ℎ𝑒 𝑡𝑒𝑎𝑐ℎ𝑒𝑟 𝑑𝑟𝑖𝑛𝑘𝑠)

𝐶(𝑡ℎ𝑒 𝑡𝑒𝑎𝑐ℎ𝑒𝑟 𝑑𝑖𝑛𝑘𝑠 𝑡𝑒𝑎) Both


𝑃 𝑡𝑒𝑎 𝑡ℎ𝑒 𝑡𝑒𝑎𝑐ℎ𝑒𝑟 𝑑𝑟𝑖𝑛𝑘𝑠 = likely 0
𝐶(𝑡ℎ𝑒 𝑡𝑒𝑎𝑐ℎ𝑒𝑟 𝑑𝑟𝑖𝑛𝑘𝑠)

Natural Language Processing


07: Generative Language Models
15

Approximation of Sequence Probability


Input: the teacher drinks tea
𝑃 𝑡𝑒𝑎𝑐ℎ𝑒𝑟 𝑡ℎ𝑒
𝑃 𝑡𝑒𝑎 𝑡ℎ𝑒 𝑡𝑒𝑎𝑐ℎ𝑒𝑟 𝑑𝑟𝑖𝑛𝑘𝑠 ≈ 𝑃(𝑡𝑒𝑎|𝑑𝑟𝑖𝑛𝑘𝑠) 𝑃(𝑑𝑟𝑖𝑛𝑘𝑠|𝑡𝑒𝑎𝑐ℎ𝑒𝑟)
𝑃(𝑡𝑒𝑎|𝑑𝑟𝑖𝑛𝑘𝑠)

𝑃 𝑡ℎ𝑒 𝑡𝑒𝑎𝑐ℎ𝑒𝑟 𝑑𝑟𝑖𝑛𝑘𝑠 𝑡𝑒𝑎 =


𝑃 𝑡ℎ𝑒 . 𝑃 𝑡𝑒𝑎𝑐ℎ𝑒𝑟 𝑡ℎ𝑒 . 𝑃 𝑑𝑟𝑖𝑛𝑘𝑠 𝑡ℎ𝑒 𝑡𝑒𝑎𝑐ℎ𝑒𝑟 . 𝑃(𝑡𝑒𝑎|𝑡ℎ𝑒 𝑡𝑒𝑎𝑐ℎ𝑒𝑟 𝑑𝑟𝑖𝑛𝑘𝑠)

𝑃 𝑡ℎ𝑒 . 𝑃 𝑡𝑒𝑎𝑐ℎ𝑒𝑟 𝑡ℎ𝑒 . 𝑃 𝑑𝑟𝑖𝑛𝑘𝑠 𝑡𝑒𝑎𝑐ℎ𝑒𝑟 . 𝑃(𝑡𝑒𝑎|𝑑𝑟𝑖𝑛𝑘𝑠)

Natural Language Processing


07: Generative Language Models
16

Approximation of Sequence Probability


Markov Assumption: Only Last 𝑁 Words Matter

Bigram: 𝑃(𝑤𝑛 |𝑤1𝑛−1 ) ≈ 𝑃(𝑤𝑛 |𝑤𝑛−1 )

N-Gram: 𝑃 𝑤𝑛 𝑤1𝑛−1 ≈ 𝑃 𝑤𝑛 𝑤𝑛−𝑁+1


𝑛−1

Entire Sentence Modeled with Bigram


𝑛

𝑃(𝑤1𝑛 ) ≈ ෑ 𝑃(𝑤𝑖 |𝑤𝑖−1 )


𝑖=1
𝑃 𝑤1𝑛 ≈ 𝑃 𝑤1 . 𝑃 𝑤2 𝑤1 . ⋯ . 𝑃(𝑤𝑛 |𝑤𝑛−1 )

Natural Language Processing


07: Generative Language Models
17

Start of Sentence Token <s>

the teacher drinks tea

𝑃 𝑡ℎ𝑒 𝑡𝑒𝑎𝑐ℎ𝑒𝑟 𝑑𝑟𝑖𝑛𝑘𝑠 𝑡𝑒𝑎 ≈


𝑃 𝑡ℎ𝑒 . 𝑃 𝑡𝑒𝑎𝑐ℎ𝑒𝑟 𝑡ℎ𝑒 . 𝑃 𝑑𝑟𝑖𝑛𝑘𝑠 𝑡𝑒𝑎𝑐ℎ𝑒𝑟 . 𝑃(𝑡𝑒𝑎|𝑑𝑟𝑖𝑛𝑘𝑠)

< 𝑠 > the teacher drinks tea

𝑃 < 𝑠 > 𝑡ℎ𝑒 𝑡𝑒𝑎𝑐ℎ𝑒𝑟 𝑑𝑟𝑖𝑛𝑘𝑠 𝑡𝑒𝑎 ≈


𝑃 𝑡ℎ𝑒| < 𝑠 > . 𝑃 𝑡𝑒𝑎𝑐ℎ𝑒𝑟 𝑡ℎ𝑒 . 𝑃 𝑑𝑟𝑖𝑛𝑘𝑠 𝑡𝑒𝑎𝑐ℎ𝑒𝑟 . 𝑃(𝑡𝑒𝑎|𝑑𝑟𝑖𝑛𝑘𝑠)

Natural Language Processing


07: Generative Language Models
18

Start of Sentence Token <s> for N-Grams


Trigram

𝑃 𝑡ℎ𝑒 𝑡𝑒𝑎𝑐ℎ𝑒𝑟 𝑑𝑟𝑖𝑛𝑘𝑠 𝑡𝑒𝑎 ≈


𝑃 𝑡ℎ𝑒 . 𝑃 𝑡𝑒𝑎𝑐ℎ𝑒𝑟 𝑡ℎ𝑒 . 𝑃 𝑑𝑟𝑖𝑛𝑘𝑠 𝑡ℎ𝑒 𝑡𝑒𝑎𝑐ℎ𝑒𝑟 . 𝑃(𝑡𝑒𝑎|𝑡𝑒𝑎𝑐ℎ𝑒𝑟 𝑑𝑟𝑖𝑛𝑘𝑠)

the teacher drinks tea ➔ < 𝑠 > < 𝑠 > the teacher drinks tea

𝑃(𝑤1𝑛 ) ≈ 𝑃 𝑤1 | < 𝑠 >< 𝑠 > . 𝑃 𝑤2 | < 𝑠 > 𝑤1 . ⋯ . 𝑃(𝑤𝑛 |𝑤𝑛−2 𝑤𝑛−1 )

N-Gram Model: Add 𝑁 − 1 Start Tokens < 𝑠 >

Natural Language Processing


07: Generative Language Models
19

End of Sentence Token </s> – Motivation


𝐶(𝑥 𝑦) 𝐶(𝑥 𝑦)
𝑃 𝑦𝑥 = =
σ𝑤 𝐶(𝑥 𝑤) 𝐶(𝑥)

Corpus:
<s> Lyn drinks chocolate
<s> John drinks

෍ 𝐶 𝑑𝑟𝑖𝑛𝑘𝑠 𝑤 = 1
𝑤

𝐶 𝑑𝑟𝑖𝑛𝑘𝑠 = 2

Natural Language Processing


07: Generative Language Models
20

End of Sentence Token </s> – Motivation


Corpus: Sentence of length 2:
<s> yes no <s> yes yes
𝑃(< 𝑠 > 𝑦𝑒𝑠 𝑦𝑒𝑠) =
<s> yes yes <s> yes no
𝑃 𝑦𝑒𝑠 < 𝑠 > ∗ 𝑃(𝑦𝑒𝑠|𝑦𝑒𝑠) =
<s> no no <s> no no
<s> no yes
𝐶(< 𝑠 > 𝑦𝑒𝑠) 𝐶(𝑦𝑒𝑠 𝑦𝑒𝑠)
∗ =
σ𝑤 𝐶(< 𝑠 > 𝑤) σ𝑤 𝐶(𝑦𝑒𝑠 𝑤)

2 1 1
∗ =
3 2 3
1 1
𝑃(< 𝑠 > 𝑦𝑒𝑠 𝑦𝑒𝑠) = 𝑃(< 𝑠 > 𝑦𝑒𝑠 𝑛𝑜) =
3 3
1 ෍ 𝑃(⋯ ) = 1
𝑃(< 𝑠 > 𝑛𝑜 𝑛𝑜) = 𝑃(< 𝑠 > 𝑛𝑜 𝑦𝑒𝑠) = 0
3 2 𝑤𝑜𝑟𝑑𝑠
Natural Language Processing
07: Generative Language Models
21

End of Sentence Token </s> – Motivation


Corpus: Sentence of length 3:
<s> yes no <s> yes yes yes
𝑃(< 𝑠 > 𝑦𝑒𝑠 𝑦𝑒𝑠 𝑦𝑒𝑠) = ⋯
<s> yes yes <s> yes yes no
<s> no no … 𝑃(< 𝑠 > 𝑦𝑒𝑠 𝑦𝑒𝑠 𝑛𝑜) = ⋯
<s> no no no
⋯=⋯

𝑃(< 𝑠 > 𝑛𝑜 𝑛𝑜 𝑛𝑜) = ⋯

෍ 𝑃(⋯ ) = 1
3 𝑤𝑜𝑟𝑑𝑠
Natural Language Processing
07: Generative Language Models
22

End of Sentence Token </s> – Motivation


Corpus:
<s> yes no
<s> yes yes
෍ 𝑃(⋯ ) + ෍ 𝑃(⋯ ) + ⋯ =1
2 𝑤𝑜𝑟𝑑𝑠 3 𝑤𝑜𝑟𝑑𝑠
<s> no no

Natural Language Processing


07: Generative Language Models
23

End of Sentence Token </s> – Motivation


Corpus:
<s> yes no
<s> yes yes
<s> no no

෍ 𝑃(⋯ ) + ෍ 𝑃(⋯ ) + ⋯ =1
2 𝑤𝑜𝑟𝑑𝑠 3 𝑤𝑜𝑟𝑑𝑠

Natural Language Processing


07: Generative Language Models
24

End of Sentence Token </s> – Solution


Bigram
< 𝑠 > the teacher drinks tea ➔ < 𝑠 > the teacher drinks tea </𝑠 >

𝑃 𝑡ℎ𝑒| < 𝑠 > . 𝑃 𝑡𝑒𝑎𝑐ℎ𝑒𝑟 𝑡ℎ𝑒 . 𝑃 𝑑𝑟𝑖𝑛𝑘𝑠 𝑡𝑒𝑎𝑐ℎ𝑒𝑟 . 𝑃 𝑡𝑒𝑎 𝑑𝑟𝑖𝑛𝑘𝑠 . 𝑃(</𝑠 > |𝑡𝑒𝑎)

Corpus:
<s> Lyn drinks chocolate </s> ෍ 𝐶 𝑑𝑟𝑖𝑛𝑘𝑠 𝑤 = 2
<s> John drinks </s> 𝑤
𝐶 𝑑𝑟𝑖𝑛𝑘𝑠 = 2

Trigram
< 𝑠 > the teacher drinks tea ➔ < 𝑠 > < 𝑠 > the teacher drinks tea </𝑠 >
N-Gram ➔ Just one </𝒔 >
Natural Language Processing
07: Generative Language Models
25

Example – Bigram
Corpus:
<s> Lyn drinks chocolate </s>
<s> John drinks tea </s>
2 1 1 2 1
𝑃 𝑠𝑒𝑛𝑡𝑒𝑛𝑐𝑒 = ∗ ∗ ∗ =
<s> Lyn eats chocolate </s> 3 2 2 2 6

1 1
𝑃 𝐽𝑜ℎ𝑛 < 𝑠 > = 𝑃 </𝑠 > 𝑡𝑒𝑎 =
3 1

1 2
𝑃 𝑐ℎ𝑜𝑐𝑙𝑎𝑡𝑒 𝑒𝑎𝑡𝑠 = 𝑃 𝐿𝑦𝑛 < 𝑠 > =? =
1 3

Natural Language Processing


26

07-02
N-Gram Language Model
07: Generative Language Models

Natural Language Processing


07: Generative Language Models
27

Generative Language Model

Generative
Count Probability Language
Language
Matrix Matrix Model
Model

Natural Language Processing


07: Generative Language Models
28

Count Matrix
Rows: Unique corpus (N-1)-Grams
𝑛−1
Columns: Unique Corpus Words 𝑛−1
𝐶(𝑤𝑛−𝑁+1 𝑤𝑛 )
𝑃 𝑤𝑛 |𝑤𝑛−𝑁+1 = 𝑛−1
𝐶(𝑤𝑛−𝑁+1 )

Bigram Count Matrix


Corpus: <s> I study I learn </s>

<s> </s> I study learn


<s> 0 0 1 0 0
</s> 0 0 0 0 0
I 0 0 0 1 1
“study I” bigram study 0 0 1 0 0
learn 0 1 0 0 0

Natural Language Processing


07: Generative Language Models
29

Probability Matrix
Divide each cell by its row sum 𝑛−1
𝐶(𝑤𝑛−𝑁+1 𝑤𝑛 )
𝑛−1
𝑃 𝑤𝑛 |𝑤𝑛−𝑁+1 = 𝑛−1
𝐶(𝑤𝑛−𝑁+1 )
𝑛−1 𝑛−1
𝑠𝑢𝑚 𝑟𝑜𝑤 = ෍ 𝐶(𝑤𝑛−𝑁+1 , 𝑤) = 𝐶(𝑤𝑛−𝑁+1 )
Corpus: <s> I study I learn </s> 𝑤∈𝑉

Count Matrix (Bigram) Probability Matrix


<s> </s> I study learn Sum <s> </s> I study learn
<s> 0 0 1 0 0 1 <s> 0 0 1 0 0
</s> 0 0 0 0 0 0 </s> 0 0 0 0 0
I 0 0 0 1 1 2 I 0 0 0 0.5 0.5
study 0 0 1 0 0 1 study 0 0 1 0 0
learn 0 1 0 0 0 1 learn 0 1 0 0 0

Natural Language Processing


07: Generative Language Models
30

Language Model
Probability Matrix ➔ Language Model
Sentence Probability
Next Word Prediction

Probability Matrix Sentence Probability

<s> </s> I study learn <s> I learn </s>


<s> 0 0 1 0 0 𝑃 𝑠𝑒𝑛𝑡𝑒𝑛𝑐𝑒 =
</s> 0 0 0 0 0
𝑃 𝐼| < 𝑠 > . 𝑃 𝑙𝑒𝑎𝑟𝑛 𝐼 . 𝑃(</𝑠 > |𝑙𝑒𝑎𝑟𝑛) =
I 0 0 0 0.5 0.5
study 0 0 1 0 0 1 ∗ 0.5 ∗ 1 =
learn 0 1 0 0 0 0.5

Natural Language Processing


07: Generative Language Models
31

Log Probability
𝑛
All probabilities in calculation <= 1
𝑃(𝑤1𝑛 ) ≈ ෑ 𝑃(𝑤𝑖 |𝑤𝑖−1 )
And multiplying them brings risk
𝑖=1

Logarithm properties reminder

log 𝑎 ∗ 𝑏 = log 𝑎 + log(𝑏)

Natural Language Processing


07: Generative Language Models
32

Generative Language Model


Corpus:
1. (<s>,Lyn) or (<s>,John) ?
<s> Lyn drinks chocolate </s>
<s> John drinks tea </s> 2. (Lyn, eats) or (Lyn, drinks) ?
<s> Lyn eats chocolate </s> 3. (drinks, tea) or (drinks, chocolate)> ?
4. (tea,</s>) – always
Algorithm:
1. Choose sentence start
2. Choose next bigram starting with previous word
3. Continue until </𝑠 > is picked

Natural Language Processing


07: Generative Language Models
33

Language Model Evaluation

Train

Validate
Perplexity

Test Split

Natural Language Processing


07: Generative Language Models
34

Test Data

Split corpus to Train / Validation / Test Evaluate on Training Data

For Smaller Corpora For Large Corpora (Typical for Text)

80% Train 98% Train


10% Validation 01% Validation
10% Test 01% Test

Natural Language Processing


07: Generative Language Models
35

Test Data – Split Method

Continuous Text Random Short Sequences


Validation

Training
Corpus
Test

Natural Language Processing


36

07-03
Perplexity
07: Generative Language Models

Natural Language Processing


07: Generative Language Models
37

Perplexity
1
−𝑚
𝑃𝑃 𝑊 = 𝑃(𝑠1 , 𝑠2 , ⋯ , 𝑠𝑚 )

𝑊 → 𝑡𝑒𝑠𝑡 𝑠𝑒𝑡 𝑐𝑜𝑛𝑡𝑎𝑖𝑛𝑖𝑛𝑔 𝑚 𝑠𝑒𝑛𝑡𝑒𝑛𝑐𝑒𝑠 𝑠


𝑆𝑖 → 𝑖 𝑡ℎ 𝑠𝑒𝑛𝑡𝑒𝑛𝑐𝑒 𝑖𝑛 𝑡ℎ𝑒 𝑡𝑒𝑠𝑡 𝑠𝑒𝑡, 𝑒𝑎𝑐ℎ 𝑒𝑛𝑑𝑖𝑛𝑔 𝑤𝑖𝑡ℎ </𝑠 >
𝑚 → 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑎𝑙𝑙 𝑤𝑜𝑟𝑑𝑠 𝑖𝑛 𝑒𝑛𝑡𝑖𝑟𝑒 𝑡𝑒𝑠𝑡 𝑠𝑒𝑡 𝑊 𝑖𝑛𝑐𝑙𝑢𝑑𝑖𝑛𝑔 </𝑠 > 𝑏𝑢𝑡 𝑛𝑜𝑡 < 𝑠 >

e.g., 𝑚 = 100
1

𝑃 𝑊 = 0.9 → 𝑃𝑃 𝑊 = 0.9 100 = 1.00105416
1
𝑃 𝑊 = 10 −250
→ 𝑃𝑃 𝑊 = (10−250 )−100 ≈ 316

Smaller Perplexity = Better Model


𝐶ℎ𝑎𝑟𝑎𝑐𝑡𝑒𝑟 𝑙𝑒𝑣𝑒𝑙 𝑚𝑜𝑑𝑒𝑙 𝑃𝑃 < 𝑊𝑜𝑟𝑑 − 𝐵𝑎𝑠𝑒𝑑 𝑀𝑜𝑑𝑒𝑙𝑠 𝑃𝑃

Natural Language Processing


07: Generative Language Models
38

Perplexity for Bigram Models


𝑚 |𝑠𝑖 |
𝑚 1
𝑃𝑃 𝑊 = ෑෑ 𝑖 (𝑖)
𝑖=1 𝑗=1 𝑃(𝑤𝑗 |𝑤𝑗−1

(𝑖)
𝑤𝑗 → 𝑗𝑡ℎ 𝑤𝑜𝑟𝑑 𝑖𝑛 𝑡ℎ𝑒 𝑖 𝑡ℎ 𝑠𝑒𝑛𝑡𝑒𝑛𝑐𝑒

Concatenate all sentences in 𝑊

𝑚
𝑚 1 1 𝑚
𝑃𝑃 𝑊 = ෑ log 𝑃𝑃(𝑊) = − σ 𝑙𝑜𝑔2 (𝑃( 𝑤𝑖 |𝑤𝑖−1 ))
𝑃(𝑤𝑖 |𝑤𝑖−1 ) 𝑚 𝑖=1
𝑖=1

(𝑖)
𝑤𝑗 → 𝑖 𝑡ℎ 𝑤𝑜𝑟𝑑 𝑖𝑛 𝑡𝑒𝑠𝑡 𝑠𝑒𝑡

Natural Language Processing


07: Generative Language Models
39

Example – Wall Street Journal

WSJ Corpus

Training:
38 Million words

Test:
1.5 Million Words

Unigram 962

Perplexity Bigram 170

Trigram 109 From Speech and Language Processing by Dan Jurafsky et. al.

Natural Language Processing


07: Generative Language Models
40

Example – Approximating Shakespeare

N=884,647 tokens
V=29,066

From Speech and Language Processing by Dan Jurafsky et. al.

Natural Language Processing


41

07-04
Out of Vocabulary Words
07: Generative Language Models

Natural Language Processing


07: Generative Language Models
42

Out of Vocabulary Words


Closed vs. Open Vocabularies
Unknown Word = Out of Vocabulary Word (OOV)
Special tag < 𝑈𝑁𝐾 > in corpus and in input

Using < 𝑈𝑁𝐾 > in Corpus


Create Vocabulary 𝑉
Replace any word in corpus and not in 𝑉 by < 𝑈𝑁𝐾 >
Count the probabilities with < 𝑈𝑁𝐾 > as with other words

Natural Language Processing


07: Generative Language Models
43

Example
Corpus: Corpus:
<s> Lyn drinks chocolate </s> <s> Lyn drinks chocolate </s>
<s> John drinks tea </s> <s> <UNK> drinks <UNK> </s>
<s> Lyn eats chocolate </s> 𝑀𝑖𝑛 𝐹𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 𝑓 = 2 <s> Lyn <UNK> chocolate </s>

Vocabulary Input query:


Lyn, drinks, chocolate <s> Adam drinks chocolate </s>

<s> <UNK> drinks chocolate </s>

Natural Language Processing


07: Generative Language Models
44

How to Create Vocabulary V

Criteria
𝑀𝑖𝑛 word frequency 𝑓
𝑀𝑎𝑥|𝑉|, include words by frequency

Using < 𝑈𝑁𝐾 > sparingly

Perplexity – Only compare LMs with same 𝑉

Natural Language Processing


07: Generative Language Models
45

Missing N-Gram in Training Corpus


Problem:
N-Gram made of known words still might be missing in the training corpus

“John”, “eat” in corpus “John eats”

Their counts cannot be used for probability estimation

𝑛−1
𝑛−1
𝐶(𝑤𝑛−𝑁+1 , 𝑤𝑛 )
𝑃 𝑤𝑛 |𝑤𝑛−𝑁+1 = 𝑛−1 Can be 0
𝐶(𝑤𝑛−𝑁+1 )

Natural Language Processing


46

07-05
Smoothing & Interpolation
07: Generative Language Models

Natural Language Processing


07: Generative Language Models
47

Smoothing
Add-one Smoothing (Laplacian Smoothing)

𝐶 𝑤𝑛−1 , 𝑤𝑛 + 1 𝐶 𝑤𝑛−1 , 𝑤𝑛 + 1
𝑃 𝑤𝑛 𝑤𝑛−1 = =
σ𝑤∈𝑉(𝐶 𝑤𝑛−1 , 𝑤 + 1) 𝐶 𝑤𝑛−1 + 𝑉

Add-K Smoothing

𝐶 𝑤𝑛−1 , 𝑤𝑛 + 𝑘 𝐶 𝑤𝑛−1 , 𝑤𝑛 + 𝑘
𝑃 𝑤𝑛 𝑤𝑛−1 = =
σ𝑤∈𝑉(𝐶 𝑤𝑛−1 , 𝑤 + 𝑘) 𝐶 𝑤𝑛−1 + 𝑘 ∗ 𝑉

Advanced Methods: Kneser-Ney Smoothing, Good-Turing Smoothing

Natural Language Processing


07: Generative Language Models
48

Backoff
If 𝑁-Gram is missing ➔ Use (𝑁 − 1)-Gram, …
Probability Discounting e.g., Katz Backoff
“Stupid” Backoff

Corpus:
<s> Lyn drinks chocolate </s> 𝑃 𝑐ℎ𝑜𝑐𝑜𝑙𝑎𝑡𝑒 𝐽𝑜ℎ𝑛 𝑑𝑟𝑖𝑛𝑘𝑠 =?
<s> John drinks tea </s>
<s> Lyn eats chocolate </s>
0.4 ∗ 𝑃 𝑐ℎ𝑜𝑐𝑜𝑙𝑎𝑡𝑒 𝑑𝑟𝑖𝑛𝑘𝑠

Natural Language Processing


07: Generative Language Models
49

Interpolation
𝑃෠ 𝑐ℎ𝑜𝑐𝑜𝑙𝑎𝑡𝑒 𝐽𝑜ℎ𝑛 𝑑𝑟𝑖𝑛𝑘𝑠 = 0.7 ∗ 𝑃 𝑐ℎ𝑜𝑐𝑜𝑙𝑎𝑡𝑒 𝐽𝑜ℎ𝑛 𝑑𝑟𝑖𝑛𝑘𝑠
+0.2 ∗ 𝑃 𝑐ℎ𝑜𝑐𝑜𝑙𝑎𝑡𝑒 𝑑𝑟𝑖𝑛𝑘𝑠 + 0.1 ∗ 𝑃(𝑐ℎ𝑜𝑐𝑜𝑙𝑎𝑡𝑒)

𝑃෠ 𝑤𝑛 𝑤𝑛−2 𝑤𝑛−1 = 𝜆1 ∗ 𝑃 𝑤𝑛 𝑤𝑛−2 𝑤𝑛−1


+𝜆2 ∗ 𝑃 𝑤𝑛 𝑤𝑛−1 + 𝜆3 ∗ 𝑃(𝑤𝑛 )

෍ 𝜆𝑖 = 1
𝑖

Natural Language Processing


07: Generative Language Models
50

Summary
N-Gram and Probabilities

Approximate Sentence Probability from 𝑁-Grams

Build Language Model from Corpus

Fix Missing Information

Out of Vocabulary Words with < 𝑈𝑁𝐾 >

Missing N-Grams in Corpus with Smoothing, Backoff and Interpolation

Evaluate Language Model with Perplexity

Coding Project

Natural Language Processing

You might also like