0% found this document useful (0 votes)

24 views50 pages

NLP07 Generative Language Models

Uploaded by

240415

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

24 views50 pages

NLP07 Generative Language Models

Uploaded by

240415

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Natural Language Processing

07: Generative Language Models

1. N-Gram and Probabilities

2. N-Gram Language Model
3. Perplexity
4. Out of Vocabulary Words
5. Smoothing & Interpolation
Dr. Imran Ihsan
Ph.D. in Knowledge Engineering
Associate Professor

Natural Language Processing

07-01
N-Gram and Probabilities
07: Generative Language Models

Natural Language Processing

07: Generative Language Models
3

Language Model
Create Language Model (LM) from text corpus to
Estimate probability of word sequences
Estimate probability of a word following a sequence of words

Apply this concept to autocomplete a sentence with most likely suggestions

“chocolate”
Text Language
Corpus Model “Lyn is eating…” “eggs”

“toast”

Natural Language Processing

07: Generative Language Models
4

Applications
Speech Recognition
P(I saw a van) > P(eyes awe a an)

Spelling Correction
“He entered the ship to buy some groceries” – “ship” a dictionary word
P(entered the shop to buy) > P(entered the ship to buy)

Augmented Communication
Predict most likely word from menu for people unable to physically
talk or sign (Newell et. al., 1998)

Natural Language Processing

07: Generative Language Models
5

N-Gram
An N-Gram is a sequence of N words.

Corpus: I am happy because I am learning

Unigrams: { I , am , happy , because , learning }

Natural Language Processing

07: Generative Language Models
6

N-Gram
An N-Gram is a sequence of N words.

Corpus: I am happy because I am learning

Unigrams: { I , am , happy , because , learning }

Bigrams: { I am , am happy , happy because , … } I happy

Natural Language Processing

07: Generative Language Models
7

N-Gram
An N-Gram is a sequence of N words.

Corpus: I am happy because I am learning

Unigrams: { I , am , happy , because , learning }

Bigrams: { I am , am happy , happy because , … } I happy

Trigrams: { I am happy , am happy because , … }

Natural Language Processing

07: Generative Language Models
8

Sequence Notation
Corpus: This is great … teacher drinks tea. 𝑚 = 500
𝑤1 𝑤2 𝑤3 𝑤498 𝑤499 𝑤500

𝑤1𝑚 = 𝑤1 𝑤2 ⋯ 𝑤𝑚

𝑤13 = 𝑤1 𝑤2 𝑤3

𝑚
𝑤𝑚−2 = 𝑤𝑚−2 𝑤𝑚−1 𝑤𝑚

Natural Language Processing

07: Generative Language Models
9

Unigram Probability
Corpus: I am happy because I am learning

Size of the corpus: 7

2 1
𝑃 𝐼 = 𝑃 ℎ𝑎𝑝𝑝𝑦 =
7 7

𝐶(𝑤)
Probability of Unigram: 𝑃 𝑤 =
𝑚

Natural Language Processing

07: Generative Language Models
10

Bigram Probability
Corpus: I am happy because I am learning

𝐶(𝐼 𝑎𝑚) 2
𝑃 𝑎𝑚 | 𝐼 = = =1
𝐶(𝐼) 2

𝐶(𝐼 ℎ𝑎𝑝𝑝𝑦) 0
𝑃 ℎ𝑎𝑝𝑝𝑦 | 𝐼 = = =0
𝐶(𝐼) 2

𝐶(𝑎𝑚 𝑙𝑒𝑎𝑟𝑛𝑖𝑛𝑔) 1
𝑃 𝑙𝑒𝑎𝑟𝑛𝑖𝑛𝑔 | 𝑎𝑚 = = = 0.5
𝐶(𝑎𝑚) 2

𝐶(𝑥 𝑦) 𝐶(𝑥 𝑦)
Probability of Bigram: 𝑃 𝑦|𝑥 = =
σ𝑤 𝐶(𝑥 𝑤) 𝐶(𝑥)

Natural Language Processing

07: Generative Language Models
11

Trigram Probability
Corpus: I am happy because I am learning

𝐶(𝐼 𝑎𝑚 ℎ𝑎𝑝𝑝𝑦) 1
𝑃 ℎ𝑎𝑝𝑝𝑦 | 𝐼 𝑎𝑚 = = = 0.5
𝐶(𝐼 𝑎𝑚) 2

𝐶(𝑤12 𝑤3 )
Probability of Trigram: 𝑃 𝑤3 |𝑤12 =
𝐶(𝑤12 )

𝐶 𝑤12 𝑤3 = 𝐶 𝑤1 𝑤2 𝑤3 = 𝐶(𝑤13 )

Natural Language Processing

07: Generative Language Models
12

N-Gram Probability

𝐶(𝑤1𝑁−1 𝑤𝑁 )
𝑃 𝑤𝑁 |𝑤1𝑁−1 =
𝐶(𝑤1𝑁−1 )
𝐶 𝑤1𝑁−1 𝑤𝑁 = 𝐶(𝑤1𝑁 )

Natural Language Processing

07: Generative Language Models
13

Probability of a Sequence
Given a sentence, what is its probability?

𝑃 𝑡ℎ𝑒 𝑡𝑒𝑎𝑐ℎ𝑒𝑟 𝑑𝑟𝑖𝑛𝑘𝑠 𝑡𝑒𝑎 =?

Conditional Probability and Chain Rule

𝑃(𝐴, 𝐵)
𝑃 𝐵𝐴 = → 𝑃 𝐴, 𝐵 = 𝑃 𝐴 𝑃 𝐵 𝐴
𝑃(𝐴)

𝑃 𝐴, 𝐵, 𝐶, 𝐷 = 𝑃 𝐴 . 𝑃 𝐵 𝐴 . 𝑃 𝐶 𝐴, 𝐵 . 𝑃(𝐷|𝐴, 𝐵, 𝐶)

𝑃 𝑡ℎ𝑒 𝑡𝑒𝑎𝑐ℎ𝑒𝑟 𝑑𝑟𝑖𝑛𝑘𝑠 𝑡𝑒𝑎 =

𝑃 𝑡ℎ𝑒 . 𝑃 𝑡𝑒𝑎𝑐ℎ𝑒𝑟 𝑡ℎ𝑒 . 𝑃 𝑑𝑟𝑖𝑛𝑘𝑠 𝑡ℎ𝑒 𝑡𝑒𝑎𝑐ℎ𝑒𝑟 . 𝑃(𝑡𝑒𝑎|𝑡ℎ𝑒 𝑡𝑒𝑎𝑐ℎ𝑒𝑟 𝑑𝑟𝑖𝑛𝑘𝑠)
Natural Language Processing
07: Generative Language Models
14

Sentence Not in Corpus

Problem: Corpus almost never contains the exact sentence we’re interested in or
even its longer subsequences.

Input: the teacher drinks tea

𝑃 𝑡ℎ𝑒 𝑡𝑒𝑎𝑐ℎ𝑒𝑟 𝑑𝑟𝑖𝑛𝑘𝑠 𝑡𝑒𝑎 =

𝑃 𝑡ℎ𝑒 . 𝑃 𝑡𝑒𝑎𝑐ℎ𝑒𝑟 𝑡ℎ𝑒 . 𝑃 𝑑𝑟𝑖𝑛𝑘𝑠 𝑡ℎ𝑒 𝑡𝑒𝑎𝑐ℎ𝑒𝑟 . 𝑃(𝑡𝑒𝑎|𝑡ℎ𝑒 𝑡𝑒𝑎𝑐ℎ𝑒𝑟 𝑑𝑟𝑖𝑛𝑘𝑠)

𝐶(𝑡ℎ𝑒 𝑡𝑒𝑎𝑐ℎ𝑒𝑟 𝑑𝑖𝑛𝑘𝑠 𝑡𝑒𝑎) Both

𝑃 𝑡𝑒𝑎 𝑡ℎ𝑒 𝑡𝑒𝑎𝑐ℎ𝑒𝑟 𝑑𝑟𝑖𝑛𝑘𝑠 = likely 0
𝐶(𝑡ℎ𝑒 𝑡𝑒𝑎𝑐ℎ𝑒𝑟 𝑑𝑟𝑖𝑛𝑘𝑠)

Natural Language Processing

07: Generative Language Models
15

Approximation of Sequence Probability

Input: the teacher drinks tea
𝑃 𝑡𝑒𝑎𝑐ℎ𝑒𝑟 𝑡ℎ𝑒
𝑃 𝑡𝑒𝑎 𝑡ℎ𝑒 𝑡𝑒𝑎𝑐ℎ𝑒𝑟 𝑑𝑟𝑖𝑛𝑘𝑠 ≈ 𝑃(𝑡𝑒𝑎|𝑑𝑟𝑖𝑛𝑘𝑠) 𝑃(𝑑𝑟𝑖𝑛𝑘𝑠|𝑡𝑒𝑎𝑐ℎ𝑒𝑟)
𝑃(𝑡𝑒𝑎|𝑑𝑟𝑖𝑛𝑘𝑠)

𝑃 𝑡ℎ𝑒 𝑡𝑒𝑎𝑐ℎ𝑒𝑟 𝑑𝑟𝑖𝑛𝑘𝑠 𝑡𝑒𝑎 =

𝑃 𝑡ℎ𝑒 . 𝑃 𝑡𝑒𝑎𝑐ℎ𝑒𝑟 𝑡ℎ𝑒 . 𝑃 𝑑𝑟𝑖𝑛𝑘𝑠 𝑡𝑒𝑎𝑐ℎ𝑒𝑟 . 𝑃(𝑡𝑒𝑎|𝑑𝑟𝑖𝑛𝑘𝑠)

Natural Language Processing

07: Generative Language Models
16

Approximation of Sequence Probability

Markov Assumption: Only Last 𝑁 Words Matter

Bigram: 𝑃(𝑤𝑛 |𝑤1𝑛−1 ) ≈ 𝑃(𝑤𝑛 |𝑤𝑛−1 )

N-Gram: 𝑃 𝑤𝑛 𝑤1𝑛−1 ≈ 𝑃 𝑤𝑛 𝑤𝑛−𝑁+1

𝑛−1

Entire Sentence Modeled with Bigram

𝑛

𝑃(𝑤1𝑛 ) ≈ ෑ 𝑃(𝑤𝑖 |𝑤𝑖−1 )

𝑖=1
𝑃 𝑤1𝑛 ≈ 𝑃 𝑤1 . 𝑃 𝑤2 𝑤1 . ⋯ . 𝑃(𝑤𝑛 |𝑤𝑛−1 )

Natural Language Processing

07: Generative Language Models
17

Start of Sentence Token <s>

the teacher drinks tea

𝑃 𝑡ℎ𝑒 𝑡𝑒𝑎𝑐ℎ𝑒𝑟 𝑑𝑟𝑖𝑛𝑘𝑠 𝑡𝑒𝑎 ≈

𝑃 𝑡ℎ𝑒 . 𝑃 𝑡𝑒𝑎𝑐ℎ𝑒𝑟 𝑡ℎ𝑒 . 𝑃 𝑑𝑟𝑖𝑛𝑘𝑠 𝑡𝑒𝑎𝑐ℎ𝑒𝑟 . 𝑃(𝑡𝑒𝑎|𝑑𝑟𝑖𝑛𝑘𝑠)

< 𝑠 > the teacher drinks tea

𝑃 < 𝑠 > 𝑡ℎ𝑒 𝑡𝑒𝑎𝑐ℎ𝑒𝑟 𝑑𝑟𝑖𝑛𝑘𝑠 𝑡𝑒𝑎 ≈

𝑃 𝑡ℎ𝑒| < 𝑠 > . 𝑃 𝑡𝑒𝑎𝑐ℎ𝑒𝑟 𝑡ℎ𝑒 . 𝑃 𝑑𝑟𝑖𝑛𝑘𝑠 𝑡𝑒𝑎𝑐ℎ𝑒𝑟 . 𝑃(𝑡𝑒𝑎|𝑑𝑟𝑖𝑛𝑘𝑠)

Natural Language Processing

07: Generative Language Models
18

Start of Sentence Token <s> for N-Grams

Trigram

𝑃 𝑡ℎ𝑒 𝑡𝑒𝑎𝑐ℎ𝑒𝑟 𝑑𝑟𝑖𝑛𝑘𝑠 𝑡𝑒𝑎 ≈

𝑃 𝑡ℎ𝑒 . 𝑃 𝑡𝑒𝑎𝑐ℎ𝑒𝑟 𝑡ℎ𝑒 . 𝑃 𝑑𝑟𝑖𝑛𝑘𝑠 𝑡ℎ𝑒 𝑡𝑒𝑎𝑐ℎ𝑒𝑟 . 𝑃(𝑡𝑒𝑎|𝑡𝑒𝑎𝑐ℎ𝑒𝑟 𝑑𝑟𝑖𝑛𝑘𝑠)

the teacher drinks tea ➔ < 𝑠 > < 𝑠 > the teacher drinks tea

𝑃(𝑤1𝑛 ) ≈ 𝑃 𝑤1 | < 𝑠 >< 𝑠 > . 𝑃 𝑤2 | < 𝑠 > 𝑤1 . ⋯ . 𝑃(𝑤𝑛 |𝑤𝑛−2 𝑤𝑛−1 )

N-Gram Model: Add 𝑁 − 1 Start Tokens < 𝑠 >

Natural Language Processing

07: Generative Language Models
19

End of Sentence Token </s> – Motivation

𝐶(𝑥 𝑦) 𝐶(𝑥 𝑦)
𝑃 𝑦𝑥 = =
σ𝑤 𝐶(𝑥 𝑤) 𝐶(𝑥)

Corpus:
<s> Lyn drinks chocolate
<s> John drinks

෍ 𝐶 𝑑𝑟𝑖𝑛𝑘𝑠 𝑤 = 1
𝑤

𝐶 𝑑𝑟𝑖𝑛𝑘𝑠 = 2

Natural Language Processing

07: Generative Language Models
20

End of Sentence Token </s> – Motivation

Corpus: Sentence of length 2:
<s> yes no <s> yes yes
𝑃(< 𝑠 > 𝑦𝑒𝑠 𝑦𝑒𝑠) =
<s> yes yes <s> yes no
𝑃 𝑦𝑒𝑠 < 𝑠 > ∗ 𝑃(𝑦𝑒𝑠|𝑦𝑒𝑠) =
<s> no no <s> no no
<s> no yes
𝐶(< 𝑠 > 𝑦𝑒𝑠) 𝐶(𝑦𝑒𝑠 𝑦𝑒𝑠)
∗ =
σ𝑤 𝐶(< 𝑠 > 𝑤) σ𝑤 𝐶(𝑦𝑒𝑠 𝑤)

2 1 1
∗ =
3 2 3
1 1
𝑃(< 𝑠 > 𝑦𝑒𝑠 𝑦𝑒𝑠) = 𝑃(< 𝑠 > 𝑦𝑒𝑠 𝑛𝑜) =
3 3
1 ෍ 𝑃(⋯ ) = 1
𝑃(< 𝑠 > 𝑛𝑜 𝑛𝑜) = 𝑃(< 𝑠 > 𝑛𝑜 𝑦𝑒𝑠) = 0
3 2 𝑤𝑜𝑟𝑑𝑠
Natural Language Processing
07: Generative Language Models
21

End of Sentence Token </s> – Motivation

Corpus: Sentence of length 3:
<s> yes no <s> yes yes yes
𝑃(< 𝑠 > 𝑦𝑒𝑠 𝑦𝑒𝑠 𝑦𝑒𝑠) = ⋯
<s> yes yes <s> yes yes no
<s> no no … 𝑃(< 𝑠 > 𝑦𝑒𝑠 𝑦𝑒𝑠 𝑛𝑜) = ⋯
<s> no no no
⋯=⋯

𝑃(< 𝑠 > 𝑛𝑜 𝑛𝑜 𝑛𝑜) = ⋯

෍ 𝑃(⋯ ) = 1
3 𝑤𝑜𝑟𝑑𝑠
Natural Language Processing
07: Generative Language Models
22

End of Sentence Token </s> – Motivation

Corpus:
<s> yes no
<s> yes yes
෍ 𝑃(⋯ ) + ෍ 𝑃(⋯ ) + ⋯ =1
2 𝑤𝑜𝑟𝑑𝑠 3 𝑤𝑜𝑟𝑑𝑠
<s> no no

Natural Language Processing

07: Generative Language Models
23

End of Sentence Token </s> – Motivation

Corpus:
<s> yes no
<s> yes yes
<s> no no

෍ 𝑃(⋯ ) + ෍ 𝑃(⋯ ) + ⋯ =1
2 𝑤𝑜𝑟𝑑𝑠 3 𝑤𝑜𝑟𝑑𝑠

Natural Language Processing

07: Generative Language Models
24

End of Sentence Token </s> – Solution

Bigram
< 𝑠 > the teacher drinks tea ➔ < 𝑠 > the teacher drinks tea </𝑠 >

𝑃 𝑡ℎ𝑒| < 𝑠 > . 𝑃 𝑡𝑒𝑎𝑐ℎ𝑒𝑟 𝑡ℎ𝑒 . 𝑃 𝑑𝑟𝑖𝑛𝑘𝑠 𝑡𝑒𝑎𝑐ℎ𝑒𝑟 . 𝑃 𝑡𝑒𝑎 𝑑𝑟𝑖𝑛𝑘𝑠 . 𝑃(</𝑠 > |𝑡𝑒𝑎)

Corpus:
<s> Lyn drinks chocolate </s> ෍ 𝐶 𝑑𝑟𝑖𝑛𝑘𝑠 𝑤 = 2
<s> John drinks </s> 𝑤
𝐶 𝑑𝑟𝑖𝑛𝑘𝑠 = 2

Trigram
< 𝑠 > the teacher drinks tea ➔ < 𝑠 > < 𝑠 > the teacher drinks tea </𝑠 >
N-Gram ➔ Just one </𝒔 >
Natural Language Processing
07: Generative Language Models
25

Example – Bigram
Corpus:
<s> Lyn drinks chocolate </s>
<s> John drinks tea </s>
2 1 1 2 1
𝑃 𝑠𝑒𝑛𝑡𝑒𝑛𝑐𝑒 = ∗ ∗ ∗ =
<s> Lyn eats chocolate </s> 3 2 2 2 6

1 1
𝑃 𝐽𝑜ℎ𝑛 < 𝑠 > = 𝑃 </𝑠 > 𝑡𝑒𝑎 =
3 1

1 2
𝑃 𝑐ℎ𝑜𝑐𝑙𝑎𝑡𝑒 𝑒𝑎𝑡𝑠 = 𝑃 𝐿𝑦𝑛 < 𝑠 > =? =
1 3

Natural Language Processing

07-02
N-Gram Language Model
07: Generative Language Models

Natural Language Processing

07: Generative Language Models
27

Generative Language Model

Generative
Count Probability Language
Language
Matrix Matrix Model
Model

Natural Language Processing

07: Generative Language Models
28

Count Matrix
Rows: Unique corpus (N-1)-Grams
𝑛−1
Columns: Unique Corpus Words 𝑛−1
𝐶(𝑤𝑛−𝑁+1 𝑤𝑛 )
𝑃 𝑤𝑛 |𝑤𝑛−𝑁+1 = 𝑛−1
𝐶(𝑤𝑛−𝑁+1 )

Bigram Count Matrix

Corpus: <s> I study I learn </s>

<s> </s> I study learn

<s> 0 0 1 0 0
</s> 0 0 0 0 0
I 0 0 0 1 1
“study I” bigram study 0 0 1 0 0
learn 0 1 0 0 0

Natural Language Processing

07: Generative Language Models
29

Probability Matrix
Divide each cell by its row sum 𝑛−1
𝐶(𝑤𝑛−𝑁+1 𝑤𝑛 )
𝑛−1
𝑃 𝑤𝑛 |𝑤𝑛−𝑁+1 = 𝑛−1
𝐶(𝑤𝑛−𝑁+1 )
𝑛−1 𝑛−1
𝑠𝑢𝑚 𝑟𝑜𝑤 = ෍ 𝐶(𝑤𝑛−𝑁+1 , 𝑤) = 𝐶(𝑤𝑛−𝑁+1 )
Corpus: <s> I study I learn </s> 𝑤∈𝑉

Count Matrix (Bigram) Probability Matrix

<s> </s> I study learn Sum <s> </s> I study learn
<s> 0 0 1 0 0 1 <s> 0 0 1 0 0
</s> 0 0 0 0 0 0 </s> 0 0 0 0 0
I 0 0 0 1 1 2 I 0 0 0 0.5 0.5
study 0 0 1 0 0 1 study 0 0 1 0 0
learn 0 1 0 0 0 1 learn 0 1 0 0 0

Natural Language Processing

07: Generative Language Models
30

Language Model
Probability Matrix ➔ Language Model
Sentence Probability
Next Word Prediction

Probability Matrix Sentence Probability

<s> </s> I study learn <s> I learn </s>

<s> 0 0 1 0 0 𝑃 𝑠𝑒𝑛𝑡𝑒𝑛𝑐𝑒 =
</s> 0 0 0 0 0
𝑃 𝐼| < 𝑠 > . 𝑃 𝑙𝑒𝑎𝑟𝑛 𝐼 . 𝑃(</𝑠 > |𝑙𝑒𝑎𝑟𝑛) =
I 0 0 0 0.5 0.5
study 0 0 1 0 0 1 ∗ 0.5 ∗ 1 =
learn 0 1 0 0 0 0.5

Natural Language Processing

07: Generative Language Models
31

Log Probability
𝑛
All probabilities in calculation <= 1
𝑃(𝑤1𝑛 ) ≈ ෑ 𝑃(𝑤𝑖 |𝑤𝑖−1 )
And multiplying them brings risk
𝑖=1

Logarithm properties reminder

log 𝑎 ∗ 𝑏 = log 𝑎 + log(𝑏)

Natural Language Processing

07: Generative Language Models
32

Generative Language Model

Corpus:
1. (<s>,Lyn) or (<s>,John) ?
<s> Lyn drinks chocolate </s>
<s> John drinks tea </s> 2. (Lyn, eats) or (Lyn, drinks) ?
<s> Lyn eats chocolate </s> 3. (drinks, tea) or (drinks, chocolate)> ?
4. (tea,</s>) – always
Algorithm:
1. Choose sentence start
2. Choose next bigram starting with previous word
3. Continue until </𝑠 > is picked

Natural Language Processing

07: Generative Language Models
33

Language Model Evaluation

Train

Validate
Perplexity

Test Split

Natural Language Processing

07: Generative Language Models
34

Test Data

Split corpus to Train / Validation / Test Evaluate on Training Data

For Smaller Corpora For Large Corpora (Typical for Text)

80% Train 98% Train

10% Validation 01% Validation
10% Test 01% Test

Natural Language Processing

07: Generative Language Models
35

Test Data – Split Method

Continuous Text Random Short Sequences

Validation

Training
Corpus
Test

Natural Language Processing

07-03
Perplexity
07: Generative Language Models

Natural Language Processing

07: Generative Language Models
37

Perplexity
1
−𝑚
𝑃𝑃 𝑊 = 𝑃(𝑠1 , 𝑠2 , ⋯ , 𝑠𝑚 )

𝑊 → 𝑡𝑒𝑠𝑡 𝑠𝑒𝑡 𝑐𝑜𝑛𝑡𝑎𝑖𝑛𝑖𝑛𝑔 𝑚 𝑠𝑒𝑛𝑡𝑒𝑛𝑐𝑒𝑠 𝑠

𝑆𝑖 → 𝑖 𝑡ℎ 𝑠𝑒𝑛𝑡𝑒𝑛𝑐𝑒 𝑖𝑛 𝑡ℎ𝑒 𝑡𝑒𝑠𝑡 𝑠𝑒𝑡, 𝑒𝑎𝑐ℎ 𝑒𝑛𝑑𝑖𝑛𝑔 𝑤𝑖𝑡ℎ </𝑠 >
𝑚 → 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑎𝑙𝑙 𝑤𝑜𝑟𝑑𝑠 𝑖𝑛 𝑒𝑛𝑡𝑖𝑟𝑒 𝑡𝑒𝑠𝑡 𝑠𝑒𝑡 𝑊 𝑖𝑛𝑐𝑙𝑢𝑑𝑖𝑛𝑔 </𝑠 > 𝑏𝑢𝑡 𝑛𝑜𝑡 < 𝑠 >

e.g., 𝑚 = 100
1
−
𝑃 𝑊 = 0.9 → 𝑃𝑃 𝑊 = 0.9 100 = 1.00105416
1
𝑃 𝑊 = 10 −250
→ 𝑃𝑃 𝑊 = (10−250 )−100 ≈ 316

Smaller Perplexity = Better Model

𝐶ℎ𝑎𝑟𝑎𝑐𝑡𝑒𝑟 𝑙𝑒𝑣𝑒𝑙 𝑚𝑜𝑑𝑒𝑙 𝑃𝑃 < 𝑊𝑜𝑟𝑑 − 𝐵𝑎𝑠𝑒𝑑 𝑀𝑜𝑑𝑒𝑙𝑠 𝑃𝑃

Natural Language Processing

07: Generative Language Models
38

Perplexity for Bigram Models

𝑚 |𝑠𝑖 |
𝑚 1
𝑃𝑃 𝑊 = ෑෑ 𝑖 (𝑖)
𝑖=1 𝑗=1 𝑃(𝑤𝑗 |𝑤𝑗−1

(𝑖)
𝑤𝑗 → 𝑗𝑡ℎ 𝑤𝑜𝑟𝑑 𝑖𝑛 𝑡ℎ𝑒 𝑖 𝑡ℎ 𝑠𝑒𝑛𝑡𝑒𝑛𝑐𝑒

Concatenate all sentences in 𝑊

𝑚
𝑚 1 1 𝑚
𝑃𝑃 𝑊 = ෑ log 𝑃𝑃(𝑊) = − σ 𝑙𝑜𝑔2 (𝑃( 𝑤𝑖 |𝑤𝑖−1 ))
𝑃(𝑤𝑖 |𝑤𝑖−1 ) 𝑚 𝑖=1
𝑖=1

(𝑖)
𝑤𝑗 → 𝑖 𝑡ℎ 𝑤𝑜𝑟𝑑 𝑖𝑛 𝑡𝑒𝑠𝑡 𝑠𝑒𝑡

Natural Language Processing

07: Generative Language Models
39

Example – Wall Street Journal

WSJ Corpus

Training:
38 Million words

Test:
1.5 Million Words

Unigram 962

Perplexity Bigram 170

Trigram 109 From Speech and Language Processing by Dan Jurafsky et. al.

Natural Language Processing

07: Generative Language Models
40

Example – Approximating Shakespeare

N=884,647 tokens
V=29,066

From Speech and Language Processing by Dan Jurafsky et. al.

Natural Language Processing

07-04
Out of Vocabulary Words
07: Generative Language Models

Natural Language Processing

07: Generative Language Models
42

Out of Vocabulary Words

Closed vs. Open Vocabularies
Unknown Word = Out of Vocabulary Word (OOV)
Special tag < 𝑈𝑁𝐾 > in corpus and in input

Using < 𝑈𝑁𝐾 > in Corpus

Create Vocabulary 𝑉
Replace any word in corpus and not in 𝑉 by < 𝑈𝑁𝐾 >
Count the probabilities with < 𝑈𝑁𝐾 > as with other words

Natural Language Processing

07: Generative Language Models
43

Example
Corpus: Corpus:
<s> Lyn drinks chocolate </s> <s> Lyn drinks chocolate </s>
<s> John drinks tea </s> <s> <UNK> drinks <UNK> </s>
<s> Lyn eats chocolate </s> 𝑀𝑖𝑛 𝐹𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 𝑓 = 2 <s> Lyn <UNK> chocolate </s>

Vocabulary Input query:

Lyn, drinks, chocolate <s> Adam drinks chocolate </s>

<s> <UNK> drinks chocolate </s>

Natural Language Processing

07: Generative Language Models
44

How to Create Vocabulary V

Criteria
𝑀𝑖𝑛 word frequency 𝑓
𝑀𝑎𝑥|𝑉|, include words by frequency

Using < 𝑈𝑁𝐾 > sparingly

Perplexity – Only compare LMs with same 𝑉

Natural Language Processing

07: Generative Language Models
45

Missing N-Gram in Training Corpus

Problem:
N-Gram made of known words still might be missing in the training corpus

“John”, “eat” in corpus “John eats”

Their counts cannot be used for probability estimation

𝑛−1
𝑛−1
𝐶(𝑤𝑛−𝑁+1 , 𝑤𝑛 )
𝑃 𝑤𝑛 |𝑤𝑛−𝑁+1 = 𝑛−1 Can be 0
𝐶(𝑤𝑛−𝑁+1 )

Natural Language Processing

07-05
Smoothing & Interpolation
07: Generative Language Models

Natural Language Processing

07: Generative Language Models
47

Smoothing
Add-one Smoothing (Laplacian Smoothing)

𝐶 𝑤𝑛−1 , 𝑤𝑛 + 1 𝐶 𝑤𝑛−1 , 𝑤𝑛 + 1
𝑃 𝑤𝑛 𝑤𝑛−1 = =
σ𝑤∈𝑉(𝐶 𝑤𝑛−1 , 𝑤 + 1) 𝐶 𝑤𝑛−1 + 𝑉

Add-K Smoothing

𝐶 𝑤𝑛−1 , 𝑤𝑛 + 𝑘 𝐶 𝑤𝑛−1 , 𝑤𝑛 + 𝑘
𝑃 𝑤𝑛 𝑤𝑛−1 = =
σ𝑤∈𝑉(𝐶 𝑤𝑛−1 , 𝑤 + 𝑘) 𝐶 𝑤𝑛−1 + 𝑘 ∗ 𝑉

Advanced Methods: Kneser-Ney Smoothing, Good-Turing Smoothing

Natural Language Processing

07: Generative Language Models
48

Backoff
If 𝑁-Gram is missing ➔ Use (𝑁 − 1)-Gram, …
Probability Discounting e.g., Katz Backoff
“Stupid” Backoff

Corpus:
<s> Lyn drinks chocolate </s> 𝑃 𝑐ℎ𝑜𝑐𝑜𝑙𝑎𝑡𝑒 𝐽𝑜ℎ𝑛 𝑑𝑟𝑖𝑛𝑘𝑠 =?
<s> John drinks tea </s>
<s> Lyn eats chocolate </s>
0.4 ∗ 𝑃 𝑐ℎ𝑜𝑐𝑜𝑙𝑎𝑡𝑒 𝑑𝑟𝑖𝑛𝑘𝑠

Natural Language Processing

07: Generative Language Models
49

Interpolation
𝑃෠ 𝑐ℎ𝑜𝑐𝑜𝑙𝑎𝑡𝑒 𝐽𝑜ℎ𝑛 𝑑𝑟𝑖𝑛𝑘𝑠 = 0.7 ∗ 𝑃 𝑐ℎ𝑜𝑐𝑜𝑙𝑎𝑡𝑒 𝐽𝑜ℎ𝑛 𝑑𝑟𝑖𝑛𝑘𝑠
+0.2 ∗ 𝑃 𝑐ℎ𝑜𝑐𝑜𝑙𝑎𝑡𝑒 𝑑𝑟𝑖𝑛𝑘𝑠 + 0.1 ∗ 𝑃(𝑐ℎ𝑜𝑐𝑜𝑙𝑎𝑡𝑒)

𝑃෠ 𝑤𝑛 𝑤𝑛−2 𝑤𝑛−1 = 𝜆1 ∗ 𝑃 𝑤𝑛 𝑤𝑛−2 𝑤𝑛−1

+𝜆2 ∗ 𝑃 𝑤𝑛 𝑤𝑛−1 + 𝜆3 ∗ 𝑃(𝑤𝑛 )

෍ 𝜆𝑖 = 1
𝑖

Natural Language Processing

07: Generative Language Models
50

Summary
N-Gram and Probabilities

Approximate Sentence Probability from 𝑁-Grams

Build Language Model from Corpus

Fix Missing Information

Out of Vocabulary Words with < 𝑈𝑁𝐾 >

Missing N-Grams in Corpus with Smoothing, Backoff and Interpolation

Evaluate Language Model with Perplexity

Coding Project

Natural Language Processing

Deeplearning Ai
No ratings yet
Deeplearning Ai
69 pages
NLP Unit-4
No ratings yet
NLP Unit-4
48 pages
NLP N-Grams for M.Sc Students
100% (2)
NLP N-Grams for M.Sc Students
93 pages
Lecture 5: Language Modeling (N-Gram, BOW)
No ratings yet
Lecture 5: Language Modeling (N-Gram, BOW)
25 pages
Language Models L3-6
No ratings yet
Language Models L3-6
49 pages
2.1 Chap NLP Ngrams
No ratings yet
2.1 Chap NLP Ngrams
37 pages
CME4408 P5 N-Grams Smooting
No ratings yet
CME4408 P5 N-Grams Smooting
43 pages
Lecture04-Ngram Lang Models
No ratings yet
Lecture04-Ngram Lang Models
39 pages
Ngrams
No ratings yet
Ngrams
22 pages
Language Models & N-Gram Analysis
No ratings yet
Language Models & N-Gram Analysis
41 pages
5) Lecture Feb11&13&17&18
No ratings yet
5) Lecture Feb11&13&17&18
21 pages
Module - 2 - Test Portion
No ratings yet
Module - 2 - Test Portion
33 pages
Natural Language Processing:: N-Gram Language Models
No ratings yet
Natural Language Processing:: N-Gram Language Models
48 pages
Introduction To Language Modeling Final
No ratings yet
Introduction To Language Modeling Final
69 pages
Language Models in CS6370
No ratings yet
Language Models in CS6370
35 pages
Ngrams
100% (1)
Ngrams
22 pages
CS 388: Natural Language Processing:: N-Gram Language Models
No ratings yet
CS 388: Natural Language Processing:: N-Gram Language Models
22 pages
N-Gram Language Models in NLP
No ratings yet
N-Gram Language Models in NLP
22 pages
Language Modeling with N-grams
No ratings yet
Language Modeling with N-grams
79 pages
N-Gram in NLP
No ratings yet
N-Gram in NLP
15 pages
Lecture 4
No ratings yet
Lecture 4
37 pages
Lecture 4 N Grams
No ratings yet
Lecture 4 N Grams
29 pages
Understanding n-gram Models in AI
No ratings yet
Understanding n-gram Models in AI
32 pages
NLP Lec 11
No ratings yet
NLP Lec 11
6 pages
N-Gram Language Modeling Techniques
No ratings yet
N-Gram Language Modeling Techniques
87 pages
N-Gram Language Models Explained
No ratings yet
N-Gram Language Models Explained
13 pages
Lecture 03
No ratings yet
Lecture 03
41 pages
Session 2-3 Language Modeling
No ratings yet
Session 2-3 Language Modeling
69 pages
Chapter 6-NLP
No ratings yet
Chapter 6-NLP
8 pages
Linguistics & N-Gram Models
No ratings yet
Linguistics & N-Gram Models
47 pages
Introduction to N-grams in Language Modeling
No ratings yet
Introduction to N-grams in Language Modeling
97 pages
N-Gram Language Models Lecture
No ratings yet
N-Gram Language Models Lecture
56 pages
N-gram Models in Language Processing
No ratings yet
N-gram Models in Language Processing
59 pages
NLP 1.2
No ratings yet
NLP 1.2
22 pages
Introduction to N-grams in Language Modeling
No ratings yet
Introduction to N-grams in Language Modeling
76 pages
6.chapter6 LanguageModel
No ratings yet
6.chapter6 LanguageModel
33 pages
N-Gram Models in Language Processing
No ratings yet
N-Gram Models in Language Processing
51 pages
Language Modeling 2025
No ratings yet
Language Modeling 2025
52 pages
1 N-Grams and Language Models Detailed
No ratings yet
1 N-Grams and Language Models Detailed
4 pages
2 N-Gram
No ratings yet
2 N-Gram
70 pages
Lecture 10 - N-Gram Language Models4 - Unit 2
No ratings yet
Lecture 10 - N-Gram Language Models4 - Unit 2
4 pages
Chain Rule in N-Gram Language Models
No ratings yet
Chain Rule in N-Gram Language Models
24 pages
Notes of NLP - Unit-2
No ratings yet
Notes of NLP - Unit-2
23 pages
A34 NLP Expt 02
No ratings yet
A34 NLP Expt 02
7 pages
5-N Gram
No ratings yet
5-N Gram
35 pages
N-Gram Model for Sentence Probability
No ratings yet
N-Gram Model for Sentence Probability
5 pages
Lecture - 3 - Statistical Language Models
No ratings yet
Lecture - 3 - Statistical Language Models
56 pages
3 LM Jan 08 2021
No ratings yet
3 LM Jan 08 2021
77 pages
3 LM Jan 08 2021
No ratings yet
3 LM Jan 08 2021
77 pages
NLP Cat 2
No ratings yet
NLP Cat 2
78 pages
Language Modeling: Introduction To N-Grams
No ratings yet
Language Modeling: Introduction To N-Grams
88 pages
N Grams - Nptel Notes
No ratings yet
N Grams - Nptel Notes
75 pages
N-Gram Language Models: Random Sentence Generated From A Jane Austen Trigram Model
No ratings yet
N-Gram Language Models: Random Sentence Generated From A Jane Austen Trigram Model
28 pages
Chapter 5
No ratings yet
Chapter 5
22 pages
N-Gram Language Models: Random Sentence Generated From A Jane Austen Trigram Model
No ratings yet
N-Gram Language Models: Random Sentence Generated From A Jane Austen Trigram Model
28 pages
Baysian Final
No ratings yet
Baysian Final
7 pages
Time Series DES 3 5 201320130514143106
No ratings yet
Time Series DES 3 5 201320130514143106
69 pages
Object Detection-Compressed
No ratings yet
Object Detection-Compressed
80 pages
GANs
No ratings yet
GANs
41 pages
Лекции По Медицинска Биохимия (Ганка Косекова Ваньо Митев) PDF
No ratings yet
Лекции По Медицинска Биохимия (Ганка Косекова Ваньо Митев) PDF
1 page
Guide Yosemite HP Laptops Clover UEFI
No ratings yet
Guide Yosemite HP Laptops Clover UEFI
26 pages
2019.12 Octave - Configuration Guide
No ratings yet
2019.12 Octave - Configuration Guide
9 pages
Pes Accomplishment Report 2023 2024
No ratings yet
Pes Accomplishment Report 2023 2024
10 pages
CVA Guidelines - CR32024
No ratings yet
CVA Guidelines - CR32024
3 pages
AAT PDSY 2020-21 Pocket Notes (Print)
No ratings yet
AAT PDSY 2020-21 Pocket Notes (Print)
126 pages
Economics Sba Guidelin 2
No ratings yet
Economics Sba Guidelin 2
18 pages
Image Cartoonify Project Report
No ratings yet
Image Cartoonify Project Report
15 pages
UCD DB Topic 1 Student 2024
No ratings yet
UCD DB Topic 1 Student 2024
22 pages
ChatGPT: Applications and Impacts Analysis
No ratings yet
ChatGPT: Applications and Impacts Analysis
13 pages
Dell Alienware 17 R4 BAP10 LA-D751P Rev1.0 Schematic
No ratings yet
Dell Alienware 17 R4 BAP10 LA-D751P Rev1.0 Schematic
82 pages
Az 900
No ratings yet
Az 900
88 pages
Computer Science: Class XI (As Per CBSE Board)
No ratings yet
Computer Science: Class XI (As Per CBSE Board)
23 pages
Physics 117 0
No ratings yet
Physics 117 0
35 pages
Ibrahim Badhusha
No ratings yet
Ibrahim Badhusha
37 pages
GBSS Feature Documentation GBSS21.1 - 08 20210309221351
No ratings yet
GBSS Feature Documentation GBSS21.1 - 08 20210309221351
179 pages
Software Requirement Specification For Banking System
71% (7)
Software Requirement Specification For Banking System
24 pages
Overview of Programming Languages
No ratings yet
Overview of Programming Languages
13 pages
ProNet Series
No ratings yet
ProNet Series
185 pages
Study of D-SCAN Algorithm: A Seminar ON
No ratings yet
Study of D-SCAN Algorithm: A Seminar ON
31 pages
Kecheng ISMLG
No ratings yet
Kecheng ISMLG
2 pages
2006-1252: Concept Modeling With Nurbs, Polygon, and Subdivision Surfaces
No ratings yet
2006-1252: Concept Modeling With Nurbs, Polygon, and Subdivision Surfaces
7 pages
Css NC II (Coc 1) - CBLM
100% (2)
Css NC II (Coc 1) - CBLM
120 pages
Assignment Week 4-DIP 2024
No ratings yet
Assignment Week 4-DIP 2024
8 pages
Heath Tracker Website
No ratings yet
Heath Tracker Website
10 pages
Input Devices - Touch Screen, Touch Pad & Light Pen-1
No ratings yet
Input Devices - Touch Screen, Touch Pad & Light Pen-1
4 pages
HD-410 Eng
No ratings yet
HD-410 Eng
26 pages
Whale Keypoints
No ratings yet
Whale Keypoints
16 pages
K45603, K45604, K45703, K45704 - ADA5000 - Technical Datasheet
No ratings yet
K45603, K45604, K45703, K45704 - ADA5000 - Technical Datasheet
2 pages
NCR - Aloha Table Service Uk - 12
No ratings yet
NCR - Aloha Table Service Uk - 12
2 pages