NLP07 Generative Language Models
NLP07 Generative Language Models
07-01
N-Gram and Probabilities
07: Generative Language Models
Language Model
Create Language Model (LM) from text corpus to
Estimate probability of word sequences
Estimate probability of a word following a sequence of words
“chocolate”
Text Language
Corpus Model “Lyn is eating…” “eggs”
“toast”
Applications
Speech Recognition
P(I saw a van) > P(eyes awe a an)
Spelling Correction
“He entered the ship to buy some groceries” – “ship” a dictionary word
P(entered the shop to buy) > P(entered the ship to buy)
Augmented Communication
Predict most likely word from menu for people unable to physically
talk or sign (Newell et. al., 1998)
N-Gram
An N-Gram is a sequence of N words.
N-Gram
An N-Gram is a sequence of N words.
N-Gram
An N-Gram is a sequence of N words.
Sequence Notation
Corpus: This is great … teacher drinks tea. 𝑚 = 500
𝑤1 𝑤2 𝑤3 𝑤498 𝑤499 𝑤500
𝑤1𝑚 = 𝑤1 𝑤2 ⋯ 𝑤𝑚
𝑤13 = 𝑤1 𝑤2 𝑤3
𝑚
𝑤𝑚−2 = 𝑤𝑚−2 𝑤𝑚−1 𝑤𝑚
Unigram Probability
Corpus: I am happy because I am learning
2 1
𝑃 𝐼 = 𝑃 ℎ𝑎𝑝𝑝𝑦 =
7 7
𝐶(𝑤)
Probability of Unigram: 𝑃 𝑤 =
𝑚
Bigram Probability
Corpus: I am happy because I am learning
𝐶(𝐼 𝑎𝑚) 2
𝑃 𝑎𝑚 | 𝐼 = = =1
𝐶(𝐼) 2
𝐶(𝐼 ℎ𝑎𝑝𝑝𝑦) 0
𝑃 ℎ𝑎𝑝𝑝𝑦 | 𝐼 = = =0
𝐶(𝐼) 2
𝐶(𝑎𝑚 𝑙𝑒𝑎𝑟𝑛𝑖𝑛𝑔) 1
𝑃 𝑙𝑒𝑎𝑟𝑛𝑖𝑛𝑔 | 𝑎𝑚 = = = 0.5
𝐶(𝑎𝑚) 2
𝐶(𝑥 𝑦) 𝐶(𝑥 𝑦)
Probability of Bigram: 𝑃 𝑦|𝑥 = =
σ𝑤 𝐶(𝑥 𝑤) 𝐶(𝑥)
Trigram Probability
Corpus: I am happy because I am learning
𝐶(𝐼 𝑎𝑚 ℎ𝑎𝑝𝑝𝑦) 1
𝑃 ℎ𝑎𝑝𝑝𝑦 | 𝐼 𝑎𝑚 = = = 0.5
𝐶(𝐼 𝑎𝑚) 2
𝐶(𝑤12 𝑤3 )
Probability of Trigram: 𝑃 𝑤3 |𝑤12 =
𝐶(𝑤12 )
𝐶 𝑤12 𝑤3 = 𝐶 𝑤1 𝑤2 𝑤3 = 𝐶(𝑤13 )
N-Gram Probability
𝐶(𝑤1𝑁−1 𝑤𝑁 )
𝑃 𝑤𝑁 |𝑤1𝑁−1 =
𝐶(𝑤1𝑁−1 )
𝐶 𝑤1𝑁−1 𝑤𝑁 = 𝐶(𝑤1𝑁 )
Probability of a Sequence
Given a sentence, what is its probability?
𝑃(𝐴, 𝐵)
𝑃 𝐵𝐴 = → 𝑃 𝐴, 𝐵 = 𝑃 𝐴 𝑃 𝐵 𝐴
𝑃(𝐴)
𝑃 𝐴, 𝐵, 𝐶, 𝐷 = 𝑃 𝐴 . 𝑃 𝐵 𝐴 . 𝑃 𝐶 𝐴, 𝐵 . 𝑃(𝐷|𝐴, 𝐵, 𝐶)
the teacher drinks tea ➔ < 𝑠 > < 𝑠 > the teacher drinks tea
Corpus:
<s> Lyn drinks chocolate
<s> John drinks
𝐶 𝑑𝑟𝑖𝑛𝑘𝑠 𝑤 = 1
𝑤
𝐶 𝑑𝑟𝑖𝑛𝑘𝑠 = 2
2 1 1
∗ =
3 2 3
1 1
𝑃(< 𝑠 > 𝑦𝑒𝑠 𝑦𝑒𝑠) = 𝑃(< 𝑠 > 𝑦𝑒𝑠 𝑛𝑜) =
3 3
1 𝑃(⋯ ) = 1
𝑃(< 𝑠 > 𝑛𝑜 𝑛𝑜) = 𝑃(< 𝑠 > 𝑛𝑜 𝑦𝑒𝑠) = 0
3 2 𝑤𝑜𝑟𝑑𝑠
Natural Language Processing
07: Generative Language Models
21
𝑃(⋯ ) = 1
3 𝑤𝑜𝑟𝑑𝑠
Natural Language Processing
07: Generative Language Models
22
𝑃(⋯ ) + 𝑃(⋯ ) + ⋯ =1
2 𝑤𝑜𝑟𝑑𝑠 3 𝑤𝑜𝑟𝑑𝑠
𝑃 𝑡ℎ𝑒| < 𝑠 > . 𝑃 𝑡𝑒𝑎𝑐ℎ𝑒𝑟 𝑡ℎ𝑒 . 𝑃 𝑑𝑟𝑖𝑛𝑘𝑠 𝑡𝑒𝑎𝑐ℎ𝑒𝑟 . 𝑃 𝑡𝑒𝑎 𝑑𝑟𝑖𝑛𝑘𝑠 . 𝑃(</𝑠 > |𝑡𝑒𝑎)
Corpus:
<s> Lyn drinks chocolate </s> 𝐶 𝑑𝑟𝑖𝑛𝑘𝑠 𝑤 = 2
<s> John drinks </s> 𝑤
𝐶 𝑑𝑟𝑖𝑛𝑘𝑠 = 2
Trigram
< 𝑠 > the teacher drinks tea ➔ < 𝑠 > < 𝑠 > the teacher drinks tea </𝑠 >
N-Gram ➔ Just one </𝒔 >
Natural Language Processing
07: Generative Language Models
25
Example – Bigram
Corpus:
<s> Lyn drinks chocolate </s>
<s> John drinks tea </s>
2 1 1 2 1
𝑃 𝑠𝑒𝑛𝑡𝑒𝑛𝑐𝑒 = ∗ ∗ ∗ =
<s> Lyn eats chocolate </s> 3 2 2 2 6
1 1
𝑃 𝐽𝑜ℎ𝑛 < 𝑠 > = 𝑃 </𝑠 > 𝑡𝑒𝑎 =
3 1
1 2
𝑃 𝑐ℎ𝑜𝑐𝑙𝑎𝑡𝑒 𝑒𝑎𝑡𝑠 = 𝑃 𝐿𝑦𝑛 < 𝑠 > =? =
1 3
07-02
N-Gram Language Model
07: Generative Language Models
Generative
Count Probability Language
Language
Matrix Matrix Model
Model
Count Matrix
Rows: Unique corpus (N-1)-Grams
𝑛−1
Columns: Unique Corpus Words 𝑛−1
𝐶(𝑤𝑛−𝑁+1 𝑤𝑛 )
𝑃 𝑤𝑛 |𝑤𝑛−𝑁+1 = 𝑛−1
𝐶(𝑤𝑛−𝑁+1 )
Probability Matrix
Divide each cell by its row sum 𝑛−1
𝐶(𝑤𝑛−𝑁+1 𝑤𝑛 )
𝑛−1
𝑃 𝑤𝑛 |𝑤𝑛−𝑁+1 = 𝑛−1
𝐶(𝑤𝑛−𝑁+1 )
𝑛−1 𝑛−1
𝑠𝑢𝑚 𝑟𝑜𝑤 = 𝐶(𝑤𝑛−𝑁+1 , 𝑤) = 𝐶(𝑤𝑛−𝑁+1 )
Corpus: <s> I study I learn </s> 𝑤∈𝑉
Language Model
Probability Matrix ➔ Language Model
Sentence Probability
Next Word Prediction
Log Probability
𝑛
All probabilities in calculation <= 1
𝑃(𝑤1𝑛 ) ≈ ෑ 𝑃(𝑤𝑖 |𝑤𝑖−1 )
And multiplying them brings risk
𝑖=1
Train
Validate
Perplexity
Test Split
Test Data
Training
Corpus
Test
07-03
Perplexity
07: Generative Language Models
Perplexity
1
−𝑚
𝑃𝑃 𝑊 = 𝑃(𝑠1 , 𝑠2 , ⋯ , 𝑠𝑚 )
e.g., 𝑚 = 100
1
−
𝑃 𝑊 = 0.9 → 𝑃𝑃 𝑊 = 0.9 100 = 1.00105416
1
𝑃 𝑊 = 10 −250
→ 𝑃𝑃 𝑊 = (10−250 )−100 ≈ 316
(𝑖)
𝑤𝑗 → 𝑗𝑡ℎ 𝑤𝑜𝑟𝑑 𝑖𝑛 𝑡ℎ𝑒 𝑖 𝑡ℎ 𝑠𝑒𝑛𝑡𝑒𝑛𝑐𝑒
𝑚
𝑚 1 1 𝑚
𝑃𝑃 𝑊 = ෑ log 𝑃𝑃(𝑊) = − σ 𝑙𝑜𝑔2 (𝑃( 𝑤𝑖 |𝑤𝑖−1 ))
𝑃(𝑤𝑖 |𝑤𝑖−1 ) 𝑚 𝑖=1
𝑖=1
(𝑖)
𝑤𝑗 → 𝑖 𝑡ℎ 𝑤𝑜𝑟𝑑 𝑖𝑛 𝑡𝑒𝑠𝑡 𝑠𝑒𝑡
WSJ Corpus
Training:
38 Million words
Test:
1.5 Million Words
Unigram 962
Trigram 109 From Speech and Language Processing by Dan Jurafsky et. al.
N=884,647 tokens
V=29,066
07-04
Out of Vocabulary Words
07: Generative Language Models
Example
Corpus: Corpus:
<s> Lyn drinks chocolate </s> <s> Lyn drinks chocolate </s>
<s> John drinks tea </s> <s> <UNK> drinks <UNK> </s>
<s> Lyn eats chocolate </s> 𝑀𝑖𝑛 𝐹𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 𝑓 = 2 <s> Lyn <UNK> chocolate </s>
Criteria
𝑀𝑖𝑛 word frequency 𝑓
𝑀𝑎𝑥|𝑉|, include words by frequency
𝑛−1
𝑛−1
𝐶(𝑤𝑛−𝑁+1 , 𝑤𝑛 )
𝑃 𝑤𝑛 |𝑤𝑛−𝑁+1 = 𝑛−1 Can be 0
𝐶(𝑤𝑛−𝑁+1 )
07-05
Smoothing & Interpolation
07: Generative Language Models
Smoothing
Add-one Smoothing (Laplacian Smoothing)
𝐶 𝑤𝑛−1 , 𝑤𝑛 + 1 𝐶 𝑤𝑛−1 , 𝑤𝑛 + 1
𝑃 𝑤𝑛 𝑤𝑛−1 = =
σ𝑤∈𝑉(𝐶 𝑤𝑛−1 , 𝑤 + 1) 𝐶 𝑤𝑛−1 + 𝑉
Add-K Smoothing
𝐶 𝑤𝑛−1 , 𝑤𝑛 + 𝑘 𝐶 𝑤𝑛−1 , 𝑤𝑛 + 𝑘
𝑃 𝑤𝑛 𝑤𝑛−1 = =
σ𝑤∈𝑉(𝐶 𝑤𝑛−1 , 𝑤 + 𝑘) 𝐶 𝑤𝑛−1 + 𝑘 ∗ 𝑉
Backoff
If 𝑁-Gram is missing ➔ Use (𝑁 − 1)-Gram, …
Probability Discounting e.g., Katz Backoff
“Stupid” Backoff
Corpus:
<s> Lyn drinks chocolate </s> 𝑃 𝑐ℎ𝑜𝑐𝑜𝑙𝑎𝑡𝑒 𝐽𝑜ℎ𝑛 𝑑𝑟𝑖𝑛𝑘𝑠 =?
<s> John drinks tea </s>
<s> Lyn eats chocolate </s>
0.4 ∗ 𝑃 𝑐ℎ𝑜𝑐𝑜𝑙𝑎𝑡𝑒 𝑑𝑟𝑖𝑛𝑘𝑠
Interpolation
𝑃 𝑐ℎ𝑜𝑐𝑜𝑙𝑎𝑡𝑒 𝐽𝑜ℎ𝑛 𝑑𝑟𝑖𝑛𝑘𝑠 = 0.7 ∗ 𝑃 𝑐ℎ𝑜𝑐𝑜𝑙𝑎𝑡𝑒 𝐽𝑜ℎ𝑛 𝑑𝑟𝑖𝑛𝑘𝑠
+0.2 ∗ 𝑃 𝑐ℎ𝑜𝑐𝑜𝑙𝑎𝑡𝑒 𝑑𝑟𝑖𝑛𝑘𝑠 + 0.1 ∗ 𝑃(𝑐ℎ𝑜𝑐𝑜𝑙𝑎𝑡𝑒)
𝜆𝑖 = 1
𝑖
Summary
N-Gram and Probabilities
Coding Project