TM340 Natural
Language Processing
Language Modeling
Based on slides by Dan Jurafsky and Chris Manning
Agenda
➢ Introduction to N-grams
➢ Estimating N-gram Probabilities
➢ Evaluation and Perplexity
➢ Sampling and Generalization
➢ Smoothing: Add-one (Laplace) smoothing
➢ Interpolation, Backoff, and Web-Scale LMs
2
Introduction to N-grams
3
Probabilistic Language Models
➢ Goal: assign a probability to a sentence
• Machine Translation:
-P(high winds tonite) > P(large winds tonite)
• Spell Correction
-The office is about fifteen minuets from my house
• P(about fifteen minutes from) > P(about fifteen minuets from)
• Speech Recognition
-P(I saw a van) >> P(eyes awe of an)
• + Summarization, question-answering, etc.!!
4
Probabilistic Language Modeling
➢ Goal: compute the probability of a sentence or sequence of words:
P(W) = P(w1,w2,w3,w4,w5…wn)
➢ Related task: probability of an upcoming word:
P(w5|w1,w2,w3,w4)
➢ A model that computes either of these:
P(W) or P(wn|w1,w2…wn-1) is called a language model.
➢ Better: the grammar, but language model or LM is standard
5
How to compute P(W)
➢ Given the corpus:
I am happy because I am learning
➢ The probability of the word "I" is given by:
➢ And the unigram probability of the word "happy" is given by:
6
How to compute P(W)
➢ Given the corpus:
I am happy because I am learning
➢ The probability of the word "am" following "I", is given by:
➢ The probability of the "I happy" is given by:
➢ The probability of the "am learning" is given by:
7
How to compute P(W)
➢ How to compute this joint probability:
P(its, water, is, so, transparent, that)
➢ Intuition: let’s rely on the Chain Rule of Probability
8
The Chain Rule
➢ Recall the definition of conditional probabilities
p(B|A) = P(A,B)/P(A) ➔ Rewriting: P(A,B) = P(A)P(B|A)
➢ With more variables:
P(A,B,C,D) = P(A)P(B|A)P(C|A,B)P(D|A,B,C)
➢ The Chain Rule in General
P(x1,x2,x3,…,xn) = P(x1)P(x2|x1)P(x3|x1,x2)…P(xn|x1,…,xn-1)
9
The Chain Rule
➢ P(“its water is so transparent”) =
P(its) × P(water | its) × P(is | its water) × P(so | its water is) × P(transparent | its water is so)
➢ Formally, the probability of a sentence or sequence of
words can be calculated using:
𝑛
𝑃 𝑤1 𝑤2 … 𝑤𝑛 = ෑ𝑃 𝑤𝑖 𝑤1 𝑤2 … 𝑤𝑖−1 ሻ
𝑖=1
10
How to estimate these probabilities?
➢ Could we just count and divide?
P(the | its water is so transparent that) =
Count(its water is so transparent that the)
Count(its water is so transparent that)
➢ No! Too many possible sentences!
➢ We’ll never see enough data for estimating these
➢ Instead, we can use “Markov Assumption” to estimate this
probability
11
Markov Assumption
➢ Simplifying assumption:
P(the | its water is so transparent that) » P(the | that)
➢ Or maybe
P(the | its water is so transparent that) » P(the | transparent that)
12
Markov Assumption
➢ Formally, the probability of a sentence or sequence of words
can be estimated using:
𝑃 𝑤1 𝑤2 … 𝑤𝑛 ≈ ෑ 𝑃 𝑤𝑖 𝑤𝑖−𝑘 … 𝑤𝑖−1 ሻ
𝑖=1
➢ In other words, we approximate each component using:
𝑃 𝑤𝑖 | 𝑤1 𝑤2 … 𝑤𝑖−1 ≈ 𝑃 𝑤𝑖 𝑤𝑖−𝑘 … 𝑤𝑖−1 ሻ
13
Simplest case: Unigram model
𝑃 𝑤1 𝑤2 … 𝑤𝑛 = ෑ 𝑃(𝑤𝑖 ሻ
𝑖=1
➢ The estimated probability of a sentence is the product of the probability of each word
(Unigrams).
➢ Example of some automatically generated sentences from a unigram model:
fifth, an, of, futures, the, an, incorporated, a, a, the, inflation, most, dollars, quarter,
in, is, mass
thrift, did, eighty, said, hard, 'm, july, bullish
that, or, limited, the
14
Bigram model
𝑃 𝑤𝑖 | 𝑤1 𝑤2 … 𝑤𝑖−1 ≈ 𝑃 𝑤𝑖 𝑤𝑖−1 ሻ
➢ In Bigram model, the condition is based on the previous word.
➢ Example of some automatically generated sentences from a bigram
model:
texaco, rose, one, in, this, issue, is, pursuing, growth, in, a, boiler,
house, said, mr., gurria, mexico, 's, motion, control, proposal, without,
permission, from, five, hundred, fifty, five, yen
outside, new, car, parking, lot, of, the, agreement, reached
this, would, be, a, record, november
15
N-gram models
➢ We can extend the previous model to trigrams, 4-
grams, 5-grams
➢ In general, this is an insufficient model of language:
• because language has long-distance dependencies:
“The computer which I had just put into the machine room on the fifth floor crashed.”
➢ But we can often get away with N-gram models
16
Estimating N-gram Probabilities
17
Estimating bigram probabilities
➢ The Maximum Likelihood Estimate:
count(wi-1,wi )
P(wi | w i-1) =
count(w i-1 )
c(wi-1,wi )
P(wi | w i-1 ) =
c(wi-1)
18
Estimating bigram probabilities
➢ Example: c(wi-1,wi )
<s> I am Sam </s> P(wi | w i-1 ) =
<s> Sam I am </s> c(wi-1)
<s> I do not like green eggs and ham </s>
19
Example: Berkeley Restaurant Project sentences
‐ can you tell me about any good cantonese restaurants close by
‐ mid priced thai food is what i’m looking for
‐ tell me about chez panisse
‐ can you give me a listing of the kinds of food that are available
‐ i’m looking for a good place to eat breakfast
‐ when is caffe venezia open during the day
20
Example: Berkeley Restaurant Project sentences
➢ Raw bigram counts (Out of 9222 sentences)
21
Example: Berkeley Restaurant Project sentences
➢ Normalize by unigrams:
➢ Raw bigram probabilities
22
Example: Berkeley Restaurant Project sentences
➢ Bigram estimates of sentence probabilities:
P(<s> I want english food </s>) = P(I | <s>)
× P(want | I)
× P(English | want)
× P(food | english)
× P(</s> | food)
= 0.000031
23
Practical Issues
➢ Bigram estimates of sentence probabilities:
P(english | want) = 0.0011
P(chinese | want) = 0.0065
P(to | want) = 0.66
P(eat | to) = 0.28
P(food | to) = 0
P(want | spend) = 0
P (i | <s>) = 0.25
24
Practical Issues
➢ We do everything in log space
• Avoid underflow
• Adding is faster than multiplying
log(p1 ´ p2 ´ p3 ´ p4 ) = log p1 + log p2 + log p3 + log p4
25
Some useful language modeling tools
➢ Toolkits:
• SRILM: http://www.speech.sri.com/projects/srilm/
• KenLM: https://kheafield.com/code/kenlm/
➢ Google N-Gram Release:
• https://research.google/blog/all-our-n-gram-are-belong-to-
you/
• Google Book N-grams:
-https://books.google.com/ngrams/
26
Evaluation and Perplexity
27
Language models Evaluating
➢ Language models, like all NLP tools, need to be evaluated.
➢ We will cover the evaluation concept for Language Models
using: Training Sets, Test Sets, and Perplexity Metric
➢ The evaluation concept Applies to n-gram and large
language models
28
How to evaluate language models
Extrinsic (in-vivo) Evaluation
➢ The best way to evaluate any NLP component like an N-gram language
model is called “extrinsic or in-vivo evaluation”, meaning a live or real
application.
➢ To compare models A and B
1. Put each model in a real task:
• Machine Translation, speech recognition, etc.
2. Run the task, get a score for A and for B:
• How many words translated correctly
• How many words transcribed correctly
3. Compare accuracy for A and B
29
How to evaluate language models
Intrinsic (in-vitro) evaluation
➢ Extrinsic evaluation not always possible
• Expensive, time-consuming
• Doesn't always generalize to other applications
➢ Intrinsic (in-vitro) evaluation: perplexity
• Directly measures language model performance at predicting words.
• Doesn't necessarily correspond with real application performance
• But gives us a single general metric for language models
• Useful for large language models (LLMs) as well as n-grams models
30
How to evaluate language models
Training sets and Test sets
➢ We train parameters of our model on a training set.
➢ We test the model’s performance on data we haven’t seen.
• A test set is an unseen dataset; different from training set.
• Intuition: we want to measure generalization to unseen data
• An evaluation metric (like perplexity) tells us how well our model
does on the test set.
31
How to evaluate language models
Training sets and Test sets
➢ If we're building an LM for a specific task
• The test set should reflect the task we want to use the
model for.
➢ If we're building a general-purpose model
• We'll need lots of different kinds of training data
• We don't want the training set or the test set to be just from
one domain or author or language
32
How to evaluate language models
Dev sets
➢ If we test on the test set many times we might implicitly tune to
its characteristics
• Noticing which changes make the model better.
➢ So, we run on the test set only once, or a few times
➢ That means we need a third dataset:
• A development test set or, devset.
• We test our LM on the devset until the very end
• And then test our LM on the test set once
33
How to evaluate language models
➢ Intuition: A good LM prefers "real" sentences
• Assign higher probability to “real” or “frequently observed”
sentences
• Assigns lower probability to “word salad” or “rarely
observed” sentences?
34
How to evaluate language models
Predicting upcoming words
➢ The Shannon Game: How well can we predict the next word?
time 0.9
dream 0.03
• Once upon a ____ midnight 0.02
…
and 1e-100
• That is a picture of a ____
• For breakfast I ate my usual ____
➢ Unigrams are terrible at this game.
➢ A good LM is one that assigns a higher probability to the next word that actually occurs
35
How to evaluate language models
Predicting upcoming words
➢ We said: a good LM is one that assigns a higher probability to the
next word that actually occurs.
➢ Let's generalize to all the words!
• The best LM assigns high probability to the entire test set.
➢ When comparing two LMs, A and B
➢ We compute PA(test set) and PB(test set)
• The better language model will give a higher probability to the test set
than the other language model.
36
How to evaluate language models
Use perplexity instead of raw probability
➢ Probability depends on size of test set
• Probability gets smaller the longer the text
• Better: a metric that is per-word, normalized by length
➢ Perplexity is the inverse probability of the test set,
normalized by the number of words
1
-
PP(W ) = P(w1w2 ...wN ) N
1
= N
P(w1w2 ...wN )
37
How to evaluate language models
Use perplexity instead of raw probability
1
-
PP(W ) = P(w1w2 ...wN ) N
1
= N
P(w1w2 ...wN )
➢ Perplexity is the inverse probability of the test set, normalized by the number of
words
(The inverse comes from the original definition of perplexity from cross-entropy rate in information theory)
➢ Probability range is [0,1], perplexity range is [1,∞]
➢ Minimizing perplexity is the same as maximizing probability
38
How to evaluate language models
Perplexity for N-grams
1
-
PP(W ) = P(w1w2 ...wN ) N
1
= N
P(w1w2 ...wN )
➢ Chain rule: ➢ Bigrams
39
How to evaluate language models
perplexity as the weighted average branching factor
➢ Perplexity is also the weighted average branching factor of a language.
➢ Branching factor: number of possible next words that can follow any
word
➢ Example: Deterministic language L = {red, blue, green}
• Branching factor = 3 (any word can be followed by red, blue, green)
• Now assume LM A where each word follows any other word with equal
probability ⅓
• Given a test set T = "red red red red blue"
PerplexityA(T) = PP(W) = P(w1 w2 … wn)-1/n = PA(red red red red blue)-1/5 = ((1/3)5)-1/5 = (1/3)-1 = 3
40
How to evaluate language models
perplexity as the weighted average branching factor
➢ But now suppose red was very likely in training set, such
that for LM B:
➢ P(red) = 0.8 p(green) = 0.1 p(blue) = 0.1
➢ We would expect the probability to be higher, and
hence the perplexity to be smaller:
-1/n -1/5 4 -1/5 -1/5
PerplexityB(T) = PP(W) = P(w1 w2 … wn) = PA(red red red red blue) = (0.8 *0.1) = (0.04) = 1.9
41
Sampling and Generalization
42
Sampling and Generalization
➢ Sampling helps visualize the type of knowledge a
language model embodies.
• Sampling from a distribution: Choosing random points
based on their likelihood.
• Sampling from a language model: Generating sentences
based on their probabilities as defined by the model.
• High-probability sentences are generated more often; low-
probability sentences are generated less often.
43
The Shannon Visualization Method
➢ Claude Shannon's 1948 Examples:
• Unigram model: Produces random, incoherent words.
REPRESENTING AND SPEEDILY IS AN GOOD APT OR COME CAN DIFFERENT
NATURAL HERE HE THE A IN CAME THE TO OF TO EXPERT GRAY COME TO
FURNISHES THE LINE MESSAGE HAD BE THESE.
• Bigram model: Generates more coherent sequences, though
still random.
THE HEAD AND IN FRONTAL ATTACK ON AN ENGLISH WRITER THAT THE
CHARACTER OF THIS POINT IS THEREFORE ANOTHER METHOD FOR THE
LETTERS THAT THE TIME OF WHO EVER TOLD THE PROBLEM FOR AN
UNEXPECTED
44
The Shannon Visualization Method
➢ How Shannon sampled those words in 1948?
➢ Here's how Shannon sampled those words in 1948,
although describing the process for letters rather than
words:
"Open a book at random and select a letter at random on the page. This letter is
recorded. The book is then opened to another page and one reads until this letter
is encountered. The succeeding letter is then recorded. Turning to another page
this second letter is searched for and the succeeding letter recorded, etc."
45
Sampling a word from a distribution
➢ Unigram case: Visualize words in the English language covering the probability space from 0 to 1.
➢ Each word occupies an interval proportional to its frequency.
➢ Random selection: Choose a random value between 0 and 1.
• Identify the word whose interval contains the chosen value.
• Example: Random number 0.12 corresponds to the word "to".
➢ Repeat the process until the end-of-sentence token is generated.
46
Visualizing Bigrams
➢ We can extend this idea to n-grams.
➢ Example for the bigram case:
<s> I
Choose a random bigram (<s>, w) I want
want to
according to its probability p(w | <s>) to eat
eat Chinese
Now choose a random bigram (w, x) according to its probability p(x | w)
Chinese food
And so on until we choose </s> food
</s>
Then string the words together I want to eat Chinese food
➢ There are other, more sophisticated sampling algorithms, like Temperature
sampling, Top-k sampling, Top-p sampling.
• They are mainly used to avoid generating rare words.
47
Approximating Shakespeare
➢ This figure shows some random sentences generated from unigram,
bigram, trigram, and 4- gram models trained on the works of William
Shakespeare. The longer the context, the more coherent the sentences.
48
Approximating Shakespeare
➢ Shakespeare works as corpus:
• N=884,647 tokens, V=29,066
2
➢ Shakespeare produced 300,000 bigram types out of V = 844 million
possible bigrams.
• So 99.96% of the possible bigrams were never seen (have zero entries in the table)
• That sparsity is even worse for 4-grams, explaining why our sampling generated
actual Shakespeare
49
The Wall Street Journal is not Shakespeare
➢ Exploring Grammar Dependence: Analyze how a grammar depends on its training set.
➢ Example: An n-gram grammar trained on the Wall Street Journal (WSJ).
➢ Contrast: Compare with a corpus like Shakespeare.
• Expectation: Both are in English, so some overlap might be expected.
• Reality: There is no overlap, even in small phrases.
• Conclusion: Statistical models are ineffective when training and test sets are vastly different (e.g.,
Shakespeare vs. WSJ).
50
Choosing training data
➢ If you need a task-specific LM, use a training corpus that matches the
genre of your task.
• For a medical language model, use a lot of medical documents as the
corpus.
• For a question-answering system, use a corpus of questions.
➢ Ensure the training data includes the appropriate dialect and variety of
speaker/authors.
• Example: African American English (AAE) features in social media, like the
word finna (marking immediate future tense) e.g. "My phone finna die"
51
N-gram Limitations and Generalization - Overfitting
➢ N-grams only work well for word prediction if the test
corpus looks like the training corpus
• But even when we try to pick a good training corpus, the
test set will surprise us!
• We need to train robust models that generalize!
52
N-gram Limitations and Generalization - Zeros
➢ One kind of generalization: Zeros
• Things that don’t ever occur in the training set
• But occur in the test set
➢ Example:
Training set:
… ate lunch
… ate dinner
… ate a
… ate the
Test set
… ate lunch
… ate breakfast
P(“breakfast” | ate) = 0
53
N-gram Limitations and Generalization - Zeros
➢ Bigrams with zero probability:
• Will hurt our performance for texts where those words
appear!
• And mean that we will assign 0 probability to the test set!
➢ And hence we cannot compute perplexity (can’t divide
by 0)!
54
Smoothing: Add-one (Laplace)
smoothing
55
The intuition of smoothing
➢ When we have sparse statistics:
P(w | denied the)
3 allegations
2 reports
1 claims
1 request
7 total
➢ Steal probability mass to generalize better
P(w | denied the)
2.5 allegations
1.5 reports
0.5 claims
0.5 request
2 other
7 total
56
Add-one estimation (Laplace smoothing)
➢ Also called Laplace smoothing
➢ Pretend we saw each word one more time than we did
➢ Just add one to all the counts!
➢ Maximum Likelihood Estimate:
c(wi-1, wi )
PMLE (wi | wi-1 ) =
c(wi-1 )
➢ Add-1 Estimate:
c(wi-1, wi ) +1
PAdd-1 (wi | wi-1 ) =
c(wi-1 ) +V
57
Maximum Likelihood Estimates
➢ The maximum likelihood estimate
• of some parameter of a model M from a training set T
• maximizes the likelihood of the training set T given the model M
➢ Suppose the word “bagel” occurs 400 times in a corpus of a million words
➢ What is the probability that a random word from some other text will be
“bagel”?
c(wi-1, wi )
➢ MLE estimate is 400/1,000,000 = .0004 PMLE (wi | wi-1 ) =
c(wi-1 )
➢ This may be a bad estimate for some other corpus
• But it is the estimate that makes it most likely that “bagel” will occur 400 times in a
million-word corpus.
58
Laplace smoothed bigram counts
➢ Considering Berkeley Restaurant Corpus:
59
Laplace-smoothed bigrams
60
Reconstituted counts
61
Comparing reconstituted counts with raw counts
➢ C(want to) went from 608 to 238 ➔P(to|want) from 0.66 to 0.26!
➢ Discount d= c*/c ➔ d for “chinese food” =0.10 ! (a 10x reduction)
62
Add-one estimation (Laplace smoothing)
➢ So, add-1 isn’t used for N-grams:
• We’ll see better methods
➢ But add-1 is used to smooth other NLP models:
• For text classification
• In domains where the number of zeros isn’t so huge.
63
Interpolation, Backoff,
and Web-Scale LMs
64
Backoff and Interpolation
➢ Sometimes it helps to use less context
• Condition on less context for contexts you haven’t learned much about
➢ Backoff:
• If the N-gram we need has zero counts, we can approximate it by backing off to the (N-1)-gram.
If the (N-1)-gram has zero counts, we can approximate it by backing off to the (N-2)-gram, and
so on. We continue backing off until we reach something that has non-zero counts.
- e.g. use trigram if you have good evidence, otherwise bigram, otherwise unigram
➢ Interpolation:
• Mix unigram, bigram, trigram
• Interpolation works better
65
Unknown words
Open versus closed vocabulary tasks
➢ If we know all the words in advanced
• Vocabulary V is fixed
• Closed vocabulary task
➢ Often, we don’t know this
• Out Of Vocabulary = OOV words
• Open vocabulary task
➢ Instead: create an unknown word token <UNK>
• Training of <UNK> probabilities
- Create a fixed lexicon L of size V
- At text normalization phase, any training word not in L changed to <UNK>
- Now we train its probabilities like a normal word
• At decoding time
- If text input: Use UNK probabilities for any word not in training
66
Web-scale N-grams
How to deal with, e.g., Google N-gram corpus?
➢ Pruning
• Only store N-grams with count > threshold.
• Entropy-based pruning
➢ Efficiency
• Efficient data structures like tries
• Store words as indexes, not strings
-Use Huffman coding to fit large numbers of words into two bytes
• Quantize probabilities (4-8 bits instead of 8-byte float)
67
Smoothing for Web-scale N-grams
➢ A simple yet efficient approach is the Stupid Backoff by Brants et
al. (2007), where the α parameter was heuristically set to 0.4.
➢ Originally, the authors thought that such simple approach cannot
possibly be good, but it turned out to be very effective and
inexpensive.
68
N-gram Smoothing Summary
➢ Add-1 (Laplace) smoothing is good for text categorization,
but not for language modeling.
➢ We can use Backoff, but Interpolation is better.
➢ There are a couple of advanced smoothing techniques,
such as Good-Turing smoothing or Kneser-Ney smoothing.
➢ For very large N-grams like the Web:
• Stupid backoff
69
Thank You