Introduction to N-grams
Language
Modeling
Probabilistic Language Models
Today’s goal: assign a probability to a sentence
◦ Machine Translation:
◦ P(high winds tonite) > P(large winds tonite)
◦ Spell Correction
◦ The office is about fifteen minuets from my house
◦ P(about fifteen minutes from) > P(about fifteen minuets from)
Why? ◦ Speech Recognition
◦ P(I saw a van) >> P(eyes awe of an)
◦ + Summarization, question-answering, etc., etc.!!
Probabilistic Language Modeling
Goal: compute the probability of a sentence or sequence of
words:
P(W) = P(w1,w2,w3,w4,w5…wn)
Related task: probability of an upcoming word:
P(w5|w1,w2,w3,w4)
A model that computes either of these:
P(W) or P(wn|w1,w2…wn-1) is called a language model.
Better: the grammar But language model or LM is standard
Models that assign probabilities to sequences of words are called
language models or LMs.
How to compute P(W)
How to compute this joint probability:
◦ P(its, water, is, so, transparent, that)
Intuition: let’s rely on the Chain Rule of Probability
Reminder: The Chain Rule
Recall the definition of conditional probabilities
p(B|A) = P(A,B)/P(A) Rewriting: P(A,B) = P(A)P(B|A)
More variables:
P(A,B,C,D) = P(A)P(B|A)P(C|A,B)P(D|A,B,C)
The Chain Rule in General
P(x1,x2,x3,…,xn) = P(x1)P(x2|x1)P(x3|x1,x2)…P(xn|x1,…,xn-1)
The Chain Rule applied to compute joint
probability of words in sentence
P(“its water is so transparent”) =
P(its) × P(water|its) × P(is|its water)
× P(so|its water is) × P(transparent|its water is so)
How to estimate these
probabilities
Could we just count and divide?
No! Too many possible sentences!
We’ll never see enough data for estimating these
An n-gram is a sequence n-gram of n words
a 2-gram (which we’ll call bigram) is a two-word sequence
of words like “please turn”, “turn your”, or ”your
homework”
a 3-gram (a trigram) is a three-word sequence of words
like “please turn your”, or “turn your homework”.
Markov Assumption
Simplifying assumption: Andrei Markov
Or maybe
Markov Assumption
In other words, we approximate each
component in the product
Simplest case: Unigram model
Some automatically generated sentences from a unigram model
fifth, an, of, futures, the, an, incorporated, a,
a, the, inflation, most, dollars, quarter, in, is,
mass
thrift, did, eighty, said, hard, 'm, july, bullish
that, or, limited, the
Bigram model
Condition on the previous word:
texaco, rose, one, in, this, issue, is, pursuing, growth, in,
a, boiler, house, said, mr., gurria, mexico, 's, motion,
control, proposal, without, permission, from, five, hundred,
fifty, five, yen
outside, new, car, parking, lot, of, the, agreement, reached
this, would, be, a, record, november
N-gram models
We can extend to trigrams, 4-grams, 5-grams
In general this is an insufficient model of language
◦ because language has long-distance dependencies:
“The computer which I had just put into the machine room
on the fifth floor crashed.”
But we can often get away with N-gram models
Estimating N-gram
Language Probabilities
Modeling
Estimating bigram probabilities
The Maximum Likelihood Estimate
An example
<s> I am Sam </s>
<s> Sam I am </s>
<s> I do not like green eggs and ham </s>
More examples:
Berkeley Restaurant Project sentences
can you tell me about any good cantonese restaurants close by
mid priced thai food is what i’m looking for
tell me about chez panisse
can you give me a listing of the kinds of food that are available
i’m looking for a good place to eat breakfast
when is caffe venezia open during the day
Raw bigram counts
Out of 9222 sentences
Raw bigram probabilities
Normalize by unigrams:
Result:
Bigram estimates of sentence probabilities
P(<s> I want english food </s>) =
P(I|<s>)
× P(want|I)
× P(english|want)
× P(food|english)
× P(</s>|food)
= .000031
What kinds of knowledge?
P(english|want) = .0011
P(chinese|want) = .0065
P(to|want) = .66
P(eat | to) = .28
P(food | to) = 0
P(want | spend) = 0
P (i | <s>) = .25
Practical Issues
We do everything in log space
◦ Avoid underflow
◦ (also adding is faster than multiplying)
Language Modeling Toolkits
SRILM
◦ http://www.speech.sri.com/projects/srilm/
KenLM
◦ https://kheafield.com/code/kenlm/
◦ Google Book N Grams
◦ http://ngrams.googlelabs.com/
Evaluating Language Models
Language
Modeling
Extrinsic Evaluation
The best way to evaluate the performance of a language
model is to embed it in an application and measure how
much the application improves. Such end-to-end evaluation
is called extrinsic evaluation.
It is the only way to know if a particular improvement in a
component is really going to help the task at hand.
Thus, for speech recognition, we can compare the
performance of two language models by running the speech
recognizer twice, once with each language model, and
seeing which gives the more accurate transcription.
Extrinsic evaluation of N-
gram models
Best evaluation for comparing models A and B
◦ Put each model in a task
◦ spelling corrector, speech recognizer, MT system
◦ Run the task, get an accuracy for A and for B
◦ How many misspelled words corrected properly
◦ How many words translated correctly
◦ Compare accuracy for A and B
Intrinsic evaluation
Unfortunately, running big NLP systems end-to-end is often very
expensive.
Instead, it would be nice to have a metric that can be used to
quickly evaluate potential improvements in a language model.
An intrinsic evaluation metric is one that mea- intrinsic
evaluation sures the quality of a model independent of any
application
Intrinsic Evaluation: How good is our model?
Does our language model prefer good sentences to bad ones?
◦ Assign higher probability to “real” or “frequently observed” sentences
◦ Than “ungrammatical” or “rarely observed” sentences?
We train parameters of our model on a training set.
We test the model’s performance on data we haven’t seen.
◦ A test set is an unseen dataset that is different from our training set,
totally unused.
◦ An evaluation metric tells us how well our model does on the test set.
Difficulty of extrinsic (in-vivo)
evaluation of N-gram models
Extrinsic evaluation
◦ Time-consuming; can take days or weeks
So
◦ Sometimes use intrinsic evaluation: perplexity
◦ Bad approximation
◦ unless the test data looks just like the training data
◦ So generally only useful in pilot experiments
◦ But is helpful to think about.
Intuition of Perplexity
The Shannon Game: mushrooms 0.1
◦ How well can we predict the next word? pepperoni 0.1
I always order pizza with cheese and ____ anchovies 0.01
The 33rd President of the US was ____ ….
I saw a ____ fried rice 0.0001
….
◦ Unigrams are terrible at this game. (Why?) and 1e-100
A better model of a text
◦ is one which assigns a higher probability to the word that
actually occurs
Perplexity
The best language model is one that best predicts an unseen test set
• Gives the highest P(sentence)
Perplexity is the inverse probability of
the test set, normalized by the number
of words:
Chain rule:
For bigrams:
Minimizing perplexity is the same as maximizing probability
Perplexity as branching factor
Let’s suppose a sentence consisting of random
digits
What is the perplexity of this sentence according to
a model that assign P=1/10 to each digit?
Lower perplexity = better model
Training 38 million words, test 1.5 million words, WSJ
N-gram Unigram Bigram Trigram
Order
Perplexity 962 170 109
Smoothing: Add-one
Language (Laplace) smoothing
Modeling
Laplace Smoothing(add-one Smoothing)
The simplest way to do smoothing is to add one to all the n-gram
counts, before we normalize them into probabilities.
All the counts that used to be zero will now have a count of 1, the
counts of 1 will be 2, and so on.
This algorithm is called Laplace smoothing
Laplace smoothing to unigram probabilities
MLE estimate: The unigram probability of the Word wi is
its count ci normalized by the total number of word
tokens N
Add one Estimate: Laplace smoothing merely adds one to
each count (so its alternate name add-one
smoothing). Since V words in the vocabulary and each
one was incremented, we also need to adjust the
denominator to take into account the extra V
observations.
Berkeley Restaurant Corpus:
Laplace smoothed bigram counts
Laplace-smoothed bigrams
For add-one smoothed bigram counts, we need to augment the
unigram count by the number of total word types in the
vocabulary V
Add-one smoothed bigram probabilities for eight of the words (out of V =
1446) in the BeRP corpus of 9332 sentences. Previously-zero probabilities
Reconstituted counts
It is often convenient to reconstruct the count matrix so we can
see how much a smoothing algorithm has changed the original
counts.
Compare with raw bigram
counts
Add-1 estimation is a blunt
instrument
So add-1 isn’t used for N-grams:
◦ We’ll see better methods
But add-1 is used to smooth other NLP models
◦ For text classification
◦ In domains where the number of zeros isn’t so huge.
Add-k smoothing
One alternative to add-one smoothing is to move a bit less of the
probability mass from the seen to the unseen events.
Instead of adding 1 to each count, we add a fractional count k
(.5? .05? .01?).
This algorithm is therefore called add-k.
Add-k smoothing requires that we have a method for choosing k; this can
be done, for example, by optimizing on a devset. Although add-k is useful
for some tasks (including text classification), it turns out that it still doesn’t
work well for language modeling, generating counts with poor variances
and often inappropriate discounts
Language Backoff and Interpolation
Modeling
Backoff and Interpolation
Sometimes it helps to use less context
◦ Condition on less context for contexts you haven’t learned much about
Backoff:
◦ use trigram if you have good evidence,
◦ otherwise bigram, otherwise unigram
◦ In other words, we only “back off” to a lower-order n-gram if we have zero evidence for
a higher-order n-gram.
Interpolation:
we always mix the probability estimates from all the n-gram estimators,
weighting and combining
mix unigram, bigram, trigram
Interpolation works better
Linear Interpolation
Simple interpolation
Lambdas conditional on context:
How to set the lambdas?
Use a held-out corpus
Held-Out Test
Training Data Data Data
Choose λs to maximize the probability of held-out data:
◦ Fix the N-gram probabilities (on the training data)
◦ Then search for λs that give largest probability to held-out set:
There are various ways to find this optimal set of λs.
One way is to use the EM algorithm, an iterative learning algorithm
that converges on locally optimal λs
Katz backoff
Just as with add-one smoothing, if the higher-order n-grams aren’t
discounted and we just used the undiscounted MLE probability, then
as soon as we replaced an n-gram which has zero probability with a
lower-order n-gram, we would be adding probability mass, and the
total probability assigned to all possible strings by the language
model would be greater than 1!
In addition to this explicit discount factor, we’ll need a function α to
distribute this probability mass to the lower order n-grams.
This kind of backoff with discounting is also called Katz backoff. In
Katz backoff we rely on a discounted probability P ∗ if we’ve seen
this n-gram before (i.e., if we have non-zero counts).
Katz backoff is often combined with a smoothing
method called Good-Turing.
The combined Good-Turing backoff algorithm
involves quite detailed computation for estimating
the Good-Turing smoothing and the P ∗ and α
values.