0% found this document useful (0 votes)

38 views51 pages

N Grams

Uploaded by

sikepoc132

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

38 views51 pages

N Grams

Uploaded by

sikepoc132

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 51

Introduction to N-grams

Language
Modeling
Probabilistic Language Models
Today’s goal: assign a probability to a sentence
◦ Machine Translation:
◦ P(high winds tonite) > P(large winds tonite)
◦ Spell Correction
◦ The office is about fifteen minuets from my house
◦ P(about fifteen minutes from) > P(about fifteen minuets from)
Why? ◦ Speech Recognition
◦ P(I saw a van) >> P(eyes awe of an)
◦ + Summarization, question-answering, etc., etc.!!
Probabilistic Language Modeling
Goal: compute the probability of a sentence or sequence of
words:
P(W) = P(w1,w2,w3,w4,w5…wn)
Related task: probability of an upcoming word:
P(w5|w1,w2,w3,w4)
A model that computes either of these:
P(W) or P(wn|w1,w2…wn-1) is called a language model.
Better: the grammar But language model or LM is standard
Models that assign probabilities to sequences of words are called
language models or LMs.
How to compute P(W)
How to compute this joint probability:

◦ P(its, water, is, so, transparent, that)

Intuition: let’s rely on the Chain Rule of Probability

Reminder: The Chain Rule

Recall the definition of conditional probabilities

p(B|A) = P(A,B)/P(A) Rewriting: P(A,B) = P(A)P(B|A)

P(“its water is so transparent”) =

P(its) × P(water|its) × P(is|its water)
× P(so|its water is) × P(transparent|its water is so)
How to estimate these
probabilities
Could we just count and divide?

No! Too many possible sentences!

We’ll never see enough data for estimating these
An n-gram is a sequence n-gram of n words
a 2-gram (which we’ll call bigram) is a two-word sequence
of words like “please turn”, “turn your”, or ”your
homework”
a 3-gram (a trigram) is a three-word sequence of words
like “please turn your”, or “turn your homework”.
Markov Assumption

Simplifying assumption: Andrei Markov

Or maybe
Markov Assumption

In other words, we approximate each

component in the product
Simplest case: Unigram model

Some automatically generated sentences from a unigram model

fifth, an, of, futures, the, an, incorporated, a,

a, the, inflation, most, dollars, quarter, in, is,
mass

thrift, did, eighty, said, hard, 'm, july, bullish

that, or, limited, the

Bigram model

Condition on the previous word:

texaco, rose, one, in, this, issue, is, pursuing, growth, in,
a, boiler, house, said, mr., gurria, mexico, 's, motion,
control, proposal, without, permission, from, five, hundred,
fifty, five, yen

outside, new, car, parking, lot, of, the, agreement, reached

this, would, be, a, record, november

N-gram models

We can extend to trigrams, 4-grams, 5-grams

In general this is an insufficient model of language
◦ because language has long-distance dependencies:
“The computer which I had just put into the machine room
on the fifth floor crashed.”

But we can often get away with N-gram models

Estimating N-gram
Language Probabilities
Modeling
Estimating bigram probabilities
The Maximum Likelihood Estimate
An example
<s> I am Sam </s>
<s> Sam I am </s>
<s> I do not like green eggs and ham </s>
More examples:
Berkeley Restaurant Project sentences

can you tell me about any good cantonese restaurants close by

mid priced thai food is what i’m looking for
tell me about chez panisse
can you give me a listing of the kinds of food that are available
i’m looking for a good place to eat breakfast
when is caffe venezia open during the day
Raw bigram counts
Out of 9222 sentences
Raw bigram probabilities
Normalize by unigrams:

Result:
Bigram estimates of sentence probabilities

P(<s> I want english food </s>) =

We do everything in log space

◦ Avoid underflow
◦ (also adding is faster than multiplying)
Language Modeling Toolkits
SRILM
◦ http://www.speech.sri.com/projects/srilm/
KenLM
◦ https://kheafield.com/code/kenlm/

◦ Google Book N Grams

◦ http://ngrams.googlelabs.com/
Evaluating Language Models
Language
Modeling
Extrinsic Evaluation
The best way to evaluate the performance of a language
model is to embed it in an application and measure how
much the application improves. Such end-to-end evaluation
is called extrinsic evaluation.
It is the only way to know if a particular improvement in a
component is really going to help the task at hand.
Thus, for speech recognition, we can compare the
performance of two language models by running the speech
recognizer twice, once with each language model, and
seeing which gives the more accurate transcription.
Extrinsic evaluation of N-
gram models
Best evaluation for comparing models A and B
◦ Put each model in a task
◦ spelling corrector, speech recognizer, MT system
◦ Run the task, get an accuracy for A and for B
◦ How many misspelled words corrected properly
◦ How many words translated correctly
◦ Compare accuracy for A and B
Intrinsic evaluation
Unfortunately, running big NLP systems end-to-end is often very
expensive.

Instead, it would be nice to have a metric that can be used to

quickly evaluate potential improvements in a language model.

An intrinsic evaluation metric is one that mea- intrinsic

evaluation sures the quality of a model independent of any
application
Intrinsic Evaluation: How good is our model?
Does our language model prefer good sentences to bad ones?
◦ Assign higher probability to “real” or “frequently observed” sentences
◦ Than “ungrammatical” or “rarely observed” sentences?

We train parameters of our model on a training set.

We test the model’s performance on data we haven’t seen.
◦ A test set is an unseen dataset that is different from our training set,
totally unused.
◦ An evaluation metric tells us how well our model does on the test set.
Difficulty of extrinsic (in-vivo)
evaluation of N-gram models
Extrinsic evaluation
◦ Time-consuming; can take days or weeks
So
◦ Sometimes use intrinsic evaluation: perplexity
◦ Bad approximation
◦ unless the test data looks just like the training data
◦ So generally only useful in pilot experiments
◦ But is helpful to think about.
Intuition of Perplexity
The Shannon Game: mushrooms 0.1
◦ How well can we predict the next word? pepperoni 0.1
I always order pizza with cheese and ____ anchovies 0.01
The 33rd President of the US was ____ ….
I saw a ____ fried rice 0.0001
….
◦ Unigrams are terrible at this game. (Why?) and 1e-100

A better model of a text

◦ is one which assigns a higher probability to the word that
actually occurs
Perplexity
The best language model is one that best predicts an unseen test set
• Gives the highest P(sentence)
Perplexity is the inverse probability of
the test set, normalized by the number
of words:

Chain rule:

For bigrams:

Minimizing perplexity is the same as maximizing probability

Perplexity as branching factor
Let’s suppose a sentence consisting of random
digits
What is the perplexity of this sentence according to
a model that assign P=1/10 to each digit?
Lower perplexity = better model

Training 38 million words, test 1.5 million words, WSJ

N-gram Unigram Bigram Trigram

Order
Perplexity 962 170 109
Smoothing: Add-one
Language (Laplace) smoothing
Modeling
Laplace Smoothing(add-one Smoothing)
The simplest way to do smoothing is to add one to all the n-gram
counts, before we normalize them into probabilities.
All the counts that used to be zero will now have a count of 1, the
counts of 1 will be 2, and so on.
This algorithm is called Laplace smoothing
Laplace smoothing to unigram probabilities
MLE estimate: The unigram probability of the Word wi is
its count ci normalized by the total number of word
tokens N

Add one Estimate: Laplace smoothing merely adds one to

each count (so its alternate name add-one
smoothing). Since V words in the vocabulary and each
one was incremented, we also need to adjust the
denominator to take into account the extra V
observations.
Berkeley Restaurant Corpus:
Laplace smoothed bigram counts
Laplace-smoothed bigrams
For add-one smoothed bigram counts, we need to augment the
unigram count by the number of total word types in the
vocabulary V

Add-one smoothed bigram probabilities for eight of the words (out of V =

1446) in the BeRP corpus of 9332 sentences. Previously-zero probabilities
Reconstituted counts
It is often convenient to reconstruct the count matrix so we can
see how much a smoothing algorithm has changed the original
counts.
Compare with raw bigram
counts
Add-1 estimation is a blunt
instrument
So add-1 isn’t used for N-grams:
◦ We’ll see better methods
But add-1 is used to smooth other NLP models
◦ For text classification
◦ In domains where the number of zeros isn’t so huge.
Add-k smoothing
One alternative to add-one smoothing is to move a bit less of the
probability mass from the seen to the unseen events.
Instead of adding 1 to each count, we add a fractional count k
(.5? .05? .01?).
This algorithm is therefore called add-k.

Add-k smoothing requires that we have a method for choosing k; this can
be done, for example, by optimizing on a devset. Although add-k is useful
for some tasks (including text classification), it turns out that it still doesn’t
work well for language modeling, generating counts with poor variances
and often inappropriate discounts
Language Backoff and Interpolation
Modeling
Backoff and Interpolation
Sometimes it helps to use less context
◦ Condition on less context for contexts you haven’t learned much about
Backoff:
◦ use trigram if you have good evidence,
◦ otherwise bigram, otherwise unigram
◦ In other words, we only “back off” to a lower-order n-gram if we have zero evidence for
a higher-order n-gram.
Interpolation:
we always mix the probability estimates from all the n-gram estimators,
weighting and combining
mix unigram, bigram, trigram

Interpolation works better

Linear Interpolation

Simple interpolation

Lambdas conditional on context:

How to set the lambdas?
Use a held-out corpus
Held-Out Test
Training Data Data Data
Choose λs to maximize the probability of held-out data:
◦ Fix the N-gram probabilities (on the training data)
◦ Then search for λs that give largest probability to held-out set:

There are various ways to find this optimal set of λs.

One way is to use the EM algorithm, an iterative learning algorithm
that converges on locally optimal λs
Katz backoff
Just as with add-one smoothing, if the higher-order n-grams aren’t
discounted and we just used the undiscounted MLE probability, then
as soon as we replaced an n-gram which has zero probability with a
lower-order n-gram, we would be adding probability mass, and the
total probability assigned to all possible strings by the language
model would be greater than 1!
In addition to this explicit discount factor, we’ll need a function α to
distribute this probability mass to the lower order n-grams.
This kind of backoff with discounting is also called Katz backoff. In
Katz backoff we rely on a discounted probability P ∗ if we’ve seen
this n-gram before (i.e., if we have non-zero counts).
Katz backoff is often combined with a smoothing
method called Good-Turing.
The combined Good-Turing backoff algorithm
involves quite detailed computation for estimating
the Good-Turing smoothing and the P ∗ and α
values.

3 LM Jan 08 2021
No ratings yet
3 LM Jan 08 2021
77 pages
LM 24 Aug
No ratings yet
LM 24 Aug
75 pages
3 LM Jan 08 2021
No ratings yet
3 LM Jan 08 2021
77 pages
08 NLP - N-Gram Language Models
No ratings yet
08 NLP - N-Gram Language Models
65 pages
Introduction to N-gram Language Models
No ratings yet
Introduction to N-gram Language Models
77 pages
Language Models & N-Gram Analysis
No ratings yet
Language Models & N-Gram Analysis
41 pages
N-Gram Language Modeling Techniques
No ratings yet
N-Gram Language Modeling Techniques
87 pages
Language Models
No ratings yet
Language Models
59 pages
Unit 2
No ratings yet
Unit 2
75 pages
N-gram Models in Language Processing
No ratings yet
N-gram Models in Language Processing
59 pages
N-Gram Language Models Lecture
No ratings yet
N-Gram Language Models Lecture
56 pages
3 LM 2024
No ratings yet
3 LM 2024
78 pages
N-gram Language Modeling Overview
No ratings yet
N-gram Language Modeling Overview
84 pages
Language Modeling with N-grams
No ratings yet
Language Modeling with N-grams
79 pages
08 Language Models
No ratings yet
08 Language Models
69 pages
Language Models L3-6
No ratings yet
Language Models L3-6
49 pages
Understanding N-grams in Language Modeling
No ratings yet
Understanding N-grams in Language Modeling
78 pages
Notes of NLP - Unit-2
No ratings yet
Notes of NLP - Unit-2
23 pages
14 Ngramlm
No ratings yet
14 Ngramlm
67 pages
13 Ngramlm
No ratings yet
13 Ngramlm
27 pages
Week 4
No ratings yet
Week 4
37 pages
Lecture 5: Language Modeling (N-Gram, BOW)
No ratings yet
Lecture 5: Language Modeling (N-Gram, BOW)
25 pages
Language Modeling: Introduction To N-Grams
No ratings yet
Language Modeling: Introduction To N-Grams
88 pages
04 Language Modeling
No ratings yet
04 Language Modeling
70 pages
Session 2-3 Language Modeling
No ratings yet
Session 2-3 Language Modeling
69 pages
Multimedia Application L6
No ratings yet
Multimedia Application L6
63 pages
Language Modeling
No ratings yet
Language Modeling
88 pages
Multimedia Application L5
No ratings yet
Multimedia Application L5
35 pages
NLP for Language Model Enthusiasts
No ratings yet
NLP for Language Model Enthusiasts
74 pages
Introduction to N-grams in NLP
No ratings yet
Introduction to N-grams in NLP
88 pages
Lecture 03
No ratings yet
Lecture 03
41 pages
2.1 Chap NLP Ngrams
No ratings yet
2.1 Chap NLP Ngrams
37 pages
Introduction to N-grams in Language Modeling
No ratings yet
Introduction to N-grams in Language Modeling
76 pages
Lecture04-Ngram Lang Models
No ratings yet
Lecture04-Ngram Lang Models
39 pages
6.chapter6 LanguageModel
No ratings yet
6.chapter6 LanguageModel
33 pages
NLP Unit-4
No ratings yet
NLP Unit-4
48 pages
Language Modeling
No ratings yet
Language Modeling
50 pages
Language Modeling: Introduction To N-Grams
No ratings yet
Language Modeling: Introduction To N-Grams
76 pages
Introduction to N-grams in Language Modeling
No ratings yet
Introduction to N-grams in Language Modeling
97 pages
N-Gram Language Models in NLP
No ratings yet
N-Gram Language Models in NLP
22 pages
Ngrams
No ratings yet
Ngrams
22 pages
Language Model Evaluation Methods
No ratings yet
Language Model Evaluation Methods
21 pages
CS 388: Natural Language Processing:: N-Gram Language Models
No ratings yet
CS 388: Natural Language Processing:: N-Gram Language Models
22 pages
Ngrams
100% (1)
Ngrams
22 pages
CME4408 P5 N-Grams Smooting
No ratings yet
CME4408 P5 N-Grams Smooting
43 pages
Understanding n-gram Models in AI
No ratings yet
Understanding n-gram Models in AI
32 pages
NLP Language Models Explained
No ratings yet
NLP Language Models Explained
65 pages
5) Lecture Feb11&13&17&18
No ratings yet
5) Lecture Feb11&13&17&18
21 pages
NLP Units Iv V
No ratings yet
NLP Units Iv V
30 pages
Language Modeling
No ratings yet
Language Modeling
43 pages
Unit 2b
No ratings yet
Unit 2b
22 pages
Introduction To N-Grams and Evaluation
No ratings yet
Introduction To N-Grams and Evaluation
7 pages
Lecture 4
No ratings yet
Lecture 4
37 pages
Linguistics & N-Gram Models
No ratings yet
Linguistics & N-Gram Models
47 pages
5-N Gram
No ratings yet
5-N Gram
35 pages
April 22 Part 2achine Translation
No ratings yet
April 22 Part 2achine Translation
36 pages
Video v3
No ratings yet
Video v3
34 pages
Chain Rule in N-Gram Language Models
No ratings yet
Chain Rule in N-Gram Language Models
24 pages
Lecture 6 To 8 N-Gram
No ratings yet
Lecture 6 To 8 N-Gram
19 pages
Download NASA CMaps Dataset FD001
No ratings yet
Download NASA CMaps Dataset FD001
20 pages
Cross Sell and Upsell Opportunities Identification
No ratings yet
Cross Sell and Upsell Opportunities Identification
8 pages
Naive Bayes
No ratings yet
Naive Bayes
56 pages
Conference Latex Template 10 17 19
No ratings yet
Conference Latex Template 10 17 19
6 pages
SBC Transformation Story
No ratings yet
SBC Transformation Story
24 pages
Les Previsions de Matieres Pour Roman New Times - 2
No ratings yet
Les Previsions de Matieres Pour Roman New Times - 2
37 pages
LeapStart3D Parent Resource Guide
No ratings yet
LeapStart3D Parent Resource Guide
27 pages
2nd Grade ELA: Close Reading Guide
No ratings yet
2nd Grade ELA: Close Reading Guide
8 pages
Talking and Thinking With Our Hands
No ratings yet
Talking and Thinking With Our Hands
6 pages
Division and Classification Explained
No ratings yet
Division and Classification Explained
3 pages
Revisiting Philippine Historical Figures
No ratings yet
Revisiting Philippine Historical Figures
20 pages
755257316 はしめての日本語能力試験N2漢字550
No ratings yet
755257316 はしめての日本語能力試験N2漢字550
258 pages
The Karma of Love - Roach, Geshe Michael
100% (1)
The Karma of Love - Roach, Geshe Michael
563 pages
漢字加密的故事和挑戰
No ratings yet
漢字加密的故事和挑戰
28 pages
M2.2 Lesson 2 NOUN - ADJECTIVE AGREEMENT
No ratings yet
M2.2 Lesson 2 NOUN - ADJECTIVE AGREEMENT
2 pages
Extra Grammar Exercises (Unit 1, Page 5) LESSON 1 Verb Be: Singular Statements / Contractions
No ratings yet
Extra Grammar Exercises (Unit 1, Page 5) LESSON 1 Verb Be: Singular Statements / Contractions
4 pages
Ii: GMRC Hanzlei Faye M. Ramos
No ratings yet
Ii: GMRC Hanzlei Faye M. Ramos
7 pages
Silabo Inglés I
No ratings yet
Silabo Inglés I
5 pages
Definitive Budget of Work For Sy 2021-2022 (Languages) : English 6
No ratings yet
Definitive Budget of Work For Sy 2021-2022 (Languages) : English 6
4 pages
Positive Affirmations for Kids
No ratings yet
Positive Affirmations for Kids
1 page
Suffixes
No ratings yet
Suffixes
3 pages
Gabarito Do Simulado OBLI 2025.1 - Nível Médio
No ratings yet
Gabarito Do Simulado OBLI 2025.1 - Nível Médio
30 pages
The Philippine Literature
No ratings yet
The Philippine Literature
21 pages
Automating Android App Partitioning for TEE
No ratings yet
Automating Android App Partitioning for TEE
22 pages
Symbiosis C.V Guidelines
No ratings yet
Symbiosis C.V Guidelines
1 page
Skill Builder - Order of Operations (BODMAS)
No ratings yet
Skill Builder - Order of Operations (BODMAS)
7 pages
Escobar Abigail - R J Scene Performance Reflection
No ratings yet
Escobar Abigail - R J Scene Performance Reflection
2 pages
Homework Meaning in Telugu
100% (1)
Homework Meaning in Telugu
8 pages
Ethiopian Poets: Stylistic Analysis
No ratings yet
Ethiopian Poets: Stylistic Analysis
156 pages
Present Tense Regular
No ratings yet
Present Tense Regular
6 pages
Faiz Fehmi PDF
No ratings yet
Faiz Fehmi PDF
4 pages
EDPsy Chapter 5 Sociocultural Diversity
No ratings yet
EDPsy Chapter 5 Sociocultural Diversity
17 pages
Overview of the Brāhmī Alphabet
No ratings yet
Overview of the Brāhmī Alphabet
5 pages
Intro Unit Related Grammar
No ratings yet
Intro Unit Related Grammar
8 pages
German Translator Resume
100% (2)
German Translator Resume
6 pages

N Grams

Uploaded by

N Grams

Uploaded by

Introduction to N-grams

◦ P(its, water, is, so, transparent, that)

Intuition: let’s rely on the Chain Rule of Probability

Recall the definition of conditional probabilities

P(“its water is so transparent”) =

No! Too many possible sentences!

Simplifying assumption: Andrei Markov

In other words, we approximate each

Some automatically generated sentences from a unigram model

fifth, an, of, futures, the, an, incorporated, a,

thrift, did, eighty, said, hard, 'm, july, bullish

that, or, limited, the

Condition on the previous word:

outside, new, car, parking, lot, of, the, agreement, reached

this, would, be, a, record, november

We can extend to trigrams, 4-grams, 5-grams

But we can often get away with N-gram models

can you tell me about any good cantonese restaurants close by

P(<s> I want english food </s>) =

We do everything in log space

◦ Google Book N Grams

Instead, it would be nice to have a metric that can be used to

An intrinsic evaluation metric is one that mea- intrinsic

We train parameters of our model on a training set.

A better model of a text

Minimizing perplexity is the same as maximizing probability

Training 38 million words, test 1.5 million words, WSJ

N-gram Unigram Bigram Trigram

Add one Estimate: Laplace smoothing merely adds one to

Add-one smoothed bigram probabilities for eight of the words (out of V =

Interpolation works better

Lambdas conditional on context:

There are various ways to find this optimal set of λs.

You might also like