0% found this document useful (0 votes)

32 views37 pages

2.1 Chap NLP Ngrams

Uploaded by

vishalmishra0427

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

32 views37 pages

2.1 Chap NLP Ngrams

Uploaded by

vishalmishra0427

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 37

Natural Language Processing:

N-Gram Language Models

1
Language Models
• Formal grammars (e.g. regular, context
free) give a hard “binary” model of the
legal sentences in a language.
• For NLP, a probabilistic model of a
language that gives a probability that a
string is a member of a language is more
useful.
• To specify a correct probability distribution,
the probability of all sentences in a
language must sum to 1.
Uses of Language Models
• Speech recognition
– “I ate a cherry” is a more likely sentence than “Eye eight
uh Jerry”
• OCR & Handwriting recognition
– More probable sentences are more likely correct readings.
• Machine translation
– More likely sentences are probably better translations.
• Generation
– More likely sentences are probably better NL generations.
• Context sensitive spelling correction
– “Their are problems wit this sentence.”
Completion Prediction
• A language model also supports predicting
the completion of a sentence.
– Please turn off your cell _____
– Your program does not ______
• Predictive text input systems can guess what
you are typing and give choices on how to
complete it.
N-Gram Models
• Estimate probability of each word given prior context.
– P(phone | Please turn off your cell)
• Number of parameters required grows exponentially with
the number of words of prior context.
• An N-gram model uses only N−1 words of prior context.
– Unigram: P(phone)
– Bigram: P(phone | cell)
– Trigram: P(phone | your cell)
• The Markov assumption is the presumption that the future
behavior of a dynamical system only depends on its recent
history. In particular, in a kth-order Markov model, the
next state only depends on the k most recent states,
therefore an N-gram model is a (N−1)-order Markov
model.
N-Gram Model Formulas
• Word sequences

• Chain rule of probability

• Bigram approximation

• N-gram approximation
Estimating Probabilities
• N-gram conditional probabilities can be estimated
from raw text based on the relative frequency of
word sequences.
Bigram:

N-gram:

• To have a consistent probabilistic model, append a

unique start (<s>) and end (</s>) symbol to every
sentence and treat these as additional words.
Generative Model & MLE
• An N-gram model can be seen as a probabilistic
automata for generating sentences.
Initialize sentence with N−1 <s> symbols
Until </s> is generated do:
Stochastically pick the next word based on the conditional
probability of each word given the previous N −1 words.

• Relative frequency estimates can be proven to be

maximum likelihood estimates (MLE) since they
maximize the probability that the model M will
generate the training corpus T.
Example from Textbook
• P(<s> i want english food </s>)
= P(i | <s>) P(want | i) P(english | want)
P(food | english) P(</s> | food)
= .25 x .33 x .0011 x .5 x .68 = .000031
• P(<s> i want chinese food </s>)
= P(i | <s>) P(want | i) P(chinese | want)
P(food | chinese) P(</s> | food)
= .25 x .33 x .0065 x .52 x .68 = .00019
Train and Test Corpora
• A language model must be trained on a large
corpus of text to estimate good parameter values.
• Model can be evaluated based on its ability to
predict a high probability for a disjoint (held-out)
test corpus (testing on the training corpus would
give an optimistically biased estimate).
• Ideally, the training (and test) corpus should be
representative of the actual application data.
• May need to adapt a general model to a small
amount of new (in-domain) data by adding highly
weighted small corpus to original training data.
Unknown Words
• How to handle words in the test corpus that
did not occur in the training data, i.e. out of
vocabulary (OOV) words?
• Train a model that includes an explicit
symbol for an unknown word (<UNK>).
– Choose a vocabulary in advance and replace
other words in the training corpus with
<UNK>.
– Replace the first occurrence of each word in the
training data with <UNK>.
Evaluation of Language Models
• Ideally, evaluate use of model in end application
(extrinsic, in vivo)
– Realistic
– Expensive
• Evaluate on ability to model test corpus
(intrinsic).
– Less realistic
– Cheaper
• Verify at least once that intrinsic evaluation
correlates with an extrinsic one.
Perplexity
• Measure of how well a model “fits” the test data.
• Uses the probability that the model assigns to the
test corpus.
• Normalizes for the number of words in the test
corpus and takes the inverse.

• Measures the weighted average branching factor

in predicting the next word (lower is better).
Sample Perplexity Evaluation
• Models trained on 38 million words from
the Wall Street Journal (WSJ) using a
19,979 word vocabulary.
• Evaluate on a disjoint set of 1.5 million
WSJ words.

Unigram Bigram Trigram

Perplexity 962 170 109
Smoothing
• Since there are a combinatorial number of possible
word sequences, many rare (but not impossible)
combinations never occur in training, so MLE
incorrectly assigns zero to many parameters (a.k.a.
sparse data).
• If a new combination occurs during testing, it is
given a probability of zero and the entire sequence
gets a probability of zero (i.e. infinite perplexity).
• In practice, parameters are smoothed (a.k.a.
regularized) to reassign some probability mass to
unseen events.
– Adding probability mass to unseen events requires
removing it from seen ones (discounting) in order to
maintain a joint distribution that sums to 1.
Laplace (Add-One) Smoothing
• “Hallucinate” additional training data in which each
possible N-gram occurs exactly once and adjust
estimates accordingly.
Bigram:

N-gram:
where V is the total number of possible (N−1)-
grams (i.e. the vocabulary size for a bigram model).
• Tends to reassign too much mass to unseen events,
so can be adjusted to add 0<δ<1 (normalized by δV
instead of V).
Advanced Smoothing
• Many advanced techniques have been
developed to improve smoothing for
language models.
– Good-Turing
– Interpolation
– Backoff
– Kneser-Ney
– Class-based (cluster) N-grams
Model Combination
• As N increases, the power (expressiveness)
of an N-gram model increases, but the
ability to estimate accurate parameters from
sparse data decreases (i.e. the smoothing
problem gets worse).
• A general approach is to combine the results
of multiple N-gram models of increasing
complexity (i.e. increasing N).
Interpolation
• Linearly combine estimates of N-gram
models of increasing order.
Interpolated Trigram Model:

Where:

• Learn proper values for λi by training to

(approximately) maximize the likelihood of
an independent development (a.k.a. tuning)
corpus.
Backoff
• Only use lower-order model when data for higher-
order model is unavailable (i.e. count is zero).
• Recursively back-off to weaker models until data
is available.

Where P* is a discounted probability estimate to reserve

mass for unseen events and α’s are back-off weights
(see text for details).
A Problem for N-Grams:
Long Distance Dependencies
• Many times local context does not provide the
most useful predictive clues, which instead are
provided by long-distance dependencies.
– Syntactic dependencies
• “The man next to the large oak tree near the grocery store on
the corner is tall.”
• “The men next to the large oak tree near the grocery store on
the corner are tall.”
– Semantic dependencies
• “The bird next to the large oak tree near the grocery store on
the corner flies rapidly.”
• “The man next to the large oak tree near the grocery store on
the corner talks rapidly.”
• More complex models of language are needed to
handle such dependencies.
Summary
• Language models assign a probability that a
sentence is a legal string in a language.
• They are useful as a component of many NLP
systems, such as ASR, OCR, and MT.
• Simple N-gram models are easy to train on
unsupervised corpora and can provide useful
estimates of sentence likelihood.
• MLE gives inaccurate parameters for models
trained on sparse data.
• Smoothing techniques adjust parameter estimates
to account for unseen (but not impossible) events.
Estimate Bigram probability

<s> I am Henry</s> Word Count

<s>
<s> I like college</s>
</s>
<s>Do Henry like college </s> I
<s> Henry I am</s> am

<s> Do I like Henry</s> henry

<s> Do I like college</s> like

<s>I do like henry </s> college

do
N-Gram model
Bi-Gram example
perplexity example
Instead of doing tokenization based on
white space , the above data centric
approach uses chars that most frequently

N-Gram Language Models in NLP
No ratings yet
N-Gram Language Models in NLP
22 pages
Ngrams
No ratings yet
Ngrams
22 pages
Ngrams
100% (1)
Ngrams
22 pages
CS 388: Natural Language Processing:: N-Gram Language Models
No ratings yet
CS 388: Natural Language Processing:: N-Gram Language Models
22 pages
NLP Unit-4
No ratings yet
NLP Unit-4
48 pages
6.chapter6 LanguageModel
No ratings yet
6.chapter6 LanguageModel
33 pages
Language Models L3-6
No ratings yet
Language Models L3-6
49 pages
N-Gram Language Model Overview
No ratings yet
N-Gram Language Model Overview
28 pages
N Grams
No ratings yet
N Grams
51 pages
N-Gram Language Modeling Techniques
No ratings yet
N-Gram Language Modeling Techniques
87 pages
5) Lecture Feb11&13&17&18
No ratings yet
5) Lecture Feb11&13&17&18
21 pages
Language Models & N-Gram Analysis
No ratings yet
Language Models & N-Gram Analysis
41 pages
Notes of NLP - Unit-2
No ratings yet
Notes of NLP - Unit-2
23 pages
13 Ngramlm
No ratings yet
13 Ngramlm
27 pages
Language Models
No ratings yet
Language Models
59 pages
Lecture - 3 - Statistical Language Models
No ratings yet
Lecture - 3 - Statistical Language Models
56 pages
08 NLP - N-Gram Language Models
No ratings yet
08 NLP - N-Gram Language Models
65 pages
Understanding n-gram Models in AI
No ratings yet
Understanding n-gram Models in AI
32 pages
Language Modeling
No ratings yet
Language Modeling
50 pages
Lecture 5: Language Modeling (N-Gram, BOW)
No ratings yet
Lecture 5: Language Modeling (N-Gram, BOW)
25 pages
N-gram Language Modeling Overview
No ratings yet
N-gram Language Modeling Overview
84 pages
LM 24 Aug
No ratings yet
LM 24 Aug
75 pages
04 Language Modeling
No ratings yet
04 Language Modeling
70 pages
Unit 2
No ratings yet
Unit 2
75 pages
NLP Unit-4
No ratings yet
NLP Unit-4
62 pages
Introduction To Language Modeling Final
No ratings yet
Introduction To Language Modeling Final
69 pages
NLP Language Models Explained
No ratings yet
NLP Language Models Explained
65 pages
3 LM Jan 08 2021
No ratings yet
3 LM Jan 08 2021
77 pages
Grammar-Based Language Modeling Overview
No ratings yet
Grammar-Based Language Modeling Overview
36 pages
14 Ngramlm
No ratings yet
14 Ngramlm
67 pages
N-Gram Language Models Explained
No ratings yet
N-Gram Language Models Explained
13 pages
Understanding N-grams in Language Modeling
No ratings yet
Understanding N-grams in Language Modeling
78 pages
Language Modeling with N-grams
No ratings yet
Language Modeling with N-grams
79 pages
Lecture 4
No ratings yet
Lecture 4
37 pages
Multimedia Application L6
No ratings yet
Multimedia Application L6
63 pages
Unit 5 Notes Final
No ratings yet
Unit 5 Notes Final
14 pages
NLP for Language Model Enthusiasts
No ratings yet
NLP for Language Model Enthusiasts
74 pages
Lecture 6 To 8 N-Gram
No ratings yet
Lecture 6 To 8 N-Gram
19 pages
Language Modeling Lecture Notes
No ratings yet
Language Modeling Lecture Notes
88 pages
Multimedia Application L5
No ratings yet
Multimedia Application L5
35 pages
NLP
No ratings yet
NLP
12 pages
Unit-5 Notes NLP
No ratings yet
Unit-5 Notes NLP
28 pages
Probabilistic Language Modeling Challenges
No ratings yet
Probabilistic Language Modeling Challenges
12 pages
Video v3
No ratings yet
Video v3
34 pages
Lecture04-Ngram Lang Models
No ratings yet
Lecture04-Ngram Lang Models
39 pages
3 LM Jan 08 2021
No ratings yet
3 LM Jan 08 2021
77 pages
3 LM 2024
No ratings yet
3 LM 2024
78 pages
CME4408 P5 N-Grams Smooting
No ratings yet
CME4408 P5 N-Grams Smooting
43 pages
Lecture 03
No ratings yet
Lecture 03
41 pages
Lecture 3 - Language Modelling and RNNs Part 1
No ratings yet
Lecture 3 - Language Modelling and RNNs Part 1
44 pages
Module-1 ch-2
No ratings yet
Module-1 ch-2
31 pages
N-Gram Models in Language Processing
No ratings yet
N-Gram Models in Language Processing
51 pages
Ai Unit 5
No ratings yet
Ai Unit 5
16 pages
Language Modeling
No ratings yet
Language Modeling
88 pages
Chain Rule in N-Gram Language Models
No ratings yet
Chain Rule in N-Gram Language Models
24 pages
Language Modeling: Introduction To N-Grams
No ratings yet
Language Modeling: Introduction To N-Grams
88 pages
Natural Language Processing - Notes - Unit 2
No ratings yet
Natural Language Processing - Notes - Unit 2
19 pages
NLP Unit 4 Q & A
No ratings yet
NLP Unit 4 Q & A
17 pages
NLP 1.2
No ratings yet
NLP 1.2
22 pages
MTX - Associate Machine Learning Engineer
No ratings yet
MTX - Associate Machine Learning Engineer
2 pages
Analysis and Design of Algorithms 2
No ratings yet
Analysis and Design of Algorithms 2
83 pages
Regression Trees
No ratings yet
Regression Trees
17 pages
Faster R-CNN - Deep Dive Into Object Detection
No ratings yet
Faster R-CNN - Deep Dive Into Object Detection
31 pages
HW3 Sol PDF
No ratings yet
HW3 Sol PDF
44 pages
Statistical Physics Guide
100% (2)
Statistical Physics Guide
105 pages
MATLAB Basics for Signal Processing
No ratings yet
MATLAB Basics for Signal Processing
22 pages
Ijser: Hybrid Data Encryption and Decryption Using Rsa and Rc4
No ratings yet
Ijser: Hybrid Data Encryption and Decryption Using Rsa and Rc4
10 pages
Data Science
No ratings yet
Data Science
16 pages
Spring21final Sol
No ratings yet
Spring21final Sol
14 pages
Sample Project Report
No ratings yet
Sample Project Report
19 pages
Lecture 4 Control
No ratings yet
Lecture 4 Control
23 pages
Common Shock Model in Actuarial Science
0% (1)
Common Shock Model in Actuarial Science
18 pages
AP Review On 1.1 - 1.6
No ratings yet
AP Review On 1.1 - 1.6
4 pages
AI Viva Questions and Answers
No ratings yet
AI Viva Questions and Answers
13 pages
AIML Engineer Resume and Projects
No ratings yet
AIML Engineer Resume and Projects
1 page
Control Systems 1 Block Diagram Reduction Part 3
No ratings yet
Control Systems 1 Block Diagram Reduction Part 3
8 pages
Webpage Design Using Shortest Path Algoritjm
100% (1)
Webpage Design Using Shortest Path Algoritjm
7 pages
Probability Trick: When 2 Dices Rolled Together
No ratings yet
Probability Trick: When 2 Dices Rolled Together
5 pages
Experiment 5 - Z Bus Building Algorithm
No ratings yet
Experiment 5 - Z Bus Building Algorithm
5 pages
Chapter 11 - Digital Logic
No ratings yet
Chapter 11 - Digital Logic
24 pages
Bayesian Games: 1: Definition and Equilibrium
No ratings yet
Bayesian Games: 1: Definition and Equilibrium
20 pages
How To Estimate Long-Run Relationships in Economics
No ratings yet
How To Estimate Long-Run Relationships in Economics
13 pages
Domnic Object Detecion Basics
No ratings yet
Domnic Object Detecion Basics
62 pages
ISI Kolkata Placement Prep Guide
No ratings yet
ISI Kolkata Placement Prep Guide
9 pages
Cryptography and Network Security: Fifth Edition by William Stallings
No ratings yet
Cryptography and Network Security: Fifth Edition by William Stallings
41 pages
Siat Question Bank
No ratings yet
Siat Question Bank
12 pages
Axiomatic Design for Engineers
No ratings yet
Axiomatic Design for Engineers
1 page
Class XII Mathematics Sample Paper 2023
No ratings yet
Class XII Mathematics Sample Paper 2023
4 pages
AI Brain Vs Human Brain
No ratings yet
AI Brain Vs Human Brain
2 pages

2.1 Chap NLP Ngrams

Uploaded by

2.1 Chap NLP Ngrams

Uploaded by

Natural Language Processing:

N-Gram Language Models

• Chain rule of probability

• To have a consistent probabilistic model, append a

• Relative frequency estimates can be proven to be

• Measures the weighted average branching factor

Unigram Bigram Trigram

• Learn proper values for λi by training to

Where P* is a discounted probability estimate to reserve

<s> I am Henry</s> Word Count

<s> Do I like Henry</s> henry

<s> Do I like college</s> like

<s>I do like henry </s> college

You might also like