Natural Language Processing
AC3110E
1
Chapter 6: N-grams Language Model
Lecturer: PhD. DO Thi Ngoc Diep
SCHOOL OF ELECTRICAL AND ELECTRONIC ENGINEERING
HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY
Approaches of NLP
• Symbolic NLP (1950s – early 1990s)
• collection of rules, complex sets of hand-written rules.
• computer emulates NLP tasks by applying those rules to the data it confronts.
• hand-coding of a set of rules for manipulating symbols, coupled with a dictionary
lookup
• Statistical probability NLP (1990s–2010s)
• introduction of machine learning algorithms for language processing
• increase in computational power + enormous amount of data available
• Neural NLP (present)
• In the 2010s, deep neural network-style (featuring many hidden layers) machine
learning methods
• achieve state-of-the-art results in many natural language tasks
3
Approaches of NLP
• Symbolic NLP (1950s – early 1990s)
• collection of rules, complex sets of hand-written rules.
• computer emulates NLP tasks by applying those rules to the data it confronts.
• hand-coding of a set of rules for manipulating symbols, coupled with a dictionary
lookup
• Statistical probability NLP (1990s–2010s)
• introduction of machine learning algorithms for language processing
• increase in computational power + enormous amount of data available
• Neural NLP (present)
• In the 2010s, deep neural network-style (featuring many hidden layers) machine
learning methods
• word embedding algorithm: large corpus of text produces a vector space, word
embeddings to capture semantic properties of words.
• achieve state-of-the-art results in many natural language tasks
4
Introduction
• Which is correct?
• Output of an ASR system
• “Their are two terms” vs. “There are two terms” ?
• “Here you are !” vs. “Hear you are !” ?
• Output of a MT system
• “Strong wind tonight” vs. “Big wind tonight” ?
• Output of Handwritten recognition
• HOW to assign a probability to a sentence ?
• HOW to predict upcoming words ?
5
Introduction
• Predict how frequent a phrase occurs within the natural use of a language
• Useful for spelling correction, grammatical error correction, etc. in NLP tasks
6
Probabilistic Language Model
• Language Models: assign probabilities to sequences of words
• P(S) = P(w1, w2, w3, ..., wn)
• To predict the next word
• Conditional probability: P(wn|w1, w2, w3, ..., wn-1)
• Target
• In recognition: P(“recognize speech”) ? P(“wreck a nice beach”)
• P(“recognize speech”) > P(“wreck a nice beach”)
• In text generation: P(“three houses”) ? P(“three house”)
• P(“three houses”) > P(“three house”)
• In spelling correction: P(“my cat eats fish”) ? P(“my xat eats fish”)
• P(“my cat eats fish”) > P(“my xat eats fish”)
• In machine translation: P(“strong wind”) ? P(“big wind”)
• P(“strong wind”) > P(“big wind”)
• etc.
• => Need to be learned from a training corpus with high quality
• containing perfectly acceptable word sequences/sentences in a language
7
Probabilistic Language Model
• How to calculate the probabilities from a training corpus
(" ")
• P(“we wanted to know”) = ? = ( )
(" ")
• P(“know”|“we wanted to”) = ? =
(" ")
+ Need a large “enough” corpus
+ “estimation” only, never see enough data
• Chain rule of probability:
P(S) = P(w1, w2, w3, ..., wn) = P(w1:n)
=> How to model P(wk|w1:k-1) ?
8
6.1 n-gram Language models
• The simplest model: n-gram LM
• n-gram: a sequence of n words
• 2-gram (bigram), 3-gram (trigram), etc.
"The big brown fox jumped over the fence"
Unigrams: "The", "big", "brown", "fox", "jumped", "over", "the", "fence"
Bigram: "The big", "big brown", "brown fox", "fox jumped", "jumped over", "over
the", "the fence"
Trigram: "The big brown", "big brown fox", "brown fox jumped", "fox jumped over",
"jumped over the", "over the fence"
…
6-gram (Hexagram): "The big brown fox jumped over", "big brown fox jumped over
the", "brown fox jumped over the fence"
• Longer n-grams suffer from low density
9
6.1 n-gram Language models
• The simplest model: n-gram LM
• n-gram: a sequence of n words
• 2-gram (bigram), 3-gram (trigram), etc.
• Longer n-grams suffer from low density
• Markov assumption:
• Approximation: The probability of a word depends only on several previous words
• The n-gram model looks n−1 words in the past
• P(“know”|“we wanted to”)
• unigram: = P(“know”)
• bigram: = P(“know”|“to”)
• trigram: = P(“know”|“wanted to”)
10
6.1 n-gram Language models
• Estimate the n-gram probabilities
• Maximum likelihood estimation (MLE)
• Unigram probability
• Bigram probability
• Example: Bigram probabilities for the corpus
<s> I am Sam </s> P(I|<s>) P(Sam|<s>) P(am|I)
?
<s> Sam I am </s> P(</s>|Sam) P(Sam|am) P(do|I)
<s> I do not like green eggs and ham </s>
11
6.1 n-gram Language models
• Estimate the n-gram probabilities
• Maximum likelihood estimation (MLE)
• n-gram LM from relative frequency
12
6.1 n-gram Language models
• Chain rule of probability: P(S) = P(w1, w2, w3, ..., wn) = P(w1:n)
• Log probabilities:
• Avoid underflow
• Faster
p1 × p2 × p3 × p4 = exp(log p1 +log p2 +log p3 +log p4)
• In practice it’s more common to use trigram models, or 4-gram or even 5-
gram models, when there is sufficient training data
13
Building n-gram Language models
• Preprocess the text:
• word normalization, sentence segmentation, etc.
• depends on the tasks
• Build model’s vocabulary
• All the words recognized by the language model
• All other words are considered “unknown”: <UNK>, <OOV>
• Include words whose frequency is greater than a certain threshold
• Add <UNK>, [<s>], </s>
• Types: the number of distinct words in a corpus
• Tokens: the total number of running words
<s> I am Sam </s>
<s> Sam I am </s>
<s> I do not like green eggs and ham </s>
14
Building n-gram Language models
• Example
• Berkeley Restaurant Project
• Counts for unigram
• Counts for bigram wn
wn-1
• Probability of bigrams = ? wn
wn-1
https://github.com/wooters/berp-trans
Jurafsky, D., et al. 1994. The Berkeley restaurant project. ICSLP.
15
Building n-gram Language models
• Example
• Berkeley Restaurant Project
• Estimation of the bigram probability of a statement “<s> I want chinese food </s>”
• Given P( I | <s>) = 0.25
P(</s >| food) = 0.68
P(<s> I want chinese food </s>)
= P( I |<s>) × P(want | I) × P(chinese | want) × P (food | chinese) × P(</s> | food)
= 0.25 × 0.33 × 0.0065 × 0.52 × 0.68
= 0.000189618
wn
wn-1
16
6.2. Evaluating Language Models
• How good is a model?
• How much does the embed application improve ?
• How a model performs on some unseen data ?
• Does the language model prefer good sentences or bad ones?
• Assign higher probability to “grammatical” or “frequently observed” statements
• than “ungrammatical” or “rarely observed” utterances?
• Does it accurately predict the unseen data ?
• Does it assign a higher probability to the unseen data ?
17
6.2. Evaluating Language Models
• Extrinsic evaluation - end-to-end evaluation
• Good evaluation to compare models A and B
• Implement each model in a task.
• spelling correction, speech recognition, translation system
• Execute the task, get precision for A and for B
• How many misspelled words were corrected correctly?
• How many words translated correctly
• Compare accuracy for A and B
• Unfortunately
• very time consuming assessment
18
6.2. Evaluating Language Models
• Intrinsic evaluation
• Using a measure: perplexity
• Measures the quality of a model independent of any application
• A training set
• for training language model parameters
• A development set
• for turning the model parameters to fit the characteristics of the development set
• A test set
• data truly not seen during model training
• Perplexity indicates how well the model fits the test set
• assign a higher probability to the test data
19
6.2. Evaluating Language Models
• Perplexity Intuition
• A better model is one that assigns a higher probability to the word that actually
occurs
• The perplexity (PP) of a language model on a test set W= w1, w2, w3, ..., wn :
The higher the conditional probability of the word sequence,
the lower the perplexity => model predicts the test data better
• Perplexity is a normalized version of the probability of the test set
20
6.2. Evaluating Language Models
• Goodman (2001) A bit of progress in language modelling
• Vocabulary size 50,000
• Trigram model: perplexity = 74
• Bigram model: perplexity = 137
• Unigram model: perplexity = 995
• Daniel and Martin (2009),
• Wall Street Journal
• 38 million words, Vocabulary size 19,979
• Trigram model: perplexity = 109
• Bigram model: perplexity = 170
• Unigram model: perplexity = 962
• The perplexity of two language models is only comparable if they use
identical vocabularies
• The perplexity is typically between 40 and 400
21
6.2. Evaluating Language Models
• Perplexity
• Independent measure of specific application.
• But difficult to know whether a gain in perplexity will comes into a real gain
in practice
• Bad approximation unless the test data resembles the training data
• Generally useful in pilot experiments, before moving on to extrinsic
evaluation
22
6.3. Smoothing algorithms
• Training data
• n-gram model is dependent on the training corpus
• be sure to use a training corpus that has a similar genre/domain to the task
• Problem of sparsity
• Unknown Words
• Out of vocabulary (OOV) words can appear in open vocabulary system
• convert OOV word to <UNK> token (need a prior vocabulary in advance)
• replace words in the training data by <UNK> based on their frequency (< threshold)
• if some perfectly acceptable word sequences are missing from training corpus
“zero probability n-grams”
• but they could appear in the test set !
• For example
• In the work of Shakespeare
• N=884,647 tokens, V=29,066 => Nbr of possible bigrams = V2 = 844 million
• Shakespeare produced 300,000 different bigrams
• Thus, 99.96% of possible bigrams have never been seen.
23
6.3. Smoothing algorithms6
• Problem:
• Some of these zeros are really zeros.... but, some of them are only rare events
What if “students opened their ” never
occurred in data? zero frequency n-grams ?
Smoothing algorithms
What if “students opened their” never
occurred in data? divide by zero ?
Backoff solution
24
Backoff solutions
• Backoff
• “back off” to a lower-order n-gram if zero evidence for a higher-order n-gram
P(wn|wn−2wn−1) P(wn|wn−1) P(wn) ...
• Katz backoff
• Good-Turing backoff
• Interpolation
• weighting and combining the trigram, bigram, and unigram etc.
with
• Simple interpolation
• Conditional interpolation
25
Smoothing algorithms
• Smoothing (discounting)
• Adding probability mass to unseen events by Removing probability mass from
previously seen events in order to maintain a joint distribution that is 1
• Type of smoothing
• Laplace (add-one) smoothing
• Add-k smoothing
• Kneser-Ney smoothing
• Stupid backoff
• etc.
26
Laplace Smoothing
• Add 1 to all the n-gram counts, before normalizing into probabilities
• For unigram probabilities
• For bigram probabilities
• etc.
• Made a very big change to the counts !
• too much probability mass is moved to all the zeros
• tends to reassign too much mass to events not seen in training
27
Add-k smoothing
• Instead of adding 1 to each count, adding a fractional count k (.5? .05? .01?)
• How to choose k? Are they appropriate discounts ?
28
Kneser-Ney smoothing
• One of the most commonly used and best performing n-gram smoothing
methods
• Take into account the number of different contexts word w has appeared in
• Hypothesis: words that have appeared in more contexts in the past are more likely to
appear in some new context as well.
• base on the number of different contexts word w has appeared in
I can’t see without my reading ________
• “Los Angeles” and “glasses” are very frequent words.
• But the word (Los) occurring in only one context (with Angeles) will have a low continuation
probability
• The word “glasses” has a much wider distribution better choice
29
• Google Books Ngram Viewer
• https://books.google.com/ngrams/
30
Why should we care about Language Modeling?
• Language Modeling is a subcomponent of many NLP tasks
• Especially those involving generating text or estimating the probability of
text:
• Predictive typing
• Speech recognition
• Handwriting recognition
• Spelling/grammar correction
• Machine translation
• Summarization
• Dialogue
• etc.
• Everything else in NLP has now been rebuilt upon Language Modeling:
• GPT-3 is an LM!
31
Language Modeling Toolkits
• SRILM
• http://www.speech.sri.com/projects/srilm/
• IRSTLM
• https://sourceforge.net/projects/irstlm/
• KenLM
• https://kheafield.com/code/kenlm/
32
• end of Chapter 6
33