0% found this document useful (0 votes)

59 views33 pages

6.chapter6 LanguageModel

Uploaded by

Minh Mai Ngọc

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

59 views33 pages

6.chapter6 LanguageModel

Uploaded by

Minh Mai Ngọc

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 33

Natural Language Processing

AC3110E

1
Chapter 6: N-grams Language Model

Lecturer: PhD. DO Thi Ngoc Diep

SCHOOL OF ELECTRICAL AND ELECTRONIC ENGINEERING
HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY
Approaches of NLP

• Symbolic NLP (1950s – early 1990s)

• collection of rules, complex sets of hand-written rules.
• computer emulates NLP tasks by applying those rules to the data it confronts.
• hand-coding of a set of rules for manipulating symbols, coupled with a dictionary
lookup
• Statistical probability NLP (1990s–2010s)
• introduction of machine learning algorithms for language processing
• increase in computational power + enormous amount of data available
• Neural NLP (present)
• In the 2010s, deep neural network-style (featuring many hidden layers) machine
learning methods
• achieve state-of-the-art results in many natural language tasks

3
Approaches of NLP

• Symbolic NLP (1950s – early 1990s)

• collection of rules, complex sets of hand-written rules.
• computer emulates NLP tasks by applying those rules to the data it confronts.
• hand-coding of a set of rules for manipulating symbols, coupled with a dictionary
lookup
• Statistical probability NLP (1990s–2010s)
• introduction of machine learning algorithms for language processing
• increase in computational power + enormous amount of data available
• Neural NLP (present)
• In the 2010s, deep neural network-style (featuring many hidden layers) machine
learning methods
• word embedding algorithm: large corpus of text produces a vector space, word
embeddings to capture semantic properties of words.
• achieve state-of-the-art results in many natural language tasks

4
Introduction

• Which is correct?
• Output of an ASR system
• “Their are two terms” vs. “There are two terms” ?
• “Here you are !” vs. “Hear you are !” ?
• Output of a MT system
• “Strong wind tonight” vs. “Big wind tonight” ?
• Output of Handwritten recognition

• HOW to assign a probability to a sentence ?

• HOW to predict upcoming words ?

5
Introduction

• Predict how frequent a phrase occurs within the natural use of a language
• Useful for spelling correction, grammatical error correction, etc. in NLP tasks

6
Probabilistic Language Model

• Language Models: assign probabilities to sequences of words

• P(S) = P(w1, w2, w3, ..., wn)
• To predict the next word
• Conditional probability: P(wn|w1, w2, w3, ..., wn-1)
• Target
• In recognition: P(“recognize speech”) ? P(“wreck a nice beach”)
• P(“recognize speech”) > P(“wreck a nice beach”)
• In text generation: P(“three houses”) ? P(“three house”)
• P(“three houses”) > P(“three house”)
• In spelling correction: P(“my cat eats fish”) ? P(“my xat eats fish”)
• P(“my cat eats fish”) > P(“my xat eats fish”)
• In machine translation: P(“strong wind”) ? P(“big wind”)
• P(“strong wind”) > P(“big wind”)
• etc.

• => Need to be learned from a training corpus with high quality

• containing perfectly acceptable word sequences/sentences in a language

7
Probabilistic Language Model

• How to calculate the probabilities from a training corpus

(" ")
• P(“we wanted to know”) = ? = ( )

(" ")
• P(“know”|“we wanted to”) = ? =
(" ")

+ Need a large “enough” corpus

+ “estimation” only, never see enough data

• Chain rule of probability:

P(S) = P(w1, w2, w3, ..., wn) = P(w1:n)

=> How to model P(wk|w1:k-1) ?

8
6.1 n-gram Language models

• The simplest model: n-gram LM

• n-gram: a sequence of n words
• 2-gram (bigram), 3-gram (trigram), etc.

"The big brown fox jumped over the fence"

Unigrams: "The", "big", "brown", "fox", "jumped", "over", "the", "fence"

Bigram: "The big", "big brown", "brown fox", "fox jumped", "jumped over", "over
the", "the fence"
Trigram: "The big brown", "big brown fox", "brown fox jumped", "fox jumped over",
"jumped over the", "over the fence"
…
6-gram (Hexagram): "The big brown fox jumped over", "big brown fox jumped over
the", "brown fox jumped over the fence"

• Longer n-grams suffer from low density

9
6.1 n-gram Language models

• The simplest model: n-gram LM

• n-gram: a sequence of n words
• 2-gram (bigram), 3-gram (trigram), etc.
• Longer n-grams suffer from low density
• Markov assumption:
• Approximation: The probability of a word depends only on several previous words
• The n-gram model looks n−1 words in the past

• P(“know”|“we wanted to”)

• unigram: = P(“know”)
• bigram: = P(“know”|“to”)
• trigram: = P(“know”|“wanted to”)

10
6.1 n-gram Language models

• Estimate the n-gram probabilities

• Maximum likelihood estimation (MLE)
• Unigram probability

• Bigram probability

• Example: Bigram probabilities for the corpus

11
6.1 n-gram Language models

• Estimate the n-gram probabilities

• Maximum likelihood estimation (MLE)
• n-gram LM from relative frequency

12
6.1 n-gram Language models

• Chain rule of probability: P(S) = P(w1, w2, w3, ..., wn) = P(w1:n)

• Log probabilities:
• Avoid underflow
• Faster
p1 × p2 × p3 × p4 = exp(log p1 +log p2 +log p3 +log p4)

• In practice it’s more common to use trigram models, or 4-gram or even 5-

gram models, when there is sufficient training data

13
Building n-gram Language models

• Preprocess the text:

• word normalization, sentence segmentation, etc.
• depends on the tasks
• Build model’s vocabulary
• All the words recognized by the language model
• All other words are considered “unknown”: <UNK>, <OOV>
• Include words whose frequency is greater than a certain threshold
• Add <UNK>, [<s>], </s>
• Types: the number of distinct words in a corpus
• Tokens: the total number of running words

<s> I am Sam </s>

<s> Sam I am </s>
<s> I do not like green eggs and ham </s>

14
Building n-gram Language models

• Example
• Berkeley Restaurant Project
• Counts for unigram

• Counts for bigram wn

wn-1

• Probability of bigrams = ? wn
wn-1

https://github.com/wooters/berp-trans
Jurafsky, D., et al. 1994. The Berkeley restaurant project. ICSLP.
15
Building n-gram Language models

• Example
• Berkeley Restaurant Project
• Estimation of the bigram probability of a statement “<s> I want chinese food </s>”
• Given P( I | <s>) = 0.25
P(</s >| food) = 0.68

P(<s> I want chinese food </s>)

wn
wn-1

16
6.2. Evaluating Language Models

• How good is a model?

• How much does the embed application improve ?
• How a model performs on some unseen data ?
• Does the language model prefer good sentences or bad ones?
• Assign higher probability to “grammatical” or “frequently observed” statements
• than “ungrammatical” or “rarely observed” utterances?
• Does it accurately predict the unseen data ?
• Does it assign a higher probability to the unseen data ?

17
6.2. Evaluating Language Models

• Extrinsic evaluation - end-to-end evaluation

• Good evaluation to compare models A and B
• Implement each model in a task.
• spelling correction, speech recognition, translation system
• Execute the task, get precision for A and for B
• How many misspelled words were corrected correctly?
• How many words translated correctly
• Compare accuracy for A and B
• Unfortunately
• very time consuming assessment

18
6.2. Evaluating Language Models

• Intrinsic evaluation
• Using a measure: perplexity
• Measures the quality of a model independent of any application
• A training set
• for training language model parameters
• A development set
• for turning the model parameters to fit the characteristics of the development set
• A test set
• data truly not seen during model training
• Perplexity indicates how well the model fits the test set
• assign a higher probability to the test data

19
6.2. Evaluating Language Models

• Perplexity Intuition
• A better model is one that assigns a higher probability to the word that actually
occurs
• The perplexity (PP) of a language model on a test set W= w1, w2, w3, ..., wn :

The higher the conditional probability of the word sequence,

the lower the perplexity => model predicts the test data better

• Perplexity is a normalized version of the probability of the test set

20
6.2. Evaluating Language Models

• Goodman (2001) A bit of progress in language modelling

• Vocabulary size 50,000
• Trigram model: perplexity = 74
• Bigram model: perplexity = 137
• Unigram model: perplexity = 995
• Daniel and Martin (2009),
• Wall Street Journal
• 38 million words, Vocabulary size 19,979
• Trigram model: perplexity = 109
• Bigram model: perplexity = 170
• Unigram model: perplexity = 962
• The perplexity of two language models is only comparable if they use
identical vocabularies
• The perplexity is typically between 40 and 400

21
6.2. Evaluating Language Models

• Perplexity
• Independent measure of specific application.
• But difficult to know whether a gain in perplexity will comes into a real gain
in practice
• Bad approximation unless the test data resembles the training data
• Generally useful in pilot experiments, before moving on to extrinsic
evaluation

22
6.3. Smoothing algorithms

• Training data
• n-gram model is dependent on the training corpus
• be sure to use a training corpus that has a similar genre/domain to the task
• Problem of sparsity
• Unknown Words
• Out of vocabulary (OOV) words can appear in open vocabulary system
• convert OOV word to <UNK> token (need a prior vocabulary in advance)
• replace words in the training data by <UNK> based on their frequency (< threshold)
• if some perfectly acceptable word sequences are missing from training corpus 
“zero probability n-grams”
• but they could appear in the test set !
• For example
• In the work of Shakespeare
• N=884,647 tokens, V=29,066 => Nbr of possible bigrams = V2 = 844 million
• Shakespeare produced 300,000 different bigrams
• Thus, 99.96% of possible bigrams have never been seen.

23
6.3. Smoothing algorithms6

• Problem:
• Some of these zeros are really zeros.... but, some of them are only rare events

What if “students opened their ” never

occurred in data?  zero frequency n-grams ?

 Smoothing algorithms

What if “students opened their” never

occurred in data?  divide by zero ?
 Backoff solution

24
Backoff solutions

• Backoff
• “back off” to a lower-order n-gram if zero evidence for a higher-order n-gram
P(wn|wn−2wn−1) P(wn|wn−1) P(wn) ...
• Katz backoff
• Good-Turing backoff
• Interpolation
• weighting and combining the trigram, bigram, and unigram etc.

with

• Simple interpolation
• Conditional interpolation

25
Smoothing algorithms

• Smoothing (discounting)
• Adding probability mass to unseen events by Removing probability mass from
previously seen events in order to maintain a joint distribution that is 1

• Type of smoothing
• Laplace (add-one) smoothing
• Add-k smoothing
• Kneser-Ney smoothing
• Stupid backoff
• etc.

26
Laplace Smoothing

• Add 1 to all the n-gram counts, before normalizing into probabilities

• For unigram probabilities

• For bigram probabilities

• etc.
• Made a very big change to the counts !
• too much probability mass is moved to all the zeros
• tends to reassign too much mass to events not seen in training

27
Add-k smoothing

• Instead of adding 1 to each count, adding a fractional count k (.5? .05? .01?)

• How to choose k? Are they appropriate discounts ?

28
Kneser-Ney smoothing

• One of the most commonly used and best performing n-gram smoothing
methods
• Take into account the number of different contexts word w has appeared in
• Hypothesis: words that have appeared in more contexts in the past are more likely to
appear in some new context as well.
• base on the number of different contexts word w has appeared in

I can’t see without my reading ________

• “Los Angeles” and “glasses” are very frequent words.

• But the word (Los) occurring in only one context (with Angeles)  will have a low continuation
probability
• The word “glasses” has a much wider distribution  better choice

29
• Google Books Ngram Viewer
• https://books.google.com/ngrams/

30
Why should we care about Language Modeling?

• Language Modeling is a subcomponent of many NLP tasks

• Especially those involving generating text or estimating the probability of
text:
• Predictive typing
• Speech recognition
• Handwriting recognition
• Spelling/grammar correction
• Machine translation
• Summarization
• Dialogue
• etc.
• Everything else in NLP has now been rebuilt upon Language Modeling:
• GPT-3 is an LM!

31
Language Modeling Toolkits

• SRILM
• http://www.speech.sri.com/projects/srilm/
• IRSTLM
• https://sourceforge.net/projects/irstlm/
• KenLM
• https://kheafield.com/code/kenlm/

32
• end of Chapter 6

Ngrams
No ratings yet
Ngrams
22 pages
13 Ngramlm
No ratings yet
13 Ngramlm
27 pages
04 Language Modeling
No ratings yet
04 Language Modeling
70 pages
CS 388: Natural Language Processing:: N-Gram Language Models
No ratings yet
CS 388: Natural Language Processing:: N-Gram Language Models
22 pages
Ngrams
100% (1)
Ngrams
22 pages
N-Gram Language Models in NLP
No ratings yet
N-Gram Language Models in NLP
22 pages
14 Ngramlm
No ratings yet
14 Ngramlm
67 pages
2.1 Chap NLP Ngrams
No ratings yet
2.1 Chap NLP Ngrams
37 pages
Language Modeling
No ratings yet
Language Modeling
50 pages
Lecture - 3 - Statistical Language Models
No ratings yet
Lecture - 3 - Statistical Language Models
56 pages
Lecture 6 To 8 N-Gram
No ratings yet
Lecture 6 To 8 N-Gram
19 pages
NLP Language Models Explained
No ratings yet
NLP Language Models Explained
65 pages
Notes of NLP - Unit-2
No ratings yet
Notes of NLP - Unit-2
23 pages
Language Models
No ratings yet
Language Models
59 pages
N-Gram Language Model Overview
No ratings yet
N-Gram Language Model Overview
28 pages
NLP Unit-4
No ratings yet
NLP Unit-4
62 pages
N Grams
No ratings yet
N Grams
51 pages
5) Lecture Feb11&13&17&18
No ratings yet
5) Lecture Feb11&13&17&18
21 pages
Lecture 5: Language Modeling (N-Gram, BOW)
No ratings yet
Lecture 5: Language Modeling (N-Gram, BOW)
25 pages
Understanding N-grams in Language Modeling
No ratings yet
Understanding N-grams in Language Modeling
78 pages
Introduction To Language Modeling Final
No ratings yet
Introduction To Language Modeling Final
69 pages
08 NLP - N-Gram Language Models
No ratings yet
08 NLP - N-Gram Language Models
65 pages
NLP Unit-4
No ratings yet
NLP Unit-4
48 pages
LM 24 Aug
No ratings yet
LM 24 Aug
75 pages
NLP for Language Model Enthusiasts
No ratings yet
NLP for Language Model Enthusiasts
74 pages
Language Models & N-Gram Analysis
No ratings yet
Language Models & N-Gram Analysis
41 pages
3 LM 2024
No ratings yet
3 LM 2024
78 pages
N-Gram Models in Language Processing
No ratings yet
N-Gram Models in Language Processing
51 pages
Lecture 4
No ratings yet
Lecture 4
37 pages
N-gram Language Modeling Overview
No ratings yet
N-gram Language Modeling Overview
84 pages
N-Gram Language Modeling Techniques
No ratings yet
N-Gram Language Modeling Techniques
87 pages
N-Gram Language Models Explained
No ratings yet
N-Gram Language Models Explained
13 pages
N-Gram Language Modelling Overview
No ratings yet
N-Gram Language Modelling Overview
40 pages
Unit 2
No ratings yet
Unit 2
75 pages
PLM Language Models Overview
No ratings yet
PLM Language Models Overview
35 pages
02 NLP LM
No ratings yet
02 NLP LM
99 pages
Language Models L3-6
No ratings yet
Language Models L3-6
49 pages
NLP-Ch-2 Introduction To Language Models
No ratings yet
NLP-Ch-2 Introduction To Language Models
82 pages
Probabilistic Language Modeling Challenges
No ratings yet
Probabilistic Language Modeling Challenges
12 pages
Multimedia Application L6
No ratings yet
Multimedia Application L6
63 pages
NLP
No ratings yet
NLP
12 pages
Language Modeling Lecture Notes
No ratings yet
Language Modeling Lecture Notes
88 pages
3 LM Jan 08 2021
No ratings yet
3 LM Jan 08 2021
77 pages
Intro To Language Models - Soumyasis Mishra - 191001021003 - BCS4C
No ratings yet
Intro To Language Models - Soumyasis Mishra - 191001021003 - BCS4C
10 pages
Multimedia Application L5
No ratings yet
Multimedia Application L5
35 pages
NLP Unit-5
No ratings yet
NLP Unit-5
13 pages
CME4408 P5 N-Grams Smooting
No ratings yet
CME4408 P5 N-Grams Smooting
43 pages
NLP Sem Unit 5
No ratings yet
NLP Sem Unit 5
9 pages
Language Modeling with N-grams
No ratings yet
Language Modeling with N-grams
79 pages
Language Modeling: Introduction To N-Grams
No ratings yet
Language Modeling: Introduction To N-Grams
88 pages
NLP 5th Unit
No ratings yet
NLP 5th Unit
19 pages
Unit 5 Notes Final
No ratings yet
Unit 5 Notes Final
14 pages
Lecture 4 N Grams
No ratings yet
Lecture 4 N Grams
29 pages
NLP 1.2
No ratings yet
NLP 1.2
22 pages
NLP Unit 4 Q & A
No ratings yet
NLP Unit 4 Q & A
17 pages
Language Modeling
No ratings yet
Language Modeling
88 pages
Language Model Evaluation Methods
No ratings yet
Language Model Evaluation Methods
21 pages
Grammar-Based Language Modeling Overview
No ratings yet
Grammar-Based Language Modeling Overview
36 pages
Module-1 ch-2
No ratings yet
Module-1 ch-2
31 pages
Fair Federated Learning For Digital Healthcare
No ratings yet
Fair Federated Learning For Digital Healthcare
15 pages
15.chapter11 NLPApplications
No ratings yet
15.chapter11 NLPApplications
25 pages
14.chapter10 AdvancedDeepLearningForText
No ratings yet
14.chapter10 AdvancedDeepLearningForText
22 pages
12-13.chapter9 DeepLearningInNLP
No ratings yet
12-13.chapter9 DeepLearningInNLP
45 pages
11.chapter8 WordEmbedding
No ratings yet
11.chapter8 WordEmbedding
17 pages
1.chapter1 Introduction Chapter2 LanguageCharacteristics
No ratings yet
1.chapter1 Introduction Chapter2 LanguageCharacteristics
35 pages
A Prompt Array Keeps The Bias Away: Debiasing Vision-Language Models With Adversarial Learning
No ratings yet
A Prompt Array Keeps The Bias Away: Debiasing Vision-Language Models With Adversarial Learning
17 pages
AI Important Questions
No ratings yet
AI Important Questions
196 pages
ICEBE 2018 Paper 68
No ratings yet
ICEBE 2018 Paper 68
8 pages
Revisiting AI Project Cycle Class 10 Notes
No ratings yet
Revisiting AI Project Cycle Class 10 Notes
5 pages
RP Methodology
No ratings yet
RP Methodology
33 pages
AI Bsics
No ratings yet
AI Bsics
3 pages
1 s2.0 S2667305324001091 Main
No ratings yet
1 s2.0 S2667305324001091 Main
15 pages
Data - Science - FBSI - (Banking, Financial Services, and Insurance)
No ratings yet
Data - Science - FBSI - (Banking, Financial Services, and Insurance)
19 pages
Beevi
No ratings yet
Beevi
39 pages
06 Chapter 2
No ratings yet
06 Chapter 2
48 pages
AI Narratives for Literary Scholars
No ratings yet
AI Narratives for Literary Scholars
8 pages
AI Resume Analyzer for Job Matching
No ratings yet
AI Resume Analyzer for Job Matching
18 pages
Bca (Ai) - Unit III - Chapter I
No ratings yet
Bca (Ai) - Unit III - Chapter I
24 pages
Statistical Machine Translation Overview
No ratings yet
Statistical Machine Translation Overview
18 pages
AI and ML
No ratings yet
AI and ML
28 pages
Java Resume To Crack FAANG
No ratings yet
Java Resume To Crack FAANG
1 page
NLP - Unit 1-M
No ratings yet
NLP - Unit 1-M
38 pages
2023.findings Emnlp.314v2
No ratings yet
2023.findings Emnlp.314v2
21 pages
Cyberbullying Detection Using Natural Language Processing
No ratings yet
Cyberbullying Detection Using Natural Language Processing
10 pages
GPT in Construction: Pros & Cons
No ratings yet
GPT in Construction: Pros & Cons
29 pages
Artificial Intelligence in Medicine
No ratings yet
Artificial Intelligence in Medicine
10 pages
Part B Unit 1 Introduction To Artificial Intelligence
No ratings yet
Part B Unit 1 Introduction To Artificial Intelligence
15 pages
AI's Impact on Business Communication
No ratings yet
AI's Impact on Business Communication
27 pages
Beyond English: Evaluating Llms For Arabic Grammatical Error Correction
No ratings yet
Beyond English: Evaluating Llms For Arabic Grammatical Error Correction
19 pages
Education: M.S./B.S., Computer Science (Artificial Intelligence, Computer Systems) Minor in Music
No ratings yet
Education: M.S./B.S., Computer Science (Artificial Intelligence, Computer Systems) Minor in Music
1 page
Artificial Intelligent 1
No ratings yet
Artificial Intelligent 1
20 pages
Using AI For Better Business
No ratings yet
Using AI For Better Business
6 pages
AI Fundamentals Training Overview
No ratings yet
AI Fundamentals Training Overview
17 pages
Language Models and Application of Natural Language Processing
No ratings yet
Language Models and Application of Natural Language Processing
70 pages
RichGCN-Graph Convolutional Networks For Event Causality Identification With
No ratings yet
RichGCN-Graph Convolutional Networks For Event Causality Identification With
11 pages