0% found this document useful (0 votes)
18 views117 pages

Natural Language Processing - Language Modelling

What is a Language Model? A statistical model that predicts the likelihood of a sequence of words in a language. Assigns probabilities to sequences of words or predicts the next word given a context.

Uploaded by

rexar38710
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views117 pages

Natural Language Processing - Language Modelling

What is a Language Model? A statistical model that predicts the likelihood of a sequence of words in a language. Assigns probabilities to sequences of words or predicts the next word given a context.

Uploaded by

rexar38710
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd

Module:3 Language

modelling
The role of language models. N-gram models. Estimating parameters
and smoothing. Evaluating language models.
Introduction to Language Models

• What is a Language Model?


• A statistical model that predicts the likelihood of a sequence of words in a language.
• Assigns probabilities to sequences of words or predicts the next word given a
context.
• Why Language Models?
• Core component of NLP tasks: speech recognition, machine translation, text
generation, spell correction, etc.
• Captures the structure and patterns of natural language.
• Example:
• Given "The cat is", a language model predicts the next word (e.g., "on" or"sleeping").
• P("The cat is on the mat") = P(The) × P(cat|The) × P(is|The cat) × ...
Role of Language Models

• Key Roles:
• Text Generation: Generate coherent text (e.g., chatbots, story generation).
• Speech Recognition: Disambiguate similar-sounding words (e.g., "recognize speech"
vs. "wreck a nice beach").
• Machine Translation: Ensure grammatically correct translations.
• Spell Correction: Suggest correct words based on context (e.g., "I have a kat" → "cat").
• Sentiment Analysis: Understand contextual meaning for better classification.
• Applications:
• Virtual assistants (e.g., Siri, Alexa).
• Autocomplete in search engines.
• Predictive keyboards on smartphones.
Types of Language Models

• Statistical Language Models:


• Based on probability distributions over word sequences.
• Example: N-gram models.
• Neural Language Models:
• Use neural networks (e.g., RNNs, Transformers like BERT, GPT).
• Capture long-range dependencies better than N-grams.
• Focus :
• N-gram models as a foundational approach to language modelling.
Introduction to Probability
Theory
•What is Probability Theory?
•A branch of mathematics that quantifies uncertainty and predicts the likelihood of events.
•Provides the foundation for statistical language models, such as N-grams, by assigning probabilities to
word sequences.
•Relevance to NLP:
•Language models rely on probability to predict the next word, rank possible sentences, or disambiguate
meanings.
•Example: In speech recognition, probability helps choose between "recognize speech" and "wreck a nice
beach."
•Key Components:
•Sample space, events, probability distributions, conditional probability, and independence.
Core Concepts of Probability
Theory
Sample Space (Ω):
•The set of all possible outcomes.
•Example: For a single word in a language model, Ω = {all words in vocabulary, e.g., "the," "cat," "is," ...}.
Event:
•A subset of the sample space.
•Example: Event A = {the next word is "cat"}.
Probability (P):
•A function assigning a value between 0 and 1 to an event.
•P(A) = Probability that event A occurs.
•Example: P("cat" | "The") = probability of "cat" following "The."
Probability Axioms:
1.P(A) ≥ 0 for any event A.
2.P(Ω) = 1 (the sample space has probability 1).
[Link] mutually exclusive events A₁, A₂, ..., P(A₁ ∪ A₂ ∪ ...) = P(A₁) + P(A₂) + ...
Types of Probability
Marginal Probability:
•Probability of an event occurring independently.
•Example: P("the") = frequency of "the" / total words in a corpus.
•Corpus: "The cat is on the mat." P("the") = 2/6 = 0.333.
Joint Probability:
•Probability of multiple events occurring together.
•Example: P("the," "cat") = probability of the bigram "the cat."
•P("the," "cat") = Count("the cat") / total bigrams.
Conditional Probability:
•Probability of an event given another event has occurred.
•Formula: P(A|B) = P(A ∩ B) / P(B), where P(B) > 0.
•Example: P("cat" | "The") = Count("The cat") / Count("The").
•Corpus: "The cat is on the mat. The cat sleeps."
•P("cat" | "The") = 2/2 = 1.0.
Example Sentence and Corpus

•Sentence for Analysis: "The cat sleeps."


•Corpus: A small text corpus to compute probabilities:
"The cat is on the mat. The cat sleeps. The dog runs."
•Tokenized Corpus: ["The", "cat", "is", "on", "the", "mat", ".", "The", "cat", "sleeps", ".", "The", "dog",
"runs", "."]
•Vocabulary: {The, cat, is, on, the, mat, ., sleeps, dog, runs}
•Total Words: 15
•Assumption: We’ll use a bigram model for joint and conditional probabilities to keep calculations
manageable and relevant to N-gram models in NLP.
Probability Calculations

Marginal Probability
• Definition: The probability of a single event occurring, regardless of other events.
• Formula: P(wᵢ) = Count(wᵢ) / Total words.
• Example Event: Probability of the word "cat" appearing in the corpus.
• Calculation:
• Count("cat") = 2 (appears twice in the corpus).
• Total words = 15.
• P("cat") = 2 / 15 ≈ 0.1333.
• Other Marginal Probabilities :
• P("The") = Count("The") / Total = 3 / 15 = 0.2.
• P("sleeps") = Count("sleeps") / Total = 1 / 15 ≈ 0.0667.
Probability Calculations

Joint Probability
• Definition: The probability of multiple events occurring together.
• Formula (for bigrams): P(wᵢ, wᵢ₊₁) = Count(wᵢ, wᵢ₊₁) / Total bigrams.
• Example Event: Probability of the bigram "The cat" in the sentence.
• Calculation:
• Total bigrams = Total words - 1 = 15 - 1 = 14.
• Count("The cat") = 2 (appears in "The cat is" and "The cat sleeps").
• P("The", "cat") = 2 / 14 ≈ 0.1429.
• Another Example:
• Bigram "cat sleeps":
• Count("cat sleeps") = 1.
• P("cat", "sleeps") = 1 / 14 ≈ 0.0714.
Probability Calculations

Conditional Probability
• Definition: The probability of an event given another event has occurred.
• Formula: P(wᵢ₊₁ | wᵢ) = P(wᵢ, wᵢ₊₁) / P(wᵢ) = Count(wᵢ, wᵢ₊₁) / Count(wᵢ).
• Example Event: Probability of "cat" given "The" (i.e., P("cat" | "The")).
• Calculation:
• Count("The cat") = 2.
• Count("The") = 3.
• P("cat" | "The") = 2 / 3 ≈ 0.6667.
• Another Example:
• P("sleeps" | "cat"):
• Count("cat sleeps") = 1.
• Count("cat") = 2.
• P("sleeps" | "cat") = 1 / 2 = 0.5.
Summary of Calculated
Probabilities
Marginal Probabilities:
•P("The") = 0.2
•P("cat") = 0.1333
•P("sleeps") = 0.0667
Joint Probabilities:
•P("The", "cat") = 0.1429
•P("cat", "sleeps") = 0.0714
Conditional Probabilities:
•P("cat" | "The") = 0.6667
•P("sleeps" | "cat") = 0.5
Marginal Probability

• Definition: Probability of a single word occurring.


• Example Sentence: "The cat sleeps."
• Calculations:
• P("The") = Count("The") / Total = 3 / 15 = 0.2
• P("cat") = 2 / 15 ≈ 0.1333
• P("sleeps") = 1 / 15 ≈ 0.0667
Joint Probability

• Definition: Probability of two words occurring together (bigram).


• Example Sentence: "The cat sleeps."
• Calculations:
• P("The", "cat") = Count("The cat") / Total bigrams = 2 / 14 ≈ 0.1429
• P("cat", "sleeps") = 1 / 14 ≈ 0.0714
Conditional Probability

• Definition: Probability of a word given the previous word.


• Example Sentence: "The cat sleeps."
• Calculations:
• P("cat" | "The") = Count("The cat") / Count("The") = 2 / 3 ≈ 0.6667
• P("sleeps" | "cat") = Count("cat sleeps") / Count("cat") = 1 / 2 = 0.5
Interpretation

• Marginal Probability:
• Shows how common individual words are in the corpus.
• "The" is most frequent (P = 0.2), "sleeps" is least (P = 0.0667).
• Joint Probability:
• Indicates likelihood of word pairs.
• "The cat" is more likely (P = 0.1429) than "cat sleeps" (P = 0.0714).
• Conditional Probability:
• Critical for N-gram models, predicts next word given context.
• High P("cat" | "The") = 0.6667 suggests "cat" often follows "The."
• Relevance to NLP:
• These probabilities drive language models to predict word sequences, as in
autocomplete or speech recognition.
Bayes’ Theorem in NLP

Overview
• Bayes’ Theorem is a fundamental concept in probability theory that relates conditional probabilities,
enabling us to update probabilities based on new evidence.
• In NLP, it’s widely used in tasks like text classification (e.g., spam detection), spell correction, and word
sense disambiguation, where we need to infer the most likely outcome given observed data.
Bayes’ Theorem Formula
P(A∣B)=
• P(A|B): Posterior probability – the probability of event A given evidence B.
• P(B|A): Likelihood – the probability of observing B given A.
• P(A): Prior probability – the initial probability of A before observing B.
• P(B): Evidence – the probability of B, often computed as a normalizing constant.
• In NLP, these terms are interpreted as:
• A: A hypothesis, e.g., a word, class label (spam/not spam), or part-of-speech tag.
• B: Observed data, e.g., a word sequence, email text, or context.
Relevance to NLP

• Bayes’ Theorem allows NLP systems to:


• Classify text (e.g., spam detection by calculating P(spam|email text)).
• Correct spelling errors (e.g., P("cat"|"kat") vs. P("hat"|"kat")).
• Disambiguate word senses (e.g., P("bank" = river bank | context) vs. P("bank"
= financial institution | context)).
• It’s particularly useful in Naive Bayes classifiers, which assume
feature independence to simplify probability calculations in text
classification.
Example 1: Spell Correction

Scenario: A user types "kat" in a sentence, and we want to determine whether the intended
word was "cat" or "hat" using Bayes’ Theorem.
Corpus:
• A small text corpus: "The cat is on the mat. The cat sleeps. The hat is red."
• Tokenized: ["The", "cat", "is", "on", "the", "mat", ".", "The", "cat", "sleeps", ".", "The", "hat",
"is", "red", "."]
• Total words: 16
• Vocabulary: {The, cat, is, on, the, mat, ., sleeps, hat, red}
Step-by-Step Calculation:
• Goal: Compute P("cat"|"kat") and P("hat"|"kat") to determine the most likely correction.
Bayes’ Theorem:
• P(word∣"kat")=P("kat"∣word)⋅P(word)/P("kat")
Example 1 – Spell Correction

• Scenario: User types "kat." Was the intended word "cat" or "hat"?
Corpus: "The cat is on the mat. The cat sleeps. The hat is red."
• Calculations:
• Prior:
• P("cat") = 2 / 16 = 0.125
• P("hat") = 1 / 16 = 0.0625
• Likelihood:
• P("kat"|"cat") = 0.1 (typo probability).
• P("kat"|"hat") = 0.08.
• Posterior (proportional):
• P("cat"|"kat") ∝ 0.1 × 0.125 = 0.0125
• P("hat"|"kat") ∝ 0.08 × 0.0625 = 0.005
• Conclusion: "cat" is more likely (0.0125 > 0.005).
Example 1: Spell Correction

word: Either "cat" or "hat."


•P("kat"|word): Likelihood of typing "kat" given the intended word (typo probability).
•P(word): Prior probability of the word in the corpus.
•P("kat"): Evidence, a normalizing constant (same for both words, so we can compare proportionally).
•Prior Probabilities (P(word)):
•P("cat") = Count("cat") / Total words = 2 / 16 = 0.125
•P("hat") = Count("hat") / Total words = 1 / 16 = 0.0625
•Likelihood (P("kat"|word)):
•Assume a simple typo model: Probability of typing "kat" instead of "cat" or "hat" depends on edit distance.
•Edit distance from "cat" to "kat" = 1 (substitute "c" with "k").
•Edit distance from "hat" to "kat" = 1 (substitute "h" with "k").
•Assume P("kat"|"cat") = 0.1 (10% chance of typing "kat" when intending "cat").
•Assume P("kat"|"hat") = 0.08 (8% chance, slightly lower due to less frequent typos for "h").
Example 1: Spell Correction

• Evidence (P("kat")):
• Since we’re comparing probabilities, we can use the proportional form:
P(word∣"kat")∝P("kat"∣word)⋅P(word)
• We don’t need to compute P("kat") explicitly for ranking.
• Posterior Probabilities:
• For "cat": P("cat"∣"kat")∝P("kat"∣"cat")⋅P("cat")=0.1*0.125=0.0125
• For "hat": P("hat"∣"kat")∝P("kat"∣"hat")⋅P("hat")=0.08*0.0625=0.005
• Conclusion:
• P("cat"|"kat") > P("hat"|"kat") (0.0125 > 0.005).
• The model suggests "cat" is more likely than "hat" as the intended word.
Example 2 – Spam Detection

• Scenario: Classify an email with the word "win" as spam or not spam.
• Corpus:
• Spam: "Win a prize now", "Win big money"
• Not spam: "Meet me now", "Work is done"
• Calculations:
• Prior:
• P(spam) = 0.5, P(not spam) = 0.5
• Likelihood (with Laplace smoothing):
• P("win"|spam) = 2 / 6 = 0.3333
• P("win"|not spam) = 1 / 17 ≈ 0.0588
• Evidence:
• P("win") = (0.3333 × 0.5) + (0.0588 × 0.5) ≈ 0.1961
• Posterior:
• P(spam|"win") = (0.3333 × 0.5) / 0.1961 ≈ 0.8501
• P(not spam|"win") = (0.0588 × 0.5) / 0.1961 ≈ 0.1499
• Conclusion: Email is likely spam (0.8501 > 0.1499).
Example 2: Text Classification (Spam Detection)

Scenario: Classify an email as "spam" or "not spam" based on the word "win" appearing in it,
using a Naive Bayes approach.
• Corpus:
• Training data:
• Spam emails: "Win a prize now", "Win big money"
• Not spam emails: "Meet me now", "Work is done"
• Vocabulary: {win, a, prize, now, big, money, meet, me, work, is, done}
• Total words:
• Spam: 6 words
• Not spam: 6 words
• Total emails: 4 (2 spam, 2 not spam)
• Step-by-Step Calculation:
• Goal: Compute P(spam|"win") and P(not spam|"win").
Example 2: Text Classification (Spam Detection)

• Bayes’ Theorem:
•P(class∣"win")=P("win"∣class)⋅P(class)/P("win")
class: Either "spam" or "not spam."
•P("win"|class): Likelihood of "win" appearing in a class.
•P(class): Prior probability of the class.
•P("win"): Evidence (normalizing constant).
Prior Probabilities (P(class)):
•P(spam) = Number of spam emails / Total emails = 2 / 4 = 0.5
•P(not spam) = 2 / 4 = 0.5
•Likelihood (P("win"|class)):
•For spam:
•Count("win" in spam) = 2 (appears in both spam emails).
•Total words in spam = 6.
•P("win"|spam) = 2 / 6 = 0.3333
Example 2: Text Classification (Spam Detection)

•For not spam:


•Count("win" in not spam) = 0.
•Apply Laplace smoothing (add 1 to numerator, vocabulary size = 11 to denominator):
P("win"∣not spam)= 0.0588
•Evidence (P("win")):
•Using the law of total probability: P("win")=P("win"∣spam)⋅P(spam)
+P("win"∣not spam)⋅P(not spam)
P("win")=(0.3333⋅0.5)+(0.0588⋅0.5)=0.1667+0.0294=0.1961
Bayes’ Theorem in N-gram
Models
• Connection to N-grams:
• Bayes’ Theorem can enhance N-gram models by incorporating prior knowledge.
• Example: In a trigram model, use Bayes’ Theorem to estimate P(w₃|w₁,w₂)
when data is sparse: [ P(w₃|w₁,w₂) \propto P(w₁,w₂|w₃) \cdot P(w₃) ]
• Rarely used directly in N-grams due to computational complexity; Naive Bayes
is more common in classification tasks.
• Naive Bayes in NLP:
• Assumes word independence for simplicity.
• Example: In spam detection, P(email|spam) = ∏ P(wordᵢ|spam).
• Simplifies calculations but may miss dependencies (unlike N-grams, which
capture local dependencies).
Challenges and Considerations

• Challenges:
• Sparsity: Zero probabilities (e.g., P("win"|not spam) = 0 without smoothing).
• Solution: Apply Laplace or Kneser-Ney smoothing.
• Assumption of Independence: Naive Bayes assumes words are independent,
which isn’t true in language.
• Solution: Use N-grams or neural models for context.
• Estimating Priors and Likelihoods: Requires a representative corpus.
• Solution: Use large, domain-specific datasets.
• Example Impact:
• In spell correction, inaccurate P("kat"|word) estimates (typo probabilities) can
skew results.
• In spam detection, small corpora lead to unreliable P(word|class).
Independence and Chain Rule

Independence:
•Events A and B are independent if P(A ∩ B) = P(A) × P(B).
•Example: In a unigram model, words are assumed independent (simplification, not realistic).
•P("the cat") = P("the") × P("cat").
•Chain Rule:
•Expresses joint probability as a product of conditional probabilities.
•Formula: P(w₁, w₂, ..., wₙ) = P(w₁) × P(w₂|w₁) × P(w₃|w₁,w₂) × ... × P(w ₙ|w₁,...,w ₙ₋₁).
•Example: For "The cat is":
•P("The cat is") = P("The") × P("cat" | "The") × P("is" | "The cat").
•Corpus: "The cat is on the mat."
•P("The") = 2/6, P("cat" | "The") = 1/2, P("is" | "The cat") = 1/1.
•P("The cat is") = (2/6) × (1/2) × (1/1) = 0.167.
Probability in N-gram Models

•N-gram Approximation:
•To make computation feasible, assume the probability of a word depends only on the previous n-1
words.
•Bigram (n=2): P(wₖ | w₁, ..., wₖ₋₁) ≈ P(wₖ | wₖ₋₁).
•Trigram (n=3): P(wₖ | w₁, ..., wₖ₋₁) ≈ P(wₖ | wₖ₋₂, wₖ₋₁).
•Example:
•Corpus: "The cat is on the mat. The cat sleeps."
•Bigram model:
•P("is" | "cat") = Count("cat is") / Count("cat") = 1/2 = 0.5.
•Trigram model:
•P("is" | "The cat") = Count("The cat is") / Count("The cat") = 1/1 = 1.0.
•Why Approximate?
•Reduces computational complexity.
•Handles data sparsity by limiting context.
Probability in N-gram Models

•Context: We'll use a text classification example (e.g., spam vs. non-spam emails) where trigrams and
4-grams are features for applying Bayes' Theorem.
•Dataset: A small, hypothetical set of text data (emails) labeled as spam or non-spam.
•Bayes' Theorem: P(A∣B)=P(B∣A)⋅P(A)/P(B) where:
•A : Class (e.g., Spam or Non-Spam).
•B : Feature (trigram or 4-gram).
•P(A∣B) Posterior probability of class given the feature.
•P(B∣A): Likelihood of feature given the class.
•P(A): Prior probability of the class.
•P(B): Evidence (normalizing constant).
Example Setup
Dataset:
• Spam Email 1: "Win cash now"
• Spam Email 2: "Free money prize now"
• Non-Spam Email 1: "Meeting tomorrow morning"
• Non-Spam Email 2: "Project update next week"
• Classes:
• Spam (S)
• Non-Spam (NS)
• Prior Probabilities:
• Total emails: 4 (2 Spam, 2 Non-Spam)
• P(S)=2/4=0.5
• P(NS)=2/4=0.5
Introduction to N-Grams and
Bayes' Theorem
• Understanding Trigrams, 4-Grams, and Bayes' Theorem
• N-Grams: Sequences of n words used as features in text analysis.
• Trigram: 3-word sequence (e.g., "win cash now").
• 4-Gram: 4-word sequence (e.g., "free money prize now").
• Bayes' Theorem: P(A∣B)=P(B∣A)⋅P(A)/P(B)
• Used to classify text (e.g., spam vs. non-spam) based on n-gram
probabilities.
• Example: Classify a new email using trigrams and 4-grams.
Trigram Example with Bayes' Theorem

• Title: Trigram Example – Classifying "Win Cash Now"


• Trigram: "win cash now" (from Spam Email 1).
• Goal: Calculate P(S∣"win cash now") and P(NS∣"win cash now")
• Step 1: Likelihood P(B∣A)
• Spam Corpus: "Win cash now", "Free money prize now" (2 emails, 2 trigrams).
• Trigram "win cash now" appears 1 time in Spam.
• Total trigrams in Spam: 2.
• P("win cash now"∣S)=1/2=0.5
• Non-Spam Corpus: "Meeting tomorrow morning", "Project update next week" (2
emails, 2 trigrams).
• Trigram "win cash now" appears 0 times in Non-Spam.
• Use Laplace smoothing (add 1 to numerator, add 2 to denominator to account for
vocabulary size, assume 2 unique trigrams for simplicity):
• P("win cash now"∣NS)= =1/4=0.25
Trigram Example with Bayes'
Theorem
• Step 2: Prior P(A)
P(S)=0.5, P(NS)=0.5
• Step 3: Evidence P(B)
• P("win cash now")=P("win cash now"∣S)⋅P(S)+P("win cash now"∣NS)⋅P(NS)
=(0.5*0.5)+(0.25*0.5)=0.25+0.125=0.375
• Step 4: Posterior P(A∣B) P(A|B) P(A∣B)
• P(S∣"win cash now")=P("win cash now"∣S)⋅P(S)/P("win cash now")
• =0.5*0.50/0.375=0.25/0.375≈0.667
• P(NS∣"win cash now")=P("win cash now"∣NS)⋅P(NS)/P("win cash now")
• =0.25*0.5/0.375=0.125/0.375≈0.333
• Conclusion: The trigram "win cash now" suggests the email is Spam
(P(S∣"win cash now")≈0.667
4-Gram Example with Bayes' Theorem

• 4-Gram Example – Classifying "Free Money Prize Now"


• 4-Gram: "free money prize now" (from Spam Email 2).
• Goal: Calculate P(S∣"free money prize now") and
P(NS∣"free money prize now")
• Step 1: Likelihood P(B∣A)
• Spam Corpus: Only "Free money prize now" contains a 4-gram (1 email, 14-
gram).
• 4-Gram "free money prize now" appears 1 time in Spam.
• Total 4-grams in Spam: 1.
• P("free money prize now"∣S)=1/1=1.0 (adjust with smoothing if needed).
• With Laplace smoothing (assume 2 unique 4-grams for vocabulary):
• P("free money prize now"∣S)=≈0.667
4-Gram Example with Bayes' Theorem

• Non-Spam Corpus: No 4-grams in Non-Spam (both emails have <4 words).


P("free money prize now"∣NS)==0.5
• Step 2: Prior P(A) P(A) P(A)
P(S)=0.5, P(NS)=0.5
• Step 3: Evidence P(B)
• P("free money prize now")=P("free money prize now"∣S)⋅P(S)
+P("free money prize now"∣NS)⋅P(NS) =(0.667*0.5)+(0.5*0.5)=0.3335+0.25=0.5835
• Step 4: Posterior P(A∣B)
• P(S∣"free money prize now")== ≈0.571
• P(NS∣"free money prize now")= =≈0.429
• Conclusion: The 4-gram "free money prize now" suggests the email is Spam
(P(S∣"free money prize now")≈0.571
Summary and Visualization

• Summary of Trigram and 4-Gram Classification


• Trigram ("win cash now"):
• P(S∣"win cash now")≈0.667
• P(NS∣"win cash now")≈0.333 Classified as: Spam
• 4-Gram ("free money prize now"):
• P(S∣"free money prize now")≈0.571
• P(NS∣"free money prize now")≈0.429
• Classified as: Spam
Probability Comparison
What is MLE?

•MLE (Maximum Likelihood Estimation) calculates probabilities by counting how often something appears in
data.
•In NLP, it helps classify text, like deciding if an email is Spam or Not Spam.
•Simple method: Divide the count of an event by the total count.
•Example: If "free" appears 5 times in 20 words, probability = 5 ÷ 20 = 0.25.
MLE for Text Classification
• Goal: Find out if a phrase suggests an email is Spam or Not Spam.
• Use MLE to estimate probabilities based on counts.
• Formula: Probability = (Count of phrase in class × Class probability) ÷ Total phrase count.
• Task: Check if "free money prize now" indicates Spam or Not Spam.
Estimating Probabilities with
MLE
Maximum Likelihood Estimation (MLE):
•Estimates probabilities based on observed frequencies.
•Formula: P(wₖ | wₖ₋ₙ₊₁, ..., wₖ₋₁) = Count(wₖ₋ₙ₊₁, ..., wₖ) / Count(w ₖ₋ₙ₊₁, ...,
wₖ₋₁).
•Example:
•Corpus: "I like to eat. I like to drink."
•Bigram: P("to" | "like") = Count("like to") / Count("like") = 2/2 = 1.0.
•Issue: If "like to" never appears, P = 0 (sparsity problem).
•Solution: Smoothing (e.g., Laplace, Kneser-Ney).
Estimating Probabilities with MLE using Bayes’ Theorem

•Bayes’ theorem helps find the probability of a class (e.g., Spam) given text.
•Formula: Probability of class given text = (Likelihood × Prior) ÷ Evidence
Terms:
•Likelihood: How often text appears in class (from MLE).
•Prior: How common the class is (from MLE).
•Evidence: How common the text is overall.
•Task: Classify email with "free money prize now" as Spam or Not Spam.
Sample Data
• Total emails: 12.
• Spam emails: 6 (4 have "free money prize now").
• Not Spam emails: 6 (3 have "free money prize now").
• 4-gram: "free money prize now" (4-word sequence).
• Use MLE and Bayes’ theorem to calculate probabilities.
Estimating Probabilities with MLE using Bayes’ Theorem

Step 1 - Prior Probabilities (MLE)

•Prior: Probability of each class.


•Spam probability = Spam emails ÷ Total emails = 6 ÷ 12 = 0.5.
•Not Spam probability = Not Spam emails ÷ Total emails = 6 ÷ 12 = 0.5.
Step 2 - Likelihoods (MLE)
• Likelihood: Probability of 4-gram in each class.
• In Spam: Spam emails with 4-gram ÷ Total Spam emails = 4 ÷ 6 ≈ 0.667.
• In Not Spam: Not Spam emails with 4-gram ÷ Total Not Spam emails = 3 ÷ 6 = 0.5.
Step 3 - Evidence
• Evidence: Probability of "free money prize now" across all emails.
• Emails with 4-gram: 4 (Spam) + 3 (Not Spam) = 7.
• Probability of 4-gram = Emails with 4-gram ÷ Total emails = 7 ÷ 12 ≈ 0.5835.
Estimating Probabilities with MLE

Step 4 - Bayes’ Theorem


• Bayes’ formula: Probability = (Likelihood × Prior) ÷ Evidence.
• Spam probability given 4-gram: (0.667 × 0.5) ÷ 0.5835 = 0.3335 ÷ 0.5835 ≈ 0.571.
• Not Spam probability given 4-gram: (0.5 × 0.5) ÷ 0.5835 = 0.25 ÷ 0.5835 ≈ 0.429.
Results:
• Spam probability ≈ 0.571.
• Not Spam probability ≈ 0.429.
• Decision: "free money prize now" suggests the email is Spam (0.571 > 0.429).
• Visual idea: Bar chart with Spam (0.571, red) and Not Spam (0.429, blue).
Limitations of MLE

• Zero Probabilities: Unseen phrases (e.g., "win cash now") get 0


probability.
• Fix: Add small counts (smoothing).
• Small Data: Less reliable with few examples.
• Overfitting: May focus too much on training data.
Applications in NLP

• Spam Detection: Classify emails based on phrases.


• Autocomplete: Predict next word in a sentence.
• Sentiment Analysis: Detect positive/negative text.
• Translation: Estimate word sequence probabilities.
• MLE uses counts to estimate probabilities.
• Combined with Bayes’ rule, it classifies text (e.g., Spam vs. Not Spam).
• Simple but needs smoothing for unseen data.
• MLE is a powerful, easy tool for NLP tasks.
Smoothing Techniques
• Why Smoothing?
• To handle unseen N-grams (zero probabilities).
• Ensures non-zero probabilities for all possible N-grams.
• Common Smoothing Methods:
• Add-One (Laplace) Smoothing:
• Add 1 to all counts to avoid zero probabilities.
• Formula: P(wₖ | wₖ₋ₙ₊₁, ..., wₖ₋₁) = (Count(wₖ₋ₙ₊₁, ..., wₖ) + 1) / (Count(w ₖ₋ ₙ₊₁, ..., w ₖ₋₁) + V), where V is vocabulary
size.
• Example: If Count("cat jumps") = 0, add 1 to numerator and V to denominator.
• Add-k Smoothing:
• Add a smaller constant k (e.g., 0.01) instead of 1.
• Kneser-Ney Smoothing:
• Adjusts probabilities based on word frequency and context diversity.
• More sophisticated, widely used in practice.
• Example:
• Without smoothing: P("jumps|cat") = 0 (if not in corpus).
• With Laplace: P("jumps|cat") = (0 + 1) / (Count("cat") + V) > 0.
Add-One (Laplace) Smoothing

• Add-One smoothing, also called Laplace smoothing, addresses zero probabilities by


adding 1 to the count of every possible n-gram, even those not seen in the training
data.
• To normalize the probabilities, the denominator is increased by the vocabulary
size (V), which is the total number of unique words. This ensures that every
possible word sequence has a non-zero probability, but it can overestimate
probabilities for rare n-grams.
Formula:
P(wₖ | wₖ₋ₙ₊₁, ..., wₖ₋₁) = (Count(wₖ₋ₙ₊₁, ..., wₖ) + 1) / (Count(wₖ₋ₙ₊₁, ..., wₖ₋₁) + V)
• Count(wₖ₋ₙ₊₁, ..., wₖ): Number of times the full n-gram appears.
• Count(wₖ₋ₙ₊₁, ..., wₖ₋₁): Number of times the context (previous words) appears.
• V: Vocabulary size (total unique words).
Example
• Imagine a small corpus with the sentences:
• "The cat jumps."
• "The dog jumps."
Vocabulary: {The, cat, dog, jumps} (V = 4).
We want to calculate the bigram probability P(jumps | cat), i.e., the chance
"jumps" follows "cat."
• Without smoothing:
• Count("cat jumps") = 1 (from "The cat jumps").
• Count("cat") = 1 (appears once).
• P(jumps | cat) = Count("cat jumps") / Count("cat") = 1 / 1 = 1.0.
• But what about P(dog | cat)? Count("cat dog") = 0, so P(dog | cat) = 0 / 1 = 0. This is
problematic because it says "cat dog" is impossible, even though it might occur in new data.
Example
• With Add-One Smoothing:
• Add 1 to every bigram count, and add V = 4 to the denominator.
• For P(jumps | cat):
• Count("cat jumps") + 1 = 1 + 1 = 2.
• Count("cat") + V = 1 + 4 = 5.
• P(jumps | cat) = 2 / 5 = 0.4.
• For P(dog | cat):
• Count("cat dog") + 1 = 0 + 1 = 1.
• Count("cat") + V = 1 + 4 = 5.
• P(dog | cat) = 1 / 5 = 0.2.
• Add-One smoothing gives a non-zero probability (0.2) to "cat dog," even though it
wasn't in the corpus. However, it assigns the same probability to all unseen bigrams,
which can be unrealistic since some unseen combinations are more likely than
others.
Add-k Smoothing

• Add-k smoothing is a refinement of Add-One smoothing. Instead of adding 1 to


every n-gram count, it adds a smaller constant k (e.g., 0.01 or 0.05), which
reduces the probability mass given to unseen n-grams.
• This makes it less aggressive than Add-One smoothing and often more realistic,
as it doesn't overestimate rare events as much. The denominator is still increased
by k × V to normalize the probabilities.
Formula:
P(wₖ | wₖ₋ₙ₊₁, ..., wₖ₋₁) = (Count(wₖ₋ₙ₊₁, ..., wₖ) + k) / (Count(wₖ₋ₙ₊₁, ..., wₖ₋₁) + k × V)
• Example:
Using the same corpus: "The cat jumps." and "The dog jumps."
Vocabulary: {The, cat, dog, jumps} (V = 4).
Let’s set k = 0.01 and calculate P(jumps | cat) and P(dog | cat).
Add-k Smoothing

• For P(jumps | cat):


• Count("cat jumps") + k = 1 + 0.01 = 1.01.
• Count("cat") + k × V = 1 + 0.01 × 4 = 1 + 0.04 = 1.04.
• P(jumps | cat) = 1.01 / 1.04 ≈ 0.971.
• For P(dog | cat):
• Count("cat dog") + k = 0 + 0.01 = 0.01.
• Count("cat") + k × V = 1 + 0.04 = 1.04.
• P(dog | cat) = 0.01 / 1.04 ≈ 0.0096.
• Add-k smoothing assigns a much smaller probability (0.0096) to the unseen
"cat dog" compared to Add-One (0.2). This is more realistic since "cat dog" is
unseen, but it still avoids zero probability. The choice of k is critical and often
tuned based on the dataset.
Kneser-Ney Smoothing

• Kneser-Ney smoothing is a more sophisticated technique widely used in practice. It adjusts


probabilities based on context diversity rather than just raw counts.
• It discounts the counts of observed n-grams and redistributes the probability mass to
unseen n-grams, prioritizing words that appear in diverse contexts (i.e., with many
different preceding words).
• This makes it better at modeling real-world language patterns, as it accounts for how likely
a word is to appear in new contexts.
Key Idea:
• Instead of just adding a constant, Kneser-Ney reduces (discounts) the counts of observed n-
grams by a fixed amount (e.g., 0.75).
• The saved probability mass is distributed to unseen n-grams based on their "continuation
probability" (how often a word follows different contexts).
• It uses lower-order n-grams (e.g., unigrams) to inform higher-order n-gram probabilities.
Example
• Using the same corpus: "The cat jumps." and "The dog jumps."
Vocabulary: {The, cat, dog, jumps} (V = 4).
We want P(jumps | cat) and P(dog | cat). For simplicity, assume a discount d = 0.75 (a
common choice).
• Step 1: Discount observed bigrams:
• Count("cat jumps") = 1, but we subtract d = 0.75, so adjusted count = 1 - 0.75 = 0.25.
• Count("cat") = 1 (total occurrences of "cat").
• The discounted probability for P(jumps | cat) = 0.25 / 1 = 0.25.
• Step 2: Redistribute probability to unseen bigrams:
• For unseen bigrams like "cat dog," Kneser-Ney uses the "continuation probability" of the word "dog."
• Continuation count for "dog": How many unique bigrams end with "dog"? In the corpus, we have "The
dog" (1 unique context).
• Compare to "jumps," which appears in "cat jumps" and "dog jumps" (2 unique contexts).
• "Jumps" is more likely to appear after new words because it has higher context diversity (2 vs. 1).
Example

• The remaining probability mass (1 - 0.25 = 0.75) is distributed to unseen


bigrams, weighted by continuation probabilities.
• Since "jumps" has 2 contexts and "dog" has 1, "dog" gets a smaller share, say
P(dog | cat) ≈ 0.1 (exact values depend on the full model).
• Kneser-Ney gives "jumps" a higher probability than "dog" after "cat"
because "jumps" appears in more diverse contexts.
• This makes it more likely to follow new words like "cat" in unseen
data. Unlike Add-One or Add-k, it doesn't treat all unseen bigrams
equally, making it more accurate for real-world language modeling.
Summary of Differences

• Add-One (Laplace): Adds 1 to all counts, simple but overestimates probabilities


for unseen n-grams. Example: P(dog | cat) = 0.2 (high for an unseen bigram).
• Add-k: Adds a smaller k (e.g., 0.01), more conservative for unseen n-grams.
Example: P(dog | cat) ≈ 0.0096 (more realistic).
• Kneser-Ney: Uses context diversity and discounts observed counts, giving more
probability to words that appear in varied contexts. Example: P(dog | cat) ≈ 0.1,
but prioritizes words like "jumps" with more contexts.
• Why Kneser-Ney is Preferred:
It’s more sophisticated and better reflects real language patterns by considering
how words are used across different contexts, making it the go-to choice for
modern NLP tasks like machine translation or speech recognition.
Advanced Smoothing
Techniques
• Backoff Models:
• If an N-gram is unseen, "back off" to a lower-order N-gram (e.g., trigram →
bigram → unigram).
• Example: If P(wₖ|wₖ₋₂, wₖ₋₁) is zero, use P(wₖ|wₖ₋₁).
• Interpolation:
• Combine probabilities of different N-gram orders.
• Formula: P(wₖ | wₖ₋₂, wₖ₋₁) = λ₁P(wₖ | wₖ₋₂, wₖ₋₁) + λ₂P(wₖ | wₖ₋₁) + λ₃P(wₖ),
where λ₁ + λ₂ + λ₃ = 1.
• Benefits:
• Robust to sparse data.
• Balances context-specific and general probabilities.
Backoff Models
• Backoff models address unseen n-grams by "backing off" to lower-order n-grams when the higher-
order n-gram has a zero count in the training data.
• For example, if a trigram (three-word sequence) isn’t observed, the model falls back to a bigram
(two-word sequence), and if the bigram is also unseen, it falls back to a unigram (single word).
• To prevent overestimating probabilities, backoff models often apply a discount to higher-order n-
gram counts and use a scaling factor (called a backoff weight) to ensure probabilities sum to 1.
• This approach is simpler than some other smoothing methods but can be less flexible since it only
uses the highest available n-gram order.
Key Idea:
• Check if the n-gram exists in the corpus.
• If its count is zero, drop to a lower-order n-gram (e.g., trigram → bigram → unigram).
• Adjust probabilities with a discount or backoff weight to maintain valid probabilities.
Example

• Consider a small corpus:


• "The cat jumps."
• "The dog jumps."
Vocabulary: {The, cat, dog, jumps} (V = 4).
We want to calculate the trigram probability P(jumps | The cat), i.e., the chance
"jumps" follows "The cat."
• Without smoothing:
• Count("The cat jumps") = 1 (appears once).
• Count("The cat") = 1 (appears once).
• P(jumps | The cat) = 1 / 1 = 1.0.
• Now, for an unseen trigram like P(dog | The cat):
• Count("The cat dog") = 0.
• P(dog | The cat) = 0 / 1 = 0 (problematic, as it suggests "The cat dog" is impossible).
Example

• With Backoff:
• Check the trigram "The cat dog." Since Count("The cat dog") = 0, back off to the bigram P(dog | cat).
• Check the bigram "cat dog." Count("cat dog") = 0, so back off to the unigram P(dog).
• Unigram P(dog) = Count("dog") / Total words = 1 / 6 (1 occurrence of "dog" in 6 total words: The, cat,
jumps, The, dog, jumps).
• P(dog) = 1 / 6 ≈ 0.167.
• To ensure probabilities sum to 1, a backoff weight (say, α = 0.4) is applied to scale the lower-order
probability. So, P(dog | The cat) ≈ 0.4 × 0.167 ≈ 0.067.
• For P(jumps | The cat), since the trigram exists (Count("The cat jumps") = 1), use the discounted
trigram probability (e.g., discount d = 0.5):
• Adjusted count = 1 - 0.5 = 0.5.
• P(jumps | The cat) = 0.5 / 1 = 0.5.
• Backoff assigns a non-zero probability (≈0.067) to the unseen trigram "The cat dog" by using
the unigram P(dog). It only relies on the highest-order n-gram available, which can
sometimes ignore useful information from other n-gram orders.
Interpolation

• Interpolation smoothing combines probabilities from all n-gram orders (e.g., trigram, bigram,
unigram) by assigning each a weight (λ) that sums to 1.
• Instead of choosing one n-gram order like backoff, interpolation mixes them, ensuring that
even unseen higher-order n-grams get some probability from lower-order models.
• This makes it more robust, as it considers all available context, weighted by how reliable each
n-gram order is (higher-order n-grams get more weight if they’re observed, but lower-order n-
grams contribute for unseen cases).
Formula:
P(wₖ | wₖ₋₂, wₖ₋₁) = λ₁P(wₖ | wₖ₋₂, wₖ₋₁) + λ₂P(wₖ | wₖ₋₁) + λ₃P(wₖ), where λ₁ + λ₂ + λ₃ = 1
• P(wₖ | wₖ₋₂, wₖ₋₁): Trigram probability.
• P(wₖ | wₖ₋₁): Bigram probability.
• P(wₖ): Unigram probability.
• λ₁, λ₂, λ₃: Weights (e.g., 0.5, 0.3, 0.2) tuned to prioritize higher-order n-grams when they exist.
Example

• Using the same corpus: "The cat jumps." and "The dog jumps."
Vocabulary: {The, cat, dog, jumps} (V = 4).
We want P(dog | The cat) for the unseen trigram "The cat dog." Assume weights
λ₁ = 0.5 (trigram), λ₂ = 0.3 (bigram), λ₃ = 0.2 (unigram).
• Step 1: Calculate probabilities for each n-gram order:
• Trigram: P(dog | The cat) = Count("The cat dog") / Count("The cat") = 0 / 1 = 0.
• Bigram: P(dog | cat) = Count("cat dog") / Count("cat") = 0 / 1 = 0.
• Unigram: P(dog) = Count("dog") / Total words = 1 / 6 ≈ 0.167.
• Step 2: Apply interpolation:
• P(dog | The cat) = λ₁ × P_trigram + λ₂ × P_bigram + λ₃ × P_unigram
• = 0.5 × 0 + 0.3 × 0 + 0.2 × 0.167
• = 0 + 0 + 0.0334 ≈ 0.0334.
Example

• For a seen trigram, P(jumps | The cat):


• Trigram: P(jumps | The cat) = 1 / 1 = 1.0 (or discounted, e.g., 0.5 with d = 0.5).
• Bigram: P(jumps | cat) = 1 / 1 = 1.0 (or discounted, e.g., 0.5).
• Unigram: P(jumps) = 2 / 6 ≈ 0.333.
• P(jumps | The cat) = 0.5 × 0.5 + 0.3 × 0.5 + 0.2 × 0.333
• = 0.25 + 0.15 + 0.0666 ≈ 0.4666.
• Interpolation gives a non-zero probability (0.0334) to "The cat dog" by
incorporating the unigram P(dog), even though the trigram and bigram are
unseen.
• For seen n-grams like "The cat jumps," it combines all orders, giving more
weight to the trigram, resulting in a higher probability (0.4666). This makes
interpolation smoother and more flexible than backoff.
Summary of Differences

• Backoff Models: Use the highest-order n-gram available, falling back to lower orders if the count is
zero. Example: P(dog | The cat) ≈ 0.067 (from unigram only).
• Interpolation: Combines all n-gram orders with weights, ensuring contributions from trigram,
bigram, and unigram. Example: P(dog | The cat) ≈ 0.0334 (from unigram, weighted).
Comparison to Other Smoothing Techniques:
• Unlike Add-One or Add-k, which add a constant to all counts, backoff and interpolation leverage
lower-order n-grams, making them more context-aware.
• Compared to Kneser-Ney, interpolation is simpler but less sophisticated, as Kneser-Ney uses context
diversity for better probability distribution. Backoff is closer to Kneser-Ney in spirit but only uses one
n-gram order at a time, while Kneser-Ney often interpolates with continuation probabilities.
• Interpolation is widely used because it’s robust and balances all n-gram orders, while backoff is
simpler but may miss some contextual information.
• Why Interpolation is Preferred:
Interpolation is often favored in practice (sometimes combined with Kneser-Ney) because it uses all
available n-gram information, making it more reliable for unseen data while still valuing observed
higher-order n-grams.
Evaluating Language Models

• Evaluation Metrics:
• Perplexity:
• Measures how well a model predicts a test set.
• Formula: PP = 2^(-1/N ∑ log₂ P(wᵢ | w₁, ..., wᵢ₋₁)).
• Lower perplexity = better model.
• Example: A model with perplexity 100 is better than one with 200.
• Log-Likelihood:
• Sum of log probabilities of the test set.
• Higher log-likelihood = better fit.
• Extrinsic Evaluation:
• Measure performance in downstream tasks (e.g., accuracy in speech recognition).
• Challenges:
• Perplexity doesn’t always correlate with task performance.
• Requires a representative test set.
Evaluating Language Models

• In NLP, evaluation metrics like Perplexity and Log-Likelihood are used


to assess how well a language model predicts or fits a test dataset.
• These metrics help quantify the model’s ability to assign probabilities
to word sequences, which is crucial for tasks like text generation,
machine translation, or speech recognition.
Perplexity

• Perplexity measures how well a language model predicts a test set by evaluating how "surprised" the
model is by the data.
• It’s based on the average probability the model assigns to each word in the test set, given its context
(previous words).
• A lower perplexity indicates the model is better at predicting the next word, meaning it assigns higher
probabilities to the actual words in the test set.
• Perplexity is often interpreted as the effective number of choices the model considers for each word
—lower values mean the model is more confident and accurate.
Formula:
PP = 2^(-1/N ∑ log₂ P(wᵢ | w₁, ..., wᵢ₋₁))
• N: Number of words in the test set.
• P(wᵢ | w₁, ..., wᵢ₋₁): Probability of word wᵢ given its context (from the model).
• ∑ log₂ P: Sum of log base-2 probabilities of all words in the test set.
• Lower perplexity means the model is less "perplexed" by the data.
Example

• Consider a tiny test set: "The cat jumps." (3 words, N = 3).


We have two language models: Model A (uses Kneser-Ney smoothing) and Model B (uses
Add-One smoothing). We calculate perplexity for both models on this test set using bigram
probabilities.
• Model A (Kneser-Ney):
Suppose the model assigns these probabilities (trained on a corpus like "The cat jumps. The
dog jumps."):
• P(The | <start>) = 0.8 (probability of "The" starting a sentence).</start>
• P(cat | The) = 0.5 (probability of "cat" following "The").
• P(jumps | cat) = 0.6 (probability of "jumps" following "cat").
• Log probabilities (base-2):
• log₂(0.8) ≈ -0.322, log₂(0.5) = -1, log₂(0.6) ≈ -0.737.
• Sum = -0.322 - 1 - 0.737 = -2.059.
• Average log probability = -2.059 / 3 ≈ -0.686.
• Perplexity = 2^(-(-0.686)) = 2^0.686 ≈ 1.61.
Example

• Model B (Add-One):
Add-One smoothing assigns more probability to unseen bigrams, so probabilities are lower for
seen bigrams: P(The | <start>) = 0.6, P(cat | The) = 0.3, P(jumps | cat) = 0.4 (due to adding 1
to all counts and V = 4 to denominators).</start>
• Log probabilities:
• log₂(0.6) ≈ -0.737, log₂(0.3) ≈ -1.737, log₂(0.4) ≈ -1.322.
• Sum = -0.737 - 1.737 - 1.322 = -3.796.
• Average log probability = -3.796 / 3 ≈ -1.265.
• Perplexity = 2^(-(-1.265)) = 2^1.265 ≈ 2.40.
• Model A has a perplexity of 1.61, meaning it’s less "surprised" by the test set than Model B
(perplexity 2.40).
• Lower perplexity (1.61 vs. 2.40) indicates Model A (Kneser-Ney) is better at predicting "The cat
jumps." This makes sense, as Kneser-Ney is more sophisticated than Add-One, assigning more
accurate probabilities to seen bigrams.
Log-Likelihood

• Log-Likelihood measures how well a language model fits a test set by summing the log
probabilities of all words in the test set, given their context.
• A higher log-likelihood means the model assigns higher probabilities to the actual words
in the test set, indicating a better fit.
• Unlike perplexity, which is an exponentiated average, log-likelihood is a direct sum of log
probabilities, so it’s not normalized by the number of words. It’s often used to compare
models on the same dataset, where a higher value indicates better performance.
Formula:
Log-Likelihood = ∑ log₂ P(wᵢ | w₁, ..., wᵢ₋₁)
• P(wᵢ | w₁, ..., wᵢ₋₁): Probability of each word given its context.
• ∑: Sum over all words in the test set.
• Higher values indicate a better model.
Example
• Using the same test set: "The cat jumps." (3 words).
We compute log-likelihood for Model A (Kneser-Ney) and Model B (Add-One) using the same
bigram probabilities as above.
• Model A (Kneser-Ney):
• Probabilities: P(The | <start>) = 0.8, P(cat | The) = 0.5, P(jumps | cat) = 0.6.</start>
• Log probabilities:
• log₂(0.8) ≈ -0.322, log₂(0.5) = -1, log₂(0.6) ≈ -0.737.
• Log-Likelihood = -0.322 - 1 - 0.737 = -2.059.
• Model B (Add-One):
• Probabilities: P(The | <start>) = 0.6, P(cat | The) = 0.3, P(jumps | cat) = 0.4.</start>
• Log probabilities:
• log₂(0.6) ≈ -0.737, log₂(0.3) ≈ -1.737, log₂(0.4) ≈ -1.322.
• Log-Likelihood = -0.737 - 1.737 - 1.322 = -3.796.
• Model A has a log-likelihood of -2.059, which is higher (less negative) than Model B’s -3.796. This
means Model A assigns higher probabilities to the test set, indicating a better fit. The higher log-
likelihood aligns with Model A’s lower perplexity, confirming Kneser-Ney outperforms Add-One.
Summary of Differences

• Perplexity: Measures prediction quality via an exponentiated average of log


probabilities. Lower values are better. Example: Model A (1.61) is better than
Model B (2.40) for "The cat jumps."
• Log-Likelihood: Sums log probabilities to measure model fit. Higher (less
negative) values are better. Example: Model A (-2.059) is better than Model B (-
3.796).
Connection to Smoothing Techniques:
• Smoothing techniques like Kneser-Ney, Add-One, Add-k, Backoff, and
Interpolation directly affect these metrics by adjusting probabilities for unseen n-
grams.
• For instance, Kneser-Ney’s context-aware smoothing (as in Model A) assigns
higher probabilities to likely words, reducing perplexity and increasing log-
likelihood compared to Add-One (Model B), which overestimates unseen n-
grams.
• Perplexity is more commonly reported because it’s normalized by the
number of words, making it easier to compare across datasets of
different sizes. Log-Likelihood is useful for direct model comparison
on the same dataset but is sensitive to test set size.
• Why These Metrics Matter:
Both metrics help evaluate how well a language model generalizes to
new data. Lower perplexity and higher log-likelihood indicate a model
that better captures the patterns in the test set, which is critical for
applications like text prediction or machine translation. Kneser-Ney, as
shown, often outperforms simpler methods like Add-One due to its
smarter probability distribution, reflected in better metric scores.
Example: Perplexity Calculation

• Corpus: "The cat is on the mat."


• Test Sentence: "The cat is on."
• Bigram Model (simplified):
• Assume P(The) = 0.2, P(cat|The) = 0.5, P(is|cat) = 0.4, P(on|is) = 0.3.
• Log-likelihood = log₂(0.2) + log₂(0.5) + log₂(0.4) + log₂(0.3).
• Perplexity = 2^(-1/4 × Log-likelihood).
• Interpretation:
• Lower perplexity indicates the model is less "surprised" by the test data.
Challenges in Language
Modelling
• Data Sparsity:
• Large vocabularies lead to unseen N-grams.
• Solution: Smoothing and backoff.
• Context Limitation:
• N-grams capture only short-term dependencies (e.g., 2-3 words).
• Solution: Neural models (e.g., Transformers).
• Computational Complexity:
• Higher N-grams require more memory and computation.
• Solution: Efficient data structures (e.g., tries).
• Domain Adaptation:
• Models trained on one domain (e.g., news) may perform poorly on another (e.g., social media).
• Solution: Domain-specific training or transfer learning.
N-grams vs. Neural Models

• N-grams:
• Pros: Simple, interpretable, computationally efficient.
• Cons: Limited context, data sparsity.
• Neural Models:
• Pros: Capture long-range dependencies, handle large contexts.
• Cons: Computationally expensive, less interpretable.
• Example:
• N-gram: P("rain" | "It is") = fixed probability.
• Neural (e.g., GPT): Considers entire sentence context, more accurate but
slower.
Practical Example: Building an
N-gram Model
• Steps:
• Collect a corpus (e.g., news articles).
• Tokenize into words.
• Count N-grams (e.g., bigrams, trigrams).
• Estimate probabilities using MLE.
• Apply smoothing (e.g., Kneser-Ney).
• Evaluate on a test set using perplexity.
• Tools:
• Python libraries: NLTK, spaCy, scikit-learn.
• Example code (simplified):
Practical Example: Building an
N-gram Model
(contd..)
from nltk import bigrams, FreqDist
corpus = ["The cat is on the mat", "The cat sleeps"]
tokens = [word for sent in corpus for word in [Link]()]
bigram_counts = FreqDist(bigrams(tokens))
P_cat_the = bigram_counts[("The", "cat")] / sum(1 for w1, w2 in
bigrams(tokens) if w1 == "The")
Probability and Smoothing

•Why Smoothing?
•Unseen N-grams result in zero probabilities, breaking the model.
•Smoothing redistributes probability mass to unseen events.
•Laplace Smoothing Example:
•Corpus: "The cat is on the mat."
•Vocabulary size (V) = 6 (unique words: The, cat, is, on, mat, .).
•Unseen bigram: "cat jumps."
•Without smoothing: P("jumps" | "cat") = 0.
•With Laplace: P("jumps" | "cat") = (0 + 1) / (Count("cat") + V) = 1 / (2 + 6) = 0.125.
•Impact:
•Ensures all possible N-grams have non-zero probability.
•Improves robustness for language modeling.
Evaluating Models with
Probability
Perplexity:
•Measures how well a probability model predicts a test set.
•Formula: PP = 2^(-1/N ∑ log₂ P(wᵢ | w₁, ..., wᵢ ₋₁)).
•Relies on conditional probabilities from the model.
•Example:
•Test sentence: "The cat is."
•Bigram model probabilities: P("The") = 0.2, P("cat" | "The") = 0.5, P("is" | "cat") = 0.4.
•Log-likelihood = log₂(0.2) + log₂(0.5) + log₂(0.4) ≈ -2.322 - 1 - 1.322 = -4.644.
•Perplexity = 2^(-1/3 × -4.644) = 2^(1.548) ≈ 2.92.
•Interpretation: Lower perplexity indicates better predictive power.
Practical Example: Probability in
Action
Task: Build a bigram model to predict the next word.
•Corpus: "The cat is on the mat. The cat sleeps."
•Steps:
[Link]: ["The", "cat", "is", "on", "the", "mat", ".", "The", "cat", "sleeps", "."].
[Link] bigrams: e.g., ("The", "cat") = 2, ("cat", "is") = 1.
[Link] probabilities:
•P("cat" | "The") = 2/2 = MOE = 1.0.
•P("is" | "cat") = 1/2 = 0.5.
[Link] Laplace smoothing for unseen bigrams.
[Link] next word after "The": Likely "cat" (P = 1.0).
•Evaluation:
•Compute perplexity on a test set to assess model quality.
Challenges in Applying
Probability

•Sparsity:
•Many N-grams have zero counts in small corpora.
•Solution: Smoothing or larger corpora.
•Context Limitation:
•N-grams rely on short contexts, limiting probability accuracy.
•Example: P("rain" | "It is") vs. neural models considering full context.
•Overfitting:
•MLE may overfit to training data, giving unreliable probabilities.
•Solution: Smoothing or regularization techniques.
•Computational Cost:
•Calculating probabilities for large N or vocabularies is expensive.
•Solution: Efficient data structures or neural models.
Probability Theory and Modern
NLP

•Beyond N-grams:
•Neural language models (e.g., Transformers) use probability distributions over entire contexts.
•Example: BERT predicts masked words using contextual probabilities.
•Softmax Layer:
•Neural models output a probability distribution over the vocabulary for the next word.
•Example: P("cat" | "The") is one of many probabilities in a softmax output.
•Advantages:
•Neural models assign probabilities based on deeper linguistic patterns.
•Example: GPT predicts coherent sentences by modeling P(wₖ | w₁, ..., w ₖ₋₁) over long contexts.
Probability in Language
Modelling

•Key Points:
•Probability theory underpins statistical language models.
•N-grams use conditional probabilities for prediction but face sparsity issues.
•Smoothing adjusts probabilities for unseen events.
•Evaluation metrics like perplexity rely on probability estimates.
•Example Recap:
•Bigram model: P("cat" | "The") = 1.0 in a small corpus, adjusted via smoothing.
•Perplexity measures how well probabilities match test data.
•Future Directions:
•Combine probability-based N-grams with neural models for hybrid systems.
•Explore Bayesian approaches for uncertainty-aware language models.
Ngram Model
• For n = 3, Trigrams
P(w |w ..,w ) = P(w |w ,w )
n 1 n-1 n n-2 n-1

• To create the model use training text and record pairs


and triples of words that appear in the text and how
many times
P(wi|wi-2,wi-1)= C(wi-2,i) / C(wi-2,i-1)

• P(submarine|the, yellow) = C(the,yellow,


submarine)/C(the,yellow)
• Relative frequency: observed frequency of a particular sequence
divided by observed fequency of a prefix
86
Noisy Channel Model
W X Y W*
Chann decoder
encoder
messag
el p(y|
e x) Attempt to
input Output reconstruc
to from t message
channe channel based
l on output

In language processing the problem is


reduced to decode for getting the most
likely input given the output

87
Correct
Re a l L a n g u a g e X
text

Noisy X- Error
channel >Y s

Text
Observed with
Language Y errors

W e w a n t to retrieve X
from Y
88
Correct
Re a l L a n g u a g e X
text

Noisy X- Space
removi
channel >Y ng

Text
Observed
witho
Language Y ut
space
W e w a n t to retrieve X s
from Y
89
Tex
Re a l L a n g u a g e X
t

Text to
Noisy X- speech
channel >Y generat
or
Observed Speec
Language Y h

W e w a n t to retrieve X
from Y
90
S o u rc e
Real Language X Languag
e

Noisy X- Translati
channel >Y on

Observed Ta r g e t
Language Y Langua
ge
W e w a n t to retrieve X
from Y
91
Example: A S R Automatic
Speech Recognizer

Acoustic word
chain chain

X 1 ... X T w 1 ... w N
ASR

Language Acoustic
model Model

92
Example: Machine Translation

Target Language Translation


Model Model

93

Markov Models of order n
+ 1
• P(w |h ) =
i i P(wi|wi-n + 1 , … w i-1)
• 0-gram

• 1-gram
• P(w |h ) =
i i P(w i )

• 2-gram
• P(w |h ) =
i i P(wi|wi-1)

• 3-gram
• P(w |h ) =
i i P(wi|wi-2,wi-1)
94
C
1-gram
( w )=

PML ( w )
Model
E

C ( wV
i−1
• 2-gram wi ) ∣
Model
PMLE ( wi
∣w
i−1 )= C
( wi−1 )
• 3-gram
Model C ( wi−2
PMLE ( wi ∣w
i−1 )=
i−2 ,w C ( wi−2
wi−1 wi )
wi−1 ) 95
96
97
True probability
distribution

98
The seen cases are overestimated
the
unseen ones have a null
probability

99
Save a part of the ma ss
probability from seen
cases and assign it to
the unseen ones

SMOOTHING

100
N GRAM
PROBABILITIES
Predicting next
Evaluating language models
• Intuitively, a trigram model captures more context than
a bigram model, so should be a “better” model.
• That is, it should more accurately predict the
probabilities of sentences
• But how can we measure this?

You might also like