Natural Language Processing - Language Modelling
Natural Language Processing - Language Modelling
modelling
The role of language models. N-gram models. Estimating parameters
and smoothing. Evaluating language models.
Introduction to Language Models
• Key Roles:
• Text Generation: Generate coherent text (e.g., chatbots, story generation).
• Speech Recognition: Disambiguate similar-sounding words (e.g., "recognize speech"
vs. "wreck a nice beach").
• Machine Translation: Ensure grammatically correct translations.
• Spell Correction: Suggest correct words based on context (e.g., "I have a kat" → "cat").
• Sentiment Analysis: Understand contextual meaning for better classification.
• Applications:
• Virtual assistants (e.g., Siri, Alexa).
• Autocomplete in search engines.
• Predictive keyboards on smartphones.
Types of Language Models
Marginal Probability
• Definition: The probability of a single event occurring, regardless of other events.
• Formula: P(wᵢ) = Count(wᵢ) / Total words.
• Example Event: Probability of the word "cat" appearing in the corpus.
• Calculation:
• Count("cat") = 2 (appears twice in the corpus).
• Total words = 15.
• P("cat") = 2 / 15 ≈ 0.1333.
• Other Marginal Probabilities :
• P("The") = Count("The") / Total = 3 / 15 = 0.2.
• P("sleeps") = Count("sleeps") / Total = 1 / 15 ≈ 0.0667.
Probability Calculations
Joint Probability
• Definition: The probability of multiple events occurring together.
• Formula (for bigrams): P(wᵢ, wᵢ₊₁) = Count(wᵢ, wᵢ₊₁) / Total bigrams.
• Example Event: Probability of the bigram "The cat" in the sentence.
• Calculation:
• Total bigrams = Total words - 1 = 15 - 1 = 14.
• Count("The cat") = 2 (appears in "The cat is" and "The cat sleeps").
• P("The", "cat") = 2 / 14 ≈ 0.1429.
• Another Example:
• Bigram "cat sleeps":
• Count("cat sleeps") = 1.
• P("cat", "sleeps") = 1 / 14 ≈ 0.0714.
Probability Calculations
Conditional Probability
• Definition: The probability of an event given another event has occurred.
• Formula: P(wᵢ₊₁ | wᵢ) = P(wᵢ, wᵢ₊₁) / P(wᵢ) = Count(wᵢ, wᵢ₊₁) / Count(wᵢ).
• Example Event: Probability of "cat" given "The" (i.e., P("cat" | "The")).
• Calculation:
• Count("The cat") = 2.
• Count("The") = 3.
• P("cat" | "The") = 2 / 3 ≈ 0.6667.
• Another Example:
• P("sleeps" | "cat"):
• Count("cat sleeps") = 1.
• Count("cat") = 2.
• P("sleeps" | "cat") = 1 / 2 = 0.5.
Summary of Calculated
Probabilities
Marginal Probabilities:
•P("The") = 0.2
•P("cat") = 0.1333
•P("sleeps") = 0.0667
Joint Probabilities:
•P("The", "cat") = 0.1429
•P("cat", "sleeps") = 0.0714
Conditional Probabilities:
•P("cat" | "The") = 0.6667
•P("sleeps" | "cat") = 0.5
Marginal Probability
• Marginal Probability:
• Shows how common individual words are in the corpus.
• "The" is most frequent (P = 0.2), "sleeps" is least (P = 0.0667).
• Joint Probability:
• Indicates likelihood of word pairs.
• "The cat" is more likely (P = 0.1429) than "cat sleeps" (P = 0.0714).
• Conditional Probability:
• Critical for N-gram models, predicts next word given context.
• High P("cat" | "The") = 0.6667 suggests "cat" often follows "The."
• Relevance to NLP:
• These probabilities drive language models to predict word sequences, as in
autocomplete or speech recognition.
Bayes’ Theorem in NLP
Overview
• Bayes’ Theorem is a fundamental concept in probability theory that relates conditional probabilities,
enabling us to update probabilities based on new evidence.
• In NLP, it’s widely used in tasks like text classification (e.g., spam detection), spell correction, and word
sense disambiguation, where we need to infer the most likely outcome given observed data.
Bayes’ Theorem Formula
P(A∣B)=
• P(A|B): Posterior probability – the probability of event A given evidence B.
• P(B|A): Likelihood – the probability of observing B given A.
• P(A): Prior probability – the initial probability of A before observing B.
• P(B): Evidence – the probability of B, often computed as a normalizing constant.
• In NLP, these terms are interpreted as:
• A: A hypothesis, e.g., a word, class label (spam/not spam), or part-of-speech tag.
• B: Observed data, e.g., a word sequence, email text, or context.
Relevance to NLP
Scenario: A user types "kat" in a sentence, and we want to determine whether the intended
word was "cat" or "hat" using Bayes’ Theorem.
Corpus:
• A small text corpus: "The cat is on the mat. The cat sleeps. The hat is red."
• Tokenized: ["The", "cat", "is", "on", "the", "mat", ".", "The", "cat", "sleeps", ".", "The", "hat",
"is", "red", "."]
• Total words: 16
• Vocabulary: {The, cat, is, on, the, mat, ., sleeps, hat, red}
Step-by-Step Calculation:
• Goal: Compute P("cat"|"kat") and P("hat"|"kat") to determine the most likely correction.
Bayes’ Theorem:
• P(word∣"kat")=P("kat"∣word)⋅P(word)/P("kat")
Example 1 – Spell Correction
• Scenario: User types "kat." Was the intended word "cat" or "hat"?
Corpus: "The cat is on the mat. The cat sleeps. The hat is red."
• Calculations:
• Prior:
• P("cat") = 2 / 16 = 0.125
• P("hat") = 1 / 16 = 0.0625
• Likelihood:
• P("kat"|"cat") = 0.1 (typo probability).
• P("kat"|"hat") = 0.08.
• Posterior (proportional):
• P("cat"|"kat") ∝ 0.1 × 0.125 = 0.0125
• P("hat"|"kat") ∝ 0.08 × 0.0625 = 0.005
• Conclusion: "cat" is more likely (0.0125 > 0.005).
Example 1: Spell Correction
• Evidence (P("kat")):
• Since we’re comparing probabilities, we can use the proportional form:
P(word∣"kat")∝P("kat"∣word)⋅P(word)
• We don’t need to compute P("kat") explicitly for ranking.
• Posterior Probabilities:
• For "cat": P("cat"∣"kat")∝P("kat"∣"cat")⋅P("cat")=0.1*0.125=0.0125
• For "hat": P("hat"∣"kat")∝P("kat"∣"hat")⋅P("hat")=0.08*0.0625=0.005
• Conclusion:
• P("cat"|"kat") > P("hat"|"kat") (0.0125 > 0.005).
• The model suggests "cat" is more likely than "hat" as the intended word.
Example 2 – Spam Detection
• Scenario: Classify an email with the word "win" as spam or not spam.
• Corpus:
• Spam: "Win a prize now", "Win big money"
• Not spam: "Meet me now", "Work is done"
• Calculations:
• Prior:
• P(spam) = 0.5, P(not spam) = 0.5
• Likelihood (with Laplace smoothing):
• P("win"|spam) = 2 / 6 = 0.3333
• P("win"|not spam) = 1 / 17 ≈ 0.0588
• Evidence:
• P("win") = (0.3333 × 0.5) + (0.0588 × 0.5) ≈ 0.1961
• Posterior:
• P(spam|"win") = (0.3333 × 0.5) / 0.1961 ≈ 0.8501
• P(not spam|"win") = (0.0588 × 0.5) / 0.1961 ≈ 0.1499
• Conclusion: Email is likely spam (0.8501 > 0.1499).
Example 2: Text Classification (Spam Detection)
Scenario: Classify an email as "spam" or "not spam" based on the word "win" appearing in it,
using a Naive Bayes approach.
• Corpus:
• Training data:
• Spam emails: "Win a prize now", "Win big money"
• Not spam emails: "Meet me now", "Work is done"
• Vocabulary: {win, a, prize, now, big, money, meet, me, work, is, done}
• Total words:
• Spam: 6 words
• Not spam: 6 words
• Total emails: 4 (2 spam, 2 not spam)
• Step-by-Step Calculation:
• Goal: Compute P(spam|"win") and P(not spam|"win").
Example 2: Text Classification (Spam Detection)
• Bayes’ Theorem:
•P(class∣"win")=P("win"∣class)⋅P(class)/P("win")
class: Either "spam" or "not spam."
•P("win"|class): Likelihood of "win" appearing in a class.
•P(class): Prior probability of the class.
•P("win"): Evidence (normalizing constant).
Prior Probabilities (P(class)):
•P(spam) = Number of spam emails / Total emails = 2 / 4 = 0.5
•P(not spam) = 2 / 4 = 0.5
•Likelihood (P("win"|class)):
•For spam:
•Count("win" in spam) = 2 (appears in both spam emails).
•Total words in spam = 6.
•P("win"|spam) = 2 / 6 = 0.3333
Example 2: Text Classification (Spam Detection)
• Challenges:
• Sparsity: Zero probabilities (e.g., P("win"|not spam) = 0 without smoothing).
• Solution: Apply Laplace or Kneser-Ney smoothing.
• Assumption of Independence: Naive Bayes assumes words are independent,
which isn’t true in language.
• Solution: Use N-grams or neural models for context.
• Estimating Priors and Likelihoods: Requires a representative corpus.
• Solution: Use large, domain-specific datasets.
• Example Impact:
• In spell correction, inaccurate P("kat"|word) estimates (typo probabilities) can
skew results.
• In spam detection, small corpora lead to unreliable P(word|class).
Independence and Chain Rule
Independence:
•Events A and B are independent if P(A ∩ B) = P(A) × P(B).
•Example: In a unigram model, words are assumed independent (simplification, not realistic).
•P("the cat") = P("the") × P("cat").
•Chain Rule:
•Expresses joint probability as a product of conditional probabilities.
•Formula: P(w₁, w₂, ..., wₙ) = P(w₁) × P(w₂|w₁) × P(w₃|w₁,w₂) × ... × P(w ₙ|w₁,...,w ₙ₋₁).
•Example: For "The cat is":
•P("The cat is") = P("The") × P("cat" | "The") × P("is" | "The cat").
•Corpus: "The cat is on the mat."
•P("The") = 2/6, P("cat" | "The") = 1/2, P("is" | "The cat") = 1/1.
•P("The cat is") = (2/6) × (1/2) × (1/1) = 0.167.
Probability in N-gram Models
•N-gram Approximation:
•To make computation feasible, assume the probability of a word depends only on the previous n-1
words.
•Bigram (n=2): P(wₖ | w₁, ..., wₖ₋₁) ≈ P(wₖ | wₖ₋₁).
•Trigram (n=3): P(wₖ | w₁, ..., wₖ₋₁) ≈ P(wₖ | wₖ₋₂, wₖ₋₁).
•Example:
•Corpus: "The cat is on the mat. The cat sleeps."
•Bigram model:
•P("is" | "cat") = Count("cat is") / Count("cat") = 1/2 = 0.5.
•Trigram model:
•P("is" | "The cat") = Count("The cat is") / Count("The cat") = 1/1 = 1.0.
•Why Approximate?
•Reduces computational complexity.
•Handles data sparsity by limiting context.
Probability in N-gram Models
•Context: We'll use a text classification example (e.g., spam vs. non-spam emails) where trigrams and
4-grams are features for applying Bayes' Theorem.
•Dataset: A small, hypothetical set of text data (emails) labeled as spam or non-spam.
•Bayes' Theorem: P(A∣B)=P(B∣A)⋅P(A)/P(B) where:
•A : Class (e.g., Spam or Non-Spam).
•B : Feature (trigram or 4-gram).
•P(A∣B) Posterior probability of class given the feature.
•P(B∣A): Likelihood of feature given the class.
•P(A): Prior probability of the class.
•P(B): Evidence (normalizing constant).
Example Setup
Dataset:
• Spam Email 1: "Win cash now"
• Spam Email 2: "Free money prize now"
• Non-Spam Email 1: "Meeting tomorrow morning"
• Non-Spam Email 2: "Project update next week"
• Classes:
• Spam (S)
• Non-Spam (NS)
• Prior Probabilities:
• Total emails: 4 (2 Spam, 2 Non-Spam)
• P(S)=2/4=0.5
• P(NS)=2/4=0.5
Introduction to N-Grams and
Bayes' Theorem
• Understanding Trigrams, 4-Grams, and Bayes' Theorem
• N-Grams: Sequences of n words used as features in text analysis.
• Trigram: 3-word sequence (e.g., "win cash now").
• 4-Gram: 4-word sequence (e.g., "free money prize now").
• Bayes' Theorem: P(A∣B)=P(B∣A)⋅P(A)/P(B)
• Used to classify text (e.g., spam vs. non-spam) based on n-gram
probabilities.
• Example: Classify a new email using trigrams and 4-grams.
Trigram Example with Bayes' Theorem
•MLE (Maximum Likelihood Estimation) calculates probabilities by counting how often something appears in
data.
•In NLP, it helps classify text, like deciding if an email is Spam or Not Spam.
•Simple method: Divide the count of an event by the total count.
•Example: If "free" appears 5 times in 20 words, probability = 5 ÷ 20 = 0.25.
MLE for Text Classification
• Goal: Find out if a phrase suggests an email is Spam or Not Spam.
• Use MLE to estimate probabilities based on counts.
• Formula: Probability = (Count of phrase in class × Class probability) ÷ Total phrase count.
• Task: Check if "free money prize now" indicates Spam or Not Spam.
Estimating Probabilities with
MLE
Maximum Likelihood Estimation (MLE):
•Estimates probabilities based on observed frequencies.
•Formula: P(wₖ | wₖ₋ₙ₊₁, ..., wₖ₋₁) = Count(wₖ₋ₙ₊₁, ..., wₖ) / Count(w ₖ₋ₙ₊₁, ...,
wₖ₋₁).
•Example:
•Corpus: "I like to eat. I like to drink."
•Bigram: P("to" | "like") = Count("like to") / Count("like") = 2/2 = 1.0.
•Issue: If "like to" never appears, P = 0 (sparsity problem).
•Solution: Smoothing (e.g., Laplace, Kneser-Ney).
Estimating Probabilities with MLE using Bayes’ Theorem
•Bayes’ theorem helps find the probability of a class (e.g., Spam) given text.
•Formula: Probability of class given text = (Likelihood × Prior) ÷ Evidence
Terms:
•Likelihood: How often text appears in class (from MLE).
•Prior: How common the class is (from MLE).
•Evidence: How common the text is overall.
•Task: Classify email with "free money prize now" as Spam or Not Spam.
Sample Data
• Total emails: 12.
• Spam emails: 6 (4 have "free money prize now").
• Not Spam emails: 6 (3 have "free money prize now").
• 4-gram: "free money prize now" (4-word sequence).
• Use MLE and Bayes’ theorem to calculate probabilities.
Estimating Probabilities with MLE using Bayes’ Theorem
• With Backoff:
• Check the trigram "The cat dog." Since Count("The cat dog") = 0, back off to the bigram P(dog | cat).
• Check the bigram "cat dog." Count("cat dog") = 0, so back off to the unigram P(dog).
• Unigram P(dog) = Count("dog") / Total words = 1 / 6 (1 occurrence of "dog" in 6 total words: The, cat,
jumps, The, dog, jumps).
• P(dog) = 1 / 6 ≈ 0.167.
• To ensure probabilities sum to 1, a backoff weight (say, α = 0.4) is applied to scale the lower-order
probability. So, P(dog | The cat) ≈ 0.4 × 0.167 ≈ 0.067.
• For P(jumps | The cat), since the trigram exists (Count("The cat jumps") = 1), use the discounted
trigram probability (e.g., discount d = 0.5):
• Adjusted count = 1 - 0.5 = 0.5.
• P(jumps | The cat) = 0.5 / 1 = 0.5.
• Backoff assigns a non-zero probability (≈0.067) to the unseen trigram "The cat dog" by using
the unigram P(dog). It only relies on the highest-order n-gram available, which can
sometimes ignore useful information from other n-gram orders.
Interpolation
• Interpolation smoothing combines probabilities from all n-gram orders (e.g., trigram, bigram,
unigram) by assigning each a weight (λ) that sums to 1.
• Instead of choosing one n-gram order like backoff, interpolation mixes them, ensuring that
even unseen higher-order n-grams get some probability from lower-order models.
• This makes it more robust, as it considers all available context, weighted by how reliable each
n-gram order is (higher-order n-grams get more weight if they’re observed, but lower-order n-
grams contribute for unseen cases).
Formula:
P(wₖ | wₖ₋₂, wₖ₋₁) = λ₁P(wₖ | wₖ₋₂, wₖ₋₁) + λ₂P(wₖ | wₖ₋₁) + λ₃P(wₖ), where λ₁ + λ₂ + λ₃ = 1
• P(wₖ | wₖ₋₂, wₖ₋₁): Trigram probability.
• P(wₖ | wₖ₋₁): Bigram probability.
• P(wₖ): Unigram probability.
• λ₁, λ₂, λ₃: Weights (e.g., 0.5, 0.3, 0.2) tuned to prioritize higher-order n-grams when they exist.
Example
• Using the same corpus: "The cat jumps." and "The dog jumps."
Vocabulary: {The, cat, dog, jumps} (V = 4).
We want P(dog | The cat) for the unseen trigram "The cat dog." Assume weights
λ₁ = 0.5 (trigram), λ₂ = 0.3 (bigram), λ₃ = 0.2 (unigram).
• Step 1: Calculate probabilities for each n-gram order:
• Trigram: P(dog | The cat) = Count("The cat dog") / Count("The cat") = 0 / 1 = 0.
• Bigram: P(dog | cat) = Count("cat dog") / Count("cat") = 0 / 1 = 0.
• Unigram: P(dog) = Count("dog") / Total words = 1 / 6 ≈ 0.167.
• Step 2: Apply interpolation:
• P(dog | The cat) = λ₁ × P_trigram + λ₂ × P_bigram + λ₃ × P_unigram
• = 0.5 × 0 + 0.3 × 0 + 0.2 × 0.167
• = 0 + 0 + 0.0334 ≈ 0.0334.
Example
• Backoff Models: Use the highest-order n-gram available, falling back to lower orders if the count is
zero. Example: P(dog | The cat) ≈ 0.067 (from unigram only).
• Interpolation: Combines all n-gram orders with weights, ensuring contributions from trigram,
bigram, and unigram. Example: P(dog | The cat) ≈ 0.0334 (from unigram, weighted).
Comparison to Other Smoothing Techniques:
• Unlike Add-One or Add-k, which add a constant to all counts, backoff and interpolation leverage
lower-order n-grams, making them more context-aware.
• Compared to Kneser-Ney, interpolation is simpler but less sophisticated, as Kneser-Ney uses context
diversity for better probability distribution. Backoff is closer to Kneser-Ney in spirit but only uses one
n-gram order at a time, while Kneser-Ney often interpolates with continuation probabilities.
• Interpolation is widely used because it’s robust and balances all n-gram orders, while backoff is
simpler but may miss some contextual information.
• Why Interpolation is Preferred:
Interpolation is often favored in practice (sometimes combined with Kneser-Ney) because it uses all
available n-gram information, making it more reliable for unseen data while still valuing observed
higher-order n-grams.
Evaluating Language Models
• Evaluation Metrics:
• Perplexity:
• Measures how well a model predicts a test set.
• Formula: PP = 2^(-1/N ∑ log₂ P(wᵢ | w₁, ..., wᵢ₋₁)).
• Lower perplexity = better model.
• Example: A model with perplexity 100 is better than one with 200.
• Log-Likelihood:
• Sum of log probabilities of the test set.
• Higher log-likelihood = better fit.
• Extrinsic Evaluation:
• Measure performance in downstream tasks (e.g., accuracy in speech recognition).
• Challenges:
• Perplexity doesn’t always correlate with task performance.
• Requires a representative test set.
Evaluating Language Models
• Perplexity measures how well a language model predicts a test set by evaluating how "surprised" the
model is by the data.
• It’s based on the average probability the model assigns to each word in the test set, given its context
(previous words).
• A lower perplexity indicates the model is better at predicting the next word, meaning it assigns higher
probabilities to the actual words in the test set.
• Perplexity is often interpreted as the effective number of choices the model considers for each word
—lower values mean the model is more confident and accurate.
Formula:
PP = 2^(-1/N ∑ log₂ P(wᵢ | w₁, ..., wᵢ₋₁))
• N: Number of words in the test set.
• P(wᵢ | w₁, ..., wᵢ₋₁): Probability of word wᵢ given its context (from the model).
• ∑ log₂ P: Sum of log base-2 probabilities of all words in the test set.
• Lower perplexity means the model is less "perplexed" by the data.
Example
• Model B (Add-One):
Add-One smoothing assigns more probability to unseen bigrams, so probabilities are lower for
seen bigrams: P(The | <start>) = 0.6, P(cat | The) = 0.3, P(jumps | cat) = 0.4 (due to adding 1
to all counts and V = 4 to denominators).</start>
• Log probabilities:
• log₂(0.6) ≈ -0.737, log₂(0.3) ≈ -1.737, log₂(0.4) ≈ -1.322.
• Sum = -0.737 - 1.737 - 1.322 = -3.796.
• Average log probability = -3.796 / 3 ≈ -1.265.
• Perplexity = 2^(-(-1.265)) = 2^1.265 ≈ 2.40.
• Model A has a perplexity of 1.61, meaning it’s less "surprised" by the test set than Model B
(perplexity 2.40).
• Lower perplexity (1.61 vs. 2.40) indicates Model A (Kneser-Ney) is better at predicting "The cat
jumps." This makes sense, as Kneser-Ney is more sophisticated than Add-One, assigning more
accurate probabilities to seen bigrams.
Log-Likelihood
• Log-Likelihood measures how well a language model fits a test set by summing the log
probabilities of all words in the test set, given their context.
• A higher log-likelihood means the model assigns higher probabilities to the actual words
in the test set, indicating a better fit.
• Unlike perplexity, which is an exponentiated average, log-likelihood is a direct sum of log
probabilities, so it’s not normalized by the number of words. It’s often used to compare
models on the same dataset, where a higher value indicates better performance.
Formula:
Log-Likelihood = ∑ log₂ P(wᵢ | w₁, ..., wᵢ₋₁)
• P(wᵢ | w₁, ..., wᵢ₋₁): Probability of each word given its context.
• ∑: Sum over all words in the test set.
• Higher values indicate a better model.
Example
• Using the same test set: "The cat jumps." (3 words).
We compute log-likelihood for Model A (Kneser-Ney) and Model B (Add-One) using the same
bigram probabilities as above.
• Model A (Kneser-Ney):
• Probabilities: P(The | <start>) = 0.8, P(cat | The) = 0.5, P(jumps | cat) = 0.6.</start>
• Log probabilities:
• log₂(0.8) ≈ -0.322, log₂(0.5) = -1, log₂(0.6) ≈ -0.737.
• Log-Likelihood = -0.322 - 1 - 0.737 = -2.059.
• Model B (Add-One):
• Probabilities: P(The | <start>) = 0.6, P(cat | The) = 0.3, P(jumps | cat) = 0.4.</start>
• Log probabilities:
• log₂(0.6) ≈ -0.737, log₂(0.3) ≈ -1.737, log₂(0.4) ≈ -1.322.
• Log-Likelihood = -0.737 - 1.737 - 1.322 = -3.796.
• Model A has a log-likelihood of -2.059, which is higher (less negative) than Model B’s -3.796. This
means Model A assigns higher probabilities to the test set, indicating a better fit. The higher log-
likelihood aligns with Model A’s lower perplexity, confirming Kneser-Ney outperforms Add-One.
Summary of Differences
• N-grams:
• Pros: Simple, interpretable, computationally efficient.
• Cons: Limited context, data sparsity.
• Neural Models:
• Pros: Capture long-range dependencies, handle large contexts.
• Cons: Computationally expensive, less interpretable.
• Example:
• N-gram: P("rain" | "It is") = fixed probability.
• Neural (e.g., GPT): Considers entire sentence context, more accurate but
slower.
Practical Example: Building an
N-gram Model
• Steps:
• Collect a corpus (e.g., news articles).
• Tokenize into words.
• Count N-grams (e.g., bigrams, trigrams).
• Estimate probabilities using MLE.
• Apply smoothing (e.g., Kneser-Ney).
• Evaluate on a test set using perplexity.
• Tools:
• Python libraries: NLTK, spaCy, scikit-learn.
• Example code (simplified):
Practical Example: Building an
N-gram Model
(contd..)
from nltk import bigrams, FreqDist
corpus = ["The cat is on the mat", "The cat sleeps"]
tokens = [word for sent in corpus for word in [Link]()]
bigram_counts = FreqDist(bigrams(tokens))
P_cat_the = bigram_counts[("The", "cat")] / sum(1 for w1, w2 in
bigrams(tokens) if w1 == "The")
Probability and Smoothing
•Why Smoothing?
•Unseen N-grams result in zero probabilities, breaking the model.
•Smoothing redistributes probability mass to unseen events.
•Laplace Smoothing Example:
•Corpus: "The cat is on the mat."
•Vocabulary size (V) = 6 (unique words: The, cat, is, on, mat, .).
•Unseen bigram: "cat jumps."
•Without smoothing: P("jumps" | "cat") = 0.
•With Laplace: P("jumps" | "cat") = (0 + 1) / (Count("cat") + V) = 1 / (2 + 6) = 0.125.
•Impact:
•Ensures all possible N-grams have non-zero probability.
•Improves robustness for language modeling.
Evaluating Models with
Probability
Perplexity:
•Measures how well a probability model predicts a test set.
•Formula: PP = 2^(-1/N ∑ log₂ P(wᵢ | w₁, ..., wᵢ ₋₁)).
•Relies on conditional probabilities from the model.
•Example:
•Test sentence: "The cat is."
•Bigram model probabilities: P("The") = 0.2, P("cat" | "The") = 0.5, P("is" | "cat") = 0.4.
•Log-likelihood = log₂(0.2) + log₂(0.5) + log₂(0.4) ≈ -2.322 - 1 - 1.322 = -4.644.
•Perplexity = 2^(-1/3 × -4.644) = 2^(1.548) ≈ 2.92.
•Interpretation: Lower perplexity indicates better predictive power.
Practical Example: Probability in
Action
Task: Build a bigram model to predict the next word.
•Corpus: "The cat is on the mat. The cat sleeps."
•Steps:
[Link]: ["The", "cat", "is", "on", "the", "mat", ".", "The", "cat", "sleeps", "."].
[Link] bigrams: e.g., ("The", "cat") = 2, ("cat", "is") = 1.
[Link] probabilities:
•P("cat" | "The") = 2/2 = MOE = 1.0.
•P("is" | "cat") = 1/2 = 0.5.
[Link] Laplace smoothing for unseen bigrams.
[Link] next word after "The": Likely "cat" (P = 1.0).
•Evaluation:
•Compute perplexity on a test set to assess model quality.
Challenges in Applying
Probability
•Sparsity:
•Many N-grams have zero counts in small corpora.
•Solution: Smoothing or larger corpora.
•Context Limitation:
•N-grams rely on short contexts, limiting probability accuracy.
•Example: P("rain" | "It is") vs. neural models considering full context.
•Overfitting:
•MLE may overfit to training data, giving unreliable probabilities.
•Solution: Smoothing or regularization techniques.
•Computational Cost:
•Calculating probabilities for large N or vocabularies is expensive.
•Solution: Efficient data structures or neural models.
Probability Theory and Modern
NLP
•Beyond N-grams:
•Neural language models (e.g., Transformers) use probability distributions over entire contexts.
•Example: BERT predicts masked words using contextual probabilities.
•Softmax Layer:
•Neural models output a probability distribution over the vocabulary for the next word.
•Example: P("cat" | "The") is one of many probabilities in a softmax output.
•Advantages:
•Neural models assign probabilities based on deeper linguistic patterns.
•Example: GPT predicts coherent sentences by modeling P(wₖ | w₁, ..., w ₖ₋₁) over long contexts.
Probability in Language
Modelling
•Key Points:
•Probability theory underpins statistical language models.
•N-grams use conditional probabilities for prediction but face sparsity issues.
•Smoothing adjusts probabilities for unseen events.
•Evaluation metrics like perplexity rely on probability estimates.
•Example Recap:
•Bigram model: P("cat" | "The") = 1.0 in a small corpus, adjusted via smoothing.
•Perplexity measures how well probabilities match test data.
•Future Directions:
•Combine probability-based N-grams with neural models for hybrid systems.
•Explore Bayesian approaches for uncertainty-aware language models.
Ngram Model
• For n = 3, Trigrams
P(w |w ..,w ) = P(w |w ,w )
n 1 n-1 n n-2 n-1
87
Correct
Re a l L a n g u a g e X
text
Noisy X- Error
channel >Y s
Text
Observed with
Language Y errors
W e w a n t to retrieve X
from Y
88
Correct
Re a l L a n g u a g e X
text
Noisy X- Space
removi
channel >Y ng
Text
Observed
witho
Language Y ut
space
W e w a n t to retrieve X s
from Y
89
Tex
Re a l L a n g u a g e X
t
Text to
Noisy X- speech
channel >Y generat
or
Observed Speec
Language Y h
W e w a n t to retrieve X
from Y
90
S o u rc e
Real Language X Languag
e
Noisy X- Translati
channel >Y on
Observed Ta r g e t
Language Y Langua
ge
W e w a n t to retrieve X
from Y
91
Example: A S R Automatic
Speech Recognizer
Acoustic word
chain chain
X 1 ... X T w 1 ... w N
ASR
Language Acoustic
model Model
92
Example: Machine Translation
93
•
Markov Models of order n
+ 1
• P(w |h ) =
i i P(wi|wi-n + 1 , … w i-1)
• 0-gram
• 1-gram
• P(w |h ) =
i i P(w i )
• 2-gram
• P(w |h ) =
i i P(wi|wi-1)
• 3-gram
• P(w |h ) =
i i P(wi|wi-2,wi-1)
94
C
1-gram
( w )=
•
PML ( w )
Model
E
∣
C ( wV
i−1
• 2-gram wi ) ∣
Model
PMLE ( wi
∣w
i−1 )= C
( wi−1 )
• 3-gram
Model C ( wi−2
PMLE ( wi ∣w
i−1 )=
i−2 ,w C ( wi−2
wi−1 wi )
wi−1 ) 95
96
97
True probability
distribution
98
The seen cases are overestimated
the
unseen ones have a null
probability
99
Save a part of the ma ss
probability from seen
cases and assign it to
the unseen ones
SMOOTHING
100
N GRAM
PROBABILITIES
Predicting next
Evaluating language models
• Intuitively, a trigram model captures more context than
a bigram model, so should be a “better” model.
• That is, it should more accurately predict the
probabilities of sentences
• But how can we measure this?