Statistical Methods for Ambiguity Resolution
Statistical methods rely on probability distributions, corpora, and machine learning models to
disambiguate by choosing the most likely interpretation.
1. N-gram Models
Predict the likelihood of a word based on the previous (n−1) words.
Helps in resolving lexical and syntactic ambiguity.
Example:
o P("bank account") > P("bank river") → likely meaning: financial bank
2. Part-of-Speech (POS) Tagging with Hidden Markov Models (HMM)
Tags words with their parts of speech based on probabilities.
Resolves ambiguity in word classes.
e.g., "record" as a noun or verb.
HMM chooses the sequence of tags with highest probability.
3. Word Sense Disambiguation (WSD) using Bayesian Methods
Uses Bayes' theorem to find the most probable sense of a word given the context.
P(sense | context) = P(context | sense) × P(sense) / P(context)
Naïve Bayes is often used for practical WSD.
4. Maximum Entropy Models (Logistic Regression)
Uses features like surrounding words, POS tags, and syntactic roles.
Selects the sense or parse with the highest probability.
Unlike Naïve Bayes, makes fewer independence assumptions.
5. Conditional Random Fields (CRF)
Widely used in sequence labeling tasks like POS tagging, Named Entity Recognition.
Can incorporate many contextual features for disambiguation.
6. Statistical Parsing (Probabilistic Context-Free Grammars - PCFGs)
Assigns probabilities to grammar rules.
Resolves syntactic ambiguity by selecting the parse tree with the highest probability.
Example:
Sentence: “I saw her duck.”
Possible meanings:
1. "duck" = noun (the bird)
2. "duck" = verb (she dodged)
Using a statistical POS tagger trained on real-world sentences, we might find that:
P(“duck” as noun | “her duck”) > P(“duck” as verb | “her duck”)
→ So, system infers “duck” as a noun.
An N-gram model is a probabilistic language model used in Natural Language Processing (NLP) to
predict the next item (usually a word) in a sequence based on the previous (N−1) items. It’s a
fundamental statistical technique for language modeling, text generation, speech recognition, and
ambiguity resolution.
What is an N-Gram?
An N-gram is a sequence of N items (typically words) from a given text or speech.
N Name Example (from sentence “I love data mining”)
1 Unigram I, love, data, mining
2 Bigram I love, love data, data mining
3 Trigram I love data, love data mining
4 4-gram I love data mining
How N-Gram Models Work
The probability of a word depends on the previous (N−1) words:
For a sequence of words:
P(w₁, w₂, ..., wₙ) ≈ P(w₁) × P(w₂|w₁) × P(w₃|w₁,w₂) × ...
For bigrams (N = 2):
P(w₁, w₂, ..., wₙ) ≈ P(w₁) × P(w₂|w₁) × P(w₃|w₂) × ...
This simplifies computation by assuming the Markov property (only the most recent history matters).
Why Use N-Gram Models?
Efficient: Reduces complexity by focusing only on a fixed number of previous words.
Effective: Captures common language patterns.
Useful for: Spell-checking, predictive typing, machine translation, and disambiguation.
Limitations of N-Gram Models
Sparsity: Many word combinations never appear in training data.
Context Loss: Long-range dependencies are ignored.
Data Hungry: Requires a large corpus for accuracy.
Smoothing Techniques
To overcome sparsity, smoothing adjusts probabilities of unseen N-grams:
Laplace Smoothing (Add-1)
Good-Turing Estimation
Backoff and Interpolation
Applications of N-Grams
Speech Recognition
Spell Correction
Predictive Text Input
Machine Translation
POS Tagging
Example: Lexical Ambiguity Resolution using Bigram Model
Sentence:
"He sat on the bank."
Ambiguity:
bank could mean:
1. A financial institution
2. A riverbank
Bigram Probability Analysis:
From a large language corpus (like Google Ngram or Brown Corpus), we find bigram frequencies:
Bigram Phrase Frequency Probability (estimated)
"on the bank" 5000 High
"at the bank" 10000 High
"bank account" 12000 High (finance-related)
"river bank" 3000 Moderate
But we look at contextual bigrams, like:
“sat on” – typically used for physical surfaces
“on the bank” – more commonly used in nature contexts
We then compute:
P(bank = riverbank | "sat on the") > P(bank = financial institution | "sat on the")
So, the model prefers “riverbank”, because:
“sat on the bank” is more frequent in nature-related texts
“sat at the bank” would be expected for financial meaning
Resolution:
Interpreted Meaning = He sat on the riverbank
Because “sat on the bank” is more statistically likely in a corpus to refer to a physical location than to
a financial institution.
Behind the Scenes:
Using bigram probabilities:
P("bank" | "on the") = frequency("on the bank") / frequency("on the")
Sure! Let's walk through a detailed explanation of Part-of-Speech (POS) tagging with a suitable
example, including how ambiguity is resolved using context and how HMM helps pick the correct
tags.
Part-of-Speech (POS) Tagging – Explained
POS tagging is the process of assigning word categories like noun, verb, adjective, etc., to each word
in a sentence.
Why POS Tagging Needs Ambiguity Resolution
Some words can have multiple parts of speech depending on context.
Example Sentence:
“Can you can a can as a canner can can a can?”
This sentence contains the word “can” used with different meanings and POS tags:
Word Possible Tags
can modal verb, noun, verb
canner noun
you pronoun
a article
as conjunction
Correct POS Tagging (Using Context):
“Can/MODAL you/PRON can/VERB a/DET can/NOUN as/CONJ a/DET canner/NOUN
can/VERB can/VERB a/DET can/NOUN?”
Breakdown:
Word POS Tag Explanation
Can MODAL Helping verb: Can you...?
you PRON Subject pronoun
can VERB Main verb: to can (preserve)
a DET Article
can NOUN Object noun: a can (container)
as CONJ Comparison word
a DET Article
canner NOUN A person who cans food
Word POS Tag Explanation
can VERB Main verb
can VERB Auxiliary verb: can can...
a DET Article
can NOUN Noun again
Ambiguity Without Context:
If we saw just the word "can", we wouldn't know if it's:
A modal verb (Can you swim?)
A noun (Open the can.)
A verb (She can the tomatoes.)
How HMM Helps:
Using a trained Hidden Markov Model, we calculate:
Transition Probability:
P(current tag | previous tag)
E.g., P(VERB | MODAL) = high (like “can go”)
Emission Probability:
P(word | tag)
E.g., P("can" | VERB) vs. P("can" | NOUN)
The HMM chooses the sequence of POS tags that gives the highest total probability.
Certainly! Let’s walk through a detailed, suitable example of POS tagging using a Hidden Markov
Model (HMM), showing how it resolves ambiguity using transition and emission probabilities.
POS Tagging with Hidden Markov Model – Step-by-Step Example
Goal:
Tag the sentence:
“He can fish.”
❗ Ambiguity:
"can" could be a modal verb or a main verb
"fish" could be a noun (the animal) or a verb (to catch fish)
Possible Interpretations:
Interpretation Meaning
He/MODAL can/VERB fish/VERB He is able to catch fish (correct)
He/NOUN can/NOUN fish/VERB Makes little sense
He/PRON can/NOUN fish/NOUN Grammatically odd
Using HMM to Resolve Ambiguity
Hidden Markov Model uses two types of probabilities:
1. Emission Probability
P(word | tag): How likely is a word to appear as a certain POS?
| Word | POS | P(word | tag) (estimated) |
|--------|-----------|----------------------------|
| he | PRON | 0.9 |
| can | MODAL | 0.7 |
| can | NOUN | 0.2 |
| fish | NOUN | 0.5 |
| fish | VERB | 0.4 |
2. Transition Probability
P(current_tag | previous_tag): Likelihood of one tag following another.
| Previous → Current | P(tag₂ | tag₁) |
|--------------------|----------------|
| START → PRON | 0.8 |
| PRON → MODAL | 0.7 |
| PRON → NOUN | 0.2 |
| MODAL → VERB | 0.9 |
| VERB → NOUN | 0.3 |
| MODAL → NOUN | 0.1 |
Calculate Probabilities for Tag Sequences
➤ Option 1 (Correct):
P=P(PRON∣START)⋅P(he∣PRON)⋅P(MODAL∣PRON)⋅P(can∣MODAL)⋅P(VERB∣MODAL)⋅P(fish∣VERB)
“He/PRON can/MODAL fish/VERB”
P=0.8⋅0.9⋅0.7⋅0.7⋅0.9⋅0.4=0.1270
➤ Option 2 (Less likely):
“He/PRON can/NOUN fish/VERB”
P=0.8⋅0.9⋅0.2⋅0.2⋅0.4⋅0.4=0.0092
➤ Option 3 (Nonsensical):
“He/PRON can/NOUN fish/NOUN”
P=0.8⋅0.9⋅0.2⋅0.2⋅0.3⋅0.5=0.0108
Most Probable Tagging (Highest Score):
He/PRON can/MODAL fish/VERB
Interpretation:
“He is able to catch fish.”
How HMM Helps:
Uses contextual patterns in the form of transition probabilities.
Considers word-tag probabilities (emissions).
Viterbi algorithm efficiently computes the best path (tag sequence).
Why It’s Effective for Ambiguity
The word “can” has multiple meanings.
The model chooses “MODAL” for “can” because it follows a pronoun and precedes a verb,
both highly probable transitions.
Sure! Let's walk through a clear example that demonstrates Maximum Entropy Modeling (Logistic
Regression) in NLP using a Part-of-Speech (POS) tagging task.
Example: POS Tagging Using Maximum Entropy (Logistic Regression)
✍️Sentence to Tag:
“Book that flight”
Ambiguity:
“Book” can be:
o a Noun (I read a book)
o a Verb (Book a flight)
Goal:
Determine the correct POS tag for “Book” using contextual features and a trained Maximum Entropy
classifier.
Step 1: Define Feature Set for Each Word
Let’s extract features for the first word “Book”:
Feature Value
Word itself “Book”
Is first word in sentence? Yes
Is capitalized? Yes
Following word “that”
Part-of-speech of next word Conjunction / Pronoun (varies)
Word shape (title case, lowercase, etc.) Title
These features form the input xx to the MaxEnt model.
Step 2: Maximum Entropy Model Prediction
The model evaluates:
Assume the model computes the probabilities:
P(Book = VERB | features) = 0.85
P(Book = NOUN | features) = 0.15
Since the verb tag has the highest probability, the model selects:
“Book” → VERB
Tagged Sentence Output:
“Book/VERB that/DET flight/NOUN”
Why the Classifier Chose VERB:
“Book” is capitalized and at the beginning → neutral clue.
The next word is “that”, commonly followed by a determiner in command-style sentences →
suggests verb.
The overall sentence pattern resembles a command.
Python Code (Using NLTK’s MaxEnt Classifier)
Here’s a simplified version using NLTK's built-in classifier:
import nltk
from [Link] import MaxentClassifier
# Training data (simplified): list of tuples (features, label)
train_set = [
({"word": "Book", "next_word": "that", "capitalized": True}, "VERB"),
({"word": "book", "next_word": "is", "capitalized": False}, "NOUN"),
({"word": "flight", "capitalized": False}, "NOUN"),
]
# Train the classifier
classifier = [Link](train_set, algorithm='iis', trace=0, max_iter=10)
# Test features for word "Book"
test_features = {"word": "Book", "next_word": "that", "capitalized": True}
# Predict POS tag
predicted_tag = [Link](test_features)
print("Predicted Tag for 'Book':", predicted_tag)
🧾 Output:
Predicted Tag for 'Book': VERB
Summary
Maximum Entropy (Logistic Regression) chooses the most likely tag based on features, not just
word frequency.
It works well when context and features are important (like capitalization, nearby words,
position).
Unlike Naïve Bayes, it doesn’t assume independence between features.
Absolutely! Let's walk through a complete, step-by-step example of how a Conditional Random
Field (CRF) is used in Named Entity Recognition (NER) — a common NLP task — to resolve
ambiguity.
Example: Named Entity Recognition (NER) with CRF
Sentence:
"Steve Jobs founded Apple in California."
Goal:
Identify named entities in the sentence, such as:
Steve Jobs → PERSON
Apple → ORGANIZATION
California → LOCATION
We want to assign a label (NER tag) to each word.
1. Tokenized Sentence:
["Steve", "Jobs", "founded", "Apple", "in", "California", "."]
2. Tags to Predict:
Word NER Tag
Steve B-PER (Begin Person)
Jobs I-PER (Inside Person)
founded O (Outside)
Apple B-ORG
in O
California B-LOC
. O
3. Feature Extraction per Word
Let’s extract features for each word — CRFs learn patterns between features and labels, and between
labels themselves.
🔹 Features for “Apple”:
Feature Value
Word “Apple”
Is Capitalized? Yes
Prefix (2 letters) “Ap”
Suffix (3 letters) “ple”
Previous Word “founded”
Next Word “in”
Previous Tag O (learned during training)
These features help the CRF recognize “Apple” as a likely organization.
4. How CRF Resolves Ambiguity
Let’s say a model needs to tag "Apple" in different contexts:
“Apple is tasty.” → “Apple” = B-FRUIT (or O)
“Apple released a new iPhone.” → “Apple” = B-ORG
CRF looks at the surrounding words and tags:
If the previous word is “founded”, it learns that the next word is likely an organization.
If the previous word is “an”, and the next word is “tree”, it may tag “Apple” as a fruit.
Prediction:
“founded” → followed by capitalized noun = likely organization
So, CRF tags "Apple" as → B-ORG
5. CRF Learns These Patterns:
Pattern (Learned from Training Data) Likely Tag
Capitalized word after “founded” ORG
Two consecutive capitalized words at the beginning PER
Word “in” followed by capitalized word LOC
Final Tagged Output:
Word Predicted Tag
Steve B-PER
Jobs I-PER
founded O
Apple B-ORG
in O
California B-LOC
. O
How CRF Helps Resolve Ambiguity:
CRF considers word-level features AND neighboring tags.
It learns that:
o “Steve Jobs” is a name → PERSON
o “Apple” after “founded” → ORGANIZATION
o “California” after “in” → LOCATION
Unlike HMM, CRF can use rich contextual and lexical features without assuming
independence.
Summary:
CRF is ideal for tasks like NER and POS tagging where context matters.
It uses features and label sequences to make smart, consistent decisions.
CRFs are accurate and flexible, often used in real-world NLP systems.
Statistical Parsing with Probabilistic Context-Free Grammars (PCFGs)
What Is Statistical Parsing?
Statistical parsing is the process of analyzing the grammatical structure of a sentence using
probability-based models, to choose the most likely parse tree among several possibilities.
A Probabilistic Context-Free Grammar (PCFG) is an extension of a regular Context-Free
Grammar (CFG) that assigns a probability to each production rule.
1. Context-Free Grammar (CFG) – Recap
A CFG consists of:
Terminals (actual words like “dog”, “runs”)
Non-terminals (syntactic categories like NP, VP)
Production Rules, e.g.:
S → NP VP
NP → Det Noun
VP → Verb NP
2. What Is a PCFG?
In a PCFG, each rule has a probability, e.g.:
S → NP VP [1.0]
NP → Det Noun [0.5]
NP → ProperNoun [0.5]
VP → Verb NP [0.8]
VP → Verb [0.2]
The probabilities are learned from a treebank (a parsed corpus).
For each non-terminal, the sum of rule probabilities must be 1.0.
3. Example Sentence:
“She sees a dog”
Let’s define a basic PCFG:
📜 Grammar:
S → NP VP [1.0]
NP → Pronoun [0.4]
NP → Det Noun [0.6]
VP → Verb NP [1.0]
Det → “a” [1.0]
Noun → “dog” [1.0]
Pronoun → “She” [1.0]
Verb → “sees” [1.0]
4. All Possible Parse Trees
Only one parse tree is possible here:
S
├── NP → Pronoun → She
└── VP
├── Verb → sees
└── NP
├── Det → a
└── Noun → dog
🔢 Probability of Parse Tree:
Multiply the probabilities of the applied rules:
P(S → NP VP) = 1.0
P(NP → Pronoun) = 0.4
P(Pronoun → She) = 1.0
P(VP → Verb NP) = 1.0
P(Verb → sees) = 1.0
P(NP → Det Noun) = 0.6
P(Det → a) = 1.0
P(Noun → dog) = 1.0
Total = 1.0 × 0.4 × 1.0 × 1.0 × 1.0 × 0.6 × 1.0 × 1.0 = **0.24**
5. Why Use PCFGs for Ambiguity Resolution?
Consider a more ambiguous sentence:
“I saw the man with the telescope.”
Possible parses:
1. I saw [the man with the telescope].
2. I saw [the man] [with the telescope].
PCFG will score each parse using the rule probabilities and pick the most probable one, effectively
resolving syntactic ambiguity.
6. Python Example with NLTK
import nltk
from nltk import CFG
from [Link] import ViterbiParser
# Define a PCFG
pcfg_grammar = [Link]("""
S -> NP VP [1.0]
NP -> Det N [0.6] | Pronoun [0.4]
VP -> V NP [0.8] | V [0.2]
Det -> 'a' [1.0]
N -> 'dog' [1.0]
V -> 'sees' [1.0]
Pronoun -> 'She' [1.0]
""")
# Create parser
parser = ViterbiParser(pcfg_grammar)
# Parse a sentence
sentence = ['She', 'sees', 'a', 'dog']
for tree in [Link](sentence):
print(tree)
print("Probability:", [Link]())
Output:
(S
(NP (Pronoun She))
(VP (V sees) (NP (Det a) (N dog))))
Probability: 0.24
Advantages of PCFGs
Automatically chooses the most likely parse using learned probabilities.
Can resolve ambiguity better than basic CFGs.
Trainable from corpora like Penn Treebank.
import nltk
from nltk import PCFG
from [Link] import ViterbiParser
# Define the grammar
pcfg_grammar = [Link]("""
S -> NP VP [1.0]
VP -> V NP [0.5] | VP PP [0.5]
NP -> Pronoun [0.3] | Det N [0.4] | NP PP [0.6]
PP -> P NP [1.0]
Det -> 'the' [1.0]
N -> 'man' [0.5] | 'telescope' [0.5]
P -> 'with' [1.0]
V -> 'saw' [1.0]
Pronoun -> 'I' [1.0]
""")
# Parser
parser = ViterbiParser(pcfg_grammar)
sentence = ['I', 'saw', 'the', 'man', 'with', 'the', 'telescope']
for tree in [Link](sentence):
print(tree)
print("Probability:", [Link]())