N-Gram Model in NLP
Dr Vivek K Verma
Introduction to N-Gram Models
An N-Gram model is a probabilistic language model used in Natural Language
Processing (NLP) to predict the next word in a sequence based on the previous
N − 1 words. The N-Gram model is based on the Markov assumption, which
simplifies the computation by assuming that the probability of a word depends
only on the previous N − 1 words, rather than the entire sequence.
The model is called an N-Gram because it breaks down a sequence of words
into contiguous sequences of N words.
Why N-Gram Models?
- Efficient: The N-Gram model simplifies language modeling by considering only
local context.
- Scalable: It can be applied to large corpora and various tasks such as
speech recognition, machine translation, and text prediction.
- Flexible: The choice of N determines the level of context captured. For
example, a 1-Gram (Unigram) only considers individual words, while a 2-Gram
(Bigram) considers word pairs.
N-Gram Probability Model
The probability of a word sequence W = w1 , w2 , . . . , wn can be computed using
the chain rule of probability:
P (W ) = P (w1 ) · P (w2 |w1 ) · P (w3 |w1 , w2 ) · · · · · P (wn |w1 , w2 , . . . , wn−1 )
However, this becomes computationally expensive for large sequences. The
N-Gram model simplifies this by considering only the previous N − 1 words:
n
Y
P (W ) ≈ P (wi |wi−(N −1) , . . . , wi−1 )
i=1
For example, in a Bigram model (2-Gram), we have:
P (W ) = P (w1 ) · P (w2 |w1 ) · P (w3 |w2 ) · · · · · P (wn |wn−1 )
1
Types of N-Gram Models
• Unigram Model (1-Gram): The probability of a word depends only
on itself.
P (W ) = P (w1 ) · P (w2 ) · P (w3 ) · · · · · P (wn )
• Bigram Model (2-Gram): The probability of a word depends on the
previous word.
P (W ) = P (w1 ) · P (w2 |w1 ) · P (w3 |w2 ) · · · · · P (wn |wn−1 )
• Trigram Model (3-Gram): The probability of a word depends on the
two preceding words.
P (W ) = P (w1 ) · P (w2 |w1 ) · P (w3 |w1 , w2 ) · · · · · P (wn |wn−2 , wn−1 )
Example: Bigram Model
Let’s walk through an example using a Bigram model (2-Gram) to calculate the
probability of a given sentence.
Consider the sentence: “I love NLP.”
We want to calculate the probability of this sentence using the Bigram model.
Step-by-Step Calculation
1. **Break the sentence into word pairs**:
“I love NLP” ⇒ (I, love), (love, NLP)
2. **Calculate the probability of each word pair**: Using a trained Bigram
model (from a corpus), let’s assume the following probabilities:
P (love|I) = 0.4, P (NLP|love) = 0.3
- The probability P (love|I) represents how often ”love” follows ”I” in the
corpus. - The probability P (NLP|love) represents how often ”NLP” follows
”love” in the corpus.
3. **Compute the sentence probability**:
P (“I love NLP”) = P (I) · P (love|I) · P (NLP|love)
Assuming P (I) = 0.1 (the unigram probability of ”I”):
P (“I love NLP”) = 0.1 · 0.4 · 0.3 = 0.012
Therefore, the probability of the sentence “I love NLP” under this Bigram
model is 0.012.
2
Applications of N-Gram Models
N-Gram models are widely used in several NLP applications, including:
• **Text Prediction**: Predicting the next word in a sequence based on the
previous words.
• **Speech Recognition**: Recognizing words in a speech based on phoneme
sequences.
• **Machine Translation**: Translating text from one language to another
using N-Gram probabilities.
Example: Trigram Model
To better understand the Trigram Model, let’s walk through an example where
we calculate the probability of a sentence based on the two preceding words.
Consider the sentence: “I love learning NLP.”
We will calculate the probability of this sentence using a Trigram model.
Step-by-Step Calculation
1. **Break the sentence into word triplets**:
“I love learning NLP” ⇒ (I, love, learning), (love, learning, NLP)
2. **Calculate the probability of each word triplet**: Using a trained Tri-
gram model, let’s assume the following probabilities:
P (learning|I, love) = 0.25, P (NLP|love, learning) = 0.4
3. **Calculate the unigram and bigram probabilities as needed for the first
two words**: - Assume:
P (I) = 0.1, P (love|I) = 0.3
4. **Compute the sentence probability using the Trigram model**:
P (“I love learning NLP”) = P (I)·P (love|I)·P (learning|I, love)·P (NLP|love, learning)
Substituting the values:
P (“I love learning NLP”) = 0.1 · 0.3 · 0.25 · 0.4 = 0.003
Therefore, the probability of the sentence “I love learning NLP” under
this Trigram model is 0.003.
3
Comparison with Bigram Model
The Trigram model provides more context than the Bigram model by consider-
ing an additional preceding word, which helps capture more linguistic structure.
For instance, phrases like “I love learning” may be common and hence carry dif-
ferent probabilities than “I love” followed by other words.
Advantages of Trigram Models
Trigram models can capture more nuances of language by considering a larger
context, which helps in applications where phrase structure and specific word
sequences are important, such as:
• **Text Prediction**: Better prediction accuracy due to more context.
• **Machine Translation**: Captures common three-word phrases that im-
prove translation quality.
• **Speech Recognition**: Recognizes context within phrases, improving
accuracy.
Trigram models, by considering two preceding words, offer a richer context
compared to Bigram models. This allows for improved predictions in tasks re-
quiring greater understanding of phrase structures and contextual relationships
between words.
The N-Gram model is a simple yet powerful probabilistic model used in
NLP for a variety of tasks. By considering the local context of words, N-Gram
models can capture linguistic patterns and are widely used in applications such
as speech recognition and machine translation. However, the choice of N and
the handling of unseen word pairs through smoothing techniques are crucial for
effective performance.