Assignment on N-gram Language
Models
Assignment on N-gram language
models
Packages
• ```python
• from collections import Counter
• import random
• from wordcloud import WordCloud
• import matplotlib.pyplot as plt
• ```
Dataset
• ```python
• dataset_path = "GPAC.txt"
• with open(dataset_path, "r", encoding="utf8")
as file:
• corpus_text = file.read()
• ```
• ```python
Create n-grams for n=1, 2, 3, 4.
• We will use 50,000,000 words of the corpuse
text to create the n-grams for demonstration.
• ```python
• def create_ngrams(text, n, stop_words = {'(',
')', '።', '፥', '፡', '፣', '፤'}):
• words = text
• words = words[:50000000].split()
N = 1 (Unigrams)
• ```python
• unigrams = create_ngrams(corpus_text, 1)
• print("N-grams for n=1: ")
• print(unigrams[:5])
• for i in range(5):
• print(unigrams[i][0])
• ```
N = 2 (Bigrams)
• ```python
• bigrams = create_ngrams(corpus_text, 2)
• print("N-grams for n=2: ")
• for i in range(5):
• print(bigrams[i])
• ```
N = 3 (Trigrams)
• ```python
• trigrams = create_ngrams(corpus_text, 3)
• print("N-grams for n=3: ")
• for i in range(5):
• print(trigrams[i])
• ```
N = 4 (Quadgrams)
• ```python
• quadgrams = create_ngrams(corpus_text, 4)
• print("N-grams for n=4: ")
• for i in range(5):
• print(quadgrams[i])
• ```
Probabilities of n-grams and the
top 10 most likely n-grams for all n.
• ```python
• # Precalculate the count of N-grams
• unigram_counts = Counter(unigrams)
• bigram_counts = Counter(bigrams)
• trigram_counts = Counter(trigrams)
• quadgram_counts = Counter(quadgrams)
• ```
Unigram Probabilities
• ```python
• # Calculate probabilities
• unigram_probabilities =
calculate_unigram_probabilities(unigrams)
• top_unigrams =
dict(sorted(unigram_probabilities.items(),
key=lambda x: x[1], reverse=True)[:10])
Bigram Probabilities
• ```python
• # Calculate probabilities
• bigram_probabilities =
calculate_bigram_probabilities(bigrams)
• top_bigrams =
dict(sorted(bigram_probabilities.items(),
key=lambda x: x[1], reverse=True)[:10])
Trigram Probabilities
• ```python
• # Calculate probabilities
• trigram_probabilities =
calculate_trigram_probabilities(trigrams)
• top_trigrams =
dict(sorted(trigram_probabilities.items(),
key=lambda x: x[1], reverse=True)[:10])
Quadgram Probabilities
• ```python
• # Calculate probabilities
• quadgram_probabilities =
calculate_quadgram_probabilities(quadgrams)
• top_quadgrams =
dict(sorted(quadgram_probabilities.items(),
key=lambda x: x[1], reverse=True)[:10])
Remove common stopwords and
recompute biagrams and trigrams
frequencies, find the top 10 n-
grams; n=1,2,3,4.
• ```python
• common_stop_words =
set(stop_words_text.split())
• filtered_unigrams =
create_ngrams(corpus_text, 1,
common_stop_words)
• filtered_bigrams =
create_ngrams(corpus_text, 2,
Create word clouds for unigrams,
bigrams and trigrams before and
after stop word removal
• ```python
• def plot_word_cloud(ngrams):
• ngram_counts = Counter(ngrams)
• ngram_dict = { " ".join(ngram_key):
ngram_counts[ngram_key] for ngram_key in
ngrams}
• # Create word cloud
Word cloud for unigram, bigram
and trigram before common word
removal
• ```python
• plot_word_cloud(unigrams)
• plot_word_cloud(bigrams)
• plot_word_cloud(trigrams)
• ```
Word cloud for unigram, bigram
and trigram after common word
removal
• ```python
• plot_word_cloud(filtered_unigrams)
• plot_word_cloud(filtered_bigrams)
• plot_word_cloud(filtered_trigrams)
• ```
Lets take a random sentence and
calculate it's probability. "ኢትዮጵያ
ታሪካዊ ሀገር ናት "?
• Let's calculate the probability of the sentence
using different n-gram models: Unigram,
Bigram, Trigram, and Quadgram.
Unigram Estimation
• Finding the probability of the sentence using
Unigram Estimation
• ```python
• def
unigram_probability_estimation(sentence):
• # Find probability using the Unigrams
• sentence_ngrams =
Bigram Estimation
• Finding the probability of the sentence using
Bigram Estimation
• ```python
• def bigram_probability_estimation(sentence):
• # Find probability using the Unigrams
• sentence_ngrams =
create_ngrams(sentence, 2)
Trigram Estimation
• Finding the probability of the sentence using
Trigram Estimation
• ```python
• def trigram_probability_estimation(sentence):
• # Find probability using the Unigrams
• sentence_ngrams =
create_ngrams(sentence, 3)
Quadgram Estimation
• Finding the probability of the sentence using
Quadgram Estimation
• ```python
• def
quadgram_probability_estimation(sentence):
• # Find probability using the Unigrams
• sentence_ngrams =
Finiding the probability of the
sentence using the Chain Rule
• ```python
• def
chain_rule_probability_estimation(sentence):
• sentence = sentence.split()
• # Find probability using the Unigrams
• sentence_probability = 1.0
• sentence_probability *=
Generating random sentences
using n-grams to see what happens
as n increases
• ```python
• def
generate_random_sentence_for_unigrams(se
ed_word, ngram_probabilities, n, reps = 10):
• sentence = [*seed_word]
• choices = list(ngram_probabilities.keys())
• for _ in range(reps):
• next_word = random.choice(choices)
Explanation
• As the value of n increases, the model takes
into account a greater amount of context
when generating text.
• 1. <b>Enhanced Contextual Relevance:</b> A
higher n leads to sentences that are more
contextually appropriate and coherent, as the
model considers a longer sequence of
previous words to predict the next one.
Evaluating these Language Models
Using Intrinsic Evaluation Method
• ```python
• import math
• def calculate_probability(sentence, n,
probability_function):
• splitted_sentence = sentence.split()
• sentence_ngrams =
[tuple(splitted_sentence[i:i+n]) for i in
Evaluating these Language Models
Using Extrinsic Evaluation Method
• We chose sentence completion as a task to
evaluate these language models
• We can use the functions that we created
before to generate random sentence but for
generating the next word for the given initial
sentence.
• ```python
Next Word Prediction
• ```python
• def generate_next_words(seed_word, n):
• if n == 1:
• ngram_probabilities =
unigram_probabilities
• elif n == 2:
• ngram_probabilities =
bigram_probabilities