0% found this document useful (0 votes)
41 views29 pages

Quadgram Language Model Analysis

The document outlines an assignment on N-gram language models, detailing the process of creating unigrams, bigrams, trigrams, and quadgrams from a text dataset. It includes code snippets for calculating probabilities of these n-grams, generating word clouds, and estimating the probability of specific sentences using different n-gram models. Additionally, it discusses the evaluation of language models through intrinsic and extrinsic methods, including next word prediction.

Uploaded by

leulabay4
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views29 pages

Quadgram Language Model Analysis

The document outlines an assignment on N-gram language models, detailing the process of creating unigrams, bigrams, trigrams, and quadgrams from a text dataset. It includes code snippets for calculating probabilities of these n-grams, generating word clouds, and estimating the probability of specific sentences using different n-gram models. Additionally, it discusses the evaluation of language models through intrinsic and extrinsic methods, including next word prediction.

Uploaded by

leulabay4
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 29

Assignment on N-gram Language

Models
Assignment on N-gram language
models
Packages

• ```python
• from collections import Counter
• import random
• from wordcloud import WordCloud
• import matplotlib.pyplot as plt
• ```
Dataset

• ```python
• dataset_path = "GPAC.txt"
• with open(dataset_path, "r", encoding="utf8")
as file:
• corpus_text = file.read()
• ```

• ```python
Create n-grams for n=1, 2, 3, 4.

• We will use 50,000,000 words of the corpuse


text to create the n-grams for demonstration.

• ```python
• def create_ngrams(text, n, stop_words = {'(',
')', '።', '፥', '፡', '፣', '፤'}):
• words = text
• words = words[:50000000].split()
N = 1 (Unigrams)

• ```python
• unigrams = create_ngrams(corpus_text, 1)

• print("N-grams for n=1: ")


• print(unigrams[:5])
• for i in range(5):
• print(unigrams[i][0])
• ```
N = 2 (Bigrams)

• ```python
• bigrams = create_ngrams(corpus_text, 2)

• print("N-grams for n=2: ")


• for i in range(5):
• print(bigrams[i])
• ```
N = 3 (Trigrams)

• ```python
• trigrams = create_ngrams(corpus_text, 3)

• print("N-grams for n=3: ")


• for i in range(5):
• print(trigrams[i])
• ```
N = 4 (Quadgrams)

• ```python
• quadgrams = create_ngrams(corpus_text, 4)

• print("N-grams for n=4: ")


• for i in range(5):
• print(quadgrams[i])
• ```
Probabilities of n-grams and the
top 10 most likely n-grams for all n.

• ```python
• # Precalculate the count of N-grams
• unigram_counts = Counter(unigrams)
• bigram_counts = Counter(bigrams)
• trigram_counts = Counter(trigrams)
• quadgram_counts = Counter(quadgrams)
• ```
Unigram Probabilities

• ```python
• # Calculate probabilities
• unigram_probabilities =
calculate_unigram_probabilities(unigrams)

• top_unigrams =
dict(sorted(unigram_probabilities.items(),
key=lambda x: x[1], reverse=True)[:10])
Bigram Probabilities

• ```python
• # Calculate probabilities
• bigram_probabilities =
calculate_bigram_probabilities(bigrams)

• top_bigrams =
dict(sorted(bigram_probabilities.items(),
key=lambda x: x[1], reverse=True)[:10])
Trigram Probabilities

• ```python
• # Calculate probabilities
• trigram_probabilities =
calculate_trigram_probabilities(trigrams)

• top_trigrams =
dict(sorted(trigram_probabilities.items(),
key=lambda x: x[1], reverse=True)[:10])
Quadgram Probabilities

• ```python
• # Calculate probabilities
• quadgram_probabilities =
calculate_quadgram_probabilities(quadgrams)

• top_quadgrams =
dict(sorted(quadgram_probabilities.items(),
key=lambda x: x[1], reverse=True)[:10])
Remove common stopwords and
recompute biagrams and trigrams
frequencies, find the top 10 n-
grams; n=1,2,3,4.
• ```python
• common_stop_words =
set(stop_words_text.split())

• filtered_unigrams =
create_ngrams(corpus_text, 1,
common_stop_words)
• filtered_bigrams =
create_ngrams(corpus_text, 2,
Create word clouds for unigrams,
bigrams and trigrams before and
after stop word removal
• ```python
• def plot_word_cloud(ngrams):
• ngram_counts = Counter(ngrams)
• ngram_dict = { " ".join(ngram_key):
ngram_counts[ngram_key] for ngram_key in
ngrams}

• # Create word cloud


Word cloud for unigram, bigram
and trigram before common word
removal
• ```python
• plot_word_cloud(unigrams)
• plot_word_cloud(bigrams)
• plot_word_cloud(trigrams)
• ```
Word cloud for unigram, bigram
and trigram after common word
removal
• ```python
• plot_word_cloud(filtered_unigrams)
• plot_word_cloud(filtered_bigrams)
• plot_word_cloud(filtered_trigrams)
• ```
Lets take a random sentence and
calculate it's probability. "ኢትዮጵያ
ታሪካዊ ሀገር ናት "?
• Let's calculate the probability of the sentence
using different n-gram models: Unigram,
Bigram, Trigram, and Quadgram.
Unigram Estimation

• Finding the probability of the sentence using


Unigram Estimation

• ```python
• def
unigram_probability_estimation(sentence):
• # Find probability using the Unigrams
• sentence_ngrams =
Bigram Estimation

• Finding the probability of the sentence using


Bigram Estimation

• ```python
• def bigram_probability_estimation(sentence):
• # Find probability using the Unigrams
• sentence_ngrams =
create_ngrams(sentence, 2)
Trigram Estimation

• Finding the probability of the sentence using


Trigram Estimation

• ```python
• def trigram_probability_estimation(sentence):
• # Find probability using the Unigrams
• sentence_ngrams =
create_ngrams(sentence, 3)
Quadgram Estimation

• Finding the probability of the sentence using


Quadgram Estimation

• ```python
• def
quadgram_probability_estimation(sentence):
• # Find probability using the Unigrams
• sentence_ngrams =
Finiding the probability of the
sentence using the Chain Rule

• ```python
• def
chain_rule_probability_estimation(sentence):

• sentence = sentence.split()
• # Find probability using the Unigrams
• sentence_probability = 1.0
• sentence_probability *=
Generating random sentences
using n-grams to see what happens
as n increases
• ```python
• def
generate_random_sentence_for_unigrams(se
ed_word, ngram_probabilities, n, reps = 10):
• sentence = [*seed_word]
• choices = list(ngram_probabilities.keys())
• for _ in range(reps):
• next_word = random.choice(choices)
Explanation
• As the value of n increases, the model takes
into account a greater amount of context
when generating text.

• 1. <b>Enhanced Contextual Relevance:</b> A


higher n leads to sentences that are more
contextually appropriate and coherent, as the
model considers a longer sequence of
previous words to predict the next one.
Evaluating these Language Models
Using Intrinsic Evaluation Method

• ```python
• import math

• def calculate_probability(sentence, n,
probability_function):
• splitted_sentence = sentence.split()
• sentence_ngrams =
[tuple(splitted_sentence[i:i+n]) for i in
Evaluating these Language Models
Using Extrinsic Evaluation Method

• We chose sentence completion as a task to


evaluate these language models

• We can use the functions that we created


before to generate random sentence but for
generating the next word for the given initial
sentence.

• ```python
Next Word Prediction

• ```python
• def generate_next_words(seed_word, n):
• if n == 1:
• ngram_probabilities =
unigram_probabilities
• elif n == 2:
• ngram_probabilities =
bigram_probabilities

You might also like