0% found this document useful (0 votes)
18 views19 pages

NLP Lab

The document outlines a series of experiments and programming tasks related to Natural Language Processing (NLP) for a course taught by Soumya M S at GEC, Challakere. It includes tasks such as text preprocessing, N-gram modeling, Minimum Edit Distance algorithm, parsing techniques, and implementing a Naive Bayes classifier for genre classification of movie reviews. Additionally, it introduces the Natural Language Toolkit (NLTK) as a resource for building NLP applications in Python.

Uploaded by

Varun
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views19 pages

NLP Lab

The document outlines a series of experiments and programming tasks related to Natural Language Processing (NLP) for a course taught by Soumya M S at GEC, Challakere. It includes tasks such as text preprocessing, N-gram modeling, Minimum Edit Distance algorithm, parsing techniques, and implementing a Naive Bayes classifier for genre classification of movie reviews. Additionally, it introduces the Natural Language Toolkit (NLTK) as a resource for building NLP applications in Python.

Uploaded by

Varun
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

Natural Language Proce

[IPCC LAB]

Soumya M S
Asst Professor
Dept of AIML
GEC,Challakere
Natural Language Processing (BAI601)

Sl no Experiments
1 Write a Python program for the following preprocessing of text in NLP:
● Tokenization ● Filtration ● Script Validation ● Stop Word Removal ● Stemming
2 Demonstrate the N-gram modeling to analyze and establish the probability distribution across
sentences and explore the utilization of unigrams, bigrams, and trigrams in diverse English
sentences to illustrate the impact of varying n-gram orders on the calculated probabilities.
3 Investigate the Minimum Edit Distance (MED) algorithm and its application in string comparison
and the goal is to understand how the algorithm efficiently computes the minimum number of
edit operations required to transform one string into another.
● Test the algorithm on strings with different type of variations (e.g., typos, substitutions,
insertions, deletions)
● Evaluate its adaptability to different types of input variations
4 Write a program to implement top-down and bottom-up parser using appropriate context free
grammar.
5 Given the following short movie reviews, each labeled with a genre, either comedy or action:
● fun, couple, love, love comedy
● fast, furious, shoot action
● couple, fly, fast, fun, fun comedy
● furious, shoot, shoot, fun action
● fly, fast, shoot, love action and
A new document D: fast, couple, shoot, fly Compute the most likely class for D. Assume a Naive
Bayes classifier and use add-1 smoothing for the likelihoods.
6 Demonstrate the following using appropriate programming tool which illustrates the use of
information retrieval in NLP:
● Study the various Corpus – Brown, Inaugural, Reuters, udhr with various methods like filelds,
raw, words, sents, categories 3
● Create and use your own corpora (plaintext, categorical)
● Study Conditional frequency distributions
● Study of tagged corpora with methods like tagged_sents, tagged_words
● Write a program to find the most frequent noun tags
● Map Words to Properties Using Python Dictionaries
● Study Rule based tagger, Unigram Tagger Find different words from a given plain text without
any space by comparing this text with a given corpus of words. Also find the score of words.
7 Write a Python program to find synonyms and antonyms of the word "active" using WordNet.
8 Implement the machine translation application of NLP where it needs to train a machine
translation model for a language with limited parallel corpora. Investigate and incorporate
techniques to improve performance in low-resource scenarios.

Soumya M S, Dept of AIML, GEC,Challakere. 1


Natural Language Processing (BAI601)
Natural Language Toolkit (NLTK) is a leading platform for building Python programs to work with human
language data (Natural Language Processing).
It is accompanied by a book that explains the underlying concepts behind the language processing tasks
supported by the toolkit.
NLTK is intended to support research and teaching in NLP or closely related areas, including empirical
linguistics, cognitive science, artificial intelligence, information retrieval, and machine learning.
For installation instructions on your local machine, please refer to: http://www.nltk.org/install.html ,
http://www.nltk.org/data.html

Soumya M S, Dept of AIML, GEC,Challakere. 2


Natural Language Processing (BAI601)
Program 1.
Write a Python program for the following preprocessing of text in NLP:
 Tokenization
 Filtration
 Script Validation
 Stop Word Removal
 Stemming

import nltk
import re
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
# Download necessary NLTK resources
nltk.download('punkt_tab')
nltk.download('stopwords')
def preprocess_text(text):
# Step 1: Tokenization
tokens = word_tokenize(text)
print("Tokens:", tokens)
# Step 2: Filtration (remove special characters, numbers, etc.)
filtered_tokens = [word for word in tokens if re.match(r'^[a-zA-Z]+$', word)]
print("Filtered Tokens:", filtered_tokens)
# Step 3: Script Validation (ensure all tokens are in English script)
# Assuming the text is already in English, no further action is needed.
# If not, you can use a language detection library like `langdetect`.
# Step 4: Stop Word Removal
stop_words = set(stopwords.words('english'))
tokens_without_stopwords = [word for word in filtered_tokens if word.lower() not in
stop_words]
print("Tokens without Stopwords:", tokens_without_stopwords)
# Step 5: Stemming
stemmer = PorterStemmer()
stemmed_tokens = [stemmer.stem(word) for word in tokens_without_stopwords]

Soumya M S, Dept of AIML, GEC,Challakere. 3


Natural Language Processing (BAI601)
print("Stemmed Tokens:", stemmed_tokens)
return stemmed_tokens
# Example Usage
text = "This is an example text! It includes different words, numbers like 123, and punctuation."
processed_text = preprocess_text(text)
print("Processed Tokens:", processed_text)
Output:
Tokens: ['This', 'is', 'an', 'example', 'text', '!', 'It', 'includes', 'different', 'words', ',', 'numbers', 'like',
'123', ',', 'and', 'punctuation', '.']
Filtered Tokens: ['This', 'is', 'an', 'example', 'text', 'It', 'includes', 'different', 'words', 'numbers',
'like', 'and', 'punctuation']
Tokens without Stopwords: ['example', 'text', 'includes', 'different', 'words', 'numbers', 'like',
'punctuation']
Stemmed Tokens: ['exampl', 'text', 'includ', 'differ', 'word', 'number', 'like', 'punctuat']
Processed Tokens: ['exampl', 'text', 'includ', 'differ', 'word', 'number', 'like', 'punctuat']

Soumya M S, Dept of AIML, GEC,Challakere. 4


Natural Language Processing (BAI601)
Program 2:
Demonstrate the N-gram modeling to analyze and establish the probability distribution across sentences and
explore the utilization of unigrams, bigrams, and trigrams in diverse English sentences to illustrate the
impact of varying n-gram orders on the calculated probabilities.
Program:
import nltk
from nltk.util import ngrams
from collections import Counter
from nltk.tokenize import word_tokenize
from nltk.probability import FreqDist
# Download necessary NLTK resources
nltk.download('punkt_tab')
# Sample sentences
sentences = [
"The quick brown fox jumps over the lazy dog.",
"A quick brown fox jumps over the lazy dog.",
"The lazy dog is jumped over by the quick brown fox."
]
# Function to generate N-grams and calculate probabilities
def ngram_probability(sentences, n):
# Tokenize sentences and generate N-grams
tokens = []
for sentence in sentences:
tokens.extend(word_tokenize(sentence.lower()))
# Generate N-grams
n_grams = list(ngrams(tokens, n))
# Calculate frequency distribution
freq_dist = FreqDist(n_grams)

# Calculate probabilities
total_ngrams = len(n_grams)
probabilities = {gram: count / total_ngrams for gram, count in freq_dist.items()}

Soumya M S, Dept of AIML, GEC,Challakere. 5


Natural Language Processing (BAI601)

return probabilities
# Unigrams (n=1)
unigram_probs = ngram_probability(sentences, 1)
print("Unigram Probabilities:")
for gram, prob in unigram_probs.items():
print(f"{gram}: {prob:.4f}")
# Bigrams (n=2)
bigram_probs = ngram_probability(sentences, 2)
print("\nBigram Probabilities:")
for gram, prob in bigram_probs.items():
print(f"{gram}: {prob:.4f}")
# Trigrams (n=3)
trigram_probs = ngram_probability(sentences, 3)
print("\nTrigram Probabilities:")
for gram, prob in trigram_probs.items():
print(f"{gram}: {prob:.4f}")

Output:
Unigram Probabilities:
('the',): 0.1562
('quick',): 0.0938
('brown',): 0.0938
('fox',): 0.0938
('jumps',): 0.0625
('over',): 0.0938
('lazy',): 0.0938
('dog',): 0.0938
('.',): 0.0938
('a',): 0.0312
('is',): 0.0312

Soumya M S, Dept of AIML, GEC,Challakere. 6


Natural Language Processing (BAI601)
('jumped',): 0.0312
('by',): 0.0312

Bigram Probabilities:
('the', 'quick'): 0.0645
('quick', 'brown'): 0.0968
('brown', 'fox'): 0.0968
('fox', 'jumps'): 0.0645
('jumps', 'over'): 0.0645
('over', 'the'): 0.0645
('the', 'lazy'): 0.0968
('lazy', 'dog'): 0.0968
('dog', '.'): 0.0645
('.', 'a'): 0.0323
('a', 'quick'): 0.0323
('.', 'the'): 0.0323
('dog', 'is'): 0.0323
('is', 'jumped'): 0.0323
('jumped', 'over'): 0.0323
('over', 'by'): 0.0323
('by', 'the'): 0.0323
('fox', '.'): 0.0323

Trigram Probabilities:
('the', 'quick', 'brown'): 0.0667
('quick', 'brown', 'fox'): 0.1000
('brown', 'fox', 'jumps'): 0.0667
('fox', 'jumps', 'over'): 0.0667
('jumps', 'over', 'the'): 0.0667
('over', 'the', 'lazy'): 0.0667
('the', 'lazy', 'dog'): 0.1000

Soumya M S, Dept of AIML, GEC,Challakere. 7


Natural Language Processing (BAI601)
('lazy', 'dog', '.'): 0.0667
('dog', '.', 'a'): 0.0333
('.', 'a', 'quick'): 0.0333
('a', 'quick', 'brown'): 0.0333
('dog', '.', 'the'): 0.0333
('.', 'the', 'lazy'): 0.0333
('lazy', 'dog', 'is'): 0.0333
('dog', 'is', 'jumped'): 0.0333
('is', 'jumped', 'over'): 0.0333
('jumped', 'over', 'by'): 0.0333
('over', 'by', 'the'): 0.0333
('by', 'the', 'quick'): 0.0333
('brown', 'fox', '.'): 0.0333

Soumya M S, Dept of AIML, GEC,Challakere. 8


Natural Language Processing (BAI601)
Program 3:
Investigate the Minimum Edit Distance (MED) algorithm and its application in string comparison and the
goal is to understand how the algorithm efficiently computes the minimum number of edit operations
required to transform one string into another.
● Test the algorithm on strings with different type of variations (e.g., typos, substitutions, insertions,
deletions)
● Evaluate its adaptability to different types of input variations

def min_edit_distance(str1, str2):


m = len(str1)
n = len(str2)

# Create a DP table to store results of subproblems


dp = [[0] * (n + 1) for _ in range(m + 1)]

# Initialize the base cases


for i in range(m + 1):
dp[i][0] = i # Deletion cost
for j in range(n + 1):
dp[0][j] = j # Insertion cost

# Fill the DP table


for i in range(1, m + 1):
for j in range(1, n + 1):
if str1[i - 1] == str2[j - 1]:
dp[i][j] = dp[i - 1][j - 1] # No operation needed
else:
dp[i][j] = 1 + min(
dp[i - 1][j], # Deletion
dp[i][j - 1], # Insertion
dp[i - 1][j - 1] # Substitution
)

# The final result is in dp[m][n]

Soumya M S, Dept of AIML, GEC,Challakere. 9


Natural Language Processing (BAI601)
return dp[m][n]

# Test cases
test_cases = [
("kitten", "sitting"), # Substitutions and insertions
("intention", "execution"), # Substitutions and deletions
("flaw", "lawn"), # Substitutions
("apple", "aple"), # Deletion
("book", "books"), # Insertion
("abc", "def"), # All substitutions
("", "abc"), # All insertions
("abc", "") # All deletions
]

# Evaluate MED for each test case


for str1, str2 in test_cases:
distance = min_edit_distance(str1, str2)
print(f"MED between '{str1}' and '{str2}': {distance}")

Output:
MED between 'kitten' and 'sitting': 3
MED between 'intention' and 'execution': 5
MED between 'flaw' and 'lawn': 2
MED between 'apple' and 'aple': 1
MED between 'book' and 'books': 1
MED between 'abc' and 'def': 3
MED between '' and 'abc': 3
MED between 'abc' and '': 3

Soumya M S, Dept of AIML, GEC,Challakere. 10


Natural Language Processing (BAI601)
Program 4:
Write a program to implement top-down and bottom-up parser using appropriate context free grammar.
import nltk
from nltk import CFG
# Define a simple Context-Free Grammar (CFG)
grammar = CFG.fromstring("""
S -> NP VP
NP -> Det N | N
VP -> V NP | V
Det -> 'the' | 'a'
N -> 'cat' | 'dog'
V -> 'chased' | 'barked'
""")
# Create Top-Down (Recursive Descent) and Bottom-Up (Chart) parsers
top_down_parser = nltk.RecursiveDescentParser(grammar)
bottom_up_parser = nltk.ChartParser(grammar)
# Input sentence
sentence = "the cat chased a dog".split()
# Top-Down Parsing
print("Top-Down Parsing Results:")
for tree in top_down_parser.parse(sentence):
print(tree)
# Bottom-Up Parsing
print("\nBottom-Up Parsing Results:")
for tree in bottom_up_parser.parse(sentence):
print(tree)
Output:
Top-Down Parsing Results:
(S (NP (Det the) (N cat)) (VP (V chased) (NP (Det a) (N dog))))

Bottom-Up Parsing Results:


(S (NP (Det the) (N cat)) (VP (V chased) (NP (Det a) (N dog))))

Soumya M S, Dept of AIML, GEC,Challakere. 11


Natural Language Processing (BAI601)
Program 5:
Given the following short movie reviews, each labeled with a genre, either comedy or action:
● fun, couple, love, love comedy
● fast, furious, shoot action
● couple, fly, fast, fun, fun comedy
● furious, shoot, shoot, fun action
● fly, fast, shoot, love action and
A new document D: fast, couple, shoot, fly Compute the most likely class for D.
Assume a Naive Bayes classifier and use add-1 smoothing for the likelihoods.

Program:
import nltk
from collections import defaultdict
from nltk.tokenize import word_tokenize
# Define dataset: Labeled movie reviews
documents = [
(["fun", "couple", "love", "love"], "Comedy"),
(["fast", "furious", "shoot"], "Action"),
(["couple", "fly", "fast", "fun", "fun"], "Comedy"),
(["furious", "shoot", "shoot", "fun"], "Action"),
(["fly", "fast", "shoot", "love"], "Action"),
]
# The new document to classify
D = ["fast", "couple", "shoot", "fly"]
# Count class occurrences
class_counts = defaultdict(int)
word_counts = defaultdict(lambda: defaultdict(int))
vocabulary = set()
# Process dataset
for words, label in documents:
class_counts[label] += 1
for word in words:
word_counts[label][word] += 1
vocabulary.add(word)

Soumya M S, Dept of AIML, GEC,Challakere. 12


Natural Language Processing (BAI601)
# Compute total words per class
total_words = {label: sum(word_counts[label].values()) for label in class_counts}
V = len(vocabulary) # Vocabulary size

# Compute prior probabilities


total_docs = sum(class_counts.values())
priors = {label: class_counts[label] / total_docs for label in class_counts}

# Function to compute likelihood with Add-1 (Laplace) smoothing


def compute_likelihood(word, label):
return (word_counts[label][word] + 1) / (total_words[label] + V)

# Compute posterior probabilities for D


posterior_probs = {}
for label in class_counts:
posterior_probs[label] = priors[label] # Start with prior probability
for word in D:
posterior_probs[label] *= compute_likelihood(word, label) # Multiply likelihoods

# Determine the most likely class


predicted_class = max(posterior_probs, key=posterior_probs.get)

# Output results
print("Posterior Probabilities:")
for label, prob in posterior_probs.items():
print(f"P({label} | D) = {prob:.6f}")
print(f"\nThe new document D is classified as: *{Predicted_class}*")

Soumya M S, Dept of AIML, GEC,Challakere. 13


Natural Language Processing (BAI601)
Program 6
Demonstrate the following using appropriate programming tool which illustrates the use of information
retrieval in NLP:
● Study the various Corpus – Brown, Inaugural, Reuters, udhr with various methods like filelds, raw, words,
sents, categories 3
● Create and use your own corpora (plaintext, categorical)
● Study Conditional frequency distributions
● Study of tagged corpora with methods like tagged_sents, tagged_words
● Write a program to find the most frequent noun tags
● Map Words to Properties Using Python Dictionaries
● Study Rule based tagger, Unigram Tagger Find different words from a given plain text without any space
by comparing this text with a given corpus of words. Also find the score of words.

from nltk import download


from nltk.corpus import brown, inaugural, reuters, udhr, wordnet
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag, DefaultTagger, UnigramTagger
from nltk.probability import ConditionalFreqDist
from collections import Counter

# Download necessary corpora


download('brown')
download('inaugural')
download('reuters')
download('udhr')
download('punkt')
download('averaged_perceptron_tagger')
download('wordnet')
# 1. STUDY VARIOUS CORPORA
print("\n--- Studying Standard Corpora ---\n")
print("Brown Corpus Categories:", brown.categories())
print("Brown Corpus Sample Words:", brown.words(categories='news')[:20])
print("\nInaugural Speech Words:", inaugural.words(fileids='2009-Obama.txt')[:20])
print("\nReuters Corpus Categories:", reuters.categories()[:5])
print("Reuters Corpus Sample Words:", reuters.words(categories='trade')[-20:])
print("\nUDHR Available Languages:", udhr.fileids()[:5])
print("UDHR English Sample:", udhr.words(fileids='English-Latin1')[:20])

Soumya M S, Dept of AIML, GEC,Challakere. 14


Natural Language Processing (BAI601)

# 2. CREATE & USE CUSTOM CORPUS


print("\n--- Creating and Using Custom Corpus ---\n")
custom_text = "Natural Language Processing is amazing. NLP helps human language."
custom_tokens = word_tokenize(custom_text)
print("Tokenized Custom Corpus:", custom_tokens)

# 3. CONDITIONAL FREQUENCY DISTRIBUTION


print("\n--- Conditional Frequency Distribution ---\n")
cfd = ConditionalFreqDist((genre, word) for genre in brown.categories()
for word in brown.words(categories=genre))
print("Most common words in 'news':", cfd['news'].most_common(10))

# 4. STUDY TAGGED CORPORA


print("\n--- Studying Tagged Corpora ---\n")
brown_tagged_sents = brown.tagged_sents(categories='news')
brown_tagged_words = brown.tagged_words(categories='news')
print("Tagged Sentence Sample:", brown_tagged_sents[0])
print("Tagged Words Sample:", brown_tagged_words[:10])

# 5. FIND MOST FREQUENT NOUN TAGS


print("\n--- Most Frequent Noun Tags ---\n")
noun_tags = [tag for (word, tag) in brown_tagged_words if tag.startswith('NN')]
freq_nouns = Counter(noun_tags)
print("Most Frequent Noun Tags:", freq_nouns.most_common(5))

# 6. RULE-BASED AND UNIGRAM TAGGER


print("\n--- Rule-Based and Unigram Tagger ---\n")
default_tagger = DefaultTagger('NN')
print("Default Tagger Example:", default_tagger.tag(custom_tokens))

unigram_tagger = UnigramTagger(brown_tagged_sents[:5000])
print("Unigram Tagger Example:", unigram_tagger.tag(custom_tokens))

Soumya M S, Dept of AIML, GEC,Challakere. 15


Natural Language Processing (BAI601)
# 7. SPLITTING TEXT WITHOUT SPACES
print("\n--- Finding Words in a Plain Text Without Spaces ---\n")
corpus_words = set(brown.words())
text_without_spaces = "thecatinthehat"

def split_text(text, corpus):


possible_words = [text[i:j] for i in range(len(text))
for j in range(i+1, len(text)+1)
if text[i:j] in corpus]
return possible_words

valid_words = split_text(text_without_spaces, corpus_words)


print("Identified Words:", valid_words)

Soumya M S, Dept of AIML, GEC,Challakere. 16


Natural Language Processing (BAI601)
Program 7
Write a Python program to find synonyms and antonyms of the word "active" using WordNet

import nltk
from nltk.corpus import wordnet
# Download WordNet
nltk.download('wordnet')
#Function to find synonyms and antonyms
def get_synonyms_antonyms(word):
synonyms = set()
antonyms = set()
for synset in wordnet.synsets(word):
for lemma in synset.lemmas():
synonyms.add(lemma.name())
if lemma.antonyms():
antonyms.add(lemma.antonyms()[0].name())
return synonyms, antonyms
#Get synonyms and antonyms
synonyms, antonyms = get_synonyms_antonyms("active")
#Results
print("Synonyms of 'active':", synonyms)
print("Antonyms of 'active':", antonyms)

Soumya M S, Dept of AIML, GEC,Challakere. 17


Natural Language Processing (BAI601)

Program 8
Implement the machine translation application of NLP where it needs to train a machine translation model
for a language with limited parallel corpora. Investigate and incorporate techniques to improve performance
in low-resource scenarios.

Soumya M S, Dept of AIML, GEC,Challakere. 18

You might also like