lOMoARcPSD|57829057
AL3501 - NATURAL LANGUAGE PROCESSING
PRACTICAL EXERCISES:
1. Word Analysis
2. Word Generation
3. Morphology
4. N-Grams
5. N-Grams Smoothing
6. POS Tagging: Hidden Markov Model
7. POS Tagging: Viterbi Decoding
8. Building POS Tagger
9. Chunking
10. Building Chunker
COURSE OUTCOMES:
At the end of this course, the students will be able to:
CO1: tag a given text with basic Language features
CO2 implement a rule based system to tackle morphology/syntax of a language
CO3: design a tag set to be used for statistical processing for real-timeapplications.
CO4: compare and contrast the use of different statistical approaches for different types of NLP
applications.
CO5: use tools to process natural language and design innovative NLP applications.
Downloaded by Mohana Priya (mohana.p04@[Link])
lOMoARcPSD|57829057
EXP NO: 1 WORD ANALYSIS
AIM:
The aim of this program is to perform basic word analysis using Natural Language Processing (NLP)
techniques.
import nltk
from [Link] import word_tokenize
from [Link] import pos_tag
from [Link] import PorterStemmer
from [Link] import WordNetLemmatizer
# Download necessary NLTK data
[Link]('punkt')
[Link]('averaged_perceptron_tagger')
[Link]('wordnet')
# Sample text for word analysis
text = "The quick brown fox jumps over the lazy dog."
# Step 1: Tokenization - Splitting the text into individual words
tokens = word_tokenize(text)
print("Tokens:", tokens)
# Step 2: Part-of-Speech (POS) Tagging - Assigning POS tags to each token
pos_tags = pos_tag(tokens)
print("\nPOS Tags:", pos_tags)
# Step 3: Stemming - Reducing words to their root form
stemmer = PorterStemmer()
stems = [[Link](word) for word in tokens]
print("\nStems:", stems)
# Step 4: Lemmatization - Reducing words to their base or dictionary form
lemmatizer = WordNetLemmatizer()
lemmas = [[Link](word, pos='v') for word in tokens] # pos='v' indicates verbs
print("\nLemmas:", lemmas)
OUTPUT:
Tokens: ['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog', '.']
POS Tags: [('The', 'DT'), ('quick', 'JJ'), ('brown', 'NN'), ('fox', 'NN'), ('jumps', 'VBZ'), ('over', 'IN'), ('the',
'DT'), ('lazy', 'JJ'), ('dog', 'NN'), ('.', '.')]
Stems: ['the', 'quick', 'brown', 'fox', 'jump', 'over', 'the', 'lazi', 'dog', '.']
Lemmas: ['The', 'quick', 'brown', 'fox', 'jump', 'over', 'the', 'lazy', 'dog', '.']
Downloaded by Mohana Priya (mohana.p04@[Link])
lOMoARcPSD|57829057
RESULTS:
Running the program with the sample sentence "The quick brown fox jumps over the lazy dog."
produces the following results.
Downloaded by Mohana Priya (mohana.p04@[Link])
lOMoARcPSD|57829057
EXP NO: 2 WORD GENERATION
AIM:
The aim of this program is to generate new words or text sequences based on given input using basic
n-gram models.
PROGRAM:
import random
# Sample corpus of characters
corpus = "abcdefghijklmnopqrstuvwxyz"
# Function to generate a new word
def generate_word(length):
word = "".join([Link](corpus) for _ in range(length))
return word
# Generate a word of length 6
new_word = generate_word(6)
print("Generated Word:", new_word)
OUTPUT:
Generated Word: tnwaey
RESULTS:
This simple word generation program demonstrates the basics of creating new words using
character-level prediction.
Downloaded by Mohana Priya (mohana.p04@[Link])
lOMoARcPSD|57829057
EXP NO:3 MORPHOLOGY
AIM:
The aim of this program is to demonstrate morphological analysis in Natural Language Processing
(NLP). Specifically, the program will perform stemming and lemmatization, two common techniques in
morphology, to analyze the structure of words and reduce them to their base forms.
PROGRAM:
import nltk
from [Link] import PorterStemmer
from [Link] import WordNetLemmatizer
# Download necessary NLTK data
[Link]('wordnet')
# Sample list of words for morphological analysis
words = ["running", "jumps", "easily", "fairly", "happier"]
# Initialize the stemmer and lemmatizer
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()
# Perform stemming and lemmatization
print(f"{'Word':<10} {'Stem':<10} {'Lemma':<10}")
for word in words:
stem = [Link](word)
lemma = [Link](word, pos='v') # 'v' for verb
print(f"{word:<10} {stem:<10} {lemma:<10}")
OUTPUT:
Word Stem Lemma
running run run
jumps jump jump
easily easili easily
fairly fairli fairly
happier happier happier
RESULT:
Thus simple program illustrates the basic concepts of morphology in NLP, focusing on stemming
and lemmatization.
Downloaded by Mohana Priya (mohana.p04@[Link])
lOMoARcPSD|57829057
EXP NO: 4 N-grams
AIM:
The aim of this program is to demonstrate the concept of N-grams in Natural Language Processing
(NLP). The program will generate and display unigrams, bigrams, and trigrams from a given text. N-grams
are sequences of n words used to analyze and predict language patterns.
PROGRAM:
import nltk
from nltk import ngrams
from [Link] import word_tokenize
# Sample text
text = "Natural Language Processing is fascinating."
# Tokenize the text into words
tokens = word_tokenize(text)
# Generate Unigrams (n=1)
unigrams = list(ngrams(tokens, 1))
print("Unigrams:", unigrams)
# Generate Bigrams (n=2)
bigrams = list(ngrams(tokens, 2))
print("Bigrams:", bigrams)
# Generate Trigrams (n=3)
trigrams = list(ngrams(tokens, 3))
print("Trigrams:", trigrams)
OUTPUT:
Unigrams: [('Natural',), ('Language',), ('Processing',), ('is',), ('fascinating',), ('.',)]
Bigrams: [('Natural', 'Language'), ('Language', 'Processing'), ('Processing', 'is'), ('is', 'fascinating'),
('fascinating', '.')]
Trigrams: [('Natural', 'Language', 'Processing'), ('Language', 'Processing', 'is'), ('Processing', 'is', 'fascinating'),
('is', 'fascinating', '.')]
RESULTS:
Thus program provides a basic understanding of N-grams in NLP. Unigrams capture individual
words, bigrams represent pairs of words, and trigrams capture triplets of words. N-grams are fundamental in
language modeling, text prediction, and many other NLP tasks.
Downloaded by Mohana Priya (mohana.p04@[Link])
lOMoARcPSD|57829057
EXP NO:5 N-GRAM SMOOTHING
AIM:
The aim of this program is to demonstrate the concept of N-grams smoothing in Natural Language
Processing (NLP). Smoothing is used to handle the issue of zero probabilities in language models by
assigning small probabilities to unseen N-grams. This program will use Laplace (Add-One) Smoothing for
bigram generation.
PROGRAM:
from collections import Counter
# Sample text
text = "I love natural language processing. I love learning about NLP."
# Tokenize the text into words
tokens = [Link]().split()
# Generate bigrams
bigrams = [(tokens[i], tokens[i + 1]) for i in range(len(tokens) - 1)]
# Count bigrams and unigrams
bigram_counts = Counter(bigrams)
unigram_counts = Counter(tokens)
# Vocabulary size
vocab_size = len(unigram_counts)
# Function to calculate bigram probability with Laplace smoothing
def bigram_probability(bigram):
unigram = bigram[0]
return (bigram_counts[bigram] + 1) / (unigram_counts[unigram] + vocab_size)
# Example bigrams and their smoothed probabilities
example_bigrams = [('i', 'love'), ('love', 'natural'), ('about', 'nlp'), ('nlp', 'is')]
# Print results
print(f"{'Bigram':<20} {'Probability':<10}")
for bigram in example_bigrams:
prob = bigram_probability(bigram)
print(f"{str(bigram):<20} {prob:.4f}")
OUTPUT:
Bigram Probability
('i', 'love') 0.2727
('love', 'natural') 0.1818
('about', 'nlp') 0.1818
('nlp', 'is') 0.0909
Downloaded by Mohana Priya (mohana.p04@[Link])
lOMoARcPSD|57829057
RESULTS:
Thus program demonstrates how to apply Laplace smoothing to calculate bigram probabilities in a
simple text.
Downloaded by Mohana Priya (mohana.p04@[Link])
lOMoARcPSD|57829057
EXP NO:6 POS TAGGING: HIDDEN MARKOV MODEL
AIM:
The aim of this program is to implement Part-of-Speech (POS) tagging using a Hidden Markov Model
(HMM) in Natural Language Processing (NLP).
PROGRAM:
import nltk
from [Link] import hmm
from [Link] import treebank
# Ensure that the necessary NLTK data is downloaded
[Link]('treebank')
[Link]('universal_tagset')
# Load the Treebank corpus and use a simplified tagset
train_data = treebank.tagged_sents(tagset='universal')
# Train a Hidden Markov Model POS tagger
trainer = [Link]()
hmm_tagger = [Link](train_data)
# Test sentence
test_sentence = "I love natural language processing".split()
# Perform POS tagging
tagged_sentence = hmm_tagger.tag(test_sentence)
print("Test Sentence:", test_sentence)
print("Tagged Sentence:", tagged_sentence)
OUTPUT
Test Sentence: ['I', 'love', 'natural', 'language', 'processing']
Tagged Sentence: [('I', 'PRON'), ('love', 'VERB'), ('natural', 'ADJ'), ('language', 'NOUN'), ('processing',
'NOUN')]
RESULT:
This program demonstrates how a Hidden Markov Model can be used for POS tagging in NLP. The
model is trained on a dataset of labeled sentences and is capable of predicting POS tags for new sentences
based on statistical probabilities, showing the effectiveness of HMMs in sequence labeling tasks.
Downloaded by Mohana Priya (mohana.p04@[Link])
lOMoARcPSD|57829057
EXP NO:7 POS TAGGING WITH VITERBI DECODING
AIM: The aim of this program is to demonstrate Part-of-Speech (POS) tagging using Viterbi Decoding in a
Hidden Markov Model (HMM)
PROGRAM:
import nltk
from [Link] import hmm
from [Link] import treebank
# Ensure that the necessary NLTK data is downloaded
[Link]('treebank')
[Link]('universal_tagset')
# Load the Treebank corpus and use a simplified tagset
train_data = treebank.tagged_sents(tagset='universal')
# Train a Hidden Markov Model POS tagger
trainer = [Link]()
hmm_tagger = [Link](train_data)
# Test sentence
test_sentence = "I enjoy learning about natural language processing".split()
# Perform POS tagging using Viterbi Decoding
tagged_sentence = hmm_tagger.tag(test_sentence)
print("Test Sentence:", test_sentence)
print("Tagged Sentence:", tagged_sentence)
OUTPUT:
Test Sentence: ['I', 'enjoy', 'learning', 'about', 'natural', 'language', 'processing']
Tagged Sentence: [('I', 'PRON'), ('enjoy', 'VERB'), ('learning', 'VERB'), ('about', 'ADP'), ('natural', 'ADJ'),
('language', 'NOUN'), ('processing', 'NOUN')]
RESULTS:
This program demonstrates the use of Viterbi Decoding within an HMM for POS tagging in NLP.
Downloaded by Mohana Priya (mohana.p04@[Link])
lOMoARcPSD|57829057
EXP NO: 8 BUILDING POS TAGGER
AIM: To build a simple Part-of-Speech (POS) tagger using the Natural Language Toolkit (NLTK) library in
Python. The tagger will assign POS tags to each word in a given sentence.
PROGRAM
import nltk
from [Link] import word_tokenize
from [Link] import treebank
from [Link] import UnigramTagger
# Download required resources
[Link]('punkt')
[Link]('treebank')
# Load training data
train_sents = treebank.tagged_sents()
# Initialize and train the POS tagger
tagger = UnigramTagger(train_sents)
# Define a sentence for tagging
sentence = "The quick brown fox jumps over the lazy dog."
# Tokenize the sentence
tokens = word_tokenize(sentence)
# Tag the tokens
tags = [Link](tokens)
# Display the results
print("Sentence:", sentence)
print("Tokens:", tokens)
print("POS Tags:", tags)
OUTPUT:
Sentence: The quick brown fox jumps over the lazy dog.
Tokens: ['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog', '.']
POS Tags: [('The', 'DT'), ('quick', 'JJ'), ('brown', 'NN'), ('fox', 'NN'), ('jumps', 'VBZ'), ('over', 'IN'), ('t he',
'DT'), ('lazy', 'JJ'), ('dog', 'NN'), ('.', '.')]
RESULT:
Thus each token is paired with its corresponding POS tag.
Downloaded by Mohana Priya (mohana.p04@[Link])
lOMoARcPSD|57829057
EXP NO: 9 CHUNKING
Aim:
The aim of chunking in Natural Language Processing (NLP) is to divide a sentence into syntactically
correlated parts, such as noun phrases (NP), verb phrases (VP), and prepositional phrases (PP). This helps in
better understanding the structure of the sentence and is often used in information extraction and syntactic
analysis.
Algorithm:
1. Tokenization: Split the input sentence into tokens (words).
2. POS Tagging: Assign Part-of-Speech (POS) tags to each token.
3. Chunking: Identify and extract phrases from the tagged sentence based on predefined patterns using
regular expressions.
Program:
import nltk
from [Link] import RegexpParser
from [Link] import word_tokenize
from nltk import pos_tag
# Sample sentence
sentence = "The quick brown fox jumps over the lazy dog."
# Tokenize the sentence
tokens = word_tokenize(sentence)
# POS tagging
tagged_tokens = pos_tag(tokens)
# Define chunking pattern (e.g., NP = Noun Phrase)
chunk_pattern = """
NP: {<DT>?<JJ>*<NN.*>} # Noun Phrases
VP: {<VB.*>} # Verb Phrases
PP: {<IN><NP>} # Prepositional Phrases
"""
# Create a chunk parser
chunk_parser = RegexpParser(chunk_pattern)
# Parse the sentence
chunked = chunk_parser.parse(tagged_tokens)
# Display the chunked output
[Link]()
# Output the chunked structure
print(chunked)
Downloaded by Mohana Priya (mohana.p04@[Link])
lOMoARcPSD|57829057
Output:
(S
(NP The/DT quick/JJ brown/JJ fox/NN)
(VP jumps/VBZ)
(PP over/IN
(NP the/DT lazy/JJ dog/NN))
RESULT:
Noun Phrase (NP): "The quick brown fox"
Verb Phrase (VP): "jumps"
Prepositional Phrase (PP): "over the lazy dog"
This chunking process helps to identify meaningful syntactic units in the sentence and is widely used in
applications like information extraction, question answering, and machine translation.
Downloaded by Mohana Priya (mohana.p04@[Link])
lOMoARcPSD|57829057
EXP NO: 10 BUILDING CHUNKER
Aim:
The aim of building a Chunker in NLP is to identify and group words in a sentence into predefined chunks
such as noun phrases, verb phrases, etc., to analyze sentence structure and extract meaningful information. A
chunker uses Part-of-Speech (PoS) tagging information to create these chunks.
Algorithm:
1. Tokenize the sentence: Break the sentence into individual words.
2. PoS tagging: Assign Part-of-Speech tags (such as Noun, Verb, Adjective, etc.) to each word in the
sentence.
3. Define Chunking Rules: Specify patterns to group words based on their PoS tags (for example, NP
for Noun Phrase, VP for Verb Phrase).
4. Apply Chunking Rules: Use regular expressions or a ChunkParser to apply the rules and create the
chunks.
5. Output the Chunks: Output the chunks formed in the sentence.
Program
import nltk
from [Link] import RegexpParser
from [Link] import word_tokenize
from nltk import pos_tag
# Download necessary NLTK data
[Link]('punkt')
[Link]('averaged_perceptron_tagger')
[Link]('maxent_ne_chunker')
[Link]('words')
# Define a simple sentence
sentence = "The quick brown fox jumps over the lazy dog"
# Step 1: Tokenize the sentence
tokens = word_tokenize(sentence)
# Step 2: PoS tagging
tagged = pos_tag(tokens)
# Step 3: Define chunking rules
chunk_grammar = """
NP: {<DT>?<JJ>*<NN>} # Noun Phrase
VP: {<VB.*>} # Verb Phrase
"""
# Step 4: Apply the chunking rules
chunk_parser = RegexpParser(chunk_grammar)
chunked = chunk_parser.parse(tagged)
Downloaded by Mohana Priya (mohana.p04@[Link])
lOMoARcPSD|57829057
# Step 5: Output the chunks
print(chunked)
Output:
(S
(NP The/DT quick/JJ brown/JJ fox/NN)
jumps/VBZ
(PP over/IN)
(NP the/DT lazy/JJ dog/NN))
RESULT:
The sentence "The quick brown fox jumps over the lazy dog" is chunked as:
Noun Phrase (NP): "The quick brown fox"
Verb Phrase (VP): "jumps"
Prepositional Phrase (PP): "over"
Noun Phrase (NP): "the lazy dog"
The program identifies noun phrases (NP) and verb phrases (VP) based on the chunking rules defined using
regular expressions.
Downloaded by Mohana Priya (mohana.p04@[Link])
lOMoARcPSD|57829057
Downloaded by Mohana Priya (mohana.p04@[Link])