NLP Labexperiments
NLP Labexperiments
ROLL.NO
YEAR
BRANCH
INDEX
S.No. Date Name of the Program Page Remarks
No.
EXPERIMENT - 1
WORD ANALYSIS
AIM:
A word can be simple or complex. For example, the word 'cat' is simple because one cannot
further decompose the word into smaller part. On the other hand, the word 'cats' is
complex, because the word is made up of two parts: root 'cat' and plural suffix '-s'
THEORY:
Analysis of a word into root and affix(es) is called as Morphological analysis of a word. It is
mandatory to identify root of a word for any natural language processing task. A root word
can have various forms. For example, the word 'play' in English has the following forms:
'play', 'plays', 'played' and 'playing'. Hindi shows more number of forms for the word ' खेल'
(khela) which is equivalent to 'play'. The forms of 'खेल'(khela) are the following:
For Telugu root ఆడడం (Adadam), the forms are the following::
Thus, we understand that the morphological richness of one language might vary from one language
to another. Indian languages are generally morphologically rich languages and therefore
morphological analysis of words becomes a very significant task for Indian languages.
Types of Morphology
1. Inflectional morphology
Deals with word forms of a root, where there is no change in lexical category. For example,
'played' is an inflection of the root word 'play'. Here, both 'played' and 'play' are verbs.
2. Derivational morphology
Deals with word forms of a root, where there is a change in the lexical category. For
example, the word form 'happiness' is a derivation of the word 'happy'. Here, 'happiness' is
a derived noun form of the adjective 'happy'.
Morphological Features:
All words will have their lexical category attested during morphological
analysis. A noun and pronoun can take suffixes of the following features:
gender, number, person, case For example, morphological analysis of a few
words is given below:
A verb can take suffixes of the following features: tense, aspect, modality, gender, number,
person
'rt' stands for root. 'cat' stands for lexical category. They value of lexicat category can be
noun, verb, adjective, pronoun, adverb, preposition. 'gen' stands for gender. The value of
gender can be masculine or feminine. 'num' stands for number. The value of number can be
singular (sg) or plural (pl). 'per' stands for person. The value of person can be 1, 2 or 3
The value of tense can be present, past or future. This feature is applicable for verbs. The
value of aspect can be perfect (pft), continuous (cont) or habitual (hab). This feature is not
applicable for verbs.
'case' can be direct or oblique. This feature is applicable for nouns. A case is an oblique case
when a postposition occurs after noun. If no postposition can occur after noun, then the
case is a direct case. This is applicable for hindi but not english as it doesn't have any
postpositions. Some of the postpositions in hindi are: का(kaa), की(kii), के(ke), को(ko),
में(meM).
OBJECTIVE:
The objective of the experiment is to learn about morphological features of a word by
analysing it.
PROCEDURE:
OUTPUT: Right features are marked by tick and wrong features are marked by cross.
SIMULATION:
EXAMPLE PROGRAM:
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('stopwords')
text = "Natural Language Processing (NLP) is a subfield of linguistics, computer science, and
artificial intelligence concerned with the interactions between computers and human
language."
tokens = word_tokenize(text)
# Remove stopwords
stop_words = set(stopwords.words('english'))
pos_tags = pos_tag(filtered_tokens)
fdist = FreqDist(filtered_tokens)
# Print results
RESULT
EXPERIMENT - 2
WORD GENERATION
AIM:
A word can be simple or complex. For example, the word 'cat' is simple because one cannot
further decompose the word into smaller part. On the other hand, the word 'cats' is
complex, because the word is made up of two parts: root 'cat' and plural suffix '-s'
THEORY:
Given the root and suffix information, a word can be generated. For example,
Analysis may involve non-determinism, since more than one analysis is possible.
OBJECTIVE:
The objective of the experiment is to generate word forms from root and suffix information.
PROCEDURE:
OUTPUT: Drop downs for selecting root and other features will appear.
STEP 3: After selecting all the features, select the word corresponding above features
selected.
STEP 4: Click the check button to see whether right word is selected or not
SIMULATION:
EXAMPLE PROGRAM:
import nltk
import random
def generate_word(corpus):
freq_dist = nltk.FreqDist(letters)
return new_word
for _ in range(5):
print(generate_word(corpus))
RESULT
EXPERIMENT - 3
MORPHOLOGY
AIM:
Morphology is the study of the way words are built up from smaller meaning bearing units
i.e., morphemes. A morpheme is the smallest meaningful linguistic unit.
For Example:
Words can be analysed morphologically if we know all variants of a given root word. We can
use an 'Add-Delete' table for this analysis.
THEORY:
Morph Analyser
Definition
Morphemes are considered as smallest meaningful units of language. These morphemes can
either be a root word(play) or affix(-ed). Combination of these morphemes is called
morphological process. So, word "played" is made out of 2 morphemes "play" and "-ed".
Thus, finding all parts of a word(morphemes) and thus describing properties of a word is
called "Morphological Analysis". For example, "played" has information verb "play" and
"past tense", so given word is past tense form of verb "play".
Analysis of a word:
2. Delete आ(aa)
3. output बच्(bachch)
Words in the same paradigm class behave similarly, for Example लड़क is in the same
paradigm class as बच्, so लड़का would behave similarly as बच्ा as they share the same
paradigm class.
OBJECTIVE:
The Objective of the experiment is understanding the morphology of a word by the use of
Add-Delete table.
PROCEDURE:
STEP 1: Select a word root.
SIMULATION:
EXAMPLE PROGRAM:
class MorphologicalAnalyzer:
def _init_(self):
self.root_words = {}
self.root_words[root] = variants
if morpheme in word:
break
# Example usage
analyzer = MorphologicalAnalyzer()
# Adding root words and their morphological variants
analyzer.add_root_word("बच्ा", ["ओ"ूं ])
analyzer.add_root_word("play", ["-ed"])
word1 = "बच्ो"ूं
word2 = "played"
analyzer.morphological_analysis(word1)
analyzer.morphological_analysis(word2)
RESULT
EXPERIMENT - 4
N-Grams
AIM:
Probability of a sentence can be calculated by the probability of sequence of words
occurring in it. We can use Markov assumption, that the probability of a word in a sentence
depends on the probability of the word occurring just before it. Such a model is called first
order Markov model or the bigram model.
Here, Wn refers to the word token corresponding to the nth word in a sequence.
THEORY:
A combination of words forms a sentence. However, such a formation is meaningful only
when the words are arranged in some order.
One easy way to handle such unacceptable sentences is by assigning probabilities to the
strings of words i.e, how likely the sentence is.
Probability of a sentence
If we consider each word occurring in its correct location as an independent event, the
probability of the sentences is: P(w(1), w(2)..., w(n-1), w(n)).
Using chain rule: = P(w(1)) * P(w(2) | w(1)) * P(w(3) | w(1)w(2)) ... P(w(n) | w(1)w(2) ...
w(n-1))
Bigrams
We can avoid this very long calculation by approximating that the probability of a given
word depends only on the probability of its previous words. This assumption is called
Markov assumption and such a model is called Markov model- bigrams. Bigrams can be
generalized to the n-gram which looks at (n-1) words in the past. A bigram is a first-order
Markov model.
A bigram table for a given corpus can be generated and used as a lookup table for
calculating probability of sentences.
Eg: Corpus - (eos) You book a flight (eos) I read a book (eos) You read (eos).
Bigram Table:
OBJECTIVE:
The objective of this experiment is to learn to calculate bigrams from a given corpus and
calculate probability of a sentence.
PROCEDURE:
STEP 1: Select a corpus and click on Generate bigram table
STEP 2: Fill up the table that is generated and hit Submit
STEP 3: If incorrect (red), see the correct answer by clicking on show answer or repeat Step
2.
STEP 4: If correct (green), click on take a quiz and fill the correct answer
SIMULATION:
EXAMPLE PROGRAM:
class BigramModel:
def _init_(self, training_corpus):
self.bigram_counts = defaultdict(lambda: defaultdict(int))
self.unigram_counts = defaultdict(int)
def calculate_probabilities(self):
bigram_probabilities = defaultdict(lambda: defaultdict(float))
return bigram_probabilities
probabilities = self.calculate_probabilities()
probability = 1.0
return probability
# Example usage
training_corpus = "This is a sample sentence. This sentence is for demonstration purposes."
test_sentence = "This is a sample sentence for demonstration."
bigram_model = BigramModel(training_corpus)
probability = bigram_model.sentence_probability(test_sentence)
print(f"The probability of the sentence is: {probability}")
RESULT
EXPERIMENT –5
N-GRAM SMOOTHING
AIM:One major problem with standard N-gram models is that they must be trained from
some corpus, and because any particular training corpus is finite, some perfectly acceptable
N-grams are bound to be missing from it. We can see that bigram matrix for any given
training corpus is sparse. There are large number of cases with zero probabilty bigrams and
that should really have some non-zero probability. This method tend to underestimate the
probability of strings that happen not to have occurred nearby in their training corpus.
There are some techniques that can be used for assigning a non-zero probabilty to these
'zero probability bigrams'. This task of reevaluating some of the zero-probability and low-
probabilty N-grams, and assigning them non-zero values, is called smoothing.
THEORY:The standard N-gram models are trained from some corpus. The finiteness of the
training corpus leads to the absence of some perfectly acceptable N-grams. This results in
sparse bigram matrices. This method tend to underestimate the probability of strings that
do not occur in their training corpus.
There are some techniques that can be used for assigning a non-zero probabilty to these
'zero probability bigrams'. This task of reevaluating some of the zero-probability and low-
probabilty N-grams, and assigning them non-zero values, is called smoothing. Some of the
techniques are: Add-One Smoothing, Witten-Bell Discounting, Good-Turing Discounting.
Add-One Smoothing
In Add-One smooting, we add one to all the bigram counts before normalizing them into
probabilities. This is called add-one smoothing.
Application on unigrams
The unsmoothed maximum likelihood estimate of the unigram probability can be computed
by dividing the count of the word by the total number of word tokens N.
Application on bigrams
Normal bigram probabilities are computed by normalizing each row of counts by the
unigram count:
P(wn|wn-1) = C(wn-1wn)/C(wn-1)
For add-one smoothed bigram counts we need to augment the unigram count by the
number of total word types in the vocabulary V:
p*(wn|wn-1) = ( C(wn-1wn)+1 )/( C(wn-1)+V )
PROCEDURE:
STEP 1: Select a corpus.
STEP 2: Apply add one smoothing and calculate bigram probabilities using the given bigram
counts,N and V. Fill the table and hit Submit.
STEP 3: If incorrect (red), see the correct answer by clicking on show answer or repeat Step
2.
SIMULATION:
EXAMPLE PROGRAM:
class BigramModel:
def _init_(self, corpus):
self.bigram_counts = defaultdict(int)
self.unigram_counts = defaultdict(int)
self.vocab_size = 0
self.train(corpus)
if _name_ == "_main_":
corpus = [
"the cat sat on the mat",
"the dog sat on the rug"
]
model = BigramModel(corpus)
RESULT
EXPERIMENT - 6
For example the word "Park" can have two different lexical categories based on the context.
THEORY:
A Hidden Markov Model (HMM) is a statistical Markov model in which the system being
modeled is assumed to be a Markov process with unobserved (hidden) states.In a regular
Markov model (Markov Model (Ref: http://en.wikipedia.org/wiki/Markov_model)), the
state is directly visible to the observer, and therefore the state transition probabilities are
the only parameters. In a hidden Markov model, the state is not directly visible, but output,
dependent on the state, is visible.
Hidden Markov Model has two important components-
2) Emission Probabilties: : The output probabilities for an observation from state. Emission
probabilities B = { bi,k = bi(ok) = P(ok | qi) }, where ok is an Observation. Informally, B is the
probability that the output is ok given that the current state is qi
For POS tagging, it is assumed that POS are generated as random process, and each process
randomly generates a word. Hence, transition matrix denotes the transition probability from
one POS to another and emission matrix denotes the probability that a given word can have
a particular POS. Word acts as the observations. Some of the basic assumptions are:
EOS/eos
Calculating Emission Probability
They/ Matrix
pronoun
Count the no. of timescut/verb
a specific word occus with a specific POS tag in the corpus.
Here, say for "cut" the/determin
paper/noun
in/prepositio
Repeat the same for all the word-tag combination and fill the
Count the no. of times a specific tag comes after other POS tags in the corpus.
Here, say for "determiner"
OBJECTIVE:
The objective of the experiment is to calculate emission and transition matrix which will be
helpful for tagging Parts of Speech using Hidden Markov Model.
PROCEDURE:
STEP1: Select the corpus.
STEP2: For the given corpus fill the emission and transition matrix. Answers are rounded to
2 decimal digits.
RESULT
EXPERIMENT - 7
AIM:
In previous experiment you have calculated the transition and emission matrix, and now in
this experiment it will be used to find the POS tag sequence for a given sentence. When we
have emission and transition matrix, various algorithms can be applied to find out the POS
tags for words. Some of possible algorithms are: Backward algorithm, forward algorithm and
viterbi algorithm. Here, in this experiment, you can get familiar with Viterbi Decoding
THEROY:
Viterbi Decoding is based on dynamic programming. This algorithm takes emission and
transmission matrix as the input. Emission matrix gives us information about proabities of a
POS tag for a given word and transmission matrix gives the probability of transition from
one POS tag to another POS tag. It observes sequence of words and returns the state
sequences of POS tags along with its probability.
Here "s" denotes words and "t" denotes tags. "a" is transmission matrix and "b" is emission
matrix.
Using above algorithm, we have to fill the viterbi table column by column.
OBJECTIVE:
The objective of this experiment is to find POS tags of words in a sentence using Viterbi
decoding.
PROCEDURE:
STEP1:Select the corpus.
STEP2: Fill the column with the probabilty of possible POS tags given the word (i.e. form the
viterbi matrix by filling colum for each observation). Answers submitted are rounded off to 3
digits after decimal and are than checked.
STEP4: Repeat steps 2 and 3 untill all words of a sentence are covered.
STEP5: At last check the POS tag for each word obtained from backtracking
SIMULATION:
RESULT
EXPERIMENT –8
AIM:
In corpus linguistics, part-of-speech tagging (POS tagging or POST), also called grammatical
tagging or word-category disambiguation, is the process of marking up a word in a text
(corpus) as corresponding to a particular part of speech, based on both its definition, as well
as its context i.e. relationship with adjacent and related words in a phrase, sentence, or
paragraph. A simplified form of this is identification of words as nouns, verbs, adjectives,
adverbs, etc. Once performed by hand, POS tagging is now done in the context of
computational linguistics, using algorithms which associate discrete terms, as well as hidden
parts of speech, in accordance with a set of descriptive tags. POS-tagging algorithms fall into
two distinctive groups: rule-based and stochastic.
THEORY:
In the mid 1980s, researchers in Europe began to use Hidden Markov models (HMMs) to
disambiguate parts of speech. HMMs involve counting cases, and making a table of the
probabilities of certain sequences. For example, once you've seen an article such as 'the',
perhaps the next word is a noun 40% of the time, an adjective 40%, and a number 20%.
Knowing this, a program can decide that "can" in "the can" is far more likely to be a noun
than a verb or a modal. The same method can of course be used to benefit from knowledge
about following words.
More advanced ("higher order") HMMs learn the probabilities not only of pairs, but triples
or even larger sequences. So, for example, if you've just seen an article and a verb, the next
item may be very likely a preposition, article, or noun, but much less likely another verb.
When several ambiguous words occur together, the possibilities multiply. However, it is
easy to enumerate every combination and to assign a relative probability to each one, by
multiplying together the probabilities of each choice in turn.
It is worth remembering, as Eugene Charniak points out in Statistical techniques for natural
language parsing, that merely assigning the most common tag to each known word and the
tag "proper noun" to all unknowns, will approach 90% accuracy because many words are
unambiguous.
HMMs underlie the functioning of stochastic taggers and are used in various algorithms.
Accuracies for one such algorithm (TnT) on various training data is shown here.
If only one neighbour is considered as a context, then it is called bigram. Similarly, two
neighbours as the context is called trigram. In this experiment, size of training corpus and
context were varied to know their importance.
OBJECTIVE:
The objective of the experiment is to know the importance of context and size of training
corpus in learning Parts of Speech.
PROCEDURE:
STEP1: Select the language.
OUTPUT: Drop down to select size of corpus, algorithm and features will appear.
shown.
SIMULATION:
EXAMPLE PROGRAM:
import nltk
pos_tagger = train_pos_tagger(brown_tagged_sents)
RESULT
EXPERIMENT – 9
CHUNKING
AIM:
Chunking of text invloves dividing a text into syntactically correlated words. For example,
the sentence 'He ate an apple.' can be divided as follows:
Each chunk has an open boundary and close boundary that delimit the word groups as a
minimal non-recursive unit. This can be formally expressed by using IOB prefixes.
THEORY:
Chunk Types
The chunk types are based on the syntactic category part. Besides the head a chunk also
contains modifiers (like determiners, adjectives, postpositions in NPs).
NP Noun Chunks
Noun Chunks will be given the tag NP and include non-recursive noun phrases and
postposition for Indian languages and preposition for English. Determiners, adjectives and
other modifiers will be part of the noun chunk.
Eg:
Verb Chunks
The verb chunks are marked as VP for English, however they would be of several types for
Indian languages. A verb group will include the main verb and its auxiliaries, if any.
For English:
The types of verb chunks and their tags are described below.
3. VGNN Gerunds
A verb chunk having a gerund will be annotated as VGNN.
An adjectival chunk will be tagged as ADJP for English and JJP for Indian languages. This
chunk will consist of all adjectival chunks including the predicative adjectives.
Eg:
वहलड़कीहै
is (ripe/JJ)ADJP
Note: Adjectives appearing before a noun will be grouped together within the noun chunk.
Eg:
PP Prepositional Chunk
This chunk type is present for only English and not for Indian languages. It consists of only
the preposition and not the NP argument.
Eg:
(with/IN)PP a pen
IOB prefixes
Each chunk has an open boundary and close boundary that delimit the word groups as a
minimal non-recursive unit. This can be formally expressed by using IOB prefixes: B-CHUNK
for the first word of the chunk and I-CHUNK for each other word in the chunk. Here is an
example of the file format:
OBJECTIVE:
The objective of this experiment is to understand the concept of chunking and get familiar
with the basic chunk tagset.
PROCEDURE:
STEP3: Select the corresponding chunk-tag for each word in the sentence and click
the Submit button.
OUTPUT1: The submitted answer will be checked.
if _name_ == '_main_':
numbers = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
words = ['apple', 'banana', 'cherry', 'date', 'fig', 'grape', 'kiwi', 'lemon', 'mango', 'orange']
chunked_numbers = chunk_numbers(numbers, 4)
chunked_words = chunk_words(words, 3)
print("Chunked numbers:")
for chunk in
chunked_numbers:
print(chunk)
print("\nChunked words:")
for chunk in
chunked_words: print(chunk)
RESULT
EXPERIMENT – 10
BUILDING CHUNKER
AIM:
Chunking is an analysis of a sentence which identifies the constituents (noun groups, verbs,
verb groups, etc.) which are correlated. These are non-overlapping regions of text. Usually,
each chunk contains a head, with the possible addition of some function words and
modifiers either before or after depending on languages. These are non-recursive in nature
i.e. a chunk cannot contain another chunk of the same
1. Noun Group
2. Verb Group
For example, the sentence 'He reckons the current account deficit will narrow to only 1.8
billion in September.' can be divided as follows:
[NP He ] [VP reckons ] [NP the current account deficit ] [VP will narrow ] [PP to ] [NP only 1.8
billion ] [PP in ] [NP September ]
Each chunk has an open boundary and close boundary that delimit the word groups as a
minimal non-recursive unit.
THEORY:
In the mid 1980s, researchers in Europe began to use Hidden Markov models (HMMs) to
disambiguate parts of speech. HMMs involve counting cases, and making a table of the
probabilities of certain sequences. For example, once you've seen an article such as 'the',
perhaps the next word is a noun 40% of the time, an adjective 40%, and a number 20%.
Knowing this, a program can decide that "can" in "the can" is far more likely to be a noun
than a verb or a modal. The same method can of course be used to benefit from knowledge
about following words.
More advanced ("higher order") HMMs learn the probabilities not only of pairs, but triples
or even larger sequences. So, for example, if you've just seen an article and a verb, the next
item may be very likely a preposition, article, or noun, but much less likely another verb.
When several ambiguous words occur together, the possibilities multiply. However, it is
easy to enumerate every combination and to assign a relative probability to each one, by
multiplying together the probabilities of each choice in turn.
It is worth remembering, as Eugene Charniak points out in Statistical techniques for natural
language parsing, that merely assigning the most common tag to each known word and the
tag "proper noun" to all unknowns, will approach 90% accuracy because many words are
unambiguous.
HMMs underlie the functioning of stochastic taggers and are used in various algorithms.
Accuracies for one such algorithm (TnT) on various training data is shown here.
Conditional random fields (CRFs) are a class of statistical modelling method often applied in
machine learning, where they are used for structured prediction. Whereas an ordinary
classifier predicts a label for a single sample without regard to "neighboring" samples, a CRF
can take context into account. Since it can consider context, therefore CRF can be used in
Natural Language Processing. Hence, Parts of Speech tagging is also possible. It predicts the
POS using the lexicons as the context.
In this experiment both algorithms are used for training and testing data. As the size of
training corpus increases, it is observed that accuracy increases. Further, even features also
play an important role for better output. In this experiment, we can see that Parts of Speech
as a feature performs better than only lexicon as the feature. Therefore, it is important to
select proper features for training a model to have better accuracy.
OBJECTIVE:
The objective of the experiment is to know the importance of selecting proper features for
training a model and size of training corpus in learning how to do chunking.
PROCEDURE:
STEP1: Select the language.
OUTPUT: Drop down to select size of corpus, algorithm and features will appear.
STEP4: Select feature "only lexicon", "only POS", "lexicon and POS".
RESULT