0% found this document useful (0 votes)
18 views46 pages

NLP Labexperiments

The document is a lab manual for Natural Language Processing (NLP) focused on morphological analysis and word generation, detailing various experiments and their objectives. It covers concepts such as inflectional and derivational morphology, morphological features, and n-grams, providing procedures and example programs for practical implementation. Each experiment aims to enhance understanding of word structures and probabilities in language processing.

Uploaded by

Manu My
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views46 pages

NLP Labexperiments

The document is a lab manual for Natural Language Processing (NLP) focused on morphological analysis and word generation, detailing various experiments and their objectives. It covers concepts such as inflectional and derivational morphology, morphological features, and n-grams, providing procedures and example programs for practical implementation. Each experiment aims to enhance understanding of word structures and probabilities in language processing.

Uploaded by

Manu My
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 46

NATURAL LANGUAGE

PROCESSING LAB MANUAL


(R 20 REGULATION)
III BTECH CSE(AIML) –II SEM

NAME OF THE STUDENT

ROLL.NO

YEAR

BRANCH
INDEX
S.No. Date Name of the Program Page Remarks
No.
EXPERIMENT - 1

WORD ANALYSIS
AIM:

A word can be simple or complex. For example, the word 'cat' is simple because one cannot
further decompose the word into smaller part. On the other hand, the word 'cats' is
complex, because the word is made up of two parts: root 'cat' and plural suffix '-s'

THEORY:
Analysis of a word into root and affix(es) is called as Morphological analysis of a word. It is
mandatory to identify root of a word for any natural language processing task. A root word
can have various forms. For example, the word 'play' in English has the following forms:
'play', 'plays', 'played' and 'playing'. Hindi shows more number of forms for the word ' खेल'
(khela) which is equivalent to 'play'. The forms of 'खेल'(khela) are the following:

खेल(khela),खेला(khelaa),खेली(khelii), ूंगा(kheluungaa), ूं(kheluungii),


खेलेगा(khelegaa), खेलेगी(khelegii),
खेलते(khelate), खेलती(khelatii), खेलने(khelane),
खेलकर(khelakar).

For Telugu root ఆడడం (Adadam), the forms are the following::

Adutaanu, AdutunnAnu, Adenu, Ademu, AdevA, AdutAru, Adutunnaru, AdadAniki, Adesariki,


AdanA, Adinxi, Adutunxi, AdinxA, AdeserA, Adestunnaru, ...

Thus, we understand that the morphological richness of one language might vary from one language
to another. Indian languages are generally morphologically rich languages and therefore
morphological analysis of words becomes a very significant task for Indian languages.

Types of Morphology

Morphology is of two types,

1. Inflectional morphology
Deals with word forms of a root, where there is no change in lexical category. For example,
'played' is an inflection of the root word 'play'. Here, both 'played' and 'play' are verbs.
2. Derivational morphology
Deals with word forms of a root, where there is a change in the lexical category. For
example, the word form 'happiness' is a derivation of the word 'happy'. Here, 'happiness' is
a derived noun form of the adjective 'happy'.

Morphological Features:

All words will have their lexical category attested during morphological
analysis. A noun and pronoun can take suffixes of the following features:
gender, number, person, case For example, morphological analysis of a few
words is given below:

A verb can take suffixes of the following features: tense, aspect, modality, gender, number,
person

'rt' stands for root. 'cat' stands for lexical category. They value of lexicat category can be
noun, verb, adjective, pronoun, adverb, preposition. 'gen' stands for gender. The value of
gender can be masculine or feminine. 'num' stands for number. The value of number can be
singular (sg) or plural (pl). 'per' stands for person. The value of person can be 1, 2 or 3

The value of tense can be present, past or future. This feature is applicable for verbs. The
value of aspect can be perfect (pft), continuous (cont) or habitual (hab). This feature is not
applicable for verbs.

'case' can be direct or oblique. This feature is applicable for nouns. A case is an oblique case
when a postposition occurs after noun. If no postposition can occur after noun, then the
case is a direct case. This is applicable for hindi but not english as it doesn't have any
postpositions. Some of the postpositions in hindi are: का(kaa), की(kii), के(ke), को(ko),
में(meM).

OBJECTIVE:
The objective of the experiment is to learn about morphological features of a word by
analysing it.

PROCEDURE:

STEP 1: Select the language.

OUTPUT: Drop down for selecting words will appear.

STEP 2: Select the word.

OUTPUT: Drop down for selecting features will appear.

STEP 3: Select the features.

STEP 4: Click "Check" button to check your answer.

OUTPUT: Right features are marked by tick and wrong features are marked by cross.

SIMULATION:
EXAMPLE PROGRAM:

import nltk

from nltk.tokenize import word_tokenize

from nltk.probability import FreqDist

from nltk.corpus import stopwords

from nltk.tag import pos_tag

# Download NLTK data if not already downloaded

nltk.download('punkt')

nltk.download('averaged_perceptron_tagger')

nltk.download('stopwords')

# Sample text for analysis

text = "Natural Language Processing (NLP) is a subfield of linguistics, computer science, and
artificial intelligence concerned with the interactions between computers and human
language."

# Tokenize the text

tokens = word_tokenize(text)

# Remove stopwords

stop_words = set(stopwords.words('english'))

filtered_tokens = [word for word in tokens if word.lower() not in stop_words]

# Perform part-of-speech tagging

pos_tags = pos_tag(filtered_tokens)

# Frequency distribution of words

fdist = FreqDist(filtered_tokens)
# Print results

print("Tokenized Text:", tokens)

print("Filtered Tokens (without stopwords):", filtered_tokens)

print("Part-of-Speech Tags:", pos_tags)

print("Most Common Words:", fdist.most_common(5))

RESULT
EXPERIMENT - 2
WORD GENERATION

AIM:

A word can be simple or complex. For example, the word 'cat' is simple because one cannot
further decompose the word into smaller part. On the other hand, the word 'cats' is
complex, because the word is made up of two parts: root 'cat' and plural suffix '-s'

THEORY:

Given the root and suffix information, a word can be generated. For example,

 Morphological analysis and generation: Inverse processes.

 Analysis may involve non-determinism, since more than one analysis is possible.

 Generation is a deterministic process. In case a language allows spelling variation,


then till that extent, generation would also involve non-determinism.

OBJECTIVE:
The objective of the experiment is to generate word forms from root and suffix information.
PROCEDURE:

STEP 1: Select the language.

OUTPUT: Drop downs for selecting root and other features will appear.

STEP 2: Select the root and other features.

STEP 3: After selecting all the features, select the word corresponding above features
selected.

STEP 4: Click the check button to see whether right word is selected or not

OUTPUT: Output tells whether the word selected is right or wrong.

SIMULATION:
EXAMPLE PROGRAM:

import nltk

import random

# Sample corpus of words

corpus = ["apple", "banana", "orange", "grape", "pineapple", "kiwi", "strawberry",


"blueberry"]

# Function to generate new words based on frequency distribution

def generate_word(corpus):

letters = [letter for word in corpus for letter in word]

freq_dist = nltk.FreqDist(letters)

word_length = random.randint(3, 8) # Random word length between 3 to 8 characters

new_word =''.join(random.choices(list(freq_dist.keys()), weights=list(freq_dist.values()),


k=word_length))

return new_word

# Generate 5 new words

for _ in range(5):

print(generate_word(corpus))

RESULT
EXPERIMENT - 3

MORPHOLOGY

AIM:
Morphology is the study of the way words are built up from smaller meaning bearing units
i.e., morphemes. A morpheme is the smallest meaningful linguistic unit.

For Example:

 बच्ो(bachchoM) consists of two morphemes, बच्ा(bachchaa) has the information


of the root word noun "बच्ा"(bachchaa) and ओ(ूं oM) has the information of
plural and oblique case.
 played has two morphemes play and -ed having information verb "play" and "past
tense", so given word is past tense form of verb "play".

Words can be analysed morphologically if we know all variants of a given root word. We can
use an 'Add-Delete' table for this analysis.

THEORY:

Morph Analyser

Definition

Morphemes are considered as smallest meaningful units of language. These morphemes can
either be a root word(play) or affix(-ed). Combination of these morphemes is called
morphological process. So, word "played" is made out of 2 morphemes "play" and "-ed".
Thus, finding all parts of a word(morphemes) and thus describing properties of a word is
called "Morphological Analysis". For example, "played" has information verb "play" and
"past tense", so given word is past tense form of verb "play".

Analysis of a word:

बच्ोूं (bachchoM) = बच्ा(bachchaa)(root) + ओ(ूं 3 plural oblique) A linguistic


oM)(suffix) (ओ=
paradigm is the complete set of variants of a given lexeme. These variants can be classified
according to shared inflectional categories (example: number, case etc) and arranged into
tables.

Paradigm for बच्चा

Algorithm to get बच्चच(ों bachchoM) from बच्चा(bachchaa)

1. Take Root बच्(bachch)आ(aa)

2. Delete आ(aa)

3. output बच्(bachch)

4. Add ओ(ूं oM) to output

5. Return बच्ोूं (bachchoM)

Therefore, आ is deleted and ओूं is added to get बच्ो

Add-Delete table for बच्चा


Paradigm Class

Words in the same paradigm class behave similarly, for Example लड़क is in the same
paradigm class as बच्, so लड़का would behave similarly as बच्ा as they share the same
paradigm class.

OBJECTIVE:
The Objective of the experiment is understanding the morphology of a word by the use of
Add-Delete table.

PROCEDURE:
STEP 1: Select a word root.

STEP 2: Fill the add-delete table and submit.

STEP 3: If wrong, see the correct answer or repeat STEP1.

SIMULATION:
EXAMPLE PROGRAM:

class MorphologicalAnalyzer:

def _init_(self):

self.root_words = {}

def add_root_word(self, root, variants):

self.root_words[root] = variants

def morphological_analysis(self, word):

for root, morphemes in self.root_words.items():

for morpheme in morphemes:

if morpheme in word:

# Display the root word and morpheme found

print(f"Root Word: {root}, Morpheme: {morpheme}")

break

# Example usage

analyzer = MorphologicalAnalyzer()
# Adding root words and their morphological variants

analyzer.add_root_word("बच्ा", ["ओ"ूं ])

analyzer.add_root_word("play", ["-ed"])

# Performing morphological analysis

word1 = "बच्ो"ूं

word2 = "played"

print(f"Morphological Analysis of '{word1}':")

analyzer.morphological_analysis(word1)

print(f"\nMorphological Analysis of '{word2}':")

analyzer.morphological_analysis(word2)

RESULT
EXPERIMENT - 4

N-Grams

AIM:
Probability of a sentence can be calculated by the probability of sequence of words
occurring in it. We can use Markov assumption, that the probability of a word in a sentence
depends on the probability of the word occurring just before it. Such a model is called first
order Markov model or the bigram model.

Here, Wn refers to the word token corresponding to the nth word in a sequence.

THEORY:
A combination of words forms a sentence. However, such a formation is meaningful only
when the words are arranged in some order.

Eg: Sit I car in the

Such a sentence is not grammatically acceptable. However some perfectly grammatical


sentences can be nonsensical too!

Eg: Colorless green ideas sleep furiously

One easy way to handle such unacceptable sentences is by assigning probabilities to the
strings of words i.e, how likely the sentence is.

Probability of a sentence

If we consider each word occurring in its correct location as an independent event, the
probability of the sentences is: P(w(1), w(2)..., w(n-1), w(n)).

Using chain rule: = P(w(1)) * P(w(2) | w(1)) * P(w(3) | w(1)w(2)) ... P(w(n) | w(1)w(2) ...
w(n-1))

Bigrams

We can avoid this very long calculation by approximating that the probability of a given
word depends only on the probability of its previous words. This assumption is called
Markov assumption and such a model is called Markov model- bigrams. Bigrams can be
generalized to the n-gram which looks at (n-1) words in the past. A bigram is a first-order
Markov model.

Therefore, P(w(1), w(2)..., w(n-1), w(n)) = P(w(2)|w(1)) P(w(3)|w(2)) ... P(w(n)|w(n-1))


We use (eos) tag to mark the beginning and end of a sentence.

A bigram table for a given corpus can be generated and used as a lookup table for
calculating probability of sentences.

Eg: Corpus - (eos) You book a flight (eos) I read a book (eos) You read (eos).

Bigram Table:

P((eos)you read a book (eos))


= P(you|eos) * P(read|you) * P(a|read) * P(book|a) * P(eos|book)
=0.33*0.5*0.5*0.5*0.5
=.020625

OBJECTIVE:
The objective of this experiment is to learn to calculate bigrams from a given corpus and
calculate probability of a sentence.

PROCEDURE:
STEP 1: Select a corpus and click on Generate bigram table
STEP 2: Fill up the table that is generated and hit Submit
STEP 3: If incorrect (red), see the correct answer by clicking on show answer or repeat Step
2.

STEP 4: If correct (green), click on take a quiz and fill the correct answer
SIMULATION:

EXAMPLE PROGRAM:

from collections import defaultdict


import nltk
from nltk import bigrams, word_tokenize

class BigramModel:
def _init_(self, training_corpus):
self.bigram_counts = defaultdict(lambda: defaultdict(int))
self.unigram_counts = defaultdict(int)

# Tokenize the training corpus into words


tokens = word_tokenize(training_corpus.lower()) # Converting to lowercase for case-
insensitivity

# Create bigrams from the tokenized corpus


bigram_model = list(bigrams(tokens))
# Calculate the counts of each bigram and unigram
for bigram in bigram_model:
self.bigram_counts[bigram[0]][bigram[1]] += 1
self.unigram_counts[bigram[0]] += 1

def calculate_probabilities(self):
bigram_probabilities = defaultdict(lambda: defaultdict(float))

# Calculate the probabilities with Laplace smoothing


for word, next_word_counts in self.bigram_counts.items():
total_count = self.unigram_counts[word] + len(self.bigram_counts) # Laplace smoothing
for next_word, count in next_word_counts.items():
bigram_probabilities[word][next_word] = (count + 1) / total_count # Laplace smoothing

return bigram_probabilities

def sentence_probability(self, sentence):


sentence_tokens = word_tokenize(sentence.lower())
sentence_bigrams = list(bigrams(sentence_tokens))

probabilities = self.calculate_probabilities()
probability = 1.0

# Calculate the overall probability of the sentence


for bigram in sentence_bigrams:
word, next_word = bigram
probability *= probabilities[word][next_word]

return probability

# Example usage
training_corpus = "This is a sample sentence. This sentence is for demonstration purposes."
test_sentence = "This is a sample sentence for demonstration."

bigram_model = BigramModel(training_corpus)
probability = bigram_model.sentence_probability(test_sentence)
print(f"The probability of the sentence is: {probability}")
RESULT

EXPERIMENT –5

N-GRAM SMOOTHING

AIM:One major problem with standard N-gram models is that they must be trained from
some corpus, and because any particular training corpus is finite, some perfectly acceptable
N-grams are bound to be missing from it. We can see that bigram matrix for any given
training corpus is sparse. There are large number of cases with zero probabilty bigrams and
that should really have some non-zero probability. This method tend to underestimate the
probability of strings that happen not to have occurred nearby in their training corpus.

There are some techniques that can be used for assigning a non-zero probabilty to these
'zero probability bigrams'. This task of reevaluating some of the zero-probability and low-
probabilty N-grams, and assigning them non-zero values, is called smoothing.

THEORY:The standard N-gram models are trained from some corpus. The finiteness of the
training corpus leads to the absence of some perfectly acceptable N-grams. This results in
sparse bigram matrices. This method tend to underestimate the probability of strings that
do not occur in their training corpus.

There are some techniques that can be used for assigning a non-zero probabilty to these
'zero probability bigrams'. This task of reevaluating some of the zero-probability and low-
probabilty N-grams, and assigning them non-zero values, is called smoothing. Some of the
techniques are: Add-One Smoothing, Witten-Bell Discounting, Good-Turing Discounting.

Add-One Smoothing

In Add-One smooting, we add one to all the bigram counts before normalizing them into
probabilities. This is called add-one smoothing.

Application on unigrams

The unsmoothed maximum likelihood estimate of the unigram probability can be computed
by dividing the count of the word by the total number of word tokens N.
Application on bigrams

Normal bigram probabilities are computed by normalizing each row of counts by the
unigram count:
P(wn|wn-1) = C(wn-1wn)/C(wn-1)

For add-one smoothed bigram counts we need to augment the unigram count by the
number of total word types in the vocabulary V:
p*(wn|wn-1) = ( C(wn-1wn)+1 )/( C(wn-1)+V )

OBJECTIVE:The objective of this experiment is to learn how to apply add-one smoothing on


sparse bigram table.

PROCEDURE:
STEP 1: Select a corpus.

STEP 2: Apply add one smoothing and calculate bigram probabilities using the given bigram
counts,N and V. Fill the table and hit Submit.
STEP 3: If incorrect (red), see the correct answer by clicking on show answer or repeat Step
2.
SIMULATION:
EXAMPLE PROGRAM:
class BigramModel:
def _init_(self, corpus):
self.bigram_counts = defaultdict(int)
self.unigram_counts = defaultdict(int)
self.vocab_size = 0
self.train(corpus)

def train(self, corpus):


for sentence in corpus:
sentence = sentence.split()
for i in range(len(sentence) - 1):
self.bigram_counts[(sentence[i], sentence[i+1])] += 1
self.unigram_counts[sentence[i]] += 1
self.vocab_size = len(self.unigram_counts)

def prob(self, word1, word2):


count_bigram = self.bigram_counts[(word1, word2)]
count_unigram = self.unigram_counts[word1]
# Apply Add-One smoothing
smoothed_prob = (count_bigram + 1) / (count_unigram + self.vocab_size)
return smoothed_prob

def sentence_prob(self, sentence):


sentence = sentence.split()
prob = 1.0
for i in range(len(sentence) - 1):
prob *= self.prob(sentence[i], sentence[i+1])
return prob

if _name_ == "_main_":
corpus = [
"the cat sat on the mat",
"the dog sat on the rug"
]

model = BigramModel(corpus)

test_sentence = "the cat sat on the rug"


print("Probability of the test sentence:", model.sentence_prob(test_sentence))

RESULT
EXPERIMENT - 6

POS TAGGING - Hidden Markov Model

AIM:POS tagging or part-of-speech tagging is the procedure of assigning a grammatical


category like noun, verb, adjective etc. to a word. In this process both the lexical
information and the context play an important role as the same lexical form can behave
differently in a different context.

For example the word "Park" can have two different lexical categories based on the context.

 The boy is playing in the park. ('Park' is Noun)


 Park the car. ('Park' is Verb)

THEORY:
A Hidden Markov Model (HMM) is a statistical Markov model in which the system being
modeled is assumed to be a Markov process with unobserved (hidden) states.In a regular
Markov model (Markov Model (Ref: http://en.wikipedia.org/wiki/Markov_model)), the
state is directly visible to the observer, and therefore the state transition probabilities are
the only parameters. In a hidden Markov model, the state is not directly visible, but output,
dependent on the state, is visible.
Hidden Markov Model has two important components-

1) Transition Probabilities: The one-step transition probability is the probability of


transitioning from one state to another in a single step.

2) Emission Probabilties: : The output probabilities for an observation from state. Emission
probabilities B = { bi,k = bi(ok) = P(ok | qi) }, where ok is an Observation. Informally, B is the
probability that the output is ok given that the current state is qi

For POS tagging, it is assumed that POS are generated as random process, and each process
randomly generates a word. Hence, transition matrix denotes the transition probability from
one POS to another and emission matrix denotes the probability that a given word can have
a particular POS. Word acts as the observations. Some of the basic assumptions are:

First-order (bigram) Markov assumptions:


a. Limited Horizon: Tag depends only on previous tag
P(t<sub>i+1</sub>=t<sub>k</sub> |
t<sub>1</sub>=t<sub>j1</sub>,.....,t<sub>i</sub>=t<sub>ji</sub>) = P(t<sub>i+1</sub> =
t<sub>k</sub> | t<sub>i</sub> = t<sub>j</sub>)
b. Time invariance: No change over time
P(t<sub>i+1</sub> = t<sub>k</sub> | t<sub>i</sub> = t<sub>j</sub>) = P(t<sub>2</sub> =
t<sub>k</sub> | t<sub>1</sub> = t<sub>j</sub>) = P(t<sub>j</sub> -> t<sub>k</sub>)
2. Output probabilities:
- Probability of getting word wk for tag tj: P(wk | tj) is independent of other tags or
words!
Calculating the Probabilities

Consider the given toy corpus

EOS/eos
Calculating Emission Probability
They/ Matrix
pronoun
Count the no. of timescut/verb
a specific word occus with a specific POS tag in the corpus.
Here, say for "cut" the/determin

and so on zero for otherer tags too.


count(cut,verb)=1 paper/noun
Now, count(cut,noun)=2 EOS/eos
count(cut) = total count ofcalculating
count(cut,determiner)=0 cut = 3 the probability
He/pronoun
Probability to be filled in the matrix cell at the intersection of cut and verb
asked/verb
Similarly, for/preposition
P(cut/verb)=count(cut,verb)/count(cut)=1/3=0.33
Probability to be filledhis/pronoun
in the cell at he intersection of cut and determiner
cut/noun.
P(cut/determiner)=count(cut,determiner)/count(cut)=0/3=0
EOS/eos
Put/verb
the/determin
er

paper/noun
in/prepositio
Repeat the same for all the word-tag combination and fill the

Calculating Transition Probability Matrix

Count the no. of times a specific tag comes after other POS tags in the corpus.
Here, say for "determiner"

and so on zero for other tags too.


count(verb,determiner)=2
Now, count(preposition,determiner)
calculating the=probability
count(determiner)
=1 total countProbability to be filled
of tag 'determiner' = 3in the cell at he intersection of
determiner(in the column) and verb(in the row)
count(determiner,determiner)
Similarly,
=0 count(eos,determiner)=0
P(determiner/verb)=count(verb,determiner)/count(determiner)=2/3=0.66
Probability to be filled in the cell at he intersection of determiner(in the column) and
count(noun,determiner)=0
noun(in the row)

Repeat the same for all the tags


P(determiner/noun)=count(noun,determiner)/count(determiner)=0/3=0
Note: EOS/eos is a special marker which represents End Of Sentence.

OBJECTIVE:
The objective of the experiment is to calculate emission and transition matrix which will be
helpful for tagging Parts of Speech using Hidden Markov Model.

PROCEDURE:
STEP1: Select the corpus.

STEP2: For the given corpus fill the emission and transition matrix. Answers are rounded to
2 decimal digits.

STEP3: Press Check to check your answer.


Wrong answers are indicated by the red cell.
SIMULATION:

RESULT
EXPERIMENT - 7

POS TAGGING - Viterbi Decoding

AIM:
In previous experiment you have calculated the transition and emission matrix, and now in
this experiment it will be used to find the POS tag sequence for a given sentence. When we
have emission and transition matrix, various algorithms can be applied to find out the POS
tags for words. Some of possible algorithms are: Backward algorithm, forward algorithm and
viterbi algorithm. Here, in this experiment, you can get familiar with Viterbi Decoding

THEROY:
Viterbi Decoding is based on dynamic programming. This algorithm takes emission and
transmission matrix as the input. Emission matrix gives us information about proabities of a
POS tag for a given word and transmission matrix gives the probability of transition from
one POS tag to another POS tag. It observes sequence of words and returns the state
sequences of POS tags along with its probability.
Here "s" denotes words and "t" denotes tags. "a" is transmission matrix and "b" is emission
matrix.

Using above algorithm, we have to fill the viterbi table column by column.
OBJECTIVE:
The objective of this experiment is to find POS tags of words in a sentence using Viterbi
decoding.

PROCEDURE:
STEP1:Select the corpus.

OUTPUT: Emission and Transmission matrix will appear.

STEP2: Fill the column with the probabilty of possible POS tags given the word (i.e. form the
viterbi matrix by filling colum for each observation). Answers submitted are rounded off to 3
digits after decimal and are than checked.

STEP3: Check the column.

Wrong answers are indicated by red backgound in a cell.

If answers are right, then go to step2

STEP4: Repeat steps 2 and 3 untill all words of a sentence are covered.

STEP5: At last check the POS tag for each word obtained from backtracking
SIMULATION:

RESULT
EXPERIMENT –8

BUILDING POS TAGGER

AIM:
In corpus linguistics, part-of-speech tagging (POS tagging or POST), also called grammatical
tagging or word-category disambiguation, is the process of marking up a word in a text
(corpus) as corresponding to a particular part of speech, based on both its definition, as well
as its context i.e. relationship with adjacent and related words in a phrase, sentence, or
paragraph. A simplified form of this is identification of words as nouns, verbs, adjectives,
adverbs, etc. Once performed by hand, POS tagging is now done in the context of
computational linguistics, using algorithms which associate discrete terms, as well as hidden
parts of speech, in accordance with a set of descriptive tags. POS-tagging algorithms fall into
two distinctive groups: rule-based and stochastic.
THEORY:

Hidden Markov Model

In the mid 1980s, researchers in Europe began to use Hidden Markov models (HMMs) to
disambiguate parts of speech. HMMs involve counting cases, and making a table of the
probabilities of certain sequences. For example, once you've seen an article such as 'the',
perhaps the next word is a noun 40% of the time, an adjective 40%, and a number 20%.
Knowing this, a program can decide that "can" in "the can" is far more likely to be a noun
than a verb or a modal. The same method can of course be used to benefit from knowledge
about following words.

More advanced ("higher order") HMMs learn the probabilities not only of pairs, but triples
or even larger sequences. So, for example, if you've just seen an article and a verb, the next
item may be very likely a preposition, article, or noun, but much less likely another verb.

When several ambiguous words occur together, the possibilities multiply. However, it is
easy to enumerate every combination and to assign a relative probability to each one, by
multiplying together the probabilities of each choice in turn.

It is worth remembering, as Eugene Charniak points out in Statistical techniques for natural
language parsing, that merely assigning the most common tag to each known word and the
tag "proper noun" to all unknowns, will approach 90% accuracy because many words are
unambiguous.

HMMs underlie the functioning of stochastic taggers and are used in various algorithms.
Accuracies for one such algorithm (TnT) on various training data is shown here.

Conditional Random Field


Conditional random fields (CRFs) are a class of statistical modelling method often applied in
machine learning, where they are used for structured prediction. Whereas an ordinary
classifier predicts a label for a single sample without regard to "neighboring" samples, a CRF
can take context into account. Since it can consider context, therefore CRF can be used in
Natural Language Processing. Hence, Parts of Speech tagging is also possible. It predicts the
POS using the lexicons as the context.

If only one neighbour is considered as a context, then it is called bigram. Similarly, two
neighbours as the context is called trigram. In this experiment, size of training corpus and
context were varied to know their importance.

OBJECTIVE:
The objective of the experiment is to know the importance of context and size of training
corpus in learning Parts of Speech.

PROCEDURE:
STEP1: Select the language.

OUTPUT: Drop down to select size of corpus, algorithm and features will appear.

STEP2: Select corpus size.

STEP3: Select algorithm "CRF" or "HMM".

STEP4: Select feature "bigram" or "trigram".

OUTPUT: Corresponding accuracy wil be

shown.
SIMULATION:
EXAMPLE PROGRAM:
import nltk

# Load the NLTK brown


dataset
nltk.download('brown')
brown_tagged_sents = nltk.corpus.brown.tagged_sents()

# Define a function to train the POS tagger


def train_pos_tagger(tagged_sents):
tagger = nltk.UnigramTagger(tagged_sents)
return tagger

# Train the POS tagger

pos_tagger = train_pos_tagger(brown_tagged_sents)

# Test the POS tagger


def test_pos_tagger(tagger,
sentence): tagged_words =
tagger.tag(sentence)
print(tagged_words)

test_sentence = ['The', 'cat', 'is', 'on', 'the', 'mat']


test_pos_tagger(pos_tagger, test_sentence)

RESULT
EXPERIMENT – 9

CHUNKING

AIM:
Chunking of text invloves dividing a text into syntactically correlated words. For example,
the sentence 'He ate an apple.' can be divided as follows:

Each chunk has an open boundary and close boundary that delimit the word groups as a
minimal non-recursive unit. This can be formally expressed by using IOB prefixes.

THEORY:

Chunking of text invloves

Eg: He ate an apple to asatiate


dividing his hunger. [NP He ] [VP ate] [NP an apple] [VP to satiate] [NP
text into
his hunger] syntactically correlated words.
Eg: दरवाज़ाखुलगया [NP दरवाज़ा] [VP खुलगया]

Chunk Types

The chunk types are based on the syntactic category part. Besides the head a chunk also
contains modifiers (like determiners, adjectives, postpositions in NPs).

The basic types of chunks in English are:


The basic Chunk Tag Set for Indian Languages

NP Noun Chunks

Noun Chunks will be given the tag NP and include non-recursive noun phrases and
postposition for Indian languages and preposition for English. Determiners, adjectives and
other modifiers will be part of the noun chunk.

Eg:

(इस/DEM ककताब/NN में/PSP)NP 'this' 'book' 'in'

((in/IN the/DT big/ADJ room/NN))NP

Verb Chunks

The verb chunks are marked as VP for English, however they would be of several types for
Indian languages. A verb group will include the main verb and its auxiliaries, if any.

For English:

I (will/MD be/VB loved/VBD)VP

The types of verb chunks and their tags are described below.

1. VGF Finite Verb Chunk


The auxiliaries in the verb group mark the finiteness of the verb at the chunk level. Thus, any
verb group which is finite will be tagged as VGF. For example,

Eg: मैंनेघरपर (खाया/VM)VGF 'I erg''home' 'at''meal' 'ate'

2. VGNF Non-finite Verb Chunk


A non-finite verb chunk will be tagged as VGNF.
Eg: सेब (खाता/VM हुआ/VAUX)VGNF लड़काजारहाहै 'apple' 'eating' 'PROG' 'boy' go' 'PROG' 'is'

3. VGNN Gerunds
A verb chunk having a gerund will be annotated as VGNN.

Eg: शराब (पीना/VM)VGNNसेहतके


कलएहाकनकारकहैsharAba 'liquor' 'drinking' 'heath' 'for' 'harmful' 'is'

JJP/ADJP Adjectival Chunk

An adjectival chunk will be tagged as ADJP for English and JJP for Indian languages. This
chunk will consist of all adjectival chunks including the predicative adjectives.

Eg:

वहलड़कीहै

(सुन्दर/JJ)JJP The fruit

is (ripe/JJ)ADJP

Note: Adjectives appearing before a noun will be grouped together within the noun chunk.

RBP/ADVP Adverb Chunk

This chunk will include all pure adverbial phrases.

Eg:

वह (धीरे -धीरे /RB)RBPचलरहाथा 'he' 'slowly' 'walk'

'PROG' 'was' He walks (slowly/ADV)/ADVP

PP Prepositional Chunk

This chunk type is present for only English and not for Indian languages. It consists of only
the preposition and not the NP argument.

Eg:

(with/IN)PP a pen

IOB prefixes

Each chunk has an open boundary and close boundary that delimit the word groups as a
minimal non-recursive unit. This can be formally expressed by using IOB prefixes: B-CHUNK
for the first word of the chunk and I-CHUNK for each other word in the chunk. Here is an
example of the file format:
OBJECTIVE:
The objective of this experiment is to understand the concept of chunking and get familiar
with the basic chunk tagset.

PROCEDURE:

STEP1: Select a language

STEP2: Select a sentence

STEP3: Select the corresponding chunk-tag for each word in the sentence and click
the Submit button.
OUTPUT1: The submitted answer will be checked.

Click on Get Answer button for the correct answer.


SIMULATION:
EXAMPLE PROGRAM:
from itertools import zip_longest

def chunk_numbers(numbers, chunk_size):


args = [iter(numbers)] * chunk_size

return [tuple(filter(None, chunk)) for chunk in zip_longest(*args, fillvalue=None)]

def chunk_words(words, chunk_size):


args = [iter(words)] * chunk_size
return [' '.join(filter(None, chunk)) for chunk in zip_longest(*args, fillvalue='')]

if _name_ == '_main_':
numbers = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
words = ['apple', 'banana', 'cherry', 'date', 'fig', 'grape', 'kiwi', 'lemon', 'mango', 'orange']

chunked_numbers = chunk_numbers(numbers, 4)
chunked_words = chunk_words(words, 3)

print("Chunked numbers:")
for chunk in
chunked_numbers:
print(chunk)
print("\nChunked words:")
for chunk in
chunked_words: print(chunk)

RESULT
EXPERIMENT – 10

BUILDING CHUNKER

AIM:
Chunking is an analysis of a sentence which identifies the constituents (noun groups, verbs,
verb groups, etc.) which are correlated. These are non-overlapping regions of text. Usually,
each chunk contains a head, with the possible addition of some function words and
modifiers either before or after depending on languages. These are non-recursive in nature
i.e. a chunk cannot contain another chunk of the same

category. Some of the groups possible are:

1. Noun Group

2. Verb Group

For example, the sentence 'He reckons the current account deficit will narrow to only 1.8
billion in September.' can be divided as follows:

[NP He ] [VP reckons ] [NP the current account deficit ] [VP will narrow ] [PP to ] [NP only 1.8
billion ] [PP in ] [NP September ]

Each chunk has an open boundary and close boundary that delimit the word groups as a
minimal non-recursive unit.

THEORY:

Hidden Markov Model

In the mid 1980s, researchers in Europe began to use Hidden Markov models (HMMs) to
disambiguate parts of speech. HMMs involve counting cases, and making a table of the
probabilities of certain sequences. For example, once you've seen an article such as 'the',
perhaps the next word is a noun 40% of the time, an adjective 40%, and a number 20%.
Knowing this, a program can decide that "can" in "the can" is far more likely to be a noun
than a verb or a modal. The same method can of course be used to benefit from knowledge
about following words.

More advanced ("higher order") HMMs learn the probabilities not only of pairs, but triples
or even larger sequences. So, for example, if you've just seen an article and a verb, the next
item may be very likely a preposition, article, or noun, but much less likely another verb.
When several ambiguous words occur together, the possibilities multiply. However, it is
easy to enumerate every combination and to assign a relative probability to each one, by
multiplying together the probabilities of each choice in turn.

It is worth remembering, as Eugene Charniak points out in Statistical techniques for natural
language parsing, that merely assigning the most common tag to each known word and the
tag "proper noun" to all unknowns, will approach 90% accuracy because many words are
unambiguous.

HMMs underlie the functioning of stochastic taggers and are used in various algorithms.
Accuracies for one such algorithm (TnT) on various training data is shown here.

Conditional Random Field

Conditional random fields (CRFs) are a class of statistical modelling method often applied in
machine learning, where they are used for structured prediction. Whereas an ordinary
classifier predicts a label for a single sample without regard to "neighboring" samples, a CRF
can take context into account. Since it can consider context, therefore CRF can be used in
Natural Language Processing. Hence, Parts of Speech tagging is also possible. It predicts the
POS using the lexicons as the context.

In this experiment both algorithms are used for training and testing data. As the size of
training corpus increases, it is observed that accuracy increases. Further, even features also
play an important role for better output. In this experiment, we can see that Parts of Speech
as a feature performs better than only lexicon as the feature. Therefore, it is important to
select proper features for training a model to have better accuracy.

OBJECTIVE:

The objective of the experiment is to know the importance of selecting proper features for
training a model and size of training corpus in learning how to do chunking.

PROCEDURE:
STEP1: Select the language.

OUTPUT: Drop down to select size of corpus, algorithm and features will appear.

STEP2: Select corpus size.

STEP3: Select algorithm "CRF" or "HMM".

STEP4: Select feature "only lexicon", "only POS", "lexicon and POS".

OUTPUT: Corresponding accuracy wil be shown.


SIMULATION:

RESULT

You might also like