NLP Practical Manual
NLP Practical Manual
Note: If you don't want to download the latest version, you can visit the
download tab and see all releases.
Step 7) open command prompt window and run the following commands:
C:\Users\Beena Kapadia>pip install --upgrade pip
C:\Users\Beena Kapadia> pip install --user -U nltk
C:\Users\Beena Kapadia> >pip install --user -U numpy
C:\Users\Beena Kapadia>python
>>> import nltk
>>>
# text to speech
Output:
welcomeNLP.mp3 audio file is getting created and it plays the file with playsound()
method, while running the program.
Note: required to store the input file "male.wav" in the current folder before running the
program.
import speech_recognition as sr
filename = "male.wav"
Input:
male.wav (any wav file)
Output:
Practical No. 2:
a. Study of various Corpus – Brown, Inaugural, Reuters, udhr with various
methods like filelds, raw, words, sents, categories.
b. Create and use your own corpora (plaintext, categorical)
c. Study Conditional frequency distributions
d. Study of tagged corpora with methods like tagged_sents, tagged_words.
e. Write a program to find the most frequent noun tags.
f. Map Words to Properties Using Python Dictionaries
g. Study DefaultTagger, Regular expression tagger, UnigramTagger
h. Find different words from a given plain text without any space by comparing
this text with a given corpus of words. Also find the score of words.
import nltk
from nltk.corpus import brown
print ('File ids of brown corpus\n',brown.fileids())
'''Let’s pick out the first of these texts — Emma by Jane Austen — and give it a short
name, emma, then find out how many words it contains:'''
ca01 = brown.words('ca01')
#categories or files
print ('\n\nCategories or file in brown corpus:\n')
print (brown.categories())
'''display other information about each text, by looping over all the values of fileid
corresponding to the brown file identifiers listed earlier and then computing statistics
for each text.'''
print ('\n\nStatistics for each text:\n')
print
('AvgWordLen\tAvgSentenceLen\tno.ofTimesEachWordAppearsOnAvg\t\tFileName')
for fileid in brown.fileids():
num_chars = len(brown.raw(fileid))
num_words = len(brown.words(fileid))
num_sents = len(brown.sents(fileid))
num_vocab = len(set([w.lower() for w in brown.words(fileid)]))
output:
import nltk
from nltk.corpus import PlaintextCorpusReader
corpus_root = 'D:/2020/NLP/Practical/uni'
filelist = PlaintextCorpusReader(corpus_root, '.*')
print ('\n File list: \n')
print (filelist.fileids())
print (filelist.root)
'''display other information about each text, by looping over all the values of fileid
corresponding to the filelist file identifiers listed earlier and then computing statistics
for each text.'''
output:
print(len(genre_word))
print(genre_word[:4])
print(genre_word[-4:])
cfd = nltk.ConditionalFreqDist(genre_word)
print(cfd)
print(cfd.conditions())
print(cfd['news'])
print(cfd['romance'])
print(list(cfd['romance']))
cfd.tabulate(conditions=['English', 'German_Deutsch'],
samples=range(10), cumulative=True)
output:
Source code:
import nltk
from nltk import tokenize
nltk.download('punkt')
nltk.download('words')
# word tokenization
print("\nword tokenization\n===================\n")
for index in range(len(sents)):
words = tokenize.word_tokenize(sents[index])
print(words)
output:
print (addNounWords)
temp = defaultdict(int)
# memoizing count
for sub in addNounWords:
for wrd in sub.split():
temp[wrd] += 1
# printing result
print("Word with maximum frequency : " + str(res))
output:
output:
i) DefaultTagger
code:
import nltk
from nltk.tag import DefaultTagger
exptagger = DefaultTagger('NN')
from nltk.corpus import treebank
testsentences = treebank.tagged_sents() [1000:]
print(exptagger.evaluate (testsentences))
output
iii) UnigramTagger
code:
# Loading Libraries
from nltk.tag import UnigramTagger
from nltk.corpus import treebank
# Initializing
tagger = UnigramTagger(train_sents)
h. Find different words from a given plain text without any space by comparing
this text with a given corpus of words. Also find the score of words.
Question:
Initialize the hash tag test data or URL test data and convert to plain text without any
space.. Read a text file of different words and compare the plain text data with the
words exist in that text file and find out different words available in that plain text. Also
find out how many words could be found. (for example, text = "#whatismyname" or
text = www.whatismyname.com. Convert that to plain text without space as:
whatismyname and read text file as words.txt. Now compare plain text with words
given in a file and find the words form the plain text and the count of words which
could be found)
Source code:
from __future__ import with_statement #with statement for reading file
import re # Regular expression
print("MENU")
print("-----------")
print(" 1 . Hash tag segmentation ")
print(" 2 . URL segmentation ")
print("enter the input choice for performing word segmentation")
choice = int(input())
if choice == 1:
text = "#whatismyname" # hash tag test data to segment
print("input with HashTag",text)
pattern=re.compile("[^\w']")
a = pattern.sub('', text)
elif choice == 2:
text = "www.whatismyname.com" # url test data to segment
print("input with URL",text)
a=re.split('\s|(?<!\d)[,.](?!\d)', text)
splitwords = ["www","com","in"] # remove the words which is containg in the list
a ="".join([each for each in a if each not in splitwords])
else:
print("wrong choice...try again")
print(a)
for each in a:
testword.append(each) #test word
test_lenth = len(testword) # lenth of the test data
def Seg(a,lenth):
ans =[]
for k in range(0,lenth+1): # this loop checks char by char in the corpus
if a[0:k] in words:
print(a[0:k],"-appears in the corpus")
ans.append(a[0:k])
break
if ans != []:
g = max(ans,key=len)
return g
N = 37 # total no of corpus
M=0
C=0
while test_tot_itr < test_lenth:
ans_words = Seg(a,test_lenth)
if ans_words != 0:
test_itr = len(ans_words)
answer.append(ans_words)
a = a[test_itr:test_lenth]
test_tot_itr += test_itr
# Calculating Score
C = len(answer)
score = C * N / N # Calculate the score
print("Score",score)
Input:
Words.txt
--------------
check back
domain social
big media
rocks 30
name seconds
cheap earth
being this
human is
current insane
rates it
ought time
to what
go is
down my
apple name
domains let
honesty us
hour go
follow
Output:
Source code:
'''WordNet provides synsets which is the collection of synonym words also called
“lemmas”'''
import nltk
from nltk.corpus import wordnet
print(wordnet.synsets("computer"))
#examples
print("Examples:", wordnet.synset("computer.n.01").examples())
#get Antonyms
print(wordnet.lemma('buy.v.01.buy').antonyms())
output:
#Hyponyms give abstract concepts of the word that are much more specific
#the list of hyponyms words of the computer
syn = wordnet.synset('computer.n.01')
print(syn.hyponyms)
print(car.lowest_common_hypernyms(vehicle))
Output:
c. Write a program using python to find synonym and antonym of word "active"
using Wordnet.
Source code:
from nltk.corpus import wordnet
print( wordnet.synsets("active"))
print(wordnet.lemma('active.a.01.active').antonyms())
Output:
import nltk
from nltk.corpus import wordnet
syn1 = wordnet.synsets('football')
syn2 = wordnet.synsets('soccer')
# A word may have multiple synsets, so need to compare each synset of word1
with synset of word2
for s1 in syn1:
for s2 in syn2:
print("Path similarity of: ")
print(s1, '(', s1.pos(), ')', '[', s1.definition(), ']')
print(s2, '(', s2.pos(), ')', '[', s2.definition(), ']')
print(" is", s1.path_similarity(s2))
print()
output:
e. Handling stopword:
i) Using nltk Adding or Removing Stop Words in NLTK's Default Stop Word
List
code:
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
from nltk.tokenize import word_tokenize
text = "Yashesh likes to play football, however he is not too fond of tennis."
text_tokens = word_tokenize(text)
print(tokens_without_sw)
text_tokens = word_tokenize(text)
tokens_without_sw = [word for word in text_tokens if not word in all_stopwords]
print(tokens_without_sw)
text_tokens = word_tokenize(text)
tokens_without_sw = [word for word in text_tokens if not word in all_stopwords]
print(tokens_without_sw)
output
ii) Using Gensim Adding and Removing Stop Words in Default Gensim Stop
Words List
code:
#pip install gensim
import gensim
from gensim.parsing.preprocessing import remove_stopwords
text = "Yashesh likes to play football, however he is not too fond of tennis."
filtered_sentence = remove_stopwords(text)
print(filtered_sentence)
all_stopwords = gensim.parsing.preprocessing.STOPWORDS
print(all_stopwords)
'''The following script adds likes and play to the list of stop words in Gensim:'''
text = "Yashesh likes to play football, however he is not too fond of tennis."
text_tokens = word_tokenize(text)
tokens_without_sw = [word for word in text_tokens if not word in
all_stopwords_gensim]
print(tokens_without_sw)
'''Output:
The following script removes the word "not" from the set of stop words in
Gensim:'''
all_stopwords_gensim = STOPWORDS
sw_list = {"not"}
all_stopwords_gensim = STOPWORDS.difference(sw_list)
text = "Yashesh likes to play football, however he is not too fond of tennis."
text_tokens = word_tokenize(text)
tokens_without_sw = [word for word in text_tokens if not word in
all_stopwords_gensim]
print(tokens_without_sw)
output
Microsoft Visual C++ 14.0 is required. Get it with "Build Tools for Visual Studio":
https://visualstudio.microsoft.com/downloads/
iii) Using Spacy Adding and Removing Stop Words in Default Spacy Stop Words
List
code:
#pip install spacy
#python -m spacy download en_core_web_sm
#python -m spacy download en
import spacy
import nltk
from nltk.tokenize import word_tokenize
sp = spacy.load('en_core_web_sm')
text = "Yashesh likes to play football, however he is not too fond of tennis."
text_tokens = word_tokenize(text)
tokens_without_sw = [word for word in text_tokens if not word in all_stopwords]
print(tokens_without_sw)
print(tokens_without_sw)
output:
4. Text Tokenization
a. Tokenization using Python’s split() function
code:
text = """ This tool is an a beta stage. Alexa developers can use Get Metrics API to
seamlessly analyse metric. It also supports custom skill model, prebuilt Flash Briefing
model, and the Smart Home Skill API. You can use this tool for creation of monitors,
alarms, and dashboards that spotlight changes. The release of these three tools will
enable developers to create visual rich skills for Alexa devices with screens. Amazon
describes these tools as the collection of tech and tools for creating visually rich and
interactive voice experiences. """
data = text.split('.')
for i in data:
print (i)
output:
code:
import nltk
# import RegexpTokenizer() method from nltk
from nltk.tokenize import RegexpTokenizer
print(tokens)
output:
code:
import nltk
from nltk.tokenize import word_tokenize
output:
code:
import spacy
nlp = spacy.blank("en")
output:
code:
#pip install keras
#pip install tensorflow
import keras
from keras.preprocessing.text import text_to_word_sequence
output:
code:
#pip install gensim
output:
Microsoft Visual C++ 14.0 is required. Get it with "Build Tools for Visual Studio":
https://visualstudio.microsoft.com/downloads/
output
['▁प्राकृतिक', '▁भाषा', '▁सीखना', '▁बहुि', '▁तिलचस्प', '▁है ', '।']
print(output)
Output:
['मैं आजकल बहुि खुश हूं ', 'मैं आज अत्यतिक खुश हूं ', 'मैं अभी बहुि खुश हूं ', 'मैं वितमान बहुि
खुश हूं ', 'मैं वित मान बहुि खु श हूं ']
Output:
gujarati
# word tokenization
print("\nword tokenization\n===================\n")
for index in range(len(sents)):
words = tokenize.word_tokenize(sents[index])
print(words)
# POS Tagging
tagged_words = []
for index in range(len(sents)):
tagged_words.append(tag.pos_tag(words))
print("\nPOS Tagging\n===========\n",tagged_words)
# chunking
tree = []
for index in range(len(sents)):
tree.append(chunk.ne_chunk(tagged_words[index]))
print("\nchunking\n========\n")
print(tree)
Output:
sentence tokenization
===================
['Hello!', 'My name is Beena Kapadia.', "Today you'll be learning NLTK."]
word tokenization
===================
['Hello', '!']
['My', 'name', 'is', 'Beena', 'Kapadia', '.']
['Today', 'you', "'ll", 'be', 'learning', 'NLTK', '.']
POS Tagging
===========
[[('Today', 'NN'), ('you', 'PRP'), ("'ll", 'MD'), ('be', 'VB'), ('learning', 'VBG'), ('NLTK',
'NNP'), ('.', '.')], [('Today', 'NN'), ('you', 'PRP'), ("'ll", 'MD'), ('be', 'VB'), ('learning',
'VBG'), ('NLTK', 'NNP'), ('.', '.')], [('Today', 'NN'), ('you', 'PRP'), ("'ll", 'MD'), ('be',
'VB'), ('learning', 'VBG'), ('NLTK', 'NNP'), ('.', '.')]]
chunking
========
[Tree('S', [('Today', 'NN'), ('you', 'PRP'), ("'ll", 'MD'), ('be', 'VB'), ('learning', 'VBG'),
Tree('ORGANIZATION', [('NLTK', 'NNP')]), ('.', '.')]), Tree('S', [('Today', 'NN'), ('you',
'PRP'), ("'ll", 'MD'), ('be', 'VB'), ('learning', 'VBG'), Tree('ORGANIZATION', [('NLTK',
'NNP')]), ('.', '.')]), Tree('S', [('Today', 'NN'), ('you', 'PRP'), ("'ll", 'MD'), ('be', 'VB'),
('learning', 'VBG'), Tree('ORGANIZATION', [('NLTK', 'NNP')]), ('.', '.')])]
# Analyse syntax
print("Noun phrases:", [chunk.text for chunk in doc.noun_chunks])
print("Verbs:", [token.lemma_ for token in doc if token.pos_ == "VERB"])
Output:
Noun phrases: ['Sebastian Thrun', 'self-driving cars', 'Google', 'few people', 'the
company', 'him', 'I', 'you', 'very senior CEOs', 'major American car companies', 'my
hand', 'I', 'Thrun', 'an interview', 'Recode']
Verbs: ['start', 'work', 'drive', 'take', 'tell', 'shake', 'turn', 'be', 'talk', 'say']
import nltk
nltk.download('treebank')
from nltk.corpus import treebank_chunk
treebank_chunk.tagged_sents()[0]
treebank_chunk.chunked_sents()[0]
treebank_chunk.chunked_sents()[0].draw()
Output:
parser = nltk.ChartParser(grammar1)
for tree in parser.parse(all_tokens):
print(tree)
tree.draw()
output:
b) Accept the input string with Regular expression of Finite Automaton: 101+.
Source code:
def FA(s):
#if the length is less than 3 then it can't be accepted, Therefore end the process.
if len(s)<3:
return "Rejected"
#first three characters are fixed. Therefore, checking them using index
if s[0]=='1':
if s[1]=='0':
if s[2]=='1':
# After index 2 only "1" can appear. Therefore break the process if any other
character is detected
for i in range(3,len(s)):
if s[i]!='1':
return "Rejected"
return "Accepted" # if all 4 nested if true
return "Rejected" # else of 3rd if
return "Rejected" # else of 2nd if
return "Rejected" # else of 1st if
inputs=['1','10101','101','10111','01010','100','','10111101','1011111']
for i in inputs:
print(FA(i))
Output:
Rejected
Rejected
Accepted
Accepted
Rejected
Rejected
Rejected
Rejected
Accepted
output:
Rejected
Rejected
Accepted
Accepted
Rejected
Rejected
Rejected
Rejected
Accepted
output:
Code:
# PorterStemmer
import nltk
from nltk.stem import PorterStemmer
word_stemmer = PorterStemmer()
print(word_stemmer.stem('writing'))
Output:
#LancasterStemmer
import nltk
from nltk.stem import LancasterStemmer
Lanc_stemmer = LancasterStemmer()
print(Lanc_stemmer.stem('writing'))
Output:
#RegexpStemmer
import nltk
from nltk.stem import RegexpStemmer
Reg_stemmer = RegexpStemmer('ing$|s$|e$|able$', min=4)
print(Reg_stemmer.stem('writing'))
output
#SnowballStemmer
import nltk
from nltk.stem import SnowballStemmer
english_stemmer = SnowballStemmer('english')
print(english_stemmer.stem ('writing'))
output
#WordNetLemmatizer
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
print("word :\tlemma")
print("rocks :", lemmatizer.lemmatize("rocks"))
print("corpora :", lemmatizer.lemmatize("corpora"))
Output:
import pandas as pd
import numpy as np
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
stemming = PorterStemmer()
corpus = []
for i in range (0,len(sms_data)):
s1 = re.sub('[^a-zA-Z]',repl = ' ',string = sms_data['v2'][i])
s1.lower()
s1 = s1.split()
s1 = [stemming.stem(word) for word in s1 if word not in
set(stopwords.words('english'))]
s1 = ' '.join(s1)
corpus.append(s1)
x = countvectorizer.fit_transform(corpus).toarray()
print(x)
y = sms_data['v1'].values
print(y)
y_pred = multinomialnb.predict(x_test)
print(y_pred)
print(classification_report(y_test,y_pred))
print("accuracy_score: ",accuracy_score(y_test,y_pred))
input:
spam.csv file from github
output:
code
import spacy
sp = spacy.load('en_core_web_sm')
sen = sp(u"I like to play football. I hated it in my childhood though")
print(sen.text)
print(sen[7].pos_)
print(sen[7].tag_)
print(spacy.explain(sen[7].tag_))
for word in sen:
print(f'{word.text:{12}} {word.pos_:{10}} {word.tag_:{8}}
{spacy.explain(word.tag_)}')
num_pos = sen.count_by(spacy.attrs.POS)
num_pos
output:
To view the dependency tree, type the following address in your browser:
http://127.0.0.1:5000/. You will see the following dependency tree:
code:
import nltk
from nltk.corpus import state_union
from nltk.tokenize import PunktSentenceTokenizer
# tokenize:
tokenized = custom_sent_tokenizer.tokenize(sample_text)
def process_content():
try:
for i in tokenized[:2]:
words = nltk.word_tokenize(i)
tagged = nltk.pos_tag(words)
print(tagged)
except Exception as e:
print(str(e))
process_content()
output:
b. Statistical parsing:
i. Usage of Give and Gave in the Penn Treebank sample
Source code:
#probabilitistic parser
#Usage of Give and Gave in the Penn Treebank sample
import nltk
import nltk.parse.viterbi
import nltk.parse.pchart
def give(t):
return t.label() == 'VP' and len(t) > 2 and t[1].label() == 'NP'\
and (t[2].label() == 'PP-DTV' or t[2].label() == 'NP')\
and ('give' in t[0].leaves() or 'gave' in t[0].leaves())
def sent(t):
return ' '.join(token for token in t.leaves() if token[0] not in '*-0')
Output:
grammar = PCFG.fromstring('''
NP -> NNS [0.5] | JJ NNS [0.3] | NP CC NP [0.2]
NNS -> "men" [0.1] | "women" [0.2] | "children" [0.3] | NNS CC NNS [0.4]
JJ -> "old" [0.4] | "young" [0.6]
CC -> "and" [0.9] | "or" [0.1]
''')
print(grammar)
viterbi_parser = nltk.ViterbiParser(grammar)
obj = viterbi_parser.parse(token)
print("Output: ")
for x in obj:
print(x)
Output:
c. Malt parsing:
Parse a sentence and draw a tree using malt parsing.
Note: 1) Java should be installed.
2) maltparser-1.7.2 zip file should be copied in C:\Users\Beena
Kapadia\AppData\Local\Programs\Python\Python39 folder and should be
extracted in the same folder.
3) engmalt.linear-1.7.mco file should be copied to C:\Users\Beena
Kapadia\AppData\Local\Programs\Python\Python39 folder
Source code:
# copy maltparser-1.7.2(unzipped version) and engmalt.linear-1.7.mco files to
C:\Users\Beena Kapadia\AppData\Local\Programs\Python\Python39 folder
# java should be installed
# environment variables should be set - MALT_PARSER - C:\Users\Beena
Kapadia\AppData\Local\Programs\Python\Python39\maltparser-1.7.2 and
MALT_MODEL - C:\Users\Beena
Kapadia\AppData\Local\Programs\Python\Python39\engmalt.linear-1.7.mco
Output:
(saw I (bird a (from (window. my))))
#convert
#Reliance supermarket
#Reliance hypermarket
#Reliance
#Reliance
#Reliance downtown
#Relianc market
#Mumbai
#Mumbai Hyper
#Mumbai dxb
#mumbai airport
#k.m trading
#KM Trading
#KM trade
#K.M. Trading
#KM.Trading
#into
#Reliance
#Reliance
#Reliance
#Reliance
#Reliance
#Reliance
#Mumbai
#Mumbai
#Mumbai
#Mumbai
#KM Trading
#KM Trading
#KM Trading
#KM Trading
#KM Trading
import numpy as np
import re
import textdistance # pip install textdistance
# we will need scikit-learn>=0.21
import sklearn #pip install sklearn
from sklearn.cluster import AgglomerativeClustering
texts = [
'Reliance supermarket', 'Reliance hypermarket', 'Reliance', 'Reliance', 'Reliance
downtown', 'Relianc market',
'Mumbai', 'Mumbai Hyper', 'Mumbai dxb', 'mumbai airport',
'k.m trading', 'KM Trading', 'KM trade', 'K.M. Trading', 'KM.Trading'
]
def normalize(text):
""" Keep only lower-cased text and numbers"""
return re.sub('[^a-z0-9]+', ' ', text.lower())
print(group_texts(texts))
Output:
best_synset = get_first_sense('bank')
print ('%s: %s' % (best_synset.name, best_synset.definition))
best_synset = get_first_sense('set','n')
print ('%s: %s' % (best_synset.name, best_synset.definition))
best_synset = get_first_sense('set','v')
print ('%s: %s' % (best_synset.name, best_synset.definition))
Output: