NATURALLANGUAGEPROCESSING LAB
MANUAL
B.TECH
(IIIYEAR–ISEM)
(2025-26)
DEPARTMENT OF AIML,
(AutonomousInstitution–UGC,Govt.ofIndia)
Recognizedunder2(f)and12(B)ofUGCACT1956
(AffiliatedtoJNTUH,Hyderabad,ApprovedbyAICTE-AccreditedbyNBA&NAAC–‘A’Grade-ISO9001:2015Certified)
Maisammaguda,Dhulapally(PostVia.Hakimpet),Secunderabad–500100,TelanganaState,India
LabObjectives:
1. BeabletodiscussthecurrentandlikelyfutureperformanceofseveralNLPapplications.
2. Beabletodescribe brieflyafundamentaltechniqueforprocessinglanguageforseveral
subtasks, such as morphological processing.
3. Implementparsing,wordsensedisambiguationandetc.
4. Understandhowthesetechniquesdrawonandrelatetootherareasofcomputerscience.
5. UnderstandthebasicprinciplesofdesigningandrunninganNLPexperiment.
LabOutcomes:
Uponsuccessfulcompletionofthiscourse,thestudentswillbeableto:
1. StudentwillbeabletoimplementLSI,NER.
2. StudentwillbeabletoimplementTD-IDFmethodandNgrammodels
3. DevelopaPartofspeechtagger.
4. Studentcanableclassifythetextbasedonpartofspeechtagger.
5. StudentcanabletoimplementseveralNLPapplications.
R-20
Introductionaboutlab
Systemconfigurationsareasfollows:
Hardware/Software’sinstalled:Intel®CORE™i3-
[email protected]:4GB/Google CoLabor Jupyter Notebookor PyCharm or
Visual Studio Code
Systemsareprovidedforstudentsinthe1:1ratio.
Systemsareassignednumbersandsamesystemisallottedforstudentswhentheydothelab.
AllSystemsareconfiguredinLINUX,itisopensourceandstudentscanuseanydifferentprogramm
ingenvironmentsthrough packageinstallation.
Guidelinestostudents
A. Standardoperatingprocedure
a) Explanationontoday’sexperimentbytheconcernedfacultyusingPPTcoveringthefollowin
gaspects:
1) Nameoftheexperiment
2) Aim
3) Software/Hardwarerequirements
4) WritingtheNLPprogramsbythestudentsinPython
WritingoftheexperimentintheObservationBook
Thestudentswillwritethetoday’sexperimentintheObservationbook
asperthefollowingformat:
a) Nameoftheexperiment
b) Aim
c) Writingtheprogram
d) Viva-VoceQuestionsandAnswers
e) Errorsobserved(ifany)duringcompilation/execution
SignatureoftheFaculty
NLPLab1
R-20
B. GuideLinestoStudentsinLab
Studentsarerequiredtocarrytheirlabobservationbookandrecordbookwith
Completedexperimentswhileenteringthelab.
Studentsmustusetheequipmentwithcare.Anydamageiscausedstudentispunishable.
Studentsarenotallowedtousetheircellphones/pendrives/CDsinlabs.
StudentsneedtobemaintainproperdresscodealongwithIDCard
Studentsaresupposedtooccupythecomputersallottedtothemandarenotsupposed
totalk ormakenoiseinthelab.
Students,aftercompletionofeachexperimenttheyneedtobeupdatedinobservation
notesand sameto beupdated intherecord.
Labrecordsneedtobesubmittedaftercompletionofexperimentandgetitcorrected
withtheconcernedlabfaculty.
Ifastudentisabsentforanylab,theyneedtobecompletedthesameexperimentinthe
freetimebeforeattendingnextlab.
Steps to perform Experiments in the lab by the student
Step1:Studentshavetowritethedate,aimandforthatexperiment
intheobservationbook.Step2:Studentshavetolistenandunderstandtheexperimentexplainedbyth
efacultyandnotedowntheimportantpoints intheobservationbook.
Step3:Studentsneedtowrite procedure/algorithmintheobservationbook.
Step4:AnalyzeandDevelop/implementthelogicoftheprogrambythestudentinrespectiveplatform
Step5:Afterapprovalof
logicoftheexperimentbythefacultythentheexperimenthastobeexecutedonthesystem.
Step6:Aftersuccessfulexecutiontheresultsaretobeshowntothefacultyandnotedthesamein
theobservationbook.
Step7:StudentsneedtoattendtheViva-
Voceonthatexperimentandwritethesameintheobservationbook.
Step8:Updatethecompletedexperimentintherecordandsubmittotheconcernedfacultyin-
charge.
NLPLab1
R-20
Instructions to maintain the record
Beforestartofthefirstlabtheyhavetobuytherecordandbringtherecordtothelab.
Regularly (Weekly) update the record after completion of the experiment and get
itcorrectedwith concerned lab in-charge for continuous evaluation. In case the record is lost
inform thesame day to thefaculty in charge andget the newrecord within 2 days the record has
to besubmittedand get it corrected bythefaculty.
If record is not submitted in time or record is not written properly, the evaluation marks
(5M)will bededucted.
Awardingthemarksfor daytodayevaluation
Totalmarksfordaytodayevaluationis15MarksasperAutonomous(JNTUH).These15Marksaredis
tributedas:
Regularity 3Marks
Programwritten 3Marks
Execution&Result 3Marks
Viva-Voce 3Marks
DressCode 3Marks
AllocationofMarksforLabInternal
Totalmarksforlabinternalare30MarksasperAutonomous(JNTUH.)These3
0Marksaredistributedas:
Averageof
daytodayevaluationmarks:15MarksLabMidexam:1
0Marks
VIVA&Observation:5Marks
AllocationofMarksforLabExternal
TotalmarksforlabInternalandExternalare70MarksasperAutonomous/(JNTUH).These70E
xternalLabMarks aredistributedas:
ProgramWritten 30Marks
ProgramExecutionandResult 20Marks
Viva-Voce 10Marks
Record 10Marks
NLPLab1
R-20
C. Generallaboratoryinstructions
1. Studentsareadvisedtocometothelaboratoryatleast5minutesbefore(tothestartingtime),thosewhoc
omeafter5minuteswillnotbe allowedintothelab.
2. Planyourtaskproperlymuchbeforetothecommencement,comepreparedtothelabwiththesynopsis
/ program/experimentdetails.
3. Studentshouldenterintothelaboratorywith:
a. Laboratoryobservationnoteswithallthedetails(Problemstatement,Aim,Algorithm,Procedure,Pr
ogram, Expected Output, etc.,)filledin forthe labsession.
b. LaboratoryRecordupdateduptothelastsessionexperimentsandotherutensils(ifany)neededin
thelab.
c. ProperDresscodeandIdentitycard.
4. Sign in the laboratory login register, write the TIME-IN, and occupy
the computersystemallottedtoyou bythefaculty.
5. Executeyourtaskinthelaboratory,andrecordtheresults/outputinthelabobservation
notebook,andgetcertifiedby theconcernedfaculty.
6. Allthestudentsshouldbepoliteandcooperativewiththelaboratorystaff,mustmaintain
thedisciplineanddecencyinthelaboratory.
7. Computerlabsareestablishedwithsophisticatedandhighendbrandedsystems,
whichshouldbeutilized properly.
8. Students / Faculty must keep their mobile phones in SWITCHED OFF mode during the
labsessions. Misuse of the equipment, misbehaviors with the staff and systems etc., will
attractseverepunishment.
9. Students must take the permission of the faculty in case of any urgency to go out; if
anybodyfound loitering outside the lab / class without permission during working hours
willbetreatedseriouslyandpunishedappropriately.
10. Students should LOG OFF/ SHUT DOWN the computer systembeforehe/she leaves thelab
after completing the task (experiment) in all aspects. He/she must ensure the system /
seatiskeptproperly.
HeadoftheDepartment Principal
NLPLab1
R-20
INDEX
Page
S.No. Week ProgramName No.
a.Write a python program to perform tokenization by
wordand sentence using nltk.
1 Week1 12
b.Writeapython programtoeliminate stopwordsusingnltk.
c.Writeapythonprogramto performstemmingusing nltk.
a.WriteapythonprogramtoperformPartsofSpeechtagging
using nltk.
2 Week2 19
b.Writeapythonprogramtoperformlemmatizationusing nltk.
a.Writeapython programforchunkingusingnltk.
3 Week3 23
b.WriteapythonprogramtoperformNamedEntity Recognition
using nltk.
a.WriteapythonprogramtofindTermFrequencyand Inverse
Document Frequency (TF-IDF).
4 Week4
b.WriteapythonprogramforCYKparsing(Cocke-Younger-
Kasami Parsing) or Chart Parsing. 27
a. Write a python program to find all unigrams, bigrams and
trigrams present in the given corpus.
5 Week5 b.Writeapython program to find the probabilityofthe given 31
statement “This is my cat” by taking the an exmple corpus
into consideration.
UsetheStanfordnamedEntityrecognizerto
extractentitiesfromthedocuments.Useit
6 Week6 programmaticallyandoutputforeachdocument
whichnamedentitiesitcontainsandofwhichtype.
35
Chooseanycorpusavailableontheinternetfreely.
Forthecorpus,foreachdocument,counthowmanytimeseachstop
wordoccursandfindoutwhichare
themostfrequentlyoccurringstopwords.Further, calculate the
7 Week7 term frequency and inverse documentfrequency asThe 37
motivation behind this is basically to find out how important
a document is to a given query. For e.g.:If the query is say:
“The brown crow”. “The” is less important. “Brown” and
“crow” are relatively more important. Since “the” is a more
common word, its tf will be
high.Hencewemultiplyitbyidf,byknowinghowcommon
itisto reduceits weight.
NLPLab1
R-20
8 Week8 Writethepythoncodetoperformsentimentanalysisusing NLP 39
9 Week9 Writethepython codetodevelopSpam FilterusingNLP 41
10 Week10 Writethepython codetodetect FakeNews usingNLP 43
NLPLab1
R-20
WEEK-1 Date:
Aim:a)Writeapythonprogramtoperformtokenizationbywordandsentenceusingnltk.
Program for sentence tokenization:
import nltk
nltk.download('punkt')#Downloadthenecessarytokenizationmodels
from nltk.tokenize import sent_tokenize
def
tokenize_sentences(text):sente
nces=sent_tokenize(text)
return sentences
# Exampletext
text = "NLTK is a leading platform for building Python programs to work with human language data. It
provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a
suiteoftextprocessinglibrariesforclassification,tokenization,stemming,tagging,parsing,andsemantic
reasoning, wrappers for industrial-strength NLP libraries, and an active discussion forum."
#Tokenizesentences
sentences= tokenize_sentences(text)
#Print tokenized sentences
for i, sentence in enumerate(sentences):
print(f"Sentence{i+1}:{sentence}")
Output:
Page1|12
R-20
Programforword Tokenization:
import nltk
nltk.download('punkt')#Downloadthenecessarytokenizationmodels
from nltk.tokenize import word_tokenize
deftokenize_words(text):
words=word_tokenize(text)
return words
# Exampletext
text="NLTKis aleadingplatformforbuildingPythonprogramsto workwithhumanlanguagedata."
#Tokenizewords
words= tokenize_words(text)
#Printtokenizedwords
print(words)
Output:
Page|13
R-20
b.Writeapython programtoeliminatestopwordsusingnltk.
# Stopwords
import nltk
fromnltk.corpusimportstopwords
fromnltk.tokenizeimport word_tokenize
#DownloadNLTKstopwordsandtokenizermodels
nltk.download('stopwords')
nltk.download('punkt')
defremove_stopwords(text):
#Tokenizethetextintowords words
= word_tokenize(text)
#GetEnglishstopwords
english_stopwords=set(stopwords.words('english'))
#Removestopwords from the tokenized words
filtered_words=[wordforwordin wordsif word.lower()notin english_stopwords]
#Jointhefilteredwordsbackintoasinglestring
filtered_text = ''.join(filtered_words)
returnfiltered_text
# Example text
text="NLTKisaleadingplatformforbuildingPythonprogramsto workwithhumanlanguagedata."
# Removestopwords
filtered_text= remove_stopwords(text)
#Printfilteredtext
print(filtered_text)
Output:
Page|14
R-20
c.Writea python programtoperformstemming using nltk.
#Stemming
import nltk
from nltk.stem import PorterStemmer
fromnltk.tokenizeimportword_tokenize
#DownloadNLTKtokenizerandstemmermodels
nltk.download('punkt')
defstem_text(text):
# Initialize the Porter Stemmer
porter_stemmer=PorterStemmer()
# Tokenize the text into words
words = word_tokenize(text)
#Applystemmingtoeachword
stemmed_words=[porter_stemmer.stem(word)forwordinwords] #
Join the stemmed words back into a single string
stemmed_text=''.join(stemmed_words) return
stemmed_text
#Exampletext
text="NLTKisaleadingplatformforbuildingPythonprogramstoworkwithhumanlanguage data."
# Perform stemming
stemmed_text=stem_text(text)
#Printstemmedtext
print(stemmed_text)
Output:
Page|15
R-20
Signatureofthefaculty
EXERCISE:
1. WriteapythonprogramtoperformtokenizationbywordandsentenceusingStanza.
2. Writeapythonprogram forwordtokenization andsentencesegmentation using spaCy.
3. WriteapythonprogramtofindallthestopwordsinthegivencorpususingspaCy.
Page|16
R-20
Page|17
R-20
Signatureofthefaculty
Page|18
R-20
WEEK-2 Date:
a. Writeapythonprogramtoperform Partsof Speechtagging usingnltk.
#PartsofSpeechTagging
import nltk
fromnltk.tokenizeimport word_tokenize
#DownloadNLTKtokenizerandPOStaggingmodels
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
defpos_tagging(text):
#Tokenizethetextintowords words
= word_tokenize(text)
# Perform POS tagging
tagged_words=nltk.pos_tag(words)
returntagged_words
# Example text
text="NLTKisaleadingplatformforbuildingPythonprogramsto workwithhumanlanguagedata."
# Perform POS tagging
tagged_text=pos_tagging(text)
#PrintPOStaggedtext
print(tagged_text)
Output:
Page|19
R-20
b. Writeapython programtoperformlemmatization using nltk.
#Lemmatization
from nltk.tokenize import word_tokenize
fromnltk.stemimportWordNetLemmatizer
nltk.download('punkt')
nltk.download('wordnet')
deflemmatize_text(text):
lemmatizer=WordNetLemmatizer()
tokens = word_tokenize(text)
lemmatized_text=''.join([lemmatizer.lemmatize(word)forwordintokens]) return
lemmatized_text
text="Thecatsarechasingmiceandplayinginthegarden" lemmatized_text
= lemmatize_text(text)
print("Original Text:", text)
print("LemmatizedText:",lemmatized_text)
Output:
Signatureof the Faculty
Page|20
R-20
EXERCISE:
1. StudyandusetheStanfordPart ofspeechtagger onasuitablecorpusavailablefreely.The corpus
should be of decent size. (Use spaCy and stanza).
2. WriteapythonprogramforlemmatizationusingspaCyand stanza.
Page|21
R-20
Page|22
R-20
WEEK-3 Date:
a. Writeapythonprogramforchunkingusingnltk.
#Chunking
importnltk
fromnltk.tokenizeimportword_tokenize from
nltk import pos_tag, RegexpParser
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
def chunk_sentence(sentence):
words=word_tokenize(sentence)
tagged_words = pos_tag(words)
#Definegrammarforchunking
grammar = r"""
NP: {<DT|JJ|NN.*>+} # Chunk sequences of DT, JJ, NN
PP: {<IN><NP>} #ChunkprepositionsfollowedbyNP
VP:{<VB.*><NP|PP|CLAUSE>+$}#Chunkverbsandtheirarguments
CLAUSE: {<NP><VP>} # Chunk NP, VP pairs
"""
parser = RegexpParser(grammar)
chunked_sentence=parser.parse(tagged_words)
return chunked_sentence
sentence="Thequickbrownfoxjumpsoverthelazydog"
chunked_sentence = chunk_sentence(sentence)
print(chunked_sentence)
Output:
Page|23
R-20
b. WriteapythonprogramtoperformNamedEntityRecognitionusingnltk.
#NamedEntityRecognition
import nltk
fromnltk.tokenizeimportword_tokenize from
nltk import pos_tag, ne_chunk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')
defner(text):
words =
word_tokenize(text)tagged_words =
pos_tag(words)
named_entities=ne_chunk(tagged_words)
return named_entities
text="AppleisacompanybasedinCalifornia,UnitedStates.SteveJobswasoneofitsfounders." named_entities =
ner(text)
print(named_entities)
Output:
Signatureofthefaculty
Page|24
R-20
EXERCISE:
1. Writeapythonprogramforchinkingusingnltk.
2. UsetheStanfordnamedEntityrecognizertoextractentitiesfromthedocuments.Useit
programmatically andoutput for each document which named entities it contains and of which type.
Page|25
R-20
Signatureof the Faculty
Page|26
R-20
WEEK-4 Date:
a. Writeapython program tofind Term FrequencyandInverseDocument Frequency(TF-IDF).
#tf-idf
import nltk
importstring
fromnltk.corpusimportstopwords
fromsklearn.feature_extraction.textimport TfidfVectorizer
nltk.download('punkt')#Sampledocuments
documents = [
"Thisisthefirst document.",
"Thisdocumentistheseconddocument.",
"And this is the third one.",
"Isthisthefirstdocument?",
]
#Tokenizeandpreprocessthedocuments def
preprocess_text(doc):
#Tokenizethedocumentintowords
tokens = nltk.word_tokenize(doc)
# Removepunctuation
tokens=[word forword in tokens ifword not in string.punctuation]
#Convert wordsto lowercase
tokens=[word.lower()forwordintokens]
#Remove stopwords
stop_words= set(stopwords.words('english'))
tokens=[word forword in tokens ifword not in stop_words]
#Jointhetokensbackintoasinglestring
preprocessed_doc = ''.join(tokens)
returnpreprocessed_doc
#Preprocessalldocuments
preprocessed_documents=[preprocess_text(doc)fordocindocuments]
#ComputeTF-IDFscoresusingscikit-learn
vectorizer = TfidfVectorizer()
tfidf_matrix=vectorizer.fit_transform(preprocessed_documents)
# Print TF-IDF matrix
print(tfidf_matrix.toarray())
Page|27
R-20
Output:
Page|28
R-20
b. WriteapythonprogramforCYKparsing(Cocke-Younger-KasamiParsing)orChart Parsing.
import nltk
grammar=nltk.CFG.fromstring("""S->VNP V -
>'describe' | 'present' NP -> PRP N
PRP->'your'N->'work'""")
parser=nltk.ChartParser(grammar)sent='describeyourwork'.split()print (list(parser.parse(sent)))
Output:
Signatureof theFaculty
Page|29
R-20
EXERCISE:
1. WriteapythonprogramforCYK Parsingbydefining yourown Grammar.
Signatureof theFaculty
Page|30
B.Tech–CSE(Computationalintelligence) R-20
WEEK-5 Date:
a. Writeapythonprogramtofindallunigrams,bigramsandtrigramspresentinthegivencorpus.
importnltknltk.download('punkt')
from nltk.util import ngrams
samplText='this is a very good book to study' for i in range(1,4):
NGRAMS=ngrams(sequence=nltk.word_tokenize(samplText),n=i)forgramsinNGRAMS:
print(grams)
MallaReddyCollegeofEngineeringandTechnology(MRCETCAMPUS) Page|31
B.Tech–CSE(Computationalintelligence) R-20
b. Writeapythonprogramtofindtheprobabilityofthegivenstatement“Thisismycat”bytaking the
an exmple corpus into consideration.
'Thisisadog’,
'This is a cat',
'Ilovemycat',
'Thisismyname’
defreadData():
data=['Thisisa dog','Thisisacat','Ilovemycat','Thisismyname'] dat=[]
fori in range(len(data)):
forwordindata[i].split():
dat.append(word)
print(dat)
returndat
defcreateBigram(data):
listOfBigrams = []
bigramCounts = {}
unigramCounts = {}
foriinrange(len(data)-1):
if i < len(data) - 1 and data[i+1].islower():
listOfBigrams.append((data[i],data[i+1]))
if (data[i], data[i+1]) in bigramCounts:
bigramCounts[(data[i],data[i+1])]+=1
else:
bigramCounts[(data[i],data[i+1])]=1
if data[i] in unigramCounts:
unigramCounts[data[i]]+=1
else:
unigramCounts[data[i]]=1
MallaReddyCollegeofEngineeringandTechnology(MRCETCAMPUS) Page|32
B.Tech–CSE(Computationalintelligence) R-20
returnlistOfBigrams,unigramCounts,bigramCounts
defcalcBigramProb(listOfBigrams,unigramCounts,bigramCounts):
listOfProb = {}
forbigraminlistOfBigrams: word1
= bigram[0]
word2=bigram[1]
listOfProb[bigram]=(bigramCounts.get(bigram))/(unigramCounts.get(word1))
return listOfProb
if name =='main':
data = readData()
listOfBigrams,unigramCounts,bigramCounts=createBigram(data)
print("\nAllthepossibleBigramsare")
print(listOfBigrams)
print("\nBigramsalongwiththeirfrequency")
print(bigramCounts)
print("\nUnigramsalongwiththeirfrequency")
print(unigramCounts)
bigramProb=calcBigramProb(listOfBigrams,unigramCounts,bigramCounts)
print("\nBigramsalongwiththeirprobability")
print(bigramProb)
inputList="Thisismycat"splt
=inputList.split()
outputProb1 = 1
bilist=[]
bigrm=[]
MallaReddyCollegeofEngineeringandTechnology(MRCETCAMPUS) Page|33
B.Tech–CSE(Computationalintelligence) R-20
foriinrange(len(splt)-1): if i
< len(splt) - 1:
bilist.append((splt[i],splt[i+1]))
print("\nThebigramsingivensentenceare")
print(bilist)
fori in range(len(bilist)):
ifbilist[i] inbigramProb:
outputProb1*=bigramProb[bilist[i]]
else:
outputProb1*=0
print('\n'+'Probablilityofsentence\"This ismycat\"='+str(outputProb1))
Signatureof theFaculty
MallaReddyCollegeofEngineeringandTechnology(MRCETCAMPUS) Page|34
B.Tech–CSE(Computationalintelligence) R-20
WEEK–6 Date:
Usethe Stanford named Entityrecognizer toextract entities fromthe documents. Use
Itprogrammaticallyand outputforeachdocumentwhichnamedentitiesitcontainsandof Which
type.
MallaReddyCollegeofEngineeringandTechnology(MRCETCAMPUS) Page|35
B.Tech–CSE(Computationalintelligence)
Signatureofthe Faculty
Page|36
B.Tech–CSE(Computationalintelligence)
WEEK–7 Date:
Choose any corpus available on the internet freely. For the corpus, for each document, count how many
times each stop word occurs and find out which are the most frequently occurring stop words. Further,
calculate the term frequency and inverse document frequency asThe motivation behind this is basically
tofindouthowimportantadocumentistoagivenquery.Fore.g.: Ifthequeryissay:“Thebrown crow”. “The” is
less important. “Brown” and “crow” are relatively more important. Since “the” is a more common word,
its tf will be high. Hence we multiply it by idf, by knowing how common it is to reduceits weight.
Page|37
R-20
SignatureoftheFaculty
Page|38
R-20
WeeK-8 Date:
a.Writethepython codetoperformsentiment analysisusing NLP
Page|39
R-20
Signatureofthe Faculty
Page|40
R-20
WEEK–9 Date:
1. Writethepythoncode to developSpamFilterusingNLP
Page|41
R-20
SignatureoftheFaculty
Page|42
R-20
Week-10: Date:
1.Write thepython code todetect FakeNews usingNLP
Page|43
R-20
Signatureof the Faculty
Page|44
R-20
Page|45