0% found this document useful (0 votes)

17 views5 pages

NLP Preprocessing Lecture 3

Uploaded by

38 Syed Ali Farhad

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views5 pages

NLP Preprocessing Lecture 3

Uploaded by

38 Syed Ali Farhad

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

NLP Preprocessing Lab Manual

Step 1: Tokenization
Library: nltk
Function: word_tokenize()

from nltk import word_tokenize, download

download('punkt')

def get_tokens(sentence):
return word_tokenize(sentence)

print(get_tokens("I am reading NLP Fundamentals."))

Step 2: Parts-of-Speech (PoS) Tagging

Library: nltk
Function: pos_tag()

from nltk import pos_tag

words = get_tokens("I am reading NLP Fundamentals")

print(pos_tag(words))

Step 3: Stop Word Removal

Library: nltk.corpus
Function: stopwords.words('english')

from nltk.corpus import stopwords

download('stopwords')

stop_words = stopwords.words('english')
sentence = "I am learning Python. It is one of the most popular programming languages."
tokens = word_tokenize(sentence)

def remove_stop_words(tokens, stop_words):

return [w for w in tokens if w.lower() not in stop_words]

print(remove_stop_words(tokens, stop_words))

Step 4: Text Normalization

Library: Python built-in string methods
Function: str.replace()

def normalize(text):
return text.replace("US", "United States")\
.replace("UK", "United Kingdom")\
.replace("-18", "-2018")

print(normalize("I visited the US from the UK on 22-10-18"))

Step 5: Spelling Correction

Library: autocorrect
Class/Function: Speller(lang='en')

from autocorrect import Speller

spell = Speller(lang='en')

print(spell("Natureal"))

tokens = word_tokenize("Ntural Luanguage Processin deals with insightes")

print([spell(w) for w in tokens])

Step 6: Stemming
Library: nltk.stem
Class/Function: PorterStemmer().stem()

from nltk import stem

porter = stem.PorterStemmer()

print(porter.stem("production"))
print(porter.stem("coming"))
print(porter.stem("firing"))

Step 7: Lemmatization
Library: nltk.stem.wordnet
Class/Function: WordNetLemmatizer().lemmatize()

download('wordnet')
from nltk.stem.wordnet import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

print(lemmatizer.lemmatize("products"))
print(lemmatizer.lemmatize("coming"))

Step 8: Named Entity Recognition (NER)

Library: nltk
Functions: word_tokenize(), pos_tag(), ne_chunk()

from nltk import ne_chunk

download('maxent_ne_chunker')
download('words')

sentence = "We are reading a book published by Packt which is based out of Birmingham."
def get_ner(text):
return [a for a in ne_chunk(pos_tag(word_tokenize(text)), binary=True) if len(a)==1]

print(get_ner(sentence))

Step 9: Word Sense Disambiguation

Library: nltk.wsd
Function: lesk()

import nltk
nltk.download('wordnet')
from nltk.wsd import lesk
from nltk import word_tokenize

sentence1 = "Keep your savings in the bank"

sentence2 = "It's so risky to drive over the banks of the road"

def get_synset(sentence, word):

return lesk(word_tokenize(sentence), word)

print(get_synset(sentence1,'bank')) # Synset('savings_bank.n.02')
print(get_synset(sentence2,'bank')) # Synset('bank.v.07')

Step 10: Sentence Boundary Detection

Library: nltk.tokenize
Function: sent_tokenize()

import nltk
from nltk.tokenize import sent_tokenize

def get_sentences(text):
return sent_tokenize(text)

print(get_sentences("We are reading a book. Do you know who is the publisher? It is Packt.
Packt is based out of Birmingham."))

print(get_sentences("Mr. Donald John Trump is the current president of the USA. Before
joining politics, he was a businessman."))
-------------- ------------ - -------------------- ----------------- ------------------

Exercise

Now, try the following tasks on your own:

1. Tokenize this sentence:

"Artificial Intelligence is transforming the world of
technology."
2. Perform PoS tagging on the tokens.
3. Remove stop words from the same sentence.
4. Normalize this text:
"He visited the USA in 19-09-20 and later moved to UK."
5. Correct spellings in this sentence:
"Ths is an exmple of splling corection."
6. Stem the following words: ["running", "flies", "production",
"battling"]
7. Lemmatize the same words and compare results with stemming.
8. Run NER on this sentence:
"Google opened a new office in London near the River Thames."

Tokenization (Breaking Text Into Words) : Import From Import From Import From Import
No ratings yet
Tokenization (Breaking Text Into Words) : Import From Import From Import From Import
4 pages
NLP Notebook
No ratings yet
NLP Notebook
20 pages
NLP Core Using NLTK: Dr. Muhammad Nouman Durrani
No ratings yet
NLP Core Using NLTK: Dr. Muhammad Nouman Durrani
42 pages
Tokenization (Breaking Text Into Words) : Import From Import From Import From Import
No ratings yet
Tokenization (Breaking Text Into Words) : Import From Import From Import From Import
7 pages
Natural Language Processing
No ratings yet
Natural Language Processing
17 pages
NLP Lab1
No ratings yet
NLP Lab1
2 pages
Shubham Jade MSC It 31031420010 NLP Practical Journal
No ratings yet
Shubham Jade MSC It 31031420010 NLP Practical Journal
17 pages
NLP Text Preprocessing in Python
No ratings yet
NLP Text Preprocessing in Python
2 pages
NLPEXP3
No ratings yet
NLPEXP3
3 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
19 pages
C24064 - NLP - Lab Manual
No ratings yet
C24064 - NLP - Lab Manual
28 pages
NLP Lab Manual for CSE Students
No ratings yet
NLP Lab Manual for CSE Students
45 pages
Date: Practical No.4:: Foundation of AI and ML (4351601)
No ratings yet
Date: Practical No.4:: Foundation of AI and ML (4351601)
10 pages
NLP Applications and Preprocessing
No ratings yet
NLP Applications and Preprocessing
56 pages
NLP Record
No ratings yet
NLP Record
15 pages
NLP
No ratings yet
NLP
12 pages
A7 Dsbda Sana
No ratings yet
A7 Dsbda Sana
15 pages
Tinywow Pythass3 77951173
No ratings yet
Tinywow Pythass3 77951173
17 pages
Natural Langauage Processing (NLP) : Tokenization of Words
No ratings yet
Natural Langauage Processing (NLP) : Tokenization of Words
8 pages
123 NLP 456
No ratings yet
123 NLP 456
4 pages
NLP Techniques for Students
No ratings yet
NLP Techniques for Students
55 pages
NLP Tasks for MCA Students
No ratings yet
NLP Tasks for MCA Students
16 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
32 pages
Lab 2
No ratings yet
Lab 2
4 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
17 pages
NLP Lab 2
No ratings yet
NLP Lab 2
6 pages
Aiml P4
No ratings yet
Aiml P4
12 pages
NLP Exp2
No ratings yet
NLP Exp2
2 pages
7 TextAnalysis
No ratings yet
7 TextAnalysis
3 pages
Natural Language Pre-Processing: Prepared By: Syed Afroz Ali
No ratings yet
Natural Language Pre-Processing: Prepared By: Syed Afroz Ali
81 pages
NLP Lab
No ratings yet
NLP Lab
7 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
15 pages
CSDM2-Text Preprocessing For NL Data - 011050
No ratings yet
CSDM2-Text Preprocessing For NL Data - 011050
6 pages
NLP Lab
No ratings yet
NLP Lab
63 pages
Module 1 Updated Final
No ratings yet
Module 1 Updated Final
45 pages
NLP Lab Assignment 8
No ratings yet
NLP Lab Assignment 8
14 pages
Lab2 IR
No ratings yet
Lab2 IR
16 pages
Token Ization
No ratings yet
Token Ization
5 pages
NLP Lab1
No ratings yet
NLP Lab1
6 pages
Experiment 2
No ratings yet
Experiment 2
4 pages
R22 NLP Python Programs
No ratings yet
R22 NLP Python Programs
15 pages
DS 7
No ratings yet
DS 7
3 pages
NLP Assignment (917722H031)
No ratings yet
NLP Assignment (917722H031)
18 pages
NLP Practical Journal 2023-24
No ratings yet
NLP Practical Journal 2023-24
27 pages
NLP Lab Manual 3-2 Aiml R22 Update
100% (2)
NLP Lab Manual 3-2 Aiml R22 Update
20 pages
Wsma Final Manual
No ratings yet
Wsma Final Manual
58 pages
NLP Practicals
No ratings yet
NLP Practicals
6 pages
NLP Practical Journal 2023-24
No ratings yet
NLP Practical Journal 2023-24
22 pages
NLP - Exp 1 11
No ratings yet
NLP - Exp 1 11
29 pages
NLP Smitpatel
No ratings yet
NLP Smitpatel
32 pages
Rajeev Mishra 20 SCSE1180087
No ratings yet
Rajeev Mishra 20 SCSE1180087
29 pages
NLP Pratical
No ratings yet
NLP Pratical
14 pages
NLP Text Preprocessing Techniques
No ratings yet
NLP Text Preprocessing Techniques
15 pages
Tokenization (Breaking Text Into Words) : Import From Import From Import From Import
No ratings yet
Tokenization (Breaking Text Into Words) : Import From Import From Import From Import
11 pages
TSA Lab Manual New
No ratings yet
TSA Lab Manual New
14 pages
Essential NLP Pre-processing Steps
No ratings yet
Essential NLP Pre-processing Steps
20 pages
NLP Lab File
No ratings yet
NLP Lab File
13 pages
Bling
No ratings yet
Bling
7 pages
NLP Session 4
No ratings yet
NLP Session 4
13 pages
Tasl 2 - Maureen Jimenez
No ratings yet
Tasl 2 - Maureen Jimenez
8 pages
Early Childhood Development Guide
No ratings yet
Early Childhood Development Guide
11 pages
Quick Guide To GREP Codes in Adobe InDesign
100% (2)
Quick Guide To GREP Codes in Adobe InDesign
28 pages
English Morphology
100% (1)
English Morphology
127 pages
Improving Writing Skills with CAR
No ratings yet
Improving Writing Skills with CAR
19 pages
Cohesion and Coherence
100% (1)
Cohesion and Coherence
13 pages
Self-Paced Learning Module: Foundations of Special and Inclusive Education
80% (5)
Self-Paced Learning Module: Foundations of Special and Inclusive Education
46 pages
Jolly Phonics for Teachers
No ratings yet
Jolly Phonics for Teachers
24 pages
ملخص قواعد الصف الثاني الاعدادي في ورقة واحدة
No ratings yet
ملخص قواعد الصف الثاني الاعدادي في ورقة واحدة
1 page
Linguistic Roots of "Nipple" in Indo-European
100% (2)
Linguistic Roots of "Nipple" in Indo-European
11 pages
Transitive vs Intransitive Verbs Explained
No ratings yet
Transitive vs Intransitive Verbs Explained
1 page
8 Syllabus 2024-25
No ratings yet
8 Syllabus 2024-25
42 pages
D0241 BugClubIndependent Component Chart
No ratings yet
D0241 BugClubIndependent Component Chart
9 pages
General Mental Ability Alphabet Test
No ratings yet
General Mental Ability Alphabet Test
8 pages
Intonation
No ratings yet
Intonation
4 pages
Class 10 English Project Guidelines
No ratings yet
Class 10 English Project Guidelines
4 pages
Ge1 A2 - Student Workbook Session 10-15
No ratings yet
Ge1 A2 - Student Workbook Session 10-15
26 pages
The Science of Alap Syllables
No ratings yet
The Science of Alap Syllables
2 pages
Indropuro Dialect Phonemes Study
No ratings yet
Indropuro Dialect Phonemes Study
9 pages
Language and The Digital Era
No ratings yet
Language and The Digital Era
270 pages
Compare and Contrast Project
No ratings yet
Compare and Contrast Project
13 pages
Sociolinguistics Terms Guide
No ratings yet
Sociolinguistics Terms Guide
20 pages
STD Vii 1ST Evaluation Syllabus 2025-26
No ratings yet
STD Vii 1ST Evaluation Syllabus 2025-26
1 page
ENGLISH SCHEMES 10 Term 3
No ratings yet
ENGLISH SCHEMES 10 Term 3
4 pages
English Grammar Module Overview
No ratings yet
English Grammar Module Overview
58 pages
3B Souvenirs
No ratings yet
3B Souvenirs
5 pages
Scope and S Avancemos
No ratings yet
Scope and S Avancemos
5 pages
Mid-Term Introduction To Linguistics
No ratings yet
Mid-Term Introduction To Linguistics
4 pages
English For Specific Purposes Sylabus
No ratings yet
English For Specific Purposes Sylabus
6 pages
Discourse Analysis vs. Semantics
No ratings yet
Discourse Analysis vs. Semantics
8 pages

NLP Preprocessing Lecture 3

Uploaded by

NLP Preprocessing Lecture 3

Uploaded by

NLP Preprocessing Lab Manual

from nltk import word_tokenize, download

print(get_tokens("I am reading NLP Fundamentals."))

Step 2: Parts-of-Speech (PoS) Tagging

from nltk import pos_tag

words = get_tokens("I am reading NLP Fundamentals")

Step 3: Stop Word Removal

from nltk.corpus import stopwords

def remove_stop_words(tokens, stop_words):

Step 4: Text Normalization

print(normalize("I visited the US from the UK on 22-10-18"))

Step 5: Spelling Correction

from autocorrect import Speller

tokens = word_tokenize("Ntural Luanguage Processin deals with insightes")

from nltk import stem

Step 8: Named Entity Recognition (NER)

from nltk import ne_chunk

Step 9: Word Sense Disambiguation

sentence1 = "Keep your savings in the bank"

def get_synset(sentence, word):

Step 10: Sentence Boundary Detection

Now, try the following tasks on your own:

1. Tokenize this sentence:

You might also like