0% found this document useful (0 votes)
17 views5 pages

NLP Preprocessing Lecture 3

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views5 pages

NLP Preprocessing Lecture 3

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

NLP Preprocessing Lab Manual

Step 1: Tokenization
Library: nltk
Function: word_tokenize()

from nltk import word_tokenize, download


download('punkt')

def get_tokens(sentence):
return word_tokenize(sentence)

print(get_tokens("I am reading NLP Fundamentals."))

Step 2: Parts-of-Speech (PoS) Tagging


Library: nltk
Function: pos_tag()

from nltk import pos_tag

words = get_tokens("I am reading NLP Fundamentals")


print(pos_tag(words))

Step 3: Stop Word Removal


Library: nltk.corpus
Function: stopwords.words('english')

from nltk.corpus import stopwords


download('stopwords')

stop_words = stopwords.words('english')
sentence = "I am learning Python. It is one of the most popular programming languages."
tokens = word_tokenize(sentence)

def remove_stop_words(tokens, stop_words):


return [w for w in tokens if w.lower() not in stop_words]

print(remove_stop_words(tokens, stop_words))

Step 4: Text Normalization


Library: Python built-in string methods
Function: str.replace()

def normalize(text):
return text.replace("US", "United States")\
.replace("UK", "United Kingdom")\
.replace("-18", "-2018")

print(normalize("I visited the US from the UK on 22-10-18"))

Step 5: Spelling Correction


Library: autocorrect
Class/Function: Speller(lang='en')

from autocorrect import Speller


spell = Speller(lang='en')

print(spell("Natureal"))

tokens = word_tokenize("Ntural Luanguage Processin deals with insightes")


print([spell(w) for w in tokens])

Step 6: Stemming
Library: nltk.stem
Class/Function: PorterStemmer().stem()

from nltk import stem


porter = stem.PorterStemmer()

print(porter.stem("production"))
print(porter.stem("coming"))
print(porter.stem("firing"))

Step 7: Lemmatization
Library: nltk.stem.wordnet
Class/Function: WordNetLemmatizer().lemmatize()

download('wordnet')
from nltk.stem.wordnet import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

print(lemmatizer.lemmatize("products"))
print(lemmatizer.lemmatize("coming"))

Step 8: Named Entity Recognition (NER)


Library: nltk
Functions: word_tokenize(), pos_tag(), ne_chunk()

from nltk import ne_chunk


download('maxent_ne_chunker')
download('words')

sentence = "We are reading a book published by Packt which is based out of Birmingham."
def get_ner(text):
return [a for a in ne_chunk(pos_tag(word_tokenize(text)), binary=True) if len(a)==1]

print(get_ner(sentence))

Step 9: Word Sense Disambiguation


Library: nltk.wsd
Function: lesk()

import nltk
nltk.download('wordnet')
from nltk.wsd import lesk
from nltk import word_tokenize

sentence1 = "Keep your savings in the bank"


sentence2 = "It's so risky to drive over the banks of the road"

def get_synset(sentence, word):


return lesk(word_tokenize(sentence), word)

print(get_synset(sentence1,'bank')) # Synset('savings_bank.n.02')
print(get_synset(sentence2,'bank')) # Synset('bank.v.07')

Step 10: Sentence Boundary Detection


Library: nltk.tokenize
Function: sent_tokenize()

import nltk
from nltk.tokenize import sent_tokenize

def get_sentences(text):
return sent_tokenize(text)

print(get_sentences("We are reading a book. Do you know who is the publisher? It is Packt.
Packt is based out of Birmingham."))

print(get_sentences("Mr. Donald John Trump is the current president of the USA. Before
joining politics, he was a businessman."))
-------------- ------------ - -------------------- ----------------- ------------------

Exercise

Now, try the following tasks on your own:

1. Tokenize this sentence:


"Artificial Intelligence is transforming the world of
technology."
2. Perform PoS tagging on the tokens.
3. Remove stop words from the same sentence.
4. Normalize this text:
"He visited the USA in 19-09-20 and later moved to UK."
5. Correct spellings in this sentence:
"Ths is an exmple of splling corection."
6. Stem the following words: ["running", "flies", "production",
"battling"]
7. Lemmatize the same words and compare results with stemming.
8. Run NER on this sentence:
"Google opened a new office in London near the River Thames."

You might also like