NLP Preprocessing Lab Manual
Step 1: Tokenization
Library: nltk
Function: word_tokenize()
from nltk import word_tokenize, download
download('punkt')
def get_tokens(sentence):
return word_tokenize(sentence)
print(get_tokens("I am reading NLP Fundamentals."))
Step 2: Parts-of-Speech (PoS) Tagging
Library: nltk
Function: pos_tag()
from nltk import pos_tag
words = get_tokens("I am reading NLP Fundamentals")
print(pos_tag(words))
Step 3: Stop Word Removal
Library: nltk.corpus
Function: stopwords.words('english')
from nltk.corpus import stopwords
download('stopwords')
stop_words = stopwords.words('english')
sentence = "I am learning Python. It is one of the most popular programming languages."
tokens = word_tokenize(sentence)
def remove_stop_words(tokens, stop_words):
return [w for w in tokens if w.lower() not in stop_words]
print(remove_stop_words(tokens, stop_words))
Step 4: Text Normalization
Library: Python built-in string methods
Function: str.replace()
def normalize(text):
return text.replace("US", "United States")\
.replace("UK", "United Kingdom")\
.replace("-18", "-2018")
print(normalize("I visited the US from the UK on 22-10-18"))
Step 5: Spelling Correction
Library: autocorrect
Class/Function: Speller(lang='en')
from autocorrect import Speller
spell = Speller(lang='en')
print(spell("Natureal"))
tokens = word_tokenize("Ntural Luanguage Processin deals with insightes")
print([spell(w) for w in tokens])
Step 6: Stemming
Library: nltk.stem
Class/Function: PorterStemmer().stem()
from nltk import stem
porter = stem.PorterStemmer()
print(porter.stem("production"))
print(porter.stem("coming"))
print(porter.stem("firing"))
Step 7: Lemmatization
Library: nltk.stem.wordnet
Class/Function: WordNetLemmatizer().lemmatize()
download('wordnet')
from nltk.stem.wordnet import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
print(lemmatizer.lemmatize("products"))
print(lemmatizer.lemmatize("coming"))
Step 8: Named Entity Recognition (NER)
Library: nltk
Functions: word_tokenize(), pos_tag(), ne_chunk()
from nltk import ne_chunk
download('maxent_ne_chunker')
download('words')
sentence = "We are reading a book published by Packt which is based out of Birmingham."
def get_ner(text):
return [a for a in ne_chunk(pos_tag(word_tokenize(text)), binary=True) if len(a)==1]
print(get_ner(sentence))
Step 9: Word Sense Disambiguation
Library: nltk.wsd
Function: lesk()
import nltk
nltk.download('wordnet')
from nltk.wsd import lesk
from nltk import word_tokenize
sentence1 = "Keep your savings in the bank"
sentence2 = "It's so risky to drive over the banks of the road"
def get_synset(sentence, word):
return lesk(word_tokenize(sentence), word)
print(get_synset(sentence1,'bank')) # Synset('savings_bank.n.02')
print(get_synset(sentence2,'bank')) # Synset('bank.v.07')
Step 10: Sentence Boundary Detection
Library: nltk.tokenize
Function: sent_tokenize()
import nltk
from nltk.tokenize import sent_tokenize
def get_sentences(text):
return sent_tokenize(text)
print(get_sentences("We are reading a book. Do you know who is the publisher? It is Packt.
Packt is based out of Birmingham."))
print(get_sentences("Mr. Donald John Trump is the current president of the USA. Before
joining politics, he was a businessman."))
-------------- ------------ - -------------------- ----------------- ------------------
Exercise
Now, try the following tasks on your own:
1. Tokenize this sentence:
"Artificial Intelligence is transforming the world of
technology."
2. Perform PoS tagging on the tokens.
3. Remove stop words from the same sentence.
4. Normalize this text:
"He visited the USA in 19-09-20 and later moved to UK."
5. Correct spellings in this sentence:
"Ths is an exmple of splling corection."
6. Stem the following words: ["running", "flies", "production",
"battling"]
7. Lemmatize the same words and compare results with stemming.
8. Run NER on this sentence:
"Google opened a new office in London near the River Thames."