0% found this document useful (0 votes)

15 views5 pages

Text Normalization: Stemming & Lemmatization

Lab 2

Uploaded by

Don Pablo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views5 pages

Text Normalization: Stemming & Lemmatization

Lab 2

Uploaded by

Don Pablo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 5

CT107-3-3-TXSA - Text Analytics and Sentiment Analysis Text Normalisation

Lab 4: Text Normalization: Stemming and Lemmatization

This session covers:

 Different types of Stemmers & stemming  Porter | Lancaster | SnowBall
 Different types of Lemmatizer & lemmatizing  WordNet lemmatizer | textblob lemmatizer
 Lemmatization with & without POS tags

Learning Outcome:
Apply text mining and natural language processing methodologies to textual data.

QUICK REVIEW

Mapping from foxes to fox is called stemming. Morphological STEMMING parsing or stemming applies to many
affixes other than plurals; for example we might need to take any English verb form ending in -ing (going, talking,
congratulating) and parse it into its verbal stem plus the -ing morpheme.
The Porter algorithm is a simple and efficient way to do stemming, stripping off affixes. It is not as accurate as a
transducer model that includes a lexicon, but may be preferable for applications like information retrieval in which
exact morphological structure is not needed.

PRACTICES

4.1. Stemmers
Use the simple following code to implement porter stemmer.

from nltk.stem import PorterStemmer

from nltk.tokenize import word_tokenize

ps = PorterStemmer()

example_words1 = ["python", "pythoner", "pythoning", "pythoned", "pythonly"]

example_words2 = ["List", "listed", "lists", "listing", "listings"]

for w in example_words1:
print(ps.stem(w))

for w in example_words2:
print(ps.stem(w))

You can use this algorithm combining with a tokenization code in order to stem the words in a sentence, as below:

from nltk.stem import PorterStemmer

from nltk.tokenize import word_tokenize

ps = PorterStemmer()

new_text = """It is very important to be pythonly while you are pythoning

with python. All pythoners have pythoned poorly at least once."""

words = word_tokenize(new_text)

print([ps.stem(w) for w in words])

Level 3 Asia Pacific University (APU) Page 1 of 5

CT107-3-3-TXSA - Text Analytics and Sentiment Analysis Text Normalisation

NLTK includes several off-the-shelf stemmers, and if you ever need a stemmer you should use one of these in
preference to crafting your own using regular expressions, since these handle a wide range of irregular cases. The Porter
and Lancaster stemmers follow their own rules for stripping affixes. Observe that the Porter stemmer correctly handles
the word lying (mapping it to lie), while the Lancaster stemmer does not.

from nltk.stem import PorterStemmer, LancasterStemmer

from nltk.tokenize import word_tokenize

raw = """DENNIS: Listen, strange women lying in ponds distributing swords

is no basis for a system of government. Supreme executive power derives from
a mandate from the masses, not from some farcical aquatic ceremony."""

tokens = word_tokenize(raw)

porter = PorterStemmer()
lancaster = LancasterStemmer()

print([porter.stem(t) for t in tokens])

print("\n").
print([lancaster.stem(t) for t in tokens])

Compare and discuss the results of two stemmers (Porter and Lancaster), if you observe any difference.

4.2. Lemmatization using WordNet lemmatizer

The WordNet lemmatizer only removes affixes if the resulting word is in its dictionary. This additional checking
process makes the lemmatizer slower than the above stemmers.

from nltk.stem import WordNetLemmatizer, PorterStemmer

lemmatizer = WordNetLemmatizer()
print("rocks :", lemmatizer.lemmatize("rocks"))
print("\nproduced :", lemmatizer.lemmatize("produced", pos ="v"))

ps = PorterStemmer()
print("\nStem of the word produced :", ps.stem("produced"))

print("\nbetter :", lemmatizer.lemmatize("better", pos ="a"))

print("\nwomen :", lemmatizer.lemmatize("women", pos ="n"))

Notice that it doesn't handle lying, but it converts women to woman.

from nltk.stem import PorterStemmer

from nltk.tokenize import word_tokenize

raw = """DENNIS: Listen, strange women lying in ponds distributing swords

is no basis for a system of government. Supreme executive power derives from
a mandate from the masses, not from some farcical aquatic ceremony."""

tokens = word_tokenize(raw)

wnl = nltk.WordNetLemmatizer()
print([wnl.lemmatize(t) for t in tokens])
print()

Level 3 Asia Pacific University (APU) Page 2 of 5

CT107-3-3-TXSA - Text Analytics and Sentiment Analysis Text Normalisation

for t in tokens:
print ("{0:20}{1:20}".format(t, wnl.lemmatize(t, pos="v")))
print()

example_words = ["List", "listed", "lists", "listing", "listings"]

print([wnl.lemmatize(w) for w in example_words])
print()
for words in example_words:
print ("{0:20}{1:20}".format(words, wnl.lemmatize(words, pos="v")))

The WordNet lemmatizer is a good choice if you want to compile the vocabulary of some texts and want a list of valid
lemmas (or lexicon headwords). The results would result lemma not as 100% accurate according to the lemmas found in
the dictionary.

However, to have the exact lemma as per the dictionary, POS tagging better be included in the code.

for t in tokens:
print ("{0:20}{1:20}".format(t, wnl.lemmatize(t, pos="v")))

4.3. Lemmatization using TextBlob

Words can be lemmatized by calling the lemmatize method via the TextBlob objects

from textblob import TextBlob

sentence = TextBlob('DENNIS: Listen, strange women lying in ponds distributing
swords is no basis for a system of government. Supreme executive power derives
from a mandate from the masses, not from some farcical aquatic ceremony.')
tokens = sentence.words
print(tokens)
print
tokens.lemmatize()

for t in tokens:
print ("{0:20}{1:20}".format(t, wnl.lemmatize(t, pos="v")))

4.4. Stemming & Lemmatization

Stemming and Lemmatization both generate the root form of the inflected words. The difference is that stem might not
be an actual word whereas, lemma is an actual language word.

import nltk
from nltk.stem import PorterStemmer
from nltk.stem import LancasterStemmer
from nltk.tokenize import word_tokenize
from nltk.stem.snowball import SnowballStemmer

#file = open ("D:/APU/TXSA-CT107-3-3/TUTORIAL/sample01.txt")

#raw = file.read()

raw = """DENNIS: Listen, strange women lying in ponds distributing swords

is no basis for a system of government. Supreme executive power derives from
a mandate from the masses, not from some farcical aquatic ceremony."""

Level 3 Asia Pacific University (APU) Page 3 of 5

CT107-3-3-TXSA - Text Analytics and Sentiment Analysis Text Normalisation

words = raw.lower()
print(words)
print()
tokens = word_tokenize(words)
print("Tokens")
print(tokens)
print()
print("Lemmas")
wnl = nltk.WordNetLemmatizer()
print([wnl.lemmatize(t, pos = "v") for t in tokens])
print()
print("Porter Stemming")
ps = PorterStemmer()
print ([ps.stem(t) for t in tokens])
print()
print("Lancaster Stemming")
ls = LancasterStemmer()
print ([ls.stem(t) for t in tokens])
print()
print("Snowball Stemming")
sn = nltk.SnowballStemmer("english")
print([sn.stem(t) for t in tokens])

NOTE:
Stemming and Lemmatization both generate the root form of the inflected words. The difference is that stem might not
be an actual word whereas, lemma is an actual language word. Whereas, in lemmatization, you used WordNet corpus
and a corpus for stop words as well to produce lemma which makes it slower than stemming.

4.5. Stemmers --> Snowball Stemmer

import nltk
print(nltk.SnowballStemmer.languages)
print(len(nltk.SnowballStemmer.languages))
print()
text = "This is achieved in practice during stemming, a text preprocessing
operation."
tokens = nltk.tokenize.word_tokenize(text)
print()
stemmer = nltk.SnowballStemmer('english')
print([stemmer.stem(t) for t in tokens])
print()
text2 = "Ceci est réalisé en pratique lors du stemming, une opération de
prétraitement de texte."
tokens2 = nltk.tokenize.word_tokenize(text2)
print()
stemmer = nltk.SnowballStemmer('french')
print([stemmer.stem(t) for t in tokens2])

4.6. Snowball Stemmer --> for other space delimited languages

from textblob import TextBlob

import nltk
en_blob = TextBlob(u'This is achieved in practice during stemming, a text
preprocessing operation.')

Level 3 Asia Pacific University (APU) Page 4 of 5

CT107-3-3-TXSA - Text Analytics and Sentiment Analysis Text Normalisation

print(en_blob.detect_language())
fr_blob = en_blob.translate(from_lang="en", to='fr')
print(fr_blob)
tokens = fr_blob.words
print(tokens)
print()
stemmer = nltk.SnowballStemmer('french')
print([stemmer.stem(t) for t in tokens])

References:
1. Regular Expressions: The Complete Tutorial, by Jan Goyvaerts, 2007.
2. Speech and Language Processing, by Dan Jurafsky and James H. Martin. Prentice Hall Series in Artificial
Intelligence, 2008.
3. Natural Language Processing with Python, by Steven Bird, Ewan Klein and Edward Loper, 2014.
4. Lemmatization approaches with examples in Python (https://www.machinelearningplus.com/nlp/lemmatization-
examples-python/)

Revision Quiz:
https://quizlet.com/512734559/test?
answerTermSides=2&promptTermSides=6&questionCount=7&questionTypes=14&showImages=true

Take-home task:
Perform the lemmatization including the POS tags such as ADJECTIVE, NOUN, VERB, and ADVERB. Write the
suitable python code to get the proper output.

Level 3 Asia Pacific University (APU) Page 5 of 5

Experiment 3 Manual
No ratings yet
Experiment 3 Manual
7 pages
02-Stemming - Jupyter Notebook
No ratings yet
02-Stemming - Jupyter Notebook
4 pages
NLP Intro
No ratings yet
NLP Intro
15 pages
NLP Word Level Analysis Techniques
No ratings yet
NLP Word Level Analysis Techniques
28 pages
NLP - Exp 1 11
No ratings yet
NLP - Exp 1 11
29 pages
NLP Exp-123
No ratings yet
NLP Exp-123
6 pages
NLTK - Stem NLTK - Stem: Print Print Print Print
No ratings yet
NLTK - Stem NLTK - Stem: Print Print Print Print
1 page
Chapter 6
No ratings yet
Chapter 6
6 pages
NLTK
No ratings yet
NLTK
3 pages
NLP Record
No ratings yet
NLP Record
15 pages
Natural Language Processing-Section
No ratings yet
Natural Language Processing-Section
25 pages
Morphology Techniques in NLP with NLTK
No ratings yet
Morphology Techniques in NLP with NLTK
4 pages
7 TextAnalysis
No ratings yet
7 TextAnalysis
3 pages
NLP with Python Lab Manual
No ratings yet
NLP with Python Lab Manual
15 pages
NLP Experiments No-1
No ratings yet
NLP Experiments No-1
7 pages
Lemmatization Approaches
No ratings yet
Lemmatization Approaches
13 pages
2.3text Preprocessing Stemming
No ratings yet
2.3text Preprocessing Stemming
3 pages
ChatGPT-Tokenization Stemming Lemmatization NLTK
No ratings yet
ChatGPT-Tokenization Stemming Lemmatization NLTK
110 pages
NLP 03
No ratings yet
NLP 03
3 pages
NLP Lab 2
No ratings yet
NLP Lab 2
4 pages
Stemming, Lemmatization & NLP Basics
No ratings yet
Stemming, Lemmatization & NLP Basics
6 pages
NLP Experiment 3
No ratings yet
NLP Experiment 3
5 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
6 pages
NLP Exp 3
No ratings yet
NLP Exp 3
4 pages
NLP Notebook
No ratings yet
NLP Notebook
20 pages
NLTK: Python Text Processing Guide
No ratings yet
NLTK: Python Text Processing Guide
4 pages
4 PorterStemmer
No ratings yet
4 PorterStemmer
23 pages
Lemmatization Stemming Presentation
No ratings yet
Lemmatization Stemming Presentation
11 pages
NLP Core Using NLTK: Dr. Muhammad Nouman Durrani
No ratings yet
NLP Core Using NLTK: Dr. Muhammad Nouman Durrani
42 pages
NLP Lab Programms
No ratings yet
NLP Lab Programms
9 pages
Natural Langauage Processing (NLP) : Tokenization of Words
No ratings yet
Natural Langauage Processing (NLP) : Tokenization of Words
8 pages
14python Stemming and Lemmatization
No ratings yet
14python Stemming and Lemmatization
2 pages
Token Ization
No ratings yet
Token Ization
5 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
17 pages
Bag of Words Feature Extraction Guide
No ratings yet
Bag of Words Feature Extraction Guide
21 pages
Module 1 Updated Final
No ratings yet
Module 1 Updated Final
45 pages
EXP1
No ratings yet
EXP1
4 pages
NLP Tokenization and Text Preprocessing Guide
No ratings yet
NLP Tokenization and Text Preprocessing Guide
6 pages
NLP Lab Manual (R20)
50% (2)
NLP Lab Manual (R20)
24 pages
NLP-Lab Manual - Ashwini - Kachare
No ratings yet
NLP-Lab Manual - Ashwini - Kachare
41 pages
NLP Final Review
No ratings yet
NLP Final Review
32 pages
NLP Lab 2
No ratings yet
NLP Lab 2
6 pages
Overview of the Porter Stemmer Algorithm
No ratings yet
Overview of the Porter Stemmer Algorithm
3 pages
NLP Exp 2
No ratings yet
NLP Exp 2
4 pages
Lab 2
No ratings yet
Lab 2
4 pages
LP Vi Manual
No ratings yet
LP Vi Manual
77 pages
Lemmatization and POS Tagging in NLTK
No ratings yet
Lemmatization and POS Tagging in NLTK
2 pages
NLP Lecture2 Text Pre Processing
No ratings yet
NLP Lecture2 Text Pre Processing
54 pages
Text Analytics with TF-IDF in Python
No ratings yet
Text Analytics with TF-IDF in Python
14 pages
UBC Summer Linguistics Course Overview
No ratings yet
UBC Summer Linguistics Course Overview
33 pages
04 StemminginNLP
No ratings yet
04 StemminginNLP
10 pages
DSBD 7 Ass
No ratings yet
DSBD 7 Ass
9 pages
NLP Lab Manual for CSE Students
No ratings yet
NLP Lab Manual for CSE Students
45 pages
Experiment: 1
No ratings yet
Experiment: 1
28 pages
Applied NLP Lab Manual for IT Students
No ratings yet
Applied NLP Lab Manual for IT Students
33 pages
NLP Preprocessing Techniques
No ratings yet
NLP Preprocessing Techniques
17 pages
Types of Sentences: Compound & Complex
No ratings yet
Types of Sentences: Compound & Complex
3 pages
ESL Conversation Exam
No ratings yet
ESL Conversation Exam
3 pages
A Detailed Lesson Plan in English (High School Level)
No ratings yet
A Detailed Lesson Plan in English (High School Level)
8 pages
The Expressive Macro Skills G1
No ratings yet
The Expressive Macro Skills G1
15 pages
TOEFL Practice Test Analysis
No ratings yet
TOEFL Practice Test Analysis
6 pages
Old Toys, New Toys: Key Learning Outcomes
No ratings yet
Old Toys, New Toys: Key Learning Outcomes
20 pages
Present Simple vs Continuous Exercises
No ratings yet
Present Simple vs Continuous Exercises
3 pages
UNIT 1 - Present - Present Continuous
No ratings yet
UNIT 1 - Present - Present Continuous
21 pages
Red Wine Production Process Guide
No ratings yet
Red Wine Production Process Guide
6 pages
Evaluation of "Rio" Film Translation
No ratings yet
Evaluation of "Rio" Film Translation
12 pages
English IV - Past Simple - Pagenumber
No ratings yet
English IV - Past Simple - Pagenumber
30 pages
Instructional Materials
No ratings yet
Instructional Materials
3 pages
English 1st Quarter Reviewer
No ratings yet
English 1st Quarter Reviewer
6 pages
Regrets of a Lost Love
No ratings yet
Regrets of a Lost Love
3 pages
Understanding Figurative Language
No ratings yet
Understanding Figurative Language
15 pages
RPH Dan RPT - Ppki Bahasa Inggeris Tahun 2 013
No ratings yet
RPH Dan RPT - Ppki Bahasa Inggeris Tahun 2 013
7 pages
TKT Module 1
No ratings yet
TKT Module 1
30 pages
Diskusi 5 Structure Inggris
No ratings yet
Diskusi 5 Structure Inggris
4 pages
Libro Nivel V - Lesson 2a
No ratings yet
Libro Nivel V - Lesson 2a
28 pages
KEY Viet Teacher S4 W32 U7 Grammar Revision
No ratings yet
KEY Viet Teacher S4 W32 U7 Grammar Revision
6 pages
Language Testing Approaches Explained
No ratings yet
Language Testing Approaches Explained
7 pages
Item Analysis 3 Aman
No ratings yet
Item Analysis 3 Aman
6 pages
Ultimate Resume Writing Guide
No ratings yet
Ultimate Resume Writing Guide
8 pages
Cba 1-16test A To 20testa 16 Test A: Select The Correctly Punctuated Sentence
No ratings yet
Cba 1-16test A To 20testa 16 Test A: Select The Correctly Punctuated Sentence
12 pages
LESSON 3 - Language of Business Letter
No ratings yet
LESSON 3 - Language of Business Letter
5 pages
What Is Spanish Slang?
No ratings yet
What Is Spanish Slang?
20 pages
Oral Comm
0% (1)
Oral Comm
250 pages
Mezenin S M A History of English Life of Language PDF
No ratings yet
Mezenin S M A History of English Life of Language PDF
123 pages
Articulation and Phonological Disorders Speech Sound Disorders in Children 7th Edition Test Bank
No ratings yet
Articulation and Phonological Disorders Speech Sound Disorders in Children 7th Edition Test Bank
8 pages
Grade 7 Daily Lesson Log: Kindness
No ratings yet
Grade 7 Daily Lesson Log: Kindness
5 pages

Text Normalization: Stemming & Lemmatization

Uploaded by

Text Normalization: Stemming & Lemmatization

Uploaded by

CT107-3-3-TXSA - Text Analytics and Sentiment Analysis Text Normalisation

Lab 4: Text Normalization: Stemming and Lemmatization

This session covers:

from nltk.stem import PorterStemmer

example_words1 = ["python", "pythoner", "pythoning", "pythoned", "pythonly"]

from nltk.stem import PorterStemmer

new_text = """It is very important to be pythonly while you are pythoning

print([ps.stem(w) for w in words])

Level 3 Asia Pacific University (APU) Page 1 of 5

from nltk.stem import PorterStemmer, LancasterStemmer

raw = """DENNIS: Listen, strange women lying in ponds distributing swords

print([porter.stem(t) for t in tokens])

4.2. Lemmatization using WordNet lemmatizer

from nltk.stem import WordNetLemmatizer, PorterStemmer

print("\nbetter :", lemmatizer.lemmatize("better", pos ="a"))

print("\nwomen :", lemmatizer.lemmatize("women", pos ="n"))

Notice that it doesn't handle lying, but it converts women to woman.

from nltk.stem import PorterStemmer

raw = """DENNIS: Listen, strange women lying in ponds distributing swords

Level 3 Asia Pacific University (APU) Page 2 of 5

example_words = ["List", "listed", "lists", "listing", "listings"]

4.3. Lemmatization using TextBlob

from textblob import TextBlob

4.4. Stemming & Lemmatization

#file = open ("D:/APU/TXSA-CT107-3-3/TUTORIAL/sample01.txt")

raw = """DENNIS: Listen, strange women lying in ponds distributing swords

Level 3 Asia Pacific University (APU) Page 3 of 5

4.5. Stemmers --> Snowball Stemmer

4.6. Snowball Stemmer --> for other space delimited languages

from textblob import TextBlob

Level 3 Asia Pacific University (APU) Page 4 of 5

Level 3 Asia Pacific University (APU) Page 5 of 5

You might also like