Natural Language Processing
Computer Science Department
Faculty of Sciences
University of Blida 1
Chapter 2 : Basic Text Processing
Preprocessing: Basic Text Processing
1. Sentence 2.Word Tokenization. 3. Word Normalization.
Segmentation.
4. STOP WORDS 5. P.O.S 6. NAMED 7. STEMMING
REMOVAL ENTITY AND
Agenda
Basic Text Processing (2)
• Stemming
• Lemmatization
• Practical Insights
18
Basic Text
Processing
7.Stemming and Lemmatization
Morphology
• Morphemes:
• The small meaningful units that make up
words
• Stems: The core meaning-bearing units
• Affixes: Bits and pieces that adhere to
stems
• Often with grammatical functions
• (ex. Prefixes like un-ال, and Suffixes like
–ly) ياء النسب
Stemming & Lemmatization Multiple stemmers for English
= Cut word down to base Language available in NLTK
formUses rough heuristics to reduce words to
Stemming: PorterStemmer, LancasterStemmer,
basic form called stem SnowballStemmer
Lemmatization: Uses vocabulary and
morphological analysis, reduce words to basic WordNetLemmatizer
form called root
Makes the meaning of run, runs, running, ran all
the same
Cuts down on complexity by reducing the
number of unique words
7.1 Lemmatization
• Reduce inflections or variant forms to base form
• am, are, is 🡪 be
• car, cars, car's, cars’ 🡪 car
• the boy's cars are different colors
🡪 the boy car be different color
• Lemmatization: has to find correct
dictionary headword form
• Frequently used in Machine translation
• Spanish quiero (‘I want’), quieres (‘you want’) same lemma as querer
‘want’
7.2 Stemming
• Stemming is the procedure that standardizes and generally truncates all words
with the same root to a common base form called a stem irrespective of their inflections.
• For example: amusing and amusement have the same stem as amus (using light stemmers,
Affix- Removal).
• Stemming ran on tokens to normalize them (put
words into its base form).
• The big problem with stemming is that sometimes it produces the stem word which may
not have any meaning.
• Using full/heavy stemmer may give end results more like lemmatization: for example,
is/are/were/am are normalized to be () .
Stemming
Reduce terms to their stems and Stemming is crude chopping of affixes
used
in information retrieval language dependent
e.g., automate(s), automatic, automation all reduced to
automat.
for example compressed
and compression are both for exampl compress and
accepted as compress ar both accept
equivalent to compress. as equival to compress
Stemming Example in Arabic Language
• https://d3i71xaburhd42.cloudfront.net/c1c603f8be203133c560185e447d80331adbced2/2-Table1-
1.png
Light Stemming Example in Arabic
https://www.researchgate.net/profile/Ahmed-Taha-26/publication/277813512/figure/tbl2/AS:667694206578689@1536202112053/Examples-of-light-
Arabic-words.png
stems-of-
Porter English
Stemmer
The most common English
stemmer
Porter Stemmer Algorithm for English
Language
http://snowball.tartarus.org/algorithms/porter/stemmer.ht
ml
Porter Stemmer Algorithm for English
Language
Porter Stemmer Algorithm for English Language (Step 1)
Porter Stemmer Algorithm for English Language (Step 2)
Porter Stemmer Algorithm for English
Language (Step 3)
Porter Stemmer Algorithm for English
Language (Step 4)
Porter Stemmer Algorithm for English
Language (Step 5)
PORTER STEMMER
5
PHASES
Porter STEMMER
Porter Stemmer Algorithm Consists of 5 Sets of rules that
should be applied sequentially and in order.
Before getting into the rules there are 2 definitions we need
to know:
• VOWEL : A, E, I, O, U, and Y.
• CONSONANT : Any Other Letter.
𝑉𝐶 𝑉
So, all words will be in this form :𝐶
𝑚
C : any number of CONSONANT
letters V : any number of
VOWEL letters
(VC)^m : any number (m) of a VOWEL followed by
CONSONANT
Porter STEMMER Rule Form
Rules in this algorithm are represented as :
(Condition) S1 -> S2
This form means that if The condition becomes true
then we Replace suffix S1 with suffix S2
e.g. (m > 1) EMENT -> NULL
So, for the word “Replacement”
it has ep, ac, em, en which represents the form (VC) so “ m = 4 “
then m > 1. so, its suffix S1 is EMENT then we will replace this S1
with S2 from the rule
so, it will become 🡪 Replac
Porter SEMMER Rule From cont.
Conditions may include the following
symbols:
• m : the number of (VC)
• *S : the words that ends with S
• *v* : the words that contains VOWEL Letters
• *d : the words that ends with double CONSONANT letters
• *o : the words that end with (CVC) where the second C is not “W” ,
“X”, or “Y”
The condition may also include expression with
AND, OR, And NOT
E.g. (*v* and m > 2) S1 -> S2.
Step 1a (no conditions)
• SSES -> SS
e.g., caresses -> caress
• IES -> I
e.g., ponies -> poni
• SS -> SS
e.g., Caress -> caress
• S -> ɛ
e.g., cars -> car
Step 1b
• (m<1) EED -> EE
e.g., agreed -> agree
• (*v*) ED -> ɛ
e.g., plastered -> plaster
• (*v*) ING -> ɛ
e.g., motoring -> motor
The following rules applies if and only if second or third rules are
applied
• AT -> ATE
• e.g., conflat(ed) -> conflate
• BL -> BLE
• e.g., trouble(ing) -> trouble
• (*d and not(*L or *S or *Z)) -> remove on of the double letters
• e.g., hopp(ing) -> hop
• e.g., fall(ing) -> fall (condition is false)
• (m=1 and *o) -> E
• e.g., fill(ing) -> file
• e.g., fail -> fail (condition is false)
Step 1c
• (*v*) Y -> I
• e.g., happy -> happi
• e.g., sky -> sky (condition
is false)
Step 2
• (m>0) ATIONAL -> ATE
• e.g., Relational -> Relate
• (m>0) IZATION -> IZE
• e.g., generalization -> generalize
• (m>0) BILITI -> BLE
• e.g., sensibiliti -> sensible
• (m>0) IZER -> IZE
• e.g., digitizer -> digitize
• (m>0) ABLI -> ABLE
• e.g., conformabli -> conformable
• (m>0) IVENESS -> IVE
• e.g., forgiveness -> forgive And there are
more
• (m>0) FULNESS -> FUL
• e.g., hopefulness-> hopeful
Step 3
• (m>0) ICATE -> IC
• e.g., triplicate -> triplic
• (m>0) FUL -> ɛ
• e.g., hopeful -> hope
• (m>0) NESS -> ɛ
• e.g., goodness -> good
• (m>0) ICITI -> IC
• e.g., replicate -> replic
• (m>0) ALIZE -> AL And there are
more
• e.g., generalize ->
general
Step 4
• (m>1) ANCE -> ɛ
• e.g., allowance -> allow
• (m>1) ENT -> ɛ
• e.g., dependent -> depend
• (m>1) IVE -> ɛ
• e.g., effective -> effect
• (m>1) IC -> ɛ
• e.g., gyroscopic -> gyroscop
• (m>1) OUS -> ɛ
• e.g., homologous ->
homolog
• (m>1 and (*S or *T)) ION -
>ɛ And there are
more
• e.g., adoption-> adopt
Step 5
• Step 5a :
• (m>1) E -> ɛ
e.g., probate ->
probat e.g., rate ->
rate
• (m=1 and not *o) E -> ɛ
e.g., cease -> ceas
• Step 5b :
• (m>1 and not *L and *d) E -> single letter (remove one of the
double letters)
e.g., controll -> control
e.g., roll -> roll
Porter STEMMER
After Applying all these rules, you will get the stem of the given
word.
Sometimes the stemmer give wrong stems so its not 100% accurate
but it work perfectly at the most cases.
Dealing with complex morphology is sometimes necessary
• Some languages require complex morpheme segmentation
• Turkish
• Uygarlastiramadiklarimizdanmissinizcasina
• `(behaving) as if you are among those whom we could not civilize’
• Uygar `civilized’ + las `become’
+ tir `cause’ + ama `not able’
+ dik `past’ + lar ‘plural’
+ imiz ‘p1pl’ + dan ‘abl’
+ mis ‘past’ + siniz ‘2pl’ + casina ‘as if’
Preprocessing Stemming
Full Stemming and Light Stemming
⮚ A full stemmer works on finding the Root of the word. i.e.
Returning it to its origin.
⮚ Example:
1. Playing 🡪 Play
"”عمل” => "العامالن.2
⮚ A light stemmer works on only removing the Suffixes and
Prefixes attached to the word, thus finding its Stem.
⮚ Example:
1.Playing 🡪 Play
""عامل" => "العامالن.2
Arabic Stemmers
Preprocessing: 7.Stemming
⮚ An Affix Removal Stemming Algorithm has been developed and published in
2008 and compared to a former light stemmer presented by Aitao Chen and
Fredric Gey in the Text REtrieval Conference in 2002 (TREC 2002).
⮚ The developed Algorithm has proven to give better results by 3.34% using
manual testing.
⮚ It can be considered a hybrid technique between light and full
Stemmers as follows:
1. Removing Affixes; like Light Stemmers.
2. Returning some word variants to their roots; like plurals changing them
to singulars; like Full Stemmers.
Preprocessing: 7.Stemming
Aitao and Fred Arabic Stemming Algorithm
1. For words of at least five characters, remove the first
three characters if they are one of the Prefixes.
2. For words of at least four characters, remove the first two characters
if they are one of the Prefixes.
3. For words of at least four characters beginning with
Prefix و, remove the initial letter و.
4. For words of at least four characters beginning with Prefixes بor
ل, remove them if the resultant word is in the Arabic document
collection.
5. For words of at least four characters, recursively strip the two-character
Suffixes in the order of presentation.
6. For words of at least three characters, recursively strip the one-character
Preprocessing: 7.Stemming
The Hybrid Affix Removal Stemming Algorithm
⮚ Text Preprocessing
1. Read the document /query text.
2. Create an Indices Vector.
3. Normalize text (Part 1): remove shaddah and tanween.
4. Remove Stop Words in their different forms (Prefix + Suffix, Stop Word, Prefix + Stop Word, Stop Word + Suffix, Prefix
+ Stop Word + Suffix )
5. Normalize text (Part 2) : unification of letters’ forms
⮚ Stemming the Indices Vector
6. Suffix Removal; including returning some plural forms to singular forms.
7. Prefix Removal.
Preprocessing: 7.Stemming
The Hybrid Affix Removal Stemming Algorithm
(Cont.)
6. Suffix Removal:
For Words ending with a suffix & of at least 3-characters without the suffix,
1. Replace the suffix with " "هonly if it is “ & ”اتthe word isn’t starting with " "الOr its
length>5 letters, otherwise Remove “”ات. If so Stop Searching for more suffixes.
2. Replace the suffix by " “يif it is “”اه.
3. Stop Searching for Suffixes if the word starts with “ ”الAnd having any suffix except " "هand "“ي
or its length without the Suffix is less than or equal 5 characters.
4. Otherwise, Remove the suffix and Stop Searching for Suffixes.
7. Prefix Removal
I. Phase 1:
8. For words starting with “ ”ا اRemove “”ا.
9. For words of at least 5-characters, starting with “”وال, Remove “ & ”والStop Searching.
10. For words of more than 4-characters, starting with“”لل, Remove “ &”للStop Searching.
11. For words of at least 4-characters, starting with “”ال, Then Remove “”ال, Mark it as Non-
Stop Word & Stop Searching for Prefixes.
Preprocessing: 7.Stemming
The Hybrid Affix Removal Stemming Algorithm
(Cont.)
II. Phase 2:
For Words of at least 3-characters, starting with a Prefix, Remove Prefix.
⮚ Recheck for Stop Words and Remove them.
Shaimaa Arafat, Sally Saad. “An Affix Removal Stemming Algorithm for Arabic Language”. International Journal on Intelligent
Computing
Stemming Results Sample Document
Stemming Results Normalization and Stop
Words Removal
Stemming Results Normalization and Stemm
Stemming Results Comparison of the two
Stemmers
Preprocessing: 7.2 Lemmatization
• Lemmatization is similar to stemming in terms of function.
• Lemmatization functions produce lemmas which are dictionary-
based words which, unlike stems, are not truncated or ambiguous.
• The main difference between Stemming and lemmatization
is that it produces the root word, which has a meaning.
• For example: amusing and amusement have the same
lemma as amuse.
Lemmatization in Arabic Language
https://d3i71xaburhd42.cloudfront.net/
de1900b776827e036baea69769b09668fd
f9d166/1-TableI-1.png
Stemming Vs Lemmatization
• Stemming and Lemmatization both generate the foundation sort of the inflected
words and therefore the only difference is that stem may not be an actual word
whereas, lemma is an actual language word. Stemming follows an algorithm
with steps to perform on the words which makes it faster.
• Lemmatization provides better results by performing an analysis that
depends on the word's part-of-speech and producing real, dictionary words. As a
result, lemmatization is harder to implement and slower compared to stemming.
Stemming Vs Lemmatization
Stemming Vs Lemmatization
Stemming Vs Lemmatization
Stemming Vs Lemmatization
Stemming Vs Lemmatization
Stemming Vs. Lemmatization
https://www.googleapis.com/download/storage/v1/b/kaggle-forum-message- attachments/o/inbox
%2F4010658%2Ff0e0abfea00346cbdd6b6753a2cb1cb5%2F1_OTjdJlYF5vRIzpBfOw75KA.png?generation=1600866066577474&alt=media 5
8
Questions
• Given the following text:
• “Mr. Ahmed Mahmoud went to the store in Alger. The store had offers and he
bought some stuff for his family.”
• Apply the NLP pipeline on the given text as follows:
a) Word Tokenization.
b) Word Count.
c) Stop Words Removal.
d) Named Entity Recognition.
e) Lemmatization.
• Fill in the following table (next page) with each token and its properties.
Token Count Stop Word? Named Entity Tag Lemma
Solution Mr 1 Mr or Mister
Ahmed 1 PER
Mahmoud 1 PER
went 1 go
to 1
the 2
store 2
in 1
Alger 1 LOC (or GPE)
Had 1 have
Offers 1 offer
And 1
He 1 can be considered stop word or not
Bought 1 buy
Some 1
Stuff 1
For 1
His 1 can be considered stop word or not
family 1
Practical Insights
Code: Stemming
Input:
from nltk. stem. lancaster import LancasterStemmer
stemmer = LancasterStemmer()
# Try some stems
print('drive: {} ’ . format( stemmer. stem(' driv e ' ) ) )
print('drives: { } ' . format( stemmer. stem(' drives')))
print('driver: { } ' . f o r m a t ( s t e mm er. s t em ( ' dr i v er ' ) ) )
print('drivers: { } ' . format( stemmer. stem(' driv ers' ) ) )
print('driven: { } ' . format( stemmer. stem(' driven')))
Output:
Code: Stemming
# importing modules
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
ps = PorterStemmer()
sentence = "Programmers program with
programming
languages"
words = word_tokenize(sentence)
for w in words:
print(w, " : ", ps.stem(w))
Output :
Progra
Programmers:: program
m with program
programming::with
program
Languages :
languag
https://
www.geeksforgeeks.org/python-stemming-words-with-nltk/