0% found this document useful (0 votes)

23 views58 pages

Lecture 3 - Basic Text Processing

The document discusses basic text processing techniques in Natural Language Processing, focusing on stemming and lemmatization. It explains the differences between the two methods, with stemming being a crude approach that reduces words to their base forms, while lemmatization uses dictionary-based analysis to produce meaningful root words. Additionally, it covers the application of these techniques in various languages, including Arabic, and presents algorithms for effective stemming.

Uploaded by

nadjibdjellali9

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views58 pages

Lecture 3 - Basic Text Processing

Uploaded by

nadjibdjellali9

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 58

Natural Language Processing

Computer Science Department

Faculty of Sciences
University of Blida 1

Chapter 2 : Basic Text Processing

Preprocessing: Basic Text Processing

1. Sentence 2.Word Tokenization. 3. Word Normalization.

Segmentation.

4. STOP WORDS 5. P.O.S 6. NAMED 7. STEMMING

REMOVAL ENTITY AND
Agenda

Basic Text Processing (2)

• Stemming
• Lemmatization
• Practical Insights
18

Basic Text
Processing
7.Stemming and Lemmatization
Morphology

• Morphemes:
• The small meaningful units that make up
words
• Stems: The core meaning-bearing units
• Affixes: Bits and pieces that adhere to
stems
• Often with grammatical functions
• (ex. Prefixes like un-‫ال‬, and Suffixes like
–ly‫) ياء النسب‬
Stemming & Lemmatization Multiple stemmers for English
= Cut word down to base Language available in NLTK
formUses rough heuristics to reduce words to
Stemming: PorterStemmer, LancasterStemmer,
basic form called stem SnowballStemmer
Lemmatization: Uses vocabulary and
morphological analysis, reduce words to basic WordNetLemmatizer
form called root
Makes the meaning of run, runs, running, ran all
the same
Cuts down on complexity by reducing the
number of unique words
7.1 Lemmatization
• Reduce inflections or variant forms to base form
• am, are, is 🡪 be
• car, cars, car's, cars’ 🡪 car
• the boy's cars are different colors
🡪 the boy car be different color
• Lemmatization: has to find correct
dictionary headword form
• Frequently used in Machine translation
• Spanish quiero (‘I want’), quieres (‘you want’) same lemma as querer
‘want’
7.2 Stemming
• Stemming is the procedure that standardizes and generally truncates all words
with the same root to a common base form called a stem irrespective of their inflections.
• For example: amusing and amusement have the same stem as amus (using light stemmers,
Affix- Removal).
• Stemming ran on tokens to normalize them (put
words into its base form).

• The big problem with stemming is that sometimes it produces the stem word which may
not have any meaning.
• Using full/heavy stemmer may give end results more like lemmatization: for example,
is/are/were/am are normalized to be () .
Stemming

Reduce terms to their stems and Stemming is crude chopping of affixes

used
in information retrieval language dependent
e.g., automate(s), automatic, automation all reduced to
automat.

for example compressed

and compression are both for exampl compress and
accepted as compress ar both accept
equivalent to compress. as equival to compress
Stemming Example in Arabic Language

• https://d3i71xaburhd42.cloudfront.net/c1c603f8be203133c560185e447d80331adbced2/2-Table1-
1.png
Light Stemming Example in Arabic

https://www.researchgate.net/profile/Ahmed-Taha-26/publication/277813512/figure/tbl2/AS:667694206578689@1536202112053/Examples-of-light-
Arabic-words.png
stems-of-
Porter English
Stemmer
The most common English
stemmer
Porter Stemmer Algorithm for English
Language

http://snowball.tartarus.org/algorithms/porter/stemmer.ht
ml
Porter Stemmer Algorithm for English
Language
Porter Stemmer Algorithm for English Language (Step 1)
Porter Stemmer Algorithm for English Language (Step 2)
Porter Stemmer Algorithm for English
Language (Step 3)
Porter Stemmer Algorithm for English
Language (Step 4)
Porter Stemmer Algorithm for English
Language (Step 5)
PORTER STEMMER

5
PHASES
Porter STEMMER
Porter Stemmer Algorithm Consists of 5 Sets of rules that
should be applied sequentially and in order.

Before getting into the rules there are 2 definitions we need

to know:
• VOWEL : A, E, I, O, U, and Y.
• CONSONANT : Any Other Letter.

𝑉𝐶 𝑉
So, all words will be in this form :𝐶
𝑚

C : any number of CONSONANT

letters V : any number of
VOWEL letters
(VC)^m : any number (m) of a VOWEL followed by
CONSONANT
Porter STEMMER Rule Form

Rules in this algorithm are represented as :

(Condition) S1 -> S2
This form means that if The condition becomes true
then we Replace suffix S1 with suffix S2
e.g. (m > 1) EMENT -> NULL
So, for the word “Replacement”
it has ep, ac, em, en which represents the form (VC) so “ m = 4 “
then m > 1. so, its suffix S1 is EMENT then we will replace this S1
with S2 from the rule
so, it will become 🡪 Replac
Porter SEMMER Rule From cont.

Conditions may include the following

symbols:
• m : the number of (VC)
• *S : the words that ends with S
• *v* : the words that contains VOWEL Letters
• *d : the words that ends with double CONSONANT letters
• *o : the words that end with (CVC) where the second C is not “W” ,
“X”, or “Y”
The condition may also include expression with
AND, OR, And NOT
E.g. (*v* and m > 2) S1 -> S2.
Step 1a (no conditions)
• SSES -> SS
e.g., caresses -> caress
• IES -> I
e.g., ponies -> poni
• SS -> SS
e.g., Caress -> caress
• S -> ɛ
e.g., cars -> car
Step 1b
• (m<1) EED -> EE
e.g., agreed -> agree
• (*v*) ED -> ɛ
e.g., plastered -> plaster
• (*v*) ING -> ɛ
e.g., motoring -> motor
The following rules applies if and only if second or third rules are
applied
• AT -> ATE
• e.g., conflat(ed) -> conflate
• BL -> BLE
• e.g., trouble(ing) -> trouble
• (*d and not(*L or *S or *Z)) -> remove on of the double letters
• e.g., hopp(ing) -> hop
• e.g., fall(ing) -> fall (condition is false)
• (m=1 and *o) -> E
• e.g., fill(ing) -> file
• e.g., fail -> fail (condition is false)
Step 1c

• (*v*) Y -> I
• e.g., happy -> happi
• e.g., sky -> sky (condition
is false)
Step 2
• (m>0) ATIONAL -> ATE
• e.g., Relational -> Relate
• (m>0) IZATION -> IZE
• e.g., generalization -> generalize
• (m>0) BILITI -> BLE
• e.g., sensibiliti -> sensible
• (m>0) IZER -> IZE
• e.g., digitizer -> digitize
• (m>0) ABLI -> ABLE
• e.g., conformabli -> conformable
• (m>0) IVENESS -> IVE
• e.g., forgiveness -> forgive And there are
more
• (m>0) FULNESS -> FUL
• e.g., hopefulness-> hopeful
Step 3

• (m>0) ICATE -> IC

• e.g., triplicate -> triplic
• (m>0) FUL -> ɛ
• e.g., hopeful -> hope
• (m>0) NESS -> ɛ
• e.g., goodness -> good
• (m>0) ICITI -> IC
• e.g., replicate -> replic
• (m>0) ALIZE -> AL And there are
more
• e.g., generalize ->
general
Step 4

• (m>1) ANCE -> ɛ

• e.g., allowance -> allow
• (m>1) ENT -> ɛ
• e.g., dependent -> depend
• (m>1) IVE -> ɛ
• e.g., effective -> effect
• (m>1) IC -> ɛ
• e.g., gyroscopic -> gyroscop
• (m>1) OUS -> ɛ
• e.g., homologous ->
homolog
• (m>1 and (*S or *T)) ION -
>ɛ And there are
more

• e.g., adoption-> adopt

Step 5

• Step 5a :
• (m>1) E -> ɛ
e.g., probate ->
probat e.g., rate ->
rate
• (m=1 and not *o) E -> ɛ
e.g., cease -> ceas
• Step 5b :
• (m>1 and not *L and *d) E -> single letter (remove one of the
double letters)
e.g., controll -> control
e.g., roll -> roll
Porter STEMMER

After Applying all these rules, you will get the stem of the given
word.
Sometimes the stemmer give wrong stems so its not 100% accurate
but it work perfectly at the most cases.
Dealing with complex morphology is sometimes necessary

• Some languages require complex morpheme segmentation

• Turkish
• Uygarlastiramadiklarimizdanmissinizcasina
• `(behaving) as if you are among those whom we could not civilize’
• Uygar `civilized’ + las `become’
+ tir `cause’ + ama `not able’
+ dik `past’ + lar ‘plural’
+ imiz ‘p1pl’ + dan ‘abl’
+ mis ‘past’ + siniz ‘2pl’ + casina ‘as if’
Preprocessing Stemming

Full Stemming and Light Stemming

⮚ A full stemmer works on finding the Root of the word. i.e.
Returning it to its origin.
⮚ Example:
1. Playing 🡪 Play
"‫”عمل” => "العامالن‬.2
⮚ A light stemmer works on only removing the Suffixes and
Prefixes attached to the word, thus finding its Stem.
⮚ Example:
1.Playing 🡪 Play
"‫"عامل" => "العامالن‬.2
Arabic Stemmers
Preprocessing: 7.Stemming

⮚ An Affix Removal Stemming Algorithm has been developed and published in

2008 and compared to a former light stemmer presented by Aitao Chen and
Fredric Gey in the Text REtrieval Conference in 2002 (TREC 2002).

⮚ The developed Algorithm has proven to give better results by 3.34% using
manual testing.

⮚ It can be considered a hybrid technique between light and full

Stemmers as follows:
1. Removing Affixes; like Light Stemmers.
2. Returning some word variants to their roots; like plurals changing them
to singulars; like Full Stemmers.
Preprocessing: 7.Stemming

Aitao and Fred Arabic Stemming Algorithm

1. For words of at least five characters, remove the first
three characters if they are one of the Prefixes.
2. For words of at least four characters, remove the first two characters
if they are one of the Prefixes.
3. For words of at least four characters beginning with
Prefix ‫و‬, remove the initial letter ‫و‬.
4. For words of at least four characters beginning with Prefixes ‫ ب‬or
‫ ل‬, remove them if the resultant word is in the Arabic document
collection.
5. For words of at least four characters, recursively strip the two-character
Suffixes in the order of presentation.
6. For words of at least three characters, recursively strip the one-character
Preprocessing: 7.Stemming
The Hybrid Affix Removal Stemming Algorithm
⮚ Text Preprocessing
1. Read the document /query text.
2. Create an Indices Vector.
3. Normalize text (Part 1): remove shaddah and tanween.
4. Remove Stop Words in their different forms (Prefix + Suffix, Stop Word, Prefix + Stop Word, Stop Word + Suffix, Prefix
+ Stop Word + Suffix )

5. Normalize text (Part 2) : unification of letters’ forms

⮚ Stemming the Indices Vector

6. Suffix Removal; including returning some plural forms to singular forms.
7. Prefix Removal.
Preprocessing: 7.Stemming
The Hybrid Affix Removal Stemming Algorithm
(Cont.)
6. Suffix Removal:
For Words ending with a suffix & of at least 3-characters without the suffix,
1. Replace the suffix with "‫ "ه‬only if it is “‫ & ”ات‬the word isn’t starting with "‫ "ال‬Or its
length>5 letters, otherwise Remove “‫”ات‬. If so Stop Searching for more suffixes.
2. Replace the suffix by "‫ “ي‬if it is “‫”اه‬.
3. Stop Searching for Suffixes if the word starts with “‫ ”ال‬And having any suffix except "‫ "ه‬and "‫“ي‬
or its length without the Suffix is less than or equal 5 characters.
4. Otherwise, Remove the suffix and Stop Searching for Suffixes.
7. Prefix Removal
I. Phase 1:
8. For words starting with “‫ ”ا ا‬Remove “‫”ا‬.
9. For words of at least 5-characters, starting with “‫”وال‬, Remove “‫ & ”وال‬Stop Searching.
10. For words of more than 4-characters, starting with“‫”لل‬, Remove “‫ &”لل‬Stop Searching.
11. For words of at least 4-characters, starting with “‫”ال‬, Then Remove “‫”ال‬, Mark it as Non-
Stop Word & Stop Searching for Prefixes.
Preprocessing: 7.Stemming

The Hybrid Affix Removal Stemming Algorithm

(Cont.)
II. Phase 2:
For Words of at least 3-characters, starting with a Prefix, Remove Prefix.
⮚ Recheck for Stop Words and Remove them.
Shaimaa Arafat, Sally Saad. “An Affix Removal Stemming Algorithm for Arabic Language”. International Journal on Intelligent
Computing
Stemming Results Sample Document
Stemming Results Normalization and Stop
Words Removal
Stemming Results Normalization and Stemm
Stemming Results Comparison of the two
Stemmers
Preprocessing: 7.2 Lemmatization

• Lemmatization is similar to stemming in terms of function.

• Lemmatization functions produce lemmas which are dictionary-
based words which, unlike stems, are not truncated or ambiguous.

• The main difference between Stemming and lemmatization

is that it produces the root word, which has a meaning.

• For example: amusing and amusement have the same

lemma as amuse.
Lemmatization in Arabic Language

https://d3i71xaburhd42.cloudfront.net/
de1900b776827e036baea69769b09668fd
f9d166/1-TableI-1.png
Stemming Vs Lemmatization

• Stemming and Lemmatization both generate the foundation sort of the inflected
words and therefore the only difference is that stem may not be an actual word
whereas, lemma is an actual language word. Stemming follows an algorithm
with steps to perform on the words which makes it faster.
• Lemmatization provides better results by performing an analysis that
depends on the word's part-of-speech and producing real, dictionary words. As a
result, lemmatization is harder to implement and slower compared to stemming.
Stemming Vs Lemmatization
Stemming Vs Lemmatization
Stemming Vs Lemmatization
Stemming Vs Lemmatization
Stemming Vs Lemmatization
Stemming Vs. Lemmatization
https://www.googleapis.com/download/storage/v1/b/kaggle-forum-message- attachments/o/inbox
%2F4010658%2Ff0e0abfea00346cbdd6b6753a2cb1cb5%2F1_OTjdJlYF5vRIzpBfOw75KA.png?generation=1600866066577474&alt=media 5
8
Questions

• Given the following text:

• “Mr. Ahmed Mahmoud went to the store in Alger. The store had offers and he
bought some stuff for his family.”
• Apply the NLP pipeline on the given text as follows:
a) Word Tokenization.
b) Word Count.
c) Stop Words Removal.
d) Named Entity Recognition.
e) Lemmatization.
• Fill in the following table (next page) with each token and its properties.
Token Count Stop Word? Named Entity Tag Lemma

Solution Mr 1 Mr or Mister

Ahmed 1 PER
Mahmoud 1 PER
went 1 go
to 1
the 2
store 2
in 1
Alger 1 LOC (or GPE)
Had 1 have
Offers 1 offer
And 1
He 1 can be considered stop word or not

Bought 1 buy
Some 1
Stuff 1
For 1
His 1 can be considered stop word or not

family 1
Practical Insights
Code: Stemming
Input:
from nltk. stem. lancaster import LancasterStemmer
stemmer = LancasterStemmer()

# Try some stems

print('drive: {} ’ . format( stemmer. stem(' driv e ' ) ) )
print('drives: { } ' . format( stemmer. stem(' drives')))
print('driver: { } ' . f o r m a t ( s t e mm er. s t em ( ' dr i v er ' ) ) )
print('drivers: { } ' . format( stemmer. stem(' driv ers' ) ) )
print('driven: { } ' . format( stemmer. stem(' driven')))
Output:
Code: Stemming
# importing modules
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize

ps = PorterStemmer()

sentence = "Programmers program with

programming
languages"
words = word_tokenize(sentence)

for w in words:
print(w, " : ", ps.stem(w))

Output :
Progra
Programmers:: program
m with program
programming::with
program
Languages :
languag
https://
www.geeksforgeeks.org/python-stemming-words-with-nltk/

Text Processing for IR Systems
No ratings yet
Text Processing for IR Systems
43 pages
NLP Chapter 2
No ratings yet
NLP Chapter 2
103 pages
NLP Chap2
No ratings yet
NLP Chap2
126 pages
04 Word Normalization and Stemming 11-47
No ratings yet
04 Word Normalization and Stemming 11-47
5 pages
Week 2
No ratings yet
Week 2
9 pages
Introduction - Types of Stemming Algorithms
No ratings yet
Introduction - Types of Stemming Algorithms
28 pages
Stemming vs Lemmatization in NLP
No ratings yet
Stemming vs Lemmatization in NLP
12 pages
Chapter 2 Part II
No ratings yet
Chapter 2 Part II
75 pages
Morphology
No ratings yet
Morphology
44 pages
NLP Chapter 2
No ratings yet
NLP Chapter 2
10 pages
Module 2 Complete
No ratings yet
Module 2 Complete
134 pages
Basics of Text Processing
No ratings yet
Basics of Text Processing
28 pages
Overview of the Porter Stemmer Algorithm
No ratings yet
Overview of the Porter Stemmer Algorithm
3 pages
8-Morphology Part3
No ratings yet
8-Morphology Part3
27 pages
Overview of the Porter Stemmer
No ratings yet
Overview of the Porter Stemmer
12 pages
Understanding Stemming Algorithms
No ratings yet
Understanding Stemming Algorithms
23 pages
NLP Manual
No ratings yet
NLP Manual
9 pages
Module 3 - Part 1
No ratings yet
Module 3 - Part 1
54 pages
Text Preprocessing
No ratings yet
Text Preprocessing
39 pages
Chap 2
No ratings yet
Chap 2
70 pages
Rule Based Urdu Stemmer: Rohit Kansal Vishal Goyal G. S. Lehal
No ratings yet
Rule Based Urdu Stemmer: Rohit Kansal Vishal Goyal G. S. Lehal
10 pages
Wordlevel Analysis - Chap2
No ratings yet
Wordlevel Analysis - Chap2
97 pages
2 - Text Operation - 1
No ratings yet
2 - Text Operation - 1
28 pages
IR Group Assignment
No ratings yet
IR Group Assignment
5 pages
NLP Notes 1&2
No ratings yet
NLP Notes 1&2
3 pages
Text Normalization in NLP
No ratings yet
Text Normalization in NLP
29 pages
Lecture 2 IR System Components
No ratings yet
Lecture 2 IR System Components
10 pages
Porter Stemming Algorithm Guide
No ratings yet
Porter Stemming Algorithm Guide
6 pages
Word Level Analysis NLP Mod 2
No ratings yet
Word Level Analysis NLP Mod 2
18 pages
NLP Morphology for Linguists
No ratings yet
NLP Morphology for Linguists
8 pages
NLP Text Preprocessing Techniques
No ratings yet
NLP Text Preprocessing Techniques
59 pages
NLP 2
No ratings yet
NLP 2
13 pages
Lab - Manual - IR - BE AI&DS CL II
No ratings yet
Lab - Manual - IR - BE AI&DS CL II
38 pages
3.word Level Analysis-Tokenization Stemming
No ratings yet
3.word Level Analysis-Tokenization Stemming
8 pages
NLP - Notes
No ratings yet
NLP - Notes
17 pages
II - 2 Unit
No ratings yet
II - 2 Unit
73 pages
NLP 2 Chapter 2
No ratings yet
NLP 2 Chapter 2
6 pages
2 Text Operations
No ratings yet
2 Text Operations
32 pages
Unit Iii Data Structure
No ratings yet
Unit Iii Data Structure
43 pages
Natural Language Computing
No ratings yet
Natural Language Computing
20 pages
An Accuracy-Enhanced Light Stemmer For Arabic Text
No ratings yet
An Accuracy-Enhanced Light Stemmer For Arabic Text
22 pages
Lecture-3 (Words - Transducers)
No ratings yet
Lecture-3 (Words - Transducers)
61 pages
Irs Ii
No ratings yet
Irs Ii
39 pages
Morphology FST
No ratings yet
Morphology FST
47 pages
Regular Expression
No ratings yet
Regular Expression
29 pages
Porter Stemmer
No ratings yet
Porter Stemmer
14 pages
Unit 2 Data - Structures
No ratings yet
Unit 2 Data - Structures
84 pages
NLP Experiments No-1
No ratings yet
NLP Experiments No-1
7 pages
Session 1
No ratings yet
Session 1
33 pages
3-More On Indexing & Text Operations
No ratings yet
3-More On Indexing & Text Operations
27 pages
NLP Lecture 6 Week 3
No ratings yet
NLP Lecture 6 Week 3
9 pages
NLP - Sem
No ratings yet
NLP - Sem
31 pages
Unit 3 - Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval
No ratings yet
Unit 3 - Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval
8 pages
Words & Transducers
No ratings yet
Words & Transducers
7 pages
Lexical and Morphological Analysis in NLP
No ratings yet
Lexical and Morphological Analysis in NLP
9 pages
NLB Final Lab Manual
No ratings yet
NLB Final Lab Manual
23 pages
4 PorterStemmer
No ratings yet
4 PorterStemmer
23 pages
Module 2
No ratings yet
Module 2
78 pages
Lecture 6 - Word2Vec and Text Classification
No ratings yet
Lecture 6 - Word2Vec and Text Classification
66 pages
Lecture 1 - Introduction To NLP
No ratings yet
Lecture 1 - Introduction To NLP
79 pages
The Art of War
No ratings yet
The Art of War
93 pages
Wangmz Beautiful Soup
No ratings yet
Wangmz Beautiful Soup
3 pages
1 DW Series - Regression-Exercise 1 (Linear Regression)
No ratings yet
1 DW Series - Regression-Exercise 1 (Linear Regression)
2 pages
Chap 2 Linear Regression - Part2
No ratings yet
Chap 2 Linear Regression - Part2
16 pages
BBC Learning Englihs (Can, Could, Be Able To)
No ratings yet
BBC Learning Englihs (Can, Could, Be Able To)
5 pages
Unit 0 1: Sudd V Sudd V
No ratings yet
Unit 0 1: Sudd V Sudd V
1 page
Year 7 NAPLAN Style Language Conventions Practice Test 3 - Print
No ratings yet
Year 7 NAPLAN Style Language Conventions Practice Test 3 - Print
10 pages
English Grammar Exercises
No ratings yet
English Grammar Exercises
2 pages
Active and Passive Voice
No ratings yet
Active and Passive Voice
10 pages
Academic IELTS Task 1 Useful Vocabulary For Graphs and Diagrams
No ratings yet
Academic IELTS Task 1 Useful Vocabulary For Graphs and Diagrams
4 pages
Class 13 Modal Passive
No ratings yet
Class 13 Modal Passive
4 pages
Soal Cerdas Cermat Bahasa Inggris Tingkat Masmasmk-1
100% (4)
Soal Cerdas Cermat Bahasa Inggris Tingkat Masmasmk-1
14 pages
Engleski-Sva Teorija
No ratings yet
Engleski-Sva Teorija
11 pages
Tense and Effect Worksheet
No ratings yet
Tense and Effect Worksheet
2 pages
Old English Grammar: Nouns & Adjectives
No ratings yet
Old English Grammar: Nouns & Adjectives
28 pages
Energize1 G&V U3 STANDARD
No ratings yet
Energize1 G&V U3 STANDARD
2 pages
Adverbs of Manner Rules & Examples
No ratings yet
Adverbs of Manner Rules & Examples
5 pages
Advanced Grammar Lesson Grade10 Updated
No ratings yet
Advanced Grammar Lesson Grade10 Updated
18 pages
Active & Passive Voice
No ratings yet
Active & Passive Voice
32 pages
Sanskrit Level 3 Week 0 2546 PDF
No ratings yet
Sanskrit Level 3 Week 0 2546 PDF
4 pages
Giao Trinh Listening Level 3 G I L P
No ratings yet
Giao Trinh Listening Level 3 G I L P
90 pages
English Vocabulary and Research Activities
No ratings yet
English Vocabulary and Research Activities
9 pages
Meaning of Bubba - Google Search
No ratings yet
Meaning of Bubba - Google Search
1 page
Unis Can ENGLEZA 5
100% (1)
Unis Can ENGLEZA 5
124 pages
Pau Inglés-19
No ratings yet
Pau Inglés-19
1 page
IELTS Keywords
No ratings yet
IELTS Keywords
11 pages
Grade 7 English Analytical Plan 2023
100% (1)
Grade 7 English Analytical Plan 2023
5 pages
Past Continuous Information
No ratings yet
Past Continuous Information
7 pages
Child Development Chart: Preschool Milestones: by Mayo Clinic Staff
No ratings yet
Child Development Chart: Preschool Milestones: by Mayo Clinic Staff
2 pages
Language Skills Workshop Activities
No ratings yet
Language Skills Workshop Activities
6 pages
Irregular Verbs - Basic List in Groups
No ratings yet
Irregular Verbs - Basic List in Groups
2 pages
Graph and Chart Trend Descriptions
No ratings yet
Graph and Chart Trend Descriptions
17 pages
Action vs Linking Verbs Worksheet
100% (2)
Action vs Linking Verbs Worksheet
2 pages
Descriptive Grammar Test - Grammar & Syntax
No ratings yet
Descriptive Grammar Test - Grammar & Syntax
5 pages

Lecture 3 - Basic Text Processing

Uploaded by

Lecture 3 - Basic Text Processing

Uploaded by

Natural Language Processing

Computer Science Department

Chapter 2 : Basic Text Processing

1. Sentence 2.Word Tokenization. 3. Word Normalization.

4. STOP WORDS 5. P.O.S 6. NAMED 7. STEMMING

Basic Text Processing (2)

Reduce terms to their stems and Stemming is crude chopping of affixes

for example compressed

Before getting into the rules there are 2 definitions we need

C : any number of CONSONANT

Rules in this algorithm are represented as :

Conditions may include the following

• (m>0) ICATE -> IC

• (m>1) ANCE -> ɛ

• e.g., adoption-> adopt

• Some languages require complex morpheme segmentation

Full Stemming and Light Stemming

⮚ An Affix Removal Stemming Algorithm has been developed and published in

⮚ It can be considered a hybrid technique between light and full

Aitao and Fred Arabic Stemming Algorithm

5. Normalize text (Part 2) : unification of letters’ forms

⮚ Stemming the Indices Vector

The Hybrid Affix Removal Stemming Algorithm

• Lemmatization is similar to stemming in terms of function.

• The main difference between Stemming and lemmatization

• For example: amusing and amusement have the same

• Given the following text:

# Try some stems

sentence = "Programmers program with

You might also like