0% found this document useful (0 votes)

34 views9 pages

Week 2

The document explains the concept of stemming in Natural Language Processing (NLP) and details the Porter Stemming Algorithm, which reduces words to their root forms by removing common suffixes. It outlines the algorithm's steps, advantages, and disadvantages, along with examples of how various words are stemmed. Additionally, a Python program demonstrates the implementation of the Porter Stemmer using the NLTK library.

Uploaded by

Rida kaunain

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

34 views9 pages

Week 2

Uploaded by

Rida kaunain

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 9

WEEK -2

AIM: Write a Python program to implement Porter stemmer algorithm for stemming

What is Stemming?
Stemming is the process of reducing words to their root form or base form. The idea is to remove
suffixes or prefixes from words, so they can be analyzed as a single word. For instance:
 "Running" → "Run"
 "Happiness" → "Happy"
 "Studies" → "Study"
Why Stemming?
In Natural Language Processing (NLP), stemming is crucial because it helps in grouping similar
words under a single root word. This makes it easier for algorithms to analyze text data.
Key Reasons for Stemming:
1. Simplification: It simplifies words into their base form, reducing the complexity of analyzing
variations of a word.
2. Efficiency: It reduces the number of unique words in a text, making analysis faster and easier.
3. Accuracy in Analysis: It helps in focusing on the core meaning rather than the different
forms of a word.
When Do We Go for Stemming?
We use stemming in cases where:
 We need to analyze text data that involves multiple forms of the same word.
 We want to improve search results (i.e., find "running" or "runs" when searching for "run").
 We are working on tasks like text classification, sentiment analysis, or information retrieval.
Why NLP Needs Stemming?
NLP needs stemming because:
 Languages have inflectional variations: A single word can take many forms based on tense,
person, or number (e.g., "run", "running", "runs"). Stemming reduces these variations to the
root word, making it easier for machines to analyze and understand.
 Consistency in Analysis: Without stemming, different forms of the same word would be
treated as completely different words, leading to inefficiency.
Real-Life Example:
Imagine you're analyzing customer reviews of a product. If you don't apply stemming, the words
"love", "loved", "loving" would all be treated as separate, and your analysis could be skewed. With
stemming, they would all be reduced to "love", making your analysis more accurate.
Porter Stemming Algorithm
The Porter Stemming Algorithm is a widely used stemming algorithm that reduces words to their
root form by removing common morphological and inflectional endings.
It was developed by Martin Porter in 1980, and it works through a set of rules that follow the
structure of words based on suffixes.
This process is designed to be simple, efficient, and effective for English text.
The algorithm works by applying a series of transformation rules to words in a systematic way.
These rules are based on the classification of letters as consonants, vowels, and special rules for
specific suffixes.
Key Concepts in the Porter Stemmer Algorithm
To understand the rules applied by the Porter Stemming algorithm, it’s essential to understand the
following concepts:
1. Vowels and Consonants
The algorithm classifies letters in a word into two categories:
 Vowels (V): These are the letters a, e, ©, o, u (and sometimes y in certain contexts).
 Consonants ©: All other letters are considered consonants.
In the algorithm, the distinction between vowels and consonants helps in identifying how to remove
suffixes from words. Words are processed by identifying the structure of the characters in them.
2. M (The Measure)
The letter M refers to the "measure" of a word, which is the count of the number of consonant-
vowel (C-V) sequences in the word. In other words, it's a count of how many syllables are in a
word based on alternating consonants and vowels.
For example:
 In the word "running", the measure is 2 because the C-V pattern is:
r - u - nn - i - ng (C-V-C-V-C).
 In "cats", the measure is 1, since c - a - t - s follows C-V-C-C, i.e., only one C-V combination.
3. The "Step" Mechanism
The Porter algorithm processes words in stages (called "steps") and applies rules based on the
measure (M) and the suffix of the word. These steps allow the algorithm to systematically reduce the
word.

Detailed Steps of the Porter Stemming Algorithm

The Porter Stemming algorithm follows several steps, each consisting of rules that are triggered based
on conditions such as the word's suffix and its measure (M).
The process is carried out as follows:
Step 1: Suffix Removal
This step handles suffixes that are commonly found in words, such as "-ed", "-ing", "-ly", etc. Here
are the key rules:
1. Step 1A (Ed and Ing Suffixes):
o If a word ends in "ing" or "ed", and the word has a measure greater than 0 (i.e., it
has at least one syllable), remove the suffix.
 Example: "running" → "run"
 Example: "baking" → "bake"
2. Step 1B (Additional Suffix Removal):
o This step deals with suffixes like "es", "e", and other derivations of the word.

 For example, if the word ends in "s", remove the suffix if the word measure is
greater than 0.
 Example: "happiness" → "happi"
3. Step 1C (Suffixes like "ly", "ful", etc.):
o Further rules deal with common suffixes like "ly", "ment", and "ful". These are
removed if the word has a meaningful stem.
Step 2: Suffix Modifications
This step handles suffixes like "-ize", "-ation", and "-ic", which are typically added to create nouns or
adjectives.
 The algorithm removes these suffixes only if the word has a certain structure and measure.
o Example: "organization" → "organ"

o Example: "applying" → "appl"

Step 3: Further Refinement

This step looks for more specific suffixes and refines the word by removing them.
 Common transformations include removing "-ness", "-ment", "-tion", and other suffixes that
appear in adjectives or nouns.
o Example: "hopeless" → "hope"

Examples of Porter Stemmer Rules

1. Example 1: Running → Run
o The word "running" ends in "ing". According to Step 1A, the suffix "ing" is removed
because the word has at least one C-V combination (measure > 0).
o The result is "run", which is the base/root form of the word.

2. Example 2: Better → Better

o The word "better" has a non-regular form (irregular comparative adjective). The
Porter algorithm doesn't modify it since there is no applicable rule for reducing
irregular comparative adjectives.
o The word remains as "better".

3. Example 3: Studies → Studi

o "Studies" ends in "es". In Step 1B, the suffix "es" is removed because the measure of
the word (number of C-V pairs) is greater than 0.
o The result is "studi", which is the root word after stemming.

Why Does the Algorithm Work?

The Porter Stemmer is built based on the premise that the suffixes at the end of English words carry a
lot of information about word categories (verbs, nouns, adjectives). By stripping away common
suffixes, the algorithm helps reduce the complexity of analyzing words and makes it easier to focus on
the word's core meaning.
Advantages of the Porter Stemmer
1. Efficiency: It is relatively fast and can process large volumes of text quickly.
2. Simplicity: The algorithm is simple and straightforward, making it easy to implement.
3. Wide Applicability: It works well for most English words and is widely used in information
retrieval and text mining.
Disadvantages of the Porter Stemmer
1. Over-Stemming: The algorithm might reduce words too aggressively, leading to results that
are not actual words, which could be problematic in some contexts.
o Example: "caresses" → "caress", "fishing" → "fish" (this may be fine, but overuse in
complex text could lead to unintended results).
2. Irregular Forms: The algorithm doesn't handle irregular forms (like "better" or "good")
correctly because it follows predefined rules rather than understanding the meaning behind
the word.
3. Loss of Meaning: The over-simplification of words could lead to loss of nuance and subtle
meaning, especially in highly specialized domains.
Example 1: Running → Run
1. Step 1A: The word "running" ends with "ing", which is a common suffix for verbs. The
algorithm checks if the word has a measure (M > 0) and removes "ing".
o The word "running" has a measure > 0 (it has 2 C-V pairs: r - u - nn - i - ng).

o The suffix "ing" is removed, resulting in the stem "run".

Result: "running" → run

Example 2: Happily → Happi

1. Step 1A: The word "happily" ends with the suffix "ly", which is commonly used to form
adverbs.
o The word "happily" has a measure > 0 (it has 2 C-V pairs: h - a - p - i - l - y).

o The suffix "ly" is removed, resulting in "happi".

Result: "happily" → happi

Example 3: Better → Better

1. The word "better" is an irregular comparative adjective. The Porter Stemmer doesn’t modify
it because it is already in a form that the algorithm recognizes as a valid word without further
reduction.
2. There's no suffix that needs to be removed, and the word stays the same.
Result: "better" → better

Example 4: Studies → Studi

1. Step 1B: The word "studies" ends with "es". In Step 1B, if the word has a measure > 0, the
suffix "es" is removed.
o The word "studies" has a measure > 0 (it has 2 C-V pairs: s - t - u - d - i - e - s).

o The suffix "es" is removed, resulting in "studi".

Result: "studies" → studi

Example 5: Connection → Connect
1. Step 2: The word "connection" ends with "ion", a common suffix used in forming nouns.
o The word "connection" has a measure > 0 (it has 3 C-V pairs: c - o - n - n - e - c - t - i
- o - n).
o The suffix "ion" is removed, resulting in "connect".

Result: "connection" → connect

Example 6: Playing → Play

1. Step 1A: The word "playing" ends with the suffix "ing".
o The word "playing" has a measure > 0 (it has 2 C-V pairs: p - l - a - y - i - n - g).

o The suffix "ing" is removed, resulting in "play".

Result: "playing" → play

Example 7: Caring → Care

1. Step 1A: The word "caring" ends with the suffix "ing".
o The word "caring" has a measure > 0 (it has 2 C-V pairs: c - a - r - i - n - g).

o The suffix "ing" is removed, resulting in "care".

Result: "caring" → care

Example 8: Easily → Easili

1. Step 1A: The word "easily" ends with "ly", which is a common suffix used to form adverbs.
o The word "easily" has a measure > 0 (it has 3 C-V pairs: e - a - s - i - l - y).

o The suffix "ly" is removed, resulting in "easili".

Result: "easily" → easili

Example 9: Globalization → Global

1. Step 2: The word "globalization" ends with "ization", which is a common suffix used to form
nouns.
o The word "globalization" has a measure > 0 (it has 4 C-V pairs: g - l - o - b - a - l - i -
z - a - t - i - o - n).
o The suffix "ization" is removed, resulting in "global".

Result: "globalization" → global

Example 10: Fishes → Fish
1. Step 1B: The word "fishes" ends with "es".
o The word "fishes" has a measure > 0 (it has 2 C-V pairs: f - i - sh - e - s).

o The suffix "es" is removed, resulting in "fish".

Result: "fishes" → fish

Example 11: Relational → Relat

1. Step 2: The word "relational" ends with "ional", which is a suffix used to form adjectives.
o The word "relational" has a measure > 0 (it has 4 C-V pairs: r - e - l - a - t - i - o - n - a
- l).
o The suffix "ional" is removed, resulting in "relat".

Result: "relational" → relat

Example 12: Happiness → Happi

1. Step 1B: The word "happiness" ends with "ness".
o The word "happiness" has a measure > 0 (it has 2 C-V pairs: h - a - p - p - i - n - e - s -
s).
o The suffix "ness" is removed, resulting in "happi".

Result: "happiness" → happi

PROGRAM:

import nltk
from nltk.stem import PorterStemmer

# Download the NLTK data for stemming

nltk.download('punkt')

# Initialize the PorterStemmer object

porter_stemmer = PorterStemmer()

# Sample text
words = ["running", "runs", "runner", "easily", "happily", "better", "studies"]

# Stem each word in the list

stemmed_words = [porter_stemmer.stem(word) for word in words]

# Output the original and stemmed words

print("Original Words: ", words)
print("Stemmed Words: ", stemmed_words)

Flow of the for Loop:

1. Initialize stemmed_words = []
2. Start for word in words:
o Iteration 1: word = "running"

 stemmed_word = porter_stemmer.stem("running")
 stemmed_words.append("run")
o Iteration 2: word = "played"
 stemmed_word = porter_stemmer.stem("played")
 stemmed_words.append("play")
o Iteration 3: word = "easily"

 stemmed_word = porter_stemmer.stem("easily")
 stemmed_words.append("easili")
o Iteration 4: word = "happiness"

 stemmed_word = porter_stemmer.stem("happiness")
 stemmed_words.append("happi")
o Iteration 5: word = "studies"

 stemmed_word = porter_stemmer.stem("studies")
 stemmed_words.append("studi")
o Iteration 6: word = "connection"

 stemmed_word = porter_stemmer.stem("connection")
 stemmed_words.append("connect")
3. End Loop
4. Final List stemmed_words = ["run", "play", "easili", "happi", "studi", "connect"]

Overview of the Porter Stemmer
No ratings yet
Overview of the Porter Stemmer
12 pages
NLP Chap2
No ratings yet
NLP Chap2
126 pages
Lecture 3 - Basic Text Processing
No ratings yet
Lecture 3 - Basic Text Processing
58 pages
NLP Chapter 2
No ratings yet
NLP Chapter 2
103 pages
Text Processing for IR Systems
No ratings yet
Text Processing for IR Systems
43 pages
Porter Stemming Algorithm Guide
No ratings yet
Porter Stemming Algorithm Guide
6 pages
NLP Manual
No ratings yet
NLP Manual
9 pages
Understanding Stemming Algorithms
No ratings yet
Understanding Stemming Algorithms
23 pages
IR Group Assignment
No ratings yet
IR Group Assignment
5 pages
04 Word Normalization and Stemming 11-47
No ratings yet
04 Word Normalization and Stemming 11-47
5 pages
Overview of the Porter Stemmer Algorithm
No ratings yet
Overview of the Porter Stemmer Algorithm
3 pages
Porter Stemmer
No ratings yet
Porter Stemmer
14 pages
Inverted File Structure Overview
No ratings yet
Inverted File Structure Overview
7 pages
Introduction - Types of Stemming Algorithms
No ratings yet
Introduction - Types of Stemming Algorithms
28 pages
Porter Stemmer on Penn Treebank Sample
No ratings yet
Porter Stemmer on Penn Treebank Sample
23 pages
NLP-1 (Stemming)
No ratings yet
NLP-1 (Stemming)
7 pages
XSTEM: An Exemplar-Based Stemming Algorithm: Kirk Baker Lexical Intelligence, LLC May 10, 2022
No ratings yet
XSTEM: An Exemplar-Based Stemming Algorithm: Kirk Baker Lexical Intelligence, LLC May 10, 2022
11 pages
Rule Based Urdu Stemmer: Rohit Kansal Vishal Goyal G. S. Lehal
No ratings yet
Rule Based Urdu Stemmer: Rohit Kansal Vishal Goyal G. S. Lehal
10 pages
Lecture 2 IR System Components
No ratings yet
Lecture 2 IR System Components
10 pages
2.3text Preprocessing Stemming
No ratings yet
2.3text Preprocessing Stemming
3 pages
NLP - Notes
No ratings yet
NLP - Notes
17 pages
4 PorterStemmer
No ratings yet
4 PorterStemmer
23 pages
02-Stemming - Jupyter Notebook
No ratings yet
02-Stemming - Jupyter Notebook
4 pages
Stemming vs Lemmatization in NLP
No ratings yet
Stemming vs Lemmatization in NLP
12 pages
Chap 2
No ratings yet
Chap 2
70 pages
Week 2
No ratings yet
Week 2
4 pages
Module 2 Complete
No ratings yet
Module 2 Complete
134 pages
Unit Iii Data Structure
No ratings yet
Unit Iii Data Structure
43 pages
Chapter 2 Part II
No ratings yet
Chapter 2 Part II
75 pages
Chapter 6
No ratings yet
Chapter 6
6 pages
IR Suffix Stripping Algorithm
No ratings yet
IR Suffix Stripping Algorithm
8 pages
Unit 2 Data - Structures
No ratings yet
Unit 2 Data - Structures
84 pages
Porter Stemming Algorithm For Semantic Checking
No ratings yet
Porter Stemming Algorithm For Semantic Checking
7 pages
Irs Ii
No ratings yet
Irs Ii
39 pages
Ids Sem Ans U-Iv
No ratings yet
Ids Sem Ans U-Iv
5 pages
Unit 5
No ratings yet
Unit 5
14 pages
NLP Chapter 2
No ratings yet
NLP Chapter 2
10 pages
NLP Experiments No-1
No ratings yet
NLP Experiments No-1
7 pages
An Algorithm For Suffix Stripping
No ratings yet
An Algorithm For Suffix Stripping
7 pages
Comparative Study of Stemming Algorithms
No ratings yet
Comparative Study of Stemming Algorithms
6 pages
Telstem:An Unsupervised Telugu Stemmer With Heuristic Improvements and Normalized Signatures
No ratings yet
Telstem:An Unsupervised Telugu Stemmer With Heuristic Improvements and Normalized Signatures
42 pages
Analysis and Evaluation of Stemming Algorithms A Case Study With Assamese
No ratings yet
Analysis and Evaluation of Stemming Algorithms A Case Study With Assamese
5 pages
Improving A Lightweight Stemmer For Gujarati Language
No ratings yet
Improving A Lightweight Stemmer For Gujarati Language
8 pages
NLP Word Level Analysis Techniques
No ratings yet
NLP Word Level Analysis Techniques
28 pages
Porter
No ratings yet
Porter
13 pages
Stemming in R A Comprehensive Guide
No ratings yet
Stemming in R A Comprehensive Guide
8 pages
Week 2
No ratings yet
Week 2
2 pages
Morphology
No ratings yet
Morphology
44 pages
Natural Language Computing
No ratings yet
Natural Language Computing
20 pages
Prefixes and Suffixes Explained
No ratings yet
Prefixes and Suffixes Explained
43 pages
A Rule Based Bengali Stemmer - Mahmud2014 - Citedby - 53
No ratings yet
A Rule Based Bengali Stemmer - Mahmud2014 - Citedby - 53
7 pages
Implementation of A New Method For Stemming in Persian Language
No ratings yet
Implementation of A New Method For Stemming in Persian Language
5 pages
II - 2 Unit
No ratings yet
II - 2 Unit
73 pages
NLP 2
No ratings yet
NLP 2
13 pages
NLP Exp 3
No ratings yet
NLP Exp 3
4 pages
NLP Intro
No ratings yet
NLP Intro
15 pages
Prefix and Suffix
No ratings yet
Prefix and Suffix
16 pages
8-Morphology Part3
No ratings yet
8-Morphology Part3
27 pages
Snowball: A Language For Stemming Algorithms
No ratings yet
Snowball: A Language For Stemming Algorithms
14 pages
District Demo Teaching LP Final
No ratings yet
District Demo Teaching LP Final
9 pages
GROUP 2 - SOW 36 Year 4 Sample Lesson Plan
No ratings yet
GROUP 2 - SOW 36 Year 4 Sample Lesson Plan
3 pages
Tourism & Language in Tofo
No ratings yet
Tourism & Language in Tofo
3 pages
Active Voice: in The Active Voice, The Subject Performs The Action Expressed by The Verb
100% (1)
Active Voice: in The Active Voice, The Subject Performs The Action Expressed by The Verb
2 pages
Grade 5 Mars Rla Result
No ratings yet
Grade 5 Mars Rla Result
3 pages
Anglo-Saxon Literature Quiz Questions
No ratings yet
Anglo-Saxon Literature Quiz Questions
9 pages
How Vernacular Language Influence The Grade VI Pupils' English Pronunciation: A Case Study
No ratings yet
How Vernacular Language Influence The Grade VI Pupils' English Pronunciation: A Case Study
7 pages
Present Continuous Tense
No ratings yet
Present Continuous Tense
17 pages
Open and Closed Syllable Sort: ©180daysofreading2016
No ratings yet
Open and Closed Syllable Sort: ©180daysofreading2016
8 pages
1 Quick Test: Grammar Tick ( ) A, B, or C To Complete The Sentences
No ratings yet
1 Quick Test: Grammar Tick ( ) A, B, or C To Complete The Sentences
3 pages
Review Morpho
No ratings yet
Review Morpho
5 pages
English Lesson Plan - Coordinating Conjunctions
100% (9)
English Lesson Plan - Coordinating Conjunctions
13 pages
Teachers 3 LiveBeat UNIT 6
No ratings yet
Teachers 3 LiveBeat UNIT 6
10 pages
DLL Matatag - Reading&literacy 1 - Q2 - W8
No ratings yet
DLL Matatag - Reading&literacy 1 - Q2 - W8
32 pages
TekArt P.7 GRAMMAR NOTES 2025
No ratings yet
TekArt P.7 GRAMMAR NOTES 2025
201 pages
Takeaway PDF
No ratings yet
Takeaway PDF
5 pages
List of The Regular and Irregular Verbs Regular Verbs
No ratings yet
List of The Regular and Irregular Verbs Regular Verbs
9 pages
Grade 5 Term 1 English Expanded Opportunity
No ratings yet
Grade 5 Term 1 English Expanded Opportunity
3 pages
English I - Summer 2022
No ratings yet
English I - Summer 2022
10 pages
Daftar Singkatan Bahasa Inggris
No ratings yet
Daftar Singkatan Bahasa Inggris
6 pages
Complex Sentence
No ratings yet
Complex Sentence
31 pages
English Grammar Course for Teachers
No ratings yet
English Grammar Course for Teachers
3 pages
Prepositions and Grammar Exercises
No ratings yet
Prepositions and Grammar Exercises
4 pages
The Main Types of Connection in Word-Combinations.
No ratings yet
The Main Types of Connection in Word-Combinations.
3 pages
Class 2 - English - SVH
No ratings yet
Class 2 - English - SVH
15 pages
Using Language
No ratings yet
Using Language
12 pages
Understanding Adjectives and Commas
No ratings yet
Understanding Adjectives and Commas
5 pages
Grade 4 Efal Memorandum - June 2023
No ratings yet
Grade 4 Efal Memorandum - June 2023
3 pages
Comm Skills Week 6 Capitalization Punctuation
No ratings yet
Comm Skills Week 6 Capitalization Punctuation
41 pages
Pronoun Usage and Sentence Completion Test
No ratings yet
Pronoun Usage and Sentence Completion Test
4 pages