0% found this document useful (0 votes)
34 views9 pages

Week 2

The document explains the concept of stemming in Natural Language Processing (NLP) and details the Porter Stemming Algorithm, which reduces words to their root forms by removing common suffixes. It outlines the algorithm's steps, advantages, and disadvantages, along with examples of how various words are stemmed. Additionally, a Python program demonstrates the implementation of the Porter Stemmer using the NLTK library.

Uploaded by

Rida kaunain
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views9 pages

Week 2

The document explains the concept of stemming in Natural Language Processing (NLP) and details the Porter Stemming Algorithm, which reduces words to their root forms by removing common suffixes. It outlines the algorithm's steps, advantages, and disadvantages, along with examples of how various words are stemmed. Additionally, a Python program demonstrates the implementation of the Porter Stemmer using the NLTK library.

Uploaded by

Rida kaunain
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 9

WEEK -2

AIM: Write a Python program to implement Porter stemmer algorithm for stemming

What is Stemming?
Stemming is the process of reducing words to their root form or base form. The idea is to remove
suffixes or prefixes from words, so they can be analyzed as a single word. For instance:
 "Running" → "Run"
 "Happiness" → "Happy"
 "Studies" → "Study"
Why Stemming?
In Natural Language Processing (NLP), stemming is crucial because it helps in grouping similar
words under a single root word. This makes it easier for algorithms to analyze text data.
Key Reasons for Stemming:
1. Simplification: It simplifies words into their base form, reducing the complexity of analyzing
variations of a word.
2. Efficiency: It reduces the number of unique words in a text, making analysis faster and easier.
3. Accuracy in Analysis: It helps in focusing on the core meaning rather than the different
forms of a word.
When Do We Go for Stemming?
We use stemming in cases where:
 We need to analyze text data that involves multiple forms of the same word.
 We want to improve search results (i.e., find "running" or "runs" when searching for "run").
 We are working on tasks like text classification, sentiment analysis, or information retrieval.
Why NLP Needs Stemming?
NLP needs stemming because:
 Languages have inflectional variations: A single word can take many forms based on tense,
person, or number (e.g., "run", "running", "runs"). Stemming reduces these variations to the
root word, making it easier for machines to analyze and understand.
 Consistency in Analysis: Without stemming, different forms of the same word would be
treated as completely different words, leading to inefficiency.
Real-Life Example:
Imagine you're analyzing customer reviews of a product. If you don't apply stemming, the words
"love", "loved", "loving" would all be treated as separate, and your analysis could be skewed. With
stemming, they would all be reduced to "love", making your analysis more accurate.
Porter Stemming Algorithm
The Porter Stemming Algorithm is a widely used stemming algorithm that reduces words to their
root form by removing common morphological and inflectional endings.
It was developed by Martin Porter in 1980, and it works through a set of rules that follow the
structure of words based on suffixes.
This process is designed to be simple, efficient, and effective for English text.
The algorithm works by applying a series of transformation rules to words in a systematic way.
These rules are based on the classification of letters as consonants, vowels, and special rules for
specific suffixes.
Key Concepts in the Porter Stemmer Algorithm
To understand the rules applied by the Porter Stemming algorithm, it’s essential to understand the
following concepts:
1. Vowels and Consonants
The algorithm classifies letters in a word into two categories:
 Vowels (V): These are the letters a, e, ©, o, u (and sometimes y in certain contexts).
 Consonants ©: All other letters are considered consonants.
In the algorithm, the distinction between vowels and consonants helps in identifying how to remove
suffixes from words. Words are processed by identifying the structure of the characters in them.
2. M (The Measure)
The letter M refers to the "measure" of a word, which is the count of the number of consonant-
vowel (C-V) sequences in the word. In other words, it's a count of how many syllables are in a
word based on alternating consonants and vowels.
For example:
 In the word "running", the measure is 2 because the C-V pattern is:
r - u - nn - i - ng (C-V-C-V-C).
 In "cats", the measure is 1, since c - a - t - s follows C-V-C-C, i.e., only one C-V combination.
3. The "Step" Mechanism
The Porter algorithm processes words in stages (called "steps") and applies rules based on the
measure (M) and the suffix of the word. These steps allow the algorithm to systematically reduce the
word.

Detailed Steps of the Porter Stemming Algorithm


The Porter Stemming algorithm follows several steps, each consisting of rules that are triggered based
on conditions such as the word's suffix and its measure (M).
The process is carried out as follows:
Step 1: Suffix Removal
This step handles suffixes that are commonly found in words, such as "-ed", "-ing", "-ly", etc. Here
are the key rules:
1. Step 1A (Ed and Ing Suffixes):
o If a word ends in "ing" or "ed", and the word has a measure greater than 0 (i.e., it
has at least one syllable), remove the suffix.
 Example: "running" → "run"
 Example: "baking" → "bake"
2. Step 1B (Additional Suffix Removal):
o This step deals with suffixes like "es", "e", and other derivations of the word.

 For example, if the word ends in "s", remove the suffix if the word measure is
greater than 0.
 Example: "happiness" → "happi"
3. Step 1C (Suffixes like "ly", "ful", etc.):
o Further rules deal with common suffixes like "ly", "ment", and "ful". These are
removed if the word has a meaningful stem.
Step 2: Suffix Modifications
This step handles suffixes like "-ize", "-ation", and "-ic", which are typically added to create nouns or
adjectives.
 The algorithm removes these suffixes only if the word has a certain structure and measure.
o Example: "organization" → "organ"

o Example: "applying" → "appl"

Step 3: Further Refinement


This step looks for more specific suffixes and refines the word by removing them.
 Common transformations include removing "-ness", "-ment", "-tion", and other suffixes that
appear in adjectives or nouns.
o Example: "hopeless" → "hope"

Examples of Porter Stemmer Rules


1. Example 1: Running → Run
o The word "running" ends in "ing". According to Step 1A, the suffix "ing" is removed
because the word has at least one C-V combination (measure > 0).
o The result is "run", which is the base/root form of the word.

2. Example 2: Better → Better


o The word "better" has a non-regular form (irregular comparative adjective). The
Porter algorithm doesn't modify it since there is no applicable rule for reducing
irregular comparative adjectives.
o The word remains as "better".

3. Example 3: Studies → Studi


o "Studies" ends in "es". In Step 1B, the suffix "es" is removed because the measure of
the word (number of C-V pairs) is greater than 0.
o The result is "studi", which is the root word after stemming.

Why Does the Algorithm Work?


The Porter Stemmer is built based on the premise that the suffixes at the end of English words carry a
lot of information about word categories (verbs, nouns, adjectives). By stripping away common
suffixes, the algorithm helps reduce the complexity of analyzing words and makes it easier to focus on
the word's core meaning.
Advantages of the Porter Stemmer
1. Efficiency: It is relatively fast and can process large volumes of text quickly.
2. Simplicity: The algorithm is simple and straightforward, making it easy to implement.
3. Wide Applicability: It works well for most English words and is widely used in information
retrieval and text mining.
Disadvantages of the Porter Stemmer
1. Over-Stemming: The algorithm might reduce words too aggressively, leading to results that
are not actual words, which could be problematic in some contexts.
o Example: "caresses" → "caress", "fishing" → "fish" (this may be fine, but overuse in
complex text could lead to unintended results).
2. Irregular Forms: The algorithm doesn't handle irregular forms (like "better" or "good")
correctly because it follows predefined rules rather than understanding the meaning behind
the word.
3. Loss of Meaning: The over-simplification of words could lead to loss of nuance and subtle
meaning, especially in highly specialized domains.
Example 1: Running → Run
1. Step 1A: The word "running" ends with "ing", which is a common suffix for verbs. The
algorithm checks if the word has a measure (M > 0) and removes "ing".
o The word "running" has a measure > 0 (it has 2 C-V pairs: r - u - nn - i - ng).

o The suffix "ing" is removed, resulting in the stem "run".

Result: "running" → run

Example 2: Happily → Happi


1. Step 1A: The word "happily" ends with the suffix "ly", which is commonly used to form
adverbs.
o The word "happily" has a measure > 0 (it has 2 C-V pairs: h - a - p - i - l - y).

o The suffix "ly" is removed, resulting in "happi".

Result: "happily" → happi

Example 3: Better → Better


1. The word "better" is an irregular comparative adjective. The Porter Stemmer doesn’t modify
it because it is already in a form that the algorithm recognizes as a valid word without further
reduction.
2. There's no suffix that needs to be removed, and the word stays the same.
Result: "better" → better

Example 4: Studies → Studi


1. Step 1B: The word "studies" ends with "es". In Step 1B, if the word has a measure > 0, the
suffix "es" is removed.
o The word "studies" has a measure > 0 (it has 2 C-V pairs: s - t - u - d - i - e - s).

o The suffix "es" is removed, resulting in "studi".

Result: "studies" → studi


Example 5: Connection → Connect
1. Step 2: The word "connection" ends with "ion", a common suffix used in forming nouns.
o The word "connection" has a measure > 0 (it has 3 C-V pairs: c - o - n - n - e - c - t - i
- o - n).
o The suffix "ion" is removed, resulting in "connect".

Result: "connection" → connect

Example 6: Playing → Play


1. Step 1A: The word "playing" ends with the suffix "ing".
o The word "playing" has a measure > 0 (it has 2 C-V pairs: p - l - a - y - i - n - g).

o The suffix "ing" is removed, resulting in "play".

Result: "playing" → play

Example 7: Caring → Care


1. Step 1A: The word "caring" ends with the suffix "ing".
o The word "caring" has a measure > 0 (it has 2 C-V pairs: c - a - r - i - n - g).

o The suffix "ing" is removed, resulting in "care".

Result: "caring" → care

Example 8: Easily → Easili


1. Step 1A: The word "easily" ends with "ly", which is a common suffix used to form adverbs.
o The word "easily" has a measure > 0 (it has 3 C-V pairs: e - a - s - i - l - y).

o The suffix "ly" is removed, resulting in "easili".

Result: "easily" → easili

Example 9: Globalization → Global


1. Step 2: The word "globalization" ends with "ization", which is a common suffix used to form
nouns.
o The word "globalization" has a measure > 0 (it has 4 C-V pairs: g - l - o - b - a - l - i -
z - a - t - i - o - n).
o The suffix "ization" is removed, resulting in "global".

Result: "globalization" → global


Example 10: Fishes → Fish
1. Step 1B: The word "fishes" ends with "es".
o The word "fishes" has a measure > 0 (it has 2 C-V pairs: f - i - sh - e - s).

o The suffix "es" is removed, resulting in "fish".

Result: "fishes" → fish

Example 11: Relational → Relat


1. Step 2: The word "relational" ends with "ional", which is a suffix used to form adjectives.
o The word "relational" has a measure > 0 (it has 4 C-V pairs: r - e - l - a - t - i - o - n - a
- l).
o The suffix "ional" is removed, resulting in "relat".

Result: "relational" → relat

Example 12: Happiness → Happi


1. Step 1B: The word "happiness" ends with "ness".
o The word "happiness" has a measure > 0 (it has 2 C-V pairs: h - a - p - p - i - n - e - s -
s).
o The suffix "ness" is removed, resulting in "happi".

Result: "happiness" → happi


PROGRAM:

import nltk
from nltk.stem import PorterStemmer

# Download the NLTK data for stemming


nltk.download('punkt')

# Initialize the PorterStemmer object


porter_stemmer = PorterStemmer()

# Sample text
words = ["running", "runs", "runner", "easily", "happily", "better", "studies"]

# Stem each word in the list


stemmed_words = [porter_stemmer.stem(word) for word in words]

# Output the original and stemmed words


print("Original Words: ", words)
print("Stemmed Words: ", stemmed_words)

Flow of the for Loop:


1. Initialize stemmed_words = []
2. Start for word in words:
o Iteration 1: word = "running"

 stemmed_word = porter_stemmer.stem("running")
 stemmed_words.append("run")
o Iteration 2: word = "played"
 stemmed_word = porter_stemmer.stem("played")
 stemmed_words.append("play")
o Iteration 3: word = "easily"

 stemmed_word = porter_stemmer.stem("easily")
 stemmed_words.append("easili")
o Iteration 4: word = "happiness"

 stemmed_word = porter_stemmer.stem("happiness")
 stemmed_words.append("happi")
o Iteration 5: word = "studies"

 stemmed_word = porter_stemmer.stem("studies")
 stemmed_words.append("studi")
o Iteration 6: word = "connection"

 stemmed_word = porter_stemmer.stem("connection")
 stemmed_words.append("connect")
3. End Loop
4. Final List stemmed_words = ["run", "play", "easili", "happi", "studi", "connect"]

You might also like