0% found this document useful (0 votes)
57 views13 pages

ML Lab

The project report details the development of a spam email classification system using machine learning, specifically employing a Naive Bayes classifier and TF-IDF for feature extraction. The model achieved over 95% accuracy in distinguishing between spam and legitimate emails, demonstrating the effectiveness of natural language processing techniques. Future enhancements include integrating deep learning models and real-time spam filtering capabilities to improve classification accuracy and adaptability.

Uploaded by

Pavan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
57 views13 pages

ML Lab

The project report details the development of a spam email classification system using machine learning, specifically employing a Naive Bayes classifier and TF-IDF for feature extraction. The model achieved over 95% accuracy in distinguishing between spam and legitimate emails, demonstrating the effectiveness of natural language processing techniques. Future enhancements include integrating deep learning models and real-time spam filtering capabilities to improve classification accuracy and adaptability.

Uploaded by

Pavan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

A Project report on

“Spam Email Classification”


submitted in partial fulfillment of the Academic requirements for the award of the
degree of
Bachelor of Technology
Submitted by
MD JAHANGEER - (22H51A05A7)
[Link] - (22H51A05B2)
S. MANASWINI - (22H51A05B9)

UNDER THE COURSE


COMPILER DESIGN LABORATORY

CMR COLLEGE OF ENGINEERING &


TECHNOLOGY
(Autonomous)
(NAAC Accredited with ‘A+’ Grade & NBA Accredited)
(Approved by AICTE, Permanently Affiliated to JNTU Hyderabad)
KANDLAKOYA, MEDCHAL ROAD, HYDERABAD-501401
2024-25

1
CMR COLLEGE OF ENGINEERING & TECHNOLOGY
(AUTONOMUS)
(NAAC Accredited with ‘A+’ Grade & NBA Accredited)
(Approved by AICTE, Permanently Affiliated to JNTU Hyderabad)
KANDLAKOYA, MEDCHAL ROAD, HYDERABAD-501401
2024-25

CERTIFICATE

This is to certify that a Micro Project entitled with “Spam Email Classification Using
Machine Learning” is being Submitted By

MD JAHANGEER 22H51A05A7
[Link] 22H51A05B2
[Link] 22H51A05B9

In partial fulfillment of the requirement for completion of the “MACHINE

LEARNING LABORATORY” of III-B. Tech II- Semester is a record of a bonafide


work carried out under guidance and supervision.

Signature of Faculty Signature of HOD

2
ACKNOWLEDGEMENT

We are obliged and grateful to thank, CMRCET, for his cooperation in all
respects during the course.

We would like to thank the Principal of CMRCET, [Link] kumar , for


his support in the course of this project work.

We would like to thank the head of CSE department Dr. S. Siva Skandha and
our subject faculty Mr [Link] for his support in the course of this project work.

Finally, we thank all our faculty members and Lab Programmer for their valid
support.

We own all our success to our beloved parents, whose vision, love and
inspiration has made us reach out for these glories.

MD JAHANGEER 22H51A05A7
[Link] 22H51A05B2
[Link] 22H51A05B9

3
TABLE OF CONTENTS

[Link]. CONTENTS [Link]


1. Abstract 5
2. Introduction 6
3. Proposed Solution 7
5. Source code 8-11
6. Results and Discussion 12
7. Conclusion 13
8. Future Enhancement 14
9. Reference 15

4
ABSTRACT
The increasing volume of spam emails poses serious challenges in terms of productivity loss, security

threats, and user inconvenience. To address this, our project aims to develop an efficient and accurate spam

email classification system using machine learning techniques. We utilized the SMS Spam Collection

dataset, which includes labeled examples of both spam and non-spam (ham) messages.

The project involves key stages such as text preprocessing, tokenization, and feature extraction using

TF-IDF (Term Frequency-Inverse Document Frequency). We implemented a Naive Bayes classifier, known

for its effectiveness in text classification tasks, to train the model on preprocessed email data. The model

was evaluated using accuracy score, confusion matrix, and classification report, achieving an accuracy of

over 95%.

This project demonstrates how natural language processing (NLP) combined with supervised learning can

be effectively used to classify and filter spam messages, thereby enhancing communication security and

efficiency.

5
CHAPTER 1

INTRODUCTION

Electronic mail, or email, is one of the most widely used tools for both personal and professional

communication. However, the rise of spam emails—unsolicited, irrelevant, or inappropriate

messages—has become a significant nuisance and threat. Spam emails can lead to wasted time,

productivity loss, exposure to scams, and even security breaches through malicious links or

attachments. With the increasing volume and sophistication of these unwanted messages,

traditional rule-based filters have become inadequate. These systems often fail to adapt to new

patterns in spam messages and can incorrectly classify important emails as spam (false positives).

To overcome these limitations, this project proposes a machine learning-based approach to spam

detection. By leveraging natural language processing (NLP) techniques and training a classifier

using labeled email data, the system can automatically learn to identify spam based on the content

and structure of messages. This not only enhances the accuracy of spam detection but also ensures

adaptability to evolving spam tactics, making email communication safer and more efficient.

6
CHAPTER-2

PROPOSED SOLUTION

To effectively classify emails into spam or non-spam (ham), this project proposes the use of a supervised

machine learning approach using a Naive Bayes classifier. The first step involves cleaning and

preprocessing the email content through natural language processing (NLP) techniques such as lowercasing,

removal of punctuation and stopwords, and stemming. This transforms the raw text into a uniform format

suitable for analysis. The preprocessed data is then converted into numerical features using the TF-IDF

(Term Frequency–Inverse Document Frequency) vectorization technique, which helps capture the

importance of words within the email content. The dataset is divided into training and testing subsets to

evaluate model performance. A Multinomial Naive Bayes classifier, known for its effectiveness in text

classification tasks, is trained on this data. The model is evaluated using metrics such as accuracy,

confusion matrix, and classification report. This solution is lightweight, efficient, and demonstrates high

accuracy in identifying spam messages, making it a practical choice for real-world deployment in email

filtering systems.

7
CHAPTER 3

SOURCE CODE

# Import necessary libraries


import pandas as pd
import numpy as np
import string
import nltk
import [Link] as plt
import seaborn as sns

from sklearn.feature_extraction.text import TfidfVectorizer


from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from [Link] import accuracy_score, confusion_matrix, classification_report

# Download stopwords
[Link]('stopwords')
from [Link] import stopwords
from [Link] import PorterStemmer

# Step 1: Load the dataset (local file)


df = pd.read_csv('[Link]', encoding='latin-1')[['v1', 'v2']]
[Link] = ['label', 'text']

# Step 2: Convert labels to numeric


df['label_num'] = df['label'].map({'ham': 0, 'spam': 1})

# Step 3: Text preprocessing


stemmer = PorterStemmer()
stop_words = set([Link]('english'))

def preprocess_text(text):
text = [Link]()
text = ''.join([char for char in text if char not in [Link]])
words = [Link]()
words = [[Link](word) for word in words if word not in stop_words]
return ' '.join(words)

df['clean_text'] = df['text'].apply(preprocess_text)

# Step 4: Feature extraction using TF-IDF


vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(df['clean_text'])
y = df['label_num']
8
# Step 5: Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 6: Train the Naive Bayes classifier


model = MultinomialNB()
[Link](X_train, y_train)

# Step 7: Make predictions and evaluate


y_pred = [Link](X_test)

# Display results
print("Accuracy:", accuracy_score(y_test, y_pred))
print("\n Classification Report:\n", classification_report(y_test, y_pred))

# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
[Link](cm, annot=True, fmt='d', cmap='Blues', xticklabels=['Ham', 'Spam'],
yticklabels=['Ham', 'Spam'])
[Link]('Predicted')
[Link]('Actual')
[Link]('Confusion Matrix')
[Link]()

9
CHAPTER 4

RESULT AND DISCUSSIONS

Figure 5.1

10
CHAPTER 5

CONCLUSION

The Spam Email Classification project successfully demonstrates how machine learning,

combined with natural language processing techniques, can be used to automate the

detection of unwanted or harmful email messages. By using the Naive Bayes algorithm

and TF-IDF vectorization, the model achieved high accuracy in distinguishing spam

from legitimate (ham) messages. The preprocessing steps played a crucial role in

enhancing the model’s performance by cleaning and standardizing the input text data.

The evaluation metrics, including accuracy and confusion matrix, confirmed the model's

reliability and effectiveness. This project highlights the practical applications of machine

learning in enhancing email security and reducing manual efforts in spam filtering. With

further improvements, such as incorporating deep learning models or additional features,

this system can be scaled for real-time spam detection in larger and more dynamic

environments.

11
CHAPTER 6

FUTURE ENHANCEMENT

The current spam email classification model provides a solid foundation, but there are
numerous opportunities for future improvement and expansion. One major enhancement
involves adopting deep learning techniques such as Recurrent Neural Networks (RNNs),
Long Short-Term Memory (LSTM) networks, or Transformer-based models like BERT.
These models excel in understanding context and sequential data, which can
significantly increase classification accuracy and reduce false positives.

Another enhancement could be real-time spam filtering integration into email clients or
messaging systems, allowing instant detection and handling of spam messages.
Additionally, introducing continuous learning mechanisms through user feedback can
enable the model to evolve and adapt to new spam patterns over time.

To improve the model's versatility, expanding the dataset with a wider variety of spam
messages, including multimedia content and multilingual text, can make the classifier
more robust across different use cases. Adding advanced NLP techniques such as named
entity recognition, topic modeling, or emotion analysis can also provide deeper insights
into spam content.

Furthermore, ensemble learning methods that combine predictions from multiple models
can enhance reliability and reduce overfitting. Implementing cloud-based APIs or
deploying the model as a microservice would support scalability and integration into
larger systems or commercial platforms.

12
REFERENCES

WEBSITES:

• Naive Bayes Classifier Tutorial: with Python Scikit-learn | DataCamp


• Kaggle-SMS-Spam-Collection-Dataset-/[Link] at master · mohitgupta-
1O1/Kaggle-SMS-Spam-Collection-Dataset-

13

You might also like