A Project report on
“Spam Email Classification”
submitted in partial fulfillment of the Academic requirements for the award of the
degree of
Bachelor of Technology
Submitted by
MD JAHANGEER - (22H51A05A7)
[Link] - (22H51A05B2)
S. MANASWINI - (22H51A05B9)
UNDER THE COURSE
COMPILER DESIGN LABORATORY
CMR COLLEGE OF ENGINEERING &
TECHNOLOGY
(Autonomous)
(NAAC Accredited with ‘A+’ Grade & NBA Accredited)
(Approved by AICTE, Permanently Affiliated to JNTU Hyderabad)
KANDLAKOYA, MEDCHAL ROAD, HYDERABAD-501401
2024-25
1
CMR COLLEGE OF ENGINEERING & TECHNOLOGY
(AUTONOMUS)
(NAAC Accredited with ‘A+’ Grade & NBA Accredited)
(Approved by AICTE, Permanently Affiliated to JNTU Hyderabad)
KANDLAKOYA, MEDCHAL ROAD, HYDERABAD-501401
2024-25
CERTIFICATE
This is to certify that a Micro Project entitled with “Spam Email Classification Using
Machine Learning” is being Submitted By
MD JAHANGEER 22H51A05A7
[Link] 22H51A05B2
[Link] 22H51A05B9
In partial fulfillment of the requirement for completion of the “MACHINE
LEARNING LABORATORY” of III-B. Tech II- Semester is a record of a bonafide
work carried out under guidance and supervision.
Signature of Faculty Signature of HOD
2
ACKNOWLEDGEMENT
We are obliged and grateful to thank, CMRCET, for his cooperation in all
respects during the course.
We would like to thank the Principal of CMRCET, [Link] kumar , for
his support in the course of this project work.
We would like to thank the head of CSE department Dr. S. Siva Skandha and
our subject faculty Mr [Link] for his support in the course of this project work.
Finally, we thank all our faculty members and Lab Programmer for their valid
support.
We own all our success to our beloved parents, whose vision, love and
inspiration has made us reach out for these glories.
MD JAHANGEER 22H51A05A7
[Link] 22H51A05B2
[Link] 22H51A05B9
3
TABLE OF CONTENTS
[Link]. CONTENTS [Link]
1. Abstract 5
2. Introduction 6
3. Proposed Solution 7
5. Source code 8-11
6. Results and Discussion 12
7. Conclusion 13
8. Future Enhancement 14
9. Reference 15
4
ABSTRACT
The increasing volume of spam emails poses serious challenges in terms of productivity loss, security
threats, and user inconvenience. To address this, our project aims to develop an efficient and accurate spam
email classification system using machine learning techniques. We utilized the SMS Spam Collection
dataset, which includes labeled examples of both spam and non-spam (ham) messages.
The project involves key stages such as text preprocessing, tokenization, and feature extraction using
TF-IDF (Term Frequency-Inverse Document Frequency). We implemented a Naive Bayes classifier, known
for its effectiveness in text classification tasks, to train the model on preprocessed email data. The model
was evaluated using accuracy score, confusion matrix, and classification report, achieving an accuracy of
over 95%.
This project demonstrates how natural language processing (NLP) combined with supervised learning can
be effectively used to classify and filter spam messages, thereby enhancing communication security and
efficiency.
5
CHAPTER 1
INTRODUCTION
Electronic mail, or email, is one of the most widely used tools for both personal and professional
communication. However, the rise of spam emails—unsolicited, irrelevant, or inappropriate
messages—has become a significant nuisance and threat. Spam emails can lead to wasted time,
productivity loss, exposure to scams, and even security breaches through malicious links or
attachments. With the increasing volume and sophistication of these unwanted messages,
traditional rule-based filters have become inadequate. These systems often fail to adapt to new
patterns in spam messages and can incorrectly classify important emails as spam (false positives).
To overcome these limitations, this project proposes a machine learning-based approach to spam
detection. By leveraging natural language processing (NLP) techniques and training a classifier
using labeled email data, the system can automatically learn to identify spam based on the content
and structure of messages. This not only enhances the accuracy of spam detection but also ensures
adaptability to evolving spam tactics, making email communication safer and more efficient.
6
CHAPTER-2
PROPOSED SOLUTION
To effectively classify emails into spam or non-spam (ham), this project proposes the use of a supervised
machine learning approach using a Naive Bayes classifier. The first step involves cleaning and
preprocessing the email content through natural language processing (NLP) techniques such as lowercasing,
removal of punctuation and stopwords, and stemming. This transforms the raw text into a uniform format
suitable for analysis. The preprocessed data is then converted into numerical features using the TF-IDF
(Term Frequency–Inverse Document Frequency) vectorization technique, which helps capture the
importance of words within the email content. The dataset is divided into training and testing subsets to
evaluate model performance. A Multinomial Naive Bayes classifier, known for its effectiveness in text
classification tasks, is trained on this data. The model is evaluated using metrics such as accuracy,
confusion matrix, and classification report. This solution is lightweight, efficient, and demonstrates high
accuracy in identifying spam messages, making it a practical choice for real-world deployment in email
filtering systems.
7
CHAPTER 3
SOURCE CODE
# Import necessary libraries
import pandas as pd
import numpy as np
import string
import nltk
import [Link] as plt
import seaborn as sns
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from [Link] import accuracy_score, confusion_matrix, classification_report
# Download stopwords
[Link]('stopwords')
from [Link] import stopwords
from [Link] import PorterStemmer
# Step 1: Load the dataset (local file)
df = pd.read_csv('[Link]', encoding='latin-1')[['v1', 'v2']]
[Link] = ['label', 'text']
# Step 2: Convert labels to numeric
df['label_num'] = df['label'].map({'ham': 0, 'spam': 1})
# Step 3: Text preprocessing
stemmer = PorterStemmer()
stop_words = set([Link]('english'))
def preprocess_text(text):
text = [Link]()
text = ''.join([char for char in text if char not in [Link]])
words = [Link]()
words = [[Link](word) for word in words if word not in stop_words]
return ' '.join(words)
df['clean_text'] = df['text'].apply(preprocess_text)
# Step 4: Feature extraction using TF-IDF
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(df['clean_text'])
y = df['label_num']
8
# Step 5: Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Step 6: Train the Naive Bayes classifier
model = MultinomialNB()
[Link](X_train, y_train)
# Step 7: Make predictions and evaluate
y_pred = [Link](X_test)
# Display results
print("Accuracy:", accuracy_score(y_test, y_pred))
print("\n Classification Report:\n", classification_report(y_test, y_pred))
# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
[Link](cm, annot=True, fmt='d', cmap='Blues', xticklabels=['Ham', 'Spam'],
yticklabels=['Ham', 'Spam'])
[Link]('Predicted')
[Link]('Actual')
[Link]('Confusion Matrix')
[Link]()
9
CHAPTER 4
RESULT AND DISCUSSIONS
Figure 5.1
10
CHAPTER 5
CONCLUSION
The Spam Email Classification project successfully demonstrates how machine learning,
combined with natural language processing techniques, can be used to automate the
detection of unwanted or harmful email messages. By using the Naive Bayes algorithm
and TF-IDF vectorization, the model achieved high accuracy in distinguishing spam
from legitimate (ham) messages. The preprocessing steps played a crucial role in
enhancing the model’s performance by cleaning and standardizing the input text data.
The evaluation metrics, including accuracy and confusion matrix, confirmed the model's
reliability and effectiveness. This project highlights the practical applications of machine
learning in enhancing email security and reducing manual efforts in spam filtering. With
further improvements, such as incorporating deep learning models or additional features,
this system can be scaled for real-time spam detection in larger and more dynamic
environments.
11
CHAPTER 6
FUTURE ENHANCEMENT
The current spam email classification model provides a solid foundation, but there are
numerous opportunities for future improvement and expansion. One major enhancement
involves adopting deep learning techniques such as Recurrent Neural Networks (RNNs),
Long Short-Term Memory (LSTM) networks, or Transformer-based models like BERT.
These models excel in understanding context and sequential data, which can
significantly increase classification accuracy and reduce false positives.
Another enhancement could be real-time spam filtering integration into email clients or
messaging systems, allowing instant detection and handling of spam messages.
Additionally, introducing continuous learning mechanisms through user feedback can
enable the model to evolve and adapt to new spam patterns over time.
To improve the model's versatility, expanding the dataset with a wider variety of spam
messages, including multimedia content and multilingual text, can make the classifier
more robust across different use cases. Adding advanced NLP techniques such as named
entity recognition, topic modeling, or emotion analysis can also provide deeper insights
into spam content.
Furthermore, ensemble learning methods that combine predictions from multiple models
can enhance reliability and reduce overfitting. Implementing cloud-based APIs or
deploying the model as a microservice would support scalability and integration into
larger systems or commercial platforms.
12
REFERENCES
WEBSITES:
• Naive Bayes Classifier Tutorial: with Python Scikit-learn | DataCamp
• Kaggle-SMS-Spam-Collection-Dataset-/[Link] at master · mohitgupta-
1O1/Kaggle-SMS-Spam-Collection-Dataset-
13