0% found this document useful (0 votes)

57 views13 pages

ML Lab

The project report details the development of a spam email classification system using machine learning, specifically employing a Naive Bayes classifier and TF-IDF for feature extraction. The model achieved over 95% accuracy in distinguishing between spam and legitimate emails, demonstrating the effectiveness of natural language processing techniques. Future enhancements include integrating deep learning models and real-time spam filtering capabilities to improve classification accuracy and adaptability.

Uploaded by

Pavan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

57 views13 pages

ML Lab

Uploaded by

Pavan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

A Project report on

“Spam Email Classification”

submitted in partial fulfillment of the Academic requirements for the award of the
degree of
Bachelor of Technology
Submitted by
MD JAHANGEER - (22H51A05A7)
[Link] - (22H51A05B2)
S. MANASWINI - (22H51A05B9)

UNDER THE COURSE

COMPILER DESIGN LABORATORY

CMR COLLEGE OF ENGINEERING &

TECHNOLOGY
(Autonomous)
(NAAC Accredited with ‘A+’ Grade & NBA Accredited)
(Approved by AICTE, Permanently Affiliated to JNTU Hyderabad)
KANDLAKOYA, MEDCHAL ROAD, HYDERABAD-501401
2024-25

1
CMR COLLEGE OF ENGINEERING & TECHNOLOGY
(AUTONOMUS)
(NAAC Accredited with ‘A+’ Grade & NBA Accredited)
(Approved by AICTE, Permanently Affiliated to JNTU Hyderabad)
KANDLAKOYA, MEDCHAL ROAD, HYDERABAD-501401
2024-25

CERTIFICATE

This is to certify that a Micro Project entitled with “Spam Email Classification Using
Machine Learning” is being Submitted By

MD JAHANGEER 22H51A05A7
[Link] 22H51A05B2
[Link] 22H51A05B9

In partial fulfillment of the requirement for completion of the “MACHINE

LEARNING LABORATORY” of III-B. Tech II- Semester is a record of a bonafide

work carried out under guidance and supervision.

Signature of Faculty Signature of HOD

2
ACKNOWLEDGEMENT

We are obliged and grateful to thank, CMRCET, for his cooperation in all
respects during the course.

We would like to thank the Principal of CMRCET, [Link] kumar , for

his support in the course of this project work.

We would like to thank the head of CSE department Dr. S. Siva Skandha and
our subject faculty Mr [Link] for his support in the course of this project work.

Finally, we thank all our faculty members and Lab Programmer for their valid
support.

We own all our success to our beloved parents, whose vision, love and
inspiration has made us reach out for these glories.

MD JAHANGEER 22H51A05A7
[Link] 22H51A05B2
[Link] 22H51A05B9

3
TABLE OF CONTENTS

4
ABSTRACT
The increasing volume of spam emails poses serious challenges in terms of productivity loss, security

threats, and user inconvenience. To address this, our project aims to develop an efficient and accurate spam

email classification system using machine learning techniques. We utilized the SMS Spam Collection

dataset, which includes labeled examples of both spam and non-spam (ham) messages.

The project involves key stages such as text preprocessing, tokenization, and feature extraction using

TF-IDF (Term Frequency-Inverse Document Frequency). We implemented a Naive Bayes classifier, known

for its effectiveness in text classification tasks, to train the model on preprocessed email data. The model

was evaluated using accuracy score, confusion matrix, and classification report, achieving an accuracy of

over 95%.

This project demonstrates how natural language processing (NLP) combined with supervised learning can

be effectively used to classify and filter spam messages, thereby enhancing communication security and

efficiency.

5
CHAPTER 1

INTRODUCTION

Electronic mail, or email, is one of the most widely used tools for both personal and professional

communication. However, the rise of spam emails—unsolicited, irrelevant, or inappropriate

messages—has become a significant nuisance and threat. Spam emails can lead to wasted time,

productivity loss, exposure to scams, and even security breaches through malicious links or

attachments. With the increasing volume and sophistication of these unwanted messages,

traditional rule-based filters have become inadequate. These systems often fail to adapt to new

patterns in spam messages and can incorrectly classify important emails as spam (false positives).

To overcome these limitations, this project proposes a machine learning-based approach to spam

detection. By leveraging natural language processing (NLP) techniques and training a classifier

using labeled email data, the system can automatically learn to identify spam based on the content

and structure of messages. This not only enhances the accuracy of spam detection but also ensures

adaptability to evolving spam tactics, making email communication safer and more efficient.

6
CHAPTER-2

PROPOSED SOLUTION

To effectively classify emails into spam or non-spam (ham), this project proposes the use of a supervised

machine learning approach using a Naive Bayes classifier. The first step involves cleaning and

preprocessing the email content through natural language processing (NLP) techniques such as lowercasing,

removal of punctuation and stopwords, and stemming. This transforms the raw text into a uniform format

suitable for analysis. The preprocessed data is then converted into numerical features using the TF-IDF

(Term Frequency–Inverse Document Frequency) vectorization technique, which helps capture the

importance of words within the email content. The dataset is divided into training and testing subsets to

evaluate model performance. A Multinomial Naive Bayes classifier, known for its effectiveness in text

classification tasks, is trained on this data. The model is evaluated using metrics such as accuracy,

confusion matrix, and classification report. This solution is lightweight, efficient, and demonstrates high

accuracy in identifying spam messages, making it a practical choice for real-world deployment in email

filtering systems.

7
CHAPTER 3

SOURCE CODE

# Import necessary libraries

import pandas as pd
import numpy as np
import string
import nltk
import [Link] as plt
import seaborn as sns

from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from [Link] import accuracy_score, confusion_matrix, classification_report

# Download stopwords
[Link]('stopwords')
from [Link] import stopwords
from [Link] import PorterStemmer

# Step 1: Load the dataset (local file)

df = pd.read_csv('[Link]', encoding='latin-1')[['v1', 'v2']]
[Link] = ['label', 'text']

# Step 2: Convert labels to numeric

df['label_num'] = df['label'].map({'ham': 0, 'spam': 1})

# Step 3: Text preprocessing

stemmer = PorterStemmer()
stop_words = set([Link]('english'))

def preprocess_text(text):
text = [Link]()
text = ''.join([char for char in text if char not in [Link]])
words = [Link]()
words = [[Link](word) for word in words if word not in stop_words]
return ' '.join(words)

df['clean_text'] = df['text'].apply(preprocess_text)

# Step 4: Feature extraction using TF-IDF

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(df['clean_text'])
y = df['label_num']
8
# Step 5: Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 6: Train the Naive Bayes classifier

model = MultinomialNB()
[Link](X_train, y_train)

# Step 7: Make predictions and evaluate

y_pred = [Link](X_test)

# Display results
print("Accuracy:", accuracy_score(y_test, y_pred))
print("\n Classification Report:\n", classification_report(y_test, y_pred))

# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
[Link](cm, annot=True, fmt='d', cmap='Blues', xticklabels=['Ham', 'Spam'],
yticklabels=['Ham', 'Spam'])
[Link]('Predicted')
[Link]('Actual')
[Link]('Confusion Matrix')
[Link]()

9
CHAPTER 4

RESULT AND DISCUSSIONS

Figure 5.1

10
CHAPTER 5

CONCLUSION

The Spam Email Classification project successfully demonstrates how machine learning,

combined with natural language processing techniques, can be used to automate the

detection of unwanted or harmful email messages. By using the Naive Bayes algorithm

and TF-IDF vectorization, the model achieved high accuracy in distinguishing spam

from legitimate (ham) messages. The preprocessing steps played a crucial role in

enhancing the model’s performance by cleaning and standardizing the input text data.

The evaluation metrics, including accuracy and confusion matrix, confirmed the model's

reliability and effectiveness. This project highlights the practical applications of machine

learning in enhancing email security and reducing manual efforts in spam filtering. With

further improvements, such as incorporating deep learning models or additional features,

this system can be scaled for real-time spam detection in larger and more dynamic

environments.

11
CHAPTER 6

FUTURE ENHANCEMENT

The current spam email classification model provides a solid foundation, but there are
numerous opportunities for future improvement and expansion. One major enhancement
involves adopting deep learning techniques such as Recurrent Neural Networks (RNNs),
Long Short-Term Memory (LSTM) networks, or Transformer-based models like BERT.
These models excel in understanding context and sequential data, which can
significantly increase classification accuracy and reduce false positives.

Another enhancement could be real-time spam filtering integration into email clients or
messaging systems, allowing instant detection and handling of spam messages.
Additionally, introducing continuous learning mechanisms through user feedback can
enable the model to evolve and adapt to new spam patterns over time.

To improve the model's versatility, expanding the dataset with a wider variety of spam
messages, including multimedia content and multilingual text, can make the classifier
more robust across different use cases. Adding advanced NLP techniques such as named
entity recognition, topic modeling, or emotion analysis can also provide deeper insights
into spam content.

Furthermore, ensemble learning methods that combine predictions from multiple models
can enhance reliability and reduce overfitting. Implementing cloud-based APIs or
deploying the model as a microservice would support scalability and integration into
larger systems or commercial platforms.

12
REFERENCES

WEBSITES:

• Naive Bayes Classifier Tutorial: with Python Scikit-learn | DataCamp

• Kaggle-SMS-Spam-Collection-Dataset-/[Link] at master · mohitgupta-
1O1/Kaggle-SMS-Spam-Collection-Dataset-

Aryan Blackbook 1
No ratings yet
Aryan Blackbook 1
29 pages
Email Spam Detection PPT Github
No ratings yet
Email Spam Detection PPT Github
11 pages
Final Report Spam Classifier
100% (1)
Final Report Spam Classifier
24 pages
Spam Email Classifier
No ratings yet
Spam Email Classifier
17 pages
Pruthviraj Micor Foml
No ratings yet
Pruthviraj Micor Foml
26 pages
Final Report (Saie)
No ratings yet
Final Report (Saie)
38 pages
Document
No ratings yet
Document
11 pages
Vishal FOML Micro Project Vishal & Milan
No ratings yet
Vishal FOML Micro Project Vishal & Milan
26 pages
Email Spam Detection Edited
No ratings yet
Email Spam Detection Edited
30 pages
Spam Detection for CS Students
No ratings yet
Spam Detection for CS Students
29 pages
Mini Project Final 10,42,52
No ratings yet
Mini Project Final 10,42,52
39 pages
Email Classification with Machine Learning
No ratings yet
Email Classification with Machine Learning
22 pages
Ai Project
No ratings yet
Ai Project
8 pages
Email Spam Detection Project Report
No ratings yet
Email Spam Detection Project Report
19 pages
Spam Email Classifier - Ramsanjay
No ratings yet
Spam Email Classifier - Ramsanjay
2 pages
FICE Project Report Spam
No ratings yet
FICE Project Report Spam
14 pages
EmailSpam
No ratings yet
EmailSpam
14 pages
Abhishek Mini Proj . File
No ratings yet
Abhishek Mini Proj . File
19 pages
Spam Email Detection Using Python
No ratings yet
Spam Email Detection Using Python
9 pages
Email Spam Filtering Using Machine Learning.1
No ratings yet
Email Spam Filtering Using Machine Learning.1
16 pages
E-Mail Spam Detection
No ratings yet
E-Mail Spam Detection
8 pages
Spam Detection via ML & NLP
No ratings yet
Spam Detection via ML & NLP
44 pages
Presentation 3
No ratings yet
Presentation 3
13 pages
Spam Mail Classifier
No ratings yet
Spam Mail Classifier
8 pages
Email Spam Final
No ratings yet
Email Spam Final
32 pages
1822 B Deleted Merged Cropped
No ratings yet
1822 B Deleted Merged Cropped
40 pages
Final PPT
No ratings yet
Final PPT
18 pages
Spam Detection Using ML & NLP
No ratings yet
Spam Detection Using ML & NLP
2 pages
Zoom
No ratings yet
Zoom
20 pages
Vaibhav Tiwari Final Project
No ratings yet
Vaibhav Tiwari Final Project
32 pages
Kriti - Report FINAL
No ratings yet
Kriti - Report FINAL
11 pages
Email Report
No ratings yet
Email Report
15 pages
Spam Email. Classifier
No ratings yet
Spam Email. Classifier
16 pages
Spam Detection & Classification Final
No ratings yet
Spam Detection & Classification Final
38 pages
Anti Spam
No ratings yet
Anti Spam
26 pages
Email Spam Detection
No ratings yet
Email Spam Detection
2 pages
Spam Filter Project Report Logistic Regression
No ratings yet
Spam Filter Project Report Logistic Regression
10 pages
EMAIL+SPAM+DETECTION Final Fishries++ (2658+to+2664) - 1
No ratings yet
EMAIL+SPAM+DETECTION Final Fishries++ (2658+to+2664) - 1
7 pages
Email Spam Classification
No ratings yet
Email Spam Classification
17 pages
Email Spam Detection
No ratings yet
Email Spam Detection
8 pages
Final Documentation
No ratings yet
Final Documentation
82 pages
Project Report Emaildetection 4 44
No ratings yet
Project Report Emaildetection 4 44
41 pages
1822 B Deleted
No ratings yet
1822 B Deleted
38 pages
B.Sc. Project: Email Spam Filter
No ratings yet
B.Sc. Project: Email Spam Filter
35 pages
Spam Detection in Emails Using Machine Learning
No ratings yet
Spam Detection in Emails Using Machine Learning
81 pages
Second Progress Report
No ratings yet
Second Progress Report
17 pages
Email
No ratings yet
Email
27 pages
Spam Filter - Machine Learning
No ratings yet
Spam Filter - Machine Learning
25 pages
Pending Proj
No ratings yet
Pending Proj
37 pages
Maid Hiring Management System
No ratings yet
Maid Hiring Management System
43 pages
Lab 3 Write Up
No ratings yet
Lab 3 Write Up
2 pages
Email Spam Detection Using Machine Learning
No ratings yet
Email Spam Detection Using Machine Learning
2 pages
Leveraging Prompt Engineering For Efficient Real-Time Spam Email Filtering
No ratings yet
Leveraging Prompt Engineering For Efficient Real-Time Spam Email Filtering
11 pages
Email Spam Classification with ML & NLP
No ratings yet
Email Spam Classification with ML & NLP
6 pages
Report (1) 1
No ratings yet
Report (1) 1
35 pages
Spam Email Detection Using Machine Learning
No ratings yet
Spam Email Detection Using Machine Learning
8 pages
Report 1nt18mca92
No ratings yet
Report 1nt18mca92
62 pages
Team 20 PRC-2
No ratings yet
Team 20 PRC-2
17 pages
Team 13 - Insurance - Underwring
No ratings yet
Team 13 - Insurance - Underwring
2 pages
Concept of Random Variable and Expectation
No ratings yet
Concept of Random Variable and Expectation
11 pages
OB 6752 Final
No ratings yet
OB 6752 Final
10 pages
Cloud Computing and Computer Networks Concepts
No ratings yet
Cloud Computing and Computer Networks Concepts
1 page
OB Assignment-1
No ratings yet
OB Assignment-1
1 page
Ethics Pyq 2025 26 B3 - Jan25
No ratings yet
Ethics Pyq 2025 26 B3 - Jan25
6 pages
Java Project
No ratings yet
Java Project
50 pages
Tech Interview Deep Cloud Networks DBMS
No ratings yet
Tech Interview Deep Cloud Networks DBMS
12 pages
22h51a6752 DL
No ratings yet
22h51a6752 DL
12 pages
Technical
No ratings yet
Technical
20 pages
Technical Exam Questions
No ratings yet
Technical Exam Questions
42 pages
Yesterdays Coding Questions
No ratings yet
Yesterdays Coding Questions
10 pages
NLP Project 3
No ratings yet
NLP Project 3
12 pages
Method Overloading and Overriding in Java (6752)
No ratings yet
Method Overloading and Overriding in Java (6752)
9 pages
PA Team
No ratings yet
PA Team
12 pages
22h51a6752 Pa
No ratings yet
22h51a6752 Pa
11 pages
Dependency Parsing and Algorithms With Images
No ratings yet
Dependency Parsing and Algorithms With Images
13 pages
6752 NLP
No ratings yet
6752 NLP
14 pages
Database Models in DBMS Explained
No ratings yet
Database Models in DBMS Explained
9 pages
College of Engineering & Technology: Semester: 4
No ratings yet
College of Engineering & Technology: Semester: 4
10 pages
English - Expository Writing
No ratings yet
English - Expository Writing
1 page
Update Dbtab
No ratings yet
Update Dbtab
10 pages
Sol Series10
No ratings yet
Sol Series10
8 pages
BS EN ISO 9229 Thermal Insulation - Vocabulary
No ratings yet
BS EN ISO 9229 Thermal Insulation - Vocabulary
50 pages
Symbolic Interaction Ism As Defined by Herbert Blumer
67% (6)
Symbolic Interaction Ism As Defined by Herbert Blumer
15 pages
System Operation Testing and Adjustin C10 y C12
100% (2)
System Operation Testing and Adjustin C10 y C12
96 pages
The Outer Limits (1995 TV Series) Episode List
No ratings yet
The Outer Limits (1995 TV Series) Episode List
7 pages
En 818 04
No ratings yet
En 818 04
32 pages
Voron2.4r2 Pro Spider h7 v30 Wiring
No ratings yet
Voron2.4r2 Pro Spider h7 v30 Wiring
1 page
Internship Application for ENA 2023-2024
No ratings yet
Internship Application for ENA 2023-2024
3 pages
Six Virtues for the Best Workplace
No ratings yet
Six Virtues for the Best Workplace
12 pages
Sinotruk Howo 8x4 Dump Truck Technical Specifications
No ratings yet
Sinotruk Howo 8x4 Dump Truck Technical Specifications
2 pages
DOP Finacle - Module Wise Menu List - SA POST
No ratings yet
DOP Finacle - Module Wise Menu List - SA POST
8 pages
SOP HoribaParticleSize
No ratings yet
SOP HoribaParticleSize
2 pages
Inset Resear Inset Resear Inset Resear Inset Resear Inset Research C CHC CHC CHC CH Cafe AFE AFE AFE AFE
0% (1)
Inset Resear Inset Resear Inset Resear Inset Resear Inset Research C CHC CHC CHC CH Cafe AFE AFE AFE AFE
2 pages
JMOD Multiple Entry Horn
No ratings yet
JMOD Multiple Entry Horn
22 pages
Aamiq Project Report
No ratings yet
Aamiq Project Report
8 pages
Top Down Basement at One Hyde Park
100% (1)
Top Down Basement at One Hyde Park
10 pages
Nike NeoStride GTM Strategy Overview
No ratings yet
Nike NeoStride GTM Strategy Overview
3 pages
Further Pure Mathematics FP1 - Mock - Ms
No ratings yet
Further Pure Mathematics FP1 - Mock - Ms
4 pages
2nd Module
No ratings yet
2nd Module
25 pages
Healthcare Recommender Systems: Simar Preet Singh Deepak Kumar Jain Johan Debayle
No ratings yet
Healthcare Recommender Systems: Simar Preet Singh Deepak Kumar Jain Johan Debayle
379 pages
The Great Unity Festival
No ratings yet
The Great Unity Festival
3 pages
Full Wave Bridge Rectifier PHYSICS PROJECT 1
No ratings yet
Full Wave Bridge Rectifier PHYSICS PROJECT 1
4 pages
Ants: Their History, Life, and Purpose
No ratings yet
Ants: Their History, Life, and Purpose
3 pages
Statistical Mechanics Homework
No ratings yet
Statistical Mechanics Homework
6 pages
Understanding Multiple Disabilities
No ratings yet
Understanding Multiple Disabilities
23 pages
Techno NJR BTech Seminar Report Template
No ratings yet
Techno NJR BTech Seminar Report Template
21 pages
Wpiea2024085-Print-Pdf 240420 092259
No ratings yet
Wpiea2024085-Print-Pdf 240420 092259
43 pages

ML Lab

Uploaded by

ML Lab

Uploaded by

A Project report on

“Spam Email Classification”

UNDER THE COURSE

CMR COLLEGE OF ENGINEERING &

In partial fulfillment of the requirement for completion of the “MACHINE

LEARNING LABORATORY” of III-B. Tech II- Semester is a record of a bonafide

Signature of Faculty Signature of HOD

We would like to thank the Principal of CMRCET, [Link] kumar , for

[Link]. CONTENTS [Link]

communication. However, the rise of spam emails—unsolicited, irrelevant, or inappropriate

# Import necessary libraries

from sklearn.feature_extraction.text import TfidfVectorizer

# Step 1: Load the dataset (local file)

# Step 2: Convert labels to numeric

# Step 3: Text preprocessing

# Step 4: Feature extraction using TF-IDF

# Step 6: Train the Naive Bayes classifier

# Step 7: Make predictions and evaluate

RESULT AND DISCUSSIONS

further improvements, such as incorporating deep learning models or additional features,

• Naive Bayes Classifier Tutorial: with Python Scikit-learn | DataCamp

You might also like