0% found this document useful (0 votes)

38 views14 pages

FICE Project Report Spam

This document provides a project synopsis for classifying emails using a machine learning algorithm. The team aims to detect spam emails using a Naive Bayes classifier by preprocessing a dataset, splitting it into training and testing sets, and evaluating the model's performance. The best performing model is the Multinomial Naive Bayes classifier, which achieves a precision score of 1 and accuracy of 97% on the test data. The project seeks to optimize spam detection by comparing multiple classifiers.

Uploaded by

Anubhav Yadav

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

38 views14 pages

FICE Project Report Spam

Uploaded by

Anubhav Yadav

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

Intel College Excellence Program

Project Synopsis

“E-mail fraud detection using Classification Algorithm.”

Team member’s detail

S.No. Participant Name Mobile No. Email ID
1 Anubhav Yadav 8791711910 [email protected]
2 Arushi Rajdev 6350389112
3 Vishal Gupta 9580499094
Faculty(college) mentor detail
S.No. Mentor Name Mobile No. Email ID
1. Mr. Sanjay Kumar Sonker 9718154099 [email protected].
in
College/University Name
Galgotias University

School of Computing Science and Engineering,

Galgotias University

Project Proposal Page 1 of 14

BACKGROUND

Electronic mail that is known as E-mails has become a part of our life right

now. Electronic mail has reduced communication difficulties in many

organizations as well as individuals. It has become an industry standard for

communicating as of now. Email development started in the 1960s, but

initially only users could send email to other users of the same computer.

Some systems also supported instant messaging, in which the sender and

recipient needed to be online at the same time. Now a days this method is

exploited by the spam and phishing attacks that are being done by the

fraudsters by sending unnecessary and unrequired Emails. It has become

important to find out those types of E-mails.

Project Proposal Page 2 of 14

PROBLEM IDENTIFICATION

Emails are known as the mode of exchanging messages between people

using electronic devices. It is fast yet the cheapest mode of communication.

But Emails are infected by something knowns as Spam. Spam is an

unsolicited and unwanted email from a stranger that is sent in big volumes

to large mailing lists, usually with some commercial nature sent out in bulk.

These spam Emails intend to perform some malicious activities to harm the

recipient either in a normal way or in a financial way. This spamming is

increasing day by day in other communication channels as well. A study

estimated that over 70% of today's business E-mails are spam. This

disruption in emails also affects the communication bandwidth, Email

servers, and user time. Spam emails have the following features, the emails

are sent to undisclosed recipients for the advertisement of

services/products/offensive material. The aim is to deceive innocent people

by gaining personal data and abuse it. The majority of the spam emails do

not offer to unsubscribe option.

Project Proposal Page 3 of 14

PROPOSED SOLUTION

Our project aims to detect Spam Emails using a machine learning classifier

that is called Naïve Bayes classifier. It filters and scans E-mail by the user

to mark or detect any repulsive E-mail. In this project, the main aim is to

preprocess the given dataset in which we will use the word-based feature

selection first to get the correct words that will be used by the classifier to

get the output, data visualization and then we will split up the given dataset

into a training set and a testing set then the Naïve Bayes classification model

will be deployed and then the result analysis will be done with the help of

confusion matrix to visualize the performance of the model.

Naive Bayes Classifier :- It is a machine-readable algorithm where words

may play a key role here. If some words appear frequently in spam but not

in ham, then this incoming email is probably spam. The Naive Bayes

classifier strategy has become the most popular method of email filtering

software.

HARDWARE & SOFWARE REQUIREMENTS

Hardware requirements:

1. Laptop
2. Internet Network

Project Proposal Page 4 of 14

Software requirements:

1. Anaconda IDE
2. Jupyter Notebook
3. PyCharm IDE
4. Streamlit

Python Libraries:-

• pandas
• numpy
• nltk
• wordcloud
• sklearn

BLOCK DIAGRAM & DESCRIPTION

This project is mainly divided into five steps:-

First of all, we take given the dataset from the Kaggle and load it in jupyter
notebook using pandas.

Project Proposal Page 5 of 14

Step – 1:- Data Cleaning

In this step, we clean the given dataset and rename the columns according
to our need and we also drop the unwanted columns that have null values.
We use the Sklearn library to encode the category of the email as ham or
spam. If it is 0 then it’s ham and if it is 1 then it is spam. Then we check for
any missing values or duplicate values and if it is there then we remove it

Step – 2:- EDA (exploratory data analysis)

In this step we visualize the basic data structure to understand the dataset
much better. We use the matplotlib library for the pie-chart representation
of data. As we can see 87.37 percent of our data has ham mail and 12.63 is
spam

Project Proposal Page 6 of 14

Here we have used the nltk library which stands for natural language
tokenization for word and sentences count. We have given these new
names to the character, word and sentences number columns. We have
used a lambda function for tokenization.

Project Proposal Page 7 of 14

Next, we have plotted a histogram for the data visualization. We have used
the seaborn library for this part. In the first image, we see the histogram
for characters. As we can see around 530 is the maximum count for the
character in a ham email and around 190 is the maximum in the case of
spam.

As we can see around 600 is the maximum count for the words in a ham
email and around 100 is the maximum in the case of spam.

Project Proposal Page 8 of 14

Step – 3:- Data Preprocessing

In this step we do the following steps:-

• converting into lower case

In this we covert every single character in lower case.

• Tokenization
In this, we tokenize each character, word, and sentence.

• removing stop words and punctuation

We remove stop-words such as
'i','me','my','myself','we','our','ours','ourselves','you',"you're","you've",
"you'll","you'd",'your','yours','yourself','yourselves'
etc. and punctuations.

• removing special characters

In this, we remove special characters such as @#$%^&*! Etc.

• stemming
In this, we revert back to the root word. If the written mail has “loving” as
a word then it will convert it to a root word “love”.

For doing all this we create a function ‘modify_email_text’ which uses the
nltk library. First, we tokenize the email text and convert it into lower
characters. Then we create a list Z and if the text isn’t in any stop-words or
punctuation then it will append to that list Z. Then we clear the list and
append the stem word into it.

Project Proposal Page 9 of 14

After doing all this we get the output as modified_text

Then we create a word cloud for it. Word Cloud shows which word is the
most used and which is repeatedly used in a picture form. As in the below
picture, we see the word cloud for the spam emails. We see that ‘free’,’
text’,’ mobile,’ call’ are one of the most used words in spam emails.

Project Proposal Page 10 of 14

Step – 4:- Model Training using Naïve Bayes Classifier

In this, we are going to use the Sklearn library for CountVectorizer and
TfidfVectorizer to convert textual data into numerical form. After
converting the data we split it into train and test by using the
train_test_split function of sklearn. We have given 80 percent to test and
20 percent of the whole dataset for training.

Project Proposal Page 11 of 14

We then import GaussianNB, MultinomialNB, BernoulliNB from
sklearn.naive_bayes to test the data. We provided abbreviation
gnb = GaussianNB()
mnb = MultinomialNB()
bnb = BernoulliNB()

The line gnb.fit(X_train,y_train) means that we have given it the data to

learn from and y_pred1 =gnb.predict(X_test) here it predicts what it has
learned from the given data. Then it prints the accuracy score, confusion
matrix, and precision score. In our case, we can see that MultinomialNB is
the best performing model as it gives a precision score of 1 and an accuracy
of 0.97.

Project Proposal Page 12 of 14

FUTURE SCOPE

In this project, we have used only one type of classifier but in the future,

we want to use four or five classifiers to test the accuracy and precision

against each other.

By doing this we will know which model works best in the given situation

and the deployment of the model will be optimized.

Project Proposal Page 13 of 14

CONCLUSION

Email spam is one of the most needed and difficult internet problems in

today's world of communication and technology. It is almost impossible to

think of an email without considering the issue of spam. Spammers who

generate spam mail abuse misuse this contact center and thus affect

organizations and many email users.

The machine learning model used by Google is now so advanced that it can

detect and filter out spam and phishing emails with almost 99.9 percent

accuracy. What this means is that one in a thousand messages have managed

to escape their email spam filter.

In this project, we did the data exploration, we visualized various aspects of

the data, we did the analysis of the data, we build a machine learning

algorithm that can be used to best prevent fraud or spam emails.

REFERENCES

1. https://en.wikipedia.org/wiki/Email
2. M. Siponen and C. Stucke, "Effective anti-spam strategies in companies", An international
study, In Proceedings of HICSS 2006, vol. 6, 200
3. https://www.youtube.com/watch?v=rHesaMUqTjE

Project Proposal Page 14 of 14

ML Lab
No ratings yet
ML Lab
13 pages
Aryan Blackbook 1
No ratings yet
Aryan Blackbook 1
29 pages
Mini Project Final 10,42,52
No ratings yet
Mini Project Final 10,42,52
39 pages
Spam Email Detection Using Python
No ratings yet
Spam Email Detection Using Python
9 pages
Email Spam Detection Project Report
No ratings yet
Email Spam Detection Project Report
19 pages
Email Spam Detection PPT Github
No ratings yet
Email Spam Detection PPT Github
11 pages
EmailSpam
No ratings yet
EmailSpam
14 pages
B.Sc. Project: Email Spam Filter
No ratings yet
B.Sc. Project: Email Spam Filter
35 pages
Email Spam Detection
No ratings yet
Email Spam Detection
8 pages
Kriti - Report FINAL
No ratings yet
Kriti - Report FINAL
11 pages
Spam Detection for CS Students
No ratings yet
Spam Detection for CS Students
29 pages
IJCRT23A5429
No ratings yet
IJCRT23A5429
7 pages
Email Spam Detection Edited
No ratings yet
Email Spam Detection Edited
30 pages
Document
No ratings yet
Document
11 pages
Spam Email Classifier - Ramsanjay
No ratings yet
Spam Email Classifier - Ramsanjay
2 pages
Report1 4 Sem New Final
No ratings yet
Report1 4 Sem New Final
27 pages
Maid Hiring Management System
No ratings yet
Maid Hiring Management System
43 pages
Email Spam Classification
No ratings yet
Email Spam Classification
17 pages
Vaibhav Tiwari Final Project
No ratings yet
Vaibhav Tiwari Final Project
32 pages
EMAIL+SPAM+DETECTION Final Fishries++ (2658+to+2664) - 1
No ratings yet
EMAIL+SPAM+DETECTION Final Fishries++ (2658+to+2664) - 1
7 pages
Anti Spam
No ratings yet
Anti Spam
26 pages
Email Spam Final
No ratings yet
Email Spam Final
32 pages
CS329 2025 T10 Proposal Report
No ratings yet
CS329 2025 T10 Proposal Report
7 pages
Devangi It Report
No ratings yet
Devangi It Report
22 pages
Report (1) 1
No ratings yet
Report (1) 1
35 pages
Chapters Report 16it088
No ratings yet
Chapters Report 16it088
13 pages
Report 1nt18mca92
No ratings yet
Report 1nt18mca92
62 pages
Final Report Spam Classifier
100% (1)
Final Report Spam Classifier
24 pages
Spam Detection in Emails Using Machine Learning
No ratings yet
Spam Detection in Emails Using Machine Learning
56 pages
Vishal FOML Micro Project Vishal & Milan
No ratings yet
Vishal FOML Micro Project Vishal & Milan
26 pages
Spam Detection via ML & NLP
No ratings yet
Spam Detection via ML & NLP
44 pages
E-Mail Spam Detection
No ratings yet
E-Mail Spam Detection
8 pages
Final Documentation
No ratings yet
Final Documentation
82 pages
Spam Alert System Project Overview
No ratings yet
Spam Alert System Project Overview
12 pages
Email Spam Detection
No ratings yet
Email Spam Detection
2 pages
Pruthviraj Micor Foml
No ratings yet
Pruthviraj Micor Foml
26 pages
Final Report (Saie)
No ratings yet
Final Report (Saie)
38 pages
Zoom
No ratings yet
Zoom
20 pages
Project Report Emaildetection 4 44
No ratings yet
Project Report Emaildetection 4 44
41 pages
Ai Project
No ratings yet
Ai Project
8 pages
Spam Email Classifier
No ratings yet
Spam Email Classifier
17 pages
Maths Answers
No ratings yet
Maths Answers
4 pages
Spam Email Detection Using Machine Learning
No ratings yet
Spam Email Detection Using Machine Learning
8 pages
Presentation 3
No ratings yet
Presentation 3
13 pages
Python Spam Mail Detection Program
No ratings yet
Python Spam Mail Detection Program
2 pages
Assignment 12
No ratings yet
Assignment 12
2 pages
Reportfile
No ratings yet
Reportfile
10 pages
Spam Detection in Emails Using Machine Learning
No ratings yet
Spam Detection in Emails Using Machine Learning
81 pages
AI Capstone Project Email Classification
No ratings yet
AI Capstone Project Email Classification
10 pages
Second Progress Report
No ratings yet
Second Progress Report
17 pages
Final Report
No ratings yet
Final Report
27 pages
Spam Email Detection Using Python and Machine Learning
No ratings yet
Spam Email Detection Using Python and Machine Learning
14 pages
Spam Classifier Project Report
No ratings yet
Spam Classifier Project Report
45 pages
Email Classification with Machine Learning
No ratings yet
Email Classification with Machine Learning
22 pages
Email Spam Detection Using Machine Learning
No ratings yet
Email Spam Detection Using Machine Learning
2 pages
English Tenses Explained: A Comprehensive Guide
No ratings yet
English Tenses Explained: A Comprehensive Guide
7 pages
Black Box Testing
No ratings yet
Black Box Testing
14 pages
PVTP Manual
100% (1)
PVTP Manual
446 pages
Staad Examples
100% (1)
Staad Examples
35 pages
Chạy 2 Ứng Dụng Trên Android
No ratings yet
Chạy 2 Ứng Dụng Trên Android
26 pages
ch03 (User Authentication)
100% (1)
ch03 (User Authentication)
29 pages
Mindray DP6600 Operation Manual Advanced PDF
100% (1)
Mindray DP6600 Operation Manual Advanced PDF
113 pages
Build Your Own Butler Robot
No ratings yet
Build Your Own Butler Robot
45 pages
Effective Manupatra Judgment Searches
100% (1)
Effective Manupatra Judgment Searches
8 pages
What's New in OptiTex Version 10
100% (1)
What's New in OptiTex Version 10
49 pages
G&SR Hindi (NR) PDF
No ratings yet
G&SR Hindi (NR) PDF
29 pages
Kinescope Player Log 1707746198174.json
No ratings yet
Kinescope Player Log 1707746198174.json
13 pages
Praxis S3 (HPXS301-1)
No ratings yet
Praxis S3 (HPXS301-1)
4 pages
BNM802 Assignment
No ratings yet
BNM802 Assignment
3 pages
Technical Support Resume
100% (3)
Technical Support Resume
7 pages
Devnath Resume-2
No ratings yet
Devnath Resume-2
5 pages
Objective of The Study
No ratings yet
Objective of The Study
4 pages
AutoForm-Trim Optimization Guide
100% (1)
AutoForm-Trim Optimization Guide
20 pages
Operational Excellence Through Business Process Management (BPM)
No ratings yet
Operational Excellence Through Business Process Management (BPM)
21 pages
Internet Marketing-PPT Final
100% (1)
Internet Marketing-PPT Final
19 pages
Rifa Smart TV 50 - .Wheel
No ratings yet
Rifa Smart TV 50 - .Wheel
164 pages
Da1 Scribd
No ratings yet
Da1 Scribd
7 pages
Genuine Friendships in Virtual Worlds - Final
No ratings yet
Genuine Friendships in Virtual Worlds - Final
14 pages
PDC 25 - Scheduling and Storage Systems
No ratings yet
PDC 25 - Scheduling and Storage Systems
18 pages
PHP Developer Resume - Bruno Sandivilli
No ratings yet
PHP Developer Resume - Bruno Sandivilli
1 page
Switch and Ternary Operator
No ratings yet
Switch and Ternary Operator
21 pages
Data Warehousing & ETL Essentials
No ratings yet
Data Warehousing & ETL Essentials
30 pages
DSL-2730U U1 Manual v1.01
No ratings yet
DSL-2730U U1 Manual v1.01
79 pages
DSR Billing Report
No ratings yet
DSR Billing Report
13 pages
3rd International Conference On Cloud Computing
No ratings yet
3rd International Conference On Cloud Computing
1 page

FICE Project Report Spam

Uploaded by

FICE Project Report Spam

Uploaded by

Intel College Excellence Program

“E-mail fraud detection using Classification Algorithm.”

Team member’s detail

School of Computing Science and Engineering,

Project Proposal Page 1 of 14

now. Electronic mail has reduced communication difficulties in many

organizations as well as individuals. It has become an industry standard for

communicating as of now. Email development started in the 1960s, but

fraudsters by sending unnecessary and unrequired Emails. It has become

important to find out those types of E-mails.

Project Proposal Page 2 of 14

Emails are known as the mode of exchanging messages between people

using electronic devices. It is fast yet the cheapest mode of communication.

But Emails are infected by something knowns as Spam. Spam is an

recipient either in a normal way or in a financial way. This spamming is

increasing day by day in other communication channels as well. A study

disruption in emails also affects the communication bandwidth, Email

are sent to undisclosed recipients for the advertisement of

services/products/offensive material. The aim is to deceive innocent people

not offer to unsubscribe option.

Project Proposal Page 3 of 14

confusion matrix to visualize the performance of the model.

Naive Bayes Classifier :- It is a machine-readable algorithm where words

HARDWARE & SOFWARE REQUIREMENTS

Project Proposal Page 4 of 14

BLOCK DIAGRAM & DESCRIPTION

This project is mainly divided into five steps:-

Project Proposal Page 5 of 14

Step – 2:- EDA (exploratory data analysis)

Project Proposal Page 6 of 14

Project Proposal Page 7 of 14

Project Proposal Page 8 of 14

In this step we do the following steps:-

• converting into lower case

• removing stop words and punctuation

• removing special characters

Project Proposal Page 9 of 14

Project Proposal Page 10 of 14

Project Proposal Page 11 of 14

The line gnb.fit(X_train,y_train) means that we have given it the data to

Project Proposal Page 12 of 14

against each other.

and the deployment of the model will be optimized.

Project Proposal Page 13 of 14

today's world of communication and technology. It is almost impossible to

think of an email without considering the issue of spam. Spammers who

organizations and many email users.

to escape their email spam filter.

In this project, we did the data exploration, we visualized various aspects of

algorithm that can be used to best prevent fraud or spam emails.

Project Proposal Page 14 of 14

You might also like