0% found this document useful (0 votes)
38 views14 pages

FICE Project Report Spam

This document provides a project synopsis for classifying emails using a machine learning algorithm. The team aims to detect spam emails using a Naive Bayes classifier by preprocessing a dataset, splitting it into training and testing sets, and evaluating the model's performance. The best performing model is the Multinomial Naive Bayes classifier, which achieves a precision score of 1 and accuracy of 97% on the test data. The project seeks to optimize spam detection by comparing multiple classifiers.

Uploaded by

Anubhav Yadav
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views14 pages

FICE Project Report Spam

This document provides a project synopsis for classifying emails using a machine learning algorithm. The team aims to detect spam emails using a Naive Bayes classifier by preprocessing a dataset, splitting it into training and testing sets, and evaluating the model's performance. The best performing model is the Multinomial Naive Bayes classifier, which achieves a precision score of 1 and accuracy of 97% on the test data. The project seeks to optimize spam detection by comparing multiple classifiers.

Uploaded by

Anubhav Yadav
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Intel College Excellence Program

Project Synopsis

“E-mail fraud detection using Classification Algorithm.”

Team member’s detail


S.No. Participant Name Mobile No. Email ID
1 Anubhav Yadav 8791711910 [email protected]
2 Arushi Rajdev 6350389112
3 Vishal Gupta 9580499094
Faculty(college) mentor detail
S.No. Mentor Name Mobile No. Email ID
1. Mr. Sanjay Kumar Sonker 9718154099 [email protected].
in
College/University Name
Galgotias University

School of Computing Science and Engineering,


Galgotias University

Project Proposal Page 1 of 14


BACKGROUND

Electronic mail that is known as E-mails has become a part of our life right

now. Electronic mail has reduced communication difficulties in many

organizations as well as individuals. It has become an industry standard for

communicating as of now. Email development started in the 1960s, but

initially only users could send email to other users of the same computer.

Some systems also supported instant messaging, in which the sender and

recipient needed to be online at the same time. Now a days this method is

exploited by the spam and phishing attacks that are being done by the

fraudsters by sending unnecessary and unrequired Emails. It has become

important to find out those types of E-mails.

Project Proposal Page 2 of 14


PROBLEM IDENTIFICATION

Emails are known as the mode of exchanging messages between people

using electronic devices. It is fast yet the cheapest mode of communication.

But Emails are infected by something knowns as Spam. Spam is an

unsolicited and unwanted email from a stranger that is sent in big volumes

to large mailing lists, usually with some commercial nature sent out in bulk.

These spam Emails intend to perform some malicious activities to harm the

recipient either in a normal way or in a financial way. This spamming is

increasing day by day in other communication channels as well. A study

estimated that over 70% of today's business E-mails are spam. This

disruption in emails also affects the communication bandwidth, Email

servers, and user time. Spam emails have the following features, the emails

are sent to undisclosed recipients for the advertisement of

services/products/offensive material. The aim is to deceive innocent people

by gaining personal data and abuse it. The majority of the spam emails do

not offer to unsubscribe option.

Project Proposal Page 3 of 14


PROPOSED SOLUTION

Our project aims to detect Spam Emails using a machine learning classifier

that is called Naïve Bayes classifier. It filters and scans E-mail by the user

to mark or detect any repulsive E-mail. In this project, the main aim is to

preprocess the given dataset in which we will use the word-based feature

selection first to get the correct words that will be used by the classifier to

get the output, data visualization and then we will split up the given dataset

into a training set and a testing set then the Naïve Bayes classification model

will be deployed and then the result analysis will be done with the help of

confusion matrix to visualize the performance of the model.

Naive Bayes Classifier :- It is a machine-readable algorithm where words

may play a key role here. If some words appear frequently in spam but not

in ham, then this incoming email is probably spam. The Naive Bayes

classifier strategy has become the most popular method of email filtering

software.

HARDWARE & SOFWARE REQUIREMENTS

Hardware requirements:

1. Laptop
2. Internet Network

Project Proposal Page 4 of 14


Software requirements:

1. Anaconda IDE
2. Jupyter Notebook
3. PyCharm IDE
4. Streamlit

Python Libraries:-

• pandas
• numpy
• nltk
• wordcloud
• sklearn

BLOCK DIAGRAM & DESCRIPTION

This project is mainly divided into five steps:-

First of all, we take given the dataset from the Kaggle and load it in jupyter
notebook using pandas.

Project Proposal Page 5 of 14


Step – 1:- Data Cleaning

In this step, we clean the given dataset and rename the columns according
to our need and we also drop the unwanted columns that have null values.
We use the Sklearn library to encode the category of the email as ham or
spam. If it is 0 then it’s ham and if it is 1 then it is spam. Then we check for
any missing values or duplicate values and if it is there then we remove it

Step – 2:- EDA (exploratory data analysis)

In this step we visualize the basic data structure to understand the dataset
much better. We use the matplotlib library for the pie-chart representation
of data. As we can see 87.37 percent of our data has ham mail and 12.63 is
spam

Project Proposal Page 6 of 14


Here we have used the nltk library which stands for natural language
tokenization for word and sentences count. We have given these new
names to the character, word and sentences number columns. We have
used a lambda function for tokenization.

Project Proposal Page 7 of 14


Next, we have plotted a histogram for the data visualization. We have used
the seaborn library for this part. In the first image, we see the histogram
for characters. As we can see around 530 is the maximum count for the
character in a ham email and around 190 is the maximum in the case of
spam.

As we can see around 600 is the maximum count for the words in a ham
email and around 100 is the maximum in the case of spam.

Project Proposal Page 8 of 14


Step – 3:- Data Preprocessing

In this step we do the following steps:-

• converting into lower case


In this we covert every single character in lower case.

• Tokenization
In this, we tokenize each character, word, and sentence.

• removing stop words and punctuation


We remove stop-words such as
'i','me','my','myself','we','our','ours','ourselves','you',"you're","you've",
"you'll","you'd",'your','yours','yourself','yourselves'
etc. and punctuations.

• removing special characters


In this, we remove special characters such as @#$%^&*! Etc.

• stemming
In this, we revert back to the root word. If the written mail has “loving” as
a word then it will convert it to a root word “love”.

For doing all this we create a function ‘modify_email_text’ which uses the
nltk library. First, we tokenize the email text and convert it into lower
characters. Then we create a list Z and if the text isn’t in any stop-words or
punctuation then it will append to that list Z. Then we clear the list and
append the stem word into it.

Project Proposal Page 9 of 14


After doing all this we get the output as modified_text

Then we create a word cloud for it. Word Cloud shows which word is the
most used and which is repeatedly used in a picture form. As in the below
picture, we see the word cloud for the spam emails. We see that ‘free’,’
text’,’ mobile,’ call’ are one of the most used words in spam emails.

Project Proposal Page 10 of 14


Step – 4:- Model Training using Naïve Bayes Classifier

In this, we are going to use the Sklearn library for CountVectorizer and
TfidfVectorizer to convert textual data into numerical form. After
converting the data we split it into train and test by using the
train_test_split function of sklearn. We have given 80 percent to test and
20 percent of the whole dataset for training.

Project Proposal Page 11 of 14


We then import GaussianNB, MultinomialNB, BernoulliNB from
sklearn.naive_bayes to test the data. We provided abbreviation
gnb = GaussianNB()
mnb = MultinomialNB()
bnb = BernoulliNB()

The line gnb.fit(X_train,y_train) means that we have given it the data to


learn from and y_pred1 =gnb.predict(X_test) here it predicts what it has
learned from the given data. Then it prints the accuracy score, confusion
matrix, and precision score. In our case, we can see that MultinomialNB is
the best performing model as it gives a precision score of 1 and an accuracy
of 0.97.

Project Proposal Page 12 of 14


FUTURE SCOPE

In this project, we have used only one type of classifier but in the future,

we want to use four or five classifiers to test the accuracy and precision

against each other.

By doing this we will know which model works best in the given situation

and the deployment of the model will be optimized.

Project Proposal Page 13 of 14


CONCLUSION

Email spam is one of the most needed and difficult internet problems in

today's world of communication and technology. It is almost impossible to

think of an email without considering the issue of spam. Spammers who

generate spam mail abuse misuse this contact center and thus affect

organizations and many email users.

The machine learning model used by Google is now so advanced that it can

detect and filter out spam and phishing emails with almost 99.9 percent

accuracy. What this means is that one in a thousand messages have managed

to escape their email spam filter.

In this project, we did the data exploration, we visualized various aspects of

the data, we did the analysis of the data, we build a machine learning

algorithm that can be used to best prevent fraud or spam emails.

REFERENCES

1. https://en.wikipedia.org/wiki/Email
2. M. Siponen and C. Stucke, "Effective anti-spam strategies in companies", An international
study, In Proceedings of HICSS 2006, vol. 6, 200
3. https://www.youtube.com/watch?v=rHesaMUqTjE

Project Proposal Page 14 of 14

You might also like