Intel College Excellence Program
Project Synopsis
“E-mail fraud detection using Classification Algorithm.”
Team member’s detail
S.No. Participant Name Mobile No. Email ID
1 Anubhav Yadav 8791711910
[email protected] 2 Arushi Rajdev 6350389112
3 Vishal Gupta 9580499094
Faculty(college) mentor detail
S.No. Mentor Name Mobile No. Email ID
1. Mr. Sanjay Kumar Sonker 9718154099
[email protected].
in
College/University Name
Galgotias University
School of Computing Science and Engineering,
Galgotias University
Project Proposal Page 1 of 14
BACKGROUND
Electronic mail that is known as E-mails has become a part of our life right
now. Electronic mail has reduced communication difficulties in many
organizations as well as individuals. It has become an industry standard for
communicating as of now. Email development started in the 1960s, but
initially only users could send email to other users of the same computer.
Some systems also supported instant messaging, in which the sender and
recipient needed to be online at the same time. Now a days this method is
exploited by the spam and phishing attacks that are being done by the
fraudsters by sending unnecessary and unrequired Emails. It has become
important to find out those types of E-mails.
Project Proposal Page 2 of 14
PROBLEM IDENTIFICATION
Emails are known as the mode of exchanging messages between people
using electronic devices. It is fast yet the cheapest mode of communication.
But Emails are infected by something knowns as Spam. Spam is an
unsolicited and unwanted email from a stranger that is sent in big volumes
to large mailing lists, usually with some commercial nature sent out in bulk.
These spam Emails intend to perform some malicious activities to harm the
recipient either in a normal way or in a financial way. This spamming is
increasing day by day in other communication channels as well. A study
estimated that over 70% of today's business E-mails are spam. This
disruption in emails also affects the communication bandwidth, Email
servers, and user time. Spam emails have the following features, the emails
are sent to undisclosed recipients for the advertisement of
services/products/offensive material. The aim is to deceive innocent people
by gaining personal data and abuse it. The majority of the spam emails do
not offer to unsubscribe option.
Project Proposal Page 3 of 14
PROPOSED SOLUTION
Our project aims to detect Spam Emails using a machine learning classifier
that is called Naïve Bayes classifier. It filters and scans E-mail by the user
to mark or detect any repulsive E-mail. In this project, the main aim is to
preprocess the given dataset in which we will use the word-based feature
selection first to get the correct words that will be used by the classifier to
get the output, data visualization and then we will split up the given dataset
into a training set and a testing set then the Naïve Bayes classification model
will be deployed and then the result analysis will be done with the help of
confusion matrix to visualize the performance of the model.
Naive Bayes Classifier :- It is a machine-readable algorithm where words
may play a key role here. If some words appear frequently in spam but not
in ham, then this incoming email is probably spam. The Naive Bayes
classifier strategy has become the most popular method of email filtering
software.
HARDWARE & SOFWARE REQUIREMENTS
Hardware requirements:
1. Laptop
2. Internet Network
Project Proposal Page 4 of 14
Software requirements:
1. Anaconda IDE
2. Jupyter Notebook
3. PyCharm IDE
4. Streamlit
Python Libraries:-
• pandas
• numpy
• nltk
• wordcloud
• sklearn
BLOCK DIAGRAM & DESCRIPTION
This project is mainly divided into five steps:-
First of all, we take given the dataset from the Kaggle and load it in jupyter
notebook using pandas.
Project Proposal Page 5 of 14
Step – 1:- Data Cleaning
In this step, we clean the given dataset and rename the columns according
to our need and we also drop the unwanted columns that have null values.
We use the Sklearn library to encode the category of the email as ham or
spam. If it is 0 then it’s ham and if it is 1 then it is spam. Then we check for
any missing values or duplicate values and if it is there then we remove it
Step – 2:- EDA (exploratory data analysis)
In this step we visualize the basic data structure to understand the dataset
much better. We use the matplotlib library for the pie-chart representation
of data. As we can see 87.37 percent of our data has ham mail and 12.63 is
spam
Project Proposal Page 6 of 14
Here we have used the nltk library which stands for natural language
tokenization for word and sentences count. We have given these new
names to the character, word and sentences number columns. We have
used a lambda function for tokenization.
Project Proposal Page 7 of 14
Next, we have plotted a histogram for the data visualization. We have used
the seaborn library for this part. In the first image, we see the histogram
for characters. As we can see around 530 is the maximum count for the
character in a ham email and around 190 is the maximum in the case of
spam.
As we can see around 600 is the maximum count for the words in a ham
email and around 100 is the maximum in the case of spam.
Project Proposal Page 8 of 14
Step – 3:- Data Preprocessing
In this step we do the following steps:-
• converting into lower case
In this we covert every single character in lower case.
• Tokenization
In this, we tokenize each character, word, and sentence.
• removing stop words and punctuation
We remove stop-words such as
'i','me','my','myself','we','our','ours','ourselves','you',"you're","you've",
"you'll","you'd",'your','yours','yourself','yourselves'
etc. and punctuations.
• removing special characters
In this, we remove special characters such as @#$%^&*! Etc.
• stemming
In this, we revert back to the root word. If the written mail has “loving” as
a word then it will convert it to a root word “love”.
For doing all this we create a function ‘modify_email_text’ which uses the
nltk library. First, we tokenize the email text and convert it into lower
characters. Then we create a list Z and if the text isn’t in any stop-words or
punctuation then it will append to that list Z. Then we clear the list and
append the stem word into it.
Project Proposal Page 9 of 14
After doing all this we get the output as modified_text
Then we create a word cloud for it. Word Cloud shows which word is the
most used and which is repeatedly used in a picture form. As in the below
picture, we see the word cloud for the spam emails. We see that ‘free’,’
text’,’ mobile,’ call’ are one of the most used words in spam emails.
Project Proposal Page 10 of 14
Step – 4:- Model Training using Naïve Bayes Classifier
In this, we are going to use the Sklearn library for CountVectorizer and
TfidfVectorizer to convert textual data into numerical form. After
converting the data we split it into train and test by using the
train_test_split function of sklearn. We have given 80 percent to test and
20 percent of the whole dataset for training.
Project Proposal Page 11 of 14
We then import GaussianNB, MultinomialNB, BernoulliNB from
sklearn.naive_bayes to test the data. We provided abbreviation
gnb = GaussianNB()
mnb = MultinomialNB()
bnb = BernoulliNB()
The line gnb.fit(X_train,y_train) means that we have given it the data to
learn from and y_pred1 =gnb.predict(X_test) here it predicts what it has
learned from the given data. Then it prints the accuracy score, confusion
matrix, and precision score. In our case, we can see that MultinomialNB is
the best performing model as it gives a precision score of 1 and an accuracy
of 0.97.
Project Proposal Page 12 of 14
FUTURE SCOPE
In this project, we have used only one type of classifier but in the future,
we want to use four or five classifiers to test the accuracy and precision
against each other.
By doing this we will know which model works best in the given situation
and the deployment of the model will be optimized.
Project Proposal Page 13 of 14
CONCLUSION
Email spam is one of the most needed and difficult internet problems in
today's world of communication and technology. It is almost impossible to
think of an email without considering the issue of spam. Spammers who
generate spam mail abuse misuse this contact center and thus affect
organizations and many email users.
The machine learning model used by Google is now so advanced that it can
detect and filter out spam and phishing emails with almost 99.9 percent
accuracy. What this means is that one in a thousand messages have managed
to escape their email spam filter.
In this project, we did the data exploration, we visualized various aspects of
the data, we did the analysis of the data, we build a machine learning
algorithm that can be used to best prevent fraud or spam emails.
REFERENCES
1. https://en.wikipedia.org/wiki/Email
2. M. Siponen and C. Stucke, "Effective anti-spam strategies in companies", An international
study, In Proceedings of HICSS 2006, vol. 6, 200
3. https://www.youtube.com/watch?v=rHesaMUqTjE
Project Proposal Page 14 of 14