Feature Selection for Phishing Website
Classification
Asst.Prof.Mallesh Hatti1,Vanamala Shivani2,A.B.Pravallika3,Muthyam Vyshnavi4
ABSTRACT
Phishing is an attempt to obtain the confidential information about user or an organization. It is an act of impersonating a
credible webpage to users to expose personal data, such as username, password and credit card information. It has cost the
online community and various stakeholders hundreds of millions of dollars. There is need to detect and predict phishing,
and the machine learning classification approach is a promising approach to do so. However, it may take several phases to
identify and collect the effective features from the dataset before the selected classifier can be trained to identify phishing
sites correctly. This paper presents the performance of phishing webpage detection via two different machine learning
techniques:-XGboost, Logistic Regression and by using deep learning technique LSTM Algorithm. The most effective
classification performance of two machine learning algorithms is further rectified. The observational results have shown
that the optimized XGboost achieves the highest performance among all the techniques.
Keywords— Phishing, Web threat, XGboost, Logistic Regression, LSTM, Machine learning, Deep Learning 1
Introduction
In today’s world, technology has become an integral They are using a social engineering trick, which can be
part of the twenty-first century. The internet is one of described as fraudsters that try to manipulate the user
these technologies, which is growing rapidly every into giving them their personal information based on
year and plays an important role in individuals’ lives. exploiting human vulnerabilities rather than software
It has become a valuable and a convenient mechanism vulnerabilities. Statistics have shown that the number
for supporting public transactions such as e-banking of phishing attacks keeps increasing, which presents a
and e-commerce transactions. That has led the users security risk to the user information according to the
to trust it is convenient to provide their private AntiPhishing Working Group (APWG) and recorded
information to the Internet. As a result, the security phishing attacks by Kaspersky Lab, which stated that it
thieves that have started to target this information has increased by 47.48% from all of the phishing
have become a major security problem. Phishing attacks that have been detected during 2016. Recently,
websites are considered to be one of these problems. there have been several studies that tried to solve the
phishing problem. Some researchers used the URL and
compared it with existing blacklists that contain lists of
malicious websites, which they have been creating, and
there are others that have used the URL in an opposite
manner, namely comparing the URL with a whitelist of
legitimate websites. The latter approach uses
*Correspondingauthor: Asst.Prof.Mallesh Hatti
heuristics, which uses a signature database of any
Address:Sridevi Women’s Enginnering College,Department of
Electronics and Communication Engineering, known attacks that match the signature of the heuristic
Vattinagulapally,Gandipet,R.R.DIST-500075,India pattern to decide if it is a phishing website.
email[email protected] Additionally, measuring website traffic using Alexa is
another way that has been implemented by researchers
to detect phishing websites. Moreover, other
1
researchers have used machine learning techniques. C.PhishShield: A Desktop Application to Detect
Machine learning is a field of computer science, which Phishing Webpages through Heuristic Approach
is also a branch of artificial intelligence (AI) that
performs tasks and is capable of learning or acting in an Phishing is a website forgery with an intention to
intelligent way. It has two different types of learning: track and steal the sensitive information of online
supervised learning and unsupervised learning. users. The attacker fools the user with social
Supervised learning is based on training a model by engineering techniques such as SMS, voice, email,
giving it a set of measured features of data associated website and malware.
with a target label related to these data, and once the In this paper, we implemented a desktop application
model is trained it can generate a new target label with called PhishShield, which concentrates on URL and
unknown data. On the other hand, unsupervised Website Content of phishing page. PhishShield takes
learning is based on generating new data without URL as input and outputs the status of URL as
giving any target label in the training process. phishing or legitimate website. The heuristics used
to detect phishing are footer links with null value,
II.LITERATURE SURVEY
zero links in body of html, copyright content, title
content and website identity. PhishShield is able to
A.Cantina: AContent-based Approach to Detecting detect zero hour phishing attacks which blacklists
Phishing Web Sites unable to detect and it is faster than visual based
assessment techniques that are used in detecting
Phishing is a significant problem involving fraudulent phishing. The accuracy rate obtained for PhishShield
email and web sites that trick unsuspecting users into is 96.57% and covers a wide range of phishing web
revealing private information. In this paper, we sites resulting less false negative and false positive
present the design, implementation, and evaluation rate.
of CANTINA, a novel, content-based approach to
detecting phishing web sites, based on the TF-IDF
III.METHODOLOGY
information retrieval algorithm. We also discuss the
design and evaluation of several heuristics we
The framework in figure 1 represents the module
developed to reduce false positives. Our experiments
description of the analysis
show that CANTINA is good at detecting phishing
sites, correctly labelling approximately 95% of
phishing sites.
B. Techniques for detecting zero day phishing websites
Phishing is a means of obtaining confidential
information through fraudulent web sites that appear
to be legitimate. There are many phishing detection
techniques available, but current practices leave
much to be desired. A central problem is that web
browsers rely on a black list of known phishing sites,
but some phishing sites have a lifespan as short as a Figure 1:Block Diagram
few hours. A faster recognition system is needed by A.Dataset
the web browser to identify zero day phishing sites
which are new phishing sites that have not yet been
discovered. URLs of benign websites were collected from
This research improves upon techniques used by www.alexa.com and The URLs of phishing websites
popular anti-phishing software and introduces a new were collected from www.phishtank.com. The data
method of detecting fraudulent web pages using set consists of total 25,469 URLs which include
cascading style sheets (CSS). Current phishing 12,058 benign URLs ,13411 phishing URLs. Benign
detection techniques are examined and a new URLs are labelled as “B” and phishing URLs are
detection method is implemented and evaluated labelled as “M”.
against hundreds of known phishing sites.
2
B..Data Preprocessing classified as supervised machine learning, This is
where an algorithm tries to learn a function that
Data preprocessing consists of cleansing, instance maps an input to an output based on example input-
selection, feature extraction, normalization, output pairs. It infers a function from labeled training
transformation, etc. The results of data preprocessing data consisting of a set of training examples. We
is that the absolute training dataset. Data present machine learning methods that we used in
preprocessing may impact how results of the
our study.
ultimate processing is interpreted. Data cleaning
could be a step where filling the missing data, A. Logistic Regression
smoothing of noise, recognizing or removing outliers
and resolving incompatibilities is done. Data Logistic Regression is a classification algorithm used to
Integration may be a method where the addition of
assign observations to a discrete set of classes. Unlike
certain databases, or data sets is done. Data
linear regression which outputs continuous number
transformation is whereby collection and
values, Logistic Regression transforms its output using
normalization are performed to measure a particular
the logistic sigmoid function to return a probability
data. By doing data reduction we can achieve an
overview of the dataset that is very small in size but, value which can then be mapped to two or more
which helps to produce the identical outcome of the discrete classes. Logistic regression works well when
analysis . the relationship in the data is almost linear despite if
there are complex nonlinear relationships between
C.Exploratory Data Analysis variables, it has poor performance. Besides, it requires
more statistical assumptions before using other
A technique in data analysis that provides more than techniques.
one method that is primarily diagrammatic is known B. Gradeint Boosting
as Exploratory Data Analysis (EDA) as shown in
Figure 3. It maximizes the perception of a data set,
Gradient Boosting trains many models incrementally
unveil the hidden structure, excerpt essential
and sequentially. The main difference between Ada-
parameters, locates outliers as well as anomalies and
Boost and Gradient Boosting Algorithm is how
test hidden presumptions.
algorithms identify the shortcomings of weak
learners like decision trees. While the Ada-Boost
D.Train-test split
model identifies the shortcomings by using high
The dataset is part into two subsets as testing set and weight data points, Gradient Boosting performs the
training set so that the training dataset can be same methods by using gradients in the loss function.
equipped with the algorithms and then used for C. XGBoost
detecting the phishing websites on testing dataset.
30% of the data is reviewed for the testing set so that
XGBoost is a refined and customized version of a
the training model will train and learn the data
Gradient Boosting to provide better performance and
effectively.
speed. The most important factor behind the success
of XGBoost is its scalability in all scenarios. The
IV.MACHINE LEARNING APPROACH
XGBoost runs more than ten times faster than
popular solutions on a single machine and scales to
Machine learning provides simplified and efficient billions of examples in distributed or memory limited
methods for data analysis. It has indicated promising settings. The scalability of XGBoost is due to several
outcomes in realtime classification problems important algorithmic optimizations. These
recently. The key advantage of machine learning is innovations include a novel tree learning algorithm
the ability to create flexible models for specific tasks for handling sparse data; a theoretically justified
like phishing detection. Since phishing is a weighted quantile sketch procedure enables
classification problem, Machine learning models can handling instance weights in approximate tree
be used as a powerful tool. Machine learning models learning. Parallel and distributed computing make
could adapt to changes quickly to identify patterns of learning faster which enables quicker model
fraudulent transactions that help to develop a exploration.
learning-based identification system. Most of the
machine learning models discussed here are
3
V. MODELING PHISHING URLS WITH or prevents the internal state to be seen from the
RECURRENT NEURAL NETWORKS outside.
In this work, we used LSTM units to build a
model that receives as input a URL as character
A neural network is a bio-inspired machine learning sequence and predicts whether or not the URL
model that consists of a set of artificial neurons with corresponds to a case of phishing. The architecture
connections between them. Recurrent Neural is illustrated in Fig. 2. Each input character is
Networks (RNN) are a type of neural network that is translated by a 128-dimension embedding. The
able to model sequential patterns. The distinctive translated URL is fed into a LSTM layer as a 150-
characteristic of RNNs is that they introduce the notion step sequence. Finally, the classification is
of time to the model, which in turn allows them to performed using an output sigmoid neuron. The
process sequential data one element at a time and network is trained by backpropagation using a
learn their sequential dependencies . crossen tropy loss function and dropout in the last
layer.
VlI.RESULTS
The phishing website detection model has been tested
and trained using many classifiers and ensemble
algorithms to analyze and compare the model’s result
for best accuracy. Each algorithm will give its
evaluated accuracy after all the algorithms return its
result. Each is compared with other algorithms to see
which provides the high accuracy percentage as shown
in Table 1. Each algorithm’s accuracy will be depicted
in the confusion matrix for greater comprehension.
The dataset is also trained using a deep learning
algorithm. The final accuracy comparison of
Figure 2. Recurrent neural network for classifying algorithms is shown in Figure 3.
phishing URL’s based on LSTM units.
Classifiers Training Testing Precision
Each input character is translated by an set set Accuracy
128dimension embedding. The translated URL is Accuracy Accuracy
fed into a LSTM layer as a 150-step sequence. Logistic 92.00 92.00 89.00
Finally, the classification is performed using an Regression
output sigmoid neuron.
One limitation of general RNNs is that they XGboost 93.80 93.40 93.42
are unable to learn the correlation between
elements more than 5 or 10 time steps apart [29]. A
model that overcomes this problem is Long Short
Term Memory (LSTM). This model can bridge
elements separated by more than 1,000 time steps
without loss of short time lag capabilities [30].
LSTM is an adaptation of RNN. Here, each
neuron is replaced by a memory cell that, in
addition to a conventional neuron representing an
internal state, uses multiplicative units as gates to
control the flow of information. A typical LSTM cell
has an input gate that controls the input of
information from the outside, a forget cell that
controls whether to keep or forget the information Figure 3.Comparison of ML Algorithms
in the internal state, and an output gate that allows
4
VIII.CONCLUSION
This paper aims to enhance detection method to detect
phishing websites using machine learning technology.
We achieved 97.14% detection accuracy using random
forest algorithm with lowest false positive rate. Also
result shows that classifiers give better performance
when we used more data as training data. In future
hybrid technology will be implemented to detect
phishing websites more accurately, for which random
forest algorithm of machine learning technology and
blacklist method will be used.
REFERENCES
[1] AO Kaspersky lab. (2017). The Dangers of
Phishing: Help employees avoid the lure of
cybercrime. [Online]
Available:https://go.kaspersky.com/DangersPhishing
Landin g-Page- Soc.html [Oct 30, 2017].
[2] Financial threats in 2016: Every Second
Phishing Attack Aims to Steal Your Money” 2017
financialthreatsin-2016. Feb 22, 2017 [Oct 30,
2017].
[3] Y. Zhang, J. I. Hong, and L. F. Cranor, ”Cantina: A
Content-based Approach to Detecting Phishing Web
Sites,” New York, NY, USA, 2007, pp. 639-648.
[4] N. Sanglerdsinlapachai and A. Rungsawang, ”Web
Phishing Detection Using Classifier Ensemble,” New
York, NY, USA, 2010, pp. 210-215.
[5] R. M. Mohammad, F. Thabtah, and L. McCluskey,
”Predicting phishing websites based on self-
structuring”