Detection of Spyware by Mining Executable Files
Objectives
The main objective of our project is to establish a method in spyware detection
research using data mining techniques. These techniques are used for information
retrieval and classification. In application of techniques, there was only one change that
computer programs were used rather than text documents.
In this project, binary features are extracted from executable files. A feature
reduction method is then used to obtain a subset of data which is further used as a
training set for automatically generating classifiers. In this method, the generated
classifiers are used to classify new, previously unseen binaries as either legitimate
software or spyware. We will use appropriate value of n in order to yield high
performance, also suitable machine learning algorithm to produce high accuracy.
Project idea
The goal of the project is to detect spyware by using data mining and machine
learning. We use the Waikato Environment for Knowledge Analysis (WEKA) to perform
the experiments. WEKA is a suite of machine learning algorithms and analysis tools,
which is used in practice for solving data mining problems. First, we extract features
from the binary files and we then apply a feature reduction method in order to reduce data
set complexity. Finally, we convert the reduced feature set into the Attribute Relation File
Format (ARFF). ARFF files are ASCII text files that include a set of data instances, each
described by a set of features. Figure 2.1 shows the steps involved in our proposed
method.
Detection of Spyware by Mining Executable Files
Figure 2.1: Proposed System
We organized our work into following stages:
1. Data Collection
2. Byte Sequence Generation
3. N-gram Generation
4. Feature Extraction
5. Feature Reduction
6. ARFF Generation
7. Model Training
Step 1: Data Collection
Detection of Spyware by Mining Executable Files
Our data set consists of two classes of binary files:
(1) Benign files
(2) Spyware files.
Step 2: Byte Sequence Generation
This process makes file conversion from binary to byte sequence in each class.
We use xxd, which is a UNIX based utility for conversion.
Step 3: N-gram Generation
This process pieces out the byte sequences into a desired size of n (namely 4, 5
and 6). An n-gram is a sequence of n elements. This process also makes sure that each
line contains one n-gram and length of a single line is equal to the size of n.
Step 4: Feature Extraction
We extract the features by using two different approaches: Common Feature Based
Extraction (CFBE) and Frequency Based Feature Extraction (FBFE). Both methods are
used to obtain Reduced Feature Sets (RFSs) which are then used to generate the Attribute
Relation File Format (ARFF) files.
1. Frequency Based Feature Extraction (FBFE):
In FBFE, the frequency of each n-gram in each class is calculated.
2. Common Feature Based Extraction (CFBE):
In CFBE, the common n-grams are extracted from each class.
Step 5: Feature Reduction
In FBFE, all n-grams within a specified frequency range (50-500) are extracted
and the rest (1-49) are discarded. In CFBE, only one representation of each feature is
Detection of Spyware by Mining Executable Files
considered in one class. To obtain Reduced Feature Sets (RFSs) for CFBE and FBFE,
merge unique n-grams for both classes.
Step 6: ARFF Generation (Data Set Generation)
This process generates two ARFF databases: frequency based feature database
and common feature based database. All attributes in database are treated as Boolean
attributes. ARFF process searches for every n-gram in all byte sequences for a class and
assign a value to the attribute which can be either 1 or 0 on the present/not present
basis.
Step 7: Model Training
The ARFF file is used as input to WEKA for applying machine learning
algorithms. The algorithms used in the experiment are: ZeroR, Naive Bayes, SVM
(Support Vector Machines), J48, Random Forest and JRip.
Hardware Requirements
Pentium Processor, 1.6 GHz or advanced
RAM, 128 MB or more
HDD, 40 GB or more.
Software Requirements
Platform: Linux OS
Language: JAVA
Editor: G-Edit Editor
WEKA (Machine Learning Tool)
Detection of Spyware by Mining Executable Files