Detecting 0day
Detecting 0day
The remaining parameters left as default. also reports the performance achieved by each classifier when
– MLPClassifier() A feed-forwarding neural network. their data were standardized or left as raw (non-standardized).
The parameter hidden_layer_sizes represents the The classifiers are ordered with respect to accuracy achieved
number of neurons in the hidden layer: by each classifier.
– A call to this library with only one parameter value The classifiers are grouped into five classes with respect to
implies that there is only one layer with the num- their demonstrated performance: I) Poorly performed classi-
ber of neurons specified. MLPClassifier(100) fiers, II) Conventional Machine Learning classifiers with good
implies that there is only one layer with 100 neurons. performance, III) Simple Neural Network with Single Hidden
– A call to this library with two values implies that Layer, IV) Deep Learners with small Epochs, and V) Deep
there are two hidden layers whose number of neurons Learners with larger Epochs.
is specified. For instance, MLPClassifier(100,
1) means that there are two hidden layers each with A. RQ. 1: The Influence of Standardization?
100 and 1 neurons, respectively.
– A call to this library with three parameter values A quick glance at the accuracy achieved by applying each
implies that there are three hidden layers whose classifier on non-standardized and standardized forms of data
number of neurons is specified as inputs. For in- shows that (except a couple of cases (Gaussian, Naive Bayes
stance, MLPClassifier(100, 50, 1) means and Quadratic Discriminant Analysis (QDA))), the classifiers
that there are three hidden layers each with 100, 50, demonstrated an improved accuracy when the given data are
and 1 neurons, respectively. standardized. It was also observed that it takes much smaller
We noticed some differences in the performance when amount of time and effort to train models with standardized
dealing with raw (i.e., non-standardized) versus standardized data; whereas, when non-standardized data are fed into the
data. The standardization of data means changing the values classifiers, it takes considerable amount of time for training
to enforce the distribution of standard deviation from the and fitting the models.
mean to be equal to one. For the comparison purposes and The average mean values for accuracy obtained for different
discovering whether standardization affects the performance, classes of classifiers when non-standardized data are mod-
we performed the experiments on both non-standardized and eled are 61.57%, 90.61%, 94.78%, and 58.70% for Poorly
standardized data. The Python StandardScaler(...) performed classifiers, conventional machine learners, Simple
library was used for standardization. Neural Networks with only one hidden layer, and Deep learn-
We performed a stratified 10-fold cross validation on the ing models, respectively. Whereas, the calculated mean values
dataset. Furthermore, we employed the Python pipeline (i.e., for accuracy when standardized data are used are 47.79%,
pipeline(...)) to sequentially apply a list of transforms 98.88%, 99.21%, and 98.90% for Poorly performed classifiers,
and produce a final estimator. The primary purpose of utilizing conventional machine learners, Simple Neural Networks with
the pipeline is to add several layers of fitting-assessing proce- only one hidden layer, and Deep learning models, respectively.
dure. More specifically, the pipeline helped us with stacking Excluding the poorly performed classes of classifiers, we
two processes: 1) the standardization of the data, and 2) the observe that standardization help in improving accuracy by
application of the specified classifier. 98.88 − 90.61 = 8.27%, 99.21 − 94.78 = 4.43%, and
98.90 − 58.70 = 40.2%, for conventional machine learners,
VI. R ESULTS Simple Neural Networks with only one hidden layer and Deep
Table III reports the performance of each classifiers The learning models, respectively. It was also observed that the
performance is measured in terms of the mean values of accu- standard deviations calculated for accuracy were much higher
racy obtained along with their standard deviations. The table for non-standardized data compared to standardized data. The
TABLE III
T HE PERFORMANCE OF CLASSIFIERS IN ASCENDING ORDER OF “ACCURACY ” FOR PIPED AND STANDARDIZED DATA (cv = 10).
Accuracy %
No Standardized Piped and Standardized
Classifier Mean SD Mean SD
(Sorted)
I) Poorly Performed Learners
Gaussian Naive Bayes 76.60% 39.49% 46.31% 0.46%
Quadratic Discriminant Analysis (QDA) 46.55% 34.29% 49.28% 0.45%
Average 61.57% 47.79%
II) Conventional Machine Learning
Logistic Regression 30.28% 45.20% 97.51% 0.19%
Linear SVM (LinearSVC) 84.51% 27.68% 97.61% 0.16%
AdaBoost(n = 50) 97.57% 3.65% 98.78% 0.09%
AdaBoost(n = 100) 97.77% 3.35% 98.88% 0.06%
AdaBoost(n = 200) 97.53% 3.87% 98.99% 0.08%
Nearest Neighbors (k = 7) 97.88% 2.01% 98.99% 0.08%
Nearest Neighbors (k = 5) 97.89% 2.12% 99.07% 0.08%
Nearest Neighbors (k = 3) 97.81% 2.45% 99.12% 0.06%
Decision Tree 95.20% 9.60% 99.24% 0.07%
Random Forest (n = 10) 96.85% 6.94% 99.45% 0.06%
Random Forest (n = 100) 97.04% 6.89% 99.51% 0.08%
Random Forest (n = 50) 97.07% 6.66% 99.51% 0.06%
Average 90.61% 98.88%
III) Simple Neural Network (One Layer)
Multi-Layer Perceptron: MLP (52)[BatchSize = Auto] 94.70% 4.36% 99.14% 0.07%
Multi-Layer Perceptron: MLP (100)[BatchSize = Auto] 95.19% 4.55% 99.21% 0.11%
Multi-Layer Perceptron: MLP (200)[BatchSize = Auto] 95.52% 4.12% 99.24% 0.09%
Multi-Layer Perceptron: MLP (400)[BatchSize = Auto] 93.72% 5.05% 99.25% 0.08%
Average 94.78% 99.21%
IV) Deep Learning (Multiple Layers): Epoch = 5
Multi-Layer Perceptron: MLP (30, 1)[BatchSize = 100] 65.01% 11.97% 98.73% 0.09%
Multi-Layer Perceptron: MLP (52, 1)[BatchSize = 100] 57.13% 16.26% 98.79% 0.07%
Multi-Layer Perceptron: MLP (100, 1)[BatchSize = 100] 45.04% 17.57% 98.83% 0.07%
Multi-Layer Perceptron: MLP (30, 1)[BatchSize = 5] 65.88% 11.99% 98.88% 0.09%
Multi-Layer Perceptron: MLP (52, 1)[BatchSize = 5] 58.03% 18.26% 98.90% 0.09%
Multi-Layer Perceptron: MLP (52, 30, 1)[BatchSize = 100] 67.58% 15.08% 98.90% 0.07%
Multi-Layer Perceptron: MLP (100, 1)[BatchSize = 5] 45.04% 17.57% 98.92% 0.10%
Multi-Layer Perceptron: MLP (100, 50, 1)[BatchSize = 100] 50.02% 20.05% 98.95% 0.10%
Multi-Layer Perceptron: MLP (52, 30, 1)[BatchSize = 5] 64.54% 18.83% 98.95% 0.09%
Multi-Layer Perceptron: MLP (100, 80, 60, 40, 20, 10, 1)[BatchSize = 5] 66.05% 12.04% 98.97% 0.07%
Multi-Layer Perceptron: MLP (100, 50, 1)[BatchSize = 5] 50.02% 20.05% 98.99% 0.10%
Multi-Layer Perceptron: MLP (100, 80, 60, 40, 20, 10, 1)[BatchSize = 100] 70.07% 0.00% 98.99% 0.10%
Average 58.70% 98.90%
V) Deep Learning (Multiple Layers Different Epochs
Multi-Layer Perceptron: MLP (100, 50, 1)[BatchSize = 100],[Epoch = 50] – – 99.24% 0.10%
Multi-Layer Perceptron: MLP (100, 50, 1)[BatchSize = 100],[Epoch = 100] – – 99.28% 0.08%
Multi-Layer Perceptron: MLP (100, 50, 1)[BatchSize = 100],[Epoch = 400] – – 99.29% 0.06%
Multi-Layer Perceptron: MLP (100, 50, 1)[BatchSize = 100],[Epoch = 200] – – 99.33% 0.06%
Average – – 99.28%
primary reason might be due to the fact that larger and wider we employed a 10-fold cross-validation to optimize the models
scale of numerical raw values were used in model fitting. and that might explain why it was infeasible to build a perfect
model with zero false positive and false negative values.
B. RQ. 2: Achieving 100% Accuracy?
The research team tested different classifiers with different C. RQ. 3: The Best Classifier?
tuning parameters with the goal of achieving the mean value As Table III shows, the random forest was the best classifier
of 100% for accuracy. It turned out that achieving such a high with outstanding performance of achieving 99.51% on average
accuracy on model fitting and prediction was infeasible. While for accuracy, followed by decision tree achieving 99.24%.
some of the classifiers demonstrated very high accuracy and The simple neural networks with a single hidden layer also
thus promising results, it seems that building a model that performed very well. On average, they achieved 99.21% on
reduces the false positive and false negative ratios to zero is accuracy. The number of neurons on the singleton hidden
very difficult and hence there are some penalties with missing layer seems to have some light impacts on accuracy. For
such cases for detecting zero-day malware. However, it is also instance, an MLP model with only one hidden layer and
possible that building a perfect model with 100% accuracy 50 neurons achieved 99.14% accuracy; whereas, increasing
may imply that the model is overfitted and may perform poorly the number of neurons to 100, 200, and 400 increased the
for classifying unseen data. To avoid such overfitting problem, accuracy to 99.21%, 99.24%, and 99.25%, respectively. Since
VII. D ISCUSSION
A. Standardization Is Important for Classification
According to our results, standardization is critical for
classification. The primary reason might be because of compu-
tational expenses involved in dealing with large numbers and
thus with higher standard variations. Some of these classifiers
utilize a distance metric (e.g., Euclidean distance) where the
square roots of the sum of the squared differences between
Fig. 2. Accuracy vs. Epochs. the observations are calculated for clustering the data items.
there are some computational costs associated with the number As a result, to accommodate such expensive computation
of neurons on the layer, and given the slights improvement when larger values are provided as data, the demands for
on the observed accuracy, the question is whether a more computational needs will be increased. Hence, since a larger
complex model is worthy to be built or a simpler model with standard deviation will affect the accuracy of the prediction.
slightly lower accuracy would be sufficient for the prediction.
The choice of this trade-off totally depends on the application B. Feature Reductions and Parameters Tuning
domain. As a special case, detecting zero-day malware is an The authors tuned several parameters of the classifiers.
important and critical task and thus increasing the accuracy as Furthermore, they studied several machine/deep learning al-
much as possible is indeed needed regardless of the cost. gorithms. These learners perform the task of classifications
differently. For instance, one utilizes hyperplanes to create
D. RQ. 4: The Influence of Batch Size in Deep Learning? clusters (e.g., SVM); whereas, some other uses ensemble
learning and take the majority votes to decide (e.g., random
The authors controlled the batch size parameters for deep
forest). Moreover, some of these techniques apply feature
learning classifiers with multiple layers. According to our
reductions and thus tune the parameters and build a model;
observations: the smaller batch size is, the better/fitter the
whereas, some other more advanced algorithms try to take into
model will be. For instance, a neural network with two hidden
account all features and then through deeper analysis adjust
layers each with 10 and 1 neurons but with batch size of
their contributions and weights to the final model. A potential
100 and 5 achieved the accuracy of 98.73% and 98.88%,
drawback of utilizing all features in the computation is the
respectively. The improvement seems to be very small. On the
overfitting problem and thus the model may suffer from being
other hand, training with smaller batch size appeared to take
tightly coupled to the seen data and thus unable to perform
more computation time and thus more expensive than training
well for unseen data. It also may cause adding noisy features
a model with a larger batch size.
into the computations. On the other hand, reducing features
It was also observed that for deep learning-based approaches may cause the problem of missing important relationships
with multiple layers: the larger and deeper the model is, the between parameters. The choice of feature reduction depends
better and fitter the classification will be. For instance, a model on the type of dataset and is an important decision and should
with two layers each having 30 and 1 neurons and batch size be handled with additional care.
of 100 achieved the accuracy of 98.73%. Whereas, a deeper
model with seven layers each having 100, 80, 60, 40, 20, and C. Conventional vs. Deep Learning Classifiers
1 neurons and with the batch size equal to 100 achieved the
accuracy of 98.99%. However, the improvement is very small. The authors expected to observe much better performance
from deep learning-based algorithms. While these deep learn-
ers performed very well, surprisingly, some of the conventional
E. RQ. 5: The Influence of Epochs in Deep Learning?
machine learning classifiers performed comparatively similar
The authors performed a systematic analysis on the influ- or even better. Given the lower cost of training associated with
ence of the number of iterations needed to train the classifiers. the conventional machine learning algorithms and at the same
It was observed that: the greater the number of epochs is, the time a considerably greater cost for training deep classifiers,
more accurate the model will be. For instance, an MLP model the conventional machine learning algorithms might be even a
(100, 50, 1) with Epochs = 50 achieved 99.24% accuracy; better choice compared to the deep learning-based algorithms.
whereas, an increase of Epochs to 200 enhanced the accuracy The deep learning-based classifiers demonstrate a consistent
to 99.33%. This observation might indicate that a smaller improvement achieved by building larger models and addi-
number of epochs might be sufficient to learn the key and tional training. However, a simple random forest algorithm
significant features of the data and thus by adding more rounds still outperforms even larger deep learning-based classifiers
of training stages, the model will not learn anything further with additional training. For instance, the performance demon-
(i.e., all features are already learned). As an example, Figure strated by Random Forest (i.e., 99.51%) and the best per-
2 illustrates the improvement of accuracy over epochs for the formance achieved by deep learning (i.e., 99.33%) after 200
model M LP (100, 50, 1)[BatchSize = 100][Epoch = 200]. epochs with three hidden layers is remarkable.
D. Deep or Deeper Classifiers? [3] Y. Ye, T. Li, D. Adjeroh, and S. S. Iyengar, “A survey on malware
detection using data mining techniques,” ACM Comput. Surv., vol. 50,
According to our results, larger and deeper classifiers tend no. 3, pp. 41:1–41:40, Jun. 2017.
to perform better and they build a more accurate model. [4] K. Rieck, T. Holz, C. Willems, P. Düssel, and P. Laskov, “Learning
and classification of malware behavior,” in Conference on Detection of
However, the improvement does not seem to be significant. Intrusions and Malware, and Vulnerability Assessment (DIMVA), 2008,
A simpler deep learning classifier (e.g., M LP (30, 1) with pp. 108–125.
batch size = 100 and accuracy of 98.73%) might perform [5] D. Gavrilu, M. Cimpoeu, and D. A. L. Ciortuz, “Malware detection
using machine learning,” in International Multi-conference on Computer
comparatively very similar to a deeper and larger classifier Science and Information Technology (IMCSIT), 2009.
(e.g., M LP (100, 80, 60, 40, 20, .10, 1) with batch size = 100 [6] W. Hardy, L. Chen, S. Hou, Y. Ye, and X. Li, “Dl4md: A deep learning
and accuracy of 98.99%. Hence, the choice of the depth of framework for intelligent malwaredetection,” in Int’l Conf. Data Mining
(DMIN’16), 2016.
deep classifiers depends on the desired level of accuracy. [7] S. Siami-Namini, N. Tavakoli, and A. S. Namin, “A comparison of
ARIMA and LSTM in forecasting time series,” in International Con-
VIII. C ONCLUSIONS AND F UTURE W ORK ference on Machine Learning and Applications, ICMLA, Orlando, FL,
USA, 2018, pp. 1394–1401.
This paper empirically explored whether machine and deep [8] N. Tavakoli, “Modeling genome data using bidirectional LSTM,” in
learning classifiers are effective in detecting zero-day malware. Annual Computer Software and Applications Conference, COMPSAC,
Addressing such a question is important from security per- Milwaukee, WI, USA., 2019, pp. 183–188.
[9] M. Chatterjee and A. S. Namin, “Detecting phishing websites through
spective because zero-day malware are unknown applications deep reinforcement learning,” in Annual Computer Software and Appli-
and thus there might not be any malicious signature similar cations Conference, COMPSAC, WI, USA, 2019, pp. 227–232.
to their patterns. We empirically compared a good number of [10] L. Bilge and T. Dumitras, “Before we knew it: An empirical study of
zero-day attacks in the real world,” in ACM Conference on Computer
well-known conventional machine learning and deep learning and Communications Security, 10 2012, pp. 833–844.
classifiers and observed that some of the conventional machine [11] M. G. Miller, “Are we protected yet? developing a machine learning de-
learning algorithms (e.g., random forests) perform very well tection system to combat zero-day malware attacks,” Ph.D. dissertation,
2018.
in comparison with their deep learning-based counterparts. [12] V. Sharma, J. Kim, S. Kwon, I. You, K. Lee, and K. Yim, “A framework
This result implies that some of the conventional and deep for mitigating zero-day attacks in IoT,” CoRR, vol. abs/1804.05549,
learning-based approaches are good classifiers for detecting 2018.
[13] M. Alazab, S. Venkatraman, P. Watters, and M. Alazab, “Zero-day
zero-day malware. However, even though they achieve very malware detection based on supervised learning algorithms of API call
high accuracy (e.g., 99.51%), these algorithms never achieve signatures,” in Ninth Australasian Data Mining Conference - Volume
a 100% accuracy and thus these classifiers might slightly 121, ser. AusDM ’11, 2011, pp. 171–182.
[14] P. M. Comar, L. Liu, S. Saha, P. Tan, and A. Nucci, “Combining
misclassify some of the zero-day malware. supervised and unsupervised learning for zero-day malware detection,”
This paper focused on measuring the accuracy of classifiers in 2013 Proceedings IEEE INFOCOM, April 2013, pp. 2022–2030.
[15] Q. Zhou and D. Pezaros, “Evaluation of machine learning classifiers for
using a 10-fold cross validation. It is important to carry out zero-day intrusion detection - an analysis on CIC-AWS-2018 dataset,”
additional experiments and measure precision, recall, accuracy, CoRR, vol. abs/1905.03685, 2019.
and F1 measures all together along with the ROC measure [16] L. Xiao, X. Wan, X. Lu, Y. Zhang, and D. Wu, “Iot security techniques
based on machine learning: How do IoT devices use AI to enhance
for these classifiers and capture the exact values for false security?” IEEE Signal Processing Magazine, vol. 35, no. 5, pp. 41–49,
positive and false negative. It is also important to replicate Sep. 2018.
the experiments reported in this paper with some other datasets [17] “Malware detection - make your own malware security system, in asso-
ciation with meraz’18 malware security partner max secure software,”
and perform a meta analysis [18] to have a better insights about https://www.kaggle.com/c/malware-detection, Accessed 2019.
the machine leaning algorithms and their classification per- [18] S. Kakarla, S. Momotaz, and A. S. Namin, “An evaluation of mutation
formances. Furthermore, given the outstanding performance and data-flow testing: A meta-analysis,” in International Conference on
Software Testing, Verification and Validation, ICST, Berlin, Germany,
of random forest, it would be interesting to observe whether Workshop Proceedings, 2011, pp. 366–375.
ensemble-based deep learning classifiers perform better than [19] M. Chatterjee, A. S. Namin, and P. Datta, “Evidence fusion for malicious
other classifiers. It is also an interesting question to investigate bot detection in iot,” in International Conference on Big Data, WA, USA,
2018, pp. 4545–4548.
whether evidence theory [19], uncertainty reasoning [20], or [20] S. Sartoli and A. S. Namin, “Adaptive reasoning in the presence of
control-theoretical approaches and decision-based processes imperfect security requirements,” in Annual Computer Software and
[21] can be utilized in accordance with learning algorithms Applications Conference, COMPSAC, GA, USA, 2016, pp. 498–499.
[21] J. Zheng and A. S. Namin, “A markov decision process to determine
to detect zero-day vulnerabilities. optimal policies in moving target,” in ACM SIGSAC Conference on
Computer and Communications Security, Toronto, ON, Canada, 2018,
ACKNOWLEDGMENT pp. 2321–2323.
This work is supported in part by National Science Foun-
dation (NSF) under the grants 1821560 and 1723765.
R EFERENCES
[1] “Yara - the pattern matching swiss knife for malware researchers,”
https://virustotal.github.io/yara/, Accessed 2019.
[2] L. Xie, X. Zhang, J.-P. Seifert, and S. Zhu, “pbmds: A behavior-based
malware detection system for cellphone devices,” in ACM Conference
on Wireless Network Security, ser. WiSec ’10, 2010, pp. 37–48.