0% found this document useful (0 votes)
14 views17 pages

Ensemble Learning for Malware Analysis

This paper discusses ensemble classifiers, which combine multiple scoring functions to improve accuracy, particularly in the context of malware analysis. It presents a framework for categorizing ensemble techniques, surveys existing methods, and provides experimental results based on a large malware dataset. The authors aim to bring order to the diverse field of ensemble learning by offering a common dataset and measures of success for comparison.

Uploaded by

Vishal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views17 pages

Ensemble Learning for Malware Analysis

This paper discusses ensemble classifiers, which combine multiple scoring functions to improve accuracy, particularly in the context of malware analysis. It presents a framework for categorizing ensemble techniques, surveys existing methods, and provides experimental results based on a large malware dataset. The authors aim to bring order to the diverse field of ensemble learning by offering a common dataset and measures of success for comparison.

Uploaded by

Vishal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

On Ensemble Learning

Mark Stamp∗, S Aniket Chandak‡ Gavin Wong§ Allen Ye¶

Abstract
In this paper, we consider ensemble classifiers, that is, machine learning based classifiers that
utilize a combination of scoring functions. We provide a framework for categorizing such classifiers,
and we outline several ensemble techniques, discussing how each fits into our framework. From this
general introduction, we then pivot to the topic of ensemble learning within the context of malware
analysis. We present a brief survey of some of the ensemble techniques that have been used in malware
(and related) research. We conclude with an extensive set of experiments, where we apply ensemble
arXiv:2103.12521v1 [cs.CR] 7 Mar 2021

techniques to a large and challenging malware dataset. While many of these ensemble techniques
have appeared in the malware literature, previously there has been no way to directly compare results
such as these, as different datasets and different measures of success are typically used. Our common
framework and empirical results are an effort to bring some sense of order to the chaos that is evident
in the evolving field of ensemble learning—both within the narrow confines of the malware analysis
problem, and in the larger realm of machine learning in general.

1 Introduction
In ensemble learning, multiple learning algorithms are combined, with the goal of improved accuracy
as compared to the individual algorithms. Ensemble techniques are widely used, and as a testament
to their strength, ensembles have won numerous machine learning contests in recent years, including
the KDD Cup [15], the Kaggle competition [14], and the Netflix prize [26].
Many such ensembles resemble Frankenstein’s monster [33], in the sense that they are an ag-
glomeration of disparate components, with some of the components being of questionable value—an
“everything and the kitchen sink” approach clearly prevails. This effect can be clearly observed in
the aforementioned machine learning contests, where there is little (if any) incentive to make systems
that are efficient or practical, as accuracy is typically the only criteria for success. In the case of the
Netflix prize, the winning team was awarded $1,000,000, yet Netflix never implement the winning
scheme, since the improvements in accuracy “did not seem to justify the engineering effort needed to
bring them into a production environment” [3]. In real-world systems, practicality and efficiency are
necessarily crucial factors.
In this paper, we provide a straightforward framework for categorizing ensemble techniques. We
then consider specific (and relatively simple) examples of various categories of such ensembles, and
we show how these fit into our framework. For various examples of ensembles, we also provide
experimental results, based on a large and diverse malware dataset.
While many of the techniques that we consider have previously appeared in the malware literature,
we are not aware of any comparable study focused on the effectiveness of various ensembles using
a common dataset and common measures of success. While we believe that these examples are
interesting in their own right, they also provide a basis for discussing various tradeoffs between
measures of accuracy and practical considerations.
The remainder of this paper is organized as follows. In Section 2 we discuss ensemble classifiers,
including our framework for categorizing such classifiers. Section 3 contains our experimental results.
This section also includes a discussion of our dataset, scoring metrics, software used, and so on.
Finally, Section 4 concludes the paper and includes suggestions for future work.

2 Ensemble Classifiers
In this section, we first give a selective survey of some examples of malware (and closely related)
research involving ensemble learning. Then we provide a framework for discussing ensemble classifiers
in general.
[email protected]
[email protected]
§ [email protected]
[email protected]
S Department of Computer Science, San Jose State University, San Jose, California

1
2.1 Examples of Related Work
The paper [18] discusses various ways to combine classifiers and provides a theoretical framework for
such combinations. The focus is on straightforward combinations, such as a maximum, sum, product,
majority vote, and so on. The work in [18] has clearly been influential, but it seems somewhat dated,
given the wide variety of ensemble methods that are used today.
The book [20] presents the topic of ensemble learning from a similar perspective as [18] but in
much more detail. Perhaps not surprisingly, the more recent book [62] seems to have a somewhat
more modern perspective with respect to ensemble methods, but retains the theoretical flavor of [20]
and [18]. The brief blog at [35] provides a highly readable (if highly selective) summary of some of
the topics covered in the books [20] and [62].
Here, we take an approach that is, in some sense, more concrete than that in [18, 20, 62]. Our
objective is to provide a relatively straightforward framework for categorizing and discussing ensemble
techniques. We then use this framework as a frame of reference for experimental results based on a
variety of ensemble methods.
Table 1 provides a summary of several research papers where ensemble techniques have been
applied to security-related problems. The emphasis here is on malware, but we have also included a
few closely related topics. In any case, this represents a small sample of the many papers that have
been published, and is only intended to provide an indication as to the types and variety of ensemble
strategies that have been considered to date. On this list, we see examples of ensemble methods based
on bagging, boosting, and stacking, as discussed below in Section 2.3.

Table 1: Security research papers using ensemble classifiers

Authors Application Features Ensemble


Alazab et al. [2] Detection API calls Neural networks
Comar et al. [8] Detection Network traffic Random forest
Dimjaševic et al. [9] Android System calls RF and SVM
Guo et al. [10] Detection API calls BKS
Idrees et al. [12] Android Permissions, intents RF and others
Jain & Meena [13] Detection Byte 𝑛-grams AdaBoost
Khan et al. [17] Detection Network based Boosting
Kong & Yan [19] Classification Function call graph Boosting
Morales et al. [24] Android Permissions Several
Narouei et al. [25] Detection DLL dependency Random forest
Shahzad et al. [31] Detection Opcodes Voting
Sheen et al. [32] Various Detection efficiency Pruning
Singh et al. [34] Detection Opcodes SVM
Smutz & Stavrou [36] Malicious PDF Metadata Random forest
Toolan & Carthy [40] Phishing Various C5.0, boosting
Ye et al. [58] Detection API calls, strings SVM, bagging
Ye et al. [59] Categorization Opcodes Clustering
Yerima et al. [60] Zero day 179 features RF, regression
Zhang et al. [61] Detection 𝑛-grams Dempster-Shafer

2.2 A Framework for Ensemble Classifiers


In this section, we consider various means of constructing ensemble classifiers, as viewed from a
high-level perspective. We then provide an equally high level framework that we find useful in our
subsequent discussion of ensemble classifiers in Sections 2.3 and, especially, in Section 2.4.
We consider ensemble learners that are based on combinations of scoring functions. In the general
case, we assume the scoring functions are real valued, but the more restricted case of zero-one valued
“scoring” functions (i.e., classifiers) easily fits into our framework. We place no additional restrictions
on the scoring functions and, in particular, they do not necessarily represent “learning” algorithms,
per se. Hence, we are dealing with ensemble methods broadly speaking, rather than ensemble learners
in a strict sense. We assume that the ensemble method itself—as opposed to the scoring functions
that comprise the ensemble—is for classification, and hence ensemble functions are zero-one valued.

2
Let 𝜔1 , 𝜔2 , . . . , 𝜔𝑛 be training samples, and let 𝑣𝑖 be a feature vector of length 𝑚, where the
features that comprise 𝑣𝑖 are extracted from sample 𝜔𝑖 . We collect the feature vectors for all 𝑛
training samples into an 𝑚 × 𝑛 matrix that we denote as
(︀ )︀
𝑉 = 𝑣1 𝑣2 · · · 𝑣𝑛 (1)

where each 𝑣𝑖 is a column of the matrix 𝑉 . Note that each row of 𝑉 corresponds to a specific feature
type, while column 𝑖 of 𝑉 corresponds to the features extracted from the training sample 𝜔𝑖 .
Let 𝑆 : R𝑚 → R be a scoring function. Such a scoring function will be determined based on
training data, where this training data is given by a feature matrix 𝑉 , as in equation (1). A scoring
function 𝑆 will generally also depend on a set of 𝑘 parameters that we denote as
(︀ )︀
Λ = 𝜆1 𝜆2 . . . 𝜆𝑘 (2)

The score generated by the scoring function 𝑆 when applied to sample 𝑥 is given by

𝑆(𝑥; 𝑉, Λ)

where we have explicitly included the dependence on the training data 𝑉 and the function parame-
ters Λ.
For any scoring function 𝑆, there is a corresponding classification function that we denote as 𝑆̂︀ :
R𝑚 → {0, 1}. That is, once we determine a threshold to apply to the scoring function 𝑆, it provides a
binary classification function that we denote as 𝑆.
̂︀ As with 𝑆, we explicitly indicate the dependence
on training data 𝑉 and the function parameters Λ by writing

𝑆(𝑥;
̂︀ 𝑉, Λ).

For example, each training sample 𝜔𝑖 could be a malware executable file, where all of the 𝜔𝑖
belong to the same malware family. Then an example of an extracted feature 𝑣𝑖 would be the opcode
histogram, that is, the relative frequencies of the mnemonic opcodes that are obtained when 𝜔𝑖 is
disassembled. The scoring function 𝑆 could, for example, be based on a hidden Markov model that
is trained on the feature matrix 𝑉 as given in equation (1), with the parameters Λ in equation (2)
being the initial values that are selected when training the HMM.
In its most general form, an ensemble method for a binary classification problem can be viewed
as a function 𝐹 : Rℓ → {0, 1} of the form
(︀ )︀
𝐹 𝑆1 (𝑥; 𝑉1 , Λ1 ), 𝑆2 (𝑥; 𝑉2 , Λ2 ), . . . , 𝑆ℓ (𝑥; 𝑉ℓ , Λℓ ) (3)

That is, the ensemble method defined by the function 𝐹 produces a classification based on the
scores 𝑆1 , 𝑆2 , . . . , 𝑆ℓ , where scoring function 𝑆𝑖 is trained using the data 𝑉𝑖 and parameters Λ𝑖 .

2.3 Classifying Ensemble Classifiers


From a high level perspective, ensemble classifiers can be categorized as bagging, boosting, stacking,
or some combination thereof [20, 35, 62]. In this section, we briefly introduce each of these general
classes of ensemble methods and give their generic formulation in terms of equation (3).

2.3.1 Bagging
In bootstrap aggregation (i.e., bagging), different subsets of the data or features (or both) are used
to generate different scores. The results are then combined in some way, such as a sum of the scores,
or a majority vote of the corresponding classifications. For bagging we assume that the same scoring
method is used for all scores in the ensemble. For example, bagging is used when generating a random
forest, where each individual scoring function is based on a decision tree structure. One benefit of
bagging is that it reduces overfitting, which is a particular problem for decision trees.
For bagging, the general equation (3) is restricted to
(︀ )︀
𝐹 𝑆(𝑥; 𝑉1 , Λ), 𝑆(𝑥; 𝑉2 , Λ), . . . , 𝑆(𝑥; 𝑉ℓ , Λ) (4)

That is, in bagging, each scoring function is essentially the same, but each is trained on a different
feature set. For example, suppose that we collect all available feature vectors into a matrix 𝑉 as in
equation (1). Then bagging based on subsets of samples would correspond to generating 𝑉𝑖 by deleting
a subset of the columns of 𝑉 . On the other hand, bagging based on features would correspond to
generating 𝑉𝑖 by deleting a subset of the rows of 𝑉 . Of course, we can easily extend this to bagging
based on both the data and features simultaneously, as in a random forest. In Section 2.4, we discuss
specific examples of bagging.

3
2.3.2 Boosting
Boosting is a process whereby distinct classifiers are combined to produce a stronger classifier. Gen-
erally, boosting deals with weak classifiers that are combined in an adaptive or iterative manner so as
to improve the overall classifier. We restrict our definition of boosting to cases where the classifiers
are closely related, in the sense that they differ only in terms of parameters. From this perspective,
boosting can be viewed as “bagging” based on classifiers, rather than data or features. That is, all of
the scoring functions are reparameterized versions of the same scoring technique. Under this definition
of boosting, the general equation (3) becomes
(︀ )︀
𝐹 𝑆(𝑥; 𝑉, Λ1 ), 𝑆(𝑥; 𝑉, Λ2 ), . . . , 𝑆(𝑥; 𝑉, Λℓ ) (5)

That is, the scoring functions differ only by re-parameterization, while the scoring data and features
do not change.
Below, in Section 2.4, we discuss specific examples of boosting; in particular, we discuss the most
popular method of boosting, AdaBoost. In addition, we show that some other popular techniques fit
our definition of boosting.

2.3.3 Stacking
Stacking is an ensemble method that combines disparate scores using a meta-classifier [35]. In this
generic form, stacking is defined by the general case in equation (3), where the scoring functions can
be (and typically are) significantly different. Note that from this perspective, stacking is easily seen
to be a generalization of both bagging and boosting.
Because stacking generalizes both bagging and boosting, it is not surprising that stacking based
ensemble methods can outperform bagging and boosting methods, as evidenced by recent machine
learning competitions, including the KDD Cup [15], the Kaggle competition [14], as well as the
infamous Netflix prize [26]. However, this is not the end of the story, as efficiency and practicality are
often ignored in such competitions, whereas in practice, it is virtually always necessary to consider
such issues. Of course, the appropriate tradeoffs will depend on the specifics of the problem at hand.
Our empirical results in Section 3 provide some insights into these tradeoff issues within the malware
analysis domain.
In the next section, we discuss concrete examples of bagging, boosting, and stacking techniques.
Then in Section 3 we present our experimental results, which include selected bagging, boosting, and
stacking architectures.

2.4 Ensemble Classifier Examples


Here, we consider a variety of ensemble methods and discuss how each fits into the general framework
presented above. We begin with a few fairly generic examples, and then discuss several more specific
examples.

2.4.1 Maximum
In this case, we have
(︀ )︀
𝐹 𝑆1 (𝑥; 𝑉1 , Λ1 ), 𝑆2 (𝑥; 𝑉2 , Λ2 ), . . . , 𝑆ℓ (𝑥; 𝑉ℓ , Λℓ ) = max{𝑆𝑖 (𝑥; 𝑉𝑖 , Λ𝑖 )} (6)

2.4.2 Averaging
Averaging is defined by

(︀ )︀ 1 ∑︁
𝐹 𝑆1 (𝑥; 𝑉1 , Λ1 ), 𝑆2 (𝑥; 𝑉2 , Λ2 ), . . . , 𝑆ℓ (𝑥; 𝑉ℓ , Λℓ ) = 𝑆𝑖 (𝑥; 𝑉𝑖 , Λ𝑖 ) (7)
ℓ 𝑖=1

2.4.3 Voting
Voting could be used as a form of boosting, provided that no bagging is involved (i.e., the same data
and features are used in each case). Voting is also applicable to stacking, and is generally applied
in such a mode, or at least with significant diversity in the scoring functions, since we want limited
correlation when voting.

4
In the case of stacking, a simple majority vote is of the form
(︀ )︀
𝐹 𝑆̂︀1 (𝑥; 𝑉1 , Λ1 ), 𝑆̂︀2 (𝑥; 𝑉2 , Λ2 ), . . . , 𝑆̂︀ℓ (𝑥; 𝑉ℓ , Λℓ )
(︀ )︀
= maj 𝑆̂︀1 (𝑥; 𝑉1 , Λ1 ), 𝑆̂︀2 (𝑥; 𝑉2 , Λ2 ), . . . , 𝑆̂︀ℓ (𝑥; 𝑉ℓ , Λℓ )

where “maj” is the majority vote function. Note that the majority vote is well defined in this case,
provided that ℓ is odd—if ℓ is even, we can simply flip a coin in case of a tie.
As an aside, we note that it is easy to see why we want to avoid correlation when voting is used
as a combining function. Consider the following example from [47]. Suppose that we have the three
highly correlated scores
⎛ ⎞ ⎛ ⎞
𝑆̂︀1 1 1 1 1 1 1 1 1 0 0
⎝ 𝑆2 ⎠ = 1 1 1 1 1 1 1 1 0 0 ⎠
⎜ ̂︀ ⎟ ⎝
𝑆̂︀3 1 0 1 1 1 1 1 1 0 0

where each 1 indicates correct classification, and each 0 is an incorrect classification. Then, both 𝑆̂︀1
and 𝑆̂︀2 are 80% accurate, and 𝑆̂︀3 is 70% accurate. If we use a simple majority vote, then we obtain
the classifier
𝐶=( 1 1 1 1 1 1 1 1 0 0 )
which is 80% accurate. On the other hand, the less correlated classifiers
⎛ ⎞ ⎛
𝑆̂︀1′

1 1 1 1 1 1 1 1 0 0
⎜ ̂︀ ′ ⎟ ⎝
⎝ 𝑆2 ⎠ = 0 1 1 1 0 1 1 1 0 1 ⎠
′ 1 0 0 0 1 0 1 1 1 1
𝑆3
̂︀

are only 80%, 70% and 60% accurate, respectively, but the majority vote in this case gives us

𝐶′ = ( 1 1 1 1 1 1 1 1 0 1 )

which is 90% accurate.

2.4.4 ML-Based Combination


Recall that the most general formulation of an ensemble classifier is given in equation (3). In this
formulation, we can select the function 𝐹 based on a machine learning technique, which is applied
to the individual scores 𝑆(𝑥; 𝑉𝑖 , Λ𝑖 ). In the remainder of this section, we consider specific ensemble
examples involving machine learning techniques.

2.4.5 AdaBoost
Given a collection of (weak) classifiers 𝑐1 , 𝑐2 , . . . , 𝑐ℓ , AdaBoost is an iterative algorithm that generates
a series of (generally, stronger) classifiers, 𝐶1 , 𝐶2 , . . . , 𝐶𝑀 based on the classifiers 𝑐𝑖 . Each classifier is
determined from the previous classifier by the simple linear extension

𝐶𝑚 (𝑥) = 𝐶𝑚−1 (𝑥) + 𝛼𝑚 𝑐𝑖 (𝑥)

and the final classifier is given by 𝐶 = 𝐶𝑀 . Note that at each iteration, we include a previously
unused 𝑐𝑖 from the set of (weak) classifiers and determine a new weight 𝛼𝑖 . A greedy approach is
used when selecting 𝑐𝑖 , but it is not a hill climb, so that results might get worse at any step in the
AdaBoost process.
From this description, we see that the AdaBoost algorithm fits the form in equation (5), with 𝑆(𝑥;
̂︀ 𝑉, Λ𝑖 ) =
𝐶𝑖 (𝑥), and
(︀ )︀
𝐹 𝑆(𝑥;
̂︀ 𝑉, Λ1 ), 𝑆(𝑥;
̂︀ 𝑉, Λ2 ), . . . , 𝑆(𝑥;
̂︀ 𝑉, Λ𝑀 ) = 𝑆(𝑥;
̂︀ 𝑉, Λ𝑀 ) = 𝐶𝑀 (𝑥)

2.4.6 SVM as Meta-Classifier


It is natural to use an SVM as a meta-classifier to combine scores [38]. For example, in [34], an
SVM is used to generate a malware classifier based on several machine learning and statistical based
malware scores. In [34], it is shown that the resulting SVM classifier consistently outperforms any of
the component scores, and the differences are most pronounced in the most challenging cases.
The use of SVM in this meta-classifier mode can be viewed as a general stacking method. Thus,
this SVM technique is equivalent to equation (3), where the function 𝐹 is simply an SVM classifier
based on the component scores 𝑆𝑖 (𝑥; 𝑉𝑖 , Λ𝑖 ), for 𝑖 = 1, 2, . . . , ℓ.

5
2.4.7 HMM with Random Restarts
A hidden Markov model can be viewed as a discrete hill climb technique [37, 38]. As with any hill
climb, when training an HMM we are only assured of a local maximum, and we can often significantly
improve our results by executing the hill climb multiple times with different initial values, selecting
the best of the resulting models. For example, in [51] it is shown that an HMM can be highly effective
for breaking classic substitution ciphers and, furthermore, by using a large number of random restarts,
we can significantly increase the success rate in the most difficult cases. The work in [51] is closely
related to that in [7], where such an approach is used to analyze the unsolved Zodiac 340 cipher.
From the perspective considered in this paper, an HMM with random restarts can be seen as special
case of boosting. If we simply select the best model, then the “combining” function is particularly
simple, and is given by
(︀ )︀
𝐹 𝑆(𝑥; 𝑉, Λ1 ), 𝑆(𝑥; 𝑉, Λ2 ), . . . , 𝑆(𝑥; 𝑉, Λℓ ) = max{𝑆(𝑥; 𝑉, Λ𝑖 )} (8)

Here, each scoring function is an HMM, where the trained models differ based only on different initial
values. We see that equation (8) is a special case of equation (6). However, the “max” in equation (8)
is the maximum over the HMM model scores, not the maximum over any particular set of input
values. That is, we select the highest scoring model and use it for scoring. Of course, we could use
other combining functions, such as an average or majority vote of the corresponding classifiers. In any
case, since there is a score associated with each model generated by an HMM, any such combining
function is well-defined.

2.4.8 Bagged Perceptron


Like a linear SVM, a perceptron will separate linearly separable data. However, unlike an SVM, a
perceptron will not necessarily produce the optimal separation, in the sense of maximizing the margin.
If we generate multiple perceptrons, each with different random initial weights, and then average
these models, the resulting classifier will tend to be nearer to optimal, in the sense of maximizing the
margin [21, 47]. That is, we construct a classifier

(︀ )︀ 1 ∑︁
𝐹 𝑆(𝑥; 𝑉, Λ1 ), 𝑆(𝑥; 𝑉, Λ2 ), . . . , 𝑆(𝑥; 𝑉, Λℓ ) = 𝑆(𝑥; 𝑉, Λ𝑖 ) (9)
ℓ 𝑖=1

where 𝑆 is a perceptron and each 𝑃𝑖 represents a set of initial values. We see that equation (9) is a
special case of the averaging example given in equation (7). Also, we note that in this sum, we are
averaging the perceptron models, not the classifications generated by the models.
Although this technique is sometimes referred to as “bagged” perceptrons [47], by our criteria, it
is a boosting scheme. That is, the “bagging” here is done with respect to parameters of the scoring
functions, which is our working definition of boosting.

2.4.9 Bagged Hidden Markov Model


Like the HMM with random restarts example given above, in this case, we generate multiple HMMs.
However, here we leave the model parameters unchanged, and simply train each on a subset of the
data. We could then average the model scores (for example) as a way of combining the HMMs into
a single score, from which we can easily construct a classifier.

2.4.10 Bagged and Boosted Hidden Markov Model


Of course, we could combine both the HMM with random restarts discussed in Section 2.4.7 with
the bagging approach discussed in the previous section. This process would yield an HMM-based
ensemble technique that combines both bagging and boosting.

3 Experiments and Results


In this section, we consider a variety of experiments that illustrate various ensemble techniques.
There experiments involve malware classification, based on a challenging dataset that includes a large
number of samples from a significant number of malware families.

6
3.1 Dataset and Features
Our dataset consists of samples from the 21 malware families listed in Table 2. These families are from
various different types of malware, including Trojans, worms, backdoors, password stealers, so-called
VirTools, and so on.

Table 2: Type of each malware family

Index Family Type Index Family Type


1 Adload [41] Trojan Downloader 12 Renos [43] Trojan Downloader
2 Agent [42] Trojan 13 Rimecud [54] Worm
3 Allaple [52] Worm 14 Small [44] Trojan Downloader
4 BHO [45] Trojan 15 Toga [46] Trojan
5 Bifrose [4] Backdoor 16 VB [6] Backdoor
6 CeeInject [48] VirTool 17 VBinject [50] VirTool
7 Cycbot [5] Backdoor 18 Vobfus [55] Worm
8 FakeRean [53] Rogue 19 Vundo [56] Trojan Downloader
9 Hotbar [1] Adware 20 Winwebsec [22] Rogue
10 Injector [49] VirTool 21 Zbot [23] Password Stealer
11 OnLineGames [28] Password Stealer — — —

Each of the malware families in Table 2 is summarized below.


Adload downloads an executable file, stores it remotely, executes the file, and disables proxy set-
tings [41].
Agent downloads Trojans or other software from a remote server [42].
Allaple is a worm that can be used as part of a denial of service (DoS) attack [52].
BHO can perform a variety of actions, guided by an attacker [45].
Bifrose is a backdoor Trojan that enables a variety of attacks [4].
CeeInject uses advanced obfuscation to avoid being detected by antivirus software [48].
Cycbot connects to a remote server, exploits vulnerabilities, and spreads through backdoor ports [5].
FakeRean pretends to scan the system, notifies the user of supposed issues, and asks the user to
pay to clean the system [53].
Hotbar is adware that shows ads on webpages and installs additional adware [1].
Injector loads other processes to perform attacks on its behalf [49].
OnLineGames steals login information of online games and tracks user keystroke activity [28].
Renos downloads software that claims the system has spyware and asks for a payment to remove
the nonexistent spyware [43].
Rimecud is a sophisticated family of worms that perform a variety of activities and can spread
through instant messaging [54].
Small is a family of Trojans that downloads unwanted software. This downloaded software can
perform a variety of actions, such as a fake security application [44].
Toga is a Trojan that can perform a variety of actions of the attacker’s choice [46].
VB is a backdoor that enables an attacker to gain access to a computer [6].
VBinject is a generic description of malicious files that are obfuscated in a specific manner [50].
Vobfus is a worm that downloads malware and spreads through USB drives or other removable
devices [55].
Vundo displays pop-up ads and may download files. It uses advanced techniques to defeat detec-
tion [56].
Winwebsec displays alerts that ask the user for money to fix supposed issues [22].
Zbot is installed through email and shares a user’s personal information with attackers. In addition,
Zbot can disable a firewall [23].

7
From each available malware sample, we extract the first 1000 mnemonic opcodes using the revers-
ing tool Radare2 (also know as R2) [29]. We discard any malware executable that yields less than 1000
opcodes, as well as a number of executables that were found to be corrupted. The resulting opcode
sequences, each of length 1000, serve as the feature vectors for our machine learning experiments.
Table 3 gives the number of samples (per family) from which we successfully obtained opcode
feature vectors. Note that our dataset contains a total of 9725 samples from the 21 malware families
and that the dataset is highly imbalanced—the number of samples per family varies from a low of 129
to a high of nearly 1000.

Table 3: Type of each malware family

Index Family Samples Index Family Samples


1 Adload 162 12 Renos 532
2 Agent 184 13 Rimecud 153
3 Allaple 986 14 Small 180
4 BHO 332 15 Toga 406
5 Bifrose 156 16 VB 346
6 CeeInject 873 17 VBinject 937
7 Cycbot 597 18 Vobfus 929
8 FakeRean 553 19 Vundo 762
9 Hotbar 129 20 Winwebsec 837
10 Injector 158 21 Zbot 303
11 OnLineGames 210 Total 9725

3.2 Metrics
The metrics used to quantify the success of our experiments are accuracy, balanced accuracy, precision,
recall, and the F1 score. Accuracy is simply the ratio of correct classifications to the total number of
classifications. In contrast, the balanced accuracy is the average accuracy per family.
Precision, which is also known as the positive predictive value, is the number of true positives
divided by the sum of the true positives and false positives. That is, the precision is the ratio of
samples classified as positives that are actually positive to all samples that are classified as positive.
Recall, which is also known as the true positive rate or sensitivity, is the computed by dividing the
number of true positives by the number true positives plus the number of false negatives. That is,
the recall is the fraction of positive samples that are classified as such. The F1 score is computed as
precision · recall
F1 = 2 · ,
precision + recall
which is the harmonic mean of the precision and recall.

3.3 Software
The software packages used in our experiments include hmmlearn [11], XGBoost [57], Keras [16], and
TensorFlow [39], and scikit-learn [30], as indicated in Table 4. In addition, we use Numpy [27] for
linear algebra and various tools available in the package scikit-learn (also known as sklearn) for
general data processing. These packages are all widely used in machine learning.

3.4 Overview of Experiments


For all of our experiments, we use opcode sequences of length 1000 as features. For CNNs, the
sequences are interpreted as images.
We consider three broad categories of experiments. First, we apply “standard” machine learning
techniques. These experiments, serve as a baseline for comparison for our subsequent experiments.
Among other things, these standard experiments show that the malware classification problem that
we are dealing with is challenging.

8
Table 4: Software used in experiments

Technique Software
HMM hmmlearn
XGBoost XGBoost
AdaBoost sklearn
CNN Keras, TensorFlow
LSTM Keras, TensorFlow
Random Forest sklearn

We also conduct bagging and boosting experiments based on a subset of the techniques considered
in our baseline standard experiments. These results demonstrate that both bagging and boosting can
provide some improvement over our baseline techniques.
Finally, we consider a set of stacking experiments, where we restrict our attention to simple voting
schemes, all of which are based on architectures previously considered in this paper. Although these
are very basic stacking architectures, they clearly show the potential benefit of stacking multiple
techniques.

3.5 Standard Techniques


For our “standard” techniques, we test several machine learning methods that are typically used
individually. Specifically, we consider hidden Markov models (HMM), convolutional neural networks
(CNN), random forest, and long short-term memory (LSTM). The parameters that we have tested in
each of these cases are listed in Table 5, with those that gave the best results in boldface.

Table 5: Parameters for standard techniques

Technique Parameters Values tested


n components [1,2,5,10]
HMM n iter [50,100,200,300,500]
tol [0.01,0.5]
learning rate [0.001,0.0001]
CNN batch size [32,64,128]
epochs [50,75,100
n estimators [100,200,300,500,800]
min samples split [2,5,10,15,20]
Random Forest min samples leaf [1,2,5,10,15]
max features [auto,sqrt,log2 ]
max depth [30,40,50,60,70,80]
layers [1,3]
directional [uni-dir,bi-dir]
LSTM learning rate [0.01]
batch size [1,16,32]
epochs [20]

From Table 5, we note that a significant number of parameter combinations were tested in each
case. For example, in the case of our random forest model, we tested
53 · 3 · 6 = 2250
different combinations of parameters.
The confusion matrices for all of the experiments in this section can be found in the Appendix in
Figure 2 (a) through Figure 2 (d). We present the results of all of these experiments—in terms of the
metrics discussed previously (i.e., accuracy, balanced accuracy, precision, recall, and F1 score)—in
Section 3.9, below.

9
3.6 Bagging Experiments
Recall from our discussion above, that we use the term bagging to mean a multi-model approach where
the individual models are trained with the same technique and essentially the same parameters, but
different subsets of the data or features. In contrast, we use boosting to refer to multi-model cases
where the data and features are essentially the same and the models are of the same type, with the
model parameters varied.
We will use AdaBoost and XGBoost results to serve as representative examples of boosting. We
also consider bagging experiments (in the sense described in the previous paragraph) involving each of
the HMM, CNN, and LSTM architectures. The results of these three distinct bagging experiments—
in the form of confusion matrices—are given in Figure 3 in the Appendix. In terms of the metrics
discussed above, the results of these experiments are summarized in Section 3.9, below.

3.7 Boosting Experiments


As representative examples of boosting techniques, we consider AdaBoost and XGBoost. In each
case, we experiment with a variety of parameters as listed in Table 6. The parameter selection that
yielded the best results are highlighted in boldface.

Table 6: Parameters for boosting techniques

Technique Parameters Values tested


n estimators [100,200,300,500,800,1000]
AdaBoost learning rate [0.5,1.0,1.5,2.0]
algorithm [SAMME,SAMME.R]
eta [0.05,0.1,0.2,0.3,0.5]
max depth [1,2,3,4]
XGBoost
objective [multi:softprob,binary:logistic]
steps [1,5,10,20,50]

Confusion matrices for these two boosting experiments are given in Figure 4 in the Appendix.
The results of these experiments are summarized in Section 3.9, below, in terms of accuracy, balanced
accuracy, and so on.

3.8 Voting Experiments


Since there exists an essentially unlimited number of possible stacking architectures, we have limited
our attention to one of the simplest, namely, voting. These results serve as a lower bound on the
results that can be obtained with stacking architectures.
We consider six different stacking architectures. These stacking experiments can be summarized
as follows.
CNN consists of the plain and bagged CNN models discussed above. The confusion matrix for this
experiment is given in Figure 5 (a).
LSTM consists of the plain and bagged LSTM models discussed above. The confusion matrix for
this experiment is given in Figure 5 (b).
Bagged neural networks combines our bagged CNN and bagged LSTM models. The confusion
matrix for this experiment is given in Figure 5 (c).
Classic techniques combines (via voting) all of the classic models considered above, namely, HMM,
bagged HMM, random forest, AdaBoost, and XGBoost. The confusion matrix for this experi-
ment is given in Figure 5 (d).
All neural networks consists of all of the CNN and LSTM models, bagged and plain. The confusion
matrix for this experiment is given in Figure 5 (e).
All models combines all of the classic and neural network models into one voting scheme. The
confusion matrix for this experiment is given in Figure 5 (f).
In the next section, we present the results for each of the voting experiments discussed in this
section in terms of the our various metrics. These metrics enable us to directly compare all of our
experimental results.

10
3.9 Discussion
Table 7 summarizes the results of all of the experiments discussed above, in term of the following met-
rics: accuracy, balanced accuracy, precision, recall, and F1 score. These metrics have been introduced
in Section 3.1, above.

Table 7: Comparison of experimental results

Balanced
Experiments Case Accuracy Precision Recall F1 score
accuracy
HMM 0.6717 0.6336 0.7325 0.6717 0.6848
CNN 0.8211 0.7245 0.8364 0.8211 0.8104
Standard
Random Forest 0.7549 0.6610 0.7545 0.7523 0.7448
LSTM 0.8410 0.7185 0.7543 0.7185 0.8145
Bagged HMM 0.7168 0.6462 0.7484 0.7168 0.7165
Bagging Bagged CNN 0.8910 0.8105 0.9032 0.8910 0.8838
Bagged LSTM 0.8602 0.7754 0.8571 0.8602 0.8549
AdaBoost 0.5378 0.4060 0.5231 0.5378 0.5113
Boosting
XGBoost 0.7472 0.6636 0.7371 0.7472 0.7285
Classic 0.8766 0.8079 0.8747 0.8766 0.8719
CNN 0.9260 0.8705 0.9321 0.9260 0.9231
LSTM 0.8560 0.7470 0.8511 0.8560 0.8408
Voting
Bagged neural networks 0.9337 0.8816 0.9384 0.9337 0.9313
All neural networks 0.9208 0.8613 0.9284 0.9208 0.9171
All models 0.9188 0.8573 0.9249 0.9188 0.9154

In Table 7, the best result for each type of experiment is in boldface, with the best results overall
also being boxed. We see that a voting strategy based on all of the bagged neural network techniques
gives us the best result for each of the five statistics that we have computed.
Since our dataset is highly imbalanced, we consider the balanced accuracy as the best measure of
success. The balanced accuracy results in Table 7 are given in the form of a bar graph in Figure 1.

0.90
0.80
0.70
0.60
Accuracy

0.50
0.40
0.30
Standard
0.20 Bagging
Boosting
0.10
Voting
0.00
M

TM

ost

N
TM
t

NN

TM

st

ic

els
res

ork

ork
ass
CN

CN
oo
HM

HM

od
Bo
C
Fo

LS

LS
LS

etw

tw
aB

Cl

lm
XG
ed

ne
ed

Ad
m

ed

ln
gg

Al
do

gg

gg

ral
ura
Ba
n

Ba

Ba

eu
Ra

ne

ln
ged

Al
g
Ba

Figure 1: Balanced accuracy results

Note that the results in Figure 1 clearly show that stacking techniques are beneficial, as compared
to the corresponding “standard” techniques. Stacking not only yields the best results, but it dominates
in all categories. We note that five of the six stacking experiments perform better than any of the

11
standard, bagging, or boosting experiments. This is particularly noteworthy since we only considered
a simple stacking approach. As a results, our stacking experiments likely provide a poor lower bound
on stacking in general, and more advanced stacking techniques may improve significantly over the
results that we have obtained.

4 Conclusion and Future Work


In this paper, we have attempted to impose some structure on the field of ensemble learning. We
showed that combination architectures can be classified as either bagging, boosting, or in the more
general case, stacking. We then provided experimental results involving a challenging malware dataset
to illustrate the potential benefits of ensemble architectures. Our results clearly show that ensembles
improve on standard techniques, with respect to our specific dataset. Of course, in principle, we
expect such combination architectures to outperform standard techniques, but it is instructive to
confirm this empirically, and to show that the improvement can be substantial. These results make
it clear that there is a reason why complex stacking architectures win machine learning competitions.
However, stacking models are not without potential pitfalls. As the architectures become more
involved, training can become impractical. Furthermore, scoring can also become prohibitively costly,
especially if large numbers of features are used in complex schemes involving extensive use of bagging
or boosting.
For future work, it would be useful to quantify the tradeoff between accuracy and model complexity.
While stacking will generally improve results, marginal improvements in accuracy that come at great
additional cost in training and scoring are unlikely to be of any value in real world applications. More
concretely, future work involving additional features would be very interesting, as it would allow for
a more thorough analysis of bagging, and it would enable us to draw firmer conclusions regarding the
relative merits of bagging and boosting. Of course, more more complex classes of stacking techniques
could be considered.

References
[1] Adware:win32/hotbar. https://www.microsoft.com/en-us/wdsi/threats/malware-
encyclopedia-description?Name=Adware:Win32/Hotbar&threatId=6204.
[2] Mamoun Alazab, Sitalakshmi Venkatraman, Paul Watters, and Moutaz Alazab. Zero-day mal-
ware detection based on supervised learning algorithms of API call signatures. In Proceedings
of the Ninth Australasian Data Mining Conference, volume 121 of AusDM ’11, pages 171–182.
Australian Computer Society, 2011.
[3] Xavier Amatriain and Justin Basilico. Netflix recommendations: Beyond the 5 stars
(part 1). https://medium.com/netflix-techblog/netflix-recommendations-beyond-the-5-
stars-part-1-55838468f429, 2012.
[4] Backdoor:win32/bifrose. https://www.microsoft.com/en-us/wdsi/threats/malware-
encyclopedia-description?Name=Backdoor:Win32/Bifrose&threatId=-2147479537.
[5] Backdoor:win32/cycbot.g. https://www.microsoft.com/en-us/wdsi/threats/malware-
encyclopedia-description?Name=Backdoor:Win32/Cycbot.G.
[6] Backdoor:win32/vb. https://www.microsoft.com/en-us/wdsi/threats/malware-
encyclopedia-description?Name=Backdoor:Win32/VB&threatId=7275.
[7] Taylor Berg-Kirkpatrick and Dan Klein. Decipherment with a million random restarts. In
Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP
2013, pages 874–878, 2013.
[8] Prakash Mandayam Comar, Lei Liu, Sabyasachi Saha, Pang-Ning Tan, and Antonio Nucci. Com-
bining supervised and unsupervised learning for zero-day malware detection. In 2013 Proceedings
IEEE INFOCOM, pages 2022–2030. IEEE, 2013.
[9] Marko Dimjaševic, Simone Atzeni, Ivo Ugrina, and Zvonimir Rakamaric. Android malware
detection based on system calls. Technical Report UUCS-15-003, School of Computing, University
of Utah, Salt Lake City, Utah, 2015.
[10] Shanqing Guo, Qixia Yuan, Fengbo Lin, Fengyu Wang, and Tao Ban. A malware detection algo-
rithm based on multi-view fusion. In International Conference on Neural Information Processing,
ICONIP 2010, pages 259–266. Springer, 2010.
[11] hmmlearn. https://hmmlearn.readthedocs.io/en/latest/.

12
[12] Fauzia Idrees, Muttukrishnan Rajarajan, Mauro Conti, Thomas M Chen, and Yogachandran
Rahulamathavan. Pindroid: A novel android malware detection system using ensemble learning
methods. Computers & Security, 68:36–46, 2017.
[13] Sachin Jain and Yogesh Kumar Meena. Byte level 𝑛-gram analysis for malware detection. In
Computer Networks and Intelligent Computing, pages 51–59. Springer, 2011.
[14] Kaggle. Welcome to Kaggle competitions. https://www.kaggle.com/competitions, 2018.
[15] KDD Cup of fresh air. https://biendata.com/competition/kdd_2018/, 2018.
[16] Keras: The Python deep learning API. https://keras.io/.
[17] Muhammad Salman Khan, Sana Siddiqui, Robert D McLeod, Ken Ferens, and Witold Kinsner.
Fractal based adaptive boosting algorithm for cognitive detection of computer malware. In 15th
International Conference on Cognitive Informatics & Cognitive Computing, ICCI*CC, pages 50–
59. IEEE, 2016.
[18] Josef Kittler, Mohamad Hatef, Robert P. W. Duin, and Jiri Matas. On combining classifiers.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(3):226–239, March 1998.
[19] Deguang Kong and Guanhua Yan. Discriminant malware distance learning on structural informa-
tion for automated malware classification. In Proceedings of the 19th ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining, KDD ’13, pages 1357–1365. ACM, 2013.
[20] Ludmila I. Kuncheva. Combining Pattern Classifiers: Methods and Algorithms.
Wiley, Hoboken, New Jersey, 2004. https://pdfs.semanticscholar.org/453c/
2b407c57d7512fdbe19fa1cefa08dd22614a.pdf.
[21] Marios Michailidis. Investigating machine learning methods in recommender systems. Thesis,
University College London, 2017.
[22] Microsoft malware protection center, winwebsec. https://www.microsoft.com/security/
portal/threat/encyclopedia/entry.aspx?Name=Win32%2fWinwebsec.
[23] Symantec security response, zbot. http://www.symantec.com/security_response/writeup.
jsp?docid=2010-011016-3514-99.
[24] Salvador Morales-Ortega, Ponciano Jorge Escamilla-Ambrosio, Abraham Rodriguez-Mota, and
Lilian D Coronado-De-Alba. Native malware detection in smartphones with Android OS using
static analysis, feature selection and ensemble classifiers. In 11th International Conference on
Malicious and Unwanted Software, MALWARE 2016, pages 1–8. IEEE, 2016.
[25] Masoud Narouei, Mansour Ahmadi, Giorgio Giacinto, Hassan Takabi, and Ashkan Sami.
Dllminer: structural mining for malware detection. Security and Communication Networks,
8(18):3311–3322, 2015.
[26] Netflix Prize. https://www.netflixprize.com, 2009.
[27] Numpy. https://numpy.org/.
[28] Pws:win32/onlinegames. https://www.microsoft.com/en-us/wdsi/threats/malware-
encyclopedia-description?Name=PWS%3AWin32%2FOnLineGames.
[29] Radare2: Libre and portable reverse engineering framework. https://rada.re/n/.
[30] scikit-learn: Machine learning in Python. https://scikit-learn.org/stable/.
[31] Raja Khurram Shahzad and Niklas Lavesson. Comparative analysis of voting schemes for
ensemble-based malware detection. Journal of Wireless Mobile Networks, Ubiquitous Computing,
and Dependable Applications, 4(1):98–117, 2013.
[32] Shina Sheen, R Anitha, and P Sirisha. Malware detection by pruning of parallel ensembles using
harmony search. Pattern Recognition Letters, 34(14):1679–1686, 2013.
[33] Mary Wollstonecraft Shelley. Frankenstein or The Modern Prometheus. Dent, 1869.
[34] Tanuvir Singh, Fabio Di Troia, Visaggio Aaron Corrado, Thomas H. Austin, and Mark Stamp.
Support vector machines and malware detection. Journal of Computer Virology and Hacking
Techniques, 12(4):203–212, 2016.
[35] Vadim Smolyakov. Ensemble learning to improve machine learning results. https://blog.
statsbot.co/ensemble-learning-d1dcd548e936, 2017.
[36] Charles Smutz and Angelos Stavrou. Malicious pdf detection using metadata and structural
features. In Proceedings of the 28th Annual Computer Security Applications Conference, ACSAC
2012, pages 239–248. ACM, 2012.

13
[37] Mark Stamp. A revealing introduction to hidden Markov models. https://www.cs.sjsu.edu/
~stamp/RUA/HMM.pdf, 2004.
[38] Mark Stamp. Introduction to Machine Learning with Applications in Information Security. Chap-
man and Hall/CRC, Boca Raton, 2017.
[39] TensorFlow: An end-to-end open source machine learning platform. https://www.tensorflow.
org/.
[40] Fergus Toolan and Joe Carthy. Phishing detection using classifier ensembles. In eCrime Re-
searchers Summit, 2009, eCRIME ’09, pages 1–9. IEEE, 2009.
[41] Trojandownloader:win32/adload. https://www.microsoft.com/en-us/wdsi/threats/malware-
encyclopedia-description?Name=TrojanDownloader%3AWin32%2FAdload.
[42] Trojandownloader:win32/agent. https://www.microsoft.com/en-us/wdsi/threats/malware-
encyclopedia-description?Name=TrojanDownloader:Win32/Agent&ThreatID=14992.
[43] Trojandownloader:win32/renos. https://www.microsoft.com/en-us/wdsi/threats/malware-
encyclopedia-description?Name=TrojanDownloader:Win32/Renos&threatId=16054.
[44] Trojandownloader:win32/small. https://www.microsoft.com/en-us/wdsi/threats/malware-
encyclopedia-description?Name=TrojanDownloader:Win32/Small&threatId=15508.
[45] Trojan:win32/bho. https://www.microsoft.com/en-us/wdsi/threats/malware-
encyclopedia-description?Name=Trojan:Win32/BHO&threatId=-2147364778.
[46] Trojan:win32/toga. https://www.microsoft.com/en-us/wdsi/threats/malware-
encyclopedia-description?Name=Trojan:Win32/Toga&threatId=-2147259798.
[47] Hendrik Jacob van Veen, Le Nguyen The Dat, and Armando Segnini. Kaggle ensembling guide.
https://mlwave.com/kaggle-ensembling-guide/, 2015.
[48] Virtool:win32/ceeinject. https://www.microsoft.com/en-us/wdsi/threats/malware-
encyclopedia-description?Name=VirTool%3AWin32%2FCeeInject.
[49] Virtool:win32/injector. https://www.microsoft.com/en-us/wdsi/threats/malware-
encyclopedia-description?Name=VirTool:Win32/Injector&threatId=-2147401697.
[50] Virtool:win32/vbinject. https://www.microsoft.com/en-us/wdsi/threats/malware-
encyclopedia-description?Name=VirTool:Win32/VBInject&threatId=-2147367171.
[51] Rohit Vobbilisetty, Fabio Di Troia, Richard M. Low, Corrado Aaron Visaggio, and Mark Stamp.
Classic cryptanalysis using hidden Markov models. Cryptologia, 41(1):1–28, 2017.
[52] Win32/allaple. https://www.microsoft.com/en-us/wdsi/threats/malware-encyclopedia-
description?Name=Win32/Allaple&threatId=.
[53] Win32/fakerean. https://www.microsoft.com/en-us/wdsi/threats/malware-encyclopedia-
description?Name=Win32/FakeRean.
[54] Win32/rimecud. https://www.microsoft.com/en-us/wdsi/threats/malware-encyclopedia-
description?Name=Win32/Rimecud&threatId=.
[55] Win32/vobfus. https://www.microsoft.com/en-us/wdsi/threats/malware-encyclopedia-
description?Name=Win32/Vobfus&threatId=.
[56] Win32/vundo. https://www.microsoft.com/en-us/wdsi/threats/malware-encyclopedia-
description?Name=Win32/Vundo&threatId=.
[57] XGBoost documentation. https://xgboost.readthedocs.io/en/latest/.
[58] Yanfang Ye, Lifei Chen, Dingding Wang, Tao Li, Qingshan Jiang, and Min Zhao. Sbmds: an
interpretable string based malware detection system using svm ensemble with bagging. Journal
in Computer Virology, 5(4):283, 2009.
[59] Yanfang Ye, Tao Li, Yong Chen, and Qingshan Jiang. Automatic malware categorization us-
ing cluster ensemble. In Proceedings of the 16th ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining, KDD ’10, pages 95–104. ACM, 2010.
[60] Suleiman Y Yerima, Sakir Sezer, and Igor Muttik. High accuracy android malware detection
using ensemble learning. IET Information Security, 9(6):313–320, 2015.
[61] Boyun Zhang, Jianping Yin, Jingbo Hao, Dingxing Zhang, and Shulin Wang. Malicious codes
detection based on ensemble learning. In International Conference on Autonomic and Trusted
Computing, ATC 2007, pages 468–477. Springer, 2007.
[62] Zhi-Hua Zhou. Ensemble Methods: Foundations and Algorithms. CRC Press, Boca Raton,
Florida, 2012. http://www2.islab.ntua.gr/attachments/article/86/Ensemble%20methods%
20-%20Zhou.pdf.

14
Appendix: Confusion Matrices
1 1
Adload 0.829 0.098 0.049 0.024 Adload 0.881 0.119

Agent 0.348 0.022 0.022 0.022 0.022 0.022 0.109 0.022 0.152 0.043 0.130 0.065 0.022 Agent 0.486 0.114 0.029 0.029 0.057 0.029 0.086 0.029 0.057 0.086

Allaple 1.000 Allaple 1.000

BHO 0.060 0.867 0.012 0.036 0.012 0.012 BHO 0.042 0.887 0.014 0.028 0.014 0.014
0.8 0.8
Bifrose 0.026 0.436 0.026 0.051 0.026 0.128 0.077 0.051 0.051 0.077 0.051 Bifrose 0.105 0.158 0.105 0.026 0.132 0.395 0.053 0.026

CeeInject 0.018 0.005 0.091 0.723 0.009 0.014 0.059 0.036 0.009 0.005 0.009 0.005 0.018 CeeInject 0.006 0.899 0.006 0.013 0.032 0.013 0.006 0.019 0.006

Cycbot 0.007 0.839 0.007 0.007 0.007 0.094 0.040 Cycbot 0.953 0.031 0.016

FakeRean 0.014 0.007 0.741 0.029 0.050 0.029 0.036 0.022 0.072 FakeRean 0.011 0.874 0.011 0.021 0.021 0.063

Hotbar 0.969 0.031 0.6 Hotbar 0.038 0.962 0.6

Injector 0.050 0.100 0.025 0.100 0.025 0.025 0.075 0.050 0.175 0.050 0.025 0.100 0.025 0.175 Injector 0.032 0.161 0.097 0.226 0.129 0.065 0.032 0.161 0.065 0.032

OnLineGames 0.074 0.056 0.037 0.019 0.019 0.704 0.074 0.019 OnLineGames 0.021 0.043 0.745 0.021 0.021 0.043 0.064 0.043

Renos 0.015 0.008 0.015 0.015 0.023 0.818 0.023 0.008 0.008 0.015 0.008 0.045 Renos 0.010 0.010 0.019 0.886 0.010 0.019 0.038 0.010

Rimecud 0.077 0.026 0.846 0.026 0.026 0.4 Rimecud 0.088 0.853 0.059 0.4
Small 0.043 0.022 0.022 0.065 0.087 0.196 0.326 0.065 0.152 0.022 Small 0.029 0.029 0.029 0.029 0.057 0.429 0.086 0.057 0.257

Toga 0.040 0.010 0.208 0.030 0.030 0.030 0.020 0.030 0.465 0.040 0.079 0.020 Toga 0.025 0.013 0.013 0.114 0.025 0.443 0.013 0.241 0.013 0.038 0.051 0.013

VB 0.023 0.023 0.011 0.011 0.011 0.023 0.023 0.529 0.092 0.241 0.011 VB 0.034 0.017 0.017 0.483 0.397 0.034 0.017

VBinject 0.004 0.021 0.204 0.021 0.043 0.021 0.047 0.055 0.106 0.328 0.132 0.009 0.009 VBinject 0.006 0.045 0.022 0.006 0.034 0.011 0.815 0.011 0.039 0.011
0.2 0.2
Vobfus 0.009 0.004 0.013 0.009 0.017 0.948 Vobfus 0.005 0.005 0.074 0.905 0.011

Vundo 0.005 0.073 0.005 0.068 0.047 0.016 0.105 0.005 0.052 0.105 0.492 0.021 0.005 Vundo 0.006 0.006 0.036 0.018 0.916 0.018

Winwebsec 0.010 0.048 0.019 0.138 0.033 0.005 0.010 0.048 0.024 0.005 0.562 0.100 Winwebsec 0.006 0.006 0.024 0.006 0.012 0.006 0.941

Zbot 0.013 0.053 0.013 0.026 0.026 0.039 0.026 0.026 0.053 0.039 0.145 0.013 0.013 0.013 0.013 0.026 0.461 Zbot 0.036 0.018 0.182 0.018 0.036 0.073 0.036 0.073 0.527
0 0
A t

C ct

eG r

ga

Vo t

nt

ct

eG r

ga

Vo t
e

ke t

ke t
O

es

R nos

es

R nos

V VB
d

ee e

H n

O Inj r

ud

V s
in do

c
ot

H n

O Inj r

ud

V s
in do

c
ot
n

al

al
in to

in to

c
Fa bo

Fa bo
pl

pl
ba

ba
u

u
C ros

se

os

se
oa

ea

oa

ea
V
je

je

je

je
H

H
ge

ge
am

am
To

To
Zb

Zb
bf

bf
Sm

W un

Sm

W un
ec

ec
nL ec

nL ec
lla

lla
eb

eb
ifr
yc

yc
ot

ot

e
In

in

In

in
R

R
B

B
dl

dl
if
A

A
R

R
im

im
A
w

w
B

B
C
A

ee
V

C
(a) HMM (b) CNN
1 1
Adload 0.929 0.024 0.024 0.024 Adload 0.810 0.119 0.024 0.024 0.024

Agent 0.486 0.057 0.114 0.086 0.029 0.029 0.029 0.086 0.029 0.057 Agent 0.171 0.057 0.029 0.114 0.057 0.086 0.029 0.114 0.200 0.114 0.029

Allaple 1.000 Allaple 1.000

BHO 0.042 0.014 0.915 0.014 0.014 BHO 0.944 0.014 0.042
0.8 0.8
Bifrose 0.026 0.079 0.553 0.053 0.026 0.026 0.053 0.184 Bifrose 0.053 0.105 0.053 0.026 0.026 0.105 0.026 0.237 0.132 0.132 0.079 0.026

CeeInject 0.038 0.006 0.905 0.006 0.006 0.013 0.006 0.013 0.006 CeeInject 0.006 0.019 0.829 0.006 0.032 0.006 0.006 0.006 0.070 0.006 0.013

Cycbot 1.000 Cycbot 0.977 0.023

FakeRean 0.042 0.042 0.874 0.021 0.011 0.011 FakeRean 0.011 0.063 0.642 0.011 0.011 0.011 0.095 0.095 0.063

Hotbar 1.000 0.6 Hotbar 0.962 0.038 0.6

Injector 0.065 0.097 0.677 0.032 0.097 0.032 Injector 0.032 0.032 0.097 0.032 0.032 0.097 0.161 0.032 0.323 0.065 0.097

OnLineGames 0.064 0.021 0.766 0.064 0.064 0.021 OnLineGames 0.043 0.021 0.021 0.021 0.043 0.553 0.021 0.021 0.191 0.043 0.021

Renos 0.067 0.010 0.914 0.010 Renos 0.010 0.010 0.010 0.010 0.010 0.010 0.810 0.010 0.038 0.029 0.010 0.029 0.019

Rimecud 0.118 0.029 0.824 0.029 0.4 Rimecud 0.059 0.029 0.706 0.088 0.118 0.4
Small 0.029 0.114 0.057 0.029 0.086 0.571 0.057 0.057 Small 0.029 0.029 0.029 0.029 0.029 0.086 0.086 0.143 0.229 0.314

Toga 0.013 0.177 0.025 0.025 0.013 0.013 0.025 0.519 0.152 0.025 0.013 Toga 0.013 0.013 0.038 0.025 0.025 0.278 0.013 0.291 0.114 0.063 0.038 0.089

VB 0.034 0.034 0.017 0.517 0.259 0.103 0.034 VB 0.017 0.017 0.362 0.293 0.241 0.017 0.034 0.017

VBinject 0.022 0.011 0.011 0.006 0.006 0.006 0.006 0.854 0.073 0.006 VBinject 0.006 0.006 0.006 0.006 0.006 0.006 0.006 0.017 0.006 0.045 0.146 0.393 0.326 0.017 0.011
0.2 0.2
Vobfus 0.005 0.005 0.047 0.942 Vobfus 0.005 0.011 0.042 0.021 0.916 0.005

Vundo 0.006 0.006 0.030 0.012 0.934 0.006 0.006 Vundo 0.006 0.006 0.006 0.018 0.012 0.006 0.940 0.006

Winwebsec 0.035 0.018 0.006 0.006 0.935 Winwebsec 0.006 0.029 0.006 0.024 0.024 0.018 0.006 0.006 0.012 0.006 0.865

Zbot 0.145 0.018 0.018 0.018 0.036 0.018 0.055 0.691 Zbot 0.036 0.018 0.018 0.145 0.018 0.127 0.018 0.109 0.073 0.091 0.345
0 0
A t

C ct

eG r

ga

Vo t

A t

C ct

eG r

ga

Vo t
e

ke t

ke t
O

es

R nos

es

R nos

V VB
d

ee e

H n

O Inj r

ud

V s
in do

c
ot

ee e

H n

O Inj r

ud

V s
in do

c
ot
n

n
al

al
in to

in to

c
Fa bo

Fa bo
pl

pl
ba

ba
u

u
C ros

se

C ros

se
oa

ea

oa

ea
V
je

je

je

je
H

H
ge

ge
am

am
To

To
Zb

Zb
bf

bf
Sm

W un

Sm

W un
ec

ec
nL ec

nL ec
lla

lla
eb

eb
yc

yc
ot

ot

e
In

in

In

in
R

R
B

B
dl

dl
if

if
A

A
R

R
im

im
w

w
B

B
A

A
V

(c) Random Forest (d) LSTM

Figure 2: Confusion matrices for standard techniques

15
1 1
Adload 0.146 0.829 0.024 Adload 0.952 0.024 0.024

Agent 0.022 0.304 0.022 0.022 0.065 0.130 0.022 0.043 0.022 0.022 0.152 0.152 0.022 Agent 0.629 0.086 0.029 0.057 0.057 0.086 0.057

Allaple 1.000 Allaple 1.000

BHO 0.060 0.012 0.831 0.024 0.072 BHO 0.042 0.915 0.014 0.014 0.014
0.8 0.8
Bifrose 0.410 0.051 0.051 0.026 0.103 0.051 0.026 0.051 0.128 0.051 0.051 Bifrose 0.053 0.132 0.184 0.026 0.026 0.053 0.079 0.421 0.026

CeeInject 0.050 0.009 0.064 0.764 0.005 0.032 0.018 0.005 0.014 0.005 0.023 0.005 0.009 CeeInject 0.924 0.019 0.006 0.019 0.006 0.019 0.006

Cycbot 0.872 0.034 0.007 0.087 Cycbot 0.992 0.008

FakeRean 0.007 0.777 0.007 0.014 0.007 0.029 0.029 0.036 0.036 0.050 0.007 FakeRean 0.958 0.011 0.011 0.021

Hotbar 0.969 0.031 0.6 Hotbar 0.038 0.962 0.6

Injector 0.150 0.100 0.025 0.075 0.025 0.225 0.025 0.025 0.025 0.050 0.050 0.075 0.025 0.025 0.100 Injector 0.065 0.032 0.355 0.032 0.065 0.032 0.290 0.097 0.032

OnLineGames 0.111 0.056 0.759 0.019 0.037 0.019 OnLineGames 0.021 0.021 0.915 0.021 0.021

Renos 0.008 0.008 0.008 0.886 0.015 0.008 0.053 0.015 Renos 0.010 0.010 0.019 0.943 0.010 0.010

Rimecud 0.872 0.026 0.026 0.026 0.051 0.4 Rimecud 0.118 0.824 0.029 0.029 0.4
Small 0.152 0.043 0.043 0.043 0.043 0.065 0.348 0.065 0.087 0.065 0.043 Small 0.086 0.029 0.571 0.143 0.057 0.029 0.086

Toga 0.010 0.139 0.030 0.020 0.030 0.020 0.030 0.495 0.099 0.030 0.020 0.050 0.010 0.020 Toga 0.013 0.025 0.013 0.747 0.190 0.013

VB 0.023 0.011 0.011 0.034 0.023 0.586 0.069 0.230 0.011 VB 0.017 0.034 0.017 0.638 0.259 0.034

VBinject 0.009 0.026 0.140 0.017 0.081 0.013 0.051 0.060 0.123 0.285 0.132 0.047 0.017 VBinject 0.006 0.011 0.006 0.938 0.006 0.022 0.011
0.2 0.2
Vobfus 0.013 0.021 0.004 0.957 0.004 Vobfus 0.005 0.005 0.005 0.026 0.958

Vundo 0.026 0.010 0.005 0.005 0.010 0.010 0.005 0.010 0.010 0.869 0.010 0.026 Vundo 0.006 0.036 0.006 0.006 0.922 0.018 0.006

Winwebsec 0.024 0.005 0.005 0.119 0.095 0.010 0.005 0.005 0.010 0.010 0.700 0.014 Winwebsec 0.006 0.006 0.006 0.982

Zbot 0.053 0.013 0.013 0.026 0.026 0.013 0.013 0.039 0.066 0.092 0.013 0.013 0.013 0.053 0.039 0.513 Zbot 0.018 0.109 0.036 0.036 0.036 0.764
0 0
nt

C ct

eG r

ga

Vo t

nt

ct

eG r

ga

Vo t
e

ke t

ke t
O

es

R nos

V VB

es

R nos

V VB
d

ee e

H n

O Inj r

ud

Vu s
in do

c
ot

H n

O Inj r

ud

V s
in do

c
ot
al

al
in to

in to

c
Fa bo

Fa bo
pl

pl
ba

ba
u

u
C ros

se

os

se
oa

ea

oa

ea
je

je

je

je
H

H
ge

ge
am

am
To

To
Zb

Zb
bf

bf
Sm

W n

Sm

W un
ec

ec
nL ec

nL ec
lla

lla
eb

eb
ifr
yc

yc
ot

ot

e
In

in

In

in
R

R
B

B
dl

dl
if
A

A
R

R
im

im
A

A
w

w
B

B
C
A

ee
C
(a) Bagged HMM (b) Bagged CNN
1
Adload 0.833 0.119 0.024 0.024

Agent 0.514 0.029 0.086 0.086 0.086 0.057 0.029 0.029 0.029 0.029 0.029

Allaple 1.000

BHO 0.944 0.014 0.014 0.028


0.8
Bifrose 0.026 0.158 0.158 0.026 0.026 0.026 0.053 0.053 0.026 0.368 0.053 0.026

CeeInject 0.006 0.013 0.918 0.006 0.006 0.019 0.006 0.025

Cycbot 1.000

FakeRean 0.011 0.916 0.011 0.011 0.021 0.011 0.021

Hotbar 0.962 0.038 0.6

Injector 0.032 0.290 0.161 0.032 0.161 0.032 0.032 0.097 0.032 0.032 0.065 0.032

OnLineGames 0.064 0.021 0.809 0.021 0.064 0.021

Renos 0.010 0.933 0.029 0.010 0.019

Rimecud 0.029 0.912 0.029 0.029 0.4


Small 0.143 0.029 0.029 0.029 0.029 0.057 0.029 0.457 0.029 0.029 0.114 0.029

Toga 0.013 0.025 0.025 0.013 0.025 0.025 0.494 0.342 0.025 0.013

VB 0.017 0.017 0.690 0.155 0.121

VBinject 0.011 0.006 0.006 0.006 0.011 0.006 0.006 0.062 0.860 0.017 0.006 0.006
0.2
Vobfus 0.016 0.005 0.011 0.968

Vundo 0.006 0.006 0.006 0.006 0.006 0.952 0.012 0.006

Winwebsec 0.018 0.006 0.006 0.006 0.006 0.947 0.012

Zbot 0.018 0.127 0.055 0.036 0.018 0.018 0.055 0.018 0.018 0.036 0.036 0.564
0
nt

C ct

eG r

ga

Vo t
e

ke t
O

es

R nos

V VB
d

ee e

H n

O Inj r

ud

Vu s
in do

c
ot
al
in to

c
Fa bo
pl

ba

u
C os

se
oa

ea
je

je
H
ge

am

To

Zb
bf
Sm

W n
ec
nL ec
lla

eb
ifr

yc

ot

e
In

in
R
B
dl
A

R
im
A

w
B

B
A

(c) Bagged LSTM

Figure 3: Confusion matrices for bagging experiments

1 1
Adload 0.231 0.077 0.038 0.615 0.038 Adload 0.923 0.038 0.038

Agent 0.158 0.053 0.053 0.158 0.053 0.526 Agent 0.211 0.053 0.053 0.158 0.053 0.053 0.053 0.105 0.053 0.053 0.158

Allaple 0.035 0.894 0.035 0.024 0.012 Allaple 0.976 0.012 0.012

BHO 0.688 0.031 0.062 0.219 BHO 0.062 0.875 0.031 0.031
0.8 0.8
Bifrose 0.062 0.125 0.188 0.062 0.188 0.062 0.188 0.062 0.062 Bifrose 0.125 0.062 0.188 0.062 0.062 0.062 0.312 0.062 0.062

CeeInject 0.011 0.659 0.022 0.044 0.011 0.011 0.143 0.099 CeeInject 0.022 0.846 0.011 0.011 0.011 0.033 0.011 0.022 0.011 0.022

Cycbot 0.470 0.045 0.288 0.015 0.030 0.152 Cycbot 0.985 0.015

FakeRean 0.024 0.381 0.024 0.095 0.071 0.024 0.381 FakeRean 0.024 0.071 0.738 0.024 0.024 0.095 0.024

Hotbar 0.385 0.615 0.6 Hotbar 1.000 0.6

Injector 0.091 0.091 0.091 0.091 0.182 0.091 0.273 0.091 Injector 0.091 0.091 0.182 0.091 0.091 0.273 0.091 0.091

OnLineGames 0.038 0.077 0.346 0.038 0.154 0.115 0.231 OnLineGames 0.038 0.038 0.038 0.577 0.077 0.115 0.115

Renos 0.125 0.018 0.018 0.018 0.018 0.429 0.036 0.054 0.018 0.268 Renos 0.018 0.018 0.018 0.857 0.018 0.054 0.018

Rimecud 0.077 0.538 0.231 0.154 0.4 Rimecud 0.077 0.077 0.615 0.154 0.077 0.4
Small 0.105 0.158 0.053 0.158 0.105 0.421 Small 0.105 0.105 0.105 0.105 0.105 0.158 0.105 0.158 0.053

Toga 0.023 0.023 0.047 0.372 0.349 0.047 0.140 Toga 0.047 0.023 0.023 0.023 0.302 0.372 0.047 0.116 0.047

VB 0.036 0.143 0.571 0.143 0.107 VB 0.036 0.321 0.464 0.107 0.071

VBinject 0.011 0.011 0.022 0.033 0.011 0.011 0.011 0.033 0.600 0.244 0.011 VBinject 0.022 0.033 0.022 0.056 0.700 0.122 0.011 0.033
0.2 0.2
Vobfus 0.019 0.010 0.058 0.442 0.452 0.019 Vobfus 0.010 0.019 0.010 0.019 0.106 0.837

Vundo 0.046 0.023 0.874 0.057 Vundo 0.011 0.034 0.011 0.011 0.885 0.034 0.011

Winwebsec 0.037 0.012 0.012 0.012 0.025 0.012 0.012 0.025 0.049 0.222 0.580 Winwebsec 0.025 0.049 0.012 0.012 0.012 0.877 0.012

Zbot 0.040 0.080 0.040 0.040 0.040 0.280 0.080 0.080 0.080 0.200 0.040 Zbot 0.120 0.040 0.120 0.040 0.200 0.040 0.040 0.160 0.240
0 0
A t

C ct

eG r

ga

Vo t

A t

C ct

eG r

ga

Vo t
e

ke t

ke t
O

es

R nos

V VB

es

R nos

V VB
d

ee e

H n

O Inj r

ud

Vu s
in do

c
ot

ee e

H n

O Inj r

ud

V s
in do

c
ot
n

n
al

al
in to

in to

c
Fa bo

Fa bo
pl

pl
ba

ba
u

u
C ros

se

C ros

se
oa

ea

oa

ea
je

je

je

je
H

H
ge

ge
am

am
To

To
Zb

Zb
bf

bf
Sm

W n

Sm

W un
ec

ec
nL ec

nL ec
lla

lla
eb

eb
yc

yc
ot

ot

e
In

in

In

in
R

R
B

B
dl

dl
if

if
A

A
R

R
im

im
w

w
B

B
A

(a) AdaBoost (b) XGBoost

Figure 4: Confusion matrices for boosting techniques

16
1 1
Adload 0.976 0.024 Adload 0.833 0.119 0.024 0.024

Agent 0.743 0.057 0.029 0.029 0.057 0.029 0.057 Agent 0.514 0.029 0.057 0.057 0.086 0.086 0.057 0.057 0.029 0.029

Allaple 1.000 Allaple 1.000

BHO 0.042 0.915 0.014 0.028 BHO 0.944 0.014 0.014 0.028
0.8 0.8
Bifrose 0.368 0.158 0.026 0.026 0.026 0.368 0.026 Bifrose 0.026 0.079 0.184 0.026 0.026 0.053 0.079 0.026 0.368 0.053 0.079

CeeInject 0.949 0.019 0.013 0.013 0.006 CeeInject 0.013 0.013 0.911 0.013 0.025 0.006 0.019

Cycbot 0.984 0.008 0.008 Cycbot 0.992 0.008

FakeRean 0.989 0.011 FakeRean 0.021 0.926 0.021 0.011 0.021

Hotbar 1.000 0.6 Hotbar 0.962 0.038 0.6

Injector 0.032 0.161 0.548 0.161 0.065 0.032 Injector 0.032 0.323 0.032 0.097 0.032 0.097 0.032 0.032 0.129 0.032 0.032 0.129

OnLineGames 0.043 0.936 0.021 OnLineGames 0.064 0.043 0.745 0.021 0.085 0.043

Renos 0.010 0.990 Renos 0.010 0.010 0.933 0.010 0.029 0.010

Rimecud 0.029 0.941 0.029 0.4 Rimecud 0.029 0.029 0.912 0.029 0.4
Small 0.029 0.029 0.714 0.114 0.029 0.086 Small 0.143 0.029 0.029 0.057 0.029 0.429 0.057 0.057 0.057 0.114

Toga 0.013 0.013 0.759 0.215 Toga 0.025 0.013 0.038 0.025 0.013 0.025 0.443 0.380 0.013 0.013 0.013

VB 0.052 0.017 0.724 0.190 0.017 VB 0.052 0.017 0.603 0.190 0.121 0.017

VBinject 0.006 0.006 0.006 0.955 0.006 0.011 0.011 VBinject 0.006 0.006 0.006 0.006 0.006 0.017 0.933 0.006 0.011 0.006
0.2 0.2
Vobfus 0.011 0.989 Vobfus 0.016 0.005 0.011 0.963 0.005

Vundo 0.018 0.958 0.018 0.006 Vundo 0.006 0.006 0.012 0.006 0.006 0.952 0.006 0.006

Winwebsec 0.006 0.006 0.982 0.006 Winwebsec 0.018 0.006 0.006 0.971

Zbot 0.055 0.036 0.055 0.855 Zbot 0.018 0.109 0.055 0.091 0.018 0.018 0.036 0.018 0.018 0.018 0.055 0.545
0 0
nt

C ct

eG r

ga

Vo t

nt

C ct

eG r

ga

Vo t
e

ke t

ke t
O

es

R nos

V VB

es

R nos

V VB
d

ee e

H n

O Inj r

ud

Vu s
in do

c
ot

ee e

H n

O Inj r

ud

V s
in do

c
ot
al

al
in to

in to

c
Fa bo

Fa bo
pl

pl
ba

ba
u

u
C ros

se

os

se
oa

ea

oa

ea
je

je

je

je
H

H
ge

ge
am

am
To

To
Zb

Zb
bf

bf
Sm

W n

Sm

W un
ec

ec
nL ec

nL ec
lla

lla
eb

eb
ifr
yc

yc
ot

ot

e
In

in

In

in
R

R
B

B
dl

dl
if
A

A
R

R
im

im
A

A
w

w
B

B
A

C
(a) CNN (b) LSTM
1 1
Adload 0.976 0.024 Adload 0.952 0.024 0.024

Agent 0.771 0.029 0.029 0.029 0.057 0.029 0.057 Agent 0.514 0.029 0.086 0.057 0.057 0.057 0.086 0.029 0.029 0.057

Allaple 1.000 Allaple 1.000

BHO 0.042 0.915 0.014 0.028 BHO 0.042 0.915 0.014 0.014 0.014
0.8 0.8
Bifrose 0.026 0.342 0.158 0.026 0.395 0.053 Bifrose 0.026 0.053 0.500 0.053 0.053 0.026 0.026 0.026 0.211 0.026

CeeInject 0.956 0.019 0.006 0.006 0.006 0.006 CeeInject 0.019 0.013 0.025 0.892 0.006 0.013 0.006 0.006 0.006 0.006 0.006

Cycbot 0.984 0.008 0.008 Cycbot 0.984 0.016

FakeRean 0.989 0.011 FakeRean 0.011 0.011 0.032 0.926 0.011 0.011

Hotbar 1.000 0.6 Hotbar 1.000 0.6

Injector 0.032 0.161 0.516 0.194 0.065 0.032 Injector 0.032 0.032 0.097 0.065 0.484 0.032 0.032 0.032 0.161 0.032

OnLineGames 0.021 0.021 0.894 0.064 OnLineGames 0.043 0.021 0.809 0.043 0.064 0.021

Renos 0.010 0.990 Renos 0.010 0.943 0.010 0.010 0.029

Rimecud 0.971 0.029 0.4 Rimecud 0.088 0.029 0.882 0.4


Small 0.057 0.743 0.114 0.029 0.057 Small 0.029 0.057 0.057 0.029 0.057 0.029 0.057 0.543 0.029 0.057 0.057

Toga 0.013 0.013 0.658 0.316 Toga 0.013 0.025 0.013 0.013 0.025 0.013 0.696 0.177 0.013 0.013

VB 0.052 0.017 0.672 0.224 0.034 VB 0.034 0.017 0.017 0.569 0.172 0.155 0.034

VBinject 0.006 0.006 0.966 0.006 0.006 0.011 VBinject 0.006 0.011 0.006 0.006 0.006 0.006 0.011 0.034 0.820 0.067 0.017 0.011
0.2 0.2
Vobfus 0.005 0.995 Vobfus 0.005 0.005 0.005 0.026 0.958

Vundo 0.012 0.964 0.018 0.006 Vundo 0.006 0.018 0.958 0.012 0.006

Winwebsec 0.006 0.006 0.006 0.982 Winwebsec 0.012 0.006 0.982

Zbot 0.018 0.036 0.018 0.055 0.055 0.018 0.800 Zbot 0.018 0.018 0.018 0.036 0.018 0.145 0.018 0.036 0.055 0.636
0 0
A t

C ct

eG r

ga

Vo t

A t

C ct

eG r

ga

Vo t
e

ke t

ke t
O

es

R nos

es

R nos

B
d

ee e

H n

O Inj r

ud

V s
in do

c
ot

ee e

H n

O Inj r

ud

V s
in do

c
ot
n

n
al

al
in to

in to

c
Fa bo

Fa bo
pl

pl
ba

ba
u

u
C ros

se

C ros

se
oa

ea

oa

ea
V

V
je

je

je

je
H

H
ge

ge
am

am
To

To
Zb

Zb
bf

bf
Sm

W un

Sm

W un
ec

ec
nL ec

nL ec
lla

lla
eb

eb
yc

yc
ot

ot

e
In

in

In

in
R

R
B

B
dl

dl
if

if
A

A
R

R
im

im
w

w
B

B
A

A
V

(c) Bagged neural networks (d) Classic techniques


1 1
Adload 0.976 0.024 Adload 0.976 0.024

Agent 0.771 0.029 0.029 0.029 0.057 0.029 0.057 Agent 0.600 0.057 0.029 0.057 0.029 0.029 0.057 0.029 0.057 0.057

Allaple 1.000 Allaple 1.000

BHO 0.042 0.915 0.014 0.028 BHO 0.042 0.901 0.014 0.042
0.8 0.8
Bifrose 0.026 0.342 0.158 0.026 0.395 0.053 Bifrose 0.026 0.447 0.079 0.053 0.368 0.026

CeeInject 0.956 0.019 0.006 0.006 0.006 0.006 CeeInject 0.968 0.006 0.013 0.006 0.006

Cycbot 0.984 0.008 0.008 Cycbot 0.984 0.008 0.008

FakeRean 0.989 0.011 FakeRean 1.000

Hotbar 1.000 0.6 Hotbar 1.000 0.6

Injector 0.032 0.161 0.516 0.194 0.065 0.032 Injector 0.129 0.548 0.032 0.194 0.065 0.032

OnLineGames 0.021 0.021 0.894 0.064 OnLineGames 0.021 0.872 0.021 0.064 0.021

Renos 0.010 0.990 Renos 0.010 0.990

Rimecud 0.971 0.029 0.4 Rimecud 0.971 0.029 0.4


Small 0.057 0.743 0.114 0.029 0.057 Small 0.029 0.029 0.029 0.771 0.086 0.029 0.029

Toga 0.013 0.013 0.658 0.316 Toga 0.013 0.013 0.013 0.658 0.304

VB 0.052 0.017 0.672 0.224 0.034 VB 0.017 0.034 0.707 0.172 0.052 0.017

VBinject 0.006 0.006 0.966 0.006 0.006 0.011 VBinject 0.011 0.006 0.006 0.961 0.006 0.006 0.006
0.2 0.2
Vobfus 0.005 0.995 Vobfus 0.016 0.984

Vundo 0.012 0.964 0.018 0.006 Vundo 0.018 0.958 0.018 0.006

Winwebsec 0.006 0.006 0.006 0.982 Winwebsec 0.006 0.994

Zbot 0.018 0.036 0.018 0.055 0.055 0.018 0.800 Zbot 0.018 0.036 0.018 0.018 0.091 0.055 0.055 0.709
0 0
A t

C ct

eG r

ga

Vo t

A t

C ct

eG r

ga

Vo t
e

ke t

ke t
O

es

R nos

V VB

es

R nos

B
d

ee e

H n

O Inj r

ud

Vu s
in do

c
ot

ee e

H n

O Inj r

ud

V s
in do

c
ot
n

n
al

al
in to

in to

c
Fa bo

Fa bo
pl

pl
ba

ba
u

u
C ros

se

C ros

se
oa

ea

oa

ea

V
je

je

je

je
H

H
ge

ge
am

am
To

To
Zb

Zb
bf

bf
Sm

W n

Sm

W un
ec

ec
nL ec

nL ec
lla

lla
eb

eb
yc

yc
ot

ot

e
In

in

In

in
R

R
B

B
dl

dl
if

if
A

A
R

R
im

im
w

w
B

B
A

(e) All neural networks (f) All models

Figure 5: Confusion matrices for voting ensembles

17

You might also like