0% found this document useful (0 votes)
117 views12 pages

Webshell Detection Based On Executable Data Charac

Uploaded by

Taylor Dominic
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
117 views12 pages

Webshell Detection Based On Executable Data Charac

Uploaded by

Taylor Dominic
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Hindawi

Wireless Communications and Mobile Computing


Volume 2021, Article ID 5533963, 12 pages
https://doi.org/10.1155/2021/5533963

Research Article
Webshell Detection Based on Executable Data Characteristics of
PHP Code

Zulie Pan ,1,2 Yuanchao Chen ,1,2 Yu Chen ,1,2 Yi Shen ,1,2 and Xuanzhen Guo 1,2

1
College of Electronic Engineering, National University of Defense Technology, Hefei 230011, China
2
Anhui Province Key Laboratory of Cyberspace Security Situation Awareness and Evaluation, Hefei 230037, China

Correspondence should be addressed to Yuanchao Chen; [email protected]

Received 7 January 2021; Revised 3 February 2021; Accepted 8 March 2021; Published 23 March 2021

Academic Editor: Di Zhang

Copyright © 2021 Zulie Pan et al. This is an open access article distributed under the Creative Commons Attribution License, which
permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

A webshell is a malicious backdoor that allows remote access and control to a web server by executing arbitrary commands. The
wide use of obfuscation and encryption technologies has greatly increased the difficulty of webshell detection. To this end, we
propose a novel webshell detection model leveraging the grammatical features extracted from the PHP code. The key idea is to
combine the executable data characteristics of the PHP code with static text features for webshell classification. To verify the
proposed model, we construct a cleaned data set of webshell consisting of 2,917 samples from 17 webshell collection projects
and conduct extensive experiments. We have designed three sets of controlled experiments, the results of which show that the
accuracy of the three algorithms has reached more than 99.40%, the highest reached 99.66%, the recall rate has been increased
by at least 1.8%, the most increased by 6.75%, and the F1 value has increased by 2.02% on average. It not only confirms the
efficiency of the grammatical features in webshell detection but also shows that our system significantly outperforms several
state-of-the-art rivals in terms of detection accuracy and recall rate.

1. Introduction tion traffic [4–6], and other characteristics. It only works


when webshell is dynamically executed. On the one hand,
Webshell is a web backdoor written in the web scripting lan- dynamic detection needs hook [7], sandbox [8] and other
guages that provide a covert way to communicate with the technologies. On the other hand, it has to detect the opera-
server [1]. Along with the fast development of the Internet, tion of files and traffic communication. Therefore, dynamic
web attacks occur much more frequently, among which detection could consume much of the computing resources
implanting webshell into target websites is one of the most of the server and significantly deteriorate its performance.
commonly used means for attackers [2]. Attackers can use As a result, static feature detection methods have become
webshell to gain control of the website server, so as to further the focus of research.
conduct information sniffing, data theft, or tampering. In The static feature detection methods are mainly based on
order to bypass the webshell detection tools, attackers usually webshell text content as well as web log information [9, 10]
use obfuscation and encryption or just embed webshell codes for analysis and detection. Although regular expressions
into normal files, thus preventing webshell files from being [11] are the earliest widely used method to detect the con-
detected. Therefore, precisely detecting and identifying web- tents of webshell, they are confined to be extracted from the
shell are growing much tougher, and how to accurately con- existing webshell and need to be constantly updated. To this
duct webshell detection has already become one important end, the static feature detection method cannot detect the
problem for preventing web-based attacks. unknown webshell. Moreover, since webshell code obfusca-
Currently, there are two distinct technical routes of web- tion and encryption technology continues to mature, detec-
shell detection—dynamic feature detection and static feature tion methods based on regular expressions can be easily
detection. Specifically, dynamic feature detection is mainly bypassed. Machine learning emerges recently in various
based on the webshell file behavior [3], webshell communica- fields including cyberspace security [12]. Some researchers
2 Wireless Communications and Mobile Computing

began to apply this technology to webshell detection, and input training model for classification, significant perfor-
results prove that it can play a vital role in webshell detection. mance has been achieved.
The construction of feature engineering, namely, the features Hu et al. proposed a webshell detection model based on
which are involved are the webshell characteristics we choose the decision tree focusing on the detection of PHP webshell
for model training, plays a critical role in webshell detection [13]. The basic attributes of the sample, such as the number
methods based on machine learning. of words and lines in the text, called functions, are extracted
The main contributions of this paper are summarized as to construct the features and train the decision tree classifica-
follows: tion model for webshell detection. Although the number of
words, line numbers of the text, and called functions cannot
(1) We first proposed the concept of the executable data distinguish between normal files and webshell files, this
characteristics of PHP code and used it for webshell research has made a good attempt to build a detection model
detection research. From the perspective of the gram- using the text features of webshell.
matical features of webshell, we constructed a web- Meng et al. [14] build a matrix to make use of the SVM
shell detection model based on the executable data algorithm training model for webshell detection. To this
characteristics of PHP code end, web page attribute characteristics are extracted. These
characteristics include page length, code lines, number of
(2) We construct a cleaned data set to facilitate subse-
comments and other characteristics, and page operation
quent related research via collecting 17 existing web-
attributes. Besides encryption and decryption function calls,
shell data sets on Github. The md5 algorithm has
system, eval, exec, shell_exec, and other function calls, char-
been used to remove the duplicate webshell samples
acter operation function calls, system function calls, file oper-
and some non-PHP code webshell files through man-
ations, FTP (File Transfer Protocol) operations, database
ual analysis
operations, ActiveX control calls, and other features are all
(3) We conduct extensive experiments to evaluate the involved in the matrix building.
performance of our proposed method. The results Hu proposed a model of webshell detection based on
show that the executable data characteristics of PHP Bayesian theory [15] and extracted the common statistical
code is one of the important grammatical features features of file analysis for the construction of feature engi-
of PHP webshell, which significantly improves the neering. Features involved are information entropy, longest
accuracy of the detection model. Moreover, the exe- word length, compression ratio, index of coincidence, and
cutable data characteristics of PHP code can better other features, all of which would be put into the Bayesian
improve the distinguishing ability of the detection classifier to train the model for webshell detection.
model compared with the traditional static statistical The model of detecting webshell based on random for-
characteristics est with FastText was first proposed by Fang et al. to extract
the opcode of PHP code as a feature for model construction
In the next section, we review some representative related [16]. By extracting the opcode of the PHP code to train the
research to outline the motivation of our research. In Section FastTest model, the FastText model is used for preclassifi-
3, we introduce our system model in detail, including the cation processing. Statistical features such as the longest
opcode text vector library and data executable features of string, information entropy, index of coincidence, text fea-
PHP code. In Section 4, we systematically evaluate our detec- tures such as sensitive functions, and blacklist keywords
tion model. Finally, we summarize our work in Section 5. are used as input to train random deep forest classification
model.
2. Related Work Cui et al. proposed the model of webshell detection based
on random forest–gradient boosting decision tree algorithm
For webshell detection based on static features, the feature [17]. First attain opcode hash vector and text vector library
selection is mainly divided into two types: text features and features extracted from PHP opcode processed by the TF-
grammatical features. Current research on webshell detection IDF (term frequency–inverse document frequency) [18] vec-
using machine learning mainly focuses on the text feature. In tor. All these features are used to train a random deep forest
order to make up for the shortcomings of the detection model, so as to obtain preliminary preprocessing results.
method using regular expression detection, researchers have Combined with statistical features such as information
proposed a detection method based on statistical features entropy, index of coincidence, compression ratio, length of
from the perspective of text features. That is, by extracting the longest word, and number of signature function matches,
statistical features in webshell, such as information entropy, to construct a feature matrix as the input of the GBDT (Gra-
longest words, index of coincidence, and compression ratio, dient Boosting Decision Tree) classifier for model training,
these features combine to form a feature matrix for model the final detection result is obtained.
construction. And then, through experiments, it was found In 2019, Li et al. proposed a webshell detection model
that PHP opcode can help improve the capability of the based on the word attention mechanism [19]; first, use
detection of webshell in PHP language. By combining the sta- word2vec to vectorize the text content, then use the GRU
tistical features, PHP opcode word frequency, PHP opcode (Gated Recurrent Unit) model and the attention mechanism
text vector library, signature functions, word relevance, and model for training, and finally input to a sigmoid function
other features together to construct a feature matrix as an full connection layer for webshell classification.
Wireless Communications and Mobile Computing 3

F1... Fn...Fn+6 Text vector library Static statistical


Data preprocessing of PHP opcode features
s1 0.601 ... 0 Sample classification
s2 0 ... 1
F1 ... Fn... Fn+6
s3 Fn+7
Text feature s1 0.601 ... 0
1

...
extraction s2 0 ... 1
sm–2 0
sm–1 s3
sm
Classify

...
Text feature
s1 1
s2 0 sm–2
Webshell sample
s3 sm–1
Data executable
characteristic sm
...

extraction of PHP code


sm–2 Text feature
sm–1 D. E. P. Used supervised learning
sm Feature matrix algorithm (SVM, MLP, RFC)
D. E. P. construction for classification

Figure 1: The structure of the detection model.

The statistical characteristics are mainly the attribute <?php echo('Hello


Hello World
values of some aspects of the file, which summarize the char- World');?>
acteristics of webshell from the perspective of the entire file.
Input Output
However, with the rapid development of web services, a large
number of web service frameworks have emerged, and devel-
opers have begun to use code obfuscation and encryption Lexical analysis
techniques in the project code to avoid the leakage of source Zend Engine
code, which will in the end cause the statistical feature of nor- syntactic execution
mal files to resemble webshell, making the webshell detection analysis
model based on statistical features lose its original advan-
tages. This also fully illustrates that webshell detection based
Figure 2: The running process of PHP code.
on statistical characteristics is not comprehensive. Webshell
not only has the attributes of a file but also has the structure
of a scripting language. 3.1. Static Statistical Features. We selected six types of static
According to the analysis above, due to the growing dis- statistical features used in the literature [12], including infor-
advantages of statistical features, this paper is dedicated to mation entropy, index of coincidence, length of the longest
proving that the executable data characteristics of the PHP word, amount of matched signatures, data compression ratio,
code can be effective in webshell detection. and uses of the eval function. Combining these features, we
can get a 6-dimensional feature vector.
3. Model Architecture
3.2. Text Vector Library of PHP Opcode. The running process
The structure of the detection model is shown in Figure 1, of PHP code includes three stages: lexical analysis, syntax
which includes three parts: data preprocessing, feature analysis, and Zend engine execution. After these three stages
matrix construction, and classifications via supervised learn- of processing, the corresponding results of PHP code will be
ing algorithms. output, as shown in Figure 2.
Firstly, we preprocess the data and extract the features Through lexical analysis, PHP code is divided into lan-
from the text and data executable, respectively. From the guage fragments. After syntax analysis, the language frag-
view of text, we choose to use the text vector library of opcode ments are transformed into meaningful expressions which
with the best discrimination ability in the current research to are input into Zend engine for execution. In the Zend engine
combine the static statistical features of the samples. Sec- execution stage, it will translate the expressions generated by
ondly, we combine the PHP opcode text vector library syntax analysis into a series of opcodes and perform corre-
extracted in the preprocessing stage, the static statistical fea- sponding operations according to opcode. For example, the
tures of samples, and the executable data characteristics of original code of a webshell sample is as follows.
PHP code to form the feature matrix to describe the samples. The opcode sequence generated by Zend engine coding is
Finally, supervised learning algorithm is used for training as follows: [‘FETCH CONSTANT’,‘FETCH R’,‘FETCH DIM
classification. R’,‘INCLUDE OR EVAL’,‘RETURN’], as shown in Table 1.
4 Wireless Communications and Mobile Computing

PHP file has a certain correlation; that is to say, each opcode


<?phpeval($_REQUEST[‘password’]) ; ?> has a certain correlation with the opcode before and after it.
Therefore, n-gram model is used to preprocess opcode to
Code 1 generate the opcode word frequency matrix, and then, the
TF-IDF model is introduced to calculate the TF-IDF value
Table 1: Opcode sequence. of each opcode segment. The text vector library of opcode
is generated by filtering out the opcode fragments with less
No. Opcode distinguishing ability. Its concrete process is shown in
0 FETCH CONSTANT Figure 3.
1 FETCH R N-gram [22] is a probabilistic language model based
on the Markov assumption that the occurrence of the n
2 FETCH DIM R
th word is related to the (n − 1)th words; the probability
3 INCLUDE OR EVAL of the entire sentence is equal to the probability product
4 RETURN of each word appearing. Assuming that the sequenceS is
composed of the word sequences W1, W2, W3,...,Wn, the
probability of the sentence S appearing could be expressed
<?php in this way:
error_reporting(E_ALLˆE_NOTICE);
define(‘%uFFFD%uFFFD’,’%uFFFD%uFFFD%uF P ðS Þ = P ð W 1 Þ P ð W 2 Þ P ð W 3 Þ ⋯ P ð W n Þ : ð1Þ
FFD’);
$_SERVER[%uFFFD%uFFFD]=explode(‘|-|; |(
‘,“password”);
The probability of each word occurrence is calculated by
eval($_REQUEST[$_SERVER{%uFFFD%uFFFD}[ sample statistics when the model is built. Using the N-Gram
0]]) model to preprocess opcode, a large number of opcode
?> sequence samples are divided into n-length opcode corpus
fragments. For example, opcode sequences are as follows:
[‘ASSIGN’, ‘INCLUDE_OR_EVAL’, ‘CONCAT’, ‘INCLUDE_
Code 2
OR_EVAL’, ‘RETURN’], [‘ASSIGN’, ‘INIT_FCALL’, ‘SEND_
VAL’, ‘DO_ICALL’, ‘CONCAT’, ‘INCLUDE_OR_EVAL’,
‘RETURN’], The corpus fragments generated by N =3 are
That is to say, Zend engine will parse and execute accord- shown in Table 2.
ing to the opcode sequence generated by compilation and get The frequency matrix of opcode sequence can be con-
the above webshell code by obfuscating encryption: structed by calculating the number of occurrences of corpus
The opcode sequence of this code is as follows: [‘INIT_ fragments:
FCALL’, ‘SEND_VAL’, ‘DO_ICALL’, ‘INIT_FCALL’,
‘SEND_VAL’, ‘SEND_VAL’, ‘DO_ICALL’, ‘FETC_CON- " #
½1 0 1 0 1 0 0
STANT’, ‘INIT_FCALL’. ‘SEND_VAL’, ‘SEND_VAL’, ð2Þ
‘DO_ICALL’, ‘FETCH_W’, ‘ASSIGN_DIM’, ‘OP_DATA’, ½0 1 1 1 0 1 1
‘FETCH_CONSTANT’, ‘FETCH_R’, ‘FETCH_DIM_R’,
‘FETCH_DIM_R’, ‘FETCH_R’, ‘FETCH_DIM_R’, ‘INCLUDE_ TF-IDF [18] is used to evaluate the importance of a cor-
OR_EVAL’, ‘RETURN’]. By comparison, it can be found that pus fragment in the corpus, where TF (word frequency) rep-
opcode generated by obfuscated encryption contains the resents the frequency of a corpus fragment in the corpus. To
opcode of source code. From the perspective of opcode, after be specific, nði, jÞ represents the number of occurrences of
obfuscated encryption, webshell code adds some additional corpus fragment t i in file d j , and ∑k nk, j means the total num-
operations for changing code style on the basis of original ber of occurrences of all corpus fragments in file d j . The cal-
operation sequence and does not change its original core culation formula is given like this:
opcode sequence. Therefore, the use of opcode in webshell
detection can play a great role in webshell detection. Docu- ni, j
ment [16] first demonstrated the application of PHP opcode tf ij = : ð3Þ
∑k nk, j
to webshell detection through the comparative experiment
for the first time in 2018, which can improve the discrimina-
tion ability of the detection model. IDF (Reverse File Frequency) is used to indicate whether
In the previous research, we usually use PHP’s VLD [20] a corpus fragment has good class discrimination ability in the
extension to extract opcode, but when we are dealing with the corpus. The formula is (4), in which jDj represents the sum of
large amount of code after confusion encryption, the low effi- the files in the corpus. jfj : t i ∈ d j gj represents the number of
ciency of VLD extension processing the output file opcode is files that contain corpus fragments t i . If the corpus fragment
unbearable, so we use PHP’ native debugger PHPDBG [21] is not in the corpus, it will result in a denominator of 0. So
to extract opcode of PHP code. generally, the denominator is symbolled as jfj : t i ∈ d j gj + 1.
In the detection model of this paper, we focus on the text If fewer files contain corpus fragments t i , the IDF value grows
vector library of opcode, and the opcode sequence of each with files in corpus fragments decreasing. This shows that
Wireless Communications and Mobile Computing 5

Opcode text set Word frequency matrix Opcode text vector library
ASSIGN INCLUDE_OR_EVAL … [0,0,1,0,0,0,…,0,1,0,0,0,0]
Use phpdbg to ASSIGN INIT_FCALL … N-gram [0,0,0,1,0,1,…,0,0,0,0,1,0] TF-IDF [0,0,0.60604332,0,0,0,…,0,0.51519219,0,0,0,0]
extract opcode CONCAT ECHO ECHO RETURN … pretreatment [0,0,0,0,0,0,…,1,1,0,0,0,0] processing [0,0,0,0.46015789,0,0.39117625,…,0,0,0,0,0.46015789,0]
ASSIGN BEGIN_SILENCE … [0,0,0,0,0,0,…,0.70710678,0.70710678,0,0,0,0]
END_SILENCE IS_SMALLER … [0,1,0,0,0,0,…,0,0,0,0,0,1]
[0,0.40824829,0,0,0,0,…,0,0,0,0,0,0.40824829]
… …

INIT_FCALL_BY_NAME… [0,0,2,0,1,0,…,0,0,0,1,0,0]
THIS INIT_ARRAY NEXT…
[1,0,0,0,0,0,…,1,0,0,0,0,1] [0,0,0.33333333,0,0.56011203,0,…,0,0,0,0.56011203,0,0]
Data set SEND_VAL_EX DO_FCALL… [0.26061951,0,0,0,0,0,…,0.26061951,0,0,0,0,0.22155040]
ECHO RETURN [0,1,0,0,2,0,…,0,0,2,0,0,0]
[0,1,0,0,0.38568218,0,…,0,0,0.38568218,0,0,0]
SEND_VAL_EX SEND_VAL_EX… [0,0,3,0,0,0,…,0.1,0,0,0,0]
[0,0,0.14002801,0,0,0,…,0,0.42008403,0,0,0,0]

Figure 3: PHP opcode processing flow.

Table 2: The corpus fragments generated by N =3.


<?phpecho($_GET[‘txt’]) ; ?>
# Corpus fragments
0 [‘ASSIGN’, ‘INCLUDE_OR_EVAL’, ‘CONCAT’] Code 3
1 [‘ASSIGN’, ‘INIT_FCALL’, ‘SEND_VAL’]
2 [‘CONCAT’, ‘INCLUDE_OR_EVAL’, ‘RETURN’]
3 [‘DO_ICALL’, ‘CONCAT’, ‘INCLUDE_OR_EVAL’] <?phpeval($_GET[‘txt’]) ; ?>
4 [‘INCLUDE_OR_EVAL’, ‘CONCAT’, ‘INCLUDE_OR_EVAL’]
5 [‘INIT_FCALL’, ‘SEND_VAL’, ‘DO_ICALL’] Code 4
6 [‘SEND_VAL’, ‘DO_ICALL’, ‘CONCAT’]

We can directly get knowledge about the function of the


corpus segment shows excellent performance in distinguish- code. That is to say, the data input by the user would be
ing categories. printed in the page by the echo function. And the overall
function of the code would not be affected due to the different
inputs from users, so the function of the code is clear. If the
j Dj
idf i, j,D = log   : ð4Þ echo function in the above code is replaced with eval, it will
 j : ti ∈ d j 
become a sentence webshell in PHP.
The specific function of this code cannot be determined
TF-IDF is the product of TF and IDF, and the formula is directly. If the data obtained by $_GET[‘txt’] is ‘1 +1’, this
(5). The TF-IDF value can be used to filter out some of the code would output the calculation result ‘2’. If the data
referenced corpus fragments, reserving corpus fragments obtained by $_GET[‘txt’] is the function phpinfo() in PHP,
with good discriminatory ability. this code would print the relevant configuration information
of the server of PHP language. If the acquired data is ‘system
tf − idf i, j,D = tf i, j ∗ idf i, j,D : ð5Þ (whoami)’, the user input is converted into a system function
to execute the corresponding system command, which
means that the data entered by the user is actually executed
The opcode text vector library can be obtained by passing as a PHP code in this code. To sum up, input data determines
the frequency matrix generated by the n-gram preprocessing the actual function of the code, so we make a definition as
above into the TF-IDF model. follow:
By transmitting frequency matrix generated by n-gram
preprocessing into the TF-IDF model, the opcode text vector
Definition 1. In a section of PHP code, the input data is
library is attained:
parsed and executed as PHP code or system command to
" # determine the actual function of this section of code, which
½ 0:6317 0 0:4494 0 0:6317 0 0 is called data executable characteristic.
½0 0:4712 0:3351 0:4712 0 0:4712 0:4712 
And webshell is able to realize various functions through
ð6Þ a simple piece of code. For example, you can get information
about the running environment of the web server; perform
3.3. The Executable Data Characteristics of PHP Code. In the file uploading, downloading, or editing operations; connect
parsing of PHP language, there is no difference between data to the database; get the command execution environment
segment and code segment. When PHP receives data from of the server; and so on. Therefore, we are certain that the
users, the input data may not only be processed as characters majority of webshell will respond to users’ inputs to achieve
but also be parsed and executed as PHP code. For example, the correspondence functionality. In other words, the major-
the PHP code is as follows. ity of webshell possesses data executability features.
6 Wireless Communications and Mobile Computing

array ( Input: PHP language samples files


0: Stmt_Expression ( Output: one-dimensional matrix of executable data charac-
expr: Expr_Eval ( teristics of the sample
expr: Expr_ArrayDimFetch ( 1. Convert PHP code to abstract syntax tree, turn to step 2.
var: Expr_Variable ( 2. Judges whether there are Eval, FuncCall, MethodCall,
name: _REQUEST or ShellExec nodes under Expr nodes in the abstract
) grammar tree, and if matched, turn to step 3; else,
dim: Scalar_String ( return 0.
value: txt 3. Judges whether the functions in the nodes above are
) functions that can execute the data as PHP code or
) system commands, such as eval, exec, system, etc. If
) the answer is yes, turn to step 4; else, return 0.
) 4. Judge the type of parameter in the function, whether
) the parameter is variable. If it is, return 1, if not,
return 0.
Code 5
Algorithm 1: Executable data characteristic extraction of PHP
code.
An abstract syntax tree [23] for PHP code is needed to
extract executable data characteristics of PHP code. An
abstract syntax tree is a tree that represents the grammatical is constituted by M rows and N + 7 columns. M represents a
structure of a program’s source code, where each node repre- total of M samples. The first N columns represent the text
sents a structure in the source code. Abstract syntax tree for vector of the sample opcode. Column N + 1 to N + 6 repre-
PHP code can be generated by using PHP-Parser [24]. sent the static statistical characteristics of the sample. The
PHP-Parser is an open-source PHP abstract grammar tree N + 7th column represents the executable data characteristics
generation tool programmed in PHP language based on the of the sample. This paper focuses on whether the executable
Zend engine. For example, the abstract syntax tree generated data characteristics of PHP code can play a vital role in web-
by the sentence webshell mentioned above through PHP- shell detection. Therefore, the impact of specific algorithm on
parse is as follows. detection is not discussed. Supervised learning refers to using
Stmt and Expr represent the declaring node and the a set of labeled data to learn its mapping from input to output
expression node, respectively. Besides, variables in the for- and applies this mapping relationship to unknown data in
mula are expressed by Variable, and a string constant is rep- order to achieve the purpose of classification or regression.
resented by Scalar_String. Through the abstract grammar The constructed feature matrix is introduced into the super-
tree, we can intuitively understand the overall grammar vised learning algorithm to construct the model. Finally, the
structure of the code. Extracting the executable data charac- model is used to classify each test data.
teristics of the PHP code allows us to analyze Eval, FuncCall,
MethodCall, and ShellExec nodes in Expr nodes of the 4. Experimental Analysis
abstract syntax tree by matching, analyzing, and judging the
attributes (function name, parameter type) of these nodes 4.1. Data Sets. Since we found there is no ready-made and
to get knowledge that whether the function in the expression cleaned data set of PHP webshell on the internet available,
node can execute data as the PHP code or system command we collected a total of 6021 webshell samples of PHP lan-
and whether the parameter of the function is a variable node. guages from 17 open-source projects. Github projects
Variable node, the function of expression node, executes var- involved are shown in Table 3.
iable parameters as PHP code or system command, while the Since webshell samples collected by each github project
actual function of the code is dynamically determined by the inevitably include partial duplicate sample files, in order to
value of the variable. If all the above conditions are met, the avoid repeated webshell sample files affecting the experimen-
PHP code is judged to have executable data characteristics. tal results, we used the md5 algorithm to reprocess 6,021
The algorithm to extract the executable data characteristics webshell samples and obtained a total of 3,211 nonrepeated
of the PHP code shown in Algorithm 1: webshell sample files. Meanwhile, in order to ensure the
Finally, a one-dimensional matrix is generated to accuracy of the data, 294 non-PHP webshell files were
describe the executable data characteristics of the sample. excluded by manual analysis, so the final number of webshell
Samples which possess executable data characteristics will samples was 2,917. By making a test of executable data char-
be marked as 1, and samples without data executability are acteristics, a total of 2,696 samples were found to have this
marked as 0. property. And the remaining 221 webshell samples were ana-
lyzed manually only to find that samples that did not detect
3.4. Feature Matrix Construction and Supervised Learning the executable data characteristics of PHP code were only
Algorithm. The feature matrix is constructed by combining related to operations with only one function such as file
the text vector library, sample static statistical features, and upload operation, database connection operation, and file
data executable features of opcode. The characteristic matrix system directory operation. This data also fully demonstrated
Wireless Communications and Mobile Computing 7

Table 3: Github projects lists. tor library, the divided training set and the test set will have
relevance in the opcode text vector library if the data is uni-
No. Github projects fied preprocessing. With the aim of eliminating the relevance
0 JohnTroony/PHP-Webshells of the training set and test set, we have divided the training
1 xl7dev/Webshell set and test set before preprocessing. The preprocessing of
2 ysrc/Webshell-sample the data and the construction of the feature matrix are carried
3 tennc/Webshell out separately to ensure that the data in the training set and
the test set are uncorrelated and to ensure the validity of
4 BlackArch/Webshells
the experimental results. We uploaded the cleaned data set
5 JoyChou93/Webshell to the Github project for use in the subsequent experiments
6 bartblaze/PHP-backdoors of webshell detection research, which can be downloaded
7 WangYihang/Webshell-sniper from https://github.com/Cyc1e183/PHP-Webshell-Dataset.
8 tanjiti/WebshellSample
9 tdifg/Webshell 4.2. Algorithm Parameter Setting. Aiming to better compare
10 LandGrey/Webshell-detect-bypass the experiments’ results and analyze the impact of executable
11 backlion/Webshell data characteristics of PHP code on model performance, we
12 Webshellpub/awsome-Webshell adopt the random forest (RF) algorithm [25], support vector
13 x-o-r-r-o/PHP-Webshells-Collection machine algorithm (SVM) [14], and multilayer perceptron
14 S9MF/S9MF-PHP-Webshell-bypass
(MLP) [26] with the same training set for webshell classifica-
tion. Besides, we do not explore the optimal situation of algo-
15 backdoorhub/shell-backdoor-list
rithm parameter setting. In the experiment, we set the N
16 amitnaik/PHP-backdoor value of the N-gram algorithm used in opcode text vector
library processing as ‘3’; The sample-set segmentation strat-
Table 4: Content management system lists. egy of the random forest algorithm was set to information
entropy, the number was set to 100, the value of the random
# CMS Version seed was set to 2, and the remaining parameters were used for
0 Wordpress 5.4 model training with default settings. Set the kernel type of the
1 Joomla 3.9.16 support vector machine algorithm to the linear kernel func-
2 Laravel 7.6.2 tion. The penalty factor is set to 1. The remaining parameters
3 PHPBB 3.3.0
are trained by default. The weight optimization algorithm in
multilayer perceptron was set to a random gradient-based
4 Typecho 1.1
optimization algorithm. The regularization parameter is set
5 ThinkPHP 5.0.24 to 0.0001. The hidden layers are set to 1 with 100 hidden
6 Seacms 10.1 units in this layer. We choose the logistic function as the hid-
7 MetInfo 7.0.0 den layer activation function. The random seed was set to 1.
8 DiscuzX 3.4 The maximum number of iterations was set to 150, while the
initial learning rate was set to 0.089. The remaining parame-
ters were set by default design for model training.
the truth that most PHP language webshell had executable
data characteristics of PHP code. We used the same approach
to collect 9,736 samples of nonrepeated normal web pages in 4.3. Experimental Results and Analysis. We divided the
the PHP language from 9 well-known open-source web con- experiment into three groups. The experiment uniformly
tent management systems (content management system used the same parameter for random forest algorithm, sup-
(CMS)). The relevant CMS is shown in Table 4. port vector machine algorithm, and multilayer perceptron.
We divided the 2,917 webshell samples and 9,736 normal The first group combined the opcode text vector library, sam-
web page samples obtained by processing into the training set ple static statistical features, and executable data characteris-
and the test set randomly according to the ratio of 7 : 3; that tics of the PHP code to form a feature matrix construction
is, the training set consists of 2,044 webshell samples and model for the experiment (we call it model 1). The second
6,815 normal web page samples totaling 8,859, and the test group used the opcode text vector library combined with
set consists of 3,794 samples which consisted by 873 webshell static statistical features as a feature matrix to build a model
samples and 2,921 normal web page samples. After the (we call it model 2), and the last group used the opcode text
assignment was completed, the normal web samples and vector library and executable data characteristics of PHP
webshell samples for training were combined to form the code to constitute a feature matrix construction model for
training set of the model. The normal web samples and the experimentation (we call it model 3). Three experiments are
webshell samples for classification testing were mixed to set to compare the excellency of executable data characteris-
form the test set of the model. In the previous research, the tics of the PHP code with executable data characteristics.
data set is usually preprocessed uniformly, and then, the What is more, we can also know whether executable data
training set and the test set are divided. Since we use the N characteristics of PHP code would affect the performance of
-gram and TF-IDF algorithm to extract the opcode text vec- the detection model.
8 Wireless Communications and Mobile Computing

Table 5: Confusion matrices of each group of experiments.

Predicted
Model name Classification algorithm Actual
Positive Negative

Positive TP = 869 FN = 4
Random forest
Negative FP = 9 TN = 2912∗
Positive TP = 865∗ FN = 8
Model 1 Support vector machine
Negative FP = 14 TN = 2907
Positive TP = 867∗ FN = 6
Multilayer perceptron
Negative FP = 12 TN = 2909∗

Positive TP = 854 FN = 19
Random forest
Negative FP = 9 TN = 2912∗
Positive TP = 806 FN = 67
Model 2 Support vector machine
Negative FP = 12 TN = 2909∗
Positive TP = 848 FN = 25
Multilayer perceptron
Negative FP = 24 TN = 2897

Positive TP = 867 FN = 6
Random forest
Negative FP = 16 TN = 2905
Positive TP = 863 FN = 10
Model 3 Support vector machine
Negative FP = 34 TN = 2887
Positive TP = 859 FN = 14
Multilayer perceptron
Negative FP = 17 TN = 2904

The maximum value of the same algorithm in different models.

We marked the webshell sample as positive and the nor- methods, we used five commonly used evaluation indicators
mal web page file sample as negative in the experiment. Con- for webshell detection evaluation: accuracy, precision, recall,
fusion matrices of each group of experiments are shown in F1 values, and comprehensive evaluation using ROC curves
Table 5. [27]. The four evaluation indicators are calculated as follows:
While TP (true positive) indicates the number of web-
shell samples that the model recognizes correctly, FN (false ðTP + TNÞ
negative) indicates the number of webshell samples recog- Accuracy = ,
ðTP + FN + FP + TNÞ
nized by the classification model as normal web page files.
In contrast, TN (true negative) indicates the number of nor- TP
mal web page files that the classification model correctly Precision = ,
ðTP + FPÞ
identifies, and FP (false positive) means that the classification ð7Þ
model recognizes the number of normal web pages as web- ðTPÞ
Recall = ,
shell. After comparing and analyzing the experimental data, ðTP + FNÞ
adding the executable data characteristics of PHP code, the ð2 ∗ Precision ∗ RecallÞ
FN and FP values displayed in the confusion matrix have F1 = :
been significantly reduced. ðPrecision + RecallÞ
We can see from the data in Table 5 that, compared to
model 2 and model 3, the random forest and MLP models in While accuracy indicates the proportion of correctly pre-
model 1 have achieved the maximum value of TP and TN, dicted samples to the whole test set, precision indicates the
while the TP in the SVM model has also achieved the maxi- proportion of predicted true-positive samples to all tested
mum value. TP indicates the number of webshell samples that positive samples. Recall rate indicates the proportion of cor-
the model correctly identified, it is more critical to system rectly predicted webshell samples. The F1 value is a compre-
security to be able to accurately identify the webshell. Com- hensive evaluation index combining the accuracy and recall
pared with model 2, the TP of the three models’ results has rate. The evaluation index results of the model are shown
increased by 31 in model 1; the value of TN is not significantly in Table 6.
improved. Compared with model 2, model 3 has an average Model 1 involves a feature matrix that combines the
increase of 9 in the TP value of the three models’ results. opcode text vector library, sample static statistical features,
To more accurately conduct the experimental evaluation and the executable data characteristics of PHP code. Model
and make it easy for us to compare with other detection 2 adapts a feature matrix that only uses the combination of
Wireless Communications and Mobile Computing 9

Table 6: The evaluation index results.

Evaluation index
Model name Classification algorithm
Accuracy Precision Recall F1 score
∗ ∗ ∗
Random forest 0.9966 0.9897 0.9954 0.9926∗
Model 1 Support vector machine 0.9942∗ 0.9841 0.9908∗ 0.9874∗
∗ ∗ ∗
Multilayer perceptron 0.9953 0.9863 0.9931 0.9897∗
Random forest 0.9926 0.9896 0.9782 0.9839
Model 2 Support vector machine 0.9792 0.9853∗ 0.9233 0.9533
Multilayer perceptron 0.9871 0.9725 0.9714 0.9719
Random forest 0.9942 0.9819 0.9931 0.9875
Model 3 Support vector machine 0.9884 0.9621 0.9885 0.9751
Multilayer perceptron 0.9918 0.9806 0.9840 0.9823

Comparing the same algorithm in different models, the maximum value of the indicator.

0.98

0.96

0.94

0.92

0.9
Random forest Support vector Multi-layer Random forest Support vector Multi-layer Random forest Support vector Multi-layer
machine perceptron machine perceptron machine perceptron
Model 1 Model 2 Model 3

Accuracy Recall
Precision F1 score

Figure 4: Data comparison of the three models.

opcode text vector library and sample static statistics. By mance of the model constructed using the feature matrix of
comparing the experimental data of model 1 and model 2, the combination of the opcode text vector library and the
we can see model 1 performs well than model 2. The evalua- executable data characteristics of the PHP code is slightly bet-
tion indicators have been improved a lot, among which the ter than the performance of the model built by model 2.
accuracy of the three algorithms has reached more than Among them, the accuracy rate, recall rate, and F1 value
99.40%, the highest reached 99.66%, the recall rate has been are improved by 0.52%, 3.07%, and 1.19%, respectively. The
increased by at least 1.8%, the most increased by 6.75%, data fully indicates that the executable data characteristics
and the F1 value has increased by 2.02% on average. The of PHP code can describe webshell better than static statisti-
average value of precision has not changed much. The data cal features and have better discrimination ability. We can
fully shows that the executable data characteristics of PHP intuitively see the difference in the experimental results of
code can effectively improve the distinguishing ability of the three models from Figure 4.
the model. Figures 5–7 show the ROC curves of the three groups of
By comparing the experimental data of model 2 and experiments, respectively. And its horizontal coordinates
model 3, results can be reached out. Specifically, the perfor- are false-positive rate (false-positive rate (FPR)) which
10 Wireless Communications and Mobile Computing

Receiver operating characteristic curve

1.00

0.95

0.90
True positive rate
0.85

0.80

0.75

0.70

–0.01 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08
False positive rate

Random forest (AUC = 0.9999)


Multilayer perceptron (AUC = 0.9997)
Support vector machine (AUC = 0.9997)

Figure 5: Model 1 ROC curve.

Receiver operating characteristic curve

1.00

0.95

0.90
True positive rate

0.85

0.80

0.75

0.70

–0.01 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08
False positive rate

Random forest (AUC = 0.9974)


Multilayer perceptron (AUC = 0.9982)
Support vector machine (AUC = 0.9969)

Figure 6: Model 2 ROC curve.

represents the false alarm rate of web normal page file. The the ROC curves has validated the performance of our detec-
ordinate stands for true-positive rate (true-positive rate tion model.
(TPR)) which describes the accuracy of the classification of By analyzing the above data, it fully shows that the data
webshell documents. The ideal test model is supposed to fully executable feature of PHP code is an important grammatical
identify webshell and normal web files when the value of TPR feature of PHP language webshell. This grammatical feature
is 1 and the value of FPR is 0. In other words, the closer the can describe webshell better than traditional statistical-
area value under the ROC curve to 1, the higher the recogni- based text features, and has a better ability to distinguish
tion accuracy of the detection will be. A comparison between between webshell files and normal web pages. By adding
Wireless Communications and Mobile Computing 11

Receiver operating characteristic curve

1.00

0.95

0.90
True positive rate
0.85

0.80

0.75

0.70

–0.01 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08
False positive rate

Random Forest (AUC = 0.9995)


Multilayer Perceptron (AUC = 0.9975)
Support Vector Machine (AUC = 0.9984)

Figure 7: Model 3 ROC curve.

Table 7: The evaluation index results.


the data executable feature of the PHP code, the performance
of the webshell detection model has been significantly Webshell detection tools Version Accuracy Recall
improved.
D shield V2.0.9 0.9863 0.9633
Besides, we also randomly divide the training set and the
test set at a ratio of 8 : 2; that is, the training set consists of WEBDIR+ 2020-0423-1800 0.9797 0.9118
7,789 normal web page samples and 2,334 webshell samples, PHP Malware Finder — 0.8996 0.7904
and the test set consists of 1,947 normal web page samples SHELLPUB V 1.7.0 0.9027 0.6277
and 583 webshell samples. By comparing to model 2, the Our model — 0.9966 0.9954
highest accuracy rate in model 1 has reached 99.64%, the
most increased by 1.46%, and the least increased by 0.31%.
The recall rate increased by at least 0.86%; the most increased It can be seen from the table that though the accuracy rate
by 6.35%. The F1 value increased by 2.33% on average, and of the D shield can be as high as 98.63% and the Recall as high
the average precision increased by 1.16%. By comparing the as 96.33%, the accuracy rate and the recall rate of our model
experimental data of model 2 and model 3, results can be detection can reach 99.66% and 99.54%. Obviously, the
reached out. Specifically, model 3 is slightly better than the detection ability of our model is significantly better than
performance of the model built by model 2. Among them, these webshell detection tools. It is worth mentioning that
the accuracy rate, precision rate, recall rate, and F1 value the false-positive rate of Webdir+ for whitelist detection in
are improved by 0.94%, 0.83%, 3.37%, and 2.10%. From all the test data is 0%, which is also the goal we expect to achieve
the experimental results, the result of the random division in subsequent work.
according to the ratio of 8 : 2 and 7 : 3 is similar. Further con-
firmed the executable data characteristics of the PHP code 5. Conclusion
can significantly improve the detection performance of the
model, so, we do not describe in detail the experimental In this paper, we propose a webshell detection model based
results of randomly dividing the data set and the test set on static features of PHP codes. The key idea is to leverage
according to the 8 : 2 ratio. PHP code executable data characteristics from the perspec-
tive of PHP code syntax features for webshell detection. To
4.4. Comparison with Well-Known Webshell Detection Tools. systematically evaluate our model, we firstly construct a
We selected 4 of the current most popular webshell detection cleaned data set of webshell consisting of 2,917 samples from
tools to compare with the module in this paper, which are D 17 webshell collection projects and then conduct extensive
shield [28], Baidu WEBDIR+ [29], PHP Malware Finder experiments. The experimental results have verified the effi-
[30], and SHELLPUB [31]. By using these tools to scan and ciency of our model by achieving 99.66% detection accuracy,
detect all the sample files in our test set, we get results which without exploring the optimization of the machine learning
are shown in Table 7. algorithm. Moreover, our detection model outperforms
12 Wireless Communications and Mobile Computing

several popular webshell detection tools in terms of accuracy, [11] K. Thompson, “Programming techniques: regular expression
precision, recall rate, F1 value, and ROC curve. It is con- search algorithm,” Communications of the ACM, vol. 11,
firmed that the executable data characteristics of PHP code no. 6, pp. 419–422, 1968.
are significant grammatical features of webshell and can [12] J. B. Fraley and J. Cannady, “The promise of machine learning
effectively improve the performance of the detection model. in cybersecurity,” in SoutheastCon 2017, pp. 1–6, Charlotte,
NC, USA, 2017.
Data Availability [13] J. Hu, Z. Xu, D. Ma, and J. Yang, “Research of webshell detec-
tion based on decision tree,” Journal of Network New Media,
We uploaded the data set to Github at https://github.com/ vol. 6, 2012.
Cyc1e183/PHP-Webshell-Dataset. [14] Z. Meng, R. Mei, T. Zhang, and W. P. Wen, “Research of Linux
WebShell detection based on SVM classifier,” Netinfo Security,
Conflicts of Interest vol. 5, pp. 5–9, 2014.
[15] B. Hu, “Research on webshell detection method based on
The authors declare that they have no conflicts of interest. Bayesian theory,” Science Mosaic, vol. 6, pp. 66–70, 2016.
[16] Y. Fang, Y. Qiu, L. Liu, and C. Huang, “Detecting webshell
Acknowledgments based on random forest with fasttext,” in Proceedings of the
2018 International Conference on Computing and Artificial
We are very thankful to Hui Guo, Zhijie Xie, and Zhihao Hu Intelligence - ICCAI 2018, pp. 52–56, Chengdu, China, 2018.
for their help in the preparation of the experiment and paper [17] H. Cui, D. Huang, Y. Fang, L. Liu, and C. Huang, “Webshell
review. This research was funded by the National Key R&D detection based on random forest–gradient boosting decision
Program “Cyberspace Security” (2017YFB0802900). tree algorithm,” in 2018 IEEE Third International Conference
on Data Science in Cyberspace (DSC), pp. 153–160, Guang-
References zhou, China, 2018.
[18] J. Ramos, “Using TF-IDF to determine word relevance in doc-
[1] J. Kim, D.-H. Yoo, H. Jang, and K. Jeong, “WebSHArk 1.0: a ument queries,” in Proceedings of the first instructional confer-
benchmark collection for malicious web shell detection,” Jour- ence on machine learning, pp. 29–48, 2003.
nal of Information Processing Systems, vol. 11, no. 2, pp. 229– [19] T. Li, C. Ren, Y. Fu, J. Xu, J. Guo, and X. Chen, “Webshell
238, 2015. detection based on the word attention mechanism,” IEEE
[2] T. D. Tu, C. Guang, G. Xiaojun, and P. Wubin, “Webshell Access, vol. 7, pp. 185140–185147, 2019.
detection techniques in web applications,” in Fifth Interna- [20] “PECL:: package:: vld,” 2020, http://pecl.PHP.net/package/vld.
tional Conference on Computing, Communications and Net-
[21] “PHP: PHPdbg-Manual,” 2020, http://www.PHP.net/PHPdbg.
working Technologies (ICCCNT), pp. 1–7, Hefei, China, 2014.
[3] B. Yong, X. Liu, Y. Liu, H. Yin, L. Huang, and Q. Zhou, “Web [22] W. B. Cavnar and J. M. Trenkle, “N-gram-based text categori-
behavior detection based on deep neural network,” in 2018 zation,” in Proceedings of SDAIR-94, 3rd Annual Symposium
IEEE SmartWorld, Ubiquitous Intelligence & Computing, on Document Analysis and Information Retrieval, vol.
Advanced & Trusted Computing, Scalable Computing & Com- 161175, 1994.
munications, Cloud & Big Data Computing, Internet of People [23] I. Neamtiu, J. S. Foster, and M. Hicks, “Understanding source
and Smart City Innovation (SmartWorld/SCALCOM/UI- code evolution using abstract syntax tree matching,” in Pro-
C/ATC/CBDCom/IOP/SCI), pp. 1911–1916, 2018. ceedings of the 2005 international workshop on Mining soft-
[4] Y. Tian, J. Wang, Z. Zhou, and S. Zhou, “CNN-webshell: mali- ware repositories - MSR '05, pp. 1–5, 2005.
cious web shell detection with convolutional neural network,” [24] “nikic/PHP-Parser: a PHP parser written in PHP,” 2020,
in Proceedings of the 2017 VI International Conference on Net- https://github.com/nikic/PHP-Parser.
work, Communication and Computing - ICNCC 2017, pp. 75– [25] T. K. Ho, “Random decision forests,” in Proceedings of 3rd
79, Kunming, China, 2017. international conference on document analysis and recognition,
[5] H. Zhang, H. Guan, H. Yan et al., “Webshell traffic detection pp. 278–282, Montreal, QC, Canada, 1995.
with character-level features based on deep learning,” IEEE [26] Z. Wang, J. Yang, M. Dai, R. Xu, and X. Liang, “A method of
Access, vol. 6, pp. 75268–75277, 2018. detecting webshell based on multi-layer perception,” Academic
[6] W. Yang, B. Sun, and B. Cui, Innovative Mobile and Internet Journal of Computing & Information Science, vol. 2, no. 1,
Services in Ubiquitous Computing. IMIS 2018, Springer, 2018. 2019.
[7] J. Riordan, A. Wespi, and D. Zamboni, “How to hook worms [27] J. A. Hanley and B. J. McNeil, “The meaning and use of the
[computer network security],” IEEE Spectrum, vol. 42, no. 5, area under a receiver operating characteristic (ROC) curve,”
pp. 32–36, 2005. Radiology, vol. 143, no. 1, pp. 29–36, 1982.
[8] J. Shukla, “Application sandbox to detect, remove, and prevent [28] “D shield,” http://www.d99net.net/.
malware,” US Patent 11/769,297, 2008.
[29] BAIDU, “WEBDIR+webshell detector,” https://scanner.baidu
[9] S. Liuyang and F. Yong, “Webshell detection method research
.com.
based on web log,” Journal of Information Security Research,
vol. 1, p. 11, 2016. [30] “php-malware-finder,” https://github.com/jvoisin/php-
malware-finder.
[10] Y. Wu, Y. Sun, C. Huang, P. Jia, and L. Liu, “Session-based
webshell detection using machine learning in web logs,” Secu- [31] “SHELLPUB,” https://ml.shellpub.com/.
rity and Communication Networks, vol. 2019, Article ID
3093809, 11 pages, 2019.

You might also like