Webshell Detection Based On Executable Data Charac
Webshell Detection Based On Executable Data Charac
Research Article
Webshell Detection Based on Executable Data Characteristics of
PHP Code
Zulie Pan ,1,2 Yuanchao Chen ,1,2 Yu Chen ,1,2 Yi Shen ,1,2 and Xuanzhen Guo 1,2
1
College of Electronic Engineering, National University of Defense Technology, Hefei 230011, China
2
Anhui Province Key Laboratory of Cyberspace Security Situation Awareness and Evaluation, Hefei 230037, China
Received 7 January 2021; Revised 3 February 2021; Accepted 8 March 2021; Published 23 March 2021
Copyright © 2021 Zulie Pan et al. This is an open access article distributed under the Creative Commons Attribution License, which
permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
A webshell is a malicious backdoor that allows remote access and control to a web server by executing arbitrary commands. The
wide use of obfuscation and encryption technologies has greatly increased the difficulty of webshell detection. To this end, we
propose a novel webshell detection model leveraging the grammatical features extracted from the PHP code. The key idea is to
combine the executable data characteristics of the PHP code with static text features for webshell classification. To verify the
proposed model, we construct a cleaned data set of webshell consisting of 2,917 samples from 17 webshell collection projects
and conduct extensive experiments. We have designed three sets of controlled experiments, the results of which show that the
accuracy of the three algorithms has reached more than 99.40%, the highest reached 99.66%, the recall rate has been increased
by at least 1.8%, the most increased by 6.75%, and the F1 value has increased by 2.02% on average. It not only confirms the
efficiency of the grammatical features in webshell detection but also shows that our system significantly outperforms several
state-of-the-art rivals in terms of detection accuracy and recall rate.
began to apply this technology to webshell detection, and input training model for classification, significant perfor-
results prove that it can play a vital role in webshell detection. mance has been achieved.
The construction of feature engineering, namely, the features Hu et al. proposed a webshell detection model based on
which are involved are the webshell characteristics we choose the decision tree focusing on the detection of PHP webshell
for model training, plays a critical role in webshell detection [13]. The basic attributes of the sample, such as the number
methods based on machine learning. of words and lines in the text, called functions, are extracted
The main contributions of this paper are summarized as to construct the features and train the decision tree classifica-
follows: tion model for webshell detection. Although the number of
words, line numbers of the text, and called functions cannot
(1) We first proposed the concept of the executable data distinguish between normal files and webshell files, this
characteristics of PHP code and used it for webshell research has made a good attempt to build a detection model
detection research. From the perspective of the gram- using the text features of webshell.
matical features of webshell, we constructed a web- Meng et al. [14] build a matrix to make use of the SVM
shell detection model based on the executable data algorithm training model for webshell detection. To this
characteristics of PHP code end, web page attribute characteristics are extracted. These
characteristics include page length, code lines, number of
(2) We construct a cleaned data set to facilitate subse-
comments and other characteristics, and page operation
quent related research via collecting 17 existing web-
attributes. Besides encryption and decryption function calls,
shell data sets on Github. The md5 algorithm has
system, eval, exec, shell_exec, and other function calls, char-
been used to remove the duplicate webshell samples
acter operation function calls, system function calls, file oper-
and some non-PHP code webshell files through man-
ations, FTP (File Transfer Protocol) operations, database
ual analysis
operations, ActiveX control calls, and other features are all
(3) We conduct extensive experiments to evaluate the involved in the matrix building.
performance of our proposed method. The results Hu proposed a model of webshell detection based on
show that the executable data characteristics of PHP Bayesian theory [15] and extracted the common statistical
code is one of the important grammatical features features of file analysis for the construction of feature engi-
of PHP webshell, which significantly improves the neering. Features involved are information entropy, longest
accuracy of the detection model. Moreover, the exe- word length, compression ratio, index of coincidence, and
cutable data characteristics of PHP code can better other features, all of which would be put into the Bayesian
improve the distinguishing ability of the detection classifier to train the model for webshell detection.
model compared with the traditional static statistical The model of detecting webshell based on random for-
characteristics est with FastText was first proposed by Fang et al. to extract
the opcode of PHP code as a feature for model construction
In the next section, we review some representative related [16]. By extracting the opcode of the PHP code to train the
research to outline the motivation of our research. In Section FastTest model, the FastText model is used for preclassifi-
3, we introduce our system model in detail, including the cation processing. Statistical features such as the longest
opcode text vector library and data executable features of string, information entropy, index of coincidence, text fea-
PHP code. In Section 4, we systematically evaluate our detec- tures such as sensitive functions, and blacklist keywords
tion model. Finally, we summarize our work in Section 5. are used as input to train random deep forest classification
model.
2. Related Work Cui et al. proposed the model of webshell detection based
on random forest–gradient boosting decision tree algorithm
For webshell detection based on static features, the feature [17]. First attain opcode hash vector and text vector library
selection is mainly divided into two types: text features and features extracted from PHP opcode processed by the TF-
grammatical features. Current research on webshell detection IDF (term frequency–inverse document frequency) [18] vec-
using machine learning mainly focuses on the text feature. In tor. All these features are used to train a random deep forest
order to make up for the shortcomings of the detection model, so as to obtain preliminary preprocessing results.
method using regular expression detection, researchers have Combined with statistical features such as information
proposed a detection method based on statistical features entropy, index of coincidence, compression ratio, length of
from the perspective of text features. That is, by extracting the longest word, and number of signature function matches,
statistical features in webshell, such as information entropy, to construct a feature matrix as the input of the GBDT (Gra-
longest words, index of coincidence, and compression ratio, dient Boosting Decision Tree) classifier for model training,
these features combine to form a feature matrix for model the final detection result is obtained.
construction. And then, through experiments, it was found In 2019, Li et al. proposed a webshell detection model
that PHP opcode can help improve the capability of the based on the word attention mechanism [19]; first, use
detection of webshell in PHP language. By combining the sta- word2vec to vectorize the text content, then use the GRU
tistical features, PHP opcode word frequency, PHP opcode (Gated Recurrent Unit) model and the attention mechanism
text vector library, signature functions, word relevance, and model for training, and finally input to a sigmoid function
other features together to construct a feature matrix as an full connection layer for webshell classification.
Wireless Communications and Mobile Computing 3
...
extraction s2 0 ... 1
sm–2 0
sm–1 s3
sm
Classify
...
Text feature
s1 1
s2 0 sm–2
Webshell sample
s3 sm–1
Data executable
characteristic sm
...
Opcode text set Word frequency matrix Opcode text vector library
ASSIGN INCLUDE_OR_EVAL … [0,0,1,0,0,0,…,0,1,0,0,0,0]
Use phpdbg to ASSIGN INIT_FCALL … N-gram [0,0,0,1,0,1,…,0,0,0,0,1,0] TF-IDF [0,0,0.60604332,0,0,0,…,0,0.51519219,0,0,0,0]
extract opcode CONCAT ECHO ECHO RETURN … pretreatment [0,0,0,0,0,0,…,1,1,0,0,0,0] processing [0,0,0,0.46015789,0,0.39117625,…,0,0,0,0,0.46015789,0]
ASSIGN BEGIN_SILENCE … [0,0,0,0,0,0,…,0.70710678,0.70710678,0,0,0,0]
END_SILENCE IS_SMALLER … [0,1,0,0,0,0,…,0,0,0,0,0,1]
[0,0.40824829,0,0,0,0,…,0,0,0,0,0,0.40824829]
… …
…
INIT_FCALL_BY_NAME… [0,0,2,0,1,0,…,0,0,0,1,0,0]
THIS INIT_ARRAY NEXT…
[1,0,0,0,0,0,…,1,0,0,0,0,1] [0,0,0.33333333,0,0.56011203,0,…,0,0,0,0.56011203,0,0]
Data set SEND_VAL_EX DO_FCALL… [0.26061951,0,0,0,0,0,…,0.26061951,0,0,0,0,0.22155040]
ECHO RETURN [0,1,0,0,2,0,…,0,0,2,0,0,0]
[0,1,0,0,0.38568218,0,…,0,0,0.38568218,0,0,0]
SEND_VAL_EX SEND_VAL_EX… [0,0,3,0,0,0,…,0.1,0,0,0,0]
[0,0,0.14002801,0,0,0,…,0,0.42008403,0,0,0,0]
Table 3: Github projects lists. tor library, the divided training set and the test set will have
relevance in the opcode text vector library if the data is uni-
No. Github projects fied preprocessing. With the aim of eliminating the relevance
0 JohnTroony/PHP-Webshells of the training set and test set, we have divided the training
1 xl7dev/Webshell set and test set before preprocessing. The preprocessing of
2 ysrc/Webshell-sample the data and the construction of the feature matrix are carried
3 tennc/Webshell out separately to ensure that the data in the training set and
the test set are uncorrelated and to ensure the validity of
4 BlackArch/Webshells
the experimental results. We uploaded the cleaned data set
5 JoyChou93/Webshell to the Github project for use in the subsequent experiments
6 bartblaze/PHP-backdoors of webshell detection research, which can be downloaded
7 WangYihang/Webshell-sniper from https://github.com/Cyc1e183/PHP-Webshell-Dataset.
8 tanjiti/WebshellSample
9 tdifg/Webshell 4.2. Algorithm Parameter Setting. Aiming to better compare
10 LandGrey/Webshell-detect-bypass the experiments’ results and analyze the impact of executable
11 backlion/Webshell data characteristics of PHP code on model performance, we
12 Webshellpub/awsome-Webshell adopt the random forest (RF) algorithm [25], support vector
13 x-o-r-r-o/PHP-Webshells-Collection machine algorithm (SVM) [14], and multilayer perceptron
14 S9MF/S9MF-PHP-Webshell-bypass
(MLP) [26] with the same training set for webshell classifica-
tion. Besides, we do not explore the optimal situation of algo-
15 backdoorhub/shell-backdoor-list
rithm parameter setting. In the experiment, we set the N
16 amitnaik/PHP-backdoor value of the N-gram algorithm used in opcode text vector
library processing as ‘3’; The sample-set segmentation strat-
Table 4: Content management system lists. egy of the random forest algorithm was set to information
entropy, the number was set to 100, the value of the random
# CMS Version seed was set to 2, and the remaining parameters were used for
0 Wordpress 5.4 model training with default settings. Set the kernel type of the
1 Joomla 3.9.16 support vector machine algorithm to the linear kernel func-
2 Laravel 7.6.2 tion. The penalty factor is set to 1. The remaining parameters
3 PHPBB 3.3.0
are trained by default. The weight optimization algorithm in
multilayer perceptron was set to a random gradient-based
4 Typecho 1.1
optimization algorithm. The regularization parameter is set
5 ThinkPHP 5.0.24 to 0.0001. The hidden layers are set to 1 with 100 hidden
6 Seacms 10.1 units in this layer. We choose the logistic function as the hid-
7 MetInfo 7.0.0 den layer activation function. The random seed was set to 1.
8 DiscuzX 3.4 The maximum number of iterations was set to 150, while the
initial learning rate was set to 0.089. The remaining parame-
ters were set by default design for model training.
the truth that most PHP language webshell had executable
data characteristics of PHP code. We used the same approach
to collect 9,736 samples of nonrepeated normal web pages in 4.3. Experimental Results and Analysis. We divided the
the PHP language from 9 well-known open-source web con- experiment into three groups. The experiment uniformly
tent management systems (content management system used the same parameter for random forest algorithm, sup-
(CMS)). The relevant CMS is shown in Table 4. port vector machine algorithm, and multilayer perceptron.
We divided the 2,917 webshell samples and 9,736 normal The first group combined the opcode text vector library, sam-
web page samples obtained by processing into the training set ple static statistical features, and executable data characteris-
and the test set randomly according to the ratio of 7 : 3; that tics of the PHP code to form a feature matrix construction
is, the training set consists of 2,044 webshell samples and model for the experiment (we call it model 1). The second
6,815 normal web page samples totaling 8,859, and the test group used the opcode text vector library combined with
set consists of 3,794 samples which consisted by 873 webshell static statistical features as a feature matrix to build a model
samples and 2,921 normal web page samples. After the (we call it model 2), and the last group used the opcode text
assignment was completed, the normal web samples and vector library and executable data characteristics of PHP
webshell samples for training were combined to form the code to constitute a feature matrix construction model for
training set of the model. The normal web samples and the experimentation (we call it model 3). Three experiments are
webshell samples for classification testing were mixed to set to compare the excellency of executable data characteris-
form the test set of the model. In the previous research, the tics of the PHP code with executable data characteristics.
data set is usually preprocessed uniformly, and then, the What is more, we can also know whether executable data
training set and the test set are divided. Since we use the N characteristics of PHP code would affect the performance of
-gram and TF-IDF algorithm to extract the opcode text vec- the detection model.
8 Wireless Communications and Mobile Computing
Predicted
Model name Classification algorithm Actual
Positive Negative
∗
Positive TP = 869 FN = 4
Random forest
Negative FP = 9 TN = 2912∗
Positive TP = 865∗ FN = 8
Model 1 Support vector machine
Negative FP = 14 TN = 2907
Positive TP = 867∗ FN = 6
Multilayer perceptron
Negative FP = 12 TN = 2909∗
Positive TP = 854 FN = 19
Random forest
Negative FP = 9 TN = 2912∗
Positive TP = 806 FN = 67
Model 2 Support vector machine
Negative FP = 12 TN = 2909∗
Positive TP = 848 FN = 25
Multilayer perceptron
Negative FP = 24 TN = 2897
Positive TP = 867 FN = 6
Random forest
Negative FP = 16 TN = 2905
Positive TP = 863 FN = 10
Model 3 Support vector machine
Negative FP = 34 TN = 2887
Positive TP = 859 FN = 14
Multilayer perceptron
Negative FP = 17 TN = 2904
∗
The maximum value of the same algorithm in different models.
We marked the webshell sample as positive and the nor- methods, we used five commonly used evaluation indicators
mal web page file sample as negative in the experiment. Con- for webshell detection evaluation: accuracy, precision, recall,
fusion matrices of each group of experiments are shown in F1 values, and comprehensive evaluation using ROC curves
Table 5. [27]. The four evaluation indicators are calculated as follows:
While TP (true positive) indicates the number of web-
shell samples that the model recognizes correctly, FN (false ðTP + TNÞ
negative) indicates the number of webshell samples recog- Accuracy = ,
ðTP + FN + FP + TNÞ
nized by the classification model as normal web page files.
In contrast, TN (true negative) indicates the number of nor- TP
mal web page files that the classification model correctly Precision = ,
ðTP + FPÞ
identifies, and FP (false positive) means that the classification ð7Þ
model recognizes the number of normal web pages as web- ðTPÞ
Recall = ,
shell. After comparing and analyzing the experimental data, ðTP + FNÞ
adding the executable data characteristics of PHP code, the ð2 ∗ Precision ∗ RecallÞ
FN and FP values displayed in the confusion matrix have F1 = :
been significantly reduced. ðPrecision + RecallÞ
We can see from the data in Table 5 that, compared to
model 2 and model 3, the random forest and MLP models in While accuracy indicates the proportion of correctly pre-
model 1 have achieved the maximum value of TP and TN, dicted samples to the whole test set, precision indicates the
while the TP in the SVM model has also achieved the maxi- proportion of predicted true-positive samples to all tested
mum value. TP indicates the number of webshell samples that positive samples. Recall rate indicates the proportion of cor-
the model correctly identified, it is more critical to system rectly predicted webshell samples. The F1 value is a compre-
security to be able to accurately identify the webshell. Com- hensive evaluation index combining the accuracy and recall
pared with model 2, the TP of the three models’ results has rate. The evaluation index results of the model are shown
increased by 31 in model 1; the value of TN is not significantly in Table 6.
improved. Compared with model 2, model 3 has an average Model 1 involves a feature matrix that combines the
increase of 9 in the TP value of the three models’ results. opcode text vector library, sample static statistical features,
To more accurately conduct the experimental evaluation and the executable data characteristics of PHP code. Model
and make it easy for us to compare with other detection 2 adapts a feature matrix that only uses the combination of
Wireless Communications and Mobile Computing 9
Evaluation index
Model name Classification algorithm
Accuracy Precision Recall F1 score
∗ ∗ ∗
Random forest 0.9966 0.9897 0.9954 0.9926∗
Model 1 Support vector machine 0.9942∗ 0.9841 0.9908∗ 0.9874∗
∗ ∗ ∗
Multilayer perceptron 0.9953 0.9863 0.9931 0.9897∗
Random forest 0.9926 0.9896 0.9782 0.9839
Model 2 Support vector machine 0.9792 0.9853∗ 0.9233 0.9533
Multilayer perceptron 0.9871 0.9725 0.9714 0.9719
Random forest 0.9942 0.9819 0.9931 0.9875
Model 3 Support vector machine 0.9884 0.9621 0.9885 0.9751
Multilayer perceptron 0.9918 0.9806 0.9840 0.9823
∗
Comparing the same algorithm in different models, the maximum value of the indicator.
0.98
0.96
0.94
0.92
0.9
Random forest Support vector Multi-layer Random forest Support vector Multi-layer Random forest Support vector Multi-layer
machine perceptron machine perceptron machine perceptron
Model 1 Model 2 Model 3
Accuracy Recall
Precision F1 score
opcode text vector library and sample static statistics. By mance of the model constructed using the feature matrix of
comparing the experimental data of model 1 and model 2, the combination of the opcode text vector library and the
we can see model 1 performs well than model 2. The evalua- executable data characteristics of the PHP code is slightly bet-
tion indicators have been improved a lot, among which the ter than the performance of the model built by model 2.
accuracy of the three algorithms has reached more than Among them, the accuracy rate, recall rate, and F1 value
99.40%, the highest reached 99.66%, the recall rate has been are improved by 0.52%, 3.07%, and 1.19%, respectively. The
increased by at least 1.8%, the most increased by 6.75%, data fully indicates that the executable data characteristics
and the F1 value has increased by 2.02% on average. The of PHP code can describe webshell better than static statisti-
average value of precision has not changed much. The data cal features and have better discrimination ability. We can
fully shows that the executable data characteristics of PHP intuitively see the difference in the experimental results of
code can effectively improve the distinguishing ability of the three models from Figure 4.
the model. Figures 5–7 show the ROC curves of the three groups of
By comparing the experimental data of model 2 and experiments, respectively. And its horizontal coordinates
model 3, results can be reached out. Specifically, the perfor- are false-positive rate (false-positive rate (FPR)) which
10 Wireless Communications and Mobile Computing
1.00
0.95
0.90
True positive rate
0.85
0.80
0.75
0.70
–0.01 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08
False positive rate
1.00
0.95
0.90
True positive rate
0.85
0.80
0.75
0.70
–0.01 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08
False positive rate
represents the false alarm rate of web normal page file. The the ROC curves has validated the performance of our detec-
ordinate stands for true-positive rate (true-positive rate tion model.
(TPR)) which describes the accuracy of the classification of By analyzing the above data, it fully shows that the data
webshell documents. The ideal test model is supposed to fully executable feature of PHP code is an important grammatical
identify webshell and normal web files when the value of TPR feature of PHP language webshell. This grammatical feature
is 1 and the value of FPR is 0. In other words, the closer the can describe webshell better than traditional statistical-
area value under the ROC curve to 1, the higher the recogni- based text features, and has a better ability to distinguish
tion accuracy of the detection will be. A comparison between between webshell files and normal web pages. By adding
Wireless Communications and Mobile Computing 11
1.00
0.95
0.90
True positive rate
0.85
0.80
0.75
0.70
–0.01 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08
False positive rate
several popular webshell detection tools in terms of accuracy, [11] K. Thompson, “Programming techniques: regular expression
precision, recall rate, F1 value, and ROC curve. It is con- search algorithm,” Communications of the ACM, vol. 11,
firmed that the executable data characteristics of PHP code no. 6, pp. 419–422, 1968.
are significant grammatical features of webshell and can [12] J. B. Fraley and J. Cannady, “The promise of machine learning
effectively improve the performance of the detection model. in cybersecurity,” in SoutheastCon 2017, pp. 1–6, Charlotte,
NC, USA, 2017.
Data Availability [13] J. Hu, Z. Xu, D. Ma, and J. Yang, “Research of webshell detec-
tion based on decision tree,” Journal of Network New Media,
We uploaded the data set to Github at https://github.com/ vol. 6, 2012.
Cyc1e183/PHP-Webshell-Dataset. [14] Z. Meng, R. Mei, T. Zhang, and W. P. Wen, “Research of Linux
WebShell detection based on SVM classifier,” Netinfo Security,
Conflicts of Interest vol. 5, pp. 5–9, 2014.
[15] B. Hu, “Research on webshell detection method based on
The authors declare that they have no conflicts of interest. Bayesian theory,” Science Mosaic, vol. 6, pp. 66–70, 2016.
[16] Y. Fang, Y. Qiu, L. Liu, and C. Huang, “Detecting webshell
Acknowledgments based on random forest with fasttext,” in Proceedings of the
2018 International Conference on Computing and Artificial
We are very thankful to Hui Guo, Zhijie Xie, and Zhihao Hu Intelligence - ICCAI 2018, pp. 52–56, Chengdu, China, 2018.
for their help in the preparation of the experiment and paper [17] H. Cui, D. Huang, Y. Fang, L. Liu, and C. Huang, “Webshell
review. This research was funded by the National Key R&D detection based on random forest–gradient boosting decision
Program “Cyberspace Security” (2017YFB0802900). tree algorithm,” in 2018 IEEE Third International Conference
on Data Science in Cyberspace (DSC), pp. 153–160, Guang-
References zhou, China, 2018.
[18] J. Ramos, “Using TF-IDF to determine word relevance in doc-
[1] J. Kim, D.-H. Yoo, H. Jang, and K. Jeong, “WebSHArk 1.0: a ument queries,” in Proceedings of the first instructional confer-
benchmark collection for malicious web shell detection,” Jour- ence on machine learning, pp. 29–48, 2003.
nal of Information Processing Systems, vol. 11, no. 2, pp. 229– [19] T. Li, C. Ren, Y. Fu, J. Xu, J. Guo, and X. Chen, “Webshell
238, 2015. detection based on the word attention mechanism,” IEEE
[2] T. D. Tu, C. Guang, G. Xiaojun, and P. Wubin, “Webshell Access, vol. 7, pp. 185140–185147, 2019.
detection techniques in web applications,” in Fifth Interna- [20] “PECL:: package:: vld,” 2020, http://pecl.PHP.net/package/vld.
tional Conference on Computing, Communications and Net-
[21] “PHP: PHPdbg-Manual,” 2020, http://www.PHP.net/PHPdbg.
working Technologies (ICCCNT), pp. 1–7, Hefei, China, 2014.
[3] B. Yong, X. Liu, Y. Liu, H. Yin, L. Huang, and Q. Zhou, “Web [22] W. B. Cavnar and J. M. Trenkle, “N-gram-based text categori-
behavior detection based on deep neural network,” in 2018 zation,” in Proceedings of SDAIR-94, 3rd Annual Symposium
IEEE SmartWorld, Ubiquitous Intelligence & Computing, on Document Analysis and Information Retrieval, vol.
Advanced & Trusted Computing, Scalable Computing & Com- 161175, 1994.
munications, Cloud & Big Data Computing, Internet of People [23] I. Neamtiu, J. S. Foster, and M. Hicks, “Understanding source
and Smart City Innovation (SmartWorld/SCALCOM/UI- code evolution using abstract syntax tree matching,” in Pro-
C/ATC/CBDCom/IOP/SCI), pp. 1911–1916, 2018. ceedings of the 2005 international workshop on Mining soft-
[4] Y. Tian, J. Wang, Z. Zhou, and S. Zhou, “CNN-webshell: mali- ware repositories - MSR '05, pp. 1–5, 2005.
cious web shell detection with convolutional neural network,” [24] “nikic/PHP-Parser: a PHP parser written in PHP,” 2020,
in Proceedings of the 2017 VI International Conference on Net- https://github.com/nikic/PHP-Parser.
work, Communication and Computing - ICNCC 2017, pp. 75– [25] T. K. Ho, “Random decision forests,” in Proceedings of 3rd
79, Kunming, China, 2017. international conference on document analysis and recognition,
[5] H. Zhang, H. Guan, H. Yan et al., “Webshell traffic detection pp. 278–282, Montreal, QC, Canada, 1995.
with character-level features based on deep learning,” IEEE [26] Z. Wang, J. Yang, M. Dai, R. Xu, and X. Liang, “A method of
Access, vol. 6, pp. 75268–75277, 2018. detecting webshell based on multi-layer perception,” Academic
[6] W. Yang, B. Sun, and B. Cui, Innovative Mobile and Internet Journal of Computing & Information Science, vol. 2, no. 1,
Services in Ubiquitous Computing. IMIS 2018, Springer, 2018. 2019.
[7] J. Riordan, A. Wespi, and D. Zamboni, “How to hook worms [27] J. A. Hanley and B. J. McNeil, “The meaning and use of the
[computer network security],” IEEE Spectrum, vol. 42, no. 5, area under a receiver operating characteristic (ROC) curve,”
pp. 32–36, 2005. Radiology, vol. 143, no. 1, pp. 29–36, 1982.
[8] J. Shukla, “Application sandbox to detect, remove, and prevent [28] “D shield,” http://www.d99net.net/.
malware,” US Patent 11/769,297, 2008.
[29] BAIDU, “WEBDIR+webshell detector,” https://scanner.baidu
[9] S. Liuyang and F. Yong, “Webshell detection method research
.com.
based on web log,” Journal of Information Security Research,
vol. 1, p. 11, 2016. [30] “php-malware-finder,” https://github.com/jvoisin/php-
malware-finder.
[10] Y. Wu, Y. Sun, C. Huang, P. Jia, and L. Liu, “Session-based
webshell detection using machine learning in web logs,” Secu- [31] “SHELLPUB,” https://ml.shellpub.com/.
rity and Communication Networks, vol. 2019, Article ID
3093809, 11 pages, 2019.