0% found this document useful (0 votes)
38 views6 pages

Anomaly-Based Network Intrusion Detection System

This paper presents an anomaly-based network intrusion detection system (NIDS) that combines feature selection, K-Means clustering, and XGBoost classification to detect attacks using the NSL-KDD dataset. The proposed model achieves an accuracy of 84.41%, a detection rate of 86.36%, and a false alarm rate of 18.20%, outperforming other machine learning models. The system effectively reduces the number of features from 122 to 75 through a feature selection method, enhancing performance while maintaining computational efficiency.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views6 pages

Anomaly-Based Network Intrusion Detection System

This paper presents an anomaly-based network intrusion detection system (NIDS) that combines feature selection, K-Means clustering, and XGBoost classification to detect attacks using the NSL-KDD dataset. The proposed model achieves an accuracy of 84.41%, a detection rate of 86.36%, and a false alarm rate of 18.20%, outperforming other machine learning models. The system effectively reduces the number of features from 122 to 75 through a feature selection method, enhancing performance while maintaining computational efficiency.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

2018 Sixteenth International Conference on ICT and Knowledge Engineering

Anomaly-Based Network Intrusion Detection System


through Feature Selection and Hybrid Machine
Learning Technique
Apichit Pattawaro Chantri Polprasert
Master of Science in Information Technology, Master of Science in Information Technology,
Department of Computer Science, Faculty of Science Department of Computer Science, Faculty of Science
Srinakharinwirot University, Bangkok, Thailand Srinakharinwirot University, Bangkok, Thailand
[email protected] [email protected]

Abstract—In this paper, we propose an anomaly-based unusual traffic statistics or pattern. This system could be used
network intrusion detection system based on a combination of to detect zero-day attack and results from the discovery of the
feature selection, K-Means clustering and XGBoost classification attack could be added into the database for future detection
model. We test the performance of our proposed system over through misused-based. However, an anomaly-based system
NSL-KDD dataset using KDDTest+ dataset. A feature selection suffers from high false alarm rate since much normal traffic
method based on attribute ratio (AR) [14] is applied to construct exhibiting unusual behavior could trigger the alarm of the
a reduced feature subset of NSL-KDD dataset. After applying K- system. In practical system, a hybrid of the misused-based and
Means clustering, hyperparameter tuning of each classification anomaly-based is usually employed to mitigate the impact of
model corresponding to each cluster is implemented. Using only 2
both zero-day attack and high false-alarm rate.
clusters, our proposed model obtains accuracy equal to 84.41%
with detection rate equal to 86.36% and false alarm rate equal to Recently, anomaly-based NIDS employing machine
18.20% for KDDTest+ dataset. The performance of our proposed learning techniques has gained widespread attentions. A
model outperforms those obtained using the recurrent neural number of classification models e.g. KNN [3], genetic
network (RNN)-based deep neural network and other tree-based algorithms [4], Random Forest (RF) [5,6], Support Vector
classifiers. In addition, due to feature selection, our proposed Machine (SVM) [7] or Extreme Gradient Boosting (XGBoost)
model employs only 75 out of 122 features (61.47%) to achieve [8] have been used to differentiate between normal or
this level of performance comparable to those using full number suspicious traffic. However, feasibility of this approach is still
of features to train the model.
limited due to poor detection performance. This could be due
Keywords—Hybrid Clustering and Classification, NSL-KDD,
many reasons such as diverse nature of traffic, imbalance
network security traffic classes and ineffective feature selection processes. To
overcome this limitation, a number of papers utilize deep
I. INTRODUCTION neural network (DNN) methods such as recurrent neural
network (RNN) [9], long short-term memory (LSTM) [10] and
Network Intrusion Detection Systems (NIDS) [1, 2] play a convolutional neural network (CNN) [11] for anomaly-based
crucial role in the protection of computer systems from NIDS. Even though the DNN approaches exhibit enhanced
network-based malicious attacks that could disrupt the services detection performance, it requires significant amount of
of the system. Providing powerful and robust NIDS is a very training time and large volume of effective train dataset.
challenging task due to several factors. For instance, with the
growth of the Internet traffic which consisting of a variety of A hybrid approach for anomaly-based NID [12] is another
data type traversing the network, NIDS must be able to analyze promising alternative. By combining traditional ML model
these huge volume of traffic and differentiate between normal together, significant performance improvement in terms of
and malicious behavior with acceptable accuracy. Typically, accuracy, precision, false alarm rate (FAR) is exhibited with
NIDS systems are broadly categorized into 2 types consisting acceptable additional computational complexity.
of 1) misuse-based system (sometimes called signature-based)
In this paper, we focus on NID’s binary classification
2) anomaly-based system. problem where the system differentiates between normal or
A misuse-based NIDS system relies on an extensive attack activities. We propose an anomaly-based network
database of attack signatures. Each signature is a set of rules intrusion detection system based on a combination of feature
corresponding to intrusion attacks occurred in the past. selection, K-Means clustering and XGBoost classification
Therefore, with up-to-date NIDS signature database, a model. The reason behind selection of XGBoost classifier
misused-based system is very powerful in detecting past model is due to its strong performance, a variety of
attacks. However, this system is vulnerable to zero-day attack hyperparameter selection, fast implementation and popularity
and long processing time. An anomaly-based IDS system among machine learning communities. We test the
detects attacks in the computer system through observing performance of our proposed system over NSL-KDD dataset
[13] using KDDTest+ dataset. A feature selection method based
978-1-5386-7159-7/18/$31.00 ©2018 IEEE
on Attribute Ratio (AR) [14] is applied to construct a reduced 1) Denial-of-Service(DoS):This type of attack overwhelms
feature subset of NSL-KDD dataset. After applying K-Means the targets’ resources (Network, CPU or Memory) so that
clustering, hyperparameter tuning of each classification model typical operations cannot be performed as expected. Examples
corresponding to each cluster is implemented. Using only 2 of this attack include sending huge number of packets to the
clusters, our proposed model obtains the best accuracy equal to targeted server so that normal users cannot access.
84.41%, TPR equal to 86.36%, FPR = 18.20 and AUC=0.922
for KDDTest+ dataset. In addition, the performance of our 2) Probe:This type of attack involves port scanning to
proposed model outperforms those obtained from the RNN- identify vulnerabilities in computer systems for further attacks.
based deep neural network and other tree-based classifiers.
3) Root to Local (R2L): The attackers try to access the
Moreover, due to feature selection, our proposed model
employs only 75 out of 122 features (61.47%) to achieve this unauthorized computer resources in order to destroy or modify
level of performance comparable to those using full number of operations of the targeted computer systems
features to train the model. 4) Unauthorized to Root (U2R): The attackers try to gain
The remainder of this paper is organized as follows. Section accesses to unauthorized resources using root privileges.
2 discusses NSL-KDD dataset. The proposed system is Table 1. Type of Attacks
explained in Section 3. Results are evaluated and discussed in
Section 4 and our findings in this paper is summarized in Category Attacks
Section 5
Neptune, pod, smurf, teardrop,
2. NSL-KDD DATASET DoS process
(Denial of Service) table,warezmaster,apache2, mail
Previously, KDD-Cup 99 dataset [15] has been widely bomb, back
used to test the performance of an anomaly-based intrusion multihop, http tunnel,
detection system. However, researchers [15]pointed out 2 Probe ftp_write, root kit, ps
critical issues based on statistical analysis of the dataset buffer overflow, xterm
leading to over-simplistic prediction results. To circumvent
R2L named, snmpgetattack,xlock,
this problem, they proposed NSL-KDD dataset which has the (Root to Local ) send mail, guess_passwd
following advantages compared to KDD-Cup 99 as follows.
1) In NSL-KDD, many redundant and duplicate data
ipsweep, nmap,
encountered in KDD-Cup 99 are removed from the datasets. U2R
port sweep, satan,
(Unauthorized to Root)
2) To achieve more accurate evaluation of different mscan, saint
learning techniques, the number of selected records from each
difficulty-level group is inversely proportional to the
percentage of records in the original KDD dataset.
Table2.Ratio of Normal/Attack in each type of NSL-KDD dataset
NSL-KDD dataset is categorized into 4 separate dataset
including: Dataset Classify Total Normal Anomaly
1) KDDTrain+: This is the overall train dataset consisting
Number 125973 67343 58630
of 125, 973 recordsKDDTrain+ 20Percent: This is the train KDDTrain+
dataset consisting of only 20% of the total train dataset and has Percent 100% 53.46% 46.54%
25,192 records. Number 22544 9711 12833
2) KDDTrain+20Percent: This is the train dataset KDDTest+
consisting of only 20% of the total train dataset and has 25,192 Percent 100% 43.08% 56.92%
records. Number 11850 4342 7508
KDDTest-21
3) KDDTest+: This is the test dataset consisting of 22,544
Percent 100% 36.64% 63.36%
records.
4) KDDTest-21: This dataset consists of 11,850 records. NSL-KDD dataset contains 41 features categorizing into 3
This dataset is obtained by applying 21 machine learning types consisting of 3 nominal features, 6 binary features and
model on KDDTest+ dataset to predict the label of the dataset 32 numeric features [13].
(Normal/Attack) and those that are accurately predicted by all
21 models are discarded from the dataset.
3. METHODS
Table 2 lists the ratio of normal/attack for each type of Figure 1 illustrates the block diagram of the proposed NIDS
dataset in NSL-KDD. The NSL-KDD categorizes attacks into model. We employ KDDTrain+ dataset to train the proposed
4 types consisting of Denial-of-Service, Probe, Root to Local ML model and evaluate its performance in terms of accuracy,
and Unauthorized to Root as presented in Table 1. Each type AUC, precision and recall using KDDTest+ dataset.
of attack can be explained as follows:
,
()= (2)
,

whereAVGi,j=Ci,j/Ni,j. Ci,j is the sum of the jth feature


corresponding to ith label and Ni,j is the number of records of
the jth features corresponding to the ith label.

∑ ,
, = (3)

th th
is the sum of j feature divided by the total number of j feature
th
(Nj). For j binary feature, CRi(j) can be written as

(1) ,
( )= (4)
(0) ,

where Freq(1)i,j is the number of ithrecords whose jth feature


is equal to one and Freq(0)i,j is the number of ith records
whose jth feature is equal to zero. Figure 2 displays top ten
highest-important features obtained from Eq. (1). Features
whose AR values less than 0.01 are removed from the
analysis. The threshold 0.01 is judiciously selected to obtain
the best performance with acceptable computational
Figure1.Block diagram of the proposed NIDS model. complexity. After applying feature selection method with
threshold equal to 0.01, only 75 out of 122 features (61.47%)
3.1) Data Preprocessing are left to be used to train the model.
There are three sub-processes within the Data
Preprocessing consisting of One-Hot Encoding, Scaling and
Feature Selection

3.1.1) One Hot Encoding and Scaling


We exercise One-Hot Encoding to transform 3 nominal
features listed as protocal_type, service and flag into 84 binary
features (protocol has 3 features, service has 70 features and
flag has 11 features). In summary, after One-Hot Encoding
process, there are 122 features entering Normalization process.
During Normalization process, we scale the dataset so that the
mean of every feature is equal to zero and standard deviation
is equal to one.

3.1.2) Feature Selection


To enhance model efficiency, reduce computational Figure2.A list of top ten highest-important features.
complexity and remove irrelevant features, we implement
feature selection based on calculating the average AR. This
value will be used to determine the feature importance of 3.1.3) Clustering
every feature and its calculation can be explained as follows. The main objective of applying K-Means clustering to
In AR approach, we employ attribute average and frequency NSL-KDD dataset is to group a set of normal and attack traffic
for numeric and binary features, respectively. The AR of the jth that exhibit similar pattern into the same partitions. Then, ML
feature AR (j) can be calculated as model corresponding to each partition is trained to
differentiate normal or attack data within that group. To
determine the number of clusters (K), we implement K-Means
AR(j) = ∈[ , ]
CR (j) (1)
clustering on NSL-KDD dataset using different number of
clusters and evaluate performance using sum square error
where CRi(j) is Class Ratio of the jth feature of the ith label( i∈
(SSE). Figure 3 shows SSE of K-Means clustering over a
[0,4] where i = 0 for Normal, i=1 for DoS, i=2 for Probe, i=3
range of K where random_state is equal to 3425. From the
for R2L and i=4 for U2R class). For jth numeric feature,
figure, SSE yields highest drop (23.89%) when K is increased
CRi(j) can be expressed as
from 1 to 2. SSE gradually decreases for higher values of K
(SSE drop is reduced to 9.30% when K is increased from 2 to KDDTrain+ dataset to train our model and evaluate the
3). In addition, no significant numbers of records are presented in performance over KDDTest+ dataset. Our hybrid model
a new group when K is greater than 10. employs feature selection, K-Means clustering followed by
XGBoost prediction model to differentiate between normal or
attack traffic. We perform feature selection based on AR
values where features with AR value less than 0.01 are
discarded from analysis. This threshold is judiciously selected
to obtain maximum performance. For K-Means clustering, as
presented in Fig. 3, we use two clusters (K=2) for K-Means
clustering due to its steepest drop in SSE. For hyperparameter
tuning in each XGBoost model corresponding to each cluster,
we employ RandomizedSearchCV algorithm to select
hyperparameters that yield the best model’s performance in
term of accuracy. A set of hyperparameters we are interested
in is presented in Figure 4. The followings are some
hyperparameters we use in our model:
 n_estimators: The total number of trees used in the
model.
 max_depth: The highest number of tree hierarchies.
Figure3. SSE of K-Means clustering vs. a number of clusters (K) The higher the value, the more complex the model
becomes.
3.1.4) XGBoostClassifier  learning_rate: This parameter mitigates over-fitting
XGBoost was designed for speed and performance based on problem. It controls step-size shrinkage and weighting
gradient-boosted decision trees algorithms. It provides the factors for corrections when adding new trees to the
benefit of algorithm enhancement, model tuning, and can also model.
be deployed in different computing environments. In addition,  subsample: The fraction of samples to be randomly
it allows the addition or tuning of regularization parameters to selected for each tree.
mitigate the impact of over-fitting.  colsample_bytree: The fraction of columns to be
randomly sampled for each tree.
3.2) EVALUATION METRICS  colsample_bylevel : The subsample ratio of column for
each feature split, in each layer.
We evaluate the performance of our proposed model using
 min_child_weight: The minimum sum of weights of all
Accuracy, True Positive Rate (TPR) or Recall, False Positive
observations required in a child. It is used to control
Rate (FPR). In addition, we usedArea Under the Curve
over-fitting.
(AUC)for overall measure of performance across all possible
 gamma: The minimum loss reduction required to make
classification thresholds. Accuracy metric can be written as
a split.
+  reg_lambda: L2 regularization term on weights It is
Accuracy = (5) used to manage the regularization part of XGBoost loss
(TP + FN + TN + FP)
function
WhereTP, TN, FP and FN represent True Positive, True
Negative, False Positive and False Negative, respectively. For each set of hyperparameter, we implement 5-fold cross-
Recall or TPR is the ratio of items correctly classified as attack validation to validate the performance of our model in each
to all items classified as attack and can be written as cluster. Table 3 lists a set of hyperparameter of XGBoost
model for each cluster. From Table 3, after hyperparameter
TP tuning, both clusters yield identical hyperparameters.
TPR = (6)
(TP + FN ) With 2 clusters employing hyperparameters as listed in
Table 3, the performance of our proposed model is exhibited
FPR is the ratio of items incorrectly classified as attack to all in Table 4. From the table, the proposed model with K=2
items that belong to normal and can be written as yields accuracy equal to 84.41%, TPR = 86.36%, FPR =
18.20% and AUC = 0.84. We compare the performance of our
FP model over a range of cluster groups and found that those with
FPR = (7)
(FP + TN ) K=2 yields highest accuracy and AUC. TPR and FPR are on
4. RESULTS the same order over a range of a number of clusters as
presented in Table 4. This justifies our selection with K = 2.
We compare the performance of our proposed hybrid ML
system in terms of accuracy, TPR, FPR and AUC with the
RNN approach and other tree-based classifiers. We use
Table 5 lists top ten highest-important features of both
param_grid = { clusters. From the table, both clusters exhibit similar pattern
'n_estimators': [100, 200], where src_bytes feature yields highest importance for both
'max_depth': [5, 10, 15, 20], clusters.
'learning_rate': [0.001, 0.01, 0.1, 0.2, 0,3],
'subsample': [0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0], Table 5. A list of top ten highest-important feature of cluster 0 and 1 on
'colsample_bytree': [0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0], KDDTrain+ dataset
'colsample_bylevel': [0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0],
'min_child_weight': [0.4, 0.5, 1.0, 3.0, 5.0, 7.0, 10.0], Feature Name Cluster 0 Cluster 1
'gamma': [0, 0.25, 0.5, 1.0], src_bytes 0.140192 0.132889
'reg_lambda': [0.1, 1.0, 5.0, 10.0, 50.0, 100.0]
} dst_bytes 0.074614 0.062317
dst_host_srv_count 0.063247 0.070306
Figure 4.List of XGBoostHyperparameters.
dst_host_diff_srv_rate 0.056252 0.060719
dst_host_count 0.052171 0.057257
Table 3.A list of XGBoosthyperparameter for each cluster
dst_host_same_src_port_rate 0.050131 0.045806
duration 0.042845 0.047137
dst_host_rerror_rate 0.041387 0.040213
dst_host_same_srv_rate 0.041096 0.045806
Table4. Performance of the proposed model in terms of accuracy, TPR and dst_host_srv_diff_host_rate 0.037015 0.033289
FPR over a range of K-Means clusters

The performance of our proposed model in term of accuracy is


compared with those from RNN and other tree-based classifiers
(Random Forest and Adaboost) in Table 6. From the table, our
proposed model obtains highest accuracy compared to others
for both KDDTrain+ and KDDTest+ datasets. use to clustering,
XGBoost classifier is trained using data exhibiting similar
pattern in each cluster and this improves the detection
performance compared to those obtained by training the
classifier without clustering [16] (Accuracy = 77%, TPR =
62% and FPR = 3%). This could be one of the main reasons
that contribute to its superiority over that of the RNN model.
For comparison with RF and Adaboost models, with more
customizable hyperparameter selection, the proposed model
yields superior performance compared to others. In addition, by
implementing feature selection using AR gain, our proposed
model uses only 75 out of 122 features (61.47%) to achieve
strong performance in comparable to those using full 122
features to train the model.
Table6. Comparison of the accuracy metric

Model KDDTrain+ KDDTest+

Baseline 99.81% 83.28%

K-Means+XGBoost (Our model) 99.85% 84.41%


K-Means+Random Forest 99.67% 75.67%
K-Means+Adaboost 99.61% 72.64%

5. CONCLUSIONS
We proposed a hybrid machine learning technique for
network intrusion detection based on a combination of feature
selection, K-Means clustering and XGBoost classification
models. We test the performance of our proposed system over
NSL-KDD (KDDTrain+, KDDTest+) dataset. A feature
selection method based on AR is applied to construct a reduced
feature subset of NSL-KDD dataset. After applying K-Means [8] Chen, Tianqi, and Carlos Guestrin. “XGBoost.” Proceedings of the 22nd
clustering, hyperparameter tuning of each classification model ACM SIGKDD International Conference on Knowledge Discovery and Data
Mining - KDD 16, 2016, pp. 785–794.
corresponding to each cluster is implemented. Our proposed
model obtains the best accuracy equal to 84.41% with detection [9] Yin, Chuanlong, et al. “A Deep Learning Approach for Intrusion Detection
rate equal to 86.36%, false alarm rate equal to 18.20% and Using Recurrent Neural Networks.” IEEE Access, vol. 5, 2017, pp. 21954–
21961.
AUC equal to 0.84 for KDDTest+ dataset. In addition, the
performance of our proposed model in term of accuracy [10] Staudemeyer, Ralf C. “Applying Long Short-Term Memory Recurrent
outperforms those obtained from the RNN-based deep neural Neural Networks to Intrusion Detection.” South African Computer Journal,
network. Due to feature selection, our proposed model employs vol. 56, no. 1, Nov. 2015, pp. 136–154.
only 75 out of 122 features (61.47%) to achieve this level of
[11] Li, Zhipeng, et al. “Intrusion Detection Using Convolutional Neural
performance comparable to those using full number of features Networks for Representation Learning.” Neural Information Processing
to train the model. Lecture Notes in Computer Science, 2017, pp. 858–866.

[12] Kuang, Fangjun, et al. “A Novel Hybrid KPCA and SVM with GA
REFERENCES Model for Intrusion Detection.” Applied Soft Computing, vol. 18, 2014, pp.
178–184.
[1] Buczak, Anna L., and ErhanGuven. “Using Data Mining and Machine
Learning Methods for Cyber Security Intrusion Detection.” International [13] “Search UNB.” University of New Brunswick Est.1785,
Journal of Recent Trends in Engineering and Research, vol. 3, no. 4, 2017, pp. www.unb.ca/cic/datasets/nsl.html.
109–111.
[14] Sang-Hyun, Choi, and ChaeHee-Su. “Feature Selection Using Attribute
[2] Vemuri, V Rao. “Cyber-Security and Cyber-Trust.”Enhancing Computer Ratio in NSL-KDD Data.” International Conference Data Mining, Civil and
Security with Smart Technology, 2005, pp. 1–8. Mechanical Engineering (ICDMCME’2014), Feb 4-5, 2014 Bali (Indonesia),
4 Feb. 2014, pp. 90–92.
[3] Rao, B. Basaveswara, and K. Swathi. “Fast KNN Classifiers for Network
Intrusion Detection System.” Indian Journal of Science and Technology, vol. [15] Tavallaee, Mahbod, et al. “A Detailed Analysis of the KDD CUP 99 Data
10, no. 14, Jan. 2017, pp. 1–10. Set.” 2009 IEEE Symposium on Computational Intelligence for Security and
Defense Applications, 2009
[4] Li, Fan. “Hybrid Neural Network Intrusion Detection System Using
Genetic Algorithm.” 2010 International Conference on Multimedia [16] “Network Intrusion Detection.” NYC Data Science Academy Blog,
Technology, 2010, pp. 597–602. nycdatascience.com/blog/student-works/network-intrusion-detection-2.

[5]Farnaaz, Nabila, and M.a.Jabbar. “Random Forest Modeling for Network


Intrusion Detection System.” Procedia Computer Science, vol. 89, 2016, pp.
213–217.

[6] Zhang, Jiong, et al. “Random-Forests-Based Network Intrusion Detection


Systems.” IEEE Transactions on Systems, Man, and Cybernetics, Part C
(Applications and Reviews), vol. 38, no. 5, 2008, pp. 649–659.

[7] Pervez, Muhammad Shakil, and Dewan Md. Farid. “Feature Selection and
Intrusion Classification in NSL-KDD Cup 99 Dataset Employing SVMs.” The
8th International Conference on Software, Knowledge, Information
Management and Applications (SKIMA 2014), 2014.

You might also like