2023 Scopus Ensemble Based Dimensionality
2023 Scopus Ensemble Based Dimensionality
Abstract—In the present digital era, the evolution of internet Support Vector Machines(SVM). Large number of IDS
technologies increases rapidly and as a result large number of methods are designed by researchers involving SVM in
devices is connected with the public network. Due to this, huge various forms[3]. KNN is a traditional weight based
volume of data is been generated and transmitted over the classification strategy. Weighted KNN and optimization
network. Similarly the attacker has formulated large number of algorithms are also used in performing IDS[4]. Naïve bayes
ways to access to information on the network. Due to the classifier uses probability models for classification of data
availability of internet, almost every device are interconnected based on the details of known data using bayes rule. Feature
and are exposed to data breach. Hence Intrusion Detection selection is the process of retrieving the important features and
System (IDS) are mandatory to ensure the information is secure
thereby eliminating the problem of overfitting. Variety of
from the attackers. As technology is developing the attackers
identifies a new technique to breach the secure data. Machine
schemes are utilized in the existing methods for this purpose.
Learning approach is been employed in different domain and it A hybrid feature selection approach is employed involving set
has given better results in terms of performance and accuracy. theory to rank relevant features[5]. Nature based approaches
In order to design an effective IDS, this study utilizes machine are widely used to perform feature retrieval task[6]. These
leaning models. This article has proposed an ensemble-based approaches generally pick the attributes reaching a threshold
hybrid approach for intrusion detection in networks. In the score of relevance and importance. A hybrid approach
initial stage, the important features are extracted using Blended encompassing black hole and PSO is considered for selecting
Linear Discriminant Analysis (BLDA). Further the essential features for cancer detection[7].
dimensionality reduced dataset are used to detect intrusion
using Random Forest Classifier. Here, two benchmark datasets Deep learning models are superior to machine learning
namely NSL-KDD and UNSW_NB15 are used to evaluate the models. They are capable of learning and performing the
potential of the proposed method. To prove the effectiveness of classification task automatically. Still these DL models have
our approach, the proposed scheme is compared with classical certain challenges like overfitting and requirement of well-
LDA, PCA and PLS based feature selection schemes where the balanced data. A hybrid approach involving OHDNN
presented method provides. The accuracy of the proposed classifier and ECRF method feature selection produced better
method is 90.12 % and 91.0 % for NSL-KDD and UNSW_NB15 ID scheme[8]. CNN is a deep learning model well suited for
datasets respectively. The results clearly shows that our dealing with images. Due to its robustness it is employed in
proposed method provides considerable improvement in IDS. Using CNN requires the features to be scaled into image.
performance of IDS. These image are further used by CNN for the detection of
attack[9]. Selecting necessary features from the dataset and
Keywords— Intrusion detection system, Dimensionality applying CNN over the reduced dataset has shown prominent
reduction, Random Forest Classifier, Linear Discriminant improvement in the performance of traditional CNN
Analysis model[10]. RNN models are widely used for sequential data
classification problems. LSTM is the modified version of
I. INTRODUCTION
RNN and capable of using the previous data for processing
The latest developments in internet and IoT has increased involving three gates. Since the size of the dataset is very
the number of gadgets attached to the public network and huge, LSTM is applied in developing IDS. These models are
creates security and privacy concerns. The security aspect of capable of performing both binary and multiclass
the network is a major concern as the malicious attacks are classification in detecting intrusions[11]. Further the LSTM
very frequent. The attacker formulates new approaches to performance is enhanced by selecting appropriate feature
breach the network and its security mechanisms [1] IDS is a using Grey wolve scheme [12]. Autoencoders are capable of
way of designing requirement mechanism to identify reducing the data and widely used for the purpose of feature
unethical activities and to safeguard the data from malware selection. Further the reduced datasets are used by DL models
and untrusted access in the network. in detecting the intrusions. The processing of reduced dataset
The faster computing capabilities of machine learning and provides the advantage of faster processing and accurate
deep learning model involves in developing security solutions detection[13]. Hybrid deep learning approaches has shown
for IOT environment[2]. Machine learning generally utilizes significant improvement in the performance of deep learning
labelled data for performing classification tasks. The most models[14,15]. The challenges in designing an efficient IDS
commonly used machine learning method for classification is are unbalanced data, requirement to select essential attribute
Authorized licensed use limited to: Kyungpook National Univ. Downloaded on March 20,2023 at 07:20:57 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the 5th International Conference on Smart Systems and Inventive Technology (ICSSIT 2023)
IEEE Xplore Part Number: CFP23P17-ART; ISBN: 978-1-6654-7467-2
from dataset, issues of overfitting, estimation overhead and A. Linear Discriminant Analysis (LDA)
performance consideration. The challenges of IDS are LDA is a common strategy for feature nomination. It
overcome by designing a hybrid BLDA approach and random transforms the data in higher dimensional space to lower
forest classifier. dimensional space that linearly separates the features and
With the objective of designing an efficient IDS that can eliminates overfitting. LDA consists of following steps.
produce better detection rates, the main contributions of the 1. Compute between class variance as the distance
article is presented as follows: among average of different classes termed as
• Selection of most relevant features is carried out using between class variance using the formula
BLDA approach, 𝑆𝑏 = ∑𝑛𝑖=1 𝑁𝑖 (𝑋̅𝑖 − 𝑋̅)(𝑋̅𝑖 − 𝑋̅)𝑇 (1)
• The reduced features are further fed to random 2. Estimate the distance among mean for same class
classifier to determine the presence of attack. termed as within class variance
• The performance of the proposed IDS scheme is 𝑆𝑤 = ∑𝑛𝑖=1 ∑𝑁𝑖 ̅ ̅ 𝑇 (2)
𝑗=1 𝑁𝑖 (𝑋𝑖,𝑗 − 𝑋𝑖 )(𝑋𝑖,𝑗 − 𝑋𝑖 )
validated against existing methods and shows
promising results. 3. Compute the eigen values 1, 2…. n and their
corresponding eigen vectors v1, v2,…vn where n is
II. PROPOSED METHODOLOGY 41 for NSLKDD and 49 for USWNB 15 data set.
The IDS scheme used in this article used BLDA approach Calculate the transformation matrix as
for retrieving necessary features of the dataset. The primary 𝑆𝑏 𝑋 = 𝑆𝑖 𝑋 (3)
task is preprocessing od data that involves handling missing
and null values. One hot encoder is used to transform the 4. Pick k eigen vectors from descending sorted list and
categorical attributes into numerical. Further max min construct the lower dimensional space having
normalization is performed to scale the data into uniform smaller within class variance and higher between
range. 80% of the dataset is used for training and model and class variance
its performance is evaluated using 20% test data. The reduced
features are further classified using RF classifier. Fig.1. gives Y=XVk (4)
the overall structure of proposed IDS scheme. Thus, LDA reduces M dimension features to k feature set
and eliminates M-k features that are irrelevant. A feature is
considered as good if it is relevant to the class concept and at
the same time it should not be redundant. In our proposed
Input Data method, in addition to the features generated using LDA,
highly correlated features are included. The correlation
between all independent features with dependent features are
Data computed and then the top ‘n’ highly correlated features are
Preprocessing selected and included in dataset used for training the classifier.
To avoid including redundant features the value of ‘n’ remains
minimum within the range 1 to 5.
B. Random Forest Classifier
Training Set Test set(20%)
Random Forest is an ensemble based model used for
(80%)
classification tasks proposed by Breiman in 2001[16].
Random Forest performs classification with the help of
Feature Selection creating multiple decision trees. These trees are formulated
using Information Gain, Gini Index and Gain ratio. The
accuracy of a single decision tree will be to some extend and
when ensemble of decision trees are used then the class with
RF classifier maximum score will be considered as final result thereby
model improving the accuracy compared to single decision tree
predictor. The steps followed in random forest classifier is
given below:
Evaluate dataset 1. Transform the input dataset from high dimensional
space to low dimensional space using Blended Linear
Discriminant Analysis (BLDA).
2. Select K random points from the dimension reduced
Attack / Normal dataset.
3. For each of these K points construct a decision tree
predictor. Use these decision trees to predict the result as
Fig.1. Overview of Proposed IDS attack or normal.
4. Make the final decision considering the maximum
score of the individual decision tree predictors.
Authorized licensed use limited to: Kyungpook National Univ. Downloaded on March 20,2023 at 07:20:57 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the 5th International Conference on Smart Systems and Inventive Technology (ICSSIT 2023)
IEEE Xplore Part Number: CFP23P17-ART; ISBN: 978-1-6654-7467-2
In our proposed BLDA approach, initially the high PCA + RF 84.71 85.51 83.8
dimensional dataset is transformed to low dimensional dataset
using classic Linear Discriminant Analysis (LDA). Then ‘n’ LDA + RF 86.09 87.08 84.94
highly correlated features are comprised in the transformed
dataset. The optimized resultant attributes are further Corr + RF 83.41 85.71 80.08
employed to train the classifier model. The overall framework
of our proposed method is depicted in Fig. 2. PLS + RF 79.35 81.43 77.21
A. Evaluation Parameters
The performance of the proposed BLDA approach is 60
analyzed using two benchmark datasets and compared using 40
three evaluation parameters including accuracy, specificity
and sensitivity. 20
True Positive (TP): Count of correctly categorized 0
instances as normal. Accuracy Specificity Sensitivity
False Positive (FP): Count of normal records wrongly
categorized as attack type. Fig. 3. Performance comparison of different models over
True Negative (TN): Count of correctly categorized NSL-KDD dataset
records as attack type. Our method also gives significant improvement in
performs for the UNSW-NB15 dataset. The comparison
False Negative (FN): Count of attack type records that are results with other approaches are carried involving various
categorized as normal. metrics and are given in Table 2 and graphically presented in
Accuracy measures the ratio of rightly classified instances fig. 4. It is observed that our approach gives better
out of total instances from the testset and is estimated using classification accuracy of 91.08 % and 88.75% sensitivity.
TP+TN
PCA+RF gives 85.66% accuracy, 88.07 % specificity and
Accuracy= (5) 82.73% sensitivity. Other methods including LDA+RF gives
TP+FP+TN+FN
88.32 % accuracy, 91.25 % specificity, 86.34 % sensitivity
and Corr+RF method gives 81.65 % accuracy, 87.28%
Authorized licensed use limited to: Kyungpook National Univ. Downloaded on March 20,2023 at 07:20:57 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the 5th International Conference on Smart Systems and Inventive Technology (ICSSIT 2023)
IEEE Xplore Part Number: CFP23P17-ART; ISBN: 978-1-6654-7467-2
Performance in %
Method Accuracy Specificity Sensitivity
60
BLDA + RF 91.08 92.81 88.75 50
PCA + RF 85.66 88.07 82.73 40
LDA + RF 88.32 91.25 86.34 30
Corr + RF 81.65 87.28 78.66 20
PLS + RF 80.21 82.78 77.85 10
0
Accuracy Specificity Sensitivity
Proposed PCA + RF LDA + RF
Corr + RF PLS + RF Fig. 5. Overall performance of proposed IDS
The proposed model is also used to perform multiclass
100 clssification to determine the attack class. The NSL-KDD
90 dataset consists of four categories of attack.
80
70
Metrics in %
60
50 BLDA + RF PCA + RF LDA + RF
40 Corr + RF PLS + RF
30
20 100
10
0 90
Accuracy Specificity Sensitivity
80
Fig. 4. Performance comparison of different models over
UNSW NB15 dataset 70
Accuracy in %
Authorized licensed use limited to: Kyungpook National Univ. Downloaded on March 20,2023 at 07:20:57 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the 5th International Conference on Smart Systems and Inventive Technology (ICSSIT 2023)
IEEE Xplore Part Number: CFP23P17-ART; ISBN: 978-1-6654-7467-2
features carried out by BLDA approach has significantly [12] Karthic, S., Manoj Kumar, S., & Senthil Prakash, P.N. (2022). Grey
shown improvement performance than other feature selection wolf based feature reduction for intrusion detection in WSN using
LSTM. International Journal of Information Technology.
strategies.
[13] Rao, K.N., Rao, K.V., & Reddy, P.V. (2021). A hybrid Intrusion
Detection System based on Sparse autoencoder and Deep Neural
IV. CONCLUSION Network. Comput. Commun., 180, 77-88.
In this article, a blended LDA based dimensionality [14] Kim, B., Yuvaraj, N., Sri Preethaa, K.R., & Arun Pandian, R. (2021).
reduction method is introduced for intrusion detection system. Surface crack detection using deep learning with shallow CNN
The specified BLDA method effectively eliminates the architecture for enhanced computation. Neural Computing and
Applications, 33, 9289 - 9305.
irrelevant features for building the classification model and
[15] Kim, B., Yuvaraj, N., SriPreethaa, K.R., Santhosh, R., & Sabari, A.
BLDA also takes necessary measures to handle the redundant (2020). Enhanced pedestrian detection using optimized deep
features available in the input dataset. The essential features convolution neural network for smart building surveillance. Soft
of the dataset are extracted and is then used to train the Computing, 1-12.
Random Forest classifier. The proposed model is then utilized [16] Breiman, L.. “Random Forests.” Machine Learning 45 (2001): 5-32
to determine the presence of attack in the testset. The
effectiveness of our approach is evaluated using two
benchmark datasets namely NSL-KDD, UNSW_NB15 and its
capability is evaluated with other ML methods using the
evaluation metrics. The result clearly insights that our
proposed BLDA based intrusion detection system gives better
performance in classifying the network attacks.
Implementation of BLDA for detecting multiclass attacks
using deep learning approaches can be carried out as a future
work. Also the model can be trained against real time dataset
in detecting intrusions.
.
REFERENCES
Authorized licensed use limited to: Kyungpook National Univ. Downloaded on March 20,2023 at 07:20:57 UTC from IEEE Xplore. Restrictions apply.