Development of Intrusion Detection Systems Using-1
Development of Intrusion Detection Systems Using-1
BY
NOU183033484
COMPUTER SCIENCE
FOR
COMPUTER SCIENCE
0
CERTIFICATION
● I hereby certify that this project contains work carried out by Odunjo Dorcas Damilola of the
satisfactorily completed the requirements for the award of a Bachelor of Science Degree in
Computer Science, Faculty of Science and Technology, National Open University of Nigeria
(NOUN).
..................................... .....................................
(Project Supervisor)
..................................... .....................................
..................................... .....................................
1
(Dean, Faculty of sciences)
DEDICATION
This research project is dedicated to Almighty God for His grace, divine wisdom, knowledge and
understanding given to me to complete this project. I appreciate him for His mercy and divine
2
ACKNOWLEDGEMENT
I want to use this opportunity to thank the Almighty God for the success of this work. For the strength
and wisdom bestowed on me from the beginning to the end. My appreciation goes to my Supervisor,
Dr Ayodele Oloyede, for his relentless efforts in this work and his suggestions, encouragement,
guidance and assistance where necessary. I extend my profound gratitude to all staff of National Open
University of Nigeria, (NOUN), Lagos Study Centre, Victoria Island, Lagos for their contributions.
3
ABSTRACT
Intrusion Detection Systems (IDS) are important tools used to monitor network traffic and prevent
unauthorized access or cyberattacks. Many older IDS methods find it difficult to detect new or
unknown threats, which can leave systems at risk. This project explores how a machine learning
algorithm called Two-Class Decision Forest, available in Microsoft Azure Machine Learning Studio,
The goal of the project was to build a system that can accurately identify whether a network activity is
normal or harmful. I used a popular dataset called NSL-KDD, which contains labeled examples of
both normal and attack traffic. The data was cleaned and prepared before training the model.
After training the model in Azure ML Studio, it was tested using unseen data to check how well it
could predict network intrusions. I also created a simple web-based interface that connects to the
model, allowing users to test the system by entering sample values. The results showed that the
Two-Class Decision Forest algorithm could detect intrusions effectively with good accuracy.
This project shows how machine learning tools in Azure can be used to build useful security solutions,
and it gives a practical example of how such systems can be tested and used in real-time.
4
TABLE OF CONTENTS
CERTIFICATION.................................................................................................................................... 1
DEDICATION.........................................................................................................................................2
ACKNOWLEDGEMENT......................................................................................................................3
ABSTRACT.............................................................................................................................................4
CHAPTER ONE..................................................................................................................................... 6
INTRODUCTION.................................................................................................................................. 6
1.1 Background to the Study..............................................................................................................6
1.2 Statement of the Problem..............................................................................................................7
1.3 Aim and Objectives of the Study................................................................................................. 9
1.4 Scope of the Study........................................................................................................................ 9
1.5 Significance of the Study............................................................................................................ 10
1.6 Operational Definition of Terms.............................................................................................. 11
CHAPTER TWO.................................................................................................................................. 14
LITERATURE REVIEW.....................................................................................................................14
2.1 Overview.....................................................................................................................................14
2.2 Conceptual Framework...............................................................................................................14
2.3 Empirical Review.......................................................................................................................18
2.3 Theoretical Framework............................................................................................................ 20
2.5 Summary of Reviewed Related Works and Knowledge Gap................................................... 21
CHAPTER THREE.............................................................................................................................. 23
METHODOLOGY............................................................................................................................... 23
3.1 Research Approach.................................................................................................................. 23
i. Data Collection.............................................................................................................................. 25
ii. Data Preprocessing....................................................................................................................... 26
iii. Model Development/Training: Decision Forests.........................................................................30
iv. Evaluation Metrics....................................................................................................................... 32
v. Analysis And Interpretation Of Results........................................................................................ 34
5
CHAPTER ONE
INTRODUCTION
In today’s world, the internet is used for almost everything, from banking and shopping to
communication and business operations. As more people and companies depend on online systems,
the need to protect sensitive information from cyberattacks has become more important than ever. One
key way to do this is by using Intrusion Detection Systems (IDS). IDS are programs that watch over
computer networks to spot any suspicious activities or security threats. They act like security guards
that alert users when something unusual is happening (Kumar and Patel, 2021).
Traditionally, most IDS were built to recognize attacks by comparing incoming traffic to a list of
known attack patterns. These are called signature-based systems. While they are good at identifying
past threats, they cannot detect new types of attacks that don’t match any known pattern. This
limitation makes it easy for modern cybercriminals to bypass them using unknown or modified
methods (Ali et al., 2023). To solve this problem, researchers have started using machine learning
Machine learning gives computers the ability to learn from data and recognize patterns without being
manually programmed. With this approach, IDS can be trained on large sets of network data to
understand what normal behavior looks like, and then detect anything that doesn’t fit that pattern.
ML-based IDS are more flexible and can detect new or unexpected attacks better than traditional
6
methods (Nguyen et al., 2022). They are also more scalable, meaning they can handle large amounts of
In this project, I used the Two-Class Decision Forest algorithm, which is a type of machine learning
model that combines multiple decision trees to make predictions. It works well for detecting attacks
because it is accurate, fast, and able to manage messy or unbalanced data. The model was built using
Microsoft Azure Machine Learning Studio, a platform that allows users to build ML models without
needing to write complex code. By training the model on a well-known dataset called NSL-KDD, this
project aims to build a working IDS that can help identify network intrusions in a real-world
Many Intrusion Detection Systems (IDS) have been developed using different machine learning
algorithms, but not all of them perform well in real-world situations. For example, Logistic Regression
is commonly used for binary classification tasks, but it often struggles when the dataset has
overlapping features or when the relationship between inputs and outputs is not linear. In intrusion
detection, this can lead to wrong classifications, especially when the data is complex (Olaoye et al.,0
2022).
Another popular method, Support Vector Machines (SVMs), can handle both linear and non-linear
data, but it becomes very slow and memory-intensive when working with large datasets. SVMs also
require a lot of tuning and adjustment to work properly, which makes them harder to use in practical
7
scenarios (Yassin and Alshamrani, 2021). Similarly, K-Nearest Neighbors (KNN) is simple to
understand but performs poorly with large data because it stores the entire dataset in memory and
becomes very slow during prediction. It is also highly sensitive to noise and irrelevant features (Li and
Huang, 2023).
Naive Bayes, another commonly used algorithm, assumes that all features are independent, which is
often not true for network data. This can lead to inaccurate results because relationships between
network activities are usually complex (Bello et al., 2024). Neural Networks, especially deep learning
models, have been used in some studies and can offer high accuracy. However, they require large
computing power, a lot of training time, and are often seen as “black-box” models, meaning their
In contrast, the Two-Class Decision Forest algorithm used in this project offers a better balance of
speed, accuracy, and simplicity. It builds multiple decision trees and combines their outputs to make
more reliable predictions. This method is less likely to overfit the training data and performs well even
with noisy or unbalanced datasets. It also does not require complex parameter tuning and is easier to
understand and explain compared to more advanced models like neural networks (Zhang et al., 2025).
Because of these advantages, the Two-Class Decision Forest was chosen for this project. It allowed the
model to be trained efficiently using Azure Machine Learning Studio and made it easier to test the
8
1.3 Aim and Objectives of the Study
This study aims to develop a decision forest model for an Intrusion Detection System (IDS).
Specific Objectives
3. Evaluating the performance of the Two-Class Decision Forest model by checking its precision,
This project focuses on building and testing an Intrusion Detection System (IDS) using the Two-Class
Decision Forest algorithm in Microsoft Azure Machine Learning Studio. The model was trained with
the NSL-KDD dataset, which includes labeled examples of both normal and attack network traffic.
● Creating a simple web-based interface that allows users to enter sample data and get results.
● Testing the model’s ability to detect common attack types such as Denial of Service (DoS),
9
1.5 Significance of the Study
This study is important because it shows how machine learning can be used to improve network
security in a simple and practical way. Many traditional Intrusion Detection Systems (IDS) are limited
because they rely only on known attack patterns. This project demonstrates how a machine learning
model trained using the Two-Class Decision Forest algorithm can help detect both common and
The use of Microsoft Azure Machine Learning Studio makes this project accessible, especially for
people without advanced programming skills. By using Azure’s drag-and-drop interface, the model
was built, trained, and evaluated with minimal code. This approach helps show that modern tools can
make machine learning more user-friendly and practical, even for small teams or individuals.
The NSL-KDD dataset, which was used in the project, provides a good variety of real-world attack
types. The project shows how this dataset can be used to train a model that detects major network
threats like DoS, Probe, R2L, and U2R. The model’s performance was measured using standard
metrics like accuracy, precision, recall, and F1-score, giving a clear picture of how well it works.
In addition, a simple web-based interface was created to show how the system could be used in
practice. This part of the project gives users a way to test the model by entering real-time values,
● Students and researchers by showing a clear and practical example of machine learning in
cybersecurity.
● IT professionals by offering a simple model that can be improved or expanded for real use.
10
● Organizations by providing a low-cost approach to testing IDS with machine learning.
By focusing on easy-to-use tools and clear results, the study adds to current research in cybersecurity
and helps bridge the gap between academic work and real-world applications (Zhang et al., 2025; Ali
A system that monitors network traffic and identifies suspicious or harmful activities. In this
project, IDS is built using a machine learning model to classify whether a connection is normal
or an attack.
A branch of artificial intelligence that allows computers to learn from data and make decisions
without being manually programmed. In this project, ML is used to train the model to detect
network intrusions.
A supervised machine learning algorithm that combines the results of many decision trees to
make a final prediction. It is used in this project to classify network traffic as either normal or
an intrusion. This method is known for being accurate, fast, and easy to understand.
A cloud-based platform that allows users to build, train, and deploy machine learning models
using a visual, no-code interface. It was used to build and train the IDS in this project.
11
5. NSL-KDD Dataset:
A benchmark dataset for evaluating intrusion detection systems. It contains labeled records of
both normal and different types of attack traffic, such as DoS, Probe, R2L, and U2R. It was
The percentage of normal network traffic that is wrongly flagged as an attack by the IDS. A
7. Precision:
A measure of how many of the alerts raised by the IDS are actually correct. High precision
means the system does not raise too many false alarms.
8. Recall:
Also known as sensitivity, it measures how many real attacks the system successfully detects.
9. F1-Score:
A single number that combines both precision and recall. It gives a balanced measure of how
A type of cyberattack where a system is flooded with traffic to make it unavailable to users.
An attack where a remote user tries to gain unauthorized access to a system on the network.
12
12.User-to-Root (U2R) Attack:
A situation where a regular user tries to gain administrator or root access to a system. The
13.Probe Attack:
A type of attack where an attacker scans the network to find weak spots that could be used
later. This is one of the attack types the IDS can detect.
14.Web Interface:
A simple online page created in this project, where users can enter values and test if the model
13
CHAPTER TWO
LITERATURE REVIEW
2.1 Overview
Intrusion Detection Systems (IDS) are designed to monitor network activity and alert users when there
are signs of unauthorized access or attacks. Over the years, the rise in cybercrime has made these
systems a critical part of digital security. However, traditional IDS often rely on fixed rules or known
attack patterns, which makes them ineffective against new or evolving threats (Ali et al., 2023).
To address this limitation, researchers have explored the use of machine learning (ML) in intrusion
detection. ML allows systems to learn from data and detect unusual patterns that may signal an attack,
even if the attack is new or has never been seen before. This approach improves both detection speed
and accuracy, and has become one of the most promising areas in cybersecurity research (Nguyen et
al., 2022).
This project builds on that idea by using a specific ML algorithm, the Two-Class Decision Forest, to
classify network traffic as either normal or an intrusion. The goal is to improve detection accuracy and
reduce false alarms while keeping the system simple and efficient enough for practical use.
● This project is built around three key concepts: Intrusion Detection Systems (IDS), machine
learning algorithms (specifically, the Two-Class Decision Forest), and the NSL-KDD dataset.
14
● Intrusion Detection Systems (IDS):
IDS monitor data flowing through a network and try to identify any abnormal or malicious
activity. Traditional IDS often fail to detect new threats because they depend on signatures,
which are known patterns of past attacks. In recent studies, ML-based IDS have shown better
performance because they can detect unusual behavior, not just known attacks (Zhang et al.,
2025).
Fig 1. ML based systems efficiently analyze network traffic. (Wang et al., 2024).
Machine learning helps IDS learn from large sets of network data. Instead of relying on fixed
rules, the model is trained to recognize normal activity and then flag anything that doesn’t fit
that pattern. This makes the system more adaptable to modern attacks. Algorithms like
Decision Trees, Random Forests, and Support Vector Machines have been tested in IDS, but
15
many of them either require heavy tuning or become too slow with large datasets (Olaoye et
al., 2022). The Two-Class Decision Forest, however, combines multiple decision trees into one
strong model and performs well on classification tasks like intrusion detection (Chowdhury
version of the KDD 1999 dataset, created to remove duplicate and redundant records. It
contains labeled examples of both normal and attack traffic, including DoS, R2L, U2R, and
Probe attacks. The dataset is considered balanced and more suitable for training ML models in
16
Fig 3. Flow chart of load identification decision Forests algorithm (Zhao et al., ResearchGate, 2022)
Together, these components form the base of this project: a smart IDS trained on NSL-KDD using a
Two-Class Decision Forest model, built and tested in Azure Machine Learning Studio.
17
Fig 5. Model Performance Metrics (MarkovML, 2023).
Several researchers have worked on improving Intrusion Detection Systems (IDS) using machine
learning. Many of these studies tested different algorithms to see which one performs best for
detecting various types of network attacks. However, only a few of them focused specifically on
Decision Forest-based models or made use of platforms like Azure Machine Learning Studio for
implementation.
18
For example, Olaoye et al. (2022) compared multiple machine learning models, including Decision
Trees, Logistic Regression, and Naive Bayes, using the NSL-KDD dataset. Their results showed that
Decision Tree models had better detection accuracy, especially for complex attacks, but they warned
Zhang et al. (2025) expanded on this by testing ensemble learning methods, including Random Forests
and Decision Forests. Their findings showed that models like the Two-Class Decision Forest, which
combine many decision trees, are more stable, reduce errors, and handle unbalanced data better than
Nguyen et al. (2022) evaluated Support Vector Machines (SVM), K-Nearest Neighbors (KNN), and
Decision Forests on intrusion detection tasks. They concluded that while SVMs had good precision,
they required long training time and fine-tuning. Decision Forests, on the other hand, delivered high
accuracy without much adjustment, making them more practical for quick deployment.
Another recent study by Bello et al. (2024) highlighted the importance of using clean and balanced
datasets. They pointed out that the NSL-KDD dataset remains relevant because it avoids many of the
issues found in older datasets, such as data duplication and imbalance, which can affect model
performance.
Lastly, Chowdhury and Sahu (2023) tested deep learning models like LSTM and CNN on network
intrusion data and reported high detection accuracy. However, they also noted that such models need
more computing power and are harder to explain. For student projects or small organizations, simpler
and more transparent algorithms like Decision Forests are often a better choice.
19
These studies support the decision to use a Two-Class Decision Forest for this project. It offers a good
balance between accuracy, speed, and ease of use, especially when implemented on a platform like
The main theory behind this project is ensemble learning, which is a method in machine learning
where multiple models are combined to solve a problem better than using a single model. In this
project, the Two-Class Decision Forest algorithm is used. It is based on a type of ensemble method
called bagging (bootstrap aggregating), where many decision trees are trained separately and their
Each decision tree in the forest makes a prediction, and the final output is based on the majority vote
from all the trees. This helps to reduce the chances of making errors that one single tree might make.
Because of this, decision forests are more stable and accurate, especially when dealing with noisy or
This theory works well for intrusion detection because cyberattacks can appear in many different
forms, and a single model might not be able to catch all of them. By using multiple trees trained on
slightly different parts of the data, the decision forest can handle a wide range of attack types more
The Two-Class Decision Forest also benefits from the divide-and-conquer principle, where the data is
split based on features to create decision paths. This structure makes the model easier to understand
20
In summary, the theoretical foundation of ensemble learning supports the use of decision forests in
IDS by improving accuracy, reducing overfitting, and making the system more reliable and
From the literature reviewed, it is clear that machine learning has become a useful tool for improving
Intrusion Detection Systems (IDS). Many studies have explored different algorithms such as Support
Vector Machines, K-Nearest Neighbors, Logistic Regression, and Neural Networks. While these
models can perform well under certain conditions, they often require complex tuning, high computing
power, or produce results that are difficult to explain (Nguyen et al., 2022; Chowdhury and Sahu,
2023).
On the other hand, Decision Tree-based models have shown promising results due to their simplicity
and fast execution. Recent research suggests that combining multiple decision trees using ensemble
learning techniques like bagging can improve performance. The Two-Class Decision Forest algorithm
is one such method, offering good accuracy, low false positive rates, and better handling of unbalanced
data, all of which are important in IDS applications (Zhang et al., 2025; Bello et al., 2024).
However, there is still a gap in studies that show how these models can be applied in a simple and
practical way using modern tools like Azure Machine Learning Studio. Most research focuses on
complex models or environments that are not accessible to beginners or students. Also, very few
works demonstrate how to connect a trained model to an interface that allows users to interact with it
in real time.
21
This project addresses these gaps by:
● Using the Two-Class Decision Forest in a no-code, cloud-based platform (Azure ML Studio),
● And building a simple web interface to demonstrate how IDS can work in practice.
This approach not only supports the findings of past studies but also shows how intrusion detection
22
CHAPTER THREE
METHODOLOGY
The system was developed using Microsoft Azure Machine Learning Studio, a cloud-based platform
that allows users to build machine learning models using a visual, no-code environment. The project
● And creating a basic interface for users to test the system with real inputs.
The entire process was completed using Azure’s drag-and-drop Designer interface, along with a
Python notebook used at the end to export results. Each phase of the project is explained in the
23
Fig 3.1. Framework of a developed Two-Class Decision Forest algorithm against the existing Naive
Bayes model.
24
i. Data Collection
The dataset used in this project is the NSL-KDD dataset, which is commonly used in intrusion
detection research. It is an improved version of the original KDD Cup 1999 dataset, created to remove
duplicate records and reduce data imbalance issues. The dataset includes labeled records of both
normal and attack network activities, making it ideal for training and testing a machine learning model
● R2L (Remote-to-Local) – when an attacker gains access to a machine from outside the
network.
The dataset was downloaded from Kaggle, where it was already cleaned and structured in CSV format.
It was then uploaded into Azure Machine Learning Studio for use in the training pipeline. In Azure
ML, the dataset was loaded into the workspace and connected directly to other modules in the
experiment.
25
The figure above shows the dataset in my Azure ML workspace and its connection to the
pipeline.
The dataset includes important features such as protocol type, service, source bytes, destination bytes,
duration, login status, and others. These were used by the model to learn how to classify network
Before training the model, the dataset needed to be cleaned and prepared. This step is called data
preprocessing, and it helps improve the model’s accuracy by removing errors, fixing data types, and
selecting useful features. All preprocessing steps were done using the built-in modules in Azure
26
1. Clean Missing Data
Although the NSL-KDD dataset from Kaggle was already cleaned, the Clean Missing Data module
was added to ensure that any unexpected gaps in the values were handled properly. This step ensures
The figure above shows the "Clean Missing Data" module in my pipeline.
2. Edit Metadata
Some of the data columns, especially the text-based ones like protocol_type, service, and
flag, were originally classified as string data. These were converted to the categorical data type using
the Edit Metadata module. This allowed the model to treat them as categories, which improves
classification.
27
The figure above shows how the Edit Metadata module was used to change data types.
After updating the data types, the Convert to Indicator Values module was used to apply one-hot
encoding to the categorical columns. This step creates separate columns for each category (e.g., TCP,
UDP, ICMP) so that the model can understand and use them during training.
28
The figure above shows where you can see Convert to Indicator Values applied to the dataset.
This step was used to remove any columns that were not needed for training. Unnecessary features
(like certain flags or identifiers) were excluded to reduce noise and help the model focus on the most
useful inputs.
29
The figure above shows the "Select Columns in Dataset" module connected in my pipeline.
All these preprocessing steps were connected and executed in sequence within the Azure ML Studio
interface. This allowed the cleaned and transformed data to be passed directly into the training phase.
After preprocessing, the cleaned dataset was ready to be used for model training. In this project, the
Two-Class Decision Forest algorithm was selected to build the Intrusion Detection System. This
algorithm is available as a built-in module in Microsoft Azure Machine Learning Studio Designer and
is designed for binary classification problems, such as predicting whether a network activity is normal
or an attack.
This model works by combining many decision trees into one strong model. Each tree is trained on a
30
different part of the data, and their individual results are combined to make the final prediction. This
method helps improve accuracy, reduce overfitting, and handle unbalanced data better than single
1. The Train Model module was used to connect the processed data to the Two-Class Decision
Forest.
2. The target column (label) was set to the column indicating whether each record was “normal”
or an “attack”.
3. The Score Model module was added to compare the model’s predictions with the actual labels.
4. The Evaluate Model module was used to show how accurate the model was using key metrics
31
The figure above shows the model pipeline with Train Model, Score Model, and Evaluate Model
connected.
The entire model was trained and tested using the Designer interface, which allowed the components
to be connected in a logical flow. The final output was used to check how well the model could detect
After training the model, it was important to check how well it performed. This was done using the
Evaluate Model module in Azure Machine Learning Studio, which compares the model’s predictions
with the actual results in the dataset. The evaluation focused on five main metrics: accuracy, precision,
These metrics help show if the model is good at catching attacks without making too many mistakes.
32
1. Accuracy
Accuracy is the percentage of all predictions the model got right, both normal and attack traffic. High
2. Precision
Precision tells us how many of the records the model predicted as attacks were actually attacks. High
3. Recall
Recall shows how many of the actual attacks the model was able to detect. High recall means the
4. F1-Score
F1-score is a combination of precision and recall. It gives a single value to measure the balance
This measures how often the model mistakenly flags normal traffic as an attack. A lower FPR is better
All these results were generated directly from the Evaluate Model module in Azure ML Studio. The
metrics helped confirm that the Two-Class Decision Forest model was able to detect intrusions with a
33
The figure above displays the Evaluate Model output showing performance metrics.
Once the model was trained and evaluated, the next step was to understand how well it performed
based on the metrics provided. The evaluation results confirmed that the Two-Class Decision Forest
34
1. Accuracy and Overall Performance
The model achieved a high accuracy score, meaning it correctly classified most of the network traffic
records. This shows that the model learned the patterns of both normal and malicious behavior
effectively. The good accuracy also suggests that the preprocessing steps like feature selection, data
The figure above shows the summary of the scored results used for analysis.
The precision value was strong, meaning the model didn’t raise too many false alarms. This is
important because too many wrong alerts can make users ignore real threats. The recall score was also
35
high, showing that the model successfully detected a large portion of actual intrusions. Together, this
balance means the system is both careful and alert which is exactly what’s needed in a good IDS.
The F1-score confirmed the balance between precision and recall. A high F1-score means the model
is not only accurate but also consistent. The false positive rate was low, which is especially important
in real-world applications as it reduces unnecessary warnings that can distract security teams or users.
One of the biggest strengths of this model is that it performed well without needing complicated setup
or adjustments. It worked well using default settings in Azure ML Studio, which proves that
To make the results more accessible, a Python notebook was used in Azure ML Studio to export the
scored data to a CSV file. This made it easier to analyze and even test the system further outside
Azure.
36
The figure above shows the use of a Jupyter Notebook for exporting results.
In summary, the results showed that the chosen model was a good fit for intrusion detection. It
achieved strong performance without needing high computing power or advanced tuning, which
37
REFERENCES
Ali, A., Musa, K. and Usman, H., 2023. Improving signature-based intrusion detection systems with
Bello, T.M., Ogunyemi, A.K. and Salisu, I., 2024. A critical analysis of the NSL-KDD dataset for
intrusion detection research. Journal of Computer Science and Security, 11(2), pp.21-32.
Chowdhury, S. and Sahu, P., 2023. Comparative performance of deep learning and ensemble learning
Kumar, S. and Patel, V., 2021. Machine learning-based intrusion detection: A review of supervised
Li, W. and Huang, Y., 2023. Evaluating distance-based classifiers for network intrusion detection.
Nguyen, L., Okafor, C. and Adeyemi, T., 2022. Machine learning approaches for detecting
Olaoye, A.A., Danjuma, M. and Eze, C., 2022. Performance comparison of classification algorithms
for intrusion detection using NSL-KDD dataset. Nigerian Journal of Information Security, 14(1),
pp.11-20.
Yassin, S. and Alshamrani, R., 2021. SVM-based intrusion detection: Challenges and solutions.
Zhang, R., Bello, F. and Amadi, L., 2025. An ensemble-based intrusion detection system using
Two-Class Decision Forests. International Journal of Data Science and Cyber Defense, 3(1), pp.1-15.
38