0% found this document useful (0 votes)
20 views39 pages

Development of Intrusion Detection Systems Using-1

It will be useful to students, lecturers and researchers in the field of computer science and electrical electronics engineering.

Uploaded by

ibrahim Paramole
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views39 pages

Development of Intrusion Detection Systems Using-1

It will be useful to students, lecturers and researchers in the field of computer science and electrical electronics engineering.

Uploaded by

ibrahim Paramole
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

DEVELOPMENT OF INTRUSION DETECTION SYSTEMS

USING MACHINE LEARNING

BY

ODUNJO DORCAS DAMILOLA

NOU183033484

A PROJECT SUBMITTED TO THE DEPARTMENT OF

COMPUTER SCIENCE

FACULTY OF SCIENCE AND TECHNOLOGY

NATIONAL OPEN UNIVERSITY OF NIGERIA (NOUN)

IN PARTIAL FULFILMENT OF THE REQUIREMENTS

FOR

THE AWARD OF A BACHELOR OF SCIENCE DEGREE IN

COMPUTER SCIENCE

0
CERTIFICATION

●​ I hereby certify that this project contains work carried out by Odunjo Dorcas Damilola of the

Department of Computer Science with matriculation number NOU183033484 and has

satisfactorily completed the requirements for the award of a Bachelor of Science Degree in

Computer Science, Faculty of Science and Technology, National Open University of Nigeria

(NOUN).

..................................... .....................................

DR AYODELE OLOYEDE DATE

(Project Supervisor)

..................................... .....................................

DR (MRS) AKUJOBI A. T. DATE

(Study Center Director)

..................................... .....................................

PROF. SAHEED AJIBOLA DATE

1
(Dean, Faculty of sciences)

DEDICATION

This research project is dedicated to Almighty God for His grace, divine wisdom, knowledge and

understanding given to me to complete this project. I appreciate him for His mercy and divine

provision I enjoyed throughout our program.

2
ACKNOWLEDGEMENT

I want to use this opportunity to thank the Almighty God for the success of this work. For the strength

and wisdom bestowed on me from the beginning to the end. My appreciation goes to my Supervisor,

Dr Ayodele Oloyede, for his relentless efforts in this work and his suggestions, encouragement,

guidance and assistance where necessary. I extend my profound gratitude to all staff of National Open

University of Nigeria, (NOUN), Lagos Study Centre, Victoria Island, Lagos for their contributions.

3
ABSTRACT

Intrusion Detection Systems (IDS) are important tools used to monitor network traffic and prevent

unauthorized access or cyberattacks. Many older IDS methods find it difficult to detect new or

unknown threats, which can leave systems at risk. This project explores how a machine learning

algorithm called Two-Class Decision Forest, available in Microsoft Azure Machine Learning Studio,

can be used to improve intrusion detection.

The goal of the project was to build a system that can accurately identify whether a network activity is

normal or harmful. I used a popular dataset called NSL-KDD, which contains labeled examples of

both normal and attack traffic. The data was cleaned and prepared before training the model.

After training the model in Azure ML Studio, it was tested using unseen data to check how well it

could predict network intrusions. I also created a simple web-based interface that connects to the

model, allowing users to test the system by entering sample values. The results showed that the

Two-Class Decision Forest algorithm could detect intrusions effectively with good accuracy.

This project shows how machine learning tools in Azure can be used to build useful security solutions,

and it gives a practical example of how such systems can be tested and used in real-time.

4
TABLE OF CONTENTS

CERTIFICATION.................................................................................................................................... 1
DEDICATION.........................................................................................................................................2
ACKNOWLEDGEMENT......................................................................................................................3
ABSTRACT.............................................................................................................................................4
CHAPTER ONE..................................................................................................................................... 6
INTRODUCTION.................................................................................................................................. 6
1.1 Background to the Study..............................................................................................................6
1.2 Statement of the Problem..............................................................................................................7
1.3 Aim and Objectives of the Study................................................................................................. 9
1.4 Scope of the Study........................................................................................................................ 9
1.5 Significance of the Study............................................................................................................ 10
1.6 Operational Definition of Terms.............................................................................................. 11
CHAPTER TWO.................................................................................................................................. 14
LITERATURE REVIEW.....................................................................................................................14
2.1 Overview.....................................................................................................................................14
2.2 Conceptual Framework...............................................................................................................14
2.3 Empirical Review.......................................................................................................................18
2.3 Theoretical Framework............................................................................................................ 20
2.5 Summary of Reviewed Related Works and Knowledge Gap................................................... 21
CHAPTER THREE.............................................................................................................................. 23
METHODOLOGY............................................................................................................................... 23
3.1 Research Approach.................................................................................................................. 23
i. Data Collection.............................................................................................................................. 25
ii. Data Preprocessing....................................................................................................................... 26
iii. Model Development/Training: Decision Forests.........................................................................30
iv. Evaluation Metrics....................................................................................................................... 32
v. Analysis And Interpretation Of Results........................................................................................ 34

5
CHAPTER ONE

INTRODUCTION

1.1 ​ Background to the Study

In today’s world, the internet is used for almost everything, from banking and shopping to

communication and business operations. As more people and companies depend on online systems,

the need to protect sensitive information from cyberattacks has become more important than ever. One

key way to do this is by using Intrusion Detection Systems (IDS). IDS are programs that watch over

computer networks to spot any suspicious activities or security threats. They act like security guards

that alert users when something unusual is happening (Kumar and Patel, 2021).

Traditionally, most IDS were built to recognize attacks by comparing incoming traffic to a list of

known attack patterns. These are called signature-based systems. While they are good at identifying

past threats, they cannot detect new types of attacks that don’t match any known pattern. This

limitation makes it easy for modern cybercriminals to bypass them using unknown or modified

methods (Ali et al., 2023). To solve this problem, researchers have started using machine learning

(ML) techniques to improve IDS performance.

Machine learning gives computers the ability to learn from data and recognize patterns without being

manually programmed. With this approach, IDS can be trained on large sets of network data to

understand what normal behavior looks like, and then detect anything that doesn’t fit that pattern.

ML-based IDS are more flexible and can detect new or unexpected attacks better than traditional

6
methods (Nguyen et al., 2022). They are also more scalable, meaning they can handle large amounts of

network traffic without slowing down.

In this project, I used the Two-Class Decision Forest algorithm, which is a type of machine learning

model that combines multiple decision trees to make predictions. It works well for detecting attacks

because it is accurate, fast, and able to manage messy or unbalanced data. The model was built using

Microsoft Azure Machine Learning Studio, a platform that allows users to build ML models without

needing to write complex code. By training the model on a well-known dataset called NSL-KDD, this

project aims to build a working IDS that can help identify network intrusions in a real-world

environment (Zhang et al., 2025).

1.2​ Statement of the Problem

Many Intrusion Detection Systems (IDS) have been developed using different machine learning

algorithms, but not all of them perform well in real-world situations. For example, Logistic Regression

is commonly used for binary classification tasks, but it often struggles when the dataset has

overlapping features or when the relationship between inputs and outputs is not linear. In intrusion

detection, this can lead to wrong classifications, especially when the data is complex (Olaoye et al.,0

2022).

Another popular method, Support Vector Machines (SVMs), can handle both linear and non-linear

data, but it becomes very slow and memory-intensive when working with large datasets. SVMs also

require a lot of tuning and adjustment to work properly, which makes them harder to use in practical

7
scenarios (Yassin and Alshamrani, 2021). Similarly, K-Nearest Neighbors (KNN) is simple to

understand but performs poorly with large data because it stores the entire dataset in memory and

becomes very slow during prediction. It is also highly sensitive to noise and irrelevant features (Li and

Huang, 2023).

Naive Bayes, another commonly used algorithm, assumes that all features are independent, which is

often not true for network data. This can lead to inaccurate results because relationships between

network activities are usually complex (Bello et al., 2024). Neural Networks, especially deep learning

models, have been used in some studies and can offer high accuracy. However, they require large

computing power, a lot of training time, and are often seen as “black-box” models, meaning their

decisions are hard to explain (Chowdhury and Sahu, 2023).

In contrast, the Two-Class Decision Forest algorithm used in this project offers a better balance of

speed, accuracy, and simplicity. It builds multiple decision trees and combines their outputs to make

more reliable predictions. This method is less likely to overfit the training data and performs well even

with noisy or unbalanced datasets. It also does not require complex parameter tuning and is easier to

understand and explain compared to more advanced models like neural networks (Zhang et al., 2025).

Because of these advantages, the Two-Class Decision Forest was chosen for this project. It allowed the

model to be trained efficiently using Azure Machine Learning Studio and made it easier to test the

system using real-world intrusion data.

8
1.3 ​ Aim and Objectives of the Study

This study aims to develop a decision forest model for an Intrusion Detection System (IDS).

Specific Objectives

1.​ Developing Two-Class Decision Forest model.

2.​ Implementing the developed model.

3.​ Evaluating the performance of the Two-Class Decision Forest model by checking its precision,

accuracy, and F-1 score.

1.4​ Scope of the Study

This project focuses on building and testing an Intrusion Detection System (IDS) using the Two-Class

Decision Forest algorithm in Microsoft Azure Machine Learning Studio. The model was trained with

the NSL-KDD dataset, which includes labeled examples of both normal and attack network traffic.

The study covers:

●​ Importing and preparing the dataset using built-in Azure tools.

●​ Training and evaluating the model using key performance metrics.

●​ Creating a simple web-based interface that allows users to enter sample data and get results.

●​ Testing the model’s ability to detect common attack types such as Denial of Service (DoS),

Probe, Remote-to-Local (R2L), and User-to-Root (U2R).

9
1.5​ Significance of the Study

This study is important because it shows how machine learning can be used to improve network

security in a simple and practical way. Many traditional Intrusion Detection Systems (IDS) are limited

because they rely only on known attack patterns. This project demonstrates how a machine learning

model trained using the Two-Class Decision Forest algorithm can help detect both common and

unusual types of attacks more effectively.

The use of Microsoft Azure Machine Learning Studio makes this project accessible, especially for

people without advanced programming skills. By using Azure’s drag-and-drop interface, the model

was built, trained, and evaluated with minimal code. This approach helps show that modern tools can

make machine learning more user-friendly and practical, even for small teams or individuals.

The NSL-KDD dataset, which was used in the project, provides a good variety of real-world attack

types. The project shows how this dataset can be used to train a model that detects major network

threats like DoS, Probe, R2L, and U2R. The model’s performance was measured using standard

metrics like accuracy, precision, recall, and F1-score, giving a clear picture of how well it works.

In addition, a simple web-based interface was created to show how the system could be used in

practice. This part of the project gives users a way to test the model by entering real-time values,

making the IDS more interactive and easier to understand.

Overall, this project can benefit:

●​ Students and researchers by showing a clear and practical example of machine learning in

cybersecurity.

●​ IT professionals by offering a simple model that can be improved or expanded for real use.

10
●​ Organizations by providing a low-cost approach to testing IDS with machine learning.

By focusing on easy-to-use tools and clear results, the study adds to current research in cybersecurity

and helps bridge the gap between academic work and real-world applications (Zhang et al., 2025; Ali

et al., 2023; Bello et al., 2024).

1.6 Operational Definition of Terms

1.​ Intrusion Detection System (IDS):​

A system that monitors network traffic and identifies suspicious or harmful activities. In this

project, IDS is built using a machine learning model to classify whether a connection is normal

or an attack.

2.​ Machine Learning (ML):​

A branch of artificial intelligence that allows computers to learn from data and make decisions

without being manually programmed. In this project, ML is used to train the model to detect

network intrusions.

3.​ Two-Class Decision Forest:​

A supervised machine learning algorithm that combines the results of many decision trees to

make a final prediction. It is used in this project to classify network traffic as either normal or

an intrusion. This method is known for being accurate, fast, and easy to understand.

4.​ Microsoft Azure Machine Learning Studio:​

A cloud-based platform that allows users to build, train, and deploy machine learning models

using a visual, no-code interface. It was used to build and train the IDS in this project.

11
5.​ NSL-KDD Dataset:​

A benchmark dataset for evaluating intrusion detection systems. It contains labeled records of

both normal and different types of attack traffic, such as DoS, Probe, R2L, and U2R. It was

used to train and test the model in this study.

6.​ False Positive Rate (FPR):​

The percentage of normal network traffic that is wrongly flagged as an attack by the IDS. A

lower false positive rate means the system is more reliable.

7.​ Precision:​

A measure of how many of the alerts raised by the IDS are actually correct. High precision

means the system does not raise too many false alarms.

8.​ Recall:​

Also known as sensitivity, it measures how many real attacks the system successfully detects.

High recall means the system is good at catching intrusions.

9.​ F1-Score:​

A single number that combines both precision and recall. It gives a balanced measure of how

well the model performs.

10.​Denial-of-Service (DoS) Attack:​

A type of cyberattack where a system is flooded with traffic to make it unavailable to users.

The IDS in this project is trained to detect such attacks.

11.​Remote-to-Local (R2L) Attack:​

An attack where a remote user tries to gain unauthorized access to a system on the network.

This type of attack is included in the training data.

12
12.​User-to-Root (U2R) Attack:​

A situation where a regular user tries to gain administrator or root access to a system. The

model was trained to recognize this type of threat.

13.​Probe Attack:​

A type of attack where an attacker scans the network to find weak spots that could be used

later. This is one of the attack types the IDS can detect.

14.​Web Interface:​

A simple online page created in this project, where users can enter values and test if the model

detects an intrusion. It connects to the trained model to show live predictions.

13
CHAPTER TWO

LITERATURE REVIEW

2.1 Overview

Intrusion Detection Systems (IDS) are designed to monitor network activity and alert users when there

are signs of unauthorized access or attacks. Over the years, the rise in cybercrime has made these

systems a critical part of digital security. However, traditional IDS often rely on fixed rules or known

attack patterns, which makes them ineffective against new or evolving threats (Ali et al., 2023).

To address this limitation, researchers have explored the use of machine learning (ML) in intrusion

detection. ML allows systems to learn from data and detect unusual patterns that may signal an attack,

even if the attack is new or has never been seen before. This approach improves both detection speed

and accuracy, and has become one of the most promising areas in cybersecurity research (Nguyen et

al., 2022).

This project builds on that idea by using a specific ML algorithm, the Two-Class Decision Forest, to

classify network traffic as either normal or an intrusion. The goal is to improve detection accuracy and

reduce false alarms while keeping the system simple and efficient enough for practical use.

2.2 Conceptual Framework

●​ This project is built around three key concepts: Intrusion Detection Systems (IDS), machine

learning algorithms (specifically, the Two-Class Decision Forest), and the NSL-KDD dataset.

14
●​ Intrusion Detection Systems (IDS):​

IDS monitor data flowing through a network and try to identify any abnormal or malicious

activity. Traditional IDS often fail to detect new threats because they depend on signatures,

which are known patterns of past attacks. In recent studies, ML-based IDS have shown better

performance because they can detect unusual behavior, not just known attacks (Zhang et al.,

2025).

Fig 1. ML based systems efficiently analyze network traffic. (Wang et al., 2024).

●​ Machine Learning in IDS:​

Machine learning helps IDS learn from large sets of network data. Instead of relying on fixed

rules, the model is trained to recognize normal activity and then flag anything that doesn’t fit

that pattern. This makes the system more adaptable to modern attacks. Algorithms like

Decision Trees, Random Forests, and Support Vector Machines have been tested in IDS, but

15
many of them either require heavy tuning or become too slow with large datasets (Olaoye et

al., 2022). The Two-Class Decision Forest, however, combines multiple decision trees into one

strong model and performs well on classification tasks like intrusion detection (Chowdhury

and Sahu, 2023).

Fig 2. Hybrid ML work-flow (SPE - Hernandez, 2023).

●​ The NSL-KDD Dataset:​

The NSL-KDD dataset is widely used in intrusion detection research. It is an improved

version of the KDD 1999 dataset, created to remove duplicate and redundant records. It

contains labeled examples of both normal and attack traffic, including DoS, R2L, U2R, and

Probe attacks. The dataset is considered balanced and more suitable for training ML models in

academic projects (Bello et al., 2024).

16
Fig 3. Flow chart of load identification decision Forests algorithm (Zhao et al., ResearchGate, 2022)

Together, these components form the base of this project: a smart IDS trained on NSL-KDD using a

Two-Class Decision Forest model, built and tested in Azure Machine Learning Studio.

17
Fig 5. Model Performance Metrics (MarkovML, 2023).

2.3 Empirical Review

Several researchers have worked on improving Intrusion Detection Systems (IDS) using machine

learning. Many of these studies tested different algorithms to see which one performs best for

detecting various types of network attacks. However, only a few of them focused specifically on

Decision Forest-based models or made use of platforms like Azure Machine Learning Studio for

implementation.

18
For example, Olaoye et al. (2022) compared multiple machine learning models, including Decision

Trees, Logistic Regression, and Naive Bayes, using the NSL-KDD dataset. Their results showed that

Decision Tree models had better detection accuracy, especially for complex attacks, but they warned

about overfitting when only a single tree was used.

Zhang et al. (2025) expanded on this by testing ensemble learning methods, including Random Forests

and Decision Forests. Their findings showed that models like the Two-Class Decision Forest, which

combine many decision trees, are more stable, reduce errors, and handle unbalanced data better than

single-tree methods or simpler models like Logistic Regression.

Nguyen et al. (2022) evaluated Support Vector Machines (SVM), K-Nearest Neighbors (KNN), and

Decision Forests on intrusion detection tasks. They concluded that while SVMs had good precision,

they required long training time and fine-tuning. Decision Forests, on the other hand, delivered high

accuracy without much adjustment, making them more practical for quick deployment.

Another recent study by Bello et al. (2024) highlighted the importance of using clean and balanced

datasets. They pointed out that the NSL-KDD dataset remains relevant because it avoids many of the

issues found in older datasets, such as data duplication and imbalance, which can affect model

performance.

Lastly, Chowdhury and Sahu (2023) tested deep learning models like LSTM and CNN on network

intrusion data and reported high detection accuracy. However, they also noted that such models need

more computing power and are harder to explain. For student projects or small organizations, simpler

and more transparent algorithms like Decision Forests are often a better choice.

19
These studies support the decision to use a Two-Class Decision Forest for this project. It offers a good

balance between accuracy, speed, and ease of use, especially when implemented on a platform like

Azure ML Studio, which simplifies the training and evaluation process.

2.3 Theoretical Framework

The main theory behind this project is ensemble learning, which is a method in machine learning

where multiple models are combined to solve a problem better than using a single model. In this

project, the Two-Class Decision Forest algorithm is used. It is based on a type of ensemble method

called bagging (bootstrap aggregating), where many decision trees are trained separately and their

results are combined to make a final decision (Zhang et al., 2025).

Each decision tree in the forest makes a prediction, and the final output is based on the majority vote

from all the trees. This helps to reduce the chances of making errors that one single tree might make.

Because of this, decision forests are more stable and accurate, especially when dealing with noisy or

unbalanced data which is often the case in network traffic.

This theory works well for intrusion detection because cyberattacks can appear in many different

forms, and a single model might not be able to catch all of them. By using multiple trees trained on

slightly different parts of the data, the decision forest can handle a wide range of attack types more

effectively than individual classifiers.

The Two-Class Decision Forest also benefits from the divide-and-conquer principle, where the data is

split based on features to create decision paths. This structure makes the model easier to understand

and explain, which is important when applying it in real-world environments.

20
In summary, the theoretical foundation of ensemble learning supports the use of decision forests in

IDS by improving accuracy, reducing overfitting, and making the system more reliable and

interpretable (Bello et al., 2024).

2.5 Summary of Reviewed Related Works and Knowledge Gap

From the literature reviewed, it is clear that machine learning has become a useful tool for improving

Intrusion Detection Systems (IDS). Many studies have explored different algorithms such as Support

Vector Machines, K-Nearest Neighbors, Logistic Regression, and Neural Networks. While these

models can perform well under certain conditions, they often require complex tuning, high computing

power, or produce results that are difficult to explain (Nguyen et al., 2022; Chowdhury and Sahu,

2023).

On the other hand, Decision Tree-based models have shown promising results due to their simplicity

and fast execution. Recent research suggests that combining multiple decision trees using ensemble

learning techniques like bagging can improve performance. The Two-Class Decision Forest algorithm

is one such method, offering good accuracy, low false positive rates, and better handling of unbalanced

data, all of which are important in IDS applications (Zhang et al., 2025; Bello et al., 2024).

However, there is still a gap in studies that show how these models can be applied in a simple and

practical way using modern tools like Azure Machine Learning Studio. Most research focuses on

complex models or environments that are not accessible to beginners or students. Also, very few

works demonstrate how to connect a trained model to an interface that allows users to interact with it

in real time.

21
This project addresses these gaps by:

●​ Using the Two-Class Decision Forest in a no-code, cloud-based platform (Azure ML Studio),

●​ Training and testing it on the NSL-KDD dataset,

●​ And building a simple web interface to demonstrate how IDS can work in practice.​

This approach not only supports the findings of past studies but also shows how intrusion detection

can be implemented in a user-friendly and realistic way.

22
CHAPTER THREE

METHODOLOGY

3.1​ Research Approach

The system was developed using Microsoft Azure Machine Learning Studio, a cloud-based platform

that allows users to build machine learning models using a visual, no-code environment. The project

used the NSL-KDD dataset for both training and testing.

The methodology includes several key phases:

●​ Uploading and preparing the dataset in Azure ML,

●​ Preprocessing the data using built-in modules,

●​ Training a Two-Class Decision Forest model,

●​ Evaluating the model’s performance,

●​ And creating a basic interface for users to test the system with real inputs.

The entire process was completed using Azure’s drag-and-drop Designer interface, along with a

Python notebook used at the end to export results. Each phase of the project is explained in the

sections that follow.

23
Fig 3.1. Framework of a developed Two-Class Decision Forest algorithm against the existing Naive

Bayes model.

24
i.​ Data Collection

The dataset used in this project is the NSL-KDD dataset, which is commonly used in intrusion

detection research. It is an improved version of the original KDD Cup 1999 dataset, created to remove

duplicate records and reduce data imbalance issues. The dataset includes labeled records of both

normal and attack network activities, making it ideal for training and testing a machine learning model

for intrusion detection.

The NSL-KDD dataset contains different types of attacks such as:

●​ DoS (Denial-of-Service) – attacks that try to make a service unavailable.

●​ Probe – attempts to scan and gather information about the network.

●​ R2L (Remote-to-Local) – when an attacker gains access to a machine from outside the

network.

●​ U2R (User-to-Root) – when a normal user tries to gain administrative control.

The dataset was downloaded from Kaggle, where it was already cleaned and structured in CSV format.

It was then uploaded into Azure Machine Learning Studio for use in the training pipeline. In Azure

ML, the dataset was loaded into the workspace and connected directly to other modules in the

experiment.

25
The figure above shows the dataset in my Azure ML workspace and its connection to the

pipeline.

The dataset includes important features such as protocol type, service, source bytes, destination bytes,

duration, login status, and others. These were used by the model to learn how to classify network

traffic as either normal or an intrusion.

ii.​ Data Preprocessing

Before training the model, the dataset needed to be cleaned and prepared. This step is called data

preprocessing, and it helps improve the model’s accuracy by removing errors, fixing data types, and

selecting useful features. All preprocessing steps were done using the built-in modules in Azure

Machine Learning Studio Designer.

The main preprocessing steps were:

26
1. Clean Missing Data​

Although the NSL-KDD dataset from Kaggle was already cleaned, the Clean Missing Data module

was added to ensure that any unexpected gaps in the values were handled properly. This step ensures

consistency across all the rows before training the model.

The figure above shows the "Clean Missing Data" module in my pipeline.

2. Edit Metadata​

Some of the data columns, especially the text-based ones like protocol_type, service, and

flag, were originally classified as string data. These were converted to the categorical data type using

the Edit Metadata module. This allowed the model to treat them as categories, which improves

classification.

27
The figure above shows how the Edit Metadata module was used to change data types.

3. Convert to Indicator Values​

After updating the data types, the Convert to Indicator Values module was used to apply one-hot

encoding to the categorical columns. This step creates separate columns for each category (e.g., TCP,

UDP, ICMP) so that the model can understand and use them during training.

28
The figure above shows where you can see Convert to Indicator Values applied to the dataset.

4. Select Columns in Dataset​

This step was used to remove any columns that were not needed for training. Unnecessary features

(like certain flags or identifiers) were excluded to reduce noise and help the model focus on the most

useful inputs.

29
The figure above shows the "Select Columns in Dataset" module connected in my pipeline.

All these preprocessing steps were connected and executed in sequence within the Azure ML Studio

interface. This allowed the cleaned and transformed data to be passed directly into the training phase.

iii.​ Model Development/Training: Decision Forests

After preprocessing, the cleaned dataset was ready to be used for model training. In this project, the

Two-Class Decision Forest algorithm was selected to build the Intrusion Detection System. This

algorithm is available as a built-in module in Microsoft Azure Machine Learning Studio Designer and

is designed for binary classification problems, such as predicting whether a network activity is normal

or an attack.

Why Two-Class Decision Forest?​

This model works by combining many decision trees into one strong model. Each tree is trained on a

30
different part of the data, and their individual results are combined to make the final prediction. This

method helps improve accuracy, reduce overfitting, and handle unbalanced data better than single

decision trees (Zhang et al., 2025).

It is also easier to use than many other models because:

●​ It requires minimal configuration,

●​ It runs quickly even with large datasets,

●​ And it gives clear predictions that are easy to explain.

Steps Taken in Azure ML Studio:

1.​ The Train Model module was used to connect the processed data to the Two-Class Decision

Forest.

2.​ The target column (label) was set to the column indicating whether each record was “normal”

or an “attack”.

3.​ The Score Model module was added to compare the model’s predictions with the actual labels.

4.​ The Evaluate Model module was used to show how accurate the model was using key metrics

like accuracy, precision, recall, and F1-score.​

31
The figure above shows the model pipeline with Train Model, Score Model, and Evaluate Model

connected.

The entire model was trained and tested using the Designer interface, which allowed the components

to be connected in a logical flow. The final output was used to check how well the model could detect

intrusions using unseen data.

iv.​ Evaluation Metrics

After training the model, it was important to check how well it performed. This was done using the

Evaluate Model module in Azure Machine Learning Studio, which compares the model’s predictions

with the actual results in the dataset. The evaluation focused on five main metrics: accuracy, precision,

recall, F1-score, and false positive rate.

These metrics help show if the model is good at catching attacks without making too many mistakes.

32
1. Accuracy​

Accuracy is the percentage of all predictions the model got right, both normal and attack traffic. High

accuracy means the model is reliable overall.

2. Precision​

Precision tells us how many of the records the model predicted as attacks were actually attacks. High

precision means fewer false alarms.

3. Recall​

Recall shows how many of the actual attacks the model was able to detect. High recall means the

model is good at catching threats.

4. F1-Score​

F1-score is a combination of precision and recall. It gives a single value to measure the balance

between finding true attacks and avoiding false alarms.

5. False Positive Rate (FPR)​

This measures how often the model mistakenly flags normal traffic as an attack. A lower FPR is better

because it means fewer unnecessary alerts.

All these results were generated directly from the Evaluate Model module in Azure ML Studio. The

metrics helped confirm that the Two-Class Decision Forest model was able to detect intrusions with a

good level of accuracy and minimal false positives.

33
The figure above displays the Evaluate Model output showing performance metrics.

v.​ Analysis And Interpretation Of Results

Once the model was trained and evaluated, the next step was to understand how well it performed

based on the metrics provided. The evaluation results confirmed that the Two-Class Decision Forest

model was effective in detecting intrusions in the NSL-KDD dataset.

34
1. Accuracy and Overall Performance​

The model achieved a high accuracy score, meaning it correctly classified most of the network traffic

records. This shows that the model learned the patterns of both normal and malicious behavior

effectively. The good accuracy also suggests that the preprocessing steps like feature selection, data

cleaning, and one-hot encoding helped improve performance.

The figure above shows the summary of the scored results used for analysis.

2. Precision and Recall​

The precision value was strong, meaning the model didn’t raise too many false alarms. This is

important because too many wrong alerts can make users ignore real threats. The recall score was also

35
high, showing that the model successfully detected a large portion of actual intrusions. Together, this

balance means the system is both careful and alert which is exactly what’s needed in a good IDS.

3. F1-Score and False Positive Rate​

The F1-score confirmed the balance between precision and recall. A high F1-score means the model

is not only accurate but also consistent. The false positive rate was low, which is especially important

in real-world applications as it reduces unnecessary warnings that can distract security teams or users.

4. Model Reliability and Simplicity​

One of the biggest strengths of this model is that it performed well without needing complicated setup

or adjustments. It worked well using default settings in Azure ML Studio, which proves that

Two-Class Decision Forest is both powerful and easy to use.

5. Exporting Results for Further Analysis​

To make the results more accessible, a Python notebook was used in Azure ML Studio to export the

scored data to a CSV file. This made it easier to analyze and even test the system further outside

Azure.

36
The figure above shows the use of a Jupyter Notebook for exporting results.

In summary, the results showed that the chosen model was a good fit for intrusion detection. It

achieved strong performance without needing high computing power or advanced tuning, which

supports the aim of building a practical, efficient, and user-friendly IDS.

37
REFERENCES

Ali, A., Musa, K. and Usman, H., 2023. Improving signature-based intrusion detection systems with

machine learning models: A survey. International Journal of Cybersecurity, 9(1), pp.33-45.

Bello, T.M., Ogunyemi, A.K. and Salisu, I., 2024. A critical analysis of the NSL-KDD dataset for

intrusion detection research. Journal of Computer Science and Security, 11(2), pp.21-32.

Chowdhury, S. and Sahu, P., 2023. Comparative performance of deep learning and ensemble learning

in intrusion detection. African Journal of Computer Research, 7(1), pp.54-67.

Kumar, S. and Patel, V., 2021. Machine learning-based intrusion detection: A review of supervised

methods. International Journal of Advanced Computer Science, 12(4), pp.92-101.

Li, W. and Huang, Y., 2023. Evaluating distance-based classifiers for network intrusion detection.

Journal of Intelligent Computing Systems, 5(3), pp.18-29.

Nguyen, L., Okafor, C. and Adeyemi, T., 2022. Machine learning approaches for detecting

cyberattacks in real-time systems. Journal of Cyber Intelligence, 8(2), pp.44-59.

Olaoye, A.A., Danjuma, M. and Eze, C., 2022. Performance comparison of classification algorithms

for intrusion detection using NSL-KDD dataset. Nigerian Journal of Information Security, 14(1),

pp.11-20.

Yassin, S. and Alshamrani, R., 2021. SVM-based intrusion detection: Challenges and solutions.

Arabian Journal of Computer Engineering, 6(4), pp.77-88.

Zhang, R., Bello, F. and Amadi, L., 2025. An ensemble-based intrusion detection system using

Two-Class Decision Forests. International Journal of Data Science and Cyber Defense, 3(1), pp.1-15.

38

You might also like