0% found this document useful (0 votes)
39 views10 pages

Sindh University

Uploaded by

aziz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views10 pages

Sindh University

Uploaded by

aziz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

(c

University of Sindh Journal of Information and Communication Technology


(USJICT)
Volume 7, Issue 2
ISSN-E: 2523-1235, ISSN-P: 2521-5582 © Published by University of Sindh, Jamshoro
Website: http://sujo.usindh.edu.pk/index.php/USJICT/

Evaluating Diabetes Detection Methods: A Multilinear


Regression Approach vs. Other Machine Learning Classifiers
Hasnain Hyder1, Khawaja Haider Ali1, Dr. Abdul Aziz1, Lubina Iram1
1
Department of Electrical Engineering, Sukkur IBA University, Sukkur, Pakistan
[email protected], [email protected], [email protected] and Lubinairam.mec17@iba-
suk.edu.pk

Abstract: Machine learning has become an important tool in many fields, including healthcare. In this research
paper, we aim to implement diabetes dataset in multi-linear regression and compare its performance with different
classifiers of machine learning. The novelty of this research lies in the evaluation of the diabetes dataset using
multilinear regression and subsequent comparison of its performance against several other classifiers, including
Decision Trees (DT), Random Forest (RF), K-Nearest Neighbors (KNN), Logistic Regression (LR), and Support
Vector Machines (SVM). There is not much research on using Multi-linear regression to find diabetes, so
it is important to check how well it works with diabetes data. Introducing Multi-linear regression to the
analysis and measuring its success against other recognized machine learning classifiers will shed light
on its suitability for diabetes detection. Our results show that multi-linear regression achieved an accuracy of
80.5%, However, other classifiers such as random forest, and logistic regression outperformed linear regression,
achieving accuracy scores of 81.4% and 81.25%, respectively. Furthermore, we observed that decision tree, KNN,
and SVM, which are often used for classification tasks, did not perform well on this dataset, achieving an accuracy
of only 78.7%, 80.5%, and 79.6% respectively. This suggests that the model's performance can be greatly impacted
by the classifier selection. Our findings suggest that linear regression can be used for predicting diabetes, other
classifiers such as random forest, and logistic regression are more effective for this dataset. To choose the best
classifier for a given job, it is crucial to assess and contrast the performance of several classifiers.

Keywords: Multi-Linear Regression; Diabetes Dataset; Logistic Regression (LR); Support Vector Machine (SVM); K-Nearest Neighbour
(KNN); Decision Tree (DT);

especially high in lower-middle-income countries, where


I. INTRODUCTION diabetes caused 13% more deaths. [8].
In this research paper, we aim to implement the diabetes
Diabetes is a widespread chronic disease affecting dataset on multi-linear regression and compare the
millions of individuals globally, and its early diagnosis can performance of diverse machine learning classifiers. To
significantly improve patient outcomes [1-2]. Therefore, accomplish this, we employ the diabetes dataset sourced
developing accurate prediction models for diabetes is crucial from Kaggle, a widely recognized platform for data science
for effective treatment and management of the disease [3-5]. competitions. By analyzing extensive data, our study aims
Diabetes comes in two primary forms: Type 1, typically to determine the effectiveness of various predictive models.
diagnosed during childhood, often involves immune-related Ultimately, our findings will empower healthcare
mechanisms. On the other hand, Type 2 diabetes tends to professionals to make informed decisions when diagnosing
develop later in life, particularly as individuals age, and is and treating diabetes, leading to improved care for patients.
often associated with pancreatic diseases [6]. Our goal is to use multilinear regression to forecast
In 2014, 8.5% of persons over the age of 18 had blood diabetes and contrast it with other common classifiers such
glucose levels affected by diabetes, a chronic illness. as LR, DT, SVM, and RF. We evaluate the performance of
Almost half of the 1.5 million deaths that were directly each classifier using metrics such as accuracy, precision,
caused by it in 2019 occurred in those under the age of 70. recall, and F1 score. We seek to identify the best classifier
Diabetes also contributed to 460,000 fatalities from kidney for diabetes forecasting and comprehend the factors that
illness, and high blood glucose levels were associated with affect the performance of different classifiers.
20% of deaths from heart and blood valve issues. [7]. The paper is divided into following different sections. In
Diabetes-related mortality rates increased by 3% after Section 1, we survey the previous work on machine learning
adjusting for age from 2000 to 2019. The increase was for diabetes forecasting. In Section 3, provides the detailed
analysis of previous reseach, In Section 3, we explain the
data source, the data preparation, and the classifier training
University of Sindh Journal of Information and Communication Technology (USJICT) Vol.7(2), pg.: 47-56

methods used in this study. In Section 4, we report our nearest neighbor, and logistic regression. In addition, DL
findings and compare the accuracy of different classifiers. In methods such as Convolutional Neural Network (CNN),
Section 5, we discuss the significance of our results and Deep Belief Network (DBN), and Deep Neural Network
summarize the paper. (DNN) are utilized to enhance the accuracy and efficiency
of diabetes identification. Additionally, the paper discusses
II. LITERATURE REVIEW the use of oversampling techniques and feature selection to
This section provide a detailed analysis of prior research enhance the performance of diabetes detection systems. It
focused on diabetes detection. It comprises of a detailed concludes by highlighting the potential future direction of
survey of the scholarly work undertaken in this domain, employing advanced feature selection techniques to further
offering an extensive summary of the methodologies, improve the reliability and accuracy of diabetes detection
findings, and conclusions that constitute the current models.
academic landscape regarding diabetes identification. Adel Al-Zebari et al. [14], focuses on evaluating various
Smith et al. [9], conduct a research aiming to predict machine learning methods for detecting diabetes. The study
diabetes using multiple linear regression models while uses the Pima Indian Diabetes Dataset (PIDD) and employs
placing emphasis on feature selection techniques. The multiple classifiers including decision tree, logistic
authors explore various methods to identify pertinent regression, discriminant analysis, support vector machines,
features from the diabetes dataset, with the goal of k-nearest neighbors, and ensemble learners. All methods
enhancing the predictive performance of the multiple linear were implemented using MATLAB Classification Learner
regression model. According to the study's findings, their Tool. The study concludes that Logistic Regression was the
constructed model successfully predicts diabetes with an most effective model that acieves the highest accuracy that
accuracy rate of 78%. is 77.9%. The author suggested that the future work may
Similarly, Deepti et al. [10], developed a model to includes applying deep learning techniques and advanced
estimate the likelihood of diabetes in patients using machine feature selection methods to enhance classification
learning classifiers. They employed the Pima Indians accuracy.
Diabetes Database and evaluated three classifiers (DT, Turki Alghamdi et al. [15], prsents a study in which the
SVM, and Naive Bayes). They discovered that Naive Bayes author used data mining and machine learning techniques to
achieved the highest accuracy of 76.30%. They also predict diabetes and its complications effectively. This study
assessed the performance of the classifiers using Precision, highlights the application of various computational
F-Measure, and Recall. They validated their results using intelligence techniques like decision trees, logistic
ROC curves. regression, support vector machines, neural networks, and
Furthurmore, Priyanka et al. [11], created a logistic particularly the XGBoost classifier, which demonstrated a
regression diabetes prediction model and investigated notable accuracy rate of 89%. This research underscores the
methods to improve its accuracy and performance. They potential of computational intelligence in transforming
made use of two datasets, PIMA Indians Diabetes and healthcare approaches towards diabetes, emphasizing early
Vanderbilt, and used feature selection and ensemble diagnosis and personalized treatment plans to mitigate the
methods. They achieved the highest accuracy of 78% for disease's impact.
Dataset 1 and 93% for Dataset 2, using ensemble Md Shahin Ali el al. [16], discusses a machine learning
techniques. The research highlighted the importance of data approach to improve diabetes detection through optimal
preprocessing, feature selection, and ensemble methods in parameter selection and feature engineering. The study
improving model accuracy and speed. Logistic regression introduces a fine-tuned Random Forest algorithm with best
was recognized as an effective algorithm for developing parameters (RFWBP), which incorporates feature
prediction models for diabetes analysis. engineering techniques to enhance early diabetes detection.
Sihao Wang et al. [12], explores the application of the Several data processing and mining techniques were applied
LASSO (Least Absolute Shrinkage and Selection Operator) to enrich the primary dataset used for training multiple
regression model in predicting diabetes incidence. The study machine learning models, including AdaBoost, SVM,
outlines the methodology of data collection and logistic regression, and more. The proposed RFWBP
preprocessing, feature selection using LASSO regression, achieved impressive accuracy rates of 95.83% with 5-fold
and the construction and training of the LASSO regression cross-validation and 90.68% without, outperforming
model. It emphasizes the model’s ability to deal with traditional machine learning methods. The study
multicollinearity and overfitting by shrinking certain underscores the significance of early and accurate diabetes
coefficients to zero, thus simplifying the model and focusing detection to manage and mitigate the adverse impacts of the
on significant predictors. disease. Through rigorous testing and comparison with
Boon Feng Wee et al. [13], presents the application of conventional methods, this research demonstrates the
machine learning (ML) and deep learning (DL) techniques potential of machine learning in improving diagnostic
for identifying diabetes. The study reviews a range of ML processes for chronic conditions like diabetes.
models including SVM, decision tree, random forest, k-

48
University of Sindh Journal of Information and Communication Technology (USJICT) Vol.7(2), pg.: 47-56

III. AFTER REVIEWING THE LITERATURE RESEACH IT IS


CLEAR THAT THERE IS NOT MUCH RESEARCH ON USING
MULTI-LINEAR REGRESSION TO FIND DIABETES, SO IT IS
IMPORTANT TO CHECK HOW WELL IT WORKS WITH DIABETES
DATA. INTRODUCING MULTI-LINEAR REGRESSION TO THE
ANALYSIS AND MEASURING ITS SUCCESS AGAINST OTHER
RECOGNIZED MACHINE LEARNING CLASSIFIERS WILL SHED
LIGHT ON ITS SUITABILITY FOR DIABETES DETECTION. THIS
IS A KEY STEP BECAUSE IT FILLS A GAP IN THE RESEARCH AND
HELPS US UNDERSTAND THE PROS AND CONS OF MULTI-
LINEAR REGRESSION COMPARED TO OTHER METHODS. THIS
DETAILED STUDY WILL HELP US LEARN MORE ABOUT THE
BEST WAYS TO DETECT DIABETES. METHODS
The study outlines its methodology in a step-by-step
process as illustrated in Fig. 1. This structured approach not Figure 2 Details of a missing values in a diabeties dataset.
only improves the clarity and understandability of the
analysis but also each step is designed to add to the overall The Fig. 2 shows bar chart presents the count of missing
soundness and reliability of the research findings. values in a diabeties dataset. In the bar chart x-axis shows
the variables included that are Pregnancies, Glucose,
BloodPressure, SkinThickness, Insulin, BMI,
DiabetesPedigreeFunction, Age, and Outcome. The vertical
scale ranging from 0 to 768 shows the number of missing
values. In bar chart the variables such as Pregnancies,
Glucose, BloodPressure, BMI, DiabetesPedigreeFunction,
Age, and Outcome—reach up to the chart maximum height,
indicating they have almost no missing data. However, the
bar for SkinThickness shows a moderate amount of missing
data, roughly half of the maximum count, while Insulin has
a noticeably lower count of missing values but more than
the others. This visualization highlights which variables are
most impacted by missing data and may require strategies
for data imputation or removal to ensure robust data
analysis. After analyzing the missing values, we filled the
missing values. The Fig. 3 show the result after filling
Figure 1 Proposed design flow for diabetes detection using missing values in our dataset. This shows that now we do
various classifiers. not have missing value in our dataset.
A. Data Collection
First, The dataset is taken from Kaggle website that
comprises a total of 768 instances. Among these instances,
268 are labeled as positive, indicating the presence of
diabetes, while 500 instances are labeled as negative,
indicating the absence of diabetes. The dataset consists of 9
variables in total, with 8 input variables and 1 output
variable. The table I displays the details of the first five
instances, including their features and target values.
B. Data Pre-Processing
Figure 3 Details of missing values in our dataset
In the initial phase of our methodology, we address the
need for data refinement, normalization, and preparation for
After conducting an assessment for missing values
subsequent analysis. This crucial step involves several
within our dataset, we proceeded to investigate the presence
processes, including the removal of missing values, the
of outliers. Upon inspection, it was evident that outliers
handling of outliers, and the conversion of categorical
were present within the dataset. To enhance the dataset's
variables into numerical values. Firslty, we check the
robustness and ensure the reliability of our analysis, we
missing values in our dataset weather any missing values.
employed outlier removal techniques. By eliminating these
The Fig. 2 shows the bar chart of missing values.
outliers, we aimed to refine the dataset's integrity and

49
University of Sindh Journal of Information and Communication Technology (USJICT) Vol.7(2), pg.: 47-56

enhance the accuracy of our subsequent analyses and model After evaluating the feature importance, it was
training processes. determined that glucose, insulin, BMI, age, diabetes
pedigree function, and skin thickness are pivotal parameters.
C. Feature Selection
Consequently, these six features were selected for model
training, ensuring a comprehensive consideration of factors
Following the data pre-processing stage, the focus crucial to diabetes prediction.
shifts to feature selection. In this step, a meticulous
approach is employed to identify and choose relevant
features from the preprocessed data. In order to improve the
model accuracy and interpretability, this selection process is
essential. For feature selection firstly calculate the weight of
the feature that highlight how much which feature is most
important. The Fig. 4 shows a bar chart illustrating the
relative importance of various features in a diabetes dataset.
The chart is designed to show which factors are most
predictive of diabetes outcomes based on the data analyzed.
At the top of the importance scale is glucose, with a value
close to 0.175, indicating that glucose levels are a critical
predictor of diabetes. Following closely is insulin,
suggesting its significant role in diabetes management and Figure 4 Relative importance of various features in diabetes
as a predictor of the condition. Body Mass Index (BMI) also prediction
features prominently, underscoring the connection between
body weight and diabetes risk. In order to further find the relationship between the
Age is another factor considered, with a moderate features, we create a correlation heatmap for various factors
importance, reflecting its role in increasing diabetes risk as associated with diabetes, using a color scale from green to
it advances. The Diabetes Pedigree Function, which gauges red to indicate the strength and direction of correlations
genetic predispositions to diabetes, also shows a notable between variables. The colors reflect the correlation strength
level of importance, though less than the aforementioned with green indicating positive and red indicating negative
factors. Skin thickness and blood pressure appear further correlations. The values range from 1, which signifies a
down the scale, suggesting they have a lesser, yet still perfect positive correlation, to -1, indicating a perfect
measurable, impact on diabetes prediction. The least negative correlation, with values around 0 showing little to
important feature, according to the chart, is the number of no correlation.
pregnancies, which holds some relevance, particularly in the The heat map as shown in Fig.5 highlights several
context of gestational diabetes, but is less critical compared relationships such as a strong positive correlation of 0.54
to other factors.
Table I: DETAILS OF THE FIRST FIVE INSTANCES OF DIABETES DATASET

Pregnancies Glucose Blood Skin Insulin BMI Diabetes Age Outcome


Pressure Thickness Pedigree
Function

6 148 72 35 0 33.6 0.627 50 1


1 85 66 29 0 26.6 0.351 31 0
8 183 64 0 0 23.3 0.672 32 1
1 89 66 23 94 28.1 0.167 21 0
0 137 40 35 168 43.1 2.288 33 1

between age and the number of pregnancies, suggesting that This visualization is crucial for understanding the
older women tend to have had more pregnancies. There is interrelationships among different physiological factors in
also a notable positive correlation of 0.55 between skin the context of diabetes, which can aid in medical research
thickness and BMI, indicating that higher BMI may be and the development of treatment strategies.
associated with greater skin thickness. Glucose levels
demonstrate moderate positive correlations with insulin and
BMI, with coefficients of 0.36 and 0.23 respectively,
suggesting that higher glucose levels might be linked to
higher insulin levels and a higher body mass index.

50
University of Sindh Journal of Information and Communication Technology (USJICT) Vol.7(2), pg.: 47-56

The utilization of various classifiers serves a dual purpose: it


allows for a thorough examination of how well each model
handles the intricacies of the dataset, and it facilitates the
assessment of their efficacy in predicting diabetes. By
employing a range of classifiers, our goal is to learn more
about the advantages and disadvantages of each approach,
providing a comprehensive understanding of their
applicability in the context of diabetes detection. This
diversified approach ensures a robust evaluation and
comparison of the models, paving the way for informed
decisions on the most suitable model for our specific dataset.
The following model will be used in order to gauge the
performance of different models.
• Regression: Regression is a method used to
nderstand how independent variables relate to a
Figure 5 Correlation heatmap of various factors dependent variable. It is commonly employed in
associated with diabetes machine learning to forecast continuous values
[17]. The objective of regression is to determine the
D. Classifications Algorithms values of the dependent variable, also called the
Upon the completion of dataset splitting, the subsequent outcome, based on a set of independent variables,
phase involves the strategic selection of models for analysis. also known as features [17]. For instance,
In this time, a diverse classifiers will be employed, regression can be used to predict weather condition
encompassing methodologies such as LR, RF, KNN, DT, or the likelihood of a disease occurrence [18]. The
LR, SVM, among others. The primary objective is to train general mathematical equation for regression is
these models comprehensively using the diabetes disease given in Eq. (1).
dataset and subsequently evaluate their performance.
Multilinear regression is a technique to
E. Dataset Splitting forecast a variable that is influenced by
two or more other variables. In contrast to
The diabetes dataset was divided into two parts: a simple linear regression, which has one
testing set and a training set. The training set has 428 predictor variable and one outcome
instances and helps the model learn from the data. The variable, multilinear regression has more
testing set has 108 instances and tests the model's than one predictor variable along with the
performance and generalization. This way of splitting the outcome variable. The formula for
data is essential to measure how well the model applies its multilinear regression is given in Eq. (2).
learned knowledge to new data.
Y = b0 + b1X1 + b2X2 + ... + bnXn + e (2)

Y=b0+b1X1 +e (1) where, Y is the variable that depends on


other variables, X1, X2, ..., Xn are the
In regression, the intercept (b0) is the value of the variables that affect Y, b0, b1, b2, ..., bn
outcome variable (Y) when the predictor variable are the values that show how much each
(X) is zero. The slope (b1) indicates how the variable changes Y, and e is the difference
outcome variable varies with the predictor variable. between the actual and predicted values of
The error term (e) is the discrepancy between the Y.
observed and estimated values of the outcome • Random Forest: This approach is a versatile and
variable, which represents the variation in the data straightforward algorithm commonly employed for
that the regression equation fails to account for.The both classification and regression purposes [19]
regression is classified into two parts as follows: [20]. It utilizes multiple individual decision trees
i. Linear Regression: working collectively as a unified model. Each tree
Linear regression is a statistical method categorizes instances into specific classes, and the
used to model the relationship between predicted class is determined by the one with the
two or more variables. It assumes a linear highest number of votes [19].
relationship between the dependent • K-Nearest Neighbour (KNN): It is a basic
variable (the one you want to predict) and supervised machine learning technique that operates
one or more independent variables (the by locating the k nearest points from the training set
ones used to make the prediction). to a specific input point. [21-22]. The predicted
ii. Multi-Linear Regression: class of the new input data point is determined by

51
University of Sindh Journal of Information and Communication Technology (USJICT) Vol.7(2), pg.: 47-56

considering the majority vote of the classes among specifies the maximum number of iterations for the
its k nearest neighbors [21]. solver to converge that is with a default of 100.
• Decision Tree: A decision tree is a versatile • K-Nearest Neighbour: In KNN, we use
supervised learning technique that creates forecasts n_neighbors parameter that defines the number of
using a structure resembling a tree [23]. It may neighbors to consider when making predictions. we
create a hierarchical decision process by examining set this parameter to 5, meaning the classifier will
the input features and addressing problems with consider the labels of the five nearest neighbors.
regression and classification [23]. We use metric parameter that is the distance metric
• Logistic Regression: One statistical method for used for calculating the distance between points.
binary classification issues is logistic regression. It Here, minkowski is employed, which is a
uses independent variables to compute the
generalization of other distance metrics like
probability of a result or a class's membership. It
Euclidean distance and Manhattan distance. We
differs from linear regression, which predicts
continuous outcomes, by modeling the connection also used p parameter that is used when the
between independent variables and a binary 'minkowski' metric is selected. It defines the power
outcome [24-25]. parameter for the Minkowski metric. When p is set
• Support Vector Machine (SVM): This supervised to 2, it corresponds to using the Euclidean distance.
machine learning method is mostly applied to • Decision Tree: In this we use max_depth
categorization issues. It looks for a border in a high- parameter that determines the maximum depth of
dimensional space that best separates various the decision tree. we set this parameter to 5,
classes. This boundary is formed by choosing meaning the tree will grow to a maximum depth of
important instances called support vectors. These 5 levels.
support vectors are essential for defining the • Support Vector Machine: In this we use kernel
decision boundary [26-27]. parameter that specifies the type of kernel used for
F. Model Training the SVM. We use linear kernel that indicates a
linear decision boundary. We use random_state
Once the classification algorithm is chosen, we will parameter that sets the seed used by the random
proceed to train the model and assess its performance on the number generator. By fixing it to 0, it ensures
diabetes dataset. In order to compare the performance of reproducibility of results across different runs.
linear regression with other classifiers, we will iteratively
train different models. This iterative process allows us to G. Performance Metrices
evaluate how well the multi-linear regression model There are various evaluation measures that may be used
performs in contrast to other classifiers when applied to the to assess each classifier's performance. The following
diabetes disease dataset. The following hyperparamerters evaluation metrices are used in this research:
are used while training the models: i. Accuracy: Accuracy is the most straightforward
• Multi-linear Regression: In multi-linear regression, metric, measuring the proportion of correctly
we use default parameters that are automatically classified instances among all instances. The
used when you create an instance of the accuracy can be calculated using the Eq. (3).
LinearRegression class without explicitly
specifying any parameters. These default
parameters are the settings that the model relies on
unless you specify otherwise. Random Forest: In
Random Forest, we utilize the parameter ii. Precision: Precision measures the proportion of
n_estimators to define the number of trees in the true positive predictions among all positive
forest. By setting it to 100, an ensemble comprising predictions made by the classifier. The precision
100 decision trees is created. Additionally, we can be calculated using Eq. (4).
employ max_depth to establish the maximum
depth of each decision tree, which is set to 5,
thereby capping the depth at 5 levels. To ensure
consistent results across different runs, we utilize
random_state, fixing it at 42 to set the random seed iii. Recall (Sensitivity): Recall measures the proportion
for reproducibility. of true positive predictions among all actual
• Logistic Regression: In logistic regression, we used positive instances in the data. The recall can be
the default parameters to train the model. We use calculated using Eq. (5).
as default penalty parameter that is L2 that shows
the type of regularization applied. Max_iter

52
University of Sindh Journal of Information and Communication Technology (USJICT) Vol.7(2), pg.: 47-56

iv. F1-score: F1-score is the harmonic mean of


precision and recall, providing a balance between
the two metrics. The F1 score can be calculated
using Eq. (6)

IV. RESULTS AND DISCUSSION


The results of using different classifiers on a diabetic Figure 4 Representation of the confusion matrix illustrating
dataset are shown in this section. We will now discuss the the performance evaluation of a random forest model
findings in detail, highlighting the performance of each applied to the diabetes dataset.
classifier on the given dataset.
C. Logistic Regression (LR)
A. Multi-Linear Regression (MLR) Continuing with our classifier evaluation, we used
Our main objective was to utilize a diabetes dataset and logistic regression on the diabetes dataset. On the test
apply multi-linear regression to build a predictive model. dataset, the LR model outperformed the multi-linear
We then proceeded to compare its performance against regression, scoring 81.4% accuracy. Furthermore, the LR
other classification algorithms. We trained a multi-linear obtained an F1 score of 0.79, precision of 0.80, and recall of
regression model and tested it on a different dataset that was 0.85. These results demonstrate that the logistic regression
not used for training. The model had an accuracy of 80.5%. algorithm outperformed the multi-linear regression model in
It also had an F1 score of 0.76, a precision of 0.78, and a all aspects. The logistic regression confusion matrix is
recall of 0.73. The confusion matrix of the model is shown shown in Fig. 8.
in Fig. 6.

Figure 3 Representation of the confusion matrix illustrating Figure 5 Representation of the confusion matrix illustrating
the performance evaluation of a multi-linear regression the performance evaluation of a logistic regression model
model applied to the diabetes dataset. applied to the diabetes dataset.
D. K-Nearest Neighbour (KNN)
B. Random Forest (RF) Another classifier we utilized for training the diabetes
Following the implementation of the diabetes dataset, dataset was K-Nearest Neighbors (KNN). Interestingly, the
we trained a random forest classifier as the next algorithm. KNN classifier exhibited similar performance to the multi-
The random forest model demonstrated a higher accuracy of linear regression model. It achieved a test accuracy of
approximately 81.4% on the unseen test data compared to 80.5%, along with an F1 score of 0.79, precision of 0.75,
the multi-linear regression. Moreover, the random forest and recall of 0.86. These results indicate that the KNN
classifier achieved an improved F1 score of 0.795, precision classifier performed at a comparable level to the multi-linear
of 0.795, and recall of 0.795.This means that the random regression model. The confusion matrix of a k-nearest
forest algorithm was more accurate and better at making neighbor is displayed in Fig. 9.
predictions compared to the multi-linear regression model.
The Fig. 7 demonstrates the confusion matrix of random
forest.

53
University of Sindh Journal of Information and Communication Technology (USJICT) Vol.7(2), pg.: 47-56

Figure 6 Representation of the confusion matrix illustrating


the performance evaluation of a K-Nearest Neighbour Figure 8 Representation of the confusion matrix illustrating
model applied to the diabetes dataset. the performance evaluation of a support vector machine
model applied to the diabetes dataset.

E. Decision Tree (DT)


G. Comparsion Table
We applied the decision tree to the diabetes dataset as
well. However, this model showed poor performance A comparison of several classifiers and multilinear
compared to the multi-linear regression. The decision tree regression based on performance criteria, such as accuracy,
classifier achieved 78.7% accuracy on the test data, with an F1 score, precision, and recall, is shown in table II. This
F1 score of 0.76, precision of 0.78, and recall of 0.73. These comparison allows for an assessment of the classifiers'
metrics indicate that the decision tree model performed less effectiveness in terms of their ability to accurately predict
well than multi-linear regression in appropriately identifying outcomes.
occurrences. Figure 10 depicts the confusion matrix of a
decision tree.
Table II: COMPARISON TABLE OF MODEL
PERFORMANCE IN DIABETES DETECTION.
Classifier Accuracy F1 Score Precision Recall
MLR 80.5 0.76 0.78 0.73
RF 81.4 0.795 0.795 0.795
LR 81.4 0.79 0.80 0.85
KNN 80.5 79.9 0.75 0.86
DT 78.7 0.76 0.78 0.73
SVM 79.6 0.76 0.81 0.71
H. Enhancing Diabetes Prediction through Classifier
Comparison and Analysis
i. Random Forest and Logistic Regression: Both RF
Figure 7 Representation of the confusion matrix illustrating and LR outperformed Multi-Linear Regression
the performance evaluation of a decision tree model applied (MLR) in terms of accuracy, F1 score, precision, and
to the diabetes dataset. recall. This could be attributed to their ability to
capture non-linear relationships between features and
the target variable. RF, being an ensemble method,
F. Support Vector Machine (SVM) combines multiple decision trees, which can handle
Finally, we performed training on a Support Vector complex interactions between features better than
Machine (SVM) using diabetes data. The SVM model MLR. LR, on the other hand, is inherently suited for
exhibited an accuracy of 79.6%, which is comparable to the binary classification tasks like diabetes prediction
accuracy achieved by the multilinear regression model. and can learn complex decision boundaries.
Moreover, the F1 score, precision, and recall of the SVM ii. K-Nearest Neighbors: KNN exhibited similar
model were 0.76, 0.81, and 0.71, respectively. The SVM performance to MLR, suggesting that its simple
confusion matrix is shown in Fig. 11. approach might not be as effective in capturing the
underlying patterns in the dataset compared to more
sophisticated algorithms like RF and LR. KNN relies
heavily on the choice of the number of neighbors (k)

54
University of Sindh Journal of Information and Communication Technology (USJICT) Vol.7(2), pg.: 47-56

and the distance metric, which might not be optimal from the data, saving time and possibly revealing new
for this particular dataset. insights. They are complex and need a lot of computing
iii. Decision Tree and Support Vector Machine: DT power, which can make them hard to work with and
performed less well than MLR, indicating that it understand. So, there is a push to make DNN more
might have overfit the training data or failed to transparent and efficient, especially if they are going to be
capture the underlying patterns effectively. SVM, used in healthcare.
while achieving comparable accuracy to MLR, had
lower recall, suggesting that it might have VI. CONCLUSION
misclassified some positive instances as negative. The aim of this research paper was to apply multilinear
This could be due to the choice of hyperparameters regression to the diabetes dataset and compare its
or the nature of the dataset. performance with various machine learning classifiers. The
iv. Model Complexity and Generalization: The novelty of this research stems from the assessment of the
performance differences among classifiers highlight diabetes dataset using multilinear regression and the
the importance of selecting appropriate models that subsequent comparison of its performance with several other
balance complexity and generalization. More classifiers, such as decision trees, random forests, K-nearest
complex models like RF and LR might perform neighbors, logistic regression, and SVM . We discovered that
better on the test data but could be prone to linear regression obtained an accuracy of 80.5%, while other
classifiers such as random forest and logistic regression
overfitting if not regularized properly.
surpassed linear regression, with accuracy scores of 81.4%
and 81.25% respectively. Conversely, the decision tree, k-
V. LIMITATIONS AND FUTURE WORK nearest neighbour, and SVM were less effective, obtaining
accuracies of 78.7%, 80.5%, and 79.6% respectively.
This section outlines the limitations encountered during the These results indicate that the selection of classifier has a
course of this research. The study encountered several considerable influence on the predictive performance of the
constraints, including: model. Although linear regression can be utilized for
predicting diabetes, other classifiers such as random forest
• Imbalanced Class Distribution: The dataset and logistic regression demonstrate greater efficiency for this
exhibited an imbalanced distribution among the specific dataset. To select the best classifier for a given job, it
classes, potentially affecting the classifiers is therefore crucial to assess and contrast the performance of
performance and result generalization. several classifiers.
REFERENCES
Biases in Data Collection: Biases such as sampling
bias or selection bias were identified in the data
collection process. These biases may have [1] G Chowdhury, M. M., Ayon, R. S., & Hossain, M. S. (2024). An
investigation of machine learning algorithms and data augmentation
ramifications for the generalizability of the study techniques for diabetes diagnosis using class imbalanced BRFSS
findings, as they could skew the representation of dataset. Healthcare Analytics, 5, 100297.
certain population segments. [2] Alam, A., Dhoundiyal, S., Ahmad, N., & Rao, G. K. (2024).
• Dataset Size: Additionally, the size of the dataset Unveiling diabetes: Categories, genetics, diagnostics, treatments, and
future horizons. Current Diabetes Reviews, 20(4), 10-22.
presents a notable limitation, as it is relatively
[3] Franks, P. W., Cefalu, W. T., Dennis, J., Florez, J. C., Mathieu, C.,
small. The limited number of instances may restrict Morton, R. W., ... & Stehouwer, C. D. (2023). Precision medicine for
the robustness of the models developed and could cardiometabolic disease: a framework for clinical translation. The
contribute to poorer performance in terms of Lancet Diabetes & Endocrinology, 11(11), 822-835.
predictive accuracy and generalization. [4] Syed Muhammad Nabeel Mustafa, Hassan Zaki, Syeda Sundus
Zehra, & Muhammad Shoaib. (2022). Significance and Challenges of
Big Data in Healthcare: A Review. University of Sindh Journal of
Deep Neural Network (DNN) can be used as a game- Information and Communication Technology , 6(1), 25-30. Retrieved
changer for improving diabetes prediction models. They are from https://sujo.usindh.edu.pk/index.php/USJICT/article/view/6265
known for their precision in various classification and [5] Abdul Hafeez Muhammad, & Amna Faisal. (2022). Integration of
prediction tasks, DNNs could significantly boost the Artificial Intelligence and Human Computer Interaction in
Healthcare. University of Sindh Journal of Information and
accuracy and strength of these models. Communication Technology , 6(3), 101-107. Retrieved from
By analyzing diabetes data with DNNs, we can expect to https://sujo.usindh.edu.pk/index.php/USJICT/article/view/6279
see major strides in how accurately we can predict the [6] ElSayed, N. A., Aleppo, G., Aroda, V. R., Bannuru, R. R., Brown, F.
condition. DNNs' ability to learn complex patterns through M., Bruemmer, D., ... & American Diabetes Association. (2023). 2.
multiple layers means they can find hidden relationships in Classification and diagnosis of diabetes: standards of care in
diabetes—2023. Diabetes care, 46(Supplement_1), S19-S40.
the data that simpler models might miss.
[7] Mumtaz, M. T., Khan, M. M. F., Uzair, M., Khan, M. A., Salman, A.,
DNN are also incredibly flexible and can handle big, & Khan, H. F. T. (2023). The Silent Killer: Investigating the
diverse datasets, which is often a challenge in medical Influence of Stress on Cardiovascular Health of Diabetic Patients.
research. They can automatically pick out important features Pakistan Journal of Medical & Health Sciences, 17(05), 397-397.

55
University of Sindh Journal of Information and Communication Technology (USJICT) Vol.7(2), pg.: 47-56

[8] Liu, Y., Wang, D., Huang, X., Liang, R., Tu, Z., You, X., ... & Chen, [18] Leung, P., & Tran, L. T. (2000). Predicting shrimp disease
W. (2023). Temporal trend and global burden of type 2 diabetes occurrence: artificial neural networks vs. logistic regression.
attributable to non-optimal temperature, 1990–2019: an analysis for Aquaculture, 187(1-2), 35-49.
the Global Burden of Disease Study 2019. Environmental Science [19] Brown, S. H. (2009). Multiple linear regression analysis: a matrix
and Pollution Research, 1-10. approach with MATLAB. Alabama Journal of Mathematics, 34, 1-3.
[9] Alhussan, A. A., Abdelhamid, A. A., Towfek, S. K., Ibrahim, A., Eid, [20] Jeong, J. H., Resop, J. P., Mueller, N. D., Fleisher, D. H., Yun, K.,
M. M., Khafaga, D. S., & Saraya, M. S. (2023). Classification of Butler, E. E., ... & Kim, S. H. (2016). Random forests for global and
Diabetes Using Feature Selection and Hybrid Al-Biruni Earth Radius regional crop yield predictions. PloS one, 11(6), e0156571.
and Dipper Throated Optimization. Diagnostics, 13(12), 2038.
[21] Hu, J., & Szymczak, S. (2023). A review on longitudinal data
[10] Sisodia, D., & Sisodia, D. S. (2018). Prediction of diabetes using analysis with random forest. Briefings in Bioinformatics, 24(2),
classification algorithms. Procedia computer science, 132, 1578-1585. bbad002.
[11] Rajendra, P., & Latifi, S. (2021). Prediction of diabetes using logistic [22] Cunningham, P., & Delany, S. J. (2021). k-Nearest neighbour
regression and ensemble techniques. Computer Methods and classifiers-A Tutorial. ACM computing surveys (CSUR), 54(6), 1-25.
Programs in Biomedicine Update, 1, 100032.
[23] Syriopoulos, P. K., Kalampalikis, N. G., Kotsiantis, S. B., & Vrahatis,
[12] Wang, S., Chen, Y., Cui, Z., Lin, L., & Zong, Y. (2024). Diabetes M. N. (2023). k NN Classification: a review. Annals of Mathematics
Risk Analysis Based on Machine Learning LASSO Regression and Artificial Intelligence, 1-33.
Model. Journal of Theory and Practice of Engineering Science, 4(01),
[24] Shehadeh, A., Alshboul, O., Al Mamlook, R. E., & Hamedat, O.
58-64.
(2021). Machine learning models for predicting the residual value of
[13] Wee, B. F., Sivakumar, S., Lim, K. H., Wong, W. K., & Juwono, F. heavy construction equipment: An evaluation of modified decision
H. (2024). Diabetes detection based on machine learning and deep tree, LightGBM, and XGBoost regression. Automation in
learning approaches. Multimedia Tools and Applications, 83(8), Construction, 129, 103827.
24153-24185.
[25] Peng, C. Y. J., Lee, K. L., & Ingersoll, G. M. (2002). An introduction
[14] Al-Zebari, A., & Sengur, A. (2019, November). Performance to logistic regression analysis and reporting. The journal of
comparison of machine learning techniques on diabetes disease educational research, 96(1), 3-14.
detection. In 2019 1st international informatics and software
[26] Loh, W. Y. (2023). Logistic regression tree analysis. In Springer
engineering conference (UBMYK) (pp. 1-4). IEEE.
handbook of engineering statistics (pp. 593-604). London: Springer
[15] Alghamdi, T. (2023). Prediction of diabetes complications using London.
computational intelligence techniques. Applied Sciences, 13(5), 3030.
[27] Shahi, T. B., & Pant, A. K. (2018, February). Nepali news
[16] Ali, M. S., Islam, M. K., Das, A. A., Duranta, D. U. S., Haque, M., & classification using Naive Bayes, support vector machines and neural
Rahman, M. H. (2023). A novel approach for best parameters networks. In 2018 international conference on communication
selection and feature engineering to analyze and detect diabetes: information and computing technology (iccict) (pp. 1-5). IEEE.
Machine learning insights. BioMed Research International, 2023.
[17] Kumari, K., & Yadav, S. (2018). Linear regression analysis study.
Journal of the practice of Cardiovascular Sciences, 4(1), 33-36.

56

You might also like