Sindh University
Sindh University
Abstract: Machine learning has become an important tool in many fields, including healthcare. In this research
paper, we aim to implement diabetes dataset in multi-linear regression and compare its performance with different
classifiers of machine learning. The novelty of this research lies in the evaluation of the diabetes dataset using
multilinear regression and subsequent comparison of its performance against several other classifiers, including
Decision Trees (DT), Random Forest (RF), K-Nearest Neighbors (KNN), Logistic Regression (LR), and Support
Vector Machines (SVM). There is not much research on using Multi-linear regression to find diabetes, so
it is important to check how well it works with diabetes data. Introducing Multi-linear regression to the
analysis and measuring its success against other recognized machine learning classifiers will shed light
on its suitability for diabetes detection. Our results show that multi-linear regression achieved an accuracy of
80.5%, However, other classifiers such as random forest, and logistic regression outperformed linear regression,
achieving accuracy scores of 81.4% and 81.25%, respectively. Furthermore, we observed that decision tree, KNN,
and SVM, which are often used for classification tasks, did not perform well on this dataset, achieving an accuracy
of only 78.7%, 80.5%, and 79.6% respectively. This suggests that the model's performance can be greatly impacted
by the classifier selection. Our findings suggest that linear regression can be used for predicting diabetes, other
classifiers such as random forest, and logistic regression are more effective for this dataset. To choose the best
classifier for a given job, it is crucial to assess and contrast the performance of several classifiers.
Keywords: Multi-Linear Regression; Diabetes Dataset; Logistic Regression (LR); Support Vector Machine (SVM); K-Nearest Neighbour
(KNN); Decision Tree (DT);
methods used in this study. In Section 4, we report our nearest neighbor, and logistic regression. In addition, DL
findings and compare the accuracy of different classifiers. In methods such as Convolutional Neural Network (CNN),
Section 5, we discuss the significance of our results and Deep Belief Network (DBN), and Deep Neural Network
summarize the paper. (DNN) are utilized to enhance the accuracy and efficiency
of diabetes identification. Additionally, the paper discusses
II. LITERATURE REVIEW the use of oversampling techniques and feature selection to
This section provide a detailed analysis of prior research enhance the performance of diabetes detection systems. It
focused on diabetes detection. It comprises of a detailed concludes by highlighting the potential future direction of
survey of the scholarly work undertaken in this domain, employing advanced feature selection techniques to further
offering an extensive summary of the methodologies, improve the reliability and accuracy of diabetes detection
findings, and conclusions that constitute the current models.
academic landscape regarding diabetes identification. Adel Al-Zebari et al. [14], focuses on evaluating various
Smith et al. [9], conduct a research aiming to predict machine learning methods for detecting diabetes. The study
diabetes using multiple linear regression models while uses the Pima Indian Diabetes Dataset (PIDD) and employs
placing emphasis on feature selection techniques. The multiple classifiers including decision tree, logistic
authors explore various methods to identify pertinent regression, discriminant analysis, support vector machines,
features from the diabetes dataset, with the goal of k-nearest neighbors, and ensemble learners. All methods
enhancing the predictive performance of the multiple linear were implemented using MATLAB Classification Learner
regression model. According to the study's findings, their Tool. The study concludes that Logistic Regression was the
constructed model successfully predicts diabetes with an most effective model that acieves the highest accuracy that
accuracy rate of 78%. is 77.9%. The author suggested that the future work may
Similarly, Deepti et al. [10], developed a model to includes applying deep learning techniques and advanced
estimate the likelihood of diabetes in patients using machine feature selection methods to enhance classification
learning classifiers. They employed the Pima Indians accuracy.
Diabetes Database and evaluated three classifiers (DT, Turki Alghamdi et al. [15], prsents a study in which the
SVM, and Naive Bayes). They discovered that Naive Bayes author used data mining and machine learning techniques to
achieved the highest accuracy of 76.30%. They also predict diabetes and its complications effectively. This study
assessed the performance of the classifiers using Precision, highlights the application of various computational
F-Measure, and Recall. They validated their results using intelligence techniques like decision trees, logistic
ROC curves. regression, support vector machines, neural networks, and
Furthurmore, Priyanka et al. [11], created a logistic particularly the XGBoost classifier, which demonstrated a
regression diabetes prediction model and investigated notable accuracy rate of 89%. This research underscores the
methods to improve its accuracy and performance. They potential of computational intelligence in transforming
made use of two datasets, PIMA Indians Diabetes and healthcare approaches towards diabetes, emphasizing early
Vanderbilt, and used feature selection and ensemble diagnosis and personalized treatment plans to mitigate the
methods. They achieved the highest accuracy of 78% for disease's impact.
Dataset 1 and 93% for Dataset 2, using ensemble Md Shahin Ali el al. [16], discusses a machine learning
techniques. The research highlighted the importance of data approach to improve diabetes detection through optimal
preprocessing, feature selection, and ensemble methods in parameter selection and feature engineering. The study
improving model accuracy and speed. Logistic regression introduces a fine-tuned Random Forest algorithm with best
was recognized as an effective algorithm for developing parameters (RFWBP), which incorporates feature
prediction models for diabetes analysis. engineering techniques to enhance early diabetes detection.
Sihao Wang et al. [12], explores the application of the Several data processing and mining techniques were applied
LASSO (Least Absolute Shrinkage and Selection Operator) to enrich the primary dataset used for training multiple
regression model in predicting diabetes incidence. The study machine learning models, including AdaBoost, SVM,
outlines the methodology of data collection and logistic regression, and more. The proposed RFWBP
preprocessing, feature selection using LASSO regression, achieved impressive accuracy rates of 95.83% with 5-fold
and the construction and training of the LASSO regression cross-validation and 90.68% without, outperforming
model. It emphasizes the model’s ability to deal with traditional machine learning methods. The study
multicollinearity and overfitting by shrinking certain underscores the significance of early and accurate diabetes
coefficients to zero, thus simplifying the model and focusing detection to manage and mitigate the adverse impacts of the
on significant predictors. disease. Through rigorous testing and comparison with
Boon Feng Wee et al. [13], presents the application of conventional methods, this research demonstrates the
machine learning (ML) and deep learning (DL) techniques potential of machine learning in improving diagnostic
for identifying diabetes. The study reviews a range of ML processes for chronic conditions like diabetes.
models including SVM, decision tree, random forest, k-
48
University of Sindh Journal of Information and Communication Technology (USJICT) Vol.7(2), pg.: 47-56
49
University of Sindh Journal of Information and Communication Technology (USJICT) Vol.7(2), pg.: 47-56
enhance the accuracy of our subsequent analyses and model After evaluating the feature importance, it was
training processes. determined that glucose, insulin, BMI, age, diabetes
pedigree function, and skin thickness are pivotal parameters.
C. Feature Selection
Consequently, these six features were selected for model
training, ensuring a comprehensive consideration of factors
Following the data pre-processing stage, the focus crucial to diabetes prediction.
shifts to feature selection. In this step, a meticulous
approach is employed to identify and choose relevant
features from the preprocessed data. In order to improve the
model accuracy and interpretability, this selection process is
essential. For feature selection firstly calculate the weight of
the feature that highlight how much which feature is most
important. The Fig. 4 shows a bar chart illustrating the
relative importance of various features in a diabetes dataset.
The chart is designed to show which factors are most
predictive of diabetes outcomes based on the data analyzed.
At the top of the importance scale is glucose, with a value
close to 0.175, indicating that glucose levels are a critical
predictor of diabetes. Following closely is insulin,
suggesting its significant role in diabetes management and Figure 4 Relative importance of various features in diabetes
as a predictor of the condition. Body Mass Index (BMI) also prediction
features prominently, underscoring the connection between
body weight and diabetes risk. In order to further find the relationship between the
Age is another factor considered, with a moderate features, we create a correlation heatmap for various factors
importance, reflecting its role in increasing diabetes risk as associated with diabetes, using a color scale from green to
it advances. The Diabetes Pedigree Function, which gauges red to indicate the strength and direction of correlations
genetic predispositions to diabetes, also shows a notable between variables. The colors reflect the correlation strength
level of importance, though less than the aforementioned with green indicating positive and red indicating negative
factors. Skin thickness and blood pressure appear further correlations. The values range from 1, which signifies a
down the scale, suggesting they have a lesser, yet still perfect positive correlation, to -1, indicating a perfect
measurable, impact on diabetes prediction. The least negative correlation, with values around 0 showing little to
important feature, according to the chart, is the number of no correlation.
pregnancies, which holds some relevance, particularly in the The heat map as shown in Fig.5 highlights several
context of gestational diabetes, but is less critical compared relationships such as a strong positive correlation of 0.54
to other factors.
Table I: DETAILS OF THE FIRST FIVE INSTANCES OF DIABETES DATASET
between age and the number of pregnancies, suggesting that This visualization is crucial for understanding the
older women tend to have had more pregnancies. There is interrelationships among different physiological factors in
also a notable positive correlation of 0.55 between skin the context of diabetes, which can aid in medical research
thickness and BMI, indicating that higher BMI may be and the development of treatment strategies.
associated with greater skin thickness. Glucose levels
demonstrate moderate positive correlations with insulin and
BMI, with coefficients of 0.36 and 0.23 respectively,
suggesting that higher glucose levels might be linked to
higher insulin levels and a higher body mass index.
50
University of Sindh Journal of Information and Communication Technology (USJICT) Vol.7(2), pg.: 47-56
51
University of Sindh Journal of Information and Communication Technology (USJICT) Vol.7(2), pg.: 47-56
considering the majority vote of the classes among specifies the maximum number of iterations for the
its k nearest neighbors [21]. solver to converge that is with a default of 100.
• Decision Tree: A decision tree is a versatile • K-Nearest Neighbour: In KNN, we use
supervised learning technique that creates forecasts n_neighbors parameter that defines the number of
using a structure resembling a tree [23]. It may neighbors to consider when making predictions. we
create a hierarchical decision process by examining set this parameter to 5, meaning the classifier will
the input features and addressing problems with consider the labels of the five nearest neighbors.
regression and classification [23]. We use metric parameter that is the distance metric
• Logistic Regression: One statistical method for used for calculating the distance between points.
binary classification issues is logistic regression. It Here, minkowski is employed, which is a
uses independent variables to compute the
generalization of other distance metrics like
probability of a result or a class's membership. It
Euclidean distance and Manhattan distance. We
differs from linear regression, which predicts
continuous outcomes, by modeling the connection also used p parameter that is used when the
between independent variables and a binary 'minkowski' metric is selected. It defines the power
outcome [24-25]. parameter for the Minkowski metric. When p is set
• Support Vector Machine (SVM): This supervised to 2, it corresponds to using the Euclidean distance.
machine learning method is mostly applied to • Decision Tree: In this we use max_depth
categorization issues. It looks for a border in a high- parameter that determines the maximum depth of
dimensional space that best separates various the decision tree. we set this parameter to 5,
classes. This boundary is formed by choosing meaning the tree will grow to a maximum depth of
important instances called support vectors. These 5 levels.
support vectors are essential for defining the • Support Vector Machine: In this we use kernel
decision boundary [26-27]. parameter that specifies the type of kernel used for
F. Model Training the SVM. We use linear kernel that indicates a
linear decision boundary. We use random_state
Once the classification algorithm is chosen, we will parameter that sets the seed used by the random
proceed to train the model and assess its performance on the number generator. By fixing it to 0, it ensures
diabetes dataset. In order to compare the performance of reproducibility of results across different runs.
linear regression with other classifiers, we will iteratively
train different models. This iterative process allows us to G. Performance Metrices
evaluate how well the multi-linear regression model There are various evaluation measures that may be used
performs in contrast to other classifiers when applied to the to assess each classifier's performance. The following
diabetes disease dataset. The following hyperparamerters evaluation metrices are used in this research:
are used while training the models: i. Accuracy: Accuracy is the most straightforward
• Multi-linear Regression: In multi-linear regression, metric, measuring the proportion of correctly
we use default parameters that are automatically classified instances among all instances. The
used when you create an instance of the accuracy can be calculated using the Eq. (3).
LinearRegression class without explicitly
specifying any parameters. These default
parameters are the settings that the model relies on
unless you specify otherwise. Random Forest: In
Random Forest, we utilize the parameter ii. Precision: Precision measures the proportion of
n_estimators to define the number of trees in the true positive predictions among all positive
forest. By setting it to 100, an ensemble comprising predictions made by the classifier. The precision
100 decision trees is created. Additionally, we can be calculated using Eq. (4).
employ max_depth to establish the maximum
depth of each decision tree, which is set to 5,
thereby capping the depth at 5 levels. To ensure
consistent results across different runs, we utilize
random_state, fixing it at 42 to set the random seed iii. Recall (Sensitivity): Recall measures the proportion
for reproducibility. of true positive predictions among all actual
• Logistic Regression: In logistic regression, we used positive instances in the data. The recall can be
the default parameters to train the model. We use calculated using Eq. (5).
as default penalty parameter that is L2 that shows
the type of regularization applied. Max_iter
52
University of Sindh Journal of Information and Communication Technology (USJICT) Vol.7(2), pg.: 47-56
Figure 3 Representation of the confusion matrix illustrating Figure 5 Representation of the confusion matrix illustrating
the performance evaluation of a multi-linear regression the performance evaluation of a logistic regression model
model applied to the diabetes dataset. applied to the diabetes dataset.
D. K-Nearest Neighbour (KNN)
B. Random Forest (RF) Another classifier we utilized for training the diabetes
Following the implementation of the diabetes dataset, dataset was K-Nearest Neighbors (KNN). Interestingly, the
we trained a random forest classifier as the next algorithm. KNN classifier exhibited similar performance to the multi-
The random forest model demonstrated a higher accuracy of linear regression model. It achieved a test accuracy of
approximately 81.4% on the unseen test data compared to 80.5%, along with an F1 score of 0.79, precision of 0.75,
the multi-linear regression. Moreover, the random forest and recall of 0.86. These results indicate that the KNN
classifier achieved an improved F1 score of 0.795, precision classifier performed at a comparable level to the multi-linear
of 0.795, and recall of 0.795.This means that the random regression model. The confusion matrix of a k-nearest
forest algorithm was more accurate and better at making neighbor is displayed in Fig. 9.
predictions compared to the multi-linear regression model.
The Fig. 7 demonstrates the confusion matrix of random
forest.
53
University of Sindh Journal of Information and Communication Technology (USJICT) Vol.7(2), pg.: 47-56
54
University of Sindh Journal of Information and Communication Technology (USJICT) Vol.7(2), pg.: 47-56
and the distance metric, which might not be optimal from the data, saving time and possibly revealing new
for this particular dataset. insights. They are complex and need a lot of computing
iii. Decision Tree and Support Vector Machine: DT power, which can make them hard to work with and
performed less well than MLR, indicating that it understand. So, there is a push to make DNN more
might have overfit the training data or failed to transparent and efficient, especially if they are going to be
capture the underlying patterns effectively. SVM, used in healthcare.
while achieving comparable accuracy to MLR, had
lower recall, suggesting that it might have VI. CONCLUSION
misclassified some positive instances as negative. The aim of this research paper was to apply multilinear
This could be due to the choice of hyperparameters regression to the diabetes dataset and compare its
or the nature of the dataset. performance with various machine learning classifiers. The
iv. Model Complexity and Generalization: The novelty of this research stems from the assessment of the
performance differences among classifiers highlight diabetes dataset using multilinear regression and the
the importance of selecting appropriate models that subsequent comparison of its performance with several other
balance complexity and generalization. More classifiers, such as decision trees, random forests, K-nearest
complex models like RF and LR might perform neighbors, logistic regression, and SVM . We discovered that
better on the test data but could be prone to linear regression obtained an accuracy of 80.5%, while other
classifiers such as random forest and logistic regression
overfitting if not regularized properly.
surpassed linear regression, with accuracy scores of 81.4%
and 81.25% respectively. Conversely, the decision tree, k-
V. LIMITATIONS AND FUTURE WORK nearest neighbour, and SVM were less effective, obtaining
accuracies of 78.7%, 80.5%, and 79.6% respectively.
This section outlines the limitations encountered during the These results indicate that the selection of classifier has a
course of this research. The study encountered several considerable influence on the predictive performance of the
constraints, including: model. Although linear regression can be utilized for
predicting diabetes, other classifiers such as random forest
• Imbalanced Class Distribution: The dataset and logistic regression demonstrate greater efficiency for this
exhibited an imbalanced distribution among the specific dataset. To select the best classifier for a given job, it
classes, potentially affecting the classifiers is therefore crucial to assess and contrast the performance of
performance and result generalization. several classifiers.
REFERENCES
Biases in Data Collection: Biases such as sampling
bias or selection bias were identified in the data
collection process. These biases may have [1] G Chowdhury, M. M., Ayon, R. S., & Hossain, M. S. (2024). An
investigation of machine learning algorithms and data augmentation
ramifications for the generalizability of the study techniques for diabetes diagnosis using class imbalanced BRFSS
findings, as they could skew the representation of dataset. Healthcare Analytics, 5, 100297.
certain population segments. [2] Alam, A., Dhoundiyal, S., Ahmad, N., & Rao, G. K. (2024).
• Dataset Size: Additionally, the size of the dataset Unveiling diabetes: Categories, genetics, diagnostics, treatments, and
future horizons. Current Diabetes Reviews, 20(4), 10-22.
presents a notable limitation, as it is relatively
[3] Franks, P. W., Cefalu, W. T., Dennis, J., Florez, J. C., Mathieu, C.,
small. The limited number of instances may restrict Morton, R. W., ... & Stehouwer, C. D. (2023). Precision medicine for
the robustness of the models developed and could cardiometabolic disease: a framework for clinical translation. The
contribute to poorer performance in terms of Lancet Diabetes & Endocrinology, 11(11), 822-835.
predictive accuracy and generalization. [4] Syed Muhammad Nabeel Mustafa, Hassan Zaki, Syeda Sundus
Zehra, & Muhammad Shoaib. (2022). Significance and Challenges of
Big Data in Healthcare: A Review. University of Sindh Journal of
Deep Neural Network (DNN) can be used as a game- Information and Communication Technology , 6(1), 25-30. Retrieved
changer for improving diabetes prediction models. They are from https://sujo.usindh.edu.pk/index.php/USJICT/article/view/6265
known for their precision in various classification and [5] Abdul Hafeez Muhammad, & Amna Faisal. (2022). Integration of
prediction tasks, DNNs could significantly boost the Artificial Intelligence and Human Computer Interaction in
Healthcare. University of Sindh Journal of Information and
accuracy and strength of these models. Communication Technology , 6(3), 101-107. Retrieved from
By analyzing diabetes data with DNNs, we can expect to https://sujo.usindh.edu.pk/index.php/USJICT/article/view/6279
see major strides in how accurately we can predict the [6] ElSayed, N. A., Aleppo, G., Aroda, V. R., Bannuru, R. R., Brown, F.
condition. DNNs' ability to learn complex patterns through M., Bruemmer, D., ... & American Diabetes Association. (2023). 2.
multiple layers means they can find hidden relationships in Classification and diagnosis of diabetes: standards of care in
diabetes—2023. Diabetes care, 46(Supplement_1), S19-S40.
the data that simpler models might miss.
[7] Mumtaz, M. T., Khan, M. M. F., Uzair, M., Khan, M. A., Salman, A.,
DNN are also incredibly flexible and can handle big, & Khan, H. F. T. (2023). The Silent Killer: Investigating the
diverse datasets, which is often a challenge in medical Influence of Stress on Cardiovascular Health of Diabetic Patients.
research. They can automatically pick out important features Pakistan Journal of Medical & Health Sciences, 17(05), 397-397.
55
University of Sindh Journal of Information and Communication Technology (USJICT) Vol.7(2), pg.: 47-56
[8] Liu, Y., Wang, D., Huang, X., Liang, R., Tu, Z., You, X., ... & Chen, [18] Leung, P., & Tran, L. T. (2000). Predicting shrimp disease
W. (2023). Temporal trend and global burden of type 2 diabetes occurrence: artificial neural networks vs. logistic regression.
attributable to non-optimal temperature, 1990–2019: an analysis for Aquaculture, 187(1-2), 35-49.
the Global Burden of Disease Study 2019. Environmental Science [19] Brown, S. H. (2009). Multiple linear regression analysis: a matrix
and Pollution Research, 1-10. approach with MATLAB. Alabama Journal of Mathematics, 34, 1-3.
[9] Alhussan, A. A., Abdelhamid, A. A., Towfek, S. K., Ibrahim, A., Eid, [20] Jeong, J. H., Resop, J. P., Mueller, N. D., Fleisher, D. H., Yun, K.,
M. M., Khafaga, D. S., & Saraya, M. S. (2023). Classification of Butler, E. E., ... & Kim, S. H. (2016). Random forests for global and
Diabetes Using Feature Selection and Hybrid Al-Biruni Earth Radius regional crop yield predictions. PloS one, 11(6), e0156571.
and Dipper Throated Optimization. Diagnostics, 13(12), 2038.
[21] Hu, J., & Szymczak, S. (2023). A review on longitudinal data
[10] Sisodia, D., & Sisodia, D. S. (2018). Prediction of diabetes using analysis with random forest. Briefings in Bioinformatics, 24(2),
classification algorithms. Procedia computer science, 132, 1578-1585. bbad002.
[11] Rajendra, P., & Latifi, S. (2021). Prediction of diabetes using logistic [22] Cunningham, P., & Delany, S. J. (2021). k-Nearest neighbour
regression and ensemble techniques. Computer Methods and classifiers-A Tutorial. ACM computing surveys (CSUR), 54(6), 1-25.
Programs in Biomedicine Update, 1, 100032.
[23] Syriopoulos, P. K., Kalampalikis, N. G., Kotsiantis, S. B., & Vrahatis,
[12] Wang, S., Chen, Y., Cui, Z., Lin, L., & Zong, Y. (2024). Diabetes M. N. (2023). k NN Classification: a review. Annals of Mathematics
Risk Analysis Based on Machine Learning LASSO Regression and Artificial Intelligence, 1-33.
Model. Journal of Theory and Practice of Engineering Science, 4(01),
[24] Shehadeh, A., Alshboul, O., Al Mamlook, R. E., & Hamedat, O.
58-64.
(2021). Machine learning models for predicting the residual value of
[13] Wee, B. F., Sivakumar, S., Lim, K. H., Wong, W. K., & Juwono, F. heavy construction equipment: An evaluation of modified decision
H. (2024). Diabetes detection based on machine learning and deep tree, LightGBM, and XGBoost regression. Automation in
learning approaches. Multimedia Tools and Applications, 83(8), Construction, 129, 103827.
24153-24185.
[25] Peng, C. Y. J., Lee, K. L., & Ingersoll, G. M. (2002). An introduction
[14] Al-Zebari, A., & Sengur, A. (2019, November). Performance to logistic regression analysis and reporting. The journal of
comparison of machine learning techniques on diabetes disease educational research, 96(1), 3-14.
detection. In 2019 1st international informatics and software
[26] Loh, W. Y. (2023). Logistic regression tree analysis. In Springer
engineering conference (UBMYK) (pp. 1-4). IEEE.
handbook of engineering statistics (pp. 593-604). London: Springer
[15] Alghamdi, T. (2023). Prediction of diabetes complications using London.
computational intelligence techniques. Applied Sciences, 13(5), 3030.
[27] Shahi, T. B., & Pant, A. K. (2018, February). Nepali news
[16] Ali, M. S., Islam, M. K., Das, A. A., Duranta, D. U. S., Haque, M., & classification using Naive Bayes, support vector machines and neural
Rahman, M. H. (2023). A novel approach for best parameters networks. In 2018 international conference on communication
selection and feature engineering to analyze and detect diabetes: information and computing technology (iccict) (pp. 1-5). IEEE.
Machine learning insights. BioMed Research International, 2023.
[17] Kumari, K., & Yadav, S. (2018). Linear regression analysis study.
Journal of the practice of Cardiovascular Sciences, 4(1), 33-36.
56