Diabetes Prediction with Machine Learning
Diabetes Prediction with Machine Learning
Submitted by:
1
2
CANDIDATE’S DECLARATION
I hereby certify that the work which is being presented in the Synopsis entitled “Diabetes
Prediction Using Machine Learning” in partial fulfillment of the requirements for the
award of the Degree of Master of Computer Applications in the Department of Computer
Applications of the Graphic Era (Deemed to be University), Dehradun shall be carried out by
the undersigned under the supervision of Mr.Harendra Singh Negi, Assistant Professor,
Department of Computer Applications, Graphic Era (Deemed to be University), Dehradun.
The above mentioned student shall be working under the supervision of the undersigned on
the “Diabetes Prediction Using Machine Learning”
Signature Signature
Supervisor Head of the Department
3
Table of Contents
Page no
Abstract 4
1.1 Introduction 5
2.1 Introduction 14
4.2 Methadology 22
4.6 Objectives 25
Description 26
5.1 Evaluation 28
5.2 Results 30
Chapter 6 Conclusion 33
References 36
5
Abstract
Diabetes is an illness that causes health problems all around the world. According to the IDF,
382 million people in the world have diabetes. In 2035, this number will grow to 592 million.
It increases the risk of chronic problems such as heart problems and CKD. If the disease is
caught early, people can live long, healthy lives. Predictions in the medical field are difficult,
but ultimately can help doctors make timely decisions about a patient's health and disease
based on data. The emergence of machine learning techniques solves this important problem.
This project aims to create a model that reliably predicts the accuracy of diabetes in patients.
Different models of machine learning trained with appropriate data can help diagnose
diabetes at an early stage. Efficient preprocessing techniques like standardization have also
To detect diabetes at its preliminary stage, this project deploys the concepts of machine
tree ,Random Forest ,Logistic Regression .The PIMA INDJAN DIABETES DATABASE
(PIDD) is used in the experiment. Its purpose is to diagnose whether a patient has diabetes
using diagnostic measures included in the dataset. Various measures like Precision,
Accuracy, Specificity, and F1 Score are measured over classified instances using Confusion
Matrix
Accuracy of various algorithms were compared and the project's conclusion is Random
Forest Classifier Algorithm produced the best results, with an accuracy of 75.97%. Using
machine learning methods, this project aims to assist doctors and physicians in the early
detection of diabetes.
6
Chapter 1
Diabetes is a an illness that holds the power to cause health problems all around the world.
According to the IDF, 382 million people worldwide have diabetes. In 2035, this number
will grow to 592 million. Diabetes is a disease caused by increased blood glucose levels.
High serum glucose can cause symptoms such as frequent micturition, increased thirst and
increased food demand. Diabetes leads in the race of causing blindness, kidney failure,
amputation due to diabetic foot , CHF and stroke. When we consume converts into glucose.
Due to high glucose, our pancreas secretes a hormone insulin. Insulin helps in glucose
transportation and allows glucose to enter in our cells and allows us to use glucose as energy.
However, this well organised mechanism fails in diabetes. Type 1 and type 2 diabetes are the
most common type of disease, but there are others such as gestational diabetes and less
common forms like MODY , DIABETES 1.5 etc. Machine learning is a revolution in field of
data science that studies how machines can learn from experience and be used to open
endless dimensions.
1.1 Introduction
Diabetes is a pathology that affects the beta cells of pancreas and in which the body cannot
produce enough insulin or body becomes resistant to ir[1]. Insulin is mainly responsible for
keeping a check on blood glucose levels. Factors influencing DM includes obesity, lack of
exercise, high blood pressure and bad cholesterol levels . It cause many problems, but
increased micturition commonest. [2]. It causes damage to the skin, nerves and eyes, and if
not treated early, can lead to eye diseases, renal failure and diabetic retinopathy having a
bad prognosis. According to the IDF (International Diabetes Federation), 537 million people
worldwide will have diabetes [3]. According to 2019 statistics, approximately 7.1 million
people in Bangladesh are affected by this disease [2].
According to the World Health Organization (WHO), diabetes affects 8.5% of people over th
e age of 18 and causes 1.6 million deaths worldwide (World Health Organization, 2021). Alth
ough premature deaths from diabetes decreased between 2000 and 2010 in many developing
countries, the statistics increased between 2010 and 2016. Chronic respiratory diseases and di
7
abetes have killed more than 18% of the world's population and have become a public health
problem.
Artificial intelligence and machine learning technologies provide a tool to help them
understand the disease and reduce their work accordingly.
1.2Types of Diabetes
8
such as obesity, a history of gestational diabetes in previous pregnancies, and a family
history of diabetes contributes majorly.
Symptoms: Gestational diabetes often does not cause noticeable symptoms, but it can
be detected through routine screening tests during pregnancy.
Management: Management includes lifestyle changes such as healthy eating and
regular physical activity. Some women may need insulin therapy if lifestyle changes
are not sufficient to control blood sugar levels. Regular monitoring of blood glucose
levels is essential to ensure they remain within a healthy range.
1.3Symptoms of Diabetes
Diabetes can present with a range of symptoms that vary in intensity and onset depending
on the type and severity of the disease. Understanding these symptoms is crucial for early
diagnosis and effective management.
9
Peripheral neuropathy is a serious complication that needs to be addressed to prevent
further nerve damage
1.4Causes of Diabetes
Diabetes is a multifaceted disease with various underlying causes. The primary types of
diabetes—Type 1, Type 2, and gestational diabetes—each have distinct causes, though some
factors may overlap. Understanding these causes can help in managing and potentially
preventing the disease.
Type 1 Diabetes
Type 1 diabetes is an autoimmune condition. It occurs when the body’s immune system
mistakenly attacks and destroys the insulin-producing beta cells in the pancreas. The precise
cause of this autoimmune response is not completely understood, but several factors are
believed to contribute:
1. Genetic Factors: Certain genes increase the risk of developing Type 1 diabetes. Family
history plays a significant role, and individuals with close relatives who have Type 1 diabetes
are at a higher risk.
2. Environmental Triggers: Environmental factors, such as exposure to certain viruses,
might trigger the autoimmune response in genetically predisposed individuals. Possible viral
triggers include the Coxsackievirus, mumps, and rubella.
3. Autoimmune Reactions: In Type 1 diabetes, the immune system’s T cells attack the
insulin-producing beta cells. The exact mechanism behind this autoimmune reaction is still
being researched, but it involves a complex interaction of genetic and environmental factors.
Type 2 Diabetes
Type 2 diabetes is primarily associated with insulin resistance, where the body’s cells do not
respond effectively to insulin, and with inadequate insulin production over time. Several
factors contribute to the development of Type 2 diabetes:
1. Lifestyle Factors:
Obesity: Excess body fat, particularly around the abdomen, increases the body’s resistance
to insulin.
Physical Inactivity: A sedentary lifestyle contributes to insulin resistance and weight gain,
increasing the risk of Type 2 diabetes.
Unhealthy Diet: Diets high in processed foods, sugars, and unhealthy fats can lead to weight
gain and insulin resistance.
2. Genetic Factors: A family history of Type 2 diabetes significantly increases the risk.
Certain genes related to glucose metabolism and insulin production can predispose
individuals to the disease.
3. Age: The risk of Type 2 diabetes increases with age, particularly after the age of 45.
However, the incidence is also rising among younger populations, including children and
adolescents, due to increasing rates of obesity and inactivity.
4. Ethnicity: Certain ethnic groups, including African Americans, Hispanics, Native
Americans, and Asian Americans, have a higher prevalence of Type 2 diabetes, suggesting a
genetic predisposition in these populations.
5. Metabolic Syndrome: A cluster of conditions—high blood pressure, high blood sugar,
abnormal cholesterol levels, and excess abdominal fat—collectively known as metabolic
syndrome, significantly raises the risk of Type 2 diabetes.
10
Gestational Diabetes
Gestational diabetes occurs during pregnancy and typically resolves after childbirth, but it
increases the risk of developing Type 2 diabetes later in life. The causes of gestational
diabetes include:
1. Hormonal Changes: During pregnancy, the placenta produces hormones that help the
baby grow. Some of these hormones can make the mother’s cells more resistant to insulin. As
the pregnancy progresses, the placenta enlarges and produces more of these hormones,
increasing insulin resistance.
2. Insulin Demand: As insulin resistance increases, the pancreas tries to compensate by
producing more insulin. If the pancreas cannot keep up with the increased demand, blood
sugar levels rise, leading to gestational diabetes.
3. Genetic and Lifestyle Factors: Similar to Type 2 diabetes, genetic predisposition and
lifestyle factors like obesity and physical inactivity also play a role in the development of
gestational diabetes.
Machine learning holds significant promise in the medical field, offering opportunities to
improve diagnostics, treatment planning, disease management, and patient outcomes.
Continued advancements in machine learning algorithms, coupled with the availability of
large-scale medical data, have the potential to revolutionize healthcare and usher in a new era
of personalized and effective medical interventions.
Disease Diagnosis and Detection: Machine learning algorithms can analyze large amounts of
medical data, including patient records, medical images, and genetic information, to assist in
disease diagnosis and detection. By learning patterns and relationships in data, machine
learning models can identify subtle signs or indicators of diseases, enabling earlier and more
accurate diagnoses.
Personalized Treatment Planning: Machine learning algorithms can analyze patient data, such
as medical history, genetic information, and treatment outcomes, to create personalized
treatment plans. By considering individual patient characteristics, machine learning models
can assist in selecting the most effective treatments and predicting potential adverse
reactions, ultimately leading to improved patient outcomes.
With ongoing advancements in machine learning algorithms and the availability of extensive
medical data, the potential for further transformative applications in the medical field is
substantial. Machine learning, when used in conjunction with human expertise, has the power
to revolutionize healthcare, improving diagnostics, treatment planning, disease management,
and patient outcomes.
11
Machine learning (ML) plays a crucial role in predicting diabetes by analyzing vast amounts
of medical and lifestyle data to identify individuals at risk. This application leverages various
algorithms and data sources to provide accurate and early predictions, which are essential for
effective prevention and management of diabetes.
Negative type- The patient is actually diabetic, but the test results show that the person
does not have anemia.
False positive type. In this type of patient, he is not actually diabetic, but the test report
shows that he is diabetic.
Unclassifiable type, the system cannot detect the situation. This is because there is not
enough information from previous data and the patient is estimated as an unknown type.
This misdiagnosis can lead to inappropriate treatment or missing treatment when needed. In
order to prevent this impact or reduce its severity, machine learning algorithms need to be
developed that will provide accurate results and reduce human use.
We use various classification and association methods to predict diabetes. Machine learning
is a method used to train a computer or machine. Many machine learning methods gather
information by building various classification and association models from the collected data
to provide useful results.
These programs make predictions based on variables such as your health history and lifestyle.
They examine many samples of people with and without diabetes to make better predictions.
For example, they might focus on how much sugar someone eats or how much exercise they
get. By doing this, they can give early warning to people who are at risk of developing
diabetes so they can take better care of themselves.
The Pima Indian dataset is an open source dataset that is publicly available for distributed ma
chine learning and is used in conjunction with private datasets in this project [4]. There is
data on 768 patients, 268 diabetic patients.
12
Figure 1.1 Percentage of people having diabetes in the Pima Indian dataset
13
Chapter 2
Literature Survey
2.1 Introduction
The accurate and timely detection of diabetes plays a critical role in effective diagnosis,
treatment planning, and patient outcomes. Over the years, extensive research has been
conducted in this field, leveraging advancements in machine learning algorithms and
techniques. These studies have paved the way for the development of sophisticated and
reliable approaches for diabetes detection.
.Research efforts are now using advanced technologies, primarily machine learning (ML), to
improve health outcomes. Diabetes is a chronic and common disease that has been the focus
of many studies focused on using the power of machine learning to better manage and
predict its onset and progression as it grows. This section reviews some of the important
existing work in this field.
14
2.3 Evolution of Machine Learning in Diabetes Detection
Machine learning algorithms, with their capacity to learn from data patterns and make
predictions, have transformed the field of medical diagnostics, particularly in the detection
and management of diabetes. These algorithms enhance the accuracy, speed, and objectivity
of diabetes detection, providing healthcare professionals with valuable insights. In recent
years, the application of machine learning techniques, such as neural networks, decision trees,
and ensemble models, has yielded promising results in diabetes prediction and diagnosis.
Machine learning models can analyze vast datasets, including electronic health records,
genetic information, and lifestyle factors, to identify individuals at risk of developing
diabetes. By leveraging large-scale data, these models detect subtle patterns that might be
missed by traditional methods, enabling earlier and more precise diagnosis.
The integration of machine learning in diabetes detection marks a significant advancement in
healthcare, offering automated and efficient solutions that support medical professionals in
making informed decisions. As these technologies continue to evolve, they hold the promise
of further enhancing the early detection and management of diabetes, ultimately leading to
better patient outcomes and quality of life.
2.4 Overview of Prior Research
Yashoda et al [5]. The diabetes patient database was created by collecting data from the
hospital's repository, which contained 200 cases with nine characteristics. The nature of this
information relates to two groups; blood test and urine test. In this study,since it is very
effective in small data,verification can be made by classifying the data using WEKA and
evaluating the data with the 10-fold cross-validation method, And the results can be
compared. . Naive Bayes, J48, REP trees and random trees algorithms were used. The result
is that J48 gives the best result with 60.2% accuracy.
Zou et al. (2018)[6] Diabetes prediction using decision trees, random forests, and neural netw
orks. These data were collected from Luzhou Physical Examination in China. Principal comp
onent analysis (PCA) was used to reduce the remaining data sets. They selected several mach
ine learning methods to conduct independent tests to verify the validity of the method.
Deeraj Shetty et al. [8] proposed diabetes disease prediction using data mining assemble
Intelligent Diabetes Disease Prediction System that gives analysis of diabetes malady
utilizing diabetes patient’s database. In this system, they propose the use of algorithms like
Bayesian and KNN (K-Nearest Neighbor) to apply on diabetes patient’s database and analyze
them by taking various attributes of diabetes for prediction of diabetes disease.
Tejas N [9] proposed to use machine learning to predict diabetes with three machine
learning methods, including SVM, logistic regression, and ANN. This study provides a
useful tool for rapid diagnosis of diabetes.
Aishwarya et al. [10] It aims to discover solutions for diabetes diagnosis by researching and
analyzing decision trees obtained from data and distribution analysis using Naive Bayes
algorithms. The research hopes to find a faster way to recognize the disease, which will help
treat patients in a timely manner. Here's the result: The J48 algorithm has an accuracy of
74.8%, while Naive Bayes has an accuracy of 79.5% using 70:30 splitting.
Gupta et al. [11] aims to find and calculate the accuracy, sensitivity, and percent specificity of
various classification methods and to compare and analyze the results of several methods
deployed in WEKA. Performance of the same classifier when used using the same
parameters(e.g. compare accuracy, sensitivity, and specificity) by many other tools, including
Rapidminer and Matlab. They use JRIP, Jgraft and BayesNet algorithms. The results showed
that Jgraft had the highest accuracy of 81.3%, sensitivity of 59.7% and specificity of 81.4%.
It was concluded that WEKA works better than Matlab and Rapidminner.
Lee et al. [12] focused on the use of a decision tree algorithm called CART in diabetes
medical records after inverse filtering the data. The author shows the problem in the fuzzy
class and that this problem needs to be solved before using an algorithm to achieve better
accuracy. Category imbalance often occurs in datasets with binary values; This means that
16
there are two outcomes for category variables, and if data is seen first before , the stage can
be easily done and will help improve the accuracy of the prediction model.
In their recent study [13], Mohan and Jain employed the SVM algorithm to analyze and
predict diabetes using the Pima Indian Diabetes Dataset. They experimented with four
distinct types of kernels: linear, polynomial, RBF, and sigmoid, to perform the predictions on
a machine learning platform. The accuracies achieved with these different kernels varied,
ranging from 0.69 to 0.82. Notably, the SVM method utilizing the radial basis function (RBF)
kernel achieved the highest accuracy, reaching 0.82.
Olisah et al. [14] carried out diabetes mellitus prediction using advanced feature selection and
various machine learning models. They utilized two open-source datasets: the Pima Indian
and LMCH Iraqi databases. To handle missing samples, a polynomial regression-based
preprocessing technique was applied. Hyperparameter tuning was conducted for the random
forest, decision tree, and deep neural network (DNN) models. The optimized DNN technique
achieved the highest accuracy, with scores of 0.972 for the Pima dataset and 0.973 for the
LMCH dataset.
Ramesh et al. [15] developed an automated remote system for predicting diabetes using the
Pima Indian dataset. They applied various data preprocessing methods, including feature
scaling, feature selection, and SMOTE. The SVM with an RBF kernel achieved the highest
accuracy of 83.2%. This machine learning framework was integrated into an Android
application.
Jyotismita et al. [16] developed a mechanism for detecting and analyzing diabetes using six
facets: dataset, processing methods, feature extraction, machine learning identification, and
classification and diagnosis of diabetes mellitus (DM), addressing the limitations of
classification. They compared various supervised, unsupervised, and clustering techniques.
Each dataset presented unique challenges, highlighting the need for significant improvements
to enhance the efficiency of detecting different diabetic conditions.
Branimir et al. [17] introduced a system aimed at addressing two main challenges: the
heterogeneity of previous techniques and the lack of transparency in feature selection.
17
Utilizing the PRISMA methodology, they conducted a comparison of 18 different models,
including tree-based algorithms. The study concluded that KNN and SVM are predominantly
used for prediction.
Nur et al. [18] concentrated primarily on data preprocessing, which involved removing
missing values, balancing the dataset, assessing feature importance, and performing data
augmentation. They used Random Forest (RF) and Logistic Regression (LR) for
classification. The results showed a 20% increase in precision and a 24% increase in recall
compared to data that had not undergone preprocessing.
Safial et al. [19] proposed a strategy for diagnosing diabetes using a deep learning (DL)
network, employing 5-fold and 10-fold cross-validation for training. Utilizing the Pima
Indians dataset, they achieved a prediction accuracy of 98.35% with 10-fold cross-validation.
Bavkar et al. [20] developed a pipeline model utilizing deep learning (DL) techniques to
predict diabetes. The model includes data augmentation with a variational autoencoder
(VAE), feature augmentation with a sparse autoencoder (SAE), and classification with a
convolutional neural network (CNN). Using the Pima dataset from the UCI Repository, they
achieved an accuracy of 92.31% by training the CNN classifier in conjunction with SAE for
feature augmentation, compared to a well-balanced dataset.
Goyal and his team [21] developed a smart home health monitor to detect diabetes. The
authors also used the Pima Indian Sourcebook for their research. They use formal judgments
to estimate blood pressure; They use SVM, KNN and decision trees to predict diabetes.
Among these models, SVM outperformed other classification algorithms, with 75% accuracy.
18
Further research could focus on comparing and optimizing different segmentation
algorithms to improve the accuracy and reliability of Diabetes Prediction
Limited Analysis of Multi-Class Diabetes Classification: While many studies
focus on binary classification of diabetes (diabetic vs. non-diabetic), there is a
significant gap in the literature concerning the comprehensive evaluation and
comparison of algorithms for multi-class diabetes classification. Diabetes presents
in various forms, such as Type 1, Type 2, and gestational diabetes, each requiring
different management strategies.
Investigating the Effectiveness of Different Machine Learning Approaches in
Classifying Various Types of Diabetes: Exploring diverse machine learning
approaches to accurately classify the types of diabetes could significantly enhance
diagnostic tools.
19
Chapter 3
Literature Review
The objectives of the proposed work are as follows:
20
Chapter 4
21
Which machine learning algorithms demonstrate high accuracy in classifying individuals as
diabetic or non-diabetic?
How do different feature extraction methods influence the prediction performance?
What is the impact of dataset characteristics?
4.2 Methadology
In the methodology, the data preprocessing steps are described, including the normalization
process and the preparation of the dataset for machine learning algorithms. By scaling and
standardizing the features, consistent input for subsequent processing and training is enabled.
S. No Attributes
1 Pregnancy
2 Glucose
3 Blood Pressure
4 Skin Thickness
5 Insulin
7 Age
22
8 Diabetes Pedigree Function
4.3 Objectives
The objective of this research is to conduct a comprehensive comparative study of machine
learning-based diabetes prediction using various clinical and physiological datasets. The
primary aim is to identify the most accurate and effective machine learning algorithm for
predicting diabetes. By evaluating and analyzing the performance of different algorithms,
considering various factors such as dataset characteristics and feature extraction methods, we
seek to contribute to the development of automated and efficient diabetes prediction systems.
To analyze the impact of dataset characteristics and feature extraction methods: The
dataset characteristics, including the distribution of diabetic and non-diabetic
instances, as well as the feature extraction methods employed, will be analyzed to
understand their influence on the performance of the machine learning algorithms.
The aim is to identify the key factors that contribute to accurate and reliable diabetes
prediction and provide insights into improving the effectiveness of the algorithms.
To recommend the most accurate and effective machine learning algorithm for
diabetes prediction: Based on the comparative analysis and evaluation results, a
recommendation will be made regarding the most accurate and effective machine
learning algorithm for diabetes prediction. The recommendation will consider factors
such as classification accuracy, robustness, computational efficiency, and suitability
for real-world clinical applications.
25
Chapter 5
Result Analysis
In this project, a variety of tools and Python libraries were instrumental in conducting the
experiment and reaching conclusions. The data collection phase commenced with the
acquisition of diabetes-related datasets, followed by meticulous labeling to classify instances
as either indicative or non-indicative of diabetes. This classification was denoted by assigning
a label of 1 to instances indicating diabetes and 0 to those not indicating the condition.
Following training, confusion matrices were generated to quantitatively assess the models'
performance.
To evaluate the effectiveness of each algorithm in predicting diabetes, a range of performance
metrics were employed. These included Recall (R), F1-score (F1), Accuracy (A), By
extracting true positive (T_P), true negative (T_N), false positive (F_P), and false negative
(F_N) values from the confusion matrices, these metrics were calculated and utilized to
determine the most effective algorithm for diabetes prediction.
Several tools were employed to initiate and complete the study on diabetes prediction using
machine learning techniques. Data were collected from a Kaggle dataset and processed for
the training and testing of models. Tools provided by various Python libraries such as
Matplotlib, scikit-learn, pandas supported the analysis of this research. The workflow began
with the collection of the dataset, followed by the labeling process to classify the data into
two categories: non-diabetic instances labeled as 0 and diabetic instances labeled as 1.
The data were then subjected to a pre-processing phase. All features were normalized to
ensure uniformity in scale and to enhance the performance of the machine learning models.
The processed data was then used to train various predictive models.
Once the models were trained, a confusion matrix was generated for each model to quantify
its performance. Performance metrics such as Recall (R), F1-score (F1), Accuracy (A),
Precision were utilized to analyze the efficacy of the algorithms employed in this experiment.
These metrics helped determine which algorithm offered the best performance in predicting
diabetes.
26
The confusion matrices provided true positive (TP), true negative (TN), false positive (FP),
and false negative (FN) values, which were essential for computing the aforementioned
performance parameters. This comprehensive evaluation approach allowed for a robust
analysis of the machine learning algorithms' capabilities in diabetes prediction, guiding the
selection of the most effective model for practical applications in healthcare.
27
Fig 5.4 Confusion Matrix And Classification Report Of Logistic Regression
5.1 Evaluation
This is the last step of the prediction model. Here, we use various metrics such as
classifaction accuracy, confusion matrix ,accuracy etc to evaluate the prediction results.
1. Accuracy : It is the ratio of number of correct predictions to the total number of input
samples. It is given as :
Number of Correct Predictions
Accuracy = ……….(5.1)
Total Number of Predictions Made
2. Confusion Matrix : It gives us gives us a matrix as output and describes the complete
performance of the model.
Where, TP: True Positive
FP: False Positive
FN: False Negative
TN: True Negative
28
3. Precision : Precision is defined as the ratio of correctly classified positive samples
(True Positive) to a total number of classified positive samples (either correctly or
incorrectly).
4. Recall : The recall is calculated as the ratio between the numbers of Positive samples
correctly classified as Positive to the total number of Positive samples. The recall
measures the model's ability to detect positive samples.
5. F1 Score : The F1 score is calculated as the harmonic mean of precision and recall.
29
The KNN algorithm demonstrated an accuracy of 0.7143, indicating an 71.43%
overall correct classification rate. It achieved a recall of 0.88, implying its ability to
correctly identify 86% of the diabetes- positive cases. The F1-score of 0.80 indicates a
balanced performance between precision and recall.
The SVM algorithm demonstrated an accuracy of 0.7468, indicating an 74.68%
overall correct classification rate. It achieved a recall of 0.89, implying its ability to
correctly identify 86% of the diabetes- positive cases. The F1-score of 0.82 indicates a
balanced performance between precision and recall.
5.1.3 Results
Different steps are taken in this study. The scheme uses different distributions and
integrations and is implemented by python. This technique is a machine learning technique
used to extract the best facts from data. In this study, we saw that the Random Forest
classifier achieved better results than other classifiers. Overall, we used the best machine
learning techniques to make predictions and get great results
30
Figure 5.8 Heatmap of coorelation analysis
A heatmap is a graphical representation of data where values in a matrix are represented as
colors. It is particularly useful for visualizing the relationships between two variables across
multiple data points. In a heatmap, each cell in the matrix is assigned a color based on its
value, creating a visual representation of the data's patterns and correlations.
Heatmaps are often used in data analysis and visualization to:
Identify Patterns: Heatmaps help identify patterns and trends in large datasets by
visually highlighting areas of high or low values.
Overall, heatmaps provide a powerful visual tool for understanding complex data and
extracting meaningful insights from it.
31
Figure 5.9 Scatter Plot
32
Chapter 6
Conclusion
Diabetes can be a factor that reduces life expectancy and quality. In the long term, early
diagnosis of this disease can reduce the risk and complications of many diseases. In this
project automatic prediction of diabetes is proposed using various machine learning methods.
Machine learning techniques can support doctors in diagnosing and treating diabetes. We
should note that improving classification accuracy helps machine learning models achieve
better results. The performance analysis is in terms of accuracy rate among all the
classification techniques such as decision tree , SVM and random forest ,Logistic
Regression,K Nearest Neighbor.
In this project, several machine learning algorithms were applied to the task of diabetes
prediction using the Pima Indian Diabetes dataset. The evaluated algorithms included
Logistic Regression, Support Vector Machine (SVM), K-Nearest Neighbors (KNN), Decision
Tree, and Random Forest. The primary objective was to compare their performance and
identify the most effective algorithm for accurately predicting diabetes cases.
The results obtained from the experiments provided valuable insights into the performance of
these algorithms. Confusion matrices were analyzed to derive performance metrics such as
accuracy (A), recall (R), and F1-score,precision . Based on the accuracy metric, the Random
Forest algorithm achieved the highest accuracy score of 0.7597, closely followed by the SVM
with an accuracy score of 0.7468. Logistic Regression and KNN algorithms also
demonstrated good accuracy scores ranging from 0.7403 to 0.7143. The Decision Tree
algorithm achieved a slightly lower accuracy score of 0.6623.
It's worth noting that the effectiveness of these algorithms can be influenced by various
factors such as the specific characteristics of the dataset, the preprocessing techniques
applied, and the parameter configurations utilized. Further exploration and optimization may
be necessary to maximize their performance. Nevertheless, this research provides valuable
insights into the capabilities of different machine learning algorithms for diabetes prediction
using the Pima Indian Diabetes dataset. These findings can contribute significantly to the
advancement of more accurate and reliable diagnostic tools for diabetes prediction, ultimately
leading to improved healthcare outcomes for individuals affected by the condition.
33
6.1 Scope of Future Work
The project presented here offers valuable insights into the performance of machine learning
algorithms for diabetes prediction using clinical and physiological data. However, there
remain several opportunities for future exploration and enhancement to further improve the
accuracy and effectiveness of predictive models. This section outlines potential directions for
future work and areas of focus within the scope of diabetes prediction projects.
Algorithmic Modifications:
While the algorithms employed in this project have shown promising results, there is
potential for further optimization and customization. Future research could focus on
algorithmic modifications tailored specifically for diabetes prediction, such as
parameter tuning, ensemble learning, and algorithm selection. Exploring novel
algorithmic architectures and incorporating domain-specific knowledge into models
may also lead to performance improvements.
34
Validation and Clinical Trials:
Conducting rigorous validation studies and clinical trials is essential to assess the
clinical relevance and utility of predictive models in real-world healthcare settings.
Collaborations with healthcare professionals and institutions are critical for designing
and executing validation studies, evaluating model performance against established
clinical standards, and assessing the impact of model predictions on patient outcomes.
Additionally, incorporating feedback from clinicians and end-users throughout the
development process can ensure the practical applicability and acceptance of
predictive models in clinical practice.
In conclusion, while the project lays the foundation for diabetes prediction using machine
learning techniques, there are numerous avenues for future research and improvement. By
integrating additional features, expanding the training dataset, improving preprocessing
techniques, modifying algorithms, exploring deep learning approaches, and conducting
validation studies, the accuracy and reliability of diabetes prediction models can be
significantly enhanced. These advancements have the potential to contribute to the
development of more accurate diagnostic tools and personalized healthcare interventions for
individuals at risk of diabetes.
35
References
[1] Kharroubi, A.T. , Darwish, H.M. : Diabetes mellitus: The epidemic of the century. World
Communication Systems (ICIIECS), 2017.
[2] Papatheodorou, K. , Banach, M. , Edmonds, M. , Papanas, N. , Papazoglou, D.
Complications of diabetes. J. Diabetes Res. 2015, 1–6 (2015)
[3] Atlas, G. : Diabetes. International Diabetes Federation. 10th ed., IDF Diabetes Atlas
[4] J.W. , Everhart, J.E. , Dickson, W.C. , Knowler, W.C. , Johannes, R.S. : Using the ADAP
learning algorithm to forecast the onset of diabetes mellitus. In: Annual Symposium on
Computer Applications in Medical Care pp. 261–265 (1998)
[5] Aljumah, A.A., Ahamad, M.G., Siddiqui, M.K., 2013. Application of data mining:
Diabetes health care in young and old patients. Journal of King Saud University -
Computer and Information Sciences 25, 127–136. doi:10.1016/j.jksuci.2012.10.003..
[6] Zou et al., 2018 Q. . Qu, ,Predicting diabetes mellitus with machine learning techniques
[7] K.VijiyaKumar, B.Lavanya, I.Nirmala, S.Sofia Caroline, "Random Forest Algorithm for
the Prediction of Diabetes ".Proceeding of International Conference on Systems
Computation Automation and Networking, 2019.
[8] Deeraj Shetty, Kishor Rit, Sohail Shaikh, Nikita Patil, "Diabetes Disease Prediction Using
Data Mining ".International Conference on Innovations in Information, Embedded and
Communication Systems (ICIIECS), 2017.
[9] Tejas N. Joshi, Prof. Pramila M. Chawan, "Diabetes Prediction Using Machine Learning
Techniques".Int. Journal of Engineering Research and Application, Vol. 8, Issue 1,(Part-
II) January 2018, pp.-09-13
[10] Arora, R., Suman, 2012. Comparative Analysis of Classification Algorithms on
Different Datasets using WEKA. International Journal of Computer Applications
using WEKA. International Journal of Computer Applications 54, 21–25.
doi:10.5120/8626-2492.
[11] Bamnote, M.P., G.R., 2014. Design of Classifier for Detection of Diabetes Mellitus
Using Genetic Programming. Advances in Intelligent Systems and Computing
[12] Choubey, D.K., Paul, S., Kumar, S., Kumar, S., 2017. Classification of Pima indian
diabetes dataset using naive bayes with genetic algorithm as an attribute selection, in:
Communication and Computing Systems: Proceedings of the International Conference
on Communication and Computing System (ICCCS 2016), pp. 451– 455
36
[13] Mohan, N. , Jain, V. : Performance analysis of support vector machine in diabetes
prediction. In: International Conference on Electronics, Communication and Aerospace
Technology, pp. 1–3 (2020)
[14] Olisah, C.C. , Smith, L. , Smith, M. : Diabetes mellitus prediction and diagnosis from a
data preprocessing and machine learning perspective. Comput. Methods Programs
Biomed. 220, 1–12 (2022)
[15] Ramesh, J. , Aburukba, R. , Sagahyroon, A. : A remote healthcare monitoring
framework for diabetes prediction using machine learning. Healthcare Technol. Lett (2021]
[16] J. Chaki, S. T. Ganesh, S. K. Cidham, and S. Ananda Theertan, “Machine learning and
artificial intelligence based diabetes mellitus detection and self-management: a systematic
review,” Journal of King Saud University - Computer and Information Sciences, vol. 34
37