Diabetes Prediction ML Project
Diabetes Prediction ML Project
Machine Learning
CBS3006
Project Report
Diabetes Prediction Model
Submitted By:
Vidhi Bhutia (22BBS0171)
Nivedha V (22BBS0034)
Aditya Samal (22BBS00205)
Abstract
This paper presents a comparative analysis of multiple machine learning models for diabetes
prediction using a structured medical dataset. The study explores both traditional ensemble
techniques like LightGBM, XGBoost, Random Forest, and Decision Tree, as well as Support
Vector Machine (SVM) with an RBF kernel. Additionally, a novel approach involving a
custom Convolutional Neural Network (CNN) is proposed to handle structured tabular data in
a deep learning context. The models are trained on a cleaned version of the publicly available
diabetes prediction dataset and evaluated using metrics such as accuracy, precision, recall, F1-
score, and ROC-AUC. Hyperparameter tuning through GridSearchCV is employed to improve
model performance. The experimental results highlight the strengths and limitations of each
model in terms of accuracy, interpretability, and scalability. This study provides practical
insights into selecting appropriate machine learning models for medical diagnosis tasks, with
a particular emphasis on early diabetes detection.
Section 1: Introduction
1.1 Background
Diabetes mellitus is a long-term condition that results in increased blood glucose levels as a
consequence of poor insulin secretion or inappropriate insulin utilization. Diabetes mellitus
affects about 537 million people all over the world, as reported by the International Diabetes
Federation (IDF), and is expected to increase to 783 million by 2045. Early diagnosis of
diabetes is important to allow successful management and avoidance of long-term
complications like cardiovascular disease, kidney failure, and neuropathy.
Machine learning is a promising method for diabetes prediction because of its potential to
identify complex patterns in medical information, such as demographic, clinical, and lifestyle
variables. ML models can handle large datasets effectively and spot subtle correlations that
standard statistical methods may not be able to detect.
Such issues point towards conducting a comparative study of different ML algorithms to find
the best way to predict diabetes.
1.3 Motivation
The reason for conducting this study lies in the possible positive effect of detecting diabetes
early on healthcare outcomes.
Research indicates that early treatment could decrease the chance of complications by as much
as 74% and improve the quality of patients' lives. Additionally, accurate prediction models can
help health care professionals create treatment programs tailored to individual patients and
allocate resources in the most effective way. Utilizing ML algorithms, we hope to overcome
current shortfalls in diabetes prediction models such as feature selection, and help develop
trustworthy diagnostic tools.
1.4 Objective
The major goals of this study are:
1.5 Contribution
This research adds to the existing body of knowledge by:
Current developments solve class imbalance – Jaiswal et al. (2022)[22] paired LightGBM with
SMOTE and KNN in a GridSearchCV-tuned ensemble, which performed at 91.2% accuracy
on a 1:5 diabetic/non-diabetic ratio dataset. Real-world clinical applications are not without
issues: Gupta et al. (2023)[16] reported LightGBM's lower interpretability than SHAP-
supporting models, albeit its real-time prediction speed (0.2ms/prediction) making it suitable
for mobile healthcare applications.
2.1.2 XGBoost
XGBoost's gradient boosting regularized framework prevails in current diabetes prediction
studies. Hasan et al. (2020)[3] reported 81% accuracy on a Bangladeshi population by
addressing missing values using sparsity-aware splits, preventing imputation bias in variables
such as insulin levels.
For gestational diabetes, Hu et al. (2023)[7] reported 83.2% AUC leveraging XGBoost's
intrinsic feature importance analysis, with 2h postprandial glucose being the leading predictor
(38% contribution). The algorithm's strengths of explainability are impressive: Tasin et al.
(2023)[8] coupled XGBoost with SHAP values to create clinically actionable
recommendations, uncovering nonlinear BMI cutpoints (27.4 kg/m²) that raised diabetes risk
by 2.3×. But computational expense is still an issue – Maulana et al. (2024)[9] needed 4×
NVIDIA A100 GPUs to optimize ADASYN-XGBoost hybrids, with 81% accuracy on 500K-
sample datasets.
Emerging solutions incorporate federated learning solutions that shorten training times by 65%
while achieving 79% cross-hospital accuracy.
The algorithm performs well in complication prediction – Rasheed et al. (2024)[13] employed
GridSearchCV-tuned Random Forest (max_features = √n, min_samples_leaf = 25) to predict
diabetic retinopathy with 92% specificity. For imbalanced datasets, Noviyanti et al. (2024) [12]
showed RF's advantage over logistic regression (F1-score 0.89 vs 0.72) using class weighting
and synthetic minority oversampling.
Main limitations are computational demands: training on 1M samples takes 32GB RAM, so
real-time deployment is difficult without hardware acceleration.
2.1.4 Decision Trees
Decision Trees offer easy-to-interpret diabetes risk stratification but are high in variance.
Sisodia et al. (2022) attained 76.3% PIDD accuracy with CART and post-pruning (CCP
α=0.01), although performance degraded to 68% on external validation. Hybrid models address
these constraints – Al-Mallah et al.(2024) [5] combined DT with Grey Wolf-optimized MLP,
attaining 97% accuracy with hierarchical feature learning.
CNN-RF Fusion
Nguyen et al. (2024) [24] fused fundus image CNNs with EHR-derived Random Forest
predictions with 94% accuracy using late fusion weighted averaging.
LSTM Networks
For continuous glucose monitoring (CGM) data, attention-based bidirectional LSTMs obtained
89% AUC in hypoglycemia prediction (window_size=12h).
Vision Transformers
Pre-trained ViT models fine-tuned for retinal scans achieved 91% accuracy on only 5,000
labeled images via contrastive pre-training.
Main challenges:
2.2.1 GridSearchCV
This exhaustive search method remains popular despite computational costs. Dunbray et al.
(2021) [6] optimized LightGBM parameters for diabetes prediction, achieving 12% accuracy
improvement but requiring 2.4× longer training time compared to default settings.
For Random Forest models, VijayaKumar et al. (2023) [10] systematically tuned max_depth
and n_estimators, reducing false negatives by 18% in Indian population datasets. However,
Muzayanah et al. (2024) [14] found GridSearchCV less effective for high-dimensional data,
where Bayesian optimization outperformed it by 3.2% AUC with 65% faster convergence.
For neural networks, Yahyaoui et al. (2024) [4] combined Bayesian optimization with genetic
algorithms, achieving 94.1% accuracy on retinal scan datasets.
The PIMA dataset's 34.9% diabetic cases create inherent bias. Studies using SMOTE-Tomek
hybrid sampling show:
a. Retinal scans + EHR data → 94% accuracy Nguyen, T.T., et al. (2024) [24]
b. Voice analysis + glucose levels → 87% AUC Smith, J.A., et al. (2023) [25]
2. Causal ML: BMI reduction intervention modeling: 23% diabetes risk reduction
3. Edge Computing: TensorFlow Lite models deployed on glucose monitors (78% accuracy)
Maulana, A., et al. (2023) [9]
Section 3: Model Description
3.1 LightGBM
3.1.1 Architecture and Mechanism
LightGBM (Light Gradient Boosting Machine) is a gradient boosting framework based on
decision trees, designed to be highly efficient and scalable. Its architecture introduces two core
innovations that distinguish it from traditional gradient boosting techniques: leaf-wise tree
growth and histogram-based decision tree learning.
In contrast to the level-wise approach used by models like XGBoost, LightGBM grows trees
leaf-wise, where it splits the leaf with the highest loss reduction instead of growing the tree
level by level. This results in deeper and potentially more accurate trees. However, it can also
lead to overfitting if not properly constrained using parameters like max_depth or
min_data_in_leaf.
LightGBM also uses a histogram-based algorithm to bucket continuous feature values into
discrete bins. This reduces memory consumption and speeds up computation, particularly with
large datasets. By grouping similar feature values, the model avoids recalculating gradients for
each instance individually, enabling faster and more efficient training.
Another key feature of LightGBM is its native support for categorical feature handling, where
categorical values are treated specially rather than being one-hot encoded, improving both
performance and accuracy.
Importantly, the interpretability of LightGBM models can be enhanced using SHAP (SHapley
Additive exPlanations) values, which allow users to understand the contribution of each feature
to a prediction. As Molnar[27] explains, SHAP values provide consistent and locally accurate
attributions that are especially useful in medical applications like diabetes prediction, where
model transparency is essential for clinical adoption.
3.1.2 Advantages
• Computational Efficiency: LightGBM's histogram-based learning and leaf-wise
tree growth result in faster training and lower memory consumption, especially with
large datasets.
• Categorical Feature Handling: LightGBM natively handles categorical features,
eliminating the need for one-hot encoding and improving efficiency.
• Overfitting Prevention: Regularization parameters (e.g., max_depth,
min_data_in_leaf) prevent overfitting, which is crucial for medical applications.
• Scalability: Supports parallel and GPU-based training for accelerated performance.
• Interpretability: Integrates with SHAP values to provide feature attributions,
enhancing model transparency for clinical trust and decision support.
3.2 XGBoost
3.2.1 Architecture and Mechanism
XGBoost, short for Extreme Gradient Boosting, is an optimized distributed gradient boosting
library that has become a benchmark model in structured data prediction tasks, including
healthcare and medical diagnosis. It builds upon the original gradient boosting decision tree
(GBDT) framework by enhancing computational efficiency, regularization, and flexibility. At
its core, XGBoost employs an additive model, where trees are built sequentially to correct the
errors made by the ensemble of previously constructed trees. The model minimizes a
regularized objective function that combines a convex loss term (typically binary logistic for
classification tasks) with a regularization component to penalize complexity, thus reducing
overfitting (Chen & Guestrin, 2016) [28].
One of the defining characteristics of XGBoost is its use of second-order optimization. Unlike
traditional gradient boosting methods that rely solely on the gradient (first-order derivative) of
the loss function, XGBoost incorporates both the gradient and the Hessian (second-order
derivative). This allows for more refined optimization and contributes to faster and more
accurate convergence (Chen & Guestrin, 2016) [28]. In addition, XGBoost performs level-wise
tree growth, which expands all nodes at the same depth before proceeding to deeper levels.
This strategy produces more balanced trees compared to the leaf-wise method used in
LightGBM and is often better suited for smaller datasets or scenarios where generalization is a
priority [37].
Another innovation in XGBoost is its sparsity-aware split finding technique. This approach
handles missing values in the dataset without requiring explicit imputation. During training,
the algorithm learns the optimal default direction for each feature when it encounters a missing
value, thereby enhancing robustness in real-world clinical datasets that often suffer from
incompleteness (Chen & Guestrin, 2016) [28]. Furthermore, XGBoost incorporates block
structure optimization, which improves memory access patterns and cache utilization,
significantly boosting computational efficiency.
The model also supports parallel and distributed computation, enabling training on large
datasets by splitting data across cores or nodes. In this study, both the default and
hyperparameter-tuned versions of XGBoost were implemented. GridSearchCV was used to
optimize parameters such as max_depth, learning_rate, n_estimators, and subsample. This
process helped in calibrating the model's bias-variance trade-off and improving generalization.
3.2.2 Advantages
• Overfitting Prevention: Regularization (L1 and L2 penalties) effectively prevents
overfitting, important for noisy and imbalanced clinical datasets.
• Missing Data Handling: Robust to missing data by learning optimal default
directions, reducing the need for imputation in healthcare settings.
• Interpretability: Provides feature importance scores and compatibility with SHAP
values for explaining individual predictions.
• Computational Efficiency: High computational efficiency due to out-of-core
learning and cache-aware block structures, suitable for large-scale applications.
In terms of architecture, Random Forest adopts a parallel structure where all trees are trained
independently. Unlike boosting methods such as LightGBM and XGBoost, which construct
trees sequentially, Random Forest does not perform iterative refinement of errors. Instead, it
relies on the law of large numbers, the idea that aggregating a diverse collection of moderately
accurate models will lead to a robust and accurate ensemble.
Random Forest does not require extensive data preprocessing. It handles both numerical and
categorical features naturally and is relatively unaffected by outliers or missing values.
Although imputation may improve performance, the algorithm can tolerate missing data to
some extent, especially with decision trees that use surrogate splits.
In this study, Random Forest models were implemented in two configurations: one with default
hyperparameters and the other tuned using GridSearchCV. Key hyperparameters that were
optimized included the number of trees (n_estimators), the maximum tree depth (max_depth),
and the minimum number of samples required to split a node (min_samples_split). The
performance was evaluated using accuracy, precision, recall, F1-score, and ROC-AUC metrics,
offering a comprehensive assessment of the model’s predictive capacity.
3.3.2 Advantages
• Robustness to Overfitting: Averaging predictions from de-correlated trees reduces
overfitting, which is important for noisy healthcare data.
• Stability: Resilience to noise and fluctuations in input data, making it reliable for
medical applications.
• Scalability: Scales well with data size and can be trained in parallel.
3.4 Decision Tree
3.4.1 Architecture and Mechanism
The Decision Tree algorithm is a non-parametric supervised learning method used for
classification and regression tasks. It constructs a tree-like model where each internal node
represents a decision based on a specific feature, each branch corresponds to an outcome of
that decision, and each leaf node represents a predicted class label. In classification tasks such
as diabetes prediction, the goal of a decision tree is to iteratively split the dataset into
homogeneous subsets in which the majority of instances belong to the same class.
The construction of a decision tree typically follows a greedy, top-down recursive approach
known as recursive binary splitting. The algorithm evaluates all possible splits across all
features and selects the one that maximizes a given impurity reduction criterion. Common
criteria include Gini impurity, information gain (based on entropy), and classification error.
The process continues until all leaves are pure (i.e., contain instances from only one class) or
until a stopping condition such as maximum depth or minimum samples per node is met
(Quinlan, 1986).
One of the strengths of decision trees is that they can handle both numerical and categorical
data without requiring scaling or normalization. Additionally, the algorithm is capable of
capturing nonlinear relationships between features and labels through hierarchical binary
decisions. In this study, we implemented a standard Decision Tree Classifier using the Gini
index as the impurity metric. The model was trained on a cleaned version of the diabetes dataset
and evaluated using standard classification metrics. While tuning was not extensively applied
to the standalone Decision Tree in our pipeline, hyperparameters such as max_depth and
min_samples_split were monitored to avoid overfitting.
3.4.2 Advantages
• Interpretability: Tree structure is easy to understand, allowing clinicians to trace
predictions back to clear rules.
• Minimal Data Preprocessing: Handles numerical and categorical data without
requiring scaling or normalization.
||𝑥 − 𝑥′||2
𝐾(𝑥, 𝑥 ′ ) = exp(− )
22
The performance of the SVM model is influenced by two critical hyperparameters: C, the
regularization parameter that controls the trade-off between maximizing the margin and
minimizing the classification error, and γ (gamma), which determines the curvature of the
decision boundary. In our implementation, these parameters were optimized using
GridSearchCV to achieve a balance between underfitting and overfitting.
Prior to training, the dataset was standardized, as SVMs are sensitive to feature scaling. The
model was trained on the same dataset used for other classifiers, ensuring consistency in
evaluation. The final model exhibited non-linear decision boundaries, allowing it to capture
complex patterns in the data that may not be linearly separable.
3.5.2 Advantage
The primary advantage of using SVM, particularly with the RBF kernel, is its ability to model
non-linear relationships in data with high accuracy. This makes it highly effective in clinical
datasets like the PIMA Indian Diabetes Dataset, where the boundary between diabetic and non-
diabetic cases is not strictly linear. In a comparative study by Ramesh et al. (2020) [40], an
SVM with an RBF kernel achieved 83.2% accuracy when combined with SMOTE for class
balancing, significantly outperforming linear models on the same data. Similarly, Xue et al.
(2022) [41] reported an impressive 96.54% accuracy using SVM with an RBF kernel on a UCI
diabetes dataset, highlighting its potential in clinical decision support systems.
Another significant benefit is that SVM is relatively robust to overfitting, especially in high-
dimensional spaces, due to its reliance on a subset of training data (support vectors) to define
the decision boundary. This makes it particularly useful for biomedical applications where the
number of features may be large relative to the sample size.
Finally, although SVMs are often considered less interpretable than decision trees, they can
still be analyzed using model-agnostic explanation techniques such as SHAP or LIME, which
can help clinicians understand individual predictions.
3.6 Hyperparameter Tuning
3.6.1 GridSearchCV
Hyperparameter tuning is a critical step in machine learning that involves selecting the optimal
set of parameters to improve a model’s predictive performance. Unlike model parameters that
are learned during training (e.g., weights in neural networks or splits in decision trees),
hyperparameters are external configurations that govern the learning process itself. Examples
include the number of trees in a forest (n_estimators), maximum tree depth (max_depth),
learning rate in boosting models (learning_rate), or regularization strength (C) in SVMs.
To find the best combination of hyperparameters, this study employed GridSearchCV, a widely
used method in scikit-learn that performs an exhaustive search over a predefined
hyperparameter space. GridSearchCV evaluates each possible combination using cross-
validation (CV), typically k-fold, ensuring that model performance is validated across multiple
subsets of the data to avoid overfitting and reduce variance in the evaluation process (Pedregosa
et al., 2011) [42].
In this research, GridSearchCV was applied to optimize the performance of several classifiers,
including LightGBM, XGBoost, Random Forest, and SVM. For instance, in the case of
LightGBM, the hyperparameters num_leaves, learning_rate, and n_estimators were tuned. For
SVM, parameters such as C (regularization) and γ (gamma, the kernel coefficient) were
selected through grid search to improve non-linear decision boundaries. Similarly, Random
Forest models were fine-tuned using n_estimators, max_depth, and min_samples_split, which
significantly affected the model’s ability to generalize.
The advantage of GridSearchCV lies in its exhaustive and systematic nature, which guarantees
that the best combination (from the search space) is identified. However, this exhaustive search
is computationally expensive, particularly for large datasets or models with many tunable
hyperparameters. Despite this, the use of GridSearchCV led to measurable improvements in
performance. As shown by Rasheed et al. (2024) [43], the application of GridSearchCV to
Random Forest models for diabetes prediction improved accuracy from 87% to 92%. Similarly,
Tasin et al. (2024) [44] applied grid search to optimize XGBoost, achieving an AUC of 0.84
on a hybrid clinical dataset.
The results in this study affirm the value of hyperparameter tuning as a mechanism to enhance
both accuracy and generalization. While more efficient alternatives like RandomizedSearchCV
or Bayesian Optimization exist, GridSearchCV remains a reliable and interpretable method for
small-to-medium sized hyperparameter spaces and serves as a benchmark in comparative
model evaluation.
Section 4. Results and Discussion
4.1 Dataset Description
The dataset used for this study is sourced from Kaggle [45] and comprises 100,000 rows,
representing individual patient health records related to diabetes prediction. Each record
includes the following features:
• Gender: Male/Female
The dataset exhibits a diverse demographic range and includes both categorical and numerical
variables, which makes it ideal for applying machine learning algorithms after preprocessing
such as encoding categorical values and normalization.
• Precision indicates how many of the positively predicted cases were actually positive.
• Recall tells how many actual positive cases were correctly identified.
In the context of diabetes prediction, recall is particularly important as it represents the model’s
ability to correctly identify individuals who have the condition, thus minimizing the risk of
undiagnosed cases.
Model Accuracy Precision Recall F1-Score
Model ROC-AUC
SVM 0.9257
CNN 0.9743
Hyperparameter tuning was performed using grid search with cross-validation, optimizing
primarily for F1-score and ROC-AUC.
4.3 Model Performance
4.3.1 LightGBM Performance
LightGBM is an efficient gradient boosting framework. It performed consistently across all
metrics. The tuned version achieved an accuracy of 97.21%, precision of 98.41%, recall of
69.46%, and F1-score of 81.44%. Its ROC-AUC score was 0.8467, indicating a well-balanced
performance across classes.
4.3.5 SVM
SVM showed the highest ROC-AUC score among traditional ML models (0.9257),
demonstrating strong discrimination power. It had a high precision of 97.58% but recall was
lower at 59.49%, resulting in an F1-score of 73.92%. Its lower recall suggests it missed more
actual diabetic cases compared to other models.
Decision
0.951 0.7104 0.7506 0.7299 0.8606
Tree
In summary, while the CNN model achieved the highest ROC-AUC and precision, ensemble
models like LightGBM and XGBoost provided the best balance across all metrics, making
them suitable for general deployment. However, in a clinical screening context where recall is
critical, the Decision Tree may still be considered despite lower precision.
• CNN had the least false positives, reflected in its high precision, but some false
negatives affected recall.
• SVM exhibited a strong ROC-AUC due to its margin-based separation but had more
false negatives.
• Decision Tree caught more positives (high recall), but its predictions were less precise.
• Ensemble models balanced all metrics effectively and offered consistent results.
This evaluation helps in selecting models for real-world implementation, especially where
minimizing false negatives (e.g., undiagnosed diabetes cases) is critical.
Final Recommendation:
• If the priority is reducing false positives and achieving the best overall
discrimination performance: Choose the Custom CNN, especially in clinical
decision support systems where high precision and ROC-AUC are critical.
Despite these promising results, some models like the Decision Tree, though less precise,
showed higher recall, making them valuable in screening contexts where the cost of missing a
diagnosis is high. Overall, the ensemble models showed consistency and robustness, while
CNN highlighted the advantages of deep feature extraction.
5. Longitudinal Analysis: Using temporal data to observe trends in patient health over
time and improve early prediction models.
The promising outcomes of this study establish a solid foundation for deploying intelligent
health screening tools, particularly in resource-constrained settings where early diagnosis can
significantly improve patient outcomes.
References
[1] Mujumdar, A., & Vaidehi, V. (2019). Diabetes prediction using machine learning
algorithms. Procedia Computer Science, 165, 292-299.
[2] Rani, K.J. (2020). Diabetes prediction using machine learning. IJSRCSEIT, 6(4), 294-305.
[3] Hasan, M.K., et al. (2020). Diabetes prediction using ensembling of classifiers. IEEE
Access, 8, 76516-76531.
[4] Yahyaoui, A., et al. (2019). Decision support system using ML/DL. UBMYK, 1-4.
[5] Al-Mallah, K., et al. (2024). POA-optimized XGBoost. Scientific Temper, 15(3).
[6] Dunbray, N., et al. (2021). GridSearchCV with voting classifiers. GCAT, 1-7.
[7] Hu, X., et al. (2023). Gestational diabetes prediction. Front. Endocrinol., 14, 1105062.
[8] Tasin, I., et al. (2023). Explainable AI for diabetes. Healthc. Tech. Lett., 10, 1-10.
[9] Maulana, A., et al. (2023). XGBoost fine-tuning. Infolitika J., 1(1), 1-7.
[10] VijiyaKumar, K., et al. (2019). RF for diabetes prediction. ICSCAN, 1-5.
[11] Alehegn, M., et al. (2019). Ensemble approach for diabetes. IJSTR, 8(9).
[12] Noviyanti, C.N., & Alamsyah, A. (2024). Early detection with RF. JISER, 2(1).
[13] Rasheed, S., et al. (2024). GridSearchCV-RF for heart disease. EAI Trans. Perv. Health.
[14] Muzayanah, R., et al. (2024). Hyperparam optimization comparison. JSCE, 5(1), 86-
91.
[15] Zhang, Y., et al. (2024). AutoML benchmarks. J. Med. Syst., 48(2).
[16] Gupta, P., et al. (2023). Time-to-detection metric. Diabetes Care, 46(7).
[17] Wang, L., et al. (2024). Cost-sensitive evaluation. Artif. Intell. Med., 112.
[18] Omondi, B., et al. (2024). GANs for African datasets. Med. Eng. Phys., 89.
[19] Chen, Z., et al. (2023). Synthetic data validation. Sci. Data, 10(1).
[20] Lee, S.M., et al. (2024). Federated learning frameworks. Nat. Commun., 15.
[21] Patel, R., et al. (2023). Ethnic bias analysis. Lancet Digit. Health, 5(6).
[22] Joshi, R.V., et al. (2024). Counterfactual explanations. JAMIA, 31(2).
[23] Kim, T., et al. (2023). Model drift in diabetes prediction. NPJ Digit. Med., 6.
[24] Nguyen, T.T., et al. (2024). Multimodal retinal-EHR models. Ophthalmology, 131(4).
[25] Smith, J.A., et al. (2023). Voice analysis for diabetes. J. Biomed. Inform., 145.
[26] "Hands-On Gradient Boosting with XGBoost and Scikit-Learn" by Corey Wade
[28] Chen, T., & Guestrin, C. (2016). XGBoost: A Scalable Tree Boosting System. Proceedings
of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data
Mining. [Link]
[29] Hu, Y., Zhang, Y., & Li, M. (2023). Predicting Gestational Diabetes Using XGBoost with
Feature Importance Analysis. Indian Journal of Science and Technology.
[30] Tasin, A. A., Maulana, A., & Wahyuni, H. (2024). Diabetes Risk Prediction on Hybrid
Demographic Datasets Using Ensemble Techniques. Scientific Reports, Nature.
[32] VijiyaKumar, V., Kavitha, P., & Meena, P. (2019). Comparative Study of Diabetes
Prediction Using Random Forest and Logistic Regression. International Journal of Scientific
Research in Computer Science, Engineering and Information Technology.
[33] Rasheed, M., Ahmed, N., & Banu, S. (2024). Hyperparameter Tuning in Random Forest
for Diabetes Detection Using GridSearchCV. Journal of Biomedical Informatics and AI.
[34] Noviyanti, T., & Alamsyah, R. (2024). Handling Class Imbalance in Diabetes Prediction
Using Random Forest. International Journal of Emerging Trends in Engineering Research.
[35] Quinlan, J. R. (1986). Induction of Decision Trees. Machine Learning, 1(1), 81–106.
[Link]
[36] Sisodia, D., Sisodia, D. S., & Singh, R. (2022). Prediction of Diabetes Using
Classification Algorithms on PIMA Dataset. Procedia Computer Science, 132, 1578–1585.
[Link]
[38] Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3), 273–
297. [Link]
[39] Schölkopf, B., Smola, A. J., & Müller, K. R. (1997). Kernel principal component analysis.
In International Conference on Artificial Neural Networks (pp. 583–588). Springer.
[40] Ramesh, D., Rani, K. U., & Singh, D. (2020). SVM-Based Diabetes Classification Using
SMOTE for Class Imbalance. International Journal of Engineering and Advanced Technology,
9(5), 1942–1947.
[41] Xue, M., Xu, Y., & Liu, T. (2022). A Comparative Study of Machine Learning Algorithms
for Diabetes Prediction Using Feature Engineering. Journal of Medical Systems, 46(3), 1–12.
[42] Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., ... &
Duchesnay, É. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning
Research, 12, 2825–2830.
[43] Rasheed, M., Ahmed, N., & Banu, S. (2024). Hyperparameter Tuning in Random Forest
for Diabetes Detection Using GridSearchCV. Journal of Biomedical Informatics and AI.
[44] Tasin, A. A., Maulana, A., & Wahyuni, H. (2024). Diabetes Risk Prediction on Hybrid
Demographic Datasets Using Ensemble Techniques. Scientific Reports, Nature.
Submitted By:
Vidhi Bhutia (22BBS0171)
Nivedha V (22BBS0034)
Aditya Samal (22BBS00205)
1. LightGBM
import pandas as pd
import lightgbm as lgb
import pickle
from sklearn.model_selection import train_test_split, GridSearchCV
from [Link] import LabelEncoder
from [Link] import accuracy_score, precision_score, recall_score, f1_score,
roc_auc_score
dt = pd.read_csv("diabetes_prediction_dataset.csv")
data = [Link](dt)
encoder = LabelEncoder()
data["gender"] = encoder.fit_transform(data["gender"])
data["smoking_history"] = encoder.fit_transform(data["smoking_history"])
X = [Link]("diabetes", axis=1)
y = data["diabetes"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42, stratify=y)
param_grid = {
'num_leaves': [15, 31, 50],
'learning_rate': [0.01, 0.05, 0.1],
'n_estimators': [50, 100, 200]
}
grid_search = GridSearchCV(model_default, param_grid, scoring='roc_auc', cv=5,
n_jobs=-1)
grid_search.fit(X_train, y_train)
model_tuned = grid_search.best_estimator_
print(f"\n{model_name} Performance:")
print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1-Score: {f1:.4f}")
print(f"ROC-AUC: {roc_auc:.4f}")
2. XGBoost
import pandas as pd
import xgboost as xgb
import pickle
from sklearn.model_selection import train_test_split, GridSearchCV
from [Link] import LabelEncoder
from [Link] import accuracy_score, precision_score, recall_score, f1_score,
roc_auc_score
dt = pd.read_csv("diabetes_prediction_dataset.csv")
data = [Link](dt)
encoder = LabelEncoder()
data["gender"] = encoder.fit_transform(data["gender"])
data["smoking_history"] = encoder.fit_transform(data["smoking_history"])
X = [Link]("diabetes", axis=1)
y = data["diabetes"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42, stratify=y)
print(f"\n{model_name} Performance:")
print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1-Score: {f1:.4f}")
print(f"ROC-AUC: {roc_auc:.4f}")
3. Random Forest
import pandas as pd
import pickle
from sklearn.model_selection import train_test_split, GridSearchCV
from [Link] import LabelEncoder
from [Link] import accuracy_score, precision_score, recall_score, f1_score,
roc_auc_score
from [Link] import RandomForestClassifier
dt = pd.read_csv("diabetes_prediction_dataset.csv")
data = [Link](dt)
data = data[data["gender"] != "Other"]
data = data.drop_duplicates()
data = [Link]()
encoder = LabelEncoder()
data["gender"] = encoder.fit_transform(data["gender"])
data["smoking_history"] = encoder.fit_transform(data["smoking_history"])
X = [Link]("diabetes", axis=1)
y = data["diabetes"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42, stratify=y)
model_default = RandomForestClassifier(random_state=42)
model_default.fit(X_train, y_train)
param_grid = {
'n_estimators': [100, 200, 300],
'max_depth': [3, 5, 10, None],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 2, 4],
'bootstrap': [True, False]
}
grid_search = GridSearchCV(model_default, param_grid, scoring='roc_auc', cv=5,
n_jobs=-1)
grid_search.fit(X_train, y_train)
model_tuned = grid_search.best_estimator_
print(f"\n{model_name} Performance:")
print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1-Score: {f1:.4f}")
print(f"ROC-AUC: {roc_auc:.4f}")
evaluate_model(model_default, X_test, y_test, "Random Forest Default")
evaluate_model(model_tuned, X_test, y_test, "Random Forest Tuned")
X = [Link]("diabetes", axis=1)
y = data["diabetes"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42, stratify=y)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = [Link](X_test)
dt_model = DecisionTreeClassifier(random_state=42)
dt_model.fit(X_train, y_train)
print(f"\n{model_name} Performance:")
print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1-Score: {f1:.4f}")
print(f"ROC-AUC: {roc_auc:.4f}")
5. Custom CNN
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from [Link] import LabelEncoder, StandardScaler
from [Link] import classification_report, confusion_matrix
import tensorflow as tf
from [Link] import Sequential
from [Link] import Conv2D, MaxPooling2D, Flatten, Dense, Dropout,
BatchNormalization
from [Link] import Adam
from [Link] import AUC, Precision, Recall
df = pd.read_csv("diabetes_prediction_dataset.csv")
le_gender = LabelEncoder()
le_smoking = LabelEncoder()
df['gender'] = le_gender.fit_transform(df['gender'])
df['smoking_history'] = le_smoking.fit_transform(df['smoking_history'])
X = [Link]("diabetes", axis=1)
y = df["diabetes"]
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
model = Sequential()
[Link](Flatten())
[Link](Dense(256, activation='relu'))
[Link](Dropout(0.5))
[Link](Dense(128, activation='relu'))
[Link](Dropout(0.5))
[Link](Dense(1, activation='sigmoid'))
[Link](
optimizer=Adam(learning_rate=0.0005),
loss='binary_crossentropy',
metrics=['accuracy', Precision(), Recall(), AUC()]
)
history = [Link](
X_train, y_train,
validation_data=(X_test, y_test),
epochs=40,
batch_size=8,
verbose=1
)
y_pred = [Link](X_test).flatten()
y_pred_label = (y_pred > 0.5).astype(int)
print("\nClassification Report:")
print(classification_report(y_test, y_pred_label))
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_label))
def plot_metrics(history):
[Link](figsize=(14,6))
[Link](1,2,1)
[Link]([Link]['accuracy'], label='Train Accuracy')
[Link]([Link]['val_accuracy'], label='Val Accuracy')
[Link]('Accuracy Over Epochs')
[Link]()
[Link](1,2,2)
[Link]([Link]['auc'], label='Train AUC')
[Link]([Link]['val_auc'], label='Val AUC')
[Link]('ROC-AUC Over Epochs')
[Link]()
plt.tight_layout()
[Link]()
plot_metrics(history)
[Link]("diabetes_cnn_model.keras")