0% found this document useful (0 votes)
60 views11 pages

AI Diabetic Health Prediction Report

The project report details a mini project on predicting diabetic health using the K-Nearest Neighbors (KNN) algorithm, highlighting its methodology, dataset, and results. The dataset utilized is from the Behavioral Risk Factor Surveillance System (BRFSS) and contains 253,680 responses with 21 feature variables. The KNN model achieved an accuracy of 85.64% in predicting diabetic health status, demonstrating its effectiveness in this domain.

Uploaded by

mahima.t139
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
60 views11 pages

AI Diabetic Health Prediction Report

The project report details a mini project on predicting diabetic health using the K-Nearest Neighbors (KNN) algorithm, highlighting its methodology, dataset, and results. The dataset utilized is from the Behavioral Risk Factor Surveillance System (BRFSS) and contains 253,680 responses with 21 feature variables. The KNN model achieved an accuracy of 85.64% in predicting diabetic health status, demonstrating its effectiveness in this domain.

Uploaded by

mahima.t139
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 11

PROJECT REPORT

on

AI IN MEDICAL FIELD: DIABETIC HEALTH


PREDICTION
(CSE IV Semester Mini project)
2023-2024

Submitted to: Submitted by:


Ms.Shagun Dasawat Ms.Shalini Semalti
(CC-CSE-F1-IV-Sem) Roll. No:2219612
Guided by: CSE-F1-IV-Semester
Mr. Piyush Agarwal Session: 2023-2024
GRAPHIC ERA HILL UNVERSITY, DEHRADUN
DEPARTMENT OF COMPUTER SCIENCE AND INFORMATION TECHNOLOGY
CERTIFICATE

Certified that Ms. Shalini Semalti (Roll No.- 2219612) has developed mini project on “AI IN
MEDICAL FIELD: DIABETIC HEALTH PREDICTION” for the CSE IV Semester Mini Project
Lab in Graphic Era Hill University, Dehradun. The project carried out by Students is their own work
as best of my knowledge.

Date:

Ms.Shagun Dasawat Ms. Shalini Semalti

Class Co-ordinator Project Guide

CSE-F1-IV-Sem Resource Person

(CSE Department) (CSE Department)

GEHU Dehradun GEHU Dehradun


ACKNOWLEDGMENT

We would like to express our gratitude to The Almighty Shiva Baba, the most

Beneficent and the most Merciful, for completion of project.

We wish to thank our parents for their continuing support and encouragement.

We also wish to thank them for providing us with the opportunity to reach this

far in our studies.

We would like to thank particularly our project Co-ordinator Ms Shagun

Dasawat and our Project Guide Mr. Piyush Agarwal for his patience, support

and encouragement throughout the completion of this project and having faith

in us.

At last but Not the least We greatly indebted to all other persons who directly or

indirectly helped us during this work.

Ms. Shalini Semalti


Roll No.- 2219612
CSE-F1-IV-Sem
Session: 2023-2024
GEHU, Dehradun
TABLE OF CONTENTS

CONTENT PAGE NO.

1. INTRODUCTION ..…… .…2


2. METHODOLOGY ..…… 3
3. DATA SET .. .4
4. DATASET PREPROCESSING .....6
5. MODEL 7
6. RESULTS .....7
7. REFERENCES ..…..7
1. INTRODUCTION
Diabetes is among the most prevalent chronic diseases in the world, impacting millions of people each
year and exerting a significant financial burden on the economy. Diabetes is a serious chronic disease
in which individuals lose the ability to effectively regulate levels of glucose in the blood, and can lead
to reduced quality of life and life expectancy. After different foods are broken down into sugars
during digestion, the sugars are then released into the bloodstream. This signals the pancreas to
release insulin. Insulin helps enable cells within the body to use those sugars in the bloodstream for
energy. Diabetes is generally characterized by either the body Not making enough insulin or being
unable to use the insulin that is made as effectively as needed.

In this study, the K-Nearest Neighbors (KNN) algorithm is employed for the predictive modeling of
diabetes. Leveraging a dataset encompassing diverse features, the KNN algorithm is applied to
discern patterns indicative of diabetes risk. By systematically evaluating the algorithm's performance
on a comprehensive range of 'k' values, the research aims to identify an optimal configuration for
accurate diabetes prediction.

1.1. K-Nearest Neighbours (KNN) Algorithm:


KNN is a simple and versatile supervised machine learning algorithm used for classification and
regression tasks. It belongs to the category of instance-based or lazy learning algorithms, where the
model is not explicitly trained during the learning phase but makes predictions based on the similarity
of instances in the training data.
The fundamental concept of KNN involves predicting the class or value of a data point based on the
classes or values of its neighbouring data points in the feature space. The algorithm considers the 'k'
nearest neighbours to make predictions.

1.1.1. KNN Algorithm Workflow


The algorithm relies on a distance metric to quantify the similarity between data points in the feature
space. The choice of distance metric depends on the nature of the data and the problem at hand.

Fig 1: KNN workflow for classification of a new data point into class A or B on the
basis of Euclidean distance using value of k=3.
1
1.1.2. Parameter 'k':
The hyperparameter 'k' determines the number of neighbors considered when making predictions. A
smaller 'k' leads to more flexible models, potentially sensitive to noise, while a larger 'k' tends to
smooth out local variations.

1.1.3. Voting Mechanism:


For classification tasks, the algorithm uses a majority voting mechanism among the 'k' neighbors to
assign a class label to the query data point. In regression tasks, the algorithm averages the target
values of the 'k' neighbors to predict a continuous value.

1.1.4 Training and Prediction:


Unlike traditional machine learning algorithms that involve an explicit training phase, KNN does not
build an internal model during training. The entire training dataset is stored in memory for future
predictions.

1.1.5 Prediction Process:


When a new data point needs to be classified or predicted, the algorithm calculates the distances to all
data points in the training set. It then selects the 'k' nearest neighbors and applies the voting or
averaging mechanism to assign the class label or predict the value.

1.1.6 Key Considerations and Variants:


Effect of 'k':
The choice of 'k' significantly influences the algorithm's performance. A small 'k' may result in an
overly sensitive model, while a large 'k' may over smooth the decision boundaries. Cross-validation is
often employed to determine an optimal 'k' for a given dataset.

1.1.7 Curse of Dimensionality:


KNN can be sensitive to the curse of dimensionality, where the effectiveness of distance metrics
diminishes as the number of features increases. Feature scaling and dimensionality reduction
techniques may be applied to mitigate this issue.

1.1.8. Strengths and Weaknesses:


Strengths: KNN is simple, easy to understand, and applicable to both classification and regression
tasks. It adapts well to non-linear decision boundaries and is robust to noisy data.

Weaknesses: The algorithm can be computationally expensive, particularly for large datasets. It may
also be sensitive to irrelevant or redundant features. Additionally, the prediction process requires
storing the entire training dataset in memory.

1.1.9. Use Cases and Conclusion:


KNN finds applications in various domains, including medical diagnosis, recommendation systems,
and pattern recognition. Its effectiveness in capturing local patterns makes it suitable for diverse
datasets.

2
2. Methodology

Data Collection

Collecting required diabetes data

Exploratory Data Analysis

Cleaning the data or filling the


missing data if any

Creating training and Testing Dataset

Splitting the data into training and


testing dataset

Model

Prediction

Fig 2: Methodology Followed

3
3. DATA SET
The Behavioral Risk Factor Surveillance System (BRFSS) is a health-related telephone survey that is
collected annually by the CDC. Each year, the survey collects responses from over 400,000
Americans on health-related risk behaviors, chronic health conditions, and the use of preventative
services. It has been conducted every year since 1984. For this project, a csv of the dataset available
on Kaggle for the year 2015 was used. This original dataset contains responses from 441,455
individuals and has 330 features. These features are either questions directly asked of participants, or
calculated variables based on individual participant responses.

This dataset is a clean dataset of 253,680 survey responses to the CDC's BRFSS2015. The target variable
Diabetes_binary has 2 classes. 0 is for no diabetes, and 1 is for prediabetes or diabetes. This dataset has 21
feature variables.

3.1. feature Description


Table 1: Description of Independent variables

Feature Description
Diabetes_binary No diabetes: 0
Diabetes: 1
HighBP No high BP: 0
High BP: 1
HighChol No high cholesterol: 0
High cholesterol: 1
CholCheck No cholesterol checks in 5 years:0
Yes cholesterol check in 5 years: 1
BMI Body Mass Index
Smoker Have you smoked at least 100 cigarettes in your
entire life? [Note: 5 packs = 100 cigarettes]
No: 0
Yes: 1
Stroke (Ever told) you had a stroke.
No: 0
Yes: 1
HeartDiseaseorAttack coronary heart disease (CHD) or myocardial
infarction (MI)
No: 0
Yes:1
PhysActivity physical activity in past 30 days - Not including
job
No: 0
Yes: 1
Fruits Consume Fruit 1 or more times per day
No: 0
Yes: 1
Veggies Consume Vegetables 1 or more times per day
No: 0
Yes: 1
HvyAlcoholConsump Heavy drinkers (adult men having more than 14
drinks per week and adult women having more
than 7 drinks per week)
AnyHealthcare Have any kind of health care coverage, including

4
health insurance, prepaid plans such as HMO,
etc.
No: 0
Yes: 1
NoDocbcCost Was there a time in the past 12 months when you
needed to see a doctor but could Not because of
cost?
No: 0
Yes: 1
GenHlth Would you say that in general your health is:
scale 1-5?
1 = excellent, 2 = very good, 3 = good, 4 = fair,
5 = poor
MentHlth Now thinking about your mental health, which
includes stress, depression, and problems with
emotions, for how many days during the past 30
days was your mental health Not good? scale 1-
30 days
PhysHlth Now thinking about your physical health, which
includes physical illness and injury, for how
many days during the past 30 days was your
physical health Not good? scale 1-30 days
DiffWalk Do you have serious difficulty walking or
climbing stairs?
0 = No
1 = Yes
Sex 0 = female
1 = male
Age 13-level age category
1 = 18-24
9 = 60-64
13 = 80 or older
Education Education level (EDUCA see codebook) scale 1-
6
1 = Never attended school or only kindergarten
2 = Grades 1 through 8 (Elementary)
3 = Grades 9 through 11 (Some high school)
4 = Grade 12 or GED (High school graduate)
5 = College 1 year to 3 years (Some college or
technical school)
6 = College 4 years or more (College graduate)
Income Income scale (INCOME2 see codebook) scale 1-
8
1 = less than $10,000
5 = less than $35,000
8 = $75,000 or more

5
4. Data Preprocessing

The provided dataset is already clean and does not contain any missing values. Therefore, doesn’t
require any cleaning.

4.1.1 Correlation Between Independent Variables

In order to retain the unique variables, a Pearson correlation analysis was applied to explore relationships
between different independent variables. Surprisingly, the investigation revealed a lack of significant
connections among the variables.

Fig: Pearson correlation between independent variables

As there is no high correlation between the variables, all the variables will be used further.

6
5. Model
The dataset was divided into a training set, utilized for training the KNN model, and a testing set to
evaluate its predictive capabilities. The KNN algorithm, chosen for its simplicity and effectiveness in
classification tasks, was employed to predict diabetic health based on relevant features. The model's
accuracy was measured against the testing set to assess its reliability in making accurate predictions.

6. Results
Upon applying the KNN model to the dataset, an accuracy of 85.64% for the value of k equal to 4 was
achieved. This indicates that the model successfully predicted diabetic health status for a significant
portion of the testing set. The outcome suggests promising potential for the KNN algorithm in the
realm of diabetic health prediction.

7. REFERENCES
1. Dataset: https://www.kaggle.com/datasets/alexteboul/diabetes-health-indicators-
dataset/data

2. https://blakelobato1.medium.com/k-nearest-neighbor-classifier-implement-homemade-
class-compare-with-sklearn-import-6896f49b89e

You might also like