0% found this document useful (0 votes)

11 views27 pages

Programming For Data Analytics

The document outlines coursework focused on applying machine learning techniques to a stroke dataset, covering tasks such as descriptive analytics, data preparation, classification, regression, and clustering. Various classifiers were implemented, including Logistic Regression, Decision Tree, K Neighbors, Neural Network, and Ada Boost, with performance metrics indicating class imbalance challenges. The results highlight that Logistic Regression, Neural Network, and Ada Boost achieved high accuracy and ROC AUC values, while the Decision Tree showed limited effectiveness in distinguishing between classes.

Uploaded by

ramintahery

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views27 pages

Programming For Data Analytics

Uploaded by

ramintahery

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Programming for DA and AI

Coursework: Applying Machine Learning Techniques on Stroke Dataset

Number of Words: 3971

Page 1 of 27
School of Computing, University of Portsmouth

Module M33145

Submitted by: UP2291855

Page 2 of 27
Table of Contents
Task 1: Descriptive analytics……………………………………………………………………… 4
Task 2: Data Preparation……………………………………………………………………………9
Task 3: Classification……………………………………………………………………………... 10
Task 4: Regression ………………………………………………………………………………….19
Task 5: clustering………………………………………………………………………………….. 25
References…………………………………………………………………………………..………27

AI Statement
I confirm that, I utilized AI tools to assist in the revision of the English language in my coursework to ensure
clarity, coherence, and academic integrity.

Page 3 of 27
Task 1: Descriptive analytics

In the initial step, I checked the dataset. The general overview of dataset is shown in figure 1.

Figure 1: General Overview of Stroke Heart attack Dataset

The basic information about the dataset are shown in figure 2:

Figure 2: Row and Column size in Stroke Heart attack Dataset

I checked the dataset to find any duplicated samples or null features and the results are available in figure 3
and figure 4.

Figure 3:# of duplicated Samples

Figure 4: Checking Null Values data types

The other statistical information which I could find in the dataset is shown in figure 5. This information is
related to Numerical attributes. Also, I checked some important information related to ‘Age’, ‘Average
Glucose Level’ and ‘bmi’ attributes, because of their nature and bmi potential to be outlier. This information
is presented in figure 6.

Page 4 of 27
Figure 5: Numerical attributes statistical information

Figure 6: Age, Average Glucose Level and BMI statistics

Figure 7: Distribution of age Figure 8: Distribution of avg glucose level

The figure 7, a histogram, illustrates the distribution of ages among individuals included in the stroke dataset.
Each bar represents a specific age range, and the height of the bar indicates how many individuals fall within
that range (4 years). The histogram shows a prominent number of individuals in the middle-age bracket,
particularly around 40 to 60 years. This is significant because stroke risk typically increases with age, and
middle age is a critical time for the onset of risk factors associated with stroke. There is a noticeable decrease
in the number of individuals older than 60 years. While the risk of stroke increases with age, the dataset's
decline in older age groups could reflect higher mortality rates or less participation in the data collection.

The histogram in figure 8, illustrates the distribution of average glucose levels among individuals in the stroke
dataset. Each bar represents the frequency of individuals falling within specific ranges of glucose levels,
measured in milligrams per deciliter (mg/dL). The histogram displays a prominent peak around glucose levels
of 50 to 100 mg/dL, followed by a gradual decline in frequency as glucose levels increase. There are smaller
peaks observed at higher glucose levels, suggesting the presence of individuals with significantly elevated
glucose levels.

Page 5 of 27
The histogram in figure 9, represents the distribution of
BMI among individuals in the stroke dataset. The
histogram displays a pronounced peak around a BMI of
approximately 25 to 30 kg/m², indicating that a
substantial portion of the population falls within the
overweight category. The distribution is skewed towards
higher BMIs, with a long tail extending into the obese
range, although the frequency diminishes as the BMI
increases beyond 40 kg/m².

Figure 9: Distribution of bmi

Figure 10: Count Plot of Gender Figure 11: Count Plot of Hypertension

Figure 10 clearly demonstrates that females are more prevalent in this dataset than males by a significant
margin. This distribution could have implications depending on the focus of the study or the analysis.
Depending on machine learning methods, it may require applying methods for balancing the dataset. The
presence "Other" category with only 1 entry, indicates that the dataset is almost based on two main genders,
and we can consider it as a binary gender dataset.

The bar chart in figure 11 illustrates the distribution of individuals with and without hypertension in a dataset.
The number of individuals without hypertension is 4,612, but the number of individuals with hypertension is
498. This chart shows that within this dataset, less than 10% of the population has hypertension.

Figure 12: Count Plot of Heart Disease Figure 13: Count Plot of ever married

Page 6 of 27
The bar chart in figure 12 shows the distribution of individuals with and without heart disease in the dataset.
In this dataset, 4,834 individuals do not have heart disease, and only 276 individuals have heart disease. It
means around 5.5% of the population has heart disease.

The bar chart in figure 13 displays the distribution of individuals based on their marital status in the dataset,
categorized into those who have ever been married (3,353 samples) and those who have not (1,757 samples).

Figure 14: Count Plot of work type Figure 15: Count Plot of Residence Type

The bar chart in figure 14 shows the distribution of individuals across different work types in the dataset. It
categorizes individuals into five groups based on their employment status and demonstrates a significant
prevalence of individuals working in the private sector, which is considerably higher compared to other
work types.

The bar chart in figure 15 illustrates the distribution of individuals based on their residence type, categorized
as either "Urban" or "Rural". The dataset shows a relatively balanced distribution between urban and rural
residents.

Figure 16: Count Plot of smoking status Figure 17: Count Plot of stroke

The bar chart in figure 16 shows the distribution of individuals based on their smoking status,
categorized into four groups. This chart indicates a significant number of individuals who have
never smoked or formerly smoked, and they decided to not smoke.

The bar chart in figure 17 shows the distribution of individuals in a dataset based on whether they
have experienced a stroke. No Stroke (4,861 individuals) and Stroke (249 individuals). This shows
less than 5% of samples are related to individuals who experienced a stroke.

Page 7 of 27
The scatter plot in figure 18, illustrates the relationship
between average glucose levels and BMI among
individuals in the stroke dataset. Each point on the plot
represents an individual, with their glucose level plotted
along the horizontal axis and their BMI along the vertical
axis. The scatter plot shows a wide distribution of glucose
levels and BMI among individuals, with most data points
clustering around glucose levels below 150 mg/dL and
BMIs ranging from about 20 to 40 kg/m². A few outliers
appear at higher glucose levels and BMI, but there is no
apparent strong correlation or clear trend indicating a
direct relationship between the two variables.
Figure 18: Scatter Plot of Avg Glucose Level vs BMI

Figure 19: Box Plot for BMI

The box plot in figure 19 visualizes the distribution of BMI values within a dataset. The majority of BMI values
are concentrated between approximately 20 and 40, suggesting that most individuals in the dataset fall within
what might be considered a normal to moderately obese range according to standard BMI categories. Outliers
in this chart shown as dots. These points are typically considered as extreme values that might be errors or
unique cases.

Figure 20: Box Plot for Average Glucose

The box plot in figure 20 visualizes the distribution of average glucose levels within a stroke dataset. The
majority of the average glucose levels are concentrated between roughly 75 and 115 mg/dL.

Figure 21: Box Plot of age

The box plot in figure 21 shows the distribution of age within a dataset. The majority of the ages are
concentrated between roughly 25 and 60 years, suggesting a middle-aged demographic predominates in this
dataset.

Page 8 of 27
Task 2: Data Preparation
In this task, I start by handling missing data within a dataset. In figure 22, I used median value of ‘bmi’ to
replace missed values in this attribute.

Figure 22: Handling missing data in 'bmi' attribute

The I used Interquartile Range (IQR) method to remove outliers from ‘avg_glucose_level’ and ‘bmi’ attributes
in a dataset. Figure 23 shows the related code and results. I consider lower 0.25 and upper 0.25 of samples
as potential outkiers.

Figure 23: Interquartile Range (IQR) method to remove outliers

The next step is to normalise the data. For this, based on the data values, three features 'avg_glucose_level',
'bmi', and 'age' were selected to be normalised. Figure 24 shows the code for performing normalisation and
the corresponding results.

Figure 24: Data Normalisation

Page 9 of 27
Task 3: Classification
In this task, we need to split data to training set and test set. I decide to use 30% of data as test data
and 70% as train data. In Figure 25, I am doing this splitting.

Figure 25: Splitting data to train and test

In this task I decided to apply 5 different classifier and compare their results. These classifiers are:
 Logistic Regression
 Decision Tree Classifier
 Neural Network (Not covered in module)
 K Neighbors Classifier
 Ada Boost Classifier (Not covered in module)
In each case, I applied classifier and computed the machine learning metric and finally I compare them
at the end of this section. (some of these classifiers are covered in module and some of them did not
covered.)

 Logistic Regression
In figure 26, the implementation of logistic regression classifier in python is shown. In this report I
used python libraries like Keras, Scikit-learn and very few commands in Tensorflow. So, I only
called functions of ML methods. Other parts of my codes are preparing results and figures.

Figure 26: Logistic Regression implementation

Page 10 of 27
The results of classification by regression classifier is presented in figure 27. The model correctly
predicted 96.2% of the samples. While this might seem high, it is misleading due to the class imbalance
evident in the dataset. The model performs well in identifying class '0', but it fails entirely on class '1'.
This imbalance leads to a high accuracy score, which doesn't accurately reflect the model's
performance on minority classes.

Figure 27: Logistic Regression results

The ROC of applying regression classifier is shown in figure 28. The area under ROC curve is
0.83. This is a good score and generally indicates that the model has a high degree of separability
between the positive and negative classes. The AUC ranges from 0 to 1, where 0.5 denotes a model
with no discrimination ability (equivalent to random guessing) and 1 denotes a perfect model. Values
above 0.7 are considered acceptable, with values closer to 1 being desirable.
Based on this information we can say; the regression classifier could detect non-strok samples by 98%
accuracy and some of the non-strok samples could classified as strok.

Figure 28: The ROC curve for applied Logistic Regression

Page 11 of 27
 Decision Tree Classifier
In figure 29, the implementation of Decision Tree classifier in python is shown.

Figure 29: Decision Tree implementation

The results of classification by Decision Tree classifier is presented in figure 30. The model correctly
predicted 92.5% of the samples. However, given the significant class imbalance, this metric might not
fully reflect the model's effectiveness, particularly for the minority class. The result is better than
logistic regression but still needs to improve. The model's high performance on the majority class
(negative) has overshadowed its poor performance on the minority class (positive). The low recall
(0.16) and precision (0.12) for the positive class indicate that the model struggles to correctly identify
and predict the positive cases. This could lead to significant issues, especially our positive class is
detecting strok.

Figure 30: Decision Tree Results

The ROC of applying Decision Tree classifier is shown in figure 30. The area under ROC curve
is 0.56. This is not a good score and generally indicates that the model has not a high degree of
separability between the positive and negative classes.

Page 12 of 27
Figure 31: The ROC curve for Decision Tree

 K Neighbors Classifier
In figure 32, the implementation of K-Neighbors classifier in python is shown.

Figure 32: K Neighbors Classifier Implementation

The results of classification by K-Neighbors classifier is presented in figure 33. The model correctly
predicted 95.61% of the samples. However, given the significant class imbalance, this metric might
not fully reflect the model's effectiveness, particularly for the minority class. The result is better than
logistic regression but not good as Decision Tree. The model's high performance on the majority class
(negative) has overshadowed its poor performance on the minority class (positive). The low recall
(0.04) and precision (0.17) for the positive class indicate that the model struggles to correctly identify

Page 13 of 27
and predict the positive cases. This could lead to significant issues, especially our positive class is
detecting strok.

Figure 33:K Neighbors Classifier Results

The ROC of applying K Neighbors Classifier is shown in figure 34. The area under ROC curve is
0.65 and so the model's ability to distinguish between classes is moderate, as indicated by the AUC.

Figure 34: The ROC for K Neighbors Classifier

 Neural Network
In figure 35, the implementation of Neural Network classifier in python is shown. In this report I used
a simple sequential model which has two Dense layers. In the first Dense layer, number of neurons are
10 and I used ‘relu’ as activation function. In the second layer there is only one neuron (because this
is binary classification) and I used ‘Sigmoid’ as activation function. Also, I used ‘adam’ as optimizer.

Page 14 of 27
Figure 35: Neural Network implementation

I trained Neural Network model in 50 epochs and in figure 36, the accuracy, AUC and loss values for
final 10 epochs are shown. As we can see the good level of accuracy achieved an AUC is high. This
shows the quality of classification by Neural Network. Ain, similar to previous methods in the
confusion matrix we can see two 0 values. This shows that the model performs well in identifying class
'0', but it fails entirely on class '1'. This imbalance leads to a high accuracy score, which doesn't
accurately reflect the model's performance on minority classes.

Figure 36: Neural Network Results

The ROC of applying Neural Network is shown in figure 37. The area under ROC curve is 0.83. This
is a good score and generally indicates that the model has a high degree of separability between the
positive and negative classes.

Page 15 of 27
Figure 37: The ROC for Neural Network

 Ada Boost Classifier

In figure 38, the implementation of Ada Boost Classifier in python is shown.

Figure 38: Ada Boost Classifier Implementation

The results of classification by Ada Boost Classifier is presented in figure 39. The model correctly
predicted 96.2% of the samples. However, given the significant class imbalance, this metric might not
fully reflect the model's effectiveness, particularly for the minority class. The result is like logistic
regression but not good as Decision Tree and Neural Network. The model's high performance on the
majority class (negative) has overshadowed its poor performance on the minority class (positive).

Page 16 of 27
Figure 39: Ada Boost Classifier Results

The ROC of applying Ada Boost Classifier is shown in figure 40. The area under ROC curve is 0.82.
This is a good score and generally indicates that the model has a high degree of separability between
the positive and negative classes.

Figure 40: The ROC for Ada Boost Classifier

Comparison of applied classifiers

Based on table 1, Logistic Regression, Neural Network, and Ada Boost Classifier demonstrate good
performance, with each achieving an accuracy of 96.2% and ROC AUC values close to 0.83 and 0.82,
respectively. These metrics suggest that these models are highly effective at distinguishing between
classes, indicative of their robustness in handling complex patterns and relationships within the data,
which are critical in predicting stroke outcomes.
The Decision Tree Classifier, while exhibiting a reasonable accuracy of 92.56%, shows a significantly
lower ROC AUC of 0.56. This indicates a limited ability to differentiate effectively between the
classes, possibly due to the model's sensitivity to the dataset's variance or its proneness to overfitting,
which is a common challenge with decision trees. On the other hand, the K Neighbors Classifier
displays a better balance between accuracy (95.61%) and ROC AUC (0.65) than the Decision Tree.
However, it still lags behind the Logistic Regression, Neural Network, and Ada Boost models.

Page 17 of 27
Table 1: Comparing Classification Algorithms

Method Accuracy ROC

Logistic Regression 96.2 0.83

Decision Tree Classifier 92.56 0.56

Neural Network (Not covered in module) 96.2 0.83

K Neighbors Classifier 95.61 0.65

Ada Boost Classifier (Not covered in module) 96.2 0.82

It is important to note, that the imbalance of the entire data with respect to the stroke variable remains
a challenge for all methods.
In summary, the Neural Network and Ada Boost models emerge as particularly potent for this dataset,
likely due to their ability to model complex nonlinear relationships and leverage ensemble learning
techniques, respectively. These models would be preferable in scenarios where the primary objective
is to maximize predictive performance and where computational resources are not a limiting factor.
Conversely, for applications requiring faster model training and predictions, or where interpretability
is a key concern, Logistic Regression or even K Neighbors Classifier might be more appropriate despite
some sacrifice in ROC AUC.

Page 18 of 27
Task 4: Regression
In this task I decided to apply 3 different regressor and compare their results. These regressors are:

 Random Forest Regression

 Support Vector Regression (SVR)
 Ridge Regression
In each case, I applied regressor and computed the machine learning metric and finally I compare
them at the end of this section. (some of these regressors are covered in module and some of them
did not covered.)
 Random Forest Regression

In figure 41, the implementation of Random Forest Regression in python is shown

Figure 41: Random forest Regression Implementation

For regression methods I used two performance metrics.

Mean Squared Error (MSE) is the average of the squares of the errors—that is, the average
squared difference between the estimated values and the actual value.
R-squared is a statistical measure that represents the proportion of the variance for a dependent
variable that's explained by an independent variable or variables in a regression model.
The results of classification by Random Forest Regression is presented in figure 42. A lower MSE
indicates a better fit of the model to the data. Given that MSE is 0.7923, without context to the range
and scale of the target variable, it's challenging to determine whether this is high or low.

Page 19 of 27
An R-squared of 0.1279 means that approximately 12.79% of the variance in the dependent variable is
predictable from the independent variables. This value typically ranges from 0 to 1, where 0 indicates that
the model explains none of the variability of the response data around its mean, and 1 indicates that the
model explains all the variability.

Figure 42: Random Forest Regression results

The scatter plot in figure 43 shows the relationship between the actual BMI values and the predicted BMI
values obtained from Random Forest regression model. The general grouping of points around the line
suggests that the model has a moderate fit to the data. The R-squared value (0.1279) seems low considering
the visual evidence here, which might indicate that while the model tracks the direction of changes in BMI, it
does not explain much of the variability.

Figure 43: Comparing predicted and actual BMI by Random Forest Regression

 Support Vector Regression (SVR)

In figure 44, the implementation of Random Forest Regression in python is shown

Page 20 of 27
Figure 44: Support Vector Regression implementation

The results of classification by Support Vector Regression is presented in figure 45. A lower MSE
indicates a better fit of the model to the data. Given that MSE is 0.6718, it's shows this model is
better than previous one.
An R-squared of 0.2604 means that approximately 26.04% of the variance in the dependent variable
is predictable from the independent variables. This value typically ranges from 0 to 1, where 0 indicates
that the model explains none of the variability of the response data around its mean, and 1 indicates
that the model explains all the variability.

Figure 45: Support Vector Regression results

The scatter plot in figure 46 shows the relationship between the actual BMI values and the predicted
BMI values obtained from Support Vector Regression model. The general grouping of points around
the line suggests that the model has a moderate fit to the data. The R-squared value (0.2604) seems
low considering the visual evidence here, which might indicate that while the model tracks the direction
of changes in BMI, it does not explain much of the variability but it’s clearly better than previous
method.

Page 21 of 27
Figure 46: Comparing predicted and actual BMI by Support Vector Regression

 Ridge Regression

In figure 47, the implementation of Random Forest Regression in python is shown

Figure 47: Ridge Regression Implementation

Page 22 of 27
The results of classification by Random Forest Regression is presented in figure 42. A lower MSE
indicates a better fit of the model to the data. Given that MSE is 0.7961, without context to the range
and scale of the target variable, it's challenging to determine whether this is high or low.
An R-squared of 0.1236 means that approximately 12.36% of the variance in the dependent variable
is predictable from the independent variables. This value typically ranges from 0 to 1, where 0 indicates
that the model explains none of the variability of the response data around its mean, and 1 indicates
that the model explains all the variability. So, this method is comparable with the first method.

Figure 48: Ridge Regression results

The scatter plot in figure 48 shows the relationship between the actual BMI values and the predicted
BMI values obtained from Ridge Regression model. The general grouping of points around the line
suggests that the model has a moderate fit to the data. The R-squared value (0.1236) seems low
considering the visual evidence here, which might indicate that while the model tracks the direction of
changes in BMI, it does not explain much of the variability.

Figure 49: Comparing predicted and actual BMI by Ridge Regression

Comparison of applied Regressors

Based on table 2, Random Forest shows a moderate MSE, which implies an average level of prediction
error. The R-squared value is quite low, indicating that only about 12.78% of the variability in the
dependent variable is explained by the model. This suggests that while Random Forest is robust for
complex datasets with non-linear relationships, it may not be capturing all the nuances in the stroke
dataset effectively. SVR performs better in this comparison, with a lower MSE and a higher R-squared
value. The lower MSE indicates that the predictions are, on average, closer to the actual data points,
and the higher R-squared suggests that about 26.04% of the variability is captured by the model. SVR's
ability to handle non-linear data effectively through kernel tricks might be contributing to its better
performance. Ridge Regression has the highest MSE and a similarly low R-squared value as Random

Page 23 of 27
Forest, indicating a less effective model in terms of both error rate and variability explained. This
model, which includes regularization to handle multicollinearity and reduce model complexity, still
struggles to effectively model the stroke dataset, possibly due to the dataset's characteristics or the
inherent linear nature of Ridge Regression.
Table 2: Comparing Regression Algorithms

Method MSE R-squared

Random forest Regression 0.7923 0.1278
Support Vector Regression 0.6718 0.2604
Ridge Regression 0.7961 0.1236

Page 24 of 27
Task 5: clustering

In this task I decided to apply 2 different clustering algorithms and compare their results. These
algorithms are:

 K-means Clustering
 DBSCAN Clustering
In each case, I applied clustering algorithm and computed the machine learning metric and finally
I compare them at the end of this section. (some of these clustering algorithms are covered in
module and some of them did not covered.)

 K-means Clustering
In figure 50, the implementation of Random Forest Regression in python is shown

Figure 50: K-means Clustering Implementation

In figure 51, the results of applying K-means clustering on strok dataset is shown. As we can see the data is
classified to three clusters and so it is not accurate. One of the options is using different values of K and
another option is using other algorithms like hierarchical clustering.

Page 25 of 27
Figure 51: K-means Clustering results

 DBSCAN Clustering
In figure 41, the implementation of Random Forest Regression in python is shown

Figure 52: DBSCAN Clustering Implementation

In figure 53, the results of applying DBSCAN clustering on strok dataset is shown. As we can see the data is
classified to one cluster and so it is not accurate.

Page 26 of 27
Figure 53: DBSCAN Clustering results

Comparison of applied Clustering algorithms

Both applied algorithms had low quality results and my suggestion is to use hierarchical clustering.

References

[1] https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html, visit: 19/11/2024

[2] https://scikit-learn.org/dev/modules/generated/sklearn.neural_network.MLPClassifier.html, visit: 19/11/2024

[3] https://scikit-learn.org/dev/modules/generated/sklearn.neighbors.KNeighborsClassifier.html, visit: 19/11/2024

[4] https://keras.io/guides/sequential_model/, visit: 19/11/2024

[5] https://scikit-learn.org/1.5/modules/tree.html, visit: 19/11/2024

[6] Rigatti, S. J. (2017). Random forest. Journal of Insurance Medicine, 47(1), 31-39.

[7] Chi, Z. (1995, November). MLP classifiers: overtraining and solutions. In Proceedings of ICNN'95-International
Conference on Neural Networks (Vol. 5, pp. 2821-2824). IEEE.

[8] Cunningham, P., & Delany, S. J. (2021). K-nearest neighbour classifiers-a tutorial. ACM computing surveys (CSUR),
54(6), 1-25.

[9] Song, Y. Y., & Ying, L. U. (2015). Decision tree methods: applications for classification and prediction. Shanghai archives
of psychiatry, 27(2), 130.

Page 27 of 27

Programming With Python - Final Assignment - Valerie Riady Huette
No ratings yet
Programming With Python - Final Assignment - Valerie Riady Huette
11 pages
Big Data Resit Assignment
No ratings yet
Big Data Resit Assignment
22 pages
Final Report
No ratings yet
Final Report
35 pages
Healthcare Analytics
No ratings yet
Healthcare Analytics
72 pages
Heart Disease Risk Factor Data Analysis Midterm Data 2 - Jupyter Notebook
No ratings yet
Heart Disease Risk Factor Data Analysis Midterm Data 2 - Jupyter Notebook
20 pages
545988E5-0C88-4AB1-8BB8-6BCF14B7A6EF
No ratings yet
545988E5-0C88-4AB1-8BB8-6BCF14B7A6EF
17 pages
Diabetes Data Analysis and Insights
No ratings yet
Diabetes Data Analysis and Insights
14 pages
Biostatistics
100% (1)
Biostatistics
16 pages
Analysis of The 2nd National Nutrition and Health Survey
No ratings yet
Analysis of The 2nd National Nutrition and Health Survey
36 pages
Diabetes Pedigree Function Analysis
No ratings yet
Diabetes Pedigree Function Analysis
14 pages
Topic 1
No ratings yet
Topic 1
14 pages
Midterm Project Group 6
No ratings yet
Midterm Project Group 6
41 pages
MayankBaryal
No ratings yet
MayankBaryal
9 pages
Stroke Prediction Dataset
No ratings yet
Stroke Prediction Dataset
48 pages
Biostatistics for Health Professionals
No ratings yet
Biostatistics for Health Professionals
30 pages
Topic 1 - ETC1000
No ratings yet
Topic 1 - ETC1000
11 pages
Biostatics and Epidemiology 2022 1
No ratings yet
Biostatics and Epidemiology 2022 1
17 pages
Predicting Heart Disease Factors
No ratings yet
Predicting Heart Disease Factors
22 pages
Biostatistics Tutorial: Data Analysis Techniques
No ratings yet
Biostatistics Tutorial: Data Analysis Techniques
15 pages
Topic 1 - W1-3 Introduction To Biostatistics
No ratings yet
Topic 1 - W1-3 Introduction To Biostatistics
52 pages
Week-01 B
No ratings yet
Week-01 B
4 pages
Seminar1 1
No ratings yet
Seminar1 1
44 pages
BMI Analysis for Student Health
No ratings yet
BMI Analysis for Student Health
16 pages
#1660908-Data Management and Statistical Computing
No ratings yet
#1660908-Data Management and Statistical Computing
21 pages
Problem 1: Solution
No ratings yet
Problem 1: Solution
12 pages
Heart Disease Indicator Prediction Model
No ratings yet
Heart Disease Indicator Prediction Model
17 pages
Contact Details:: Dr. Joy C. Chavez
No ratings yet
Contact Details:: Dr. Joy C. Chavez
101 pages
Assignment Project 2
No ratings yet
Assignment Project 2
9 pages
QT Report
No ratings yet
QT Report
20 pages
R&B L16
No ratings yet
R&B L16
18 pages
ADS Exp-1
No ratings yet
ADS Exp-1
3 pages
CH 1 Homework Sts
No ratings yet
CH 1 Homework Sts
5 pages
EDA Presentation
No ratings yet
EDA Presentation
13 pages
Notes - Biostatitics
No ratings yet
Notes - Biostatitics
13 pages
Report
No ratings yet
Report
5 pages
Pima Indian Diabetes Questions
No ratings yet
Pima Indian Diabetes Questions
6 pages
Introduction to Biostatistics Basics
No ratings yet
Introduction to Biostatistics Basics
52 pages
1 Introduction To Biostatistics
100% (3)
1 Introduction To Biostatistics
52 pages
Logistic Regression
No ratings yet
Logistic Regression
12 pages
Heart Disease Detection with Pandas
No ratings yet
Heart Disease Detection with Pandas
17 pages
Basic Public Health and Epidemiology
No ratings yet
Basic Public Health and Epidemiology
51 pages
STAT501 Online - HW2R - Spring2024
No ratings yet
STAT501 Online - HW2R - Spring2024
7 pages
Class 1-A Midterm Rankings Analysis
No ratings yet
Class 1-A Midterm Rankings Analysis
22 pages
2.4 General Epidemiological Measures
No ratings yet
2.4 General Epidemiological Measures
32 pages
Contact Details:: Dr. Joy C. Chavez
No ratings yet
Contact Details:: Dr. Joy C. Chavez
54 pages
Visualization
No ratings yet
Visualization
9 pages
Explanationdocx
No ratings yet
Explanationdocx
9 pages
q3 Stat2100 Bautista-Lhuriely
No ratings yet
q3 Stat2100 Bautista-Lhuriely
11 pages
Pima Tutorial
No ratings yet
Pima Tutorial
8 pages
Data Management in Healthcare Final
No ratings yet
Data Management in Healthcare Final
25 pages
Q3 - Stat2100 Dupol Melkiancaesar
No ratings yet
Q3 - Stat2100 Dupol Melkiancaesar
12 pages
Insurance Claim Project
No ratings yet
Insurance Claim Project
23 pages
Bio Statistics
No ratings yet
Bio Statistics
7 pages
Diabetes Prediction 1704256341
No ratings yet
Diabetes Prediction 1704256341
17 pages
Biostatistics Module Sep2023 240520 122333
No ratings yet
Biostatistics Module Sep2023 240520 122333
65 pages
Document 1
No ratings yet
Document 1
6 pages
Biostatistics Teaching
No ratings yet
Biostatistics Teaching
283 pages
Assignment On ANOVA
No ratings yet
Assignment On ANOVA
7 pages
Statistics For Health Data Science An Organic Approach High-Quality Download
100% (20)
Statistics For Health Data Science An Organic Approach High-Quality Download
14 pages
Sobia Rana PJE
No ratings yet
Sobia Rana PJE
227 pages
Intelligent Solutions For Intrusion Detection in Transportation System
No ratings yet
Intelligent Solutions For Intrusion Detection in Transportation System
20 pages
Pricai 2025 Paper 389
No ratings yet
Pricai 2025 Paper 389
16 pages
IDTA For NLP
No ratings yet
IDTA For NLP
16 pages
Course Work AI - Foundation
No ratings yet
Course Work AI - Foundation
12 pages
IDTACoursework
No ratings yet
IDTACoursework
4 pages
Big Data CNN Models
No ratings yet
Big Data CNN Models
32 pages
Chart Title: Tablet Computer Sales Week Units Sold
No ratings yet
Chart Title: Tablet Computer Sales Week Units Sold
4 pages
SQL - Visualisation - Docxedited)
No ratings yet
SQL - Visualisation - Docxedited)
22 pages
Final Footwear Project Report 65
75% (16)
Final Footwear Project Report 65
181 pages
Chapter 5 - Classification Problems
100% (1)
Chapter 5 - Classification Problems
25 pages
Water Level Prediction Using Various Machine Learning Algorithms A Case Study of Durian Tunggal River Malaysia
No ratings yet
Water Level Prediction Using Various Machine Learning Algorithms A Case Study of Durian Tunggal River Malaysia
20 pages
Lecture 4 - Consumer Research
100% (1)
Lecture 4 - Consumer Research
43 pages
Applied Econometrics Assignment Guide
80% (5)
Applied Econometrics Assignment Guide
3 pages
Understanding Secondary Data in Research
No ratings yet
Understanding Secondary Data in Research
14 pages
Business Aptitude Test: BAT™ Module I - Academic Aptitude
No ratings yet
Business Aptitude Test: BAT™ Module I - Academic Aptitude
18 pages
Analysis and Predicted Questions
No ratings yet
Analysis and Predicted Questions
4 pages
Random Forests: Features & Algorithm
100% (1)
Random Forests: Features & Algorithm
13 pages
Month Actual Shed Sales 3-Month Moving Average 3 - Month Movingaverage (Weight 3,2,1)
No ratings yet
Month Actual Shed Sales 3-Month Moving Average 3 - Month Movingaverage (Weight 3,2,1)
3 pages
Qualitative vs Quantitative Research
No ratings yet
Qualitative vs Quantitative Research
14 pages
Analysis of Jewelry
No ratings yet
Analysis of Jewelry
16 pages
Data Visualization
100% (1)
Data Visualization
23 pages
Predictive Analysis of Agency Sales Units Basis RAG Rating
No ratings yet
Predictive Analysis of Agency Sales Units Basis RAG Rating
3 pages
Notes in Environmental Data Analysis
100% (1)
Notes in Environmental Data Analysis
11 pages
Python Plotting Cheat Sheet
No ratings yet
Python Plotting Cheat Sheet
2 pages
Week 2 - Qualitative Research and Its Importance in Daily Life-Part 1
100% (2)
Week 2 - Qualitative Research and Its Importance in Daily Life-Part 1
23 pages
Football Analytics for Tactical Decisions
No ratings yet
Football Analytics for Tactical Decisions
225 pages
Usulan Pengadaan Buku TM 2019
No ratings yet
Usulan Pengadaan Buku TM 2019
10 pages
HR Analytics1 For Website 2024
No ratings yet
HR Analytics1 For Website 2024
5 pages
Understanding Correlation Analysis Techniques
100% (1)
Understanding Correlation Analysis Techniques
29 pages
Building An Effective and Extensible Data and Analytics Operating Model Codex3579
No ratings yet
Building An Effective and Extensible Data and Analytics Operating Model Codex3579
18 pages
Basic Regression Analysis With Time Series: Chapter 10 - Review
No ratings yet
Basic Regression Analysis With Time Series: Chapter 10 - Review
8 pages
Reliance Communication LTD 1
No ratings yet
Reliance Communication LTD 1
63 pages
BDA Unit 1 Book
No ratings yet
BDA Unit 1 Book
39 pages
Joel E. Collier - Applied Structural Equation Modeling Using AMOS - Basic To Advanced Techniques-Routledge (2020)
100% (2)
Joel E. Collier - Applied Structural Equation Modeling Using AMOS - Basic To Advanced Techniques-Routledge (2020)
367 pages
Analyzing Student Performance Data
No ratings yet
Analyzing Student Performance Data
36 pages
Exam Memo for Statistics Topics
No ratings yet
Exam Memo for Statistics Topics
2 pages

Programming For Data Analytics

Uploaded by

Programming For Data Analytics

Uploaded by

Programming for DA and AI

Coursework: Applying Machine Learning Techniques on Stroke Dataset

Number of Words: 3971

Submitted by: UP2291855

Figure 1: General Overview of Stroke Heart attack Dataset

The basic information about the dataset are shown in figure 2:

Figure 2: Row and Column size in Stroke Heart attack Dataset

Figure 3:# of duplicated Samples

Figure 4: Checking Null Values data types

Figure 6: Age, Average Glucose Level and BMI statistics

Figure 7: Distribution of age Figure 8: Distribution of avg glucose level

Figure 9: Distribution of bmi

Figure 19: Box Plot for BMI

Figure 20: Box Plot for Average Glucose

Figure 21: Box Plot of age

Figure 22: Handling missing data in 'bmi' attribute

Figure 23: Interquartile Range (IQR) method to remove outliers

Figure 24: Data Normalisation

Figure 25: Splitting data to train and test

Figure 26: Logistic Regression implementation

Figure 27: Logistic Regression results

Figure 28: The ROC curve for applied Logistic Regression

Figure 29: Decision Tree implementation

Figure 30: Decision Tree Results

Figure 32: K Neighbors Classifier Implementation

Figure 33:K Neighbors Classifier Results

Figure 34: The ROC for K Neighbors Classifier

Figure 36: Neural Network Results

 Ada Boost Classifier

Figure 38: Ada Boost Classifier Implementation

Figure 40: The ROC for Ada Boost Classifier

Comparison of applied classifiers

Method Accuracy ROC

Decision Tree Classifier 92.56 0.56

Neural Network (Not covered in module) 96.2 0.83

K Neighbors Classifier 95.61 0.65

Ada Boost Classifier (Not covered in module) 96.2 0.82

 Random Forest Regression

In figure 41, the implementation of Random Forest Regression in python is shown

Figure 41: Random forest Regression Implementation

For regression methods I used two performance metrics.

Figure 42: Random Forest Regression results

 Support Vector Regression (SVR)

In figure 44, the implementation of Random Forest Regression in python is shown

Figure 45: Support Vector Regression results

In figure 47, the implementation of Random Forest Regression in python is shown

Figure 47: Ridge Regression Implementation

Figure 48: Ridge Regression results

Figure 49: Comparing predicted and actual BMI by Ridge Regression

Comparison of applied Regressors

Method MSE R-squared

Figure 50: K-means Clustering Implementation

Figure 52: DBSCAN Clustering Implementation

Comparison of applied Clustering algorithms

[1] https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html, visit: 19/11/2024

[2] https://scikit-learn.org/dev/modules/generated/sklearn.neural_network.MLPClassifier.html, visit: 19/11/2024

[3] https://scikit-learn.org/dev/modules/generated/sklearn.neighbors.KNeighborsClassifier.html, visit: 19/11/2024

[4] https://keras.io/guides/sequential_model/, visit: 19/11/2024

[5] https://scikit-learn.org/1.5/modules/tree.html, visit: 19/11/2024

You might also like