Programming for DA and AI
Coursework: Applying Machine Learning Techniques on Stroke Dataset
Number of Words: 3971
Page 1 of 27
School of Computing, University of Portsmouth
Module M33145
Submitted by: UP2291855
Page 2 of 27
Table of Contents
Task 1: Descriptive analytics……………………………………………………………………… 4
Task 2: Data Preparation……………………………………………………………………………9
Task 3: Classification……………………………………………………………………………... 10
Task 4: Regression ………………………………………………………………………………….19
Task 5: clustering………………………………………………………………………………….. 25
References…………………………………………………………………………………..………27
AI Statement
I confirm that, I utilized AI tools to assist in the revision of the English language in my coursework to ensure
clarity, coherence, and academic integrity.
Page 3 of 27
Task 1: Descriptive analytics
In the initial step, I checked the dataset. The general overview of dataset is shown in figure 1.
Figure 1: General Overview of Stroke Heart attack Dataset
The basic information about the dataset are shown in figure 2:
Figure 2: Row and Column size in Stroke Heart attack Dataset
I checked the dataset to find any duplicated samples or null features and the results are available in figure 3
and figure 4.
Figure 3:# of duplicated Samples
Figure 4: Checking Null Values data types
The other statistical information which I could find in the dataset is shown in figure 5. This information is
related to Numerical attributes. Also, I checked some important information related to ‘Age’, ‘Average
Glucose Level’ and ‘bmi’ attributes, because of their nature and bmi potential to be outlier. This information
is presented in figure 6.
Page 4 of 27
Figure 5: Numerical attributes statistical information
Figure 6: Age, Average Glucose Level and BMI statistics
Figure 7: Distribution of age Figure 8: Distribution of avg glucose level
The figure 7, a histogram, illustrates the distribution of ages among individuals included in the stroke dataset.
Each bar represents a specific age range, and the height of the bar indicates how many individuals fall within
that range (4 years). The histogram shows a prominent number of individuals in the middle-age bracket,
particularly around 40 to 60 years. This is significant because stroke risk typically increases with age, and
middle age is a critical time for the onset of risk factors associated with stroke. There is a noticeable decrease
in the number of individuals older than 60 years. While the risk of stroke increases with age, the dataset's
decline in older age groups could reflect higher mortality rates or less participation in the data collection.
The histogram in figure 8, illustrates the distribution of average glucose levels among individuals in the stroke
dataset. Each bar represents the frequency of individuals falling within specific ranges of glucose levels,
measured in milligrams per deciliter (mg/dL). The histogram displays a prominent peak around glucose levels
of 50 to 100 mg/dL, followed by a gradual decline in frequency as glucose levels increase. There are smaller
peaks observed at higher glucose levels, suggesting the presence of individuals with significantly elevated
glucose levels.
Page 5 of 27
The histogram in figure 9, represents the distribution of
BMI among individuals in the stroke dataset. The
histogram displays a pronounced peak around a BMI of
approximately 25 to 30 kg/m², indicating that a
substantial portion of the population falls within the
overweight category. The distribution is skewed towards
higher BMIs, with a long tail extending into the obese
range, although the frequency diminishes as the BMI
increases beyond 40 kg/m².
Figure 9: Distribution of bmi
Figure 10: Count Plot of Gender Figure 11: Count Plot of Hypertension
Figure 10 clearly demonstrates that females are more prevalent in this dataset than males by a significant
margin. This distribution could have implications depending on the focus of the study or the analysis.
Depending on machine learning methods, it may require applying methods for balancing the dataset. The
presence "Other" category with only 1 entry, indicates that the dataset is almost based on two main genders,
and we can consider it as a binary gender dataset.
The bar chart in figure 11 illustrates the distribution of individuals with and without hypertension in a dataset.
The number of individuals without hypertension is 4,612, but the number of individuals with hypertension is
498. This chart shows that within this dataset, less than 10% of the population has hypertension.
Figure 12: Count Plot of Heart Disease Figure 13: Count Plot of ever married
Page 6 of 27
The bar chart in figure 12 shows the distribution of individuals with and without heart disease in the dataset.
In this dataset, 4,834 individuals do not have heart disease, and only 276 individuals have heart disease. It
means around 5.5% of the population has heart disease.
The bar chart in figure 13 displays the distribution of individuals based on their marital status in the dataset,
categorized into those who have ever been married (3,353 samples) and those who have not (1,757 samples).
Figure 14: Count Plot of work type Figure 15: Count Plot of Residence Type
The bar chart in figure 14 shows the distribution of individuals across different work types in the dataset. It
categorizes individuals into five groups based on their employment status and demonstrates a significant
prevalence of individuals working in the private sector, which is considerably higher compared to other
work types.
The bar chart in figure 15 illustrates the distribution of individuals based on their residence type, categorized
as either "Urban" or "Rural". The dataset shows a relatively balanced distribution between urban and rural
residents.
Figure 16: Count Plot of smoking status Figure 17: Count Plot of stroke
The bar chart in figure 16 shows the distribution of individuals based on their smoking status,
categorized into four groups. This chart indicates a significant number of individuals who have
never smoked or formerly smoked, and they decided to not smoke.
The bar chart in figure 17 shows the distribution of individuals in a dataset based on whether they
have experienced a stroke. No Stroke (4,861 individuals) and Stroke (249 individuals). This shows
less than 5% of samples are related to individuals who experienced a stroke.
Page 7 of 27
The scatter plot in figure 18, illustrates the relationship
between average glucose levels and BMI among
individuals in the stroke dataset. Each point on the plot
represents an individual, with their glucose level plotted
along the horizontal axis and their BMI along the vertical
axis. The scatter plot shows a wide distribution of glucose
levels and BMI among individuals, with most data points
clustering around glucose levels below 150 mg/dL and
BMIs ranging from about 20 to 40 kg/m². A few outliers
appear at higher glucose levels and BMI, but there is no
apparent strong correlation or clear trend indicating a
direct relationship between the two variables.
Figure 18: Scatter Plot of Avg Glucose Level vs BMI
Figure 19: Box Plot for BMI
The box plot in figure 19 visualizes the distribution of BMI values within a dataset. The majority of BMI values
are concentrated between approximately 20 and 40, suggesting that most individuals in the dataset fall within
what might be considered a normal to moderately obese range according to standard BMI categories. Outliers
in this chart shown as dots. These points are typically considered as extreme values that might be errors or
unique cases.
Figure 20: Box Plot for Average Glucose
The box plot in figure 20 visualizes the distribution of average glucose levels within a stroke dataset. The
majority of the average glucose levels are concentrated between roughly 75 and 115 mg/dL.
Figure 21: Box Plot of age
The box plot in figure 21 shows the distribution of age within a dataset. The majority of the ages are
concentrated between roughly 25 and 60 years, suggesting a middle-aged demographic predominates in this
dataset.
Page 8 of 27
Task 2: Data Preparation
In this task, I start by handling missing data within a dataset. In figure 22, I used median value of ‘bmi’ to
replace missed values in this attribute.
Figure 22: Handling missing data in 'bmi' attribute
The I used Interquartile Range (IQR) method to remove outliers from ‘avg_glucose_level’ and ‘bmi’ attributes
in a dataset. Figure 23 shows the related code and results. I consider lower 0.25 and upper 0.25 of samples
as potential outkiers.
Figure 23: Interquartile Range (IQR) method to remove outliers
The next step is to normalise the data. For this, based on the data values, three features 'avg_glucose_level',
'bmi', and 'age' were selected to be normalised. Figure 24 shows the code for performing normalisation and
the corresponding results.
Figure 24: Data Normalisation
Page 9 of 27
Task 3: Classification
In this task, we need to split data to training set and test set. I decide to use 30% of data as test data
and 70% as train data. In Figure 25, I am doing this splitting.
Figure 25: Splitting data to train and test
In this task I decided to apply 5 different classifier and compare their results. These classifiers are:
Logistic Regression
Decision Tree Classifier
Neural Network (Not covered in module)
K Neighbors Classifier
Ada Boost Classifier (Not covered in module)
In each case, I applied classifier and computed the machine learning metric and finally I compare them
at the end of this section. (some of these classifiers are covered in module and some of them did not
covered.)
Logistic Regression
In figure 26, the implementation of logistic regression classifier in python is shown. In this report I
used python libraries like Keras, Scikit-learn and very few commands in Tensorflow. So, I only
called functions of ML methods. Other parts of my codes are preparing results and figures.
Figure 26: Logistic Regression implementation
Page 10 of 27
The results of classification by regression classifier is presented in figure 27. The model correctly
predicted 96.2% of the samples. While this might seem high, it is misleading due to the class imbalance
evident in the dataset. The model performs well in identifying class '0', but it fails entirely on class '1'.
This imbalance leads to a high accuracy score, which doesn't accurately reflect the model's
performance on minority classes.
Figure 27: Logistic Regression results
The ROC of applying regression classifier is shown in figure 28. The area under ROC curve is
0.83. This is a good score and generally indicates that the model has a high degree of separability
between the positive and negative classes. The AUC ranges from 0 to 1, where 0.5 denotes a model
with no discrimination ability (equivalent to random guessing) and 1 denotes a perfect model. Values
above 0.7 are considered acceptable, with values closer to 1 being desirable.
Based on this information we can say; the regression classifier could detect non-strok samples by 98%
accuracy and some of the non-strok samples could classified as strok.
Figure 28: The ROC curve for applied Logistic Regression
Page 11 of 27
Decision Tree Classifier
In figure 29, the implementation of Decision Tree classifier in python is shown.
Figure 29: Decision Tree implementation
The results of classification by Decision Tree classifier is presented in figure 30. The model correctly
predicted 92.5% of the samples. However, given the significant class imbalance, this metric might not
fully reflect the model's effectiveness, particularly for the minority class. The result is better than
logistic regression but still needs to improve. The model's high performance on the majority class
(negative) has overshadowed its poor performance on the minority class (positive). The low recall
(0.16) and precision (0.12) for the positive class indicate that the model struggles to correctly identify
and predict the positive cases. This could lead to significant issues, especially our positive class is
detecting strok.
Figure 30: Decision Tree Results
The ROC of applying Decision Tree classifier is shown in figure 30. The area under ROC curve
is 0.56. This is not a good score and generally indicates that the model has not a high degree of
separability between the positive and negative classes.
Page 12 of 27
Figure 31: The ROC curve for Decision Tree
K Neighbors Classifier
In figure 32, the implementation of K-Neighbors classifier in python is shown.
Figure 32: K Neighbors Classifier Implementation
The results of classification by K-Neighbors classifier is presented in figure 33. The model correctly
predicted 95.61% of the samples. However, given the significant class imbalance, this metric might
not fully reflect the model's effectiveness, particularly for the minority class. The result is better than
logistic regression but not good as Decision Tree. The model's high performance on the majority class
(negative) has overshadowed its poor performance on the minority class (positive). The low recall
(0.04) and precision (0.17) for the positive class indicate that the model struggles to correctly identify
Page 13 of 27
and predict the positive cases. This could lead to significant issues, especially our positive class is
detecting strok.
Figure 33:K Neighbors Classifier Results
The ROC of applying K Neighbors Classifier is shown in figure 34. The area under ROC curve is
0.65 and so the model's ability to distinguish between classes is moderate, as indicated by the AUC.
Figure 34: The ROC for K Neighbors Classifier
Neural Network
In figure 35, the implementation of Neural Network classifier in python is shown. In this report I used
a simple sequential model which has two Dense layers. In the first Dense layer, number of neurons are
10 and I used ‘relu’ as activation function. In the second layer there is only one neuron (because this
is binary classification) and I used ‘Sigmoid’ as activation function. Also, I used ‘adam’ as optimizer.
Page 14 of 27
Figure 35: Neural Network implementation
I trained Neural Network model in 50 epochs and in figure 36, the accuracy, AUC and loss values for
final 10 epochs are shown. As we can see the good level of accuracy achieved an AUC is high. This
shows the quality of classification by Neural Network. Ain, similar to previous methods in the
confusion matrix we can see two 0 values. This shows that the model performs well in identifying class
'0', but it fails entirely on class '1'. This imbalance leads to a high accuracy score, which doesn't
accurately reflect the model's performance on minority classes.
Figure 36: Neural Network Results
The ROC of applying Neural Network is shown in figure 37. The area under ROC curve is 0.83. This
is a good score and generally indicates that the model has a high degree of separability between the
positive and negative classes.
Page 15 of 27
Figure 37: The ROC for Neural Network
Ada Boost Classifier
In figure 38, the implementation of Ada Boost Classifier in python is shown.
Figure 38: Ada Boost Classifier Implementation
The results of classification by Ada Boost Classifier is presented in figure 39. The model correctly
predicted 96.2% of the samples. However, given the significant class imbalance, this metric might not
fully reflect the model's effectiveness, particularly for the minority class. The result is like logistic
regression but not good as Decision Tree and Neural Network. The model's high performance on the
majority class (negative) has overshadowed its poor performance on the minority class (positive).
Page 16 of 27
Figure 39: Ada Boost Classifier Results
The ROC of applying Ada Boost Classifier is shown in figure 40. The area under ROC curve is 0.82.
This is a good score and generally indicates that the model has a high degree of separability between
the positive and negative classes.
Figure 40: The ROC for Ada Boost Classifier
Comparison of applied classifiers
Based on table 1, Logistic Regression, Neural Network, and Ada Boost Classifier demonstrate good
performance, with each achieving an accuracy of 96.2% and ROC AUC values close to 0.83 and 0.82,
respectively. These metrics suggest that these models are highly effective at distinguishing between
classes, indicative of their robustness in handling complex patterns and relationships within the data,
which are critical in predicting stroke outcomes.
The Decision Tree Classifier, while exhibiting a reasonable accuracy of 92.56%, shows a significantly
lower ROC AUC of 0.56. This indicates a limited ability to differentiate effectively between the
classes, possibly due to the model's sensitivity to the dataset's variance or its proneness to overfitting,
which is a common challenge with decision trees. On the other hand, the K Neighbors Classifier
displays a better balance between accuracy (95.61%) and ROC AUC (0.65) than the Decision Tree.
However, it still lags behind the Logistic Regression, Neural Network, and Ada Boost models.
Page 17 of 27
Table 1: Comparing Classification Algorithms
Method Accuracy ROC
Logistic Regression 96.2 0.83
Decision Tree Classifier 92.56 0.56
Neural Network (Not covered in module) 96.2 0.83
K Neighbors Classifier 95.61 0.65
Ada Boost Classifier (Not covered in module) 96.2 0.82
It is important to note, that the imbalance of the entire data with respect to the stroke variable remains
a challenge for all methods.
In summary, the Neural Network and Ada Boost models emerge as particularly potent for this dataset,
likely due to their ability to model complex nonlinear relationships and leverage ensemble learning
techniques, respectively. These models would be preferable in scenarios where the primary objective
is to maximize predictive performance and where computational resources are not a limiting factor.
Conversely, for applications requiring faster model training and predictions, or where interpretability
is a key concern, Logistic Regression or even K Neighbors Classifier might be more appropriate despite
some sacrifice in ROC AUC.
Page 18 of 27
Task 4: Regression
In this task I decided to apply 3 different regressor and compare their results. These regressors are:
Random Forest Regression
Support Vector Regression (SVR)
Ridge Regression
In each case, I applied regressor and computed the machine learning metric and finally I compare
them at the end of this section. (some of these regressors are covered in module and some of them
did not covered.)
Random Forest Regression
In figure 41, the implementation of Random Forest Regression in python is shown
Figure 41: Random forest Regression Implementation
For regression methods I used two performance metrics.
Mean Squared Error (MSE) is the average of the squares of the errors—that is, the average
squared difference between the estimated values and the actual value.
R-squared is a statistical measure that represents the proportion of the variance for a dependent
variable that's explained by an independent variable or variables in a regression model.
The results of classification by Random Forest Regression is presented in figure 42. A lower MSE
indicates a better fit of the model to the data. Given that MSE is 0.7923, without context to the range
and scale of the target variable, it's challenging to determine whether this is high or low.
Page 19 of 27
An R-squared of 0.1279 means that approximately 12.79% of the variance in the dependent variable is
predictable from the independent variables. This value typically ranges from 0 to 1, where 0 indicates that
the model explains none of the variability of the response data around its mean, and 1 indicates that the
model explains all the variability.
Figure 42: Random Forest Regression results
The scatter plot in figure 43 shows the relationship between the actual BMI values and the predicted BMI
values obtained from Random Forest regression model. The general grouping of points around the line
suggests that the model has a moderate fit to the data. The R-squared value (0.1279) seems low considering
the visual evidence here, which might indicate that while the model tracks the direction of changes in BMI, it
does not explain much of the variability.
Figure 43: Comparing predicted and actual BMI by Random Forest Regression
Support Vector Regression (SVR)
In figure 44, the implementation of Random Forest Regression in python is shown
Page 20 of 27
Figure 44: Support Vector Regression implementation
The results of classification by Support Vector Regression is presented in figure 45. A lower MSE
indicates a better fit of the model to the data. Given that MSE is 0.6718, it's shows this model is
better than previous one.
An R-squared of 0.2604 means that approximately 26.04% of the variance in the dependent variable
is predictable from the independent variables. This value typically ranges from 0 to 1, where 0 indicates
that the model explains none of the variability of the response data around its mean, and 1 indicates
that the model explains all the variability.
Figure 45: Support Vector Regression results
The scatter plot in figure 46 shows the relationship between the actual BMI values and the predicted
BMI values obtained from Support Vector Regression model. The general grouping of points around
the line suggests that the model has a moderate fit to the data. The R-squared value (0.2604) seems
low considering the visual evidence here, which might indicate that while the model tracks the direction
of changes in BMI, it does not explain much of the variability but it’s clearly better than previous
method.
Page 21 of 27
Figure 46: Comparing predicted and actual BMI by Support Vector Regression
Ridge Regression
In figure 47, the implementation of Random Forest Regression in python is shown
Figure 47: Ridge Regression Implementation
Page 22 of 27
The results of classification by Random Forest Regression is presented in figure 42. A lower MSE
indicates a better fit of the model to the data. Given that MSE is 0.7961, without context to the range
and scale of the target variable, it's challenging to determine whether this is high or low.
An R-squared of 0.1236 means that approximately 12.36% of the variance in the dependent variable
is predictable from the independent variables. This value typically ranges from 0 to 1, where 0 indicates
that the model explains none of the variability of the response data around its mean, and 1 indicates
that the model explains all the variability. So, this method is comparable with the first method.
Figure 48: Ridge Regression results
The scatter plot in figure 48 shows the relationship between the actual BMI values and the predicted
BMI values obtained from Ridge Regression model. The general grouping of points around the line
suggests that the model has a moderate fit to the data. The R-squared value (0.1236) seems low
considering the visual evidence here, which might indicate that while the model tracks the direction of
changes in BMI, it does not explain much of the variability.
Figure 49: Comparing predicted and actual BMI by Ridge Regression
Comparison of applied Regressors
Based on table 2, Random Forest shows a moderate MSE, which implies an average level of prediction
error. The R-squared value is quite low, indicating that only about 12.78% of the variability in the
dependent variable is explained by the model. This suggests that while Random Forest is robust for
complex datasets with non-linear relationships, it may not be capturing all the nuances in the stroke
dataset effectively. SVR performs better in this comparison, with a lower MSE and a higher R-squared
value. The lower MSE indicates that the predictions are, on average, closer to the actual data points,
and the higher R-squared suggests that about 26.04% of the variability is captured by the model. SVR's
ability to handle non-linear data effectively through kernel tricks might be contributing to its better
performance. Ridge Regression has the highest MSE and a similarly low R-squared value as Random
Page 23 of 27
Forest, indicating a less effective model in terms of both error rate and variability explained. This
model, which includes regularization to handle multicollinearity and reduce model complexity, still
struggles to effectively model the stroke dataset, possibly due to the dataset's characteristics or the
inherent linear nature of Ridge Regression.
Table 2: Comparing Regression Algorithms
Method MSE R-squared
Random forest Regression 0.7923 0.1278
Support Vector Regression 0.6718 0.2604
Ridge Regression 0.7961 0.1236
Page 24 of 27
Task 5: clustering
In this task I decided to apply 2 different clustering algorithms and compare their results. These
algorithms are:
K-means Clustering
DBSCAN Clustering
In each case, I applied clustering algorithm and computed the machine learning metric and finally
I compare them at the end of this section. (some of these clustering algorithms are covered in
module and some of them did not covered.)
K-means Clustering
In figure 50, the implementation of Random Forest Regression in python is shown
Figure 50: K-means Clustering Implementation
In figure 51, the results of applying K-means clustering on strok dataset is shown. As we can see the data is
classified to three clusters and so it is not accurate. One of the options is using different values of K and
another option is using other algorithms like hierarchical clustering.
Page 25 of 27
Figure 51: K-means Clustering results
DBSCAN Clustering
In figure 41, the implementation of Random Forest Regression in python is shown
Figure 52: DBSCAN Clustering Implementation
In figure 53, the results of applying DBSCAN clustering on strok dataset is shown. As we can see the data is
classified to one cluster and so it is not accurate.
Page 26 of 27
Figure 53: DBSCAN Clustering results
Comparison of applied Clustering algorithms
Both applied algorithms had low quality results and my suggestion is to use hierarchical clustering.
References
[1] https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html, visit: 19/11/2024
[2] https://scikit-learn.org/dev/modules/generated/sklearn.neural_network.MLPClassifier.html, visit: 19/11/2024
[3] https://scikit-learn.org/dev/modules/generated/sklearn.neighbors.KNeighborsClassifier.html, visit: 19/11/2024
[4] https://keras.io/guides/sequential_model/, visit: 19/11/2024
[5] https://scikit-learn.org/1.5/modules/tree.html, visit: 19/11/2024
[6] Rigatti, S. J. (2017). Random forest. Journal of Insurance Medicine, 47(1), 31-39.
[7] Chi, Z. (1995, November). MLP classifiers: overtraining and solutions. In Proceedings of ICNN'95-International
Conference on Neural Networks (Vol. 5, pp. 2821-2824). IEEE.
[8] Cunningham, P., & Delany, S. J. (2021). K-nearest neighbour classifiers-a tutorial. ACM computing surveys (CSUR),
54(6), 1-25.
[9] Song, Y. Y., & Ying, L. U. (2015). Decision tree methods: applications for classification and prediction. Shanghai archives
of psychiatry, 27(2), 130.
Page 27 of 27