INTRODUCTION
This project presents a detailed data-driven study aiming to uncover the key contributors to
stroke occurrences among a diverse population of 40,852 individuals. Using a binary
classification setup, the core objective is to explore how various personal, medical, and
lifestyle-related factors influence the likelihood of experiencing a stroke.
The study employs IBM SPSS Modeler as the analytical platform and applies two advanced
predictive methods: Binary Logistic Regression and the CHAID (Chi-squared Automatic
Interaction Detection) algorithm. Both models are thoroughly evaluated based on
performance metrics and validation procedures to determine which delivers better predictive
accuracy and clarity. The final model is selected based on test data performance, and the most
impactful stroke predictors are highlighted and interpreted within a practical, real-world
context.
This analysis will be carried out using IBM SPSS Modeler, applying two predictive modeling
approaches: Binary Logistic Regression and the CHAID (Chi-squared Automatic Interaction
Detection) algorithm. The performance of both models will be rigorously assessed using relevant
evaluation metrics and validation techniques. A comparative analysis will determine which model
provides superior predictive accuracy and interpretability. Final model selection will be guided by test
data performance, and the most influential predictors of stroke will be identified and discussed in the
context of their practical significance.
The available variables are listed below:
1) ID: unique identifier
2) sex: 1 - "Male", 0 - "Female"
3) age: age of the patient (years)
4) hypertension: 0 – "the patient doesn't have hypertension ", 1- "the patient has hypertension”
5) heart_disease: 0 – "the patient doesn't have any heart diseases ", 1- "the patient has a heart
disease "
6) ever_married: 0 -"No" or 1 - "Yes"
7) work_type: 1 - "children", 2 - "Govt_jov", 3 - "Self-employed", 4 - "Private", 0 -
"Never_worked"
8) Residence_type: 0 - "Rural" or 1 - "Urban"
9) avg_glucose_level: average glucose level in blood
10) BMI: body mass index
11) smoking_status: 0 - "non-smoker, 1 - "smoker"
12) stroke: 1- "the patient had a stroke ", 0 – "the patient did not have a stroke "
Data Upload
The dataset titled Stroke_Data.csv was successfully imported into IBM SPSS Modeler using the Var.
File node, an appropriate choice for handling comma-separated values (CSV) files.
DATA UNDERSTANDING
At the outset of our analysis, a thorough review of the dataset's structure and integrity was undertaken
to ensure its suitability for predictive modeling. Given the presence of a variety of variable types, the
Type node from the Field Operations palette in SPSS Modeler was employed to assign correct
measurement levels (e.g., nominal, continuous) and to apply descriptive value labels. This critical
preprocessing step established a coherent data structure, enabling standardized manipulation and
enhancing the interpretability of subsequent analytical results. The finalized variable configurations
are presented in Figure 1.
Figure 1: Illustration of Variable Type
Furthermore, a critical aspect of the preliminary data assessment involves evaluating the dataset for
potential data quality issues, including missing values, irregularities, and outliers that could affect the
robustness of the analysis. To facilitate this process, the Data Audit node in IBM SPSS Modeler was
employed. This node offers comprehensive descriptive summaries and diagnostic visualizations,
allowing for a systematic examination of variable distributions and data completeness. The outcomes
of this audit are presented graphically in Figure 2, providing a foundational understanding of the
dataset’s integrity prior to model development.
Figure 2: Implementation of Data Audit
Following the data audit, it was observed that the dataset contains a limited number of missing values.
A detailed review of the variable distributions further identified data points that fall substantially
outside the expected range, indicating the presence of outliers, particularly within the BMI variable, as
well as missing entries in the sex variable. These data quality concerns are depicted in Figure 3.
Figure 3: Missing Values and Outliers
To address the missing data, the most frequent category (mode) was used to impute the missing
values in the sex variable. For the extreme values in BMI, a null assignment was applied where
necessary, while outliers were addressed through coercion techniques, ensuring that the dataset
remained analytically robust for predictive modeling. Figure 4: shows the data audit after the process.
Figure 4: Data Audit after handling missing values, extremes and outliers.
DATA PREPARATION
A structured sequence of preprocessing steps was undertaken to ensure the dataset’s accuracy,
consistency, and readiness for robust modeling. To mitigate the potential effects of multicollinearity
and maintain the validity of the predictor variables, a methodical approach was employed. Central to
this process was the utilization of the Partition and Statistical nodes, which facilitated the division of
data for model training and testing, while also supporting the evaluation of variable relationships and
distributions. These steps were pivotal in shaping the analytical trajectory of the modeling phase.
Figure 5: Descriptive Statistics of Continuous Variables
As shown in Figure 5, descriptive statistics for the continuous variables—Age, Average Glucose
Level, and Body Mass Index (BMI)—were generated using the Statistics node from the Output palette
in IBM SPSS Modeler. The analysis revealed that the mean age of individuals in the dataset is
approximately 52 years, the average glucose level is around 122 mg/dL, and the mean BMI is 30.41,
indicating a population with generally elevated body mass. These summary statistics provide a
foundational understanding of the central tendencies within the data and are essential for informing
subsequent modeling steps.
Figure 6: Correlation Values of Continuous variables
The correlation matrix presented in Figure 6 indicates that there are no strong linear associations
among the continuous variables under consideration. The observed correlation coefficients fall within
a range that suggests only weak relationships, thereby confirming the absence of significant
multicollinearity. This finding supports the suitability of these variables for inclusion in subsequent
predictive modeling without the risk of redundancy or inflated variance.
A feature selection process was undertaken to isolate the most influential predictors contributing to
the classification of credit rating outcomes. While the preliminary analysis, illustrated in Figure 7,
indicated that all variables exhibited relatively high importance scores, the objective was to enhance
model efficiency by retaining only the most impactful features. This selective approach aims to
simplify the model structure, reduce dimensionality, and improve interpretability, all while
maintaining robust predictive performance.
Figure 7: Feature Selection Rank of the Variables
Drawing from the feature importance analysis and the theoretical relevance of predictors to stroke
incidence outcomes, a subset of six variables was selected for focused examination.
These include:
Age, given its established correlation with stroke incidence;
Hypertension, a recognized clinical risk factor;
Heart Disease, which exhibits a strong association with cerebrovascular events;
Average Glucose Level, indicative of potential metabolic disorders such as diabetes;
Body Mass Index (BMI), reflecting both underweight and obesity-related risks; and
Smoking Status, a well-documented behavioral risk factor.
These variables were selected not only for their statistical significance but also for their substantive
relevance in evaluating health-related risk, which can have downstream effects on Stroke. They were
subsequently incorporated into the modeling phase to assess their predictive influence on the binary
classification of credit rating. Figure 8 shows a preview of the variables selected.
Figure 8: Overview of selected variables.
MAIN ANALYSIS
Two distinct predictive modeling techniques were employed: Binary Logistic Regression and a
Decision Tree utilizing the CHAID algorithm. Both models were seamlessly integrated into the
production flow within IBM SPSS Modeler.
We commenced with the Binary Logistic Regression model, which is ideally suited for binary
classification tasks. In this context, the model was used to predict the likelihood of a stroke outcome,
categorizing patients into two mutually exclusive classes: " the patient had a stroke " and " the
patient did not have a stroke ".
The model equation is given as follows;
ln(1−P)
=constant + β 1 x 1+ β2 x 2+ β3 x 3+ β 4 x 4 + β5 x 5+ β6 x 6
P
Let P represent the probability of stroke incidence.
The predictors are:
- Average Glucose Level (X₁) – Higher glucose levels, indicating potential diabetes, increase stroke
risk (β₁ > 0).
- BMI (X₂) – Both obesity and being underweight are risk factors, potentially raising stroke
probability (β₂ > 0).
- Hypertension (X₃) – Presence of hypertension significantly increases stroke risk (β₃ > 0).
- Heart Disease (X₄) – A history of heart disease is strongly correlated with a higher probability of
stroke (β₄ > 0).
- Smoking Status (X₅) – Smoking Status is a well-established risk factor for stroke, increasing its
likelihood (β₅ > 0).
- Age (X₆) – Older age is positively associated with a higher likelihood of stroke (β₆ > 0).
Figure 9: Parameter estimates from the Binary Logistic Regression
Figure 9 presents the parameter estimates from a binary logistic regression model.
Model Interpretation
Overall Model Interpretation
Reference category: "The patient did not have a stroke" (baseline).
All predictors are statistically significant (Sig. < 0.001), meaning they contribute
meaningfully to predicting stroke.
Exp(B) = Odds Ratio (OR): Indicates how much the odds of stroke increase (OR > 1) or
decrease (OR < 1) per unit change in the predictor.
Table 1: Continuous Variables
Exp(B)
B
Variable (Odds Interpretation
(Coefficient)
Ratio)
avg_glucose_leve - For every 1-unit increase in glucose,
+0.008 1.008
l stroke odds increase by 0.8%.
- For every 1-unit increase in BMI,
stroke odds decrease by
Bmi -0.017 0.983
1.7% (counterintuitive—possible
confounding or data issue).
Table 2: Categorical Variables (Dummy-Coded)
Exp(B)
Variable Category Interpretation
(OR)
- Patients without hypertension have 69.5%
Hypertension No (Ref: Yes) 0.305 lower odds of stroke than hypertensive
patients.
- Patients without heart disease have 69.8%
Heart Disease No (Ref: Yes) 0.302 lower odds of stroke than those with heart
disease.
Smoking Non-smoker 0.817 - Non-smokers have 18.3% lower odds of
Exp(B)
Variable Category Interpretation
(OR)
Status (Ref: Smoker) stroke than smokers.
Young
Age Group 1.0 - Baseline group.
(Reference)
- Middle-aged patients have 23.6% lower
Middle 0.764 odds than young patients (unexpected—
requires validation).
- Older patients have 16.1% higher
Old 1.161
odds than young patients.
Figure 10: Predictor Importance from the CHAID Algorithm
The Predictor Importance graph above illustrates the relative importance of each variable in predicting
the likelihood of a stroke.
1. Most Important Predictor:
- Average Glucose Level is by far the most important predictor, with a value close to 1. This
indicates a strong association between higher glucose levels and the likelihood of having a stroke.
2. Moderately Important Predictors:
- BMI and Hypertension both show moderate importance in predicting stroke risk, suggesting they
have some influence but are less critical than glucose levels.
3. Less Important Predictors:
- Smoking Status, Heart Disease, and Age Group have lower importance in predicting stroke.
Among these, Age Group shows the least impact on the prediction model, indicating that other factors
(like glucose levels and hypertension) may be more influential in predicting stroke risk.
In summary, the analysis strongly suggests that glucose levels are the primary driver of stroke risk,
while other factors, although relevant, contribute less significantly to the prediction model.
Model Results
Figure 11: Binary Logistic ROC
The curve appears to rise sharply initially, indicating strong predictive power at lower thresholds. The
close alignment between training and testing results suggests good generalization.
Figure 12: Binary Logistic Confusion Matrix
The confusion matrix shows that in Partition 1, which represents the training data, the model correctly
predicted the stroke outcome in 19,400 out of 28,494 cases, resulting in an overall accuracy of
approximately 68.08%. Conversely, it made incorrect predictions in 9,094 instances, accounting for
31.92% of the training data. Similarly, in Partition 2, representing the test dataset, the model achieved
a correct classification rate of 67.93%, accurately predicting 8,377 out of 12,332 cases. The
proportion of incorrect predictions in the test set stood at 32.07%.
Figure 13: CHAID Algorithm ROC
A steep initial rise in the curve indicates high sensitivity at low false positive rates - crucial for risk
models where catching true "bad" credits is prioritized. The closeness of the training and test lines
suggests good generalization.
Figure 14: CHAID Algorithm Confusion Matrix
The confusion matrix shows that in Partition 1, which represents the training data, the model correctly
predicted the stroke outcome in 20,296 out of 28,494 cases, resulting in an overall accuracy of
approximately 71.23%. Conversely, it made incorrect predictions in 8,198 instances, accounting for
28.77% of the training data. Similarly, in Partition 2, representing the test dataset, the model achieved
a correct classification rate of 70.75%, accurately predicting 8,725 out of 12,332 cases. The
proportion of incorrect predictions in the test set stood at 29.25%.
Model Comparison
Model Comparison: Binary Logistic Regression vs. CHAID
Binary Logistic
Metric CHAID
Regression
Training Accuracy 68.08% (19,400/28,494) 71.23% (20,296/28,494)
Test Accuracy 67.93% (8,377/12,332) 70.75% (8,725/12,332)
Error Rate
31.92% (9,094/28,494) 28.77% (8,198/28,494)
(Training)
Error Rate (Test) 32.07% (3,955/12,332) 29.25% (3,607/12,332)
CHAID outperforms Logistic Regression in accuracy (~3% higher in both training and test sets).
CONCLUSION
After evaluating both Binary Logistic Regression and CHAID, the CHAID model is recommended
for stroke prediction due to its higher accuracy (71.23% training, 70.75% test) and ability to capture
non-linear relationships, such as the critical role of glucose levels and hypertension. While logistic
regression provides interpretable odds ratios, its lower accuracy (68.08% training, 67.93% test) and
counterintuitive findings (e.g., BMI’s negative effect) limit its reliability. CHAID’s tree-based
structure also offers actionable decision rules for clinical use, making it the superior choice for real-
world risk stratification.