0% found this document useful (0 votes)

21 views9 pages

Assignment Project 2

This project analyzes stroke occurrences among 40,852 individuals using Binary Logistic Regression and the CHAID algorithm to identify key predictors. The CHAID model outperformed the logistic regression in accuracy, achieving 71.23% training and 70.75% test accuracy, while highlighting the significant role of glucose levels and hypertension in stroke risk. The study emphasizes the importance of data quality and preprocessing in predictive modeling, ultimately recommending CHAID for clinical risk stratification.

Uploaded by

fahad.maqsood532

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views9 pages

Assignment Project 2

Uploaded by

fahad.maqsood532

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

INTRODUCTION

This project presents a detailed data-driven study aiming to uncover the key contributors to
stroke occurrences among a diverse population of 40,852 individuals. Using a binary
classification setup, the core objective is to explore how various personal, medical, and
lifestyle-related factors influence the likelihood of experiencing a stroke.

The study employs IBM SPSS Modeler as the analytical platform and applies two advanced
predictive methods: Binary Logistic Regression and the CHAID (Chi-squared Automatic
Interaction Detection) algorithm. Both models are thoroughly evaluated based on
performance metrics and validation procedures to determine which delivers better predictive
accuracy and clarity. The final model is selected based on test data performance, and the most
impactful stroke predictors are highlighted and interpreted within a practical, real-world
context.

This analysis will be carried out using IBM SPSS Modeler, applying two predictive modeling
approaches: Binary Logistic Regression and the CHAID (Chi-squared Automatic Interaction
Detection) algorithm. The performance of both models will be rigorously assessed using relevant
evaluation metrics and validation techniques. A comparative analysis will determine which model
provides superior predictive accuracy and interpretability. Final model selection will be guided by test
data performance, and the most influential predictors of stroke will be identified and discussed in the
context of their practical significance.

The available variables are listed below:

1) ID: unique identifier

2) sex: 1 - "Male", 0 - "Female"
3) age: age of the patient (years)
4) hypertension: 0 – "the patient doesn't have hypertension ", 1- "the patient has hypertension”
5) heart_disease: 0 – "the patient doesn't have any heart diseases ", 1- "the patient has a heart
disease "
6) ever_married: 0 -"No" or 1 - "Yes"
7) work_type: 1 - "children", 2 - "Govt_jov", 3 - "Self-employed", 4 - "Private", 0 -
"Never_worked"
8) Residence_type: 0 - "Rural" or 1 - "Urban"
9) avg_glucose_level: average glucose level in blood
10) BMI: body mass index
11) smoking_status: 0 - "non-smoker, 1 - "smoker"
12) stroke: 1- "the patient had a stroke ", 0 – "the patient did not have a stroke "

Data Upload

The dataset titled Stroke_Data.csv was successfully imported into IBM SPSS Modeler using the Var.
File node, an appropriate choice for handling comma-separated values (CSV) files.

DATA UNDERSTANDING

At the outset of our analysis, a thorough review of the dataset's structure and integrity was undertaken
to ensure its suitability for predictive modeling. Given the presence of a variety of variable types, the
Type node from the Field Operations palette in SPSS Modeler was employed to assign correct
measurement levels (e.g., nominal, continuous) and to apply descriptive value labels. This critical
preprocessing step established a coherent data structure, enabling standardized manipulation and
enhancing the interpretability of subsequent analytical results. The finalized variable configurations
are presented in Figure 1.

Figure 1: Illustration of Variable Type

Furthermore, a critical aspect of the preliminary data assessment involves evaluating the dataset for
potential data quality issues, including missing values, irregularities, and outliers that could affect the
robustness of the analysis. To facilitate this process, the Data Audit node in IBM SPSS Modeler was
employed. This node offers comprehensive descriptive summaries and diagnostic visualizations,
allowing for a systematic examination of variable distributions and data completeness. The outcomes
of this audit are presented graphically in Figure 2, providing a foundational understanding of the
dataset’s integrity prior to model development.

Figure 2: Implementation of Data Audit

Following the data audit, it was observed that the dataset contains a limited number of missing values.
A detailed review of the variable distributions further identified data points that fall substantially
outside the expected range, indicating the presence of outliers, particularly within the BMI variable, as
well as missing entries in the sex variable. These data quality concerns are depicted in Figure 3.

Figure 3: Missing Values and Outliers

To address the missing data, the most frequent category (mode) was used to impute the missing
values in the sex variable. For the extreme values in BMI, a null assignment was applied where
necessary, while outliers were addressed through coercion techniques, ensuring that the dataset
remained analytically robust for predictive modeling. Figure 4: shows the data audit after the process.
Figure 4: Data Audit after handling missing values, extremes and outliers.
DATA PREPARATION

A structured sequence of preprocessing steps was undertaken to ensure the dataset’s accuracy,
consistency, and readiness for robust modeling. To mitigate the potential effects of multicollinearity
and maintain the validity of the predictor variables, a methodical approach was employed. Central to
this process was the utilization of the Partition and Statistical nodes, which facilitated the division of
data for model training and testing, while also supporting the evaluation of variable relationships and
distributions. These steps were pivotal in shaping the analytical trajectory of the modeling phase.

Figure 5: Descriptive Statistics of Continuous Variables

As shown in Figure 5, descriptive statistics for the continuous variables—Age, Average Glucose
Level, and Body Mass Index (BMI)—were generated using the Statistics node from the Output palette
in IBM SPSS Modeler. The analysis revealed that the mean age of individuals in the dataset is
approximately 52 years, the average glucose level is around 122 mg/dL, and the mean BMI is 30.41,
indicating a population with generally elevated body mass. These summary statistics provide a
foundational understanding of the central tendencies within the data and are essential for informing
subsequent modeling steps.

Figure 6: Correlation Values of Continuous variables

The correlation matrix presented in Figure 6 indicates that there are no strong linear associations
among the continuous variables under consideration. The observed correlation coefficients fall within
a range that suggests only weak relationships, thereby confirming the absence of significant
multicollinearity. This finding supports the suitability of these variables for inclusion in subsequent
predictive modeling without the risk of redundancy or inflated variance.

A feature selection process was undertaken to isolate the most influential predictors contributing to
the classification of credit rating outcomes. While the preliminary analysis, illustrated in Figure 7,
indicated that all variables exhibited relatively high importance scores, the objective was to enhance
model efficiency by retaining only the most impactful features. This selective approach aims to
simplify the model structure, reduce dimensionality, and improve interpretability, all while
maintaining robust predictive performance.

Figure 7: Feature Selection Rank of the Variables

Drawing from the feature importance analysis and the theoretical relevance of predictors to stroke
incidence outcomes, a subset of six variables was selected for focused examination.
These include:
Age, given its established correlation with stroke incidence;
Hypertension, a recognized clinical risk factor;
Heart Disease, which exhibits a strong association with cerebrovascular events;
Average Glucose Level, indicative of potential metabolic disorders such as diabetes;
Body Mass Index (BMI), reflecting both underweight and obesity-related risks; and
Smoking Status, a well-documented behavioral risk factor.

These variables were selected not only for their statistical significance but also for their substantive
relevance in evaluating health-related risk, which can have downstream effects on Stroke. They were
subsequently incorporated into the modeling phase to assess their predictive influence on the binary
classification of credit rating. Figure 8 shows a preview of the variables selected.
Figure 8: Overview of selected variables.

MAIN ANALYSIS

Two distinct predictive modeling techniques were employed: Binary Logistic Regression and a
Decision Tree utilizing the CHAID algorithm. Both models were seamlessly integrated into the
production flow within IBM SPSS Modeler.

We commenced with the Binary Logistic Regression model, which is ideally suited for binary
classification tasks. In this context, the model was used to predict the likelihood of a stroke outcome,
categorizing patients into two mutually exclusive classes: " the patient had a stroke " and " the
patient did not have a stroke ".

The model equation is given as follows;

ln(1−P)
=constant + β 1 x 1+ β2 x 2+ β3 x 3+ β 4 x 4 + β5 x 5+ β6 x 6
P

Let P represent the probability of stroke incidence.

The predictors are:

- Average Glucose Level (X₁) – Higher glucose levels, indicating potential diabetes, increase stroke
risk (β₁ > 0).
- BMI (X₂) – Both obesity and being underweight are risk factors, potentially raising stroke
probability (β₂ > 0).
- Hypertension (X₃) – Presence of hypertension significantly increases stroke risk (β₃ > 0).
- Heart Disease (X₄) – A history of heart disease is strongly correlated with a higher probability of
stroke (β₄ > 0).
- Smoking Status (X₅) – Smoking Status is a well-established risk factor for stroke, increasing its
likelihood (β₅ > 0).
- Age (X₆) – Older age is positively associated with a higher likelihood of stroke (β₆ > 0).
Figure 9: Parameter estimates from the Binary Logistic Regression

Figure 9 presents the parameter estimates from a binary logistic regression model.

Model Interpretation

Overall Model Interpretation

 Reference category: "The patient did not have a stroke" (baseline).
 All predictors are statistically significant (Sig. < 0.001), meaning they contribute
meaningfully to predicting stroke.
 Exp(B) = Odds Ratio (OR): Indicates how much the odds of stroke increase (OR > 1) or
decrease (OR < 1) per unit change in the predictor.

Table 1: Continuous Variables

Exp(B)
B
Variable (Odds Interpretation
(Coefficient)
Ratio)

avg_glucose_leve - For every 1-unit increase in glucose,

+0.008 1.008
l stroke odds increase by 0.8%.

- For every 1-unit increase in BMI,

stroke odds decrease by
Bmi -0.017 0.983
1.7% (counterintuitive—possible
confounding or data issue).

Table 2: Categorical Variables (Dummy-Coded)

Exp(B)
Variable Category Interpretation
(OR)

- Patients without hypertension have 69.5%

Hypertension No (Ref: Yes) 0.305 lower odds of stroke than hypertensive
patients.

- Patients without heart disease have 69.8%

Heart Disease No (Ref: Yes) 0.302 lower odds of stroke than those with heart
disease.

Smoking Non-smoker 0.817 - Non-smokers have 18.3% lower odds of

Exp(B)
Variable Category Interpretation
(OR)

Status (Ref: Smoker) stroke than smokers.

Young
Age Group 1.0 - Baseline group.
(Reference)

- Middle-aged patients have 23.6% lower

Middle 0.764 odds than young patients (unexpected—
requires validation).

- Older patients have 16.1% higher

Old 1.161
odds than young patients.

Figure 10: Predictor Importance from the CHAID Algorithm

The Predictor Importance graph above illustrates the relative importance of each variable in predicting
the likelihood of a stroke.

1. Most Important Predictor:

- Average Glucose Level is by far the most important predictor, with a value close to 1. This
indicates a strong association between higher glucose levels and the likelihood of having a stroke.

2. Moderately Important Predictors:

- BMI and Hypertension both show moderate importance in predicting stroke risk, suggesting they
have some influence but are less critical than glucose levels.

3. Less Important Predictors:

- Smoking Status, Heart Disease, and Age Group have lower importance in predicting stroke.
Among these, Age Group shows the least impact on the prediction model, indicating that other factors
(like glucose levels and hypertension) may be more influential in predicting stroke risk.

In summary, the analysis strongly suggests that glucose levels are the primary driver of stroke risk,
while other factors, although relevant, contribute less significantly to the prediction model.

Model Results
Figure 11: Binary Logistic ROC

The curve appears to rise sharply initially, indicating strong predictive power at lower thresholds. The
close alignment between training and testing results suggests good generalization.

Figure 12: Binary Logistic Confusion Matrix

The confusion matrix shows that in Partition 1, which represents the training data, the model correctly
predicted the stroke outcome in 19,400 out of 28,494 cases, resulting in an overall accuracy of
approximately 68.08%. Conversely, it made incorrect predictions in 9,094 instances, accounting for
31.92% of the training data. Similarly, in Partition 2, representing the test dataset, the model achieved
a correct classification rate of 67.93%, accurately predicting 8,377 out of 12,332 cases. The
proportion of incorrect predictions in the test set stood at 32.07%.

Figure 13: CHAID Algorithm ROC

A steep initial rise in the curve indicates high sensitivity at low false positive rates - crucial for risk
models where catching true "bad" credits is prioritized. The closeness of the training and test lines
suggests good generalization.

Figure 14: CHAID Algorithm Confusion Matrix

The confusion matrix shows that in Partition 1, which represents the training data, the model correctly
predicted the stroke outcome in 20,296 out of 28,494 cases, resulting in an overall accuracy of
approximately 71.23%. Conversely, it made incorrect predictions in 8,198 instances, accounting for
28.77% of the training data. Similarly, in Partition 2, representing the test dataset, the model achieved
a correct classification rate of 70.75%, accurately predicting 8,725 out of 12,332 cases. The
proportion of incorrect predictions in the test set stood at 29.25%.

Model Comparison

Model Comparison: Binary Logistic Regression vs. CHAID

Binary Logistic
Metric CHAID
Regression

Training Accuracy 68.08% (19,400/28,494) 71.23% (20,296/28,494)

Test Accuracy 67.93% (8,377/12,332) 70.75% (8,725/12,332)

Error Rate
31.92% (9,094/28,494) 28.77% (8,198/28,494)
(Training)

Error Rate (Test) 32.07% (3,955/12,332) 29.25% (3,607/12,332)

CHAID outperforms Logistic Regression in accuracy (~3% higher in both training and test sets).

CONCLUSION

After evaluating both Binary Logistic Regression and CHAID, the CHAID model is recommended
for stroke prediction due to its higher accuracy (71.23% training, 70.75% test) and ability to capture
non-linear relationships, such as the critical role of glucose levels and hypertension. While logistic
regression provides interpretable odds ratios, its lower accuracy (68.08% training, 67.93% test) and
counterintuitive findings (e.g., BMI’s negative effect) limit its reliability. CHAID’s tree-based
structure also offers actionable decision rules for clinical use, making it the superior choice for real-
world risk stratification.

Exploratory Data Analysis and Machine Learning Models For Stroke Prediction
No ratings yet
Exploratory Data Analysis and Machine Learning Models For Stroke Prediction
7 pages
Predicting Stats Grades with Regression
No ratings yet
Predicting Stats Grades with Regression
2 pages
Stroke Prediction Model Analysis
No ratings yet
Stroke Prediction Model Analysis
11 pages
Rapport
No ratings yet
Rapport
21 pages
Group Assessment
No ratings yet
Group Assessment
20 pages
Stroke Prediction Dataset
No ratings yet
Stroke Prediction Dataset
48 pages
Assignment 2 - Bayesian Classification
No ratings yet
Assignment 2 - Bayesian Classification
16 pages
2021 Combining Machine Learning and Conventional Statistical Approaches For Risk Factor Discovery in A Large Cohort Study
No ratings yet
2021 Combining Machine Learning and Conventional Statistical Approaches For Risk Factor Discovery in A Large Cohort Study
11 pages
Diabetes Prediction Project
No ratings yet
Diabetes Prediction Project
14 pages
Multiple Regression
No ratings yet
Multiple Regression
55 pages
Stroke Prediction Using Clinical and Social Features in Machine
No ratings yet
Stroke Prediction Using Clinical and Social Features in Machine
13 pages
411 965 1 PB
No ratings yet
411 965 1 PB
8 pages
Modeling Life Insurance Risk: Prudential Insurance Data Set
No ratings yet
Modeling Life Insurance Risk: Prudential Insurance Data Set
7 pages
Stroke
No ratings yet
Stroke
6 pages
EDA Presentation
No ratings yet
EDA Presentation
13 pages
Stroke Prediction Analysis
No ratings yet
Stroke Prediction Analysis
5 pages
Programming With Python - Final Assignment - Valerie Riady Huette
No ratings yet
Programming With Python - Final Assignment - Valerie Riady Huette
11 pages
Big Data Resit Assignment
No ratings yet
Big Data Resit Assignment
22 pages
Darabi (2024) - CVD Mortality Prediction
No ratings yet
Darabi (2024) - CVD Mortality Prediction
13 pages
12 - Chapter 5
No ratings yet
12 - Chapter 5
39 pages
DSC652 Project-Stroke Prediction System
No ratings yet
DSC652 Project-Stroke Prediction System
22 pages
Apo B
No ratings yet
Apo B
8 pages
Report - SVM
No ratings yet
Report - SVM
13 pages
Linear Regression Modelfor Predicting Medical Expenses
No ratings yet
Linear Regression Modelfor Predicting Medical Expenses
5 pages
Programming For Data Analytics
No ratings yet
Programming For Data Analytics
27 pages
F2022393008-Stroke Prediction
No ratings yet
F2022393008-Stroke Prediction
6 pages
Heart Disease Prediction Model
No ratings yet
Heart Disease Prediction Model
25 pages
Pencina 2007
No ratings yet
Pencina 2007
16 pages
Support Vector Machine: Machine Learning Approach in Healthcare
No ratings yet
Support Vector Machine: Machine Learning Approach in Healthcare
5 pages
Models To Predict Cardiovascular Risk - Comparison of CART, Multilayer Perceptron and Logistic Regression
No ratings yet
Models To Predict Cardiovascular Risk - Comparison of CART, Multilayer Perceptron and Logistic Regression
5 pages
Heart Disease Prediction Using Machine Learning-1
No ratings yet
Heart Disease Prediction Using Machine Learning-1
6 pages
Case Study
No ratings yet
Case Study
21 pages
Stroke Prediction Using Machine Learning
No ratings yet
Stroke Prediction Using Machine Learning
8 pages
Suryapdf2 Merged
No ratings yet
Suryapdf2 Merged
20 pages
S Pss Classification
No ratings yet
S Pss Classification
16 pages
My ML Project
No ratings yet
My ML Project
14 pages
BMC Medical Informatics and Decision Making
No ratings yet
BMC Medical Informatics and Decision Making
17 pages
Template Capstone PPT
No ratings yet
Template Capstone PPT
12 pages
AI-Based Predictive Support For Heart Disease Diagnosis
No ratings yet
AI-Based Predictive Support For Heart Disease Diagnosis
16 pages
Toth 2021
No ratings yet
Toth 2021
11 pages
Eda Report
No ratings yet
Eda Report
8 pages
Medicial
No ratings yet
Medicial
13 pages
IJSDR2004071
No ratings yet
IJSDR2004071
13 pages
Hypothesis Testing
No ratings yet
Hypothesis Testing
17 pages
CAPESTONE
No ratings yet
CAPESTONE
16 pages
Machine Learning - Project
No ratings yet
Machine Learning - Project
26 pages
SPSS For Starters Part2
No ratings yet
SPSS For Starters Part2
106 pages
Journal Pmed 1003692-6
No ratings yet
Journal Pmed 1003692-6
1 page
SLR, SLC and USL Mini Project
No ratings yet
SLR, SLC and USL Mini Project
10 pages
Using Machine Learning For Detection and Prediction of Chronic Diseases
No ratings yet
Using Machine Learning For Detection and Prediction of Chronic Diseases
17 pages
Machine Learning For Preventive Healthcare
No ratings yet
Machine Learning For Preventive Healthcare
10 pages
FINALPPT
No ratings yet
FINALPPT
23 pages
Heart Attack Prediction System: Sushmita Manikandan
No ratings yet
Heart Attack Prediction System: Sushmita Manikandan
4 pages
Progress Report
No ratings yet
Progress Report
13 pages
New Oct PREDICTIVE - ANALYSIS - ON - HEART - STROKE - PREDICTION - USING - MACHINE - LEARNING - MODEL-4-1 - (1) (1) NEw
No ratings yet
New Oct PREDICTIVE - ANALYSIS - ON - HEART - STROKE - PREDICTION - USING - MACHINE - LEARNING - MODEL-4-1 - (1) (1) NEw
7 pages
SGEP Seminar Requirements 2024 2025
No ratings yet
SGEP Seminar Requirements 2024 2025
4 pages
Summary of Amazon Interview Prep Workshop
No ratings yet
Summary of Amazon Interview Prep Workshop
13 pages
Thesis Supervisors
No ratings yet
Thesis Supervisors
1 page
Assignment 3 F1 - F4
No ratings yet
Assignment 3 F1 - F4
19 pages
Castlegar/Slocan Valley Pennywise May 8, 2018
No ratings yet
Castlegar/Slocan Valley Pennywise May 8, 2018
40 pages
First Quarter Exam in English 10
100% (1)
First Quarter Exam in English 10
2 pages
About Me PDF Love Romance (Love)
No ratings yet
About Me PDF Love Romance (Love)
1 page
Pakistan in World Affair
No ratings yet
Pakistan in World Affair
17 pages
Design of Valve Lapping Machine For Ic Engine
75% (4)
Design of Valve Lapping Machine For Ic Engine
35 pages
Bangalore Astrology Data 2021
No ratings yet
Bangalore Astrology Data 2021
1 page
European Shale
No ratings yet
European Shale
8 pages
5G - Mixed Tenses
No ratings yet
5G - Mixed Tenses
2 pages
Google Classroom The Comprehensive Teachers and Student's Guide To Designing High-Quality Digital Learning Experiences For... (Crawford, Anna)
No ratings yet
Google Classroom The Comprehensive Teachers and Student's Guide To Designing High-Quality Digital Learning Experiences For... (Crawford, Anna)
111 pages
Understanding Phobias and Their Types
No ratings yet
Understanding Phobias and Their Types
6 pages
FINAL
No ratings yet
FINAL
16 pages
Approved Training Organisations (ATO) in Switzerland
No ratings yet
Approved Training Organisations (ATO) in Switzerland
17 pages
Daily Routine
No ratings yet
Daily Routine
2 pages
Husband Sold His Wife
No ratings yet
Husband Sold His Wife
21 pages
Community Health Nursing Overview
No ratings yet
Community Health Nursing Overview
9 pages
Bias Lesson Plan-1
No ratings yet
Bias Lesson Plan-1
4 pages
EAU 2012 - Eng Part 3
No ratings yet
EAU 2012 - Eng Part 3
246 pages
Children Rights and Our Responsibilities
No ratings yet
Children Rights and Our Responsibilities
11 pages
(Ebook) The Bible Knowledge Commentary: An Exposition of the Scriptures by Dallas Seminary Faculty [New Testament Edition] by John F. Walvoord, Roy B. Zuck ISBN 9780882078120, 0882078127 full digital chapters
0% (1)
(Ebook) The Bible Knowledge Commentary: An Exposition of the Scriptures by Dallas Seminary Faculty [New Testament Edition] by John F. Walvoord, Roy B. Zuck ISBN 9780882078120, 0882078127 full digital chapters
90 pages
RESPIRATORY SYSTEM WORKSHEET Answer Key
No ratings yet
RESPIRATORY SYSTEM WORKSHEET Answer Key
5 pages
Gastrointestinal Tuberculosis Case Study
No ratings yet
Gastrointestinal Tuberculosis Case Study
32 pages
Gold Exp 2e B2P GrammarFiles U1
No ratings yet
Gold Exp 2e B2P GrammarFiles U1
2 pages
Java 2D Graphics and Drawing Guide
No ratings yet
Java 2D Graphics and Drawing Guide
35 pages
YGFC Catalog - EN - PUBL-8212 (1218)
50% (2)
YGFC Catalog - EN - PUBL-8212 (1218)
16 pages
1310 - EXTC - ECCDLO8044 - NMT - Sample Questions For Students
No ratings yet
1310 - EXTC - ECCDLO8044 - NMT - Sample Questions For Students
15 pages
B.A Semester-2 Admit Card 2019
No ratings yet
B.A Semester-2 Admit Card 2019
1 page
609c0fa7ff9c315f83069083 All UPSTOX SIGNED
No ratings yet
609c0fa7ff9c315f83069083 All UPSTOX SIGNED
51 pages
Home Economics 9
No ratings yet
Home Economics 9
9 pages
Annex 2 - Planning and Checking Form LABINNOVA
No ratings yet
Annex 2 - Planning and Checking Form LABINNOVA
3 pages
Six Sigma Acronyms
No ratings yet
Six Sigma Acronyms
3 pages

Assignment Project 2

Uploaded by

Assignment Project 2

Uploaded by

INTRODUCTION

The available variables are listed below:

1) ID: unique identifier

Figure 1: Illustration of Variable Type

Figure 2: Implementation of Data Audit

Figure 3: Missing Values and Outliers

Figure 5: Descriptive Statistics of Continuous Variables

Figure 6: Correlation Values of Continuous variables

Figure 7: Feature Selection Rank of the Variables

The model equation is given as follows;

Let P represent the probability of stroke incidence.

The predictors are:

Overall Model Interpretation

Table 1: Continuous Variables

avg_glucose_leve - For every 1-unit increase in glucose,

- For every 1-unit increase in BMI,

Table 2: Categorical Variables (Dummy-Coded)

- Patients without hypertension have 69.5%

- Patients without heart disease have 69.8%

Smoking Non-smoker 0.817 - Non-smokers have 18.3% lower odds of

Status (Ref: Smoker) stroke than smokers.

- Middle-aged patients have 23.6% lower

- Older patients have 16.1% higher

Figure 10: Predictor Importance from the CHAID Algorithm

1. Most Important Predictor:

2. Moderately Important Predictors:

3. Less Important Predictors:

Figure 12: Binary Logistic Confusion Matrix

Figure 13: CHAID Algorithm ROC

Figure 14: CHAID Algorithm Confusion Matrix

Model Comparison: Binary Logistic Regression vs. CHAID

Training Accuracy 68.08% (19,400/28,494) 71.23% (20,296/28,494)

Test Accuracy 67.93% (8,377/12,332) 70.75% (8,725/12,332)

Error Rate (Test) 32.07% (3,955/12,332) 29.25% (3,607/12,332)

You might also like