0% found this document useful (0 votes)
28 views4 pages

Technical Assignment II

Uploaded by

johnthairu079
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views4 pages

Technical Assignment II

Uploaded by

johnthairu079
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

P100/1625G/21

NG’ANG’A JOHN THAIRU


KARATINA UNIVERSITY
Predictive analytics in Business Intelligence
Technical Assignment II
Predictive Analytics Using Machine Learning

1. Dataset Description

1.1 Overview

The Titanic dataset contains information about passengers aboard the RMS Titanic, which sank
in 1912. The goal is to predict survival (Survived: 0 = No, 1 = Yes) based on passenger
attributes.

1.2 Features

Feature Description
Pclass Ticket class (1 = 1st, 2 = 2nd, 3 = 3rd)
Sex Gender (Male/Female)
Age Age in years
SibSp Number of siblings/spouses aboard
Parch Number of parents/children aboard
Fare Passenger fare
Embarke
Port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)
d
Cabin Cabin number (many missing)

1.3 Data Preprocessing

 Handled missing values in Age, Embarked, and Fare.


 Engineered new features:
o FamilySize = SibSp + Parch + 1
o IsAlone (1 if no family aboard)
o Title (extracted from names, e.g., Mr, Mrs, Master)
o Has_Cabin (1 if cabin data available)

2. Exploratory Data Analysis (EDA) – Key Findings


2.1 Survival Rate by Features

Feature Survival Rate (%)


Sex
Female 74.2%
Male 18.9%
Pclass
1st Class 63.0%
2nd Class 47.3%
3rd Class 24.2%
Age
Children (<12) 59.0%
Adults (20-40) 38.1%

2.2 Visual Insights

 Women and children had significantly higher survival rates.


 1st-class passengers were more likely to survive.
 Passengers with cabins (likely higher class) had a 66.7% survival rate vs. 30% without.

3. Model Selection & Evaluation

3.1 Models Compared

Accurac
Model Precision Recall F1-Score ROC-AUC
y
Logistic Regression 79.3% 75.0% 70.6% 72.7% 84.3%
Decision Tree 81.0% 76.9% 73.2% 75.0% 80.3%
Random Forest 83.2% 80.0% 76.5% 78.2% 88.1%

3.2 Confusion Matrix (Random Forest)

Predicted Died Predicted Survived


Actual Died 92 14
Actual
19 54
Survived
 Precision (80%): When the model predicts survival, it’s correct 80% of the time.
 Recall (76.5%): The model captures 76.5% of actual survivors.

3.3 Cross-Validation Score

 Mean Accuracy (5-fold CV): 81.5% (±2.1%)


4. Feature Importance Analysis

Top 5 Features (Random Forest)

Importanc
Feature Interpretation
e
Sex_Code 52.0% Women (1) survived more than men (0)
Fare 18.0% Higher fare = higher survival
Age 8.4% Children had better survival
Pclass 4.6% 1st class > 2nd > 3rd
Title_Code 5.7% "Mrs" and "Miss" survived more

5. Final Insights & Recommendations

5.1 Key Takeaways

✔ Women, children, and 1st-class passengers had the highest survival rates.
✔ Random Forest (83.2% accuracy) performed best, followed by Decision Tree (81.0%).
✔ Sex and Fare were the strongest predictors of survival.

5.2 Recommendations for Improvement

Try XGBoost or Neural Networks for potential accuracy gains.


Engineer more features:

 Extract deck levels from Cabin (e.g., A, B, C).


 Create interaction terms (e.g., Sex × Pclass).
🔹 Address class imbalance (if any) using SMOTE.

5.3 Business Impact

 Safety protocols: Prioritize women, children, and high-paying passengers in emergencies.


 Historical analysis: Understand socio-economic biases in survival

Appendix

Code & Data Availability

 Kaggle Dataset:

You might also like