P100/1625G/21
NG’ANG’A JOHN THAIRU
KARATINA UNIVERSITY
Predictive analytics in Business Intelligence
Technical Assignment II
Predictive Analytics Using Machine Learning
1. Dataset Description
1.1 Overview
The Titanic dataset contains information about passengers aboard the RMS Titanic, which sank
in 1912. The goal is to predict survival (Survived: 0 = No, 1 = Yes) based on passenger
attributes.
1.2 Features
Feature Description
Pclass Ticket class (1 = 1st, 2 = 2nd, 3 = 3rd)
Sex Gender (Male/Female)
Age Age in years
SibSp Number of siblings/spouses aboard
Parch Number of parents/children aboard
Fare Passenger fare
Embarke
Port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)
d
Cabin Cabin number (many missing)
1.3 Data Preprocessing
Handled missing values in Age, Embarked, and Fare.
Engineered new features:
o FamilySize = SibSp + Parch + 1
o IsAlone (1 if no family aboard)
o Title (extracted from names, e.g., Mr, Mrs, Master)
o Has_Cabin (1 if cabin data available)
2. Exploratory Data Analysis (EDA) – Key Findings
2.1 Survival Rate by Features
Feature Survival Rate (%)
Sex
Female 74.2%
Male 18.9%
Pclass
1st Class 63.0%
2nd Class 47.3%
3rd Class 24.2%
Age
Children (<12) 59.0%
Adults (20-40) 38.1%
2.2 Visual Insights
Women and children had significantly higher survival rates.
1st-class passengers were more likely to survive.
Passengers with cabins (likely higher class) had a 66.7% survival rate vs. 30% without.
3. Model Selection & Evaluation
3.1 Models Compared
Accurac
Model Precision Recall F1-Score ROC-AUC
y
Logistic Regression 79.3% 75.0% 70.6% 72.7% 84.3%
Decision Tree 81.0% 76.9% 73.2% 75.0% 80.3%
Random Forest 83.2% 80.0% 76.5% 78.2% 88.1%
3.2 Confusion Matrix (Random Forest)
Predicted Died Predicted Survived
Actual Died 92 14
Actual
19 54
Survived
Precision (80%): When the model predicts survival, it’s correct 80% of the time.
Recall (76.5%): The model captures 76.5% of actual survivors.
3.3 Cross-Validation Score
Mean Accuracy (5-fold CV): 81.5% (±2.1%)
4. Feature Importance Analysis
Top 5 Features (Random Forest)
Importanc
Feature Interpretation
e
Sex_Code 52.0% Women (1) survived more than men (0)
Fare 18.0% Higher fare = higher survival
Age 8.4% Children had better survival
Pclass 4.6% 1st class > 2nd > 3rd
Title_Code 5.7% "Mrs" and "Miss" survived more
5. Final Insights & Recommendations
5.1 Key Takeaways
✔ Women, children, and 1st-class passengers had the highest survival rates.
✔ Random Forest (83.2% accuracy) performed best, followed by Decision Tree (81.0%).
✔ Sex and Fare were the strongest predictors of survival.
5.2 Recommendations for Improvement
Try XGBoost or Neural Networks for potential accuracy gains.
Engineer more features:
Extract deck levels from Cabin (e.g., A, B, C).
Create interaction terms (e.g., Sex × Pclass).
🔹 Address class imbalance (if any) using SMOTE.
5.3 Business Impact
Safety protocols: Prioritize women, children, and high-paying passengers in emergencies.
Historical analysis: Understand socio-economic biases in survival
Appendix
Code & Data Availability
Kaggle Dataset: