DAL Assignment 2 Endsem
DAL Assignment 2 Endsem
1st
Department of Chemical Engineering
IIT Madras
Chennai, India
Abstract—The survival rates of RMS Titanic’s passengers’ are preparedness, evacuation procedures, and safety measures
investigated in this paper, using logistic regression to determine today, ensuring that the lessons from the Titanic continue to
the underlying factors. The study identifies critical variables guide us toward safer journeys on the seas.
influencing survival odds by analyzing passenger demographics,
ticket information, and cabin class. The findings underline how II. DATA AND C LEANING
important factors like gender, age, and cabin class are in
determining survival rates. The model is also cross-validated to A. The Datasets
estimate the expected classification metrics. The effect of using The logistic regression model was constructed using two
regularizing methods such as the L2 −norm is investigated. We
datasets: a training set and a test set. The test set lacked the
finally predict passenger survival on an unseen dataset. Section
II has been changed. target labels (survival status), making it impossible to assess
Index Terms—logistic regression, bootstrapping, HMS Titanic, the accuracy of the model’s predictions. Apart from the target
cross-validation label (’survival’), all other labels were consistent between the
two datasets. Table I summarizes the collected data and their
I. I NTRODUCTION respective types.
The sinking of the RMS Titanic in 1912, a heartbreaking
tragedy, continues to captivate our curiosity as one of the TABLE I
deadliest maritime disasters in history. It claimed the lives TABLE OF THE FEATURES IN THE GIVEN DATASETS ALONG WITH THEIR
DESCRIPTIONS . N OTE THAT THE ’ SURVIVAL’ VARIABLE IS ABSENT IN THE
of over 1,500 people and prompted a profound exploration TEST SET. W E OBSERVE THAT MOST VARIABLES ARE CATEGORICAL
of maritime safety and crisis response. Understanding what
determined who survived and who didn’t on the Titanic has Feature Description Type
surivival Surivival Categorical (0, 1)
intrigued researchers for a century. pclass Ticket Categorical (classes 1, 2, 3)
sex Sex Categorical (M, F)
This paper sets out to investigate the factors behind Age Age Numerical (years)
sibsp Num. of siblings, spouses Numerical
passenger survival on the Titanic, using a statistical method parch Num. of parents, children aboard Numerical
called logistic regression. We’re fortunate to have a vast ticket Ticket Number -
dataset with details about the passengers, including their fare Passenger Fare Numerical
cabin Cabin Number -
demographics, tickets, cabins, and family connections. This embarked Port of Embarkation Categorical(C, Q, S)
dataset allows us to dig deep into the complex factors that
affected survival.
B. Data Cleaning
Our goal is to reveal the subtle details that governed A coded pipeline is designed to clean a dataset in the
who lived and who perished on the Titanic. We’ll explore specified format, based on a flag indicating whether it’s
questions like whether women and children were indeed for training or testing (’train’ or ’test’). Individuals with
given priority during evacuation, if passenger class influenced variables, such as ’Sex’ and ’Embarked,’ containing missing
survival chances, and whether having family members values that cannot be imputed are excluded. Subsequently,
onboard made a difference. ’Sex’ and ’Embarked’ undergo One-Hot Encoding. Certain
columns, namely ’Name,’ ’Ticket,’ and ’PassengerId,’ are
Through logistic regression, we aim to uncover hidden dropped due to encoding complexities. The ’Cabin’ column,
patterns and relationships between these variables and with a substantial number of missing values and encoding
survival outcomes. Our objective is not only to shed challenges, is also removed.
light on the Titanic tragedy but also to provide insights
relevant to modern disaster planning, emergency response, A K-Nearest Neighbours (KNN) Imputer is employed,
and the enduring lessons we can draw from this historic event. treating a distribution as a data cluster. This imputer identifies
the k-nearest neighbors for each data point with a missing
As we embark on this journey into the past, we acknowledge value and imputes based on them. Given the close proximity
that the knowledge gained may have broader implications. of the points used in imputation, this method is expected
It may offer valuable insights for improving disaster to preserve the distribution of each variable effectively. The
variables are then converted to their appropriate types, and
the cleaned dataset is returned. Notably, no confounding
symbols are identified in the train or test data; only missing
values are addressed.
Fig. 1. The probability and cumulative distributions of the Fare of the various
Fig. 4. Count Plot of Survival on the Titanic based on the Passenger Class.
passengers is plotted. The left image contains the KDE of the data before
We find a clear disparity between the three classes. First class passengers
and after KNN Imputation. The right image shows the ECDFs of the data
have a far higher survival rate than the other classes. This is expected since
before and after KNN Imputation. In both distribution functions, no significant
passengers in First Class were likely to be richer, and more preferred during
differences between the two distributions can be observed, indicating that the
the rescue. The chance of survival in the Second Class is around 50%, with
imputer does not change the underlying distribution.
the Third Class having the worst Survival Rate.
Fig. 2. The probability and cumulative distributions of the SibSp of the In Fig. 6, we find that the older passengers in general
various passengers is plotted. The left image contains the KDE of the data had better Passenger Classes. However, we find that
before and after KNN Imputation. The right image shows the ECDFs of extremely old people did not have a good survival rate, even
the data before and after KNN Imputation. In both distribution functions,
no significant differences between the two distributions can be observed, in the First Class, even though it was better than the other
indicating that the imputer does not change the underlying distribution. Passenger Classes. This suggests a trade-off between having
a lower Survival chance due to Age but also a higher one
dataset to obtain better insights. From Fig. 4, we find a clear due to being able to afford a better Passenger Class.
disparity between the three classes. First class passengers In Fig. 7, we have a Swarm Plot of Passenger Class and
have a far higher survival rate than the other classes. This Fare. We find that the First Class has a large spread in terms
Fig. 5. Box Plot of Passenger Class and Fare on the Titanic. We find that the
First Class has a large spread in terms of price, when compared to the Second
and Third Classes, indicating that some First Class cabins were more desirable
than the other. The spread could also be due to the different Embarking Ports,
with some having lower prices than the other. We find that the difference Fig. 7. Swarm Plot of Passenger Class and Fare on the Titanic. We find
between the Second and Third Class cabins is smaller than the difference that the First Class has a large spread in terms of price, when compared to
between the First and Second Class cabins. the Second and Third Classes, indicating that some First Class cabins were
more desirable than the other. The spread could also be due to the different
Embarking Ports, with some having lower prices than the other. We find that
the difference between the Second and Third Class cabins is smaller than the
difference between the First and Second Class cabins.
Fig. 8. Stacked Bar Plot of Survival on the Titanic based on the Passenger
Class. We find a clear disparity between the three classes. First class
Fig. 6. Violin Plot of Passenger Class and Age, based on Survival on the passengers have a far higher survival rate than the other classes. This is
Titanic. We find that the older passengers in general had better Passenger expected since passengers in First Class were likely to be richer, and more
Classes. This is expected since older people are generally wealthier than preferred during the rescue. The chance of survival in the Second Class is
younger people. However, we find that extremely old people did not have around 50%, with the Third Class having the worst Survival Rate.
a good survival rate, even in the First Class, even though it was better than
the other Passenger Classes. This suggests a trade-off between having a lower
Survival chance due to Age but also a higher one due to being able to afford
a better Passenger Class. III. M ETHODS
A. Logistic Regression
Logistic regression is a fundamental statistical technique
used for binary classification. In this section, we provide
of price, when compared to the Second and Third Classes, a detailed mathematical analysis of logistic regression,
indicating that some First Class cabins were more desirable including its derivation, the likelihood function, and
than the other. The spread could also be due to the different parameter estimation.
Embarking Ports, with some having lower prices than the
other. We find that the difference between the Second and At the core of logistic regression is the logistic (sigmoid)
Third Class cabins is smaller than the difference between the function, denoted as σ(z), defined as:
First and Second Class cabins. From Fig. 8, we find a clear
1
disparity between the three classes. First class passengers σ(z) = (1)
1 + e−z
have a far higher survival rate than the other classes. This
is expected since passengers in First Class were likely to be where z is given by:
richer, and more preferred during the rescue. The chance of z = β0 + β1 x 1 + β2 x 2 + . . . + βn x n (2)
survival in the Second Class is around 50%, with the Third
Class having the worst Survival Rate. and xi ’s are the features used in our model.
The sigmoid maps the linear combination to the range 1) Accuracy: Accuracy is one of the most straightforward
[0, 1], allowing it to model probabilities. classification metrics and is defined as:
Number of Correct Predictions
The logistic regression model assumes that the log-odds Accuracy = (9)
Total Number of Predictions
(logit) of the positive class can be represented as a linear
It measures the proportion of correct predictions made by the
combination of input features.
model. While accuracy provides an overall sense of model
performance, it may not be suitable for imbalanced datasets,
Let p be the probability of the positive class. The odds
where one class dominates the other.
of the positive class are defined as:
p 2) Recall: Recall, also known as sensitivity or true positive
O(p) = (3)
1−p rate, quantifies a model’s ability to correctly identify positive
Taking the logarithm of the positive class odds gives us: instances:
True Positives
p Recall = (10)
log (O(p)) = log (4) True Positives + False Negatives
1−p
Recall is essential when the cost of missing positive cases
Expressing the log-odds as a linear combination of the fea- (false negatives) is high, such as in medical diagnoses.
tures:
p 3) Precision: Precision measures the accuracy of positive
log = β0 + β1 x1 + β2 x2 + . . . + βn xn (5)
1−p predictions made by the model:
Here, β0 , β1 , . . . , βn are regression coefficients are estimated True Positives
Precision = (11)
from the given data. Solving for p, we get: True Positives + False Positives
p Precision is valuable when minimizing false positive
= eβ0 +β1 x1 +β2 x2 +...+βn xn (6) predictions is critical, like in spam email detection.
1−p
eβ0 +β1 x1 +β2 x2 +...+βn xn 4) F1-score: The F1 score is the harmonic mean of preci-
p= (7)
1 + eβ0 +β1 x1 +β2 x2 +...+βn xn sion and recall, providing a balance between the two:
This is the logistic regression model, where p represents the Precision · Recall
F1 Score = 2 · (12)
probability of the positive class, and β0 , β1 , . . . , βn are the Precision + Recall
model parameters to be estimated. It is particularly useful when there is an uneven class
distribution or when both precision and recall need to be
The likelihood function measures how well the logistic considered simultaneously.
regression model predicts the observed outcomes in the
training data. The likelihood for a single data point is given 5) Receiver Operator Characteristic Curve (ROC Curve):
by: The ROC curve is a graphical representation of a model’s
L(β) = py (1 − p)1−y (8) performance across different classification thresholds. It plots
Here: the true positive rate (recall) against the false positive rate (1
- specificity) at various threshold values.
y : Actual outcome (0 or 1) The area under the ROC curve (AUC-ROC) quantifies the
p : Predicted probability of the positive class model’s overall performance. A higher AUC-ROC indicates
a better model at distinguishing between positive and negative
If we assume that our predictions are i.i.d, the likelihood for instances.
the entire training dataset is the product of these individual
likelihoods. The regression coefficients β0 , β1 , . . . , βn , can be C. Singular Value Decomposition (SVD)
estimated by maximizing the likelihood function. If a closed Singular Value Decomposition (SVD) is a fundamental
form L(β) is not available, the is done using numerical matrix factorization technique in linear algebra. It breaks
optimization techniques. down a matrix into three separate matrices, capturing the
inherent structure of the original matrix. SVD has a wide
B. Classification Metrics range of applications, and one of its practical uses is in data
There are various metrics that can evaluate the goodness- compression. Given a matrix A of dimensions N × p, SVD
of-fit of a given classifier. Some of these metrics are presented decomposes A into three matrices:
in this section. In classification tasks, it is essential to choose A = U ΣV T (13)
appropriate evaluation metrics based on the problem’s context
and objectives. where:
IV. R ESULTS
A. Existence of Linear Relationships among survival factors
Exploratory analysis of the Independent Variables indicates
the existence of linear relationships between themselves.
This could allow us to loss-lessly reduce the number of
independent variables used in our model. This is evident
from Fig. 10 where the singular values of the Independent
Variables dataset are presented. Three linear relationships
exist between the variables.
Fig. 10. Singular values of the Independent Variables are presented. The last
three singular values are of order < 10−13 and can be considered to be 0.
Fig. 9. A sample ROC curve from a classifier. Note the trade-off between This allows us to loss-lessly remove up to three variables from the dataset
sensitivity and specificity. Based on the problem, we may optimize be required
to optiize for only one.
The correlation heatmap for the independent variables is
shown in Fig. 11. We observe several variables that are
perfectly correlated with each other. This is an artefact of
• U is an N ×N orthogonal matrix (eigenvectors of AAT ).
• Σ is an N ×p diagonal matrix with non-negative singular
values.
• V is an p × p orthogonal matrix (eigenvectors of AT A).
The columns of U are called the left singular vectors, the
diagonal entries of Σ are the singular values, and the columns
of V are the right singular vectors.
where U (:, 1 : k) contains the first k columns of U , Fig. 11. The correlation heatmap between all independent variables. This
Σ(1 : k, 1 : k) is the upper-left k × k submatrix of Σ, and was obtained by finding the pairwise correlation coefficient between each
independent variable. The color gradient indicates the magnitude of the
V (:, 1 : k) contains the first k columns of V . correlation between the variables.
By using a lower-rank approximation, the original data can our encoding method. When we encoded our categorical
be represented more compactly, leading to data compression. variables, atleast one class will be highly correlated with all
The extent of compression depends on the choice of k. A other classes. For example in our ’sex’ feature, only ’M’ and
smaller value of k reduces the storage requirements but may ’F’ classes are present. If a sample has ’sex’ attribute ’M’,
lead to a loss of information. then it cannot have ’F’, making the two classes, which have
now become features, perfectly negatively correlated. The
Sometimes the columns of data matrix A may contain same is true between the ’C’, ’Q’, and ’S’ features. Recall
linear relationships between themselves. In this case, if n that they came from categorical feature ’embarked’.
linear relationships exist, n singular values are 0. We can
drop up to n columns and perform a loss-less compression of These are the three linear relationships that the SVD
the data. This allows us to use fewer independent variables indicates. We can remove either set of M or F, and either
in our regression models i.e., enforce parsimony. one of C, Q, S variables. We choose to remove the M and
S variables and retain the others. The singular values after
removal and the correlation heatmap between the variables
after removal are presented in Fig. 12 and Fig. 13.
Fig. 12. Singular values of the Independent Variables dataset, after loss-less
compression are presented. The last three singular values are no longer of
order < 10−13 and cannot be considered to be 0. While we can still remove
some more variables based on the singular values, the compression would no
longer be loss-less.
TABLE II
E VALUATION METRICS OF THE LOGISTIC REGRESSION CLASSIFIER . W E
FIND THAT MOST METRICS ARE REASONABLY HIGH . T HE VARIANCE IN
THESE ESTIMATES ARE ALSO ACCEPTABLE .
Fig. 17. The left plot contains the histogram of the F1 score obtained for Fig. 20. The left plot contains the histogram of the precision obtained for
each bootstrap sample from the validation split. The right plot contains the each bootstrap sample from the validation split. The right plot contains the
ECDF of the F1 score obtained for each bootstrap sample from the validation ECDF of the precision obtained for each bootstrap sample from the validation
split. We find that the metric is high and its variance is acceptable split. We find that the metric is high and its variance is acceptable
TABLE III
E VALUATION METRICS OF THE L2 - NORM LOGISTIC REGRESSION
CLASSIFIER . W E FIND THAT MOST METRICS ARE REASONABLY HIGH . T HE
VARIANCE IN THESE ESTIMATES ARE ALSO ACCEPTABLE .
Fig. 21. The left plot contains the histogram of the F1 score obtained for
each bootstrap sample from the validation split. The right plot contains the
ECDF of the F1 score obtained for each bootstrap sample from the validation
split. We find that the metric is high and its variance is acceptable
TABLE IV
R EGRESSION COEFFICIENTS OF ALL FEATURES USED , BASED ON THE
L OGISTIC R EGRESSION MODEL WITHOUT AND WITH THE L2 PENALTY.
W E FIND NO SIGNIFICANT CHANGES BETWEEN THE TWO SETS OF
COEFFICIENTS
Fig. 18. The left plot contains the histogram of the accuracy obtained for Feature Coeff. based on Log. Reg. Coeff. based on L2 -norm Log. Reg.
each bootstrap sample from the validation split. The right plot contains the Pclass -0.912 -0.893
ECDF of the accuracy obtained for each bootstrap sample from the validation Age -0.5 -0.487
split. We find that the metric is high and its variance is acceptable Sibsp -0.373 -0.363
Parch -0.135 -0.132
Fare 0.183 0.187
female 1.288 1.269
a weak correlation with the fare, possibly since we did not C 0.195 0.194
remove one of the two. Q 0.087 0.085
Fig. 23. The Receiver Operator Characteristic curve obtained for the L2 -
norm logistic regression classifier. We find that we can achieve a good True
Positive Rate with a small False Positive Rate, indicating that our classifier
is robust to class imbalances