0% found this document useful (0 votes)

95 views8 pages

DAL Assignment 2 Endsem

IITM DAL assignment 2

Uploaded by

msrirang

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

95 views8 pages

DAL Assignment 2 Endsem

IITM DAL assignment 2

Uploaded by

msrirang

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

A Linear Model for Survival in the Titanic

1st
Department of Chemical Engineering
IIT Madras
Chennai, India

Abstract—The survival rates of RMS Titanic’s passengers’ are preparedness, evacuation procedures, and safety measures
investigated in this paper, using logistic regression to determine today, ensuring that the lessons from the Titanic continue to
the underlying factors. The study identifies critical variables guide us toward safer journeys on the seas.
influencing survival odds by analyzing passenger demographics,
ticket information, and cabin class. The findings underline how II. DATA AND C LEANING
important factors like gender, age, and cabin class are in
determining survival rates. The model is also cross-validated to A. The Datasets
estimate the expected classification metrics. The effect of using The logistic regression model was constructed using two
regularizing methods such as the L2 −norm is investigated. We
datasets: a training set and a test set. The test set lacked the
finally predict passenger survival on an unseen dataset. Section
II has been changed. target labels (survival status), making it impossible to assess
Index Terms—logistic regression, bootstrapping, HMS Titanic, the accuracy of the model’s predictions. Apart from the target
cross-validation label (’survival’), all other labels were consistent between the
two datasets. Table I summarizes the collected data and their
I. I NTRODUCTION respective types.
The sinking of the RMS Titanic in 1912, a heartbreaking
tragedy, continues to captivate our curiosity as one of the TABLE I
deadliest maritime disasters in history. It claimed the lives TABLE OF THE FEATURES IN THE GIVEN DATASETS ALONG WITH THEIR
DESCRIPTIONS . N OTE THAT THE ’ SURVIVAL’ VARIABLE IS ABSENT IN THE
of over 1,500 people and prompted a profound exploration TEST SET. W E OBSERVE THAT MOST VARIABLES ARE CATEGORICAL
of maritime safety and crisis response. Understanding what
determined who survived and who didn’t on the Titanic has Feature Description Type
surivival Surivival Categorical (0, 1)
intrigued researchers for a century. pclass Ticket Categorical (classes 1, 2, 3)
sex Sex Categorical (M, F)
This paper sets out to investigate the factors behind Age Age Numerical (years)
sibsp Num. of siblings, spouses Numerical
passenger survival on the Titanic, using a statistical method parch Num. of parents, children aboard Numerical
called logistic regression. We’re fortunate to have a vast ticket Ticket Number -
dataset with details about the passengers, including their fare Passenger Fare Numerical
cabin Cabin Number -
demographics, tickets, cabins, and family connections. This embarked Port of Embarkation Categorical(C, Q, S)
dataset allows us to dig deep into the complex factors that
affected survival.
B. Data Cleaning
Our goal is to reveal the subtle details that governed A coded pipeline is designed to clean a dataset in the
who lived and who perished on the Titanic. We’ll explore specified format, based on a flag indicating whether it’s
questions like whether women and children were indeed for training or testing (’train’ or ’test’). Individuals with
given priority during evacuation, if passenger class influenced variables, such as ’Sex’ and ’Embarked,’ containing missing
survival chances, and whether having family members values that cannot be imputed are excluded. Subsequently,
onboard made a difference. ’Sex’ and ’Embarked’ undergo One-Hot Encoding. Certain
columns, namely ’Name,’ ’Ticket,’ and ’PassengerId,’ are
Through logistic regression, we aim to uncover hidden dropped due to encoding complexities. The ’Cabin’ column,
patterns and relationships between these variables and with a substantial number of missing values and encoding
survival outcomes. Our objective is not only to shed challenges, is also removed.
light on the Titanic tragedy but also to provide insights
relevant to modern disaster planning, emergency response, A K-Nearest Neighbours (KNN) Imputer is employed,
and the enduring lessons we can draw from this historic event. treating a distribution as a data cluster. This imputer identifies
the k-nearest neighbors for each data point with a missing
As we embark on this journey into the past, we acknowledge value and imputes based on them. Given the close proximity
that the knowledge gained may have broader implications. of the points used in imputation, this method is expected
It may offer valuable insights for improving disaster to preserve the distribution of each variable effectively. The
variables are then converted to their appropriate types, and
the cleaned dataset is returned. Notably, no confounding
symbols are identified in the train or test data; only missing
values are addressed.

Various imputation techniques are available, such as imputing

missing values with 0, the mean, median, mode of the
variable, or random sampling from the variable’s distribution.
Expectation Imputers (mean, median, mode) may distort
the distribution compared to the Random Sampling Imputer
Fig. 3. The probability and cumulative distributions of the Age of the various
(RSI). However, RSI is a slow technique, requiring either passengers is plotted. The left image contains the KDE of the data before
assuming a prior distribution and estimating its parameters and after KNN Imputation. The right image shows the ECDFs of the data
from data or using a non-parametric method like Kernel before and after KNN Imputation. In both distribution functions, no significant
differences between the two distributions can be observed, indicating that the
Density Estimation (KDE). imputer does not change the underlying distribution.

We can also observe that the KNN Imputer preserves

the data distribution empirically. In Figs. 1-3, we present the is expected since passengers in First Class were likely to be
Kernel Density Estimate (KDE) and Empirical Cumulative richer, and more preferred during the rescue. The chance of
Density Function (ECDF) of the numerical variables in the survival in the Second Class is around 50%, with the Third
train dataset, before and after imputation. It is evident that Class having the worst Survival Rate.
the underlying distributions do not undergo drastic changes
due to the KNN Imputation. We can now visualize the given

Fig. 1. The probability and cumulative distributions of the Fare of the various
Fig. 4. Count Plot of Survival on the Titanic based on the Passenger Class.
passengers is plotted. The left image contains the KDE of the data before
We find a clear disparity between the three classes. First class passengers
and after KNN Imputation. The right image shows the ECDFs of the data
have a far higher survival rate than the other classes. This is expected since
before and after KNN Imputation. In both distribution functions, no significant
passengers in First Class were likely to be richer, and more preferred during
differences between the two distributions can be observed, indicating that the
the rescue. The chance of survival in the Second Class is around 50%, with
imputer does not change the underlying distribution.
the Third Class having the worst Survival Rate.

In Fig. 5, we find that the First Class has a large spread

in terms of price, when compared to the Second and Third
Classes, indicating that some First Class cabins were more
desirable than the other. The spread could also be due to the
different Embarking Ports, with some having lower prices
than the other. We find that the difference between the Second
and Third Class cabins is smaller than the difference between
the First and Second Class cabins.

Fig. 2. The probability and cumulative distributions of the SibSp of the In Fig. 6, we find that the older passengers in general
various passengers is plotted. The left image contains the KDE of the data had better Passenger Classes. However, we find that
before and after KNN Imputation. The right image shows the ECDFs of extremely old people did not have a good survival rate, even
the data before and after KNN Imputation. In both distribution functions,
no significant differences between the two distributions can be observed, in the First Class, even though it was better than the other
indicating that the imputer does not change the underlying distribution. Passenger Classes. This suggests a trade-off between having
a lower Survival chance due to Age but also a higher one
dataset to obtain better insights. From Fig. 4, we find a clear due to being able to afford a better Passenger Class.
disparity between the three classes. First class passengers In Fig. 7, we have a Swarm Plot of Passenger Class and
have a far higher survival rate than the other classes. This Fare. We find that the First Class has a large spread in terms
Fig. 5. Box Plot of Passenger Class and Fare on the Titanic. We find that the
First Class has a large spread in terms of price, when compared to the Second
and Third Classes, indicating that some First Class cabins were more desirable
than the other. The spread could also be due to the different Embarking Ports,
with some having lower prices than the other. We find that the difference Fig. 7. Swarm Plot of Passenger Class and Fare on the Titanic. We find
between the Second and Third Class cabins is smaller than the difference that the First Class has a large spread in terms of price, when compared to
between the First and Second Class cabins. the Second and Third Classes, indicating that some First Class cabins were
more desirable than the other. The spread could also be due to the different
Embarking Ports, with some having lower prices than the other. We find that
the difference between the Second and Third Class cabins is smaller than the
difference between the First and Second Class cabins.

Fig. 8. Stacked Bar Plot of Survival on the Titanic based on the Passenger
Class. We find a clear disparity between the three classes. First class
Fig. 6. Violin Plot of Passenger Class and Age, based on Survival on the passengers have a far higher survival rate than the other classes. This is
Titanic. We find that the older passengers in general had better Passenger expected since passengers in First Class were likely to be richer, and more
Classes. This is expected since older people are generally wealthier than preferred during the rescue. The chance of survival in the Second Class is
younger people. However, we find that extremely old people did not have around 50%, with the Third Class having the worst Survival Rate.
a good survival rate, even in the First Class, even though it was better than
the other Passenger Classes. This suggests a trade-off between having a lower
Survival chance due to Age but also a higher one due to being able to afford
a better Passenger Class. III. M ETHODS
A. Logistic Regression
Logistic regression is a fundamental statistical technique
used for binary classification. In this section, we provide
of price, when compared to the Second and Third Classes, a detailed mathematical analysis of logistic regression,
indicating that some First Class cabins were more desirable including its derivation, the likelihood function, and
than the other. The spread could also be due to the different parameter estimation.
Embarking Ports, with some having lower prices than the
other. We find that the difference between the Second and At the core of logistic regression is the logistic (sigmoid)
Third Class cabins is smaller than the difference between the function, denoted as σ(z), defined as:
First and Second Class cabins. From Fig. 8, we find a clear
1
disparity between the three classes. First class passengers σ(z) = (1)
1 + e−z
have a far higher survival rate than the other classes. This
is expected since passengers in First Class were likely to be where z is given by:
richer, and more preferred during the rescue. The chance of z = β0 + β1 x 1 + β2 x 2 + . . . + βn x n (2)
survival in the Second Class is around 50%, with the Third
Class having the worst Survival Rate. and xi ’s are the features used in our model.
The sigmoid maps the linear combination to the range 1) Accuracy: Accuracy is one of the most straightforward
[0, 1], allowing it to model probabilities. classification metrics and is defined as:
Number of Correct Predictions
The logistic regression model assumes that the log-odds Accuracy = (9)
Total Number of Predictions
(logit) of the positive class can be represented as a linear
It measures the proportion of correct predictions made by the
combination of input features.
model. While accuracy provides an overall sense of model
performance, it may not be suitable for imbalanced datasets,
Let p be the probability of the positive class. The odds
where one class dominates the other.
of the positive class are defined as:
p 2) Recall: Recall, also known as sensitivity or true positive
O(p) = (3)
1−p rate, quantifies a model’s ability to correctly identify positive
Taking the logarithm of the positive class odds gives us: instances:
True Positives
p Recall = (10)
log (O(p)) = log (4) True Positives + False Negatives
1−p
Recall is essential when the cost of missing positive cases
Expressing the log-odds as a linear combination of the fea- (false negatives) is high, such as in medical diagnoses.
tures:

p 3) Precision: Precision measures the accuracy of positive
log = β0 + β1 x1 + β2 x2 + . . . + βn xn (5)
1−p predictions made by the model:
Here, β0 , β1 , . . . , βn are regression coefficients are estimated True Positives
Precision = (11)
from the given data. Solving for p, we get: True Positives + False Positives
p Precision is valuable when minimizing false positive
= eβ0 +β1 x1 +β2 x2 +...+βn xn (6) predictions is critical, like in spam email detection.
1−p

eβ0 +β1 x1 +β2 x2 +...+βn xn 4) F1-score: The F1 score is the harmonic mean of preci-
p= (7)
1 + eβ0 +β1 x1 +β2 x2 +...+βn xn sion and recall, providing a balance between the two:
This is the logistic regression model, where p represents the Precision · Recall
F1 Score = 2 · (12)
probability of the positive class, and β0 , β1 , . . . , βn are the Precision + Recall
model parameters to be estimated. It is particularly useful when there is an uneven class
distribution or when both precision and recall need to be
The likelihood function measures how well the logistic considered simultaneously.
regression model predicts the observed outcomes in the
training data. The likelihood for a single data point is given 5) Receiver Operator Characteristic Curve (ROC Curve):
by: The ROC curve is a graphical representation of a model’s
L(β) = py (1 − p)1−y (8) performance across different classification thresholds. It plots
Here: the true positive rate (recall) against the false positive rate (1
- specificity) at various threshold values.
y : Actual outcome (0 or 1) The area under the ROC curve (AUC-ROC) quantifies the
p : Predicted probability of the positive class model’s overall performance. A higher AUC-ROC indicates
a better model at distinguishing between positive and negative
If we assume that our predictions are i.i.d, the likelihood for instances.
the entire training dataset is the product of these individual
likelihoods. The regression coefficients β0 , β1 , . . . , βn , can be C. Singular Value Decomposition (SVD)
estimated by maximizing the likelihood function. If a closed Singular Value Decomposition (SVD) is a fundamental
form L(β) is not available, the is done using numerical matrix factorization technique in linear algebra. It breaks
optimization techniques. down a matrix into three separate matrices, capturing the
inherent structure of the original matrix. SVD has a wide
B. Classification Metrics range of applications, and one of its practical uses is in data
There are various metrics that can evaluate the goodness- compression. Given a matrix A of dimensions N × p, SVD
of-fit of a given classifier. Some of these metrics are presented decomposes A into three matrices:
in this section. In classification tasks, it is essential to choose A = U ΣV T (13)
appropriate evaluation metrics based on the problem’s context
and objectives. where:
IV. R ESULTS
A. Existence of Linear Relationships among survival factors
Exploratory analysis of the Independent Variables indicates
the existence of linear relationships between themselves.
This could allow us to loss-lessly reduce the number of
independent variables used in our model. This is evident
from Fig. 10 where the singular values of the Independent
Variables dataset are presented. Three linear relationships
exist between the variables.

Fig. 10. Singular values of the Independent Variables are presented. The last
three singular values are of order < 10−13 and can be considered to be 0.
Fig. 9. A sample ROC curve from a classifier. Note the trade-off between This allows us to loss-lessly remove up to three variables from the dataset
sensitivity and specificity. Based on the problem, we may optimize be required
to optiize for only one.
The correlation heatmap for the independent variables is
shown in Fig. 11. We observe several variables that are
perfectly correlated with each other. This is an artefact of
• U is an N ×N orthogonal matrix (eigenvectors of AAT ).
• Σ is an N ×p diagonal matrix with non-negative singular
values.
• V is an p × p orthogonal matrix (eigenvectors of AT A).
The columns of U are called the left singular vectors, the
diagonal entries of Σ are the singular values, and the columns
of V are the right singular vectors.

SVD can be leveraged for data compression by approximating

the original matrix A using a lower-rank approximation.
This is particularly useful when dealing with large datasets
or images. The lower-rank approximation retains the most
important features of the data while reducing its dimensions.

Given the SVD of matrix A as A = U ΣV T , the matrix Ak

obtained by keeping only the first k singular values and their
corresponding singular vectors is given by:

Ak = U (:, 1 : k)Σ(1 : k, 1 : k)V (:, 1 : k)T (14)

where U (:, 1 : k) contains the first k columns of U , Fig. 11. The correlation heatmap between all independent variables. This
Σ(1 : k, 1 : k) is the upper-left k × k submatrix of Σ, and was obtained by finding the pairwise correlation coefficient between each
independent variable. The color gradient indicates the magnitude of the
V (:, 1 : k) contains the first k columns of V . correlation between the variables.

By using a lower-rank approximation, the original data can our encoding method. When we encoded our categorical
be represented more compactly, leading to data compression. variables, atleast one class will be highly correlated with all
The extent of compression depends on the choice of k. A other classes. For example in our ’sex’ feature, only ’M’ and
smaller value of k reduces the storage requirements but may ’F’ classes are present. If a sample has ’sex’ attribute ’M’,
lead to a loss of information. then it cannot have ’F’, making the two classes, which have
now become features, perfectly negatively correlated. The
Sometimes the columns of data matrix A may contain same is true between the ’C’, ’Q’, and ’S’ features. Recall
linear relationships between themselves. In this case, if n that they came from categorical feature ’embarked’.
linear relationships exist, n singular values are 0. We can
drop up to n columns and perform a loss-less compression of These are the three linear relationships that the SVD
the data. This allows us to use fewer independent variables indicates. We can remove either set of M or F, and either
in our regression models i.e., enforce parsimony. one of C, Q, S variables. We choose to remove the M and
S variables and retain the others. The singular values after
removal and the correlation heatmap between the variables
after removal are presented in Fig. 12 and Fig. 13.

Fig. 12. Singular values of the Independent Variables dataset, after loss-less
compression are presented. The last three singular values are no longer of
order < 10−13 and cannot be considered to be 0. While we can still remove
some more variables based on the singular values, the compression would no
longer be loss-less.

Fig. 13 still indicates variables with some correlation, but

they cannot be removed in a loss-less manner. Still one can
choose to remove additional variables based on heuristics
such as cumulative variance captured, Scree Plots, etc.
B. Logistic Regression model with an L2 penalty performs as
good as one without
To train and evaluate our logistic regression model, we
split our train data into train and validation splits. This is
done using a fixed random seed for replicability, with 20%
of our given data in the validation split.
Fig. 13. The correlation heatmap between all independent variables. This
The logistic regression model is first trained on the was obtained by finding the pairwise correlation coefficient between each
independent variable. The color gradient indicates the magnitude of the
train split without any regularization. We then bootstrap the correlation between the variables. Highly correlated variables still exist, but
validation set (1000 bootstrap samples) and compute the their removal cannot be done in a loss-less fashion.
evaluation metrics presented in Section III-B. We provide
the 95% CIs for our evaluation metrics in Table II. The
probability distributions and ECDFs of our evaluation metrics
are presented in Figs. 14-17.

TABLE II
E VALUATION METRICS OF THE LOGISTIC REGRESSION CLASSIFIER . W E
FIND THAT MOST METRICS ARE REASONABLY HIGH . T HE VARIANCE IN
THESE ESTIMATES ARE ALSO ACCEPTABLE .

Metric Value 95% CI

Accuracy 0.79 (0.72, 0.87)
Precision 0.75 (0.62, 0.88) Fig. 14. The left plot contains the histogram of the accuracy obtained for
Recall 0.73 (0.60, 0.86) each bootstrap sample from the validation split. The right plot contains the
F1 Score 0.74 (0.64, 0.84) ECDF of the accuracy obtained for each bootstrap sample from the validation
split. We find that the metric is high and its variance is acceptable
We perform a similar analysis with an L2 penalty imposed
on our linear objective function. We then bootstrap the
validation set (1000 bootstrap samples) and compute the
evaluation metrics presented in Section III-B. We provide
the 95% CIs for our evaluation metrics in Table III. The
probability distributions and ECDFs of our evaluation metrics
are presented in Figs. 18-21. We find that no significant
difference occurs between the two models.
The ROC curves for both regressors are shown in Fig. 22
and Fig. 23. The predicted coefficients of each feature are
tabulated in Table IV.
Fig. 15. The left plot contains the histogram of the recall obtained for each
V. D ISCUSSION bootstrap sample from the validation split. The right plot contains the ECDF
of the recall obtained for each bootstrap sample from the validation split. We
From Table IV, we find that the Passenger Class is strongly find that the metric is high and its variance is acceptable
correlated with surviving the Titanic. Given the coefficient is
negative, we can infer that people that booked more expensive
suites were more likely to survive. This is also indicated by
Fig. 16. The left plot contains the histogram of the precision obtained for Fig. 19. The left plot contains the histogram of the recall obtained for each
each bootstrap sample from the validation split. The right plot contains the bootstrap sample from the validation split. The right plot contains the ECDF
ECDF of the precision obtained for each bootstrap sample from the validation of the recall obtained for each bootstrap sample from the validation split. We
split. We find that the metric is high and its variance is acceptable find that the metric is high and its variance is acceptable

Fig. 17. The left plot contains the histogram of the F1 score obtained for Fig. 20. The left plot contains the histogram of the precision obtained for
each bootstrap sample from the validation split. The right plot contains the each bootstrap sample from the validation split. The right plot contains the
ECDF of the F1 score obtained for each bootstrap sample from the validation ECDF of the precision obtained for each bootstrap sample from the validation
split. We find that the metric is high and its variance is acceptable split. We find that the metric is high and its variance is acceptable

TABLE III
E VALUATION METRICS OF THE L2 - NORM LOGISTIC REGRESSION
CLASSIFIER . W E FIND THAT MOST METRICS ARE REASONABLY HIGH . T HE
VARIANCE IN THESE ESTIMATES ARE ALSO ACCEPTABLE .

Metric Value 95% CI

Accuracy 0.79 (0.72, 0.87)
Precision 0.75 (0.62, 0.88)
Recall 0.73 (0.60, 0.86)
F1 Score 0.74 (0.64, 0.84)

Fig. 21. The left plot contains the histogram of the F1 score obtained for
each bootstrap sample from the validation split. The right plot contains the
ECDF of the F1 score obtained for each bootstrap sample from the validation
split. We find that the metric is high and its variance is acceptable

TABLE IV
R EGRESSION COEFFICIENTS OF ALL FEATURES USED , BASED ON THE
L OGISTIC R EGRESSION MODEL WITHOUT AND WITH THE L2 PENALTY.
W E FIND NO SIGNIFICANT CHANGES BETWEEN THE TWO SETS OF
COEFFICIENTS

Fig. 18. The left plot contains the histogram of the accuracy obtained for Feature Coeff. based on Log. Reg. Coeff. based on L2 -norm Log. Reg.
each bootstrap sample from the validation split. The right plot contains the Pclass -0.912 -0.893
ECDF of the accuracy obtained for each bootstrap sample from the validation Age -0.5 -0.487
split. We find that the metric is high and its variance is acceptable Sibsp -0.373 -0.363
Parch -0.135 -0.132
Fare 0.183 0.187
female 1.288 1.269
a weak correlation with the fare, possibly since we did not C 0.195 0.194
remove one of the two. Q 0.087 0.085

This tells us that richer people were more likely to

survive. The age of the individual has a significant negative
closer to lifeboats, potentially improving their survival
chances.
For future work, one could use more powerful classifiers and
consider other factors. Survival of various subgroups could
also be analyzed.
R EFERENCES
[1] James, G., Witten, D., Hastie, T. and Tibshirani, R., 2013. An introduc-
tion to statistical learning (Vol. 112, p. 18). New York: springer.
[2] Hastie, T., Tibshirani, R., Friedman, J. H., & Friedman, J. H. (2009). The
elements of statistical learning: data mining, inference, and prediction
(Vol. 2, pp. 1-758). New York: springer.
[3] Bishop, C. M., & Nasrabadi, N. M. (2006). Pattern recognition and
Fig. 22. The Receiver Operator Characteristic curve obtained for the logistic
machine learning (Vol. 4, No. 4, p. 738). New York: springer.
regression classifier. We find that we can achieve a good True Positive Rate
with a small False Positive Rate, indicating that our classifier is robust to class
imbalances

Fig. 23. The Receiver Operator Characteristic curve obtained for the L2 -
norm logistic regression classifier. We find that we can achieve a good True
Positive Rate with a small False Positive Rate, indicating that our classifier
is robust to class imbalances

correlation with survival. As expected, older people were less

likely to survive the sinking ship.

The biggest factor of survival was the sex of the passenger,

with women more likely to survive than men. This is
potentially due to a notion of “saving the women and children
first” that was prevalent in the past.
VI. C ONCLUSIONS AND F UTURE W ORK
Conclusions based on our analysis is listed below:
• Our logistic regression classifier shows that gender and
age are the most influential factors affecting survival.
Women had a significantly higher chance of survival
compared to men.
• Passenger class (Pclass) is another crucial factor. Passen-
gers in first class had a notably higher survival rate than
those in second and third classes. This likely reflects the
proximity of first-class cabins to lifeboats.
• The size of a passenger’s family group (Sibsp + Parch)
also influences survival. Passengers traveling alone or
with a small family group had a better chance of survival
compared to those with larger families.
• While not as strong as gender, age, or class, fare paid for
the ticket and cabin location seem to have some impact on
survival. Passengers who paid higher fares were generally

Assignment 1 - TITANIC
No ratings yet
Assignment 1 - TITANIC
6 pages
jmp027 Titanic Passengers
No ratings yet
jmp027 Titanic Passengers
13 pages
Indraneel S (RA2211003010421)
No ratings yet
Indraneel S (RA2211003010421)
21 pages
Report TSP
No ratings yet
Report TSP
13 pages
Logistic Regression Analysis of Titanic Data
No ratings yet
Logistic Regression Analysis of Titanic Data
4 pages
Titanic Logistic Regression Project
No ratings yet
Titanic Logistic Regression Project
35 pages
Titanic Survival Prediction Using ML
No ratings yet
Titanic Survival Prediction Using ML
7 pages
Titanic Documentation-1722102624939
No ratings yet
Titanic Documentation-1722102624939
34 pages
Iml Project
No ratings yet
Iml Project
13 pages
Titanic Prediction
No ratings yet
Titanic Prediction
53 pages
CEP Final
No ratings yet
CEP Final
11 pages
LamTang TitanicMachineLearningFromDisaster
No ratings yet
LamTang TitanicMachineLearningFromDisaster
5 pages
Titanic Project Report Final
No ratings yet
Titanic Project Report Final
1 page
Titanic Survival Prediction Model
No ratings yet
Titanic Survival Prediction Model
4 pages
EDS Mini Project
No ratings yet
EDS Mini Project
10 pages
Titanic Survival Prediction Guide
No ratings yet
Titanic Survival Prediction Guide
20 pages
Rouse Final
No ratings yet
Rouse Final
8 pages
Terminal Assessment 2 DAP
No ratings yet
Terminal Assessment 2 DAP
25 pages
Terminal Assessment 3
No ratings yet
Terminal Assessment 3
20 pages
Titanic DS Callenge
No ratings yet
Titanic DS Callenge
24 pages
Titanic Survival Analysis by Class & Gender
No ratings yet
Titanic Survival Analysis by Class & Gender
2 pages
Aiml Team Presentation
No ratings yet
Aiml Team Presentation
18 pages
Titanic Survival Prediction Model
No ratings yet
Titanic Survival Prediction Model
1 page
Titanic Survival Prediction Guide
No ratings yet
Titanic Survival Prediction Guide
13 pages
Titanic Data Analysis & Modeling
No ratings yet
Titanic Data Analysis & Modeling
12 pages
A Comparative Study On Machine Learning Techniques Using Titanic Dataset
No ratings yet
A Comparative Study On Machine Learning Techniques Using Titanic Dataset
6 pages
Titanic Survival Prediction Using ML Miniproject
No ratings yet
Titanic Survival Prediction Using ML Miniproject
21 pages
Report
No ratings yet
Report
1 page
Titanic Survival Analysis in R
No ratings yet
Titanic Survival Analysis in R
19 pages
02 Titanic Dataset Descr
No ratings yet
02 Titanic Dataset Descr
5 pages
Technical Assignment II
No ratings yet
Technical Assignment II
4 pages
Titanic ML Kaggle
No ratings yet
Titanic ML Kaggle
3 pages
Titanic Data Analysis Project
No ratings yet
Titanic Data Analysis Project
14 pages
MCA - Project Documentation Guidelines 2024-2025
No ratings yet
MCA - Project Documentation Guidelines 2024-2025
26 pages
Titanic Data Visualization Analysis
No ratings yet
Titanic Data Visualization Analysis
18 pages
Titanic Dataset Analysis Insights
No ratings yet
Titanic Dataset Analysis Insights
4 pages
Titanic Survival Prediction Model
No ratings yet
Titanic Survival Prediction Model
19 pages
Titanic Survival Prediction Guide
No ratings yet
Titanic Survival Prediction Guide
1 page
(ML84) Group 2 - Group Discussion 1
No ratings yet
(ML84) Group 2 - Group Discussion 1
9 pages
Titanic PuneethRegonda
No ratings yet
Titanic PuneethRegonda
8 pages
Titanic Survival Prediction
No ratings yet
Titanic Survival Prediction
14 pages
Suma de Negocios: Surviving The Titanic Tragedy: A Sociological Study Using Machine Learning Models
No ratings yet
Suma de Negocios: Surviving The Titanic Tragedy: A Sociological Study Using Machine Learning Models
7 pages
Titanic Survival Prediction Assignment
No ratings yet
Titanic Survival Prediction Assignment
3 pages
Predictive Modeling of Titanic Survivors
No ratings yet
Predictive Modeling of Titanic Survivors
12 pages
Titanic Survival Prediction Model
100% (1)
Titanic Survival Prediction Model
7 pages
Data Analysis of Titanic Passengers
No ratings yet
Data Analysis of Titanic Passengers
8 pages
Titanic Survival Prediction Analysis
No ratings yet
Titanic Survival Prediction Analysis
7 pages
Lab 1 - Data, Frequency Tables and Histograms (20042023) - Picture
No ratings yet
Lab 1 - Data, Frequency Tables and Histograms (20042023) - Picture
14 pages
Hoàng Ngọc Trang - 31221026649
No ratings yet
Hoàng Ngọc Trang - 31221026649
5 pages
Group 7 - PA - Assignment 1
No ratings yet
Group 7 - PA - Assignment 1
6 pages
Summary Report
No ratings yet
Summary Report
1 page
Titanic
No ratings yet
Titanic
1 page
Data Science Insights on Titanic
No ratings yet
Data Science Insights on Titanic
24 pages
Coding Titanicmain
No ratings yet
Coding Titanicmain
58 pages
Titanic Survival Prediction Using ML
No ratings yet
Titanic Survival Prediction Using ML
24 pages
SFM Asm
No ratings yet
SFM Asm
19 pages
Document 26 1
No ratings yet
Document 26 1
5 pages
Titanic Classification Project
No ratings yet
Titanic Classification Project
17 pages
Project Report
No ratings yet
Project Report
7 pages
Second Year Syll
No ratings yet
Second Year Syll
79 pages
Class 12 Maths Unit Test 1 Paper 2024
100% (1)
Class 12 Maths Unit Test 1 Paper 2024
2 pages
Applied Mathematics SrSec 2022-23
No ratings yet
Applied Mathematics SrSec 2022-23
23 pages
Corrplot
No ratings yet
Corrplot
18 pages
Matlab Chapter 1
No ratings yet
Matlab Chapter 1
20 pages
Robotics Kinematics Essentials
No ratings yet
Robotics Kinematics Essentials
56 pages
Numerical Methods and Optimization An Introduction 1st Edition Pardalos Full Chapters Instanly
No ratings yet
Numerical Methods and Optimization An Introduction 1st Edition Pardalos Full Chapters Instanly
62 pages
Matrix Operations and Conclusions
No ratings yet
Matrix Operations and Conclusions
4 pages
Matrix Algebra in Economics
No ratings yet
Matrix Algebra in Economics
10 pages
NCERT Solutions For Class 12 Maths Chapter 4 Miscellaneous Exercise
No ratings yet
NCERT Solutions For Class 12 Maths Chapter 4 Miscellaneous Exercise
1 page
Chapter 0
No ratings yet
Chapter 0
9 pages
Final Exam Review Sheet MAT 142 ONLINE
No ratings yet
Final Exam Review Sheet MAT 142 ONLINE
113 pages
B.E. ECE All Sem
No ratings yet
B.E. ECE All Sem
29 pages
Memory Layout of Multi-Dimensional Arrays - Eli Bendersky's Website
No ratings yet
Memory Layout of Multi-Dimensional Arrays - Eli Bendersky's Website
12 pages
MACHINATIONS 101 - Turkish Language en
No ratings yet
MACHINATIONS 101 - Turkish Language en
64 pages
Chapter 2 - Problems and Exercises
No ratings yet
Chapter 2 - Problems and Exercises
2 pages
1 Sem Maths Syllabus As On 4-6-2025
No ratings yet
1 Sem Maths Syllabus As On 4-6-2025
5 pages
Running Time Calculation Methods
No ratings yet
Running Time Calculation Methods
8 pages
Linear Transformation On Strongly Magic Squares
No ratings yet
Linear Transformation On Strongly Magic Squares
5 pages
Jee Mains EAMCET B &G
No ratings yet
Jee Mains EAMCET B &G
13 pages
Algebra (1st Semester)
No ratings yet
Algebra (1st Semester)
3 pages
Quantum Solutions for Chemical Kinetics
No ratings yet
Quantum Solutions for Chemical Kinetics
22 pages
C++ Programs for Inheritance and Functions
No ratings yet
C++ Programs for Inheritance and Functions
46 pages
JEE Advanced 2014 Physics
No ratings yet
JEE Advanced 2014 Physics
20 pages
Autocorrelation in Periodic ARMA Models
No ratings yet
Autocorrelation in Periodic ARMA Models
14 pages
Arroyo-Correa Et Al-2020-Journal of Ecology
No ratings yet
Arroyo-Correa Et Al-2020-Journal of Ecology
12 pages
Math 3110: Linear Algebra: Spring 2014
No ratings yet
Math 3110: Linear Algebra: Spring 2014
3 pages
EST102-D January 2024
No ratings yet
EST102-D January 2024
3 pages
Maxima - PPT 1 New
No ratings yet
Maxima - PPT 1 New
9 pages
Course Outline Standard-Business Math
No ratings yet
Course Outline Standard-Business Math
4 pages

DAL Assignment 2 Endsem

Uploaded by

DAL Assignment 2 Endsem

Uploaded by

A Linear Model for Survival in the Titanic

Various imputation techniques are available, such as imputing

We can also observe that the KNN Imputer preserves

In Fig. 5, we find that the First Class has a large spread

SVD can be leveraged for data compression by approximating

Given the SVD of matrix A as A = U ΣV T , the matrix Ak

Ak = U (:, 1 : k)Σ(1 : k, 1 : k)V (:, 1 : k)T (14)

Fig. 13 still indicates variables with some correlation, but

Metric Value 95% CI

Metric Value 95% CI

This tells us that richer people were more likely to

correlation with survival. As expected, older people were less

The biggest factor of survival was the sex of the passenger,

You might also like