0% found this document useful (0 votes)

13 views13 pages

Survival Analysis - Python

The document outlines a survival analysis project that involves loading and cleaning a dataset, performing descriptive statistics, and applying Kaplan-Meier estimation and Cox proportional hazards modeling. It includes steps for data normalization, event coercion, and visualizing survival curves stratified by groups such as gender and race. The analysis also incorporates statistical tests like the log-rank test to compare survival across groups.

Uploaded by

Abraham Mota Ushacho

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views13 pages

Survival Analysis - Python

Uploaded by

Abraham Mota Ushacho

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

9/1/25, 11:27 PM Survival Analysis - Project one

Load required modules

In [1]: import os
import numpy as np
import pandas as pd
import [Link] as plt
import warnings
[Link]("ignore")

Load the dataset

In [2]: df = pd.read_csv("C:/Users/ADMIN/Desktop/Data Science/Datasets/survival/Survival

Inspect the data

In [4]: ## First few observations

[Link](10)

Out[4]: id Start_date Gender Race Pay_hourly Pay_sat Turnover Turnover_date LO

0 1073 5/4/2014 Man Black 12.56 4.666667 1 1/14/2017 98

1 1598 9/30/2016 Man Black 17.68 3.000000 1 5/7/2018 58

2 1742 3/14/2018 Man Black 12.56 4.333333 0 NaN 29

3 1125 9/7/2014 Man Black 15.15 3.000000 1 4/13/2017 94

4 1676 4/19/2017 Man Black 14.67 3.333333 0 NaN 62

5 1420 12/25/2015 Man Black 15.91 4.333333 1 7/31/2018 94

6 1443 1/20/2016 Man Black 15.15 3.333333 2 10/2/2018 98

7 1698 6/18/2017 Man Black 14.49 4.666667 1 12/18/2018 54

8 1661 2/23/2017 Man Black 17.15 3.666667 1 8/25/2017 18

9 1347 9/13/2015 Man Black 14.59 3.666667 0 NaN 120

In [5]: ## normalize names

[Link] = [[Link]().lower() for c in [Link]] # normalize names
print("columns:", [Link]())

columns: ['id', 'start_date', 'gender', 'race', 'pay_hourly', 'pay_sat', 'turnove

r', 'turnover_date', 'los']

In [6]: ## Data types

[Link]

[Link] Analysis - Project [Link] 1/13

9/1/25, 11:27 PM Survival Analysis - Project one

Out[6]: id int64
start_date object
gender object
race object
pay_hourly float64
pay_sat float64
turnover int64
turnover_date object
los int64
dtype: object

Clean data & coerce event to binary safely

In [7]: ## Select or detect duration/event columns (explicit choice recommended)

DURATION = "los" # <-- set according to your file
EVENT = "turnover" # <-- set according to your file

## Safe coercion of event column to 0/1

def coerce_event(series):
s = [Link]()
# handle booleans and numeric
if [Link].is_bool_dtype(s):
return [Link]("Int64") # nullable integer
if [Link].is_numeric_dtype(s):
vals = sorted([Link]([Link]()))
if set(vals).issubset({0,1}):
return [Link]("Int64")
if len(vals) == 2:
## map min -> 0, max -> 1
return (s == vals[-1]).astype("Int64")
## common strings mapping
yes = {"yes","y","true","t","event","death","left","1"}
no = {"no","n","false","f","censored","stayed","0"}
def map_val(v):
if [Link](v): return [Link]
v2 = str(v).strip().lower()
if v2 in yes: return 1
if v2 in no: return 0
return [Link]
mapped = [Link](map_val).astype("Int64")
return mapped

## Apply cleaning
df[DURATION] = pd.to_numeric(df[DURATION], errors="coerce")
df[EVENT] = coerce_event(df[EVENT])
df = [Link](subset=[DURATION, EVENT]) # remove rows missing core info
df = df[df[DURATION] >= 0] # ensure non-negative times
df[EVENT] = df[EVENT].astype(int) # finally cast to int (0/1)
print("n rows after cleaning:", len(df))

n rows after cleaning: 685

In [8]: [Link]()

[Link] Analysis - Project [Link] 2/13

9/1/25, 11:27 PM Survival Analysis - Project one

Out[8]: id start_date gender race pay_hourly pay_sat turnover turnover_date los

0 1073 5/4/2014 Man Black 12.56 4.666667 1 1/14/2017 986

1 1598 9/30/2016 Man Black 17.68 3.000000 1 5/7/2018 584

2 1742 3/14/2018 Man Black 12.56 4.333333 0 NaN 293

3 1125 9/7/2014 Man Black 15.15 3.000000 1 4/13/2017 949

4 1676 4/19/2017 Man Black 14.67 3.333333 0 NaN 622

Descriptive summary (counts, event rate, time range)

In [10]: ## Descriptives
n = len(df)
events = df[EVENT].sum()
censored = n - events
event_rate = events / n
time_min = df[DURATION].min()
time_median = df[DURATION].median()
time_max = df[DURATION].max()

summary = {
"n": int(n),
"events": int(events),
"censored": int(censored),
"event_rate": float(event_rate),
"min_time": float(time_min),
"median_time": float(time_median),
"max_time": float(time_max)
}
summary

Out[10]: {'n': 685,

'events': 463,
'censored': 222,
'event_rate': 0.6759124087591241,
'min_time': 73.0,
'median_time': 913.0,
'max_time': 1935.0}

Interpretation: basic cohort metrics. event_rate tells how many observed events
(failures) vs censored — if event counts are very low, survival estimates and Cox
model power will be poor.

Kaplan–Meier overall (two options)

Option A: Using lifelines (recommended)

In [41]: # Define KM estimator

def km_estimator(times, events):
times = [Link](times)
events = [Link](events)

# sort by time
order = [Link](times)

[Link] Analysis - Project [Link] 3/13

9/1/25, 11:27 PM Survival Analysis - Project one

t_sorted = times[order]
e_sorted = events[order]

# unique event times

uniq = [Link](t_sorted[e_sorted == 1])

survival = []
at_risk = []
events_at_t = []
s = 1.0

for ut in uniq:
# number at risk just before ut
risk = [Link](t_sorted >= ut)
# number of events at ut
d = [Link]((t_sorted == ut) & (e_sorted == 1))

if risk > 0:
s *= (1 - d / risk)

[Link](s)
at_risk.append(risk)
events_at_t.append(d)

return uniq, [Link](survival), [Link](at_risk), [Link](events_at_t)

# Apply on dataset
DURATION = "los"
EVENT = "turnover"

times = df[DURATION].values
events = df[EVENT].values

uniq_times, surv_vals, at_risk, ev_counts = km_estimator(times, events)

# Build life table

life_table = [Link]({
"time": uniq_times,
"at_risk": at_risk,
"events": ev_counts,
"survival": surv_vals
})

print(life_table.head(10))

time at_risk events survival

0 73 685 12 0.982482
1 110 666 1 0.981007
2 146 665 3 0.976581
3 183 661 4 0.970671
4 219 656 3 0.966232
5 256 652 4 0.960304
6 292 644 1 0.958813
7 329 640 1 0.957315
8 365 637 6 0.948298
9 402 626 7 0.937694

In [42]: ## Kaplan–Meier Survival Plot

[Link](figsize=(8,6))
[Link](uniq_times, surv_vals, where="post", label="KM Estimate", color="blue")

[Link] Analysis - Project [Link] 4/13

9/1/25, 11:27 PM Survival Analysis - Project one

## Formatting
[Link]("Kaplan–Meier Survival Curve", fontsize=14)
[Link]("Time (LOS)", fontsize=12)
[Link]("Survival Probability S(t)", fontsize=12)
[Link](0, 1.05)
[Link](True, linestyle="--", alpha=0.6)
[Link]()

[Link]()

KM stratified by groups (example: gender, race)

In [24]: group_cols = [c for c in ("gender","race") if c in [Link]]

## using lifelines (if available)

from lifelines import KaplanMeierFitter
for g in group_cols:
kmf = KaplanMeierFitter()
[Link]()
for level in df[g].astype("category").[Link]:
mask = df[g] == level
[Link]([Link][mask, DURATION], event_observed=[Link][mask, EVENT], labe
[Link](ci_show=False)
[Link](f"KM by {g}")
[Link]("Time"); [Link]("Survival")
[Link](f"km_by_{g}.png", bbox_inches="tight")
[Link]()

[Link] Analysis - Project [Link] 5/13

9/1/25, 11:27 PM Survival Analysis - Project one

Log-rank test (compare survival across groups) Using lifelines (pairwise)

In [26]: from [Link] import logrank_test

## compare two groups A vs B for group g

[Link] Analysis - Project [Link] 6/13

9/1/25, 11:27 PM Survival Analysis - Project one

g = "gender" # choose an available group

levels = df[g].astype("category").[Link]
pairs = []
for i in range(len(levels)):
for j in range(i+1, len(levels)):
A = df[df[g] == levels[i]]
B = df[df[g] == levels[j]]
res = logrank_test(A[DURATION], B[DURATION], event_observed_A=A[EVENT],
[Link]({
"group": g,
"level_A": levels[i],
"level_B": levels[j],
"p_value": float(res.p_value),
"test_statistic": float(res.test_statistic)
})
pairs

Out[26]: [{'group': 'gender',

'level_A': 'Man',
'level_B': 'Woman',
'p_value': 0.6864647030647238,
'test_statistic': 0.1629392601399209}]

Cox proportional hazards model (multivariable) Using lifelines

In [39]: from lifelines import CoxPHFitter

## Prepare covariates: numeric columns + one-hot small categoricals

work = [Link]()
exclude = {DURATION, EVENT, "id", "start_date", "turnover_date"} # keep as need
covariates = []
for c in [Link]:
if c in exclude: continue
if [Link].is_numeric_dtype(work[c]):
[Link](c)
else:
# add one-hot dummies for small-cardinality categoricals
if work[c].nunique(dropna=True) <= 10:
dummies = pd.get_dummies(work[c], prefix=c, drop_first=True)
work = [Link]([work, dummies], axis=1)
covariates += list([Link])

model_df = work[[DURATION, EVENT] + covariates].dropna()

print("Cox model n=", len(model_df), "num covariates=", len(covariates))

cph = CoxPHFitter()
[Link](model_df, duration_col=DURATION, event_col=EVENT)
cph.print_summary() # coefficients, HR, p-values, 95% CI

Cox model n= 650 num covariates= 5

[Link] Analysis - Project [Link] 7/13

9/1/25, 11:27 PM Survival Analysis - Project one

model [Link]

duration col 'los'

event col 'turnover'

baseline estimation breslow

number of observations 650

number of events observed 445

partial log-likelihood -2437.89

time fit was run 2025-09-01 [Link] UTC

coef coef exp(coef) exp(coef)

cmp
coef exp(coef) se(coef) lower upper lower upper
to
95% 95% 95% 95%

gender_Woman -0.04 0.96 0.10 -0.23 0.15 0.79 1.16 0.00

race_HispanicLatino 0.07 1.07 0.16 -0.24 0.38 0.79 1.46 0.00

race_White -0.07 0.93 0.10 -0.28 0.13 0.76 1.14 0.00

pay_hourly -0.12 0.89 0.03 -0.18 -0.05 0.83 0.95 0.00

pay_sat 0.13 1.14 0.08 -0.03 0.29 0.97 1.33 0.00

Concordance 0.55

Partial AIC 4885.78

log-likelihood ratio test 17.33 on 5 df

-log2(p) of ll-ratio test 8.00

In [33]: ## extract hazard ratios & save

summary = [Link]()
summary["HR"] = [Link](summary["coef"])
summary

Out[33]: coef coef exp(coef) exp

coef exp(coef) se(coef) lower upper lower
95% 95% 95%

covariate

gender_Woman -0.040394 0.960411 0.097361 -0.231217 0.150429 0.793567 1.

race_HispanicLatino 0.071419 1.074031 0.157860 -0.237981 0.380820 0.788217 1.4

race_White -0.074764 0.927962 0.103886 -0.278376 0.128848 0.757012 1.

pay_hourly -0.116725 0.889830 0.032939 -0.181283 -0.052167 0.834199 0.

pay_sat 0.129821 1.138625 0.080736 -0.028418 0.288060 0.971982 1.

[Link] Analysis - Project [Link] 8/13

9/1/25, 11:27 PM Survival Analysis - Project one

In [34]: ## baseline survival

cph.baseline_survival_.plot()
[Link]("Cox baseline survival")
[Link]()

In [35]: ## Concordance index

print("Concordance index:", cph.concordance_index_)

Concordance index: 0.5509943292436867

Check proportional hazards assumption

In [36]: ## Check proportional hazards (lifelines)

cph.check_assumptions(model_df, p_value_threshold=0.05, show_plots=True)

Bootstrapping lowess lines. May take a moment...

Proportional hazard assumption looks okay.

[Link] Analysis - Project [Link] 9/13

9/1/25, 11:27 PM Survival Analysis - Project one

Out[36]: [[<Axes: xlabel='rank-transformed time\n(p=0.0707)'>,

<Axes: xlabel='km-transformed time\n(p=0.0625)'>],
[<Axes: xlabel='rank-transformed time\n(p=0.4316)'>,
<Axes: xlabel='km-transformed time\n(p=0.5350)'>],
[<Axes: xlabel='rank-transformed time\n(p=0.9220)'>,
<Axes: xlabel='km-transformed time\n(p=0.4784)'>],
[<Axes: xlabel='rank-transformed time\n(p=0.4284)'>,
<Axes: xlabel='km-transformed time\n(p=0.4532)'>],
[<Axes: xlabel='rank-transformed time\n(p=0.4763)'>,
<Axes: xlabel='km-transformed time\n(p=0.4112)'>]]

[Link] Analysis - Project [Link] 10/13

9/1/25, 11:27 PM Survival Analysis - Project one

[Link] Analysis - Project [Link] 11/13

9/1/25, 11:27 PM Survival Analysis - Project one

Diagnostics & multicollinearity (optional but important)

In [37]: ## Variance Inflation Factor (VIF) to detect multicollinearity

import [Link] as sm
from [Link].outliers_influence import variance_inflation_factor

[Link] Analysis - Project [Link] 12/13

9/1/25, 11:27 PM Survival Analysis - Project one

X = model_df[covariates].astype(float)
X = [Link]()
X_const = sm.add_constant(X)
vif = [Link]({
"variable": X_const.columns,
"VIF": [variance_inflation_factor(X_const.values, i) for i in range(X_const.
})
print(vif.sort_values("VIF", ascending=False).head(10))

variable VIF
0 const 135.237148
3 race_White 1.169985
2 race_HispanicLatino 1.138402
4 pay_hourly 1.056765
1 gender_Woman 1.019376
5 pay_sat 1.002665

[Link] Analysis - Project [Link] 13/13

Lifelines
No ratings yet
Lifelines
347 pages
Lifelines
No ratings yet
Lifelines
343 pages
Haberman Cancer Survival Analysis Guide
No ratings yet
Haberman Cancer Survival Analysis Guide
1 page
Health Risk Prediction
No ratings yet
Health Risk Prediction
80 pages
4 Exploratory Data Analysis.
No ratings yet
4 Exploratory Data Analysis.
1 page
Mayank Chaudhary DEV Practicals
No ratings yet
Mayank Chaudhary DEV Practicals
14 pages
Haberman Cancer Survival Analysis Guide
No ratings yet
Haberman Cancer Survival Analysis Guide
1 page
EDA On Haberman Survival Data
No ratings yet
EDA On Haberman Survival Data
6 pages
Data Analysis with Python
No ratings yet
Data Analysis with Python
12 pages
Assignment2 DMS672
No ratings yet
Assignment2 DMS672
15 pages
AE II Simulation File PDF
No ratings yet
AE II Simulation File PDF
32 pages
Titanic ML for Data Scientists
No ratings yet
Titanic ML for Data Scientists
36 pages
Can We Really Live Longer - A Machine Learning Study - by Nicolasdealba - Medium
No ratings yet
Can We Really Live Longer - A Machine Learning Study - by Nicolasdealba - Medium
34 pages
Heart Disease Diagnosis Using Machine Learning
No ratings yet
Heart Disease Diagnosis Using Machine Learning
26 pages
DSBDA Practicals
No ratings yet
DSBDA Practicals
16 pages
Survival Analysis with R Guide
No ratings yet
Survival Analysis with R Guide
42 pages
Print Print Print Print: Import As
No ratings yet
Print Print Print Print: Import As
6 pages
Dataset Visualization Basic Ml-1
No ratings yet
Dataset Visualization Basic Ml-1
12 pages
Life Expectancy Data Analysis
No ratings yet
Life Expectancy Data Analysis
26 pages
FOML Journal
No ratings yet
FOML Journal
8 pages
Ilovepdf Merged
No ratings yet
Ilovepdf Merged
89 pages
Da Pra Week-14 (Survival Analysis) - 102311
No ratings yet
Da Pra Week-14 (Survival Analysis) - 102311
11 pages
DAV Practical
No ratings yet
DAV Practical
12 pages
Assignment 3
No ratings yet
Assignment 3
14 pages
1.1 Objective: 2. Data Preparation and Exploratory Analysis
No ratings yet
1.1 Objective: 2. Data Preparation and Exploratory Analysis
11 pages
Chapter 4
No ratings yet
Chapter 4
39 pages
Lec 7 Data Visualization Basic Statistics Updated 21102024 122008pm
No ratings yet
Lec 7 Data Visualization Basic Statistics Updated 21102024 122008pm
39 pages
ML LAB Manual-1
No ratings yet
ML LAB Manual-1
33 pages
DA Lab
No ratings yet
DA Lab
27 pages
Datascience 2 PDF
No ratings yet
Datascience 2 PDF
24 pages
Titanic Eda
No ratings yet
Titanic Eda
17 pages
DALab Part-B BCU&BU
No ratings yet
DALab Part-B BCU&BU
12 pages
Survival Analysis for Academics
No ratings yet
Survival Analysis for Academics
133 pages
R - Tutorial 1 - Survival Analysis in R
100% (1)
R - Tutorial 1 - Survival Analysis in R
16 pages
R Survival Analysis Guide
No ratings yet
R Survival Analysis Guide
16 pages
Analyzing Birth Weight and Ethnicity Data
No ratings yet
Analyzing Birth Weight and Ethnicity Data
5 pages
DSDBAAssignment2 SUMEET
No ratings yet
DSDBAAssignment2 SUMEET
8 pages
Statistics IMP Questions and Answers
No ratings yet
Statistics IMP Questions and Answers
23 pages
Data Visualization & Preprocessing Guide
No ratings yet
Data Visualization & Preprocessing Guide
18 pages
Titanic Data Analysis & Modeling
No ratings yet
Titanic Data Analysis & Modeling
11 pages
C2M2 - Assignment: 1 Risk Models Using Tree-Based Models
100% (1)
C2M2 - Assignment: 1 Risk Models Using Tree-Based Models
38 pages
Python For Machine Learning
No ratings yet
Python For Machine Learning
66 pages
Stroke Prediction
No ratings yet
Stroke Prediction
10 pages
Aditya DM Prac File
No ratings yet
Aditya DM Prac File
35 pages
Titanic Data Analysis & Modeling
No ratings yet
Titanic Data Analysis & Modeling
12 pages
Data Preprocessing Techniques in Python
No ratings yet
Data Preprocessing Techniques in Python
27 pages
Linear and Multilinear Regression
No ratings yet
Linear and Multilinear Regression
5 pages
Project Paarth
No ratings yet
Project Paarth
21 pages
Solution
No ratings yet
Solution
8 pages
Data Warehousing and Data Mining
No ratings yet
Data Warehousing and Data Mining
24 pages
FDS Slot 3
No ratings yet
FDS Slot 3
15 pages
DSBDA2
No ratings yet
DSBDA2
6 pages
Lab Sheet 3,4
No ratings yet
Lab Sheet 3,4
3 pages
How To Open Without Moudlue
No ratings yet
How To Open Without Moudlue
9 pages
DWM Journal
No ratings yet
DWM Journal
104 pages
Davp Pyq 2023 Solution
No ratings yet
Davp Pyq 2023 Solution
15 pages
Life Expectancy Regression Analysis
No ratings yet
Life Expectancy Regression Analysis
41 pages
Mod Trachoma Final
No ratings yet
Mod Trachoma Final
71 pages
MEAL Director
No ratings yet
MEAL Director
5 pages
Oromia
No ratings yet
Oromia
8 pages
GAT Analytical Reasoning PDF PDF
No ratings yet
GAT Analytical Reasoning PDF PDF
272 pages
CDC Revised Budget Guideline
No ratings yet
CDC Revised Budget Guideline
10 pages
Hossana
No ratings yet
Hossana
7 pages
Kirkos SC Final Covid-19 EPRP For 2015
No ratings yet
Kirkos SC Final Covid-19 EPRP For 2015
79 pages
Bale
No ratings yet
Bale
6 pages
Selective Feeding Programmes
No ratings yet
Selective Feeding Programmes
34 pages
Six Building Blocks of Health System
No ratings yet
Six Building Blocks of Health System
19 pages
HIV Surveillance Training Ethiopia
No ratings yet
HIV Surveillance Training Ethiopia
290 pages
The Successful Candidate Is Skilled and Experienced in The Design of Machinery, Has Firsthand Knowledge
No ratings yet
The Successful Candidate Is Skilled and Experienced in The Design of Machinery, Has Firsthand Knowledge
2 pages
Concept Learning for Beginners
No ratings yet
Concept Learning for Beginners
59 pages
An Inconvenient Truth (2006) M-HD 720p Hindi Eng BHATTI87.Mkv
No ratings yet
An Inconvenient Truth (2006) M-HD 720p Hindi Eng BHATTI87.Mkv
2 pages
Ascii Magic Number
No ratings yet
Ascii Magic Number
3 pages
CBLM Use Relevant Technologies VGD NCIII
100% (9)
CBLM Use Relevant Technologies VGD NCIII
33 pages
BioTime 8.5 Integration Manual
No ratings yet
BioTime 8.5 Integration Manual
7 pages
NNFLC 3 Unit Qbwa
No ratings yet
NNFLC 3 Unit Qbwa
15 pages
Mock 10025
No ratings yet
Mock 10025
40 pages
Evolution of Computers Essay
No ratings yet
Evolution of Computers Essay
2 pages
CSE209 Lab Report 6
No ratings yet
CSE209 Lab Report 6
6 pages
FAI Information Security Assessment Questionnaire v2.1
No ratings yet
FAI Information Security Assessment Questionnaire v2.1
7 pages
Gmail - Important Information Regarding Your Personal Data
No ratings yet
Gmail - Important Information Regarding Your Personal Data
3 pages
Sample Questions For Amazon CLF c02 Exam by Crosby
No ratings yet
Sample Questions For Amazon CLF c02 Exam by Crosby
15 pages
Juniper Config Guide Routing
No ratings yet
Juniper Config Guide Routing
958 pages
SSRN Id3429842
No ratings yet
SSRN Id3429842
8 pages
Distribution System Integrated Voltage and Reactive Power Control
No ratings yet
Distribution System Integrated Voltage and Reactive Power Control
6 pages
Security Awareness Policy
No ratings yet
Security Awareness Policy
4 pages
MSP430-1 MSP430 Overview
No ratings yet
MSP430-1 MSP430 Overview
99 pages
Screenshot 2025-02-21 at 11.45.50 AM
No ratings yet
Screenshot 2025-02-21 at 11.45.50 AM
7 pages
CCTV System Types: Hybrid to Quadbrid
No ratings yet
CCTV System Types: Hybrid to Quadbrid
18 pages
CSSC Common Exam Time Table 2024 2025
No ratings yet
CSSC Common Exam Time Table 2024 2025
1 page
DCS Automation Strategy Overview
No ratings yet
DCS Automation Strategy Overview
14 pages
Systems Theory in Practice
No ratings yet
Systems Theory in Practice
10 pages
Viral Scandal62
No ratings yet
Viral Scandal62
5 pages
65900011V1-19 Ships Manual
No ratings yet
65900011V1-19 Ships Manual
717 pages
B.E. Electronics & Computer Engg Exam Scheme 2016-19
No ratings yet
B.E. Electronics & Computer Engg Exam Scheme 2016-19
36 pages
Bilasco Horvath
No ratings yet
Bilasco Horvath
10 pages
Data Preprocessing Notes From Mahesh Huddar
No ratings yet
Data Preprocessing Notes From Mahesh Huddar
4 pages
US CAN MW Site Prep Guide V5 5
No ratings yet
US CAN MW Site Prep Guide V5 5
14 pages