0% found this document useful (0 votes)
13 views13 pages

Survival Analysis - Python

The document outlines a survival analysis project that involves loading and cleaning a dataset, performing descriptive statistics, and applying Kaplan-Meier estimation and Cox proportional hazards modeling. It includes steps for data normalization, event coercion, and visualizing survival curves stratified by groups such as gender and race. The analysis also incorporates statistical tests like the log-rank test to compare survival across groups.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views13 pages

Survival Analysis - Python

The document outlines a survival analysis project that involves loading and cleaning a dataset, performing descriptive statistics, and applying Kaplan-Meier estimation and Cox proportional hazards modeling. It includes steps for data normalization, event coercion, and visualizing survival curves stratified by groups such as gender and race. The analysis also incorporates statistical tests like the log-rank test to compare survival across groups.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

9/1/25, 11:27 PM Survival Analysis - Project one

Load required modules

In [1]: import os
import numpy as np
import pandas as pd
import [Link] as plt
import warnings
[Link]("ignore")

Load the dataset

In [2]: df = pd.read_csv("C:/Users/ADMIN/Desktop/Data Science/Datasets/survival/Survival

Inspect the data

In [4]: ## First few observations


[Link](10)

Out[4]: id Start_date Gender Race Pay_hourly Pay_sat Turnover Turnover_date LO

0 1073 5/4/2014 Man Black 12.56 4.666667 1 1/14/2017 98

1 1598 9/30/2016 Man Black 17.68 3.000000 1 5/7/2018 58

2 1742 3/14/2018 Man Black 12.56 4.333333 0 NaN 29

3 1125 9/7/2014 Man Black 15.15 3.000000 1 4/13/2017 94

4 1676 4/19/2017 Man Black 14.67 3.333333 0 NaN 62

5 1420 12/25/2015 Man Black 15.91 4.333333 1 7/31/2018 94

6 1443 1/20/2016 Man Black 15.15 3.333333 2 10/2/2018 98

7 1698 6/18/2017 Man Black 14.49 4.666667 1 12/18/2018 54

8 1661 2/23/2017 Man Black 17.15 3.666667 1 8/25/2017 18

9 1347 9/13/2015 Man Black 14.59 3.666667 0 NaN 120

In [5]: ## normalize names


[Link] = [[Link]().lower() for c in [Link]] # normalize names
print("columns:", [Link]())

columns: ['id', 'start_date', 'gender', 'race', 'pay_hourly', 'pay_sat', 'turnove


r', 'turnover_date', 'los']

In [6]: ## Data types


[Link]

[Link] Analysis - Project [Link] 1/13


9/1/25, 11:27 PM Survival Analysis - Project one

Out[6]: id int64
start_date object
gender object
race object
pay_hourly float64
pay_sat float64
turnover int64
turnover_date object
los int64
dtype: object

Clean data & coerce event to binary safely

In [7]: ## Select or detect duration/event columns (explicit choice recommended)


DURATION = "los" # <-- set according to your file
EVENT = "turnover" # <-- set according to your file

## Safe coercion of event column to 0/1


def coerce_event(series):
s = [Link]()
# handle booleans and numeric
if [Link].is_bool_dtype(s):
return [Link]("Int64") # nullable integer
if [Link].is_numeric_dtype(s):
vals = sorted([Link]([Link]()))
if set(vals).issubset({0,1}):
return [Link]("Int64")
if len(vals) == 2:
## map min -> 0, max -> 1
return (s == vals[-1]).astype("Int64")
## common strings mapping
yes = {"yes","y","true","t","event","death","left","1"}
no = {"no","n","false","f","censored","stayed","0"}
def map_val(v):
if [Link](v): return [Link]
v2 = str(v).strip().lower()
if v2 in yes: return 1
if v2 in no: return 0
return [Link]
mapped = [Link](map_val).astype("Int64")
return mapped

## Apply cleaning
df[DURATION] = pd.to_numeric(df[DURATION], errors="coerce")
df[EVENT] = coerce_event(df[EVENT])
df = [Link](subset=[DURATION, EVENT]) # remove rows missing core info
df = df[df[DURATION] >= 0] # ensure non-negative times
df[EVENT] = df[EVENT].astype(int) # finally cast to int (0/1)
print("n rows after cleaning:", len(df))

n rows after cleaning: 685

In [8]: [Link]()

[Link] Analysis - Project [Link] 2/13


9/1/25, 11:27 PM Survival Analysis - Project one

Out[8]: id start_date gender race pay_hourly pay_sat turnover turnover_date los

0 1073 5/4/2014 Man Black 12.56 4.666667 1 1/14/2017 986

1 1598 9/30/2016 Man Black 17.68 3.000000 1 5/7/2018 584

2 1742 3/14/2018 Man Black 12.56 4.333333 0 NaN 293

3 1125 9/7/2014 Man Black 15.15 3.000000 1 4/13/2017 949

4 1676 4/19/2017 Man Black 14.67 3.333333 0 NaN 622

Descriptive summary (counts, event rate, time range)

In [10]: ## Descriptives
n = len(df)
events = df[EVENT].sum()
censored = n - events
event_rate = events / n
time_min = df[DURATION].min()
time_median = df[DURATION].median()
time_max = df[DURATION].max()

summary = {
"n": int(n),
"events": int(events),
"censored": int(censored),
"event_rate": float(event_rate),
"min_time": float(time_min),
"median_time": float(time_median),
"max_time": float(time_max)
}
summary

Out[10]: {'n': 685,


'events': 463,
'censored': 222,
'event_rate': 0.6759124087591241,
'min_time': 73.0,
'median_time': 913.0,
'max_time': 1935.0}

Interpretation: basic cohort metrics. event_rate tells how many observed events
(failures) vs censored — if event counts are very low, survival estimates and Cox
model power will be poor.

Kaplan–Meier overall (two options)

Option A: Using lifelines (recommended)

In [41]: # Define KM estimator


def km_estimator(times, events):
times = [Link](times)
events = [Link](events)

# sort by time
order = [Link](times)

[Link] Analysis - Project [Link] 3/13


9/1/25, 11:27 PM Survival Analysis - Project one

t_sorted = times[order]
e_sorted = events[order]

# unique event times


uniq = [Link](t_sorted[e_sorted == 1])

survival = []
at_risk = []
events_at_t = []
s = 1.0

for ut in uniq:
# number at risk just before ut
risk = [Link](t_sorted >= ut)
# number of events at ut
d = [Link]((t_sorted == ut) & (e_sorted == 1))

if risk > 0:
s *= (1 - d / risk)

[Link](s)
at_risk.append(risk)
events_at_t.append(d)

return uniq, [Link](survival), [Link](at_risk), [Link](events_at_t)

# Apply on dataset
DURATION = "los"
EVENT = "turnover"

times = df[DURATION].values
events = df[EVENT].values

uniq_times, surv_vals, at_risk, ev_counts = km_estimator(times, events)

# Build life table


life_table = [Link]({
"time": uniq_times,
"at_risk": at_risk,
"events": ev_counts,
"survival": surv_vals
})

print(life_table.head(10))

time at_risk events survival


0 73 685 12 0.982482
1 110 666 1 0.981007
2 146 665 3 0.976581
3 183 661 4 0.970671
4 219 656 3 0.966232
5 256 652 4 0.960304
6 292 644 1 0.958813
7 329 640 1 0.957315
8 365 637 6 0.948298
9 402 626 7 0.937694

In [42]: ## Kaplan–Meier Survival Plot


[Link](figsize=(8,6))
[Link](uniq_times, surv_vals, where="post", label="KM Estimate", color="blue")

[Link] Analysis - Project [Link] 4/13


9/1/25, 11:27 PM Survival Analysis - Project one

## Formatting
[Link]("Kaplan–Meier Survival Curve", fontsize=14)
[Link]("Time (LOS)", fontsize=12)
[Link]("Survival Probability S(t)", fontsize=12)
[Link](0, 1.05)
[Link](True, linestyle="--", alpha=0.6)
[Link]()

[Link]()

KM stratified by groups (example: gender, race)

In [24]: group_cols = [c for c in ("gender","race") if c in [Link]]

## using lifelines (if available)


from lifelines import KaplanMeierFitter
for g in group_cols:
kmf = KaplanMeierFitter()
[Link]()
for level in df[g].astype("category").[Link]:
mask = df[g] == level
[Link]([Link][mask, DURATION], event_observed=[Link][mask, EVENT], labe
[Link](ci_show=False)
[Link](f"KM by {g}")
[Link]("Time"); [Link]("Survival")
[Link](f"km_by_{g}.png", bbox_inches="tight")
[Link]()

[Link] Analysis - Project [Link] 5/13


9/1/25, 11:27 PM Survival Analysis - Project one

Log-rank test (compare survival across groups) Using lifelines (pairwise)

In [26]: from [Link] import logrank_test

## compare two groups A vs B for group g

[Link] Analysis - Project [Link] 6/13


9/1/25, 11:27 PM Survival Analysis - Project one

g = "gender" # choose an available group


levels = df[g].astype("category").[Link]
pairs = []
for i in range(len(levels)):
for j in range(i+1, len(levels)):
A = df[df[g] == levels[i]]
B = df[df[g] == levels[j]]
res = logrank_test(A[DURATION], B[DURATION], event_observed_A=A[EVENT],
[Link]({
"group": g,
"level_A": levels[i],
"level_B": levels[j],
"p_value": float(res.p_value),
"test_statistic": float(res.test_statistic)
})
pairs

Out[26]: [{'group': 'gender',


'level_A': 'Man',
'level_B': 'Woman',
'p_value': 0.6864647030647238,
'test_statistic': 0.1629392601399209}]

Cox proportional hazards model (multivariable) Using lifelines

In [39]: from lifelines import CoxPHFitter

## Prepare covariates: numeric columns + one-hot small categoricals


work = [Link]()
exclude = {DURATION, EVENT, "id", "start_date", "turnover_date"} # keep as need
covariates = []
for c in [Link]:
if c in exclude: continue
if [Link].is_numeric_dtype(work[c]):
[Link](c)
else:
# add one-hot dummies for small-cardinality categoricals
if work[c].nunique(dropna=True) <= 10:
dummies = pd.get_dummies(work[c], prefix=c, drop_first=True)
work = [Link]([work, dummies], axis=1)
covariates += list([Link])

model_df = work[[DURATION, EVENT] + covariates].dropna()


print("Cox model n=", len(model_df), "num covariates=", len(covariates))

cph = CoxPHFitter()
[Link](model_df, duration_col=DURATION, event_col=EVENT)
cph.print_summary() # coefficients, HR, p-values, 95% CI

Cox model n= 650 num covariates= 5

[Link] Analysis - Project [Link] 7/13


9/1/25, 11:27 PM Survival Analysis - Project one

model [Link]

duration col 'los'

event col 'turnover'

baseline estimation breslow

number of observations 650

number of events observed 445

partial log-likelihood -2437.89

time fit was run 2025-09-01 [Link] UTC

coef coef exp(coef) exp(coef)


cmp
coef exp(coef) se(coef) lower upper lower upper
to
95% 95% 95% 95%

gender_Woman -0.04 0.96 0.10 -0.23 0.15 0.79 1.16 0.00

race_HispanicLatino 0.07 1.07 0.16 -0.24 0.38 0.79 1.46 0.00

race_White -0.07 0.93 0.10 -0.28 0.13 0.76 1.14 0.00

pay_hourly -0.12 0.89 0.03 -0.18 -0.05 0.83 0.95 0.00

pay_sat 0.13 1.14 0.08 -0.03 0.29 0.97 1.33 0.00

Concordance 0.55

Partial AIC 4885.78

log-likelihood ratio test 17.33 on 5 df

-log2(p) of ll-ratio test 8.00

In [33]: ## extract hazard ratios & save


summary = [Link]()
summary["HR"] = [Link](summary["coef"])
summary

Out[33]: coef coef exp(coef) exp


coef exp(coef) se(coef) lower upper lower
95% 95% 95%

covariate

gender_Woman -0.040394 0.960411 0.097361 -0.231217 0.150429 0.793567 1.

race_HispanicLatino 0.071419 1.074031 0.157860 -0.237981 0.380820 0.788217 1.4

race_White -0.074764 0.927962 0.103886 -0.278376 0.128848 0.757012 1.

pay_hourly -0.116725 0.889830 0.032939 -0.181283 -0.052167 0.834199 0.

pay_sat 0.129821 1.138625 0.080736 -0.028418 0.288060 0.971982 1.

[Link] Analysis - Project [Link] 8/13


9/1/25, 11:27 PM Survival Analysis - Project one

In [34]: ## baseline survival


cph.baseline_survival_.plot()
[Link]("Cox baseline survival")
[Link]()

In [35]: ## Concordance index


print("Concordance index:", cph.concordance_index_)

Concordance index: 0.5509943292436867

Check proportional hazards assumption

In [36]: ## Check proportional hazards (lifelines)


cph.check_assumptions(model_df, p_value_threshold=0.05, show_plots=True)

Bootstrapping lowess lines. May take a moment...

Bootstrapping lowess lines. May take a moment...

Bootstrapping lowess lines. May take a moment...

Bootstrapping lowess lines. May take a moment...

Bootstrapping lowess lines. May take a moment...

Proportional hazard assumption looks okay.

[Link] Analysis - Project [Link] 9/13


9/1/25, 11:27 PM Survival Analysis - Project one

Out[36]: [[<Axes: xlabel='rank-transformed time\n(p=0.0707)'>,


<Axes: xlabel='km-transformed time\n(p=0.0625)'>],
[<Axes: xlabel='rank-transformed time\n(p=0.4316)'>,
<Axes: xlabel='km-transformed time\n(p=0.5350)'>],
[<Axes: xlabel='rank-transformed time\n(p=0.9220)'>,
<Axes: xlabel='km-transformed time\n(p=0.4784)'>],
[<Axes: xlabel='rank-transformed time\n(p=0.4284)'>,
<Axes: xlabel='km-transformed time\n(p=0.4532)'>],
[<Axes: xlabel='rank-transformed time\n(p=0.4763)'>,
<Axes: xlabel='km-transformed time\n(p=0.4112)'>]]

[Link] Analysis - Project [Link] 10/13


9/1/25, 11:27 PM Survival Analysis - Project one

[Link] Analysis - Project [Link] 11/13


9/1/25, 11:27 PM Survival Analysis - Project one

Diagnostics & multicollinearity (optional but important)

In [37]: ## Variance Inflation Factor (VIF) to detect multicollinearity


import [Link] as sm
from [Link].outliers_influence import variance_inflation_factor

[Link] Analysis - Project [Link] 12/13


9/1/25, 11:27 PM Survival Analysis - Project one

X = model_df[covariates].astype(float)
X = [Link]()
X_const = sm.add_constant(X)
vif = [Link]({
"variable": X_const.columns,
"VIF": [variance_inflation_factor(X_const.values, i) for i in range(X_const.
})
print(vif.sort_values("VIF", ascending=False).head(10))

variable VIF
0 const 135.237148
3 race_White 1.169985
2 race_HispanicLatino 1.138402
4 pay_hourly 1.056765
1 gender_Woman 1.019376
5 pay_sat 1.002665

[Link] Analysis - Project [Link] 13/13

You might also like