9/1/25, 11:27 PM Survival Analysis - Project one
Load required modules
In [1]: import os
import numpy as np
import pandas as pd
import [Link] as plt
import warnings
[Link]("ignore")
Load the dataset
In [2]: df = pd.read_csv("C:/Users/ADMIN/Desktop/Data Science/Datasets/survival/Survival
Inspect the data
In [4]: ## First few observations
[Link](10)
Out[4]: id Start_date Gender Race Pay_hourly Pay_sat Turnover Turnover_date LO
0 1073 5/4/2014 Man Black 12.56 4.666667 1 1/14/2017 98
1 1598 9/30/2016 Man Black 17.68 3.000000 1 5/7/2018 58
2 1742 3/14/2018 Man Black 12.56 4.333333 0 NaN 29
3 1125 9/7/2014 Man Black 15.15 3.000000 1 4/13/2017 94
4 1676 4/19/2017 Man Black 14.67 3.333333 0 NaN 62
5 1420 12/25/2015 Man Black 15.91 4.333333 1 7/31/2018 94
6 1443 1/20/2016 Man Black 15.15 3.333333 2 10/2/2018 98
7 1698 6/18/2017 Man Black 14.49 4.666667 1 12/18/2018 54
8 1661 2/23/2017 Man Black 17.15 3.666667 1 8/25/2017 18
9 1347 9/13/2015 Man Black 14.59 3.666667 0 NaN 120
In [5]: ## normalize names
[Link] = [[Link]().lower() for c in [Link]] # normalize names
print("columns:", [Link]())
columns: ['id', 'start_date', 'gender', 'race', 'pay_hourly', 'pay_sat', 'turnove
r', 'turnover_date', 'los']
In [6]: ## Data types
[Link]
[Link] Analysis - Project [Link] 1/13
9/1/25, 11:27 PM Survival Analysis - Project one
Out[6]: id int64
start_date object
gender object
race object
pay_hourly float64
pay_sat float64
turnover int64
turnover_date object
los int64
dtype: object
Clean data & coerce event to binary safely
In [7]: ## Select or detect duration/event columns (explicit choice recommended)
DURATION = "los" # <-- set according to your file
EVENT = "turnover" # <-- set according to your file
## Safe coercion of event column to 0/1
def coerce_event(series):
s = [Link]()
# handle booleans and numeric
if [Link].is_bool_dtype(s):
return [Link]("Int64") # nullable integer
if [Link].is_numeric_dtype(s):
vals = sorted([Link]([Link]()))
if set(vals).issubset({0,1}):
return [Link]("Int64")
if len(vals) == 2:
## map min -> 0, max -> 1
return (s == vals[-1]).astype("Int64")
## common strings mapping
yes = {"yes","y","true","t","event","death","left","1"}
no = {"no","n","false","f","censored","stayed","0"}
def map_val(v):
if [Link](v): return [Link]
v2 = str(v).strip().lower()
if v2 in yes: return 1
if v2 in no: return 0
return [Link]
mapped = [Link](map_val).astype("Int64")
return mapped
## Apply cleaning
df[DURATION] = pd.to_numeric(df[DURATION], errors="coerce")
df[EVENT] = coerce_event(df[EVENT])
df = [Link](subset=[DURATION, EVENT]) # remove rows missing core info
df = df[df[DURATION] >= 0] # ensure non-negative times
df[EVENT] = df[EVENT].astype(int) # finally cast to int (0/1)
print("n rows after cleaning:", len(df))
n rows after cleaning: 685
In [8]: [Link]()
[Link] Analysis - Project [Link] 2/13
9/1/25, 11:27 PM Survival Analysis - Project one
Out[8]: id start_date gender race pay_hourly pay_sat turnover turnover_date los
0 1073 5/4/2014 Man Black 12.56 4.666667 1 1/14/2017 986
1 1598 9/30/2016 Man Black 17.68 3.000000 1 5/7/2018 584
2 1742 3/14/2018 Man Black 12.56 4.333333 0 NaN 293
3 1125 9/7/2014 Man Black 15.15 3.000000 1 4/13/2017 949
4 1676 4/19/2017 Man Black 14.67 3.333333 0 NaN 622
Descriptive summary (counts, event rate, time range)
In [10]: ## Descriptives
n = len(df)
events = df[EVENT].sum()
censored = n - events
event_rate = events / n
time_min = df[DURATION].min()
time_median = df[DURATION].median()
time_max = df[DURATION].max()
summary = {
"n": int(n),
"events": int(events),
"censored": int(censored),
"event_rate": float(event_rate),
"min_time": float(time_min),
"median_time": float(time_median),
"max_time": float(time_max)
}
summary
Out[10]: {'n': 685,
'events': 463,
'censored': 222,
'event_rate': 0.6759124087591241,
'min_time': 73.0,
'median_time': 913.0,
'max_time': 1935.0}
Interpretation: basic cohort metrics. event_rate tells how many observed events
(failures) vs censored — if event counts are very low, survival estimates and Cox
model power will be poor.
Kaplan–Meier overall (two options)
Option A: Using lifelines (recommended)
In [41]: # Define KM estimator
def km_estimator(times, events):
times = [Link](times)
events = [Link](events)
# sort by time
order = [Link](times)
[Link] Analysis - Project [Link] 3/13
9/1/25, 11:27 PM Survival Analysis - Project one
t_sorted = times[order]
e_sorted = events[order]
# unique event times
uniq = [Link](t_sorted[e_sorted == 1])
survival = []
at_risk = []
events_at_t = []
s = 1.0
for ut in uniq:
# number at risk just before ut
risk = [Link](t_sorted >= ut)
# number of events at ut
d = [Link]((t_sorted == ut) & (e_sorted == 1))
if risk > 0:
s *= (1 - d / risk)
[Link](s)
at_risk.append(risk)
events_at_t.append(d)
return uniq, [Link](survival), [Link](at_risk), [Link](events_at_t)
# Apply on dataset
DURATION = "los"
EVENT = "turnover"
times = df[DURATION].values
events = df[EVENT].values
uniq_times, surv_vals, at_risk, ev_counts = km_estimator(times, events)
# Build life table
life_table = [Link]({
"time": uniq_times,
"at_risk": at_risk,
"events": ev_counts,
"survival": surv_vals
})
print(life_table.head(10))
time at_risk events survival
0 73 685 12 0.982482
1 110 666 1 0.981007
2 146 665 3 0.976581
3 183 661 4 0.970671
4 219 656 3 0.966232
5 256 652 4 0.960304
6 292 644 1 0.958813
7 329 640 1 0.957315
8 365 637 6 0.948298
9 402 626 7 0.937694
In [42]: ## Kaplan–Meier Survival Plot
[Link](figsize=(8,6))
[Link](uniq_times, surv_vals, where="post", label="KM Estimate", color="blue")
[Link] Analysis - Project [Link] 4/13
9/1/25, 11:27 PM Survival Analysis - Project one
## Formatting
[Link]("Kaplan–Meier Survival Curve", fontsize=14)
[Link]("Time (LOS)", fontsize=12)
[Link]("Survival Probability S(t)", fontsize=12)
[Link](0, 1.05)
[Link](True, linestyle="--", alpha=0.6)
[Link]()
[Link]()
KM stratified by groups (example: gender, race)
In [24]: group_cols = [c for c in ("gender","race") if c in [Link]]
## using lifelines (if available)
from lifelines import KaplanMeierFitter
for g in group_cols:
kmf = KaplanMeierFitter()
[Link]()
for level in df[g].astype("category").[Link]:
mask = df[g] == level
[Link]([Link][mask, DURATION], event_observed=[Link][mask, EVENT], labe
[Link](ci_show=False)
[Link](f"KM by {g}")
[Link]("Time"); [Link]("Survival")
[Link](f"km_by_{g}.png", bbox_inches="tight")
[Link]()
[Link] Analysis - Project [Link] 5/13
9/1/25, 11:27 PM Survival Analysis - Project one
Log-rank test (compare survival across groups) Using lifelines (pairwise)
In [26]: from [Link] import logrank_test
## compare two groups A vs B for group g
[Link] Analysis - Project [Link] 6/13
9/1/25, 11:27 PM Survival Analysis - Project one
g = "gender" # choose an available group
levels = df[g].astype("category").[Link]
pairs = []
for i in range(len(levels)):
for j in range(i+1, len(levels)):
A = df[df[g] == levels[i]]
B = df[df[g] == levels[j]]
res = logrank_test(A[DURATION], B[DURATION], event_observed_A=A[EVENT],
[Link]({
"group": g,
"level_A": levels[i],
"level_B": levels[j],
"p_value": float(res.p_value),
"test_statistic": float(res.test_statistic)
})
pairs
Out[26]: [{'group': 'gender',
'level_A': 'Man',
'level_B': 'Woman',
'p_value': 0.6864647030647238,
'test_statistic': 0.1629392601399209}]
Cox proportional hazards model (multivariable) Using lifelines
In [39]: from lifelines import CoxPHFitter
## Prepare covariates: numeric columns + one-hot small categoricals
work = [Link]()
exclude = {DURATION, EVENT, "id", "start_date", "turnover_date"} # keep as need
covariates = []
for c in [Link]:
if c in exclude: continue
if [Link].is_numeric_dtype(work[c]):
[Link](c)
else:
# add one-hot dummies for small-cardinality categoricals
if work[c].nunique(dropna=True) <= 10:
dummies = pd.get_dummies(work[c], prefix=c, drop_first=True)
work = [Link]([work, dummies], axis=1)
covariates += list([Link])
model_df = work[[DURATION, EVENT] + covariates].dropna()
print("Cox model n=", len(model_df), "num covariates=", len(covariates))
cph = CoxPHFitter()
[Link](model_df, duration_col=DURATION, event_col=EVENT)
cph.print_summary() # coefficients, HR, p-values, 95% CI
Cox model n= 650 num covariates= 5
[Link] Analysis - Project [Link] 7/13
9/1/25, 11:27 PM Survival Analysis - Project one
model [Link]
duration col 'los'
event col 'turnover'
baseline estimation breslow
number of observations 650
number of events observed 445
partial log-likelihood -2437.89
time fit was run 2025-09-01 [Link] UTC
coef coef exp(coef) exp(coef)
cmp
coef exp(coef) se(coef) lower upper lower upper
to
95% 95% 95% 95%
gender_Woman -0.04 0.96 0.10 -0.23 0.15 0.79 1.16 0.00
race_HispanicLatino 0.07 1.07 0.16 -0.24 0.38 0.79 1.46 0.00
race_White -0.07 0.93 0.10 -0.28 0.13 0.76 1.14 0.00
pay_hourly -0.12 0.89 0.03 -0.18 -0.05 0.83 0.95 0.00
pay_sat 0.13 1.14 0.08 -0.03 0.29 0.97 1.33 0.00
Concordance 0.55
Partial AIC 4885.78
log-likelihood ratio test 17.33 on 5 df
-log2(p) of ll-ratio test 8.00
In [33]: ## extract hazard ratios & save
summary = [Link]()
summary["HR"] = [Link](summary["coef"])
summary
Out[33]: coef coef exp(coef) exp
coef exp(coef) se(coef) lower upper lower
95% 95% 95%
covariate
gender_Woman -0.040394 0.960411 0.097361 -0.231217 0.150429 0.793567 1.
race_HispanicLatino 0.071419 1.074031 0.157860 -0.237981 0.380820 0.788217 1.4
race_White -0.074764 0.927962 0.103886 -0.278376 0.128848 0.757012 1.
pay_hourly -0.116725 0.889830 0.032939 -0.181283 -0.052167 0.834199 0.
pay_sat 0.129821 1.138625 0.080736 -0.028418 0.288060 0.971982 1.
[Link] Analysis - Project [Link] 8/13
9/1/25, 11:27 PM Survival Analysis - Project one
In [34]: ## baseline survival
cph.baseline_survival_.plot()
[Link]("Cox baseline survival")
[Link]()
In [35]: ## Concordance index
print("Concordance index:", cph.concordance_index_)
Concordance index: 0.5509943292436867
Check proportional hazards assumption
In [36]: ## Check proportional hazards (lifelines)
cph.check_assumptions(model_df, p_value_threshold=0.05, show_plots=True)
Bootstrapping lowess lines. May take a moment...
Bootstrapping lowess lines. May take a moment...
Bootstrapping lowess lines. May take a moment...
Bootstrapping lowess lines. May take a moment...
Bootstrapping lowess lines. May take a moment...
Proportional hazard assumption looks okay.
[Link] Analysis - Project [Link] 9/13
9/1/25, 11:27 PM Survival Analysis - Project one
Out[36]: [[<Axes: xlabel='rank-transformed time\n(p=0.0707)'>,
<Axes: xlabel='km-transformed time\n(p=0.0625)'>],
[<Axes: xlabel='rank-transformed time\n(p=0.4316)'>,
<Axes: xlabel='km-transformed time\n(p=0.5350)'>],
[<Axes: xlabel='rank-transformed time\n(p=0.9220)'>,
<Axes: xlabel='km-transformed time\n(p=0.4784)'>],
[<Axes: xlabel='rank-transformed time\n(p=0.4284)'>,
<Axes: xlabel='km-transformed time\n(p=0.4532)'>],
[<Axes: xlabel='rank-transformed time\n(p=0.4763)'>,
<Axes: xlabel='km-transformed time\n(p=0.4112)'>]]
[Link] Analysis - Project [Link] 10/13
9/1/25, 11:27 PM Survival Analysis - Project one
[Link] Analysis - Project [Link] 11/13
9/1/25, 11:27 PM Survival Analysis - Project one
Diagnostics & multicollinearity (optional but important)
In [37]: ## Variance Inflation Factor (VIF) to detect multicollinearity
import [Link] as sm
from [Link].outliers_influence import variance_inflation_factor
[Link] Analysis - Project [Link] 12/13
9/1/25, 11:27 PM Survival Analysis - Project one
X = model_df[covariates].astype(float)
X = [Link]()
X_const = sm.add_constant(X)
vif = [Link]({
"variable": X_const.columns,
"VIF": [variance_inflation_factor(X_const.values, i) for i in range(X_const.
})
print(vif.sort_values("VIF", ascending=False).head(10))
variable VIF
0 const 135.237148
3 race_White 1.169985
2 race_HispanicLatino 1.138402
4 pay_hourly 1.056765
1 gender_Woman 1.019376
5 pay_sat 1.002665
[Link] Analysis - Project [Link] 13/13