Explonatory Data Analysis

The document outlines an exploratory data analysis (EDA) of the Haberman dataset, which contains information about patients who underwent surgery for breast cancer. It highlights the imbalanced survival status of patients, with more survivors than non-survivors, and provides various statistical analyses, visualizations, and insights regarding patient age, operation year, and positive axillary nodes. The analysis reveals that patients with fewer positive axillary nodes are more likely to survive, while visualizations indicate a lack of clear distinctions between survival status based on age and operation year.

Uploaded by

verma.shubham.01.05.08

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF or read online on Scribd

0% found this document useful (0 votes)

23 views11 pages

Explonatory Data Analysis

Uploaded by

verma.shubham.01.05.08

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF or read online on Scribd

You are on page 1/ 11

In [2 In [2]: out[2]: In [3]: out[3]: In [4]: out[4]: Exploratory Data Analysis (EDA) Importing libraries and loading data import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns import scipy.stats as stats import warnings df=pd.read_csv("D: /Downloads/archive (21)/haberman.csv" ,header=0) df.columns=[‘patient_age’, ‘operation_year', ‘positive_axillary_nodes’, ‘sur af — > patient_age operation_year positive_axillary_nodes survival_status ° 30 62 3 1 1 30 6s ° 2 3 59 2 3 a 6s 4 4 33 38 10 300 75 62 1 sot 76 67 ° 302 7 65 3 303 1 65 1 2 304 83 58 2 2 305 rows x 4 columns Data understanding df. shape (3e5, 4) df[' survival_status'].value_counts() survival_status 1 224 2 at Name: count, dtype: inte4The dataset is imbalanced, out of total of 305 patients no. of survived is 3 times the patients who died within 5 years In [5]: df.info() RangeIndex: 3@5 entries, @ to 304 Data columns (total 4 columns): #~ Colunn Non-Null Count Dtype @ patient_age 3@5 non-null —_int6a 1 operation_year 3@5 non-null intea 2 positive axillary nodes 305 non-null — intea 3 survival_status 3@5 non-null —_intea dtypes: intea(4) memory usage: 9.7 KB In [6]: #output shows all integer non null values In [7]: df[‘survival_status'] = df['survival_status' ].map({1. of out[7]: patient_age operation year positive_axillary_nodes survival_status ° 30 62 3 yes 1 30 65 0 yes 2 at 59 2 yes 3 at 65 4 yes 4 33 58 10 yes 300 75 62 1 yes 301 76 er ° yes 302 7 65 3 yes 303, 7B 65 1 no 304 83 58 2 no 305 rows x 4 columnsIn [8]: df.describe() outs. pation age operation year positve_axllary_nodes count 305,000000 -308.000000 305 000000 mean 52531148 62840180 036086 std 10.744024 3.284078 7199870 min 30.000000 8.000000 0.000000 25% 44.0000 60,000000 0.000000 0% 520000000 63.000000 1.000000 75% 61.000000 _68,000000 4.000000 max 63.000000 _69.000000 52000000 patients got operated at age of 63 average number of positive axillary nodes detected Seth percentile, the median of positive axillary nodes is 1. 7th percentile, 75% of the patients have less than 4 nodes detected. “there is a significant difference between the mean and the median values(5@%). This is because there are some outliers in our data and the mean is influenced by the presence of outliers.It indicate potential outliers, it's not conclusive proof.Or (mean>median) skewed" (positive_axillary_nodes mean =4 and median=1 difference is high) Class-wise statistical analysis In [9]: survival_yes=df[df[ 'survival_status' ] survival_yes.describe() yes") Out[9]: patient_age oper count 224,000000 224.0000 224,000000 52.116071 62.857143, 2.799107 10.937446 3.229231, 8.882237 min 30,000000 8,000000 0.000000 25% — 43,000000 60,0000 0.000000 50% — §2.000000 63,0000 0.000000 75% 60,0000 66,0000, 3,000000 max 77,000000 69,0000 46,000000In [10]: survival_no=df[df[' survival_status' ]=="no’ ] survival_no.describe() out [16]: patient_age operation_year_positive_axillary_nodes count 81,000000 81,0000 '81,000000 mean 3.679012 62.827160 7.456790 std 10,167137 3342118 9.185654 ‘min 34.000000 8.000000, 0.000000 25% — 46.000000 9.000000 1.000000 50% 53.0000 63.0000 4.000000 ‘75% — 61.000000 5.000000 +1.000000 max 63.000000 69,0000 '52,000000 1.patient is operated on is nearly the same in both cases 2.patient who died within 5 years on average about 4 to 5 positive axiliary nodes more than patients who lived 3. Uni-variate data analysis In [11]: with warnings. catch_warnings(): warnings .simplefilter("ignor ) sns.FacetGrid(df,hue="survival_status").map(sns.distplot, "patient_age”). plt.figure(figsize=(15, 8)) plt.show() EE > 004 003 z 2 002 survival status é yes oor . 0.00 B 0 OB patient_age

Among all the age groups, the patients belonging to 40-60 years of age are the highestIn [12]: In [13]: with warnings. catch_warnings(): warnings. simplefilter("ignore") sns.FacetGrid(df, hue = "survival_status").map(sns.distplot, "operation_ plt. show() FT » o1 o10 008 Density 004 002 0.00 5 8 & 7 ‘operation year Huge overlap between the class labels suggesting that one cannot make any distinctive conclusion regarding the survival status based solely on the operation year and patient's age. Number of positive axillary nods with warnings. catch_warnings(): warnings. simplefilter("ignore") g = sns.FacetGrid(df, hue="survival_status") g.map(sns.distplot, "positive_axillary_nodes", kde=True) g.add_legend() plt. Figure (Figsize=(12,6)) plt. show() z 2 survival status & mm yes 0 OD positive_axllary_nodes

Patients having 4 or fewer axillary nodes — A very good majority of these patients have survived 5 years Box plotIn [14]: The box plot, commonly referred to as a box and whisker plot, serves as a visual representation that summarizes exploratory data analysis Python using five key metrics — the minimum, lower quartile (25th percentile), median (58th percentile), upper quartile (75th percentile), and maximum data values. plt.figure(figsize = (15, 4)) plt.subplot(1,3,1) sns.boxplot(x = ‘survival_status', y = ‘patient_age’, data = df) plt.subplot(1,3,2) sns.boxplot(x = ‘survival_status', y = ‘operation_year', data = df) plt.subplot(1,3,3) sns.boxplot(x = ‘survival_status', y = ‘positive _axillary_nodes', data = df) plt.show() 2 & ee 2 s es patient age and the operation year plots show similar statistics The isolated points seen in the box plot of positive axillary nodes are the outliers in the data. Such a high number of outliers is kind of expected in medical datasets. Violin plot A violin plot displays the same information as the box and whisker plot; additionally, it also shows the density-smoothed plot of the underlying distribution.In [15]: plt.figure(figsize = (15, 4)) plt.subplot(1,3,1) sns.violinplot(x = ‘survival_status', y plt.subplot(1,3,2) sns.violinplot(x = ‘survival_status', y plt.subplot(1,3,3) sns.violinplot(x = ‘survival_status', y plt.show() "patient_age’, data = df) ‘operation_year’, data = df) ‘positive_axillary_nodes', data eeebeves suueees violin plot for positive axillary nodes, it becomes apparent that the distribution is highly skewed for the ‘yes’ class label and moderately skewed for the ‘no’ label. Bi-variate data analysisIn [16]: apair plot sns.set_style(‘whitegrid') sns.pairplot (df, hue="survival_status’) plt.show() 3 1 ° Ball “ Sa|| eS i -3e-4._8 4 Sano ite Esseae gf 0 g = til ill a : . ails 2° 0 0 8 © © » “0 o «© aes Sate Ne Se As we can observe in the above pair plot, there is a high overlap between any two features and hence no clear distinction can be made between the class labels Joint plot While the Pair plot provides a visual insight into all possible correlations, the Joint plot provides bivariate plots with univariate marginal distributions.In [17]: sns.jointplot(x="patient_age" ,y="positive_axillary_nodes" ,data=df) plt.show() a patent ape The pair plot and the joint plot reveal that there is no correlation between the patient’s age and the number of positive axillary nodes detected. The histogram on the top edge indicates that patients are more likely to get operated in the age of 40-60 years compared to other age groups. The histogram on the right edge indicates that the majority of patients had fewer than 4 positive axillary nodes.In [18] In [19] sns.jointplot (x="patient_age": plt.show() < /="positive_axillary_nodes” data=df,hue="survi a> .° 3 A Heatmap plt.figure(figsize=(8, 6)) sns-heatmap(df.iloc[:,0:3].corr(), ¢map="V1GnBu" ,annot=True) plt.show() 10 0007 ita 08 5 00038 i 04 8 ! ¢ -o2 2 £ 00 patent age cperation year postiveIn [20]: eda data analysis these values are nearly @ for any pair, so no correlation exists among any pair of variables. Multivariate analysis 3-dimensional surface by plotting constant z slices, called contours, in a 2-dimensional format.3d to 2d. sns.jointplot(x = ‘patient_age’, y = ‘operation_year’ , data = df, kind = plt.show() » a < years 1959-1964 witnessed more patients in the age group of 45-55 years

4 Exploratory Data Analysis.
No ratings yet
4 Exploratory Data Analysis.
1 page
EDA HabermanDataset
No ratings yet
EDA HabermanDataset
15 pages
EDA Assignment
No ratings yet
EDA Assignment
15 pages
Breast Cancer Survival Analysis
No ratings yet
Breast Cancer Survival Analysis
10 pages
AIML Expt
No ratings yet
AIML Expt
7 pages
Haberman Cancer Survival Analysis Guide
No ratings yet
Haberman Cancer Survival Analysis Guide
1 page
Haberman Cancer Survival Analysis Guide
No ratings yet
Haberman Cancer Survival Analysis Guide
1 page
EDA Haberman Dataset
No ratings yet
EDA Haberman Dataset
13 pages
Haberman Dataset Analysis Insights
No ratings yet
Haberman Dataset Analysis Insights
13 pages
Ilovepdf Merged
No ratings yet
Ilovepdf Merged
89 pages
EDA On Haberman Survival Data
No ratings yet
EDA On Haberman Survival Data
6 pages
Hangal - Frailty Models
No ratings yet
Hangal - Frailty Models
307 pages
Age Operation - Year Axil - Nodes Survival - Status
No ratings yet
Age Operation - Year Axil - Nodes Survival - Status
9 pages
Data Analytics7
No ratings yet
Data Analytics7
5 pages
A Weighted Random Survival Forest
No ratings yet
A Weighted Random Survival Forest
27 pages
A Practical Introduction To Nordpred - Cancerview - Ca
No ratings yet
A Practical Introduction To Nordpred - Cancerview - Ca
46 pages
Shailesh020902@gmail - Com 1
No ratings yet
Shailesh020902@gmail - Com 1
1 page
Analysis of Multivariate Survival Data Full Download
No ratings yet
Analysis of Multivariate Survival Data Full Download
15 pages
Haberman Dataset EDA Insights
No ratings yet
Haberman Dataset EDA Insights
11 pages
Bonetti SHARE
No ratings yet
Bonetti SHARE
51 pages
مشروع بيانات تخطيط القلب ويكا
No ratings yet
مشروع بيانات تخطيط القلب ويكا
21 pages
Machine Learning For Survival Analysis: A Survey: Chine Learning Rinformation Systems
No ratings yet
Machine Learning For Survival Analysis: A Survey: Chine Learning Rinformation Systems
38 pages
The Frailty Model: Luc Duchateau and Paul Janssen
No ratings yet
The Frailty Model: Luc Duchateau and Paul Janssen
334 pages
Survival Analysis and Interpretation Of.32
No ratings yet
Survival Analysis and Interpretation Of.32
7 pages
Frailty Models in Survival Analysis, 1st Edition Latest Edition Download
100% (10)
Frailty Models in Survival Analysis, 1st Edition Latest Edition Download
16 pages
Breast Cancer Diagnosis Using Machine Learning Alg
No ratings yet
Breast Cancer Diagnosis Using Machine Learning Alg
13 pages
Biometrics 79 4 3066
No ratings yet
Biometrics 79 4 3066
16 pages
Lifelines
No ratings yet
Lifelines
347 pages
Breast Cancer Prediction with Logistic Regression
No ratings yet
Breast Cancer Prediction with Logistic Regression
15 pages
Statistical Models and Methods For Biomedical and Technical Systems Enhanced Ebook Download
No ratings yet
Statistical Models and Methods For Biomedical and Technical Systems Enhanced Ebook Download
16 pages
Survival Analysis Methods Guide
100% (1)
Survival Analysis Methods Guide
15 pages
Art:10.1007/s00170 013 5065 Z
No ratings yet
Art:10.1007/s00170 013 5065 Z
15 pages
Breast Cancer Survival Prediction Guide
No ratings yet
Breast Cancer Survival Prediction Guide
12 pages
DeepSurv Using A Cox Proportional Hasards DeepNets 1652051740
No ratings yet
DeepSurv Using A Cox Proportional Hasards DeepNets 1652051740
12 pages
BB - Frequent Pattern Mining From Multivariate Time Series Data - 11tr
No ratings yet
BB - Frequent Pattern Mining From Multivariate Time Series Data - 11tr
11 pages
Bellaachia PDF
No ratings yet
Bellaachia PDF
4 pages
Statistical Analysis and Applications: R.S Jassal
No ratings yet
Statistical Analysis and Applications: R.S Jassal
37 pages
Chapter Five
No ratings yet
Chapter Five
48 pages
Mathematics 10 03907
No ratings yet
Mathematics 10 03907
23 pages
FedCSIS 2025 Paper 6352
No ratings yet
FedCSIS 2025 Paper 6352
6 pages
Statistics in Medicine - 2025 - Loffredo - Random Survival Forest For Censored Functional Data
No ratings yet
Statistics in Medicine - 2025 - Loffredo - Random Survival Forest For Censored Functional Data
22 pages
Cancer Incidence Prediction Models
No ratings yet
Cancer Incidence Prediction Models
21 pages
Accelerated Failure Time Models: Patrick Breheny
No ratings yet
Accelerated Failure Time Models: Patrick Breheny
25 pages
Presentation FRAILTYPACK Rondeau
No ratings yet
Presentation FRAILTYPACK Rondeau
19 pages
Survival Analysis with R Guide
No ratings yet
Survival Analysis with R Guide
42 pages
10 - Eda To Prediction Dietanic
No ratings yet
10 - Eda To Prediction Dietanic
21 pages
Multivariate Models 4 Unit
No ratings yet
Multivariate Models 4 Unit
6 pages
Data Mining Techniques Overview
No ratings yet
Data Mining Techniques Overview
109 pages
Machine Learning Data Analysis
No ratings yet
Machine Learning Data Analysis
21 pages
Analyzing Survival or Reliability
No ratings yet
Analyzing Survival or Reliability
7 pages
Survival Model Analysis for Experts
No ratings yet
Survival Model Analysis for Experts
30 pages
Machine Learning for Dementia Survival Analysis
No ratings yet
Machine Learning for Dementia Survival Analysis
10 pages
Heart Disease Prediction - Jupyter Notebook
100% (1)
Heart Disease Prediction - Jupyter Notebook
9 pages
Frailty Models in Survival Analysis 1st Edition Full PDF Download
No ratings yet
Frailty Models in Survival Analysis 1st Edition Full PDF Download
16 pages
A Tutorial On Frailty Models
No ratings yet
A Tutorial On Frailty Models
31 pages
Data Visualization II: Downloading The Seaborn Library
No ratings yet
Data Visualization II: Downloading The Seaborn Library
14 pages
Zhou Cross-Modal Translation and Alignment For Survival Analysis ICCV 2023 Paper
No ratings yet
Zhou Cross-Modal Translation and Alignment For Survival Analysis ICCV 2023 Paper
10 pages
Minor Project
No ratings yet
Minor Project
21 pages

Explonatory Data Analysis

Uploaded by

Explonatory Data Analysis

Uploaded by

You might also like