0 ratings 0% found this document useful (0 votes) 23 views 11 pages Explonatory Data Analysis
The document outlines an exploratory data analysis (EDA) of the Haberman dataset, which contains information about patients who underwent surgery for breast cancer. It highlights the imbalanced survival status of patients, with more survivors than non-survivors, and provides various statistical analyses, visualizations, and insights regarding patient age, operation year, and positive axillary nodes. The analysis reveals that patients with fewer positive axillary nodes are more likely to survive, while visualizations indicate a lack of clear distinctions between survival status based on age and operation year.
AI-enhanced title and description
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content,
claim it here .
Available Formats
Download as PDF or read online on Scribd
Go to previous items Go to next items
Save Explonatory Data analysis For Later In [2
In [2]:
out[2]:
In [3]:
out[3]:
In [4]:
out[4]:
Exploratory Data Analysis (EDA)
Importing libraries and loading data
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats
import warnings
df=pd.read_csv("D: /Downloads/archive (21)/haberman.csv" ,header=0)
df.columns=[‘patient_age’, ‘operation_year', ‘positive_axillary_nodes’, ‘sur
af
— >
patient_age operation_year positive_axillary_nodes survival_status
° 30 62 3 1
1 30 6s °
2 3 59 2
3 a 6s 4
4 33 38 10
300 75 62 1
sot 76 67 °
302 7 65 3
303 1 65 1 2
304 83 58 2 2
305 rows x 4 columns
Data understanding
df. shape
(3e5, 4)
df[' survival_status'].value_counts()
survival_status
1 224
2 at
Name: count, dtype: inte4The dataset is imbalanced, out of total of 305 patients no. of survived is
3 times the patients who died within 5 years
In [5]: df.info()
RangeIndex: 3@5 entries, @ to 304
Data columns (total 4 columns):
#~ Colunn Non-Null Count Dtype
@ patient_age 3@5 non-null —_int6a
1 operation_year 3@5 non-null intea
2 positive axillary nodes 305 non-null — intea
3 survival_status 3@5 non-null —_intea
dtypes: intea(4)
memory usage: 9.7 KB
In [6]: #output shows all integer non null values
In [7]: df[‘survival_status'] = df['survival_status' ].map({1.
of
out[7]: patient_age operation year positive_axillary_nodes survival_status
° 30 62 3 yes
1 30 65 0 yes
2 at 59 2 yes
3 at 65 4 yes
4 33 58 10 yes
300 75 62 1 yes
301 76 er ° yes
302 7 65 3 yes
303, 7B 65 1 no
304 83 58 2 no
305 rows x 4 columnsIn [8]: df.describe()
outs. pation age operation year positve_axllary_nodes
count 305,000000 -308.000000 305 000000
mean 52531148 62840180 036086
std 10.744024 3.284078 7199870
min 30.000000 8.000000 0.000000
25% 44.0000 60,000000 0.000000
0% 520000000 63.000000 1.000000
75% 61.000000 _68,000000 4.000000
max 63.000000 _69.000000 52000000
patients got operated at age of 63
average number of positive axillary nodes detected
Seth percentile, the median of positive axillary nodes is 1.
7th percentile, 75% of the patients have less than 4 nodes detected.
“there is a significant difference between the mean and the median
values(5@%). This is because there are some outliers in our data and the
mean is influenced by the presence of outliers.It indicate potential
outliers, it's not conclusive proof.Or (mean>median) skewed"
(positive_axillary_nodes mean =4 and median=1 difference is high)
Class-wise statistical analysis
In [9]: survival_yes=df[df[ 'survival_status' ]
survival_yes.describe()
yes")
Out[9]: patient_age oper
count 224,000000 224.0000 224,000000
52.116071 62.857143, 2.799107
10.937446 3.229231, 8.882237
min 30,000000 8,000000 0.000000
25% — 43,000000 60,0000 0.000000
50% — §2.000000 63,0000 0.000000
75% 60,0000 66,0000, 3,000000
max 77,000000 69,0000 46,000000In [10]: survival_no=df[df[' survival_status' ]=="no’ ]
survival_no.describe()
out [16]: patient_age operation_year_positive_axillary_nodes
count 81,000000 81,0000 '81,000000
mean 3.679012 62.827160 7.456790
std 10,167137 3342118 9.185654
‘min 34.000000 8.000000, 0.000000
25% — 46.000000 9.000000 1.000000
50% 53.0000 63.0000 4.000000
‘75% — 61.000000 5.000000 +1.000000
max 63.000000 69,0000 '52,000000
1.patient is operated on is nearly the same in both cases
2.patient who died within 5 years on average about 4 to 5 positive
axiliary nodes more than patients who lived
3. Uni-variate data analysis
In [11]: with warnings. catch_warnings():
warnings .simplefilter("ignor
)
sns.FacetGrid(df,hue="survival_status").map(sns.distplot, "patient_age”).
plt.figure(figsize=(15, 8))
plt.show()
EE >
004
003
z
2 002 survival status
é yes
oor .
0.00
B 0 OB
patient_age
Among all the age groups, the patients belonging to 40-60 years of age are
the highestIn [12]:
In [13]:
with warnings. catch_warnings():
warnings. simplefilter("ignore")
sns.FacetGrid(df, hue = "survival_status").map(sns.distplot, "operation_
plt. show()
FT »
o1
o10
008
Density
004
002
0.00
5 8 & 7
‘operation year
Huge overlap between the class labels suggesting that one cannot make any
distinctive conclusion regarding the survival status based solely on the
operation year and patient's age.
Number of positive axillary nods
with warnings. catch_warnings():
warnings. simplefilter("ignore")
g = sns.FacetGrid(df, hue="survival_status")
g.map(sns.distplot, "positive_axillary_nodes", kde=True)
g.add_legend()
plt. Figure (Figsize=(12,6))
plt. show()
z
2 survival status
& mm yes
0 OD
positive_axllary_nodes
Patients having 4 or fewer axillary nodes — A very good majority of these
patients have survived 5 years
Box plotIn [14]:
The box plot, commonly referred to as a box and whisker plot, serves as a
visual representation that summarizes exploratory data analysis Python
using five key metrics — the minimum, lower quartile (25th percentile),
median (58th percentile), upper quartile (75th percentile), and maximum
data values.
plt.figure(figsize = (15, 4))
plt.subplot(1,3,1)
sns.boxplot(x = ‘survival_status', y = ‘patient_age’, data = df)
plt.subplot(1,3,2)
sns.boxplot(x = ‘survival_status', y = ‘operation_year', data = df)
plt.subplot(1,3,3)
sns.boxplot(x = ‘survival_status', y = ‘positive _axillary_nodes', data = df)
plt.show()
2
&
ee
2
s
es
patient age and the operation year plots show similar statistics
The isolated points seen in the box plot of positive axillary nodes are
the outliers in the data. Such a high number of outliers is kind of
expected in medical datasets.
Violin plot
A violin plot displays the same information as the box and whisker plot;
additionally, it also shows the density-smoothed plot of the underlying
distribution.In [15]:
plt.figure(figsize = (15, 4))
plt.subplot(1,3,1)
sns.violinplot(x = ‘survival_status', y
plt.subplot(1,3,2)
sns.violinplot(x = ‘survival_status', y
plt.subplot(1,3,3)
sns.violinplot(x = ‘survival_status', y
plt.show()
"patient_age’, data
= df)
‘operation_year’, data = df)
‘positive_axillary_nodes', data
eeebeves
suueees
violin plot for positive axillary nodes, it becomes apparent that the
distribution is highly skewed for the ‘yes’ class label and moderately
skewed for the ‘no’ label.
Bi-variate data analysisIn [16]:
apair plot
sns.set_style(‘whitegrid')
sns.pairplot (df, hue="survival_status’)
plt.show()
3 1 °
Ball “
Sa|| eS i -3e-4._8
4 Sano ite Esseae gf
0 g = til ill a
: . ails
2° 0 0 8 © © » “0 o «©
aes Sate Ne Se
As we can observe in the above pair plot, there is a high overlap between
any two features and hence no clear distinction can be made between the
class labels
Joint plot
While the Pair plot provides a visual insight into all possible
correlations, the Joint plot provides bivariate plots with univariate
marginal distributions.In [17]: sns.jointplot(x="patient_age" ,y="positive_axillary_nodes" ,data=df)
plt.show()
a
patent ape
The pair plot and the joint plot reveal that there is no correlation
between the patient’s age and the number of positive axillary nodes
detected.
The histogram on the top edge indicates that patients are more likely to
get operated in the age of 40-60 years compared to other age groups.
The histogram on the right edge indicates that the majority of patients
had fewer than 4 positive axillary nodes.In [18]
In [19]
sns.jointplot (x="patient_age":
plt.show()
<
/="positive_axillary_nodes” data=df,hue="survi
a>
.°
3 A
Heatmap
plt.figure(figsize=(8, 6))
sns-heatmap(df.iloc[:,0:3].corr(), ¢map="V1GnBu" ,annot=True)
plt.show()
10
0007 ita
08
5 00038
i 04
8 !
¢ -o2
2
£
00
patent age cperation year postiveIn [20]:
eda data analysis these values are nearly @ for any pair, so no
correlation exists among any pair of variables.
Multivariate analysis
3-dimensional surface by plotting constant z slices, called contours, in a
2-dimensional format.3d to 2d.
sns.jointplot(x = ‘patient_age’, y = ‘operation_year’ , data = df, kind =
plt.show()
»
a
<
years 1959-1964 witnessed more patients in the age group of 45-55 years