0% found this document useful (0 votes)
23 views11 pages

Explonatory Data Analysis

The document outlines an exploratory data analysis (EDA) of the Haberman dataset, which contains information about patients who underwent surgery for breast cancer. It highlights the imbalanced survival status of patients, with more survivors than non-survivors, and provides various statistical analyses, visualizations, and insights regarding patient age, operation year, and positive axillary nodes. The analysis reveals that patients with fewer positive axillary nodes are more likely to survive, while visualizations indicate a lack of clear distinctions between survival status based on age and operation year.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
23 views11 pages

Explonatory Data Analysis

The document outlines an exploratory data analysis (EDA) of the Haberman dataset, which contains information about patients who underwent surgery for breast cancer. It highlights the imbalanced survival status of patients, with more survivors than non-survivors, and provides various statistical analyses, visualizations, and insights regarding patient age, operation year, and positive axillary nodes. The analysis reveals that patients with fewer positive axillary nodes are more likely to survive, while visualizations indicate a lack of clear distinctions between survival status based on age and operation year.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 11
In [2 In [2]: out[2]: In [3]: out[3]: In [4]: out[4]: Exploratory Data Analysis (EDA) Importing libraries and loading data import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns import scipy.stats as stats import warnings df=pd.read_csv("D: /Downloads/archive (21)/haberman.csv" ,header=0) df.columns=[‘patient_age’, ‘operation_year', ‘positive_axillary_nodes’, ‘sur af — > patient_age operation_year positive_axillary_nodes survival_status ° 30 62 3 1 1 30 6s ° 2 3 59 2 3 a 6s 4 4 33 38 10 300 75 62 1 sot 76 67 ° 302 7 65 3 303 1 65 1 2 304 83 58 2 2 305 rows x 4 columns Data understanding df. shape (3e5, 4) df[' survival_status'].value_counts() survival_status 1 224 2 at Name: count, dtype: inte4 The dataset is imbalanced, out of total of 305 patients no. of survived is 3 times the patients who died within 5 years In [5]: df.info() RangeIndex: 3@5 entries, @ to 304 Data columns (total 4 columns): #~ Colunn Non-Null Count Dtype @ patient_age 3@5 non-null —_int6a 1 operation_year 3@5 non-null intea 2 positive axillary nodes 305 non-null — intea 3 survival_status 3@5 non-null —_intea dtypes: intea(4) memory usage: 9.7 KB In [6]: #output shows all integer non null values In [7]: df[‘survival_status'] = df['survival_status' ].map({1. of out[7]: patient_age operation year positive_axillary_nodes survival_status ° 30 62 3 yes 1 30 65 0 yes 2 at 59 2 yes 3 at 65 4 yes 4 33 58 10 yes 300 75 62 1 yes 301 76 er ° yes 302 7 65 3 yes 303, 7B 65 1 no 304 83 58 2 no 305 rows x 4 columns In [8]: df.describe() outs. pation age operation year positve_axllary_nodes count 305,000000 -308.000000 305 000000 mean 52531148 62840180 036086 std 10.744024 3.284078 7199870 min 30.000000 8.000000 0.000000 25% 44.0000 60,000000 0.000000 0% 520000000 63.000000 1.000000 75% 61.000000 _68,000000 4.000000 max 63.000000 _69.000000 52000000 patients got operated at age of 63 average number of positive axillary nodes detected Seth percentile, the median of positive axillary nodes is 1. 7th percentile, 75% of the patients have less than 4 nodes detected. “there is a significant difference between the mean and the median values(5@%). This is because there are some outliers in our data and the mean is influenced by the presence of outliers.It indicate potential outliers, it's not conclusive proof.Or (mean>median) skewed" (positive_axillary_nodes mean =4 and median=1 difference is high) Class-wise statistical analysis In [9]: survival_yes=df[df[ 'survival_status' ] survival_yes.describe() yes") Out[9]: patient_age oper count 224,000000 224.0000 224,000000 52.116071 62.857143, 2.799107 10.937446 3.229231, 8.882237 min 30,000000 8,000000 0.000000 25% — 43,000000 60,0000 0.000000 50% — §2.000000 63,0000 0.000000 75% 60,0000 66,0000, 3,000000 max 77,000000 69,0000 46,000000 In [10]: survival_no=df[df[' survival_status' ]=="no’ ] survival_no.describe() out [16]: patient_age operation_year_positive_axillary_nodes count 81,000000 81,0000 '81,000000 mean 3.679012 62.827160 7.456790 std 10,167137 3342118 9.185654 ‘min 34.000000 8.000000, 0.000000 25% — 46.000000 9.000000 1.000000 50% 53.0000 63.0000 4.000000 ‘75% — 61.000000 5.000000 +1.000000 max 63.000000 69,0000 '52,000000 1.patient is operated on is nearly the same in both cases 2.patient who died within 5 years on average about 4 to 5 positive axiliary nodes more than patients who lived 3. Uni-variate data analysis In [11]: with warnings. catch_warnings(): warnings .simplefilter("ignor ) sns.FacetGrid(df,hue="survival_status").map(sns.distplot, "patient_age”). plt.figure(figsize=(15, 8)) plt.show() EE > 004 003 z 2 002 survival status é yes oor . 0.00 B 0 OB patient_age
Among all the age groups, the patients belonging to 40-60 years of age are the highest In [12]: In [13]: with warnings. catch_warnings(): warnings. simplefilter("ignore") sns.FacetGrid(df, hue = "survival_status").map(sns.distplot, "operation_ plt. show() FT » o1 o10 008 Density 004 002 0.00 5 8 & 7 ‘operation year Huge overlap between the class labels suggesting that one cannot make any distinctive conclusion regarding the survival status based solely on the operation year and patient's age. Number of positive axillary nods with warnings. catch_warnings(): warnings. simplefilter("ignore") g = sns.FacetGrid(df, hue="survival_status") g.map(sns.distplot, "positive_axillary_nodes", kde=True) g.add_legend() plt. Figure (Figsize=(12,6)) plt. show() z 2 survival status & mm yes 0 OD positive_axllary_nodes
Patients having 4 or fewer axillary nodes — A very good majority of these patients have survived 5 years Box plot In [14]: The box plot, commonly referred to as a box and whisker plot, serves as a visual representation that summarizes exploratory data analysis Python using five key metrics — the minimum, lower quartile (25th percentile), median (58th percentile), upper quartile (75th percentile), and maximum data values. plt.figure(figsize = (15, 4)) plt.subplot(1,3,1) sns.boxplot(x = ‘survival_status', y = ‘patient_age’, data = df) plt.subplot(1,3,2) sns.boxplot(x = ‘survival_status', y = ‘operation_year', data = df) plt.subplot(1,3,3) sns.boxplot(x = ‘survival_status', y = ‘positive _axillary_nodes', data = df) plt.show() 2 & ee 2 s es patient age and the operation year plots show similar statistics The isolated points seen in the box plot of positive axillary nodes are the outliers in the data. Such a high number of outliers is kind of expected in medical datasets. Violin plot A violin plot displays the same information as the box and whisker plot; additionally, it also shows the density-smoothed plot of the underlying distribution. In [15]: plt.figure(figsize = (15, 4)) plt.subplot(1,3,1) sns.violinplot(x = ‘survival_status', y plt.subplot(1,3,2) sns.violinplot(x = ‘survival_status', y plt.subplot(1,3,3) sns.violinplot(x = ‘survival_status', y plt.show() "patient_age’, data = df) ‘operation_year’, data = df) ‘positive_axillary_nodes', data eeebeves suueees violin plot for positive axillary nodes, it becomes apparent that the distribution is highly skewed for the ‘yes’ class label and moderately skewed for the ‘no’ label. Bi-variate data analysis In [16]: apair plot sns.set_style(‘whitegrid') sns.pairplot (df, hue="survival_status’) plt.show() 3 1 ° Ball “ Sa|| eS i -3e-4._8 4 Sano ite Esseae gf 0 g = til ill a : . ails 2° 0 0 8 © © » “0 o «© aes Sate Ne Se As we can observe in the above pair plot, there is a high overlap between any two features and hence no clear distinction can be made between the class labels Joint plot While the Pair plot provides a visual insight into all possible correlations, the Joint plot provides bivariate plots with univariate marginal distributions. In [17]: sns.jointplot(x="patient_age" ,y="positive_axillary_nodes" ,data=df) plt.show() a patent ape The pair plot and the joint plot reveal that there is no correlation between the patient’s age and the number of positive axillary nodes detected. The histogram on the top edge indicates that patients are more likely to get operated in the age of 40-60 years compared to other age groups. The histogram on the right edge indicates that the majority of patients had fewer than 4 positive axillary nodes. In [18] In [19] sns.jointplot (x="patient_age": plt.show() < /="positive_axillary_nodes” data=df,hue="survi a> .° 3 A Heatmap plt.figure(figsize=(8, 6)) sns-heatmap(df.iloc[:,0:3].corr(), ¢map="V1GnBu" ,annot=True) plt.show() 10 0007 ita 08 5 00038 i 04 8 ! ¢ -o2 2 £ 00 patent age cperation year postive In [20]: eda data analysis these values are nearly @ for any pair, so no correlation exists among any pair of variables. Multivariate analysis 3-dimensional surface by plotting constant z slices, called contours, in a 2-dimensional format.3d to 2d. sns.jointplot(x = ‘patient_age’, y = ‘operation_year’ , data = df, kind = plt.show() » a < years 1959-1964 witnessed more patients in the age group of 45-55 years

You might also like