0% found this document useful (0 votes)

40 views29 pages

Data Science Code

The document describes clustering a dataset containing various data points described by features to identify important features and values that contribute to cluster groupings. It outlines preprocessing steps like handling missing data and standardization, applying K-means clustering and evaluating results to determine feature importance and values associated with each cluster.

Uploaded by

Vasudevan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

40 views29 pages

Data Science Code

Uploaded by

Vasudevan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 29

Problem Statement

Provided with a dataset containing various data points, each described by a set of features. Your
task is to apply unsupervised learning techniques to cluster these data points into 'n' distinct
clusters. Additionally, you need to identify the most important features and their corresponding
values that contribute significantly to the grouping of data points within these clusters.

Objective:
1. Clustering: Implement an unsupervised learning algorithm to partition the data points
into 'n' clusters. You should select an appropriate clustering algorithm based on the
characteristics of the dataset and the problem requirements.
2. Feature Importance: Identify the most important features that influence the formation of
clusters. Determine the relevance and significance of each feature in grouping data
points together. This analysis will help you understand which features are driving the
cluster assignments.
3. Feature Values: Determine the specific values or ranges of values for the identified
important features that are associated with each cluster. In other words, find the feature-
value combinations that differentiate one cluster from another.

Index
##Solution

##1. Data Preprocessing: Handling of missing values using mean and mode.

Standardidation of features using StandardScaler.

Encoded categorical variables using LabelEncoder.

##2. Clustering: Choose an appropriate clustering algorithm (K-means clustering).

Determined the optimal number of clusters which is '2' based on both elbow method and
silhouette score.

Applid the chosen algorithm to cluster the data points into '2' clusters.

##3. Evaluation:

Evaluated the quality of the clustering results using Silhouette score, Davies-Bouldin index,
Calinski_Harabasz_score.

##4. Clustering: Employed techniques such as dimensionality reduction method PCA and K-
means inbuilt method to perform clustering.

Determined the relevance and contribution of each feature to the cluster assignments by
employing RFC.

##5. Insights:
Tabulation of the contribution of the original features in forming the pca components.

Created a bar gaph that represents the influencing feature-value distributions in each cluster.

For each identified important feature, analyse its values or value ranges that are prevalent in
each cluster in table form

Importing Libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")
from sklearn.ensemble import RandomForestClassifier
from sklearn.decomposition import PCA
from sklearn.metrics import silhouette_score, davies_bouldin_score,
calinski_harabasz_score
from sklearn.cluster import KMeans
from sklearn.preprocessing import LabelEncoder,StandardScaler

Importing Data
data=pd.read_csv('METABRIC_RNA_Mutation.csv')

#Data Exploration

data.head()

patient_id age_at_diagnosis type_of_breast_surgery cancer_type

\
0 0 75.65 MASTECTOMY Breast Cancer

1 2 43.19 BREAST CONSERVING Breast Cancer

2 5 48.87 MASTECTOMY Breast Cancer

3 6 47.68 MASTECTOMY Breast Cancer

4 8 76.97 MASTECTOMY Breast Cancer

cancer_type_detailed cellularity chemotherapy

\
0 Breast Invasive Ductal Carcinoma NaN 0

1 Breast Invasive Ductal Carcinoma High 0

2 Breast Invasive Ductal Carcinoma High 1

3 Breast Mixed Ductal and Lobular Carcinoma Moderate 1

4 Breast Mixed Ductal and Lobular Carcinoma High 1

pam50_+_claudin-low_subtype cohort er_status_measured_by_ihc ...

mtap_mut \
0 claudin-low 1.0 Positve ...
0
1 LumA 1.0 Positve ...
0
2 LumB 1.0 Positve ...
0
3 LumB 1.0 Positve ...
0
4 LumB 1.0 Positve ...
0

ppp2cb_mut smarcd1_mut nras_mut ndfip1_mut hras_mut prps2_mut

smarcb1_mut \
0 0 0 0 0 0 0
0
1 0 0 0 0 0 0
0
2 0 0 0 0 0 0
0
3 0 0 0 0 0 0
0
4 0 0 0 0 0 0
0

stmn2_mut siah1_mut
0 0 0
1 0 0
2 0 0
3 0 0
4 0 0

[5 rows x 693 columns]

data.shape

(1904, 693)

data.describe(include='all')

patient_id age_at_diagnosis type_of_breast_surgery

cancer_type \
count 1904.000000 1904.000000 1882
1904
unique NaN NaN 2
2
top NaN NaN MASTECTOMY Breast
Cancer
freq NaN NaN 1127
1903
mean 3921.982143 61.087054 NaN
NaN
std 2358.478332 12.978711 NaN
NaN
min 0.000000 21.930000 NaN
NaN
25% 896.500000 51.375000 NaN
NaN
50% 4730.500000 61.770000 NaN
NaN
75% 5536.250000 70.592500 NaN
NaN
max 7299.000000 96.290000 NaN
NaN

cancer_type_detailed cellularity chemotherapy \

count 1889 1850 1904.000000
unique 6 3 NaN
top Breast Invasive Ductal Carcinoma High NaN
freq 1500 939 NaN
mean NaN NaN 0.207983
std NaN NaN 0.405971
min NaN NaN 0.000000
25% NaN NaN 0.000000
50% NaN NaN 0.000000
75% NaN NaN 0.000000
max NaN NaN 1.000000

pam50_+_claudin-low_subtype cohort
er_status_measured_by_ihc \
count 1904 1904.000000
1874
unique 7 NaN
2
top LumA NaN
Positve
freq 679 NaN
1445
mean NaN 2.643908
NaN
std NaN 1.228615
NaN
min NaN 1.000000
NaN
25% NaN 1.000000
NaN
50% NaN 3.000000
NaN
75% NaN 3.000000
NaN
max NaN 5.000000
NaN

... mtap_mut ppp2cb_mut smarcd1_mut nras_mut ndfip1_mut

hras_mut \
count ... 1904 1904 1904 1904 1904
1904.0
unique ... 5 5 5 4 4
4.0
top ... 0 0 0 0 0
0.0
freq ... 1900 1900 1900 1901 1901
1024.0
mean ... NaN NaN NaN NaN NaN
NaN
std ... NaN NaN NaN NaN NaN
NaN
min ... NaN NaN NaN NaN NaN
NaN
25% ... NaN NaN NaN NaN NaN
NaN
50% ... NaN NaN NaN NaN NaN
NaN
75% ... NaN NaN NaN NaN NaN
NaN
max ... NaN NaN NaN NaN NaN
NaN

prps2_mut smarcb1_mut stmn2_mut siah1_mut

count 1904 1904.0 1904 1904.0
unique 3 4.0 3 3.0
top 0 0.0 0 0.0
freq 1902 1024.0 1902 1024.0
mean NaN NaN NaN NaN
std NaN NaN NaN NaN
min NaN NaN NaN NaN
25% NaN NaN NaN NaN
50% NaN NaN NaN NaN
75% NaN NaN NaN NaN
max NaN NaN NaN NaN
[11 rows x 693 columns]

#separating categorical and numerical value based columns

cat_cols=data.describe(include='object').columns.tolist()
num_cols=data.describe(exclude='object').columns.tolist()

#converting object datatypes into categorical columns.

for i in cat_cols:
data[i]=data[i].astype('category')

Data Preprocessing
Missing Values
#identifying and noting down the cols with missing values
missing_cat=[]
for i in cat_cols:
if(data[i].isna().sum()== 0):
pass
else:
missing_cat.append(i)

missing_num=[]
for i in num_cols:
if(data[i].isna().sum()== 0):
pass
else:
missing_num.append(i)

#viewing the % of missing values in the columns

print((data[missing_cat].isna().sum())/data.shape[0])
print((data[missing_num].isna().sum())/data.shape[0])

#we will drop the columns have more than 10% missing values
data1=data.drop(['tumor_stage'],axis=1)

#handling missing values with mode and mean for cat and num cols
respectively
for i in missing_cat:
if(i in data1.columns.tolist()):
x=data1[i].mode().tolist()
data1[i]=data1[i].fillna(x[0])

for i in missing_num:
if(i in data1.columns.tolist()):
data1[i]=data1[i].fillna(data1[i].mean())
#redefining cat and num cols after dropping the ones with excess % of
missing data
new_cat=data1.describe(include='category').columns.tolist()
new_num=data1.describe(exclude='category').columns.tolist()

type_of_breast_surgery 0.011555
cancer_type_detailed 0.007878
cellularity 0.028361
er_status_measured_by_ihc 0.015756
tumor_other_histologic_subtype 0.007878
primary_tumor_laterality 0.055672
oncotree_code 0.007878
3-gene_classifier_subtype 0.107143
death_from_cancer 0.000525
dtype: float64
neoplasm_histologic_grade 0.037815
mutation_count 0.023634
tumor_size 0.010504
tumor_stage 0.263130
dtype: float64

data1.isna().sum().sum()

Encoding Categorical Variables

encoder=LabelEncoder()

#encoding mixture of numerical and text values

for i in new_cat:
try:
data1[i]=encoder.fit_transform(data1[i])
except TypeError:
data1[i]=data1[i].replace(0,'Zero')
data1[i]=encoder.fit_transform(data1[i])

Standardization
To ensure all features have a similar scale, which can be important for algorithms that rely on
distance calculations(K-Means clustering uses distance calculations)

data.describe(exclude='category')

patient_id age_at_diagnosis chemotherapy cohort \

count 1904.000000 1904.000000 1904.000000 1904.000000
mean 3921.982143 61.087054 0.207983 2.643908
std 2358.478332 12.978711 0.405971 1.228615
min 0.000000 21.930000 0.000000 1.000000
25% 896.500000 51.375000 0.000000 1.000000
50% 4730.500000 61.770000 0.000000 3.000000
75% 5536.250000 70.592500 0.000000 3.000000
max 7299.000000 96.290000 1.000000 5.000000

neoplasm_histologic_grade hormone_therapy \
count 1832.000000 1904.000000
mean 2.415939 0.616597
std 0.650612 0.486343
min 1.000000 0.000000
25% 2.000000 0.000000
50% 3.000000 1.000000
75% 3.000000 1.000000
max 3.000000 1.000000

lymph_nodes_examined_positive mutation_count \
count 1904.000000 1859.000000
mean 2.002101 5.697687
std 4.079993 4.058778
min 0.000000 1.000000
25% 0.000000 3.000000
50% 0.000000 5.000000
75% 2.000000 7.000000
max 45.000000 80.000000

nottingham_prognostic_index overall_survival_months ... \

count 1904.000000 1904.000000 ...
mean 4.033019 125.121324 ...
std 1.144492 76.334148 ...
min 1.000000 0.000000 ...
25% 3.046000 60.825000 ...
50% 4.042000 115.616667 ...
75% 5.040250 184.716667 ...
max 6.360000 355.200000 ...

srd5a1 srd5a2 srd5a3 st7

star \
count 1.904000e+03 1.904000e+03 1.904000e+03 1.904000e+03
1904.000000
mean 4.726891e-07 -3.676471e-07 -9.453782e-07 -1.050420e-07 -
0.000002
std 1.000263e+00 1.000262e+00 1.000262e+00 1.000263e+00
1.000262
min -2.120800e+00 -3.364800e+00 -2.719400e+00 -4.982700e+00 -
2.981700
25% -6.188500e-01 -6.104750e-01 -6.741750e-01 -6.136750e-01 -
0.632900
50% -2.456500e-01 -4.690000e-02 -1.422500e-01 -5.175000e-02 -
0.026650
75% 3.306000e-01 5.144500e-01 5.146000e-01 5.787750e-01
0.590350
max 6.534900e+00 1.027030e+01 6.329000e+00 4.571300e+00
12.742300

tnk2 tulp4 ugt2b15 ugt2b17

ugt2b7
count 1.904000e+03 1.904000e+03 1.904000e+03 1904.000000
1.904000e+03
mean 3.676471e-07 4.726891e-07 7.878151e-07 0.000000
3.731842e-18
std 1.000264e+00 1.000262e+00 1.000263e+00 1.000262
1.000262e+00
min -3.833300e+00 -3.609300e+00 -1.166900e+00 -2.112600 -
1.051600e+00
25% -6.664750e-01 -7.102000e-01 -5.058250e-01 -0.476200 -
7.260000e-01
50% 7.000000e-04 -2.980000e-02 -2.885500e-01 -0.133400 -
4.248000e-01
75% 6.429000e-01 5.957250e-01 6.022500e-02 0.270375
4.284000e-01
max 3.938800e+00 3.833400e+00 1.088490e+01 12.643900
3.284400e+00

[8 rows x 503 columns]

scaler=StandardScaler()
scaled_data=pd.DataFrame(scaler.fit_transform(data1[new_num]),columns=
new_num)
print('Scaled data shape:',scaled_data.shape)
new_data=pd.concat((data1[new_cat],scaled_data),axis=1)
print('New data shape:',new_data.shape)

Scaled data shape: (1904, 502)

New data shape: (1904, 692)

Clustering
To find the ideal number of clusters using elbow and silhoutte score method

X=new_data.copy()
X.columns = X.columns.astype(str)

#Elbow method
# Calculate WCSS for a range of K values
wcss = []
for i in range(1, 11):
kmeans = KMeans(n_clusters=i, init='k-means++', max_iter=500,
n_init=10, random_state=0)
kmeans.fit(X)
wcss.append(kmeans.inertia_)

# Plot the Elbow Method curve

plt.figure(figsize=(8, 6))
plt.plot(range(1, 11), wcss, marker='o', linestyle='-', color='b')
plt.title('Elbow Method')
plt.xlabel('Number of Clusters (K)')
plt.ylabel('WCSS')
plt.xticks(range(1, 11))
plt.grid(True)
plt.show()

#silhoutte scores method

silhouette_scores = []
for k in range(2, 11):
kmeans = KMeans(n_clusters=k, init='k-means++', max_iter=500,
n_init=10, random_state=0)
kmeans.fit(X)
silhouette_scores.append(silhouette_score(X, kmeans.labels_))

# Plot the Silhouette Method curve

plt.figure(figsize=(8, 6))
plt.plot(range(2, 11), silhouette_scores, marker='o', linestyle='-',
color='b')
plt.title('Silhouette Method')
plt.xlabel('Number of Clusters (K)')
plt.ylabel('Silhouette Score')
plt.xticks(range(2, 11))
plt.grid(True)
plt.show()

From both the graphs we can conclude that the ideal number of clusters is 2 as per the
methods.
Clustering the dimensionally reduced dataset through
PCA, evaluating the results and obtaining the important
features and their corresponding values
n_components = 5 #the number of components
pca = PCA(n_components=n_components)
X=new_data.copy()
X.columns = X.columns.astype(str)
X_pca = pca.fit_transform(X)

# Step 3: Analyze PCA results

explained_variance_ratio = pca.explained_variance_ratio_

# Print the explained variance ratios for each component

print("Explained Variance Ratios:")
print(explained_variance_ratio)

# Optional: Visualize the explained variance

plt.plot(range(1, n_components + 1),
np.cumsum(explained_variance_ratio))
plt.xlabel("Number of Principal Components")
plt.ylabel("Cumulative Explained Variance")
plt.title("Explained Variance vs. Number of Components")
plt.show()

Explained Variance Ratios:

[0.33570044 0.13648734 0.09348513 0.05628623 0.04368038]
The dimensionally reduced components explains approximately 68% of the data.

K-Means clustering Is useful to obtain important features post clustering.

davies_bouldin_score: a lower Davies-Bouldin Index suggests that the clusters are well-
separated and distinct

silhouette_score: A higher Silhouette Score indicates that the data points are well-separated
into clusters.

calinski-Harabasz_score: higher values indicating better separation between clusters

# Performing K-Means clustering

kmeans = KMeans(n_clusters=2, random_state=0,max_iter=1000)
labels = kmeans.fit_predict(X_pca)

# Evaluating the clustering results

silhouette_avg = silhouette_score(X, labels)
db_index = davies_bouldin_score(X, labels)
ch_index = calinski_harabasz_score(X, labels)

print(f"Silhouette Score: {silhouette_avg}")

print(f"Davies-Bouldin Index: {db_index}")
print(f"Calinski-Harabasz Index: {ch_index}")
Silhouette Score: 0.3530636931145265
Davies-Bouldin Index: 1.2331561638066597
Calinski-Harabasz Index: 804.105981173214

#obtaining the features

df=pd.DataFrame(X_pca)
df['ClusterLabel'] = kmeans.fit_predict(df)

# Get the centroids (cluster centers)

centroids = kmeans.cluster_centers_

# Identify important features based on centroids

important_features = np.argsort(np.abs(centroids), axis=1)[:,-5:] #
Select the top 5 important features for each cluster

print("Important Features for Each Cluster:")

print(important_features)

Important Features for Each Cluster:

[[2 4 3 1 0]
[2 4 3 1 0]]

#obtaining the original features from the dimensionally reduced

components and their weightage in the process.
loadings = pca.components_
# Get the original feature names (assuming your data is in a
DataFrame)
original_feature_names = list(pd.DataFrame(X, columns=["Feature_" +
str(i) for i in range(691)]))
orig_cols=new_data.columns.tolist()

# Print the loadings for the first PC with corresponding feature names
def feature_names(loadings,original_feature_names,component_name):
feat={'feature_name':[],'weightage':[]}
loadings_for_first_pc = loadings
loading_with_names = list(zip(original_feature_names,
loadings_for_first_pc))
for feature_name, loading in loading_with_names:
feat['feature_name'].append(feature_name)
feat['weightage'].append(loading)
values=sorted(feat['weightage'],reverse=True)[0:3]
indices={'feature_name':[],'weightage':[],'component_name':[]}
for i in values:
if 'weightage' in feat:
x = [index for index, value in enumerate(feat['weightage']) if
value == i]
indices['feature_name'].append(orig_cols[x[0]])
indices['weightage'].append(i)
indices['component_name'].append(component_name)
return indices
f_names=pd.DataFrame()
f_names=pd.concat((f_names,pd.DataFrame(feature_names(loadings[0],orig
inal_feature_names,0))))
f_names=pd.concat((f_names,pd.DataFrame(feature_names(loadings[1],orig
inal_feature_names,1))))
f_names=pd.concat((f_names,pd.DataFrame(feature_names(loadings[3],orig
inal_feature_names,3))))
f_names=pd.concat((f_names,pd.DataFrame(feature_names(loadings[2],orig
inal_feature_names,2))))
f_names=pd.concat((f_names,pd.DataFrame(feature_names(loadings[4],orig
inal_feature_names,4))))

x_cols=new_data.columns.tolist()
x_new=new_data.copy()
x_new['ClusterLabel']=df['ClusterLabel']

feat_contribution={'Feature_Name':
[],'%_of_Influence_in_cluster_formation':[]}
data=x_new.drop(['ClusterLabel'],axis=1)
labels=x_new['ClusterLabel']
model = RandomForestClassifier(n_estimators=100, random_state=0)
model.fit(data,labels) # Fit the model with cluster labels as targets
feature_importance = model.feature_importances_.tolist()
values=sorted(feature_importance,reverse=True)[0:10]
for i in values:
x=[index for index, value in enumerate(feature_importance) if value
== i]
feat_contribution['Feature_Name'].append(x_cols[x[0]])
# feat_contribution['index'].append(x[0])

feat_contribution['%_of_Influence_in_cluster_formation'].append(i*100)

Insights Obtained
Top 3 original features that have the major influence on
the formation of pca components(distribution is
component wise)
1. feature_name: Name of the feature as per the dataframe.

2. weightage: Weightage of the feature in the formation of the component.

3. component_name: The name of the pca component(0-4)

f_names
feature_name weightage component_name
0 tp53_mut 0.994655 0
1 muc16_mut 0.047248 0
2 syne1_mut 0.023867 0
0 muc16_mut 0.952263 1
1 ahnak2_mut 0.284344 1
2 kmt2c_mut 0.054483 1
0 kmt2c_mut 0.926952 3
1 pik3ca_mut 0.294682 3
2 map3k1_mut 0.203274 3
0 ahnak2_mut 0.954922 2
1 syne1_mut 0.029579 2
2 ahnak_mut 0.024780 2
0 syne1_mut 0.843752 4
1 kmt2c_mut 0.166811 4
2 dnah11_mut 0.119488 4

The below table shows the importance of the features in

the formation of the clusters(in %)
pd.DataFrame(feat_contribution)

Feature_Name %_of_Influence_in_cluster_formation
0 tp53_mut 19.821408
1 bcl2 2.640829
2 aph1b 2.100798
3 chek1 1.870523
4 er_status 1.277683
5 gata3 1.042316
6 e2f3 0.806697
7 mapk1 0.749623
8 cdkn2a 0.734711
9 srd5a1 0.703182

The below visualizations show the cluster-wise

distribution of values(frequency-value pair) belonging to
the top10 features that have influenced the formation of
clusters.
#processing the obtained indices to extract the original feature names
from the dataset and their distribution cluster wise.

important_features=feat_contribution['Feature_Name']

# Group data by cluster

cluster_label_data=new_data.copy()
cluster_label_data['ClusterLabel']=df['ClusterLabel']
cluster_groups = cluster_label_data.groupby('ClusterLabel')

insights0,insights1=pd.DataFrame(),pd.DataFrame()
j=0
for feature in important_features:
for label,group in cluster_groups:
j+=1
if(j%2==0):
insights0=pd.concat((insights0,group[feature]),axis=1)
else:
insights1=pd.concat((insights1,group[feature]),axis=1)

insights={'cluster':[],'max_freq':[],
'most_occured_value':[],'mean_value':[],
'median_value':[],'std_dev':[],'feature_name':[]}

# Analyze feature distributions for each cluster

for feature in important_features:
for label, group in cluster_groups:
insights['feature_name'].append(feature)
insights['cluster'].append(label)
insights['most_occured_value'].append(group[feature].mode()
[0])

insights['max_freq'].append(group[feature].value_counts().tolist()[0])
insights['mean_value'].append(group[feature].mean())
insights['median_value'].append(group[feature].median())
insights['std_dev'].append(group[feature].std())

# Visualize feature distributions

plt.figure(figsize=(8, 6))
for label, group in cluster_groups:
plt.hist(group[feature], bins=10, alpha=0.5, label=f'Cluster
{label}')
plt.xlabel(feature)
plt.ylabel('Frequency')
plt.legend()
plt.title(f'Distribution of {feature} by Cluster')
plt.show()
The below tables give the statistical description of the
values from the top10 features which have influenced the
formation of clusters.
#cluster_label=0
insights0.describe()

tp53_mut bcl2 aph1b chek1 er_status

gata3 \
count 472.000000 472.000000 472.000000 472.000000 472.000000
472.000000
mean 238.042373 -0.697808 -0.757289 0.733421 0.447034 -
0.787537
std 57.288430 0.942895 0.788225 1.051694 0.497714
0.983745
min 113.000000 -2.791904 -2.917802 -1.504302 0.000000 -
2.772898
25% 198.750000 -1.399302 -1.302625 -0.058999 0.000000 -
1.631724
50% 237.000000 -0.828201 -0.782550 0.681002 0.000000 -
0.726699
75% 274.250000 -0.129875 -0.282425 1.373253 1.000000
0.027776
max 342.000000 2.656105 2.212302 3.952808 1.000000
1.745900

e2f3 mapk1 cdkn2a srd5a1

count 472.000000 472.000000 472.000000 472.000000
mean 0.696880 0.512338 0.581673 0.672567
std 1.073119 1.015709 1.408055 1.160492
min -1.992600 -2.659901 -1.331901 -1.201500
25% -0.102249 -0.225300 -0.452076 -0.180400
50% 0.601601 0.547700 0.031799 0.414699
75% 1.330601 1.164900 1.494925 1.287924
max 4.458301 4.294400 5.837501 6.534898

#cluster_label=1
insights1.describe()

tp53_mut bcl2 aph1b chek1 er_status

\
count 1432.000000 1432.000000 1432.000000 1432.000000 1432.000000

mean 10.045391 0.230004 0.249609 -0.241742 0.871508

std 24.314789 0.907946 0.935167 0.854743 0.334753

min 0.000000 -2.625804 -2.854401 -1.949803 0.000000

25% 2.000000 -0.276325 -0.344600 -0.833076 1.000000

50% 2.000000 0.316401 0.314051 -0.381150 1.000000

75% 2.000000 0.848577 0.878576 0.172951 1.000000

max 125.000000 2.534905 3.881904 4.015408 1.000000

gata3 e2f3 mapk1 cdkn2a srd5a1

count 1432.000000 1432.000000 1432.000000 1432.000000 1432.000000

mean 0.259579 -0.229698 -0.168871 -0.191725 -0.221684

std 0.860239 0.859374 0.935873 0.727733 0.829995

min -2.812598 -2.885000 -3.069801 -1.356901 -2.120800

25% -0.024749 -0.787700 -0.807651 -0.610651 -0.708675

50% 0.468950 -0.295050 -0.254400 -0.324951 -0.379500

75% 0.804675 0.226376 0.353850 0.015474 0.027400

max 2.202800 4.480201 3.617600 4.304301 5.345998

Conclusion
The objective of the case study that is to cluster the data and to identify the important features
along with their contribution to the clustering is achieved.

Preprocessing1.ipynb - Colab
No ratings yet
Preprocessing1.ipynb - Colab
13 pages
Model2.ipynb - Colab
No ratings yet
Model2.ipynb - Colab
11 pages
Baseline - Ipynb - Colab
No ratings yet
Baseline - Ipynb - Colab
5 pages
Practical 1
No ratings yet
Practical 1
7 pages
Bank Loan
No ratings yet
Bank Loan
85 pages
Cern Electron Mass Prediction 0 9859 R
No ratings yet
Cern Electron Mass Prediction 0 9859 R
53 pages
Dovdush KN-305 Lab3
No ratings yet
Dovdush KN-305 Lab3
2 pages
Aids
No ratings yet
Aids
88 pages
# Import Plotting Libraries: in (1) : Import Pandas As PD
No ratings yet
# Import Plotting Libraries: in (1) : Import Pandas As PD
13 pages
Bio-Signal Analysis For Smoking
No ratings yet
Bio-Signal Analysis For Smoking
1 page
Normality Tests and Q-Q Plots
No ratings yet
Normality Tests and Q-Q Plots
5 pages
Decision Tree PBEL With GridSearchCV
No ratings yet
Decision Tree PBEL With GridSearchCV
12 pages
Diabetes Dataset Analysis & Prep
No ratings yet
Diabetes Dataset Analysis & Prep
11 pages
Installing Missingno and Seaborn
No ratings yet
Installing Missingno and Seaborn
23 pages
Practical Jupyter for Heart Data Analysis
No ratings yet
Practical Jupyter for Heart Data Analysis
6 pages
Heart Disease Prediction with Python
No ratings yet
Heart Disease Prediction with Python
12 pages
Dovdush KN-305 Lab2
No ratings yet
Dovdush KN-305 Lab2
2 pages
Assignment 2 (Set B)
No ratings yet
Assignment 2 (Set B)
5 pages
Heart Disease Prediction! ?
No ratings yet
Heart Disease Prediction! ?
52 pages
Support Vector Machines Com Python
No ratings yet
Support Vector Machines Com Python
13 pages
Merged
No ratings yet
Merged
35 pages
Data Science Lab Program Printout
No ratings yet
Data Science Lab Program Printout
43 pages
Dsa 1
No ratings yet
Dsa 1
8 pages
Covid19-Maro (2) - JupyterLab
No ratings yet
Covid19-Maro (2) - JupyterLab
7 pages
DIY Bagging Boosting
No ratings yet
DIY Bagging Boosting
14 pages
Patel - ML Lab Exercise 8
No ratings yet
Patel - ML Lab Exercise 8
10 pages
Tugas1 - 4 Analisis Data Talitha Syahda Aguslin (20037061)
No ratings yet
Tugas1 - 4 Analisis Data Talitha Syahda Aguslin (20037061)
27 pages
KNN - Jupyter Notebook
No ratings yet
KNN - Jupyter Notebook
7 pages
Correlation: Import As Import As Import As Import As From Import From Import Import Matplotlib Import
No ratings yet
Correlation: Import As Import As Import As Import As From Import From Import Import Matplotlib Import
1 page
DATA SCIENCE IDC 302 End Sem Project
No ratings yet
DATA SCIENCE IDC 302 End Sem Project
1 page
Yashvi 2202031030144
No ratings yet
Yashvi 2202031030144
24 pages
DSBDA1
No ratings yet
DSBDA1
5 pages
Data Set Preperation
No ratings yet
Data Set Preperation
7 pages
Artificial Neural Network (Ann)
No ratings yet
Artificial Neural Network (Ann)
1 page
Experiment 1
No ratings yet
Experiment 1
2 pages
ML Project - Binary - Colaboratory
No ratings yet
ML Project - Binary - Colaboratory
7 pages
ML Lab Assessment4.Ipynb - Colab
No ratings yet
ML Lab Assessment4.Ipynb - Colab
5 pages
DP v8
No ratings yet
DP v8
19 pages
5 Breast Cancer Model - Ipynb Colab
No ratings yet
5 Breast Cancer Model - Ipynb Colab
5 pages
Howxtre
No ratings yet
Howxtre
8 pages
Covid 19 Analysis and Visualization Using Plotly Express
No ratings yet
Covid 19 Analysis and Visualization Using Plotly Express
11 pages
Breast Cancer Diagnosis Using Machine Learning Alg
No ratings yet
Breast Cancer Diagnosis Using Machine Learning Alg
13 pages
Data Clustering for Analysts
No ratings yet
Data Clustering for Analysts
8 pages
Week-01 B
No ratings yet
Week-01 B
4 pages
Employee Burnout Analysis
No ratings yet
Employee Burnout Analysis
20 pages
Cancer 241029 150515
No ratings yet
Cancer 241029 150515
99 pages
Script Group8
No ratings yet
Script Group8
19 pages
45B AIML Practical 08
No ratings yet
45B AIML Practical 08
10 pages
Pandas for Data Analysis Beginners
No ratings yet
Pandas for Data Analysis Beginners
19 pages
20BCP021 Assignment 3
No ratings yet
20BCP021 Assignment 3
7 pages
Feature Selection & Dimensionality Reduction
No ratings yet
Feature Selection & Dimensionality Reduction
3 pages
BT Segmentation
No ratings yet
BT Segmentation
25 pages
KNN For Classification
No ratings yet
KNN For Classification
5 pages
Churn V2
No ratings yet
Churn V2
15 pages
Xtasy
No ratings yet
Xtasy
14 pages
Pandas
No ratings yet
Pandas
18 pages
MTA Project
No ratings yet
MTA Project
1 page
Machine Learning Homework Guide
No ratings yet
Machine Learning Homework Guide
1 page
ACS 100 User Manual and Installation Guide
No ratings yet
ACS 100 User Manual and Installation Guide
52 pages
Nutri Scan
No ratings yet
Nutri Scan
4 pages
TLE 7-8 Front Office Service Q1 - M2 For Printing
No ratings yet
TLE 7-8 Front Office Service Q1 - M2 For Printing
22 pages
CVT PDF
No ratings yet
CVT PDF
194 pages
NE List of Students SEM 2 - 2024-25
No ratings yet
NE List of Students SEM 2 - 2024-25
9 pages
Chakra Cleansing and Awakening Guide
No ratings yet
Chakra Cleansing and Awakening Guide
8 pages
Student Study Guide Earths Surface
No ratings yet
Student Study Guide Earths Surface
7 pages
Theories of Accounting Regulation
No ratings yet
Theories of Accounting Regulation
39 pages
Bacterial Wilt Disease On Banana
No ratings yet
Bacterial Wilt Disease On Banana
10 pages
Personalized and Adaptive Learning Educational Pra
No ratings yet
Personalized and Adaptive Learning Educational Pra
11 pages
Sovereignty LTD - Sir George Goldie and The Rise of The Royal Niger Company
No ratings yet
Sovereignty LTD - Sir George Goldie and The Rise of The Royal Niger Company
65 pages
Complete Child Psychopathology 8th Edition Mash HQ File Verified
No ratings yet
Complete Child Psychopathology 8th Edition Mash HQ File Verified
305 pages
Danfoss Series 90 Pump and Motor Guide
100% (1)
Danfoss Series 90 Pump and Motor Guide
34 pages
Family Relationships and Traditions
No ratings yet
Family Relationships and Traditions
20 pages
06-12-2025 Jr.super60 Nucleus Bt Jee-main Wtm-33 q.paper
No ratings yet
06-12-2025 Jr.super60 Nucleus Bt Jee-main Wtm-33 q.paper
20 pages
Program Module 2 (Ay10-11)
No ratings yet
Program Module 2 (Ay10-11)
31 pages
The Authentic Man Retreat: True Transformation Starts Within
No ratings yet
The Authentic Man Retreat: True Transformation Starts Within
6 pages
KSAOs: Reliability and Validity in Selection
No ratings yet
KSAOs: Reliability and Validity in Selection
3 pages
Organizational Chart For IT Department
No ratings yet
Organizational Chart For IT Department
9 pages
PNB Youth Leaflet
No ratings yet
PNB Youth Leaflet
2 pages
Live Break Bundle - Free Edition User Guide
100% (1)
Live Break Bundle - Free Edition User Guide
15 pages
Year 8 Homework HT3
No ratings yet
Year 8 Homework HT3
2 pages
Social Media Intern at Think Ocean CIC
No ratings yet
Social Media Intern at Think Ocean CIC
4 pages
Calculate Your Acoustics
No ratings yet
Calculate Your Acoustics
3 pages
Sales Rep with 5+ Years Experience
No ratings yet
Sales Rep with 5+ Years Experience
2 pages
Mathematics Alternative B
No ratings yet
Mathematics Alternative B
5 pages
Electronics Lab Guide
No ratings yet
Electronics Lab Guide
115 pages
Product PDF 4956
No ratings yet
Product PDF 4956
2 pages
Indian Aviation Industry Overview
50% (2)
Indian Aviation Industry Overview
54 pages
9 Marciak
No ratings yet
9 Marciak
11 pages