Problem Statement
Provided with a dataset containing various data points, each described by a set of features. Your
task is to apply unsupervised learning techniques to cluster these data points into 'n' distinct
clusters. Additionally, you need to identify the most important features and their corresponding
values that contribute significantly to the grouping of data points within these clusters.
Objective:
1. Clustering: Implement an unsupervised learning algorithm to partition the data points
into 'n' clusters. You should select an appropriate clustering algorithm based on the
characteristics of the dataset and the problem requirements.
2. Feature Importance: Identify the most important features that influence the formation of
clusters. Determine the relevance and significance of each feature in grouping data
points together. This analysis will help you understand which features are driving the
cluster assignments.
3. Feature Values: Determine the specific values or ranges of values for the identified
important features that are associated with each cluster. In other words, find the feature-
value combinations that differentiate one cluster from another.
Index
##Solution
##1. Data Preprocessing: Handling of missing values using mean and mode.
Standardidation of features using StandardScaler.
Encoded categorical variables using LabelEncoder.
##2. Clustering: Choose an appropriate clustering algorithm (K-means clustering).
Determined the optimal number of clusters which is '2' based on both elbow method and
silhouette score.
Applid the chosen algorithm to cluster the data points into '2' clusters.
##3. Evaluation:
Evaluated the quality of the clustering results using Silhouette score, Davies-Bouldin index,
Calinski_Harabasz_score.
##4. Clustering: Employed techniques such as dimensionality reduction method PCA and K-
means inbuilt method to perform clustering.
Determined the relevance and contribution of each feature to the cluster assignments by
employing RFC.
##5. Insights:
Tabulation of the contribution of the original features in forming the pca components.
Created a bar gaph that represents the influencing feature-value distributions in each cluster.
For each identified important feature, analyse its values or value ranges that are prevalent in
each cluster in table form
Importing Libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")
from sklearn.ensemble import RandomForestClassifier
from sklearn.decomposition import PCA
from sklearn.metrics import silhouette_score, davies_bouldin_score,
calinski_harabasz_score
from sklearn.cluster import KMeans
from sklearn.preprocessing import LabelEncoder,StandardScaler
Importing Data
data=pd.read_csv('METABRIC_RNA_Mutation.csv')
#Data Exploration
data.head()
patient_id age_at_diagnosis type_of_breast_surgery cancer_type
\
0 0 75.65 MASTECTOMY Breast Cancer
1 2 43.19 BREAST CONSERVING Breast Cancer
2 5 48.87 MASTECTOMY Breast Cancer
3 6 47.68 MASTECTOMY Breast Cancer
4 8 76.97 MASTECTOMY Breast Cancer
cancer_type_detailed cellularity chemotherapy
\
0 Breast Invasive Ductal Carcinoma NaN 0
1 Breast Invasive Ductal Carcinoma High 0
2 Breast Invasive Ductal Carcinoma High 1
3 Breast Mixed Ductal and Lobular Carcinoma Moderate 1
4 Breast Mixed Ductal and Lobular Carcinoma High 1
pam50_+_claudin-low_subtype cohort er_status_measured_by_ihc ...
mtap_mut \
0 claudin-low 1.0 Positve ...
0
1 LumA 1.0 Positve ...
0
2 LumB 1.0 Positve ...
0
3 LumB 1.0 Positve ...
0
4 LumB 1.0 Positve ...
0
ppp2cb_mut smarcd1_mut nras_mut ndfip1_mut hras_mut prps2_mut
smarcb1_mut \
0 0 0 0 0 0 0
0
1 0 0 0 0 0 0
0
2 0 0 0 0 0 0
0
3 0 0 0 0 0 0
0
4 0 0 0 0 0 0
0
stmn2_mut siah1_mut
0 0 0
1 0 0
2 0 0
3 0 0
4 0 0
[5 rows x 693 columns]
data.shape
(1904, 693)
data.describe(include='all')
patient_id age_at_diagnosis type_of_breast_surgery
cancer_type \
count 1904.000000 1904.000000 1882
1904
unique NaN NaN 2
2
top NaN NaN MASTECTOMY Breast
Cancer
freq NaN NaN 1127
1903
mean 3921.982143 61.087054 NaN
NaN
std 2358.478332 12.978711 NaN
NaN
min 0.000000 21.930000 NaN
NaN
25% 896.500000 51.375000 NaN
NaN
50% 4730.500000 61.770000 NaN
NaN
75% 5536.250000 70.592500 NaN
NaN
max 7299.000000 96.290000 NaN
NaN
cancer_type_detailed cellularity chemotherapy \
count 1889 1850 1904.000000
unique 6 3 NaN
top Breast Invasive Ductal Carcinoma High NaN
freq 1500 939 NaN
mean NaN NaN 0.207983
std NaN NaN 0.405971
min NaN NaN 0.000000
25% NaN NaN 0.000000
50% NaN NaN 0.000000
75% NaN NaN 0.000000
max NaN NaN 1.000000
pam50_+_claudin-low_subtype cohort
er_status_measured_by_ihc \
count 1904 1904.000000
1874
unique 7 NaN
2
top LumA NaN
Positve
freq 679 NaN
1445
mean NaN 2.643908
NaN
std NaN 1.228615
NaN
min NaN 1.000000
NaN
25% NaN 1.000000
NaN
50% NaN 3.000000
NaN
75% NaN 3.000000
NaN
max NaN 5.000000
NaN
... mtap_mut ppp2cb_mut smarcd1_mut nras_mut ndfip1_mut
hras_mut \
count ... 1904 1904 1904 1904 1904
1904.0
unique ... 5 5 5 4 4
4.0
top ... 0 0 0 0 0
0.0
freq ... 1900 1900 1900 1901 1901
1024.0
mean ... NaN NaN NaN NaN NaN
NaN
std ... NaN NaN NaN NaN NaN
NaN
min ... NaN NaN NaN NaN NaN
NaN
25% ... NaN NaN NaN NaN NaN
NaN
50% ... NaN NaN NaN NaN NaN
NaN
75% ... NaN NaN NaN NaN NaN
NaN
max ... NaN NaN NaN NaN NaN
NaN
prps2_mut smarcb1_mut stmn2_mut siah1_mut
count 1904 1904.0 1904 1904.0
unique 3 4.0 3 3.0
top 0 0.0 0 0.0
freq 1902 1024.0 1902 1024.0
mean NaN NaN NaN NaN
std NaN NaN NaN NaN
min NaN NaN NaN NaN
25% NaN NaN NaN NaN
50% NaN NaN NaN NaN
75% NaN NaN NaN NaN
max NaN NaN NaN NaN
[11 rows x 693 columns]
#separating categorical and numerical value based columns
cat_cols=data.describe(include='object').columns.tolist()
num_cols=data.describe(exclude='object').columns.tolist()
#converting object datatypes into categorical columns.
for i in cat_cols:
data[i]=data[i].astype('category')
Data Preprocessing
Missing Values
#identifying and noting down the cols with missing values
missing_cat=[]
for i in cat_cols:
if(data[i].isna().sum()== 0):
pass
else:
missing_cat.append(i)
missing_num=[]
for i in num_cols:
if(data[i].isna().sum()== 0):
pass
else:
missing_num.append(i)
#viewing the % of missing values in the columns
print((data[missing_cat].isna().sum())/data.shape[0])
print((data[missing_num].isna().sum())/data.shape[0])
#we will drop the columns have more than 10% missing values
data1=data.drop(['tumor_stage'],axis=1)
#handling missing values with mode and mean for cat and num cols
respectively
for i in missing_cat:
if(i in data1.columns.tolist()):
x=data1[i].mode().tolist()
data1[i]=data1[i].fillna(x[0])
for i in missing_num:
if(i in data1.columns.tolist()):
data1[i]=data1[i].fillna(data1[i].mean())
#redefining cat and num cols after dropping the ones with excess % of
missing data
new_cat=data1.describe(include='category').columns.tolist()
new_num=data1.describe(exclude='category').columns.tolist()
type_of_breast_surgery 0.011555
cancer_type_detailed 0.007878
cellularity 0.028361
er_status_measured_by_ihc 0.015756
tumor_other_histologic_subtype 0.007878
primary_tumor_laterality 0.055672
oncotree_code 0.007878
3-gene_classifier_subtype 0.107143
death_from_cancer 0.000525
dtype: float64
neoplasm_histologic_grade 0.037815
mutation_count 0.023634
tumor_size 0.010504
tumor_stage 0.263130
dtype: float64
data1.isna().sum().sum()
Encoding Categorical Variables
encoder=LabelEncoder()
#encoding mixture of numerical and text values
for i in new_cat:
try:
data1[i]=encoder.fit_transform(data1[i])
except TypeError:
data1[i]=data1[i].replace(0,'Zero')
data1[i]=encoder.fit_transform(data1[i])
Standardization
To ensure all features have a similar scale, which can be important for algorithms that rely on
distance calculations(K-Means clustering uses distance calculations)
data.describe(exclude='category')
patient_id age_at_diagnosis chemotherapy cohort \
count 1904.000000 1904.000000 1904.000000 1904.000000
mean 3921.982143 61.087054 0.207983 2.643908
std 2358.478332 12.978711 0.405971 1.228615
min 0.000000 21.930000 0.000000 1.000000
25% 896.500000 51.375000 0.000000 1.000000
50% 4730.500000 61.770000 0.000000 3.000000
75% 5536.250000 70.592500 0.000000 3.000000
max 7299.000000 96.290000 1.000000 5.000000
neoplasm_histologic_grade hormone_therapy \
count 1832.000000 1904.000000
mean 2.415939 0.616597
std 0.650612 0.486343
min 1.000000 0.000000
25% 2.000000 0.000000
50% 3.000000 1.000000
75% 3.000000 1.000000
max 3.000000 1.000000
lymph_nodes_examined_positive mutation_count \
count 1904.000000 1859.000000
mean 2.002101 5.697687
std 4.079993 4.058778
min 0.000000 1.000000
25% 0.000000 3.000000
50% 0.000000 5.000000
75% 2.000000 7.000000
max 45.000000 80.000000
nottingham_prognostic_index overall_survival_months ... \
count 1904.000000 1904.000000 ...
mean 4.033019 125.121324 ...
std 1.144492 76.334148 ...
min 1.000000 0.000000 ...
25% 3.046000 60.825000 ...
50% 4.042000 115.616667 ...
75% 5.040250 184.716667 ...
max 6.360000 355.200000 ...
srd5a1 srd5a2 srd5a3 st7
star \
count 1.904000e+03 1.904000e+03 1.904000e+03 1.904000e+03
1904.000000
mean 4.726891e-07 -3.676471e-07 -9.453782e-07 -1.050420e-07 -
0.000002
std 1.000263e+00 1.000262e+00 1.000262e+00 1.000263e+00
1.000262
min -2.120800e+00 -3.364800e+00 -2.719400e+00 -4.982700e+00 -
2.981700
25% -6.188500e-01 -6.104750e-01 -6.741750e-01 -6.136750e-01 -
0.632900
50% -2.456500e-01 -4.690000e-02 -1.422500e-01 -5.175000e-02 -
0.026650
75% 3.306000e-01 5.144500e-01 5.146000e-01 5.787750e-01
0.590350
max 6.534900e+00 1.027030e+01 6.329000e+00 4.571300e+00
12.742300
tnk2 tulp4 ugt2b15 ugt2b17
ugt2b7
count 1.904000e+03 1.904000e+03 1.904000e+03 1904.000000
1.904000e+03
mean 3.676471e-07 4.726891e-07 7.878151e-07 0.000000
3.731842e-18
std 1.000264e+00 1.000262e+00 1.000263e+00 1.000262
1.000262e+00
min -3.833300e+00 -3.609300e+00 -1.166900e+00 -2.112600 -
1.051600e+00
25% -6.664750e-01 -7.102000e-01 -5.058250e-01 -0.476200 -
7.260000e-01
50% 7.000000e-04 -2.980000e-02 -2.885500e-01 -0.133400 -
4.248000e-01
75% 6.429000e-01 5.957250e-01 6.022500e-02 0.270375
4.284000e-01
max 3.938800e+00 3.833400e+00 1.088490e+01 12.643900
3.284400e+00
[8 rows x 503 columns]
scaler=StandardScaler()
scaled_data=pd.DataFrame(scaler.fit_transform(data1[new_num]),columns=
new_num)
print('Scaled data shape:',scaled_data.shape)
new_data=pd.concat((data1[new_cat],scaled_data),axis=1)
print('New data shape:',new_data.shape)
Scaled data shape: (1904, 502)
New data shape: (1904, 692)
Clustering
To find the ideal number of clusters using elbow and silhoutte score method
X=new_data.copy()
X.columns = X.columns.astype(str)
#Elbow method
# Calculate WCSS for a range of K values
wcss = []
for i in range(1, 11):
kmeans = KMeans(n_clusters=i, init='k-means++', max_iter=500,
n_init=10, random_state=0)
kmeans.fit(X)
wcss.append(kmeans.inertia_)
# Plot the Elbow Method curve
plt.figure(figsize=(8, 6))
plt.plot(range(1, 11), wcss, marker='o', linestyle='-', color='b')
plt.title('Elbow Method')
plt.xlabel('Number of Clusters (K)')
plt.ylabel('WCSS')
plt.xticks(range(1, 11))
plt.grid(True)
plt.show()
#silhoutte scores method
silhouette_scores = []
for k in range(2, 11):
kmeans = KMeans(n_clusters=k, init='k-means++', max_iter=500,
n_init=10, random_state=0)
kmeans.fit(X)
silhouette_scores.append(silhouette_score(X, kmeans.labels_))
# Plot the Silhouette Method curve
plt.figure(figsize=(8, 6))
plt.plot(range(2, 11), silhouette_scores, marker='o', linestyle='-',
color='b')
plt.title('Silhouette Method')
plt.xlabel('Number of Clusters (K)')
plt.ylabel('Silhouette Score')
plt.xticks(range(2, 11))
plt.grid(True)
plt.show()
From both the graphs we can conclude that the ideal number of clusters is 2 as per the
methods.
Clustering the dimensionally reduced dataset through
PCA, evaluating the results and obtaining the important
features and their corresponding values
n_components = 5 #the number of components
pca = PCA(n_components=n_components)
X=new_data.copy()
X.columns = X.columns.astype(str)
X_pca = pca.fit_transform(X)
# Step 3: Analyze PCA results
explained_variance_ratio = pca.explained_variance_ratio_
# Print the explained variance ratios for each component
print("Explained Variance Ratios:")
print(explained_variance_ratio)
# Optional: Visualize the explained variance
plt.plot(range(1, n_components + 1),
np.cumsum(explained_variance_ratio))
plt.xlabel("Number of Principal Components")
plt.ylabel("Cumulative Explained Variance")
plt.title("Explained Variance vs. Number of Components")
plt.show()
Explained Variance Ratios:
[0.33570044 0.13648734 0.09348513 0.05628623 0.04368038]
The dimensionally reduced components explains approximately 68% of the data.
K-Means clustering Is useful to obtain important features post clustering.
davies_bouldin_score: a lower Davies-Bouldin Index suggests that the clusters are well-
separated and distinct
silhouette_score: A higher Silhouette Score indicates that the data points are well-separated
into clusters.
calinski-Harabasz_score: higher values indicating better separation between clusters
# Performing K-Means clustering
kmeans = KMeans(n_clusters=2, random_state=0,max_iter=1000)
labels = kmeans.fit_predict(X_pca)
# Evaluating the clustering results
silhouette_avg = silhouette_score(X, labels)
db_index = davies_bouldin_score(X, labels)
ch_index = calinski_harabasz_score(X, labels)
print(f"Silhouette Score: {silhouette_avg}")
print(f"Davies-Bouldin Index: {db_index}")
print(f"Calinski-Harabasz Index: {ch_index}")
Silhouette Score: 0.3530636931145265
Davies-Bouldin Index: 1.2331561638066597
Calinski-Harabasz Index: 804.105981173214
#obtaining the features
df=pd.DataFrame(X_pca)
df['ClusterLabel'] = kmeans.fit_predict(df)
# Get the centroids (cluster centers)
centroids = kmeans.cluster_centers_
# Identify important features based on centroids
important_features = np.argsort(np.abs(centroids), axis=1)[:,-5:] #
Select the top 5 important features for each cluster
print("Important Features for Each Cluster:")
print(important_features)
Important Features for Each Cluster:
[[2 4 3 1 0]
[2 4 3 1 0]]
#obtaining the original features from the dimensionally reduced
components and their weightage in the process.
loadings = pca.components_
# Get the original feature names (assuming your data is in a
DataFrame)
original_feature_names = list(pd.DataFrame(X, columns=["Feature_" +
str(i) for i in range(691)]))
orig_cols=new_data.columns.tolist()
# Print the loadings for the first PC with corresponding feature names
def feature_names(loadings,original_feature_names,component_name):
feat={'feature_name':[],'weightage':[]}
loadings_for_first_pc = loadings
loading_with_names = list(zip(original_feature_names,
loadings_for_first_pc))
for feature_name, loading in loading_with_names:
feat['feature_name'].append(feature_name)
feat['weightage'].append(loading)
values=sorted(feat['weightage'],reverse=True)[0:3]
indices={'feature_name':[],'weightage':[],'component_name':[]}
for i in values:
if 'weightage' in feat:
x = [index for index, value in enumerate(feat['weightage']) if
value == i]
indices['feature_name'].append(orig_cols[x[0]])
indices['weightage'].append(i)
indices['component_name'].append(component_name)
return indices
f_names=pd.DataFrame()
f_names=pd.concat((f_names,pd.DataFrame(feature_names(loadings[0],orig
inal_feature_names,0))))
f_names=pd.concat((f_names,pd.DataFrame(feature_names(loadings[1],orig
inal_feature_names,1))))
f_names=pd.concat((f_names,pd.DataFrame(feature_names(loadings[3],orig
inal_feature_names,3))))
f_names=pd.concat((f_names,pd.DataFrame(feature_names(loadings[2],orig
inal_feature_names,2))))
f_names=pd.concat((f_names,pd.DataFrame(feature_names(loadings[4],orig
inal_feature_names,4))))
x_cols=new_data.columns.tolist()
x_new=new_data.copy()
x_new['ClusterLabel']=df['ClusterLabel']
feat_contribution={'Feature_Name':
[],'%_of_Influence_in_cluster_formation':[]}
data=x_new.drop(['ClusterLabel'],axis=1)
labels=x_new['ClusterLabel']
model = RandomForestClassifier(n_estimators=100, random_state=0)
model.fit(data,labels) # Fit the model with cluster labels as targets
feature_importance = model.feature_importances_.tolist()
values=sorted(feature_importance,reverse=True)[0:10]
for i in values:
x=[index for index, value in enumerate(feature_importance) if value
== i]
feat_contribution['Feature_Name'].append(x_cols[x[0]])
# feat_contribution['index'].append(x[0])
feat_contribution['%_of_Influence_in_cluster_formation'].append(i*100)
Insights Obtained
Top 3 original features that have the major influence on
the formation of pca components(distribution is
component wise)
1. feature_name: Name of the feature as per the dataframe.
2. weightage: Weightage of the feature in the formation of the component.
3. component_name: The name of the pca component(0-4)
f_names
feature_name weightage component_name
0 tp53_mut 0.994655 0
1 muc16_mut 0.047248 0
2 syne1_mut 0.023867 0
0 muc16_mut 0.952263 1
1 ahnak2_mut 0.284344 1
2 kmt2c_mut 0.054483 1
0 kmt2c_mut 0.926952 3
1 pik3ca_mut 0.294682 3
2 map3k1_mut 0.203274 3
0 ahnak2_mut 0.954922 2
1 syne1_mut 0.029579 2
2 ahnak_mut 0.024780 2
0 syne1_mut 0.843752 4
1 kmt2c_mut 0.166811 4
2 dnah11_mut 0.119488 4
The below table shows the importance of the features in
the formation of the clusters(in %)
pd.DataFrame(feat_contribution)
Feature_Name %_of_Influence_in_cluster_formation
0 tp53_mut 19.821408
1 bcl2 2.640829
2 aph1b 2.100798
3 chek1 1.870523
4 er_status 1.277683
5 gata3 1.042316
6 e2f3 0.806697
7 mapk1 0.749623
8 cdkn2a 0.734711
9 srd5a1 0.703182
The below visualizations show the cluster-wise
distribution of values(frequency-value pair) belonging to
the top10 features that have influenced the formation of
clusters.
#processing the obtained indices to extract the original feature names
from the dataset and their distribution cluster wise.
important_features=feat_contribution['Feature_Name']
# Group data by cluster
cluster_label_data=new_data.copy()
cluster_label_data['ClusterLabel']=df['ClusterLabel']
cluster_groups = cluster_label_data.groupby('ClusterLabel')
insights0,insights1=pd.DataFrame(),pd.DataFrame()
j=0
for feature in important_features:
for label,group in cluster_groups:
j+=1
if(j%2==0):
insights0=pd.concat((insights0,group[feature]),axis=1)
else:
insights1=pd.concat((insights1,group[feature]),axis=1)
insights={'cluster':[],'max_freq':[],
'most_occured_value':[],'mean_value':[],
'median_value':[],'std_dev':[],'feature_name':[]}
# Analyze feature distributions for each cluster
for feature in important_features:
for label, group in cluster_groups:
insights['feature_name'].append(feature)
insights['cluster'].append(label)
insights['most_occured_value'].append(group[feature].mode()
[0])
insights['max_freq'].append(group[feature].value_counts().tolist()[0])
insights['mean_value'].append(group[feature].mean())
insights['median_value'].append(group[feature].median())
insights['std_dev'].append(group[feature].std())
# Visualize feature distributions
plt.figure(figsize=(8, 6))
for label, group in cluster_groups:
plt.hist(group[feature], bins=10, alpha=0.5, label=f'Cluster
{label}')
plt.xlabel(feature)
plt.ylabel('Frequency')
plt.legend()
plt.title(f'Distribution of {feature} by Cluster')
plt.show()
The below tables give the statistical description of the
values from the top10 features which have influenced the
formation of clusters.
#cluster_label=0
insights0.describe()
tp53_mut bcl2 aph1b chek1 er_status
gata3 \
count 472.000000 472.000000 472.000000 472.000000 472.000000
472.000000
mean 238.042373 -0.697808 -0.757289 0.733421 0.447034 -
0.787537
std 57.288430 0.942895 0.788225 1.051694 0.497714
0.983745
min 113.000000 -2.791904 -2.917802 -1.504302 0.000000 -
2.772898
25% 198.750000 -1.399302 -1.302625 -0.058999 0.000000 -
1.631724
50% 237.000000 -0.828201 -0.782550 0.681002 0.000000 -
0.726699
75% 274.250000 -0.129875 -0.282425 1.373253 1.000000
0.027776
max 342.000000 2.656105 2.212302 3.952808 1.000000
1.745900
e2f3 mapk1 cdkn2a srd5a1
count 472.000000 472.000000 472.000000 472.000000
mean 0.696880 0.512338 0.581673 0.672567
std 1.073119 1.015709 1.408055 1.160492
min -1.992600 -2.659901 -1.331901 -1.201500
25% -0.102249 -0.225300 -0.452076 -0.180400
50% 0.601601 0.547700 0.031799 0.414699
75% 1.330601 1.164900 1.494925 1.287924
max 4.458301 4.294400 5.837501 6.534898
#cluster_label=1
insights1.describe()
tp53_mut bcl2 aph1b chek1 er_status
\
count 1432.000000 1432.000000 1432.000000 1432.000000 1432.000000
mean 10.045391 0.230004 0.249609 -0.241742 0.871508
std 24.314789 0.907946 0.935167 0.854743 0.334753
min 0.000000 -2.625804 -2.854401 -1.949803 0.000000
25% 2.000000 -0.276325 -0.344600 -0.833076 1.000000
50% 2.000000 0.316401 0.314051 -0.381150 1.000000
75% 2.000000 0.848577 0.878576 0.172951 1.000000
max 125.000000 2.534905 3.881904 4.015408 1.000000
gata3 e2f3 mapk1 cdkn2a srd5a1
count 1432.000000 1432.000000 1432.000000 1432.000000 1432.000000
mean 0.259579 -0.229698 -0.168871 -0.191725 -0.221684
std 0.860239 0.859374 0.935873 0.727733 0.829995
min -2.812598 -2.885000 -3.069801 -1.356901 -2.120800
25% -0.024749 -0.787700 -0.807651 -0.610651 -0.708675
50% 0.468950 -0.295050 -0.254400 -0.324951 -0.379500
75% 0.804675 0.226376 0.353850 0.015474 0.027400
max 2.202800 4.480201 3.617600 4.304301 5.345998
Conclusion
The objective of the case study that is to cluster the data and to identify the important features
along with their contribution to the clustering is achieved.