SPLEX TME 2
Clustering
The goal of the TME is to learn how to use some popular clustering methods (unsupervised
learning), and how to interpret the results.
We will use the scikit-learn Python library [Link] which is already installed
on the computers.
Data (simulated data sets + data sets of TME 1)
We explore two data sets downloadable from the Machine Learning Repository ([Link]
[Link]/ml/[Link])
• Breast Cancer Wisconsin (Diagnostic) Data Set ([Link]
Breast+Cancer+Wisconsin+(Diagnostic))
• Mice Protein Expression Data Set ([Link]
Expression)
Libraries
You will need to load the following packages:
import [Link] as plt
from sklearn import cluster
from [Link] import KMeans
from sklearn import metrics
from [Link] import AgglomerativeClustering
from [Link] import make_classification
from [Link] import make_blobs
from [Link] import make_moons
Analysis
Before running analysis on the Breast and Mice data sets, we will do analysis on three simulated
data sets to better understand what different clustering methods do, and why they produce different
clustering. Generate and visualize the artificial data as follows:
# First simulated data set
[Link]("Two informative features, one cluster per class", fontsize=’small’)
X1, Y1 = make_classification(n_samples=200, n_features=2, n_redundant=0, n_informative=2,
n_clusters_per_class=1)
[Link](X1[:, 0], X1[:, 1], marker=’o’, c=Y1,s=25, edgecolor=’k’)
# Second simulated data set
[Link]("Three blobs", fontsize=’small’)
X2, Y2 = make_blobs(n_samples=200, n_features=2, centers=3)
[Link](X2[:, 0], X2[:, 1], marker=’o’, c=Y2, s=25, edgecolor=’k’)
# Third simulated data set
[Link]("Non-linearly separated data sets", fontsize=’small’)
X3, Y3 = make_moons(n_samples=200, shuffle=True, noise=None, random_state=None)
[Link](X3[:, 0], X3[:, 1], marker=’o’, c=Y3, s=25, edgecolor=’k’)
1
Apply the following clustering methods to the three simulated data sets.
Clustering Methods
1. K-means
[Link]
An example of k-means clustering (where k is the number of clusters you want to produce,
and X is the data matrix):
km = KMeans(n_clusters=k, init=’k-means++’, max_iter=100, n_init=1)
[Link](X)
You can also visualize the clustering (and compare it to the true repartition):
[Link](X[:, 0], X[:, 1], s=10, c=km.labels_)
2. Hierarchical clustering
[Link]
An example of hierarchical clustering (where k is the number of clusters you want to produce,
and X is the data matrix):
for linkage in (’ward’, ’average’, ’complete’):
clustering = AgglomerativeClustering(linkage=linkage, n_clusters=k)
[Link](X)
3. Spectral clustering
[Link]
An example of spectral clustering (where k is the number of clusters you want to produce,
and X is the data matrix):
spectral = [Link](n_clusters=k, eigen_solver=’arpack’,
affinity="nearest_neighbors")
[Link](X)
4. Analyse the results of clustering in terms of
• Homogeneity [Link] score()
• Completeness [Link] score()
• V-measure metrics.v measure score()
• Adjusted Rand-Index [Link] rand score()
• Silhouette Coefficient [Link] score()
5. What is an optimal clustering method for each simulated data set?
6. Re-run the clustering methods on the Breast cancer and Mice data sets. Do not include the
class variables in your clustering analysis but compare the obtained clustering with the true
class labels.