0% found this document useful (0 votes)

41 views10 pages

Unsupervised ML: Clustering Guide

The document discusses three unsupervised machine learning algorithms: hierarchical clustering, K-means clustering, and DBSCAN. It generates sample data with predefined clusters and evaluates the clustering results of each algorithm using the rand index metric. First, it performs hierarchical clustering on the sample data. Then, it performs K-means clustering. Finally, it performs DBSCAN clustering and compares the performance of the three algorithms.

Uploaded by

hhaline9

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

41 views10 pages

Unsupervised ML: Clustering Guide

Uploaded by

hhaline9

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

TD4_unsupervised_machine_learning [Link]

PW4-Unsupervised machine learning:

Kmeans, Hclustering, DBSCAN

ESILV (A4-All) De Vinci Group

Zhiqiang WANG @ 2023-2024
***Requirements for the work: the exercise 10 and exercise 11 should be uploaded on the
DVL repository before the deadline. You should create your colab file correctly named.***

Learning outcomes PW4

In this section, we will talk about three unsupervised machine learning algorithmes:
Hierarchical Clustering, Kmeans and DBSCAN. We use the rand index metric to evaluate the
performances for each algorithmes and compare them with the random clustering algorithm.

1. Part I: Generate the orignam data with label manually and define the metric function
(rand index).

2. Part II: Clustering the orignal data by Hierarchical Clustering.

3. Part III: Clustering the orignal data by Kmeans.

4. Part IV: Clustering the orignal dayta by DBSCAN.

Part I: Generate the orignam data with

label manually and define the metric
function (rand index).
Import tools. Please import others if needed

In [ ]: import numpy as np
import [Link] as plt
import scipy as sc
from scipy import cluster
import sklearn as sk
from sklearn import datasets
import pandas as pd

1 sur 10 10/10/2023, 11:46

TD4_unsupervised_machine_learning [Link]

Exercise 1: Generate a synthetic dataset

1. Generate a set of 100 points in a 2 dimensional space split into 4 non overlapping
clusters. You may need the function [Link] and the function
[Link]()

In [ ]: syn=[Link]([0, 2]);
synLabels=[];
for k in range(4):
syn = [Link]((syn, k*2+[Link](25, 2)), axis=0);
synLabels = [Link]((synLabels, [Link](25)*k));

In [ ]: syn
synLabels

1. Display the set with one color per cluster using the scatter function from
[Link]. Here, you may use [Link](X, Y, color)

In [ ]: [Link](syn[:, 0], syn[:, 1], c=synLabels);

[Link]();

1. Cluster this orignal data by the random clustering algorithm as the Baseline.

Cluster this dataset into k clusters by assigning a random integer value between 0 and k-1 to
each point. You may need the function [Link]()

In [ ]: def randomClustering(syn, k):

return [Link](0, k, [Link][0])

rand = randomClustering(syn, 4)

In [ ]: rand

2 sur 10 10/10/2023, 11:46

TD4_unsupervised_machine_learning [Link]

Exercise 2: Creat the rand index as the metric to evaluate the performance of
clustering.

The Rand index or Rand measure (named after William M. Rand) in statistics, and in
particular in data clustering, is a measure of the similarity between two data clusterings. A
form of the Rand index may be defined that is adjusted for the chance grouping of elements,
this is the adjusted Rand index. The Rand index is the accuracy of determining if a link
belongs within a cluster or not.

Given a set of n elements S = o1 , … , on and two partitions of S to compare,

X = X1 , … , Xr , a partition of S into r subsets, and Y = Y1 , … , Ys , a partition of S into s
subsets, define the following:

a, the number of pairs of elements in S that are in the same subset

in X and in the same subset in Y;
b, the number of pairs of elements in S that are in different
subsets in X and in different subsets in Y;
c, the number of pairs of elements in S that are in the same subset
in X and in different subsets in Y;
d, the number of pairs of elements in S that are in different
subsets in X and in the same subset in Y.

The Rand index, R, is:

a+b a+b a+b
R= = =
a+b+c+d Cn2 n(n−1)/2

Intuitively, a + b can be considered as the number of agreements between X and Y and

c + d as the number of disagreements between X and Y .

Since the denominator is the total number of pairs, the Rand index represents the frequency
of occurrence of agreements over the total pairs, or the probability that X and Y will agree
on a randomly chosen pair.

Implement the rand index criterion (see [Link] for

reference)

In [ ]: def rand_index(ref, est):

num = 0
if len(ref) != len(est):
print('arrays must be of the same size')
error()
for k in range(len(ref)):
for l in range(len(ref)-k-1):
if (ref[k]==ref[k+l+1] and est[k]==est[k+l+1]) or (ref[k]!=ref[k+l+1] and
num += 1
den = len(ref)*(len(ref)-1)/2
return num/den

3 sur 10 10/10/2023, 11:46

TD4_unsupervised_machine_learning [Link]

Calculate the rand Index for random clustering (the ground truth is given at the first step as
synLabels)

In [ ]: ri = rand_index(rand, rand)
print(ri)
ri = rand_index(synLabels, rand)
print(ri)

Compute the rand index between the reference clustering and 100 runs of the baseline
algorithm.

In [ ]: ri = [Link]((100))
for k in range(100):
ri[k] = rand_index(synLabels, randomClustering(syn, 4))

Display results and compute the mean and standard deviation.

In [ ]: [Link](0, [Link](ri), yerr = [Link](ri), alpha=0.5, ecolor='black', capsize=10)

[Link]()

Part II: Hierarchical Clustering

Exercise 3: Compute the euclidean distance matrix using the pdist function from
[Link].

In [ ]: import scipy
from [Link] import linkage
euc_distance_matrix = [Link](syn)

Display and interpret its shape

In [ ]: #We compute the squareform with [Link]

#and plot a heatmap through seaborn
import seaborn as sbn
squareform = [Link](euc_distance_matrix)
[Link]("Eucledian distance matrix ")
[Link](squareform)
[Link]()

answer here:

Compute the single link hierarchical clustering using the linkage function from
[Link].

In [ ]: l = [Link](d)

Exercise 4: Display the corresponding dendrogram using the dendrogram function

from [Link].

4 sur 10 10/10/2023, 11:46

TD4_unsupervised_machine_learning [Link]

In [ ]: [Link]()
dn = [Link](l)

Implement a clustering algorithm that cuts the dendrogram in order to produce k clusters
using the fcluster function from [Link].

In [ ]: def agglomerativeClustering(x, k):

d = [Link](x)
l = [Link](d)
return [Link](l, k, criterion='maxclust')

agg = agglomerativeClustering(syn, 4)

Exercise 5: Compute the rand index between the reference clustering and 100 runs of
this clustering algorithm.

In [ ]: ri = [Link]((100))
for k in range(100):
ri[k] = rand_index(synLabels, agglomerativeClustering(syn, 4))

Display results and compute the mean and variance.

In [ ]: [Link](0, [Link](ri), yerr = [Link](ri), alpha=0.5, ecolor='black', capsize=10)

[Link]()

Explain why the standard deviation is 0.

The algorithm is deterministic.

Part III: Partitional Clustering - Kmeans

Exercise6: Implement the k-means algorithm (see [Link]
means_clustering section Standard algorithm for reference).

Hint: please consider the cdist function from [Link] to compute the distance
of the points to the centroids.

5 sur 10 10/10/2023, 11:46

TD4_unsupervised_machine_learning [Link]

In [ ]: def kMeans(dataset, k, nbIterations=1000, show=False):

centroids = [Link]((k, [Link][1]))*[Link](dataset)

labels = [Link](([Link][0]))
for i in range(nbIterations):
if show:
[Link](dataset[:, 0], dataset[:, 1], c=labels);
[Link](centroids[:, 0], centroids[:, 1], s=200);
[Link]();

cd = [Link](centroids, dataset)
pastLabels = labels
for d in range([Link][0]):
labels[d] = [Link](cd[:, d])
for c in range(k):
centroids[c, :] = [Link](dataset[labels==c, :], axis=0)

if ([Link](pastLabels == labels)):
break
return labels

labels = kMeans(syn, 4)

Exercise7: Compute the rand index between the reference clustering and 100 runs of
this clustering algorithm.

In [ ]: ri = [Link]((100))
for k in range(100):
ri[k] = rand_index(synLabels, kMeans(syn, 4))

Display results and compute the mean and variance.

In [ ]: [Link](0, [Link](ri), yerr = [Link](ri), alpha=0.5, ecolor='black', capsize=10)

[Link]()

Part IV: DBSCAN

Density-based spatial clustering of applications with noise (DBSCAN) is a data clustering
algorithm proposed by Martin Ester, Hans-Peter Kriegel, Jörg Sander and Xiaowei Xu in 1996.
It is a density-based clustering non-parametric algorithm: given a set of points in some
space, it groups together points that are closely packed together (points with many nearby
neighbors), marking as outliers points that lie alone in low-density regions (whose nearest
neighbors are too far away). DBSCAN have two parameters important: eps and min_sample.
Unlike Kmeans, we don't have to specify the number of clusters before clustering.

Exercise8: Clustering the data syn by DBSCAN.

Here, you can use directily [Link](eps=0.5, min_samples=5,

metric='euclidean')

6 sur 10 10/10/2023, 11:46

TD4_unsupervised_machine_learning [Link]

In [ ]: from [Link] import DBSCAN

clustering_DBSCAN = DBSCAN(eps=0.5, min_samples=5).fit(syn)

After clustering, give us the labels for each observation.

In [ ]: clustering_DBSCAN.labels_

Compute the rand index between the reference clustering and 100 runs of this DBSCAN
clustering algorithm.

In [ ]: ri = [Link]((100))
for k in range(100):
clustering_DBSCAN = DBSCAN(eps=0.5, min_samples=5).fit(syn)
ri[k] = rand_index(synLabels, clustering_DBSCAN.labels_)

In [ ]: [Link](0, [Link](ri), yerr = [Link](ri), alpha=0.5, ecolor='black', capsize=10)

[Link]()

Exercise9: Performance Analysis

Display the performance of the 4 clustering algorithms (Random, Hierarchical Clustering,

Kmeans and DBSCAN) on the synthetic dataset using the bar function from
[Link].

In [ ]: ri = [Link]((4, 100))
for k in range(100):
ri[0, k] = rand_index(synLabels, randomClustering(syn, 4))
ri[1, k] = rand_index(synLabels, agglomerativeClustering(syn, 4))
ri[2, k] = rand_index(synLabels, kMeans(syn, 4))
ri[3, k] = rand_index(synLabels, DBSCAN(eps=0.5, min_samples=5).fit(syn).labels_

x = [0, 1, 2, 3]
[Link](x, [Link](ri, axis=1), yerr = [Link](ri, axis=1), alpha=0.5, ecolor='black'
[Link]('rand index')
[Link](x, ['Random', 'Hierarchical', 'kmeans', 'DBSCAN'])
[Link]()

# Exercise10: Load the iris dataset using the [Link].load_iris function from
scikit-learn and perform the same performance analysis using this dataset.

This data sets consists of 3 different types of irises’ (Setosa, Versicolour, and Virginica) petal
and sepal length, stored in a 150x4 [Link]. This dataset have reference label in the
'target' column. The 3 features are included inside the 'data'.

The rows being the samples and the columns being: Sepal Length, Sepal Width, Petal Length
and Petal Width.

7 sur 10 10/10/2023, 11:46

TD4_unsupervised_machine_learning [Link]

# Exercise11: Load the Breast cancer wisconsin (diagnostic) dataset

Use the [Link].load_breast_cancer function from scikit-learn and perform the same
performance analysis using this dataset.

The breast cancer dataset is a classic and very easy binary classification dataset. The featuers
are stored in 'data'. And the reference are stored in the 'target' .

Exercise12: Determining the number of clusters for KMeans

Implement the gap statistic method for determining the optimal number of clusters for the 3
datasets. Gap statistics compares the change in within-cluster dispersion with the uniform
distribution. A large gap statistics value means that the clustering is very different from the
uniform distribution.

8 sur 10 10/10/2023, 11:46

TD4_unsupervised_machine_learning [Link]

In [ ]: #We use K-means from scikit-learn in this method.

#The code is taken from the article: [Link]

from [Link] import KMeans

def optimalK(data, nrefs=3, maxClusters=15):
"""
Calculates KMeans optimal K using Gap Statistic from Tibshirani, Walther, Hastie

Parameters
----------
data:
ndarry of shape (n_samples, n_features)
nrefs:
number of sample reference datasets to create
maxClusters:
Maximum number of clusters to test for

Returns
-------
(gaps, optimalK): tuple
"""
gaps = [Link]((len(range(1, maxClusters)),))
resultsdf = [Link]({'clusterCount':[], 'gap':[]})
for gap_index, k in enumerate(range(1, maxClusters)):
# Holder for reference dispersion results
refDisps = [Link](nrefs)
# For n references, generate random sample and perform kmeans getting resulting dispe
for i in range(nrefs):

# Create new random reference set

randomReference = [Link].random_sample(size=[Link])

# Fit to it
km = KMeans(k)
[Link](randomReference)

refDisp = km.inertia_
refDisps[i] = refDisp
# Fit cluster to original data and create dispersion
km = KMeans(k)
[Link](data)

origDisp = km.inertia_
# Calculate gap statistic
gap = [Link]([Link](refDisps)) - [Link](origDisp)
# Assign this loop's gap statistic to gaps
gaps[gap_index] = gap

resultsdf = [Link]({'clusterCount':k, 'gap':gap}, ignore_index=True

return ([Link]() + 1, resultsdf) # Plus 1 because index of 0 means 1 cluster is opt

9 sur 10 10/10/2023, 11:46

TD4_unsupervised_machine_learning [Link]

In [ ]: # We calculate the three gap-statistics and print for the baseline algorithm:
k, gapdf = optimalK(syn, nrefs=3, maxClusters=11)
print('Optimal k is: ', k)
# Visualization
[Link]([Link], [Link], linewidth=3)
[Link](gapdf[[Link] == k].clusterCount, gapdf[[Link] ==
[Link](True)
[Link]('Cluster Count')
[Link]('Gap Value')
[Link]('Gap Values by Cluster Count for synthetic data')
[Link]()

In [ ]: #We repeat the analysis for the iris data.

dataset = [Link].load_iris()
iris_data = dataset['data']
k, gapdf = optimalK(iris_data, nrefs=3, maxClusters=10)
print('Optimal k is: ', k)
# Visualization
[Link]([Link], [Link], linewidth=3)
[Link](gapdf[[Link] == k].clusterCount, gapdf[[Link] ==
[Link](True)
[Link]('Cluster Count')
[Link]('Gap Value')
[Link]('Gap Values by Cluster Count for iris_data')
[Link]()

In [ ]: #We repeat the analysis for the cancer data.

dataset = [Link].load_breast_cancer()
cancer_data = dataset['data']
k, gapdf = optimalK(cancer_data, nrefs=3, maxClusters=11)
print('Optimal k is: ', k)
# Visualization
[Link]([Link], [Link], linewidth=3)
[Link](gapdf[[Link] == k].clusterCount, gapdf[[Link] ==
[Link](True)
[Link]('Cluster Count')
[Link]('Gap Value')
[Link]('Gap Values by Cluster Count for cancer_data')
[Link]()

10 sur 10 10/10/2023, 11:46

Mlsec Solution Exercise Sheet 7
No ratings yet
Mlsec Solution Exercise Sheet 7
6 pages
SE KMeansClustering
No ratings yet
SE KMeansClustering
21 pages
ML0101EN Clus DBSCN Weather Py v1
No ratings yet
ML0101EN Clus DBSCN Weather Py v1
16 pages
K Means Clustering
No ratings yet
K Means Clustering
11 pages
Baidurya Debnath 4
No ratings yet
Baidurya Debnath 4
37 pages
ML DSBA Lab7
No ratings yet
ML DSBA Lab7
6 pages
LAB7 Kmeans
No ratings yet
LAB7 Kmeans
11 pages
Detecting Patterns With Unsupervised Learning
No ratings yet
Detecting Patterns With Unsupervised Learning
21 pages
Se Demo
No ratings yet
Se Demo
29 pages
K-Means Clustering Guide
No ratings yet
K-Means Clustering Guide
26 pages
Clustering
No ratings yet
Clustering
1 page
Apriori Algorithm & Clustering Guide
No ratings yet
Apriori Algorithm & Clustering Guide
8 pages
10 - DBSCANClusteringOnIRIS-Copy1 - Jupyter Notebook
No ratings yet
10 - DBSCANClusteringOnIRIS-Copy1 - Jupyter Notebook
4 pages
Lecture 6
No ratings yet
Lecture 6
55 pages
Introduction To Data Science Unsupervised Learning: CS 194 Fall 2015 John Canny
No ratings yet
Introduction To Data Science Unsupervised Learning: CS 194 Fall 2015 John Canny
54 pages
ML Notes 1
No ratings yet
ML Notes 1
3 pages
EE769-10 Clustering
No ratings yet
EE769-10 Clustering
20 pages
MS6711 Data Mining Homework 1: 1.1 Implement K-Means Manually (8 PTS)
No ratings yet
MS6711 Data Mining Homework 1: 1.1 Implement K-Means Manually (8 PTS)
6 pages
Python K-Means Clustering Guide
No ratings yet
Python K-Means Clustering Guide
6 pages
DBSCAN Clustering Explained: Key Concepts
No ratings yet
DBSCAN Clustering Explained: Key Concepts
5 pages
Seminar 10
No ratings yet
Seminar 10
3 pages
ML Exp5 C36
No ratings yet
ML Exp5 C36
18 pages
Casos de ML Unsupervised Daniel Ames Camayo
No ratings yet
Casos de ML Unsupervised Daniel Ames Camayo
20 pages
Unit 3 Unsupervised Learning
No ratings yet
Unit 3 Unsupervised Learning
9 pages
51 DA5400 - FML51 - 20250501 ProblemSet06
No ratings yet
51 DA5400 - FML51 - 20250501 ProblemSet06
4 pages
Assignment 3
No ratings yet
Assignment 3
2 pages
DB Scan
No ratings yet
DB Scan
7 pages
Exp 6
No ratings yet
Exp 6
10 pages
Experiment 9
No ratings yet
Experiment 9
10 pages
K-Means Clustering Python Guide
No ratings yet
K-Means Clustering Python Guide
3 pages
2019BurkovTheHundred pageMachineLearnin2
No ratings yet
2019BurkovTheHundred pageMachineLearnin2
33 pages
Clustering
No ratings yet
Clustering
55 pages
K-Means Algorithm
No ratings yet
K-Means Algorithm
29 pages
Assignments Introduction To Machine Learning 2024
No ratings yet
Assignments Introduction To Machine Learning 2024
45 pages
Naive Bayes Algorithm
No ratings yet
Naive Bayes Algorithm
51 pages
Learn Lab3
No ratings yet
Learn Lab3
12 pages
Program 8
No ratings yet
Program 8
11 pages
KDD WS 24 25 E4 Clustering I
No ratings yet
KDD WS 24 25 E4 Clustering I
2 pages
Exp5 - Unsupervised Learning
No ratings yet
Exp5 - Unsupervised Learning
13 pages
Comprehensive Guide to Clustering Algorithms
No ratings yet
Comprehensive Guide to Clustering Algorithms
15 pages
Density Based Clustering (Unit 5)
No ratings yet
Density Based Clustering (Unit 5)
5 pages
From Import Import As Import As From Import From Import From Import From Import
No ratings yet
From Import Import As Import As From Import From Import From Import From Import
9 pages
Unsuper
No ratings yet
Unsuper
15 pages
K.means Clustering
No ratings yet
K.means Clustering
8 pages
Clustering Algorithms in Machine Learning
No ratings yet
Clustering Algorithms in Machine Learning
6 pages
Unsupervised Learning: Clustering & Anomaly Detection
No ratings yet
Unsupervised Learning: Clustering & Anomaly Detection
50 pages
Artificial Intelligence Lab 10
No ratings yet
Artificial Intelligence Lab 10
8 pages
Unit 2
No ratings yet
Unit 2
33 pages
Introduction To Machine Learning - Unit 13 - Week 10
No ratings yet
Introduction To Machine Learning - Unit 13 - Week 10
4 pages
Data Clustering Problem Set
No ratings yet
Data Clustering Problem Set
2 pages
ML Exercises 4 5 6 en
No ratings yet
ML Exercises 4 5 6 en
4 pages
Lab Manual Dbscan
No ratings yet
Lab Manual Dbscan
6 pages
Inbuilt Kmeans
No ratings yet
Inbuilt Kmeans
3 pages
Dmbi Iat-2 Imp Ques Soln
No ratings yet
Dmbi Iat-2 Imp Ques Soln
43 pages
Intro To Data Science
No ratings yet
Intro To Data Science
47 pages
Exp 5 ML
No ratings yet
Exp 5 ML
9 pages
DWDM Lab All
No ratings yet
DWDM Lab All
20 pages
Lp2-Etl Model Assignment No. 2: R (2) C (4) V (2) T (2) Total (10) Dated Sign
No ratings yet
Lp2-Etl Model Assignment No. 2: R (2) C (4) V (2) T (2) Total (10) Dated Sign
7 pages

Unsupervised ML: Clustering Guide

Uploaded by

Unsupervised ML: Clustering Guide

Uploaded by

TD4_unsupervised_machine_learning [Link]

PW4-Unsupervised machine learning:

ESILV (A4-All) De Vinci Group

Learning outcomes PW4

2. Part II: Clustering the orignal data by Hierarchical Clustering.

3. Part III: Clustering the orignal data by Kmeans.

4. Part IV: Clustering the orignal dayta by DBSCAN.

Part I: Generate the orignam data with

1 sur 10 10/10/2023, 11:46

Exercise 1: Generate a synthetic dataset

In [ ]: [Link](syn[:, 0], syn[:, 1], c=synLabels);

In [ ]: def randomClustering(syn, k):

2 sur 10 10/10/2023, 11:46

Given a set of n elements S = o1 , … , on and two partitions of S to compare,

a, the number of pairs of elements in S that are in the same subset

The Rand index, R, is:

Intuitively, a + b can be considered as the number of agreements between X and Y and

Implement the rand index criterion (see [Link] for

In [ ]: def rand_index(ref, est):

3 sur 10 10/10/2023, 11:46

Display results and compute the mean and standard deviation.

In [ ]: [Link](0, [Link](ri), yerr = [Link](ri), alpha=0.5, ecolor='black', capsize=10)

Part II: Hierarchical Clustering

Display and interpret its shape

In [ ]: #We compute the squareform with [Link]

Exercise 4: Display the corresponding dendrogram using the dendrogram function

4 sur 10 10/10/2023, 11:46

In [ ]: def agglomerativeClustering(x, k):

Display results and compute the mean and variance.

In [ ]: [Link](0, [Link](ri), yerr = [Link](ri), alpha=0.5, ecolor='black', capsize=10)

Explain why the standard deviation is 0.

The algorithm is deterministic.

Part III: Partitional Clustering - Kmeans

5 sur 10 10/10/2023, 11:46

In [ ]: def kMeans(dataset, k, nbIterations=1000, show=False):

centroids = [Link]((k, [Link][1]))*[Link](dataset)

Display results and compute the mean and variance.

In [ ]: [Link](0, [Link](ri), yerr = [Link](ri), alpha=0.5, ecolor='black', capsize=10)

Part IV: DBSCAN

Exercise8: Clustering the data syn by DBSCAN.

Here, you can use directily [Link](eps=0.5, min_samples=5,

6 sur 10 10/10/2023, 11:46

In [ ]: from [Link] import DBSCAN

After clustering, give us the labels for each observation.

In [ ]: [Link](0, [Link](ri), yerr = [Link](ri), alpha=0.5, ecolor='black', capsize=10)

Exercise9: Performance Analysis

Display the performance of the 4 clustering algorithms (Random, Hierarchical Clustering,

7 sur 10 10/10/2023, 11:46

# Exercise11: Load the Breast cancer wisconsin (diagnostic) dataset

Exercise12: Determining the number of clusters for KMeans

8 sur 10 10/10/2023, 11:46

In [ ]: #We use K-means from scikit-learn in this method.

from [Link] import KMeans

# Create new random reference set

resultsdf = [Link]({'clusterCount':k, 'gap':gap}, ignore_index=True

9 sur 10 10/10/2023, 11:46

In [ ]: #We repeat the analysis for the iris data.

In [ ]: #We repeat the analysis for the cancer data.

10 sur 10 10/10/2023, 11:46

You might also like