0% found this document useful (0 votes)
26 views27 pages

Session 11 Hierarchical DBSCAN

The document provides an overview of two clustering techniques: Hierarchical Clustering and DBSCAN. Hierarchical Clustering involves creating a tree-like structure through agglomerative or divisive methods, while DBSCAN is a density-based algorithm that identifies clusters based on high-density regions. Each method has its advantages and disadvantages, with practical examples and implementations using Python's sklearn library included.

Uploaded by

WTF
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views27 pages

Session 11 Hierarchical DBSCAN

The document provides an overview of two clustering techniques: Hierarchical Clustering and DBSCAN. Hierarchical Clustering involves creating a tree-like structure through agglomerative or divisive methods, while DBSCAN is a density-based algorithm that identifies clusters based on high-density regions. Each method has its advantages and disadvantages, with practical examples and implementations using Python's sklearn library included.

Uploaded by

WTF
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

Fundamentals of Machine

Learning
Unsupervised learning / Hierarchical and DBSCAN

7/23/2025 09e-BM/DT/FSOFT - ©FPT SOFTWARE – FPT Software Academy - Internal Use 1


Agenda

Hierarchical Clustering

DBSCAN Clustering

7/23/2025 09e-BM/DT/FSOFT - ©FPT SOFTWARE – FPT Software Academy - Internal Use 2


Section 1

CLUSTERING
HIERARCHICAL CLUSTERING

7/23/2025 09e-BM/DT/FSOFT - ©FPT SOFTWARE – FPT Software Academy - Internal Use 3


Hierarchical Clustering
▪ Hierarchical clustering is characterized by the development of
a hierarchy or tree-like structure.
✓ Agglomerative clustering starts with each object in a separate cluster.
Clusters are formed by grouping objects into bigger and bigger clusters.
✓ Divisive clustering starts with all the objects grouped in a single
cluster. Clusters are divided or split until each object is in a separate
cluster.
▪ Agglomerative methods are commonly used in marketing
research. They consist of linkage methods, variance methods,
and centroid methods.

7/23/2025 09e-BM/DT/FSOFT - ©FPT SOFTWARE – FPT Software Academy - Internal Use 4


Hierarchical Clustering
▪ Cluster similarity or dissimilarity
✓ Distance metric

2
• Euclidean distance 𝑑 𝑝, 𝑞 = σ𝑛𝑖=1 𝑞𝑖 − 𝑝𝑖

• Manhattan distance 𝑑 𝑝, 𝑞 = σ𝑛𝑖=1 𝑞𝑖 − 𝑝𝑖


✓ Linkage criteria
• Singe linkage
• Complete linkage
• Average linkage

7/23/2025 09e-BM/DT/FSOFT - ©FPT SOFTWARE – FPT Software Academy - Internal Use 5


Hierarchical Clustering
▪ Hierarchical Agglomerative
Clustering-Linkage Method
Minimum
✓ The single linkage method is based on Distance
minimum distance, or the nearest Cluster 1 Cluster 2
neighbor rule.
✓ The complete linkage method is based on Maximum
Distance
the maximum distance or the furthest
neighbor approach. Cluster 1 Cluster 2
✓ The average linkage method the distance
between two clusters is defined as the
average of the distances between all pairs
of objects Cluster 1
Average
Distance Cluster 2

7/23/2025 09e-BM/DT/FSOFT - ©FPT SOFTWARE – FPT Software Academy - Internal Use 6


Hierarchical Clustering
▪ Agglomerative Clustering Algorithm
▪ Basic algorithm
✓ Compute the distance matrix between the input data points.
✓ Let each data point be a cluster
✓ Repeat
• Merge the two closest clusters
• Update the distance matrix
Until only a single cluster remains

7/23/2025 09e-BM/DT/FSOFT - ©FPT SOFTWARE – FPT Software Academy - Internal Use 7


Hierarchical Clustering
▪ Agglomerative Clustering Example
✓ Input/Initial setting
• Start with clusters of individual points and a distance matrix
3
4 5
2
1
6
7
9
8

Distance Matrix
7/23/2025 09e-BM/DT/FSOFT - ©FPT SOFTWARE – FPT Software Academy - Internal Use 8
Hierarchical Clustering
▪ Agglomerative Clustering Example
✓ Intermediate State
• After some merging steps, we have some clusters
3
4 C2 5
2
1
C1 6
7
9
8 C3
Distance Matrix

7/23/2025 09e-BM/DT/FSOFT - ©FPT SOFTWARE – FPT Software Academy - Internal Use 9


Hierarchical Clustering
▪ Agglomerative Clustering Example
✓ Intermediate State
• Merge the two closet clusters (C1 and C2) and update the distance matrix

3 0.0 0.2 0.5 C12 C3

C12
4 C2 5 0.0 0.4
2 C3
1 0.
C1 6 C12
7 0
9
8 C3 Distance Matrix Distance Matrix

7/23/2025 09e-BM/DT/FSOFT - ©FPT SOFTWARE – FPT Software Academy - Internal Use 10


Hierarchical Clustering
▪ Agglomerative Clustering Example
✓ Stop
• Only a single cluster remains

3
4 C2 5
2
1
C1 6 C12
7
9
8 C3

Dendrogram
7/23/2025 09e-BM/DT/FSOFT - ©FPT SOFTWARE – FPT Software Academy - Internal Use 11
Hierarchical Clustering
▪ Advantages vs Disadvantages
✓ Advantages
• Doesn’t required number of clusters to be specified.
• Easy to implement
• Produces a dendrogram, which helps with understanding the data
✓ Disadvantages
• Can never undo any previous steps throughout the algorithm
• Generally has long runtime
• Sometimes difficult to identify the number of cluster by dendrogram

7/23/2025 09e-BM/DT/FSOFT - ©FPT SOFTWARE – FPT Software Academy - Internal Use 12


Hierarchical Clustering
▪ Agglomerative Clustering with sklearn
from sklearn.cluster import AgglomerativeClustering
import numpy as np
X = np.array([[1, 2], [1, 4], [1, 0], [4, 2], [4, 4], [
4, 0]])
cluster = AgglomerativeClustering(n_clusters = 2, linka
ge = 'average', affinity='euclidean')
clustering = cluster.fit(X)
labels = clustering.labels_
print(labels)

7/23/2025 09e-BM/DT/FSOFT - ©FPT SOFTWARE – FPT Software Academy - Internal Use 13


Hierarchical Clustering
▪ Practice

7/23/2025 09e-BM/DT/FSOFT - ©FPT SOFTWARE – FPT Software Academy - Internal Use 14


Section 2

CLUSTERING
DBSCAN

7/23/2025 09e-BM/DT/FSOFT - ©FPT SOFTWARE – FPT Software Academy - Internal Use 15


DBSCAN
▪ What is DBSCAN?
✓ DBSCAN is a density-based algorithm
✓ DBSCAN stands for Density-Based Spatial Clustering of
Applications with Noise
✓ Density-based Clustering locates regions of high density that are
separated from one another by regions of low density
Density = number of points within a specified radius (R)

7/23/2025 09e-BM/DT/FSOFT - ©FPT SOFTWARE – FPT Software Academy - Internal Use 16


DBSCAN
▪ K-Means vs DBSCAN?
K-Means DBSCAN
K-means assigns all points to a cluster Density-based clustering locates region
even if they do not belong in any of high density, and separates outliers

7/23/2025 09e-BM/DT/FSOFT - ©FPT SOFTWARE – FPT Software Academy - Internal Use 17


DBSCAN
▪ Radius and Min number neighbors
✓ R - Radius of neighborhood R

• Radius that if includes enough number of


points within, we call it a dense area
✓ M – Min number of neighbors
Noise

R
• The minimum number of the data points we
want in a neighborhood to define a cluster

7/23/2025 09e-BM/DT/FSOFT - ©FPT SOFTWARE – FPT Software Academy - Internal Use 18


DBSCAN
▪ Core point, border point and noise point
✓ A point is a core point if it has more than specified number of point
(M) within (R)
✓ A border point has fewer than M within R, but is in the neighborhood
of a core point
✓ A noise point is any point that is not a core point or border point
Noise

7/23/2025 09e-BM/DT/FSOFT - ©FPT SOFTWARE – FPT Software Academy - Internal Use 19


DBSCAN
▪ How DBSCAN works
✓ Example: R =3, M = 4
Randomly select a point and check if it is core point or not?

R
Not a Core point
Core point

7/23/2025 09e-BM/DT/FSOFT - ©FPT SOFTWARE – FPT Software Academy - Internal Use 20


DBSCAN
▪ How DBSCAN works
✓ Example: R =3, M = 4

When determining a point as a core point, we


check the current border points to see if any R

point is the next core point.

7/23/2025 09e-BM/DT/FSOFT - ©FPT SOFTWARE – FPT Software Academy - Internal Use 21


DBSCAN
▪ How DBSCAN works
✓ Example: R =3, M = 4

We continue and visit all the points in the


R
dataset and label them as either core, border,
or outlier

7/23/2025 09e-BM/DT/FSOFT - ©FPT SOFTWARE – FPT Software Academy - Internal Use 22


DBSCAN
▪ How DBSCAN works
✓ Example: R =3, M = 4

The next step is to connect core points that are


neighbors and put them in the same cluster.
A cluster is formed as at least one core point
plus all reachable core points plus all their
borders

7/23/2025 09e-BM/DT/FSOFT - ©FPT SOFTWARE – FPT Software Academy - Internal Use 23


DBSCAN
▪ Advantages vs Disadvantages
✓ Advantages of DBSCAN:
• Arbitrarily shaped clusters
• Robust to outliers
• Does not require specification of the number of clusters
✓ Disadvantages of DBSCAN
• Does not work very well for sparse datasets or datasets with varying density.
• Sensitive to R and M parameters.
• Not partitionable for multiprocessor systems.

24
DBSCAN
▪ DBSCAN with sklearn
from sklearn.cluster import DBSCAN
import numpy as np
X = np.array([[1, 2], [2, 2], [2, 3], [8, 7], [8, 8], [25, 8
0]])
clustering = DBSCAN(eps=3, min_samples=2).fit(X)
labels = clustering.labels_
print(labels)

✓ Eps: float, default=0.5 : The maximum distance between two samples


for one to be considered as in the neighborhood of the other.
✓ min_samples: int, default=5: The number of samples (or total weight)
in a neighborhood for a point to be considered as a core point.

7/23/2025 09e-BM/DT/FSOFT - ©FPT SOFTWARE – FPT Software Academy - Internal Use 25


DBSCAN
▪ Practice

7/23/2025 09e-BM/DT/FSOFT - ©FPT SOFTWARE – FPT Software Academy - Internal Use 26


Thank you

7/23/2025 09e-BM/DT/FSOFT - ©FPT SOFTWARE – FPT Software Academy - Internal Use 27

You might also like