Fundamentals of Machine
Learning
Unsupervised learning / Hierarchical and DBSCAN
7/23/2025 09e-BM/DT/FSOFT - ©FPT SOFTWARE – FPT Software Academy - Internal Use 1
Agenda
Hierarchical Clustering
DBSCAN Clustering
7/23/2025 09e-BM/DT/FSOFT - ©FPT SOFTWARE – FPT Software Academy - Internal Use 2
Section 1
CLUSTERING
HIERARCHICAL CLUSTERING
7/23/2025 09e-BM/DT/FSOFT - ©FPT SOFTWARE – FPT Software Academy - Internal Use 3
Hierarchical Clustering
▪ Hierarchical clustering is characterized by the development of
a hierarchy or tree-like structure.
✓ Agglomerative clustering starts with each object in a separate cluster.
Clusters are formed by grouping objects into bigger and bigger clusters.
✓ Divisive clustering starts with all the objects grouped in a single
cluster. Clusters are divided or split until each object is in a separate
cluster.
▪ Agglomerative methods are commonly used in marketing
research. They consist of linkage methods, variance methods,
and centroid methods.
7/23/2025 09e-BM/DT/FSOFT - ©FPT SOFTWARE – FPT Software Academy - Internal Use 4
Hierarchical Clustering
▪ Cluster similarity or dissimilarity
✓ Distance metric
2
• Euclidean distance 𝑑 𝑝, 𝑞 = σ𝑛𝑖=1 𝑞𝑖 − 𝑝𝑖
• Manhattan distance 𝑑 𝑝, 𝑞 = σ𝑛𝑖=1 𝑞𝑖 − 𝑝𝑖
✓ Linkage criteria
• Singe linkage
• Complete linkage
• Average linkage
7/23/2025 09e-BM/DT/FSOFT - ©FPT SOFTWARE – FPT Software Academy - Internal Use 5
Hierarchical Clustering
▪ Hierarchical Agglomerative
Clustering-Linkage Method
Minimum
✓ The single linkage method is based on Distance
minimum distance, or the nearest Cluster 1 Cluster 2
neighbor rule.
✓ The complete linkage method is based on Maximum
Distance
the maximum distance or the furthest
neighbor approach. Cluster 1 Cluster 2
✓ The average linkage method the distance
between two clusters is defined as the
average of the distances between all pairs
of objects Cluster 1
Average
Distance Cluster 2
7/23/2025 09e-BM/DT/FSOFT - ©FPT SOFTWARE – FPT Software Academy - Internal Use 6
Hierarchical Clustering
▪ Agglomerative Clustering Algorithm
▪ Basic algorithm
✓ Compute the distance matrix between the input data points.
✓ Let each data point be a cluster
✓ Repeat
• Merge the two closest clusters
• Update the distance matrix
Until only a single cluster remains
7/23/2025 09e-BM/DT/FSOFT - ©FPT SOFTWARE – FPT Software Academy - Internal Use 7
Hierarchical Clustering
▪ Agglomerative Clustering Example
✓ Input/Initial setting
• Start with clusters of individual points and a distance matrix
3
4 5
2
1
6
7
9
8
Distance Matrix
7/23/2025 09e-BM/DT/FSOFT - ©FPT SOFTWARE – FPT Software Academy - Internal Use 8
Hierarchical Clustering
▪ Agglomerative Clustering Example
✓ Intermediate State
• After some merging steps, we have some clusters
3
4 C2 5
2
1
C1 6
7
9
8 C3
Distance Matrix
7/23/2025 09e-BM/DT/FSOFT - ©FPT SOFTWARE – FPT Software Academy - Internal Use 9
Hierarchical Clustering
▪ Agglomerative Clustering Example
✓ Intermediate State
• Merge the two closet clusters (C1 and C2) and update the distance matrix
3 0.0 0.2 0.5 C12 C3
C12
4 C2 5 0.0 0.4
2 C3
1 0.
C1 6 C12
7 0
9
8 C3 Distance Matrix Distance Matrix
7/23/2025 09e-BM/DT/FSOFT - ©FPT SOFTWARE – FPT Software Academy - Internal Use 10
Hierarchical Clustering
▪ Agglomerative Clustering Example
✓ Stop
• Only a single cluster remains
3
4 C2 5
2
1
C1 6 C12
7
9
8 C3
Dendrogram
7/23/2025 09e-BM/DT/FSOFT - ©FPT SOFTWARE – FPT Software Academy - Internal Use 11
Hierarchical Clustering
▪ Advantages vs Disadvantages
✓ Advantages
• Doesn’t required number of clusters to be specified.
• Easy to implement
• Produces a dendrogram, which helps with understanding the data
✓ Disadvantages
• Can never undo any previous steps throughout the algorithm
• Generally has long runtime
• Sometimes difficult to identify the number of cluster by dendrogram
7/23/2025 09e-BM/DT/FSOFT - ©FPT SOFTWARE – FPT Software Academy - Internal Use 12
Hierarchical Clustering
▪ Agglomerative Clustering with sklearn
from sklearn.cluster import AgglomerativeClustering
import numpy as np
X = np.array([[1, 2], [1, 4], [1, 0], [4, 2], [4, 4], [
4, 0]])
cluster = AgglomerativeClustering(n_clusters = 2, linka
ge = 'average', affinity='euclidean')
clustering = cluster.fit(X)
labels = clustering.labels_
print(labels)
7/23/2025 09e-BM/DT/FSOFT - ©FPT SOFTWARE – FPT Software Academy - Internal Use 13
Hierarchical Clustering
▪ Practice
7/23/2025 09e-BM/DT/FSOFT - ©FPT SOFTWARE – FPT Software Academy - Internal Use 14
Section 2
CLUSTERING
DBSCAN
7/23/2025 09e-BM/DT/FSOFT - ©FPT SOFTWARE – FPT Software Academy - Internal Use 15
DBSCAN
▪ What is DBSCAN?
✓ DBSCAN is a density-based algorithm
✓ DBSCAN stands for Density-Based Spatial Clustering of
Applications with Noise
✓ Density-based Clustering locates regions of high density that are
separated from one another by regions of low density
Density = number of points within a specified radius (R)
7/23/2025 09e-BM/DT/FSOFT - ©FPT SOFTWARE – FPT Software Academy - Internal Use 16
DBSCAN
▪ K-Means vs DBSCAN?
K-Means DBSCAN
K-means assigns all points to a cluster Density-based clustering locates region
even if they do not belong in any of high density, and separates outliers
7/23/2025 09e-BM/DT/FSOFT - ©FPT SOFTWARE – FPT Software Academy - Internal Use 17
DBSCAN
▪ Radius and Min number neighbors
✓ R - Radius of neighborhood R
• Radius that if includes enough number of
points within, we call it a dense area
✓ M – Min number of neighbors
Noise
R
• The minimum number of the data points we
want in a neighborhood to define a cluster
7/23/2025 09e-BM/DT/FSOFT - ©FPT SOFTWARE – FPT Software Academy - Internal Use 18
DBSCAN
▪ Core point, border point and noise point
✓ A point is a core point if it has more than specified number of point
(M) within (R)
✓ A border point has fewer than M within R, but is in the neighborhood
of a core point
✓ A noise point is any point that is not a core point or border point
Noise
7/23/2025 09e-BM/DT/FSOFT - ©FPT SOFTWARE – FPT Software Academy - Internal Use 19
DBSCAN
▪ How DBSCAN works
✓ Example: R =3, M = 4
Randomly select a point and check if it is core point or not?
R
Not a Core point
Core point
7/23/2025 09e-BM/DT/FSOFT - ©FPT SOFTWARE – FPT Software Academy - Internal Use 20
DBSCAN
▪ How DBSCAN works
✓ Example: R =3, M = 4
When determining a point as a core point, we
check the current border points to see if any R
point is the next core point.
7/23/2025 09e-BM/DT/FSOFT - ©FPT SOFTWARE – FPT Software Academy - Internal Use 21
DBSCAN
▪ How DBSCAN works
✓ Example: R =3, M = 4
We continue and visit all the points in the
R
dataset and label them as either core, border,
or outlier
7/23/2025 09e-BM/DT/FSOFT - ©FPT SOFTWARE – FPT Software Academy - Internal Use 22
DBSCAN
▪ How DBSCAN works
✓ Example: R =3, M = 4
The next step is to connect core points that are
neighbors and put them in the same cluster.
A cluster is formed as at least one core point
plus all reachable core points plus all their
borders
7/23/2025 09e-BM/DT/FSOFT - ©FPT SOFTWARE – FPT Software Academy - Internal Use 23
DBSCAN
▪ Advantages vs Disadvantages
✓ Advantages of DBSCAN:
• Arbitrarily shaped clusters
• Robust to outliers
• Does not require specification of the number of clusters
✓ Disadvantages of DBSCAN
• Does not work very well for sparse datasets or datasets with varying density.
• Sensitive to R and M parameters.
• Not partitionable for multiprocessor systems.
24
DBSCAN
▪ DBSCAN with sklearn
from sklearn.cluster import DBSCAN
import numpy as np
X = np.array([[1, 2], [2, 2], [2, 3], [8, 7], [8, 8], [25, 8
0]])
clustering = DBSCAN(eps=3, min_samples=2).fit(X)
labels = clustering.labels_
print(labels)
✓ Eps: float, default=0.5 : The maximum distance between two samples
for one to be considered as in the neighborhood of the other.
✓ min_samples: int, default=5: The number of samples (or total weight)
in a neighborhood for a point to be considered as a core point.
7/23/2025 09e-BM/DT/FSOFT - ©FPT SOFTWARE – FPT Software Academy - Internal Use 25
DBSCAN
▪ Practice
7/23/2025 09e-BM/DT/FSOFT - ©FPT SOFTWARE – FPT Software Academy - Internal Use 26
Thank you
7/23/2025 09e-BM/DT/FSOFT - ©FPT SOFTWARE – FPT Software Academy - Internal Use 27