1
Unit - III
Cluster Analysis
INT303R01 - Data Warehousing and Data Mining
Today’s Topic
2
Hierarchical Methods
Agglomerative and Divisive
INT303R01 - Data Warehousing and Data Mining
Hierarchical Clustering
3
◻ While partitioning methods meet the basic clustering requirement of organizing a
set of objects into a number of exclusive groups, in some situations we may want to
partition our data into groups at different levels such as in a hierarchy.
◻ A hierarchical clustering method works by grouping data objects into a hierarchy or
“tree” of clusters.
◻ Representing data objects in the form of a hierarchy is useful for data
summarization and visualization
◻ Hierarchical clustering methods organize objects into a hierarchy using a bottom-up
or top-down strategy, respectively.
◻ Agglomerative methods start with individual objects as clusters, which are
iteratively merged to form larger clusters.
◻ Conversely, divisive methods initially let all the given objects form one cluster, which
they iteratively split into smaller clusters.
INT303R01 - Data Warehousing and Data Mining
Hierarchical Clustering
4
◻ Use distance matrix as clustering criteria. This method does not require the
number of clusters k as an input, but needs a termination condition
Step 0 Step 1 Step 2 Step 3 Step 4
agglomerative
(AGNES)
a ab
b abcde
c
cde
d
de
e
divisive
Step 4 Step 3 Step 2 Step 1 Step 0 (DIANA)
INT303R01 - Data Warehousing and Data Mining
AGNES (Agglomerative Nesting)
5
◻ Introduced in Kaufmann and Rousseeuw (1990)
◻ Implemented in statistical packages, e.g., Splus
◻ Use the single-link method and the dissimilarity matrix
◻ Merge nodes that have the least dissimilarity
◻ Go on in a non-descending fashion
◻ Eventually all nodes belong to the same cluster
INT303R01 - Data Warehousing and Data Mining
DIANA (Divisive Analysis)
6
◻ Introduced in Kaufmann and Rousseeuw (1990)
◻ Implemented in statistical analysis packages, e.g., Splus
◻ Inverse order of AGNES
◻ Eventually each node forms a cluster on its own
INT303R01 - Data Warehousing and Data Mining
DIANA (Divisive Analysis)
7
◻ A challenge with divisive methods is how to partition a large cluster into several
smaller ones.
◻ For example, there are 2n-1 -1 possible ways to partition a set of ‘n’ objects into two
exclusive subsets, where ‘n’ is the number of objects.
◻ When ‘n’ is large, it is computationally prohibitive to examine all possibilities.
◻ Consequently, a divisive method typically uses heuristics in partitioning, which can
lead to inaccurate results.
◻ For the sake of efficiency, divisive methods typically do not backtrack on partitioning
decisions that have been made.
◻ Once a cluster is partitioned, any alternative partitioning of this cluster will not be
considered again.
◻ Due to the challenges in divisive methods, there are many more agglomerative
methods than divisive methods.
INT303R01 - Data Warehousing and Data Mining
Dendrogram: Shows How Clusters are Merged
8
◻ A tree structure called a dendrogram is
commonly used to represent the process
of hierarchical clustering.
◻ It shows how objects are grouped
together (in an agglomerative method) or
partitioned (in a divisive method)
step-by-step.
◻ A clustering of the data objects is obtained
by cutting the dendrogram at the
desired level, then each connected
component forms a cluster
INT303R01 - Data Warehousing and Data Mining
Distance Measures
9
◻ Whether using an agglomerative method or a divisive method
a core need is to measure the distance between two clusters
where each cluster is generally a set of objects.
◻ Four widely used measures for distance between clusters are as follows,
where |pi – pj| is the distance between two objects or points, pi and pj;
mi is the mean for cluster, Ci and
ni is the number of objects in Ci .
◻ They are also known as linkage measures.
INT303R01 - Data Warehousing and Data Mining
Distance Measures
10
◻ Single link: smallest distance between an element in one cluster and an element
in the other, i.e., dist(Ki, Kj) = min(tip, tjq)
◻ Complete link: largest distance between an element in one cluster and an
element in the other, i.e., dist(Ki, Kj) = max(tip, tjq)
INT303R01 - Data Warehousing and Data Mining
Distance Measures
11
INT303R01 - Data Warehousing and Data Mining
Hierarchical Clustering
12
◻ When an algorithm uses the minimum distance, dmin(Ci ,Cj), to measure the
distance between clusters, it is sometimes called a nearest-neighbor clustering
algorithm.
◻ If the clustering process is terminated when the distance between nearest clusters
exceeds a user-defined threshold, it is called a single-linkage algorithm.
◻ When an algorithm uses the maximum distance, dmax(Ci ,Cj), to measure the
distance between clusters, it is sometimes called a farthest-neighbor clustering
algorithm.
◻ If the clustering process is terminated when the maximum distance between
nearest clusters exceeds a user-defined threshold, it is called a complete-linkage
algorithm.
INT303R01 - Data Warehousing and Data Mining
Example : 1 Single Dimensional Data
13
Find the Clusters using single link technique. Use Euclidean
distance and draw the dendrogram.
◻ {7,10,20,28,35}
INT303R01 - Data Warehousing and Data Mining
Example : 1 Single Dimensional Data
14
◻ {7,10,20,28,35}
◻ Visualization
INT303R01 - Data Warehousing and Data Mining
Example : 1 Single Dimensional Data
15
◻ {7,10,20,28,35}
7 10 20 28 35
3 10 8 7 min = 3, Cluster (7,10)
(7,10) 20 28 35
10 8 7 min = 7, Cluster (28,35)
(7,10) 20 (28,35)
10 8 min = 8, Cluster (20,28,35)
(7,10) (20,28,35)
10
INT303R01 - Data Warehousing and Data Mining
Example : 1 Single Dimensional Data
16
◻ {7,10,20,28,35}
7 10 20 28 35
3 10 8 7
(7,10) 20 28 35
10 8 7
(7,10) 20 (28,35)
10 8
(7,10) (20,28,35)
10
INT303R01 - Data Warehousing and Data Mining
Example - 2
17
Find the Clusters using single link technique. Use Euclidean
distance and draw the dendrogram.
INT303R01 - Data Warehousing and Data Mining
Visualization
18
INT303R01 - Data Warehousing and Data Mining
Distance Matrix
19
P1 = (0.40,0.53)
P2 = (0.22,0.38)
P1 P2 P3 P4 P5 P6
P1 0 0.23
y P2 0.23 0
P3 0
P4 0
P5 0
P6 0
INT303R01 - Data Warehousing and Data Mining
Distance Matrix
20
INT303R01 - Data Warehousing and Data Mining
Min Distance
21
INT303R01 - Data Warehousing and Data Mining
Cluster P3,P6
22
P1 P2 P3,P6 P4 P5
P1 0
P2 0.24 0
P3,P6 ? ? 0
P4 0.37 0.20 ? 0
P5 0.34 0.14 ? 0
INT303R01 - Data Warehousing and Data Mining
Cluster P3,P6
23
INT303R01 - Data Warehousing and Data Mining
Update distance matrix
24
INT303R01 - Data Warehousing and Data Mining
Update distance matrix
25
INT303R01 - Data Warehousing and Data Mining
Updated Distance Matrix for P3,P6
26
INT303R01 - Data Warehousing and Data Mining
Next Min Distance
27
INT303R01 - Data Warehousing and Data Mining
Dendrogram
28
INT303R01 - Data Warehousing and Data Mining
Cluster P2,P5
29
P1 P2,P5 P3,P6 P4
P1 0
P2,P5 0
P3,P6 0.22 0
P4 0.37 0.15 0
INT303R01 - Data Warehousing and Data Mining
Update distance matrix
30
INT303R01 - Data Warehousing and Data Mining
Update distance matrix
31
INT303R01 - Data Warehousing and Data Mining
Updated distance matrix after P2,P5
32
INT303R01 - Data Warehousing and Data Mining
Next Min Disatnce
33
INT303R01 - Data Warehousing and Data Mining
Dendrogram
34
INT303R01 - Data Warehousing and Data Mining
Update Distance Matrix
35
INT303R01 - Data Warehousing and Data Mining
Updated Distance Matrix after P2,P5,P3,P6
36
INT303R01 - Data Warehousing and Data Mining
Next Min Distance
37
INT303R01 - Data Warehousing and Data Mining
Next Min Distance
38
INT303R01 - Data Warehousing and Data Mining
Dendrogram
39
INT303R01 - Data Warehousing and Data Mining
Update Distance Matrix
40
INT303R01 - Data Warehousing and Data Mining
Updated Distance Matrix after P2,P5,P3,P6,P4
41
INT303R01 - Data Warehousing and Data Mining
Final Dendrogram
42
INT303R01 - Data Warehousing and Data Mining
Extensions to Hierarchical Clustering
43
◻ Major weakness of agglomerative clustering methods
Hierarchical clustering methods can encounter difficulties regarding the
selection of merge or split points.
Such a decision is critical, because once a group of objects is merged or
split, the process at the next step will operate on the newly generated
clusters.
It will neither undo what was done previously, nor perform object
swapping between clusters.
Thus, merge or split decisions, if not well chosen, may lead to low-quality
clusters.
The methods do not scale well because each decision of merge or split
needs to examine and evaluate many objects or clusters.
INT303R01 - Data Warehousing and Data Mining
References
44
◻ https://www.youtube.com/watch?v=RdT7bhm1M3E
◻ https://www.youtube.com/watch?v=oNYtYm0tFso
INT303R01 - Data Warehousing and Data Mining
Problem for Practice
45
INT303-Data Warehousing and Data Mining
Example
46
INT303-Data Warehousing and Data Mining
Distance Matrix
47
◻ Here is the distance matrix
INT303-Data Warehousing and Data Mining
Single Linkage : Minimum Distance
48
◻ That is the distance between D & F
INT303-Data Warehousing and Data Mining
Single Linkage : Minimum Distance
49
◻ Cluster D & F
INT303-Data Warehousing and Data Mining
Single Linkage : Minimum Distance
50
◻ Distance between (D,F) & A
◻ Distance between (D,F) & B
◻ Distance between (D,F) & C
◻ Finally, Distance between (D,F) & E
INT303-Data Warehousing and Data Mining
Resultant Matrix
51
3.20 2.50 2.24
1.00
INT303-Data Warehousing and Data Mining
Single Linkage : Minimum Distance
52
◻ Distance between (A,B) & C
3.20 2.50 2.24
◻ Distance between (A,B) & (D,F) 1.00
◻ Distance between (A,B) & E
INT303-Data Warehousing and Data Mining
Resultant Distance Matrix
53
INT303-Data Warehousing and Data Mining
Single Linkage : Minimum Distance
54
INT303-Data Warehousing and Data Mining
Dendogram
55
INT303-Data Warehousing and Data Mining
Extensions to Hierarchical Clustering
56
◻ Major weakness of agglomerative clustering methods
Can never undo what was done previously
Do not scale well: time complexity of at least O(n2), where n is the
number of total objects
INT303-Data Warehousing and Data Mining
Hierarchical Clustering
57
◻ There are several ways to categorize hierarchical clustering methods.
◻ For instance, they may be categorized into algorithmic methods, probabilistic
methods, and Bayesian methods.
◻ Agglomerative, divisive, and multiphase methods are algorithmic, meaning they
consider data objects as deterministic and compute clusters according to the
deterministic distances between objects.
◻ Probabilistic methods use probabilistic models to capture clusters and measure
the quality of clusters by the fitness of models.
◻ Bayesian methods compute a distribution of possible clusterings.
◻ That is, instead of outputting a single deterministic clustering over a data set, they
return a group of clustering structures and their probabilities, conditional on the
given data.
INT303-Data Warehousing and Data Mining