0% found this document useful (0 votes)
16 views57 pages

Unit - III: Cluster Analysis

The document discusses hierarchical clustering methods, specifically agglomerative and divisive approaches, which organize data into a hierarchy or tree structure. It highlights the use of distance matrices for clustering criteria and the challenges associated with divisive methods, including computational complexity. Additionally, it introduces the concepts of dendrograms for visualizing clustering processes and various distance measures used in these methods.

Uploaded by

Varshini
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views57 pages

Unit - III: Cluster Analysis

The document discusses hierarchical clustering methods, specifically agglomerative and divisive approaches, which organize data into a hierarchy or tree structure. It highlights the use of distance matrices for clustering criteria and the challenges associated with divisive methods, including computational complexity. Additionally, it introduces the concepts of dendrograms for visualizing clustering processes and various distance measures used in these methods.

Uploaded by

Varshini
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

1

Unit - III
Cluster Analysis

INT303R01 - Data Warehousing and Data Mining


Today’s Topic
2

Hierarchical Methods
Agglomerative and Divisive

INT303R01 - Data Warehousing and Data Mining


Hierarchical Clustering
3

◻ While partitioning methods meet the basic clustering requirement of organizing a


set of objects into a number of exclusive groups, in some situations we may want to
partition our data into groups at different levels such as in a hierarchy.
◻ A hierarchical clustering method works by grouping data objects into a hierarchy or
“tree” of clusters.
◻ Representing data objects in the form of a hierarchy is useful for data
summarization and visualization
◻ Hierarchical clustering methods organize objects into a hierarchy using a bottom-up
or top-down strategy, respectively.
◻ Agglomerative methods start with individual objects as clusters, which are
iteratively merged to form larger clusters.
◻ Conversely, divisive methods initially let all the given objects form one cluster, which
they iteratively split into smaller clusters.

INT303R01 - Data Warehousing and Data Mining


Hierarchical Clustering
4

◻ Use distance matrix as clustering criteria. This method does not require the
number of clusters k as an input, but needs a termination condition

Step 0 Step 1 Step 2 Step 3 Step 4


agglomerative
(AGNES)
a ab
b abcde
c
cde
d
de
e
divisive
Step 4 Step 3 Step 2 Step 1 Step 0 (DIANA)

INT303R01 - Data Warehousing and Data Mining


AGNES (Agglomerative Nesting)
5

◻ Introduced in Kaufmann and Rousseeuw (1990)


◻ Implemented in statistical packages, e.g., Splus
◻ Use the single-link method and the dissimilarity matrix
◻ Merge nodes that have the least dissimilarity
◻ Go on in a non-descending fashion
◻ Eventually all nodes belong to the same cluster

INT303R01 - Data Warehousing and Data Mining


DIANA (Divisive Analysis)
6

◻ Introduced in Kaufmann and Rousseeuw (1990)

◻ Implemented in statistical analysis packages, e.g., Splus

◻ Inverse order of AGNES

◻ Eventually each node forms a cluster on its own

INT303R01 - Data Warehousing and Data Mining


DIANA (Divisive Analysis)
7

◻ A challenge with divisive methods is how to partition a large cluster into several
smaller ones.

◻ For example, there are 2n-1 -1 possible ways to partition a set of ‘n’ objects into two
exclusive subsets, where ‘n’ is the number of objects.

◻ When ‘n’ is large, it is computationally prohibitive to examine all possibilities.

◻ Consequently, a divisive method typically uses heuristics in partitioning, which can


lead to inaccurate results.

◻ For the sake of efficiency, divisive methods typically do not backtrack on partitioning
decisions that have been made.

◻ Once a cluster is partitioned, any alternative partitioning of this cluster will not be
considered again.

◻ Due to the challenges in divisive methods, there are many more agglomerative
methods than divisive methods.
INT303R01 - Data Warehousing and Data Mining
Dendrogram: Shows How Clusters are Merged
8

◻ A tree structure called a dendrogram is


commonly used to represent the process
of hierarchical clustering.

◻ It shows how objects are grouped


together (in an agglomerative method) or
partitioned (in a divisive method)
step-by-step.

◻ A clustering of the data objects is obtained


by cutting the dendrogram at the
desired level, then each connected
component forms a cluster

INT303R01 - Data Warehousing and Data Mining


Distance Measures
9

◻ Whether using an agglomerative method or a divisive method


a core need is to measure the distance between two clusters
where each cluster is generally a set of objects.

◻ Four widely used measures for distance between clusters are as follows,
where |pi – pj| is the distance between two objects or points, pi and pj;
mi is the mean for cluster, Ci and
ni is the number of objects in Ci .

◻ They are also known as linkage measures.

INT303R01 - Data Warehousing and Data Mining


Distance Measures
10

◻ Single link: smallest distance between an element in one cluster and an element
in the other, i.e., dist(Ki, Kj) = min(tip, tjq)

◻ Complete link: largest distance between an element in one cluster and an


element in the other, i.e., dist(Ki, Kj) = max(tip, tjq)

INT303R01 - Data Warehousing and Data Mining


Distance Measures
11

INT303R01 - Data Warehousing and Data Mining


Hierarchical Clustering
12

◻ When an algorithm uses the minimum distance, dmin(Ci ,Cj), to measure the
distance between clusters, it is sometimes called a nearest-neighbor clustering
algorithm.
◻ If the clustering process is terminated when the distance between nearest clusters
exceeds a user-defined threshold, it is called a single-linkage algorithm.

◻ When an algorithm uses the maximum distance, dmax(Ci ,Cj), to measure the
distance between clusters, it is sometimes called a farthest-neighbor clustering
algorithm.

◻ If the clustering process is terminated when the maximum distance between


nearest clusters exceeds a user-defined threshold, it is called a complete-linkage
algorithm.

INT303R01 - Data Warehousing and Data Mining


Example : 1 Single Dimensional Data
13

Find the Clusters using single link technique. Use Euclidean


distance and draw the dendrogram.

◻ {7,10,20,28,35}

INT303R01 - Data Warehousing and Data Mining


Example : 1 Single Dimensional Data
14

◻ {7,10,20,28,35}
◻ Visualization

INT303R01 - Data Warehousing and Data Mining


Example : 1 Single Dimensional Data
15

◻ {7,10,20,28,35}
7 10 20 28 35

3 10 8 7 min = 3, Cluster (7,10)

(7,10) 20 28 35

10 8 7 min = 7, Cluster (28,35)

(7,10) 20 (28,35)

10 8 min = 8, Cluster (20,28,35)

(7,10) (20,28,35)

10

INT303R01 - Data Warehousing and Data Mining


Example : 1 Single Dimensional Data
16

◻ {7,10,20,28,35}
7 10 20 28 35

3 10 8 7

(7,10) 20 28 35

10 8 7

(7,10) 20 (28,35)

10 8

(7,10) (20,28,35)

10

INT303R01 - Data Warehousing and Data Mining


Example - 2
17

Find the Clusters using single link technique. Use Euclidean


distance and draw the dendrogram.

INT303R01 - Data Warehousing and Data Mining


Visualization
18

INT303R01 - Data Warehousing and Data Mining


Distance Matrix
19

P1 = (0.40,0.53)
P2 = (0.22,0.38)
P1 P2 P3 P4 P5 P6
P1 0 0.23
y P2 0.23 0
P3 0

P4 0

P5 0

P6 0

INT303R01 - Data Warehousing and Data Mining


Distance Matrix
20

INT303R01 - Data Warehousing and Data Mining


Min Distance
21

INT303R01 - Data Warehousing and Data Mining


Cluster P3,P6
22

P1 P2 P3,P6 P4 P5

P1 0
P2 0.24 0
P3,P6 ? ? 0

P4 0.37 0.20 ? 0
P5 0.34 0.14 ? 0

INT303R01 - Data Warehousing and Data Mining


Cluster P3,P6
23

INT303R01 - Data Warehousing and Data Mining


Update distance matrix
24

INT303R01 - Data Warehousing and Data Mining


Update distance matrix
25

INT303R01 - Data Warehousing and Data Mining


Updated Distance Matrix for P3,P6
26

INT303R01 - Data Warehousing and Data Mining


Next Min Distance
27

INT303R01 - Data Warehousing and Data Mining


Dendrogram
28

INT303R01 - Data Warehousing and Data Mining


Cluster P2,P5
29

P1 P2,P5 P3,P6 P4

P1 0
P2,P5 0
P3,P6 0.22 0

P4 0.37 0.15 0

INT303R01 - Data Warehousing and Data Mining


Update distance matrix
30

INT303R01 - Data Warehousing and Data Mining


Update distance matrix
31

INT303R01 - Data Warehousing and Data Mining


Updated distance matrix after P2,P5
32

INT303R01 - Data Warehousing and Data Mining


Next Min Disatnce
33

INT303R01 - Data Warehousing and Data Mining


Dendrogram
34

INT303R01 - Data Warehousing and Data Mining


Update Distance Matrix
35

INT303R01 - Data Warehousing and Data Mining


Updated Distance Matrix after P2,P5,P3,P6
36

INT303R01 - Data Warehousing and Data Mining


Next Min Distance
37

INT303R01 - Data Warehousing and Data Mining


Next Min Distance
38

INT303R01 - Data Warehousing and Data Mining


Dendrogram
39

INT303R01 - Data Warehousing and Data Mining


Update Distance Matrix
40

INT303R01 - Data Warehousing and Data Mining


Updated Distance Matrix after P2,P5,P3,P6,P4
41

INT303R01 - Data Warehousing and Data Mining


Final Dendrogram
42

INT303R01 - Data Warehousing and Data Mining


Extensions to Hierarchical Clustering
43

◻ Major weakness of agglomerative clustering methods


Hierarchical clustering methods can encounter difficulties regarding the
selection of merge or split points.

Such a decision is critical, because once a group of objects is merged or


split, the process at the next step will operate on the newly generated
clusters.

It will neither undo what was done previously, nor perform object
swapping between clusters.

Thus, merge or split decisions, if not well chosen, may lead to low-quality
clusters.

The methods do not scale well because each decision of merge or split
needs to examine and evaluate many objects or clusters.

INT303R01 - Data Warehousing and Data Mining


References
44

◻ https://www.youtube.com/watch?v=RdT7bhm1M3E

◻ https://www.youtube.com/watch?v=oNYtYm0tFso

INT303R01 - Data Warehousing and Data Mining


Problem for Practice
45

INT303-Data Warehousing and Data Mining


Example
46

INT303-Data Warehousing and Data Mining


Distance Matrix
47

◻ Here is the distance matrix

INT303-Data Warehousing and Data Mining


Single Linkage : Minimum Distance
48

◻ That is the distance between D & F

INT303-Data Warehousing and Data Mining


Single Linkage : Minimum Distance
49

◻ Cluster D & F

INT303-Data Warehousing and Data Mining


Single Linkage : Minimum Distance
50

◻ Distance between (D,F) & A

◻ Distance between (D,F) & B

◻ Distance between (D,F) & C

◻ Finally, Distance between (D,F) & E

INT303-Data Warehousing and Data Mining


Resultant Matrix
51

3.20 2.50 2.24


1.00

INT303-Data Warehousing and Data Mining


Single Linkage : Minimum Distance
52

◻ Distance between (A,B) & C

3.20 2.50 2.24


◻ Distance between (A,B) & (D,F) 1.00

◻ Distance between (A,B) & E

INT303-Data Warehousing and Data Mining


Resultant Distance Matrix
53

INT303-Data Warehousing and Data Mining


Single Linkage : Minimum Distance
54

INT303-Data Warehousing and Data Mining


Dendogram
55

INT303-Data Warehousing and Data Mining


Extensions to Hierarchical Clustering
56

◻ Major weakness of agglomerative clustering methods

Can never undo what was done previously

Do not scale well: time complexity of at least O(n2), where n is the


number of total objects

INT303-Data Warehousing and Data Mining


Hierarchical Clustering
57

◻ There are several ways to categorize hierarchical clustering methods.


◻ For instance, they may be categorized into algorithmic methods, probabilistic
methods, and Bayesian methods.
◻ Agglomerative, divisive, and multiphase methods are algorithmic, meaning they
consider data objects as deterministic and compute clusters according to the
deterministic distances between objects.

◻ Probabilistic methods use probabilistic models to capture clusters and measure


the quality of clusters by the fitness of models.

◻ Bayesian methods compute a distribution of possible clusterings.


◻ That is, instead of outputting a single deterministic clustering over a data set, they
return a group of clustering structures and their probabilities, conditional on the
given data.
INT303-Data Warehousing and Data Mining

You might also like