0% found this document useful (0 votes)

20 views49 pages

Clustering

The document provides an overview of clustering in data mining, detailing various clustering algorithms such as k-means, hierarchical clustering, DBSCAN, and Gaussian Mixture Models. It discusses clustering applications, validation techniques, and practical considerations for selecting the appropriate clustering method. The importance of understanding data characteristics and the challenges associated with clustering are emphasized throughout the document.

Uploaded by

huy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views49 pages

Clustering

Uploaded by

huy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

COMP5009

DATA MINING

CLUSTERING

CURTIN UNIVERSITY
SEMESTER 2
 Introduction
CLUSTERING
 Cluster Validation
Aggarwal Chapters
6.1, 6.3-6.6, 6.9  Clustering Algorithms

7.1, 7.2.1  Applications

 Summary

COMP5009 – DATA MINING, CURTIN UNIVERSITY 2

GIVEN A SET OF DATA
POINTS, PARTITION
THEM INTO GROUPS
CONTAINING VERY
SIMILAR DATA
POINTS.

COMP5009 – DATA MINING, CURTIN UNIVERSITY 3

CLUSTERING APPLICATIONS

 Data summarization
 Measure properties/behaviors of K clusters instead of N  Social network analysis
items  Use linkages between network nodes to identify
 K << N communities

 Related to "representative" examples  Communities of people, businesses, bot-nets, etc.

 Data segmentation  Data mining "block"

 Group data into segments each of which may have  Clustering can be used as a building block or pre-
different behaviors or needs processing stage for other DM problems

 e.g. Customer segmentation for advertising or  e.g. clustering prior to classification or outlier detection
recommender systems

COMP5009 – DATA MINING, CURTIN UNIVERSITY 4

HOW DO WE CLUSTER?

 Partition data points into intuitively similar groups

 Many ways of defining similar  How many groups do we need?

 Distance / similarity metrics  How do we define a "good" grouping?
 Clusters are context / domain dependent

COMP5009 – DATA MINING, CURTIN UNIVERSITY 5

CLUSTERING AMBIGUITY

COMP5009 – DATA MINING, CURTIN UNIVERSITY 6

EXAMPLES

COMP5009 – DATA MINING, CURTIN UNIVERSITY 7

DIFFERENT FROM ASSOCIATION PATTERN MINING

Association Pattern Mining Clustering

 Find attributes/features that co-occur  Group transactions that are similar

 Focus is on the items  Focus on the transactions (or customers)

COMP5009 – DATA MINING, CURTIN UNIVERSITY 8

CLUSTERING
ALGORITHMS
 Representative-based: k-means and k-modes

 Hierarchical clustering: Bottom-Up agglomerative

methods
 Grid-based and density-based: DBSCAN
 Probabilistic algorithms: Gaussian Mixture Model

COMP5009 – DATA MINING, CURTIN UNIVERSITY 9

REPRESENTATIVE BASED ALGORITHMS

 User typically defines the number of clusters k

 Determine representatives Yi for data point Xi to

 Use an intuitive distance function (eg Lp) minimize the objective:
 Chose a representative for each cluster

 Cluster membership is assigned based on the

nearest representative
Distance to nearest
 Like stereo-types or tropes in stories representative

 Algorithm is therefore iterative

COMP5009 – DATA MINING, CURTIN UNIVERSITY 10

THE K-MEANS ALGORITHM

 Choose:

 Where ||.||p is the Lp-norm.

 Optimal choice for Yi is the mean of the data points

in each group

COMP5009 – DATA MINING, CURTIN UNIVERSITY 11

 Shapes represent cluster labels

 K-means with k=2

 Note that the representatives are not necessarily

real data points (μ1=7)
 If you impose this constraint you end up with the k-
mediods algorithm

Zaki + Meira
COMP5009 – DATA MINING, CURTIN UNIVERSITY 12
COMP5009 – DATA MINING, CURTIN UNIVERSITY 13
BASIC K-MEANS CLUSTERING

Pro Con

 Simple, intuitive  Local solution only: initialization may be critical

 Good results when clusters are well separated  Need to specify the number of clusters: how would
you know that?
 Relatively efficient
 Not robust against highly overlapping
data/noise/outliers
 Euclidean distance: may not be optimal, does not
work for categorical data

COMP5009 – DATA MINING, CURTIN UNIVERSITY 14

MAHALANOBIS K-MEANS

 Instead of Lp you can use the Mahalanobis

distance:

 Which accounts for local orientation and density

variation

COMP5009 – DATA MINING, CURTIN UNIVERSITY 15

K-MODES CLUSTERING

k-Modes: simple adaptation of k-Means for categorical

data
 Mode: dominant category in a cluster
 Key idea: instead of taking average to find the
 k-Means requires taking average and computing
distances ⇒ not applicable to categorical data centroid, use mode to find the dominant category in
each cluster
 Note of two key steps
 Multi-dimensional categorical data: mode finding for
 Find the representative (centroid) of a cluster each attribute separately
 Identify which representative is most similar to a data samples
 For the re-assignment step: richer set of similarity
functions for computing the distances between data
points and their modes
 e.g. Inverse occurrence frequency to normalize for the skew in
the attribute values

COMP5009 – DATA MINING, CURTIN UNIVERSITY 16

K-MODES EXAMPLE

 Representatives could be either {Blue, Cube}, or {Green, Cube}

COMP5009 – DATA MINING, CURTIN UNIVERSITY 17

HIERARCHICAL CLUSTERING

 Top down (splitting or divisive)

 Bottom up (grouping or agglomerative)

 Both require a distance or similarity measure

between items
[Link]
clustering-example-in-python-1e18e0075019

COMP5009 – DATA MINING, CURTIN UNIVERSITY 18

AGGLOMERATIVE CLUSTERING

 Initially A-F are all in their own clusters

 AB are the closest two clusters, so they are joined

 Followed by CD and EF

 The AB is joined to CD

 ABCD is then joined to EF

COMP5009 – DATA MINING, CURTIN UNIVERSITY 19

DENDROGRAMS (TREE PLOTS) Largest distance between
clusters indicates optimal
number of clusters (2)

Inter-Cluster Distance
[Link] [Link]

COMP5009 – DATA MINING, CURTIN UNIVERSITY 20

LINKAGE CRITERIA

 Best (single) linkage

 Smallest distance between pairs of objects in different
clusters
 Closest Centroid
 Worst (complete) linkage  Distance between centroids of clusters
 Largest distance between pairs of objects in different
 Ward's method
clusters
 Compute change in intra-group sum of square errors.
 Group-average linkage
 Average distance of all possible pairs from different
clusters

COMP5009 – DATA MINING, CURTIN UNIVERSITY 21

LINKAGE CRITERIA

 Agglomerative clustering with n_clusters=3,

using different linkage criteria

 All using distance metric (or affinity) of L 2

norm

COMP5009 – DATA MINING, CURTIN UNIVERSITY 22

[Link]
CLUSTERING EXAMPLE

 Compute the clusters at each level of

agglomerative clustering using:
o Best linkage
o Worst linkage

COMP5009 – DATA MINING, CURTIN UNIVERSITY 23

AGGLOMERATIVE CLUSTERING

Pro Con

 Does not assume a number of clusters  Slow for large data sets, O(𝑛2 log(𝑛))
(c.f. k-means)
 O(n2) for computing distance matrix
 May generate meaningful taxonomies

COMP5009 – DATA MINING, CURTIN UNIVERSITY 24

DBSCAN

DBSCAN: Density-based spatial clustering of

applications with noise
 Density-based: clusters are dense regions

 Can discover clusters of arbitrary shape ( spherical,

drawn-out, linear, elongated, etc.)
 Requires only minimal information about the data

 Efficient for large data sets

 Can handle outliers

[Link]

COMP5009 – DATA MINING, CURTIN UNIVERSITY 25

DBSCAN CONCEPTS

Density description
 ε: neighborhood radius
 MinPts: number of points that define a threshold for
core points
Point types
 Core point: has at least MinPts points within its ε
neighborhood
 Boundary point: has less than MinPts points within its
ε neighborhood , but is within a ε neighborhood of a
core point
 Outlier point: not a core point nor a boundary point
[Link]

COMP5009 – DATA MINING, CURTIN UNIVERSITY 26

EXAMPLE
minpts = 2, ε = 2

COMP5009 – DATA MINING, CURTIN UNIVERSITY 27

EXAMPLE
minpts = 2, ε = 2

COMP5009 – DATA MINING, CURTIN UNIVERSITY 28

EXAMPLE
minpts = 2, ε = 2

COMP5009 – DATA MINING, CURTIN UNIVERSITY 29

EXAMPLE
minpts = 2, ε = 2

COMP5009 – DATA MINING, CURTIN UNIVERSITY 30

EXAMPLE
minpts = 2, ε = 2

3 Core points {A, D, E}

1 Border point {B}
3 Outlier points {C,F,G}

COMP5009 – DATA MINING, CURTIN UNIVERSITY 31

DBSCAN

Pro Con

 Does not need to specify number of clusters k  Border points between two overlapping clusters?

 Can find arbitrary shape  Cannot cluster data with large differences in
densities
 Can handle noise and outliers
 Region query not as efficient compared to grid-
 Mostly insensitive to the ordering of data samples
based methods
 Region query using Euclidean distance may be
difficult with high-dimensional data
 How to choose ε if data not well understood?

COMP5009 – DATA MINING, CURTIN UNIVERSITY 32

GAUSSIAN MIXTURE MODEL

 Choose starting guesses for the location and shape

for each cluster
 Probabilistic: Instead of assigning data to a cluster,
 Repeat until converged:
assign a probability that the data belong to each of
the clusters.  E-step: for each point, find weights encoding the
probability of membership in each cluster
 An example of expectation-maximization (EM)
 M-step: for each cluster, update its location, normalization,
 Similar to k-means but with "soft" boundaries and shape based on all data points, making use of the
weights

COMP5009 – DATA MINING, CURTIN UNIVERSITY 33

K-MEANS VS GMM

COMP5009 – DATA MINING, CURTIN UNIVERSITY 34

CLUSTER
VALIDATION
 How do you know if a clustering solution is ‘good’ ?
 How do you compare one solution against another?
 How to make use of domain information?

COMP5009 – DATA MINING, CURTIN UNIVERSITY 35

CLUSTER VALIDATION

 Internal validation
 No other information about the data available  External validation
 Use some geometry-based functions  When class labels are available, e.g. benchmark datasets
 Most typical in practice  Commonly used to develop/benchmark algorithms
 Biased to the choice of functions  Main assumption: data samples belonging to the same
cluster should have the same class labels
 Better used when comparing clustering algorithms of the
same type

COMP5009 – DATA MINING, CURTIN UNIVERSITY 36

INTERNAL VALIDATION

Sum of squared distances to centroids (SSQ)

 Collect all data samples belonging to a cluster

 Compute the centroid (average)

 Compute the distances from these data samples to

the centroid
 Square the distances and sum them up

 Repeat for all clusters and take the final sum

COMP5009 – DATA MINING, CURTIN UNIVERSITY 37

Intra cluster Inter cluster

INTERNAL VALIDATION (2)

Intracluster-to-intercluster distance ratio

 Sample r pairs of data points
 P pairs: each pair in the same cluster
 Q pairs: each pair in different clusters Interpretation
 Clusters well separated: intracluster distance <<
intercluster distance
 Thus, a smaller value indicates a better clustering
result
Note: The number of sampled pairs r needs to be
sufficiently large
 Ratio = Intracluster distance / Intercluster distance
 a.k.a Davies-Bouldin score

COMP5009 – DATA MINING, CURTIN UNIVERSITY 38

INTERNAL VALIDATION (3)

Silhouette coefficient:
 Davgiin average distance to data points within
cluster Xi
 Dminiout minimum distance to data point outside of
cluster Xi  Si = 1 is good

 Si = -1 is bad (data are in the wrong cluster)

 Si lies in (-1,1)

COMP5009 – DATA MINING, CURTIN UNIVERSITY 39

EXTERNAL VALIDATION

Consider clusters as labels and use classification

metrics
 Purity

 Confusion matrix

 Gini index

 Entropy

COMP5009 – DATA MINING, CURTIN UNIVERSITY 40

GINI INDEX

Recall from Data Classification workshop

 For each cluster Sr:
 p1 , p2 , . . . , pk fraction of samples from k classes

Gini Index is  Average weighted Gini index is then

COMP5009 – DATA MINING, CURTIN UNIVERSITY 41

ENTROPY

 Again, from Data Classification:

 For each subset Sr

 p1 , p2 , . . . , pk fraction of samples from k classes  Average weighted entropy is then
 Entropy is then:

COMP5009 – DATA MINING, CURTIN UNIVERSITY 42

FURTHER VALIDATION

COMP5009 – DATA MINING, CURTIN UNIVERSITY 43

WRAP UP

COMP5009 – DATA MINING, CURTIN UNIVERSITY 44

DATA CLUSTERING: PRACTICAL CONSIDERATIONS

 Clustering is generally difficult (ambiguity, How to select right parameters? How many clusters?
no unique solution)
 Model selection: trade-off between how well the
 Clustering algorithms typically have controlling model explains the data and model
parameters complexity, i.e. the simplest model that
 k-means and k-modes: k explains the data well
 DBSCAN: ε  Interpretability: do clusters give interesting
 Others: interpretation ⇒ domain specific
 Noise: outliers and noisy clusters may be present

COMP5009 – DATA MINING, CURTIN UNIVERSITY 45

HOW TO SELECT THE RIGHT CLUSTERING METHOD?

 Very challenging: real-world data unlikely to have  Need to compare clustering solutions produced
ideal characteristics for distance-based, density- by different approaches, e.g. cluster validation
based, hierarchical-based approaches
 Need to examine the interpretability
 Visualization: project data to 2D or 3D (e.g.
 Cluster ensembles: combine clustering results
principal component analysis) for visual inspection
from different approaches to obtain robust
 Computationally feasible? clustering (features/attributes or instances)

COMP5009 – DATA MINING, CURTIN UNIVERSITY 46

APPLICATIONS

To other data mining tasks

 Data summarization: indicates how data points are
distributed
 Outlier analysis: small, fragmented clusters are
indicative of outliers  Text applications: Hierarchical clustering potentially
 Classification: kNN classifier can use centroids for useful for organization of web pages
improved computational speed  Multimedia applications: clustering index - effective
Customer segmentation and collaborative filtering retrieval of multimedia contents

 Group customers according to profiles, demographics,  Social network analysis: discover sub-communities
etc.
 Collaborative filtering: as per every group found via the
clustering step

COMP5009 – DATA MINING, CURTIN UNIVERSITY 47

SUMMARY

 Data clustering basics

 Definition

 Cluster validation
 Internal
 External

 Clustering algorithms
 Representative-based: k-means and k-modes
 Hierarchical-based: Bottom-up agglomerative clustering
 Density-based: DBSCAN

 Applications

COMP5009 – DATA MINING, CURTIN UNIVERSITY 48

NEXT: OUTLIER ANALYSIS
CHAPTER: 8

COMP5009 – DATA MINING, CURTIN UNIVERSITY 49

CS423 Data Warehousing and Data Mining: Dr. Hammad Afzal
No ratings yet
CS423 Data Warehousing and Data Mining: Dr. Hammad Afzal
41 pages
Customer Segmentation Techniques Explained
No ratings yet
Customer Segmentation Techniques Explained
46 pages
Clustering in Data Mining Lecture
No ratings yet
Clustering in Data Mining Lecture
80 pages
Session 34 - 35clustering
No ratings yet
Session 34 - 35clustering
50 pages
Data Mining Lecture Notes-1: Bsc. (H) Computer Science: Vi Semester Teacher: Ms. Sonal Linda
No ratings yet
Data Mining Lecture Notes-1: Bsc. (H) Computer Science: Vi Semester Teacher: Ms. Sonal Linda
40 pages
L5 Clustering
No ratings yet
L5 Clustering
6 pages
Clustering
No ratings yet
Clustering
35 pages
Datamining-Lect5 - Clustering. The K-Means Algorithm. Hierarchical Clustering. The DBSCAN Algorithm. Clustering Evaluation
No ratings yet
Datamining-Lect5 - Clustering. The K-Means Algorithm. Hierarchical Clustering. The DBSCAN Algorithm. Clustering Evaluation
110 pages
DMDWUNITV
No ratings yet
DMDWUNITV
72 pages
Cluster
100% (1)
Cluster
72 pages
Datamining Lect8
No ratings yet
Datamining Lect8
79 pages
L11 Cluster Analysis
No ratings yet
L11 Cluster Analysis
47 pages
Clustering Part1
No ratings yet
Clustering Part1
79 pages
3 UnSupervised Learning
No ratings yet
3 UnSupervised Learning
53 pages
Clustering
No ratings yet
Clustering
28 pages
Lec09 Clustering
No ratings yet
Lec09 Clustering
27 pages
Clustering
No ratings yet
Clustering
16 pages
Cluster Analysis Techniques Guide
No ratings yet
Cluster Analysis Techniques Guide
37 pages
Clustering Techniques Overview
No ratings yet
Clustering Techniques Overview
35 pages
4 Clustering
No ratings yet
4 Clustering
9 pages
BIS 541 Ch04 20-21 S
No ratings yet
BIS 541 Ch04 20-21 S
82 pages
Machine Learning & Data Mining
No ratings yet
Machine Learning & Data Mining
108 pages
Clustering
No ratings yet
Clustering
18 pages
Lecture 2.1.1 To 2.1.2
No ratings yet
Lecture 2.1.1 To 2.1.2
97 pages
Soft Vs Hard Clustering
No ratings yet
Soft Vs Hard Clustering
5 pages
ML - 8
No ratings yet
ML - 8
70 pages
Chapter 4
No ratings yet
Chapter 4
60 pages
Unit 5
No ratings yet
Unit 5
85 pages
Data Analytics for B.Tech Students
No ratings yet
Data Analytics for B.Tech Students
98 pages
Chapter 3 Unsupervised Learning
No ratings yet
Chapter 3 Unsupervised Learning
45 pages
Clustering
No ratings yet
Clustering
84 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
83 pages
Unsupervised Learning and Clustering Techniques
No ratings yet
Unsupervised Learning and Clustering Techniques
60 pages
UNIT5
No ratings yet
UNIT5
60 pages
UG BSF Clustering
No ratings yet
UG BSF Clustering
119 pages
Clustering Techniques for CS Students
100% (1)
Clustering Techniques for CS Students
26 pages
cz4041 10 Clustering
No ratings yet
cz4041 10 Clustering
67 pages
Cluster Analysis Concepts & Algorithms
No ratings yet
Cluster Analysis Concepts & Algorithms
82 pages
Lecture 9 Clustering
No ratings yet
Lecture 9 Clustering
36 pages
Cluster Analysis & K-Means Limitations
No ratings yet
Cluster Analysis & K-Means Limitations
84 pages
Clustering Techniques Overview
No ratings yet
Clustering Techniques Overview
45 pages
Data Mining: Techniques and Methods
No ratings yet
Data Mining: Techniques and Methods
20 pages
Clustering
No ratings yet
Clustering
104 pages
Wk. 9. Cluster Analysis (01-04-2021)
No ratings yet
Wk. 9. Cluster Analysis (01-04-2021)
97 pages
Intro to Cluster Analysis
No ratings yet
Intro to Cluster Analysis
90 pages
کتاب چهارم بارگزاری شده
No ratings yet
کتاب چهارم بارگزاری شده
63 pages
U-5 Iml
No ratings yet
U-5 Iml
20 pages
Clustering-Part 1
No ratings yet
Clustering-Part 1
35 pages
Chapter 5 Clustering
No ratings yet
Chapter 5 Clustering
40 pages
Unsupervised Learning: Clustering
No ratings yet
Unsupervised Learning: Clustering
57 pages
Machine Learning Notes-1 (Clustering-1)
No ratings yet
Machine Learning Notes-1 (Clustering-1)
25 pages
Clustering
No ratings yet
Clustering
65 pages
K-means Clustering Explained
No ratings yet
K-means Clustering Explained
41 pages
Clustering
No ratings yet
Clustering
32 pages
Lecture Notes For Chapter 8 Introduction To Data Mining: by Tan, Steinbach, Kumar
No ratings yet
Lecture Notes For Chapter 8 Introduction To Data Mining: by Tan, Steinbach, Kumar
90 pages
Concepts and Techniques: - Chapter 11
No ratings yet
Concepts and Techniques: - Chapter 11
103 pages
ML Module 4 2022 1 PDF
No ratings yet
ML Module 4 2022 1 PDF
31 pages
Cluster Analysis in Data Mining Techniques
No ratings yet
Cluster Analysis in Data Mining Techniques
76 pages
Adobe Photoshop Layers Lesson Plan
No ratings yet
Adobe Photoshop Layers Lesson Plan
8 pages
Machado Et Al (2002) - Constitutive Model For Municipal Solid Waste
No ratings yet
Machado Et Al (2002) - Constitutive Model For Municipal Solid Waste
13 pages
Generative Components Tutorial Guide
No ratings yet
Generative Components Tutorial Guide
5 pages
Using Appropriate Cohesive Devices
No ratings yet
Using Appropriate Cohesive Devices
22 pages
Accountability Vs Responsibility
100% (1)
Accountability Vs Responsibility
3 pages
Convocation Dec10 Speech
100% (1)
Convocation Dec10 Speech
5 pages
Metaphysics of Mulla Sadra
No ratings yet
Metaphysics of Mulla Sadra
28 pages
Ernst and Young Stress Testing
No ratings yet
Ernst and Young Stress Testing
16 pages
Nursing Skills Evaluation Guide
No ratings yet
Nursing Skills Evaluation Guide
14 pages
Personality 3
No ratings yet
Personality 3
24 pages
Java - Multithreading: Life Cycle of A Thread
No ratings yet
Java - Multithreading: Life Cycle of A Thread
7 pages
SWOT Analysis of Huawei
33% (3)
SWOT Analysis of Huawei
7 pages
Syllabus MB II
No ratings yet
Syllabus MB II
2 pages
LBCHX PDF
No ratings yet
LBCHX PDF
1 page
Reso Guide (2011-12)
0% (1)
Reso Guide (2011-12)
48 pages
Emotional Intelligence for Managers
No ratings yet
Emotional Intelligence for Managers
17 pages
Alternate Battlefleet Gothic Rules For VASSAL
No ratings yet
Alternate Battlefleet Gothic Rules For VASSAL
39 pages
Influence of Spiritual Leadership, Organizational Commitment and Its Effect On The Performance of Lembaga Perkreditan Desa
No ratings yet
Influence of Spiritual Leadership, Organizational Commitment and Its Effect On The Performance of Lembaga Perkreditan Desa
14 pages
09th Convocation - Booklet
No ratings yet
09th Convocation - Booklet
40 pages
Different Types of Systems: TF X (TF)
No ratings yet
Different Types of Systems: TF X (TF)
20 pages
Ueb 3012 Presentation 2022
100% (1)
Ueb 3012 Presentation 2022
6 pages
Unraveling The Energy of 5 Planets in Libra: by Saptarishis Astrology
No ratings yet
Unraveling The Energy of 5 Planets in Libra: by Saptarishis Astrology
11 pages
Corporate Organisation Chart - Procurement: Group Ceo Name
No ratings yet
Corporate Organisation Chart - Procurement: Group Ceo Name
1 page
2025 Grade 11 Mlit p2 Eng MG
No ratings yet
2025 Grade 11 Mlit p2 Eng MG
7 pages
Joseph LeDoux - (2014) - The Emotional Brain Revisited
100% (10)
Joseph LeDoux - (2014) - The Emotional Brain Revisited
373 pages
Sample Gantt Chart For Project Paper
No ratings yet
Sample Gantt Chart For Project Paper
2 pages
Student Robotics for Waste Control
100% (1)
Student Robotics for Waste Control
23 pages
RealAnalog Circuits1 Chapter3
No ratings yet
RealAnalog Circuits1 Chapter3
36 pages
Unit 3 Word
No ratings yet
Unit 3 Word
45 pages
Fuzzy L7 Properties
No ratings yet
Fuzzy L7 Properties
19 pages