COMP5009
DATA MINING
CLUSTERING
CURTIN UNIVERSITY
SEMESTER 2
Introduction
CLUSTERING
Cluster Validation
Aggarwal Chapters
6.1, 6.3-6.6, 6.9 Clustering Algorithms
7.1, 7.2.1 Applications
Summary
COMP5009 – DATA MINING, CURTIN UNIVERSITY 2
GIVEN A SET OF DATA
POINTS, PARTITION
THEM INTO GROUPS
CONTAINING VERY
SIMILAR DATA
POINTS.
COMP5009 – DATA MINING, CURTIN UNIVERSITY 3
CLUSTERING APPLICATIONS
Data summarization
Measure properties/behaviors of K clusters instead of N Social network analysis
items Use linkages between network nodes to identify
K << N communities
Related to "representative" examples Communities of people, businesses, bot-nets, etc.
Data segmentation Data mining "block"
Group data into segments each of which may have Clustering can be used as a building block or pre-
different behaviors or needs processing stage for other DM problems
e.g. Customer segmentation for advertising or e.g. clustering prior to classification or outlier detection
recommender systems
COMP5009 – DATA MINING, CURTIN UNIVERSITY 4
HOW DO WE CLUSTER?
Partition data points into intuitively similar groups
Many ways of defining similar How many groups do we need?
Distance / similarity metrics How do we define a "good" grouping?
Clusters are context / domain dependent
COMP5009 – DATA MINING, CURTIN UNIVERSITY 5
CLUSTERING AMBIGUITY
COMP5009 – DATA MINING, CURTIN UNIVERSITY 6
EXAMPLES
COMP5009 – DATA MINING, CURTIN UNIVERSITY 7
DIFFERENT FROM ASSOCIATION PATTERN MINING
Association Pattern Mining Clustering
Find attributes/features that co-occur Group transactions that are similar
Focus is on the items Focus on the transactions (or customers)
COMP5009 – DATA MINING, CURTIN UNIVERSITY 8
CLUSTERING
ALGORITHMS
Representative-based: k-means and k-modes
Hierarchical clustering: Bottom-Up agglomerative
methods
Grid-based and density-based: DBSCAN
Probabilistic algorithms: Gaussian Mixture Model
COMP5009 – DATA MINING, CURTIN UNIVERSITY 9
REPRESENTATIVE BASED ALGORITHMS
User typically defines the number of clusters k
Determine representatives Yi for data point Xi to
Use an intuitive distance function (eg Lp) minimize the objective:
Chose a representative for each cluster
Cluster membership is assigned based on the
nearest representative
Distance to nearest
Like stereo-types or tropes in stories representative
Algorithm is therefore iterative
COMP5009 – DATA MINING, CURTIN UNIVERSITY 10
THE K-MEANS ALGORITHM
Choose:
Where ||.||p is the Lp-norm.
Optimal choice for Yi is the mean of the data points
in each group
COMP5009 – DATA MINING, CURTIN UNIVERSITY 11
Shapes represent cluster labels
K-means with k=2
Note that the representatives are not necessarily
real data points (μ1=7)
If you impose this constraint you end up with the k-
mediods algorithm
Zaki + Meira
COMP5009 – DATA MINING, CURTIN UNIVERSITY 12
COMP5009 – DATA MINING, CURTIN UNIVERSITY 13
BASIC K-MEANS CLUSTERING
Pro Con
Simple, intuitive Local solution only: initialization may be critical
Good results when clusters are well separated Need to specify the number of clusters: how would
you know that?
Relatively efficient
Not robust against highly overlapping
data/noise/outliers
Euclidean distance: may not be optimal, does not
work for categorical data
COMP5009 – DATA MINING, CURTIN UNIVERSITY 14
MAHALANOBIS K-MEANS
Instead of Lp you can use the Mahalanobis
distance:
Which accounts for local orientation and density
variation
COMP5009 – DATA MINING, CURTIN UNIVERSITY 15
K-MODES CLUSTERING
k-Modes: simple adaptation of k-Means for categorical
data
Mode: dominant category in a cluster
Key idea: instead of taking average to find the
k-Means requires taking average and computing
distances ⇒ not applicable to categorical data centroid, use mode to find the dominant category in
each cluster
Note of two key steps
Multi-dimensional categorical data: mode finding for
Find the representative (centroid) of a cluster each attribute separately
Identify which representative is most similar to a data samples
For the re-assignment step: richer set of similarity
functions for computing the distances between data
points and their modes
e.g. Inverse occurrence frequency to normalize for the skew in
the attribute values
COMP5009 – DATA MINING, CURTIN UNIVERSITY 16
K-MODES EXAMPLE
Representatives could be either {Blue, Cube}, or {Green, Cube}
COMP5009 – DATA MINING, CURTIN UNIVERSITY 17
HIERARCHICAL CLUSTERING
Top down (splitting or divisive)
Bottom up (grouping or agglomerative)
Both require a distance or similarity measure
between items
[Link]
clustering-example-in-python-1e18e0075019
COMP5009 – DATA MINING, CURTIN UNIVERSITY 18
AGGLOMERATIVE CLUSTERING
Initially A-F are all in their own clusters
AB are the closest two clusters, so they are joined
Followed by CD and EF
The AB is joined to CD
ABCD is then joined to EF
COMP5009 – DATA MINING, CURTIN UNIVERSITY 19
DENDROGRAMS (TREE PLOTS) Largest distance between
clusters indicates optimal
number of clusters (2)
Inter-Cluster Distance
[Link] [Link]
COMP5009 – DATA MINING, CURTIN UNIVERSITY 20
LINKAGE CRITERIA
Best (single) linkage
Smallest distance between pairs of objects in different
clusters
Closest Centroid
Worst (complete) linkage Distance between centroids of clusters
Largest distance between pairs of objects in different
Ward's method
clusters
Compute change in intra-group sum of square errors.
Group-average linkage
Average distance of all possible pairs from different
clusters
COMP5009 – DATA MINING, CURTIN UNIVERSITY 21
LINKAGE CRITERIA
Agglomerative clustering with n_clusters=3,
using different linkage criteria
All using distance metric (or affinity) of L 2
norm
COMP5009 – DATA MINING, CURTIN UNIVERSITY 22
[Link]
CLUSTERING EXAMPLE
Compute the clusters at each level of
agglomerative clustering using:
o Best linkage
o Worst linkage
COMP5009 – DATA MINING, CURTIN UNIVERSITY 23
AGGLOMERATIVE CLUSTERING
Pro Con
Does not assume a number of clusters Slow for large data sets, O(𝑛2 log(𝑛))
(c.f. k-means)
O(n2) for computing distance matrix
May generate meaningful taxonomies
COMP5009 – DATA MINING, CURTIN UNIVERSITY 24
DBSCAN
DBSCAN: Density-based spatial clustering of
applications with noise
Density-based: clusters are dense regions
Can discover clusters of arbitrary shape ( spherical,
drawn-out, linear, elongated, etc.)
Requires only minimal information about the data
Efficient for large data sets
Can handle outliers
[Link]
COMP5009 – DATA MINING, CURTIN UNIVERSITY 25
DBSCAN CONCEPTS
Density description
ε: neighborhood radius
MinPts: number of points that define a threshold for
core points
Point types
Core point: has at least MinPts points within its ε
neighborhood
Boundary point: has less than MinPts points within its
ε neighborhood , but is within a ε neighborhood of a
core point
Outlier point: not a core point nor a boundary point
[Link]
COMP5009 – DATA MINING, CURTIN UNIVERSITY 26
EXAMPLE
minpts = 2, ε = 2
COMP5009 – DATA MINING, CURTIN UNIVERSITY 27
EXAMPLE
minpts = 2, ε = 2
COMP5009 – DATA MINING, CURTIN UNIVERSITY 28
EXAMPLE
minpts = 2, ε = 2
COMP5009 – DATA MINING, CURTIN UNIVERSITY 29
EXAMPLE
minpts = 2, ε = 2
COMP5009 – DATA MINING, CURTIN UNIVERSITY 30
EXAMPLE
minpts = 2, ε = 2
3 Core points {A, D, E}
1 Border point {B}
3 Outlier points {C,F,G}
COMP5009 – DATA MINING, CURTIN UNIVERSITY 31
DBSCAN
Pro Con
Does not need to specify number of clusters k Border points between two overlapping clusters?
Can find arbitrary shape Cannot cluster data with large differences in
densities
Can handle noise and outliers
Region query not as efficient compared to grid-
Mostly insensitive to the ordering of data samples
based methods
Region query using Euclidean distance may be
difficult with high-dimensional data
How to choose ε if data not well understood?
COMP5009 – DATA MINING, CURTIN UNIVERSITY 32
GAUSSIAN MIXTURE MODEL
Choose starting guesses for the location and shape
for each cluster
Probabilistic: Instead of assigning data to a cluster,
Repeat until converged:
assign a probability that the data belong to each of
the clusters. E-step: for each point, find weights encoding the
probability of membership in each cluster
An example of expectation-maximization (EM)
M-step: for each cluster, update its location, normalization,
Similar to k-means but with "soft" boundaries and shape based on all data points, making use of the
weights
COMP5009 – DATA MINING, CURTIN UNIVERSITY 33
K-MEANS VS GMM
COMP5009 – DATA MINING, CURTIN UNIVERSITY 34
CLUSTER
VALIDATION
How do you know if a clustering solution is ‘good’ ?
How do you compare one solution against another?
How to make use of domain information?
COMP5009 – DATA MINING, CURTIN UNIVERSITY 35
CLUSTER VALIDATION
Internal validation
No other information about the data available External validation
Use some geometry-based functions When class labels are available, e.g. benchmark datasets
Most typical in practice Commonly used to develop/benchmark algorithms
Biased to the choice of functions Main assumption: data samples belonging to the same
cluster should have the same class labels
Better used when comparing clustering algorithms of the
same type
COMP5009 – DATA MINING, CURTIN UNIVERSITY 36
INTERNAL VALIDATION
Sum of squared distances to centroids (SSQ)
Collect all data samples belonging to a cluster
Compute the centroid (average)
Compute the distances from these data samples to
the centroid
Square the distances and sum them up
Repeat for all clusters and take the final sum
COMP5009 – DATA MINING, CURTIN UNIVERSITY 37
Intra cluster Inter cluster
INTERNAL VALIDATION (2)
Intracluster-to-intercluster distance ratio
Sample r pairs of data points
P pairs: each pair in the same cluster
Q pairs: each pair in different clusters Interpretation
Clusters well separated: intracluster distance <<
intercluster distance
Thus, a smaller value indicates a better clustering
result
Note: The number of sampled pairs r needs to be
sufficiently large
Ratio = Intracluster distance / Intercluster distance
a.k.a Davies-Bouldin score
COMP5009 – DATA MINING, CURTIN UNIVERSITY 38
INTERNAL VALIDATION (3)
Silhouette coefficient:
Davgiin average distance to data points within
cluster Xi
Dminiout minimum distance to data point outside of
cluster Xi Si = 1 is good
Si = -1 is bad (data are in the wrong cluster)
Si lies in (-1,1)
COMP5009 – DATA MINING, CURTIN UNIVERSITY 39
EXTERNAL VALIDATION
Consider clusters as labels and use classification
metrics
Purity
Confusion matrix
Gini index
Entropy
COMP5009 – DATA MINING, CURTIN UNIVERSITY 40
GINI INDEX
Recall from Data Classification workshop
For each cluster Sr:
p1 , p2 , . . . , pk fraction of samples from k classes
Gini Index is Average weighted Gini index is then
COMP5009 – DATA MINING, CURTIN UNIVERSITY 41
ENTROPY
Again, from Data Classification:
For each subset Sr
p1 , p2 , . . . , pk fraction of samples from k classes Average weighted entropy is then
Entropy is then:
COMP5009 – DATA MINING, CURTIN UNIVERSITY 42
FURTHER VALIDATION
COMP5009 – DATA MINING, CURTIN UNIVERSITY 43
WRAP UP
COMP5009 – DATA MINING, CURTIN UNIVERSITY 44
DATA CLUSTERING: PRACTICAL CONSIDERATIONS
Clustering is generally difficult (ambiguity, How to select right parameters? How many clusters?
no unique solution)
Model selection: trade-off between how well the
Clustering algorithms typically have controlling model explains the data and model
parameters complexity, i.e. the simplest model that
k-means and k-modes: k explains the data well
DBSCAN: ε Interpretability: do clusters give interesting
Others: interpretation ⇒ domain specific
Noise: outliers and noisy clusters may be present
COMP5009 – DATA MINING, CURTIN UNIVERSITY 45
HOW TO SELECT THE RIGHT CLUSTERING METHOD?
Very challenging: real-world data unlikely to have Need to compare clustering solutions produced
ideal characteristics for distance-based, density- by different approaches, e.g. cluster validation
based, hierarchical-based approaches
Need to examine the interpretability
Visualization: project data to 2D or 3D (e.g.
Cluster ensembles: combine clustering results
principal component analysis) for visual inspection
from different approaches to obtain robust
Computationally feasible? clustering (features/attributes or instances)
COMP5009 – DATA MINING, CURTIN UNIVERSITY 46
APPLICATIONS
To other data mining tasks
Data summarization: indicates how data points are
distributed
Outlier analysis: small, fragmented clusters are
indicative of outliers Text applications: Hierarchical clustering potentially
Classification: kNN classifier can use centroids for useful for organization of web pages
improved computational speed Multimedia applications: clustering index - effective
Customer segmentation and collaborative filtering retrieval of multimedia contents
Group customers according to profiles, demographics, Social network analysis: discover sub-communities
etc.
Collaborative filtering: as per every group found via the
clustering step
COMP5009 – DATA MINING, CURTIN UNIVERSITY 47
SUMMARY
Data clustering basics
Definition
Cluster validation
Internal
External
Clustering algorithms
Representative-based: k-means and k-modes
Hierarchical-based: Bottom-up agglomerative clustering
Density-based: DBSCAN
Applications
COMP5009 – DATA MINING, CURTIN UNIVERSITY 48
NEXT: OUTLIER ANALYSIS
CHAPTER: 8
COMP5009 – DATA MINING, CURTIN UNIVERSITY 49