0% found this document useful (0 votes)
20 views49 pages

Clustering

The document provides an overview of clustering in data mining, detailing various clustering algorithms such as k-means, hierarchical clustering, DBSCAN, and Gaussian Mixture Models. It discusses clustering applications, validation techniques, and practical considerations for selecting the appropriate clustering method. The importance of understanding data characteristics and the challenges associated with clustering are emphasized throughout the document.

Uploaded by

huy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views49 pages

Clustering

The document provides an overview of clustering in data mining, detailing various clustering algorithms such as k-means, hierarchical clustering, DBSCAN, and Gaussian Mixture Models. It discusses clustering applications, validation techniques, and practical considerations for selecting the appropriate clustering method. The importance of understanding data characteristics and the challenges associated with clustering are emphasized throughout the document.

Uploaded by

huy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

COMP5009

DATA MINING

CLUSTERING

CURTIN UNIVERSITY
SEMESTER 2
 Introduction
CLUSTERING
 Cluster Validation
Aggarwal Chapters
6.1, 6.3-6.6, 6.9  Clustering Algorithms

7.1, 7.2.1  Applications

 Summary

COMP5009 – DATA MINING, CURTIN UNIVERSITY 2


GIVEN A SET OF DATA
POINTS, PARTITION
THEM INTO GROUPS
CONTAINING VERY
SIMILAR DATA
POINTS.

COMP5009 – DATA MINING, CURTIN UNIVERSITY 3


CLUSTERING APPLICATIONS

 Data summarization
 Measure properties/behaviors of K clusters instead of N  Social network analysis
items  Use linkages between network nodes to identify
 K << N communities

 Related to "representative" examples  Communities of people, businesses, bot-nets, etc.

 Data segmentation  Data mining "block"

 Group data into segments each of which may have  Clustering can be used as a building block or pre-
different behaviors or needs processing stage for other DM problems

 e.g. Customer segmentation for advertising or  e.g. clustering prior to classification or outlier detection
recommender systems

COMP5009 – DATA MINING, CURTIN UNIVERSITY 4


HOW DO WE CLUSTER?

 Partition data points into intuitively similar groups

 Many ways of defining similar  How many groups do we need?


 Distance / similarity metrics  How do we define a "good" grouping?
 Clusters are context / domain dependent

COMP5009 – DATA MINING, CURTIN UNIVERSITY 5


CLUSTERING AMBIGUITY

COMP5009 – DATA MINING, CURTIN UNIVERSITY 6


EXAMPLES

COMP5009 – DATA MINING, CURTIN UNIVERSITY 7


DIFFERENT FROM ASSOCIATION PATTERN MINING

Association Pattern Mining Clustering

 Find attributes/features that co-occur  Group transactions that are similar

 Focus is on the items  Focus on the transactions (or customers)

COMP5009 – DATA MINING, CURTIN UNIVERSITY 8


CLUSTERING
ALGORITHMS
 Representative-based: k-means and k-modes

 Hierarchical clustering: Bottom-Up agglomerative


methods
 Grid-based and density-based: DBSCAN
 Probabilistic algorithms: Gaussian Mixture Model

COMP5009 – DATA MINING, CURTIN UNIVERSITY 9


REPRESENTATIVE BASED ALGORITHMS

 User typically defines the number of clusters k

 Determine representatives Yi for data point Xi to


 Use an intuitive distance function (eg Lp) minimize the objective:
 Chose a representative for each cluster

 Cluster membership is assigned based on the


nearest representative
Distance to nearest
 Like stereo-types or tropes in stories representative

 Algorithm is therefore iterative

COMP5009 – DATA MINING, CURTIN UNIVERSITY 10


THE K-MEANS ALGORITHM

 Choose:

 Where ||.||p is the Lp-norm.

 Optimal choice for Yi is the mean of the data points


in each group

COMP5009 – DATA MINING, CURTIN UNIVERSITY 11


 Shapes represent cluster labels

 K-means with k=2

 Note that the representatives are not necessarily


real data points (μ1=7)
 If you impose this constraint you end up with the k-
mediods algorithm

Zaki + Meira
COMP5009 – DATA MINING, CURTIN UNIVERSITY 12
COMP5009 – DATA MINING, CURTIN UNIVERSITY 13
BASIC K-MEANS CLUSTERING

Pro Con

 Simple, intuitive  Local solution only: initialization may be critical

 Good results when clusters are well separated  Need to specify the number of clusters: how would
you know that?
 Relatively efficient
 Not robust against highly overlapping
data/noise/outliers
 Euclidean distance: may not be optimal, does not
work for categorical data

COMP5009 – DATA MINING, CURTIN UNIVERSITY 14


MAHALANOBIS K-MEANS

 Instead of Lp you can use the Mahalanobis


distance:

 Which accounts for local orientation and density


variation

COMP5009 – DATA MINING, CURTIN UNIVERSITY 15


K-MODES CLUSTERING

k-Modes: simple adaptation of k-Means for categorical


data
 Mode: dominant category in a cluster
 Key idea: instead of taking average to find the
 k-Means requires taking average and computing
distances ⇒ not applicable to categorical data centroid, use mode to find the dominant category in
each cluster
 Note of two key steps
 Multi-dimensional categorical data: mode finding for
 Find the representative (centroid) of a cluster each attribute separately
 Identify which representative is most similar to a data samples
 For the re-assignment step: richer set of similarity
functions for computing the distances between data
points and their modes
 e.g. Inverse occurrence frequency to normalize for the skew in
the attribute values

COMP5009 – DATA MINING, CURTIN UNIVERSITY 16


K-MODES EXAMPLE

 Representatives could be either {Blue, Cube}, or {Green, Cube}

COMP5009 – DATA MINING, CURTIN UNIVERSITY 17


HIERARCHICAL CLUSTERING

 Top down (splitting or divisive)

 Bottom up (grouping or agglomerative)

 Both require a distance or similarity measure


between items
[Link]
clustering-example-in-python-1e18e0075019

COMP5009 – DATA MINING, CURTIN UNIVERSITY 18


AGGLOMERATIVE CLUSTERING

 Initially A-F are all in their own clusters

 AB are the closest two clusters, so they are joined

 Followed by CD and EF

 The AB is joined to CD

 ABCD is then joined to EF

COMP5009 – DATA MINING, CURTIN UNIVERSITY 19


DENDROGRAMS (TREE PLOTS) Largest distance between
clusters indicates optimal
number of clusters (2)

Inter-Cluster Distance
[Link] [Link]

COMP5009 – DATA MINING, CURTIN UNIVERSITY 20


LINKAGE CRITERIA

 Best (single) linkage


 Smallest distance between pairs of objects in different
clusters
 Closest Centroid
 Worst (complete) linkage  Distance between centroids of clusters
 Largest distance between pairs of objects in different
 Ward's method
clusters
 Compute change in intra-group sum of square errors.
 Group-average linkage
 Average distance of all possible pairs from different
clusters

COMP5009 – DATA MINING, CURTIN UNIVERSITY 21


LINKAGE CRITERIA

 Agglomerative clustering with n_clusters=3,


using different linkage criteria

 All using distance metric (or affinity) of L 2


norm

COMP5009 – DATA MINING, CURTIN UNIVERSITY 22


[Link]
CLUSTERING EXAMPLE

 Compute the clusters at each level of


agglomerative clustering using:
o Best linkage
o Worst linkage

COMP5009 – DATA MINING, CURTIN UNIVERSITY 23


AGGLOMERATIVE CLUSTERING

Pro Con

 Does not assume a number of clusters  Slow for large data sets, O(𝑛2 log(𝑛))
(c.f. k-means)
 O(n2) for computing distance matrix
 May generate meaningful taxonomies

COMP5009 – DATA MINING, CURTIN UNIVERSITY 24


DBSCAN

DBSCAN: Density-based spatial clustering of


applications with noise
 Density-based: clusters are dense regions

 Can discover clusters of arbitrary shape ( spherical,


drawn-out, linear, elongated, etc.)
 Requires only minimal information about the data

 Efficient for large data sets

 Can handle outliers

[Link]

COMP5009 – DATA MINING, CURTIN UNIVERSITY 25


DBSCAN CONCEPTS

Density description
 ε: neighborhood radius
 MinPts: number of points that define a threshold for
core points
Point types
 Core point: has at least MinPts points within its ε
neighborhood
 Boundary point: has less than MinPts points within its
ε neighborhood , but is within a ε neighborhood of a
core point
 Outlier point: not a core point nor a boundary point
[Link]

COMP5009 – DATA MINING, CURTIN UNIVERSITY 26


EXAMPLE
minpts = 2, ε = 2

COMP5009 – DATA MINING, CURTIN UNIVERSITY 27


EXAMPLE
minpts = 2, ε = 2

COMP5009 – DATA MINING, CURTIN UNIVERSITY 28


EXAMPLE
minpts = 2, ε = 2

COMP5009 – DATA MINING, CURTIN UNIVERSITY 29


EXAMPLE
minpts = 2, ε = 2

COMP5009 – DATA MINING, CURTIN UNIVERSITY 30


EXAMPLE
minpts = 2, ε = 2

3 Core points {A, D, E}


1 Border point {B}
3 Outlier points {C,F,G}

COMP5009 – DATA MINING, CURTIN UNIVERSITY 31


DBSCAN

Pro Con

 Does not need to specify number of clusters k  Border points between two overlapping clusters?

 Can find arbitrary shape  Cannot cluster data with large differences in
densities
 Can handle noise and outliers
 Region query not as efficient compared to grid-
 Mostly insensitive to the ordering of data samples
based methods
 Region query using Euclidean distance may be
difficult with high-dimensional data
 How to choose ε if data not well understood?

COMP5009 – DATA MINING, CURTIN UNIVERSITY 32


GAUSSIAN MIXTURE MODEL

 Choose starting guesses for the location and shape


for each cluster
 Probabilistic: Instead of assigning data to a cluster,
 Repeat until converged:
assign a probability that the data belong to each of
the clusters.  E-step: for each point, find weights encoding the
probability of membership in each cluster
 An example of expectation-maximization (EM)
 M-step: for each cluster, update its location, normalization,
 Similar to k-means but with "soft" boundaries and shape based on all data points, making use of the
weights

COMP5009 – DATA MINING, CURTIN UNIVERSITY 33


K-MEANS VS GMM

COMP5009 – DATA MINING, CURTIN UNIVERSITY 34


CLUSTER
VALIDATION
 How do you know if a clustering solution is ‘good’ ?
 How do you compare one solution against another?
 How to make use of domain information?

COMP5009 – DATA MINING, CURTIN UNIVERSITY 35


CLUSTER VALIDATION

 Internal validation
 No other information about the data available  External validation
 Use some geometry-based functions  When class labels are available, e.g. benchmark datasets
 Most typical in practice  Commonly used to develop/benchmark algorithms
 Biased to the choice of functions  Main assumption: data samples belonging to the same
cluster should have the same class labels
 Better used when comparing clustering algorithms of the
same type

COMP5009 – DATA MINING, CURTIN UNIVERSITY 36


INTERNAL VALIDATION

Sum of squared distances to centroids (SSQ)


 Collect all data samples belonging to a cluster

 Compute the centroid (average)

 Compute the distances from these data samples to


the centroid
 Square the distances and sum them up

 Repeat for all clusters and take the final sum

COMP5009 – DATA MINING, CURTIN UNIVERSITY 37


Intra cluster Inter cluster

INTERNAL VALIDATION (2)

Intracluster-to-intercluster distance ratio


 Sample r pairs of data points
 P pairs: each pair in the same cluster
 Q pairs: each pair in different clusters Interpretation
 Clusters well separated: intracluster distance <<
intercluster distance
 Thus, a smaller value indicates a better clustering
result
Note: The number of sampled pairs r needs to be
sufficiently large
 Ratio = Intracluster distance / Intercluster distance
 a.k.a Davies-Bouldin score

COMP5009 – DATA MINING, CURTIN UNIVERSITY 38


INTERNAL VALIDATION (3)

Silhouette coefficient:
 Davgiin average distance to data points within
cluster Xi
 Dminiout minimum distance to data point outside of
cluster Xi  Si = 1 is good

 Si = -1 is bad (data are in the wrong cluster)

 Si lies in (-1,1)

COMP5009 – DATA MINING, CURTIN UNIVERSITY 39


EXTERNAL VALIDATION

Consider clusters as labels and use classification


metrics
 Purity

 Confusion matrix

 Gini index

 Entropy

COMP5009 – DATA MINING, CURTIN UNIVERSITY 40


GINI INDEX

Recall from Data Classification workshop


 For each cluster Sr:
 p1 , p2 , . . . , pk fraction of samples from k classes

Gini Index is  Average weighted Gini index is then

COMP5009 – DATA MINING, CURTIN UNIVERSITY 41


ENTROPY

 Again, from Data Classification:

 For each subset Sr


 p1 , p2 , . . . , pk fraction of samples from k classes  Average weighted entropy is then
 Entropy is then:

COMP5009 – DATA MINING, CURTIN UNIVERSITY 42


FURTHER VALIDATION

COMP5009 – DATA MINING, CURTIN UNIVERSITY 43


WRAP UP

COMP5009 – DATA MINING, CURTIN UNIVERSITY 44


DATA CLUSTERING: PRACTICAL CONSIDERATIONS

 Clustering is generally difficult (ambiguity, How to select right parameters? How many clusters?
no unique solution)
 Model selection: trade-off between how well the
 Clustering algorithms typically have controlling model explains the data and model
parameters complexity, i.e. the simplest model that
 k-means and k-modes: k explains the data well
 DBSCAN: ε  Interpretability: do clusters give interesting
 Others: interpretation ⇒ domain specific
 Noise: outliers and noisy clusters may be present

COMP5009 – DATA MINING, CURTIN UNIVERSITY 45


HOW TO SELECT THE RIGHT CLUSTERING METHOD?

 Very challenging: real-world data unlikely to have  Need to compare clustering solutions produced
ideal characteristics for distance-based, density- by different approaches, e.g. cluster validation
based, hierarchical-based approaches
 Need to examine the interpretability
 Visualization: project data to 2D or 3D (e.g.
 Cluster ensembles: combine clustering results
principal component analysis) for visual inspection
from different approaches to obtain robust
 Computationally feasible? clustering (features/attributes or instances)

COMP5009 – DATA MINING, CURTIN UNIVERSITY 46


APPLICATIONS

To other data mining tasks


 Data summarization: indicates how data points are
distributed
 Outlier analysis: small, fragmented clusters are
indicative of outliers  Text applications: Hierarchical clustering potentially
 Classification: kNN classifier can use centroids for useful for organization of web pages
improved computational speed  Multimedia applications: clustering index - effective
Customer segmentation and collaborative filtering retrieval of multimedia contents

 Group customers according to profiles, demographics,  Social network analysis: discover sub-communities
etc.
 Collaborative filtering: as per every group found via the
clustering step

COMP5009 – DATA MINING, CURTIN UNIVERSITY 47


SUMMARY

 Data clustering basics


 Definition

 Cluster validation
 Internal
 External

 Clustering algorithms
 Representative-based: k-means and k-modes
 Hierarchical-based: Bottom-up agglomerative clustering
 Density-based: DBSCAN

 Applications

COMP5009 – DATA MINING, CURTIN UNIVERSITY 48


NEXT: OUTLIER ANALYSIS
CHAPTER: 8

COMP5009 – DATA MINING, CURTIN UNIVERSITY 49

You might also like