INSTITUTE OF INFORMATION TECHNOLOGY & MANAGEMENT
Accredited ‘A’ Grade by NAAC &Recognised U/s 2(f) of UGC act
Rated Category `A+’ by SFRC & `A’ by JAC Govt. of NCT of Delhi
Approved by AICTE & Affiliated to GGS Indraprastha University, New Delhi
Machine Learning with Python
Programme : BCA
Semester : V
Subject Code : BCAT311
Subject : Machine Learning with Python
Topic : Clustering
Faculty : Ms. Shilpi Bansal
© Institute of Information Technology and Management, D-29, Institutional Area, Janakpuri, New Delhi-110058
List of Topics
Introduction to clustering
K-mean clustering
Hierarchical clustering
© Institute of Information Technology and Management, D-29,
Institutional Area, Janakpuri, New Delhi-110058
Examples of Clustering Applications
Marketing: Help marketers discover distinct groups in their customer
bases, and then use this knowledge to develop targeted marketing
programs
Land use: Identification of areas of similar land use in an earth
observation database
Insurance: Identifying groups of motor insurance policy holders with a
high average claim cost
Urban planning: Identifying groups of houses according to their house
type, value, and geographical location
Seismology: Observed earth quake epicenters should be clustered along
continent faults
4
What Is a Good Clustering?
A good clustering method will produce clusters with
High intra-class similarity
Low inter-class similarity
Precise definition of clustering quality is difficult
Application-dependent
Ultimately subjective
5
Requirements for Clustering in Data
Mining
Scalability
Ability to deal with different types of attributes
Discovery of clusters with arbitrary shape
Minimal domain knowledge required to determine input
parameters
Ability to deal with noise and outliers
Insensitivity to order of input records
Robustness wrt high dimensionality
Incorporation of user-specified constraints
Interpretability and usability
6
Similarity and Dissimilarity Between
Objects
Same we used for IBL (e.g, Lp norm)
Euclidean distance (p = 2):
Properties | + | x d(i,j)
d (i, j) = (| xof−ax metric
2
− x |: +...+ | x
2
−x |2 )
i
1 j
1 i
2 j2 i p jp
d(i,j) 0
d(i,i) = 0
d(i,j) = d(j,i)
d(i,j) d(i,k) + d(k,j)
7
Major Clustering Approaches
Partitioning: Construct various partitions and then evaluate them by
some criterion
Hierarchical: Create a hierarchical decomposition of the set of objects
using some criterion
Model-based: Hypothesize a model for each cluster and find best fit of
models to data
Density-based: Guided by connectivity and density functions
8
Partitioning Algorithms
Partitioning method: Construct a partition of a database D of n objects
into a set of k clusters
Given a k, find a partition of k clusters that optimizes the chosen
partitioning criterion
Global optimal: exhaustively enumerate all partitions
Heuristic methods: k-means and k-medoids algorithms
k-means (MacQueen, 1967): Each cluster is represented by the center
of the cluster
k-medoids or PAM (Partition around medoids) (Kaufman & Rousseeuw,
1987): Each cluster is represented by one of the objects in the cluster
9
K-Means Clustering
Given k, the k-means algorithm consists of four steps:
Select initial centroids at random.
Assign each object to the cluster with the nearest centroid.
Compute each centroid as the mean of the objects assigned to it.
Repeat previous 2 steps until no change.
10
K-Means Clustering (contd.)
Example
10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
11
Comments on the K-Means Method
Strengths
Relatively efficient: O(tkn), where n is # objects, k is # clusters, and t is
# iterations. Normally, k, t << n.
Often terminates at a local optimum. The global optimum may be found
using techniques such as simulated annealing and genetic algorithms
Weaknesses
Applicable only when mean is defined (what about categorical data?)
Need to specify k, the number of clusters, in advance
Trouble with noisy data and outliers
Not suitable to discover clusters with non-convex shapes
12
Hierarchical Clustering
Use distance matrix as clustering criteria. This method does not
require the number of clusters k as an input, but needs a termination
condition
Step 0 Step 1 Step 2 Step 3 Step 4
agglomerative
(AGNES)
a ab
b abcde
c
cde
d
de
e
divisive
13 Step 4 Step 3 Step 2 Step 1 Step 0 (DIANA)
AGNES (Agglomerative Nesting)
Produces tree of clusters (nodes)
Initially: each object is a cluster (leaf)
Recursively merges nodes that have the least dissimilarity
Criteria: min distance, max distance, avg distance, center distance
Eventually all nodes belong to the same cluster (root)
10 10 10
9 9 9
8 8 8
7 7 7
6 6 6
5 5 5
4 4 4
3 3 3
2 2 2
1 1 1
0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
14
A Dendrogram Shows How the
Clusters are Merged Hierarchically
Decompose data objects into several levels of nested
partitioning (tree of clusters), called a dendrogram.
A clustering of the data objects is obtained by cutting the
dendrogram at the desired level. Then each connected
component forms a cluster.
15
DIANA (Divisive Analysis)
Inverse order of AGNES
Start with root cluster containing all objects
Recursively divide into subclusters
Eventually each cluster contains a single object
10 10
10
9 9
9
8 8
8
7 7
7
6 6
6
5 5
5
4 4
4
3 3
3
2 2
2
1 1
1
0 0
0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
16
© Institute of Information Technology and Management, D-29,
Institutional Area, Janakpuri, New Delhi-110058