Data Mining
(19ADOCN1001)
Mr.M.VijayaKumar, AP/AI&DS
19ADCN1303 - Data Mining 1
Course Outcomes
CO4: Classify
data for the given dataset using real world
applications.
19ADCN1303 - Data Mining 2
UNIT IV – Classification and Clustering
Classification: Basic Concepts - Decision Tree
Induction – Bayes Classification Methods – Rule
Based Classification – K-Nearest-Neighbor
Classifier - Model Evaluation and Selection –
Techniques to Improve Classification Accuracy.
Cluster Analysis: Basic Concepts and Methods-
Cluster Analysis - Partitioning Methods -
Hierarchical Methods - Density-Based Methods -
Grid-Based Methods.
19ADCN1303 - Data Mining 3
Partitioning Algorithms: Basic Concept
• Partitioning method: Partitioning a database D of n objects into a set of k
clusters, such that the sum of squared distances is minimized (where ci is
the centroid or medoid of cluster Ci)
E ik1 pCi ( p ci ) 2
• Given k, find a partition of k clusters that optimizes the chosen
partitioning criterion
• Global optimal: exhaustively enumerate all partitions
• Heuristic methods: k-means and k-medoids algorithms
• k-means (MacQueen’67, Lloyd’57/’82): Each cluster is represented
by the center of the cluster
• k-medoids or PAM (Partition around medoids) (Kaufman &
Rousseeuw’87): Each cluster is represented by one of the objects in
the cluster
Data Mining 4
The K-Means Clustering Method
• Given k, the k-means algorithm is implemented in four
steps:
• Partition objects into k nonempty subsets
• Compute seed points as the centroids of the clusters
of the current partitioning (the centroid is the center,
i.e., mean point, of the cluster)
• Assign each object to the cluster with the nearest
seed point
• Go back to Step 2, stop when the assignment does not
change
19ADCN1303 - Data Mining 5
An Example of K-Means Clustering
19ADCN1303 - Data Mining 6
Comments on the K-Means Method
• Strength: Efficient: O(tkn), where n is # objects, k is # clusters, and t is
# iterations. Normally, k, t << n.
• Comparing: PAM: O(k(n-k)2 ), CLARA: O(ks2 + k(n-k))
• Comment: Often terminates at a local optimal.
• Weakness
• Applicable only to objects in a continuous n-dimensional space
• Using the k-modes method for categorical data
• In comparison, k-medoids can be applied to a wide range of
data
• Need to specify k, the number of clusters, in advance (there are
ways to automatically determine the best k (see Hastie et al.,
2009)
• Sensitive to noisy data and outliers
• Not suitable to discover19ADCN1303
clusters- Data
with non-convex shapes 7
Mining
Variations of the K-Means Method
• Most of the variants of the k-means which differ in
• Selection of the initial k means
• Dissimilarity calculations
• Strategies to calculate cluster means
• Handling categorical data: k-modes
• Replacing means of clusters with modes
• Using new dissimilarity measures to deal with categorical
objects
• Using a frequency-based method to update modes of clusters
• A mixture of categorical and numerical data: k-prototype method
19ADCN1303 - Data Mining 8
What Is the Problem of the K-Means
Method?
• The k-means algorithm is sensitive to outliers !
• Since an object with an extremely large value may substantially
distort the distribution of the data
• K-Medoids: Instead of taking the mean value of the object in a cluster as
a reference point, medoids can be used, which is the most centrally
located object in a cluster
19ADCN1303 - Data Mining 9
PAM: A Typical K-Medoids Algorithm
19ADCN1303 - Data Mining 10
The K-Medoid Clustering Method
• K-Medoids Clustering: Find representative objects (medoids) in clusters
• PAM (Partitioning Around Medoids, Kaufmann & Rousseeuw 1987)
• Starts from an initial set of medoids and iteratively replaces one
of the medoids by one of the non-medoids if it improves the total
distance of the resulting clustering
• PAM works effectively for small data sets, but does not scale well
for large data sets (due to the computational complexity)
• Efficiency improvement on PAM
• CLARA (Kaufmann & Rousseeuw, 1990): PAM on samples
• CLARANS (Ng & Han, 1994): Randomized re-sampling
19ADCN1303 - Data Mining 11
Summary
• Partitioning Methods
19ADCN1303 - Data Mining 12
Reference
1. Jiawei Han, Micheline Kamber, Jian Pei, “Data Mining:
Concepts and Techniques”, 3rd Edition, Elsevier, 2012.
19ADCN1303 - Data Mining 13
Thank you
19ADCN1303 - Data Mining 14