Unit -5
Data Mining Cluster Analysis
1. Basic Concepts
The aim of cluster analysis is to identify groups of similar observations - formally, forming
groups so that:
(a) within a group, the observations are most similar to each other,
(b) and between groups the observations are most dissimilar to each other.
Cluster analysis is a form of unsupervised classification: no pre-defined classes, and can be
considered descriptive data mining.
2.Examples:
⚫ Humans, as a society, have been “clustering” for a long time in attempts to understand (and
simplify) the environment we live in:
⚫ Clustering the animal and plant kingdoms into a hierarchy of similarities.
⚫ Clustering chemical structures.
⚫ Day-by-day we see grocery items clustered into similar groups.
⚫ We cluster student populations into similar groups of students from similar backgrounds or
studying similar combinations of subjects.
Retail Marketing
A retail company may collect the following information on households:
⚫ Household income
⚫ Household size
⚫ Distance from nearest urban area
⚫ Head of household Occupation
They can then feed these variables into a clustering algorithm to perhaps identify the following
clusters:
⚫ Cluster 1: Small family, high spenders
⚫ Cluster 2: Larger family, high spenders
⚫ Cluster 3: Small family, low spenders
⚫ Cluster 4: Large family, low spenders
The company can then send personalized advertisements or sales letters to each household based on
how likely they are to respond to specific types of advertisements
Streaming Services
A streaming service may collect the following data about individuals:
⚫ Minutes watched per day
⚫ Total viewing sessions per week
⚫ Number of unique shows viewed per month
⚫ Using these metrics, a streaming service can perform cluster analysis to identify high usage
and low usage users so that they can know who they should spend most of their advertising
dollars on.
Health Insurance
An actuary may collect the following information about households:
⚫ Total number of doctor visits per year
⚫ Total household size
⚫ Total number of chronic conditions per household
⚫ Average age of household members
An actuary can then feed these variables into a clustering algorithm to identify households that are
similar. The health insurance company can then set monthly premiums based on how often they
expect households in specific clusters to use their insurance.
3.What is Cluster Analysis?
4.K-Means Algorithm (A centroid based Technique)
It is one of the most commonly used algorithm for partitioning a given data set into a set of k
groups (i.e. k clusters), where k represents the number of groups. It classifies objects in multiple
groups (i.e., clusters), such that objects within the same cluster are as similar as possible (i.e.,
high intra-class similarity), whereas objects from different clusters are as dissimilar as possible (i.e.,
low inter-class similarity). In k-means clustering, each cluster is represented by its center (i.e, centroid)
which corresponds to the mean of points assigned to the cluster. The basic idea behind k-means
clustering consists of defining clusters so that the total intra-cluster variation (known as total
within-cluster variation) is minimized.
Steps involved in K-Means Clustering :
⚫ The first step when using k-means clustering is to indicate the number of clusters (k) that
will be generated in the final solution.
⚫ The algorithm starts by randomly selecting k objects from the data set to serve as the initial
centers for the clusters. The selected objects are also known as cluster means or centroids.
⚫ Next, each of the remaining objects is assigned to it’s closest centroid, where closest is
defined using the Euclidean distance between the object and the cluster mean. This step is
called “cluster assignment step”.
⚫ After the assignment step, the algorithm computes the new mean value of each cluster. The
term cluster “centroid update” is used to design this step. Now that the centers have been
recalculated, every observation is checked again to see if it might be closer to a different
cluster. All the objects are reassigned again using the updated cluster means.
⚫ The cluster assignment and centroid update steps are iteratively repeated until the cluster
assignments stop changing (i.e until convergence is achieved). That is, the clusters formed in the
current iteration are the same as those obtained in the previous iteration.
5. Hierarchical clustering in data mining
Hierarchical clustering refers to an unsupervised learning procedure that determines
successive clusters based on previously defined clusters. It works via grouping data into a tree of
clusters. Hierarchical clustering stats by treating each data points as an individual cluster. The
endpoint refers to a different set of clusters, where each cluster is different from the other cluster,
and the objects within each cluster are the same as one another.
There are two types of hierarchical clustering
o Agglomerative Hierarchical Clustering
o Divisive Clustering
Agglomerative hierarchical clustering
Agglomerative clustering is one of the most common types of hierarchical clustering used to group
similar objects in clusters. Agglomerative clustering is also known as AGNES (Agglomerative
Nesting). In agglomerative clustering, each data point act as an individual cluster and at each step,
data objects are grouped in a bottom-up method. Initially, each data object is in its cluster. At each
iteration, the clusters are combined with different clusters until one cluster is formed.
Agglomerative hierarchical clustering algorithm
1. Determine the similarity between individuals and all other clusters. (Find proximity matrix).
2. Consider each data point as an individual cluster.
3. Combine similar clusters.
4. Recalculate the proximity matrix for each cluster.
5. Repeat step 3 and step 4 until you get a single cluster.
Let’s understand this concept with the help of graphical representation using a dendrogram.
With the help of given demonstration, we can understand that how the actual algorithm work. Here
no calculation has been done below all the proximity among the clusters are assumed.
Let's suppose we have six different data points P, Q, R, S, T, V.
Step 1:
Consider each alphabet (P, Q, R, S, T, V) as an individual cluster and find the distance between the
individual cluster from all other clusters.
Step 2:
Now, merge the comparable clusters in a single cluster. Let’s say cluster Q and Cluster R are similar
to each other so that we can merge them in the second step. Finally, we get the clusters [ (P), (QR),
(ST), (V)]
Step 3:
Here, we recalculate the proximity as per the algorithm and combine the two closest clusters [(ST),
(V)] together to form new clusters as [(P), (QR), (STV)]
Step 4:
Repeat the same process. The clusters STV and PQ are comparable and combined together to form
a new cluster. Now we have [(P), (QQRSTV)].
Step 5:
Finally, the remaining two clusters are merged together to form a single cluster [(PQRSTV)]
Advantages of Hierarchical clustering
o It is simple to implement and gives the best output in some cases.
o It is easy and results in a hierarchy, a structure that contains more information.
o It does not need us to pre-specify the number of clusters.
Disadvantages of hierarchical clustering
o It breaks the large clusters.
o It is Difficult to handle different sized clusters and convex shapes.
o It is sensitive to noise and outliers.
o The algorithm can never be changed or deleted once it was done previously.
6.Density-based clustering in data mining
Density-based clustering refers to a method that is based on local cluster criterion, such as
density connected points. In this tutorial, we will discuss density-based clustering with examples.
What is Density-based clustering?
Density-Based Clustering refers to one of the most popular unsupervised learning
methodologies used in model building and machine learning algorithms. The data points in the
region separated by two clusters of low point density are considered as noise. The surroundings with
a radius ε of a given object are known as the ε neighborhood of the object. If the ε neighborhood of
the object comprises at least a minimum number, MinPts of objects, then it is called a core object.
Density-Based Clustering - Background
There are two different parameters to calculate the density-based clustering
EPS: It is considered as the maximum radius of the neighborhood.
MinPts: MinPts refers to the minimum number of points in an Eps neighborhood of that point.
NEps (i) : { k belongs to D and dist (i,k) < = Eps}
Directly density reachable:
A point i is considered as the directly density reachable from a point k with respect to Eps, MinPts if
i belongs to NEps(k)
Core point condition:
NEps (k) >= MinPts
Density reachable:
A point denoted by i is a density reachable from a point j with respect to Eps, MinPts if
there is a sequence chain of a point i1,…., in, i1 = j, pn = i such that ii + 1 is directly density
reachable from ii.
Density connected:
A point i refers to density connected to a point j with respect to Eps, MinPts if there is a point o
such that both i and j are considered as density reachable from o with respect to Eps and MinPts.
Working of Density-Based Clustering
Suppose a set of objects is denoted by D', we can say that an object I is directly density reachable
form the object j only if it is located within the ε neighborhood of j, and j is a core object.
An object i is density reachable form the object j with respect to ε and MinPts in a given set of
objects, D' only if there is a sequence of object chains point i1,…., in, i1 = j, pn = i such that ii + 1 is
directly density reachable from ii with respect to ε and MinPts.
An object i is density connected object j with respect to ε and MinPts in a given set of objects, D'
only if there is an object o belongs to D such that both point i and j are density reachable from o
with respect to ε and MinPts.
Major Features of Density-Based Clustering
The primary features of Density-based clustering are given below.
o It is a scan method.
o It requires density parameters as a termination condition.
o It is used to manage noise in data clusters.
o Density-based clustering is used to identify clusters of arbitrary size.
Density-Based Clustering Methods
DBSCAN
DBSCAN stands for Density-Based Spatial Clustering of Applications with Noise. It depends on a
density-based notion of cluster. It also identifies clusters of arbitrary size in the spatial database with outliers.