0% found this document useful (0 votes)

19 views15 pages

Unit 5 Cluster Analysis

Cluster analysis aims to group similar observations while ensuring dissimilarity between groups, functioning as a form of unsupervised classification. Techniques such as K-Means, hierarchical clustering, and density-based clustering are employed to identify patterns in various fields, including retail marketing and health insurance. Each method has its own advantages and disadvantages, influencing the choice of technique based on the specific data characteristics and analysis goals.

Uploaded by

Nuthalapati

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views15 pages

Unit 5 Cluster Analysis

Uploaded by

Nuthalapati

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

Unit -5

Data Mining Cluster Analysis

1. Basic Concepts

The aim of cluster analysis is to identify groups of similar observations - formally, forming
groups so that:
(a) within a group, the observations are most similar to each other,
(b) and between groups the observations are most dissimilar to each other.
Cluster analysis is a form of unsupervised classification: no pre-defined classes, and can be
considered descriptive data mining.

2.Examples:

⚫ Humans, as a society, have been “clustering” for a long time in attempts to understand (and
simplify) the environment we live in:

⚫ Clustering the animal and plant kingdoms into a hierarchy of similarities.

⚫ Clustering chemical structures.

⚫ Day-by-day we see grocery items clustered into similar groups.

⚫ We cluster student populations into similar groups of students from similar backgrounds or
studying similar combinations of subjects.

Retail Marketing

A retail company may collect the following information on households:

⚫ Household income

⚫ Household size

⚫ Distance from nearest urban area

⚫ Head of household Occupation

They can then feed these variables into a clustering algorithm to perhaps identify the following
clusters:
⚫ Cluster 1: Small family, high spenders

⚫ Cluster 2: Larger family, high spenders

⚫ Cluster 3: Small family, low spenders

⚫ Cluster 4: Large family, low spenders

The company can then send personalized advertisements or sales letters to each household based on
how likely they are to respond to specific types of advertisements

Streaming Services

A streaming service may collect the following data about individuals:

⚫ Minutes watched per day

⚫ Total viewing sessions per week

⚫ Number of unique shows viewed per month

⚫ Using these metrics, a streaming service can perform cluster analysis to identify high usage
and low usage users so that they can know who they should spend most of their advertising
dollars on.

Health Insurance

An actuary may collect the following information about households:

⚫ Total number of doctor visits per year

⚫ Total household size

⚫ Total number of chronic conditions per household

⚫ Average age of household members

An actuary can then feed these variables into a clustering algorithm to identify households that are
similar. The health insurance company can then set monthly premiums based on how often they
expect households in specific clusters to use their insurance.

3.What is Cluster Analysis?

4.K-Means Algorithm (A centroid based Technique)
It is one of the most commonly used algorithm for partitioning a given data set into a set of k
groups (i.e. k clusters), where k represents the number of groups. It classifies objects in multiple
groups (i.e., clusters), such that objects within the same cluster are as similar as possible (i.e.,
high intra-class similarity), whereas objects from different clusters are as dissimilar as possible (i.e.,
low inter-class similarity). In k-means clustering, each cluster is represented by its center (i.e, centroid)
which corresponds to the mean of points assigned to the cluster. The basic idea behind k-means
clustering consists of defining clusters so that the total intra-cluster variation (known as total
within-cluster variation) is minimized.
Steps involved in K-Means Clustering :

⚫ The first step when using k-means clustering is to indicate the number of clusters (k) that
will be generated in the final solution.

⚫ The algorithm starts by randomly selecting k objects from the data set to serve as the initial
centers for the clusters. The selected objects are also known as cluster means or centroids.

⚫ Next, each of the remaining objects is assigned to it’s closest centroid, where closest is
defined using the Euclidean distance between the object and the cluster mean. This step is
called “cluster assignment step”.

⚫ After the assignment step, the algorithm computes the new mean value of each cluster. The
term cluster “centroid update” is used to design this step. Now that the centers have been
recalculated, every observation is checked again to see if it might be closer to a different
cluster. All the objects are reassigned again using the updated cluster means.
⚫ The cluster assignment and centroid update steps are iteratively repeated until the cluster
assignments stop changing (i.e until convergence is achieved). That is, the clusters formed in the
current iteration are the same as those obtained in the previous iteration.
5. Hierarchical clustering in data mining

Hierarchical clustering refers to an unsupervised learning procedure that determines

successive clusters based on previously defined clusters. It works via grouping data into a tree of
clusters. Hierarchical clustering stats by treating each data points as an individual cluster. The
endpoint refers to a different set of clusters, where each cluster is different from the other cluster,
and the objects within each cluster are the same as one another.

There are two types of hierarchical clustering

o Agglomerative Hierarchical Clustering

o Divisive Clustering

Agglomerative hierarchical clustering

Agglomerative clustering is one of the most common types of hierarchical clustering used to group
similar objects in clusters. Agglomerative clustering is also known as AGNES (Agglomerative
Nesting). In agglomerative clustering, each data point act as an individual cluster and at each step,
data objects are grouped in a bottom-up method. Initially, each data object is in its cluster. At each
iteration, the clusters are combined with different clusters until one cluster is formed.
Agglomerative hierarchical clustering algorithm
1. Determine the similarity between individuals and all other clusters. (Find proximity matrix).
2. Consider each data point as an individual cluster.
3. Combine similar clusters.
4. Recalculate the proximity matrix for each cluster.
5. Repeat step 3 and step 4 until you get a single cluster.

Let’s understand this concept with the help of graphical representation using a dendrogram.

With the help of given demonstration, we can understand that how the actual algorithm work. Here
no calculation has been done below all the proximity among the clusters are assumed.

Let's suppose we have six different data points P, Q, R, S, T, V.

Step 1:
Consider each alphabet (P, Q, R, S, T, V) as an individual cluster and find the distance between the
individual cluster from all other clusters.
Step 2:
Now, merge the comparable clusters in a single cluster. Let’s say cluster Q and Cluster R are similar
to each other so that we can merge them in the second step. Finally, we get the clusters [ (P), (QR),
(ST), (V)]
Step 3:
Here, we recalculate the proximity as per the algorithm and combine the two closest clusters [(ST),
(V)] together to form new clusters as [(P), (QR), (STV)]
Step 4:
Repeat the same process. The clusters STV and PQ are comparable and combined together to form
a new cluster. Now we have [(P), (QQRSTV)].
Step 5:
Finally, the remaining two clusters are merged together to form a single cluster [(PQRSTV)]
Advantages of Hierarchical clustering
o It is simple to implement and gives the best output in some cases.
o It is easy and results in a hierarchy, a structure that contains more information.
o It does not need us to pre-specify the number of clusters.

Disadvantages of hierarchical clustering

o It breaks the large clusters.
o It is Difficult to handle different sized clusters and convex shapes.
o It is sensitive to noise and outliers.
o The algorithm can never be changed or deleted once it was done previously.

6.Density-based clustering in data mining

Density-based clustering refers to a method that is based on local cluster criterion, such as
density connected points. In this tutorial, we will discuss density-based clustering with examples.

What is Density-based clustering?

Density-Based Clustering refers to one of the most popular unsupervised learning
methodologies used in model building and machine learning algorithms. The data points in the
region separated by two clusters of low point density are considered as noise. The surroundings with
a radius ε of a given object are known as the ε neighborhood of the object. If the ε neighborhood of
the object comprises at least a minimum number, MinPts of objects, then it is called a core object.

Density-Based Clustering - Background

There are two different parameters to calculate the density-based clustering
EPS: It is considered as the maximum radius of the neighborhood.
MinPts: MinPts refers to the minimum number of points in an Eps neighborhood of that point.
NEps (i) : { k belongs to D and dist (i,k) < = Eps}
Directly density reachable:
A point i is considered as the directly density reachable from a point k with respect to Eps, MinPts if
i belongs to NEps(k)
Core point condition:
NEps (k) >= MinPts
Density reachable:

A point denoted by i is a density reachable from a point j with respect to Eps, MinPts if
there is a sequence chain of a point i1,…., in, i1 = j, pn = i such that ii + 1 is directly density
reachable from ii.

Density connected:
A point i refers to density connected to a point j with respect to Eps, MinPts if there is a point o
such that both i and j are considered as density reachable from o with respect to Eps and MinPts.
Working of Density-Based Clustering
Suppose a set of objects is denoted by D', we can say that an object I is directly density reachable
form the object j only if it is located within the ε neighborhood of j, and j is a core object.
An object i is density reachable form the object j with respect to ε and MinPts in a given set of
objects, D' only if there is a sequence of object chains point i1,…., in, i1 = j, pn = i such that ii + 1 is
directly density reachable from ii with respect to ε and MinPts.
An object i is density connected object j with respect to ε and MinPts in a given set of objects, D'
only if there is an object o belongs to D such that both point i and j are density reachable from o
with respect to ε and MinPts.

Major Features of Density-Based Clustering

The primary features of Density-based clustering are given below.
o It is a scan method.
o It requires density parameters as a termination condition.
o It is used to manage noise in data clusters.
o Density-based clustering is used to identify clusters of arbitrary size.

Density-Based Clustering Methods

DBSCAN
DBSCAN stands for Density-Based Spatial Clustering of Applications with Noise. It depends on a
density-based notion of cluster. It also identifies clusters of arbitrary size in the spatial database with outliers.

Unit 4 ML
No ratings yet
Unit 4 ML
14 pages
DWM 4
No ratings yet
DWM 4
14 pages
Clustering
No ratings yet
Clustering
38 pages
22AIP3101A Session 9
No ratings yet
22AIP3101A Session 9
38 pages
Cluster
100% (1)
Cluster
72 pages
K-Means Clustering Guide
100% (1)
K-Means Clustering Guide
14 pages
Unit5 CSM ML
No ratings yet
Unit5 CSM ML
32 pages
Fundamentals of Data Science Unit 3
No ratings yet
Fundamentals of Data Science Unit 3
15 pages
Customer Segmentation Techniques Explained
No ratings yet
Customer Segmentation Techniques Explained
46 pages
Clustering New
No ratings yet
Clustering New
6 pages
Clustering Techniques Overview
No ratings yet
Clustering Techniques Overview
80 pages
DW & DM Unit 4 Notes
No ratings yet
DW & DM Unit 4 Notes
40 pages
Cluster Analysis & Methods Guide
No ratings yet
Cluster Analysis & Methods Guide
11 pages
Slide TIF311 DM 10 11
No ratings yet
Slide TIF311 DM 10 11
49 pages
Data Mining Clustering Techniques
No ratings yet
Data Mining Clustering Techniques
43 pages
Unit 4
No ratings yet
Unit 4
16 pages
Clustering
No ratings yet
Clustering
12 pages
Cluster Analysis in Data Mining Techniques
No ratings yet
Cluster Analysis in Data Mining Techniques
18 pages
DMW Unit 5
No ratings yet
DMW Unit 5
10 pages
Cluster Analysis Fundamentals
No ratings yet
Cluster Analysis Fundamentals
39 pages
Cluster Analysis Concepts & Algorithms
No ratings yet
Cluster Analysis Concepts & Algorithms
93 pages
Chap15 Cluster Analysis
No ratings yet
Chap15 Cluster Analysis
55 pages
Clustering Techniques Overview
No ratings yet
Clustering Techniques Overview
40 pages
Clustering
No ratings yet
Clustering
11 pages
UNIT5
No ratings yet
UNIT5
60 pages
Chapter 3 Unsupervised Learning
No ratings yet
Chapter 3 Unsupervised Learning
45 pages
Hierarchical Clustering Algorithm
No ratings yet
Hierarchical Clustering Algorithm
9 pages
Clustering Techniques Overview
No ratings yet
Clustering Techniques Overview
45 pages
Unit 3 Unsupervised Learning Algorith
No ratings yet
Unit 3 Unsupervised Learning Algorith
15 pages
Lecture 9 Clustering
No ratings yet
Lecture 9 Clustering
36 pages
Hierarchical Clustering in Data Mining
No ratings yet
Hierarchical Clustering in Data Mining
4 pages
Unit Iii - ML
No ratings yet
Unit Iii - ML
13 pages
Clustering: EE-671 Prof L. Behera, IITK
No ratings yet
Clustering: EE-671 Prof L. Behera, IITK
33 pages
Unit 4 Clustering
No ratings yet
Unit 4 Clustering
18 pages
DWDM 5
No ratings yet
DWDM 5
12 pages
Intro to Clustering Methods
No ratings yet
Intro to Clustering Methods
39 pages
M5
No ratings yet
M5
40 pages
Unit 4
No ratings yet
Unit 4
74 pages
Unsupervised Machine Learning Techniques
No ratings yet
Unsupervised Machine Learning Techniques
58 pages
Clustering in Data Mining Lecture
No ratings yet
Clustering in Data Mining Lecture
80 pages
ML - 8
No ratings yet
ML - 8
70 pages
Unit 4 Clustering - K-Means and Hierarchical
No ratings yet
Unit 4 Clustering - K-Means and Hierarchical
40 pages
Clustering Techniques in Unsupervised Learning
No ratings yet
Clustering Techniques in Unsupervised Learning
42 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
64 pages
IT3080 Lecture04 2023
No ratings yet
IT3080 Lecture04 2023
56 pages
6 - Clustering and Applications and Trends in Datamining
No ratings yet
6 - Clustering and Applications and Trends in Datamining
66 pages
Clustering
No ratings yet
Clustering
75 pages
Unit 3 Updated Notes
No ratings yet
Unit 3 Updated Notes
29 pages
Module 5
No ratings yet
Module 5
91 pages
Lecture-02 Unsupervised Learning Algorithm (Clustering)
No ratings yet
Lecture-02 Unsupervised Learning Algorithm (Clustering)
60 pages
Overview of Clustering Methods in ML
No ratings yet
Overview of Clustering Methods in ML
37 pages
ML Unit 4
No ratings yet
ML Unit 4
15 pages
Cluster Analysis Concepts & Algorithms
No ratings yet
Cluster Analysis Concepts & Algorithms
82 pages
Clustering
No ratings yet
Clustering
29 pages
DMDWUNITV
No ratings yet
DMDWUNITV
72 pages
Hierarchical Clustering Explained
No ratings yet
Hierarchical Clustering Explained
14 pages
Simplified Swarm Optimization for DWTA
No ratings yet
Simplified Swarm Optimization for DWTA
15 pages
6.4 Practice
No ratings yet
6.4 Practice
4 pages
The MathWorks, Inc. - MATLAB Global Optimization Toolbox™ User's Guide (2020, The MathWorks, Inc.)
No ratings yet
The MathWorks, Inc. - MATLAB Global Optimization Toolbox™ User's Guide (2020, The MathWorks, Inc.)
878 pages
Linear Programming: Simplex Method - II
No ratings yet
Linear Programming: Simplex Method - II
24 pages
Week 6 - Tutorial Solutions
No ratings yet
Week 6 - Tutorial Solutions
2 pages
ATC Module 3
No ratings yet
ATC Module 3
38 pages
Flightno. 201 202 203 203: 5.6. Traveling Salesman Problem
No ratings yet
Flightno. 201 202 203 203: 5.6. Traveling Salesman Problem
7 pages
Design and Analysis of Algorithms - Tutorial Sheet Practice
No ratings yet
Design and Analysis of Algorithms - Tutorial Sheet Practice
2 pages
Linked Lists, Stacks & Queues Guide
No ratings yet
Linked Lists, Stacks & Queues Guide
4 pages
Mcse 003 PDF
No ratings yet
Mcse 003 PDF
24 pages
Dilation
No ratings yet
Dilation
28 pages
Ittiam Interview Experience
100% (1)
Ittiam Interview Experience
3 pages
DS Lecture 22 (Intro To Graphs)
No ratings yet
DS Lecture 22 (Intro To Graphs)
22 pages
7-Information Theory
No ratings yet
7-Information Theory
29 pages
Write A Program To Implement The Heuristic Search
No ratings yet
Write A Program To Implement The Heuristic Search
10 pages
40 Data Structure MCQ's To Test Your Computer Knowledge: GK & Gs Latest Jobs Previous Year Papers Q & A More
No ratings yet
40 Data Structure MCQ's To Test Your Computer Knowledge: GK & Gs Latest Jobs Previous Year Papers Q & A More
17 pages
Slides - 07 - 2 - 6P - MDPs2
No ratings yet
Slides - 07 - 2 - 6P - MDPs2
8 pages
Signal Processing Techniques Overview
No ratings yet
Signal Processing Techniques Overview
7 pages
Graph 1
No ratings yet
Graph 1
1 page
3 Uninformed Search
No ratings yet
3 Uninformed Search
77 pages
Applications of Context-Free Grammar
No ratings yet
Applications of Context-Free Grammar
56 pages
Data Warehousing & Mining Exam 2019
No ratings yet
Data Warehousing & Mining Exam 2019
4 pages
Matrix Condition
No ratings yet
Matrix Condition
2 pages
Artificial Intelligence Fundamentals Prelim Lec Exam Answers 72%
100% (1)
Artificial Intelligence Fundamentals Prelim Lec Exam Answers 72%
22 pages
Case Study On Sorting Algorithm
No ratings yet
Case Study On Sorting Algorithm
17 pages
Daa MCQ
No ratings yet
Daa MCQ
7 pages
Lec-1-Introduction To Numerical Computing
No ratings yet
Lec-1-Introduction To Numerical Computing
4 pages
537 Overview
No ratings yet
537 Overview
12 pages
Competitive Programming C++ Cheatsheet
100% (1)
Competitive Programming C++ Cheatsheet
23 pages
Rank-Balanced Trees
No ratings yet
Rank-Balanced Trees
26 pages

Unit 5 Cluster Analysis

Uploaded by

Unit 5 Cluster Analysis

Uploaded by

Unit -5

Data Mining Cluster Analysis

⚫ Clustering the animal and plant kingdoms into a hierarchy of similarities.

⚫ Clustering chemical structures.

⚫ Day-by-day we see grocery items clustered into similar groups.

A retail company may collect the following information on households:

⚫ Distance from nearest urban area

⚫ Head of household Occupation

⚫ Cluster 2: Larger family, high spenders

⚫ Cluster 3: Small family, low spenders

⚫ Cluster 4: Large family, low spenders

A streaming service may collect the following data about individuals:

⚫ Minutes watched per day

⚫ Total viewing sessions per week

⚫ Number of unique shows viewed per month

An actuary may collect the following information about households:

⚫ Total number of doctor visits per year

⚫ Total household size

⚫ Total number of chronic conditions per household

⚫ Average age of household members

3.What is Cluster Analysis?

Hierarchical clustering refers to an unsupervised learning procedure that determines

There are two types of hierarchical clustering

o Agglomerative Hierarchical Clustering

Agglomerative hierarchical clustering

Let's suppose we have six different data points P, Q, R, S, T, V.

Disadvantages of hierarchical clustering

6.Density-based clustering in data mining

What is Density-based clustering?

Density-Based Clustering - Background

Major Features of Density-Based Clustering

Density-Based Clustering Methods

You might also like