0% found this document useful (0 votes)

95 views28 pages

Clustering

The document discusses clustering algorithms. It defines clustering as grouping similar data points without supervision. It describes common clustering algorithms like K-means, hierarchical clustering, and DBSCAN. It explains how K-means works by initializing centroids and alternately assigning points to clusters and updating centroids. Determining the right number of clusters K is challenging. The document outlines limitations of K-means, like failing on clusters of different densities and non-globular shapes. It also introduces K-medoids which uses actual data points as centroids.

Uploaded by

prabhakaran sridharan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

95 views28 pages

Clustering

Uploaded by

prabhakaran sridharan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Clustering

Standard problem:
Regression and classification:
We want to find the function that satisfies the training data. In Case of clustering the task is clustering.
Task: It is to group the similar data points.

Consider the data points as follows:

The task of clustering is to group the similar data points. All these points are grouped together. The
points are grouped together. The similar is very much task specific.
Here all the data given are the data points in the mathematical form.
Measuring the performance of the clustering algorithm. There are various metrics that can measure the
performance of clustering.
Clustering algos: K- Means, Hierarchical, DBSCAN.

These are the most used algos. There are algos for various data.
Unsupervised learning: Clustering is referred to as unsupervised learning. Both classification and
regression are the supervised learning.
There is also an area called semi-supervised learning, the data sets are union of D1 and D2. This
happens when the cost of labeling the data is expensive.
Applications of clustering: These are more studied in Data mining, than machine learning.
There are many applications of clustering.
They want to group the similar customers based on the purchasing behavior.
They can decide the income level using the purchasing behavior.

Assume the clusters are grouped to various classes. By grouping customers we can give different
offers.
Image segmentation: This is a problem in computer vision and image processing.

This is all about segmenting the pixels of the image. The clustering can help in seg the image.
We can apply the ML algos to apply the object detection.

The polarity is the class label. This is very time consuming. We can apply the clustering algorithm to
get the labels of the reviews.
The manual labeling is time consuming and expensive. We can pick a cluster it is review and label each
cluster.

Now, we can train the machine learning algorithms.

Metrics for measuring Clustering: What is a good clustering result?
All the points inside the cluster is called the intra clusters. All the outside the cluster is called the inter
clusters. The intra cluster distance is kept small and the inter cluster distance kept large. This leads to
the good clustering. This is the basis of clustering effectiveness.
In an ideal world, we want the inter clustering distance to be very high and intra to be lvery less.

Dunn – index: The numerator is the max inter cluster distance. The denominator is the intra cluster
distance.

The distances are calculated as follows and the farthest points are chosen.
Denominator: The farthest distance of the cluster is computed, then we take the max over the distances
d’1 and d’2.

K – Means: Geometric intuition, Centroids

It is the popular clustering algorithm, variants of K- Means.
The K is the number of clusters which is the hyper parameter int K – Means.
K-Means groups every cluster, In this case we have three clusters. The intersection of the clusters is the
null set.

In K-Clusters, we have K centroids. K- clusters, K-sets of points. It is the geometric mean of the each
cluster.

K – Means clustering is the centroid based scheme.

There are other clustering called

Heriarical based clsutersing

DBSCAN.
The challenge is how to find the K – Centroids. Once we get the K-centroids, we can compute the K-
set nearest centroid. There are algos to find the K centroids.

To find the K central points and assign those points to the cluster.
K – Means: Mathematical formulation: Objective function

This is the optimization problem -

This is sum of squared distances in the cluster I,

This problem is very hard to solve. If there is a hard problem then we use the approximation of algo.
Using some hacks and math.
Lloyds algorithm: This solves the problem.
Lloyd's algorithm: K-means (we find the K centroids randomly)

The second step is assignment -

The third stage is called recompute centroid state: This is also called the update stage of centroids.

Step – 4:

We can repeat the step -2 and step -3, until convergence.

Here, we have the set of centroids, of new and old.
If there are no much change in the distance of the old and new points distances.

Distance between new centroids and old has no change.

The actual mathematical formulation is very hard. This is the lloyd algorithm.

How to initialize: K-Means++

We can do the random initialization from the dataset and make the K centroids. There is one problem,
initialization sensitivity.

Given this ideal dataset, when we initialize the centroids. We get the optimal and sub-optmal clustering
states.
Applying the same lloyd algorithm, we get the complete different solution.
The second way is K-Means ++, Instead of using the random initialization we use the smart
initialization.
Step – 1:
Pick K centroids, we will pick the first cenroid randomly C1.

Step – 2:
The chances of the points near to the centroid is very low.
In the initialization we want to pick the points that are far as possible from other centroids picked up.

The points far to the chosen center has more probability to be the centroid.
We are trying to pick the points as far as possible that are already picked up.
Why cant I just pick the point that has highest value from the nearest center. We can have outliers.

We will pick the outliers as the centroid. K – Means ++ does get affected by outliers.
That is the whole reason we do it probabilistic-ally. To reduce the mitigation of being an outlier.

Failure cases/Limitations:
When we have clusters of different sizes, different Densities and Non – gobular shapes.

Different densities:

K means tend to fail in case of different densities.

Non – gobular shapes:

If we give this data to the K – Means, we cannot work with gobular data.

If we are continuing this K means then, if we keep k = 10, w get clusters like as follows:

TO find the parts of cluster, but need to put together.

Clusters of different densities:

K means is never perfect algorithm.

In case of non – gobular structures, this is the best solution.
Evaluating clustering is not easy, there is no ground truth.

In the case of classification and regression there is no uncertainty in the results obtained.
K – Mediods: The centroids may not be interpret-able.

That means the centroids may be not interpretable in case of K-medoids.

When we represent the centroid as the data point in the dataset. Then it is called K-Medoids.

Especially in case of interpretation.

K – Medoids:

Partioning around mediods(PAM) algorithm:

We swap each medoid with a non-mediod point in the dataset.

What is loss in K-Means?

After swapping again compute the loss values.

Is loss decreases by swapping then we keep the point that decreases the loss. There are lot of swaps
possible.

The swap is successful, when the loss decreases. Medoid is also a data point.

We can use the kernel matrix or distance matrix and apply K – Mediods. The massive advantage is

1. More interpretation.
2. Kearnelization.
It is trivially kernalizable.
Determining the right K:

In K – means, K is the hyper parameter,

1. Domain knowledge – we can know the clusters.

Then we use the elbow-method (or) Knee-method -

We want to minimize the loss function is k means.

We only have data D={xi}.

Code examples:

Sklearn – [Link]

Time and space complexity:

K-Means: O(nkdi) This is still a linear in time.

As far as space concerned we need to store nd + kd. Which is also linear.

It is fast and simple to understand.

Cluster Amazon reviews:

SLide#4 - Clustering and Elbow Technique
No ratings yet
SLide#4 - Clustering and Elbow Technique
29 pages
Mod4 - Unsupervised Learning
No ratings yet
Mod4 - Unsupervised Learning
9 pages
The Math Behind The K-Means and Hierarchical Clust+
No ratings yet
The Math Behind The K-Means and Hierarchical Clust+
13 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
23 pages
K Means Algorithm
No ratings yet
K Means Algorithm
4 pages
Clustering Algorithm
No ratings yet
Clustering Algorithm
47 pages
Datamining Lect8
No ratings yet
Datamining Lect8
79 pages
Datamining-Lect5 - Clustering. The K-Means Algorithm. Hierarchical Clustering. The DBSCAN Algorithm. Clustering Evaluation
No ratings yet
Datamining-Lect5 - Clustering. The K-Means Algorithm. Hierarchical Clustering. The DBSCAN Algorithm. Clustering Evaluation
110 pages
ML Module5 Clustering
No ratings yet
ML Module5 Clustering
71 pages
ML Unit-4 Final 2024-25
No ratings yet
ML Unit-4 Final 2024-25
28 pages
Unit 4
No ratings yet
Unit 4
125 pages
Unit 4
No ratings yet
Unit 4
16 pages
Unit 4 Clustering - K-Means and Hierarchical
No ratings yet
Unit 4 Clustering - K-Means and Hierarchical
40 pages
K-Means Clustering Guide
No ratings yet
K-Means Clustering Guide
32 pages
Machine Learning: Clustering & Algorithms
No ratings yet
Machine Learning: Clustering & Algorithms
66 pages
K-Means Clustering Explained
No ratings yet
K-Means Clustering Explained
16 pages
Clustering Techniques for Analysts
No ratings yet
Clustering Techniques for Analysts
7 pages
Week 4 - Lecture Slides - K-Means, Mixture Models, & EM
No ratings yet
Week 4 - Lecture Slides - K-Means, Mixture Models, & EM
65 pages
Lecture 1 (UNIT 1)
No ratings yet
Lecture 1 (UNIT 1)
68 pages
Lecture Unsupervised (17!04!2024)
No ratings yet
Lecture Unsupervised (17!04!2024)
61 pages
7.introduction To Clustering
No ratings yet
7.introduction To Clustering
11 pages
K - Means Clustering
No ratings yet
K - Means Clustering
13 pages
Unsupervised Learning Modi
No ratings yet
Unsupervised Learning Modi
16 pages
Unit 4
No ratings yet
Unit 4
74 pages
Clustering Kmeans
No ratings yet
Clustering Kmeans
6 pages
Unsupervised Learning: K-Means Clustering
No ratings yet
Unsupervised Learning: K-Means Clustering
23 pages
ML Lec-16
No ratings yet
ML Lec-16
16 pages
Clustering
No ratings yet
Clustering
125 pages
ML Unit3
No ratings yet
ML Unit3
21 pages
R20 Machine Learning Unit 4
No ratings yet
R20 Machine Learning Unit 4
49 pages
Lec 6
No ratings yet
Lec 6
9 pages
Clustering FinancialData
No ratings yet
Clustering FinancialData
38 pages
Neural Network Clustering Guide
No ratings yet
Neural Network Clustering Guide
168 pages
Data Mining Lecture Notes-1: Bsc. (H) Computer Science: Vi Semester Teacher: Ms. Sonal Linda
No ratings yet
Data Mining Lecture Notes-1: Bsc. (H) Computer Science: Vi Semester Teacher: Ms. Sonal Linda
40 pages
Chapter 5. Clustering Algorithms-Stud
No ratings yet
Chapter 5. Clustering Algorithms-Stud
44 pages
DMDWUNITV
No ratings yet
DMDWUNITV
72 pages
V5I5201647
No ratings yet
V5I5201647
13 pages
K-Means Clustering Explained
No ratings yet
K-Means Clustering Explained
6 pages
PART2
No ratings yet
PART2
61 pages
Clustering
No ratings yet
Clustering
10 pages
K Means Clustering
No ratings yet
K Means Clustering
29 pages
K-Means Clustering Seminar Report
No ratings yet
K-Means Clustering Seminar Report
43 pages
Session 37 CO4 Unsupervised Learning
No ratings yet
Session 37 CO4 Unsupervised Learning
34 pages
A Tutorial On Clustering Algorithms
No ratings yet
A Tutorial On Clustering Algorithms
4 pages
DSML-ML09. Unsupervised Learning
No ratings yet
DSML-ML09. Unsupervised Learning
69 pages
ML 5
No ratings yet
ML 5
61 pages
K-Means Clustering Guide
No ratings yet
K-Means Clustering Guide
26 pages
Clustering Algorithm: An Unsupervised Learning Approach
No ratings yet
Clustering Algorithm: An Unsupervised Learning Approach
23 pages
K-Means Clustering Guide 2023
No ratings yet
K-Means Clustering Guide 2023
14 pages
Chapter 3 Unsupervised Learning
No ratings yet
Chapter 3 Unsupervised Learning
45 pages
K-Means Clustering Explained
No ratings yet
K-Means Clustering Explained
39 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
12 pages
Cluster Lecture-1
No ratings yet
Cluster Lecture-1
20 pages
Unsupervised Learning: Clustering Techniques
No ratings yet
Unsupervised Learning: Clustering Techniques
54 pages
AI Chapter 3 Part 5
No ratings yet
AI Chapter 3 Part 5
30 pages
UNIT - 3 - Clustering
No ratings yet
UNIT - 3 - Clustering
21 pages
Deep Learning: MLPs and Regularization Techniques
No ratings yet
Deep Learning: MLPs and Regularization Techniques
44 pages
Decision Trees
No ratings yet
Decision Trees
24 pages
Decode Look-and-Say Pattern Rows
No ratings yet
Decode Look-and-Say Pattern Rows
9 pages
SQuAD 2.0 QA Models: BiDAF & QANet
No ratings yet
SQuAD 2.0 QA Models: BiDAF & QANet
11 pages
Learning Performance of Neuron Model Based On Quantum Superposit
No ratings yet
Learning Performance of Neuron Model Based On Quantum Superposit
6 pages
Network Engineer: Richard Anderson
No ratings yet
Network Engineer: Richard Anderson
4 pages
SONiC 2022 Update: Features & Governance
No ratings yet
SONiC 2022 Update: Features & Governance
19 pages
Module 2 - MVJ22EC643 - Virtual Instrumentation
No ratings yet
Module 2 - MVJ22EC643 - Virtual Instrumentation
61 pages
Answer Key For Summative Test
No ratings yet
Answer Key For Summative Test
53 pages
Basic Computer Engineering Syllabus 2024
No ratings yet
Basic Computer Engineering Syllabus 2024
2 pages
Milk Analyzer Master Eco-Operating Instructions
No ratings yet
Milk Analyzer Master Eco-Operating Instructions
61 pages
Installing MeeGo on Your Netbook
No ratings yet
Installing MeeGo on Your Netbook
4 pages
Paralysis Patient Healthcare System Using IOT
No ratings yet
Paralysis Patient Healthcare System Using IOT
8 pages
Mini Project Report Final308
No ratings yet
Mini Project Report Final308
39 pages
CCNA Lab 1: Switch and Router Configuration
No ratings yet
CCNA Lab 1: Switch and Router Configuration
19 pages
RN Qts Pro Dev Support 17.0 683706 704970
No ratings yet
RN Qts Pro Dev Support 17.0 683706 704970
11 pages
Increasing The Impact of Learning Analytics
No ratings yet
Increasing The Impact of Learning Analytics
10 pages
Manual ZES LMG600 en v2
No ratings yet
Manual ZES LMG600 en v2
280 pages
Printer Wireless Guide
No ratings yet
Printer Wireless Guide
1 page
Quest
No ratings yet
Quest
385 pages
UI Customization:: Branding Your Instance
No ratings yet
UI Customization:: Branding Your Instance
11 pages
Carpet Cost Estimator Tool
No ratings yet
Carpet Cost Estimator Tool
1 page
Instagram Guide for Local Businesses
No ratings yet
Instagram Guide for Local Businesses
10 pages
Unit 5 Notes
No ratings yet
Unit 5 Notes
32 pages
Jahnvi Sahni Resume
No ratings yet
Jahnvi Sahni Resume
1 page
The Future of Cryptography
No ratings yet
The Future of Cryptography
2 pages
Microprocessor I/O & Memory Mapping
No ratings yet
Microprocessor I/O & Memory Mapping
14 pages
Sales Force Admin Course Plan
No ratings yet
Sales Force Admin Course Plan
5 pages
Grade 7-ICT Workbook-Term 3-Answers
No ratings yet
Grade 7-ICT Workbook-Term 3-Answers
10 pages
UNIT 1 Question Bank
No ratings yet
UNIT 1 Question Bank
4 pages
Fabric Anomaly Detection Automation
No ratings yet
Fabric Anomaly Detection Automation
6 pages
Khalil Ahmad Behmanesh CV
No ratings yet
Khalil Ahmad Behmanesh CV
4 pages
C Programming Week12
No ratings yet
C Programming Week12
4 pages
cs1101s 2122s1 Final Solution
No ratings yet
cs1101s 2122s1 Final Solution
27 pages
Definition of IoT Data Analytics
No ratings yet
Definition of IoT Data Analytics
18 pages

Clustering

Uploaded by

Clustering

Uploaded by

Clustering

Consider the data points as follows:

Now, we can train the machine learning algorithms.

K – Means: Geometric intuition, Centroids

K – Means clustering is the centroid based scheme.

Heriarical based clsutersing

This is the optimization problem -

The second step is assignment -

We can repeat the step -2 and step -3, until convergence.

Distance between new centroids and old has no change.

How to initialize: K-Means++

K means tend to fail in case of different densities.

TO find the parts of cluster, but need to put together.

K means is never perfect algorithm.

That means the centroids may be not interpretable in case of K-medoids.

Especially in case of interpretation.

Partioning around mediods(PAM) algorithm:

We swap each medoid with a non-mediod point in the dataset.

After swapping again compute the loss values.

In K – means, K is the hyper parameter,

1. Domain knowledge – we can know the clusters.

Then we use the elbow-method (or) Knee-method -

We want to minimize the loss function is k means.

We only have data D={xi}.

Time and space complexity:

K-Means: O(nkdi) This is still a linear in time.

It is fast and simple to understand.

You might also like