0 ratings 0% found this document useful (0 votes) 46 views 6 pages K-Means Algorithm
K-Means Clustering is an unsupervised learning algorithm that groups data into K clusters based on similarity, minimizing variance within each cluster. The algorithm involves selecting the number of clusters, initializing centroids, assigning data points to the nearest centroid, computing new centroids, and repeating until convergence. Key concepts include intra-cluster and inter-cluster distances, the Elbow Method for determining optimal K, and the use of StandardScaler to standardize features for improved clustering performance.
AI-enhanced title and description
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content,
claim it here .
Available Formats
Download as PDF or read online on Scribd
Go to previous items Go to next items
Save K-means algorithm For Later
1. What is K-Means Clustering?
k-Means is an unsupervised learning algorithm used for clustering data into K groups based on
similarity. It minimizes the variance within each cluster.
2. Steps of K-Means Algorithm
Step 1: Choose the Number of Clusters (K)
* Decide the number of clusters K in advance.
* This can be done using methods like the Elbow Method or Silhouette Score.
Step 2: Initialize K Cluster Centroids
* There are two common ways to initialize:
1. Random Initialization: Randomly select K data points as initial centroids.
2. KMeans++ Initialization: Smartly selects initial centroids to improve accuracy.
‘Step 3: Assign Each Data Point to the Nearest Centroid
* Calculate the Euclidean distance from each date point to each centroid.
* Assign each point to the nearest centroid, forming K clusters.
Step 4: Compute New Centroids
* For each cluster, compute the new centroid as the mean (average) of all points in that cluster.
Step 5: Repeat Until Convergence
* Repeat Steps 3 and 4 until:
© Centroids do not change (or change very little).
o Amaximum number of iterations is reached.3. Example Calculation (Step-by-Step)
Let's go through a small dataset example.
Dataset (2D Points):
Data Points = {(2, 3), (3, 3), (5, 7), (8,8), (9, 10)}
Let's assume K = 2 (two clusters).
Step 1: Initialize K Centroids
Assume random initial centroids:
© = (23)
+ =(8,8)
Step 2: Assign Points to Nearest Centroid
We calculate Euclidean distance:
d= J (ae — 21)? + (w — wr)
Data Point Distance to (2, 3) Distance to C'p(3, 8) Assigned Cluster
(23) oO 7.81 ‘Cluster 1
@3) 41 7.21 ‘Cluster 1
(57) 5.00 3.61 Cluster 2
@a) 781 ° Cluster 2
(9.10) 10.30 2.24 Cluster 2
‘Step 3: Compute New Centroids
New centroid for Cluster 1:
New centroid for Cluster 2:
= ( ana = (7.33,8.33)
Step 4: Repeat Until Convergence
« Reassign points to new centroids.
« Update centroids again.
« Stop when centroids don’t change.1, Intra-Cluster Distance (Within-Cluster Distance)
«The average or sum of distances between data points within the same cluster.
+ Measures how compact a cluster is.
Formula:
Intra-Cluster Distance = ys d(z;,C)
ieCe
Where:
+ d(2;, Cy) is the distance between a data point «; and its cluster centroid C.
* The lower the intra-cluster distance, the better the clustering (points are close together).
2. Inter-Cluster Distance (Between-Cluster Distance)
Definition:
« The distance between different clusters (usually the centroids).
© Measures how well-separated clusters are.
Formula:
Inter-Cluster Distance = d(C;,C;)
Where:
© C;and CG; are centroids of two different clusters.
© The higher the inter-cluster distance, the better the separation.
3. Good Clustering Characteristics
fi Low Intra-Cluster Distance (points inside clusters are close to each other).
High Inter-Cluster Distance (clusters are well-separated from each other).
This balance ensures that clusters are coherent and distinct.
The Elbow Method is a technique used to determine the optimal number of clusters (K)1. Concept of the Elbow Method
‘The idea is to plot the Within-Cluster Sum of Squares (WCSS) vs. K and find the “elbaw point” where
adding more clusters doesn*t significantly improve the clustering.
* WCSS (Within-Cluster Sum of Squares) measures the compactness of the clusters:
K
Wess =S> 3 Ile — Cl?
kai,
where:
© aj isa data pointin cluster Ci,
© Gy is the centroid of cluster k,
© ||2; — C;||? is the squared Euclidean distance.
* Goal: Minimize WCSS while keeping K reasonable.
2. Steps of the Elbow Method
Step 1: Run K-Means for Different Values of K
«Choose different values of KK (e.g,, from 1 to 10).
* For each K,, calculate the WCSS.
Step 2: Plot K vs. WCSS
© On the x-axis, plot K (number of clusters).
+ Onthe y-axis, plot WCSS.
Step 3: Find the Elbow Point
@ The "elbow" is where WCSS stops decreasing sharply.
This means increasing K further does nat reduce WCSS significantly.
«The optimal Kis at the elbow point.Elbow Method for Optimal K
400
350 ‘.
300
250
wess
200
150
100 *
50 e--_
2 4 6 8 10
Number of Clusters (K)
Why Use StandardScaler?
1. K-Means Uses Euclidean Distance
© If features have different scales, larger values dominate smaller ones.
© Example:
= Annual Income (0-100)
= Spending Score (0-1) — Will be ignored by K-Means due ta its small range.
2. Improves Performance
© Standardization makes clustering more accurate by balancing feature influence.
How StandardScaler Works
It transforms data using the formula:
Where:
«X= original data
© = mean of the feature
«= standard deviation of the featureExample
Imagine this dataset:
‘Customer Annual Income ($)
20
8 80
40
Step 1: Compute Mean & Standard Deviation
* Annual Income: Mean j: = 46.67, Std o = 24.94
+ Spending Score: Mean j: = 30, Std o = 16.33,
Step 2: Apply Standardscaler Formula
Customer Annual Income (Scaled)
A (20 — 46.67)/24.94 = —1.07
(80 — 46.67)/24.94 = 1.34
c (40 — 46.67)/24.94 = -0.97
Spending Score
8
‘Spending Score (Scaled)
(10 — 30)/16.33 = —1.22
(50 - 30)/16.33 = 1.22
(30 — 30)/16.33 =0
Now, both features have mean = 0 and variance = 1, making them equally important.
1. What is WCSS?
WCSS (Within-Cluster Sum of Squares) measures how close data points are to their assigned cluster
centroids.
* Lower WCSS — Clusters are tight and well-defined.
‘+ Higher WCSS — Clusters are spread out (not ideal).
© Formula:
k
woss=S> FIle sl?
i=l eG;
© w= Data point
© 44 = Centroid of cluster i
© C;=Setof points in cluster i