0% found this document useful (0 votes)
46 views6 pages

K-Means Algorithm

K-Means Clustering is an unsupervised learning algorithm that groups data into K clusters based on similarity, minimizing variance within each cluster. The algorithm involves selecting the number of clusters, initializing centroids, assigning data points to the nearest centroid, computing new centroids, and repeating until convergence. Key concepts include intra-cluster and inter-cluster distances, the Elbow Method for determining optimal K, and the use of StandardScaler to standardize features for improved clustering performance.

Uploaded by

wilsonabi2406
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
46 views6 pages

K-Means Algorithm

K-Means Clustering is an unsupervised learning algorithm that groups data into K clusters based on similarity, minimizing variance within each cluster. The algorithm involves selecting the number of clusters, initializing centroids, assigning data points to the nearest centroid, computing new centroids, and repeating until convergence. Key concepts include intra-cluster and inter-cluster distances, the Elbow Method for determining optimal K, and the use of StandardScaler to standardize features for improved clustering performance.

Uploaded by

wilsonabi2406
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
1. What is K-Means Clustering? k-Means is an unsupervised learning algorithm used for clustering data into K groups based on similarity. It minimizes the variance within each cluster. 2. Steps of K-Means Algorithm Step 1: Choose the Number of Clusters (K) * Decide the number of clusters K in advance. * This can be done using methods like the Elbow Method or Silhouette Score. Step 2: Initialize K Cluster Centroids * There are two common ways to initialize: 1. Random Initialization: Randomly select K data points as initial centroids. 2. KMeans++ Initialization: Smartly selects initial centroids to improve accuracy. ‘Step 3: Assign Each Data Point to the Nearest Centroid * Calculate the Euclidean distance from each date point to each centroid. * Assign each point to the nearest centroid, forming K clusters. Step 4: Compute New Centroids * For each cluster, compute the new centroid as the mean (average) of all points in that cluster. Step 5: Repeat Until Convergence * Repeat Steps 3 and 4 until: © Centroids do not change (or change very little). o Amaximum number of iterations is reached. 3. Example Calculation (Step-by-Step) Let's go through a small dataset example. Dataset (2D Points): Data Points = {(2, 3), (3, 3), (5, 7), (8,8), (9, 10)} Let's assume K = 2 (two clusters). Step 1: Initialize K Centroids Assume random initial centroids: © = (23) + =(8,8) Step 2: Assign Points to Nearest Centroid We calculate Euclidean distance: d= J (ae — 21)? + (w — wr) Data Point Distance to (2, 3) Distance to C'p(3, 8) Assigned Cluster (23) oO 7.81 ‘Cluster 1 @3) 41 7.21 ‘Cluster 1 (57) 5.00 3.61 Cluster 2 @a) 781 ° Cluster 2 (9.10) 10.30 2.24 Cluster 2 ‘Step 3: Compute New Centroids New centroid for Cluster 1: New centroid for Cluster 2: = ( ana = (7.33,8.33) Step 4: Repeat Until Convergence « Reassign points to new centroids. « Update centroids again. « Stop when centroids don’t change. 1, Intra-Cluster Distance (Within-Cluster Distance) «The average or sum of distances between data points within the same cluster. + Measures how compact a cluster is. Formula: Intra-Cluster Distance = ys d(z;,C) ieCe Where: + d(2;, Cy) is the distance between a data point «; and its cluster centroid C. * The lower the intra-cluster distance, the better the clustering (points are close together). 2. Inter-Cluster Distance (Between-Cluster Distance) Definition: « The distance between different clusters (usually the centroids). © Measures how well-separated clusters are. Formula: Inter-Cluster Distance = d(C;,C;) Where: © C;and CG; are centroids of two different clusters. © The higher the inter-cluster distance, the better the separation. 3. Good Clustering Characteristics fi Low Intra-Cluster Distance (points inside clusters are close to each other). High Inter-Cluster Distance (clusters are well-separated from each other). This balance ensures that clusters are coherent and distinct. The Elbow Method is a technique used to determine the optimal number of clusters (K) 1. Concept of the Elbow Method ‘The idea is to plot the Within-Cluster Sum of Squares (WCSS) vs. K and find the “elbaw point” where adding more clusters doesn*t significantly improve the clustering. * WCSS (Within-Cluster Sum of Squares) measures the compactness of the clusters: K Wess =S> 3 Ile — Cl? kai, where: © aj isa data pointin cluster Ci, © Gy is the centroid of cluster k, © ||2; — C;||? is the squared Euclidean distance. * Goal: Minimize WCSS while keeping K reasonable. 2. Steps of the Elbow Method Step 1: Run K-Means for Different Values of K «Choose different values of KK (e.g,, from 1 to 10). * For each K,, calculate the WCSS. Step 2: Plot K vs. WCSS © On the x-axis, plot K (number of clusters). + Onthe y-axis, plot WCSS. Step 3: Find the Elbow Point @ The "elbow" is where WCSS stops decreasing sharply. This means increasing K further does nat reduce WCSS significantly. «The optimal Kis at the elbow point. Elbow Method for Optimal K 400 350 ‘. 300 250 wess 200 150 100 * 50 e--_ 2 4 6 8 10 Number of Clusters (K) Why Use StandardScaler? 1. K-Means Uses Euclidean Distance © If features have different scales, larger values dominate smaller ones. © Example: = Annual Income (0-100) = Spending Score (0-1) — Will be ignored by K-Means due ta its small range. 2. Improves Performance © Standardization makes clustering more accurate by balancing feature influence. How StandardScaler Works It transforms data using the formula: Where: «X= original data © = mean of the feature «= standard deviation of the feature Example Imagine this dataset: ‘Customer Annual Income ($) 20 8 80 40 Step 1: Compute Mean & Standard Deviation * Annual Income: Mean j: = 46.67, Std o = 24.94 + Spending Score: Mean j: = 30, Std o = 16.33, Step 2: Apply Standardscaler Formula Customer Annual Income (Scaled) A (20 — 46.67)/24.94 = —1.07 (80 — 46.67)/24.94 = 1.34 c (40 — 46.67)/24.94 = -0.97 Spending Score 8 ‘Spending Score (Scaled) (10 — 30)/16.33 = —1.22 (50 - 30)/16.33 = 1.22 (30 — 30)/16.33 =0 Now, both features have mean = 0 and variance = 1, making them equally important. 1. What is WCSS? WCSS (Within-Cluster Sum of Squares) measures how close data points are to their assigned cluster centroids. * Lower WCSS — Clusters are tight and well-defined. ‘+ Higher WCSS — Clusters are spread out (not ideal). © Formula: k woss=S> FIle sl? i=l eG; © w= Data point © 44 = Centroid of cluster i © C;=Setof points in cluster i

You might also like