Customer Segmentation using
Clustering (K-Means)
Customer segmentation means grouping customers into different clusters
based on their purchasing behavior or attributes. This helps businesses tailor
marketing strategies or services to each group.
Clustering is an unsupervised machine learning technique that automatically
finds natural groupings in data without pre-labeled categories.
Why K-Means Clustering?
It partitions data points into K clusters where each point belongs to the
cluster with the nearest mean.
It’s simple, efficient, and widely used in marketing segmentation.
Step-by-step Explanation
1. Dataset
We consider two features for each customer:
o Annual Income (in thousands)
o Spending Score (a score from 1 to 100 that indicates how much
the customer spends)
2. Data Standardization
Since these features have different scales, we standardize them to have
zero mean and unit variance. This prevents bias where features with
larger scales dominate the clustering.
3. Choosing Number of Clusters (K)
We use the Elbow Method:
o Run K-Means for a range of K (say 1 to 10).
o Calculate the sum of squared distances (WCSS) between points
and their cluster centers.
o Plot WCSS vs K and look for the "elbow" where adding more
clusters does not reduce WCSS significantly.
o This “elbow” point indicates a good trade-off between model
complexity and explained variance.
4. Run K-Means
Using the chosen K, cluster the data points.
5. Interpretation
Each cluster represents a group of customers with similar income and
spending patterns, helping businesses understand customer profiles like:
o High income, high spending
o Low income, low spending
o Medium income, high spending, etc.
Full runnable code with output plot
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
# 1. Create sample data
np.random.seed(42)
data = {
'CustomerID': range(1, 201),
'Annual Income (k$)': np.random.randint(15, 150, 200),
'Spending Score (1-100)': np.random.randint(1, 100, 200)
}
df = pd.DataFrame(data)
# 2. Visualize data distribution
plt.figure(figsize=(8,5))
sns.scatterplot(x='Annual Income (k$)', y='Spending Score (1-100)', data=df)
plt.title('Customer Data Distribution')
plt.show()
# 3. Scale features
features = df[['Annual Income (k$)', 'Spending Score (1-100)']]
scaler = StandardScaler()
scaled_features = scaler.fit_transform(features)
# 4. Elbow method to find optimal K
wcss = []
for i in range(1, 11):
kmeans = KMeans(n_clusters=i, random_state=42)
kmeans.fit(scaled_features)
wcss.append(kmeans.inertia_)
plt.figure(figsize=(8,5))
plt.plot(range(1, 11), wcss, marker='o')
plt.title('Elbow Method For Optimal K')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.show()
# 5. From the elbow plot, let's choose K=5
kmeans = KMeans(n_clusters=5, random_state=42)
df['Cluster'] = kmeans.fit_predict(scaled_features)
# 6. Visualize clusters
plt.figure(figsize=(8,5))
sns.scatterplot(x='Annual Income (k$)', y='Spending Score (1-100)',
hue='Cluster', palette='Set1', data=df)
plt.title('Customer Segments (K=5)')
plt.show()
# 7. Print cluster centers in original scale (optional)
centers = scaler.inverse_transform(kmeans.cluster_centers_)
print("Cluster centers (Annual Income, Spending Score):")
print(centers)
We created synthetic customer data.
Standardized features.
Used the Elbow method to find optimal clusters.
Applied K-Means clustering.
Visualized customer segments.