Performing Cluster Analysis in Python: A Step-by-Step Tutorial

Cluster analysis refers to the set of tools, algorithms, and methods for finding hidden groups in a dataset based on similarity, and subsequently analyzing the characteristics and properties of data belonging to each identified group. Understanding and taking action upon the discovery of groups underneath the data is a common and valuable analytic approach in domains like marketing (for segmenting customers or products), finance (for discovering fraudulent transactions), and more.

Whilst the process of finding groups within the data (clustering) is a key step of clustering analysis, these often interchanged terms are not synonyms. Clustering is the grouping process, typically governed by an algorithm like k-means, DBSCAN, hierarchical clustering, etc. Meanwhile, cluster analysis encapsulates both clustering and the subsequent analysis and interpretation of clusters, ultimately leading to decision-making outcomes based on the insights obtained.

This tutorial illustrates a step-by-step cluster analysis pipeline in Python, consisting of the following stages:

Preparing and preprocessing data
Setting the number of clusters
Applying the clustering algorithm
Visualizing the results
Evaluating the goodness of clusters
Interpreting results

Data Preparation and Preprocessing

For this tutorial, we will use the mall customers dataset, which is publicly and widely available for download from several GitHub repositories, for instance, this one.

This dataset contains information about customers in a shopping mall: gender, age, annual income, and spending score. The latter represents an indicator ranging from 1 to 100 of the money spent by the customer in the mall. Thus, we aim to find and analyze hidden groups of customers with similar traits or patterns.

The first few lines of code below import the necessary libraries, load the dataset from a GitHub repository into a Pandas Dataframe, and display the first five customers in the dataset.

# Import necessary libraries
import pandas as pd
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import seaborn as sns

# Load the dataset from URL
url = 'https://raw.githubusercontent.com/kennedykwangari/Mall-Customer-Segmentation-Data/master/Mall_Customers.csv'
df = pd.read_csv(url)

# Quick glimpse of the data
df.head()

The customers’ data and their attributes should look like this:

Next, we prepare and preprocess the data. The data contain no missing values but have some disparity in the ranges of values found across attributes, as well as irrelevant attributes like customer ID and a categorical attribute that may not be very relevant for segmenting customers: gender.

Based on these premises, our data preparation will consist of two simple steps:

Feature selection: select relevant numerical attributes for clustering.
Normalization: scale the values of attributes. This will be helpful for the effectiveness of the clustering algorithm.

# Select relevant features for clustering (e.g., Age, Annual Income, Spending Score)
# Store the selected data attributes in a new Dataframe, named 'X'
X = df[['Age', 'Annual Income (k$)', 'Spending Score (1-100)']]

# Feature scaling (standardization) using the StandardScaler() class available in sklearn library
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

Setting the Number of Clusters

Our data is now ready to apply a clustering algorithm! Depending on the algorithm type chosen, we may need to specify some configuration parameters (called hyperparameters, in pure machine learning jargon). Concretely, for the k-means clustering algorithm -an iterative clustering method we will use in this tutorial- it is necessary to specify a priori the number of groups or clusters we aim to find.

Sometimes, depending on domain or a priori knowledge about the problem and data at hand, we may have a clue of the exact or approximate number of groups (denoted K) we would like our clustering algorithm to discover. If that’s not the case, don’t worry: there is a more systematic approach than just “trial and error”. The approach is called the Elbow Method and it simply consists of iteratively trying several K-means algorithm configurations with an increasing number of clusters K, and plotting the inertia of each configuration. The inertia is a metric of cluster quality, such that lower inertia values indicate more consistently defined groups.

Overall, a good clustering result is one where both the inertia is lower and the number of clusters K is also lower. Hence, if the inertia of several clustering results for different K are visually represented as a curve, where the closest point to the bottom-left corner (0,0) indicates the “best” value for K, given the dataset.

For instance, let’s try the elbow method for possible values of K ranging between 1 and 11, and visualize the inertia associated with each K.

# Determine optimal number of clusters (K) using the Elbow Method
inertia = []
for k in range(1, 11):
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(X_scaled)
    inertia.append(kmeans.inertia_)

# Plot the elbow method to decide on the best 'K'
plt.figure(figsize=(8, 5))
plt.plot(range(1, 11), inertia, marker='o')
plt.title('Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('Inertia')
plt.show()

Here is the elbow curve, clearly hinting at K=5 as an ideal number of clusters to find.

Applying the Clustering Algorithm

Performing the K-means clustering algorithm in Python is straightforward thanks to the scikit-learn library. Indeed, we have already done this several times as part of the elbow method to find the best K. Now it only remains to apply it one last time with the chosen number of clusters to identify.

# Apply K-means with K=5
kmeans = KMeans(n_clusters=5, random_state=42)
kmeans.fit(X_scaled)

# Add the cluster identifiers as a new attribute in the original data
df['Cluster'] = kmeans.labels_

Each data object (customer) will belong to one of the five clusters found, and each cluster found has in turn an associated numerical identifier, e.g. from 1 to 5. Therefore, after applying K-means we have saved the information about the cluster ID each customer belongs to, as a new attribute in the dataset. This new attribute will be extremely useful in visualizing and analyzing the clustering results.

Visualizing the Results

There are multiple ways to visualize clustering results when the data used for clustering has more than two attributes. The simplest approach is to choose any two attributes and show a scatter plot where dots are colored differently depending on the cluster they belong to. This may not yield an insightful cluster visualization at first, but if at least one pair of visualized attributes leads to an interesting visualization of clearly defined clustering, then that’s a good sign that the clusters are insightful.

For example, by plotting the clusters customers belong to based on the two finance-related attributes (annual income and spending score), we can already get a pretty interesting segmentation of customers:

# Visualize the clusters using annual income and spending score
plt.figure(figsize=(10, 6))
sns.scatterplot(x=df['Annual Income (k$)'], y=df['Spending Score (1-100)'], hue=df['Cluster'], palette='viridis', s=100)
plt.scatter(kmeans.cluster_centers_[:, 1], kmeans.cluster_centers_[:, 2], s=300, c='red', label='Centroids')
plt.title('Customer Segments based on income and spending Score')
plt.legend()
plt.show()

Evaluating the Goodness of Clusters

The customer segments obtained are realistically far from perfect, as there are some overlaps especially regarding segments 1 and 4, yet the results are not bad either, as they allow US to easily get some useful insights that may help to make decisions related to marketing and personalization, as described next.

Interpreting Results

In broad terms, customers can be categorized into multiple types based on their financial profile and purchase behavior. A possible interpretation cluster by cluster could be:

Cluster 0 (purple): low-income, low-spending customers, who are likely less engaged with the brand or more price-sensitive
Cluster 1 (blue): low-income, high-spending customers, likely a group of frequent shoppers or individuals who spend a large portion of their income at the mall and/or live nearby.
Cluster 2 (emerald): moderate to high-income, high-spending customers, likely the most valuable customers for the business.
Cluster 3 (green): moderate to high-income, low-spending customers, likely affluent and wealthy customers with selective spending habits.
Cluster 4 (yellow): middle-income, middle-spending customers, representing an average customer segment with balanced spending habits.

These insights can be turned into well-targeted marketing strategies, e.g. tailored promotions or newsletters for each customer segment, or they could be combined with other analyses to build strategies oriented towards retaining high-spending customers, incentivizing higher spending by lower-spending groups, etc.

On a last remark, the elbow method is not infallible, and sometimes an analysis of results may motivate trying other clustering configurations, for example, K=4, to see if strongly overlapping segments like 1 and 4 would become one single segment, or if they are still separable under other K-means configurations.