0% found this document useful (0 votes)
33 views9 pages

Cluster Analysis

Cluster analysis is a statistical method used to group data into distinct clusters based on shared characteristics, primarily for marketing purposes. Key clustering algorithms include agglomerative methods, which create a tree-like structure, and K-means clustering, which partitions data into a predefined number of clusters. Evaluating clusters involves metrics such as cluster diameter, variance, and silhouette scores to assess the similarity and distinctiveness of the groups formed.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views9 pages

Cluster Analysis

Cluster analysis is a statistical method used to group data into distinct clusters based on shared characteristics, primarily for marketing purposes. Key clustering algorithms include agglomerative methods, which create a tree-like structure, and K-means clustering, which partitions data into a predefined number of clusters. Evaluating clusters involves metrics such as cluster diameter, variance, and silhouette scores to assess the similarity and distinctiveness of the groups formed.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Cluster Analysis

Cluster analysis is a statistical method for organizing data into groups based on their closely
associated characteristics. The goal of cluster analysis is to find distinct groups or "clusters"
within a data set.

Purpose of clustering is to divide the


customer into subgroups allow
marketers to differentiate their
approach by segments in order to
maximize customer value
Clustering Algorithms
Two important classes
Agglomerative Methods

Creates a tree-like structure of clusters. At each step algorithm identifies two clusters
that are closest and merges them.

Partitioning Methods

The goal is to divide the dataset into nonoverlapping groups such that the points within
each group are relatively similar and points within different groups relatively dissimilar
Hierarchical Clustering

• A hierarchical clustering approach is based on the determination of successive clusters


based on previously defined clusters.

• It's a technique aimed more toward grouping data into a tree of clusters called
dendrograms, which graphically represent the hierarchical relationship between the
underlying clusters.

• Hierarchical clustering has a variety of applications in our day-to-day life, including


biology, image processing, marketing, economics, and social network analysis.
K-means Clustering

• K-means clustering is a popular unsupervised machine


learning algorithm for grouping data points into a
predefined number of clusters (K).

• It works by iteratively minimizing the within-cluster


variance, aiming to create clusters with high similarity
within themselves and distinction between each other.

• K-means is a centroid-based algorithm or a distance-based


algorithm
Steps in K-means Clustering
▪ Initialization:

▪ Choose the number of clusters (K).

▪ Randomly pick K data points as the initial centroids (cluster centers/cluster seeds).

▪ Assignment:

▪ Assign each data point to the closest centroid based on distance (usually Euclidean distance).

▪ Centroid Update:

▪ Re-calculate the centroid of each cluster as the average of its assigned data points.

▪ Reassignment & Termination:

▪ Repeat steps 2 and 3:

▪ Re-assign data points based on the new centroids.

▪ Re-calculate centroids based on the newly assigned data points.

▪ Stop when the centroids no longer move significantly (clusters stabilize) or a maximum number of iterations is reached.
Interpreting Clusters

What cluster members have in How each cluster is different


common from other clusters

• Centroid used to define typical member • Key to differentiating segments


(Hypothetical customer who has • Take the average value of each
average value in each of the cluster variable in the cluster and compare it to
dimensions) the average of the same variable in the
entire customer base
Evaluating Clusters
• Cluster Diameter – Maximum distance between any two points within the cluster and
indicates the maximum dissimilarity between members of the same cluster. The lower the
diameter, the more similar the cluster is.

• Cluster Variance – Sum of the squared distance from the centroid of the cluster. Lower
the variance, tighter and similar the cluster.

• Cluster Silhouette – It measures how well a point in a cluster is matched to that cluster,
compared to other clusters. The silhouette score of a clustering solution is the average of
the silhouette scores of all individual customers in the database. Scores can be computed
at various levels: customer base, clusters and individual customers.
Silhouette Score
(𝒃−𝒂)
• Silhouette Score for each customer =
𝐦𝐚𝐱(𝒂,𝒃)
Where
• a = Mean Intra-cluster distance (Average of distance of each customer from all other
customers within the cluster)
• b = Mean Inter-cluster distance (Average of distance of each customer from all customers
in the nearest cluster)

Higher the silhouette score, the more similar the customer is to other customers in the
cluster.

You might also like