What is the concept of clustering?
Clustering is an unsupervised machine learning technique designed to group unlabeled
examples based on their similarity to each other. (If the examples are labeled, this kind of
grouping is called classification.) Consider a hypothetical patient study designed to evaluate a
new treatment protocol.25 Feb 2025
K-means clustering is a popular, unsupervised machine learning algorithm used to group similar
data points together. It's a centroid-based algorithm that aims to partition a dataset into k
distinct clusters, where k is a predefined number. The algorithm works iteratively by assigning
each data point to the nearest cluster centroid, and then recalculating the centroids based on
the newly assigned data points. This process continues until the centroids stabilize or the
algorithm reaches a predetermined number of iterations.
Here's a more detailed breakdown:
1. Initial Setup:
Choose the number of clusters (k):
This is often determined using methods like the elbow method or domain knowledge, according
to W3Schools and [Link].**
Randomly initialize cluster centroids:
These are the initial "center points" for each cluster, as explained by [Link].**
2. Iterative Process:
Assign data points to clusters:
Each data point is assigned to the closest centroid, according to the AI Accelerator Institute and
[Link].**
Recalculate cluster centroids:
The new centroid for each cluster is calculated as the mean of all data points assigned to that
cluster, according to the AI Accelerator Institute and [Link].**
Repeat steps 2 and 3:
This process continues until the centroids no longer move significantly, or a maximum number
of iterations is reached, as explained by [Link] and W3Schools.**
3. Goal of K-means:
Minimize within-cluster variance: The algorithm aims to find the best centroids that
minimize the sum of squared distances between each data point and its assigned
centroid, according to IBM and [Link].**
Maximize between-cluster variance: Ideally, clusters should be distinct and well-
separated.
4. Use Cases:
Customer segmentation: Grouping customers based on purchasing behavior or
demographics.
Document clustering: Organizing documents based on similarity in content.
Image segmentation: Dividing an image into different regions or objects.
Anomaly detection: Identifying data points that fall outside of the typical clusters.
In essence, K-means is a powerful tool for grouping data points based on their proximity to
centroids, enabling insights into the underlying structure of the data.
Hierarchical Clustering in Data Mining
A Hierarchical clustering method works via grouping data into a tree of clusters. Hierarchical
clustering begins by treating every data point as a separate cluster. Then, it repeatedly executes
the subsequent steps:
1. Identify the 2 clusters which can be closest together, and
2. Merge the 2 maximum comparable clusters. We need to continue these steps until all
the clusters are merged together.
In Hierarchical Clustering, the aim is to produce a hierarchical series of nested clusters. A
diagram called Dendrogram (A Dendrogram is a tree-like diagram that statistics the sequences
of merges or splits) graphically represents this hierarchy and is an inverted tree that describes
the order in which factors are merged (bottom-up view) or clusters are broken up (top-down
view).
What is Hierarchical Clustering?
Hierarchical clustering is a method of cluster analysis in data mining that creates a hierarchical
representation of the clusters in a dataset. The method starts by treating each data point as a
separate cluster and then iteratively combines the closest clusters until a stopping criterion is
reached. The result of hierarchical clustering is a tree-like structure, called a dendrogram, which
illustrates the hierarchical relationships among the clusters.
Hierarchical clustering has several advantages over other clustering methods
The ability to handle non-convex clusters and clusters of different sizes and densities.
The ability to handle missing data and noisy data.
The ability to reveal the hierarchical structure of the data, which can be useful for
understanding the relationships among the clusters.
Drawbacks of Hierarchical Clustering
The need for a criterion to stop the clustering process and determine the final number of
clusters.
The computational cost and memory requirements of the method can be high,
especially for large datasets.
The results can be sensitive to the initial conditions, linkage criterion, and distance
metric used.
In summary, Hierarchical clustering is a method of data mining that groups similar data
points into clusters by creating a hierarchical structure of the clusters.
This method can handle different types of data and reveal the relationships among the
clusters. However, it can have high computational cost and results can be sensitive to
some conditions.
Types of Hierarchical Clustering
Basically, there are two types of hierarchical Clustering:
1. Agglomerative Clustering
2. Divisive clustering
1. Agglomerative Clustering
Initially consider every data point as an individual Cluster and at every step, merge the nearest
pairs of the cluster. (It is a bottom-up method). At first, every dataset is considered an individual
entity or cluster. At every iteration, the clusters merge with different clusters until one cluster is
formed.
The algorithm for Agglomerative Hierarchical Clustering is:
Calculate the similarity of one cluster with all the other clusters (calculate proximity
matrix)
Consider every data point as an individual cluster
Merge the clusters which are highly similar or close to each other.
Recalculate the proximity matrix for each cluster
Repeat Steps 3 and 4 until only a single cluster remains.
Let’s see the graphical representation of this algorithm using a dendrogram.
Note: This is just a demonstration of how the actual algorithm works no calculation has been
performed below all the proximity among the clusters is assumed.
Let’s say we have six data points A, B, C, D, E, and F.
Agglomerative Hierarchical clustering
Step-1: Consider each alphabet as a single cluster and calculate the distance of one
cluster from all the other clusters.
Step-2: In the second step comparable clusters are merged together to form a single
cluster. Let’s say cluster (B) and cluster (C) are very similar to each other therefore we
merge them in the second step similarly to cluster (D) and (E) and at last, we get the
clusters [(A), (BC), (DE), (F)]
Step-3: We recalculate the proximity according to the algorithm and merge the two
nearest clusters([(DE), (F)]) together to form new clusters as [(A), (BC), (DEF)]
Step-4: Repeating the same process; The clusters DEF and BC are comparable and
merged together to form a new cluster. We’re now left with clusters [(A), (BCDEF)].
Step-5: At last, the two remaining clusters are merged together to form a single cluster
[(ABCDEF)].
2. Divisive Hierarchical clustering
We can say that Divisive Hierarchical clustering is precisely the opposite of Agglomerative
Hierarchical clustering. In Divisive Hierarchical clustering, we take into account all of the data
points as a single cluster and in every iteration, we separate the data points from the clusters
which aren’t comparable. In the end, we are left with N clusters.
Divisive Hierarchical clustering
Density-based spatial clustering of applications with noise (DBSCAN) is a clustering algorithm
used in machine learning to partition data into clusters based on their distance to other points.
Its effective at identifying and removing noise in a data set, making it useful for data cleaning
and outlier detection.
Evaluation of Clustering in Data Mining
Introduction to Data Mining
The process of extracting patterns, connections and information from sizable datasets is known
as data mining. It is important in many fields, including business, medicine, and scientific
research. Data mining's subset of Clustering focuses on assembling related data points.
What is the Evaluation of Clustering?
Evaluation of Clustering is a process that determines the quality and value of clustering
outcomes in data mining and machine learning.
In data mining, to assess how we can cluster all the well data points, we need to choose an
appropriate clustering algorithm and set the parameters and various metrics or techniques that
must be used.
The main objective of clustering evaluation is to analyze the data with specific objectives to
improve performance and provide a better understanding of clustering solutions.
Importance of Clustering in Data Mining
The following are some major reasons why Clustering is so important in data mining:
1. Pattern Discovery
In data mining, with the help of Clustering, we can discover the patterns and connections in
data. Because of this, it becomes simple to understand, and we can analyze the data by
combining similar data points that help us to reveal the unstructured data.
2. Data Summarization
With the help of Clustering, we can also summarize large data sets into a smaller cluster that is
much easier to manage. The data analysis process can be made simpler by working with clusters
rather than individual data points.
3. Anomaly Detection
Clustering helps us identify anomalies and outline the data in data mining. Data points that are
not part of any cluster or that form small, unusual clusters could indicate errors or unusual
events that need to be addressed.
4. Customer Segmentation
Clustering is a technique used in business and marketing to divide customers into different
groups according to their behaviour, preferences, or demographics. This segmentation enables
the customization of marketing plans and product offerings for particular customer groups.
5. Image and Document Categorization
Image and document categorization: Clustering is useful for categorizing images and
documents. It assists in classifying and organizing texts, images, or documents based on
similarities, making it simpler to manage and retrieve information.
6. Recommendation Systems
In data mining, we can use Clustering for e-commerce and content recommendation systems to
put users and products in a similar group. With the help of this, we can make sure the
recommendation systems can better suggest good content so user can find it interesting based
on the preferences of their cluster.
7. Scientific Research
Clustering categorizes scientific data, such as classifying stars in astronomy or identifying genes
in bioinformatics. It helps interpret challenging scientific datasets.
8. Data preprocessing
Clustering can be used to reduce the dimensionality and noise in data as a preprocessing step in
data mining. The data is streamlined and made ready for additional analysis.
9. Risk Assessment
Using Clustering, we can find the risks and spot fraud in the finance sector. It also helps in
grouping unusual patterns in financial transactions for additional investigation.
In conclusion, Clustering is a flexible and essential data mining technique for organizing,
comprehending and making sense of complex datasets. With the help of this useful tool, we can
easily find important information from the data, and with the help of its broad application in a
variety of fields like business and marketing, it also helps in scientific research and beyond.
Types of Clustering Algorithms
There are several clustering algorithms, and each has a distinctive methodology. The most
typical ones are:
1. Hierarchical Clustering
Hierarchical Clustering is a well-liked and effective method in data analysis and mining for
classifying data points into hierarchical cluster structures. Clusters are created iteratively based
on the similarity between data points using a bottom-up or top-down approach. A dendrogram,
which graphically depicts the relationships between data points and clusters, is produced by
hierarchical Clustering.
2. K means Clustering
A common data mining and machine learning technique called K-Means clustering involves
dividing data points into a predetermined number of clusters, denoted by the letter "K."
Important K-Means Clustering Features:
o Centroid Based: In K-means clustering, we use centroids to see the average data points
with each cluster, and centroids are also used to represent the cluster.
o K-Determination: In k-means clustering, it is difficult to define the number of clusters K
in advance because many techniques can be used to find the ideal value of K. techniques
like the Silhouette score and elbow method.
o Iterative Algorithm: K-Means employs an iterative process to minimize cluster variance.
Data points are assigned to the closest centroid after cluster centroids are first randomly
initialized. The process of recalculating centroids involves taking the cluster mean and
repeating it until convergence is achieved.
Unsupervised cluster evaluation assesses the quality and validity of clusters formed by
algorithms like K-means without relying on pre-labeled data. This evaluation is crucial because
unsupervised learning, unlike supervised learning, doesn't have ground truth labels to compare
against. Instead, internal and external indices, along with stability checks and visual inspection,
are used to determine if the clusters are meaningful and consistent.
Here's a breakdown of key aspects:
1. Internal Indices:
Cluster Cohesion:
Measures how tightly packed data points are within a cluster. For example, a low average
distance between points in a cluster indicates good cohesion.
Cluster Separation:
Measures how well-separated clusters are from each other. A large distance between cluster
centroids indicates good separation.
Silhouette Coefficient:
Calculates the mean distance between a data point and other points in its own cluster
compared to the mean distance to points in the nearest neighboring cluster. A higher value
indicates better clustering.
Davies-Bouldin Index:
Measures the ratio of the average distance within a cluster to the average distance between
clusters. A lower value indicates better clustering.
Calinski-Harabasz Index:
Measures the ratio of between-cluster variance to within-cluster variance. A higher value
indicates better clustering.
2. External Indices:
These metrics rely on comparing the clustering results to a pre-existing ground truth or
labeled dataset.
Adjusted Rand Index: Measures the similarity between the clustering results and the
known labels, accounting for chance.
Mutual Information: Quantifies the dependency between the clustering results and the
known labels.
3. Stability Analysis:
Evaluating the consistency of clustering results across different runs or with slight
variations in the data.
This helps determine if the clusters are robust and not just a result of randomness.
4. Visual Inspection:
Plotting the data points with their cluster assignments to visually inspect the cluster
shapes, separation, and potential outliers.
This can be particularly useful for low-dimensional data where the clusters can be
visualized directly.
5. Choosing the Right Metric:
The best evaluation metrics depend on the specific clustering algorithm, data
characteristics, and the goals of the analysis.
For example, if you're using K-means, the Silhouette coefficient or Davies-Bouldin index
might be suitable.
If you have ground truth labels, external indices like the Adjusted Rand Index or Mutual
Information can be used.
In essence, unsupervised cluster evaluation involves a combination of these methods to provide
a comprehensive assessment of the quality and validity of the clusters formed by the algorithm.
Cohesion and separation are key concepts in software design, particularly when discussing
clustering and object-oriented programming. Cohesion refers to how closely related the
elements within a module (like a class or function) are to each other, and a high degree of
cohesion means the elements work together towards a single, focused purpose. Separation of
concerns, on the other hand, emphasizes dividing a complex system into distinct, independent
modules, each with its own specific responsibility. By combining high cohesion with separation
of concerns, you can create more manageable, reusable, and maintainable software.
Cohesion:
Definition:
Cohesion measures how well the elements within a module are related and focused on a single
purpose.
Example:
A class that manages user data should have high cohesion. All its methods should relate to user
management, not, for example, handling financial transactions.
Benefits:
High cohesion makes code easier to understand, maintain, and reuse. It also reduces the
likelihood of unintended side effects when making changes.
Separation of Concerns:
Definition:
Separation of concerns involves dividing a system into modules, each with a clear responsibility,
minimizing dependencies between them.
Example:
In a website, different modules might handle user authentication, data storage, and rendering
the user interface, each with its own responsibilities.
Benefits:
This separation makes the system more modular, testable, and easier to adapt to changes. Each
module can be developed and maintained independently, reducing the impact of changes in
one area on others.
Relationship between Cohesion and Separation:
Interdependence:
High cohesion and separation of concerns work together to create a well-structured
system. High cohesion within modules is easier to achieve when you have clearly separated
concerns.
Benefits of Combining:
By combining high cohesion within modules with separation of concerns, you create a system
that is:
Easier to Understand: Each module has a clear purpose, making it easier to
understand its functionality.
More Maintainable: Changes are localized to the specific module affected,
reducing the risk of unintended consequences.
More Reusable: Well-defined modules can be reused in different parts of the