Unit V Unsupervised Learning
Unit V Unsupervised Learning
Unsupervised learning is a type of machine learning that analyzes and models data without
labelled responses or predefined categories. Unlike supervised learning, where the algorithm
learns from input-output pairs, unsupervised learning algorithms work solely with input data
and aim to discover hidden patterns, structures or relationships within the dataset
independently, without any human intervention or prior knowledge of the data's meaning.
The image shows set of animals like elephants, camels and cows that represents raw data that
the unsupervised learning algorithm will process.
• The "Interpretation" stage signifies that the algorithm doesn't have predefined labels
or categories for the data. It needs to figure out how to group or organize the data
based on inherent patterns.
The output shows the results of the unsupervised learning process. In this case, the algorithm
might have grouped the animals into clusters based on their species (elephants, camels,
cows).
1
OCS351 ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING FUNDAMENTALS
2. Select an Algorithm
• The algorithm looks for similarities, relationships or hidden structures within the data.
• The algorithm organizes data into groups (clusters), rules or lower-dimensional forms
without human input.
• Example: It may group similar animals together or extract key patterns from large
datasets.
• Analyze the discovered groups, rules or features to gain insights or use them for further
tasks like visualization, anomaly detection or as input for other models.
1. Clustering Algorithms
Clustering is an unsupervised machine learning technique that groups unlabeled data into
clusters based on similarity. Its goal is to discover patterns or relationships within the data
without any prior knowledge of categories or labels.
• Works purely from the input data without any output labels.
2
OCS351 ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING FUNDAMENTALS
• K-means Clustering: Groups data into K clusters based on how close the points are to
each other.
• Density-Based Clustering (DBSCAN): Finds clusters in dense areas and treats scattered
points as noise.
• Mean-Shift Clustering: Discovers clusters by moving points toward the most crowded
areas.
3. Dimensionality Reduction
3
OCS351 ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING FUNDAMENTALS
simplify complex data making it easier to analyze and visualize. It also improves the efficiency
and performance of machine learning algorithms by reducing noise and computational cost.
• It reduces the dataset’s feature space from many dimensions to fewer, more
meaningful ones.
Unsupervised learning has diverse applications across industries and domains. Key
applications include:
• Image and Text Clustering: Groups similar images or documents for tasks like
organization, classification or content recommendation.
Advantages
• No need for labeled data: Works with raw, unlabeled data hence saving time and
effort on data annotation.
4
OCS351 ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING FUNDAMENTALS
• Discovers hidden patterns: Finds natural groupings and structures that might be
missed by humans.
• Handles complex and large datasets: Effective for high-dimensional or vast amounts
of data.
• Useful for anomaly detection: Can identify outliers and unusual data points without
prior examples.
Challenges
• Noisy Data: Outliers and noise can distort patterns and reduce the effectiveness of
algorithms.
• Overfitting Risk: Overfitting can occur when models capture noise instead of
meaningful patterns in the data.
• Limited Guidance: The absence of labels restricts the ability to guide the algorithm
toward specific outcomes.
• Cluster Interpretability: Results such as clusters may lack clear meaning or alignment
with real-world categories.
• Lack of Ground Truth: Unsupervised learning lacks labeled data making it difficult to
evaluate the accuracy of results.
PCA is commonly used for data preprocessing for use with machine learning algorithms. It
helps to remove redundancy, improve computational efficiency and make data easier to
visualize and analyze especially when dealing with high-dimensional data.
5
OCS351 ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING FUNDAMENTALS
PCA uses linear algebra to transform data into new features called principal components. It
finds these by calculating eigenvectors (directions) and eigenvalues (importance) from the
covariance matrix. PCA selects the top components with the highest eigenvalues and projects
the data onto them simplify the dataset.
Note: It prioritizes the directions where the data varies the most because more variation =
more useful information.
Imagine you’re looking at a messy cloud of data points like stars in the sky and want to simplify
it. PCA helps you find the "most important angles" to view this cloud so you don’t miss the big
patterns. Here’s how it works step by step:
Different features may have different units and scales like salary vs. age. To compare them
fairly PCA first standardizes the data by making each feature have:
• A mean of 0
• A standard deviation of 1
Z=X−μσZ=σX−μ
where:
Next PCA calculates the covariance matrix to see how features relate to each other whether
they increase or decrease together. The covariance between two features x1x1 and x2x2 is:
cov(x1,x2)=∑i=1n(x1i−x1ˉ)(x2i−x2ˉ)n−1cov(x1,x2)=n−1∑i=1n(x1i−x1ˉ)(x2i−x2ˉ)
Where:
PCA identifies new axes where the data spreads out the most:
• 1st Principal Component (PC1): The direction of maximum variance (most spread).
6
OCS351 ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING FUNDAMENTALS
• 2nd Principal Component (PC2): The next best direction, perpendicular to PC1 and so
on.
These directions come from the eigenvectors of the covariance matrix and their importance
is measured by eigenvalues. For a square matrix A an eigenvector X (a non-zero vector) and
its corresponding eigenvalue λ satisfy:
AX=λXAX=λX
This means:
After calculating the eigenvalues and eigenvectors PCA ranks them by the amount of
information they capture. We then:
1. Select the top k components hat capture most of the variance like 95%.
This means we reduce the number of features (dimensions) while keeping the important
patterns in the data.
7
OCS351 ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING FUNDAMENTALS
In the above image the original dataset has two features "Radius" and "Area" represented by
the black axes. PCA identifies two new directions: PC₁ and PC₂ which are the principal
components.
• These new axes are rotated versions of the original ones. PC₁ captures the maximum
variance in the data meaning it holds the most information while PC₂ captures the
remaining variance and is perpendicular to PC₁.
• The spread of data is much wider along PC₁ than along PC₂. This is why PC₁ is chosen
for dimensionality reduction. By projecting the data points (blue crosses) onto PC₁ we
effectively transform the 2D data into 1D and retain most of the important structure
and patterns.
Hence PCA uses a linear transformation that is based on preserving the most variance in the
data using the least number of dimensions. It involves the following steps:
We import the necessary library like pandas, numpy, scikit learn, seaborn and matplotlib to
visualize results.
import numpy as np
import pandas as pd
We make a small dataset with three features Height, Weight, Age and Gender.
data = {
'Height': [170, 165, 180, 175, 160, 172, 168, 177, 162, 158],
'Weight': [65, 59, 75, 68, 55, 70, 62, 74, 58, 54],
'Age': [30, 25, 35, 28, 22, 32, 27, 33, 24, 21],
8
OCS351 ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING FUNDAMENTALS
df = [Link](data)
print(df)
Output:
Since the features have different scales Height vs Age we standardize the data. This makes all
features have mean = 0 and standard deviation = 1 so that no feature dominates just because
of its units.
X = [Link]('Gender', axis=1)
y = df['Gender']
scaler = StandardScaler()
X_scaled = scaler.fit_transform(df)
• We reduce the data from 3 features to 2 new features called principal components.
These components capture most of the original information but in fewer dimensions.
• We split the data into 70% training and 30% testing sets.
9
OCS351 ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING FUNDAMENTALS
• We train a logistic regression model on the reduced training data and predict gender
labels on the test set.
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)
model = LogisticRegression()
[Link](X_train, y_train)
y_pred = [Link](X_test)
The confusion matrix compares actual vs predicted labels. This makes it easy to see where
predictions were correct or wrong.
cm = confusion_matrix(y_test, y_pred)
[Link](figsize=(5,4))
[Link]('Predicted Label')
[Link]('True Label')
[Link]('Confusion Matrix')
[Link]()
Output:
10
OCS351 ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING FUNDAMENTALS
y_numeric = [Link](y)[0]
[Link](figsize=(12, 5))
[Link](1, 2, 1)
[Link](label='Target classes')
11
OCS351 ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING FUNDAMENTALS
[Link](1, 2, 2)
[Link](label='Target classes')
plt.tight_layout()
[Link]()
Output:
• Left Plot Before PCA: This shows the original standardized data plotted using the first
two features. There is no guarantee of clear separation between classes as these are
raw input dimensions.
• Right Plot After PCA: This displays the transformed data using the top 2 principal
components. These new components capture the maximum variance often showing
better class separation and structure making it easier to analyze or model.
2. Noise Reduction: Eliminates components with low variance enhance data clarity.
3. Data Compression: Represents data with fewer components reduce storage needs
and speeding up processing.
12
OCS351 ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING FUNDAMENTALS
4. Outlier Detection: Identifies unusual data points by showing which ones deviate
significantly in the reduced space.
2. Data Scaling Sensitivity: Requires proper scaling of data before application or results
may be misleading.
3. Information Loss: Reducing dimensions may lose some important information if too
few components are kept.
4. Assumption of Linearity: Works best when relationships between variables are linear
and may struggle with non-linear data.
6. Risk of Overfitting: Using too many components or working with a small dataset might
lead to models that don't generalize well.
Competitive networks are a family of neural network models in which neurons compete
among themselves to become active. Unlike feedforward networks where many neurons can
be active simultaneously, competitive nets implement a mechanism where typically only one
neuron, or a small subset of neurons, is declared the "winner." This principle is often called
the winner-takes-all (WTA) strategy.
In fixed weight competitive networks, the weights of the neurons are predefined and do not
change during learning. Each neuron corresponds to a prototype or class, and the neuron
whose weight vector most closely resembles the input vector becomes the winner.
The fixed weights act as reference vectors or templates against which input data is compared.
This allows the network to classify new inputs efficiently without performing iterative
learning.
13
OCS351 ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING FUNDAMENTALS
1. Input Layer:
o Contains a set of neurons, each associated with a fixed weight vector wjw_jwj
.
o Each neuron computes a similarity measure between the input and its weight
vector.
3. Winner-Takes-All Mechanism:
o The neuron with the highest similarity score produces an output of 1 (winner).
Working Principle
The working of a fixed weight competitive net can be explained step by step:
1. Input Presentation:
An input vector is presented to all neurons simultaneously.
2. Activation Calculation:
Each neuron computes its activation using a similarity measure, often the dot
product:
3. Competition Phase:
The network compares activations across all neurons.
4. Selection of Winner:
The neuron with the highest activation is selected as the winner.
5. Output Generation:
The winner neuron outputs 1, and all other neurons output 0.
This mechanism ensures that only the neuron whose weight vector is closest to the input
vector represents the input.
14
OCS351 ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING FUNDAMENTALS
5. Mathematical Representation
• Suppose there are mmm neurons, each with an nnn-dimensional weight vector.
Example
This shows that the input is most similar to the prototype represented by w3w_3w3.
Characteristics
Applications
Fixed weight competitive networks are used in areas where prototypes are known in advance,
and rapid classification is required. Some applications include:
15
OCS351 ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING FUNDAMENTALS
Advantages
Limitations
Fixed weight competitive networks are simple yet powerful models for classification and
pattern recognition tasks where reference prototypes are predetermined. Their strength lies
in their efficiency and ability to perform fast recognition without requiring complex learning
algorithms. However, their limitation is their inability to adapt, which makes them suitable
only for static problems.
A Self Organizing Map (SOM) or Kohonen Map is an unsupervised neural network algorithm
based on biological neural models from the 1970s. It uses a competitive learning approach
and is primarily designed for clustering and dimensionality reduction. SOM effectively maps
high-dimensional data to a lower-dimensional grid enabling easier interpretation and
visualization of complex datasets.
16
OCS351 ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING FUNDAMENTALS
1. Initialization
The weights of the output neurons are randomly initialized. These weights represent the
features of each neuron and will be adjusted during training.
2. Competition
For each input vector, SOM computes the Euclidean distance between the input and the
weight vectors of all neurons. The neuron with the smallest distance is the winning neuron.
Formula :D(j)=∑i=1n(wij−xi)2D(j)=∑i=1n(wij−xi)2
where
3. Weight Update
17
OCS351 ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING FUNDAMENTALS
The winning neuron’s weights are updated to move closer to the input vector. The weights of
neighboring neurons are also adjusted but with smaller changes.
Formula: wij(new)=wij(old)+α⋅(xi−wij(old))wij(new)=wij(old)+α⋅(xi−wij(old))
where
The learning rate α\alphaα decreases over time allowing the map to converge to stable values.
Formula: α(t+1)=0.5⋅α(t)α(t+1)=0.5⋅α(t)
5. Stopping Condition
The training stops when the maximum number of epochs is reached or when the weights
converge.
Now let’s walk through a Python implementation of the SOM algorithm. The code is divided
into blocks for clarity.
We will use the math library to compute the Euclidean distance between the input vector and
the weight vector.
import math
In this class, we define two important functions, winner() to compute the winning neuron by
calculating the Euclidean distance between the input and weight vectors of each cluster and
update() to update the weight vectors of the winning neuron according to the weight update
rule.
class SOM:
D0 = 0
D1 = 0
for i in range(len(sample)):
D0 += [Link]((sample[i] - weights[0][i]), 2)
18
OCS351 ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING FUNDAMENTALS
D1 += [Link]((sample[i] - weights[1][i]), 2)
for i in range(len(weights[0])):
return weights
In this section, we define the training data and initialize the weights. We also specify the
number of epochs and the learning rate.
• T: This is the training data with four examples, each having four features.
• weights: These are the initial weights for two clusters, each with four features.
def main():
m, n = len(T), len(T[0])
ob = SOM()
epochs = 3
alpha = 0.5
Here, we loop through each training example for the specified number of epochs, compute
the winning cluster and update the weights. For each epoch and training sample we:
for i in range(epochs):
19
OCS351 ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING FUNDAMENTALS
for j in range(m):
sample = T[j]
J = [Link](weights, sample)
After training the SOM network, we use a test sample s and classify it into one of the clusters
by computing which cluster has the closest weight vector to the input sample. Finally, we print
the cluster assignment and the trained weights for each cluster.
s = [0, 0, 0, 1]
J = [Link](weights, s)
The following block runs the main() function when the script is executed.
if __name__ == "__main__":
main()
Output:
The output will display which cluster the test sample belongs to and the final trained weights
of the clusters.
• The trained weights for both clusters are displayed after 3 epochs.
Clustering is an unsupervised machine learning technique that groups similar data points
together into clusters based on their characteristics, without using any labeled data. The
20
OCS351 ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING FUNDAMENTALS
objective is to ensure that data points within the same cluster are more similar to each other
than to those in different clusters, enabling the discovery of natural groupings and hidden
patterns in complex datasets.
• How: Data points are assigned to clusters based on similarity or distance measures.
• Output: Each group is assigned a cluster ID, representing shared characteristics within
the cluster.
For example, if we have customer purchase data, clustering can group customers with similar
shopping habits. These clusters can then be used for targeted marketing, personalized
recommendations or customer segmentation.
Types of Clustering
21
OCS351 ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING FUNDAMENTALS
1. Hard Clustering: In hard clustering, each data point strictly belongs to exactly one cluster,
no overlap is allowed. This approach assigns a clear membership, making it easier to interpret
and use for definitive segmentation tasks.
• Example: If clustering customer data into 2 segments, each customer belongs fully to
either Cluster 1 or Cluster 2 without partial memberships.
Let's see an example to see the difference between the hard and soft clustering using a
distribution,
2. Soft Clustering: Soft clustering assigns each data point a probability or degree of
membership to multiple clusters simultaneously, allowing data points to partially belong to
several groups.
• Example: A data point may have a 70% membership in Cluster 1 and 30% in Cluster 2,
reflecting uncertainty or overlap in group characteristics.
• Use cases: Situations with overlapping class boundaries, fuzzy categories like customer
personas or medical diagnosis.
Clustering methods can be classified on the basis of how they for clusters,
22
OCS351 ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING FUNDAMENTALS
Centroid-based clustering organizes data points around central prototypes called centroids,
where each cluster is represented by the mean (or medoid) of its members. The number of
clusters is specified in advance and the algorithm allocates points to the nearest centroid,
making this technique efficient for spherical and similarly sized clusters but sensitive to
outliers and initialization.
Algorithms:
• K-medoids: Similar to K-means but uses actual data points (medoids) as centers, robust
to outliers.
Pros:
Cons:
Density-based clustering defines clusters as contiguous regions of high data density separated
by areas of lower density. This approach can identify clusters of arbitrary shapes, handles
noise well and does not require predefining the number of clusters, though its effectiveness
depends on chosen density parameters.
Algorithms:
Pros:
23
OCS351 ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING FUNDAMENTALS
Cons:
Approaches:
• Divisive (Top-down): Start with one cluster; iteratively split into smaller clusters.
Pros:
Cons:
4. Distribution-based Clustering
Algorithm:
Pros:
24
OCS351 ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING FUNDAMENTALS
Cons:
• Sensitive to initialization.
5. Fuzzy Clustering
Fuzzy clustering extends traditional methods by allowing each data point to belong to multiple
clusters with varying degrees of membership. This approach captures ambiguity and soft
boundaries in data and is particularly useful when the clusters overlap or boundaries are not
clear-cut.
Algorithm:
• Fuzzy C-Means: Similar to K-means but with fuzzy memberships updated iteratively.
Pros:
Cons:
Use Cases
• Image Segmentation: Dividing images into meaningful parts for object detection,
medical diagnostics or computer vision tasks.
25
OCS351 ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING FUNDAMENTALS
Hierarchical clustering is an unsupervised learning technique used to group similar data points
into clusters by building a hierarchy (tree-like structure). Unlike flat clustering like k-
means hierarchical clustering does not require specifying the number of clusters in advance.
The algorithm builds clusters step by step either by progressively merging smaller clusters or
by splitting a large cluster into smaller ones. The process is often visualized using
a dendrogram, which helps to understand data similarity.
Imagine we have four fruits with different weights: an apple (100g), a banana (120g), a cherry
(50g) and a grape (30g). Hierarchical clustering starts by treating each fruit as its own group.
• Merge the closest items: grape (30g) and cherry (50g) are grouped first.
Finally all the fruits are merged into one large group, showing how hierarchical clustering
progressively combines the most similar data points.
Dendrogram
A dendrogram is like a family tree for clusters. It shows how individual data points or groups
of data merge together. The bottom shows each data point as its own group and as we move
up, similar groups are combined. The lower the merge point, the more similar the groups are.
It helps us see how things are grouped step by step.
26
OCS351 ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING FUNDAMENTALS
• At the bottom of the dendrogram the points P, Q, R, S and T are all separate.
• As we move up, the closest points are merged into a single group.
• The lines connecting the points show how they are progressively merged based on
similarity.
• The height at which they are connected shows how similar the points are to each
other; the shorter the line the more similar they are
Now we understand the basics of hierarchical clustering. There are two main types of
hierarchical clustering.
1. Agglomerative Clustering
2. Divisive clustering
27
OCS351 ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING FUNDAMENTALS
1. Start with individual points: Each data point is its own cluster. For example if we have
5 data points we start with 5 clusters each containing just one data point.
2. Calculate distances between clusters: Calculate the distance between every pair of
clusters. Initially since each cluster has one point this is the distance between the two
data points.
3. Merge the closest clusters: Identify the two clusters with the smallest distance and
merge them into a single cluster.
4. Update distance matrix: After merging we now have one less cluster. Recalculate the
distances between the new cluster and the remaining clusters.
5. Repeat steps 3 and 4: Keep merging the closest clusters and updating the distance
matrix until we have only one cluster left.
Implementation
28
OCS351 ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING FUNDAMENTALS
• Repeat merging until the desired number of clusters or one cluster remains.
• The dendrogram visualizes these merges as a tree, showing cluster relationships and
distances.
import numpy as np
clustering = AgglomerativeClustering(n_clusters=3)
labels = clustering.fit_predict(X)
[Link](X)
counts = [Link](model.children_.shape[0])
n_samples = len(model.labels_)
current_count = 0
29
OCS351 ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING FUNDAMENTALS
current_count += 1
else:
counts[i] = current_count
linkage_matrix = np.column_stack(
dendrogram(linkage_matrix, **kwargs)
ax1.set_title("Agglomerative Clustering")
ax1.set_xlabel("Feature 1")
ax1.set_ylabel("Feature 2")
[Link](ax2)
[Link]("Sample index")
[Link]("Distance")
plt.tight_layout()
[Link]()
Output :
30
OCS351 ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING FUNDAMENTALS
1. Start with all data points in one cluster: Treat the entire dataset as a single large
cluster.
2. Split the cluster: Divide the cluster into two smaller clusters. The division is typically
done by finding the two most dissimilar points in the cluster and using them to
separate the data into two parts.
3. Repeat the process: For each of the new clusters, repeat the splitting process: Choose
the cluster with the most dissimilar points and split it again into two smaller clusters.
4. Stop when each data point is in its own cluster: Continue this process until every data
point is its own cluster or the stopping condition (such as a predefined number of
clusters) is met.
31
OCS351 ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING FUNDAMENTALS
Implementation
• Finds the largest cluster and splits it into two using KMeans.
• Repeats splitting the largest cluster until reaching the desired number of clusters.
import numpy as np
32
OCS351 ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING FUNDAMENTALS
[Link](cluster_to_split)
cluster1 = cluster_to_split[kmeans.labels_ == 0]
cluster2 = cluster_to_split[kmeans.labels_ == 1]
[Link]([cluster1, cluster2])
return clusters
[Link](figsize=(12, 5))
[Link](1, 2, 1)
[Link]()
33
OCS351 ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING FUNDAMENTALS
[Link](1, 2, 2)
dendrogram(linked, orientation='top',
distance_sort='descending', show_leaf_counts=True)
plt.tight_layout()
[Link]()
Output:
While merging two clusters we check the distance between two every pair of clusters and
merge the pair with the least distance/most similarity. But the question is how is that distance
determined. There are different ways of defining Inter Cluster distance/similarity. Some of
them are:
• Min Distance: Find the minimum distance between any two points of the cluster.
• Max Distance: Find the maximum distance between any two points of the cluster.
• Group Average: Find the average distance between every two points of the clusters.
• Ward's Method: The similarity of two clusters is based on the increase in squared error
when two clusters are merged.
34
OCS351 ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING FUNDAMENTALS
K means Clustering
K-Means Clustering is an unsupervised machine learning algorithm that helps group data
points into clusters based on their inherent similarity. Unlike supervised learning, where we
train models using labeled data, K-Means is used when we have data that is not labeled and
the goal is to uncover hidden patterns or structures. For example, an online store can use K-
Means to segment customers into groups like "Budget Shoppers," "Frequent Buyers," and
"Big Spenders" based on their purchase history.
Suppose we are given a data set of items with certain features and values for these features
like a vector. The task is to categorize those items into groups. To achieve this we will use the
K-means algorithm. "kk" represents the number of groups or clusters we want to classify our
items into.
The algorithm will categorize the items into "kk" groups or clusters of similarity. To calculate
that similarity we will use the Euclidean distance as a measurement. The algorithm works as
follows:
1. Initialization: We begin by randomly selecting k cluster centroids.
35
OCS351 ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING FUNDAMENTALS
2. Assignment Step: Each data point is assigned to the nearest centroid, forming clusters.
3. Update Step: After the assignment, we recalculate the centroid of each cluster by
averaging the points within it.
4. Repeat: This process repeats until the centroids no longer change or the maximum
number of iterations is reached.
The goal is to partition the dataset into kk clusters such that data points within each cluster
are more similar to each other than to those in other clusters.
We will be using blobs datasets and show how clusters are made using Python programming
language.
Step 1: Importing the necessary libraries
We will be importing the following libraries.
• Numpy: for numerical operations (e.g., distance calculation).
• Matplotlib: for plotting data and results.
• Scikit learn: to create a synthetic dataset using make_blobs
import numpy as np
import [Link] as plt
from [Link] import make_blobs
36
OCS351 ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING FUNDAMENTALS
fig = [Link](0)
[Link](True)
[Link](X[:,0],X[:,1])
[Link]()
Output:
clusters = {}
[Link](23)
37
OCS351 ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING FUNDAMENTALS
clusters[idx] = cluster
clusters
Output:
We will now plot the data points and the initial centroids.
[Link](X[:,0],X[:,1])
[Link](True)
for i in clusters:
center = clusters[i]['center']
[Link]()
Output:
38
OCS351 ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING FUNDAMENTALS
def distance(p1,p2):
return [Link]([Link]((p1-p2)**2))
Next, we define functions to assign points to the nearest centroid and update the centroids
based on the average of the points assigned to each cluster.
• curr_cluster = [Link](dist): Finds the index of the closest cluster by selecting the
minimum distance.
dist = []
curr_x = X[idx]
for i in range(k):
dis = distance(curr_x,clusters[i]['center'])
[Link](dis)
curr_cluster = [Link](dist)
clusters[curr_cluster]['points'].append(curr_x)
return clusters
39
OCS351 ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING FUNDAMENTALS
for i in range(k):
points = [Link](clusters[i]['points'])
if [Link][0] > 0:
clusters[i]['center'] = new_center
clusters[i]['points'] = []
return clusters
We create a function to predict the cluster for each data point based on the final centroids.
• [Link]([Link](dist)): Appends the index of the closest cluster (the one with
the minimum distance) to pred.
pred = []
for i in range([Link][0]):
dist = []
for j in range(k):
[Link](distance(X[i],clusters[j]['center']))
[Link]([Link](dist))
return pred
We assign points to clusters, update the centroids and predict the final cluster labels.
• pred_cluster(X, clusters): Predicts the final clusters for all data points.
clusters = assign_clusters(X,clusters)
clusters = update_clusters(X,clusters)
pred = pred_cluster(X,clusters)
40
OCS351 ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING FUNDAMENTALS
Finally, we plot the data points, colored by their predicted clusters, along with the updated
centroids.
[Link](X[:,0],X[:,1],c = pred)
for i in clusters:
center = clusters[i]['center']
[Link]()
Output:
• Choosing the Right Number of Clusters (kk): One of the biggest challenges is deciding
how many clusters to use.
41
OCS351 ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING FUNDAMENTALS
• Sensitive to Initial Centroids: The final clusters can vary depending on the initial
random placement of centroids.
• Non-Spherical Clusters: K-Means assumes that the clusters are spherical and equally
sized. This can be a problem when the actual clusters in the data are of different shapes
or densities.
• Outliers: K-Means is sensitive to outliers, which can distort the centroid and,
ultimately, the clusters.
42