Name: MOHD HANIF BIN MOHD JAME Matric: BK22110288
Instruction
Submit the assignment (in PDF) to ITEL latest by 31 May 2024 (Friday).
Objective
To implement the K-Means clustering algorithm from scratch, apply it to a dataset, and analyse the results.
Assignment 4 (10 marks)
Develop a K-means clustering algorithm for the given iris flower dataset. The dataset is downloaded from UCI
Machine Learning Repository. It is a commonly used dataset for clustering tasks, which contains of 150 samples
of iris flowers. Each sample described by four features, namely sepal length, sepal width, petal length and petal
width.
a. How to determine the suitable number of clusters (𝐾)?
(3 marks)
-We can determine the suitable number of clusters (K) using the Elbow Method. From the graph, The Within-
Cluster Sum of Squares (WCSS) decreases significantly as K increases from 1 to 3. Then, The rate of decrease
in WCSS starts to slow down noticeably after K = 3 . This significant change can be indicate as the “elbow
point” Thus suitable number of cluster (K) is 3.
KE27003 Computational Engineering Semester 2 Session 2023/2024
Source Code:
% Load and standardize the data
data = [
5.1,3.5,1.4,0.2; 4.9,3.0,1.4,0.2; 4.7,3.2,1.3,0.2; 4.6,3.1,1.5,0.2;
5.0,3.6,1.4,0.2; 5.4,3.9,1.7,0.4; 4.6,3.4,1.4,0.3; 5.0,3.4,1.5,0.2;
4.4,2.9,1.4,0.2; 4.9,3.1,1.5,0.1; 5.4,3.7,1.5,0.2; 4.8,3.4,1.6,0.2;
4.8,3.0,1.4,0.1; 4.3,3.0,1.1,0.1; 5.8,4.0,1.2,0.2; 5.7,4.4,1.5,0.4;
5.4,3.9,1.3,0.4; 5.1,3.5,1.4,0.3; 5.7,3.8,1.7,0.3; 5.1,3.8,1.5,0.3;
5.4,3.4,1.7,0.2; 5.1,3.7,1.5,0.4; 4.6,3.6,1.0,0.2; 5.1,3.3,1.7,0.5;
4.8,3.4,1.9,0.2; 5.0,3.0,1.6,0.2; 5.0,3.4,1.6,0.4; 5.2,3.5,1.5,0.2;
5.2,3.4,1.4,0.2; 4.7,3.2,1.6,0.2; 4.8,3.1,1.6,0.2; 5.4,3.4,1.5,0.4;
5.2,4.1,1.5,0.1; 5.5,4.2,1.4,0.2; 4.9,3.1,1.5,0.1; 5.0,3.2,1.2,0.2;
5.5,3.5,1.3,0.2; 4.9,3.1,1.5,0.1; 4.4,3.0,1.3,0.2; 5.1,3.4,1.5,0.2;
5.0,3.5,1.3,0.3; 4.5,2.3,1.3,0.3; 4.4,3.2,1.3,0.2; 5.0,3.5,1.6,0.6;
5.1,3.8,1.9,0.4; 4.8,3.0,1.4,0.3; 5.1,3.8,1.6,0.2; 4.6,3.2,1.4,0.2;
5.3,3.7,1.5,0.2; 5.0,3.3,1.4,0.2; 7.0,3.2,4.7,1.4; 6.4,3.2,4.5,1.5;
6.9,3.1,4.9,1.5; 5.5,2.3,4.0,1.3; 6.5,2.8,4.6,1.5; 5.7,2.8,4.5,1.3;
6.3,3.3,4.7,1.6; 4.9,2.4,3.3,1.0; 6.6,2.9,4.6,1.3; 5.2,2.7,3.9,1.4;
5.0,2.0,3.5,1.0; 5.9,3.0,4.2,1.5; 6.0,2.2,4.0,1.0; 6.1,2.9,4.7,1.4;
5.6,2.9,3.6,1.3; 6.7,3.1,4.4,1.4; 5.6,3.0,4.5,1.5; 5.8,2.7,4.1,1.0;
6.2,2.2,4.5,1.5; 5.6,2.5,3.9,1.1; 5.9,3.2,4.8,1.8; 6.1,2.8,4.0,1.3;
6.3,2.5,4.9,1.5; 6.1,2.8,4.7,1.2; 6.4,2.9,4.3,1.3; 6.6,3.0,4.4,1.4;
6.8,2.8,4.8,1.4; 6.7,3.0,5.0,1.7; 6.0,2.9,4.5,1.5; 5.7,2.6,3.5,1.0;
5.5,2.4,3.8,1.1; 5.5,2.4,3.7,1.0; 5.8,2.7,3.9,1.2; 6.0,2.7,5.1,1.6;
5.4,3.0,4.5,1.5; 6.0,3.4,4.5,1.6; 6.7,3.1,4.7,1.5; 6.3,2.3,4.4,1.3;
5.6,3.0,4.1,1.3; 5.5,2.5,4.0,1.3; 5.5,2.6,4.4,1.2; 6.1,3.0,4.6,1.4;
5.8,2.6,4.0,1.2; 5.0,2.3,3.3,1.0; 5.6,2.7,4.2,1.3; 5.7,3.0,4.2,1.2;
5.7,2.9,4.2,1.3; 6.2,2.9,4.3,1.3; 5.1,2.5,3.0,1.1; 5.7,2.8,4.1,1.3;
6.3,3.3,6.0,2.5; 5.8,2.7,5.1,1.9; 7.1,3.0,5.9,2.1; 6.3,2.9,5.6,1.8;
6.5,3.0,5.8,2.2; 7.6,3.0,6.6,2.1; 4.9,2.5,4.5,1.7; 7.3,2.9,6.3,1.8;
6.7,2.5,5.8,1.8; 7.2,3.6,6.1,2.5; 6.5,3.2,5.1,2.0; 6.4,2.7,5.3,1.9;
6.8,3.0,5.5,2.1; 5.7,2.5,5.0,2.0; 5.8,2.8,5.1,2.4; 6.4,3.2,5.3,2.3;
6.5,3.0,5.5,1.8; 7.7,3.8,6.7,2.2; 7.7,2.6,6.9,2.3; 6.0,2.2,5.0,1.5;
6.9,3.2,5.7,2.3; 5.6,2.8,4.9,2.0; 7.7,2.8,6.7,2.0; 6.3,2.7,4.9,1.8;
6.7,3.3,5.7,2.1; 7.2,3.2,6.0,1.8; 6.2,2.8,4.8,1.8; 6.1,3.0,4.9,1.8;
6.4,2.8,5.6,2.1; 7.2,3.0,5.8,1.6; 7.4,2.8,6.1,1.9; 7.9,3.8,6.4,2.0;
6.4,2.8,5.6,2.2; 6.3,2.8,5.1,1.5; 6.1,2.6,5.6,1.4; 7.7,3.0,6.1,2.3;
6.3,3.4,5.6,2.4; 6.4,3.1,5.5,1.8; 6.0,3.0,4.8,1.8; 6.9,3.1,5.4,2.1;
6.7,3.1,5.6,2.4; 6.9,3.1,5.1,2.3; 5.8,2.7,5.1,1.9; 6.8,3.2,5.9,2.3;
6.7,3.3,5.7,2.5; 6.7,3.0,5.2,2.3; 6.3,2.5,5.0,1.9; 6.5,3.0,5.2,2.0;
6.2,3.4,5.4,2.3; 5.9,3.0,5.1,1.8
];
data = zscore(data);
% Initialize variables
maxK = 10;
wcss = zeros(maxK, 1);
% Compute K-means clustering for different values of K
for k = 1:maxK
[~, ~, sumd] = kmeans(data, k, 'Replicates', 5, 'MaxIter', 1000);
wcss(k) = sum(sumd);
end
% Plot the WCSS against K to find the elbow point
figure;
plot(1:maxK, wcss, '-o');
xlabel('Number of clusters (K)');
ylabel('Within-cluster sum of squares (WCSS)');
title('Elbow Method for Determining Optimal K');
grid on;
KE27003 Computational Engineering Semester 2 Session 2023/2024
b. Develop a K-means algorithm in Matlab/Excel/Octave/Scilab to cluster the data in dataset. Visualise the
clustering results using a scatter plot, and colours the points based on their cluster assignments.
[Remarks: using the basic arithmetic operators, loops and decision making]
% Load the dataset from the provided text data
data_str = [
"5.1 3.5 1.4 0.2"; "4.9 3.0 1.4 0.2"; "4.7 3.2 1.3 0.2"; "4.6 3.1 1.5 0.2";
"5.0 3.6 1.4 0.2"; "5.4 3.9 1.7 0.4"; "4.6 3.4 1.4 0.3"; "5.0 3.4 1.5 0.2";
"4.4 2.9 1.4 0.2"; "4.9 3.1 1.5 0.1"; "5.4 3.7 1.5 0.2"; "4.8 3.4 1.6 0.2";
"4.8 3.0 1.4 0.1"; "4.3 3.0 1.1 0.1"; "5.8 4.0 1.2 0.2"; "5.7 4.4 1.5 0.4";
"5.4 3.9 1.3 0.4"; "5.1 3.5 1.4 0.3"; "5.7 3.8 1.7 0.3"; "5.1 3.8 1.5 0.3";
"5.4 3.4 1.7 0.2"; "5.1 3.7 1.5 0.4"; "4.6 3.6 1.0 0.2"; "5.1 3.3 1.7 0.5";
"4.8 3.4 1.9 0.2"; "5.0 3.0 1.6 0.2"; "5.0 3.4 1.6 0.4"; "5.2 3.5 1.5 0.2";
"5.2 3.4 1.4 0.2"; "4.7 3.2 1.6 0.2"; "4.8 3.1 1.6 0.2"; "5.4 3.4 1.5 0.4";
"5.2 4.1 1.5 0.1"; "5.5 4.2 1.4 0.2"; "4.9 3.1 1.5 0.1"; "5.0 3.2 1.2 0.2";
"5.5 3.5 1.3 0.2"; "4.9 3.1 1.5 0.1"; "4.4 3.0 1.3 0.2"; "5.1 3.4 1.5 0.2";
"5.0 3.5 1.3 0.3"; "4.5 2.3 1.3 0.3"; "4.4 3.2 1.3 0.2"; "5.0 3.5 1.6 0.6";
"5.1 3.8 1.9 0.4"; "4.8 3.0 1.4 0.3"; "5.1 3.8 1.6 0.2"; "4.6 3.2 1.4 0.2";
"5.3 3.7 1.5 0.2"; "5.0 3.3 1.4 0.2"; "7.0 3.2 4.7 1.4"; "6.4 3.2 4.5 1.5";
"6.9 3.1 4.9 1.5"; "5.5 2.3 4.0 1.3"; "6.5 2.8 4.6 1.5"; "5.7 2.8 4.5 1.3";
"6.3 3.3 4.7 1.6"; "4.9 2.4 3.3 1.0"; "6.6 2.9 4.6 1.3"; "5.2 2.7 3.9 1.4";
"5.0 2.0 3.5 1.0"; "5.9 3.0 4.2 1.5"; "6.0 2.2 4.0 1.0"; "6.1 2.9 4.7 1.4";
"5.6 2.9 3.6 1.3"; "6.7 3.1 4.4 1.4"; "5.6 3.0 4.5 1.5"; "5.8 2.7 4.1 1.0";
"6.2 2.2 4.5 1.5"; "5.6 2.5 3.9 1.1"; "5.9 3.2 4.8 1.8"; "6.1 2.8 4.0 1.3";
"6.3 2.5 4.9 1.5"; "6.1 2.8 4.7 1.2"; "6.4 2.9 4.3 1.3"; "6.6 3.0 4.4 1.4";
"6.8 2.8 4.8 1.4"; "6.7 3.0 5.0 1.7"; "6.0 2.9 4.5 1.5"; "5.7 2.6 3.5 1.0";
"5.5 2.4 3.8 1.1"; "5.5 2.4 3.7 1.0"; "5.8 2.7 3.9 1.2"; "6.0 2.7 5.1 1.6";
"5.4 3.0 4.5 1.5"; "6.0 3.4 4.5 1.6"; "6.7 3.1 4.7 1.5"; "6.3 2.3 4.4 1.3";
"5.6 3.0 4.1 1.3"; "5.5 2.5 4.0 1.3"; "5.5 2.6 4.4 1.2"; "6.1 3.0 4.6 1.4";
"5.8 2.6 4.0 1.2"; "5.0 2.3 3.3 1.0"; "5.6 2.7 4.2 1.3"; "5.7 3.0 4.2 1.2";
"5.7 2.9 4.2 1.3"; "6.2 2.9 4.3 1.3"; "5.1 2.5 3.0 1.1"; "5.7 2.8 4.1 1.3";
"6.3 3.3 6.0 2.5"; "5.8 2.7 5.1 1.9"; "7.1 3.0 5.9 2.1"; "6.3 2.9 5.6 1.8";
"6.5 3.0 5.8 2.2"; "7.6 3.0 6.6 2.1"; "4.9 2.5 4.5 1.7"; "7.3 2.9 6.3 1.8";
"6.7 2.5 5.8 1.8"; "7.2 3.6 6.1 2.5"; "6.5 3.2 5.1 2.0"; "6.4 2.7 5.3 1.9";
"6.8 3.0 5.5 2.1"; "5.7 2.5 5.0 2.0"; "5.8 2.8 5.1 2.4"; "6.4 3.2 5.3 2.3";
"6.5 3.0 5.5 1.8"; "7.7 3.8 6.7 2.2"; "7.7 2.6 6.9 2.3"; "6.0 2.2 5.0 1.5";
"6.9 3.2 5.7 2.3"; "5.6 2.8 4.9 2.0"; "7.7 2.8 6.7 2.0"; "6.3 2.7 4.9 1.8";
"6.7 3.3 5.7 2.1"; "7.2 3.2 6.0 1.8"; "6.2 2.8 4.8 1.8"; "6.1 3.0 4.9 1.8";
"6.4 2.8 5.6 2.1"; "7.2 3.0 5.8 1.6"; "7.4 2.8 6.1 1.9"; "7.9 3.8 6.4 2.0";
"6.4 2.8 5.6 2.2"; "6.3 2.8 5.1 1.5"; "6.1 2.6 5.6 1.4"; "7.7 3.0 6.1 2.3";
KE27003 Computational Engineering Semester 2 Session 2023/2024
"6.3 3.4 5.6 2.4"; "6.4 3.1 5.5 1.8"; "6.0 3.0 4.8 1.8"; "6.9 3.1 5.4 2.1";
"6.7 3.1 5.6 2.4"; "6.9 3.1 5.1 2.3"; "5.8 2.7 5.1 1.9"; "6.8 3.2 5.9 2.3";
"6.7 3.3 5.7 2.5"; "6.7 3.0 5.2 2.3"; "6.3 2.5 5.0 1.9"; "6.5 3.0 5.2 2.0";
"6.2 3.4 5.4 2.3"; "5.9 3.0 5.1 1.8"];
% Convert the data_str to a matrix
data = str2double(split(data_str));
% Number of clusters
K = 3;
% Number of data points
n = size(data, 1);
% Randomly initialize the centroids
rng(1); % For reproducibility
centroids = data(randperm(n, K), :);
% To store the cluster assignment
clusters = zeros(n, 1);
% To store the previous centroids
prev_centroids = centroids;
% Maximum number of iterations
max_iters = 100;
% K-means clustering algorithm
for iter = 1:max_iters % Loop
% Assign clusters
for i = 1:n % Loop
distances = zeros(K, 1);
for j = 1:K % Loop
distances(j) = sum((data(i, :) - centroids(j, :)).^2); % Arithmetic operations
end
[~, cluster] = min(distances); % Decision making
clusters(i) = cluster;
end
% Update centroids
for j = 1:K % Loop
points = data(clusters == j, :);
if ~isempty(points) % Decision making
centroids(j, :) = mean(points); % Arithmetic operations
end
end
% Check for convergence
if isequal(centroids, prev_centroids) % Decision making
break;
end
prev_centroids = centroids;
end
% Visualize the clustering results
figure;
hold on;
colors = ['r', 'g', 'm'];
for j = 1:K % Loop
scatter(data(clusters == j, 1), data(clusters == j, 2), [], colors(j));
end
scatter(centroids(:, 1), centroids(:, 2), 100, 'k', 'filled');
xlabel('Feature 1');
ylabel('Feature 2');
title('K-means Clustering Results');
legend('Iris-virginica', 'Iris-setosa', 'Iris-versicolour', 'Centroids');
hold off;
KE27003 Computational Engineering Semester 2 Session 2023/2024
c. Re-compute the developed algorithm with different values of K and observe how the clustering results change.
Then, discuss any interesting observations gained from the clustering results.
K=2
K=1
K=4
KE27003 Computational Engineering Semester 2 Session 2023/2024
COMPARISON
Clustering with K = 2
When clustering the data with K=2, we observe two distinct clusters. Cluster 1, represented by red points,
contains data with lower values for Feature 1 and varying values for Feature 2. Cluster 2, represented by
green points, contains data with higher values for Feature 1 and a wider range of values for Feature 2.
The centroids, marked by black markers, indicate the center of each cluster.
In this scenario, there is a clear separation between the two clusters, but the separation is relatively broad,
leading to noticeable overlap. Cluster 1 likely includes species with generally smaller Feature 1 values
and varying Feature 2 values, while Cluster 2 includes species with larger Feature 1 values and varying
Feature 2 values. Although the two-cluster solution provides a simple division of the dataset, it does not
capture the finer distinctions between the data points. This simplicity may miss important variations
within the data.
Clustering with K = 1
With K=1, All data points are assigned to one cluster, represented by a single centroid. This centroid is
the mean of all the data points. The Cluster is represent by Red. With only one cluster, no separation
between data points is captured. This clustering solution does not differentiate between different groups
or patterns in the data. The centroid, marked by a black marker, indicates the center of this single cluster.
It represents the average values for Feature 1 and Feature 2 across the entire dataset. While this provides
the simplest possible division, it offers no insight into any inherent structure within the data. It's useful
as a baseline but is inadequate for most practical purposes, where distinguishing between different groups
is essential.
Clustering with K = 4
When the number of clusters is increased to K=4, the clustering becomes more refined. Cluster 1 (Red)
represents points with specific ranges for both features, Cluster 2 (Green) another distinct group with
different feature ranges, Cluster 3 (magenta) points with intermediate values for both features, Cluster 4
(blue) points with higher values for Feature 2. The centroids, marked by distinct markers, indicate the
center of each cluster. They represent the mean values for Feature 1 and Feature 2 within each cluster.
The four clusters show more separation and capture more specific patterns within the data. However,
this increased specificity can lead to overfitting, where the clusters become too detailed and lose their
general applicability. This over-segmentation reduces the generality of the clusters, making them less
practical for broader use
KE27003 Computational Engineering Semester 2 Session 2023/2024
DISCUSSION
The clustering results change significantly as K increases, moving from broad, less distinct clusters
(K= 1) to more detailed, finer clusters (K = 4). With k=1 all data points are assigned to a single cluster .
This approach provides a baseline understanding of the overall average position of the data points.
However, it does not reveal any subgroups or patterns within the data, offering no useful information
about the structure or distribution of the data. With K=2 the data is divided into two distinct clusters
Cluster 1 may contain data points with lower values for Feature 1, while Cluster 2 includes data points
with higher values for Feature 1. A Noticeable overlap between clusters can be seen if they are not well
separated. Although it provide simple division of dataset, it might miss finer details and subgroup.
With K=4, the clusters become too specific, capturing finer patterns but risking
overfitting and reduced generality. This is why K=3 is the best choice for clustering this dataset sine it
has balance between capturing essential patterns and avoiding over-segmentation.
KE27003 Computational Engineering Semester 2 Session 2023/2024