0% found this document useful (0 votes)
30 views4 pages

Complex Computing Problem KMeans Clustering

Uploaded by

MUNEEB UR REHMAN
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views4 pages

Complex Computing Problem KMeans Clustering

Uploaded by

MUNEEB UR REHMAN
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 4

Department of Computer Science

School of Systems and Technology


University of Management and Technology, Lahore
CS4172- Parallel and Distributed Computing, Fall 2023
Instructor: Hassan Bashir

Complex Computing Problem

Group Members:

1. MUHAMMAD TALHA HUSSAIN - F2020332037


2. MUNEEB UR REHMAN SHAH - F2020266395

K-Means Clustering Algorithm:

1. Initialization:

 Choose the number of clusters k.

 Randomly initialize k cluster centroids.

2. Assignment:

 Assign each data point to the nearest cluster centroid. This is done by computing the
Euclidean distance between each point and all centroids, and assigning the point to the
cluster whose centroid is closest.

3. Update:
 Recalculate the centroids of the clusters based on the current assignments.
 The new centroid for each cluster is the mean of all data points assigned to that cluster.

Mathematically, for each cluster j:

4. Repeat:
 Repeat steps 2 and 3 until convergence. Convergence occurs when the assignments of
data points to clusters no longer change significantly, or when a specified number of
iterations is reached.

The algorithm minimizes the within-cluster sum of squares (WCSS), which is the sum of
squared distances between each data point and its assigned centroid. This leads to tight
clusters where points within the same cluster are close to each other.

Parallelization in Dask:
Department of Computer Science
School of Systems and Technology
University of Management and Technology, Lahore
CS4172- Parallel and Distributed Computing, Fall 2023
Instructor: Hassan Bashir

Dask is a parallel computing library that enables parallel processing of large datasets. In the provided
code:

 Dask DataFrame: The dataset is loaded as a Dask DataFrame, which is a parallelized version of
the Pandas DataFrame. Dask divides the dataset into partitions, and operations on Dask
DataFrames are parallelized across these partitions.

 K-Means Training: The KMeans model from dask_ml.cluster is used to perform K-Means
clustering. The fit method trains the model in parallel on the Dask DataFrame partitions, and the
predict method assigns data points to clusters.

 Conversion to Dask Array: After obtaining the cluster assignments with predict, the result is
converted to a Dask array using dd.from_dask_array. This is necessary for column assignment to
the Dask DataFrame.

 Column Assignment: The cluster assignments are assigned to new columns in the Dask
DataFrame ('Cluster_Seq' for the sequential approach and 'Cluster_Parallel' for the parallel
approach).

 Computation: The compute method is used to trigger the actual computation, bringing the
results from the Dask world to the local Python environment.

Finally, the execution times for the sequential and parallel approaches are compared.

In summary, the Dask library is used to handle large datasets in a parallelized manner, and the K-Means
algorithm is applied in a distributed fashion to leverage parallel computation capabilities.

Source Code:

Dataset link: https://drive.google.com/file/d/1xCxibuxS87qWLLybODG3VvCVG7Gkd4zk/view?


usp=drive_link

import dask.dataframe as dd
from dask_ml.cluster import KMeans
import numpy as np
import time
import matplotlib.pyplot as plt

dataset_path = '/content/drive/MyDrive/kmeans_dataset.csv'
ddf = dd.read_csv(dataset_path)

features = ddf[['Feature_1', 'Feature_2']]


Department of Computer Science
School of Systems and Technology
University of Management and Technology, Lahore
CS4172- Parallel and Distributed Computing, Fall 2023
Instructor: Hassan Bashir

num_clusters = 3

ddf = ddf.set_index('Feature_1')

ddf = ddf.compute()

# Sequential Approach
start_time = time.time()
kmeans_seq = KMeans(n_clusters=num_clusters, random_state=42)
kmeans_seq.fit(features)
ddf['Cluster_Seq'] =
dd.from_dask_array(kmeans_seq.predict(features.compute()))
sequential_execution_time = time.time() - start_time

# Parallel Approach
start_time = time.time()
kmeans_parallel = KMeans(n_clusters=num_clusters, random_state=42,
n_jobs=-1)
kmeans_parallel.fit(features)
ddf['Cluster_Parallel'] =
dd.from_dask_array(kmeans_parallel.predict(features.compute()))
parallel_execution_time = time.time() - start_time

plt.figure(figsize=(12, 6))

plt.subplot(1, 2, 1)
plt.scatter(ddf.index, ddf['Feature_2'], c=ddf['Cluster_Seq'],
cmap='viridis', alpha=0.5)
plt.scatter(kmeans_seq.cluster_centers_[:, 0],
kmeans_seq.cluster_centers_[:, 1], marker='X', s=200, c='red')
plt.title('Sequential K-Means Clustering')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')

plt.subplot(1, 2, 2)
plt.scatter(ddf.index, ddf['Feature_2'], c=ddf['Cluster_Parallel'],
cmap='viridis', alpha=0.5)
plt.scatter(kmeans_parallel.cluster_centers_[:, 0],
kmeans_parallel.cluster_centers_[:, 1], marker='X', s=200, c='red')
Department of Computer Science
School of Systems and Technology
University of Management and Technology, Lahore
CS4172- Parallel and Distributed Computing, Fall 2023
Instructor: Hassan Bashir

plt.title('Parallel K-Means Clustering')


plt.xlabel('Feature 1')
plt.ylabel('Feature 2')

plt.tight_layout()
plt.show()

print(f"Sequential Execution Time: {sequential_execution_time} seconds")


print(f"Parallel Execution Time: {parallel_execution_time} seconds")

Final Results:

Execution time:

Video Link:

https://drive.google.com/file/d/151KBwxHeHl-A4uZfQvqfXo7w0pZH6H9_/view?usp=drivesdk

You might also like