Department of Computer Science
School of Systems and Technology
University of Management and Technology, Lahore
CS4172- Parallel and Distributed Computing, Fall 2023
Instructor: Hassan Bashir
Complex Computing Problem
Group Members:
1. MUHAMMAD TALHA HUSSAIN - F2020332037
2. MUNEEB UR REHMAN SHAH - F2020266395
K-Means Clustering Algorithm:
1. Initialization:
Choose the number of clusters k.
Randomly initialize k cluster centroids.
2. Assignment:
Assign each data point to the nearest cluster centroid. This is done by computing the
Euclidean distance between each point and all centroids, and assigning the point to the
cluster whose centroid is closest.
3. Update:
Recalculate the centroids of the clusters based on the current assignments.
The new centroid for each cluster is the mean of all data points assigned to that cluster.
Mathematically, for each cluster j:
4. Repeat:
Repeat steps 2 and 3 until convergence. Convergence occurs when the assignments of
data points to clusters no longer change significantly, or when a specified number of
iterations is reached.
The algorithm minimizes the within-cluster sum of squares (WCSS), which is the sum of
squared distances between each data point and its assigned centroid. This leads to tight
clusters where points within the same cluster are close to each other.
Parallelization in Dask:
Department of Computer Science
School of Systems and Technology
University of Management and Technology, Lahore
CS4172- Parallel and Distributed Computing, Fall 2023
Instructor: Hassan Bashir
Dask is a parallel computing library that enables parallel processing of large datasets. In the provided
code:
Dask DataFrame: The dataset is loaded as a Dask DataFrame, which is a parallelized version of
the Pandas DataFrame. Dask divides the dataset into partitions, and operations on Dask
DataFrames are parallelized across these partitions.
K-Means Training: The KMeans model from dask_ml.cluster is used to perform K-Means
clustering. The fit method trains the model in parallel on the Dask DataFrame partitions, and the
predict method assigns data points to clusters.
Conversion to Dask Array: After obtaining the cluster assignments with predict, the result is
converted to a Dask array using dd.from_dask_array. This is necessary for column assignment to
the Dask DataFrame.
Column Assignment: The cluster assignments are assigned to new columns in the Dask
DataFrame ('Cluster_Seq' for the sequential approach and 'Cluster_Parallel' for the parallel
approach).
Computation: The compute method is used to trigger the actual computation, bringing the
results from the Dask world to the local Python environment.
Finally, the execution times for the sequential and parallel approaches are compared.
In summary, the Dask library is used to handle large datasets in a parallelized manner, and the K-Means
algorithm is applied in a distributed fashion to leverage parallel computation capabilities.
Source Code:
Dataset link: https://drive.google.com/file/d/1xCxibuxS87qWLLybODG3VvCVG7Gkd4zk/view?
usp=drive_link
import dask.dataframe as dd
from dask_ml.cluster import KMeans
import numpy as np
import time
import matplotlib.pyplot as plt
dataset_path = '/content/drive/MyDrive/kmeans_dataset.csv'
ddf = dd.read_csv(dataset_path)
features = ddf[['Feature_1', 'Feature_2']]
Department of Computer Science
School of Systems and Technology
University of Management and Technology, Lahore
CS4172- Parallel and Distributed Computing, Fall 2023
Instructor: Hassan Bashir
num_clusters = 3
ddf = ddf.set_index('Feature_1')
ddf = ddf.compute()
# Sequential Approach
start_time = time.time()
kmeans_seq = KMeans(n_clusters=num_clusters, random_state=42)
kmeans_seq.fit(features)
ddf['Cluster_Seq'] =
dd.from_dask_array(kmeans_seq.predict(features.compute()))
sequential_execution_time = time.time() - start_time
# Parallel Approach
start_time = time.time()
kmeans_parallel = KMeans(n_clusters=num_clusters, random_state=42,
n_jobs=-1)
kmeans_parallel.fit(features)
ddf['Cluster_Parallel'] =
dd.from_dask_array(kmeans_parallel.predict(features.compute()))
parallel_execution_time = time.time() - start_time
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
plt.scatter(ddf.index, ddf['Feature_2'], c=ddf['Cluster_Seq'],
cmap='viridis', alpha=0.5)
plt.scatter(kmeans_seq.cluster_centers_[:, 0],
kmeans_seq.cluster_centers_[:, 1], marker='X', s=200, c='red')
plt.title('Sequential K-Means Clustering')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.subplot(1, 2, 2)
plt.scatter(ddf.index, ddf['Feature_2'], c=ddf['Cluster_Parallel'],
cmap='viridis', alpha=0.5)
plt.scatter(kmeans_parallel.cluster_centers_[:, 0],
kmeans_parallel.cluster_centers_[:, 1], marker='X', s=200, c='red')
Department of Computer Science
School of Systems and Technology
University of Management and Technology, Lahore
CS4172- Parallel and Distributed Computing, Fall 2023
Instructor: Hassan Bashir
plt.title('Parallel K-Means Clustering')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.tight_layout()
plt.show()
print(f"Sequential Execution Time: {sequential_execution_time} seconds")
print(f"Parallel Execution Time: {parallel_execution_time} seconds")
Final Results:
Execution time:
Video Link:
https://drive.google.com/file/d/151KBwxHeHl-A4uZfQvqfXo7w0pZH6H9_/view?usp=drivesdk