0% found this document useful (0 votes)

30 views4 pages

Complex Computing Problem KMeans Clustering

Uploaded by

MUNEEB UR REHMAN

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

30 views4 pages

Complex Computing Problem KMeans Clustering

Uploaded by

MUNEEB UR REHMAN

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 4

Department of Computer Science

School of Systems and Technology

University of Management and Technology, Lahore
CS4172- Parallel and Distributed Computing, Fall 2023
Instructor: Hassan Bashir

Complex Computing Problem

Group Members:

1. MUHAMMAD TALHA HUSSAIN - F2020332037

2. MUNEEB UR REHMAN SHAH - F2020266395

K-Means Clustering Algorithm:

1. Initialization:

 Choose the number of clusters k.

 Randomly initialize k cluster centroids.

2. Assignment:

 Assign each data point to the nearest cluster centroid. This is done by computing the
Euclidean distance between each point and all centroids, and assigning the point to the
cluster whose centroid is closest.

3. Update:
 Recalculate the centroids of the clusters based on the current assignments.
 The new centroid for each cluster is the mean of all data points assigned to that cluster.

Mathematically, for each cluster j:

4. Repeat:
 Repeat steps 2 and 3 until convergence. Convergence occurs when the assignments of
data points to clusters no longer change significantly, or when a specified number of
iterations is reached.

The algorithm minimizes the within-cluster sum of squares (WCSS), which is the sum of
squared distances between each data point and its assigned centroid. This leads to tight
clusters where points within the same cluster are close to each other.

Parallelization in Dask:
Department of Computer Science
School of Systems and Technology
University of Management and Technology, Lahore
CS4172- Parallel and Distributed Computing, Fall 2023
Instructor: Hassan Bashir

Dask is a parallel computing library that enables parallel processing of large datasets. In the provided
code:

 Dask DataFrame: The dataset is loaded as a Dask DataFrame, which is a parallelized version of
the Pandas DataFrame. Dask divides the dataset into partitions, and operations on Dask
DataFrames are parallelized across these partitions.

 K-Means Training: The KMeans model from dask_ml.cluster is used to perform K-Means
clustering. The fit method trains the model in parallel on the Dask DataFrame partitions, and the
predict method assigns data points to clusters.

 Conversion to Dask Array: After obtaining the cluster assignments with predict, the result is
converted to a Dask array using dd.from_dask_array. This is necessary for column assignment to
the Dask DataFrame.

 Column Assignment: The cluster assignments are assigned to new columns in the Dask
DataFrame ('Cluster_Seq' for the sequential approach and 'Cluster_Parallel' for the parallel
approach).

 Computation: The compute method is used to trigger the actual computation, bringing the
results from the Dask world to the local Python environment.

Finally, the execution times for the sequential and parallel approaches are compared.

In summary, the Dask library is used to handle large datasets in a parallelized manner, and the K-Means
algorithm is applied in a distributed fashion to leverage parallel computation capabilities.

Source Code:

Dataset link: https://drive.google.com/file/d/1xCxibuxS87qWLLybODG3VvCVG7Gkd4zk/view?

usp=drive_link

import dask.dataframe as dd
from dask_ml.cluster import KMeans
import numpy as np
import time
import matplotlib.pyplot as plt

dataset_path = '/content/drive/MyDrive/kmeans_dataset.csv'
ddf = dd.read_csv(dataset_path)

features = ddf[['Feature_1', 'Feature_2']]

Department of Computer Science
School of Systems and Technology
University of Management and Technology, Lahore
CS4172- Parallel and Distributed Computing, Fall 2023
Instructor: Hassan Bashir

num_clusters = 3

ddf = ddf.set_index('Feature_1')

ddf = ddf.compute()

# Sequential Approach
start_time = time.time()
kmeans_seq = KMeans(n_clusters=num_clusters, random_state=42)
kmeans_seq.fit(features)
ddf['Cluster_Seq'] =
dd.from_dask_array(kmeans_seq.predict(features.compute()))
sequential_execution_time = time.time() - start_time

# Parallel Approach
start_time = time.time()
kmeans_parallel = KMeans(n_clusters=num_clusters, random_state=42,
n_jobs=-1)
kmeans_parallel.fit(features)
ddf['Cluster_Parallel'] =
dd.from_dask_array(kmeans_parallel.predict(features.compute()))
parallel_execution_time = time.time() - start_time

plt.figure(figsize=(12, 6))

plt.subplot(1, 2, 1)
plt.scatter(ddf.index, ddf['Feature_2'], c=ddf['Cluster_Seq'],
cmap='viridis', alpha=0.5)
plt.scatter(kmeans_seq.cluster_centers_[:, 0],
kmeans_seq.cluster_centers_[:, 1], marker='X', s=200, c='red')
plt.title('Sequential K-Means Clustering')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')

plt.subplot(1, 2, 2)
plt.scatter(ddf.index, ddf['Feature_2'], c=ddf['Cluster_Parallel'],
cmap='viridis', alpha=0.5)
plt.scatter(kmeans_parallel.cluster_centers_[:, 0],
kmeans_parallel.cluster_centers_[:, 1], marker='X', s=200, c='red')
Department of Computer Science
School of Systems and Technology
University of Management and Technology, Lahore
CS4172- Parallel and Distributed Computing, Fall 2023
Instructor: Hassan Bashir

plt.title('Parallel K-Means Clustering')

plt.xlabel('Feature 1')
plt.ylabel('Feature 2')

plt.tight_layout()
plt.show()

print(f"Sequential Execution Time: {sequential_execution_time} seconds")

print(f"Parallel Execution Time: {parallel_execution_time} seconds")

Final Results:

Execution time:

Video Link:

https://drive.google.com/file/d/151KBwxHeHl-A4uZfQvqfXo7w0pZH6H9_/view?usp=drivesdk

Chapter2-Working With Dask Arrays
No ratings yet
Chapter2-Working With Dask Arrays
41 pages
Chapter1-Working With Big Data
100% (1)
Chapter1-Working With Big Data
44 pages
Parallel Kmeans DNA
No ratings yet
Parallel Kmeans DNA
4 pages
Dask Bags & Globbing in Python
No ratings yet
Dask Bags & Globbing in Python
33 pages
Cmpe 300 Programming Project Fall 2010 Parallel K-Means Algorithm
No ratings yet
Cmpe 300 Programming Project Fall 2010 Parallel K-Means Algorithm
4 pages
Logistic Regression in Python Using Dask
No ratings yet
Logistic Regression in Python Using Dask
19 pages
Parallel K-Means Algorithm For Shared
No ratings yet
Parallel K-Means Algorithm For Shared
9 pages
Dask for Parallel Data Processing
100% (1)
Dask for Parallel Data Processing
24 pages
Assignment # 1: Performance Timeline of Flynn Taxonomy
No ratings yet
Assignment # 1: Performance Timeline of Flynn Taxonomy
21 pages
Lab Report - Assignment 1: Variables
No ratings yet
Lab Report - Assignment 1: Variables
4 pages
AdityaGaur BDA Exp8
No ratings yet
AdityaGaur BDA Exp8
4 pages
Common Python Data Science Interview Questions1
No ratings yet
Common Python Data Science Interview Questions1
5 pages
(MPI Vs OpenMP) Parallel K-Means Clustering
No ratings yet
(MPI Vs OpenMP) Parallel K-Means Clustering
27 pages
DWDM Lab All
No ratings yet
DWDM Lab All
20 pages
Unit 3 Unsupervised Learning
No ratings yet
Unit 3 Unsupervised Learning
9 pages
ML DSBA Lab7
No ratings yet
ML DSBA Lab7
6 pages
Co 2
No ratings yet
Co 2
22 pages
Chapter5-Case Study Analyzing Flight Delays
No ratings yet
Chapter5-Case Study Analyzing Flight Delays
32 pages
3 Clustering
No ratings yet
3 Clustering
18 pages
PDC Report
No ratings yet
PDC Report
22 pages
Parallelizing - K Means Clustering: A Project Report
100% (1)
Parallelizing - K Means Clustering: A Project Report
32 pages
Vid 4
No ratings yet
Vid 4
6 pages
Dask Parallel Computing Cheat Sheet
No ratings yet
Dask Parallel Computing Cheat Sheet
2 pages
Dis Top Tim Notes 1
No ratings yet
Dis Top Tim Notes 1
3 pages
Accelerated K-Means Algorithms For Low-Dimensional Data On Parallel Shared-Memory Systems
No ratings yet
Accelerated K-Means Algorithms For Low-Dimensional Data On Parallel Shared-Memory Systems
16 pages
Clustering Course Slides
No ratings yet
Clustering Course Slides
26 pages
Data Science Final Report
No ratings yet
Data Science Final Report
33 pages
DynamicalSystemsWithApplicationsUsingPython 2ed Lynch PDF
100% (2)
DynamicalSystemsWithApplicationsUsingPython 2ed Lynch PDF
104 pages
Class XII Informatics Project
No ratings yet
Class XII Informatics Project
34 pages
Kmeans Clustering On GPU: Chinmai Cchinmai@ur - Rocheter.edu
No ratings yet
Kmeans Clustering On GPU: Chinmai Cchinmai@ur - Rocheter.edu
3 pages
Achieving Superlinear Speedup in Parallel Algorithms
No ratings yet
Achieving Superlinear Speedup in Parallel Algorithms
7 pages
SOLUTION ONLY CODE DWDM - Lab - All
No ratings yet
SOLUTION ONLY CODE DWDM - Lab - All
8 pages
RG2 ParallelizationPrinciples HPCAI Jan2020
No ratings yet
RG2 ParallelizationPrinciples HPCAI Jan2020
40 pages
ML Labs
No ratings yet
ML Labs
15 pages
Parallel Random Access Machine (PRAM) : Control
No ratings yet
Parallel Random Access Machine (PRAM) : Control
9 pages
51 DA5400 - FML51 - 20250501 ProblemSet06
No ratings yet
51 DA5400 - FML51 - 20250501 ProblemSet06
4 pages
K Means
100% (2)
K Means
329 pages
CSE 6040: Clustering with Lloyd's Algorithm
No ratings yet
CSE 6040: Clustering with Lloyd's Algorithm
15 pages
Exp 5 ML
No ratings yet
Exp 5 ML
9 pages
Unit1 2 and 3
No ratings yet
Unit1 2 and 3
76 pages
ML File: Data Analysis Programs
No ratings yet
ML File: Data Analysis Programs
17 pages
ML Exp5 C36
No ratings yet
ML Exp5 C36
18 pages
Parallel Algorithms for Shared Memory
No ratings yet
Parallel Algorithms for Shared Memory
23 pages
Super Linear Speedup in Parallel Algorithms
33% (3)
Super Linear Speedup in Parallel Algorithms
4 pages
DSC Lab Programs
No ratings yet
DSC Lab Programs
24 pages
Data Wrangling With Dask CheatSheet 1731972488
No ratings yet
Data Wrangling With Dask CheatSheet 1731972488
7 pages
Ass Parallel
No ratings yet
Ass Parallel
11 pages
2022 Mid 1
No ratings yet
2022 Mid 1
4 pages
KMeans Clustering for Universities
No ratings yet
KMeans Clustering for Universities
9 pages
Solving Sparse Linear Systems of Equations Using Fortran Coarrays
No ratings yet
Solving Sparse Linear Systems of Equations Using Fortran Coarrays
10 pages
Python & Pandas for Beginners
No ratings yet
Python & Pandas for Beginners
29 pages
Experiment 9
No ratings yet
Experiment 9
10 pages
Practical Record Programs - Solutions
No ratings yet
Practical Record Programs - Solutions
23 pages
VSS NumericalLibraries
No ratings yet
VSS NumericalLibraries
21 pages
Ip Practical File
No ratings yet
Ip Practical File
39 pages
Deep Learning Exam for M.E CSE
No ratings yet
Deep Learning Exam for M.E CSE
9 pages
OCS353-Data Science Fundamentals Manual 1
No ratings yet
OCS353-Data Science Fundamentals Manual 1
34 pages
Cybersecurity Basics for Beginners
No ratings yet
Cybersecurity Basics for Beginners
9 pages
Communication
No ratings yet
Communication
1,695 pages
Manual X9DBL V1.0a 20130405
No ratings yet
Manual X9DBL V1.0a 20130405
113 pages
Sunny Sharma Resume
No ratings yet
Sunny Sharma Resume
2 pages
NM Lab Manual
No ratings yet
NM Lab Manual
7 pages
Microprocessors Lab Manual for B.Tech
No ratings yet
Microprocessors Lab Manual for B.Tech
143 pages
Unit V Cloud Computing Notes
No ratings yet
Unit V Cloud Computing Notes
24 pages
AC 800PEC Training
No ratings yet
AC 800PEC Training
34 pages
Chapter 10 Void Functions
No ratings yet
Chapter 10 Void Functions
39 pages
DNS Gratis Volphz
No ratings yet
DNS Gratis Volphz
21 pages
A Framework For Automatic Generation of Augmented Reality Maintenance & Repair Instructions Bases On Convolutional Neural Networks
No ratings yet
A Framework For Automatic Generation of Augmented Reality Maintenance & Repair Instructions Bases On Convolutional Neural Networks
6 pages
Mechanical Engineering Graduate Resume
No ratings yet
Mechanical Engineering Graduate Resume
3 pages
Complete Mathematics For Programming & DSA - Ultimate Guide
No ratings yet
Complete Mathematics For Programming & DSA - Ultimate Guide
22 pages
5th Generation
No ratings yet
5th Generation
11 pages
Trading Bot Coded
No ratings yet
Trading Bot Coded
3 pages
Hull
No ratings yet
Hull
36 pages
5G Networking and Signaling (5G RAN5.1 - 02)
No ratings yet
5G Networking and Signaling (5G RAN5.1 - 02)
67 pages
Industrial Report
No ratings yet
Industrial Report
14 pages
Business Technology Assessment Exam
No ratings yet
Business Technology Assessment Exam
4 pages
Symbol Tables Map Identifiers To Their Attributes
No ratings yet
Symbol Tables Map Identifiers To Their Attributes
14 pages
Mobile Edge Caching Overview
No ratings yet
Mobile Edge Caching Overview
29 pages
Future of Petaflops Computing
No ratings yet
Future of Petaflops Computing
3 pages
Case Study - Sustainable Digital Transformation at Rolls-Royce
No ratings yet
Case Study - Sustainable Digital Transformation at Rolls-Royce
6 pages
MDMA Setup Guide for Huawei Modems
No ratings yet
MDMA Setup Guide for Huawei Modems
12 pages
Mlops Syllabus
No ratings yet
Mlops Syllabus
2 pages
Getting Started With Scratch
No ratings yet
Getting Started With Scratch
48 pages
Textnow - Wed, 05 Feb 2025 14-28-01 GMT - Log
No ratings yet
Textnow - Wed, 05 Feb 2025 14-28-01 GMT - Log
23 pages
02 OHC508203 USCDB Principle With Operation and Maintenance (VHSS)
100% (1)
02 OHC508203 USCDB Principle With Operation and Maintenance (VHSS)
104 pages
CH-1 Assignment
No ratings yet
CH-1 Assignment
4 pages
Java Banking System Code
No ratings yet
Java Banking System Code
8 pages

Complex Computing Problem KMeans Clustering

Uploaded by

Complex Computing Problem KMeans Clustering

Uploaded by

Department of Computer Science

School of Systems and Technology

Complex Computing Problem

1. MUHAMMAD TALHA HUSSAIN - F2020332037

K-Means Clustering Algorithm:

 Choose the number of clusters k.

 Randomly initialize k cluster centroids.

Mathematically, for each cluster j:

Dataset link: https://drive.google.com/file/d/1xCxibuxS87qWLLybODG3VvCVG7Gkd4zk/view?

features = ddf[['Feature_1', 'Feature_2']]

plt.title('Parallel K-Means Clustering')

print(f"Sequential Execution Time: {sequential_execution_time} seconds")

You might also like