Master in Big Data
TECHNOLOGICAL FUNDAMENTALS IN THE BIG DATA WORLD
LAB1. K-MEANS PARALLELIZATION in PYTHON
Creation of the datasets
In the lab material you would find a file named “computers-generator.py”. You have
to use it to generate computers datasets for the lab. To generate a data set, execute
the command:
$>python computers-generator.py numrows
Being “numrows” a parameter specifying the number of protein chains in the dataset.
For development:
$>python computers-generator.py 5000
To test performance of the solution to deliver a least 500,000 rows, but you can test as
many as you want:
$>python computers-generator.py 500000
The file "computers.csv", created include a data set about a list of computers,
including the following information per computer:
"id",”price”,"speed","hd","ram","screen",”cores”,"cd","laptop", “trend"
IMPORTANT: Do not modify, touch the file or create transformed fields. For the lab
delivery extra files will not be accepted. We will use the same command to generate
the dataset.
Notice: In the data you have 2 fields that are not numerical.
cd,
laptop
As they have only two values, you can substitute them with 0 (no) and 1 (yes) to
normalize de data.
Master in Big Data
TECHNOLOGICAL FUNDAMENTALS IN THE BIG DATA WORLD
Laboratory Description
You are asked to extract useful information the computer data set implementing a
program using the k-means algorithm in Python on the “price” attribute.
Use the path “computers.csv”. for the file. Do not include the full path in your
computer (e.g things like C:\mifolder\computers.csv, or ../tmp/computers.csv are
forbidden). You have to execute the python program in the directory where the file is.
Part one – Python serial
1.- Implement the k-means algorithm
- Use euclidean distance
- Random centroids at the beginning
2.- Construct the elbow graph and find the optimal clusters number (k).
3.- Cluster the data using the optimum value using k-means.
4.-Measure time and print it
5.- Plot the results of the elbow graph.
6.- Plot the first 2 dimensions of the clusters (price, speed)
7.- Find the cluster with the highest average price and print id of cluster and the
average price.
6.- Print a heat map using the values of the clusters centroids.
Part two – Python parallel, multiprocessing
1.- Write a parallel version of you program using multiprocessing
2.- Measure the time and optimize the program to get the fastest version you
can.
3.-Measure time and print it
4.- Plot the first 2 dimensions of the clusters (price, speed)
Master in Big Data
TECHNOLOGICAL FUNDAMENTALS IN THE BIG DATA WORLD
5- Find the cluster with the highest average price and print id of cluster and the
average price.
6.- Print a heat map using the values of the clusters’ centroids.
Part three – Python parallel, threading
1.- Write a parallel version of you program using threads
2.- Measure the time and optimize the program to get the fastest version you
can.
3.-Measure time and print it
4.- Plot the first 2 dimensions of the clusters (price, speed)
5.- Find the cluster with the highest average price and print id of cluster and the
average price.
6.- Print a heat map using the values of the clusters’ centroids.
Part four
10.- Write a memory explaining your results (maximum 12 pages).
- Important: show measures with speedup of the parallel versions.
- Show possible speedup using Amdahl Law.
Important
- To correctly measure time, you cannot stop the program to
see the graphs (e.g. k-means). You have to keep all the
graphs and print them at the end, so as the messages in
the screen.
- Computing the Optimum K can be done using the elbow
algorithm and computing the interval (k1,K2) using the
slope of the curve. For example in the picture a possible
interval would be [3,6].
Master in Big Data
TECHNOLOGICAL FUNDAMENTALS IN THE BIG DATA WORLD
Laboratory Delivery
Maximum group: 4 people.
Do not use deliver Jupyter notebooks. Deliver .py files.
You have to deliver a compressed file named: “yournia_computers_2024.zip” (e.g.
100023456_computers_2024.zip) including:
- A PDF report with the memory (include author names)
- “authors.txt” file, including a line per author (NIA, SURNAMES, NAME)
- Three Python programs with serial and parallel versions of the program with
multiprocessing. Names:
• Computer-serial.py
• Computer-mp.py
• Computer-th.py
Delivery date: October 8th 2024. 23:30 hours.