0% found this document useful (0 votes)
12 views4 pages

LAB1

The document outlines a lab assignment for a Master in Big Data program, focusing on implementing the k-means algorithm in Python to analyze a dataset of computers. It details the steps for generating datasets, performing serial and parallel computations, and delivering a report with findings. The final submission must include specific files and adhere to a deadline of October 8th, 2024.

Uploaded by

patatapocha18
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views4 pages

LAB1

The document outlines a lab assignment for a Master in Big Data program, focusing on implementing the k-means algorithm in Python to analyze a dataset of computers. It details the steps for generating datasets, performing serial and parallel computations, and delivering a report with findings. The final submission must include specific files and adhere to a deadline of October 8th, 2024.

Uploaded by

patatapocha18
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Master in Big Data

TECHNOLOGICAL FUNDAMENTALS IN THE BIG DATA WORLD

LAB1. K-MEANS PARALLELIZATION in PYTHON

Creation of the datasets


In the lab material you would find a file named “computers-generator.py”. You have
to use it to generate computers datasets for the lab. To generate a data set, execute
the command:

$>python computers-generator.py numrows

Being “numrows” a parameter specifying the number of protein chains in the dataset.

For development:
$>python computers-generator.py 5000

To test performance of the solution to deliver a least 500,000 rows, but you can test as
many as you want:

$>python computers-generator.py 500000

The file "computers.csv", created include a data set about a list of computers,
including the following information per computer:

"id",”price”,"speed","hd","ram","screen",”cores”,"cd","laptop", “trend"

IMPORTANT: Do not modify, touch the file or create transformed fields. For the lab
delivery extra files will not be accepted. We will use the same command to generate
the dataset.

Notice: In the data you have 2 fields that are not numerical.
cd,
laptop

As they have only two values, you can substitute them with 0 (no) and 1 (yes) to
normalize de data.
Master in Big Data
TECHNOLOGICAL FUNDAMENTALS IN THE BIG DATA WORLD

Laboratory Description
You are asked to extract useful information the computer data set implementing a
program using the k-means algorithm in Python on the “price” attribute.

Use the path “computers.csv”. for the file. Do not include the full path in your
computer (e.g things like C:\mifolder\computers.csv, or ../tmp/computers.csv are
forbidden). You have to execute the python program in the directory where the file is.

Part one – Python serial

1.- Implement the k-means algorithm


- Use euclidean distance

- Random centroids at the beginning


2.- Construct the elbow graph and find the optimal clusters number (k).

3.- Cluster the data using the optimum value using k-means.

4.-Measure time and print it

5.- Plot the results of the elbow graph.

6.- Plot the first 2 dimensions of the clusters (price, speed)

7.- Find the cluster with the highest average price and print id of cluster and the
average price.

6.- Print a heat map using the values of the clusters centroids.

Part two – Python parallel, multiprocessing

1.- Write a parallel version of you program using multiprocessing

2.- Measure the time and optimize the program to get the fastest version you
can.

3.-Measure time and print it

4.- Plot the first 2 dimensions of the clusters (price, speed)


Master in Big Data
TECHNOLOGICAL FUNDAMENTALS IN THE BIG DATA WORLD
5- Find the cluster with the highest average price and print id of cluster and the
average price.

6.- Print a heat map using the values of the clusters’ centroids.

Part three – Python parallel, threading

1.- Write a parallel version of you program using threads

2.- Measure the time and optimize the program to get the fastest version you
can.

3.-Measure time and print it

4.- Plot the first 2 dimensions of the clusters (price, speed)

5.- Find the cluster with the highest average price and print id of cluster and the
average price.

6.- Print a heat map using the values of the clusters’ centroids.

Part four

10.- Write a memory explaining your results (maximum 12 pages).


- Important: show measures with speedup of the parallel versions.
- Show possible speedup using Amdahl Law.

Important

- To correctly measure time, you cannot stop the program to


see the graphs (e.g. k-means). You have to keep all the
graphs and print them at the end, so as the messages in
the screen.
- Computing the Optimum K can be done using the elbow
algorithm and computing the interval (k1,K2) using the
slope of the curve. For example in the picture a possible
interval would be [3,6].
Master in Big Data
TECHNOLOGICAL FUNDAMENTALS IN THE BIG DATA WORLD

Laboratory Delivery
Maximum group: 4 people.

Do not use deliver Jupyter notebooks. Deliver .py files.

You have to deliver a compressed file named: “yournia_computers_2024.zip” (e.g.


100023456_computers_2024.zip) including:

- A PDF report with the memory (include author names)


- “authors.txt” file, including a line per author (NIA, SURNAMES, NAME)
- Three Python programs with serial and parallel versions of the program with
multiprocessing. Names:
• Computer-serial.py
• Computer-mp.py
• Computer-th.py

Delivery date: October 8th 2024. 23:30 hours.

You might also like