0% found this document useful (0 votes)

12 views4 pages

LAB1

The document outlines a lab assignment for a Master in Big Data program, focusing on implementing the k-means algorithm in Python to analyze a dataset of computers. It details the steps for generating datasets, performing serial and parallel computations, and delivering a report with findings. The final submission must include specific files and adhere to a deadline of October 8th, 2024.

Uploaded by

patatapocha18

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views4 pages

LAB1

Uploaded by

patatapocha18

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

Master in Big Data

TECHNOLOGICAL FUNDAMENTALS IN THE BIG DATA WORLD

LAB1. K-MEANS PARALLELIZATION in PYTHON

Creation of the datasets

In the lab material you would find a file named “computers-generator.py”. You have
to use it to generate computers datasets for the lab. To generate a data set, execute
the command:

$>python computers-generator.py numrows

Being “numrows” a parameter specifying the number of protein chains in the dataset.

For development:
$>python computers-generator.py 5000

To test performance of the solution to deliver a least 500,000 rows, but you can test as
many as you want:

$>python computers-generator.py 500000

The file "computers.csv", created include a data set about a list of computers,
including the following information per computer:

"id",”price”,"speed","hd","ram","screen",”cores”,"cd","laptop", “trend"

IMPORTANT: Do not modify, touch the file or create transformed fields. For the lab
delivery extra files will not be accepted. We will use the same command to generate
the dataset.

Notice: In the data you have 2 fields that are not numerical.
cd,
laptop

As they have only two values, you can substitute them with 0 (no) and 1 (yes) to
normalize de data.
Master in Big Data
TECHNOLOGICAL FUNDAMENTALS IN THE BIG DATA WORLD

Laboratory Description
You are asked to extract useful information the computer data set implementing a
program using the k-means algorithm in Python on the “price” attribute.

Use the path “computers.csv”. for the file. Do not include the full path in your
computer (e.g things like C:\mifolder\computers.csv, or ../tmp/computers.csv are
forbidden). You have to execute the python program in the directory where the file is.

Part one – Python serial

1.- Implement the k-means algorithm

- Use euclidean distance

- Random centroids at the beginning

2.- Construct the elbow graph and find the optimal clusters number (k).

3.- Cluster the data using the optimum value using k-means.

4.-Measure time and print it

5.- Plot the results of the elbow graph.

6.- Plot the first 2 dimensions of the clusters (price, speed)

7.- Find the cluster with the highest average price and print id of cluster and the
average price.

6.- Print a heat map using the values of the clusters centroids.

Part two – Python parallel, multiprocessing

1.- Write a parallel version of you program using multiprocessing

2.- Measure the time and optimize the program to get the fastest version you
can.

3.-Measure time and print it

4.- Plot the first 2 dimensions of the clusters (price, speed)

Master in Big Data
TECHNOLOGICAL FUNDAMENTALS IN THE BIG DATA WORLD
5- Find the cluster with the highest average price and print id of cluster and the
average price.

6.- Print a heat map using the values of the clusters’ centroids.

Part three – Python parallel, threading

1.- Write a parallel version of you program using threads

2.- Measure the time and optimize the program to get the fastest version you
can.

3.-Measure time and print it

4.- Plot the first 2 dimensions of the clusters (price, speed)

5.- Find the cluster with the highest average price and print id of cluster and the
average price.

6.- Print a heat map using the values of the clusters’ centroids.

Part four

10.- Write a memory explaining your results (maximum 12 pages).

- Important: show measures with speedup of the parallel versions.
- Show possible speedup using Amdahl Law.

Important

- To correctly measure time, you cannot stop the program to

see the graphs (e.g. k-means). You have to keep all the
graphs and print them at the end, so as the messages in
the screen.
- Computing the Optimum K can be done using the elbow
algorithm and computing the interval (k1,K2) using the
slope of the curve. For example in the picture a possible
interval would be [3,6].
Master in Big Data
TECHNOLOGICAL FUNDAMENTALS IN THE BIG DATA WORLD

Laboratory Delivery
Maximum group: 4 people.

Do not use deliver Jupyter notebooks. Deliver .py files.

You have to deliver a compressed file named: “yournia_computers_2024.zip” (e.g.

100023456_computers_2024.zip) including:

- A PDF report with the memory (include author names)

- “authors.txt” file, including a line per author (NIA, SURNAMES, NAME)
- Three Python programs with serial and parallel versions of the program with
multiprocessing. Names:
• Computer-serial.py
• Computer-mp.py
• Computer-th.py

Delivery date: October 8th 2024. 23:30 hours.

BDA Practical File
No ratings yet
BDA Practical File
57 pages
BIG DATA Lab Record-2024
No ratings yet
BIG DATA Lab Record-2024
59 pages
Tushar Verma 21scse1310012 Data Analysis Using Big Data Tools 21scse1310012 Report
No ratings yet
Tushar Verma 21scse1310012 Data Analysis Using Big Data Tools 21scse1310012 Report
6 pages
DSBDAlab Manual
No ratings yet
DSBDAlab Manual
116 pages
Unit II - Data Science
No ratings yet
Unit II - Data Science
113 pages
Saurabh Maurya's Income Analysis Report
No ratings yet
Saurabh Maurya's Income Analysis Report
13 pages
BCS 402 Lesson 5
No ratings yet
BCS 402 Lesson 5
16 pages
ML Lab Syllabus for Students
No ratings yet
ML Lab Syllabus for Students
90 pages
Data Science Practical Guide
No ratings yet
Data Science Practical Guide
28 pages
Vamshi ml-1,2
No ratings yet
Vamshi ml-1,2
25 pages
Ty B Tech - Bda - Ai315 - Lab Manual
No ratings yet
Ty B Tech - Bda - Ai315 - Lab Manual
52 pages
Efficient Large Data Handling
No ratings yet
Efficient Large Data Handling
6 pages
Feature Engineering - Introduction
No ratings yet
Feature Engineering - Introduction
74 pages
ML Aml Cse It Lab Manual Final
No ratings yet
ML Aml Cse It Lab Manual Final
22 pages
Python CA2
No ratings yet
Python CA2
11 pages
DSBDA Manual
No ratings yet
DSBDA Manual
76 pages
Practical 21
No ratings yet
Practical 21
11 pages
Unit 4 - DS - 1st Year
No ratings yet
Unit 4 - DS - 1st Year
6 pages
Ass1 DSBDA Writeup
No ratings yet
Ass1 DSBDA Writeup
8 pages
Python in Research
No ratings yet
Python in Research
18 pages
Dsbda Unit4
No ratings yet
Dsbda Unit4
110 pages
Data Analytics QP May 25
No ratings yet
Data Analytics QP May 25
4 pages
ML LabManual
No ratings yet
ML LabManual
16 pages
ML LAB Manual
No ratings yet
ML LAB Manual
18 pages
Data Analysis Lab with Python
No ratings yet
Data Analysis Lab with Python
11 pages
Python for Data-Driven Programmers
100% (3)
Python for Data-Driven Programmers
49 pages
Data Analysis Using Python2
No ratings yet
Data Analysis Using Python2
27 pages
Advanced Python & Data Science Guide
No ratings yet
Advanced Python & Data Science Guide
42 pages
Digital Principal and System Design
No ratings yet
Digital Principal and System Design
17 pages
Machine Learning Lab Manual
No ratings yet
Machine Learning Lab Manual
26 pages
Features of A Datase1
No ratings yet
Features of A Datase1
11 pages
Python Ca22
No ratings yet
Python Ca22
14 pages
ML Lab Report for ECE Students
No ratings yet
ML Lab Report for ECE Students
38 pages
Data Mining Lab Manual 2 2
No ratings yet
Data Mining Lab Manual 2 2
63 pages
Lab Manual
No ratings yet
Lab Manual
80 pages
Olympic Data Analysis Guide
No ratings yet
Olympic Data Analysis Guide
23 pages
93150v00 Big Data IoT Whitepaper
No ratings yet
93150v00 Big Data IoT Whitepaper
15 pages
Tackling Big Data Using Matlab
No ratings yet
Tackling Big Data Using Matlab
73 pages
Efficient Single-PC Data Handling
No ratings yet
Efficient Single-PC Data Handling
54 pages
Big Data Report
No ratings yet
Big Data Report
6 pages
Machine Learning Lab Dlihebca6sem
100% (1)
Machine Learning Lab Dlihebca6sem
25 pages
STQS2223 CH 3
No ratings yet
STQS2223 CH 3
25 pages
Machine Learning Crash Course For BCA 5th Semester
No ratings yet
Machine Learning Crash Course For BCA 5th Semester
21 pages
Python Unit 4
No ratings yet
Python Unit 4
70 pages
Assignment 22778 19835 612cbc1d69fbb
No ratings yet
Assignment 22778 19835 612cbc1d69fbb
7 pages
Lab 02 - Introduction To Pandas
No ratings yet
Lab 02 - Introduction To Pandas
6 pages
Machine Learning With Python Supervised Learning
No ratings yet
Machine Learning With Python Supervised Learning
114 pages
Data Preprocessing-AIML Algorithm1
No ratings yet
Data Preprocessing-AIML Algorithm1
47 pages
Python For Data Science
No ratings yet
Python For Data Science
22 pages
Datascience Unit3
No ratings yet
Datascience Unit3
19 pages
Data Structures For Statistical Computing in Pytho
No ratings yet
Data Structures For Statistical Computing in Pytho
7 pages
Ip Project
No ratings yet
Ip Project
16 pages
Unit 1-1
No ratings yet
Unit 1-1
10 pages
D P Lab Manual
No ratings yet
D P Lab Manual
54 pages
Introduction To EDA
No ratings yet
Introduction To EDA
16 pages
Micro Project Report Format
No ratings yet
Micro Project Report Format
11 pages
Pandas and Binary Search Assignment 8
No ratings yet
Pandas and Binary Search Assignment 8
10 pages
ME Curriculum Guide
No ratings yet
ME Curriculum Guide
2 pages
Company Presentation 29.07.2024
No ratings yet
Company Presentation 29.07.2024
20 pages
Screenshot 2024-08-30 at 21.34.20
No ratings yet
Screenshot 2024-08-30 at 21.34.20
1 page
Acetamiprid Testing Method by HPLC
No ratings yet
Acetamiprid Testing Method by HPLC
6 pages
Ee3301-Part A With Answer PDF
No ratings yet
Ee3301-Part A With Answer PDF
10 pages
Essential Exercises For Marimba
No ratings yet
Essential Exercises For Marimba
6 pages
Page 011
No ratings yet
Page 011
1 page
C6 - W2 - 7.8.24 Physics
No ratings yet
C6 - W2 - 7.8.24 Physics
3 pages
Lec 3
No ratings yet
Lec 3
33 pages
Characteristics of Contemporary Architecture
No ratings yet
Characteristics of Contemporary Architecture
4 pages
GE en N AGui MDTAW SE MOD Proprts Trngls
No ratings yet
GE en N AGui MDTAW SE MOD Proprts Trngls
3 pages
Qns Maps and Mapwork
No ratings yet
Qns Maps and Mapwork
24 pages
Solving Equations in Social Sciences
No ratings yet
Solving Equations in Social Sciences
29 pages
Retail Services Engineering Homework 2
No ratings yet
Retail Services Engineering Homework 2
2 pages
Removal of Phosphate Species From Solution by Adsorption Onto Calcite Used As Natural Adsorbent
No ratings yet
Removal of Phosphate Species From Solution by Adsorption Onto Calcite Used As Natural Adsorbent
6 pages
Chionh 1999
No ratings yet
Chionh 1999
12 pages
Introduction To Databases Part 1
No ratings yet
Introduction To Databases Part 1
78 pages
AQA MA01 WRE Jan19
No ratings yet
AQA MA01 WRE Jan19
7 pages
Elements of Astronomy
No ratings yet
Elements of Astronomy
312 pages
IPC Problems Solutions
No ratings yet
IPC Problems Solutions
23 pages
The React Cheatsheet
No ratings yet
The React Cheatsheet
19 pages
Duo-Fine® 1401 Series Filter Cartridges
No ratings yet
Duo-Fine® 1401 Series Filter Cartridges
2 pages
Gas Solenoid Valves and Accessories Guide
No ratings yet
Gas Solenoid Valves and Accessories Guide
1 page
Posting Control and Negative Posting Guide
No ratings yet
Posting Control and Negative Posting Guide
2 pages
WS Grade 10 IG Chemistry 24-25 - Organic Chemistry - 1
No ratings yet
WS Grade 10 IG Chemistry 24-25 - Organic Chemistry - 1
3 pages
Student Math Challenges & Insights
No ratings yet
Student Math Challenges & Insights
18 pages
Berry Shrivel Disorder in Grape
No ratings yet
Berry Shrivel Disorder in Grape
5 pages
Basic Laboratory Apparatus
No ratings yet
Basic Laboratory Apparatus
12 pages
Unit2 Backtracking 1
No ratings yet
Unit2 Backtracking 1
26 pages
Review On Physico-Chemical Analysis of Drinking Water: Ramesh Kumar Pahade
No ratings yet
Review On Physico-Chemical Analysis of Drinking Water: Ramesh Kumar Pahade
6 pages

LAB1

Uploaded by

LAB1

Uploaded by

Master in Big Data

TECHNOLOGICAL FUNDAMENTALS IN THE BIG DATA WORLD

LAB1. K-MEANS PARALLELIZATION in PYTHON

Creation of the datasets

$>python computers-generator.py numrows

$>python computers-generator.py 500000

Part one – Python serial

1.- Implement the k-means algorithm

- Random centroids at the beginning

4.-Measure time and print it

5.- Plot the results of the elbow graph.

6.- Plot the first 2 dimensions of the clusters (price, speed)

Part two – Python parallel, multiprocessing

1.- Write a parallel version of you program using multiprocessing

3.-Measure time and print it

4.- Plot the first 2 dimensions of the clusters (price, speed)

Part three – Python parallel, threading

1.- Write a parallel version of you program using threads

3.-Measure time and print it

4.- Plot the first 2 dimensions of the clusters (price, speed)

10.- Write a memory explaining your results (maximum 12 pages).

- To correctly measure time, you cannot stop the program to

Do not use deliver Jupyter notebooks. Deliver .py files.

You have to deliver a compressed file named: “yournia_computers_2024.zip” (e.g.

- A PDF report with the memory (include author names)

Delivery date: October 8th 2024. 23:30 hours.

You might also like