02 Working With Data

This document outlines the second project in a data engineering course, focusing on loading and processing real-life data using Python. It includes instructions for creating a Jupyter notebook to perform tasks such as data loading, value scaling, category assignment, number extraction, and one-hot encoding. The project emphasizes good practices in analytic script preparation and requires saving results in specific formats.

Uploaded by

stifi.extra

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views3 pages

02 Working With Data

Uploaded by

stifi.extra

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Data Engineering – Project 2: Working with Data

March 20, 2025

1 Introduction
Welcome to the second project. This set of exercises is focused on working with real-life data.
As before, they will require that you load a dataset and process it as required, saving the indicated
file. This time, we will focus on loading datasets and some basic data cleaning operations like type
casting, value filtering and mapping, as well as using categories and encodings.
From this project onwards, the best result out of the first 5 (five) attempts will be taken into account
when computing your grades.

2 Instructions
As before, please write one Python program which performs all the specified tasks in a sequence,
producing the desired results.
Please remember about the good practices of analytic script preparation:
1. Creating your software as a Jupyter notebook is probably a good way to start.
2. When your program is ready, check if it runs correctly as a whole (instead of using individual
cells in a REPL environment) – you can do this by selecting Kernel > Restart Kernel and
Run All Cells… from the top menu, or by simply clicking the “fast forward” button in the
notebook toolbar.
3. Finally, you can download your notebook as a .py file by selecting Save and Export Notebook
As… > Executable Script (or using JupyText). Remember to launch a terminal (or open a
terminal tab in JupyterLab), activate the environment, and run the program in pure Python.
Your soluton should be committed to the repository as project02/[Link].

3 Exercises
3.1 Exercise 1: Load the file
2 points
Create a DataFrame from a file called proj2_data.csv, knowing that:
• it contains more than one column,
• the separator is either |, ; or ,,
• at least one column contains pure floating-point numbers,
• the decimal part is separated using . or ,,

1
• the decimal separator does is not the same as the column separator,
• thousands are not grouped,
• all columns containing pure floating-point numbers have the same format,
• the file also contains text columns.
The DataFrame obtained by loading the file will be referred to as our initial DataFrame.
Save the DataFrame, as imported, to proj2_ex01.pkl.

3.2 Exercise 2: Value scale

3 points
The file proj2_scale.txt contains strings, one in each line, forming a scale. Subsequent values
are (implicitly) associated with natural numbers 1, 2, 3, … , 𝑛. For instance, given a file with values:
very bad
bad
average
good
very good
the value very bad should be associated with 1, and the value very good – with 5.
Create a copy of the initial DataFrame. Locate columns in which the values are a subset of the
values loaded from the text file. In these columns, replace the values with their numeric counterparts.
Save the resulting DataFrame to proj2_ex02.pkl.

3.3 Exercise 3: Categories

3 points
Create another copy of the initial DataFrame. Change the type of columns identified in Exercise 2
to categorical. Set the categories for these columns to reflect the entire list loaded from the text
file, even if not all values are present in the source data.
Save the resulting DataFrame to proj2_ex03.pkl.

3.4 Exercise 4: Number extraction

4 points
Scan the strings in all non-numeric columns of your DataFrame for numbers, following the guidelines
below:
• include both integers (e.g. 123) and floating-point numbers (e.g. 4567.89),
• include negative numbers (e.g. -3.14),
• the decimal can be separated by a dot (.) or a comma (,),
• thousands are not separated,
• if there is more than one number, pick the first one.
Create a DataFrame with just these columns where any numbers were extracted and save it to
proj2_ex04.pkl.

2
3.5 Exercise 5: One-hot encoding
3 points
In the initial DataFrame, find columns which:
• contain text data,
• contain no more than 10 unique values,
• only have values consisting of small letters, i.e. the [a-z] range,
• have values that do not appear in the text file loaded in Exercise 2.
For these columns, perform one-hot encoding, obtaining a separate DataFrame with encoded values
for each original column. The column names should match the values within the column, without
any prefixes or suﬀixes.
For example, for a column that contains 3 distinct values, red, green, and blue, column names in
the resulting DataFrame should be exactly that – red, green, and blue.
Save the DataFrame created for each column to files named proj2_ex05_X.pkl, where X is a
subsequent natural number, e.g. 1, 2, 3, etc. (e.g. proj2_ex05_1.pkl, proj2_ex05_2.pkl).

01 Pandas Basics
No ratings yet
01 Pandas Basics
5 pages
Ge - Computer Science Data Analysis
No ratings yet
Ge - Computer Science Data Analysis
16 pages
Python CAT Papers
No ratings yet
Python CAT Papers
6 pages
Python 1
No ratings yet
Python 1
16 pages
Ge Sem II Dav Upc 2344001201 Sl. No. Qp. 2012 July 2023
No ratings yet
Ge Sem II Dav Upc 2344001201 Sl. No. Qp. 2012 July 2023
16 pages
Shubham Info Practical 3251
No ratings yet
Shubham Info Practical 3251
59 pages
2023 Data Analysis and Visualization Using Python
100% (2)
2023 Data Analysis and Visualization Using Python
9 pages
Index of Data Science
No ratings yet
Index of Data Science
1 page
Question Bank CIA 2
No ratings yet
Question Bank CIA 2
3 pages
Practical of R
No ratings yet
Practical of R
38 pages
Dataframe
No ratings yet
Dataframe
2 pages
Exam Preparation Python - Jupyter Notebook
No ratings yet
Exam Preparation Python - Jupyter Notebook
17 pages
Lucknow Public School - 20241201 - 220143 - 0000
No ratings yet
Lucknow Public School - 20241201 - 220143 - 0000
44 pages
IP Practical Record 2022-23
No ratings yet
IP Practical Record 2022-23
43 pages
Fds Merged
No ratings yet
Fds Merged
102 pages
Questions Practical File
No ratings yet
Questions Practical File
13 pages
Jashan ML
No ratings yet
Jashan ML
20 pages
Aanik Info Practical 3261
No ratings yet
Aanik Info Practical 3261
61 pages
Lab #2 - Data Analysis With NumPy and Pandas
No ratings yet
Lab #2 - Data Analysis With NumPy and Pandas
7 pages
DSA LAB Manual - Good Content
No ratings yet
DSA LAB Manual - Good Content
70 pages
Assignment 1
No ratings yet
Assignment 1
2 pages
Lecture 1 Pyhton Programming DOST 1
No ratings yet
Lecture 1 Pyhton Programming DOST 1
67 pages
Ip Practical File GV
No ratings yet
Ip Practical File GV
46 pages
Amity International School SESSION: 2024-25 Informatics Practices (065) Class Xii Practical List
No ratings yet
Amity International School SESSION: 2024-25 Informatics Practices (065) Class Xii Practical List
5 pages
Python Practical Questions@Subas
No ratings yet
Python Practical Questions@Subas
7 pages
QP Xii Ip Hy 2023-24
No ratings yet
QP Xii Ip Hy 2023-24
9 pages
CS3361 Set1
No ratings yet
CS3361 Set1
5 pages
1
No ratings yet
1
3 pages
Class XII Informatics Practices Unit Test
No ratings yet
Class XII Informatics Practices Unit Test
2 pages
XII IP Practical List 2023-24
No ratings yet
XII IP Practical List 2023-24
4 pages
Utkarsh Kumar Info Practical
No ratings yet
Utkarsh Kumar Info Practical
53 pages
Wa0012.
No ratings yet
Wa0012.
30 pages
PW2 DataCleaning
No ratings yet
PW2 DataCleaning
6 pages
ADS Mid Term Lab QB
No ratings yet
ADS Mid Term Lab QB
2 pages
DS Practical
No ratings yet
DS Practical
30 pages
MLC Practical
No ratings yet
MLC Practical
51 pages
DSBDA Lab Manual24-25
No ratings yet
DSBDA Lab Manual24-25
58 pages
Class 12 IP Pre-Board Exam 2019-20
No ratings yet
Class 12 IP Pre-Board Exam 2019-20
11 pages
PPS CT 3 QB
No ratings yet
PPS CT 3 QB
4 pages
GE Python Visualization 2023
No ratings yet
GE Python Visualization 2023
16 pages
23CSE312-MQP Python Sjbit
No ratings yet
23CSE312-MQP Python Sjbit
3 pages
3rd EXPERIMENT
No ratings yet
3rd EXPERIMENT
13 pages
SQL and Python Practical Questions
No ratings yet
SQL and Python Practical Questions
5 pages
G 12 Model 2 Cs Ms-Pcbcs
No ratings yet
G 12 Model 2 Cs Ms-Pcbcs
6 pages
Practical File Question 28.09.2022
No ratings yet
Practical File Question 28.09.2022
15 pages
PDS Practical List for GTU Syllabus
No ratings yet
PDS Practical List for GTU Syllabus
7 pages
Questions
No ratings yet
Questions
16 pages
CS3361 Set2
No ratings yet
CS3361 Set2
6 pages
Python
No ratings yet
Python
20 pages
OSDBMS
No ratings yet
OSDBMS
59 pages
Informatics Practices Practical List22-2323
No ratings yet
Informatics Practices Practical List22-2323
6 pages
Data Science for Engineers Course
No ratings yet
Data Science for Engineers Course
8 pages
Python MB-2 Sec-09
No ratings yet
Python MB-2 Sec-09
10 pages
Shiv PDS
No ratings yet
Shiv PDS
15 pages
Python Assignment
No ratings yet
Python Assignment
3 pages
PYQ Data Analysis and Visualisation Using Python GE May 2024
No ratings yet
PYQ Data Analysis and Visualisation Using Python GE May 2024
6 pages
N RQgi 8 Eg DUNFS451 K4 X QXA
No ratings yet
N RQgi 8 Eg DUNFS451 K4 X QXA
61 pages
Ekansh Practical File XII-A
No ratings yet
Ekansh Practical File XII-A
16 pages
Batch 1 Set Question
No ratings yet
Batch 1 Set Question
3 pages
Axt1800 Datasheet 20230602
No ratings yet
Axt1800 Datasheet 20230602
6 pages
AE-2KI - Address Unit, IP55 (New Type)
No ratings yet
AE-2KI - Address Unit, IP55 (New Type)
1 page
2022 Belinda Lesmana 4517101024
No ratings yet
2022 Belinda Lesmana 4517101024
88 pages
User Manual 38798
No ratings yet
User Manual 38798
5 pages
D155A-6 Power Train System: Dubai Training and Demonstration Center
100% (2)
D155A-6 Power Train System: Dubai Training and Demonstration Center
20 pages
01 TB95 InstallIntegra32r5.0
No ratings yet
01 TB95 InstallIntegra32r5.0
23 pages
Kali Linux Web Penetration Testing Cookbook Sample Chapter PDF
100% (1)
Kali Linux Web Penetration Testing Cookbook Sample Chapter PDF
31 pages
WP HandBook
No ratings yet
WP HandBook
409 pages
Solution Paper IT System Admin
No ratings yet
Solution Paper IT System Admin
4 pages
HPC - Model - Paper
No ratings yet
HPC - Model - Paper
3 pages
Lan9514 Usb2.0 and Fast Ether Controller
No ratings yet
Lan9514 Usb2.0 and Fast Ether Controller
54 pages
Aarushi Bisht Premium Renewal Receipt
No ratings yet
Aarushi Bisht Premium Renewal Receipt
1 page
(TEMPLATE) 2025 Q2 Quarterly Business Review - GoFleet
No ratings yet
(TEMPLATE) 2025 Q2 Quarterly Business Review - GoFleet
52 pages
Making Invisible Work Visible Using Social Network-1
No ratings yet
Making Invisible Work Visible Using Social Network-1
24 pages
Risk Assessment and Data Protection Steps
No ratings yet
Risk Assessment and Data Protection Steps
34 pages
Bentinho Massaro - 3 Main Teachings PDF
No ratings yet
Bentinho Massaro - 3 Main Teachings PDF
1 page
DAC Student Full Profile-3
No ratings yet
DAC Student Full Profile-3
168 pages
4.3 Applied Productivity Tools Using Slides: I. Learning Skills A. Learning Competency
No ratings yet
4.3 Applied Productivity Tools Using Slides: I. Learning Skills A. Learning Competency
4 pages
Lecture 15 SD Multi Degrees of Freedom System Undamped Free Vibration
No ratings yet
Lecture 15 SD Multi Degrees of Freedom System Undamped Free Vibration
27 pages
Tulpar Nx-25 Plakali Eşanjör
No ratings yet
Tulpar Nx-25 Plakali Eşanjör
1 page
Rru5502 LF
No ratings yet
Rru5502 LF
12 pages
(Oto-Hui - Com) Caterpillar 966H Wheel Loader HYDRAULIC SYSTEM PDF 11
No ratings yet
(Oto-Hui - Com) Caterpillar 966H Wheel Loader HYDRAULIC SYSTEM PDF 11
1 page
1LA6163-4AA90-Z B02+C19+E00+K02+K52+L13+L2W Datasheet en
No ratings yet
1LA6163-4AA90-Z B02+C19+E00+K02+K52+L13+L2W Datasheet en
1 page
Apd Lms Kopi Daong
No ratings yet
Apd Lms Kopi Daong
5 pages
Social Media Marketing for Entrepreneurs
No ratings yet
Social Media Marketing for Entrepreneurs
7 pages
The Future Impact of Artificial Intelligence On Society
No ratings yet
The Future Impact of Artificial Intelligence On Society
3 pages
Software Developer
No ratings yet
Software Developer
4 pages
Free Small Business Plan Overview
No ratings yet
Free Small Business Plan Overview
9 pages
Solid State Calculations Using WIEN2k: Karlheinz Schwarz, Peter Blaha
No ratings yet
Solid State Calculations Using WIEN2k: Karlheinz Schwarz, Peter Blaha
15 pages
MF PDF
No ratings yet
MF PDF
2 pages

02 Working With Data

Uploaded by

02 Working With Data

Uploaded by

Data Engineering – Project 2: Working with Data

March 20, 2025

3.2 Exercise 2: Value scale

3.3 Exercise 3: Categories

3.4 Exercise 4: Number extraction

You might also like