0% found this document useful (0 votes)

14 views6 pages

Tung Wah College GEN3005 / GED3005 Big Data and Data Sciences

Uploaded by

Valiant Cheung

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views6 pages

Tung Wah College GEN3005 / GED3005 Big Data and Data Sciences

Uploaded by

Valiant Cheung

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

Tung Wah College

GEN3005 / GED3005 Big Data and Data Sciences

Lecture 4: Data cleaning and pre-processing

Learning Outcomes:

By completing the Lecture 4, you should be able to

1) identify processes involved for data cleaning in data science;
2) dealing with missing values in data frame;
3) perform standardization on values in data frame;
4) convert data type in data frame;
5) identify the ways to do data normalization and perform such process; and
6) convert categorical variables into numeric values for building data models.

1. Introduction

Data pre-processing is the process of converting or mapping data from one raw form into
another format to make it ready for further analysis. It is also called data cleaning or data
wrangling.

2. Dealing with missing values

Missing value refers to the case that there are no values stored in a column for a particular
record. Missing values are usually represented by a blank cell.

To deal with missing values, it is possible to check with the data collection source for the
missing values again or leave the missing values as they were. Apart from these two
approaches, the following methods could be used with the help of computer programs.

A. Dropping the data

There are two ways to drop the data with missing value: by dropping the whole column, or
by removing the record with the missing value.

In Python program, the method dropna() in the pandas library can be used to drop
data with missing values. You can choose to drop rows or columns that contain missing
values like NaN.

1
By specifying axis=0 as a parameter, the method dropna() drops the entire row with
missing value in a certain column, and by specifying axis=1, the method dropna()
drops the entire column that contains the missing values.

For example, the following method is used to drop the rows with a missing value in the
“Sleep Duration” column (specified in the subset parameter) from the dataframe df.

df = df.dropna(subset = ["Sleep Duration"], axis = 0)

print(df[["Person ID", "Sleep Duration"]].head(10))

B. Replacing the data

This method tries to replace the missing value with a guess of what the data should be. For
example, the missing value can be replaced by the average of the variable. If a variable
cannot be averaged, we can replace the missing value by the mode (i.e., the value which
occurs the most frequently).

To replace missing values with actual values, pandas library has a built-in method called
replace which can be used to fill in the missing values with the newly calculated values.

For example, to replace the missing values by the average of values in a column, we first
compute the mean of the values in a column. Then, the missing values are replaced by the
mean using the method replace. In the method, the first parameter is the value to be
replaced, while the second parameter is the value that we would like to replace with.

The sample codes are shown below to replace the missing value in the “Sleep Duration”
column by the average of the values in the column.

mean = df["Sleep Duration"].mean()

df["Sleep Duration"] = df["Sleep Duration"].replace(np.nan, mean)
print(df[["Person ID", "Sleep Duration"]].head(10))

3. Standardization

In big data applications, data are collected from various sources, from different places, and
by different people. This could result in data being stored in different formats. In data
preparation, it is important to bring data into a common standard that allows meaningful
analysis. This ensures data is consistent and easily understandable.

2
For example, people may use different expressions to represent “male” for gender, as “M”,
“m”, “Male”, or “male”. Standardization should be done so that a common code is used to
represent “male” or “female” in the “Gender” variable.

To standardize the value of “Gender” to either “M” or “F”, we can first apply the method
df["Gender"].str.upper() to convert the string to upper case letters, and then
obtain the first character of the string (by .str[0]) as the value of the field.

print(df["Gender"].value_counts())

df["Gender"] = df["Gender"].str.upper()
df["Gender"] = df["Gender"].str[0]

print(df["Gender"].value_counts())

As another example, to represent blood pressure, the following forms are all possible to
represent the systolic blood pressure and diastolic blood pressure: “100/80”, “100 / 80”. To
rectify this issue and separate the systolic blood pressure and diastolic blood pressure, we
can use the split() method to separate the two blood pressure values as two new
columns (as string), in the following way. after spliting
126/83
df["Sys BP"] = df["Blood Pressure"].str.split("/").str[0]
df["Dia BP"] = df["Blood Pressure"].str.split("/").str[1]
print(df[["Blood Pressure", "Sys BP", "Dia BP"]].head())
print(df[["Blood Pressure", "Sys BP", "Dia BP"]].dtypes)

4. Converting data type in a dataframe

Sometimes, you may observe a wrong data type on a variable. For example, a numeric
variable is treated wrongly as a string (or object) by pandas because of some invalid entries,
or one used a string to represent missing information in records. This happens when
someone typed “.” for a numeric variable to denote the information is missing. Such
numeric variable will then be treated as a string by pandas. Further analysis on this
variable will be incorrect (or even impossible) if it has a string type.

To rectify this problem, we can firstly treat the entries with “.” as missing values, by
replacing “.” by “NaN”. Then, the data type of the column should be converted to the float
data type by astype().

The following codes shows the way to replace records with “.” in the “Quality of Sleep”
variable by “NaN”, and to covert its data type to float64.

3
df["Quality of Sleep"] = df["Quality of Sleep"].replace(".",
np.nan)
df["Quality of Sleep"] = df["Quality of Sleep"].astype(float)

After this step, the data can be processed by the data cleaning approaches described in
Section 2.

5. Data normalization

Normalization is the process of transforming values of variables from one unit to another.
This is useful when data of a variable are collected from different sources but they have
different units, or when we wish to analyse a set of variables with different units but their
scales should be common.

A. Normalizing data with different units

If data are collected from different sources, values of a variable may be of different units.
Consider the weight variables which is measured in pounds. You can change the unit from
pounds to kilograms by the following method (1 pound = 2.2046 kg).

df["Weight (in kg)"] = df["Weight (in pounds)"]/2.2046 .round(2)

df["Weight (in kg)"] = df["Weight (in kg)"].round(2)

B. Normalizing data into similar range

Another form of normalization is to transform variables with different units into a similar
range. Two variables may have different units, say “Physical Activity Level” (from 30 to
90) and “Stress Level” (from 3 to 8). It is difficult to have a fair comparison on these two
variables with different range. For some data analysis algorithms, we have to make sure
variables are in a common scale in the data model.

There are three techniques of normalization. They are (i) simple feature scaling, (ii) min-
max, and (iii) z-score or standard score.

(i) Simple feature scaling: the value 𝑥𝑜𝑙𝑑 is transformed into

𝑥𝑜𝑙𝑑
𝑥𝑛𝑒𝑤 =
𝑥𝑚𝑎𝑥

(ii) Min-max: the value 𝑥𝑜𝑙𝑑 is transformed into

𝑥𝑜𝑙𝑑 − 𝑥𝑚𝑖𝑛
𝑥𝑛𝑒𝑤 =
𝑥𝑚𝑎𝑥 − 𝑥𝑚𝑖𝑛

old scale is 3-8 4

new scale is 0-1
3=0, 8=1
(iii) Z-score or standard score: the value 𝑥𝑜𝑙𝑑 is transformed into
𝑥𝑜𝑙𝑑 − 𝜇
𝑥𝑛𝑒𝑤 =
𝜎
how much difference to the mean
Let’s consider normalizing the variable “BMI” using the above three techniques. The
following codes are useful.

(i) Simple feature scaling:

df["BMI - simple"] = df["BMI"]/df["BMI"].max()

transformed original max value in the column
(ii) Min-max:
transformed
df["BMI - minmax"] = (df["BMI"]-
df["BMI"].min())/(df["BMI"].max()-df["BMI"].min())

(iii) Z-score or standard score:

mean = df["BMI"].mean()
sd = df["BMI"].std(ddof = 0)
df["BMI - zscore"] = (df["BMI"]-mean)/sd
new column BMI is
smaller or
Note that, the method std() computes the population standard deviation when ddof = greater than
0 is specified. mean by ?
standard
6. Converting categorical variables into quantitative variables deviation

Most statistical models cannot take in objects or strings as input. To train a data model,
only numbers will be taken as inputs. Therefore, if values of a variable are strings, we have
to convert them to numbers.

For example, the “BMI Category” is a string, with possible values like “Normal”,
“Overweight”, or “Obese”. To convert them into numeric values, we can encode the values
by adding new variables corresponding to each unique element in the original variable we
would like to encode.

In this case, when the “BMI Category” has 3 values, we create three new variables,
“Normal”, “Overweight”, and “Obese”. If the “BMI Category” is Normal, the variable
“Normal” will be set as 1, while others (“Overweight” and “Obese”) will be set to 0. This
technique is known as “One-hot encoding”.

5
In Python, the get_dummies() method can be used to convert categorical variables to
dummy variables.

dummy_variable = pd.get_dummies(df["BMI Category"]).astype(int)

print(dummy_variable.head(10))

The following result about the dummy variable created could be obtained.

We can merge the dummy variable into the dataframe, and to remove the original variable
“BMI Category” from the dataframe. The following codes are used.
print whole column
df = pd.concat([df, dummy_variable], axis=1)
df = df.drop("BMI Category", axis=1)remove whole column
print(df.head(10))

IntroToPython Unit 5
No ratings yet
IntroToPython Unit 5
42 pages
EDA - Exploratory Data Analysis
No ratings yet
EDA - Exploratory Data Analysis
16 pages
Dsbda Ass2
No ratings yet
Dsbda Ass2
49 pages
Pandas
No ratings yet
Pandas
30 pages
Pandas: Data Cleaning Essentials
No ratings yet
Pandas: Data Cleaning Essentials
6 pages
ML - Preprocessing - Introduction
No ratings yet
ML - Preprocessing - Introduction
14 pages
Week 10
No ratings yet
Week 10
50 pages
Lecture 4 New Data Pre Processing
No ratings yet
Lecture 4 New Data Pre Processing
41 pages
Python Data Analysis with Numpy & Pandas
No ratings yet
Python Data Analysis with Numpy & Pandas
19 pages
Python Basics - Hamza Zahoor
No ratings yet
Python Basics - Hamza Zahoor
6 pages
Lab File
No ratings yet
Lab File
96 pages
ML Proj Diabetes
No ratings yet
ML Proj Diabetes
51 pages
Lecture 4 Data Pre-Processing
No ratings yet
Lecture 4 Data Pre-Processing
43 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
48 pages
LP II Practical
No ratings yet
LP II Practical
5 pages
Data Preprocessing PT 2
No ratings yet
Data Preprocessing PT 2
7 pages
TOPIC 3 Pima Indian
No ratings yet
TOPIC 3 Pima Indian
16 pages
Data Mining Lab 03
No ratings yet
Data Mining Lab 03
10 pages
Aide Memoire Preparation Des Données
No ratings yet
Aide Memoire Preparation Des Données
2 pages
Machine Learning Lab Manual
No ratings yet
Machine Learning Lab Manual
42 pages
Abhiml ML File
No ratings yet
Abhiml ML File
74 pages
Unit-1 AI ML PYTHON - Jupyter Notebook
No ratings yet
Unit-1 AI ML PYTHON - Jupyter Notebook
10 pages
Download
No ratings yet
Download
3 pages
Data Preparation Techniques in Python
No ratings yet
Data Preparation Techniques in Python
9 pages
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
No ratings yet
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
36 pages
Unit 4 - Working With Graphs - Python
No ratings yet
Unit 4 - Working With Graphs - Python
49 pages
Asfasdas
No ratings yet
Asfasdas
36 pages
Data Cleaning Essentials
No ratings yet
Data Cleaning Essentials
42 pages
Data Science Unit 2 Second Half Notes
No ratings yet
Data Science Unit 2 Second Half Notes
18 pages
Presentation - University
No ratings yet
Presentation - University
52 pages
Student Notebook HR Analysis
No ratings yet
Student Notebook HR Analysis
11 pages
Data Cleaning With Python and Pandas
No ratings yet
Data Cleaning With Python and Pandas
49 pages
Statistical Transform Data Cleaning
No ratings yet
Statistical Transform Data Cleaning
30 pages
Academic Performance Data Wrangling
No ratings yet
Academic Performance Data Wrangling
9 pages
PW2 DataCleaning
No ratings yet
PW2 DataCleaning
6 pages
DSBDA Lab Manual24-25
No ratings yet
DSBDA Lab Manual24-25
58 pages
Python (Unit - 2)
No ratings yet
Python (Unit - 2)
22 pages
Data Wrangling
No ratings yet
Data Wrangling
18 pages
Chapter 1. Data Preparation
No ratings yet
Chapter 1. Data Preparation
74 pages
Dunit I-Part-2
No ratings yet
Dunit I-Part-2
82 pages
Part A Assignment 6
No ratings yet
Part A Assignment 6
28 pages
Lab2!17!07-2025 - Demonstrate Various Data Pre-Processing Techniques For A Given Dataset.
No ratings yet
Lab2!17!07-2025 - Demonstrate Various Data Pre-Processing Techniques For A Given Dataset.
17 pages
Ch01 - Introduction To Data Science
No ratings yet
Ch01 - Introduction To Data Science
65 pages
Data Preprocessing Techniques in ML
No ratings yet
Data Preprocessing Techniques in ML
12 pages
Pandas Module (Part-I)
No ratings yet
Pandas Module (Part-I)
36 pages
Data Cleaning Techniques in Python
No ratings yet
Data Cleaning Techniques in Python
5 pages
Week 6 - Data Cleaning
No ratings yet
Week 6 - Data Cleaning
8 pages
Tutorial 4
No ratings yet
Tutorial 4
8 pages
Tung Wah College GEN3005 / GED3005 Big Data and Data Sciences
No ratings yet
Tung Wah College GEN3005 / GED3005 Big Data and Data Sciences
6 pages
Unit V
No ratings yet
Unit V
47 pages
Data Science Cheat Sheet: KEY Imports
100% (1)
Data Science Cheat Sheet: KEY Imports
1 page
Lab Manual 5 Solved 40
No ratings yet
Lab Manual 5 Solved 40
13 pages
Machine Learning Unit 2
No ratings yet
Machine Learning Unit 2
71 pages
DWDM Lab Manual
No ratings yet
DWDM Lab Manual
32 pages
Exp-12 Iaiml
No ratings yet
Exp-12 Iaiml
13 pages
Pandas
No ratings yet
Pandas
18 pages
Move To Focus (June-July)
No ratings yet
Move To Focus (June-July)
28 pages
Entry Form SIC 2024
No ratings yet
Entry Form SIC 2024
4 pages
EDA Graph
No ratings yet
EDA Graph
69 pages
Wolters Kluwer: Ppincott Illiams & Wilki Li
No ratings yet
Wolters Kluwer: Ppincott Illiams & Wilki Li
596 pages
Title Page Template
No ratings yet
Title Page Template
1 page
Tung Wah College GEN3005 / GED3005 Big Data and Data Sciences
No ratings yet
Tung Wah College GEN3005 / GED3005 Big Data and Data Sciences
8 pages
Prefix Prefix symbol Magnitude: 必讀概要 1 國際標準單位 (S.I. Units)
No ratings yet
Prefix Prefix symbol Magnitude: 必讀概要 1 國際標準單位 (S.I. Units)
3 pages
Heep Hong Society Kahoot
No ratings yet
Heep Hong Society Kahoot
2 pages
GEB1305 China and The World - Lecture 6 (Part 2) 複製
No ratings yet
GEB1305 China and The World - Lecture 6 (Part 2) 複製
12 pages
Thread Standards Reference Guide
No ratings yet
Thread Standards Reference Guide
1 page
Neet Dataset
No ratings yet
Neet Dataset
11 pages
Math 5 DLP 9 - Finding The Greatest Common Factor of Two or More Numbers PDF
No ratings yet
Math 5 DLP 9 - Finding The Greatest Common Factor of Two or More Numbers PDF
8 pages
Chm102 CBT CA Questions
No ratings yet
Chm102 CBT CA Questions
95 pages
Chapter 4
No ratings yet
Chapter 4
14 pages
Prometheus Part 12 Internals Storage & Security
No ratings yet
Prometheus Part 12 Internals Storage & Security
27 pages
Laboratory Fermentors LiFlus GX/GM
No ratings yet
Laboratory Fermentors LiFlus GX/GM
6 pages
Investigating Newton's Law of Cooling
No ratings yet
Investigating Newton's Law of Cooling
9 pages
DC21 Manual
No ratings yet
DC21 Manual
79 pages
Week 1 - Webinar Slides - 1 Per Slide
No ratings yet
Week 1 - Webinar Slides - 1 Per Slide
38 pages
Orthopedic Examination Guide
No ratings yet
Orthopedic Examination Guide
25 pages
Key-Value Databases Guide
No ratings yet
Key-Value Databases Guide
5 pages
Horizontal Format - Calculating Lengths and Angles in Shapes
No ratings yet
Horizontal Format - Calculating Lengths and Angles in Shapes
2 pages
Solving Recurrence Relations in Algorithms
No ratings yet
Solving Recurrence Relations in Algorithms
4 pages
D1 Probset Revalida Review
No ratings yet
D1 Probset Revalida Review
3 pages
New Chat: 1. Predicting Uber Ride Prices
No ratings yet
New Chat: 1. Predicting Uber Ride Prices
16 pages
Axlr8r Racing Intro Guide
No ratings yet
Axlr8r Racing Intro Guide
49 pages
Model 435 437 Datasheet Sensor de Presion
No ratings yet
Model 435 437 Datasheet Sensor de Presion
7 pages
Performance Analysis of A Steam Turbine Power Plant at Part Load Conditions
No ratings yet
Performance Analysis of A Steam Turbine Power Plant at Part Load Conditions
8 pages
NABL Policy on Calibration & Measurement
No ratings yet
NABL Policy on Calibration & Measurement
11 pages
Single-Phase Induction Motor Guide
No ratings yet
Single-Phase Induction Motor Guide
27 pages
Online Donation of Unused Medicines
No ratings yet
Online Donation of Unused Medicines
5 pages
Diesel Engine Tuning Specifications 75071
No ratings yet
Diesel Engine Tuning Specifications 75071
7 pages
Java Loops
No ratings yet
Java Loops
2 pages
Page 011
No ratings yet
Page 011
1 page
Edm 2018 8434989
No ratings yet
Edm 2018 8434989
4 pages
Statistical Estimation for Analysts
No ratings yet
Statistical Estimation for Analysts
7 pages
4TEC3F60B1000A
No ratings yet
4TEC3F60B1000A
21 pages
Thesis Defense Presentation
No ratings yet
Thesis Defense Presentation
18 pages
Java Cryptography Algorithms Guide
No ratings yet
Java Cryptography Algorithms Guide
16 pages

Tung Wah College GEN3005 / GED3005 Big Data and Data Sciences

Uploaded by

Tung Wah College GEN3005 / GED3005 Big Data and Data Sciences

Uploaded by

Tung Wah College

GEN3005 / GED3005 Big Data and Data Sciences

Lecture 4: Data cleaning and pre-processing

By completing the Lecture 4, you should be able to

2. Dealing with missing values

A. Dropping the data

df = df.dropna(subset = ["Sleep Duration"], axis = 0)

B. Replacing the data

mean = df["Sleep Duration"].mean()

4. Converting data type in a dataframe

A. Normalizing data with different units

df["Weight (in kg)"] = df["Weight (in pounds)"]/2.2046 .round(2)

B. Normalizing data into similar range

(i) Simple feature scaling: the value 𝑥𝑜𝑙𝑑 is transformed into

(ii) Min-max: the value 𝑥𝑜𝑙𝑑 is transformed into

old scale is 3-8 4

(i) Simple feature scaling:

df["BMI - simple"] = df["BMI"]/df["BMI"].max()

(iii) Z-score or standard score:

dummy_variable = pd.get_dummies(df["BMI Category"]).astype(int)

You might also like