Tung Wah College
GEN3005 / GED3005 Big Data and Data Sciences
Lecture 4: Data cleaning and pre-processing
Learning Outcomes:
By completing the Lecture 4, you should be able to
1) identify processes involved for data cleaning in data science;
2) dealing with missing values in data frame;
3) perform standardization on values in data frame;
4) convert data type in data frame;
5) identify the ways to do data normalization and perform such process; and
6) convert categorical variables into numeric values for building data models.
1. Introduction
Data pre-processing is the process of converting or mapping data from one raw form into
another format to make it ready for further analysis. It is also called data cleaning or data
wrangling.
2. Dealing with missing values
Missing value refers to the case that there are no values stored in a column for a particular
record. Missing values are usually represented by a blank cell.
To deal with missing values, it is possible to check with the data collection source for the
missing values again or leave the missing values as they were. Apart from these two
approaches, the following methods could be used with the help of computer programs.
A. Dropping the data
There are two ways to drop the data with missing value: by dropping the whole column, or
by removing the record with the missing value.
In Python program, the method dropna() in the pandas library can be used to drop
data with missing values. You can choose to drop rows or columns that contain missing
values like NaN.
1
By specifying axis=0 as a parameter, the method dropna() drops the entire row with
missing value in a certain column, and by specifying axis=1, the method dropna()
drops the entire column that contains the missing values.
For example, the following method is used to drop the rows with a missing value in the
“Sleep Duration” column (specified in the subset parameter) from the dataframe df.
df = df.dropna(subset = ["Sleep Duration"], axis = 0)
print(df[["Person ID", "Sleep Duration"]].head(10))
B. Replacing the data
This method tries to replace the missing value with a guess of what the data should be. For
example, the missing value can be replaced by the average of the variable. If a variable
cannot be averaged, we can replace the missing value by the mode (i.e., the value which
occurs the most frequently).
To replace missing values with actual values, pandas library has a built-in method called
replace which can be used to fill in the missing values with the newly calculated values.
For example, to replace the missing values by the average of values in a column, we first
compute the mean of the values in a column. Then, the missing values are replaced by the
mean using the method replace. In the method, the first parameter is the value to be
replaced, while the second parameter is the value that we would like to replace with.
The sample codes are shown below to replace the missing value in the “Sleep Duration”
column by the average of the values in the column.
mean = df["Sleep Duration"].mean()
df["Sleep Duration"] = df["Sleep Duration"].replace(np.nan, mean)
print(df[["Person ID", "Sleep Duration"]].head(10))
3. Standardization
In big data applications, data are collected from various sources, from different places, and
by different people. This could result in data being stored in different formats. In data
preparation, it is important to bring data into a common standard that allows meaningful
analysis. This ensures data is consistent and easily understandable.
2
For example, people may use different expressions to represent “male” for gender, as “M”,
“m”, “Male”, or “male”. Standardization should be done so that a common code is used to
represent “male” or “female” in the “Gender” variable.
To standardize the value of “Gender” to either “M” or “F”, we can first apply the method
df["Gender"].str.upper() to convert the string to upper case letters, and then
obtain the first character of the string (by .str[0]) as the value of the field.
print(df["Gender"].value_counts())
df["Gender"] = df["Gender"].str.upper()
df["Gender"] = df["Gender"].str[0]
print(df["Gender"].value_counts())
As another example, to represent blood pressure, the following forms are all possible to
represent the systolic blood pressure and diastolic blood pressure: “100/80”, “100 / 80”. To
rectify this issue and separate the systolic blood pressure and diastolic blood pressure, we
can use the split() method to separate the two blood pressure values as two new
columns (as string), in the following way. after spliting
126/83
df["Sys BP"] = df["Blood Pressure"].str.split("/").str[0]
df["Dia BP"] = df["Blood Pressure"].str.split("/").str[1]
print(df[["Blood Pressure", "Sys BP", "Dia BP"]].head())
print(df[["Blood Pressure", "Sys BP", "Dia BP"]].dtypes)
4. Converting data type in a dataframe
Sometimes, you may observe a wrong data type on a variable. For example, a numeric
variable is treated wrongly as a string (or object) by pandas because of some invalid entries,
or one used a string to represent missing information in records. This happens when
someone typed “.” for a numeric variable to denote the information is missing. Such
numeric variable will then be treated as a string by pandas. Further analysis on this
variable will be incorrect (or even impossible) if it has a string type.
To rectify this problem, we can firstly treat the entries with “.” as missing values, by
replacing “.” by “NaN”. Then, the data type of the column should be converted to the float
data type by astype().
The following codes shows the way to replace records with “.” in the “Quality of Sleep”
variable by “NaN”, and to covert its data type to float64.
3
df["Quality of Sleep"] = df["Quality of Sleep"].replace(".",
np.nan)
df["Quality of Sleep"] = df["Quality of Sleep"].astype(float)
After this step, the data can be processed by the data cleaning approaches described in
Section 2.
5. Data normalization
Normalization is the process of transforming values of variables from one unit to another.
This is useful when data of a variable are collected from different sources but they have
different units, or when we wish to analyse a set of variables with different units but their
scales should be common.
A. Normalizing data with different units
If data are collected from different sources, values of a variable may be of different units.
Consider the weight variables which is measured in pounds. You can change the unit from
pounds to kilograms by the following method (1 pound = 2.2046 kg).
df["Weight (in kg)"] = df["Weight (in pounds)"]/2.2046 .round(2)
df["Weight (in kg)"] = df["Weight (in kg)"].round(2)
B. Normalizing data into similar range
Another form of normalization is to transform variables with different units into a similar
range. Two variables may have different units, say “Physical Activity Level” (from 30 to
90) and “Stress Level” (from 3 to 8). It is difficult to have a fair comparison on these two
variables with different range. For some data analysis algorithms, we have to make sure
variables are in a common scale in the data model.
There are three techniques of normalization. They are (i) simple feature scaling, (ii) min-
max, and (iii) z-score or standard score.
(i) Simple feature scaling: the value 𝑥𝑜𝑙𝑑 is transformed into
𝑥𝑜𝑙𝑑
𝑥𝑛𝑒𝑤 =
𝑥𝑚𝑎𝑥
(ii) Min-max: the value 𝑥𝑜𝑙𝑑 is transformed into
𝑥𝑜𝑙𝑑 − 𝑥𝑚𝑖𝑛
𝑥𝑛𝑒𝑤 =
𝑥𝑚𝑎𝑥 − 𝑥𝑚𝑖𝑛
old scale is 3-8 4
new scale is 0-1
3=0, 8=1
(iii) Z-score or standard score: the value 𝑥𝑜𝑙𝑑 is transformed into
𝑥𝑜𝑙𝑑 − 𝜇
𝑥𝑛𝑒𝑤 =
𝜎
how much difference to the mean
Let’s consider normalizing the variable “BMI” using the above three techniques. The
following codes are useful.
(i) Simple feature scaling:
df["BMI - simple"] = df["BMI"]/df["BMI"].max()
transformed original max value in the column
(ii) Min-max:
transformed
df["BMI - minmax"] = (df["BMI"]-
df["BMI"].min())/(df["BMI"].max()-df["BMI"].min())
(iii) Z-score or standard score:
mean = df["BMI"].mean()
sd = df["BMI"].std(ddof = 0)
df["BMI - zscore"] = (df["BMI"]-mean)/sd
new column BMI is
smaller or
Note that, the method std() computes the population standard deviation when ddof = greater than
0 is specified. mean by ?
standard
6. Converting categorical variables into quantitative variables deviation
Most statistical models cannot take in objects or strings as input. To train a data model,
only numbers will be taken as inputs. Therefore, if values of a variable are strings, we have
to convert them to numbers.
For example, the “BMI Category” is a string, with possible values like “Normal”,
“Overweight”, or “Obese”. To convert them into numeric values, we can encode the values
by adding new variables corresponding to each unique element in the original variable we
would like to encode.
In this case, when the “BMI Category” has 3 values, we create three new variables,
“Normal”, “Overweight”, and “Obese”. If the “BMI Category” is Normal, the variable
“Normal” will be set as 1, while others (“Overweight” and “Obese”) will be set to 0. This
technique is known as “One-hot encoding”.
5
In Python, the get_dummies() method can be used to convert categorical variables to
dummy variables.
dummy_variable = pd.get_dummies(df["BMI Category"]).astype(int)
print(dummy_variable.head(10))
The following result about the dummy variable created could be obtained.
We can merge the dummy variable into the dataframe, and to remove the original variable
“BMI Category” from the dataframe. The following codes are used.
print whole column
df = pd.concat([df, dummy_variable], axis=1)
df = df.drop("BMI Category", axis=1)remove whole column
print(df.head(10))