Data Cleaning Using
R
Data Cleaning Definition
• The process to transform raw data into consistent data that
can be easily analyzed.
• It is aimed at filtering the content of statistical statements .
• To improves your data quality and overall productivity.
Objective of Data Cleaning
The following are the various purposes of data cleaning in R:
• Eliminate Errors
• Eliminate Redundancy
• Increase Data Reliability
• Delivery Accuracy
• Ensure Consistency
• Assure Completeness
• Standardize your approach
Clean Data Vs Messy Data
Messy Data Clean Data
• Special characters (e.g. commas • Free of duplicate rows/values
in numeric values) • Error-free (misspellings free )
• Numeric values stored as • Relevant (special characters
text/character data types free )
• Duplicate rows • The appropriate data type for
• Misspellings analysis
• Inaccuracies • Free of outliers (or only contain
• White space outliers that have been
• Missing data identified/understood)
• Zeros instead of null values vary. • Neat and clean data structure
Data Cleaning Example
Using inbuilt datasets(“airquality” datasets)
> head(airquality)
Ozone Solar.R Wind Temp Month Day
1 41 190 7.4 67 5 1
2 36 118 8.0 72 5 2
3 12 149 12.6 74 5 3
4 18 313 11.5 62 5 4
5 NA NA 14.3 56 5 5
6 28 NA 14.9 66 5 6
The NA value inside the columns
Summary Function
> summary(airquality)
We can get a clear visual of the
irregular data using a boxplot.
boxplot(airquality)
Removing irregularities data
with is.na() methods
New_df = airquality
New_df$Ozone = ifelse(is.na(New_df$Ozone), median(New_df$Ozone,
na.rm = TRUE),)
Performing the same operation in
another column.
New_df$Solar.R = ifelse(is.na(New_df$Solar.R), median(New_df$Solar.R,
na.rm = TRUE), New_df$Solar.R) )
summary(New_df)
head(New_df)
An Illustration with Example
1) Creation of Example Data
2) Example 1: Modify Column Names
3) Example 2: Format Missing Values
4) Example 3: Remove Empty Rows & Columns
5) Example 4: Remove Rows with Missing Values
6) Example 5: Remove Duplicates
7) Example 6: Modify Classes of Columns
8) Example 7: Detect & Remove Outliers
9) Example 8: Remove Spaces in Character Strings
10) Example 9: Combine Categories
Creation of Example Data
data <- data.frame(x1 = c(1:4, 99999, 1, NA, 1, 1, NA), # Create example data
frame
x1 = c(1:5, 1, "NA", 1, 1, "NA"),
x1 = c(letters[c(1:3)], "x x", "x", " y y y", "x", "a", "a", NA),
x4 = "",
x5 = NA)
data
Example 1: Modify Column Names
Let’s first have a closer look at the names
of our data frame columns:
colnames(data)
# Print column names# [1] "x1" "x1.1"
"x1.2" "x4" "x5“
Let’s assume that we want to change
these column names to a consecutive
range with the prefix “col”. Then, we can
apply the colnames, paste0,
and ncol functions as shown below.
#Modify all column namesdata
colnames(data) <- paste0("col",
1:ncol(data))
# Print updated data frame
Example 2: Format Missing Values
• In the R programming language, missing values are usually represented by NA.
For that reason, it is useful to convert all missing values to this NA format.
• Some missing values are represented by blank character strings.
data[data == ""]
# Print blank data cells# [1] NA NA NA "" "" "" ""
"" "" "" "" "" "" NA NA NA NA NA NA NA NA NA NA
• Assign NA values to those blank cells, we can use the following syntax:
data[data == ""] <- NA #
Replace blanks by NA
have a closer look at the column col2:
data$col2 #
Print column# [1] "1" "2" "3" "4" "5" "1" "NA" "1"
"1" "NA"
Example 2: Format Missing Values
The NA values in this column are shown between quotes (i.e. “NA”). This indicates
that those NA values are formatted as characters instead of real NA values.
We can change that using the following R code:
data$col2[data$col2 == "NA"] <- NA #
Replace character "NA"
data #
Print
Example 3: Remove Empty Rows & Columns
Use the rowSums, is.na, and ncol functions to remove only-NA rows:
data <- data[rowSums(is.na(data)) != ncol(data), ]
# Drop empty rowsdata
# Print updated data frame
Example 3: Remove Empty Rows & Columns
Also exclude columns that contain only NA values
data <- data[ , colSums(is.na(data)) != nrow(data)]
# Drop empty columnsdata
# Print updated
data frame
Example 4: Remove Rows with Missing Values
in case you have decided to remove all rows with one or more NA values, you may use
the na.omit function as shown below:
data <- na.omit(data) # Delete rows with missing values
data
# Print updated data frame
Example 5: Remove Duplicates
Use the unique function to our data frame as demonstrated in the following R
snippet:
data <- unique(data) # Exclude duplicates
data
# Print updated data frame
Example 6: Modify Classes of Columns
• The class of the columns of a data frame is another critical topic when it
comes to data cleaning.
• This example explains how to format each column to the most appropriate
data type automatically.
• Let’s first check the current classes of our data frame columns:
sapply(data, class) # Print classes of all columns
# col1 col2 col3
# "numeric" "character" "character"
use the type.convert function to change the column classes whenever it is appropriate:
data <- type.convert(data, as.is = TRUE)
data
# Print updated data frame
Print the data types of our columns once again, we can see that the first two
columns have been changed to the integer class.
sapply(data, class) # Print classes of
updated columns
# col1 col2 col3
# "integer" "integer" "character"
Example 7: Detect & Remove Outliers
One method to detect outliers is provided by the boxplot.stats function
# Identify outliers in column# [1] 99999
data$col1[data$col1 %in% boxplot.stats(data$col1)$out]
• The previous output has returned one outlier (i.e. the value 99999). This
value is obviously much higher than the other values in this column.
• Apply the R code below to remove the outlier:
# Remove rows with outliersdata
data <- data[! data$col1 %in%
boxplot.stats(data$col1)$out, ]
Example 8: Remove Spaces in Character Strings
• Use the gsub function as demonstrated below
# Delete white space in character stringsdata
data$col3 <- gsub(" ", "", data$col3)
Example 9: Combine Categories
• Use the gsub function as demonstrated below
The following R code illustrates how to group the categories “a”, “b”, and
“c” in a single category “a”.
data$col3[data$col3 %in% c("b", "c")] <- "a" # Merge
categoriesdata
# Print updated data frame
Thanks