0% found this document useful (0 votes)

37 views26 pages

R Data Cleaning Techniques

good

Uploaded by

jainpranav3882

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

37 views26 pages

R Data Cleaning Techniques

good

Uploaded by

jainpranav3882

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 26

Data Cleaning Using

R
Data Cleaning Definition

• The process to transform raw data into consistent data that

can be easily analyzed.
• It is aimed at filtering the content of statistical statements .
• To improves your data quality and overall productivity.
Objective of Data Cleaning
The following are the various purposes of data cleaning in R:
• Eliminate Errors
• Eliminate Redundancy
• Increase Data Reliability
• Delivery Accuracy
• Ensure Consistency
• Assure Completeness
• Standardize your approach
Clean Data Vs Messy Data

Messy Data Clean Data

• Special characters (e.g. commas • Free of duplicate rows/values
in numeric values) • Error-free (misspellings free )
• Numeric values stored as • Relevant (special characters
text/character data types free )
• Duplicate rows • The appropriate data type for
• Misspellings analysis
• Inaccuracies • Free of outliers (or only contain
• White space outliers that have been
• Missing data identified/understood)
• Zeros instead of null values vary. • Neat and clean data structure
Data Cleaning Example
Using inbuilt datasets(“airquality” datasets)
> head(airquality)
Ozone Solar.R Wind Temp Month Day
1 41 190 7.4 67 5 1
2 36 118 8.0 72 5 2
3 12 149 12.6 74 5 3
4 18 313 11.5 62 5 4
5 NA NA 14.3 56 5 5
6 28 NA 14.9 66 5 6

The NA value inside the columns

Summary Function
> summary(airquality)
We can get a clear visual of the
irregular data using a boxplot.
boxplot(airquality)
Removing irregularities data
with is.na() methods
New_df = airquality
New_df$Ozone = ifelse(is.na(New_df$Ozone), median(New_df$Ozone,
na.rm = TRUE),)
Performing the same operation in
another column.
New_df$Solar.R = ifelse(is.na(New_df$Solar.R), median(New_df$Solar.R,
na.rm = TRUE), New_df$Solar.R) )
summary(New_df)
head(New_df)
An Illustration with Example

1) Creation of Example Data

2) Example 1: Modify Column Names
3) Example 2: Format Missing Values
4) Example 3: Remove Empty Rows & Columns
5) Example 4: Remove Rows with Missing Values
6) Example 5: Remove Duplicates
7) Example 6: Modify Classes of Columns
8) Example 7: Detect & Remove Outliers
9) Example 8: Remove Spaces in Character Strings
10) Example 9: Combine Categories
Creation of Example Data
data <- data.frame(x1 = c(1:4, 99999, 1, NA, 1, 1, NA), # Create example data
frame
x1 = c(1:5, 1, "NA", 1, 1, "NA"),
x1 = c(letters[c(1:3)], "x x", "x", " y y y", "x", "a", "a", NA),
x4 = "",
x5 = NA)
data
Example 1: Modify Column Names
Let’s first have a closer look at the names
of our data frame columns:

colnames(data)
# Print column names# [1] "x1" "x1.1"
"x1.2" "x4" "x5“

Let’s assume that we want to change

these column names to a consecutive
range with the prefix “col”. Then, we can
apply the colnames, paste0,
and ncol functions as shown below.

#Modify all column namesdata

colnames(data) <- paste0("col",
1:ncol(data))

# Print updated data frame

Example 2: Format Missing Values
• In the R programming language, missing values are usually represented by NA.
For that reason, it is useful to convert all missing values to this NA format.

• Some missing values are represented by blank character strings.

data[data == ""]
# Print blank data cells# [1] NA NA NA "" "" "" ""
"" "" "" "" "" "" NA NA NA NA NA NA NA NA NA NA

• Assign NA values to those blank cells, we can use the following syntax:

data[data == ""] <- NA #

Replace blanks by NA

have a closer look at the column col2:

data$col2 #
Print column# [1] "1" "2" "3" "4" "5" "1" "NA" "1"
"1" "NA"
Example 2: Format Missing Values
The NA values in this column are shown between quotes (i.e. “NA”). This indicates
that those NA values are formatted as characters instead of real NA values.
We can change that using the following R code:
data$col2[data$col2 == "NA"] <- NA #
Replace character "NA"

data #
Print
Example 3: Remove Empty Rows & Columns
Use the rowSums, is.na, and ncol functions to remove only-NA rows:

data <- data[rowSums(is.na(data)) != ncol(data), ]

# Drop empty rowsdata
# Print updated data frame
Example 3: Remove Empty Rows & Columns
Also exclude columns that contain only NA values
data <- data[ , colSums(is.na(data)) != nrow(data)]
# Drop empty columnsdata
# Print updated
data frame
Example 4: Remove Rows with Missing Values
in case you have decided to remove all rows with one or more NA values, you may use
the na.omit function as shown below:
data <- na.omit(data) # Delete rows with missing values
data
# Print updated data frame
Example 5: Remove Duplicates
Use the unique function to our data frame as demonstrated in the following R
snippet:
data <- unique(data) # Exclude duplicates
data
# Print updated data frame
Example 6: Modify Classes of Columns
• The class of the columns of a data frame is another critical topic when it
comes to data cleaning.
• This example explains how to format each column to the most appropriate
data type automatically.
• Let’s first check the current classes of our data frame columns:

sapply(data, class) # Print classes of all columns

# col1 col2 col3
# "numeric" "character" "character"

use the type.convert function to change the column classes whenever it is appropriate:

data <- type.convert(data, as.is = TRUE)

data
# Print updated data frame
Print the data types of our columns once again, we can see that the first two
columns have been changed to the integer class.
sapply(data, class) # Print classes of
updated columns
# col1 col2 col3
# "integer" "integer" "character"
Example 7: Detect & Remove Outliers
One method to detect outliers is provided by the boxplot.stats function
# Identify outliers in column# [1] 99999
data$col1[data$col1 %in% boxplot.stats(data$col1)$out]

• The previous output has returned one outlier (i.e. the value 99999). This
value is obviously much higher than the other values in this column.
• Apply the R code below to remove the outlier:

# Remove rows with outliersdata

data <- data[! data$col1 %in%
boxplot.stats(data$col1)$out, ]
Example 8: Remove Spaces in Character Strings
• Use the gsub function as demonstrated below

# Delete white space in character stringsdata

data$col3 <- gsub(" ", "", data$col3)
Example 9: Combine Categories
• Use the gsub function as demonstrated below
The following R code illustrates how to group the categories “a”, “b”, and
“c” in a single category “a”.

data$col3[data$col3 %in% c("b", "c")] <- "a" # Merge

categoriesdata
# Print updated data frame
Thanks

Data Cleaning Using R
No ratings yet
Data Cleaning Using R
26 pages
CleaningData Chapter 3
No ratings yet
CleaningData Chapter 3
29 pages
Intro To Data Science Lecture 4
No ratings yet
Intro To Data Science Lecture 4
13 pages
R Data Cleaning Functions Guide
No ratings yet
R Data Cleaning Functions Guide
4 pages
(R) Internal-2 Q & A
No ratings yet
(R) Internal-2 Q & A
65 pages
Advanced R Programming Tidyverse Packages Notes
No ratings yet
Advanced R Programming Tidyverse Packages Notes
12 pages
Cleaning Data3
No ratings yet
Cleaning Data3
41 pages
8 R Basics 3
No ratings yet
8 R Basics 3
27 pages
Dar Lecture 7
No ratings yet
Dar Lecture 7
24 pages
R Tutorial2
No ratings yet
R Tutorial2
23 pages
Assignment 2 Tidyr
No ratings yet
Assignment 2 Tidyr
2 pages
R Programming Cont..
No ratings yet
R Programming Cont..
24 pages
Important R Codes and Notes
No ratings yet
Important R Codes and Notes
13 pages
Solutions For QB3
No ratings yet
Solutions For QB3
14 pages
RSTUDIO
No ratings yet
RSTUDIO
44 pages
Materi 4
No ratings yet
Materi 4
30 pages
Mda Practical2 Eda
No ratings yet
Mda Practical2 Eda
50 pages
R Studio: Scripts, Data Handling & Cleaning
No ratings yet
R Studio: Scripts, Data Handling & Cleaning
25 pages
R Guru Cheat Sheet
No ratings yet
R Guru Cheat Sheet
2 pages
ProgrammingForDS14 Rbasics
No ratings yet
ProgrammingForDS14 Rbasics
32 pages
R-Programming Lab Mannual
No ratings yet
R-Programming Lab Mannual
33 pages
Data Cleaning R
No ratings yet
Data Cleaning R
2 pages
Week3 2020
No ratings yet
Week3 2020
20 pages
R Programming Materials
No ratings yet
R Programming Materials
51 pages
Unit 2
No ratings yet
Unit 2
76 pages
Ex 4 R Objects
No ratings yet
Ex 4 R Objects
6 pages
R Data Handling Guide
No ratings yet
R Data Handling Guide
16 pages
Programming For Data Science Assignment-2
No ratings yet
Programming For Data Science Assignment-2
23 pages
Lesson 7 - The Data Frame
No ratings yet
Lesson 7 - The Data Frame
7 pages
R Lecture 2-1
No ratings yet
R Lecture 2-1
28 pages
R Data Types and Input Methods
No ratings yet
R Data Types and Input Methods
29 pages
R File Code
No ratings yet
R File Code
16 pages
FE418 RLectureNotes1
No ratings yet
FE418 RLectureNotes1
15 pages
Week2 DataWrangling DelimitedText PDF
No ratings yet
Week2 DataWrangling DelimitedText PDF
5 pages
2.3 Data Frame
No ratings yet
2.3 Data Frame
3 pages
Section 03
No ratings yet
Section 03
20 pages
Data - Analysis - With - R - 24
No ratings yet
Data - Analysis - With - R - 24
47 pages
Unit1 R Notes
No ratings yet
Unit1 R Notes
16 pages
MIT 302 - Statistical Computing II - Tutorial 02
No ratings yet
MIT 302 - Statistical Computing II - Tutorial 02
5 pages
R Cheat Sheets for ECON1267
No ratings yet
R Cheat Sheets for ECON1267
13 pages
Introduction To R For Business Analytics
No ratings yet
Introduction To R For Business Analytics
7 pages
R Tutorial3
No ratings yet
R Tutorial3
17 pages
R WorkSamples
No ratings yet
R WorkSamples
44 pages
Assignment 2 Tidyr
No ratings yet
Assignment 2 Tidyr
2 pages
R Functions
No ratings yet
R Functions
8 pages
Understanding Tidy Data in R
No ratings yet
Understanding Tidy Data in R
7 pages
Statistics With R Unit 1: Divya Arun Kumar
No ratings yet
Statistics With R Unit 1: Divya Arun Kumar
65 pages
Unit 4
No ratings yet
Unit 4
27 pages
R Examples
No ratings yet
R Examples
56 pages
Unit No 4 Question 2024-1
No ratings yet
Unit No 4 Question 2024-1
3 pages
Basic R Dplyr Session 4 Demonstration
No ratings yet
Basic R Dplyr Session 4 Demonstration
18 pages
R Data Subsetting & Manipulation Guide
No ratings yet
R Data Subsetting & Manipulation Guide
44 pages
Practical Preprocessing and Data Cleaning
No ratings yet
Practical Preprocessing and Data Cleaning
51 pages
Introduction to Non-Tabular Data in R
No ratings yet
Introduction to Non-Tabular Data in R
5 pages
First Course On R
No ratings yet
First Course On R
26 pages
Matrix, Dataframes, List
No ratings yet
Matrix, Dataframes, List
8 pages
13 Assignment 12 - 240425 - 161433
No ratings yet
13 Assignment 12 - 240425 - 161433
2 pages
Subnetting Design for 192.168.10.0
No ratings yet
Subnetting Design for 192.168.10.0
2 pages
3 R and
No ratings yet
3 R and
19 pages
Probandstat 1
No ratings yet
Probandstat 1
2 pages
Probandstat 1
No ratings yet
Probandstat 1
2 pages
SeatingPlan Quiz2
No ratings yet
SeatingPlan Quiz2
18 pages
Socket Programming Basics in C/C++
No ratings yet
Socket Programming Basics in C/C++
4 pages
ffPROB 3-4
No ratings yet
ffPROB 3-4
9 pages
AnerudhParthiShyam 102203042 3CO1 Assigment1
No ratings yet
AnerudhParthiShyam 102203042 3CO1 Assigment1
6 pages
AnerudhParthiShyam 102203042 3CO1 Assigment2
No ratings yet
AnerudhParthiShyam 102203042 3CO1 Assigment2
7 pages
MATH 103 Practice Exam: Chapters 2-3
No ratings yet
MATH 103 Practice Exam: Chapters 2-3
22 pages
Data Mining Part 02 Eng
No ratings yet
Data Mining Part 02 Eng
12 pages
015 Altube
No ratings yet
015 Altube
6 pages
Ultimate Statistics Handnote
No ratings yet
Ultimate Statistics Handnote
19 pages
04-003 Statistics
No ratings yet
04-003 Statistics
14 pages
Allama Iqbal Open University: Assignment No. 01
No ratings yet
Allama Iqbal Open University: Assignment No. 01
29 pages
Income Prediction
No ratings yet
Income Prediction
19 pages
Data Analytics for Beginners
No ratings yet
Data Analytics for Beginners
47 pages
Osmosis Practical Write UP
No ratings yet
Osmosis Practical Write UP
13 pages
Mean Median and Mode For Grouped Data
No ratings yet
Mean Median and Mode For Grouped Data
4 pages
Backtesting Protocol for Machine Learning
No ratings yet
Backtesting Protocol for Machine Learning
18 pages
Credit EDA Case Study
100% (3)
Credit EDA Case Study
16 pages
Teaching Guidelines for Data Analytics
No ratings yet
Teaching Guidelines for Data Analytics
4 pages
Journal Pre-Proof: Journal of Cleaner Production
No ratings yet
Journal Pre-Proof: Journal of Cleaner Production
45 pages
Oxford Sexdifferences
No ratings yet
Oxford Sexdifferences
14 pages
Human Resource Indicators and Health Service Performance
No ratings yet
Human Resource Indicators and Health Service Performance
20 pages
SPE 56419 Processing and Interpretation of Long-Term Data From Permanent Downhole Pressure Gauges
No ratings yet
SPE 56419 Processing and Interpretation of Long-Term Data From Permanent Downhole Pressure Gauges
16 pages
Arjo Dedessa Dam Design Report
No ratings yet
Arjo Dedessa Dam Design Report
41 pages
ORM-2 Assignment 1
No ratings yet
ORM-2 Assignment 1
2 pages
Outlier Detection in PCR Data Analysis
No ratings yet
Outlier Detection in PCR Data Analysis
8 pages
Determinants of Religious Tourists' Social Media Usage Behaviour
No ratings yet
Determinants of Religious Tourists' Social Media Usage Behaviour
18 pages
Gurucul Studio Guide
No ratings yet
Gurucul Studio Guide
34 pages
Neural Networks for Earthquake Prediction
No ratings yet
Neural Networks for Earthquake Prediction
8 pages
EN 12457 Leaching Test Comparison
No ratings yet
EN 12457 Leaching Test Comparison
66 pages
AI's Role in Hollywood Films
No ratings yet
AI's Role in Hollywood Films
28 pages
01 Temitayo-Ds7006-Quantitative-Analysis
No ratings yet
01 Temitayo-Ds7006-Quantitative-Analysis
47 pages
A Common Source For The Late Babylonian Chronicles Dealing With The Eighth and SeventhCenturies by Manuel Gerber
100% (1)
A Common Source For The Late Babylonian Chronicles Dealing With The Eighth and SeventhCenturies by Manuel Gerber
18 pages
Bhabav 6
No ratings yet
Bhabav 6
12 pages
Review Mid Term Exam 2 Answer Keys
No ratings yet
Review Mid Term Exam 2 Answer Keys
11 pages
Anomaly-Fraud-Detection
No ratings yet
Anomaly-Fraud-Detection
50 pages

R Data Cleaning Techniques

Uploaded by

R Data Cleaning Techniques

Uploaded by

Data Cleaning Using

• The process to transform raw data into consistent data that

Messy Data Clean Data

The NA value inside the columns

1) Creation of Example Data

Let’s assume that we want to change

#Modify all column namesdata

# Print updated data frame

• Some missing values are represented by blank character strings.

data[data == ""] <- NA #

have a closer look at the column col2:

data <- data[rowSums(is.na(data)) != ncol(data), ]

sapply(data, class) # Print classes of all columns

data <- type.convert(data, as.is = TRUE)

# Remove rows with outliersdata

# Delete white space in character stringsdata

data$col3[data$col3 %in% c("b", "c")] <- "a" # Merge

You might also like