0% found this document useful (0 votes)
38 views21 pages

Data Analytics Lesson 10 Slides

Lesson 2 of the Data Analytics Diploma focuses on data wrangling, specifically data cleaning and merging datasets using R. It emphasizes the importance of tidy data for easier manipulation and visualization, and introduces the tidyverse collection of packages for data analytics. Key functions for data manipulation in R, such as those in the dplyr package, are also highlighted, along with a practical challenge involving the Titanic dataset.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views21 pages

Data Analytics Lesson 10 Slides

Lesson 2 of the Data Analytics Diploma focuses on data wrangling, specifically data cleaning and merging datasets using R. It emphasizes the importance of tidy data for easier manipulation and visualization, and introduces the tidyverse collection of packages for data analytics. Key functions for data manipulation in R, such as those in the dplyr package, are also highlighted, along with a practical challenge involving the Titanic dataset.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Diploma in

Data Analytics
Lesson 2: Data wrangling
Data cleaning
Described in R
Merging datasets

Lesson Objectives
Lesson 2
Data
cleaning
Data cleaning
• Process of detecting and
correcting inaccuracies from data
• Up to 80% of the time
Recap packages
in R
Collection of R functions,
compiled code and sample
data
Info on • Description file
• Package Description
packages (“package”)
• Help (package =
“package”)
What is tidy data?
Hadley
Wickham “Tidy datasets are easy to manipulate, model and
visualise, and have a specific structure: each
variable is a column, each observation is a row,
and each type of observational unit is a table.”
Why tidy data?

Tidy data is easier to


manipulate, model and
visualise.
Image by: https://www.linkedin.com/learning/learning-the-r-
tidyverse/what-is-the-tidyverse
What is
tidyverse?
Collection of data analytics tools
contained in R for transforming
and visualising data
Packages
available in
tidyverse
library
Ggplot2, dplyr, tidyr,
readr, purr, tibble,
stringr, forcats
tidyr dplyr readr
Helps create tidy data Helps manipulate data Helps import data

More
about…
(packages contained in
the tidyverse library)
Data Frames
data.frame()

• R mostly works with data in data


frames
• Table with rows and columns
• Similar to spreadsheet, table in
SQL database
Titanic
Data management in Module 1
with Power Query
Described
in R
Titanic
Described in Module 1 with Data
Analysis Toolpak
ggplot2
Helps create visualisations
Box -, density - & time series plots
Part of tidyverse
Help with R
RStudio community
Stack Overflow
Merging
datasets
Merging datasets

• Bringing datasets together


• Combining datasets

Source: https://www.iconfinder.com/icons/3030755/combining_datasets_files_mashup_process_icon
dplyr • select() • arrange()
for • filter() • join()
• group_by() • mutate()
data • summarise()
manipulation
Source: https://www.edureka.co/blog/sql-joins-types
Challenge
Join the training and test
dataset from the Titanic and
remove missing values from
Age

#exploredata

You might also like