Data Science using R
Introduction to dplyr
© Kalasalingam Academy of Research and Education
Introduction to dplyr
dplyr is a powerful R package designed for data manipulation and transformation,
making it easier to work with data frames and perform common data operations. It is part of
the tidyverse, a collection of R packages that share an underlying design philosophy and
grammar, which makes data science in R more efficient and intuitive.
Features of dplyr
• Simplified Syntax: dplyr offers a clean and consistent set of functions that allow for
straightforward data manipulation.
• Chaining Operations: You can use the pipe operator (%>%) to chain together multiple
operations, making your code more readable and concise.
• Performance: dplyr is optimized for performance, especially with large datasets,
making it suitable for data analysis tasks.
Functions of dplyr
1. filter()
The filter() function is used to select rows from a data frame that meet specific conditions.
Syntax filter(data, condition)
Example
# Load dplyr
library(dplyr)
# Create a sample data frame
df <- data.frame(
Name = c("Alice", "Bob", "Charlie", "David"),
Age = c(25, 30, 35, 40),
Salary = c(50000, 60000, 70000, 80000)
)
# Filter rows where Age is greater than 30
filtered_df <- filter(df, Age > 30)
print(filtered_df)
Functions of dplyr
2. select()
The select() function is used to choose specific columns from a data frame.
Syntax select(data, columns)
Example
# Select the Name and Salary columns
selected_df <- select(df, Name, Salary)
print(selected_df)
3. mutate()
The mutate() function is used to create new columns or modify existing ones.
Syntax mutate(data, new_column = expression)
Example
# Add a new column for Annual Salary
mutated_df <- mutate(df, Annual_Salary = Salary * 12)
print(mutated_df)
Functions of dplyr
4. summarise() (or summarize())
The summarise() function is used to compute summary statistics for a data frame.
Syntax summarise(data, summary_statistic = function(column))
Example
# Calculate the average salary
summary_df <- summarise(df, Average_Salary = mean(Salary))
print(summary_df)
Functions of dplyr
5. group_by()
The group_by() function is used to group data by one or more variables. This is often used
conjunction with summarise() to perform calculations on grouped data.
Syntax group_by(data, grouping_variable)
Example
# Create another data frame with a department column
df2 <- data.frame(
Name = c("Alice", "Bob", "Charlie", "David"),
Department = c("HR", "IT", "IT", "HR"),
Salary = c(50000, 60000, 70000, 80000)
)
# Group by Department and calculate the average salary
grouped_df <- df2 %>%
group_by(Department) %>%
summarise(Average_Salary = mean(Salary))
Functions of dplyr
6. arrange()
The arrange() function is used to reorder rows in a data frame based on one or more
variables.
Syntax arrange(data, column)
Example
# Arrange the data frame by Salary in descending order
arranged_df <- arrange(df, desc(Salary))
print(arranged_df)
Functions of dplyr
7. join() Functions
dplyr provides several functions for joining data frames, including inner_join(),
left_join(), right_join(), and full_join().
Example
# Create another data frame for joining
df3 <- data.frame(
Name = c("Alice", "Bob"),
Department = c("HR", "IT")
)
# Inner join df2 with df3
joined_df <- inner_join(df2, df3, by = "Name")
print(joined_df)
Functions of dplyr
8. Pipe Operator (%>%)
The pipe operator is a key feature of dplyr that allows you to chain multiple operations
together in a readable way.
Example
# Using pipe to chain operations
result_df <- df %>%
filter(Age > 30) %>%
select(Name, Salary) %>%
mutate(Annual_Salary = Salary * 12)
print(result_df)
Data Science using R
Data manipulation in R with dplyr
© Kalasalingam Academy of Research and Education
Data manipulation in R with dplyr
Data manipulation in R with dplyr is a key aspect of data analysis, allowing you to
clean, transform, and summarize data efficiently. Below, we will explore various common
data manipulation tasks using the dplyr package, including filtering, selecting, mutating,
summarizing, grouping, and arranging data.
Setting Up dplyr
Before using dplyr, ensure that you have it installed and loaded into your R session.
# Install dplyr if you haven't already
install.packages("dplyr")
# Load the dplyr package
library(dplyr)
Data manipulation in R with dplyr
Sample Data
Let's create a sample data frame that we will use for our examples:
# Create a sample data frame
data <- data.frame(
Name = c("Alice", "Bob", "Charlie", "David", "Eva"),
Age = c(25, 30, 35, 40, 28),
Salary = c(50000, 60000, 70000, 80000, 65000),
Department = c("HR", "IT", "IT", "HR", "Finance")
)
Common Data Manipulation Tasks
1. Filtering Rows with filter()
The filter() function is used to subset rows based on specific conditions.
# Filter rows where Age is greater than 30
filtered_data <- filter(data, Age > 30)
print(filtered_data)
2. Selecting Columns with select()
The select() function is used to choose specific columns from a data frame.
# Select the Name and Salary columns
selected_data <- select(data, Name, Salary)
print(selected_data)
Common Data Manipulation Tasks
3. Adding New Columns with mutate()
The mutate() function allows you to create new columns or modify existing ones.
# Add a new column for Annual Salary
mutated_data <- mutate(data, Annual_Salary = Salary * 12)
print(mutated_data)
4. Summarizing Data with summarise()
The summarise() function is used to calculate summary statistics.
# Calculate the average salary
summary_data <- summarise(data, Average_Salary = mean(Salary))
print(summary_data)
Common Data Manipulation Tasks
5. Grouping Data with group_by() and Summarizing
The group_by() function is used to group data by one or more variables. It is often followed by
summarise() to perform calculations on each group.
# Group by Department and calculate the average salary
grouped_data <- data %>%
group_by(Department) %>%
summarise(Average_Salary = mean(Salary), .groups = "drop")
print(grouped_data)
6. Arranging Rows with arrange()
The arrange() function is used to reorder rows based on the values of one or more columns.
# Arrange the data frame by Salary in descending order
arranged_data <- arrange(data, desc(Salary))
print(arranged_data)
Common Data Manipulation Tasks
7. Chaining Operations with the Pipe Operator (%>%)
The pipe operator allows you to chain multiple dplyr functions together for a more
readable workflow.
# Chain operations to filter, select, and mutate data
result_data <- data %>%
filter(Age < 35) %>%
select(Name, Salary) %>%
mutate(Annual_Salary = Salary * 12)
print(result_data)
Common Data Manipulation Tasks
8. Joining Data with join() Functions
dplyr provides various functions for joining data frames, such as inner_join(), left_join(),
right_join(), and full_join().
Example of left_join()
# Create another data frame for joining
additional_data <- data.frame(
Name = c("Alice", "Bob", "Charlie"),
Bonus = c(5000, 6000, 7000)
)
# Perform a left join
joined_data <- left_join(data, additional_data, by = "Name")
Data Science using R
Selecting, Mutating, Filtering, Arranging
and Summarising
© Kalasalingam Academy of Research and Education
Selecting, Mutating, Filtering, Arranging and Summarising
In R programming, particularly when using the dplyr package, the functions for
selecting, mutating, filtering, arranging, and summarizing data frames are essential for
effective data manipulation. Below, we will explore each of these operations in detail,
along with examples to illustrate their usage.
Setting Up dplyr
Before we start, make sure you have the dplyr package installed and loaded in your R
session:
# Install dplyr if you haven't already
install.packages("dplyr")
# Load the dplyr package
library(dplyr)
Selecting, Mutating, Filtering, Arranging and Summarising
Sample Data Frame
We will use a sample data frame for demonstration purposes:
# Create a sample data frame
data <- data.frame(
Name = c("Alice", "Bob", "Charlie", "David", "Eva"),
Age = c(25, 30, 35, 40, 28),
Salary = c(50000, 60000, 70000, 80000, 65000),
Department = c("HR", "IT", "IT", "HR", "Finance")
)
# Print the original data
print(data)
Selecting, Mutating, Filtering, Arranging and Summarising
1. Selecting Columns with select()
The select() function is used to choose specific columns from a data frame.
Syntax select(data, columns)
Example
# Select the Name and Salary columns
selected_data <- select(data, Name, Salary)
print(selected_data)
2. Mutating Data with mutate()
The mutate() function is used to add new columns or modify existing ones.
Syntax mutate(data, new_column = expression)
Example
# Add a new column for Annual Salary
mutated_data <- mutate(data, Annual_Salary = Salary * 12)
print(mutated_data)
Selecting, Mutating, Filtering, Arranging and Summarising
3. Filtering Rows with filter()
The filter() function is used to subset rows based on specific conditions.
Syntax filter(data, condition)
Example
# Filter rows where Age is greater than 30
filtered_data <- filter(data, Age > 30)
print(filtered_data)
4. Arranging Rows with arrange()
The arrange() function is used to reorder rows based on the values of one or more columns.
Syntax arrange(data, column)
Example
# Arrange the data frame by Salary in descending order
arranged_data <- arrange(data, desc(Salary))
print(arranged_data)
Selecting, Mutating, Filtering, Arranging and Summarising
5. Summarizing Data with summarise()
The summarise() function is used to calculate summary statistics for one or more columns.
Syntax summarise(data, summary_statistic = function(column))
Example
# Calculate the average salary
summary_data <- summarise(data, Average_Salary = mean(Salary))
print(summary_data)
6. Grouping Data with group_by()
The group_by() function is often used in conjunction with summarise() to perform calculations
on grouped data.
Syntax group_by(data, grouping_variable)
Example
# Group by Department and calculate the average salary
grouped_data <- data %>%
group_by(Department) %>%
summarise(Average_Salary = mean(Salary), .groups = "drop")
print(grouped_data)
Selecting, Mutating, Filtering, Arranging and Summarising
7. Combining Operations
You can combine these operations using the pipe operator (%>%) for a more readable
workflow.
Example
# Combine filtering, selecting, and mutating in one chain
result_data <- data %>%
filter(Age < 35) %>%
select(Name, Salary) %>%
mutate(Annual_Salary = Salary * 12)
print(result_data)
Data Science using R
Pipe operator R programming
© Kalasalingam Academy of Research and Education
Pipe operator R programming
The pipe operator (%>%) in R, provided by the magrittr package (which is also
part of the tidyverse), is a powerful tool for chaining together multiple functions in a
clean and readable way. It allows you to take the output of one function and pass it
directly as an input to the next function, enabling a streamlined workflow in data
manipulation and analysis.
Benefits of Using the Pipe Operator
1. Readability: The pipe operator enhances the readability of your code by allowing
you to express a sequence of operations in a linear fashion, resembling natural
language.
2. Conciseness: It reduces the need for temporary variables and makes the code
cleaner.
3. Chaining Functions: It allows you to easily combine multiple operations without
nesting functions.
Pipe operator R programming
Basic Syntax
The general syntax for using the pipe operator is as follows:
data %>% function1(arguments) %>% function2(arguments) %>%
function3(arguments)
In this syntax:
• data is the initial dataset.
• function1, function2, and function3 are the functions you want to apply sequentially.
Example of the Pipe Operator
Let's walk through a comprehensive example using a sample data frame to illustrate how the
pipe operator works in R.
Pipe operator R programming
Setting Up
First, ensure that you have the necessary packages installed and loaded:
# Install the tidyverse if you haven't already
install.packages("tidyverse")
# Load the dplyr package
library(dplyr)
Sample Data Frame
We will use the following sample data frame for our examples:
# Create a sample data frame
data <- data.frame(
Name = c("Alice", "Bob", "Charlie", "David", "Eva"),
Age = c(25, 30, 35, 40, 28),
Salary = c(50000, 60000, 70000, 80000, 65000),
Department = c("HR", "IT", "IT", "HR", "Finance")
)
Pipe operator R programming
Example: Using the Pipe Operator
Here’s how you can use the pipe operator to perform a series of operations on the data:
# Using the pipe operator for data manipulation
result <- data %>%
filter(Age > 28) %>% # Filter rows where Age is greater than 28
select(Name, Salary) %>% # Select only Name and Salary columns
mutate(Annual_Salary = Salary * 12) %>% # Create a new column for Annual Salary
arrange(desc(Annual_Salary)) # Arrange the data by Annual Salary in descending order
# Print the result
print(result)
Data Science using R
Data blending and joining
© Kalasalingam Academy of Research and Education
Data blending and joining R programming
Data blending and joining in R involves combining multiple datasets into a
single cohesive dataset for analysis. This is a crucial step in data preparation,
allowing you to create a unified dataset that includes all relevant information from
different sources.
In R, the dplyr package provides a set of powerful functions for joining
data frames. Below are the most common types of joins, along with examples to
demonstrate their usage.
Data blending and joining R programming
Setting Up dplyr
Ensure that you have the dplyr package installed and loaded:
# Install dplyr if you haven't already
install.packages("dplyr")
# Load the dplyr package
library(dplyr)
Data blending and joining R programming
Sample Data Frames
Let’s create two sample data frames that we can use for our joining examples:
# Create the first data frame
employees <- data.frame(
Employee_ID = c(1, 2, 3, 4, 5),
Name = c("Alice", "Bob", "Charlie", "David", "Eva"),
Department = c("HR", "IT", "IT", "HR", "Finance")
)
# Create the second data frame
salaries <- data.frame(
Employee_ID = c(1, 2, 3, 4, 6),
Salary = c(50000, 60000, 70000, 80000, 90000)
)
# Print the data frames
print(employees)
print(salaries)
Data blending and joining R programming
Types of Joins in dplyr
1. Inner Join (inner_join()): Returns only the rows with matching values in both data
frames.
inner_joined <- inner_join(employees, salaries, by = "Employee_ID")
print(inner_joined)
2. Left Join (left_join()): Returns all rows from the left data frame and the matched rows
from the right data frame. If there is no match, the result will contain NA for columns
from the right data frame.
left_joined <- left_join(employees, salaries, by = "Employee_ID")
print(left_joined)
3. Right Join (right_join()): Returns all rows from the right data frame and the matched
rows from the left data frame. If there is no match, the result will contain NA for
columns from the left data frame.
right_joined <- right_join(employees, salaries, by = "Employee_ID")
print(right_joined)
Data blending and joining R programming
4. Full Join (full_join()): Returns all rows from both data frames, with NA in places
where there is no match.
full_joined <- full_join(employees, salaries, by = "Employee_ID")
print(full_joined)
5. Semi Join (semi_join()): Returns all rows from the left data frame where there are
matching values in the right data frame, but does not include any columns from the right
data frame.
semi_joined <- semi_join(employees, salaries, by = "Employee_ID")
print(semi_joined)
6. Anti Join (anti_join()): Returns all rows from the left data frame where there are no
matching values in the right data frame.
anti_joined <- anti_join(employees, salaries, by = "Employee_ID")
print(anti_joined)
Data Science using R
Outliers and Missing Values Treatment
© Kalasalingam Academy of Research and Education
Outliers and Missing Values Treatment
Handling outliers and missing values is a crucial part of data preprocessing in any
data analysis or machine learning project. In R, you can use various techniques to
identify, treat, and impute these data issues. Below, we will explore both outliers and
missing values treatment in detail.
Outliers Treatment
Outliers are data points that significantly differ from the rest of the dataset. They can
skew results and lead to misleading conclusions if not handled appropriately.
Outliers and Missing Values Treatment
1. Identifying Outliers
You can identify outliers using several methods:
•Visual Methods: Boxplots and scatter plots can visually reveal outliers.
•Statistical Methods: Use the IQR (Interquartile Range) method or Z-scores.
Example: Using Boxplots and IQR
# Load necessary library
library(ggplot2)
# Create a sample data
setdata <- data.frame( Value = c(10, 12, 12, 13, 12, 15, 18, 19, 100) # 100 is an outlier
)
# Boxplot to visualize outliers
ggplot(data, aes(y = Value)) + geom_boxplot() + ggtitle("Boxplot to Identify Outliers")
Outliers and Missing Values Treatment
Example: Using IQR Method
# Calculate the IQR
Q1 <- quantile(data$Value, 0.25)
Q3 <- quantile(data$Value, 0.75)
IQR_value <- Q3 - Q1
# Determine outlier boundaries
lower_bound <- Q1 - 1.5 * IQR_value
upper_bound <- Q3 + 1.5 * IQR_value
# Identify outliers
outliers <- data$Value[data$Value < lower_bound | data$Value > upper_bound]
print(outliers)
Outliers and Missing Values Treatment
2. Treating Outliers
Once identified, you can treat outliers in several ways:
•Remove Outliers: Simply exclude them from the dataset.
data_no_outliers <- data[data$Value >= lower_bound & data$Value <= upper_bound, ]
•Transform Data: Apply transformations (like log or square root) to reduce the effect of
outliers.
•Impute Values: Replace outliers with a statistical measure (e.g., mean or median).
# Replace outliers with the median
data$Value[data$Value < lower_bound | data$Value > upper_bound] <-
median(data$Value)
Outliers and Missing Values Treatment
Missing Values Treatment
Missing values can occur for various reasons and can significantly impact data analysis. Handling
them appropriately is vital.
1. Identifying Missing Values
You can check for missing values using the is.na() function or the summary() function.
# Create a sample dataset with missing values
data_with_na <- data.frame( Name = c("Alice", "Bob", NA, "David", "Eva"),
Age = c(25, NA, 35, 40, 28)
)
# Check for missing values
summary(data_with_na)
Outliers and Missing Values Treatment
2. Treating Missing Values
There are several strategies to handle missing values:
•Remove Rows with Missing Values: This is straightforward but may lead to loss
of important data.
data_cleaned <- na.omit(data_with_na)
•Impute Missing Values: Replace missing values with appropriate substitutes
(mean, median, mode, or using predictive models).
Outliers and Missing Values Treatment
Example: Mean Imputation
# Impute missing age with the mean age
data_with_na$Age[is.na(data_with_na$Age)] <- mean(data_with_na$Age, na.rm = TRUE)
Example: Using mice Package for Multiple Imputation
# Install mice package if not installed
install.packages("mice")library(mice)
# Use mice to impute missing value
simputed_data <- mice(data_with_na, m = 5, method = 'pmm', maxit = 50)
completed_data <- complete(imputed_data)
print(completed_data)