Data Manipulation (R)
Liana Harutyunyan
Programming for Data Science
April 25, 2024
American University of Armenia
[email protected]
1
Data Manipulation
• dplyr is a package that helps you to solve most
common data manipulation tasks.
• Again for installing and importing the package, you
need:
install.packages("dplyr")
library(dplyr)
2
Data Manipulation
These are the most common functions from the library:
• filter — filters rows;
• select - filters columns;
• arrange — sorts a dataframe;
• mutate — creates new columns;
• group by — groups data by a specific key;
• summarize - performs aggregate functions
3
Data Manipulation
• Not necessary, but dplyr works the best with pipe like
operator from magrittr package.
• % > % operator takes the object from its left hand side
and uses it as an argument in the function on the right
hand side.
• Understand how to use it by replacing the pipe
operation with ‘then’ (in your mind, not in the code)
• example: filter(...) % > % select(...) - FILTER,
THEN on the filtered SELECT
4
Mutate
mutate - adds new columns or modifies current variables
in the dataset.
Adding new columns filled with the same value:
diamonds %>%
mutate(JustOne = 1,
Values = "something",
Simple = TRUE)
5
Mutate
Can use existing columns:
diamonds %>%
mutate(price_discounted = price * 0.9)
Can modify the existing ones:
diamonds %>%
mutate(price = price * 0.9,
mean_price = mean(price))
6
ifelse
ifelse returns a value with the same shape as test which is
filled with elements selected from either yes or no
depending on whether the element of test is TRUE or FALSE.
ifelse(test, yes, no)
Example:
vector <- c(1:10)
vector <- ifelse(vector > 5, "high", "low")
7
ifelse with mutate
Will change / create a new column depending on another
column’s values.
practice %>%
mutate(Health = ifelse(Subject == 1,
"sick",
"healthy"))
8
Filter
Only retain specific rows of data that meet the specified
requirement(s).
diamonds %>%
filter(cut == "Fair")
Will return only those rows that have ”cut” equal to ”Fair”.
Equivalent to:
diamonds[diamonds$cut == "Fair", ]
9
Filter
To have multiple conditions,
• for OR, you can use ”|”
• for AND, you can use ”,” (comma).
diamonds %>%
filter(cut == "Fair" | cut == "Good",
price <= 600)
Same as
diamonds %>%
filter(cut %in% c("Fair", "Good"),
price <= 600)
10
Select
• Select only the columns that you want to see. Gets rid of
all other columns.
• Can use columns positions or by name.
• The order in which you list the column names/positions
is the order that the columns will be displayed.
diamonds %>%
select(cut, color)
Same as:
diamonds %>%
select(1:5)
11
Select with negative sign
To exclude columns, you can use minus sign, both with
position numbers or column names.
Examples:
diamonds %>% select(-cut)
diamonds %>% select(-cut, -color)
diamonds %>% select(-c(cut, color))
diamonds %>% select(-c(2, 3))
12
group by and summarize
Groups variables different categories together for future
operations.
data %>%
group_by(Country) %>%
summarize(m = mean(Score),
s = sd(Score),
n = n()) # calculate the count
Equivalent to Python’s:
data.groupby("Country")["Score"].agg(["mean", "std", "count
Can also groupby with multiple columns, using
group by(Country, City)
13
arrange
Allows you arrange values within a variable in ascending or
descending order.
This can apply to both numerical and non-numerical
(alphabetical order) values.
diamonds %>%
arrange(cut)
diamonds %>%
arrange(desc(price))
14
Data Manipulation - examples set 1
• Exercise: Filter the data to have a diamond subset,
which is the original diamond data’s only Ideal cuts.
15
Data Manipulation - examples set 1
• Exercise: Filter the data to have a diamond subset,
which is the original diamond data’s only Ideal cuts.
• Exercise: Add a new column *volume* that will be
multiplication result of columns x, y and z.
• Exercise: During last class, we calculated diamond price
means for each type of cut, using for loop. Do the same
with dplyr package.
15
Data Manipulation - examples set 1
• Exercise: Filter the data to have a diamond subset,
which is the original diamond data’s only Ideal cuts.
• Exercise: Add a new column *volume* that will be
multiplication result of columns x, y and z.
• Exercise: During last class, we calculated diamond price
means for each type of cut, using for loop. Do the same
with dplyr package.
• Exercise: Use the previous point to plot a barplot on
obtained table.
• Exercise: Calculate how many of each cut there is in the
dataset.
15
Data Manipulation - examples set 2
• Read summer.csv dataset in R.
• Filter the dataset to have only USA data. Now count the
type of medals USA received.
• Count how many medals received all the countries and
sort by the most medals to least. Take top 10.
• Take the data of countries ”USA”, ”FRA” and ”GBR”.
Group by by 2 columns Country and Medal and take
count for each.
• Plot the data from last point in a graph. What is the best
graph for this.
16
Data Manipulation - examples set 3
• Load *airquality* built-in dataset from R.
• Filter to have only those rows that do not complete NA
values and plot a scaterplot of *wind* and
*temperature*.
• Change the temperature from Fahrenheit to Celsius.
• Create a new column, where if temperature is bigger
than the mean temperature, write ”high”, otherwise
write ”low”.
• Calculate mean temperature for each month.
• Plot graph of multiple line plots, where each line is the
month and y is the temperature value.
17
Summary
Reading
https://bookdown.org/yih huynh/GuidetoRBook/basicdata-
management.html
Questions?
18