Cheat Sheet: Exploratory Data Analysis
Command Syntax Description Example
summarize function reduces a
data frame to a summary of
just one vector or value.
.data
A data frame, data frame
extension (e.g. a tibble), or a avg_delays <- sub_airline %>%
lazy data frame group_by(Reporting_Airline,
DayOfWeek) %>%
summarize() summarize(.data, ...)
summarize(mean_delays =
… mean(ArrDelayMinutes),
.groups = 'keep')
Name-value pairs of
summary functions. The
name will be the name of the
variable in the result. The
value should be an expression
that returns a single value like
min(x), n(), or sum([Link](y))
group_by function takes an
existing table and converts it
into a grouped table where
operations are performed "by
group".
.data
A data frame, data frame
extension (e.g. a tibble), or a sub_airline %>%
group_by(.data, ..., .add =
lazy data frame group_by(Reporting_Airline)
group_by() FALSE, .drop =
%>% summarize(mean_delays =
group_by_drop_default(.data))
.add mean(ArrDelayMinutes))
When FALSE, the default,
group_by() will override
existing groups.
.drop
Drop groups formed by factor
levels that don’t appear in the
data
cor() cor(x, use=, method= ) cor function computes the sub_airline %>%
correlation coefficient select(DepDelayMinutes,
ArrDelayMinutes) %>%
cor(method = "pearson")
x: Matrix or data frame
use: Specifies the handling of
missing data.
method: Specifies the type of
correlation. Options are
pearson, spearman or kendall.
[Link] function is a test for
association/correlation
[Link](x, y, alternative =
between paired samples. It
c("[Link]", "less", returns both the correlation
"greater"), method = coefficient and the sub_airline %>%
[Link]() c("pearson", "kendall", significance level(or p-value) [Link](~DepDelayMinutes +
"spearman"), exact = NULL, of the correlation . ArrDelayMinutes, data = .)
[Link] = 0.95, continuity
= FALSE, …)
x, y: numeric vectors of data
values. x and y must have the
same length.
aov function (Analysis of
Variance (ANOVA)) is a
statistical method used to test
whether there are significant
aa_as_subset <- sub_airline
differences between the %>% select(ArrDelay,
means of two or more groups. Reporting_Airline) %>%
filter(Reporting_Airline ==
aov(formula, data = NULL, formula: A formula 'AA' | Reporting_Airline ==
aov projections = FALSE, qr =
TRUE, contrasts = NULL, …) specifying the model. 'AS')
data: A data frame in which ad_aov <- aov(ArrDelay ~
Reporting_Airline, data =
the variables specified in the aa_as_subset)
formula will be found. If
missing, the variables are
searched for in the standard
way.
count function lets you
quickly count the unique
values of one or more
variables
count(df, vars = NULL, wt_var sub_airline %>%
count() = NULL) count(Reporting_Airline)
df: data frame to be processed
vars: variables to count
unique values of
ggplot function initializes a
ggplot object. It can be used
to declare the input data
ggplot(aes(x =
ggplot(data = NULL, mapping = frame for a graphic and to
Reporting_Airline, y =
ggplot() aes(), ..., environment = specify the set of plot DayOfWeek, fill =
[Link]()) aesthetics intended to be mean_delays))
common throughout all
subsequent layers unless
specifically overridden.
corrplot() corrplot(method=, type=,....) corrplot function provides a corrplot(airlines_cor, method
visual exploratory tool on = "color", col = col(200),
type = "upper", order =
correlation matrix that "hclust", [Link] =
supports automatic variable "black", # Add coefficient of
reordering to help detect correlation [Link] = "black",
hidden patterns among [Link] = 45, #Text label
variables. color and rotation )
method: There are seven
visualization methods
(parameter method) in
corrplot package, named
‘circle’, ‘square’, ‘ellipse’,
‘number’, ‘shade’, ‘color’,
‘pie’
type: There are three layout
types (parameter type): ‘full’,
‘upper’ and ‘lower’.
geom_bar
ggplot(aes(x =
Reporting_Airline, y =
geom_bar(mapping = NULL, data function is used to produce Average_Delays)) +
geom_bar() = NULL, stat = "bin", position
1d area plots: bar charts for geom_bar(stat = "identity") +
= "stack", ...)
categorical x, and histograms ggtitle("Average Arrival
for continuous y. Delays by Airline")
ggplot(avg_delays, aes(x =
Reporting_Airline, y =
geom_tile(mapping = NULL, data geom_tile function tile plane lubridate::wday(DayOfWeek,
geom_tile() = NULL, stat = "identity",
position = "identity", ...) with rectangles. label = TRUE), fill = bins))
+ geom_tile(colour = "white",
size = 0.2)
ggplot(avg_delays, aes(x =
Reporting_Airline, y =
geom_text(mapping = NULL, data lubridate::wday(DayOfWeek,
= NULL, stat = "identity", geom_text used for text label = TRUE), fill = bins))
geom_text() position = "identity", parse = annotation. + geom_tile(colour = "white",
FALSE, ...) size = 0.2) +
geom_text(aes(label =
round(mean_delays, 3)))
ggplot(avg_delays, aes(x =
Reporting_Airline, y =
labs(...)
lubridate::wday(DayOfWeek,
labs Change axis labels and label = TRUE), labs(x =
labs() …
a list of new names in the legend titles "Reporting Airline",y = "Day
of Week",title = "Average
form aesthetic = “new name”
Arrival Delays") fill =
bins)) +
scale_fill_manual function
Change axis labels and
legend titles
…
scale_fill_manual(values =
common discrete scale c("#d53e4f", "#f46d43",
scale_fill_manual() scale_fill_manual(..., values) parameters: name, breaks, "#fdae61", "#fee08b",
labels, [Link], limits and "#e6f598", "#abdda4"))
guide. See discrete_scale for
more details
values: a set of aesthetic
values to map data values to.
Author(s)
Lakshmi Holla
Changelog
Date Version Changed by Change Description
2023-05-11 1.1 Eric Hao & Vladislav Boyko Updated Page Frames
2021-08-09 1.0 Lakshmi Holla Initial Version