STATS 20: Chapter 6 - Factors
Thomas Maierhofer
Fall 2024
1 / 34
Learning Objectives
2 / 34
Learning Objectives
After studying this chapter, you should be able to:
▶ Identify when to use factors.
▶ Create factors using factor().
▶ Differentiate between character vectors and factors.
▶ Understand how R stores factors.
▶ Summarize a categorical variable using table().
▶ Assign and reassign levels to a factor.
▶ Order the levels of a factor.
▶ Apply a function to subsets of a vector using tapply
3 / 34
Basic Definitions
4 / 34
Factors in Experimental Design
▶ In experimental design, a factor is an explanatory variable controlled by the
experimenter.
▶ Its levels are the different values a factor can take.
▶ Example: In an experiment on headache medications:
▶ The factor might be the medication.
▶ The levels could be the types of medication, like acetaminophen, ibuprofen, and
naproxen.
5 / 34
Factors are Categorical Variables
▶ The levels (categories) of a factor are often represented by numbers to show
ordering or just because.
▶ Example: The Saffir-Simpson hurricane scale uses numbers to denote hurricane
categories:
▶ Category 1, Category 2, etc., based on maximum sustained wind speeds.
▶ Problem: If entered as a numeric vector, R may not recognize that this represents
categorical data.
Review: Analyzing Categorical vs. Numerical Data
▶ Categorical variables and numerical variables are analyzed differently:
▶ Mean of ibuprofen and naproxen, for instance, would not be meaningful.
▶ Computing relative frequencies of reaction times in milliseconds does not make
sense
6 / 34
Factors in R
▶ Factors in R provide a way to store categorical data, especially when a vector
represents categories (levels).
▶ The factor() and as.factor() functions create or coerce vectors into factors.
group <- c("control", "treatment", "control", "treatment", "treatment")
group # Character vector
## [1] "control" "treatment" "control" "treatment" "treatment"
# Convert to a factor
group <- factor(group)
group # Factor vector
## [1] control treatment control treatment treatment
## Levels: control treatment
7 / 34
Operations on Factors
▶ Factors represent categorical data, so arithmetic operations cannot be applied
directly to them.
▶ Warning: Attempting to apply numeric operations to factors will cause a warning
and produce NA values.
group + 1
## Warning in Ops.factor(group, 1): ’+’ not meaningful for factors
## [1] NA NA NA NA NA
8 / 34
Working with Levels
9 / 34
Efficient Storage of Factors
▶ Factors are stored efficiently in R by internally coding levels as integers.
▶ This reduces memory usage for repeated values compared to character vectors.
typeof(group) # Internal storage type of the factor vector
## [1] "integer"
as.integer(group) # How levels of `group` are coded in R internally
## [1] 1 2 1 2 2
10 / 34
The levels() Function
The levels() function accesses the levels attribute of a factor vector.
attributes(group)
## $levels
## [1] "control" "treatment"
##
## $class
## [1] "factor"
The levels are stored as a character vector
levels(group) # Access the factor levels
## [1] "control" "treatment"
11 / 34
Modifying Factor Labels with levels()
▶ You can use levels() with the assignment <- operator to change factor labels.
▶ For example, change "control" to "placebo" in the group factor.
levels(group)[1] <- "placebo"
group # factor label for the first level is now "placebo"
## [1] placebo treatment placebo treatment treatment
## Levels: placebo treatment
12 / 34
Counting and Summarizing Levels
▶ The nlevels() function returns the number of levels in the factor.
▶ The table() function outputs a frequency table summarizing the factor levels.
nlevels(group) # Number of levels in `group`
## [1] 2
table(group) # Frequency table of the `group` factor
## group
## placebo treatment
## 2 3
13 / 34
Caution: Changing Factor Values
▶ Changing a factor element to a new value not already a level will:
▶ Replace the value with NA.
▶ Throw a warning.
group[5] <- "control" # Warning: "control" is not an existing level
## Warning in ‘[<-.factor‘(‘*tmp*‘, 5, value = "control"): invalid factor level,
## NA generated
group
## [1] placebo treatment placebo treatment <NA>
## Levels: placebo treatment
group[5] <- "placebo" # No warning: "placebo" is an existing level
group
## [1] placebo treatment placebo treatment placebo
## Levels: placebo treatment
14 / 34
Specifying All Possible Levels
The levels argument in the factor() function allows us to specify all possible
levels, even if some levels are not (yet) observed in the data.
# Sample hurricane category data with all possible levels
hurricanes <- factor(c(3, 1, 2, 5, 3, 3, 5), levels = c(1, 2, 3, 4, 5))
hurricanes
## [1] 3 1 2 5 3 3 5
## Levels: 1 2 3 4 5
Here, levels 1 through 5 are specified, even though level 4 is not observed.
15 / 34
Adding Levels Using the levels() Function
Levels can also be added to an existing factor by modifying the levels attribute
directly.
# Sample gender data
gender <- factor(c("M", "F", "F", "M", "M"))
levels(gender) # Current levels: "M", "F"
## [1] "F" "M"
levels(gender)[3] <- "X" # Add a new level "X"
levels(gender) # View Updated levels
## [1] "F" "M" "X"
gender
## [1] M F F M M
## Levels: F M X
The gender factor now has an additional level “X”, even though “X” is not observed in
the data.
16 / 34
Extracting Values from Factors
▶ Factors are special vectors that can be subset using square brackets (numeric
indices and logical indices work).
▶ When subsetting, the levels attribute of the original factor is retained, even if the
subset does not include all levels.
hurricanes[1:3] # Only contains 1, 2, 3
## [1] 3 1 2
## Levels: 1 2 3 4 5
hurricanes[c(rep(TRUE, 3), rep(FALSE, 4))] # same as above
## [1] 3 1 2
## Levels: 1 2 3 4 5
17 / 34
Removing Unobserved Labels
To remove the unobserved levels, we could invoke the factor() function again to
reset the levels attribute:
factor(hurricanes[1:3]) # resets the levels attribute
## [1] 3 1 2
## Levels: 1 2 3
or more directly remove unobserved levels by specifying the argument drop = TRUE in
the square brackets:
hurricanes[1:3, drop = TRUE] # remove unobserved level
## [1] 3 1 2
## Levels: 1 2 3
18 / 34
Ordered Levels
19 / 34
Ordered vs. Unordered Levels
▶ Ordinal variables: Categorical variables with a natural ordering (e.g., hurricane
categories, coffee sizes).
▶ Nominal variables: Categorical variables without a natural ordering (e.g., gender,
eye color).
Default Ordering in factor()
▶ By default, factor() orders character levels alphabetically and numeric levels in
increasing order.
▶ Lowercase letters come before uppercase in alphabetical order (a < A).
20 / 34
Example: Month Names in Alphabetical Order
If we create a factor of month names, the natural ordering will not be preserved.
month.name # Built-in character vector of month names
## [1] "January" "February" "March" "April" "May" "June"
## [7] "July" "August" "September" "October" "November" "Decemb
factor(month.name) # Alphabetical order
## [1] January February March April May June July
## [8] August September October November December
## 12 Levels: April August December February January July June March ... Se
21 / 34
The table function automatically orders factors by their levels, with unexpected results
when the levels are not in correct order.
table(x = factor(month.name))
## x
## April August December February January July June
## 1 1 1 1 1 1 1
## May November October September
## 1 1 1 1
22 / 34
The same happens with the plot() function:
plot(x = factor(month.name), y = 1:12)
12
10
8
y
6
4
2
April August February July June March May November September
23 / 34
Specifying a Custom Order for Levels
To set a custom order for levels, use the levels argument in factor() and set
ordered = TRUE to tell R you mean to save it explicitly as an ordered factor.
factor(month.name, levels = month.name) # levels in correct calendar order
## [1] January February March April May June July
## [8] August September October November December
## 12 Levels: January February March April May June July August ... Decembe
table(factor(month.name, levels = month.name))
##
## January February March April May June July
## 1 1 1 1 1 1 1
## September October November December
## 1 1 1 1
24 / 34
plot(x = factor(month.name, levels = month.name), y = 1:12) # much better
12
10
8
y
6
4
2
January March April May June July August October December
25 / 34
Explicitly Creating an Ordered Factor
▶ There is a sub-class of “factor” called “ordered” which means that this variable is
an ordered factor, i.e., an ordinal categorical variable
▶ most functions do not care about this distinction and just use whatever order the
levels are in.
ordered <- factor(month.name, levels = month.name, ordered = FALSE)
explicitly_ordered <- factor(month.name, levels = month.name, ordered = TRU
26 / 34
attributes(ordered) # class is "factor", levels are in correct order
## $levels
## [1] "January" "February" "March" "April" "May" "June"
## [7] "July" "August" "September" "October" "November" "Decemb
##
## $class
## [1] "factor"
attributes(explicitly_ordered) # this is of class "ordered" as well as "fac
## $levels
## [1] "January" "February" "March" "April" "May" "June"
## [7] "July" "August" "September" "October" "November" "Decemb
##
## $class
## [1] "ordered" "factor"
27 / 34
Operations on Subsets of Data
28 / 34
The Split-Apply-Combine Strategy
▶ Split the data into groups based on some criteria.
▶ Apply a function to each group independently.
▶ Combine the results into a data structure.
More specifically
▶ Using factor levels to subset and analyze specific categories is common in
statistical analysis, such as finding means or counts per level.
▶ Subsetting and logical indexing allow us to extract subsets of an object (usually
a vector or matrix) based on a condition or criterion.
▶ A natural application is to subset an object based on the levels of a factor (i.e.,
categories of a categorical variable).
▶ We can then apply functions to these subsets, enabling flexible data
manipulation and analysis by category.
▶ combine the results in a vector or matrix
29 / 34
The tapply() Function
The tapply() function is used to apply a function to subsets of a vector.
The syntax of tapply() is tapply(X, INDEX, FUN, ..., simplify = TRUE),
where the arguments are:
▶ X: A numeric or logical vector
▶ INDEX: A factor or list of factors that identifies the subsets. Non-factors will be
coerced into factors.
▶ FUN: The function to be applied.
▶ ...: Any optional arguments to be passed to the FUN function.
▶ simplify: Logical value that specifies whether to simplify the output to a matrix
or array.
The tapply() function splits the values of the vector X into groups, each group
corresponding to a level of the INDEX factor, then applies the function in FUN to each
group.
30 / 34
Example: Hurricanes Data
As an example, we will consider the hurricanes.RData file, which has four objects
category, pressure, wind, and year, containing measurements on 455 hurricanes
that occurred between 2006 and 2011.
load("hurricanes.RData") # Load the objects in the hurricanes data
category[1:10] # The Saffir-Simpson classification
## [1] 1 2 1 1 2 2 1 2 1 1
## Levels: 1 < 2 < 3 < 4 < 5
pressure[1:10] # Air pressure at the hurricane's center (in millibars)
## [1] 983 968 981 960 952 983 981 953 985 990
wind[1:10] # Hurricane's maximum sustained wind speed (in knots)
## [1] 65 90 80 65 95 85 70 95 80 70 31 / 34
Example ctd: Using tapply() for Grouped Calculations
▶ Suppose we want to determine if mean air pressure at a hurricane’s center is
related to its category.
▶ The tapply() function allows us to split the data by category and compute the
mean for each subset.
# Compute mean pressure grouped by hurricane category
tapply(X = pressure, INDEX = category, FUN = mean)
## 1 2 3 4 5
## 979.3766 964.3333 954.7407 940.3220 924.3000
From the output, we see that the mean pressure at a hurricane’s center is lower for
higher category hurricanes.
Question: How would you find the mean maximum sustained wind speed in each year?
32 / 34
Example ctd: Using tapply() with Multiple Factors
▶ Suppose we want the mean pressure for each category/year combination.
▶ The tapply() function can group values by combinations of levels from
multiple factors.
▶ When using multiple factors in tapply(), put the factors in a list in the INDEX
argument.
# Compute the mean pressure for each category/year combination
tapply(X = pressure, INDEX = list(category, year), FUN = mean)
## 2006 2007 2008 2009 2010 2011
## 1 983.9 981.5217 979.8158 977.9524 977.1948 979.3000
## 2 969.5 973.6000 957.0385 967.5000 966.5862 964.5385
## 3 957.0 948.0000 955.4286 953.6667 955.0000 954.6923
## 4 NA 933.7143 945.3750 948.8000 938.5238 942.6667
## 5 NA 924.3000 NA NA NA NA
Question: How would you find out how many observations are in each category (or 33 / 34
Last Slide: Why I don’t like Factors
▶ Factors can be unintuitive, especially with the default alphabetical ordering
▶ Modifying factors (e.g., adding levels) is cumbersome and can lead to
unexpected behavior, such as warnings and NA values.
▶ Arithmetic operations and other functions often don’t handle factors as expected
(character labels vs. internal storage as integer)
▶ Just use character vectors, they are simpler and more transparent for categorical
variables.
Why I Still Teach Factors
▶ Factors are a foundational data type in base R, widely used and often
encountered in code and data.
▶ Understanding factors is essential for using R, including in many R packages and
statistical functions.
34 / 34