0% found this document useful (0 votes)
16 views9 pages

Unit 1 Factor

This document provides an overview of factors, lists, and data frames in R, explaining how to create and manipulate these data structures. It covers the creation of factors, lists, and data frames, along with methods for accessing and modifying their components. Additionally, it discusses how to add columns, combine data frames, and subset data based on logical conditions.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views9 pages

Unit 1 Factor

This document provides an overview of factors, lists, and data frames in R, explaining how to create and manipulate these data structures. It covers the creation of factors, lists, and data frames, along with methods for accessing and modifying their components. Additionally, it discusses how to add columns, combine data frames, and subset data based on logical conditions.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Unit 1: Factor, List and Data Frames

Factors
In this section, you’ll look at some simple functions related to creating, handling, and
inspecting factors. Factors are R’s most natural way of representing data points that fit in only
one of a finite number of distinct categories, rather than belonging to a continuum. Categorical
variables in R are called “factors”. Factors have as many levels as there are unique categories.

# create a vector called 'gender'


gender <- c("f", "f", "f", "m", "m", "m", "m")
# transform 'gender' into a factor object
gender <- factor(gender)
# examine the structure of 'gender'
str(gender)
## Factor w/ 2 levels "f","m": 1 1 1 2 2 2 2

Lists of Objects

The list is an incredibly useful data structure. It can be used to group together any mix of R
structures and objects. A single list could contain a numeric matrix, a logical array, a single
character string, and a factor object. You can even have a list as a component of another list.
In this section, you’ll see how to create, modify, and access components of these flexible
structures. Creating a list is much like creating a vector. You supply the elements that you
want to include to the list function, separated by commas.

R> foo <- list(matrix(data=1:4,nrow=2,ncol=2),c(T,F,T,T),"hello")


R> foo

[[1]]
[,1] [,2]
[1,] 1 3
[2,] 2 4

[[2]]
TRUE FALSE TRUE TRUE

[[3]]
[1] "hello"

In the list foo, you’ve stored a 2*2 numeric matrix, a logical vector, and a character string.
These are printed in the order they were supplied to list. Just as with vectors, you can use the
length function to check the number of components in a list.

R> length(x=foo)
[1] 3
You can retrieve components from a list using indexes, which are entered in double square
brackets.

R> foo[[1]]
[,1] [,2]
[1,] 1 3
[2,] 2 4
R> foo[[3]]
[1] "hello"

This action is known as a member reference. When you’ve retrieved a component this way,
you can treat it just like a stand-alone object in the workspace.

To overwrite a member of foo, you use the assignment operator.

R> foo[[3]]
[1] "hello"
R> foo[[3]] <- paste(foo[[3]],"you!")
R> foo
[[1]]
[,1] [,2]
[1,] 1 3
[2,] 2 4

[[2]]
TRUE FALSE TRUE TRUE

[[3]]
[1] "hello you!"

Naming
You can name list components to make the elements more recognizable and easy to work
with. Just like the information stored about factor levels, a name is an R attribute.

Let’s start by adding names to the list foo from earlier.

R> names(foo) <- c("mymatrix","mylogicals","mystring")


R> foo
$mymatrix
[,1] [,2]
[1,] 1 3
[2,] 2 4

$mylogicals
TRUE FALSE TRUE TRUE
$mystring
[1] "hello you!"

This has changed how the object is printed to the console. Where earlier it printed [[1]], [[2]],
and [[3]] before each component, now it prints the names you specified: $mymatrix,
$mylogicals, and $mystring. You can now perform member referencing using these names
and the dollar operator, rather than the double square brackets.

R> foo$mymatrix
[,1] [,2]
[1,] 1 3
[2,] 2 4

This is the same as calling foo[[1]]. In fact, even when an object is named, you can still use
the numeric index to obtain a member.

R> foo[[1]]
[,1] [,2]
[1,] 1 3
[2,] 2 4

Data Frames
A data frame is R’s most natural way of presenting a data set with a collection of recorded
observations for one or more variables. Like lists, data frames have no restriction on the data
types of the variables; you can store numeric data, factor data, and so on. The R data frame
can be thought of as a list with some extra rules attached. The most important distinction is
that in a data frame (unlike a list), the members must all be vectors of equal length.
The data frame is one of the most important and frequently used tools in R for statistical data
analysis. To create a data frame from scratch, use the data.frame function. You supply your
data, grouped by variable, as vectors of the same length—the same way you would construct
a named list. Consider the following example data set:
R> mydata <- data.frame (person=c("Peter","Lois","Meg","Chris","Stewie"),
age=c(42,40,17,14,1), gender=factor(c("M","F","F","M","M")))
R> mydata
person age gender
1Peter 42 M
2Lois 40 F
3Meg 17 F
4Chris 14 M
5Stewie 1 M

Here, you’ve constructed a data frame with the first name, age in years, and gender of five
individuals. The returned object should make it clear why vectors passed to data.frame must
be of equal length: vectors of differing lengths wouldn’t make sense in this context. If you
pass vectors of unequal length to data.frame, then R will attempt to recycle any shorter vectors
to match the longest, throwing your data off and potentially allocating observations to the
wrong variable. Notice that data frames are printed to the console in rows and columns—they
look more like a matrix than a named list. This natural spreadsheet style makes it easy to read
and manipulate data sets. Each row in a data frame is called a record, and each column is a
variable.
You can extract portions of the data by specifying row and column index positions (much as
with a matrix). Here’s an example:

R> mydata[2,2]
[1] 40

This gives you the element at row 2, column 2—the age of Lois. Now extract the third, fourth,
and fifth elements of the third column:

R> mydata[3:5,3]
FMM
Levels: F M

This returns a factor vector with the gender of Meg, Chris, and Stewie. The following extracts
the entire third and first columns (in that order):

R> mydata[,c(3,1)]
gender person
M Peter
F Lois
F Meg
M Chris
M Stewie

This results in another data frame giving the gender and then the name of each person.
You can also use the names of the vectors that were passed to data.frame to access variables
even if you don’t know their column index positions, which can be useful for large data sets.
You use the same dollar operator you used for member-referencing named lists.

R> mydata$age
[1] 42 40 17 14 1

You can subset this returned vector, too:

R> mydata$age[2]
[1] 40

This returns the same thing as the earlier call of mydata[2,2].


You can report the size of a data frame—the number of records and variables—just as you’ve
seen for the dimensions of a matrix.
R> nrow(mydata)
[1] 5

R> ncol(mydata)
[1] 3

R> dim(mydata)
[1]5 3

The nrow function retrieves the number of rows (records), ncol retrieves the number of
columns (variables), and dim retrieves both.
R’s default behavior for character vectors passed to data.frame is to convert each variable into
a factor object. Observe the following:

R> mydata$person
[1] Peter Lois Meg Chris Stewie
Levels: Chris Lois Meg Peter Stewie

Notice that this variable has levels, which shows it’s being treated as a factor. But this isn’t
what you intended when you defined mydata earlier— you explicitly defined gender to be a
factor but left person as a vector of character strings. To prevent this automatic conversion of
character strings to factors when using data.frame, set the optional argument stringsAsFactors
to FALSE (otherwise, it defaults to TRUE). Reconstructing mydata with this in place looks
like this:

R> mydata <- data.frame(person=c("Peter","Lois","Meg","Chris","Stewie"),


age=c(42,40,17,14,1),gender=factor(c("M","F","F","M","M")),
stringsAsFactors=FALSE)

R> mydata
person age gender
1Peter 42 M
2Lois 40 F
3Meg 17 F
4Chris 14 M
5Stewie 1 M

R> mydata$person
[1] "Peter" "Lois" "Meg" "Chris" "Stewie"

You now have person in the desired, nonfactor form.


Adding Data Columns and Combining Data Frames
Say you want to add data to an existing data frame. This could be a set of observations for a
new variable (adding to the number of columns), or it could be more records (adding to the
number of rows). Once again, you can use some of the functions you’ve already seen applied
to matrices.
Recall the rbind and cbind functions, which let you append rows and columns, respectively.
These same functions can be used to extend data frames intuitively. For example, suppose
you had another record to include in mydata: the age and gender of another individual, Brian.
The first step is to create a new data frame that contains Brian’s information.

R>newrecord<-data.frame(person="Brian",age=7,
gender=factor("M",levels=levels(mydata$gender)))
R> newrecord
person age gender
1 Brian 7 M

To avoid any confusion, it’s important to make sure the variable names and the data types
match the data frame you’re planning to add this to. Note that for a factor, you can extract the
levels of the existing factor variable using levels.
Now, you can simply call the following:

R
> mydata <- rbind(mydata,newrecord)
R> mydata
person age gender
1 Peter 42 M
2 Lois 40 F
3 Meg 17 F
4 Chris 14 M
5 Stewie 1 M
6 Brian 7 M

Using rbind, you combined mydata with the new record and overwrote mydata with the result.
Adding a variable to a data frame is also quite straightforward. Let’s say you’re now given
data on the classification of how funny these six individuals are, defined as a “degree of
funniness.” The degree of funniness can take three possible values: Low, Med (medium), and
High. Suppose Peter, Lois, and Stewie have a high degree of funniness, Chris and Brian have
a medium degree of funniness, and Meg has a low degree of funniness. In R, you’d have a
factor vector like this:

R> funny <- c("High","High","Low","Med","High","Med")


R> funny <- factor(x=funny, levels=c("Low","Med","High"))
R> funny
High High Low Med High Med
Levels: Low Med High
The first line creates the basic character vector as funny, and the second line overwrites funny
by turning it into a factor. The order of these elements must correspond to the records in your
data frame. Now, you can simply use cbind to append this factor vector as a column to the
existing mydata.

R
> mydata <- cbind(mydata,funny)
R> mydata
person age gender funny
1 Peter 42 M High
2 Lois 40 F High
3 Meg 17 F Low
4 Chris 14 M Med
5 Stewie 1 M High
6 Brian 7 M Med

The rbind and cbind functions aren’t the only ways to extend a data frame. One useful
alternative for adding a variable is to use the dollar operator, much like adding a new member
to a named list. Suppose now you want to add another variable to mydata by including a
column with the age of the individuals in months, not years, calling this new variable age.mon.

R> mydata$age.mon <- mydata$age*12


R> Mydata
person age gender funny age.mon
1 Peter 42 M High 504
2 Lois 40 F High 480
3 Meg 17 F Low 204
4 Chris 14 M Med 168
5 Stewie 1 M High 12
6 Brian 7 M Med 84

This creates a new age.mon column with the dollar operator and at the same time assigns it
the vector of ages in years (already stored as age) multi-plied by 12.

Logical Record Subsets


you saw how to use logical flag vectors to subset data structures. This is a particularly useful
technique with data frames, where you’ll often want to examine a subset of entries that meet
certain criteria. For example, when working with data from a clinical drug trial, a researcher
might want to examine the results for just male participants and compare them to the results
for females. Or the researcher might want to look at the characteristics of individuals who
responded most positively to the drug.
Let’s continue to work with mydata. Say you want to examine all records corresponding to
males. you know that the following line will identify the relevant positions in the gender factor
vector:
R> mydata$gender=="M"
TRUE FALSE FALSE TRUE TRUE TRUE

This flags the male records. You can use this with the matrix-like syntax to get the male-only
subset.

R>mydata[mydata$gender=="M"
,]
person age gender funny age.mon
1 Peter 42 M High 504
4 Chris 14 M Med 168
5 Stewie 1 M High 12
6 Brian 7 M Med 84

This returns data for all variables for only the male participants. You can use the same
behavior to pick and choose which variables to return in the subset. For example, since you
know you are selecting the males only, you could omit gender from the result using a negative
numeric index in the column dimension.

R> mydata[mydata$gender=="M",-3]
person age funny age.mon
1 Peter 42 High 504
4 Chris 14 Med 168
5 Stewie 1 High 12
6 Brian 7 Med 84

If you don’t have the column number or if you want to have more control over the returned
columns, you can use a character vector of variable names instead.

R>
mydata[mydata$gender=="M",c("person","age","funny","age.mon")]
person age funny age.mon
1 Peter 42 High 504
4 Chris 14 Med 168
5 Stewie 1 High 12
6 Brian 7 Med 84

The logical conditions you use to subset a data frame can be as simple or as complicated as
you need them to be. The logical flag vector you place in the square brackets just has to match
the number of records in the data frame. Let’s extract from mydata the full records for
individuals who are more than 10 years old OR have a high degree of funniness.
R> mydata[mydata$age>10|mydata$funny=="High",]
person age gender funny age.mon
1 Peter 42 M High 504
2 Lois 40 F High 480
3 Meg 17 F Low 204
4 Chris 14 M Med 168
5 Stewie 1 M High 12

You might also like