0% found this document useful (0 votes)

11 views111 pages

Please Read R Manual For Stats

An overview document regarding statistics in R

Uploaded by

matthew.meeks

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views111 pages

Please Read R Manual For Stats

An overview document regarding statistics in R

Uploaded by

matthew.meeks

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

R Reference Manual:

A gentle overview

Table of Contents

1. Introduction to R
 Obtaining R 4
 Basics and Orientation 13
 Scripts and Workspaces 25
 Importing Data 30

2. Data Analysis and Statistical Concepts

 Concept 1 – Measurements of Central Tendency 44

 Concept 2 – Measurements of Dispersion 49
 Concept 3 – Visualization of Univariate Data
53
 Concept 4 – Visualization of Multivariate Data 66
 Concept 5 – Random Number Generation And Simple Sampling
83
 Concept 6 – Confidence Intervals 89

2 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
These reference manuals have been developed to assist students in the basics of statistical computing –
sort of a “Statistical Computing for Dummies”. It is not our intention to use this manual to teach
statistical concepts1…but rather to demonstrate how to utilize previously taught statistical and data
analysis concepts the way that professionals and practitioners apply them – through the able assistance
of computing. Proficiency in software allows students to focus more on the interpretation of the output
and on the application of results rather than on the mathematical computations.

We should pause here and strongly make the point that computers should serve as a medium of
expediency of calculation – not as a substitution for the ability to execute a calculation.

In the Basic Concepts manual, we present statistical concepts, context for their use, and formulas where
appropriate. We provide exercises to execute these concepts by hand. Then, in each subsequent manual,
the concepts are applied in a consistent manner using each of the five major statistical computing
packages – Excel, SPSS, Minitab, R and SAS.

1
Readers of this manual are assumed to have completed some introductory statistics course. For individuals wishing to
review statistical concepts, we recommend Introduction to Stats by DeVeaux, Velleman and Bock.
3 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
What is R?

Unlike MS Excel, SPSS, and Minitab, yet similar to SAS, R is a commands-driven programming
environment to execute statistical analysis. Unlike all of the other software packages we have discussed
which are proprietary2 (including SAS), R is an open-source program that is free and readily available via
download from the internet.

Of all the packages, we acknowledge that both R and SAS represent substantial challenges for students.
However, like SAS, R is among the most analytically comprehensive and most flexible of the statistical
software applications. Furthermore, R is becoming quite popular in quantitative analysis in many fields
including statistics, social science research (Psychology, Sociology, Education, etc.), marketing research,
business intelligence, etc.

R is an implementation of the S-Plus programming language that was originally developed by Bell Labs in
the 1970s. Therefore, S-Plus and R code are most often interchangeable and instructions for one program
will be applicable to the other.

2
And therefore very expensive.
4 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
Obtaining R

Before importing the WidgeOne example data into R and tackle the basic statistics that we all know and
love, we will first discuss how to obtain R.

Remember how we said R is free? Let’s download it from the internet…for free! Follow these steps to
download and install R

5 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
Step [Link] official R website is called CRAN (The Comprehensive R Archive Network). Therefore, search
for CRAN3 in your favorite internet search engine

3
The URL for CRAN is [Link]
6 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
Step 2. From the main CRAN page, select the appropriate download version for your operating
system.

Step 3. Next, select "base" to download the basic R program.

7 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
8 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
Step 4. Next, select "Download R XX for XX" to download the R installation program (where XX is the
version number).

9 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
Step 5. Save and then run the [Link] file (where XX is the version number).

10 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
Step 6. Follow the steps of the R setup wizard (you can stick with the defaults). If you want to be clever
and save R on a flash drive, simply browse to your flash drive location during the “Set Destination
Location” step of the setup wizard. Then you can impress your friends with fast access to R from any
computer!

11 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
If you choose to save it to your flash drive, be sure to also create a shortcut to launch R from your flash
drive. The easiest way to do this is simply move the shortcut that will appear on your desktop to your
flash drive (you can copy and paste, or just drag it).

12 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
R Basics & Orientation

After installing and launching R for the first time, you should see this:

The main component of this interface is the R Console. This is where the user submits commands to the
program AND where R prints the results of those commands. However, typing commands directly into
the console is often not done because it is easy to make an error and difficult to re-create what you did at
a later time. Therefore, one can also write, develop (debug), and submit R code from a separate savable
file called a script. If you are a SAS user, an R script is very much like your SAS programming file that
you develop in the Enhanced Editor (i.e., the .sas file).

13 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
R as a Calculator

It may be easy to feel overwhelmed with the R environment at first, but do not let your hearts be
troubled. Just think of R as a super graphing-calculator, much like your old Texas Instruments TI-83, but
a bit more powerful (and cheaper). You can simply type mathematical expressions into the R console, hit
"Enter" and the result is printed in the R console.

Much like any good calculator, there are a large number of mathematical and statistical functions that
are available to the user. The following table presents a few of these. Try them out.
14 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
15 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
Function/
Operation Description Example Result
Mathematical
+ Addition 3+4 7
- Subtraction 3-4 -1
* Multiplication 3*4 12
/ Division 3/4 0.75
x^2 The power function. 2^2 4
sqrt(x) The square root of x. sqrt(4) 2
The natural logarithm of x (default base of e =
log(x) 2.718281…) log(100) 4.60517
log(100,base=1
log(x,base=10) The logarithm of x (base of 10) 0) 2
22026.4
exp(x) The exponential of x. exp(10) 7
-
sin(x) The sine function of x. sin(100) 0.50637
0.86231
cos(x) The cosine function of x. cos(100) 9
-
tan(x) The tangent function of x. tan(100) 0.58721
0.52359
asin(x) The arc-sine function of x. asin(.5) 9
round(x) The rounding function. round(4.60517) 5
Statistical
mean(x) The mean of x. mean(c(3,4,5)) 4
median(c(3,4,5)
median(x) The median of x. ) 4
sd(x) The standard deviation of x. sd(c(3,4,5)) 1
var(x) The variance of x. var(c(3,4,5)) 1

16 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
min(x) The minimum of x. min(c(3,4,5)) 3
max(x) The maximum of x. max(c(3,4,5)) 5

17 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
Notice that the main argument in the mathematical functions is a single real number. However, the
statistical functions have multiple real numbers as the main argument. This brings up a very important
point in understanding how R operates and/or "thinks”. R is often called an object oriented programming
language. This means that all of the data are stored in objects and that all R functions operate on objects.
Objects can be single numbers or character strings, a list of numbers or character strings
(conceptualized as a vector, but you can simply think of it as a column in a data set), or multiple lists of
numbers or character strings (conceptualized as a matrix, very much like a data set in SPSS or SAS and a
worksheet in MS Excel). Put these ideas on "hold" for the moment and we will return to them shortly.

For now, realize that, like any good graphing calculator, R can be used to make variable assignments.
These assignments allow the user to generalize and re-use code (less typing for us!!). Variable
assignment is done using an assignment statement with the "<-" (pronounced "gets") operator. The gets
operator is nothing special: it literally is the less than sign (<) followed immediately by the hyphen (-).
Therefore, the statement:

is read/pronounced "a gets 4".

Now, anytime "a" is used in an expression or function call, 4 is substituted for a.

18 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
So, "a" is an object. In particular, it is called a scalar (a special mathematical name for a single real
number). Now let's create a list of numbers (a vector). Once again, this is done with the gets operator,
however, now we need to tell R that the variable (or object) is a list of numbers. We do this using, think
about... yes!: a function (Everything in R is performed using functions). In this case it is the concatenate
function or simply "c" for short. For an example, let's enter the first five values for the years on the job
(YRONJOB) variable from the example WidgeOne data set into a new variable simply named "b".

Essentially, this statement reads: "b gets the list of values of 11.10, 11.00...". We hit the "Enter" button
on our keyboard after typing this. Notice that we do not get any feedback from R. Nothing happens. This
is actually a good thing. If we did it wrong, we would get an error. For example, if we forgot the
concatenate function (the "c") then we would get something like:

19 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
Not good. So, the fact that we did not get any feedback earlier when we entered the statement in
correctly is ok. The object (or vector or list of values or variable) has been properly saved in R's working
memory with the name "b". If we want to actually see it, we must type its name.

As a side note, the user can always get a list of all objects currently saved in the work space using the ls()
function. So, right now, we have two objects saved in the work space: a & b. Once again, they are saved
in R's temporary working memory. If we were to close the program, these are erased. We will talk about
saving a session permanently later on. As another side note, the user can always click on the R console
and press "Ctrl+L" to clear the console (when it gets cluttered).

Now, realize since we have defined b as a list of numbers, we can use the statistical functions in from the
Table and specify "b" as the main argument. This saves us from having to type all the data again!

20 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
We can also save these values as variables and then use them in subsequent expressions. Here we save
the mean of the vector b as a new variable called simply "m" for short.

Ok, it is not part of the STAT 3010 curriculum, but you most likely remember Z-scores from elementary
statistics (one of the prerequisites for 3010, check your transcripts!). For a refresher, remember we
subtract the mean from each value of the variable of interest and then divide by its standard deviation to
get a Z-score for each value. Here is the formula (that I'm sure you know and love):

21 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
We are using this example because it is SO EASY to do in R and really showcases R's power and utility.
Check this out:

It really is that simple. Once again, the first statement reads "a new vector (variable) called z gets the
value of b minus the mean of b divided by the standard deviation of b". You would not believe how
difficult this is to do in a SAS DATA step...(of course, there is a special SAS procedure for this, however, it
is still WAY too complicated to do in a DATA step...).

Also, this showcases how R performs operations element-wise. This means that R performs a given
operation on each value of a vector separately and produces an entire vector of results whose length (the
number of values or elements in a vector) is equal to the length of the input vector.

Now pretend that we had already saved both the mean and standard deviation of b before we wanted to
calculate the Z-scores.

22 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
Then, the statement to calculate the Z-scores is even simpler:

Note: R is case-sensitive. That means that objects named "m" and "M" are different. For example:

23 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
The object m was previously defined as the mean of the vector b. However, the object M has not been
previously defined; therefore R produces an error message.

Getting Help in R

Obtaining help documentation in R is rather simple; however, the usefulness of that documentation is
debatable. Because most everything in R is accomplished using functions, the typical R user will have
questions about the use of one or more functions. In order to obtain the help page for a given function
submit one of the two options below to the R console:

help(function-name)
?function-name

In the following example, we obtain the R help page for the log function.

24 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
When either of these commands is submitted to the R console, the appropriate help page is opened in
your primary internet browser (however, you do not have to be currently connected to the internet. R just
uses the browser as a document viewing protocol).

Now, as hinted at earlier, the utility of these help pages is debatable. It has been our experience that
they often are written for people who already know a great deal about R, and therefore are not very
useful to the nascent user. Consequently, it is a good idea to have more "help resources" in your toolbox.
The most powerful of these is the official R Help list serv.

25 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
We highly recommend that you use the R-Help list serv. You can either browse the existing discussions
for a situation like the one you are encountering (see [Link] or you can
email the list serv a specific question/issue that you are dealing with. Most often when you email the list
serv, you will obtain top-notch assistance for your specific problem from half a dozen "R professionals"
within a short amount of time. For more information about the list serv, go to
[Link] Warning: be sure to read the posting guide before emailing the list
serv (see [Link] There are standards for online etiquette.

26 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
Working with Scripts

As mentioned previously, it is often not easiest to continue to type R code directly into the console for a
number of reasons. Therefore, we use scripts. Scripts allow the user to develop, debug, and save code for
later use during an R session. To open a brand new script, select New Script from the File drop down
menu.

27 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
A few important facts about R scripts include: 1) Files with a .R file extension are associated (easily
recognizable) by the R program, 2) R script files, regardless of the file extension, are simple text files (so,
you could open and view them with any text editing software, however, they will only run in R), 3) when
you go to save a R script (it is highly recommended that you SAVE OFTEN, no matter what software
package you are using), unlike most software packages, R does NOT automatically save the script with
the .R extension. The user actually has to type in the .R extension at the end of the file name in the File
name field when saving the script file.

We suggest resizing the script window and placing it side by side with the console. Then you can write R
code and double-check it before submitting it to the console. To submit code to the console from the
script, highlight the desired piece of code (you usually don’t want to submit a whole script at once) and
press "Ctrl+R" (you could also copy and paste).

28 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
New Script

Often times, R users will not write brand-new code for a new project, but instead work from existing code
that they developed in the past. For example, there is a sample script entitled stat.3010.R that contains
all the code necessary to perform a full STAT 3010-style analysis of the WidgeOne data (the code is also
included at the end of this document). In order to open an existing script, select Open script... from the
File drop down menu, navigate to where the desired script is saved, and either double-click on the file or
single click on the file and then select Open.

29 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
30 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
You should see something like this:

Note: you probably noticed that there are several lines in the stat.3010.R file that begin with the hash
mark (#). The hash mark in R signifies the beginning of a comment. A comment in typical computer
programming is a note to the human-users that aids in understanding the purpose of code. These
comments are not processed by the computer. In R, comments begin with a hash and continue for the
rest of that line.

31 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
Saving and Loading a Workspace

There are actually some options here; however, we have a lot to cover, so we are only going to present
the easiest approach to saving your work in R and returning to it at a later date.

To save your work, left-click on the R console in a null space so that it becomes active. Next, from the
File drop down menu, select Save Workspace... Now, specify the desired physical location and file name
to which you want to save the file and select Save. This is a very nice function: it saves all objects (data)
in the current working memory as well as your script and any changes to settings that you have made in
the console.

To load or re-start a previously saved R session, yup, you guessed it: Launch a new session of R, select
Load Workspace... from the File drop-down menu, navigate to the appropriate sub-directory (folder),
select the desired file, and select Open.

32 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
Getting Data into R (Importing Data)

From what we have discussed so far, it would be possible for us to enter our data into R one column at a
time using the concatenate function. However, we have better things to do with our time. So, what do we
do? This is R: We use a function!

The base R package has a number of functions that can be used to import data. The most common one is
[Link](). However, our data (i.e., the WidgeOne example data set) are saved in MS Excel. Admittedly,
importing data from Excel to R is something that R does not do very well. There are some special add-on
packages (see xlsx & xlsReadWrite) for this task, however, it is our experience that they are not very
reliable (in other words, sometimes they work and sometimes they don't...). However, R is very good at
importing non-proprietary file formats (*.txt, *.csv, *.dbf, etc.). Therefore, the most reliable and stable
method for importing MS Excel data into R is to open the file in MS Excel, save it as a .csv file (comma-
separated file), and then use the proper function in R to import the .csv file.

To save a MS Excel file as a .csv file, open the file in Excel, select the File tab then select “Save As…”:

33 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
Select CSV (Comma delimited) (*.csv) from the Save as type drop down menu:

34 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
35 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
Now we are ready to submit the proper function call to R to import these data. Let’s use the [Link]()
function. The call to [Link]() is presented in the STAT 3010 example script entitled stat.3010.R. We
highly recommend that you follow along with this discussion on your own computer from here on out by
either opening that file, or by using the code we provide.

Notice that in the example script, we have specified the pathway (the physical location of where the CSV
file resides) that is specific and unique to each computer setup. In this example, the [Link] file is
on the E drive, in the folder STAT3010. You will need to customize this pathway to your situation.

Open My Computer, navigate to the folder where you saved the CSV file, right click, select Properties and
under the General tab, you should see the following:

36 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
Copy the location and paste it into R

Note: You do have to specify both the pathway and the file name in the call to [Link](), so go ahead and
type in the file name along with the file extension, “[Link]”, after you the location. Next, and this
is VERY IMPORTANT: the backslash character (\) in R is a special character, so after you copy and paste
the pathway, you WILL NEED to add a second backslash for the pathway to be correctly specified in R
parlance. Therefore, every \ in the pathway needs to become \\.

37 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
Once you have made the necessary changes to the call to [Link](), notice what it does: you are giving R
instructions to read in data from the [Link] file and save it to an R object named widge
(remember, R is case-sensitive, so widge is not the same as Widge or WIDGE). When you are ready,
highlight the code and press "Ctrl-R" to submit it to the R console.

Notice that the command is copied to the console, however, nothing else happens. This is ok. Most often
during assignment statements, no feedback from the console is good news.
Your next step should be to verify that the data were correctly imported into R. The easiest way to do this
is to simply view the data. As mentioned previously, we view objects in R by typing their name and
pressing the "Enter" key. So far, everything looks good!

38 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
Often times, when you are working with very large datasets, it is not useful to print the entire data set at
once. R has a very nice function called head() that prints only the first 5 rows of data with the
corresponding column names.

39 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
A few side notes here are important to be aware of from here on out:

1) The [Link]() and [Link]() functions return a special kind of R object: the data frame. In other
words, the widge data as currently saved in R's working memory is a data frame. A data frame is a
special kind of matrix. A matrix can be thought of as a collection of column vectors (or simply columns of
data). However, in R, a matrix must consist of all numeric or all character vectors. Statistical data,
however, is most often a combination of both numeric and character data. As mentioned a moment ago, a
data frame is a special kind of matrix: it is a matrix that may consist of a mixture of numeric and
character column vectors: Exactly what we need for most statistical applications.

2) Often times we need to work with only parts of a data frame (or matrix, or vector). There are a number
of ways to subset objects in R.

a) We may want to perform an operation (using a function!) on just one column of the widge data frame
(in others, just one variable in the WidgeOne data). We may do this using a combination of the data frame
name and the column name. The two are delimited by the special character $. For example, earlier we
obtained the mean of the first five observations of the variable years on the job (YRONJOB). Now, let's
obtain the mean for all N = 40 observations of that variable:

40 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
Once again, notice that we delimited the object name from the column name by a $. Try this with any
numeric variable in the WidgeOne data.

b) We can perform the same operation using explicit subsetting of the parent data frame (the source of
the data, in this case the widge data frame). For example, in order to perform the exact same operation
using subsetting, we specify the widge data frame name with the square brackets [ ]. R expects two
arguments with the square brackets: the rows to be used and the columns to be used. These are
delimited within the brackets with a single comma (,) with row first and columns second. Furthermore, if
we leave one (or both) of these blank, R assumes we want to select all rows and/or columns. Let's look at
some examples:

YRONJOB is the eighth column or variable in the widge data frame (counting from left to right).
Therefore, in order to select (in this case print) all N = 40 observations of YRONJOB, we submit the
following to the R console:

If we want the mean of YRONJOB for all N = 40, then:

41 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
Notice we obtain the same exact result that we obtained when we specified YRONJOB using names in
item #1 above.

Now, perhaps we want the mean of only the first five observations of YRONJOB. We could use either of
the following:

This instructs R to obtain the mean of YRONJOB for observations 1 through (:) 5. Notice that there is only
one argument within the square brackets (there is no comma separating the rows and columns. In other
words, 1:5 is considered as a single row specification by R. Furthermore, because widge$YRONJOB is a
column, we do not need to specify a column number like the example above where the object to subset
(the widge data frame) had multiple columns).

42 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
Alternatively, we could subset the data frame. Here we will need to supply both a row and column
argument within the square brackets:

Notice that here we have specified the first five observations (1:5) of the 8th column of widge. We obtain
the same results.

c) Now, we often want to work with variables in R and let's face it, typing the data frame name along with
the $ character is a pain. We can make temporary copies of all columns in an object (either a data frame
or matrix) to R's working memory. Then, we could refer to them just by the column name. This is easily
done using the attach() function.

If we attempt to access the YRONJOB variable BEFORE attaching the widge data frame, R essentially
tells us that it does not exist:

43 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
Now, let's attach it and attempt to access the data using the exact same call to the column name:

Now, we can obtain the mean of YRONJOB with the following AFTER attaching the widge data frame:

44 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
3) We talked about subsetting a moment ago. In a similar vein, you can always obtain the total number of
rows and the total number of columns of a data frame or matrix by using the dim() function:

The dim function returns an object (i.e., a vector) of length 2: The first element is the total number of
rows, the second the total number of columns. Therefore, we now know that the widge data consists of N
= 40 employees and 9 characteristics (traits, variables, columns, etc.) for those individuals. The dim()
function is appropriate for multi-dimensional arrays (i.e., matrices and data frames).

In order to obtain the length of a single column (vector), we use the length() function in like manner:

4) Before moving on, you should be aware that R uses the missing place holder "NA" for missing data.
This is much like a period for missing numeric data in SAS or SPSS. Therefore, do not be alarmed if you
see "NA" values peppered throughout your data.

5) We have already discussed how to create a new object using the gets operator. FYI: In order to remove
or delete an object from R's working memory, we use the remove() or rm() (either one works!) functions:

45 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
If we want to remove multiple objects at once, we delimit their names by commas in the reference to
them in the remove() function:

Free R Documentation 7 Manuals

There are a number of free, readily-available manuals for R on the internet. We recommend the following:

1) This manual!
2) R for SAS and SPSS Users by Bob Muenchen at: [Link]
3) The Quick-R website at: [Link]

46 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
Concept 1: Using R for Measurements of Central Tendency

We have already seen a demonstration of the mean() function. We can obtain the median using the
median() function in similar fashion.

We can also obtain the mean or median (or any other summary function 4) for multiple variables at once.
To do this we simply specify the appropriate columns from the widge data frame using subsetting
operations we discussed previously:

However, if we want the measures of central tendency AND other distributional information for several
columns at once, then this approach is inefficient. Alternatively, we can use the summary() function:

4
Generally in statistics, a summary function is any statistical function that "summarizes" a random variable of length N in N-
1values. In other words, a summary function summarizes a random variable in usually 1, but at the very least N-1 or fewer
values than the length of the random variable. Essentially, it is a dimension reduction. Examples include the mean, median,
standard deviation, range, quartiles, etc. The use of the term summary function here should not be confused with the actual
summary function in R (The next topic of discussion).
47 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
This is very nice: We get the mean, median, first and third quartiles, and the minimum and maximum for
all numeric variables in the data set and a basic frequency count for all character variables. IMPORTANT:
WARNING: CAUTION: Notice that R analyzes the Employee ID numbers. Is this an
appropriate/meaningful/useful analysis? Obviously the computer does not know any better, however, you,
as the analyst, are held to a higher standard.

48 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
Notice that the TRUE quantitative variables in the WidgeOne data reside in columns 5 through 9 in the
widge data frame. Therefore, using what we learned about concerning subsetting objects in the last
section, we can obtain summary results for ONLY the quantitative variables with the following call to the
summary() function:

49 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
A Stratified Analysis in R

Similar to SAS, stratified analyses can also be obtained in R using the by() function:
Variable of interest (here, The separator variable (we
we’re saying the variable could also specify the column
contained the 8th column of number of this variable by
Widge) widge[,3])
Summary function

What we want R to do
with missing values

The by() function has 4 arguments here. The first argument is the column number of the variable we’re
interested in (the 8th column of the widge data frame is the variable YRONJOB). Next, we specify the
stratification factor. Here we want a separate analysis for each of two groups, males and females.
Therefore, we specify Gender as the stratification factor. We could have also typed widge[,3] because
Gender is the third column vector in the widge data frame. Here, Gender works because we previously
attached the widge data frame (we would have received an error otherwise!). Next, we specify which
summary function is of interest. Here we instruct R to return the mean. Last, the [Link] argument
instructs R how to deal with missing values. This argument take two values: TRUE or T will remove any
rows with missing values on either the analysis variable or the stratification factor while FALSE or F will
not remove rows with missing values (In this case, if missing values do exist, R returns NA (missing) for
the value of the function). FALSE is the default.

50 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
Consider the following call to the by() function. What is being asked?

See the end of this chapter for the answer.

Now, in order to obtain frequency tables for categorical variables outside of the summary function, we
use the table() function:

51 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
Concept 2: Using R for Measurements of Dispersion

Much like the mean() and median() functions, we can obtain measures of dispersion in R. The standard
deviation and the variance of a variable are obtained with the sd() and var() functions.

Just like with the other summary functions, we can obtain the measures of dispersions for multiple
variables at once using subsetting operations on the data frame of interest:

52 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
Using R to Categorize a Continuous Variable

Often it is of interest to categorize or create meaningful groups or "bins" out of a continuous variable.
This is often done in applied biomedical and social science research. For example, researchers often take
continuous attributes like age, income, etc. and create groups from them. This can easily be
accomplished in R using assignment statements with the subsetting operator [ ]. See the example code
below.

We are creating a The values of Jobten

new variable, are conditional on
Jobten, in the widge the value of
data frame YRONJOB
The first statement essentially says, any row (and remember rows in this data frame represent
employees) with a value less than 5 for YRONJOB gets the value "New" for Jobten. Next, any row with a
value of YRONJOB between 5 and just less than 10 gets the value "Experienced" for Jobten. Finally, any
row with a value of greater than or equal to 10 for YRONJOB gets the value of "Mature" for Jobten.
REMEMBER, we can reference YRONJOB directly here because we attached the parent data frame
(widge), otherwise we would need to specify widge$YRONJOB.

53 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
Now, we double-check our work by printing the data frame:

Notice that the new variable Jobten was added as the 10th column to the widge data frame and the
values of Jobten are conditional on the corresponding values of YRONJOB. Look back at the code: We
didn't have to type much code in order to do this: R is very efficient at operations like this.

54 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
Since we added a new column to widge, we re-attach the data frame so that Jobten is available via
column name only and then we obtain a frequency table of the newly created variable in order to
summarize the amount of professional experience of these 40 employees.

Notice, as we re-attach the data frame, R gives us a warning that it is copying over the old attached
versions of the column vectors.

55 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
Concept 3: Using R for Visualization/Organization of Univariate Data

Unlike all of the other software packages discussed (with the possible exception of Minitab), R has
excellent graphing capabilities and allows the user to create and customize presentation-quality graphics.

To replicate the pie chart developed in Basic Concepts Manual, execute the following code:

Notice now nothing happens in the R console, but another graphics window opens up and the pie chart is
printed to the new window.

56 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
Now, this pie chart looks nice, but it’s missing the percentages of each section. We can go back and add
them in by showing the percentages instead of the labels, so we’ll have to insert a legend as well. Let’s
change the colors while we’re at it.

The following code accomplishes this:

This line of code calculates the

percentages of each category rounded to 1
decimal place
57 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
This line creates labels that will display the
calculated percentages and attach a “%”

Here we’re assigning some colors to a

variable named “colors”

Creating the pie chart with the labels and

colors that we specified
This line adds in the legend with the
appropriate color scheme

58 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
The final Pie Chart looks like this:

59 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
To replicate the bar chart in the Basic Concepts manual, execute the following code. Here we add an
informative x-axis label using the xlab argument. This argument is can be used in almost every call to an
R graphing function.

Now, what do you think about this graphic? Is it appropriate? Is it correct? NO!! Why not? The answer is
because the variable Jobten is an ordinal variable and this graphic does not reflect the natural order of
the categories. Therefore, more revision is necessary in order to get this right.

60 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
In order to specify any variable as an ordinal variable in R, we specify it as an ordered factor. A factor is a
special variable type that instructs R that a variable is categorical by nature. We specify a variable as an
ordered factor using the ordered() function:
Replaced Jobten with ordered()
function

Notice that the old reference to the variable Jobten is now replaced by:

ordered(Jobten,c("New","Experienced","Mature"))

This is the beauty of R: you don't even have to create a new variable in order to do this (although you
could...) and because functions can be called within other functions (this is called nesting or nested
functions) you can do all of this in a few simple lines of code 5. For the ordered() function, the first
argument is the input variable that you want to be treated as an ordinal variable. The second argument is
a character vector (notice the values are enclosed in quotes and delimited by commas) using the
concatenate function (c()). This character vector communicates the proper order of the ordinal variable
values to R.

Much Better!

5
Calling a function within another function call is often done in more advanced R programming. When
one function call resides within another function call these are "called" (HA!) nested functions.

61 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
62 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
The histogram is generated in R using the hist() function.

63 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
A simple box plot is generated using the boxplot() function.

64 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
Side by side box plots are also generated using the boxplot() function. However, the structure of the
arguments is quite different here. If you want side by side box plots, boxplot() expects that you specify an
expression in the form of: "a quantitative variable is modeled as (the tilde (~) in R is read as "is modeled
as") the categorical variable (or stratification factor)". So, in the example below, we are obtaining side by
side box plots of job satisfaction stratified by job position. JOBSAT~POSITION is read as "job satisfaction
is modeled as (or by) job position". Notice now we must include the data= argument in the call to
boxplot().

You should see this:

65 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
Again, do you think this is sufficient? Does it stand on its own? No! There are abbreviations (for Hourly &
Management for the x-axis tick mark labels) that are an unnecessary source of confusion that should be
avoided at all costs. Professional presentation quality statistical evidence (usually in the form of tables
and graphs) should not be confusing. Instead they should be clear, concise, easily-digestible for the
audience, and informative! We can correct this graphic using the following where we explicitly tell R
what we want printed as the x-axis tick mark labels.
66 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
67 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
Concept 4: Using R for Visualization/Organization of Multivariate Data

We can also obtain 2-way contingency tables using the table() function: we simply add another column
name as a second argument (and, of course, arguments are delimited by commas). Remember, N-way
contingency tables are appropriate for summarizing the joint and marginal distributions of 2 or more
categorical variables. Here notice that the first column will be the row variable (Plant) and the second
column will be the column variable (Gender) in the resulting contingency table:

Likewise, we can obtain total percents6 for the 2-way table above by specifying the table() function as the
argument to the [Link]() function. This is an excellent example of nested functions, which we
introduced earlier.

6
We still call them percents even though [Link]() returns proportions. REMEMBER: In order to transform a proportion
into a percent simply multiple it by 100.
68 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
In order to obtain row percents for this table, we add an optional second argument to the [Link]()
function (REMEMBER: You can learn more about [Link]() by submitting either: help([Link]) or ?
[Link] to the R console).

69 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
Alternatively, you could assign the results of the table() function to a matrix called t1, for example, and
then submit the call to [Link]() using t1 as the first argument:

You can also obtain the column percents in like manner:

70 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
The stacked bar chart is an excellent visualization of a 2-way contingency table. Like the simple bar
chart, the stacked bar chart is also generated in R using the barplot() function. Notice here that the first
argument to this call to barplot() is not the raw widge data, but rather the results of the table() function:
Another example of nested functions. Notice, also, that a legend is necessary for this graphic to be
meaningful and we are supplying information for the legend to be extracted from the row names of the
results of the table() function.

You should see this:

71 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
Notice we have abbreviation issues again. Therefore, we do it again, and explicitly tell R what we want
printed in the legend using the concatenate function (c()). Realize, however, it is helpful to generate the
incorrect graph once so we know for sure the order of the groups in the legend. Then, we refine it and
generate a final product appropriate for our audience.

Now we redo it:

72 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
73 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
Now, this graphic is incorrect for the same reason that our first univariate bar chart was incorrect: It
suffers from misrepresenting the ordered nature of the ordinal variable Jobten.

74 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
Just like before, we use the ordered() function nested within barplot() to instruct R how to order the
categories:

Notice that the old reference to the variable Jobten is now replaced by:

ordered(Jobten,c("New", "Experienced", "Mature"))

75 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
Now, we are not as draconian about this, but you will notice that the printing of the legend looks a little
less than ideal here. We can actually tell R where to print the legend (do this in your assignments and
REALLY impress us!).

76 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
Here we are telling R to suppress the printing of the legend through the barplot() function and using a
separate call to the legend() function where we have more control. Obtain the R help page on legend() for
more details on how this works. BTW, we figured out the appropriate x and y coordinates for the
placement of the legend here just by trial and error. The final graphic is printed on the next page.

77 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
78 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
Now, we can easily generate a 100% stacked bar chart simply by nesting the table() function within the
[Link]() function in the call to barplot() (Yes, there is a lot of nesting going on here. Don't forget a
parenthesis!!).

So, essentially what we are doing is generating our 100% stacked bar chart from the column percents.
The only problem is that [Link]() returns these in the form of proportions, not percents. As a result,
the y-axis of our resulting graphic ranges between 0 and 1.0.

79 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
80 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
Now, here is how cool R is: You can actually specify a mathematical expression within the call to
barplot(). Therefore, all we have to do to correct this is to multiple the column proportions from
[Link]() by 100 WITHIN the call to barplot(). Notice we also added a y-axis label using the ylab
argument and we forced to change the y coordinate specification (the 2nd argument) in the call to the
legend() function.

81 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
We’ve changed the scale from 0 to 100, so be
sure to change the legend coordinates from
0.9 to 90

82 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
A scatterplot is generated using the plot() function. The first argument is the x-axis variable, the second
the y-axis variable. In order to obviate abbreviations from the start, we use the xlab and ylab arguments
to provide proper labeling for the audience.

83 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
84 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
Concept 5: Using R for Random Number Generation and Simple Random
Sampling

As we have seen through our previous STAT 3010 studies, there is great utility in the ability to generate
random numbers ranging from sampling applications to random assignment of observations and
developing computer simulations (ok, simulations are beyond the scope of 3010, but you will encounter
these if you continue on your journey in studying statistics). R is extraordinarily effective and efficient as
a random number generator. Like the other packages, R uses the computer clock time as the default seed
for all random number functions. To generate uniformly distributed random numbers, we use the runif()
function:

In the example above, we generate 40 random numbers and store them in the vector named Ran and
then print them to the console7. The runif() function has one mandatory argument, the number of random
numbers to generate. The default is to generate numbers between 0 and 1 (which is nice). Pretend for a
moment that we really wanted a set of N = 40 random whole numbers that varied between 0 and 100. We
could obtain this by multiplying Ran by 100 and using the round function in order to round the numbers
to the nearest whole number (here we named the result R100, but this is completely arbitrary):

7
Obviously, you should not expect to obtain the same exact results as we do here due to the use of the default seed.
85 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
Verify that the seed is set to the clock time by re-submitting the same code. You should obtain different
values for your N = 40 generate numbers.

Next, use the [Link]() function to set the seed so that you can obtain the same exact results at a later
date (this is often desirable):

86 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
Note: it is necessary to call the [Link]() function with the same initial value (here the value 974) before
each new call to the random number generating function. Also, it is important to keep in mind that
[Link]() only uses the integer portion of the initial seed value. Therefore, if a fractional value is
supplied to the function, [Link]() automatically rounds it to an integer (be mindful!).

Now if we desired to create statistically independent groups from the WidgeOne data, we use simple
assignment statements much like we did when we categorized a continuous variable.

87 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
Notice that the first assignment statement in the example above reads "the new variable Group appended
to the widge data frame gets a value of1 if the associated random number is less than .5". The second
statement is read in similar manner. We then print the results in order to confirm the effectiveness of our
code.

After performing random group assignment, it is often desirable to sort the data by the new group
membership. This is easily done in R using the order() function specified within the square bracket
operators:

88 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
Notice that we began by re-attaching the data frame
(otherwise we would have to specify widge$Group
instead of simply Group when referencing the new group
membership).

89 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
In this example, we not only sorted the data by group membership, but then within groups we sorted by
employee ID. Notice that we created a new version of the widge data frame (widge2) that is sorted. The
operative statement reads something like "a new data frame named widge2 gets the old version of widge
after it is sorted in ascending order (the default) by Group and then by employee ID within Group". Also,
it is important to realize that the order() function is called in the area within the square brackets that is
associated with rows. Therefore, we are sorting rows, not columns. Packages like MS Excel, SPSS, and
SAS only allow for sorting of this nature of rows, however, R is much more flexible in this regard.

In order to obtain a simple random sample of the WidgeOne data, we use: Yes! the sample() function! In
the example below, we desire to sample the rows of the parent data frame, so the sample() function is
specified just like the order() function in the example above (i.e., in the row area within the square
brackets):

90 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
Here we create a new data frame named sam1. The first argument of the sample() function is the row
numbers of the parent object to sample from. Therefore, we want to sample from 1 through 40 (the
nrow() function returns the maximum number of rows of a 2-dimensional R object (like a matrix of data
frame)). The second argument is the size of the sample. So in this example, we want a sample of 30
employees from the original data containing N = 40 employees. Finally, we specify not to perform
sampling with replacement so that the same employee cannot be chosen twice for inclusion in the
sample.
Concept 6: Using R for Confidence Intervals

Unlike SPSS and SAS, we are not aware of a "canned" (i.e., ready-made) function in R that calculates
confidence intervals (CIs)for the user. HOWEVER, this is a great opportunity to showcase how easily this
kind of thing can be done with a little bit of user generated code. The following code performs the CI
calculation and generates a little report:

There is a lot going on here. First, we set the alpha level to .05 which, of course, corresponds with a 95%
confidence level. Notice that alpha is not a function or an argument to a function. Here it is a simple user-
defined (which means that we made it up...) R object (in this case, it is a scalar). Then, we count the
number of non-missing values of the vector JOBSAT.

91 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
FYI: The [Link]() function returns a logical vector of the same length of the input vector with a TRUE or
FALSE for each element answering the question "is this value/element missing (NA)?":

We obtained 40 FALSE's because there are no missing values of the variable JOBSAT. Now, the sum
function works here because, just like SAS, R interprets TRUE as 1 and FALSE as 0. So,
sum([Link](JOBSAT)) counts the number of missing values in the JOBSAT vector.

92 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
Now, we want the number of non-missing values, so we add the ! operator to the expression. The !
operator means NOT in R.

As a result, we are now counting the number of non-missing values of the input vector. Of course, we use
this information to determine the degrees of freedom in the calculation of the margin of error of these
CIs. Next, we calculate both the lower confidence limit for the mean (lclm) using a number of R functions
(e.g., round(), mean(), sd(), sqrt(), and qt()). Thus far, we have discussed all of these except qt(). Like any
good statistical package, R contains a number of functions to obtain values of reference statistical
distributions like the normal, t, chi-square, and F-distributions). The qt() function returns the appropriate
quantile from Student's t-distribution given a probability value (here, 1-alpha/2) and the correct degrees
of freedom (here, n-1). We then do the same for the upper limit of this interval. Next, we calculate the
associated sample mean value. Finally, we use the cbind() (short for column bind) function to "paste" or
bind the four computed scalars into a little matrix (with only 1 row, sort of like a row vector) for ease of
printing and viewing. This is very much like the output one would obtain from SAS, however, we
customized it to exactly the information we wanted.

REMEMBER: When reporting CIs ALWAYS, ALWAYS, ALWAYS provide the appropriate interpretation of
the results. For example, “Based on a representative sample of 40 employees, we are 95% confident that
job satisfaction for all employees is between 6.53 and 7.17”.

93 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
R Lagniappe

What is a “Lagniappe”? This word derives from New World Spanish la ñapa, “the gift”. The word came
into the Creole dialect of New Orleans and there acquired a French spelling. It is still used in the Gulf
States, especially southern Louisiana, to denote a little bonus that a friendly shopkeeper might add to a
purchase.

Part 1: Writing Your Own Functions

As we have seen so far, R's utility and power is a result of its efficiency and ease in customizing your own
programs and results. Let's take this a step further.

We introduced and discussed several functions that are available to the user through the base package.
Additionally, there are a number of add-on packages that allow you to use functions that other users have
written and developed (see [Link] for more information on
R packages). Now, we can also write functions of our own...cool.

Let's use our code for generating CIs in the previous section. What if we could generalize and package
that code so that all the user had to do is type 1 line of code to call all of our source code and compute
and print the CIs for any variable they want? It's actually pretty easy to do in R (If you are a SAS user,
this would be like writing your own procedure, however, that is not an option in SAS).

How do we write our own function? This is R!: We use a function! And, in this case, it is actually called
function (ok, we did not mean to be confusing here...).

Check this out:

94 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
So, here CI gets or is defined as a function (it's not an object!) with 2 arguments: x (which we assume is a
continuous random variable8) and alpha, the significance level associated with the desired confidence
level. Then the curly braces are used to instruct R that everything within the braces is the body of the
function. Notice we made some slight changes (added a field for the variable name, the confidence level,
and the margin of error (me)). Now after we define the function, from now on all we or anyone else with
this function loaded into their R session has to do is call the CI function while supplying the appropriate
information for the 2 arguments, and the function returns the desired confidence limits and all the
information associated with them.

Here are 3 instances of calling the function and obtaining the results in the example below. Pretty sweet!

8
Here the term "random variable" is used as it is used in statistical theory: "random variable" or stochastic variable refers to a
variable whose value results from a measurement on some type of random process. It should not be confused with random
number generation, the topic of the previous section.
95 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
96 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
Now, let's put this in hyper-drive. Let's add a default value to the alpha argument and another argument,
an optional argument, that allows the user to specify the decimal precision of the results (i.e., the number
of decimal places used in the results).

Here we add alpha=.05. Then .05 becomes the default value of alpha. The user can change it, however, if
they don't specify anything, they get 95% CIs (just like SAS!). Also we add the dec=3 specification in the
call to the function() function (HA!) and replace the value of the digits argument with dec.

Now look at the sample calls to this function:

97 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
Part 2: Outputting Results from R

Ok, if one were working on a... homework, for example, one may desire to output the results they receive
to a format that can easily be used in a homework document. In that case, we will discuss 2 options for
outputting R results for 1) tabular output and 2) graphics.

Outputting Tabular Output in R

Arguably, this is another major shortcoming of R: There is no function at the present time that allows the
user to easily create properly formatted tables from R output 9. The best way to create presentation-
quality tables from R output is to copy and paste the results from the console into MS EXCEL and then
properly format the tabular information in EXCEL (e.g., adding titles, table lines, replacing abbreviations,
etc.). Unfortunately, even this approach requires several steps.

1) Highlight and copy output from the R console.

9
In other words, there is no analog to SAS's ODS RTF statement in R.
98 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
99 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
2) Paste R output into MS EXCEL. Unfortunately, these "pastes" are often pasted into a single cell in
EXCEL. Therefore, the user will often have to use the Text to Columns function in the Data tab.

100 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
3) Select Next and then Finish from the resulting dialog box.

101 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
4) The information is now separated into separate columns.

102 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
4) Next, use basic MS EXCEL functionality to properly format the table.

103 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
5) Finally, copy and paste this formatted table into a word processing document.

104 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
R Chapter Answers (Actually there is only one...)

This call to by() requests a stratified analysis of the standard deviation of the Productivity Scores by Plant
while removing rows with missing values.

R code
#R Reference Manual Script

#Importing data into R with the [Link]() function

widge<-[Link]("E:\\STAT3010\\[Link]")

#Viewing data
widge

#Printing the first 5 observations with the head() function

head(widge)

#mean of Years on Job

mean(widge$YRONJOB)

#Printing YRONJOB
widge[,8]

#Mean of YRONJOB
mean(widge[,8])

#Mean of first 5 values of YRONJOB

mean(widge$YRONJOB[1:5])

105 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
mean(widge[1:5,8])

attach(widge)
YRONJOB
mean(YRONJOB)
dim(widge)
length(YRONJOB)
median(YRONJOB)
mean(widge[,5:9])

#descriptive statistics
summary(widge[,5:9])

#stratified
by(widge[,6],widge[,2],sd,[Link]=TRUE)

#frequency
table(Plant)

#Categorizing a Continuous Variable

widge$Jobten[YRONJOB < 5] <- "New"
widge$Jobten[YRONJOB >= 5 & YRONJOB < 10] <- "Experienced"
widge$Jobten[YRONJOB >= 10] <- "Mature"

#reattaching
attach(widge)

#frequency
table(Jobten)

#Pie Chart
pie(table(Jobten),main="Figure i: Pie Chart of Job Tenure")

#Pie Chart with Percenages, Grayscale and legend

Jobten_labels <- round((table(Jobten))/sum(table(Jobten)) * 100, 1)
Jobten_labels <- paste(Jobten_labels, "%", sep="")
colors <- c("white","grey","black")
pie(table(Jobten), main="Figure i: Pie Chart of Job Tenure",col=colors,
106 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
labels=Jobten_labels)
legend(1, 0.5, c("New","Expierenced","Mature"),fill=colors)

#barplot
barplot(table(Jobten),main = "Figure i: Bar Chart of Job Tenure", xlab="Number of Employees")

#barplot with labels

barplot(table(ordered(Jobten,c("New","Experienced","Mature"))),
main = "Figure i: Bar Chart of Job Tenure", xlab="Number of Employees")

#histogram
hist(PRDCTY,main="Figure i: Histogram of Productivity Scores",
xlab="Productivity Scores")

#boxplot
boxplot(PRDCTY,main="Figure i: Boxplot of Productivity Scores",
xlab="Productivity Scores")

#boxplot with lables

boxplot(JOBSAT~POSITION,data=widge, main="Figure i: Side-by-Side
Boxplots of Productivity by Plant", ylab="Productivity Scores",
xlab="Position",names=c("Hourly","Management"))

#Row Percents
[Link](table(Plant,Gender),1)

#porportions
t1<- table(Plant,Gender)
[Link](t1,1)
[Link](t1,2)

#Barcharts
barplot(table(Gender,Jobten),
main = "Figure i: Stacked Bar Chart of Job Tenure",
xlab = "Number of Employees",
107 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
legend = c("Female","Male"))

barplot(table(Gender,ordered(Jobten,c("New","Experienced","Mature"))
), main = "Figure i: Stacked Bar Chart of Job Tenure",
xlab = "Number of Employees",
legend = c("Female","Male"))

barplot(table(Gender,ordered(Jobten,c("New","Experienced","Mature"))),
main = "Figure i: Stacked Bar Chart of Job Tenure",
xlab = "Number of Employees",
legend = NULL)
legend(.25,15,fill=c(1,8),c("Female","Male"))

barplot([Link](table(Gender,ordered(Jobten,
c("New","Experienced","Mature"))),2),
main = "Figure i: 100% Stacked Bar Chart of Job Tenure",
xlab = "Number of Employees",legend = NULL)
legend(.35,90,fill=c(1,8),c("Female","Male"))

barplot([Link](table(Gender,ordered(Jobten,
c("New","Experienced","Mature"))),2)*100,
main = "Figure i: 100% Stacked Bar Chart of Job Tenure",
xlab = "Number of Employees",ylab = "Percent",
legend = NULL)
legend(.35,90,fill=c(1,8),c("Female","Male"))

#scatter plots
plot(YRONJOB,PRDCTY,
main="Figure i: Scatterplot of Productivity by Years on the Job",
xlab="Years on the Job",
ylab="Productivity Scores")

#generating random number

ran<-runif(40)
ran

r100<-round(ran*100)
r100
108 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
[Link](974)
ran<-runif(40)
ran

#randomly assigning data to 3 groups

widge$Group[ran<.5]<-1
widge$Group[ran>=.5]<-2
widge

#reattaching
attach(widge)
widge2<-widge[order(Group,EmpID),]
widge2

#sampling
sam1<-widge2[sample(1:nrow(widge2),30,replace=FALSE),]
sam1

#confidence intervals
alpha<-.05
n<-sum(![Link](JOBSAT))
lclm<-round(mean(JOBSAT)-qt(1-alpha/2,n-1)*sd(JOBSAT)/sqrt(n),digits=3)
uclm<-round(mean(JOBSAT)+qt(1-alpha/2,n-1)*sd(JOBSAT)/sqrt(n),digits=3)
mean<-round(mean(JOBSAT),digits=3)
limits<-cbind(n,mean,lclm,uclm)
limits

#checking for missing

[Link](JOBSAT)

CI<-function(x,alpha){
n<-sum(![Link](x))
con<-(1-alpha)*100
109 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
me<-qt(1-alpha/2,n-1)*sd(x,[Link]=T)/sqrt(n)
lclm<-round(mean(x,[Link]=T)-me,digits=3)
uclm<-round(mean(x,[Link]=T)+me,digits=3)
mean<-round(mean(x,[Link]=T),digits=3)
{limits<-[Link](cbind(variable=deparse(substitute(x)),n,
[Link]=con,
mean,me=round(me,digits=3),lclm,uclm))}
print(limits)
rm(n,con,lclm,uclm,mean)}

CI(JOBSAT,.05)
CI(PRDCTY,.10)
CI(YRONJOB,.01)

CI<-function(x,alpha=.05,dec=3){
n<-sum(![Link](x))
con<-(1-alpha)*100
me<-qt(1-alpha/2,n-1)*sd(x,[Link]=T)/sqrt(n)
lclm<-round(mean(x,[Link]=T)-me,digits=dec)
uclm<-round(mean(x,[Link]=T)+me,digits=dec)
mean<-round(mean(x,[Link]=T),digits=dec)
{limits<-[Link](cbind(variable=deparse(substitute(x)),n,
[Link]=con,
mean,me=round(me,digits=dec),lclm,uclm))}
print(limits)
rm(n,con,lclm,uclm,mean)}

CI(JOBSAT)
CI(PRDCTY,.10,2)
CI(YRONJOB,.01,5)

110 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University
Congratulations. You are now an even bigger Geek. Take a
bow.

111 Developed and maintained by the Center for Statistics and Analytical Services of Kennesaw State University

R Assignment Final
No ratings yet
R Assignment Final
12 pages
Lec9 10
No ratings yet
Lec9 10
20 pages
R Recipes A Problemsolution Approach 1st Ed Larry Pace PDF Download
No ratings yet
R Recipes A Problemsolution Approach 1st Ed Larry Pace PDF Download
82 pages
Intro To R
No ratings yet
Intro To R
19 pages
R Tutorial Session 1-2
100% (1)
R Tutorial Session 1-2
8 pages
R Using R Statistics Stowell2014
No ratings yet
R Using R Statistics Stowell2014
232 pages
Week1 2020
No ratings yet
Week1 2020
15 pages
Nirula R Programming Lab Manual
No ratings yet
Nirula R Programming Lab Manual
94 pages
Introduction to R: Installation Guide
No ratings yet
Introduction to R: Installation Guide
233 pages
Introducing R
No ratings yet
Introducing R
35 pages
Introducing R
No ratings yet
Introducing R
35 pages
R Programming Basics for Data Analytics
100% (1)
R Programming Basics for Data Analytics
156 pages
Statistical Methods Lab Manual-2021-22
No ratings yet
Statistical Methods Lab Manual-2021-22
58 pages
Regression Analysis with R Guide
No ratings yet
Regression Analysis with R Guide
39 pages
R and RStudio Statistics Tutorial
No ratings yet
R and RStudio Statistics Tutorial
35 pages
BA-unit 3.
No ratings yet
BA-unit 3.
17 pages
CH 3 PDF
No ratings yet
CH 3 PDF
50 pages
R Programming for Beginners
100% (1)
R Programming for Beginners
177 pages
Module 2 Textbook Content
No ratings yet
Module 2 Textbook Content
104 pages
R Book Distribution PDF
No ratings yet
R Book Distribution PDF
215 pages
R With RStudio For Introductory Statistics
No ratings yet
R With RStudio For Introductory Statistics
163 pages
Introduction to R and R Commander
No ratings yet
Introduction to R and R Commander
11 pages
R Practical Report
No ratings yet
R Practical Report
55 pages
Teaching With R
No ratings yet
Teaching With R
36 pages
Descriptive Stats With R Software Book
No ratings yet
Descriptive Stats With R Software Book
944 pages
Intro to R for Social Scientists
No ratings yet
Intro to R for Social Scientists
30 pages
R Programming Notes
100% (1)
R Programming Notes
32 pages
Installation of R
No ratings yet
Installation of R
3 pages
R Programming for Data Analysis Training
No ratings yet
R Programming for Data Analysis Training
42 pages
07 Introduction To R
No ratings yet
07 Introduction To R
75 pages
R Programming
100% (4)
R Programming
163 pages
R Programming PDF
No ratings yet
R Programming PDF
163 pages
IntroductionR SWR Sample Materials 9
No ratings yet
IntroductionR SWR Sample Materials 9
11 pages
Data Analysis Using R
100% (1)
Data Analysis Using R
78 pages
A Short Introduction To R: Richard Harris Creative Commons Attribution-Noncommercial-Sharealike 3.0 Unported License
No ratings yet
A Short Introduction To R: Richard Harris Creative Commons Attribution-Noncommercial-Sharealike 3.0 Unported License
36 pages
R For SAS and SPSS Users ISBN 144191854X, 9781441918543 Scribd Download
No ratings yet
R For SAS and SPSS Users ISBN 144191854X, 9781441918543 Scribd Download
17 pages
Assignment For MCA 3rd Sem HPU R Programming
No ratings yet
Assignment For MCA 3rd Sem HPU R Programming
31 pages
R Language 1st Unit Deep
100% (3)
R Language 1st Unit Deep
61 pages
579
No ratings yet
579
369 pages
Introduction To R
No ratings yet
Introduction To R
36 pages
Introduction to R Programming
No ratings yet
Introduction to R Programming
22 pages
DMDW Lab Report: R & Python Basics
No ratings yet
DMDW Lab Report: R & Python Basics
51 pages
CS ELEC 4 - Analytics Techniques & Tools/Machine Learning: Module No.: 1 (Prelim) Module Title: Writer
No ratings yet
CS ELEC 4 - Analytics Techniques & Tools/Machine Learning: Module No.: 1 (Prelim) Module Title: Writer
22 pages
R Programming A Step-by-Step Guide For Absolute Beginners by Daniel Bell
100% (1)
R Programming A Step-by-Step Guide For Absolute Beginners by Daniel Bell
145 pages
Chapter 02 Introduction
No ratings yet
Chapter 02 Introduction
31 pages
IntroducingR Princeton University
No ratings yet
IntroducingR Princeton University
24 pages
Learning R
No ratings yet
Learning R
22 pages
Owen TheRGuide
No ratings yet
Owen TheRGuide
61 pages
Beginner's Guide to R and RStudio
No ratings yet
Beginner's Guide to R and RStudio
150 pages
R Programming in Statistics
No ratings yet
R Programming in Statistics
403 pages
Graphing Data With R An Introduction 1 (Early Release) Edition John Jay Hilfiger Online Reading
100% (2)
Graphing Data With R An Introduction 1 (Early Release) Edition John Jay Hilfiger Online Reading
83 pages
R Language
No ratings yet
R Language
59 pages
Undergrad Guide Tor
No ratings yet
Undergrad Guide Tor
68 pages
Introduction to R: Usage and Resources
No ratings yet
Introduction to R: Usage and Resources
3 pages
Practical File
No ratings yet
Practical File
56 pages
Comparative Education: India & Pakistan
No ratings yet
Comparative Education: India & Pakistan
9 pages
3.word Level Analysis-Tokenization Stemming
No ratings yet
3.word Level Analysis-Tokenization Stemming
8 pages
Event Detection in Soccer Matches Through Audio Classification Using Transfer Learning
No ratings yet
Event Detection in Soccer Matches Through Audio Classification Using Transfer Learning
9 pages
Boss Witch Ann Aguirre Digital Access
No ratings yet
Boss Witch Ann Aguirre Digital Access
409 pages
Avoiding Common Technical Errors in Subclavian Central Venous Catheter Placement
No ratings yet
Avoiding Common Technical Errors in Subclavian Central Venous Catheter Placement
6 pages
True View
No ratings yet
True View
7 pages
Thesis Introduction Past Tense
100% (2)
Thesis Introduction Past Tense
6 pages
Important Dates 2025 26
No ratings yet
Important Dates 2025 26
3 pages
Dương Tố Quyên-18D170034-Hoàn thiện
No ratings yet
Dương Tố Quyên-18D170034-Hoàn thiện
57 pages
Selected Research Papers For ML - AI Project
No ratings yet
Selected Research Papers For ML - AI Project
3 pages
Assignment in Understanding
100% (1)
Assignment in Understanding
5 pages
Trans Guanella
100% (1)
Trans Guanella
12 pages
Future Careers
No ratings yet
Future Careers
3 pages
L9 Corporate Business Strategy
No ratings yet
L9 Corporate Business Strategy
18 pages
Disaster Readiness Lesson Plan
No ratings yet
Disaster Readiness Lesson Plan
2 pages
Burke's Metaphor For The Unending Conversation 2
No ratings yet
Burke's Metaphor For The Unending Conversation 2
3 pages
Form 14 Grade 3 Q2
No ratings yet
Form 14 Grade 3 Q2
2 pages
Introduction To Reading Skills
No ratings yet
Introduction To Reading Skills
52 pages
Session Plan - Frigid Zone
100% (2)
Session Plan - Frigid Zone
3 pages
Iatf La
No ratings yet
Iatf La
3 pages
Civil Eng. Internal Actions Guide
No ratings yet
Civil Eng. Internal Actions Guide
3 pages
Teaching Philosophy Lesson Plan Guide
No ratings yet
Teaching Philosophy Lesson Plan Guide
3 pages
The Importance of Technology To 21st Century Learners Essay
100% (4)
The Importance of Technology To 21st Century Learners Essay
2 pages
Ancient Journeys and Migrants
No ratings yet
Ancient Journeys and Migrants
3 pages
Bhavik Bansal - Y1
No ratings yet
Bhavik Bansal - Y1
1 page
Architectural Project Forest Villa
No ratings yet
Architectural Project Forest Villa
9 pages
Real Time Crime Detection Using Deep Learning Algorithm
No ratings yet
Real Time Crime Detection Using Deep Learning Algorithm
5 pages
Computer Science Universities List
No ratings yet
Computer Science Universities List
14 pages
Em Project
No ratings yet
Em Project
5 pages
Unit 5 End-Of-unit Test
No ratings yet
Unit 5 End-Of-unit Test
3 pages

Please Read R Manual For Stats

Uploaded by

Please Read R Manual For Stats

Uploaded by

R Reference Manual:

2. Data Analysis and Statistical Concepts

 Concept 1 – Measurements of Central Tendency 44

Step 3. Next, select "base" to download the basic R program.

is read/pronounced "a gets 4".

Now, anytime "a" is used in an expression or function call, 4 is substituted for a.

If we want the mean of YRONJOB for all N = 40, then:

Free R Documentation 7 Manuals

See the end of this chapter for the answer.

We are creating a The values of Jobten

The following code accomplishes this:

This line of code calculates the

Here we’re assigning some colors to a

Creating the pie chart with the labels and

You should see this:

You can also obtain the column percents in like manner:

You should see this:

Now we redo it:

ordered(Jobten,c("New", "Experienced", "Mature"))

Part 1: Writing Your Own Functions

Check this out:

Now look at the sample calls to this function:

Outputting Tabular Output in R

1) Highlight and copy output from the R console.

#Importing data into R with the [Link]() function

#Printing the first 5 observations with the head() function

#mean of Years on Job

#Mean of first 5 values of YRONJOB

#Categorizing a Continuous Variable

#Pie Chart with Percenages, Grayscale and legend

#barplot with labels

#boxplot with lables

#generating random number

#randomly assigning data to 3 groups

#checking for missing

You might also like