0% found this document useful (0 votes)
19 views102 pages

R Tutorial Master 2013

Uploaded by

Spencer
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views102 pages

R Tutorial Master 2013

Uploaded by

Spencer
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

R Tutorials for Regression and Time Series for Actuaries

Margie Rosenberg
University of Wisconsin-Madison

August 2013

This document is a set of tutorials to assist students who are first learning the statistical software R
to gain the basic skills for use in a regession class. These tutorials can be used in a computer classroom
environment or as homework to guide the student through an example.
The tutorials have been designed over a two-year period for use in a one-semester regression and
time series class. The textbook used is Frees’ Regression Modeling with Actuarial and Financial Appli-
cations. Data used for the tutorials can be found at: [Link]
jfreesbooks/Regression\%20Modeling/BookWebDec2010/[Link]. There is one additional data
set [Link] that can be found at [Link]
aspx.
The tutorials are ordered somewhat sequentially. There is duplication of some material for purposes
of reinforcement of concepts, but also for reasons of trying to explain the same material in a different
way.
As users of R know, there are many ways of coding and many packages duplicating similar functions.
I evolved my approach to teaching this course over the semesters and the tutorials reflect a change in
the material taught, as well as some of my approaches to coding R. I am far from a R expert. I used
my learning of the software as a guide to teaching others. My style in the tutorial is intended to be
informal.
The tutorials use a mix of showing R code or doing the coding using Rcmdr, a package in R that
has a drop-down menu. While the drop-down menu is great as a kickstarter for the long learning curve
of R, I tried to introduce R code separately soon afterwards so that students can learn R and not rely
on the menus. I also suggest using a program like RStudio or TinnR rather than the R GUI. As with
other things, in the end personal preference and comfort level is critical to success.
Certainly these tutorials can be improved and expanded. I welcome suggestions for new ones and
certainly corrections if errors are spotted.
I want to specifically acknowledge and thank Andy Korger, a student here at UW-Madison, who
helped design a few of the tutorials.
Contents
Page

1 To Get You Started 1

2 FILE and DATA Menus 4

3 Introduction to Data Frames 6

4 Simulation and Statistics 11

5 Importing, Some Basic Functions, Loading, Saving 14

6 Simulation and Regression 16

7 Introduction to Scatterplots 19

8 A Little More on Scatterplots: Using Matrix Plots 25

9 Subsetting, Summaries and Graphing 33

10 Simulation, Regression, and Confidence/Prediction Intervals 35

11 Introduction to Regression with Categorical Explanatory Variables, Part I 40

12 Introduction to Regression with Categorical Explanatory Variables, Part II 45

13 Categorical Explanatory Variables: Labels and Levels 52

14 Regression with Categorical Explanatory Variables: With and Without Intercept


Term 56

15 Categorical Explanaotry Variables: Impact of Interaction Terms 62

16 More on Categorical Explanatory Variables 69

17 Influential Points and Residual Analysis 75

18 Impact of Outlier on Regression Results 82

19 Example of Residual Plots 88

20 Missing Data and Added Variable Approach 94

21 Assessing the Model 99


Act Sci 654: Regression and Time Series for Actuaries 1

1 To Get You Started


This exercise will help get you started to analyze data and provide you some helpful references.

1. Download R from the [Link] website.

(a) Refer to the Regression book website for other tips:


(b) Refer to [Link] for a video
on installing R for windows. Note that the current version is now 2.15.1.

2. Start R. We will now want to download some packages to help us. These packages are written by
regular people and become part of the vast library of resources of R. These packages are macros
and functions to help you do your work more easily and more efficiently.

(a) In the menu bar, go to Packages > Set CRAN mirror. These mirrors are different web
servers to download the packages. I usually choose the Iowa one in the USA as it seems to
be faster. Usually, you choose one close to where you currently are located.
(b) Next go to Packages > Install Packages. This command pulls up a list of many, many, many
packages that you could download. If you push the Ctrl button, you can select multiple
packages. Do not download all of them, as your computer will be unavailable for a long,
long, long time. You can download packages as you need them. But once the packages
are on your computer, you need not install them each time. However, if you know that a
package has been updated, then you can choose that option to update the package on your
system. (If you are using a new computer or a computer at school, you may need to install
the packages.)
(c) There are many packages already built into the base package and too many packages to
list here. Check out [Link]
[Link] for a list.
(d) Initially, let us download just a couple of packages: Hmisc and psych. You can download
one at a time or as a group by selecting with the Ctrl button.
(e) You can also type in a command at the R prompt (>) to download packages. Type the follow-
ing:

[Link]("Rcmdr", dep=TRUE)

This command installs the package Rcmdr (R commander), together with all of the other
packages that Rcmdr requires.
(f) These packages are now loaded into storage on your computer, but when you need to use
them, you need to open the package. You do this with the library statement.
(g) To bring Rcmdr into R, type:
library(Rcmdr)

or go to the R menu Packages > Load Packages and click on Rcmdr. Rmcdr is a package
that allows you to work in R as a menu-driven package which helps speed getting started
and eases the learning curve. Let us focus initially on Rcmdr.
Act Sci 654: Regression and Time Series for Actuaries 2

3. Note that there are 3 windows: the Script, Output, and Message Windows.

(a) The Script window is where you type commands or where the commands are placed after
you select them from the Menu Bar. Note the Submit button. You highlight the command
(if it takes more than one line) or place the cursor at the end of the line and then push the
Submit button.
(b) The Output is displayed in the Output window. You can copy and paste output into WORD
for your homework and project. Use the Courier font for R output so that everything lines
up nicely.
(c) When something seems afoot, check the messages window. This window shows you when
there are errors. Not necessarily what are the errors, but it alerts you to whether your data
are loaded and other items of note.
(d) You can clear the window by left-clicking on the particular window to select it (like the
Output window), then right-clicking and selecting Clear. This is especially helpful in the
Message window to help when errors are made and you are trying to correct them. Then
you know if it is a new error or everything is fixed.

4. Check out the menus for Rcmdr.

File Menu; Items for loading and saving script files; for saving output and the R workspace;
and for exiting.
Edit Menu: Items (such as Cut, Copy, Paste) for editing the contents of the script and output
windows. Right clicking in the script or output window also brings up an edit context menu.
Data: Sub-menus containing menu items for reading and manipulating data.
Statistics: Sub-menus containing menu items for a variety of basic statistical analyses.
Graphs Menu: Items for creating simple statistical graphs.
Models Menu: Items and sub-menus for obtaining numerical summaries, confidence intervals,
hypothesis tests, diagnostics, and graphs for a statistical model, and for adding diagnostic
quantities to the data set.
Distributions: Probabilities, quantiles, and graphs of standard statistical distributions.
Tools Menu: Items for loading R packages unrelated to the Rcmdr package (e.g., to access
data saved in another package), and for setting some options.
Help Menu: Items to obtain information about the R Commander (including an introductory
manual derived from this paper). As well, each R Commander dialog box has a Help button.

5. Rcmdr is a useful package for getting started programming in R. You can learn the commands
by using the menu and then seeing the commands that Rcmdr prints.

6. How to get help.

(a) I would like to say that learning R with Rcmdr is easy. I could say it, but it would not
be accurate. There is a learning curve to R and initially, certainly, and most probably
throughout the semester you might be a little frustrated. I like to believe that the time you
spend learning R will be worthwhile in class, certainly, and beyond class into your career.
(b) I will be continually adding resources to help you with R on the course website. There are
articles about R, Rcmdr, reference cards, and other materials.
Act Sci 654: Regression and Time Series for Actuaries 3

(c) Note the links on Frees’ Regression Book Web Site for R and Rcmdr videos to get started.
(d) Note that R is case-sensitive, i.e. log does not equal Log nor LOG.
(e) I am always available for help, at least during waking hours.
(f) The course website has a forum set up to allow you to ask questions of all in the class for
help.
(g) I found that with a basic understanding, you will soon be able to google for help and be
able to understand some possible solution.
(h) One great website is: [Link]. Not only can you type in questions, but there is also
a very nice reference card that you could print and have with you.
Act Sci 654: Regression and Time Series for Actuaries 4

2 FILE and DATA Menus


This exercise will illustrate some of the items under the FILE and DATA menu.

1. Download [Link] from regression book website

2. Open [Link] in NOTEPAD to show it is a comma-separated value. (Note: if you just double-
clicked, the file would open in EXCEL. You can right-click on an EXCEL file icon and see under
properties that it is automatically opened in EXCEL.)

3. Start R and Rcmdr. If you need to re-install Rcmdr, recall from the email that you can use
command: [Link](”Rcmdr”, dep=TRUE). What this does is install Rcmdr, together
with all of the other packages that Rcmdr requires. To bring Rcmdr into R, type library(Rcmdr)
or go to the R menu Packages > Load Packages and click on Rcmdr.

4. Let’s make some mistakes.

(a) For the [Link] file, try to Load the file into Rcmdr using: Data > Load. That does not
work.
(b) Using the same [Link] file, import the file, put a clever name like AutoBI for the
Dataset, and then click ok. Everything looks good until you go into View and see that the
data are not separated into columns.

5. For the [Link] file, import the file, put a clever name like AutoBI.1 for the Dataset, click
on ”comma” in the ”Field Separator” and then click ok. Everything looks good even in View.
(Note: the decimal-separator ”comma” is for other systems that represent 100.3 as 100,3.)

6. Type names(AutoBI.1) and click Submit.

7. Type str(AutoBI.1) and click Submit.

8. Type head(AutoBI.1) and click Submit.

9. Type tail(AutoBI.1) and click Submit.

10. Type AutoBI.1[1:10, ] and click Submit. How does this differ from the head command?

11. Type AutoBI.1[ ,2:3 ] and click Submit. What does this command do?

12. Type attributes(AutoBI.1) and click Submit. See how it contains the names, the class (here a
data frame), and the row numbers. This command is helpful to summarize the R object.

13. Type summary(AutoBI.1) and click Submit. Summary is a useful command that provides dif-
ferent output for different R objects. Here, on a data frame, the summary command shows the
descriptive statistical summary of the data.

14. Compute a new variable [Link] with CLMINSUR using Data > Manage variables > Compute
new variable. Function is: log. (Note: Check out R functions link in R Resource section on
Course Website.) View the data set again to see that the new variable is added to the data set
and computed correctly.

15. Go to Data > Active Data Set > Variables in active data set. This produces the function ‘names’
that indicates the variable names. Powerful function that will come in handy when we go to
Regression modeling.
Act Sci 654: Regression and Time Series for Actuaries 5

16. Go to Data > Active Data Set and save active data set. Note extension of file. (This is what you
would use for the Data > Load function. This saves your data, including the transformations, in
a data set that can be restored.)

17. Go to Data > Active Data Set and export active data set. This writes the data to a .csv file so
that you can use it in EXCEL, for example.

18. Go to File > save script as and save script. This lets you keep your code for future times.

19. Go to File > save output as. This gives you the output in a text file. (Use courier font for tables
and output so that it lines up better.)

20. Go to File > save workspace as. This is similar to saving the active data set, but saves everything
in the workspace. So if you have created multiple datasets, then all are saved within this one file.
Whereas in (16), the command only saves the active data set.

21. Now clear all of the working areas: click in the script, then right click to clear the contents. Do
the same for the output and message windows.

22. Type rm(list=ls()). This clears all of the objects from the workspace. Click on the Active Dataset
box and everything disappears. You can see this by typing ls() and you’ll get character(0).

23. Now bring in one of the .RData files using Data > Load. See that all your data are restored.
Now type ls() and see that it lists the name(s) of the datasets in that file.
Act Sci 654: Regression and Time Series for Actuaries 6

3 Introduction to Data Frames


This exercise introduces you to the data frame object in R that we use throughout the semester.
Understanding the notation and how you can get information in and out of it will be useful. What
makes a data frame useful is that it can contain both numbers and letters. For example, if your SEX
variable is labeled M or F, then a data frame can be used.

1. The code that I used in R is attached if you want to use it.

2. The exercise uses the data from Modified UN Life Expectancy from our course website.

3. Please remember to save your R scripts if you are working in Rcmdr as you are working just in
case your system freezes.

4. Start R/Rcmdr. You can use Rcmdr to submit a couple of lines at one time. That might be
easier than using the R GUI.1 Also recall that R is case-sensitive and space-sensitive.

5. Before you start, type in the following statement:

rm(list=ls())

I like to start with this statement as it removes all of the objects in your workspace so that you
start fresh. If you type:

ls()

you see that there is nothing in the workspace, i.e. the result is character(0).

6. Read in [Link]. You can use Rcmdr to do this. In my code at the end
of this document, I called the data frame UN. Look at the code that comes up in the Script
window. It is a general way of reading data into R.

7. The <− is called an assignment operator. The information on the right-hand side is assigned to
the object with the name on the left-hand side. Be sure that there is NO space between the <
and the −.

8. Now type ls() again to see that UN is showing up as in the workspace.

9. You can shorten the code by using the [Link] statement:

UN.1 <- [Link]("d:/ActSci654/DataForFrees/[Link]",


header=TRUE)

Here I make sure that I note that there are variable names at the top of the data set. I can verify
that by either looking at the data in notepad or in Excel. I changed the name of the data frame
to UN.1 so that both UN and UN.1 are in the workspace. Note that the workspace is where the
data are stored and calculations made. This workspace is temporary, i.e. unless you save it, the
workspace and all the work will be gone if you close R.
1
Also remember RStudio and TinnR are great substitutes for Rcmdr where you can use a script window and have an
output window.
Act Sci 654: Regression and Time Series for Actuaries 7

10. See what the commands str(UN), names(UN), [Link](UN), and attributes(UN) display.

11. You can view the entire data frame by typing UN, or you can view the top 6 rows by typing
head(UN) or the bottom 6 rows by typing tail(UN).

12. Many times, we want to take just a few rows or a few columns from the data frame. Suppose we
create a new object UN.sub1 by the following:

UN.sub1 <- UN[1:10, ]


UN.sub1

Here we create a new object that is based on our original data. The data frame is accessed by
rows and columns (remember your matrices?). Note the use of the square brackets. By typing
UN.sub1, we can view the subset. Note that the colon (:) says to take all rows from 1 to 10.

13. Similarly, we can take just a few of the columns using:

UN.sub2 <- UN[1:10, 1:2 ]


UN.sub2

14. And, we can take just a few of the columns using their names :

UN.sub3 <- UN[1:10, c("LIFEEXP","PUBLICEDUCATION") ]


UN.sub3

Here we use quotes around the variable name (remember picky R for upper and lower case, as
well as spaces), and we use the c() notation as a combine operator.

15. How would you pull the odd rows: 1, 3, 5, 7?

16. An interesting way of subsetting the data frame to keep only a specific group of variables is:

keep <- c("LIFEEXP","PUBLICEDUCATION")


keep

UN.sub4 <- UN[,keep]


UN.sub4

Here we create a new object keep that contains just the names of the two variables we want to
retain. Then we use that keep object in the subsetting.

17. We will always want to find descriptive summaries of our data. We have seen the summary()
statement. Note that it does not include the standard deviation, an important statistic. As we
know, there are many ways in R to do the same thing.
Act Sci 654: Regression and Time Series for Actuaries 8

18. Suppose you type summary(PUBLICEDUCATION) to get summary statistics. Why does this
not work?

19. Now type the first command below and try to understand

summary(UN.sub4$PUBLICEDUCATION)

The command that erred, was looking in the general workspace for the variable PUBLICEDU-
CATION, which does not exist. Type ls() to see. In the second summary command that did
work, we accessed PUBLICEDUCATION in the data frame UN.sub4, so it did work.

20. But we can take summaries over the entire data frame:

summary(UN.sub4)
mean(UN.sub4)
sd(UN.sub4)
colMeans(UN.sub4)
sapply(UN.sub4,mean)
sapply(UN.sub4,sd)

This data set has missing data as shown with the NA. When missing data exist, it messes up
calculations. So the mean and sd functions do not work and produce a NA.

21. To correct this, we add the statement to remove the NA from the calculations.

summary(UN.sub4$PUBLICEDUCATION)

sapply(UN.sub4,mean,[Link]=TRUE)
sapply(UN.sub4,sd,[Link]=TRUE)

22. R packages are created by people to help make their and your lives easier. We have already down-
loaded the Rcmdr package and now will download the Hmisc and psych packages. Remember,
these packages need to be installed on your computer.

## psych package

library(psych)
describe(UN.sub4)

## Hmisc package

library(Hmisc)
describe(UN.sub4)

Note that the results are the same, but the format differs. These packages each have different
functions built-in, like correlation. We will get more into these in later tutorials.
Act Sci 654: Regression and Time Series for Actuaries 9

rm(list=ls())
ls()

UN <-[Link]("d:/ActSci654/DataForFrees/[Link]",
header=TRUE, sep=",", [Link]="NA", dec=".", [Link]=TRUE)

ls()

## or easier

UN.1 <- [Link]("d:/ActSci654/DataForFrees/[Link]",


header=TRUE)

str(UN)

names(UN)

[Link](UN)

attributes(UN)

UN

head(UN)

tail(UN)

UN.sub1 <- UN[1:10,]


UN.sub1

UN.sub2 <- UN[1:10, 1:2]


UN.sub2

UN.sub3 <- UN.sub1[1:10, c("LIFEEXP","PUBLICEDUCATION")]


UN.sub3

UN.sub3a <- UN[c(1,3,5,7), c("LIFEEXP","PUBLICEDUCATION")]


UN.sub3a

keep <- c("LIFEEXP","PUBLICEDUCATION")


keep

UN.sub4 <- UN[,keep]


UN.sub4

# gives an error
summary(PUBLICEDUCATION)
Act Sci 654: Regression and Time Series for Actuaries 10

# makes it right
summary(UN.sub4$PUBLICEDUCATION)

summary(UN.sub4)
mean(UN.sub4)
sd(UN.sub4)
colMeans(UN.sub4)
sapply(UN.sub4,mean)
sapply(UN.sub4,sd)

# corrects the code


summary(UN.sub4$PUBLICEDUCATION)

sapply(UN.sub4,mean,[Link]=TRUE)
sapply(UN.sub4,sd,[Link]=TRUE)

## psych package

library(psych)
describe(UN.sub4)

## Hmisc package

library(Hmisc)
describe(UN.sub4)
Act Sci 654: Regression and Time Series for Actuaries 11

4 Simulation and Statistics


This exercise will introduce you to some aspects of simulation and relate to statistics and hypothesis
testing. We will illustrate starting with Rcmdr and also highlight the R commands.

1. Start R and load Rcmdr with the command

library(Rcmdr)

Remember that R is case-sensitive. Also remember to check the message box for errors and
warnings.

2. From the Rcmdr menu bar, choose Distributions > Continuous Distribution > Normal Distribu-
tion > Sample from Normal distribution

(a) Call the data set name what you want, like MySample. Do not put spaces as that might
confuse R. But remember that R is case-sensitive.
(b) Set the mean at 2000 and the standard deviation at 500. (We will mimic some loss models
data.)
(c) Set the number of rows (samples) as 1000 and the number of columns (observations) as 5.
(In a way this is backwards for some purposes, but we can work with it. Also, when we get
better, we can adjust the code and do whatever we want.)
(d) Check the boxes to generate the mean, standard deviation, and sums for each sample. Click
’OK’.

3. Now let’s see what happens in Rcmdr.

(a) First, at the top of Rcmdr, just below the menu bar and just above the Script window is a
place that says ’Data set’ with MySample shown. This was blank before we started. (If you
want start from the beginning to check.) So by creating the simulated data, the MySample
becomes the active data set. In some work that we will do, we may create multiple data
sets in one session. We can move from one data set to another, by just clicking on the active
data set that we want.
(b) Also at the top is a View button. If you click on this you will be able to see the data set
that has been created from the commands.
i. Down the vertical column are the 1000 samples that we created, while across the top
are 5 observations, along with the sample mean, sample standard deviation, and sample
sums.
ii. Each row is a sample of 5 observations. In probability and statistics, you would talk
about a sample of X1 , . . . , X5 , a sample from a normal distribution with µ = 2000 and
σ = 500.
iii. Check out your first few rows. The first 5 of mine is:
obs1 obs2 obs3 obs4 obs5 mean sum sd
sample1 1041.808 2233.845 3527.794 2290.63 2051.026 2229.021 11145.1 884.8428
sample2 2523.594 2692.7 2159.205 2475.304 1985.262 2367.213 11836.06 287.7648
sample3 1278.55 1867.606 2212.036 1893.822 2465.663 1943.535 9717.677 445.8564
sample4 1265.572 2158.062 1426.195 1646.197 1795.433 1658.292 8291.459 345.2616
sample5 2407.344 1565.79 2516.14 2599.611 1611.374 2140.052 10700.26 508.2715
Act Sci 654: Regression and Time Series for Actuaries 12

iv. Each row represents a random sample of 5 observations from this normal distribution.
Not that none of sample means are equal to 2,000 and none of the sample standard
deviations are equal to 500.
v. In real-life we usually get one sample, like row 1. We do not know that µ = 2000 nor
that σ = 500. Our job in real-life is to provide a guess (or an estimate) of what might
be a good choice of the mean and the standard deviation.
vi. In the Message window, note that 1000 rows and 8 columns have been created. One
row for each observation, 5 columns for the 5 normally distributed observations, one for
the sample mean, one for the sample standard deviation, and one for the sample sum.
(c) For this active data set MySample, let us get some summary statistics.
i. From the menu, take Statistics > Summaries > Active Data Set.
summary(MySample)
obs1 obs2 obs3 obs4
Min. : 290.1 Min. : 344.1 Min. : 588.8 Min. : 244.4
1st Qu.:1657.0 1st Qu.:1649.8 1st Qu.:1616.6 1st Qu.:1617.9
Median :1990.4 Median :1999.4 Median :1984.9 Median :1984.2
Mean :1991.5 Mean :1996.9 Mean :1973.0 Mean :1998.3
3rd Qu.:2333.1 3rd Qu.:2318.6 3rd Qu.:2290.5 3rd Qu.:2380.6
Max. :3496.7 Max. :3583.9 Max. :3635.2 Max. :3429.7
obs5 mean sum sd
Min. : 368.4 Min. :1236 Min. : 6180 Min. : 76.64
1st Qu.:1683.0 1st Qu.:1851 1st Qu.: 9256 1st Qu.: 348.59
Median :2046.5 Median :2006 Median :10030 Median : 463.51
Mean :2041.3 Mean :2000 Mean :10001 Mean : 472.10
3rd Qu.:2377.9 3rd Qu.:2149 3rd Qu.:10747 3rd Qu.: 577.20
Max. :4188.0 Max. :2766 Max. :13829 Max. :1055.33

ii. Check out the output window for the results. What would you expect the mean of obs
1 to obs 5 to be? The mean and median each hover around 2,000. Look at the mean
of the mean. That is spot-on (at least for me) at 2,000. The mean of the standard
deviation for me is 472.10, not the 500 that we would want it to be.
iii. From the menu, take Statistics > Summaries > Numerical Summaries. Click on the
mean and sd (pushing Ctrl at the same time to select both).

mean sd 0% 25% 50% 75% 100% n


mean 2000.211 225.4406 1236.039 1851.212 2005.932 2149.363 2765.718 1000
sd 472.098 165.6786 76.63666 348.5862 463.5127 577.195 1055.328 1000

iv. This output is the same as that provided above, except it shows the standard deviation.
v. Finally let us try some plots. Go to Graphs > Histograms and click on mean. Note the
distribution of the sample means.
vi. Similarly, go to Graphs > Histograms and click on sd. Note the distribution of the
sample standard deviations.
vii. We will refer back to this example when we talk about Sampling Distributions.
(d) Let us take a closer look at the code generated by this example in the Script windows.
Act Sci 654: Regression and Time Series for Actuaries 13

[Link](654) # sets a seed so that each time you run this code,
# you get the same results

MySample <- [Link](matrix(rnorm(1000*5, mean=2000, sd=500), ncol=5))


rownames(MySample) <- paste("sample", 1:1000, sep="")
colnames(MySample) <- paste("obs", 1:5, sep="")

MySample$mean <- rowMeans(MySample[,1:5]) # Rcmdr


MySample$mean.1 <- apply(MySample[,1:5], 1, mean) # same result another way

MySample$sum <- rowSums(MySample[,1:5])

MySample$sd <- apply(MySample[,1:5], 1, sd) # from Rcmdr


MySample$var <- apply(MySample[,1:5], 1, var) # cannot do in Rcmdr

summary(MySample)
numSummary(MySample[,c("mean", "sd")], statistics=c("mean", "sd",
"quantiles"), quantiles=c(0,.25,.5,.75,1))

Hist(MySample$mean, scale="frequency", breaks="Sturges", col="darkgray")


Hist(MySample$sd, scale="frequency", breaks="Sturges", col="darkgray")

4. Some other code that we will discuss in class

library(psych)
## Relate to Intro to Stat material

summary(MySample$mean)
describe(MySample$mean) ## from psych package

summary(MySample$var)
describe(MySample$var) ## from psych package

summary(MySample$sd)
describe(MySample$sd) ## from psych package

hist(MySample$mean)
hist(MySample$var)
hist(MySample$sd)
Act Sci 654: Regression and Time Series for Actuaries 14

5 Importing, Some Basic Functions, Loading, Saving


This exercise will illustrate some of the items under the FILE and DATA menu.

1. Download [Link] from regression book website

2. Open [Link] in NOTEPAD to show it is a comma-separated value. (Note: if you just double-
clicked, the file would open in EXCEL. You can right-click on an EXCEL file icon and see under
properties that it is automatically opened in EXCEL.)

3. Start R and Rcmdr. If you need to re-install Rcmdr, recall from the email that you can use
command: [Link](”Rcmdr”, dep=TRUE). What this does is install Rcmdr, together
with all of the other packages that Rcmdr requires. To bring Rcmdr into R, type library(Rcmdr)
or go to the R menu Packages > Load Packages and click on Rcmdr.

4. Let’s make some mistakes.

(a) For the [Link] file, try to Load the file into Rcmdr using: Data > Load. That does not
work.
(b) Using the same [Link] file, import the file, put a clever name like AutoBI for the
Dataset, and then click ok. Everything looks good until you go into View and see that the
data are not separated into columns.

5. For the [Link] file, import the file, put a clever name like AutoBI.1 for the Dataset, click
on ”comma” in the ”Field Separator” and then click ok. Everything looks good even in View.
(Note: the decimal-separator ”comma” is for other systems that represent 100.3 as 100,3.)

6. Type names(AutoBI.1) and click Submit.

7. Type str(AutoBI.1) and click Submit.

8. Type head(AutoBI.1) and click Submit.

9. Type tail(AutoBI.1) and click Submit.

10. Type AutoBI.1[1:10, ] and click Submit. Compare this output to the command head.

11. Type AutoBI.1[ ,2:3 ] and click Submit. How does this command differ from the one above?

12. Type attributes(AutoBI.1) and click Submit. See how it contains the names, the class (here a
data frame), and the row numbers. This command is helpful to summarize the R object.

13. Type summary(AutoBI.1) and click Submit. Summary is a useful command that provides dif-
ferent output for different R objects. Here, on a data frame, the summary command shows the
descriptive statistical summary of the data.

14. Compute a new variable [Link] with CLMINSUR using Data > Manage variables > Compute
new variable. Function is: log. (Note: Check out R functions link in R Resource section on
Course Website.) View the data set again to see that the new variable is added to the data set
and computed correctly.

15. Go to Data > Active Data Set > Variables in active data set. This produces the function ‘names’
that indicates the variable names. Powerful function that will come in handy when we go to
Regression modeling.
Act Sci 654: Regression and Time Series for Actuaries 15

16. Go to Data > Active Data Set and save active data set. Note extension of file. (This is what you
would use for the Data > Load function. This saves your data, including the transformations, in
a data set that can be restored.)

17. Go to Data > Active Data Set and export active data set. This writes the data to a .csv file so
that you can use it in EXCEL, for example.

18. Go to File > save script as and save script. This lets you keep your code for future times.

19. Go to File > save output as. This gives you the output in a text file. (Use courier font for tables
and output so that it lines up better.)

20. Go to File > save workspace as. This is similar to saving the active data set, but saves everything
in the workspace. So if you have created multiple datasets, then all are saved within this one file.
Whereas in (16), the command only saves the active data set.

21. Now clear all of the working areas: click in the script, then right click to clear the contents. Do
the same for the output and message windows.

22. Type rm(list=ls()). This clears all of the objects from the workspace. Click on the Active Dataset
box and everything disappears. You can see this by typing ls() and you’ll get character(0).

23. Now bring in one of the .RData files using Data > Load. See that all your data are restored.
Now type ls() and see that it lists the name(s) of the datasets in that file.
Act Sci 654: Regression and Time Series for Actuaries 16

6 Simulation and Regression


This exercise will introduce you to the basics of simulation with an application in basic regression. We
will establish the ‘true’ regression parameters and simulate some data. We will use the data then to
estimate the regression parameters and then compare the estimated line with the ‘true’ line. With
simulation, you can re-produce the study quickly (just by re-submitting the code) and see how the
results may change. This is an example of sampling variability.

1. The code that I used in R is attached if you want to use it. At this stage, it would be beneficial
for you to start paying attention to the R code so that you can modify it for use in the future.
Using R directly becomes faster, provides more options, and also does not freeze up the system.
Please remember to save your R scripts as you are working just in case your system freezes.

2. Start R and Rcmdr. The Rcmdr package should still be in your hard drive. Try library(Rcmdr)
to see first, but if you need to re-install Rcmdr, use: [Link](”Rcmdr”, dep=TRUE). If
other packaes are needed, please add them. To bring Rcmdr into R, type library(Rcmdr) or go
to the R menu Packages > Load Packages and click on Rcmdr.

3. Go to File > Change working directory to select where you want the data to be placed and
where you want R to look. Remember that you can also change the ‘Start in’ box in the R-icon
properties. Also, the link on the course website for instructions for installing Rcmdr has some
other set-up tips.

4. You can type code in the Script Window as well as working from the menus. You can place your
cursor at the end of the line and click the Submit button.

5. I find typing in the code for simulation easier. (For those who are interested, you can see how
Rcmdr produces simulated outcomes. It is in the Distributions Tab. We will be working with the
Gamma and Normal distributions here, but notice the set of distributions available. Rcmdr puts
the data in a matrix which I find harder.) We will simulate some xi ’s from a Gamma Distribution
and some i ’s from a Normal Distribution. Recall that in Regression, the xi ’s are assumed as
fixed, i.e. given. The Gamma Distribution is just a way to get some data. Type the following
(along with Submit) and remember that R/Rcmdr is CASE sensitive and space sensitive (i.e.
cares about spacing).

(a) num <− 25


(b) beta.0 <− 12500
(c) beta.1 <− 50
(d) sigma <− 2000 (This is our population value of σ for the standard deviation of the error
term.)
(e) x <− rgamma(num, shape = 50, scale = 15)
(f) epsilon <− rnorm(num, mean = 0, sd = sigma) (Note: this is the  in our linear model.)

The num sets the number of random observations; β0 , β1 , and σ set the values of the ‘true’
parameters; x and  are vectors are random variables.

6. Find the sample mean and sample standard deviation of x and . Type mean(x) and sd(x) and
mean(epsilon) and sd(epsilon). How do these numbers compare with truth?
Act Sci 654: Regression and Time Series for Actuaries 17

7. Draw histograms of x and . You can find histograms in the Rcmdr menu Graphs or use the R
command hist(xxx), where xxx = the variable you would like to graph. What do you think of
these? Are they what you expected?
8. Create the truth vector : [Link] <− beta.0 + beta.1 * x. This is the population version. We do
not know [Link], but here we do because we are making things up.
9. Now let us combine the values to create our observation vector y : y <− beta.0 + beta.1 * x +
epsilon. This command creates a vector y that is the same length as x and . We take our truth
and add a little error to it (which is what we observe).
10. Find the sample mean and standard deviation of y. Why does this not equal the sample standard
deviation of ?
11. Suppose you wanted to view these observations? If you look at the active dataset, you cannot
view any of the data. Why? These variables have been created in a temporary workspace. You
can see that they exist by typing: ls(). You can still view them using the head() command. Type
head(y), head(x), head(epsilon), head([Link]). You can see that if you apply the beta.0 and
beta.1 appropriately to each of the x then you match [Link] and if you add , you match y.
12. Combine y and x into a dataframe to run a regression. (In this example, it is not necessary as
the variables are in the local workspace, but when you have variables that you will be importing,
they will all be in a dataframe.) Use the code: myDataSet <− [Link](x,y). Note that these
data are now showing in Rcmdr if you click on the Dataset window.
13. To run a Regression type: RegModel <− lm(y ∼ x, data = myDataSet) and then type sum-
mary(RegModel). The first command runs a regression and puts the results in an object called
RegModel. The second command shows a summary of the results. (If you run multiple regres-
sions, you can name them: RegModel.1, RegModel.2 )
(a) A helpful command is: names().
(b) Type: names(RegModel)
(c) Type: names(summary(RegModel))
(d) The names() command provides you a list of variables that are available for your use in other
aspects of your analysis. For example, type: [Link] <− coefficients(RegModel) and then
type: [Link] and click Submit. The coefficients command places the estimated values of
your regression in a vector.
14. You can do this in Rcmdr under the Statistics > Fit Models (and choose either the Linear Regres-
sion or Linear Models entry). The Linear Models is more flexible, as you can do transformations.
So I would recommend using that one. Note that Rcmdr automatically provides the summary
command. Note now, that the regression model that you created has become the active Model
at the top right in Rcmdr. To use this model further in your analysis, make sure that this is
indicated. (This is important when you will be creating different models and all will be in the
workspace. Only one can be active.)
15. Examine the results. How does your value for the intercept and slope terms compare to truth?
Where does the estimate for σ come into play in the regression output?
16. Now let us plot some of the results. You cannot do this in Rcmdr. Type the following code, but
note that # is a comment character in R. No need to type that in. Included as instructions to
you.
Act Sci 654: Regression and Time Series for Actuaries 18

plot(y ~ x, cex=1.5, pch=19, col = c("blue"),


main = "Comparison of Data, Truth and Estimated Regression Line",
xlab = "My label of x", ylab = "My label of y",
data=myDataSet) # click Submit for this block and see what it does.

abline(a = beta.0, b = beta.1, lwd = 4, lty = 2) # click Submit for this block
and see what it does

abline(coef = [Link], col = "red", lwd = 3, lty = 1) # click Submit for this block
and see what it does

legend("topleft",c("Observed Data","Truth","Least Squares"),


col=c("blue","black","red"), pch = c(19, 19, 19)) # click Submit for this block
and see what it does

17. What does your graph say about Truth vs. Estimate?

18. We will (or have) learned about the ANOVA Table. You can get this as:

(a) [Link](RegModel) in R
(b) anova(RegModel) in Rcmdr under Models > Hypothesis Tests > ANOVA.
Act Sci 654: Regression and Time Series for Actuaries 19

7 Introduction to Scatterplots
This exercise will illustrate some of the features of the plot command in R and uses the data set
[Link] is a data set, located on the course website, containing statistics for UW-Madison men’s
football and basketball wins and winning percentages from years 2002 - 2011.

1. Start R and set your working directory.

2. Import the Sports dataset into R using the [Link] command and name it Sports.

3. Sports is a small data set, so we can just list the entire [Link] by typing:

Sports

4. Note that there are 5 variables (all in upper case): YEAR, FB (# games won for football), BB
(# games won for basketball), WPFB (winning percentage for football), and WPBB (winning
percentage for basketball).

5. If we wanted to see a plot of number of wins by year for football, we could use the plot command:

plot(Sports$YEAR, Sports$FB)

This code plots YEAR on the x-axis and number of games won during the football season on the
y-axis.

6. We could plot the same data with the command:

plot(FB~YEAR, data=Sports)

Here, this approach uses the ”∼” command similar to in the linear regression command. Again,
YEAR on the x-axis and number of games won during the football season on the y-axis. Note
that this command allows you to put the [Link] in the plot command rather than needing
to use the [Link]$[Link] approach. Note that the labels for the variables differ in
the two plots because of the syntax of the code. (We will learn how to change the labels a little
further in this tutorial.)

7. Notice there is no title and the graph looks pretty bland. Let’s spice it up a bit. We will keep
adding to our previous plot code inside the parentheses. The order of code within the plot
parentheses typically does not matter; it is not imperative to add the code in the exact order
shown here.

(a) Let’s add a title with the main parameter:


plot(FB~YEAR, main="Recent Badger Football Win Totals", data=Sports)

(b) Now let’s label the y-axis more effectively with ylab. (You can similarly change the x-axis
with xlab.)
plot(FB~YEAR, main="Recent Badger Football Win Totals",
ylab="Win Count", data=Sports)
Act Sci 654: Regression and Time Series for Actuaries 20

(c) Let’s make the points solid, and change the color to red.
plot(FB~YEAR, main="Recent Badger Football Win Totals",
ylab="Win Count", pch=19, col="red", data=Sports)

(d) The pch command changes the type of point in the graph. Codes 15 to 20 are solid objects,
but typing ?pch will bring up a help page with more detail. The course website shows a
complete listing.
(e) You can also see the entire list of available colors by typing colors(). A neat variety of
colors can be found with the following commands: rainbow(n), [Link](n), ter-
[Link](n), [Link](n), and [Link](n), where n is the number of contiguous
colors.
(f) The size of the points in the plot can be changed using the cex command.
plot(FB~YEAR, main="Recent Badger Football Win Totals", ylab="Win Count",
pch=19, col="red", cex=1.5, data=Sports)

(g) Since the data are plotted over time, instead of having just the points plotted, we might
want to connect the dots. We can use the type parameter as follows:
plot(FB~YEAR, main="Recent Badger Football Win Totals", ylab="Win Count",
pch=19, col="red", cex=1.5, type = "l", data=Sports)

The ”l” parameter indicates a line, while the ”o” parameter not only connects the dots, but
includes the dots as well.
(h) If we also want to make the line thicker, we use the ”lwd” command. The ”lty” command
changes the type of line. You can Check out an entire set of graphics parameters by typing
”?par”. (You can see that ”?” is a help command. You can also type help(par) to get help.)
plot(FB ~ YEAR, main="Recent Badger Football Win Totals",
ylab = "Win Count", pch = 19, col="red", cex = 1.5,
type = "o", lwd = 2, lty = 2, data = Sports)

(i) Let’s plot the number of wins by year for basketball. Here you can copy and paste the most
recent command to generate the football wins. The variable needs to be changed to BB,
and the title must be changed as well. You could also change the color of the points. Here’s
what we have so far:
plot(FB ~ YEAR, main="Recent Badger Football Win Totals",
ylab = "Win Count", pch = 19, col="red", cex = 1.5,
type = "o", lwd = 2, lty = 2, data = Sports)

plot(BB ~ YEAR, main="Recent Badger Basketball Win Totals",


ylab = "Win Count", pch = 19, col="purple", cex = 1.5,
type = "o", lwd = 2, lty = 2, data = Sports)

(j) In order to get multiple graphs within the same window, the following code should precede
the code for the two graphs:
Act Sci 654: Regression and Time Series for Actuaries 21

par(mfrow=c(1,2), oma = c(0,0,2,0))

This command changes the print setting to allow multiple graphs in one window. The first
set of numbers means one row, two columns. If you wanted to make a 2x2 table to display
four graphs, for example, you would enter mfrow=c(2,2). The parameter oma stands for
outer margin area. The order of numbers when adjusting the margins is: bottom, left, top,
right (it goes clockwise). By setting the third parameter to a ”2”, we allow for a wider
top-margin so that you can put a grand title over all graphs. You can experiment with
changes to this setting on your own.
(k) Adding this par command prior to the two plot commands produces graphs side-by-side.
Notice that the y-axis scales differ. Use the ylim parameter (or xlim for the axis) to change
the scaling. The code below changes the y-axis to go from 5 to 35.2
par(mfrow=c(1,2), oma = c(0,0,2,0))

plot(FB ~ YEAR, main="Recent Badger Football Win Totals",


ylab = "Win Count", pch = 19, col="red", cex = 1.5,
type = "o", lwd = 2, lty = 2, ylim = c(5,35), data = Sports)

plot(BB ~ YEAR, main="Recent Badger Basketball Win Totals",


ylab = "Win Count", pch = 19, col="purple", cex = 1.5,
type = "o", lwd = 2, lty = 2, ylim = c(5,35), data = Sports)

(l) The reason to use the oma parameter is to insert a grand title (or other margin text). The
mtext command can be combined with lots of other options in order to do this.
I show my love for the Badgers in the bottom right corner of the graph.
mtext("GO BADGERS", side=1, adj=1, line=2, col="red", font=2)
Explore ?mtext for other choices. The first portion of the code is the text that you want
included. Side tells R which cardinal direction to place the text. Recall the sides are from
numbered from 1 to 4 beginning with the bottom and moving clockwise (2 would be left).
The adj command moves the text left or right, where 0 corresponds to left, 1 to the right.
Finally, the line portion controls how far the text is from the plot zone. The parameter
line=0 sets the text very close to the plot, and moves the text farther away as the number
increases.
Font can be bolded or italicized. The code 1 corresponds to regular, 2 is bold, 3 is italic,
and 4 is both bold and italic. With the font = 2, we set the text as bold.
(m) With the mtext parameter, we can put another phrase in the top left, very close to the plot
area. Note how the adjustments in code change the position of the text.
mtext("GO BADGERS", side=3, adj=0, line=0, col="red", font=2)
So the total code to have now is:
2
You may also note that the comparison is not appropriate as there are more games played during a basketball season
than a football season. The winning percentages are also included in this [Link], so you can compare the graphs on
your own.
Act Sci 654: Regression and Time Series for Actuaries 22

par(mfrow=c(1,2), oma = c(0,0,2,0))

plot(FB ~ YEAR, main="Recent Badger Football Win Totals",


ylab = "Win Count", pch = 19, col="red", cex = 1.5,
type = "o", lwd = 2, lty = 2, ylim = c(5,35), data = Sports)

plot(BB ~ YEAR, main="Recent Badger Basketball Win Totals",


ylab = "Win Count", pch = 19, col="purple", cex = 1.5,
type = "o", lwd = 2, lty = 2, ylim = c(5,35), data = Sports)

mtext("GO BADGERS", side=1, adj=1, line=2, col="red", font=2)


mtext("GO BADGERS", side=3, adj=0, line=0, col="red", font=2)

mtext("We Love R", outer=TRUE, cex = 1.5, col="olivedrab")

Note that the two GO BADGERS statements are with just the second graph, while the
margin text for We Love R is outside on top of both graphs.
The last mtext statement places the text outside of both graphs due to the outer parameter
being TRUE. Notice too, that here, the cex parameter increases the size of the font, whereas
before it changed the size of the plotting character (pch).
(n) You can also play with the way the titles are displayed. The titles may overlap. We can
either shorten them, or use another way. Typing in \n in a string of text acts begins a new
line. Let’s change our title code in this manner:
par(mfrow=c(1,2), oma = c(0,0,2,0))

plot(FB ~ YEAR, main="Recent Badger Football \nWin Totals",


ylab = "Win Count", pch = 19, col="red", cex = 1.5,
type = "o", lwd = 2, lty = 2, ylim = c(5,35), data = Sports)

plot(BB ~ YEAR, main="Recent Badger Basketball \nWin Totals",


ylab = "Win Count", pch = 19, col="purple", cex = 1.5,
type = "o", lwd = 2, lty = 2, ylim = c(5,35), data = Sports)

mtext("GO BADGERS", side=1, adj=1, line=2, col="red", font=2)


mtext("GO BADGERS", side=3, adj=0, line=0, col="red", font=2)
mtext("We Love R", outer=TRUE, cex = 1.5, col="olivedrab")
(o) You can write your plots directly to a pdf file. You can set your working directory or enter
the path where you want the file to be saved. The following command saves the file to the
d-drive in a file named [Link].
pdf(file=’d:/[Link]’)
plot(FB ~ YEAR, main="Recent Badger Football \nWin Totals",
ylab = "Win Count", pch = 19, col="red", cex = 1.5,
type = "o", lwd = 2, lty = 2, ylim = c(5,35), data = Sports)

plot(BB ~ YEAR, main="Recent Badger Basketball \nWin Totals",


ylab = "Win Count", pch = 19, col="purple", cex = 1.5,
Act Sci 654: Regression and Time Series for Actuaries 23

type = "o", lwd = 2, lty = 2, ylim = c(5,35), data = Sports)

par(mfrow=c(1,2), oma = c(0,0,2,0))

plot(FB ~ YEAR, main="Recent Badger Football Win Totals",


ylab = "Win Count", pch = 19, col="red", cex = 1.5,
type = "o", lwd = 2, lty = 2, ylim = c(5,35), data = Sports)

plot(BB ~ YEAR, main="Recent Badger Basketball Win Totals",


ylab = "Win Count", pch = 19, col="purple", cex = 1.5,
type = "o", lwd = 2, lty = 2, ylim = c(5,35), data = Sports)

mtext("GO BADGERS", side=1, adj=1, line=2, col="red", font=2)


mtext("GO BADGERS", side=3, adj=0, line=0, col="red", font=2)
mtext("We Love R", outer=TRUE, cex = 1.5, col="olivedrab")

[Link]()
The [Link]() command closes the device so that the file is closed and the output can be
reviewed.
(p) Now compare the output just generated with that generated here:
pdf(file=’d:/[Link]’, width=10,height=6)
plot(FB ~ YEAR, main="Recent Badger Football \nWin Totals",
ylab = "Win Count", pch = 19, col="red", cex = 1.5,
type = "o", lwd = 2, lty = 2, ylim = c(5,35), data = Sports)

plot(BB ~ YEAR, main="Recent Badger Basketball \nWin Totals",


ylab = "Win Count", pch = 19, col="purple", cex = 1.5,
type = "o", lwd = 2, lty = 2, ylim = c(5,35), data = Sports)

par(mfrow=c(1,2), oma = c(0,0,0,0))

plot(FB ~ YEAR, main="Recent Badger Football Win Totals",


ylab = "Win Count", pch = 19, col="red", cex = 1.5,
type = "o", lwd = 2, lty = 2, ylim = c(5,35), data = Sports)

plot(BB ~ YEAR, main="Recent Badger Basketball Win Totals",


ylab = "Win Count", pch = 19, col="purple", cex = 1.5,
type = "o", lwd = 2, lty = 2, ylim = c(5,35), data = Sports)

mtext("GO BADGERS", side=1, adj=1, line=2, col="red", font=2)


mtext("GO BADGERS", side=3, adj=0, line=0, col="red", font=2)
mtext("We Love R", outer=TRUE, cex = 1.5, col="olivedrab")

[Link]()
Notice that we have changed the relative printing dimensions of the windows. You can play
with the settings to determine what is best for your current output. Remember that you
Act Sci 654: Regression and Time Series for Actuaries 24

can create multiple pdfs with different print settings within one R session.

Final code:

pdf(file=’d:/[Link]’)
plot(FB ~ YEAR, main="Recent Badger Football \nWin Totals",
ylab = "Win Count", pch = 19, col="red", cex = 1.5,
type = "o", lwd = 2, lty = 2, ylim = c(5,35), data = Sports)

plot(BB ~ YEAR, main="Recent Badger Basketball \nWin Totals",


ylab = "Win Count", pch = 19, col="purple", cex = 1.5,
type = "o", lwd = 2, lty = 2, ylim = c(5,35), data = Sports)

par(mfrow=c(1,2), oma = c(0,0,0,0))

plot(FB ~ YEAR, main="Recent Badger Football Win Totals",


ylab = "Win Count", pch = 19, col="red", cex = 1.5,
type = "o", lwd = 2, lty = 2, ylim = c(5,35), data = Sports)

plot(BB ~ YEAR, main="Recent Badger Basketball Win Totals",


ylab = "Win Count", pch = 19, col="purple", cex = 1.5,
type = "o", lwd = 2, lty = 2, ylim = c(5,35), data = Sports)

mtext("GO BADGERS", side=1, adj=1, line=2, col="red", font=2)


mtext("GO BADGERS", side=3, adj=0, line=0, col="red", font=2)
mtext("We Love R", outer=TRUE, cex = 1.5, col="olivedrab")

[Link]()

Thanks to Andy Korger for his design of this tutorial.


Act Sci 654: Regression and Time Series for Actuaries 25

8 A Little More on Scatterplots: Using Matrix Plots


This exercise will reinforce many of the features of the plot command in R that we introduced in the
Introduction to Scatterplots and again uses the data set [Link] located on the course website,
containing statistics for UW-Madison men’s football and basketball wins and winning percentages from
years 2002 - 2011.
In this tutorial we will be graphing both football and basketball in the same plot, which is more
complicated than making two separate plots and using what is known as a matrix plot.

1. Start R and set your working directory.

2. Import the Sports dataset into R using the [Link] command and name it Sports.

3. We can create a simple matrix plot with the command:

matplot(Sports$YEAR, Sports[ , c("BB","FB")])

To recognize a variable, R often requires it in the format above, which lists the dataset followed
by a dollar sign before the variable name (Sports$YEAR). The second part of the code combines
the football and basketball totals in one plot.

Note that the notation Sports[ , c(”BB”,”FB”)] indicates that the [Link] Sports is an
array. The observations of the [Link] are the rows and the variables are the columns. Thus
the first comma (really with a blank in front of it) indicates to include all rows of Sports. The
entry after that first comma indicates to only include the columns for BB and FB.

Note that the basketball wins (since it is listed first in the code) is labeled 1 and the football
wins are labeled 2.

4. We can label things better too. Let us add the code that was covered in the previous tutorial.

matplot(Sports$YEAR, Sports[ , c("BB","FB")], xlab="Year", ylab="Win Total",


main="Recent Win Totals of Football & Basketball Teams")

5. Instead of having the points represented by numbers, let us make them both solid circles, and
color code them. Add the the third line of the following to the plot code:

matplot(Sports$YEAR, Sports[, c("BB","FB")], xlab="Year", ylab="Win Total",


main="Recent Win Totals of Football & Basketball Teams",
pch=c(19,19), col=c("orange","brown"))

This is how to include multiple variables in a plot. The ”c” in c(”BB”,”FB”) stands for combine.
Basketball was listed first, so we now have the basketball points in orange and the football points
in brown. We could change the point size in the same way with the cex command, as you may
remember from the first tutorial.

6. The plot is not self-explanatory to an outsider so we will add a legend. The legend code is added
outside of the plot code in a separate line below.
Act Sci 654: Regression and Time Series for Actuaries 26

matplot(Sports$YEAR, Sports[, c("BB","FB")], xlab="Year", ylab="Win Total",


main="Recent Win Totals of Football & Basketball Teams",
pch=c(19,19), col=c(orange","brown"))

legend("topleft",c("Basketball","Football"), col=c("orange","brown"), pch=c(19,19))

The legend can be placed in any corner or spot. Make sure that you have matching color and
point types inside the legend as compared to the plot. Here the parameter topleft places the
legend at the top, left of the plot.

7. As we noted in the first plot tutorial, it would be better to compare the quality of a football
season to a basketball season using the winning percentage rather than win counts, so we will
switch to a different metric, win percentage. We can keep almost the same code, but switch the
variables to WPBB and WPFB, and change the y-axis and title appropriately as well.

matplot(Sports$YEAR, Sports[, c("WPBB","WPFB")], xlab="Year", ylab="Win Percentage",


main="Recent Winning Percentage of Football & Basketball Teams",
pch=c(19,19), col=c("orange","brown"))

legend("topleft",c("Basketball","Football"), col=c("orange","brown"), pch=c(19,19))

8. The graph may be more difficult to interpret because there are colored points are all over the
place. We can add lines to the plot to illustrate the year-to-year differences. Add the following
inside the matplot code:

matplot(Sports$YEAR, Sports[, c("WPBB","WPFB")], type="b", xlab="Year",


ylab="Win Percentage",
main="Recent Winning Percentage of Football & Basketball Teams",
pch=c(19,19), col=c("orange","brown"))

legend("topleft",c("Basketball","Football"), col=c("orange","brown"), pch=c(19,19))

There are a number of types that you could use. Type “p” graphs points, “l ” graphs lines, and
“b” stands for both lines and points. Note the subtle difference using “o”. You can look at all
the types by using ?plot in R.
You can also select the line type with the “lty” command. I used lty = c(1,1) in the following
examples. Note since there are two series, we need two codes. The lty = 1 means a solid line.

9. To improve the appearance of the graph, we can change the scaling of the y-axis. Let us change
it to have twice as many hashes.3

matplot(Sports$YEAR, Sports[, c("WPBB","WPFB")], type="b", lty=c(1,1), xlab="Year",


ylab="Win Percentage", main="Recent Winning Percentage of Football & Basketball Teams",
pch=c(19,19), col=c("orange","brown"))

3
The extra line spacing is not necessary but included here to improve readability.
Act Sci 654: Regression and Time Series for Actuaries 27

legend("topleft",c("Basketball","Football"), col=c("orange","brown"), pch=c(19,19))

yaxis <- seq(0.5,1,by=.05)


ylabels <- seq(0.5,1,by=.05)

axis(2,at=yaxis,labels=ylabels)

We create two new vectors: yaxis and ylabels that are sequences from 0.5 to 1.0, incremented
by 0.05. The parameters in the seq function refers to the (min, max, tick interval). With this
code, we are setting the y-axis to include tick marks at every 0.05 on the interval (0.5,1). (Note:
we could have just used on vector, as the sequences are identical. For clarity we create another
vector.)
The axis command sets 2 for the y-axis (1 is used for the x-axis). The parameter “at” indicates
where the tick marks are set (here as defined by the vector yaxis) and “labels” how the ticks are
labeled (here as defined by the vector ylabels). We can add more ticks to the x-axis in the same
manner, but lets leave them unlabeled this time.

matplot(Sports$YEAR, Sports[, c("WPBB","WPFB")], type="b", lty=c(1,1),


xlab="Year", ylab="Win Percentage",
main="Recent Winning Percentage of Football & Basketball Teams",
pch=c(19,19), col=c("orange","brown"))

legend("topleft",c("Basketball","Football"), col=c("orange","brown"), pch=c(19,19))

yaxis <- seq(.5,1,by=.05)


ylabels <- seq(.5,1,by=.05)

axis(2,at=yaxis,labels=ylabels)
xaxis <- seq(2002,2011,by=1)

axis(1,at=xaxis)
3
10. Lets say we want to look at only seasons where the teams won more than 4 of their games. We
can adjust the zoom of a graph using xlim or ylim.

matplot(Sports$YEAR, Sports[, c("WPBB","WPFB")], type="b", lty=c(1,1),


xlab="Year", ylab="Win Percentage",
main="Recent Winning Percentage of Football & Basketball Teams",
ylim=c(.75,1), pch=c(19,19), col=c("orange","brown"))

legend("topleft",c("Basketball","Football"), col=c("orange","brown"), pch=c(19,19))

yaxis <- seq(.5,1,by=.05)


ylabels <- seq(.5,1,by=.05)

axis(2,at=yaxis,labels=ylabels)
xaxis <- seq(2002,2011,by=1)
Act Sci 654: Regression and Time Series for Actuaries 28

axis(1,at=xaxis)

This code simply sets the minima and maxima for an axis. Note that because of these limits
that some of the points are not plotted, so we need to be careful when we set the limits.4 Barry
Alvarez’s final year as football coach was 2005, so lets adjust the x-axis as well to examine the
Bret Bielema era.

matplot(Sports$YEAR, Sports[, c("WPBB","WPFB")], type="b", lty=c(1,1),


xlab="Year", ylab="Win Percentage",
main="Recent Winning Percentage of Football & Basketball Teams",
xlim=c(2006,2011), ylim=c(.75,1), pch=c(19,19), col=c("orange","brown"))

legend("topleft",c("Basketball","Football"), col=c("orange","brown"), pch=c(19,19))

yaxis <- seq(.5,1,by=.05)


ylabels <- seq(.5,1,by=.05)

axis(2,at=yaxis,labels=ylabels)
xaxis <- seq(2002,2011,by=1)

axis(1,at=xaxis)

11. Now let’s learn how to increase the width of a line and also return to the original zoom level.

matplot(Sports$YEAR, Sports[, c("WPBB","WPFB")], type="b", lty=c(1,1),


lwd=c(1,5), xlab="Year", ylab="Win Percentage",
main="Recent Winning Percentage of Football & Basketball Teams",
pch=c(19,19), col=c("orange","brown"))

legend("topleft",c("Basketball","Football"), col=c("orange","brown"), pch=c(19,19))

yaxis <- seq(.5,1,by=.05)


ylabels <- seq(.5,1,by=.05)

axis(2,at=yaxis,labels=ylabels)
xaxis <- seq(2002,2011,by=1)

axis(1,at=xaxis)

12. The font can be bolded or italicized. A label of 1 corresponds to regular, 2 is bold, 3 is italic,
and 4 is both bold and italic. Lets apply both bold and italics to our axes.

matplot(Sports$YEAR, Sports[, c("WPBB","WPFB")], type="b", lty=c(1,1), lwd=c(1,5),


font=4, xlab="Year", ylab="Win Percentage",
main="Recent Winning Percentage of Football & Basketball Teams",
4
You could find the minimum and maximum of the data to set the limits.
Act Sci 654: Regression and Time Series for Actuaries 29

pch=c(19,19), col=c("orange","brown"))

legend("topleft",c("Basketball","Football"), col=c("orange","brown"), pch=c(19,19))

yaxis <- seq(.5,1,by=.05)


ylabels <- seq(.5,1,by=.05)

axis(2,at=yaxis,labels=ylabels)
xaxis <- seq(2002,2011,by=1)

axis(1,at=xaxis)

After looking at the graph, you may notice that the font change only affected the labeling on
the standard axes. If you want the font change to apply to other aspects of the graph, you must
insert it into those respective areas as well. For instance, you could apply the font = 4 to the
axis(2) code to apply that change everywhere on the y-axis.
Text can also be inserted in the graphs margins. The mtext command can be combined with
other options. Let’s add aknowledgement for the Badgers in the bottom right corner of our graph.

matplot(Sports$YEAR, Sports[, c("WPBB","WPFB")], type="b", lty=c(1,1), lwd=c(1,5),


font=4, xlab="Year", ylab="Win Percentage",
main="Recent Winning Percentage of Football & Basketball Teams",
pch=c(19,19), col=c("orange","brown"))

legend("topleft",c("Basketball","Football"), col=c("orange","brown"), pch=c(19,19))

yaxis <- seq(.5,1,by=.05)


ylabels <- seq(.5,1,by=.05)

axis(2,at=yaxis,labels=ylabels, font=4)
xaxis <- seq(2002,2011,by=1)

axis(1,at=xaxis)

mtext("GO BADGERS", side=1, adj=1, line=2, col="red", font=2)

Recall that you can use ?mtext to find out which choices are available to you when using margin
text. The first portion is the text that you want included. The parameter “side” is where to
place the text. As noted in the Introduction to Scatterplots, the sides are numbered from 1
to 4 beginning with the bottom and moving clockwise (2 would be left).
The “adj” command moves the text left or right. The label 0 corresponds to left and 1 to the
right. Finally, the “line” portion controls how far is the text from the plot zone. The command
line = 0 sets the text very close to the plot, and increases in the number move the text farther
away.
Act Sci 654: Regression and Time Series for Actuaries 30

To illustrate, I will put another phrase in the top left, very close to the plot area. Note how
the adjustments in code change the position of the text. With the plot opened, if you type the
following, the text will be added to the plot.

mtext("GO BADGERS", side=3, adj=0, line=0, col="red", font=2)

13. You may have noticed the y-axis text is not parallel to the bottom of the screen, making it
difficult to read. To remedy this problem, we can use the las” command to rotate the text.

matplot(Sports$YEAR, Sports[, c("WPBB","WPFB")], type="b", lty=c(1,1), lwd=c(1,5),


font=4, las=1, xlab="Year", ylab="Win Percentage",
main="Recent Winning Percentage of Football & Basketball Teams",
pch=c(19,19), col=c("orange","brown"))

legend("topleft",c("Basketball","Football"), col=c("orange","brown"), pch=c(19,19))

yaxis <- seq(.5,1,by=.05)


ylabels <- seq(.5,1,by=.05)

axis(2,at=yaxis,labels=ylabels,font=4)

xaxis <- seq(2002,2011,by=1)


axis(1,at=xaxis)

mtext("GO BADGERS", side=1, adj=1, line=2, col="red", font=2)


mtext("GO BADGERS", side=3, adj=0, line=0, col="red", font=2)

Unfortunately this does not completely produce the result we desire. The text has only rotated
to the proper angle at the default tick locations. By adding in the same command to our custom
y-axis code, the issue will be remedied.

matplot(Sports$YEAR, Sports[, c("WPBB","WPFB")], type="b", lty=c(1,1), lwd=c(1,5),


font=4, las=1, xlab="Year", ylab="Win Percentage",
main="Recent Winning Percentage of Football & Basketball Teams",
pch=c(19,19), col=c("orange","brown"))

legend("topleft",c("Basketball","Football"), col=c("orange","brown"), pch=c(19,19))

yaxis <- seq(.5,1,by=.05)


ylabels <- seq(.5,1,by=.05)

axis(2,at=yaxis,labels=ylabels,font=4,las=1) # change in this line

xaxis <- seq(2002,2011,by=1)


axis(1,at=xaxis)
Act Sci 654: Regression and Time Series for Actuaries 31

mtext("GO BADGERS", side=1, adj=1, line=2, col="red", font=2)


mtext("GO BADGERS", side=3, adj=0, line=0, col="red", font=2)

Thats it for this tutorial. Thanks for your participation.


Act Sci 654: Regression and Time Series for Actuaries 32

Final code:

matplot(Sports$YEAR, Sports[, c("WPBB","WPFB")], type="b", lty=c(1,1),


lwd=c(1,5), font=4, las=1, xlab="Year", ylab="Win Percentage",
main="Recent Winning Percentage of Football & Basketball Teams",
pch=c(19,19), col=c("orange","brown"))

legend("topleft",c("Basketball","Football"), col=c("orange","brown"), pch=c(19,19))

yaxis <- seq(.5,1,by=.05)


ylabels <- seq(.5,1,by=.05)

axis(2,at=yaxis,labels=ylabels,font=4,las=1)
xaxis <- seq(2002,2011,by=1)

axis(1,at=xaxis)

mtext("GO BADGERS", side=1, adj=1, line=2, col="red", font=2)


mtext("GO BADGERS", side=3, adj=0, line=0, col="red", font=2)

Thanks to Andy Korger for his design of this tutorial.


Act Sci 654: Regression and Time Series for Actuaries 33

9 Subsetting, Summaries and Graphing


This exercise will illustrate some of the items under the GRAPH menu.

1. Download [Link] from regression book website.

2. Start R and Rcmdr. If you need to re-install Rcmdr, recall from the email that you can use
command: [Link](”Rcmdr”, dep=TRUE). This installs Rcmdr, together with all of the
other packages that Rcmdr requires. However, note if other packaes need to be loaded and add
them. To bring Rcmdr into R, type library(Rcmdr) or go to the R menu Packages > Load
Packages and click on Rcmdr.

3. Recall that in R/Rcmdr everything is CASE Sensitive.

4. Go to File > Change working directory to select where you want the data to be placed and
where you want R to look. Remember that you can also change the ‘Start in’ box in the R-icon
properties. Also, the link on the course website for instructions for installing Rcmdr has some
other set-up tips.

5. Import [Link], put a clever name like MEPS for the Dataset, click on ”comma” in
the ”Field Separator” and then click ok.

6. Type str(MEPS) and click Submit. Notice now that there are variables called Factors that are
discrete variables (rather than continuous). We will cover Factors in Chapter 4.

7. Type head(MEPS) and click Submit.

8. Go to Data > Active Data Set > Subset active data set. Include all of the variables (i.e. keep the
box checked). In the subset expression box, type COUNTIP > 0. Put another name for the new
data set, like [Link] and click ok. Note that the active data set (at the top left) has changed
to this new data set.

9. The full dataset is still available. Click the Data set box and click on MEPS to restore it as the
active dataset.

10. Under Statistics > Summaries > Numerical summaries and choose COUNTIP and EXPENDIP
as the variables. (Note you have to push the CNTL key to select multiple variables.)

11. Now, choose [Link] as your active dataset and re-do the numerical summaries for those two
variables. Why such a difference?

12. Now let us practice some plots.

(a) With [Link], choose Graphs > Boxplot and choose EXPENDIP. Click the box to identify
outliers with a mouse. Close the dialog box when it appears and click on the outlier. Note
that it shows you the observation number of that point. In the top left menu, click stop
locator when done. You can save the graph under File in the graphics window or in Rcmdr
in the Graphs > Save Graph to File.
(b) Let us compare the Boxplot for EXPENDIP for the full dataset. Reselect MEPS as the
active dataset and choose Graphs > Boxplot and choose EXPENDIP. Click the box to
identify outliers with a mouse. Close the dialog box when it appears and click on the
outlier. Note that it shows you the observation number of that point. In the top left menu,
Act Sci 654: Regression and Time Series for Actuaries 34

click stop locator when done. You can save the graph under File in the graphics window or
in Rcmdr in the Graphs > Save Graph to File.
(c) It is hard to view these two graphs side by side to compare. Let us do this a different way.

Type the command"


par(mfrow=c(1,2), oma = c(0,0,2,0))

Note: oma = outer margin area.


This changes the print setting to allow 2 graphs in a window.

(d) Now re-do the two boxplots and see how they are in one R graphics window.
(e) If you want to complete the graph, add the following command:

mtext("Compare two Graphs", outer=TRUE, cex = 1.5)

This will put some margin text above the two graphs. See the Help with R Graphs document
on the course website. Also, note you can re-size the graphs by clicking on a R graphics
window and dragging the window.
(f) You can return to 1 plot per graphics window by changing the mfrow to mfrow=c(1,1) in
the par setting, i.e. just re-submit that code.
(g) You can save the graphs in two ways: one in the File menu in the R graphics window (this
seems to hang up Rcmdr for me) or the other in Rcmdr in the Graphs > Save Graph as
tab. See if the first way works for you. I have found depending which computer I am using,
that Rcmdr does some funny things (like lock-up). This is another reason why I recommend
weaning ourselves off of it and using R (through Tinn-R or R-Studio).

13. Save your script file (maybe email to yourself) so that you can see what commands that you used
and can re-create in the future.
Act Sci 654: Regression and Time Series for Actuaries 35

10 Simulation, Regression, and Confidence/Prediction Intervals


This exercise will use simulation with an application in basic regression and introduce you to the
calculation and graphing of confidence and prediction intervals. As before, we will establish the ‘true’
regression parameters and simulate some data. We will use the data then to estimate the regression
parameters and then compare the estimated line with the ‘true’ line.
Note: There is a plug-in in Rcmdr called [Link]. You will need to install it on your
computer as you did Rcmdr (i.e. from a CRAN mirror). In theory, you can load Rcmdr and then within
Rcmdr under Tools, you could load the Plugin. My computers have been crashing so I worked out this
example for you in R. (You could try the following two command simultaneously: library(Rcmdr) and
library([Link]) – that may work.) Under the Model menu some items related to prediction
intervals come up. You can explore those on your own. I wanted to provide you something here that
would work independently of Rcmdr.

1. The code that I used in R is attached if you want to use it. The example is modified from one lo-
cated on the web at:
[Link]
Please remember to save your R scripts as you are working just in case your system freezes.

2. Start R.

3. Remember that R is case-sensitive and space-sensitive.

4. We will simulate some xi ’s from a Normal Distribution (0, 1) and some i ’s from a Normal
Distribution(0, 1). Recall that in Regression, the xi ’s are assumed as fixed, i.e. given.

(a) x <− rnorm(15)


(b) y <− x + rnorm(15)
(c) xy <− [Link](x,y)

Note here we left off the mean and sd for the Normal deviates. The default in R is that these
values are 0 and 1 respectively.

5. View the data by typing xy (the name of our dataset. Type summary(xy) and sd(xy) to find
the summary statistics of x and y. These will come in handy if you wanted to verify some of the
results by hand.

6. Run a regression using: RegModel <− lm(y ∼ x) and then type summary(RegModel). Note the
results. Recall the ‘true’ model does NOT contain an intercept term, but this one has it. Is the
intercept term significant? We will re-address the intercept term later in this exercise.

7. One nice command is [Link]. You can view the data used to estimate the model by typing
[Link](RegModel). Note there is a column of 1’s and then your data. The column of 1’s
shows that there is an intercept term estimated in the model. We will see in chapter 3, that all
x’s in the model will be included in the [Link] (that is, it is linked to the x’s that you
actually include in the model, not all those that are in the data frame).

8. Built-in to R is a predict function. If you type: predict(RegModel), the fitted values will be
shown. You can also get at this by typing: [Link](RegModel). This is an example of
multiple ways of doing the same thing. And they get the same answer too!
Act Sci 654: Regression and Time Series for Actuaries 36

9. For this example, we know that the range of a N (0, 1) variable (i.e. x here), can be captured
between (−3, 3). (Really you need to go to (−5, 5), but the chance of that happening is really
small. If you feel better you can change the example and do these more extreme limits.) To do
this, we create a new dataset, cleverly called ‘new’ using the sequence command. Type: new <−
[Link](x = seq(-3, 3, 0.5)). Then type new. See how x varies from -3 to 3 by increments of
0.5. We use these values of x to find fitted values on the line and then find the associated values
on the confidence and prediction intervals.
10. Type: predict(RegModel, new, [Link] = TRUE). This shows you the fitted values for all of the x’s
in the data frame new. Note that these fitted values depend on the model estimated, so it does
include an intercept term. Of course, you could check the calculations by hand just to verify.
11. To obtain the confidence intervals (for the expected value of the line itself at a particular value
of x) and the prediction intervals (for a new observation), type:

mylevel <- 0.95 # provides a flexible way of changing the confidence level
[Link] <- predict(RegModel, new, interval="confidence", level = mylevel)
[Link] <- predict(RegModel, new, interval="prediction", level = mylevel)

The first command finds the values associated with the confidence interval for each of x’s in your
data new (i.e. the x’s from -3 to 3). You can see the data by typing [Link] or [Link].
Note that these data sets include the fitted value and then the upper and lower values.
12. Now let us do the graph.

mytitle <- paste("Comparison of Fitted Values, Confidence Interval,


& Prediction Interval at ", mylevel*100, "% Significance", sep = "")

matplot(new$x,cbind([Link], [Link][,-1]),
lty=c(1,2,2,3,3), type="l", xlab="x", ylab="Values of y", main=mytitle)

legend("topleft",c("Fitted Value","Lower Confidence Interval",


"Upper Confidence Interval", "Lower Prediction Interval",
"Upper Prediction Interval"),
col=palette(), pch = c(19, 19, 19, 19, 19))

The line with mytitle provides a way to include the mylevel dynamically in the title. So that
when you change the mylevel from 0.95 to 0.90 and re-run to get new values, you do not need to
remember to change the title in the graph.
Actually the graph on in the original example on the web, did not include a legend (which is
included here). If you print [Link] and [Link], you see that both contain fitted values.
The cbind function combines the columns. But to graph, we only need one column for the fitted
values. So the ‘-1’ parameter in [Link] says to include all columns, BUT the first one.
(Remember matrix notation is (rows, columns). Same is true here within the [ ].)
The order of the colors follows the palette function: type palette() to see the default order. R
uses the colors in order and if it needs more, then it just repeats. Since the [Link] is listed
first, it prints first. Check out the order and see that it makes sense to you.
Including the legend helps so that the reader does not need to think so hard.
Act Sci 654: Regression and Time Series for Actuaries 37

13. We can modify the graph to help make it clearer and simpler for the reader:

matplot(new$x,cbind([Link], [Link][,-1]),
col=c("black","red","red","blue","blue"),lwd=c(2,1,1,1.5,1.5),
lty=c(1,2,2,3,3), type="l", xlab=" Values of x", ylab="Values of y",
main = mytitle)

legend("topleft",c("Fitted Value","Confidence Interval",


"Prediction Interval"),
col=c("black","red","blue"), pch = c(19, 19, 19))

Now it is clearer which is the confidence interval and which is the prediction interval. And the
pictures are what we talked about in class. You can check the values using the formulas provided.

14. Finally let us see what happens to these lines when we constrain the model and not have an
intercept term. You can copy your previous code and make some changes (but be careful) or
type the code below. Note that the Regression Model statement has a ‘-1’ that indicates NO
INTERCEPT. And the title has changed so as to be indicated appropriately in the graph. Here
I am just showing the simpler of the matplots.

## Model without intercept term


RegModel.o <- lm(y ~ x -1 )

[Link](RegModel.o)
summary(RegModel.o)
names(RegModel.o)
predict(RegModel.o)
[Link](RegModel.o) ## same as predict

predict(RegModel.o, new, [Link] = TRUE)


[Link].o <- predict(RegModel.o, new, interval="confidence", level = mylevel)
[Link].o <- predict(RegModel.o, new, interval="prediction", level = mylevel)

mytitle1 <- paste("Comparison of Fitted Values, Confidence Interval,


& Prediction Interval with NO Intercept Term at ", mylevel*100,
"% Significance", sep = "")

matplot(new$x,cbind([Link].o, [Link].o[,-1]),
col=c("black","red","red","blue","blue"),lwd=c(2,1,1,1.5,1.5),
lty=c(1,2,2,3,3), type="l", xlab=" Values of x", ylab="Values of y",
main=mytitle1)
legend("topleft",c("Fitted Value","Confidence Interval",
"Prediction Interval"),
col=c("black","red","blue"), pch = c(19, 19, 19))
Act Sci 654: Regression and Time Series for Actuaries 38

Entire R code

# Modified Example from


# [Link]

## Simulated Example
x <- rnorm(15)
y <- x + rnorm(15)
xy <- [Link](x,y)

summary(xy)
sd(xy)

## Model with intercept term


RegModel <- lm(y ~ x)

[Link](RegModel)
summary(RegModel)
names(RegModel)
predict(RegModel)
[Link](RegModel) ## same as predict

new <- [Link](x = seq(-3, 3, 0.5))


predict(RegModel, new, [Link] = TRUE)

mylevel <- 0.95

[Link] <- predict(RegModel, new, interval="confidence", level = mylevel)


[Link] <- predict(RegModel, new, interval="prediction", level = mylevel)

mytitle <- paste("Comparison of Fitted Values, Confidence Interval,


& Prediction Interval at ", mylevel*100, "% Significance", sep = "")

matplot(new$x,cbind([Link], [Link][,-1]),
lty=c(1,2,2,3,3), type="l", xlab="x", ylab="Values of y", main=mytitle)
legend("topleft",c("Fitted Value","Lower Confidence Interval",
"Upper Confidence Interval", "Lower Prediction Interval",
"Upper Prediction Interval"),
col=palette(), pch = c(19, 19, 19, 19, 19))

matplot(new$x,cbind([Link], [Link][,-1]),
col=c("black","red","red","blue","blue"),lwd=c(2,1,1,1.5,1.5),
lty=c(1,2,2,3,3), type="l", xlab=" Values of x", ylab="Values of y",
main = mytitle)

legend("topleft",c("Fitted Value","Confidence Interval",


"Prediction Interval"),
Act Sci 654: Regression and Time Series for Actuaries 39

col=c("black","red","blue"), pch = c(19, 19, 19))

## Model without intercept term


RegModel.o <- lm(y ~ x -1 )

[Link](RegModel.o)
summary(RegModel.o)
names(RegModel.o)
predict(RegModel.o)
[Link](RegModel.o) ## same as predict

predict(RegModel.o, new, [Link] = TRUE)


[Link].o <- predict(RegModel.o, new, interval="confidence", level = mylevel)
[Link].o <- predict(RegModel.o, new, interval="prediction", level = mylevel)

mytitle1 <- paste("Comparison of Fitted Values, Confidence Interval,


& Prediction Interval with NO Intercept Term at ", mylevel*100,
"% Significance", sep = "")

matplot(new$x,cbind([Link].o, [Link].o[,-1]),
col=c("black","red","red","blue","blue"),lwd=c(2,1,1,1.5,1.5),
lty=c(1,2,2,3,3), type="l", xlab=" Values of x", ylab="Values of y",
main=mytitle1)
legend("topleft",c("Fitted Value","Confidence Interval",
"Prediction Interval"),
col=c("black","red","blue"), pch = c(19, 19, 19))
Act Sci 654: Regression and Time Series for Actuaries 40

11 Introduction to Regression with Categorical Explanatory Vari-


ables, Part I
This exercise will illustrate how to create binary (or indicator or dummy) variables. Finally, the
impact on a regression analysis of including an interaction term of a binary and continuous explanatory
variables will be demonstrated.

1. Download [Link] from our course website. This is a random subset of 1000 observations
from the Medical Expenditure Panel Study ([Link] It is a more
recent year (2009) and more variables than in the health expenditures data set on the regression
book website. There is a codebook and documentation that explains the data in more depth.

2. Read in the data and let us keep a smaller subset of variables:

rm(list=ls())
library(psych)

setwd("D:/ActSci654/Spring2013") # set your working directory

MEPS <- [Link]("[Link]", header = TRUE)

keep <- c("panel","AGE09X","sex","racex", "educyr","BMINDX53",


"RXEXP09", "RXTOT09","angidx","arthdx","asthdx",
"cancerdx","chddx","choldx","diabdx","emphdx")

dat <- MEPS[,keep]


str(dat)

3. Our variable of interest is RXEXP09, drug expenditures in 2009.

4. Before we start on the categorical variables, let us first work on the missing data.

NAs <- (dat < 0)


dat[NAs] <- NA

In the MEPS dataset, the codebook shows negative numbers for missing data. For example, if
someone refused to answer a question or if the question does not apply, then some negative code
would be entered. We want to set all of these to NA.
The first line creates a new matrix with dimension of 1000 rows and 16 columns, that is a matrix
of entries with elements of TRUE and FALSE, representing where in the dataset that an entry is
less than 0. To see the dimension you can type, dim(NAs). Or to view the matrix, you can type
NAs.
The next line recodes the [Link] dat to be a NA, whenever the (i, j)th entry is indicated as
a TRUE, i.e. a missing value.

5. Another nice function is the sum function:


Act Sci 654: Regression and Time Series for Actuaries 41

sum([Link](dat$BMINDX53))

sum(dat$RXTOT09 == 0)

The first line sums the number of observations who are missing (the [Link] function) a value
for BMINDX53, body mass index, a measure of obesity. The second line sums the number of
observations where the number of prescriptions in a year is zero. Remember there are 1000
observations in the data set, so many are missing BMINDX53 and many are zero counts.
Not that it is really relevant, but how would you count how many are missing BMINDX53 and
have zero counts? You could do this by typing:

sum([Link](dat$BMINDX53) & dat$RXTOT09 == 0)

The & symbol operates as an AND operator.

6. One final clean-up item is creating a log-transform, but recoding the zero expenditure and count
variables for drugs to NA. We can do this with an ifelse statement, similar to the IF statement
in Excel.

dat$[Link] <- ifelse(dat$RXTOT09 >0,log(dat$RXTOT09),NA)


dat$[Link] <- ifelse(dat$RXEXP09 >0,log(dat$RXEXP09),NA)

7. As a simple introduction to categorical variables, we will start with the sex variable. Look at
your [Link] with the head(dat) command. Or do a summary(dat) to show values. Note that
the sex variable is coded as 1 or 2 (i.e. no NAs or other values). The categorical variable has two
levels: 1, 2.
If you type str(dat), you see that sex is shown as an integer variable. But which code represents
males? You would need to check the MEPS codebook. It is better to code the variable so that
you do not have to think each time.

(a) Because there are two levels for this variable, one way is to create binary variable.
(b) We could create one variable for Males and another variable for Females with:
dat$Male <- 1*(dat$sex == 1)
dat$Female <- 1*(dat$sex == 2)

Use a head(dat) statement to see what this code creates. Notice that wherever sex is coded
as a 1, then the Male variable shows a 1 and the Female variable shows a 0. Similarly when
sex is coded as a 2, then the Female variable shows a 1 and the Male variable shows a zero.
Create a new variable that is the sum of Male and Female and see what the value is across
all observations.

sexsum <- dat$Male + dat$Female


summary(sexsum)
Act Sci 654: Regression and Time Series for Actuaries 42

Note that I did not add sexsum to the [Link] dat, so I did not need to put the dat$
before it.
Why are the values all ones?
(c) Now run 4 regressions. To make the results a little more interesting and to tie it to other
tutorials, let us add AGE09X (or age) to the model.
i. Blindly use the original sex variable.
Mod.1 <- lm([Link] ~ AGE09X + sex, data = dat)
summary(Mod.1)
What is the underlying assumption for this model? How would you check to see if it is
correct?
One way would be to use plots. We have done a lot of scatterplots. Let us try that
where we use the plotting symbol as a way to distinguish Males and Females by Age.
(We will learn about the factor command in the next tutorial.)
dat$pch.f =c(16,24)[[Link](dat$sex)]
plot([Link] ~ AGE09X, data = dat, pch=pch.f,
main="Log(Rx Expend) vs. Age", ylim = c(0,10))
legend("topleft", c("Male","Female"), pch = c(16,24))
Or instead of plotting by Age, we could just look by sex.
plot([Link] ~ sex, data = dat, pch=pch.f,
main="Log(Rx Expend) vs. Sex", ylim = c(0,10))
Or we could use boxplots that are very helpful for categorical variables.
boxplot([Link] ~ sex, data = dat, main="Log(Rx Expend) vs. Sex")
boxplot([Link] ~ Male, data = dat,
main="Log(Rx Expend) vs. Sex (Male = 1)")
Does it make sense to use sex blindly?
ii. Use the Male variable for sex.
Mod.2 <- lm([Link] ~ AGE09X + Male, data = dat)
summary(Mod.2)
iii. Use the Female variable for sex.
Mod.3 <- lm([Link] ~ AGE09X + Female, data = dat)
summary(Mod.3)
iv. Include both the Male and Female variables.
Mod.4 <- lm([Link] ~ AGE09X + Male + Female, data = dat)
summary(Mod.4)
(d) First look at Mod.4. Why does NA show up?
(e) Now examine models 2 and 3. How do they differ?

8. Let us focus on the model with the Male indicator variable. Suppose we created a new model
that includes AGE09X, Male and a new variable that is AGE09X times Male.

dat$MaleAge <- dat$Male * dat$AGE09X


Mod.5 <- lm([Link] ~ AGE09X + Male + MaleAge, data = dat)
summary(Mod.5)
Act Sci 654: Regression and Time Series for Actuaries 43

Write out an equation (on paper) that represents the fitted values for this model and compare it
with Mod.2. What are your thoughts?
Now graph Mod.2 and Mod.5 to see these differences.

plot([Link] ~ AGE09X, data = dat, pch=pch.f,


main="Log(Rx Expend) vs. Age", ylim = c(0,10))
legend("topleft", c("Male","Female"), pch = c(16,24))
abline(Mod.2, lty = "solid", lwd = 2, col = "red")
abline(Mod.5, lty = "longdash", lwd = 2, col = "blue")
legend("bottomright", c("Mod.2","Mod.5"),
lty = c("solid", "longdash"), col = c("red","blue"),
lwd = c(2,2))

Note that adding the interaction term changes the slope and the intercept. Why?

9. Let us see how these lines look relative to the female only data and the male only data.

plot([Link] ~ AGE09X, data = dat, subset = sex == 1,


main="Log(Rx Expend) vs. Age\n Males only", ylim=c(0,10),
pch = 16)
abline(Mod.2, lty = "solid", lwd = 2, col = "red")
abline(Mod.5, lty = "longdash", lwd = 2, col = "blue")
legend("bottomright", c("Mod.2","Mod.5"),
lty = c("solid", "longdash"), col = c("red","blue"),
lwd = c(2,2))

plot([Link] ~ AGE09X, data = dat, subset = sex == 2,


main="Log(Rx Expend) vs. Age\n Females only", ylim=c(0,10),
pch = 24)
abline(Mod.2, lty = "solid", lwd = 2, col = "red")
abline(Mod.5, lty = "longdash", lwd = 2, col = "blue")
legend("bottomright", c("Mod.2","Mod.5"),
lty = c("solid", "longdash"), col = c("red","blue"),
lwd = c(2,2))

What are your conclusions?

10. While convenient to have a one-model-fits-all, that might not always be the best choice. Suppose
instead you decide to run separate regression models by sex. Compare the output of these with
Mod.2 and Mod.5.

Mod.m <- lm([Link] ~ AGE09X, subset = sex == 1, data = dat)


summary(Mod.m)

plot([Link] ~ AGE09X, data = dat, subset = sex == 1,


main="Log(Rx Expend) vs. Age\n Males Only", ylim=c(0,10),
pch = 16)
Act Sci 654: Regression and Time Series for Actuaries 44

abline(Mod.m, lwd = 2)

Mod.f <- lm([Link] ~ AGE09X, subset = sex == 2, data = dat)


summary(Mod.f)

plot([Link] ~ AGE09X, data = dat, subset = sex == 2,


main="Log(Rx Expend) vs. Age\n Females Only", ylim=c(0,10),
pch = 24)
abline(Mod.f, lwd = 2)

11. Save your [Link] to a csv file so that you can start on this again and not have to create all
of the variables.

MEPS <- [Link](dat, file = "put where you want it")


Act Sci 654: Regression and Time Series for Actuaries 45

12 Introduction to Regression with Categorical Explanatory Vari-


ables, Part II
This exercise furthers the learning of using categorical explanatory variables in a regression.

1. Load the *.csv file that you saved from Part I or download [Link] from our
course website. This is a 1000 observations from the Medical Expenditure Panel Study (http:
//[Link]/mepsweb/) that we examined in Part I, but contains a smaller number of
variables and all of the changes that we had done previously.

2. Read in the data and let us keep a smaller subset of variables:

rm(list=ls())
library(psych)

setwd("yourdirectory") # set your working directory

dat <- [Link]("[Link]", header = TRUE) # my filename

3. Again, our variable of interest is RXEXP09, drug expenditures in 2009.

4. In Part I, we learned about indicator variables and how to include them in a regression. Recall
that for the sex variable is originally coded as 1 = Males and 2 = Females.

5. Rcmdr can be helpful in your initial coding and understanding of the use of categorical variables.

6. Suppose we want to convert the numeric variable sex to a categorical variable, called a factor in
R.

dat$SEX.f <- factor(dat$sex, labels = c("Male","Female"))


dat$SEX.f

The labels command gives you the words that are associated with the coding. Note the ordering
of the labels is important, where they follow the order of the coding (Male = 1 and Female = 2).
Viewing the data with the code dat$SEX.f, you see a listing of the coding, ending with Levels:
Male Female. Note that the listing shows the labels Male and Female, and not the original coding.
You could use the command:

unclass(dat$SEX.f)

to remind yourself of the underlying values of the variable.


I like to put an *.f after the variable to signify that it is a factor variable, although this is not
necessary. In the str() command, you can always see what kind of variable it is.

In Rcmdr, you can create categorical variables using Data > Manage Data > Convert numeric
variable to factor. Be sure to always specify a new variable for the categorical variable so that
you can verify that you approach worked.
Act Sci 654: Regression and Time Series for Actuaries 46

7. Let us see the impact of the use of a categorical variable in a regression. First, let us remind
ourselves of the output from the regression model where we used the Male and Female indicators.

Mod.2 <- lm([Link] ~ AGE09X + Male, data = dat)


summary(Mod.2)

Mod.3 <- lm([Link] ~ AGE09X + Female, data = dat)


summary(Mod.3)

We saw in Part I that the output is equivalent, i.e. it does not matter which indicator you put
in the model. Essentially the model controls for sex, i.e. given a person’s age, we adjust the
intercept up or down depending on the person’s sex. Recall if you omitted, the sex indicator,
you would have a model that is an aggregate model and the intercept term is a weighted average
(somewhat) of the balance in the data between males and females.
So what happens when you use the following:

Mod.2a <- lm([Link] ~ AGE09X + SEX.f, data = dat)


summary(Mod.2a)

Here, R leaves out Males and shows the results for Females. We call Males, the reference category.
Why? Because the results shown for Females are relative to that for Males. The intercept term
is for Males (why?), thus the intercept term for Females is adjusted downwards in this case.
R chose Male as the reference category, as it was coded with the lowest value, i.e. a 1 vs. a 2 for
Female. What if we wanted Female to be the reference category? We will see this application in
the next set of examples with regards to race.

dat$SEX1.f <- factor(dat$SEX.f, levels = c("Female","Male"))

unclass(dat$SEX1.f)

Mod.2b <- lm([Link] ~ AGE09X + SEX1.f, data = dat)


summary(Mod.2b)

You start with the factor variable, and you change the levels from Male-Female to Female-Male.
With the unclass statement, you can see how R views the underlying coding.
Now see the regression results. Using SEX1.f, we have now made Female the reference category.
You can also re-order the categories pretty easily in Rcmdr.

8. As an important aside, suppose you do all of this work and save your file in a .csv file.

[Link](dat, "[Link]")
dat1 <- [Link]("[Link]",header = TRUE)
str(dat1)
head(dat1)
Act Sci 654: Regression and Time Series for Actuaries 47

Write the [Link] to a file and then open it in Excel. Note that the factor variables are saved
in the csv with just the labels and not the underlying coding (i.e. 1, 2). So when you bring
it back into R, then you will need to redefine the levels to get what you want as the reference
category. This is one advantage of saving the R workspace, as it also saves all of the underlying
information.

9. As a reminder, if you wanted to change both the intercept and the slope by sex, then you could
include an interaction term of dat$AGE09X*dat$SEX.f in the model as we did in part I.

10. What if we had more categories? Race (racex in MEPS) is a good example. First let us use the
summary() command or the table() command to see what is happening.

summary(dat$racex)
table(dat$racex)

Of course from this output, we have no idea of what this means for racex. The summary command
treats racex as a number. Remember it is shown in the str() command as an integer variable
(int). The table command provides more information to show that there are 6 categories and the
frequency by category. But we do not know which category is which.

Looking in the MEPS codebook, we can create the following code:

dat$race.f <- factor(dat$racex, labels = c("white","black",


"amer/native indian","asian","hawaiian","multiple"))

summary(dat$race.f)
table(dat$race.f)

Now both of the commands produce a more understandable output. You might be tempted to
collapse the last 4 categories into one bigger category. That might be reasonable. But before we
do that, let us explore the data a little further.

11. Let us bring in the dependent variable, log(RxEXP). Do a boxplot by race.

boxplot([Link] ~ race.f, data = dat,


main="Log(Rx Expend) vs. Race")

See if you can widen the plot to see all of the labeling. Of course, you can play with the font size
to make it readable (which you would do for anything that you turn in).
Here the output shows there are vast differences in median and spread of log(RxEXP) by race.
And the medians do not grow linearly (i.e. in the order of the original coding).

boxplot([Link] ~ SEX.f*race.f, data = dat,


col=c("grey","red"), main="Log(Rx Expend) vs. Race by SEX")

This boxplot takes advantage of the interaction approach, using SEX.f*race.f to allow us to view
the boxplot by sex and race categories.
Act Sci 654: Regression and Time Series for Actuaries 48

12. What if you did the following?

Mod.3 <- lm([Link] ~ AGE09X + SEX.f + racex, data = dat)


summary(Mod.3)

How do you interpret the output? Does it make sense relative to the boxplots?

13. Now let us control for race as a factor.

Mod.3a <- lm([Link] ~ AGE09X + SEX.f + race.f, data = dat)


summary(Mod.3a)

How do you interpret the output? Does it make sense relative to the boxplots?

14. What we have not yet done is controlled for the interaction between sex and race (or even with
AGE). We could create a new variable or just use the colon, ‘:’, as the operator for interaction in
the lm command.

Mod.3b <- lm([Link] ~ AGE09X + SEX.f + race.f + SEX.f:race.f, data = dat)


summary(Mod.3b)

Mod.3c <- lm([Link] ~ AGE09X + SEX.f + AGE09X:SEX.f + race.f + SEX.f:race.f,


data = dat)
summary(Mod.3c)

Mod.3d <- lm([Link] ~ AGE09X + SEX.f + AGE09X:SEX.f + race.f + SEX.f:race.f


+ AGE09X:race.f, data = dat)
summary(Mod.3d)

Mod.3e <- lm([Link] ~ AGE09X + SEX.f + AGE09X:SEX.f + race.f + SEX.f:race.f


+ AGE09X:race.f + AGE09X:SEX.f:race.f, data = dat)
summary(Mod.3e)

How do you interpret the output? Does it make sense relative to the boxplots?

15. But going back to our table() command, there are few observations in some of the categories. We
might want to recode the variable and collapse some of these categories together.
As one extreme, we could just an indicator variable for each race category.

dat$white <- 1*(dat$race.f == "white" )


dat$black <- 1*(dat$race.f == "black")
dat$indian <- 1*(dat$race.f == "amer/native indian" )
dat$asian <- 1*(dat$race.f == "asian")
dat$hawaiian <- 1*(dat$race.f == "hawaiian" )
dat$multiple <- 1*(dat$race.f == "multiple")

table(dat$race.f)
Act Sci 654: Regression and Time Series for Actuaries 49

table(dat$white)
table(dat$black)
table(dat$indian)
table(dat$asian)
table(dat$hawaiian)
table(dat$multiple)

racesum <- dat$white + dat$black + dat$indian + dat$asian +


dat$hawaiian + dat$multiple
summary(racesum)

And then we can throw all of these variables in a model:

Mod.4 <- lm([Link] ~ AGE09X + SEX.f + white + black + indian +


asian + hawaiian + multiple, data = dat)
summary(Mod.4)

How do you interpret the output?


Or we can use only a subset of these variables in a model:

Mod.4a <- lm([Link] ~ AGE09X + SEX.f + white + black, data = dat)


summary(Mod.4a)

How do you interpret the output?

16. Another way would be to define a new factor variable. Remember that the ! operator, means
‘NOT’ and & operator means ‘AND’.

dat$race.1[dat$racex == 1] <- "white"


dat$race.1[dat$racex == 2] <- "black"
dat$race.1[dat$racex != 1 & dat$racex != 2] <- "other"

table(dat$race.1)

str(dat)

Note that race.1 is defined as a character variable, not a factor. So we need to convert it to a
factor.

dat$race1.f <- factor(dat$race.1)

table(dat$race1.f)
str(dat)
Act Sci 654: Regression and Time Series for Actuaries 50

See how these numbers compare to previous summaries.

17. Now let us return to the boxplot and see how collapsing the categories changes things.

boxplot([Link] ~ SEX.f*race1.f, data = dat,


col=c("grey","red"), main="Log(Rx Expend) vs. Race by SEX")

Of course, the plots of white and black by sex are the same as before. But how does the collapsing
change the ‘other’ category relative to the individual race categories. Be aware of the assumptions
that you are making when you collapse.

18. We could go through all of the steps as we did before to examine different regression outputs.
What if we just compared this race composition to that done in Mod.3c?

Mod.3c <- lm([Link] ~ AGE09X + SEX.f + AGE09X:SEX.f + race.f + SEX.f:race.f,


data = dat)
summary(Mod.3c)

Mod.5c <- lm([Link] ~ AGE09X + SEX.f + AGE09X:SEX.f + race1.f + SEX.f:race1.f,


data = dat)
summary(Mod.5c)

How do you compare the output? The reference category for race differs between model. It would
be easier to have them the same. (Note: conclusions for use for the model do not change. They
may look different, but everything about them is the same. We will learn about the partial F-test
to test to signficance of groups of variables, like race.
We learned how to change the levels above when we first were examing SEX. Here is an easy way
to do it with the relevel command.

dat$race2.f <- relevel(dat$race1.f, ref = "white")

Mod.6c <- lm([Link] ~ AGE09X + SEX.f + AGE09X:SEX.f + race2.f + SEX.f:race2.f,


data = dat)
summary(Mod.6c)

Now re-running the model with this new variable you can more readily compare the outputs.
So what do you think?

19. Changing gears slightly, we think about using continuous (or almost continuous like Age) by
just including Age in the model. If we do this, we are assuming that the relationship between
log(RxEXP) and Age is linear. What if it is not that way but kind of varies? What if it does not
vary uniformly, like being able to use a quadratic or other nice function?
We could create a categorical variable for Age and use it that way. One simple way of doing it is
as follows (similar to what we did for race):
Act Sci 654: Regression and Time Series for Actuaries 51

dat$age[dat$AGE09X >= 65] <- "senior"


dat$age[dat$AGE09X < 21] <- "junior"
dat$age[dat$AGE09X >= 21 & dat$AGE09X < 65] <- "adult"

table(dat$age)

Of course, this is one of an infinitely many number of ways you can do this. How you slice up
the data is part of the ‘art’ of statistics.

20. Related to the above, but beyond the scope of this tutorial, check out the following boxplots for
education and age and see what you think.

boxplot([Link] ~ educyr, data = dat,


main="Log(Rx Expend) vs. # Years Education")
boxplot([Link] ~ educyr*SEX.f, data = dat,
main="Log(Rx Expend) vs. # Years Education and Sex")
boxplot([Link] ~ educyr, data = dat, subset = SEX.f == "Male",
main="Log(Rx Expend) vs. # Years Education and Male")
boxplot([Link] ~ educyr, data = dat, subset = SEX.f == "Female",
main="Log(Rx Expend) vs. # Years Education and Female")

boxplot([Link] ~ AGE09X, data = dat,


main="Log(Rx Expend) vs. Age")
boxplot([Link]~ AGE09X*SEX.f, data = dat,
main="Log(Rx Expend) vs. Age and Sex")
boxplot([Link] ~ AGE09X, data = dat, subset = SEX.f == "Male",
main="Log(Rx Expend) vs. Age and Male")
boxplot([Link] ~ AGE09X, data = dat, subset = SEX.f == "Female",
main="Log(Rx Expend) vs. Age and Female")
Act Sci 654: Regression and Time Series for Actuaries 52

13 Categorical Explanatory Variables: Labels and Levels


In this tutorial we will understand some of the basic aspects of regression with categorical explanatory
variables. As we discussed, categorical variables are those with a defined set of buckets. For the variable
Sex, we have Male and Female. For the variable Insurance, we could have a number of different buckets
depending on the available data. For health insurance, we could have No Insurance, Public Insurance,
Private Insurance. For survey or other types of data, we could also have a Blank, or missing data, or
Do Not Want to Answer, another form of missing data. How to work with missing data will be the
subject of another tutorial.
Understanding the impact of the way you incorporate a categorical variable in a model so that you
properly interpret the results is fundamental. Much of actuarial data are categorical variables. In this
tutorial we will work with the [Link] file from the course website.

1. Start R and set your working directory.

2. Load the [Link] dataset into R or Rcmdr and name the dataset Auto.

3. Submit the str(Auto) command to see the variables.


Variable # Missing Description
CASENUM 0 Claim identification #
ATTORNEY 0 Whether claimant represented by attorney
( 1 if yes; 2 if no)
CLMSEX 12 Claimant’s gender ( 1 if male; 2 if female)
MARITAL 16 Claimant’s marital status
( 1 if married; 2 if single; 3 if widowed; 4 if divorced/separated)
CLMINSUR 41 Uninsured driver of claimant’s vehicle
(1 if yes; 2 if no; 3 if not applicable)
SEATBELT 48 Claimant wearing seatbelt/child restraint
(1 if yes; 2 if no; 3 if not applicable)

4. The variables CLMSEX and CLMINSUR are coded as either 1 or 2. Think about this way of
coding. For CLMSEX, if you see a 1 in the field for this variable, you know that the claimant
is a male. Instead of 1 or 2, we could have easily coded this as 3 and 7. Essentially, we can put
claimant’s into buckets that represent their sex. Thus, whether the coding is 1 or 2, or 3 or 7,
only indicates which bucket to put the claimant.

5. When data sets are created, it is less time-consuming to code the data as 1 or 2, rather than type
male or female. It is up to the programmer to use the data appropriately.

6. In R, categorical variables are called factors and the buckets are referred to as labels.

7. Categorical variables with 2 buckets can be converted easily to a variable that is a 1/0 variable.
These 1/0 variables are commonly referred to as binary variables, indicator variables, or dummy
variables. As an example, we can convert the variable CLMSEX to an indicator variable by:

Auto$Male <- 1*(Auto$CLMSEX == 1)

The code specifically alerts the user that the variable is 1 if male and 0 if female. The code
(Auto$CLMSEX==1) is a logical condition. When the condition is TRUE, then the multi-
plication by 1 will result in a 1; otherwise when the condition is FALSE, the multiplication by 1
with result in a 0.
Of course, you could create a indicator variable as Female by:
Act Sci 654: Regression and Time Series for Actuaries 53

Auto$Female <- 1*(Auto$CLMSEX == 2)

8. In general, it is best to convert the numeric variables that really are categorical variables to factors
so that it is clear to R and to the user that these variables are categorical variables. Suppose we
create a new variable Sex as a categorical variable:

Auto$Sex <- factor(Auto$CLMSEX)


head(Auto)
str(Auto)

Note that the head(Auto) command shows the new variable Sex and lists the same elements as
CLMSEX. So for this command it does not look like anything happened.
However with the str(Auto) command, the new variable Sex shows it as a factor and indicates
that it has 2 levels, or R-speak for buckets. The labels for each of the two levels are 1 or 2.
9. It would be helpful for categorical variables to more clearly label the variable. Here we have
coded Sex is 1 for Males and 2 for Females. Without the data dictionary, you will quickly forget
whether Males are 1 or Males are 2. The way to remember is to insert the label as Male or
Female.

Auto$Sex <- factor(Auto$CLMSEX, labels = c("Male", "Female"))


head(Auto)
str(Auto)

Of course you need to be careful as to the order of the labels. Since the data are numeric, the
labels will be applied in order, from the smallest to the largest. Thus, since our data are coded
1/2, then 1 is labeled Male and 2 is labeled Female.
Interesting what happens now. The head command shows the new labeling of Male/Female. The
structure command shows that Sex is a factor with 2 levels, Male and Female, and the coding of
the original variables with the 1 or 2.
10. As an illustration, suppose we run a regression of LOSS on Sex.

Model.1 <- lm(LOSS ~ Sex, data = Auto)


summary(Model.1)

Notice that R automatically includes the Female-level in the model. As discussed in class, for
a categorical variable with 2 levels, we need to only include 1 level in the regression. Since the
Male-level is excluded (reflected by the intercept term), we call the Male-level (or bucket) the
reference category.
11. Suppose we really wanted Female to be the reference category. We can accomplish that in one of
two ways.
(a) We could change the order of the levels.
Auto$Sex.1 <- factor(Auto$Sex, levels = c("Female","Male"))
head(Auto)
str(Auto)
Act Sci 654: Regression and Time Series for Actuaries 54

Here we start with the factor variables with the labels already defined. We create a new factor
variable that reverses the order of the levels. Look at the results of the head command. The
entries for Sex and Sex.1 are identical. That is good, as we did not want to change the value
of the observation. We only wanted to change the reference category.
To show the impact of the change, we re-run the regression with this new factor variable.
Model.2 <- lm(LOSS ~ Sex.1, data = Auto)
summary(Model.2)

See that the overall results are the same, but the sign of the Sex variable differs from Sex.1.
The intercept term changes accordingly. But the overall model summary statistics are the
same, as will be the fitted values.
(b) We could change the reference level directly.
Auto$Sex.2 <- relevel(Auto$Sex, ref="Female")
head(Auto)
str(Auto)

Model.3 <- lm(LOSS ~ Sex.2, data = Auto)


summary(Model.3)

Note that the results of these commands are as we would have hoped.
(c) You could also use Rcmdr to create a factor variable, establish the labels, and change the
levels.
i. Go to Data > Manage variables in active data set > Convert numeric variables to factors
ii. Be sure to change the name of the variable so that you do not overwrite the original
variable. I call mine Sex.R.
iii. Check the box that says you will supply the level names. Here for 1, enter Male, and for
2, enter Female, as we did to set up Sex.
Note the Rcmdr coding shows:
Auto$Sex.R <- factor(Auto$CLMSEX, labels=c(’Male’,’Female’))
It is confusing sometimes the difference between ’level’ and ’label’. Think of the levels
as the buckets and labels as the names of each of the buckets.
iv. To re-order the levels in Rcmdr, you can use Data > Manage variables in active data
set > Reorder factor levels. Choose Sex.R, choose a new name for the created variable,
Sex.R1, and click OK. In the next dialog box, just reverse the numbering of the levels.
Note that the level numbered ‘1’ is the reference level.
Now the Rcmdr coding shows:
Auto$Sex.R1 <- factor(Auto$Sex.R, levels=c(’Female’,’Male’))
This is the same coding that we did above.
Act Sci 654: Regression and Time Series for Actuaries 55

Final code:

setwd("D:/ActSci654/Work")
# THIS CODE USES THE DATASET [Link]

rm(list=ls())

Auto <- [Link]("d:/ActSci654/DataForFrees/[Link]", header=TRUE)

str(Auto)

Auto$Sex <- factor(Auto$CLMSEX)


str(Auto)
head(Auto)
levels(Auto$Sex)

Auto$Sex <- factor(Auto$CLMSEX, labels = c("Male", "Female"))


head(Auto)
str(Auto)

Model.1 <- lm(LOSS ~ Sex, data = Auto)


summary(Model.1)

Auto$Sex.1 <- factor(Auto$Sex, levels = c("Female","Male"))


head(Auto)
str(Auto)

Model.2 <- lm(LOSS ~ Sex.1, data = Auto)


summary(Model.2)

Auto$Sex.2 <- relevel(Auto$Sex, ref="Female")


head(Auto)
str(Auto)

Model.3 <- lm(LOSS ~ Sex.2, data = Auto)


summary(Model.3)

library(Rcmdr)

Thanks to Andy Korger for his help with design of this tutorial.
Act Sci 654: Regression and Time Series for Actuaries 56

14 Regression with Categorical Explanatory Variables: With and


Without Intercept Term
This exercise will compare the output of regression when you have a binary variable as the only
explanatory variable and (i) include an intercept term versus (ii) excluding an intercept term using a
factor variable.
The [Link] dataset is used. In the data set GEN DER is coded as a numeric variable as
0 = F emale and 1 = M ale.
We define two new variables and add them to the dataset T erm:

Term$Gender.f <- factor(Term$GENDER, labels=c(’Female’,’Male’))

## change order of levels


Term$Gender.f1 <- factor(Term$Gender.f, levels=c(’Male’,’Female’))

Gender.f is now a factor variable (categorical). Note the labels command defines that the 0’s are
Females and the 1’s are Males. Similarly, Gender.f 1 re-orders the levels of the Gender.f so that
now Males are first. This switching of levels impacts which is the reference category in the regression
analysis.
Assume that we have taken a subset of the data and use only those observations where F ACE > 0
and the GEN DER variable. We call that new dataset F ace.0.
If you do a str() on F ace.0, you will see that GEN DER is numeric, while Gender.f and Gender.f 1
show as factors. Also, as a frame of reference:

FACE Mean
Overall 747,600.0
Female 111,621.6
Male 846,449.2

Our task is to compare the regression output of FACE on GENDER in 4 different ways:
1. Regression of FACE on the binary variable GENDER (where 0 = F emale and 1 = M ale).
2. Regression of FACE on the factor GENDER (called Gender.f), where Female is the first level and Male
the second.
3. Regression of FACE on the factor GENDER (called Gender.f1), where Male is the first level and Female
the second.
4. Regression of FACE on the factor GENDER (called Gender.f1), but excluding an intercept term.

The output:

1. Regression of FACE on GENDER


The coefficient for GENDER (representing Male) has a p-value of 0.0127, with the R2 of 0.02251.
This example shows that a variable, GENDER, is statistically significant in explaining the vari-
ability of FACE, but alone, only explains 2% of the variation. Compare the coefficient of the
intercept with the mean for Females. Compare the sum of the coefficient of the intercept and the
Gender variable to the mean for Males.
Act Sci 654: Regression and Time Series for Actuaries 57

lm(formula = FACE ~ GENDER, data = Face.0)

Residuals:
Min 1Q Median 3Q Max
-845649 -757949 -596449 -26535 13153551

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 111622 272646 0.409 0.6826
GENDER 734828 293074 2.507 0.0127 *
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

Residual standard error: 1658000 on 273 degrees of freedom


Multiple R-squared: 0.02251,Adjusted R-squared: 0.01893
F-statistic: 6.287 on 1 and 273 DF, p-value: 0.01275

2. Regression of FACE on Gender.f


The results for this regression are identical to that in Regression (1) even though we used GEN-
DER as a categorical variable. Here the reference category is Female.

lm(formula = FACE ~ Gender.f, data = Face.0)

Residuals:
Min 1Q Median 3Q Max
-845649 -757949 -596449 -26535 13153551

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 111622 272646 0.409 0.6826
[Link] 734828 293074 2.507 0.0127 *
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

Residual standard error: 1658000 on 273 degrees of freedom


Multiple R-squared: 0.02251,Adjusted R-squared: 0.01893
F-statistic: 6.287 on 1 and 273 DF, p-value: 0.01275

3. Regression of FACE on Gender.f1 (but changed the order of the levels)


The summary results for this regression are identical to that in Regression (1) and (2). Here
Male is the omitted category. The t-statistic on the GENDER variable is the same, as well as
the diagnostics. To obtain an estimate for Males, we just use the intercept term (as we would
multiply the Female coefficient by 0). Also included in this analysis is the ANOVA table to be
used as comparison in Regression (4).
Find the SS Total for FACE. You find this by finding the variance of FACE multiplied by (number
of observations - 1). Compare this number to the sum of the SSR (Gender.f1 in this case) and
SSR (Residuals) in the ANOVA table. This sum is equal to the SS Total.
Act Sci 654: Regression and Time Series for Actuaries 58

lm(formula = FACE ~ Gender.f1, data = Face.0)

Residuals:
Min 1Q Median 3Q Max
-845649 -757949 -596449 -26535 13153551

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 846449 107501 7.874 8.15e-14 ***
Gender.f1Female -734828 293074 -2.507 0.0127 *
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

Residual standard error: 1658000 on 273 degrees of freedom


Multiple R-squared: 0.02251,Adjusted R-squared: 0.01893
F-statistic: 6.287 on 1 and 273 DF, p-value: 0.01275

> Analysis of Variance Table

Response: FACE
Df Sum Sq Mean Sq F value Pr(>F)
Gender.f1 1 1.7291e+13 1.7291e+13 6.2866 0.01275 *
Residuals 273 7.5087e+14 2.7504e+12
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

4. Regression of FACE on Gender.f NO INTERCEPT


In this last regression, we conduct a regression without an intercept term. Think of the ‘-1’ term
as a vector that removes the constant term from all of the rows of the design matrix. We notice
that the estimates for Female and Male agree with the estimates for FACE for Female and Male.
The p-value for Female is the same as the p-value for the intercept term in Regressions (1) and
(2). While the p-value for Male is the same as the the p-value for the intercept term in Regression
(3). The Residual standard error (s) is the same as in Regressions (1), (2), and (3).
But the R2 , Ra2 , and F-statistic (and its p-value) are vastly different. Why is this so? This
question is answered after the result.

lm(formula = FACE ~ Gender.f1 - 1, data = Face.0)

Residuals:
Min 1Q Median 3Q Max
-845649 -757949 -596449 -26535 13153551

Coefficients:
Estimate Std. Error t value Pr(>|t|)
Gender.f1Male 846449 107501 7.874 8.15e-14 ***
Gender.f1Female 111622 272646 0.409 0.683
Act Sci 654: Regression and Time Series for Actuaries 59

---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

Residual standard error: 1658000 on 273 degrees of freedom


Multiple R-squared: 0.1855,Adjusted R-squared: 0.1795
F-statistic: 31.08 on 2 and 273 DF, p-value: 6.89e-13

> Analysis of Variance Table

Response: FACE
Df Sum Sq Mean Sq F value Pr(>F)
Gender.f1 2 1.7098e+14 8.5491e+13 31.083 6.89e-13 ***
Residuals 273 7.5087e+14 2.7504e+12
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

In this regression, note that the ANOVA shows a degree of freedom of 2 for the factor variable.
The degrees of freedom for the residuals stay at the same level as before.P
Recall, we removed the
intercept term. Thus, the Total SS for this ANOVA table is equal to the F ACE 2 , as the mean
of FACE is assumed as 0. There are 2 degrees of freedom for the Gender factor, as there are two
explanatory variables included in the model (and NO intercept term).

Here the hypothesis test is:


H0 : βF emales = βM ales = 0
H1 : At least one of βF emales or βM ales 6= 0

Since the coefficients of Female and Male are the means of FACE for Females and Males respec-
tively, then the hypothesis test is equivalent to:
H0 : Mean of Face for Females = Mean of Face for Males = 0
H1 : At least one of Mean of Face for Females or Mean of Face for Males 6= 0

Given the sample means earlier, it is no surprise that we reject the hypothesis that the means
are equal to 0.

As a contrast with Regression (3) including an intercept term, the corresponding hypotheses are:
H0 : βF emales = βM ales = 0 (Looks at differences from overall mean of FACE)
H1 : At least one of βF emales or βM ales 6= 0
Here the null hypothesis is that you would use the overall mean of FACE for an estimate (i.e.
the slope of the line is horizontal at the overall mean).

The hypothesis test is equivalent to:


H0 : Mean of Face for Females = Mean of Face for Males
H1 : Mean of Face for Females Not equal to Mean of Face for Males
Act Sci 654: Regression and Time Series for Actuaries 60

Again, from the sample statistics above, you would not conclude that the mean of FACE for Males =
mean of FACE for Females. You would again reject the null hypothesis.

In conclusion, all of the models are equivalent. You just need to be aware of the summary statistics
in the regression models and what they are trying to tell you.
Act Sci 654: Regression and Time Series for Actuaries 61

R-code for this analysis

rm(list=ls())
Term <- [Link]("D:/ActSci654/DataForFrees/[Link]",
header = TRUE)

Term$Gender.f <- factor(Term$GENDER, labels=c(’Female’,’Male’))

## change order of levels


Term$Gender.f1 <- factor(Term$Gender.f, levels=c(’Male’,’Female’))

## subset data to remove those values of FACE == 0


Face.0 <- Term[Term$FACE > 0 , ]

str(Face.0)

nobs <- length(Face.0$FACE)

# Find summary statsitics


summary(Face.0$FACE)
tapply(Face.0$FACE,Face.0$Gender.f, mean)

RegMod <- lm(FACE ~ GENDER, data = Face.0)


summary(RegMod)

RegMod.1 <- lm(FACE ~ Gender.f, data = Face.0)


summary(RegMod.1)

RegMod.2 <- lm(FACE ~ Gender.f1, data = Face.0)


summary(RegMod.2)
anova(RegMod.2)

## compare to SST.2
var <- sd(Face.0$FACE) * sd(Face.0$FACE)*(nobs - 1)
## total sum of squares
SST.2 <- 1.7291e+13 + 7.5087e+14
## sum of squares of FACE
SS <- t(Face.0$FACE) %*% Face.0$FACE
SST.4 <- 1.7098e+14 + 7.5087e+14

RegMod.3 <- lm(FACE ~ Gender.f1 -1, data = Face.0)


summary(RegMod.3)
anova(RegMod.3)
Act Sci 654: Regression and Time Series for Actuaries 62

15 Categorical Explanaotry Variables: Impact of Interaction Terms


In this tutorial we will examine the use of categorical explanatory variables in a regression and the
impact of including interaction terms. As we discussed, categorical variables are those with a defined
set of buckets. In this tutorial we will use the [Link] file from the course website.

1. Start R and set your working directory.

2. Load the [Link] dataset into R or Rcmdr and name the dataset Auto.

3. Submit the str(Auto) command to see the variables.

4. From the course website, the data dictionary for this data set is:
Variable # Missing Description
CASENUM 0 Claim identification #
ATTORNEY 0 Whether claimant represented by attorney
( 1 if yes; 2 if no)
CLMSEX 12 Claimant’s gender ( 1 if male; 2 if female)
MARITAL 16 Claimant’s marital status
( 1 if married; 2 if single; 3 if widowed; 4 if divorced/separated)
CLMINSUR 41 Uninsured driver of claimant’s vehicle
(1 if yes; 2 if no; 3 if not applicable)
SEATBELT 48 Claimant wearing seatbelt/child restraint
(1 if yes; 2 if no; 3 if not applicable)

5. Let us focus on the LOSS variable as our outcome variable. As a start do a histogram of the
outcome variable and see that it is highly skewed. As such, take a log of the variable:

Auto$[Link] <- log(Auto$LOSS)

and then create another histogram with the [Link] variable to see that it is more symmetric.
So we will model the [Link].

6. We will focus in this tutorial on two of the categorical variables: ATTORNEY and CLMSEX.
First, let us convert these variables to factors (or categorical variables) for use in R. I have
included changing the variable MARITAL to a factor just in case you want to extend this tutorial
to this variable. Another tutorial covers other aspects of categorical variables and as mentioned,
we only focus on the use of interaction terms here.

Auto$Sex.f <- factor(Auto$CLMSEX, labels = c("Male", "Female"))


Auto$Attorney.f <- factor(Auto$ATTORNEY, labels = c("Yes", "No"))
Auto$Marital.f <- factor(Auto$MARITAL,
labels = c("Married", "Single","Widowed","Divorced/Separated"))

Re-do the str(Auto) command to see the addition of these variables and that they are indeed
labeled factors.

7. As discussed in class, we noted that boxplots are really helpful in examining the outcome variable
relative to the various levels of a categorical variable. Insert the following code and see the plots.
What are your conclusions from these plots? Note the last one combines the Sex factor and the
Attorney factor so we have 4 categories.5
5
I could have dressed up the plots by putting titles and other things that we have learned.
Act Sci 654: Regression and Time Series for Actuaries 63

boxplot([Link] ~Sex.f, data = Auto)


boxplot([Link] ~Attorney.f, data = Auto)
boxplot([Link] ~ Sex.f*Attorney.f, data = Auto)

8. One last creation item before we begin modeling is to create new binary variables for each of the
4 specific categories: Female with Attorney; Male with Attorney; Female without Attorney; and
Male Without Attorney. Note the use of the ifelse statement, similar in structure to Excel’s IF
statement.

Auto$[Link] <- ifelse(Auto$Sex.f == "Female" & Auto$Attorney.f == "Yes", 1, 0)


Auto$[Link] <- ifelse(Auto$Sex.f == "Male" & Auto$Attorney.f == "Yes", 1, 0)
Auto$[Link] <- ifelse(Auto$Sex.f == "Female" & Auto$Attorney.f == "No", 1, 0)
Auto$[Link] <- ifelse(Auto$Sex.f == "Male" & Auto$Attorney.f == "No", 1, 0)

9. So now it is time to create the models. Below are 6 models that I have created for illustrative
purposes.

Model.1 <- lm([Link] ~ Sex.f + Attorney.f, data = Auto)


summary(Model.1)

Model.2 <- lm([Link] ~ [Link] + [Link] + [Link] + [Link], data = Auto)


summary(Model.2)

Model.2a <- lm([Link] ~ [Link] + [Link] + [Link] + [Link] -1, data = Auto)
summary(Model.2a)

Model.3 <- lm([Link] ~ Sex.f + Attorney.f + Sex.f:Attorney.f, data = Auto)


summary(Model.3)

Model.3a <- lm([Link] ~ Sex.f + Attorney.f + Sex.f:Attorney.f -1, data = Auto)


summary(Model.3a)

Model.4 <- lm([Link] ~ Sex.f*Attorney.f , data = Auto)


summary(Model.4)

Model.4a <- lm([Link] ~ Sex.f*Attorney.f -1, data = Auto)


summary(Model.4a)
(a) Model 1 includes the factors Sex and Attorney.
(b) Model 2 uses the 4 new binary variables that we created.
(c) Model 2a is a variation of Model 2 but eliminates the intercept term with the ’-1’ code.
(d) Model 3 is a variation of Model 1 and includes an interaction term with the ’:” symbol.
(e) Model 3a is a variation of Model 3 but eliminates the intercept term with the ’-1’ code.
(f) Model 4 is the same as Model 2a. The ’*’ code says to R to do the main effects + interaction
term.
(g) Model 4a is the same as Model 3a and is also a variation of Model 4 eliminating the intercept
term.
Act Sci 654: Regression and Time Series for Actuaries 64

Model Output

> Model.1 <- lm([Link] ~ Sex.f + Attorney.f, data = Auto)


> summary(Model.1)

Call:
lm(formula = [Link] ~ Sex.f + Attorney.f, data = Auto)

Residuals:
Min 1Q Median 3Q Max
-5.1638 -0.6794 0.0136 0.8099 5.6966

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.21584 0.06223 19.539 <2e-16 ***
[Link] 0.06084 0.07181 0.847 0.397
[Link] -1.41125 0.07133 -19.786 <2e-16 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

Residual standard error: 1.295 on 1325 degrees of freedom


(12 observations deleted due to missingness)
Multiple R-squared: 0.2283,Adjusted R-squared: 0.2271
F-statistic: 196 on 2 and 1325 DF, p-value: < 2.2e-16

>
> Model.2 <- lm([Link] ~ [Link] + [Link] + [Link] + [Link], data = Auto)
> summary(Model.2)

Call:
lm(formula = [Link] ~ [Link] + [Link] + [Link] + [Link],
data = Auto)

Residuals:
Min 1Q Median 3Q Max
-5.1811 -0.6828 0.0170 0.8051 5.7158

Coefficients: (1 not defined because of singularities)


Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.22129 0.08019 -2.759 0.00587 **
[Link] 1.47878 0.10583 13.973 < 2e-16 ***
[Link] 1.45791 0.10768 13.539 < 2e-16 ***
[Link] 0.10404 0.10361 1.004 0.31548
[Link] NA NA NA NA
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

Residual standard error: 1.296 on 1324 degrees of freedom


(12 observations deleted due to missingness)
Multiple R-squared: 0.2285,Adjusted R-squared: 0.2268
F-statistic: 130.7 on 3 and 1324 DF, p-value: < 2.2e-16

>
> Model.2a <- lm([Link] ~ [Link] + [Link] + [Link] + [Link] -1, data = Auto)
> summary(Model.2a)

Call:
Act Sci 654: Regression and Time Series for Actuaries 65

lm(formula = [Link] ~ [Link] + [Link] + [Link] + [Link] -


1, data = Auto)

Residuals:
Min 1Q Median 3Q Max
-5.1811 -0.6828 0.0170 0.8051 5.7158

Coefficients:
Estimate Std. Error t value Pr(>|t|)
[Link] 1.25750 0.06905 18.210 < 2e-16 ***
[Link] 1.23663 0.07187 17.207 < 2e-16 ***
[Link] -0.11724 0.06560 -1.787 0.07415 .
[Link] -0.22129 0.08019 -2.759 0.00587 **
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

Residual standard error: 1.296 on 1324 degrees of freedom


(12 observations deleted due to missingness)
Multiple R-squared: 0.3254,Adjusted R-squared: 0.3233
F-statistic: 159.6 on 4 and 1324 DF, p-value: < 2.2e-16

>
> Model.3 <- lm([Link] ~ Sex.f + Attorney.f + Sex.f:Attorney.f, data = Auto)
> summary(Model.3)

Call:
lm(formula = [Link] ~ Sex.f + Attorney.f + Sex.f:Attorney.f,
data = Auto)

Residuals:
Min 1Q Median 3Q Max
-5.1811 -0.6828 0.0170 0.8051 5.7158

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.23663 0.07187 17.207 <2e-16 ***
[Link] 0.02087 0.09967 0.209 0.834
[Link] -1.45791 0.10768 -13.539 <2e-16 ***
[Link]:[Link] 0.08317 0.14376 0.579 0.563
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

Residual standard error: 1.296 on 1324 degrees of freedom


(12 observations deleted due to missingness)
Multiple R-squared: 0.2285,Adjusted R-squared: 0.2268
F-statistic: 130.7 on 3 and 1324 DF, p-value: < 2.2e-16

>
> Model.3a <- lm([Link] ~ Sex.f + Attorney.f + Sex.f:Attorney.f -1, data = Auto)
> summary(Model.3a)

Call:
lm(formula = [Link] ~ Sex.f + Attorney.f + Sex.f:Attorney.f -
1, data = Auto)

Residuals:
Min 1Q Median 3Q Max
-5.1811 -0.6828 0.0170 0.8051 5.7158
Act Sci 654: Regression and Time Series for Actuaries 66

Coefficients:
Estimate Std. Error t value Pr(>|t|)
[Link] 1.23663 0.07187 17.207 <2e-16 ***
[Link] 1.25750 0.06905 18.210 <2e-16 ***
[Link] -1.45791 0.10768 -13.539 <2e-16 ***
[Link]:[Link] 0.08317 0.14376 0.579 0.563
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

Residual standard error: 1.296 on 1324 degrees of freedom


(12 observations deleted due to missingness)
Multiple R-squared: 0.3254,Adjusted R-squared: 0.3233
F-statistic: 159.6 on 4 and 1324 DF, p-value: < 2.2e-16

>
> Model.4 <- lm([Link] ~ Sex.f*Attorney.f , data = Auto)
> summary(Model.4)

Call:
lm(formula = [Link] ~ Sex.f * Attorney.f, data = Auto)

Residuals:
Min 1Q Median 3Q Max
-5.1811 -0.6828 0.0170 0.8051 5.7158

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.23663 0.07187 17.207 <2e-16 ***
[Link] 0.02087 0.09967 0.209 0.834
[Link] -1.45791 0.10768 -13.539 <2e-16 ***
[Link]:[Link] 0.08317 0.14376 0.579 0.563
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

Residual standard error: 1.296 on 1324 degrees of freedom


(12 observations deleted due to missingness)
Multiple R-squared: 0.2285,Adjusted R-squared: 0.2268
F-statistic: 130.7 on 3 and 1324 DF, p-value: < 2.2e-16

>
> Model.4a <- lm([Link] ~ Sex.f*Attorney.f -1, data = Auto)
> summary(Model.4a)

Call:
lm(formula = [Link] ~ Sex.f * Attorney.f - 1, data = Auto)

Residuals:
Min 1Q Median 3Q Max
-5.1811 -0.6828 0.0170 0.8051 5.7158

Coefficients:
Estimate Std. Error t value Pr(>|t|)
[Link] 1.23663 0.07187 17.207 <2e-16 ***
[Link] 1.25750 0.06905 18.210 <2e-16 ***
[Link] -1.45791 0.10768 -13.539 <2e-16 ***
[Link]:[Link] 0.08317 0.14376 0.579 0.563
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Act Sci 654: Regression and Time Series for Actuaries 67

Residual standard error: 1.296 on 1324 degrees of freedom


(12 observations deleted due to missingness)
Multiple R-squared: 0.3254,Adjusted R-squared: 0.3233
F-statistic: 159.6 on 4 and 1324 DF, p-value: < 2.2e-16
Act Sci 654: Regression and Time Series for Actuaries 68

Final code:

setwd("D:/ActSci654/Work")
# THIS CODE USES THE DATASET [Link]
rm(list=ls())
Auto <- [Link]("d:/ActSci654/DataForFrees/[Link]", header=TRUE)
str(Auto)

Auto$Sex.f <- factor(Auto$CLMSEX, labels = c("Male", "Female"))


Auto$Marital.f <- factor(Auto$MARITAL,
labels = c("Married", "Single","Widowed","Divorced/Separated"))
Auto$Attorney.f <- factor(Auto$ATTORNEY, labels = c("Yes", "No"))
Auto$[Link] <- log(Auto$LOSS)
head(Auto)
str(Auto)

hist(Auto$LOSS)
hist(Auto$[Link])

table(Auto$Sex.f,Auto$Attorney.f)

boxplot([Link] ~Sex.f, data = Auto)


boxplot([Link] ~Attorney.f, data = Auto)
boxplot([Link] ~ Sex.f*Attorney.f, data = Auto)

Auto$[Link] <- ifelse(Auto$Sex.f == "Female" & Auto$Attorney.f == "Yes", 1, 0)


Auto$[Link] <- ifelse(Auto$Sex.f == "Male" & Auto$Attorney.f == "Yes", 1, 0)
Auto$[Link] <- ifelse(Auto$Sex.f == "Female" & Auto$Attorney.f == "No", 1, 0)
Auto$[Link] <- ifelse(Auto$Sex.f == "Male" & Auto$Attorney.f == "No", 1, 0)

Model.1 <- lm([Link] ~ Sex.f + Attorney.f, data = Auto)


summary(Model.1)

Model.2 <- lm([Link] ~ [Link] + [Link] + [Link] + [Link], data = Auto)


summary(Model.2)

Model.2a <- lm([Link] ~ [Link] + [Link] + [Link] + [Link] -1, data = Auto)
summary(Model.2a)

Model.3 <- lm([Link] ~ Sex.f + Attorney.f + Sex.f:Attorney.f, data = Auto)


summary(Model.3)

Model.3a <- lm([Link] ~ Sex.f + Attorney.f + Sex.f:Attorney.f -1, data = Auto)


summary(Model.3a)

Model.4 <- lm([Link] ~ Sex.f*Attorney.f , data = Auto)


summary(Model.4)

Model.4a <- lm([Link] ~ Sex.f*Attorney.f -1, data = Auto)


summary(Model.4a)
Act Sci 654: Regression and Time Series for Actuaries 69

16 More on Categorical Explanatory Variables


This exercise demonstrates differences in regression output and interpretation using categorical ex-
planatory variables.
1. The code that I used in R is attached if you want to use it.
2. The data uses the [Link] file from the course website.
Please remember to save your R scripts as you are working just in case your system freezes.
3. Start R/Rcmdr. You can use Rcmdr to submit a couple of lines at one time. That might be
easier than using the R GUI. Also recall that R is case-sensitive and space-sensitive.
4. Open the [Link] file, call your data frame U N and do a str() command
to see the variables.
5. Create a histogram of HEALTHEXPEND, 2004 Health expenditures per capita in US dollars.
Note the skewness in the data. So take a log(HEALT HEXP EN D) and do a histogram of that,
along with a qqplot. In R, you can type qqnorm(name − of − variable) to get the plot. Recall
that you can type par(mf row = c(1, 2)) to change the parameters for the graphs to get 2 graphs
in one output window.
6. Let us take a subset of the data and put it in a separate data frame, U [Link]. Use the following
command:

U [Link] <−U N [, c(”[Link]”, ”REGION ”, ”F ERT ILIT Y ”, ”P HY SICIAN ”)]

View the data set by typing: head([Link]). See that just the 4 variables are in the data frame,
using all of the rows.
7. Now do a correlation on [Link] using the function from Hmisc package called rcorr. Type:
rcorr([Link](U [Link]), type = ”pearson”). How do you interpret the correlations? Are they
significant?
8. Recall that REGION is a categorical variable for region of the world, FERTILITY is the births
per woman, [Link] is the logarithm of the 2004 Health Expenditures, and PHYSICIAN is the
variable for the number of physicians per 100,000 people. So, how do you interpret the correlation
of REGION with the other variables? The calculation worked numerically as the variable is read
into R as a numeric variable. The description is as a category. Taken literally, the output says
that as REGION gets larger in ‘number’, then [Link] gets larger. What does it mean for
REGION to get larger in ‘number’ ? Does that have any interpretable value? Let us delve into
this further.
9. Before we do, let us first create a categorical variable for REGION , called Region.f .

UN$Region.f <- factor(UN$REGION, labels=c("Arab States", "East Asia & Pacific",


"Latin American & Carribean","South Asia","Southern Europe",
"Sub-Saharan Africa","Central & Eastern Europe",
"High-Income OECD")

levels(UN$Region.f)
Act Sci 654: Regression and Time Series for Actuaries 70

The first command creates the factor variable with labels. The second command shows you what
are the levels for that variable. Now if we do a str(UN), we see that we have two variables for
Region, one is a numeric variable, one is a categorical variable. Before we move on, try doing a
correlation of Region.f with [Link]. What happens? Not surprisingly, the correlation will
not work with a categorical variable.

10. Now create a new data frame that contains no missing data for
(”[Link]”,”REGION”,”FERTILITY”,”PHYSICIAN”).

## just delete NA’s for those variables


[Link]<-UN[![Link](UN$[Link]) & ![Link](UN$REGION) &
![Link](UN$FERTILITY) & ![Link](UN$PHYSICIAN),]

str([Link])

There are now 175 observations in the data frame (as compared to 185 in the U N data frame.

11. Run a regression of [Link] ∼ REGION using data frame = U N.N A. This regression shows
that REGION is significant with a p-value of 0.0032. Now interestingly, we learned in basic
regression (which this is), that the p-value of the regression should be equivalent to the p-value
from the correlation. (Of course too, you can find the correlation of ([Link], REGION )
using the regression coefficient and the standard deviation of [Link] and REGION ).) But
the p-values are close, but not equivalent. Why? (Remember the missing data?)

12. But again, what is the interpretation of using REGION as a numeric variable? The regression
says that when you go from Region 1 to Region 2 that is the same change in [Link] as going
from Region 5 to Region 6. Does this make sense? Look at the categories. These are Regions of
the world. Is there an order to the labels?

13. Using REGION as a numeric variable, here, does not make sense. So we use the factor version
of REGION .

14. Run a regression of [Link] ∼ Region.f . Now, the regions are like ‘buckets’, not a numeric
value. In this version with an intercept, we have what is shown in the book as y = µ + τj + . See
the text on page 126 under equation (4.11) for an thorough explanation. Essentially, the µ here
is that of the reference category (say µr , Arab States. And the τj are differences of µj and µr .

15. If you run a regression without an intercept term, each level is represented in the output. This
form is consistent with equation (4.7). You can compare your output without an intercept term
with the means by category. This code uses the ddply function as part of the plyr package.

RegModel.2a <- lm([Link] ~ Region.f - 1, data=[Link])


summary(RegModel.2a)

ddply([Link],~ Region.f, summarise, count=length([Link]),


mean=mean([Link]), sd=sd([Link]))
Act Sci 654: Regression and Time Series for Actuaries 71

16. But looking at the output, the standard deviations in the summary from ddply do not match
the standard error shown in the regression output (without an intercept term). How can that
be? They measure different things. The standard deviation in the summary output measures
the spread of [Link] within each of the Regions. That is, given a particular Region, what is
the standard deviation of [Link]? The standard error of the regression coefficient, measures
the error in the estimate of the regression coefficient. Formula (4.8) shows that the value of the
standard error of the regression coefficient is √snj . Note that s is that of the residual standard
error for the regression.

17. You can do a box plot by Region to illustrate any difference.

par(mfrow=c(1,1)) # resets the parameter back to one plot per window


boxplot([Link] ~ Region.f, data = [Link],main="Plot of Log(Health Expenditures)
By Region") # this needs some tinkering to get all of the labels plotted

18. Finally run a regression on [Link] ∼ Region.f + F ERT ILIT Y + P HY SICIAN . What
if the question was asked whether Region is a good variable to include in the model? We do a
partial F-test.

RegModel.3 <- lm([Link] ~ Region.f + FERTILITY + PHYSICIAN, data=[Link])


summary(RegModel.3)

RegModel.3a <- lm([Link] ~ FERTILITY + PHYSICIAN, data=[Link])


summary(RegModel.3a)

You would then do a partial F-test using the hypothesis: H0 : Region should not be in the
model, versus H1 : Region should be in the model. You would use equation (4.4) (or a variant)
2
to calculate. Here p = 7, dff ull = 165, Rf2 ull = 0.7894 and Rreduced = 0.6327:

Rf2 ull − Rreduced


2
n − (k + p + 1)
F − ratio = 2 · = 17.539
1 − Rf ull p

(Note: This variation of formula (4.4) reflects that the full model has k + p variables, while the
reduced model has k variables. Thus the numerator has p degrees of freedom.)

19. Otherwise, you can use the anova command. BUT, you need to be careful!! The ANOVA table
groups the Region factor as one variable.

RegModel.3 <- lm([Link] ~ Region.f + FERTILITY + PHYSICIAN, data=[Link])


anova(RegModel.3)

RegModel.3b <- lm([Link] ~ FERTILITY + PHYSICIAN + Region.f, data=[Link])


anova(RegModel.3b)
Act Sci 654: Regression and Time Series for Actuaries 72

Same model, but the Region.f variable is in different positions in the R statement. Without going
into much explanation, if you are going to use this command, the order must be right. The second
one is the correct order.
Another command to get the appropriate output is: drop1(RegM odel.3, ∼ ., test = ”F ”). It uses
the original model, but does a different method to get the F statistics. Note that the F stat of
the anova output matches the output of the partial F-test output.
Act Sci 654: Regression and Time Series for Actuaries 73

library(Hmisc) # for correlations


library(car) # vif
library(plyr) # for means by group (ddply)
library(psych) # for nice descriptives and means by group
library(Rcmdr)

rm(list=ls())
UN <-[Link]("d:/ActSci654/DataForFrees/[Link]",
header=TRUE, sep=",", [Link]="NA", dec=".", [Link]=TRUE)

str(UN)

hist(UN$HEALTHEXPEND)

UN$[Link] <- log(UN$HEALTHEXPEND)

par(mfrow=c(1,2))
hist(UN$[Link])
qqnorm(UN$[Link])

[Link] <- UN[,c("[Link]","REGION","FERTILITY","PHYSICIAN")]


rcorr([Link]([Link]), type="pearson")

UN$Region.f <- factor(UN$REGION, labels=c("Arab States", "East Asia & Pacific",


"Latin American & Carribean","South Asia","Southern Europe",
"Sub-Saharan Africa","Central & Eastern Europe",
"High-Income OECD"))

levels(UN$Region.f)

[Link] <- UN[,c("[Link]","REGION","Region.f")]


rcorr([Link]([Link]), type="pearson")

## just delete NA’s for those variables


[Link]<-UN[![Link](UN$[Link]) & ![Link](UN$REGION) &
![Link](UN$FERTILITY) & ![Link](UN$PHYSICIAN),]

str([Link])

#### Part a
RegModel.1 <- lm([Link] ~ REGION, data=[Link])
summary(RegModel.1)
anova(RegModel.1)

RegModel.1a <- lm([Link] ~ REGION, data=UN)


summary(RegModel.1a)

RegModel.2 <- lm([Link] ~ Region.f, data=[Link])


Act Sci 654: Regression and Time Series for Actuaries 74

summary(RegModel.2)
anova(RegModel.2)

RegModel.2a <- lm([Link] ~ Region.f - 1, data=[Link])


summary(RegModel.2a)

ddply([Link],~ Region.f, summarise, count=length([Link]),


mean=mean([Link]), sd=sd([Link]))

par(mfrow=c(1,1))
boxplot([Link] ~ Region.f, data = [Link],main="Plot of Log(Health Expenditures)
By Region")

RegModel.3 <- lm([Link] ~ Region.f + FERTILITY + PHYSICIAN, data=[Link])


summary(RegModel.3)

RegModel.3a <- lm([Link] ~ FERTILITY + PHYSICIAN, data=[Link])


summary(RegModel.3a)

anova(RegModel.3)
RegModel.3b <- lm([Link] ~ FERTILITY + PHYSICIAN + Region.f, data=[Link])
anova(RegModel.3b)
drop1(RegModel.3,~.,test="F")
Act Sci 654: Regression and Time Series for Actuaries 75

17 Influential Points and Residual Analysis


This exercise demonstrates ways of analyzing the residuals and how to use these to modify your model.

1. The code that I used in R is attached if you want to use it.

2. The exercise uses simulated data from the course website.

3. Please remember to save your R scripts if you are working in Rcmdr as you are working just in
case your system freezes.

4. Start R/Rcmdr. You can use Rcmdr to submit a couple of lines at one time. That might be
easier than using the R GUI.6 Also recall that R is case-sensitive and space-sensitive.

5. You will need the following libraries for this exercise:

library(car) # avPlots and vif


library(MPV) # gets PRESS statistic
library(psych) # for correlation

6. Read in [Link] and do a str() to see the variables. y is the dependent variable.

7. Do a correlation matrix and a scatterplot matrix on these variables. Look at the relationships.

8. You decide to run a regression on all of the variables. What are your conclusions?

9. Now let us look at the residual plots and see if there are other things going on.

(a) Use the avP lots(M odel), where Model is your model name. These are the added variable
plots that we introduced in Chapter 3. What do these indicate? Anything?
(b) Use the vif (M odel), where Model is your model name to get the variance inflation factors.
What does this indicate? Any collinearity amongst the explanatory variables?
(c) We can examine the leverage in a couple of ways:

leverage <- hatvalues(Model)


leverage
lev <- hat([Link](Model))
lev

summary(leverage)
hist(leverage)

plot(hat([Link](Model)), type = "h")

6
Also remember RStudio and TinnR are great substitutes for Rcmdr where you can use a script window and have an
output window.
Act Sci 654: Regression and Time Series for Actuaries 76

The hatvalues(M odel) or hat([Link](M odel)) command is a way to save the lever-
age values (hii ). You can summarize the leverage to see descriptive statistics. What is
the average value of the leverage and does this tie to what we discussed in class? What
observations can you make about the histogram? What does the plot of the leverages tell
you?
(d) We can also examine the residuals. We talked about residuals, standardized residuals, and
studentized residuals.
## summarize the 3 types

summary(residuals(Model))
summary(rstandard(Model))
summary(rstudent(Model))

## histograms of the 3 types


par(mfrow=c(1,3))
hist(residuals(Model))
hist(rstandard(Model))
hist(rstudent(Model))

## plots of the 3 types


par(mfrow=c(1,3))
plot(rstudent(Model), type = "h", main = "Studentized")
plot(rstandard(Model), type = "h", main = "Standardized")
plot(residuals(Model), type = "h", main = "Residuals")

Any concerns here?


(e) Finally we discussed the Press and AIC statistics.

PRESS(Model)
AIC(Model)

These two results are not useful by themselves, but in relation to other model summaries.
(f) From the original summary statistics, it seems that x3 is not significant, but the original
scatterplots looked like there was some relationship to x2. So run two other models, one of
y ∼ x1 + x2 and the other one of y ∼ x1 + x3. Do a vif, Press, and AIC on each of these
models and compare the results to the model using x1, x2, x3.
(g) As final information, the ”true” model is: y <−0.5 + 5 ∗ x1 + 7 ∗ x2 + error, where error ∼
N (0, σ = 2). Do your results approximate truth?

10. Read in [Link] and do a str() to see the variables. y is the dependent variable.

11. Do a correlation matrix and a scatterplot matrix on these variables. Look at the relationships.

12. You decide to run a regression on x1, x2, x4. What are your conclusions? Get the vif, Press, and
AIC for comparison purposes to later models.
Act Sci 654: Regression and Time Series for Actuaries 77

13. Now let us look at the residual plots and see if there are other things going on.

(a) Use the avP lots(M odel), where Model is your model name. Here are the added variable
plots that we introduced in Chapter 3. What do these indicate? Anything?
(b) We can also examine the residuals. It looks like something interesting is going on with
respect to the residuals and x4.

resid.mod1b <- rstandard(Model.1b)

colx4.f = c("red","blue")[[Link](data$x4)]

par(mfrow = c(2,2), oma = c(0,0,2,0))


plot(data$x1,resid.mod1b, col=colx4.f, pch=20, cex=2)
abline(h=0,col=3,lty=3)
plot(data$x2,resid.mod1b, col=colx4.f, pch=20, cex=2)
abline(h=0,col=3,lty=3)
plot(data$x3,resid.mod1b, col=colx4.f, pch=20, cex=2)
abline(h=0,col=3,lty=3)
plot(data$x4,resid.mod1b, col=colx4.f, pch=20, cex=2)
abline(h=0,col=3,lty=3)
mtext("Residual Analysis for Additional Relationships",
outer = TRUE, cex = 1.5)

Here we color-coded the binary variable x4. Look at the relationship of the residuals from the
regression and x3. Interesting. Even though the simple correlations showed no relationship
between y and x3, now there looks to be something going on when we look at the residuals
by group.
(c) So now include x3 in the model and see if things improve. Do the avplots, vif, Press, and
AIC. Do the above color-coded residual plots. Answer is, something else is going on.
(d) Let’s add an interaction term. You could create an additional variable x5 <− x3 ∗ x4 or use
the x3 : x4 in the regression equation.

Model.1a <- lm(y ~ x1 + x2 + x3 + x4 + x3:x4,data=data )


summary(Model.1a)

Do all of the above avplots, vif, Press, AIC and residual plots and see what you find.
(e) As final information, the ”true” model is: y <−1.5 − 5 ∗ x1 + 7 ∗ x2 + 2 ∗ x3 + 3.33 ∗ x4 − 2.5 ∗
x5 + error, where x5 = x3 ∗ x4 and error ∼ N (0, σ = 2.7). Do your results approximate
truth?
Act Sci 654: Regression and Time Series for Actuaries 78

library(car) # avPlots and vif


library(MPV) # gets PRESS statistic
library(psych) # for correlation

rm(list=ls())

data <- [Link](file="d:/ActSci654/ClassExamples/[Link]",


header = TRUE)
str(data)

[Link]( data, use = "pairwise",method ="pearson")

pairs(~data$y + data$x1 + data$x2 + data$x3,


main="Simple Scatterplot Matrix")

Model <- lm(y ~ x1 + x2 + x3, data=data )


summary(Model)

avPlots(Model, [Link]="identify") # added variable plot


# with identify option

## collinearity
vif(Model)

## ways to get leverage


leverage <- hatvalues(Model)
leverage
lev <- hat([Link](Model))
lev

summary(leverage)
hist(leverage)

plot(hat([Link](Model)), type = "h")

## summarize the 3 types

summary(residuals(Model))
summary(rstandard(Model))
summary(rstudent(Model))

## histograms of the 3 types


par(mfrow=c(1,3))
hist(residuals(Model), main = "Residuals")
hist(rstandard(Model), main = "Standardized")
hist(rstudent(Model), main = "Studentized")
Act Sci 654: Regression and Time Series for Actuaries 79

## plots of the 3 types


par(mfrow=c(1,3))
plot(rstudent(Model), type = "h", main = "Studentized")
plot(rstandard(Model), type = "h", main = "Standardized")
plot(residuals(Model), type = "h", main = "Residuals")

## cook’s distance
par(mfrow=c(1,1))
plot([Link](Model), type ="h")

## Press and AIC statistic


PRESS(Model)
AIC(Model)

## try other models


Model.a <- lm(y ~ x1 + x2, data=data )
summary(Model.a)
vif(Model.a)
PRESS(Model.a)

Model.b <- lm(y ~ x1 + x3, data=data )


summary(Model.b)
vif(Model.b)
PRESS(Model.b)
AIC(Model.b)

################# NEw Exercise #####################################


par(mfrow=c(1,1))

rm(list=ls())

data <- [Link](file="d:/ActSci654/ClassExamples/[Link]",


header = TRUE)
str(data)

[Link]( data, use = "pairwise",method ="pearson")

pairs(~data$y + data$x1 + data$x2 + data$x3 +


data$x4, main="Simple Scatterplot Matrix")

Model.1b <- lm(y ~ x1 + x2 + x4, data=data )


summary(Model.1b)
Act Sci 654: Regression and Time Series for Actuaries 80

vif(Model.1b)

PRESS(Model.1b)
AIC(Model.1b)

avPlots(Model.1b) # added variable plot


# no identify option

## check out residuals of x4 on x1, x2, x3


resid.mod1b <- rstandard(Model.1b)

colx4.f = c("red","blue")[[Link](data$x4)]

par(mfrow = c(2,2), oma = c(0,0,2,0))


plot(data$x1,resid.mod1b, col=colx4.f, pch=20, cex=2)
abline(h=0,col=3,lty=3)
plot(data$x2,resid.mod1b, col=colx4.f, pch=20, cex=2)
abline(h=0,col=3,lty=3)
plot(data$x3,resid.mod1b, col=colx4.f, pch=20, cex=2)
abline(h=0,col=3,lty=3)
plot(data$x4,resid.mod1b, col=colx4.f, pch=20, cex=2)
abline(h=0,col=3,lty=3)
mtext("Residual Analysis for Additional Relationships",
outer = TRUE, cex = 1.5)

Model.1 <- lm(y ~ x1 + x2 + x3 + x4, data=data )


summary(Model.1)

avPlots(Model.1) # added variable plot

## all plots
par(mfrow = c(2, 3), oma = c(0, 0, 2, 0))
plot(Model.1, which = c(1:6))

vif(Model.1)

PRESS(Model.1)
AIC(Model.1)

## check out residuals of x4 on x1, x2, x3


resid.mod1 <- rstandard(Model.1)

colx4.f = c("red","blue")[[Link](data$x4)]

par(mfrow = c(2,2), oma = c(0,0,2,0))


plot(data$x1,resid.mod1, col=colx4.f, pch=20, cex=2)
abline(h=0,col=3,lty=3)
Act Sci 654: Regression and Time Series for Actuaries 81

plot(data$x2,resid.mod1, col=colx4.f, pch=20, cex=2)


abline(h=0,col=3,lty=3)
plot(data$x3,resid.mod1, col=colx4.f, pch=20, cex=2)
abline(h=0,col=3,lty=3)
plot(data$x4,resid.mod1, col=colx4.f, pch=20, cex=2)
abline(h=0,col=3,lty=3)
mtext("Residual Analysis for Additional Relationships",
outer = TRUE, cex = 1.5)

## add interaction term

Model.1a <- lm(y ~ x1 + x2 + x3 + x4 + x3:x4,


data=data )
summary(Model.1a)

avPlots(Model.1a) # added variable plot

# Influence Plot
par(mfrow=c(1,1))
influencePlot(Model.1a, [Link]="identify",
main="Influence Plot",
sub="Circle size is proportial to Cook’s Distance" )

# Cook’s D plot
# identify D values > 4/(n)
cutoff <- 4/((nrow([Link](Model.1a))))
plot(Model.1a, which=4, [Link]=cutoff)

par(mfrow = c(2, 3), oma = c(0, 0, 2, 0))


plot(Model.1a, which = c(1:6))

vif(Model.1a)

PRESS(Model.1a)
AIC(Model.1a)
Act Sci 654: Regression and Time Series for Actuaries 82

18 Impact of Outlier on Regression Results


This exercise will use the dataset [Link] on the Regression Book Website. These data are those
used in the example on page 43 of the textbook. We will focus on just the first two columns of the
dataset: X and Y .

1. The code that I used in R is attached if you want to use it.

2. Start R/Rcmdr. You can use Rcmdr to submit a couple of lines at one time. That might be
easier than using the R GUI. Also recall that R is case-sensitive and space-sensitive.

3. For this exercise, we will use the package calibrate to help with labeling on the graphs. You will
need to install the package and then load it with library(calibrate).

4. Read in the data into a dataset named Outlier and do a str() to see the information within the
dataset. It is a small dataset, so it would be helpful to view it. You can type the name of the
dataset to view. See how the data are listed: the 19 base points plus 3 extra points that are A,
B, and C in Figure 2.7.

5. Let us subset the data so that we have 4 datasets: (1) 19 base points, (2) 19 base points + point
A, (3) 19 base points + point B, and (4) 19 base points + point C.

Outlier.0 <- Outlier[c(1:19), 1:2]

Outlier.A <- Outlier[c(1:19,20), 1:2]

Outlier.B <- Outlier[c(1:19,21), 1:2]

Outlier.C <- Outlier[c(1:19,22), 1:2]

Recall that the c() combines items and that Outlier[rows, columns] is the dataset. The first line
then pulls the first 19 rows and first columns of Outlier. The last line pulls the first 19 rows +
the 22nd row along with the first two columns. You can view the data sets by typing the name.

6. This next set of code is where you need the calibrate package. You can type:

plot(Outlier$X,Outlier$Y, pch = 19, cex = 1.5, main="Plot of the Outlier Data")


textxy(4.3,8.0, "A", cx = 1.0, dcol = "darkred") # puts label at x-y coordinate
textxy(9.5,8.0, "B", cx = 1.0, dcol = "royalblue4")
textxy(9.5,2.5, "C", cx = 1.0, dcol = "forestgreen")

The first line is our usual plot command, where we make the dots solid, a little bigger, and a
title. The next 3 lines are ways are labeling the data points. The first entries are the (x, y) values
of the point, how you would want them to be labeled, the relative size, and the color. Note the
commands are slightly different than the basic R plot. (There are other ways of doing this too;
this is just one way.)

7. Run regression models on Y ∼ X for each of the four models and summarize the results. Note
how different are the intercept and slope terms and compare your results to that on page 44.

8. Let us put all of the plots on one page for easier reference:
Act Sci 654: Regression and Time Series for Actuaries 83

[Link]() # creates a new plot


par(mfrow=c(2,2), oma = c(0,0,2,0))
plot(Outlier.0$X,Outlier.0$Y, main="No Influential Points")
abline(Model.0)
plot(Outlier.A$X,Outlier.A$Y, main="Base Points + A")
abline(Model.A)
plot(Outlier.B$X,Outlier.B$Y, main="Base Points + B")
abline(Model.B)
plot(Outlier.C$X,Outlier.C$Y, main="Base Points + C")
abline(Model.C)
mtext("Plots of Data vs. Least Squares Lines", outer = TRUE, cex = 1.5)

This is nice, but it is hard to compare as the axes are of different scales. So the following set of
code corrects this issue:

[Link]()
par(mfrow=c(2,2), oma = c(0,0,2,0))
plot(Outlier.0$X,Outlier.0$Y, main="No Influential Points", xlim= c(0,10),
ylim = c(0,8))
abline(Model.0)
plot(Outlier.A$X,Outlier.A$Y, main="Base Points + A", xlim= c(0,10),
ylim = c(0,8))
abline(Model.A)
plot(Outlier.B$X,Outlier.B$Y, main="Base Points + B", xlim= c(0,10),
ylim = c(0,8))
abline(Model.B)
plot(Outlier.C$X,Outlier.C$Y, main="Base Points + C", xlim= c(0,10),
ylim = c(0,8))
abline(Model.C)
mtext("Plots of Data vs. Least Squares Lines", outer = TRUE, cex = 1.5)

Here the xlim and ylim set the limits for the x and y axes. Note the c() notation again here.

9. One approach to handling outliers and high leverage points is through the use of indicator vari-
ables. One way to do this is the following:

## create an indicator variable and run the regression


Outlier.C$C <- 1*(Outlier.C$X == 9.5)
Model.C1 <- lm(Y ~ X + C, data=Outlier.C)
summary(Model.C1)

Essentially, the 1 ∗ (Outlier.C$X == 9.5) create an indicator variable, with a 1 when the con-
dition is true, and 0 otherwise. This new variable, C, is included in the model as an additional
explanatory variable. We will learn more about extra variables in chapter 3.

10. Note the output from the regression. Look at the intercept and slope terms. Which model does it
resemble? What happens to the fitted value/residual value for the observation with an X = 9.5?
Act Sci 654: Regression and Time Series for Actuaries 84

11. Now let us plot all of the graphs for comparison. Same code as above (so just copy the prior code
and add the last 3 lines (other than the mtext command).

## plot original data and fit


[Link]()
par(mfrow=c(3,2), oma = c(0,0,2,0))
plot(Outlier.0$X,Outlier.0$Y, main="No Influential Points", xlim= c(0,10),
ylim = c(0,8))
abline(Model.0)
plot(Outlier.A$X,Outlier.A$Y, main="Base Points + A", xlim= c(0,10),
ylim = c(0,8))
abline(Model.A)
plot(Outlier.B$X,Outlier.B$Y, main="Base Points + B", xlim= c(0,10),
ylim = c(0,8))
abline(Model.B)
plot(Outlier.C$X,Outlier.C$Y, main="Base Points + C", xlim= c(0,10),
ylim = c(0,8))
abline(Model.C)
plot(Outlier.C$X,Outlier.C$Y, main="Base Points + C + Indicator on C",
xlim= c(0,10), ylim = c(0,8))
abline(Model.C1)
mtext("Plots of Data vs. Least Squares Lines", outer = TRUE, cex = 1.5)
Act Sci 654: Regression and Time Series for Actuaries 85

#### Outlier examples

rm(list=ls())
library(calibrate) # to get text labels in graphs

Outlier <-[Link]("D:/ActSci654/DataForFrees/[Link]",
header=TRUE)

# shows variables in dataset


str(Outlier)

## list out dataset


Outlier

Outlier.0 <- Outlier[c(1:19), 1:2]


Outlier.A <- Outlier[c(1:19,20), 1:2]
Outlier.B <- Outlier[c(1:19,21), 1:2]
Outlier.C <- Outlier[c(1:19,22), 1:2]

## plot "big" data set with all outliers, together with text labels
plot(Outlier$X,Outlier$Y, pch = 19, cex = 1.5)
textxy(4.3,8.0, "A", cx = 1.0, dcol = "darkred") # puts label at x-y coordinate
textxy(9.5,8.0, "B", cx = 1.0, dcol = "royalblue4")
textxy(9.5,2.5, "C", cx = 1.0, dcol = "forestgreen")

## run models

Model.0 <- lm(Y ~ X, data=Outlier.0)


summary(Model.0)
Model.A <- lm(Y ~ X, data=Outlier.A)
summary(Model.A)
Model.B <- lm(Y ~ X, data=Outlier.B)
summary(Model.B)
Model.C <- lm(Y ~ X, data=Outlier.C)
summary(Model.C)

## plot original data and fit


[Link]()
par(mfrow=c(2,2), oma = c(0,0,2,0))
plot(Outlier.0$X,Outlier.0$Y, main="No Influential Points")
abline(Model.0)
plot(Outlier.A$X,Outlier.A$Y, main="Base Points + A")
abline(Model.A)
Act Sci 654: Regression and Time Series for Actuaries 86

plot(Outlier.B$X,Outlier.B$Y, main="Base Points + B")


abline(Model.B)
plot(Outlier.C$X,Outlier.C$Y, main="Base Points + C")
abline(Model.C)
mtext("Plots of Data vs. Least Squares Lines", outer = TRUE, cex = 1.5)

## plot original data and fit -- re-do axes limits


[Link]()
par(mfrow=c(2,2), oma = c(0,0,2,0))
plot(Outlier.0$X,Outlier.0$Y, main="No Influential Points", xlim= c(0,10),
ylim = c(0,8))
abline(Model.0)
plot(Outlier.A$X,Outlier.A$Y, main="Base Points + A", xlim= c(0,10),
ylim = c(0,8))
abline(Model.A)
plot(Outlier.B$X,Outlier.B$Y, main="Base Points + B", xlim= c(0,10),
ylim = c(0,8))
abline(Model.B)
plot(Outlier.C$X,Outlier.C$Y, main="Base Points + C", xlim= c(0,10),
ylim = c(0,8))
abline(Model.C)
mtext("Plots of Data vs. Least Squares Lines", outer = TRUE, cex = 1.5)

## Modified model with point C


## include an indicator variable on C

Outlier.C$C <- 1*(Outlier.C$X == 9.5)


Model.C1 <- lm(Y ~ X + C, data=Outlier.C)
summary(Model.C1)

## plot original data and fit -- re-do axes limits


[Link]()
par(mfrow=c(3,2), oma = c(0,0,2,0))
plot(Outlier.0$X,Outlier.0$Y, main="No Influential Points", xlim= c(0,10),
ylim = c(0,8))
abline(Model.0)
plot(Outlier.A$X,Outlier.A$Y, main="Base Points + A", xlim= c(0,10),
ylim = c(0,8))
abline(Model.A)
plot(Outlier.B$X,Outlier.B$Y, main="Base Points + B", xlim= c(0,10),
ylim = c(0,8))
abline(Model.B)
plot(Outlier.C$X,Outlier.C$Y, main="Base Points + C", xlim= c(0,10),
ylim = c(0,8))
abline(Model.C)
plot(Outlier.C$X,Outlier.C$Y, main="Base Points + C + Indicator on C",
xlim= c(0,10),
Act Sci 654: Regression and Time Series for Actuaries 87

ylim = c(0,8))
abline(Model.C1)
mtext("Plots of Data vs. Least Squares Lines", outer = TRUE, cex = 1.5)
Act Sci 654: Regression and Time Series for Actuaries 88

19 Example of Residual Plots


This exercise uses simulation to generate some data and then we will explore the data omitting some
variables in the regression to examine the residual plots.

1. The code that I used in R is attached if you want to use it.

2. Please remember to save your R scripts as you are working just in case your system freezes.

3. Start R/Rcmdr. You can use Rcmdr to submit a couple of lines at one time. That might be
easier than using the R GUI. Also recall that R is case-sensitive and space-sensitive.

4. Load the library Hmisc using the command: library(Hmisc).

5. Let us generate the data:

## generate the data

x1 <- 3* runif(100, min=0, max=1)


x2 <- 2* runif(100, min=0, max=1)
error <- rnorm(100, mean = 0, sd = 2)

y <- 0.5 + 5*x1 + 7*x2 + error ## true model

Here we established two explanatory variables, x1 and x2 and tied them to y in a linear model.
This is what we call ‘truth’.

6. Using the ls() command, list the objects in your workspace. Note: the way we set this up, there
is no dataset combining all variables.

7. Construct a histogram of y and a qqplot of y using:

par(mfrow = c(1,2)) # puts both the histogram and qqplot in one window
hist(y, main= "Histogram of y")

qqnorm(y)
qqline(y) # this adds a line to the qq-plot.

Each time you construct this example, that is, generate new data, you would generate different
x’s and different y’s. You can play with this example by also changing the number of random
numbers you generate in the sample. Here we generate 100.

8. Suppose you decide just to plunge in and run a model of y on x1 only.

## example one: leave out variable x2 from model estimation


Model.1 <- lm(y ~ x1)
summary(Model.1)
Act Sci 654: Regression and Time Series for Actuaries 89

resid.1 <- residuals(Model.1)

par(mfrow = c(1,3))
plot(x1, y, pch= 19, main="Plot of x1 vs y")
abline(Model.1) # puts regression line in plot
plot(x1, resid.1, pch= 19, main="Plot of x1 vs residuals")
plot(x2, resid.1, pch= 19, main="Plot of x2 vs residuals from y ~ x1")

9. Examine the plots. The one on the far left plots y vs. x1 . The plot shows a nice linear relationship.
What about the one in the center? This is a plot of the residuals from the regression of y ∼ x1 .
So it looks like, after removing the impact of x1 , there is no other relationship between y and x1 ,
i.e. the pattern looks random. You can do a correlation of resid.1 and x1 to confirm this. What
about the one on the far right? This is a graph between resid.1 and x2 . So after you remove the
effects of x1 from y and see what you have left, there seems to be a relationship between resid.1
and x2 . This is the clue to add x2 to the model. You can also check the correlation between
resid.1 and x2 .

10. Just to look at this relationship another way, plot y vs. x2 and compare that against resid.1 and
x2 (residuals from y ∼ x1 vs. x2 ).

par(mfrow = c(1,2))
plot(x2, y, pch= 19, main="Plot of x2 vs y")
plot(x2, resid.1, pch= 19, main="Plot of x2 vs residuals of y ~ x1")

The pictures are similar, but different. How could you explain the difference?

11. Run a regression of y vs. x1 + x2 . See how closely your estimates are to truth. Compare your
results to the regression found by regressing y only on x1 .

12. If you had other variables, you would follow the same process by taking the residuals and plotting
against the variables not in the model. Recall the added variable plot that we learned earlier.
This approach follows what we did there.

13. Now clear your workspace so we can do another example.

rm(list=ls()) ## get rid of objects


ls() # see it is empty now

## re-generate the data


x1 <- 3* runif(100, min=0, max=1)
x2 <- 2* runif(100, min=0, max=1)
error <- rnorm(100, mean = 0, sd = 2)

## Revised Truth
y <- 0.5 + 5*x1 + 3*x1*x1 + 7*x2 + error
Act Sci 654: Regression and Time Series for Actuaries 90

Notice that we added an x21 term to the model.

14. Run a regression of y ∼ x1 + x2

Model.2 <- lm(y ~ x1 + x2)


summary(Model.2)

resid.2 <- residuals(Model.2)

par(mfrow = c(1,2))
plot(x1, y, pch= 19, main="Plot of x1 vs y")
plot(x1, resid.2, pch= 19, main="Plot of x1 vs residuals")

The first plot shows a linear relationship between x1 and y. What does the second plot show?
Here it is the residuals (the remaining variation of y after removing the effects of x1 and x2 )
plotted against x1 . There looks to be a quadratic relationship, so this is the clue to add the
squared term.

15. Run a regression of y ∼ x1 + x2 + x21 and compare your results against the other model and truth.
Act Sci 654: Regression and Time Series for Actuaries 91

# Residual plots

library(Hmisc)
rm(list=ls())

## generate the data


x1 <- 3* runif(100, min=0, max=1)
x2 <- 2* runif(100, min=0, max=1)
error <- rnorm(100, mean = 0, sd = 2)

y <- 0.5 + 5*x1 + 7*x2 + error ## true model

ls() ## objects in workspace

par(mfrow = c(1,2))
hist(y, main= "Histogram of y")
# normal fit
qqnorm(y)
qqline(y)

data <- [Link](y, x1, x2)


rcorr([Link](data), type="pearson")

## example one: leave out variable x2 from model estimation


Model.1 <- lm(y ~ x1)
summary(Model.1)

resid.1 <- residuals(Model.1)

par(mfrow = c(1,3))
plot(x1, y, pch= 19, main="Plot of x1 vs y")
abline(Model.1)
plot(x1, resid.1, pch= 19, main="Plot of x1 vs residuals")
plot(x2, resid.1, pch= 19, main="Plot of x2 vs residuals from y ~ x1")

# Find correlations of residuals


rcorr(x1, resid.1)
r.1 <- rcorr(x1, resid.1)
names(r.1) # way to see more digits in output
r.1$r
r.1$P

par(mfrow = c(1,2))
plot(x2, y, pch= 19, main="Plot of x2 vs y")
plot(x2, resid.1, pch= 19, main="Plot of x2 vs residuals of y ~ x1")

rcorr(x2, resid.1)
Act Sci 654: Regression and Time Series for Actuaries 92

r.1a <- rcorr(x2, resid.1)


r.1a$r
r.1a$P

Model.1a <- lm(y ~ x1 + x2)


summary(Model.1a)

#######################
rm(list=ls()) ## get rid of objects
ls()

## re-generate the data


x1 <- 3* runif(100, min=0, max=1)
x2 <- 2* runif(100, min=0, max=1)
error <- rnorm(100, mean = 0, sd = 2)

## Revised Truth
y <- 0.5 + 5*x1 + 3*x1*x1 + 7*x2 + error

par(mfrow = c(1,2))
hist(y, main= "Histogram of y with squared x1")
# normal fit
qqnorm(y)
qqline(y)

## example two: leave out squared x1 from model


Model.2 <- lm(y ~ x1 + x2)
summary(Model.2)

resid.2 <- residuals(Model.2)

par(mfrow = c(1,2))
plot(x1, y, pch= 19, main="Plot of x1 vs y")
plot(x1, resid.2, pch= 19, main="Plot of x1 vs residuals")

r.2 <- rcorr(x1, resid.2)


r.2$r
r.2$P

plot(x2, y, pch= 19, main="Plot of x2 vs y")


plot(x2, resid.2, pch= 19, main="Plot of x2 vs residuals of Model.2")

r.2a <- rcorr(x2, resid.2)


r.2a$r
r.2a$P

Model.2a <- lm(y ~ x1 + x1*x1 + x2)


Act Sci 654: Regression and Time Series for Actuaries 93

summary(Model.2a)
Act Sci 654: Regression and Time Series for Actuaries 94

20 Missing Data and Added Variable Approach


This exercise will consider the case when you want to create an added variable plot and the original
data set has missing values.

1. The code that I used in R is attached if you want to use it. The data uses the UNLifeExpectan-
[Link] file from the Regression Book website. Please remember to save your R scripts as
you are working just in case your system freezes.

2. Start R/Rcmdr. You can use Rcmdr to submit a couple of lines at one time. That might be
easier than using the R GUI. Also recall that R is case-sensitive and space-sensitive.

3. Load the libraries Hmisc and car with the commands: library(Hmisc) and library(car). Note the
log file to see which functions are masked with the loading of these packages.

4. Read in the data to a dataset and call it U N . Run a str(U N ) command and see the list of
variables. We will focus on the variables LIF EEXP , F ERT ILIT Y , P U BLICEDU CAT ION ,
P RIV AT EHEALT H.
5. Pull these variables in a separate dataset to see correlations and missing data using

U [Link] <−U N [ , c(”LIF EEXP ”, ”F ERT ILIT Y ”, ”P U BLICEDU CAT ION ”, ”P RIV AT EHEALT H”)]

Note that we are including all of the rows of the dataset, but only a subset of the variables.

6. Using the Hmisc package, find the pair-wise correlation of these variables using the command
rcorr([Link](U [Link]), type = ”pearson”). You can see that there are 185 observations in
the entire dataset, but in some of the pairs, many of the observations are missing.

7. You could also do a summary([Link]) to see the actual number of missing observations for each
variable.

8. Remember that a variable may be missing for one observation and another observation may have
other missing variables.

9. Run a regression of

lm(LIF EEXP ∼ F ERT ILIT Y +P U BLICEDU CAT ION +P RIV AT EHEALT H, data = U N )

You can see that the regression ran despite having missing values for some of the variables. The
output does indicate that it deleted 33 observations. The end result is 148 degrees of freedom.
(To get to the total number, take 148 + 4 regression coefficients + 33 missing = 185)

10. Let us create an added variable plot. (This first way will bring errors, so be prepared.)

11. Run a regression of

RegM odel.1 <−lm(LIF EEXP ∼ F ERT ILIT Y + P RIV AT EHEALT H, data = U N )

Note that the degrees of freedom is 177. This is the first clue to upcoming problems (as the first
regression had 148 df.)
Act Sci 654: Regression and Time Series for Actuaries 95

12. Run the next regression of

RegM odel.2 <−lm(P U BLICEDU CAT ION ∼ F ERT ILIT Y +P RIV AT EHEALT H, data = U N )

Here we only have 149 degrees of freedom.

13. We want to plot the residuals against each other in a scatterplot. Essentially RegM odel.1 takes
LIF EEXP and removes the explained variation from F ERT ILIT Y and P RIV AT EHEALT H,
while RegM odel.2 takes P U BLICEDU CAT ION and removes the explained variation from
F ERT ILIT Y and P RIV AT EHEALT H. The residuals of each regression represent the unex-
plained variation in LIF EEXP and P U BLICEDU CAT ION after controlling for F ERT ILIT Y
and P RIV AT EHEALT H.

14. If you run the following statement:

RegM odel.2c <−lm(residuals(RegM odel.2a) ∼ residuals(RegM odel.2b))

you get an error saying that the lengths of the two vectors (meaning the residuals) differ. Really
we had the clue that they would be from the df from the regression output.

15. If you try to blindly run a plot statement, you get the same error.

16. To correct this, we need to remove the missing data. One way would be with the following
statement:

## one way does the whole data set but removes too much
[Link] <- UN[[Link](UN),]
str([Link])

You end up with 45 observations of the original 185. Simple, but the command retains only
complete observations where values for all 16 variables in the data set exist.

17. The following commands remove the missing data for just the variables we want to include:

## another way just deletes NA’s for those variables


[Link]<-UN[![Link](UN$LIFEEXP) & ![Link](UN$PUBLICEDUCATION) &
![Link](UN$ FERTILITY) & ![Link](UN$PRIVATEHEALTH),]

The & operate as and, while the ![Link] is the not equal to NA function. Doing a str(U N.N A)
shows that this dataset has 152 observations which is what we want.

18. To finish the analysis you can run the following code:

## added variable plot


RegModel.2a1 <- lm(LIFEEXP ~ FERTILITY + PRIVATEHEALTH, data=[Link])
summary(RegModel.2a1)
## Note there are 149 degrees of freedom

RegModel.2b1 <- lm(PUBLICEDUCATION ~ FERTILITY + PRIVATEHEALTH, data=[Link])


summary(RegModel.2b1)
Act Sci 654: Regression and Time Series for Actuaries 96

## Note there are 149 degrees of freedom

RegModel.2c1 <- lm(residuals(RegModel.2a1) ~ residuals(RegModel.2b1))


summary(RegModel.2c1)

plot(residuals(RegModel.2b1),residuals(RegModel.2a1), main="Added Variable


Plot of LIFEEXP vs. PUBLICEDUCATION ", xlab = "Residuals of PUBLICEDUCATION
~ FERTILITY + PRIVATEHEALTH",
ylab = "Residuals of LIFEEXP ~ FERTILITY + PRIVATEHEALTH", pch = 19)

cor(residuals(RegModel.2a1),residuals(RegModel.2b1))

resid <- cbind(residuals(RegModel.2a1),residuals(RegModel.2b1))


rcorr([Link](resid), type="pearson")

Now the residuals are of equal length. Check the output to see the relationship between the
correlation and the p-values for the residuals.
Act Sci 654: Regression and Time Series for Actuaries 97

library(Hmisc) # for correlations


library(car) # vif

rm(list=ls())
UN <-[Link]("d:/ActSci654/DataForFrees/[Link]",
header=TRUE, sep=",", [Link]="NA", dec=".", [Link]=TRUE)

str(UN)

[Link] <- UN[,c("LIFEEXP","FERTILITY","PUBLICEDUCATION","PRIVATEHEALTH")]


rcorr([Link]([Link]), type="pearson")

summary([Link])
sd([Link])
head([Link])

RegModel.1 <- lm(LIFEEXP ~ FERTILITY +PUBLICEDUCATION +


PRIVATEHEALTH, data=UN)
summary(RegModel.1)

## added variable plot part d


RegModel.2a <- lm(LIFEEXP ~ FERTILITY +PRIVATEHEALTH, data=UN)
summary(RegModel.2a)
## Note there are 177 degrees of freedom

RegModel.2b <- lm(PUBLICEDUCATION ~ FERTILITY + PRIVATEHEALTH, data=UN)


summary(RegModel.2b)
## Note there are 149 degrees of freedom

## will not run


RegModel.2c <- lm(residuals(RegModel.2a ) ~ residuals(RegModel.2b ))
summary(RegModel.2c )

## will not plot


plot(residuals(RegModel.2b ),residuals(RegModel.2a1), main="Added Variable
Plot of LIFEEXP vs. PUBLICEDUCATION ", xlab = "Residuals of PUBLICEDUCATION
~ FERTILITY + PRIVATEHEALTH",
ylab = "Residuals of LIFEEXP ~ FERTILITY + PRIVATEHEALTH", pch = 19)

## one way does the whole data set but removes too much
[Link] <- UN[[Link](UN),]
str([Link])

## another way just deletes NA’s for those variables


[Link]<-UN[![Link](UN$LIFEEXP) & ![Link](UN$PUBLICEDUCATION) &
![Link](UN$ FERTILITY) & ![Link](UN$PRIVATEHEALTH),]

## added variable plot part d


Act Sci 654: Regression and Time Series for Actuaries 98

RegModel.2a1 <- lm(LIFEEXP ~ FERTILITY + PRIVATEHEALTH, data=[Link])


summary(RegModel.2a1)
## Note there are 149 degrees of freedom

RegModel.2b1 <- lm(PUBLICEDUCATION ~ FERTILITY + PRIVATEHEALTH, data=[Link])


summary(RegModel.2b1)
## Note there are 149 degrees of freedom

RegModel.2c1 <- lm(residuals(RegModel.2a1) ~ residuals(RegModel.2b1))


summary(RegModel.2c1)

plot(residuals(RegModel.2b1),residuals(RegModel.2a1), main="Added Variable


Plot of LIFEEXP vs. PUBLICEDUCATION ", xlab = "Residuals of PUBLICEDUCATION
~ FERTILITY + PRIVATEHEALTH",
ylab = "Residuals of LIFEEXP ~ FERTILITY + PRIVATEHEALTH", pch = 19)

cor(residuals(RegModel.2a1),residuals(RegModel.2b1))

resid <- cbind(residuals(RegModel.2a1),residuals(RegModel.2b1))


rcorr([Link](resid), type="pearson")
Act Sci 654: Regression and Time Series for Actuaries 99

21 Assessing the Model


This exercise is an expanded version of Frees Problem 5.4 and also re-illustrates letter plots shown in
Figures 3.5 and 3.7.

1. We will use the [Link] from our course website. Set a working directory and put the file
in that directory. This is a modified version of Frees 5.4.

(a) Reproduce Figure 3.5 with the following code:

Term <- [Link]("d:/ActSci654/DataForFrees/[Link]", header=TRUE)

Term2 <- subset(Term, subset=FACE > 0)


str(Term2)

Term2$[Link] <- log(Term2$FACE)


Term2$[Link] <- log(Term2$INCOME)
Term2$MARSTAT.f <- factor(Term2$MARSTAT,
labels = c("Other","Married", "LivingTogether"))
table(Term2$MARSTAT.f)

## defined as in Frees text


Term2$Single <- 1*(Term2$MARSTAT.f == "Other")
table(Term2$Single)

Term2$Sex.f <- factor(Term2$GENDER, labels = c( "Female", "Male"))


table (Term2$Sex.f)

# Example Figure 3.5


Model.1 <- lm([Link] ~ [Link] + Single, data = Term2)
summary(Model.1)
coef.1 <- Model.1$coefficients
coef.1

ccol <- c("navy", "red" )


Term2$sym <- factor(Term2$Single, labels = c("o","s"))

plot([Link] ~ [Link], data = Term2, type="n", main="Figure 3.5")


text(Term2$[Link], Term2$[Link], col=ccol[[Link](Term2$Single)],
labels=Term2$sym, font=2)
abline(a = coef.1[1], b=coef.1[2], lty = 2, col = "navy")
abline(a = coef.1[1] + coef.1[3], b=coef.1[2], lty = 1, col = "red")

Note that coef.1 is a vector, with 3 entries: the intercept term, slope term for [Link],
and single. You can identify the first entry specifically with the code coef.1[1]. The others
you can change the [1] to 2 or 3 appropriately.
I set up a vector for the colors for the points and another vector for the labels that correspond
to the variable Single.
Act Sci 654: Regression and Time Series for Actuaries 100

The type = ”n” argument in the plot statement says ”start with no points”. Then I add the
points with the text command, that uses the color coding and symbols as defined initially.
The abline commands add the two lines; one for the singles and one for other than singles.
I could have enhanced the graph further by including a legend. You can try to do that on
your own. This graph differs slightly from the textbook, as it sampled a 50 points from the
data set rather than using the entire data frame.
The plot shows the distribution of [Link] by [Link] with the points indicating
whether they are single or other. This is a nice way of displaying categorical variables
and a way to show 3 variables in a two-dimensional setting.
The regression lines are parallel as the model is additive and does not contain an interaction
term.
(b) Now modify the above code to reproduce Figure 3.7. Also interpret this graph.

Model.2 <- lm([Link] ~ [Link] + Single + [Link]:Single, data = Term2)


summary(Model.2)
coef.2 <- Model.2$coefficients
coef.2

ccol <- c("navy", "red" )


Term2$sym <- factor(Term2$Single, labels = c("o","s"))

plot([Link] ~ [Link], data = Term2, type="n", main="Figure 3.7")


text(Term2$[Link], Term2$[Link], col=ccol[[Link](Term2$Single)],
labels=Term2$sym, font=2)
abline(a = coef.2[1], b=coef.2[2], lty = 2, col = "navy")
abline(a = coef.2[1] + coef.2[3], b=coef.2[2] + coef.2[4],
lty = 1, col = "red")

With this model, we add an interaction term of [Link] and Single, with the term ‘:’.
Thus the new coefficient vector has an additional term.
The coding for the plot command is the same, but the second abline command need to be
altered. Here we need to add the interaction term for both of the intercept term and the
slope term. Write out the equation to see why this is so or check out page 96 in the text for
the formula.
(c) Which model do you like better? Please justify.

You might also like