0% found this document useful (0 votes)
5 views20 pages

Tutorial 8

This document outlines the tutorial for ECON20003, focusing on simple linear regression and its application using Excel and R. It provides instructions for completing tutorial exercises, calculating regression parameters, and interpreting results, including the coefficient of determination and standard errors. The document emphasizes the importance of understanding the relationship between the dependent and independent variables in regression analysis.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views20 pages

Tutorial 8

This document outlines the tutorial for ECON20003, focusing on simple linear regression and its application using Excel and R. It provides instructions for completing tutorial exercises, calculating regression parameters, and interpreting results, including the coefficient of determination and standard errors. The document emphasizes the importance of understanding the relationship between the dependent and independent variables in regression analysis.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

ECON20003 – QUANTITATIVE METHODS 2

TUTORIAL 8

Download the t8e2, t8e3 and t8e4 Excel data files from the subject website and save them
to your computer or USB flash drive. Read this handout and try to complete the tutorial
exercises before your tutorial class, so that you can ask help from your tutor during the Zoom
session if necessary.

After you have completed the tutorial exercises attempt the “Exercises for assessment”. For
each assessment exercise type your answer in the corresponding box available in the Quiz.
If the exercise requires you to use R, insert the relevant R/RStudio script and printout in the
same Quiz box below your answer. To get the tutorial mark on week 9, you must submit
your answers to these exercises in the Tutorial 8 Canvas Quiz by 10am Wednesday 4 May
and attend Tutorial 9.

Simple Linear Regression

A simple linear regression model relates the expected value of the dependent variable,
E(Y), to a single independent variable, X, via some function which is linear in the unknown
intercept and slope parameters. If this function is linear in the independent variable as well,
then

E(Y )  0  1 X

By adding a random error term () to both sides we obtain the population regression model:

Y  E(Y)   0  1X 

The OLS estimators of the 0 and 1 unknown parameters are


n
( xi  x )( yi  y ) sxy
ˆ1  i 1
 , ˆ0  y  ˆ1 x
 i 1 ( xi  x )2
n
sx2

and

yˆi  ˆ0  ˆ1xi

specifies the sample (or fitted) regression line.

The quality of the fit of this regression line to the sample data can be evaluated with the
sample coefficient of determination, which measures the proportion of the total variation in
Y that can be explained by X within the estimated regression model:

1
L. Kónya, 2022, Semester 1 ECON20003 - Tutorial 8
SSR SSE
R2   1
SST SST

where

n n
SST   ( yi  y )2  (n  1)s y2 , SSR   ( yˆi  y )2 ,
i 1 i 1

n  2 sxy2 
n
SSE   e   ( yi  yˆi )  (n  1)  s y  2 
2 2
i  sx 
i 1 i 1 

In addition, a useful computational formula is

s xy2
R 2

s x2 s y2

Under the first five classical assumptions of linear regression (see LR1-LR5 in the week 7
lecture notes), the OLS estimators of the y-intercept and slope parameters are the best
linear unbiased estimators (i.e. BLUE) and have the following sampling distributions:

 x
n 2


ˆ0  N 0 ,  ˆ
0
 ,  ˆ  
0
n  x  x 
n
i 1 i
2
i 1 i


ˆ1  N 1 ,  ˆ  ,  ˆ  
1
 x  x
1 1
n 2
i 1 i

where  is the common standard deviation of all random errors. It is an unknown parameter,
just like 0 and 1, but it can be estimated from the sum of squared OLS residuals, SSE, as

SSE
s 
n2

This is called the estimated standard error of regression or the standard error of estimate.
Using this estimate of , the estimated standard errors of the 0-hat and 1-hat estimators
are


n
xi2 / n 1
sˆ  s i 1
, sˆ  s
0
( n  1) s 2
x
1
( n  1) s x2

From the sampling distributions and the estimated standard errors of the 0-hat and 1-hat
estimators, the confidence interval estimator of i (i = 0, 1) is

βˆi  tα / 2 ,n  2 s βˆ
i

2
L. Kónya, 2022, Semester 1 ECON20003 - Tutorial 8
and the test statistic for H0: i = 0,i is

ˆi  0,i
tˆ 
i
sˆ
i

Granted that H0 is true, these test statistics have a t distribution with df = n - 2.

Recall that the sample correlation coefficient between X and Y and the OLS estimator of 1
are

sxy sxy
rxy  , ˆ1 
sx s y sx2

Combining these two formulas we obtain

s
rxy  ˆ1 x
sy

This formula shows that 1-hat = 0 implies rxy = 0, and vice versa. Consequently, the t-test
for H0: 1 = 0 can be also used to test H0: rxy = 0.

After having estimated the parameters of a simple linear regression model, always ask
yourself the following three questions.

1) Does the slope estimate appear reasonable, i.e., does β1-hat have the logical sign
and an acceptable size?

Compare the sign (and magnitude) of your slope estimate to what you expect based
purely on common sense and/or some relevant theory.1

2) How good is the fit of the estimated regression model to the data?

Use the coefficient of determination to measure the proportion of the total variation
in the sample Y values that can be accounted for by the regression model and hence
by the sample variation of the independent variable in the model.

3) Is the slope estimate significant in the logical (maybe illogical) direction?

If you have a solid reason to expect a positive / negative slope parameter, perform a
right-tail / left-tail t-test for the slope, otherwise perform a two-tail t-test.

1
You should always form your expectations before making any statistical analysis in order to not being
influenced by the actual data at hand, let alone by your regression results.
3
L. Kónya, 2022, Semester 1 ECON20003 - Tutorial 8
Exercise 1 (Selvanathan et al., p. 736, ex. 17.3)

Using the same data as in Exercise 4 of Tutorial 7, perform the following tasks both manually
and with R.

a) Determine the sample regression line that expresses the Price of a car as a function of
the number of kilometres on the Odometer.

From the calculations in part (d) of Exercise 4, Tutorial 7,

sxy 4.077 1623.7 3601.1


ˆ1  2
  0.094 , ˆ0  y  ˆ1 x   (0.094)   19.622
s x 6.5962 100 100

and the sample regression line is

yˆi  ˆ0  ˆ1xi  19.622  0.094xi

To estimate the regression model with R, launch RStudio, create a new project and
script, and name them t8e1. Import the data from the t7e4 Excel file and attach it to your
project.

A linear regression can be set up and estimated in R with the lm() function, which has the
following general format:2

lm(formula = y ~ x1 + x2 + …)

where y is the dependent variable and x1, x2 etc. denote the independent variables. By
default, the model has an implied intercept term, and the “formula =” part of the specification
can be omitted for the sake of brevity.

In the current example, the dependent variable is Price and the single independent
variable is Odometer, so execute

lm(Price ~ Odometer)

In return, RStudio displays the following printout in the Console:

2
This specification of the lm() function assumes that regression is going to be performed on the active data
set. Otherwise, lm() has to be augmented with the data = [data source] argument.
4
L. Kónya, 2022, Semester 1 ECON20003 - Tutorial 8
This printout is rather succinct; it just echoes the model specification and shows the
point estimates of the intercept and slope parameters that we just calculated manually.

We can access further regression results by saving the model under some name and then
using the

summary(“name”)

generic function which produces summaries of the results of various model fitting functions
like lm.

Suppose, for example, that we call the model m and execute the

m = lm(Price ~ Odometer)

command. In return, R re-estimates the regression and saves the model object on your
Environment tab.

Then, the

summary(m)

command produces the following printout:

5
L. Kónya, 2022, Semester 1 ECON20003 - Tutorial 8
It echoes the lm() command and shows some location statistics of the residuals, the
regression coefficients with the corresponding standard errors, t-ratios and p-values,
and some useful summary measures of the goodness of fit, like the standard error of
regression, the unadjusted and adjusted R2 statistics, and the details of the F-test of
overall significance.

Starting with the estimates of the coefficients, the sample regression equation is

 i  ˆ  ˆ Odometer  19.61139  0.093705  Odometer


Price 0 1 i i

You plot this sample regression equation by executing the following commands:

plot(Odometer, Price,
main = "Scatterplot of Price versus Odometer", col = "blue", pch = 19)
abline(lm(Price ~ Odometer), col = "red")

The blue dots that represent the observations are scattered rather loosely around the
red fitted or sample regression line. This suggests that this simple linear regression
model does not fit to the sample data very well.

b) Interpret the coefficients in the actual context of this exercise.

In general, the y-intercept coefficient is the estimated expected value of the dependent
variable when the independent variable(s) is (are) equal to zero. Although it is an
essential mathematical term, in practical applications it does not have a logical
interpretation when the independent variable(s) cannot assume zero or at least in the
sample at hand all values are relatively far from zero. From practical point of view the
6
L. Kónya, 2022, Semester 1 ECON20003 - Tutorial 8
really important coefficient is the slope coefficient. It shows the estimated change in the
conditional expected value of dependent variable that is due to a one-unit increase in
the independent variable.

In this example, the independent variable is Odometer. It can be zero, but since the
sample includes used cars only, zero is not a realistic value. In fact, as you can easily
verify, in the sample the Odometer values range between 19.1 and 49.2, so it is better
not to assign any interpretation to the y-intercept estimate. The slope estimate is about
-0.094, which suggests that by each additional 1000 km on the odometer, the price of
a used car decreases on average by $10000.094, i.e. by $94.

In Exercise 4, Tutorial 7 we already discussed that, ceteris paribus, cars with higher
odometer reading sell for less. This implies that 1 < 0 and our slope estimate is indeed
negative. Its magnitude is also reasonable, at least it does not sound astounding that
an extra 10,000 km reduces the price by $940.

c) Calculate the coefficient of determination. What does its value tell you about the
relationship between the two variables, Odometer and Price?

Since we already know the sample standard deviations and covariance, we can use the
computational formula:

s xy2 ( 4.077) 2
R 
2
  0.653
2 2
s s
x y 6.596 2  0.7652

Also recall, that the coefficient of determination is the square of the correlation
coefficient between the dependent variable and its estimate, which, in the case of a
simple linear regression model, is the same than the square of the correlation coefficient
between the independent and the dependent variables. Hence, since in part (d) of
Exercise 4, Tutorial 7 we already found the correlation coefficient between Odometer
and Price,

R2  rx2, y  (0.808)2  0.652

On the previous R printout (see p. 5) this statistic is called Multiple R-squared and its
value is 0.6533.

The sample coefficient of determination indicates that given this model, about 65% of
the total variation in the auction Price is explained by, or due to, the variation in the
Odometer readings, while 35% remains unexplained.

d) Find the estimated standard error of the regression model and the standard errors of
the intercept and slope estimators.

The standard error of the regression model, also known as the standard error of
estimate, is

7
L. Kónya, 2022, Semester 1 ECON20003 - Tutorial 8
SSE
n n  2 sxy2 
s  where SSE   e   ( yi  yˆi )  (n 1)  sy  2 
2 2

n2
i  sx 
i 1 i 1 

Therefore,

 2 sxy2   (4.077)2 
SSE  (n 1)  sy  2   99  0.765 
2
  20.114
 sx  6.5962 
 

SSE 20.114
s    0.453
n2 98

On the R printout (p. 5) this statistic is called Residual standard error and it is equal to
0.4526.

On its own the standard error of the regression model is not particularly informative, but
it is necessary for confidence interval estimation and hypothesis testing because the
standard errors of the y-intercept and slope estimators are related to it.

These standard errors are


n
xi2 / n 1
sˆ  s i 1
, sˆ  s
0
( n  1) s 2
x
1
( n  1) s x2

The first formula requires the sum of squared odometer observations3, which is
133986.6. Therefore,


n
xi2 / n 133986.6 / 100
sˆ  s i 1
 0.453  0.253
0
( n  1) s 2
x 99  6.596 2

and

1 1
s ˆ  s  0.453  0.0069
1
( n  1) s x2 99  6.596 2

On the R printout (p. 5) these statistics are displayed in the Std. Error column.

e) Determine the 95% confidence interval estimates of 0 and 1.

The relevant formula is

βˆi  tα / 2 ,n  2 s βˆ
i

3
You can get this by executing the sum(Odometer^2) command.
8
L. Kónya, 2022, Semester 1 ECON20003 - Tutorial 8
From the t-table the reliability factor is t0.025,98  t0.025,100 = 1.984. Therefore,

βˆ0  tα / 2 ,n  2 s βˆ  19.622  1.984  0.253  (19.120 , 20.124)


0

βˆ1  tα / 2 ,n  2 s βˆ  0.094  1.984  0.0069  ( 0.108 ,  0.080)


1

To obtain these confidence intervals in R, execute the

confint(m, level=0.95)

command. It returns the 95% confidence interval for each coefficient:

Since we did not interpret the y-intercept estimate, do not worry about the confidence
interval for 0 either.

As for 1, its 95% confidence interval suggests that by each additional 1000 km on the
odometer, the price of a used car decreases on average by about $10000.107 = $107
to $10000.080 = $80.

f) Test to determine whether there is enough evidence to infer that a linear relationship
exists between the price of a car and its odometer reading. Use a significance level of
5%.

If there is a linear relationship between the independent and the dependent variable,
then the slope parameter is different from zero. Therefore, the question implies a two-
tail test and the following hypotheses:

H0 : 1  0 vs. HA : 1  0

The test statistic is

ˆ1  0,1
tˆ   tdf n2
1
sˆ
1

Since the significance level is 5%, the absolute value of the lower and upper critical
values is the same than the reliability factor in part (g). Accordingly, H0 is to be rejected
if the observed test statistic value is smaller than -1.984 or larger than 1.984.

The observed or calculated test statistic is

ˆ1  0,1 ˆ1 0.094


tˆ     13.623
1
sˆ sˆ 0.0069
1 1

9
L. Kónya, 2022, Semester 1 ECON20003 - Tutorial 8
It is well below the lower critical value (-1.984), so at the 5% significance level we reject
H0 and conclude that the slope parameter is different from zero, i.e. there is a linear
relationship between the odometer reading and the auction price of cars. Note also, that
this also implies that the odometer reading and the auction price of cars are correlated
with each other.

On the R printout (p. 5) the test statistics for H0: i = 0 (i = 1, 2) are in the t value column,
while the two-tail p-values are in the Pr(> | t |) column.4 The t-statistic for Odometer, i.e.
for 1, is -13.59 and the corresponding p-value is practically zero, so H0 can be rejected
at any reasonable significance level.

Multiple Linear Regression

The multiple linear regression model is the generalization of the simple linear regression
model to more than one (k > 1) independent variables. Hence, the population regression
model becomes:

Y  E(Y)   0  1X1  2 X2 ...  k Xk 

Based on the least squares principle, the unknown y-intercept and slope parameters of this
model (0 and 1, 2,…, k) can be estimated similarly to those of a simple linear regression
model. The calculations are more demanding, but the idea is the same: given the sample,
find the set of y-intercept and slope estimates that minimizes the sum of squared errors
(SSE).5

The resulting sample regression equation is

Yˆ  ˆ0  ˆ1X1  ˆ2 X2 ...  ˆk Xk

The fit of a multiple linear regression model can be assessed with the adjusted coefficient
of determination, which is R2 adjusted to the sample size (n) and to the number of
independent variables in the model (k):

SSE / (n  k 1) n 1
R2  1  1 (1  R2 )
SST / (n 1) n  k 1

When the model fits to the data perfectly, i.e. each observed Y value is equal to its estimate,
adjusted R2 is one, just like its unadjusted counterpart R2, while otherwise adjusted R2 is
smaller than R2 and can assume even negative values in extreme cases.6 The difference

4
On regression printouts the p-values for the t-tests on the intercept and slope parameters are always reported
for two-tail tests with zero hypothesized parameter values. If you are interested in a one-tail p-value, then you
need to check whether the point estimate has the sign implied by the alternative hypothesis and divide the
reported p-value by 2.
5
We do not consider the formulas of the OLS estimators for multiple linear regression because in this course
we always estimate these models with R.
6
In that case one can round the adjusted R2 statistic to zero.
10
L. Kónya, 2022, Semester 1 ECON20003 - Tutorial 8
between adjusted R2 and (unadjusted) R2 is relatively small when k is small compared to n
and thus (n - 1) / (n - k - 1) is close to one.

The significance of a multiple linear regression model can be tested with the F-test for overall
significance.7 In this test the null hypothesis is that all slope parameters are simultaneously
zero (H0: β1 = β2 = … = βk = 0). If it is true, R2 is also zero and the regression model is
useless because each y-hat is equal to the sample mean (y-bar), which can be calculated
without estimating the regression model. The alternative hypothesis is that at least one slope
parameter is different from zero, i.e. at least one of the independent variables has an impact
on the dependent variable, and thus R2 is positive.

The F-test statistic is the ratio of the mean square due to regression to the mean square due
to errors, i.e.

MSR SSR / k R2 / k
F  
MSE SSE / (n  k  1) (1  R2 ) / (n  k  1)

and if H0 is correct, it is distributed as an F random variable with df1 = k and df2 = n – k – 1.


The null hypothesis is to be rejected when the calculated value of this test statistic is larger
than the critical value, F,df1,df2.

Under the six classical assumptions of linear regression (see Review 5 and the week 7
lecture notes), the OLS estimators of the y-intercept and slope parameters are the best
linear unbiased estimators (i.e., BLUE) and have normally distributed sampling distributions.

The estimate of the common standard deviation of all random error terms (i.e., ), is the so
called estimated standard error of regression or the standard error of estimate:

SSE
s 
n  k 1

With s, it is possible to estimate the standard errors of the i-hat estimators8, to develop
confidence intervals for the regression parameters (i = 0, 1, …, k),

βˆi  tα / 2 ,n  k 1 s βˆ
i

and to perform t-tests on individual parameters,

ˆi  0,i
tˆ   tnk 1
i
sˆ
i

7
This test can be used for simple linear regression models alike, but then it is equivalent with a two-tail t-test
on the single slope parameter with zero hypothesized value.
8
The standard error formulas are a bit complicated. We do not consider them because in this course we always
estimate the standard errors of multiple linear regression coefficients with R.
11
L. Kónya, 2022, Semester 1 ECON20003 - Tutorial 8
Exercise 2 (McClave et. al, p. 887, ex. 13.113)

The data saved in the t8e2 file were collected for a random sample of 26 households in
Washington DC during 2002. An economist wants to relate household food consumption
(foodcon, $1,000s) to household income (income, $1,000s) and household size (size) by
estimating the following multiple linear regression model:

foodconi  0  1incomei  2sizei i

a) Given this model, do you expect the slope parameters to be positive or negative?

It seems to be reasonable to assume that given the size of a household, the higher the
household income, the higher the household food consumption. Similarly, given the
income of a household, it seems logical that larger households consume more food.
Hence, both slope parameters are expected to be positive.

b) Plot each independent variable against the dependent variable. Do the scatter plots
support your expectations about the signs of the slope parameters? Do they support the
model specification? In particular, do the partial relationships between foodcon and
income and between foodcon and size seem to be linear?

Import the data to R and develop the following scatterplots.9

Scatterplot of foodcon versus income Scatterplot of foodcon versus size


7
7

6
6

foodcon
foodcon

5
5

4
4

3
3

20 40 60 80 2 4 6 8

income size

On the first plot all points but the one in the upper-left corner are in a horizontal band
(say, between 2.5 and 5.75), giving the impression that in this sample foodcon does not
really depend on income. The second plot, on the other hand, indicates that in this
sample foodcon is positively and more or less linearly related to size.

9
If you do not remember how to do so, see page 18 of tutorial 2.
12
L. Kónya, 2022, Semester 1 ECON20003 - Tutorial 8
c) Using this data, estimate the multiple linear regression model.

Multiple linear regression models can be estimated with the lm() function of R similarly
to simple linear regression models, we just need to list all independent variables joined
by positive signs (+) in the model specification. Let’s call the regression model m again.
Set up and estimate the model by executing the

m = lm(foodcon ~ income + size)

command and then display a summary of the results by executing the

summary(m)

command. You should get the following printout:

The sample regression equation is


foodconi  2.794  0.00016 incomei  0.383 sizei

d) Carefully explain the meanings of the estimated slope coefficients. Do they have the
logical signs?

When you interpret some slope estimates, remember that their meanings are conditional
on all independent variables in the model. Each slope estimate shows the expected
change in the dependent variable in response to a one unit increase in the
corresponding independent variable, while the other independent variables in the model
are kept constant.

Accordingly, the first slope estimate (-0.00016) means that, keeping household size
constant, an extra $1,000 household income is likely to be accompanied by a $0.16
decrease of household food consumption. The negative sign of this slope estimate is a
bit surprising; one would expect food consumption to increase with income. It is probably

13
L. Kónya, 2022, Semester 1 ECON20003 - Tutorial 8
also true though, that above a certain level, higher household income is accompanied
by higher household consumption, but not necessarily with higher food consumption.

The second slope estimate (0.383) is positive as expected. It means that, keeping
household income constant, with every additional household member household food
consumption is expected to increase by $383.

e) What does the unadjusted coefficient of determination tell you about the fit of the model
to the data?

From the printout, the unadjusted coefficients of determination (R2) is 0.558. It suggests
that about 56% of the total variation in household food consumption can be explained
by the variations in household income and size.

f) How large is the adjusted coefficient of determination? Although this statistic is also on
the printout (called Adjusted R-squared), calculate its value manually as well using the
relationship between the unadjusted and adjusted coefficients of determination.

n 1 25
R2  1 (1  R 2 )  1  (1  0.558)  0.520
n  k 1 23

It is smaller than the unadjusted coefficient of determination and shows that, after having
adjusted for the sample size and for the number of independent variables, 52% of the
total variation in household food consumption can be explained by the variations in
household income and size.

g) Test the overall significance of the model at the 5 percent level.

The purpose of this test is to find out whether (H0) all slope parameters are
simultaneously zero, implying that overall the model is useless; or that (HA) at least one
slope parameter is different from zero and thus the corresponding independent variable
has some impact on the dependent variable and thus the model is useful.

The hypotheses are

H0 : 1  2  0 , HA : 1  0 or / and 2  0

From the printout, the test statistic (F-statistic) is F = 14.52. The numerator and
denominator degrees of freedom are k = 2 and n – k – 1 = 23, respectively. The p-value
is practically zero, so we reject H0 and conclude that the model is significant, at least
one of the two independent variables (income, size) has a significant influence on the
dependent variable (foodcon).

h) Find the 95% confidence intervals for the slope parameters, first manually using the
estimated standard errors from the printout in part (b), and then with R. What do these
confidence intervals tell you?

From the t-table the reliability factor is t0.025,23 = 2.069 and the estimated standard errors
of the slopes are 0.00656 and 0.07189, respectively. Therefore,

14
L. Kónya, 2022, Semester 1 ECON20003 - Tutorial 8
βˆ1  tα / 2 ,n  k 1 s βˆ  0.00016  2.069  0.00656  ( 0.0137 , 0.0134)
1

βˆ 2  t α / 2 ,n  k 1 s βˆ  0.38348  2.069  0.07189  (0.2347 , 0.5322)


2

To get these confidence intervals with R, execute the

confint(m, level=0.95)

command, which returns

Hence, with 95% confidence, given size, with an extra $1,000 household income food
consumption is expected to change by –14 to 13 dollars. Since zero is included in this
confidence interval, the impact of income on foodcon is insignificant at the 5% level.

Also, with 95% confidence, given income, with an extra household member food
consumption is likely to increase by $235 to $532. This confidence interval has positive
lower and upper limits, so the impact of size on foodcon is significant at the 5% level.

i) Develop and test appropriate hypotheses concerning the slope coefficients using t-tests
at the 5 percent level. What do you conclude?

Given that the logical sign for both slope parameters is positive, both t-tests are right-
tail and the hypotheses are

H0 : i  0 , HA : i  0 (i 1,2)

For both tests the hypothesized parameter value is zero, so the test statistics are

ˆ1 0.00016
tˆ    0.0244
1
sˆ 0.00656
1

ˆ2 0.38348
tˆ    5.3343
2
sˆ 0.07189
2

Using the critical value approach, t0.05,23 = 1.714, so reject H0 at the 5% significance
level if the test statistic is above this critical value.

The first t-statistic is negative, so the first null hypothesis cannot be rejected. Hence,
there is not a significantly positive relationship between household income and food
consumption.
15
L. Kónya, 2022, Semester 1 ECON20003 - Tutorial 8
The second t-statistic is larger than 1.714, so we reject H0 and conclude that, given
household income, household size has a significantly positive effect on household food
consumption.

Alternatively, relying on the R printout, we can argue as follows.

The first slope estimate is negative, so we cannot reject H0 in favour of a right-sided


alternative hypothesis.

The second slope estimate is positive, so the t-ratio has the sign implied by HA, and the
corresponding p-value (half of Prob.) is practically zero. Therefore, we reject H0.

j) Does it appear that the normality requirement is satisfied? If not, what is the practical
implication?

According to the 6th classical assumption of the linear regression model (LR6), the
conditional distribution of the random error is normal. This assumption is not required
for the estimation of the regression model with OLS, but it makes statistical inference
about the population regression model easier, especially when the sample size is
relatively small.

Since we cannot observe the random errors (i, i = 1, … , n), we study the residuals (ei),
and check normality in the usual way, i.e., by illustrating their distribution graphically,
calculating appropriate descriptive statistics and performing a normality test, e.g., the
Shapiro-Wilk test.

Once a regression model is estimated in R, the residuals can be extracted from the model
object by the following function:

residuals(model)

where model is the name of the model object.

Execute the following command:

m_res = residuals(m)

It extracts the residuals from the m model object and saves them under the name m_res,
as you can see in the values section of your Environment tab:

16
L. Kónya, 2022, Semester 1 ECON20003 - Tutorial 8
Now you can apply the normality checks you learnt about on tutorial 3 on the m_res
series. Start with executing

hist(m_res, freq = FALSE, col = "lightslateblue")


lines(seq(-1.5, 3.5, by = 0.001), dnorm(seq(-1.5, 3.5, by = 0.001),
mean(m_res), sd(m_res)), col="red")

to obtain a relative frequency histogram of the residuals with a normal curve


superimposed on it.

Histogram of m_res
1.0
0.8
0.6
Density

0.4
0.2
0.0

-1 0 1 2 3

m_res

It shows that the distribution of the residuals is skewed to the right.

17
L. Kónya, 2022, Semester 1 ECON20003 - Tutorial 8
Next, run the following commands to develop the normal QQ plot shown on the next
page:

qqnorm(m_res, main = "Normal Q-Q Plot",


xlab = "Theoretical Quantiles", ylab = "Sample Quantiles",
pch = 19, col = "salmon")
qqline(m_res, col = "royalblue4")

Several dots on this plot fall relatively far from the straight line, indicating that the
residuals are not normally distributed.

Normal Q-Q Plot


2
Sample Quantiles

1
0
-1

-2 -1 0 1 2

Theoretical Quantiles

Finally, obtain descriptive statistics and perform the SW test by executing the following
commands:

library(pastecs)
round(stat.desc(m_res, basic = FALSE, norm = TRUE), 4)

They return

18
L. Kónya, 2022, Semester 1 ECON20003 - Tutorial 8
As you can see, the median and the mean are very close to each other, but skew.2SE
and kurt.2SE are both larger than one, indicating at the 5% significance level that S is
different from zero and K is different from 3. Consequently, the residuals are not normally
distributed. Finally, the p-value of the SW test (normtest.p) is zero, so we can reject the
null hypothesis of normality.

All these checks suggest that the distribution of the residuals is not normally distributed,
so the random errors might not be normal either. Given the limited sample size (n = 26),
this means that the confidence intervals and the F- and t-tests in parts (g)-(i) might be
incorrect.

Exercises for Assessment

Exercise 3

Lotteries have become important sources of revenue for governments. Many people have
criticised lotteries, however, referring to them as a tax on the poor and uneducated. In an
examination of the issue a random sample of 100 adults was asked how much they spend
on lottery tickets as a percentage of the total household income. They were also interviewed
about various socioeconomic variables, like number of years of education, age, number of
children, and personal income (in thousands of dollars). The data are stored in file t8e3.

Obtain and test appropriate correlation coefficients with R to study the following beliefs. Use
 = 0.05.

a) Relatively uneducated people spend a greater proportion of their income on lotteries


than do relatively educated people.

b) Older people spend a greater proportion of their income on lottery tickets than do
younger people.

c) People with more children spend a greater proportion of their income on lotteries than
do people with fewer children.

d) Relatively poor people spend a greater proportion of their income on lotteries than do
relatively rich people.

Exercise 4 (Selvanathan et al., p. 783, ex. 17.78)

The head office of a life insurance company believed that regional managers should have
weekly meetings with their salespeople, not only to keep them abreast of current market
trends but also to provide them with important facts and figures that would help them in their
sales. Furthermore, the company felt that these meetings should be used for pep talks. One
of the points the management felt strongly about was the high value of new contact initiation
and follow-up phone calls.

19
L. Kónya, 2022, Semester 1 ECON20003 - Tutorial 8
To dramatize the importance of phone calls on prospective clients and (ultimately) on sales,
the company undertook the following small study. Twenty randomly selected life insurance
salespeople were surveyed to determine the number of weekly calls they made and the
number of policy sales they concluded. The data (Calls and Sales) are saved in file t8e4.
Perform the following tasks with R.

a) Do you expect Calls and Sales to be related to each other? If yes, do you expect the
relationship between them to be positive or a negative?

b) Illustrate the data on a scattergram. What does this plot suggest about the relationship
between the two variables?

c) Find the correlation coefficient between Calls and Sales. What does this coefficient and
the corresponding t-test statistic and p-value tell you about the relationship between the
two variables? Can we rely on this t-test?

d) Find the least squares regression line that expresses the number of Sales as a function
of the number of Calls.

e) What do the coefficients tell you?

f) What proportion of the variability in the number of sales can be attributed to the variability
in the number of calls?

g) Is there enough evidence (with α = 0.05) to indicate that the larger the number of calls,
the larger the number of sales?

20
L. Kónya, 2022, Semester 1 ECON20003 - Tutorial 8

You might also like