0 ratings0% found this document useful (0 votes) 77 views22 pagesRegression Analysis
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content,
claim it here.
Available Formats
Download as PDF or read online on Scribd
Simple linear
regression and
Correlation
18.1 Introduction
182 Model
183. Least squares method
18.4 Error variable: Required conditions
185 Assessing the mode!
18.6 Using the regression equation
187 Coefficients of correlation (Optional)
188 Regression diagnostics — 1
189 Summary
Introduction
almost all
such as product demand,
Prices of raw materials and labour costs,
pe technique involves developing a mathernatical equation that describes the
Belttonship between the variable to-be'forecas, which ic cas the dependent variable,
daa itbles that the statistician believes aze related to the dependent variable. The
igert Variable is denoted y, while the related variables are calied independent
Natbles and are denoted ,,m,.., x (where kis the nuanbet af independent variables).
Hf We are interested only in determi
Betause regression analysis involves a number of new techniques and concepts, we
alloys Presentation into three chapters, In this chapter, we present techniques that
slloW us to determine the relationship between only two variables. In Chapter 19, we
espn Sut discussion to more than two variables, avd in ‘Chapter 20, we discuss how to
BTPssion models.CHAPTER 18 SIMPLE LINEAR REGRESSION AND CORRELATION 618 ‘
immeasurable, that are not part ofthe model. The value of e will vary from one sale to the
next, even if x remains constant. That is, houses of exactly the same size will sell for al
diferent prices because of differences in locaton, selling season, decorations and other
variables.
Jn fe thee chapters devoted to regression analysis, we will present only probabilistic - *
models. Additionally o simplify the presentation, all models wil be linea: In this hapten /
We restrict the number of independent variables to one. The model to be used inte 4
chapter is called the first-order linear model or the simple linear regression model, ;
‘Simple linear regression model [
Yt Bree
where q |
dependent variable
dependent variable
Bo= y-intercept :
8, = slope of the line (defined as the ratio ise/run or change in y/change in 2) E
error Variable
Figure 18.1 depicts the deterministic. ‘component of the model.
Figure 18.1 _Simpie linear model: Deterministic component
Y Y=Bo +z
Run
Po . /
1e problem objective acidressed by the model is to analyse the relationship between two
rae ae both of which must be quantitative. To define the relationship between
x and y, we need to know the value of the coefficients of the linear model f and ,.
However, these coefficients are population parameters, which are almost always
unknown, In the next section, we discuss how these parameters are estimated.
[EXERCISES
18.1 Graph each of the following straight lines. Identify the intercept and the slope.620 AUSTAALIAN BUSiNess starisTics
182 For each of the following data sets, Plot the points on a graph and detemni
whether a linear model is reasonable. ~ m ae aes
Graph the following observations of v and y,
x 1 2 3 4 5 6
y 4 6 z 7 9 nl
law 2 straight line through the data. What are the intercept and the slope of the
line you drew?
BEI}
Least squares method
We estimate the parameters fh and fy in a way similar to the methods used to estimate
all the other parameters'discussed in. this book. We draw a random sample from the
Populations of interest and calculate the sample slatstcs we need, Beearee A and &
Bee tree Coeifclents ofa staight line, their estimators ae based on drawing a straight
line through the sample data. To see how this is done, consider the following simple
example.
[EXAMPLE Te.
Hg the following six observations of variables x and y, determine the straight line thet
fits these data.
aoe
y [2
SOLUTION
As first step we graph the data, as shown in Figure 18.2. Recall (from Chapter 2) that this
Steph is called a scatter diagram. The scatter diagram usually reveals whether ot not a
Straightline model fits the data reasonably well. Evidently, in this case a linear model is
Justified, Our task is to draw the straight line that provides the best posible fi.CHAPTER 18 SIMPLE LINEAR REGRESSION AND CORRELATION 621
Figure 18.2 _Scawer diagram for Example 18.1
i 1 aa
{ We can define what we mean by best in various ways. For example, we can draw the line
! that minimises the sum of the differences between the line and the points, Because some of
the differences wil be positive (points above the line), and others will be negative (points
{ below the line), a cancelling effect might produce a straight line that does not ft the data at
all. To eliminate the positive and negative differences, we will draw the line that minimises |
the sum of squared differences. That is, we want to determine the line that minimises
2o-a =
| Where y, represents the observed value of y and 9, represents the value of y calculated from oi
i] the equation of the line. That is, S
i =A +Bs :
i The technique that produces this ine is called the least squares method. The line itself is q
i called the least squares line, the fitted line or the regression line. The ‘hats’ on the
coefficients remind us that they are estimators of the parameters fh and fi.
i By using calculus, we can produce formulas for A, and f,. Although we're sure that you ce
| are keenly interested in the calculus derivation of the formulas, we will not provide them,
i ‘because we promised to keep the mathematics to a minimum. Instead, we offer the
| following, which were derived by calculus. i Ce
Calculation of A, and A, :
Ss, é
: Ae ; i
| Bo =9- Be
where Bb
Sy =Le-DU-D
88, = Et 3F
: g=2H ng oe
i ie 4
it is the numerator in the calculation of sample a
‘The formula for SS, should look familiar; , :
| variance s*. We introduced the $$ notation in Chapter 15; it stands for sum of squares. The } Th
i statistic $5, is the gum of squared ditferences between the observations of x and theirHe yoMetl speaking 58, is not a sum of squares. The formula for $5,
tracehe Mumerator in the calculation for
(introduced in Chapter 3)
(As Was the case with the analysis of var ice procedures introduced in Chapter 15,
calculating the statistics manually in any tealistic exemple is extremely time constuning”
Naturally, we recommend the use of Statistical software to produce the statistics we noes’
However, it may be worthwhile to Manually perform the calculations for several small.
Simple Problems. Such efforts may provide you with insights into the working of
seein analysis, To that end we provide sherivcr formulas for the various statistics that
are calculated in this chapter, :
iy May be familiar,
covariance and the coefficient of comelation
Shortcut formulas for $S, and SS,
ss, zy -Gal or SP ang?
5s, xy,-ZAEH, or Day, ney
AS You can see, to estimate the regression coefficients by hand, we need to determine the
| following summations.
| Sum of x: 7
Sum of y Dy
‘Sum of squared: x?
Sum of x times y: Dixy,
i ~ Returning to our example we find
: By = 53
1 Zy = 148
. La = 609
: Sey; = 1786
Using these summations in our shortcut formulas, we find
12a) «699 3F _ssggsa
$5, = Ex} SEL = 609 C9" «uo
= 48007
uae ae 2) oa
i= 38 30(
: This, the least squares regression line is =-5.356 +3309.
¢ the least squares regres2
Heute 185 describes the repression line. As you can sce, the line fits the data quite wel
We can measure how well by calculating the value of the mini
differences, The differenc: i line are called errors or residuals,
Figure 18.3 Sceuer diagram with regression ine: Exemole 18.1
y
9-356 +3399
‘Sum of squares for error
SSE = Se} = Ley.-g*
Fhe calculation of SSE in this example is shown in Figure 18.4, Notice that we calculate j,
by ‘substituting 2; into the formula for the regression line. The seeiieaae @is are the
Gifferences between the observed values y; and the calculated values ‘Jp The following
table describes the calculation of SSE.
Residual equared
i % d= Wi-w
1 2 2 03114
2 4 7 15376
3 8 3 10.0109
4 10 2% 69380
5 13 38 0.6906
6 16 50 0.9448
Dui 9 = 20.852
Thus, SSE = 20.4952. No other straight Line will produce a sum of squared errors as sinall
as 20.4982. In that sense, the regression line fits the data best. The sum of squares for ersor
is an important statistic because itis the basis for other statistics that assess how well the
Tinear model fits the data. We will introduce these statistics later in this chapter,
Figu
100
SOLUT
Identity
Notice 1
Variable
identify
depends
Solving
To deter
and624 AUSTRALIAN BUSINESS sTaTiSTics |
Figure 18.4 Calculation of SSE: Example 18.7 i|
2 92-5356 43.99%
We now apply the technique to a more practical problem.
[EXAMPLE 78.3
A citical factor for used-car buyers to determine the value of a car is how far the car has
been driven. However, there is not much information available about this in the public
Comain. To examine this issue, a used-car dealer randomly selected 100 five-year-old Ford
Lasers that were sold at auctions during the past month. Each car was in top condition and
ssuipped with automatic transmission, AM/FM cassette tape player and air conditioning,
The dealer recorded the price and the number of kilometres on the odometer. These data
{are stored in file XM18-02; some of the data are listed below. The dealer wants to find the
regression line.
Car Odometer reading ‘Auction selling price ()
1 37 388 5318
2 44.758 5061
3 45 833 5008
300 36392 5133
pa ee Ne a 2 SSS el
SOLUTION
Identifying the technique
Notice that the problem objective is to analyse the relationship between two quantitative
Taautles. Because we want to know how the odometer reading affects the selling price, we
Kentify the former as the independent variable, which we label z, and the latte ae the
‘ependent variable, which we label y.
Solving by hand
% determine the coefficient estimates, we must calculate $8, and $5... They are
SS, = L(x)-¥)? = Daj nz? = 4 309 340 160
SS, = L(x) - My - 7) = Ley, - zy = -134 269 296, iCHAPTER 18. SIMPLE LINEAR REGRESSION AND CORRELATION 625 ¥ —
Using the sums of squares, we find the slope coefficient.
5 SSzy _ 7134269 296
| b= Se = Tams 160
gi1577
‘To determine the intercept, we need to find ¥ and j. They are
Dv | SAL idd
Sp) wn al a
and *
| Za _ 3600945 é
ree elie ee ,
by oa 1
i Thus, |
i (By = 9 — Bx = 5411.41 — (-.0811577)(36 009.45) = 6523.38 | ey E
q ‘The sample regression line is a |
= 6593-00312 t
Hi Using the computer
it ‘The complete printouts are shown below, The printouts include more statistics than we ee
‘| , need right now. However, we will be discussing the rest of the printouts later. We have also >
) included the scatter diagrams, which is often a first step in the regression analysis. Notice .
that there does appear to be a straight-line relationship between the two variables. |
; |
Excel output for Example 18.2
/ Ee |
be |
i a |
Biss z
[elaguaad Rear eee? :
‘Savard Br er ser5's
| 5 cocoons ‘0
! Mov = |
i! z F Td
i Fogo 2 HET AS EOE -
| esdual easter ZERO me
H Hota 3 eesees0.19 ae i
i Saw za |
} ay at TE AS ae AEE COE STUICS
[5 |odor Sie ieras “omminees elon “a.acuseoe Onss7abeer ~ 0.0088) a
COMMANDS el COMMANDS FOR EXAMPLE 18.2 te
1. Type or import the data into two ‘Open file XM18-02
adjacent columns. :
2 Click Tools, Data Analysis |
and Regression. Click OK. |
| ;
tH :3 Specify Input ¥ Range.
4 Specify Input X Range.
Click Labels (if necessary).
5 To draw the scatter diagram click
Line Fit Plots before dlicking OK.
(fou can also draw the scatter diagram using the commands described in Chapter 2.)
In the line &t plot (scatter diagram), the diamonds are the values of y and the
Squares are the predicted values of y. Use a ruler to join the squares to dravy the
regression line.
‘The scatter diagram will be drawn with the two axes stating at zero, This may
cuase the points to be bunched together leaving large blank spaces in the dlggran.
‘To modify the chart, proceed as follows,
Right click the mouse on the Y-axis,
click Format axis, click Seale
(necessary) and change the
‘Minimum, Maximum and/or 4800 (Minimum) 6000 (Maximum)
Major and Minor Units. {500 (Major Units) 100 (Minor Units)
Click OK.
7 Repeat for the X-axis. 19000 50000 10000 2000
8 Click anywhere within the boundaries
of the box and use the m to increase/
decrease the length and height of
the box as required.
Interpreting the coefficients
| The coefficient A, is -0.0312, which means that for each additicnal kilometre on the.
Sdometer, the price decreases by an average of $0.0312 (3.12 cents).628 AUSTRALIAN BUSINESS STATISTICS
Applying the techniques
188. Self-corracting exercise. The accompanying table exhibits, for 8 delicatessens, the
senniual profit per dollar of sales y (measured in cents) and the number of employees
per store x.
Fe nee ee i
yl2 3 20 3 8 2 2 19
a. Find the least squares regression line to help predict profit per sales dollar on the
basis of the number of employees.
'b Plot the points, and graph the regression line
¢ Dees it appear that a straight line model is reasonable?
@ Make an economic interpretation of the slope.
‘A-custom jobber of speciality fbreglass-bodied cars wished to estimate overhead
expenses (abelled y and measured in $000s) as a function of the number of cars
Grit produced monthly. A random sample of 12 months was recorded and the
following statistics calculated.
Sreisy Sya5? Ley=9e7 Exes. Ly=ais
Find the least squares regression line and interpret the value of By
Twelve secretaries at a university in Queensland were asked to take a special three.
day intensive course to improve their keyboarding skills. At the beginning and
fgain at the end of the course, they were given a particular two-page leter and
Sicd to type it flawlessly, The data shown in the following table were recorded.
Number of years Improvement
of experience (words per minute)
Typist x y
9
6 u
3 8
8 2
0 it
5 9
10 4
n B
2
9
8
L 10
Dea Dey = 1102
Ee = 848
a Find the equation of the regression line.
b Asa check of your calculations in part (2), plot the 12 points and graph the line,
¢ Does it appear that the secretaries’ experience is linearly related to their
improvement?
ry
B
c
D
E
F
&
H
i
J
K
‘Advertising is often touted as the key to success. In seeking to determine just
how influential advertising is, the management of a recently set up retail chain
hhas collected data over the previous 15 weeks on sales revenue and advertising
expenditures from its chain stores, with the results shown in the following table,CHAPTER 18 SIMPLE LINEAR REGRESSION AND CORRELATION 627
‘The intercept is = 6583, Technicelly, the intercept i the point at which the regressitn
tine and the yaxis intersect. This means that when x= 0 (Le, the car was not driven &t all)
the selling price is $6593, We might be tempted to interpret this number as the price of cars
tre rave vot been driven. However, in this case, the intercept is probably meaningless.
Because our sample did not include any cars with zero kilometres on the odometer, Ww
Hacc ne basis forinterpretingA, As a general rule, we cannot determine the value of y for
evens of x that is far outsice the range of the sample values of x. In this example, the
Serallest and largest values of x are 19 057 and 49 223, respectively. Beeause x = 0s notin
this interval, we cannot safely interpret the value of y when x = 0.
Tn the sections that follow, we will retur to this problem and the computer output fo
introduce other statistics associated with regression analysis.
[EXERCISES
Most of the exercises that follow were created to allow you to see how regression analysis
rused to solve realistic problems. As a result, most feature a large number of observations
We anticipate that most students will solve these problems using « computes and
WN steal poftware. However for students without these resources, we have calculated the
fume of squares (6S, §5,, and one that will be needed later, SSy. which is the sat of
Squares for the dependent varisble) that will permit them to complete the calculations
sally. We believe tat i s pointless for scents to calculate the sums of squares ROM:
the raw data except for several small-sample exercises that are fourtd below. In any 256,
ae avs will have previously calculated sums of squares as part ofa variety of procedures,
Gioluding sample variance (Chapter 3), covasiance and correlation (Chapter 3), and
analysis of variance (Chapter 15).
18.4 You are given the following six points.
ze fp es 0
isms Chem
a Draw the scatter diagram.
‘ Determine the least squares line,
185 The observation of two variables was recorded as shown below.
ea eS eo
|e ase massa 7 wee ee
a Draw the scatter diagram.
b Find the least squares line.
4186 Asset of 10 observations to be analysed by a regression model yields the following
summations:
Zresl Ly=37 Lay=75 Se-103 Ly ass
Find the least squares regression line.
187A set of 25 observations of two variables x and y produced the following
summations.
Frees Ly=19.0 Lxynill Lea3178 Sy = 9564
Find the least squares regression line.18.12
CHAPTER 18 SIMPLE LINEAR REGRESSION AND CORRELATION 629
“Advertising expenditures Sales
($0008) (0008)
x y
30 50
5.0 250
70 700
60 450
65 600
80 1000
35 5
40 150
45 200
65 550
70 750
75 800
75 900
85 1100
70 600
Ix=915 Xy= 8175 Dixy = 57 787.5
De = 598.75 Sy? = 6.070.625
a Find the coefficients of the regression line, using the least squares method.
b Make an economic interpretation of the slope.
¢ Ifthe sign of the slope were negative, what would that say about the advertising?
4. What does the value of the intercept tell you?
‘The term ‘regression’ was originally used in 1885 by Sir Francis Galton in his
analysis of the relationship between the heights of children and parents. He
formulated the Taw of universal regression,’ which specifies that ‘each peculiarity
ina man is shared by his Kinsmen, but on average in a less degree’. (Evidently,
people spoke this way in 1885.) In 1903, two statisticians, K. Pearson and A. Lee,
took a random sample of 1078 father-son pairs to examine Galton’s law (‘On the
Laws of Inheritance in Man, I. Inheritance of Physical Characteristics’, Biometrika,
vol. 2, pp. 457-62). Their sample regression line was
Son's height = 33,73 + 516 x Father's height
a Interpret the coefficients.
b What does the regression line tell you about the heights of sons of tal fathers?
¢ What does the regression line tell you about the heights of sons of short fathers?
Computer/Solving by hand exercises
18.13
The objective of commercials is to have as many viewers as possible remember the
product ina favourable way and eventually buy it. In an experiment to determine
how the length of a commercial affects people’s memory of it, 60 randomly selected
people were asked to watch a one-hour television In the middle of the
show, a commercial advertising a brand of toothpaste appeared. Each viewer
watched a commercial whose length varied between 20 and 60 seconds. The
‘essential content of the commercials was the same. After the show, each person was
given a test to measure how much he or she remembered about the product. The
‘commercial times and test scores (on 2 30-point test) are stored in fle XR18-13,
1 Draw a cater diagram ofthe data fo determine whether a linear model appears
to be appropriate.
b Determine the least squares line.
¢ Interpret the coefficients.630 AUSTRALIAN BUSINESS STATISTICS
Use a software package OR
to solve this problem.
‘The sums of squares are
a
SSieq = 2829.6
Complete your answer
manwally.
18.14 After several semesters without much success, Pat Statsdud (a student in the lower
quarter of a statistics subject) decided to try to improve. Pat needed to know the
secret of success for university students. After many hours of discussion with other,
more successful, students, Pat postulated a rather radical theory: the longer one
studied, the better one’s grade. To test the theory, Pat took a random sample of 100
students in an economics subject and asked each to report the average amount
of time he or she studied economics and the final mark received. These data are
stored in columns 1 (study time in hours) and 2 (final mark out of 100) in file
XR16-14,
a Determine the sample regression line.
Interpret the coefficients.
¢ Is the sign of the slope logical? If the slope had had the opposite sign, what
would that tell you?
Use a software package oR
to solve this problem.
‘The sums of squares are
SSenaytine Fmigane = 15 241
36 020
Complete your answer
manually.
18.15 Suppose that a statistician wanted to update the study described in Exercise 18.12.
She collected data on 400 father-son pairs and stored the data in columns 1 (fathers’
heights in centimetres) and 2 (sons’ heights in centimetres) in file XR18-15,
a Determine the sample regression line.
b What does the value of f, tell you?
¢ What does the value of , tell you?
Use a software package OR
to solve this problem.
The sums of squares are
SSraer,gm = 19 626.25
8a, = 35 265
SSrana = 40 980.6
Complete your answer
manually,
18.16 Until now, most Australian private health insurance companies charged the same
annual premium regardless of the length of time a person had private health cover.
Recently, the Australian federal government announced a 30% rebate on the health
insurance premium if a person has private health cover. The health insurance
companies are of the view that this government incentive will be attractive to senior
members of the population who do not currently have private health cover. They
are lobbying the federal government to accept their proposal to charge a higher
premium to the new members and lower premium to those who have had private
health cover for a longer period. A health economist is very suspicious about theCHAPTER 18 SIMPLE LINEAR REGRESSION AND CORRELATION 631 Le
| motive of the insurance companies’ proposal. He gathered data concerning the age
| iy rae iy medial expenses ofa random sample of 1548 Australians ¢n0g |
\ She ravious 12smonth period. Te data are stored in fle XREA6 (column ‘= age: |
\ column 2 = mean daily medical expense). i.
i 4 Determine the sample regression line. :
bb Interpret the coefficients. bE
¢ What rate plan would you suggest? Sy
| Use a software package OR ‘The sums of squares are :
to solve this problem. SS ae, tape = 69 408 4
Pe aaaer iy) :
Saye = 307 461 :
Complete your answer :
manually.
For Exercise 184, calculate the sum of squared ersors (SSE) by calculating ee
Felue fg, subtecting y, squaring the differences, and summing 1s squared
rare cece co check your calculation, calculate SSE by the shortcut method.
For Exercise 185, calculate SSE by calculating each value of J subractg
Ehuaring the differences, and summing the squared differences, Then calculate SSE
‘by the shortcut method, and compare the results,
In a study of the relationship between two variables x and y, the following
summations were calculated:
Yx=105 Dyassie Ley = 37525 Fv2956 Ly=1818421 n= 15
Calculate SSE, #3 and 5,
Error variable: Required conditions
In the previous section, we described the least squares method of estimating the
eae ere ofthe probabilistic model. A critical part of this model is the error variable €
caren aection, we present methods of ascessing how well the stright line fits the
wee eetder for these methods to be valid, however, five requirements involving the
probability distribution of the error variable must be satisfied.
Required conditions for the error variable
1 The probability distribution of eis normal.
2. The mean of the distribution is zero; that is, E(€)
2 Te standard deviation of eis 0, which is a constant no matter what the value of
is, (Erzors with this property are called homoscedastic)
4. The errors associated with any two values of y are independent. As a result, the
value of the error variable at one point does not affect the value of the er0r
Vaxleble at another point. (Errors that do not satisfy this requirement are known a5
autocorrelated errors.)
5 The errors are independent of the independent variables,N QUSINESS STATISTICS
Eok
Requirements 1, 2 and 3 can be interpreted in another way: for each value of x, y is a
normally distributed rendom variable whose mean is,
Ey) = y+ Be f
and whose stendard deviation is o,, Notice that the mean depends an x. To reflect this
dependence, the expected value is sometimes expressed a8
_ Elvin) = A+ Bx
The standard deviation, however, is not influenced by x, because it is a constant over all
values of x. Figure 185 depicts this interpretation. In Section 18.8, we will discuss how
departures fcom these required conditions affect the regression analysis and how they are
identified
Experimental and observational data
Statisticians often design controlled experiments where regression analysis will be used.
They do so by setting several different values of x and observing the corresponding values
of y. For example, the data in Exercise 18.13 were gathered through a controlled
experiment. To determine the effect of the length of a television commercial on its viewers’
memories of the product advertised, the statistician arranged for 60 television viewers to
watch 2 commercial of differing lengths and then tested theit memories of that
commercial. Each viewer was randomly assigned a commercial length. The values of x
ranged from 29 to 60 ane were set by the statistician as part of the experiment. For each
value of x, the distribution of the memory test scores is assumed to be normally distributed
with 2 constant variance
Figure 18.5 distribution ofy given x
Ayla
Byiah By Due
tn many cases, i is difficult or impossible to design a contzolled experiment, Thus, we
Rave no alternative to gathering observational data. As was the case with other techniques,
‘whether che data are observational or experimental does not affect the choice of statistical
athod ‘veickn dpely region enslais Goth @rnedinienlaland ohservational
fs fee data i
GV fthere are ihe on
Ha However, when
sit,
a EESCHAPTER 18 SIMPLE LINEAR REGRESSION AND CORRELATION 633
interpretations of the results. To illustrate, suppose that an analysis of observational data
showed that there is a linear relationship (with a positive value of f,) between a university
lecturer's salary and teaching evaluations. We may interpret these resulis to infer that
better teachers are more highly rewarded. However, it may be that a university’s best
researchers are also its best teachers, and, because research is rewarded, there appears to
be a relationship between salary and teaching evaluations. A multiple regression analysis
(Chapter 19) may be able to produce a definite conclusion.
Ee
Assessing the model
The least squares method produces the best straight line. However, there may in fact be
no relationship or pechaps a non-linear (e.g, quadratic) relationship between the two
variables. If so, the use of a linear model is pointless. Consequently, it is important for us
to assess how well the linear model fits the data. Ifthe St is poor, we should discard the
linear model and seek another one,
Several methods are used to evaluate the model. In this section, we present two statistics
and one test procedure to determine whether a linear model should be employed. They are
the standard error of estimate, the test of the slope, and the coefficient of determination.
Standard error of estimate
In Section 184, we pointed out that the error variable ¢ is normally distributed with mean
zeroand standard deviation 9, If gs large, some of the errors will be large, which implies
that the model's fit is poor. If g, is small, the errors tend to be close to the mean (which is
zero), and, as a result, the model fits well. Hence, we could use g; to measure the suitability
of using a linear model. Unfortunately, g, is a population parameter and, like most
parameters, is unknown. We can, however, estimate o, from the data. The estimates based
on the statistic we introduced in Section 18.2, the sum of squares for error, SSE.
Recall that SSE is the minimised sum of squared differences between the points and the
line. That is, for the least squares line,
SSE = Eet= Dyi- 9"
For Example 18.1, we showed how SSE is found. We determined the value of for each
value of x, calculated the difference betweeri y and g, squared the difference, and added.
This procedure can be quite time-consuming, Fortunately, we can also express SSE as a
function of sums of squares
‘Simplified calculation for SSE
S83
SSE =, =
rss
$5,- Eu. =Zy- SH or Sengzach
asa
634 AUSTRALIAN BUSINESS STATISTICS
We can estimate o} by dividing SSE by the number of observations minus 2, where 2
represents the number of parameters estimated in the regression model—namely, fy and
j. That is, the sample statistic
SSE
is an unbiased estimator of of. The square root of s2is called the standard error of estimate.
Standard error of estimate
[EXAMPLE 18.3
Find the standard error of estimate for Example 18.2
SOLUTION
Solving by hand
To calculate the standard error of estimate, we need to find the sum of squares for error.
This requires the calculation of SS, SS. and SS. In Example 182, we calculated
SSy= 4309340160 and $5, = 134.269 296
From the data we find
: $5, = Liy;- 7? = 6 434 890
We can now determine the sum of squares for error
582
SSE = $5,
- SS,
_ Thus, the standard error of estimate is
SSE _ [2251363
Vna2 7 ¥ doo=g 7151569
_ Excel output for Example 18.3
(6134 269 296)?
6 434 890 - “Taos sag gy” = 2251 363
Using the computer
Refer to the Excel printout for Example 18.2. Excel reports the standard error of estimate
as
Standard Error 151,5687515
Interpreting the results
| Tesmallest value that s,can assume is zero, which occurs when SSE = 0, that is, when all
Points fall on the regression line. Thus, when sis small, the fit is excellent, and the
eat model is likely to be an effective analytical and forecasting tool. If sis large, the
model is a poor one, and the statistician should improve it or discard it.
We judge the value of s, by comparing it to the values of the dependent variable y, or
© specifically to the sample mean J. In this example, s; (= 151.6) is only 2.8% relative
|
|CHAPTER 18 SIMPLE LINEAR REGRESSION AND CORRELATION 635
to (= 5411.4), Therefore, we could admit that the standard error of estimate is reasonably
small. In general, the standard error of estimate cannot be used alone as an absolute
measure of the model's utility.
Nonetheless, s,s useful in comparing models. Ifthe statistician has several models from
which to choose, the one with the smallest value of, should generally be the one used. As
you'll see, 5, is also an important statistic in other procedures associated with regression
Estimating the slope and the intercept
As discussed in Chapter 8, there are two types of estimators available—namely, point
estimators and interval estimators—to estimate an unknown parameter. In Section 18.3 we
used the least squares method to derive point estimators A, and A, to estimate the intercept
Ayand the slope coetfictent B,. Now we provide the interval estimators for fi and f,
Confidence interval estimator of f, and
Bo taranathy
Bit tasan os,
where sp, is the standard deviation of A, (also called the standard error of f,) and is
equal to
5
5 Ss
and s, is the standard error of A, and is equal to
=
[EXAMPLE Tea
Determine the 95% confidence interval estimate of the slope B, for Example 18.2.
SOLUTION
Solving by hand
In Example 18.2, we have
3 151,569
en
‘hfs, ~ Ya309 340 160 = 750?
Trereforg, the 95% confidence interval estimate of for Example 182 is
Aetonxn,
= 0311577 © 1.984 ».002309
==.0911577 = 00456105
= 1-036, 027]
‘Thus the 95% confidence interval estimate of the slope is the interval from -0.036 to-0.027.L4 dy 636 AUSTRALIAN BUSINESS STATISTICS
Using the computer
27.
Excel output for Example 18.4
Refer to the Excel printout for Example 182. Excel reports the confidence interval estimate
in the column next to that of p-value,
Testing the slope
‘To understand this method of assessing the linear model, consider the consequences of
applying the regression technique to two variables that are not at al linearly related. If we
could observe the entire population and draw the scatter diagram, we would observe the
geaph shown in Figure 186. The line is horizontal, which means that the value-of y is
uunaifected by the value of x. Recall that a horizontal straight line has a slope of zero-—that
fs, =0.
Figure 18.6 6, =
a Elyls) Bo Bix
————
Because we rarely examine complete populations, the parameters are unknown. However,
We can draw inferences about the population slope , from the sample slope A,
The process of testing hypotheses about 6, is identical to the process of testing any other
Parameter. We begin with the hypotheses. The null hypothesis specifies that there is no
Iinear relationship, which means that the slope is zero, Thus, we specify
Hy B,=0
We can conduct one- or two-tail tests of f, Most often, we perform a two-til test to
determine whether there is sulficient evidence to infer that a linear relationship exists. We
test
Hy B20
The test statistic is
hy
Wheres, isthe standard deviation of A, (also called the standard error of, B,) and is equal to
aeCHAPTER 18 SIMPLE LINEAR REGRESSION AND CORRELATION 637 E;
i If the error variable is normally distributed, the test statistic is Student tdistributed with
‘2 degrees of freedom. Note that 2 represents the number of parameters estimated in the 4
! regression model,
i :
| [EXAMPLE G5 i
Test to determine whether there is enough'evidence in Example 18.2 to infer that there is |
| a linear relationship between the price and the odometer reading. Use a significance level
i} of 5%, Bf
i SOLUTION
i ‘We test the hypotheses
! Hy B,=0
Hy BO
Test statistics
Level of significance: a= 05 By
I Decision rule: Reject Hy if |] > fovang = tomse = 1.984, |
Value ofthe test statisti:
Wi Solving by hand x
20 calculate the value of the test statistic, we need A, and s4. In Example 182, we found
| | 88, = 4 309 340 160 and , = -0311577. In Example 183, we found s, = 151.569. Thus, F
i
= le eS ° :
“> ss, [rao 340 te = 07309
Ni The value of the test statistic is
ly AWB | ~0s577=0_ an
cy “002309 = é
/ Using the computer
| Excl output for Example 185 . :
i ‘The output below was taken from the Excel output for Example 18.2. a
and the two-tall pavalue of the test (P-value). These values are 002308896, -13.49465083
and 4.44346E-24 (which is practically 0), respectively. Notice the printout includes a test FE
for By However, as we've pointed out before, interpreting the vali ofthe y-intercept can i
lead to erroneous, if not ridiculous, conclusions. As a result, we will ignore the test of ee
q
i 2
| | Conclusion: Since t = ~13.49 < -1.984, reject the null hypothesis, be
Hy
i ‘The printout includes the standard deviation of f (Standard Error), the I-statistic (t Stat), =
i698 AUSTAALIAN BUSINESS STATISTICS
Interpreting the results
The value of the test statistic is ¢ = -13.49, with a p-value of 0.000, There is overwhelming
evidence to infer that a Linear relationship exists. Figure 187 depicts the sampling
distribution of the test statistic.) What this means is that the odometer reading does affect
the auction selling price of the cars.
Figure 187 Sampling cistioution of BE for Evample 18.5
sa349 1984 ° 1984
Rejecon region Rejection region
As was the case when we interpreted the y-intercept, the conclusion we draw here is valid
only over the range of the values of the independent variable. That is, we can infer that
there is a linear relationship between odometer reading and auction price for the five-year-
old Ford Laser whose odometer reading lies between 19 057 and 49 223 km (the minimum
and maximum values of x in the sample). Because we have no observations outside this
range, we do not know how, or even whether, the two variables are related. Figure 18.8,
depicts several possible relationships over a wider range of values of x than the sample
data in Example 18.2. As you can see, all three figures show a linear relationship when x
lies between 20 000 and 50 000 kin. Outside this range the relationship may be linear
(Figure 18.8) or nonlinear (Figures 18.8b and 188c). This issue is particularly important,
to remember when we tse the regression equation to estimate or forecast (see Section 18.6).
Coefficient of determination
The test of f, addresses only the question of whether there is enough evidence to infer
that a linear relationship exists. In many cases, however, it is also useful to measure the
Strength of that linear relationship, particularly when we want to compare several
different models. The statistic that performs this function is the coefficient of
determination.
‘The coefficient of determination, denoted R? and defined as the ratio SSR/SS, (where
SSR is the sum of squares for regression defined below) is calculated in the following way.
Coefficient of determination
ineorey,
‘With a little algebra, mathematicians can show that
SSE
55,
Rtel-iF
Figure 18.8 Relationships between odometer reading and auction price om
@
Pee ;
6000 i
5000 ee
=
ts
4000 4
jem wom om mow aed i
Odometer mating :
i
) pamee
Pee
6000 *
000 i ;
4000 i k
°
tome won woman —a00O0
tenes ning
©
Pine eee
a a a
i someterrading
‘The significance of this formula is based on the analysis of variance technique. In Chapter Z
15, we partitioned the total sum of squares into two sources of variation, Here, we begin
the discussion by observing that the deviation between y; and 7 can be decomposed into
two parts, That is, i
YH U-H*+G-D .
‘This equation is represented graphically (for i = 1) in Figure 18.9,