0% found this document useful (0 votes)
77 views22 pages

Regression Analysis

Uploaded by

Enosh Waren
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
77 views22 pages

Regression Analysis

Uploaded by

Enosh Waren
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
Simple linear regression and Correlation 18.1 Introduction 182 Model 183. Least squares method 18.4 Error variable: Required conditions 185 Assessing the mode! 18.6 Using the regression equation 187 Coefficients of correlation (Optional) 188 Regression diagnostics — 1 189 Summary Introduction almost all such as product demand, Prices of raw materials and labour costs, pe technique involves developing a mathernatical equation that describes the Belttonship between the variable to-be'forecas, which ic cas the dependent variable, daa itbles that the statistician believes aze related to the dependent variable. The igert Variable is denoted y, while the related variables are calied independent Natbles and are denoted ,,m,.., x (where kis the nuanbet af independent variables). Hf We are interested only in determi Betause regression analysis involves a number of new techniques and concepts, we alloys Presentation into three chapters, In this chapter, we present techniques that slloW us to determine the relationship between only two variables. In Chapter 19, we espn Sut discussion to more than two variables, avd in ‘Chapter 20, we discuss how to BTPssion models. CHAPTER 18 SIMPLE LINEAR REGRESSION AND CORRELATION 618 ‘ immeasurable, that are not part ofthe model. The value of e will vary from one sale to the next, even if x remains constant. That is, houses of exactly the same size will sell for al diferent prices because of differences in locaton, selling season, decorations and other variables. Jn fe thee chapters devoted to regression analysis, we will present only probabilistic - * models. Additionally o simplify the presentation, all models wil be linea: In this hapten / We restrict the number of independent variables to one. The model to be used inte 4 chapter is called the first-order linear model or the simple linear regression model, ; ‘Simple linear regression model [ Yt Bree where q | dependent variable dependent variable Bo= y-intercept : 8, = slope of the line (defined as the ratio ise/run or change in y/change in 2) E error Variable Figure 18.1 depicts the deterministic. ‘component of the model. Figure 18.1 _Simpie linear model: Deterministic component Y Y=Bo +z Run Po . / 1e problem objective acidressed by the model is to analyse the relationship between two rae ae both of which must be quantitative. To define the relationship between x and y, we need to know the value of the coefficients of the linear model f and ,. However, these coefficients are population parameters, which are almost always unknown, In the next section, we discuss how these parameters are estimated. [EXERCISES 18.1 Graph each of the following straight lines. Identify the intercept and the slope. 620 AUSTAALIAN BUSiNess starisTics 182 For each of the following data sets, Plot the points on a graph and detemni whether a linear model is reasonable. ~ m ae aes Graph the following observations of v and y, x 1 2 3 4 5 6 y 4 6 z 7 9 nl law 2 straight line through the data. What are the intercept and the slope of the line you drew? BEI} Least squares method We estimate the parameters fh and fy in a way similar to the methods used to estimate all the other parameters'discussed in. this book. We draw a random sample from the Populations of interest and calculate the sample slatstcs we need, Beearee A and & Bee tree Coeifclents ofa staight line, their estimators ae based on drawing a straight line through the sample data. To see how this is done, consider the following simple example. [EXAMPLE Te. Hg the following six observations of variables x and y, determine the straight line thet fits these data. aoe y [2 SOLUTION As first step we graph the data, as shown in Figure 18.2. Recall (from Chapter 2) that this Steph is called a scatter diagram. The scatter diagram usually reveals whether ot not a Straightline model fits the data reasonably well. Evidently, in this case a linear model is Justified, Our task is to draw the straight line that provides the best posible fi. CHAPTER 18 SIMPLE LINEAR REGRESSION AND CORRELATION 621 Figure 18.2 _Scawer diagram for Example 18.1 i 1 aa { We can define what we mean by best in various ways. For example, we can draw the line ! that minimises the sum of the differences between the line and the points, Because some of the differences wil be positive (points above the line), and others will be negative (points { below the line), a cancelling effect might produce a straight line that does not ft the data at all. To eliminate the positive and negative differences, we will draw the line that minimises | the sum of squared differences. That is, we want to determine the line that minimises 2o-a = | Where y, represents the observed value of y and 9, represents the value of y calculated from oi i] the equation of the line. That is, S i =A +Bs : i The technique that produces this ine is called the least squares method. The line itself is q i called the least squares line, the fitted line or the regression line. The ‘hats’ on the coefficients remind us that they are estimators of the parameters fh and fi. i By using calculus, we can produce formulas for A, and f,. Although we're sure that you ce | are keenly interested in the calculus derivation of the formulas, we will not provide them, i ‘because we promised to keep the mathematics to a minimum. Instead, we offer the | following, which were derived by calculus. i Ce Calculation of A, and A, : Ss, é : Ae ; i | Bo =9- Be where Bb Sy =Le-DU-D 88, = Et 3F : g=2H ng oe i ie 4 it is the numerator in the calculation of sample a ‘The formula for SS, should look familiar; , : | variance s*. We introduced the $$ notation in Chapter 15; it stands for sum of squares. The } Th i statistic $5, is the gum of squared ditferences between the observations of x and their He yoMetl speaking 58, is not a sum of squares. The formula for $5, tracehe Mumerator in the calculation for (introduced in Chapter 3) (As Was the case with the analysis of var ice procedures introduced in Chapter 15, calculating the statistics manually in any tealistic exemple is extremely time constuning” Naturally, we recommend the use of Statistical software to produce the statistics we noes’ However, it may be worthwhile to Manually perform the calculations for several small. Simple Problems. Such efforts may provide you with insights into the working of seein analysis, To that end we provide sherivcr formulas for the various statistics that are calculated in this chapter, : iy May be familiar, covariance and the coefficient of comelation Shortcut formulas for $S, and SS, ss, zy -Gal or SP ang? 5s, xy,-ZAEH, or Day, ney AS You can see, to estimate the regression coefficients by hand, we need to determine the | following summations. | Sum of x: 7 Sum of y Dy ‘Sum of squared: x? Sum of x times y: Dixy, i ~ Returning to our example we find : By = 53 1 Zy = 148 . La = 609 : Sey; = 1786 Using these summations in our shortcut formulas, we find 12a) «699 3F _ssggsa $5, = Ex} SEL = 609 C9" «uo = 48007 uae ae 2) oa i= 38 30( : This, the least squares regression line is =-5.356 +3309. ¢ the least squares regres 2 Heute 185 describes the repression line. As you can sce, the line fits the data quite wel We can measure how well by calculating the value of the mini differences, The differenc: i line are called errors or residuals, Figure 18.3 Sceuer diagram with regression ine: Exemole 18.1 y 9-356 +3399 ‘Sum of squares for error SSE = Se} = Ley.-g* Fhe calculation of SSE in this example is shown in Figure 18.4, Notice that we calculate j, by ‘substituting 2; into the formula for the regression line. The seeiieaae @is are the Gifferences between the observed values y; and the calculated values ‘Jp The following table describes the calculation of SSE. Residual equared i % d= Wi-w 1 2 2 03114 2 4 7 15376 3 8 3 10.0109 4 10 2% 69380 5 13 38 0.6906 6 16 50 0.9448 Dui 9 = 20.852 Thus, SSE = 20.4952. No other straight Line will produce a sum of squared errors as sinall as 20.4982. In that sense, the regression line fits the data best. The sum of squares for ersor is an important statistic because itis the basis for other statistics that assess how well the Tinear model fits the data. We will introduce these statistics later in this chapter, Figu 100 SOLUT Identity Notice 1 Variable identify depends Solving To deter and 624 AUSTRALIAN BUSINESS sTaTiSTics | Figure 18.4 Calculation of SSE: Example 18.7 i| 2 92-5356 43.99% We now apply the technique to a more practical problem. [EXAMPLE 78.3 A citical factor for used-car buyers to determine the value of a car is how far the car has been driven. However, there is not much information available about this in the public Comain. To examine this issue, a used-car dealer randomly selected 100 five-year-old Ford Lasers that were sold at auctions during the past month. Each car was in top condition and ssuipped with automatic transmission, AM/FM cassette tape player and air conditioning, The dealer recorded the price and the number of kilometres on the odometer. These data {are stored in file XM18-02; some of the data are listed below. The dealer wants to find the regression line. Car Odometer reading ‘Auction selling price () 1 37 388 5318 2 44.758 5061 3 45 833 5008 300 36392 5133 pa ee Ne a 2 SSS el SOLUTION Identifying the technique Notice that the problem objective is to analyse the relationship between two quantitative Taautles. Because we want to know how the odometer reading affects the selling price, we Kentify the former as the independent variable, which we label z, and the latte ae the ‘ependent variable, which we label y. Solving by hand % determine the coefficient estimates, we must calculate $8, and $5... They are SS, = L(x)-¥)? = Daj nz? = 4 309 340 160 SS, = L(x) - My - 7) = Ley, - zy = -134 269 296, i CHAPTER 18. SIMPLE LINEAR REGRESSION AND CORRELATION 625 ¥ — Using the sums of squares, we find the slope coefficient. 5 SSzy _ 7134269 296 | b= Se = Tams 160 gi1577 ‘To determine the intercept, we need to find ¥ and j. They are Dv | SAL idd Sp) wn al a and * | Za _ 3600945 é ree elie ee , by oa 1 i Thus, | i (By = 9 — Bx = 5411.41 — (-.0811577)(36 009.45) = 6523.38 | ey E q ‘The sample regression line is a | = 6593-00312 t Hi Using the computer it ‘The complete printouts are shown below, The printouts include more statistics than we ee ‘| , need right now. However, we will be discussing the rest of the printouts later. We have also > ) included the scatter diagrams, which is often a first step in the regression analysis. Notice . that there does appear to be a straight-line relationship between the two variables. | ; | Excel output for Example 18.2 / Ee | be | i a | Biss z [elaguaad Rear eee? : ‘Savard Br er ser5's | 5 cocoons ‘0 ! Mov = | i! z F Td i Fogo 2 HET AS EOE - | esdual easter ZERO me H Hota 3 eesees0.19 ae i i Saw za | } ay at TE AS ae AEE COE STUICS [5 |odor Sie ieras “omminees elon “a.acuseoe Onss7abeer ~ 0.0088) a COMMANDS el COMMANDS FOR EXAMPLE 18.2 te 1. Type or import the data into two ‘Open file XM18-02 adjacent columns. : 2 Click Tools, Data Analysis | and Regression. Click OK. | | ; tH : 3 Specify Input ¥ Range. 4 Specify Input X Range. Click Labels (if necessary). 5 To draw the scatter diagram click Line Fit Plots before dlicking OK. (fou can also draw the scatter diagram using the commands described in Chapter 2.) In the line &t plot (scatter diagram), the diamonds are the values of y and the Squares are the predicted values of y. Use a ruler to join the squares to dravy the regression line. ‘The scatter diagram will be drawn with the two axes stating at zero, This may cuase the points to be bunched together leaving large blank spaces in the dlggran. ‘To modify the chart, proceed as follows, Right click the mouse on the Y-axis, click Format axis, click Seale (necessary) and change the ‘Minimum, Maximum and/or 4800 (Minimum) 6000 (Maximum) Major and Minor Units. {500 (Major Units) 100 (Minor Units) Click OK. 7 Repeat for the X-axis. 19000 50000 10000 2000 8 Click anywhere within the boundaries of the box and use the m to increase/ decrease the length and height of the box as required. Interpreting the coefficients | The coefficient A, is -0.0312, which means that for each additicnal kilometre on the. Sdometer, the price decreases by an average of $0.0312 (3.12 cents). 628 AUSTRALIAN BUSINESS STATISTICS Applying the techniques 188. Self-corracting exercise. The accompanying table exhibits, for 8 delicatessens, the senniual profit per dollar of sales y (measured in cents) and the number of employees per store x. Fe nee ee i yl2 3 20 3 8 2 2 19 a. Find the least squares regression line to help predict profit per sales dollar on the basis of the number of employees. 'b Plot the points, and graph the regression line ¢ Dees it appear that a straight line model is reasonable? @ Make an economic interpretation of the slope. ‘A-custom jobber of speciality fbreglass-bodied cars wished to estimate overhead expenses (abelled y and measured in $000s) as a function of the number of cars Grit produced monthly. A random sample of 12 months was recorded and the following statistics calculated. Sreisy Sya5? Ley=9e7 Exes. Ly=ais Find the least squares regression line and interpret the value of By Twelve secretaries at a university in Queensland were asked to take a special three. day intensive course to improve their keyboarding skills. At the beginning and fgain at the end of the course, they were given a particular two-page leter and Sicd to type it flawlessly, The data shown in the following table were recorded. Number of years Improvement of experience (words per minute) Typist x y 9 6 u 3 8 8 2 0 it 5 9 10 4 n B 2 9 8 L 10 Dea Dey = 1102 Ee = 848 a Find the equation of the regression line. b Asa check of your calculations in part (2), plot the 12 points and graph the line, ¢ Does it appear that the secretaries’ experience is linearly related to their improvement? ry B c D E F & H i J K ‘Advertising is often touted as the key to success. In seeking to determine just how influential advertising is, the management of a recently set up retail chain hhas collected data over the previous 15 weeks on sales revenue and advertising expenditures from its chain stores, with the results shown in the following table, CHAPTER 18 SIMPLE LINEAR REGRESSION AND CORRELATION 627 ‘The intercept is = 6583, Technicelly, the intercept i the point at which the regressitn tine and the yaxis intersect. This means that when x= 0 (Le, the car was not driven &t all) the selling price is $6593, We might be tempted to interpret this number as the price of cars tre rave vot been driven. However, in this case, the intercept is probably meaningless. Because our sample did not include any cars with zero kilometres on the odometer, Ww Hacc ne basis forinterpretingA, As a general rule, we cannot determine the value of y for evens of x that is far outsice the range of the sample values of x. In this example, the Serallest and largest values of x are 19 057 and 49 223, respectively. Beeause x = 0s notin this interval, we cannot safely interpret the value of y when x = 0. Tn the sections that follow, we will retur to this problem and the computer output fo introduce other statistics associated with regression analysis. [EXERCISES Most of the exercises that follow were created to allow you to see how regression analysis rused to solve realistic problems. As a result, most feature a large number of observations We anticipate that most students will solve these problems using « computes and WN steal poftware. However for students without these resources, we have calculated the fume of squares (6S, §5,, and one that will be needed later, SSy. which is the sat of Squares for the dependent varisble) that will permit them to complete the calculations sally. We believe tat i s pointless for scents to calculate the sums of squares ROM: the raw data except for several small-sample exercises that are fourtd below. In any 256, ae avs will have previously calculated sums of squares as part ofa variety of procedures, Gioluding sample variance (Chapter 3), covasiance and correlation (Chapter 3), and analysis of variance (Chapter 15). 18.4 You are given the following six points. ze fp es 0 isms Chem a Draw the scatter diagram. ‘ Determine the least squares line, 185 The observation of two variables was recorded as shown below. ea eS eo |e ase massa 7 wee ee a Draw the scatter diagram. b Find the least squares line. 4186 Asset of 10 observations to be analysed by a regression model yields the following summations: Zresl Ly=37 Lay=75 Se-103 Ly ass Find the least squares regression line. 187A set of 25 observations of two variables x and y produced the following summations. Frees Ly=19.0 Lxynill Lea3178 Sy = 9564 Find the least squares regression line. 18.12 CHAPTER 18 SIMPLE LINEAR REGRESSION AND CORRELATION 629 “Advertising expenditures Sales ($0008) (0008) x y 30 50 5.0 250 70 700 60 450 65 600 80 1000 35 5 40 150 45 200 65 550 70 750 75 800 75 900 85 1100 70 600 Ix=915 Xy= 8175 Dixy = 57 787.5 De = 598.75 Sy? = 6.070.625 a Find the coefficients of the regression line, using the least squares method. b Make an economic interpretation of the slope. ¢ Ifthe sign of the slope were negative, what would that say about the advertising? 4. What does the value of the intercept tell you? ‘The term ‘regression’ was originally used in 1885 by Sir Francis Galton in his analysis of the relationship between the heights of children and parents. He formulated the Taw of universal regression,’ which specifies that ‘each peculiarity ina man is shared by his Kinsmen, but on average in a less degree’. (Evidently, people spoke this way in 1885.) In 1903, two statisticians, K. Pearson and A. Lee, took a random sample of 1078 father-son pairs to examine Galton’s law (‘On the Laws of Inheritance in Man, I. Inheritance of Physical Characteristics’, Biometrika, vol. 2, pp. 457-62). Their sample regression line was Son's height = 33,73 + 516 x Father's height a Interpret the coefficients. b What does the regression line tell you about the heights of sons of tal fathers? ¢ What does the regression line tell you about the heights of sons of short fathers? Computer/Solving by hand exercises 18.13 The objective of commercials is to have as many viewers as possible remember the product ina favourable way and eventually buy it. In an experiment to determine how the length of a commercial affects people’s memory of it, 60 randomly selected people were asked to watch a one-hour television In the middle of the show, a commercial advertising a brand of toothpaste appeared. Each viewer watched a commercial whose length varied between 20 and 60 seconds. The ‘essential content of the commercials was the same. After the show, each person was given a test to measure how much he or she remembered about the product. The ‘commercial times and test scores (on 2 30-point test) are stored in fle XR18-13, 1 Draw a cater diagram ofthe data fo determine whether a linear model appears to be appropriate. b Determine the least squares line. ¢ Interpret the coefficients. 630 AUSTRALIAN BUSINESS STATISTICS Use a software package OR to solve this problem. ‘The sums of squares are a SSieq = 2829.6 Complete your answer manwally. 18.14 After several semesters without much success, Pat Statsdud (a student in the lower quarter of a statistics subject) decided to try to improve. Pat needed to know the secret of success for university students. After many hours of discussion with other, more successful, students, Pat postulated a rather radical theory: the longer one studied, the better one’s grade. To test the theory, Pat took a random sample of 100 students in an economics subject and asked each to report the average amount of time he or she studied economics and the final mark received. These data are stored in columns 1 (study time in hours) and 2 (final mark out of 100) in file XR16-14, a Determine the sample regression line. Interpret the coefficients. ¢ Is the sign of the slope logical? If the slope had had the opposite sign, what would that tell you? Use a software package oR to solve this problem. ‘The sums of squares are SSenaytine Fmigane = 15 241 36 020 Complete your answer manually. 18.15 Suppose that a statistician wanted to update the study described in Exercise 18.12. She collected data on 400 father-son pairs and stored the data in columns 1 (fathers’ heights in centimetres) and 2 (sons’ heights in centimetres) in file XR18-15, a Determine the sample regression line. b What does the value of f, tell you? ¢ What does the value of , tell you? Use a software package OR to solve this problem. The sums of squares are SSraer,gm = 19 626.25 8a, = 35 265 SSrana = 40 980.6 Complete your answer manually, 18.16 Until now, most Australian private health insurance companies charged the same annual premium regardless of the length of time a person had private health cover. Recently, the Australian federal government announced a 30% rebate on the health insurance premium if a person has private health cover. The health insurance companies are of the view that this government incentive will be attractive to senior members of the population who do not currently have private health cover. They are lobbying the federal government to accept their proposal to charge a higher premium to the new members and lower premium to those who have had private health cover for a longer period. A health economist is very suspicious about the CHAPTER 18 SIMPLE LINEAR REGRESSION AND CORRELATION 631 Le | motive of the insurance companies’ proposal. He gathered data concerning the age | iy rae iy medial expenses ofa random sample of 1548 Australians ¢n0g | \ She ravious 12smonth period. Te data are stored in fle XREA6 (column ‘= age: | \ column 2 = mean daily medical expense). i. i 4 Determine the sample regression line. : bb Interpret the coefficients. bE ¢ What rate plan would you suggest? Sy | Use a software package OR ‘The sums of squares are : to solve this problem. SS ae, tape = 69 408 4 Pe aaaer iy) : Saye = 307 461 : Complete your answer : manually. For Exercise 184, calculate the sum of squared ersors (SSE) by calculating ee Felue fg, subtecting y, squaring the differences, and summing 1s squared rare cece co check your calculation, calculate SSE by the shortcut method. For Exercise 185, calculate SSE by calculating each value of J subractg Ehuaring the differences, and summing the squared differences, Then calculate SSE ‘by the shortcut method, and compare the results, In a study of the relationship between two variables x and y, the following summations were calculated: Yx=105 Dyassie Ley = 37525 Fv2956 Ly=1818421 n= 15 Calculate SSE, #3 and 5, Error variable: Required conditions In the previous section, we described the least squares method of estimating the eae ere ofthe probabilistic model. A critical part of this model is the error variable € caren aection, we present methods of ascessing how well the stright line fits the wee eetder for these methods to be valid, however, five requirements involving the probability distribution of the error variable must be satisfied. Required conditions for the error variable 1 The probability distribution of eis normal. 2. The mean of the distribution is zero; that is, E(€) 2 Te standard deviation of eis 0, which is a constant no matter what the value of is, (Erzors with this property are called homoscedastic) 4. The errors associated with any two values of y are independent. As a result, the value of the error variable at one point does not affect the value of the er0r Vaxleble at another point. (Errors that do not satisfy this requirement are known a5 autocorrelated errors.) 5 The errors are independent of the independent variables, N QUSINESS STATISTICS Eok Requirements 1, 2 and 3 can be interpreted in another way: for each value of x, y is a normally distributed rendom variable whose mean is, Ey) = y+ Be f and whose stendard deviation is o,, Notice that the mean depends an x. To reflect this dependence, the expected value is sometimes expressed a8 _ Elvin) = A+ Bx The standard deviation, however, is not influenced by x, because it is a constant over all values of x. Figure 185 depicts this interpretation. In Section 18.8, we will discuss how departures fcom these required conditions affect the regression analysis and how they are identified Experimental and observational data Statisticians often design controlled experiments where regression analysis will be used. They do so by setting several different values of x and observing the corresponding values of y. For example, the data in Exercise 18.13 were gathered through a controlled experiment. To determine the effect of the length of a television commercial on its viewers’ memories of the product advertised, the statistician arranged for 60 television viewers to watch 2 commercial of differing lengths and then tested theit memories of that commercial. Each viewer was randomly assigned a commercial length. The values of x ranged from 29 to 60 ane were set by the statistician as part of the experiment. For each value of x, the distribution of the memory test scores is assumed to be normally distributed with 2 constant variance Figure 18.5 distribution ofy given x Ayla Byiah By Due tn many cases, i is difficult or impossible to design a contzolled experiment, Thus, we Rave no alternative to gathering observational data. As was the case with other techniques, ‘whether che data are observational or experimental does not affect the choice of statistical athod ‘veickn dpely region enslais Goth @rnedinienlaland ohservational fs fee data i GV fthere are ihe on Ha However, when sit, a EES CHAPTER 18 SIMPLE LINEAR REGRESSION AND CORRELATION 633 interpretations of the results. To illustrate, suppose that an analysis of observational data showed that there is a linear relationship (with a positive value of f,) between a university lecturer's salary and teaching evaluations. We may interpret these resulis to infer that better teachers are more highly rewarded. However, it may be that a university’s best researchers are also its best teachers, and, because research is rewarded, there appears to be a relationship between salary and teaching evaluations. A multiple regression analysis (Chapter 19) may be able to produce a definite conclusion. Ee Assessing the model The least squares method produces the best straight line. However, there may in fact be no relationship or pechaps a non-linear (e.g, quadratic) relationship between the two variables. If so, the use of a linear model is pointless. Consequently, it is important for us to assess how well the linear model fits the data. Ifthe St is poor, we should discard the linear model and seek another one, Several methods are used to evaluate the model. In this section, we present two statistics and one test procedure to determine whether a linear model should be employed. They are the standard error of estimate, the test of the slope, and the coefficient of determination. Standard error of estimate In Section 184, we pointed out that the error variable ¢ is normally distributed with mean zeroand standard deviation 9, If gs large, some of the errors will be large, which implies that the model's fit is poor. If g, is small, the errors tend to be close to the mean (which is zero), and, as a result, the model fits well. Hence, we could use g; to measure the suitability of using a linear model. Unfortunately, g, is a population parameter and, like most parameters, is unknown. We can, however, estimate o, from the data. The estimates based on the statistic we introduced in Section 18.2, the sum of squares for error, SSE. Recall that SSE is the minimised sum of squared differences between the points and the line. That is, for the least squares line, SSE = Eet= Dyi- 9" For Example 18.1, we showed how SSE is found. We determined the value of for each value of x, calculated the difference betweeri y and g, squared the difference, and added. This procedure can be quite time-consuming, Fortunately, we can also express SSE as a function of sums of squares ‘Simplified calculation for SSE S83 SSE =, = rss $5,- Eu. =Zy- SH or Seng zach asa 634 AUSTRALIAN BUSINESS STATISTICS We can estimate o} by dividing SSE by the number of observations minus 2, where 2 represents the number of parameters estimated in the regression model—namely, fy and j. That is, the sample statistic SSE is an unbiased estimator of of. The square root of s2is called the standard error of estimate. Standard error of estimate [EXAMPLE 18.3 Find the standard error of estimate for Example 18.2 SOLUTION Solving by hand To calculate the standard error of estimate, we need to find the sum of squares for error. This requires the calculation of SS, SS. and SS. In Example 182, we calculated SSy= 4309340160 and $5, = 134.269 296 From the data we find : $5, = Liy;- 7? = 6 434 890 We can now determine the sum of squares for error 582 SSE = $5, - SS, _ Thus, the standard error of estimate is SSE _ [2251363 Vna2 7 ¥ doo=g 7151569 _ Excel output for Example 18.3 (6134 269 296)? 6 434 890 - “Taos sag gy” = 2251 363 Using the computer Refer to the Excel printout for Example 18.2. Excel reports the standard error of estimate as Standard Error 151,5687515 Interpreting the results | Tesmallest value that s,can assume is zero, which occurs when SSE = 0, that is, when all Points fall on the regression line. Thus, when sis small, the fit is excellent, and the eat model is likely to be an effective analytical and forecasting tool. If sis large, the model is a poor one, and the statistician should improve it or discard it. We judge the value of s, by comparing it to the values of the dependent variable y, or © specifically to the sample mean J. In this example, s; (= 151.6) is only 2.8% relative | | CHAPTER 18 SIMPLE LINEAR REGRESSION AND CORRELATION 635 to (= 5411.4), Therefore, we could admit that the standard error of estimate is reasonably small. In general, the standard error of estimate cannot be used alone as an absolute measure of the model's utility. Nonetheless, s,s useful in comparing models. Ifthe statistician has several models from which to choose, the one with the smallest value of, should generally be the one used. As you'll see, 5, is also an important statistic in other procedures associated with regression Estimating the slope and the intercept As discussed in Chapter 8, there are two types of estimators available—namely, point estimators and interval estimators—to estimate an unknown parameter. In Section 18.3 we used the least squares method to derive point estimators A, and A, to estimate the intercept Ayand the slope coetfictent B,. Now we provide the interval estimators for fi and f, Confidence interval estimator of f, and Bo taranathy Bit tasan os, where sp, is the standard deviation of A, (also called the standard error of f,) and is equal to 5 5 Ss and s, is the standard error of A, and is equal to = [EXAMPLE Tea Determine the 95% confidence interval estimate of the slope B, for Example 18.2. SOLUTION Solving by hand In Example 18.2, we have 3 151,569 en ‘hfs, ~ Ya309 340 160 = 750? Trereforg, the 95% confidence interval estimate of for Example 182 is Aetonxn, = 0311577 © 1.984 ».002309 ==.0911577 = 00456105 = 1-036, 027] ‘Thus the 95% confidence interval estimate of the slope is the interval from -0.036 to-0.027. L4 dy 636 AUSTRALIAN BUSINESS STATISTICS Using the computer 27. Excel output for Example 18.4 Refer to the Excel printout for Example 182. Excel reports the confidence interval estimate in the column next to that of p-value, Testing the slope ‘To understand this method of assessing the linear model, consider the consequences of applying the regression technique to two variables that are not at al linearly related. If we could observe the entire population and draw the scatter diagram, we would observe the geaph shown in Figure 186. The line is horizontal, which means that the value-of y is uunaifected by the value of x. Recall that a horizontal straight line has a slope of zero-—that fs, =0. Figure 18.6 6, = a Elyls) Bo Bix ———— Because we rarely examine complete populations, the parameters are unknown. However, We can draw inferences about the population slope , from the sample slope A, The process of testing hypotheses about 6, is identical to the process of testing any other Parameter. We begin with the hypotheses. The null hypothesis specifies that there is no Iinear relationship, which means that the slope is zero, Thus, we specify Hy B,=0 We can conduct one- or two-tail tests of f, Most often, we perform a two-til test to determine whether there is sulficient evidence to infer that a linear relationship exists. We test Hy B20 The test statistic is hy Wheres, isthe standard deviation of A, (also called the standard error of, B,) and is equal to ae CHAPTER 18 SIMPLE LINEAR REGRESSION AND CORRELATION 637 E; i If the error variable is normally distributed, the test statistic is Student tdistributed with ‘2 degrees of freedom. Note that 2 represents the number of parameters estimated in the 4 ! regression model, i : | [EXAMPLE G5 i Test to determine whether there is enough'evidence in Example 18.2 to infer that there is | | a linear relationship between the price and the odometer reading. Use a significance level i} of 5%, Bf i SOLUTION i ‘We test the hypotheses ! Hy B,=0 Hy BO Test statistics Level of significance: a= 05 By I Decision rule: Reject Hy if |] > fovang = tomse = 1.984, | Value ofthe test statisti: Wi Solving by hand x 20 calculate the value of the test statistic, we need A, and s4. In Example 182, we found | | 88, = 4 309 340 160 and , = -0311577. In Example 183, we found s, = 151.569. Thus, F i = le eS ° : “> ss, [rao 340 te = 07309 Ni The value of the test statistic is ly AWB | ~0s577=0_ an cy “002309 = é / Using the computer | Excl output for Example 185 . : i ‘The output below was taken from the Excel output for Example 18.2. a and the two-tall pavalue of the test (P-value). These values are 002308896, -13.49465083 and 4.44346E-24 (which is practically 0), respectively. Notice the printout includes a test FE for By However, as we've pointed out before, interpreting the vali ofthe y-intercept can i lead to erroneous, if not ridiculous, conclusions. As a result, we will ignore the test of ee q i 2 | | Conclusion: Since t = ~13.49 < -1.984, reject the null hypothesis, be Hy i ‘The printout includes the standard deviation of f (Standard Error), the I-statistic (t Stat), = i 698 AUSTAALIAN BUSINESS STATISTICS Interpreting the results The value of the test statistic is ¢ = -13.49, with a p-value of 0.000, There is overwhelming evidence to infer that a Linear relationship exists. Figure 187 depicts the sampling distribution of the test statistic.) What this means is that the odometer reading does affect the auction selling price of the cars. Figure 187 Sampling cistioution of BE for Evample 18.5 sa349 1984 ° 1984 Rejecon region Rejection region As was the case when we interpreted the y-intercept, the conclusion we draw here is valid only over the range of the values of the independent variable. That is, we can infer that there is a linear relationship between odometer reading and auction price for the five-year- old Ford Laser whose odometer reading lies between 19 057 and 49 223 km (the minimum and maximum values of x in the sample). Because we have no observations outside this range, we do not know how, or even whether, the two variables are related. Figure 18.8, depicts several possible relationships over a wider range of values of x than the sample data in Example 18.2. As you can see, all three figures show a linear relationship when x lies between 20 000 and 50 000 kin. Outside this range the relationship may be linear (Figure 18.8) or nonlinear (Figures 18.8b and 188c). This issue is particularly important, to remember when we tse the regression equation to estimate or forecast (see Section 18.6). Coefficient of determination The test of f, addresses only the question of whether there is enough evidence to infer that a linear relationship exists. In many cases, however, it is also useful to measure the Strength of that linear relationship, particularly when we want to compare several different models. The statistic that performs this function is the coefficient of determination. ‘The coefficient of determination, denoted R? and defined as the ratio SSR/SS, (where SSR is the sum of squares for regression defined below) is calculated in the following way. Coefficient of determination ineorey, ‘With a little algebra, mathematicians can show that SSE 55, Rtel- iF Figure 18.8 Relationships between odometer reading and auction price om @ Pee ; 6000 i 5000 ee = ts 4000 4 jem wom om mow aed i Odometer mating : i ) pamee Pee 6000 * 000 i ; 4000 i k ° tome won woman —a00O0 tenes ning © Pine eee a a a i someterrading ‘The significance of this formula is based on the analysis of variance technique. In Chapter Z 15, we partitioned the total sum of squares into two sources of variation, Here, we begin the discussion by observing that the deviation between y; and 7 can be decomposed into two parts, That is, i YH U-H*+G-D . ‘This equation is represented graphically (for i = 1) in Figure 18.9,

You might also like