0% found this document useful (0 votes)
114 views60 pages

Regression

This document covers decision analysis through linear regression, detailing both simple and multiple linear regression models. It outlines the estimation process, least squares method, and the importance of assessing model fit, including the coefficient of determination. Additionally, it provides practical examples, such as the Butler Trucking Company, to illustrate the application of these concepts in predicting relationships between variables.

Uploaded by

nursophia27
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
114 views60 pages

Regression

This document covers decision analysis through linear regression, detailing both simple and multiple linear regression models. It outlines the estimation process, least squares method, and the importance of assessing model fit, including the coefficient of determination. Additionally, it provides practical examples, such as the Butler Trucking Company, to illustrate the application of these concepts in predicting relationships between variables.

Uploaded by

nursophia27
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

DECISION ANALYSIS

Linear Regression
Chapter Contents
8.1 Simple Linear Regression Model
8.2 Least Squares Method
8.3 Assessing the Fit of the Simple Linear Regression Model
8.4 The Multiple Linear Regression Model
8.5 Inference and Linear Regression
8.6 Categorical Independent Variables
8.7 Modeling Nonlinear Relationships
8.8 Model Fitting
8.9 Big Data and Linear Regression
8.10 Prediction with Linear Regression
Summary

© 2024 Cengage Group. All Rights Reserved.


Learning Objectives (1 of 2)
After completing this chapter, you will be able to:
LO 8-1 Construct an estimated simple linear regression model that estimates
how a dependent variable is related to an independent variable.
LO 8-2 Construct an estimated multiple linear regression model that
estimates how a dependent variable is related to multiple independent
variables.
LO 8-3 Compute and interpret the estimated coefficient of determination for a
linear regression model.
LO 8-4 Assess whether the conditions necessary for valid inference in a least
squares linear regression model are satisfied and test hypotheses
about the parameters.
LO 8-5 Test hypotheses about the parameters of a linear regression model
and interpret the results of these hypotheses tests.

© 2024 Cengage Group. All Rights Reserved.


Learning Objectives (2 of 2)
LO 8-6 Compute and interpret confidence intervals for the parameters of a
linear regression model.
LO 8-7 Use dummy variables to incorporate categorical independent
variables in a linear regression model and interpret the associated
estimated regression parameters.
LO 8-8 Use a quadratic regression model, a piecewise linear regression
model, and interaction between independent variables to account for
curvilinear relationships between independent variables and the
dependent variable in a regression model and interpret the estimated
parameters.
LO 8-9 Use an estimated linear regression model to predict the value of the
dependent variable given values of the independent variables.

© 2024 Cengage Group. All Rights Reserved.


Introduction
Managerial decisions are often based on the relationship between variables.
Regression analysis is a statistical procedure that uses data to develop an
equation showing how the variables are related.
• Dependent (or response) variable is the variable being predicted.
• Independent (or predictor) variables (or features) are variables used to
predict the value of the dependent variable.
• Simple linear regression is a form of regression analysis in which a
(“simple”) single independent variable, 𝑥 is used to develop a “linear”
relationship (straight line) with the dependent variable, 𝑦.
• Multiple linear regression is a more general form of regression
analysis involving two or more independent variables.

© 2024 Cengage Group. All Rights Reserved.


8.1 Simple Linear Regression Model
The simple linear regression model is an equation that describes how the
dependent variable 𝑦 is related to the independent variable 𝑥 and error term 𝜀.
𝒚 = 𝜷𝟎 + 𝜷𝟏 𝒙 + 𝜺
Where
𝑦 is the dependent variable.
𝑥 is the independent variable.
𝛽0 and 𝛽1 are referred to as the population parameters.
𝜀 is the error term. It accounts for the variability in 𝑦 that cannot be
explained by the linear relationship between 𝑥 and 𝑦.

© 2024 Cengage Group. All Rights Reserved.


8.1 Estimated Simple Linear Regression
Equation
The estimated simple linear regression equation is described as follows.
ෝ = 𝒃𝟎 + 𝒃 𝟏 𝒙
𝒚

Where 𝑦ො is the point estimator of 𝐸 𝑦|𝑥 , the mean of 𝑦 for a given 𝑥.


𝑏0 is the point estimator of 𝛽0 and the 𝑦-intercept of the regression.
The 𝑦-intercept 𝑏0 is the estimated value of the dependent variable 𝑦 when
the independent variable 𝑥 is equal to 0.
𝑏1 is the point estimator of 𝛽1 and the slope of the regression.
The slope 𝑏1 is the estimated change in the value of the dependent
variable 𝑦 that is associated with a one unit increase in the independent
variable 𝑥.

© 2024 Cengage Group. All Rights Reserved.


8.1 The Estimation Process in Simple Linear Regression

The estimation of 𝑏0 and 𝑏1 is a


statistical process much like the
estimation of the population mean
𝜇 described in Chapter 7.
𝛽0 and 𝛽1 are the unknown
parameters of interest, and 𝑏0 and
𝑏1 are the sample statistics used to
estimate the parameters.
The flow chart to the right provides
a summary of the estimation
process for simple linear
regression

© 2024 Cengage Group. All Rights Reserved.


8.2 Least Squares Method
The least squares method is a procedure for using sample data to find the
estimated linear regression equation (see notes for 𝑏0 and 𝑏1 equations.)

𝐦𝐢𝐧 ෍ 𝒆𝟐𝒊 = 𝐦𝐢𝐧 ෍ 𝒚𝒊 − 𝒚


ෝ𝒊 𝟐
= 𝐦𝐢𝐧 ෍ 𝒚𝒊 − 𝒃𝟎 − 𝒃𝟏 𝒙𝒊 𝟐

Where
𝒆𝒊 = 𝒚𝒊 − 𝒚 ෝ𝒊 is referred to as the 𝑖th residual: the error made in
estimating the value of the dependent variable for the 𝑖th observation.
𝑥𝑖 and 𝑦𝑖 are the values of independent and dependent variables for the
𝑖th observation.
𝑦ො𝑖 is the predicted value of the dependent variable for the 𝑖th observation.
𝑛 is the total number of observations.

© 2024 Cengage Group. All Rights Reserved.


8.2 The Butler Trucking Company Example
We use a sample of 10 randomly selected driving assignments made by the
Butler Trucking Company to build a scatter chart depicting the relationship
between the travel time (in hours) and the miles traveled.
DATAfile: butler
Because the scatter diagram
shows a positive linear
relationship, we choose the
simple linear regression model
to represent the relationship
between travel time (𝑦) and
miles traveled (𝑥.)

© 2024 Cengage Group. All Rights Reserved.


8.2 Regression Equation for Butler Trucking
Co. software produces the following simple linear regression equation.
Computer
𝑦ො = 1.2739 + 0.0678𝑥
The slope and intercept of the regression are 𝑏1 = 0.0678 and 𝑏0 = 1.2739.
Thus, we estimate that if the length
of a driving assignment were 1 mile
longer, the mean travel time would
be 0.0678 hours or ~4 minutes
longer.
Also, if the length of a driving
assignment were 0 miles, the mean
travel time would be 1.2739 hours
or ~76 minutes. See notes for Excel.

© 2024 Cengage Group. All Rights Reserved.


8.2 Experimental Region and Extrapolation
The regression model is valid only over the experimental region, defined as
the range of values of the independent variables in the data used to estimate
the model.
Extrapolation, the prediction of the value of the dependent variable outside
the experimental region, is risky and should be avoided unless we have
empirical evidence dictating otherwise.
The experimental region for the Butler Trucking data is from 50 to 100 miles.
• Any prediction made outside the travel time for a driving distance less
than 50 miles or greater than 100 miles is not a reliable estimate.
• Thus, for this model the estimate of 𝛽0 is meaningless.

© 2024 Cengage Group. All Rights Reserved.


8.2 Estimating Travel Time for the Butler Trucking Co.
We can use the estimated model for the Butler Trucking Company example,
and the known values for miles traveled for a driving assignment to estimate
the mean travel time in hours.
For example, the first driving assignment in the data set has a value for miles
traveled of 𝑥 = 100, and a value for travel time of 𝑦 = 9.3 hours.
The mean travel time for this driving assignment is estimated to be
𝑦ො𝑖 = 1.2739 + 0.0678 100 = 8.0539 hours
The resulting residual of the estimate is
𝑒𝑖 = 𝑦𝑖 − 𝑦ො𝑖 = 9.3 − 8.0539 = 1.2461 hours
The next slide shows the calculations for the 10 observations in the data set.

© 2024 Cengage Group. All Rights Reserved.


8.2 Predicted Travel Time and Residuals
Driving 𝒙𝒊 = Miles 𝒚𝒊 = Travel Time
Assignment i Traveled (hours) ෝ𝒊 = 𝒃𝟎 + 𝒃𝟏𝒙𝒊
𝒚 ෝ𝒊
𝒆𝒊 = 𝒚𝒊 − 𝒚 𝒆𝟐𝒊 = 𝒚𝒊 − 𝒚
ෝ𝒊 𝟐

1 100 9.3 8.0539 1.2461 1.5528


2 50 4.8 4.6639 0.1361 0.0185
3 100 8.9 8.0539 0.8461 0.7159
4 100 6.5 8.0539 -1.5539 2.4146
5 50 4.2 4.6639 -0.4639 0.2152
6 80 6.2 6.6979 -0.4979 0.2479
7 75 7.4 6.3589 1.0411 1.0839
8 65 6.0 5.6809 0.3191 0.1018
9 90 7.6 7.3759 0.2241 0.0502
10 90 6.1 7.3759 -1.2759 1.6279
σ 𝑦𝑖 =67.0 σ 𝑦ො𝑖 =67.0000 σ 𝑒𝑖 =0.0000 σ 𝑒𝑖2 =8.0288

© 2024 Cengage Group. All Rights Reserved.


8.3 The Sums of Squares
The value of the sum of squares due to error (SSE) is a measure of the
error that results from using the 𝑦ො𝑖 values to predict the 𝑦𝑖 values.
𝑺𝑺𝑬 = σ 𝒚𝒊 − 𝒚 ෝ𝒊 𝟐
The value of the total sum of squares (SST) is a measure of the error that
results from using the sample mean 𝑦ത to predict the 𝑦𝑖 values.
𝑺𝑺𝑻 = σ 𝒚𝒊 − 𝒚ഥ 𝟐
The value of the sum of squares due to regression (SSR) is a measure of
how much the 𝑦ො𝑖 values deviate from the sample mean 𝑦.

𝑺𝑺𝑹 = σ 𝒚 ഥ 𝟐
ෝ𝒊 − 𝒚
The relationship between these three sums of squares is 𝑺𝑺𝑻 = 𝑺𝑺𝑹 + 𝑺𝑺𝑬

© 2024 Cengage Group. All Rights Reserved.


8.3 Total Sum of Squares for the Butler Trucking Co.
Driving 𝒙𝒊 = Miles 𝒚𝒊 = Travel
Assignment i Traveled Time (hours) ෝ𝒊 − 𝒚
𝒚 ഥ ෝ𝒊 − 𝒚
𝒚 ഥ 𝟐

1 100 9.3 2.6 6.76


2 50 4.8 -1.9 3.61
3 100 8.9 2.2 4.84
4 100 6.5 -0.2 0.04
5 50 4.2 -2.5 6.25
6 80 6.2 -0.5 0.25
7 75 7.4 0.7 0.49
8 65 6.0 -0.7 0.49
9 90 7.6 0.9 0.81
10 90 6.1 -0.6 0.36
σ 𝑦𝑖 =67.0 σ 𝑦ො𝑖 − 𝑦ത =0 𝑆𝑆𝑇 = 23.90

© 2024 Cengage Group. All Rights Reserved.


8.3 Coefficient of Determination
The ratio 𝑆𝑆𝑅/𝑆𝑆𝑇 is called the coefficient of determination, denoted by 𝑟 2 .
𝟐
𝑺𝑺𝑹
𝒓 =
𝑺𝑺𝑻
The coefficient of determination can only assume values between 0 and 1 and
is used to evaluate the goodness of fit for the estimated regression equation.
A perfect fit exists when 𝑦𝑖 is identical to 𝑦ො𝑖 for every observation 𝑖 so that all
residuals 𝑦𝑖 − 𝑦ො𝑖 = 0.
• In such case, 𝑆𝑆𝐸 = 0, 𝑆𝑆𝑅 = 𝑆𝑆𝑇, and 𝑟 2 = 𝑆𝑆𝑅Τ𝑆𝑆𝑇 = 1.
Poorer fits between 𝑦𝑖 and 𝑦ො𝑖 result in larger values of 𝑆𝑆𝐸 and lower
𝑟 2 values.
• The poorest fit happens when 𝑆𝑆𝐸 = 𝑆𝑆𝑇, 𝑆𝑆𝑅 = 0, and 𝑟 2 = 0.

© 2024 Cengage Group. All Rights Reserved.


8.3 Goodness of Fit for the Butler Trucking Co.
From our previous calculations for the sum of squares due to error, we
already know that
𝑆𝑆𝐸 = σ 𝑦𝑖 − 𝑦ො𝑖 2
= σ 𝑒𝑖2 = 8.0288
Similar calculations for the total sum of squares reveal that
2
𝑆𝑆𝑇 = σ 𝑦𝑖 − 𝑦ത = 23.90
Because of the sum of squares relationship, we can write
𝑟 2 = 𝑆𝑆𝑅Τ𝑆𝑆𝑇 = 1 − 𝑆𝑆𝐸 Τ𝑆𝑆𝑇 = 1 − 8.0288Τ23.90 = 0.6641
Thus, we can conclude that 66.41% of the variability in the values of travel
time can be explained by the linear relationship between the miles traveled
and travel time. See notes for Excel instructions.

© 2024 Cengage Group. All Rights Reserved.


8.4 Multiple Linear Regression Model
The multiple linear regression model describes how the dependent variable
𝑦 is related to the independent variables 𝑥1 , 𝑥2 , … , 𝑥𝑞 and an error term 𝜀.
𝒚 = 𝜷𝟎 + 𝜷𝟏 𝒙𝟏 + 𝜷𝟐 𝒙𝟐 + … + 𝜷𝒒 𝒙𝒒 + 𝜺
Where
𝛽0 , 𝛽1 , 𝛽2 , … , 𝛽𝑞 are the parameters of the model.
𝜀 is the error term that accounts for the variability in 𝑦 that cannot be
explained by the linear effect of the 𝑞 independent variables.
The coefficient 𝛽𝑗 (with 𝑗 = 1 … 𝑞) represents the change in the mean value of
𝑦 that corresponds to a one unit increase in the independent variable 𝑥𝑗 ,
holding the values of all other independent variables in the model constant.

© 2024 Cengage Group. All Rights Reserved.


8.4 The Estimation Process in Multiple
Regression
The estimated multiple linear regression equation is
ෝ = 𝒃𝟎 + 𝒃𝟏 𝒙𝟏 + 𝒃𝟐 𝒙𝟐 + … + 𝒃𝒒 𝒙𝒒
𝒚
Where:
𝑦ො is a point estimate of 𝐸 𝑦 for
a given set of 𝑝 independent
variables, 𝑥1 , 𝑥2 , … , 𝑥𝑞 .
A simple random sample is
used to compute the sample
statistics 𝑏0 , 𝑏1 , 𝑏2 , … , 𝑏𝑞 that
are used as estimates of
𝛽0 , 𝛽1 , 𝛽2 , … , 𝛽𝑞 .

© 2024 Cengage Group. All Rights Reserved.


8.4 Least Squares Method and Multiple Regression
The least squares method uses the sample data to provide the values of the
sample statistics 𝑏0 , 𝑏1 , 𝑏2 , … , 𝑏𝑞 that minimize the sum of the square errors
between the 𝑦𝑖 and the 𝑦ො𝑖 .
𝟐
𝟐
𝐦𝐢𝐧 σ 𝒚𝒊 − 𝒚
ෝ𝒊 = 𝒎𝒊𝒏 σ 𝒚𝒊 − 𝒃𝟎 + 𝒃𝟏 𝒙𝟏 + 𝒃𝟐 𝒙𝟐 + … + 𝒃𝒒 𝒙𝒒 =
𝐦𝐢𝐧 σ 𝒆𝟐𝒊
Where, 𝑦𝑖 is the value of dependent variable for the 𝑖th observation.
𝑦ො𝑖 is the predicted value of dependent variable for the 𝑖th
observation.
Because the formulas for the regression coefficients involve the use of
matrix algebra, we rely on computer software packages to perform the
calculations.
The emphasis will be on how to interpret the computer output rather than on
© 2024 Cengage Group. All Rights Reserved.
8.4 Multiple Regression with Two Independent
Variables
DATAfile: butlerwithdeliveries
We add a second independent variable, the number of deliveries made per
driving assignment, which also contributes to the total travel time.
The estimated multiple linear regression with two independent variables is
𝑦ො = 𝑏0 + 𝑏1 𝑥1 + 𝑏2 𝑥2
Where
𝑦ො = estimated mean travel time
𝑥1 = distance traveled (miles)
𝑥2 = number of deliveries
The 𝑆𝑆𝐸, 𝑆𝑆𝑇, 𝑆𝑆𝑅 and coefficient of determination (denoted 𝑅2 in multiple
linear regression) are computed as we saw for simple linear regression.

© 2024 Cengage Group. All Rights Reserved.


8.4 Butler Trucking Co. and Multiple Regression
The estimated multiple linear regression equation (see notes for Excel
instructions), after rounding the sample coefficients to four decimal places, is
𝑦ො = 0.1273 + 0.0672𝑥1 + 0.6900𝑥2
• For a fixed number of deliveries, the mean travel time is expected to
increase by 0.0672 hours (~4 minutes) when the distance traveled
increases by 1 mile.
• For a fixed distance traveled, the mean travel time is expected to increase
by 0.69 hours (~41 minutes) for each additional delivery.
• The interpretation of the estimated y-intercept is not meaningful because it
results from extrapolation.
• We can now explain 𝑅2 = 81.73% of the variability in total travel time.

© 2024 Cengage Group. All Rights Reserved.


8.4 Butler Trucking Co. Excel Regression
Output

© 2024 Cengage Group. All Rights Reserved.


8.4 Graph of the Multiple Linear Regression
Equation
With two independent variables 𝑥1 and 𝑥2 , we now generate a predicted
value of 𝑦 for every combination of values of 𝑥1 and 𝑥2 .
• Instead of a regression
line, we now create a 3-
D regression plane.
The graph of the estimated
regression plane shows the
seventh driving assignment
for the Butler Trucking
Company example.
*See notes for details on the
interpretation of the graph.

© 2024 Cengage Group. All Rights Reserved.


8.5 Conditions for Valid Inference in
Regression
Given a multiple linear regression model expressed as
𝑦 = 𝛽0 + 𝛽1 𝑥1 + 𝛽2 𝑥2 + … + 𝛽𝑞 𝑥𝑞 + 𝜀
The least squares method is used to develop estimates of the model
parameters resulting in the estimated multiple linear regression equation.
𝑦ො = 𝑏0 + 𝑏1 𝑥1 + 𝑏2 𝑥2 + … + 𝑏𝑞 𝑥𝑞
The validity of inferences depends on the two conditions about the error
term 𝜀.
1. For any given combination of values of the independent variables
𝑥1 , 𝑥2 , … , 𝑥𝑞 , the population of potential error terms 𝜀 is normally
distributed with a mean of 0 and a constant variance.
2. The values of 𝜀 are statistically independent.

© 2024 Cengage Group. All Rights Reserved.


8.5 Illustration of the Conditions for Valid
Inference
The value of 𝐸(𝑦|𝑥) changes
linearly according to the specific
value of 𝑥 considered, and so the
mean error is zero at each value of
𝑥.
The error term 𝜀 and hence the
dependent variable 𝑦 are normally
distributed with the same variance.
The specific value of the error term
𝜀 at any particular point depends on
whether the actual value of 𝑦 is
greater or less than 𝐸(𝑦|𝑥).

© 2024 Cengage Group. All Rights Reserved.


8.5 Scatter Chart of the Residuals
A simple scatter chart of the residuals is an extremely effective method for
assessing whether the error term conditions are violated.
The example to the right displays a random error pattern for a scatter chart
of residuals versus the predicted values of the dependent variable.
For proper inference, the scatter chart must exhibit a random pattern with
• residuals centered around zero,
• a constant spread of the residuals
throughout, and
• residuals symmetrically distributed
with the values near zero
occurring more frequently than
those outside.

© 2024 Cengage Group. All Rights Reserved.


8.5 Common Error Term Violations

Non-
Nonlinear pattern
constant
spread

Non-independent
Non-normal residuals
residuals

© 2024 Cengage Group. All Rights Reserved.


8.5 Excel Residual Plots for the Butler Trucking
Co. vs. Miles
Residuals Residuals vs. Deliveries

Both Residual plots show valid conditions for inference. In Excel, select the
Residual Plots option in the Residuals area of the Regression dialog box.

© 2024 Cengage Group. All Rights Reserved.


8.5 Scatter Chart of Residuals vs. Predicted Variable
A scatter chart of the
residuals against the
predicted values 𝑦ො is
also commonly used.
The scatter chart to
the right for the Butler
Trucking Company
data shows valid
conditions for
inference.
See notes to create
the data and this chart
in Excel.
© 2024 Cengage Group. All Rights Reserved.
8.5 t Test for Individual Significance
In a multiple regression model with p independent variables, for each
parameter 𝛽𝑗 (𝑗 = 1 … 𝑞), we use a t test to test the hypothesis that parameter
𝛽𝑗 is zero.
𝑯𝟎 : 𝜷𝒋 = 𝟎
𝑯𝒂 : 𝜷𝒋 ≠ 𝟎
The test statistic follows a t distribution with 𝑛 − 𝑞 − 1 degrees of freedom.
𝒃𝒋
𝒕 = ൘𝒔𝒃
𝒋

Where 𝑠𝑏𝑗 is the estimated standard deviation of the regression coefficient 𝑏𝑗 .


If p−value ≤ 𝛼, we reject 𝐻0 and conclude that there is a linear relationship
between the dependent variable 𝑦 and the independent variable 𝑥𝑗 .
Statistical software will generally report a p−value for each test statistic.

© 2024 Cengage Group. All Rights Reserved.


8.5 Individual t Tests for the Butler Trucking Co. Example
The multiple regression output for the Butler Trucking Company example
shows the t-ratio calculations. The values of 𝑏1 , 𝑏2 , 𝑠𝑏1 , and 𝑠𝑏1 are as follows.
Variable Miles: 𝑏1 = 0.0672 𝑠𝑏1 = 0.00245
Variable Deliveries: 𝑏2 = 0.6900 𝑠𝑏1 = 0.02952
Calculation of the t-ratios provide the test statistic for the hypotheses involving
parameters 𝛽1 and 𝛽2 , also provided by the computer output.
𝑏1 Τ𝑠𝑏1 = 0.0672Τ0.00245 = 27.37
𝑏2 Τ𝑠𝑏2 = 0.6900Τ0.02952 = 23.37
Using 𝛼 = 0.01, the p-values of 0.000 in the output indicate that we can reject
𝐻0 : 𝑏1 = 0 and 𝐻0 : 𝑏2 = 0. Hence, both parameters are statistically significant.

© 2024 Cengage Group. All Rights Reserved.


8.5 Testing Regression Coefficients with Confidence Intervals

Confidence interval can be used to test whether each of the regression


parameters 𝛽1 , 𝛽2 , … , 𝛽𝑞 is equal to zero.
To test that 𝛽𝑗 is zero (i.e., there is no linear relationship between 𝑥𝑗 and 𝑦) at
some predetermined level of significance (say 0.05), first build a confidence
interval at the (1 – 0.05)100% confidence level.
If the resulting confidence interval does not contain zero, we conclude that 𝛽𝑗
differs from zero at the predetermined level of significance.
The multiple regression output for the Butler Trucking Company example
shows each regression coefficient's confidence intervals at 95% and 99%
confidence levels.

© 2024 Cengage Group. All Rights Reserved.


8.5 Addressing Nonsignificant Independent Variables
If practical experience dictates that a nonsignificant independent variable 𝑥𝑗
is related to the dependent variable 𝑦, the independent variable 𝑥𝑗 should be
left in the model.
If the model sufficiently explains the dependent variable 𝑦 without the
nonsignificant independent variable 𝑥𝑗 , then consider rerunning the
regression without the nonsignificant independent variable 𝑥𝑗 .
At times, the estimates of the other regression coefficients and their 𝑝−values
may change considerably when we remove the nonsignificant independent
variable 𝑥𝑗 from the model.
The appropriate treatment of the inclusion or exclusion of the 𝑦-intercept
when 𝑏0 is not statistically significant may require special consideration (*see
notes.)

© 2024 Cengage Group. All Rights Reserved.


8.5 Multicollinearity
Multicollinearity refers to the correlation among the independent variables in
multiple regression analysis (*see notes.)
• Multicollinearity increases the standard errors of the regression
estimates of 𝛽1 , 𝛽2 , … , 𝛽𝑞 and the predicted values of the dependent
variable 𝑦 so that inference based on these estimates is less precise
than it should be.
In t tests for the significance of individual parameters, it is possible to
conclude that a parameter associated with one of the multicollinear
independent variables is not significantly different from zero when the
independent variable has a strong relationship with the dependent variable
instead.
The presence of multicollinearity is excluded when there is little correlation
among the independent variables.
© 2024 Cengage Group. All Rights Reserved.
8.5 Multicollinearity in the Butler Trucking Co.
Data butlerwithgasconsumption
DATAfile:
The regression output to the right
has miles driven (𝑥1 ) and gasoline
consumption (𝑥2 ) as independent
variables.
The two variables are highly
correlated, as gas consumption
increases with total miles driven,
with a correlation coefficient of
0.9572.
Because of multicollinearity, the
regression coefficient 𝛽2 is not
significant, with 𝑝−𝑣𝑎𝑙𝑢𝑒 = 0.6588.
© 2024 Cengage Group. All Rights Reserved.
8.6 Dummy Variables
Thus far, the regression examples we have considered involved quantitative
independent variables such as distance traveled, gas consumption, and
number of deliveries.
Often, we must work with categorical independent variables, such as:
• gender (male, female)
• method of payment (cash, credit card, check)
To add a two-level categorical independent variable into a regression model,
such as whether a driver should take the highway during afternoon rush hour
in the Butler Trucking Company problem, we define a dummy variable as
follows:
0 if an assignment does not include driving on the highway during rush hour
𝑥3 = ቊ
1 if an assignment includes driving on the highway during rush hour

© 2024 Cengage Group. All Rights Reserved.


8.6 Effect of Afternoon Rush Hour on Travel
Time

A review of the residuals for the current model with miles traveled (𝑥1 ) and the
number of deliveries (𝑥2 ) as independent variables reveals that driving on the
highway during afternoon rush hour affects the total travel time (*see notes.)

© 2024 Cengage Group. All Rights Reserved.


8.6 Regression Output with Highway Rush Hour Variable
DATAfile: butlerhighway
Excel regression output for the
Butler Trucking Company
regression model including the
independent variables:
• miles traveled (𝑥1 )
• number of deliveries (𝑥2 )
• highway rush hour (𝑥3 )
All independent variables are
significant, and they explain
(𝑅2 = 0.8838) about 88.4% of
the total travel time variability.

© 2024 Cengage Group. All Rights Reserved.


8.6 Interpreting the Parameters for the Butler
Example
𝑦ො = −0.3302 + 0.0672𝑥1 + 0.6735𝑥2 + 0.9980𝑥3
The model estimates that travel time increases by:
1. 0.0672 hours (about 4 minutes) for every increase of 1 mile traveled,
holding constant the number of deliveries and whether the driver uses
the highway during afternoon rush hour.
2. 0.6735 hours (about 40 minutes) for every delivery, holding constant
the number of miles traveled and whether the driver uses the highway
during afternoon rush hour.
3. 0.9980 hours (about 60 minutes) if the driver uses the highway during
afternoon rush hour, holding constant the number of miles traveled and
the number of deliveries.

© 2024 Cengage Group. All Rights Reserved.


8.6 More Complex Categorical Variables
If an independent categorical variable has 𝑘 levels, 𝑘 − 1 dummy variables are
required, with each dummy variable being coded as 0 or 1.
Consider the situation faced by a manufacturer of vending machines that sells its
products to three sales territories: region A, B, and C.
To code the sales regions in a regression model that explains the dependent
variable (𝑦) number of units sold, we need to define 𝑘 − 1 = 2 dummy variables as
follows:
Sales Region 𝑥1 𝑥2
1 if sales region B A 0 0
𝑥1 = ቊ
0 otherwise B 1 0
1 if sales region C C 0 1
𝑥2 = ቊ
0 otherwise

The regression equation relating the expected number of units sold to the sales
region can be written as 𝐸 𝑦 = 𝛽0 + 𝛽1 𝑥1 + 𝛽2 𝑥2
© 2024 Cengage Group. All Rights Reserved.
8.6 Interpretation of the Parameters for a Categorical Variable
with Three Levels
To interpret 𝛽0 , 𝛽1 , and 𝛽2 , in the sales territory example, consider the following
variations of the regression equation.
𝐸 𝑦 region A = 𝛽0 + 𝛽1 0 + 𝛽2 0 = 𝛽0
𝐸 𝑦 region B = 𝛽0 + 𝛽1 1 + 𝛽2 0 = 𝛽0 + 𝛽1
𝐸 𝑦 region C = 𝛽0 + 𝛽1 0 + 𝛽2 1 = 𝛽0 + 𝛽2
Thus, the regression parameters are interpreted as follows.
𝛽0 is the mean or expected value of sales for region A.
𝛽1 is the difference between the mean number of units sold in region B and
the mean number of units sold in region A.
𝛽2 is the difference between the mean number of units sold in region C and
the mean number of units sold in region A.

© 2024 Cengage Group. All Rights Reserved.


8.7 Modeling Nonlinear Relationships
A Manager at Reynolds, Inc., a manufacturer of industrial scales, wants to
investigate the relationship between length of employment (𝑥) and the
number of electronic laboratory scales sold (𝑦) for a sample of 123
salespeople.
DATAfile: reynolds
The estimated linear
regression equation is
𝑆𝑎𝑙𝑒𝑠 = 183.291 +
2.358 𝑀𝑜𝑛𝑡ℎ𝑠
The scatter diagram indicates
a curvilinear relationship
between the length of time
employed and the number of
units sold.
© 2024 Cengage Group. All Rights Reserved.
8.7 A Curvilinear Pattern in the Reynolds Data
The pattern in the scatter chart of residuals against the predicted values of
the dependent variable suggests that a curvilinear relationship may provide a
better fit to the data.
We may wish to consider an
alternative to simple linear
regression if we have a
practical reason to suspect a
curvilinear relationship.
For example, a salesperson
who has been employed for a
long time may eventually
become burned out and less
efficient.

© 2024 Cengage Group. All Rights Reserved.


8.7 A Quadratic Regression Model
To account for the curvilinear relationship, we add an independent variable,
MonthSq, as the square of the number of months the salesperson has been
with the firm. See notes for Excel.
The following equation describes a
quadratic regression model.
ෝ = 𝒃 𝟎 + 𝒃 𝟏 𝒙 + 𝒃 𝟐 𝒙𝟐
𝒚
The regression output produces the
estimated quadratic regression
equation for the Reynolds problem:
𝑆𝑎𝑙𝑒𝑠 = 101.985 +
6.136 𝑀𝑜𝑛𝑡ℎ𝑠
−0.0341 𝑀𝑜𝑛𝑡ℎ𝑆𝑞

© 2024 Cengage Group. All Rights Reserved.


8.7 Interpreting a Quadratic Regression
IfEquation
the estimated parameters 𝑏 and 𝑏 corresponding to the linear term 𝑥 and the
1 2
2
squared term 𝑥 have the same sign, the estimated dependent variable 𝑦ො is
a) increasing over the experimental range of 𝑥 when 𝑏1 > 0 and 𝑏2 > 0 or
b) decreasing over the experimental range of 𝑥 when 𝑏1 < 0 and 𝑏2 < 0.
If, on the other hand, the estimated parameters 𝑏1 and 𝑏2 corresponding to the linear
term 𝑥 and the squared term 𝑥 2 have different signs, 𝑦ො has
c) a maximum over the experimental range of 𝑥 when 𝑏1 > 0 and 𝑏2 < 0 or
d) a minimum over the experimental range of 𝑥 when 𝑏1 < 0 and 𝑏2 > 0 .
In the case of the Reynolds data, we can use calculus to demonstrate that the
maximum sales occur at 𝑥 = 90.
Thus, maximum sales are: 𝑆𝑎𝑙𝑒𝑠 = 101.985 + 6.136 90 − 0.0341 902 = 378

© 2024 Cengage Group. All Rights Reserved.


8.7 Types of Quadratic Regression Models

© 2024 Cengage Group. All Rights Reserved.


8.7 Interaction Between Independent
Variables
An interaction is a relationship between the dependent variable and one
independent variable that is different at various values of a second
independent variable.
If the original data set consists of observations for 𝑦 and two independent
variables 𝑥1 and 𝑥2 , we can incorporate an 𝑥1 𝑥2 interaction term into the
estimated multiple linear regression equation in the following manner.
𝑦ො = 𝑏0 + 𝑏1 𝑥1 + 𝑏2 𝑥2 + 𝑏3 𝑥1 𝑥2
When an interaction term between two variables is present, we cannot study
the relationship between one independent variable and the dependent
variable 𝑦 independently of the other variable.
See notes and next slide for an Excel application that uses the DATAFile:
tyler.

© 2024 Cengage Group. All Rights Reserved.


8.7 Regression Output for the Tyler Personal Care Example

© 2024 Cengage Group. All Rights Reserved.


8.7 Piecewise Linear Regression Model
A piecewise linear regression model is a type of interaction with a dummy
variable that allows fitting nonlinear relationships as two linear regressions
joined at the value of 𝑥 at which the relationship between 𝑥 and 𝑦 changes.
• The value of the independent variable, 𝑥 𝑘 , at which the relationship
between 𝑥 and 𝑦 changes is called a knot, or breakpoint.
• A dummy variable 𝑥𝑘 is added to the model such that
𝑘
0 if 𝑥𝑘 ≤ 𝑥
𝑥𝑘 = ൝ 𝑘
1 if 𝑥𝑘 > 𝑥
Then, the following estimated regression equation is fit:
𝑘
𝑦ො = 𝑏0 + 𝑏1 𝑥1 + 𝑏2 𝑥1 − 𝑥 𝑥𝑘

© 2024 Cengage Group. All Rights Reserved.


8.7 Piecewise Linear Regression Model for Reynolds Data
We observe that below some value of Months Employed, the relationship
with Sales appears to be positive and linear. Whereas the relationship
becomes negative and linear for the remaining observations. See notes for
Excel.
As shown in the previous slide, we
add a dummy variable to the model
with knot at 𝑥 𝑘 = 70.
The regression output, shown in the
next slide, produces the following
estimated regression equation, with
all independent variables significant:
𝑦ො = 147.46 + 3.354𝑥1 − 3.661(𝑥1 −
70)𝑥𝑘

© 2024 Cengage Group. All Rights Reserved.


8.7 Piecewise Regression Output for Reynolds Data

© 2024 Cengage Group. All Rights Reserved.


8.8 Variable Selection Procedures
There are four major variable selection procedures we can use to find the
best estimated regression equation for a set of independent variables (*see
notes):
1. Stepwise Regression
2. Forward Selection
3. Backward Elimination
4. Best-Subsets Regression
The first three procedures are Iterative; one independent variable at a time is
added or deleted but offers no guarantee that the best model will be found.
In the fourth procedure, all possible subsets of the independent variables are
evaluated.

© 2024 Cengage Group. All Rights Reserved.


8.8 Overfitting
Overfitting generally results from creating an overly complex regression model to
explain idiosyncrasies in the sample data.
An overfit model will overperform on the sample data used to fit the model and
underperform on other data from the population.
To avoid overfitting a model:
• Use only real and meaningful independent variables.
• Only use complex models when you have reasonable expectations about them.
• Use variable selection procedures only for guidance.
• If you have sufficient data, consider cross-validation, in which you assess the
model on data other than the sample data used to generate the model.
• One example of cross-validation is the holdout method, which divides the
data set between a training set and a validation set.

© 2024 Cengage Group. All Rights Reserved.


8.9 Inference and Very Large Samples
Virtually all regression
coefficients will be
statistically significant if the
sample is sufficiently large.
DATAfile: largecredit
The regression output to
the right shows a modest
𝑅2 for a data set with 𝑛 =
3,000.
All the regression
coefficients are significant.
See notes for details.

© 2024 Cengage Group. All Rights Reserved.


8.9 Model Selection
When dealing with large samples, it is often difficult to discern the most
appropriate model.
• If developing a regression model for explanatory purposes, the practical
significance of the estimated regression coefficients should be
considered when interpreting the model and considering which variables
to keep.
• If developing a regression model to make future predictions, selecting
the independent variables to include in the model should be based on
the predictive accuracy of observations that have not been used to train
it.
For example, the credit card data set could be split into two data sets:
1. a training data set with 𝑛 = 2,000, and
2. a validation data set with 𝑛 = 1,000.
© 2024 Cengage Group. All Rights Reserved.
8.10 Prediction with Linear Regression
In addition to the point estimate, there are two types of interval estimates
associated with the regression equation:
• A confidence interval is an interval estimate of the mean 𝑦 value given
values of the independent variables 𝑥1 , 𝑥2 , … , 𝑥𝑞 .
ෝ ± 𝒕𝜶Τ𝟐 𝒔𝒚ෝ where 𝑠𝑦ො is the estimated standard deviation of 𝑦ො
𝒚
• A prediction interval is an interval estimate of an individual 𝑦 value
given values of the independent variables 𝑥1 , 𝑥2 , … , 𝑥𝑞 .

ෝ ± 𝒕𝜶Τ𝟐 𝒔𝟐𝒚ෝ + 𝑺𝑺𝑬Τ 𝒏 − 𝒑 − 𝟏


𝒚 where 𝑠𝑦2ො is the estimated variance
of 𝑦ො
The calculation of the confidence and prediction intervals use matrix algebra
and requires the use of specialized statistical software.

© 2024 Cengage Group. All Rights Reserved.


8.10 Prediction of New Routes for Butler Trucking Co.
Predicted Values and 95% Confidence Intervals and Prediction
Intervals for 10 New Butler Trucking Routes (DATAfile: butler)
Predicted 95% Cl 95% PI
Assignment Miles Deliveries Value Half-Width(+/−) Half-Width(+/−)
301 105 3 9.25 0.193 1.645
302 60 4 6.92 0.112 1.637
303 95 5 9.96 0.173 1.642
304 100 1 7.54 0.225 1.649
305 40 3 4.88 0.177 1.643
306 80 3 7.57 0.108 1.637
307 65 4 7.25 0.103 1.637
308 55 3 5.89 0.124 1.638
309 95 2 7.89 0.175 1.643
310 95 3 8.58 0.154 1.641

© 2024 Cengage Group. All Rights Reserved.


Summary
• In this chapter, we showed how linear regression analysis is used to determine how a
dependent variable 𝑦 is related to one or more independent variables.
• We used sample data and the least squares method to develop the estimated simple
linear regression equation, interpreted its coefficients, and presented the coefficient of
determination as a measure of its goodness of fit.
• We then extended our discussion to include multiple independent variables and reviewed
how to use Excel to find the estimated multiple linear regression equation, build estimates
in the form of prediction and confidence intervals, and the ramifications of multicollinearity.
• We discussed the necessary conditions for the linear regression model and its associated
error term to conduct valid inference for regression.
• We showed how to incorporate categorical independent variables into a regression model
and discussed how to fit nonlinear relationships.
• Finally, we discussed various variable selection procedures, the problem of overfitting,
and the implication of big data on regression analysis.

© 2024 Cengage Group. All Rights Reserved.

You might also like