CHAPTER 6
In this chapter we introduce linear regression m@deliyfor the purpose of pre~
diction, We discuss the differences between fit ising regression models
for the purpose of inference (as in classical 4 and for prediction. A pre~
dictive goal calls for evaluating mi fe on a validation set, and for
using predictive metrics. We thet c Wellenges of using many predictors
and describe variable selection algoftins What are often implemented in linear
regression procedures. 4,
6.1 INTRODUCTI C
‘The most poy
model enc,
el Yor making predictions is the multiple linear regression
most introductory statistics courses and textbooks. This
it Mrelationship between a quantitative dependent variable ¥ (also
cffied th 6, target, or response variable) and a set of predictors X4,Xp, ...Xp
-d to as independent variables, input varlables, regressors, or covariates). The
assulliption is that che following function approximates the relationship between
as input and outcome variables:
*
Y= fo + Bix, + Byxy to + Bx, He .1)
fon where Phy... fp are coefficients and ¢ is the noise or unexplained part. Data are
< then used to estimate the coefficients and to quantify the noise. In predictive
modeling, the data are also used to evaluate model performance.
ise Ming fr Basis Anais: Coup, igus, and Appin ia XLMne®, Tied Bion,
Gale Stas Peer C. Brice an Nia, Patel
© 2016 Joha Wey & Sons lnc. Published 2016 by John Wiley & Son, lc.
140EXPLANATORY VS. PREDICTIVE MODELING
Regression modeling means not only estimating the coefficients but alo
choosing which input variables to include and in what form. For example, a
numerical input can be included as-is, or in logarithmic form (log X), or in a
binned form (eg., age group). Choosing the right form depends on domain
knowledge, data availability and needed predictive power.
Multiple linear regression is applicable to numerous predictive modeling
situations, Examples are predicting customer activity on credit cards from their
demographics and historical activity patterns, predicting the time to failure of
equipment based on utilization and environment conditions, predicting oe ©
staffing requirements at help desks based on historical data and produ:
aaa
ditures on vacation travel based on historical frequent flyer data, se
sales information, predicting sales from cross selling of products for
information, and predicting the impact of discounts on sales in $8,
6.2 EXPLANATORY VS. PREDICTIVE MODE!
Before introducing the use of linear regression he mn, we must clarify an
important distinction that often escapes thigoith cltter familiarity with linear
regression from courses in statistics, In the two popular but different
objectives behind fitting a regresign n
1, Explaining or quantifyi ~ effect of inputs on an output
(explanatory or descriptile tk, respectively)
2. Predicting the o i for new records, given their input values
(predictive task)
‘The classic, G35 is focused on the first objective. In that scenario,
the data fe treatgd Wa random sample from a larger population of interest
The regresgion mpdel estimated from this sample is an attempt to capture the
hip in the larger population. This model is then used in decision
ing to generate statements such as “a unit increase in service speed (X}) is
@ with an average increase of 5 points in customer satisfaction (Y), all
wctors (Xp, Xq,...,X,) being equal. If X; is known to cause Y, then
€ Qs a statement indicates actionable policy changes—this is called explanatory
‘Ss
>
odeling, When the causal structure is unknown, then this model quantifies the
degree of association between the inputs and output, and the approach is called
descriptive modeling.
In predictive analytics, however, the focus is typically on the second goal:
predicting new individual observations. Here we are not interested in the co-
efficients themselves, nor in the “average record,” but rather in the predictions
a142 MULTIPLE LINEAR REGRESSION
that this model can generate for new records. In this scenario, the model is used
for micro-decision-making at the record level. In our previous example, we
would use the regression model to predict customer satisfaction for each new
Conomes Fine ~N
Both explanatory and predictive modeling involve using a dataset to fitg,
a model (i.e., to estimate coefficients), checking model validity, assessing its, “!
performance, and comparing to other models. However, the modeling a
and performance assessment differ in the two cases, usually leading to dit
final models. Therefore the choice of model is closely tied to acne
is explanatory or predictive.
In explanatory and descriptive modeling, where the fo; pcos
the average record, we try to fit the best model to the data in ‘to learn
about the underlying relationship in the population. tin predictive
modeling (data mining), the goal is to find a regression m t best predicts
new individual records. A regression model that fit&the exigting data too well
is not likely to perform well with new data. Hi Jook for a model that
has the highest predictive power by ¥@ holdout set and using
predictive metrics (see Chapter 5)
Let us summarize the main differenc
scenarios
1 linear regression in the two
1. A good explanatory m
good predictive m
of input vatiabl
he that fits the data closely, whereas 2
1 predicts new cases accurately. Choices
ch can therefore differ.
el®, the entice datasct is used for estimating the best-
ize the amount of information that we have about
iged felationship in the population, When the goal is to
Games of new individual cases, the data are typically split into
thémodel, and the validation or holdout set is used to assess this models
idictive performance on new, unobserved data.
3. Performance measures for explanatory models measure how close the
data fit the model (how well the model approximates the data) and
how strong the average relationship is, whereas in predictive models
performance is measured by predictive accuracy (how well the model
O predicts new individual cases).
>
. In explanatory models the focus is on the coefficients (f), whereas in
predictive models the focus is on the predictions ().
For these reasons it is extremely important to know the goal of the analysis
before beginning the modeling process. A good predictive model can have aESTIMATING THE REGRESSION EQUATION AND PREDICTION
looser fit to the data on which it is based, and a good explanatory model can
have low prediction accuracy. In the remainder of this chapter we focus on
predictive models because these are more popular in data mining and because
‘most statistics textbooks focus on explanatory modeling.
6.3 ESTIMATING THE REGRESSION EQUATION AND PREDICTION
‘Once we determine the input variables to include and their form, we oa
ordinary least squares (OLS). This method finds values fo, B,, By,
143
the coefficients of the regression formula from the data using a method ney
predicted values based on that model (P).
To predict the value of the output variable for a record with iypue values
Xyp%py veep, We use the equation
P= fot Bix + hax. + +8, (6.2)
Predictions based on this equation are the, be ns possible in the sense
that they will be unbiased (equal to the tue¥ues%gn average) and will have the
smallest average squared error compa sant
=
iased estimates if we make
the following assumptions:
2 anormal distribution,
2. The choice of variatl their form is correct (linearity).
1. The noise € (or equival
3, The cases are ing each other.
alues for 2 given set of predictors is the same
of the predictors (homoskedastcty).
‘he first assumpiion and allow the noise to follow an arbitrary distribution, these estimates
se [for prediction, in the sense that among all linear models, as defined by
6.1), the model using the least squares estimates, Hip. Ay, Bo By, will
the smallest average squared errors. The assumption of a normal distribution
Cyrest in explanatory modeling, where itis used for constructing confidence
intervals and statistical tests for the model parameters
Even ifthe other assumptions are violated, itis still possible that the resulting
predictions are sufficiently accurate and precise for the purpose they are intended
for. The key is to evaluate predictive performance of the model, which is the
main priority, Satisfying assumptions is of secondary interest and residual analysis
can give clues to potential improved models to examine.144 MULTIPLE LINEAR REGRESSION
‘TABLE 6.1 VARIABLES IN THE TOYOTA COROLLA EXAMPLE
Description N
Price Offer price in euros
Age Age in manths as of August 2006
Kilometers Accumulated kilometers on odometer 9%
Fuel type Fuel type (Petrot, Diesel, CNG)
HP Horsepower
Metallic Metallic color? (Yes = 1, No =
Automatic ‘Automatic (Yes = 1, No
c Cylinder volume in cubig, cr
Doors Number of doors
Quart tax Quarterly road!
Weight Weight in kilo
Example: Predicting the Price of Used Toyota €grolla C
A large Toyota car dealership offers purchasers of yyota cars the option to
buy their used car as part of a trade-in. In Sweflew promotion promises
to pay high prices for used Toyota Ci isMfor purchasers of a new car.
“The dealer then sells the used crs fg Ytgalf Profit. To ensure a reasonable
profit, the dealer needs to be abl ict the price that the dealership will
get for the used cars, For that ita were collected on all previous sales
of used Toyota Corollas at x ip. The data include the sales price and
other information on thi g age, mileage, fuel type, and engine size.
A description of cach @f thgse Variables is given in Table 6.1. A sample of this
dataset is shown in: ,
‘The total m cords in the dataset is 1000 cars (we use the first 1000
oCorolla.x1s). After partitioning the data into
training ( lidation (40%) sets, we fit a multiple linear regression model
been, ¢ Output variable) and the other variables (as predictors) using
ly the @aining set. Figure 6.1 shows the estimated coefficients, as computed
by jer.' Notice that the Fuel Type predictor has three categories (Petrol,
Diesel, and CNG). We therefore have two dummy variables in. the model: Petrol
(Q/1) and Diesel (0/1); the third, CNG (0/1), is redundant given the information
mn the first two dummies. Inclusion of this redundant variable will cause typical
regression software to fail due to a mulfcollinearity error since the redundant
{ variable will be a perfect linear combination of the other two (see Section 4.5).
The regression coefficients are then used to predict prices of individual used
Toyota Corolla cars based on their age, mileage, and so on. Figure 6.2 shows a
sample of predicted prices for 20 cars in the validation set, using the estimated
model. It gives the predictions and their errors (celative to the actual prices)
Tj vome version: of XLMiner, the intercept in the coefficients ble is called “constant ren.”ESTIMATING THE REGRESSION EQUATION AND PREDICTION
TABLE 6.2 PRICES AND ATTRIBUTES FOR USED TOYOTA COROLLA CARS.
(SELECTED ROWS AND COLUMNS ONLY)
a
Fuel Auto- Quart
Price Age Kilometers Type -HP_Metallic matic CC__Doors Tax Weight.
13500 23 46986 Diesel «= «90st 2000 320165,
137502372937 -—~iesel«= 90120002065
13950 24 © «41711 Diesel«= 90S 22000320165,
14950 26 © 48000 Diesel «= «9020002165,
137503038500 -—‘iesel «= «90020008
12950 32 61000-Diesel «90's 2000s tt
16900 27 © 94612 Diesel «= 801000.
18600 30° 75889 esl «9012000
21500 27 19700 Potrol« 2s.
12950 2371138—Ciesel «691900
20950 25 © 31461 Petrol«s182,=SsS S800
19950 2243610 Petrol «192,
19600 25 «32189 Petrol««192,- (800,
21500 31 23000 Petrol’ «192,180
22500 «323432 Petrol« 192
22000 28 «= 18739 Petrol «192
22750 30 © 34000 Petrol «192 00) 3
1795024 «= BA7I6 = Petrol «10S 3
16750 24 25563 Patrol 00s
16950 3064359 —Petrot 1600 3
15950 3067660 Petrol 0° 1600 3
16950 2943005 Petrol 1 16003
15950 28 56349 Petrol So 16003
16950 28 © 32220 —Petral 0 16003
16250 2925813 0 1600 3
15950 2528450 0 16003
174952734545 0 16003
15750 29 4tat5 0 16005
0 16095
1195039 98823 197
for these 80 cars. the right we get overall measures of predictive accuracy.
Note thatthe aystage error is $111. A boxplot of the residuals (Figure 6.3)
shows that 5196 of the errors are approximately +$850. This error magnitude
tbe sina relative to the car price, but should be taken into account when
the profit. Another observation of interest is the large positive
the application. Measures such as the average error, and error percentiles. are
ed to assess the predictive performance of model and to compare models. We
discuss such measures in the next section. This example also illustrates the point
about the relaxation of the normality assumption. A histogram or probability
plot of prices shows a right-skewed distribution. In a descriptive/explanatory
modeling case where the goal is to obtain a good fit to the data, the output
variable would be transformed (e.g., by taking a logarithm) to achieve a more
oO als (underpredictions), which may or may not be a concer, depending
145146 MULTIPLE LINEAR REGRESSION
Regression Model
coeticent] sud eror | estatite | evaue iis ‘
“anyzai|_seaasenaia) ~iase2es) 0352013 le aensees7
isa 976) =28.0851525|_9 36-111 ladistea | osars0sy
=0019505| 00002365487] ~840076258| 3.34835 ist rorexmate| 13680054)
saessiso) 5275892772] 6316840337, 527610) ss 309532058
269.0696679|0.83598572| 0.403500]
[09587009] 0.217964815) 092752)
22.900512| 2.45903667] 6.2136:8874|_5.59t 19
32saessa|_1512800956| 550010228 1036-16
[sel type, ese 126 24002] 536.7550725| 0.200776985| 009612]
fuel Type Petrol [26708733] s2002139] 5135005687] 3.2207
FIGURE 6.1 ESTIMATED COEFFICIENTS FOR REGRESSION MODEL: 5 CAR
ATTRIBUTES SS i
(b) Validat ~. Report
eats BEF as err | average oor
25.2] r410.9T998s| 110 9%asT |
(A) PREDICTED PRICES (AND ERRORS) FOR 20 CARS IN VALIDATION
SET, AND (8) SUMMARY PREDICTIVE MEASURES FOR ENTIRE
VALIDATION SET
“normal’ variable. Although the fit of such a model to the training data is
expected to be better, it will not necessarily improve predictive performance. In
this example the average error in a model of log(price) is -$160, compared to
$111 in the original model for price.VARIABLE SELECTION IN LINEAR REGRESSION 147
‘8000
6000
4000
2000
Residual
2000
4000
6.4 VARIABLE SELECTION IN LINEAR eo)
Mon 2 regression equation to
have many variables available
’igh speed of moder algorithms
19s, ikd# tempting in such a situation to
other to select a subset? Just use all the
Reducing the Number of Predictors
‘A frequent problem in data mining is
predict the value ofa dependent variabl
to choose as predictors in our model
for multiple linear regression ca
take a kitchen-sink approach:
variables in the model
Another considerati
hope that a previo
found that custon
ionship will emerge. For example, a company
if purchased anti-scuff protectors for chair and table
It may be expensive or not feasible to collect a fall complement of predic-
for future predictions.
é We may be able to measure fewer predictors more accurately (e.g., in
surveys).
© The more predictors there ate, the higher the chance of missing values
in the data. If we delete or impute cases with missing values, multiple
A predictors will lead to a higher rate of case deletion or imputation.
© Parsimony is an important property of good models. We obtain more insight
into the influence of predictors in models with few parameters,148 MULTIPLE LINEAR REGRESSION
O
1 Estimates of regression coeificients are likely to be unstable, due to multi-
collinearity in models with many variables. (Multicollineariy is the presence
of two or more predictors sharing the same linear relationship with the
outcome variable.) Regression coefficients are mote stable for parsimo-
nious model. One very rough rule of thumb is to have a number of cases @
n larger than 5(p + 2), where p is the number of predictors.
‘Ie can be shown that using predictors that are uncorrelated with the, w
pendent variable increases the variance of predictions,
# It can be shown that dropping predictors that are actually on)
the dependent variable can increase the average error (bias) Ry .
‘The last two points mean that there is a trade-ol 10 few and
too many predictors. In general, accepting some bigs can rétkuce the variance in
predictions. This bias-variance trade-off is particularlyataportant for large numbers
of predictors, since in that case it is very likely that CBfere are variables in the
model that have small coefficients relative tofth@yuimlird deviation of the noise
variables will improve the predictions jees the prediction variance. This
type of bias—variance trade-off is Mebsic Mpect of most data mining procedures
for prediction and classifica she of this, methods for reducing the
are often used.
and also exhibit at least moderate oaths ther variables. Dropping such
duce the number of predictors should always be to
Ivis important to understand what the various predictors
set of predictors should be reduced to a sensible set that reflects
at hand, Some practical reasons for predictor elimination are the
cexpemé of collecting this information in the future, inaccuracy, high correlation
with another predictor, many missing values, or simply irrelevance. Also helpful
examining potential predictors are summary statistics and graphs, such as
frequency and correlation tables, predictor-specific summary statistics and plots,
and missing value counts.
The next step makes use of computational power and statistical significance.
In general, there are two types of methods for reducing the number of predictors
in a model. ‘The fist is an exhaustive search for the “best” subset of predictors by
fitting regression models with all the possible combinations of predictors. The
second is to search through a partial set of models. We describe these two
approaches next.
NVARIABLE SELECTION IN LINEAR REGRESSION 149
Exhaustive Search The idea here is to evaluate all subsets. Since the
number of subsets for even moderate values of p is very large, after the algorithm.
creates the subsets and runs all the models, we need some way to examine the y
most promising subsets and to select from them. Criteria for evaluating and .
comparing models are based on metrics computed from the training data. One +
popular criterion is the adjusted R?, which is defined as Ww
of adjusted R indicate better fit. Unlike R?, which does not a
the number of predictors used, adjusted R? uses a penalty o1
predictors. This avoids the artificial increase in R? that can resul
increasing the number of predictors but not the amount formation, It can
be shown that using R2, to choose a subset is equivalent Picking the subset
that minimizes 6?
Another criterion that is often used for subsgt se known as Mallow's
C, (Gee formula below). This criterion sma full model (with all
predictors) is unbiased, although it may dicCOH that, if dropped, would |
reduce prediction variability. With this
model is unbiased, the average C,,valu e number of parameters p+ 1
( number of predictors + 1), th ubset. So a reasonable approach
to identifying subset models to examine those with values
of C, that are near p+ 1 an estimate of the error? for predictions
at the x-values observe ining sec. Thus good models are those that
have values of G, ne: at have small p (Le, are of small size). C, is i
computed from th
al SSE 2094+ In, (63)
where Ned cnet value of o in the full model that includes all
n
SS Ie is important to remember that the usefulness of this approach
ui
ds Heavily on the reliability of the estimate of 6? for the full model. This
that the training set contain a large number of observations relative to
tumber of predictors. Finally, a useful point to note is that for a fixed size of i
cht, Rig and C, all select the same subset. There isin fact no difference
among them inthe order of merit that they ascribe to subsets of a fixed size,
‘This is good to know if comparing models with the same number of predictors,
but often we want to compare models with different numbers of predictors.
“Fn particular, ics the sum of the MSE standandized by dividing by a2150 MULTIPLE LINEAR REGRESSION