0% found this document useful (0 votes)
4 views9 pages

Multiple Linear Regression

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views9 pages

Multiple Linear Regression

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

94 Marill d MULTIPLE LINEAR REGRESSION

Advanced Statistics: Linear Regression, Part II:


Multiple Linear Regression
Keith A. Marill, MD
Abstract
The applications of simple linear regression in medical ing on Part I of this series, this article acquaints the reader
research are limited, because in most situations, there are with some of the important concepts in multiple regression
multiple relevant predictor variables. Univariate statistical analysis. These include multicollinearity, interaction effects,
techniques such as simple linear regression use a single and an expansion of the discussion of inference testing,
predictor variable, and they often may be mathematically leverage, and variable transformations to multivariate
correct but clinically misleading. Multiple linear regression models. Examples from the first article in this series are
is a mathematical technique used to model the relationship expanded on using a primarily graphic, rather than
between multiple independent predictor variables and mathematical, approach. The importance of the relation-
a single dependent outcome variable. It is used in medical ships among the predictor variables and the dependence of
research to model observational data, as well as in di- the multivariate model coefficients on the choice of these
agnostic and therapeutic studies in which the outcome is variables are stressed. Finally, concepts in regression model
dependent on more than one factor. Although the technique building are discussed. Key words: regression analysis;
generally is limited to data that can be expressed with linear models; least-squares analysis; statistics; models,
a linear function, it benefits from a well-developed statistical, epidemiologic methods. ACADEMIC EMER-
mathematical framework that yields unique solutions and GENCY MEDICINE 2004; 11:94–102.
exact confidence intervals for regression coefficients. Build-

Multiple linear regression is a generalization of simple administration of other treatments such as intravenous
linear regression in which there is more than one (IV) fluid, etc. Multiple linear regression allows the
predictor variable. If the investigator suspects that investigator to account for all of these potentially
the outcome of interest may be associated with or important factors in one model. The advantages of
depend on more than one predictor variable, then the this approach are that this may lead to a more accurate
approach using simple linear regression may be in- and precise understanding of the association of each
appropriate. A multiple regression model that acco- individual factor with the outcome. It also yields an
unts for multiple predictor variables simultaneously understanding of the association of all of the factors
may be used. For example, in the first scenario dis- as a whole with the outcome, and the associations
cussed in Part I of this series, the investigator studied between the various predictor variables themselves.
the relationship between the intensity of insulin thera- Expanding the schematic approach introduced in
py and the resolution of serum acidosis in patients with Figure 6 of Part I, the introduction of another pre-
diabetic ketoacidosis (DKA). The resolution of acidosis dictor variable to the model is represented by the
seems to depend on the intensity of insulin therapy, but addition of another circle that overlaps the outcome
there may be other important factors too. These variable circle. This overlap is labeled area C in Part II,
could include: the initial severity of the DKA episode, Figures 1A and 1B. In general, the addition of the new
the severity of the patient’s underlying disease, the predictor circle and its overlap, area C, with the
outcome circle will increase the total portion of the
outcome explained by the regression, areas A þ C,
From the Division of Emergency Medicine, Massachusetts General and decrease the unknown or residual portion, area B.
Hospital, Harvard Medical School, Boston, MA.
Received July 24, 2001; revision received July 9, 2002, and April 21,
The new predictor circle and area C may, to some
2003; accepted September 10, 2003. degree, overlap the original predictor circle and area
Series editior: Roger J. Lewis, MD, PhD, Department of Emergency A, depending on the relationship between these two
Medicine, Harbor–UCLA Medical Center, Torrance, CA. predictor variables. This represents the variable
Based on a didactic lecture, ‘‘Concepts in Multiple Linear Re- amount of redundancy and collinearity existing
gression Analysis, ’’ given at the SAEM annual meeting, St. Louis,
MO, May 2002.
between the two predictor variables in the model.
Address for correspondence and reprints: Keith A. Marill, MD,
Massachusetts General Hospital, 55 Fruit Street, Clinics 115, Boston, THE MULTIPLE LINEAR REGRESSION MODEL
MA 02114. Fax: 617-724-0917; e-mail: kmarill@[Link].
Part I appears on page 87. The multiple linear regression model is built on the
doi:10.1197/S1069-6563(03)00601-8 same foundation as simple linear regression, and the
15532712, 2004, 1, Downloaded from [Link] by Cochrane Peru, Wiley Online Library on [02/07/2025]. See the Terms and Conditions ([Link] on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
ACAD EMERG MED d January 2004, Vol. 11, No. 1 d [Link] 95

Figure 1. Multivariate schematics.

four fundamental assumptions made with simple


linear regression must also be true for multiple linear
regression. However, in addition to the concepts
discussed thus far for simple linear regression, which
remain applicable, a new set of concepts must be
introduced. This discussion will concentrate on the
Figure 2. (A) z ¼ 0.5x þ 0.5y, R 2 ¼ 1.0. (B) y ¼ 1x, Rpred2 ¼ 1.0.
situation in which there are two predictor variables NS ¼ normal saline; IV ¼ intravenous.
and one outcome variable. With a total of three
variables, a three-dimensional figure can be used to
visualize the data. Models with a larger number of elevation and tilt that minimizes SSres. This approach
predictor variables follow the same principles, but are leads to three equations and three unknowns, and
more difficult to visualize. there usually is a unique solution. It is still true that
The equation for the regression model now repre- SStot ¼ SSreg þ SSres and R2 ¼ SSreg/SStot, but the
sents a flat plane. Letting z be the outcome variable meaning of these equations has changed somewhat.
and x and y be the predictor variables, we have: SSreg includes the contribution of both of the predictor
variables to the regression, not each one individually.
z ¼ k1 x þ k2 y þ c ðequation 1Þ R is now called the multiple correlation coefficient. R2,
which is called the coefficient of determination,
where k1 and k2 are the constant coefficients for x and suggests what proportion of the variation in the
y, respectively, and c is the z intercept at x ¼ y ¼ 0. k1 outcome variable can be attributed to both of the
and k2 determine the tilt of the plane along the x- and predictor variables in the linear model as a whole.
y-axes, respectively. Note that the outcome variable, z, Referring to the schematics in Figures 1A and 1B, R2 ¼
is a linear function of each of the predictor variables, x (A þ C)/(A þ B þ C), where the overlap, if any, of
and y, and this forces the regression model to be a flat areas A and C would only be counted once.
plane with no curves or bending. Figure 2A demon- In simple linear regression, a test for whether the
strates a regression plane where k1 ¼ k2. relationship in the regression model is statistically
The plane that fits the data best can again be found significant and unlikely to be due to chance is
using the least-squares technique described in the first equivalent to a t-test in which the ratio of the slope
article in this series. This approach finds the plane to its standard error (SE) is computed and checked for
that minimizes the sum of the residuals squared. The significance. In multiple regression, this test has
residual value for each data point equals the actual a different meaning because there are multiple
value of z at that point minus the corresponding predictor variables and multiple slopes. Instead, an
predicted value of z on the regression plane (Figure analysis of variance (ANOVA) is used to test for the
5A). The optimal coefficients c, k1, and k2 are found significance of the model as a whole. Schematically,
such that the regression plane has the proper this is equivalent to comparing the size of area A þ C
15532712, 2004, 1, Downloaded from [Link] by Cochrane Peru, Wiley Online Library on [02/07/2025]. See the Terms and Conditions ([Link] on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
96 Marill d MULTIPLE LINEAR REGRESSION

versus area B in Figures 1A and 1B after adjusting for Notice that the resolution of DKA seems to be
the number of predictor variables and data points. If associated with the intensity of both insulin and fluid
the regression area A þ C is relatively large compared therapy. Recall that the original equation for the
with the residual area B, then it is concluded that the relationship between insulin therapy and the im-
predictors taken together are associated with the provement in serum bicarbonate was z ¼ x. Would
outcome beyond mere chance. Mathematically, a com- a correct model with two predictor variables now be
parison is made between the mean SSreg ¼ MSreg ¼ z ¼ x þ y? No. If the intensity of insulin therapy is
SSreg/ktot, which includes contributions from all of the 4-units per hour and IV NS hydration is 4 in 100 mL/
predictor variables, and the mean SSres ¼ MSres ¼ hr units, then the improvement in serum bicarbonate
SSres/(n  ktot  1), where n is the number of data would be 4 þ 4 ¼ 8 mEq/L after four hours of therapy,
points and ktot is the number of predictor variables. instead of 4 mEq/L as in the figure. This would
F ¼ MSreg/MSres, and if, after accounting for the overestimate the improvement in outcome. A more
appropriate degrees of freedom, F is sufficiently large, correct model would be z ¼ 1/2x þ 1/2y. By adding
then the null hypothesis is rejected and it is concluded the y variable to the model, the value of the coefficient
that the multiple linear model explains some of the in the x-axis has decreased by 50% from 1 to 0.5. Why
variation in the outcome variable. is this so? It is because the data displays collinearity in
the x,y plane.
Figures 2B through 5B are two-dimensional graphs
RELATIONSHIPS AMONG THE PREDICTORS of the same data in Figures 2A through 5A, but only
Perhaps the most important difference in multiple the x- and y-axes are displayed. This allows a clear
versus simple linear regression is that the multiple display of the relationship between the predictor
regression model includes the linear relationships variables x and y in each data set. Notice that the
among the predictor variables themselves. These value of y varies with the value of x in Figure 2B. The
relationships, termed ‘‘multicollinearity,’’ can have patient who received more insulin therapy also re-
a tremendous effect on the model coefficients and the ceived more IV NS hydration. They increase together
precision with which they are known. To illustrate linearly, and thus display positive collinearity. In the
this, we return to the simple univariate predictor simple linear model of Part I Figure 2, the entire
models in Figures 2 through 5 of Part I of this series, improvement in the serum bicarbonate was associa-
and now include multivariate data with two predictor ted only with the insulin therapy, whereas in the
variables. Figures 2A through 5A of this article, Part multivariable model in Part II, Figure 2A, we chose to
II, correspond to Figures 2 through 5 of Part I and apportion the improvement in the serum bicarbonate
include the same data points; however, a second equally between the insulin and IV NS treatments.
predictor variable, y, has been added. How does When IV NS treatment is included in the model, the
inclusion of an additional predictor variable affect the improvement in serum bicarbonate associated with
regression model? The answer is ‘‘It depends’’—it treatment must be shared between two therapies,
depends on whether there is a linear relationship insulin and NS. When collinearity is present, the
between the new predictor variable and the predictor magnitude of the predictor coefficients change de-
variable or variables that already are present in the pending on which predictor variables are included in
model. the model. In particular, when an increase in one
predictor variable is associated with an increase in
another predictor, there is positive collinearity. When
COLLINEARITY there is positive collinearity, the value of positive pre-
In Figure 2 of Part I of this series, the investigator dictor coefficients will tend to decrease as more predict-
studied the association of the intensity of intravenous ors are included in the model. The association with the
(IV) insulin therapy with the rate of resolution of DKA outcome must be shared among multiple predictors.
in two diabetic patients. It was found that the The change in the regression model associated with
predictor and outcome variables were proportional, the addition of a new predictor variable can be even
and for every one unit per hour of insulin therapy, more dramatic. Consider the second example in Part I
there was an associated 1-mEq/L increase in the of this series: the investigator examined the hypoth-
serum bicarbonate level after four hours of therapy. esis that a higher initial respiratory rate may be
The researcher now returns to the data and inves- associated with a greater improvement in DKA. It was
tigates whether the intensity of IV normal saline (NS) found in Part I, Figure 3, that patients with higher
fluid therapy also is associated with the rate of DKA initial respiratory rates demonstrated greater im-
resolution. provement. The investigator now realizes, however,
The outcome variable, the increase in the serum that the intensity of insulin therapy should also be
bicarbonate after four hours of therapy, is graphed as included in the analysis.
a function of the intensity of insulin and IV NS Part II, Figure 3A, is a graph of the improvement
therapy for the two study patients in Figure 2A, Part II. in serum bicarbonate as a function of the initial
15532712, 2004, 1, Downloaded from [Link] by Cochrane Peru, Wiley Online Library on [02/07/2025]. See the Terms and Conditions ([Link] on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
ACAD EMERG MED d January 2004, Vol. 11, No. 1 d [Link] 97

variables, and they are represented schematically in


Figure 1B, in which the two predictor areas A and C
overlap. The term ‘‘multicollinearity’’ is used to
describe the same types of collinear effects that can
occur among three or more predictor variables in
a data set. Both of these examples demonstrated
positive collinearity. Sometimes, the value of one
predictor variable may decrease as the other predictor
variable increases. This would be negative colinearity.
Data sets with many predictor variables may contain
complex multicollinearities with both positive and
negative collinear relationships.

NO COLLINEARITY
In Part I, Figure 4B, the investigator determined that
the log of the duration of the intensive care unit (ICU)
stay for patients with DKA varied with the initial
intensity of insulin therapy: log z ¼ kx þ c. Perhaps
there are other factors that might also help explain the
duration of ICU admission, such as the patient’s age,
comorbid illnesses, or secondary infections. In Part II,
Figure 4A, the investigator graphed the log of the ICU
stay as a function of two predictor variables, the initial
intensity of insulin therapy, x, and the patient’s age in
years, y. When comparing the new figure with two

Figure 3. (A) z ¼ 0.1x þ 1.25y  1, R 2 ¼ 1.0. (B) y ¼ 0.2x, Rpred2 ¼


0.25. IV ¼ intravenous.

respiratory rate and the intensity of insulin treat-


ment. As expected, the increase in serum bicarbonate
is greater in patients who receive more insulin.
Interestingly, the improvement in serum bicarbonate
is now less in patients with higher initial respiratory
rates. The sign of the x coefficient relating the initial
respiratory rate to the improvement in serum bi-
carbonate has changed from positive to negative.
Once again, this change has occurred because there is
collinearity between the two predictor variables, the
initial respiratory rate and the intensity of insulin
therapy. Patients with higher initial respiratory rates
presumably have more severe disease and are treated
more aggressively with insulin. This association and
positive collinearity are demonstrated in the plot
relating the two predictor variables, x and y (Figure
3B). The greater improvement in serum bicarbonate
originally attributed to a higher initial respiratory
rate in the univariate analysis actually seems to
be due to more aggressive insulin therapy. After
analyzing the same data as in Part I, but with two
predictor variables instead of one, the investigator
finds that there is no longer evidence suggesting that
patients can hyperventilate their way out of DKA.
These examples have demonstrated effects due to Figure 4. (A) q ¼ 0.097x þ 0.074y  2.37, R 2 ¼ 1.0. (B) y ¼ 0x þ
collinearity and confounding between two predictor 35, Rpred2 ¼ 0. ICU ¼ intensive care unit.
15532712, 2004, 1, Downloaded from [Link] by Cochrane Peru, Wiley Online Library on [02/07/2025]. See the Terms and Conditions ([Link] on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
98 Marill d MULTIPLE LINEAR REGRESSION

predictor variables, Part II, Figure 4A, to the previous bicarbonate, or both agents to each of four groups of
graph with one predictor variable, Part I, Figure 4b, animals with experimental salicylate overdose. Eval-
note that although a new coefficient in the y-plane uating the results with respect to potassium infusion
representing patient age has been added, there is no alone in Part I, Figure 5, it was found that salicylate
change in the x-axis coefficient relating insulin clearance was higher in the animals that received
therapy to ICU stay. Inspection of the relationship potassium. The investigator now takes a multivariate
between the two predictor variables in Part II, Figure approach and analyzes salicylate clearance as a func-
4B reveals that there is no linear relationship in the x,y tion of both potassium and bicarbonate treatment in
plane between the two predictor variables, because Part II, Figure 5A. Similar to the potassium predictor
the slope of the regression line is zero. The intensity of variable, the bicarbonate variable, y, is given dummy
initial insulin therapy is unrelated to patient age, and variable values of 0 or 1, corresponding to the absence
thus there is no collinearity between the two variables. or presence of bicarbonate infusion, respectively.
As a result, each predictor is independently associated Inspection of Part II, Figure 5B reveals that, by design,
with the outcome, and the inclusion of the age
predictor has no bearing on the association or
coefficient of insulin therapy with ICU stay. This
situation is represented schematically in Figure 1A.
Also notice in Part II, Figure 4A that the regression
plane now fits the data perfectly, and all of the
residuals are zero. Addition of a predictor variable
that demonstrates no collinearity with the other
predictors usually will improve the model by re-
ducing the residuals without altering the coefficients
that already are present.

ASSESSING COLLINEARITY
How is the degree of collinearity among two predictor
variables or multicollinearity among multiple pre-
dictor variables assessed? The degree of collinearity
between two predictor variables is quantified by their
correlation coefficient, Rpred2. The correlation coeffi-
cient of one predictor variable with another can be
labeled Rpred2 to distinguish it from the coefficient of
determination of all of the predictors with the
outcome, which remains R2. Returning to Part II,
Figure 2B, it can be observed that there is complete
collinearity in the x,y plane representing the two
predictors, because the data forms a straight line,
(Rpred2 ¼ 1). In Figure 3B, there is partial collinearity,
(Rpred2 ¼ 0.25), and in Figure 4B, there is no
collinearity, (Rpred2 ¼ 0). When there are more than
two predictor variables, multicollinearity can be
assessed by determining the Rpred2 or coefficient of
determination of the predictor variable of interest
with the other predictor variables. This essentially is
a regression of one predictor variable with all of the
others, and it represents a regression among the
predictors within the larger regression model. Sche-
matically, it represents the total proportion of the
predictor-of-interest circle that is overlapped by other
predictor circles in the model (Figure 1).

QUANTIFYING UNCERTAINTY OF
THE COEFFICIENTS
In the experiment described in Part I, Figure 5, the Figure 5. (A) z ¼ 5.5x þ 19.5y  0.25, R 2 ¼ 0.85. (B) y ¼ 0x þ 0.5,
investigator administered either placebo, potassium, Rpred2 ¼ 0. (C) z ¼ x þ 13y þ 13xy þ 3, R2 ¼ 0.94.
15532712, 2004, 1, Downloaded from [Link] by Cochrane Peru, Wiley Online Library on [02/07/2025]. See the Terms and Conditions ([Link] on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
ACAD EMERG MED d January 2004, Vol. 11, No. 1 d [Link] 99

there is no linear relationship or collinearity between slope of the regression line becomes less certain as the
the two predictor variables, potassium and bicarbon- spacing of the points along the x-axis is decreased.
ate infusion, and each data point represents two The increase in the SE of a predictor coefficient due
animals that received identical treatment. Thus, to multicollinearity is quantified by the square root of
addition of the bicarbonate variable to the analysis the variance inflation factor (VIF), where:
in Part II, Figure 5A causes no change in the value of
1
the potassium coefficient, 5.5, from the original model VIF ¼ ðequation 3Þ
in Part I, Figure 5. 1  R2pred
In Part I, the investigator also determined the SE of
Rpred2 can have any value from 0 to 1. The greater the
the potassium coefficient, 8.7, and used this to perform
overlap and collinearity of the predictor of interest
inference testing and to calculate the 95% confidence
with the other predictors, then the greater is Rpred2
interval (CI) of the coefficient, 15.8 to 26.8. How is the
and the VIF. Recall that the SE is the square root of the
SE of a predictor coefficient determined in the multiple
variance. To determine the SE of a predictor co-
regression model, and how is it affected by the
efficient in multiple linear regression, we multiply the
inclusion of other predictor variables?
formula from simple linear regression times the
It was stated in Part I that the SE of a coefficient in
square root of the VIF to obtain:
simple linear regression is:
SEðcoefficientÞmultreg
SEðcoefficientÞsimplereg 2 31=2
2 31=2
6 SSres 1 7
6 SSres 7 ¼4 n  5
ðn  ktot  1Þ+ ðX  Xmean Þ 1  Rpred
2 2
¼4 n 5 ðequation 2Þ
2
ðn  2Þ+ ðX  Xmean Þ 1
1
ðequation 4Þ
In multiple linear regression, a similar formula is
where the denominator of the original formula has
used, but a modification must be made to account for
been slightly modified to account for the number of
possible multicollinearity. When collinearity is present
predictor variables, ktot. What happens to the SE of the
among two or more predictor variables, there is
potassium coefficient in the salicylate clearance exper-
additional uncertainty in the value of their coeffi-
iment when the bicarbonate predictor variable is
cients.
included in the regression model? We already have
Returning to Figure 1, whenever there is a linear
determined from Figure 5B that there is no collinearity
relationship between two variables in the experimen-
between the potassium and bicarbonate predictor
tal sample, their circles overlap. The degree or
coefficients. Therefore, Rpred2 for the potassium co-
strength of overlap will have some uncertainty when
efficient is zero, and the VIF ¼ 1/(1  0) ¼ 1. So there is
inferences are made on a different or larger popula-
no inflation of the variance or the SE. Does this mean
tion. Specifically, when two predictor variables exhibit
that the SE remains unchanged? No. A review of Figure
collinearity, their circles overlap, as in Figure 1B. Due
1A reveals that when a second predictor variable
to this overlap, the size of the individual areas A and
represented by area C is included in the model, the area
C are less certain. The extent of overlap of one
representing the residuals or uncertainty, area B,
predictor variable with all of the others is quantified
becomes smaller. The new information serves to
by its multiple correlation coefficient with the other
increase the certainty of the model, and this is
predictors, Rpred2.
manifested by a decrease in the total error or residuals,
The increased uncertainty due to collinearity also
SSres. According to equation 4, if the SSres decreases,
can be visualized with inspection of the linear
then the SE of the coefficient also decreases. By
regression plane. Compare Figures 2 and 4, which
including the bicarbonate coefficient, the SE of the
demonstrate the extremes of complete and no
potassium coefficient decreases from 8.7 to 3.8, and the
collinearity between the x and y predictor variables.
span of the 95% CI of the coefficient decreases from
Imagine that the regression plane is a piece of
15.8 to 26.8 to the narrower range of 4.3 to 15.3. It
cardboard that is balanced on the data points in
was previously noted that inclusion of a noncollinear
space. In Figure 2A, the cardboard easily can be tilted
predictor variable in the regression model usually adds
or rotated around the line connecting the two data
new information and decreases the total uncertainty or
points, whereas in Figure 4A, the cardboard sits on
SSres. It is now apparent that this generally leads to
a stable platform of four points spaced apart. As the
a decrease in the SE of the other predictor coefficients.
data points become more linearly oriented in the x,y
predictor plane, the cardboard becomes less stable
and more easily tilted over the line that represents the
INCREASING POWER
regression of the x and y predictors. This is analogous The concept in the paragraph above has important
to the simple linear regression model in which the implications for inference testing and univariate
15532712, 2004, 1, Downloaded from [Link] by Cochrane Peru, Wiley Online Library on [02/07/2025]. See the Terms and Conditions ([Link] on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
100 Marill d MULTIPLE LINEAR REGRESSION

versus multivariate analysis. In general, the power clearance? It could be because those particular
to determine a statistically significant effect is based animals happened to have highly efficient kidneys,
on the magnitude of the effect and the degree of or perhaps there was a dosing or measurement error.
uncertainty in the results. If the magnitude of the An alternative explanation would be that there is an
effect is relatively large with respect to the uncertainty interaction between the two treatments. Bicarbonate
of the results, then the null hypothesis is rejected and infusion alone may alkalanize the urine and increase
statistical significance is concluded. The power of the salicylate excretion somewhat, and potassium alone
test can only be increased by either increasing the may have little effect. Potassium infusion and the
magnitude of the effect or decreasing the uncertainty presence of excess renal potassium combined with
in the results. Perhaps the most common approach to bicarbonate infusion may allow bicarbonate to alka-
decreasing uncertainty and increasing power is by lanize the urine to a much greater extent. The effect of
collecting more data and increasing the number of the combined treatment would be greater than the
data points, n. Including another noncollinear in- sum of the individual treatments alone.
dependent predictor variable in an analysis is another To describe the interaction effect in addition to the
technique that can be used to decrease uncertainty individual effects of each of the two treatments, an
and increase power without increasing n. In essence, interaction term is added to the regression equation to
instead of collecting more data points, the investigator yield:
is using more information from each data point that
already is included to improve the precision of the z ¼ k1 x þ k2 y þ k3 xy þ c ðequation 5Þ
model. In the example above, a t-test can be used in where k3 is the coefficient of the interaction term, xy.
the original univariate model in Part I, Figure 5 to Part II, Figure 5C demonstrates the new model that
determine the effect of potassium infusion on salicy- includes the interaction term for the salicylate clear-
late clearance. In Part II, Figure 5A, the investigator ance experiment. The interaction term adds shape to
has moved to a multivariate approach that includes the previously flat plane. Imagine that the regression
both the potassium and bicarbonate therapies. The plane is a flat piece of paper instead of cardboard. If
comparable multivariate statistical test would be we pick up the corner of the paper farthest from the
a two-way ANOVA. Although the magnitude of the origin and allow the paper to curve down to the
potassium therapy effect is the same in both tests, the origin, then this is the shape associated with inclusion
power to determine its statistical significance may be of a positive interaction term in the regression
increased using the ANOVA approach. equation.
The x variable coefficient representing potassium
LEVERAGE REVISITED infusion has changed from þ5.5 to 1, and the y
coefficient has changed from 19.5 to 13 in the new
The multivariate model depicted in Part II, Figure 5A model incorporating the interaction effect. The x and
is an improvement over the univariate model in Part I, y coefficients have changed as a result of some
Figure 5, as evidenced by a decrease in the SSres from expected colinearity of each with the new xy term.
905 to 145 and a corresponding increase in R2 from This is depicted schematically in Figure 1C, in which
0.06 to 0.85. Careful inspection of the graphs, the new interaction effect is represented by area D,
however, reveals that the two animals that received and it partly overlaps areas A and C. In summary, the
both potassium and bicarbonate had a salicylate improvement in salicylate clearance attributed to
clearance that was remarkably higher than the other potassium infusion in Part I, Figure 5 may actually
three groups. These two data points are exerting be the result of both bicarbonate infusion alone and
upward leverage in both the univariate and multi- the combined effects of bicarbonate and potassium
variate regression models. The evaluation of leverage infusion together.
in multivariate regression is comparable with that in
simple linear regression. In multivariate regression, SE OF THE COEFFICIENT: TWO
Cook’s distance corresponds to the combined change
in all of the predictor coefficients and the z intercept
COMPETING EFFECTS
when the data point in question is removed. If Cook’s The multivariate model depicted in Figure 5C
distance is relatively large for one or more data points appears to fit the data best, and a small, negative
as compared with the others, then those points may effect of potassium therapy alone is suggested by the
have a disproportionate influence on the regression potassium coefficient of 1. What is the effect of
model. inclusion of the interaction term on the SE of the
potassium coefficient? In one sense, the new in-
teraction term has improved the fit of the model, as
INTERACTION EFFECT evidenced by a decrease in the SSres to 60 and
Why did the two animals that received both potas- a corresponding increase in R2 to 0.94. This is
sium and bicarbonate have such a high salicylate represented schematically in Figure 1C by the portion
15532712, 2004, 1, Downloaded from [Link] by Cochrane Peru, Wiley Online Library on [02/07/2025]. See the Terms and Conditions ([Link] on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
ACAD EMERG MED d January 2004, Vol. 11, No. 1 d [Link] 101

of area D that does not overlap the predictor areas A farction (MI) in patients who present to the emer-
and C and results in a decrease in the residual area, B. gency department. The researcher may tabulate a list
This decrease in the uncertainty of the model leads to of the predictor variables, each with its own in-
a decrease in the SE of the other predictors such as the dividual univariate statistic, such as an odds ratio
potassium coefficient. (OR) and an associated 95% CI. Let us also assume
The interaction term also has some collinearity with that some of the predictors are positively correlated,
the potassium coefficient, and this is represented by such as the degree of elevation of the electrocardio-
the overlap of area D with area A in Figure 1C. New gram (ECG) ST segment and the serum troponin level.
explanatory information is not added in this area, Each individual univariate statistic will attribute all
instead, the collinearity represents redundancy in the of the overlap in predictive value with the other
two predictors. This redundancy leads to uncertainty predictors to the particular predictor of interest. As
in the distribution of the association of these pre- we move down the list of predictors, the entire
dictors with the outcome variable. The size of area A overlap with the other variables is attributed to each
becomes less certain. The extent of collinearity bet- individual predictor variable in turn. Consequently,
ween the predictor variable x and the other predictors although each univariate statistic is numerically
y and xy is quantified by its coefficient of determina- correct, it is biased toward a higher value, and the
tion, Rpred2, which is now 0.50. Then, according to association of the predictor variables as a whole with
equation 3, VIF ¼ (1/1  0.50) ¼ 2.0. This means that the outcome variable will seem larger than it truly is.
the SE of the coefficient relating potassium infusion to When there is positive collinearity among positive
salicylate clearance is increased or inflated by a factor predictor variables, each predictor coefficient will be
of [VIF]1/2 ¼ [2.0]1/2 ¼ 1.41, or 41%, when the inter- highest when viewed in a univariate model.
action variable is included in the model. Conversely, consider the same situation, but in this
Thus, inclusion of the collinear interaction term has case, an excessive number of collinear variables are
two competing effects on the SE of the potassium included in a single multivariate logistic model in
coefficient. New information is included in the model, which the outcome is the probability of an MI, which
and this is manifested by a decrease in the total error varies between 0 and 1. In addition to a history of
or residuals, SSres, and a decrease in the SE of the smoking, the investigators also measured the history of
potassium predictor coefficient. Conversely, the in- coughing, frequency of visitation to a drinking estab-
teraction term is partly collinear with the potassium lishment, and dental coloration. It is expected that all of
predictor, and its inclusion increases the uncertainty these variables would display positive collinearity
and SE of the potassium effect. In this example, the with the smoking history. Based on our experience and
overall effect is that the SE of the potassium coefficient previous science, we know that, fundamentally,
increases from 3.8 to 3.9, and the 95% CI of the smoking might lead to an increase in the likelihood
potassium coefficient is 11.8 to 9.8 after inclusion of of an acute MI, but the other variables likely would not.
the interaction term. In general, the SE of existing Inclusion of multiple extraneous collinear variables
predictor coefficients may increase or decrease with that are not associated with the outcome variable in
the inclusion of a new colinear predictor in the the multiple regression model would not be expected
regression model. The total effect depends on the to bias the predictor variable of interest, smoking
relative amount of new explanatory information history. In practice, the situation is often complex and
added versus the extent of collinearity and redun- these variables may actually be associated with the
dancy the new predictor displays with each of the outcome for a variety of unanticipated reasons. For
previously existing predictor variables. example, history of coughing may be a better measure
of cigarette use than the reported smoking history, and
bar patrons may suffer from second-hand smoke. In
CHOOSING PROPER VARIABLES this situation, inclusion of these variables in the model
The art and science of building the multiple re- may alter the history of smoking coefficient. Regard-
gression model require active collaboration between less of any bias that may occur, inclusion of non-
the clinical or laboratory scientist and the statistician. predictive collinear variables will generally inflate the
The general goal should be the inclusion of all variance and uncertainty of the predictor of interest.
predictor variables that add substantial independent Finally, it is interesting to consider the possible con-
information while avoiding excessive collinearity or sequences of removing the smoking history variable
overlap. When multicollinearity exists, which usually from the model. The model might still demonstrate an
is the case in medical research, the predictor variable excellent R2 with the remaining extraneous variables,
coefficients can be biased or their SEs can be increased which could now be viewed as positively biased.
by inclusion of either too few or too many variables in There are numerous manual and automated meth-
the regression model. ods for building multiple linear regression models.1,2
Consider the following examples. Suppose one is As one can appreciate from the examples above,
trying to predict the likelihood of myocardial in- methods that rely on univariate screening3,4 and
15532712, 2004, 1, Downloaded from [Link] by Cochrane Peru, Wiley Online Library on [02/07/2025]. See the Terms and Conditions ([Link] on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
102 Marill d MULTIPLE LINEAR REGRESSION

automated stepwise techniques5 are prone to bias and analysis often is flawed and may produce quantita-
scientific error. The use of all automated predictor tively or qualitatively incorrect predictor coefficients,
variable selection methods is discouraged. Even when and incorrect conclusions with inference testing. A
using manual methods to choose predictor variables, multivariate approach often is required. Multiple
there is active debate regarding whether investigators linear regression is a useful technique for modeling
should generally use the smallest number of predictor many phenomena in medical research. For data sets
variables that yield a good fit to the data, or whether that meet the necessary assumptions, it offers a well-
all scientifically credible predictor variables should be developed model that usually can be solved exactly,
included to maximize the predictive value of the yielding estimates of the predictor variable coeffi-
model.6,7 Perhaps the best and simplest approach is to cients and their SE or uncertainty. This can lead to
design an experiment to collect data on the most a better understanding of the relative effects and
important known and suspected fundamental pre- importance of the predictors of interest, and allows
dictor variables based on current scientific knowl- the investigator to prognosticate the outcome of future
edge. The investigator then can examine the results to data. Applications in clinical medicine include models
confirm that the data satisfy the assumptions of the to determine diagnosis, prognosis, and therapeutic
linear model, and that all of the included predictor outcomes.
variables contribute unique information and do not
demonstrate excessive multicollinearity. Regardless of The author thanks Doctors Elaine Rabin, Lillian Oshva, Lisa
the approach used to construct the regression model, Campanella, Ellen Weber, Lewis Goldfrank, and the statistical
the results should eventually be validated using either editors of Academic Emergency Medicine for their thoughtful
a similar, but separate, set of data, or using another encouragement, suggestions, challenges, and support.
method such as cross-validation.8,9
References
MULTICOLLINEARITY: DEALING WITH IT
1. Harrell FE, Lee KL, Mark DB. Multivariable prognostic models:
Sometimes multicollinearity among important predic- issues in developing models, evaluating assumptions and
tors is inevitable. How can the investigator deal with adequacy, and measuring and reducing errors. Stat Med. 1996;
15:361–87.
this? One approach would be to collect more data to 2. Nick TG, Hardin JM. Regression modeling strategies: an
decrease collinearity. Sometimes, however, this may be illustrative case study from medical rehabilitation outcomes
difficult or impossible. For example, we know that research. Am J Occup Ther. 1999; 53:469–70.
patients with elevated troponin are more likely to have 3. Sun GW, Shook TL, Kay GL. Inappropriate use of bivariable
an elevated ST segment on ECG, and finding enough analysis to screen risk factors for use in multivariable analysis.
J Clin Epidemiol. 1996; 49:907–16.
patients with an elevated troponin and normal ST 4. Cardo DM, Culver DH, Ciesielski CA, et al. A case–control
segment may be difficult. Collinear predictors may be study of HIV seroconversion in health care workers after
combined to form a summary variable or score. The percutaneous exposure. N Engl J Med. 1997; 337:1485–90.
Goldman criteria are an example of a combined score 5. Derksen S, Keselman HJ. Backward, forward and stepwise
composed of multiple collinear risk factors for heart automated subset-selection algorithms—frequency of obtain-
ing authentic and noise variables. Br J Mathemat Stat Psychol.
disease used to evaluate cardiac risk in patients under- 1992; 45:265–82.
going noncardiac surgical procedures.10 Another 6. Wears RL, Lewis RJ. Statistical models and Occam’s razor.
approach would be to study different predictors that Acad Emerg Med. 1999; 6:93–4.
may be more fundamental and display less collinear- 7. Kepermann N, Willits N. In response to ‘‘Statistical models and
ity. Instead of measuring patient age and history of Occam’s razor.’’ Acad Emerg Med. 2000; 7:100–3.
8. Efron B. Computer-intensive methods in statistics. Sci Am.
hypertension and elevated cholesterol as predictors 1983; 248:116–30.
of an MI or angina, the investigator might instead 9. Efron B, Gong G. A leisurely look at the bootstrap, the
measure the degree of luminal narrowing on cardiac jackknife, and cross-validation. Am Stat. 1983; 37:36–48.
catheterization. This is a variation of the concept used 10. Goldman L, Caldera DL, Nussbaum SR, Multifactorial index of
in the technique of principal component analysis.11 cardiac risk in noncardiac surgical patients. N Engl J Med. 1977;
297:845–50.
Finally, there are modifications of the least-squares 11. Using principal components to diagnose and treat multi-
approach such as ridge regression. This analytic collinearity. In: Glantz SA, Slinker BK. Primer of Applied
technique introduces bias in the estimates of highly Regression and Analysis of Variance, 2nd ed. New York, NY:
collinear variable coefficients in exchange for a de- McGraw-Hill, 2001, pp. 219-37.
crease in their uncertainty.12–14 12. Hoerl AE, Kennard RW. Ridge regression: biased estimation for
nonorthogonal problems. Technometrics. 1970; 12:55–67.
13. Hoerl AE, Kennard RW. Ridge regression: applications to
CONCLUSIONS nonorthogonal problems. Technometrics 1970; 12:69–82.
14. Ridge regression. In: Myers RH. Classical and Modern
Most problems in clinical medicine are multivariate. Regression with Applications, 2nd ed. Pacific Grove, CA:
Consequently, a univariate approach in research Duxbury Press, 1990, pp. 392-411.

You might also like