Linear Regression
Linear Regression
Linear Regression
Regression analysis is a statistical technique that attempts to explore and model the relationship between two
or more variables. For example, an analyst may want to know if there is a relationship between road accidents
and the age of the driver. Regression analysis forms an important part of the statistical analysis of the data
obtained from designed experiments and is discussed briefly in this chapter.
A linear regression model attempts to explain the relationship between two or more variables using a straight
line. Consider the data obtained from a chemical process where the yield of the process is thought to be related
to the reaction temperature (see the table below).
And a scatter plot can be obtained as shown in the following figure. In the scatter plot yield, is plotted for
different temperature values, .
It is clear that no line can be found to pass through all points of the plot. Thus no functional relation exists
between the two variables and . However, the scatter plot does give an indication that a straight line may
exist such that all the points on the plot are scattered randomly around this line. A statistical relation is said to
exist in this case. The statistical relation between and may be expressed as follows:
The above equation is the linear regression model that can be used to explain the relation between and
that is seen on the scatter plot above. In this model, the mean value of (abbreviated as ) is
assumed to follow the linear relation:
The actual values of (which are observed as yield from the chemical process from time to time and are
random in nature) are assumed to be the sum of the mean value, , and a random error term, :
The regression model here is called a simple linear regression model because there is just one independent
variable, , in the model. In regression models, the independent variables are also referred to as regressors
or predictor variables. The dependent variable, , is also referred to as the response. The slope, , and
the intercept, , of the line are called regression coefficients. The slope, ,
can be interpreted as the change in the mean value of for a unit change in .
The random error term, , is assumed to follow the normal distribution with a mean of 0 and variance of .
Since is the sum of this random term and the mean value, , which is a constant, the variance
of at any given value of is also . Therefore, at any given value of , say , the dependent
variable follows a normal distribution with a mean of and a standard deviation of . This is
illustrated in the following figure.
coefficients and for an observed data set. The estimates, and , are calculated using least
squares. (For details on least square estimates, refer to Hahn & Shapiro (1967).) The estimated regression
line, obtained using the values of and , is called the fitted line. The least square estimates,
where is the mean of all the observed values and is the mean of all values of the predictor variable at
using .
Once and are known, the fitted regression line can be written as:
where is the fitted or estimated value based on the fitted regression model. It is an estimate of the mean
value, . The fitted value, , for a given value of the predictor variable, , may be different from the
corresponding observed value, . The difference between the two values is called the residual, :
The least square estimates of the regression coefficients can be obtained for the data in the preceding table as
follows:
Example
In this lesson, we apply regression analysis to some fictitious data, and we show how to
interpret the results of our analysis.
Problem Statement
Last year, five randomly selected students took a math aptitude test before they began their
statistics course. The Statistics Department has three questions.
What linear regression equation best predicts statistics performance, based on math
aptitude scores?
If a student made an 80 on the aptitude test, what grade would we expect her to
make in statistics?
How well does the regression equation fit the data?
In the table below, the xi column shows scores on the aptitude test. Similarly, the y i column
shows statistics grades. The last two rows show sums and mean scores that we will use to
conduct the regression analysis.
1 95 85 17 8 289 64 136
2 85 95 7 18 49 324 126
3 80 70 2 -7 4 49 -14
4 70 65 -8 -12 64 144 96
39 38
Sum 730 630 470
0 5
Mea
78 77
n
b1 = Σ [ (xi - x)(yi - y) ] / Σ b0 = y - b 1 * x
2
[ (xi - x) ] b0 = 77 - (0.644)(78) =
b1 = 470/730 = 0.644 26.768
Once you have the regression equation, using it is a snap. Choose a value for the
independent variable (x), perform the computation, and you have an estimated value (ŷ) for
the dependent variable.
In our example, the independent variable is the student's score on the aptitude test. The
dependent variable is the student's statistics grade. If a student made an 80 on the aptitude
test, the estimated statistics grade would be:
Warning: When you use a regression equation, do not use values for the independent
variable that are outside the range of values used to create the equation. That is
called extrapolation, and it can produce unreasonable estimates.
In this example, the aptitude test scores used to create the regression equation ranged from
60 to 95. Therefore, only use values inside that range to estimate statistics grades. Using
values outside that range (less than 60 or greater than 95) is problematic.
Whenever you use a regression equation, you should ask how well the equation fits the data.
One way to assess fit is to check the coefficient of determination, which can be computed
from the following formula.
where N is the number of observations used to fit the model, Σ is the summation symbol,
xi is the x value for observation i, x is the mean x value, yi is the y value for observation
i, y is the mean y value, σx is the standard deviation of x, and σy is the standard deviation of
y. Computations for the sample problem of this lesson are shown below.
σx = sqrt [ Σ ( xi - x )2 / N ] σy = sqrt [ Σ ( yi - y )2 / N ]
σx = sqrt( 730/5 ) = sqrt(146) = σy = sqrt( 630/5 ) = sqrt(126) =
12.083 11.225
A coefficient of determination equal to 0.48 indicates that about 48% of the variation in
statistics grades (the dependent variable) can be explained by the relationship to math
aptitude scores (theindependent variable). This would be considered a good fit to the data,
in the sense that it would substantially improve an educator's ability to predict student
performance in statistics class.
Residual Analysis in Regression
Because a linear regression model is not always appropriate for the data, you should assess
the appropriateness of the model by defining residuals and examining residual plots.
Residuals
The difference between the observed value of the dependent variable (y) and the predicted
value (ŷ) is called the residual (e). Each data point has one residual.
Both the sum and the mean of the residuals are equal to zero. That is, Σ e = 0 and e = 0.
Residual Plots
A residual plot is a graph that shows the residuals on the vertical axis and the independent
variable on the horizontal axis. If the points in a residual plot are randomly dispersed around
the horizontal axis, a linear regression model is appropriate for the data; otherwise, a non-
linear model is more appropriate.
Below the table on the left shows inputs and outputs from a simple linear regression
analysis, and the chart on the right displays the residual (e) and independent variable (X) as
a residual plot.
x 60 70 80 85 95
y 70 65 70 95 85
The residual plot shows a fairly random pattern - the first residual is positive, the next two
are negative, the fourth is positive, and the last residual is negative. This random pattern
indicates that a linear model provides a decent fit to the data.
Below, the residual plots show three typical patterns. The first plot shows a random pattern,
indicating a good fit for a linear model. The other plot patterns are non-random (U-shaped
and inverted U), suggesting a better fit for a non-linear model.
In the next lesson, we will work on a problem, where the residual plot shows a non-random
pattern. And we will show how to "transform" the data to use a linear model with nonlinear
data.
There are many ways to transform variables to achieve linearity for regression analysis.
Some common methods are summarized below.
Standard linear
None y = b0 + b1x ŷ = b0 + b1x
regression
Dependent variable =
Exponential model log(y) = b0 + b1x ŷ = 10b0 + b1x
log(y)
Dependent variable =
Quadratic model sqrt(y) = b0 + b1x ŷ = ( b0 + b1x )2
sqrt(y)
Independent variable =
Logarithmic model y= b0 + b1log(x) ŷ = b0 + b1log(x)
log(x)
log(y)
Independent variable =
log(x)
Each row shows a different nonlinear transformation method. The second column shows the
specific transformation applied to dependent and/or independent variables. The third
column shows the regression equation used in the analysis. And the last column shows the
"back transformation" equation used to restore the dependent variable to its original, non-
transformed measurement scale.
In practice, these methods need to be tested on the data to which they are applied to be
sure that they increase rather than decrease the linearity of the relationship. Testing the
effect of a transformation method involves looking at residual plots and correlation
coefficients, as described in the following sections.
Logistic Regression
What is the logistic curve? What is the base of the natural logarithm? Why do
statisticians prefer logistic regression to ordinary linear regression when the DV is
binary? How are probabilities, odds and logits related? What is an odds ratio? How
can logistic regression be considered a linear regression? What is a loss function?
What is a maximum likelihood estimate? How is the bweight in logistic regression for
a categorical variable related to the odds ratio of its constituent categories?
For this chapter only, we are going to deal with a dependent variable that is binary (a
categorical variable that has two values such as "yes" and "no") rather than
continuous.
Suppose we want to predict whether someone is male or female (DV, M=1, F=0)
using height in inches (IV). We could plot the relations between the two variables as
we customarily do in regression. The plot might look something like this:
Points to notice about the graph (data are fictional):
1. The regression line is a rolling average, just as in linear regression. The Y-axis
is P, which indicates the proportion of 1s at any given value of height. (review
graph)
2. The regression line is nonlinear. (review graph)
3. None of the observations --the raw data points-- actually fall on the regression
line. They all fall on zero or one. (review graph)
When I was in graduate school, people didn't use logistic regression with a binary DV.
They just used ordinary linear regression instead. Statisticians won the day, however,
and now most psychologists use logistic regression with a binary DV for the
following reasons:
1. If you use linear regression, the predicted values will become greater than one
and less than zero if you move far enough on the X-axis. Such values are
theoretically inadmissible.
2. One of the assumptions of regression is that the variance of Y is constant across
values of X (homoscedasticity). This cannot be the case with a binary variable,
because the variance is PQ. When 50 percent of the people are 1s, then the
variance is .25, its maximum value. As we move to more extreme values, the
variance decreases. When P=.10, the variance is .1*.9 = .09, so as P approaches
1 or zero, the variance approaches zero.
3. The significance testing of the b weights rest upon the assumption that errors of
prediction (Y-Y') are normally distributed. Because Y only takes the values 0
and 1, this assumption is pretty hard to justify, even approximately. Therefore,
the tests of the regression weights are suspect if you use linear regression with a
binary DV.
Or
where P is the probability of a 1 (the proportion of 1s, the mean of Y), e is the base of
the natural logarithm (about 2.718) and a and b are the parameters of the model. The
value of a yields P when X is zero, and b adjusts how quickly the probability changes
with changing X a single unit (we can have standardized and unstandardized b
weights in logistic regression, just as in ordinary linear regression). Because the
relation between X and P is nonlinear, b does not have a straightforward interpretation
in this model as it does in ordinary linear regression.
Loss Function
A loss function is a measure of fit between a mathematical model of data and the
actual data. We choose the parameters of our model to minimize the badness-of-fit or
to maximize the goodness-of-fit of the model to the data. With least squares (the only
loss function we have used thus far), we minimize SSres, the sum of squares residual.
This also happens to maximize SSreg, the sum of squares due to regression. With linear
or curvilinear models, there is a mathematical solution to the problem that will
minimize the sum of squares, that is,
b = (X'X)-1X'y
Or
= R-1r
With some models, like the logistic curve, there is no mathematical solution that will
produce least squares estimates of the parameters. For many of these models, the loss
function chosen is called maximum likelihood. A likelihood is a conditional
probability (e.g., P(Y|X), the probability of Y given X). We can pick the parameters of
the model (a and b of the logistic curve) at random or by trial-and-error and then
compute the likelihood of the data given those parameters (actually, we do better than
trail-and-error, but not perfectly). We will choose as our parameters, those that result
in the greatest likelihood computed. The estimates are called maximum likelihood
because the parameters are chosen to maximize the likelihood (conditional probability
of the data given parameter estimates) of the sample data. The techniques actually
employed to find the maximum likelihood estimates fall under the general
label numerical analysis. There are several methods of numerical analysis, but they all
follow a similar series of steps. First, the computer picks some initial estimates of the
parameters. Then it will compute the likelihood of the data given these parameter
estimates. Then it will improve the parameter estimates slightly and recalculate the
likelihood of the data. It will do this forever until we tell it to stop, which we usually
do when the parameter estimates do not change much (usually a change .01 or .001 is
small enough to tell the computer to stop). [Sometimes we tell the computer to stop
after a certain number of tries or iterations, e.g., 20 or 250. This usually indicates a
problem in estimation.]
(Odds can also be found by counting the number of people in each group and dividing
one number by the other. Clearly, the probability is not the same as the odds.) In our
example, the odds would be .90/.10 or 9 to one. Now the odds of being female would
be .10/.90 or 1/9 or .11. This asymmetry is unappealing, because the odds of being a
male should be the opposite of the odds of being a female. We can take care of this
asymmetry though the natural logarithm, ln. The natural log of 9 is 2.217
(ln(.9/.1)=2.217). The natural log of 1/9 is -2.217 (ln(.1/.9)=-2.217), so the log odds of
being male is exactly opposite to the log odds of being female. The natural log
function looks like this:
Note that the natural log is zero when X is 1. When X is larger than one, the log
curves up slowly. When X is less than one, the natural log is less than zero, and
decreases rapidly as X approaches zero. When P = .50, the odds are .50/.50 or 1, and
ln(1) =0. If P is greater than .50, ln(P/(1-P) is positive; if P is less than .50, ln(odds) is
negative. [A number taken to a negative power is one divided by that number, e.g. e-
10
= 1/e10. A logarithm is an exponent from a given base, for example ln(e10) = 10.]
In logistic regression, the dependent variable is a logit, which is the natural log of the
odds, that is,
logit(P) = a + bX,
Which is assumed to be linear, that is, the log odds (logit) is assumed to be linearly
related to X, our IV. So there's an ordinary regression hidden in there. We could in
theory do ordinary regression with logits as our DV, but of course, we don't have
logits in there, we have 1s and 0s. Then, too, people have a hard time understanding
logits. We could talk about odds instead. Of course, people like to talk about
probabilities more than odds. To get there (from logits to probabilities), we first have
to take the log out of both sides of the equation. Then we have to convert odds to a
simple probability:
The simple probability is this ugly equation that you saw earlier. If log odds are
linearly related to X, then the relation between X and P is nonlinear, and has the form
of the S-shaped curve you saw in the graph and the function form (equation) shown
immediately above.
An Example
Suppose that we are working with some doctors on heart attack patients. The
dependent variable is whether the patient has had a second heart attack within 1 year
(yes = 1). We have two independent variables, one is whether the patient completed a
treatment consistent of anger control practices (yes=1). The other IV is a score on a
trait anxiety scale (a higher score means more anxious).
Our data:
Note that half of our patients have had a second heart attack. Knowing nothing else
about a patient, and following the best in current medical practice, we would flip a
coin to predict whether they will have a second attack within 1 year. According to our
correlation coefficients, those in the anger treatment group are less likely to have
another attack, but the result is not significant. Greater anxiety is associated with a
higher probability of another attack, and the result is significant (according to r).
Now let's look at the logistic regression, for the moment examining the treatment of
anger by itself, ignoring the anxiety test scores. SAS prints this:
Response Levels: 2
Number of Observations: 20
Response Profile
Ordered
1 0 10
2 1 10
SAS tells us what it understands us to model, including the name of the DV, and its
distribution.
Then we calculate probabilities with and without including the treatment variable.
Only and
Covariates
1df (p=.17)
The computer calculates the likelihood of the data. Because there are equal numbers
of people in the two groups, the probability of group membership initially (without
considering anger treatment) is .50 for each person. Because the people are
independent, the probability of the entire set of people is .5020, a very small number.
Because the number is so small, it is customary to first take the natural log of the
probability and then multiply the result by -2. The latter step makes the result positive.
The statistic -2LogL (minus 2 times the log of the likelihood) is a badness-of-fit
indicator, that is, large numbers mean poor fit of the model to the data. SAS prints the
result as -2 LOG L. For the initial model (intercept only), our result is the value
27.726. This is a baseline number indicating model fit. This number has no direct
analog in linear regression. It is roughly analogous to generating some random
numbers and finding R2 for these numbers as a baseline measure of fit in ordinary
linear regression. By including a term for treatment, the loss function reduces to
25.878, a difference of 1.848, shown in the chi-square column. The difference
between the two values of -2LogL is known as the likelihood ratio test.
When taken from large samples, the difference between two values of -2LogL is
distributed as chi-square:
This says that the (-2Log L) for a restricted (smaller) model - (-2LogL) for a full
(larger) model is the same as the log of the ratio of two likelihoods, which is
distributed as chi-square. The full or larger model has all the parameters of interest in
it. The restricted is said to be nested in the larger model. The restricted model has one
or more of parameters in the full model restricted to some value (usually zero). The
parameters in the nested model must be a proper subset of the parameters in the full
model. For example, suppose we have two IVs, one categorical and once continuous,
and we are looking at an ATI design. A full model could have included terms for the
continuous variable, the categorical variable and their interaction (3 terms). Restricted
models could delete the interaction or one or more main effects (e.g., we could have a
model with only the categorical variable). A nested model cannot have as a single IV,
some other categorical or continuous variable not contained in the full model. If it
does, then it is no longer nested, and we cannot compare the two values of -2LogL to
get a chi-square value. The chi-square is used to statistically test whether including a
variable reduces badness-of-fit measure. This is analogous to producing an increment
in R-square in hierarchical regression. If chi-square is significant, the variable is
considered to be a significant predictor in the equation, analogous to the significance
of the b weight in simultaneous regression.
For our example with anger treatment only, SAS produces the following:
The intercept is the value of a, in this case -.5596. As usual, we are not terribly
interested in whether a is equal to zero. The value of b given for Anger Treatment is
1.2528. the chi-square associated with this b is not significant, just as the chi-square
for covariates was not significant. Therefore we cannot reject the hypothesis that b is
zero in the population. Our equation can be written either:
Logit(P) = -.5596+1.2528X
Or
Now the odds for another group would also be P/(1-P) for that group. Suppose we
arrange our data in the following way:
Anger Treatment
Heart Attack Yes (1) No (0) Total
Yes (1) 3 (a) 7 (b) 10 (a+b)
No (0) 6 (c) 4 (d) 10 (c+d)
Total 9 (a+c) 11 (b+d) 20 (a+b+c+d)
Now we can compute the odds of having a heart attack for the treatment group and the
no treatment group. For the treatment group, the odds are 3/6 = 1/2. The probability of
a heart attack is 3/(3+6) = 3/9 = .33. The odds from this probability are .33/(1-.33)
= .33/.66 = 1/2. The odds for the no treatment group are 7/4 or 1.75. The odds ratio is
calculated to compare the odds across groups.
If the odds are the same across groups, the odds ratio (OR) will be 1.0. If not, the OR
will be larger or smaller than one. People like to see the ratio be phrased in the larger
direction. In our case, this would be 1.75/.5 or 1.75*2 = 3.50.
Now if we go back up to the last column of the printout where is says odds ratio in the
treatment column, you will see that the odds ratio is 3.50, which is what we got by
finding the odds ratio for the odds from the two treatment conditions. It also happens
that e1.2528 = 3.50. Note that the exponent is our value of b for the logistic curve.