8/25/22
Lecture 10:
Simple Linear Regression
and Correlation
Outline
• Simple linear regression model
• Required conditions for the model
• Model assessment
• Confidence and prediction intervals
• Model diagnostics
• Outliers
Introduction
• In this lecture, we employ regression analysis
to examine the relationship among
quantitative variables.
• The technique is used to predict the value of
one variable (the dependent variable – y)
based on the value of other variables
(independent variables x1, x2, … xk.)
1
8/25/22
Predicting home prices
Recent family home sales in San Antonio
provided the data displayed (partly) in the next
slide (San Antonio Realty Watch website,
November, 2008). We wish to predict the home
prices using the square footage.
Part of the data
What is the dependent variable?
What is the independent variable? 5
Linear relationship?
2
8/25/22
The model
• The first-order linear model or simple linear
regression model
b0 and b1 are unknown,
y = dependent variable therefore they are estimated
y from the data.
x = independent variable
b 0 = y-intercept
b 1 = slope of the line Rise b 1 = rise/run
e = error variable b0 Run
x
7
Least squares method
• Estimates of the coefficients are determined by
– drawing a sample from the population of interest
– calculating sample statistics
– producing a straight line that cuts into the data.
The question is:
y w
Which straight line fits the best?
w
w
w
w w w w w
w
w w w w
w
x 8
Least squares method
The best line is the one that minimises the sum of squared
vertical differences between the points and the line.
Sum of squared differences = (2 - 1)2 + (4 - 2)2 + (1.5 - 3)2 + (3.2 - 4)2 = 6.89
Sum of squared differences = (2 -2.5)2 + (4 - 2.5)2 + (1.5 - 2.5)2 + (3.2 - 2.5)2 = 3.99
Let us compare two lines.
4 (2, 4)
w The second line is horizontal.
3 w (4, 3.2)
2.5
2 The smaller the sum of
(1, 2)w (3, 1.5)
w squared differences,
1 the better the fit of the
line to the data.
1 2 3 4 9
3
8/25/22
Least squares method
The least squares line is:
10
R output for linear model
11
Interpreting estimated parameters
12
4
8/25/22
Error variable: Required conditions
• The error e is a critical part of the regression model.
• Five requirements involving the distribution of e must
be satisfied:
– The mean of e is zero: E(e) = 0.
– The standard deviation of e is a constant (s e) for all
values of x.
– The errors are independent.
– The errors are independent of the independent
variable x.
– The probability distribution of e is normal.
13
Error variable: Required conditions
E(y|x 3)
The standard deviation remains constant ...
b 0 + b 1x 3 µ3
E(y|x 2)
b 0 + b 1x 2 µ2
… but the mean value changes with x E(y|x 1)
b 0 + b 1x 1 µ1
x1 x2 x3
From the first three assumptions we have: y is normally distributed with
mean E(y) = b0 + b1x and a constant standard deviation se 14
Assessing the model
• The least squares method will produce a
regression line whether or not there is a linear
relationship between x and y.
• Consequently, it is important to assess how
well the linear model fits the data.
• Several methods are used to assess the
model:
– using descriptive measurements
– testing and/or estimating the coefficients
15
5
8/25/22
Sum of squares for errors
– This is the sum of the squared vertical differences
between the points and the regression line.
– It can serve as a measure of how well the line fits
the data.
– This statistic plays a role in every statistical
technique we employ to assess the model.
16
Standard error of estimate
– The mean error is equal to zero.
– If se is small, the errors tend to be close to zero (close
to the mean error). Then the model fits the data well.
– Therefore we can use se as a measure of the suitability
of using a linear model.
– An unbiased estimator of se2 is given by se2
17
Home Prices example (cont.)
Read the standard error of estimate from the R
output and describe what it tells you about the
model fit.
Note: We can find the sample mean price to be
$120,270
18
6
8/25/22
Coefficient of determination
• When we want to measure the strength of
the linear relationship, we use the
coefficient of determination.
19
Coefficient of determination
in p a
rt b y the regression model
in e d
e x p la
Overall variability in y
rema
ins, in
part,
unex
plaine the error
d
20
Coefficient of determination
y2
Two data points
(x1, y1) and (x2,
y2) of a certain
y
sample are
shown.
y1
x1 x2
Total variation in y = variation explained by + unexplained variation (error)
the regression line
21
7
8/25/22
Coefficient of determination
• R2 measures the proportion of the variation
in y that is explained by the variation in x.
¡ R2 takes on any value between zero and one.
R 2 = 1: perfect match between the line and the
data points.
R 2 = 0: there is no linear relationship between x
and y. SST = variation in y = SSR + SSE
22
Home prices example (cont.)
Find the coefficient of determination. What does
this statistic tell you about the model?
57% of the variation in the home prices is explained by
the variation in square footage. The rest (43%) remains
unexplained by this model.
23
Testing the slope
When no relationship exists between two variables,
the regression line should be horizontal.
q q
q
q q q
q q q
qq q q q q q
q q qq q qq qq qq q q
q qq
q
q q q q q q q q q
q q
q q qq q q qq qq q q qqq q
q q q qqq qq q qq
q q
q qq qqqq q
qq qqqqqq q qq q qq q q q
qq qq qq qqq qq q
Relationship. No relationship.
Different inputs (x) yield Different inputs (x) yield
different outputs (y). the same output (y).
The slope is not equal to zero. The slope is equal to zero.
24
8
8/25/22
Testing the slope
H 0: b 1 = 0
H A: b 1 ¹ 0 (or < 0, or > 0)
– The test statistic is
where
The standard error of
– If the error variable is normally distributed, the
statistic is Student t-distributed with d.f. = n – 2.
25
Testing the slope using the R output
26
Coefficient of correlation
• The coefficient of correlation is used to measure the
strength of a linear association between two variables.
• The coefficient values range between –1 and 1.
– If r = –1 (perfect negative linear association) or r =
+1 (perfect positive linear association): every point
falls on the regression line.
– If r = 0: there is no linear association.
• The coefficient can be used to test for linear
relationships between two variables.
27
9
8/25/22
Testing the coefficient of correlation
– When there is no linear
relationship between two
variables, r = 0.
– The hypotheses are:
H 0: r = 0 Y
H A: r ¹ 0
X
– The test statistic is:
The statistic is Student t-distributed
with d.f. = n – 2, provided the variables
are bivariate normally distributed.
28
Home prices example (cont.)
Test the coefficient of correlation to determine if a
linear relationship exists. R output is provided below:
29
Using the Regression equation
• Before using the regression model, we need
to assess how well it fits the data.
• If we are satisfied with how well the model fits
the data and the model assumptions are
satisfied, we can use it to make predictions for
y.
Example
– Predict the price of a home with square footage =
1500
30
10
8/25/22
Prediction interval and confidence
interval
§ Two intervals can be used to discover how closely
the predicted value will match the true value of y
• prediction interval – for a particular value of y
• confidence interval – for the expected value of y.
The prediction interval The confidence
interval
The prediction interval is wider than the confidence interval.
31
Home Prices example (cont.)
– Provide an 95% confidence interval estimate for
the price of a home with square footage = 1500
– R output:
32
Home Prices example (cont.)
– Provide an 95% confidence interval estimate for
the mean price of homes with square footage =
1500
– R output:
33
11
8/25/22
The effect of the given value of x
on the intervals
– As xg moves away from `x the interval becomes
longer. That is, the shortest interval is found at `x.
The confidence interval
when x g =
The confidence interval
when x g =
The confidence interval
when x g =
34
Regression diagnostics
• The three important conditions required for
the validity of the regression analysis are:
– The error variable is normally distributed.
– The error variance is constant for all values of x.
– The errors are independent of each other.
• How can we diagnose violations of these
conditions?
35
Regression diagnostics
• Examining the residuals (or standardized
residuals), we can identify violations of the
required conditions.
• For the details à Self-study (read the
textbooks for guidelines). We will give some
examples in the next few slides.
36
12
8/25/22
Example: Heteroscedasticity
When the requirement of a constant variance is
violated, we have heteroscedasticity.
+
^y
++
Residual
+
+ + + ++
+
+ + + ++ + +
+ + + +
+ + + ++ +
+ + + + y^
+ + ++ +
+ + +
+ ++
+ +++
+
The spread increases with ^y
37
Example: Heteroscedasticity
When the requirement of a constant variance is
not violated, we have homoscedasticity.
+
^y
++
Residual
+ +
+ + + ++
+
+ + + +
+ ++ + +
+ +
+ + + ++ ++ +
+ + + y^ ++
+ + + ++ +
+ + + + +
+ +++
+ ++
+
+
The spread of the data points
does not change much.
38
Example: Heteroscedasticity
When the requirement of a constant variance is
not violated, we have homoscedasticity.
+
^y +++ +
++ ++
Residual
+ +++
+ + +++ +
+ +++
+ + + +
+ ++ +
+ + +
+ ++
+ + + + ++
+ + y^ +
+ + +
+ + + ++ +
+ ++
+ ++
As far as the even spread, this is
a much better situation.
39
13
8/25/22
Example: Non-independence of the
error variable
• A time series is constituted if data were
collected over time.
• Examining the residuals over time, no pattern
should be observed if the errors are
independent.
• When a pattern is detected, the errors are said
to be auto-correlated.
• Autocorrelation can be detected by graphing
the residuals against time.
40
Patterns in the appearance of the residuals
over time indicate that autocorrelation exists.
Residual Residual
+
+ + +
+
+ + +
+ + +
0 + 0 + +
+ Time Time
+ + + + + +
+ + + +
++
+
Note the runs of positive Note the oscillating behaviour
residuals, replaced by runs of the residuals around zero.
of negative residuals.
41
Outliers
• An outlier is an observation that is unusually
small or large.
• Several possibilities need to be investigated when
an outlier is observed:
– There was an error in recording the value.
– The point does not belong in the sample.
– The observation is valid.
• Identify outliers from the scatter diagram.
• It is customary to suspect an observation is an
outlier if its |standard residual| > 2.
42
14
8/25/22
an outlier an influential observation
+++++++++++
+ +
+
+ + … but some outliers
+ +
+
may be very influential.
+ + + +
+
+ +
+
The outlier causes a shift
in the regression line ...
43
Procedure for simple linear regression
analysis
– Develop a model that has a theoretical basis.
– Gather data for the two variables in the model.
– Draw the scatter diagram to determine whether a linear
model appears to be appropriate.
– Check the required conditions for the errors.
– Assess the model fit.
– If the model fits the data and the assumptions are
satisfied, use the regression equation.
44
Summary
• Simple linear regression model
• Required conditions for the model
• Model assessment
• Confidence and prediction intervals
• Model diagnostics
• Outliers
45
15