0% found this document useful (0 votes)
111 views43 pages

Simple Regression Model Explained

The document describes the simple linear regression model, including the data generation process where Y is linearly related to X plus an error term, and the ordinary least squares estimation procedure that minimizes the residual sum of squares. OLS provides consistent, unbiased, and best linear unbiased estimates of the parameters under standard assumptions. The R-squared measure indicates how much variation in Y is explained by X.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
111 views43 pages

Simple Regression Model Explained

The document describes the simple linear regression model, including the data generation process where Y is linearly related to X plus an error term, and the ordinary least squares estimation procedure that minimizes the residual sum of squares. OLS provides consistent, unbiased, and best linear unbiased estimates of the parameters under standard assumptions. The R-squared measure indicates how much variation in Y is explained by X.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

The Simple Regression Model

Model:
• DGP: Yi = β 0 + β1 X i + ui where the conditional distribution of the
random error ui given X i is iid ( 0, σ 2 ) or iid N ( 0, σ 2 ) .
• Y is the dependent or left hand [-side] variable.
• Y is linearly related to X, the regressor, independent variable,
explanatory variable or right hand [-side] variable.
• The sample or data is a collection of N independent observations on
X and Y, which is a reasonable assumption with cross section data.
• The data are non-experimental so X and Y are random variables,
variables whose values are subject to chance.
• Many textbooks treat the X’s as “fixed”, but this is not very realistic
in economics since experimental data are very rare.
• However, treating the X’s as fixed is a lot easier since there is no
need to distinguish between conditional and unconditional
expectations!
• Assumptions and results in the fixed X case can be interpreted as
statements conditional on the X’s, allowing us to dispense with “|X”
terms etc.
• When deriving the properties of the OLS estimators we will
consider the classical linear regression model – fixed X’s and normal
random error terms – and then sketch out the extension to the case
where the X’s are random and the random errors are not necessarily
normally distributed.
The Data Generation Process
• DGP: Yi = β 0 + β1 X i + ui ui | X i ~ iid (0,σ 2 ) i = 1… N
• In some cases, we assume that ui | X i ~ iid N (0,σ 2 ) . In large
samples, the OLS estimators will be normally distributed in any case
by the CLT (central limit theorem).
• The β 's are unknown parameters to be estimated and the ui 's are
unobserved random error terms with certain properties set out
below.
• The part of the r.h.s. involving X ( β 0 + β1 X i ) is the regression or
regression function and the β 's are often called the regression
coefficients.
• The linearity assumption is not very restrictive. Y and X could be
transformations of the variables in question.
• Interpretation - E (Yi | X i ) = β 0 + β1 X i and dE (Yi | X i ) dX i = β1 etc.

Yi = β 0 + β1 X i + ui
= Conditional Mean + Random Error

• Figures illustrate the DGP.


Y
6

4 E(Y|X) = β0 + β1 X

-2 E(Yi |Xi )
DGP: Yi = β0 + β1 Xi + ui
Xi , ui ~iid N(0.1)
ui
β0 = 1,β1 = 2
-4
Yi X
-1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5

Fig. 1: The Conditional Mean of Y


Conditional PDF of Y

0.4
0.3
0.2
0.1

10
5
2
0 1
Y 0
-5 -1 X
-2

Fig 2: The Conditional Pdf of Y (Same DGP as in Fig, 1)


• What about the random error term ui ? The conditional distribution
of the random error ui given X i is assumed to be iid ( 0, σ 2 ) .
• The conditional iid assumption is a reasonable assumption for cross
section data.
• The assumption that E (ui | X i ) = 0 means that X is exogenous so
there is no feedback from Y to X.
• The zero conditional mean assumption is innocuous when there is an
intercept term β 0 in the regression (as there almost always is).
• Two implications of the zero conditional mean assumption are:
E ( ui ) = E  E ( ui | X i )  = 0
E ( ui X i ) = E  E ( ui X i | X i )  = E  X i E ( ui | X i )  = 0
using the “Law of Total Expectations”.
• Thus, the unconditional mean of the random error term is zero and the
random error term u is orthogonal to the explanatory variable X (zero
cross moment Eui X i = 0 ). In turn this implies that Corr (ui , X i ) = 0
since Cov(ui , X i ) = Eui X i = 0 .
• Finally, the assumption that V (ui | X i ) = σ 2 , a constant, is the
homoscedasticity assumption, which is a reasonable initial
assumption.

Some more terminology - residual uɵ i , actual and fitted values of Yi :


Yi = β 0 + β 1 X i + uɵ i
Actual Yi = Fitted Yi + Residual

= Y i + uɵ i
Y
6

OLS Regr ession Line

4 Fitted Y = Y^ = β^0 + β^1 X

-2
Residual = Actual Yi − Fitted Yi
ûi = Yi − Y^i
X

-1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5

Fig. 3: The OLS Regression Line and Residuals


5
Conditional Mean OLS Regression Line
4

1 E(Yi |Xi ) = 1.0 + 2.0 Xi


^
Yi = 0.97 + 2.06 Xi
0

-1

-2

-3

-4
-2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0

Fig. 4: The Conditional Mean and OLS Regression Line


Overview

• The least squares principle and the OLS estimators of β 0 and β1.
Choose β and β  to minimise the residual sum of squares.
0 1
• OLS first order conditions:

∑ uɵ
i
i = 0 and ∑ X uɵ
i
i i =0

• The f.o.c.’s are sample analogs of the population moments.

1 1
N
∑ uɵ i = 0 and
i N
∑ X uɵ
i
i i =0

E ( ui ) = 0 and E ( X i ui ) = 0
• The OLS estimators are:

∑x y i i
β 1 = i  =Y −β
and β  X
∑x 2 0 1
i
i

where yi = Yi − Y and xi = X i − X are deviations from the sample


means. These are also sample analogs of population moments.
• The OLS, ML and method of moments estimators of β 0 and β1 are
all the same.
• The distribution of β 1 - derivation and intuition.
∑ xiui
β 1 = β1 + i 2
∑ xi
i

 
  σ 
2
⇒ β 1 | X ~ AN β ,
 ∑ x2 
 i 
 i 
This is the distribution conditional on X.
• The Gauss-Markov theorem – the OLS estimators of β 0 and β1 are
BLUE i.e. best or minimum variance within the class of linear,
unbiased estimators.
• If the random errors (the ui 's ) are assumed to be normally
distributed, then the OLS estimators are best within the class of
unbiased estimators.
• This stronger result is just an application of the Cramer-Rao
theorem, which says that the inverse of the information matrix is a
lower bound for the variance of an unbiased estimator.

2 1 ɵu i2 . The ML

• An unbiased estimator of σ is σ =
2

N −2 i

estimator, which does not adjust for degrees of freedom, is
consistent.
2
 
• Inference using β 1 and σ – confidence intervals and hypothesis
tests – is standard.

• R2 and all that. The decomposition TSS = ESS + RSS and the
ESS RSS
R =
2
= 1− measure of goodness of fit, which ranges
TSS TSS
between 0 and 1. This is just the fraction of the variation in Y
explained by X.
The OLS Estimators

The OLS estimators minimize the residual sum of squares:

β 0 , β 1 = arg min ∑ (Yi − b0 − b1 X i )


2
 
b0 ,b1 i =1

The first order conditions are:


(
−2∑ Yi − β 0 − β 1 X i = 0 )
(
−2∑ X i Yi − β 0 − β 1 X i = 0 )
Note that these may be written compactly as uɵ i = ∑
i
∑ i
X
i
ɵ i = 0 where
u

uɵ i = Yi − Y i is the residual.
RSS = Σi (Yi − b0 − b1 Xi )2

DGP: Yi = 1 + 2 Xi + ui
10000 20000 30000 40000 50000

Xi and ui iid std normal


N = 100
RSS

10
10
0 5
b 0
1
-10 -5 b0
-10

Fig. 5: The Residual Sum of Squares


We have:
∑Y − N β
i
i 0 − β 1 ∑ X i = 0
i

∑ i i 0 ∑ i 1∑ i = 0
X
i
Y − 
β X − 
β X 2

i i
The first equation implies that:
β 0 = Y − β 1 X
Substitute this into the second equation:
∑ ii
i
X Y − Y − β (
 X
1 ∑ i 1∑ i = 0
X − 
β )
X 2

i i
Now rewrite ∑X
i
i as NX to obtain:

∑ ii
X Y
i
− Y − 1 (
 X NX − β
β 1∑ X i =0
2
) i
This equation may be rearranged as:
    2
 ∑ X iYi − NXY  − β 1  ∑ X i − NX  = 0
2

 i   i 

Thus:
∑ X Y − NXY ∑ x y
i i i i
β 1 = i
≡ i

∑ X − NX
i
i
2
∑x
2

i
2
i

where we have introduced the notation yi = Yi − Y and xi = X i − X to


simplify the expression for β 1 . Small xi and yi denote deviations about
the sample means.
It is straightforward to show that the sum of squared deviations of X
about its sample mean is:

∑i ∑ i
i
x 2
= ( X
i
− X ) 2
= ∑ i
X 2
− NX
i
2

which is the denomination in the expression for β 1 .

Likewise, the sum of the products of the deviations of X and Y about


their respective sample means is:

∑ x y = ∑( X
i
i i
i
i − X )(Yi − Y ) = ∑ X iYi − NXY
i

which is the numerator in this expression.


To summarize, we showed that the OLS estimators are:

∑x y i i
β 1 = i  =Y −β
and β  X
∑x 2 0 1
i
i
where yi = Yi − Y and xi = X i − X are deviations from the sample
means. The formula for the OLS estimator for β 0 says that the OLS
regression line goes through the sample means of X and Y.

The formulae for the β ’s are also sample analogs of the population
moments.
Analogy between Population and Sample Moments*

The formulae for the β ’s are also sample analogs of the population
moments. For simplicity assume that the X’s are iid so EX i = EX . Since
E ( ui | X i ) = 0 implies that Eui = 0 , it follows that EY = β 0 + β1 EX . This
means that β = EY − β EX with sample analog β  =Y − β  X.
0 1 0 1

Now consider β1 . Subtracting EY = β 0 + β1 EX from Yi = β 0 + β1 X i + ui


yields Yi − EY = β1 ( X i − EX ) + ui . Multiply this by ( X i − EX ) and take
expectations:
( X i − EX )(Yi − EY ) = β1 ( X i − EX ) + ( X i − EX )ui
2

⇒ E (Yi − EY )( X i − EX ) = β1 E ( X i − EX )
2
∵ Eui = EX i ui = 0
Cov ( X , Y )
This says that Cov ( X , Y ) = β1Var ( X , Y ) so β1 = . The
Var ( X )
1
∑ xi yi ∑ xi yi
sample analog of this is β  = N −1 i = i
.

1 2
1 xi
N −1 i
∑ x 2
i i
 in the Classical Regression Model
The Distribution of β 1

1 ∑ii
 = xy
We showed that β
i
∑ i . Now substitute β1xi + ui − u for
x
i
2

yi 1:
∑ xi yi (
∑ xi β1 xi + ui − u ) β1 ∑ x 2 + ∑ xi ui + u ∑ xi
i

β 1 = i
= i
= i i i

∑i
x 2

i
∑i
x
i
2
∑i
i
x 2

∑xu i 1
 =β
⇒β + i

∑x
1 1 2
i
i

1
yi = Yi − Y = β1 ( X i − X ) + ui − u ⇒ yi = β1 xi + ui − u
using the result that u is a constant and ∑ x = ∑( X
i
i
i
i − X ) = 0.

It is useful to introduce a little bit of extra notation. Let the “weight”


wi = xi ∑ x 2j so:
j

β 1 = β1 + ∑ wiui
i
 equals the true value β plus a
which says that the OLS estimator β 1 1
weighted sum of the random errors. It is easy to check that ∑ wi = 0 and
i

∑ i =1
w
i
2
∑ i . Using these results, it is easy to find the distribution of
x
i
2

β 1 since the “weights” are fixed in the classical regression model (fixed
X’s and normally distributed u’s).
 is unbiased since the mean E β
First, β  =β:
1 1 1
 
E β 1 = β1 + E  ∑ wi ui  = β1 + ∑ wi E ( ui ) = β1 + ∑ wi 0

 i  i i

= β1
 is σ2
The variance of β :
1
∑x
i
2
1

   
( )
V β 1 = V  β + ∑ wi ui  = V  ∑ wi ui 

 i   i 
σ2
= ∑ wi2V ( ui ) = ∑ wi2σ 2 = σ 2 ∑ wi2 =
i i i ∑i
x
i
2

Question: What results have we used in the derivation?


Note the intuition - the more variation there is in X, the more precise is
the estimate of the slope β1. Illustrate this!

As more data accumulates, the denominator term in the expression for


β gets larger so the variance of β gets smaller:
1 1

∑ i → ∞ as N → ∞
x 2

∑i
x 2
2
∑x
i
2
i =N
N
i 
= N σ X → ∞ as N → ∞

Thus β is consistent since it is unbiased and its variance goes to zero as


1
the sample size N goes to infinity.
When the ui 's are normally distributed then so is β  since it is linear in
1
the ui 's . Thus, in the classical regression model:

 
  σ 2
β 1 ~ N β1 ,


∑i
x 2
i 

The normality result means that statistical inference is standard. σ 2 is


unknown but can be replaced by a consistent estimate in large samples:
2


σ =
2 ∑ i
ɵ
ui
=
RSS
N − 2 d .o. f .
where d.o.f. stands for the degrees of freedom of the regression.
Statistical Inference

 is just the square root of the estimated


The standard error or SE of β 1
.
variance of β 1
2

σ
( ) ∑x
 =
SE β 1 2
i
i
2

σ
 ± 1.96
In large samples, a 95% confidence interval for β1 is β 1
∑i
x
i
2

i.e. the coefficient estimate plus or minus approximately two standard


errors.
 
 ~ N  k, σ
2
Under the null or maintained hypothesis that β1 = k , β 
1


∑i
x 2
i 


σ2
 −k
which implies that β 1 ( ) ∑i
x 2
~ N ( 0,1) . We don’t know σ 2 but
i
2

we can replace it by the consistent estimate σ since we can show that
2

σ
( β 1 −k ) ∑x 2
i
~ AN (0,1) in large samples. Hence the test statistic
i
 −k
β 1
may be used to test the null hypothesis that β1 = k , using tables

SE β 1 ( )
of the standard normal distribution.
The Gauss-Markov Theorem

This theorem says that the OLS estimator is BLUE = best (i.e. minimum
variance) within the class of linear, unbiased estimators. We shall prove
this for the OLS slope estimator in the OLS model in deviations form:

yi = β1 xi + ui − u i = 1… N

Nothing essential is lost by considering this model2. The OLS estimator


∑ xi yi
β 1 = i 2 is clearly linear in the yi 's .
∑ xi i

2
In fact, nothing essential is lost by considering the simpler model yi = β1 xi + ui .
 may be written
Without loss of generality, any other linear estimator β 1
as the OLS estimator plus a sum of terms which are linear in the yi 's :

β 1 = β 1 + ∑ di yi
i
 is unbiased, E β
where the di 's are arbitrary. If β  = β so let’s check
1 1 1
this:
 = Eβ  + d y  = Eβ
Eβ 1  1

∑i
i i

1 ∑
 + d Ey
i
i i

= β1 + ∑ di ( β1 xi ) ∵ Eyi =β1 xi
i

= β1 + β1 ∑ di xi = β1 iff ∑d xi i =0
i i
Thus, the di 's are not completely arbitrary. For unbiasedness, ∑d x
i
i i

must equal zero. Thus, the estimator equals the true parameter plus a
mean zero term involving the random errors:

β 1 = β 1 + ∑ di yi = β1 + ∑ wi ui + ∑ di ( β1 xi + ui − u )
i i i

= β1 + ∑ wi ui + ∑ di ( ui − u ) ∵ ∑ d i xi = 0
i i i

We use this expression to calculate the variance of β and show that it is


1
.
greater than or equal to the variance of the OLS estimator β 1
   
V β 1 = V  β1 + ∑ wiui + ∑ di ( ui − u )  = V  ∑ wiui + ∑ di ( ui − u ) 

 i i   i i 
( β1 is a constant)
 
= V  ∑ wiui + ∑ ( di − d ) ui  ( Prove this general result.3
)
 i i 
= ∑V (( w + ( d − d ))u )
i i i (ui 's independent)
i

= ∑V ( wiui ) + ∑V ( ( di − d ) ui )
i i

(∑ di xi = 0 ⇒∑ di wi = 0 and ∑x i = 0)
i i i
_____________________________________

3
The general result is ∑ ( a − a )( b − b ) = ∑ ( a − a ) b = ∑ a ( b − b ) . This holds for both random and non-
i
i i
i
i i
i
i i

random variables.
 = V ( w u ) + V ( ( d − d ) u ) and Vu = σ 2 , we deduce
Since V β 1 ∑ ii ∑ i
i i
i i

that:
V β 1 = σ ∑ wi + σ ∑ ( di − d )
2
 2 2 2

i i

= V β 1 + σ ∑ ( di − d )
2
 2


≥V β 1

which is the Gauss-Markov result.


R2 And All That

• When there is a constant in the regression model, we can decompose


the sum of squares (i.e. sum of squared deviations about the sample
mean4) of Y into two separate sum of squares – the sum of squares of
the fitted values and the sum of squares of the residuals.
• In other words the total sum of squares (TSS) equals the explained
sum of squares (ESS) plus the residual sum of squares:

TSS = ESS + RSS

∑ ( Zi − Z )2 = ∑ Zi2 − N Z . This reduces to ∑Z


4 2 2
The sum of squares of a random variable Z is i when the
i i i
sample mean of Z is zero.
where TSS = ∑ i
(
i
Y − Y ) 2
, ESS = ∑i
(Y i − Y ) 2 , and RSS =
∑i
( ɵ i − uɵ ) 2
u

where Y and uɵ are the sample means of fitted values and the residuals
respectively. It turns out that these are equal to Y and 0.

This decomposition is the basis of the R2 (R squared) measure of


goodness of fit, which is the faction of variation in Y explained by X.
ESS RSS
R =
2
= 1− 0 ≤ R2 ≤ 1
TSS TSS

The R2 measure is useful but you should not rely on it exclusively.


• If there is no intercept in the model (or you are not using a least
squares estimator), R2 can be negative.
• R2 generally increases as you add additional explanatory variables to
the model.
• R2 tends to be high if you are using time series data and low in cross
section data.

Proof: Show that TSS = ESS + RSS


• By definition, Yi = Y i + ui so:
∑( )
2 2 2

Y =
i
2
i

i
ɵ
Y i + ui = ∑ ∑

Y
i
i + ɵ
u i +
i
2 ∑  i uɵ i
Y
i
• The two OLS f.o.c.’s imply that the cross product term is zero:

i i
( )
∑Y i uɵ i = ∑ β 0 + β 1 X i uɵ i = β 0 ∑ uɵ i + β 1 ∑ X i uɵ i = 0
i i
2 2
so ∑Yi = ∑Y i + ∑ u i .
2 ɵ
i i i

• Since the residuals sum to zero, the mean residual is zero, uɵ = 0 , and
the actual and fitted values of Y have the same mean, Y = Y .
• Hence:
 2  2 2
2
 ∑ Yi − NY  =  ∑ Y i − NY  + ∑ u i
2 ɵ
 i   i  i
i.e. TSS = ESS + RSS

which is the result we require.


Appendix – Random X’s and Non-Normal Random Error Terms

• DGP: Yi = β 0 + β1 X i + ui ui | X 's ~ iid (0, σ 2 ) i = 1… N

1 ∑ii
 = xy
• We showed that β
i
∑ i which implied that:
i
x 2

∑xu i i
β 1 = β1 + i
= β1 + ∑ wi ui
∑x i
2
i i

1
 = E E β
 is unbiased since E β
•β 1  1  (
 | X  = E [β ] = β :
1 1 )
 

( )
E β 1 | X = β1 + E  ∑ wi ui | X 
 i 
= β1 + ∑ wi E ( ui | X ) = β1 + ∑ wi 0 = β1
i i
The conditional variance is:
 | X = V  w u | X  = w2V ( u | X ) = w2σ 2
V β (1 ) ∑ i i
 i
 ∑ i
 i
i ∑
i
i

σ2
(
1 )∑ i
 | X = σ 2 w2 =
⇒V β
i ∑1
x 2

i
which obviously depends on the X’s.

In order to derive the conditional distribution, consider :


1
∑i xiui N i
∑ xi ui
β 1 = β1 + = β +
∑i i 2 1
x 1
N i
∑i x 2
Under suitable regularity conditions:
1
• A Law of Large Numbers applies to the denominator ∑ xi2
N i
which has plim (equal to the variance of X).

• A Central Limit Theorem applies to the numerator sample mean


1 σ2

N i
x u
i i , which has mean 0 and conditional variance 2 ∑ i
N i
x 2
;

1  σ2 2

N i
∑ xiui ~ N  0, 2 ∑ xi 
approx
 N i 
These two results may be combined (using Slutsky’s theorem) to show
that:
−1 −2
1   1   1   σ 2
2
 N ∑ xi   N ∑ xiui  approx
~  ∑ xi  N  0, 2 ∑ xi 
2 2

 i   i  N i   N i 
−1
1 2  1   2
⇒  ∑ xi   ∑ xiui  ~ AN  0,σ ∑ xi 
2

N i  N i   i 

Thus, in large samples, the conditional distribution of β  is:


1
 
β 1 | X ' s ~ AN  β1 ,σ 2 ∑ xi2 
 i 
−1
1 2  1 
since β 1 = β1 +  ∑ xi   ∑ xiui  .

N i  N i 

You might also like