SIMPLE LINEAR REGRESSION MODEL
The Data Generating Process (DGP), or the population, is described by the following linear model:
Y j=β 0 + β 1 X j +ε j
Y j is the j-th observation of the dependent variable Y (it is known)
X j is the j-th observation of the independent variable X (it is known)
β 0 is the intercept term (it is unknown)
β 1 is the slope parameter (it is unknown)
ε j is the j-th error, the j-th unobserved factor that, besides X, affects Y (it is unknown)
Since the values of X j and Y j are known but the values of β 0, β 1and ε j are unknown, also the regression
model that describes the relationship between X and Y is unknown.
Graphically, the errors are the vertical distances between the observations of the dataset and the
observations predicted by the linear regression model:
The ordinary least squares criterion (OLS)
With the OLS method we find the estimators of the unknown parameters β 0 and β 1that maximize the
explicative power of the linear model β 0 + β 1 X j and that, therefore, minimize the sum of the squared
errors:
n n
( ^β0 , ^β 1)=argmin ∑ ( Y j−β 0−β 1 X j )2 =argmin ∑ ε 2j
j=1 j=1
n
where ∑ ( Y j−β 0 −β1 X j ) is the objective function O ( β 0 , β 1 ), that must be derived with respect to β 0
2
j=1
and β 1and must be put equal to 0 in order to find ^β 0 and ^β 1.
The 1st of the first order conditions (FOC) is
n
∂O
=0 ∑ 2 ( Y j− β^ 0 − ^β1 X j ) (−1)=0 ⟺
∂ β0 j=1
⇓
^β =Y − ^β X
0 n 1 n
where:
n
1
Y n= ∑ Y is the sample average of Y
n j =1 j
n
1
X n= ∑ X is the sample average of X
n j=1 j
The 2nd of the first order conditions (FOC) is
n
∂O
=0 ⟺ ∑ 2 ( Y j− ^β 0− β^ 1 X j ) (−X j)=0
∂ β1 j=1
⇓
n n
∑ ( X ¿ ¿ j− X n) ( Y j−Y n ) 1
∑ ( X ¿ ¿ j− X n )( Y j−Y n ) sample covariance between X∧Y
^β 1= j =1 = ∙ j=1
= ¿¿
n n
T 1 sample variance of X
∑ ( X j−X n ) 2 ∙ ∑ ( X −X n )
T j=1 j
2
j=1
If ^β 0 and ^β 1 are the OLS estimators, the predicted values are defined as
Y^ j= ^β 0 + ^β 1 X j
while the residuals are defined as
ε^ j=Y j−Y^ j=Y j−( ^β0 + β^ 1 X j )
More ε^ j is close to 0, better is the quality of the regression.
The crucial assumptions
Consider 2 r.vs. X and Y and the following regression model
Y = β0 + β 1 X +ε
We know that E [ Y |X ]= β0 + β 1 X and ε =Y −¿ ¿Therefore
E [ ε| X ] =E [ Y −( β ¿ ¿ 0+ β 1 X )¿ X ]=E [ Y | X ] −E ¿
E [ Y |X ]−{β 0+ β1 E ¿
So, by the Law of iterated expectations
E [ ε ]=E [ E [ ε| X ] ] =E [ 0 ]=0
In conclusion, the expectation of the unknown factors is not influenced by X and it is equal to 0:
E [ ε| X ] =E [ ε ] =0
This implies that
E [ Xε ] =0
Because, by the Law of iterated expectations, E [ Xε ] =E [ E [ Xε|X ] ] =E [ XE [ ε| X ] ] =E [ X ∙ 0 ] =0
Moreover
E [ ε^ j ]=0
3 important properties
We can derive 3 properties from the previous conclusions:
^ coincides with the sample average of Y
Property 1: the sample average of the fitted values Y
Y^ n=Y n
Property 2: the sample average of the OLS residuals ε^ jis 0:
ε^ j=0
Property 3: the sample covariance between the regressors and the OLS residuals is always 0
n
1
∑ ( X −X n ) ( ^ε j− ε^ j ) =¿ 0 ¿
n j=1 j
Measures of goodness of prediction
The R2 measures the proportion of the variance of Y that is explained by X. The R2 varies between 0 and 1.
Higher is the R2, better is the fit of the model to the data.
2 SSE SST −SSR SSR
R= = =1−
SST SST SST
Where:
The Total sum of squares (SST) measures the data dispersion (total variance of data)
n
SST =∑ (Y j−Y n )2
j =1
The Explained sum of squares (SSE) measures the fitted values dispersion (variance explained by the
regression)
n n
SSE=∑ ( Y^j−Y n )2=¿ ∑ ( Y^j−Y^ n)2 ¿
j=1 j=1
The Sum of squares residuals (SSR) measures the residuals dispersion (variance explained by the
residuals)
n n
SSR=∑ (Y j−Y^j )2=¿ ∑ ( ε^ j)2 ¿
j=1 j=1
Theorem: the total variance of the data is given by the sum of the variance explained by the regression and
the variance explained by the residuals.
SST =SSE+ SSR
Of course:
- smaller is the SSR, better is the fit of the regression to the data
2
SSR ≪ SST ⇒ R ≈ 1(good fit )
- higher is the SSR, worst is the fit of the regression to the data
2
SSR ≈ SST ⇒ R ≈ 0 ( poor fit )
Unbiasedness of the OLS estimators
Assume that
1) the DGP of (X j ,Y ¿ ¿ j)¿ is Y j=β 0 + β 1 X j +ε j
2) E [ ε j ] =0
3) E [ ε j∨ X 1 , … , X n ] =E [ ε j ] =0
These 3 assumptions are enough to prove that ^β 0 and ^β 1are unbiased estimators of β 0 and β 1.
^β can also be expressed as
1
n
∑ (X ¿ ¿ j−X n )ε j
^β 1=β 1+ j=1 ¿
SS T X
n
Where SS T X =∑ ( X j −X n ) is the total sum of squares of X.
2
j=1
From this, we derive that
E [ ^β 1|X ] =β 1
And then, by the law of iterated expectations, ^β 1 is an unbiased estimator of β 1, in the sense that
E [ ^β 1 ]=β 1
[ [ ]]
Because E ^β 1 =E E ^β 1| X =E [ β1 ] = β1
[ ]
^β can also be expressed as
0
^β =β +( β¿¿ 1− ^β ) X +ε ¿
0 0 1 n n
From this, we derive that
E [ ^β 0|X ]=β 0
And then, by the law of iterated expectations, ^β 0 is an unbiased estimator of β 0, in the sense that
E [ ^β 0 ]= β0
[ [ ]]
Because E ^β 0 =E E ^β 0|X =E [ β 0 ] =β 0
[ ]
Variance of the OLS estimators
Assume that
1) the DGP of (X j ,Y ¿ ¿ j)¿ is Y j=β 0 + β 1 X j +ε j
2) E [ ε j ] =0
3) E [ ε j∨ X 1 , … , X n ] =E [ ε j ] =0
4) the ε j are independent
5) V [ ε j∨X 1 , … , X n ] =E [ ε j ∨X 1 , … , X n ] −( E [ ε j∨ X 1 , … , X n ] ) =E [ ε j ∨X 1 , … , X n ]=E [ ε j ]=σ ε
2 2 2 2 2
2
σε
Then E [ ^β 1∨X ]=β 1 +
2 2
, so
SS T X
2 2 2
σε 2 σε σε
V [ β 1∨X ]=
^ Because V [ ^β 1∨X ]=E [ ^β 1∨X ]−( E [ ^β 1∨X ]) =β 1+
2 2 2
−β 1=
SST X SS T X SST X
Note that:
As the variance of the errors goes to 0, the variance of the estimator goes to 0, so the estimator gets
more and more precise
σ 2ε → 0 ⇒ V [ ^β 1∨X ] → 0
As the dimension of the sample goes to + ∞, the variance of the estimator goes to 0, so the estimator
gets more and more precise
n →+∞ ⇒ SS T X →+ ∞ ⇒ V [ β^ 1∨X ] → 0
2
σε 2 1 2
E [ ^β 0∨ X ]=β 0 +
2 2
X + σ , so
SST X n n ε
n
σ2 ∑ X 2j Because
V [ ^β 0∨X ]= ε j=1
n SST X
n
∑ X 2j
( )
2 2 2
σ 1 X 1 σ
V [ ^β 0∨X ]=E [ β^ 20∨ X ]−( E [ ^β 0∨ X ] ) =β 20+
2 ε
X 2n+ σ 2ε −β 20=σ 2ε n
+ = ε j=1
SST X n SST X n n SST X
Note that:
As the variance of the errors goes to 0, the variance of the estimator goes to 0, so the estimator gets
more and more precise
σ 2ε → 0 ⇒ V [ ^β 0∨X ] → 0
As the dimension of the sample goes to + ∞, the variance of the estimator goes to 0, so the estimator
gets more and more precise
n →+∞ ⇒ SS T X →+ ∞⇒ V [ β^ 0∨ X ] →0
Estimator of the error’s variance
How can we compute V [ β
^ ∨X ] and V [ ^β ∨X ]? σ 2ε is not observed!
0 1
Assume that
1) the DGP of (X j ,Y ¿ ¿ j)¿ is Y j=β 0 + β 1 X j +ε j
2) E [ ε j ] =0
3) E [ ε j∨ X 1 , … , X n ] =E [ ε j ] =0
4) the ε j are independent
5) V [ ε j∨X 1 , … , X n ] =E [ ε j ∨X 1 , … , X n ] −( E [ ε j∨ X 1 , … , X n ] ) =E [ ε j ∨X 1 , … , X n ]=E [ ε j ]=σ ε
2 2 2 2 2
Therefore, the random variable
n
1
σ^ 2ε = ∑ ε^ 2= SSR
n−2 j=1 j n−2
2
is an unbiased estimator of the error’s variance σ ε , in the sense that
E [ σ^ ¿ ¿ ε ]=σ ε ¿
2 2