0% found this document useful (0 votes)

197 views14 pages

Variance of OLS with Non-Spherical Errors

This document discusses efficient ordinary least squares (OLS) estimation and some key properties. It summarizes that the OLS estimator is unbiased, with bias equal to 0. Its variance is equal to σ2(X'X)-1, where σ2 is the variance of the error term. It also explains that the sample residual e can be used to estimate σ2, where the expected value of e'e is equal to σ2(N - K), with N being the number of observations and K being the number of regressors. This provides an estimate of the variance of the OLS estimator.

Uploaded by

Peterson Sihotang

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

197 views14 pages

Variance of OLS with Non-Spherical Errors

Uploaded by

Peterson Sihotang

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Non-Spherical Errors

Krishna Pendakur
February 15, 2016

1 Efficient OLS
1. Consider the model

Y = Xβ + ε
0
E [X ε] = 0K
E [εε0 ] = Ω = σ 2 IN .

2. Consider the estimated OLS parameter vector

−1
β̂OLS = (X 0 X) X 0 Y.

3. Its bias is
h i h i
−1 −1
E β̂OLS −β = E (X 0 X) X 0 Xβ + (X 0 X) X 0 ε − β
h i
−1 −1
= E (X 0 X) X 0 ε = (X 0 X) 0K = 0K

The variance of the estimated parameter vector is the expectation of the square of
(X 0 X)−1 X 0 ε, which is the deviation between the estimate and its mean (which happily
is its target):
h i 0
V β̂OLS = E β̂OLS − β β̂OLS − β
h i
−1 −1
= E (X 0 X) X 0 εεX (X 0 X)
−1 −1
= (X 0 X) X 0 E [εε0 ] X (X 0 X)
−1 −1
= (X 0 X) X 0 σ 2 IN X (X 0 X)
−1 −1
= σ 2 (X 0 X) X 0 X (X 0 X)
−1
= σ 2 (X 0 X)

4. How do we know it is the lowest variance linear unbiased estimator? [Gauss-Markov

Theorem]

1
(a) Let

Y = Xβ + ε
E [ε] = 0N ,
E [εε0 ] = Ω = σ 2 IN ,

and note that E[ε] = 0N =⇒ E [X 0 ε] = 0K for fixed X.

(b) Let CY be another estimator, also linear, where C = (X 0 X)−1 X 0 + D and D is
a KxN matrix. This deviates from the OLS estimator by the matrix D.
(c) We have already shown the OLS estimator to be unbiased under the conditions
above, so D will have to be pretty special.
−1
E [CY ] − β = (X 0 X) X 0 Xβ + DXβ + E[Dε] − β

E [CY ] − β = β + DXβ + 0 − β
E[ε] = 0N =⇒ E [D0 ε] = 0K . Unbiasedness thus requires DX = 0.
(d) As with the OLS estimator, since it is unbiased, CY − β = Cε. Its variance is
the expectation of the square of this:

V [CY ] = V [Cε] = σ 2 CC 0
−1 −1 −1 −1
= σ 2 [(X 0 X) X 0 X (X 0 X) + DX (X 0 X) + (X 0 X) X 0 D0 + DD0 ]
−1
= σ 2 [(X 0 X) + 0 + 0 + DD0 ]
(e) Since DD0 is a square, it is positive semidefinite, and its minimum is zero when
D = 0. Consequently, the lowest-variance unbiased estimator for the homoskedas-
tic linear model is when D = 0, which is the OLS estimator.
h i
5. Getting back to V β̂ = σ 2 (X 0 X)−1 , a problem is that σ 2 is not observed. So, we
don’t yet have a useful object.

(a) However, we have a sample analog: the sample residual e:

e = Y − X β̂OLS .

6. So how exactly does e relate to ε?

−1
e = Y − X (X 0 X) X 0 Y
h i
−1
= I − X (X 0 X) X 0 Y
h i h i
−1 −1
= I − X (X 0 X) X 0 Xβ + I − X (X 0 X) X 0 ε
h i
0 −1 0
= Xβ − Xβ + I − X (X X) X ε
h i
−1
= I − X (X 0 X) X 0 ε

2
e is a linear transformation of ε. However, although I − X (X 0 X)−1 X 0 is an N xN

matrix, it is not a full rank matrix: its columns are related. Indeed, this N xN weighting
matrix is all driven by the identity matrix, which has rankN , and the matrix X, which
only has K columns. The full matrix I − X (X 0 X)−1 X 0 has rank N − K.

7. Matrices like I − X (X 0 X)−1 X 0 and X (X 0 X)−1 X 0 are called projection matrices,

and they come up a lot.

(a) for any matrix Z, denote its projection matrix PZ = Z(Z 0 Z)−1 Z 0 and its error
projection as MZ = I − Z(Z 0 Z)−1 Z 0
(b) These are convenient. We can write the OLS estimate of Xβ as

X β̂OLS = PX Y,

and the OLS residuals Y − X β̂OLS as

e = MX Y

and also,
e = MX ε
(c) We say stuff like “The matrix PX projects X onto Y .”
(d) These matrices have a few useful properties:
i. they are symmetric.
ii. they are idempotent, which means they equal their own square: PZ PZ = PZ ,
MZ MZ = MZ

8. Getting back to σ 2 and our estimate of it, compute e0 e in terms of ε:

e = MX ε

so,

E [e0 e] = E [ε0 MX MX ε]
= E [ε0 MX ε]
h i
0 0 0 −1 0
= E [ε ε] − E ε X (X X) X ε
= N σ 2 − Kσ 2

because ε0 P X ε = ε0 PX0 PX ε, which is the sum of squares of PX ε. PX has rank K and

projects ε on to X. Even though ε is noise, you can still get K perfect fits with K
regressors, so its sum of squares picks up these K perfect fits. Consequently,

E [e0 e]
= σ2.
N −K

3
So, we can use an estimate
2 e0 e
σ̂ =
N −K
An estimate of the variance of the OLS estimator is thus given by
h i
−1
V̂ β̂OLS = σ̂ 2 (X 0 X) .

Now, we can compute the BLUE estimate, and say something about its bias (zero)
and its sampling variability.
9. Time to crack open Kennedy for Ballentines (Kennedy, Peter E. 2002. More on Venn
Diagrams for Regression, Journal of Statistics Education Volume 10, Number 1), linked
on the course website. Go through ε vs cov(X, Y ) and the 2 regressor case.

(a) size of circle is amount of variation.

(b) overlap in circles is amount of covariation.
(c) more overlap means more information to identify the parameter associated with
that regressor.
(d) with 2 regressors, there is overlap between all 3 variables. this covariation can-
not be uniquely attributed to either regressor. this covariation can, however, be
attributed to a linear combination of the two regressors.
(e) show correlated missing regressor case. show uncorrelated missing regressor case.
(f) go through Frisch-Waugh-Lovell with the circles.

10. Precision is good. Lowh variance

i is precision. How do you get a precise estimate?
One can think about V β̂OLS = σ 2 (X 0 X)−1 in 3 pieces:

(a) The variance of ε is the variance of Y conditional on X. Less variation of Y

around the regression line yields greater precision.
0
(b) N is the number of observations. It shows up, implicitly, inside
PNX X. 2 This is
0
easiest to see if X has just one column: in this case, X X = i=1 (xi ) , which
for xi drawn fromh some
i density f (x) has an expectation that increases linearly
with N . So, V β̂OLS goes inversely proportionally with N .
(c) X 0 X is related to the variance matrix of the vectors x0i , i = 1, ..., N . Indeed, for
xi ∼ iid, X 0 X is an estimate of E[x0i xi ]. By definition, E[x0i xi ] = E[xi ]E[xi ]0 +
V [xi ]. For xi with a given mean E[xi ], increasing V [xi ] with a mean-preserving
spread implies increasing E[x0i xi ], which in turn
h is associated
i with larger X 0 X. If
X 0 X is bigger, then (X 0 X)−1 is smaller, so V β̂OLS is smaller and the estimate
β̂OLS is more precise.
(d) If the columns of X covary a lot, then the off-diagonals of X 0 X are larger. If
the off-diagonals of X 0 X are larger, then the diagonals of (X 0 X)−1 , which give
the variance of each estimated coefficient, are bigger, which means we have less
precision.

4
i. Consider a model with 2 columns in X and no constant, and let the columns
of X be positively correlated.
ii. Then, X 0 X has elements Xj0 Xk for j, k = 1, 2. Its diagonals are the sum
of squares of each column (call them a and b), and its off-diagonal is the
a c
cross-product (call that c). So, X 0 X is given by . Then, (X 0 X)−1
c b
1 b −c b a
is given by ab−c2 , whose diagonal elements are ab−c 2 and ab−c2 .
−c a
These elements increase as c, the cross-product of the columns, increases. So,
more correlation of the X’s means higher variance of the estimates.
(e) If the columns of X covary a lot, then although we have less precision on any
one regressor’s coefficient, we may have a lot of precision on particular linear
combinations of coefficients.
i. Suppose the two variables mentioned above strongly positively covary, so that
c is positive and big. The covariance of the two coefficients is −c/(ab − c2 ).
When c goes up, the numerator goes up and the denominator goes down,
so the overall ratio goes up a lot. Since a, b are both sums of squares, they
are positive. The cross-product c cannot exceed a or b, so the denominator
is positive. Thus, these positively covarying regressors would result in esti-
mated coefficients that each have high variance, but are strongly negatively
correlated.
ii. Consider the variance of the sum of the two coefficients: V (β1 + β2 ) =
1
ab−c2
(b + b − 2c). The larger is c, the smaller is this variance.

11. The Frisch-Waugh-Lovell Theorem can be expressed in a simple way using projection
matrices. Let
Y = X1 β1 + X2 β2 + ε
and consider premultiplying the whole thing by the error-maker matrix for X2 , MX2 .
This gives
MX2 Y = MX2 X1 β1 + MX2 X2 β2 + MX2 ε.
The projection of X2 onto itself is perfect, so it has no error, so MX2 X2 = 0. Thus, we
have
MX2 Y = MX2 X1 β1 + MX2 ε.
Writing Ŷ = MX2 Y and X̂1 = MX2 X1 , we have

Ŷ = X̂β1 + MX2 ε.

Since X2 is uncorrelated with ε by assumption, MX2 ε = ε − X2 (X20 X2 )−1 X20 ε is also

uncorrelated with X̂:

E[X̂10 MX2 ε] = E[X10 MX0 2 MX2 ε] = E[X10 MX2 ε]

= E[X10 ε] − E[X10 X2 (X20 X2 )−1 X20 ε] = 0 − X10 X2 (X20 X2 )−1 E[X20 ε] = 0

so this error term is exogenous with respect to X̂ .

5
(a) so you can regress Ŷ on X̂ to get an estimate of β1 .
(b) In terms of Kennedy’s Ballentines, M X2 Y is the part of Y that has had X2 cleaned
out of it, and MX2 X1 is the part of X1 that has had X2 cleaned out of it.

2 NonSpherical errors
1. In a model
Y = g (X, β) + ε,
if errors satisfy
E [εε0 ] = σ 2 IN ,
we call them spherical. Independently normal errors are spherical, but the assump-
tion of independent normality is much stronger than the assumption that errors are
spherical, because normality restricts all products of all powers of all errors. In con-
trast, the restriction that errors are spherical restricts only the squares of errors and
cross-products of errors:

(a) The first implication of spherical errors is

E (εi )2 = σ 2 ,

for all i = 1, ..., N , which we usually call homoskedasticity. Homoskedastic errors

have the same variance for all observations.
(b) The second implication is that

E [εi εj ] = 0,

for all i 6= j. This means that there are no correlations in errors across ob-
servations. This rules out over-time correlations in time-series data, and spatial
correlations in cross-sectional data.

2. OLS is inefficient if errors are nonspherical. This is easy to see by example.

(a) Imagine that we have a linear model with a constant α and one regressor (the
vector X):

Y = α + Xβ + ε,
E [ε] = 0N
E [εε0 ] = Ω =
6 σ 2 IN

where
0 0 0
Ω = 0 IN −2 0 .
0 0 0
That is, we have an environment where we know that the first and last observation
have a error term of zero, and all the rest are of the usual kind.

6
i. Consider a regression line that connects the first and last data points, and
ignores all the rest. This regression line is exactly right. Including other data
in the estimate only adds wrongness. Thus, the best linear unbiased estimator
in this case is the line connecting the first and last dots. Consequently, OLS
is inefficient—it does not have the lowest variance.
ii. The point is that you want to pay close attention where the errors have low
variance and not pay much attention where the errors have high variance.
(b) Alternatively, imagine that

σ2 0 0
2 0
Ω = 0 σ ιN −1 ιN −1 0 .
0 0 σ2

Here, the notation ιK indicates a K−vector of ones. Thus, ιN −2 ι0N −2 is an

(N − 2)x(N − 2) matrix of ones, and σ 2 ιN −1 ι0N −1 is a matrix filled with σ 2 . This
covariance matrix would arise if observations 1 and N had independent errors
with variance σ 2 , and observations 2, ..., N − 1 had the same error term. Not just
error terms drawn from the same distribution, but literally the same value of ε
for each of those observations.
i. In this case, you’d want to treat observations 2, ..., N − 1 as if they were
just one observation: for example, they all had a big positive error, you
wouldn’t want to pull the regression line up very much, because you’d know
that what seemed like a lot of positive errors was really just one big outlier.
Consequently, since OLS wouldn’t do any grouping like this, OLS is not
efficient.

3 Generalised Least Squares

1. Generalised Least Squares (GLS) is used when we face a model like

Y = α + Xβ + ε,
E [ε] = 0N
E [εε0 ] = Ω

Here, if Ω 6= σ 2 IN , you have some known form of nonspherical errors: either het-
eroskesticity, or correlations across observations. Note that E [ε] = 0 ⇒ E[X 0 ε] = 0
for fixed X.

2. We know that OLS is the efficient estimator given homoskedastic errors, but what
about the above case?

3. The trick is to convert this problem back to a homoskedastic problem. Consider pre-
multiplying Y and X by Ω−1/2

Ω−1/2 Y = Ω−1/2 Xβ + Ω−1/2 ε

7
Here is a model with the error term premultiplied by this weird inverse-matrix-square-
root thing.
4. What is the mean and variance of this new transformed error term?
E Ω−1/2 ε = Ω−1/2 E [ε] = 0

E Ω−1/2 εε0 Ω−1/2 = Ω−1/2 E [εε0 ] Ω−1/2

= Ω−1/2 ΩΩ−1/2
= Ω−1/2 Ω1/2 Ω1/2 Ω−1/2 = IN IN = IN

(see Kennedy’s appendix “All About Variance” for more rules on variance computa-
tions).
5. So the premultiplied model is homoskedastic with unit variance errors.
6. Given that the coefficients in the transformed model are the same as those in the
untransformed model, we can estimate them by using OLS on the transformed model.
7. Tranforming data by a known variance matrix and then applying OLS is called Gen-
eralised Least Squares (GLS).
8. We refer to the matrix
T = Ω−1/2
as the Transformation Matrix.
9. GLS in Stata is
10. reg TY TX

3.1 Group Means Regressands

1. One such known variance matrix is that associated with dependent variable data whose
elements are group means: eg, average income in a country. In this case, the averages
have known relative variances: the variance of the mean of something goes with the
square root of the sample size used to compute it. If every country has the same
variance σ 2 in each observation it uses to calculate its average income, the averages
will have variances inversely proportional to the sample sizes used to compute them.
So, in the model where i indexes countries, and each country computes its mean off
of a sample with size Si , and the errors are not correlated across countries, and the
covariance matrix is  1 
S1
0 0
Ω = σ 2  0 S1i 0 
0 0 S1N
and, therefore,  √ 
S1 √0 0
1
T = 0 Si √0 
σ
0 0 SN

8
2. The transformation matrix T amounts to multiplying each Y and each X by the square
root of the sample size used in each country.

3. One need not include the scalar σ in T , because leaving it out just loads it on to the
second stage which would have variance σ 2 IN instead of IN .

4. This strategy, in which you premultiply each observation separately, rather than pre-
multiplying a whole vector of Y and a whole matrix of X, is appropriate when the
covariance matrix is diagonal as it is in the grouped mean data case. This strategy is
referred to as Weighted Least Squares (WLS).

(a) in Stata,
(b) reg Y X [aweight=S]

3.2 Feasible Generalized Least Squares

1. GLS is all great if you know the covariance matrix of the errors, but usually, you don’t.
A similar strategy, called Feasible Generalised Least Squares (FGLS) covers the case
where you don’t know this covariance matrix, but you can estimate it.

2. FGLS uses two steps:

(a) Get a consistent estimate Ω̂ of Ω.

i. A consistent estimate is one which is asymptotically unbiased and whose
variance declines as the sample size increases.
ii. Not all things can be estimated consistently. Examples will come somewhat
later.
(b) Compute T̂ = Ω̂−1/2 , and run GLS.

3. The Random Effects Model uses FGLS

(a) Assume that

Yit = Xit β + θi + εit

(Actually, this is a bit stronger than what is needed: you just need θi orthogonal
to Xit , but the differing subscripts makes that assumption notationally cumber-
some.) The fact that θi are mean zero no matter what value X takes is strong.
For example, if X includes education and θi is meant to capture smartness, we
would expect correlation between them. We also need the variance of θi to be in-
dependent of X. For example, if half of all people are lazy and lazy people never
go to college, then the variance of θi would covary positively with X observed
post-secondary schooling.

9
(b) Given the assumption on θi , we get

Yit = Xit β + uit

where
uit = θi + εit
is a composite error term which satisfies exogeneity, but does not satisfy the
spherical error term requirement for efficiency of OLS.
(c) One could use OLS of Y on X and get unbiased consistent estimates of β. The
reason is that the nonspherical error term only hurts the efficiency of the OLS
estimator; it is still unbiased.
(d) However, this approach leaves out important information that could improve the
precision of our estimate. In particular, we have assumed that the composite
errors have a chunk which is the same for every t for a given i. There is a GLS
approach to take advantage of this assumption. If we knew the variance of the θi
terms, σθ2 , and knew the variance of the true errors, σε2 , we could take advantage
of this fact.
(e) Under the model, we can compute the covariance of errors of any two observations:

Ω = E [uit ujs ] = E[(θi + εit )(θj + εjs )] = I[i = j]σθ2 + I[s = t]σε2

where I[.] is the indicator function. This covariance matrix is block diagonal,
where each block consists of the sum of the two variances σθ2 and σε2 on the
diagonal, and just σθ2 off the diagonal. These blocks lie on the diagonal of the
big matrix, and the off-diagonal blocks are all zero. (see Green around p 295 for
further exposition). So, Ω has diagonal elements equal to σθ2 + σε2 and within-
person off-diagonal elements equal to σθ2 and across-person off-diagonal elements
equal to 0.
(f) FGLS requires a consistent estimate of the two variances. A fixed effects model
can be run in advance to get estimates of these variances. Or, one could run OLS
and construct an estimate of the error covariance matrix directly. Either yields a
consistent estimate.
i. reg Y X [Link]
ii. compute the variance of the person dummies for σθ2 and use the estimate of
the variance of the fixed effects error term for σε2 .
iii. reg Y X
iv. take the average squared error as an estimate of σε2 and the average cross-
product of errors for a given person as an estimate of σθ2 + σε2 .
(g) Now, form Ω and T and run GLS.
(h) The FGLS estimator uses a consistent pre-estimate of Ω, but this estimate is
only exactly right asymptotically. Thus, the FGLS estimator is only efficient
asymptotically. In the small-sample, it could be kind of crappy because the pre-
estimate of Ω might be kind of crappy.

10
4. The trick with FGLS is that the covariance matrix Ω has N (N − 1)/2 elements (it is
symmetric, so it doesn’t have N xN elements). Thus, it always has more elements than
you have observations. So, you cannot estimate the covariance matrix of the errors
without putting some structure on it. We’ll do this over and over later on.

4 Inefficient OLS
1. What if errors are not spherical? OLS is inefficient, but so what? Quit your bellyachin’—
it still minimizes prediction error, it still forces orthogonality of errors to regressors, it
is still easy to do, easy to explain, just plain easy.

2. But, with non-spherical errors, the variance of the OLS estimated coefficient is different
from when errors are spherical. Consider the model

Y = Xβ + ε
0
E [X ε] = 0K
E [εε0 ] = Ω =
6 σ 2 IN .

Recall that
h i h i
−1 −1
E β̂OLS − β = E (X 0 X) X 0 Xβ + (X 0 X) X 0 ε − β
h i
−1 −1
= E (X X) X ε = (X 0 X) 0K = 0K
0 0

The variance of the estimated parameter vector is the expectation of the square of this
quantity:
h i 0
V β̂OLS = E β̂OLS − β β̂OLS − β
h i
0 −1 0 0 −1
= E (X X) X εεX (X X)
−1 −1
= (X 0 X) X 0 E [εε0 ] X (X 0 X)
−1 −1
= (X 0 X) X 0 ΩX (X 0 X) .

If Ω = σ 2 IN , a pair of X 0 X’s cancel leaving σ 2 (X 0 X)−1 . If not, then not.

3. It seems like you could do something like with the spherical case to get rid of the bit
with Ω: After all E [εε0 ] = Ω, so perhaps we could just substitute in some errors. For
example, we could compute
−1 −1
(X 0 X) X 0 ee0 X (X 0 X) .

Unfortunately, since OLS satisfies the moment condition X 0 e = 0, this would result in
−1 −1
(X 0 X) 0K 00K X (X 0 X) = 0K 00K .

So, that’s not gonna work.

11
4. The problem for estimating Ω is the same as with FGLS: Ω has too many parameters to
consistently estimate without structure. You might think that a model like that used
for WLS might be restrictive enough: you reduce Ω to just N variance parameters
and no off-diagonal terms. Unfortunately, with N observations, you cannot estimate
N parameters consistently.
5. Robust Standard Errors.

(a) The trick here is to come up with an estimate of X 0 ΩX (which is a KxK ma-
trix). There are many strategies, and they are typically referred to as ’robust’
variance estimates (because they are robust to nonspherical errors) or as ’sand-
wich’ variance estimates, because you sandwich an estimate Xd 0 ΩX inside a pair
−10
of (X 0 X) ’s. For the same reason as above, you cannot substitute ee0 for Ω,
because you’d get Xd 0 ΩX = X 0 ee0 X = 0.

(b) General Heteroskedastic errors. Imagine that errors are not correlated with each
other, but they don’t have identical variances. We use the Eicker-White Hetero-
robust variance estimator.
i. First, restrict Ω to be a diagonal matrix with diagonal elements σi2 and off-
diagonal elements equal to 0. This is the structure you have imposed on the
model: diagonal Ω.
ii. Then, construct an estimate of X 0 ΩX that satisfies this structure:
0 ΩX = X 0 DX
Xd
where D is a diagonal matrix with (ei )2 on the main diagonal.
(c) You cannot get a consistent estimate of D, because D has N elements: adding
observations will not increase the precision of the estimate of any element of D.
(d) However, X 0 DX is only KxK, which does not grow in size with N . Recall that
asymptotic variance is equal to the variance divided by N , and it is used because
the variance goes to 0 as the sample size goes to infinity. To talk about variance
as the sample size grows, you have to reflate it by something, in this case N .
(The choice of what to reflate it by underlies much of nonparametric econometric
theory—in some models, you have to reflate by N raised to a power less than 1).
So, h i 1 −1 −1
asy.V β̂OLS = (X 0 X) X 0 ΩX (X 0 X) ,
N
and
h i 1 −1 0 ΩX (X 0 X)−1 =
asy.V̂ β̂OLS = (X 0 X) Xd
N
0
−1 X ΩX −1
d
(X 0 X) (X 0 X) .
N
Consider a model where X = 1, a column of ones. Then,
0 ΩX 0 DX
PN 2
Xd Xd i=1 (ei )
= = .
N N N
12
As N grows, this thing gets closer and closer to the average σi2. .
(e) In Stata, you can get the hetero-robust standard errors as follows:
(f) reg Y X, robust

6. Clustered Standard Errors.

(a) Imagine that within groups of observations stratified by a grouping variable g,

errors are correlated, but across groups, they are P
not. Let g = 1, .., G and let each
group have ng observations (and note that N = g ng ).
i. For example, suppose your data are drawn from a multistage sample wherein
first we sample city blocks and then we sample individuals within those city
blocks. The individuals on the same block might actually know each other,
and thus have correlations across each other.
ii. Or, suppose that you are interested in networks and your underlying data
are people, but the data you run regressions on compare distances between
pairs of people in different groups and the interactions among pairs of people
in those groups. Then, you would almost certainly have correlations between
people within groups.
iii. Or, suppose that you have data on patients in hospitals, many patients in
each of many hospitals. Certainly, diseases can travel from person to person
in a hospital, but less so across them. So, you’d expect correlations across
patients in hospitals but not across them.
(b) In this environment, Ω has a blockish structure. Sort the data by groups. The
across group blocks of Ω are 0, but the within-group blocks of Ω are non-zero.
The upper-left block is an n1 xn1 symmetric matrix with unknown elements. To
its right, we find an n1 x(N − n1 ) matrix of zeroes, and below it, we find an
(N − n1 )xn1 matrix of zeroes. The next diagonal block is an n2 xn2 symmetric
matrix with unknown elements. And so it goes.
(c) This is very like the random-effect model, except that we have not imposed the
further structure that within-group correlations are all the same and all the groups
are the same.
i. So, it is like a random-effects structure, but much more general. Unfortu-
nately, it is so much more general that we cannot use FGLS as we would with
random-effects. P The reason is that FGLS requires a consistent estimate of Ω.
This model has g ng (ng + 1) /2 parameters, which increases faster than N .
So, we cannot construct a consistent estimate of Ω.
(d) So, analogous to the hetero-robust standard errors, we use the clustered variance
estimator. We construct an estimate of X 0 ΩX that is consistent with the structure
we’ve imposed. In particular, construct
0 ΩX = X 0 CX
Xd

13
where C is block diagonal, with elements equal to ei ej (or their average) in the
blocks and zero elsewhere. Then,
h i 1 −1 −1
asy.V̂ β̂OLS = (X 0 X) X 0 CX (X 0 X) .
N

(e) In Stata,
(f) reg Y X, cluster(g)
(g) A question for you: why can’t we go maximally general, and have just 1 big
cluster?

FECO Note 2 - Simple Linear Regression: Xuan Chinh Mai
No ratings yet
FECO Note 2 - Simple Linear Regression: Xuan Chinh Mai
7 pages
EC2C4 Econometrics II
No ratings yet
EC2C4 Econometrics II
56 pages
OLS Matrix Analysis for Statisticians
No ratings yet
OLS Matrix Analysis for Statisticians
14 pages
Properties of The OLS Estimator: Quantitative Methods 2
No ratings yet
Properties of The OLS Estimator: Quantitative Methods 2
57 pages
L5 2025 Spring
No ratings yet
L5 2025 Spring
40 pages
Chapter 6: Regression
No ratings yet
Chapter 6: Regression
7 pages
Lecture 24: Weighted and Generalized Least Squares 1 Weighted Least Squares
No ratings yet
Lecture 24: Weighted and Generalized Least Squares 1 Weighted Least Squares
8 pages
OLS Estimator: Key Statistical Insights
No ratings yet
OLS Estimator: Key Statistical Insights
12 pages
Violations of Classical Linear Regression Assumptions Mis-Specification
No ratings yet
Violations of Classical Linear Regression Assumptions Mis-Specification
7 pages
Emet2007 Notes
No ratings yet
Emet2007 Notes
6 pages
Simple Linear Regression Model
No ratings yet
Simple Linear Regression Model
6 pages
Lecture Slides On Statistics at Uni St. Gallen
No ratings yet
Lecture Slides On Statistics at Uni St. Gallen
20 pages
Understanding Simple Linear Regression
No ratings yet
Understanding Simple Linear Regression
30 pages
Trix - Post Midsem Merged
No ratings yet
Trix - Post Midsem Merged
49 pages
LLICO2b ECO1 English
No ratings yet
LLICO2b ECO1 English
15 pages
Wooldridge Notes
No ratings yet
Wooldridge Notes
15 pages
Week 3-4
No ratings yet
Week 3-4
75 pages
Univariate Regression with OLS Analysis
No ratings yet
Univariate Regression with OLS Analysis
72 pages
Econometrics Handout Session 2
No ratings yet
Econometrics Handout Session 2
18 pages
OLS Estimator Properties and Assumptions
100% (1)
OLS Estimator Properties and Assumptions
23 pages
Week 2, OLS
No ratings yet
Week 2, OLS
83 pages
OLS Estimates: Finite-Sample Properties
No ratings yet
OLS Estimates: Finite-Sample Properties
20 pages
Im ch08
No ratings yet
Im ch08
12 pages
EC501 Lecture 02
No ratings yet
EC501 Lecture 02
27 pages
Week2 Lecture2
No ratings yet
Week2 Lecture2
59 pages
Econometrics for Finance Students
No ratings yet
Econometrics for Finance Students
64 pages
Panel Data
No ratings yet
Panel Data
14 pages
Linear Regression Analysis: Module - Vii
No ratings yet
Linear Regression Analysis: Module - Vii
10 pages
Classical Linear Regression and Its Assumptions
No ratings yet
Classical Linear Regression and Its Assumptions
63 pages
Econometrics Regression Insights
No ratings yet
Econometrics Regression Insights
20 pages
Multiple Regression Analysis Guide
No ratings yet
Multiple Regression Analysis Guide
11 pages
Multiple Regression Analysis: y + X + X + - . - X + U
No ratings yet
Multiple Regression Analysis: y + X + X + - . - X + U
26 pages
SLRM Note
No ratings yet
SLRM Note
15 pages
Lecture 11 - Stochastic Regressors Measurement Errors
No ratings yet
Lecture 11 - Stochastic Regressors Measurement Errors
6 pages
統計摘要
No ratings yet
統計摘要
12 pages
K-Variable Linear Regression Model Explained
No ratings yet
K-Variable Linear Regression Model Explained
22 pages
Gauss-Markov Theorem
No ratings yet
Gauss-Markov Theorem
5 pages
Lecture 6
No ratings yet
Lecture 6
45 pages
Econometrics: CLM & OLS Basics
No ratings yet
Econometrics: CLM & OLS Basics
11 pages
Finite-Sample OLS Analysis
No ratings yet
Finite-Sample OLS Analysis
35 pages
Lect 6
No ratings yet
Lect 6
20 pages
CH 03
No ratings yet
CH 03
17 pages
Multiple Linear Reegression
No ratings yet
Multiple Linear Reegression
21 pages
OLS Regression and Gauss-Markov Assumptions
No ratings yet
OLS Regression and Gauss-Markov Assumptions
40 pages
Chapter3
No ratings yet
Chapter3
52 pages
Chapter 11 Lecture Notes .
No ratings yet
Chapter 11 Lecture Notes .
22 pages
Econometrics
No ratings yet
Econometrics
13 pages
Cameron & Trivedi 2005 Microeconometrics Methods and Applications Solutions
0% (3)
Cameron & Trivedi 2005 Microeconometrics Methods and Applications Solutions
19 pages
Cameron & Trivedi - Solution Manual Cap. 4-5
0% (1)
Cameron & Trivedi - Solution Manual Cap. 4-5
12 pages
Classical Multiple Regression
No ratings yet
Classical Multiple Regression
5 pages
Multiple Regression Model and Multicollinearity
No ratings yet
Multiple Regression Model and Multicollinearity
25 pages
2023 Foundation English Test Test Format Overview
No ratings yet
2023 Foundation English Test Test Format Overview
8 pages
Rata-Rata Konsumsi Rumah Tangga Untuk Telekomunikasi Menurut Provinsi, 2017-2019
No ratings yet
Rata-Rata Konsumsi Rumah Tangga Untuk Telekomunikasi Menurut Provinsi, 2017-2019
2 pages
PSYC3001 2024 Assignment 1
No ratings yet
PSYC3001 2024 Assignment 1
5 pages
Rapid Survey
No ratings yet
Rapid Survey
11 pages
Chemistry Lab: Voltaic Cells Analysis
No ratings yet
Chemistry Lab: Voltaic Cells Analysis
7 pages
Resilience Lessons from Louis Zamperini
No ratings yet
Resilience Lessons from Louis Zamperini
2 pages
Dependence vs. Independence in Society
No ratings yet
Dependence vs. Independence in Society
2 pages
GCCRP Assignment 2 Naman Sharma
No ratings yet
GCCRP Assignment 2 Naman Sharma
5 pages
Railway Apprentice Recruitment 2023
No ratings yet
Railway Apprentice Recruitment 2023
10 pages
PQRS Practice Questions
No ratings yet
PQRS Practice Questions
13 pages
686 Mitigating The Effects of Arcs in MV Switchgear
100% (2)
686 Mitigating The Effects of Arcs in MV Switchgear
75 pages
Detailed Course Outline For Term - V - 2016-17
0% (1)
Detailed Course Outline For Term - V - 2016-17
124 pages
101 Final Exam
No ratings yet
101 Final Exam
5 pages
Veterinary and Pharma Directory
No ratings yet
Veterinary and Pharma Directory
16 pages
Usais Pamphlet 350-6 Expert Infantryman Badge
No ratings yet
Usais Pamphlet 350-6 Expert Infantryman Badge
84 pages
Ajoy Roy
No ratings yet
Ajoy Roy
3 pages
(Ebook) American Literature in Context To 1865 by Susan Castillo ISBN 9781405188647, 1405188642 PDF Download
100% (1)
(Ebook) American Literature in Context To 1865 by Susan Castillo ISBN 9781405188647, 1405188642 PDF Download
57 pages
Hydrodynamic Derivatives - Inoue - Hirano
No ratings yet
Hydrodynamic Derivatives - Inoue - Hirano
14 pages
Alchemist by Ben Johnson
No ratings yet
Alchemist by Ben Johnson
1 page
Dolinen, Kimberly B. Compare The Cultures of Philippines and Indonesia
No ratings yet
Dolinen, Kimberly B. Compare The Cultures of Philippines and Indonesia
2 pages
Lembar Kerja Siswa Procedure Text
No ratings yet
Lembar Kerja Siswa Procedure Text
8 pages
Crim Pro Cases - Notes
No ratings yet
Crim Pro Cases - Notes
8 pages
Krebs Cycle: Steps, Products, Significance
No ratings yet
Krebs Cycle: Steps, Products, Significance
13 pages
Khalid Shah Proposal
No ratings yet
Khalid Shah Proposal
12 pages
Principles of Pest Management - 2024
No ratings yet
Principles of Pest Management - 2024
23 pages
Smear Preparation and Staining: Unit Intended Learning Outcomes
No ratings yet
Smear Preparation and Staining: Unit Intended Learning Outcomes
5 pages
Chalk Liner Project Overview
No ratings yet
Chalk Liner Project Overview
12 pages
Ge Elec 1-3
No ratings yet
Ge Elec 1-3
10 pages
International A Level Pure Maths Exam
No ratings yet
International A Level Pure Maths Exam
36 pages
55566643555801.the Horn of Evenwood
100% (1)
55566643555801.the Horn of Evenwood
26 pages
Precalculus Syllabus
No ratings yet
Precalculus Syllabus
10 pages
4+of+4+Google+SketchUp+for+Interior+Design+and+Space+Planning+ How+to+Communicate+Your+Ideas+in+a+Convincing+Way+ (Book+4+PREVIEW)
100% (1)
4+of+4+Google+SketchUp+for+Interior+Design+and+Space+Planning+ How+to+Communicate+Your+Ideas+in+a+Convincing+Way+ (Book+4+PREVIEW)
14 pages
Introduction to Telecommunication Engineering
No ratings yet
Introduction to Telecommunication Engineering
34 pages
SF2 Daily Attendance Report 2020
No ratings yet
SF2 Daily Attendance Report 2020
2 pages

Variance of OLS with Non-Spherical Errors

Uploaded by

Variance of OLS with Non-Spherical Errors

Uploaded by

Non-Spherical Errors

2. Consider the estimated OLS parameter vector

4. How do we know it is the lowest variance linear unbiased estimator? [Gauss-Markov

and note that E[ε] = 0N =⇒ E [X 0 ε] = 0K for fixed X.

(a) However, we have a sample analog: the sample residual e:

6. So how exactly does e relate to ε?

7. Matrices like I − X (X 0 X)−1 X 0 and X (X 0 X)−1 X 0 are called projection matrices,

and they come up a lot.

and the OLS residuals Y − X β̂OLS as

8. Getting back to σ 2 and our estimate of it, compute e0 e in terms of ε:

because ε0 P X ε = ε0 PX0 PX ε, which is the sum of squares of PX ε. PX has rank K and

(a) size of circle is amount of variation.

10. Precision is good. Lowh variance

(a) The variance of ε is the variance of Y conditional on X. Less variation of Y

Since X2 is uncorrelated with ε by assumption, MX2 ε = ε − X2 (X20 X2 )−1 X20 ε is also

E[X̂10 MX2 ε] = E[X10 MX0 2 MX2 ε] = E[X10 MX2 ε]

= E[X10 ε] − E[X10 X2 (X20 X2 )−1 X20 ε] = 0 − X10 X2 (X20 X2 )−1 E[X20 ε] = 0

(a) The first implication of spherical errors is

for all i = 1, ..., N , which we usually call homoskedasticity. Homoskedastic errors

2. OLS is inefficient if errors are nonspherical. This is easy to see by example.

Here, the notation ιK indicates a K−vector of ones. Thus, ιN −2 ι0N −2 is an

3 Generalised Least Squares

Ω−1/2 Y = Ω−1/2 Xβ + Ω−1/2 ε

E Ω−1/2 εε0 Ω−1/2 = Ω−1/2 E [εε0 ] Ω−1/2

3.1 Group Means Regressands

3.2 Feasible Generalized Least Squares

2. FGLS uses two steps:

(a) Get a consistent estimate Ω̂ of Ω.

3. The Random Effects Model uses FGLS

(a) Assume that

Yit = Xit β + θi + εit

Yit = Xit β + uit

If Ω = σ 2 IN , a pair of X 0 X’s cancel leaving σ 2 (X 0 X)−1 . If not, then not.

So, that’s not gonna work.

6. Clustered Standard Errors.

(a) Imagine that within groups of observations stratified by a grouping variable g,

You might also like