0% found this document useful (0 votes)
90 views53 pages

Econometrics Notes

This document provides an overview of econometrics. It discusses how econometrics combines economic theory, data, and statistics to scientifically study economic phenomena. Econometrics allows economists to test hypotheses about relationships between economic variables even when data is non-experimental. It does this by using statistical methods that account for the effect of other unspecified factors, as suggested by economic theory. Econometrics is thus key to establishing economics as a social science based on the scientific method.

Uploaded by

xhardy27
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
90 views53 pages

Econometrics Notes

This document provides an overview of econometrics. It discusses how econometrics combines economic theory, data, and statistics to scientifically study economic phenomena. Econometrics allows economists to test hypotheses about relationships between economic variables even when data is non-experimental. It does this by using statistical methods that account for the effect of other unspecified factors, as suggested by economic theory. Econometrics is thus key to establishing economics as a social science based on the scientific method.

Uploaded by

xhardy27
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 53

Chapter 1

What Is Econometrics
1.1 Economics
1.1.1 The Economic Problem
Economics concerns itself with how best to satisfy unlimited wants with lim-
ited resources. As such it proposes solutions that involve the production and
consumption of goods using a particular allocation of resources. These solu-
tions are usually the results of optimizing behavior by economics agents such as
consumers or producers.
1.1.2 Some Economic Questions
Should the Eurozone countries co-ordinate their scal as well as monetary poli-
cies? What is the best time for Professor Brown to retire? Does adopting
green restictions on production and consumption help or hinder the economy?
1.1.3 Economics as a Science
Science can be dened as the intellectual and practical activity encompassing
the systematic study of the structure and behavior of the physical and natural
world through observation and experiment. The purpose of the experiment is to
control for various factors that might vary when the environment is not managed
as in an experiment. This allows the observer to focus on the relationship
between a limited number of variables.
So to be a science economics must be founded on the notion of studying and
explaining the world through observation and experimentation. Economics is
classied as a social science since it is concerned with social phenomena such as
organization and operation of the household or society.
1
2 CHAPTER 1. WHAT IS ECONOMETRICS
1.1.4 Providing Scientic Answers
As such, economics would attempt to provide answers to the above questions, for
example, using scientic approaches that would yield similar answers in similar
circumstances. In that sense, it attempts to bring the replicability of scientic
experiments into the analysis and formulation of answers. This is accomplished
by combining economic theory with the analysis of pre-existing relevant data.
1.2 Data
1.2.1 Accounting
Most of the data used in econmetrics is accounting data. Accounting data
is routinely recorded data (that is, records of transactions) as part of market
activities. Accounting data has many shortcomings, since it is not collected
specically for the purposes of the econometrician. The available data may not
measure the phenomenon we are interested in explaining but a related one.
1.2.2 Nonexperimental
Moreover, the data is typically nonexperimental. The econometrician does not
have any control over nonexperimental data. In this regard economics is in
much the same situation as meteorology and astronomy. This nonexperimental
nature of the data causes various problems. Econometrics is largely concerned
with how to manage these problems in such a fashion that we can have the same
conndence in our ndings as when experimental data is used.
1.2.3 Data Types
Denition 1.1 Time-series data are data that represent repeated observations
of some variable in subseqent time periods. A time-series variable is often sub-
scripted with the letter t. 2
Denition 1.2 Cross-sectional data are data that represent a set of obser-
vations of some variable at one specic instant over several agents. A cross-
sectional variable is often subscripted with the letter i. 2
Denition 1.3 Time-series cross-sectional data are data that are both time-
series and cross-sectional. 2
A special case of time-series cross-sectional data is panel data. Panel data
are observations on the same set of agents or a panel over time.
1.3. MODELS 3
1.2.4 Empirical Regularities
We need to look at the data to detect regularities. These regularities that persist
across time or individuals beg for explanation of some sort. Why do the occur
and why do they persist? An example would be the observation that the savings
rate in the U.S. was very low during the record expansion of the nineties and
has risen during the subsequent great recession. In economics we explain the
regularities with models.
1.3 Models
1.3.1 Simplifying Assumptions
Models are simplications of the real world. As such they introduce simplifying
assumptions that hold a number of factors constant and allows the analysis
of the relationship between a limited number of variables. These simplifying
assumptions are the factors that would be controled if we were conducting a
controlled experiment to verify the model.
Models are sometimes criticised as unrealistic but if they were realistic they
would not be models but in fact represent reality. The real question is whether
the simplictions introduced by the assumptions are useful for dealing with the
phenomena we attempt to explain. Hopefully, we will provide some metric of
what is a useful model.
1.3.2 Economic Theory
As a result of optimizing behavior by consumers, producers, or other agents in
the economy, economic theory will suggest how certain variables will be related.
For example demand for a consumer good can be represented as a function of
its own price, prices of complimentary goods, prices of competing goods, and
the income level of the consumer.
1.3.3 Equations
Economic models are usually stated in terms of one or more equations. For
example, is a simple macroeconomic model we might represent C
t
(aggregate
consumption at time t) as a linear function of 1
t
(aggregate income):
C
t
= c +,1
t
where c and , are parameters to be determined and their values the possibly to
target of inquiry. The choice of the linear model is usually a simplication but
the variables that go into the model will typically follow directly from economic
theory.
4 CHAPTER 1. WHAT IS ECONOMETRICS
1.3.4 Error Component
Because none of our models are exact, we include an error component in our
equations, usually denoted n
i
. Thus we have
C
t
= c +,1
t
+n
t
This error term could be viewed as an approximation error, a misspecication
error, or result from errors in optimization by economic agents. In any event
the linear parametric relationship is not exact for any choice of the parameters.
1.4 Statistics
1.4.1 Stochastic Component
In order to conduct analysis of the model once the error term has been intro-
duced, we assume that it is a random variable and subject to distributional
restrictions. For example, we might assume that n
t
is an i.i.d. variable with
E[n
t
j1
t
] = 0
E[n
2
t
j1
t
] = o
2
.
Again, this is a modeling assumption and may be a useful simplication or
not. We have combined economic theory to explain the systematic part of the
behavior and statistical theory to characterize the unsystematic part.
1.4.2 Statistical Analysis
The introduction of the stochastic assumptions on the error component allows
the use of statistical techniques to perform various tasks in analyzing the com-
bined model.
1.4.3 Tasks
There are three main tasks of econometrics:
1. Estimating the parameters of the model. These parameters would include
the parameters of the systematic part of the model and the parameters
of the distribution of the error component. These estimates are typically
point estimates.
2. Hypothesis tests concerning the behavior of the model. This where the
approach really becomes scientic. We might be interested in whether
accumulation of wealth is a an adequate explanation for the low saving
rate situation mentioned above. This could be tested using an hypothesis
test.
1.5. INTERACTION 5
3. Forecast the behavior of the dependent variable outside the sample us-
ing the estimated model. Such forecasts can be point forecasts such as
expected behavior or interval forecasts which give a range of values with
an attached probability of occurance. Usually, these forecasts will be
conditional on knowledge of the values of the explanatory variables.
1.4.4 Econometrics
Econometrics is the combination of economic theory together with statistical
theory as a way of explaining the regularities in the data. Since our data is, in
general, nonexperimental, econometrics makes use of economic theory to adjust
for the lack of proper data. If we were conducting an experiment, we would
be able to control the various other factors and focus on the variables whose
relationship interests us. With non-experimental data we must make statistical
assumptions to eectively control for the eect of these other factors suggested
by economic theory.
1.5 Interaction
1.5.1 Role Of Econometrics
The three components of econometrics are:
1. data;
2. economic theory;
3. statistics;
These components are interdependent, and each helps shape the others.
1.5.2 Scientic Method
In the scientic method, we take a hypothesis, construct an experiment with
which to test the hypothesis, and based on the results of the hypothesis we
either reject the hypothesis or not. The purpose of the experiment is to control
for various factors that might impact the outcome but are not of interest. This
allows us to focus our attention on the relationship between the variables of
interest that result from the experiment. Based on the outcomes for these
variables of interest, we either reject or not.
In econometrics, we let the systematic model and the statistical model or
some specied component of them to play the role of the hypothesis. The avail-
able data, ex post plays the role of the outcome of an experiment. Statistical
analysis of the data using the combined model enabels us to reject the hypoth-
esis or not. So econometrics is what allows us to use the scientic method and
hence is critical in economics being treated as a science.
6 CHAPTER 1. WHAT IS ECONOMETRICS
1.5.3 Ocam's Razor
Often in econometrics, we are faced with the problem of choosing one model
over an alternative. The simplest model that ts the data is often the best
choice. This is known as using Ocam's razor.
Chapter 2
Some Useful Distributions
2.1 Introduction
2.1.1 Inferences
Statistical Statements
As statisticians, we are often called upon to answer questions or make statements
concerning certain random variables. For example: is a coin fair (i.e. is the
probability of heads = 0.5) or what is the expected value of GNP for the quarter.
Population Distribution
Typically, answering such questions requires knowledge of the distribution of
the random variable. Unfortunately, we usually do not know this distribution
(although we may have a strong idea, as in the case of the coin).
Experimentation
In order to gain knowledge of the distribution, we draw several realizations of
the random variable. The notion is that the observations in this sample contain
information concerning the population distribution.
Inference
Denition 2.1 The process by which we make statements concerning the pop-
ulation distribution based on the sample observations is called inference. 2
Example 2.1 We decide whether a coin is fair by tossing it several times and
observing whether it seems to be heads about half the time. 2
7
8 CHAPTER 2. SOME USEFUL DISTRIBUTIONS
2.1.2 Random Samples
Suppose we draw n observations of a random variable, denoted {x
1
, x
2
, ..., x
n
}
and each x
i
is independent and has the same (marginal) distribution, then
{x
1
, x
2
, ..., x
n
} constitute a simple random sample.
Example 2.2 We toss a coin three times. Supposedly, the outcomes are inde-
pendent. If x
i
counts the number of heads for toss i, then we have a simple
random sample. 2
Note that not all samples are simple random.
Example 2.3 We are interested in the income level for the population in gen-
eral. The n observations available in this class are not identical since the higher
income individuals will tend to be more variable. 2
Example 2.4 Consider the aggregate consumption level. The n observations
available in this set are not independent since a high consumption level in one
period is usually followed by a high level in the next. 2
2.1.3 Sample Statistics
Denition 2.2 Any function of the observations in the sample which is the
basis for inference is called a sample statistic. 2
Example 2.5 In the coin tossing experiment, let S count the total number of
heads and P =
S
3
count the sample proportion of heads. Both S and P are
sample statistics. 2
2.1.4 Sample Distributions
A sample statistic is a random variable its value will vary from one experiment
to another. As a random variable, it is subject to a distribution.
Denition 2.3 The distribution of the sample statistic is the sample distribu-
tion of the statistic. 2
Example 2.6 The statistic S introduced above has a multinomial sample dis-
tribution. Specically Pr(S = 0) = 1/8, Pr(S = 1) = 3/8, Pr(S = 2) = 3/8, and
Pr(S = 3) = 1/8. 2
2.2. NORMALITY AND THE SAMPLE MEAN 9
2.2 Normality And The Sample Mean
2.2.1 Sample Sum
Consider the simple random sample {x
1
, x
2
, ..., x
n
}, where x measures the height
of an adult female. We will assume that Ex
i
= and Var(x
i
) =
2
, for all
i = 1, 2, ..., n
Let S = x
1
+x
2
+ +x
n
denote the sample sum. Now,
ES = E( x
1
+x
2
+ +x
n
) = Ex
1
+ Ex
2
+ + Ex
n
= n (2.1)
Also,
Var(S) = E( S ES )
2
= E( S n)
2
= E( x
1
+x
2
+ +x
n
n)
2
= E
"
n
X
i=1
( x
i
)
#
2
= E[( x
1
)
2
+ ( x
1
)( x
2
) + +
( x
2
)( x
1
) + ( x
2
)
2
+ + ( x
n
)
2
]
= n
2
. (2.2)
Note that E[( x
i
)( x
j
)] = 0 by independence.
2.2.2 Sample Mean
Let x =
S
n
denote the sample mean or average.
2.2.3 Moments Of The Sample Mean
The mean of the sample mean is
Ex = E
S
n
=
1
n
ES =
1
n
n = . (2.3)
The variance of the sample mean is
Var(x) = E(x )
2
= E(
S
n
)
2
= E[
1
n
( S n) ]
2
=
1
n
2
E( S n)
2
10 CHAPTER 2. SOME USEFUL DISTRIBUTIONS
=
1
n
2
n
2
=

2
n
. (2.4)
2.2.4 Sampling Distribution
We have been able to establish the mean and variance of the sample mean.
However, in order to know its complete distribution precisely, we must know
the probability density function (pdf) of the random variable x.
2.3 The Normal Distribution
2.3.1 Density Function
Denition 2.4 A continuous random variable x with the density function
f ( x) =
1

2
2
e

1
2
2
( x)
2
(2.5)
follows the normal distribution, where and
2
are the mean and variance of
x, respectively.2
Since the distribution is characterized by the two parameters and
2
, we
denote a normal random variable by x
i
N( ,
2
).
The normal density function is the familiar bell-shaped curve, as is shown
in Figure 2.1 for = 0 and
2
= 1. It is symmetric about the mean .
Approximately
2
3
of the probability mass lies within of and about .95
lies within 2. There are numerous examples of random variables that have
this shape. Many economic variables are assumed to be normally distributed.
2.3.2 Linear Transformation
Consider the transformed random variable
Y = a +bx
We know that

Y
= EY
i
= a +b
x
and

2
Y
= E(Y
Y
)
2
= b
2

2
x
If x is normally distributed, then Y is normally distributed as well. That is,
Y N(
Y
,
2
Y
)
2.3. THE NORMAL DISTRIBUTION 11
Figure 2.1: The Standard Normal Distribution
Moreover, if x
i
N(
x
,
2
x
) and z
i
N(
z
,
2
z
) are independent, then
Y = a +bx +cz N( a +b
x
+c
z
, b
2

2
x
+c
2

2
z
)
These results will be formally demonstrated in a more general setting in the
next chapter.
2.3.3 Distribution Of The Sample Mean
If, for each i = 1, 2, . . . , n, the x
i
s are independent, identically distributed
(i.i.d.) normal random variables, then
x N(
x
,

2
x
n
) (2.6)
2.3.4 The Standard Normal
The distribution of x will vary with dierent values of
x
and
2
x
, which is
inconvenient. Rather than dealing with a unique distribution for each case, we
12 CHAPTER 2. SOME USEFUL DISTRIBUTIONS
perform the following transformation:
Z =
x
x
q

2
x
n
=
x
q

2
x
n


x
q

2
x
n
(2.7)
Now,
EZ =
Ex
q

2
x
n


x
q

2
x
n
=

x
q

2
x
n


x
q

2
x
n
= 0.
Also,
Var( Z ) = E

x
q

2
x
n


x
q

2
x
n

2
= E

2
x
( x
x
)
2

=
n

2
x

2
x
n
= 1.
Thus Z N(0, 1). The N( 0, 1 ) distribution is the standard normal and is well-
tabulated. The probability density function for the standard normal distribution
evaluated at point z is
f ( z ) =
1

2
e

1
2
z
2
= (z) (2.8)
2.4 The Central Limit Theorem
2.4.1 Normal Theory
The normal density has a prominent position in statistics. This is not only
because many random variables appear to be normal, but also because most
any sample mean appears normal as the sample size increases.
Specically, suppose x
1
, x
2
, . . . , x
n
is a simple random sample, Ex
i
=
x
,
and Ex
2
i
=
2
x
, then as n , the distribution of x, suitably transformed,
2.5. DISTRIBUTIONS ASSOCIATEDWITHTHE NORMAL DISTRIBUTION13
becomes normal. Specically, let f
n
() denote the probability density function
(PDF) of Z, then
lim
n
f
n
( c ) = ( c ) (2.9)
pointwise for any point of evaluation c. Likewise, for F
n
() denoting the analo-
gous cumulative distribution function (CLT) of Z and ( c ) denoting the CDF
of the standard normal, then
lim
n
F
n
( c ) = ( c )
pointwise at any point of evaluation c. This is the Lindberg-Levy form of the
central limit theorem and will be treated more extensively in Chapter 4.
2.5 Distributions Associated With The Normal
Distribution
2.5.1 The Chi-Squared Distribution
Denition 2.5 Suppose that Z
1
, Z
2
, . . . , Z
m
is a simple random sample, and
Z
i
N( 0, 1 ). Then
m
X
i=1
Z
2
i
X
2
m
, (2.10)
where m are the degrees of freedom of the Chi-squared distribution. 2
The probability density function for the X
2
m
is
f

2( x) =
1
2
n/2
( m/2 )
x
m/21
e
x/2
, x > 0 (2.11)
where (x) is the gamma function. See Figure 2.2. If x
1
, x
2
, . . . , x
m
is a simple
random sample, and x
i
N(
x
,
2
x
), then
m
X
i=1

x
i

2
X
2
m
. (2.12)
The chi-squared distribution will prove useful in testing hypotheses on both
the variance of a single variable and the (conditional) means of several. This
multivariate usage will be explored in the next chapter.
Example 2.7 Consider the estimate of
2
s
2
=
P
n
i=1
( x
i
x)
2
n 1
.
Then
( n 1 )
s
2

2
X
2
n1
. (2.13)
2
14 CHAPTER 2. SOME USEFUL DISTRIBUTIONS
Figure 2.2: Some Chi-Squared Distributions
2.5.2 The t Distribution
Denition 2.6 Suppose that Z N( 0, 1 ), Y v
2
m
, and that Z and Y are
independent. Then
Z
q
Y
k
t
m
, (2.14)
where k are the degrees of freedom of the t distribution. 2
The probability density function for a t random variable with m degrees of
freedom is
f
t
( x) =

m+1
2

m
2

1 +
x
2
m

(m+1)/2
, (2.15)
for < x < . See Figure 2.3
The t (also known as Students t) distribution, is named after W.S. Gosset,
who published under the pseudonym Student. It is useful in testing hypotheses
concerning the (conditional) mean when the variance is estimated.
Example 2.8 Consider the sample mean from a simple random sample of nor-
mals. We know that x N( ,
2
/n) and
Z =
x
q

2
n
N( 0, 1 ).
2.5. DISTRIBUTIONS ASSOCIATEDWITHTHE NORMAL DISTRIBUTION15
Figure 2.3: Some t Distributions
Also, we know that
Y = ( n 1 )
s
2

2
X
2
n1
,
where s
2
is the unbiased estimator of
2
. Thus, if Z and Y are independent
(which, in fact, is the case), then
Z
q
Y
( n1 )
=
x

2
n
r
( n1 )
s
2

2
( n1 )
= ( x )
s
n

2
s
2

2
=
x
q
s
2
n
t
n1
(2.16)
2
16 CHAPTER 2. SOME USEFUL DISTRIBUTIONS
2.5.3 The F Distribution
Denition 2.7 Suppose that Y X
2
m
, W X
2
n
, and that Y and W are inde-
pendent. Then
Y
m
W
n
F
m,n
, (2.17)
where m,n are the degrees of freedom of the F distribution. 2
The probability density function for a F random variable with m and n
degrees of freedom is
f
F
( x) =

m+n
2

(m/n)
m/2

m
2

n
2

x
(m/2)1
(1 +mx/n)
(m+n)/2
(2.18)
The F distribution is named after the great statistician Sir Ronald A. Fisher,
and is used in many applications, most notably in the analysis of variance. This
situation will arise when we seek to test multiple (conditional) mean parameters
with estimated variance. Note that when x t
n
then x
2
F
1,n
. Some
examples of the F distribution can be seen in Figure 2.4.
Figure 2.4: Some F Distributions
Chapter 3
Multivariate Distributions
3.1 Matrix Algebra Of Expectations
3.1.1 Moments of Random Vectors
Let

x
1
x
2
.
.
.
x
m

= x
be an m 1 vector-valued random variable. Each element of the vector is a
scalar random variable of the type discussed in the previous chapter.
The expectation of a random vector is
E[ x] =

E[ x
1
]
E[ x
2
]
.
.
.
E[ x
m
]

2
.
.
.

= . (3.1)
Note that is also an m1 column vector. We see that the mean of the vector
is the vector of the means.
Next, we evaluate the following:
E[( x )( x )
0
]
= E

( x
1

1
)
2
( x
1

1
)( x
2

2
) ( x
1

1
)( x
m

m
)
( x
2

2
)( x
1

1
) ( x
2

2
)
2
( x
2

2
)( x
m

m
)
.
.
.
.
.
.
.
.
.
.
.
.
( x
m

m
)( x
1

1
) ( x
m

m
)( x
2

2
) ( x
m

m
)
2

17
18 CHAPTER 3. MULTIVARIATE DISTRIBUTIONS
=

11

12

1m

21

22

2m
.
.
.
.
.
.
.
.
.
.
.
.

m1

m2

mm

= . (3.2)
, the covariance matrix, is an m m matrix of variance and covariance
terms. The variance
2
i
=
ii
of x
i
is along the diagonal, while the cross-product
terms represent the covariance between x
i
and x
j
.
3.1.2 Properties Of The Covariance Matrix
Symmetric
The variance-covariance matrix is a symmetric matrix. This can be shown
by noting that

ij
= E( x
i

i
)( x
j

j
) = E( x
j

j
)( x
i

i
) =
ji
.
Due to this symmetry will only have m(m+ 1)/2 unique elements.
Positive Semidenite
is a positive semidenite matrix. Recall that any m m matrix is posi-
tive semidenite if and only if it meets any of the following three equivalent
conditions:
1. All the principle minors are nonnegative;
2.
0
0, for all
|{z}
m1
6= 0;
3. = PP
0
, for some P
|{z}
mm
.
The rst condition (actually we use negative deniteness) is useful in the study
of utility maximization while the latter two are useful in econometric analysis.
The second condition is the easiest to demonstrate in the current context.
Let 6= 0. Then, we have

0
=
0
E[( x )( x )
0
]
= E[
0
( x )( x )
0
]
= E{ [
0
( x )]
2
} 0,
3.2. CHANGE OF VARIABLES 19
since the term inside the expectation is a quadratic. Hence, is a positive
semidenite matrix.
Note that P satisfying the third relationship is not unique. Let D be any
m m orthonormal martix, then DD
0
= I
m
and P

= PD yields P

P
0
=
PDD
0
P
0
= PI
m
P
0
= . Usually, we will choose P to be an upper or lower
triangular matrix with m(m+ 1)/2 nonzero elements.
Positive Denite
Since is a positive semidenite matrix, it will be a positive denite matrix if
and only if det() 6= 0. Now, we know that = PP
0
for some m m matrix
P. This implies that det(P) 6= 0.
3.1.3 Linear Transformations
Let y
|{z}
m1
= b
|{z}
m1
+ B
|{z}
mm
m1
z}|{
x . Then
E[ y] = b +BE[ x]
= b +B
=
y
(3.3)
Thus, the mean of a linear transformation is the linear transformation of the
mean.
Next, we have
E[ ( y
y
)( y
y
)
0
] = E{[ B( x )][ (B( x ))
0
]}
= BE[( x )( x )
0
]B
0
= BB
0
= BB
0
(3.4)
=
y
(3.5)
where we use the result ( ABC )
0
= C
0
B
0
A
0
, if conformability holds.
3.2 Change Of Variables
3.2.1 Univariate
Let x be a random variable and f
x
() be the probability density function of x.
Now, dene y = h(x), where
h
0
( x) =
d h( x)
d x
> 0.
20 CHAPTER 3. MULTIVARIATE DISTRIBUTIONS
That is, h( x) is a strictly monotonically increasing function and so y is a one-
to-one transformation of x. Now, we would like to know the probability density
function of y, f
y
(y). To nd it, we note that
Pr( y h( a ) ) = Pr( x a ), (3.6)
Pr( x a ) =
Z
a

f
x
( x) dx = F
x
( a ), (3.7)
and,
Pr( y h( a ) ) =
Z
h( a)

f
y
( y ) dy = F
y
( h( a )), (3.8)
for all a.
Assuming that the cumulative density function is dierentiable, we use (3.6)
to combine (3.7) and (3.8), and take the total dierential, which gives us
dF
x
( a ) = dF
y
( h( a ))
f
x
( a )da = f
y
( h( a ))h
0
( a )da
for all a. Thus, for a small perturbation,
f
x
( a ) = f
y
( h( a ))h
0
( a ) (3.9)
for all a. Also, since y is a one-to-one transformation of x, we know that h()
can be inverted. That is, x = h
1
(y). Thus, a = h
1
(y), and we can rewrite
(3.9) as
f
x
( h
1
( y )) = f
y
( y )h
0
( h
1
( y )).
Therefore, the probability density function of y is
f
y
( y ) = f
x
( h
1
( y ))
1
h
0
( h
1
( y ))
. (3.10)
Note that f
y
( y ) has the properties of being nonnegative, since h
0
() > 0. If
h
0
() < 0, (3.10) can be corrected by taking the absolute value of h
0
(), which
will assure that we have only positive values for our probability density function.
3.2.2 Geometric Interpretation
Consider the graph of the relationship shown in Figure 3.1. We know that
Pr[ h( b ) > y > h( a )] = Pr( b > x > a ).
Also, we know that
Pr[ h( b ) > y > h( a )] ' f
y
[ h( b )][ h( b ) h( a )],
3.2. CHANGE OF VARIABLES 21
a b x
h(b)
h(a)
y
h(x)
Figure 3.1: Change of Variables
and
Pr( b > x > a ) ' f
x
( b )( b a ).
So,
f
y
[ h( b )][ h( b ) h( a )] ' f
x
( b )( b a )
f
y
[ h( b )] ' f
x
( b )
1
[ h( b ) h( a )]/( b a )]
(3.11)
Now, as we let a b, the denominator of (3.11) approaches h
0
(). This is then
the same formula as (3.10).
3.2.3 Multivariate
Let
x
|{z}
m1
f
x
( x).
Dene a one-to-one transformation
y
|{z}
m1
=
m1
z}|{
h ( x).
Since h() is a one-to-one transformation, it has an inverse:
x = h
1
( y).
22 CHAPTER 3. MULTIVARIATE DISTRIBUTIONS
We also assume that
h( x)
x
0
exists. This is the mm Jacobian matrix, where
h( x)
x
0
=

h
1
( x)
h
2
( x)
.
.
.
h
m
( x)

(x
1
x
2
x
m
)
=

h1( x)
x1
h2( x)
x1

hm( x)
x1
h
1
( x)
x
2
h
2
( x)
x
2

h
m
( x)
x
2
.
.
.
.
.
.
.
.
.
.
.
.
h1( x)
xm
h2( x)
xm

hm( x)
xm

= J
x
( x) (3.12)
Given this notation, the multivariate analog to (3.11) can be shown to be
f
y
( y) = f
x
[ h
1
( y)]
1
|det(J
x
[ h
1
( y)])|
(3.13)
Since h() is dierentiable and one-to-one then det(J
x
[ h
1
( y)]) 6= 0.
Example 3.1 Let y = b
0
+b
1
x, where x, b
0
, and b
1
are scalars. Then
x =
y b
0
b
1
and
dy
dx
= b
1
.
Therefore,
f
y
( y ) = f
x

y b
0
b
1

1
|b
1
|
. 2
Example 3.2 Let y = b + Bx, where y is an m 1 vector and det(B) 6= 0.
Then
x = B
1
( y b)
and
y
x
0
= B = J
x
( x).
Thus,
f
y
( y) = f
x

B
1
( y b)

1
|det( B)|
. 2
3.3. MULTIVARIATE NORMAL DISTRIBUTION 23
3.3 Multivariate Normal Distribution
3.3.1 Spherical Normal Distribution
Denition 3.1 An m 1 random vector z is said to be spherically normally
distributed if
f(z) =
1
( 2 )
n/2
e

1
2
z
0
z
. 2
Such a random vector can be seen to be a vector of independent standard
normals. Let z
1
, z
2
, . . . , z
m
, be i.i.d. random variables such that z
i
N(0, 1).
That is, z
i
has pdf given in (2.8), for i = 1, ..., m. Then, by independence, the
joint distribution of the z
i
s is given by
f( z
1
, z
2
, . . . , z
m
) = f( z
1
)f( z
2
) f( z
m
)
=
n
Y
i=1
1

2
e

1
2
z
2
i
=
1
( 2 )
m/2
e

1
2

n
i=1
z
2
i
=
1
( 2 )
m/2
e

1
2
z
0
z
, (3.14)
where z
0
= (z
1
z
2
... z
m
).
3.3.2 Multivariate Normal
Denition 3.2 The m 1 random vector x with density
f
x
( x) =
1
( 2 )
m/2
[ det( )]
1/2
e

1
2
( x)
0

1
( x)
(3.15)
is said to be distributed multivariate normal with mean vector and positive
denite covariance matrix . 2
Such a distribution for x is denoted by x N(, ). The spherical normal
distribution is seen to be a special case where = 0 and = I
m
.
There is a one-to-one relationship between the multivariate normal random
vector and a spherical normal random vector. Let z be an m 1 spherical
normal random vector and
x
|{z}
m1
= +Az,
where z is dened above, and det(A) 6= 0. Then,
Ex = +AEz = , (3.16)
24 CHAPTER 3. MULTIVARIATE DISTRIBUTIONS
since E[z] = 0.
Also, we know that
E( zz
0
) = E

z
2
1
z
1
z
2
z
1
z
m
z
2
z
1
z
2
2
z
1
z
m
.
.
.
.
.
.
.
.
.
.
.
.
z
m
z
1
z
m
z
2
z
2
m

1 0 0
0 1 0
.
.
.
.
.
.
.
.
.
.
.
.
0 0 1

= I
m
, (3.17)
since E[z
i
z
j
] = 0, for all i 6= j, and E[z
2
i
] = 1, for all i. Therefore,
E[( x )( x )
0
] = E( Azz
0
A
0
)
= AE( zz
0
)A
0
= AI
m
A
0
= , (3.18)
where is a positive denite matrix (since det(A) 6= 0).
Next, we need to nd the probability density function f
x
( x) of x. We know
that
z = A
1
( x ) ,
z
0
= ( x )
0
A
1
0
,
and
J
z
( z ) = A,
so we use (3.13) to get
f
x
( x) = f
z
[ A
1
( x )]
1
| det( A)|
=
1
( 2 )
m/2
e

1
2
( x)
0
A
10
A
1
( x)
1
| det( A)|
=
1
( 2 )
m/2
e

1
2
( x)
0
( AA
0
)
1
( x)
1
| det( A)|
(3.19)
where we use the results (ABC) = C
1
B
1
A
1
and A
01
= A
10
. However,
= AA
0
, so det() =det(A) det(A), and |det(A)| = [ det()]
1/2
. Thus we
can rewrite (3.19) as
f
x
( x) =
1
( 2 )
m/2
[ det( )]
1/2
e

1
2
( x)
0

1
( x)
(3.20)
and we see that x N(, ) with mean vector and covariance matrix .
Since this process is completely reversable the relationship is one-to-one.
3.3. MULTIVARIATE NORMAL DISTRIBUTION 25
3.3.3 Linear Transformations
Theorem 3.1 Suppose x N(, ) with det() 6= 0 and y = b +Bx with B
square and det(B) 6= 0. Then y N(
y
,
y
). 2
Proof: From (3.3) and (3.4), we have Ey = b+B =
y
and E[( y
y
)( y

y
)
0
] = BB
0
=
y
. To nd the probability density function f
y
( y) of y, we
again use (3.13), which gives us
f
y
( y) = f
x
[ B
1
( y b)]
1
| det( B)|
=
1
( 2 )
m/2
[ det( )]
1/2
e

1
2
[ B
1
( yb)]
0

1
[ B
1
( yb)]
1
| det( B)|
=
1
( 2 )
m/2
[ det( BB
0
)]
1/2
e

1
2
( ybB)
0
( BB
0
)
1
( ybB)
=
1
( 2 )
m/2
[ det(
y
)]
1/2
e

1
2
( y
y
)
0

1
y
( y
y
)
(3.21)
So, y N(
y
,
y
). 2
Thus we see, as asserted in the previous chapter, that linear transformations
of multivariate normal random variables are also multivariate normal random
variables. And any linear combination of independent normals will also be
normal.
3.3.4 Quadratic Forms
Theorem 3.2 Let x N( , ), where det() 6= 0, then ( x )
0

1
( x
) X
2
m
. 2
Proof Let = PP
0
, which is possible since positive denite.. Then
( x ) N( 0, ),
and
z = P
1
( x ) N( 0, I
m
).
Therefore,
z
0
z =
n
X
i=1
z
2
i
X
2
m
= P
1
( x )
0
P
1
( x )
= ( x )
0
P
1
0
P
1
( x )
= ( x )
0

1
( x ) X
2
m
. 2 (3.22)

1
is called the weight matrix. With this result, we can use the X
2
m
to
make inferences about the vector mean of x.
26 CHAPTER 3. MULTIVARIATE DISTRIBUTIONS
3.4 Normality and the Sample Mean
3.4.1 Moments of the Sample Mean
Consider the m1 vector x
i
i.i.d. jointly, with m1 vector mean E[x
i
] =
and mm covariance matrix E[(x
i
)(x
i
)
0
] = . Dene x
n
=
1
n
P
n
i=1
x
i
as the vector sample mean which is also the vector of scalar sample means. The
mean of the vector sample mean follows directly:
E[x
n
] =
1
n
n
X
i=1
E[x
i
] = .
Alternatively, this result can be obtained by applying the scalar results element
by element. The second moment matrix of the vector sample mean is given by
E[(x
n
)(x
n
)
0
] =
1
n
2
E
h
(
P
n
i=1
x
i
n) (
P
n
i=1
x
i
n)
0
i
=
1
n
2
E[{(x
1
) + (x
2
) +... + (x
n
)}
{(x
1
) + (x
2
) +... + (x
n
)}
0
]
=
1
n
2
n =
1
n

since the covariances between dierent observations are zero.


3.4.2 Distribution of the Sample Mean
Suppose x
i
i.i.d. N(, ) jointly. Then it follows from joint multivariate
normality that x
n
must also be multivariate normal since it is a linear transfor-
mation. Specically, we have
x
n
N(,
1
n
)
or equivalently
x
n
N(0,
1
n
)

n(x
n
) N(0, )

n
1/2
(x
n
) N(0, I
m
)
where =
1/2

1/20
and
1/2
= (
1/2
)
1
.
3.4.3 Multivariate Central Limit Theorem
Theorem 3.3 Suppose that (i) x
i
i.i.d jointly, (ii) E[x
i
] = , and (iii)
E[(x
i
)(x
i
)
0
] = , then

n(x
n
)
d
N(0, )
3.5. NONCENTRAL DISTRIBUTIONS 27
or equivalently
z =

n
1/2
(x
n
)
d
N(0, I
m
) 2 .
These results apply even if the original underlying distribution is not normal
and follow directly from the scalar results applied to any linear combination of
x
n
.
3.4.4 Limiting Behavior of Quadratic Forms
Consider the following quadratic form
n (x
n
)
0

1
(x
n
) = n (x
n
)
0
(
1/2

1/20
)
1
(x
n
)
= [n (x
n
)
0

1/20

1/2
(x
n
)
= [

n
1/2
(x
n
)]
0
[

n
1/2
(x
n
)]
= z
0
z
2
d

2
m
.
This form is convenient for asymptotic joint test concerning more than one mean
at a time.
3.5 Noncentral Distributions
3.5.1 Noncentral Scalar Normal
Denition 3.3 Let x N(,
2
). Then, for

= /,
z

=
x

N(

, 1 ) (3.23)
has a noncentral normal distribution with noncentrality parameter

. 2
Example 3.3 When we do a hypothesis test of the mean, with known variance,
we have, under the null hypothesis H
0
: =
0
,
x
0

N( 0, 1 ) (3.24)
and, under the alternative H
1
: =
1
6=
0
,
x
0

=
x
1

+

1

= N( 0, 1 ) +

1

, 1

. (3.25)
Thus, the behavior of
x
0

under the alternative hypothesis follows a non-


central normal distribution. 2
As this example makes clear, the noncentral normal distribution is especially
useful when carefully exploring the behavior of the alternative hypothesis.
28 CHAPTER 3. MULTIVARIATE DISTRIBUTIONS
3.5.2 Noncentral t
Denition 3.4 Let z

N(

, 1 ), w
2
k
, and let z

and w be independent.
Then
z

p
w/k
t
k
(

) (3.26)
has a noncentral t distribution. 2
The noncentral t distribution is used in tests of the mean, when the variance
is unknown and must be estimated.
3.5.3 Noncentral Chi-Squared
Denition 3.5 Let z

N(

, I
m
). Then
z
0
z

X
2
m
( ), (3.27)
has a noncentral chi-aquared distribution, where =
0

is the noncentrality
parameter. 2
In the noncentral chi-squared distribution, the probability mass is shifted to
the right as compared to a central chi-squared distribution.
Example 3.4 When we do a test of , with known , we have
H
0
: =
0
( x
0
)
0

1
( x
0
) X
2
m
(3.28)
H
1
: =
1
6=
0
Let z

=P
1
(x
0
) for = PP
0
. Then, we have
( x
0
)
0

1
( x
0
) = ( x
0
)
0
P
1
0
P
1
( x
0
)
= z
0
z

X
2
m
[(
1

0
)
0

1
(
1

0
)] (3.29)
2
3.5.4 Noncentral F
Denition 3.6 Let Y
2
m
(), W
2
n
, and let Y and W be independent
random variables. Then
Y/m
W/n
F
m,n
( ), (3.30)
has a noncentral F distribution, where > 0 is the noncentrality parameter. 2
3.5. NONCENTRAL DISTRIBUTIONS 29
The noncentral F distribution is used in tests of mean vectors, where the
variance-covariance matrix is unknown and must be estimated. As with the
other noncentral distributions the noncentrality parameter determines the amount
the distribution is shifted (to the right in this case) when nonzero. Such a shift
would arise under an alternative hypothesis as with the chi-squared example
above.
Chapter 4
Asymptotic Theory
4.1 Convergence Of Random Variables
4.1.1 Limits And Orders Of Sequences
Denition 4.1 A sequence of real numbers a
1
, a
2
, . . . , a
n
, . . . , is said to be
convergent to or have a limit of if for every > 0, there exists a positive real
number N such that for all n > N, | a
n
| < . This is denoted as
lim
n
a
n
= . 2
Denition 4.2 A sequence of real numbers { a
n
} is said to be of at most order
n
k
, and we write { a
n
} is O( n
k
), if
lim
n
1
n
k
a
n
= c,
where c is any real constant. 2
Example 4.1 Let { a
n
} = 3 + 1/n, and { b
n
} = 4 n
2
. Then, { a
n
} is O( 1 )
=O( n
0
), since
lim
n
1
n
a
n
= 3,
and { b
n
} is O( n
2
), since
lim
n
1
n
2
b
n
=-1. 2
Denition 4.3 A sequence of real numbers { a
n
} is said to be of order smaller
than n
k
, and we write { a
n
} is o( n
k
), if
lim
n
1
n
k
a
n
= 0. 2
30
4.1. CONVERGENCE OF RANDOM VARIABLES 31
Example 4.2 Let { a
n
} = 1/n. Then, { a
n
} is o( 1 ), since
lim
n
1
n
0
a
n
=0. 2
4.1.2 Convergence In Probability
Denition 4.4 A sequence of random variables y
1
, y
2
, . . . , y
n
, . . . , with distri-
bution functions F
1
(), F
2
(), . . . , F
n
(), . . . , is said to converge weakly in proba-
bility to some constant c if
lim
n
Pr[ | y
n
c | > ] = 0. 2 (4.1)
for every real number > 0.
Weak convergence in probability is denoted by
plim
n
y
n
= c, (4.2)
or sometimes,
y
n
p
c, (4.3)
or
y
n

p
c.
This denition is equivalent to saying that we have a sequence of tail proba-
bilities (of being greater than c+ or less than c), and that the tail probabil-
ities approach 0 as n , regardless of how small is chosen. Equivalently,
the probability mass of the distribution of y
n
is collapsing about the point c.
Denition 4.5 A sequence of random variables y
1
, y
2
, . . . , y
n
, . . . , is said to
converge strongly in probability to some constant c if
lim
N
Pr[ sup
n>N
| y
n
c | > ] = 0, (4.4)
for any real > 0. 2
Strong convergence is also called almost sure convergence and is denoted
y
n
a.s.
c, (4.5)
or
y
n

a.s.
c.
Notice that if a sequence of random variables converges strongly in probability, it
converges weakly in probability. The dierence between the two is that almost
32 CHAPTER 4. ASYMPTOTIC THEORY
sure convergence involves an element of uniformity that weak convergence does
not. A sequence that is weakly convergent can have Pr[| y
n
c | > ] wiggle-
waggle above and below the constant used in the limit and then settle down
to subsequently be smaller and meet the condition. For strong convergence,
once the probability falls below for a particular N in the sequence it will
subsequently be smaller for all larger N.
Denition 4.6 A sequence of random variables y
1
, y
2
, . . . , y
n
, . . . , is said to
converge in quadratic mean if
lim
n
E[ y
n
] = c
and
lim
n
Var[ y
n
] = 0. 2
By Chebyshevs inequality, convergence in quadratic mean implies weak con-
vergence in probability. For a random variable x with mean and variance

2
, Chebyshevs inequality states Pr(|x | k)
1
k
2
. Let
2
n
denote
the variance on y
n
, then we can write the condition for the present case as
Pr(|y
n
E[y
n
]| k
n
)
1
k
2
. Since
2
n
0 and E[y
n
] c the probability will
be less than
1
k
2
for suciently large n for any choice of k. But this is just weak
convergence in probability to c.
4.1.3 Orders In Probability
Denition 4.7 Let y
1
, y
2
, . . . , y
n
, . . . be a sequence of random variables. This
sequence is said to be bounded in probability if for any 1 > > 0, there exist a
< and some N suciently large such that
Pr(| y
n
| > ) < ,
for all n > N. 2
These conditions require that the tail behavior of the distributions of the
sequence not be pathological. Specically, the tail mass of the distributions
cannot be drifting away from zero as we move out in the sequence.
Denition 4.8 The sequence of random variables { y
n
} is said to be at most
of order in probability n

, and is denoted O
p
( n

), if n

y
n
is bounded in
probability. 2
Example 4.3 Suppose z N(0, 1) and y
n
= 3 +n z, then n
1
y
n
= 3/n+z is
a bounded random variable since the rst term is asymptotically negligible and
we see that y
n
= O
p
( n).
4.1. CONVERGENCE OF RANDOM VARIABLES 33
Denition 4.9 The sequence of random variables { y
n
} is said to be of order
in probability smaller than n

, and is denoted o
p
( n

), if n

y
n
p
0. 2
Example 4.4 Convergence in probability can be represented in terms of order
in probability. Suppose that y
n
p
c or equivalently y
n
c
p
0, then
n
0
(y
n
c)
p
0 and y
n
c = o
p
( 1 ).
4.1.4 Convergence In Distribution
Denition 4.10 A sequence of random variables y
1
, y
2
, . . . , y
n
, . . . , with cu-
mulative distribution functions F
1
(), F
2
(), . . . , F
n
(), . . . , is said to converge
in distribution to a random variable y with a cumulative distribution function
F( y ), if
lim
n
F
n
() = F(), (4.6)
for every point of continuity of F(). The distribution F() is said to be the
limiting distribution of this sequence of random variables. 2
For notational convenience, we often write y
n
d
F() or y
n

d
F() if
a sequence of random variables converges in distribution to F(). Note that
the moments of elements of the sequence do not necessarily converge to the
moments of the limiting distribution.
4.1.5 Some Useful Propositions
In the following propositions, let x
n
and y
n
be sequences of random vectors.
Proposition 4.1 If x
n
y
n
converges in probability to zero, and y
n
has a
limiting distribution, then x
n
has a limiting distribution, which is the same. 2
Proposition 4.2 If y
n
has a limiting distribution and plim
n
x
n
= 0, then for
z
n
= y
n
0
x
n
,
plim
n
z
n
= 0. 2
Proposition 4.3 Suppose that y
n
converges in distribution to a random vari-
able y, and plim
n
x
n
= c. Then x
n
0
y
n
converges in distribution to c
0
y. 2
Proposition 4.4 If g() is a continuous function, and if x
n
y
n
converges in
probability to zero, then g( x
n
) g( y
n
) converges in probability to zero. 2
Proposition 4.5 If g() is a continuous function, and if x
n
converges in proba-
bility to a constant c, then z
n
= g( x
n
) converges in distribution to the constant
g( c ). 2
34 CHAPTER 4. ASYMPTOTIC THEORY
Proposition 4.6 If g() is a continuous function, and if x
n
converges in dis-
tribution to a random variable x, then z
n
= g( x
n
) converges in distribution to
a random variable g( x). 2
4.2 Estimation Theory
4.2.1 Properties Of Estimators
Denition 4.11 An estimator
b

n
of the p1 parameter vector is a function
of the sample observations x
1
, x
2
, ..., x
n
. 2
It follows that
b

1
,
b

2
, . . . ,
b

n
form a sequence of random variables.
Denition 4.12 The estimator
b

n
is said to be unbiased if E
b

n
= , for all n.
2
Denition 4.13 The estimator
b

n
is said to be asympotically unbiased if lim
n
E
b

n
=.
2
Note that an estimator can be biased in nite samples, but asymptotically
unbiased.
Denition 4.14 The estimator
b

n
is said to be consistent if plim
n
b

n
=. 2
Consistency neither implies nor is implied by asymptotic unbiasedness, as
demonstrated by the following examples.
Example 4.5 Let
e

n
=

, with probability 1 1/n


+ nc, with probability 1/n
We have E
e

n
= +c, so
e

n
is a biased estimator, and lim
nE
e

n
= + c, so
e

n
is asymptotically biased as well. However, lim
n
Pr(|
e

n
| > ) = 0, so
e

n
is a consistent estimator. 2
Example 4.6 Suppose x
i
i.i.d. N(,
2
) for i = 1, 2, ..., n, ... and let e x
n
= x
n
be an estimator of . Now E[e x
n
] = so the estimator is unbiased but
Pr(|e x
n
| > 1.96) = .05
so the probability mass is not collapsing about the target point so the esti-
mator is not consistent.
4.2. ESTIMATION THEORY 35
4.2.2 Laws Of Large Numbers And Central Limit Theo-
rems
Most of the large sample properties of the estimators considered in the sequel
derive from the fact the estimators involve sample averages and the asymptotic
behavior of averages is well studied. In addition to the central limit theorems
presented in the previous two chapters we have the following two laws of large
numbers:
Theorem 4.1 If x
1
, x
2
, . . . , x
n
is a simple random sample, that is, the x
i
s are
i.i.d., and Ex
i
= and Var( x
i
) =
2
, then by Chebyshevs Inequality,
plim
n
x
n
= 2 (4.7)
Theorem 4.2 (Khitchine) Suppose that x
1
, x
2
, . . . , x
n
are i.i.d. random vari-
ables, such that for all i = 1, . . . , n, Ex
i
= , then,
plim
n
x
n
= 2 (4.8)
Both of these results apply element-by-element to vectors of estimators.
For sake of completeness we repeat the following scalar central limit theorem.
Theorem 4.3 (Linberg-Levy) Suppose that x
1
, x
2
, . . . , x
n
are i.i.d. random
variables, such that for all i = 1, . . . , n, Ex
i
= and Var( x
i
) =
2
, then

n( x
n
)
d
N( 0,
2
), or
lim
n
f( x
n
) =
1

2
2
e
1
2
2
( x
n
)
2
= N( 0,
2
) 2 (4.9)
This result is easily generalized to obtain the multivariate version given in The-
orem 3.3.
Theorem 4.4 (Multivariate CLT) Suppose that m1 random vectors x
1
, x
2
, . . . , x
n
are (i) jointly i.i.d., (ii) Ex
i
= , and (iii) Cov( x
i
) = , then

n( x
n
)
d
N( 0, ) (4.10)
4.2.3 CUAN And Eciency
Denition 4.15 An estimator is said to be consistently uniformly asymptoti-
cally normal (CUAN) if it is consistent, and if

n(
b

n
) converges in distrib-
ution to N( 0, ), and if the convergence is uniform over some compact subset
of the parameter space. 2
36 CHAPTER 4. ASYMPTOTIC THEORY
Suppose that

n(
b

n
) converges in distribution to N( 0, ). Let
e

n
be an alternative estimator such that

n(
e

n
) converges in distribution to
N( 0, ).
Denition 4.16 If
b

n
is CUAN with asymptotic covariance and
e

n
is CUAN
with asymptotic covariance , then
b

n
is asymptotically ecient relative to
e

n
if is a positive semidenite matrix. 2
Among other properties asymptotic relative eciency implies that the diag-
onal elements of are no larger than those of , so the asymptotic variances
of
b

n,i
are no larger than those of
e

n,i
. And a similar result applies for the
asymptotic variance of any linear combination.
Denition 4.17 A CUAN estimator
b

n
is said to be asymptotically ecient if
it is asymptotically ecient relative to any other CUAN estimator. 2
4.3 Asymptotic Inference
4.3.1 Normal Ratios
Now, under the conditions of the central limit theorem,

n( x
n
)
d

N( 0,
2
), so
( x
n
)
p

2
/n
d
N( 0, 1 ) (4.11)
Suppose that
c

2
is a consistent estimator of
2
. Then we also have
( x
n
)
p
b
2
/n
=
p

2
/n
p
b
2
/n
( x
n
)
p

2
/n
=
r

2
b
2
( x
n
)
p

2
/n
d
N( 0, 1 ) (4.12)
since the term under the square root converges in probability to one and the
remainder converges in distribution to N( 0, 1).
Most typically, such ratios will be used for inference in testing a hypothesis.
Now, for H
0
: =
0
, we have
( x
n

0
)
p
b
2
/n
d
N( 0, 1 ), (4.13)
while under H
1
: =
1
6=
0
, we nd that
( x
n

0
)
p
b
2
/n
=

n( x
n

1
)

b
2
+

n(
1

0
)

b
2
= N( 0, 1 ) + O
p
(

n) (4.14)
4.3. ASYMPTOTIC INFERENCE 37
Thus, extreme values of the ratio can be taken as rare events under the null or
typical events under the alternative.
Such ratios are of interest in estimation and inference with regard to more
general parameters. Suppose that
b
is an estimator of the parameter vector
and

n(
b

p1
)
d
N( 0,
pp
).
Then, if
j
is the parameter of particular interest, we consider
(
b

j
)
p

jj
/n
d
N( 0, 1 ), (4.15)
and duplicating the arguments for the sample mean
(
b

j
)
q
b

jj
/n
d
N( 0, 1 ), (4.16)
where
b
is a consistent estimator of . This ratio will have a similar behavior
under a null and alternative hypothesis with regard to
j
.
4.3.2 Asymptotic Chi-Square
Suppose that

n(
b
)
d
N( 0, ),
where
b
is a consistent estimator of the nonsingular pp matrix . Then, from
the previous chapter we have

n(
b
)
0

n(
b
) = n (
b
)
0

1
(
b
)
d

2
p
(4.17)
and
n (
b
)
0
b

1
(
b
)
d

2
p
(4.18)
for
b
a consistent estimator of .
This result can be used to conduct infence by testing the entire parameter
vector. If H
0
:
1
=
0
1
, then
n (
b

0
)
0
b

1
(
b

0
)
d

2
p
, (4.19)
and large positive values are rare events. while for H
1
:
1
=
1
1
6=
0
1
, we can
show (later)
n (
b

0
)
0
b

1
(
b

0
) = n ((
b

1
) + (
1

0
))
0

1
((
b

1
) + (
1

0
))
= n (
b

1
)
0
b

1
(
b

1
) + 2n (
1

0
)
0

1
(
b

1
)
+n (
1

0
)
0
b

1
(
1

0
)
=
2
p
+ O
p
(

n) + O
p
(n) (4.20)
38 CHAPTER 4. ASYMPTOTIC THEORY
Thus, if we obtain a large value of the statistic, we may take it as evidence that
the null hypothesis is incorrect.
This result can also be applied to any subvector of . Let
=

,
where
1
is a p
1
1 vector. Then

n(
b

1
)
d
N( 0,
11
), (4.21)
where
11
is the upper left-hand p
1
p
1
submatrix of and,
n (
b

1
)
0
b

1
11
(
b

1
)
d

2
p
1
(4.22)
4.3.3 Tests Of General Restrictions
We can use similar results to test general nonlinear restrictions. Suppose that
r() is a q 1 continuously dierentiable function, and

n(
b
)
d
N( 0, ).
By the intermediate value theorem we can obtain the exact Taylors series rep-
resentation
r(
b
) = r() +
r(

0
(
b
)
or equivalently

n(r(
b
) r()) =
r(

n(
b
)
= R(

n(
b
)
where R(

) =
r(

0
and

lies between
b
and . Now
b

p
so


p

and R(

) R( ). Thus, we have

n[ r(
b
) r( )]
d
N( 0, R( )R
0
( )). (4.23)
Thus, under H
0
: r( ) = 0, assuming R( )R
0
( ) is nonsingular, we have
n r(
b
)
0
[R( )R
0
( )]
1
r(
b
)
d

2
q
, (4.24)
where q is the length of r(). In practice, we substitute the consistent estimates
R(
b
) for R( ) and
b
for to obtain, following the arguments given above
n r(
b
)
0
[R(
b
)
b
R
0
(
b
)]
1
r(
b
)
d

2
q
, (4.25)
The behavior under the alternative hypothesis will be O
p
(n) as above.
4.3. ASYMPTOTIC INFERENCE 39
4.A Appendix
In the sequel, we will entertain various estimators that result from an optimiza-
tion operation. Such estimators comprise the class of extremum estimators.
Examples are: least squares, which nds a minimum; maximum likelihood,
which nds the maximum; and generalized method of moments (GMM), which
nds a minimum. In each case, we obtain rst order conditions for the ob-
jective function, in terms of the unknown parameters, and use the solution of
these as estimators. In this section, we will develop results which establish
the consistency and asymptotic normality of such estimators under very general
conditions.
Consistency and laws of large numbers usually result from the fact that the
rst-order conditions are, at least asymptotically, averages. Accordingly, we
suppose the basis of the average is the function
v
i
() = v(w
i
, )
where is a p 1 parameter vector, w
i
is a k 1 vector of observations on the
data, and v() is a p 1 vector function. The average is based on a sample of
size n, written
v() =
1
n
n
X
i=1
v
i
().
Coming from rst-order conditions, the estimator, say
b
, is obtained by setting
this average to zero and solving for so
0 = v(
b
).
The following asymptotic results will rely on higher-order assumptions based
on concepts of uniform continuity and uniform convergence in probability which
need to be dened.
Denition 4.18 A function is continuous if, for each positive number and
each point x
0
, there exists a positive number such that whenever |x x
0
| <
then |f(x) f(x
0
)| < .
Denition 4.19 A function is uniformly continuous if, for each positive num-
ber , there exists a positive number such for all x
0
, whenever |x x
0
| <
then |f(x) f(x
0
)| < .
The dierence between these two denitions is that in the second depends
only on the choice of whereas in the rst it can depend on and x
0
.
Denition 4.20 A sequence of functions {f
n
}, n = 1, 2, 3, ..., is said to be
uniformly convergent to the function f for a set E of values of x if, for each
> 0, an integer N can be found such that |f
n
(x) f(x)| < for n > N and
x E.
40 CHAPTER 4. ASYMPTOTIC THEORY
Denition 4.21 A sequence of random functions {f
n
= f
n
(x)}, n = 1, 2, 3, ...,
is said to be uniformly convergent in probability to the function f for a set E of
values of x if, for each > 0 and > 0, an integer N can be found such that
Pr(|f
n
(x) f(x)| > ) < for n > N and x E.
This denition diers from pointwise convergence in probability in that an
do not depend on the value of x. In pointwise convergence these constants can
vary depending on the point of evaluation for x.
Chapter 5
Maximum Likelihood
Methods
5.1 Maximum Likelihood Estimation (MLE)
5.1.1 Motivation
Suppose we have a model for the random variable y
i
, for i = 1, 2, . . . , n, with
unknown (p 1) parameter vector . In many cases, the model will imply a
distribution f( y
i
, ) for each realization of the variable y
i
.
A basic premise of statistical inference is to avoid unlikely or rare models,
for example, in hypothesis testing. If we have a realization of a statistic that
exceeds the critical value then it is a rare event under the null hypothesis. Under
the alternative hypothesis, however, such a realization is much more likely to
occur and we reject the null in favor of the alternative. Thus in choosing
between the null and alternative, we select the model that makes the realization
of the statistic more likely to have occured.
Carrying this idea over to estimation, we select values of such that the
corresponding values of f( y
i
, ) are not unlikely. After all, we do not want a
model that disagrees strongly with the data. Maximum likelihood estimation
is merely a formalization of this notion that the model chosen should not be
unlikely. Specically, we choose the values of the parameters that make the
realized data most likely to have occured. This approach does, however, require
that the model be specied in enough detail to imply a distribution for the
variable of interest.
41
42 CHAPTER 5. MAXIMUM LIKELIHOOD METHODS
5.1.2 The Likelihood Function
Suppose that the random variables y
1
, y
2
, . . . , y
n
are i.i.d. Then, the joint
density function for n realizations is
f( y
1
, y
2
, . . . , y
n
| ) = f( y
1
| ) f( y
2
, | ) . . . f( y
n
| )
=
n
Y
i=1
f( y
i
| ) (5.1)
Given values of the parameter vector , this function allows us to assign local
probability measures for various choices of the random variables y
1
, y
2
, . . . , y
n
.
This is the function which must be integrated to make probability statements
concerning the joint outcomes of y
1
, y
2
, . . . , y
n
.
Given a set of realized values of the random variables, we use this same
function to establish the probability measure associated with various choices of
the parameter vector .
Denition 5.1 The likelihood function of the parameters, for a particular sam-
ple of y
1
, y
2
, . . . , y
n
, is the joint density function considered as a function of
given the y
i
s. That is,
L( |y
1
, y
2
, . . . , y
n
) =
n
Y
i=1
f( y
i
| ) 2 (5.2)
5.1.3 Maximum Likelihood Estimation
For a particular choice of the parameter vector , the likelihood function gives
a probability measure for the realizations that occured. Consistent with the
approach used in hypothesis testing, and using this function as the metric, we
choose that make the realizations most likely to have occured.
Denition 5.2 The maximum likelihood estimator of is the estimator ob-
tained by maximizing L( |y
1
, y
2
, . . . , y
n
) with respect to . That is,
max

L( |y
1
, y
2
, . . . , y
n
) =
b
, (5.3)
where
b
is called the MLE of . 2
Equivalently, since log() is a strictly monotonic transformation, we may nd
the MLE of by solving
max

L( |y
1
, y
2
, . . . , y
n
), (5.4)
where
L( |y
1
, y
2
, . . . , y
n
) = log L( |y
1
, y
2
, . . . , y
n
)
5.1. MAXIMUM LIKELIHOOD ESTIMATION (MLE) 43
is denoted the log-likelihood function. In practice, we obtain
b
by solving the
rst-order conditions (FOC)
L( ; y
1
, y
2
, . . . , y
n
)

=
n
X
i=1
log f( y
i
;
b
)

= 0.
The motivation for using the log-likelihood function is apparent since the sum-
mation form will result, after division by n, in estimators that are approximately
averages, about which we know a lot. This advantage is particularly clear in
the folowing example.
Example 5.1 Suppose that y
i
i.i.d.N( ,
2
), for i = 1, 2, . . . , n. Then,
f( y
i
|,
2
) =
1

2
2
e

1
2
2
( y
i
)
2
,
for i = 1, 2, . . . , n. Using the likelihood function (5.2), we have
L( ,
2
|y
1
, y
2
, . . . , y
n
) =
n
Y
i=1
1

2
2
e

1
2
2
( y
i
)
2
. (5.5)
Next, we take the logarithm of (5.5), which gives us
log L( ,
2
|y
1
, y
2
, . . . , y
n
) =
n
X
i=1

1
2
log( 2
2
)
1
2
2
( y
i
)
2

=
n
2
log( 2
2
)
1
2
2
n
X
i=1
( y
i
)
2
(5.6)
We then maximize (5.6) with respect to both and
2
. That is, we solve the
following rst order conditions:
(A)
log L()

=
1

2
P
n
i=1
( y
i
) = 0;
(B)
log L()

2
=
n
2
2
+
1
2
4
P
n
i=1
( y
i
)
2
= 0.
By solving (A), we nd that =
1
n
P
n
i=1
y
i
= y
n
. Solving (B) gives us

2
=
1
n
P
n
i=1
( y
i
y
n
). Therefore, b = y
n
, and b
2
=
1
n
P
n
i=1
( y
i
b ). 2
Note that b
2
=
1
n
P
n
i=1
( y
i
b ) 6= s
2
, where s
2
=
1
n1
P
n
i=1
( y
i
b ). s
2
is
the familiar unbiased estimator for
2
, and b
2
is a biased estimator.
44 CHAPTER 5. MAXIMUM LIKELIHOOD METHODS
5.2 Asymptotic Behavior of MLEs
5.2.1 Assumptions
For the results we will derive in the following sections, we need to make ve
assumptions:
1. The y
i
s are iid random variables with density function f( y
i
, ) for i =
1, 2, . . . , n;
2. log f( y
i
, ) and hence f( y
i
, ) possess derivatives with respect to up to
the third order for ;
3. The range of y
i
is independent of hence dierentiation under the integral
is possible;
4. The parameter vector is globally identied by the density function.
5.
3
log f( y
i
, )/
i

k
is bounded in absolute value by some function
H
ijk
( y ) for all y and , which, in turn, has a nite expectation for all
.
The rst assumption is fundamental and the basis of the estimator. If it is
not satised then we are misspecifying the model and there is little hope for
obtaining correct inferences, at least in nite samples. The second assumption
is a regularity condition that is usually satised and easily veried. The third
assumption is also easily veried and guaranteed to be satised in models where
the dependent variable has smooth and innite support. The fourth assumption
must be veried, which is easier in some cases than others. The last assumption
is crucial and bears a cost and really should be veried before MLE is undertaken
but is usually ignored.
5.2.2 Some Preliminaries
Now, we know that
Z
L( |y)dy =
Z
f( y| )dy = 1 (5.7)
for any value of the parameter vector . Therefore,
0 =

R
L( |y)dy

=
Z
L( |y)

dy
=
Z
f( y| )

dy
=
Z
log f( y| )

f( y| )dy, (5.8)
5.2. ASYMPTOTIC BEHAVIOR OF MLES 45
for any and
0 = E

log f( y|
0
)

= E

log L(
0
|y)

(5.9)
for any value of the true parameter vector
0
.
Dierentiating (5.8) again yields
0 =
Z

2
log f( y| )

0
f( y| ) +
log f( y| )

f( y| )

dy.
(5.10)
Since
f( y| )

0
=
log f( y| )

0
f( y| ), (5.11)
then we can rewrite (5.10) as
0 = E

2
log f( y|
0
)

+ E

log f( y|
0
)

log f( y|
0
)

.
(5.12)
or, in terms of the likelihood function,
(
0
) = E

log L(
0
|y)

log L(
0
|y)

= E

log L(
0
|y)

.
(5.13)
The matrix (
0
) is called the information matrix and the relationship given
in (5.13) the information matrix equality.
Finally, we note that
E

log L(
0
|y)

log L(
0
|y)

= E
"
n
X
i=1
log f
i
( y|
0
)

n
X
i=1
log f
i
( y|
0
)

0
#
=
n
X
i=1
E

log f
i
( y|
0
)

log f
i
( y|
0
)

, (5.14)
since the covariances between dierent observations is zero.
5.2.3 Asymptotic Properties
Consistent Root Exists
For notational simplicity, consider the case where p = 1. Then, expanding in a
Taylors series and using the intermediate value theorem on the quadratic term
46 CHAPTER 5. MAXIMUM LIKELIHOOD METHODS
yields
1
n
log L( |y )

=
1
n
n
X
i=1
log f( y
i
|
0
)

+
1
n
n
X
i=1

2
log f( y
i
|
0
)

2
(
b

0
)
+
1
2
1
n
n
X
i=1

3
log f( y
i
|

3
(
b

0
)
2
(5.15)
where

lies between
b
and
0
. Now, by assumption 5, we have
1
n
n
X
i=1

3
log f( y
i
|

3
= k
n
X
i=1
H( y
i
), (5.16)
for some |k| < 1. So,
1
n
log L( |y)

= a
2
+ b + c, (5.17)
where
=
b

0
,
a =
k
2
1
n
n
X
i=1
H( y
i
),
b =
1
n
n
X
i=1

2
log f( y
i
|
0
)

2
, and
c =
1
n
n
X
i=1
log f( y
i
|
0
)

.
Note that |a|
1
2
1
n
P
n
i=1
H( y
i
) =
1
2
E[H( y
i
)] +o
p
(1) = O
p
(1), plim
n
c = 0, and
plim
n
b = (
0
).
Now, since log L(
b
|y)/ = 0, we have a
2
+ b + c = 0. There are two
possibilities. If a 6= 0 with probability 1, which will occur when the FOC are
nonlinear in , then
=
b

b
2
4ac
2a
. (5.18)
Since ac = o
p
(1), then
p
0 for the plus root while
p
(
0
)/ for the
negative root if plim
n
a = 6= 0 exists. If a = 0, then the FOC are linear
5.2. ASYMPTOTIC BEHAVIOR OF MLES 47
in whereupon =
c
b
and again
p
0. If the F.O.C. are nonlinear but
asymptotically linear then a
p
0 and ac in the numerator of (5.18) will still
go to zero faster than a in the denominator and
p
0. Thus there exits at
least one consistent solution
b
which satises
plim
n
(
b

0
) = 0. (5.19)
and in the asymptotically nonlinear case there is also a possibly inconsistent
solution.
For the case of a vector, we can apply a similar style proof to show there
exists a solution
b
to the FOC that satises plim
n
(
b

0
) = 0. And in the event
of asymptotically nonlinear FOC there is at least one other possibly inconsistent
root.
Global Maximum Is Consistent
In the event of multiple roots, we are left with the problem of selecting between
them. By assumption 4, the parameter is globally identied by the density
function. Formally, this means that
f( y, ) = f( y,
0
), (5.20)
for all y implies that =
0
. Now,
E

f( y, )
f( y,
0
)

=
Z
f( y, )
f( y,
0
)
f( y,
0
) = 1. (5.21)
Thus, by Jensens Inequality, we have
E

log
f( y, )
f( y,
0
)

< log E

f( y, )
f( y,
0
)

= 0, (5.22)
unless f( y, ) = f( y,
0
) for all y, or =
0
. Therefore, E[log f( y, )]
achieves a maximum if and only if =
0
.
However, we are solving
max
1
n
n
X
i=1
log f( y
i
, )
p
E[log f( y
i
, )] , (5.23)
and if we choose an inconsistent root, we will not obtain a global maximum.
Thus, asymptotically, the global maximum is a consistent root. This choice of
the global root has added appeal since it is, in fact, the MLE among the possible
alternatives and hence the choice that makes the realized data most likely to
have occured.
48 CHAPTER 5. MAXIMUM LIKELIHOOD METHODS
There are complications in nite samples since the value of the likelihood
function for alternative roots may cross over as the sample size increases. That
is, the global maximum in small samples may not be the global maximum in
larger samples. An added problem is to identify all the alternative roots so
we can choose the global maximum. Sometimes a solution is available in a
simple consistent estimator which may be used to start the nonlinear MLE
optimization.
Asymptotic Normality
For p = 1, we have a
2
+ b + c = 0, so
=
b

0
=
c
a + b
(5.24)
and

n(
b

0
) =

nc
a(
b

0
) + b
=
1
a(
b

0
) + b

nc.
Now since a = O
p
(1) and
b

0
= o
p
(1), then a(
b

0
) = o
p
(1) and
a(
b

0
) + b
p
(
0
). (5.25)
And by the CLT we have

nc =
1

n
n
X
i=1
log f( y
i
|
0
)

d
N( 0, (
0
)). (5.26)
Substituing these two results in (5.25), we nd

n(
b

0
)
d
N( 0, (
0
)
1
).
In general, for p > 1, we can apply the same scalar proof to show

n(
0
b

0
)
d
N( 0,
0
(
0
)
1
) for any vector , which means

n(
b

0
)
d
N( 0,
1
(
0
)), (5.27)
if
b
is the global maximum.
5.2. ASYMPTOTIC BEHAVIOR OF MLES 49
Cramer-Rao Lower Bound
In addition to being the covariance matrix of the MLE,
1
(
0
) denes a lower
bound for covariance matrices with certain desirable properties. Let
e
( y) be
any unbiased estimator, then
E
e
( y ) =
Z
e
( y)f( y; )dy = , (5.28)
for any underlying =
0
. Dierentiating both sides of this relationship with
respect to yields
I
p
=
E
e
(, y)

0
=
Z
e
( y)
f( y;
0
)

0
dy
=
Z
e
( y)
log f( y;
0
)

0
f( y;
0
)dy
= E

e
( y)
log f( y;
0
)

= E

(
e
( y)
0
)
log f( y;
0
)

. (5.29)
Next, we let
C(
0
) = E
h
(
e
( y)
0
)(
e
( y)
0
)
0
i
. (5.30)
be the covariance matrix of
e
( y), then,
Cov

e
( y)
log L

=

C(
0
) I
p
I
p
(
0
)

, (5.31)
where I
p
is a p p identity matrix, and (5.31) as a covariance matrix is positive
semidenite.
Now, for any (p 1) vector a, we have

a
0
a
0
(
0
)
1


C(
0
) I
p
I
p
(
0
),

a
0
a
0
(
0
)
1

= a
0
[ C(
0
) (
0
)
1
]a 0.
(5.32)
Thus, any unbiased estimator
e
( y) has a covariance matrix that exceeds (
0
)
1
by a positive semidenite matrix. And if the MLE estimator is unbiased, it is
ecient within the class of unbiased estimators. Likewise, any CUAN estima-
tor will have a covariance exceeding (
0
)
1
. Since the asymptotic covariance
of MLE is, in fact, (
0
)
1
, it is ecient (asymptotically). (
0
)
1
is called
the Camer-Rao lower bound.
50 CHAPTER 5. MAXIMUM LIKELIHOOD METHODS
5.3 Maximum Likelihood Inference
5.3.1 Likelihood Ratio Test
Suppose we wish to test H
0
: =
0
against H
0
: 6=
0
. Then, we dene
L
u
= max

L( |y ) = L(
b
|y ) (5.33)
and
L
r
= L(
0
|y), (5.34)
where L
u
is the unrestricted likelihood and L
r
is the restricted likelihood. We
then form the likelihood ratio
=
L
r
L
u
. (5.35)
Note that the restricted likelihood can be no larger than the unrestricted which
maximizes the function.
As with estimation, it is more convenient to work with the logs of the like-
lihood functions. It will be shown below that, under H
0
,
LR = 2 log
= 2

log
L
r
L
u

= 2[ L(
b
|y) L(
0
|y)]
d

2
p
, (5.36)
where
b
is the unrestricted MLE, and
0
is the restricted MLE. If H
1
applies,
then LR = O
p
( n). Large values of this statistic indicate that the restrictions
make the observed values much less likely than the unrestricted and we prefer
the unrestricted and reject the restictions.
In general, for H
0
: r( ) = 0, and H
1
: r( ) 6= 0, we have
L
u
= max

L( | y) = L(
b
| y), (5.37)
and
L
r
= max

L( | y) s.t. r( ) = 0
= L(
e
| y). (5.38)
Under H
0
,
LR = 2[ L(
b
|y) L(
e
|y)]
d

2
q
, (5.39)
where q is the length of r().
Note that in the general case, the likelihood ratio test requires calculation
of both the restricted and the unrestricted MLE.
5.3. MAXIMUM LIKELIHOOD INFERENCE 51
5.3.2 Wald Test
The asymptotic normality of MLE may be used to obtain a test based only on
the unrestricted estimates.
Now, under H
0
: =
0
, we have

n(
b

0
)
d
N( 0,
1
(
0
)). (5.40)
Thus, using the results on the asymptotic behavior of quadratic forms from the
previous chapter, we have
W = n(
b

0
)
0
(
0
)(
b

0
)
d

2
p
, (5.41)
which is the Wald test. As we discussed for quadratic tests, in general, under
H
0
: 6=
0
, we would have W = O
p
( n).
In practice, since
1
n

2
L(
b
|y)

0
=
n
X
i=1
1
n

2
log f(
b
|y)

0
p
(
0
), (5.42)
we use
W

= (
b

0
)
0

2
L(
b
|y)

0
(
b

0
)
d

2
p
. (5.43)
Aside from having the same asymptotic distribution, the Likelihood Ratio
and Wald tests are asymptotically equivalent in the sense that
plim
n
( LRW

) = 0. (5.44)
This is shown by expanding L(
0
|y) in a Taylors series about
b
. That is,
L(
0
) = L(
b
) +
L(
b
)

(
0

b
)
+
1
2
(
0

b
)
0

2
L(
b
)

0
(
0

b
)
+
1
6
P
i
P
j
P
k

3
L(

k
(
0
i

b

i
)(
0
j

b

j
)(
0
k

k
). (5.45)
where the third line applies the intermediate value theorem for

between
b

and
0
. Now
L(

= 0, and the third line can be shown to be O


p
(1/

n)
under assumption 5, whereupon we have
L(
b
) L(
0
) =
1
2
(
0

b
)
0

2
L(
b
)

0
(
0

b
) + O
p
( 1/

n)
(5.46)
52 CHAPTER 5. MAXIMUM LIKELIHOOD METHODS
and
LR = W

+ O
p
( 1/

n). (5.47)
In general, we may test H
0
: r( ) = 0 with
W

= r(
b
)
0

R(
b
)

2
L(
b
)

0
!
1
R
0
(
b
)

1
r(
b
)
d

2
p
.
(5.48)
5.4 Lagrange Multiplier
Alternatively, but in the same fashion, we can expand L(
b
) about
0
to obtain
L(
b
) = L(
0
) +
L(
0
)

(
b

0
)
+
1
2
(
b

0
)
0

2
L(
0
)

0
(
b

0
) + O
p
( 1/

n). (5.49)
Likewise, we can also expand
1
n
L(

about
0
, which yields
0 =
1
n
L(
b
)

=
1
n
L(
0
)

+
1
n

2
L(
0
)

0
(
b

0
) + O
p
( 1/n),
(5.50)
or
(
b

0
) =

2
L(
0
)

0
1
L(
0
)

+ O
p
( 1/n). (5.51)
Substituting (5.51) into (5.49) gives us
L(
b
) L(
0
) =
1
2
L(
0
)

2
L(
0
)

0
1
L(
0
)

+ O
p
( 1/

n),
(5.52)
and LR = LM + O
p
( 1/

n), where
LM =
L(
0
)

2
L(
0
)

0
1
L(
0
)

, (5.53)
is the Lagrange Multiplier test.
Thus, under H
0
: =
0
,
plim
n
( LRLM) = 0. (5.54)
5.5. CHOOSING BETWEEN TESTS 53
and
LM
d

2
p
. (5.55)
Note that the Lagrange Multiplier test only requires the restricted values of the
parameters.
In general, we may test H
0
: r( ) = 0 with
LM =
L(
e
)

2
L(
e
)

0
!
1
L(
e
)

2
q
, (5.56)
where L() is the unrestricted log-likelihood function, and
e
is the restricted
MLE.
5.5 Choosing Between Tests
The above analysis demonstrates that the three tests: likelihood ratio, Wald,
and Lagrange multiplier are asymptotically equivalent. In large samples, not
only do they have the same limiting distribution, but they will accept and reject
together. This in not the case in nite samples where one can reject when the
other does not. This might lead a cynical analyst to use one rather than the
other by choosing the one that yields the results (s)he wants to obtain. Making
an informed choice based on their nite sample behavior is beyond the scope of
this course.
In many cases, however, one of the tests is a much more natural choice than
the others. Recall that Wald test only requires the unrestricted estimates while
the Lagrange multiplier test only requires the restricted estimates. In some
cases the unrestricted estimates are much easier to obtain than the restricted
and in other cases the reverse is true. In the rst case we might be inclined
to use the Wald test while in the latter we would prefer to use the Lagrange
multiplier.
Another issue is the possible sensitivity of the test results to how the restric-
tions are written. For example,
1
+
2
= 0 can also be written
2
/
1
= 1.
The Wald test, in particular is sensitive to how the restriction is written. This
is yet another situation where a cynical analyst might be tempted to choose the
normalization of the restriction to force the desired result. The Lagrange
multiplier test, as presented above, is also sensitive to the normalization of the
restriction but can be modied to avoid this diculty. The likelihood ratio test
however will be unimpacted by the choice of how to write the restriction.

Common questions

Powered by AI

The Central Limit Theorem (CLT) asserts that the distribution of the sample mean becomes approximately normally distributed as the sample size increases, regardless of the original distribution of the data. This is crucial because it enables the use of normal distribution techniques for inference about the mean of virtually any population when the sample size is large enough .

Convergence in distribution refers to a sequence of random variables whose distribution functions converge to a limiting distribution as the sample size increases. It implies that while individual sample elements may vary, the distribution as a whole becomes stable, allowing for robust statistical inference in the context of large samples .

An unbiased estimator has an expected value equal to the true parameter value for all sample sizes, while a consistent estimator converges to the true parameter value as the sample size increases. This distinction is important because it illustrates that while an estimator might be unbiased for finite samples, it might not necessarily converge to the true parameter value as sample size grows, and vice versa .

The Information Matrix is used in Maximum Likelihood Estimation (MLE) to measure the amount of information that an observable random variable contains about an unknown parameter. It is derived from the second derivative of the log-likelihood function with respect to the parameter vector and is used to estimate the precision of the MLE .

The Cramér-Rao Lower Bound defines the lower bound on the variance of unbiased estimators of a parameter, demonstrating that no unbiased estimator can have a variance lower than this bound for a given set of assumptions. It is critical in assessing the efficiency of estimators, with those reaching this bound considered efficient .

The Wald Test is based solely on the unrestricted estimates and tests the significance of coefficients, while the Likelihood Ratio Test compares the likelihoods of the restricted and unrestricted models to assess fit. Both tests are asymptotically equivalent but differ in their practical application and calculation methods, with the Likelihood Ratio also requiring the calculation of the maximum likelihood under the set restrictions .

The Chi-Squared distribution is used in hypothesis testing to evaluate the independence of categorical variables and to test the goodness-of-fit of models. It is particularly useful in testing hypotheses about population variances and in complex models involving several variables .

Asymptotic normality of Maximum Likelihood Estimators (MLEs) implies that as the sample size grows, the distribution of MLEs approaches a normal distribution. This property allows for simpler statistical inference using normal theory, even when the sample size is finite, improving the interpretability and applicability of MLEs in real-world scenarios .

The mean of the sample mean is significant because it is an unbiased estimator of the population mean (µ), as demonstrated by E(x-bar) = µ. The variance of the sample mean, given by Var(x-bar) = σ²/n, is essential because it decreases with increased sample size, indicating more precise estimations of µ .

The Standard Normal Transformation allows any normal distribution to be converted into a standard normal distribution (mean of 0 and variance of 1) through the transformation Z = (x-bar - µ) / (σ/√n). This simplifies statistical analysis by centralizing and scaling the data, facilitating the use of standard normal distribution tables .

You might also like