0% found this document useful (0 votes)
3 views31 pages

Chap3 - The Geometry of Influence Functions

The document discusses the geometric interpretation of influence functions for asymptotically linear estimators in parametric and semiparametric models. It emphasizes the role of influence functions in characterizing estimators and assessing their efficiency, while also introducing the concept of super-efficient estimators and the importance of regularity conditions. The text concludes by stating that only regular and asymptotically linear estimators will be considered moving forward.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views31 pages

Chap3 - The Geometry of Influence Functions

The document discusses the geometric interpretation of influence functions for asymptotically linear estimators in parametric and semiparametric models. It emphasizes the role of influence functions in characterizing estimators and assessing their efficiency, while also introducing the concept of super-efficient estimators and the importance of regularity conditions. The text concludes by stating that only regular and asymptotically linear estimators will be considered moving forward.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

3

The Geometry of Inf luence Functions

As we will describe shortly, most reasonable estimators for the parameter


β, in either parametric or semiparametric models, are asymptotically linear
and can be uniquely characterized by the influence function of the estimator.
The class of influence functions for such estimators belongs to the Hilbert
space of all mean-zero q-dimensional random functions with finite variance
that was defined in Chapter 2. As such, this construction will allow us to view
estimators or, more specifically, the influence function of estimators, from a
geometric point of view. This will give us intuitive insight into the construction
of such estimators and a geometric way of assessing the relative efficiencies of
the various estimators.
As always, consider the statistical model where Z1 , . . . , Zn are iid ran-
dom vectors and the density of a single Z is assumed to belong to the class
{pZ (z; θ), θ # Ω} with respect to some dominating measure νZ . The parameter
θ can be written as (β T , η T )T , where β q×1 is the parameter of interest and
η, the nuisance parameter, may be finite- or infinite-dimensional. The truth
will be denoted by θ0 = (β0T , η0T )T. For the remainder of this chapter, we will
only consider parametric models where θ = (β T , η T )T and the vector θ is
p-dimensional, the parameter of interest β is q-dimensional, and the nuisance
parameter η is r-dimensional, with p = q + r.
An estimator β̂n of β is a q-dimensional measurable random function of
Z1 , . . . , Zn . Most reasonable estimators for β are asymptotically linear; that
is, there exists a random vector (i.e., a q-dimensional measurable random
function) ϕq×1 (Z), such that E{ϕ(Z)} = 0q×1 ,
n
!
n1/2 (β̂n − β0 ) = n−1/2 ϕ(Zi ) + op (1), (3.1)
i=1

where op (1) is a term that converges in probability to zero as n goes to infinity


and E(ϕϕT ) is finite and nonsingular.
Remark 1. The function ϕ(Z) is defined with respect to the true distribu-
tion p(z, θ0 ) that generates the data. Consequently, we sometimes may write
22 3 The Geometry of Influence Functions

ϕ(Z, θ) to emphasize that this random function will vary according to the
value of θ in the model. Unless otherwise stated, it will be assumed that ϕ(Z)
is evaluated at the truth and expectations are taken with respect to the truth.
Therefore, E{ϕ(Z)} is shorthand for

Eθ0 {ϕ(Z, θ0 )}.

#
"
The random vector ϕ(Zi ) in (3.1) is referred to as the i-th influence func-
tion of the estimator β̂n or the influence function of the i-th observation of
the estimator β̂n . The term influence function comes from the robustness lit-
erature, where, to first order, ϕ(Zi ) is the influence of the i-th observation on
β̂n ; see Hampel (1974).

Example 1. As a simple example, consider the model where Z1 , . . . , Zn are


iid N (µ, σ 2"
). The maximum likelihood"nestimators for µ and σ 2 are given by
n 2 2
µ̂n = n −1
i=1 Zi and σ̂n = n
−1
i=1 (Zi − µ̂n ) , respectively. That the
estimator µ̂n for µ is asymptotically linear follows immediately because
n
!
n1/2 (µ̂n − µ0 ) = n−1/2 (Zi − µ0 ).
i=1

Therefore, µ̂n is an asymptotically linear estimator for µ whose i-th influence


function is given by ϕ(Zi ) = (Zi − µ0 ).
After some straightforward algebra, we can express the estimator σ̂n2 minus
the estimand as
n
!
(σ̂n2 − σ02 ) = n−1 {(Zi − µ0 )2 − σ02 } + (µ̂n − µ0 )2 . (3.2)
i=1

Multiplying (3.2) by n1/2 , we obtain


n
!
n1/2 (σ̂n2 − σ02 ) = n−1/2 {(Zi − µ0 )2 − σ02 } + n1/2 (µ̂n − µ0 )2 .
i=1

Since n1/2 (µ̂n −µ0 ) converges to a normal distribution and (µ̂n −µ0 ) converges
in probability to zero, this implies that n1/2 (µ̂n −µ0 )2 converges in probability
to zero (i.e., is op (1)). Consequently, we have demonstrated that σ̂n2 is an
asymptotically linear estimator for σ 2 whose i-th influence function is given
by ϕ(Zi ) = {(Zi − µ0 )2 − σ02 }. "#

When considering the asymptotic properties of an asymptotically linear


estimator (e.g., asymptotic normality and asymptotic variance), it suffices
to consider the influence function of the estimator. This follows as a simple
consequence of the central limit theorem (CLT). Since, by definition,
3 The Geometry of Influence Functions 23
n
!
1/2
n (β̂n − β0 ) = n −1/2
ϕ(Zi ) + op (1),
i=1

then, by the central limit theorem,


n
! # $
D
n −1/2
ϕ(Zi ) −→ N 0 q×1
, E(ϕϕ ) ,
T

i=1

and, by Slutsky’s theorem,


# $
D
n1/2 (β̂n − β0 ) −→ N 0, E(ϕϕT ) .

In an asymptotic sense, an asymptotically linear estimator can be iden-


tified through its influence function, as we now demonstrate in the following
theorem.
Theorem 3.1. An asymptotically linear estimator has a unique (a.s.) influ-
ence function.

Proof. By contradiction
Suppose not. Then there exists another influence function ϕ∗ (Z) such that

E{ϕ∗ (Z)} = 0,

and
n
!
n1/2 (β̂n − β0 ) = n−1/2 ϕ∗ (Zi ) + op (1).
i=1

"
n
Since n1/2 (β̂n − β0 ) is also equal to n−1/2 ϕ(Zi ) + op (1), this implies that
i=1

n
!
n−1/2 {ϕ(Zi ) − ϕ∗ (Zi )} = op (1).
i=1

However, by the CLT,


n
! # $
D
n −1/2
{ϕ(Zi ) − ϕ (Zi )} −→ N 0, E{(ϕ − ϕ )(ϕ − ϕ ) } .
∗ ∗ ∗ T

i=1

In order for this limiting normal distribution to be op (1), we would require


that the covariance matrix

E{(ϕ − ϕ∗ )(ϕ − ϕ∗ )T } = 0q×q ,

which implies that ϕ(Z) = ϕ∗ (Z) a.s. #


"
24 3 The Geometry of Influence Functions

The representation of estimators through their influence function lends it-


self nicely to geometric interpretations in terms of Hilbert spaces, discussed
in Chapter 2. Before describing this geometry, we briefly comment on some
regularity conditions that will be imposed on the class of estimators we will
consider.

Reminder. We know that the variance of any unbiased estimator must be


greater than or equal to the Cràmer-Rao lower bound; see, for example,
Casella and Berger (2002, Section 7.3). When considering asymptotic theory,
where we let the sample size n go to infinity, most reasonable estimators are
asymptotically unbiased. Thus, we might expect the asymptotic variance of
such asymptotically unbiased estimators also to be greater than the Cràmer-
Rao lower bound. This indeed is the case for the most part, and estimators
whose asymptotic variance equals the Cràmer-Rao lower bound are referred
to as asymptotically efficient. For parametric models, with suitable regular-
ity conditions, the maximum likelihood estimator (MLE) is an example of
an efficient estimator. One of the peculiarities of asymptotic theory is that
asymptotically unbiased estimators can be constructed that have asymptotic
variance equal to the Cràmer-Rao lower bound for most of the parameter val-
ues in the model but have smaller variance than the Cràmer-Rao lower bound
for the other parameters. Such estimators are referred to as super-efficient
and for completeness we give the construction of such an estimator (Hodges)
as an example.

3.1 Super-Efficiency
Example Due to Hodges

Let Z1 , . . . , Zn be iid N (µ, 1), µ ∈ R. For this simple model, we know that
the maximum "n likelihood estimator (MLE) of µ is given by the sample mean
Z̄n = n−1 i=1 Zi and that

D(µ)
n1/2 (Z̄n − µ) −−−→ N (0, 1).

Now, consider the estimator µ̂n given by Hodges in 1951 (see LeCam,
1953): %
Z̄n if |Z̄n | > n−1/4
µ̂n =
0 if |Z̄n | ≤ n−1/4 .
Some of the properties of this estimator are as follows.

If µ '= 0, then with increasing probability, the support of Z̄n moves away from
0 (see Figure 3.1).
3.1 Super-Efficiency 25

–n–1/4
0 n–1/4 µ

Fig. 3.1. When µ != 0, Pµ (Z̄n != µ̂n ) → 0

D(µ)
Therefore n1/2 (Z̄n −µ) = n1/2 (µ̂n −µ)+op (1) and n1/2 (µ̂n −µ) −−−→ N (0, 1).
If µ = 0, then the support of Z̄n will be concentrated in an O(n−1/2 )
neighborhood about the origin and hence, with increasing probability, will be
within ±n−1/4 (see Figure 3.2).

–n–1/4
n–1/4

Fig. 3.2. When µ = 0, P0 {|Z̄n | < n−1/4 } → 1

& '
Therefore, this implies that P0 (µ̂n = 0) → 1. Hence P0 n1/2 µ̂n = 0 → 1, and
P D(0)
n1/2 (µ̂n − 0) −→
0
0 or −−−→ N (0, 0). Consequently, the asymptotic variance of
1/2
n (µ̂n − µ) is equal to 1 for all µ '= 0, as it is for the MLE Z̄n , but for µ = 0,
the asymptotic variance of n1/2 (µ̂n − µ) equals 0 and thus is super-efficient.
Although super-efficiency, at the surface, may seem like a good property
for an estimator to possess, upon further study we find that super-efficiency
is gained at the expense of poor estimation in a neighborhood of zero. To
26 3 The Geometry of Influence Functions

illustrate this point, consider the sequence µn = n−1/3 , which converges to


zero, the value at which the estimator µ̂n is super-efficient. The MLE has the
property that
D(µn )
n1/2 (Z̄n − µn ) −−−−→ N (0, 1).
However, because Z̄n concentrates its mass in an O(n−1/2 ) neighborhood
about µn = n−1/3 , which eventually, as n increases, will be completely con-
tained within the range ±n−1/4 with probability converging to one (see Figure
3.3).

–n–1/4
n–1/3 n–1/4

Fig. 3.3. When µn = n−1/3 , Pµn (µ̂n = 0) → 1

Therefore, ( )
Pµn n1/2 (µ̂n − µn ) = −n1/2 µn → 1.

Consequently, if µn = n−1/3 , then

−n1/2 µn → −∞.

Therefore, n1/2 (µ̂n − µn ) diverges to −∞.


Although super-efficient estimators exist, they are unnatural and have un-
desirable local properties associated with them. Therefore, in order to avoid
problems associated with super-efficient estimators, we will impose some ad-
ditional regularity conditions on the class of estimators that will exclude such
estimators. Specifically, we will require that an estimator be regular, as we
now define.

Definition 1. Consider a local data generating process (LDGP), where, for


each n, the data are distributed according to “θn ,” where n1/2 (θn − θ∗ ) con-
verges to a constant (i.e., θn is close to some fixed parameter θ∗ ). That is,

Z1n , Z2n , . . . , Znn are iid p(z, θn ),


3.1 Super-Efficiency 27

where T T
θn = (βnT , ηnT )T , θ∗ = (β ∗ , η ∗ )T .
An estimator β̂n , more specifically β̂n (Z1n , . . . , Znn ), is said to be regular if,
for each θ∗ , n1/2 (β̂n − βn ) has a limiting distribution that does not depend on
the LDGP. " #

For our purposes, this will ordinarily mean that if


* + D(θ∗ )
n1/2 β̂n (Z1n , . . . , Znn ) − β ∗ −−−−→ N (0, Σ∗ ) ,

where
Z1n , . . . , Znn are iid p(z, θ∗ ), for all n,
then * + D(θ )
n1/2 β̂n (Z1n , . . . , Znn ) − βn −−−−→ N (0, Σ∗ ),
n

where
Z1n , . . . , Znn are iid p(z, θn ),
1/2
and n (θn − θ ) → τ

, where τ is any arbitrary constant vector.
p×1

It is easy to see that, in our previous example, the MLE Z̄n is a regular
estimator, whereas the super-efficient estimator µ̂n , given by Hodges, is not.
From now on, we will restrict ourselves to regular estimators; in fact,
we will only consider estimators that are regular and asymptotically linear
(RAL). Although most reasonable estimators are RAL, regular estimators do
exist that are not asymptotically linear. However, as a consequence of Hájek’s
(1970) representation theorem, it can be shown that the most efficient regular
estimator is asymptotically linear; hence, it is reasonable to restrict attention
to RAL estimators.
In Theorem 3.2 and its subsequent corollary, given below, we present a
very powerful result that allows us to describe the geometry of influence func-
tions for regular asymptotically linear (RAL) estimators. This will aid us in
defining and visualizing efficiency and will also help us generalize ideas to
semiparametric models.
First, we define the score vector for a single observation Z in a parametric
model, where Z ∼ pZ (z, θ), θ = (β T , η T )T , by Sθ (Z, θ0 ), where
,
∂ log pZ (z, θ) ,,
Sθ (z, θ0 ) = , (3.3)
∂θ θ=θ0

is the p-dimensional vector of derivatives of the log-likelihood with respect


to the elements of the parameter θ and θ0 denotes the true value of θ that
generates the data.
This vector can be partitioned according to β (the parameters of interest)
and η (the nuisance parameters) as
28 3 The Geometry of Influence Functions

Sθ (Z, θ0 ) = {SβT (Z, θ0 ), SηT (Z, θ0 )}T ,

where
,q×1
∂ log pZ (z, θ) ,,
Sβ (z, θ0 ) = ,
∂β θ=θ0

and
,r×1
∂ log pZ (z, θ) ,,
Sη (z, θ0 ) = , .
∂η θ=θ0

Although, in many applications, we can naturally partition the parameter


space θ as (β T , η T )T , we will first give results for the more general represen-
tation where we define the q-dimensional parameter of interest as a smooth
q-dimensional function of the p-dimensional parameter θ; namely, β(θ). As we
will show later, especially when we consider infinite-dimensional semiparamet-
ric models, in some applications this will be a more natural representation.
For parametric models, this is really not a great distinction, as we can always
reparametrize the problem so that there is a one-to-one relationship between
{β T (θ), η T (θ)}T and θ for some r-dimensional nuisance function η(θ).

Theorem 3.2. Let the parameter of interest β(θ) be a q-dimensional func-


tion of the p-dimensional parameter θ, q < p, such that Γq×p (θ) = ∂β(θ)/∂θT ,
the q × p-dimensional matrix of partial derivatives, exists, has rank q, and is
continuous in θ in a neighborhood of the truth θ0 . Also let β̂n be an asymptot-
ically linear estimator with influence function ϕ(Z) such that Eθ (ϕT ϕ) exists
and is continuous in θ in a neighborhood of θ0 . Then, if β̂n is regular, this will
imply that
E{ϕ(Z)SθT (Z, θ0 )} = Γ(θ0 ). (3.4)
In the special case where θ can be partitioned as (β T , η T )T , we obtain the
following corollary.

Corollary 1.
(i)
E{ϕ(Z)SβT (Z, θ0 )} = I q×q
and
(ii)
E{ϕ(Z)SηT (Z, θ0 )} = 0q×r ,
where I q×q denotes the q ×q identity matrix and 0q×r denotes the q ×r matrix
of zeros.
3.2 m-Estimators (Quick Review) 29

Theorem 3.2 follows from the definition of regularity together with suffi-
cient smoothness conditions that makes a local data generating process con-
tiguous (to be defined shortly) to the sequence of distributions at the truth.
For completeness, we will give an outline of the proof. Before giving the gen-
eral proof of Theorem 3.2, which is complicated and can be skipped by the
reader not interested in all the technical details, we can gain some insight by
first showing how Corollary 1 could be proved for the special (and important)
case of the class of m-estimators.

3.2 m-Estimators (Quick Review)


In order to define an m-estimator, we consider a p × 1-dimensional function
of Z and θ, m(Z, θ), such that

Eθ {m(Z, θ)} = 0p×1 ,

Eθ {mT (Z, θ)m(Z, θ)} < ∞, and Eθ {m(Z, θ)mT (Z, θ)} is positive definite for
all θ ∈ Ω. Additional regularity conditions are also necessary and will be
defined as we need them.
The m-estimator θ̂n is defined as the solution (assuming it exists) of
n
!
m(Zi , θ̂n ) = 0
i=1

from a sample

Z1 , . . . , Zn iid pZ (z, θ)
θ ∈ Ω ⊂ Rp .

Under suitable regularity conditions, the maximum likelihood estimator


(MLE) of θ is an m-estimator. The MLE is defined as the value θ that maxi-
mizes the likelihood
-n
pZ (Zi , θ),
i=1

or, equivalently, the value of θ that maximizes the log-likelihood


n
!
log pZ (Zi , θ).
i=1

Under suitable regularity conditions, the maximum is found by taking the


derivative of the log-likelihood with respect to θ and setting it equal to zero.
That is, solving the score equation in θ,
n
!
Sθ (Zi , θ) = 0, (3.5)
i=1
30 3 The Geometry of Influence Functions

where Sθ (z, θ) is the score vector (i.e., the derivative of the log-density) defined
in (3.3). Since the score vector Sθ (Z, θ), under suitable regularity conditions,
has the property that Eθ {Sθ (Z, θ)} = 0 – see, for example, equation (7.3.8)
of Casella and Berger (2002) – , this implies that the MLE is an example of
an m-estimator.
In order to prove the consistency and asymptotic normality of m-estimators,
we need to assume certain regularity conditions. Some of the conditions that
are discussed in Chapter 36 of the Handbook . of /Econometrics by Newey
∂m(Z, θ0 )
and McFadden (1994) include that E be nonsingular, where
∂θT
∂m(Zi , θ)
is defined as the p × p matrix of all partial derivatives of the ele-
∂θT
ments of m(·) with respect to the elements of θ, and that
n
! . /
−1 ∂m(Zi , θ) P ∂m(Z, θ)
n → E θ0
i=1
∂θT ∂θT

uniformly in θ in a neighborhood of θ0 . For example, uniform convergence


∂m(Z, θ)
would be satisfied if the sample paths of are continuous in θ about
∂θT
θ0 almost surely and
, ,
, ∂m(Z, θ) ,
sup , , , ≤ g(Z), E{g(Z)} < ∞,
θ∈N (θ0 ) ∂θT ,

where N (θ0 ) denotes a neighborhood in θ about θ0 . In fact, these regularity


conditions would suffice to prove that the estimator θ̂n is consistent; that is,
P
→ θ0 .
θ̂n −
Therefore, assuming that these regularity conditions hold, the influence
function for θ̂n is found by using the expansion
n n
% n 0p×p
! ! ! ∂m(Zi , θ∗ )
0= m(Zi , θ̂n ) = m(Zi , θ0 ) + n
(θ̂n − θ0 ),
i=1 i=1 i=1
∂θT

where θn∗ is an intermediate value between θ̂n and θ0 .


Because we have assumed sufficient regularity conditions to guarantee the
consistency of θ̂n ,
% n
0 . /
! ∂m(Zi , θ ∗
) P ∂m(Z, θ0 )
−1 n
n →E ,
i=1
∂θT ∂θT

and by the nonsingularity assumption


% n
0−1 1 . /2−1
! ∂m(Zi , θn∗ ) P ∂m(Z, θ0 )
−1
n → E .
i=1
∂θT ∂θT
3.2 m-Estimators (Quick Review) 31

Therefore,
3 n
4−1 % n
0
! ∂m(Zi , θ∗ ) !
1/2
n (θ̂n − θ0 ) = − n −1 n
n m(Zi , θ0 )
−1/2
∂θT
i=1 i=1
1 . /2−1 % !n
0
∂m(Zi , θ0 )
=− E n −1/2
m(Zi , θ0 ) + op (1).
∂θT i=1

Since, by definition, E{m(Z, θ0 )} = 0, we immediately deduce that the influ-


ence function of θ̂n is given by
1 . /2−1
∂m(Z, θ0 )
− E m(Zi , θ0 ) (3.6)
∂θT
and
D
n1/2 (θ̂n − θ0 ) −→ (3.7)
5 1 . /2−1 . /1 . /2−1T 6
∂m(Z, θ0 ) ∂m(Z, θ0 )
N 0, E var m(Z, θ0 ) E ,
∂θT ∂θT

where 7 8
var {m(Z, θ0 )} = E m(Z, θ0 )mT (Z, θ0 ) .

Estimating the Asymptotic Variance of an m-Estimator

In order to use an m-estimator for the parameter θ for practical applications,


such as constructing confidence intervals for the parameter θ or a subset of the
parameter, we must be able to derive a consistent estimator for the asymp-
totic variance of θ̂n . Under suitable regularity conditions, a consistent estima-
tor for the asymptotic variance of θ̂n can be derived intuitively using what
is referred to as the “sandwich” variance estimator. This estimator is moti-
vated by considering the asymptotic variance derived in (3.7). The following
heuristic argument is used.
If θ0 (the truth) is known, then a simple application of the weak * law + of
∂m(Z,θ0 )
large numbers can be used to obtain a consistent estimator for E ∂θ T
,
namely
. / !n
∂m(Z, θ0 ) ∂m(Zi , θ0 )
Ê T
= n −1
, (3.8)
∂θ i=1
∂θT

and a consistent estimator for var{m(Z, θ0 )} can be obtained by using


n
!
var{m(Z,
ˆ θ0 )} = n−1 m(Zi , θ0 )mT (Zi , θ0 ). (3.9)
i=1
32 3 The Geometry of Influence Functions

Since θ0 is not known, we instead substitute θ̂n for θ0 in equations (3.8)


and (3.9) to obtain the sandwich estimator for the asymptotic variance, (3.7),
of θ̂n , given by
3 % 04−1 . /3 % 04−1T
∂m(Z, θ̂n ) ∂m(Z, θ̂n )
Ê var
ˆ m(Z, θ̂n ) Ê . (3.10)
∂θT ∂θT

The estimator (3.10) is referred to as the sandwich variance estimator as


we see the term var(·)
ˆ sandwiched between two terms involving Ê(·). The
sandwich variance will be discussed in greater detail in Chapter 4 when we
introduce estimators that solve generalized estimating equations (i.e., the so-
called GEE estimators). For more details on m-estimators, we refer the reader
to the excellent expository article by Stefanski and Boos (2002).
When we consider the special case where the m-estimator is the MLE
of θ (i.e., where m(Z, θ) = Sθ (Z, θ); see (3.5)), we note that − ∂m(Z,θ)
∂θ T
=
∂Sθ (Z,θ)
− ∂θT corresponds to minus the p × p matrix of second partial derivatives
of the log-likelihood with respect to θ, which we denote by −Sθθ (Z, θ). Under
suitable regularity conditions (see Section 7.3 of Casella and Berger, 2002),
the information matrix, which we denote by I(θ0 ), is given by

I(θ0 ) = Eθ0 {−Sθθ (Z, θ0 )} = Eθ0 {Sθ (Z, θ0 )SθT (Z, θ0 )}. (3.11)

As a consequence of (3.6) and (3.7), we obtain the well-known results that


the i-th influence function of the MLE is given by {I(θ0 )}−1 Sθ (Zi , θ0 ) and the
asymptotic distribution is normal with mean zero and variance matrix equal
to I −1 (θ0 ) (i.e., the inverse of the information matrix).
Returning to the general m-estimator, since

θ = (β T , η T )T

and

θ̂n = (β̂nT , η̂nT )T ,

the influence function of β̂n is made up of the first q elements of the p-


dimensional influence function for θ̂n given above.
We will now illustrate why Corollary 1 applies to m-estimators. By defi-
nition,
Eθ {m(Z, θ)} = 0p×1 .
That is, 9
m(z, θ)p(z, θ)dν(z) = 0 for all θ.

Therefore, 9

m(z, θ)p(z, θ)dν(z) = 0.
∂θT
3.2 m-Estimators (Quick Review) 33

Assuming suitable regularity conditions that allow us to interchange integra-


tion and differentiation, we obtain
 
9 . / 9  ∂p(z, θ) 
∂  
m(z, θ) p(z, θ)dν(z) + m(z, θ) ∂θT p(z, θ)dν(z) = 0.
∂θT 
 p(z, θ) 
> ?@ A
(3.12)
,
SθT (z, θ) or the transpose of the
score vector

At θ = θ0 , we deduce from equation (3.12) that


. /
∂m(Z, θ0 ) 7 8
E T
= −E m(Z, θ0 )SθT (Z, θ0 ) ,
∂θ
which can also be written as
1 . /2−1
∂m(Z, θ0 ) 7 8
I p×p
=− E E m(Z, θ0 )SθT (Z, θ0 ) , (3.13)
∂θT

where I p×p denotes the p×p identity matrix. Recall that the influence function
for θ̂n , given by (3.6), is
1 . /2−1
∂m(Z, θ0 )
ϕθ̂n (Zi ) = − E m(Zi , θ0 )
∂θT
* +T
and can be partitioned as ϕTβ̂ (Zi ), ϕTη̂n (Zi ) .
n
The covariance of the influence function ϕθ̂n (Zi ) and the score vector
Sθ (Zi , θ0 ) is
* +
E ϕθ̂n (Zi )SθT (Zi , θ0 )
1 . /2−1
∂m(Z, θ0 ) 7 8
=− E T
E m(Z, θ0 )SθT (Z, θ0 ) , (3.14)
∂θ

which by (3.13) is equal to I (q+r)×(q+r) , the identity matrix. This covariance


matrix (3.14) can be partitioned as
3 4
ϕβ̂n (Zi )SβT (Zi , θ0 ) ϕβ̂n (Zi )SηT (Zi , θ0 )
E .
ϕη̂n (Zi )SβT (Zi , θ0 ) ϕη̂n (Zi )SηT (Zi , θ0 )

Consequently,
* +
(i) E ϕβ̂n (Zi )SβT (Zi , θ0 ) = I q×q (the q × q identity matrix)
34 3 The Geometry of Influence Functions

and
* +
(ii) E ϕβ̂n (Zi )SηT (Zi , θ0 ) = 0q×r .

Thus, we have verified that the two conditions of Corollary 1 hold for influence
functions of m-estimators.

Proof of Theorem 3.2

In order to prove Theorem 3.2, we must introduce the theory of contiguity,


which we now review briefly. An excellent overview of contiguity theory can be
found in the Appendix of Hájek and Sidak (1967). Those readers not interested
in the theoretical details can skip the remainder of this section.

Definition 2. Let Vn be a sequence of random vectors and let P1n and P0n be
sequences of probability measures with densities p1n (vn ) and p0n (vn ), respec-
tively. The sequence of probability measures P1n is contiguous to the sequence
of probability measures P0n if, for any sequence of events An defined with re-
spect to Vn , P0n (An ) → 0 as n → ∞ implies that P1n (An ) → 0 as n → ∞.
#
"

In our applications, we let Vn = (Z1n , . . . , Znn ), where Z1n , . . . , Znn are


iid random vectors and
n
-
p0n (vn ) = p(zin , θ0 ),
i=1
-n
p1n (vn ) = p(zin , θn ),
i=1

where n1/2 (θn − θ0 ) → τ , τ being a p-dimensional vector of constants.


Letting the parameter θ0 denote the true value of the parameter that
generates the data, then p1n (·) is an example of a local data generating process
(LDGP) as given by Definition 1. If we could show that the sequence P1n is
contiguous to the sequence P0n , then a sequence of random variables Tn (Vn )
that converges in probability to zero under the truth (i.e., for every # > 0,
P0n (|Tn | > #) → 0) would also satisfy that P1n (|Tn | > #) → 0; hence, Tn (Vn )
would converge in probability to zero for the LDGP. This fact can be very
useful because in some problems it may be relatively easy to show that a
sequence of random variables converges in probability to zero under the truth,
in which case convergence in probability to zero under the LDGP follows
immediately from contiguity.
LeCam, in a series of lemmas (see Hájek and Sidak, 1967), proved some
important results regarding contiguity. One of LeCam’s results that is of par-
ticular use to us is as follows.
3.2 m-Estimators (Quick Review) 35

Lemma 3.1. LeCam


If . /
p1n (Vn ) D(P0n )
log −−−−−→ N (−σ 2 /2, σ 2 ), (3.15)
p0n (Vn )
then the sequence P1n is contiguous to the sequence P0n .

Heuristic justification of contiguity for LDGP

To illustrate that (3.15) holds for LDGPs under sufficient smoothness and
regularity conditions, we sketch out the following heuristic argument. Define
n
p1n (Vn ) - p(Zin , θn )
Ln (Vn ) = = .
p0n (Vn ) i=1 p(Zin , θ0 )

By a simple Taylor series expansion, we obtain


n
!
log{Ln (Vn )} = {log p(Zin , θn ) − log p(Zin , θ0 )}
i=1
% n 0
!
= (θn − θ0 )T Sθ (Zin , θ0 )
i=1
"n
(θn − θ0 )T { Sθθ (Zin , θn∗ )}(θn − θ0 )
+ i=1
, (3.16)
2
where Sθ (z, θ0 ) is the p-dimensional score vector defined as ∂ log p(z, θ0 )/∂θ,
Sθθ (z, θn∗ ) is the p × p matrix ∂ 2 log p(z, θn∗ )/∂θ∂θT , and θn∗ is some interme-
diate value between θn and θ0 .
The expression (3.16) can be written as
% n
0
!
n1/2 (θn − θ0 )T n−1/2 Sθ (Zin , θ0 )
i=1
"n
n1/2 (θn − θ0 )T {n−1 Sθθ (Zin , θn∗ )}n1/2 (θn − θ0 )
+ i=1
.
2
Under P0n :
(i) Sθ (Zin , θ0 ), i = 1, . . . , n are iid mean zero random vectors with variance
matrix equal to the information matrix I(θ0 ) defined by (3.11). Conse-
quently, by the CLT,
n
! # $
D(P0n )
n −1/2
Sθ (Zin , θ0 ) −−−−−→ N 0, I(θ0 ) .
i=1

(ii) Since θn∗ → θ0 and Sθθ (Zin , θ0 ), i = 1, . . . , n are iid random matrices with
mean −I(θ0 ), then, under sufficient smoothness conditions,
36 3 The Geometry of Influence Functions
n
! P
n −1
{Sθθ (Zin , θn∗ ) − Sθθ (Zin , θ0 )} −
→ 0,
i=1

and by the weak law of large numbers


n
! P
n −1
Sθθ (Zin , θ0 ) −
→ −I(θ0 ),
i=1

hence
n
! P
n −1
Sθθ (Zin , θn∗ ) −
→ −I(θ0 ).
i=1
1/2
By assumption, n (θn − θ0 ) → τ . Therefore, (i), (ii), and Slutsky’s theorem
imply that
# $
D(P0n ) τ T I(θ0 )τ T
log{Ln (Vn )} −−−−−→ N − , τ I(θ0 )τ .
2

Consequently, by LeCam’s lemma, the sequence P1n is contiguous to the se-


quence P0n .

Now we are in a position to prove Theorem 3.2.


Proof. Theorem 3.2 E
Consider theE sequence of densities p0n (vn ) = p(zin , θ0 ) and the LDGP
p1n (vn ) = p(zin , θn ), where n1/2 (θn − θ0 ) → τ . By the definition of asymp-
totic linearity,
n
!
1/2
n {β̂n − β(θ0 )} = n −1/2
ϕ(Zin ) + oP0n (1), (3.17)
i=1

where oP0n (1) is a sequence of random vectors that converge in probability to


zero with respect to the sequence of probability measures P0n . Consider the
LDGP defined by θn . By contiguity, terms that are oP0n (1) are also oP1n (1).
Consequently, by (3.17),
n
!
n1/2 {β̂n − β(θ0 )} = n−1/2 ϕ(Zin ) + oP1n (1).
i=1

By adding and subtracting common terms, we obtain


n
!
n1/2 {β̂n − β(θn )} = n−1/2 [ϕ(Zin ) − Eθn {ϕ(Z)}]
i=1

+n1/2 Eθn {ϕ(Z)} − n1/2 {β(θn ) − β(θ0 )} (3.18)


+oP1n (1).
3.2 m-Estimators (Quick Review) 37

By assumption, the estimator β̂n is regular; that is,


# $
1/2 D(P1n )
n {β̂n − β(θn )} −−−−−→ N 0, Eθ0 (ϕϕ ) .
T
(3.19)

Also, under P1n , [ϕ(Zin ) − Eθn {ϕ(Z)}], i = 1, . . . , n are iid mean-zero random
vectors with variance matrix Eθn (ϕϕT ) − Eθn (ϕ)Eθn (ϕT ). By the smoothness
assumption, Eθn (ϕϕT ) → Eθ0 (ϕϕT ) and Eθn (ϕ) → 0 as n → ∞. Hence, by
the CLT, we obtain
n
! # $
D(P1n )
n−1/2 [ϕ(Zin ) − Eθn {ϕ(Z)}] −−−−−→ N 0, Eθ0 (ϕϕT ) . (3.20)
i=1

By a simple Taylor series expansion, we deduce that β(θn ) ≈ β(θ0 )+Γ(θ0 )(θn −
θ0 ), where Γ(θ0 ) = ∂β(θ0 )/∂θT . Hence,

n1/2 {β(θn ) − β(θ0 )} → Γ(θ0 )τ. (3.21)

Finally,
9
n1/2 Eθn {ϕ(Z)} = n1/2 ϕ(z)p(z, θn )dν(z)
9 9 . /T
1/2 1/2 ∂p(z, θn∗ )
=n ϕ(z)p(z, θ0 )dν(z) + n ϕ(z) (θn − θ0 )dν(z)
∂θ
9 . /T
n→∞ ∂p(z, θ0 )
−−−−→ 0 + ϕ(z) /p(z, θ0 ) p(z, θ0 )dν(z)τ
∂θ
= Eθ0 {ϕ(Z)SθT (Z, θ0 )}τ, (3.22)

where θn∗ is some intermediate value between θn and θ0 . The only way that
(3.19) and (3.20) can hold is if the limit of (3.18), as n → ∞, is identically
equal to zero. By (3.21) and (3.22), this implies that
F G
Eθ0 {ϕ(Z)SθT (Z, θ0 )} − Γ(θ0 ) τ = 0q×1 .

Since τ is arbitrary, this implies that

Eθ0 {ϕ(Z)SθT (Z, θ0 )} = Γ(θ0 ),

which proves Theorem 3.2. #


"
We now show how the results of Theorem 3.2 lend themselves to a geomet-
ric interpretation that allows us to compare the efficiency of different RAL
estimators using our intuition of minimum distance and orthogonality.
38 3 The Geometry of Influence Functions

3.3 Geometry of Influence Functions for Parametric


Models
Consider the Hilbert space H of all q-dimensional measurable functions of
Z with mean zero and finite variance equipped with the inner product
.h1 , h2 / = E(hT1 h2 ). We first note that the score vector Sθ (Z, θ0 ), under suit-
able regularity conditions, has mean zero (i.e., E{Sθ (Z, θ0 )} = 0p×1 ). Similar
to Example 2 of Chapter 2, we can define the finite-dimensional linear sub-
space T ⊂ H spanned by the p-dimensional score vector Sθ (Z, θ0 ) as the set
of all q-dimensional mean-zero random vectors consisting of

B q×p Sθ (Z, θ0 )

for all q × p matrices B. The linear subspace T is referred to as the tangent


space.
In the case where θ can be partitioned as (β T , η T )T , consider the linear
subspace spanned by the nuisance score vector Sη (Z, θ0 ),

B q×r Sη (Z, θ0 ), (3.23)

for all q × r matrices B. This space is referred to as the nuisance tangent


space and will be denoted by Λ. We note that condition (ii) of Corollary 1 is
equivalent to saying that the q-dimensional influence function ϕβ̂n (Z) for β̂n
is orthogonal to the nuisance tangent space Λ.
In addition to being orthogonal to the nuisance tangent space, the influence
function of β̂n must also satisfy condition (i) of Corollary 1; namely,
* +
E ϕβ̂n (Z)SβT (Z, θ0 ) = I q×q .

Although influence functions of RAL estimators for β must satisfy condi-


tions (i) and (ii) of Corollary 1, a natural question is whether the converse
is true; that is, for any element of the Hilbert space satisfying conditions (i)
and (ii) of Corollary 1, does there exist an RAL estimator for β with that
influence function?
Remark 2. To prove this in full generality, especially later when we consider
infinite-dimensional nuisance parameters, is difficult and requires that some
careful technical regularity conditions hold. Nonetheless, it may be instruc-
tive to see how one may, heuristically, construct estimators that have influence
functions corresponding to elements in the subspace of the Hilbert space sat-
isfying conditions (i) and (ii). "
#

Constructing Estimators

Let ϕ(Z) be a q-dimensional measurable function with zero mean and finite
variance that satisfies conditions (i) and (ii) of Corollary 1. Define
3.3 Geometry of Influence Functions for Parametric Models 39

m(Z, β, η) = ϕ(Z) − Eβ,η {ϕ(Z)}.

Assume that we can find a root-n consistent estimator for the nuisance pa-
rameter η̂n (i.e., where n1/2 (η̂n −η0 ) is bounded in probability). In many cases
the estimator η̂n will be β-dependent (i.e., η̂n (β)). For example, we might use
the MLE for η, or the restricted MLE for η, fixing the value of β.
We will now argue that the solution to the equation
n
!
m{Zi , β, η̂n (β)} = 0, (3.24)
i=1

which we denote by β̂n , will be an asymptotically linear estimator with influ-


ence function ϕ(Z).
By construction, we have

Eβ0 ,η {m(Z, β0 , η)} = 0,

or
9
m(z, β0 , η)p(z, β0 , η)dν(z) = 0.

Consequently,
, 9
∂ ,,
m(z, β0 , η)p(z, β0 , η)dν(z) = 0,
∂η T ,η=η0

or
9 9
∂m(z, β0 , η0 )
p(z, β ,
0 0η )dν(z) + m(z, β0 , η0 )
∂η T (3.25)
×SηT (z, β0 , η0 )p(z, β0 , η0 )dν(z) = 0.

By definition, ϕ(Z) = m(Z, β0 , η0 ) must satisfy


7 8
E ϕ(Z)SηT (Z, θ0 ) = 0.

(This is condition (ii) of Corollary 1.) Consequently, by (3.25), we obtain


. /

E m(Z, β0 , η0 ) = 0. (3.26)
∂η T

Similarly, we can show that


. /

E m(Z, β ,
0 0η ) = −I q×q . (3.27)
∂β T

A standard expansion yields


40 3 The Geometry of Influence Functions
n
!
0= m{Zi , β̂n , η̂n (β̂n )}
i=1
!n
= m{Zi , β0 , η̂n (β̂n )}
i=1
3n
4
! ∂m
+ {Zi , βn , η̂n (β̂n )} (β̂n − β0 ),

(3.28)
i=1
∂β T > ?@ A

Notice that this term is held fixed

where βn∗ is an intermediate value between β̂n and β0 . Therefore,

n1/2 (β̂n − β0 )
3 n
4−1 3 n
4
! ∂ !
=− n −1
m{Zi , βn , η̂n (β̂n )}

n −1/2
m{Zi , β0 , η̂n (β̂n )} .
i=1
∂β T i=1
> ?@ A
⇓p
1 . /2−1

E m(Z, β0 , η0 ) = −I q×q by (3.27) (3.29)
∂β T
"
n
Let us consider the second term of (3.29); namely, n−1/2 m{Zi , β0 , η̂n (β̂n )}.
i=1
By expansion, this equals
n
!
n−1/2 m(Zi , β0 , η0 )
i=1
% 0
n
! ∂m(Zi , β0 , η ∗ ) ( )
+ n −1 n
n1/2 {η̂n (β̂n ) − η0 } , (3.30)
i=1
∂η T > ?@ A
> ?@ A
. ⇓p / ⇓
∂ bounded in probability
E m(Z, β0 , η0 )
∂η T
= 0 by (3.26)

where ηn∗ is an intermediate value between η̂n (β̂n ) and η0 .


Combining (3.29) and (3.30), we obtain
n
!
n1/2 (β̂n − β0 ) = n−1/2 m(Zi , β0 , η0 ) + op (1),
i=1
!n
= n−1/2 ϕ(Zi ) + op (1),
i=1
3.3 Geometry of Influence Functions for Parametric Models 41

which illustrates that ϕ(Zi ) is the influence function for the i-th observation
of the estimator β̂n above.
Remark 3. This argument was independent of the choice of the root-n consis-
tent estimator for the nuisance parameter η. "
#
Remark 4. In the derivation above, the asymptotic distribution of the esti-
mator obtained by solving the estimating equation, which uses the estimating
function m(Z, β, η̂n ), is the same as the asymptotic distribution of the estima-
tor solving the estimating equation using the estimating function m(Z, β, η0 )
had the true value of the nuisance parameter η0 been known to us. This
fact follows from the orthogonality of the estimating function (evaluated at
the truth) to the nuisance tangent space. This type of robustness, where the
asymptotic distribution of an estimator is independent of whether the true
value of the nuisance parameter is known or whether (and how) the nuisance
parameter is estimated in an estimating equation, is one of the bonuses of
working with estimating equations with estimating functions that are orthog-
onal to the nuisance tangent space. " #
Remark 5. We want to make it clear that the estimator we just presented is
for theoretical purposes only and not of practical use. The starting point was
the choice of a function satisfying the conditions of Lemma 3.1. To find such
a function necessitates knowledge of the truth, which, of course, we don’t
have. Nonetheless, starting with some truth, say θ0 , and some function ϕ(Z)
satisfying the conditions of Corollary 1 (under the assumed true model), we
constructed an estimator whose influence function is ϕ(Z) when θ0 is the
truth. If, however, the data were generated, in truth, by some other value of
the parameter, say θ∗ , then the estimator constructed by solving (3.24) would
have some other influence function ϕ∗ (Z) satisfying the conditions of Lemma
3.1 at θ∗ . "#
Thus, by Corollary 1, all RAL estimators have influence functions that belong
to the subspace of our Hilbert space satisfying
(i) E{ϕ(Z)SβT (Z, θ0 )} = I q×q
and
(ii) E{ϕ(Z)SηT (Z, θ0 )} = 0q×r ,
and, conversely, any element in the subspace above is the influence function
of some RAL estimator.

Why Is this Important?

RAL estimators are asymptotically normally distributed; i.e.,


# $
1/2 D
n (β̂n − β0 ) −→ N 0, E(ϕϕ ) .
T
42 3 The Geometry of Influence Functions

Because of this, we can compare competing RAL estimators for β by looking


at the asymptotic variance, where clearly the better estimator is the one with
smaller asymptotic variance. We argued earlier, however, that the asymptotic
variance of an RAL estimator is the variance of its influence function. There-
fore, it suffices to consider the variance of influence functions. We already
illustrated that influence functions can be viewed as elements in a subspace
of a Hilbert space. Moreover, in this Hilbert space the distance to the origin
(squared) of any element (random function) is the variance of the element.
Consequently, the search for the best estimator (i.e., the one with the small-
est asymptotic variance) is equivalent to the search for the element in the
subspace of influence functions that has the shortest distance to the origin.

Remark 6. We want to emphasize again that Hilbert spaces are characterized


by both the elements that make up the space (random functions in our case)
and the inner product, .h1 , h2 / = E(hT1 h2 ), where expectation is always taken
with respect to the truth (θ0 ). Therefore, for different θ0 , we have different
Hilbert spaces. This also means that the subspace that defines the class of
influence functions is θ0 -dependent. " #

3.4 Efficient Influence Function


We will show how the geometry of Hilbert spaces will allow us to identify the
most efficient influence function (i.e., the influence function with the smallest
variance). First, however, we give some additional notation and definitions
regarding operations on linear subspaces that will be needed shortly.

Definition 3. We say that M ⊕ N is a direct sum of two linear subspaces


M ⊂ H and N ⊂ H if M ⊕ N is a linear subspace in H and if every element
x ∈ M ⊕ N has a unique representation of the form x = m + n, where m ∈ M
and n ∈ N . "
#

Definition 4. The set of elements of a Hilbert space that are orthogonal to a


linear subspace M is denoted by M ⊥ . The space M ⊥ is also a linear subspace
(referred to as the orthogonal complement of M ) and the entire Hilbert space

H = M ⊕ M ⊥. "
#

Condition (ii) of Corollary 1 can now be stated as follows: If ϕ(Z) is an


influence function of an RAL estimator, then ϕ ∈ Λ⊥ , where Λ denotes the
nuisance tangent space defined by (3.23).

Definition 5. If we consider any arbitrary element h(Z) ∈ H, then by the


projection theorem, there exists a unique element a0 (Z) ∈ Λ such that ,h−a0 ,
has the minimum norm and a0 must uniquely satisfy the relationship

.h − a0 , a/ = 0 for all a ∈ Λ.
3.4 Efficient Influence Function 43

The element a0 is referred to as the projection of h onto the space Λ and is


denoted by Π(h|Λ). The element with the minimum norm, h−a0 , is sometimes
referred to as the residual of h after projecting onto Λ, and it is easy to show
that h − a0 = Π(h|Λ⊥ ). " #

As we discussed earlier, condition (ii) of Corollary 1 is equivalent to an


element h(Z) in our Hilbert space H being orthogonal to the nuisance tangent
space; i.e., the linear subspace generated by the nuisance score vector, namely

Λ = {B q×r Sη (Z, θ0 ) for all B q×r }.

If we want to identify all elements orthogonal to the nuisance tangent space,


we can consider the set of elements h − Π(h|Λ) for all h ∈ H, where using the
results in Example 2 of Chapter 2,

Π(h|Λ) = E(hSηT ){E(Sη SηT )}−1 Sη (Z, θ0 ).

It is also straightforward to show that the tangent space

T = {B q×p Sθ (Z, θ0 ) for all B q×p }

can be written as the direct sum of the nuisance tangent space and the tangent
space generated by the score vector with respect to the parameter of interest
“β”. That is, if we define Tβ as the space {B q×q Sβ (Z, θ0 ) for all B q×q }, then
T = Tβ ⊕ Λ.

Asymptotic Variance when Dimension Is Greater than One

When the parameter of interest β has dimension ≥ 2, we must be careful


about what we mean by smaller asymptotic variance for an estimator or its
influence function. Consider two RAL estimators for β with influence function
ϕ(1) (Z) and ϕ(2) (Z), respectively. We say that

var {ϕ(1) (Z)} ≤ var {ϕ(2) (Z)}

if and only if
var {aT ϕ(1) (Z)} ≤ var {aT ϕ(2) (Z)}
for all q × 1 constant vectors a. Equivalently,

aT E{ϕ(1) (Z)ϕ(1) (Z)}a ≤ aT E{ϕ(2) (Z)ϕ(2) (Z)}a.


T T

This means that

aT [E{ϕ(2) (Z)ϕ(2) (Z)} − E{ϕ(1) (Z)ϕ(1) (Z)}] a ≥ 0,


T T

or E(ϕ(2) ϕ(2) ) − E(ϕ(1) ϕ(1) ) is nonnegative definite.


T T
44 3 The Geometry of Influence Functions

If H(1) is the Hilbert space of one-dimensional mean-zero random func-


tions of Z, where we use the superscript (1) to emphasize one-dimensional
random functions, and if h1 and h2 are elements of H(1) that are or-
thogonal to each other, then, by the Pythagorean theorem, we know that
var(h1 + h2 ) = var(h1 ) + var(h2 ), making it clear that var(h1 + h2 ) is greater
than or equal to var(h1 ) or var(h2 ). Unfortunately, when H consists of q-
dimensional mean-zero random functions, there is no such general relationship
with regard to the variance matrices. However, there is an important special
case when this does occur, which we now discuss.
Definition 6. q-replicating linear space
A linear subspace U ⊂ H is a q-replicating linear space if U is of the form
U (1) × . . . × U (1) or {U (1) }q , where U (1) denotes a linear subspace in H(1) and
{U (1) }q ⊂ H represents the linear subspace in H that consists of elements
h = (h(1) , . . . , h(q) )T such that h(j) ∈ U (1) for all j = 1, . . . , q; i.e., {U (1) }q
consists of q-dimensional random functions, where each element in the vector
is an element of U (1) , or the space U (1) stacked up on itself q times. " #
The linear subspace spanned by an r-dimensional vector of mean zero finite
variance random functions v r×1 (Z), namely the subspace
S = {B q×r v(Z) : for all constant matrices B q×r },
is such a subspace. This is easily seen by defining U (1) to be the space
{bT v(Z) : for all constant r-dimensional constant vectors br×1 },
in which case S = {U (1) }q . Since tangent spaces and nuisance tangent spaces
are linear subspaces spanned by score vectors, these are examples of q-
replicating linear spaces.
Theorem 3.3. Multivariate Pythagorean theorem
If h ∈ H and is an element of a q-replicating linear space U, and * ∈ H is
orthogonal to U, then
var(* + h) = var(*) + var(h), (3.31)
where var(h) = E(hh ). As a consequence of (3.31), we obtain a multivariate
T

version of the Pythagorean theorem; namely, for any h∗ ∈ H,


var(h∗ ) = var (Π[h∗ |U]) + var (h∗ − Π[h∗ |U]) . (3.32)
Proof. It is easy to show that an element * = (*(1) , . . . , *(q) )T ∈ H is orthogo-
nal to U = {U (1) }q if and only if each element *(j) , j = 1, . . . , q is orthogonal
to U (1) . Consequently, such an element * is not only orthogonal to h ∈ {U (1) }q
in the sense that E(*T h) = 0 but also in that E(*hT ) = E(h*T ) = 0q×q . This
is important because for such an * and h, we obtain
var(* + h) = var(*) + var(h),
where var(h) = E(hhT ). "
#
3.4 Efficient Influence Function 45

This means that, for such cases, the variance matrix of *+h, for q-dimensional
* and h, is larger (in the multidimensional sense defined above) than either
the variance matrix of * or the variance matrix of h.
In many of the arguments that follow, we will be decomposing elements
of the Hilbert space as the projection to a tangent space or a nuisance tan-
gent space plus the residual after the projection. For such problems, because
the tangent space or nuisance tangent space is a q-replicating linear space,
we now know that we can immediately apply the multivariate version of the
Pythagorean theorem where the variance matrix of any element is always
larger than the variance matrix of the projection or the variance matrix of the
residual after projection. Consequently, we don’t have to distinguish between
the Hilbert space of one-dimensional random functions and q-dimensional ran-
dom functions.

Geometry of Influence Functions

Before describing the geometry of influence functions, we first give the defini-
tion of a linear variety (sometimes also called an affine space).
Definition 7. A linear variety is the translation of a linear subspace away
from the origin; i.e., a linear variety V can be written as V = x0 + M , where
x0 ∈ H and x0 ∈ / M, ,x0 , '= 0, and M is a linear subspace (see Figure 3.4).
#
"

M (linear subspace)

Fig. 3.4. Depiction of a linear variety

Theorem 3.4. The set of all influence functions, namely the elements of H
that satisfy condition (3.4) of Theorem 3.2, is the linear variety ϕ∗ (Z) + T ⊥ ,
where ϕ∗ (Z) is any influence function and T ⊥ is the space perpendicular to
the tangent space.
Proof. Any element l(Z) ∈ T ⊥ must satisfy

E{l(Z)SθT (Z, θ0 )} = 0q×p . (3.33)

Therefore, if we take
46 3 The Geometry of Influence Functions

ϕ(Z) = ϕ∗ (Z) + l(Z),


then
F G
E{ϕ(Z)SθT (Z, θ0 )} = E {ϕ∗ (Z) + l(Z)} SθT (Z, θ0 )
F G F G
= E ϕ∗ (Z)SθT (Z, θ0 ) + E l(Z)SθT (Z, θ0 )
= Γ(θ0 ) + 0q×p = Γ(θ0 ).

Hence, ϕ(Z) is an influence function satisfying condition (3.4) of Theorem 3.2.


Conversely, if ϕ(Z) is an influence function satisfying (3.4) of Theorem
3.2, then
ϕ(Z) = ϕ∗ (Z) + {ϕ(Z) − ϕ∗ (Z)}.
It is a simple exercise to verify that {ϕ(Z) − ϕ∗ (Z)} ∈ T ⊥ . "
#

Deriving the Efficient Influence Function

The efficient influence function ϕeff (Z), if it exists, is the influence func-
tion with the smallest variance matrix; that is, for any influence function
ϕ(Z) '= ϕeff (Z), var{ϕeff (Z)} − var{ϕ(Z)} is negative definite. That an ef-
ficient influence function exists and is unique is now easy to see from the
geometry of the problem.
Theorem 3.5. The efficient influence function is given by

ϕeff (Z) = ϕ∗ (Z) − Π(ϕ∗ (Z)|T ⊥ ) = Π(ϕ∗ (Z)|T ), (3.34)

where ϕ∗ (Z) is an arbitrary influence function and T is the tangent space,


and can explicitly be written as

ϕeff (Z) = Γ(θ0 )I −1 (θ0 )Sθ (Z, θ0 ). (3.35)

Proof. By Theorem 3.4, the class of influence functions is a linear variety,


ϕ∗ (Z) + T ⊥ . Let ϕeff = ϕ∗ − Π(ϕ∗ |T ⊥ ) = Π(ϕ∗ |T ). Because Π(ϕ∗ |T ⊥ ) ∈
T ⊥ , this implies that ϕeff is an influence function and, moreover, is orthogonal
to T ⊥ . Consequently, any other influence function can be written as ϕ = ϕeff +
l, with l ∈ T ⊥ . The tangent space T and its orthogonal complement T ⊥ are
examples of q-replicating linear spaces as defined by Definition 6. Therefore,
because of Theorem 3.3, equation (3.31), we obtain var(ϕ) = var(ϕeff )+var(l),
which demonstrates that ϕeff , constructed as above, is the efficient influence
function.
We deduce from the argument above that the efficient influence function
ϕeff = Π(ϕ∗ |T ) is an element of the tangent space T and hence can be ex-
q×p q×p
pressed as ϕeff (Z) = Beff Sθ (Z, θ0 ) for some constant matrix Beff . Since
ϕeff (Z) is an influence function, it must also satisfy relationship (3.4) of The-
orem 3.2; i.e.,
E{ϕeff (Z)SθT (Z, θ0 )} = Γ(θ0 ),
3.4 Efficient Influence Function 47

or
Beff E{Sθ (Z, θ0 )SθT (Z, θ0 )} = Γ(θ0 ),
which implies
Beff = Γ(θ0 )I −1 (θ0 ),
where I(θ0 ) = E{Sθ (Z, θ0 )SθT (Z, θ0 )} is the information matrix. Conse-
quently, the efficient influence function is given by

ϕeff (Z) = Γ(θ0 )I −1 (θ0 )Sθ (Z, θ0 ). "


#

It is instructive to consider the special case θ = (β T , η T )T . We first define


the important notion of an efficient score vector and then show the relationship
of the efficient score to the efficient influence function.

Definition 8. The efficient score is the residual of the score vector with re-
spect to the parameter of interest after projecting it onto the nuisance tangent
space; i.e.,
Seff (Z, θ0 ) = Sβ (Z, θ0 ) − Π(Sβ (Z, θ0 )|Λ).
Recall that

Π(Sβ (Z, θ0 )|Λ) = E(Sβ SηT ){E(Sη SηT )}−1 Sη (Z, θ0 ). "
#

Corollary 2. When the parameter θ can be partitioned as (β T , η T )T , where β


is the parameter of interest and η is the nuisance parameter, then the efficient
influence function can be written as

ϕeff (Z, θ0 ) = {E(Seff Seff


T
)}−1 {Seff (Z, θ0 )}. (3.36)

Proof. By construction, the efficient score vector is orthogonal to the nuisance


tangent space; i.e., it satisfies condition (ii) of being an influence function.
By appropriately scaling the efficient score, we can construct an influence
function, which we will show is the efficient influence function. We first note
that E{Seff (Z, θ0 )SβT (Z, θ0 )} = E{Seff (Z, θ0 )Seff
T
(Z, θ0 )}. This follows because

E{Seff (Z, θ0 )SβT (Z, θ0 )} = E{Seff (Z, θ0 )Seff


T
(Z, θ0 )} + E{Seff (Z, θ0 )Π(Sβ |Λ)T } .
> ?@ A
This equals zero since
Seff (Z, θ0 } ⊥ Λ

Therefore, if we define

ϕeff (Z, θ0 ) = {E(Seff Seff


T
)}−1 Seff (Z, θ0 ),

then
(i) E[ϕeff (Z, θ0 )SβT (Z, θ0 )] = I q×q
and
48 3 The Geometry of Influence Functions

(ii) E[ϕeff (Z, θ0 )SηT (Z, θ0 )] = 0q×r ;


i.e., ϕeff (Z, θ0 ) satisfies conditions (i) and (ii) of Corollary 1 and thus is an
influence function.
As argued above, the efficient influence function is the unique influ-
ence function belonging to the tangent space T . Since both Sβ (Z, θ0 ) and
Π(Sβ (Z, θ0 )|Λ) are elements of T , so is
ϕeff (Z, θ0 ) = {E(Seff Seff
T
)}−1 {Sβ (Z, θ0 ) − Π(Sβ |Λ)},
thus demonstrating that (3.36) is the efficient influence function for RAL
estimators of β. "
#
Remark 7. When the parameter θ can be partitioned as (β T , η T )T , then Γ(θ0 )
can be partitioned as [I q×q : 0q×r ], and it is a straightforward exercise to show
that (3.35) leads to (3.36). " #
Remark 8. If we denote by (β̂nM LE , η̂nM LE ) the values of β and η that maximize
the likelihood
n
-
p(Zi , β, η),
i=1

then under suitable regularity conditions, the estimator β̂nM LE of β is an RAL


estimator whose influence function is the efficient influence function given by
(3.36). See Exercise 3.2 below. "#
Remark 9. If the parameter of interest is given by β(θ) and we define by θ̂nM LE
the value of θ that maximizes the likelihood
n
-
p(Zi , θ),
i=1

then, under suitable regularity conditions, the estimator β(θ̂nM LE ) of β is an


RAL estimator with efficient influence function (3.35). "#
Remark 10. By definition,
ϕeff (Z, θ0 ) = {E(Seff Seff
T
)}−1 Seff (Z, θ0 )
has variance equal to
T
{E(Seff Seff )}−1 ,
the inverse of the variance matrix of the efficient score. If we define Iββ =
E(Sβ SβT ), Iηη = E(Sη SηT ), and Iβη = E(Sβ SηT ), then we obtain the well-
known result that the minimum variance for the most efficient RAL estimator
is
−1 T −1
{Iββ − Iβη Iηη Iβη } ,
where Iββ , Iβη , Iηη are elements of the information matrix used in likelihood
theory. "
#
3.5 Review of Notation for Parametric Models 49

3.5 Review of Notation for Parametric Models


We now give a quick review of some of the notation and ideas developed in
Chapter 3 as a useful reference.
– Z1 , . . . , Zn iid p(z, β, η),

β ∈ Rq ,
η ∈ Rr ,
θ = (β T, η T )T , θ ∈ Rp , p = q + r.

– Truth is denoted as θ0 = (β0T , η0T )T .


– n1/2 (β̂n −β0 ) = n−1/2 Σϕ(Zi )+op (1), where ϕ(Zi ) is the influence function
for the i-th observation of β̂n .
– Hilbert space: q-dimensional measurable functions of Z with mean zero and
finite variance equipped with the covariance inner product E{hT1 (Z)h2 (Z)}.
– Score vector: For θ = (β T , η T )T ,
,
∂ log p(z, θ) ,,
Sβ (z) = , ,
∂β
, θ0
∂ log p(z, θ) ,,
Sη (z) = , ,
∂η
, θ0
∂ log p(z, θ) ,,
Sθ (z) = , = {Sβ (z), Sη (z)} .
T T T
∂θ θ0

Linear subspaces

Nuisance tangent space:

Λ = {B q×r Sη : for all B q×r }.

Tangent space:

T = {B q×p Sθ : for all B q×p },


T = Tβ ⊕ Λ, where Tβ = {B q×q Sβ : for all B q×q },

and ⊕ denotes the direct sum of linear subspaces.

Influence functions ϕ must satisfy

(i) E{ϕSβT } = I q×q


and
(ii) E{ϕSηT } = 0q×r ; ϕ ⊥ Λ, ϕ ∈ Λ⊥ .
50 3 The Geometry of Influence Functions

• Efficient score

Seff (Z, θ0 ) = Sβ (Z, θ0 ) − Π(Sβ |Λ);


Π(Sβ |Λ) = E(Sβ SηT ){E(Sη SηT )}−1 Sη (Z, θ0 ).

• Efficient influence function

ϕeff (Z) = {E(Seff Seff


T
)}−1 Seff (Z, θ0 ).

• Any influence function is equal to

ϕ(Z) = ϕeff (Z) + l(Z), l(Z) ∈ T ⊥ .

That is, influence functions lie on a linear variety and



E(ϕϕT ) = E(ϕeff ϕTeff ) + E(llT ).

3.6 Exercises for Chapter 3


1. Prove that the Hodges super-efficient estimator µ̂n , given in Section 3.1,
is not asymptotically regular.
2. Let Z1 , . . . , Zn be iid p(z, β, η), where β ∈ Rq and η ∈ Rr . Assume all the
usual regularity conditions that allow the maximum likelihood estimator
to be a solution to the score equation,
n #
! $
Sβ (Zi , β, η)
= 0(q+r)×1 ,
Sη (Zi , β, η)
i=1

and be consistent and asymptotically normal.


a) Show that the influence function for β̂n is the efficient influence func-
tion.
b) Sketch out an argument that shows that the solution to the estimating
equation
!n
q×1
Seff {Zi , β, η̂n∗ (β)} = 0q×1 ,
i=1

for any root-n consistent estimator η̂n∗ (β), yields an estimator that is
asymptotically linear with the efficient influence function.
3. Assume Y1 , . . . , Yn are iid with distribution function F (y) = P (Y ≤ y),
which is differentiable everywhere with density f (y) = dFdy(y) . The median
& '
is defined as β = F −1 12 . The sample median is defined as
# $
1
β̂n ≈ F̂n−1 ,
2
3.6 Exercises for Chapter 3 51
"n
where F̂n (y) = n−1 i=1 I(Yi ≤ y) is the empirical distribution function.
Equivalently, β̂n is the solution to the m-estimating equation
n .
! /
1
I(Yi ≤ β) − ≈ 0.
i=1
2

Remark 11. We use “≈” to denote approximately because the estimating


equation is not continuous in β and therefore will not always yield a solu-
tion. However, for large n, you can get very close to zero, the difference being
asymptotically negligible. " #

(a) Find the influence function for the sample median β̂n .
Hint: You may assume the following to get your answer.
& '
(i) β̂n is consistent; i.e., β̂n → β0 = F −1 12 .
(ii) Stochastic equicontinuity:
( * + * +)
P
n1/2 F̂n (β̂n ) − F (β̂n ) − n1/2 F̂n (β0 ) − F (β0 ) − → 0.

(b) Let Y1 , . . . , Yn be iid N (µ, σ 2 ), µ ∈ R, σ 2 > 0. Clearly, for this model, the
median β is equal to µ. Verify, by direct calculation, that the influence
function for the sample median satisfies the two conditions of Corollary
1.

You might also like