Chapter 7 for BST 695: Special Topics in Statistical Theory.
Kui Zhang, 2011
Chapter 7 – Methods of Finding Estimators
Section 7.1 – Introduction
Definition 7.1.1 A point estimator is any function W ( X) W ( X 1 , X 2 ,, X n ) of a sample; that is, any statistic is a
point estimator.
Notes:
estimator: function of the sample ( X ( X 1 , X 2 , , X n ))
estimate: realized value (a number) of an estimator ( ( x ( x1 , x2 ,, xn )) )
Section 7.2 – Methods of Finding Estimators
7.2.1 Method of Moments (MME)
Notes:
oldest method dating back at least to Karl Pearson in the late 1800s
idea is simple, however, oftentimes, the resulting estimators may still be improved
1
Chapter 7 for BST 695: Special Topics in Statistical Theory. Kui Zhang, 2011
Let X 1 , , X n be iid from pmf or pdf f ( x | 1 , , k ) , we have
1 n
1st sample moment: m1 Xi
n i 1
1st population moment: 1' EX 1' (1 ,, k )
1 n k
k th sample moment: mk Xi
n i 1
k th population moment: k' EX k k' (1 ,, k )
To get MME: “Equate” the first k sample moments to the corresponding k population moments and solve these
1 n 1 n 2 1 n k
equations for (1 ,, k ) in terms of (m1 ,, mk ) (
n i 1
X 1 ,
n i 1
X i , , Xi ) .
n i 1
Example 7.2.1 (Normal method of moments) Suppose X 1 , , X n are iid from a n( , 2 ) . In this case, k 2 and
1 and 2 2 .
1 n 1 n 2
Solution:
n i 1
X i X and 2
2
X i , we can get:
n i 1
2
Chapter 7 for BST 695: Special Topics in Statistical Theory. Kui Zhang, 2011
1 n n 1 2
X and 2
n
i 1
( X i X )2
n
S .
Example 7.2.2 (Binomial method of moments) Suppose X 1 , , X n are iid from a binomial (m, p) where both m
and p are unknown. In this case, k 2 and 1 m and 2 p .
1 n 2
Solution: From X mp and
n
X mp (1 p ) m 2 p 2 mp (1 p mp ) , we have:
i 1 i
X2 X
m and p .
X 1/ n i 1 ( X i X ) 2 m
n
Note: Method of moments may give estimates that are outside the range of the parameters.
7.2.2 Maximum Likelihood (MLE)
Let X 1 , , X n be iid from pmf or pdf f ( x | 1 , , k ) . The likelihood function is defined by
L(θ | x) L(1 ,, k | x1 ,, xn ) i 1 f ( xi |1 ,, k ) .
n
3
Chapter 7 for BST 695: Special Topics in Statistical Theory. Kui Zhang, 2011
Definition 7.2.4 For each sample point x , let ˆ( x) be a parameter value at which L(θ | x) attains its maximum as a
function of θ , with x held fixed. A maximum likelihood estimator (MLE) of the parameter θ based on a sample
X is ˆ( X) .
Notes:
1. Finding the MLE can be difficult in some cases.
2. MLE may be obtained through differentiation but in some cases differentiation will not work.
3. When differentiation will be used to find the MLE, it will be easier to deal with the natural log of the
likelihood.
4. Maximization should be only over the range of the parameter.
5. If MLE cannot be obtained analytically, it can be obtained numerically.
Example 7.2.5 (Normal likelihood) Let X 1 , , X n are iid from a n( ,1) . Show that X is the MLE of using
derivatives.
Solution:
d
Step 1: find the solution from the equation: L( | x) 0 , which gives the possible solutions.
d
d2
Step 2: verify the solution achieves the global maximum ( L( | x) | x 0 in this case).
d2
4
Chapter 7 for BST 695: Special Topics in Statistical Theory. Kui Zhang, 2011
Step 3: check the boundaries ( in this case; it is not necessary in this case).
1 n
Example 7.2.6 Recall Theorem 5.2.4 (p. 212) part (a): If x1 , , xn are any numbers and x xi , then
n i 1
For any real numbers, we have:
( xi a ) 2 i 1 ( xi x ) 2 .
n n
i 1
with equality if and only if a x . This implies that for any ,
1 n 1 n
exp(
2
i 1
( xi ) 2
) exp(
2
i 1
( xi x )2 )
with equality if and only if x . So X is MLE.
Example 7.2.7 (Bernoulli MLE) Let X 1 , , X n are iid Bernoulli( p ). Find the MLE of p where 0 p 1 . Note
that we include the possibility that p 0 or p 1 .
Solution: Use the natural log of the likelihood function.
Example 7.2.8 (Restricted range MLE) Let X 1 , , X n are iid from a n( ,1) , where 0 .
Solution: Without any restriction, X is the MLE. So when x 0 , ˆ x . When x 0 , L( | x) achieves its
maximum at ˆ 0 for 0 , so ˆ 0 in this situation. In summary:
5
Chapter 7 for BST 695: Special Topics in Statistical Theory. Kui Zhang, 2011
X , X 0;
ˆ XI[0, ) ( X )
0, X 0.
Example 7.2.9 (Binomial MLE, unknown number of trials) Let X 1 , , X n are iid Binomial( k , p ). Find the
MLE of k where p is known and k is unknown. (Example where differentiation will not be used to obtain the
MLE.)
Solution: The likelihood function is:
n k
L(k | p, x) i 1 p xi (1 p ) n xi .
xi
Then consider the ratio: L( k | p, x) / L( k 1| p, x)
Invariance Property of Maximum Likelihood Estimators
Definition Consider a function ( ) which may not necessarily be one-to-one function so that for a given value ,
there may be more than one value such that ( ) . The induced likelihood function, L * , of ( ) is given by:
L * ( | x) sup{ : ( ) } L( | x) .
The value ̂ that maximizes L * ( | x) will be called the MLE of ( ) .
6
Chapter 7 for BST 695: Special Topics in Statistical Theory. Kui Zhang, 2011
Theorem 7.2.10 (Invariance Property of MLEs) If ˆ is the MLE of , then for any function ( ) , the MLE of
( ) is (ˆ) .
Example Let X 1 , , X n be iid n( ,1) , the MLE of 2 is X 2 .
Example Let X 1 , , X n be iid binomial( k , p ) where k is known and p is unknown. Find the MLE of the variance
and standard deviation of X 1 .
Solution:
kp(1 p) , so ˆ kpˆ (1 pˆ ) kX (1 X ) .
kp(1 p) , so ˆ kpˆ (1 pˆ ) kX (1 X ) .
Example Let X 1 , , X n be iid Poisson( ). Find the MLE of P( X 0) .
Solution: The MLE of is: ˆ X . Because P( X 0) exp( ) , so the MLE of P( X 0) is exp( X ) .
Note: Theorem 7.2.10 includes the multivariate case. If the MLE of (1 ,, k ) is (ˆ1 ,,ˆk ) , and if (1 , , k ) is
any function of the parameter vector, then by the invariance property of the MLE, the MLE of (1 , , k ) is
(ˆ1 , ,ˆk ) .
7
Chapter 7 for BST 695: Special Topics in Statistical Theory. Kui Zhang, 2011
Example 7.2.11 (Normal MLEs, and 2 unknown) Let X 1 , , X n are iid from a n( , 2 ) where both and
1 n n 1 2
2 are unknown. Then the MLE of is ˆ X and the MLE of 2 is ˆ 2
n
i 1
( X i X )2
n
S .
Solution: Verify these estimators using (a) univariate calculus (this Example, Example 7.2.11) and (b) multivariate
calculus (Example 7.2.12).
Notes:
1. MLE is susceptible to problems associated with numerical instability if the MLEs do not have explicitly
expression.
2. How sensitive is the MLE to measurement error in the data? (see Example 7.2.13.)
7.2.3 Bayes Estimators
Bayesian Approach to Statistics
The parameter is a random quantity described by a probability distribution known as the prior distribution.
A sample is then taken from the population indexed by .
The prior distribution is updated with this sample information to get what is known as the posterior
distribution using Bayes’ rule (Theorem 1.3.5 p. 23). Let ( ) denote the prior distribution of and let
8
Chapter 7 for BST 695: Special Topics in Statistical Theory. Kui Zhang, 2011
f (x | ) be the sampling distribution of the sample. The posterior distribution of given the sample x is
f (x | ) ( )
given by ( | x) , where m(x) is the marginal distribution of x , i.e., m( x) f ( x | ) ( ) d .
m( x )
The posterior distribution is then used to make statements about . For instance, the mean of the posterior
distribution may be used as a point estimate of .
Example 7.2.14 (Binomial Bayes estimation) Let X 1 , , X n be iid Bernoulli( p ), where p is unknown. Then
Y i 1 X i is binomial( n, p ). We assume the prior distribution on p is beta( , ). The posterior distribution of
n
p given Y y , f ( p | y ) , is beta( y , n y ). Hence the Bayes estimate for p is the mean of the posterior
distribution, i.e.,
y
pˆ B .
a n
Note that the mean of the prior distribution is
and pˆ B may be written as
n y
pˆ B .
n n n
9
Chapter 7 for BST 695: Special Topics in Statistical Theory. Kui Zhang, 2011
Hence, the Bayes estimator is a linear combination of the sample mean and the prior mean with weights
determined by , , and n .
Note: The prior and the posterior distributions are both beta distribution.
Definition 7.2.15 Let denote the class of pdfs or pmfs f ( x | ) (indexed by ). A class of prior distributions
is a conjugate family for if the posterior distribution is the class of for all f , all priors in , and all
x.
Example 7.2.16 (Normal Bayes estimation) Let X ~ n( , 2 ) where 2 is known. We assume the prior
distribution on is n( , 2 ) . The posterior distribution of given X x , is also normal (Homework problem)
with mean
2 2
E ( | x) x
2 2 2 2
and variance
2 2
Var ( | x) 2 .
2
Notes:
10
Chapter 7 for BST 695: Special Topics in Statistical Theory. Kui Zhang, 2011
1. The normal family is its own conjugate family.
2. If the prior information is vague (i.e., 2 is very large) then more weight is given to the sample data.
3. If the prior information is good (i.e., 2 2 ) then more weight is given to the prior mean.
Section 7.3 – Methods of Evaluating Estimators
7.3.1 Mean Squared Error
Definition 7.3.1 The mean squared error (MSE) of an estimator W of a parameter is the function defined by
MSE E (W ) 2 VarW ( BiasW ) 2 ,
where BiasW EW .
Definition 7.3.2 The bias of a point estimator W of a parameter is the difference between the expected value of
W and . An estimator whose bias is identically (in ) equal to 0 is called unbiased and satisfies EW for all
.
If W is unbiased then
MSE E (W ) 2 VarW .
11
Chapter 7 for BST 695: Special Topics in Statistical Theory. Kui Zhang, 2011
Example 7.3.3 (Normal MSE) Let X 1 , , X n be iid from a n( , 2 ) . We know that X and S 2 are unbiased
estimators of and 2 , respectively,
EX and ES 2 2
(which is true even without normality see Theorem 5.2.6), i.e., for all and 2 .
Thus
MSE ( X ) E ( X ) 2 2 / n ,
and
2 4
MSE ( S ) E ( S ) VarS
2 2 2 2 2
.
n 1
(Recall that (n 1) S 2 / 2 ~ Chi-square with n 1 degrees of freedom, which is gamma ( (n 1) / 2, 2) and
Var (Y ) 2 if Y ~ gamma ( , ) .)
Example 7.3.4 Let X 1 , , X n be iid from a n( , 2 ) . Recall that the MLE (and MME) of 2 is
1 n n 1 2
ˆ 2
n i 1
( X i X ) 2
n
S .
Note that
12
Chapter 7 for BST 695: Special Topics in Statistical Theory. Kui Zhang, 2011
n 1 2 n 1 2
E (ˆ 2 ) E ( S ) .
n n
and
n 1 2 n 1 2(n 1) 4
Varˆ 2 Var ( S ) 2 Var ( S 2 ) ,
n n n2
so that
2(n 1) 4 1 4 2n 1 4
MSE (ˆ 2 ) Var (ˆ 2 ) [ Bias (ˆ 2 )]2 2 2 .
n2 n n
From these formulas, you can verify that ˆ 2 has the smaller MSE than S 2 .
Example 7.3.5 (MSE of binomial Bayes estimators) Suppose X 1 , , X n are iid from a Bernoulli( p ).
(1) MLE: p̂ X is unbiased estimator for p and
p (1 p )
MSE ( pˆ ) E p ( pˆ p) 2 Varp ( X ) .
n
Y np ( ) p
(2) Bayes estimator: pˆ B is a Biased estimator because E p ( pˆ B ) p .
n n n
The MSE of pˆ B is
13
Chapter 7 for BST 695: Special Topics in Statistical Theory. Kui Zhang, 2011
MSE ( pˆ B ) Varp ( pˆ B ) [ E p ( pˆ B p)]2
Y p p 2
Varp ( )( )
n n
np (1 p ) p p 2
( ).
( n) 2
n
n Y n/4
If we choose n / 4 , we have MSE ( pˆ B ) as a constant and pˆ B . In this situation,
4(n n ) 2
n n
we can determine which of these two estimators is better in terms of the MSE.
Skip equivariance example (Example 7.3.6)
7.3.2 Best Unbiased Estimator
Consider the class of estimators
C {W : EW ( )} .
For any W1 ,W2 C , we have BiasW1 BiasW2 ( ) , so
MSE (W1 ) MSE (W2 ) E (W1 ) 2 E (W2 ) 2 Var (W1 ) Var (W2 ) .
14
Chapter 7 for BST 695: Special Topics in Statistical Theory. Kui Zhang, 2011
Definition 7.3.7 An estimator W * is a best unbiased estimator of ( ) if it satisfies EW * ( ) for all and, for
any other estimator W with EW ( ) , we have Var (W *) Var (W ) for all . W * is also called a uniform
minimum variance unbiased estimator (UMVUE) of ( ) .
Note:
UMVUE may not necessarily exist.
If UMVUE exists, it is unique (Theorem 7.3.19).
Example 7.3.8 (Poisson unbiased estimation) Let X 1 , , X n be iid from a Poisson( ). Note that E ( X ) and
E S 2 for all . Thus, both X and S 2 are unbiased estimators of .
Also, note that the class of estimators given by
Wa ( X , S 2 ) aX (1 a) S 2
is a class of unbiased estimators for 0 a 1 .
To determine which estimator has the smallest MSE, we need to calculate Var ( X ) , Var ( S 2 ) , and
Var (aX (1 a) S 2 ) . The calculation can be very lengthy.
The question here is, how can we find the best, i.e., smallest variance, of these unbiased estimators?
15
Chapter 7 for BST 695: Special Topics in Statistical Theory. Kui Zhang, 2011
Theorem 7.3.9 (Cramer-Rao inequality) Let X 1 , , X n be a sample with pdf f (x | ) and let
W ( X) W ( X 1 , , X n ) be any estimator satisfying
d
EW ( X) [W (x) f (x | )]dx
d
and
Var (W ( X)) .
Then
d
( EW ( X)) 2
Var (W ( X)) d .
E (( log f ( X | )) )
2
where log is the natural logarithm.
Corollary 7.3.10 (Cram´er-Rao inequality, iid case) If the assumptions of Theorem 7.3.9 are satisfied and,
additionally, X 1 , , X n are iid with pdf f ( x | ) , then
d
( EW ( X)) 2
Var (W ( X)) d .
nE (( log f ( X | )) )
2
Notes:
16
Chapter 7 for BST 695: Special Topics in Statistical Theory. Kui Zhang, 2011
1. The quantity E (( log f ( X | )) 2 ) is called information number, or Fisher information, of the sample.
2. The information number gives a bound on the variance of the best unbiased estimator of .
3. As the information number increases, we have more information about , and we have a smaller bound.
The following lemma helps in the computation of the CRLB.
Lemma 7.3.11 If f ( x | ) satisfies
d
E ( log f ( X | )) [( log f ( x | )) f ( x | )]dx
d
(true for an exponential family), then
2
E {[ log f ( X | )] } E ( 2 log f ( X | )) .
2
Example 7.3.12 Recall the Poisson problem. We will show that X is the UMVUE of .
Note:
17
Chapter 7 for BST 695: Special Topics in Statistical Theory. Kui Zhang, 2011
Key assumption of the Cram´er-Rao Theorem is that one can differentiate under the integral sign. Below is
an example where this assumption is not satisfied.
Example 7.3.13 (Unbiased estimator for the scale uniform) Let X 1 , , X n be iid with pdf
f ( x | ) 1 / ,0 x .
Note:
Cramer-Rao Lower Bound (CRLB) is not guaranteed to be sharp, i.e., there is no guarantee that the CRLB
can be attained.
Example 7.3.14 (Normal variance bound) Let X 1 , , X n be iid n( , 2 ) . We have:
2 4 2 4
CRLB but Var ( S )
2
,
n n 1
hence S 2 has variance larger than the CRLB.
Question:
How do we know if there exists an unbiased estimator that achieves the CRLB?
18
Chapter 7 for BST 695: Special Topics in Statistical Theory. Kui Zhang, 2011
Corollary 7.3.15 (Attainment) Let X 1 , , X n be iid with pdf f ( x | ) , where f ( x | ) satisfies the conditions of
the Cramer-Rao Theorem. Let L( | x) i 1 f ( xi | ) denote the likelihood function. If W ( X) W ( X 1 , , X n ) is
n
any unbiased estimator of ( ) , then W ( X) attains the CRLB if and only if
a ( )[W ( X) ( )] log L( | x)
for some function a ( ) .
Example 7.3.16 Recall the normal problem.
1 1
L ( , 2 | x) exp{ 2 i 1 ( xi ) 2 } ,
n
(2 ) n /2
2
so that
( xi ) 2
n
n
log L( , | x)
2
( i 1
2).
2
2 4
n
1 n
If μ is known, CRLB is achieved and the UMVUE is W ( X)
n
i 1
( X i ) 2 . Otherwise, no unbiased estimator
of 2 will achieve the CRLB.
Question:
19
Chapter 7 for BST 695: Special Topics in Statistical Theory. Kui Zhang, 2011
1. What can we do to find the “best” estimator if f ( x | ) does not satisfy the assumptions of the Cramer-Rao
Theorem.
2. What if the CRLB is not attainable, how do we know if our estimator is the “best”?
7.3.3 Sufficiency and Unbiasedness
Recall two important results:
E ( X ) E[ E ( X | Y )] and Var ( X ) Var[ E ( X | Y )] E[Var ( X | Y )] .
Theorem 7.3.17 (Rao-Blackwell) Let W be any unbiased estimator of ( ) , and let T be a sufficient statistic for
. Define (T ) E (W | T ) . Then E ( (T )) ( ) and Var ( (T )) Var (W ) for all ; i.e., (T ) is a uniformly
better unbiased estimator of ( ) .
Notes:
1. Conditioning any unbiased estimator on a sufficient statistic will result in an improved estimator.
2. To find the UMVUE, only need to consider functions of the sufficient statistic.
3. Sufficiency is needed so that the resulting quantity (estimator) after conditioning on the sufficient statistic
will not depend on .
20
Chapter 7 for BST 695: Special Topics in Statistical Theory. Kui Zhang, 2011
Example 7.3.18 (Conditioning on an insufficient statistic) Let X 1 and X 2 be iid from n( ,1) . Then X is an
unbiased estimator (and a sufficient statistic) of . Suppose we condition X on X 1 which is not a sufficient
statistic. Let ( X 1 ) E ( X 1 ) E ( X | X 1 ) . Then ( X 1 ) is unbiased for and has a smaller variance than X but is
not a valid estimator.
Theorem 7.3.19 If W is a best unbiased estimator of ( ) , then W is unique.
Let W be such that E (W ) ( ) and let U be such that E (U ) 0 for all . Then
a W aU ,
where a is a constant forms a class of unbiased estimators of with
Var (a ) VarW 2aCov (W ,U ) a 2VarU .
Question:
Which is a better estimator, W or a ?
Theorem 7.3.20 If E (W ) ( ) , W is the best unbiased estimator of ( ) if and only if W is uncorrelated with
all unbiased estimators of 0.
21
Chapter 7 for BST 695: Special Topics in Statistical Theory. Kui Zhang, 2011
Example 7.3.21 (Unbiased estimators of 0) Let X be an observation from uniform ( , 1) distribution. Then
1 1 1
EX xdx and Var X .
2 12
1 1
Therefore, X is an unbiased estimator of . We will show that X is correlated with an unbiased estimator
2 2
of 0, and hence cannot be a best unbiased estimator of .
Note:
If a family of pdfs f ( x | ) has the property that there are no unbiased estimators of 0 other than 0 itself, then
our search would be ended since Cov (W ,0) 0 . What is this property called?
n 1
Example 7.3.22 (continuation of Example 7.3.13) Let X 1 , , X n be iid uniform (0, ) Then Y where
n
Y X ( n ) is an unbiased estimator of .
Solution:
1. Conditions of Cramer-Rao Theorem were not satisfied.
2. By Rao-Blackwell Theorem, we only need to consider unbiased estimator of based on Y .
3. Y is a complete sufficient statistic, therefore Y is uncorrelated with all unbiased estimators of 0 since
this would just be 0 itself.
22
Chapter 7 for BST 695: Special Topics in Statistical Theory. Kui Zhang, 2011
n 1
4. Y is the best unbiased estimator of .
n
Important Note:
What is critical is the completeness of the family of distributions of the sufficient statistics not the
completeness of the original family.
Theorem 7.3.23 Let T be a complete sufficient statistic for a parameter , and let (T ) be any estimator based
only on T . Then (T ) is the unique best unbiased estimator of its expected value.
Result:
If T is a complete sufficient statistic for a parameter and h( X 1 , , X n ) is any unbiased estimator of ( ) ,
then (T ) E[ h( X 1 , , X n ) | T ] is the unique best unbiased estimator of ( ) .
Example 7.3.24 (Binomial best unbiased estimation) Let X 1 , , X n be iid binomial (k , ) . We want to estimate
( ) P ( X 1) k (1 )k 1 .
X i ~ binomial ( kn, ) is a complete sufficient statistic for .
n
Solution: Recall that i 1
23
Chapter 7 for BST 695: Special Topics in Statistical Theory. Kui Zhang, 2011
Question:
How about an unbiased estimator for ( ) ? Once we find an unbiased estimator, how do we get the best
unbiased estimator?
7.3.4 Loss Function Optimality
Decision Theory:
Setting: Observed data: X x where X ~ f (x | ) .
Let = action space, i.e., set of allowable decisions regarding .
Definition: Loss function is a nonnegative function that generally increases as the distance between an action, a ,
and increases.
Note:
L( , ) 0 (What does this mean? – the loss is minimum if the action is correct)
If is real-valued, two commonly used loss functions are
absolute error loss L( , a ) | a | : more penalty on small discrepancies
24
Chapter 7 for BST 695: Special Topics in Statistical Theory. Kui Zhang, 2011
squared error loss L( , a) ( a) 2 : more penalty on large discrepancies
Other examples:
( a ) 2 , a ;
L( , a )
10( a ) , a .
2
which penalizes overestimation more than underestimation.
L( , a) [( a ) 2 ] / (| | 1) , penalizes errors in estimation more if is near 0 than if | | is large
Definition: In decision theoretic analysis, the quality of an estimator, ( X) , is quantified by its risk function
defined by
R ( , ) E L( , ( X)) ,
i.e., at a given the risk function is the average loss that will be incurred if the estimator ( X) is used.
Notes:
MSE is an example of a risk function with respect to the squared error loss.
R( , ) E L( , ( X)) E ( ( X)) 2 Var ( ( X)) ( Bias ( ( X))) 2 .
We want to find an estimator that has a small risk function for all relative to another estimator. However, most
of the time the risk functions of two estimators cross.
25
Chapter 7 for BST 695: Special Topics in Statistical Theory. Kui Zhang, 2011
Example 7.3.25 (Binomial risk functions) Recall Example 7.3.5 comparing the Bayes estimator and the MLE of
the Bernoulli parameter p .
n
Xi n / 4 1 n
pˆ B i 1
n n
and pˆ X Xi .
n i 1
Example 7.3.26 (Risk of normal variance) Let X 1 , , X n be iid from n( , 2 ) . We want to estimate 2
considering estimators of the form b ( X) bS 2 .
2 4
Solution: Recall that ES and for normal samples Var ( S )
2 2
. 2
n 1
The risk function with respect to the squared error loss is
R (( , 2 ), b ) Var (bS 2 ) ( E (bS 2 ) 2 ) 2
b 2Var (bS 2 ) (b 2 2 ) 2
2b 2
[ (b 1) 2 ] 4 .
n 1
Notes:
1. The resulting risk function does not depend on .
26
Chapter 7 for BST 695: Special Topics in Statistical Theory. Kui Zhang, 2011
n 1
2. b value that minimizes this risk function is given by b . Thus, for every value of ( , 2 ) , the
n 1
estimator with the smallest risk among all estimators of the form b ( X) bS 2 is
n 1 2 1 2
S i
n
S ( X X ) (See Figure 7.3.2 p. 351 for n=5).
n 1 n 1 i 1
Example 7.3.27 (Variance estimation using Stein’s loss) Let X 1 , , X n be iid from a population with positive
finite variance, 2 . We want to estimate 2 .
Solution: Considering estimators of the form b ( X) bS 2 and the loss function
a a
L( 2 , a) 1 log (attributed to Stein)
2
2
In this case, the risk function is given by
bS 2 bS 2 S2
R( 2 , b ) E ( 1 log ) b 1 log b E (log ).
2 2 2
S2
Note that E (log ) does not depend on b . To minimize this risk function, we find b that minimizes b log(b)
2
which is when b 1. Hence, the estimator with the smallest risk for all values of 2 is
1
1 ( X ) S 2
n
( X i X )2 .
n 1 i 1
27
Chapter 7 for BST 695: Special Topics in Statistical Theory. Kui Zhang, 2011
Bayesian Approach to Loss Function
Definition: Given a prior distribution ( ) , the Bayes risk
R( , ) ( )d ( L( , (x)) f (x | )dx) ( )d
and the estimator that results in the smallest value of the Bayes risk is known as the Bayes rule with respect to a
prior ( ) .
Note that
R( , ) ( )d [ L( , (x)) ( | x)d ]m(x)dx
where the quantity in the square brackets is known as the posterior expected loss.
The action ( X) that minimizes the posterior expected loss will also minimize the Bayes risk.
Example 7.3.28 (Two Bayes rules) Suppose we want to estimate .
1. For the squared error loss, the posterior expected loss is
( a)2 ( | x)d E (( a)2 | X x) ,
28
Chapter 7 for BST 695: Special Topics in Statistical Theory. Kui Zhang, 2011
where ~ ( | x) . This is minimized by (x) E ( | x) so that the Bayes rule is the mean of the posterior
distribution (Example 2.2.6).
2. For the absolute error loss, the posterior expected loss is
| a | ( | x)d E (| a || X x)
minimized by (x) E ( | x) = median of the posterior distribution (Exercise 2.18).
Example 7.3.29 (Normal Bayes estimates) Let X 1 , , X n be iid from n( , 2 ) and let ( ) be n( , 2 ) , where
2 , , 2 are known. From Example 7.2.16 and your homework problem (Exercise 7.22), the posterior distribution
of given X x is normal with mean
2 2 /n
E ( | x ) x
2 ( 2 / n) 2 ( 2 / n)
and
2 2 / n
Var ( | x ) .
2 2 / n
1. For the squared error loss,
(x) E ( | x )
2. For the absolute error loss,
29
Chapter 7 for BST 695: Special Topics in Statistical Theory. Kui Zhang, 2011
(x) = median of the posterior distribution = E ( | x ) .
30