Statistical Estimation Essentials
Statistical Estimation Essentials
Estimators
An estimator (normally in the form of an equation) is a rule that tells us how to calculate
an estimate (a number) based on the measurements contained in a sample. It is possible to
obtain many different estimators (rules) for a single population parameter. This should not be
surprising. Ten engineers, each was assigned to estimate the cost of construction of a new cruise
terminal, would most likely arrive at different estimates. Such engineers, called estimators in
the construction industry, use certain fixed guidelines plus intuition to achieve their estimates.
Each represents a unique human subjective rule for obtaining a single estimate. This brings us
to the most important point that some estimators are good and some are bad. How do we
define good or bad here?
1
2
Desired Properties of θ̂
To estimate the mean weight (µ) of 7 million residents of HK, a sample of size 8 is taken. The
results (in kg) are as follows: 71.2, 60, 55.3, 65.4, 32.7, 78.6, 68.8, 59.6 Estimate µ.
From the 8 data we can obtain easily µ̂ = X = 61.45. The estimate µ̂ = X = 61.45 uses ALL
the information in the sample FAIRLY. There is no intention to overestimate or underestimate
µ. It is called an unbiased estimator of µ.
To appreciate this, suppose we adopt a rule of dropping the largest value and using the average
of the remaining data as an estimate for µ.
Minimum Variance. In addition to unbiasedness, we would like the spread of the sampling
distribution of the estimator to be as small as possible. In other words, we want Var(θ̂) to be
a minimum.
Example 1:
(a) Let θ be the parameter of interest and let θ̂ be an estimator of that parameter. Select the
correct definition for an unbiased estimator from the following:
(b) Let Xi have mean µ and variance σ 2 , for i = 1, 2, . . . , n. Prove that E(X) = µ.
X1 + X2 + · · · + Xn
X= .
n
X 1 + X2 + · · · + Xn 1
E(X) = E = [E(X1 ) + E(X2 ) + · · · + E(Xn )]
n n
1 1
= (µ + µ + · · · + µ) = (nµ) = µ.
n n
Thus, E(X) = µ. That is, X is an unbiased estimator of µ.
Example 2:
(a) Let θ be the parameter of interest and let θ̂ be an estimator of θ. Which of the following is
the correct definition of an unbiased estimator?
(b) Let {X1 , X2 , X3 } be a random sample taken from a population where the mean is µ and
the variance is σ 2 .
Thus, E(X)
e = µ. That is, X
e is also an unbiased estimator of µ.
4
(v) By comparing your answers in part (ii) and part (iv), which estimator of µ, X or X,
e is
better? Why?
σ2 25σ 2 2
e = 9σ = 27σ .
2
Var(X) = = and Var(X)
3 75 25 75
Example 3:
Let {X1 , X2 , X3 , X4 } be a random sample from a population where the mean is µ and the
variance is σ 2 .
(v) By comparing your answers in part (ii) and part (iv), which estimator of µ, X or X,
e is
better? Why?
σ2 4σ 2 e = 5σ .
2
Var(X) = = and Var(X)
4 16 16
Xi − µ
Z = Z is a standard normal
σ
(Xi − µ)2
Z2 = ∼ χ21 . . . (1)
Pn σ2
i=1 (Xi − µ)2
∼ χ2n . . . (2)
Pn σ2
i=1 (Xi − X)2
∼ χ2n−1 . . . (3)
σ2
n
X
(Xi − X)2 ∼ σ 2 χ2n−1 ∗∗∗∗∗∗
i=1
Pn
(Xi − X)2 σ 2 χ2n−1
S 2 = i=1 ∼ ,
n−1 n−1
n
P 2
2 2
2 i=1 (Xi − X) σ χn−1
E(S ) = E = E
n−1 n−1
2
σ σ2
= E(χ2n−1 ) = × (n − 1) . . . (4)
n−1 n−1
= σ2.
Notes:
(1) If you square a standard normal random variable, you get a random variable that follows
a Chi-squared distribution with 1 degree of freedom (df).
(2) If you add up n independent Chi-squared random variables (with 1 df each) you get a
random variable that follows a Chi-squared distribution with n degrees of freedom.
(3) If you replace a parameter with its estimate you lose 1 degree of freedom.
(4) The mean of a Chi-squared random variable is equal to the degrees of freedom of the
distribution.
Pn
2 (Xi − X)2 σ 2 χ2n−1
D = i=1 ∼ ,
Pn n n
2 2
2
2 i=1 (Xi − X) σ χn−1
E(D ) = E = E
n n
2
σ σ2
= E(χ2n−1 ) = × (n − 1) . . . (4)
n n
6= σ 2 .
n
! " n
#
X X 2
E (Xi − X)2 = E Xi2 − nX
i=1
" i=1
n
#
X 1
= E Xi2 − (X1 + X2 + · · · + Xn )2
n
" i=1
n n X n
#
X 1 X
= E Xi2 − X i Xj
i=1
n i=1 j=1
" n n
#
X 1 X 2 X
= E Xi2 − Xi2 − Xi X j
i=1
n i=1
n i<j
n
n−1X 2X
= E Xi2 − E (Xi Xj )
n i=1 n i<j
2 n(n − 1)
= (n − 1) µ2 + σ 2 − E (Xi ) E (Xj )
n 2
= (n − 1) µ2 + σ 2 − (n − 1)µ2
= (n − 1)σ 2
pq
p n p̂ n
σ12 σ22
µ1 − µ2 n1 , n2 Y1−Y2 n1
+ n2
p1 q1 p2 q2
p1 − p2 n1 , n2 p̂1 − p̂2 n1
+ n2
2 σ4
σ2 n S2 n−1
We will briefly discuss two important estimation methods (techniques) in statistics here, namely
the Method of Moments and the Maximum Likelihood Estimation Method.
Definitions:
8
(1) E(X k ) is the k-th (theoretical) raw moment of the random variable X (about the origin),
for k = 1, 2, . . .
(2) E[(X − µ)k ] is the k-th (theoretical) central moment of X (about the mean), for k =
1, 2, . . .
(3) Continue equating sample moments about the origin, Mk , with the corresponding theo-
retical moments E(X k ), k = 3, 4, . . . until you have the same number of equations and
unknown parameters.
The resulting estimators are called method of moments estimators (or simply moment estima-
tors).
Example 5
Let X1 , X2 , . . . , Xn be Bernoulli random variables with parameter p. What is the method of
moments estimator of p?
Solution: For X ∼ Bin(n = 1, p), the first theoretical moment about the origin is
E(X) = np = p since n = 1.
We have just one parameter for which we are trying to derive the method of moments estimator.
Therefore, we need just one equation. Equating the first theoretical moment about the origin
with the corresponding sample moment, we get p = X.
We just need to put a “hat” on the parameter to make it clear that it is an estimator. We
can also subscript the estimator with an “MM” to indicate that the estimator is the method of
moments estimator:
p̂M M = X.
Example 6
Let X1 , X2 , . . . , Xn be normal random variables with mean µ, and variance σ 2 . What are the
method of moments estimators of the mean, µ, and variance, σ 2 ?
Solution: Here X ∼ N(µ, σ 2 ). The first and second theoretical moments about the origin are
E(X) = µ and E(X 2 ) = σ 2 + µ2 .
9
In this case, we have two parameters for which we are trying to derive method of moments
estimators. Therefore, we need two equations here. Equating the first theoretical moment
about the origin with the corresponding sample moment, we get
µ̂M M = X.
And, equating the second theoretical moment about the origin with the corresponding sample
moment, we get
n
2 2 2 1X 2
E(X ) = σ + µ = Xi .
n i=1
Now, the first equation tells us that the method of moments estimator of the mean µ is the
sample mean. And, substituting the sample mean in for µ in the second equation and solving
for σ 2 , we get that the method of moments estimator of the variance σ 2 is
n n
2 1X 2 2 1X 2
σ̂M M = Xi − X = Xi − X .
n i=1 n i=1
Example 7
Let X1 , X2 , . . . , Xn be gamma random variables with parameters α and λ. What are the method
of moments estimators of α and λ?
λα α−1 −λx
Solution: The probability density function is f (x) = x e for x > 0.
Γ(α)
α α
E(X) = and Var(X) = 2 .
λ λ
α
Thus, = X ⇐⇒ α̂M M = λX.
λ
α α2
E(X 2 ) = + ,
Pn λ2 λ2
i=1 Xi2 X 2
= +X ,
n λ
P n 2
X i=1 Xi 2
= −X
λ n
Pn 2 2
i=1 Xi − nX
= .
n
nX
λ̂M M = Pn 2.
2
i=1 X i − nX
10
Example 8
Sometimes, M.M.E.’s (method of moment estimators or moment estimators) are meaningless.
Consider the following example.
A random variable X follows the uniform distribution Uni(0, θ) and we obtained some realized
values of X as x1 = 2, x2 = 3, x3 = 5 and x4 = 12.
We know that
θ
E(X) = ,
2
so the M.M.E. of θ is
θ̂M M = 2X.
2 + 3 + 5 + 12
The value of X = = 5.5.
4
The estimate for θ in the above example is
However, this estimate for θ, which is the upper bound of the uniform distribution, is clearly not
acceptable because of the realized value x4 = 12 has already exceeded this estimated “bound”.
Introduction
In 1922, a famous English statistician Sir Ronald A. Fisher proposed a new method of parameter
estimation, namely the maximum likelihood estimation (M.L.E.).
The idea is to find the value of the unknown parameters for which they make the most likely
possible outcome as the realized data being observed. Calculus is to be used to maximize the
“likelihood”.
which is the probability of observing the given data as a function of θ. The maximum
likelihood estimate (M.L.E.) for θ, namely θ̂, is the value of θ that maximizes L(θ) [or equiva-
lently `(θ) = ln L(θ)]: it is the value that makes the observed data the most probable. If the
Xi ’s are i.i.d., then the likelihood function simplifies to
n
Y
L(θ) = f (xi ; θ),
i=1
n
X
`(θ) = lnL(θ) = ln{f (xi ; θ)}
i=1
with respect to θ.
12
* For X ∼ Geo(p),
1
E (p̂M L ) = E
X
∞
X 1
= (1 − p)x−1 p
x=1
x
∞
p X1
= (1 − p)x
1 − p x=1 x
∞ Z
p X 1
= (1 − t)x−1 dt
1 − p x=1 p
Z 1X ∞
p
= (1 − t)x−1 dt
1 − p p x=1
Z 1
p 1
= dt
1−p p t
p
= [ln |t|]1p
1−p
p ln p
= − 6= p.
1−p
13
It can also be seen that when α = 1, i.e., X ∼ Gam(1, λ) ≡ Exp(λ), E λ̂M L does not exist
at all.
d
f (x) = P X(n) ≤ x
dx
d
= P {max(X1 , X2 , . . . , Xn ) ≤ x}
dx
d
= P (X1 ≤ x ∩ X2 ≤ x ∩ · · · ∩ Xn ≤ x)
dx
d
= [P (X1 ≤ x)P (X2 ≤ x) · · · P (Xn ≤ x)]
dx
d x n
=
dx θ
nxn−1
= , for 0 < x < θ;
θn
and zero otherwise. Therefore,
θ
nxn−1
Z
E X(n) = x n dx
0 θ
Z θ
n
= n xn dx
θ 0
θ
n xn+1
= n
θ n+1 0
nθ
= 6= θ.
n+1
14
Binomial distribution
Then,
∂` x n−x x
= 0 =⇒ − = 0 =⇒ p = .
∂p p 1−p n
The MLE of p is
X
p̂M L = .
n
Note: MLE and MME are the same in this case because E(X) = np.
Remark:
With m observations, the likelihood function is
m m m
Y Y n xi n−xi
Y n Pm Pm
L(p; x) = L(p, xi ) = p (1 − p) = × p i=1 xi (1 − p) i=1 (n−xi ) .
i=1 i=1
xi i=1
xi
Then,
Pm
mn − m
P
∂` i=1 xi
i=1 xi
= 0 =⇒ − =0
∂p p 1−p
mx mn − mx
=⇒ − =0
p 1−p
x
=⇒ p = .
n
The MLE of p is
X
p̂M L = .
n
15
Poisson distribution
e−λ λx
p(x) = , for x = 0, 1, 2, . . . .
x!
Then,
∂` x
= 0 =⇒ −1 + = 0 =⇒ λ = x.
∂λ λ
The MLE of λ is
λ̂M L = X.
Note: MLE and MME are the same in this case because E(X) = λ.
Geometric distribution
Then,
∂` x−1 1 1
= 0 =⇒ − + = 0 =⇒ p = .
∂p 1−p p x
The MLE of p is
1
p̂M L = .
X
Note: MLE and MME are the same in this case because E(X) = 1/p.
16
Note: MLE and MME are the same in this case because E(X) = r/p.
Exponential distribution
Then,
∂` 1 1
= 0 =⇒ − x = 0 =⇒ λ = .
∂λ λ x
The MLE of λ is
1
λ̂M L = .
X
Note: MLE and MME are the same in this case because E(X) = 1/λ.
17
Note: MLE and MME are the same in this case because E(X) = α/λ.
Remark: For a Gam(α, λ) distribution with known rate parameter λ but unknown shape pa-
rameter α, the MLE of α has no closed form.
Note: MLE and MME are the same in this case because E(X) = µ.
18
Suppose that X ∼ N(µ, σ 2 ) with unknown variance parameter σ 2 (not σ), the pdf is
1 (x−µ)2
f (x) = √ e− 2σ 2 , for − ∞ < x < ∞.
2πσ 2
The likelihood function is
1 (x−µ)2
L(µ, σ 2 ; x) = √ e− 2σ 2 .
2πσ 2
The log-likelihood function is
1 1 (x − µ)2
`(µ, σ 2 ; x) = ln L(µ, σ 2 ; x) = − ln(2π) − ln(σ 2 ) − .
2 2 2σ 2
Then,
∂` 1 (x − µ)2
= 0 =⇒ − + = 0 =⇒ σ 2 = (x − µ)2 .
∂σ 2 2σ 2 2(σ 2 )2
The MLE of σ 2 is
2 2
σ̂M L = (X − µ) .
Remark:
With n observations, the likelihood function is
n n n2
Y Y 1 (xi −µ)2 1 1 Pn 2
−
2
L(µ, σ ; x) = 2
L(µ, σ ; xi ) = √ e 2σ 2 = e− 2σ2 i=1 (xi −µ) .
i=1 i=1 2πσ 2 2πσ 2
Then,
n n
∂` n 1 X 2 2 1X
= 0 =⇒ − 2 + (xi − µ) = 0 =⇒ σ = (xi − µ)2 .
∂σ 2 2σ 2(σ 2 )2 i=1 n i=1
The MLE of σ 2 is n
2 1X
σ̂M L = (Xi − µ)2 .
n i=1
19
Suppose that X ∼ N(µ, σ 2 ) with unknown mean parameter µ and unknown variance parameter
σ 2 (not σ), the pdf is
1 (x−µ)2
f (x) = √ e− 2σ 2 , for − ∞ < x < ∞.
2πσ 2
The MLE of µ is
µ̂M L = X.
The MLE of σ 2 is n
2 1X
σ̂M L = (Xi − µ)2 .
n i=1
Note 1: MLE and MME are the same in this case because
E(X) = µ,
E(X 2 ) = σ 2 + µ2 .
=⇒ σ2 = E(X 2 ) − [E(X)]2 .
2
=⇒ σ̂M M = m02 − (m01 )2 .
20
and
n
2 1X
σ̂M L = (Xi − X)2
n i=1
n
1 X 2 2
= Xi − 2Xi X + X
n i=1
n n
!
1 X X 2
= Xi2 − 2X Xi + nX
n i=1 i=1
n
!
1 X 2
= Xi2 − 2XnX + nX
n i=1
n
!
1 X 2
= Xi2 − nX
n i=1
n
1X 2 2
= Xi − X
n i=1
= m02 − (m01 )2 .
Uniform distribution
That is, the smaller θ is, the larger the likelihood function is. Therefore, we are going to see
what the smallest value of θ it can take based on the n observations. Since all the observations
of X (i.e., x1 , x2 , . . . , xn ) must lie between 0 and θ, the smallest value of θ can only be the
maximum of these n observations. Thus, the MLE of θ is
that is, the n-th order statistics of X or the largest observation in the sample.
21
Exercises
(b) Use the result in part (a) to estimate θ from the following observed sample:
{0.718, 0.571, 0.662, 0.975, 0.746, 0.979, 0.429, 0.509, 0.876, 0.666}
Solution:
Log-likelihood function:
n
X
`(θ; x) = ln L(θ; x) = n ln θ + (θ − 1) ln xi
i=1
Consider
n n
!−1
∂ n X X
`(θ; x) = 0 =⇒ + ln xi = 0 =⇒ θ = −n ln xi .
∂θ θ i=1 i=1
Pn −1
The log-likelihood function attains its maximum when θ = −n ( i=1 ln xi )
∂2 n
as the second derivative 2
`(θ; x) = − 2 < 0.
∂θ θ
Therefore the MLE for θ is given by
n
!−1 n
!−1
X 1X
θ̂M L = −n ln xi =− ln xi .
i=1
n i=1
Note that
10
1 X
ln xi = −0.3704.
10 i=1
The maximum likelihood estimate for θ is
1
θ̂M L = − = 2.7000.
−0.3704
22
Exercise 2: Eddy’s budgerigar, Marigold, chirps for X minutes until she is fed. The prob-
ability density function of X is given as follows:
(
Cxe−λx , for x > 0;
f (x) =
0, otherwise,
where C and λ are two positive constants.
2, 5, 5, 7, 8, 8, 9, 12, 12, 12, 12, 12, 15, 16, 18, 18, 18, 18, 18, 25
(i) Estimate the parameter λ and write down the corresponding variance of X.
(ii) Eddy makes 5 more observations and re-estimates λ using all the observed data. He
finds that with the new estimation the standard deviation of the chirping time is 8
minutes. Determine the average chirping time of the 5 new observations.
Solution: Eddy’s budgerigar, Marigold, chirps for X minutes until she is fed. The proba-
bility density function of X is given as follows:
(
Cxe−λx , for x > 0;
f (x) =
0, otherwise,
where C and λ are two positive constants.
2
(c) Show that the moment estimator of λ is λ̂ = .
X
2
The first moment E(X) gives X = .
λ
Therefore,
2
λ̂ = .
X
(d) Show that the maximum likelihood estimator of λ is the same as its moment estimator.
i=1
The loglikelihood is
`(x1 , . . . , xn ; λ) = ln L(x1 , . . . , xn ; λ)
Yn n
X
= 2n ln λ + ln xi − λ xi .
i=1 i=1
24
n
Y n
X
(Note: ln xi = ln xi .)
i=1 i=1
Consider n
∂` d` 2n X
= = − xi .
∂λ dλ λ i=1
For
∂`
= 0,
∂λ
n
2n X
− xi = 0,
λ̂ i=1
n
2n X
= xi ,
λ̂ i=1
2n 2
λ̂ = n = .
X X
xi
i=1
(e) Eddy records the chirping time (in minutes) of Marigold for 20 days as follows:
2, 5, 5, 7, 8, 8, 9, 12, 12, 12, 12, 12, 15, 16, 18, 18, 18, 18, 18, 25
(i) Estimate the parameter λ and write down the corresponding variance of X.
(ii) Eddy makes 5 more observations and re-estimates λ using all the observed data. She
finds that with the new estimation the standard deviation of the chirping time is 8
minutes. Determine the average chirping time of the 5 new observations.
Consider
2
Var(X) = = 82 = 64.
λ̂2
25
We have
r
2
λ̂ = ,
64
√
2 2
= ,
x 8
√
x = 8 2.
Exercise 3: A theory suggests that X, the number of accidents incurred by a worker per
year, has the following probability function:
(
θ(1 − θ)x , for x = 0, 1, 2, . . . ;
f (x) =
0, otherwise,
1−θ
where 0 < θ < 1 is an unknown parameter of the distribution of X with mean and
θ
1−θ
variance . Let {X1 , . . . , Xn } be a random sample of size n from the population.
θ2
(b) Construct θ,
e the moment estimator of θ, in terms of X1 , . . . , Xn .
Solution:
n
Y
L(θ; x1 , . . . , xn ) = f (θ; xi )
i=1
n
Y
= θ(1 − θ)xi
i=1
Pn
n xi
= θ (1 − θ) i=1 .
The loglikelihood is
`(θ; x1 , . . . , xn ) = ln L(θ; x1 , . . . , xn )
X n
= n ln(θ) + xi ln(1 − θ).
i=1
∂`
Now Set = 0.
∂θ
Pn
n i=1 xi
Thus, = ,
θ̂1 − θ̂
Pn
1 − θ̂ i=1 xi
= ,
θ̂ n
1 − θ̂
= x̄,
θ̂
1 − θ̂ = x̄θ̂,
1 = x̄θ̂ + θ̂,
1 = θ̂(x̄ + 1).
1
Therefore, θ̂ = .
X̄ + 1
n
∂ 2` −2
X −n nx̄
2
= −nθ − xi (1 − θ)−2 = 2 − .
∂θ i=1
θ (1 − θ)2
1
When θ = ,
x̄ + 1
∂ 2` (x̄ + 1)2
−n nx̄ 2 2 1
= 2 − = −n(x̄ + 1) − n = −n(x̄ + 1) 1 + < 0.
∂θ2 θ (1 − θ)2 x̄ x̄
1
Therefore, θ̂ = is a maximum as long as X < −1.
X +1
(b) Construct θ,
e the moment estimator of θ, in terms of X1 , . . . , Xn .
1−θ
Given E(X) = .
θ
1 − θe
Solve X = for θ.
e
θe
1
θe = .
X +1
28
Solution:
n
Y
L(θ; x1 , . . . , xn ) = f (θ; xi )
i=1
Yn
= (θ + 1)xθi
i=1
= (θ + 1)n (x1 x2 · · · xn )θ .
The loglikelihood is
`(θ; x1 , . . . , xn ) = ln L(θ; x1 , . . . , xn )
= n ln(θ + 1) + θ ln(x1 x2 · · · xn ).
n
Y n
X
(Note: ln xi = ln xi .)
i=1 i=1
n
∂` d` n X
= = + ln(xi ).
∂θ dθ θ + 1 i=1
29
For
∂`
= 0,
∂θ
n
n X
+ ln(xi ) = 0,
θ̂ + 1 i=1
n
n X
= − ln(xi ),
θ̂ + 1 i=1
−n
θ̂ + 1 = n ,
X
ln(xi )
i=1
n
Therefore, θ̂1 = −1 − n .
X
ln(Xi )
i=1
n
∂` X
= n(θ + 1)−1 + ln(xi ).
∂θ i=1
Remark: Check the second derivative:
∂ 2` −n
2
= −n(θ + 1)−2 = < 0,
∂λ (θ + 1)2
therefore the above is a maximization.
1 1 1
(θ + 1)xθ+2
Z Z
θ θ+1 θ+1
E(X) = x(θ + 1)x dx = (θ + 1)x dx = = .
0 0 θ+2 0 θ+2
θ̂ + 1 = X θ̂ + 2X,
θ̂ − X θ̂ = 2X − 1,
θ̂(1 − X) = 2X − 1,
2X − 1
Therefore, θ̂2 = .
1−X