MS-A0503 First course in probability and statistics J Kohonen
Department of mathematics and systems analysis Spring 2020
Aalto SCI Exercise 4B
4B Parameter estimation
About notation. Here we will use notation like fλ (x) to mean “the density function of the given
form, when its parameter has value λ”. Note that the subscript now refers to the parameter and not
the random variable, like in fX (x). You must understand from context how the subscript is meant.
— There are also other ways of showing the parameter; Ross writes f (x | λ), and some people write
f (x ; λ). Varying notation is a fact of life.
Class problems
4B1 (Service requests) A computing server receives service requests at random intervals. The
intervals between each two consecutive requests are independent, and follow the density function
(
λe−λx , x > 0,
fλ (x) =
0, x ≤ 0,
where λ > 0 is an unknown parameter. We have measured the intervals 0.16, 1.85, 0.15, 0.72, 1.65.
(a) In the graph below, the measured values are marked on the x axis as small bars. There
are also three proposed density functions for the data, corresponding to the parameter
values λ = 0.25 (red), λ = 1.00 (blue) and λ = 3.00 (gray).
By looking at the data, give your opinion on which of the three proposed density functions
might the best match for (the empirical distribution of) the data.
(b) Find the maximum likelihood estimate for the parameter λ.
The likelihood function L(λ) and its logarithm `(λ) = log(L(λ)) are maximized at the same
value of λ, so you can use either function. The latter may have more convenient derivatives.
2.0
1.5
1.0
0.5
1 2 3 4 5 6
Solution.
1/6
MS-A0503 First course in probability and statistics J Kohonen
Department of mathematics and systems analysis Spring 2020
Aalto SCI Exercise 4B
(a) (One possible way of arguing.) Three of the five data points are in interval [0, 1], so
the density function should be such that it gives reasonable (roughly 3/5) area for this
interval. The remaining two points are in [1.5, 2] so we should also have reasonable area
(roughly 2/5) in this interval. Comparing the three proposals, the red curve seems to
have far too small areas here; and the gray curve seems a bad fit in the [1.5, 2] interval.
The blue curve seems the most reasonable.
(b) Let the observations be x1 , . . . , x5 . The likelihood function (given this data) is
5
Y
L(λ) = λe−λxi = λ5 e−λ(x1 +···+x5 )
i=1
The maximum of L(λ) is at the same λ as the maximum of the logarithmic likelihood
`(λ) = log L(λ). The logarithmic likelihood is
`(λ) = log λ5 e−λ(x1 +···+x5 ) = 5 log(λ) − λ(x1 + · · · + x5 ),
and its first two derivatives are
`0 (λ) = 5λ−1 − (x1 + · · · + x5 ).
and
`00 (λ) = −5λ−2 .
The first derivative is zero only at one point,
5
λ = .
x1 + · · · + x5
Because `00 (λ) ≤ 0 for all λ > 0, the ` is maximized at the zero point of its first derivative.
This is also where the likelihood function L(λ) is maximized. So the maximum likelihood
estimate for λ is
5 5
λ̂(~x) = = ≈ 1.104.
x1 + · · · + x5 0.16 + 1.85 + 0.15 + 0.72 + 1.65
Compare the result to what we argued in (a).
In general, if the density function has the form given in this exercise (you may recognize it is
the exponential distribution), then for any n-element data set ~x = (x1 , . . . , xn ) we will have
λ̂(~x) = 1/m(~x), where m(~x) = n1 (x1 + · · · + xn ). You may want to explain to yourself why and
how this makes sense, given that λ is the “rate” parameter of the process.
2/6
MS-A0503 First course in probability and statistics J Kohonen
Department of mathematics and systems analysis Spring 2020
Aalto SCI Exercise 4B
4B2 (Continuous uniform distribution) The (continuous) uniform distribution over the interval
[0, b] has density (
1
b
, 0 ≤ x ≤ b,
fb (x) =
0, elsewhere.
(a) Find the maximum likelihood estimate for the parameter b, from data (1.3, 1.9, 3.6, 1.1, 5.1).
(Note that we are assuming that the left end of the interval is fixed at zero; only the right
end b is unknown.) Hint. To find where the likelihood function reaches its maximum, it may
be enough to look at its functional form (or a graph). You might not even need logarithms or
derivatives (but you can use them if they help.)
(b) Write an expression that gives the maximum likelihood estimator b̂(~x) for the given density
function and any data set ~x = (x1 , . . . , xn ).
(c) In the (very simple!) case n = 1, find whether the estimator b̂(~x) is biased or unbiased.
(d) Another estimator for b could be obtained from the expression
n
2X
b̃(~x) = xi .
n i=1
Find whether this estimator is biased or unbiased (for general n).
Solution.
(a) Let the observed data be x1 , . . . , x5 . The likelihood function is then
5
( 5
1
Y
b
, if xi ∈ [0, b] for all i = 1, . . . , 5,
L(b) = fb (xi ) =
i=1
0, otherwise.
Another way of writing this (because all observations are nonnegative) is
( 5
1
b
, b ≥ max(x1 , . . . , x5 ),
L(b) =
0, otherwise.
Looking at the functional form, we see it has its maximum at the point b = max(x1 , . . . , x5 ).
Thus the maximum likelihood estimate for the unknown parameter b is
b̂(x) = max(x1 , . . . , x5 ) = 5.1.
(b) Going through the same steps as in (a), for any n-element data ~x = (x1 , . . . , xn ), we find
that the maximum likelihood estimate is
b̂(~x) = max(x1 , . . . , xn ).
3/6
MS-A0503 First course in probability and statistics J Kohonen
Department of mathematics and systems analysis Spring 2020
Aalto SCI Exercise 4B
(c) Let n = 1. Let us consider the stochastic model given by the distribution fb : our data will
be a single random number X that is uniformly distributed over the unknown interval
[0, b]. Whatever value X takes, our estimate will then be b̂ = max{X} = X. Let us find
the expected value of our estimator.
E(b̂) = E(X) = b/2.
Clearly b/2 < b (if b > 0), so our maximum likelihood estimator is biased (downwards).
~ = n b, so even if
For any n, it is possible to find out (see Ross, Example 7.7.c) that E b̂(X) n+1
we have more data, the maximum likelihood estimate is downward biased; although the bias
decreases as n increases.)
(d) According to our stochastic model, our data is now an n-element random vector X ~ =
(X1 , . . . , Xn ), whose elements are independent, and each follows the uniform distribution
over [0, b]. Now
n
! n
!
~
2 X 2 X 2 b
E b̂(X) = E Xi = ·E Xi = ·n· = b.
n i=1 n i=1
n 2
~ = b, showing that this estimator b̃ is unbiased.
Now we have E b̂(X)
Home problems
4B3 (Serial numbers) Battle tanks of a foreign army are numbered serially 1, 2, . . . , N , and
the serial numbers are visibly marked. Our observers have seen four tanks carrying the serial
numbers x1 = 13, x2 = 77, x3 = 111 and x4 = 145. Based on this data, find the maximum
likelihood estimate for N , the number of tanks that the foreign army has. Note that N is here
the unknown parameter (not a random variable). Assume that the observed serial numbers
have discrete uniform distribution in the set {1, 2, . . . , N }.
Hint: Compare to problem 4B2.
Grading.
+1 p for the correct likelihood function.
+1 p for finding the maximum point.
Solution. The density function for the discrete uniform distribution in {1, . . . , N } is
(
1
N
, k = 1, . . . , N,
fN (k) =
0, otherwise.
The likelihood function of N , given the data x = (x1 , x2 , x3 , x4 ), is
( 4
1
N
, if xi ∈ [1, N ] for all i,
L(N ) = fN (x1 )fN (x2 )fN (x3 )fN (x4 ) =
0, otherwise.
4/6
MS-A0503 First course in probability and statistics J Kohonen
Department of mathematics and systems analysis Spring 2020
Aalto SCI Exercise 4B
Another way of writing this is
(
1 4
N
, if N ≥ max(x1 , x2 , x3 , x4 ),
L(N ) =
0, otherwise.
From this form, we see directly (without even taking derivatives) that the likelihood function
has its maximum at point N = max(x1 , x2 , x3 , x4 ). Thus the maximum likelihood estimate for
N is
N̂ (x) = max(x1 , x2 , x3 , x4 ) = max(13, 77, 111, 145) = 145.
4B4 (Fitting a geometric distribution) A random variable X has geometric distribution with
parameter p, that is, it has density
(
p(1 − p)x , x = 0, 1, 2, . . .
fp (x) =
0, otherwise.
Application: X is obtained when an experiment has a constant probability p of succeeding, each
time; the experiment is repeated until it succeeds, and we count the number of failures. For example,
tossing a coin until heads is obtained, or asking random people until you find a supporter of party P.
From this distribution, we have three independent observations x1 = 5, x2 = 3 and x3 = 10.
Find the maximum likelihood estimate for the parameter p. Looking at the value of p, explain
what kind of experiment might have produced the data.
Using the logarithmic likelihood is probably convenient here.
Grading.
+1 p for forming the correct likelihood function.
+1 p for finding the maximum point. (For the point, it is enough to find where the first
derivative is zero — looking at the second derivative is not required.)
Solution. The likelihood function is
L(p) = fp (5) · fp (3) · fp (10)
= p(1 − p)5 · p(1 − p)3 · p(1 − p)10
= p3 (1 − p)18 ,
and the logarithmic likelihood is
`(p) = log L(p) = 3 log p + 18 log(1 − p).
The derivative of ` is
3 18
`0 (p) = − .
p 1−p
5/6
MS-A0503 First course in probability and statistics J Kohonen
Department of mathematics and systems analysis Spring 2020
Aalto SCI Exercise 4B
The derivative is zero when
3 18
= that is, p = 1/7.
p 1−p
This point is indeed a maximum point of the logarithmic likelihood function, because the second
derivative `00 (p) = − p32 − (1−p)
18
2 is negative for all 0 < p < 1, so the ` function is curving down.
Because logarithm preserves order, this point is also where the (non-logarithmic) likelihood
function takes its maximum. So the maximum likelihood estimate for p is 1/7.
(Possible) examples of processes that might produce such data:
• Spin a roulette that has seven slots numbered 1 . . . 7, until we get our lucky number
(whatever it is). First we spun the roulette 5 times without luck, giving x1 = 5, and got
the lucky number on the sixth spin. Then we went on, and again spun 3 times without
luck, giving x2 = 3, and got the lucky number on the fourth spin. Finally, we spun 10
times without luck, giving x3 = 10, and got the lucky number on the eleventh spin.
• In large population, 1/7 of people have a certain property. We wish to find three such
people for our medical experiments. Because we have no other information of where to
find these people, we pick random people repeatedly until we have the three persons we
need. This time, counting both failures and successes, we gathered 5+1+3+1+10+1 = 21
people and found the three we need.
Of course, the same data could have been obtained even if the true population parameter was, say,
0.12 or 0.15. The value p = 1/7 ≈ 0.143 is just the value that has the highest probability of generating
this data. It may be instructive to compare the likelihoods of nearby values of p, and see that they are
not very different!
6/6