1
Introduction to maximum likelihood (ML) Lecture
Sren Hjsgaard, Ulrich Halekoh Unit of Statistics and Decision Analysis Faculty of Agricultural Sciences, University of Aarhus November 26, 2007
http://gbi.agrsci.dk/statistics/courses/Rcourse-DJF2008/
GLM - Maximum Likelihood
1
1 Introduction 4
Introduction
The presented material aims at giving a brief introduction to likelihood methods and related topics. These topics will be dealt with over and over again throughout the course. Therefore, do not panic if they seem very abstract at rst reading the topics will become more concrete later on.
Estimation
The likelihood function 3.1 3.2 3.3 3.4 3.5 3.6 ML Estimates and ML estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Properties of ML estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Variance of MLE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Maximizing the likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Calculating the asymptotic variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Transformation of MLE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8 12 17 18 20 23 25 25 28 30 32 36 37 41
Tests of hypotheses 4.1 4.2 4.3 The likelihood ratio test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The Wald test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The score test* . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Condence intervals 5.1 5.2 Wald based intervals and the Wald test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Condence intervals based on logodds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
GLM - Maximum Likelihood 3 5.3 5.4 5.5 5.6 6 Likelihood ratio based intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Comparisons of the intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Interpretation of condence intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Reparameterizing the binomial model* . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 49 50 53 56
GLM - Maximum Likelihood 5
Estimation
Taylor approximation*
An estimate (more precisely a point estimate) of an unknown parameter in a model is a guess based on observed data y. Example 2.1 Out of n = 20 male moth tobacco bud worm Heliothis virescens y = 4 died after they were given a specic dose of the pesticide trans-cypermethrin. We assume that each moth has the same probability of dying and the moths die independently of each other. Then the observations follow a binomial model Y bin(n, ), i.e. the probability of observing y deaths out of n is p(y; ) = P (Y = y; ) = where
n y
NewtonRaphson method*
57
Transformations of normal variables the method*
60
n y (1 )ny y
= n!/(y!(n y)!) with n! = n (n 1) (n 2) 2 1.
GLM - Maximum Likelihood
GLM - Maximum Likelihood
A natural estimate of is the proportion = 4/20 = 0.2.
The likelihood function
20 4
With n = 20 and y = 4, we have observing y = 4 death moths is
= 4845, so the probability of
p(4; ) = 4845 4 (1 )16 Once data are observed, p(y; ) is a function of the unknown parameter only. We call this the likelihood function, written L() = p(y; ).
GLM - Maximum Likelihood 7
GLM - Maximum Likelihood 9
We speak of an interval estimate (L(y), U (y)) if we state that L(y) < < U (y) with some degree of certainty. Example 2.2 A 95% condence interval for is given by (L(y) = 1.96 Var() ; ; U (y) = + 1.96 Var())
Figure 1 shows L() = p(4, ) for dierent values of i.e., how likely it is to observe y = 4 for dierent values of . The value which maximizes L() is called the maximum likelihood estimate or MLE for short. Figure 1 suggests that = 0.2 = 4/20 the relative frequency.
= (0.2 0.18
0.2 + 0.18) = (0.02; 0.38)
Lik(theta, y = 4, n = 20)
0.00 0.0
0.05
0.10
0.15
0.20
0.2
0.4 theta
0.6
0.8
1.0
Figure 1: Plot of the likelihood function. One can think of as the value of which maximizes the probability of observing the data which one actually has
GLM - Maximum Likelihood GLM - Maximum Likelihood
10
12
observed. In that sense is the value of which ts most closely to data.
3.1
ML Estimates and ML estimators
The value (y = 4) = 0.2 = 4/20 is the maximum likelihood estimate (MLE). The question is how good the estimate is. If n = 200 and y = 40, the MLE is also 0.2 but intuition suggests that the estimate is more precise in the latter case. Hence there must be a characteristic of the estimate which is not simply the estimate itself. To discuss this we need to make the distinction between an estimate and an estimator. Consider the model Y bin(n, ). Here Y is a random variable and the outcome of Y (the number of dead moths) will not be the same when the experiment is repeated several times. Hence (Y ) = Y /n is a random variable because it is a function of Y . We say that (Y ) is the maximum likelihood estimator.
GLM - Maximum Likelihood 11 GLM - Maximum Likelihood 13
Dening the function L():
Lik<-function(theta,y,n) { u <- choose(n, y) * theta^y * (1 - theta)^(n - y)}
If you nd it dicult to think about (Y ) as a random variable in abstract terms then think about the distribution of (Y ) as what we would get if the experiment was repeated a large number of times: Suppose the experiment with n = 20 is repeated, say 1000 times and that y 1 , y 2 , . . . , y 1000 denote the number of dead moths in these experiments. Based on these experiments we can calculate estimate (y k ) = k = y k /n for k = 1, . . . , 1000. We can then look at the distribution of the k s. We can not repeat the experiment in practice but we can simulate data from a model where the true value is = 0.2. This scheme can be repeated for other values of n, e.g. n = 200, see Figure 2.
We have choose(n,y)= n . Note that we just wrote the y probability function of the binomial distribution and hence an alternative denition would use
Lik <- function(theta, y, n) { u <- dbinom(y, size = n, prob = theta) }
The plot is produced by
theta <- seq(0, 1, 0.05) plot(Lik(theta, y = 4, n = 20) ~ theta, type = "b")
GLM - Maximum Likelihood
GLM - Maximum Likelihood
14
16
main = "")
200 250 Frequency 0.0 0.1 0.2 0.3 ^ 0.4 0.5 0.6 0 0.0 50 100 150 200
Frequency
50
100
150
0.1
0.2
0.3 ^
0.4
0.5
0.6
Figure 2: Distribution of MLE when = 0.2. Left: n = 20, right: n = 200.
GLM - Maximum Likelihood 15
GLM - Maximum Likelihood 17
We generate 1000 binomially distributed data, one time with n = 20 and the other time with n = 200
set.seed(234) y.20 <- rbinom(1000, size = 20, prob = 0.2) y.200 <- rbinom(1000, size = 200, prob = 0.2)
3.2
Properties of ML estimators
Figure 2 illustrates three important points about the distribution of (Y ): 1 The distribution is centered around the true value, = 0.2. More precisely, E((Y )) . 2 The variance v = Var((Y )) becomes smaller when n becomes large. In Section 3.5 we outline how an estimate of this variance can be calculated from the likelihood function. 3 The distribution is approximately normal. More precisely (Y ) A N (, v ). An additional property is:
To be able to repeat the results we initialize the random number generator with set.seed, a function that takes an integer. Now we estimate the probability for each of the 1000 repetitions
set.seed(234) theta.20 <- y.20/20 theta.200 <- y.200/200
The histograms (Fig. 2) are produced with
par(mfrow = c(1, 2)) hist(theta.20, xlim = c(0, 0.6), xlab = bquote(hat(theta)), main = "") hist(theta.200, xlim = c(0, 0.6), xlab = bquote(hat(theta)),
GLM - Maximum Likelihood
4 nice transformations of an MLE are approximately normal: If g() is a smooth monotone function (and g its rst derivative) then g((Y )) A N (g(), g ()2 v ).
GLM - Maximum Likelihood
18
20
3.3
Variance of MLE
3.4
Maximizing the likelihood
The key to talking about how good the estimate (y) is lies in considering the variance of the estimator. With n = 20 and y = 4 as well as with n = 200 and y = 40 we get = 0.2. Yet, intuition suggests that an estimate based on 200 observations is more precise than an estimate based on 20 observations. Figure 2 shows that the estimator when n = 200 has much smaller variance than when n = 20. A simple way of summarizing the variation is to calculate the emprirical variance of the simulated k s. We then nd that Var(k ) = 0.008 for n = 20 and Var(k ) = 0.001 for n = 200
In Example 2.1 we dened L() = p(4; ) = 20 4 (1 )16 4
to be the likelihood function. Since the term 20 does not 4 involve (it is just a constant) we can ignore this term and dene the likelihood function to be simply L() = 4 (1 )16 or more generally L() = y (1 )ny . It is often easier to maximize the log-likelihood function l() = log L() = y log + (n y) log(1 ) than the likelihood function L() itself. To maximize the loglikelihood function one has to 1. calculate the rst derivative l () of l() and (1)
GLM - Maximum Likelihood 19
GLM - Maximum Likelihood 21
var(theta.20)
[1] 0.00761606
2. solve the equation l () = 0 to nd the maximum likelihood estimate . 3. It must also be investigated whether is indeed the maximum. We shall not go into such details here. The function l () is called the score function.
var(theta.200)
[1] 0.000799019
The equation l () = 0, to solve is called the likelihood equation. For the binomial model we nd y ny l () = . 1
(2)
Solving l () = 0 for to nd the maximum likelihood estimate is straightforward for the binomial modela and gives the relative frequency: y = . n
a This follows from: y ny = 0 which implies 1 (1 )y (n y) = 0 whereby y = y + n y = n (1)y(ny) (1)
= 0 which implies
GLM - Maximum Likelihood
GLM - Maximum Likelihood
22
24
In most practical applications, the likelihood equation can not be solved as easily as above. Instead one has to use iterative methods, see Section 7
For n = 20:
Var() 0.09, for n = 200:
Var() 0.03.
GLM - Maximum Likelihood 23
GLM - Maximum Likelihood 25
3.5
Calculating the asymptotic variance
1 . ()
3.6
Transformation of MLE
The variance of the ML estimator is approximately v = l Dierentiating the score function and changing sign gives l () = y ny + . 2 (1 )2
A smooth monotone transformation of the MLE is approximately normal: Example 3.2 In the moth example, it may be more natural to work with interpreting odds: The estimated odds for dying is = /(1 ) = 0.2/0.8 = .25 Let = g() = /(1 ). Then = g() = .25. Dierentiation gives that g () = 1/(1 )2 . In Example 2.1 and 3.1 we found that = 0.2 and Var() = 0.008.
(3)
In practice is unknown but it can be justied to plug the estimate = y/n into l () and this gives l () = So the variance becomes v = y(n y) . n3 n (1 ) .
Hence g (0.2) = 1.56 so A N (, 1.562 0.008) = N (, 0.02).
Example 3.1 For the moth data in Example 2.1 we have = 0.2.
GLM - Maximum Likelihood GLM - Maximum Likelihood
26
28
Tests of hypotheses
4.1
The likelihood ratio test
Suppose interest is in testing whether is equal to a specic xed value 0 . For example, suppose interest is in testing the hypothesis that = 0 = 0.3 for the moth data. We mention three tests: The likelihood ratio test, the Wald test and the score test. These are asymptotically equivalent in the sense that when the amount of information in data tends to innity then they give the same result. Still, the likelihood ratio test is often the preferred one.
The maximum likelihood estimate is the value of which gives the observed data the highest probability which is L() = p(y; ). If the value 0 assigns nearly the same probability L(0 ) = p(y, 0 ) as does, we would be tempted to accept the hypothesis that = 0 . This suggests to consider the likelihood ratio test statistic Q dened by L(0 ) Q= . L() Clearly Q is a number between 0 and 1 and values close to 1 are in favor of the hypothesis.
GLM - Maximum Likelihood 27
.
GLM - Maximum Likelihood 29
l() l() l(0 )
If the hypothesis H0 : = 0 is true then 2 log Q = 2(log L() log L(0 )) = 2(l() l(0 )) (4)
Slope l ()
has (when n is large) approximately a 2 distribution with 1 degree of freedom. Large values of 2 log Q leads to rejection of the hypothesis. In Figure 3 it can be seen that 2 log Q is twice the vertical distance between the value of l in and 0 .
Figure 3: Overview of likelihood ratio test, the Wald test and the score test.
GLM - Maximum Likelihood
GLM - Maximum Likelihood
30
32
4.2
The Wald test
4.3
The score test*
When n is large and the hypothesis is true, the distribution of the Wald test statistic W = ( 0 )2 l () will look like a 2 distribution with 1 degree of freedom. Recall that Var() = 1/l () so W = ( 0 )2 . Var() (5)
When n is large and the hypothesis is true, the distribution of the score test statistic S = l (0 )2 /l (0 ) will look like a 2 distribution with 1 degree of freedom. The score test is obtained by considering the slope of l in the point 0 . It is known that the slope of l in is 0 (l() = 0 by denition of the MLE.) Hence values of l (0 ) near 0 will also speak in favor of the hypothesis. (6)
The Wald test compares the values of and 0 directly corresponding to the horizontal distance in Figure 3.
GLM - Maximum Likelihood 31
GLM - Maximum Likelihood 33
Note: In the literature, one frequently uses the term Wald test for 0 Z= . Var() If the hypothesis is true, then Z approximately has a N (0, 1) distribution. Note that Z 2 = W .
Example 4.1 For the moth data we nd l() = l(0.20) = 10.008 and l(0 ) = l(0.3) = 10.523. Hence the likelihood ratio test is 2 log Q = 1.03. The Wald test is obtained by calculating rst l () = 125, then W = 1.25. To calculate the score test we note that l (0 ) = 9.52 and l (0 ) = 77.10. Hence S = 1.18. The 95% quantile for a 2 distribution is 3.84, so none of the 1 tests is signicant.
GLM - Maximum Likelihood
GLM - Maximum Likelihood
34
36
We dene the log-likelihood function Eq.(1)
logLik <- function(theta, y, n) { y * log(theta) + (n - y) * log(1 - theta) }
Condence intervals
A common way of presenting the precision of an estimate is a condence interval which is an interval estimate. Many text books on statistics will present the following (approximate) 95% condence interval for for the moth example 1.96 Var() = 1.96 y(n y) . n3 (7)
the rst derivative Eq.(2)
logLikD <- function(theta, y, n) { y/theta - (n - y)/(1 - theta) }
and the second derivative Eq.(3)
logLikDD <- function(theta, y, n) { -y/theta^2 - (n - y)/(1 - theta)^2 }
Hence we have the LR-test-statistic (Eq. 4)
-2 * (logLik(0.3, 4, 20) - logLik(0.2, 4, 20))
[1] 1.02928 GLM - Maximum Likelihood 35 GLM - Maximum Likelihood 37
the Wald-test-statistic W (Eq. 5)
-(0.2 - 0.3)^2 * logLikDD(0.2, 4, 20)
[1] 1.25
5.1
Wald based intervals and the Wald test
and the score-test-statistic S (Eq. 6)
-logLikD(0.3, 4, 20)^2/logLikDD(0.3, 4, 20)
[1] 1.17647
Wald based intervals and the Wald test are closely connected: Consider testing the hypothesis H0 : = 0 for a range of dierent values for 0 , e.g. 0.05, 0.10, 0.15, 0.20, . . . Because is approximately normal, one way of testing the hypothesis is to use the statistic z= 0 Var() which under the hypothesis approximately has a N (0, 1) distribution.
GLM - Maximum Likelihood
GLM - Maximum Likelihood
38
40
Thus if the test is at 5% signicance level, the hypothesis is rejected if |z| > 1.96. Some of the hypotheses will be accepted; others will be rejected. The values of 0 that leads to acceptance of the hypothesis H0 : = 0 denes the condence interval. A simple calculation gives the interval as ( 1.96 Var(); + 1.96 Var()).
The problem arises for two reasons: 1. The Wald test relies on that the MLE is approximately normal. This approximation is generally poor when n is small. 2. The parameter is bounded to 0 1. This is not accounted for in the approximate interval.
GLM - Maximum Likelihood 39
GLM - Maximum Likelihood 41
Example 5.1 A Wald condence interval based on Eq. (7) for n = 20, y = 4 and for n = 10, y = 2 (in both cases = 0.2) is given in Table 1. Note that Var() = (1 )/n = y (n y)/n3 . Table 1: Condence interval (approximate) for probabilities. 2.5% n = 20, y = 4 n = 10, y = 2 0.02 -0.05 97.5% 0.38 0.45
5.2
Condence intervals based on logodds
Example 5.2 Instead of working with probabilities in the binomial example we can work with logodds (see Section 5.6) dened as = log . 1
When varies from 0 to 1, will vary from to . Then can be expressed in terms of the logodds as = e . 1 + e
y ny
Note: Since 0 1 it does not make much sense to have a negative lower boundary on a condence interval.
(8) and the variance of
The MLE for is the logodds: = log n the estimator is Var() y(ny) .
Since the MLE is approximately normal, an approximate 95%
GLM - Maximum Likelihood GLM - Maximum Likelihood
42
44
condence interval for is 1.96 Var() = 1.96 n . y(n y) (9)
Recall that to get from logodds to probabilities, we use the inverse-logit transformation or the logistic function (8). The same transformation is then to be applied to the end points of the condence intervals to obtain an approximate condence interval for in the two cases.
Reconsider the moth example with n = 20 moths and y = 4 dead moths. Consider also the case with n = 10 moths and y = 2 dead moths. In both cases = 4/20 = 2/10 = 0.2 while = log(4/16) = log(2/8) = log(0.25) = 1.39. Condence intervals for the logodds are in the two cases given in Table 2. Table 2: Condence interval (approximate) for logodds. 2.5% n = 20, y = 4 n = 10, y = 2 2.48 2.94 97.5% 0.29 0.16
GLM - Maximum Likelihood 43
GLM - Maximum Likelihood 45
The condence interval in the rst row of table 2 (Eq. 9) is
eta <- log(4/(20 - 4)) CI.eta <- eta + 1.96 * c(-1, 1) * sqrt(20/(4 * (20 4)))
[1] -2.481968 -0.290621
The result is given in Table 3 in the rst two columns. Note that these condence intervals are not symmetrical around = 0.2. They are always between 0 and 1. Compare the intervals to those based on the proportions themselves (Table 2). Table 3: Condence interval (approximate) for probabilities. transforming logits 2.5% n = 20, y = 4 n = 10, y = 2 0.08 0.05 97.5% 0.43 0.54 likelihood based 2.5% 0.07 0.04 97.5% 0.41 0.50
GLM - Maximum Likelihood
GLM - Maximum Likelihood
46
48
The condence interval for the probability (n = 20, y = 4) transforming the log-odds via the logistic function is
plogis(CI.eta)
[1] 0.077132 0.427852
The LR-based condence interval can not be calculated with the computational and conceptual means available at the moment. Just for later reference, one would t a simple logistic regression model and use the function confint (after loading the package MASS) for the condence interval calculation:
M <- glm(cbind(4, 20 - 4) ~ 1, family = binomial) eta <- coef(M) library(MASS) CI.eta <- confint(M)
2.5 % 97.5 % -2.636494 -0.383287
CI <- plogis(CI.eta)
2.5 % 97.5 % 0.0668264 0.4053344
GLM - Maximum Likelihood 47
GLM - Maximum Likelihood 49
5.3
Likelihood ratio based intervals
5.4
Comparisons of the intervals
An alternative is to base the interval on the likelihood ratio test. The test is now to be based on that 2 log Q = 2(l() l(0 )) approximately has a 2 1 distribution when H0 is true.
For large samples, the dierent condence intervals discussed here are (asymptotically) equivalent meaning that the intervals for large samples will be very similar. For small samples and/or parameter estimates close the limit of permissible values the intervals can be quite dierent, and the Wald intervals based on the parameters close to their limits can be quite misleading. In this case, it is recommended either to calculate intervals for parameters which are not restricted to lie in a specic interval and then convert the intervals. Alternatively, one can calculate the likelihood based intervals.
We can then nd the interval of 0 s which do not lead to rejection of H0 in the same spirit as before but with one exception: It is not possible to write down the interval explicitly as before. It is necessary to run through a series of possible values of 0 and see if hypothesis is accepted or not. Yet, with modern computers, this is not a problem. The likelihood based intervals for the example is given in Table 3 in the rightmost two columns. These intervals will always be within [0, 1].
GLM - Maximum Likelihood
GLM - Maximum Likelihood
50
52
5.5
Interpretation of condence intervals
Some of these intervals will include the true parameter value and some will not. The relative frequency with which the intervals include the true value is called the coverage probability. If the interval calculated for each experiment is a 95% condence interval, the coverage probability is 95% that is, in 95% of the experiments the condence intervals will include the true parameter. But that is not the same as saying that there is 95% probability that the true value of is contained in the interval [0.02; 0.38]. For this reason we use an alternative word: We are 95% condent that is in the interval [0.02; 0.38].
A common mistake is to interpret a condence interval along the following lines: There is 95% probability that falls in the interval [0.02; 0.38] It is not correct to attach such a probabilistic meaning to a condence interval. There is only one true value of and it will either lie in the interval or not as the case might be.
GLM - Maximum Likelihood 51
GLM - Maximum Likelihood 53
To associate a probability with a condence interval, imagine repeating the experiment under identical conditions a large number of times. For each experiment, the condence interval for is calculated as illustrated in Figure 4.
5.6
Reparameterizing the binomial model*
Recall the binomial probability function for Y bin(n, ): p(y; ) = P (Y = y; ) = n y (1 )ny y (10)
q q q q q q q q q q q q q q q q q q q q
We can reparameterize the model as follows: Dene the logodds as = log . 1 Then, when varies from 0 to 1, will vary from to . Then can be expressed in terms of the logodds as
6 7
10
15
20
e 1 + e n e 1 ( )y ( )ny y 1 + e 1 + e
(11)
If we substitute (11) into (10) then we get Figure 4: Coverage of condence intervals. p(y; ) = P (Y = y; ) = (12)
Now, (12) is also the binomial probability function. It is just
GLM - Maximum Likelihood GLM - Maximum Likelihood
54
56
expressed in terms of another parameter the logodds. This is called a re-parametrization of the model. We can then proceed as before: The loglikelihood becomes (where we ignore some calculations l() = y n log(1 + e ) The score function becomes l () = y n e 1 + e
n y
Taylor approximation*
) after
The simplest version of the Taylor approximation (also called a 1. order Taylor approximation) is based on approximating a nonlinear function h(x) by a linear function l(x). Let x0 and x be two numbers (not too far apart) and assume that h is a nice (i.e. dierentiable) function. Then it is known that h(x) = h(x0 ) + h (x0 )(x x0 ) (13)
Solving l () = 0 gives the (empirical) logodds: = log y ny
[h(x0 ) h (x0 )x0 ] + h (x0 )x = a + bx = l(x) (14)
i.e. a = [h(x0 ) h (x0 )x0 ] and b = h (x0 ). The further x is away from x0 the worse is this approximation.
A very important property of ML estimates is illustrated in e the following: Inserting into = 1+e gives . Hence it does not matter whether the model is parametrized in
GLM - Maximum Likelihood 55
GLM - Maximum Likelihood 57
terms of or . This is a general result on ML estimates. The variance of the estimator is Var() n y(n y)
NewtonRaphson method*
Suppose we wish to solve the equation h(x) = 0 for x. A classical method for doing so is NewtonRaphson algorithm: Suppose that x0 is an initial guess for a solution. Then the Taylor approximation gives 0 = h(x) h(x0 ) + h (x0 )(x x0 ) Solving the last equation gives h(x0 )/h (x0 ) x x0 and hence x x0 h(x0 ) h (x0 )
This suggests an iterative scheme known as NewtonRaphson iteration: Start with some initial value x0 and calculate successively x1 = x0
GLM - Maximum Likelihood GLM - Maximum Likelihood
h(x0 ) , h (x0 )
x 2 = x1
h(x1 ) , h (x1 )
x3 = . . .
58
60
In most (but certainly not all) problems arising in practice there will be very little change in the xvalues after a few iterations. Example 7.1 We wish to solve the likelihood equation S() = l () = y ny =0 1
Transformations of normal variables the method*
It is well known that a linear function of a normal random variable is itself normal, e.g. if x N (, 2 ) and l(x) = a + bx, then z = l(x) = a + bx N (a + b, b2 2 ). A nonlinear function z = h(x) of x is, however, not normal. Yet, in some cases z is not too far from being normal, more specically h(x) A N (h(), h ()2 2 )
for the binomial example with y = 4 and n = 20. Dierentiating the score function gives S () = ny y 2 (1 )2
With a simple pocket calculator we can now do the following: Starting with 0 = 0.1 we get the sequence of values 0.100 0.153 0.191 0.200 0.200 0.200 where we note that the scheme converges after 4 iterations. One can then stop and the nal value is then taken to be the maximum likelihood estimate.
GLM - Maximum Likelihood 59
GLM - Maximum Likelihood 61
This claim can be loosely justied as follows: We apply Taylors approximation (see Section 6) to h(x) where we let x0 = E(x) = . Then z = = h(x) h() + h ()(x ) [h() h ()] + h ()x = a + bx (15) (16)
From (16) it follows that h(x) is approximately a linear function of x which is normal, so h(x) is approximately normal. Left is only to determine the (approximate) mean and variance of h(x). To do so, recall that E(a + bx) = a + b E(x) and Var(a + bx) = b2 Var(x): From (16) it now follows that E(h(x)) h() and Var(h(x)) h ()2 Var(X). How good the approximation is depends on the shape of h and the variance 2 . The closer h is to being linear and the smaller 2 the better is the approximation.
GLM - Maximum Likelihood GLM - Maximum Likelihood