Stat 535 C - Statistical Computing & Monte Carlo Methods
Arnaud Doucet
Email: [email protected]
1
Slides available on the Web before lectures:
www.cs.ubc.ca/~arnaud/stat535.html
Textbook: C.P. Robert & G. Casella, Monte Carlo Statistical Methods,
Springer, 2nd Edition.
Additional lecture notes available on the Web.
Textbooks which might also be of help:
A. Gelman, J.B. Carlin, H. Stern and D.B. Rubin, Bayesian Data
Analysis, Chapman&Hall/CRC, 2nd edition.
C.P. Robert, The Bayesian Choice, Springer, 2nd edition.
2
2.1 Outline
Summary of Previous Lecture.
Maximum Likelihood.
Bayesian Statistics.
Overview of the Lecture Page 3
3.1 Likelihood function
Parametric modelling: The observations x are the realization of a random
variable X of probability density function f ( x| ).
The function f ( x| ) considered as a function of for a xed realization
of the observation X = x is called the likelihood function.
The likelihood function is
l ( | x) = f ( x| )
to emphasize that the observations are xed.
Summary of Previous Lecture Page 4
3.2 Sucient statistics
When X f ( x| ), a function T of X (also called a statistic) is said
to be sucient if the distribution of X conditional upon T (X) is
independent of ; i.e.
f ( x| ) = h (x) g ( T (x)| ) .
xi
Let X = (X1 , ..., Xn ) i.i.d. from P () of distribution f ( xi | ) = e xi ! .
Then
n
1 n
f ( x1 , ..., xn | ) = f ( xi | ) = n en i=1 xi
i=1 xi !
i=1 g( T (x)|)
h(x)
n
The statistics T (x) = i=1 xi is sucient.
Summary of Previous Lecture Page 5
3.3 Suciency principle
Suciency principle: Two observations x and y such
that T (x) = T (y) must lead to the same inference on .
Another way to think of it is that the inference on is
only based on T (x) and not on x: T (x) is sucient.
Note that the suciency principle is also useful in practice.
It is cheaper to store T (x) rather than x.
Summary of Previous Lecture Page 6
3.4 Likelihood Principle
Likelihood Principle. The information brought by an observation x about
is entirely contained in the likelihood function l ( | x) = f ( x| ) . Moreover,
two likelihood functions contain the same information about if they are
proportional to each other; i.e. if
l1 ( | x) = c (x) l2 ( | x) .
A simpler (?) way to think of it: You can have two dierent probabilistic
models for the data. However, if l1 ( | x) l2 ( | x) then this should lead
to the same inference.
Some standard classical statistics procedures do not satisfy this principle
because they rely on quantity such as Pr (X > ) = f ( x| ) dx whereas
the likelihood principle does not bother about data you have not observed!
Summary of Previous Lecture Page 7
4.1 Maximum Likelihood Estimation
The likelihood principle is fairly vague since it does not lead
to the selection of a particular procedure.
Maximum likelihood estimation is one way to implement
the suciency and likelihood principles
= arg sup l ( | x)
Proof:
arg sup l ( | x) = arg sup h (x) g ( T (x)| ) = arg sup g ( T (x)| ) .
l1 ( | x) = c (x) l2 ( | x) arg sup l1 ( | x) = arg sup l2 ( | x)
Maximum Likelihood Estimation Page 8
4.2 Maximum Likelihood Estimation
Be careful: Maximum likelihood estimation is just one way
to implement the likelihood principle.
Maximization can be dicult or several equivalent global
maxima. However, consistent and ecient in most cases.
(asymptotic properties).
ML estimates can vary widely for small variations of
the observations (for small sample sizes).
Example: If Xi 1 1[0,] (xi ) then for n data
n
1
= max {Xi }
l ( | x) = f ( xi | ) = 1[max{x },) ()
i=1
n i
Tests require frequentists justications.
Maximum Likelihood Estimation Page 9
5.1 Alternative Approaches
Many approaches have been proposed: penalized likelihood
(e.g. Akaike Information Criterion) or stochastic complexity theory.
Many of these approaches have a Bayesian avor.
Bayesian Statistics Page 10
5.2 Bayesian Statistics
A Bayesian model is made of a parametric statistical model (X , f ( x| ))
and a prior distribution on the parameters (, ()).
The unknown parameters are now considered RANDOM.
Many statisticians do not like this although they accept the
probabilistic modeling on the observations.
Example: Assume you want to measure the speed of light given
some observations. Why should I put a prior on this physical constant?
Because of the limited accuracy of the measurement, this constant
will never be known exactly and thus it is justied to put say a
(uniform) prior on this parameter reecting this uncertainty.
Bayesian Statistics Page 11
5.2 Bayesian Statistics
In the Bayesian approach, probability describes degrees of belief.
In the frequentist interpretation, you should repeat an innite
number of times an experiment and the probabilities corresponds to
the limiting frequencies.
Problem. How do you attribute a probability to the following
event There will be a major earthquake in Tokyo on
the 27th April 2013?
The selection of a prior has an obvious impact on the inference
results! However, Bayesian statisticians are honest about it.
Bayesian Statistics Page 12
5.2 Bayesian Statistics
Based on a Bayesian model, we can dene
The joint distribution of (, X)
(, x) = () f ( x| ) .
The marginal distribution of X
(x) = () f ( x| ) d
For a realization X = x, (x) is called marginal likelihood
or evidence.
Bayesian Statistics Page 13
5.3 Ingredients of Bayesian Inference
Given the prior () and the likelihood l ( | x) = f ( x| ) then
Bayess formula yields
f ( x| ) ()
( | x) = .
f ( x| ) () d
It represents all the information on than can be extracted from x.
Note the integral appearing at the denominator of the Bayes rule!
The predictive distribution of Y when Y g ( y| , x)
g ( y| x) = g ( y| , x) ( | x) d.
x .
This is to distinguish from prediction based on g y| ,
Bayesian Statistics Page 14
5.3 Ingredients of Bayesian Inference
In case where = (1 , ..., p ) and one is only interested
in the parameter k . Then k = (1 , ..., k1 , k+1 , ..., p ) are
so-called nuisance parameters.
Bayesian inference tells us that all the information on k that
can be extracted from x is the marginal posterior distribution.
( k | x) = ( | x) dk .
Once more, computing ( k | x) requires computing a (possibly high
dimensional) integral.
Nuisance parameters are often handled using prole likelihood
technique in a maximum likelihood framework.
Bayesian Statistics Page 15
5.3 Ingredients of Bayesian Inference
Bayesian statistics do satisfy automatically the suciency principle,
and the likelihood principle.
Suciency principle: If f ( x| ) = h (x) g ( T (x)| ) then
h (x) g ( T (x)| ) () g ( T (x)| ) ()
( | x) = =
h (x) g ( T (x)| ) () d g ( T (x)| ) () d
= ( | T (x)) .
Likelihood principle: Assume we have f1 ( x| ) = c (x) f2 ( x| ) then
f1 ( x| ) () c (x) f2 ( x| ) ()
( | x) = =
f1 ( x| ) () d c (x) f2 ( x| ) () d
f2 ( x| ) ()
= .
f2 ( x| ) () d
Bayesian Statistics Page 16
5.4 Simple Examples
For events A and B, the Bayes rule is
P ( B| A) P (A) P ( B| A) P (A)
P ( A| B) = =
P ( B| A) P (A) + P B| A P A P (B)
Be careful to subtle exchanging of P ( A| B) for P ( B| A).
Prosecutors Fallacy. A zealous prosecutor has collected an evidence and
has an expert testify that the probability of nding this evidence if the accused
were innocent is one-in-a-million. The prosecutor concludes that the probability
of the accused being innocent is one-in-a-million. This is WRONG.
Bayesian Statistics Page 17
5.4 Simple Examples
Assume no other evidence is available and the population is of
10 million people.
Dening A = The accused is guilty then P (A) = 107 .
Dening B =Finding this evidence then P ( B| A) = 1 & P B| A = 106 .
Bayes formula yields
P ( B| A) P (A) 107
=
P ( B| A) P (A) + P B| A P A 107 + 106 (1 107 )
0.1.
Real-life Example: Sally Clark was condemned in UK (The RSS pointed out
the mistake). Her convinction was eventually quashed (on other grounds).
Bayesian Statistics Page 18
5.4 Simple Examples
Coming back from a trip, you feel sick and your GP thinks
you might have contracted a rare disease (0.01% of the
population has the disease).
A test is available but not perfect.
If a tested patient has the disease, 100% of the time the test
will be positive.
If a tested patient does not have the diseases, 95% of the
the time the test will be negative (5% false positive).
Your test is positive, should you really care?
Bayesian Statistics Page 19
5.4 Simple Examples
Let A be the event that the patient has the disease and
B be the event that the test returns a positive result
1 0.0001
P ( A| B) = 0.002
1 0.0001 + 0.05 0.9999
Such a test would be a complete waste of money for you or the National
Health System.
A similar question was asked to 60 students and sta at Harvard Medical
School: 18% got the right answer, the modal response was 95%!
Bayesian Statistics Page 20
5.5 What do we gain from information?
Bayesian inference involves passing from a prior () to
a posterior ( | x) .We might expect that because the posterior
incorporates the information from the data, it will be less variable
than the prior.
We have the following identities
E [] = E [E [ | X]] ,
var [] = E [var [ | X]] + var [E [ | X]] .
It means that, on average (over the realizations of the data X)
we expect the conditional expectation E [ | X] to be equal to E []
and the posterior variance to be on average smaller than the prior
variance by an amount that depend on the variations in posterior
means over the distribution of possible data.
Bayesian Statistics Page 21
5.6 Variance Decomposition Identity
If (, X) are two scalar random variables then we have
var () = E (var ( | X)) + var (E ( | X)) .
Proof:
2
var () = E E ()2
2
= E E X (E (E ( | X)))2
2
= E E X E (E ( | X))
2
2 2
+E (E ( | X)) (E (E ( | X)))
= E (var ( | X)) + var (E ( | X)) .
Bayesian Statistics Page 22
5.7 Be careful
Such results appear attractive but one should be careful.
Here there is an underlying assumption that the observations
are indeed distributed according to (x) = () f ( x| ) d.
Bayesian Statistics Page 23