Understanding Normal and Gamma Distributions
Understanding Normal and Gamma Distributions
In the case where µ = 0 and σ = 1, the distribution is called standard normal distribution. It
can be showed that if X has normal(µ, σ 2 ), then E(X) = µ, and Var(X) = σ 2 .
3. X = σZ + µ has normal(µ, σ 2 ) distribution, where Z is the standard normal, i.e., normal(0, 1),
variable and σ ≥ 0.
Remark 2 Note that the expectation, and therefore the moment-generating function, might not exist
for all values of t.
1
Remark 3 If X1 , · · · , Xn are independent random variables with moment generating functions
MX1 , · · · , MXn , then X1 + · · · + Xn has moment generating function given by
n
Y
M(X1 +···+Xn ) (t) = MXi (t). (2)
i=1
Theorem 4 If the mgf exists for t in an open interval containing zero, it uniquely determines the
probability distribution.
Theorem 5 If the mgf exists in an open interval containing zero, then M (r) (0) = E(X r ) where
M (r) (0) is the rth derivative of M at 0.
The advantage of Theorem (5) is that when the moment of a variable (which involves integration)
is difficult to calculate, we can differentiate the mgf to achieve the same result, and differentiation
is just mechanical.
Example 6 (Gamma Distribution) The gamma(α, λ) density function depends on two parameters,
α > 0 and λ > 0, and has density function
½ λα α−1 −λt Z ∞
gα,λ (t) = Γ(α) t e , t≥0
where Γ(x) = ux−1 e−u du, x > 0
0 t<0 0
Remark 7 It follows by integration by parts that, for all p > 0, Γ(p) = pΓ(p) and that Γ(k) = (k−1)!
for positive integers k.
Remark 8 If α = 1, the gamma density coincides with the exponential density. The parameter α
is called a shape parameter for the gamma density, and λ is called a scale parameter. Varying α
changes the shape of the density, whereas varying λ changes the scale of the density.
We will find E(X) and Var(X) for a gamma variable X. The mgf of a gamma distribution is
Z ∞
λα α−1 −λx
M (t) = etx x e dx
0 Γ(α)
Z ∞
λα
= xα−1 e(t−λ)x dx
Γ(α) 0
µ ¶ Z ∞
λα Γ(α)
= since xα−1 e(t−λ)x dx converages for t < λ
Γ(α) (λ − t)α 0
and can be calculated by relating it to the gamma density with α and λ − t
µ ¶α
λ
= (3)
λ−t
Therefore,
α
EX = M (1) (0) =
λ
α(α + 1)
EX 2 = M (2) (0) =
λ2
and
2
Var(X) = EX 2 − [EX]2
α(α + 1) α2
= − 2
λ2 λ
α
= 2.
λ
¤
3 Joint Distribution
3.1 For Random Variables
The table below summarizes the formulae for calculating some quantities related to joint distribu-
tions. However, the table only serves as an outline. When doing actual calculations, one should,
instead of relying on the formulae, use reasonings and the basic conditional probability concepts we
developed in the first lecture. The examples below demonstrate some of these skills.
3
and
Let X1 , X2 , ..., Xn be independent random variables, and Xi has exponential distribution with
rate λi , i = 1, 2, ..., n. Find the distribution of Xmin .
½
0 if x < 0
Fi (x) =
1 − e−λi x if x ≥ 0
Since the X 0 s are non-negative, so is their minimum. So Xmin has cdf Fmin (x) = 0 for x < 0.
For x ≥ 0,
a)
π
P (X 2 + Y 2 ≤ 1) =
4
b)
P (X 2 + Y 2 ≤ 1, X + Y ≥ 1)
P (X 2 + Y 2 ≤ 1|X + Y ≥ 1) =
P (X + Y ≥ 1)
π/4 − 1/2
=
1/2
= π/2 − 1
c)
Z 1
2 1 31 1
P (Y ≤ X ) = x2 dx = x |0 =
0 3 3
d)
P (|X − Y | ≤ 0.5) = 1 − 1/4 = 3/4
e)
X 2 1 1 2 5
P (| − 1| ≤ 0.5) = P ( X ≤ Y ≤ 2X) = 1 − ( + ) =
Y 3 2 2 3 12
f)
1 1 1 1 3
P (Y ≥ X|Y ≥ ) = ( − )/ = .
2 2 8 2 4
¤
4
Example 13 let X and Y be independent exponentially distributed random variables with parameter
λ and µ, respectively. Find P (X < Y ).
Exercise 14 Suppose U(1) < U(2) < ... < U(5) are the order statistics of 5 independent uniform
(0, 1) variable U1 , U2 , ..., U5 , so U(i) is the ith smallest of U1 , U2 , ..., U5 .(See Pitman[1993] P352,
example 3)
Theorem 15 Linear combination of independent normal variables are always normally distributed.
In addition, if X and Y are independent with normal(λ, σ 2 ) and normal(µ, τ 2 ) distributions, then
X + Y has normal (λ + µ, σ 2 + τ 2 ) distribution.
The proof of the theorem makes use of the rotational symmetry of the joint distribution of
independent standard normal random variable X and Y. See Pitman [1993].
Example 16 For σ = 1, 2, 3 suppose Xσ has normal (0, σ 2 ) distribution, and these three random
variables are independent.
5
Remark 17 If X and Y are independent with density functions fX (x) and fY (y) in the plane R2 ,
then, a formula for the joint density function fX+Y (z), where Z = X + Y, is
Z ∞
fX+Y (z) = fX (x)fY (z − x)dx.
−∞
Exercise 18 Suppose that X and Y are independent and normally distributed with mean 0 and
variance 1. Find the distribution of X
Y . (See discussion and solution in Pitman[1993].)
χ2 , t, and F Distribution
Pn
Definition 19 If Z is a standard normal random variable, the distribution of U = i=1 Zi2 is called
the chi-square distribution with n degree of freedom, denoted χ2n .
It can be shown that χ21 is a special case of the gamma distribution with parameters 21 and 12 .
In example 9, we see that the sum of independent gamma random variables sharing the same value
of λ follows a gamma distribution. Thus, χ2n is a gamma distribution with α = n2 and λ = 12 .
Definition 20 If Z ∼ N (0, 1) and U ∼ χ2n and Z and U are independent, then the distribution of
√Z is called the t distribution with n degrees of freedom.
U/n
Theorem 21 Show that if X1 , X2 , ... are independent N (µ, σ 2 ) variables, thenPX and S 2 are in-
n
dependent, and X is N (µ, σ 2 /n) and (n − 1)S 2 /σ 2 is χ2n−1 , where X = n1 i=1 Xi and S 2 =
1 2
n−1 (Xi − X) .
Proof. (From Rice[1995]) The proof of the statement is built on the fact that X and the vector
of random variables (X1 − X, X2 − X, ..., Xn − X) are independent. We will not prove this fact
here; the interested readers are referred to Rice [1995] for a treatment using the moment-generating
function.
Since S 2 is a function of (X1 − X, X2 − X, ..., Xn − X), and functions of independent vectors are
also independent, we can conclude that X and S 2 are independent.
Pn
Since X1 , X2 , ... are independent N (µ, σ 2 ), an exdensionPof theorem (15) shows that i=1 Xi
n
is N (nµ, nσ 2 ). Thus, deviding a constant n, makes X = n1 i=1 Xi a normal variable with mean
³ ´2
nσ 2 σ2
E(X) = nµ n = µ and variance Var(X)= n 2 = n . In addition, X−µ
√
σ/ n
follows χ21 by definition (19).
6
n n µ ¶2
1 X 2
X Xi − µ Xi − µ
(Xi − µ) = ∼ χ2n , since ∼ N (0, 1)
σ 2 i=1 i=1
σ σ
and
n n
1 X 2 1 X
(Xi − µ) = [(Xi − X) + (X − µ)]2
σ 2 i=1 σ 2 i=1
( n n n
)
1 X 2
X X
2
= 2 (Xi − X) + 2 (Xi − X)(X − µ) + (X − µ)
σ i=1 i=1 i=1
n n µ ¶2
1 X X X −µ
= 2 (Xi − X)2 +
σ i=1 i=1
σ
n µ ¶2
1 X 2 X −µ
= 2 (Xi − X) + √ (4)
σ i=1 σ/ n
Pn Pn ³ ´2
1 2 1 2 X−µ
Let W = σ2 i=1 (Xi −µ) , U = σ2 i=1 (Xi −X) and V =
√
σ/ n
, (4) says that W = U +V.
Since U is a function of (X1 − X, X2 − X, ..., Xn − X), and V is a function of X, U and V are
independent by the fact we mentioned at the beginning of the proof.
So far, we have showed that W ∼ χ2n , and V ∼ χ21 . Let MW (t) be the mgf for W and so on.
Then,
MW (t)
MU (t) =
MV (t)
(1 − 2t)−n/2
=
(1 − 2t)−1/2
= (1 − 2t)−(n−1)/2
Definition 22 Let U and V be independent chi-square random variables with m and n degrees of
freedom, respectively. The distribution of
U/m
W =
V /n
is called the F distribution with m and n degree of freedom and is denoted by Fm,n .
Example 23 (Gamma and uniform) Suppose X has gamma (2, λ) distribution, and that given
7
By the definition of the gamma distribution
½ 2 −λx
λ xe , x>0
fX (x) =
0, x≤0
and from the uniform (0, x) distribution of Y given X = x
½
1/x, 0 < y < x
fY (y|X = x) =
0, otherwise
So by the multiplication for densities
½
λ2 e−λx , 0<y<x
f (x, y) = fX (x)fY (y|X = x) =
0, otherwise
Example 24 Find the marginal density of Y .
Integrating out x in the joint density gives the marginal density of Y : for y > 0
Z ∞ Z ∞
fY (y) = f (x, y)dx = λ2 e−λx dx = λe−λy
0 y
The density is of course 0 for y ≤ 0. That is to say, Y has exponential (λ) distribution.
8
4.1.1 Bayes’ Rule
Let q(z | y) denote the conditional probability mass function of Z give Y = y. Then,
The inequality is because that {y ∩ z} ⊆ {y}. Thus, when pZ (z) > 0, the conditional expected value
of Y is finite whenever the expected value is finite.
Definition 28 Let g(z) = E(Y | Z = z). The random variable g(Z) is written E(Y | Z) and is
called the conditional expectation of Y given Z.
Example 29 As an example we calculate E(Y1 | Z) where Y1 and Z are given in Example 26. We
have µ ¶
n−1
i−1 i
E(Y1 | Z = i) = P [Y1 = 1 | Z = i] = µ ¶ = .
n n
i
The first of these equalities
µ holds
¶ because Y1 is an indicator. The second follows from the equation in
n−1
Example 26 because is just the number of ways i successes can occur in n Bernoulli trials
i−1
with the first trial being a success. Therefore,
Z
E(Y1 | Z) = .
n
Exercise 30 Let X1 and X2 be the numbers on two independent fair-die rolls. Let X be the mini-
mum and Y the maximum of X1 and X2 . Calculate: E(Y |X = x) and E(X|Y = y).
9
Exercise 31 Repeat the last exercise with X1 and X2 two draws without replacement from {1, 2, ..., n}.
Properties of Conditional Expected Values In the context of previous lecture, the conditional
distribution of a random vector Y given Z = z corresponds to a single probability function Pz on
(Ω, A). Specifically, define for A ∈ A,
X X X X
E(E(Y | Z)) = PZ (z)[ yp(y | z)] = yp(y | z)pZ (z) = yp(y, z) = E(Y ).
z y y,z y,z
The interchange of summation used is valid because the finiteness of E(|Y |) implies that all sums
converge absolutely.
As an illustration, we check E(E(Y | Z)) = E(Y ) for E(Y1 | Z) = Zn given before. In this case,
µ ¶
Z np
E(E(Y1 | Z)) = E = = p = E(Y1 )
n n
p(y, z)
p(y | z) =
pZ (z)
if pZ (z) > 0.
10
4.2.1 Bayes’ Rule
p (y)q(z | y)
p(y | z) = R ∞ R∞ Y ,
−∞
· · · −∞ pY (t)q(z | t)dt1 · · · dtn
where q is the conditional density of Z given Y = y.
Remark 33 If Y and Z are independent, the conditional distributions equal the marginals as in the
discrete case.
Remark 34 If E(|Y |) < ∞, we denote the conditional expectation of Y given Z = z in analogy to
the discrete case as the expected value of a random variable with density p(y | z). More generally, if
E(|r(Y)|) < ∞, the conditional expectation of r(Y) given Z = z can be obtained from
Z ∞
E(r(Y) | Z = z) = r(y)p(y | z)dy.
−∞
11
µ ¶
1 1 1
pY (y1 , y2 ) = pX (y1 + y2 ), (y1 − y2 )
2 2 2
· ¸
1 1 1
= exp − (y1 + y2 )2 + (y1 − y2 )2
8π 4 16
1 1 £ 2 ¤
= exp − 5y1 + 5y22 + 6y1 y2 .
8π 32
This is an example of bivariate normal density.
Gamma and Beta Distribution A random variable X has Beta(r, s) distribution if it has density
xr−1 (1 − x)s−1
br,s (x) = , for 0 < x < 1,
B(r, s)
Γ(r)Γ(s) R∞
where B(r, s) = Γ(r+s) is the beta function and Γ(x) = 0 ux−1 e−u du as in the gamma
distribution.
Example 37 If X1 and X2 are independent random variables with gamma(p, λ) and gamma(q, λ)
distributions, respectively, then Y1 = X1 + X2 and Y2 = (X1X+X
1
2)
are independent and have, respec-
tively, gamma(p + q, λ) and Beta(p, q) distribution.
12
6 Markov Chain
Example 38 (A random walk model) A Markov chain whose state space is given by the integers
i = 0, ±1, ±2, ... is said to be a random walk if , for some number 0 < p < 1,
Chapman-Kolmogorov Equations
We have defined the one-step transition probability Pij. We now define the n-step transition
probabilities Pijn to be the probability that a process in state i will be in state j after n additional
transitions. That is,
13
6.2 Continuous case
Suppose we have a continuous-time stochastic process {X(t), t ≥ 0} taking on values in the set
of non-negative integers. In analogy with the definition of a discrete-time Markov chain, we say
that the process {X(t), t ≥ 0} is a continous-time Markov chain if for all s, t ≥ 0 and non-negative
integers i, j, x(u), 0 ≤ u < s
In other words, a continous-time Markov chain is a stochastic process having the Markovian
property that the conditional distribution of the future X(t + s) given the present X(s) and the past
X(u), 0 ≤ u < s, depends only on the present and is independent of the past. If in addition,
P {X(t + s) = j|X(s) = i
is independent of s, then the continuous-time Markov chain is said to have stationary or homo-
geneous transition probabilities.
Example 39 Suppose the service in a particular barbra shop consists of two procedures: A customer
upon arrival goes initially to chair 1 where his/her hair will be raised by an assistant; after this is
done the customer moves on to chair 2 where his/her hair will be cut by the stylist. The service time
at the two steps are assumped to be independent random variable that are exponentially distributed
with respective rates µ1 and µ2 . Suppose that the potential customers arrive in accordance with a
Poisson process having rate λ, and that a potential customer will enter the system only if both chairs
are empty.
This problem can be modeled as a continuous-time Markov chain. Since a potential customer
will enter the shop only if there is no other customers in the shop, there will always be either 0, or 1
customers in the shop. If there is 1 customer in the shop, then we would need to know which chair
the customer is on. Therefore, an appropriate stat space would consists of three states: 0 =shop is
empty, 1 =chair 1 is occupied and 2 =chair 2 is occupied.
7 Delta Method
(The followings are taken from Rice[1995])
Suppose that we know the expectation and the variance of a random variable X, but not the
entire distribution and that we are interested in the mean and variance of Y = g(X) for some fixed
function g. For example, we might be able to measure X and determine its mean and variance, but
really be interested in Y , which is related to X in a known way. We might wish to know V ar(Y ),
at least approximately, in order to assess the accuracy of the indirect measurement process. From
the results given in this section. we cannot in general find E(Y ) = µY and V ar(Y ) = σY2 from
2
E(X) = µX and V ar(X) = σX , unless the function g is linear. However, if g is nearly linear in a
range in which X has high probability, it can be approximated by a linear function and approximate
moments of Y can be found.
In proceeding as just described, we follow a tack often taken in applied mathematics: When
confronted with a nonlinear problem that we cannot solve, we linearize. In probability and statistics,
this method is called propagation of error, or the δ method. Linearization is carried out through
a Taylor series expansion of g about µX . To the first order,
14
We have expressed Y as approximately equal to a linear function of X. Recalling that if U = a + bV ,
then E(U ) = a + bE(V ) and V ar(U ) = b2 V ar(V ), we find
µY ≈ g(µX )
σY2 ≈ σX
2
[g 0 (µX )]2
We know that in general E(Y ) 6= g(E(X)), as given by the approximation. In fact, we can carry
out the Taylor series expansion to the second order to get an improved approximation of µY
1
Y = g(X) ≈ g(µX ) + (X − µX )g 0 (µX ) + (X − µX )2 g 00 (µX )
2
Taking the expectation of the right-hand side, we have, since E(X − µX ) = 0,
1 2 00
E(Y ) ≈ g(µX ) + σX g (µX )
2
How good such approximations are depends on how nonlinear g is in a neighborhood of µX and
on the size of σX . From Chebyshev’s inequality we know that X is unlikely to be many standard
deviations away from µX ; if g can be reasonably well approximated in this range by a linear function,
the approximations for the moments will be reasonable as well.
Example 40 The relation of voltage, current, and resistance is V = IR. Suppose that the voltage
is held constant at a value V0 across a medium whose resistance fluctuates randomly as a result, say,
of random fluctuations at the molecular level. The current therefore also varies randomly. Suppose
that it can be determined experimentally to have mean µI = 0 and variance σI2 . We wish to find the
mean and variance of the resistance, R, and since we do not know the distribution of I, we must
resort to an approximation. We have
V0
R = g(I) =
I
V0
g 0 (µI ) = −
µ2I
2V0
g 00 (µI ) = 3
µI
Thus,
V0 V0
µR ≈ + 3 σI2
µI µI
2 V02 2
σR ≈ σ
µ4I I
We see that the variability of R depends on both the mean level of I and the variance of I. This makes
sense, since if I is quite small, small variations in I will result in large variations in R = V0 /I,
whereas if I is large, small variations will not affect R as much. The second-order correction factor
for µR also depends on µI and is large if µI is small. In fact, when I is near zero, the function
g(I) = V0 /I is quite nonlinear, and the linearization is not a good approximation.
References
[1] Peter J. Bickel and Kjell A. Doksum (2001) Mathematical Statistics: Basic Ideas and Selected
Topics, Vol. I, 2nd Edition. Prentice Hall
[2] Geoffrey Grimmett and David Stirzaker (2002) Probability and Random Processes, 3rd Edition,
Oxford University Press.
15
[3] Jim Pitman (1993) Probability, Springer-Verlag New York, Inc.
[4] John A. Rice (1995) Mathematical Statistics and Data Analysis, 2nd Edition, Duxbury Press.
[5] Sheldon M. Ross (2003) Introduction to Probability Models, Academic Press.
16