Lecture 03.
Discrete Random Variables
(Chapter 4)
Ping Yu
HKU Business School
The University of Hong Kong
Ping Yu (HKU) Discrete Random Variables 1 / 37
Plan of This Lecture
Random Variables
Probability Distributions for Discrete Random Variables
Properties of Discrete Random Variables
Binomial Distribution
Poisson Distribution
Hypergeometric Distribution
Jointly Distributed Discrete Random Variables
Ping Yu (HKU) Discrete Random Variables 2 / 37
Random Variables
Random Variables
Ping Yu (HKU) Discrete Random Variables 3 / 37
Random Variables
Discrete Random Variables
A random variable (r.v.) is a variable that takes on numerical values realized by the
outcomes in the sample space generated by a random experiment.
- Mathematically, a random variable is a function from S to R.
- In this and next lectures, we use capital letters, such as X , to denote the random
variable, and the corresponding lowercase letter, x, to denote a possible value.
A discrete random variable is a random variable that can take on no more than a
countable number of values.
- e.g., the number of customers visiting a store in one day, the number of claims on
a medical insurance policy in a particular year, etc.
- "Countable" includes "finite" and "countably infinite".
Ping Yu (HKU) Discrete Random Variables 4 / 37
Random Variables
Continuous Random Variables
A continuous random variable is a random variable that can take any value in an
interval (i.e., for any two values, there is some third value that lies between them).
- e.g., the yearly income for a family, the highest temperature in one day, etc.
- The probability can only be assigned to a range of values since the probability of
a single value is always zero.
Recall the distinction between discrete numerical variables and continuous
numerical variables in Lecture 1.
Modeling a r.v. as continuous is usually for convenience as the differences
between adjacent discrete values (e.g., $35,276.21 and $35,276.22) are of no
importance.
On the other hand, we model a r.v. as discrete when probability statements about
the individual possible outcomes have worthwhile meaning.
Ping Yu (HKU) Discrete Random Variables 5 / 37
Probability Distributions for Discrete Random Variables
Probability Distributions for Discrete Random
Variables
Ping Yu (HKU) Discrete Random Variables 6 / 37
Probability Distributions for Discrete Random Variables
Probability Distribution Function
The probability distribution (function), p (x ), of a discrete r.v. X represents the
probability that X takes the value x, as a function of x, i.e.,
p (x ) = P (X = x ) for all values of x.
- Sometimes, the probability distribution of a discrete r.v. is called the probability
mass function (pmf).
- Note that X = x must be an event; otherwise, P (X = x ) is not well defined.
p (x ) must satisfy the following properties (implied by the probability postulates in
Lecture 2):
1 0 p (x ) 1 for any value x,
2 ∑x2S p (x ) = 1, where S is called the support of X , i.e., the set of all x values
such that p (x ) > 0.
Notation: I will use p (x ) and p (rather than P (x ) and P as in the textbook) for pmf
and an interested probability to avoid confusion with the probability symbol P.
Ping Yu (HKU) Discrete Random Variables 7 / 37
Probability Distributions for Discrete Random Variables
Example 4.1: Number of Product Sales
Ping Yu (HKU) Discrete Random Variables 8 / 37
Probability Distributions for Discrete Random Variables
Cumulative Distribution Function
The cumulative distribution function (cdf), F (x0 ), of a r.v. X , represents the
probability that X does not exceed the value x0 , as a function of x0 , i.e.,
F (x0 ) = P (X x0 ) .
- The definition of cdf applies to both discrete and continuous r.v.’s, and x0 2 R.
- F (x0 ) for a discrete r.v. is a step function with jumps only at support points in S .
[figure here]
- p ( ) and F ( ) are probabilstic counterparts of histogram and ogive in Lecture 1.
Relationship between pmf and cdf for discrete r.v.’s:
F (x0 ) = ∑ p (x ) .
x x0
From the definition of cdf, we have (i) 0 F (x0 ) 1 for any x0 ; (ii) if x0 < x1 ,
F (x0 ) F (x1 ), i.e., F ( ) is an (weakly) increasing function.
From the figure in the next slide, we can also see (iii) F (x0 ) is right continuous,
i.e., limx#x0 F (x ) = F (x0 ); (iv) limx0 ! ∞ F (x0 ) = 0 and limx0 !∞ F (x0 ) = 1.
Ping Yu (HKU) Discrete Random Variables 9 / 37
Probability Distributions for Discrete Random Variables
Example Continued
0.7
0.3
0.1
0
-1 0 1 2 3 4
Ping Yu (HKU) Discrete Random Variables 10 / 37
Properties of Discrete Random Variables
Properties of Discrete Random Variables
Ping Yu (HKU) Discrete Random Variables 11 / 37
Properties of Discrete Random Variables
Mean
The pmf contains all information about the probability properties of a discrete r.v.,
but it is desirable to have some summary measures of the pmf’s characteristics.
The mean (or expected value, or expectation), E [X ], of a discrete r.v. X is defined
as
E [X ] = µ = ∑ xp (x ) .
x2S
∑N
i = 1 xi
- The mean of X is the same as the population mean in Lecture 1, µ = N , but
we use the probability language here: think of E [X ] in terms of relative
frequencies,
∑N
i =1 xi Nx
= ∑ x ,
N x2S
N
weighting each possible value x by its probability.
- In other words, the mean of X is is a weighted average of all possible values of X .
- For example, if we roll a die once, the expected outcome is
6
1
E [X ] = ∑i 6
= 3.5.
i =1
Ping Yu (HKU) Discrete Random Variables 12 / 37
Properties of Discrete Random Variables
Variance
The variance, Var (X ), of a discrete r.v. X is defined as
h i
Var (X ) = σ 2 = E (X µ )2 = ∑ (x µ )2 p (x ) .
x2S
- This definition of Var (X ) is the same as the population variance in Lecture 1.
- It is not hard to see that
σ2 = ∑ (x µ )2 p (x ) = ∑ x 2 p (x ) 2µ ∑ xp (x ) + µ 2 ∑ p (x )
x2S x2S x2S x2S
h i h i
= E X2 2µE [X ] + µ 2 = E X 2 2µ 2 + µ 2
h i
= E X2 µ 2,
i.e., the second moment first moment2 ,1 where in the third equality, p (x ) is the
probability of X 2 = x 2 ,2 and ∑x2S p (x ) = 1.
p
The standard deviation, σ = σ 2 , is the same as the population standard
deviation in Lecture 1.
1
σ 2 is also called the second central moment.
2
What will happen if X can take both 1 and 1? ∑x 2 =1 x 2 P X 2 = x 2 = ∑x 2 = 1 x 2 (p (1) + p ( 1))
= 12 p (1) + ( 1)2 p ( 1) .
Ping Yu (HKU) Discrete Random Variables 13 / 37
Properties of Discrete Random Variables
Mean of Functions of a R.V.
For a function of X , g (X ), its mean, E [g (X )], is defined as
E [g (X )] = ∑ g (x ) p (x ) .
x2S
- e.g., X is the time to complete a contract, and g (X ) is the cost when the
completion time is X ; we want to know the expected cost.
E [g (X )] 6= g (E [X ]) in general, e.g., if g (X ) = X 2 , then
h i
E [g (X )] g (E [X ]) = E X 2 µ 2 = σ 2 > 0.
- However, when g (X ) is linear in X , E [g (X )] = g (E [X ]).
Ping Yu (HKU) Discrete Random Variables 14 / 37
Properties of Discrete Random Variables
Mean and Variance of Linear Functions
For Y = a + bX with a and b being constant fixed numbers,
µ Y := E [Y ] = E [a + bX ] = a + bE [X ] =: a + bµ X ,
σ 2Y := Var (Y ) = Var (a + bX ) = b2 Var (X ) =: b2 σ 2X ,
and q
σY = Var (Y ) = jbj σ X .
- The proof can follow similar steps as in the last last slide. [Exercise]
- The constant a will not contribute to the variance of Y .
Some Special Linear Functions:
- If b = 0, i.e., Y = a, then E [a] = a and Var (a) = 0.
- If a = 0, i.e., Y = bX , then E [bX ] = bE [X ] and Var (bX ) = b2 Var (X ).
X µ
- If a = µ X /σ X and b = 1/σ X , i.e., Y = σ X X is the z-score of X , then
X µX µX µX
E = =0
σX σX σX
and
X µX Var (X )
Var = = 1.
σX σ 2X
Ping Yu (HKU) Discrete Random Variables 15 / 37
Binomial Distribution
Binomial Distribution
Ping Yu (HKU) Discrete Random Variables 16 / 37
Binomial Distribution
Bernoulli Distribution
The Bernoulli r.v. is a r.v. taking only two values, 0 and 1, labeled as "failure" and
"success". [figure here]
If the probability of success, p (1) = p, then the probability of failure,
p (0) = 1 p (1) = 1 p. This distribution is known as the Bernoulli distribution,
and we denote a r.v. X with this distribution as X Bernoulli(p ).
The mean of a Bernoulli(p ) r.v. X is
µ X = E [X ] = 1 p+0 (1 p ) = p,
and the variance is
σ 2X = Var (X ) = (1 p )2 p + (0 p )2 (1 p)
= p (1 p) .
- When p = 0.5, σ 2X achieves its maximum; when p = 0 and 1, σ 2X = 0. [why?]
Ping Yu (HKU) Discrete Random Variables 17 / 37
Binomial Distribution
Jacob Bernoulli (1655-1705), Swiss
Jacob Bernoulli (1655-1705) was one of the many prominent mathematicians in
the Bernoulli family.
Ping Yu (HKU) Discrete Random Variables 18 / 37
Binomial Distribution
Binomial Distribution
The binomial r.v. X is the number of successes in n independent trials of a
Bernoulli(p ) r.v., denoted as X Binomial(n, p ).
Denote Xi as the outcome in the ith trial, then the binomial r.v. X = ∑ni=1 Xi .
After some thinking, we can figure out that the number of sequences with x
successes in n trials is Cxn , and the probability of any sequence with x successes
is px (1 p )n x by the multiplication rule.
By the addition rule, the binomial distribution is
p (xjn, p ) = Cxn px (1 p )n x
, x = 0, 1, , n.
From the discussion on multivariate r.v.’s below, we can show
( ) n
µ X = E [X ] = ∑ E [Xi ] = np,
i =1
and
( ) n
σ 2X = Var (X ) = ∑ Var (Xi ) = np (1 p) .
i =1
- (*) holds even if Xi ’s are dependent, while (**) depends on the independence of
Xi ’s; see the slides on jointly distributed r.v.’s.
Ping Yu (HKU) Discrete Random Variables 19 / 37
Binomial Distribution
0.3
Binomial(20,0.1)
Binomial(20,0.5)
0.25 Binomial(20,0.7)
Binomial(40,0.1)
Binomial(40,0.5)
0.2 Binomial(40,0.7)
0.15
0.1
0.05
0
0 5 10 15 20 25 30 35 40
Figure: Binomial Distributions with Different n and p
Ping Yu (HKU) Discrete Random Variables 20 / 37
Poisson Distribution
Poisson Distribution
Ping Yu (HKU) Discrete Random Variables 21 / 37
Poisson Distribution
Poisson Distribution
The Poisson distribution was proposed by Siméon Poisson in 1837. [figure here]
Assume that an interval is divided into a very large number of equal subintervals
so that the probability of the occurrence of an event in any subinterval is very small
(e.g., 0.05). The Poisson distribution models the number of events occuring on
that inverval, assuming
1 The probability of the occurrence of an event is constant for all subintervals.
2 There can be no more than one occurrence in each subinterval.
3 Occurrences are independent.
From these assumptions, we can see the Poisson distribution can be used to
model, e.g., the number of failures in a large computer system during a given day,
the number of ships arriving at a dock during a 6-hour loading period, the number
of defective products in large production runs, etc.
The Poisson distribution is particularly useful in waiting line, or queuing, problems,
e.g., the probability of various numbers of customers waiting for a phone line or
waiting to check out of large retail store.
- For a store manager, how to balance long lines (too few checkout lines, losing
customers) and idle customer service associates (too many lines, resulting
waste)?
Ping Yu (HKU) Discrete Random Variables 22 / 37
Poisson Distribution
Siméon D. Poisson (1781-1840), French
Ping Yu (HKU) Discrete Random Variables 23 / 37
Poisson Distribution
continue
Intuitively, the Poisson r.v. is the binomial r.v. taking limit as p ! 0 and n ! ∞. If
np ! λ which specifies the average number of occurrences (successes) for a
particular time (and/or space), then the binomial distribution converges to the
Poisson distribution:
e λλx
p (xjλ ) = , x = 0, 1, 2, ,
x!
where e = 2.71828 is the base for natural logarithms, called Euler’s number.
[proof not required]
- We denote a r.v. X with the above Poisson distribution as X Poisson(λ ).
- When n is large and np is of only moderate size (preferably np 7), the binomial
distribution can be approximated by Poisson(np ). [figure here]
µ X = E [X ] = λ , and σ 2X = Var (X ) = λ .
- np ! λ , and np (1 p ) = np np p ! λ λ 0 = λ.
The sum of independent Poisson r.v.’s is also a Poisson r.v., e.g., the sum of K
Poisson(λ ) r.v. is a Poisson(K λ ) r.v..
Ping Yu (HKU) Discrete Random Variables 24 / 37
Poisson Distribution
0.3
Binomial(100,2%)
Poisson(2)
0.25
0.2
0.15
0.1
0.05
0
0 5 10 15
Figure: Poisson Approximation
For an example whether the approximation is not this good, see Assignment
II.8(iii).
Ping Yu (HKU) Discrete Random Variables 25 / 37
Hypergeometric Distribution
Hypergeometric Distribution
Ping Yu (HKU) Discrete Random Variables 26 / 37
Hypergeometric Distribution
Hypergeometric Distribution
If the binomial distribution can be treated as from random sampling with
replacement from a population of size N, S of which are successes and S/N = p,
then the hypergeometric distribution models the number of successes from
random sampling without replacement.
- These two random sampling schemes will be discussed more in Lecture 5.
The hypergeometric distribution is
CxS CnN x
S
p (xjn, N, S ) = , x = max (0, n (N S )) , , min (n, S ) ,
CnN
where n is the size of the random sample, and x is number of successes.
- A r.v. with this distribution is denoted as X Hypergeometric(n, N, S ).
The binomial distribution assumes the items are drawn independently, with the
probability of selecting an item being constant.
This assumption can be met in practice if a small sample is drawn (without
replacement) from a large population (e.g., N > 10, 000 and n/N < 1%). [figure
here]
When we draw from a small population, the probability of selecting an item is
changing with each selection because the number of remaining items is changing.
Ping Yu (HKU) Discrete Random Variables 27 / 37
Hypergeometric Distribution
0.25
Binomial(20,0.2)
Hypergeometric(20,100,20)
0.2 Hypergeometric(20,1000,200)
0.15
0.1
0.05
0
0 5 10 15
Figure: Comparison of Binomial and Hypergeometric Distributions
µ X = E [X ] = np, and σ 2X = Var (X ) = np (1 p) N
N
n
1 np (1 p ),3 where p = S
N.
[proof not required]
3 n N n
When N is small, N 1 is close to 1, matching the variance of the binomial r.v.
Ping Yu (HKU) Discrete Random Variables 28 / 37
Jointly Distributed Discrete Random Variables
Jointly Distributed Discrete Random Variables
Ping Yu (HKU) Discrete Random Variables 29 / 37
Jointly Distributed Discrete Random Variables
Bivariate Discrete R.V.’s: Joint and Marginal Probability Distributions
We can use bivariate probability distribution to model the relationship between two
univariate r.v.’s.
For two discrete r.v.’s X and Y , their joint probability distribution expresses the
probability that simultaneously X takes the specific value x and Y takes the value
y , as a function of x and y :
p (x, y ) = P (X = x \ Y = y ) , x 2 SX and y 2 SY .
- p (x, y ) is a straightforward extension of joint probabilities in Lecture 2, where
X = x and Y = y are two events with x and y indexing them.
- From probability postulates in Lecture 2, 0 p (x, y ) 1, and
∑x2SX ∑y 2SY p (x, y ) = 1.
The marginal probability distribution of X is
p (x ) = ∑ p (x, y ) ,
y 2SY
and the marginal probability distribution of Y is
p (y ) = ∑ p (x, y ) ,
x2SX
Ping Yu (HKU) Discrete Random Variables 30 / 37
Jointly Distributed Discrete Random Variables
Conditional Probability Distribution and Independence of Bivariate R.V.’s
These two concepts are parallel to conditional probabilities and independent
events in Lecture 2.
The conditional probability distribution of Y , given that X takes the value x,
expresses the probability that Y takes the value y , as a function of y , when the
value x is fixed for X :
p (x, y )
p (y jx ) = ;
p (x )
similarly, the conditional probability distribution of X , given Y = y , is
p (x, y )
p (xjy ) = .
p (y )
- One way of thinking of conditioning is filtering a data set based on the value of X .
Two r.v.’s X and Y are independent iff
p (x, y ) = p (x ) p (y )
for all x 2 SX and y 2 SY , i.e., independence of r.v.’s can be understood as a set
of independencies of events. E.g., "height" and "musical talent" are independent.
- Generally, k r.v.’s are independent if p (x1 , , xk ) = p (x1 ) p (x2 ) p (xk ).
- X and Y are independent iff p (y jx ) = p (y ) or p (xjy ) = p (x ) (=)symmetric).
Ping Yu (HKU) Discrete Random Variables 31 / 37
Jointly Distributed Discrete Random Variables
Conditional Mean and Variance
The conditional mean of Y , given that X takes the value x, is given by
µ Y jX =x = E [Y jX = x ] = ∑ yp (y jx ) .
y 2SY
- For any constants a and b, E [ a + bY j X = x ] = a + bE [Y jX = x ].
The conditional variance of Y , given that X takes the value x, is given by
2
σ 2Y jX =x = Var (Y jX = x ) = ∑ y µ Y jX =x p (y jx ) .
y 2SY
- For any constants a and b, Var (a + bY jX = x ) = b2 Var (Y jX = x ).
Notation: The notations used in the textbook, µ Y jX and σ 2Y jX , are not clear.
Ping Yu (HKU) Discrete Random Variables 32 / 37
Jointly Distributed Discrete Random Variables
Mean and Variance of (Linear) Functions
For a function of (X , Y ), g (X , Y ), its mean, E [g (X , Y )], is defined as
E [g (X , Y )] = ∑ ∑ g (x, y ) p (x, y ) .
x2SX y 2SY
For a linear function of (X , Y ), W = aX + bY ,
µ W := E [W ] = aµ X + bµ Y , [verified in the next slide]
σ 2W := Var (W ) = a2 σ 2X + b2 σ 2Y + 2abσ XY [see the next3 slide for σ XY ].
- e.g., W is the total revenue of two products with (X , Y ) being the sales and (a, b )
the prices.
- If a = b = 1, then E [X + Y ] = E [X ] + E [Y ], i.e., the mean of sum is the sum of
means.
- If a = 1 and b = 1, then E [X Y ] = E [X ] E [Y ], i.e., the mean of difference is
the difference of means.
- If a = b = 1 and σ XY = 0, then Var (X + Y ) = Var (X ) + Var (Y ), i.e., the
variance of sum is the sum of variances.
- If a = 1, b = 1 and σ XY = 0, then Var (X Y ) = Var (X ) + Var (Y ), i.e., the
variance of difference is the sum of variances.
Ping Yu (HKU) Discrete Random Variables 33 / 37
Jointly Distributed Discrete Random Variables
(*) Verification and Extensions
µW :
µW = ∑ ∑ (ax + by ) p (x, y )
x2SX y 2SY
= a ∑ x ∑ p (x, y ) + b ∑ y ∑ p (x, y )
x2SX y 2SY y 2SY x2SX
= a ∑ xp (x ) + b ∑ yp (y )
x2SX y 2SY
= aµ X + bµ Y ,
σ 2W can be derived based on this result. [Exercise]
Extension I: If W = ∑K
i =1 ai Xi , then
K K
µ W = E [W ] = ∑ ai E [Xi ] =: ∑ ai µ i ,
i =1 i =1
Ping Yu (HKU) Discrete Random Variables 34 / 37
Jointly Distributed Discrete Random Variables
continue
and
σ 2W = Var (W ) = ∑i =1 ai2 Var (Xi ) + 2 ∑i =1 ∑j>i ai aj Cov
K K 1 K
Xi , Xj
=: ∑ ∑ ∑
K K 1 K
a2 σ 2 + 2
i =1 i i i =1
aaσ .
j>i i j ij
- If ai = 1 for all i, then we have
h i
E ∑i =1 Xi = ∑i =1 µ i and Var ∑i =1 Xi = ∑i =1 σ 2i + 2 ∑i =1 ∑j>i σ ij ,
K K K K K 1 K
where Var ∑K K
i =1 Xi reduces to ∑i =1 σ i if σ ij = 0 for all i 6= j.
2
Extension II: For W = aX + bY and a r.v. Z different from (X , Y ),
E [W jZ = z ] = aµ X jZ =z + bµ Y jZ =z ,
Var (W jZ = z ) = a2 σ 2X jZ =z + b2 σ 2Y jZ =z + 2abσ XY jZ =z ,
where Var (W jZ = z ) reduces to a2 σ 2X jZ =z + b2 σ 2Y jZ =z if σ XY jZ =z = 0.
Ping Yu (HKU) Discrete Random Variables 35 / 37
Jointly Distributed Discrete Random Variables
Covariance and Correlation
These two concepts are the same as those in Lecture 1 but in the probability
language.
The covariance between X and Y
Cov (X , Y ) = σ XY = E [(X µ X ) (Y µ Y )] = ∑ ∑ (x µ X ) (y µ Y ) p (x, y ) .
x2SX y 2SY
h to ishow that Cov (X , Y ) = E [XY ]
- It is not hard µ X µ Y , which reduces to
Var (X ) = E X 2 µ 2X when X = Y .
The correlation between X and Y
Cov (X , Y )
Corr (X , Y ) = ρ XY = .
σX σY
Recall that σ XY is not unit-free so is unbounded, while ρ XY 2 [ 1, 1] is more
useful.
Recall that σ XY and ρ XY have the same sign: if they are positive, X and Y are
called positively dependent, when they are negative, X and Y are called negatively
dependent, when they are zero, there is no linear relationship between X and Y .
Ping Yu (HKU) Discrete Random Variables 36 / 37
Jointly Distributed Discrete Random Variables
Covariance and Independence
If X and Y are independent, then Cov (X , Y ) = Corr (X , Y ) = 0. [Exercise]
- The converse is not true; recall the figure in Lecture 1.
Here is a concrete example: if the distribution of X is
p ( 1) = 1/4, p (0) = 1/2 and p (1) = 1/4,
then Cov (X , Y ) = 0 with Y = X 2 . Why?
Because X can determine Y , X and Y are not independent.
The distribution of X implies E [X ] = 0.
The distribution of Y is p (0) = p (1) = 1/2, i.e., Y is a Bernoulli r.v., which implies
E [Y ] = 1/2.
The joint distribution of (X , Y ) is
p ( 1, 1) = 1/4, p (0, 0) = 1/2, p (1, 1) = 1/4,
which implies E [XY ] = 0, so Cov (X , Y ) = E [XY ] E [X ] E [Y ] = 0.
Portfolio analysis in the textbook (Pages 190-192, 236-240) will be discussed in
the next tutorial class.
Ping Yu (HKU) Discrete Random Variables 37 / 37