Probability A Lively Introduction
Probability A Lively Introduction
Probability has applications in many areas of modern science, not to mention in our
daily life, and its importance as a mathematical discipline cannot be overrated. This
engaging book, with its easy to follow writing style, provides a comprehensive, yet
concise, introduction to the subject. It covers all of the standard material for
undergraduate and first-year-graduate-level courses, as well as many topics that are
usually not found in standard texts – such as Bayesian inference, Markov chain Monte
Carlo simulation, and Chernoff bounds.
The student-friendly text has the following additional features:
HENK TIJMS
Vrije Universiteit, Amsterdam
University Printing House, Cambridge CB2 8BS, United Kingdom
One Liberty Plaza, 20th Floor, New York, NY 10006, USA
477 Williamstown Road, Port Melbourne, VIC 3207, Australia
4843/24, 2nd Floor, Ansari Road, Daryaganj, Delhi – 110002, India
79 Anson Road, #06–04/06, Singapore 079906
www.cambridge.org
Information on this title: www.cambridge.org/9781108418744
DOI: 10.1017/9781108291361
© Henk Tijms 2018
This publication is in copyright. Subject to statutory exception
and to the provisions of relevant collective licensing agreements,
no reproduction of any part may take place without the written
permission of Cambridge University Press.
First published 2018
Printed in the United Kingdom by Clays, St Ives plc
A catalogue record for this publication is available from the British Library.
Library of Congress Cataloging-in-Publication Data
Names: Tijms, H.
Title: Probability: a lively introduction / Henk Tijms, Vrije Universiteit, Amsterdam.
Description: Cambridge : Cambridge University Press, 2017. | Includes index.
Identifiers: LCCN 2017014908 | ISBN 9781108418744 (Hardback : alk. paper) |
ISBN 9781108407847 (pbk. : alk. paper)
Subjects: LCSH: Probabilities–Textbooks.
Classification: LCC QA273.2 .T55 2017 | DDC 519.2–dc23 LC record
available at https://lccn.loc.gov/2017014908
ISBN 978-1-108-41874-4 Hardback
ISBN 978-1-108-40784-7 Paperback
Additional resources for this publication at www.cambridge.org/TijmsProbability
Cambridge University Press has no responsibility for the persistence or accuracy of
URLs for external or third-party internet websites referred to in this publication
and does not guarantee that any content on such websites is, or will remain,
accurate or appropriate.
Contents
Preface page ix
v
vi Contents
Why do so many students find probability difficult? Even the most mathe-
matically competent often find probability a subject that is difficult to use
and understand. The difficulty stems from the fact that most problems in
probability, even ones that are easy to understand, cannot be solved by using
cookbook recipes as is sometimes the case in other areas of mathematics.
Instead, each new problem often requires imagination and creative thinking.
That is why probability is difficult, but also why it is fun and engaging.
Probability is a fascinating subject and I hope, in this book, to share my
enthusiasm for the subject.
Probability is best taught to beginning students by using motivating exam-
ples and problems, and a solution approach that gives the students confidence
to solve problems on their own. The examples and problems should be relevant,
clear, and instructive. This book is not written in a theorem–proof style, but
proofs flow with the subsequent text and no mathematics is introduced without
specific examples and applications to motivate the theory. It distinguishes itself
from other introductory probability texts by its emphasis on why probability
is so relevant and how to apply it. Every attempt has been made to create a
student-friendly book and to help students understand what they are learning,
not just to learn it.
This textbook is designed for a first course in probability at an undergraduate
level or first-year graduate level. It covers all of the standard material for
such courses, but it also contains many topics that are usually not found in
introductory probability books – such as stochastic simulation. The emphasis
throughout the book is on probability, but attention is also given to statistics. In
particular, Bayesian inference is discussed at length and illustrated with several
illuminating examples. The book can be used in a variety of disciplines, rang-
ing from applied mathematics and statistics to computer science, operations
ix
x Preface
research, and engineering, and is suitable not only for introductory courses, but
also for self-study. The prerequisite knowledge is a basic course in calculus.
Good problems are an essential part of each textbook. The field of probabil-
ity is well known for being a subject that can best be acquired by the process
of learning-by-doing. Much care has been taken to present problems that will
enhance the student’s understanding of probability. He or she will be asked to
think about ideas, rather than simply plugging numbers into formulas. Working
through them, it may often be found that probability problems are harder than
they first appear. This book has more than 750 carefully designed problems,
both “easy ones” and challenging ones. Problems are grouped according to the
section they are based on, which should be convenient for both students and
instructors. An important feature of this textbook is that it contains detailed
solutions to the odd-numbered problems, which helps stimulate active learning
and contributes to students’ confidence in their own problem-solving skills. It
is my belief that there is an enormous increase in content when worked-out
solutions to exercises are included. Solutions for the even-numbered problems
and further detailed solutions to the odd-numbered problems for instructors
can be found on the book’s webpage (www.cambridge.org/TijmsProbability).
Another added feature is that the student will find many tips on problem-
solving strategies.
For centuries, mankind lived with the idea that uncertainty was the domain of
the gods and fell beyond the reach of human calculation. Common gambling
led to the first steps toward understanding probability theory, and the colorful
Italian mathematician and physician Gerolamo Cardano (1501–1575) was the
first to attempt a systematic study of the calculus of probabilities. As an ardent
gambler himself, Cardano wrote a handbook for fellow gamblers entitled Liber
de Ludo Aleae (The Book of Games of Chance) about probabilities in games
of chance like dice. He originated and introduced the concept of the set of
outcomes of an experiment, and for cases in which all outcomes are equally
probable, he defined the probability of any one event occurring as the ratio of
the number of favorable outcomes to the total number of possible outcomes.
This may seem obvious today, but in Cardano’s day such an approach marked
an enormous leap forward in the development of probability theory.
Nevertheless, many historians mark 1654 as the birth of the study of
probability, since in that year questions posed by gamblers led to an exchange
of letters between the great French mathematicians Pierre de Fermat (1601–
1665) and Blaise Pascal (1623–1662). This famous correspondence laid the
groundwork for the birth of the study of probability, especially their question
of how two players in a game of chance should divide the stakes if the
game ends prematurely. This problem of points, which will be discussed in
Chapter 3, was the catalyst that enabled probability theory to develop beyond
mere combinatorial enumeration.
In 1657, the Dutch astronomer Christiaan Huygens (1629–1695) learned of
the Fermat–Pascal correspondence and shortly thereafter published the book
De Ratiociniis de Ludo Aleae (On Reasoning in Games of Chance), in which
he worked out the concept of expected value and unified various problems
that had been solved earlier by Fermat and Pascal. Huygens’ work led the
field for many years until, in 1713, the Swiss mathematician Jakob Bernoulli
1
2 1 Foundations of Probability Theory
The purpose of this chapter is to give the student a solid basis for solving
probability problems. We first discuss the intuitive and fundamental axioms of
probability, and from them derive a number of basic rules for the calculation
of probabilities. These rules include the complement rule, the addition rule,
and the inclusion–exclusion rule, for which many illustrative examples are
provided. These examples, which are instructive and provide insight into the
theory, include classical probability problems such as the birthday problem and
the hat-check problem.
Traditionally, introductory probability books begin with a comprehensive
discussion of set theory and combinatorics before presenting the “real stuff.”
This will not be done in this book, as it is not necessary and often stifles
the student’s natural curiosity and fascination with probability. Rather, the
appendices at the end of the book provide a self-contained overview of the
essentials of set theory and the basic tools from combinatorics, and these tools
are not introduced until they are needed.
• The experiment is to measure the time until the first emission of a particle
from a radioactive source. The sample space is the set (0, ∞) of the positive
real numbers, where outcome t indicates that it takes a time t until the first
emission
b −t of a particle. Taking an appropriate unit of time, the probability
a e dt can be assigned to each time interval (a, b) on the basis of physical
properties of radioactive material, where e = 2.71828 . . . is the base of the
natural logarithm.
In probability applications we are typically interested in particular subsets of
the sample space, which in probability language are called events. The terms
event and subset of outcomes of an experiment are used interchangeably in
probability theory. In the second example, the event that an odd number is
rolled is the subset A = {1, 3, 5} of the sample space. In the fourth example, the
event that more than six rolls are needed to get a six is the subset A = {7, 8, . . .}
of the sample space. In the fifth example, the event that it takes between 5 and 7
time units until the first emission of a particle is the subset A = {t : 5 ≤ t ≤ 7}
of the sample space, where “:” means “such that.”
Various choices for the sample space are sometimes possible. In the
experiment of tossing a fair coin twice, a possible choice for the sample
space is the set {HH, HT, TH, TT}. Another possible choice is the set {0, 1, 2},
where the outcome indicates the number of heads obtained. The assignment of
probabilities to the elements of the sample space differs for the two choices.
In the first choice of the sample space, the four elements are equally likely
and each element gets assigned the probability 14 . In the second choice of the
sample space, the elements are not equally likely and the elements 0, 1, and 2
get assigned the probabilities 14 , 12 , and 14 . In general, it is preferable to use a
sample space with equally likely outcomes whenever possible.
In the first three examples above, the sample space is a finite set. In the fourth
example the sample space is a so-called countably infinite set, while in the fifth
example the sample space is a so-called uncountable set. Let us briefly explain
these basic concepts from set theory, see also Appendix B. The set of natural
numbers (positive integers) is an infinite set and is the prototype of a countably
infinite set. In general, a nonfinite set is called countably infinite if a one-to-
one function exists which maps the elements of the set to the set of natural
numbers. In other words, every element of the set can be assigned to a unique
natural number and conversely each natural number corresponds to a unique
element of the set. For example, the set of squared numbers 1, 4, 9, 16, 25, . . .
is countably infinite. Not all sets with an infinite number of elements are
countably infinite. The set of all points on a line and the set of all real
numbers between 0 and 1 are examples of infinite sets that are not countable.
1.1 Probabilistic Foundations 5
The union ∞ i=1 Ai of the subsets A1 , A2 , . . . is defined as the set of all outcomes
which belong to at least one of the subsets A1 , A2 , . . . . The subsets A1 , A2 , . . .
are said to be pairwise disjoint when any two subsets have no element in
common. In probability terms, any subset of the sample space is called an
event. If the outcome of the chance experiment belongs to the subset A, then the
event A is said to occur. The events A1 , A2 , . . . are said to be mutually exclusive
(or disjoint) if the corresponding sets A1 , A2 , . . . are pairwise disjoint. In other
words, events A1 , A2 , . . . are mutually exclusive if the occurrence of one of
these events implies the non-occurrence of the others. For example, suppose
that a die is rolled. The outcome is one of the numbers from 1 to 6. Let A be
the event that the outcome 1 or 2 occurs and B be the event that the outcome 5
or 6 occurs. Then A and B are mutually exclusive events. As another example,
suppose that a coin is tossed until heads appears for the first time. Let Ak be
the event that heads appears for the first time at the kth toss. Then the events
A1 , A2 , . . . are mutually exclusive.
The first two axioms simply express a probability as a number between 0
and 1. The crucial third axiom states that, for any infinite sequence of mutually
exclusive events, the probability of at least one of these events occurring is
the sum of their individual probabilities. This property also holds for any finite
sequence of mutually exclusive events. Using the concept of the empty set, the
proof of this result is almost trivial, see Rule 1.1 in Section 1.5. The countable
additivity in Axiom 3 is required to have a unified framework for finite and
nonfinite sample spaces. Starting with the three axioms and a few definitions,
a powerful and beautiful theory of probability can be developed.
The standard notation for the sample space is the symbol . An outcome of
the sample space is denoted by ω. A sample space together with a collection of
events and an assignment of probabilities to the events is called a probability
space. For a countable sample space , it is sufficient to assign a probability
p (ω) to each element ω ∈ such that p (ω) ≥ 0 and ω∈ p (ω) = 1.
A probability measure P on is then defined by specifying the probability of
each event A as
P(A) = p(ω).
ω∈A
In other words, P(A) is the sum of the individual probabilities of the outcomes
ω that belong to the set A. It is left to the reader to verify that P satisfies Axioms
1 to 3.
A probability model is constructed with a specific situation or experiment in
mind. The assignment of probabilities is part of the translation process from a
concrete context into a mathematical model. Probabilities may be assigned to
1.2 Classical Probability Model 7
events any way you like, as long as the above axioms are satisfied. To make
your choice of the probabilities useful, the assignment should result in a “good”
model for the real-world situation. There are two main approaches to assigning
probabilities to events. In the relative-frequency approach, probabilities are
assigned to the outcomes of a physical experiment having the feature that it
can be repeated over and over under identical conditions. Think of spinning a
roulette wheel or rolling dice. Then one may speak of physical probabilities
and such probabilities can be determined experimentally. In the subjective
approach, the word probability is roughly synonymous with plausibility and
probability is defined as the degree of belief a particular person holds in the
occurrence of an event. Think of the chances of your favorite horse winning a
race or the chances of your favorite baseball team winning the World Series.
Hence judgment is used as the basis for assigning subjective probabilities.
The use of the subjective approach is usually limited to experiments that are
unrepeatable. In this book the emphasis is on physical probabilities, but we
will also pay attention to subjective probabilities in Chapters 2 and 7.
Example 1.2 Three players enter a room and are given a red or a blue hat to
wear. The color of each hat is determined by a fair coin toss. Players cannot see
the color of their own hats, but do see the color of the other two players’ hats.
The game is won when at least one of the players correctly guesses the color
of his own hat and no player gives an incorrect answer. In addition to having
the opportunity to guess a color, players may also pass. Communication of any
kind between the players is not permissible after they have been given their
hats; however, they may agree on a group strategy beforehand. The players
decided upon the following strategy. A player who sees that the other two
players wear a hat with the same color guesses the opposite color for his/her
own hat; otherwise, the player says nothing. What is the probability of winning
the game under this strategy?
Solution. This chance experiment can be seen as tossing a fair coin three
times. As sample space, we take the set consisting of the eight elements RRR,
RRB, RBR, BRR, BBB, BBR, BRB, RBB, where R stands for a red hat and B
for a blue hat. Each element of the sample space is equally probable and gets
assigned a probability of 18 . The strategy is winning if one of the six outcomes
RRB, RBR, BRR, BBR, BRB, or RBB occurs (verify!). Thus, the probability of
winning the game under the chosen strategy is 34 .
In Example 1.2 we have encountered a useful problem-solving strategy: see
whether the problem can be related to a familiar problem.
As preparation for the next example, consider a task that involves a sequence
of r choices. Suppose that n1 is the number of possible ways the first choice
can be made, n2 is the number of possible ways the second choice can be made
after the first choice has been made, and n3 is the number of possible ways the
third choice can be made after the first two choices have been made, etc. Then
the total number of possible ways the task can be performed is n1 ×n2 ×· · ·×nr .
For example, the total number of possible ways five people can stand in line is
5 × 4 × 3 × 2 × 1 = 120. In other words, there are 120 permutations.
Example 1.3 In a Monte Carlo casino the roulette wheel is divided into
37 sections numbered 1 to 36 and 0. What is the probability that all numbers
showing up in eight spins of the wheel are different?
Solution. Take as sample space the set of all ordered sequences (i1 , . . . , i8 ),
where ik is the number showing up at the kth spin of the wheel. The sample
space has 37 × 37 × · · · × 37 = 378 equiprobable elements. The number
of elements for which all components are different is 37 × 36 × · · · × 30.
Therefore, the sought probability is
1.2 Classical Probability Model 9
37 × 36 × · · · × 30
= 0.4432.
378
We reiterate the concept of the binomial coefficient before continuing.
Definition 1.1 The binomial coefficient nk denotes the total number of ways
to choose k different objects out of n distinguishable objects, with order not
mattering.
In other words, nk is the total number of combinations of k different objects
out of n. The key difference between permutations and combinations is order.
Combinations are unordered selections, permutations are ordered arrange-
ments. In Appendix A these important concepts are discussed extensively and
illustrated with several combinatorial probability problems. For any integers n
and k with 1 ≤ k ≤ n, the binomial coefficient can be calculated as
n n!
= ,
k k! (n − k)!
where
n nisshorthand for 1 × 2 × ··n·× m with the convention 0! = 1. Note that
m!
= n−k with the convention 0 = 1. For example, the number of ways
5
k
to choose three jurors out of five candidates is 3 = 3! 2! = 6×2
5! 120
= 10, with
order not mattering.
In no book on introductory probability should problems on tossing coins,
rolling dice, and dealing cards be missing. The next examples deal with these
sorts of probability problems and use binomial coefficients to solve them.
Example 1.4 A fair coin is tossed 100 times. What is the probability of getting
exactly 50 heads?
Solution. Take as sample space all possible sequences of zeros and ones to a
length of 100, where a zero stands for tails and a one for heads. The sample
space has 2100 equiprobable elements. The number of elements having exactly
50 ones is 100
50 . Therefore, the probability of getting exactly 50 heads is
100
50
.
2100
This probability is the ratio of two enormously large numbers and its com-
putation requires special provisions. However, a very accurate approximation
to this probability can be given. To this end, consider the general case of 2N
tosses of a fair coin. Then the probability pN of getting exactly N heads is
2N
N
pN = .
22N
10 1 Foundations of Probability Theory
sample space made up of ordered elements. Imagine that the 52 cards of the
deck are numbered 1, 2, . . . , 52 and that the 13 cards of the bridge hand are
randomly dealt one at a time. The sample space made up of ordered elements
is the set of all possible 13-tuples (i1 , . . . , i13 ), where ik is the number of the kth
card dealt. The sample space has 52 × 51 × · · · × 40 equally likely outcomes.
The number of cards having a value nine or lower is 8 × 4 = 32, and so there
are 32 × 31 × · · · × 20 outcomes for which there is no card higher than a nine
among the 13 cards dealt. Thus, the probability of a Yarborough is
32 × 31 × · · · × 20
= 0.000547.
52 × 51 × · · · × 40
Alternatively, this probability can be computed by using a sample space made
up of unordered elements. The order in which the 13 cards are dealt is not
relevant in such a sample space.1 Each outcome of this sample space is a set
of 13 different cards from the deck of 52 cards. In a set we don’t care which
element is first, only which elements are actually present. The number of ways
you can choose a set of 13 different cards from a deck of 52 cards is given
52
by the binomial coefficient 13 . Thus, the sample space made up of unordered
elements has 52 13 equally likely outcomes. The number of outcomes with no
32
card above a nine is 13 , and so the probability of a Yarborough can also be
calculated as
32
13
52 = 0.000547.
13
The odds against a Yarborough are 1,827 to 1. A Yarborough is named after
Lord Yarborough (1809–1862). The second Earl of Yarborough would offer
his whist-playing friends a wager of £1,000 to 1 against them picking up such
a hand. You see that the odds were on his side. There is no record that he ever
paid out.
Example 1.7 A well-shuffled standard deck of 52 cards is divided into four
piles of 13 cards each. What is the probability that each pile contains exactly
one ace?
Solution. A subtle approach is as follows. The deck of cards has 52 positions
from top to bottom. Imagine that the first 13 positions are for the first pile, the
next 13 positions for the second pile, etc. Take as sample space the collection
(Bj , Rk ), and the 7 × 6 outcomes (Bj , Bk ), where the first component of each
outcome refers to the first card picked and the second component to the second
card picked. There is a total of 420 equally likely outcomes. The probability
× 13
of picking two red cards is 14420 and is less than the probability 14 × 7+7
420
× 14
Problems
1.1 Two fair dice are rolled. What is the probability that the sum of the two
numbers rolled is odd? What is the probability that the product of the two
numbers rolled is odd?
1.2 In a township, there are two plumbers. On a particular day three resi-
dents call village plumbers, independently of each other. Each resident
randomly chooses one of the two plumbers. What is the probability that
all three residents will choose the same plumber?
1.3 Take a random permutation of the letters in the word “randomness.” What
is the probability that the permutation begins and ends with a vowel?
What is the probability that the three vowels are adjacent to each other in
the permutation?
14 1 Foundations of Probability Theory
1.4 A dog has a litter of four puppies. Can we correctly say that the litter
more likely consists of three puppies of one gender and one of the other
than that it consists of two puppies of each gender?
1.5 Suppose that you pick m numbers from a set of n different numbers
without replacement. What is the probability of getting the largest
number in the set?
1.6 Three girls and three boys have a dinner party. They agree that two of
them are going to do the dishes. The unlucky persons are determined by
drawing lots. What is the probability that two boys will do the dishes?
1.7 A box contains n identical balls, where two balls have a winning number
written on them. In a prespecified order, n persons each choose a ball at
random from the box. Show that each person has the same probability of
getting a winning ball.
1.8 Players A and B play a game of rolling two fair dice. If the largest number
rolled is one, two, three, or four, player A wins; otherwise, player B wins.
What is the probability of player A winning?
1.9 In the 6/42 lottery, six distinct numbers are picked at random from
the numbers 1, . . . , 42. What is the probability that number 10 will be
picked? What is the probability that each of the six numbers picked is 20
or more?
1.10 Five people are sitting at a table in a restaurant. Two of them order
coffee and the other three order tea. The waiter forgets who ordered what
and puts the drinks in a random order for the five persons. What is the
probability that each person gets the correct drink?
1.11 Two black socks, two brown socks, and one white sock lie mixed up in a
drawer. You grab two socks without looking. What is the probability that
you have grabbed two black socks or two brown socks?
1.12 Two letters have fallen out of the word “Cincinnati” at random places.
What is the probability that these two letters are the same?
1.13 You choose at random two cards from a standard deck of 52 cards. What
is the probability of getting a ten and a heart?
1.14 You choose at random a letter from the word “chance” and from the word
“choice.” What is the probability that the two letters are the same?
1.15 You roll a fair die 10 times. Can you explain why the probability of
rolling a total of s points is the same as the probability of rolling a total
of 70 − s points for s = 10, . . . , 34?
1.16 You roll 12 fair dice. What is the probability that each number will appear
exactly twice?
1.17 A building has three floors. Each floor has four apartments and
each apartment has one resident. Three residents meet each other
1.3 Geometric Probability Model 15
means that equal probabilities are assigned to sets of equal lengths, areas, or
volumes. The next examples illustrate the geometric probability model.
Example 1.8 Joe is driving around, looking for a parking spot. He plans to stay
in the area for exactly 2 hours and 15 minutes. He comes across a free parking
spot that only allows free parking for 2 hours. The parking enforcement officer
comes around once every 2 hours. At each inspection the officer marks cars
that were not there at his last visit and tickets earlier-marked cars. Joe arrives
at a random moment between two visits of the officer, and Joe has no idea how
long ago the officer’s last visit was. Joe decides to take a gamble and parks in
the free spot. What is the probability that Joe will receive a ticket?
Solution. Define a probability model with the interval (0, 120) as sample
space. The outcome x means that it takes x minutes until the next visit of the
officer, measured from the moment that Joe arrives. To each subinterval, we
assign as probability the length of the interval divided by 120. Let A be the
event that Joe receives a ticket. The event A occurs if and only if the outcome
x is the interval (105, 120). Thus P(A) = 120
15
, so the probability that Joe will
1
receive a ticket is 8 .
Example 1.9 You randomly throw a dart at a circular dartboard with radius R.
It is assumed that the dart is infinitely sharp and lands at a random point on the
dartboard. How do you calculate the probability of the dart hitting the bullseye,
having radius b?
Solution. The sample space of this experiment is the set of pairs of real
numbers (x, y) with x2 + y2 ≤ R2 , where (x, y) indicates the point at which
the dart hits the dartboard. This sample space is uncountable. We first make
the following observation. The probability that the dart lands exactly at a
prespecified point is zero. It only makes sense to speak of the probability of
the dart hitting a given region of the dartboard. This observation expresses
a fundamental difference between a probability model with a countable
sample space and a probability model with an uncountable sample space. The
assumption of the dart hitting the dartboard at a random point is translated by
assigning the probability
to each subset A of the sample space. Therefore, the probability of the dart
hitting the bullseye is πb2 /(π R2 ) = b2 /R2 .
1.3 Geometric Probability Model 17
y
x
Example 1.10 A floor is ruled with equally spaced parallel lines a distance D
apart. A needle of length L is dropped at random on the floor. It is assumed that
L ≤ D. What is the probability that the needle will intersect one of the lines?
This problem is known as Buffon’s needle problem.
Solution. This geometric probability problem can be translated into the
picking of a random point inside a certain region. Let y be the distance from the
center of the needle to the closest line and x be the angle at which the needle
falls, where x is measured against a line parallel to the lines on the floor; see
Figure 1.1. The sample space of the experiment can be taken as the rectangle
R consisting of the points (x, y), with 0 ≤ x ≤ π and 0 ≤ y ≤ 12 D. Dropping
the needle at random on the floor can be seen to be equivalent to choosing
a random point inside the rectangle R. The needle will land on a line only if
the hypotenuse of the right-angled triangle in Figure 1.1 is less than half of the
y
length L of the needle. That is, we get an intersection if and only if sin(x) < 12 L.
Thus, the probability that the needle will intersect one of the lines equals the
probability that a point (x, y) chosen at random in the rectangle R satisfies
y < 12 L sin(x). In other words, the area under the curve y = 12 L sin(x) divided
by the total area of the rectangle R gives the probability of an intersection. This
ratio is
π 1
0 2 L sin(x) dx −L cos(x) π
1
=
2 πD
πD 0
and so
2L
P(needle intersects one of the lines) = .
πD
18 1 Foundations of Probability Theory
Problems
1.26 A stick of length one is broken into two pieces at a random point. What
is the probability that the length of the longer piece will be at least three
times the length of the shorter piece?
1.27 Franc-carreau was a popular game in eighteenth-century France. In this
game, a coin is tossed on a chessboard. The player wins if the coin
does not fall on one of the lines of the board. Suppose now that a round
coin with a diameter of d is blindly tossed on a large table. The surface
of the table is divided into squares whose sides measure a in length,
where a > d. Define an appropriate probability space and calculate the
probability of the coin falling entirely within the confines of a square.
Hint: Consider the position of the center of the coin.
1.28 Each morning during the week you arrive at a train station at a random
moment between 7 a.m. and 8 a.m. Fast trains in the direction of your
destination arrive every 15 minutes starting at 7 a.m., whereas slow trains
in the direction of your destination arrive every 30 minutes starting at
7:05 a.m. You take the first train that arrives. What is the probability that
you will take a slow train on any given morning?
1.29 Consider again Problem 1.28. Suppose that your friend arrives at the train
station at a random moment between 7.30 a.m. and 8 a.m. and also takes
the first train in the direction of your destination. What is the probability
that he gets on the same train as you?
1.30 Two people have agreed to meet at a central point in the city. Indepen-
dently of one another, each person is to appear at a random moment
between 12 p.m. and 1 p.m. What is the probability that the two persons
will meet within 10 minutes of one another?
1.31 The numbers B and C are chosen at random between −1 and 1,
independently of each other. What is the probability that the quadratic
equation x2 + Bx + C = 0 has real roots? Also, derive a general
expression for this probability when B and C are chosen at random from
the interval (−q, q) for any q > 0.
1.32 The Manhattan distance from a point (x, y) in the plane to the origin
(0, 0) is defined as |x| + |y|. You choose a random point inside the unit
square {(x, y) : 0 ≤ x, y ≤ 1}. What is the probability that the Manhattan
distance between this point and the point (0, 0) is no more than a for
0 ≤ a ≤ 2? Does the answer change when the point is randomly chosen
in the square {(x, y) : −1 ≤ x, y ≤ 1}?
1.33 Consider the following variant of Buffon’s needle problem from
Example 1.10. A rectangular card with side lengths a and b is dropped
1.4 Compound Chance Experiments 19
√
at random on the floor. It is assumed that the length a2 + b2 of the
diagonal of the card is smaller than the distance D between the parallel
lines on the floor. Show that the probability of the card intersecting one
of the lines is given by 2(a+b)
πD .
1.34 A random point is chosen inside a triangle with height h and base of
length b. What is the probability that the perpendicular distance from
the point to the base is larger than a given value d with 0 < d < h?
What is the probability that the randomly chosen point and the base of
the triangle will form a triangle with an obtuse angle when the original
triangle is equilateral?
1.35 You choose a number v at random from (0, 1) and next a number w at
random from (0, 1 − v). What is the probability that a triangle can be
formed with side lengths v, w, and 1 − v − w? Next answer the following
question. What is the probability that a triangle can be formed with three
pieces of a broken stick if the stick is first broken at random into two
pieces and next the longer piece is randomly broken into two pieces.
Hint: Use the fact that the sum of any two side lengths must be greater
than the third side length and represent (v, w) as (x, y(1−v)), where (x, y)
is a random point inside the unit square.
1.36 A stick is broken at two places. The break points are chosen at random
on the stick, independently of each other. What is the probability that a
triangle can be formed with the three pieces of the broken stick?
1.37 Choose a random point inside a circle with radius r and construct
the (unique) chord with the chosen point as its midpoint. What is the
probability that the chord is longer than a side of an equilateral triangle
inscribed in the circle?
This choice for the probability measure is not only intuitively the obvious one,
but it can also be proved that it is the only probability measure satisfying the
property P(AB) = P(A)P(B) when the elementary experiments that generate
event A are physically independent of those elementary experiments that give
rise to event B. This important result of the uniqueness of the probability
measure satisfying this property justifies the use of the product rule for
compound chance experiments.
Example 1.11 In the “Reynard the Fox” café, it normally costs $3.50 to buy
a pint of beer. On Thursday nights, however, customers pay $0.25, $1.00,
or $2.50 for the first pint. In order to determine how much they will pay,
customers must throw a dart at a dartboard that rotates at high speed. The
dartboard is divided into eight segments of equal size. Two of the segments
read $0.25, four of the segments read $1, and two more of the segments read
$2.50. You pay whatever you hit. Two friends, John and Peter, each throw a
dart at the board and hope for the best. What is the probability that the two
friends will have to pay no more than $2 between them for their first pint?
Solution. This problem can be modeled as a compound experiment that con-
sists of two subexperiments. The physically independent subexperiments are
the throws of John and Peter. The elementary outcomes of each subexperiment
are L, M, and H, where L stands for hitting a low-priced segment, M stands
for hitting a medium-priced segment, and H stands for hitting a high-priced
segment. Assuming that the dart hits the dartboard at a random point, the
probabilities p(L) = 28 , p(M) = 48 , and p(H) = 28 are assigned to the outcomes
L, M, and H. The sample space of the compound experiment consists of the
nine outcomes (L, L), (L, M), (M, L), (L, H), (H, L), (M, M), (M, H), (H, M),
(H, H). The probability 14 × 14 = 16 1
is assigned to each of the outcomes (L, L),
(L, H), (H, L), (H, H), the probability 12 × 12 = 14 to the outcome (M, M), and
the probability 12 × 14 = 18 to each of the outcomes (L, M), (M, L), (H, M),
(M, H). The two friends will have to pay no more than $2 between them for
their first pint if one of the four outcomes (L, L), (L, M), (M, L), (M, M) occurs.
The probability of this event is 16 1
+ 18 + 18 + 14 = 16
9
.
1.4 Compound Chance Experiments 21
Problems
1.38 In a tennis tournament between three players A, B, and C, each player
plays the others once. The strengths of the players are as follows: P(A
beats B) = 0.5, P(A beats C) = 0.7, and P(B beats C) = 0.4. Assuming
independence of the match results, calculate the probability that player A
wins at least as many games as any other player.
1.39 An electronic system has four components labeled 1, 2, 3, and 4. The
system has to be used during a given time period. The probability that
component i will fail during that time period is fi for i = 1, . . . , 4. Failures
of the components are physically independent of each other. A system
failure occurs if component 1 fails or if at least two of the other three
components fail. What is the probability that the system will fail?
22 1 Foundations of Probability Theory
1.40 Bill and Mark take turns picking a ball at random from a bag containing
four red and seven white balls. The balls are drawn out of the bag with
replacement. What is the probability that Bill is the first person to pick a
red ball when Mark is the first person to start?
1.41 Two dice are tossed until the dice sum 7 or 8 appears. What is the
probability of getting a total of 8 before a total of 7?
1.42 Three desperados A, B, and C play Russian roulette in which they take
turns pulling the trigger of a six-cylinder revolver loaded with one bullet.
Each time, the magazine is spun to randomly select a new cylinder to
fire as long as the deadly shot has not already occurred. The desperados
shoot according to the order A, B, C, A, B, C, . . . . Determine, for each of
the three desperados, the probability that this desperado will be the one
to shoot himself dead.
1.43 Two persons each roll a pair of dice, independently of each other. What
is the probability that the sums rolled are different?
occurs. The probability measure on the class of cylinder sets can be extended
to one defined on a sufficiently general class of subsets capable of representing
all possible events of this chance experiment. This extension requires measure
theory, however, and is beyond the scope of this book.
It is intuitively clear that the fraction of coin tosses in which heads occurs
converges to 12 when the number of tosses increases without limit. We can state
this claim more rigorously with the help of the probability measure P(∞) . To do
so, we adopt the notation Kn (ω) to represent the number of heads occurring
in the first n elements of ω. Furthermore, let C be the set of all outcomes ω
for which limn→∞ Kn (ω) /n = 12 . For very many sequences ω, the number
Kn (ω) /n does not converge to 12 as n → ∞ (e.g., this is the case for any
sequence ω with finitely many H’s). However, “nature” chooses a sequence
from the set C according to P(∞) . More precisely, the probability measure
P(∞) assigns a probability of 1 to the set C, or, in mathematical notation,
Kn (ω) 1
P(∞)
ω : lim = = 1.
n→∞ n 2
This type of convergence is called convergence with probability one, or almost-
sure convergence. The terminology of an “almost-sure” event means that
realizations not in this event are theoretically possible but will not happen
in reality. The convergence result is a special case of the strong law of
large numbers. This law is of enormous importance: it provides a direct link
between theory and practice. It was a milestone in probability theory when,
around 1930, A. N. Kolmogorov proved this law from the simple axioms of
probability. In Chapter 9 we come back to the strong law of large numbers.
Problems
1.44 Use the Kolmogorov axioms to prove the following inequalities:
(a) P(A) ≤ P(B) when A ⊆ B. Hint: Let C = B\A be the set of all
outcomes in B that are not in A and note that B is the union of the
disjoint sets A and C.
1.5 Some Basic Rules 25
∞ ∞
(b) P k=1 Ak ≤ k=1 P Ak for any sequence of subsets A1 , A2 , . . . .
This is Boole’s inequality, also known as the union bound. Hint:
Define the pairwise disjoint sets Bk by B1 = A1 and Bk =
Ak \(A1 ∪ · · · ∪ Ak−1 ) for k ≥ 2.
1.45 In the coin-tossing experiment of repeatedly tossing a fair coin, a run of
length r is said to have occurred if heads have just been tossed r times
in a row. Prove that a run of length r is certain to occur somewhere
if a fair coin is tossed indefinitely often. Hint: Analyze the coin-
tossing experiment as an infinite sequence of physically independent
r-experiments. The r-experiment consists of r consecutive tosses of the
coin. Define Bn as the event that no run of length r occurs in any of the
first n of the r-experiments and evaluate P( ∞ n=1 Bn ).
1.46 You toss a fair coin until you obtain 10 heads in a row and then you
stop. What is the probability of seeing at least 10 consecutive tails in the
sequence prior to stopping? Hint: Use the result of Problem 1.45.
+ (−1)n−1 P(A1 A2 · · · An ).
The proofs of these rules are simple and instructive. They nicely demonstrate
how useful propositions can be obtained from “minimal” axioms.
To prove Rule 1.1, denote by ∅ the empty set of outcomes. We first show
that
P (∅) = 0.
Using Axiom 3 with Ai = ∅ for all i gives P(∅) = ∞ i=1 ai , where ai = P(∅)
for all i. This implies that P(∅) = 0. Let A1 , . . . , An be any finite sequence of
pairwise disjoint sets. Augment this sequence with An+1 = ∅, An+2 = ∅, . . ..
Then, by Axiom 3,
n
∞ ∞
n
P Ai = P Ai = P (Ai ) = P(Ai ).
i=1 i=1 i=1 i=1
The proof of Rule 1.2 is as follows. The set A ∪ Ac is by definition equal to
the sample space. Thus, by Axiom 2, P (A ∪ Ac ) = 1. The sets A and Ac are
disjoint. It now follows from Rule 1.1 that P (A ∪ Ac ) = P(A) + P(Ac ). This
gives the complement rule P(A) = 1 − P(Ac ).
To prove Rule 1.3, denote by A1 the set of outcomes that belong to A but not
to B. Let B1 be the set of outcomes that are in B but not in A and let C = AB
be the set of outcomes that are both in A and B. The sets A1 , B1 , and C are
pairwise disjoint. Moreover,
A ∪ B = A1 ∪ B1 ∪ C, A = A1 ∪ C, and B = B1 ∪ C.
Applying Rule 1.1 gives
P A ∪ B = P(A1 ) + P(B1 ) + P(C).
Also, P(A) = P(A1 ) + P(C) and P(B) = P(B1 ) + P(C). By substituting the
latter two relations into the expression for P(A ∪ B) and noting that C = AB,
we find
1.5 Some Basic Rules 27
Example 1.13 How many rolls of a fair die are required to have at least a 50%
chance of rolling at least one six? How many rolls of two fair dice are required
to have at least a 50% chance of rolling at least one double six?
Solution. Let us first consider the experiment of rolling a single die r times,
where r is fixed in advance. The sample space of this experiment consists of all
elements (i1 , i2 , . . . , ir ), where ik denotes the outcome of the kth roll of the die.
The sample space has 6 × 6 × · · · × 6 = 6r equally likely elements. Let A be
the event that at least one six is obtained in r rolls of the die. To compute P(A),
it is easiest to compute the probability of the complementary event Ac that no
six is obtained in r rolls of the die. The set Ac consists of 5 × 5 × · · · × 5 = 5r
elements. Thus, P(Ac ) = 5r /6r and so
5r
P(A) = 1 − .
6r
This probability has the value 0.4213 for r = 3 and the value 0.5177 for r = 4.
Therefore four rolls of the die are required.
The following sample space is taken for the experiment of rolling two dice
r times. Imagining that one die is blue and the other is red, the sample space
consists of all elements ((i1 , j1 ), (i2 , j2 ), . . . , (ir , jr )), where ik and jk denote the
outcomes of the blue and the red die on the kth roll of the two dice. The sample
space consists of 36 × 36 × · · · × 36 = 36r elements. All elements are equally
likely. Let A be the event that at least one double six is obtained in the r rolls
of the two dice. The complementary event Ac of rolling no double six in r
rolls of the two dice can occur in 35 × 35 × · · · × 35 = 35r ways. Therefore,
P(Ac ) = 35r /36r and so
35r
P(A) = 1 − r .
36
This probability has the value 0.4914 for r = 24 and the value 0.5055 for r =
25. Therefore 25 rolls of the two dice are required. De Méré must have been an
assiduous player in order to have established empirically that the probability
of rolling at least one double six in 24 rolls of a pair of dice lies just under
one-half.
The complement rule is also the key to the solution of the famous birthday
problem.
Example 1.14 What is the probability that in a class of n (≤ 365) children
(no twins), two or more children have the same birthday? Assume that the
year has 365 days (ignore February 29) and that all possible birthdays are
equally likely. What is the smallest value of n such that this probability is at
least 50%?
1.5 Some Basic Rules 29
Solution. Take as the sample space all ordered n-tuples of numbers selected
from the integers 1, 2, . . . , 365, where the first number of the tuple is the
birthday of the first child, the second number is the birthday of the second
child, etc. The sample space has 365 × 365 × · · · × 365 = 365n equally likely
elements. Let A be the event that at least two children have the same birthday.
The number of elements for which all n children have a different birthday is
365 × 364 × · · · × (365 − n + 1). Then, by the complement rule,
365 × 364 × · · · × (365 − n + 1)
P(A) = 1 − .
365n
The smallest value of n for which this probability is 50% or more is n = 23.
The probability has the value 0.5073 for n = 23.5 The finding that a small
value of n = 23 suffices is no longer surprising when you realize that there are
23
2 = 253 combinations of two children in a class of 23 children, where any
1
combination leads to a match with a probability of 365 .
The psychology of probability intuition is an interesting feature of the
birthday problem. Many people think that more than 23 people are needed
for a birthday match and the number 183 is commonly suggested. A similar
misconception can be seen in the words of a lottery official regarding his
lottery, in which a four-digit number was drawn daily from the 10,000 number
sequence 0000, 0001, . . . , 9999. On the second anniversary of the lottery, the
official deemed it highly improbable that any of the 10,000 possible numbers
had been drawn two or more times in the last 625 drawings. The lottery
official was wildly off the mark: the probability that some number will not
be drawn two or more times in 625 drawings is inconceivably small and of
the order of magnitude 10−9 . This probability can be calculated by looking
at the problem as a “birthday problem” with 10,000 possible birthdays and a
group of 625 people. The birthday problem nicely illustrates that coincidences
may not be so unusual after all, see also Problems 1.54 and 1.55. Nearly
all coincidences can be explained by simple probability rules. What looks
unexpected usually turns out to be expected.
Rule 1.3 is often called the addition rule for probabilities. When P(A) and
P(B) are added for non-disjoint sets A and B, the probability of the intersection
5 In reality, birthdays are not uniformly distributed throughout the year, but follow a seasonal
pattern. The probability of a match only becomes larger for any deviation from equal birth
frequencies, as can be understood intuitively by imagining a group of people coming from a
planet on which people are always born on the same day. However, for birth frequency
variation as occurring in reality, the match probability is very insensitive to deviations from
uniform birth rates and the group size of 23 for a fifty-fifty match probability does not change,
see T. S. Nunnikhoven, “A birthday problem for nonuniform birth frequencies,” The American
Statistician 46 (1992): 270–274.
30 1 Foundations of Probability Theory
where, for example, the outcome ♣7 means that the seven of clubs is drawn.
All possible outcomes are equally likely, and thus each outcome gets assigned
1
the same probability 52 . Let A be the event that the card drawn is a heart and
B the event that the card drawn is an ace. These two events are not mutually
exclusive. We are looking for the probability P(A ∪ B) that at least one of the
events A and B occurs. This probability can be calculated by applying Rule 1.3:
In this case, P(AB) stands for the probability that the card drawn is the ace of
hearts. The events A and B correspond to sets that contain 13 and 4 elements,
respectively, and thus have respective probabilities 13 4
52 and 52 . The event AB
1
corresponds to a set that is a singleton and thus has probability 52 . Therefore,
the probability that the card drawn is either a heart or an ace or both is equal to
13 4 1 16
P(A ∪ B) = + − = .
52 52 52 52
Example 1.16 Suppose that you are playing blackjack against the dealer. What
is the probability that neither of you are dealt a blackjack when getting two
cards each from a thoroughly shuffled standard deck of 52 cards? A blackjack
consists of one ace together with one card from the 16 cards formed by the
tens, jacks, queens, and kings.
Solution. The sample space consists of all combinations of an unordered pair
of cards
52 for the player and an unordered pair of cards for the dealer. It has
50
2 2 equiprobable elements. Let A be the event that neither the player nor
the dealer gets a blackjack. The sought probability P(A) is easiest computed by
using the complement rule. Let A1 be the event that the player gets a blackjack
and A2 be the event that the dealer gets a blackjack. Then P(A) = 1 − P(A1 ∪
A2 ). By the addition rule, P(A1 ∪ A2 ) = P(A1 ) + P(A2 ) − P(A1 A2 ). Thus
This gives
5,760 5,760 1,152
P(E ∪ G) = + − = 0.2571.
8! 8! 8!
Problems
1.47 Of those visiting a particular car dealer selling second-hand cars and
Japanese cars, 55% buy no car, 25% buy a second-hand car, and 30%
buy a Japanese car. What is the probability that a visit leads to buying a
second-hand Japanese car?
1.48 Suppose that 50% of households in a certain city subscribe to the morning
newspaper, 70% of households subscribe to the afternoon newspaper, and
80% of households subscribe to at least one of the two newspapers. What
proportion of the households subscribe to both newspapers?
1.49 A transport firm has two vehicles, a truck and a van. The truck is used
75% of the time. Both vehicles are used 30% of the time and neither of
the vehicles is used for 10% of the time. What is the probability that the
van is used on any given day? What is the probability that only the van is
used on any given day?
1.50 The event A has probability 23 and there is a probability of 34 that at
least one of the events A and B occurs. What are the smallest and largest
possible values for the probability of event B?
1.51 For any two events A and B, prove that the probability of exactly one of
them occurring
is P(A) + P(B) − 2P(AB). Hint: Consider P (A ∩ Bc ) ∪
(B ∩ Ac ) .
1.52 Let A1 , . . . , An be any finite sequence of events. Use induction to verify
n n−1 n n
the bounds
n i=1 P(Ai ) − i=1 j=i+1 P(Ai Aj ) ≤ P( i=1 Ai ) ≤
i=1 P(Ai ).
1.53 In a class of 30 children, each child writes down a randomly chosen
number from 1, 2, . . . , 250. What is the probability that two or more
children have chosen the same number?
1.54 Suppose that n independent repetitions are done of an experiment that
has m equally likely outcomes O1 , . . . , Om . What is the probability that
some outcome occurs two or more times? Show that this probability can
1
be approximated by 1 − e− 2 n(n−1)/m when m is much larger than n. What
is the probability that the particular outcome O1 occurs at least once?
Verify that for a fifty-fifty probability the required number of trials is
√
about 1.177 m + 0.5 for the first probability and about 0.6931m for the
second probability when m is large. Hint: Use the approximation 1 − x ≈
e−x for x close to zero.
1.5 Some Basic Rules 33
1.55 Twenty-five people each choose two distinct numbers at random from
the numbers 1, 2, . . . , 25, independently of each other. What is the
probability that at least two people will choose the same two numbers?
1.56 On Wednesday, June 21, 1995, a remarkable thing occurred in the
German 6/49 lottery, in which six different numbers are drawn from
the numbers 1, . . . , 49. On the day in question, the mid-week drawing
produced this six-number result: 15-25-27-30-42-48. These were the
same numbers as had been drawn previously on Saturday, December 20,
1986, and it was the first time in the 3,016 drawings of the German lottery
until June 21, 1995 that the same sequence had been drawn twice. Is this
an incredible occurrence, given that in the German lottery there are nearly
14 million possible combinations of the six numbers in question?
1.57 Canadian lottery officials had no knowledge of the birthday problem and
its treacherous variants when they put this idea into play. They purchased
500 Oldsmobile cars from nonclaimed prize monies, to be raffled off as
bonus prizes among their 2.4 million registered subscribers. A computer
chose the winners by selecting 500 subscriber numbers from a pool
of 2.4 million registered numbers without regard for whether or not a
given number had already appeared. The unsorted list of the 500 winning
numbers was published and to the astonishment of lottery officials, one
subscriber put in a claim for two cars. How surprising is it that this
happened?
1.58 A bowl contains 10 red and 10 blue balls. Three times you pick at random
two balls out of the bowl without replacement. What is the probability
that at least one pick has two balls of the same color?
1.59 In the Massachusetts Numbers Game, a four-digit number is drawn from
the numbers 0000, 0001, . . . , 9999 every evening (except Sundays). Let’s
assume that the same lottery takes place in ten other states each evening.
(a) What is the probability that the same number will be drawn in two or
more states next Tuesday evening?
(b) What is the probability that on some evening in the coming 300 draw-
ings, the same number will be drawn in two or more states?
1.60 The playlist of your iPhone has 15 songs by each of 30 artists. After each
song the playlist is shuffled and the next song is selected at random from
the 450 songs. What is the probability that you will hear the same artist
more than once in any 10-song block?
1.61 A producer of a new brand of breakfast cereal offers a baseball card in
each cereal box purchased. Ten different baseball cards are distributed
equally among the cereal boxes. What is probability of getting the cards
of your two favorite teams when you buy five boxes?
34 1 Foundations of Probability Theory
1.62 What is the probability of getting at least one ace in a poker hand of five
cards dealt from an ordinary deck of 52 cards? What is the probability of
getting five cards of the same suit in a poker hand of five cards?
1.63 A fair die is rolled six times. What is the probability that each of the
numbers 5 and 6 appears at least once?
1.64 In the casino game of Chuck-a-Luck, three dice are contained within
an hourglass-shaped, rotating cage. You bet on one of the six numbers
1, . . . , 6 and the cage is rotated. You lose money only if your number
does not come up on any of the three dice. What is the probability that
your number will come up?
1.65 What is the probability that in a class of 23 children exactly two children
23
have birthdays on the same day? Hint: Number the 2 = 253 possible
combinations of two children as 1, . . . , 253 and define Ai as the event that
the two children from combination i are the only two children having the
same birthday.
1.66 You roll a fair die six times in a row. What is the probability that all of
the six face values will appear? What is the probability that one or more
sixes will appear? What is the probability that the largest number rolled
is r?
1.67 For the upcoming drawing of the Bingo Lottery, five extra prizes have
been added to the pot. Each prize consists of an all-expenses-paid
vacation trip. Each prize winner may choose from among three possible
destinations A, B, and C. The three destinations are equally popular.
The prize winners choose their destinations independently of each other.
Calculate the probability that at least one of the destinations A and B
will be chosen. Also, calculate the probability that not all of the three
destinations will be chosen.
1.68 In the 6/42 lottery, six distinct numbers are picked at random from
the numbers 1, . . . , 42. What is the probability that at least two of the
numbers 7, 14, 21, 28, 35, and 42 will be picked?
1.69 You draw at random five cards from a standard deck of 52 cards. What is
the probability of having an ace among the five cards together with either
a king or a queen or both?
1.70 John and Paul play the following game. They each roll one fair die. John
wins the game if his score is larger than Paul’s score or if the product of
the two scores is an odd number. Is this a fair game?
1.71 From an ordinary deck of 52 cards, cards are randomly drawn one by one
and without replacement. What is the probability that an ace will appear
before any of the cards 2 through 10? What is the probability that two
aces will appear before any of the cards 2 through 10?
1.5 Some Basic Rules 35
Hint: Define An,k as the event that the rth appearance of outcome O1
occurs at the nth trial and that outcome O2 appears exactly k times in
the first n − 1 trials. Use the identity ∞l=0 (l + 1) · · · (l + m) x = m! /
l
6 A permutation cycle is a subset of a permutation whose elements trade places with one another.
For example, take the permutation (2, 4, 3, 1) of (1, 2, 3, 4), then the subset {1, 4, 2} is a cycle of
length 3 with 1 → 4 → 2 → 1 and the subset {3} is a cycle of length 1 with 3 → 3.
36 1 Foundations of Probability Theory
prisoner either finds his own name or has inspected 50 boxes. What is the
probability that the group will not be exiled? Hint: Consider the contents
of the boxes as a random permutation of the integers 1, 2, . . . , 2n with
n = 50 and verify that the probability of having no cycle of length n + 1
or more is equal to 1 − 2n k=n+1 1/k.
1.75 In a television show the remaining couple is shown three closed doors
behind which is hidden a new car, the car key, and a goat. One member
of the couple gets the task to find the car and the other member gets the
task to find the car key. The couple wins the car only if both partners
succeed in their respective tasks. Each member of the couple is allowed
to open two doors, where the second person cannot see what is behind
the doors opened by the first person. The couple may, however, agree on
a strategy beforehand. What is the best strategy?
1.76 Two-up is a traditional Australian gambling game. A spinner repeatedly
throws two pennies up in the air. “Heads” appear if both pennies land
with the head side facing up, “tails” appear if both pennies land with
the tails side facing up, and “odds” appear if one penny lands with the
head side up and the other lands with the tails side up. Consider the
following game variant. The spinner wins if three “heads” appear before
any appearance of “tails” and also before any intervening sequence of five
consecutive “odds.” A win pays 8.5 for 1. What is the win probability of
the spinner? Is the game favorable for the bettor?
n
(−1)k
P(A1 ∪ A2 ∪ · · · ∪ An ) = 1 − .
k!
k=0
A surprising conclusion can be drawn from this result. A basic result from
calculus is that ∞ k=0 (−1) /k! = e
k −1 with e = 2.71828 . . . , see Appendix
C. Thus, for sufficiently large n, the probability that at least one person will
receive his or her own coat is approximately equal to 1 − e−1 = 0.63212 . . .,
regardless of how large n is. This approximation is accurate to seven or more
decimal places for n ≥ 10.
Having obtained the probability of at least one person getting the correct
coat, it is not difficult to argue that
1 (−1)k
n−j
P(exactly j persons will receive their own coats) =
j! k!
k=0
Nm (−1)k
m
= .
m! k!
k=0
1.6 Inclusion–Exclusion Rule 39
Problems
1.77 A certain person is taking part in a blind taste test of ten different wines.
The person has been made aware of the names of the ten wine producers
beforehand, but does not know what order the wines will be served in.
Each wine producer may be named only once. After the tasting session
is over, it turns out that he has correctly identified five of the ten wines.
Do you think he is a connoisseur?
1.78 You belong to a class of 15 students. The teacher hands 15 graded
assignments back to the students in random order. What is the probability
that you are the only person who receives his own paper?
1.79 Five friends enter a restaurant on a rainy evening and check their coats
and umbrellas. As they are leaving, the lazy coat-checker returns both the
40 1 Foundations of Probability Theory
coats and the umbrellas in random order. What is the probability that at
least one person gets both the correct coat and the correct umbrella?
1.80 A blind taste test of ten different wines takes place. The participants are
told the names of the ten wines beforehand. The wines are presented in
random order and a participant may name each wine only once. There are
three Italian wines among the ten wines. A person who has no knowledge
about wine participates in the test. What is the probability that none of
the three Italian wines is correctly guessed by this person?
1.81 Three people each choose five distinct numbers at random from the num-
bers 1, 2, . . . , 25, independently of each other. What is the probability
that there is some number that is chosen by each of the three people?
1.82 What is the probability that a hand of 13 cards from an ordinary deck of
52 cards contains four cards of the same rank, such as four queens?
1.83 What is the probability that in a 13-card bridge hand at least one suit will
be missing? What is the probability that in a 5-card poker hand at least
one suit will be missing?
1.84 Suppose 8 Germans and 4 Italians are randomly assigned to 6 double
rooms. What is the probability that no room will have two people of
different nationalities?
1.85 Consider a communication network with four nodes n1 , n2 , n3 , and n4 ,
and six directed links l1 = (n1 , n2 ), l2 = (n1 , n3 ), l3 = (n2 , n3 ), l4 =
(n3 , n2 ), l5 = (n2 , n4 ), and l6 = (n3 , n4 ). A message has to be sent from
the source node n1 to the destination node n4 . The network is unreliable.
The probability that the link li is functioning is pi for i = 1, . . . , 6. The
links behave physically independently of each other. A path from node n1
to node n4 is only functioning if each of its links is functioning. What is
the probability that there is some functioning path from node n1 to node
n4 ? What is this probability when pi = p for all i?
1.86 What is the probability that in 30 lottery drawings of six distinct numbers
from the numbers 1 to 45, not all of the 45 numbers will come up?
1.87 The coupon-collector problem is as follows. Each time you buy a certain
product (chewing gum, for example) you receive a coupon (a baseball
card, for example), which is equally likely to be any one of n types.
42
2.1 Concept of Conditional Probability 43
fn (AB)
fn (A | B) = .
fn (B)
This relationship accounts for the definition of the conditional probability
P(A | B). The relative frequency interpretation also tells us how a conditional
probability must be estimated in a simulation study.
Example 2.1 Someone has rolled two dice out of your sight. You ask this
person to answer “yes or no” to the question of whether there is a six among
the two rolls. He truthfully answers “yes.” What is the probability that two
sixes have been rolled?
Solution. Let A be the event that two sixes show up and B the event that
at least one six shows up. It is helpful to imagine that one of the dice is
red and the other is blue. The sample space of this experiment is the set
{(i, j) | i, j = 1, . . . , 6}, where i and j denote the outcomes of the red and
1
the blue die. A probability of 36 is assigned to each element of the sample
space. The event A is given by the set A = {(6, 6)} and the event B by the
44 2 Conditional Probability
set B = {(1, 6), . . . , (5, 6), (6, 6), (6, 5), . . . , (6, 1)}. The probability of event B
occurring is 11
36 and the probability of both event A and event B occurring is 36 .
1
Given that you know that event B has occurred, the probability that event A has
also occurred is P(A | B). Applying the definition of conditional probability
gives
P(AB) 1/36
P(A | B) = = .
P(B) 11/36
1
Thus, the sought probability is 11 (not 16 ).
The above example illustrates once again how careful you have to be when
you are interpreting the information a problem is conveying. The wording of
the problem is crucial: you know that one of the dice turned up a six but you do
not know which one. In the case where one of the dice had dropped on the floor
and you had seen the outcome six for that die, the probability of the other die
turning up a six would have been 16 . An intuitive explanation of the difference
between the two probabilities is the fact that in the second scenario you have
the additional information of which one of the two dice (the red or the blue
die) shows a six.
Example 2.2 John, Pedro, and Rosita are experienced darts players. The
probability of John hitting the bullseye in a single throw is 13 . This hitting
probability is 15 for Pedro and 14 for Rosita. The three players each throw one
dart simultaneously. Two of the darts hit the bullseye and one of the darts
misses the bullseye. What is the probability that John is the one who missed?
Solution. The sample space of the chance experiment consists of the eight
elements (H, H, H), (H, H, M), (H, M, H), (H, M, M), (M, H, H), (M, H, M),
(M, M, H), (M, M, M), where H stands for “hit” and M stands for “miss.” The
first component of each element of the sample space refers to John’s throw, the
second component refers to Pedro’s throw, and the third component refers to
Rosita’s throw. By the independence of the outcomes of the individual throws,
we assign the probability 13 × 15 × 14 = 60 1
to the outcome (H, H, H), the
probability 3 × 5 × 4 = 60 to the outcome (H, H, M), the probability 13 ×
1 1 3 3
3 × 5 × 4 = 60 to the outcome (M, M, M). Let A be the event that John misses
2 4 3 24
and let B be the event that exactly two of the darts hit the target. The event
AB occurs if the outcome (M, H, H) occurs and the event B occurs if one of
2.1 Concept of Conditional Probability 45
the outcomes (H, H, M), (H, M, H), (M, H, H) occurs, and so P(AB) = 602
and
P(B) = 60 + 60 + 60 = 60 . Thus, the conditional probability that John is the
3 4 2 9
Problems
2.1 Two fair dice are rolled. What is the conditional probability that the sum
of the two dice is 8, given that the two dice show different numbers?
2.2 You toss a nickel, a dime, and a quarter, independently of each other.
What is the conditional probability that the quarter shows up heads given
that the coins showing up heads represent an amount of at least 15 cents?
2.3 Every evening, two weather stations issue a weather forecast for the next
day. The weather forecasts of the two stations are independent of each
other. On average, the weather forecast of station 1 is correct in 90% of
cases, irrespective of the weather type. This percentage is 80% for station
2. On a given day, station 1 predicts sunny weather for the next day,
whereas station 2 predicts rain. What is the probability that the weather
forecast of station 1 will be correct?
2.4 A professor has given two subsequent tests to her students. 50% of the
students passed both tests and 80% of the students passed the first test.
What percentage of those who passed the first test also passed the second
test?
2.5 In a certain village, 30% of households have a cat and 25% of households
have a dog. Of the households with a cat, 20% also have a dog. What
percentage of households with a dog also have a cat?
2.6 The experiment is to toss a fair coin once and to roll a fair die once. Let A
be the event that the coin lands “heads” and B the event that the die lands
“six.” Given that at least one of the two events has occurred, what is the
probability that both events have occurred and what is the probability that
event A has occurred?
2.7 You simultaneously grab two balls at random from an urn containing
two red balls, one blue ball, and one green ball. What is the probability
that you have grabbed two non-red balls given that you have grabbed at
least one non-red ball? What is the probability that you have grabbed two
non-red balls given that you have grabbed the green ball? Can you give
an intuitive explanation of why the second probability is larger than the
first one?
46 2 Conditional Probability
2.8 The following game is played in a particular carnival tent. The carnival
master has two covered beakers, each containing one die. He shakes the
beakers thoroughly, removes the lids, and peers inside. You have agreed
that whenever at least one of the two dice shows an even number of
points, you will bet with even odds that the other die will also show an
even number of points. Is this a fair bet?
2.9 Suppose a bridge player’s hand of 13 cards contains an ace. What is the
probability that the player has at least one more ace? What is the answer
to this question if you know that the player had the ace of hearts? Can
you explain why the second probability is larger than the first one?
2.10 A hand of 13 cards is dealt from a standard deck of 52 cards. What
is the probability that it contains more aces than tens? How does this
probability change when you have the information that the hand contains
at least one ace?
2.11 A fair die is rolled three times. What is the probability that each number
rolled is higher than all those that were rolled earlier? Hint: Condition on
the event that each roll gives a different outcome.
2.12 Let A and B be two events with 0 < P(A) < 1 and 0 < P(B) < 1.
(a) Suppose that P(A | B) > P(B | A). Verify that P(A) > P(B).
(b) Suppose that P(A | B) > P(A). Verify the inequalities P(B | A) >
P(B) and P(Bc | A) ≤ P(Bc ), where Bc is the complement of event B.
(c) What is P(A | B) if A and B are disjoint? What is P(A | B) if B is a
subset of A?
In words, the probability that events A and B both occur is equal to the
probability that event A occurs multiplied by the probability that event B occurs
given that A has occurred. This phrasing lines up naturally with the intuitive
way people think about probabilities.
The formula P(AB) = P(A)P(B | A) is often referred to as the multiplication
rule for probabilities. This rule is a very useful tool for assigning or calculating
probabilities. In many cases, the rule is used in attributing probabilities to
elements of the sample space. In illustration of this, consider the experiment
in which two marbles are randomly chosen without replacement from a
receptacle holding seven red and three white marbles. One possible choice
for the sample space of this experiment is the set consisting of four elements
(R, R), (R, W), (W, W), and (W, R), where R stands for red and W for white.
The first component of each element indicates the color of the first marble
chosen and the second component the color of the second marble chosen. On
grounds of the reasoning that P(1st marble is red) = 10 7
and P(2nd marble is
white | 1 marble is red) = 9 , we attribute the probability P(R, W) = 10
st 3 7
× 39 =
7
30 to the element (R, W). In the same way we attribute the probabilities
P(R, R) = 10 7
× 69 = 15
7
, P(W, W) = 10 3
× 29 = 15
1
, and P(W, R) = 10 3
× 79 = 307
Solution. The following probability model can be used to answer the question.
A bowl contains four red and four blue balls. Four times you pick at random
two balls from the bowl without replacement. What is the probability that you
pick each time a red and a blue ball? Defining Ai as the event that the ith pick
P(A1 ) =
44 8is P(A41 A2 A3 A4 ). Then 6×3
gives a red and a blue ball, this probability
8×4
8×7 = 4
7 (or, alternatively, P(A1 ) = 1 1 / 2 = 7 ), P(A 2 | A 1 ) = 6×5 = 5 ,
3
chain rule,
P(A1 A2 A3 A4 ) = P(A1 )P(A2 | A1 )P(A3 | A1 A2 )P(A4 | A1 A2 A3 )
4 3 2 8
= × × ×1= .
7 5 3 35
In other words, the probability that the four British teams will avoid each other
8
in the draw is 35 .
Example 2.4 Two players A and B take turns rolling a fair die. Player A rolls
the die first and wins on a “six.” If A fails in rolling a “six,” then player B rolls
the die and wins on a “five” or “six.” If B fails, then A rolls again and wins on
a “four,” “five,” or “six,” and so on. What is the probability that player A will
be the winner?
Solution. Let Ei be the event that player A wins on the ith turn for i = 1, 3,
and 5. The events E1 , E3 , and E5 are mutually exclusive and so P(A wins) =
P(E1 ) + P(E3 ) + P(E5 ). Using the chain rule for conditional probabilities,
1 5 4 3 5 4 3 2 5
P(E1 ) = , P(E3 ) = × × , and P(E5 ) = × × × × .
6 6 6 6 6 6 6 6 6
This gives
1 5 4 3 5 4 3 2 5 169
P(A wins) = + × × + × × × × = .
6 6 6 6 6 6 6 6 6 324
The game is not fair. Player A has a slight edge over player B.
Example 2.5 A coin-tossing game is played with 10 players including the two
friends John and Pete. Each player has the same starting capital of d dollars.
There are nine rounds of the game. In each round, two persons play against
each other. A fair coin is tossed and, depending on the outcome of the toss, one
player pays one dollar to the other player. The two players continue tossing the
coin until one of them has won all the other player’s money. The winner of the
first round has doubled his starting capital and next plays against another player
until one has won all the other’s money, and so on. Suppose that John is in the
first game and Pete plays the last game against the survivor of the first nine
players. Who has the best chance of being the ultimate winner – John or Pete?
50 2 Conditional Probability
Solution. To solve this problem, consider first a game in which the player is
equally likely to either win or lose one dollar on each gamble. If the player
starts with a dollars and plays the game repeatedly until he either goes broke
or increases his bankroll to a + b dollars, then the probability of reaching his
a
desired goal and thus not going broke is a+b . This intuitively obvious result
is a special case of the gambler’s ruin formula that will be proved in the next
section. Next we turn to the stated problem. The answer to this problem is that
1
each of the 10 players has the same probability 10 of being the ultimate winner.
This answer might surprise you. Let us first argue that the probability of John
1
being the ultimate winner is 10 . Define Ai as the event that John enters the ith
round and wins that round. The gambler’s ruin formula gives P(A1 ) = d+d d
=
2 , P(A2 | A1 ) = 2d+d = 3 , P(A3 | A1 , A2 ) = 3d+d = 4 , and so on. Hence, by
1 2d 2 3d 3
1 2 3 9 1
P(A1 A2 · · · A9 ) = × × × ··· × = ,
2 3 4 10 10
1
verifying that John will be the ultimate winner with probability 10 . By the
same argument, the player meeting the winner of the first round in the second
round will be the ultimate winner with probability 13 × 34 × · · · × 10
9
= 101
.
1
Following this reasoning through, we find the same win probability of 10 for
each of the 10 participants of the game.
Problems
2.17 You and your friend have lunch with three business partners. After lunch
each one puts a business card in a bowl. Two of the five cards are
randomly drawn from the bowl to determine who pays for the lunch.
What is the probability that you and your friend do not have to pay?
2.18 You are dealt a hand of four cards from a well-shuffled deck of 52 cards.
What is the probability that you receive the four cards J, Q, K, A in any
order, with suit irrelevant? Hint: Let Ai be the event that the ith card you
receive is a picture card that you have not received before.
2.19 A fair die is rolled six times. What is the probability of one or more sixes
and no one occurring?
2.20 There are two Spanish teams and two German teams among the eight
teams that have reached the quarterfinals of the soccer Champions
League. What is the probability that none of the Spanish teams will be
paired with a German team when the teams are paired randomly?
2.2 Chain Rule for Conditional Probabilities 51
2.21 A jar contains three white and two black balls. Each time, you pick at
random one ball from the jar. If it is a white ball, a black ball is inserted
instead; otherwise, a white ball is inserted instead. You continue until all
balls in the jar are black. What is the probability that you need three picks
to achieve this? What is the probability that you need five picks?
2.22 Use conditional probabilities to solve Problems 1.6, 1.9–1.11, and 1.23.
2.23 You have received a tip that the management of a theater will give a
free ticket to the first person in line having the same birthday as someone
before him or her in line. Assuming that people enter the line one at a time
and you can join the line any time, what position in the line maximizes
your probability of getting the free ticket?
2.24 You travel from Amsterdam to Sidney with a change of airplane in Dubai
and Singapore. You have one piece of luggage. At each stop your luggage
is transferred from one airplane to the other. At the airport in Amsterdam
there is a probability of 5% that your luggage is not placed in the right
plane. This probability is 3% at the airport in Dubai and 2% at the airport
in Singapore. What is the probability that your luggage does not reach
Sidney with you? If your luggage does not reach Sidney with you, what
is the probability that it was lost at Dubai airport ?
2.25 Seven individuals have reserved tickets at the opera. The seats they have
been assigned are all in the same row of seven seats. The row of seats
is accessible from either end. Assume that the seven individuals arrive
in a random order, one by one. What is the probability of all seven
individuals taking their seats without having to squeeze past an already
seated individual? Hint: Consider the reverse process of people leaving
in random order.
2.26 In a poker game with three players A, B, and C, the dealer is chosen by
the following procedure. In the order A, B, C, A, B, C, . . ., a card from a
well-shuffled deck is dealt to each player until someone gets an ace. This
first player receiving an ace gets to start the game as dealer. Do you think
that everyone has an equal chance of becoming dealer?
2.27 Suppose there are n women–men couples at a party. What is the
probability that there are at least two couples such that the two women
have the same birthday and the two men have the same birthday? Hint:
Let Ai be the event that there is no match between the ith couple and any
of the couples 1, . . . , i − 1.
2.28 What is the probability that, in the next m draws of the 5/39 lottery,
there will be three consecutive draws with the feature that the same five
numbers appear in two or more of these three draws? In each draw five
distinct numbers are drawn at random from 1, 2, . . . , 39. Hint: Define Ai
52 2 Conditional Probability
as the event that each of the three draws i, i + 1, and i + 2 has a different
set of five numbers.
2.29 A bowl contains r red and b blue balls. The balls are removed from the
bowl, one at a time and at random. What is the probability that the first
red ball is removed on the kth pick? Can you tell without doing any
calculations what the probability is that there are still blue balls in the
bowl when the last red ball is picked?
2.30 In a group of 25 people, a person tells a rumor to a second person,
who in turn tells it to a third person, and so on. Each person tells the
rumor to just one of the people chosen at random, excluding the person
from whom he/she heard the rumor. The rumor is told 10 times. What
is the probability that the rumor will not be repeated to any one person
once more? What is the probability that the rumor will not return to the
originator?
P(AB) = P(A)P(B).
The reader should be aware that independent events and disjoint events are
completely different things. If events A and B are disjoint, you calculate the
probability of the union A ∪ B by adding the probabilities of A and B. For
independent events A and B, you calculate the probability of the intersection
AB by multiplying the probabilities of A and B. Beginning students sometimes
think that independent events are disjoint. This is not true. If two events A and
B are disjoint, then A ∩ B = ∅ and so P(AB) = 0, while P(AB) = P(A)P(B) for
independent events A and B. This shows that independent events with nonzero
probability are never disjoint.
Example 2.6 Suppose that a red and a blue die are thrown. Let A be the event
that the number shown by the red die is 4, B be the event that the sum of the
dice is 7, and C be the event that the sum of the dice is 8. Show that the events
2.2 Chain Rule for Conditional Probabilities 53
A and B are independent, whereas the events A and C are dependent. Can you
explain this result intuitively?
Solution. The experiment has 36 possible outcomes (i, j), where i is the
number shown by the red die and j the number shown by the blue die. All
36 possible outcomes are equally likely. Simply, by counting, P(A) = 36 6
,
P(B) = 36 , and P(AB) = 36 . Since P(AB) = P(A)P(B), the events A and B
6 1
Problems
2.31 Suppose that a red and a blue die are thrown. Let A be the event that the
number shown by the red die is even, and B the event that the sum of the
dice is odd. Do you think the events A and B are independent?
2.32 Fifty different numbers are arranged in a matrix with 5 rows and
10 columns. You pick at random one number from the matrix. Let A
be the event that the number comes from an odd-numbered row and B
be the event that the number comes from the first five columns. Are the
events A and B independent?
2.33 Let A and B be independent events. Denote by the events Ac and Bc the
complements of the events A and B. Verify that the events A and Bc are
independent. Conclude from this result that the events Ac and Bc are also
independent.
2.34 Let A and B be two events such that P(A | B) = P(A | Bc ) and 0 <
P(B) < 1. Prove that A and B are independent events.
2.35 Suppose A1 , . . . , An are independent events. Prove thatthe probability
that at least one of the events A1 , . . . , An occurs is 1 − 1 − P(A1 ) · · ·
(1 − P(An )) .
2.36 Suppose that the events A, B, and C are independent, where P(A) = 12 ,
P(B) = 13 , and P(C) = 14 . What is P(A ∪ B ∪ C)?
2.37 Suppose that A1 , A2 , . . . is a sequence of independent events with the
property that ∞ n=1 P(An ) = ∞. Define A as the event that An occurs
for infinitely many values of n. Prove that P(A) = 1. This result is
known as the second Borel–Cantelli lemma. Hint: Argue that P(A) =
limn→∞ P ∞ k=n Ak and use the inequality 1 − x ≤ e
−x for 0 ≤ x ≤ 1.
Rule 2.2 Let A be an event that can only occur if one of the mutually exclusive
events B1 , . . . , Bn occurs. Then,
where the set ABi is the intersection of the sets A and Bi . The assumption that
the sets B1 , . . . , Bn are pairwise disjoint implies that the sets AB1 , . . . , ABn are
also pairwise disjoint. By Rule 1.1 in Chapter 1, we then have
This relationship and the definition P(A | B) = P(AB)/P(B) lead to the law of
conditional probability. Obviously, this law is also applicable when the sample
space is divided by a countably infinite number of disjoint subsets B1 , B2 , . . .
instead of by a finite number.
It is often the case that the unconditional probability P(A) of an event A is
found most easily by conditioning on appropriately chosen events B1 , . . . , Bn
such that the Bi are mutually exclusive and event A can only occur when
one of the events Bi occurs. This is one of the most useful problem-solving
strategies in probability. A nice illustration of the law of conditional probability
is provided by the next example.
Example 2.7 In a television game show, you can win a small prize, a medium
prize, and a large prize. The large prize is a sports car. The three prizes are
“locked up” in different boxes. There are five keys randomly arranged in front
of you. Three of the keys can open only one lock and each of these keys fits
a different box. Another key is a dud that does not open any of the locks. The
final key is the “master key” that opens all three locks. You have a chance to
choose up to two keys. For that purpose, you are asked two quiz questions.
For each correct answer, you can select one key. The probability of correctly
answering any given quiz question is 0.5. Any key you have gained is tried on
all three boxes. What is the probability of winning the sports car?
56 2 Conditional Probability
Solution. Let A be the event that the sports car is won. To find P(A), condition
on the first step of the experiment and let Bi be the event that you have correctly
answered i questions, i = 0, 1, 2. By the law of conditional probability, P(A) =
P(A | B0 )P(B0 ) + P(A | B1 )P(B1 ) + P(A | B2 )P(B2 ). It is immediate that
P(B0 ) = 14 , P(B1 ) = 24 , and P(B2 ) = 14 . Obviously, P(A | B0 ) = 0. Since
two of the five keys fit the box with the sports car, we have P(A | B1 ) =
5 . To get P(A | B2 ), it is easier to compute the complementary probability.
2
The probability that neither of two randomly chosen keys fits the box with 3the
sports car can be calculated as 35 × 24 = 10 3
(or, alternatively, as 32 / 52 = 10 ).
This gives P(A | B2 ) = 10 7
. Thus, the probability of winning the sports car is
1 2 2 7 1 3
P(A) = 0 × + × + × = .
4 5 4 10 4 8
All aspects of conditional probability are addressed in the next example.
Example 2.8 Two boxes are placed in front of you. One box contains nine
$1 bills and one $5 bill, while the other box contains two $1 bills and one
$100 bill. You choose at random one of the two boxes and then two bills are
randomly picked out of the chosen box. It appears that these two bills are $1
bills. Next you get the opportunity to pick a third bill out of one of the two
boxes. Should you stick to the chosen box or should you switch to the other
box when you want to maximize the probability of picking the $100 bill?
Solution. Let A be the event that you have chosen the box with the $100 bill
and B the event that you have chosen the other box. Also, let E be the event
that the two bills taken out from the chosen box are $1 bills. The probability
of picking the $100 bill as third bill is P(A | E) × 1 if you stick to the
box chosen and is P(B | E) × 13 if you switch to the other box. Since
P(B | E) = 1 − P(A | E), it suffices to find P(A | E). To do so, note that
1 1
P(E) = P(E | A) × + P(E | B) × ,
2 2
by the law of conditional probability. Obviously,
2 1 1 9 8 4
P(E | A) = × = and P(E | B) = × = .
3 2 3 10 9 5
2.3 Law of Conditional Probability 57
if you switch to the other box. You had better stick to the chosen box. A
surprising finding!
The next example deals with a famous problem known as the gambler’s
ruin problem. This problem can be seen as a random walk with two absorbing
barriers.
Example 2.9 John and Pete enter a coin-tossing game, where John starts with
a dollars and Pete with b dollars. They repeatedly flip a coin until one of
them has won all of the money. If the coin lands heads, John gets one dollar
from Pete; otherwise, John loses one dollar to Pete. The coin lands heads with
probability p and tails with probability q = 1 − p, where 0 < p < 1. What is
the probability that John will win all of the money?
Solution. Let P(a, b) denote the probability of John winning all of the money
when John starts with a dollars and Pete with b dollars. The gambler’s ruin
formula states that
1−(q/p)a
if p = q
P(a, b) = 1−(q/p) a+b
a
a+b if p = q.
To prove this formula, let A be the event that John wins all of the money. A
recurrence equation for P(A) = P(a, b) is obtained by conditioning on the
outcome of the first flip. Let E be the event that the first flip lands heads and let
F be the event that the first flip lands tails. By the law of conditional probability,
P(A) = P(A | E)P(E) + P(A | F)P(F).
If the first flip lands heads, you get the changed situation that John has a + 1
dollars and Pete has b − 1 dollars. This gives P(A | E) = P(a + 1, b − 1).
Similarly, P(A | F) = P(a − 1, b + 1). Since P(E) = p and P(F) = q, we
obtain the recurrence equation
P(a, b) = P(a + 1, b − 1) × p + P(a − 1, b + 1) × q.
Copying this equation with the starting capitals a and b replaced by i and a +
b − i gives a system of equations for Pi = P(i, a + b − i). These equations are
Pi = pPi+1 + qPi−1 for i = 1, . . . , a + b − 1
58 2 Conditional Probability
k−1
q i
Pk = P1 for 1 ≤ k ≤ a + b.
p
i=0
m
Using the identity (1 − a) i=0 a
i = 1 − am+1 , it follows that
1 − (q/p)k 1
Pk = P1 if p = q and Pk = kP1 if p = q = .
1 − q/p 2
The boundary condition Pa+b = 1 leads to P1 = (1 − q/p)/ 1 − (q/p)a+b if
p = q and P1 = 1/(a + b) if p = q = 12 . Hence,
1 − (q/p)k
Pk = for k = 1, . . . , a + b
1 − (q/p)a+b
if p = q and Pk = k
a+b if p = q = 12 , proving the gambler’s ruin formula.
To illustrate the gambler’s ruin formula, suppose that you go to the casino
with $50, and it is your goal to multiply your capital to $250. You decide to
play European roulette and to stake each time the same amount of s dollars on
red for each spin of the wheel. You get back twice your stake with probability
18 19
37 and lose your stake with probability 37 . What is the probability of reaching
your goal for s = 10, 25, and 50? To find the probabilities, we take p = 18 37
and q = 1937 in the gambler’s ruin formula and calculate P(a, b) for a = 5 and
b = 20 when s = 10, for a = 2 and b = 8 when s = 25, and for a = 1
and b = 4 when s = 50. This results in the probabilities 0.1084, 0.1592, and
0.1790 for s = 10, 25, and 50. It is intuitively obvious that the probability gets
larger for higher stakes, since your bankroll is exposed to the house advantage
of the casino for a shorter time when you play more boldly.
The gambler’s ruin formula was derived by parameterizing the starting state
and then conditioning on what happens in the first step of the process. This
powerful approach can also be used to analyze success runs.
2.3 Law of Conditional Probability 59
n/r 3 4 5 6 7 8 9 10
10 0.826 0.465 0.217 0.094 0.039 0.016 0.006 0.002
25 0.993 0.848 0.550 0.300 0.151 0.073 0.035 0.017
50 1.000 0.981 0.821 0.544 0.309 0.162 0.082 0.041
75 1.000 0.998 0.929 0.703 0.438 0.242 0.126 0.064
100 1.000 1.000 0.972 0.807 0.542 0.315 0.169 0.087
150 1.000 1.000 0.996 0.918 0.697 0.440 0.247 0.131
200 1.000 1.000 0.999 0.965 0.799 0.542 0.318 0.172
60 2 Conditional Probability
with a fair coin, you should expect the longest run of heads to be about seven
in a row and the longest run of either heads or tails to be about eight in a row.
A rule of thumb is that, in n tosses of a coin with probability p of heads, the
probability mass of the longest
runof heads is strongly concentrated around the
integer nearest to log1/p n(1 − p) , while the probability mass of the longest
run of either
heads or tails is strongly concentrated around the integer nearest
to log1/p n(1 − p) + 1 provided that n(1 − p) is sufficiently large.1 The
theoretical findings for runs in coin tossing put winning streaks in baseball into
perspective. In the 2015 baseball season the New York Mets won 11 games in
a row, which got a lot of attention. While this event is remarkable in isolation,
you can expect the event to happen at a certain moment. In the baseball season
each team plays 162 games. The probability is 23.06% that among seven
seasons of 162 games there will be some season in which a particular team
will win 11 or more games in a row when this team wins with a probability of
50% any single game. The most likely outcome for the longest winning streak
of the team in a typical 162-game season is six games. An 11-game winning
streak is the most likely outcome in a season when the team has a winning
percentage of 70%.
To conclude this section, we use conditional probabilities to solve a real-life
problem.
Example 2.11 Two candidates A and B remain in the finale of a television
game show. At this point, each candidate must spin a wheel of fortune. The
numbers 1, 2, . . . , 100 are listed on the wheel and when the wheel has stopped
spinning, a pointer stops randomly on one of the numbers. Each player has a
choice of spinning the wheel one or two times, whereby a second spin must
immediately follow the first. The goal is to reach a total closest to but not
exceeding 100 points. A player whose total exceeds 100 gets a final score of
zero. The winner is the player who gets the highest score. Should both players
have the same final score, then the player to spin the wheel first is the winner.
Player A has to spin first. What is the optimal strategy for player A?
Solution. Let us take a general setup and assume that the numbers 1, 2, . . . , R
are listed on the wheel. The optimal strategy of the second player B is obvious.
This player stops after the first spin only if the score is larger than the final score
of player A. To find the optimal strategy of player A, let S(a) be the conditional
win probability of player A if A stops after the first spin given that this spin
results in a score of a points, and let C(a) be the conditional win probability of
1 This result was obtained in M. F. Schilling, “The surprising predictability of long runs,”
Mathematics Magazine 85 (2012): 141–149.
2.3 Law of Conditional Probability 61
player A if A continues after the first spin given that this spin results in a score
of a points. To find S(a), we condition on the outcome of the first spin of player
B. Given that player B has scored b points in the first spin with b ≤ a, then
player A is the final winner if player B scores in the second spin either at most
a − b points or more than R − b points. By the law of conditional probability,
a
a−b b 1 a2
S(a) = + × = 2 for 1 ≤ a ≤ R.
R R R R
b=1
The expression for C(a) follows directly from S(a). By conditioning on the
outcome of the second spin of player A, we have
R−a
1 (a + k)2
R−a
C(a) = S(a + k) × = for 1 ≤ a ≤ R.
R R3
k=1 k=1
Taking R = 100, numerical calculations give that C(a) > S(a) for 1 ≤ a ≤ 53
and S(a) > C(a) for 53 < a ≤ 100. Hence, player A stops after the first spin if
this spin gives a score larger than 53; otherwise, A continues. Then, by the law
of conditional probability, the probability of player A winning is
53
1
100
1
C(a) × + S(a) × = 0.4596.
100 100
a=1 a=54
Problems
2.38 You have two identical boxes in front of you. One of the boxes contains
10 balls numbered 1 to 10 and the other box contains 25 balls numbered
1 to 25. You choose at random one of the boxes and pick a ball at random
from the chosen box. What is the probability of picking the ball with
number 7 written on it?
2.39 A drunkard removes two randomly chosen letters of the message HAPPY
HOUR that is attached to a billboard in a pub. His drunk friend puts the
two letters back in a random order. What is the probability that HAPPY
HOUR appears again?
2.40 On the television show “Deal or No Deal,” you are faced with 26 brief-
cases in which various amounts of money have been placed, including
the amounts $1,000,000 and $750,000. You first choose one case. This
case is “yours” and is kept out of play until the very end of the game.
Then you play the game and in each round you open a number of the
other cases. What is the probability that the cases with $1,000,000 and
$750,000 will still be in the game when you are going to open 20 cases?
62 2 Conditional Probability
2.41 In the UK National Lottery, six different numbers are drawn from a
bucket with 59 balls numbered 1 to 59. To win the jackpot, you have
to match all six numbers drawn. The lottery has added a new prize. If
you match exactly two numbers, you win a lucky-dip ticket. The ticket
is used on the next draw. If you then match exactly two numbers on the
lucky-dip ticket, you get a new lucky-dip ticket, and so on. What is the
probability of ever winning the jackpot when you buy one ticket only
once?
2.42 Suppose that Joe arrives home on time with probability 0.8. If Joe
does not arrive home on time, the probability that his dinner is burnt
is 0.5; otherwise, his dinner is burnt with probability 0.15. What is the
probability that Joe’s dinner will be burnt on any given day? What is the
probability that Joe arrived home on time given that his dinner is burnt?
2.43 You owe $20,000 to a loan shark and you will run into deep trouble if the
loan shark is not repaid before midnight. You decide to go to a newly
opened casino having a special roulette game with the three roulette
bets: a 9-numbers bet, a 12-numbers bet, and an 18-numbers bet. For
a k-numbers bet, you get back 36 k times your stake with probability 37
k
070469. At the other end of the ticket, another six-digit number is printed,
but this number is hidden by a layer of scratch-away silver paint. The
ticket holder scratches the paint away to reveal the underlying number. If
the number is the same as the number at the other end of the ticket, it is
a winning ticket. The two six-digit numbers on each of the one million
tickets printed each week are randomly generated in such a way that no
two tickets are printed with the same visible numbers or the same hidden
numbers. Assume that in a particular week only half of the tickets printed
are sold. What is the probability of exactly r winners in that week for r =
0, 1, . . .? Hint: Use the results for the matching problem in Example 1.19.
2.47 The points 1, 2, . . . , 12 are around a ring like a clock. A transition from a
point is to either of the two adjacent points with equal probabilities. What
is the probability of visiting all points before returning to the starting
point?
2.48 A thoroughly shuffled deck of cards has r red and b black cards.
The top card is removed from the deck and then a card is picked at
random from the deck. Use the law of conditional probability to find
the probability that the second card picked is red. Can you explain the
answer intuitively?
2.49 John and Pete each toss a fair coin, independently of each other. John
tosses the coin until two heads have been obtained and Pete tosses the
coin until three heads have been obtained. What is the probability that
John will need more tosses than Pete?
2.50 Twenty-five persons attended a “reverse raffle,” in which everyone
bought a number. Numbered balls were then drawn out of a bin, one
at a time and at random. The last ball in the bin would be the winner. But
when the organizers got down to the last ball, they discovered that three
numbered balls had been unintentionally overlooked. They added those
balls to the bin and continued the draw. Was the raffle still fair? Use the
law of conditional probability to answer this question.
2.51 Two players A and B enter a coin-tossing game, where player A starts
with a = 4 dollars and player B with b = 5 dollars. They repeatedly flip
a fair coin until one of them has won all of the money. If the coin lands
heads, then player A gets one dollar from player B; otherwise, player A
loses one dollar to player B. After the game has been played, you are
informed that player A is the winner. What is the probability distribution
of the lowest value of player A’ s bankroll during the game?
2.52 The upcoming Tour de France bicycle tournament will take place from
July 1 through July 23. One hundred eighty cyclists will participate in
the event. What is the probability that two or more participating cyclists
64 2 Conditional Probability
will have birthdays on the same day during the tournament? Hint: Let
Bi be the event that there are i cyclists having their birthday during the
tournament.
2.53 A drunkard leaves the pub at 22:30 hours. Twenty steps to the left of
the pub is a police station and ten steps to the right is his home. Every
30 seconds the drunkard makes a step either to the left or to the right. Any
step is with equal probabilities to the left or to the right, independently of
the other steps. The drunkard is thrown in a cell if he reaches the police
station. What is the probability that he is not locked up in a police cell
and reaches his home no later than midnight?
2.54 A tennis tournament is arranged for 8 players. It is organized as a
knockout tournament. First, the 8 players are randomly allocated over
four groups of two players each. In the semifinals, the winners of groups
1 and 2 meet each other and the winners of groups 3 and 4. In any match,
either player has a probability 0.5 of winning. John and Pete are among
the 8 players. What is the probability that they meet each other in the
semifinals? What is the probability that they meet each other in the final?
2.55 There are two bags in front of you. One bag has three white balls and one
red ball. The other has one white ball and three red balls. You choose one
of the bags at random and pick randomly one ball out of this bag. You
notice the ball picked is red. You then put it back and pick another ball
out of the same bag at random. What is the probability that the second
ball picked is red?
2.56 Dave and Eric alternately roll two dice. The first player who fails to
surpass the sum of the two dice in the previous roll loses. What is the
probability that Dave will win the game when Dave starts the game?
Hint: Parameterize and define ds as the probability that Dave will win the
game when the game begins with Dave rolling the dice and Dave has to
roll more than s points in his first roll.
2.57 Consider again Example 1.18. What is the probability that there will be
exactly j stops at which nobody gets off. More generally, what is the prob-
ability of exactly j empty bins when m balls are sequentially placed into
one of b bins with m ≥ b, where for each ball a bin is selected at random?
2.58 A jar contains five blue and five red balls. You roll a fair die. Next you
simultaneously draw at random as many balls from the jar as the score
of the die. What is the probability that each of the balls drawn is blue?
What is the inverse probability that the score of the die is r, given that
each of the balls drawn is blue?
2.59 Two fair dice are rolled twice. What is the probability that both rolls
show the same combination of two numbers?
2.3 Law of Conditional Probability 65
2.60 In some basketball competition, 14 teams participate. The first six teams
in the final ranking are automatically placed for the playoffs. There are
no ties in the final ranking. The other eight teams participate in a lottery
for the remaining two spots in the playoffs. The team placed 7th in the
competition is given 8 tickets in the lottery, the team placed 8th is given
7 tickets, and so on, until the team placed 14th is given 1 ticket. A ticket
is picked at random and the first-place draft is awarded to the team
holding that ticket. That winner’s remaining tickets are removed and a
second ticket is selected at random for second place in the draft. What
is the probability that the team placed jth in the competition will win the
first place in the draft and what is the probability that this team will win
the second place in the draft for j = 7, . . . , 14?
2.61 Imagine a coin is tossed infinitely often. The coin lands on heads with
probability p and on tails with probability 1 − p. It is assumed that p ≤ 12 .
Use the gambler’s ruin formula to explain that after the first toss, the num-
ber of heads will ever be equal to the number of tails with probability 2p.
2.62 A drunkard is wandering back and forth on a road. At each step he moves
two units distance to the north with a probability of 12 , or one√unit to the
south with a probability of 12 . Show that with probability 12 ( 5 − 1) the
drunkard will ever visit the point which is one unit distance south from
his starting point. Can you explain why this probability also gives the
probability that the number of heads will ever exceed twice the number
of tails if a fair coin is tossed over and over?
2.63 Your friend has chosen at random a card from a standard deck of 52 cards,
but keeps this card concealed. You have to guess what card it is. Before
doing so, you can ask your friend either the question whether the chosen
card is red or the question whether the card is the ace of spades. Your
friend will answer truthfully. What question would you ask?
2.64 Player 1 tosses m + 1 times a fair coin and player 2 tosses m times a fair
coin. Player 1 wins the game if player 1 tosses more heads than player 2;
otherwise, player 2 wins. What is the probability that player 1 will win
the game?
2.65 You repeatedly roll two fair dice. What is the probability of two
consecutive totals of 7 appearing before a roll with double sixes?
2.66 Consider again Example 2.11. It is now assumed that lots are drawn to
determine the winner when there is a tie. Find the optimal strategy for
player A and the corresponding win probability.
2.67 A carnival booth offers the following game of chance. Under each of six
inverted cups is a colored ball, in some random order. The six balls are
colored red, blue, yellow, orange, green, and purple. You get six tokens.
66 2 Conditional Probability
All you have to do is guess the color of the ball under each of the cups,
where you handle one cup at a time. Every time you guess, you risk a
token. If your guess is wrong, you lose the token. Each time you guess
correctly, the ball is uncovered and you keep the token. Use conditional
probabilities to obtain the probability of guessing all six balls before
running out of tokens. Hint: Define p(i, t) as your success probability
once you have reached cup i with t tokens left and derive a recursion
equation for p(i, t).
2.68 A fair die is rolled repeatedly. Let pn be the probability that the sum
of scores will ever be n. Use the law of conditional probability to
find a recursion equation for pn . Verify numerically that pn tends to
3.5 = 0.2857 as n gets large. Can you give an intuitive explanation of
1
this result?
2.69 Use the recursive approach given in Example 2.10 to solve the following
two longest-run problems.
(a) A sequence of independent trials is performed. Each trial results in
a success with probability p and in a failure with probability 1 − p.
What is the probability of getting a run of either at least r successes
or at least r failures, or both, in N trials?
(b) A bowl contains r red balls and b blue balls. Balls are randomly
picked out of the bowl, one by one and without replacement. What
is the probability that the longest run of red balls will be L or more?
2.70 Each of seven dwarfs has his own bed in a common dormitory. Every
night, they retire to bed one at a time, always in the same sequential
order. On a particular evening, the youngest dwarf, who always retires
first, has had too much to drink. He randomly chooses one of the seven
beds to fall asleep on. As each of the other dwarfs retires, he chooses
his own bed if it is not occupied, and otherwise randomly chooses
another unoccupied bed. What is the probability that the kth dwarf can
sleep in his own bed for k = 1, . . . , 7? Hint: Use a general setup and
define p(k, n) as the probability that the kth dwarf will not sleep in
his own bed for the situation of n dwarfs with the first dwarf being
drunk. Use a conditioning argument to obtain a recursion equation for
p(k, n).
2.71 (difficult) Consider the three-player variant of Example 2.11. What is
the optimal strategy for the first player A and what is the overall win
probability of player A? Next use the probability distribution function
of the final score of player A to find the overall win probabilities of the
players B and C.
2.4 Bayes’ Rule in Odds Form 67
For example, an event A with P(A) = 23 has odds 2. It is said that the odds are
2 to 1 (written 2:1) in favor of the event A. Conversely, odds oA of an event A
correspond to the probability
oA
P(A) = .
1 + oA
The odds form of Bayes’ rule is stated in the next rule.
Rule 2.3 The posterior probability P(H | E) satisfies
P (H | E) P(H) P(E | H)
= × .
P(H | E) P(H) P(E | H)
In words, Bayes’ rule in odds form states that
This insightful formula follows from first principles. Using the definitions
P(AB) = P(A)P(B | A) and P(BA) = P(B)P(A | B), we get the basic form of
the rule of Bayes:
P(A) × P(B | A)
P(A | B) = .
P(B)
This is one of the most beautiful formulas in probability theory. Next, Bayes’
rule in odds form follows by noting that
P(H)P(E | H) P(H)P(E | H)
P(H | E) = and P(H | E) = .
P(E) P(E)
P(H)
The factor P(H)
represents the prior odds in favor of the hypothesis H before
the evidence has been presented. The ratio of P(E | H) and P(E | H) is called
the likelihood ratio or the Bayes factor. Bayes’ rule updates the prior odds of
the hypothesis H by multiplying them by the likelihood ratio and thus measures
how much new evidence should alter a belief in a hypothesis. The likelihood
ratio is the probability of finding the evidence if the hypothesis is true divided
by the probability of finding the evidence if the hypothesis is not true. If the
likelihood ratio is greater than one then the evidence supports the hypothesis, if
the likelihood ratio is less than one it supports the negation of the hypothesis,
and if the likelihood ratio equals one then the evidence favors neither. The
likelihood ratio is a measure of the probative value of the evidence. In practical
situations such as in judicial decision making, the likelihood ratio is typically
determined by an expert. However, it is not the expert’s task to tell the court
what the prior odds are. The prior probability P(H) represents the personal
opinion of the court before the evidence is taken into account.
With two pieces of evidence E1 and E2 that are sequentially obtained, Bayes’
rule can be applied iteratively. You could use the first piece of evidence to
calculate initial posterior odds, and then use that posterior odds as new prior
odds to calculate second posterior odds given the second piece of evidence.
As an illustration, suppose that a closed box contains either three red balls or
two red balls together with one blue ball. Initially you believe that the two
possibilities are equally likely. Then a randomly chosen ball is removed from
the box. It appears to be a red ball. What is your new belief in the hypothesis
H that the box originally contained three red balls? The posterior odds are
calculated as 1/2 1 3 3/2
1/2 × 2/3 = 2 . Therefore, the probability 1+3/2 = 5 represents
3
your new belief in the hypothesis H. Next, a second ball is randomly taken out
of the box. Again, it appears to be a red ball. Then the new posterior odds are
calculated as 3/5
2/5 × 1/2 = 3 and your belief in the hypothesis H is revised
1
In applying Bayes’ rule in odds form, the main step is identifying the
hypothesis H and the evidence E. Bayes’ rule forces you to make transparent
any implicit assumption you are making.
Before we give several applications of Bayes’ rule, the following remark is
made. In both legal and medical cases, the conditional probabilities P(H | E)
and P(E | H) are sometimes confused with each other. As a classic example,
suppose that at the scene of a crime blood is found that must belong to the
offender. The blood type is found in approximately one in every hundred
thousand people. In the police database a person is found who matches
the blood type of the offender, but there is no other evidence against this
person. The prosecutor argues: “the probability that the suspect would have
the particular blood type is 0.001% if he were innocent. The suspect has the
particular blood type and therefore the probability is only 0.001% that he is
not guilty.” Letting H be the event that the suspect is innocent and E be the
event that the blood of the suspect matches the blood type of the offender,
the prosecutor confuses the probability P(H | E) – the relevant probability –
with the probability P(E | H). A famous example of the prosecutor’s fallacy
is the court case of People vs. Collins in Los Angeles in 1964. In this case,
a couple matching the description of a couple that had committed an armed
robbery was arrested. Based on expert testimony, the district attorney claimed
that the frequency of couples matching the description was roughly 1 in
12 million. Although this was the estimate for the probability that an innocent
couple matches the description, the district attorney treated this estimate as if
it was the probability that a couple matching the description is innocent and
incorrectly concluded that the couple was guilty beyond reasonable doubt. The
prosecutor’s fallacy had dramatic consequences in the case of Regina vs. Sally
Clark in the UK in 1999. Sally Clark was convicted for murder because of
the cot deaths of two of her newborn children within a period of one year.
A revision of her process benefited from Bayesian arguments and led to her
release in 2001.
Example 2.12 It is believed by a team of divers that a sought-after wreck will
be in a certain sea area with probability p = 0.4. A search in that area will
detect the wreck with probability d = 0.9 if it is there. What is the revised
probability of the wreck being in the area when the area is searched and no
wreck is found?
Solution. Let the hypothesis H be the event that the wreck is in the area in
question and the evidence E be the event that the wreck has not been detected
in that area. Before the outcome of the search, the events H and H have the
subjective probabilities
70 2 Conditional Probability
2 Lawrence D. Stone et al., “Search for the wreckage of Air France flight AF 447,” Statistical
Science 59 (2014): 69–80.
2.4 Bayes’ Rule in Odds Form 71
The priors are P(H) = 0.01 and P(H) = 0.99. Further, we have P(E | H) = 1
and P(E | H) = 0.5. Then, by applying Bayes’ rule in odds form, we get
P (H | E) 0.01 1 1
= × = .
P(H | E) 0.99 0.5 49.5
In the perception of the woman, the updated value of probability that she has
1/49.5
the particular gene is equal to 1+49/5 = 1012
when the test is done under the
deal with her doctor and she is not told the test results.
Example 2.14 A murder is committed. The perpetrator is either one or the
other of the two persons X and Y. Both persons are on the run from the
authorities, and after an initial investigation, both fugitives appear equally
likely to be the perpetrator. Further investigation reveals that the actual
perpetrator has blood type A. Ten percent of the population belongs to the
group having this blood type. Additional inquiry reveals that person X has
blood type A, but offers no information concerning the blood type of person
Y. In light of this new information, what is the probability that person X is the
perpetrator?
Solution. In answering this question, use H to denote the event that person X is
the perpetrator. Let E represent the new evidence that person X has blood type
A. Before the appearance of evidence E, the events H and H have probabilities
P(H) = P(H) = 12 . Further, we have that P(E | H) = 1 and P(E | H) = 10 1
.
By Bayes’ rule in odds form,
P(H | E) 1/2 1
= × = 10.
P(H | E) 1/2 1/10
The odds in favor, then, are 10 to 1 that person X is the perpetrator given
that this person has blood type A. Otherwise stated, P(H | E) = 10 11 . The
probability of Y being the perpetrator is 1 − 11 = 11 and not, as may be
10 1
1
thought, 10 × 12 = 20 1
. The error in this reasoning is that the probability of
1
person Y having blood type A is not 10 because Y is not a randomly chosen
person; rather, Y is first of all a person having a 50% probability of being the
perpetrator, whether or not he is found at a later time to have blood type A.
The Bayesian analysis sharpens our intuition in a natural way.
Another nice illustration of Bayes’ rule in odds form is provided by legal
arguments used in the discussion of the O. J. Simpson trial.3 In the analysis of
this example, we will use an extension of Bayes’ rule in odds form:
where G represents the event that the husband is not guilty of the murder of his
wife. How do we estimate the conditional probabilities on the right-hand side
of this formula? In 1992, 4,936 women were murdered in the United States,
of which roughly 1,430 were murdered by their (ex-)husbands or boyfriends.
This results in an estimate of 1,430
4,936 = 0.29 for the prior probability P(G | M)
and an estimate of 0.71 for the prior probability P(G | M). Furthermore, it
is also known that roughly 5% of married women in the United States have
at some point been physically abused by their husbands. If we assume that a
woman who has been murdered by someone other than her husband had the
same chance of being abused by her husband as a randomly selected woman,
then the probability P(E | GM) is equal to 5%. Our estimate of the probability
P(E | GM) is based on the reported remarks made by Simpson’s defense
attorney, Alan Dershowitz, in a newspaper article. In this article, Dershowitz
admitted that a substantial percentage of the husbands who murder their wives
have, previous to the murders, also physically abused their wives. Given this
statement, the probability P(E | GM) will be taken to be 0.5. By substituting
the various estimated values for the probabilities into Bayes’ formula in odds
form, we find that
P(G | ME) 0.29 0.5
= × = 4.08.
P(G | ME) 0.71 0.05
By P(G | ME) = 1 − P(G | ME), we can convert the odds to a probability.
This results in P(G | ME) = 0.81. In words, there is an estimated probability
of 81% that the husband is the murderer of his wife in light of the knowledge
that he had previously physically abused her. The fact that O. J. Simpson had
physically abused his wife in the past was therefore certainly very relevant to
the case.
The next example involves a subtle application of Bayes’ rule in odds form.
Example 2.16 A diamond merchant has lost a case containing a very
expensive diamond somewhere in a large city in an isolated area. The case
has been found again but the diamond has vanished. However, the empty
case contains the DNA of the person who took the diamond. The city has
150,000 inhabitants, and each is considered a suspect in the diamond theft. An
expert declares that the probability of a randomly chosen person matching the
DNA profile is 10−6 . The police search a database with 5,120 DNA profiles
and find one person matching the DNA from the case. Apart from the DNA
evidence, there is no additional background evidence related to the suspect.
On the basis of the extreme infrequency of the DNA profile and the fact that
the population of potential finders of the diamond is only 150,000 people, the
prosecutor jumps to the conclusion that the odds of the suspect not being the
74 2 Conditional Probability
thief are practically nil and calls for a tough sentence. What do you think of
this conclusion?
Solution. The conclusion made by the prosecutor could not be more wrong.
The prosecutor argues: “The probability that a person chosen at random would
match the DNA profile found on the diamond case is negligible and the
number of inhabitants of the city is not very large. The suspect matches
this DNA profile, thus it is nearly one hundred percent certain that he is the
perpetrator.” This is a textbook example of the faulty use of probabilities.
The probability that the suspect is innocent of the crime is generally quite
different from the probability that a randomly chosen person matches the DNA
profile in question. What we are actually looking for is the probability that
among all persons matching the DNA profile in question, the arrested person
is the perpetrator. Counsel for defense could reason as follows to estimate this
probability: “We know that the suspect has matching DNA, but among the
other 150,000 − 5,120 = 144,880 individuals the expected number of people
matching the DNA profile is 144,880 × 10−6 = 0.14488. So the probability
that the suspect is guilty is 1/(1 + 0.14488) = 0.8735. It is not beyond
reasonable doubt that the suspect is guilty and thus the suspect must be
released.”
The reasoning of the defense is on the whole correct and can be supported
by Bayes’ rule. Let us perform the analysis under the assumption that each
of the 150,000 inhabitants of the city is equally likely to be the finder of the
diamond and that the finder keeps the diamond with probability p0 when the
finder is a person in the database and with probability p1 otherwise. Let
p0
r= .
p1
It will appear that only the value of the ratio r is required and thus not the
values of p0 and p1 . The following formula will be derived:
r
P(suspect is the thief) = ,
r + (n − n0 )λ
where n = 150,000, n0 =5,120, and λ = 10−6 denotes the probability that a
randomly chosen person matches the DNA profile in question. This posterior
probability has the value 0.8735 for r = 1 and the value 0.9325 for r = 2.
This result confirms that the counsel for defense is correct in stating that it is
not beyond reasonable doubt that the suspect is guilty, so the suspect must be
released if there is no additional proof. The derivation of the above formula
uses again the extension of Bayes’ rule and proceeds as follows. Define the
following events:
2.4 Bayes’ Rule in Odds Form 75
Problems
2.72 In a binary transmission channel, a 1 is transmitted with probability 0.8
and a 0 with probability 0.2. The conditional probability of receiving a 1
given that a 1 was sent is 0.95, the conditional probability of receiving a 0
when a 0 was sent is 0.99. What is the probability that a 1 was sent when
receiving a 1? Use Bayes’ rule in odds form to answer this question.
2.73 An oil explorer performs a seismic test to determine whether oil is likely
to be found in a certain area. The probability that the test indicates the
presence of oil is 90% if oil is indeed present in the test area, while the
76 2 Conditional Probability
2.78 On the island of liars, each inhabitant lies with probability 23 . You
overhear an inhabitant making a statement. Next you ask another inhab-
itant whether the inhabitant you overheard spoke truthfully. What is the
probability that the inhabitant you overheard indeed spoke truthfully,
given that the other inhabitant says so?
2.79 Suppose that a person has confessed to a crime under a certain amount
of police pressure. Besides this confession, there is no other proof that
the person is guilty. Use Bayes’ rule in odds form to verify that the
confession only adds to the evidence of guilt if the confession is more
likely to come from the guilty than from the innocent. Do you think that
in real life a hardened person who is guilty is more likely to confess than
an unstable person who is innocent?
2.80 An opaque bowl contains one ball. The ball is equally likely to be red
or blue. A red ball is added to the bowl. Then a ball is randomly picked
from the bowl. The ball that has been picked from the bowl turns out to
be red. Use Bayes’ rule in odds form to calculate the probability that the
bowl originally contained a red ball.
2.81 A doctor discovers a lump in a woman’s breast during a routine physical
examination. The lump could be a cancer. Without performing any
further tests, the probability that the woman has breast cancer is 0.01. A
further test can be done. On average, this test is able to establish correctly
whether a tumor is benign or cancerous 90% of the time. A positive test
result indicates that a tumor is cancerous. What is the probability that the
woman has breast cancer if the test result is positive?
2.82 On average, one in every 125 births produces a set of fraternal twins and
one in every 300 births produces a set of identical twins (identical twins
are always of the same sex but fraternal twins are random). Elvis had a
twin brother, Jesse Garon, who died at birth. What is the probability that
Elvis was an identical twin?
2.83 A box contains 10,000 coins. One of the coins has heads on both sides
but all the other coins are fair coins. You choose at random one of the
coins. Use Bayes’ rule in odds form to find the probability that you have
chosen the two-headed coin given that the first n tosses have all resulted
in heads. What are the numerical values of the posterior probability for
n = 10, 15, and 25?
Bayesian inference was laid by the English clergyman Thomas Bayes (1702–
1761), but the Bayesian interpretation of probability was developed mainly by
Pierre Simon Laplace (1749–1827). Astronomers have contributed much to the
Bayesian approach. Astronomers cannot do experiments on the universe and
thus have to make probabilistic inferences from evidence left behind. This is
very much the same situation as in forensic science, where Bayesian inference
also plays a very important role. In Bayesian inference one typically deals with
nonrepeatable chance experiments.
Fundamentally, the distinction between Bayesian inference and frequentist
inference in classical statistics concerns the interpretation of probability. To
illustrate, suppose that an astronomer wants to make a statement about the
photon flux of a nonvariable star. In the frequentist view, the true flux is a single
fixed value and a statistical estimate of this unknown value can only be based
on many measurements of the flux. In a Bayesian view, one can meaningfully
treat the flux of the star as a random variable and talk about the probability
that the true value of the flux of the star is between specific bounds. That
probability is subjective and represents one’s knowledge of the value based
on prior information and available data. The fundamental difference between
the classical and the Bayesian approach can also be illustrated nicely with
the next example. Imagine a multiple-choice exam consisting of 50 questions,
each of which has three possible answers. A student receives a pass grade if
he/she correctly answers more than half of the questions. Take the case of a
student who manages to answer 26 of the 50 questions correctly and claims
not to have studied, but rather to have obtained 26 correct answers merely by
guessing. The classical approach in statistics is to calculate the probability of
correctly answering 26 or more of the 50 questions by luck alone. This excess
probability is equal to 0.0049 and is called the p-value in classical statistics. On
the basis of the small p-value one might conclude that the student is bluffing
and in fact did prepare for the exam. However, the p-value does not give the
probability that the student went unprepared to the exam, but only gives the
probability of getting 26 or more correct answers when the student did not
prepare for the exam. In reality, one might have information about the earlier
achievements of the student in question. The Bayesian approach does use this
information and then calculates the probability that the student did not prepare
for the exam given that he/she answered 26 of the 50 questions correctly. This
is the probability we are actually looking for! The Bayesian approach requires
that we first specify a prior distribution for the various ways the student may
have prepared for the exam. This distribution concerns the situation before
the exam and may be based on information of the student’s earlier academic
performance in homework or previous exams.
2.5 Bayesian Inference − Discrete Case 79
Example 2.17 A new treatment is tried out for a disease for which the
historical success rate of the standard treatment is 35%. The discrete uniform
probability distribution on 0, 0.01, . . . , 0.99, 1 is taken as prior for the success
probability of the new treatment. The experimental design is to make exactly
10 observations by treating 10 patients. The experimental study yields seven
successes and three failures. What is the posterior probability that the new
treatment is more effective than the standard treatment?4
Solution. Model the unknown success probability of the new treatment by the
random variable . Assume that our state of knowledge about the unknown
parameter is expressed by the “non-informative” prior distribution p0 (θ ) = 101 1
for θ = 0, 0.01, . . . , 0.99, 1. To update the prior distribution given the observed
data, we need the so-called likelihood function L(data | θ ). This function is
defined as the probability of getting the data given that the success probability
has value θ . In the particular situation of seven successes in the treatment of
10 patients,
10 7
L(data | θ ) = θ (1 − θ )3 for θ = 0, 0.01, . . . , 0.99, 1.
7
To find the posterior probability p(θ ) = P( = θ | data), we use
P(data | θ )p0 (θ )
P( = θ | data) = .
P(data)
By the law of conditional probability, P(data) = θ P(data | θ )p0 (θ ). Thus,
using the notation L(data | θ ) for P(data | θ ), the posterior distribution is given
by the Bayesian formula
L(data | θ )p0 (θ )
p(θ ) =
for θ = 0, 0.01, . . . , 0.99, 1.
θ L(data | θ )p0 (θ )
The posterior p(θ ) is proportional to prior p0 (θ )×likelihood L(data | θ ).
Letting θi = 100
i
and inserting the formulas for L(data | θ ) and p0 (θ ), we get
θ 7 (1 − θi )3
p(θi ) = 100i for i = 0, 1, . . . , 100.
k=0 θk (1 − θk )
7 3
4 This example is based on the paper by D. A. Berry, “Bayesian clinical trials,” Nature Reviews
Drug Discovery 5 (2006): 27–36.
2.5 Bayesian Inference − Discrete Case 81
The posterior probability of the new treatment not being more effective than
the standard treatment is 0.0134. In this particular example, the value 0.0134
of the posterior probability is not very different from the value 0.0260 of the
excess probability of obtaining seven or more successes in 10 trials under
the null hypothesis that the new treatment causes no difference and thus has
success probability 0.35. In classical statistics, this so-called p-value would
have been calculated and the hypothesis that the new treatment causes no
difference would also have been rejected.
In general, p-values overestimate the evidence against the null hypothesis.
To illustrate this, consider the following experiment. Suppose that there is
reason to believe that a coin might be slightly biased toward heads. To test this,
you decide to throw the coin 1,000 times. Before performing the experiment,
you express your uncertainty about the unbiasedness of the coin by assuming
that the probability of getting heads in a single toss of the coin can take on the
values θ = 0.50, 0.51, and 0.52 with respective prior probabilities p0 (θ ) = 12 ,
1 1
3 , and 6 . Next the experiment is performed and 541 heads are obtained in
1,000 tosses ofthe coin.
541The likelihood of getting 541 heads in 1,000 tosses is
L(541 | θ ) = 1,000
541 θ (1 − θ ) 459 for θ = 0.50, 0.51, and 0.52. This leads
to the posterior probability p(θ | 541) = 0.1282 for θ = 0.50. In other words,
your posterior belief that the coin is fair equals 0.1282. In classical statistics,
one would compute the probability of getting 541 or more heads in 1,000 tosses
of the coin under the hypothesis that the coin is fair. This excess probability
is equal to 0.0052. Many classical statisticians would consider this small
p-value as significant evidence that the coin is biased toward heads. However,
your subjective Bayesian probability of 0.1282 for the hypothesis of a fair
coin is not strong enough evidence for such a conclusion. The difference
in the conclusions can be explained as follows. The p-value is based on
the set of all possible observations that cast as much or more doubt on
the hypothesis as the actual observations do. It is not possible to base
the p-value only on the actual data, because it frequently happens that
all individual outcomes have such small probabilities that every outcome
would look significant. The inclusion of unobserved data means that the
resulting p-value may greatly exaggerate the strength of evidence against
the hypothesis.5 The p-value, which is the probability of getting data at
least as extreme as the observed data when the null hypothesis is true, is
5 That’s why physicists required an extremely small p-value of about 1 in 3.5 million (the
5-sigma rule) before declaring the “discovery” of the long-sought Higgs boson. The evidence
of the efficacy of the famous Salk vaccine against polio was also based on an extremely small
p-value, see Problem 3.65. It is true that p-values are overused and often misused, but they were
the right tool in the search for the Higgs boson and in showing the efficacy of the Salk vaccine.
82 2 Conditional Probability
in1,000
favor
517of candidate A equals θ . Using the likelihood L(data | θ ) =
517 θ (1 − θ )483 , we get the posterior distribution
1,000 517
517 θ (1 − θ )483 p0 (θ )
p(θ ) = 1,000 a 517 a .
70 a 483
a=30 517 100 1 − 100 p0 100
Performing the numerical calculations, we find that the posterior probability
of candidate A getting the majority of the votes at the election equals
p(0.51) + · · · + p(0.70) = 0.763. Let’s see what the posterior probability
of candidate A winning the election is when the prior p0 (θ ) = 1−θ c
2 for θ =
0.30, 0.31, . . . , 0.70 is used, where the normalizing constant c = 0.0174484.
Note that the shape of this prior distribution having expected value 0.520 is
much different from the shape of the triangular prior distribution. We then
find the value 0.786 for the posterior probability of candidate A winning the
election.
Problems
2.84 Your friend has fabricated a loaded die. In doing so, he has chosen at
random one of the values 0.1, 0.2, 0.3, or 0.4 and has loaded the die
in such a way that any roll of the die results in the outcome 6 with a
probability that is equal to the randomly chosen value. He does not tell
you what the chosen value is. You ask him to roll the die 300 times and to
inform you how often the outcome 6 has appeared. You are informed that
the outcome 6 has appeared 75 times. What is the posterior distribution
of the probability that a single roll of the die gives a 6?
2.85 You wonder who is the better player of the tennis players Alassi and
Bicker. Your prior assigns equal probabilities to the possible values 0.4,
0.5, and 0.6 for the probability that Alassi wins a given match. Then
you learn about a tournament at which a best-of-five series of matches
is played between Alassi and Bicker over a number of days. In such an
encounter the first player to win three matches is the overall winner. It
turns out that Alassi wins the best-of-five contest. How should you update
your prior? Note: This problem is taken from J. Albert, “Teaching Bayes’
rule: a data-oriented approach,” The American Statistician 51 (1997):
247–253.
2.86 Consider again Example 2.17. Assume that the 10 observations for the
new treatment are SSFSSFSSSF, where S stands for success and F for
failure. Update the posterior probability of the new treatment being more
effective than the standard treatment after each observation in order to
84 2 Conditional Probability
85
86 3 Discrete Random Variables
A random variable is not a variable in the traditional sense of the word and
actually it is a little misleading to call it a variable. Formally, a random
variable is defined as a real-valued function on the sample space of a chance
experiment. A random variable X assigns a numerical value X(ω) to each
element ω of the sample space. For example, suppose that the random variable
X is defined as the smallest of the two numbers that will result from the
experiment of rolling a fair die twice. Then the random variable X assigns
the numerical value X(ω) = min(i, j) to the outcome ω = (i, j) of the chance
experiment, where i is the number obtained in the first roll and j is the number
obtained in the second roll. As said before, a random variable X takes on its
value by chance. A random variable gets its value only after the underlying
chance experiment has been performed. Before the experiment is performed,
you can only speak of the probability that the random variable will take on a
specific value (or a value in a specific interval for the case that the range of
the random variable is some continuum of real numbers). It is common to use
uppercase letters such as X, Y, and Z to denote random variables, and lowercase
letters x, y, and z to denote their possible numerical values.
Illustrative examples of random variables are:
• The number of winners in a football pool next week.
• The number of major hurricanes that will hit the United States next year.
• The number of claims that will be submitted to an insurance company next
week.
• The amount of rainfall that the city of London will receive next year.
• The lifetime of a newly bought battery.
• The duration of your next mobile phone call.
The first three examples are examples of discrete random variables taking on
a discrete number of values and the other three examples describe continuous
random variables taking on a continuum of values.
In this chapter we consider only discrete random variables. A random
variable X is said to be discrete if its set of possible values is finite or countably
infinite. The set of possible values of X is called the range of X and is denoted
by I. The probabilities associated with these possible values are determined
by the probability measure P on the sample space of the chance experiment.
The probability mass function of a discrete random variable X is defined by
P(X = x) for x ∈ I, where the notation P(X = x) is shorthand for
As an example, suppose that you are going to roll a fair die twice. The
sample space of this experiment consists of 36 equally likely outcomes (i, j),
where i is the number obtained in the first roll and j is the number obtained in
the second roll. Let the random variable X be defined as the smallest of the two
numbers that will result from this experiment. The random variable X assigns
the numerical value min(i, j) to the outcome (i, j) of the sample space. Thus, the
range of X is the set I = {1, 2, . . . , 6}. The random variable X takes on the value
1 if one of the 11 outcomes (1, 1), (1, 2), . . . , (1, 6), (2, 1), (3, 1), . . . , (6, 1)
occurs and so P(X = 1) = 11 36 . In the same way, P(X = 2) = 36 ,
9
P(X = 3) = 36 7
, P(X = 4) = 36 5
, P(X = 5) = 36 3
, and P(X = 6) = 36 1
.
As another example, suppose that the random variable S is defined as the sum
of the two rolls. Then S assigns the numerical value i + j to the outcome (i, j)
of the sample space and so the range of S is the set I = {2, 3, . . . , 12}. It is
left to the reader to verify that the probability mass function of S is given by
P(S = 2) = P(S = 12) = 36 1
, P(S = 3) = P(S = 11) = 36 2
, P(S = 4) =
P(S = 10) = 36 , P(S = 5) = P(S = 9) = 36 , P(S = 6) = P(S = 8) = 36
3 4 5
,
and P(S = 7) = 36 .6
Example 3.1 In your pocket you have three dimes (coins of 10 cents) and two
quarters (coins of 25 cents). You grab at random two coins from your pocket.
What is the probability mass function of the amount of money you will grab?
Solution. The sample space of the chance experiment is chosen as =
{(D, D), (D, Q), (Q, D), (Q, Q)}. The outcome (D, D) occurs if the first coin
taken is a dime and the second one is also a dime, the outcome (D, Q) occurs
if the first coin taken is a dime and the second one is a quarter, etc. The
probability 35 × 24 = 10 3
is assigned to the outcome (D, D), the probability
3
5 × 2
4 = 3
10 to the outcome (D, Q), the probability 25 × 34 = 10
3
to the outcome
(Q, D), and the probability 5 × 4 = 10 to the outcome (Q, Q). Let the random
2 1 1
variable X denote the total number of cents you will grab. The random variable
X has 20, 35, and 50 as possible values. The random variable X takes on the
value 20 if the outcome (D, D) occurs, the value 35 if either the outcome (D, Q)
or (Q, D) occurs, and the value 50 if the outcome (Q, Q) occurs. Thus, the
probability mass function of X is
3 3 3 3 1
P(X = 20) = , P(X = 35) = + = , and P(X = 50) = .
10 10 10 5 10
Example 3.2 You are going to remove cards one at a time from a well-shuffled
standard deck of 52 cards until you get an ace. Let the random variable X be the
number of cards that must be removed. What is the probability mass function
of X?
88 3 Discrete Random Variables
Solution. The range of the random variable X is the set {1, 2, . . . , 49}.
Obviously, P(X = 1) = 52 4
. Using the chain rule for conditional probabilities,
we find for k = 2, . . . , 49:
48 48 − (k − 2) 4
P(X = k) = × ··· × × .
52 52 − (k − 2) 52 − (k − 1)
48 52
Alternatively, P(X = k) can be calculated as k−1 / k−1 × 52−(k−1)
4
. This
representation is based on P(A | B) = P(A)P(B | A) and can be explained
as the probability that the first k − 1 picks are non-aces multiplied by the
conditional probability that the kth pick is an ace given that the first k − 1
picks are non-aces.
The last example gives rise to the following important observation. Often,
an explicit listing of the sample space is not necessary to assign a probability
distribution to a random variable. Usually, the probability distribution of a ran-
dom variable is modeled without worrying about the assignment of probability
to an underlying sample space. In most problems, you will perform probability
calculations without explicitly specifying a sample space; an assignment of
probabilities to properly chosen events usually suffices.
Problems
3.1 As an added incentive for students to do homework for their probability
course, the professor randomly picks two students at each class meeting.
These two students are asked to explain a homework problem. On a
particular day, five students from the class of 20 students have not done the
homework. Let the random variable X be the number of students picked
who have not done the homework. What is the probability mass function
of X?
3.2 Imagine that people enter a room one by one and announce their birthdays.
Let the random variable X be the number of people required to have a
matching birthday. What is the probability mass function of X?
3.3 A bag contains three coins. One coin is two-headed and the other two are
normal. A coin is chosen at random from the bag and is tossed twice. Let
the random variable X denote the number of heads that will appear. What
is the probability mass function of X?
3.4 In a charity lottery, 1,000 tickets numbered 000, 001, . . . , 999 are sold.
Each contestant buys only one ticket. The prize winners of the lottery
are determined by drawing one number at random from the numbers
3.2 Expected Value 89
000, 001, . . . , 999. You are a prize winner when the number on your ticket
is the same as the number drawn or is a random permutation of the number
drawn. What is the probability mass function of the number of prize
winners?
3.5 Accidentally, two depleted batteries get into a set of five batteries. To
remove the two depleted batteries, the batteries are tested one by one in
random order. Let the random variable X denote the number of batteries
that must be tested to find the two depleted batteries. What is the
probability mass function of X?
3.6 You are allowed to toss a fair coin either five times or until tails appears,
whichever occurs first. You receive starting capital of $10 when the first
toss results in heads; otherwise, you receive nothing. Each subsequent toss
that results in heads doubles your current capital. Let the random variable
X be the end capital you will get. What is the probability mass function
of X?
3.7 A box contains seven envelopes. Two envelopes contain $10 each, one
contains $5, and the other four are empty. Envelopes are drawn at random
from the box, one by one and without replacement. You may keep the
money from the envelopes that are drawn before an empty envelope is
drawn. Let the random variable X be the amount of money you will get.
What is the probability mass function of X?
or the average win per game is approximately −0.05 dollars (meaning that the
average “win” is actually a loss). If we define the random variable X as the win
achieved in just a single repetition of the game, then the number −0.05 is said
to be the expected value of X. In the casino game, the expected value of X is
calculated as
For a discrete random variable X with range I, it is said that the expected
value E(X) exists if x∈I |x| P(X = x) < ∞ or if X is a nonnegative random
variable, where the value E(X) = ∞ is allowed when X is nonnegative.
It is important to emphasize that the expected value E(X) is finite if and
only if k∈I |k|P(X = k) < ∞. An example of a random variable X for
which E(X) does not exist is the random variable X with probability mass
function P(X = 0) = 0 and P(X = k) = π 23k2 for k = ±1, ±2, . . . . In this
∞ 1
example, ∞ k=−∞ |k|P(X = k) = ∞, by the fact that k=1 k = ∞. However,
the expected value of the nonnegative random variable with probability mass
function π 26k2 for k = 1, 2, . . . does exist (and is equal to ∞).
Example 3.1 (continued) What is the expected value of the random variable
X to be defined as the number of cents you will grab from your pocket?
Solution. Since the probability mass function of X is P(X = 20) = 3
10 , P(X =
35) = 35 , and P(X = 50) = 10
1
, the expected value of X is
3 3 1
E(X) = 20 × + 35 × + 50 × = 32 cents.
10 5 10
Example 3.2 (continued) What is the expected value of the random variable
X to be defined as the number of cards that must be picked until you get the
first ace?
Solution. Using the result found earlier for P(X = k), we get
4 48
49
48 − (k − 2) 4
E(X) = 1 × + k × ··· × × = 10.6.
52 52 52 − (k − 2) 52 − (k − 1)
k=2
Example 3.3 Joe and his friend make a guess every week whether the Dow
Jones index will have risen at the end of the week or not. Both put $10 in the
pot. Joe observes that his friend is just guessing and is making his choice by
the toss of a fair coin. Joe asks his friend if he could contribute $20 to the pot
and submit his guess together with that of his brother. The friend agrees. In
each week, however, Joe’s brother submits a prediction opposite to that of Joe.
The person having a correct prediction wins the entire pot. If more than one
person has a correct prediction, the pot is split evenly. How favorable is the
game to Joe and his brother?
Solution. Let the random variable X denote the payoff to Joe and his brother
in any given week. Either Joe or his brother will have a correct prediction. If
Joe’s friend is wrong he wins nothing, and if he is correct he shares the $30
pot with either Joe or his brother. Thus X takes on the values 30 and 15 with
92 3 Discrete Random Variables
equal chances. This gives E(X) = 12 × 30 + 12 × 15 = 22.5 dollars. Joe and his
brother have an expected profit of $2.5 every week.
Example 3.4 Three friends go to the cinema together every week. Each week,
in order to decide which friend will pay for the other two, they all toss a fair
coin into the air simultaneously. They continue to toss coins until one of the
three gets a different outcome from the other two. What is the expected value
of the number of trials required?
Solution. Let the random variable X denote the number of trials until one of
the three friends gets a different outcome from the other two. The probability
that any given trial does not lead to three equal outcomes is 1 − 18 − 18 = 34 .
Thus,
P(X = j) = (1 − p)j−1 p for j = 1, 2, . . .
with p = 34 . The expected value of X is
∞
p 1
E(X) = j(1 − p)j−1 p = = ,
[1 − (1 − p)]2 p
j=1
∞
using the fact that j=1 jxj−1 = 1/(1 − x)2 for all 0 < x < 1, see Appendix C.
Thus, the expected value of the number of trials required is 43 .
The concept of expected value is at the heart of the sequential decision
problem that is discussed in the next example.
Example 3.5 Eleven closed boxes are put in random order in front of you. One
of these boxes contains a devil’s penny and the other 10 boxes contain money.
You know which dollar amounts a1 , . . . , a10 are in the 10 boxes. You may open
as many boxes as you wish, but they must be opened one by one. You can keep
the money from the boxes you have opened, as long as you have not opened
the box with the devil’s penny. Once you open this box, the game is over and
you lose all the money gathered so far. What is a good stopping rule when you
want to maximize the expected value of your gain? What is the expected value
of the gain of the game when ai = 1 for all i?
Solution. The one-stage-look-ahead rule is the key. This is an intuitively
appealing rule that looks only one step ahead, as the name says. The rule
prescribes stopping in the states in which it is at least as good to stop now as to
continue one more step and then stop. For the situation that you have collected
so far a dollars and k + 1 boxes are still closed, including the box with the
devil’s penny, define the random variable Xk (a) as the amount by which your
capital gathered so far would change when you would decide to open one more
3.2 Expected Value 93
1 Using advanced optimization theory, it can be shown that this stopping rule is indeed optimal
for the criterion of the expected gain. In general, the one-stage-look-ahead rule is optimal if the
problem has the property that continuing in a state in which this rule calls for stopping results
again in a state in which the rule calls for stopping. To put it differently, the set of states in
which stopping is better than continuing for one step and then stopping must be a closed set
from which no escape is possible.
94 3 Discrete Random Variables
laid face down. You can flip over as many cards as you wish, but they must be
flipped over one by one. Once you flip over the joker card, the game is over
and you win nothing. If you stop before the joker card is flipped over, you win
as many dollars as the sum of the values of the cards you have flipped over (the
ace counts for $1). This is the devil’s penny problem with ai = i for all i.
Expected value is a useful concept in casino games, lotteries, and sports
betting. The expected value is what the player will win or lose on average if
the player were to make the same bet over and over again. This result is known
as the strong law of large numbers. This law will be made more precise in
Section 9.2.
The next example gives a practical application of the concept of expected
value. The example is about group testing for disease and has its historical
significance in World War II. The young men who were drafted were subject
to a blood test in the fight against syphilis, at a time when mass-produced
penicillin was not available. The Harvard economist Robert Dorfman devised
an elegant method to cut down on the number of blood tests needed. The
idea was to pool the blood samples of soldiers together and test the pooled
sample.
Example 3.6 A large group of people are undergoing a blood test for a
particular disease. The probability that a randomly chosen person has the
disease in question is equal to p = 0.005. In order to save on costs, it is
decided to split the group into smaller groups each consisting of r people.
The blood samples of the r people are then mixed and tested all at once. The
mixture will only test negative if all the individual samples are negative. If the
test result is negative, then one test will have been sufficient for that whole
group. Otherwise, r extra tests will be necessary in order to test each of the r
people individually. What value of r minimizes the average number of tests per
person?
Solution. Let the random variable X be the number of tests needed for a group
of r people. The possible values of the random variable X are 1 and r + 1.
The random variable X takes on the value 1 if and only if all the r samples
are negative. An individual sample will test negative with probability 1 − p.
Assuming independence between the test outcomes of individual people, it
follows that P(X = 1) is equal to (1 − p) × (1 − p) × · · · × (1 − p) = (1 − p)r .
Since P(X = r + 1) = 1 − P(X = 1), the probability P(X = r + 1) is equal to
1 − (1 − p)r . Thus, the expected value of X is
E(X) = 1 × (1 − p)r + (r + 1) × (1 − (1 − p)r ) = 1 + r 1 − (1 − p)r .
3.2 Expected Value 95
Therefore, by combining blood samples from r people into one mixture, the
average number of tests per person is 1r 1 + r(1 − (1 − p)r ) . For the case
p = 0.005, this expression is minimal for r = 15 and its minimal value is
0.1391. Thus, group testing cuts down the number of tests performed by about
86% when p = 0.005 when the test concerns many people. Also, an interesting
quantitative result can be given. For p close to zero, the optimal value of the
group size and the minimal value of the average number of tests per person
√
are approximately equal to √1p and 2 p. This result follows easily by noting
that 1 − (1 − p)r ≈ rp for p close to zero and next minimizing the function
x (1 + px ) with respect to x (verify!).
1 2
Problems
3.8 There are 10 cards face down numbered 1 through 10. You pick at
random one card. Your payoff is $0.50 if the number on the card is less
than 5 and is the dollar value on the card otherwise. What is the expected
value of your payoff?
3.9 Four professors give one course each. The courses have 15, 20, 70, and
125 students. No student takes more than one course. Let the random
variable X be the number of students in a randomly chosen class and
Y be the number of students in the class of a randomly chosen student.
Without doing any calculation, can you explain why E(X) is larger than
E(Y)? What are E(X) and E(Y)?
3.10 A firm exports to a variety of countries. An order for $10,000 has been
received from a politically unstable country. The export firm estimates
that there is a probability of 75% that the full amount of $10,000 will
be paid, a probability of 15% that only $5,000 will be paid, and a
probability of 10% that nothing will be paid. Therefore, the firm has
decided to purchase export credit insurance. The insurance company has
asked a price of $2,000 to insure the order. The insurance guarantees
compensation for any debt experienced on the order. What is the expected
value of the cost of risk reduction?
3.11 A bowl contains ten white and two red balls. In one shot you pick m
balls at random from the bowl, where m can be chosen at your discretion.
If all the balls picked are white, you win m dollars; otherwise, you win
nothing. What is the expected value of your winnings and what is the
maximizing value of m?
3.12 Six cards are taken from a deck: two kings and four aces. The six cards
are thoroughly shuffled, after which the two top cards are revealed.
96 3 Discrete Random Variables
If both cards are aces, you win $1.25; otherwise, you lose $1. Is this a
fair bet? A gambling game is said to be fair when the expected value of
your net winnings is zero.
3.13 Consider again Example 3.2. Let the random variable Xj be the number
of picks until an ace is obtained for the jth time. What is the expected
value of Xj for j = 2, 3, and 4? Can you explain the answer intuitively?
3.14 In the game “Unders and Overs,” two dice are rolled and you can bet
whether the total of the two dice will be under 7, over 7, or equal to 7.
The gambling table is divided into three sections marked “Under 7,” “7,”
and “Over 7.” The payoff odds for a bet on “Under 7” are 1 to 1, for a
bet on “Over 7” are 1 to 1, and for a bet on “7” are 4 to 1 (payoffs of r
to 1 mean that you get r + 1 dollars back for each dollar bet if you win;
otherwise, you get nothing back). Each player can put chips on one or
more sections of the gambling table. Your strategy is to bet one chip on
“Under 7” and one chip on “7” each time. What is the expected value of
your net winnings?
3.15 In European roulette, players bet on the outcome of a turning wheel,
which is fitted with 37 spokes numbered from 0 to 36. Of the spokes
numbered from 1 to 36, 18 are red and 18 are black. The 0 represents a
win for the casino. A bet on the color red has a 1 to 1 payout. There is
a stake limit of $1,000. A player decides to make no more than 11 bets
under a doubling strategy and to stop as soon as he has won a bet. The
player begins by staking one dollar on red. If he loses, he doubles his
stake on red, and continues doubling until red wins. If the player has lost
10 bets in a row, the maximum amount of $1,000 is staked in the 11th
bet. What is the expected value of the total amount staked by the player
and what is the expected value of the amount won by the player?
3.16 A fair coin is tossed repeatedly until heads appears for the first time or
the coin has been tossed m times, whichever occurs first. The payoff is
two dollars if heads turns up in the first toss, four dollars if heads turns
up for the first time in the second toss, etc. In general, the payoff is 2k
dollars if heads turns up for the first time in the kth toss, for 1 ≤ k ≤ m.
The payoff is zero if tails is tossed m times in a row. What is the expected
value of the payoff? What should be the stake to make the game a fair
bet? Note: This game is known as the St. Petersburg game when m = ∞.
3.17 You throw darts at a circular target on which two concentric circles of
radius 1 cm and 3 cm are drawn. The target itself has a radius of 5 cm.
You receive 15 points for hitting the target inside the smaller circle, 8
points for hitting the middle annular region, and 5 points for hitting the
outer annular region. The probability of hitting the target at all is 0.75.
3.2 Expected Value 97
If the dart hits the target, then the hitting point is a random point on the
target. Let the random variable X be the number of points scored on a
single throw of the dart. What is the expected value of X?
3.18 The following game is played in a particular carnival tent. You pay one
dollar to draw blindly three balls from a box without replacement. The
box contains 10 balls and four of those balls are gold colored. You get
back your original one-dollar stake if you draw exactly two gold-colored
balls, while you win 10 dollars and get back your original one-dollar
stake if you draw three gold-colored balls; otherwise, you get nothing
back. What is the expected value of your net winnings?
3.19 You spin a game-board spinner with 1,000 equal sections numbered 1 to
1,000. After your first spin, you have to decide whether to spin the
spinner for a second time. Your payoff is the total score of your spins,
as long as this score does not exceed 1,000; otherwise, your payoff is
zero. What strategy maximizes the expected value of your payoff?
3.20 A casino owner has given you an advance of a fiches to play a particular
casino game. Your win probability for each play of the game is p, where
p < 12 . The stake for each play is one fiche. If you win, you get two
fiches back; otherwise, you lose the stake. The fiches are given to you
as a reward for services rendered to the casino and under the following
agreement. You must play until you have lost all fiches or have won b
fiches on top of the advance of a fiches, where b must be set beforehand.
In the case that you lose all fiches, you are owed nothing by the casino;
otherwise, you return the advance of a fiches and keep the b fiches won.
You decide to choose the level b in such a way that the expected value of
the number of fiches you may keep is maximal.
(a) Show that for large a, the maximizing value of b satisfies b ≈ ln(q/p)
1
matches the red die, you lose your stake; otherwise, you get paid k + 1
dollars if exactly k of the blue dice match the red die. In the case that
exactly one blue die matches the red die, you get paid an additional $0.50
if the other two blue dice show the same number. What is the expected
payoff of the game?
3.22 You play a game in which four fair dice are rolled. The stake is $1. The
payoff is $100 if all four dice show the same number and $10 if two dice
show the same even number and the other two dice show the same odd
number. Is this a favorable game? What about the following game with
a stake of $2? A fair coin is tossed no more than five times. The game
ends if the coin comes up tails or five straight heads appear, whichever
happens first. You get a payoff of $1 each time heads appears, plus a
bonus of $25 if five heads appear in a row.
3.23 A stick of length 1 is broken at random into two pieces. You bet on
the ratio of the length of the longer piece to the length of the smaller
piece. You receive $k if the ratio is between k and k + 1 for some k with
1 ≤ k ≤ m − 1, while you receive $m if the ratio is larger than m. Here,
m is a given positive integer. What should your stake be to make this a
fair bet?
3.24 In the dice game of Pig, you repeatedly roll a single die in any round.
Upon rolling a 1, your turn is over and nothing is added to your score.
Otherwise, you can stop whenever you want and then the total number of
points rolled is added to your score. The goal is to maximize the expected
number of points gained in one round. Under the hold-at-20, rule you stop
when you have rolled 20 points or more in the round. Explain why the
hold-at-20 rule is optimal.
3.25 You play a dice game in which you repeatedly roll a single die. If the
total number of points rolled exceeds 10, the game is over and you get no
reward. Otherwise, you can stop whenever you want and then your dollar
reward is the accumulated number of points rolled. What is an optimal
stopping rule when you want to maximize the expected value of your
reward?
3.26 A game machine can be used to drop balls into bins. The balls are
dropped one at a time and any ball will land at random into one of 25 bins.
You can stop dropping balls whenever you wish. At the end of the game,
you win $1 for every bin with exactly one ball and you lose $0.50 for
every bin with two or more balls. Empty bins do not count. How does
the one-stage-look-ahead rule work for this problem? Also answer this
question when you lose 12 k dollars rather than half a dollar for every bin
containing k ≥ 2 balls.
3.3 Expected Value of Sums of Random Variables 99
3.27 A bag contains w white and r red balls. You pick balls out of the bag
at random, one at a time and without replacement. You win one dollar
each time you pick a white ball. If you pick a red ball, the game is over
and you lose all the money gathered so far. You can stop picking balls
whenever you wish. What is an optimal stopping rule when you want to
maximize your expected gain?
3.28 Consider the following dice game in which you repeatedly roll two dice.
The dollar reward for each roll is the dice total, provided that the two
numbers shown are different; otherwise, the game is over and you lose
the total reward gathered so far. You can stop rolling the dice whenever
you wish. What is an optimal stopping rule when you want to maximize
the expected value of your total reward?
3.29 Let the random variable X be nonnegative and integer valued. Verify that
∞
E(X) = P(X > k).
k=0
Apply this useful formula to calculate the expected value of the largest
of 10 randomly chosen numbers from 1 to 100.
where P(X = x, Y = y) denotes the probability of the joint event that X takes
on the value x and Y the value y. Thus, by E(Z) = z zP(Z = z), we have
E(Z) = z z x,y: x+y=z P(X = x, Y = y). This gives
E(Z) = (x + y)P(X = x, Y = y)
z x,y: x+y=z
= (x + y)P(X = x, Y = y).
x,y
Thus, E(Z) = x,y xP(X = x, Y = y) + x,y yP(X = x, Y = y). This
expression can be rewritten as
E(Z) = x P(X = x, Y = y) + y P(X = x, Y = y).
x y y x
For fixed x, the events {X = x, Y = y} are disjoint and their union is the event
{X = x}. Thus, by the third axiom of probability theory,
P(X = x, Y = y) = P(X = x) for any x.
y
In the same way, x P(X = x, Y = y) = P(Y = y) for any y. This gives
E(Z) = xP(X = x) + yP(Y = y) = E(X) + E(Y),
x y
This approximation is very accurate. For example, the approximate and exact
values of E(X) are 29.298 and 29.290 when n = 10.
The problem in Example 3.7 is known as the coupon-collector problem. This
problem has many applications. A nice application is the birthday-coverage
problem. What is the expected number of people needed so that each of the
365 possible birthdays of the year (ignore February 29) is represented in the
group of people? The answer is 365 365 k=1 k = 2,364.65. The calculation of
1
Example 3.8 Suppose that m balls (or tokens) are placed sequentially into
one of b bins, where the bin for each ball is selected at random. What is the
expected value of the number of empty bins?
Solution. Let the random variable X be the number of empty bins. Then,
X = X1 + · · · + Xb ,
This gives
m
1
E(X) = b 1 − .
b
b
A useful approximation applies to E(X) for b large. Noting that 1− 1b ≈ e−1
m b×(m/b)
for b large and writing 1 − 1b as 1 − 1b , it follows that
Example 3.9 What is the expected value of the number of times that, in a
thoroughly shuffled deck of 52 cards, two adjacent cards are of the same rank
(two aces, two kings, etc.)?
Solution. Let the random variable Xi be equal to 1 if the cards in positions i
and i + 1 are of the same rank and 0 otherwise. Then P(Xi = 1) = 51 3
and
so E(Xi ) = 51 for i = 1, . . . , 51. The expected value of the number of times
3
that two adjacent cards are of the same rank is given by E(X1 + · · · + X51 ) =
51 × 513
= 3.
Example 3.10 Suppose that you have a graph with n nodes. Each pair of nodes
is connected by an edge with probability p, independently of the occurrence of
edges between other pairs of nodes. What is the expected number of isolated
nodes?
Solution. Label the nodes k = 1, . . . , n and let the indicator variable Ik be
1 if node k is isolated and 0 otherwise. Then the number of isolated nodes is
n
k=1 Ik . For each k, P(Ik = 1) = (1 − p)
n−1 and so E(I ) = (1 − p)n−1 . Thus,
k
the expected number of isolated nodes is
n
E(Ik ) = n(1 − p)n−1 .
k=1
As a corollary,
This bound is a direct consequence of the inequality P(X ≥ 1) ≤ E(X) for any
nonnegative, integer-valued random variable X (explain!).
Problems
3.30 A group of m people simultaneously enter an elevator at the ground floor.
Each person randomly chooses one of the r floors 1, 2, . . . , r as the exit
floor, where the choices of the persons are independent of each other. The
elevator only stops on a floor if at least one person wants to exit on that
floor. No other people enter the elevator at any of the floors 1, 2, . . . , r.
What is the expected value of the number of stops the elevator will make?
3.31 Use indicator variables to give an alternative proof of the inclusion–
exclusion formula in Rule 1.4. Hint: Use the relation IAc1 ∩···∩Acn =
(1 − IA1 ) · · · (1 − IAn ), where the indicator variable IA is equal to 1 if
the event A occurs and 0 otherwise.
3.32 What is the expected value of the number of times that two adjacent
letters will be the same in a random permutation of the 11 letters of the
word “Mississippi?”
3.33 Twelve married couples participate in a tournament. The group of 24
people is randomly split into eight teams of three people each. What is
the expected value of the number of teams with a married couple?
3.34 A particle starting at the origin moves every time unit, one step up or
down with equal probabilities 12 . Verify that the expected number of
3.3 Expected Value of Sums of Random Variables 105
returns of the random walk to the zero level during the first n time units is
√ √
approximately equal to 2/π n for n large. On the basis of this result,
can you explain intuitively why the time between two successive returns
of the random walk to the zero level is infinite? Hint: Use Example 1.5
√ √
and the fact that m j=1 1/ j ≈ 2 m for m large.
3.35 In a bowl, there are 2r red balls and 2b blue balls. Each time, you
simultaneously remove two balls from the bowl until the bowl is empty.
What is the expected number of times that you pick two balls of the same
color?
3.36 What is the expected number of distinct birthdays within a randomly
formed group of 100 persons? What is the expected number of children in
a class with r children sharing a birthday with some child in another class
with s children? Assume that the year has 365 equally likely birthdays.
3.37 Let S be a given set consisting of n distinct items. A new set T is
constructed as follows. In each step, an item is chosen at random from
the set S (with replacement) and a copy of the item is added to the set T.
What is the expected value of the number of distinct items in the set T
after n steps?
3.38 Consider again Problem 1.92. What is the expected value of the number
of people who survive the first round?
3.39 A bag contains r red and w white balls. Each time, you take one ball out
of the bag at random and without replacement. You stop as soon as all the
red balls have been taken out of the bag. What is the expected number of
white balls remaining in the bag when you stop?
3.40 Consider again Problem 2.30. What is the expected value of the number
of persons having knowledge of the rumor?
3.41 What is the expected number of times that two consecutive numbers will
show up in a lottery drawing of six different numbers from the numbers
1, 2, . . . , 45?
3.42 You play a sequence of s games, where s ≥ 2. The outcomes of the
various games are independent of each other. The probability that you
will win the kth game is 1k for k = 1, 2, . . . , s. You get one dollar each
time you win two games in a row. What is the expected value of the total
amount you will get?
3.43 You distribute randomly 25 apples over 10 boxes. What is the expected
value of the number of boxes that will contain more than 3 apples?
3.44 Verify that the expected number of permutation cycles in a random
permutation of the integers 1, . . . , n is about ln(n) + γ for n large,
where γ = 0.57722 . . . is Euler’s constant. Hint: For any fixed k with
1 ≤ k ≤ n, define Xi = 1 if the integer i belongs to a cycle of length k
106 3 Discrete Random Variables
and Xi = 0 otherwise. Use the fact that 1k ni=1 Xi is the number of cycles
of length k.
3.45 Take a random permutation of the integers 1, 2, . . . , n. Let us say that
the integers i and j with i = j are switched if the integer i occupies the
jth position in the random permutation and the integer j the ith position.
What is the expected value of the total number of switches?
with the convention that an empty sum is zero. However, it is not necessary to
use the probability mass function of the random variable g(X) to calculate its
expected value. The expected value of g(X) can be calculated directly from the
probability mass function of X.
Rule 3.2 For any function g of the random variable X,
E[g(X)] = g(x) P(X = x),
x∈I
provided that x∈I |g(x)| P(X = x) < ∞, or the function g(x) is nonnegative.
This rule is called the substitution rule and is also known as the law of the
unconscious statistician. The proof of the rule is simple. Let Y = g(X) and
denote by J the range of Y. Then,
g(x) P(X = x) = yP(X = x) = yP g(X) = y
x∈I y∈J x: g(x)=y y∈J
= yP(Y = y) = E(Y) = E g(X) .
y∈J
Note that the order in which the terms of the series are added does not matter
in view of the assumption that the series is absolutely convergent or has only
nonnegative terms.
3.4 Substitution Rule and Variance 107
A frequently made mistake
of beginning students is to set E g(X) equal to
g (E(X)). In general, E g(X) = g (E(X))! Stated differently, the average value
of the input X does not determine in general the average value of the output
g(X). As a counterexample, take the random variable X with P(X = 1) =
P(X = −1) = 0.5 and take the function g(x) = x2 . An exception is the case of
a linear function g(x) = ax + b. Then, by Rule 3.2,
Rule 3.3 For any constants a and b,
E(aX + b) = aE(X) + b,
provided that E(X) is finite.
This rule is valid for any type of random variable X for which E(X) is
finite. Another result that is valid for any type of random variable is Jensen’s
inequality:
Rule 3.4 Suppose that the function g(x) is convex on a line segment
containing the range of the random variable X. Then,
E[g(X)] ≥ g E(X) ,
provided that the expectations are finite.
The proof is simple. For ease, assume that g(x) is differentiable. A differen-
tiable function g(x) is convex only if, for any point x0 , the graph of g(x) lies
entirely above the tangent line at the point x0 :
g(x) ≥ g(x0 ) + g (x0 )(x − x0 ) for all x.
Choosing x = X and x0 = E(X) in this inequality and taking the expectation on
both sides of the inequality, the inequality is preserved and Jensen’s inequality
follows. Here we use Rule 3.3 and the fact that E[g1 (X)] ≥ E[g2 (X)] if g1 (x) ≥
g2 (x) for all x.
3.4.1 Variance
An important case of a function of X is the random variable g(X) = (X − μ)2 ,
where μ = E(X) denotes the expected value of X and is assumed to be finite.
The expected value of (X − μ)2 is called the variance of X. It is denoted by
var(X) = E[(X − μ)2 ].
It is a measure of the spread of the possible values of X. Often one uses the
standard deviation, which is defined as the square root of the variance. It is
useful to work with the standard deviation, because of the fact that it has the
108 3 Discrete Random Variables
The formula for var(X) allows for another useful representation. Since
(X − μ)2 = X 2 − 2μX + μ2 , it follows from the linearity of the expectation
operator and Rule 3.3 that E[(X − μ)2 ] = E(X 2 ) − 2μE(X) + μ2 . Therefore,
var(X) is also given by
var(X) = E(X 2 ) − μ2 .
Rule 3.3 for the expectation operator has the following analogue for the
variance operator:
Rule 3.5 For any constants a and b,
var(aX + b) = a2 var(X).
To illustrate this relation, suppose that the rate of return on stock A is a random
variable X taking on the values 30%, 10%, and −10% with respective proba-
bilities 0.25, 0.50, and 0.25. The rate of return on stock B is a random variable
Y taking on the values 50%, 10%, and −30% with the same probabilities 0.25,
0.50, and 0.25. Without calculating the actual values of the standard deviation,
one can see that the standard deviation of the rate of return on stock B is twice
as large as that on stock A. The explanation is that the random variable Y is
distributed as 2X − 0.1.
Example 3.11 What is the standard deviation of the total score of a roll of two
dice?
Solution. Let the random variable X denote the total score. Since E(X) =
3.5 + 3.5 = 7 and P(X = k) = P(X = 14 − k) = k−1
36 for 2 ≤ k ≤ 7, we get
7
k − 1 2 14 − k − 1
12
35
var(X) = k2 + k − 72 = .
36 36 6
k=2 k=8
√
The standard deviation of X is var(X) = 2.415.
3.4 Substitution Rule and Variance 109
Example 3.12 Suppose that the random variable X has the so-called Poisson
distribution P(X = k) = e−λ λk /k! for k = 0, 1, . . . . What are the expected
value and the variance of X?
Solution. It will be shown that the Poisson distribution has the remarkable
property that its variance is equal to its expected value. That is,
var(X) = E(X) = λ.
2 3
To verify this, note that E(X) = λe−λ + 2 λ2! e−λ + 3 λ3! e−λ + · · · and so
λ λ 2
E(X) = λe−λ 1 + + + · · · = λe−λ eλ = λ,
1! 2!
2
where the third equality uses the power-series expansion ex = 1+ 1!x + x2! +· · ·
for every x. Using the identity k2 = k(k − 1) + k, we have
∞
∞ ∞
λk −λ λk λk−2
E(X 2 ) = k(k − 1)e−λ + ke = λ2 e−λ + E(X)
k! k! (k − 2)!
k=0 k=0 k=2
∞
λn
= λ2 e−λ + λ = λ2 + λ.
n!
n=0
b
b−1
b
E(X 2 ) = E(Xi2 ) + 2 E(Xi Xj ).
i=1 i=1 j=i+1
m
Thus, E(Xi Xj ) = 1 − 2b for all i = j. This leads to
1 m 2 m
E(X ) = b 1 −
2
+ b(b − 1) 1 − .
b b
250
1
200
1
250
g(k)P(X = k) = (−100 + k) + 100
101 101
k=150 k=150 k=201
and so
3,825 5,000
E[g(X)] = + = 87.3762.
101 101
3.4 Substitution Rule and Variance 111
To find the standard deviation of g(X), we apply the formula var(Z) = E(Z 2 )−
[E(Z)]2 with Z = g(X). This gives
Letting h(x) = [g(x)]2 , then h(x) = (−100 + x)2 for x ≤ 200 and h(x) = 1002
for x > 200. Using the substitution rule again, we get
250
1
200
1
250
E[h(X)] = h(k)P(X = k) = (−100 + k)2 + 1002 .
101 101
k=150 k=150 k=201
Problems
3.46 There are two investment projects A and B each involving an initial
investment of $2,000. The possible payoffs of investment A are $1,000,
$2,000, and $3,000 with respective probabilities 0.20, 0.40, and 0.40,
while the possible payoffs of investment B are $1,600, $2,000, and
$2,750 with respective probabilities 0.25, 0.35, and 0.40. Calculate the
expected value and the standard deviation of the payoff for each of the
two investments.
3.47 Let the random variable X have a discrete uniform distribution on the
integers a, a + 1, . . . , b with a < b, that is, P(X = k) = 1/(b − a + 1) for
a ≤ k ≤ b. What is the variance of X?
3.48 Suppose that the random variable X satisfies P(X = a) = p and
P(X = b) = 1 − p, where a < b. Verify that var(X) is largest for p = 12 .
3.49 Consider again Problem 3.33. What is the standard deviation of the
number of teams with a married couple?
3.50 In each drawing of the 6/45 lottery, six different integers are randomly
chosen from 1, 2, . . . , 45. What are the expected value and the standard
deviation of the number of integers from 1, 2, . . . , 45 that do not show up
in 15 drawings?
3.51 The number of paramedical treatments for a particular sports injury is a
random variable X with probability mass function P(X = k) = 11−k 55 for
112 3 Discrete Random Variables
The manipulations with the summations are justified by the assumption that
the series x∈I xP(X = x) and y∈J yP(Y = y) are absolutely convergent.
The converse of the above result is not true. It is possible that E(XY) =
E(X)E(Y), while X and Y are not independent. A simple example is as follows.
Suppose that two fair dice are tossed. Denote by the random variable V1 the
number appearing on the first die and by the random variable V2 the number
appearing on the second die. Let X = V1 + V2 and Y = V1 − V2 . It is obvious
that the random variables X and Y are not independent. We leave it to the reader
3.5 Independence of Random Variables 115
using the linearity property of the expectation operator and using Rule 3.8
together with E[(X − μX )] = E[(Y − μY )] = 0.
Rule 3.9 is valid for any type of independent random variables X and
Y. Also, Rule 3.9 can be extended directly to the case of finitely many
independent random variables by using the algebraic formula (a1 +· · ·+an )2 =
n n−1 n
i=1 ai + 2
2
i=1 j=i+1 ai aj . Hence, for independent random variables
X1 , X2 , . . . , Xn ,
This alternative formulation follows by noting that σ (aX) = aσ (X) for any
constant a > 0. Rule 3.10 is known as the square-root law. This law is
sometimes called de Moivre’s equation. The law was discovered by Abraham
de Moivre (1667–1754) in 1730.3 This law had an immediate impact on the
way the mass of golden coins that were struck by the London Mint was
1
controlled. The allowable deviation in the guinea was 400 of its intended
weight of 128 grains, which amounts to 0.32 grains. A run of 100 coins
was taken periodically from the Mint and the total weight of these coins was
compared to the weight of a standard of 100 coins. For almost 600 years the
watchdogs allowed a variability of 100 × 0.32 = 32 grains in the weight of the
100 coins, but after the discovery of the square-root
√ law in 1730, the allowable
variability for 100 coins was changed to 100 × 0.32 = 3.2 grains. The gold
for the coins was provided by the kings of England. Ignorance of the square-
root law may have cost them a fortune.
Convolution Formula
Suppose that X and Y are two discrete random variables each having the set of
nonnegative integers as the range of possible values. A useful rule is
Rule 3.11 If the nonnegative, integer-valued random variables X and Y are
independent, then
k
P(X + Y = k) = P(X = j)P(Y = k − j) for k = 0, 1, . . . .
j=0
This rule is called the convolution rule. The proof is as follows. Fix k. Let A be
the event {X + Y = k} and Bj be the event {X = j}. Then A = ∞ j=0 ABj . The
∞
events ABj are disjoint and so P(A) = j=0 P(ABj ). Thus,
∞
P(X + Y = k) = P(X + Y = k, X = j).
j=0
k
e−(λ+μ) k j k−j
k j
−λ λ −μ μk−j
P(X + Y = k) = e e = λμ .
j! (k − j)! k! j
j=0 j=0
k
Next, by applying Newton’s binomium (a + b)k = j=0
k
j aj bk−j , we get
k
the desired result P(X + Y = k) = e−(λ+μ) (λ+μ)
k! for k = 0, 1, . . . .
Problems
3.56 A drunkard is standing in the middle of a very large town square. He
begins to walk. Each step is a unit distance in one of the four directions
east, west, north, and south. All four possible directions are equally
probable. The direction of each step is chosen independently of the
direction of the other steps. The drunkard takes a total of n steps. Verify
that the squared distance between the drunkard’s position after n steps
and his starting position has expected value n for any value of n. Hint:
The squared distance can be written as ( ni=1 Xi )2 + ( ni=1 Yi )2 , where
the random variables Xi and Yi denote the changes in the x-coordinate and
the y-coordinate of the position of the drunkard caused by his ith step.
3.57 Let the random variable X be defined by X = YZ, where Y and Z are
independent random variables each taking on the values −1 and 1 with
probability 0.5. Verify that X is independent of both Y and Z, but not of
Y + Z.
3.58 Suppose that X1 , . . . , Xn are independent random variables each having
expected value μ and variance σ 2 . Define the sample mean and the
sample variance by the random variables X n = 1n nk=1 Xk and Sn2 =
1 n
k=1 (Xk − X n ) . Show that E(X n ) = μ and E(S n ) = n σ . Hint:
2 2 n−1 2
n n
Writing Xk − X n as Xk − μ + μ − X n , verify that k=1 (Xk − X n )2 =
n
k=1 (Xk − μ) + n(X n − μ) .
2 2
3.59 Let Xi denote the number of integers smaller than i that precede i in
a random permutation of the integers 1, . . . , 10. What are the expected
118 3 Discrete Random Variables
value and the variance of the sum X2 + · · · + X10 ? Hint: Verify that
i
Xi = k=2 Rk , where Rk = 1 if number k − 1 precedes number k in
the random permutation and Rk = 0 otherwise.
3.60 Suppose that X and Y are independent random variables each having the
same probability mass function {(1 − p)i−1 p, i = 1, 2, . . .}. What is the
probability mass function of X + Y?
3.61 Let X1 , X2 , . . . be a sequence of independent random variables each
having the same distribution with finite mean. Suppose N is a nonneg-
ative, integer-valued random variable such that the event {N ≥ n} does
not depend on Xn , Xn+1 , . . . for any n ≥ 1. For the case that the Xk
N
are nonnegative, prove Wald’s equation4 stating that E k=1 k =
X
E(N)E(X1 ). Hint: Define the indicator variable Ik by Ik = 1 if N ≥ k
∞
and Ik = 0 otherwise and note that N k=1 Xk = k=1 Xk Ik .
It is worthwhile pointing out that the binomial distribution has nearly all
of its mass within three standard deviations of the expected value when
np(1 − p) ≥ 25. In a bold advertisement campaign, a beer company once made
clever use of this fact. The beer company invited 100 beer drinkers for a blind
taste test to compare the company’s beer with the beer of a key competitor.
The live taste test was broadcast at half time of the Superbowl, in front of tens
of millions of people watching the playoffs on television. The brilliant move of
the marketeers was not to invite the average beer drinker, but beer drinkers
of the competing beer. Since many beers taste about the same and the typical
beer drinker cannot tell the difference between two such beers in a blind taste
test, the marketeers were pretty sure that the test would reveal that 35% or
more of the beer drinkers would prefer the other beer over their favorite beer
(a value less than 35 for a binomial random variable with n = 100 and
p = 12 would be more than three standard deviations below the expected
value). Such a test result would be very impressive, since it concerns beer
drinkers who stated that they swear by the competing beer. In the advertisement
campaign the laws of probability indeed acted in favor of the beer company, as
anticipated.
A classical problem of historical significance in which the binomial distri-
bution shows up is the problem of points. This problem was the subject of
the correspondence between the great French mathematicians Blaise Pascal
(1623–1662) and Pierre de Fermat (1601–1665). This correspondence laid
120 3 Discrete Random Variables
the groundwork for the birth of the study of probability. It was the cata-
lyst that enabled probability theory to develop beyond mere combinatorial
enumeration.5
Example 3.14 Two equally strong players A and B play a ball game such that a
total of six points is required to win the game. Each inning counts for one point.
The pot is 24 ducats and goes to the winner of the game. By some incident,
the players cannot finish the game when player A has five points and player B
three. How should the pot be divided between the two players?
Solution. This problem had been the subject of debate for quite some time
before it was solved by Pascal and Fermat in the seventeenth century. In the
fifteenth and sixteenth centuries, the Italian mathematicians Pacioli, Tartaglia,
and Cardano proposed different schemes to divide the pot, but these schemes
were based on deterministic arguments and were not satisfying. The idea of
Pacioli was to divide the pot according to the score 5:3 and to split the pot
of 24 ducats into 58 × 24 = 15 ducats for player A and 9 ducats for player B.
Tartaglia made the following comment to Pacioli’s rule: “his rule seems neither
agreeable nor good, since, if one player has, by chance, one point and the other
no points, then, by following this rule, the player who has one point should
take the whole pot, which obviously does not make sense.” Tartaglia proposed
another scheme that is based on the difference between the numbers of points
already won by the two players. He calculated that the difference 5 − 3 = 2 in
the points already won by the two players is one-third of the required 6 points
and proposed giving player A half of the pot plus one-third of half of the pot,
which amounts to 12 + 13 × 12 = 16 ducats.
In the summer of 1654, Pascal and Fermat addressed the problem of
points. They realized that in a fair division of the pot, the amounts of
prize money received by the players should be proportional to the respective
win probabilities of the players if the game were to be continued. These
probabilities depend only on the number of points left to be won. At the point
of stopping, player A is 1 point away from the required 6 points and player B
3 points. In the actual game, at most 1 + 3 − 1 = 3 more innings would be
needed to declare a winner. A trick to solve the problem is to imagine that three
additional innings were played. The probability of player A being the ultimate
winner if the original game was to be continued is the same as the probability
5 Until the middle of the seventeenth century, probability calculations were restricted to counting
the number of favorable outcomes and the total number of outcomes in games of chance, using
the handbook Liber de Ludo Aleae (The Book of Games of Chance) by the colorful Italian
mathematician and physician Gerolamo Cardano (1501–1576).
3.6 Binomial Distribution 121
that A would win one or more innings in three additional innings (explain!).
This probability is given by the binomial probability
3 k 3−k
3 1 1 7
= .
k 2 2 8
k=1
This probability has the value 0.0038 for r = 2 and the value 0.00037 for
r = 3. Therefore, three reserve detectors should be installed.
Problems
3.62 Daily Airlines flies every day from Amsterdam to London. The price for
a ticket on this popular route is $75. The aircraft has a capacity of 150
passengers. Demand for tickets is greater than capacity, and tickets are
sold out well in advance of flight departures. The airline company sells
160 tickets for each flight, to protect itself against no-show passengers.
The probability of a passenger being a no-show is q = 0.1. No-show
passengers are refunded half the price of their tickets. Passengers that do
show up and are not able to board the flight due to the overbooking are
refunded the full amount of their tickets plus an extra $425 compensation.
What is the probability that more passengers will turn up for a flight than
the aircraft has the seating capacity for? What are the expected value and
standard deviation of the daily return for the airline?
3.63 What are the chances of getting at least one six in one throw of 6 dice,
at least two sixes in one throw of 12 dice, and at least three sixes in one
throw of 18 dice? Which game do you think is more likely to win? This
problem is known as the Newton–Pepys problem and was brought to the
attention of Isaac Newton by Samuel Pepys, who was President of the
Royal Society of London and apparently a gambling man.
3.64 The Yankees and the Mets are playing a best-four-of-seven series. The
winner takes all the prize money of one million dollars. The two teams
are evenly matched and the outcomes of the games are independent of
each other. Unexpectedly, the competition must be suspended when the
Yankees lead two games to one. How should the prize money be divided
between the two teams?
3.65 The Salk vaccine against polio was tested in 1954 in a carefully designed
field experiment. Approximately 400,000 children took part in this exper-
iment. Using a randomization procedure, the children were randomly
divided into two groups of equal size, a treatment group and a control
3.6 Binomial Distribution 123
group. The vaccine was given only to the children in the treatment group;
the control-group children received placebo injections. The children did
not know which of the two groups they had been placed into. The
diagnosticians also lacked this information (double-blind experiment).
Fifty-seven children in the treatment group went on to contract polio,
while 142 children in the control group contracted the illness. How likely
would such a difference in outcomes be when the assignment to the
treatment or control group had absolutely no effect on the outcomes?
3.66 A game of chance played historically by Canadian Indians involved
throwing eight flat beans into the air and seeing how they fell. The beans
were symmetrical and were painted white on one side and black on the
other. The bean thrower would win one point if an odd number of beans
came up white, two points if either zero or eight white beans came up,
and would lose one point for any other configurations. Does the bean
thrower have the advantage in this game?
3.67 In order to qualify for a certain program, you are put through n tests. You
pass any test with probability p, independently of the other tests. If you
have passed k of the n tests, then you are admitted to the program with
probability nk for k = 0, 1, . . . , n. What is the unconditional probability
that you will be admitted to the program?
3.68 The final match of the soccer World Cup is a draw at the end of extra
time. The teams then proceed to penalty kicks. Each team attempts five
shots. Suppose that each shot results in a goal with probability 0.7,
independently of the other shots. What is the probability that the match
is still tied after one round of five shots by each team?
3.69 You flip 100 coins. Those that land heads are set aside. You then flip the
coins that have landed tails and again set aside all those that land heads.
Finally, you flip a third time all coins that landed tails twice and again set
aside all those that land heads. What is the probability mass function of
the number of coins that are set aside?
3.70 In an ESP experiment, a medium has to guess the correct symbol on each
of 250 Zener cards. Each card has one of the five possible Zener symbols
on it and each of the symbols is equally likely to appear. The medium
will get $100,000 if he gives 82 or more correct answers. Can you give a
quick assessment of the probability that the medium has to be paid out?
3.71 On bridge night, the cards are dealt ten times. Only twice you do receive
cards with an ace. From the beginning, you had your doubts as to whether
the cards were being shuffled thoroughly. Are these doubts confirmed?
3.72 An experiment has r possible outcomes O1 , O2 , . . . , Or with respective
probabilities of p1 , p2 , . . . , pr . Suppose that n independent repetitions of
124 3 Discrete Random Variables
the experiment are performed. Let the random variable Xi be the number
of times that the outcome Oi occurs. Argue that
n!
P(X1 = x1 , X2 = x2 , . . . , Xr = xr ) = px1 px2 · · · pxr r
x1 ! x2 ! · · · xr ! 1 2
for all nonnegative integers x1 , x2 , . . . , xr with x1 + x2 + · · · + xr = n.
This distribution is called the multinomial distribution.
3.73 A particular game is played with five poker dice. Each die displays an
ace, king, queen, jack, ten, and nine. Players may bet on two of the six
images displayed. When the dice are thrown and the bet-on images turn
up, the player receives three times the amount wagered. In all other cases,
the amount of the wager is forfeited. Is this game advantageous for the
player?
The Poisson model was of great importance in the analysis of the distribution
of hits of flying bombs (V-1 and V-2 missiles) in London during World War
II. The British authorities were anxious to know if these weapons could be
aimed accurately at a particular target, or whether they were landing at random.
If the missiles were in fact only randomly targeted, the British could simply
disperse important installations to decrease the likelihood of their being hit.
An area of 36 km2 in South London was divided into 576 regions, 250 m wide
by 250 m long, and the number of hits in each region was determined. The
576 regions were struck by 535 bombs, and so the average number of hits per
region was 0.9288. There were 229 regions with zero hits, 211 regions with
one hit, 93 regions with two hits, 35 regions with three hits, 7 regions with
four hits, 1 region with five hits, and 0 regions with six or more hits. The
analysis showed that the observed relative frequencies of the number of hits
were each very close to the corresponding probabilities given by a Poisson
distribution with expected value 0.9288. This implied that the distribution of
hits in the area was much like the distribution of hits when each of the many
flying bombs was to fall on any of the equally sized regions with the same small
probability, independently of the other flying bombs. It will be seen below that
this characterizes the Poisson distribution. The analysis convinced the British
military that the bombs struck at random and had no advanced aiming ability.
The Poisson distribution can be seen as a limiting case of the binomial
distribution with parameters n and p, when n is very large and p is very
small. That is, in a very large number of independent repetitions of a chance
experiment having a very small probability of success, the total number of
successes is approximately Poisson distributed. This insight is essential in
order to apply the Poisson distribution in practical situations. To give a precise
mathematical formulation of the limiting result, let X represent a binomially
distributed random variable with the parameters n and p. Assume now that n
becomes very large and p becomes very small, while np is kept equal to the
constant λ. The following is then true:
λk
lim P(X = k) = e−λ for k = 0, 1, . . . .
n→∞, p→0 k!
n λ k
λ n−k
The proof is as follows. Since p = λn , P(X = k) = k n 1− n and so
n −k
λk λ n! λ
P(X = k) = 1− 1− .
k! n n (n − k)!
k n
Now let’s look at the different terms separately. Take a fixed value for k, where
0 ≤ k ≤ n. The term nk (n−k)!
n!
can be written as
126 3 Discrete Random Variables
n(n − 1) · · · (n − k + 1) 1 k−1
= 1 − · · · 1 − .
nk n n
the Poisson model. In order to illustrate this, suppose you read in a newspaper
that, based on an average of 1,000 traffic deaths per year in previous years, the
number of traffic deaths for last year rose 12%. How can you evaluate this?
The number of traffic deaths over a period of one year can be modeled as a
Poisson-distributed random variable with expected value 1,000 (why is this
model reasonable?). An increase of 12% on an average of 1,000 is an increase
√
of 120, or rather an increase of 120/ 1,000 = 3.8 standard deviations above
the expected value of 1,000. This can be considered exceptional. In this way,
we find justification for the conclusion that the increase in the number of traffic
deaths is not coincidental. What would your conclusions have been if, based on
an average of 100 traffic deaths per year, a total of 112 traffic deaths occurred
in the past year? You see that the Poisson model is a practically useful model
to make quick statistical assessments.
Example 3.18 The Pegasus Insurance Company has introduced a policy that
covers certain forms of personal injury with a standard payment of $100,000.
The yearly premium for the policy is $25. On average, 100 claims per year
lead to payment. There are more than one million policyholders. What is the
probability that more than 15 million dollars will have to be paid out in the
space of a year?
Solution. In fact, every policyholder conducts a personal experiment in prob-
ability after purchasing this policy, which can be considered to be “successful”
if the policyholder files a rightful claim during the ensuing year. Let the
random variable X denote the total number of claims that will be approved for
payment during the year of coverage. In view of the many policyholders, there
is a large number of independent probability experiments each having a very
small probability of success. This means that the probability distribution of the
random variable X can be modeled by a Poisson distribution with parameter
λ = 100. The probability of having to pay out more than 15 million dollars
is given by P(X > 150). Since E(X) = 100 and σ (X) = 10, a value
of 150 claims lies five standard deviations above the expected value. Thus,
without doing any further calculations, we can draw the conclusion that the
probability of paying out more than 15 million dollars in the space of a year
must be extremely small. The precise value of the probability P(X > 150) is
−100 100k = 1.23×10−6 . Not a probability the insurance executives
1− 150 k=0 e k!
need worry about.
sheik’s visit to Lightstone, the jackpot is listed at 12.5 million dollars. The oil
sheik decides to take a gamble and orders his retinue to fill in 5 million tickets
in his name. These 5 million tickets are not filled in by hand but are random
picks generated by the computer of the lottery organization (inevitably, many
six-number sequences will be generated more than once6 ). Suppose that the
local people have purchased 2 million tickets and that these tickets are also
random picks. Each ticket costs $1. What is the probability that the sheik will
be among the jackpot winners and what is the probability that the sheik will be
the only winner? What is the probability that the sheik’s initial outlay will be
won back from the jackpot?
Solution. The probability
that a particular ticket is a winning ticket for the
jackpot is 1/ 40
6 = 1/3,838,380. Very many tickets are filled in by the sheik
and the local people. Therefore, the Poisson model is applicable. Let the
random variable X denote the number of winning tickets for the jackpot among
the 5 million tickets of the sheik and the random variable Y be the number of
winning tickets for the jackpot among the 2 million tickets of the locals. The
random variables X and Y can be modeled as Poisson random variables with
respective parameters
5,000,000 2,000,000
λ= and μ = .
3,838,380 3,838,380
The random variables X and Y are independent. The probability that the sheik
will be among the jackpot winners is
P(X ≥ 1) = 1 − e−λ = 0.7282.
The probability that the sheik is the only jackpot winner is
P(X ≥ 1, Y = 0) = P(X ≥ 1)P(Y = 0) = (1 − e−λ ) e−μ = 0.4325.
To answer the last question, let p(r, s) denote the probability that the sheik has
r winning tickets for the jackpot and the locals have s winning tickets for the
jackpot. Then,
λr μs
p(r, s) = e−λ × e−μ for r, s = 0, 1, . . . .
r! s!
Let A = {(r, s) : r+s
r
× 12.5 ≥ 5}. Then, the probability that the sheik’s initial
outlay will be won back from the jackpot is (r,s)∈A p(r, s) = 0.6935.
6 The expected number of different combinations among 5 million random picks is 2,795,050, as
follows by using the result of Example 3.8 with m = 5 × 106 and b = 40
6 .
3.7 Poisson Distribution 129
j k
−λp λp −λ(1−p) λ(1 − p)
P(X = j) = e and P(Y = k) = e
j! k!
for all j, k ≥ 0. Thus, X and Y are Poisson distributed with expected values λp
and λ(1 − p). Moreover,
P(X = j, Y = k) = P(X = j)P(Y = k) for all j, k ≥ 0,
showing the remarkable result that X and Y are independent.
weak because of the large number (365) of possible birth dates. It is therefore
reasonable to approximate the distribution of X by a Poisson distribution with
expected value λ = np = 32 m(m − 1)/365. In particular, P(X ≥ 1) ≈ 1 − e−λ .
Thus,
P(two or more people have birthdays within one day of each other)
3
≈ 1 − e− 2 m(m−1)/365 .
More generally, the Poisson heuristic leads to the approximation formula
2r+1
1 − e−( 2 )m(m−1)/365
for the probability that two or more people in a randomly assembled group
of n people have their birthdays within r days of each other. For r = 1,
the smallest value of m for which this approximate probability is at least 0.5
is 14. The approximate value of the probability that two or more people in a
group of n = 14 people will have birthdays within one day of each other is
1 − e−0.74795 = 0.5267. The exact value is 0.5375, and is calculated from the
following formula for the probability of two or more people in a group of n
people having their birthdays within r days from each other:8
(365 − 1 − nr)!
1− .
365n−1 365 − (r + 1)n !
Another interesting problem is the following. How many people are required
in order to have all 365 possible birthdays covered with a probability of at least
50%? This problem is called the birthday-coverage problem and can be framed
as a coupon-collector problem, where birthdays correspond to coupons and
people to purchases. The coupon-collector problem is as follows. Each time
you purchase a certain product you receive a coupon, which is equally likely
to be any one of n types. What is the probability of collecting a complete set
of coupons in exactly r purchases? Imagine that you conduct a trial for each
of the n coupons. Trial i is said to be successful if coupon iis not
r among the r
purchases. Each trial has the same success probability p = n−1 n . A complete
set of coupons is obtained only if there is no successful trial. Thus, by the
Poisson heuristic, the probability of collecting a complete set of coupons in
exactly r purchases is approximately
e−n×[(n−1)/n] .
r
8 This formula is taken from J. I. Naus, “An extension of the birthday problem,” The American
Statistician 22 (1968): 27–29.
132 3 Discrete Random Variables
k 0 1 2 3 4 5 6 7
exact 0.0162 0.0689 0.1442 0.1982 0.2013 0.1613 0.1052 0.060
approx. 0.0183 0.0733 0.1465 0.1954 0.1954 0.1563 0.1042 0.060
Problems
3.74 The Brederode Finance Corporation has begun the following adver-
tising campaign in Holland. Each new loan application submitted is
9 The exact values are taken from F. F. Knudsen and I. Skau, “On the asymptotic solution of a
card-matching problem,” Mathematics Magazine 69 (1996): 190–197.
3.7 Poisson Distribution 133
practically a sure thing. Imagine 100 6/42 lotteries all over the world. In
each lottery, each of one million people sends in five randomly chosen
sequences of six numbers from the numbers 1 to 42 for each drawing. In
each lottery there are two drawings every week. What is the probability
that in at least one of the lotteries some person will win the jackpot twice
in a period of three years?
3.81 The football pool is a betting pool based on predicting the outcome of
13 football matches in the coming week. In the Dutch football pool the
average number of weekly winners who have correctly predicted all 13
matches is 0.25. Last week there were 3 winners. Can you explain why
this is exceptional?
3.82 Let X be a random variable on the integers 0, 1, . . . and λ > 0 be a given
number. Verify that X is Poisson distributed with expected value λ if and
only if E[λg(X + 1) − Xg(X)] = 0 for any bounded function g(x) on
the nonnegative integers. This result is called the Stein–Chen identity for
the Poisson distribution. Hint: To verify the “if” assertion, choose g(x) as
g(r) = 1 and g(x) = 0 for x = r, where r is a given nonnegative integer.
3.83 Use the Poisson heuristic to approximate the probability that three or
more people from a randomly selected group of 25 people will have
birthdays on the same day and the probability that three or more persons
from the group will have birthdays falling within one day of each other.
3.84 Use the Poisson heuristic to approximate the probability that some suit
will be missing in a bridge hand of 13 cards.
3.85 Consider again Problem 1.94. Use the Poisson heuristic to approximate
the probability that no couple will be paired as bridge partners.
3.86 What is the probability that no two cards of the same face value (two
aces, for example) will succeed one another in a well-shuffled deck of 52
playing cards? Use the Poisson heuristic to verify that this probability is
approximately e−3 .
3.87 A company has 75 employees in service. The administrator of the
company notices, to his astonishment, that there are seven days on which
two or more employees have birthdays. Use the Poisson heuristic to argue
that this is not so astonishing after all.
3.88 What is the probability that in a random permutation of the integers
1, 2, . . . , n, no two integers have interchanged their original positions?
Use the Poisson heuristic to show that this probability is approximately
equal to √1e for large n.
3.89 Imagine a family of five brothers and sisters, together with their spouses.
The family does a gift exchange every year at Christmas. Each member
of the family buys a Christmas gift. The gifts are labeled 1 to 10, where
3.8 Hypergeometric Distribution 135
each person knows the label of their gift and the label of the gift of their
spouse. Cards with the numbers 1, . . . , 10 are put in a hat. The family
members consecutively pull a card out of the hat in order to determine
the present he or she will receive. If a person draws a card with their own
label, or the card with the label of their spouse, all the cards go back in the
hat and the drawing is redone. Use the Poisson heuristic to approximate
the probability that the drawing need not be redone.
3.90 Consider again Example 3.8. Use the Poisson heuristic to verify that
the probability mass function of the number of empty bins can be
approximated by a Poisson distribution with expected value b(1 − 1b )m
for large b.
3.91 Use the Poisson heuristic to approximate the probability that in a
randomly formed group of m people, nobody has a lone birthday. What
is the smallest value of m so that this probability is at least 50%?
3.92 Sixteen teams remain in a soccer tournament. A drawing of lots will
determine which eight matches will be played. Before the drawing takes
place, it is possible to place bets with bookmakers over the outcome of
the drawing. You are asked to predict all eight matches, paying no regard
to the order of the two teams in each match. Use the Poisson heuristic
to approximate the probability distribution of the number of correctly
predicted matches.
3.93 In the 6/45 lottery, six different numbers are randomly picked from the
numbers 1 to 45. Use the Poisson heuristic to approximate the probability
of two or more consecutive numbers in a random pick and the probability
of three or more consecutive numbers.
together with the observation that E(Xi2 ) = E(X12 ) for all i and E(Xi Xj ) =
E(X1 X2 ) for all i = j. Therefore,
E(X 2 ) = nE(X12 ) + n(n − 1)E(X1 X2 ).
Since E(X1 X2 ) = P(X1 = 1, X2 = 1) = P(X1 = 1)P(X2 = 1 | X1 = 1), it
follows that E(X1 X2 ) = R+W
R
× R+W−1
R−1
. Thus,
R R R−1
E(X 2 ) = n + n(n − 1) × .
R+W R+W R+W −1
Next, it is a matter of some algebra to get the desired expression for var(X).
The hypergeometric model is a versatile model and has many applications,
particularly to lotteries. The hypergeometric distribution with parameters R, W,
and n can be approximated by the binomial distribution with parameters n and
p = R+WR
if R + W n. Then it does not make much difference whether the
sampling is with or without replacement. Indeed, the above formulas for E(X)
and var(X) are very close to those of the binomial distribution with parameters
n and p = R+W R
if R + W n.
Example 3.20 The New York state lottery offers the game called Quick Draw,
a variant of Keno. The game can be played in bars, restaurants, bowling areas,
and other places. A new game is played every four or five minutes and so a lot
of games can be played on a day. In the game of Quick Draw a maximum of
20 individual bets can be made with a single game card. Each individual bet
can be for 1, 2, 5, or 10 dollars. A player chooses four numbers from 1 to 80.
The lottery then randomly chooses 20 numbers from 1 to 80. The number of
matches determines the payoff. The payoffs on an individual bet are 55 times
the bet size for four matches, 5 times the bet size for three matches, once the
bet size for two matches, and nothing otherwise. In November 1997, the state
lottery offered a promotion “Big Dipper Wednesday,” where payoffs on the
game were doubled on the four Wednesdays in that month. Is this a good deal
for the player or just a come-on for a sucker bet?
Solution. Let us first have a look at the game with the ordinary payoffs. The
hypergeometric model with R = 4, W = 76, and n = 20 is applicable to this
game of Quick Draw. Let the random variable X indicate how many numbers
on a single ticket are matched. Then,
4 76
k 20−k
P(X = k) = 80 for k = 0, 1, . . . , 4.
20
what is the probability that the removal of the illegal votes will change the
result of the election?
Solution. The problem can be translated into the urn model with 1,422 red
and 1,405 white balls. If a is the number of illegal votes for candidate A and
b the number of illegal votes for candidate B, then candidate A will no longer
have most of the votes only if a − b ≥ 17. Since a + b = 101, the inequality
a − b ≥ 17 boils down to a ≥ 59. The probability that the removal of the
illegal votes will change the result of the election is the same as the probability
of picking 59 or more red balls from an urn that contains 1,422 red and 1,405
white balls. This probability is given by
1,422 1,405
101
a 101−a
2,827 = 0.0592.
a=59 101
Example 3.22 Two people, perfect strangers to one another, both living in the
same city of one million inhabitants, meet each other. Each has approximately
500 acquaintances in the city. Assuming that for each of the two people, the
acquaintances represent a random sampling of the city’s various population
sectors, what is the probability of the two people having an acquaintance in
common?
Solution. A bit of imagination shows that this problem can be translated
into the urn model with R = 500 red and W = 999,498 white balls, where
the red balls represent the 500 acquaintances of the first person. The sought
probability is given by the probability that at least one red ball will be drawn
when 500 balls are randomly
999,998 taken out of the urn. This probability is equal to
1 − 999,498
500 / 500 = 0.2214. The probability of 22% is surprisingly large.
Events are often less “coincidental” than we may tend to think!
Problems
3.94 In the game “Lucky 10,” 20 numbers are drawn from the numbers 1 to
80. You tick 10 numbers on the game form. What is the probability of
matching r of the 20 numbers drawn for r = 0, 1, . . . , 10?
3.95 The New Amsterdam lottery offers the game Take Five. In this game,
players must tick five different numbers from the numbers 1 to 39.
The lottery draws five distinct numbers from the numbers 1 to 39. For
every one dollar staked, the payoff is $100,000 for five correct numbers,
$500 for four correct numbers, and $25 for three correct numbers.
3.8 Hypergeometric Distribution 139
For two correct numbers, the player wins a free game. What is the house
percentage for this lottery?
3.96 Ten identical pairs of shoes are jumbled together in one large box.
Without looking, someone picks four shoes out of the box. What is the
probability that, among the four shoes chosen, there will be both a left
and a right shoe?
3.97 A psychologist claims that he can determine from a person’s handwrit-
ing whether the person is left-handed or not. You do not believe the
psychologist and therefore present him with 50 handwriting samples,
of which 25 were written by left-handed people and 25 were written
by right-handed people. You ask the psychologist to say which 25 were
written by left-handed people. Will you change your opinion of him if
the psychologist correctly identifies 18 of the 25 left-handers?
3.98 The Massachusetts Cash Winfall lottery was established in 2004 and
was ended in 2012. In this lottery, the jackpot was won when the
six numbers chosen from 1 through 46 were correctly predicted. The
Cash Winfall lottery had the special feature that the jackpot was “rolled
down,” with the secondary prizes increased when the jackpot rose above
$2 million and was not won. Several gambling syndicates made big
profits by buying tickets in bulk when the jackpot was approaching
$2 million. Suppose that a syndicate has invested $400,000 in buying
200,000 randomly selected $2 tickets when the jackpot is “rolled down.”
The roll-down prizes are $25,000, $925, and $27.50 for matching five,
four, or three numbers, respectively. What is the expected amount of
money won by the syndicate? How can you approximate the standard
deviation of the amount won?
3.99 For a final exam, your professor gives you a list of 15 items to study.
He indicates that he will choose eight for the actual exam. You will be
required to answer correctly at least five of those. You decide to study
10 of the 15 items. What is the probability that you will pass the exam?
3.100 In the 6/45 lottery, six different numbers are drawn at random from the
numbers 1 to 45. What are the probability mass functions of the largest
number drawn and the smallest number drawn?
3.101 You play Bingo together with 35 other people. Each player purchases
one card with 24 different numbers that were selected at random out of
the numbers 1 to 80. The organizer of the game calls out chosen distinct
numbers between 1 and 80, randomly and one at a time. What is the
probability that more than 70 numbers must be called out before one of
the players has achieved a full card? What is the probability that you will
be the first player to achieve a full card while no other player has a full
140 3 Discrete Random Variables
card at the same time as you? What is the probability that you will be
among the first players achieving a full card? Hint: Let Qk = P(X > k),
where the random variable X counts how many numbers have to be
called out before a particular player has achieved a full card.
3.102 Suppose that r different numbers are picked at random from the num-
bers 1, 2, . . . , s. Let the random variable X be the sum of the r numbers
picked. Show that E(X) = 12 r(s + 1) and σ 2 (X) = 12 1
r(s + 1)(s − r).
Hint: Proceed as in the derivation of the expected value and the variance
of the hypergeometric distribution.
3.103 A bowl contains a red and b white balls. You randomly pick without
replacement one ball at a time until you have r red balls. What is the
probability mass function of the number of picks needed?
3.104 Suppose you know that the hands of you and your bridge partner contain
eight of the 13 spades in the deck. What is the probability of a 3-2 split
of the remaining five spades in the bridge hands of your opponents?
3.105 A deck of cards has 8 diamonds and 7 spades. The diamonds are
assigned to player A and the spades to player B. The cards are turned up
one by one, and the player whose suit is first to be turned up five times
wins. What is the probability that player A wins?
3.106 In the 6/49 lottery, six different numbers are drawn at random from
1, 2, . . . , 49. What is the probability that the next drawing will have no
numbers in common with the last two drawings?
3.107 There is a concert and 2,500 tickets are to be raffled off. You have sent in
100 applications. The total number of applications is 125,000. What are
your chances of getting a ticket? Can you explain why this probability
is approximately equal to 1 − e−2 ?
1
P(X = k) = for k = a, a + 1, . . . , b.
b−a+1
3.9 Other Discrete Distributions 141
The random variable X can be interpreted as the number of trials until the
first success occurs in a sequence of independent Bernoulli trials with success
probability p. Note that the geometric probability (1 − p)k−1 p is maximal
for k = 1. This explains that in a sequence of independent Bernoulli trials,
successes are likely to show up “clumped” rather than evenly spaced.
∞
Using the relations ∞ k=1 kx
k−1 = (1 − x)−2 and
k=1 k(k − 1)x
k−2 =
−3
2(1 − x) for |x| < 1, see Appendix C, it is easily verified that
1 1−p
E(X) = and var(X) = .
p p2
Example 3.24 Each time, you simultaneously toss a fair coin and roll a fair
die until heads is tossed or a six is rolled, or both. What is the probability mass
function of the number of trials?
142 3 Discrete Random Variables
Solution. Let the random variable N be the number of trials until heads is
tossed or a six is rolled, or both. Then, by the independence of the trials,
k k k
1 5 5
P(N > k) = × = for k = 0, 1, . . . .
2 6 12
The result of Example 3.24 is a special case of the general result that
min(X, Y) is geometrically distributed with parameter p1 + p2 − p1 p2 when
the random variables X and Y are independent and geometrically distributed
with parameters p1 and p2 (verify!).
r r(1 − p)
E(X) = and var(X) = .
p p2
Example 3.25 Suppose that a single die is rolled over and over. How many
rolls are needed so that the probability of rolling a six three times within this
number of rolls is at least 50%?
Solution. Let the random variable X be defined as the number of rolls of the
die until a six appears for the third time. Then the random variable X has a
3.9 Other Discrete Distributions 143
k = 1, . . . , s.
Problems
3.108 In the final of the World Series baseball, two teams play a series
consisting of at most seven games until one of the two teams has
won four games. Two unevenly matched teams are pitted against each
other and the probability that the weaker team will win any given
game is equal to 0.45. Assuming that the results of the various games
are independent of each other, calculate the probability of the weaker
team winning the final. What are the expected value and the standard
deviation of the number of games the final will take?
3.109 You perform a sequence of independent Bernoulli trials, each having
success probability 34 . What is the probability of 15 successes occurring
before 5 failures?
11 This value of k is the median of the distribution. In general, the median of an integer-valued
random variable X is any integer m with P(X ≤ m) ≥ 0.5 and P(X ≥ m) ≥ 0.5. The median
of a probability distribution need not be unique.
144 3 Discrete Random Variables
3.110 A red bag contains 15 balls and a blue bag 5 balls. Each time, you pick
one of the two bags and remove one ball from the chosen bag. At each
pick, the red bag is chosen with probability 34 and the blue bag with
probability 14 . What is the probability that the red bag will be emptied
first? What is the probability that the red bag is emptied while there are
still k balls in the blue bag for k = 1, . . . , 5?
3.111 Two players A and B each roll a fair die until player A has rolled a
1 or 2, or player B has rolled a 4, 5, or 6. The first player to roll one of
his assigned numbers is the winner, where player A is also the winner if
both players roll at the same time one of their assigned numbers. What
is the probability of player A winning the game? What is the probability
mass function of the length of the game?
3.112 In European roulette, the ball lands on one of the numbers 0, 1, . . . , 36
at every spin of the wheel. A gambler offers at even odds the bet that
the house number 0 will come up at least once in every 25 spins of the
wheel. What is the gambler’s expected profit per dollar bet?
3.113 Players A and B toss their own coins at the same time. A toss of the coin
of player A results in heads with probability a and a toss of the coin of
player B with probability b. The first player to toss heads wins. If they
both get heads at the same time, the game ends in a draw. What is the
probability of player A winning and what is the probability of a draw?
What is the probability mass function of the length of the game?
3.114 A fair coin is tossed until heads appears for the third time. Let the
random variable X be the number of tails shown up to that point. What
is the probability mass function of X? What are the expected value and
the standard deviation of X?
3.115 You are offered the following game. You can repeatedly pick at random
a number from 1 to 25. Each pick costs you one dollar. If you decide
to stop, you get paid the dollar amount of your last pick. What strategy
should you use to maximize your expected net payoff?
3.116 You toss a biased coin with probability p of heads, while your friend
tosses at the same time a fair coin. What is the probability distribution
of the number of tosses until both coins simultaneously show the same
outcome?
3.117 John and Pete had a nice evening at the pub. They decided to play the
following game in order to determine who would pay for the beer. Each
of them rolls two dice. The game ends if the dice total of John is the
same as that of Pete; otherwise, they play another round. Upon ending
the game, John pays for the beer if the dice total is odd; otherwise,
3.9 Other Discrete Distributions 145
Pete pays. What is the probability mass function of the length of the
game? What is the probability of John paying for the beer?
3.118 You have a thoroughly shuffled deck of 52 cards. Each time, you choose
one card from the deck. The drawn card is put back in the deck and all 52
cards are again thoroughly shuffled. You continue this procedure until
you have seen all four different aces. What are the expected value and
the standard deviation of the number of times you have to draw a card
until you have seen all four different aces?
4
Continuous Random Variables
146
4.1 Concept of Probability Density 147
Solution. The sample space of the chance experiment is the interval (0, 1),
where the outcome ω = u means that the point at which the stick is broken is
a distance u from the beginning of the stick. Let the random variable X denote
the ratio of the length of the shorter piece to that of the longer piece of the
broken stick. Fix 0 < a < 1. The probability that the ratio of the length of
the shorter piece to that of the longer piece is less than or equal to a is nothing
other than the probability that a random number from the interval (0,1) falls
1
either into ( 1+a , 1) or into (0, 1 − 1+a
1
) (verify!). The latter probability is equal
to 2(1 − 1+a ) = 1+a . Thus,
1 2a
2a
P(X ≤ a) = for 0 < a < 1.
1+a
Obviously, P(X ≤ a) = 0 for a ≤ 0 and P(X ≤ a) = 1 for a ≥ 1. Denote by
f (a) = (1+a)
2
2 the derivative of 1+a for 0 < a < 1 and let f (a) = 0 outside the
2a
The notation P(X ≤ a) stands for the probability that is assigned by the
probability measure P to the set of all outcomes ω for which X (ω) ≤ a.
The function f (x) is called the probability density function of X, and is a sort
of analogue of the probability mass function of a discrete random variable.
The cumulative distribution function of any random variable X, denoted by
F(x), describes the probability that the random variable X will take on a value
less than or equal to x:
F(x) = P(X ≤ x).
1 In general, let X be a random variable with cumulative distribution function F(x) and V be a
finite subset of R = (−∞, ∞). Suppose that F(x) is continuous on R and differentiable on
R\V with a continuous derivative. Then, by the fundamental theorem of integral calculus, X is
continuously distributed with density f (x) = F (x) for x ∈ R\V and f (x) = 0 for x ∈ V.
4.1 Concept of Probability Density 151
∞
P(X > a) = f (x) dx.
a
To see this, note that the event {X > a} is the complement
∞ of the event {X ≤ a}
and so P(X > a) = 1 − P(X ≤ a). a Since −∞ f (x) dx = 1, it follows that
∞ ∞
P(X > a) is equal to −∞ f (x) dx− −∞ f (x) dx = a f (x) dx. More generally,
we can express P(a < X ≤ b) in terms of the density f (x). For any constants a
and b with a < b,
b
P(a < X ≤ b) = f (x) dx.
a
To see this, note that the event {X ≤ b} is the union of the two disjoint events
{a < X ≤ b} and {X ≤ a}. As a result, P(X ≤ b) = P(a < X ≤ b) + P(X ≤ a),
or, equivalently, P(a < X ≤ b) = P(X ≤ b) − P(X ≤ a). This gives that
b a b
P(a < X ≤ b) is equal to −∞ f (x) dx − −∞ f (x) dx = a f (x) dx.
The integral representation of P(a < X ≤ b) tells us that this probability is
given by the area under the graph of f (x) between the points a and b.
Example 4.4 The maximum outdoor air temperature (in degrees Celsius) in a
certain area on any given day in May can be modeled as a continuous random
variable X with density function f (x) = 4,500
1
(30x − x2 ) for 0 < x < 30 and
f (x) = 0 otherwise. What are the probabilities P(X ≤ 10), P(X > 25), and
P(15 < X ≤ 20)?
Solution. The probabilities P(X ≤ 10) and P(X > 25) are given by
10 10
1 1 1 7
P(X ≤ 10) = (30x − x2 ) dx = 15x2 − x3 =
0 4,500 4,500 3 0 27
30 30
1 1 1 2
P(X > 25) = (30x − x2 ) dx = 15x2 − x3 = .
25 4,500 4,500 3 25 27
The probability P(15 < X ≤ 20) is evaluated as
20 20
1 1 1 13
P(15 < X ≤ 20) = (30x − x2 ) dx = 15x2 − x3 = .
15 4,500 4,500 3 15 54
Problems
4.1 Sizes of insurance claims can be modeled by a continuous random variable
X with probability density f (x) = c(10 − x) for 0 < x < 10 and f (x) = 0
otherwise, where c is some constant. What is the value of c? What is the
probability that the size of a particular claim is no more than 5 and what is
the probability that the size is more than 2?
152 4 Continuous Random Variables
4.2 Let the random variable X be the portion of a flood insurance claim for
flooding damage to a house. The probability density of X has the form
f (x) = c(3x2 − 8x − 5) for 0 < x < 1. What is the value of the constant c?
What is the cumulative distribution function of X?
4.3 The mileage (in thousands of miles) you can get out of a specific type
of tire is a continuous random variable X with probability density
xe−x /625 for x > 0 and f (x) = 0 otherwise. What
2
function f (x) = 6252
is the probability that the tire will last at most 15,000 miles? What is the
probability that the tire will last more than 30,000 miles? What is the
probability that the tire will last more than 20,000 miles but not more than
25,000 miles?
4.4 The lengths of phone calls (in minutes) made by a travel agent can be
modeled as a continuous random variable X with probability density
f (x) = 0.25e−0.25x for x > 0. What is the probability that a particular
phone call will take more than 7 minutes?
4.5 A particular pumping engine will only function properly if an essential
component functions properly. The time to failure of the component (in
thousands of hours) is a random variable X with probability density f (x) =
0.02xe−0.01x for x > 0. What is the proportion of pumping engines that
2
will not fail before 10,000 hours of use? What is the probability that the
engine will survive for another 5,000 hours, given that it has functioned
properly during the past 5,000 hours?
4.6 The probability density function f (x) of the electrical resistance (in ohms)
of a strain gauge produced by a certain firm can be modeled by f (x) =
25 (x − 115) for 115 < x ≤ 120, f (x) = 25 (125 − x) for 120 < x <
1 1
The fact that the area under the graph of f (x) can be interpreted as a probability
leads to an intuitive interpretation of f (a). Let a be a given continuity point
of f (x). Consider now a small interval of length a around the point a, say
[a − 12 a, a + 12 a]. Since
a+ 1 a
1 1 2
P a − a ≤ X ≤ a + a = f (x) dx
2 2 a− 12 a
a+ 21 a
and f (x) dx ≈ f (a)a for a small, we have the insightful result
a− 12 a
1 1
P a − a ≤ X ≤ a + a ≈ f (a)a for a small.
2 2
In other words, the probability of random variable X taking on a value in a
small interval around point a is approximately equal to f (a)a when a is the
154 4 Continuous Random Variables
length of the interval. You see that the number f (a) itself is not a probability,
but it is a relative measure for the likelihood that the random variable X will
take on a value in the immediate neighborhood of point a. Stated differently,
the probability density function f (x) expresses how densely the probability
mass of the random variable X is smeared out in the neighborhood of point x.
Hence the name of density function. The probability density function provides
the most useful description of a continuous random variable. The graph of the
density function provides a good picture of the likelihood of the possible values
of the random variable.
All of the foregoing examples follow the same procedure in order to find
the probability density function of a random variable X. The cumulative
distribution function P(X ≤ x) is determined first and this distribution function
is then differentiated to obtain the probability density.
Problems
4.9 Let the radius of a circle be a random variable X with density function
f (x) = 1 for 0 < x < 1 and f (x) = 0 otherwise. What is the probability
density of the area of the circle?
4.10 The density function of the continuous random variable X is f (x) = 67
√
(x + x) for 0 < x < 1 and f (x) = 0 otherwise. What is the probability
density of X1 ?
4.11 Let X be a positive random variable with probability density function
f (x). Define the random variable Y by Y = X 2 . What is the probability
density function of Y? Also, find the density function of the random
variable W = V 2 if V is a number chosen at random from the interval
(−a, a) with a > 0.
156 4 Continuous Random Variables
4.12 A random point Q is chosen inside the unit square. What is the density
function of the sum of the coordinates of the point Q? What is the density
function of the product of the coordinates of the point Q? Use geometry
to find these densities.
4.13 The number X is chosen at random between 0 and 1. Determine the
density functions of the random variables V = X/(1 − X) and W =
X(1 − X).
4.14 A stick of unit length is broken at random into two pieces. Let the
random variable X be the length of the shorter piece. What is the density
function of X? Also, use the cumulative distribution function of X to give
an alternative derivation of the density function of the random variable
X/(1 − X) from Example 4.1.
4.15 A random point is chosen inside the unit square. Let the random variables
V and W be defined as the largest and the smallest of the two coordinates
of the point. What are the probability densities of the random variables
V and W?
4.16 Suppose you decide to take a ride on the ferris wheel at an amusement
park. The ferris wheel has a diameter of 30 m. After several turns, the
ferris wheel suddenly stops due to a power outage. Let the random
variable X be your height above the ground when the ferris wheel
stops, where it is assumed that the bottom of the ferris wheel is level
with the ground. Verify −1that the density function of X is given by
15π 1 − (x/15 − 1)2 for 0 < x < 30. Hint: The random variable
X can be modeled as X = 15 + 15 cos(), where is a randomly chosen
angle between 0 and π (make a drawing!).
∞
integral is well-defined with a finite value if and only if −∞ |x|f (x) dx < ∞.
Such an integral is said to be absolutely convergent. For a nonnegative random
variable X, the absolute convergence of the integral is not required. Then
E(X) always exists, in accordance with the basic convention from integral
calculus that an integral with a nonnegative integrand is always well-defined by
allowing ∞ as possible value. For example, the expected value of the random
2 ) for −∞ < x < ∞ is not
1
variable X with the two-sided density function π(1+x
∞ ∞
defined, because −∞ |x| π(1+x2 ) dx = 2 0 x π(1+x2 ) dx = ∞. However, the
1 1
expected value of the nonnegative random variable with the one-sided density
2 ) for x > 0 does exist (and is equal to ∞). It is important to
2
function π(1+x
emphasize
∞ that the expected value of a random variable X is finite if and only
if −∞ |x|f (x) dx < ∞.
The definition of expected value in the continuous case parallels the
definition E(X) = xi p(xi ) for a discrete random variable X with x1 , x2 , . . .
as possible values and p(xi ) = P(X = xi ). For dx small, the quantity f (x) dx
in a discrete approximation of the continuous case corresponds to p(x) in
the discrete case. The summation becomes an integral when dx approaches
zero. Results for discrete random variables are typically expressed as sums.
The corresponding results for continuous random variables are expressed as
integrals.
Example 4.4 (continued) What is the expected value of the random variable
X with probability density function f (x) = 4,500
1
(30x − x2 ) for 0 < x < 30 and
f (x) = 0 otherwise?
Solution. The expected value of X is calculated as
1 30 1 1 4 30
E(X) = x(30x − x ) dx =
2
10x − x = 15.
3
4,500 0 4,500 4 0
Example 4.6 (continued) What is the expected value of the random variable
X representing the distance from a random point inside the circular disk with
radius 1 to the center of the disk?
Solution. The random variable X has the probability density function f (x) =
2x
r2
for 0 < x < r and f (x) = 0 otherwise. This gives
r
r 2x 2 x3 2
E(X) = x 2 dx = = r.
0 r 3 r2 3
0
shorter piece to that of the longer piece? What is the expected value of the
ratio of the length of the longer piece to that of the shorter piece?
Solution. Let the random variable X be the ratio of the length of the shorter
piece to that of the longer piece and Y be the ratio of the length of the longer
piece to that of the shorter piece. In Example 4.1, we showed that P(X ≤ x) =
x+1 for 0 ≤ x ≤ 1 with f (x) = (x+1)2 as its density function for 0 < x < 1.
2x 2
Hence,
1 1 1
2 1 1
E(X) = x dx = 2 dx − 2 dx
(x + 1) 0 x+1 0 (x + 1)
2 2
0
1
1 1
= 2ln(x + 1) + 2 = 2ln(2) − 1.
0 x + 1
0
Problems
4.17 The time (in hundreds of hours) until failure of the power supply to a
radar system is a random variable X with probability density function
f (x) = 625
1
(x − 50) for 50 < x ≤ 75, f (x) = 625
1
(100 − x) for 75 < x <
100, and f (x) = 0 otherwise. What is the expected value of X?
4.18 The time (in milliseconds) for a particular chemical reaction to complete
in water is a random variable X with probability density function
4.2 Expected Value of a Continuous Random Variable 159
√
f (x) = π 2cos(πx) for 0 < x < 0.25 and f (x) = 0 otherwise. What is
the expected value of X?
4.19 The javelin thrower Big John throws the javelin more than x meters with
probability P(x), where P(x) = 1 for 0 ≤ x < 50, P(x) = [1−(x−50)2 ]/
1,200 for 50 ≤ x < 80, P(x) = (90 − x)2 /400 for 80 ≤ x < 90, and
P(x) = 0 for x ≥ 90. What is the expected value of the distance thrown
in his next shot?
4.20 What is the expected value of the random variable X in Problems 4.2,
4.4, and 4.6?
4.21 A random point is chosen inside a triangle with height h and base length
b. What is the expected value of the perpendicular distance of the point
to the base?
4.22 Consider again Example 4.5. What is the expected value of the price paid
for the collector’s item?
4.23 A random point is chosen inside the unit square {(x, y) : 0 ≤ x, y ≤ 1}.
What is the expected value of the distance from this point to the point
(0, 0)?
4.24 A random point is chosen inside the unit square. What is the expected
value of the distance from this point to the closest side of the unit square?
4.25 Let X be a continuous random variable with probability density function
f (x).
(a) Explain why the natural definition of the conditional expected value
of X given that X > a is
∞
1
E(X | X > a) = xf (x) dx.
P(X > a) a
a
Also, explain the definition E(X | X ≤ a) = P(X≤a)
1
−∞ xf (x) dx.
(b) What is E(X | X > a) for 0 < a < 1 when X is a randomly chosen
number from (0, 1)?
(c) What are E(X | X > a) and E(X | X ≤ a) for a > 0 when X has the
exponential density function f (x) = λe−λx for x > 0 and f (x) = 0
otherwise.
4.26 Let X be a nonnegative continuous random variable with density function
f (x). Use an interchange of the order of integration to verify that
∞
E(X) = P(X > x) dx.
0
Use this result to answer the following question. What is the expected
value of the smallest of n independent random numbers from (0, 1)?
160 4 Continuous Random Variables
1 < x ≤ 2, and g(x) = 0 for x > 2. Then g(X) is the warranty payment. Its
expected value is
1 2
1 1
E[g(X)] = 250 × e−x/5 dx + 125 × e−x/5 dx
0 5 1 5
= 250(1 − e−1/5 ) + 125(e−1/5 − e−2/5 ) = 63.87 dollars.
The substitution rule often simplifies the calculation of the expected value.
As an illustration, consider again Example 4.1 in which the random variable
X is the ratio of the length of the shorter piece to that of the longer piece of a
stick of unit length that is broken at random into two pieces. In the foregoing
section, we calculated E(X) by determining first the density function of X
and then applying the definition of expected value. However, the substitution
rule provides a simpler way to calculate E(X). Let the random variable U be
distributed as a random point chosen in (0, 1). Then U has the density function
f (u) = 1 for 0 < u < 1 and X is distributed as g(U), where the function g(u)
is defined by g(u) = u/(1 − u) for 0 < u ≤ 12 and g(u) = (1 − u)/u for
1
2 < u < 1. This gives
1 1
1/2u 1−u 1−u
E(X) = du + du = 2 du
0 1 − u 1/2 u 1/2 u
1
= 2ln(u) − 2u = 2ln(2) − 1.
1/2
Rule 4.2 Let X be a continuous random variable with a finite expected value.
Then, for any constants a and b,
This rule is the continuous analogue of Rules 3.3 and 3.5 for the discrete case
and can be verified directly from Rule 4.1. The variance of X does not have
the same dimension as the values of X. Therefore, one often uses the standard
deviation of the random variable X. This measure is defined by
σ (X) = var(X).
Example 4.8 Let the random variable X be a randomly chosen number from
the interval (a, b). What are the expected value and the variance of X?
w
Solution. The probability that X will fall into a subinterval of width w is b−a .
Hence, P(X ≤ x) = b−a for a ≤ x ≤ b and so the density function of X
x−a
b
b 1 1 x2 1 b2 − a2 a+b
E(X) = x dx = = = ,
a b−a 2 b − a 2 b−a 2
a
b
b 1 1 x3 1 b3 − a3 a2 + ab + b2
E(X ) =
2
x 2
dx = = = ,
a b−a 3 b − a 3 b−a 3
a
Problems
4.27 Suppose that the random variable X has the probability density f (x) =
12x2 (1 − x) for 0 < x < 1 and f (x) = 0 otherwise.
(a) Use Jensen’s inequality from Section 3.4 to verify that E[ X1 ] ≥ 53 .
(b) What are the expected value and the standard deviation of X1 ?
√
4.28 Let the random variables V and W be defined by V = U and W = U 2 ,
where U is a number chosen at random between 0 and 1. What are the
expected values and the standard deviations of V and W?
4.29 An insurance policy for water damage pays an amount of damage up to
$450. The amount of damage is a random variable X with density func-
tion f (x) = 1,250
1
for 0 < x < 1, 250. The amount of damage exceeding
$450 is covered by a supplement policy up to $500. What is the expected
value of the amount of damage paid by the supplement policy?
4.30 Consider again Problem 4.7. Assume that the storage tank has a capacity
of 0.9 expressed in thousands of gallons. The cost of removing x > 0
units of waste at the end of the week is 1.25 + 0.5x. Additional costs of
5+10z are incurred when the capacity of the storage tank is not sufficient
and an overflow of z > 0 units of waste occurs during the week. What is
the expected value of the weekly costs?
4.31 A manufacturer has to make a last production run for a product that is
near the end of its lifetime. The final demand for the product can be mod-
eled as a continuous random variable X having probability density f (x) =
1 −x/50 for x > 0. It is decided to make a last production run of 250
2,500 xe
units of the product. The manufacturer earns $2 for each unit of product
sold but incurs a cost of $0.50 for each unit of demand occurring when
out of stock. What is the expected value of the net profit of the manufac-
turer? What is the probability that the manufacturer runs out of stock?
4.32 A car owner insures his car, worth $20,000, for one year under a policy
with a deductible of $1,000. There is a probability of 0.01 of a total loss
of the car during the policy year and a probability of 0.02 of a repairable
damage. The cost (in thousands of dollars) of a repairable damage has
the probability density f (x) = 200 1
(20 − x) for 0 < x < 20. What is the
expected value of the insurance payment? Note that this payment is a
mixed random variable.
4.33 Choose a point at random in (0, 1), then this point divides the interval
(0, 1) into two subintervals. What is the expected length of the subinterval
covering a given point s with 0 < s < 1?
4.34 What is the standard deviation of the random variable X in each of the
Problems 4.2, 4.4, and 4.6?
164 4 Continuous Random Variables
4.35 What are the expected value and the standard deviation of the area of
the circle whose radius is a random variable X with density function
f (x) = 1 for 0 < x < 1 and f (x) = 0 otherwise?
4.36 A point Q is chosen at random inside a sphere with radius r. What are
the expected value and the standard deviation of the distance from the
center of the sphere to the point Q?
4.37 Let X be a continuous random variable with probability density f (x) and
finite expected value E(X).
(a) What constant c minimizes E[(X − c)2 ] and what is the minimal
value of E[(X − c)2 ]?
(b) Prove that E(|X − c|) is minimal if c is chosen equal to the median
of X.
4.38 Consider again Problem 4.16. Calculate the expected value and the stan-
dard deviation of the height above the ground when the ferris wheel stops.
4.39 In an inventory system, a replenishment order is placed when the stock
on hand of a certain product drops to the level s, where the reorder point
s is a given positive number. The total demand for the product during
the lead time of the replenishment order has the probability density
f (x) = λe−λx for x > 0. What are the expected value and standard
deviation of the shortage (if any) when the replenishment order arrives?
4.40 Suppose that the continuous random variable X has the probability
density function f (x) = (α/β)(β/x)α+1 for x > β and f (x) = 0 for
x ≤ β for given values of the parameters α > 0 and β > 0. This density
is called the Pareto density, which provides a useful probability model
for income distributions, among others.
(a) Calculate the expected value, the variance, and the median of X.
(b) Assume that the annual income of the employed, measured in thou-
sands of dollars, in a given country follows a Pareto distribution with
α = 2.25 and β = 2.5. What percentage of the working population
has an annual income of between 25,000 and 40,000 dollars?
(c) Why do you think the Pareto distribution is a good model for income
distributions? Hint: Use the probabilistic interpretation of the density
f (x).
f(x)
1
b−a
0
a b x
Problems
4.41 The lifetime of a light bulb has a uniform probability density on (2, 12).
The light bulb will be replaced upon failure or upon reaching age 10,
whichever occurs first. What are the expected value and the standard
deviation of the age of the light bulb at the time of replacement?
4.42 A rolling machine produces sheets of steel of different thickness. The
thickness of a sheet of steel is uniformly distributed between 120 and
150 mm. Any sheet having a thickness of less than 125 mm must be
scrapped. What are the expected value and the standard deviation of a
non-scrapped sheet of steel?
f(x)
2
b−a
0
a m b x
and f (x) = 0 otherwise. It is left to the reader to verify that the nonnegative
function λe−λx integrates to 1 over the interval (−∞, ∞) and thus is indeed
a density function. The parameter λ is a scale parameter. An exponentially
distributed random variable X takes on only positive values. Figure 4.3 displays
the exponential density function with λ = 1. The expected value and the
variance of the random variable X are given by
1 1
E(X) = and var(X) = .
λ λ2
f(x)
0.5
0
1 2 3 4 5 x
∞
These
∞ 2 formulas are obtained by evaluating the integrals 0 xλe−λx dx and
−λx dx. Using partial integration, the reader easily verifies that these
0 x λe
integrals are λ1 and λ22 , showing that E(X) = λ1 and
x E(X ) = λ2 .
2 2
Memoryless Property
The exponential distribution is extremely important in probability. It not only
models many real-world phenomena, but also allows for tractable mathemat-
ical analysis. The reason for its mathematical tractability is the memoryless
property. This property states that
P(X > t + s | X > s) = P(X > t) for all t > 0,
regardless of the value of s. In words, imagining that the exponentially
distributed random variable X represents the lifetime of an item, the residual
life of an item has the same exponential distribution as the original lifetime,
regardless of how long the item has already been in use. The proof is simple.
For any x ≥ 0, we have that P(X > x) = e−λx . Applying the definition
P(A | B) = P(AB)/P(B) with A = {X > t + s} and B = {X > s}, and
noting that P(X > t + s, X > s) = P(X > t + s), we find
P(X > t + s) e−λ(t+s)
P(X > t + s | X > s) = = = e−λt = P(X > t),
P(X > s) e−λs
showing the memoryless property. The exponential distribution is the only
continuous distribution possessing this property. The technical proof of this
uniqueness result is omitted.
The exponential distribution is often used as probability model for the time
until a rare event occurs. Examples are the time until the next earthquake in
a certain region and the decay time of a radioactive particle. A very useful
result is that under general conditions, the time until the first occurrence of
a rare event is approximately exponentially distributed. To make this result
plausible, consider the next example. Maintenance of an operating unit in a
reliability system occurs at the scheduled times τ , 2τ , . . . , where τ > 0 is
fixed. Each maintenance takes a negligible time and the unit is again as good
as new after each maintenance. There is a probability p > 0 that the unit will
fail between two maintenance inspections, where p is very close to zero. Let
the random variable X be the time until the first system failure. Then, for any
4.5 Exponential Distribution 169
n, we have P(X > nτ ) = (1 − p)n and so P(X > nτ ) ≈ e−np , using the fact
that e−p ≈ 1 − p for p close to zero. Hence, taking t = nτ and replacing n by
t/τ , it follows that
P(X > t) ≈ e−λt for t > 0,
where λ = p/τ denotes the inverse of the expected time until the first system
failure. It is important to note that it suffices to know the expected value of
the time until the first system failure in order to approximate the probability
distribution of the failure time by an exponential distribution. An application
of this very useful result will be given in Example 4.9 below.
Also, the probability of exceeding some extreme level is often approximately
equal to an exponential tail probability of the form αe−βt for constants
α, β > 0. An interesting example concerns the probability that a high tide
of h meters or more above sea level will occur in any given year somewhere
along the Dutch coastline. This probability is approximately equal to e−2.97h
for values of h larger than 1.70 m. This empirical result was used in the design
of the Delta works that were built following the 1953 disaster when the sea
flooded a number of polders in the Netherlands.
Example 4.9 A reliability system has two identical units, where one unit is in
full operation and the other unit is in cold standby. The lifetime of an operating
unit has an exponential density with expected value 1/μ. Upon failure of the
operating unit, the other unit is put into operation provided a standby unit is
available. The replacement time of a failed unit is fixed and is equal to τ > 0.
A system failure occurs if no standby unit is available at the moment the
operating unit fails. It is assumed that the probability 1 − e−μτ is close to
zero, that is, the probability of an operating unit failing during the replacement
time τ is very small. What is an approximation to the probability distribution
of the time until the first system failure?
Solution. Let the random variable X denote the time until the first system
failure. In Example 7.11 in Section 7.3, it will be shown that
2 − e−μτ
E(X) = .
μ(1 − e−μτ )