100% found this document useful (10 votes)

4K views547 pages

Probability A Lively Introduction

Uploaded by

Victor Hugo Rizzo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (10 votes)

4K views547 pages

Probability A Lively Introduction

Uploaded by

Victor Hugo Rizzo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 547

Probability: A Lively Introduction

Probability has applications in many areas of modern science, not to mention in our
daily life, and its importance as a mathematical discipline cannot be overrated. This
engaging book, with its easy to follow writing style, provides a comprehensive, yet
concise, introduction to the subject. It covers all of the standard material for
undergraduate and first-year-graduate-level courses, as well as many topics that are
usually not found in standard texts – such as Bayesian inference, Markov chain Monte
Carlo simulation, and Chernoff bounds.
The student-friendly text has the following additional features:

• Is the result of many years of teaching and feedback from students

• Stresses why probability is so relevant and how to apply it
• Offers many real-world examples to support the theory
• Includes more than 750 problems with detailed solutions of the odd-numbered
problems
• Gives students confidence in their own problem-solving skills

h e n k t i j m s is Professor Emeritus at the Vrije University in Amsterdam. He is the

author of several textbooks and numerous papers on applied probability and stochastic
optimization. In 2008 Henk received the prestigious INFORMS Expository Writing
Award for his publications and books. His activities also include the popularization of
probability for high-school students and the general public.
Probability: A Lively Introduction

HENK TIJMS
Vrije Universiteit, Amsterdam
University Printing House, Cambridge CB2 8BS, United Kingdom
One Liberty Plaza, 20th Floor, New York, NY 10006, USA
477 Williamstown Road, Port Melbourne, VIC 3207, Australia
4843/24, 2nd Floor, Ansari Road, Daryaganj, Delhi – 110002, India
79 Anson Road, #06–04/06, Singapore 079906

Cambridge University Press is part of the University of Cambridge.

It furthers the University’s mission by disseminating knowledge in the pursuit of
education, learning, and research at the highest international levels of excellence.

www.cambridge.org
Information on this title: www.cambridge.org/9781108418744
DOI: 10.1017/9781108291361
© Henk Tijms 2018
This publication is in copyright. Subject to statutory exception
and to the provisions of relevant collective licensing agreements,
no reproduction of any part may take place without the written
permission of Cambridge University Press.
First published 2018
Printed in the United Kingdom by Clays, St Ives plc
A catalogue record for this publication is available from the British Library.
Library of Congress Cataloging-in-Publication Data
Names: Tijms, H.
Title: Probability: a lively introduction / Henk Tijms, Vrije Universiteit, Amsterdam.
Description: Cambridge : Cambridge University Press, 2017. | Includes index.
Identifiers: LCCN 2017014908 | ISBN 9781108418744 (Hardback : alk. paper) |
ISBN 9781108407847 (pbk. : alk. paper)
Subjects: LCSH: Probabilities–Textbooks.
Classification: LCC QA273.2 .T55 2017 | DDC 519.2–dc23 LC record
available at https://lccn.loc.gov/2017014908
ISBN 978-1-108-41874-4 Hardback
ISBN 978-1-108-40784-7 Paperback
Additional resources for this publication at www.cambridge.org/TijmsProbability
Cambridge University Press has no responsibility for the persistence or accuracy of
URLs for external or third-party internet websites referred to in this publication
and does not guarantee that any content on such websites is, or will remain,
accurate or appropriate.

Contents

Preface page ix

1 Foundations of Probability Theory 1

1.1 Probabilistic Foundations 3
1.2 Classical Probability Model 7
1.3 Geometric Probability Model 15
1.4 Compound Chance Experiments 19
1.5 Some Basic Rules 25
1.6 Inclusion–Exclusion Rule 36
2 Conditional Probability 42
2.1 Concept of Conditional Probability 42
2.2 Chain Rule for Conditional Probabilities 47
2.3 Law of Conditional Probability 54
2.4 Bayes’ Rule in Odds Form 67
2.5 Bayesian Inference − Discrete Case 77
3 Discrete Random Variables 85
3.1 Concept of a Random Variable 85
3.2 Expected Value 89
3.3 Expected Value of Sums of Random Variables 99
3.4 Substitution Rule and Variance 106
3.5 Independence of Random Variables 113
3.6 Binomial Distribution 118
3.7 Poisson Distribution 124
3.8 Hypergeometric Distribution 135
3.9 Other Discrete Distributions 140

v
vi Contents

4 Continuous Random Variables 146

4.1 Concept of Probability Density 147
4.2 Expected Value of a Continuous Random Variable 156
4.3 Substitution Rule and the Variance 160
4.4 Uniform and Triangular Distributions 164
4.5 Exponential Distribution 167
4.6 Gamma, Weibull, and Beta Distributions 177
4.7 Normal Distribution 180
4.8 Other Continuous Distributions 193
4.9 Inverse-Transformation Method and Simulation 198
4.10 Failure-Rate Function 202
4.11 Probability Distributions and Entropy 205
5 Jointly Distributed Random Variables 209
5.1 Joint Probability Mass Function 209
5.2 Joint Probability Density Function 212
5.3 Marginal Probability Densities 219
5.4 Transformation of Random Variables 228
5.5 Covariance and Correlation Coefficient 233
6 Multivariate Normal Distribution 239
6.1 Bivariate Normal Distribution 239
6.2 Multivariate Normal Distribution 248
6.3 Multidimensional Central Limit Theorem 250
6.4 Chi-Square Test 257
7 Conditioning by Random Variables 261
7.1 Conditional Distributions 262
7.2 Law of Conditional Probability for Random Variables 269
7.3 Law of Conditional Expectation 276
7.4 Conditional Expectation as a Computational Tool 283
7.5 Bayesian Inference − Continuous Case 294
8 Generating Functions 302
8.1 Generating Functions 302
8.2 Branching Processes and Generating Functions 311
8.3 Moment-Generating Functions 313
8.4 Central Limit Theorem Revisited 318
9 Additional Topics in Probability 321
9.1 Bounds and Inequalities 321
9.2 Strong Law of Large Numbers 327
Contents vii

9.3 Kelly Betting System 335

9.4 Renewal–Reward Processes 339
10 Discrete-Time Markov Chains 348
10.1 Markov Chain Model 349
10.2 Time-Dependent Analysis of Markov Chains 357
10.3 Absorbing Markov Chains 362
10.4 Long-Run Analysis of Markov Chains 373
10.5 Markov Chain Monte Carlo Simulation 386
11 Continuous-Time Markov Chains 403
11.1 Markov Chain Model 403
11.2 Time-Dependent Probabilities 414
11.3 Limiting Probabilities 420

Appendix A: Counting Methods 438

Appendix B: Basics of Set Theory 443
Appendix C: Some Basic Results from Calculus 447
Appendix D: Basics of Monte Carlo Simulation 451
Answers to Odd-Numbered Problems 463
Index 532
Preface

Why do so many students find probability difficult? Even the most mathe-
matically competent often find probability a subject that is difficult to use
and understand. The difficulty stems from the fact that most problems in
probability, even ones that are easy to understand, cannot be solved by using
cookbook recipes as is sometimes the case in other areas of mathematics.
Instead, each new problem often requires imagination and creative thinking.
That is why probability is difficult, but also why it is fun and engaging.
Probability is a fascinating subject and I hope, in this book, to share my
enthusiasm for the subject.
Probability is best taught to beginning students by using motivating exam-
ples and problems, and a solution approach that gives the students confidence
to solve problems on their own. The examples and problems should be relevant,
clear, and instructive. This book is not written in a theorem–proof style, but
proofs flow with the subsequent text and no mathematics is introduced without
specific examples and applications to motivate the theory. It distinguishes itself
from other introductory probability texts by its emphasis on why probability
is so relevant and how to apply it. Every attempt has been made to create a
student-friendly book and to help students understand what they are learning,
not just to learn it.
This textbook is designed for a first course in probability at an undergraduate
level or first-year graduate level. It covers all of the standard material for
such courses, but it also contains many topics that are usually not found in
introductory probability books – such as stochastic simulation. The emphasis
throughout the book is on probability, but attention is also given to statistics. In
particular, Bayesian inference is discussed at length and illustrated with several
illuminating examples. The book can be used in a variety of disciplines, rang-
ing from applied mathematics and statistics to computer science, operations

ix
x Preface

research, and engineering, and is suitable not only for introductory courses, but
also for self-study. The prerequisite knowledge is a basic course in calculus.
Good problems are an essential part of each textbook. The field of probabil-
ity is well known for being a subject that can best be acquired by the process
of learning-by-doing. Much care has been taken to present problems that will
enhance the student’s understanding of probability. He or she will be asked to
think about ideas, rather than simply plugging numbers into formulas. Working
through them, it may often be found that probability problems are harder than
they first appear. This book has more than 750 carefully designed problems,
both “easy ones” and challenging ones. Problems are grouped according to the
section they are based on, which should be convenient for both students and
instructors. An important feature of this textbook is that it contains detailed
solutions to the odd-numbered problems, which helps stimulate active learning
and contributes to students’ confidence in their own problem-solving skills. It
is my belief that there is an enormous increase in content when worked-out
solutions to exercises are included. Solutions for the even-numbered problems
and further detailed solutions to the odd-numbered problems for instructors
can be found on the book’s webpage (www.cambridge.org/TijmsProbability).
Another added feature is that the student will find many tips on problem-
solving strategies.

How to Teach from this Book

This book, which is the result of many years of teaching and feedback from
many students, can be used to teach probability courses at several different
levels. Since students learn best when they participate actively in the process,
the instructor may want to use computer simulation in the course. Appendix
D gives the basic ideas of Monte Carlo simulation, together with instructive
simulation exercises. Simulation in teaching probability helps to develop and
sharpen probabilistic intuition and is popular with many students. Over many
years of teaching probability I have also found that the subject of Markov
chains is very appealing to students, and the book offers the possibility
of a follow-up course on stochastic processes that covers renewal–reward
processes, discrete-time Markov chains, and continuous-time Markov chains.
In a course on discrete probability, material on discrete-time Markov chains
can even be taught right after the notion of conditional probability has been
introduced.
1
Foundations of Probability Theory

For centuries, mankind lived with the idea that uncertainty was the domain of
the gods and fell beyond the reach of human calculation. Common gambling
led to the first steps toward understanding probability theory, and the colorful
Italian mathematician and physician Gerolamo Cardano (1501–1575) was the
first to attempt a systematic study of the calculus of probabilities. As an ardent
gambler himself, Cardano wrote a handbook for fellow gamblers entitled Liber
de Ludo Aleae (The Book of Games of Chance) about probabilities in games
of chance like dice. He originated and introduced the concept of the set of
outcomes of an experiment, and for cases in which all outcomes are equally
probable, he defined the probability of any one event occurring as the ratio of
the number of favorable outcomes to the total number of possible outcomes.
This may seem obvious today, but in Cardano’s day such an approach marked
an enormous leap forward in the development of probability theory.
Nevertheless, many historians mark 1654 as the birth of the study of
probability, since in that year questions posed by gamblers led to an exchange
of letters between the great French mathematicians Pierre de Fermat (1601–
1665) and Blaise Pascal (1623–1662). This famous correspondence laid the
groundwork for the birth of the study of probability, especially their question
of how two players in a game of chance should divide the stakes if the
game ends prematurely. This problem of points, which will be discussed in
Chapter 3, was the catalyst that enabled probability theory to develop beyond
mere combinatorial enumeration.
In 1657, the Dutch astronomer Christiaan Huygens (1629–1695) learned of
the Fermat–Pascal correspondence and shortly thereafter published the book
De Ratiociniis de Ludo Aleae (On Reasoning in Games of Chance), in which
he worked out the concept of expected value and unified various problems
that had been solved earlier by Fermat and Pascal. Huygens’ work led the
field for many years until, in 1713, the Swiss mathematician Jakob Bernoulli

1
2 1 Foundations of Probability Theory

(1654–1705) published Ars Conjectandi (The Art of Conjecturing) in which

he presented the first general theory for calculating probabilities. Then, in
1812, the great French mathematician Pierre Simon Laplace (1749–1827)
published his Théorie Analytique des Probabilités. Here, Laplace applied
probabilistic ideas to many scientific and practical problems, and his book
represents perhaps the single greatest contribution in the history of probability
theory. Towards the end of the nineteenth century, mathematicians attempted
to construct a solid foundation for the mathematical theory of probability by
defining probabilities in terms of relative frequencies. That attempt, however,
was marked by much controversy, and the frequency view of probability did
not lead to a satisfactory theory.
An acceptable definition of probability should be precise enough for use in
mathematics and yet comprehensive enough to be applicable to a wide range of
phenomena. It was not until 1933 that the great Russian mathematician Andrey
Nikolaevich Kolmogorov (1903–1987) laid a satisfactory mathematical foun-
dation for probability theory by taking a number of axioms as his starting point,
as had been done in other fields of mathematics. Axioms state a number of
minimal requirements that the mathematical objects in question (such as points
and lines in geometry) must satisfy. In the axiomatic approach of Kolmogorov,
probability figures as a function on subsets of a so-called sample space, where
the sample space represents the set of all possible outcomes of the experiment.
He assigned probabilities to these subsets in a consistent way, using the concept
of additivity from measure theory. Those axioms are the basis for the mathe-
matical theory of probability, and, as a milestone, are sufficient to logically
deduce the law of large numbers. This law confirms our intuition that the prob-
ability of an event in a repeatable experiment can be estimated by the relative
frequency of its occurrence in many repetitions of the experiment. The law of
large numbers is the fundamental link between theory and the real world.
Nowadays, more than ever before, probability theory is indispensable in
a wide variety of fields. It is absolutely essential to the field of insurance.
Likewise the stock market, “the largest casino in the world,” cannot do without
it. Call centers and airline companies apply probabilistic methods to determine
how many service desks will be needed based on expected demand. In stock
control, probability theory is used to find a balance between the cost risks of
running out of inventory and the cost of holding too much inventory in a market
with uncertain demand. Engineers use probability theory when constructing
dikes, to calculate the probability of water levels exceeding their margins.
Judges and physicians benefit from a basic knowledge of probability theory, to
help make better judicial and medical decisions. In short, probabilistic thinking
has become an integral part of modern life.
1.1 Probabilistic Foundations 3

The purpose of this chapter is to give the student a solid basis for solving
probability problems. We first discuss the intuitive and fundamental axioms of
probability, and from them derive a number of basic rules for the calculation
of probabilities. These rules include the complement rule, the addition rule,
and the inclusion–exclusion rule, for which many illustrative examples are
provided. These examples, which are instructive and provide insight into the
theory, include classical probability problems such as the birthday problem and
the hat-check problem.
Traditionally, introductory probability books begin with a comprehensive
discussion of set theory and combinatorics before presenting the “real stuff.”
This will not be done in this book, as it is not necessary and often stifles
the student’s natural curiosity and fascination with probability. Rather, the
appendices at the end of the book provide a self-contained overview of the
essentials of set theory and the basic tools from combinatorics, and these tools
are not introduced until they are needed.

1.1 Probabilistic Foundations

A probability model is a mathematical representation of a real-world situation
or a random experiment. It consists of a complete description of all possible
outcomes of the experiment and an assignment of probabilities to these
outcomes. The set of all possible outcomes of the experiment is called the
sample space. A sample space is always such that one and only one of the
possible outcomes occurs if the experiment is performed. Here are some
examples:
• The experiment is to toss a coin once. The sample space is the set {H, T},
where H means that the outcome of the toss is a head and T that it is a tail.
Each outcome gets assigned a probability of 12 if the coin is fair.
• The experiment is to roll a die once. The sample space is the set {1, 2, . . . , 6},
where outcome i means that i dots appear on the up face. Each outcome gets
assigned a probability of 16 if the die is fair.
• The experiment is to choose a letter at random from the word “statistics.”
The sample space is the set {s, t, a, i, c}. The probabilities 10 3 3 1 2
, 10 , 10 , 10 , and
1
10 are assigned to the five outcomes s, t, a, i, and c.
• The experiment is to repeatedly roll a fair die until the first six shows up.
The sample space is the set {1, 2, . . .} of the positive integers. Outcome k
indicates that the first six shows up on the kth roll. The probabilities 16 , 56 ×
6 , ( 6 ) × 6 , . . . are assigned to the outcomes 1, 2, 3, . . . .
1 5 2 1
4 1 Foundations of Probability Theory

• The experiment is to measure the time until the first emission of a particle
from a radioactive source. The sample space is the set (0, ∞) of the positive
real numbers, where outcome t indicates that it takes a time t until the first
emission
b −t of a particle. Taking an appropriate unit of time, the probability
a e dt can be assigned to each time interval (a, b) on the basis of physical
properties of radioactive material, where e = 2.71828 . . . is the base of the
natural logarithm.
In probability applications we are typically interested in particular subsets of
the sample space, which in probability language are called events. The terms
event and subset of outcomes of an experiment are used interchangeably in
probability theory. In the second example, the event that an odd number is
rolled is the subset A = {1, 3, 5} of the sample space. In the fourth example, the
event that more than six rolls are needed to get a six is the subset A = {7, 8, . . .}
of the sample space. In the fifth example, the event that it takes between 5 and 7
time units until the first emission of a particle is the subset A = {t : 5 ≤ t ≤ 7}
of the sample space, where “:” means “such that.”
Various choices for the sample space are sometimes possible. In the
experiment of tossing a fair coin twice, a possible choice for the sample
space is the set {HH, HT, TH, TT}. Another possible choice is the set {0, 1, 2},
where the outcome indicates the number of heads obtained. The assignment of
probabilities to the elements of the sample space differs for the two choices.
In the first choice of the sample space, the four elements are equally likely
and each element gets assigned the probability 14 . In the second choice of the
sample space, the elements are not equally likely and the elements 0, 1, and 2
get assigned the probabilities 14 , 12 , and 14 . In general, it is preferable to use a
sample space with equally likely outcomes whenever possible.
In the first three examples above, the sample space is a finite set. In the fourth
example the sample space is a so-called countably infinite set, while in the fifth
example the sample space is a so-called uncountable set. Let us briefly explain
these basic concepts from set theory, see also Appendix B. The set of natural
numbers (positive integers) is an infinite set and is the prototype of a countably
infinite set. In general, a nonfinite set is called countably infinite if a one-to-
one function exists which maps the elements of the set to the set of natural
numbers. In other words, every element of the set can be assigned to a unique
natural number and conversely each natural number corresponds to a unique
element of the set. For example, the set of squared numbers 1, 4, 9, 16, 25, . . .
is countably infinite. Not all sets with an infinite number of elements are
countably infinite. The set of all points on a line and the set of all real
numbers between 0 and 1 are examples of infinite sets that are not countable.
1.1 Probabilistic Foundations 5

The German mathematician Georg Cantor (1845–1918) proved this result in

the nineteenth century. This discovery represented an important milestone in
the development of mathematics and logic (the concept of infinity, to which
even scholars from ancient Greece had devoted considerable energy, obtained
a solid theoretical basis for the first time through Cantor’s work). Sets that are
neither finite nor countably infinite are called uncountable, whereas sets that
are either finite or countably infinite are called countable.

1.1.1 Axioms of Probability Theory

A probability model consists of a sample space together with the assignment of
probability, where probability is a function that assigns numbers between 0 and
1 to subsets of the sample space. The axioms of probability are mathematical
rules that the probability function must satisfy. The axioms of probability
are essentially the same for a chance experiment with a countable or an
uncountable sample space. A distinction must be made, however, between
the sorts of subsets to which probabilities can be assigned, whether these
subsets occur in countable or uncountable sample spaces. In the case of a
countable sample space, probabilities can be assigned to each subset of the
sample space. In the case of an uncountable sample space, weird subsets can
be constructed to which we cannot associate a probability. Then the probability
measure is only defined on a sufficiently rich collection of well-behaved
subsets, see Appendix B for more details. These technical matters will not be
discussed further in this introductory book. The reader is asked to accept the
fact that, for more fundamental mathematical reasons, probabilities can only
be assigned to well-behaved subsets when the sample space is uncountable.
In the case that the uncountable sample space is the set of real numbers, then
essentially only those subsets consisting of a finite interval, the complement
of any finite interval, the union of any countable number of finite intervals,
or the intersection of any countable number of finite intervals are assigned
a probability. These subsets suffice for practical purposes. The probability
measure on the sample space is denoted by P. It assigns to each well-behaved
subset A a probability P(A) and must satisfy the following properties:
Axiom 1 P(A) ≥ 0 for each subset A of the sample space.
Axiom 2 P() = 1 for the sample space .

∞
∞
Axiom 3 P Ai = P(Ai ) for every collection of pairwise disjoint
i=1 i=1
subsets A1 , A2 , . . . of the sample space.
6 1 Foundations of Probability Theory

The union ∞ i=1 Ai of the subsets A1 , A2 , . . . is defined as the set of all outcomes
which belong to at least one of the subsets A1 , A2 , . . . . The subsets A1 , A2 , . . .
are said to be pairwise disjoint when any two subsets have no element in
common. In probability terms, any subset of the sample space is called an
event. If the outcome of the chance experiment belongs to the subset A, then the
event A is said to occur. The events A1 , A2 , . . . are said to be mutually exclusive
(or disjoint) if the corresponding sets A1 , A2 , . . . are pairwise disjoint. In other
words, events A1 , A2 , . . . are mutually exclusive if the occurrence of one of
these events implies the non-occurrence of the others. For example, suppose
that a die is rolled. The outcome is one of the numbers from 1 to 6. Let A be
the event that the outcome 1 or 2 occurs and B be the event that the outcome 5
or 6 occurs. Then A and B are mutually exclusive events. As another example,
suppose that a coin is tossed until heads appears for the first time. Let Ak be
the event that heads appears for the first time at the kth toss. Then the events
A1 , A2 , . . . are mutually exclusive.
The first two axioms simply express a probability as a number between 0
and 1. The crucial third axiom states that, for any infinite sequence of mutually
exclusive events, the probability of at least one of these events occurring is
the sum of their individual probabilities. This property also holds for any finite
sequence of mutually exclusive events. Using the concept of the empty set, the
proof of this result is almost trivial, see Rule 1.1 in Section 1.5. The countable
additivity in Axiom 3 is required to have a unified framework for finite and
nonfinite sample spaces. Starting with the three axioms and a few definitions,
a powerful and beautiful theory of probability can be developed.
The standard notation for the sample space is the symbol . An outcome of
the sample space is denoted by ω. A sample space together with a collection of
events and an assignment of probabilities to the events is called a probability
space. For a countable sample space , it is sufficient to assign a probability

p (ω) to each element ω ∈ such that p (ω) ≥ 0 and ω∈ p (ω) = 1.
A probability measure P on is then defined by specifying the probability of
each event A as

P(A) = p(ω).
ω∈A

In other words, P(A) is the sum of the individual probabilities of the outcomes
ω that belong to the set A. It is left to the reader to verify that P satisfies Axioms
1 to 3.
A probability model is constructed with a specific situation or experiment in
mind. The assignment of probabilities is part of the translation process from a
concrete context into a mathematical model. Probabilities may be assigned to
1.2 Classical Probability Model 7

events any way you like, as long as the above axioms are satisfied. To make
your choice of the probabilities useful, the assignment should result in a “good”
model for the real-world situation. There are two main approaches to assigning
probabilities to events. In the relative-frequency approach, probabilities are
assigned to the outcomes of a physical experiment having the feature that it
can be repeated over and over under identical conditions. Think of spinning a
roulette wheel or rolling dice. Then one may speak of physical probabilities
and such probabilities can be determined experimentally. In the subjective
approach, the word probability is roughly synonymous with plausibility and
probability is defined as the degree of belief a particular person holds in the
occurrence of an event. Think of the chances of your favorite horse winning a
race or the chances of your favorite baseball team winning the World Series.
Hence judgment is used as the basis for assigning subjective probabilities.
The use of the subjective approach is usually limited to experiments that are
unrepeatable. In this book the emphasis is on physical probabilities, but we
will also pay attention to subjective probabilities in Chapters 2 and 7.

1.2 Classical Probability Model

In many experiments with finitely many outcomes ω1 , . . . , ωm , it is natural to
assume that all these outcomes are equally likely to occur (this probability
model is, in fact, also based on judgment). In such a case, p(ωi ) = m1 for
i = 1, . . . , m and each event A gets assigned the probability
m(A)
P(A) = ,
m
where m(A) is the number of outcomes in the set A. This model is sometimes
called the classical probability model or the Laplace model.
Example 1.1 John, Pedro, and Rosita each roll one fair die. How do we
calculate the probability that Rosita’s score is equal to the sum of the scores of
John and Pedro?
Solution. We take the set {(i, j, k) : i, j, k = 1, . . . , 6} as sample space for
the chance experiment, where the outcome (i, j, k) means that John’s score is
i dots, Pedro’s score is j dots, and Rosita’s score is k dots. Each of the 216
possible outcomes is equally probable, and thus gets assigned a probability
1
mass of 216 . Rosita’s score is equal to the sum of the scores of John and Pedro
if and only if one of the 15 outcomes (1,1,2), (1,2,3), (2,1,3), (1,3,4), (3,1,4),
(2,2,4), (1,4,5), (4,1,5), (2,3,5), (3,2,5), (1,5,6), (5,1,6), (2,4,6), (4,2,6), (3,3,6)
15
occurs. The probability of this event is thus 216 .
8 1 Foundations of Probability Theory

Example 1.2 Three players enter a room and are given a red or a blue hat to
wear. The color of each hat is determined by a fair coin toss. Players cannot see
the color of their own hats, but do see the color of the other two players’ hats.
The game is won when at least one of the players correctly guesses the color
of his own hat and no player gives an incorrect answer. In addition to having
the opportunity to guess a color, players may also pass. Communication of any
kind between the players is not permissible after they have been given their
hats; however, they may agree on a group strategy beforehand. The players
decided upon the following strategy. A player who sees that the other two
players wear a hat with the same color guesses the opposite color for his/her
own hat; otherwise, the player says nothing. What is the probability of winning
the game under this strategy?
Solution. This chance experiment can be seen as tossing a fair coin three
times. As sample space, we take the set consisting of the eight elements RRR,
RRB, RBR, BRR, BBB, BBR, BRB, RBB, where R stands for a red hat and B
for a blue hat. Each element of the sample space is equally probable and gets
assigned a probability of 18 . The strategy is winning if one of the six outcomes
RRB, RBR, BRR, BBR, BRB, or RBB occurs (verify!). Thus, the probability of
winning the game under the chosen strategy is 34 .
In Example 1.2 we have encountered a useful problem-solving strategy: see
whether the problem can be related to a familiar problem.
As preparation for the next example, consider a task that involves a sequence
of r choices. Suppose that n1 is the number of possible ways the first choice
can be made, n2 is the number of possible ways the second choice can be made
after the first choice has been made, and n3 is the number of possible ways the
third choice can be made after the first two choices have been made, etc. Then
the total number of possible ways the task can be performed is n1 ×n2 ×· · ·×nr .
For example, the total number of possible ways five people can stand in line is
5 × 4 × 3 × 2 × 1 = 120. In other words, there are 120 permutations.
Example 1.3 In a Monte Carlo casino the roulette wheel is divided into
37 sections numbered 1 to 36 and 0. What is the probability that all numbers
showing up in eight spins of the wheel are different?
Solution. Take as sample space the set of all ordered sequences (i1 , . . . , i8 ),
where ik is the number showing up at the kth spin of the wheel. The sample
space has 37 × 37 × · · · × 37 = 378 equiprobable elements. The number
of elements for which all components are different is 37 × 36 × · · · × 30.
Therefore, the sought probability is
1.2 Classical Probability Model 9

37 × 36 × · · · × 30
= 0.4432.
378
We reiterate the concept of the binomial coefficient before continuing.

Definition 1.1 The binomial coefficient nk denotes the total number of ways
to choose k different objects out of n distinguishable objects, with order not
mattering.

In other words, nk is the total number of combinations of k different objects
out of n. The key difference between permutations and combinations is order.
Combinations are unordered selections, permutations are ordered arrange-
ments. In Appendix A these important concepts are discussed extensively and
illustrated with several combinatorial probability problems. For any integers n
and k with 1 ≤ k ≤ n, the binomial coefficient can be calculated as

n n!
= ,
k k! (n − k)!

where
n nisshorthand for 1 × 2 × ··n·× m with the convention 0! = 1. Note that
m!
= n−k with the convention 0 = 1. For example, the number of ways
5
k
to choose three jurors out of five candidates is 3 = 3! 2! = 6×2
5! 120
= 10, with
order not mattering.
In no book on introductory probability should problems on tossing coins,
rolling dice, and dealing cards be missing. The next examples deal with these
sorts of probability problems and use binomial coefficients to solve them.
Example 1.4 A fair coin is tossed 100 times. What is the probability of getting
exactly 50 heads?
Solution. Take as sample space all possible sequences of zeros and ones to a
length of 100, where a zero stands for tails and a one for heads. The sample
space has 2100 equiprobable elements. The number of elements having exactly
50 ones is 100
50 . Therefore, the probability of getting exactly 50 heads is
100
50
.
2100
This probability is the ratio of two enormously large numbers and its com-
putation requires special provisions. However, a very accurate approximation
to this probability can be given. To this end, consider the general case of 2N
tosses of a fair coin. Then the probability pN of getting exactly N heads is
2N
N
pN = .
22N
10 1 Foundations of Probability Theory

To approximate this probability, we use Stirling’s approximation for n!. This

famous approximation states that
√
n n
n! ≈ 2πn
e
for sufficiently large n, where the mathematical constant e is the Euler number
2.7182818 . . .. In practice, this approximation is useful for n ≥ 10. The relative
error percentage is about 100
12n %. Using Stirling’s approximation, we find
√
(2N)! 1 2π × 2N (2N/e)2N 1
pN = ≈ √ √ .
N! N! 22N 2πN (N/e) N 2π N (N/e) N 22N

Canceling out common terms in the denominator and numerator, we get

1
pN ≈ √
πN
for N sufficiently large. This approximation is not only very accurate, but also
gives insight into how the probability pN depends on N. The approximate value
of p50 is 0.07979, while the exact value is 0.07959.
Example 1.5 What is the probability that three different face values each
appear twice in a roll of six dice?
Solution. Take as sample space the set of all possible sequences of the face
values 1, 2, . . . , 6 to a length of 6. The sample space has 66 equiprobable
elements. There are 63 ways to choose the face values for the three pairs of

different face values. There are 62 possible combinations of two dice from

the six dice for the first pair, 42 possible combinations of two dice from the
remaining four dice for the second pair, and then one combination of two dice
remains for the third pair. Thus the probability of getting three pairs of different
face values in a roll of six dice is
6 6 4
3 × 2 × 2 ×1
= 0.0386.
66
The next example shows that the choice of the sample space is not always
unambiguous.
Example 1.6 A bridge hand in which there is no card higher than a nine is
called a Yarborough. What is the probability of a Yarborough when you are
randomly dealt 13 cards out of a well-shuffled deck of 52 cards?
Solution. The choice of the sample space depends on whether we care about
the order in which the cards are dealt from the deck of 52 cards. If we consider
the order in which the 13 cards are dealt as being relevant, then we take a
1.2 Classical Probability Model 11

sample space made up of ordered elements. Imagine that the 52 cards of the
deck are numbered 1, 2, . . . , 52 and that the 13 cards of the bridge hand are
randomly dealt one at a time. The sample space made up of ordered elements
is the set of all possible 13-tuples (i1 , . . . , i13 ), where ik is the number of the kth
card dealt. The sample space has 52 × 51 × · · · × 40 equally likely outcomes.
The number of cards having a value nine or lower is 8 × 4 = 32, and so there
are 32 × 31 × · · · × 20 outcomes for which there is no card higher than a nine
among the 13 cards dealt. Thus, the probability of a Yarborough is
32 × 31 × · · · × 20
= 0.000547.
52 × 51 × · · · × 40
Alternatively, this probability can be computed by using a sample space made
up of unordered elements. The order in which the 13 cards are dealt is not
relevant in such a sample space.1 Each outcome of this sample space is a set
of 13 different cards from the deck of 52 cards. In a set we don’t care which
element is first, only which elements are actually present. The number of ways
you can choose a set of 13 different cards from a deck of 52 cards is given
52
by the binomial coefficient 13 . Thus, the sample space made up of unordered

elements has 52 13 equally likely outcomes. The number of outcomes with no
32
card above a nine is 13 , and so the probability of a Yarborough can also be
calculated as
32
13
52 = 0.000547.
13
The odds against a Yarborough are 1,827 to 1. A Yarborough is named after
Lord Yarborough (1809–1862). The second Earl of Yarborough would offer
his whist-playing friends a wager of £1,000 to 1 against them picking up such
a hand. You see that the odds were on his side. There is no record that he ever
paid out.
Example 1.7 A well-shuffled standard deck of 52 cards is divided into four
piles of 13 cards each. What is the probability that each pile contains exactly
one ace?
Solution. A subtle approach is as follows. The deck of cards has 52 positions
from top to bottom. Imagine that the first 13 positions are for the first pile, the
next 13 positions for the second pile, etc. Take as sample space the collection

1 In combinatorial probability problems it is sometimes possible to calculate a probability both

through a sample space with ordered elements and through a sample space with unordered
elements. This provides a sanity check against accidental overcounting or undercounting.
Many people make mistakes in counting.
12 1 Foundations of Probability Theory

of all possible combinations

of four positions to be used for the aces. The
sample space has 52 4 equiprobable elements. There are 134 combinations of
four positions for which one position belongs to the first 13 positions of the
deck, one position belongs to the second 13 positions of the deck, one position
belongs to the third 13 positions of the deck, and one position belongs to the
fourth 13 positions of the deck. Thus, the sought probability is
134
52 = 0.1055.
4
Alternatively, the sought probability can be calculated from the formula
448336224
1 12 1 2 1 12
523926 = 0.1055.
13 13 13
The reader is asked to specify the sample space that is the basis of the
alternative calculation.
These examples illustrate the importance of doing probability calculations
systematically rather than “intuitively.” The key steps in all our calcula-
tions are:
• What is the sample space of all possible outcomes of the experiment?
• What are the probabilities assigned to the possible outcomes?
• What outcomes define the event of interest?
Then the probability of the event is calculated by adding up the probabilities
of the outcomes composing the event. The beginning student should always be
able to go back to these basic steps to avoid potential pitfalls.
The importance of specifying an appropriate sample space is nicely illus-
trated with the following trick problem. A bag contains 14 red and 7 black
cards. Two cards are picked together at random from the bag. Which event
is more likely: picking two red cards, or one red and one black card? Many
people think that picking two red cards has the largest probability. This is not
true, as can be seen by using the following sample space. Think of the red
cards labeled as R1 , . . . , R14 and the black cards labeled as B1 , . . . , B7 .2 Also,
it is helpful to imagine that the two cards are picked one at a time (it makes
no difference to the probability whether the two cards are picked together or
one after the other). Then we can take as sample space the set consisting of the
14 × 13 outcomes (Rj , Rk ), the 14 × 7 outcomes (Rj , Bk ), the 7 × 14 outcomes

2 In problems involving indistinguishable objects, the calculation of a probability is often made

easier by pretending that the objects can be distinguished so that a sample space with equally
likely outcomes can be used.
1.2 Classical Probability Model 13

(Bj , Rk ), and the 7 × 6 outcomes (Bj , Bk ), where the first component of each
outcome refers to the first card picked and the second component to the second
card picked. There is a total of 420 equally likely outcomes. The probability
× 13
of picking two red cards is 14420 and is less than the probability 14 × 7+7
420
× 14

of picking one red and one black card.

The genius Gottfried Wilhelm Leibniz (1646–1716), the inventor of differ-
ential and integral calculus along with Sir Isaac Newton (1643–1727), had
difficulties in calculating the probability of getting the sum 11 with one throw
of two dice. When asked whether 12 points and 11 points are equally likely
in a throw of two dice, he erroneously stated: “with two dice, it is equally
likely to throw twelve points, than to throw eleven; because one or the other
can be done in only one manner.” This error would not have been made had
an appropriate sample space been used. A mistake such as Leibniz’s was
not made by the famous astronomer and physicist Galileo Galilei (1564–
1642) when he explained to the Grand Duke of Tuscany, his benefactor, why
it is that when you toss three dice, the probability of the sum being 10 is
27
greater than the probability of the sum being 9 (the probabilities are 216
25
and 216 , respectively). To get these probabilities, it is convenient to imagine
that the three dice are blue, white, and red, and then to use as sample space
all ordered triples (b, w, r), where b is the number showing on the blue,
w the number showing on the white, and r the number showing on the
red die.
In the following problems, you are asked to first specify a sample space
before calculating the probabilities. In many cases there is more than one
natural candidate for the sample space.

Problems
1.1 Two fair dice are rolled. What is the probability that the sum of the two
numbers rolled is odd? What is the probability that the product of the two
numbers rolled is odd?
1.2 In a township, there are two plumbers. On a particular day three resi-
dents call village plumbers, independently of each other. Each resident
randomly chooses one of the two plumbers. What is the probability that
all three residents will choose the same plumber?
1.3 Take a random permutation of the letters in the word “randomness.” What
is the probability that the permutation begins and ends with a vowel?
What is the probability that the three vowels are adjacent to each other in
the permutation?
14 1 Foundations of Probability Theory

1.4 A dog has a litter of four puppies. Can we correctly say that the litter
more likely consists of three puppies of one gender and one of the other
than that it consists of two puppies of each gender?
1.5 Suppose that you pick m numbers from a set of n different numbers
without replacement. What is the probability of getting the largest
number in the set?
1.6 Three girls and three boys have a dinner party. They agree that two of
them are going to do the dishes. The unlucky persons are determined by
drawing lots. What is the probability that two boys will do the dishes?
1.7 A box contains n identical balls, where two balls have a winning number
written on them. In a prespecified order, n persons each choose a ball at
random from the box. Show that each person has the same probability of
getting a winning ball.
1.8 Players A and B play a game of rolling two fair dice. If the largest number
rolled is one, two, three, or four, player A wins; otherwise, player B wins.
What is the probability of player A winning?
1.9 In the 6/42 lottery, six distinct numbers are picked at random from
the numbers 1, . . . , 42. What is the probability that number 10 will be
picked? What is the probability that each of the six numbers picked is 20
or more?
1.10 Five people are sitting at a table in a restaurant. Two of them order
coffee and the other three order tea. The waiter forgets who ordered what
and puts the drinks in a random order for the five persons. What is the
probability that each person gets the correct drink?
1.11 Two black socks, two brown socks, and one white sock lie mixed up in a
drawer. You grab two socks without looking. What is the probability that
you have grabbed two black socks or two brown socks?
1.12 Two letters have fallen out of the word “Cincinnati” at random places.
What is the probability that these two letters are the same?
1.13 You choose at random two cards from a standard deck of 52 cards. What
is the probability of getting a ten and a heart?
1.14 You choose at random a letter from the word “chance” and from the word
“choice.” What is the probability that the two letters are the same?
1.15 You roll a fair die 10 times. Can you explain why the probability of
rolling a total of s points is the same as the probability of rolling a total
of 70 − s points for s = 10, . . . , 34?
1.16 You roll 12 fair dice. What is the probability that each number will appear
exactly twice?
1.17 A building has three floors. Each floor has four apartments and
each apartment has one resident. Three residents meet each other
1.3 Geometric Probability Model 15

coincidentally in the entrance of the building. What is the probability

that these three residents live on different floors?
1.18 Together with two friends you are in a group of 10 people. The group
is randomly split up into two groups of 5 people each. What is the
probability that your two friends and you are in the same group together?
1.19 You have four mathematics books, three physics books, and two chem-
istry books. The books are put in random order on a bookshelf. What is
the probability of having the books ordered per subject on the bookshelf?
1.20 Three friends go to the cinema together on a weekly basis. Before buying
their tickets, all three friends toss a fair coin into the air once. If one of
the three gets a different outcome than the other two, then that one pays
for all three tickets; otherwise, everyone pays his own way. What is the
probability that one of the three friends will have to pay for all three
tickets?
1.21 You choose a letter at random from the word “Mississippi” 11 times
without replacement. What is the probability that you can form the word
Mississippi with the 11 chosen letters? Hint: It may be helpful to number
the letters 1, 2, . . . , 11.
1.22 What is the probability that you hold one pair in a five-card poker hand
and what is the probability that you hold two pairs?
1.23 Somebody is looking for a top-floor apartment. She hears about two
vacant apartments in a building with 7 floors and 8 apartments per floor.
What is the probability that there is a vacant apartment on the top floor?
1.24 John and Paul take turns picking a ball at random from a bag containing
four red and seven white balls. The balls are drawn out of the bag without
replacement. What is the probability that John is the first person to pick a
red ball when Paul is the first person to start? Hint: Think of the order in
which the balls are drawn out of the bag as a permutation of 1, 2, . . . , 11.
1.25 You pick a number from 1 to n twice at random. What is the probability
that the sum of the two numbers picked is r for r ≤ n + 1? Use this result
to find the probability of getting a sum s when rolling two fair dice.

1.3 Geometric Probability Model

Equally likely outcomes are only possible for a finite sample space (why?). The
continuous analogue of the probability model having a finite number of equally
likely outcomes is the geometric probability model with an uncountable
sample space. In applications of this model, a random point is chosen in
a certain bounded region. In mathematical words, choosing a random point
16 1 Foundations of Probability Theory

means that equal probabilities are assigned to sets of equal lengths, areas, or
volumes. The next examples illustrate the geometric probability model.
Example 1.8 Joe is driving around, looking for a parking spot. He plans to stay
in the area for exactly 2 hours and 15 minutes. He comes across a free parking
spot that only allows free parking for 2 hours. The parking enforcement officer
comes around once every 2 hours. At each inspection the officer marks cars
that were not there at his last visit and tickets earlier-marked cars. Joe arrives
at a random moment between two visits of the officer, and Joe has no idea how
long ago the officer’s last visit was. Joe decides to take a gamble and parks in
the free spot. What is the probability that Joe will receive a ticket?
Solution. Define a probability model with the interval (0, 120) as sample
space. The outcome x means that it takes x minutes until the next visit of the
officer, measured from the moment that Joe arrives. To each subinterval, we
assign as probability the length of the interval divided by 120. Let A be the
event that Joe receives a ticket. The event A occurs if and only if the outcome
x is the interval (105, 120). Thus P(A) = 120
15
, so the probability that Joe will
1
receive a ticket is 8 .
Example 1.9 You randomly throw a dart at a circular dartboard with radius R.
It is assumed that the dart is infinitely sharp and lands at a random point on the
dartboard. How do you calculate the probability of the dart hitting the bullseye,
having radius b?
Solution. The sample space of this experiment is the set of pairs of real
numbers (x, y) with x2 + y2 ≤ R2 , where (x, y) indicates the point at which
the dart hits the dartboard. This sample space is uncountable. We first make
the following observation. The probability that the dart lands exactly at a
prespecified point is zero. It only makes sense to speak of the probability of
the dart hitting a given region of the dartboard. This observation expresses
a fundamental difference between a probability model with a countable
sample space and a probability model with an uncountable sample space. The
assumption of the dart hitting the dartboard at a random point is translated by
assigning the probability

the area of the region A

P(A) =
πR2

to each subset A of the sample space. Therefore, the probability of the dart
hitting the bullseye is πb2 /(π R2 ) = b2 /R2 .
1.3 Geometric Probability Model 17

y
x

Figure 1.1. The landing of Buffon’s needle.

Example 1.10 A floor is ruled with equally spaced parallel lines a distance D
apart. A needle of length L is dropped at random on the floor. It is assumed that
L ≤ D. What is the probability that the needle will intersect one of the lines?
This problem is known as Buffon’s needle problem.
Solution. This geometric probability problem can be translated into the
picking of a random point inside a certain region. Let y be the distance from the
center of the needle to the closest line and x be the angle at which the needle
falls, where x is measured against a line parallel to the lines on the floor; see
Figure 1.1. The sample space of the experiment can be taken as the rectangle
R consisting of the points (x, y), with 0 ≤ x ≤ π and 0 ≤ y ≤ 12 D. Dropping
the needle at random on the floor can be seen to be equivalent to choosing
a random point inside the rectangle R. The needle will land on a line only if
the hypotenuse of the right-angled triangle in Figure 1.1 is less than half of the
y
length L of the needle. That is, we get an intersection if and only if sin(x) < 12 L.
Thus, the probability that the needle will intersect one of the lines equals the
probability that a point (x, y) chosen at random in the rectangle R satisfies
y < 12 L sin(x). In other words, the area under the curve y = 12 L sin(x) divided
by the total area of the rectangle R gives the probability of an intersection. This
ratio is
π 1
0 2 L sin(x) dx −L cos(x) π
1
=
2 πD
πD 0

and so
2L
P(needle intersects one of the lines) = .
πD
18 1 Foundations of Probability Theory

Problems
1.26 A stick of length one is broken into two pieces at a random point. What
is the probability that the length of the longer piece will be at least three
times the length of the shorter piece?
1.27 Franc-carreau was a popular game in eighteenth-century France. In this
game, a coin is tossed on a chessboard. The player wins if the coin
does not fall on one of the lines of the board. Suppose now that a round
coin with a diameter of d is blindly tossed on a large table. The surface
of the table is divided into squares whose sides measure a in length,
where a > d. Define an appropriate probability space and calculate the
probability of the coin falling entirely within the confines of a square.
Hint: Consider the position of the center of the coin.
1.28 Each morning during the week you arrive at a train station at a random
moment between 7 a.m. and 8 a.m. Fast trains in the direction of your
destination arrive every 15 minutes starting at 7 a.m., whereas slow trains
in the direction of your destination arrive every 30 minutes starting at
7:05 a.m. You take the first train that arrives. What is the probability that
you will take a slow train on any given morning?
1.29 Consider again Problem 1.28. Suppose that your friend arrives at the train
station at a random moment between 7.30 a.m. and 8 a.m. and also takes
the first train in the direction of your destination. What is the probability
that he gets on the same train as you?
1.30 Two people have agreed to meet at a central point in the city. Indepen-
dently of one another, each person is to appear at a random moment
between 12 p.m. and 1 p.m. What is the probability that the two persons
will meet within 10 minutes of one another?
1.31 The numbers B and C are chosen at random between −1 and 1,
independently of each other. What is the probability that the quadratic
equation x2 + Bx + C = 0 has real roots? Also, derive a general
expression for this probability when B and C are chosen at random from
the interval (−q, q) for any q > 0.
1.32 The Manhattan distance from a point (x, y) in the plane to the origin
(0, 0) is defined as |x| + |y|. You choose a random point inside the unit
square {(x, y) : 0 ≤ x, y ≤ 1}. What is the probability that the Manhattan
distance between this point and the point (0, 0) is no more than a for
0 ≤ a ≤ 2? Does the answer change when the point is randomly chosen
in the square {(x, y) : −1 ≤ x, y ≤ 1}?
1.33 Consider the following variant of Buffon’s needle problem from
Example 1.10. A rectangular card with side lengths a and b is dropped
1.4 Compound Chance Experiments 19

√
at random on the floor. It is assumed that the length a2 + b2 of the
diagonal of the card is smaller than the distance D between the parallel
lines on the floor. Show that the probability of the card intersecting one
of the lines is given by 2(a+b)
πD .
1.34 A random point is chosen inside a triangle with height h and base of
length b. What is the probability that the perpendicular distance from
the point to the base is larger than a given value d with 0 < d < h?
What is the probability that the randomly chosen point and the base of
the triangle will form a triangle with an obtuse angle when the original
triangle is equilateral?
1.35 You choose a number v at random from (0, 1) and next a number w at
random from (0, 1 − v). What is the probability that a triangle can be
formed with side lengths v, w, and 1 − v − w? Next answer the following
question. What is the probability that a triangle can be formed with three
pieces of a broken stick if the stick is first broken at random into two
pieces and next the longer piece is randomly broken into two pieces.
Hint: Use the fact that the sum of any two side lengths must be greater
than the third side length and represent (v, w) as (x, y(1−v)), where (x, y)
is a random point inside the unit square.
1.36 A stick is broken at two places. The break points are chosen at random
on the stick, independently of each other. What is the probability that a
triangle can be formed with the three pieces of the broken stick?
1.37 Choose a random point inside a circle with radius r and construct
the (unique) chord with the chosen point as its midpoint. What is the
probability that the chord is longer than a side of an equilateral triangle
inscribed in the circle?

1.4 Compound Chance Experiments

A chance experiment is called a compound experiment if it consists of several
elementary chance experiments. The question arises as to how, in general,
we define a probability space for a compound experiment in which the ele-
mentary experiments are physically independent of each other. By physically
independent, we mean that the outcomes from any one of the elementary
experiments have no influence on the functioning or outcomes of any of the
other elementary experiments. We first answer the question for the case of
a finite number of physically independent elementary experiments ε1 , . . . , εn .
Assume that each experiment εk has a countable sample space k on which the
probability measure Pk is defined, such that the probability pk (ωk ) is assigned
20 1 Foundations of Probability Theory

to each element ωk ∈ k . The sample space of the compound experiment is

then given by the set consisting of all ω = (ω1 , . . . , ωn ), where ωk ∈ k
for k = 1, . . . , n. A natural choice for the probability measure P on arises
by assigning the probability p (ω) to each element ω = (ω1 , . . . , ωn ) ∈ by
using the product rule:

p (ω) = p1 (ω1 ) × p2 (ω2 ) × · · · × pn (ωn ).

This choice for the probability measure is not only intuitively the obvious one,
but it can also be proved that it is the only probability measure satisfying the
property P(AB) = P(A)P(B) when the elementary experiments that generate
event A are physically independent of those elementary experiments that give
rise to event B. This important result of the uniqueness of the probability
measure satisfying this property justifies the use of the product rule for
compound chance experiments.
Example 1.11 In the “Reynard the Fox” café, it normally costs $3.50 to buy
a pint of beer. On Thursday nights, however, customers pay $0.25, $1.00,
or $2.50 for the first pint. In order to determine how much they will pay,
customers must throw a dart at a dartboard that rotates at high speed. The
dartboard is divided into eight segments of equal size. Two of the segments
read $0.25, four of the segments read $1, and two more of the segments read
$2.50. You pay whatever you hit. Two friends, John and Peter, each throw a
dart at the board and hope for the best. What is the probability that the two
friends will have to pay no more than $2 between them for their first pint?
Solution. This problem can be modeled as a compound experiment that con-
sists of two subexperiments. The physically independent subexperiments are
the throws of John and Peter. The elementary outcomes of each subexperiment
are L, M, and H, where L stands for hitting a low-priced segment, M stands
for hitting a medium-priced segment, and H stands for hitting a high-priced
segment. Assuming that the dart hits the dartboard at a random point, the
probabilities p(L) = 28 , p(M) = 48 , and p(H) = 28 are assigned to the outcomes
L, M, and H. The sample space of the compound experiment consists of the
nine outcomes (L, L), (L, M), (M, L), (L, H), (H, L), (M, M), (M, H), (H, M),
(H, H). The probability 14 × 14 = 16 1
is assigned to each of the outcomes (L, L),
(L, H), (H, L), (H, H), the probability 12 × 12 = 14 to the outcome (M, M), and
the probability 12 × 14 = 18 to each of the outcomes (L, M), (M, L), (H, M),
(M, H). The two friends will have to pay no more than $2 between them for
their first pint if one of the four outcomes (L, L), (L, M), (M, L), (M, M) occurs.
The probability of this event is 16 1
+ 18 + 18 + 14 = 16
9
.
1.4 Compound Chance Experiments 21

The next example deals with a compound chance experiment having a

countably infinite sample space.
Example 1.12 Two desperados play Russian roulette in which they take turns
pulling the trigger of a six-cylinder revolver loaded with one bullet. After each
pull of the trigger, the magazine is spun to randomly select a new cylinder to
fire. What is the probability that the desperado who begins will be the one to
shoot himself dead?
Solution. The set = {F, NF, NNF, . . .} can be taken as sample space for
the compound experiment, where the outcome F means that the first attempt
is fatal and the outcome N · · · NF of length n means that no fatal shot is
fired on the first n − 1 attempts but the nth attempt is fatal. There is a
remaining possibility: no fatal shot is ever fired. However, it will be seen that
the probabilities of the other outcomes sum to 1 and so this possibility must
have probability 0. The results of pulling the trigger are independent of one
another and with each pull of the trigger there is a probability of 16 that the
fatal shot will be fired. Then it is reasonable to assign the probability p1 = 16
n−1 1
to the outcome F and the probability pn = 56 × · · · × 56 × 16 = 56 6 to the
∞
outcome N · · · NF of length n. Note that k=1 pk = 1, as follows from a basic

result saying that the geometric series ∞ j=0 x sums to 1−x for |x| < 1. Since
j 1
∞
k=1 pk = 1, the fatal shot will ultimately be fired with probability 1. Define
A as the event that the fatal shot is fired by the desperado who begins. Then

P(A) = ∞ n=0 p2n+1 . This gives
∞ 2n ∞
5 1 1 25 k 1 1 6
P(A) = = = = .
6 6 6 36 6 1 − 25 36
11
n=0 k=0

Problems
1.38 In a tennis tournament between three players A, B, and C, each player
plays the others once. The strengths of the players are as follows: P(A
beats B) = 0.5, P(A beats C) = 0.7, and P(B beats C) = 0.4. Assuming
independence of the match results, calculate the probability that player A
wins at least as many games as any other player.
1.39 An electronic system has four components labeled 1, 2, 3, and 4. The
system has to be used during a given time period. The probability that
component i will fail during that time period is fi for i = 1, . . . , 4. Failures
of the components are physically independent of each other. A system
failure occurs if component 1 fails or if at least two of the other three
components fail. What is the probability that the system will fail?
22 1 Foundations of Probability Theory

1.40 Bill and Mark take turns picking a ball at random from a bag containing
four red and seven white balls. The balls are drawn out of the bag with
replacement. What is the probability that Bill is the first person to pick a
red ball when Mark is the first person to start?
1.41 Two dice are tossed until the dice sum 7 or 8 appears. What is the
probability of getting a total of 8 before a total of 7?
1.42 Three desperados A, B, and C play Russian roulette in which they take
turns pulling the trigger of a six-cylinder revolver loaded with one bullet.
Each time, the magazine is spun to randomly select a new cylinder to
fire as long as the deadly shot has not already occurred. The desperados
shoot according to the order A, B, C, A, B, C, . . . . Determine, for each of
the three desperados, the probability that this desperado will be the one
to shoot himself dead.
1.43 Two persons each roll a pair of dice, independently of each other. What
is the probability that the sums rolled are different?

1.4.1 A Coin-Tossing Experiment3

When a compound chance experiment consists of an infinite number of
independent elementary chance experiments, it has an uncountable sample
space and the choice of an appropriate probability measure is less obvious.
We illustrate how we deal with such experiments by way of an illustration of
a compound experiment consisting of an infinite number of tosses of a fair
coin. The sample space of this experiment is taken as the set of all infinite
sequences ω = (ω1 , ω2 , . . .), where ωi is equal to H when the ith coin toss
comes up heads, and is equal to T otherwise. This sample space is uncountable,
see also Appendix B. In order to be able to define a probability measure on the
uncountable sample space, we must begin by restricting our attention to a class
of appropriately chosen subsets. The so-called cylinder sets form the basis of
this class of subsets. In the case of our chance experiment, a cylinder set is the
set of all outcomes ω for which the first n elements ω1 , . . . , ωn have specified
values for a given finite value of n. A natural choice for the probability

measure
n
on the sample space is to assign the probability P(∞) (A) = 12 to each
cylinder set A with n specified elements. In this way, the event that heads first
occurs at the kth toss can be represented by the cylinder set Ak whose elements
ω have the finite beginning T, T, . . . , T, H, and thus the event can be assigned a

k
∞
probability of 12 . The set Ak represents the event that at some point heads
k=1

3 This section can be skipped without loss of continuity.

1.4 Compound Chance Experiments 23

occurs. The probability measure on the class of cylinder sets can be extended
to one defined on a sufficiently general class of subsets capable of representing
all possible events of this chance experiment. This extension requires measure
theory, however, and is beyond the scope of this book.
It is intuitively clear that the fraction of coin tosses in which heads occurs
converges to 12 when the number of tosses increases without limit. We can state
this claim more rigorously with the help of the probability measure P(∞) . To do
so, we adopt the notation Kn (ω) to represent the number of heads occurring
in the first n elements of ω. Furthermore, let C be the set of all outcomes ω
for which limn→∞ Kn (ω) /n = 12 . For very many sequences ω, the number
Kn (ω) /n does not converge to 12 as n → ∞ (e.g., this is the case for any
sequence ω with finitely many H’s). However, “nature” chooses a sequence
from the set C according to P(∞) . More precisely, the probability measure
P(∞) assigns a probability of 1 to the set C, or, in mathematical notation,

Kn (ω) 1
P(∞)
ω : lim = = 1.
n→∞ n 2
This type of convergence is called convergence with probability one, or almost-
sure convergence. The terminology of an “almost-sure” event means that
realizations not in this event are theoretically possible but will not happen
in reality. The convergence result is a special case of the strong law of
large numbers. This law is of enormous importance: it provides a direct link
between theory and practice. It was a milestone in probability theory when,
around 1930, A. N. Kolmogorov proved this law from the simple axioms of
probability. In Chapter 9 we come back to the strong law of large numbers.

1.4.2 Continuity Property of Probability and Borel–Cantelli

The Kolmogorov axioms guarantee that the probability measure has a nice
continuity property. Let us first introduce some terminology for an infinite
sequence of sets E1 , E2 , . . . . This sequence is said to be increasing if the set
En ⊆ En+1 for all n ≥ 1 and is said to be decreasing if En+1 ⊆ En for all
n ≥ 1, where the notation A ⊆ B means that every element of the set A is also
an element of the set B. A natural definition of the set limn→∞ En is given by
∞
i=1 Ei if the sequence En is increasing,
lim En = ∞
n→∞
i=1 Ei if the sequence En is decreasing,
∞
where i=1 Ei is the set of all outcomes that belong to at least one of the sets

Ei and ∞ i=1 Ei is the set of all outcomes that belong to each of the sets Ei .
24 1 Foundations of Probability Theory

Continuity of probability. If the infinite sequence of sets E1 , E2 , . . . is either

increasing or decreasing, then

lim P(En ) = P lim En .
n→∞ n→∞
The proof is instructive and is given only for the increasing case. Define F1 =
E1 and let the set Fn+1 consist of the points of En+1 that are not in En for
n ≥ 1. It is readily seen that the sets F1 , F2 , . . . are pairwise disjoint and satisfy
n ∞
i=1 F = n Ei for all n ≥ 1 and ∞ i=1 F i = i=1 Ei . By Axiom 3,
∞ i i=1 ∞ n n
P i=1 F i = i=1 P(F i ) and P i=1 F i = i=1 P(Fi ) for all n (see also
Rule 1.1 in Section 1.5). Thus,

∞
∞ ∞ n
P lim En = P Ei = P Fi = P(Fi ) = lim P(Fi )
n→∞ n→∞
i=1 i=1 i=1 i=1

n
n
= lim P Fi = lim P Ei = lim P(En ),
n→∞ n→∞ n→∞
i=1 i=1
n
using the fact that i=1 Ei = En when the sets Ei are increasing.
The continuity property of probability implies the Borel–Cantelli lemma on
which the proof of the strong law of large numbers in Chapter 9 is based.
First Borel–Cantelli lemma. Suppose that A1 , A2 , . . . is an infinite sequence

of subsets of the sample space such that ∞k=1 P(Ak ) < ∞. Then, P({ω ∈
: ω ∈ Ak for infinitely many k}) = 0.
To prove the lemma, let the set C = {ω ∈ : ω ∈ Ak for infinitely many k}.

Define Bn = ∞ k=n Ak for n ≥ 1, then B1 , B2 , . . . is a decreasing sequence
of sets. Note that ω ∈ C if and only if ω ∈ Bn for all n ≥ 1. This implies
that set C equals the intersection of all sets Bn . Thus, by the continuity of

probability, P(C) = limn→∞ P(Bn ). This gives P(C) = limn→∞ P( ∞ k=n Ak ).
∞
Next we use Boole’s inequality, stating that P( ∞ i=1 i E ) ≤ i=1 P(Ei for any
)
sequence of sets Ei . In Problem 1.44 you are asked to verify this intuitively

obvious result. Thus, P(C) ≤ limn→∞ ∞ k=n P(Ak ). This limit is zero, since
∞
k=1 P(A k ) < ∞. This completes the proof that P(C) = 0.

Problems
1.44 Use the Kolmogorov axioms to prove the following inequalities:
(a) P(A) ≤ P(B) when A ⊆ B. Hint: Let C = B\A be the set of all
outcomes in B that are not in A and note that B is the union of the
disjoint sets A and C.
1.5 Some Basic Rules 25

∞ ∞
(b) P k=1 Ak ≤ k=1 P Ak for any sequence of subsets A1 , A2 , . . . .
This is Boole’s inequality, also known as the union bound. Hint:
Define the pairwise disjoint sets Bk by B1 = A1 and Bk =
Ak \(A1 ∪ · · · ∪ Ak−1 ) for k ≥ 2.
1.45 In the coin-tossing experiment of repeatedly tossing a fair coin, a run of
length r is said to have occurred if heads have just been tossed r times
in a row. Prove that a run of length r is certain to occur somewhere
if a fair coin is tossed indefinitely often. Hint: Analyze the coin-
tossing experiment as an infinite sequence of physically independent
r-experiments. The r-experiment consists of r consecutive tosses of the
coin. Define Bn as the event that no run of length r occurs in any of the

first n of the r-experiments and evaluate P( ∞ n=1 Bn ).
1.46 You toss a fair coin until you obtain 10 heads in a row and then you
stop. What is the probability of seeing at least 10 consecutive tails in the
sequence prior to stopping? Hint: Use the result of Problem 1.45.

1.5 Some Basic Rules

The axioms of probability theory directly imply a number of basic rules that
are useful for calculating probabilities. Before stating these results, let us repeat
some notation from set theory. Suppose that A and B are events for a chance
experiment with sample space . The event that at least one of the events A
or B occurs is called the union of A and B. It consists of all outcomes which
are either in A or B or both. The union of A and B is denoted by A ∪ B. The
event that both events A and B occur is called the intersection of A and B. It
consists of all outcomes that are both in A and in B. The intersection of events
A and B is denoted by AB. Another notation for the intersection is A ∩ B. The
notation AB for the intersection of events A and B will be used throughout this
book. The notation for the union and intersection of two events extends to any
finite or infinite sequence of events. For any event A, the complementary event
Ac is defined as the event that consists of all outcomes not in A. For example,
suppose that the chance experiment consists of rolling a die. Then the sample
space is = {1, 2, . . . , 6}, where outcome i means that number i was rolled. If
A is the event that the number rolled is odd and B is the event that the number
rolled is larger than 3, then A ∪ B = {1, 3, 4, 5, 6}, AB = {5}, and Ac = {2, 4, 6}.
Rule 1.1 For any finite sequence of events A1 , . . . , An ,
n
n
P Ai = P(Ai ) when the events are mutually exclusive.
i=1 i=1
26 1 Foundations of Probability Theory

Rule 1.2 For any event A,

P(A) = 1 − P(Ac ).
Rule 1.3 For any two events A and B,
P(A ∪ B) = P(A) + P(B) − P(AB).
Rule 1.4 For any finite sequence of events A1 , . . . , An ,
n
n
P Ai = P(Ai ) − P(Ai Aj ) + P(Ai Aj Ak ) − · · ·
i=1 i=1 i,j: i,j,k:
i<j i<j<k

+ (−1)n−1 P(A1 A2 · · · An ).
The proofs of these rules are simple and instructive. They nicely demonstrate
how useful propositions can be obtained from “minimal” axioms.
To prove Rule 1.1, denote by ∅ the empty set of outcomes. We first show
that
P (∅) = 0.

Using Axiom 3 with Ai = ∅ for all i gives P(∅) = ∞ i=1 ai , where ai = P(∅)
for all i. This implies that P(∅) = 0. Let A1 , . . . , An be any finite sequence of
pairwise disjoint sets. Augment this sequence with An+1 = ∅, An+2 = ∅, . . ..
Then, by Axiom 3,
n ∞ ∞
n
P Ai = P Ai = P (Ai ) = P(Ai ).
i=1 i=1 i=1 i=1
The proof of Rule 1.2 is as follows. The set A ∪ Ac is by definition equal to
the sample space. Thus, by Axiom 2, P (A ∪ Ac ) = 1. The sets A and Ac are
disjoint. It now follows from Rule 1.1 that P (A ∪ Ac ) = P(A) + P(Ac ). This
gives the complement rule P(A) = 1 − P(Ac ).
To prove Rule 1.3, denote by A1 the set of outcomes that belong to A but not
to B. Let B1 be the set of outcomes that are in B but not in A and let C = AB
be the set of outcomes that are both in A and B. The sets A1 , B1 , and C are
pairwise disjoint. Moreover,
A ∪ B = A1 ∪ B1 ∪ C, A = A1 ∪ C, and B = B1 ∪ C.
Applying Rule 1.1 gives

P A ∪ B = P(A1 ) + P(B1 ) + P(C).
Also, P(A) = P(A1 ) + P(C) and P(B) = P(B1 ) + P(C). By substituting the
latter two relations into the expression for P(A ∪ B) and noting that C = AB,
we find
1.5 Some Basic Rules 27

P (A ∪ B) = P(A) − P(C) + P(B) − P(C) + P(C)

= P(A) + P(B) − P(AB).
Rule 1.4 will be proved for the special case that the sample space is

countable.4 In this case P(A) = ω∈A p(ω), where p(ω) is the probability
assigned to the individual element ω of the sample space. Fix ω. If ω ∈ / ∪ni=1 Ai ,
then ω does not belong to any of the sets Ai and p(ω) does not contribute to
either the left-hand side or the right-hand side of the expression in Rule 1.4.
Assume now that ω ∈ ∪ni=1 Ai . Then there is at least one set Ai to which ω
belongs. Let s be the number of sets Ai to which ω belongs. On the left-hand
side of the expression in Rule 1.4, p(ω) contributes only once. In the first term
of theright-hand
side of this expression,
p(ω) contributes s times, in the second
term s2 times, in the third term s3 times, and so on. Thus, the coefficient of
p(ω) on the right-hand side is

s − s2 + s3 − · · · + (−1)s−1 ss .

Showing
that this coefficient is 1 gives Rule
1.4.
Noting that s1 = s and
s = 1, we can write the expression s − s + s − · · · + (−1)s−1 s as
0 2 3 s
s s s s
1 − 0 − 1 + 2 − 3 + · · · + (−1) s = 1 − (−1 + 1) = 1,
s s s
s
using Newton’s binomium a + b = sk=0 sk ak bs−k . This completes the
proof.
Next we give several illustrative applications of the above properties. We
first illustrate Rule 1.2, which is known as the complement rule. This rule
states that the probability of an event occurring is one minus the probability
that it does not occur. It is often easier to compute the complementary
probability than the probability itself. The complement rule is an extremely
useful problem-solving tool in probability. The practical importance of the
complement rule can hardly be overestimated. We first illustrate this rule with
an example based on a problem that was posed by the nobleman and gambler
Chevalier de Méré to the famous French mathematician Blaise Pascal in 1654.
De Méré knew that it was advantageous to wager that a six would be rolled at
least one time in four rolls of one die, but his experience as a gambler taught
him that it was not advantageous to wager on a double six being rolled at least
one time in 24 rolls of a pair of dice. However, his mathematical skills were
not sufficient to understand why this should be so. De Méré had the idea that
four rolls of one die could be translated into 4 × 6 = 24 rolls of two dice,
because two dice can come up in six times as many ways as one die. He posed
the problem to Pascal, who solved it.
4 A proof for the general case is outlined in Problem 3.31 in Chapter 3.
28 1 Foundations of Probability Theory

Example 1.13 How many rolls of a fair die are required to have at least a 50%
chance of rolling at least one six? How many rolls of two fair dice are required
to have at least a 50% chance of rolling at least one double six?
Solution. Let us first consider the experiment of rolling a single die r times,
where r is fixed in advance. The sample space of this experiment consists of all
elements (i1 , i2 , . . . , ir ), where ik denotes the outcome of the kth roll of the die.
The sample space has 6 × 6 × · · · × 6 = 6r equally likely elements. Let A be
the event that at least one six is obtained in r rolls of the die. To compute P(A),
it is easiest to compute the probability of the complementary event Ac that no
six is obtained in r rolls of the die. The set Ac consists of 5 × 5 × · · · × 5 = 5r
elements. Thus, P(Ac ) = 5r /6r and so
5r
P(A) = 1 − .
6r
This probability has the value 0.4213 for r = 3 and the value 0.5177 for r = 4.
Therefore four rolls of the die are required.
The following sample space is taken for the experiment of rolling two dice
r times. Imagining that one die is blue and the other is red, the sample space
consists of all elements ((i1 , j1 ), (i2 , j2 ), . . . , (ir , jr )), where ik and jk denote the
outcomes of the blue and the red die on the kth roll of the two dice. The sample
space consists of 36 × 36 × · · · × 36 = 36r elements. All elements are equally
likely. Let A be the event that at least one double six is obtained in the r rolls
of the two dice. The complementary event Ac of rolling no double six in r
rolls of the two dice can occur in 35 × 35 × · · · × 35 = 35r ways. Therefore,
P(Ac ) = 35r /36r and so
35r
P(A) = 1 − r .
36
This probability has the value 0.4914 for r = 24 and the value 0.5055 for r =
25. Therefore 25 rolls of the two dice are required. De Méré must have been an
assiduous player in order to have established empirically that the probability
of rolling at least one double six in 24 rolls of a pair of dice lies just under
one-half.
The complement rule is also the key to the solution of the famous birthday
problem.
Example 1.14 What is the probability that in a class of n (≤ 365) children
(no twins), two or more children have the same birthday? Assume that the
year has 365 days (ignore February 29) and that all possible birthdays are
equally likely. What is the smallest value of n such that this probability is at
least 50%?
1.5 Some Basic Rules 29

Solution. Take as the sample space all ordered n-tuples of numbers selected
from the integers 1, 2, . . . , 365, where the first number of the tuple is the
birthday of the first child, the second number is the birthday of the second
child, etc. The sample space has 365 × 365 × · · · × 365 = 365n equally likely
elements. Let A be the event that at least two children have the same birthday.
The number of elements for which all n children have a different birthday is
365 × 364 × · · · × (365 − n + 1). Then, by the complement rule,
365 × 364 × · · · × (365 − n + 1)
P(A) = 1 − .
365n
The smallest value of n for which this probability is 50% or more is n = 23.
The probability has the value 0.5073 for n = 23.5 The finding that a small
value of n = 23 suffices is no longer surprising when you realize that there are
23
2 = 253 combinations of two children in a class of 23 children, where any
1
combination leads to a match with a probability of 365 .
The psychology of probability intuition is an interesting feature of the
birthday problem. Many people think that more than 23 people are needed
for a birthday match and the number 183 is commonly suggested. A similar
misconception can be seen in the words of a lottery official regarding his
lottery, in which a four-digit number was drawn daily from the 10,000 number
sequence 0000, 0001, . . . , 9999. On the second anniversary of the lottery, the
official deemed it highly improbable that any of the 10,000 possible numbers
had been drawn two or more times in the last 625 drawings. The lottery
official was wildly off the mark: the probability that some number will not
be drawn two or more times in 625 drawings is inconceivably small and of
the order of magnitude 10−9 . This probability can be calculated by looking
at the problem as a “birthday problem” with 10,000 possible birthdays and a
group of 625 people. The birthday problem nicely illustrates that coincidences
may not be so unusual after all, see also Problems 1.54 and 1.55. Nearly
all coincidences can be explained by simple probability rules. What looks
unexpected usually turns out to be expected.
Rule 1.3 is often called the addition rule for probabilities. When P(A) and
P(B) are added for non-disjoint sets A and B, the probability of the intersection

5 In reality, birthdays are not uniformly distributed throughout the year, but follow a seasonal
pattern. The probability of a match only becomes larger for any deviation from equal birth
frequencies, as can be understood intuitively by imagining a group of people coming from a
planet on which people are always born on the same day. However, for birth frequency
variation as occurring in reality, the match probability is very insensitive to deviations from
uniform birth rates and the group size of 23 for a fifty-fifty match probability does not change,
see T. S. Nunnikhoven, “A birthday problem for nonuniform birth frequencies,” The American
Statistician 46 (1992): 270–274.
30 1 Foundations of Probability Theory

of A and B is counted twice and must therefore be subtracted in order to get

P(A ∪ B). The following three examples illustrate this.
Example 1.15 A single card is randomly drawn from a thoroughly shuffled
deck of 52 cards. What is the probability that the card drawn will be either a
heart or an ace or both?
Solution. For the sample space of this chance experiment, we take the set
consisting of the 52 elements

♠A, . . . , ♠2, ♥A, . . . , ♥2, ♣A, . . . , ♣2, ♦A, . . . , ♦2,

where, for example, the outcome ♣7 means that the seven of clubs is drawn.
All possible outcomes are equally likely, and thus each outcome gets assigned
1
the same probability 52 . Let A be the event that the card drawn is a heart and
B the event that the card drawn is an ace. These two events are not mutually
exclusive. We are looking for the probability P(A ∪ B) that at least one of the
events A and B occurs. This probability can be calculated by applying Rule 1.3:

P(A ∪ B) = P(A) + P(B) − P(AB).

In this case, P(AB) stands for the probability that the card drawn is the ace of
hearts. The events A and B correspond to sets that contain 13 and 4 elements,
respectively, and thus have respective probabilities 13 4
52 and 52 . The event AB
1
corresponds to a set that is a singleton and thus has probability 52 . Therefore,
the probability that the card drawn is either a heart or an ace or both is equal to
13 4 1 16
P(A ∪ B) = + − = .
52 52 52 52
Example 1.16 Suppose that you are playing blackjack against the dealer. What
is the probability that neither of you are dealt a blackjack when getting two
cards each from a thoroughly shuffled standard deck of 52 cards? A blackjack
consists of one ace together with one card from the 16 cards formed by the
tens, jacks, queens, and kings.
Solution. The sample space consists of all combinations of an unordered pair
of cards
52 for the player and an unordered pair of cards for the dealer. It has
50
2 2 equiprobable elements. Let A be the event that neither the player nor
the dealer gets a blackjack. The sought probability P(A) is easiest computed by
using the complement rule. Let A1 be the event that the player gets a blackjack
and A2 be the event that the dealer gets a blackjack. Then P(A) = 1 − P(A1 ∪
A2 ). By the addition rule, P(A1 ∪ A2 ) = P(A1 ) + P(A2 ) − P(A1 A2 ). Thus

P(A) = 1 − P(A1 ) − P(A2 ) + P(A1 A2 ).

1.5 Some Basic Rules 31

By simple counting arguments,

41650 416315
P(A1 ) = P(A2 ) = 52501 1 2
and P(A1 A2 ) = 1 1 1 1
52 50 .
2 2 2 2
This leads to the value 0.9052 for the sought probability P(A).
Example 1.17 The eight soccer teams that have reached the quarterfinals of the
Champions League are made up of two teams from each of the four countries
England, Germany, Italy, and Spain. The four matches to be played in the
quarterfinals are determined by drawing lots.
(a) What is the probability that the two teams from the same country play
against each other in each of the four matches?
(b) What is the probability that there is a match between the two teams from
England or the two teams from Germany?
Solution. Number the eight teams 1, . . . , 8 and take as sample space the
ordered set of all possible permutations (t1 , . . . , t8 ) of the integers 1, . . . , 8,
where tk is the team that is drawn from the bowl as kth team. Any permutation
(t1 , . . . , t8 ) corresponds to the four matches (t1 , t2 ), (t3 , t4 ), (t5 , t6 ), (t7 , t8 ). The
sample space has 8! equally likely elements.
(a) Let A be the event that the two teams from the same country play against
each other in each of the four matches. There are 8 × 1 × 6 × 1 × 4 × 1 ×
2 × 1 = 384 outcomes that make up the event A. Therefore
384
= 0.0095.
P(A) =
8!
(b) Let E be the event that there is a match between the two teams from
England and let G be the event that there is a match between the two teams
from Germany. The sought probability is P(E ∪ G) and satisfies
P(E ∪ G) = P(E) + P(G) − P(EG).
Each of the events E and G corresponds to 4 × 2 × 6! = 5,760 elements in the
sample space. To see this, note that there are four possibilities for the match
in which the two British teams can be paired with each other, there are two
possibilities for the order in which these two teams can be drawn from the
bowl, and there are 6! possibilities for the order in which the other six teams
can be drawn from the bowl. By the same argument, the event EG corresponds
to 4 × 3 × 2 × 2 × 4! = 1,152 elements in the sample space. Thus
5,760 1,152
P(E) = P(G) = and P(EG) = .
8! 8!
32 1 Foundations of Probability Theory

This gives
5,760 5,760 1,152
P(E ∪ G) = + − = 0.2571.
8! 8! 8!

Problems
1.47 Of those visiting a particular car dealer selling second-hand cars and
Japanese cars, 55% buy no car, 25% buy a second-hand car, and 30%
buy a Japanese car. What is the probability that a visit leads to buying a
second-hand Japanese car?
1.48 Suppose that 50% of households in a certain city subscribe to the morning
newspaper, 70% of households subscribe to the afternoon newspaper, and
80% of households subscribe to at least one of the two newspapers. What
proportion of the households subscribe to both newspapers?
1.49 A transport firm has two vehicles, a truck and a van. The truck is used
75% of the time. Both vehicles are used 30% of the time and neither of
the vehicles is used for 10% of the time. What is the probability that the
van is used on any given day? What is the probability that only the van is
used on any given day?
1.50 The event A has probability 23 and there is a probability of 34 that at
least one of the events A and B occurs. What are the smallest and largest
possible values for the probability of event B?
1.51 For any two events A and B, prove that the probability of exactly one of
them occurring
is P(A) + P(B) − 2P(AB). Hint: Consider P (A ∩ Bc ) ∪
(B ∩ Ac ) .
1.52 Let A1 , . . . , An be any finite sequence of events. Use induction to verify
n n−1 n n
the bounds
n i=1 P(Ai ) − i=1 j=i+1 P(Ai Aj ) ≤ P( i=1 Ai ) ≤
i=1 P(Ai ).
1.53 In a class of 30 children, each child writes down a randomly chosen
number from 1, 2, . . . , 250. What is the probability that two or more
children have chosen the same number?
1.54 Suppose that n independent repetitions are done of an experiment that
has m equally likely outcomes O1 , . . . , Om . What is the probability that
some outcome occurs two or more times? Show that this probability can
1
be approximated by 1 − e− 2 n(n−1)/m when m is much larger than n. What
is the probability that the particular outcome O1 occurs at least once?
Verify that for a fifty-fifty probability the required number of trials is
√
about 1.177 m + 0.5 for the first probability and about 0.6931m for the
second probability when m is large. Hint: Use the approximation 1 − x ≈
e−x for x close to zero.
1.5 Some Basic Rules 33

1.55 Twenty-five people each choose two distinct numbers at random from
the numbers 1, 2, . . . , 25, independently of each other. What is the
probability that at least two people will choose the same two numbers?
1.56 On Wednesday, June 21, 1995, a remarkable thing occurred in the
German 6/49 lottery, in which six different numbers are drawn from
the numbers 1, . . . , 49. On the day in question, the mid-week drawing
produced this six-number result: 15-25-27-30-42-48. These were the
same numbers as had been drawn previously on Saturday, December 20,
1986, and it was the first time in the 3,016 drawings of the German lottery
until June 21, 1995 that the same sequence had been drawn twice. Is this
an incredible occurrence, given that in the German lottery there are nearly
14 million possible combinations of the six numbers in question?
1.57 Canadian lottery officials had no knowledge of the birthday problem and
its treacherous variants when they put this idea into play. They purchased
500 Oldsmobile cars from nonclaimed prize monies, to be raffled off as
bonus prizes among their 2.4 million registered subscribers. A computer
chose the winners by selecting 500 subscriber numbers from a pool
of 2.4 million registered numbers without regard for whether or not a
given number had already appeared. The unsorted list of the 500 winning
numbers was published and to the astonishment of lottery officials, one
subscriber put in a claim for two cars. How surprising is it that this
happened?
1.58 A bowl contains 10 red and 10 blue balls. Three times you pick at random
two balls out of the bowl without replacement. What is the probability
that at least one pick has two balls of the same color?
1.59 In the Massachusetts Numbers Game, a four-digit number is drawn from
the numbers 0000, 0001, . . . , 9999 every evening (except Sundays). Let’s
assume that the same lottery takes place in ten other states each evening.
(a) What is the probability that the same number will be drawn in two or
more states next Tuesday evening?
(b) What is the probability that on some evening in the coming 300 draw-
ings, the same number will be drawn in two or more states?
1.60 The playlist of your iPhone has 15 songs by each of 30 artists. After each
song the playlist is shuffled and the next song is selected at random from
the 450 songs. What is the probability that you will hear the same artist
more than once in any 10-song block?
1.61 A producer of a new brand of breakfast cereal offers a baseball card in
each cereal box purchased. Ten different baseball cards are distributed
equally among the cereal boxes. What is probability of getting the cards
of your two favorite teams when you buy five boxes?
34 1 Foundations of Probability Theory

1.62 What is the probability of getting at least one ace in a poker hand of five
cards dealt from an ordinary deck of 52 cards? What is the probability of
getting five cards of the same suit in a poker hand of five cards?
1.63 A fair die is rolled six times. What is the probability that each of the
numbers 5 and 6 appears at least once?
1.64 In the casino game of Chuck-a-Luck, three dice are contained within
an hourglass-shaped, rotating cage. You bet on one of the six numbers
1, . . . , 6 and the cage is rotated. You lose money only if your number
does not come up on any of the three dice. What is the probability that
your number will come up?
1.65 What is the probability that in a class of 23 children exactly two children
23
have birthdays on the same day? Hint: Number the 2 = 253 possible
combinations of two children as 1, . . . , 253 and define Ai as the event that
the two children from combination i are the only two children having the
same birthday.
1.66 You roll a fair die six times in a row. What is the probability that all of
the six face values will appear? What is the probability that one or more
sixes will appear? What is the probability that the largest number rolled
is r?
1.67 For the upcoming drawing of the Bingo Lottery, five extra prizes have
been added to the pot. Each prize consists of an all-expenses-paid
vacation trip. Each prize winner may choose from among three possible
destinations A, B, and C. The three destinations are equally popular.
The prize winners choose their destinations independently of each other.
Calculate the probability that at least one of the destinations A and B
will be chosen. Also, calculate the probability that not all of the three
destinations will be chosen.
1.68 In the 6/42 lottery, six distinct numbers are picked at random from
the numbers 1, . . . , 42. What is the probability that at least two of the
numbers 7, 14, 21, 28, 35, and 42 will be picked?
1.69 You draw at random five cards from a standard deck of 52 cards. What is
the probability of having an ace among the five cards together with either
a king or a queen or both?
1.70 John and Paul play the following game. They each roll one fair die. John
wins the game if his score is larger than Paul’s score or if the product of
the two scores is an odd number. Is this a fair game?
1.71 From an ordinary deck of 52 cards, cards are randomly drawn one by one
and without replacement. What is the probability that an ace will appear
before any of the cards 2 through 10? What is the probability that two
aces will appear before any of the cards 2 through 10?
1.5 Some Basic Rules 35

1.72 A random experiment has the possible outcomes O1 , O2 , . . . with respec-

tive probabilities p1 , p2 , . . . . Independent repetitions of the experiment
are performed. Define an appropriate probability space to verify that the
probability of the outcome O1 occurring before the outcome O2 is given
by p1 /(p1 + p2 ). Also, verify the more general result that the probability
of the rth appearance of outcome O1 occurring before the sth appearance
of outcome O2 is given by
s−1
r k
r+k−1 p1 p1
for any r, s ≥ 1.
r−1 p1 + p2 p1 + p2
k=0

Hint: Define An,k as the event that the rth appearance of outcome O1
occurs at the nth trial and that outcome O2 appears exactly k times in

the first n − 1 trials. Use the identity ∞l=0 (l + 1) · · · (l + m) x = m! /
l

(1 − x) m+1 for m ≥ 1 and |x| < 1.

1.73 Consider a random permutation of the numbers 1, 2, . . . , n and answer
the following questions.
(a) What is the probability that a particular number r belongs to a cycle
of a given length k?6
(b) What is the probability that two particular numbers r and s belong to
the same cycle?
1.74 One hundred prisoners are offered a one-time chance to prevent being
exiled to a feared prison camp. The names of the 100 prisoners are placed
in 100 boxes, one name to a box, and the boxes are lined up on a table in
a room. One by one, the prisoners are led into the room. Each prisoner
may inspect up to 50 of the boxes to try to find his name. The prisoners
must leave the room behind exactly as they found it, and they cannot
communicate after having left the room. Unless they all find their own
names, the whole group will be exiled to the feared prison camp. If each
prisoner inspects
1 10050 boxes at random, the probability of avoiding exile for
the group is 2 and is thus practically zero. However, the prisoners are
allowed to plot a strategy in advance. They decide on a random labeling
of the boxes with their own names and to stick to this labeling. Upon
entering the room, each prisoner goes to the box labeled with his name.
If he finds another prisoner’s name in the box, he then looks into the
box labeled with that prisoner’s name. This process repeats until the

6 A permutation cycle is a subset of a permutation whose elements trade places with one another.
For example, take the permutation (2, 4, 3, 1) of (1, 2, 3, 4), then the subset {1, 4, 2} is a cycle of
length 3 with 1 → 4 → 2 → 1 and the subset {3} is a cycle of length 1 with 3 → 3.
36 1 Foundations of Probability Theory

prisoner either finds his own name or has inspected 50 boxes. What is the
probability that the group will not be exiled? Hint: Consider the contents
of the boxes as a random permutation of the integers 1, 2, . . . , 2n with
n = 50 and verify that the probability of having no cycle of length n + 1

or more is equal to 1 − 2n k=n+1 1/k.
1.75 In a television show the remaining couple is shown three closed doors
behind which is hidden a new car, the car key, and a goat. One member
of the couple gets the task to find the car and the other member gets the
task to find the car key. The couple wins the car only if both partners
succeed in their respective tasks. Each member of the couple is allowed
to open two doors, where the second person cannot see what is behind
the doors opened by the first person. The couple may, however, agree on
a strategy beforehand. What is the best strategy?
1.76 Two-up is a traditional Australian gambling game. A spinner repeatedly
throws two pennies up in the air. “Heads” appear if both pennies land
with the head side facing up, “tails” appear if both pennies land with
the tails side facing up, and “odds” appear if one penny lands with the
head side up and the other lands with the tails side up. Consider the
following game variant. The spinner wins if three “heads” appear before
any appearance of “tails” and also before any intervening sequence of five
consecutive “odds.” A win pays 8.5 for 1. What is the win probability of
the spinner? Is the game favorable for the bettor?

1.6 Inclusion–Exclusion Rule

Many probability problems can be reduced to finding the probability of the
occurrence of at least one event from a sequence of appropriately defined
events. How to do this is explained in Rule 1.4, which is known as the
inclusion–exclusion rule. It states that the probability of the union of n events
equals the sum of the probabilities of these events taken one at a time, minus
the sum of the probabilities of these events taken two at a time, plus the sum
of the probabilities of these events taken three at a time, and so on. Many
probability problems of a combinatorial nature can be solved by using the
inclusion–exclusion rule. Probabilities of intersections of events are usually
easier to calculate than probabilities of unions of events. The inclusion–
exclusion rule provides us with a way to convert between the two.
Example 1.17 (continued) What is the probability that in none of the four
matches do two teams from the same country play against each other?
1.6 Inclusion–Exclusion Rule 37

Solution. Label the four countries as i = 1, . . . , 4. Let Ai be the event that

the two teams from country i play against each other. The sought probability
is 1 − P(A1 ∪ A2 ∪ A3 ∪ A4 ). In the solution of Example 1.17, we already
found that P(Ai ) = 5,760/8! for any i, P(Ai Aj ) = 1,152/8! for any i < j, and
P(A1 A2 A3 A4 ) = 384/8!. If three matches are between the two teams from the
same country, then the fourth match is also between the two teams from the
same country. Thus P(Ai Aj Ak ) = P(A1 A2 A3 A4 ) for any i < j < k. By the
inclusion–exclusion rule, P(A1 ∪ A2 ∪ A3 ∪ A4 ) is given by

4 5,760 4 1,152 4 384 4 384
− + − = 0.4286
1 8! 2 8! 3 8! 4 8!
and so the sought probability is 1 − 0.4286 = 0.5714.
The next example is a particular instance of the balls and bins model in
which balls are placed at random into bins. Many probability problems are
related to this model, see also Problem 1.87.
Example 1.18 An airport bus deposits 25 passengers at 7 stops. Each
passenger is as likely to get off at any stop as at any other, and the passengers
act independently of one another. The bus makes a stop only if someone wants
to get off. What is the probability that somebody gets off at each stop?
Solution. Take as sample space the set of all possible ordered sequences
(i1 , . . . , i25 ), where ik is the stop at which the kth passenger gets off. Each
sequence is equally likely. Let Ai be the event that nobody gets off at the
ith stop. The sought probability is 1 − P(A1 ∪ A2 ∪ · · · ∪ A7 ). Noting that
25
P(Ai1 · · · Aik ) = (7−k) and using the fact that the total number of ways one
725
can choose k different events from the events A1 , . . . , A7 is 7k , it follows from
the inclusion–exclusion formula that
7
7 (7 − k)25
P(A1 ∪ A2 ∪ · · · ∪ A7 ) = (−1)k+1 = 0.1438.
k 725
k=1

Thus the sought probability is 1 − 0.1438 = 0.8562.

To conclude this section, we illustrate the inclusion–exclusion rule with an
instance of the classic matching problem.
Example 1.19 Suppose that n people enter a restaurant and check their coats.
As they are leaving, the lazy coat-checker returns the coats in random order.
What is the probability that at least one person gets the correct coat?
Solution. For the formulation of the sample space for this chance experiment,
it is convenient to give label i to the coat of person i for i = 1, . . . , n. Then we
38 1 Foundations of Probability Theory

take the set of all possible orderings (e1 , . . . , en ) of the integers 1, . . . , n as

our sample space. In the outcome ω = (e1 , . . . , en ), person i receives the coat
with label ei for i = 1, . . . , n. The total number of possible outcomes is n ×
(n−1)×· · ·×1 = n!. Since the coats are returned in random order, all possible
orderings are equally likely and thus each outcome (e1 , . . . , en ) gets assigned
1
the same probability n! . For any i, let Ai be the event that person i receives his
or her own coat. The probability that at least one person gets the correct coat
is given by P(A1 ∪ A2 ∪ · · · ∪ An ). The probabilities in the inclusion–exclusion
formula are easy to calculate. The total number of orderings (e1 , . . . , en ) with
ei = i is equal to (n − 1)!, and so P(Ai ) = (n−1)! n! for all i. The number of
orderings (e1 , . . . , en ) with ei = i and ej = j is (n − 2)! for i = j and so
P(Ai Aj ) = (n−2)!
n! for all i and j with i = j. In general, P(Ai1 · · · Aik ) = (n−k)!
n!
for all i1 = i2 = . . . = ik . Thus, the sought probability P(A1 ∪ A2 ∪ · · · ∪ An )
is given by

n (n − 1)! n (n − 2)! n (n − 3)! n 1
− + − · · · + (−1)n−1 .
1 n! 2 n! 3 n! n n!

Using the formula nk = k!(n−k)! n!
, we get the sought formula

n
(−1)k
P(A1 ∪ A2 ∪ · · · ∪ An ) = 1 − .
k!
k=0

A surprising conclusion can be drawn from this result. A basic result from

calculus is that ∞ k=0 (−1) /k! = e
k −1 with e = 2.71828 . . . , see Appendix

C. Thus, for sufficiently large n, the probability that at least one person will
receive his or her own coat is approximately equal to 1 − e−1 = 0.63212 . . .,
regardless of how large n is. This approximation is accurate to seven or more
decimal places for n ≥ 10.
Having obtained the probability of at least one person getting the correct
coat, it is not difficult to argue that

1 (−1)k
n−j
P(exactly j persons will receive their own coats) =
j! k!
k=0

for j = 0, 1, . . . , n. To verify this, denote by Nm the number of permutations

of the integers 1, . . . , m so that no integer remains in its original position. As
shown above, the probability that a random permutation of 1, . . . , m has no

integer in its original position is given by m k
k=0 (−1) /k!. Therefore,

Nm (−1)k
m
= .
m! k!
k=0
1.6 Inclusion–Exclusion Rule 39

By the definition of Nm , the number of permutations of theintegers

1, . . . , n
so that exactly j integers remain in their original positions is nj Nn−j . Thus the

probability that exactly j persons will receive their own coats is nj Nn−j /n! =
1
j! Nn−j /(n− j)!. Using the expression for Nm /m!, we get the formula for the
probability of exactly j persons receiving their own coats. This probability
tends to the so-called Poisson probability e−1 /j! as the number of people
becomes large. The Poisson approximation is very accurate. Numerical inves-
tigations show that the Poisson probabilities agree with the exact probabilities
in four or more decimal places for n ≥ 10.
The matching problem appears in many disguises. A nice real-life appli-
cation of the problem is the following. In 2008, a celebrated self-proclaimed
psychic medium participated in a one-million-dollar paranormal challenge to
test for psychic ability. A child was chosen by the medium as one whom
he could connect with telepathically. The medium was shown the 10 target
toys that he would have to pick up during the test and then he was placed in
an isolated room. A toy was chosen at random and given to the child. The
medium’s task was to connect to the child and state which toy he was playing
with. This was repeated ten times, and each time the child got a different toy.
At the end of the trials the medium’s answers were checked against the toys
the child actually played with. If the medium had six or more correct guesses,
then he would win one million dollars. The medium correctly guessed one toy
choice out of 10, thus failing the test and losing the challenge. By chance alone,
a correct guess of 1 out of 10 random choices was to be expected, while the
probability of winning the challenge was just 6.0 × 10−4 .

Problems
1.77 A certain person is taking part in a blind taste test of ten different wines.
The person has been made aware of the names of the ten wine producers
beforehand, but does not know what order the wines will be served in.
Each wine producer may be named only once. After the tasting session
is over, it turns out that he has correctly identified five of the ten wines.
Do you think he is a connoisseur?
1.78 You belong to a class of 15 students. The teacher hands 15 graded
assignments back to the students in random order. What is the probability
that you are the only person who receives his own paper?
1.79 Five friends enter a restaurant on a rainy evening and check their coats
and umbrellas. As they are leaving, the lazy coat-checker returns both the
40 1 Foundations of Probability Theory

coats and the umbrellas in random order. What is the probability that at
least one person gets both the correct coat and the correct umbrella?
1.80 A blind taste test of ten different wines takes place. The participants are
told the names of the ten wines beforehand. The wines are presented in
random order and a participant may name each wine only once. There are
three Italian wines among the ten wines. A person who has no knowledge
about wine participates in the test. What is the probability that none of
the three Italian wines is correctly guessed by this person?
1.81 Three people each choose five distinct numbers at random from the num-
bers 1, 2, . . . , 25, independently of each other. What is the probability
that there is some number that is chosen by each of the three people?
1.82 What is the probability that a hand of 13 cards from an ordinary deck of
52 cards contains four cards of the same rank, such as four queens?
1.83 What is the probability that in a 13-card bridge hand at least one suit will
be missing? What is the probability that in a 5-card poker hand at least
one suit will be missing?
1.84 Suppose 8 Germans and 4 Italians are randomly assigned to 6 double
rooms. What is the probability that no room will have two people of
different nationalities?
1.85 Consider a communication network with four nodes n1 , n2 , n3 , and n4 ,
and six directed links l1 = (n1 , n2 ), l2 = (n1 , n3 ), l3 = (n2 , n3 ), l4 =
(n3 , n2 ), l5 = (n2 , n4 ), and l6 = (n3 , n4 ). A message has to be sent from
the source node n1 to the destination node n4 . The network is unreliable.
The probability that the link li is functioning is pi for i = 1, . . . , 6. The
links behave physically independently of each other. A path from node n1
to node n4 is only functioning if each of its links is functioning. What is
the probability that there is some functioning path from node n1 to node
n4 ? What is this probability when pi = p for all i?
1.86 What is the probability that in 30 lottery drawings of six distinct numbers
from the numbers 1 to 45, not all of the 45 numbers will come up?
1.87 The coupon-collector problem is as follows. Each time you buy a certain
product (chewing gum, for example) you receive a coupon (a baseball
card, for example), which is equally likely to be any one of n types.

(a) What is the probability of collecting a complete set of coupons in no

more than r purchases?
(b) Answer the following questions. How many rolls of a fair die are
required so that each number will appear one or more times with
a probability of at least 50%? How many people are required in
order to have represented all of the 365 birthdays of the year (ignore
1.6 Inclusion–Exclusion Rule 41

February 29) with a probability of at least 50%? How often should

a ball be randomly dropped into one of 100 bins so that there is no
empty bin with a probability of at least 50%?
1.88 In a group of n boys and n girls, each boy chooses at random a girl and
each girl chooses at random a boy. The choices of the boys and girls are
independent of each other. If a boy and a girl have chosen each other, they
form a couple. What is the probability that no couple will be formed?
1.89 What is the probability that everybody in a randomly formed group of n
people shares his or her birthday with someone else in the group? How
large should n be so that this probability is at least 50%?
1.90 Three people each write down the numbers 1, . . . , 10 in a random order.
What is the probability that in some position the three random sequences
have the same number?
1.91 Ten persons each choose one name at random from the names of the other
nine persons, independently of each other. What is the probability that at
least two persons choose each other’s names?
1.92 There are n people in a room and each of them is armed and angry. At
a chime of the clock, everyone spins around and shoots at random one
other person. What is the probability that there is no survivor after the
first round?
1.93 Five balls are randomly chosen, with replacement, from a bowl that
contains 3 red balls, 5 white balls, and 7 blue balls. What is the
probability of getting at least one ball of each color? What is this
probability if the balls are picked without replacement?
1.94 Suppose n married couples are invited to a bridge party. Bridge partners
are chosen at random, without regard to gender. What is the probability
that no one will be paired with his or her spouse?
1.95 An ordinary deck of 52 cards is thoroughly shuffled. The dealer turns
over the cards one at a time, counting as he goes “ace, two, three, . . .,
king, ace, two, . . .,” and so on, so that the dealer ends up calling out the
13 ranks four times each. A match occurs if the rank the dealer is calling
corresponds to the card he turns over. A “supermatch” occurs if all four
cards of some same rank are matched (four aces, for example). What is
the probability that a “supermatch” will occur?
2
Conditional Probability

The concept of conditional probability lies at the heart of probability theory.

It is an intuitive concept. To illustrate this, most people reason as follows to
find the probability of getting two aces when two cards are selected at random
from an ordinary deck of 52 cards. The probability of getting an ace on the
4
first card is 52 . Given that one ace is gone from the deck, the probability of
3
getting an ace on the second card is 51 . The sought probability is therefore
4
52 × 3
51 . Letting A 1 be the event that the first card is an ace and A2 the
event that the second card is an ace, one intuitively applies the fundamental
formula P(A1 A2 ) = P(A1 )P(A2 | A1 ), where P(A2 | A1 ) is the notation for the
conditional probability that the second card will be an ace given that the first
card was an ace.
The purpose of this chapter is to present the basics of conditional probability.
You will learn about the multiplication rule for probabilities and the law
of conditional probabilities. These results are extremely useful in problem
solving. Much attention will be given to Bayes’ rule for revising conditional
probabilities in light of new information. This rule is inextricably bound up
with conditional probabilities. The odds form of Bayes’ rule is particularly
useful and will be illustrated with several examples. Following on from Bayes’
rule, we explain Bayesian inference for discrete models and give several
statistical applications.

2.1 Concept of Conditional Probability

The starting point for the definition of conditional probability is a chance
experiment for which a sample space and a probability measure P are defined.
Let A be an event of the experiment. The probability P(A) reflects our
knowledge of the occurrence of event A before the experiment takes place.

42
2.1 Concept of Conditional Probability 43

Therefore, the probability P(A) is sometimes referred to as the a priori

probability of A or the unconditional probability of A. Suppose we are now
told that an event B has occurred in the experiment, but we still do not know
the precise outcome in the set B. In light of this added information, the set B
replaces the sample space as the set of possible outcomes and consequently
the probability of the occurrence of event A changes. A conditional probability
now reflects our knowledge of the occurrence of the event A given that event B
has occurred. The notation for this new probability is P(A | B).
Definition 2.1 For any two events A and B with P(B) > 0, the conditional
probability P(A | B) is defined as
P(AB)
P(A | B) = .
P(B)
Here AB stands for the occurrence of both event A and event B. This is not
an arbitrary definition. It can be intuitively reasoned through a comparable
property of the relative frequency. Let’s define the relative frequency fn (E) of
the occurrence of event E as n(E)n , where n(E) represents the number of times
that E occurs in n repetitions of the experiment. If, in n independent repetitions
of the experiment, event B occurs r times simultaneously with event A and s
times without event A, then we can say that fn (AB) = nr and fn (B) = r+s n .
Moreover, defining fn (A | B) as the relative frequency of event A in those
repetitions of the experiment in which event B has occurred, we have that fn (A |
B) = r+sr
. Dividing fn (AB) by fn (B) gives fnfn(AB)
(B) = r+s . As a result, we get
r

fn (AB)
fn (A | B) = .
fn (B)
This relationship accounts for the definition of the conditional probability
P(A | B). The relative frequency interpretation also tells us how a conditional
probability must be estimated in a simulation study.
Example 2.1 Someone has rolled two dice out of your sight. You ask this
person to answer “yes or no” to the question of whether there is a six among
the two rolls. He truthfully answers “yes.” What is the probability that two
sixes have been rolled?
Solution. Let A be the event that two sixes show up and B the event that
at least one six shows up. It is helpful to imagine that one of the dice is
red and the other is blue. The sample space of this experiment is the set
{(i, j) | i, j = 1, . . . , 6}, where i and j denote the outcomes of the red and
1
the blue die. A probability of 36 is assigned to each element of the sample
space. The event A is given by the set A = {(6, 6)} and the event B by the
44 2 Conditional Probability

set B = {(1, 6), . . . , (5, 6), (6, 6), (6, 5), . . . , (6, 1)}. The probability of event B
occurring is 11
36 and the probability of both event A and event B occurring is 36 .
1

Given that you know that event B has occurred, the probability that event A has
also occurred is P(A | B). Applying the definition of conditional probability
gives

P(AB) 1/36
P(A | B) = = .
P(B) 11/36

1
Thus, the sought probability is 11 (not 16 ).
The above example illustrates once again how careful you have to be when
you are interpreting the information a problem is conveying. The wording of
the problem is crucial: you know that one of the dice turned up a six but you do
not know which one. In the case where one of the dice had dropped on the floor
and you had seen the outcome six for that die, the probability of the other die
turning up a six would have been 16 . An intuitive explanation of the difference
between the two probabilities is the fact that in the second scenario you have
the additional information of which one of the two dice (the red or the blue
die) shows a six.
Example 2.2 John, Pedro, and Rosita are experienced darts players. The
probability of John hitting the bullseye in a single throw is 13 . This hitting
probability is 15 for Pedro and 14 for Rosita. The three players each throw one
dart simultaneously. Two of the darts hit the bullseye and one of the darts
misses the bullseye. What is the probability that John is the one who missed?
Solution. The sample space of the chance experiment consists of the eight
elements (H, H, H), (H, H, M), (H, M, H), (H, M, M), (M, H, H), (M, H, M),
(M, M, H), (M, M, M), where H stands for “hit” and M stands for “miss.” The
first component of each element of the sample space refers to John’s throw, the
second component refers to Pedro’s throw, and the third component refers to
Rosita’s throw. By the independence of the outcomes of the individual throws,
we assign the probability 13 × 15 × 14 = 60 1
to the outcome (H, H, H), the
probability 3 × 5 × 4 = 60 to the outcome (H, H, M), the probability 13 ×
1 1 3 3

5 × 4 = 60 to the outcome (H, M, H), the probability 3 × 5 × 4 = 60

4 1 4 1 4 3 12

to the outcome (H, M, M), the probability 3 × 5 × 4 = 60 to the outcome

2 1 1 2

(M, H, H), the probability 23 × 15 × 34 = 606

to the outcome (M, H, M), the
probability 3 × 5 × 4 = 60 to the outcome (M, M, H), and the probability
2 4 1 8

3 × 5 × 4 = 60 to the outcome (M, M, M). Let A be the event that John misses
2 4 3 24

and let B be the event that exactly two of the darts hit the target. The event
AB occurs if the outcome (M, H, H) occurs and the event B occurs if one of
2.1 Concept of Conditional Probability 45

the outcomes (H, H, M), (H, M, H), (M, H, H) occurs, and so P(AB) = 602
and
P(B) = 60 + 60 + 60 = 60 . Thus, the conditional probability that John is the
3 4 2 9

one who missed given two hits is

P(AB) 2
P(A | B) = = .
P(B) 9

Problems
2.1 Two fair dice are rolled. What is the conditional probability that the sum
of the two dice is 8, given that the two dice show different numbers?
2.2 You toss a nickel, a dime, and a quarter, independently of each other.
What is the conditional probability that the quarter shows up heads given
that the coins showing up heads represent an amount of at least 15 cents?
2.3 Every evening, two weather stations issue a weather forecast for the next
day. The weather forecasts of the two stations are independent of each
other. On average, the weather forecast of station 1 is correct in 90% of
cases, irrespective of the weather type. This percentage is 80% for station
2. On a given day, station 1 predicts sunny weather for the next day,
whereas station 2 predicts rain. What is the probability that the weather
forecast of station 1 will be correct?
2.4 A professor has given two subsequent tests to her students. 50% of the
students passed both tests and 80% of the students passed the first test.
What percentage of those who passed the first test also passed the second
test?
2.5 In a certain village, 30% of households have a cat and 25% of households
have a dog. Of the households with a cat, 20% also have a dog. What
percentage of households with a dog also have a cat?
2.6 The experiment is to toss a fair coin once and to roll a fair die once. Let A
be the event that the coin lands “heads” and B the event that the die lands
“six.” Given that at least one of the two events has occurred, what is the
probability that both events have occurred and what is the probability that
event A has occurred?
2.7 You simultaneously grab two balls at random from an urn containing
two red balls, one blue ball, and one green ball. What is the probability
that you have grabbed two non-red balls given that you have grabbed at
least one non-red ball? What is the probability that you have grabbed two
non-red balls given that you have grabbed the green ball? Can you give
an intuitive explanation of why the second probability is larger than the
first one?
46 2 Conditional Probability

2.8 The following game is played in a particular carnival tent. The carnival
master has two covered beakers, each containing one die. He shakes the
beakers thoroughly, removes the lids, and peers inside. You have agreed
that whenever at least one of the two dice shows an even number of
points, you will bet with even odds that the other die will also show an
even number of points. Is this a fair bet?
2.9 Suppose a bridge player’s hand of 13 cards contains an ace. What is the
probability that the player has at least one more ace? What is the answer
to this question if you know that the player had the ace of hearts? Can
you explain why the second probability is larger than the first one?
2.10 A hand of 13 cards is dealt from a standard deck of 52 cards. What
is the probability that it contains more aces than tens? How does this
probability change when you have the information that the hand contains
at least one ace?
2.11 A fair die is rolled three times. What is the probability that each number
rolled is higher than all those that were rolled earlier? Hint: Condition on
the event that each roll gives a different outcome.
2.12 Let A and B be two events with 0 < P(A) < 1 and 0 < P(B) < 1.

2.13 In a high-school class, 35% of the students take Spanish as a foreign

language, 15% take French as a foreign language, and 40% take at least
one of these languages. What is the probability that a randomly chosen
student takes French given that the student takes Spanish?
2.14 On a summer camp, 48% of the children are enrolled in swimming and
40% in tennis. One-third of the children signed up for swimming are also
enrolled in tennis. What is the probability that a randomly chosen child
is enrolled in swimming given that the child signed up for tennis?
2.15 A population of voters contains 45% Democrats and 55% Republicans. It
is reported that 70% of the Democrats and 50% of the Republicans favor
a particular election issue.

(a) What is the probability that a randomly chosen voter is a Democrat

and in favor of the election issue, and what is the probability that a
randomly chosen voter is a Republican and in favor of the election
issue?
2.2 Chain Rule for Conditional Probabilities 47

(b) What is the probability that a randomly chosen voter is in favor of

the election issue?
(c) What is the probability that a randomly chosen voter is a Democrat
given that the voter is in favor of the election issue?
2.16 Suppose that 50% of households in a certain city subscribe to the morning
newspaper, 70% of households subscribe to the afternoon newspaper,
and 40% of households subscribe to both newspapers. What is the
probability that a randomly selected household does not subscribe to the
morning newspaper given that the household subscribes to the afternoon
newspaper?

2.2 Chain Rule for Conditional Probabilities

P(AB)
Interchanging the roles of A and B in the definition P(A | B) = P(B) and
P(AB)
noting that P(BA) = P(AB), we have P(B | A) = P(A) .
This equivalent
representation of the definition of conditional probability can be written in the
intuitively appealing form

P(AB) = P(A)P(B | A).

In words, the probability that events A and B both occur is equal to the
probability that event A occurs multiplied by the probability that event B occurs
given that A has occurred. This phrasing lines up naturally with the intuitive
way people think about probabilities.
The formula P(AB) = P(A)P(B | A) is often referred to as the multiplication
rule for probabilities. This rule is a very useful tool for assigning or calculating
probabilities. In many cases, the rule is used in attributing probabilities to
elements of the sample space. In illustration of this, consider the experiment
in which two marbles are randomly chosen without replacement from a
receptacle holding seven red and three white marbles. One possible choice
for the sample space of this experiment is the set consisting of four elements
(R, R), (R, W), (W, W), and (W, R), where R stands for red and W for white.
The first component of each element indicates the color of the first marble
chosen and the second component the color of the second marble chosen. On
grounds of the reasoning that P(1st marble is red) = 10 7
and P(2nd marble is
white | 1 marble is red) = 9 , we attribute the probability P(R, W) = 10
st 3 7
× 39 =
7
30 to the element (R, W). In the same way we attribute the probabilities
P(R, R) = 10 7
× 69 = 15
7
, P(W, W) = 10 3
× 29 = 15
1
, and P(W, R) = 10 3
× 79 = 307

to the remaining elements. It is common practice for this type of problem

48 2 Conditional Probability

to assign probabilities to the elements of the sample space as a product of

probabilities, one marginal and the others conditional.
A repeated application of the basic formula P(AB) = P(A)P(B | A) leads to
the so-called chain rule for conditional probabilities:
Rule 2.1 (chain rule) For any sequence of events A1 , A2 , . . . , An ,
P(A1 A2 · · · An )
= P(A1 ) × P(A2 | A1 ) × P(A3 | A1 A2 ) × · · · × P(An | A1 A2 · · · An−1 ).
The analysis of a complex probability experiment is often simplified by
imagining that the experiment is performed as a multi-stage experiment in
which each stage is a simple probability experiment and then applying the
chain rule. This is a very useful problem-solving strategy. Probability problems
that can be solved by counting arguments are often more easily tackled by
using the chain rule for conditional probabilities. The chain rule decomposes
the original problem into a series of simpler problems. To illustrate this,
consider question (a) from Example 1.17. For i = 1, . . . , 4, let Ai be the event
that two teams from the same country play against each other in the ith match.
The four matches are sequentially determined by lot. The probability asked
for in question (a) is then given by P(A1 A2 A3 A4 ). Obviously, P(A1 ) = 17 ,
P(A2 | A1 ) = 15 , P(A3 | A1 A2 ) = 13 , and P(A4 | A1 A2 A3 ) = 1, so the
sought probability P(A1 A2 A3 A4 ) equals 17 × 15 × 13 × 1 = 0.0095. As another
illustration, what is the probability that it takes 10 or more cards before the first
ace appears if cards are randomly drawn one by one from an ordinary deck of
52 cards? To answer this question by counting arguments, we take as sample
space the set of all possible permutations of the integers 1, 2, . . . , 52. Each of
the 52! elements is equally likely. Let A be the event that it takes 10 or more
cards before the first ace appears. The set A contains 48 × 47 × · · · × 40 × 41!
elements of the sample space and so P(A) = 48 × 47 × · · · × 40 × 41! /52! =
0.4559. To answer the question with conditional probabilities, define Ai as
the event that the ith card drawn is not an ace. Applying Rule 2.1, we find
P(A1 A2 · · · A9 ) = 48
52 × 51 × · · · × 44 = 0.4559.
47 40

The following important remark is made. In using conditional probabilities,

you usually perform the probability calculations without explicitly specifying
a sample space; an assignment of probabilities to properly chosen events
suffices. Solving a probability problem by counting arguments always requires
the specification of a sample space.
Example 2.3 There are four British teams among the eight teams that
have reached the quarterfinals of the soccer Champions League. What is the
probability that the four British teams will avoid each other in the quarterfinal
draw if the eight teams are paired randomly?
2.2 Chain Rule for Conditional Probabilities 49

Solution. The following probability model can be used to answer the question.
A bowl contains four red and four blue balls. Four times you pick at random
two balls from the bowl without replacement. What is the probability that you
pick each time a red and a blue ball? Defining Ai as the event that the ith pick
P(A1 ) =
44 8is P(A41 A2 A3 A4 ). Then 6×3
gives a red and a blue ball, this probability
8×4
8×7 = 4
7 (or, alternatively, P(A1 ) = 1 1 / 2 = 7 ), P(A 2 | A 1 ) = 6×5 = 5 ,
3

P(A3 | A1 A2 ) = 4×24×3 = 3 , and P(A4 | A1 A2 A3 ) = 1. Thus, by applying the

chain rule,
P(A1 A2 A3 A4 ) = P(A1 )P(A2 | A1 )P(A3 | A1 A2 )P(A4 | A1 A2 A3 )
4 3 2 8
= × × ×1= .
7 5 3 35
In other words, the probability that the four British teams will avoid each other
8
in the draw is 35 .
Example 2.4 Two players A and B take turns rolling a fair die. Player A rolls
the die first and wins on a “six.” If A fails in rolling a “six,” then player B rolls
the die and wins on a “five” or “six.” If B fails, then A rolls again and wins on
a “four,” “five,” or “six,” and so on. What is the probability that player A will
be the winner?
Solution. Let Ei be the event that player A wins on the ith turn for i = 1, 3,
and 5. The events E1 , E3 , and E5 are mutually exclusive and so P(A wins) =
P(E1 ) + P(E3 ) + P(E5 ). Using the chain rule for conditional probabilities,
1 5 4 3 5 4 3 2 5
P(E1 ) = , P(E3 ) = × × , and P(E5 ) = × × × × .
6 6 6 6 6 6 6 6 6
This gives
1 5 4 3 5 4 3 2 5 169
P(A wins) = + × × + × × × × = .
6 6 6 6 6 6 6 6 6 324
The game is not fair. Player A has a slight edge over player B.

Example 2.5 A coin-tossing game is played with 10 players including the two
friends John and Pete. Each player has the same starting capital of d dollars.
There are nine rounds of the game. In each round, two persons play against
each other. A fair coin is tossed and, depending on the outcome of the toss, one
player pays one dollar to the other player. The two players continue tossing the
coin until one of them has won all the other player’s money. The winner of the
first round has doubled his starting capital and next plays against another player
until one has won all the other’s money, and so on. Suppose that John is in the
first game and Pete plays the last game against the survivor of the first nine
players. Who has the best chance of being the ultimate winner – John or Pete?
50 2 Conditional Probability

Solution. To solve this problem, consider first a game in which the player is
equally likely to either win or lose one dollar on each gamble. If the player
starts with a dollars and plays the game repeatedly until he either goes broke
or increases his bankroll to a + b dollars, then the probability of reaching his
a
desired goal and thus not going broke is a+b . This intuitively obvious result
is a special case of the gambler’s ruin formula that will be proved in the next
section. Next we turn to the stated problem. The answer to this problem is that
1
each of the 10 players has the same probability 10 of being the ultimate winner.
This answer might surprise you. Let us first argue that the probability of John
1
being the ultimate winner is 10 . Define Ai as the event that John enters the ith
round and wins that round. The gambler’s ruin formula gives P(A1 ) = d+d d
=
2 , P(A2 | A1 ) = 2d+d = 3 , P(A3 | A1 , A2 ) = 3d+d = 4 , and so on. Hence, by
1 2d 2 3d 3

the chain rule for conditional probabilities,

1 2 3 9 1
P(A1 A2 · · · A9 ) = × × × ··· × = ,
2 3 4 10 10
1
verifying that John will be the ultimate winner with probability 10 . By the
same argument, the player meeting the winner of the first round in the second
round will be the ultimate winner with probability 13 × 34 × · · · × 10
9
= 101
.
1
Following this reasoning through, we find the same win probability of 10 for
each of the 10 participants of the game.

Problems
2.17 You and your friend have lunch with three business partners. After lunch
each one puts a business card in a bowl. Two of the five cards are
randomly drawn from the bowl to determine who pays for the lunch.
What is the probability that you and your friend do not have to pay?
2.18 You are dealt a hand of four cards from a well-shuffled deck of 52 cards.
What is the probability that you receive the four cards J, Q, K, A in any
order, with suit irrelevant? Hint: Let Ai be the event that the ith card you
receive is a picture card that you have not received before.
2.19 A fair die is rolled six times. What is the probability of one or more sixes
and no one occurring?
2.20 There are two Spanish teams and two German teams among the eight
teams that have reached the quarterfinals of the soccer Champions
League. What is the probability that none of the Spanish teams will be
paired with a German team when the teams are paired randomly?
2.2 Chain Rule for Conditional Probabilities 51

2.21 A jar contains three white and two black balls. Each time, you pick at
random one ball from the jar. If it is a white ball, a black ball is inserted
instead; otherwise, a white ball is inserted instead. You continue until all
balls in the jar are black. What is the probability that you need three picks
to achieve this? What is the probability that you need five picks?
2.22 Use conditional probabilities to solve Problems 1.6, 1.9–1.11, and 1.23.
2.23 You have received a tip that the management of a theater will give a
free ticket to the first person in line having the same birthday as someone
before him or her in line. Assuming that people enter the line one at a time
and you can join the line any time, what position in the line maximizes
your probability of getting the free ticket?
2.24 You travel from Amsterdam to Sidney with a change of airplane in Dubai
and Singapore. You have one piece of luggage. At each stop your luggage
is transferred from one airplane to the other. At the airport in Amsterdam
there is a probability of 5% that your luggage is not placed in the right
plane. This probability is 3% at the airport in Dubai and 2% at the airport
in Singapore. What is the probability that your luggage does not reach
Sidney with you? If your luggage does not reach Sidney with you, what
is the probability that it was lost at Dubai airport ?
2.25 Seven individuals have reserved tickets at the opera. The seats they have
been assigned are all in the same row of seven seats. The row of seats
is accessible from either end. Assume that the seven individuals arrive
in a random order, one by one. What is the probability of all seven
individuals taking their seats without having to squeeze past an already
seated individual? Hint: Consider the reverse process of people leaving
in random order.
2.26 In a poker game with three players A, B, and C, the dealer is chosen by
the following procedure. In the order A, B, C, A, B, C, . . ., a card from a
well-shuffled deck is dealt to each player until someone gets an ace. This
first player receiving an ace gets to start the game as dealer. Do you think
that everyone has an equal chance of becoming dealer?
2.27 Suppose there are n women–men couples at a party. What is the
probability that there are at least two couples such that the two women
have the same birthday and the two men have the same birthday? Hint:
Let Ai be the event that there is no match between the ith couple and any
of the couples 1, . . . , i − 1.
2.28 What is the probability that, in the next m draws of the 5/39 lottery,
there will be three consecutive draws with the feature that the same five
numbers appear in two or more of these three draws? In each draw five
distinct numbers are drawn at random from 1, 2, . . . , 39. Hint: Define Ai
52 2 Conditional Probability

as the event that each of the three draws i, i + 1, and i + 2 has a different
set of five numbers.
2.29 A bowl contains r red and b blue balls. The balls are removed from the
bowl, one at a time and at random. What is the probability that the first
red ball is removed on the kth pick? Can you tell without doing any
calculations what the probability is that there are still blue balls in the
bowl when the last red ball is picked?
2.30 In a group of 25 people, a person tells a rumor to a second person,
who in turn tells it to a third person, and so on. Each person tells the
rumor to just one of the people chosen at random, excluding the person
from whom he/she heard the rumor. The rumor is told 10 times. What
is the probability that the rumor will not be repeated to any one person
once more? What is the probability that the rumor will not return to the
originator?

2.2.1 Independent Events

In the special case of P(A | B) = P(A), the occurrence of event A is not
contingent on the occurrence or non-occurrence of event B. Event A is then
said to be independent of event B. In other words, if A is independent of B,
then learning that event B has occurred does not change the probability that
event A occurs. Since P(A | B) = P(AB)
P(B) , it follows that A is independent of B
if the equation P(AB) = P(A)P(B) holds true. This equation is symmetric in A
and B: if A is independent of B, then B is also independent of A. Summarizing,
Definition 2.2 Two events A and B are said to be independent if

P(AB) = P(A)P(B).

The reader should be aware that independent events and disjoint events are
completely different things. If events A and B are disjoint, you calculate the
probability of the union A ∪ B by adding the probabilities of A and B. For
independent events A and B, you calculate the probability of the intersection
AB by multiplying the probabilities of A and B. Beginning students sometimes
think that independent events are disjoint. This is not true. If two events A and
B are disjoint, then A ∩ B = ∅ and so P(AB) = 0, while P(AB) = P(A)P(B) for
independent events A and B. This shows that independent events with nonzero
probability are never disjoint.
Example 2.6 Suppose that a red and a blue die are thrown. Let A be the event
that the number shown by the red die is 4, B be the event that the sum of the
dice is 7, and C be the event that the sum of the dice is 8. Show that the events
2.2 Chain Rule for Conditional Probabilities 53

A and B are independent, whereas the events A and C are dependent. Can you
explain this result intuitively?
Solution. The experiment has 36 possible outcomes (i, j), where i is the
number shown by the red die and j the number shown by the blue die. All
36 possible outcomes are equally likely. Simply, by counting, P(A) = 36 6
,
P(B) = 36 , and P(AB) = 36 . Since P(AB) = P(A)P(B), the events A and B
6 1

are independent. An intuitive explanation of the independence of A and B is

the observation that the conditional probability of a sum 7 given the outcome
of the red die is the same for each possible outcome of the red die. The events
A and C are dependent, as follows by noting that P(C) = 36 5
and P(AC) = 36 1
.
An intuitive explanation is that the sum of the dice can never be 8 if the red die
shows a 1.
In practical problems we rarely need to check independence in such detail as
in Example 2.6, but independence of events can usually be concluded directly
from the physical setup of the underlying chance experiment. Independent
events A and B typically arise in a compound experiment consisting of physi-
cally independent subexperiments, where one subexperiment alone determines
whether event A occurs and another subexperiment alone determines whether
event B occurs.
Remark 2.1 In the case that events A, B, and C are pairwise independent, it is
not necessarily true that P(ABC) = P(A)P(B)P(C). This can be shown using
Example 2.6. In addition to the events A and B from Example 2.6, let C be
the event that the number shown by the blue die is even. It is readily verified
that the events A, B, and C are pairwise independent. However, P(ABC)(= 0)
1
is not equal to P(A)P(B)P(C)(= 72 ). In general, a collection of events
A1 , A2 , . . . , An are called independent if

P Ai = P(Ai )
i∈I i∈I
for each subset I of indices. An infinite collection of events A1 , A2 , . . . is said
to be independent if each finite subcollection is independent.
A useful result for independent events A1 , . . . , An is the inequality
n
P(none of the events A1 , . . . , An will occur) ≤ e− i=1 P(Ai ) .
This useful bound is easily obtained by noting that the probability is given by
P(Ac1 · · · Acn ), where Aci is the complement of Ai . It is readily verified that the
independence of the Ai implies the independence of the Aci (see Problem 2.33),

and so P(Ac1 · · · Acn ) = ni=1 P(Aci ). Noting that P(Aci ) = 1 − P(Ai ) and using
the inequality 1 − x ≤ e−x for 0 ≤ x ≤ 1, the bound follows next.
54 2 Conditional Probability

Problems
2.31 Suppose that a red and a blue die are thrown. Let A be the event that the
number shown by the red die is even, and B the event that the sum of the
dice is odd. Do you think the events A and B are independent?
2.32 Fifty different numbers are arranged in a matrix with 5 rows and
10 columns. You pick at random one number from the matrix. Let A
be the event that the number comes from an odd-numbered row and B
be the event that the number comes from the first five columns. Are the
events A and B independent?
2.33 Let A and B be independent events. Denote by the events Ac and Bc the
complements of the events A and B. Verify that the events A and Bc are
independent. Conclude from this result that the events Ac and Bc are also
independent.
2.34 Let A and B be two events such that P(A | B) = P(A | Bc ) and 0 <
P(B) < 1. Prove that A and B are independent events.
2.35 Suppose A1 , . . . , An are independent events. Prove thatthe probability
that at least one of the events A1 , . . . , An occurs is 1 − 1 − P(A1 ) · · ·
(1 − P(An )) .
2.36 Suppose that the events A, B, and C are independent, where P(A) = 12 ,
P(B) = 13 , and P(C) = 14 . What is P(A ∪ B ∪ C)?
2.37 Suppose that A1 , A2 , . . . is a sequence of independent events with the

property that ∞ n=1 P(An ) = ∞. Define A as the event that An occurs
for infinitely many values of n. Prove that P(A) = 1. This result is
known as the second Borel–Cantelli lemma. Hint: Argue that P(A) =

limn→∞ P ∞ k=n Ak and use the inequality 1 − x ≤ e
−x for 0 ≤ x ≤ 1.

2.3 Law of Conditional Probability

The law of conditional probability is a natural partitioning rule. To introduce
this rule, let us consider the following experiment. A fair die is rolled to yield
a number between 1 and 6, and then a fair coin is tossed that many times.
Many people would reason as follows to find the probability that heads will
not appear. The probability of rolling the number k is 16 for k = 1, . . . , 6. If
the number k is rolled, then the probability of no heads is ( 12 )k . Therefore, the
sought probability is
1 1 1 1 1 1 1 1 1 1 1 1
× + × + × + × + × + × = 0.1641.
2 6 4 6 8 6 16 6 32 6 64 6
This intuitive reasoning is justified by the next rule:
2.3 Law of Conditional Probability 55

Rule 2.2 Let A be an event that can only occur if one of the mutually exclusive
events B1 , . . . , Bn occurs. Then,

P(A) = P(A | B1 )P(B1 ) + P(A | B2 )P(B2 ) + · · · + P(A | Bn )P(Bn ).

This rule is called the law of conditional probability. In words, a “difficult”

unconditional probability P(A) is obtained by averaging “easier” conditional
probabilities P(A | Bi ) over the probabilities P(Bi ) for i = 1, . . . , n. Probabili-
ties are usually easier to compute when you have more information. The proof
of Rule 2.2 is simple and instructive. In terms of sets, the subset A of the sample
space is contained in the union B1 ∪ · · · ∪ Bn of the subsets B1 , . . . , Bn . This
implies

A = AB1 ∪ AB2 ∪ · · · ∪ ABn ,

where the set ABi is the intersection of the sets A and Bi . The assumption that
the sets B1 , . . . , Bn are pairwise disjoint implies that the sets AB1 , . . . , ABn are
also pairwise disjoint. By Rule 1.1 in Chapter 1, we then have

P(A) = P(AB1 ) + P(AB2 ) + · · · + P(ABn ).

This relationship and the definition P(A | B) = P(AB)/P(B) lead to the law of
conditional probability. Obviously, this law is also applicable when the sample
space is divided by a countably infinite number of disjoint subsets B1 , B2 , . . .
instead of by a finite number.
It is often the case that the unconditional probability P(A) of an event A is
found most easily by conditioning on appropriately chosen events B1 , . . . , Bn
such that the Bi are mutually exclusive and event A can only occur when
one of the events Bi occurs. This is one of the most useful problem-solving
strategies in probability. A nice illustration of the law of conditional probability
is provided by the next example.
Example 2.7 In a television game show, you can win a small prize, a medium
prize, and a large prize. The large prize is a sports car. The three prizes are
“locked up” in different boxes. There are five keys randomly arranged in front
of you. Three of the keys can open only one lock and each of these keys fits
a different box. Another key is a dud that does not open any of the locks. The
final key is the “master key” that opens all three locks. You have a chance to
choose up to two keys. For that purpose, you are asked two quiz questions.
For each correct answer, you can select one key. The probability of correctly
answering any given quiz question is 0.5. Any key you have gained is tried on
all three boxes. What is the probability of winning the sports car?
56 2 Conditional Probability

Solution. Let A be the event that the sports car is won. To find P(A), condition
on the first step of the experiment and let Bi be the event that you have correctly
answered i questions, i = 0, 1, 2. By the law of conditional probability, P(A) =
P(A | B0 )P(B0 ) + P(A | B1 )P(B1 ) + P(A | B2 )P(B2 ). It is immediate that
P(B0 ) = 14 , P(B1 ) = 24 , and P(B2 ) = 14 . Obviously, P(A | B0 ) = 0. Since
two of the five keys fit the box with the sports car, we have P(A | B1 ) =
5 . To get P(A | B2 ), it is easier to compute the complementary probability.
2

The probability that neither of two randomly chosen keys fits the box with 3the
sports car can be calculated as 35 × 24 = 10 3
(or, alternatively, as 32 / 52 = 10 ).
This gives P(A | B2 ) = 10 7
. Thus, the probability of winning the sports car is

1 2 2 7 1 3
P(A) = 0 × + × + × = .
4 5 4 10 4 8
All aspects of conditional probability are addressed in the next example.
Example 2.8 Two boxes are placed in front of you. One box contains nine
$1 bills and one $5 bill, while the other box contains two $1 bills and one
$100 bill. You choose at random one of the two boxes and then two bills are
randomly picked out of the chosen box. It appears that these two bills are $1
bills. Next you get the opportunity to pick a third bill out of one of the two
boxes. Should you stick to the chosen box or should you switch to the other
box when you want to maximize the probability of picking the $100 bill?
Solution. Let A be the event that you have chosen the box with the $100 bill
and B the event that you have chosen the other box. Also, let E be the event
that the two bills taken out from the chosen box are $1 bills. The probability
of picking the $100 bill as third bill is P(A | E) × 1 if you stick to the
box chosen and is P(B | E) × 13 if you switch to the other box. Since
P(B | E) = 1 − P(A | E), it suffices to find P(A | E). To do so, note that

P(AE) P(E | A)P(A)

P(A | E) = = .
P(E) P(E)

Since P(A) = P(B) = 12 , we have

1 1
P(E) = P(E | A) × + P(E | B) × ,
2 2
by the law of conditional probability. Obviously,

2 1 1 9 8 4
P(E | A) = × = and P(E | B) = × = .
3 2 3 10 9 5
2.3 Law of Conditional Probability 57

This leads to P(E) = 1

3 × 1
2 + 4
5 × 1
2 = 17
30 . Next we get
1/3 × 1/2 5 5 12
P(A | E) = = and P(B | E) = 1 − = .
17/30 17 17 17
Putting the pieces together, it follows that the probability of picking the $100
5
bill as third bill is 17 × 1 = 17
5
17 × 3 = 17
if you stick to the chosen box and 12 1 4

if you switch to the other box. You had better stick to the chosen box. A
surprising finding!
The next example deals with a famous problem known as the gambler’s
ruin problem. This problem can be seen as a random walk with two absorbing
barriers.
Example 2.9 John and Pete enter a coin-tossing game, where John starts with
a dollars and Pete with b dollars. They repeatedly flip a coin until one of
them has won all of the money. If the coin lands heads, John gets one dollar
from Pete; otherwise, John loses one dollar to Pete. The coin lands heads with
probability p and tails with probability q = 1 − p, where 0 < p < 1. What is
the probability that John will win all of the money?
Solution. Let P(a, b) denote the probability of John winning all of the money
when John starts with a dollars and Pete with b dollars. The gambler’s ruin
formula states that

1−(q/p)a
if p = q
P(a, b) = 1−(q/p) a+b
a
a+b if p = q.
To prove this formula, let A be the event that John wins all of the money. A
recurrence equation for P(A) = P(a, b) is obtained by conditioning on the
outcome of the first flip. Let E be the event that the first flip lands heads and let
F be the event that the first flip lands tails. By the law of conditional probability,
P(A) = P(A | E)P(E) + P(A | F)P(F).
If the first flip lands heads, you get the changed situation that John has a + 1
dollars and Pete has b − 1 dollars. This gives P(A | E) = P(a + 1, b − 1).
Similarly, P(A | F) = P(a − 1, b + 1). Since P(E) = p and P(F) = q, we
obtain the recurrence equation
P(a, b) = P(a + 1, b − 1) × p + P(a − 1, b + 1) × q.
Copying this equation with the starting capitals a and b replaced by i and a +
b − i gives a system of equations for Pi = P(i, a + b − i). These equations are
Pi = pPi+1 + qPi−1 for i = 1, . . . , a + b − 1
58 2 Conditional Probability

with the boundary conditions P0 = 0 and Pa+b = 1. The solution can be

found through the standard theory for linear difference equations. However,
this theory is not needed to find the solution. Noting that Pi = (p + q)Pi , the
difference equations can be rewritten as Pi+1 − Pi = qp (Pi − Pi−1 ). Successive
substitution of this equation into itself gives

q i
Pi+1 − Pi = (P1 − P0 ) for i = 1, . . . , a + b − 1.
p
Summing this equation over i = 1,. . . , k − 1 and noting that P0 = 0, we get
q i
the relation Pk − P1 = P1 k−1
i=1 p and so

k−1

q i
Pk = P1 for 1 ≤ k ≤ a + b.
p
i=0
m
Using the identity (1 − a) i=0 a
i = 1 − am+1 , it follows that

1 − (q/p)k 1
Pk = P1 if p = q and Pk = kP1 if p = q = .
1 − q/p 2

The boundary condition Pa+b = 1 leads to P1 = (1 − q/p)/ 1 − (q/p)a+b if
p = q and P1 = 1/(a + b) if p = q = 12 . Hence,

1 − (q/p)k
Pk = for k = 1, . . . , a + b
1 − (q/p)a+b

if p = q and Pk = k
a+b if p = q = 12 , proving the gambler’s ruin formula.
To illustrate the gambler’s ruin formula, suppose that you go to the casino
with $50, and it is your goal to multiply your capital to $250. You decide to
play European roulette and to stake each time the same amount of s dollars on
red for each spin of the wheel. You get back twice your stake with probability
18 19
37 and lose your stake with probability 37 . What is the probability of reaching
your goal for s = 10, 25, and 50? To find the probabilities, we take p = 18 37
and q = 1937 in the gambler’s ruin formula and calculate P(a, b) for a = 5 and
b = 20 when s = 10, for a = 2 and b = 8 when s = 25, and for a = 1
and b = 4 when s = 50. This results in the probabilities 0.1084, 0.1592, and
0.1790 for s = 10, 25, and 50. It is intuitively obvious that the probability gets
larger for higher stakes, since your bankroll is exposed to the house advantage
of the casino for a shorter time when you play more boldly.
The gambler’s ruin formula was derived by parameterizing the starting state
and then conditioning on what happens in the first step of the process. This
powerful approach can also be used to analyze success runs.
2.3 Law of Conditional Probability 59

Example 2.10 A sequence of independent trials is performed. Each trial

results in a success with probability p and in a failure with probability 1 − p.
What is the probability of getting a run of r or more successes in n trials?
Solution. By a “run” we mean a sequence of consecutive trials that all result
in the same outcome. Let us parameterize with respect to the number of trials
and define pk as the probability of getting a run of r or more successes in k
trials. Fix k ≥ r and let A be the event that a run of r or more consecutive
successes occurs in k trials. To get a recursion scheme for P(A) = pk , define
Br as the conditioning event that each of the first r trials results in a success
and Bj−1 as the conditioning event that each of the first j − 1 trials results in
a success but the jth trial in a failure, where 1 ≤ j ≤ r. Then P(Br ) = pr
and P(Bj−1 ) = pj−1 (1 − p) for 1 ≤ j ≤ r. The events Bi are disjoint and the
r+1
event A can only occur if one of the events Bi occurs. Thus P(A) = j=1
P(A | Bj−1 )P(Bj−1 ), by the law of conditional probability. Noting that
P(A | Br ) = 1 and P(A | Bj−1 ) = pk−j , we then obtain the recursion scheme

r
pk = pr + pj−1 (1 − p) pk−j for k ≥ r,
j=1

where pk = 0 for k < r. Starting with pr = pr , we can compute the pk

recursively until the sought probability pn is obtained.
The probability of getting a run of either at least r successes or at least r
failures, or both, in n trials can be calculated by an extension of the recursion
scheme given in Example 2.10. You are asked to derive this extension in
Problem 2.69. In Table 2.1 we give the probability of getting a run of either
at least r heads or at least r tails, or both, in n tosses of a fair coin for
several values of r and n. The numbers in the table may help correct common
mistaken notions about the lengths of runs in coin tossing. Most people grossly
underestimate the lengths of longest runs. For example, in a series of 200 tosses

Table 2.1. Probability of a run of either r heads or r tails in n tosses

n/r 3 4 5 6 7 8 9 10
10 0.826 0.465 0.217 0.094 0.039 0.016 0.006 0.002
25 0.993 0.848 0.550 0.300 0.151 0.073 0.035 0.017
50 1.000 0.981 0.821 0.544 0.309 0.162 0.082 0.041
75 1.000 0.998 0.929 0.703 0.438 0.242 0.126 0.064
100 1.000 1.000 0.972 0.807 0.542 0.315 0.169 0.087
150 1.000 1.000 0.996 0.918 0.697 0.440 0.247 0.131
200 1.000 1.000 0.999 0.965 0.799 0.542 0.318 0.172
60 2 Conditional Probability

with a fair coin, you should expect the longest run of heads to be about seven
in a row and the longest run of either heads or tails to be about eight in a row.
A rule of thumb is that, in n tosses of a coin with probability p of heads, the
probability mass of the longest
runof heads is strongly concentrated around the
integer nearest to log1/p n(1 − p) , while the probability mass of the longest
run of either
heads or tails is strongly concentrated around the integer nearest
to log1/p n(1 − p) + 1 provided that n(1 − p) is sufficiently large.1 The
theoretical findings for runs in coin tossing put winning streaks in baseball into
perspective. In the 2015 baseball season the New York Mets won 11 games in
a row, which got a lot of attention. While this event is remarkable in isolation,
you can expect the event to happen at a certain moment. In the baseball season
each team plays 162 games. The probability is 23.06% that among seven
seasons of 162 games there will be some season in which a particular team
will win 11 or more games in a row when this team wins with a probability of
50% any single game. The most likely outcome for the longest winning streak
of the team in a typical 162-game season is six games. An 11-game winning
streak is the most likely outcome in a season when the team has a winning
percentage of 70%.
To conclude this section, we use conditional probabilities to solve a real-life
problem.
Example 2.11 Two candidates A and B remain in the finale of a television
game show. At this point, each candidate must spin a wheel of fortune. The
numbers 1, 2, . . . , 100 are listed on the wheel and when the wheel has stopped
spinning, a pointer stops randomly on one of the numbers. Each player has a
choice of spinning the wheel one or two times, whereby a second spin must
immediately follow the first. The goal is to reach a total closest to but not
exceeding 100 points. A player whose total exceeds 100 gets a final score of
zero. The winner is the player who gets the highest score. Should both players
have the same final score, then the player to spin the wheel first is the winner.
Player A has to spin first. What is the optimal strategy for player A?
Solution. Let us take a general setup and assume that the numbers 1, 2, . . . , R
are listed on the wheel. The optimal strategy of the second player B is obvious.
This player stops after the first spin only if the score is larger than the final score
of player A. To find the optimal strategy of player A, let S(a) be the conditional
win probability of player A if A stops after the first spin given that this spin
results in a score of a points, and let C(a) be the conditional win probability of

1 This result was obtained in M. F. Schilling, “The surprising predictability of long runs,”
Mathematics Magazine 85 (2012): 141–149.
2.3 Law of Conditional Probability 61

player A if A continues after the first spin given that this spin results in a score
of a points. To find S(a), we condition on the outcome of the first spin of player
B. Given that player B has scored b points in the first spin with b ≤ a, then
player A is the final winner if player B scores in the second spin either at most
a − b points or more than R − b points. By the law of conditional probability,
a
a−b b 1 a2
S(a) = + × = 2 for 1 ≤ a ≤ R.
R R R R
b=1
The expression for C(a) follows directly from S(a). By conditioning on the
outcome of the second spin of player A, we have

R−a
1 (a + k)2
R−a
C(a) = S(a + k) × = for 1 ≤ a ≤ R.
R R3
k=1 k=1
Taking R = 100, numerical calculations give that C(a) > S(a) for 1 ≤ a ≤ 53
and S(a) > C(a) for 53 < a ≤ 100. Hence, player A stops after the first spin if
this spin gives a score larger than 53; otherwise, A continues. Then, by the law
of conditional probability, the probability of player A winning is

53
1
100
1
C(a) × + S(a) × = 0.4596.
100 100
a=1 a=54

Problems
2.38 You have two identical boxes in front of you. One of the boxes contains
10 balls numbered 1 to 10 and the other box contains 25 balls numbered
1 to 25. You choose at random one of the boxes and pick a ball at random
from the chosen box. What is the probability of picking the ball with
number 7 written on it?
2.39 A drunkard removes two randomly chosen letters of the message HAPPY
HOUR that is attached to a billboard in a pub. His drunk friend puts the
two letters back in a random order. What is the probability that HAPPY
HOUR appears again?
2.40 On the television show “Deal or No Deal,” you are faced with 26 brief-
cases in which various amounts of money have been placed, including
the amounts $1,000,000 and $750,000. You first choose one case. This
case is “yours” and is kept out of play until the very end of the game.
Then you play the game and in each round you open a number of the
other cases. What is the probability that the cases with $1,000,000 and
$750,000 will still be in the game when you are going to open 20 cases?
62 2 Conditional Probability

2.41 In the UK National Lottery, six different numbers are drawn from a
bucket with 59 balls numbered 1 to 59. To win the jackpot, you have
to match all six numbers drawn. The lottery has added a new prize. If
you match exactly two numbers, you win a lucky-dip ticket. The ticket
is used on the next draw. If you then match exactly two numbers on the
lucky-dip ticket, you get a new lucky-dip ticket, and so on. What is the
probability of ever winning the jackpot when you buy one ticket only
once?
2.42 Suppose that Joe arrives home on time with probability 0.8. If Joe
does not arrive home on time, the probability that his dinner is burnt
is 0.5; otherwise, his dinner is burnt with probability 0.15. What is the
probability that Joe’s dinner will be burnt on any given day? What is the
probability that Joe arrived home on time given that his dinner is burnt?
2.43 You owe $20,000 to a loan shark and you will run into deep trouble if the
loan shark is not repaid before midnight. You decide to go to a newly
opened casino having a special roulette game with the three roulette
bets: a 9-numbers bet, a 12-numbers bet, and an 18-numbers bet. For
a k-numbers bet, you get back 36 k times your stake with probability 37
k

and lose your stake with probability 1 − 37 . Your initial bankroll is

$10,000. In a desperate attempt to raise your bankroll to $20,000, you

stake $5,000 on a 12-numbers bet. If you win, you are done; otherwise,
you stake the remaining $5,000 on a 9-numbers bet. Verify that under
this betting strategy the probability of reaching your goal of $20,000 is
slightly higher than the probability of reaching your goal under the bold
strategy of putting $10,000 only once on an 18-numbers bet.
2.44 The game “craps” is played as follows. A player rolls two fair six-sided
dice. If their sum is 7 or 11, the player wins right away. If their sum is 2,
3, or 12, the player loses right away. If their sum is any other number, this
number becomes the “point.” The player then repeatedly rolls two dice
until their sum equals the point, in which case the player wins, or equals
7, in which case the player loses. What is the probability of the player
winning?
2.45 You are fighting a dragon with three heads. Each time you swing at the
dragon with your sword, there is a 70% chance of knocking off one head
and a 30% chance of missing. If you miss, an additional head grows
immediately before you can swing again at the dragon. You win if you
have knocked off all of the dragon’s heads, but you must run for your life
if the dragon has seven heads. What is your chance of winning?
2.46 A lottery organization distributes 100,000 tickets every week. At one end
of the ticket, there is a visible printed number consisting of six digits, say
2.3 Law of Conditional Probability 63

070469. At the other end of the ticket, another six-digit number is printed,
but this number is hidden by a layer of scratch-away silver paint. The
ticket holder scratches the paint away to reveal the underlying number. If
the number is the same as the number at the other end of the ticket, it is
a winning ticket. The two six-digit numbers on each of the one million
tickets printed each week are randomly generated in such a way that no
two tickets are printed with the same visible numbers or the same hidden
numbers. Assume that in a particular week only half of the tickets printed
are sold. What is the probability of exactly r winners in that week for r =
0, 1, . . .? Hint: Use the results for the matching problem in Example 1.19.
2.47 The points 1, 2, . . . , 12 are around a ring like a clock. A transition from a
point is to either of the two adjacent points with equal probabilities. What
is the probability of visiting all points before returning to the starting
point?
2.48 A thoroughly shuffled deck of cards has r red and b black cards.
The top card is removed from the deck and then a card is picked at
random from the deck. Use the law of conditional probability to find
the probability that the second card picked is red. Can you explain the
answer intuitively?
2.49 John and Pete each toss a fair coin, independently of each other. John
tosses the coin until two heads have been obtained and Pete tosses the
coin until three heads have been obtained. What is the probability that
John will need more tosses than Pete?
2.50 Twenty-five persons attended a “reverse raffle,” in which everyone
bought a number. Numbered balls were then drawn out of a bin, one
at a time and at random. The last ball in the bin would be the winner. But
when the organizers got down to the last ball, they discovered that three
numbered balls had been unintentionally overlooked. They added those
balls to the bin and continued the draw. Was the raffle still fair? Use the
law of conditional probability to answer this question.
2.51 Two players A and B enter a coin-tossing game, where player A starts
with a = 4 dollars and player B with b = 5 dollars. They repeatedly flip
a fair coin until one of them has won all of the money. If the coin lands
heads, then player A gets one dollar from player B; otherwise, player A
loses one dollar to player B. After the game has been played, you are
informed that player A is the winner. What is the probability distribution
of the lowest value of player A’ s bankroll during the game?
2.52 The upcoming Tour de France bicycle tournament will take place from
July 1 through July 23. One hundred eighty cyclists will participate in
the event. What is the probability that two or more participating cyclists
64 2 Conditional Probability

will have birthdays on the same day during the tournament? Hint: Let
Bi be the event that there are i cyclists having their birthday during the
tournament.
2.53 A drunkard leaves the pub at 22:30 hours. Twenty steps to the left of
the pub is a police station and ten steps to the right is his home. Every
30 seconds the drunkard makes a step either to the left or to the right. Any
step is with equal probabilities to the left or to the right, independently of
the other steps. The drunkard is thrown in a cell if he reaches the police
station. What is the probability that he is not locked up in a police cell
and reaches his home no later than midnight?
2.54 A tennis tournament is arranged for 8 players. It is organized as a
knockout tournament. First, the 8 players are randomly allocated over
four groups of two players each. In the semifinals, the winners of groups
1 and 2 meet each other and the winners of groups 3 and 4. In any match,
either player has a probability 0.5 of winning. John and Pete are among
the 8 players. What is the probability that they meet each other in the
semifinals? What is the probability that they meet each other in the final?
2.55 There are two bags in front of you. One bag has three white balls and one
red ball. The other has one white ball and three red balls. You choose one
of the bags at random and pick randomly one ball out of this bag. You
notice the ball picked is red. You then put it back and pick another ball
out of the same bag at random. What is the probability that the second
ball picked is red?
2.56 Dave and Eric alternately roll two dice. The first player who fails to
surpass the sum of the two dice in the previous roll loses. What is the
probability that Dave will win the game when Dave starts the game?
Hint: Parameterize and define ds as the probability that Dave will win the
game when the game begins with Dave rolling the dice and Dave has to
roll more than s points in his first roll.
2.57 Consider again Example 1.18. What is the probability that there will be
exactly j stops at which nobody gets off. More generally, what is the prob-
ability of exactly j empty bins when m balls are sequentially placed into
one of b bins with m ≥ b, where for each ball a bin is selected at random?
2.58 A jar contains five blue and five red balls. You roll a fair die. Next you
simultaneously draw at random as many balls from the jar as the score
of the die. What is the probability that each of the balls drawn is blue?
What is the inverse probability that the score of the die is r, given that
each of the balls drawn is blue?
2.59 Two fair dice are rolled twice. What is the probability that both rolls
show the same combination of two numbers?
2.3 Law of Conditional Probability 65

2.60 In some basketball competition, 14 teams participate. The first six teams
in the final ranking are automatically placed for the playoffs. There are
no ties in the final ranking. The other eight teams participate in a lottery
for the remaining two spots in the playoffs. The team placed 7th in the
competition is given 8 tickets in the lottery, the team placed 8th is given
7 tickets, and so on, until the team placed 14th is given 1 ticket. A ticket
is picked at random and the first-place draft is awarded to the team
holding that ticket. That winner’s remaining tickets are removed and a
second ticket is selected at random for second place in the draft. What
is the probability that the team placed jth in the competition will win the
first place in the draft and what is the probability that this team will win
the second place in the draft for j = 7, . . . , 14?
2.61 Imagine a coin is tossed infinitely often. The coin lands on heads with
probability p and on tails with probability 1 − p. It is assumed that p ≤ 12 .
Use the gambler’s ruin formula to explain that after the first toss, the num-
ber of heads will ever be equal to the number of tails with probability 2p.
2.62 A drunkard is wandering back and forth on a road. At each step he moves
two units distance to the north with a probability of 12 , or one√unit to the
south with a probability of 12 . Show that with probability 12 ( 5 − 1) the
drunkard will ever visit the point which is one unit distance south from
his starting point. Can you explain why this probability also gives the
probability that the number of heads will ever exceed twice the number
of tails if a fair coin is tossed over and over?
2.63 Your friend has chosen at random a card from a standard deck of 52 cards,
but keeps this card concealed. You have to guess what card it is. Before
doing so, you can ask your friend either the question whether the chosen
card is red or the question whether the card is the ace of spades. Your
friend will answer truthfully. What question would you ask?
2.64 Player 1 tosses m + 1 times a fair coin and player 2 tosses m times a fair
coin. Player 1 wins the game if player 1 tosses more heads than player 2;
otherwise, player 2 wins. What is the probability that player 1 will win
the game?
2.65 You repeatedly roll two fair dice. What is the probability of two
consecutive totals of 7 appearing before a roll with double sixes?
2.66 Consider again Example 2.11. It is now assumed that lots are drawn to
determine the winner when there is a tie. Find the optimal strategy for
player A and the corresponding win probability.
2.67 A carnival booth offers the following game of chance. Under each of six
inverted cups is a colored ball, in some random order. The six balls are
colored red, blue, yellow, orange, green, and purple. You get six tokens.
66 2 Conditional Probability

All you have to do is guess the color of the ball under each of the cups,
where you handle one cup at a time. Every time you guess, you risk a
token. If your guess is wrong, you lose the token. Each time you guess
correctly, the ball is uncovered and you keep the token. Use conditional
probabilities to obtain the probability of guessing all six balls before
running out of tokens. Hint: Define p(i, t) as your success probability
once you have reached cup i with t tokens left and derive a recursion
equation for p(i, t).
2.68 A fair die is rolled repeatedly. Let pn be the probability that the sum
of scores will ever be n. Use the law of conditional probability to
find a recursion equation for pn . Verify numerically that pn tends to
3.5 = 0.2857 as n gets large. Can you give an intuitive explanation of
1

this result?
2.69 Use the recursive approach given in Example 2.10 to solve the following
two longest-run problems.
(a) A sequence of independent trials is performed. Each trial results in
a success with probability p and in a failure with probability 1 − p.
What is the probability of getting a run of either at least r successes
or at least r failures, or both, in N trials?
(b) A bowl contains r red balls and b blue balls. Balls are randomly
picked out of the bowl, one by one and without replacement. What
is the probability that the longest run of red balls will be L or more?
2.70 Each of seven dwarfs has his own bed in a common dormitory. Every
night, they retire to bed one at a time, always in the same sequential
order. On a particular evening, the youngest dwarf, who always retires
first, has had too much to drink. He randomly chooses one of the seven
beds to fall asleep on. As each of the other dwarfs retires, he chooses
his own bed if it is not occupied, and otherwise randomly chooses
another unoccupied bed. What is the probability that the kth dwarf can
sleep in his own bed for k = 1, . . . , 7? Hint: Use a general setup and
define p(k, n) as the probability that the kth dwarf will not sleep in
his own bed for the situation of n dwarfs with the first dwarf being
drunk. Use a conditioning argument to obtain a recursion equation for
p(k, n).
2.71 (difficult) Consider the three-player variant of Example 2.11. What is
the optimal strategy for the first player A and what is the overall win
probability of player A? Next use the probability distribution function
of the final score of player A to find the overall win probabilities of the
players B and C.
2.4 Bayes’ Rule in Odds Form 67

2.4 Bayes’ Rule in Odds Form

A rule for revising (conditional) probabilities in light of new information is
Bayes’ rule. This rule, which is nothing other than logical thinking, is the queen
of conditional probability. Bayes’ rule can be stated in various forms, but the
simplest is the so-called “odds form.” The essence of Bayesian thinking is
best understood by considering Bayes’ rule for the situation where there is
a question of a hypothesis being either true or false. An example of such a
situation is a court case where the defendant is either guilty or not guilty. Let
H represent the event that the hypothesis is true, and H the complementary
event that the hypothesis is false. Before examining the evidence, a Bayesian
analysis begins with assigning prior probabilities P(H) and P(H) = 1 − P(H)
to the mutually exclusive events H and H. Prior probabilities are often based on
subjective assessments. How do the prior probabilities change once evidence
in the form of the knowledge that the event E has occurred becomes available?
In our example of the court case, event E could be the evidence that the accused
has the same blood type as the perpetrator’s, whose blood has been found at
the scene of the crime. The updated value of the probability that the hypothesis
is true given that event E has occurred is denoted by P(H | E). The calculation
of the posterior probability P(H | E) requires the rule of Bayes.
The rule of Bayes can be expressed in several ways. A convenient form uses
odds. Odds are often used to represent probabilities. Gamblers usually think in
terms of “odds” instead of probabilities. The odds of an event A are defined by
P(A)
oA = .
1 − P(A)

For example, an event A with P(A) = 23 has odds 2. It is said that the odds are
2 to 1 (written 2:1) in favor of the event A. Conversely, odds oA of an event A
correspond to the probability
oA
P(A) = .
1 + oA
The odds form of Bayes’ rule is stated in the next rule.
Rule 2.3 The posterior probability P(H | E) satisfies
P (H | E) P(H) P(E | H)
= × .
P(H | E) P(H) P(E | H)
In words, Bayes’ rule in odds form states that

posterior odds = prior odds × likelihood ratio.

68 2 Conditional Probability

This insightful formula follows from first principles. Using the definitions
P(AB) = P(A)P(B | A) and P(BA) = P(B)P(A | B), we get the basic form of
the rule of Bayes:
P(A) × P(B | A)
P(A | B) = .
P(B)
This is one of the most beautiful formulas in probability theory. Next, Bayes’
rule in odds form follows by noting that
P(H)P(E | H) P(H)P(E | H)
P(H | E) = and P(H | E) = .
P(E) P(E)
P(H)
The factor P(H)
represents the prior odds in favor of the hypothesis H before
the evidence has been presented. The ratio of P(E | H) and P(E | H) is called
the likelihood ratio or the Bayes factor. Bayes’ rule updates the prior odds of
the hypothesis H by multiplying them by the likelihood ratio and thus measures
how much new evidence should alter a belief in a hypothesis. The likelihood
ratio is the probability of finding the evidence if the hypothesis is true divided
by the probability of finding the evidence if the hypothesis is not true. If the
likelihood ratio is greater than one then the evidence supports the hypothesis, if
the likelihood ratio is less than one it supports the negation of the hypothesis,
and if the likelihood ratio equals one then the evidence favors neither. The
likelihood ratio is a measure of the probative value of the evidence. In practical
situations such as in judicial decision making, the likelihood ratio is typically
determined by an expert. However, it is not the expert’s task to tell the court
what the prior odds are. The prior probability P(H) represents the personal
opinion of the court before the evidence is taken into account.
With two pieces of evidence E1 and E2 that are sequentially obtained, Bayes’
rule can be applied iteratively. You could use the first piece of evidence to
calculate initial posterior odds, and then use that posterior odds as new prior
odds to calculate second posterior odds given the second piece of evidence.
As an illustration, suppose that a closed box contains either three red balls or
two red balls together with one blue ball. Initially you believe that the two
possibilities are equally likely. Then a randomly chosen ball is removed from
the box. It appears to be a red ball. What is your new belief in the hypothesis
H that the box originally contained three red balls? The posterior odds are
calculated as 1/2 1 3 3/2
1/2 × 2/3 = 2 . Therefore, the probability 1+3/2 = 5 represents
3

your new belief in the hypothesis H. Next, a second ball is randomly taken out
of the box. Again, it appears to be a red ball. Then the new posterior odds are
calculated as 3/5
2/5 × 1/2 = 3 and your belief in the hypothesis H is revised
1

according to the probability 3

1+3 = 34 .
2.4 Bayes’ Rule in Odds Form 69

In applying Bayes’ rule in odds form, the main step is identifying the
hypothesis H and the evidence E. Bayes’ rule forces you to make transparent
any implicit assumption you are making.
Before we give several applications of Bayes’ rule, the following remark is
made. In both legal and medical cases, the conditional probabilities P(H | E)
and P(E | H) are sometimes confused with each other. As a classic example,
suppose that at the scene of a crime blood is found that must belong to the
offender. The blood type is found in approximately one in every hundred
thousand people. In the police database a person is found who matches
the blood type of the offender, but there is no other evidence against this
person. The prosecutor argues: “the probability that the suspect would have
the particular blood type is 0.001% if he were innocent. The suspect has the
particular blood type and therefore the probability is only 0.001% that he is
not guilty.” Letting H be the event that the suspect is innocent and E be the
event that the blood of the suspect matches the blood type of the offender,
the prosecutor confuses the probability P(H | E) – the relevant probability –
with the probability P(E | H). A famous example of the prosecutor’s fallacy
is the court case of People vs. Collins in Los Angeles in 1964. In this case,
a couple matching the description of a couple that had committed an armed
robbery was arrested. Based on expert testimony, the district attorney claimed
that the frequency of couples matching the description was roughly 1 in
12 million. Although this was the estimate for the probability that an innocent
couple matches the description, the district attorney treated this estimate as if
it was the probability that a couple matching the description is innocent and
incorrectly concluded that the couple was guilty beyond reasonable doubt. The
prosecutor’s fallacy had dramatic consequences in the case of Regina vs. Sally
Clark in the UK in 1999. Sally Clark was convicted for murder because of
the cot deaths of two of her newborn children within a period of one year.
A revision of her process benefited from Bayesian arguments and led to her
release in 2001.
Example 2.12 It is believed by a team of divers that a sought-after wreck will
be in a certain sea area with probability p = 0.4. A search in that area will
detect the wreck with probability d = 0.9 if it is there. What is the revised
probability of the wreck being in the area when the area is searched and no
wreck is found?
Solution. Let the hypothesis H be the event that the wreck is in the area in
question and the evidence E be the event that the wreck has not been detected
in that area. Before the outcome of the search, the events H and H have the
subjective probabilities
70 2 Conditional Probability

P(H) = 0.4 and P(H) = 0.6.

Further, we have
P(E | H) = 0.1 and P(E | H) = 1.
Then, by Bayes’ rule in odds form,
P (H | E) 0.4 0.1 1
= × = .
P(H | E) 0.6 1 15
The updated value of the subjective probability that the wreck is in the area is
1/15
equal to 1+1/15 = 16
1
.
The above example is not far from reality. Bayesian analysis has guided
successful searches in several real-life situations. In 1966, a mid-air collision
over the Mediterranean between a KC-135 tanker and a B-52 bomber caused
four hydrogen bombs to fall near the Spanish town of Palomares. Three of the
bombs fell on land, but the fourth fell into the Mediterranean and was lost. In
May 1968, a 3500-ton nuclear attack submarine, the USS Scorpion, vanished
in the Atlantic Ocean with a crew of 99. No GPS was available for the search
efforts at that time. In both cases the lost object at sea was recovered thanks
to Bayesian search theory. In 2009, Air France flight 447 en route to Paris
from Rio de Janeiro vanished during stormy weather over a remote part of the
Atlantic Ocean. A Bayesian search procedure helped locate the flight’s black
box in 2011.2
Example 2.13 Suppose that there is a DNA test that determines with 100%
accuracy whether or not a particular gene for a certain disease is present. A
woman would like to do the DNA test, but wants to have the option of holding
out hope that the gene is not present in her DNA even if it is determined that the
gene for the illness is, indeed, present. She makes the following arrangement
with her doctor. After the test, the doctor will toss a fair coin into the air,
and will tell the woman the test results only if those results are negative and
the coin has turned up heads. In every other case, the doctor will not tell her
the test results. Suppose that there is a 1 in 100 probability that the woman
does have the gene for the disease in question before she is tested. What is the
revised value of this probability if the woman’s doctor does not inform her of
the test results?
Solution. Let the hypothesis H be the event that the woman has the particular
gene and the evidence E be the event that she is not told the test results.

2 Lawrence D. Stone et al., “Search for the wreckage of Air France flight AF 447,” Statistical
Science 59 (2014): 69–80.
2.4 Bayes’ Rule in Odds Form 71

The priors are P(H) = 0.01 and P(H) = 0.99. Further, we have P(E | H) = 1
and P(E | H) = 0.5. Then, by applying Bayes’ rule in odds form, we get
P (H | E) 0.01 1 1
= × = .
P(H | E) 0.99 0.5 49.5
In the perception of the woman, the updated value of probability that she has
1/49.5
the particular gene is equal to 1+49/5 = 1012
when the test is done under the
deal with her doctor and she is not told the test results.
Example 2.14 A murder is committed. The perpetrator is either one or the
other of the two persons X and Y. Both persons are on the run from the
authorities, and after an initial investigation, both fugitives appear equally
likely to be the perpetrator. Further investigation reveals that the actual
perpetrator has blood type A. Ten percent of the population belongs to the
group having this blood type. Additional inquiry reveals that person X has
blood type A, but offers no information concerning the blood type of person
Y. In light of this new information, what is the probability that person X is the
perpetrator?
Solution. In answering this question, use H to denote the event that person X is
the perpetrator. Let E represent the new evidence that person X has blood type
A. Before the appearance of evidence E, the events H and H have probabilities
P(H) = P(H) = 12 . Further, we have that P(E | H) = 1 and P(E | H) = 10 1
.
By Bayes’ rule in odds form,
P(H | E) 1/2 1
= × = 10.
P(H | E) 1/2 1/10
The odds in favor, then, are 10 to 1 that person X is the perpetrator given
that this person has blood type A. Otherwise stated, P(H | E) = 10 11 . The
probability of Y being the perpetrator is 1 − 11 = 11 and not, as may be
10 1
1
thought, 10 × 12 = 20 1
. The error in this reasoning is that the probability of
1
person Y having blood type A is not 10 because Y is not a randomly chosen
person; rather, Y is first of all a person having a 50% probability of being the
perpetrator, whether or not he is found at a later time to have blood type A.
The Bayesian analysis sharpens our intuition in a natural way.
Another nice illustration of Bayes’ rule in odds form is provided by legal
arguments used in the discussion of the O. J. Simpson trial.3 In the analysis of
this example, we will use an extension of Bayes’ rule in odds form:

3 This example is based on the article J. F. Merz and J. P. Caulkins, “Propensity to

abuse–propensity to murder?,” Chance 8 (1995): 14.
72 2 Conditional Probability

P(H | E1 E2 ) P(H | E1 ) P(E2 | HE1 )

= × .
P(H | E1 E2 ) P(H | E1 ) P(E2 | HE1 )
To prove this extension of Bayes’ rule, it suffices to verify the basic relation
P(A | C)P(B | AC)
P(A | BC) =
P(B | C)
for any events A, B, and C. Then apply this basic relation with (A = H, B =
E1 , C = E2 ) and with (A = H, B = E1 , C = E2 ). By taking the quotient of
P(H | E1 E2 ) and P(H | E1 E2 ), the desired result follows next. The formula for
P(A | BC) is obtained as follows:
P(ABC) P(AC)P(B | AC) P(C)P(A | C)P(B | AC)
P(A | BC) = = =
P(BC) P(BC) P(C)P(B | C)
P(A | C)P(B | AC)
= .
P(B | C)
Example 2.15 Nicole Brown was murdered at her home in Los Angeles on
the night of June 12, 1994. The prime suspect was her husband O. J. Simpson,
at the time a well-known celebrity famous both as a television actor and a
retired professional football star. This murder led to one of the most heavily
publicized murder trials in the United States during the last century. The fact
that the murder suspect had previously physically abused his wife played an
important role in the trial. The celebrity lawyer Alan Dershowitz, a member
of the team of lawyers defending the accused, tried to belittle the relevance
of this fact by stating that only 0.1% of the men who physically abuse their
wives actually end up murdering them. Was the fact that O. J. Simpson had
previously physically abused his wife irrelevant to the case?
Solution. The answer to the question is no. In this particular court case it is
important to make use of the crucial fact that Nicole Brown was murdered. The
question, therefore, is not what the probability is that abuse leads to murder,
but the probability that the husband is guilty in light of the fact that he had
previously abused his wife. This probability can be estimated with the help of
Bayes’ formula and a few facts based on crime statistics. Define the following:
G = the event that the husband is guilty of the murder of his wife
M = the event that the wife has been murdered
E = the event that the husband has physically abused his wife in
the past.
The probability in question is the conditional probability P(G | EM). By the
extension of Bayes’ rule in odds form,
P (G | ME) P (G | M) P (E | GM)
= × ,
P G | ME P G|M P E | GM
2.4 Bayes’ Rule in Odds Form 73

where G represents the event that the husband is not guilty of the murder of his
wife. How do we estimate the conditional probabilities on the right-hand side
of this formula? In 1992, 4,936 women were murdered in the United States,
of which roughly 1,430 were murdered by their (ex-)husbands or boyfriends.
This results in an estimate of 1,430
4,936 = 0.29 for the prior probability P(G | M)
and an estimate of 0.71 for the prior probability P(G | M). Furthermore, it
is also known that roughly 5% of married women in the United States have
at some point been physically abused by their husbands. If we assume that a
woman who has been murdered by someone other than her husband had the
same chance of being abused by her husband as a randomly selected woman,
then the probability P(E | GM) is equal to 5%. Our estimate of the probability
P(E | GM) is based on the reported remarks made by Simpson’s defense
attorney, Alan Dershowitz, in a newspaper article. In this article, Dershowitz
admitted that a substantial percentage of the husbands who murder their wives
have, previous to the murders, also physically abused their wives. Given this
statement, the probability P(E | GM) will be taken to be 0.5. By substituting
the various estimated values for the probabilities into Bayes’ formula in odds
form, we find that
P(G | ME) 0.29 0.5
= × = 4.08.
P(G | ME) 0.71 0.05
By P(G | ME) = 1 − P(G | ME), we can convert the odds to a probability.
This results in P(G | ME) = 0.81. In words, there is an estimated probability
of 81% that the husband is the murderer of his wife in light of the knowledge
that he had previously physically abused her. The fact that O. J. Simpson had
physically abused his wife in the past was therefore certainly very relevant to
the case.
The next example involves a subtle application of Bayes’ rule in odds form.
Example 2.16 A diamond merchant has lost a case containing a very
expensive diamond somewhere in a large city in an isolated area. The case
has been found again but the diamond has vanished. However, the empty
case contains the DNA of the person who took the diamond. The city has
150,000 inhabitants, and each is considered a suspect in the diamond theft. An
expert declares that the probability of a randomly chosen person matching the
DNA profile is 10−6 . The police search a database with 5,120 DNA profiles
and find one person matching the DNA from the case. Apart from the DNA
evidence, there is no additional background evidence related to the suspect.
On the basis of the extreme infrequency of the DNA profile and the fact that
the population of potential finders of the diamond is only 150,000 people, the
prosecutor jumps to the conclusion that the odds of the suspect not being the
74 2 Conditional Probability

thief are practically nil and calls for a tough sentence. What do you think of
this conclusion?
Solution. The conclusion made by the prosecutor could not be more wrong.
The prosecutor argues: “The probability that a person chosen at random would
match the DNA profile found on the diamond case is negligible and the
number of inhabitants of the city is not very large. The suspect matches
this DNA profile, thus it is nearly one hundred percent certain that he is the
perpetrator.” This is a textbook example of the faulty use of probabilities.
The probability that the suspect is innocent of the crime is generally quite
different from the probability that a randomly chosen person matches the DNA
profile in question. What we are actually looking for is the probability that
among all persons matching the DNA profile in question, the arrested person
is the perpetrator. Counsel for defense could reason as follows to estimate this
probability: “We know that the suspect has matching DNA, but among the
other 150,000 − 5,120 = 144,880 individuals the expected number of people
matching the DNA profile is 144,880 × 10−6 = 0.14488. So the probability
that the suspect is guilty is 1/(1 + 0.14488) = 0.8735. It is not beyond
reasonable doubt that the suspect is guilty and thus the suspect must be
released.”
The reasoning of the defense is on the whole correct and can be supported
by Bayes’ rule. Let us perform the analysis under the assumption that each
of the 150,000 inhabitants of the city is equally likely to be the finder of the
diamond and that the finder keeps the diamond with probability p0 when the
finder is a person in the database and with probability p1 otherwise. Let
p0
r= .
p1
It will appear that only the value of the ratio r is required and thus not the
values of p0 and p1 . The following formula will be derived:
r
P(suspect is the thief) = ,
r + (n − n0 )λ
where n = 150,000, n0 =5,120, and λ = 10−6 denotes the probability that a
randomly chosen person matches the DNA profile in question. This posterior
probability has the value 0.8735 for r = 1 and the value 0.9325 for r = 2.
This result confirms that the counsel for defense is correct in stating that it is
not beyond reasonable doubt that the suspect is guilty, so the suspect must be
released if there is no additional proof. The derivation of the above formula
uses again the extension of Bayes’ rule and proceeds as follows. Define the
following events:
2.4 Bayes’ Rule in Odds Form 75

H = the hypothesis that the thief is a person from the database

E1 = the evidence that the thief has the observed DNA profile
E2 = the evidence that only one person in the database has the
observed DNA profile.
The subtlety of the Bayesian analysis lies in the specification of H, E1 , and E2 .
With this specification, the probability that the suspect is the thief is given by
P(H | E1 E2 ) (verify!). Since P(H) is just the conditional probability that the
diamond has been found and kept by a person from the database given that the
diamond was not brought back, we have
(n0 /n)p0 ((n − n0 )/n)p1
P(H) = and P(H) = .
(n0 /n)p0 + ((n − n0 )/n)p1 (n0 /n)p0 + ((n − n0 )/n)p1
This shows that the prior odds of H are given by
P(H) n0
=r .
P(H) n − n0
Using the fact that P(E1 | H) = P(E1 | H) = λ, it follows from Bayes’ rule in
odds form that
P(H | E1 ) P(H)
= .
P(H | E1 ) P(H)
In other words, the evidence E1 does not affect the ratio of the prior
probabilities P(H) and P(H). Next we use the evidence E2 . Noting that
P(E2 | HE1 ) = (1 − λ)n0 −1 and P(E2 | HE1 ) = n0 λ(1 − λ)n0 −1 ,
a second application of Bayes’ rule in odds form gives that
P(H | E1 E2 ) P(H | E1 ) P(E2 | HE1 ) n0 1
= × =r × .
P(H | E1 E2 ) P(H | E1 ) P(E2 | HE1 ) n − n0 n0 λ
This leads to P(H | E1 E2 ) = r/(r + (n − n0 )λ), as was to be shown.

Problems
2.72 In a binary transmission channel, a 1 is transmitted with probability 0.8
and a 0 with probability 0.2. The conditional probability of receiving a 1
given that a 1 was sent is 0.95, the conditional probability of receiving a 0
when a 0 was sent is 0.99. What is the probability that a 1 was sent when
receiving a 1? Use Bayes’ rule in odds form to answer this question.
2.73 An oil explorer performs a seismic test to determine whether oil is likely
to be found in a certain area. The probability that the test indicates the
presence of oil is 90% if oil is indeed present in the test area, while the
76 2 Conditional Probability

probability of a false positive is 15% if no oil is present in the test area.

Before the test is done, the explorer believes that the probability of the
presence of oil in the test area is 40%. Use Bayes’ rule in odds form to
revise the value of the probability of oil being present in the test area
given that the test gives a positive signal.
2.74 In a certain region, it rains on average once every 10 days during the
summer. Rain is predicted on average for 85% of the days when rainfall
actually occurs, while rain is predicted on average for 25% of the days
when it does not rain. Assume that rain is predicted for tomorrow. What
is the probability of rainfall actually occurring on that day?
2.75 You have five coins colored red, blue, white, green, and yellow. Apart
from the variation in color, the coins look identical. One of the coins is
unfair and when tossed it comes up heads with probability 34 ; the other
four are fair coins. You have no further information about the coins, apart
from having observed that the blue coin, tossed three times, came up
heads on all three tosses. On the grounds of this observation, you indicate
that the blue coin is the unfair one. What is the probability of your being
correct in this assumption?
2.76 The final match of world championship soccer is to be played between
England and the Netherlands. The star player for the Dutch team, Dennis
Nightmare, has been injured. The probability of his being fit enough to
play in the final is being estimated at 75%. Pre-game predictions have
estimated that, without Nightmare, the probability of a Dutch win is 30%
and with Nightmare, 50%. Later, you hear that the Dutch team has won
the match. Without having any other information about the events that
occurred, what would you say was the probability that Dennis Nightmare
played in the final?
2.77 A family is chosen at random from all two-child families without twins.
Each child is a boy or a girl with equal probability. What is the probability
that the chosen family has two boys if the family has a boy among the two
children? What is the answer to this question if, among the two children,
the family has a boy born on a Tuesday? What is the probability that
the chosen family has two boys if, among the two children, the family
has a boy born on one of the first k days of the week? In each of the
three cases, it is assumed that the information was obtained by asking the
father “Yes or no, do you have a son/ a son born on Tuesday/ a son born
on the first k days of the week?” and getting a truthful answer. Note:
The wording of the boy–girl problem is of utmost importance. If the
formulation of the boy–girl problem is ambiguous and leaves room for
different interpretations, then there are several answers that are right for
different background assumptions.
2.5 Bayesian Inference − Discrete Case 77

2.78 On the island of liars, each inhabitant lies with probability 23 . You
overhear an inhabitant making a statement. Next you ask another inhab-
itant whether the inhabitant you overheard spoke truthfully. What is the
probability that the inhabitant you overheard indeed spoke truthfully,
given that the other inhabitant says so?
2.79 Suppose that a person has confessed to a crime under a certain amount
of police pressure. Besides this confession, there is no other proof that
the person is guilty. Use Bayes’ rule in odds form to verify that the
confession only adds to the evidence of guilt if the confession is more
likely to come from the guilty than from the innocent. Do you think that
in real life a hardened person who is guilty is more likely to confess than
an unstable person who is innocent?
2.80 An opaque bowl contains one ball. The ball is equally likely to be red
or blue. A red ball is added to the bowl. Then a ball is randomly picked
from the bowl. The ball that has been picked from the bowl turns out to
be red. Use Bayes’ rule in odds form to calculate the probability that the
bowl originally contained a red ball.
2.81 A doctor discovers a lump in a woman’s breast during a routine physical
examination. The lump could be a cancer. Without performing any
further tests, the probability that the woman has breast cancer is 0.01. A
further test can be done. On average, this test is able to establish correctly
whether a tumor is benign or cancerous 90% of the time. A positive test
result indicates that a tumor is cancerous. What is the probability that the
woman has breast cancer if the test result is positive?
2.82 On average, one in every 125 births produces a set of fraternal twins and
one in every 300 births produces a set of identical twins (identical twins
are always of the same sex but fraternal twins are random). Elvis had a
twin brother, Jesse Garon, who died at birth. What is the probability that
Elvis was an identical twin?
2.83 A box contains 10,000 coins. One of the coins has heads on both sides
but all the other coins are fair coins. You choose at random one of the
coins. Use Bayes’ rule in odds form to find the probability that you have
chosen the two-headed coin given that the first n tosses have all resulted
in heads. What are the numerical values of the posterior probability for
n = 10, 15, and 25?

2.5 Bayesian Inference − Discrete Case

Bayesian inference is historically the original approach to statistics, predating
what is nowadays called classical statistics by a century. The foundation of
78 2 Conditional Probability

Bayesian inference was laid by the English clergyman Thomas Bayes (1702–
1761), but the Bayesian interpretation of probability was developed mainly by
Pierre Simon Laplace (1749–1827). Astronomers have contributed much to the
Bayesian approach. Astronomers cannot do experiments on the universe and
thus have to make probabilistic inferences from evidence left behind. This is
very much the same situation as in forensic science, where Bayesian inference
also plays a very important role. In Bayesian inference one typically deals with
nonrepeatable chance experiments.
Fundamentally, the distinction between Bayesian inference and frequentist
inference in classical statistics concerns the interpretation of probability. To
illustrate, suppose that an astronomer wants to make a statement about the
photon flux of a nonvariable star. In the frequentist view, the true flux is a single
fixed value and a statistical estimate of this unknown value can only be based
on many measurements of the flux. In a Bayesian view, one can meaningfully
treat the flux of the star as a random variable and talk about the probability
that the true value of the flux of the star is between specific bounds. That
probability is subjective and represents one’s knowledge of the value based
on prior information and available data. The fundamental difference between
the classical and the Bayesian approach can also be illustrated nicely with
the next example. Imagine a multiple-choice exam consisting of 50 questions,
each of which has three possible answers. A student receives a pass grade if
he/she correctly answers more than half of the questions. Take the case of a
student who manages to answer 26 of the 50 questions correctly and claims
not to have studied, but rather to have obtained 26 correct answers merely by
guessing. The classical approach in statistics is to calculate the probability of
correctly answering 26 or more of the 50 questions by luck alone. This excess
probability is equal to 0.0049 and is called the p-value in classical statistics. On
the basis of the small p-value one might conclude that the student is bluffing
and in fact did prepare for the exam. However, the p-value does not give the
probability that the student went unprepared to the exam, but only gives the
probability of getting 26 or more correct answers when the student did not
prepare for the exam. In reality, one might have information about the earlier
achievements of the student in question. The Bayesian approach does use this
information and then calculates the probability that the student did not prepare
for the exam given that he/she answered 26 of the 50 questions correctly. This
is the probability we are actually looking for! The Bayesian approach requires
that we first specify a prior distribution for the various ways the student may
have prepared for the exam. This distribution concerns the situation before
the exam and may be based on information of the student’s earlier academic
performance in homework or previous exams.
2.5 Bayesian Inference − Discrete Case 79

The Bayesian approach to statistical inference treats population parameters

not as fixed, unknown constants but as random variables − subject to change
as additional data arise. Probability distributions have to be assigned to these
parameters by the investigator before observing data. These distributions are
called prior distributions and are inherently subjective. They represent the
investigator’s uncertainty about the true value of the unknown parameters.
Assigning probabilities by degree of belief is consistent with the idea of a
fixed but unknown value of the parameter. Treating the unknown parameter
as a random variable does not mean that we believe the parameter is random.
Rather, it expresses the state of our knowledge about the parameter. In the
Bayesian approach we revise our beliefs about the population parameters by
learning from data that are obtained after having specified the prior distribu-
tion. The revised distribution of the unknown parameter is called the posterior
distribution. It is obtained from Bayes’ rule and reflects our new beliefs about
the population parameters. An illuminating illustration of thinking about the
parameter of interest as a random state of nature is provided by the following
example. Suppose that you have three symmetric dice in your pocket. Dice 1
has hidden the numbers 1, 1, 2, 2, 2, 2 on its six faces, dice 2 the numbers
1, 1, 1, 2, 2, 2, and dice 3 the numbers 1, 1, 1, 1, 2, 2. You randomly select
one die from your pocket without looking. Somebody else rolls the die once
and truthfully informs you that a 1 has shown up. What is the probability that
you have chosen die i for i = 1, 2, and 3? Before the die is rolled,

the state of
nature of the chosen die is described by the prior distribution 13 , 13 , 13 . After
the die has been rolled,
the state
of nature of the chosen die is described by the
2 3 4
posterior distribution 9 , 9 , 9 , as the reader may verify by using Bayes’ rule
in odds form (or the Bayesian formula given in Example 2.17 below).
As said above, Bayesian analysis has its origin in the work of Reverend
Thomas Bayes, in whose Essay toward solving a problem in the doctrine
of chance, published posthumously in 1763, an early attempt is made to
establish what we now refer to as Bayes’ rule. Although the method of Bayes
was enthusiastically taken up by Pierre Simon Laplace and other leading
probabilists in the nineteenth century, it was not used much until the late 1980s
and early 1990s, when powerful computers became widely accessible and new
computational methods were developed. Nowadays, Bayesian methods are
used widely to address pressing questions in diverse application areas such as
astrophysics, neurobiology, weather forecasting, prediction of election results,
spam filtering, and criminal justice. In many cases the Bayesian approach
produces answers that are more logical and easier to understand than those
produced by the frequentist approach of classical statistics.
80 2 Conditional Probability

Example 2.17 A new treatment is tried out for a disease for which the
historical success rate of the standard treatment is 35%. The discrete uniform
probability distribution on 0, 0.01, . . . , 0.99, 1 is taken as prior for the success
probability of the new treatment. The experimental design is to make exactly
10 observations by treating 10 patients. The experimental study yields seven
successes and three failures. What is the posterior probability that the new
treatment is more effective than the standard treatment?4
Solution. Model the unknown success probability of the new treatment by the
random variable . Assume that our state of knowledge about the unknown
parameter is expressed by the “non-informative” prior distribution p0 (θ ) = 101 1

for θ = 0, 0.01, . . . , 0.99, 1. To update the prior distribution given the observed
data, we need the so-called likelihood function L(data | θ ). This function is
defined as the probability of getting the data given that the success probability
has value θ . In the particular situation of seven successes in the treatment of
10 patients,

10 7
L(data | θ ) = θ (1 − θ )3 for θ = 0, 0.01, . . . , 0.99, 1.
7
To find the posterior probability p(θ ) = P( = θ | data), we use
P(data | θ )p0 (θ )
P( = θ | data) = .
P(data)

By the law of conditional probability, P(data) = θ P(data | θ )p0 (θ ). Thus,
using the notation L(data | θ ) for P(data | θ ), the posterior distribution is given
by the Bayesian formula
L(data | θ )p0 (θ )
p(θ ) =
for θ = 0, 0.01, . . . , 0.99, 1.
θ L(data | θ )p0 (θ )
The posterior p(θ ) is proportional to prior p0 (θ )×likelihood L(data | θ ).
Letting θi = 100
i
and inserting the formulas for L(data | θ ) and p0 (θ ), we get
θ 7 (1 − θi )3
p(θi ) = 100i for i = 0, 1, . . . , 100.
k=0 θk (1 − θk )
7 3

In particular, the posterior probability of the new treatment being more

effective than the standard treatment is given by

100
p(θi ) = 0.9866.
i=36

4 This example is based on the paper by D. A. Berry, “Bayesian clinical trials,” Nature Reviews
Drug Discovery 5 (2006): 27–36.
2.5 Bayesian Inference − Discrete Case 81

The posterior probability of the new treatment not being more effective than
the standard treatment is 0.0134. In this particular example, the value 0.0134
of the posterior probability is not very different from the value 0.0260 of the
excess probability of obtaining seven or more successes in 10 trials under
the null hypothesis that the new treatment causes no difference and thus has
success probability 0.35. In classical statistics, this so-called p-value would
have been calculated and the hypothesis that the new treatment causes no
difference would also have been rejected.
In general, p-values overestimate the evidence against the null hypothesis.
To illustrate this, consider the following experiment. Suppose that there is
reason to believe that a coin might be slightly biased toward heads. To test this,
you decide to throw the coin 1,000 times. Before performing the experiment,
you express your uncertainty about the unbiasedness of the coin by assuming
that the probability of getting heads in a single toss of the coin can take on the
values θ = 0.50, 0.51, and 0.52 with respective prior probabilities p0 (θ ) = 12 ,
1 1
3 , and 6 . Next the experiment is performed and 541 heads are obtained in
1,000 tosses ofthe coin.
541The likelihood of getting 541 heads in 1,000 tosses is
L(541 | θ ) = 1,000
541 θ (1 − θ ) 459 for θ = 0.50, 0.51, and 0.52. This leads

to the posterior probability p(θ | 541) = 0.1282 for θ = 0.50. In other words,
your posterior belief that the coin is fair equals 0.1282. In classical statistics,
one would compute the probability of getting 541 or more heads in 1,000 tosses
of the coin under the hypothesis that the coin is fair. This excess probability
is equal to 0.0052. Many classical statisticians would consider this small
p-value as significant evidence that the coin is biased toward heads. However,
your subjective Bayesian probability of 0.1282 for the hypothesis of a fair
coin is not strong enough evidence for such a conclusion. The difference
in the conclusions can be explained as follows. The p-value is based on
the set of all possible observations that cast as much or more doubt on
the hypothesis as the actual observations do. It is not possible to base
the p-value only on the actual data, because it frequently happens that
all individual outcomes have such small probabilities that every outcome
would look significant. The inclusion of unobserved data means that the
resulting p-value may greatly exaggerate the strength of evidence against
the hypothesis.5 The p-value, which is the probability of getting data at
least as extreme as the observed data when the null hypothesis is true, is

5 That’s why physicists required an extremely small p-value of about 1 in 3.5 million (the
5-sigma rule) before declaring the “discovery” of the long-sought Higgs boson. The evidence
of the efficacy of the famous Salk vaccine against polio was also based on an extremely small
p-value, see Problem 3.65. It is true that p-values are overused and often misused, but they were
the right tool in the search for the Higgs boson and in showing the efficacy of the Salk vaccine.
82 2 Conditional Probability

often interpreted incorrectly as the probability that the null hypothesis is

true. The Bayesian posterior probability does have the desired probabilistic
interpretation.
In the Bayesian approach, only the actual data obtained matter. However,
in order to give a Bayesian conclusion about the truth of the hypothesis using
only the actual data, you must first choose a prior distribution for the truth
of the hypothesis. Assuming a “non-informative” prior such as the uniform
distribution, the results of the trials carry essentially all the influence in the
posterior distribution. In practice, it happens in many cases that the prior can
be based on earlier experimental studies or scientific theory.
The Bayesian approach has the appealing property that you can continuously
update your beliefs as information accrues. In many situations data arrive
sequentially and updating one’s beliefs is a natural process. Think of updating
sequential polling results to predict the outcome in elections. In Problem 2.86
dealing with Example 2.17, you are asked to update your beliefs after each
trial, using the posterior obtained after the preceding trials as prior. This gives
the same result as when the update is based on the aggregate observations.
The Bayesian approach gives the user more flexibility in the experimental
design. Whereas the frequentist approach requires trials to reach a prespecified
sample size before stopping, the Bayesian approach allows the user to stop
trials early when adding more observations will not appreciably change the
conclusion. Bayesian inference is immune to the selection bias associated with
early termination of an experiment.
Example 2.18 Two candidates A and B are contesting the election of governor
in a given state. The candidate who wins the popular vote becomes governor.
Suppose that, based on earlier polls, the prior distribution of the fraction
of the voting population in favor of candidate A is p0 (θ ) = θ−0.29 4.41 for
θ = 0.30, . . . , 0.50 and p0 (θ ) = 0.71−θ
4.41 for θ = 0.51, . . . , 0.70. A random
sample of the voting population is undertaken to find out the current preference
of the voters. The sample size of the poll is 1,000. It turns out that 517 of the
polled voters favor candidate A. What is the new value of the probability that
candidate A will win the election?
Solution. The prior distribution is a discrete triangular distribution with
expected value 0.50. In particular, the prior probability of candidate A getting
the majority of the votes at the election is p0 (0.51) + · · · + p0 (0.70) = 0.476.
The poll gives that 517 of the 1,000 polled voters favor candidate A. In light
of this new information, the updated probability of candidate A getting the
majority of the votes at the time of election is p(0.51) + · · · + p(0.70), where
p(θ ) is the posterior probability that the fraction of the voting population
2.5 Bayesian Inference − Discrete Case 83

in1,000
favor
517of candidate A equals θ . Using the likelihood L(data | θ ) =
517 θ (1 − θ )483 , we get the posterior distribution

1,000 517
517 θ (1 − θ )483 p0 (θ )
p(θ ) = 1,000 a 517 a .
70 a 483
a=30 517 100 1 − 100 p0 100
Performing the numerical calculations, we find that the posterior probability
of candidate A getting the majority of the votes at the election equals
p(0.51) + · · · + p(0.70) = 0.763. Let’s see what the posterior probability
of candidate A winning the election is when the prior p0 (θ ) = 1−θ c
2 for θ =
0.30, 0.31, . . . , 0.70 is used, where the normalizing constant c = 0.0174484.
Note that the shape of this prior distribution having expected value 0.520 is
much different from the shape of the triangular prior distribution. We then
find the value 0.786 for the posterior probability of candidate A winning the
election.

Problems
2.84 Your friend has fabricated a loaded die. In doing so, he has chosen at
random one of the values 0.1, 0.2, 0.3, or 0.4 and has loaded the die
in such a way that any roll of the die results in the outcome 6 with a
probability that is equal to the randomly chosen value. He does not tell
you what the chosen value is. You ask him to roll the die 300 times and to
inform you how often the outcome 6 has appeared. You are informed that
the outcome 6 has appeared 75 times. What is the posterior distribution
of the probability that a single roll of the die gives a 6?
2.85 You wonder who is the better player of the tennis players Alassi and
Bicker. Your prior assigns equal probabilities to the possible values 0.4,
0.5, and 0.6 for the probability that Alassi wins a given match. Then
you learn about a tournament at which a best-of-five series of matches
is played between Alassi and Bicker over a number of days. In such an
encounter the first player to win three matches is the overall winner. It
turns out that Alassi wins the best-of-five contest. How should you update
your prior? Note: This problem is taken from J. Albert, “Teaching Bayes’
rule: a data-oriented approach,” The American Statistician 51 (1997):
247–253.
2.86 Consider again Example 2.17. Assume that the 10 observations for the
new treatment are SSFSSFSSSF, where S stands for success and F for
failure. Update the posterior probability of the new treatment being more
effective than the standard treatment after each observation in order to
84 2 Conditional Probability

see how this probability changes when additional information becomes

available.
2.87 A student has passed a final exam by supplying correct answers for
26 out of 50 multiple-choice questions. For each question, there was a
choice of three possible answers, of which only one was correct. The
student claims not to have learned anything in the course and not to
have studied for the exam, and says that his correct answers are the
product of guesswork. Do you believe him when you have the following
prior information? There are three ways the student may have done the
exam: totally unprepared, half prepared, and well prepared. The three
possibilities have the respective prior probabilities 0.2, 0.3, and 0.5. Any
of the 50 questions is correctly answered by the student with probability
1
3 if he is totally unprepared, with probability 0.45 if he is half prepared,
and with probability 0.8 if he is well prepared.
2.88 Your friend is a basketball player. To find out how good he is in free
throws, you ask him to shoot 10 throws. You assume the three possible
values 0.25, 0.50, and 0.75 for the success probability of the free shots
of your friend. Before the 10 throws are shot, you believe that these three
values have the respective probabilities 0.2, 0.6, and 0.2. What is the
posterior distribution of the success probability given that your friend
scores 7 times out of the 10 throws?
3
Discrete Random Variables

In performing a chance experiment, one is often not interested in the particular

outcome that occurs but in a specific numerical value associated with that
outcome. Any function that assigns a real number to each outcome in the
sample space of the experiment is called a random variable. Intuitively,
a random variable takes on its value by chance. The observed value, or
realization, of a random variable is completely determined by the realized
outcome of the chance experiment and consequently probabilities can be
assigned to the possible values of the random variable.
The first purpose of this chapter is to familiarize you with the concept of
a random variable and with characteristics such as the expected value and the
variance of a random variable. In addition, we give rules for the expected value
and the variance of a sum of random variables, including the square-root rule.
These rules are easiest explained and understood in the context of discrete
random variables. Such random variables can take on only a finite or countably
infinite number of values (the so-called continuous random variables that can
take on a continuum of values are treated in the next chapter). The second
purpose of the chapter is to introduce you to important discrete random vari-
ables such as the binomial, Poisson, hypergeometric, geometric, and negative
binomial random variables among others. A comprehensive discussion of these
random variables is given, together with appealing examples. Much attention
is given to the Poisson distribution, which is the most important discrete
distribution. Unlike most other introductory texts, we consider at length the
practically useful Poisson heuristic for weakly dependent trials.

3.1 Concept of a Random Variable

The concept of a random variable is always difficult for the beginner. Intu-
itively, a random variable is a function that takes on its value by chance.

85
86 3 Discrete Random Variables

A random variable is not a variable in the traditional sense of the word and
actually it is a little misleading to call it a variable. Formally, a random
variable is defined as a real-valued function on the sample space of a chance
experiment. A random variable X assigns a numerical value X(ω) to each
element ω of the sample space. For example, suppose that the random variable
X is defined as the smallest of the two numbers that will result from the
experiment of rolling a fair die twice. Then the random variable X assigns
the numerical value X(ω) = min(i, j) to the outcome ω = (i, j) of the chance
experiment, where i is the number obtained in the first roll and j is the number
obtained in the second roll. As said before, a random variable X takes on its
value by chance. A random variable gets its value only after the underlying
chance experiment has been performed. Before the experiment is performed,
you can only speak of the probability that the random variable will take on a
specific value (or a value in a specific interval for the case that the range of
the random variable is some continuum of real numbers). It is common to use
uppercase letters such as X, Y, and Z to denote random variables, and lowercase
letters x, y, and z to denote their possible numerical values.
Illustrative examples of random variables are:
• The number of winners in a football pool next week.
• The number of major hurricanes that will hit the United States next year.
• The number of claims that will be submitted to an insurance company next
week.
• The amount of rainfall that the city of London will receive next year.
• The lifetime of a newly bought battery.
• The duration of your next mobile phone call.
The first three examples are examples of discrete random variables taking on
a discrete number of values and the other three examples describe continuous
random variables taking on a continuum of values.
In this chapter we consider only discrete random variables. A random
variable X is said to be discrete if its set of possible values is finite or countably
infinite. The set of possible values of X is called the range of X and is denoted
by I. The probabilities associated with these possible values are determined
by the probability measure P on the sample space of the chance experiment.
The probability mass function of a discrete random variable X is defined by
P(X = x) for x ∈ I, where the notation P(X = x) is shorthand for

P(X = x) = P({ω : X(ω) = x}).

In words, P(X = x) is the probability mass assigned by the probability measure

P to the set of all outcomes ω for which X(ω) = x.
3.1 Concept of a Random Variable 87

As an example, suppose that you are going to roll a fair die twice. The
sample space of this experiment consists of 36 equally likely outcomes (i, j),
where i is the number obtained in the first roll and j is the number obtained in
the second roll. Let the random variable X be defined as the smallest of the two
numbers that will result from this experiment. The random variable X assigns
the numerical value min(i, j) to the outcome (i, j) of the sample space. Thus, the
range of X is the set I = {1, 2, . . . , 6}. The random variable X takes on the value
1 if one of the 11 outcomes (1, 1), (1, 2), . . . , (1, 6), (2, 1), (3, 1), . . . , (6, 1)
occurs and so P(X = 1) = 11 36 . In the same way, P(X = 2) = 36 ,
9

P(X = 3) = 36 7
, P(X = 4) = 36 5
, P(X = 5) = 36 3
, and P(X = 6) = 36 1
.
As another example, suppose that the random variable S is defined as the sum
of the two rolls. Then S assigns the numerical value i + j to the outcome (i, j)
of the sample space and so the range of S is the set I = {2, 3, . . . , 12}. It is
left to the reader to verify that the probability mass function of S is given by
P(S = 2) = P(S = 12) = 36 1
, P(S = 3) = P(S = 11) = 36 2
, P(S = 4) =
P(S = 10) = 36 , P(S = 5) = P(S = 9) = 36 , P(S = 6) = P(S = 8) = 36
3 4 5
,
and P(S = 7) = 36 .6

Example 3.1 In your pocket you have three dimes (coins of 10 cents) and two
quarters (coins of 25 cents). You grab at random two coins from your pocket.
What is the probability mass function of the amount of money you will grab?
Solution. The sample space of the chance experiment is chosen as =
{(D, D), (D, Q), (Q, D), (Q, Q)}. The outcome (D, D) occurs if the first coin
taken is a dime and the second one is also a dime, the outcome (D, Q) occurs
if the first coin taken is a dime and the second one is a quarter, etc. The
probability 35 × 24 = 10 3
is assigned to the outcome (D, D), the probability
3
5 × 2
4 = 3
10 to the outcome (D, Q), the probability 25 × 34 = 10
3
to the outcome
(Q, D), and the probability 5 × 4 = 10 to the outcome (Q, Q). Let the random
2 1 1

variable X denote the total number of cents you will grab. The random variable
X has 20, 35, and 50 as possible values. The random variable X takes on the
value 20 if the outcome (D, D) occurs, the value 35 if either the outcome (D, Q)
or (Q, D) occurs, and the value 50 if the outcome (Q, Q) occurs. Thus, the
probability mass function of X is
3 3 3 3 1
P(X = 20) = , P(X = 35) = + = , and P(X = 50) = .
10 10 10 5 10
Example 3.2 You are going to remove cards one at a time from a well-shuffled
standard deck of 52 cards until you get an ace. Let the random variable X be the
number of cards that must be removed. What is the probability mass function
of X?
88 3 Discrete Random Variables

Solution. The range of the random variable X is the set {1, 2, . . . , 49}.
Obviously, P(X = 1) = 52 4
. Using the chain rule for conditional probabilities,
we find for k = 2, . . . , 49:
48 48 − (k − 2) 4
P(X = k) = × ··· × × .
52 52 − (k − 2) 52 − (k − 1)
48 52
Alternatively, P(X = k) can be calculated as k−1 / k−1 × 52−(k−1)
4
. This
representation is based on P(A | B) = P(A)P(B | A) and can be explained
as the probability that the first k − 1 picks are non-aces multiplied by the
conditional probability that the kth pick is an ace given that the first k − 1
picks are non-aces.
The last example gives rise to the following important observation. Often,
an explicit listing of the sample space is not necessary to assign a probability
distribution to a random variable. Usually, the probability distribution of a ran-
dom variable is modeled without worrying about the assignment of probability
to an underlying sample space. In most problems, you will perform probability
calculations without explicitly specifying a sample space; an assignment of
probabilities to properly chosen events usually suffices.

Problems
3.1 As an added incentive for students to do homework for their probability
course, the professor randomly picks two students at each class meeting.
These two students are asked to explain a homework problem. On a
particular day, five students from the class of 20 students have not done the
homework. Let the random variable X be the number of students picked
who have not done the homework. What is the probability mass function
of X?
3.2 Imagine that people enter a room one by one and announce their birthdays.
Let the random variable X be the number of people required to have a
matching birthday. What is the probability mass function of X?
3.3 A bag contains three coins. One coin is two-headed and the other two are
normal. A coin is chosen at random from the bag and is tossed twice. Let
the random variable X denote the number of heads that will appear. What
is the probability mass function of X?
3.4 In a charity lottery, 1,000 tickets numbered 000, 001, . . . , 999 are sold.
Each contestant buys only one ticket. The prize winners of the lottery
are determined by drawing one number at random from the numbers
3.2 Expected Value 89

000, 001, . . . , 999. You are a prize winner when the number on your ticket
is the same as the number drawn or is a random permutation of the number
drawn. What is the probability mass function of the number of prize
winners?
3.5 Accidentally, two depleted batteries get into a set of five batteries. To
remove the two depleted batteries, the batteries are tested one by one in
random order. Let the random variable X denote the number of batteries
that must be tested to find the two depleted batteries. What is the
probability mass function of X?
3.6 You are allowed to toss a fair coin either five times or until tails appears,
whichever occurs first. You receive starting capital of $10 when the first
toss results in heads; otherwise, you receive nothing. Each subsequent toss
that results in heads doubles your current capital. Let the random variable
X be the end capital you will get. What is the probability mass function
of X?
3.7 A box contains seven envelopes. Two envelopes contain $10 each, one
contains $5, and the other four are empty. Envelopes are drawn at random
from the box, one by one and without replacement. You may keep the
money from the envelopes that are drawn before an empty envelope is
drawn. Let the random variable X be the amount of money you will get.
What is the probability mass function of X?

3.2 Expected Value

The most important characteristic of a random variable is its expected value.
The concept of expected value goes back to Christiaan Huygens (1629–1695)
and Blaise Pascal (1623–1662). In his influential text De Ratiociniis de Ludo
Aleae (On Reasoning in Games of Chance), published in 1657, Huygens
worked out the concept of expected value based on an embryonic idea of
Pascal. The concept of expected value was introduced in the context of a game
of chance. To gain a good understanding of precisely what the concept is, it
helps to consider a repeatable game of chance. Imagine a casino game where
the player has a 0.70 probability of losing 1 dollar and probabilities 0.25 and
0.05 of winning 2 and 3 dollars, respectively. A player who plays this game
a large number of times reasons intuitively as follows in order to determine
the average win per game in n games. In approximately 0.70n repetitions of
the game, the player loses 1 dollar per game and in approximately 0.25n and
0.05n repetitions of the game, the player wins 2 and 3 dollars, respectively.
This means that the total win in dollars is approximately equal to
90 3 Discrete Random Variables

(0.70n) × (−1) + (0.25n) × 2 + (0.05n) × 3 = −(0.05)n,

or the average win per game is approximately −0.05 dollars (meaning that the
average “win” is actually a loss). If we define the random variable X as the win
achieved in just a single repetition of the game, then the number −0.05 is said
to be the expected value of X. In the casino game, the expected value of X is
calculated as

−1 × P(X = −1) + 2 × P(X = 2) + 3 × P(X = 3) = −0.05.

The expected value of the discrete random variable X is a weighted mean of

the values the random variable can take, the weights being furnished by the
probability mass function of the random variable.
We now state the general definition of the expected value of a discrete
random variable.
Definition 3.1 The expected value of the discrete random variable X having
I as its set of possible values is defined by

E(X) = x P(X = x),
x∈I

provided that the sum is well-defined.

The expected value of a random variable X is also known as the mean, or
first moment and average value of X. The nomenclature of the expected value
may be misleading. The expected value is in general not a typical value that
the random variable can take. Before we give several examples, we make the
following remarks. Definition 3.1 is only meaningful if the sum is well-defined.
The sum is always well-defined if the range I is finite. However, the sum over
countably infinite many terms is not always well-defined when both positive
and negative terms are involved. For example, the infinite series 1 − 1 + 1 −
1 + · · · has sum 0 when you sum the terms according to (1 − 1) +
(1 − 1) + · · · , whereas you get sum 1 when you sum the terms according
to 1 + (−1 + 1) + (−1 + 1) + (−1 + 1) + · · · . Such abnormalities cannot
happen when all terms in the infinite summation are nonnegative. The sum
of infinitely many nonnegative terms is always well-defined by allowing ∞
as a possible value for the sum. For a sequence a1 , a2 , . . . consisting of both

positive and negative terms, the infinite series ∞ k=1 ak is well-defined with a
∞
finite sum if and only if k=1 |ak | < ∞. This is a basic result from calculus.
Such a series is said to be absolutely convergent. For an absolutely convergent

series ∞ k=1 ak , the finite sum is uniquely determined and does not depend on
the order in which the individual terms are added.
3.2 Expected Value 91

For a discrete random variable X with range I, it is said that the expected

value E(X) exists if x∈I |x| P(X = x) < ∞ or if X is a nonnegative random
variable, where the value E(X) = ∞ is allowed when X is nonnegative.
It is important to emphasize that the expected value E(X) is finite if and

only if k∈I |k|P(X = k) < ∞. An example of a random variable X for
which E(X) does not exist is the random variable X with probability mass
function P(X = 0) = 0 and P(X = k) = π 23k2 for k = ±1, ±2, . . . . In this
∞ 1
example, ∞ k=−∞ |k|P(X = k) = ∞, by the fact that k=1 k = ∞. However,
the expected value of the nonnegative random variable with probability mass
function π 26k2 for k = 1, 2, . . . does exist (and is equal to ∞).
Example 3.1 (continued) What is the expected value of the random variable
X to be defined as the number of cents you will grab from your pocket?
Solution. Since the probability mass function of X is P(X = 20) = 3
10 , P(X =
35) = 35 , and P(X = 50) = 10
1
, the expected value of X is

3 3 1
E(X) = 20 × + 35 × + 50 × = 32 cents.
10 5 10
Example 3.2 (continued) What is the expected value of the random variable
X to be defined as the number of cards that must be picked until you get the
first ace?
Solution. Using the result found earlier for P(X = k), we get

4 48
49
48 − (k − 2) 4
E(X) = 1 × + k × ··· × × = 10.6.
52 52 52 − (k − 2) 52 − (k − 1)
k=2

Example 3.3 Joe and his friend make a guess every week whether the Dow
Jones index will have risen at the end of the week or not. Both put $10 in the
pot. Joe observes that his friend is just guessing and is making his choice by
the toss of a fair coin. Joe asks his friend if he could contribute $20 to the pot
and submit his guess together with that of his brother. The friend agrees. In
each week, however, Joe’s brother submits a prediction opposite to that of Joe.
The person having a correct prediction wins the entire pot. If more than one
person has a correct prediction, the pot is split evenly. How favorable is the
game to Joe and his brother?
Solution. Let the random variable X denote the payoff to Joe and his brother
in any given week. Either Joe or his brother will have a correct prediction. If
Joe’s friend is wrong he wins nothing, and if he is correct he shares the $30
pot with either Joe or his brother. Thus X takes on the values 30 and 15 with
92 3 Discrete Random Variables

equal chances. This gives E(X) = 12 × 30 + 12 × 15 = 22.5 dollars. Joe and his
brother have an expected profit of $2.5 every week.
Example 3.4 Three friends go to the cinema together every week. Each week,
in order to decide which friend will pay for the other two, they all toss a fair
coin into the air simultaneously. They continue to toss coins until one of the
three gets a different outcome from the other two. What is the expected value
of the number of trials required?
Solution. Let the random variable X denote the number of trials until one of
the three friends gets a different outcome from the other two. The probability
that any given trial does not lead to three equal outcomes is 1 − 18 − 18 = 34 .
Thus,
P(X = j) = (1 − p)j−1 p for j = 1, 2, . . .
with p = 34 . The expected value of X is
∞
p 1
E(X) = j(1 − p)j−1 p = = ,
[1 − (1 − p)]2 p
j=1
∞
using the fact that j=1 jxj−1 = 1/(1 − x)2 for all 0 < x < 1, see Appendix C.
Thus, the expected value of the number of trials required is 43 .
The concept of expected value is at the heart of the sequential decision
problem that is discussed in the next example.
Example 3.5 Eleven closed boxes are put in random order in front of you. One
of these boxes contains a devil’s penny and the other 10 boxes contain money.
You know which dollar amounts a1 , . . . , a10 are in the 10 boxes. You may open
as many boxes as you wish, but they must be opened one by one. You can keep
the money from the boxes you have opened, as long as you have not opened
the box with the devil’s penny. Once you open this box, the game is over and
you lose all the money gathered so far. What is a good stopping rule when you
want to maximize the expected value of your gain? What is the expected value
of the gain of the game when ai = 1 for all i?
Solution. The one-stage-look-ahead rule is the key. This is an intuitively
appealing rule that looks only one step ahead, as the name says. The rule
prescribes stopping in the states in which it is at least as good to stop now as to
continue one more step and then stop. For the situation that you have collected
so far a dollars and k + 1 boxes are still closed, including the box with the
devil’s penny, define the random variable Xk (a) as the amount by which your
capital gathered so far would change when you would decide to open one more
3.2 Expected Value 93

box. The one-stage-look-ahead rule prescribes continuing if the expected value

of Xk (a) is positive and stopping otherwise. Let A = a1 + a2 + · · · + a10
be the original amount of dollars in the 10 boxes. To calculate E[Xk (a)], you
need not know how the remaining amount of A − a dollars is distributed over
the k remaining closed boxes not containing the devil’s penny. To see this,
imagine that the dollar amounts b1 , . . . , bk (say) are in these k closed boxes.
Then b1 + · · · + bk = A − a and
1 1 1
E[Xk (a)] = b1 + · · · + bk − a
k+1 k+1 k+1
1 1
= (A − a) − a.
k+1 k+1
Therefore, E[Xk (a)] > 0 only if a < 12 A, regardless of the value of k. This
suggests stopping as soon as you have collected 12 A dollars or more and
continuing otherwise.1 It is noteworthy that this stopping rule depends only
on the aggregated amount of dollars in the boxes and not on the number of
boxes. To test your understanding of the solution, what would be the optimal
stopping rule when there is a 12th box with a second devil’s penny?
For the special case that ai = 1 for all i, the stopping rule prescribes stopping
as soon as you have opened 10 2 = 5 boxes without having opened the box with
the devil’s penny. Let the random variable X be the gain of the game for this
special case. The random variable X can take on only the values 0 and 5. If you
open 5 randomly chosen boxes out of the 11 boxes, then the probability of not
opening the box with the devil’s penny is
10 9 8 7 6 6
× × × × = .
11 10 9 8 7 11
Therefore, P(X = 5) = 6
11 and P(X = 0) = 1 − 6
11 = 5
11 . This gives
5 6 30
E(X) = 0 × +5× = .
11 11 11
In general, one should resort to computer simulation to find the expected
value of the gain of the game for specific values of the ai . Simulation gives
an expected value of about 15.45 for optimal play in the following instance
of the devil’s penny problem. A card game is played with 11 cards: an ace,
two, three, . . ., nine, ten, and a joker. These cards are thoroughly shuffled and

1 Using advanced optimization theory, it can be shown that this stopping rule is indeed optimal
for the criterion of the expected gain. In general, the one-stage-look-ahead rule is optimal if the
problem has the property that continuing in a state in which this rule calls for stopping results
again in a state in which the rule calls for stopping. To put it differently, the set of states in
which stopping is better than continuing for one step and then stopping must be a closed set
from which no escape is possible.
94 3 Discrete Random Variables

laid face down. You can flip over as many cards as you wish, but they must be
flipped over one by one. Once you flip over the joker card, the game is over
and you win nothing. If you stop before the joker card is flipped over, you win
as many dollars as the sum of the values of the cards you have flipped over (the
ace counts for $1). This is the devil’s penny problem with ai = i for all i.
Expected value is a useful concept in casino games, lotteries, and sports
betting. The expected value is what the player will win or lose on average if
the player were to make the same bet over and over again. This result is known
as the strong law of large numbers. This law will be made more precise in
Section 9.2.
The next example gives a practical application of the concept of expected
value. The example is about group testing for disease and has its historical
significance in World War II. The young men who were drafted were subject
to a blood test in the fight against syphilis, at a time when mass-produced
penicillin was not available. The Harvard economist Robert Dorfman devised
an elegant method to cut down on the number of blood tests needed. The
idea was to pool the blood samples of soldiers together and test the pooled
sample.
Example 3.6 A large group of people are undergoing a blood test for a
particular disease. The probability that a randomly chosen person has the
disease in question is equal to p = 0.005. In order to save on costs, it is
decided to split the group into smaller groups each consisting of r people.
The blood samples of the r people are then mixed and tested all at once. The
mixture will only test negative if all the individual samples are negative. If the
test result is negative, then one test will have been sufficient for that whole
group. Otherwise, r extra tests will be necessary in order to test each of the r
people individually. What value of r minimizes the average number of tests per
person?
Solution. Let the random variable X be the number of tests needed for a group
of r people. The possible values of the random variable X are 1 and r + 1.
The random variable X takes on the value 1 if and only if all the r samples
are negative. An individual sample will test negative with probability 1 − p.
Assuming independence between the test outcomes of individual people, it
follows that P(X = 1) is equal to (1 − p) × (1 − p) × · · · × (1 − p) = (1 − p)r .
Since P(X = r + 1) = 1 − P(X = 1), the probability P(X = r + 1) is equal to
1 − (1 − p)r . Thus, the expected value of X is

E(X) = 1 × (1 − p)r + (r + 1) × (1 − (1 − p)r ) = 1 + r 1 − (1 − p)r .
3.2 Expected Value 95

Therefore, by combining blood samples from r people into one mixture, the
average number of tests per person is 1r 1 + r(1 − (1 − p)r ) . For the case
p = 0.005, this expression is minimal for r = 15 and its minimal value is
0.1391. Thus, group testing cuts down the number of tests performed by about
86% when p = 0.005 when the test concerns many people. Also, an interesting
quantitative result can be given. For p close to zero, the optimal value of the
group size and the minimal value of the average number of tests per person
√
are approximately equal to √1p and 2 p. This result follows easily by noting
that 1 − (1 − p)r ≈ rp for p close to zero and next minimizing the function
x (1 + px ) with respect to x (verify!).
1 2

Problems
3.8 There are 10 cards face down numbered 1 through 10. You pick at
random one card. Your payoff is $0.50 if the number on the card is less
than 5 and is the dollar value on the card otherwise. What is the expected
value of your payoff?
3.9 Four professors give one course each. The courses have 15, 20, 70, and
125 students. No student takes more than one course. Let the random
variable X be the number of students in a randomly chosen class and
Y be the number of students in the class of a randomly chosen student.
Without doing any calculation, can you explain why E(X) is larger than
E(Y)? What are E(X) and E(Y)?
3.10 A firm exports to a variety of countries. An order for $10,000 has been
received from a politically unstable country. The export firm estimates
that there is a probability of 75% that the full amount of $10,000 will
be paid, a probability of 15% that only $5,000 will be paid, and a
probability of 10% that nothing will be paid. Therefore, the firm has
decided to purchase export credit insurance. The insurance company has
asked a price of $2,000 to insure the order. The insurance guarantees
compensation for any debt experienced on the order. What is the expected
value of the cost of risk reduction?
3.11 A bowl contains ten white and two red balls. In one shot you pick m
balls at random from the bowl, where m can be chosen at your discretion.
If all the balls picked are white, you win m dollars; otherwise, you win
nothing. What is the expected value of your winnings and what is the
maximizing value of m?
3.12 Six cards are taken from a deck: two kings and four aces. The six cards
are thoroughly shuffled, after which the two top cards are revealed.
96 3 Discrete Random Variables

If both cards are aces, you win $1.25; otherwise, you lose $1. Is this a
fair bet? A gambling game is said to be fair when the expected value of
your net winnings is zero.
3.13 Consider again Example 3.2. Let the random variable Xj be the number
of picks until an ace is obtained for the jth time. What is the expected
value of Xj for j = 2, 3, and 4? Can you explain the answer intuitively?
3.14 In the game “Unders and Overs,” two dice are rolled and you can bet
whether the total of the two dice will be under 7, over 7, or equal to 7.
The gambling table is divided into three sections marked “Under 7,” “7,”
and “Over 7.” The payoff odds for a bet on “Under 7” are 1 to 1, for a
bet on “Over 7” are 1 to 1, and for a bet on “7” are 4 to 1 (payoffs of r
to 1 mean that you get r + 1 dollars back for each dollar bet if you win;
otherwise, you get nothing back). Each player can put chips on one or
more sections of the gambling table. Your strategy is to bet one chip on
“Under 7” and one chip on “7” each time. What is the expected value of
your net winnings?
3.15 In European roulette, players bet on the outcome of a turning wheel,
which is fitted with 37 spokes numbered from 0 to 36. Of the spokes
numbered from 1 to 36, 18 are red and 18 are black. The 0 represents a
win for the casino. A bet on the color red has a 1 to 1 payout. There is
a stake limit of $1,000. A player decides to make no more than 11 bets
under a doubling strategy and to stop as soon as he has won a bet. The
player begins by staking one dollar on red. If he loses, he doubles his
stake on red, and continues doubling until red wins. If the player has lost
10 bets in a row, the maximum amount of $1,000 is staked in the 11th
bet. What is the expected value of the total amount staked by the player
and what is the expected value of the amount won by the player?
3.16 A fair coin is tossed repeatedly until heads appears for the first time or
the coin has been tossed m times, whichever occurs first. The payoff is
two dollars if heads turns up in the first toss, four dollars if heads turns
up for the first time in the second toss, etc. In general, the payoff is 2k
dollars if heads turns up for the first time in the kth toss, for 1 ≤ k ≤ m.
The payoff is zero if tails is tossed m times in a row. What is the expected
value of the payoff? What should be the stake to make the game a fair
bet? Note: This game is known as the St. Petersburg game when m = ∞.
3.17 You throw darts at a circular target on which two concentric circles of
radius 1 cm and 3 cm are drawn. The target itself has a radius of 5 cm.
You receive 15 points for hitting the target inside the smaller circle, 8
points for hitting the middle annular region, and 5 points for hitting the
outer annular region. The probability of hitting the target at all is 0.75.
3.2 Expected Value 97

If the dart hits the target, then the hitting point is a random point on the
target. Let the random variable X be the number of points scored on a
single throw of the dart. What is the expected value of X?
3.18 The following game is played in a particular carnival tent. You pay one
dollar to draw blindly three balls from a box without replacement. The
box contains 10 balls and four of those balls are gold colored. You get
back your original one-dollar stake if you draw exactly two gold-colored
balls, while you win 10 dollars and get back your original one-dollar
stake if you draw three gold-colored balls; otherwise, you get nothing
back. What is the expected value of your net winnings?
3.19 You spin a game-board spinner with 1,000 equal sections numbered 1 to
1,000. After your first spin, you have to decide whether to spin the
spinner for a second time. Your payoff is the total score of your spins,
as long as this score does not exceed 1,000; otherwise, your payoff is
zero. What strategy maximizes the expected value of your payoff?
3.20 A casino owner has given you an advance of a fiches to play a particular
casino game. Your win probability for each play of the game is p, where
p < 12 . The stake for each play is one fiche. If you win, you get two
fiches back; otherwise, you lose the stake. The fiches are given to you
as a reward for services rendered to the casino and under the following
agreement. You must play until you have lost all fiches or have won b
fiches on top of the advance of a fiches, where b must be set beforehand.
In the case that you lose all fiches, you are owed nothing by the casino;
otherwise, you return the advance of a fiches and keep the b fiches won.
You decide to choose the level b in such a way that the expected value of
the number of fiches you may keep is maximal.
(a) Show that for large a, the maximizing value of b satisfies b ≈ ln(q/p)
1

with q = 1 − p, regardless of how large a is. Hint: Use the fact

that the gambler’s ruin formula from Example 2.9 can be rewritten
(q/p)−a −1 −a can be neglected
as P(a, b) = (q/p) −a −(q/p)b and note that (q/p)

for large a when p < 12 .

(b) Verify that the corresponding approximations for the expected value
of the number of fiches you may keep and for the probability of
winning the b fiches are be and 1e for large a.
Note: This problem is based on M. Orkin and R. Kakigi, “What is the
worth of free casino credit?,” The American Mathematical Monthly 102
(1995): 3–8.
3.21 The following dice game is offered to you. You may simultaneously roll
one red die and three blue dice. The stake is $1. If none of the blue dice
98 3 Discrete Random Variables

matches the red die, you lose your stake; otherwise, you get paid k + 1
dollars if exactly k of the blue dice match the red die. In the case that
exactly one blue die matches the red die, you get paid an additional $0.50
if the other two blue dice show the same number. What is the expected
payoff of the game?
3.22 You play a game in which four fair dice are rolled. The stake is $1. The
payoff is $100 if all four dice show the same number and $10 if two dice
show the same even number and the other two dice show the same odd
number. Is this a favorable game? What about the following game with
a stake of $2? A fair coin is tossed no more than five times. The game
ends if the coin comes up tails or five straight heads appear, whichever
happens first. You get a payoff of $1 each time heads appears, plus a
bonus of $25 if five heads appear in a row.
3.23 A stick of length 1 is broken at random into two pieces. You bet on
the ratio of the length of the longer piece to the length of the smaller
piece. You receive $k if the ratio is between k and k + 1 for some k with
1 ≤ k ≤ m − 1, while you receive $m if the ratio is larger than m. Here,
m is a given positive integer. What should your stake be to make this a
fair bet?
3.24 In the dice game of Pig, you repeatedly roll a single die in any round.
Upon rolling a 1, your turn is over and nothing is added to your score.
Otherwise, you can stop whenever you want and then the total number of
points rolled is added to your score. The goal is to maximize the expected
number of points gained in one round. Under the hold-at-20, rule you stop
when you have rolled 20 points or more in the round. Explain why the
hold-at-20 rule is optimal.
3.25 You play a dice game in which you repeatedly roll a single die. If the
total number of points rolled exceeds 10, the game is over and you get no
reward. Otherwise, you can stop whenever you want and then your dollar
reward is the accumulated number of points rolled. What is an optimal
stopping rule when you want to maximize the expected value of your
reward?
3.26 A game machine can be used to drop balls into bins. The balls are
dropped one at a time and any ball will land at random into one of 25 bins.
You can stop dropping balls whenever you wish. At the end of the game,
you win $1 for every bin with exactly one ball and you lose $0.50 for
every bin with two or more balls. Empty bins do not count. How does
the one-stage-look-ahead rule work for this problem? Also answer this
question when you lose 12 k dollars rather than half a dollar for every bin
containing k ≥ 2 balls.
3.3 Expected Value of Sums of Random Variables 99

3.27 A bag contains w white and r red balls. You pick balls out of the bag
at random, one at a time and without replacement. You win one dollar
each time you pick a white ball. If you pick a red ball, the game is over
and you lose all the money gathered so far. You can stop picking balls
whenever you wish. What is an optimal stopping rule when you want to
maximize your expected gain?
3.28 Consider the following dice game in which you repeatedly roll two dice.
The dollar reward for each roll is the dice total, provided that the two
numbers shown are different; otherwise, the game is over and you lose
the total reward gathered so far. You can stop rolling the dice whenever
you wish. What is an optimal stopping rule when you want to maximize
the expected value of your total reward?
3.29 Let the random variable X be nonnegative and integer valued. Verify that
∞

E(X) = P(X > k).
k=0

Apply this useful formula to calculate the expected value of the largest
of 10 randomly chosen numbers from 1 to 100.

3.3 Expected Value of Sums of Random Variables

Let X and Y be two random variables that are defined on the same sample
space with probability measure P. As an example, take X as the smallest of
the two numbers rolled and Y as the sum of the two numbers rolled when the
experiment is to roll two dice. Since X and Y are defined on the same sample
space, the sum X + Y is also a random variable. It assigns to each element ω of
the sample space the numerical value X(ω) + Y(ω). The following basic rule
is of utmost importance.
Rule 3.1 For any two random variables X and Y,

E(X + Y) = E(X) + E(Y),

provided that E(X) and E(Y) exist and are finite.

The proof is simple for the discrete case. Define the random variable Z by
Z = X + Y. Then, for all z,

P(Z = z) = P(X = x, Y = y),
x,y: x+y=z
100 3 Discrete Random Variables

where P(X = x, Y = y) denotes the probability of the joint event that X takes

on the value x and Y the value y. Thus, by E(Z) = z zP(Z = z), we have

E(Z) = z z x,y: x+y=z P(X = x, Y = y). This gives

E(Z) = (x + y)P(X = x, Y = y)
z x,y: x+y=z

= (x + y)P(X = x, Y = y).
x,y

Thus, E(Z) = x,y xP(X = x, Y = y) + x,y yP(X = x, Y = y). This
expression can be rewritten as

E(Z) = x P(X = x, Y = y) + y P(X = x, Y = y).
x y y x

For fixed x, the events {X = x, Y = y} are disjoint and their union is the event
{X = x}. Thus, by the third axiom of probability theory,

P(X = x, Y = y) = P(X = x) for any x.
y

In the same way, x P(X = x, Y = y) = P(Y = y) for any y. This gives

E(Z) = xP(X = x) + yP(Y = y) = E(X) + E(Y),
x y

as was to be proved. It is pointed out that the various manipulations with

the summations in the above analysis are justified by the assumption that

x |x|P(X = x) and y |y|P(Y = y) are finite.
Rule 3.1 applies to any finite number of random variables X1 , . . . , Xn . If
E(Xi ) exists and is finite for all i = 1, . . . , n, then a repeated application of
Rule 3.1 gives
E(X1 + · · · + Xn ) = E(X1 ) + · · · + E(Xn ).
The result that the expected value of a finite sum of random variables equals
the sum of the expected values is extremely useful. It is only required that the
relevant expected values exist, but dependencies between the random variables
are allowed.
An immediate consequence of Rule 3.1 is that
E(X) ≥ E(Y) if X ≥ Y,
where X ≥ Y means that X(ω) ≥ Y(ω) for all elements ω of the sample
space. To see this, note that X − Y ≥ 0 and thus E(X − Y) ≥ 0. Also, by the
definition of expectation, E(−Y) = −E(Y). Applying Rule 3.1, we now get
E(X − Y) = E(X) − E(Y), showing that E(X) ≥ E(Y) if X ≥ Y.
3.3 Expected Value of Sums of Random Variables 101

Example 3.7 In order to introduce a new brand of breakfast cereal, the

producer has introduced a campaign offering a baseball card in each cereal
box purchased. There are n different baseball cards. These cards are distributed
equally among the cereal boxes, so that if a single box is bought there is a 1/n
chance it will contain a particular baseball card. As a baseball fan, you want to
collect a complete set of baseball cards. How many cereal boxes do you expect
to buy in order to get all n baseball cards?
Solution. Define the random variable X as the number of cereal boxes that
must be purchased in order to get a complete set of baseball cards. The
expected value of X can be calculated by a direct approach using the probability
mass function of X. However, a much simpler approach is to define the random
variable Yi as
Yi = the number of cereal boxes needed in order to go from
i − 1 to i different baseball cards.
Then we can write X as X = Y1 + Y2 + · · · + Yn and so

n
E(X) = E(Yi ).
i=1

In order to calculate E(Yi ), note that n−(i−1)

n gives the probability that the next
cereal box purchased will contain a new baseball card when as many as i − 1
different baseball cards have already been collected. Thus,

n − (i − 1) k−1 n − (i − 1)
P(Yi = k) = 1 − for k = 1, 2, . . . .
n n

Using the fact that ∞ k=1 kx
k−1 = 1
for |x| < 1, we get E(Yi ) = n−i+1 n
n (1−x)2
for each i. Then, by E(X) = i=1 E(Yi ), it follows that

n
1
E(X) = n .
k
k=1
Using this formula, we can give an approximation that gives insight into the
way that E(X) increases as a function of n. A well-known approximation is

n
1 1
≈ ln(n) + γ + ,
k 2n
k=1
where γ = 0.57721566 . . . is the Euler constant, see Appendix C. Hence we
have the insightful approximation
1
E(X) ≈ n ln(n) + γ n + .
2
102 3 Discrete Random Variables

This approximation is very accurate. For example, the approximate and exact
values of E(X) are 29.298 and 29.290 when n = 10.
The problem in Example 3.7 is known as the coupon-collector problem. This
problem has many applications. A nice application is the birthday-coverage
problem. What is the expected number of people needed so that each of the
365 possible birthdays of the year (ignore February 29) is represented in the

group of people? The answer is 365 365 k=1 k = 2,364.65. The calculation of
1

the probability mass function of the number of coupons needed is discussed in

Problem 1.87, see also the Markov chain approach in Chapter 10.

3.3.1 Expected Value and Indicator Random Variables

A powerful trick that is often applicable to calculate the expected value of a
random variable is to represent the random variable as the sum of so-called
indicator random variables. These random variables can take on only the
values 0 and 1. Just like the complement rule and the law of conditional
probability, the trick of indicator random variables is a very useful problem-
solving tool in probability. Also, indicator random variables provide a link
between probability and expected value: if A is a given event and the indicator
variable IA is to be defined as 1 if A occurs and 0 otherwise, then E(IA ) = P(A).
This is seen directly from

E(IA ) = 1 × P(IA = 1) + 0 × P(IA = 0) = P(A).

Example 3.8 Suppose that m balls (or tokens) are placed sequentially into
one of b bins, where the bin for each ball is selected at random. What is the
expected value of the number of empty bins?
Solution. Let the random variable X be the number of empty bins. Then,

X = X1 + · · · + Xb ,

where the indicator variable Xi is defined as

1 if the ith bin is empty,
Xi =
0 otherwise.
b−1 m
The probability that the ith bin will be empty is b . Therefore,
m
1
E(Xi ) = 0 × P(Xi = 0) + 1 × P(Xi = 1) = 1 − , i = 1, . . . , b.
b
3.3 Expected Value of Sums of Random Variables 103

This gives
m
1
E(X) = b 1 − .
b
b
A useful approximation applies to E(X) for b large. Noting that 1− 1b ≈ e−1
m b×(m/b)
for b large and writing 1 − 1b as 1 − 1b , it follows that

E(X) ≈ be−m/b for b large.

Example 3.9 What is the expected value of the number of times that, in a
thoroughly shuffled deck of 52 cards, two adjacent cards are of the same rank
(two aces, two kings, etc.)?
Solution. Let the random variable Xi be equal to 1 if the cards in positions i
and i + 1 are of the same rank and 0 otherwise. Then P(Xi = 1) = 51 3
and
so E(Xi ) = 51 for i = 1, . . . , 51. The expected value of the number of times
3

that two adjacent cards are of the same rank is given by E(X1 + · · · + X51 ) =
51 × 513
= 3.
Example 3.10 Suppose that you have a graph with n nodes. Each pair of nodes
is connected by an edge with probability p, independently of the occurrence of
edges between other pairs of nodes. What is the expected number of isolated
nodes?
Solution. Label the nodes k = 1, . . . , n and let the indicator variable Ik be
1 if node k is isolated and 0 otherwise. Then the number of isolated nodes is
n
k=1 Ik . For each k, P(Ik = 1) = (1 − p)
n−1 and so E(I ) = (1 − p)n−1 . Thus,
k
the expected number of isolated nodes is

n
E(Ik ) = n(1 − p)n−1 .
k=1

As a corollary,

P(there is at least one isolated node) ≤ n(1 − p)n−1 .

This bound is a direct consequence of the inequality P(X ≥ 1) ≤ E(X) for any
nonnegative, integer-valued random variable X (explain!).

The Linearity Property for General Random Variables

The linearity property of the expected value holds for any type of random vari-
ables, including continuous random variables. A continuous random variable
such as the decay time of a radioactive particle can take on a continuum of
possible values. Continuous random variables are to be discussed in Chapter 4
104 3 Discrete Random Variables

and subsequent chapters. In Chapter 4, the expected value of a continuous

random variable will be defined by an integral rather than by a sum as in
the discrete case. The models of discrete and continuous random variables
are the most important ones, but are not exhaustive. Also, there are so-called
mixed random variables, having both a discrete component and a continuous
component. Think of your delay in the queue at a supermarket counter or the
amount reimbursed on an automobile insurance policy in a given year. These
random variables take on either the discrete value zero with positive probability
or a value in a continuous interval.
For the case of discrete random variables, a proof of the linearity property of
the expected value could be given by using first principles. However, the proof
is more tricky for the general case. If the random variables are continuously
distributed, we need the concept of joint probability density and using this
concept the linearity property can be proved, see Rule 5.6 in Chapter 5. For the
general case including the case of mixed random variables, the proof requires
some technical machinery from measure theory. Then, we need to proceed
from the general definition of integral and expectation in measure theory. This
is beyond the scope of this book.

Problems
3.30 A group of m people simultaneously enter an elevator at the ground floor.
Each person randomly chooses one of the r floors 1, 2, . . . , r as the exit
floor, where the choices of the persons are independent of each other. The
elevator only stops on a floor if at least one person wants to exit on that
floor. No other people enter the elevator at any of the floors 1, 2, . . . , r.
What is the expected value of the number of stops the elevator will make?
3.31 Use indicator variables to give an alternative proof of the inclusion–
exclusion formula in Rule 1.4. Hint: Use the relation IAc1 ∩···∩Acn =
(1 − IA1 ) · · · (1 − IAn ), where the indicator variable IA is equal to 1 if
the event A occurs and 0 otherwise.
3.32 What is the expected value of the number of times that two adjacent
letters will be the same in a random permutation of the 11 letters of the
word “Mississippi?”
3.33 Twelve married couples participate in a tournament. The group of 24
people is randomly split into eight teams of three people each. What is
the expected value of the number of teams with a married couple?
3.34 A particle starting at the origin moves every time unit, one step up or
down with equal probabilities 12 . Verify that the expected number of
3.3 Expected Value of Sums of Random Variables 105

returns of the random walk to the zero level during the first n time units is
√ √
approximately equal to 2/π n for n large. On the basis of this result,
can you explain intuitively why the time between two successive returns
of the random walk to the zero level is infinite? Hint: Use Example 1.5
√ √
and the fact that m j=1 1/ j ≈ 2 m for m large.
3.35 In a bowl, there are 2r red balls and 2b blue balls. Each time, you
simultaneously remove two balls from the bowl until the bowl is empty.
What is the expected number of times that you pick two balls of the same
color?
3.36 What is the expected number of distinct birthdays within a randomly
formed group of 100 persons? What is the expected number of children in
a class with r children sharing a birthday with some child in another class
with s children? Assume that the year has 365 equally likely birthdays.
3.37 Let S be a given set consisting of n distinct items. A new set T is
constructed as follows. In each step, an item is chosen at random from
the set S (with replacement) and a copy of the item is added to the set T.
What is the expected value of the number of distinct items in the set T
after n steps?
3.38 Consider again Problem 1.92. What is the expected value of the number
of people who survive the first round?
3.39 A bag contains r red and w white balls. Each time, you take one ball out
of the bag at random and without replacement. You stop as soon as all the
red balls have been taken out of the bag. What is the expected number of
white balls remaining in the bag when you stop?
3.40 Consider again Problem 2.30. What is the expected value of the number
of persons having knowledge of the rumor?
3.41 What is the expected number of times that two consecutive numbers will
show up in a lottery drawing of six different numbers from the numbers
1, 2, . . . , 45?
3.42 You play a sequence of s games, where s ≥ 2. The outcomes of the
various games are independent of each other. The probability that you
will win the kth game is 1k for k = 1, 2, . . . , s. You get one dollar each
time you win two games in a row. What is the expected value of the total
amount you will get?
3.43 You distribute randomly 25 apples over 10 boxes. What is the expected
value of the number of boxes that will contain more than 3 apples?
3.44 Verify that the expected number of permutation cycles in a random
permutation of the integers 1, . . . , n is about ln(n) + γ for n large,
where γ = 0.57722 . . . is Euler’s constant. Hint: For any fixed k with
1 ≤ k ≤ n, define Xi = 1 if the integer i belongs to a cycle of length k
106 3 Discrete Random Variables

and Xi = 0 otherwise. Use the fact that 1k ni=1 Xi is the number of cycles
of length k.
3.45 Take a random permutation of the integers 1, 2, . . . , n. Let us say that
the integers i and j with i = j are switched if the integer i occupies the
jth position in the random permutation and the integer j the ith position.
What is the expected value of the total number of switches?

3.4 Substitution Rule and Variance

Suppose that X is a discrete random variable with a given probability mass
function. In many applications, we wish to compute the expected value of some
function of X. Note that any function of X, such as X 2 or sin(X), is also a
random variable (why?). Let g(x) be a given real-valued function. Then the
probability mass function of the discrete random variable Y = g(X) can be
calculated as

P(g(X) = y) = P(X = x)
x: g(x)=y

with the convention that an empty sum is zero. However, it is not necessary to
use the probability mass function of the random variable g(X) to calculate its
expected value. The expected value of g(X) can be calculated directly from the
probability mass function of X.
Rule 3.2 For any function g of the random variable X,

E[g(X)] = g(x) P(X = x),
x∈I

provided that x∈I |g(x)| P(X = x) < ∞, or the function g(x) is nonnegative.
This rule is called the substitution rule and is also known as the law of the
unconscious statistician. The proof of the rule is simple. Let Y = g(X) and
denote by J the range of Y. Then,

g(x) P(X = x) = yP(X = x) = yP g(X) = y
x∈I y∈J x: g(x)=y y∈J

= yP(Y = y) = E(Y) = E g(X) .
y∈J

Note that the order in which the terms of the series are added does not matter
in view of the assumption that the series is absolutely convergent or has only
nonnegative terms.
3.4 Substitution Rule and Variance 107

A frequently made mistake
of beginning students is to set E g(X) equal to
g (E(X)). In general, E g(X) = g (E(X))! Stated differently, the average value
of the input X does not determine in general the average value of the output
g(X). As a counterexample, take the random variable X with P(X = 1) =
P(X = −1) = 0.5 and take the function g(x) = x2 . An exception is the case of
a linear function g(x) = ax + b. Then, by Rule 3.2,
Rule 3.3 For any constants a and b,
E(aX + b) = aE(X) + b,
provided that E(X) is finite.
This rule is valid for any type of random variable X for which E(X) is
finite. Another result that is valid for any type of random variable is Jensen’s
inequality:
Rule 3.4 Suppose that the function g(x) is convex on a line segment
containing the range of the random variable X. Then,

E[g(X)] ≥ g E(X) ,
provided that the expectations are finite.
The proof is simple. For ease, assume that g(x) is differentiable. A differen-
tiable function g(x) is convex only if, for any point x0 , the graph of g(x) lies
entirely above the tangent line at the point x0 :
g(x) ≥ g(x0 ) + g (x0 )(x − x0 ) for all x.
Choosing x = X and x0 = E(X) in this inequality and taking the expectation on
both sides of the inequality, the inequality is preserved and Jensen’s inequality
follows. Here we use Rule 3.3 and the fact that E[g1 (X)] ≥ E[g2 (X)] if g1 (x) ≥
g2 (x) for all x.

3.4.1 Variance
An important case of a function of X is the random variable g(X) = (X − μ)2 ,
where μ = E(X) denotes the expected value of X and is assumed to be finite.
The expected value of (X − μ)2 is called the variance of X. It is denoted by
var(X) = E[(X − μ)2 ].
It is a measure of the spread of the possible values of X. Often one uses the
standard deviation, which is defined as the square root of the variance. It is
useful to work with the standard deviation, because of the fact that it has the
108 3 Discrete Random Variables

same units (e.g., dollar or centimeter) as E(X). The standard deviation of a

random variable X is usually denoted by σ (X), and thus is defined by

σ (X) = var(X).

The formula for var(X) allows for another useful representation. Since
(X − μ)2 = X 2 − 2μX + μ2 , it follows from the linearity of the expectation
operator and Rule 3.3 that E[(X − μ)2 ] = E(X 2 ) − 2μE(X) + μ2 . Therefore,
var(X) is also given by
var(X) = E(X 2 ) − μ2 .

Rule 3.3 for the expectation operator has the following analogue for the
variance operator:
Rule 3.5 For any constants a and b,

var(aX + b) = a2 var(X).

The proof is simple and is again based on the linearity of expectation.

It follows from E(aX + b)2 = a2 E(X 2 ) + 2abμ + b2 and [E(aX + b)]2 =
(aμ + b)2 = a2 μ2 + 2abμ + b2 that var(aX + b) = a2 E(X 2 ) − a2 μ2 , showing
the desired result. Rule 3.5 implies

σ (aX + b) = |a|σ (X).

To illustrate this relation, suppose that the rate of return on stock A is a random
variable X taking on the values 30%, 10%, and −10% with respective proba-
bilities 0.25, 0.50, and 0.25. The rate of return on stock B is a random variable
Y taking on the values 50%, 10%, and −30% with the same probabilities 0.25,
0.50, and 0.25. Without calculating the actual values of the standard deviation,
one can see that the standard deviation of the rate of return on stock B is twice
as large as that on stock A. The explanation is that the random variable Y is
distributed as 2X − 0.1.
Example 3.11 What is the standard deviation of the total score of a roll of two
dice?
Solution. Let the random variable X denote the total score. Since E(X) =
3.5 + 3.5 = 7 and P(X = k) = P(X = 14 − k) = k−1
36 for 2 ≤ k ≤ 7, we get

7
k − 1 2 14 − k − 1
12
35
var(X) = k2 + k − 72 = .
36 36 6
k=2 k=8
√
The standard deviation of X is var(X) = 2.415.
3.4 Substitution Rule and Variance 109

Example 3.12 Suppose that the random variable X has the so-called Poisson
distribution P(X = k) = e−λ λk /k! for k = 0, 1, . . . . What are the expected
value and the variance of X?
Solution. It will be shown that the Poisson distribution has the remarkable
property that its variance is equal to its expected value. That is,

var(X) = E(X) = λ.
2 3
To verify this, note that E(X) = λe−λ + 2 λ2! e−λ + 3 λ3! e−λ + · · · and so

λ λ 2
E(X) = λe−λ 1 + + + · · · = λe−λ eλ = λ,
1! 2!

2
where the third equality uses the power-series expansion ex = 1+ 1!x + x2! +· · ·
for every x. Using the identity k2 = k(k − 1) + k, we have
∞
∞ ∞
λk −λ λk λk−2
E(X 2 ) = k(k − 1)e−λ + ke = λ2 e−λ + E(X)
k! k! (k − 2)!
k=0 k=0 k=2
∞
λn
= λ2 e−λ + λ = λ2 + λ.
n!
n=0

Next, by var(X) = E(X 2 ) − [E(X)]2 , we get var(X) = λ.

Example 3.8 (continued) What is the variance of the random variable X,
denoting the number of empty boxes?
Solution. To find σ 2 (X) = E(X 2 ) − [E(X)]2 , we again use the representation
X = X1 + · · · + Xb , where Xi = 1 if the ith bin is empty and Xi = 0 otherwise.
n
By the algebraic formula ( ni=1 ai )2 = ni=1 a2i + 2 n−1 i=1 j=i+1 ai aj and
the linearity of the expectation operator,

b
b−1
b
E(X 2 ) = E(Xi2 ) + 2 E(Xi Xj ).
i=1 i=1 j=i+1

Note that E(Xi2 ) = E(Xi ), since Xi is 0 or 1. We have already found that

m
E(Xi ) = 1 − 1b . To find E(Xi Xj ) for i = j, note that Xi Xj equals 1 if
Xi = Xj = 1 and equals 0 otherwise. Hence, E(Xi Xj ) = P(Xi = 1, Xj = 1).
For any i = j,

b−2 m
P(Xi = 1, Xj = 1) = .
b
110 3 Discrete Random Variables

m
Thus, E(Xi Xj ) = 1 − 2b for all i = j. This leads to

1 m 2 m
E(X ) = b 1 −
2
+ b(b − 1) 1 − .
b b

Next it follows from σ 2 (X) = E(X 2 ) − [E(X)]2 that

2 m 1 m 1 m
σ (X) = b(b − 1) 1 −
2
+b 1− 1−b 1− .
b b b
b
Noting that 1 − bx ≈ e−x for b large, we find the approximation (verify!)

σ 2 (X) ≈ b e−m/b − e−2m/b for b large.

The next example deals with the so-called newsboy problem.

Example 3.13 Every morning, rain or shine, young Billy Gates can be found
at the entrance to the metro, hawking copies of The Morningstar. Demand
for newspapers varies from day to day, but Billy’s regular early-morning haul
yields him 200 copies. He purchases these copies for $1 per paper, and sells
them for $1.50 apiece. Billy goes home at the end of the morning, or earlier
if he sells out. He can return unsold papers to the distributor for $0.50 apiece.
From experience, Billy knows that demand for papers on any given morning is
uniformly distributed between 150 and 250, where each of the possible values
150, . . . , 250 is equally likely. What are the expected value and the standard
deviation of Billy’s net earnings on any given morning?
Solution. Denote by the random variable X the number of copies Billy would
have sold on a given morning if he had ample supply. The actual number of
copies sold by Billy is X if X ≤ 200 and 200 otherwise. The probability mass
function of X is P(X = k) = 101 1
for k = 150, . . . , 250. Billy’s net earnings
on any given morning is a random variable g(X), where the function g(x) is
given by

−200 + 1.5x + 0.5(200 − x), x ≤ 200
g(x) =
−200 + 1.5 × 200, x > 200.
Applying the substitution rule, we have that E[g(X)] can be calculated as

250
1
200
1
250
g(k)P(X = k) = (−100 + k) + 100
101 101
k=150 k=150 k=201

and so
3,825 5,000
E[g(X)] = + = 87.3762.
101 101
3.4 Substitution Rule and Variance 111

To find the standard deviation of g(X), we apply the formula var(Z) = E(Z 2 )−
[E(Z)]2 with Z = g(X). This gives

var[g(X)] = E([g(X)]2 ) − (E[g(X)])2 .

Letting h(x) = [g(x)]2 , then h(x) = (−100 + x)2 for x ≤ 200 and h(x) = 1002
for x > 200. Using the substitution rule again, we get

250
1
200
1
250
E[h(X)] = h(k)P(X = k) = (−100 + k)2 + 1002 .
101 101
k=150 k=150 k=201

Hence, E([g(X)]2 ) = E[h(X)] = 297,925/101 + 500,000/101 = 7,900.2475.

Therefore, the variance of Billy’s net earnings on any given morning is

var[g(X)] = 7, 900.2475 − (87.3762)2 = 265.647.

Concluding, Billy’s net earnings on any given

√ morning has an expected value
of 87.38 dollars and a standard deviation of 265.64 = 16.30 dollars.

Problems
3.46 There are two investment projects A and B each involving an initial
investment of $2,000. The possible payoffs of investment A are $1,000,
$2,000, and $3,000 with respective probabilities 0.20, 0.40, and 0.40,
while the possible payoffs of investment B are $1,600, $2,000, and
$2,750 with respective probabilities 0.25, 0.35, and 0.40. Calculate the
expected value and the standard deviation of the payoff for each of the
two investments.
3.47 Let the random variable X have a discrete uniform distribution on the
integers a, a + 1, . . . , b with a < b, that is, P(X = k) = 1/(b − a + 1) for
a ≤ k ≤ b. What is the variance of X?
3.48 Suppose that the random variable X satisfies P(X = a) = p and
P(X = b) = 1 − p, where a < b. Verify that var(X) is largest for p = 12 .
3.49 Consider again Problem 3.33. What is the standard deviation of the
number of teams with a married couple?
3.50 In each drawing of the 6/45 lottery, six different integers are randomly
chosen from 1, 2, . . . , 45. What are the expected value and the standard
deviation of the number of integers from 1, 2, . . . , 45 that do not show up
in 15 drawings?
3.51 The number of paramedical treatments for a particular sports injury is a
random variable X with probability mass function P(X = k) = 11−k 55 for
112 3 Discrete Random Variables

1 ≤ k ≤ 10. What are the expected value and standard deviation of X? An

insurance policy reimburses the costs of the treatments up to a maximum
of five treatments. What are the expected value and the standard deviation
of the number of reimbursed treatments?
3.52 You have to make a one-time business decision how much stock to order
in order to meet a random demand during a single period. The demand is
a random variable X with a given probability mass function {pk , k = 0,
1, . . .}. Suppose you decide to order Q units. What is the expected value
of the stock left over at the end of the period? What is the expected value
of the amount of demand that cannot be satisfied from stock?
3.53 The University of Gotham City renegotiates its maintenance contract
with a particular copy-machine distributor on a yearly basis. For the
coming year, the distributor has come up with the following offer. For
a prepaid cost of $50 per repair call, the university can opt for a fixed
number of calls. For each visit beyond that fixed number, the university
will pay $100. If the actual number of calls made by a repairman remains
below the fixed number, no money will be refunded. Based on previous
experience, the university estimates that the number of repairs that will
be necessary in the coming year has a Poisson distribution with expected
value 150. The university signs a contract with a fixed number of 155
repair calls. What are the expected value and the standard deviation of
the maintenance costs in excess of the prepaid costs?
3.54 At the beginning of every month, a pharmacist orders an amount of
a certain costly medicine that comes in strips of individually packed
tablets. The wholesale price per strip is $100, and the retail price per
strip is $400. The medicine has a limited shelf life. Strips not purchased
by month’s end will have reached their expiration date and must be
discarded. When it so happens that demand for the item exceeds the
pharmacist’s supply, he may place an emergency order for $350 per
strip. The monthly demand for this medicine takes on the possible values
3, 4, 5, 6, 7, 8, 9, and 10 with respective probabilities 0.3, 0.1, 0.2, 0.2,
0.05, 0.05, 0.05, and 0.05. The pharmacist decides to order eight strips
at the start of each month. What are the expected value and the standard
deviation of the net profit made by the pharmacist on this medicine in
any given month?
3.55 A contest has two rounds. You are automatically placed for the second
round, while n other people are in the first round. Each of these people
enters the second round with probability 1n , independently of each other.
1
You win the contest with probability k+1 when k other people join the
second round. Use Jensen’s inequality to show that you win the contest
3.5 Independence of Random Variables 113

with a probability of at least 12 , regardless of the number ofpeople

in the
1
first round. Hint: Argue that this probability is given by E 1+X , where
X is the number of people surviving the first round.

3.5 Independence of Random Variables

In Chapter 2, we dealt with the concept of independent events. It makes intu-
itive sense to say that random variables are independent when the underlying
events are independent. Let X and Y be two random variables that are defined
on the same sample space with probability measure P. The following definition
does not require that X and Y are discrete random variables, but applies to the
general case of two random variables X and Y.
Definition 3.2 Random variables X and Y are said to be independent if
P(X ≤ x, Y ≤ y) = P(X ≤ x)P(Y ≤ y)
for any two real numbers x and y, where P(X ≤ x, Y ≤ y) represents the
probability of occurrence of both event {X ≤ x} and event {Y ≤ y}.
In words, the random variables X and Y are independent if the event of the
random variable X taking on a value less than or equal to x and the event of the
random variable Y taking on a value less than or equal to y are independent for
all real numbers x, y. More generally, random variables X1 , . . . , Xn are said to
be independent if
P(X1 ≤ x1 , . . . , Xn ≤ xn ) = P(X1 ≤ x1 ) · · · P(Xn ≤ xn )
for all real numbers x1 , . . . , xn .2 This definition implies that Xi and Xj are
pairwise independent for any i = j. This is easily seen. We get P(Xi ≤ xi ,
Xj ≤ xj ) = P(Xi ≤ xi )P(Xj ≤ xj ) for all xi , xj by letting the other xk → ∞ in
P(X1 ≤ x1 , . . . , Xn ≤ xn ) = P(X1 ≤ x1 ) · · · P(Xn ≤ xn ) and using the fact that
limxk →∞ P(Xk ≤ xk ) = 1.
Using the axioms of probability theory, it can be shown that Definition 3.2
is equivalent to
P(X ∈ A, Y ∈ B) = P(X ∈ A)P(Y ∈ B)
for any two sets A and B of real numbers. The proof is omitted. In fact, a
rigorous proof for the non-discrete case is beyond the scope of this book and
2 An infinite collection of random variables is said to be independent if every finite subcollection
of them is independent. In applications, the independence or otherwise of random variables is
usually obvious from the physical construction of the underlying process.
114 3 Discrete Random Variables

involves measurability constraints on the sets A and B. It is not difficult to

verify the following two rules from the alternative definition of independence.
Rule 3.6 If X and Y are independent random variables, then the random
variables f (X) and g(Y) are independent for any two functions f and g.
In the case that X and Y are discrete random variables, another representation
of independence can be given.
Rule 3.7 Discrete random variables X and Y are independent if and only if
P(X = x, Y = y) = P(X = x)P(Y = y) for all x, y.
A very useful rule applies to the calculation of the expected value of the
product of two independent random variables:
Rule 3.8 If the random variables X and Y are independent, then
E(XY) = E(X)E(Y),
assuming that E(X) and E(Y) are finite.
The proof goes as follows for discrete random variables X and Y. Let I and J
denote the sets of possible values of the random variables X and Y. Define the
random variable Z by Z = XY. Then,

E(Z) = zP(Z = z) = z P(X = x, Y = y)
z z x,y: xy=z

= xyP(X = x, Y = y).
z x,y: xy=z

Hence, using the independence of X and Y,

E(Z) = xyP(X = x, Y = y) = xyP(X = x)P(Y = y)
x,y x,y

= xP(X = x) yP(Y = y) = E(X)E(Y).
x∈I y∈J

The manipulations with the summations are justified by the assumption that

the series x∈I xP(X = x) and y∈J yP(Y = y) are absolutely convergent.
The converse of the above result is not true. It is possible that E(XY) =
E(X)E(Y), while X and Y are not independent. A simple example is as follows.
Suppose that two fair dice are tossed. Denote by the random variable V1 the
number appearing on the first die and by the random variable V2 the number
appearing on the second die. Let X = V1 + V2 and Y = V1 − V2 . It is obvious
that the random variables X and Y are not independent. We leave it to the reader
3.5 Independence of Random Variables 115

to verify that E(X) = 7, E(Y) = 0, and E(XY) = E(V12 − V22 ) = 0. Hence,

E(XY) = E(X)E(Y).
Rule 3.8 has been proved for the case that X and Y are discrete random vari-
ables. The rule is valid for any type of independent random variables X and Y,
and the rule can also be extended to the case of finitely many independent
random variables. The technical machinery that is required to establish those
extended results is beyond the scope of this introductory book.
In Rule 3.1 we proved that the expectation operator has the linearity prop-
erty. This property holds for the variance operator only under an independence
assumption.
Rule 3.9 If the random variables X and Y are independent, then

var(X + Y) = var(X) + var(Y).

The proof is as follows.

Putting μX =E(X) and
μY = E(Y), it follows
2 that
2
var(X + Y) = E[ X + Y − (μX + μY ) ] = E[ X − μX + Y − μY ] can be
worked out as

E[(X − μX )2 ] + E[(Y − μY )2 ] + 2E[(X − μX )(Y − μY )]

= var(X) + var(Y) + 2E[(X − μX )]E[(Y − μY )] = var(X) + var(Y),

using the linearity property of the expectation operator and using Rule 3.8
together with E[(X − μX )] = E[(Y − μY )] = 0.
Rule 3.9 is valid for any type of independent random variables X and
Y. Also, Rule 3.9 can be extended directly to the case of finitely many
independent random variables by using the algebraic formula (a1 +· · ·+an )2 =
n n−1 n
i=1 ai + 2
2
i=1 j=i+1 ai aj . Hence, for independent random variables
X1 , X2 , . . . , Xn ,

var(X1 + X2 + · · · + Xn ) = var(X1 ) + var(X2 ) + · · · + var(Xn ).

This result has the following important corollary:

Rule 3.10 If the random variables X1 , X2 , . . . , Xn are independent and have
the same probability distribution with standard deviation σ , then the standard
deviation of the sum X1 + X2 + · · · + Xn is given by
√
σ (X1 + X2 + · · · + Xn ) = σ n.

In statistics, Rule 3.10 is usually formulated as

X1 + X2 + · · · + Xn σ
σ =√ .
n n
116 3 Discrete Random Variables

This alternative formulation follows by noting that σ (aX) = aσ (X) for any
constant a > 0. Rule 3.10 is known as the square-root law. This law is
sometimes called de Moivre’s equation. The law was discovered by Abraham
de Moivre (1667–1754) in 1730.3 This law had an immediate impact on the
way the mass of golden coins that were struck by the London Mint was
1
controlled. The allowable deviation in the guinea was 400 of its intended
weight of 128 grains, which amounts to 0.32 grains. A run of 100 coins
was taken periodically from the Mint and the total weight of these coins was
compared to the weight of a standard of 100 coins. For almost 600 years the
watchdogs allowed a variability of 100 × 0.32 = 32 grains in the weight of the
100 coins, but after the discovery of the square-root
√ law in 1730, the allowable
variability for 100 coins was changed to 100 × 0.32 = 3.2 grains. The gold
for the coins was provided by the kings of England. Ignorance of the square-
root law may have cost them a fortune.

Convolution Formula
Suppose that X and Y are two discrete random variables each having the set of
nonnegative integers as the range of possible values. A useful rule is
Rule 3.11 If the nonnegative, integer-valued random variables X and Y are
independent, then

k
P(X + Y = k) = P(X = j)P(Y = k − j) for k = 0, 1, . . . .
j=0

This rule is called the convolution rule. The proof is as follows. Fix k. Let A be

the event {X + Y = k} and Bj be the event {X = j}. Then A = ∞ j=0 ABj . The
∞
events ABj are disjoint and so P(A) = j=0 P(ABj ). Thus,
∞

P(X + Y = k) = P(X + Y = k, X = j).
j=0

Since P(X + Y = k, X = j) = P(X = j, Y = k − j), we have by the

independence of X and Y that
P(X + Y = k, X = j) = P(X = j)P(Y = k − j).

Thus, P(X + Y = k) = ∞ j=0 P(X = j)P(Y = k − j) for all k. Noting that
P(Y = k − j) = 0 for j > k, we next get the convolution formula.
3 The French-born Abraham de Moivre was the leading probabilist of the eighteenth century
and lived most of his life in England. The protestant de Moivre left France in 1688 to escape
religious persecution. He was a good friend of Isaac Newton and supported himself by
calculating odds for gamblers and insurers and by giving private lessons to students.
3.5 Independence of Random Variables 117

A very important consequence of Rule 3.11 is the next rule:

Rule 3.12 Suppose that the random variables X and Y are independent and
have the Poisson distributions P(X = k) = e−λ λk /k! and P(Y = k) =
e−μ μk /k! for k = 0, 1, . . . . Then the random variable X + Y has the Poisson
distribution P(X + Y = k) = e−(λ+μ) (λ + μ)k /k! for k = 0, 1, . . . .
The proof is based on the convolution Rule 3.11. We have

k
e−(λ+μ) k j k−j
k j
−λ λ −μ μk−j
P(X + Y = k) = e e = λμ .
j! (k − j)! k! j
j=0 j=0

k

Next, by applying Newton’s binomium (a + b)k = j=0
k
j aj bk−j , we get
k
the desired result P(X + Y = k) = e−(λ+μ) (λ+μ)
k! for k = 0, 1, . . . .

Problems
3.56 A drunkard is standing in the middle of a very large town square. He
begins to walk. Each step is a unit distance in one of the four directions
east, west, north, and south. All four possible directions are equally
probable. The direction of each step is chosen independently of the
direction of the other steps. The drunkard takes a total of n steps. Verify
that the squared distance between the drunkard’s position after n steps
and his starting position has expected value n for any value of n. Hint:

The squared distance can be written as ( ni=1 Xi )2 + ( ni=1 Yi )2 , where
the random variables Xi and Yi denote the changes in the x-coordinate and
the y-coordinate of the position of the drunkard caused by his ith step.
3.57 Let the random variable X be defined by X = YZ, where Y and Z are
independent random variables each taking on the values −1 and 1 with
probability 0.5. Verify that X is independent of both Y and Z, but not of
Y + Z.
3.58 Suppose that X1 , . . . , Xn are independent random variables each having
expected value μ and variance σ 2 . Define the sample mean and the

sample variance by the random variables X n = 1n nk=1 Xk and Sn2 =
1 n
k=1 (Xk − X n ) . Show that E(X n ) = μ and E(S n ) = n σ . Hint:
2 2 n−1 2
n n
Writing Xk − X n as Xk − μ + μ − X n , verify that k=1 (Xk − X n )2 =
n
k=1 (Xk − μ) + n(X n − μ) .
2 2

3.59 Let Xi denote the number of integers smaller than i that precede i in
a random permutation of the integers 1, . . . , 10. What are the expected
118 3 Discrete Random Variables

value and the variance of the sum X2 + · · · + X10 ? Hint: Verify that
i
Xi = k=2 Rk , where Rk = 1 if number k − 1 precedes number k in
the random permutation and Rk = 0 otherwise.
3.60 Suppose that X and Y are independent random variables each having the
same probability mass function {(1 − p)i−1 p, i = 1, 2, . . .}. What is the
probability mass function of X + Y?
3.61 Let X1 , X2 , . . . be a sequence of independent random variables each
having the same distribution with finite mean. Suppose N is a nonneg-
ative, integer-valued random variable such that the event {N ≥ n} does
not depend on Xn , Xn+1 , . . . for any n ≥ 1. For the case that the Xk
N
are nonnegative, prove Wald’s equation4 stating that E k=1 k =
X
E(N)E(X1 ). Hint: Define the indicator variable Ik by Ik = 1 if N ≥ k
∞
and Ik = 0 otherwise and note that N k=1 Xk = k=1 Xk Ik .

3.6 Binomial Distribution

Let us first introduce the Bernoulli random variable, being the building block of
the binomial random variable. A random variable X is said to have a Bernoulli
distribution with parameter p if the random variable can only assume the values
1 or 0 with
P(X = 1) = p and P(X = 0) = 1 − p,
where 0 < p < 1. A Bernoulli random variable X can be thought of as the
outcome of an experiment that can only result in “success” or “failure.” Such
an experiment is called a Bernoulli trial. Since E(X) = 0 × (1 − p) + 1 × p = p
and E(X 2 ) = E(X) = p, it follows that the expected value and the variance of
X are given by
E(X) = p and var(X) = p(1 − p).
To introduce the binomial distribution, imagine that n independent repetitions
of a Bernoulli trial are performed. Let the random variable X be defined as
the total number of successes in the n Bernoulli trials, each having the same
probability p of success. Then,
n
k p (1 − p) for k = 0, 1, . . . , n,
k n−k
P(X = k) =
0 otherwise.

4 The assumption of nonnegative X can be dropped for Wald’s equation.

k
3.6 Binomial Distribution 119

This probability mass function is said to be the binomial distribution with

parameters n and p. The derivation of the formula for P(X = k) is simple.
The sample space of the compound experiments consists of all possible
n-tuples of 0’s and 1’s, where a “1” means a “success” and a “0” means a
“failure.” Any specific element containing k ones and n − k zeros gets assigned
the probability pk (1 − p)n−k , by the independence of the
trials. The number
of elements containing k ones and n − k zeros is nk . This completes the
explanation of the formula for P(X = k). The expected value and the variance
of the random variable X are given by

E(X) = np and var(X) = np(1 − p).

These results can be derived by working out E(X) = nk=1 kP(X = k) and

E(X(X − 1)) = nk=1 k(k − 1)P(X = k). However, this requires quite some
algebra. A simpler approach is as follows. Write the random variable X as X =
X1 + · · · + Xn , where X1 , . . . , Xn are independent random variables each having
a Bernoulli distribution with parameter p. Next, using the fact that E(Xi ) = p
and var(Xi ) = p(1 − p), the results follow from the Rules 3.1 and 3.8.

It is worthwhile pointing out that the binomial distribution has nearly all
of its mass within three standard deviations of the expected value when
np(1 − p) ≥ 25. In a bold advertisement campaign, a beer company once made
clever use of this fact. The beer company invited 100 beer drinkers for a blind
taste test to compare the company’s beer with the beer of a key competitor.
The live taste test was broadcast at half time of the Superbowl, in front of tens
of millions of people watching the playoffs on television. The brilliant move of
the marketeers was not to invite the average beer drinker, but beer drinkers
of the competing beer. Since many beers taste about the same and the typical
beer drinker cannot tell the difference between two such beers in a blind taste
test, the marketeers were pretty sure that the test would reveal that 35% or
more of the beer drinkers would prefer the other beer over their favorite beer
(a value less than 35 for a binomial random variable with n = 100 and
p = 12 would be more than three standard deviations below the expected
value). Such a test result would be very impressive, since it concerns beer
drinkers who stated that they swear by the competing beer. In the advertisement
campaign the laws of probability indeed acted in favor of the beer company, as
anticipated.
A classical problem of historical significance in which the binomial distri-
bution shows up is the problem of points. This problem was the subject of
the correspondence between the great French mathematicians Blaise Pascal
(1623–1662) and Pierre de Fermat (1601–1665). This correspondence laid
120 3 Discrete Random Variables

the groundwork for the birth of the study of probability. It was the cata-
lyst that enabled probability theory to develop beyond mere combinatorial
enumeration.5
Example 3.14 Two equally strong players A and B play a ball game such that a
total of six points is required to win the game. Each inning counts for one point.
The pot is 24 ducats and goes to the winner of the game. By some incident,
the players cannot finish the game when player A has five points and player B
three. How should the pot be divided between the two players?
Solution. This problem had been the subject of debate for quite some time
before it was solved by Pascal and Fermat in the seventeenth century. In the
fifteenth and sixteenth centuries, the Italian mathematicians Pacioli, Tartaglia,
and Cardano proposed different schemes to divide the pot, but these schemes
were based on deterministic arguments and were not satisfying. The idea of
Pacioli was to divide the pot according to the score 5:3 and to split the pot
of 24 ducats into 58 × 24 = 15 ducats for player A and 9 ducats for player B.
Tartaglia made the following comment to Pacioli’s rule: “his rule seems neither
agreeable nor good, since, if one player has, by chance, one point and the other
no points, then, by following this rule, the player who has one point should
take the whole pot, which obviously does not make sense.” Tartaglia proposed
another scheme that is based on the difference between the numbers of points
already won by the two players. He calculated that the difference 5 − 3 = 2 in
the points already won by the two players is one-third of the required 6 points
and proposed giving player A half of the pot plus one-third of half of the pot,
which amounts to 12 + 13 × 12 = 16 ducats.
In the summer of 1654, Pascal and Fermat addressed the problem of
points. They realized that in a fair division of the pot, the amounts of
prize money received by the players should be proportional to the respective
win probabilities of the players if the game were to be continued. These
probabilities depend only on the number of points left to be won. At the point
of stopping, player A is 1 point away from the required 6 points and player B
3 points. In the actual game, at most 1 + 3 − 1 = 3 more innings would be
needed to declare a winner. A trick to solve the problem is to imagine that three
additional innings were played. The probability of player A being the ultimate
winner if the original game was to be continued is the same as the probability

5 Until the middle of the seventeenth century, probability calculations were restricted to counting
the number of favorable outcomes and the total number of outcomes in games of chance, using
the handbook Liber de Ludo Aleae (The Book of Games of Chance) by the colorful Italian
mathematician and physician Gerolamo Cardano (1501–1576).
3.6 Binomial Distribution 121

that A would win one or more innings in three additional innings (explain!).
This probability is given by the binomial probability
3 k 3−k
3 1 1 7
= .
k 2 2 8
k=1

Therefore, player A should receive 7

8 × 24 = 21 ducats and player B 3 ducats.
Example 3.15 Chuck-a-Luck is a carnival game of chance and is played with
three dice. To play this game, the player chooses one number from the numbers
1, . . . , 6. The three dice are then rolled. If the player’s number does not come
up at all, the player loses 10 dollars. If the chosen number comes up one,
two, or three times, the player wins $10, $20, or $30, respectively. What is the
expected win for the house per wager?
Solution. This game seems at first glance to be favorable for the player. This
is actually not the case, even if the expected value of the number of times the
chosen number comes up is equal to 12 . Let the random variable X denote the
number of times the player’s number comes up. The random variable X can be
seen as the number of successes in n = 3 independent Bernoulli trials with a
success probability of p = 16 and thus has a binomial distribution. This gives
k 3−k
3 1 5
P(X = k) = for k = 0, 1, 2, 3.
k 6 6
So, the expected win (in dollars) for the house per wager is
125 75 15 1
10 × − 10 × − 20 × − 30 × = 0.787.
216 216 216 216
A healthy house edge of 7.87%!
Example 3.16 A military early-warning installation is constructed in a desert.
The installation consists of five main detectors and a number of reserve
detectors. If fewer than five detectors are working, the installation ceases to
function. Every two months an inspection of the installation is mounted and
at that time all detectors are replaced by new ones. There is a probability of
0.05 that any given detector will cease to function between two successive
inspections. The detectors function independently of one another. How many
reserve detectors are needed to ensure a probability of less than 0.1% that the
system will cease to function between two successive inspections?
Solution. Suppose that r reserve detectors are installed. Let the random
variable X denote the number of detectors that will cease to function between
two consecutive inspections. Then the random variable X has a binomial
122 3 Discrete Random Variables

distribution with parameters n = 5 + r and p = 0.05. The probability that

the system will cease to function between two inspections is
5+r

5+r
P(X > r) = 0.05k 0.955+r−k .
k
k=r+1

This probability has the value 0.0038 for r = 2 and the value 0.00037 for
r = 3. Therefore, three reserve detectors should be installed.

Problems
3.62 Daily Airlines flies every day from Amsterdam to London. The price for
a ticket on this popular route is $75. The aircraft has a capacity of 150
passengers. Demand for tickets is greater than capacity, and tickets are
sold out well in advance of flight departures. The airline company sells
160 tickets for each flight, to protect itself against no-show passengers.
The probability of a passenger being a no-show is q = 0.1. No-show
passengers are refunded half the price of their tickets. Passengers that do
show up and are not able to board the flight due to the overbooking are
refunded the full amount of their tickets plus an extra $425 compensation.
What is the probability that more passengers will turn up for a flight than
the aircraft has the seating capacity for? What are the expected value and
standard deviation of the daily return for the airline?
3.63 What are the chances of getting at least one six in one throw of 6 dice,
at least two sixes in one throw of 12 dice, and at least three sixes in one
throw of 18 dice? Which game do you think is more likely to win? This
problem is known as the Newton–Pepys problem and was brought to the
attention of Isaac Newton by Samuel Pepys, who was President of the
Royal Society of London and apparently a gambling man.
3.64 The Yankees and the Mets are playing a best-four-of-seven series. The
winner takes all the prize money of one million dollars. The two teams
are evenly matched and the outcomes of the games are independent of
each other. Unexpectedly, the competition must be suspended when the
Yankees lead two games to one. How should the prize money be divided
between the two teams?
3.65 The Salk vaccine against polio was tested in 1954 in a carefully designed
field experiment. Approximately 400,000 children took part in this exper-
iment. Using a randomization procedure, the children were randomly
divided into two groups of equal size, a treatment group and a control
3.6 Binomial Distribution 123

group. The vaccine was given only to the children in the treatment group;
the control-group children received placebo injections. The children did
not know which of the two groups they had been placed into. The
diagnosticians also lacked this information (double-blind experiment).
Fifty-seven children in the treatment group went on to contract polio,
while 142 children in the control group contracted the illness. How likely
would such a difference in outcomes be when the assignment to the
treatment or control group had absolutely no effect on the outcomes?
3.66 A game of chance played historically by Canadian Indians involved
throwing eight flat beans into the air and seeing how they fell. The beans
were symmetrical and were painted white on one side and black on the
other. The bean thrower would win one point if an odd number of beans
came up white, two points if either zero or eight white beans came up,
and would lose one point for any other configurations. Does the bean
thrower have the advantage in this game?
3.67 In order to qualify for a certain program, you are put through n tests. You
pass any test with probability p, independently of the other tests. If you
have passed k of the n tests, then you are admitted to the program with
probability nk for k = 0, 1, . . . , n. What is the unconditional probability
that you will be admitted to the program?
3.68 The final match of the soccer World Cup is a draw at the end of extra
time. The teams then proceed to penalty kicks. Each team attempts five
shots. Suppose that each shot results in a goal with probability 0.7,
independently of the other shots. What is the probability that the match
is still tied after one round of five shots by each team?
3.69 You flip 100 coins. Those that land heads are set aside. You then flip the
coins that have landed tails and again set aside all those that land heads.
Finally, you flip a third time all coins that landed tails twice and again set
aside all those that land heads. What is the probability mass function of
the number of coins that are set aside?
3.70 In an ESP experiment, a medium has to guess the correct symbol on each
of 250 Zener cards. Each card has one of the five possible Zener symbols
on it and each of the symbols is equally likely to appear. The medium
will get $100,000 if he gives 82 or more correct answers. Can you give a
quick assessment of the probability that the medium has to be paid out?
3.71 On bridge night, the cards are dealt ten times. Only twice you do receive
cards with an ace. From the beginning, you had your doubts as to whether
the cards were being shuffled thoroughly. Are these doubts confirmed?
3.72 An experiment has r possible outcomes O1 , O2 , . . . , Or with respective
probabilities of p1 , p2 , . . . , pr . Suppose that n independent repetitions of
124 3 Discrete Random Variables

the experiment are performed. Let the random variable Xi be the number
of times that the outcome Oi occurs. Argue that
n!
P(X1 = x1 , X2 = x2 , . . . , Xr = xr ) = px1 px2 · · · pxr r
x1 ! x2 ! · · · xr ! 1 2
for all nonnegative integers x1 , x2 , . . . , xr with x1 + x2 + · · · + xr = n.
This distribution is called the multinomial distribution.
3.73 A particular game is played with five poker dice. Each die displays an
ace, king, queen, jack, ten, and nine. Players may bet on two of the six
images displayed. When the dice are thrown and the bet-on images turn
up, the player receives three times the amount wagered. In all other cases,
the amount of the wager is forfeited. Is this game advantageous for the
player?

3.7 Poisson Distribution

A random variable X is said to have a Poisson distribution with parameter
λ > 0 if
k
e−λ λk! for k = 0, 1, . . . ,
P(X = k) =
0 otherwise,

where e = 2.71828 . . . is the base of the natural logarithm. In Example 3.12 it

was verified that
E(X) = λ and var(X) = λ.

The Poisson distribution is named after the French mathematician Siméon-

Denis Poisson (1781–1840). In his 1837 book Recherches sur la Probabilité
des Jugements en Matière Criminelle et en Matière Civile (Research on
the Probability of Criminal and Civil Verdicts), he indirectly introduced a
probability distribution that would later develop into one of the most important
distributions in probability theory. Poisson himself did not recognize the huge
practical importance of the distribution that would later be named after him,
and he dedicated just one page to this distribution in his book. It was L. von
Bortkiewicz, in 1898, who first discerned and explained the importance of the
Poisson distribution in his book Das Gesetz der Kleinen Zahlen (The Law
of Small Numbers). One unforgettable example from this book applies the
Poisson model to the number of Prussian cavalry deaths attributed to fatal horse
kicks in each of the 14 cavalry corpses over the 20 years 1875 to 1894, see also
Problem 6.18 in Chapter 6.
3.7 Poisson Distribution 125

The Poisson model was of great importance in the analysis of the distribution
of hits of flying bombs (V-1 and V-2 missiles) in London during World War
II. The British authorities were anxious to know if these weapons could be
aimed accurately at a particular target, or whether they were landing at random.
If the missiles were in fact only randomly targeted, the British could simply
disperse important installations to decrease the likelihood of their being hit.
An area of 36 km2 in South London was divided into 576 regions, 250 m wide
by 250 m long, and the number of hits in each region was determined. The
576 regions were struck by 535 bombs, and so the average number of hits per
region was 0.9288. There were 229 regions with zero hits, 211 regions with
one hit, 93 regions with two hits, 35 regions with three hits, 7 regions with
four hits, 1 region with five hits, and 0 regions with six or more hits. The
analysis showed that the observed relative frequencies of the number of hits
were each very close to the corresponding probabilities given by a Poisson
distribution with expected value 0.9288. This implied that the distribution of
hits in the area was much like the distribution of hits when each of the many
flying bombs was to fall on any of the equally sized regions with the same small
probability, independently of the other flying bombs. It will be seen below that
this characterizes the Poisson distribution. The analysis convinced the British
military that the bombs struck at random and had no advanced aiming ability.
The Poisson distribution can be seen as a limiting case of the binomial
distribution with parameters n and p, when n is very large and p is very
small. That is, in a very large number of independent repetitions of a chance
experiment having a very small probability of success, the total number of
successes is approximately Poisson distributed. This insight is essential in
order to apply the Poisson distribution in practical situations. To give a precise
mathematical formulation of the limiting result, let X represent a binomially
distributed random variable with the parameters n and p. Assume now that n
becomes very large and p becomes very small, while np is kept equal to the
constant λ. The following is then true:

λk
lim P(X = k) = e−λ for k = 0, 1, . . . .
n→∞, p→0 k!
n λ k
λ n−k
The proof is as follows. Since p = λn , P(X = k) = k n 1− n and so
n −k
λk λ n! λ
P(X = k) = 1− 1− .
k! n n (n − k)!
k n
Now let’s look at the different terms separately. Take a fixed value for k, where
0 ≤ k ≤ n. The term nk (n−k)!
n!
can be written as
126 3 Discrete Random Variables

n(n − 1) · · · (n − k + 1) 1 k−1
= 1 − · · · 1 − .
nk n n

With a fixed k, this term approaches 1 as n → ∞, as does the term (1 − λ/n)−k .

A well-known result is that (1 + b/n)n tends to eb as n → ∞ for every real
number b, see Appendix C. This results in

λ n
lim 1 − = e−λ ,
n→∞ n

which completes the physical explanation of the Poisson distribution.

The Poisson distribution is useful in studying rare events. It can be used to

model many practical phenomena. The explanation is that in many practical
situations you can speak of many independent repetitions of a Bernoulli trial
with a very small “success” probability. Examples include the number of credit
cards that are stolen yearly in a certain area, the number of damage claims filed
yearly with an insurance company, the yearly number of impacts of meteorites,
and so on.
Example 3.17 The low earth orbit contains many pieces of space debris. It is
estimated that an orbiting space station will be hit by space debris beyond a
critical size and speed on average once in 400 years. What is the probability
that a newly launched space station will not be penetrated in the first 20 years?
Solution. The event that the newly launched space station will be penetrated
in the first 20 years is a rare event. Therefore, the Poisson model is an
appropriate model to describe this event. The probability that a newly launched
space station will not be penetrated in the first 20 years can be estimated by
1
e− 400 ×20 = 0.9512.
The Poisson model is characterized by the pleasant fact that one does
not need to know the precise number of trials and the precise value of the
probability of success; it is enough to know what the product of these two
values is. The value λ of this product is usually known in practical applications
and determines uniquely the Poisson distribution. What’s very useful is the fact
that the standard deviation of the Poisson distribution is the square root of the
expected value of the distribution. A useful rule of thumb is that the distribution
has nearly all of its mass within three standard deviations
√ from the expected
value when λ ≥ 25. The probability P(|X − λ| > 3 λ) is on the order of 10−3
or less when λ ≥ 25. In other words, it would be exceptional for a √ Poisson-
distributed random variable to take on a value that differs more than 3 λ from
the expected value λ. This is a very useful fact for statistical applications of
3.7 Poisson Distribution 127

the Poisson model. In order to illustrate this, suppose you read in a newspaper
that, based on an average of 1,000 traffic deaths per year in previous years, the
number of traffic deaths for last year rose 12%. How can you evaluate this?
The number of traffic deaths over a period of one year can be modeled as a
Poisson-distributed random variable with expected value 1,000 (why is this
model reasonable?). An increase of 12% on an average of 1,000 is an increase
√
of 120, or rather an increase of 120/ 1,000 = 3.8 standard deviations above
the expected value of 1,000. This can be considered exceptional. In this way,
we find justification for the conclusion that the increase in the number of traffic
deaths is not coincidental. What would your conclusions have been if, based on
an average of 100 traffic deaths per year, a total of 112 traffic deaths occurred
in the past year? You see that the Poisson model is a practically useful model
to make quick statistical assessments.
Example 3.18 The Pegasus Insurance Company has introduced a policy that
covers certain forms of personal injury with a standard payment of $100,000.
The yearly premium for the policy is $25. On average, 100 claims per year
lead to payment. There are more than one million policyholders. What is the
probability that more than 15 million dollars will have to be paid out in the
space of a year?
Solution. In fact, every policyholder conducts a personal experiment in prob-
ability after purchasing this policy, which can be considered to be “successful”
if the policyholder files a rightful claim during the ensuing year. Let the
random variable X denote the total number of claims that will be approved for
payment during the year of coverage. In view of the many policyholders, there
is a large number of independent probability experiments each having a very
small probability of success. This means that the probability distribution of the
random variable X can be modeled by a Poisson distribution with parameter
λ = 100. The probability of having to pay out more than 15 million dollars
is given by P(X > 150). Since E(X) = 100 and σ (X) = 10, a value
of 150 claims lies five standard deviations above the expected value. Thus,
without doing any further calculations, we can draw the conclusion that the
probability of paying out more than 15 million dollars in the space of a year
must be extremely small. The precise value of the probability P(X > 150) is
−100 100k = 1.23×10−6 . Not a probability the insurance executives
1− 150 k=0 e k!
need worry about.

Example 3.19 In the kingdom of Lightstone, the game of 6/40 lottery is

played. You win the jackpot when the six numbers on your ticket match the
six winning lottery numbers from the numbers 1 to 40. At the time of an oil
128 3 Discrete Random Variables

sheik’s visit to Lightstone, the jackpot is listed at 12.5 million dollars. The oil
sheik decides to take a gamble and orders his retinue to fill in 5 million tickets
in his name. These 5 million tickets are not filled in by hand but are random
picks generated by the computer of the lottery organization (inevitably, many
six-number sequences will be generated more than once6 ). Suppose that the
local people have purchased 2 million tickets and that these tickets are also
random picks. Each ticket costs $1. What is the probability that the sheik will
be among the jackpot winners and what is the probability that the sheik will be
the only winner? What is the probability that the sheik’s initial outlay will be
won back from the jackpot?
Solution. The probability
that a particular ticket is a winning ticket for the
jackpot is 1/ 40
6 = 1/3,838,380. Very many tickets are filled in by the sheik
and the local people. Therefore, the Poisson model is applicable. Let the
random variable X denote the number of winning tickets for the jackpot among
the 5 million tickets of the sheik and the random variable Y be the number of
winning tickets for the jackpot among the 2 million tickets of the locals. The
random variables X and Y can be modeled as Poisson random variables with
respective parameters
5,000,000 2,000,000
λ= and μ = .
3,838,380 3,838,380
The random variables X and Y are independent. The probability that the sheik
will be among the jackpot winners is
P(X ≥ 1) = 1 − e−λ = 0.7282.
The probability that the sheik is the only jackpot winner is
P(X ≥ 1, Y = 0) = P(X ≥ 1)P(Y = 0) = (1 − e−λ ) e−μ = 0.4325.
To answer the last question, let p(r, s) denote the probability that the sheik has
r winning tickets for the jackpot and the locals have s winning tickets for the
jackpot. Then,
λr μs
p(r, s) = e−λ × e−μ for r, s = 0, 1, . . . .
r! s!
Let A = {(r, s) : r+s
r
× 12.5 ≥ 5}. Then, the probability that the sheik’s initial

outlay will be won back from the jackpot is (r,s)∈A p(r, s) = 0.6935.

6 The expected number of different combinations among 5 million random picks is 2,795,050, as

follows by using the result of Example 3.8 with m = 5 × 106 and b = 40
6 .
3.7 Poisson Distribution 129

A scenario as described in Example 3.19 actually happened in the Irish

National Lottery. In the original 6/36 format of this lottery, the chances of
matching all six numbers drawn is 1 in 1,947,792. At the lottery’s initial cost
of £0.50 per ticket, all possible combinations could be purchased for £973,896.
When the May 1992 bank holiday draw reached a jackpot prize of £1,700,000,
a Dublin syndicate successfully pulled off a brute-force attack and managed
to buy over 80% of all possible lottery combinations. The consortium won the
jackpot but had to share it with two other players who also guessed the six
winning numbers correctly. Nevertheless, the consortium made a profit of over
£300,000, because they also won numerous smaller prizes matching four or
five numbers. To prevent such a brute-force attack from happening again, the
Irish National Lottery became a 6/42 lottery.
The Poisson distribution has a remarkable splitting property. Where Rule
3.12 shows that the merging of two independent Poisson random variables
gives again a Poisson random variable, the next rule shows that the splitting of
a Poisson random variable gives two independent Poisson random variables.
This result is of great importance.
Rule 3.13 Suppose the number of events that occur in a given time interval
is Poisson distributed with expected value λ. Each event is marked with
probability p, independently of the other events. Let the random variable X
be the number of marked events in the given time interval and Y be the number
of unmarked events. Then,
(a) X and Y are Poisson distributed with expected values λp and λ(1 − p).
(b) The Poisson random variables X and Y are independent.
The proof is as follows. Let the random variable N be the number of events
occurring in the given time interval. Noting that P(X = j, Y = k) = P(X = j,
Y = k, N = j + k) and using the formula P(AB) = P(A | B)P(B), we get
P(X = j, Y = k) = P(X = j, Y = k | N = j + k)P(N = j + k)

j+k λj+k
= p j (1 − p)k e−λ .
j (j + k)!
(j+k)!
Since j+k
j = j!k! and e−λ = e−λp e−λ(1−p) , this result can be restated as
j k
λp λ(1 − p)
P(X = j, Y = k) = e−λp × e−λ(1−p) for all j, k ≥ 0.
j! k!

Next, by the relations P(X = j) = ∞ k=0 P(X = j, Y = k) and P(Y = k) =
∞
j=0 P(X = j, Y = k), it follows that
130 3 Discrete Random Variables

j k
−λp λp −λ(1−p) λ(1 − p)
P(X = j) = e and P(Y = k) = e
j! k!
for all j, k ≥ 0. Thus, X and Y are Poisson distributed with expected values λp
and λ(1 − p). Moreover,
P(X = j, Y = k) = P(X = j)P(Y = k) for all j, k ≥ 0,
showing the remarkable result that X and Y are independent.

3.7.1 Poisson Heuristic7

The Poisson distribution is derived for the situation of many independent
trials, each having a small probability of success. In case the independence
assumption is not satisfied, but there is a “weak” dependence between the trial
outcomes, the Poisson model may still be useful as an approximate model.
In surprisingly many probability problems, the Poisson heuristic enables us to
obtain quick estimates for probabilities that are otherwise difficult to calculate.
This approach requires that the problem is reformulated in the framework of
a series of weakly dependent trials. The idea of the method will be illustrated
with extensions of the birthday problem and the matching problem.

The Almost-Birthday Problem and the Birthday-Coverage Problem

In Example 1.14, we discussed the classical birthday problem. The exact
solution to this problem is easily derived. This is different for the almost-
birthday problem. In this problem, the question is: What is the probability that,
within a randomly formed group of m people, two or more people will have
birthdays within one day of each other? The derivation of an exact formula
for this probability is far from simple, but the Poisson heuristic is particularly
simple to apply. The idea is to consider all possible combinations of two people
and to trace whether, in any of those combinations, both people have birthdays
within one day of each other. Only when such a combination exists can it be
said that two or more people out of the whole group havembirthdays on the same
day. What you are doing, in fact, is conducting n = 2 trials. A trial is said
to be successful when the two people involved have birthdays within one day
of each other. Every trial has the same success probability p = 365 3
(explain!).
Let the random variable X be the number of successful trials. The probability
that, in a group of m people, two or more people will have birthdays within
one day of each other is then equal to P(X ≥ 1). Although the outcomes of the
trials are dependent on one another, this dependence can be considered to be
7 This section can be skipped at first reading.
3.7 Poisson Distribution 131

weak because of the large number (365) of possible birth dates. It is therefore
reasonable to approximate the distribution of X by a Poisson distribution with
expected value λ = np = 32 m(m − 1)/365. In particular, P(X ≥ 1) ≈ 1 − e−λ .
Thus,
P(two or more people have birthdays within one day of each other)
3
≈ 1 − e− 2 m(m−1)/365 .
More generally, the Poisson heuristic leads to the approximation formula
2r+1
1 − e−( 2 )m(m−1)/365

for the probability that two or more people in a randomly assembled group
of n people have their birthdays within r days of each other. For r = 1,
the smallest value of m for which this approximate probability is at least 0.5
is 14. The approximate value of the probability that two or more people in a
group of n = 14 people will have birthdays within one day of each other is
1 − e−0.74795 = 0.5267. The exact value is 0.5375, and is calculated from the
following formula for the probability of two or more people in a group of n
people having their birthdays within r days from each other:8
(365 − 1 − nr)!
1− .
365n−1 365 − (r + 1)n !
Another interesting problem is the following. How many people are required
in order to have all 365 possible birthdays covered with a probability of at least
50%? This problem is called the birthday-coverage problem and can be framed
as a coupon-collector problem, where birthdays correspond to coupons and
people to purchases. The coupon-collector problem is as follows. Each time
you purchase a certain product you receive a coupon, which is equally likely
to be any one of n types. What is the probability of collecting a complete set
of coupons in exactly r purchases? Imagine that you conduct a trial for each
of the n coupons. Trial i is said to be successful if coupon iis not
r among the r
purchases. Each trial has the same success probability p = n−1 n . A complete
set of coupons is obtained only if there is no successful trial. Thus, by the
Poisson heuristic, the probability of collecting a complete set of coupons in
exactly r purchases is approximately
e−n×[(n−1)/n] .
r

8 This formula is taken from J. I. Naus, “An extension of the birthday problem,” The American
Statistician 22 (1968): 27–29.
132 3 Discrete Random Variables

Table 3.1. Probabilities for the card-matching problem

k 0 1 2 3 4 5 6 7
exact 0.0162 0.0689 0.1442 0.1982 0.2013 0.1613 0.1052 0.060
approx. 0.0183 0.0733 0.1465 0.1954 0.1954 0.1563 0.1042 0.060

Taking n = 365, the smallest value of r for which e−365×(364/365) ≥ 0.5 is

r = 2,285. Thus, the minimum number of people required in order to have

all 365 possible birthdays covered with a probability of at least 50% is
approximately 2,285. The exact answer is 2,287 people, see Problem 1.87.

A Matching Problem with Repetition

An instance of the classical matching problem was discussed in Example 1.19.
Let’s apply the Poisson heuristic to an instance of the generalized matching
problem with repetition. Consider the Las Vegas card game, which goes back
to Pierre Rémond de Montmort (1678–1719). An ordinary deck of 52 cards is
thoroughly shuffled. The dealer turns over the cards one at a time, counting as
he goes “ace, two, three, . . . , king, ace, two, . . . ,” and so on, so that the dealer
ends up calling out the 13 ranks four times each. A match occurs if the card that
comes up matches the rank called out by the dealer as he turns it over. Using
the Poisson heuristic, it is easy to calculate an approximation to the probability
of one or more matches. Imagine a trial for each card turned over. The trial
is said to be successful if the card matches the rank called out by the dealer.
There are n = 52 trials, and the success probability for each trial is p = 52 4
.
The probability distribution of the number of matches is then approximated
by a Poisson distribution with an expected value of λ = 52 × 52 4
= 4. In
−4
particular, the probability of no match is approximately equal to e = 0.0183.
The exact value of this probability is 0.0162. In Table 3.1, we give the exact and
approximate values of the probability of exactly k matches for several values
of k.9 Where the approximate values are obtained with little effort, it is very
complex to obtain the exact values.

Problems
3.74 The Brederode Finance Corporation has begun the following adver-
tising campaign in Holland. Each new loan application submitted is

9 The exact values are taken from F. F. Knudsen and I. Skau, “On the asymptotic solution of a
card-matching problem,” Mathematics Magazine 69 (1996): 190–197.
3.7 Poisson Distribution 133

accompanied by a chance to win a prize of $25,000. Every month, 50 zip

codes from all Dutch zip codes will be drawn in a lottery. In Holland,
each house address has a zip code and there are about 450,000 zip
codes. Each serious applicant whose zip code is drawn will receive a
$25,000 prize. Brederode Finance Corporation receives 200 serious loan
applications each month. Use the Poisson distribution to calculate the
probability distribution of the monthly amount that they will have to give
away.
3.75 In the famous problem of Chevalier de Méré, players bet first on the
probability that a six will turn up at least one time in four rolls of a fair
die; subsequently, players bet on the probability that a double six will
turn up in 24 rolls of a pair of fair dice. In a generalized version of the de
Méré problem, the dice are rolled a total of 4×6r−1 times; each individual
roll consists of r fair dice being rolled simultaneously. A roll of the r dice
is called a king’s roll if each of the dice shows up a six. Argue that the
probability of at least one king’s roll occurring in 4×6r−1 rolls converges
to 1 − e−2/3 = 0.4866 if r → ∞.
3.76 Around 500 marriages take place in a certain town each year. What is an
appropriate model to describe the probability distribution of the yearly
number of marriages having the feature that both partners were born on
the same day?
3.77 In a coastal area, the average number of serious hurricanes is 3.1 per
year. The numbers of hurricanes in successive years are independent of
each other. What is an appropriate probability model to calculate the
probability of a total of more than 10 serious hurricanes in the next two
years?
3.78 A total of 126 goals were scored by the soccer teams in the 48 games
played in the group matches of the 1998 World Cup. It happened that
26 times zero goals were scored by a team, 34 times one goal, 24 times
two goals, 8 times three goals, 1 time four goals, 2 times five goals, and
1 time six goals. Can you make plausible that the number of goals scored
per team per game can be quite well described by a Poisson distribution
with an expected value of 1.3125?
3.79 On average, there are 4.2 fatal shark attacks each year worldwide. What
is the probability that there will more than seven fatal shark attacks next
year worldwide?
3.80 In 1986 a woman won the New Jersey lottery twice in four months. The
event was widely reported as an amazing coincidence with an incredibly
small probability of happening. Use the Poisson distribution to explain
that somebody winning the top lottery prize twice in a short period is
134 3 Discrete Random Variables

practically a sure thing. Imagine 100 6/42 lotteries all over the world. In
each lottery, each of one million people sends in five randomly chosen
sequences of six numbers from the numbers 1 to 42 for each drawing. In
each lottery there are two drawings every week. What is the probability
that in at least one of the lotteries some person will win the jackpot twice
in a period of three years?
3.81 The football pool is a betting pool based on predicting the outcome of
13 football matches in the coming week. In the Dutch football pool the
average number of weekly winners who have correctly predicted all 13
matches is 0.25. Last week there were 3 winners. Can you explain why
this is exceptional?
3.82 Let X be a random variable on the integers 0, 1, . . . and λ > 0 be a given
number. Verify that X is Poisson distributed with expected value λ if and
only if E[λg(X + 1) − Xg(X)] = 0 for any bounded function g(x) on
the nonnegative integers. This result is called the Stein–Chen identity for
the Poisson distribution. Hint: To verify the “if” assertion, choose g(x) as
g(r) = 1 and g(x) = 0 for x = r, where r is a given nonnegative integer.
3.83 Use the Poisson heuristic to approximate the probability that three or
more people from a randomly selected group of 25 people will have
birthdays on the same day and the probability that three or more persons
from the group will have birthdays falling within one day of each other.
3.84 Use the Poisson heuristic to approximate the probability that some suit
will be missing in a bridge hand of 13 cards.
3.85 Consider again Problem 1.94. Use the Poisson heuristic to approximate
the probability that no couple will be paired as bridge partners.
3.86 What is the probability that no two cards of the same face value (two
aces, for example) will succeed one another in a well-shuffled deck of 52
playing cards? Use the Poisson heuristic to verify that this probability is
approximately e−3 .
3.87 A company has 75 employees in service. The administrator of the
company notices, to his astonishment, that there are seven days on which
two or more employees have birthdays. Use the Poisson heuristic to argue
that this is not so astonishing after all.
3.88 What is the probability that in a random permutation of the integers
1, 2, . . . , n, no two integers have interchanged their original positions?
Use the Poisson heuristic to show that this probability is approximately
equal to √1e for large n.
3.89 Imagine a family of five brothers and sisters, together with their spouses.
The family does a gift exchange every year at Christmas. Each member
of the family buys a Christmas gift. The gifts are labeled 1 to 10, where
3.8 Hypergeometric Distribution 135

each person knows the label of their gift and the label of the gift of their
spouse. Cards with the numbers 1, . . . , 10 are put in a hat. The family
members consecutively pull a card out of the hat in order to determine
the present he or she will receive. If a person draws a card with their own
label, or the card with the label of their spouse, all the cards go back in the
hat and the drawing is redone. Use the Poisson heuristic to approximate
the probability that the drawing need not be redone.
3.90 Consider again Example 3.8. Use the Poisson heuristic to verify that
the probability mass function of the number of empty bins can be
approximated by a Poisson distribution with expected value b(1 − 1b )m
for large b.
3.91 Use the Poisson heuristic to approximate the probability that in a
randomly formed group of m people, nobody has a lone birthday. What
is the smallest value of m so that this probability is at least 50%?
3.92 Sixteen teams remain in a soccer tournament. A drawing of lots will
determine which eight matches will be played. Before the drawing takes
place, it is possible to place bets with bookmakers over the outcome of
the drawing. You are asked to predict all eight matches, paying no regard
to the order of the two teams in each match. Use the Poisson heuristic
to approximate the probability distribution of the number of correctly
predicted matches.
3.93 In the 6/45 lottery, six different numbers are randomly picked from the
numbers 1 to 45. Use the Poisson heuristic to approximate the probability
of two or more consecutive numbers in a random pick and the probability
of three or more consecutive numbers.

3.8 Hypergeometric Distribution

The urn model is at the root of the hypergeometric distribution. In this model,
you have an urn that is filled with R red and W white balls. You randomly select
n balls out of the urn without replacing any. Define the random variable X as
the number of red balls among the selected balls. Then,
R W
P(X = r) = r R+W
n−r
for r = 0, 1, . . . , n.
n

By the convention ab = 0 for b > a, we have that P(X = r) = 0 for those
r with r > R or n − r > W. This probability mass function is called the
hypergeometric distribution with parameters R, W, and n. The explanation of
136 3 Discrete Random Variables

the formula for P(X = r) is as follows. The numberofways in which r red

W
and n − r white balls can be chosen from the urn is Rr n−r , while the total

number of ways in which n balls can be chosen from the urn is R+W n . The
ratio of these two expressions gives P(X = r). The expected value and the
variance of the random variable X are given by

R R R R+W −n
E(X) = n and var(X) = n 1− .
R+W R+W R+W R+W −1
The simplest way to prove these results is to use indicator variables. Imagine
that the balls are to be selected one at a time. Let the random variable Xi be

equal to 1 if the ith ball drawn is red and 0 otherwise. Then X = ni=1 Xi and
n
E(X) = i=1 E(Xi ). The random variables X1 , . . . , Xn are interchangeable,
though they are dependent.10 Thus, E(Xi ) = E(X1 ) for all i. Obviously,
P(X1 = 1) = R+W R
and so E(X1 ) = R+W R
. This gives E(X) = n R+W R
. To
find var(X), we use the relation

n
n−1
n
E(X 2 ) = E(Xi2 ) + 2 E(Xi Xj )
i=1 i=1 j=i+1

together with the observation that E(Xi2 ) = E(X12 ) for all i and E(Xi Xj ) =
E(X1 X2 ) for all i = j. Therefore,
E(X 2 ) = nE(X12 ) + n(n − 1)E(X1 X2 ).
Since E(X1 X2 ) = P(X1 = 1, X2 = 1) = P(X1 = 1)P(X2 = 1 | X1 = 1), it
follows that E(X1 X2 ) = R+W
R
× R+W−1
R−1
. Thus,
R R R−1
E(X 2 ) = n + n(n − 1) × .
R+W R+W R+W −1
Next, it is a matter of some algebra to get the desired expression for var(X).
The hypergeometric model is a versatile model and has many applications,
particularly to lotteries. The hypergeometric distribution with parameters R, W,
and n can be approximated by the binomial distribution with parameters n and
p = R+WR
if R + W n. Then it does not make much difference whether the
sampling is with or without replacement. Indeed, the above formulas for E(X)
and var(X) are very close to those of the binomial distribution with parameters
n and p = R+W R
if R + W n.

10 In general, random variables X , . . . , X are said to be interchangeable if, for every

1 n
permutation i1 , . . . , in of 1, . . . , n, P(Xi1 ≤ x1 , . . . , Xin ≤ xn ) = P(X1 ≤ x1 , . . . , Xn ≤ xn ) for
all x1 , . . . , xn .
3.8 Hypergeometric Distribution 137

Example 3.20 The New York state lottery offers the game called Quick Draw,
a variant of Keno. The game can be played in bars, restaurants, bowling areas,
and other places. A new game is played every four or five minutes and so a lot
of games can be played on a day. In the game of Quick Draw a maximum of
20 individual bets can be made with a single game card. Each individual bet
can be for 1, 2, 5, or 10 dollars. A player chooses four numbers from 1 to 80.
The lottery then randomly chooses 20 numbers from 1 to 80. The number of
matches determines the payoff. The payoffs on an individual bet are 55 times
the bet size for four matches, 5 times the bet size for three matches, once the
bet size for two matches, and nothing otherwise. In November 1997, the state
lottery offered a promotion “Big Dipper Wednesday,” where payoffs on the
game were doubled on the four Wednesdays in that month. Is this a good deal
for the player or just a come-on for a sucker bet?
Solution. Let us first have a look at the game with the ordinary payoffs. The
hypergeometric model with R = 4, W = 76, and n = 20 is applicable to this
game of Quick Draw. Let the random variable X indicate how many numbers
on a single ticket are matched. Then,
4 76
k 20−k
P(X = k) = 80 for k = 0, 1, . . . , 4.
20

This probability has the numerical values 0.308321, 0.432732, 0.212635,

0.043248, and 0.003063 for k = 0, 1, 2, 3, and 4. The expected payoff per
one-dollar bet is 1 × 0.212635 + 5 × 0.043248 + 55 × 0.003063 = 0.59734
dollars. In other words, you would expect to lose about 40 cents per dollar
bet on average; the usual house edge for this type of game. Now, what about
the “Big Dipper Wednesday?” The expected payoff per one-dollar bet for the
game with double payoffs becomes twice as much and is 1.1947 dollars. This
is more than the amount of the one-dollar bet! The player now enjoys a 19.47%
edge over the house in the game with double payoffs. It seems that the New
York state lottery made a miscalculation.
The hypergeometric distribution exhibits in various guises, which have at
first sight little to do with red and white balls in an urn. Two examples will be
given to illustrate this.
Example 3.21 In a close election between two candidates A and B in a small
town, the winning margin of candidate A is 1,422 to 1,405 votes. However, 101
votes are found to be illegal and have to be thrown out. It is not said how the
illegal votes are divided between the two candidates. Assuming that the illegal
votes are not biased in any particular way and the count is otherwise reliable,
138 3 Discrete Random Variables

what is the probability that the removal of the illegal votes will change the
result of the election?
Solution. The problem can be translated into the urn model with 1,422 red
and 1,405 white balls. If a is the number of illegal votes for candidate A and
b the number of illegal votes for candidate B, then candidate A will no longer
have most of the votes only if a − b ≥ 17. Since a + b = 101, the inequality
a − b ≥ 17 boils down to a ≥ 59. The probability that the removal of the
illegal votes will change the result of the election is the same as the probability
of picking 59 or more red balls from an urn that contains 1,422 red and 1,405
white balls. This probability is given by
1,422 1,405

101
a 101−a
2,827 = 0.0592.
a=59 101

Example 3.22 Two people, perfect strangers to one another, both living in the
same city of one million inhabitants, meet each other. Each has approximately
500 acquaintances in the city. Assuming that for each of the two people, the
acquaintances represent a random sampling of the city’s various population
sectors, what is the probability of the two people having an acquaintance in
common?
Solution. A bit of imagination shows that this problem can be translated
into the urn model with R = 500 red and W = 999,498 white balls, where
the red balls represent the 500 acquaintances of the first person. The sought
probability is given by the probability that at least one red ball will be drawn
when 500 balls are randomly
999,998 taken out of the urn. This probability is equal to
1 − 999,498
500 / 500 = 0.2214. The probability of 22% is surprisingly large.
Events are often less “coincidental” than we may tend to think!

Problems
3.94 In the game “Lucky 10,” 20 numbers are drawn from the numbers 1 to
80. You tick 10 numbers on the game form. What is the probability of
matching r of the 20 numbers drawn for r = 0, 1, . . . , 10?
3.95 The New Amsterdam lottery offers the game Take Five. In this game,
players must tick five different numbers from the numbers 1 to 39.
The lottery draws five distinct numbers from the numbers 1 to 39. For
every one dollar staked, the payoff is $100,000 for five correct numbers,
$500 for four correct numbers, and $25 for three correct numbers.
3.8 Hypergeometric Distribution 139

For two correct numbers, the player wins a free game. What is the house
percentage for this lottery?
3.96 Ten identical pairs of shoes are jumbled together in one large box.
Without looking, someone picks four shoes out of the box. What is the
probability that, among the four shoes chosen, there will be both a left
and a right shoe?
3.97 A psychologist claims that he can determine from a person’s handwrit-
ing whether the person is left-handed or not. You do not believe the
psychologist and therefore present him with 50 handwriting samples,
of which 25 were written by left-handed people and 25 were written
by right-handed people. You ask the psychologist to say which 25 were
written by left-handed people. Will you change your opinion of him if
the psychologist correctly identifies 18 of the 25 left-handers?
3.98 The Massachusetts Cash Winfall lottery was established in 2004 and
was ended in 2012. In this lottery, the jackpot was won when the
six numbers chosen from 1 through 46 were correctly predicted. The
Cash Winfall lottery had the special feature that the jackpot was “rolled
down,” with the secondary prizes increased when the jackpot rose above
$2 million and was not won. Several gambling syndicates made big
profits by buying tickets in bulk when the jackpot was approaching
$2 million. Suppose that a syndicate has invested $400,000 in buying
200,000 randomly selected $2 tickets when the jackpot is “rolled down.”
The roll-down prizes are $25,000, $925, and $27.50 for matching five,
four, or three numbers, respectively. What is the expected amount of
money won by the syndicate? How can you approximate the standard
deviation of the amount won?
3.99 For a final exam, your professor gives you a list of 15 items to study.
He indicates that he will choose eight for the actual exam. You will be
required to answer correctly at least five of those. You decide to study
10 of the 15 items. What is the probability that you will pass the exam?
3.100 In the 6/45 lottery, six different numbers are drawn at random from the
numbers 1 to 45. What are the probability mass functions of the largest
number drawn and the smallest number drawn?
3.101 You play Bingo together with 35 other people. Each player purchases
one card with 24 different numbers that were selected at random out of
the numbers 1 to 80. The organizer of the game calls out chosen distinct
numbers between 1 and 80, randomly and one at a time. What is the
probability that more than 70 numbers must be called out before one of
the players has achieved a full card? What is the probability that you will
be the first player to achieve a full card while no other player has a full
140 3 Discrete Random Variables

card at the same time as you? What is the probability that you will be
among the first players achieving a full card? Hint: Let Qk = P(X > k),
where the random variable X counts how many numbers have to be
called out before a particular player has achieved a full card.
3.102 Suppose that r different numbers are picked at random from the num-
bers 1, 2, . . . , s. Let the random variable X be the sum of the r numbers
picked. Show that E(X) = 12 r(s + 1) and σ 2 (X) = 12 1
r(s + 1)(s − r).
Hint: Proceed as in the derivation of the expected value and the variance
of the hypergeometric distribution.
3.103 A bowl contains a red and b white balls. You randomly pick without
replacement one ball at a time until you have r red balls. What is the
probability mass function of the number of picks needed?
3.104 Suppose you know that the hands of you and your bridge partner contain
eight of the 13 spades in the deck. What is the probability of a 3-2 split
of the remaining five spades in the bridge hands of your opponents?
3.105 A deck of cards has 8 diamonds and 7 spades. The diamonds are
assigned to player A and the spades to player B. The cards are turned up
one by one, and the player whose suit is first to be turned up five times
wins. What is the probability that player A wins?
3.106 In the 6/49 lottery, six different numbers are drawn at random from
1, 2, . . . , 49. What is the probability that the next drawing will have no
numbers in common with the last two drawings?
3.107 There is a concert and 2,500 tickets are to be raffled off. You have sent in
100 applications. The total number of applications is 125,000. What are
your chances of getting a ticket? Can you explain why this probability
is approximately equal to 1 − e−2 ?

3.9 Other Discrete Distributions

In this section we give several other discrete distributions that are frequently
encountered in probability applications.

3.9.1 Discrete Uniform Distribution

A random variable X is said to have a discrete uniform distribution on the
integers a, a + 1, . . . , b if

1
P(X = k) = for k = a, a + 1, . . . , b.
b−a+1
3.9 Other Discrete Distributions 141

The random variable X can be thought of as the result of an experiment with

finitely many outcomes, each of which is equally likely. Using the fact that
n n
k=1 k = 2 n(n + 1) and k=1 k = 6 n(n + 1)(2n + 1) for all n ≥ 1, it is a
1 2 1

matter of some algebra to verify that

a+b (b − a + 1)2 − 1
E(X) = and var(X) = .
2 12
Example 3.23 Two friends lose each other while wandering through a
crowded amusement park. They cannot communicate by mobile phone. For-
tunately, they have foreseen that losing each other could happen and have
made a prior agreement. The younger friend stays put at the entrance of a
main attraction of the park and the older friend searches the main attractions
in random order (the “wait-for-mommy” strategy). The park has 15 main
attractions. What are the expected value and the standard deviation of the
number of searches before the older friend finds his younger friend?
Solution. Let the random variable X denote the number of searches until the
older friend finds his younger friend. The random variable X has a discrete
uniform distribution on the integers 1, 2, . . . , 15. Therefore, the expected and
the standard deviation of X are E(X) = 8 and σ (X) = 4.32.

3.9.2 Geometric Distribution

A random variable X is said to have a geometric distribution with parameter
p if

p(1 − p)k−1 for k = 1, 2, . . . ,
P(X = k) =
0 otherwise.

The random variable X can be interpreted as the number of trials until the
first success occurs in a sequence of independent Bernoulli trials with success
probability p. Note that the geometric probability (1 − p)k−1 p is maximal
for k = 1. This explains that in a sequence of independent Bernoulli trials,
successes are likely to show up “clumped” rather than evenly spaced.
∞
Using the relations ∞ k=1 kx
k−1 = (1 − x)−2 and
k=1 k(k − 1)x
k−2 =
−3
2(1 − x) for |x| < 1, see Appendix C, it is easily verified that
1 1−p
E(X) = and var(X) = .
p p2
Example 3.24 Each time, you simultaneously toss a fair coin and roll a fair
die until heads is tossed or a six is rolled, or both. What is the probability mass
function of the number of trials?
142 3 Discrete Random Variables

Solution. Let the random variable N be the number of trials until heads is
tossed or a six is rolled, or both. Then, by the independence of the trials,
k k k
1 5 5
P(N > k) = × = for k = 0, 1, . . . .
2 6 12

This shows that N has a geometric distribution with parameter 1 − 5

12 = 7
12 .

The result of Example 3.24 is a special case of the general result that
min(X, Y) is geometrically distributed with parameter p1 + p2 − p1 p2 when
the random variables X and Y are independent and geometrically distributed
with parameters p1 and p2 (verify!).

3.9.3 Negative Binomial Distribution

A random variable X is said to have a negative binomial distribution with
parameters r and p if

k−1 r
r−1 p (1 − p) for k = r, r + 1, . . . ,
k−r
P(X = k) =
0 otherwise.

The random variable X can be interpreted as the number of trials until

the rth success occurs in a sequence of independent Bernoulli trials with
success probability p. The explanation is as follows. The probability of having
r−1
the rth success at the kth trial equals the binomial probability k−1 r−1 p
(1 − p)k−1−(r−1) of having r − 1 successes among the first k − 1 trials
multiplied by the probability p of having a success at the kth trial. To obtain
E(X) and var(X), note that X can be written as X1 + · · · + Xr , where Xi is the
number of trials needed in order to go from i − 1 to i successes. The random
variables X1 , . . . , Xr are independent and each have a geometric distribution
with parameter p. Since E(Xi ) = p and var(Xi ) = (1 − p)/p2 , an application
of Rules 3.1 and 3.8 gives

r r(1 − p)
E(X) = and var(X) = .
p p2

Example 3.25 Suppose that a single die is rolled over and over. How many
rolls are needed so that the probability of rolling a six three times within this
number of rolls is at least 50%?
Solution. Let the random variable X be defined as the number of rolls of the
die until a six appears for the third time. Then the random variable X has a
3.9 Other Discrete Distributions 143

negative binomial distribution with parameters r = 3 and p = 16 . We are

asking for the smallest integer k for which P(X ≤ k) ≥ 0.5.11 The smallest
integer k for which
k 3 j−3
j−1 1 5
≥ 0.5
2 6 6
j=3

is given by k = 16. Hence, 16 rolls are needed.

Example 3.26 A particle moves in the positive quadrant of the plane until it
is absorbed on either the x-axis or the y-axis, whichever is reached first. The
steps of the particle are independent of each other. If the particle is at the point
(i, j) with i and j positive integers, it moves to (i − 1, j) with probability p and
to (i, j − 1) with probability 1 − p. Suppose that the particle starts at the point
(r, s). What is the probability of absorption at the point (0, k) for k = 1, . . . , s?
Solution. Let us say that a success occurs each time the particle makes a step
to the left. Then, the random walk of the particle can be seen as a sequence
of independent Bernoulli trials each having success probability p. The sought
probability is nothing other than the probability ofhaving the
r rth success at
the (r + s − k)th trial. This probability is given by r+s−k−1
r−1 p (1 − p) s−k for

k = 1, . . . , s.

Problems
3.108 In the final of the World Series baseball, two teams play a series
consisting of at most seven games until one of the two teams has
won four games. Two unevenly matched teams are pitted against each
other and the probability that the weaker team will win any given
game is equal to 0.45. Assuming that the results of the various games
are independent of each other, calculate the probability of the weaker
team winning the final. What are the expected value and the standard
deviation of the number of games the final will take?
3.109 You perform a sequence of independent Bernoulli trials, each having
success probability 34 . What is the probability of 15 successes occurring
before 5 failures?

11 This value of k is the median of the distribution. In general, the median of an integer-valued
random variable X is any integer m with P(X ≤ m) ≥ 0.5 and P(X ≥ m) ≥ 0.5. The median
of a probability distribution need not be unique.
144 3 Discrete Random Variables

3.110 A red bag contains 15 balls and a blue bag 5 balls. Each time, you pick
one of the two bags and remove one ball from the chosen bag. At each
pick, the red bag is chosen with probability 34 and the blue bag with
probability 14 . What is the probability that the red bag will be emptied
first? What is the probability that the red bag is emptied while there are
still k balls in the blue bag for k = 1, . . . , 5?
3.111 Two players A and B each roll a fair die until player A has rolled a
1 or 2, or player B has rolled a 4, 5, or 6. The first player to roll one of
his assigned numbers is the winner, where player A is also the winner if
both players roll at the same time one of their assigned numbers. What
is the probability of player A winning the game? What is the probability
mass function of the length of the game?
3.112 In European roulette, the ball lands on one of the numbers 0, 1, . . . , 36
at every spin of the wheel. A gambler offers at even odds the bet that
the house number 0 will come up at least once in every 25 spins of the
wheel. What is the gambler’s expected profit per dollar bet?
3.113 Players A and B toss their own coins at the same time. A toss of the coin
of player A results in heads with probability a and a toss of the coin of
player B with probability b. The first player to toss heads wins. If they
both get heads at the same time, the game ends in a draw. What is the
probability of player A winning and what is the probability of a draw?
What is the probability mass function of the length of the game?
3.114 A fair coin is tossed until heads appears for the third time. Let the
random variable X be the number of tails shown up to that point. What
is the probability mass function of X? What are the expected value and
the standard deviation of X?
3.115 You are offered the following game. You can repeatedly pick at random
a number from 1 to 25. Each pick costs you one dollar. If you decide
to stop, you get paid the dollar amount of your last pick. What strategy
should you use to maximize your expected net payoff?
3.116 You toss a biased coin with probability p of heads, while your friend
tosses at the same time a fair coin. What is the probability distribution
of the number of tosses until both coins simultaneously show the same
outcome?
3.117 John and Pete had a nice evening at the pub. They decided to play the
following game in order to determine who would pay for the beer. Each
of them rolls two dice. The game ends if the dice total of John is the
same as that of Pete; otherwise, they play another round. Upon ending
the game, John pays for the beer if the dice total is odd; otherwise,
3.9 Other Discrete Distributions 145

Pete pays. What is the probability mass function of the length of the
game? What is the probability of John paying for the beer?
3.118 You have a thoroughly shuffled deck of 52 cards. Each time, you choose
one card from the deck. The drawn card is put back in the deck and all 52
cards are again thoroughly shuffled. You continue this procedure until
you have seen all four different aces. What are the expected value and
the standard deviation of the number of times you have to draw a card
until you have seen all four different aces?
4
Continuous Random Variables

In many practical applications of probability, physical situations are better

described by random variables that can take on a continuum of possible values
rather than a discrete number of values. Examples are the decay time of
a radioactive particle, the time until the occurrence of the next earthquake
in a certain region, the lifetime of a battery, the annual rainfall in London,
electricity consumption in kilowatt hours, and so on. These examples make
clear what the fundamental difference is between discrete random variables
taking on a discrete number of values and continuous random variables taking
on a continuum of values. Whereas a discrete random variable associates
positive probabilities with its individual values, any individual value has
probability zero for a continuous random variable. It is only meaningful to
speak of the probability of a continuous random variable taking on a value
in some interval. Taking the lifetime of a battery as an example, it will
be intuitively clear that the probability of this lifetime taking on a specific value
becomes zero when a finer and finer unit of time is used. If you can measure
the heights of people with infinite precision, the height of a randomly chosen
person is a continuous random variable. In reality, heights cannot be measured
with infinite precision, but the mathematical analysis of the distribution of
people’s heights is greatly simplified when using a mathematical model in
which the height of a randomly chosen person is modeled as a continuous
random variable. Integral calculus is required to formulate the continuous
analogue of a probability mass function of a discrete random variable.
The first purpose of this chapter is to familiarize you with the concept of the
probability density of a continuous random variable. This is always a difficult
concept for the beginning student. However, integral calculus enables us to
give an enlightening interpretation of probability density. The second purpose
of this chapter is to introduce you to important probability densities such
as the uniform, exponential, gamma, Weibull, beta, normal, and lognormal

146
4.1 Concept of Probability Density 147

densities among others. In particular, the exponential and normal distributions

are treated in depth. Many practical phenomena can be modeled by these
distributions, which are of fundamental importance. Many examples are given
to illustrate this. Also, much attention is given to the central limit theorem,
being the most important theorem of probability theory. At the end of this
chapter, we discuss special topics such as the inverse-transformation method
for generating a random observation from a continuous random variable, the
important concept of failure-rate function, and the principle of maximum
entropy in probability.

4.1 Concept of Probability Density

The most simple example of a continuous random variable is the choice of
a random number from the interval (0, 1). The probability that the randomly
chosen number will take on a prespecified value is zero. It only makes sense
to speak of the probability of the randomly chosen number falling in a given
subinterval of (0, 1). Choosing a random number from the interval (0, 1) means
that the number is chosen such that the probability of the number falling in any
given subinterval is equal to the length of that subinterval. For example, if a
dart is thrown at random into the interval (0, 1), the probability of the dart
hitting exactly the point 0.25 is zero, but the probability of the dart landing
somewhere in the interval between 0.2 and 0.3 is 0.1 (assuming that the dart
has an infinitely thin point). No matter how small x is, any subinterval of
length x has probability x of containing the point at which the dart will
land. You might say that the probability mass associated with the landing point
of the dart is smeared out over the interval (0, 1) in such a way that the density
is the same everywhere. Denote by the random variable X the point at which
the dart will land. Then, the cumulative probability P(X ≤ a) equals a for any
0 ≤ a ≤ 1 and can be represented as
a
P(X ≤ a) = f (x) dx for 0 ≤ a ≤ 1,
0
where f (x) is a probability density that is identically equal to 1 on the interval
(0, 1). This probability density is called the uniform density on (0, 1). Before
defining the concept of probability density within a general framework, it is
instructive to consider the next example.
Example 4.1 A stick of unit length is broken at a random point into two pieces.
What is the probability that the ratio of the length of the shorter piece to that
of the longer piece is less than or equal to a for any 0 < a < 1?
148 4 Continuous Random Variables

Solution. The sample space of the chance experiment is the interval (0, 1),
where the outcome ω = u means that the point at which the stick is broken is
a distance u from the beginning of the stick. Let the random variable X denote
the ratio of the length of the shorter piece to that of the longer piece of the
broken stick. Fix 0 < a < 1. The probability that the ratio of the length of
the shorter piece to that of the longer piece is less than or equal to a is nothing
other than the probability that a random number from the interval (0,1) falls
1
either into ( 1+a , 1) or into (0, 1 − 1+a
1
) (verify!). The latter probability is equal
to 2(1 − 1+a ) = 1+a . Thus,
1 2a

2a
P(X ≤ a) = for 0 < a < 1.
1+a
Obviously, P(X ≤ a) = 0 for a ≤ 0 and P(X ≤ a) = 1 for a ≥ 1. Denote by
f (a) = (1+a)
2
2 the derivative of 1+a for 0 < a < 1 and let f (a) = 0 outside the
2a

interval (0,1). Then, for any a, we can represent P(X ≤ a) as

a
P(X ≤ a) = f (x) dx.
−∞

The integral representation for P(X ≤ a) is the continuous analogue of

the cumulative probability function in the discrete case: if X is a discrete
random variable having possible values a1 , a2 , . . . with associated probabilities
p1 , p2 , . . . , then the probability that X takes on a value less than or equal to a
is represented by

P(X ≤ a) = pi for all a.
i: ai ≤a

We now come to the definition of a continuous random variable. Let X be a

random variable that is defined on a sample space with probability measure P.
It is assumed that the set of possible values of X is a finite or infinite interval
on the real line.
Definition 4.1 The random variable X is said to be (absolutely) continuously
distributed if a function f (x) exists such that
a
P (X ≤ a) = f (x) dx for each real number a,
−∞

where the function f (x) satisfies

∞
f (x) ≥ 0 for all x and f (x) dx = 1.
−∞
4.1 Concept of Probability Density 149

The notation P(X ≤ a) stands for the probability that is assigned by the
probability measure P to the set of all outcomes ω for which X (ω) ≤ a.
The function f (x) is called the probability density function of X, and is a sort
of analogue of the probability mass function of a discrete random variable.
The cumulative distribution function of any random variable X, denoted by
F(x), describes the probability that the random variable X will take on a value
less than or equal to x:
F(x) = P(X ≤ x).

Sometimes one simply speaks of the distribution function. The cumulative

distribution function F(x) is an increasing function, that is, F(x1 ) ≤ F(x2 )
if x1 ≤ x2 . This property is obvious from P(A) ≤ P(B) if A ⊆ B. Moreover,

lim F(x) = 0 and lim F(x) = 1,

x→−∞ x→∞

by the continuity of probability. Unlike the cumulative distribution function

of a discrete random variable, that of a continuous random variable has no
jumps and is continuous everywhere. The cumulative distribution function is
an important concept that makes sense for any type of random variable. For
discrete random variables the probability mass function can be reconstructed
from the sizes of the jumps of the cumulative distribution function, while for
continuous random variables the probability density function can be obtained
by taking the derivative of the cumulative distribution function.
Before we illustrate the above concepts with two examples, we make
the following remark. Beginning students often misinterpret the nonnegative
number f (a) as a probability, namely as the probability P(X = a). This
interpretation is wrong, as is obvious by the fact that f (a) can be larger than 1.
Nevertheless, it is possible to give an intuitive interpretation of the nonnegative
number f (a) in terms of probabilities. This will be done in the next section. In
that section we also prove the intuitively obvious fact that P(X = a) = 0 for
all a. This is a characteristic property of continuous random variables.
Example 4.2 Suppose that the lifetime X of a battery has the cumulative
distribution function
⎧
⎪
⎪ 0 for x < 0,
⎨
1 2
P(X ≤ x) = 4 x for 0 ≤ x ≤ 2,
⎪
⎪
⎩
1 for x > 2.

Does the random variable X satisfy the definition of a continuous random

variable?
150 4 Continuous Random Variables

Solution. The cumulative distribution function F(x) = P(X ≤ x) is a contin-

uous function and is differentiable at each point x, except for the two points
x = 0 and x = 2. The derivative of F(x) is continuous at each point at which
F(x) is differentiable. Therefore, we define the function f (x) by

1
x for 0 < x < 2,
f (x) = 2
0 otherwise.
In each of the finite number of points x at which P(X ≤ x) has no derivative,
adoes not matter what value we give to f (x). These values do not affect
it
−∞ f (x) dx. Usually, we give f (x) the value 0 at any of these exceptional
points. We can now conclude from the fundamental theorem of integral
calculus that
a
P(X ≤ a) = f (x) dx for each number a.
−∞
This completes the formal verification that X is a continuous random variable
with probability density function f (x).1
Example 4.3 A continuous random variable X has a probability density of the
form f (x) = ax + b for 0 < x < 1 and f (x) = 0 otherwise. What conditions
on the constants a and b must be satisfied? What is the cumulative distribution
function of X?
Solution. A function represents a probability density only if it is nonnegative
and integrates to 1over the real line. The requirements for f (x) are ax + b ≥ 0
1
for 0 < x < 1 and 0 (ax+b) dx = 1. The first requirement gives a+b ≥ 0 and
b ≥ 0. The second requirement gives 12 a + b = 1. The cumulative distribution
function of X is equal to
x
1
F(x) = (av + b) dv = ax2 + bx for 0 ≤ x ≤ 1.
0 2
Further, F(x) = 0 for x < 0 and F(x) = 1 for x > 1.
The probability of any event defined in terms of a continuous random
variable X with density function f (x) can be calculated as an integral. First,
consider the event {X > a}. The probability of this event allows for the integral
representation

1 In general, let X be a random variable with cumulative distribution function F(x) and V be a
finite subset of R = (−∞, ∞). Suppose that F(x) is continuous on R and differentiable on
R\V with a continuous derivative. Then, by the fundamental theorem of integral calculus, X is
continuously distributed with density f (x) = F (x) for x ∈ R\V and f (x) = 0 for x ∈ V.
4.1 Concept of Probability Density 151

∞
P(X > a) = f (x) dx.
a
To see this, note that the event {X > a} is the complement
∞ of the event {X ≤ a}
and so P(X > a) = 1 − P(X ≤ a). a Since −∞ f (x) dx = 1, it follows that
∞ ∞
P(X > a) is equal to −∞ f (x) dx− −∞ f (x) dx = a f (x) dx. More generally,
we can express P(a < X ≤ b) in terms of the density f (x). For any constants a
and b with a < b,
b
P(a < X ≤ b) = f (x) dx.
a
To see this, note that the event {X ≤ b} is the union of the two disjoint events
{a < X ≤ b} and {X ≤ a}. As a result, P(X ≤ b) = P(a < X ≤ b) + P(X ≤ a),
or, equivalently, P(a < X ≤ b) = P(X ≤ b) − P(X ≤ a). This gives that
b a b
P(a < X ≤ b) is equal to −∞ f (x) dx − −∞ f (x) dx = a f (x) dx.
The integral representation of P(a < X ≤ b) tells us that this probability is
given by the area under the graph of f (x) between the points a and b.
Example 4.4 The maximum outdoor air temperature (in degrees Celsius) in a
certain area on any given day in May can be modeled as a continuous random
variable X with density function f (x) = 4,500
1
(30x − x2 ) for 0 < x < 30 and
f (x) = 0 otherwise. What are the probabilities P(X ≤ 10), P(X > 25), and
P(15 < X ≤ 20)?
Solution. The probabilities P(X ≤ 10) and P(X > 25) are given by
10 10
1 1 1 7
P(X ≤ 10) = (30x − x2 ) dx = 15x2 − x3 =
0 4,500 4,500 3 0 27
30 30
1 1 1 2
P(X > 25) = (30x − x2 ) dx = 15x2 − x3 = .
25 4,500 4,500 3 25 27
The probability P(15 < X ≤ 20) is evaluated as
20 20
1 1 1 13
P(15 < X ≤ 20) = (30x − x2 ) dx = 15x2 − x3 = .
15 4,500 4,500 3 15 54

Problems
4.1 Sizes of insurance claims can be modeled by a continuous random variable
X with probability density f (x) = c(10 − x) for 0 < x < 10 and f (x) = 0
otherwise, where c is some constant. What is the value of c? What is the
probability that the size of a particular claim is no more than 5 and what is
the probability that the size is more than 2?
152 4 Continuous Random Variables

4.2 Let the random variable X be the portion of a flood insurance claim for
flooding damage to a house. The probability density of X has the form
f (x) = c(3x2 − 8x − 5) for 0 < x < 1. What is the value of the constant c?
What is the cumulative distribution function of X?
4.3 The mileage (in thousands of miles) you can get out of a specific type
of tire is a continuous random variable X with probability density
xe−x /625 for x > 0 and f (x) = 0 otherwise. What
2
function f (x) = 6252

is the probability that the tire will last at most 15,000 miles? What is the
probability that the tire will last more than 30,000 miles? What is the
probability that the tire will last more than 20,000 miles but not more than
25,000 miles?
4.4 The lengths of phone calls (in minutes) made by a travel agent can be
modeled as a continuous random variable X with probability density
f (x) = 0.25e−0.25x for x > 0. What is the probability that a particular
phone call will take more than 7 minutes?
4.5 A particular pumping engine will only function properly if an essential
component functions properly. The time to failure of the component (in
thousands of hours) is a random variable X with probability density f (x) =
0.02xe−0.01x for x > 0. What is the proportion of pumping engines that
2

will not fail before 10,000 hours of use? What is the probability that the
engine will survive for another 5,000 hours, given that it has functioned
properly during the past 5,000 hours?
4.6 The probability density function f (x) of the electrical resistance (in ohms)
of a strain gauge produced by a certain firm can be modeled by f (x) =
25 (x − 115) for 115 < x ≤ 120, f (x) = 25 (125 − x) for 120 < x <
1 1

125, and f (x) = 0 otherwise. To meet the specification of a customer,

the resistance must be in the range (117, 123). What proportion of strain
gauges is not acceptable?
4.7 Liquid waste produced by a factory is removed once a week. The weekly
volume of waste in thousands of gallons is a continuous random variable
X with probability density function f (x) = 105x4 (1 − x)2 for 0 < x < 1
and f (x) = 0 otherwise. How can you choose the capacity of a storage
tank so that the probability of overflow during a given week is no more
than 5%?
4.8 You have to make a one-time business decision on how much stock to
order in order to meet a random demand during a single period. The
demand is a continuous random variable X with a given probability density
f (x) = μ2 xe−μx for x > 0. Suppose that you decide to order Q units. What
is the “stockout” probability that the stock Q will not be enough to meet
the demand?
4.1 Concept of Probability Density 153

4.1.1 Interpretation of the Probability Density

The use of the word “density” originated with the analogy to the distribution
of matter in space. In physics, any finite volume, no matter how small, has
a positive mass, but there is no mass at a single point. A similar description
applies to continuous random variables, on the grounds of the fact that the
probability that a continuous random variable X with density f (x) will take
on a value between a and b can be seen as the area under the graph of f (x)
between the points a and b. Using the continuity property of probability (see
Section 1.4.2), we have
a+ 1
1 1 n
P(X = a) = lim P a − < X ≤ a + = lim f (x) dx,
n→∞ n n n→∞ a− 1
n
a
and so P(X = a) = a f (x)dx. This verifies that

P(X = a) = 0 for each real number a.

Hence, for a continuous random variable X, it makes no sense to speak of

the probability that the random variable X will take on a prespecified value.
This probability is always zero. It only makes sense to speak of the probability
that the continuous random variable X will take on a value in some interval.
Incidentally, since P(X = c) = 0 for any number c, the probability that X takes
on a value in an interval with endpoints a and b is not influenced by whether
or not the endpoints are included. In other words, for any two real numbers a
and b with a < b, we have

P(a ≤ X ≤ b) = P(a < X ≤ b) = P(a ≤ X < b) = P(a < X < b).

The fact that the area under the graph of f (x) can be interpreted as a probability
leads to an intuitive interpretation of f (a). Let a be a given continuity point
of f (x). Consider now a small interval of length a around the point a, say
[a − 12 a, a + 12 a]. Since
a+ 1 a
1 1 2
P a − a ≤ X ≤ a + a = f (x) dx
2 2 a− 12 a

a+ 21 a
and f (x) dx ≈ f (a)a for a small, we have the insightful result
a− 12 a

1 1
P a − a ≤ X ≤ a + a ≈ f (a)a for a small.
2 2
In other words, the probability of random variable X taking on a value in a
small interval around point a is approximately equal to f (a)a when a is the
154 4 Continuous Random Variables

length of the interval. You see that the number f (a) itself is not a probability,
but it is a relative measure for the likelihood that the random variable X will
take on a value in the immediate neighborhood of point a. Stated differently,
the probability density function f (x) expresses how densely the probability
mass of the random variable X is smeared out in the neighborhood of point x.
Hence the name of density function. The probability density function provides
the most useful description of a continuous random variable. The graph of the
density function provides a good picture of the likelihood of the possible values
of the random variable.

4.1.2 Verification of a Probability Density

In general, how can we verify whether a random variable X has a probability
density? In concrete situations, we first determine the cumulative distribution
function F(a) = P(X
a ≤ a) and next we verify whether F(a) can be written in
the form F(a) = −∞ f (x) dx. A sufficient condition is that F(x) is continuous
at every point x and is differentiable except for a finite number of points x. The
following two examples are given in illustration of this point.
Example 4.5 In a Vickrey auction, bidders submit written bids without
knowing the bid of the other people in the auction. The highest bidder wins
but the price paid is the second-highest bid. Suppose that in a Vickrey auction
of a collector’s item, 10 bids are made. The bids are independent of each other.
Each bid size is between 0 and 1 and the probability that any bid size is more
than x is 1 − x for 0 ≤ x ≤ 1. What is the probability density of the price paid
for the collector’s item?
Solution. Let the random variable X be the price paid for the item. The
random variable X is larger than x if and only if two or more bids are larger
10

than x. Therefore, P(X > x) is equal to the binomial probability 10 k=2 k
(1 − x)k x10−k and so P(X ≤ x) = 1 − x10 − 10(1 − x)x9 for 0 ≤ x ≤ 1.
Obviously, P(X ≤ x) = 0 for x < 0 and P(X ≤ x) = 1 for x > 1. The
cumulative distribution function P(X ≤ x) is continuous at every point x.
Also, it is differentiable with a continuous derivative except at the two points
x = 0 and x = 1. Therefore, X has a probability density. By differentiation,
we get the density function f (x) = 90(x8 − x9 ) for 0 < x < 1 and f (x) = 0
otherwise.
Example 4.6 A random point is picked inside a circular disk with radius r. Let
the random variable X denote the distance from the center of the disk to this
point. Does the random variable X have a probability density function and, if
so, what is its form?
4.1 Concept of Probability Density 155

Solution. To answer the question, we first define a sample space with an

appropriate probability measure P for the chance experiment. The sample
space is taken as the set of all points (x, y) in the two-dimensional plane with
x2 + y2 ≤ r2 . Since the point inside the circular disk is chosen at random, we
assign to each well-defined subset A of the sample space the probability
area of region A
P(A) = .
πr2
The cumulative distribution function P(X ≤ x) is easily calculated. The event
X ≤ a occurs if and only if the randomly picked point falls into the disk of
radius a with area π a2 . Therefore,
πa2 a2
P(X ≤ a) = 2
= 2 for 0 ≤ a ≤ r.
πr r
Obviously, P(X ≤ a) = 0 for a < 0 and P(X ≤ a) = 1 for a > r. The
distribution function P(X ≤ x) is continuous at every point x. Also, it is
differentiable with a continuous derivative except at the points x = 0 and x = r.
Therefore, X has the density function
2x
2 for 0 < x < r,
f (x) = r
0 otherwise.

All of the foregoing examples follow the same procedure in order to find
the probability density function of a random variable X. The cumulative
distribution function P(X ≤ x) is determined first and this distribution function
is then differentiated to obtain the probability density.

Problems
4.9 Let the radius of a circle be a random variable X with density function
f (x) = 1 for 0 < x < 1 and f (x) = 0 otherwise. What is the probability
density of the area of the circle?
4.10 The density function of the continuous random variable X is f (x) = 67
√
(x + x) for 0 < x < 1 and f (x) = 0 otherwise. What is the probability
density of X1 ?
4.11 Let X be a positive random variable with probability density function
f (x). Define the random variable Y by Y = X 2 . What is the probability
density function of Y? Also, find the density function of the random
variable W = V 2 if V is a number chosen at random from the interval
(−a, a) with a > 0.
156 4 Continuous Random Variables

4.12 A random point Q is chosen inside the unit square. What is the density
function of the sum of the coordinates of the point Q? What is the density
function of the product of the coordinates of the point Q? Use geometry
to find these densities.
4.13 The number X is chosen at random between 0 and 1. Determine the
density functions of the random variables V = X/(1 − X) and W =
X(1 − X).
4.14 A stick of unit length is broken at random into two pieces. Let the
random variable X be the length of the shorter piece. What is the density
function of X? Also, use the cumulative distribution function of X to give
an alternative derivation of the density function of the random variable
X/(1 − X) from Example 4.1.
4.15 A random point is chosen inside the unit square. Let the random variables
V and W be defined as the largest and the smallest of the two coordinates
of the point. What are the probability densities of the random variables
V and W?
4.16 Suppose you decide to take a ride on the ferris wheel at an amusement
park. The ferris wheel has a diameter of 30 m. After several turns, the
ferris wheel suddenly stops due to a power outage. Let the random
variable X be your height above the ground when the ferris wheel
stops, where it is assumed that the bottom of the ferris wheel is level
with the ground. Verify −1that the density function of X is given by
15π 1 − (x/15 − 1)2 for 0 < x < 30. Hint: The random variable
X can be modeled as X = 15 + 15 cos(), where is a randomly chosen
angle between 0 and π (make a drawing!).

4.2 Expected Value of a Continuous Random Variable

The definition of the expected value of a continuous random variable parallels
that of a discrete random variable.
Definition 4.2 The expected value of a continuous random variable X with
probability density function f (x) is defined by
∞
E(X) = xf (x) dx,
−∞

provided that the integral is well-defined.

∞
It is said that E(X) exists if the integral −∞ xf (x) dx is well-defined. For
a random variable X that can take on both positive and negative values, the
4.2 Expected Value of a Continuous Random Variable 157

∞
integral is well-defined with a finite value if and only if −∞ |x|f (x) dx < ∞.
Such an integral is said to be absolutely convergent. For a nonnegative random
variable X, the absolute convergence of the integral is not required. Then
E(X) always exists, in accordance with the basic convention from integral
calculus that an integral with a nonnegative integrand is always well-defined by
allowing ∞ as possible value. For example, the expected value of the random
2 ) for −∞ < x < ∞ is not
1
variable X with the two-sided density function π(1+x
∞ ∞
defined, because −∞ |x| π(1+x2 ) dx = 2 0 x π(1+x2 ) dx = ∞. However, the
1 1

expected value of the nonnegative random variable with the one-sided density
2 ) for x > 0 does exist (and is equal to ∞). It is important to
2
function π(1+x
emphasize
∞ that the expected value of a random variable X is finite if and only
if −∞ |x|f (x) dx < ∞.
The definition of expected value in the continuous case parallels the

definition E(X) = xi p(xi ) for a discrete random variable X with x1 , x2 , . . .
as possible values and p(xi ) = P(X = xi ). For dx small, the quantity f (x) dx
in a discrete approximation of the continuous case corresponds to p(x) in
the discrete case. The summation becomes an integral when dx approaches
zero. Results for discrete random variables are typically expressed as sums.
The corresponding results for continuous random variables are expressed as
integrals.
Example 4.4 (continued) What is the expected value of the random variable
X with probability density function f (x) = 4,500
1
(30x − x2 ) for 0 < x < 30 and
f (x) = 0 otherwise?
Solution. The expected value of X is calculated as

1 30 1 1 4 30
E(X) = x(30x − x ) dx =
2
10x − x = 15.
3
4,500 0 4,500 4 0

Example 4.6 (continued) What is the expected value of the random variable
X representing the distance from a random point inside the circular disk with
radius 1 to the center of the disk?
Solution. The random variable X has the probability density function f (x) =
2x
r2
for 0 < x < r and f (x) = 0 otherwise. This gives
r
r 2x 2 x3 2
E(X) = x 2 dx = = r.
0 r 3 r2 3
0

Example 4.1 (continued) A stick of unit length is broken at a random point

into two pieces. What is the expected value of the ratio of the length of the
158 4 Continuous Random Variables

shorter piece to that of the longer piece? What is the expected value of the
ratio of the length of the longer piece to that of the shorter piece?
Solution. Let the random variable X be the ratio of the length of the shorter
piece to that of the longer piece and Y be the ratio of the length of the longer
piece to that of the shorter piece. In Example 4.1, we showed that P(X ≤ x) =
x+1 for 0 ≤ x ≤ 1 with f (x) = (x+1)2 as its density function for 0 < x < 1.
2x 2

Hence,
1 1 1
2 1 1
E(X) = x dx = 2 dx − 2 dx
(x + 1) 0 x+1 0 (x + 1)
2 2
0
1
1 1

= 2ln(x + 1) + 2 = 2ln(2) − 1.
0 x + 1
0

In order to calculate E(Y), note that Y = X1 . Therefore,

1
P(Y ≤ y) = P X ≥ for y > 1.
y
This leads to P(Y ≤ y) = 1 − y+1
2
for y > 1. Thus, the density function of Y is
2
(y+1)2
for y > 1 and 0 otherwise. This gives
∞ ∞
2
1 ∞
E(Y) = y dy = 2ln(y + 1) + 2 = ∞.
1 (y + 1)2 1 y + 1 1

This finding is in agreement with the expression 2 12 + · · · + m+11
for the
expected payoff in Problem 3.23, which expected payoff tends to infinity
as m → ∞. A little calculus was enough to find the result E(Y) = ∞
that otherwise is difficult to obtain from a simulation study. A more relevant
measure for the random variable Y is the median. If R is a continuous random
variable, then a median of R is any number m such that P(R ≤ m) = P
(R ≥ m) = 0.5. The median of Y is m = 3, as follows by solving the equation
1 − y+1
2
= 0.5.

Problems
4.17 The time (in hundreds of hours) until failure of the power supply to a
radar system is a random variable X with probability density function
f (x) = 625
1
(x − 50) for 50 < x ≤ 75, f (x) = 625
1
(100 − x) for 75 < x <
100, and f (x) = 0 otherwise. What is the expected value of X?
4.18 The time (in milliseconds) for a particular chemical reaction to complete
in water is a random variable X with probability density function
4.2 Expected Value of a Continuous Random Variable 159

√
f (x) = π 2cos(πx) for 0 < x < 0.25 and f (x) = 0 otherwise. What is
the expected value of X?
4.19 The javelin thrower Big John throws the javelin more than x meters with
probability P(x), where P(x) = 1 for 0 ≤ x < 50, P(x) = [1−(x−50)2 ]/
1,200 for 50 ≤ x < 80, P(x) = (90 − x)2 /400 for 80 ≤ x < 90, and
P(x) = 0 for x ≥ 90. What is the expected value of the distance thrown
in his next shot?
4.20 What is the expected value of the random variable X in Problems 4.2,
4.4, and 4.6?
4.21 A random point is chosen inside a triangle with height h and base length
b. What is the expected value of the perpendicular distance of the point
to the base?
4.22 Consider again Example 4.5. What is the expected value of the price paid
for the collector’s item?
4.23 A random point is chosen inside the unit square {(x, y) : 0 ≤ x, y ≤ 1}.
What is the expected value of the distance from this point to the point
(0, 0)?
4.24 A random point is chosen inside the unit square. What is the expected
value of the distance from this point to the closest side of the unit square?
4.25 Let X be a continuous random variable with probability density function
f (x).
(a) Explain why the natural definition of the conditional expected value
of X given that X > a is
∞
1
E(X | X > a) = xf (x) dx.
P(X > a) a
a
Also, explain the definition E(X | X ≤ a) = P(X≤a)
1
−∞ xf (x) dx.
(b) What is E(X | X > a) for 0 < a < 1 when X is a randomly chosen
number from (0, 1)?
(c) What are E(X | X > a) and E(X | X ≤ a) for a > 0 when X has the
exponential density function f (x) = λe−λx for x > 0 and f (x) = 0
otherwise.
4.26 Let X be a nonnegative continuous random variable with density function
f (x). Use an interchange of the order of integration to verify that
∞
E(X) = P(X > x) dx.
0

Use this result to answer the following question. What is the expected
value of the smallest of n independent random numbers from (0, 1)?
160 4 Continuous Random Variables

4.3 Substitution Rule and the Variance

The substitution rule and the concept of the variance of a random variable were
discussed in Sections 3.4 and 3.5 for the case of a discrete random variable. The
same results apply to the case of a continuous random variable.
Rule 4.1 Let X be a continuous random variable with probability density
f (x). Then, for any given function g(x), the expected value of the random
variable g(X) can be calculated from
∞
E[g(X)] = g(x)f (x) dx,
−∞
∞
provided that −∞ |g(x)|f (x) dx < ∞ or the function g(x) is nonnegative.
We first give the proof for the case that the random variable g(X) is
nonnegative. The proof is based on the fact that
∞
E(Y) = P(Y > y) dy
0
for any nonnegative random variable Y, see Problem 4.26. Assuming that g(X)
is nonnegative, we get
∞ ∞
E[g(X)] = P(g(X) > y) dy = dy f (x) dx
0 0 x: g(x)>y
g(x) ∞
= f (x) dx dy = g(x)f (x)dx.
x: g(x)>0 0 −∞
It is noted that the interchange of the order of integration is justified by the
fact that the integrand is nonnegative. The proof for the general case follows
by writing g(X) as the difference
of the two nonnegative
random variables
+ −
g (X) = max g(X), 0 and g (X) = −min g(X), 0 . The assumption
∞ + −
−∞ |g(x)|f (x) dx < ∞ implies that the expected values of g (X) and g (X)
are finite.
We give two illustrative examples of the substitution rule.
Example 4.7 A warranty on an appliance specifies that an amount of $250
will be reimbursed if the appliance fails during the first year, an amount of
$125 if it fails during the second year, and nothing if it fails after the second
year. The time, measured in years, until the appliance fails has the probability
density f (x) = 0.2e−0.2x for x > 0. What is the expected value of the warranty
reimbursement?
Solution. Let the random variable X be the time until failure of the appliance.
Define the function g(x) by g(x) = 250 for 0 ≤ x ≤ 1, g(x) = 125 for
4.3 Substitution Rule and the Variance 161

1 < x ≤ 2, and g(x) = 0 for x > 2. Then g(X) is the warranty payment. Its
expected value is
1 2
1 1
E[g(X)] = 250 × e−x/5 dx + 125 × e−x/5 dx
0 5 1 5
= 250(1 − e−1/5 ) + 125(e−1/5 − e−2/5 ) = 63.87 dollars.

The substitution rule often simplifies the calculation of the expected value.
As an illustration, consider again Example 4.1 in which the random variable
X is the ratio of the length of the shorter piece to that of the longer piece of a
stick of unit length that is broken at random into two pieces. In the foregoing
section, we calculated E(X) by determining first the density function of X
and then applying the definition of expected value. However, the substitution
rule provides a simpler way to calculate E(X). Let the random variable U be
distributed as a random point chosen in (0, 1). Then U has the density function
f (u) = 1 for 0 < u < 1 and X is distributed as g(U), where the function g(u)
is defined by g(u) = u/(1 − u) for 0 < u ≤ 12 and g(u) = (1 − u)/u for
1
2 < u < 1. This gives
1 1
1/2u 1−u 1−u
E(X) = du + du = 2 du
0 1 − u 1/2 u 1/2 u
1

= 2ln(u) − 2u = 2ln(2) − 1.
1/2

Variance of a Continuous Random Variable

Next we deal with the variance and the standard deviation of a continuous
random variable. Recall from Chapter 3 that the variance of a random variable
X is defined by
var(X) = E[(X − μ)2 ],

where μ = E(X) and E(X) is assumed to be finite. This definition applies

to any type of random variable. If X is a continuous random variable with
probability density f (x), then the variance of X can be calculated from
∞
var(X) = (x − μ)2 f (x) dx.
−∞

Using the alternative representation var(X) = E(X 2 ) − μ2 , the variance of X is

usually calculated from
∞
var(X) = x2 f (x) dx − μ2 .
−∞
162 4 Continuous Random Variables

Rule 4.2 Let X be a continuous random variable with a finite expected value.
Then, for any constants a and b,

E(aX + b) = aE(X) + b and var(aX + b) = a2 var(X).

This rule is the continuous analogue of Rules 3.3 and 3.5 for the discrete case
and can be verified directly from Rule 4.1. The variance of X does not have
the same dimension as the values of X. Therefore, one often uses the standard
deviation of the random variable X. This measure is defined by

σ (X) = var(X).

As an illustration, consider again Example 4.6. The variance of the random

variable X representing the distance from a random point inside the circular
disk with radius 1 to the center of the disk is calculated as
r 2
2x 2 2r2 4 1 2
var(X) = x2 dx − r = − r2 = r .
0 r2 3 4 9 18
√ !
The standard deviation of the distance X is σ (X) = var(X) = 1
18 r.

Example 4.8 Let the random variable X be a randomly chosen number from
the interval (a, b). What are the expected value and the variance of X?
w
Solution. The probability that X will fall into a subinterval of width w is b−a .
Hence, P(X ≤ x) = b−a for a ≤ x ≤ b and so the density function of X
x−a

is given by the uniform density f (x) = b−a 1

for a < x < b and f (x) = 0
otherwise. This gives

b
b 1 1 x2 1 b2 − a2 a+b
E(X) = x dx = = = ,
a b−a 2 b − a 2 b−a 2
a

using the fact that b2 − a2 = (b − a)(b + a). Similarly, we find

b
b 1 1 x3 1 b3 − a3 a2 + ab + b2
E(X ) =
2
x 2
dx = = = ,
a b−a 3 b − a 3 b−a 3
a

using the fact that b3 − a3 = (b2 + ab + a2 )(b − a). Thus,

2
a2 + ab + b2 a+b (b − a)2
var(X) = − = .
3 2 12
4.3 Substitution Rule and the Variance 163

Problems
4.27 Suppose that the random variable X has the probability density f (x) =
12x2 (1 − x) for 0 < x < 1 and f (x) = 0 otherwise.

(a) Use Jensen’s inequality from Section 3.4 to verify that E[ X1 ] ≥ 53 .
(b) What are the expected value and the standard deviation of X1 ?
√
4.28 Let the random variables V and W be defined by V = U and W = U 2 ,
where U is a number chosen at random between 0 and 1. What are the
expected values and the standard deviations of V and W?
4.29 An insurance policy for water damage pays an amount of damage up to
$450. The amount of damage is a random variable X with density func-
tion f (x) = 1,250
1
for 0 < x < 1, 250. The amount of damage exceeding
$450 is covered by a supplement policy up to $500. What is the expected
value of the amount of damage paid by the supplement policy?
4.30 Consider again Problem 4.7. Assume that the storage tank has a capacity
of 0.9 expressed in thousands of gallons. The cost of removing x > 0
units of waste at the end of the week is 1.25 + 0.5x. Additional costs of
5+10z are incurred when the capacity of the storage tank is not sufficient
and an overflow of z > 0 units of waste occurs during the week. What is
the expected value of the weekly costs?
4.31 A manufacturer has to make a last production run for a product that is
near the end of its lifetime. The final demand for the product can be mod-
eled as a continuous random variable X having probability density f (x) =
1 −x/50 for x > 0. It is decided to make a last production run of 250
2,500 xe
units of the product. The manufacturer earns $2 for each unit of product
sold but incurs a cost of $0.50 for each unit of demand occurring when
out of stock. What is the expected value of the net profit of the manufac-
turer? What is the probability that the manufacturer runs out of stock?
4.32 A car owner insures his car, worth $20,000, for one year under a policy
with a deductible of $1,000. There is a probability of 0.01 of a total loss
of the car during the policy year and a probability of 0.02 of a repairable
damage. The cost (in thousands of dollars) of a repairable damage has
the probability density f (x) = 200 1
(20 − x) for 0 < x < 20. What is the
expected value of the insurance payment? Note that this payment is a
mixed random variable.
4.33 Choose a point at random in (0, 1), then this point divides the interval
(0, 1) into two subintervals. What is the expected length of the subinterval
covering a given point s with 0 < s < 1?
4.34 What is the standard deviation of the random variable X in each of the
Problems 4.2, 4.4, and 4.6?
164 4 Continuous Random Variables

4.35 What are the expected value and the standard deviation of the area of
the circle whose radius is a random variable X with density function
f (x) = 1 for 0 < x < 1 and f (x) = 0 otherwise?
4.36 A point Q is chosen at random inside a sphere with radius r. What are
the expected value and the standard deviation of the distance from the
center of the sphere to the point Q?
4.37 Let X be a continuous random variable with probability density f (x) and
finite expected value E(X).
(a) What constant c minimizes E[(X − c)2 ] and what is the minimal
value of E[(X − c)2 ]?
(b) Prove that E(|X − c|) is minimal if c is chosen equal to the median
of X.
4.38 Consider again Problem 4.16. Calculate the expected value and the stan-
dard deviation of the height above the ground when the ferris wheel stops.
4.39 In an inventory system, a replenishment order is placed when the stock
on hand of a certain product drops to the level s, where the reorder point
s is a given positive number. The total demand for the product during
the lead time of the replenishment order has the probability density
f (x) = λe−λx for x > 0. What are the expected value and standard
deviation of the shortage (if any) when the replenishment order arrives?
4.40 Suppose that the continuous random variable X has the probability
density function f (x) = (α/β)(β/x)α+1 for x > β and f (x) = 0 for
x ≤ β for given values of the parameters α > 0 and β > 0. This density
is called the Pareto density, which provides a useful probability model
for income distributions, among others.
(a) Calculate the expected value, the variance, and the median of X.
(b) Assume that the annual income of the employed, measured in thou-
sands of dollars, in a given country follows a Pareto distribution with
α = 2.25 and β = 2.5. What percentage of the working population
has an annual income of between 25,000 and 40,000 dollars?
(c) Why do you think the Pareto distribution is a good model for income
distributions? Hint: Use the probabilistic interpretation of the density
f (x).

4.4 Uniform and Triangular Distributions

The uniform and triangular distributions are often used in practice when not
much is known about the underlying distribution of the observations. Also, the
triangular distribution is related to the uniform distribution.
4.4 Uniform and Triangular Distributions 165

4.4.1 Uniform Distribution

A continuous random variable X is said to have a uniform distribution over the
interval (a, b) if X has the probability density function

1
for a < x < b,
f (x) = b−a
0 otherwise.
This density has two parameters a and b with b > a. Figure 4.1 gives the graph
of the uniform density function. The uniform distribution on the interval (0, 1)
provides the probability model for random-number generators, which generate
random numbers between 0 and 1. The uniform distribution is also used as a
model for a quantity that is known to vary randomly between a and b but about
which little else is known.
Since the density f (x) of the uniformly distributed random variable X is
constant over the interval (a, b), the random variable X is just as likely to be
near any value in (a, b) as any other value. This property is expressed by
c+ 1
1 1 2 1
P c− ≤X ≤c+ = dx = ,
2 2 1
c− 2 b − a b −a

regardless of c provided that the points c − 12 and c + 12 belong to the

interval (a, b). As shown in Example 4.8, the expected value and the variance
of the random variable X are given by
1 1
E(X) = (a + b) and var(X) = (b − a)2 .
2 12
Also, an explicit expression can be given for the cumulative distribution
x
function F(x) = −∞ f (y) dy. This function satisfies F(x) = 0 for x < a,
F(x) = 1 for x > b, and
x−a
F(x) = for a ≤ x ≤ b.
b−a

f(x)

1
b−a

0
a b x

Figure 4.1. Uniform density.

166 4 Continuous Random Variables

Problems
4.41 The lifetime of a light bulb has a uniform probability density on (2, 12).
The light bulb will be replaced upon failure or upon reaching age 10,
whichever occurs first. What are the expected value and the standard
deviation of the age of the light bulb at the time of replacement?
4.42 A rolling machine produces sheets of steel of different thickness. The
thickness of a sheet of steel is uniformly distributed between 120 and
150 mm. Any sheet having a thickness of less than 125 mm must be
scrapped. What are the expected value and the standard deviation of a
non-scrapped sheet of steel?

4.4.2 Triangular Distribution

A continuous random variable X is said to have a triangular distribution over
the interval (a, b) if X has the probability density function
2 x−a
for a < x ≤ m
f (x) = b−a m−a
2 b−x
b−a b−m for m ≤ x < b
and f (x) = 0 otherwise. This density has three parameters a, b, and m with
a < m < b. Figure 4.2 gives the graph of the triangular density function. The
density function increases linearly on the interval [a, m] and decreases linearly
on the interval [m, b]. The triangular distribution is often used as probability
model when little information is available about the quantity of interest but
one knows its lowest possible value a, its most likely value m, and its highest
possible value b. The expected value and the variance of the random variable
X are given by
1 1 2
E(X) = (a + b + m) and var(X) = (a + b2 + m2 − ab − am − bm).
3 18

f(x)

2
b−a

0
a m b x

Figure 4.2. Triangular density.

4.5 Exponential Distribution 167

The derivation is a matter of straightforward algebra and is left to the reader.

x
Also, the cumulative distribution function F(x) = −∞ f (y) dy allows for an
explicit expression. We have F(x) = 0 for x < a, F(x) = 1 for x > b, and
⎧
⎨ (x−a)2 for a ≤ x < m
(b−a)(m−a)
F(x) =
⎩1 − (b−x)2 for m ≤ x ≤ b.
(b−a)(b−m)

The triangular distribution is related to the uniform distribution. In Problem

5.30 in Section 5.3, you are asked to verify that both 12 (X +Y) and |X −Y| have
a triangular distribution when the random variables X and Y are independent
and uniformly distributed on (0, 1).

4.5 Exponential Distribution

A positive continuous random variable X is said to have an exponential
distribution with parameter λ > 0 if X has the density function

f (x) = λ e−λx for x > 0

and f (x) = 0 otherwise. It is left to the reader to verify that the nonnegative
function λe−λx integrates to 1 over the interval (−∞, ∞) and thus is indeed
a density function. The parameter λ is a scale parameter. An exponentially
distributed random variable X takes on only positive values. Figure 4.3 displays
the exponential density function with λ = 1. The expected value and the
variance of the random variable X are given by
1 1
E(X) = and var(X) = .
λ λ2

f(x)

0.5

0
1 2 3 4 5 x

Figure 4.3. Exponential density (λ = 1).

168 4 Continuous Random Variables

∞
These
∞ 2 formulas are obtained by evaluating the integrals 0 xλe−λx dx and
−λx dx. Using partial integration, the reader easily verifies that these
0 x λe
integrals are λ1 and λ22 , showing that E(X) = λ1 and
x E(X ) = λ2 .
2 2

The cumulative distribution function F(x) = −∞ f (y) dy equals

1 − e−λx for x ≥ 0
F(x) =
0 for x < 0.

Memoryless Property
The exponential distribution is extremely important in probability. It not only
models many real-world phenomena, but also allows for tractable mathemat-
ical analysis. The reason for its mathematical tractability is the memoryless
property. This property states that
P(X > t + s | X > s) = P(X > t) for all t > 0,
regardless of the value of s. In words, imagining that the exponentially
distributed random variable X represents the lifetime of an item, the residual
life of an item has the same exponential distribution as the original lifetime,
regardless of how long the item has already been in use. The proof is simple.
For any x ≥ 0, we have that P(X > x) = e−λx . Applying the definition
P(A | B) = P(AB)/P(B) with A = {X > t + s} and B = {X > s}, and
noting that P(X > t + s, X > s) = P(X > t + s), we find
P(X > t + s) e−λ(t+s)
P(X > t + s | X > s) = = = e−λt = P(X > t),
P(X > s) e−λs
showing the memoryless property. The exponential distribution is the only
continuous distribution possessing this property. The technical proof of this
uniqueness result is omitted.
The exponential distribution is often used as probability model for the time
until a rare event occurs. Examples are the time until the next earthquake in
a certain region and the decay time of a radioactive particle. A very useful
result is that under general conditions, the time until the first occurrence of
a rare event is approximately exponentially distributed. To make this result
plausible, consider the next example. Maintenance of an operating unit in a
reliability system occurs at the scheduled times τ , 2τ , . . . , where τ > 0 is
fixed. Each maintenance takes a negligible time and the unit is again as good
as new after each maintenance. There is a probability p > 0 that the unit will
fail between two maintenance inspections, where p is very close to zero. Let
the random variable X be the time until the first system failure. Then, for any
4.5 Exponential Distribution 169

n, we have P(X > nτ ) = (1 − p)n and so P(X > nτ ) ≈ e−np , using the fact
that e−p ≈ 1 − p for p close to zero. Hence, taking t = nτ and replacing n by
t/τ , it follows that
P(X > t) ≈ e−λt for t > 0,

where λ = p/τ denotes the inverse of the expected time until the first system
failure. It is important to note that it suffices to know the expected value of
the time until the first system failure in order to approximate the probability
distribution of the failure time by an exponential distribution. An application
of this very useful result will be given in Example 4.9 below.
Also, the probability of exceeding some extreme level is often approximately
equal to an exponential tail probability of the form αe−βt for constants
α, β > 0. An interesting example concerns the probability that a high tide
of h meters or more above sea level will occur in any given year somewhere
along the Dutch coastline. This probability is approximately equal to e−2.97h
for values of h larger than 1.70 m. This empirical result was used in the design
of the Delta works that were built following the 1953 disaster when the sea
flooded a number of polders in the Netherlands.
Example 4.9 A reliability system has two identical units, where one unit is in
full operation and the other unit is in cold standby. The lifetime of an operating
unit has an exponential density with expected value 1/μ. Upon failure of the
operating unit, the other unit is put into operation provided a standby unit is
available. The replacement time of a failed unit is fixed and is equal to τ > 0.
A system failure occurs if no standby unit is available at the moment the
operating unit fails. It is assumed that the probability 1 − e−μτ is close to
zero, that is, the probability of an operating unit failing during the replacement
time τ is very small. What is an approximation to the probability distribution
of the time until the first system failure?
Solution. Let the random variable X denote the time until the first system
failure. In Example 7.11 in Section 7.3, it will be shown that

2 − e−μτ
E(X) = .
μ(1 − e−μτ )

Under the assumption that 1 − e−μτ is very small, the occurrence of a

system failure is a rare event and then the probability distribution of X can
be approximated by

P(X > t) ≈ e−t/E(X) for t > 0.

170 4 Continuous Random Variables